During the past decade, supervised classification problems have been identified in several actuarial fields, such as risk management, projection modeling, fraud and anomaly detection, etc. In many of these problems, the respective classification task is subject to a highly imbalanced dataset, i.e., the number of instances of the relevant class is extremely small in comparison to the total number of instances. Classical supervised machine learning frameworks can be misleading (in case of using an inappropriate evaluation metric) or ineffective (in case of using inappropriate classifiers) in such situations.
In this web session, we will present several techniques to tackle these issues. More specifically, external approaches (data preprocessing, such as over- and undersampling procedures) as well as internal approaches (modification of classifiers, e.g., balanced versions of random forests and support vector machines) will be discussed. After a concise introduction to imbalanced classification and the techniques above, we will turn theory into practice by implementing entire machine learning workflows in Python and R for two real-world use cases: churn prediction and fraud detection.