In both life and general insurance, many predictive modelling tasks involve outcomes that occur infrequently—such as policy lapses, claims, or fraud. This leads to class imbalance, a situation where the target variable’s classes are not represented equally in the data, often with one class (e.g. policy lapse) being vastly outnumbered by the other. If not properly addressed, class imbalance can result in misleading classification models that overlook rare but critical events.
This web session will demonstrate how class imbalance in training data can be addressed with Python using a Life Insurance Lapse Prediction Case Study.
Topics Covered:
- What class imbalance is, why it matters, and how it affects classification model performance (the ‘Accuracy Paradox’).
- Step-by-step demonstrations using Python libraries (pandas, scikit-learn, imbalanced-learn) for data preparation, rebalancing techniques, ML model development and model evaluation.
- A range of Rebalancing Techniques, including:
- Oversampling (e.g. SMOTE)
- Undersampling
- Hybrid resampling
- Cost-sensitive learning
- Application of Rebalancing Techniques across a range of ML classification models, including:
- Naïve Bayes
- Logistic Regression
- Decision Trees
- Random Forests
- Gradient Boosting
- Neural Networks
- A structured evaluation of rebalancing techniques, comparing their impact on model performance using metrics such as:
- Precision
- Recall
- F1-score
- ROC-AUC
- Lift