All Projects
Fraud Detection
Vehicle Insurance Claim Fraud
16-model fraud pipeline for 15,420 claims (5.99% fraud). AdaBoost maximizes recall (89.2%). RandomizedSearchCV XGBoost: CV AUC 0.9847. SHAP: Fault (37.9%) is the dominant fraud indicator.
0.9847
XGBoost CV AUC
89.2%
AdaBoost Recall
0.819
Voting Ensemble AUC
Fault (37.9%)
Top SHAP feature
Dataset
15,420 insurance claims, 33 features, 5.99% fraud
Approach
SMOTE → 16-model benchmark → RandomizedSearchCV HPO → SHAP analysis
Tech Stack
PythonXGBoostLightGBMCatBoostSMOTESHAPScikit-learn
Keywords
XGBoostSMOTESHAPInsuranceRandomizedSearchCVAdaBoost
Visualizations6 Charts
Deep Dive
Complete fraud detection pipeline for vehicle insurance with severe class imbalance (5.99% fraud rate).
Dataset
- ▸15,420 claims: 14,497 legitimate + 923 fraud (5.99%)
- ▸33 features: vehicle details, accident area, policy type, deductible, police report, agent type
- ▸Engineered: Claim_Delay, Policy_Claim_Gap, VehicleAge_Price_Ratio, High_Risk_Score
Imbalance Strategy
- ▸SMOTE oversampling on training set (6% → 50% fraud)
- ▸class_weight='balanced' for all estimators
- ▸Evaluation: Recall and Avg Precision (not accuracy — misleading at 6%)
16-Model Leaderboard (sorted by Recall)
| Model | AUC | Recall | Precision |
|---|---|---|---|
| Naive Bayes | 0.62 | 0.789 | Low |
| AdaBoost | 0.780 | 0.892 | Best recall |
| Logistic Reg | 0.765 | 0.778 | Moderate |
| Decision Tree | 0.731 | 0.703 | — |
| Random Forest | 0.796 | 0.662 | — |
| XGBoost | 0.814 | 0.638 | — |
| Voting (XGB+LGB+CB) | 0.819 | 0.641 | Best AUC |
RandomizedSearchCV — XGBoost (40 iterations, 5-fold)
- ▸CV AUC: 0.9847
- ▸Best: subsample=0.7, max_depth=7, n_estimators=500, lr=0.05
SHAP Feature Importance (XGBoost)
| Feature | Contribution |
|---|---|
| Fault | 37.9% |
| Deductible | 12.9% |
| BasePolicy | 12.2% |
| VehicleCategory | 8.1% |
| PoliceReportFiled | 6.4% |
5-Fold CV Stability RF: 0.8619 ± 0.0010 | XGB: 0.8529 ± 0.0023 | LGB: 0.8505 ± 0.0016