All Projects
Fraud Detection

Vehicle Insurance Claim Fraud

16-model fraud pipeline for 15,420 claims (5.99% fraud). AdaBoost maximizes recall (89.2%). RandomizedSearchCV XGBoost: CV AUC 0.9847. SHAP: Fault (37.9%) is the dominant fraud indicator.

0.9847
XGBoost CV AUC
89.2%
AdaBoost Recall
0.819
Voting Ensemble AUC
Fault (37.9%)
Top SHAP feature
Dataset

15,420 insurance claims, 33 features, 5.99% fraud

Approach

SMOTE → 16-model benchmark → RandomizedSearchCV HPO → SHAP analysis

Tech Stack
PythonXGBoostLightGBMCatBoostSMOTESHAPScikit-learn
Keywords
XGBoostSMOTESHAPInsuranceRandomizedSearchCVAdaBoost
Visualizations6 Charts
Deep Dive

Complete fraud detection pipeline for vehicle insurance with severe class imbalance (5.99% fraud rate).

Dataset

  • 15,420 claims: 14,497 legitimate + 923 fraud (5.99%)
  • 33 features: vehicle details, accident area, policy type, deductible, police report, agent type
  • Engineered: Claim_Delay, Policy_Claim_Gap, VehicleAge_Price_Ratio, High_Risk_Score

Imbalance Strategy

  • SMOTE oversampling on training set (6% → 50% fraud)
  • class_weight='balanced' for all estimators
  • Evaluation: Recall and Avg Precision (not accuracy — misleading at 6%)

16-Model Leaderboard (sorted by Recall)

ModelAUCRecallPrecision
Naive Bayes0.620.789Low
AdaBoost0.7800.892Best recall
Logistic Reg0.7650.778Moderate
Decision Tree0.7310.703
Random Forest0.7960.662
XGBoost0.8140.638
Voting (XGB+LGB+CB)0.8190.641Best AUC

RandomizedSearchCV — XGBoost (40 iterations, 5-fold)

  • CV AUC: 0.9847
  • Best: subsample=0.7, max_depth=7, n_estimators=500, lr=0.05

SHAP Feature Importance (XGBoost)

FeatureContribution
Fault37.9%
Deductible12.9%
BasePolicy12.2%
VehicleCategory8.1%
PoliceReportFiled6.4%

5-Fold CV Stability RF: 0.8619 ± 0.0010 | XGB: 0.8529 ± 0.0023 | LGB: 0.8505 ± 0.0016