Fraud Detection

Vehicle Insurance Claim Fraud

16-model fraud pipeline for 15,420 claims (5.99% fraud). AdaBoost maximizes recall (89.2%). RandomizedSearchCV XGBoost: CV AUC 0.9847. SHAP: Fault (37.9%) is the dominant fraud indicator.

View on Kaggle

0.9847

XGBoost CV AUC

89.2%

AdaBoost Recall

0.819

Voting Ensemble AUC

Fault (37.9%)

Top SHAP feature

Dataset

15,420 insurance claims, 33 features, 5.99% fraud

Approach

SMOTE → 16-model benchmark → RandomizedSearchCV HPO → SHAP analysis

Tech Stack

PythonXGBoostLightGBMCatBoostSMOTESHAPScikit-learn

Keywords

XGBoostSMOTESHAPInsuranceRandomizedSearchCVAdaBoost

Visualizations6 Charts

Deep Dive

Complete fraud detection pipeline for vehicle insurance with severe class imbalance (5.99% fraud rate).

Dataset

▸15,420 claims: 14,497 legitimate + 923 fraud (5.99%)
▸33 features: vehicle details, accident area, policy type, deductible, police report, agent type
▸Engineered: Claim_Delay, Policy_Claim_Gap, VehicleAge_Price_Ratio, High_Risk_Score

Imbalance Strategy

▸SMOTE oversampling on training set (6% → 50% fraud)
▸class_weight='balanced' for all estimators
▸Evaluation: Recall and Avg Precision (not accuracy — misleading at 6%)

16-Model Leaderboard (sorted by Recall)

Model	AUC	Recall	Precision
Naive Bayes	0.62	0.789	Low
AdaBoost	0.780	0.892	Best recall
Logistic Reg	0.765	0.778	Moderate
Decision Tree	0.731	0.703	—
Random Forest	0.796	0.662	—
XGBoost	0.814	0.638	—
Voting (XGB+LGB+CB)	0.819	0.641	Best AUC

RandomizedSearchCV — XGBoost (40 iterations, 5-fold)

▸CV AUC: 0.9847
▸Best: subsample=0.7, max_depth=7, n_estimators=500, lr=0.05

SHAP Feature Importance (XGBoost)

Feature	Contribution
Fault	37.9%
Deductible	12.9%
BasePolicy	12.2%
VehicleCategory	8.1%
PoliceReportFiled	6.4%

5-Fold CV Stability RF: 0.8619 ± 0.0010 | XGB: 0.8529 ± 0.0023 | LGB: 0.8505 ± 0.0016

Back to Projects Hire Me