All Projects
Fraud DetectionFeatured
IEEE-CIS Fraud Detection
Full ML pipeline on 590K transactions, 433 features. LightGBM AUC 0.9648 — stacking ensemble LGB+XGB+CatBoost+RF with advanced feature engineering on Vesta behavioral features.
0.9648
LightGBM AUC
0.9565
Stacking AUC
0.8506
Baseline (LR)
459
Features (after FE)
Dataset
590,540 transactions, 433 features, 3.5% fraud rate
Approach
Stacking ensemble with StratifiedKFold cross-validation and behavioral feature engineering
Tech Stack
PythonLightGBMXGBoostCatBoostScikit-learnPandasNumPy
Keywords
LightGBMXGBoostCatBoostStackingFeature EngineeringStratifiedKFold
Visualizations6 Charts
Deep Dive
Production-grade fraud detection on one of Kaggle's hardest tabular datasets — 590,540 transactions, 433 features, 3.5% fraud rate.
Dataset
- ▸590,540 transaction records joined with 144,233 identity records
- ▸433 features: Vesta-engineered V1–V339 + card/email/device/M-columns
- ▸Fraud rate: 3.5% — requires careful stratified CV and threshold tuning
- ▸12 columns with >90% missing values → dropped
Feature Engineering
| Group | Features |
|---|---|
| Time | Hour of day, day of week, TransactionDT periodic cycles |
| Card behavioral | Mean/std/count of TransactionAmt per card1–card6 group |
| Email match | P_emaildomain == R_emaildomain flag |
| M-columns | Aggregate T/F/missing counts across M1–M9 |
| Amount | log(TransactionAmt), cents component, round-amount flag |
Model Results — 2-Fold Stratified CV
| Model | OOF AUC |
|---|---|
| Logistic Regression | 0.8506 |
| Decision Tree | 0.8583 |
| Random Forest | 0.9032 |
| CatBoost | 0.9529 |
| XGBoost | 0.9631 |
| LightGBM | 0.9648 |
| Weighted Blend | 0.9478 |
| Stacking (LR meta) | 0.9565 |
Key Insights
- ▸LightGBM's native missing-value handling edges out XGBoost on V-columns with 40%+ missing
- ▸Card-level behavioral aggregations (mean/std TransactionAmt per card group) are the highest-impact feature group
- ▸Email domain matching (P vs R) boosts recall on cross-domain transactions
- ▸Stacking meta-learner doesn't surpass LightGBM alone — base models too correlated