The Problem
The IEEE-CIS Fraud Detection challenge presents 590,540 training transactions with 433 features and only 3.5% fraud rate.
Key Feature Engineering
- Time-based features: Hour of day, day of week, temporal drift
- Card aggregations: Mean/std/count of TransactionAmt per card1/card2
- Email domain features: same_email_domain flag, domain-level fraud rates
- M-column boolean counts: T/F/missing across M1-M9
Model Pipeline
| Model | OOF AUC |
|---|---|
| LightGBM | 0.9648 |
| XGBoost | 0.9631 |
| CatBoost | 0.9529 |
Key Insights
- Don't drop V-columns — they carry Vesta's proprietary fraud signals
- Time-based CV is more realistic than StratifiedKFold
- Card-level aggregations are the highest-impact feature group
- LightGBM's native missing-value handling gives it the edge