Decision Tree
- Time series data? → TimeSeriesSplit (never shuffle time!)
- Groups that must stay together? → GroupKFold
- Imbalanced classes? → StratifiedKFold
- Default tabular → StratifiedKFold with 5 folds
TimeSeriesSplit Example
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
GroupKFold: Preventing Leakage
If you have user IDs, patient IDs, or store IDs — always use GroupKFold so the same entity never appears in both train and validation.