The Data Leakage Problem
Naive target encoding leaks future information: when you encode a category with its mean target, you use information from the same row to encode itself.
CatBoost's Solution: Ordered Target Encoding
CatBoost processes rows in a random order. When encoding row i, it only uses statistics computed from rows 0..i-1. This prevents leakage by construction.
Implementation
from catboost import CatBoostClassifier
model = CatBoostClassifier(
cat_features=['city', 'device_type', 'email_domain'],
iterations=1000,
learning_rate=0.05,
depth=6,
)
model.fit(X_train, y_train)
No manual encoding needed — pass raw strings directly.