Feature Engineering & Pipelines
“Garbage in, garbage out — the art of turning raw data into model-ready signals”
The full preprocessing pipeline: imputation (Simple, MICE), categorical encoding (OHE, Target, Ordinal), scaling (Standard, MinMax, Robust), feature creation (polynomial, interactions, log transforms), and sklearn Pipelines for leakage-free evaluation.
Prerequisites
Concepts Covered
∑Key Formulas
StandardScaler
Zero mean, unit variance — sensitive to outliers
MinMaxScaler
Scales to [0,1] — preserves sparsity, sensitive to outliers
RobustScaler
Scales using median and IQR — robust to outliers
Log Transform
Compresses skewed distributions — useful for income, population counts
▶Interactive Simulation
Why Feature Engineering Wins Competitions
Andrew Ng famously said 'Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.' In Kaggle competitions, top-ranked solutions consistently have better feature engineering than better model architectures. A linear model with brilliant features beats a deep network with raw features in most tabular data problems. Features encode human knowledge about the problem domain — they're the bridge between raw measurement and mathematical structure a model can exploit.
In the Netflix Prize ($1M), the winning team's features included complex temporal patterns, implicit user feedback signals, and movie metadata interactions — not model sophistication.
The Pipeline Mindset
Think of feature engineering as a sequence of transformations: Raw Data → Imputation (fill missing values) → Encoding (convert categoricals to numbers) → Scaling (put features on comparable scales) → Selection (drop noisy/redundant features). Each step must be fit on training data only and applied consistently to test data — use scikit-learn Pipelines to guarantee this. A Pipeline is also serializable, so your preprocessing is always bundled with your model for deployment.
Data leakage is the most dangerous bug in ML: if your test data influences any preprocessing step, your evaluation is optimistic garbage. Pipelines prevent this by design.
The 5-Stage Pipeline
Imputation: SimpleImputer (mean/median/mode/constant) or IterativeImputer (MICE multivariate)
Encoding: OrdinalEncoder for ordered categories, OneHotEncoder for nominal (use drop='first' to avoid dummy trap)
Scaling: StandardScaler for Gaussian-ish data, RobustScaler when outliers exist, MinMaxScaler for bounded inputs
Feature creation: PolynomialFeatures (x², x·y interactions), date decomposition (day/month/weekday), domain transforms (log, sqrt)
Selection: VarianceThreshold, SelectKBest (mutual info / chi²), SelectFromModel (tree importances), RFECV
Categorical Encoding Strategies
One-Hot Encoding creates a binary column per category — perfect for unordered categories with few values. With high-cardinality categoricals (cities, zip codes, product IDs), OHE explodes dimensionality. Use Target Encoding instead: replace each category with the mean target value of that category. But target encoding leaks if not done with cross-validation folds. CatBoost's ordered target encoding solves this by using only past samples. For ordinal features (Low/Medium/High), always use OrdinalEncoder with explicit category order.
High cardinality + OHE = disaster. 10,000 zip codes → 10,000 columns, most nearly empty. Use target encoding, embedding layers, or feature hashing instead.
Scaling Choices and Their Effects
StandardScaler: assumes Gaussian distribution, makes mean=0 and std=1. Required for SVMs, regularized linear models (Lasso/Ridge), PCA, KNN, neural networks. Not needed for tree-based models (Random Forest, XGBoost — trees only use feature order, not magnitude). MinMaxScaler: needed when algorithm requires bounded inputs (sigmoid activation, [0,1] features for neural networks). RobustScaler: use when outliers are present — scales using median and IQR, making it robust to extreme values.
sklearn Pipeline — Full Example
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import (StandardScaler, OneHotEncoder, RobustScaler, PolynomialFeatures) from sklearn.impute import SimpleImputer from sklearn.feature_selection import SelectFromModel from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.model_selection import cross_val_score, train_test_split import pandas as pd import numpy as np class="tok-comment"># ── Sample DataFrame ─────────────────────────────────────────────────── np.random.seed(class="tok-num">42) n = class="tok-num">300 df = pd.DataFrame({ class="tok-str">'age': np.random.randint(class="tok-num">18, class="tok-num">70, n).astype(float), class="tok-str">'income': np.random.exponential(class="tok-num">40000, n), class="tok-str">'score': np.random.uniform(class="tok-num">300, class="tok-num">850, n), class="tok-str">'city': np.random.choice([class="tok-str">'Paris', class="tok-str">'Lyon', class="tok-str">'Toulouse'], n), class="tok-str">'occupation': np.random.choice([class="tok-str">'engineer', class="tok-str">'teacher', class="tok-str">'doctor'], n), class="tok-str">'target': np.random.randint(class="tok-num">0, class="tok-num">2, n), }) class="tok-comment"># Add some missing values df.loc[np.random.choice(n, class="tok-num">20, replace=False), class="tok-str">'age'] = np.nan df.loc[np.random.choice(n, class="tok-num">15, replace=False), class="tok-str">'city'] = np.nan X_train = df.drop(class="tok-str">'target', axis=class="tok-num">1) y_train = df[class="tok-str">'target'] class="tok-comment"># ── Define column groups ─────────────────────────────────────────── num_features = [class="tok-str">'age', class="tok-str">'income', class="tok-str">'score'] cat_features = [class="tok-str">'city', class="tok-str">'occupation'] class="tok-comment"># ── Preprocessing for numeric columns ───────────────────────────── numeric_transformer = Pipeline([ (class="tok-str">'imputer', SimpleImputer(strategy=class="tok-str">'median')), (class="tok-str">'scaler', RobustScaler()), ]) class="tok-comment"># ── Preprocessing for categorical columns ───────────────────────── categorical_transformer = Pipeline([ (class="tok-str">'imputer', SimpleImputer(strategy=class="tok-str">'most_frequent')), (class="tok-str">'encoder', OneHotEncoder(handle_unknown=class="tok-str">'ignore', drop=class="tok-str">'first')), ]) class="tok-comment"># ── Combine with ColumnTransformer ───────────────────────────────── preprocessor = ColumnTransformer([ (class="tok-str">'num', numeric_transformer, num_features), (class="tok-str">'cat', categorical_transformer, cat_features), ]) class="tok-comment"># ── Full pipeline: preprocess → feature select → model ──────────── pipe = Pipeline([ (class="tok-str">'prep', preprocessor), (class="tok-str">'poly', PolynomialFeatures(degree=class="tok-num">2, interaction_only=True, include_bias=False)), (class="tok-str">'select', SelectFromModel(RandomForestClassifier(n_estimators=class="tok-num">50), threshold=class="tok-str">'median')), (class="tok-str">'clf', GradientBoostingClassifier(n_estimators=class="tok-num">200, learning_rate=class="tok-num">0.05)), ]) class="tok-comment"># Train / evaluate — preprocessing is always fitted on train only pipe.fit(X_train, y_train) scores = cross_val_score(pipe, X_train, y_train, cv=class="tok-num">5, scoring=class="tok-str">'roc_auc') print(fclass="tok-str">"CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")
Preprocessing Pitfalls
Fitting scalers on the full dataset (before splitting) is data leakage — test statistics contaminate training. Always fit inside a Pipeline or on X_train only. Second: OneHotEncoder on test data may see unseen categories → use handle_unknown='ignore'. Third: imputing with mean before splitting leaks test mean into training. Fourth: polynomial features explode memory — 100 features × degree=2 → 5,050 columns. Use interaction_only=True and feature selection downstream. Fifth: target encoding without cross-validation leaks target information.
The Pipeline object in scikit-learn is not just convenient — it is required for correct cross-validation. Any preprocessing that 'learns' from data (scalers, encoders, imputers) must be inside the pipeline.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.