ML Learning Hub
Applied MLbeginner

Feature Engineering & Pipelines

Garbage in, garbage out — the art of turning raw data into model-ready signals

The full preprocessing pipeline: imputation (Simple, MICE), categorical encoding (OHE, Target, Ordinal), scaling (Standard, MinMax, Robust), feature creation (polynomial, interactions, log transforms), and sklearn Pipelines for leakage-free evaluation.

45 min
14 diagrams
7 Concepts Covered

Prerequisites

Linear Regression
Model Evaluation

Concepts Covered

ImputationOneHotEncoderStandardScalerRobustScalerPolynomialFeaturesColumnTransformerData Leakage

Key Formulas

StandardScaler

Zero mean, unit variance — sensitive to outliers

MinMaxScaler

Scales to [0,1] — preserves sparsity, sensitive to outliers

RobustScaler

Scales using median and IQR — robust to outliers

Log Transform

Compresses skewed distributions — useful for income, population counts

Interactive Simulation

Loading visualization…
🎯

Why Feature Engineering Wins Competitions

motivation

Andrew Ng famously said 'Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.' In Kaggle competitions, top-ranked solutions consistently have better feature engineering than better model architectures. A linear model with brilliant features beats a deep network with raw features in most tabular data problems. Features encode human knowledge about the problem domain — they're the bridge between raw measurement and mathematical structure a model can exploit.

In the Netflix Prize ($1M), the winning team's features included complex temporal patterns, implicit user feedback signals, and movie metadata interactions — not model sophistication.

💡

The Pipeline Mindset

intuition

Think of feature engineering as a sequence of transformations: Raw Data → Imputation (fill missing values) → Encoding (convert categoricals to numbers) → Scaling (put features on comparable scales) → Selection (drop noisy/redundant features). Each step must be fit on training data only and applied consistently to test data — use scikit-learn Pipelines to guarantee this. A Pipeline is also serializable, so your preprocessing is always bundled with your model for deployment.

Data leakage is the most dangerous bug in ML: if your test data influences any preprocessing step, your evaluation is optimistic garbage. Pipelines prevent this by design.

⚙️

The 5-Stage Pipeline

algorithm
1

Imputation: SimpleImputer (mean/median/mode/constant) or IterativeImputer (MICE multivariate)

2

Encoding: OrdinalEncoder for ordered categories, OneHotEncoder for nominal (use drop='first' to avoid dummy trap)

3

Scaling: StandardScaler for Gaussian-ish data, RobustScaler when outliers exist, MinMaxScaler for bounded inputs

4

Feature creation: PolynomialFeatures (x², x·y interactions), date decomposition (day/month/weekday), domain transforms (log, sqrt)

5

Selection: VarianceThreshold, SelectKBest (mutual info / chi²), SelectFromModel (tree importances), RFECV

🔬

Categorical Encoding Strategies

deepdive

One-Hot Encoding creates a binary column per category — perfect for unordered categories with few values. With high-cardinality categoricals (cities, zip codes, product IDs), OHE explodes dimensionality. Use Target Encoding instead: replace each category with the mean target value of that category. But target encoding leaks if not done with cross-validation folds. CatBoost's ordered target encoding solves this by using only past samples. For ordinal features (Low/Medium/High), always use OrdinalEncoder with explicit category order.

High cardinality + OHE = disaster. 10,000 zip codes → 10,000 columns, most nearly empty. Use target encoding, embedding layers, or feature hashing instead.

Scaling Choices and Their Effects

math

StandardScaler: assumes Gaussian distribution, makes mean=0 and std=1. Required for SVMs, regularized linear models (Lasso/Ridge), PCA, KNN, neural networks. Not needed for tree-based models (Random Forest, XGBoost — trees only use feature order, not magnitude). MinMaxScaler: needed when algorithm requires bounded inputs (sigmoid activation, [0,1] features for neural networks). RobustScaler: use when outliers are present — scales using median and IQR, making it robust to extreme values.

RobustScaler — median centering with IQR scaling
</>

sklearn Pipeline — Full Example

code
python63 lines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (StandardScaler, OneHotEncoder,
                                   RobustScaler, PolynomialFeatures)
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
import pandas as pd
import numpy as np

class="tok-comment"># ── Sample DataFrame ───────────────────────────────────────────────────
np.random.seed(class="tok-num">42)
n = class="tok-num">300
df = pd.DataFrame({
    class="tok-str">'age':        np.random.randint(class="tok-num">18, class="tok-num">70, n).astype(float),
    class="tok-str">'income':     np.random.exponential(class="tok-num">40000, n),
    class="tok-str">'score':      np.random.uniform(class="tok-num">300, class="tok-num">850, n),
    class="tok-str">'city':       np.random.choice([class="tok-str">'Paris', class="tok-str">'Lyon', class="tok-str">'Toulouse'], n),
    class="tok-str">'occupation': np.random.choice([class="tok-str">'engineer', class="tok-str">'teacher', class="tok-str">'doctor'], n),
    class="tok-str">'target':     np.random.randint(class="tok-num">0, class="tok-num">2, n),
})
class="tok-comment"># Add some missing values
df.loc[np.random.choice(n, class="tok-num">20, replace=False), class="tok-str">'age'] = np.nan
df.loc[np.random.choice(n, class="tok-num">15, replace=False), class="tok-str">'city'] = np.nan

X_train = df.drop(class="tok-str">'target', axis=class="tok-num">1)
y_train = df[class="tok-str">'target']

class="tok-comment"># ── Define column groups ───────────────────────────────────────────
num_features = [class="tok-str">'age', class="tok-str">'income', class="tok-str">'score']
cat_features = [class="tok-str">'city', class="tok-str">'occupation']

class="tok-comment"># ── Preprocessing for numeric columns ─────────────────────────────
numeric_transformer = Pipeline([
    (class="tok-str">'imputer', SimpleImputer(strategy=class="tok-str">'median')),
    (class="tok-str">'scaler', RobustScaler()),
])

class="tok-comment"># ── Preprocessing for categorical columns ─────────────────────────
categorical_transformer = Pipeline([
    (class="tok-str">'imputer', SimpleImputer(strategy=class="tok-str">'most_frequent')),
    (class="tok-str">'encoder', OneHotEncoder(handle_unknown=class="tok-str">'ignore', drop=class="tok-str">'first')),
])

class="tok-comment"># ── Combine with ColumnTransformer ─────────────────────────────────
preprocessor = ColumnTransformer([
    (class="tok-str">'num', numeric_transformer, num_features),
    (class="tok-str">'cat', categorical_transformer, cat_features),
])

class="tok-comment"># ── Full pipeline: preprocess → feature select → model ────────────
pipe = Pipeline([
    (class="tok-str">'prep', preprocessor),
    (class="tok-str">'poly', PolynomialFeatures(degree=class="tok-num">2, interaction_only=True, include_bias=False)),
    (class="tok-str">'select', SelectFromModel(RandomForestClassifier(n_estimators=class="tok-num">50), threshold=class="tok-str">'median')),
    (class="tok-str">'clf', GradientBoostingClassifier(n_estimators=class="tok-num">200, learning_rate=class="tok-num">0.05)),
])

class="tok-comment"># Train / evaluate — preprocessing is always fitted on train only
pipe.fit(X_train, y_train)
scores = cross_val_score(pipe, X_train, y_train, cv=class="tok-num">5, scoring=class="tok-str">'roc_auc')
print(fclass="tok-str">"CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")
⚠️

Preprocessing Pitfalls

pitfall

Fitting scalers on the full dataset (before splitting) is data leakage — test statistics contaminate training. Always fit inside a Pipeline or on X_train only. Second: OneHotEncoder on test data may see unseen categories → use handle_unknown='ignore'. Third: imputing with mean before splitting leaks test mean into training. Fourth: polynomial features explode memory — 100 features × degree=2 → 5,050 columns. Use interaction_only=True and feature selection downstream. Fifth: target encoding without cross-validation leaks target information.

The Pipeline object in scikit-learn is not just convenient — it is required for correct cross-validation. Any preprocessing that 'learns' from data (scalers, encoders, imputers) must be inside the pipeline.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.