Back to Blog
Machine Learning November 5, 2024 6 min read

Scikit-learn Pipelines: The Right Way to Build ML Workflows

Why you should wrap everything in an sklearn Pipeline — preventing data leakage, proper cross-validation, easy serialization, and custom transformers.

The Problem Without Pipelines

# WRONG: Fit scaler on all data (data leakage!)
scaler.fit(X)  # should only fit on train set
X_scaled = scaler.transform(X)
cross_val_score(model, X_scaled, y)  # LEAKS validation data into scaler

The Right Way

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LGBMClassifier()),
])

# Now CV is correct — scaler only fits on train fold
cross_val_score(pipe, X, y, cv=5)  # CORRECT

Custom Transformer

class LogTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None): return self
    def transform(self, X):
        return np.log1p(np.abs(X))
Scikit-learnPipelineData LeakageBest PracticesML
O

Ossama Elhakki

AI Engineer & ML Systems Builder — Morocco