Partial Dependence & ICE Plots
“See exactly how a model's prediction changes as you vary one feature — marginalizing everything else”
PDPs marginalize over all other features to show the average effect of one variable. ICE curves expose per-sample heterogeneity. Centered ICE removes intercept bias. ALE plots fix PDPs extrapolation problem for correlated features.
Prerequisites
Concepts Covered
∑Key Formulas
Partial Dependence Function
Average model output over all values of complement features — marginalizes out interactions
ICE Curve (individual)
Prediction for sample i as feature j varies — keeps all other features at their actual values
c-ICE (centered)
ICE curve anchored at reference point x_j0 — removes intercept differences, highlights interaction shape
PDP–ICE Relation
PDP is exactly the pointwise mean of all ICE curves
▶Interactive Simulation
Beyond 'Which Features Matter' — How Do They Matter?
Feature importance tells you that income is the most predictive feature, but it says nothing about the shape of that relationship. Does prediction increase linearly with income? Does it plateau above €80k? Is there a threshold effect at €40k? Partial Dependence Plots (PDPs) answer these questions visually. They're the 'effect plot' counterpart to importance scores. Together they form a complete picture: importance tells you the magnitude, PDPs tell you the direction and shape. Used together with ICE curves, they also reveal whether the average PDP is a reliable summary or a misleading average of heterogeneous effects.
A PDP showing a flat relationship for a high-importance feature is a red flag — it often means the marginalisation hides interaction effects visible only in ICE curves.
The Monte Carlo Interpretation
PDP estimation works by a Monte Carlo simulation: for a given value x_j = v, replace the j-th column of your entire dataset with v, run all n samples through the model, and average the predictions. Do this for each v on a grid. The result is the model's average response curve. The key assumption is feature independence — the marginalisation pretends x_j can take value v while all other features retain their original joint distribution. When features are correlated (e.g., income and age), this creates extrapolation into implausible regions (e.g., age=20 with income=200k). Accumulated Local Effects (ALE plots) fix this by conditioning on the actual data distribution.
ICE curves are the individual-level equivalent: instead of averaging, plot each sample's curve separately. If ICE curves fan out or cross, the PDP average is misleading — there are interaction effects.
Computing PDP + ICE: Step by Step
Choose feature j and a grid of values G = {v_1, v_2, …, v_k} (default: 100 points between 5th–95th percentile).
For each grid value v ∈ G: set X_j = v for all n training samples (create n copies), compute f(X_j=v, X_{-j}) for all n samples, record mean as PDP(v) and all n values as ICE curves.
Plot PDP(v) as the average line. Plot each ICE curve as a faint line in the same colour.
Optionally center ICE curves (c-ICE): subtract each curve's value at v_min so all curves start at 0 — removes intercept noise.
For 2D PDPs (interaction plots): fix two features j1, j2 on a grid, marginalize over all others — produces a heatmap showing the joint effect.
PDP + ICE with scikit-learn Inspection API
from sklearn.ensemble import GradientBoostingClassifier from sklearn.inspection import PartialDependenceDisplay from sklearn.model_selection import train_test_split import pandas as pd import numpy as np import matplotlib.pyplot as plt class="tok-comment"># ── Dataset ─────────────────────────────────────────────────────────────────── np.random.seed(class="tok-num">0) n = class="tok-num">2000 X = pd.DataFrame({ class="tok-str">"income": np.random.normal(class="tok-num">50, class="tok-num">15, n).clip(class="tok-num">15, class="tok-num">120), class="tok-str">"age": np.random.randint(class="tok-num">18, class="tok-num">70, n).astype(float), class="tok-str">"credit_score": np.random.normal(class="tok-num">650, class="tok-num">80, n).clip(class="tok-num">300, class="tok-num">850), }) y = (class="tok-num">0.5*(X[class="tok-str">"income"]>class="tok-num">55) + class="tok-num">0.3*(X[class="tok-str">"credit_score"]>class="tok-num">660) + class="tok-num">0.2*(X[class="tok-str">"age"]>class="tok-num">40)) > class="tok-num">0.5 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=class="tok-num">0.2) class="tok-comment"># ── Train model ─────────────────────────────────────────────────────────────── clf = GradientBoostingClassifier(n_estimators=class="tok-num">200, max_depth=class="tok-num">4, random_state=class="tok-num">42) clf.fit(X_train, y_train) class="tok-comment"># ── class="tok-num">1. Standard PDP for class="tok-num">3 features ──────────────────────────────────────────── fig, axes = plt.subplots(class="tok-num">1, class="tok-num">3, figsize=(class="tok-num">15, class="tok-num">4)) PartialDependenceDisplay.from_estimator( clf, X_train, features=[class="tok-str">"income", class="tok-str">"age", class="tok-str">"credit_score"], class="tok-comment"># feature names or indices kind=class="tok-str">"average", class="tok-comment"># class="tok-str">"average" = PDP only grid_resolution=class="tok-num">50, class="tok-comment"># number of grid points ax=axes, ) plt.suptitle(class="tok-str">"Partial Dependence Plots (PDP)") plt.tight_layout() plt.show() class="tok-comment"># ── class="tok-num">2. PDP + ICE overlay ───────────────────────────────────────────────────── fig, axes = plt.subplots(class="tok-num">1, class="tok-num">3, figsize=(class="tok-num">15, class="tok-num">4)) PartialDependenceDisplay.from_estimator( clf, X_train, features=[class="tok-str">"income", class="tok-str">"age", class="tok-str">"credit_score"], kind=class="tok-str">"both", class="tok-comment"># class="tok-str">"both" = PDP (bold) + ICE (faint) subsample=class="tok-num">100, class="tok-comment"># sample class="tok-num">100 ICE curves for readability alpha=class="tok-num">0.3, class="tok-comment"># ICE line transparency ax=axes, ) plt.suptitle(class="tok-str">"PDP + ICE Curves — divergence reveals interactions") plt.tight_layout() plt.show() class="tok-comment"># ── class="tok-num">3. Centered ICE (c-ICE) — removes intercept bias ───────────────────────── fig, axes = plt.subplots(class="tok-num">1, class="tok-num">3, figsize=(class="tok-num">15, class="tok-num">4)) PartialDependenceDisplay.from_estimator( clf, X_train, features=[class="tok-str">"income", class="tok-str">"age", class="tok-str">"credit_score"], kind=class="tok-str">"individual", class="tok-comment"># ICE only centered=True, class="tok-comment"># anchor each curve at its leftmost value subsample=class="tok-num">150, alpha=class="tok-num">0.2, ax=axes, ) plt.suptitle(class="tok-str">"Centered ICE (c-ICE) — interaction shapes visible") plt.tight_layout() plt.show() class="tok-comment"># ── class="tok-num">4. 2D interaction PDP (heatmap) ────────────────────────────────────────── fig, ax = plt.subplots(figsize=(class="tok-num">6, class="tok-num">5)) PartialDependenceDisplay.from_estimator( clf, X_train, features=[(class="tok-str">"income", class="tok-str">"credit_score")], class="tok-comment"># tuple = 2D PDP ax=ax, ) plt.title(class="tok-str">"2D PDP: income × credit_score interaction") plt.tight_layout() plt.show() class="tok-comment"># ── class="tok-num">5. Manual PDP computation (educational) ────────────────────────────────── grid = np.linspace(X_train[class="tok-str">"income"].quantile(class="tok-num">0.05), X_train[class="tok-str">"income"].quantile(class="tok-num">0.95), class="tok-num">50) pdp_vals = [] for v in grid: X_mod = X_train.copy() X_mod[class="tok-str">"income"] = v pdp_vals.append(clf.predict_proba(X_mod)[:, class="tok-num">1].mean()) plt.figure(figsize=(class="tok-num">6, class="tok-num">3)) plt.plot(grid, pdp_vals, lw=class="tok-num">2, color=class="tok-str">"class="tok-comment">#8b5cf6") plt.xlabel(class="tok-str">"income") plt.ylabel(class="tok-str">"Avg predicted P(default=class="tok-num">1)") plt.title(class="tok-str">"Manual PDP — income effect") plt.tight_layout() plt.show()
ALE Plots: Fixing PDP's Extrapolation Problem
Accumulated Local Effects (ALE) plots condition on the neighbourhood of x_j = v rather than marginalising over the full dataset. This respects the actual data distribution and avoids extrapolation into impossible regions (e.g., 20-year-olds with €150k income). The derivative-based formulation computes the local effect of moving x_j by a small amount, then integrates those local effects — resulting in a distribution-faithful version of PDP. For uncorrelated features, PDP and ALE produce near-identical plots. For correlated features, ALE is strictly more trustworthy.
When PDP Lies: The Heterogeneous Effect Problem
A PDP can show a perfectly flat line for income while individual ICE curves vary from strongly positive to strongly negative — if opposite-sign effects cancel in the average. This happens when there are strong interaction effects (e.g., income matters a lot for young borrowers but not for older ones). Always check ICE curves alongside PDPs. Additionally, PDPs are computationally expensive for large datasets: n × k model evaluations for each feature (n=dataset size, k=grid points). Use subsample=200 to limit ICE curve computation. Finally, PDPs have no confidence intervals by default — use bootstrapped PDPs or Gaussian process regression to get uncertainty bands.
If the PDP shows a flat line but permutation importance says the feature is critical, check ICE curves — you likely have a masked interaction effect.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.