Applied MLadvanced

Partial Dependence & ICE Plots

“See exactly how a model's prediction changes as you vary one feature — marginalizing everything else”

PDPs marginalize over all other features to show the average effect of one variable. ICE curves expose per-sample heterogeneity. Centered ICE removes intercept bias. ALE plots fix PDPs extrapolation problem for correlated features.

40 min

8 diagrams

7 Concepts Covered

Prerequisites

→Feature Importance

Concepts Covered

PDPICE Curvesc-ICEALE PlotsMarginalisationInteraction EffectsPartialDependenceDisplay

Previous: Feature Importance & Selection Next: Time Series Forecasting

∑Key Formulas

Partial Dependence Function

Average model output over all values of complement features — marginalizes out interactions

ICE Curve (individual)

Prediction for sample i as feature j varies — keeps all other features at their actual values

c-ICE (centered)

ICE curve anchored at reference point x_j0 — removes intercept differences, highlights interaction shape

PDP–ICE Relation

PDP is exactly the pointwise mean of all ICE curves

▶Interactive Simulation

Loading visualization…

🎯

Beyond 'Which Features Matter' — How Do They Matter?

motivation

Feature importance tells you that income is the most predictive feature, but it says nothing about the shape of that relationship. Does prediction increase linearly with income? Does it plateau above €80k? Is there a threshold effect at €40k? Partial Dependence Plots (PDPs) answer these questions visually. They're the 'effect plot' counterpart to importance scores. Together they form a complete picture: importance tells you the magnitude, PDPs tell you the direction and shape. Used together with ICE curves, they also reveal whether the average PDP is a reliable summary or a misleading average of heterogeneous effects.

A PDP showing a flat relationship for a high-importance feature is a red flag — it often means the marginalisation hides interaction effects visible only in ICE curves.

💡

The Monte Carlo Interpretation

intuition

PDP estimation works by a Monte Carlo simulation: for a given value x_j = v, replace the j-th column of your entire dataset with v, run all n samples through the model, and average the predictions. Do this for each v on a grid. The result is the model's average response curve. The key assumption is feature independence — the marginalisation pretends x_j can take value v while all other features retain their original joint distribution. When features are correlated (e.g., income and age), this creates extrapolation into implausible regions (e.g., age=20 with income=200k). Accumulated Local Effects (ALE plots) fix this by conditioning on the actual data distribution.

ICE curves are the individual-level equivalent: instead of averaging, plot each sample's curve separately. If ICE curves fan out or cross, the PDP average is misleading — there are interaction effects.

⚙️

Computing PDP + ICE: Step by Step

algorithm

Choose feature j and a grid of values G = {v_1, v_2, …, v_k} (default: 100 points between 5th–95th percentile).

For each grid value v ∈ G: set X_j = v for all n training samples (create n copies), compute f(X_j=v, X_{-j}) for all n samples, record mean as PDP(v) and all n values as ICE curves.

Plot PDP(v) as the average line. Plot each ICE curve as a faint line in the same colour.

Optionally center ICE curves (c-ICE): subtract each curve's value at v_min so all curves start at 0 — removes intercept noise.

For 2D PDPs (interaction plots): fix two features j1, j2 on a grid, marginalize over all others — produces a heatmap showing the joint effect.

</>

PDP + ICE with scikit-learn Inspection API

code

python92 lines

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.inspection import PartialDependenceDisplay
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ── Dataset ───────────────────────────────────────────────────────────────────
np.random.seed(0)
n = 2000
X = pd.DataFrame({
    "income":       np.random.normal(50, 15, n).clip(15, 120),
    "age":          np.random.randint(18, 70, n).astype(float),
    "credit_score": np.random.normal(650, 80, n).clip(300, 850),
})
y = (0.5*(X["income"]>55) + 0.3*(X["credit_score"]>660) + 0.2*(X["age"]>40)) > 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# ── Train model ───────────────────────────────────────────────────────────────
clf = GradientBoostingClassifier(n_estimators=200, max_depth=4, random_state=42)
clf.fit(X_train, y_train)

# ── 1. Standard PDP for 3 features ────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
PartialDependenceDisplay.from_estimator(
    clf, X_train,
    features=["income", "age", "credit_score"],   # feature names or indices
    kind="average",           # "average" = PDP only
    grid_resolution=50,       # number of grid points
    ax=axes,
)
plt.suptitle("Partial Dependence Plots (PDP)")
plt.tight_layout()
plt.show()

# ── 2. PDP + ICE overlay ─────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
PartialDependenceDisplay.from_estimator(
    clf, X_train,
    features=["income", "age", "credit_score"],
    kind="both",              # "both" = PDP (bold) + ICE (faint)
    subsample=100,            # sample 100 ICE curves for readability
    alpha=0.3,                # ICE line transparency
    ax=axes,
)
plt.suptitle("PDP + ICE Curves — divergence reveals interactions")
plt.tight_layout()
plt.show()

# ── 3. Centered ICE (c-ICE) — removes intercept bias ─────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
PartialDependenceDisplay.from_estimator(
    clf, X_train,
    features=["income", "age", "credit_score"],
    kind="individual",        # ICE only
    centered=True,            # anchor each curve at its leftmost value
    subsample=150,
    alpha=0.2,
    ax=axes,
)
plt.suptitle("Centered ICE (c-ICE) — interaction shapes visible")
plt.tight_layout()
plt.show()

# ── 4. 2D interaction PDP (heatmap) ──────────────────────────────────────────
fig, ax = plt.subplots(figsize=(6, 5))
PartialDependenceDisplay.from_estimator(
    clf, X_train,
    features=[("income", "credit_score")],  # tuple = 2D PDP
    ax=ax,
)
plt.title("2D PDP: income × credit_score interaction")
plt.tight_layout()
plt.show()

# ── 5. Manual PDP computation (educational) ──────────────────────────────────
grid = np.linspace(X_train["income"].quantile(0.05),
                   X_train["income"].quantile(0.95), 50)
pdp_vals = []
for v in grid:
    X_mod = X_train.copy()
    X_mod["income"] = v
    pdp_vals.append(clf.predict_proba(X_mod)[:, 1].mean())

plt.figure(figsize=(6, 3))
plt.plot(grid, pdp_vals, lw=2, color="#8b5cf6")
plt.xlabel("income")
plt.ylabel("Avg predicted P(default=1)")
plt.title("Manual PDP — income effect")
plt.tight_layout()
plt.show()

∑

ALE Plots: Fixing PDP's Extrapolation Problem

math

Accumulated Local Effects (ALE) plots condition on the neighbourhood of x_j = v rather than marginalising over the full dataset. This respects the actual data distribution and avoids extrapolation into impossible regions (e.g., 20-year-olds with €150k income). The derivative-based formulation computes the local effect of moving x_j by a small amount, then integrates those local effects — resulting in a distribution-faithful version of PDP. For uncorrelated features, PDP and ALE produce near-identical plots. For correlated features, ALE is strictly more trustworthy.

⚠️

When PDP Lies: The Heterogeneous Effect Problem

pitfall

A PDP can show a perfectly flat line for income while individual ICE curves vary from strongly positive to strongly negative — if opposite-sign effects cancel in the average. This happens when there are strong interaction effects (e.g., income matters a lot for young borrowers but not for older ones). Always check ICE curves alongside PDPs. Additionally, PDPs are computationally expensive for large datasets: n × k model evaluations for each feature (n=dataset size, k=grid points). Use subsample=200 to limit ICE curve computation. Finally, PDPs have no confidence intervals by default — use bootstrapped PDPs or Gaussian process regression to get uncertainty bands.

If the PDP shows a flat line but permutation importance says the feature is critical, check ICE curves — you likely have a masked interaction effect.

?Knowledge Check

Progress is saved in your browser — no account needed.

Feature Importance & Selection

Time Series Forecasting

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services