ML Learning Hub
Applied MLintermediate

Feature Importance & Selection

Know which features your model actually relies on — then trust it more (or less)

Permutation importance vs impurity (Gini) importance, SHAP unified attribution, drop-column importance, and how correlated features split scores unfairly — with interactive bar chart toggling between methods.

35 min
6 diagrams
6 Concepts Covered

Prerequisites

Random Forest
Gradient Boosting

Concepts Covered

Permutation ImportanceGini ImportanceSHAPDrop-ColumnFeature SelectionCorrelation Bias

Key Formulas

Permutation Importance

Accuracy drop when feature j is randomly shuffled — model-agnostic, works post-training

Gini Impurity Importance

Weighted impurity decrease across all splits on feature j — fast but biased toward cardinality

SHAP (kernel)

Shapley value: each feature's average marginal contribution over all feature coalitions

Drop-Column Importance

Gold standard but expensive — retrain once per feature

Interactive Simulation

Loading visualization…
🎯

Why Feature Importance Is Non-Negotiable

motivation

Machine learning models are often black boxes — they produce outputs but hide their reasoning. Feature importance methods peel back that opacity. They answer: which inputs does the model lean on most? This matters for three reasons: (1) Debugging: if your model leans heavily on 'random_noise', you have a data leakage problem. (2) Trust: regulators, doctors, and loan officers must understand model decisions — GDPR Article 22 mandates explainability for automated decisions. (3) Feature selection: importance scores guide dimensionality reduction. Dropping truly unimportant features reduces inference cost and prevents overfitting to noise.

A credit scoring model relying heavily on zip_code might be fair on training data but proxy for race — importance analysis surfaces this before deployment.

💡

Two Philosophies: What Does 'Important' Mean?

intuition

There are fundamentally two schools: (A) Structural importance asks 'how much did this feature help build the model?' — tree-based impurity importance is the canonical example, computed from split statistics during training. It's fast (no extra computation) but has a known bias: it inflates importance for high-cardinality continuous features like zip_code because there are more possible splits. (B) Functional importance asks 'how much does the model's predictions degrade if I break this feature?' — permutation importance shuffles each feature independently and measures the accuracy drop. It's model-agnostic, works with any estimator, and correctly assigns near-zero importance to random_noise features. The two approaches often disagree — and that disagreement is informative.

If impurity importance says zip_code is important but permutation importance says near-zero, the model learned spurious correlations from cardinality rather than signal.

⚙️

Permutation Importance: Step by Step

algorithm
1

Train your model on (X_train, y_train). Compute baseline metric (e.g., accuracy) on X_val.

2

For feature j in {1, …, p}: shuffle column j in X_val (replace with random permutation), compute metric on shuffled data, restore column j.

3

Importance of j = baseline metric − shuffled metric. High drop = important feature.

4

Repeat K times (default K=5 in sklearn) and average to reduce variance from random shuffles.

5

Sort features by importance score. Features with negative scores (model improves when shuffled) indicate harmful or leaky features.

</>

Feature Importance: Permutation & Impurity

code
python66 lines
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

class="tok-comment"># ── Synthetic tabular dataset ──────────────────────────────────────────────────
np.random.seed(class="tok-num">42)
n = class="tok-num">1000
X = pd.DataFrame({
    class="tok-str">"income":         np.random.normal(class="tok-num">50, class="tok-num">15, n),
    class="tok-str">"age":            np.random.randint(class="tok-num">18, class="tok-num">70, n),
    class="tok-str">"credit_score":   np.random.normal(class="tok-num">650, class="tok-num">80, n),
    class="tok-str">"loan_amount":    np.random.normal(class="tok-num">20, class="tok-num">8, n),
    class="tok-str">"employment_yrs": np.random.exponential(class="tok-num">5, n),
    class="tok-str">"num_accounts":   np.random.poisson(class="tok-num">3, n),
    class="tok-str">"random_noise":   np.random.randn(n),           class="tok-comment"># truly useless
    class="tok-str">"zip_code":       np.random.randint(class="tok-num">0, class="tok-num">10000, n), class="tok-comment"># high-cardinality noise
})
y = (
    class="tok-num">0.4 * (X[class="tok-str">"income"] > class="tok-num">55)
    + class="tok-num">0.3 * (X[class="tok-str">"credit_score"] > class="tok-num">660)
    + class="tok-num">0.2 * (X[class="tok-str">"age"] > class="tok-num">35)
    + class="tok-num">0.1 * np.random.rand(n)
) > class="tok-num">0.5

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42)

class="tok-comment"># ── class="tok-num">1. Train Random Forest ─────────────────────────────────────────────────────
rf = RandomForestClassifier(n_estimators=class="tok-num">100, random_state=class="tok-num">42)
rf.fit(X_train, y_train)

class="tok-comment"># ── class="tok-num">2. Impurity (Gini) importance — fast, built-in ────────────────────────────
impurity_imp = pd.Series(rf.feature_importances_, index=X.columns)
print(class="tok-str">"Impurity importance:")
print(impurity_imp.sort_values(ascending=False).round(class="tok-num">3))
class="tok-comment"># WARNING: zip_code (high cardinality) may appear inflated here

class="tok-comment"># ── class="tok-num">3. Permutation importance — model-agnostic, honest ───────────────────────
perm = permutation_importance(
    rf, X_val, y_val,
    n_repeats=class="tok-num">10,          class="tok-comment"># shuffle class="tok-num">10 times, take mean ± std
    scoring=class="tok-str">"accuracy",
    random_state=class="tok-num">42,
    n_jobs=-class="tok-num">1
)
perm_imp = pd.DataFrame({
    class="tok-str">"mean": perm.importances_mean,
    class="tok-str">"std":  perm.importances_std,
}, index=X.columns).sort_values(class="tok-str">"mean", ascending=False)

print(class="tok-str">"\nPermutation importance:")
print(perm_imp.round(class="tok-num">3))
class="tok-comment"># random_noise and zip_code will be near zero or negative

class="tok-comment"># ── class="tok-num">4. Compare the two methods ────────────────────────────────────────────────
comparison = pd.DataFrame({
    class="tok-str">"impurity": impurity_imp,
    class="tok-str">"permutation": perm.importances_mean,
}).sort_values(class="tok-str">"permutation", ascending=False)
print(class="tok-str">"\nComparison (sorted by permutation):")
print(comparison.round(class="tok-num">3))

class="tok-comment"># ── class="tok-num">5. Feature selection using permutation importance ────────────────────────
selected = perm_imp[perm_imp[class="tok-str">"mean"] > class="tok-num">0.01].index.tolist()
print(fclass="tok-str">"\nSelected features ({len(selected)}): {selected}")

SHAP: Unified Feature Attribution

math

SHAP (SHapley Additive exPlanations) unifies LIME, feature importance, and attention mechanisms under a single axiomatic framework. Every prediction is decomposed into a sum of per-feature contributions (ϕ_j) plus a base value. Unlike permutation importance (global), SHAP is local — it explains individual predictions. TreeSHAP computes exact Shapley values for tree ensembles in polynomial time using a path-based algorithm, making it practical for production Random Forests and XGBoost models.

⚠️

Correlated Features Split Importance Unfairly

pitfall

When two features are highly correlated (e.g., income and credit_score), permutation importance underestimates both. Shuffling income still leaves credit_score intact, so the model recovers most of the signal. The true joint importance is shared between them, but each individual importance looks small. Solution: use drop-column importance or SHAP with correlation-aware grouping. Also be aware that permutation importance is validation-set dependent — importance scores change if you use different splits.

Never interpret near-zero permutation importance as 'useless' for correlated features without checking pairwise correlations first.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.