Feature Importance & Selection
“Know which features your model actually relies on — then trust it more (or less)”
Permutation importance vs impurity (Gini) importance, SHAP unified attribution, drop-column importance, and how correlated features split scores unfairly — with interactive bar chart toggling between methods.
Prerequisites
Concepts Covered
∑Key Formulas
Permutation Importance
Accuracy drop when feature j is randomly shuffled — model-agnostic, works post-training
Gini Impurity Importance
Weighted impurity decrease across all splits on feature j — fast but biased toward cardinality
SHAP (kernel)
Shapley value: each feature's average marginal contribution over all feature coalitions
Drop-Column Importance
Gold standard but expensive — retrain once per feature
▶Interactive Simulation
Why Feature Importance Is Non-Negotiable
Machine learning models are often black boxes — they produce outputs but hide their reasoning. Feature importance methods peel back that opacity. They answer: which inputs does the model lean on most? This matters for three reasons: (1) Debugging: if your model leans heavily on 'random_noise', you have a data leakage problem. (2) Trust: regulators, doctors, and loan officers must understand model decisions — GDPR Article 22 mandates explainability for automated decisions. (3) Feature selection: importance scores guide dimensionality reduction. Dropping truly unimportant features reduces inference cost and prevents overfitting to noise.
A credit scoring model relying heavily on zip_code might be fair on training data but proxy for race — importance analysis surfaces this before deployment.
Two Philosophies: What Does 'Important' Mean?
There are fundamentally two schools: (A) Structural importance asks 'how much did this feature help build the model?' — tree-based impurity importance is the canonical example, computed from split statistics during training. It's fast (no extra computation) but has a known bias: it inflates importance for high-cardinality continuous features like zip_code because there are more possible splits. (B) Functional importance asks 'how much does the model's predictions degrade if I break this feature?' — permutation importance shuffles each feature independently and measures the accuracy drop. It's model-agnostic, works with any estimator, and correctly assigns near-zero importance to random_noise features. The two approaches often disagree — and that disagreement is informative.
If impurity importance says zip_code is important but permutation importance says near-zero, the model learned spurious correlations from cardinality rather than signal.
Permutation Importance: Step by Step
Train your model on (X_train, y_train). Compute baseline metric (e.g., accuracy) on X_val.
For feature j in {1, …, p}: shuffle column j in X_val (replace with random permutation), compute metric on shuffled data, restore column j.
Importance of j = baseline metric − shuffled metric. High drop = important feature.
Repeat K times (default K=5 in sklearn) and average to reduce variance from random shuffles.
Sort features by importance score. Features with negative scores (model improves when shuffled) indicate harmful or leaky features.
Feature Importance: Permutation & Impurity
from sklearn.ensemble import RandomForestClassifier from sklearn.inspection import permutation_importance from sklearn.model_selection import train_test_split import pandas as pd import numpy as np class="tok-comment"># ── Synthetic tabular dataset ────────────────────────────────────────────────── np.random.seed(class="tok-num">42) n = class="tok-num">1000 X = pd.DataFrame({ class="tok-str">"income": np.random.normal(class="tok-num">50, class="tok-num">15, n), class="tok-str">"age": np.random.randint(class="tok-num">18, class="tok-num">70, n), class="tok-str">"credit_score": np.random.normal(class="tok-num">650, class="tok-num">80, n), class="tok-str">"loan_amount": np.random.normal(class="tok-num">20, class="tok-num">8, n), class="tok-str">"employment_yrs": np.random.exponential(class="tok-num">5, n), class="tok-str">"num_accounts": np.random.poisson(class="tok-num">3, n), class="tok-str">"random_noise": np.random.randn(n), class="tok-comment"># truly useless class="tok-str">"zip_code": np.random.randint(class="tok-num">0, class="tok-num">10000, n), class="tok-comment"># high-cardinality noise }) y = ( class="tok-num">0.4 * (X[class="tok-str">"income"] > class="tok-num">55) + class="tok-num">0.3 * (X[class="tok-str">"credit_score"] > class="tok-num">660) + class="tok-num">0.2 * (X[class="tok-str">"age"] > class="tok-num">35) + class="tok-num">0.1 * np.random.rand(n) ) > class="tok-num">0.5 X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42) class="tok-comment"># ── class="tok-num">1. Train Random Forest ───────────────────────────────────────────────────── rf = RandomForestClassifier(n_estimators=class="tok-num">100, random_state=class="tok-num">42) rf.fit(X_train, y_train) class="tok-comment"># ── class="tok-num">2. Impurity (Gini) importance — fast, built-in ──────────────────────────── impurity_imp = pd.Series(rf.feature_importances_, index=X.columns) print(class="tok-str">"Impurity importance:") print(impurity_imp.sort_values(ascending=False).round(class="tok-num">3)) class="tok-comment"># WARNING: zip_code (high cardinality) may appear inflated here class="tok-comment"># ── class="tok-num">3. Permutation importance — model-agnostic, honest ─────────────────────── perm = permutation_importance( rf, X_val, y_val, n_repeats=class="tok-num">10, class="tok-comment"># shuffle class="tok-num">10 times, take mean ± std scoring=class="tok-str">"accuracy", random_state=class="tok-num">42, n_jobs=-class="tok-num">1 ) perm_imp = pd.DataFrame({ class="tok-str">"mean": perm.importances_mean, class="tok-str">"std": perm.importances_std, }, index=X.columns).sort_values(class="tok-str">"mean", ascending=False) print(class="tok-str">"\nPermutation importance:") print(perm_imp.round(class="tok-num">3)) class="tok-comment"># random_noise and zip_code will be near zero or negative class="tok-comment"># ── class="tok-num">4. Compare the two methods ──────────────────────────────────────────────── comparison = pd.DataFrame({ class="tok-str">"impurity": impurity_imp, class="tok-str">"permutation": perm.importances_mean, }).sort_values(class="tok-str">"permutation", ascending=False) print(class="tok-str">"\nComparison (sorted by permutation):") print(comparison.round(class="tok-num">3)) class="tok-comment"># ── class="tok-num">5. Feature selection using permutation importance ──────────────────────── selected = perm_imp[perm_imp[class="tok-str">"mean"] > class="tok-num">0.01].index.tolist() print(fclass="tok-str">"\nSelected features ({len(selected)}): {selected}")
SHAP: Unified Feature Attribution
SHAP (SHapley Additive exPlanations) unifies LIME, feature importance, and attention mechanisms under a single axiomatic framework. Every prediction is decomposed into a sum of per-feature contributions (ϕ_j) plus a base value. Unlike permutation importance (global), SHAP is local — it explains individual predictions. TreeSHAP computes exact Shapley values for tree ensembles in polynomial time using a path-based algorithm, making it practical for production Random Forests and XGBoost models.
Correlated Features Split Importance Unfairly
When two features are highly correlated (e.g., income and credit_score), permutation importance underestimates both. Shuffling income still leaves credit_score intact, so the model recovers most of the signal. The true joint importance is shared between them, but each individual importance looks small. Solution: use drop-column importance or SHAP with correlation-aware grouping. Also be aware that permutation importance is validation-set dependent — importance scores change if you use different splits.
Never interpret near-zero permutation importance as 'useless' for correlated features without checking pairwise correlations first.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.