Decision Trees & Random Forest
“Recursive 20 questions: splitting reality into ever-purer regions”
How decision trees split data (Gini, entropy, information gain), pruning, then Random Forest as bagged ensemble — variance reduction, feature importance, OOB evaluation.
Prerequisites
Concepts Covered
∑Key Formulas
Gini Impurity
Probability of misclassifying a random sample. Zero = perfect purity.
Information Gain
Entropy reduction achieved by a split
Entropy
Measure of disorder / unpredictability in a node
OOB Error
Free validation estimate from samples not in each bootstrap
▶Interactive Simulation
⬡Model Architecture
The Simplest Powerful Model
Decision trees mimic human reasoning: 'If age > 40 AND smoker AND cholesterol > 200 → high risk'. They're interpretable, require no feature scaling, handle mixed types, and capture non-linear relationships. Alone, they overfit badly — but as the building block for Random Forest, XGBoost, and LightGBM, they're the most important model structure in tabular ML.
Recursive Binary Partitioning
A decision tree partitions the feature space into rectangular regions. At each step, it asks: 'Which single feature threshold best separates the classes?' It measures 'best' using impurity (Gini or entropy). The process recurses on each sub-region until a stopping criterion (max_depth, min_samples_leaf) is reached.
A depth-20 decision tree with binary splits can represent 2²⁰ ≈ 1 million different regions. That's why they overfit so catastrophically on raw data.
Gini vs. Entropy: The Split Criteria
Both measure impurity — lower is better. Gini is faster to compute (no log). Entropy is slightly more sensitive to class probabilities near 0.5. In practice, they produce nearly identical trees. For a node with classes [positive, negative] in proportions [p, 1-p]:
Why Random Forest Works: Bias-Variance Decomposition
A single deep tree has low bias but catastrophically high variance — it memorizes training noise. Random Forest exploits two tricks: (1) Bootstrap sampling grows each tree on a different random subset of data. (2) Random feature subsets (√p features per split) decorrelate the trees. Averaging decorrelated high-variance estimators reduces variance without increasing bias. Mathematically: Var(mean of n correlated trees) = ρσ² + (1-ρ)σ²/n. Reducing correlation ρ is the entire game.
The 'random' in Random Forest refers to random feature selection at each split, not just bootstrap. This is Breiman's key insight from 2001.
Random Forest Training Algorithm
For t = 1 to T (number of trees):
Bootstrap sample Dₜ from training data (n samples with replacement)
Grow decision tree hₜ on Dₜ:
At each node, randomly select m = √p features
Find best split among those m features (lowest Gini/entropy)
Expand until max_depth or min_samples_leaf is reached
Final prediction: majority vote (classification) or mean (regression)
OOB estimate: for each sample, predict using only trees that didn't see it
Production Pattern
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score, train_test_split from sklearn.datasets import make_classification import pandas as pd import shap class="tok-comment"># ── Sample data ──────────────────────────────────────────────────────── X_raw, y = make_classification(n_samples=class="tok-num">500, n_features=class="tok-num">10, n_informative=class="tok-num">5, random_state=class="tok-num">42) feature_names = [fclass="tok-str">"feature_{i}" for i in range(X_raw.shape[class="tok-num">1])] X_train, X_test, y_train, y_test = train_test_split( X_raw, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42) class="tok-comment"># Train rf = RandomForestClassifier( n_estimators=class="tok-num">500, max_features=class="tok-str">'sqrt', class="tok-comment"># Random feature subsets max_depth=None, class="tok-comment"># Fully grown (pruned by min_samples_leaf) min_samples_leaf=class="tok-num">1, oob_score=True, class="tok-comment"># Free validation estimate random_state=class="tok-num">42, n_jobs=-class="tok-num">1 ) rf.fit(X_train, y_train) print(fclass="tok-str">"OOB Score: {rf.oob_score_:.4f}") class="tok-comment"># Feature importance (Mean Decrease in Impurity) importances = pd.Series(rf.feature_importances_, index=feature_names) importances.nlargest(class="tok-num">20).sort_values().plot(kind=class="tok-str">'barh') class="tok-comment"># SHAP for correct feature importance explainer = shap.TreeExplainer(rf) shap_values = explainer.shap_values(X_test)
Random Forest Pitfalls
MDI feature importance is biased toward high-cardinality features. Use permutation importance or SHAP instead.
Memory: 500 deep trees can use 2–10GB RAM. Set max_depth=15–20 in production.
Slow inference: predicting through 500 trees serially is slow. Consider n_estimators=100 for serving.
Class imbalance: use class_weight='balanced' or stratified bootstrap.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.