ML Learning Hub
Regressionbeginner

Decision Trees & Random Forest

Recursive 20 questions: splitting reality into ever-purer regions

How decision trees split data (Gini, entropy, information gain), pruning, then Random Forest as bagged ensemble — variance reduction, feature importance, OOB evaluation.

40 min
14 diagrams
7 Concepts Covered

Prerequisites

Model Evaluation
Information Theory

Concepts Covered

Gini ImpurityEntropyInformation GainPruningBaggingFeature ImportanceOOB Score

Key Formulas

Gini Impurity

Probability of misclassifying a random sample. Zero = perfect purity.

Information Gain

Entropy reduction achieved by a split

Entropy

Measure of disorder / unpredictability in a node

OOB Error

Free validation estimate from samples not in each bootstrap

Interactive Simulation

Loading visualization…
Loading visualization…

Model Architecture

Loading visualization…
Loading visualization…
🎯

The Simplest Powerful Model

motivation

Decision trees mimic human reasoning: 'If age > 40 AND smoker AND cholesterol > 200 → high risk'. They're interpretable, require no feature scaling, handle mixed types, and capture non-linear relationships. Alone, they overfit badly — but as the building block for Random Forest, XGBoost, and LightGBM, they're the most important model structure in tabular ML.

💡

Recursive Binary Partitioning

intuition

A decision tree partitions the feature space into rectangular regions. At each step, it asks: 'Which single feature threshold best separates the classes?' It measures 'best' using impurity (Gini or entropy). The process recurses on each sub-region until a stopping criterion (max_depth, min_samples_leaf) is reached.

A depth-20 decision tree with binary splits can represent 2²⁰ ≈ 1 million different regions. That's why they overfit so catastrophically on raw data.

Gini vs. Entropy: The Split Criteria

math

Both measure impurity — lower is better. Gini is faster to compute (no log). Entropy is slightly more sensitive to class probabilities near 0.5. In practice, they produce nearly identical trees. For a node with classes [positive, negative] in proportions [p, 1-p]:

Gini for binary classification
🔬

Why Random Forest Works: Bias-Variance Decomposition

deepdive

A single deep tree has low bias but catastrophically high variance — it memorizes training noise. Random Forest exploits two tricks: (1) Bootstrap sampling grows each tree on a different random subset of data. (2) Random feature subsets (√p features per split) decorrelate the trees. Averaging decorrelated high-variance estimators reduces variance without increasing bias. Mathematically: Var(mean of n correlated trees) = ρσ² + (1-ρ)σ²/n. Reducing correlation ρ is the entire game.

The 'random' in Random Forest refers to random feature selection at each split, not just bootstrap. This is Breiman's key insight from 2001.

⚙️

Random Forest Training Algorithm

algorithm
1

For t = 1 to T (number of trees):

Bootstrap sample Dₜ from training data (n samples with replacement)

Grow decision tree hₜ on Dₜ:

At each node, randomly select m = √p features

Find best split among those m features (lowest Gini/entropy)

Expand until max_depth or min_samples_leaf is reached

7

Final prediction: majority vote (classification) or mean (regression)

8

OOB estimate: for each sample, predict using only trees that didn't see it

</>

Production Pattern

code
python33 lines
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.datasets import make_classification
import pandas as pd
import shap

class="tok-comment"># ── Sample data ────────────────────────────────────────────────────────
X_raw, y = make_classification(n_samples=class="tok-num">500, n_features=class="tok-num">10,
                                n_informative=class="tok-num">5, random_state=class="tok-num">42)
feature_names = [fclass="tok-str">"feature_{i}" for i in range(X_raw.shape[class="tok-num">1])]
X_train, X_test, y_train, y_test = train_test_split(
    X_raw, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42)

class="tok-comment"># Train
rf = RandomForestClassifier(
    n_estimators=class="tok-num">500,
    max_features=class="tok-str">'sqrt',        class="tok-comment"># Random feature subsets
    max_depth=None,             class="tok-comment"># Fully grown (pruned by min_samples_leaf)
    min_samples_leaf=class="tok-num">1,
    oob_score=True,             class="tok-comment"># Free validation estimate
    random_state=class="tok-num">42,
    n_jobs=-class="tok-num">1
)
rf.fit(X_train, y_train)
print(fclass="tok-str">"OOB Score: {rf.oob_score_:.4f}")

class="tok-comment"># Feature importance (Mean Decrease in Impurity)
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances.nlargest(class="tok-num">20).sort_values().plot(kind=class="tok-str">'barh')

class="tok-comment"># SHAP for correct feature importance
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
⚠️

Random Forest Pitfalls

pitfall
1

MDI feature importance is biased toward high-cardinality features. Use permutation importance or SHAP instead.

2

Memory: 500 deep trees can use 2–10GB RAM. Set max_depth=15–20 in production.

3

Slow inference: predicting through 500 trees serially is slow. Consider n_estimators=100 for serving.

4

Class imbalance: use class_weight='balanced' or stratified bootstrap.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.