Regressionbeginner

Decision Trees & Random Forest

“Recursive 20 questions: splitting reality into ever-purer regions”

How decision trees split data (Gini, entropy, information gain), pruning, then Random Forest as bagged ensemble — variance reduction, feature importance, OOB evaluation.

40 min

14 diagrams

7 Concepts Covered

Prerequisites

→Model Evaluation

→Information Theory

Concepts Covered

Gini ImpurityEntropyInformation GainPruningBaggingFeature ImportanceOOB Score

Previous: Naïve Bayes Classifiers Next: SVM, SVR & KNN

∑Key Formulas

Gini Impurity

Probability of misclassifying a random sample. Zero = perfect purity.

Information Gain

Entropy reduction achieved by a split

Entropy

Measure of disorder / unpredictability in a node

OOB Error

Free validation estimate from samples not in each bootstrap

▶Interactive Simulation

Loading visualization…

⬡Model Architecture

Loading visualization…

🎯

The Simplest Powerful Model

motivation

Decision trees mimic human reasoning: 'If age > 40 AND smoker AND cholesterol > 200 → high risk'. They're interpretable, require no feature scaling, handle mixed types, and capture non-linear relationships. Alone, they overfit badly — but as the building block for Random Forest, XGBoost, and LightGBM, they're the most important model structure in tabular ML.

💡

Recursive Binary Partitioning

intuition

A decision tree partitions the feature space into rectangular regions. At each step, it asks: 'Which single feature threshold best separates the classes?' It measures 'best' using impurity (Gini or entropy). The process recurses on each sub-region until a stopping criterion (max_depth, min_samples_leaf) is reached.

A depth-20 decision tree with binary splits can represent 2²⁰ ≈ 1 million different regions. That's why they overfit so catastrophically on raw data.

∑

Gini vs. Entropy: The Split Criteria

math

Both measure impurity — lower is better. Gini is faster to compute (no log). Entropy is slightly more sensitive to class probabilities near 0.5. In practice, they produce nearly identical trees. For a node with classes [positive, negative] in proportions [p, 1-p]:

🔬

Why Random Forest Works: Bias-Variance Decomposition

deepdive

A single deep tree has low bias but catastrophically high variance — it memorizes training noise. Random Forest exploits two tricks: (1) Bootstrap sampling grows each tree on a different random subset of data. (2) Random feature subsets (√p features per split) decorrelate the trees. Averaging decorrelated high-variance estimators reduces variance without increasing bias. Mathematically: Var(mean of n correlated trees) = ρσ² + (1-ρ)σ²/n. Reducing correlation ρ is the entire game.

The 'random' in Random Forest refers to random feature selection at each split, not just bootstrap. This is Breiman's key insight from 2001.

⚙️

Random Forest Training Algorithm

algorithm

For t = 1 to T (number of trees):

→

Bootstrap sample Dₜ from training data (n samples with replacement)

→

Grow decision tree hₜ on Dₜ:

→

At each node, randomly select m = √p features

→

Find best split among those m features (lowest Gini/entropy)

→

Expand until max_depth or min_samples_leaf is reached

Final prediction: majority vote (classification) or mean (regression)

OOB estimate: for each sample, predict using only trees that didn't see it

</>

Production Pattern

code

python33 lines

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.datasets import make_classification
import pandas as pd
import shap

# ── Sample data ────────────────────────────────────────────────────────
X_raw, y = make_classification(n_samples=500, n_features=10,
                                n_informative=5, random_state=42)
feature_names = [f"feature_{i}" for i in range(X_raw.shape[1])]
X_train, X_test, y_train, y_test = train_test_split(
    X_raw, y, test_size=0.2, random_state=42)

# Train
rf = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',        # Random feature subsets
    max_depth=None,             # Fully grown (pruned by min_samples_leaf)
    min_samples_leaf=1,
    oob_score=True,             # Free validation estimate
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_:.4f}")

# Feature importance (Mean Decrease in Impurity)
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances.nlargest(20).sort_values().plot(kind='barh')

# SHAP for correct feature importance
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

⚠️

Random Forest Pitfalls

pitfall

MDI feature importance is biased toward high-cardinality features. Use permutation importance or SHAP instead.

Memory: 500 deep trees can use 2–10GB RAM. Set max_depth=15–20 in production.

Slow inference: predicting through 500 trees serially is slow. Consider n_estimators=100 for serving.

Class imbalance: use class_weight='balanced' or stratified bootstrap.

?Knowledge Check

Progress is saved in your browser — no account needed.

Naïve Bayes Classifiers

SVM, SVR & KNN

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services