Ensemble Methodsintermediate

Gradient Boosting: XGBoost, LightGBM, CatBoost

“Many small corrections beat one big guess — sequentially chasing the residuals”

From vanilla Gradient Boosting to XGBoost (tree scores), then LightGBM (histogram-based, leaf-wise growth), and CatBoost (ordered boosting for categoricals). Optuna HPO patterns.

60 min

18 diagrams

6 Concepts Covered

Prerequisites

→Decision Trees

→Calculus & Optimization

Concepts Covered

ResidualsTree Score (SSR+λT)Histogram BinningLeaf-wise GrowthOrdered BoostingOptuna HPO

Previous: Anomaly & Outlier Detection Next: Bagging, Boosting & Stacking

∑Key Formulas

Boosting Ensemble

Final prediction = sum of M weak learners, each weighted by γ

Pseudo-Residuals

Negative gradient of loss — what the next tree should learn

XGBoost Tree Score

Gain from a split (G=first derivative sum, H=second derivative sum)

Split Gain

Improvement in objective from splitting a leaf into two

▶Interactive Simulation

Loading visualization…

⬡Model Architecture

Loading visualization…

🎯

Why Gradient Boosting Dominates Tabular Data

motivation

Since 2014, gradient boosting methods (XGBoost, LightGBM, CatBoost) have won the majority of Kaggle competitions on structured data. They're the single best algorithm for tabular ML because: they handle mixed feature types natively, don't require scaling, capture complex non-linear interactions, and come with built-in regularization. Understanding how they work unlocks the most powerful tool in the data scientist's arsenal.

💡

The Core Idea: Fit the Mistakes

intuition

Suppose your current model predicts house prices and it's wrong by $50k on house A. Instead of retraining from scratch, train a new small tree to predict exactly that $50k error. Add it to your model. Now you're off by less. Repeat. Each new tree targets the residual errors of all previous trees combined. This is gradient boosting — gradient descent in function space.

The 'gradient' in gradient boosting refers to functional gradient descent, not parameter gradient descent. We're optimizing in the space of functions, not weights.

∑

Gradient Boosting as Gradient Descent in Function Space

math

At step m, we fit a tree hₘ to the negative gradient of the loss with respect to the current prediction F_{m-1}(x). For MSE loss L = ½(y - F(x))², the negative gradient is exactly the residual r = y - F(x). For other losses (log loss, MAE), we get different 'pseudo-residuals' — hence the generality of the framework.

🔬

XGBoost: Second-Order Optimization

deepdive

Friedman's original GBM only uses first-order gradients (residuals). XGBoost uses both first (G) and second (H) order Taylor expansion of the loss, giving it better curvature information — like Newton's method vs. gradient descent. The tree score uses H as a natural adaptive learning rate: features/splits where the loss has high curvature (H large) get smaller effective steps.

⚖️

XGBoost vs LightGBM vs CatBoost

comparison

Three major frameworks, each with distinct architectural innovations:

XGBoost: Level-wise tree growth + second-order optimization. Slower but mature. Best for small-medium datasets.

LightGBM: Leaf-wise growth (best leaf first) + Histogram binning (continuous → discrete bins). 10–20x faster training. Best for large datasets.

CatBoost: Ordered boosting prevents target leakage. Native categorical handling (no encoding needed). Best when you have many categorical features.

Rule of thumb: Start with LightGBM. Use CatBoost with heavy categoricals. Use XGBoost for small datasets where speed doesn't matter.

⚙️

LightGBM Leaf-Wise Growth

algorithm

Initialize F₀(x) = log(p/(1-p)) for binary classification

For m = 1 to M:

→

Compute pseudo-residuals rᵢ = -∂L/∂F(xᵢ)|_{F=F_{m-1}}

→

Find best leaf to split (globally, not level-by-level)

→

Compute leaf values: γⱼ = ΣᵢGᵢ / (ΣᵢHᵢ + λ)

→

Update: F_m(x) = F_{m-1}(x) + ν · γ_{leaf(x)}

→

Add early stopping if validation loss stops improving

</>

Production LightGBM with Optuna

code

python37 lines

import lightgbm as lgb
import optuna
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# ── Sample data ────────────────────────────────────────────────────────
X, y = make_classification(n_samples=1000, n_features=20,
                            n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
dtrain = lgb.Dataset(X_train, label=y_train)

def objective(trial):
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'learning_rate': trial.suggest_float('lr', 0.01, 0.3, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 20, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'min_data_in_leaf': trial.suggest_int('min_child', 10, 100),
        'feature_fraction': trial.suggest_float('feat_frac', 0.4, 1.0),
        'bagging_fraction': trial.suggest_float('bag_frac', 0.4, 1.0),
        'lambda_l1': trial.suggest_float('l1', 1e-8, 10.0, log=True),
        'lambda_l2': trial.suggest_float('l2', 1e-8, 10.0, log=True),
        'verbose': -1,
    }
    cv_result = lgb.cv(
        params, dtrain, nfold=5,
        num_boost_round=500,
        early_stopping_rounds=50,
        stratified=True
    )
    return max(cv_result['valid auc-mean'])

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

?Knowledge Check

Progress is saved in your browser — no account needed.

Anomaly & Outlier Detection

Bagging, Boosting & Stacking

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services