Gradient Boosting: XGBoost, LightGBM, CatBoost
“Many small corrections beat one big guess — sequentially chasing the residuals”
From vanilla Gradient Boosting to XGBoost (tree scores), then LightGBM (histogram-based, leaf-wise growth), and CatBoost (ordered boosting for categoricals). Optuna HPO patterns.
Prerequisites
Concepts Covered
∑Key Formulas
Boosting Ensemble
Final prediction = sum of M weak learners, each weighted by γ
Pseudo-Residuals
Negative gradient of loss — what the next tree should learn
XGBoost Tree Score
Gain from a split (G=first derivative sum, H=second derivative sum)
Split Gain
Improvement in objective from splitting a leaf into two
▶Interactive Simulation
⬡Model Architecture
Why Gradient Boosting Dominates Tabular Data
Since 2014, gradient boosting methods (XGBoost, LightGBM, CatBoost) have won the majority of Kaggle competitions on structured data. They're the single best algorithm for tabular ML because: they handle mixed feature types natively, don't require scaling, capture complex non-linear interactions, and come with built-in regularization. Understanding how they work unlocks the most powerful tool in the data scientist's arsenal.
The Core Idea: Fit the Mistakes
Suppose your current model predicts house prices and it's wrong by $50k on house A. Instead of retraining from scratch, train a new small tree to predict exactly that $50k error. Add it to your model. Now you're off by less. Repeat. Each new tree targets the residual errors of all previous trees combined. This is gradient boosting — gradient descent in function space.
The 'gradient' in gradient boosting refers to functional gradient descent, not parameter gradient descent. We're optimizing in the space of functions, not weights.
Gradient Boosting as Gradient Descent in Function Space
At step m, we fit a tree hₘ to the negative gradient of the loss with respect to the current prediction F_{m-1}(x). For MSE loss L = ½(y - F(x))², the negative gradient is exactly the residual r = y - F(x). For other losses (log loss, MAE), we get different 'pseudo-residuals' — hence the generality of the framework.
XGBoost: Second-Order Optimization
Friedman's original GBM only uses first-order gradients (residuals). XGBoost uses both first (G) and second (H) order Taylor expansion of the loss, giving it better curvature information — like Newton's method vs. gradient descent. The tree score uses H as a natural adaptive learning rate: features/splits where the loss has high curvature (H large) get smaller effective steps.
XGBoost vs LightGBM vs CatBoost
Three major frameworks, each with distinct architectural innovations:
XGBoost: Level-wise tree growth + second-order optimization. Slower but mature. Best for small-medium datasets.
LightGBM: Leaf-wise growth (best leaf first) + Histogram binning (continuous → discrete bins). 10–20x faster training. Best for large datasets.
CatBoost: Ordered boosting prevents target leakage. Native categorical handling (no encoding needed). Best when you have many categorical features.
Rule of thumb: Start with LightGBM. Use CatBoost with heavy categoricals. Use XGBoost for small datasets where speed doesn't matter.
LightGBM Leaf-Wise Growth
Initialize F₀(x) = log(p/(1-p)) for binary classification
For m = 1 to M:
Compute pseudo-residuals rᵢ = -∂L/∂F(xᵢ)|_{F=F_{m-1}}
Find best leaf to split (globally, not level-by-level)
Compute leaf values: γⱼ = ΣᵢGᵢ / (ΣᵢHᵢ + λ)
Update: F_m(x) = F_{m-1}(x) + ν · γ_{leaf(x)}
Add early stopping if validation loss stops improving
Production LightGBM with Optuna
import lightgbm as lgb import optuna import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split class="tok-comment"># ── Sample data ──────────────────────────────────────────────────────── X, y = make_classification(n_samples=class="tok-num">1000, n_features=class="tok-num">20, n_informative=class="tok-num">10, random_state=class="tok-num">42) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42) dtrain = lgb.Dataset(X_train, label=y_train) def objective(trial): params = { class="tok-str">'objective': class="tok-str">'binary', class="tok-str">'metric': class="tok-str">'auc', class="tok-str">'learning_rate': trial.suggest_float(class="tok-str">'lr', class="tok-num">0.01, class="tok-num">0.3, log=True), class="tok-str">'num_leaves': trial.suggest_int(class="tok-str">'num_leaves', class="tok-num">20, class="tok-num">300), class="tok-str">'max_depth': trial.suggest_int(class="tok-str">'max_depth', class="tok-num">3, class="tok-num">12), class="tok-str">'min_data_in_leaf': trial.suggest_int(class="tok-str">'min_child', class="tok-num">10, class="tok-num">100), class="tok-str">'feature_fraction': trial.suggest_float(class="tok-str">'feat_frac', class="tok-num">0.4, class="tok-num">1.0), class="tok-str">'bagging_fraction': trial.suggest_float(class="tok-str">'bag_frac', class="tok-num">0.4, class="tok-num">1.0), class="tok-str">'lambda_l1': trial.suggest_float(class="tok-str">'l1', class="tok-num">1e-8, class="tok-num">10.0, log=True), class="tok-str">'lambda_l2': trial.suggest_float(class="tok-str">'l2', class="tok-num">1e-8, class="tok-num">10.0, log=True), class="tok-str">'verbose': -class="tok-num">1, } cv_result = lgb.cv( params, dtrain, nfold=class="tok-num">5, num_boost_round=class="tok-num">500, early_stopping_rounds=class="tok-num">50, stratified=True ) return max(cv_result[class="tok-str">'valid auc-mean']) study = optuna.create_study(direction=class="tok-str">'maximize') study.optimize(objective, n_trials=class="tok-num">100)
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.