Hyperparameter Tuning
“Automating the art of finding the right knobs to turn”
Grid Search (exhaustive), Random Search (surprisingly effective), Bayesian Optimisation (TPE/GP-based sequential search), Successive Halving, and Optuna — with interactive accuracy heatmap showing C × max_depth search space.
Prerequisites
Concepts Covered
∑Key Formulas
Grid Search
Exhaustive search over all combinations in the predefined grid
Successive Halving
Progressively eliminate poor candidates, allocating more resources to promising ones
Expected Improvement
Bayesian Optimisation acquisition function — trades exploration vs exploitation
▶Interactive Simulation
Why Hyperparameters Matter
A random forest with max_depth=5 might score 0.72 AUC. The same algorithm with max_depth=12, min_samples_leaf=3, max_features='sqrt' scores 0.89 AUC. That 17-point gap is pure hyperparameter tuning — the algorithm didn't change, the data didn't change. Hyperparameters are parameters that are not learned from data; they control the learning process itself. Choosing them well is often the difference between a mediocre model and a production-ready one.
The learning rate is the single most important hyperparameter in most gradient-based models. Too high = divergence. Too low = slow convergence or local minima. Always tune it first.
Grid vs Random vs Bayesian
Grid Search evaluates every combination in the Cartesian product of parameter values — correct but exponentially expensive (10 params × 5 values each = 5¹⁰ ≈ 10M evaluations). Random Search samples n_iter random combinations — surprisingly effective because most hyperparameter spaces have only a few dimensions that truly matter; random sampling covers them better than grids. Bayesian Optimization maintains a probabilistic model of the objective surface (Gaussian Process or Tree Parzen Estimator) and sequentially suggests configurations that maximize expected improvement — it learns from previous evaluations and focuses on promising regions.
Random Search with n_iter=60 typically outperforms Grid Search with 5× more evaluations. Bayesian Optimization outperforms both when evaluations are expensive (e.g., training a large neural net).
Bayesian Optimisation Loop
Fit a surrogate model (Gaussian Process) to previous (θ, score) observations
Use acquisition function (Expected Improvement, UCB) to select next θ
EI: explore where uncertainty is high OR where expected gain is high
Evaluate the actual objective: train model with θ, compute CV score
Add new observation to dataset, refit surrogate
Repeat until budget exhausted — return best θ found
Halving Search: Speed Without Sacrifice
HalvingGridSearchCV and HalvingRandomSearchCV implement successive halving: start with all candidates but minimal resources (few training samples or estimators), keep the top η fraction, double the resources, repeat. A grid of 1024 candidates with 4 halving rounds needs only 1024×1 + 512×2 + 256×4 + 128×8 = 4096 total evaluations, vs 1024×all for standard GridSearchCV. This gives a 10–100× speedup for large grids with negligible quality loss.
For neural networks, use Keras Tuner or Optuna rather than sklearn's search — they support asynchronous parallel trials, early stopping integration, and neural-specific search spaces.
All Three Methods in scikit-learn
from sklearn.model_selection import (GridSearchCV, RandomizedSearchCV, cross_val_score, train_test_split) from sklearn.experimental import enable_halving_search_cv class="tok-comment"># noqa from sklearn.model_selection import HalvingRandomSearchCV from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import make_classification from scipy.stats import uniform, randint import optuna class="tok-comment"># for Bayesian class="tok-comment"># ── Sample data ──────────────────────────────────────────────────────── X, y = make_classification(n_samples=class="tok-num">600, n_features=class="tok-num">10, random_state=class="tok-num">42) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42) class="tok-comment"># ── Parameter space ──────────────────────────────────────────────── param_grid = { class="tok-str">'n_estimators': [class="tok-num">100, class="tok-num">200, class="tok-num">400], class="tok-str">'max_depth': [class="tok-num">3, class="tok-num">5, class="tok-num">7, class="tok-num">9], class="tok-str">'learning_rate': [class="tok-num">0.01, class="tok-num">0.05, class="tok-num">0.1, class="tok-num">0.2], class="tok-str">'subsample': [class="tok-num">0.7, class="tok-num">0.8, class="tok-num">1.0], class="tok-str">'min_samples_leaf': [class="tok-num">1, class="tok-num">3, class="tok-num">5], } class="tok-comment"># ── class="tok-num">1. Grid Search (exhaustive, expensive) ───────────────────────── gs = GridSearchCV(GradientBoostingClassifier(), param_grid, cv=class="tok-num">5, scoring=class="tok-str">'roc_auc', n_jobs=-class="tok-num">1) gs.fit(X_train, y_train) print(fclass="tok-str">"Grid best: {gs.best_score_:.4f} {gs.best_params_}") class="tok-comment"># ── class="tok-num">2. Random Search (fast, almost-as-good) ──────────────────────── param_dist = { class="tok-str">'n_estimators': randint(class="tok-num">50, class="tok-num">500), class="tok-str">'max_depth': randint(class="tok-num">2, class="tok-num">12), class="tok-str">'learning_rate': uniform(class="tok-num">0.005, class="tok-num">0.3), class="tok-str">'subsample': uniform(class="tok-num">0.6, class="tok-num">0.4), } rs = RandomizedSearchCV(GradientBoostingClassifier(), param_dist, n_iter=class="tok-num">60, cv=class="tok-num">5, scoring=class="tok-str">'roc_auc', n_jobs=-class="tok-num">1, random_state=class="tok-num">42) rs.fit(X_train, y_train) print(fclass="tok-str">"Random best: {rs.best_score_:.4f} {rs.best_params_}") class="tok-comment"># ── class="tok-num">3. Optuna (Bayesian, best quality) ───────────────────────────── def objective(trial): params = { class="tok-str">'n_estimators': trial.suggest_int(class="tok-str">'n_estimators', class="tok-num">50, class="tok-num">500), class="tok-str">'max_depth': trial.suggest_int(class="tok-str">'max_depth', class="tok-num">2, class="tok-num">12), class="tok-str">'learning_rate': trial.suggest_float(class="tok-str">'learning_rate', class="tok-num">1e-3, class="tok-num">0.3, log=True), class="tok-str">'subsample': trial.suggest_float(class="tok-str">'subsample', class="tok-num">0.5, class="tok-num">1.0), } model = GradientBoostingClassifier(**params) return cross_val_score(model, X_train, y_train, cv=class="tok-num">3, scoring=class="tok-str">'roc_auc').mean() study = optuna.create_study(direction=class="tok-str">'maximize') study.optimize(objective, n_trials=class="tok-num">100, n_jobs=class="tok-num">4) print(fclass="tok-str">"Optuna best: {study.best_value:.4f} {study.best_params}")
Hyperparameter Tuning Pitfalls
Tuning on the test set inflates performance estimates — always tune using only cross-validation on training data. Second: nested cross-validation is needed for unbiased estimation when both model selection and hyperparameter tuning are applied — the outer loop estimates generalization error, the inner loop selects hyperparameters. Third: the 'winner's curse' — with 1000 random configurations, the best one will be optimistic by random chance. Use a holdout set to verify the best configuration. Fourth: don't tune everything simultaneously — fix learning rate first, then regularization, then architecture.
Overfitting to the validation set is real. With enough hyperparameter trials, you will find a configuration that accidentally scores well on your CV folds but generalizes poorly. Always do a final evaluation on a truly held-out test set.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.