Model Evaluation & Metrics
“Accuracy is a lie — learning to choose the right metric for the real problem”
Complete guide: accuracy, precision, recall, F1, ROC-AUC, confusion matrix, PR curves, cross-validation (StratifiedKFold, TimeSeriesSplit), and choosing the right metric for your task.
Prerequisites
Concepts Covered
∑Key Formulas
F1 Score
Harmonic mean of precision and recall
AUC-ROC
Probability that a random positive ranks higher than a random negative
MCC
Matthews Correlation Coefficient — best single metric for imbalance
RMSE
Root Mean Squared Error — penalizes large errors, same units as target
R² Score
Fraction of variance explained; 1=perfect, 0=predict mean, <0=worse than mean
MAE
Mean Absolute Error — robust to outliers, interpretable in target units
▶Interactive Simulation
⬡Model Architecture
When Accuracy Kills
Imagine predicting cancer (0.1% prevalence). A model that predicts 'no cancer' for everyone achieves 99.9% accuracy — and kills patients. In fraud detection (0.5% fraud rate), high accuracy is meaningless. The choice of metric is a business decision, not a technical one. Getting it wrong can mean deploying a model that optimizes for the wrong thing entirely.
In 2021, Amazon's hiring algorithm was 98.4% accurate at filtering resumes — but systematically discriminated against women because accuracy was the optimized metric.
The Confusion Matrix as a Complete Picture
Every prediction falls into four categories: True Positive (correctly predicted positive), False Positive (predicted positive, actually negative), True Negative (correctly predicted negative), False Negative (predicted negative, actually positive). From these four numbers, every classification metric derives. FP = Type I error (false alarm). FN = Type II error (miss). Which matters more depends entirely on the application.
ROC Curve: Threshold-Independent Evaluation
A classifier produces a score, not just a binary prediction. The threshold we apply to convert score → label is a design choice. The ROC curve shows all possible tradeoffs by sweeping the threshold from 0 to 1: plotting TPR (recall) vs FPR (1-specificity). AUC = 0.5 means random guessing, AUC = 1.0 is perfect. AUC has a beautiful probabilistic interpretation: P(score(positive) > score(negative)).
Stratified K-Fold: The Right Way to Validate
Hold-out validation wastes data and has high variance. K-Fold cross-validation uses all data for both training and validation. Stratified K-Fold ensures each fold has the same class distribution as the full dataset — critical for imbalanced problems. TimeSeriesSplit prevents data leakage: future data never informs past predictions, respecting temporal ordering.
StratifiedKFold: maintains class proportions in each fold
TimeSeriesSplit: all training data comes before validation data in time
GroupKFold: ensures all samples from the same group (patient, user) are in the same fold
RepeatedStratifiedKFold: repeat K-Fold N times with different random seeds → lower variance estimate
Complete Evaluation Pipeline
from sklearn.metrics import ( classification_report, roc_auc_score, f1_score, average_precision_score, matthews_corrcoef, confusion_matrix ) from sklearn.model_selection import StratifiedKFold, train_test_split from sklearn.datasets import make_classification from sklearn.ensemble import GradientBoostingClassifier import numpy as np class="tok-comment"># ── Sample data + model ──────────────────────────────────────────────── X, y = make_classification(n_samples=class="tok-num">1000, n_features=class="tok-num">10, random_state=class="tok-num">42) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42) model = GradientBoostingClassifier(n_estimators=class="tok-num">100, random_state=class="tok-num">42) skf = StratifiedKFold(n_splits=class="tok-num">5, shuffle=True, random_state=class="tok-num">42) oof_probs = np.zeros(len(y_train)) for fold, (tr_idx, val_idx) in enumerate(skf.split(X_train, y_train)): model.fit(X_train[tr_idx], y_train[tr_idx]) oof_probs[val_idx] = model.predict_proba(X_train[val_idx])[:, class="tok-num">1] print(fclass="tok-str">"Fold {fold+class="tok-num">1} AUC: {roc_auc_score(y_train[val_idx], oof_probs[val_idx]):.4f}") class="tok-comment"># Full OOF evaluation print(fclass="tok-str">"\nOOF AUC: {roc_auc_score(y_train, oof_probs):.4f}") print(fclass="tok-str">"OOF AUC-PR: {average_precision_score(y_train, oof_probs):.4f}") print(fclass="tok-str">"MCC: {matthews_corrcoef(y_train, oof_probs > class="tok-num">0.5):.4f}") class="tok-comment"># Optimal threshold by F1 thresholds = np.linspace(class="tok-num">0.01, class="tok-num">0.99, class="tok-num">200) f1s = [f1_score(y_train, oof_probs > t) for t in thresholds] best_threshold = thresholds[np.argmax(f1s)] print(fclass="tok-str">"Optimal threshold: {best_threshold:.3f}, F1: {max(f1s):.4f}")
Regression Metrics: When MSE Isn't Enough
Classification has accuracy, F1, AUC. Regression has a whole family of metrics — each sensitive to different types of errors. Choosing the wrong one can hide catastrophic failures in your model.
MAE (Mean Absolute Error): Σ|yᵢ−ŷᵢ|/n — robust to outliers, same units as target, intuitive. Lower is better.
MSE (Mean Squared Error): Σ(yᵢ−ŷᵢ)²/n — penalizes large errors heavily. Differentiable everywhere. Lower is better.
RMSE: √MSE — same units as target, penalizes large errors. The most common regression metric in Kaggle.
R² (coefficient of determination): 1 − MSE/Var(y) — fraction of variance explained. 1=perfect, 0=predicts mean, <0=worse than mean.
MAPE: Σ|yᵢ−ŷᵢ|/yᵢ/n — percentage error. Intuitive for business. Undefined when yᵢ=0, biased toward small values.
RMSLE (log-scale RMSE): √Σ(log(ŷ+1)−log(y+1))²/n — robust to outliers, penalizes under-predictions. Used for count data.
Huber Loss: quadratic for small errors, linear for large — best of MAE+MSE, robust to outliers AND differentiable.
Ranking & Calibration Metrics
Beyond point prediction accuracy, models must sometimes rank correctly (recommendation, search) or produce well-calibrated probabilities (medical risk, finance).
Spearman ρ: rank correlation between predicted and actual — measures monotone relationship, not magnitude.
NDCG (Normalized Discounted Cumulative Gain): graded relevance, position-discounted. Used in search/recommendation.
Calibration (ECE): Expected Calibration Error — do confidence=80% predictions come true 80% of the time?
Brier Score: MSE on probabilities for binary classification — lower is better. Good for probabilistic forecasts.
Log-Loss (Cross-Entropy): −Σyᵢ·log(pᵢ)+(1−yᵢ)·log(1−pᵢ) — penalizes confident wrong predictions heavily.
Metric Choice is a Business Decision
Rule: choose the metric that matches the cost of errors in your application. RMSE for house prices (large errors matter more). MAE for delivery time (outlier days don't change business behavior). MAPE when relative error matters. R² to explain variance explained to stakeholders. AUC when class balance changes. F1 when both FP and FN have costs. MCC for the most balanced single metric on imbalanced data.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.