ML Learning Hub
Evaluationbeginner

Model Evaluation & Metrics

Accuracy is a lie — learning to choose the right metric for the real problem

Complete guide: accuracy, precision, recall, F1, ROC-AUC, confusion matrix, PR curves, cross-validation (StratifiedKFold, TimeSeriesSplit), and choosing the right metric for your task.

35 min
8 diagrams
6 Concepts Covered

Prerequisites

Linear Regression

Concepts Covered

ROC-AUCF1 ScoreConfusion MatrixCross-validationPrecision-RecallTimeSeriesSplit

Key Formulas

F1 Score

Harmonic mean of precision and recall

AUC-ROC

Probability that a random positive ranks higher than a random negative

MCC

Matthews Correlation Coefficient — best single metric for imbalance

RMSE

Root Mean Squared Error — penalizes large errors, same units as target

R² Score

Fraction of variance explained; 1=perfect, 0=predict mean, <0=worse than mean

MAE

Mean Absolute Error — robust to outliers, interpretable in target units

Interactive Simulation

Loading visualization…
Loading visualization…

Model Architecture

Loading visualization…
🎯

When Accuracy Kills

motivation

Imagine predicting cancer (0.1% prevalence). A model that predicts 'no cancer' for everyone achieves 99.9% accuracy — and kills patients. In fraud detection (0.5% fraud rate), high accuracy is meaningless. The choice of metric is a business decision, not a technical one. Getting it wrong can mean deploying a model that optimizes for the wrong thing entirely.

In 2021, Amazon's hiring algorithm was 98.4% accurate at filtering resumes — but systematically discriminated against women because accuracy was the optimized metric.

💡

The Confusion Matrix as a Complete Picture

intuition

Every prediction falls into four categories: True Positive (correctly predicted positive), False Positive (predicted positive, actually negative), True Negative (correctly predicted negative), False Negative (predicted negative, actually positive). From these four numbers, every classification metric derives. FP = Type I error (false alarm). FN = Type II error (miss). Which matters more depends entirely on the application.

ROC Curve: Threshold-Independent Evaluation

math

A classifier produces a score, not just a binary prediction. The threshold we apply to convert score → label is a design choice. The ROC curve shows all possible tradeoffs by sweeping the threshold from 0 to 1: plotting TPR (recall) vs FPR (1-specificity). AUC = 0.5 means random guessing, AUC = 1.0 is perfect. AUC has a beautiful probabilistic interpretation: P(score(positive) > score(negative)).

True and False Positive Rates
🔬

Stratified K-Fold: The Right Way to Validate

deepdive

Hold-out validation wastes data and has high variance. K-Fold cross-validation uses all data for both training and validation. Stratified K-Fold ensures each fold has the same class distribution as the full dataset — critical for imbalanced problems. TimeSeriesSplit prevents data leakage: future data never informs past predictions, respecting temporal ordering.

1

StratifiedKFold: maintains class proportions in each fold

2

TimeSeriesSplit: all training data comes before validation data in time

3

GroupKFold: ensures all samples from the same group (patient, user) are in the same fold

4

RepeatedStratifiedKFold: repeat K-Fold N times with different random seeds → lower variance estimate

</>

Complete Evaluation Pipeline

code
python34 lines
from sklearn.metrics import (
    classification_report, roc_auc_score, f1_score,
    average_precision_score, matthews_corrcoef,
    confusion_matrix
)
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

class="tok-comment"># ── Sample data + model ────────────────────────────────────────────────
X, y = make_classification(n_samples=class="tok-num">1000, n_features=class="tok-num">10, random_state=class="tok-num">42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42)
model = GradientBoostingClassifier(n_estimators=class="tok-num">100, random_state=class="tok-num">42)

skf = StratifiedKFold(n_splits=class="tok-num">5, shuffle=True, random_state=class="tok-num">42)
oof_probs = np.zeros(len(y_train))

for fold, (tr_idx, val_idx) in enumerate(skf.split(X_train, y_train)):
    model.fit(X_train[tr_idx], y_train[tr_idx])
    oof_probs[val_idx] = model.predict_proba(X_train[val_idx])[:, class="tok-num">1]
    print(fclass="tok-str">"Fold {fold+class="tok-num">1} AUC: {roc_auc_score(y_train[val_idx], oof_probs[val_idx]):.4f}")

class="tok-comment"># Full OOF evaluation
print(fclass="tok-str">"\nOOF AUC: {roc_auc_score(y_train, oof_probs):.4f}")
print(fclass="tok-str">"OOF AUC-PR: {average_precision_score(y_train, oof_probs):.4f}")
print(fclass="tok-str">"MCC: {matthews_corrcoef(y_train, oof_probs > class="tok-num">0.5):.4f}")

class="tok-comment"># Optimal threshold by F1
thresholds = np.linspace(class="tok-num">0.01, class="tok-num">0.99, class="tok-num">200)
f1s = [f1_score(y_train, oof_probs > t) for t in thresholds]
best_threshold = thresholds[np.argmax(f1s)]
print(fclass="tok-str">"Optimal threshold: {best_threshold:.3f}, F1: {max(f1s):.4f}")
⚖️

Regression Metrics: When MSE Isn't Enough

comparison

Classification has accuracy, F1, AUC. Regression has a whole family of metrics — each sensitive to different types of errors. Choosing the wrong one can hide catastrophic failures in your model.

1

MAE (Mean Absolute Error): Σ|yᵢ−ŷᵢ|/n — robust to outliers, same units as target, intuitive. Lower is better.

2

MSE (Mean Squared Error): Σ(yᵢ−ŷᵢ)²/n — penalizes large errors heavily. Differentiable everywhere. Lower is better.

3

RMSE: √MSE — same units as target, penalizes large errors. The most common regression metric in Kaggle.

4

R² (coefficient of determination): 1 − MSE/Var(y) — fraction of variance explained. 1=perfect, 0=predicts mean, <0=worse than mean.

5

MAPE: Σ|yᵢ−ŷᵢ|/yᵢ/n — percentage error. Intuitive for business. Undefined when yᵢ=0, biased toward small values.

6

RMSLE (log-scale RMSE): √Σ(log(ŷ+1)−log(y+1))²/n — robust to outliers, penalizes under-predictions. Used for count data.

7

Huber Loss: quadratic for small errors, linear for large — best of MAE+MSE, robust to outliers AND differentiable.

⚖️

Ranking & Calibration Metrics

comparison

Beyond point prediction accuracy, models must sometimes rank correctly (recommendation, search) or produce well-calibrated probabilities (medical risk, finance).

1

Spearman ρ: rank correlation between predicted and actual — measures monotone relationship, not magnitude.

2

NDCG (Normalized Discounted Cumulative Gain): graded relevance, position-discounted. Used in search/recommendation.

3

Calibration (ECE): Expected Calibration Error — do confidence=80% predictions come true 80% of the time?

4

Brier Score: MSE on probabilities for binary classification — lower is better. Good for probabilistic forecasts.

5

Log-Loss (Cross-Entropy): −Σyᵢ·log(pᵢ)+(1−yᵢ)·log(1−pᵢ) — penalizes confident wrong predictions heavily.

🔭

Metric Choice is a Business Decision

insight

Rule: choose the metric that matches the cost of errors in your application. RMSE for house prices (large errors matter more). MAE for delivery time (outlier days don't change business behavior). MAPE when relative error matters. R² to explain variance explained to stakeholders. AUC when class balance changes. F1 when both FP and FN have costs. MCC for the most balanced single metric on imbalanced data.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.