Linear & Logistic Regression
“Finding the single best line through a cloud of noisy reality”
Visual deep-dive from OLS to gradient descent, R², residuals, multicollinearity, then logistic: sigmoid, log loss, L1/L2 regularization, and decision boundaries.
Prerequisites
Concepts Covered
∑Key Formulas
OLS Solution
Closed-form solution minimizing squared residuals
MSE Loss
Mean squared error — the objective being minimized
Gradient Update
Gradient descent weight update rule
Sigmoid
Squashes any real number to (0, 1) for probability
▶Interactive Simulation
⬡Model Architecture
Why Does This Matter?
Regression is the foundation of every prediction system. Your credit score, weather forecast, house price estimate, recommendation engine — all start here. Before neural networks, before ensembles, there was the line. Understanding regression deeply means understanding what 'learning' actually means mathematically.
Linear regression won a Nobel Prize (economics, 1978). It predates computers by 200 years — Gauss used it to predict planetary orbits.
The Geometric Intuition
Imagine throwing darts at a wall. Each dart lands at a (x, y) position. You want to find the line that passes as close as possible to all darts simultaneously. 'Closest' means minimizing the vertical distances (residuals) from each dart to your line. The squared residuals turn this into a smooth bowl-shaped landscape — and the bottom of the bowl is the OLS solution.
The Mathematics of Least Squares
We model the relationship as ŷ = Xβ + ε where ε ~ N(0, σ²). Minimizing the sum of squared residuals has a beautiful closed-form solution called the Normal Equation. This works because the loss surface is a paraboloid — a perfect bowl with exactly one minimum.
Why Maximize Likelihood = Minimize Squared Errors
This connection is profound. If we assume Gaussian noise ε ~ N(0, σ²), then the likelihood of observing y given x is proportional to exp(-(y - Xβ)²/2σ²). Taking the log and negating gives us exactly the sum of squared residuals. OLS and MLE are the same thing under Gaussian noise. This means linear regression has a probabilistic interpretation as Bayesian inference with a uniform prior.
The Gaussian assumption is why outliers hurt so badly — squared errors punish large residuals quadratically. Use Huber loss for robustness.
Gradient Descent: Learning Step by Step
When X^TX is not invertible (multicollinearity) or the dataset is too large for the Normal Equation, we use gradient descent. Start anywhere on the loss surface, measure the slope, take a small step downhill. Repeat until convergence.
Initialize weights β = 0 (or random small values)
Compute prediction: ŷ = Xβ
Compute residuals: ε = y - ŷ
Compute gradient: ∇L = -(2/n) Xᵀε
Update: β ← β - α · ∇L
Repeat until ||∇L|| < tolerance
Logistic Regression: The Binary Jump
For binary outcomes we need outputs in (0,1). We pass the linear combination through the sigmoid function σ(z) = 1/(1+e⁻ᶻ), which maps ℝ → (0,1). The loss function switches from MSE to Binary Cross-Entropy (log loss).
From Scratch in NumPy
The full gradient descent implementation in 12 lines:
import numpy as np from sklearn.datasets import make_regression class="tok-comment"># ── Sample data ──────────────────────────────────────────────────────── X_raw, y = make_regression(n_samples=class="tok-num">200, n_features=class="tok-num">5, noise=class="tok-num">10, random_state=class="tok-num">42) X = np.c_[np.ones(len(X_raw)), X_raw] class="tok-comment"># prepend bias column lam = class="tok-num">0.1 class="tok-comment"># Ridge regularisation strength class LinearRegression: def __init__(self, lr=class="tok-num">0.01, n_iter=class="tok-num">1000): self.lr, self.n_iter = lr, n_iter def fit(self, X, y): n, p = X.shape self.beta = np.zeros(p) for _ in range(self.n_iter): y_hat = X @ self.beta residuals = y - y_hat grad = -(class="tok-num">2/n) * X.T @ residuals self.beta -= self.lr * grad return self def predict(self, X): return X @ self.beta class="tok-comment"># Demo model = LinearRegression(lr=class="tok-num">0.01, n_iter=class="tok-num">1000).fit(X, y) print(class="tok-str">"GD beta:", model.beta[:class="tok-num">3].round(class="tok-num">2)) class="tok-comment"># Closed-form (Normal Equation): beta_ols = np.linalg.solve(X.T @ X, X.T @ y) class="tok-comment"># Ridge (L2 regularization): p = X.shape[class="tok-num">1] beta_ridge = np.linalg.solve(X.T @ X + lam * np.eye(p), X.T @ y) print(class="tok-str">"OLS beta: ", beta_ols[:class="tok-num">3].round(class="tok-num">2)) print(class="tok-str">"Ridge beta:", beta_ridge[:class="tok-num">3].round(class="tok-num">2))
Critical Pitfalls
Four mistakes that kill regression models in production:
Multicollinearity — Correlated features make (XᵀX) near-singular. VIF > 10 is a red flag. Fix: Ridge regularization or PCA.
Unscaled features — Gradient descent converges 100x slower if features have different scales. Always StandardScaler first.
Heteroscedasticity — Non-constant residual variance violates OLS assumptions. Visualize residuals vs fitted values.
Extrapolation — Linear models are dangerously confident outside training range. Never extrapolate without domain knowledge.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.