Regressionbeginner

Linear & Logistic Regression

“Finding the single best line through a cloud of noisy reality”

Visual deep-dive from OLS to gradient descent, R², residuals, multicollinearity, then logistic: sigmoid, log loss, L1/L2 regularization, and decision boundaries.

45 min

15 diagrams

8 Concepts Covered

Prerequisites

→Linear Algebra

→Probability & Statistics

Concepts Covered

Least SquaresGradient DescentR²SigmoidL1/LassoL2/RidgeDecision BoundaryOverfitting

Previous: Information Theory Next: Model Evaluation & Metrics

∑Key Formulas

OLS Solution

Closed-form solution minimizing squared residuals

MSE Loss

Mean squared error — the objective being minimized

Gradient Update

Gradient descent weight update rule

Sigmoid

Squashes any real number to (0, 1) for probability

▶Interactive Simulation

Loading visualization…

⬡Model Architecture

Loading visualization…

🎯

Why Does This Matter?

motivation

Regression is the foundation of every prediction system. Your credit score, weather forecast, house price estimate, recommendation engine — all start here. Before neural networks, before ensembles, there was the line. Understanding regression deeply means understanding what 'learning' actually means mathematically.

Linear regression won a Nobel Prize (economics, 1978). It predates computers by 200 years — Gauss used it to predict planetary orbits.

💡

The Geometric Intuition

intuition

Imagine throwing darts at a wall. Each dart lands at a (x, y) position. You want to find the line that passes as close as possible to all darts simultaneously. 'Closest' means minimizing the vertical distances (residuals) from each dart to your line. The squared residuals turn this into a smooth bowl-shaped landscape — and the bottom of the bowl is the OLS solution.

∑

The Mathematics of Least Squares

math

We model the relationship as ŷ = Xβ + ε where ε ~ N(0, σ²). Minimizing the sum of squared residuals has a beautiful closed-form solution called the Normal Equation. This works because the loss surface is a paraboloid — a perfect bowl with exactly one minimum.

🔬

Why Maximize Likelihood = Minimize Squared Errors

deepdive

This connection is profound. If we assume Gaussian noise ε ~ N(0, σ²), then the likelihood of observing y given x is proportional to exp(-(y - Xβ)²/2σ²). Taking the log and negating gives us exactly the sum of squared residuals. OLS and MLE are the same thing under Gaussian noise. This means linear regression has a probabilistic interpretation as Bayesian inference with a uniform prior.

The Gaussian assumption is why outliers hurt so badly — squared errors punish large residuals quadratically. Use Huber loss for robustness.

⚙️

Gradient Descent: Learning Step by Step

algorithm

When X^TX is not invertible (multicollinearity) or the dataset is too large for the Normal Equation, we use gradient descent. Start anywhere on the loss surface, measure the slope, take a small step downhill. Repeat until convergence.

Initialize weights β = 0 (or random small values)

Compute prediction: ŷ = Xβ

Compute residuals: ε = y - ŷ

Compute gradient: ∇L = -(2/n) Xᵀε

Update: β ← β - α · ∇L

Repeat until ||∇L|| < tolerance

∑

Logistic Regression: The Binary Jump

math

For binary outcomes we need outputs in (0,1). We pass the linear combination through the sigmoid function σ(z) = 1/(1+e⁻ᶻ), which maps ℝ → (0,1). The loss function switches from MSE to Binary Cross-Entropy (log loss).

</>

From Scratch in NumPy

code

The full gradient descent implementation in 12 lines:

python36 lines

import numpy as np
from sklearn.datasets import make_regression

# ── Sample data ────────────────────────────────────────────────────────
X_raw, y = make_regression(n_samples=200, n_features=5, noise=10, random_state=42)
X = np.c_[np.ones(len(X_raw)), X_raw]   # prepend bias column
lam = 0.1                                 # Ridge regularisation strength

class LinearRegression:
    def __init__(self, lr=0.01, n_iter=1000):
        self.lr, self.n_iter = lr, n_iter

    def fit(self, X, y):
        n, p = X.shape
        self.beta = np.zeros(p)
        for _ in range(self.n_iter):
            y_hat = X @ self.beta
            residuals = y - y_hat
            grad = -(2/n) * X.T @ residuals
            self.beta -= self.lr * grad
        return self

    def predict(self, X):
        return X @ self.beta

# Demo
model = LinearRegression(lr=0.01, n_iter=1000).fit(X, y)
print("GD beta:", model.beta[:3].round(2))

# Closed-form (Normal Equation):
beta_ols = np.linalg.solve(X.T @ X, X.T @ y)
# Ridge (L2 regularization):
p = X.shape[1]
beta_ridge = np.linalg.solve(X.T @ X + lam * np.eye(p), X.T @ y)
print("OLS beta:  ", beta_ols[:3].round(2))
print("Ridge beta:", beta_ridge[:3].round(2))

⚠️

Critical Pitfalls

pitfall

Four mistakes that kill regression models in production:

Multicollinearity — Correlated features make (XᵀX) near-singular. VIF > 10 is a red flag. Fix: Ridge regularization or PCA.

Unscaled features — Gradient descent converges 100x slower if features have different scales. Always StandardScaler first.

Heteroscedasticity — Non-constant residual variance violates OLS assumptions. Visualize residuals vs fitted values.

Extrapolation — Linear models are dangerously confident outside training range. Never extrapolate without domain knowledge.

?Knowledge Check

Progress is saved in your browser — no account needed.

Information Theory

Model Evaluation & Metrics

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services