ML Learning Hub
Foundationsbeginner

Calculus & Optimization

From derivatives to gradient descent — the engine that trains every neural network

Derivatives, partial derivatives, the chain rule (= backpropagation), and gradient descent. Then Adam, momentum, learning rate scheduling — the full story of how neural networks actually learn.

45 min
8 diagrams
8 Concepts Covered

Prerequisites

Linear Algebra

Concepts Covered

DerivativesChain RuleGradientGradient DescentAdamMomentumLearning RateConvexity

Key Formulas

Gradient

Vector of partial derivatives — points in the direction of steepest ascent

Chain Rule

The backbone of backpropagation — compose derivatives through layers

Gradient Descent

Iteratively move opposite to the gradient to minimize loss L

Adam Update

Gradient descent with adaptive per-parameter learning rates (bias-corrected 1st & 2nd moments)

Interactive Simulation

Loading visualization…
🎯

Optimization Is What Makes Models Learn

motivation

Training a machine learning model is an optimization problem: find the parameters θ that minimize the loss function L(θ). Gradient descent is the workhorse algorithm that solves this for problems with millions of parameters where closed-form solutions don't exist. The chain rule makes it possible to compute gradients through arbitrarily deep compositions of functions — that's backpropagation. Without calculus, there is no learning: every weight update in every neural network, every boosted tree fitted to residuals, every SVM soft-margin solution — all of it is optimization.

A GPT model has ~175 billion parameters. Gradient descent updates ALL of them simultaneously in a single backward pass thanks to the chain rule.

💡

The Gradient as a Direction in Parameter Space

intuition

Imagine the loss function as a hilly landscape and your parameters as your position. The gradient ∇L(θ) is an arrow pointing uphill. Moving in the OPPOSITE direction (−η∇L) goes downhill — toward lower loss. The learning rate η controls step size: too large and you bounce around (diverge), too small and training takes forever. The classic problem: an elongated bowl (ill-conditioned loss surface) makes vanilla gradient descent zigzag across the valley instead of going straight to the minimum. Adam fixes this by maintaining a separate learning rate for each parameter based on its gradient history.

Intuition for the chain rule: if temperature change affects pressure, and pressure affects volume, how does temperature affect volume? Multiply the individual sensitivities.

⚙️

Adam Optimizer — Step by Step

algorithm
1

Initialize: θ, m₀=0 (1st moment), v₀=0 (2nd moment), t=0, β₁=0.9, β₂=0.999, ε=1e-8

2

Compute gradient: g_t = ∇_θ L(θ_{t-1})

3

Update biased 1st moment (momentum): m_t = β₁·m_{t-1} + (1-β₁)·g_t

4

Update biased 2nd moment (adaptive scale): v_t = β₂·v_{t-1} + (1-β₂)·g_t²

5

Bias correction: m̂_t = m_t/(1-β₁ᵗ), v̂_t = v_t/(1-β₂ᵗ)

6

Parameter update: θ_t = θ_{t-1} - η·m̂_t / (√v̂_t + ε)

7

Intuition: m̂_t is a running average of gradients (momentum). √v̂_t normalizes by gradient magnitude — features with large gradients get smaller learning rates.

</>

Gradient Descent from Scratch

code
python80 lines
import numpy as np
import matplotlib.pyplot as plt

class="tok-comment"># ── Numerical derivatives (educational) ──────────────────────────────────────
def numerical_grad(f, x, h=class="tok-num">1e-5):
    class="tok-str">"""Central difference approximation: (f(x+h) - f(x-h)) / 2h"""
    grad = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        x_plus  = x.copy(); x_plus[i]  += h
        x_minus = x.copy(); x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (class="tok-num">2 * h)
    return grad

class="tok-comment"># ── class="tok-num">1. Gradient Descent on simple quadratic ───────────────────────────────────
def loss(theta):
    return (theta[class="tok-num">0] - class="tok-num">3)**class="tok-num">2 + (theta[class="tok-num">1] + class="tok-num">1)**class="tok-num">2  class="tok-comment"># minimum at (class="tok-num">3,-class="tok-num">1)

def grad_loss(theta):
    return np.array([class="tok-num">2*(theta[class="tok-num">0]-class="tok-num">3), class="tok-num">2*(theta[class="tok-num">1]+class="tok-num">1)])

theta = np.array([class="tok-num">0., class="tok-num">0.])
lr = class="tok-num">0.1
history = [theta.copy()]

for step in range(class="tok-num">50):
    g = grad_loss(theta)
    theta -= lr * g
    history.append(theta.copy())
    if np.linalg.norm(g) < class="tok-num">1e-6:
        print(fclass="tok-str">"Converged at step {step}")
        break

print(fclass="tok-str">"Final θ: {theta.round(class="tok-num">4)}")  class="tok-comment"># ≈ [class="tok-num">3, -class="tok-num">1]

class="tok-comment"># ── class="tok-num">2. Adam optimizer ────────────────────────────────────────────────────────
def adam(grad_fn, theta_init, lr=class="tok-num">0.01, n_steps=class="tok-num">100, b1=class="tok-num">0.9, b2=class="tok-num">0.999, eps=class="tok-num">1e-8):
    theta = theta_init.copy().astype(float)
    m, v = np.zeros_like(theta), np.zeros_like(theta)
    history = [theta.copy()]
    for t in range(class="tok-num">1, n_steps+class="tok-num">1):
        g = grad_fn(theta)
        m = b1*m + (class="tok-num">1-b1)*g
        v = b2*v + (class="tok-num">1-b2)*g**class="tok-num">2
        m_hat = m / (class="tok-num">1 - b1**t)
        v_hat = v / (class="tok-num">1 - b2**t)
        theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
        history.append(theta.copy())
    return theta, history

theta_adam, hist_adam = adam(grad_loss, np.array([class="tok-num">0., class="tok-num">0.]), lr=class="tok-num">0.1)
print(fclass="tok-str">"Adam θ: {theta_adam.round(class="tok-num">4)}")

class="tok-comment"># ── class="tok-num">3. Chain rule in action (manual backprop) ─────────────────────────────────
class="tok-comment"># f(x) = (2x + class="tok-num">1)^class="tok-num">2. df/dx = class="tok-num">2 * (2x+class="tok-num">1) * class="tok-num">2 = class="tok-num">4*(2x+class="tok-num">1)
x = class="tok-num">3.0
class="tok-comment"># Forward pass
u = class="tok-num">2*x + class="tok-num">1    class="tok-comment"># u = class="tok-num">7
f = u**class="tok-num">2       class="tok-comment"># f = class="tok-num">49

class="tok-comment"># Backward pass (chain rule)
df_du = class="tok-num">2*u    class="tok-comment"># = class="tok-num">14
du_dx = class="tok-num">2      class="tok-comment"># constant
df_dx = df_du * du_dx   class="tok-comment"># = class="tok-num">28
print(fclass="tok-str">"df/dx at x=class="tok-num">3: {df_dx}")  class="tok-comment"># analytical: class="tok-num">4*(class="tok-num">2*class="tok-num">3+class="tok-num">1) = class="tok-num">28 ✓

class="tok-comment"># ── class="tok-num">4. Learning rate sensitivity ─────────────────────────────────────────────
fig, axes = plt.subplots(class="tok-num">1, class="tok-num">3, figsize=(class="tok-num">12,class="tok-num">3))
for ax, lr_val in zip(axes, [class="tok-num">0.01, class="tok-num">0.1, class="tok-num">0.9]):
    theta = np.array([class="tok-num">0.])
    losses = []
    for _ in range(class="tok-num">100):
        g = class="tok-num">2*(theta[class="tok-num">0] - class="tok-num">5)
        theta[class="tok-num">0] -= lr_val * g
        losses.append((theta[class="tok-num">0]-class="tok-num">5)**class="tok-num">2)
    ax.semilogy(losses)
    ax.set_title(fclass="tok-str">"lr = {lr_val}")
    ax.set_xlabel(class="tok-str">"Steps")
    ax.set_ylabel(class="tok-str">"Loss")
plt.tight_layout()
plt.show()  class="tok-comment"># lr=class="tok-num">0.01: slow, lr=class="tok-num">0.1: perfect, lr=class="tok-num">0.9: oscillates
⚠️

Local Minima vs Saddle Points — What Actually Slows Training

pitfall

In high-dimensional loss landscapes (modern neural networks have millions of parameters), true local minima are rare — most 'stuck' points are saddle points where the gradient is zero but the point is a minimum in some directions and a maximum in others. Gradient descent with noise (SGD) escapes saddle points naturally. The bigger practical problems are: (1) Exploding gradients in deep networks — use gradient clipping. (2) Vanishing gradients in RNNs — use LSTM/GRU. (3) Poor conditioning — use batch normalization or weight initialization (He init for ReLU, Xavier for tanh/sigmoid).

For convex problems (linear regression, logistic regression, SVMs), gradient descent is guaranteed to find the global minimum. For neural networks, it finds a 'good enough' basin.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.