Calculus & Optimization
“From derivatives to gradient descent — the engine that trains every neural network”
Derivatives, partial derivatives, the chain rule (= backpropagation), and gradient descent. Then Adam, momentum, learning rate scheduling — the full story of how neural networks actually learn.
Prerequisites
Concepts Covered
∑Key Formulas
Gradient
Vector of partial derivatives — points in the direction of steepest ascent
Chain Rule
The backbone of backpropagation — compose derivatives through layers
Gradient Descent
Iteratively move opposite to the gradient to minimize loss L
Adam Update
Gradient descent with adaptive per-parameter learning rates (bias-corrected 1st & 2nd moments)
▶Interactive Simulation
Optimization Is What Makes Models Learn
Training a machine learning model is an optimization problem: find the parameters θ that minimize the loss function L(θ). Gradient descent is the workhorse algorithm that solves this for problems with millions of parameters where closed-form solutions don't exist. The chain rule makes it possible to compute gradients through arbitrarily deep compositions of functions — that's backpropagation. Without calculus, there is no learning: every weight update in every neural network, every boosted tree fitted to residuals, every SVM soft-margin solution — all of it is optimization.
A GPT model has ~175 billion parameters. Gradient descent updates ALL of them simultaneously in a single backward pass thanks to the chain rule.
The Gradient as a Direction in Parameter Space
Imagine the loss function as a hilly landscape and your parameters as your position. The gradient ∇L(θ) is an arrow pointing uphill. Moving in the OPPOSITE direction (−η∇L) goes downhill — toward lower loss. The learning rate η controls step size: too large and you bounce around (diverge), too small and training takes forever. The classic problem: an elongated bowl (ill-conditioned loss surface) makes vanilla gradient descent zigzag across the valley instead of going straight to the minimum. Adam fixes this by maintaining a separate learning rate for each parameter based on its gradient history.
Intuition for the chain rule: if temperature change affects pressure, and pressure affects volume, how does temperature affect volume? Multiply the individual sensitivities.
Adam Optimizer — Step by Step
Initialize: θ, m₀=0 (1st moment), v₀=0 (2nd moment), t=0, β₁=0.9, β₂=0.999, ε=1e-8
Compute gradient: g_t = ∇_θ L(θ_{t-1})
Update biased 1st moment (momentum): m_t = β₁·m_{t-1} + (1-β₁)·g_t
Update biased 2nd moment (adaptive scale): v_t = β₂·v_{t-1} + (1-β₂)·g_t²
Bias correction: m̂_t = m_t/(1-β₁ᵗ), v̂_t = v_t/(1-β₂ᵗ)
Parameter update: θ_t = θ_{t-1} - η·m̂_t / (√v̂_t + ε)
Intuition: m̂_t is a running average of gradients (momentum). √v̂_t normalizes by gradient magnitude — features with large gradients get smaller learning rates.
Gradient Descent from Scratch
import numpy as np import matplotlib.pyplot as plt class="tok-comment"># ── Numerical derivatives (educational) ────────────────────────────────────── def numerical_grad(f, x, h=class="tok-num">1e-5): class="tok-str">"""Central difference approximation: (f(x+h) - f(x-h)) / 2h""" grad = np.zeros_like(x, dtype=float) for i in range(len(x)): x_plus = x.copy(); x_plus[i] += h x_minus = x.copy(); x_minus[i] -= h grad[i] = (f(x_plus) - f(x_minus)) / (class="tok-num">2 * h) return grad class="tok-comment"># ── class="tok-num">1. Gradient Descent on simple quadratic ─────────────────────────────────── def loss(theta): return (theta[class="tok-num">0] - class="tok-num">3)**class="tok-num">2 + (theta[class="tok-num">1] + class="tok-num">1)**class="tok-num">2 class="tok-comment"># minimum at (class="tok-num">3,-class="tok-num">1) def grad_loss(theta): return np.array([class="tok-num">2*(theta[class="tok-num">0]-class="tok-num">3), class="tok-num">2*(theta[class="tok-num">1]+class="tok-num">1)]) theta = np.array([class="tok-num">0., class="tok-num">0.]) lr = class="tok-num">0.1 history = [theta.copy()] for step in range(class="tok-num">50): g = grad_loss(theta) theta -= lr * g history.append(theta.copy()) if np.linalg.norm(g) < class="tok-num">1e-6: print(fclass="tok-str">"Converged at step {step}") break print(fclass="tok-str">"Final θ: {theta.round(class="tok-num">4)}") class="tok-comment"># ≈ [class="tok-num">3, -class="tok-num">1] class="tok-comment"># ── class="tok-num">2. Adam optimizer ──────────────────────────────────────────────────────── def adam(grad_fn, theta_init, lr=class="tok-num">0.01, n_steps=class="tok-num">100, b1=class="tok-num">0.9, b2=class="tok-num">0.999, eps=class="tok-num">1e-8): theta = theta_init.copy().astype(float) m, v = np.zeros_like(theta), np.zeros_like(theta) history = [theta.copy()] for t in range(class="tok-num">1, n_steps+class="tok-num">1): g = grad_fn(theta) m = b1*m + (class="tok-num">1-b1)*g v = b2*v + (class="tok-num">1-b2)*g**class="tok-num">2 m_hat = m / (class="tok-num">1 - b1**t) v_hat = v / (class="tok-num">1 - b2**t) theta -= lr * m_hat / (np.sqrt(v_hat) + eps) history.append(theta.copy()) return theta, history theta_adam, hist_adam = adam(grad_loss, np.array([class="tok-num">0., class="tok-num">0.]), lr=class="tok-num">0.1) print(fclass="tok-str">"Adam θ: {theta_adam.round(class="tok-num">4)}") class="tok-comment"># ── class="tok-num">3. Chain rule in action (manual backprop) ───────────────────────────────── class="tok-comment"># f(x) = (2x + class="tok-num">1)^class="tok-num">2. df/dx = class="tok-num">2 * (2x+class="tok-num">1) * class="tok-num">2 = class="tok-num">4*(2x+class="tok-num">1) x = class="tok-num">3.0 class="tok-comment"># Forward pass u = class="tok-num">2*x + class="tok-num">1 class="tok-comment"># u = class="tok-num">7 f = u**class="tok-num">2 class="tok-comment"># f = class="tok-num">49 class="tok-comment"># Backward pass (chain rule) df_du = class="tok-num">2*u class="tok-comment"># = class="tok-num">14 du_dx = class="tok-num">2 class="tok-comment"># constant df_dx = df_du * du_dx class="tok-comment"># = class="tok-num">28 print(fclass="tok-str">"df/dx at x=class="tok-num">3: {df_dx}") class="tok-comment"># analytical: class="tok-num">4*(class="tok-num">2*class="tok-num">3+class="tok-num">1) = class="tok-num">28 ✓ class="tok-comment"># ── class="tok-num">4. Learning rate sensitivity ───────────────────────────────────────────── fig, axes = plt.subplots(class="tok-num">1, class="tok-num">3, figsize=(class="tok-num">12,class="tok-num">3)) for ax, lr_val in zip(axes, [class="tok-num">0.01, class="tok-num">0.1, class="tok-num">0.9]): theta = np.array([class="tok-num">0.]) losses = [] for _ in range(class="tok-num">100): g = class="tok-num">2*(theta[class="tok-num">0] - class="tok-num">5) theta[class="tok-num">0] -= lr_val * g losses.append((theta[class="tok-num">0]-class="tok-num">5)**class="tok-num">2) ax.semilogy(losses) ax.set_title(fclass="tok-str">"lr = {lr_val}") ax.set_xlabel(class="tok-str">"Steps") ax.set_ylabel(class="tok-str">"Loss") plt.tight_layout() plt.show() class="tok-comment"># lr=class="tok-num">0.01: slow, lr=class="tok-num">0.1: perfect, lr=class="tok-num">0.9: oscillates
Local Minima vs Saddle Points — What Actually Slows Training
In high-dimensional loss landscapes (modern neural networks have millions of parameters), true local minima are rare — most 'stuck' points are saddle points where the gradient is zero but the point is a minimum in some directions and a maximum in others. Gradient descent with noise (SGD) escapes saddle points naturally. The bigger practical problems are: (1) Exploding gradients in deep networks — use gradient clipping. (2) Vanishing gradients in RNNs — use LSTM/GRU. (3) Poor conditioning — use batch normalization or weight initialization (He init for ReLU, Xavier for tanh/sigmoid).
For convex problems (linear regression, logistic regression, SVMs), gradient descent is guaranteed to find the global minimum. For neural networks, it finds a 'good enough' basin.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.