Neural Networks — Forward & Backpropagation
“Universal approximators built from threshold logic, optimized by calculus”
From single perceptron to multi-layer networks: forward pass, activation functions (ReLU/sigmoid/tanh), backpropagation derivation, vanishing gradients, and weight initialization strategies.
Prerequisites
Concepts Covered
∑Key Formulas
Forward Pass
Linear transformation at layer l
Activation
Non-linear activation applied element-wise
Backprop Delta
Error signal propagated backwards through layer l
Weight Gradient
Gradient of loss w.r.t. weights at layer l
▶Interactive Simulation
⬡Model Architecture
The Universal Approximation Theorem
Cybenko (1989) proved that a single hidden layer with enough neurons can approximate any continuous function to arbitrary precision. But 'enough' can mean billions of neurons for complex functions. Deep networks (many layers, fewer neurons per layer) achieve the same approximation with exponentially fewer parameters — they learn hierarchical representations. This is why depth matters.
A deep network with L layers and n neurons per layer can represent functions that require O(2ⁿ) neurons in a single-layer network. Depth is compression.
What Neurons Actually Compute
Each neuron computes a weighted sum of its inputs (a hyperplane), then applies a non-linearity. A single neuron with sigmoid creates a smooth decision boundary that separates space into two regions. Multiple neurons in a layer create multiple hyperplanes. Deep layers compose these hyperplanes, creating increasingly complex decision boundaries — curves, then curves of curves, then manifolds.
Backpropagation: The Chain Rule at Scale
Training requires computing ∂L/∂W for every weight. Direct computation is infeasible — a network with 100M parameters would need 100M separate forward passes. Backpropagation exploits the chain rule to compute all gradients in a single backward pass, the same cost as one forward pass. This is the algorithm that made deep learning possible.
The Vanishing Gradient Problem
During backpropagation, gradients are multiplied at each layer: δ[l] = W[l+1]ᵀ · δ[l+1] ⊙ σ'(z[l]). For sigmoid, σ'(z) ≤ 0.25 everywhere. After 10 layers, the gradient is multiplied by 0.25¹⁰ ≈ 0.000001. The gradient essentially vanishes — early layers stop learning. ReLU fixes this: its derivative is 1 for z > 0, so gradients don't shrink as they propagate.
Mini-Batch SGD Training Loop
Initialize weights: He init for ReLU (W ~ N(0, √(2/fan_in)))
For each epoch, shuffle training data
For each mini-batch of size B:
Forward pass: compute activations a[1]...a[L] and loss L
Backward pass: compute δ[L] then propagate backwards
Update: W[l] ← W[l] - α · ∂L/∂W[l]
Update: b[l] ← b[l] - α · ∂L/∂b[l]
Apply learning rate scheduler (CosineAnnealing, ReduceLROnPlateau)
PyTorch: Building and Training
import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import TensorDataset, DataLoader class="tok-comment"># ── Sample dataloader ────────────────────────────────────────────────── X_data = torch.randn(class="tok-num">1000, class="tok-num">128) y_data = torch.randint(class="tok-num">0, class="tok-num">2, (class="tok-num">1000,)).float() dataloader = DataLoader(TensorDataset(X_data, y_data), batch_size=class="tok-num">64, shuffle=True) class MLP(nn.Module): def __init__(self, input_dim, hidden_dims, output_dim, dropout=class="tok-num">0.3): super().__init__() layers = [] dims = [input_dim] + hidden_dims for i in range(len(hidden_dims)): layers += [ nn.Linear(dims[i], dims[i+class="tok-num">1]), nn.BatchNorm1d(dims[i+class="tok-num">1]), nn.ReLU(), nn.Dropout(dropout) ] layers.append(nn.Linear(hidden_dims[-class="tok-num">1], output_dim)) self.net = nn.Sequential(*layers) def forward(self, x): return self.net(x) model = MLP(input_dim=class="tok-num">128, hidden_dims=[class="tok-num">256, class="tok-num">128, class="tok-num">64], output_dim=class="tok-num">1) optimizer = optim.AdamW(model.parameters(), lr=class="tok-num">1e-3, weight_decay=class="tok-num">1e-4) scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=class="tok-num">100) class="tok-comment"># Training step for x_batch, y_batch in dataloader: optimizer.zero_grad() logits = model(x_batch).squeeze() loss = nn.BCEWithLogitsLoss()(logits, y_batch.float()) loss.backward() nn.utils.clip_grad_norm_(model.parameters(), class="tok-num">1.0) optimizer.step() scheduler.step()
Critical Pitfalls
Dead ReLU neurons: if a neuron's weights push z < 0 for all inputs, it never activates. Use LeakyReLU or proper He initialization.
Exploding gradients: clip_grad_norm_(model.parameters(), 1.0) should always be in your training loop.
No BatchNorm: covariate shift makes deep networks unstable. Always BatchNorm between linear and activation layers.
Learning rate: too high → loss diverges; too low → training takes 100x longer. Use lr_find or start at 1e-3 with AdamW.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.