ML Learning Hub
Deep Learningintermediate

Neural Networks — Forward & Backpropagation

Universal approximators built from threshold logic, optimized by calculus

From single perceptron to multi-layer networks: forward pass, activation functions (ReLU/sigmoid/tanh), backpropagation derivation, vanishing gradients, and weight initialization strategies.

60 min
10 diagrams
6 Concepts Covered

Prerequisites

Linear Algebra
Calculus & Optimization
Logistic Regression

Concepts Covered

PerceptronBackpropagationReLUVanishing GradientWeight InitializationChain Rule

Key Formulas

Forward Pass

Linear transformation at layer l

Activation

Non-linear activation applied element-wise

Backprop Delta

Error signal propagated backwards through layer l

Weight Gradient

Gradient of loss w.r.t. weights at layer l

Interactive Simulation

Loading visualization…
Loading visualization…

Model Architecture

Loading visualization…
🎯

The Universal Approximation Theorem

motivation

Cybenko (1989) proved that a single hidden layer with enough neurons can approximate any continuous function to arbitrary precision. But 'enough' can mean billions of neurons for complex functions. Deep networks (many layers, fewer neurons per layer) achieve the same approximation with exponentially fewer parameters — they learn hierarchical representations. This is why depth matters.

A deep network with L layers and n neurons per layer can represent functions that require O(2ⁿ) neurons in a single-layer network. Depth is compression.

💡

What Neurons Actually Compute

intuition

Each neuron computes a weighted sum of its inputs (a hyperplane), then applies a non-linearity. A single neuron with sigmoid creates a smooth decision boundary that separates space into two regions. Multiple neurons in a layer create multiple hyperplanes. Deep layers compose these hyperplanes, creating increasingly complex decision boundaries — curves, then curves of curves, then manifolds.

Backpropagation: The Chain Rule at Scale

math

Training requires computing ∂L/∂W for every weight. Direct computation is infeasible — a network with 100M parameters would need 100M separate forward passes. Backpropagation exploits the chain rule to compute all gradients in a single backward pass, the same cost as one forward pass. This is the algorithm that made deep learning possible.

Chain rule applied to a single weight
🔬

The Vanishing Gradient Problem

deepdive

During backpropagation, gradients are multiplied at each layer: δ[l] = W[l+1]ᵀ · δ[l+1] ⊙ σ'(z[l]). For sigmoid, σ'(z) ≤ 0.25 everywhere. After 10 layers, the gradient is multiplied by 0.25¹⁰ ≈ 0.000001. The gradient essentially vanishes — early layers stop learning. ReLU fixes this: its derivative is 1 for z > 0, so gradients don't shrink as they propagate.

ReLU and its derivative (solves vanishing gradient)
⚙️

Mini-Batch SGD Training Loop

algorithm
1

Initialize weights: He init for ReLU (W ~ N(0, √(2/fan_in)))

2

For each epoch, shuffle training data

3

For each mini-batch of size B:

Forward pass: compute activations a[1]...a[L] and loss L

Backward pass: compute δ[L] then propagate backwards

Update: W[l] ← W[l] - α · ∂L/∂W[l]

Update: b[l] ← b[l] - α · ∂L/∂b[l]

8

Apply learning rate scheduler (CosineAnnealing, ReduceLROnPlateau)

</>

PyTorch: Building and Training

code
python41 lines
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

class="tok-comment"># ── Sample dataloader ──────────────────────────────────────────────────
X_data = torch.randn(class="tok-num">1000, class="tok-num">128)
y_data = torch.randint(class="tok-num">0, class="tok-num">2, (class="tok-num">1000,)).float()
dataloader = DataLoader(TensorDataset(X_data, y_data), batch_size=class="tok-num">64, shuffle=True)

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, dropout=class="tok-num">0.3):
        super().__init__()
        layers = []
        dims = [input_dim] + hidden_dims
        for i in range(len(hidden_dims)):
            layers += [
                nn.Linear(dims[i], dims[i+class="tok-num">1]),
                nn.BatchNorm1d(dims[i+class="tok-num">1]),
                nn.ReLU(),
                nn.Dropout(dropout)
            ]
        layers.append(nn.Linear(hidden_dims[-class="tok-num">1], output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = MLP(input_dim=class="tok-num">128, hidden_dims=[class="tok-num">256, class="tok-num">128, class="tok-num">64], output_dim=class="tok-num">1)
optimizer = optim.AdamW(model.parameters(), lr=class="tok-num">1e-3, weight_decay=class="tok-num">1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=class="tok-num">100)

class="tok-comment"># Training step
for x_batch, y_batch in dataloader:
    optimizer.zero_grad()
    logits = model(x_batch).squeeze()
    loss = nn.BCEWithLogitsLoss()(logits, y_batch.float())
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), class="tok-num">1.0)
    optimizer.step()
    scheduler.step()
⚠️

Critical Pitfalls

pitfall
1

Dead ReLU neurons: if a neuron's weights push z < 0 for all inputs, it never activates. Use LeakyReLU or proper He initialization.

2

Exploding gradients: clip_grad_norm_(model.parameters(), 1.0) should always be in your training loop.

3

No BatchNorm: covariate shift makes deep networks unstable. Always BatchNorm between linear and activation layers.

4

Learning rate: too high → loss diverges; too low → training takes 100x longer. Use lr_find or start at 1e-3 with AdamW.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.