Deep Learningintermediate

Neural Networks — Forward & Backpropagation

“Universal approximators built from threshold logic, optimized by calculus”

From single perceptron to multi-layer networks: forward pass, activation functions (ReLU/sigmoid/tanh), backpropagation derivation, vanishing gradients, and weight initialization strategies.

60 min

10 diagrams

6 Concepts Covered

Prerequisites

→Linear Algebra

→Calculus & Optimization

→Logistic Regression

Concepts Covered

PerceptronBackpropagationReLUVanishing GradientWeight InitializationChain Rule

Previous: Time Series Forecasting Next: Deep Learning Optimization

∑Key Formulas

Forward Pass

Linear transformation at layer l

Activation

Non-linear activation applied element-wise

Backprop Delta

Error signal propagated backwards through layer l

Weight Gradient

Gradient of loss w.r.t. weights at layer l

▶Interactive Simulation

Loading visualization…

⬡Model Architecture

Loading visualization…

🎯

The Universal Approximation Theorem

motivation

Cybenko (1989) proved that a single hidden layer with enough neurons can approximate any continuous function to arbitrary precision. But 'enough' can mean billions of neurons for complex functions. Deep networks (many layers, fewer neurons per layer) achieve the same approximation with exponentially fewer parameters — they learn hierarchical representations. This is why depth matters.

A deep network with L layers and n neurons per layer can represent functions that require O(2ⁿ) neurons in a single-layer network. Depth is compression.

💡

What Neurons Actually Compute

intuition

Each neuron computes a weighted sum of its inputs (a hyperplane), then applies a non-linearity. A single neuron with sigmoid creates a smooth decision boundary that separates space into two regions. Multiple neurons in a layer create multiple hyperplanes. Deep layers compose these hyperplanes, creating increasingly complex decision boundaries — curves, then curves of curves, then manifolds.

∑

Backpropagation: The Chain Rule at Scale

math

Training requires computing ∂L/∂W for every weight. Direct computation is infeasible — a network with 100M parameters would need 100M separate forward passes. Backpropagation exploits the chain rule to compute all gradients in a single backward pass, the same cost as one forward pass. This is the algorithm that made deep learning possible.

🔬

The Vanishing Gradient Problem

deepdive

During backpropagation, gradients are multiplied at each layer: δ[l] = W[l+1]ᵀ · δ[l+1] ⊙ σ'(z[l]). For sigmoid, σ'(z) ≤ 0.25 everywhere. After 10 layers, the gradient is multiplied by 0.25¹⁰ ≈ 0.000001. The gradient essentially vanishes — early layers stop learning. ReLU fixes this: its derivative is 1 for z > 0, so gradients don't shrink as they propagate.

⚙️

Mini-Batch SGD Training Loop

algorithm

Initialize weights: He init for ReLU (W ~ N(0, √(2/fan_in)))

For each epoch, shuffle training data

For each mini-batch of size B:

→

Forward pass: compute activations a[1]...a[L] and loss L

→

Backward pass: compute δ[L] then propagate backwards

→

Update: W[l] ← W[l] - α · ∂L/∂W[l]

→

Update: b[l] ← b[l] - α · ∂L/∂b[l]

Apply learning rate scheduler (CosineAnnealing, ReduceLROnPlateau)

</>

PyTorch: Building and Training

code

python41 lines

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# ── Sample dataloader ──────────────────────────────────────────────────
X_data = torch.randn(1000, 128)
y_data = torch.randint(0, 2, (1000,)).float()
dataloader = DataLoader(TensorDataset(X_data, y_data), batch_size=64, shuffle=True)

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
        super().__init__()
        layers = []
        dims = [input_dim] + hidden_dims
        for i in range(len(hidden_dims)):
            layers += [
                nn.Linear(dims[i], dims[i+1]),
                nn.BatchNorm1d(dims[i+1]),
                nn.ReLU(),
                nn.Dropout(dropout)
            ]
        layers.append(nn.Linear(hidden_dims[-1], output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = MLP(input_dim=128, hidden_dims=[256, 128, 64], output_dim=1)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# Training step
for x_batch, y_batch in dataloader:
    optimizer.zero_grad()
    logits = model(x_batch).squeeze()
    loss = nn.BCEWithLogitsLoss()(logits, y_batch.float())
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    scheduler.step()

⚠️

Critical Pitfalls

pitfall

Dead ReLU neurons: if a neuron's weights push z < 0 for all inputs, it never activates. Use LeakyReLU or proper He initialization.

Exploding gradients: clip_grad_norm_(model.parameters(), 1.0) should always be in your training loop.

No BatchNorm: covariate shift makes deep networks unstable. Always BatchNorm between linear and activation layers.

Learning rate: too high → loss diverges; too low → training takes 100x longer. Use lr_find or start at 1e-3 with AdamW.

?Knowledge Check

Progress is saved in your browser — no account needed.

Time Series Forecasting

Deep Learning Optimization

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services