ML Learning Hub
Deep Learningintermediate

Deep Learning Optimization

From SGD to Adam — the tricks that make deep networks actually train

SGD vs Momentum vs Adam vs AdamW, learning rate warmup and cosine scheduling, batch normalization, dropout, gradient clipping, and mixed-precision training — the recipe that makes modern deep networks train stably.

40 min
10 diagrams
8 Concepts Covered

Prerequisites

Neural Networks

Concepts Covered

AdamAdamWMomentumBatch NormalizationDropoutLR WarmupCosine ScheduleGradient Clipping

Key Formulas

SGD + Momentum

Accumulate velocity in gradient direction — escapes shallow minima and damps oscillation in narrow valleys

Adam

Per-parameter adaptive learning rate: m̂_t = bias-corrected gradient mean, v̂_t = gradient variance

Batch Normalization

Normalize activations per mini-batch; learnable γ,β restore representational power

Cosine LR Schedule

Smoothly anneal learning rate from ηmax to ηmin over T steps — better than step decay

Interactive Simulation

Loading visualization…
🎯

A Good Model with Bad Optimization Is Worthless

motivation

The same network architecture trained with vanilla SGD, good initialization, and proper scheduling can outperform a larger network trained carelessly. Optimization tricks are what separate 'works in the paper' from 'works on your GPU in production.' The history: early DNNs failed because of vanishing gradients + poor initialization. The 2012 ImageNet breakthrough (AlexNet) used ReLU activations + dropout + weight decay. ResNets (2015) added skip connections to solve gradient flow at depth 100+. Modern transformers train stably at depth 1000+ with careful normalization, learning rate warmup, and gradient clipping. Each trick solved a concrete failure mode.

Learning rate is the most important hyperparameter. A 10× wrong learning rate often makes the difference between a model that trains and one that diverges — before you even try anything else.

💡

Why Each Trick Exists

intuition

**Momentum (μ=0.9):** Gradient descent on an elongated loss bowl zigzags across the narrow dimension. Momentum damps these oscillations by averaging gradients over time — effectively turning a slow zig-zag into a smooth curve toward the minimum. **Adam:** Different parameters have very different gradient magnitudes. The first gradient update on rare words in an embedding layer is huge; on common words it's tiny. Adam normalises each parameter by its historical gradient magnitude — rare features get larger effective learning rates. **Batch normalization:** Internal covariate shift — the distribution of activations changes every weight update, forcing subsequent layers to constantly readjust. BatchNorm re-centres activations each layer, stabilising training and enabling 10× higher learning rates. **Dropout (p=0.5):** Randomly zeroes half the activations during training — forces the network to learn redundant representations, prevents co-adaptation of neurons — a powerful regularizer.

Batch size and learning rate are coupled: doubling batch size has similar effect to halving learning rate. The linear scaling rule (Goyal et al. 2017): scale lr proportionally with batch size, add 5-epoch warmup.

⚙️

Modern Deep Learning Training Recipe

algorithm
1

Initialize weights: He init for ReLU layers (σ=√(2/fan_in)), Xavier for tanh/sigmoid (σ=√(2/(fan_in+fan_out))).

2

Choose optimizer: Adam (β₁=0.9, β₂=0.999, ε=1e-8, lr=3e-4) for most tasks. SGD+momentum for ImageNet fine-tuning.

3

Add learning rate warmup: linearly ramp from 0 to target lr over 5% of total steps — prevents large gradient steps before model settles.

4

Cosine annealing (or ReduceLROnPlateau): decay lr smoothly to 1e-6 over training. OneCycleLR is a strong alternative.

5

Gradient clipping (max_norm=1.0): cap gradient norm before update — essential for RNNs, Transformers. torch.nn.utils.clip_grad_norm_().

6

Regularization: Weight decay (L2, λ=1e-4 to 1e-2) in the optimizer. Dropout (p=0.1–0.5). Label smoothing (ε=0.1) for classification.

7

Mixed precision (torch.cuda.amp): use float16 for forward pass, float32 for loss — 2× speed, 2× memory efficiency on modern GPUs.

</>

Modern PyTorch Training Loop

code
python103 lines
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import GradScaler, autocast
from torch.utils.data import TensorDataset, DataLoader

class="tok-comment"># ── Minimal model + dataloader for the demo ────────────────────────────
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(class="tok-num">16, class="tok-num">64), nn.ReLU(), nn.Linear(class="tok-num">64, class="tok-num">10))
    def forward(self, x): return self.net(x)

X_data = torch.randn(class="tok-num">512, class="tok-num">16)
y_data = torch.randint(class="tok-num">0, class="tok-num">10, (class="tok-num">512,))
train_loader = DataLoader(TensorDataset(X_data, y_data), batch_size=class="tok-num">32, shuffle=True)

def train_one_epoch(model, loader, optimizer, scaler, scheduler, device, clip_norm=class="tok-num">1.0):
    model.train()
    total_loss = class="tok-num">0.0
    for batch_idx, (X, y) in enumerate(loader):
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad(set_to_none=True)   class="tok-comment"># faster than zero_grad()

        class="tok-comment"># ── Mixed precision forward pass ──────────────────────────────────────
        with autocast():
            logits = model(X)
            loss   = nn.functional.cross_entropy(logits, y, label_smoothing=class="tok-num">0.1)

        class="tok-comment"># ── Scaled backward + gradient clipping ──────────────────────────────
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip_norm)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()                         class="tok-comment"># step per batch for OneCycleLR

        total_loss += loss.item()

    return total_loss / len(loader)

class="tok-comment"># ── Setup ─────────────────────────────────────────────────────────────────────
device = torch.device(class="tok-str">"cuda" if torch.cuda.is_available() else class="tok-str">"cpu")
model  = MyModel().to(device)

class="tok-comment"># AdamW (Adam with decoupled weight decay — better than Adam+L2)
optimizer = optim.AdamW(model.parameters(), lr=class="tok-num">3e-4, weight_decay=class="tok-num">1e-2)

class="tok-comment"># OneCycle LR: warmup + cosine anneal in one schedule
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=class="tok-num">3e-4,
    total_steps=class="tok-num">100 * len(train_loader),    class="tok-comment"># epochs × steps_per_epoch
    pct_start=class="tok-num">0.05,                          class="tok-comment"># class="tok-num">5% warmup
    anneal_strategy=class="tok-str">"cos",
)

scaler = GradScaler()                        class="tok-comment"># for mixed precision

class="tok-comment"># ── Optimizer comparison ──────────────────────────────────────────────────────
import torch.optim as optim

class="tok-comment"># SGD + Momentum (strong for fine-tuning pretrained CNNs)
sgd = optim.SGD(model.parameters(), lr=class="tok-num">0.01, momentum=class="tok-num">0.9, weight_decay=class="tok-num">1e-4, nesterov=True)

class="tok-comment"># Adam (default for most deep learning tasks)
adam = optim.Adam(model.parameters(), lr=class="tok-num">3e-4, betas=(class="tok-num">0.9, class="tok-num">0.999))

class="tok-comment"># AdamW (Adam with properly decoupled weight decay — recommended by modern best practices)
adamw = optim.AdamW(model.parameters(), lr=class="tok-num">3e-4, weight_decay=class="tok-num">0.01)

class="tok-comment"># ── Batch Normalization example ───────────────────────────────────────────────
class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, class="tok-num">3, padding=class="tok-num">1, bias=False),
            nn.BatchNorm2d(out_ch),   class="tok-comment"># bias=False because BN has its own β
            nn.ReLU(inplace=True),
        )
    def forward(self, x): return self.block(x)

class="tok-comment"># ── Layer Normalization (Transformers prefer LayerNorm) ───────────────────────
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn    = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ff      = nn.Sequential(nn.Linear(d_model, d_model*class="tok-num">4), nn.GELU(), nn.Linear(d_model*class="tok-num">4, d_model))
        self.norm1   = nn.LayerNorm(d_model)    class="tok-comment"># pre-norm (more stable than post-norm)
        self.norm2   = nn.LayerNorm(d_model)
        self.drop    = nn.Dropout(class="tok-num">0.1)
    def forward(self, x):
        class="tok-comment"># Pre-norm architecture (used in GPT-class="tok-num">2+, better gradient flow)
        x = x + self.drop(self.attn(self.norm1(x), self.norm1(x), self.norm1(x))[class="tok-num">0])
        x = x + self.drop(self.ff(self.norm2(x)))
        return x

class="tok-comment"># ── Learning rate finder ──────────────────────────────────────────────────────
class="tok-comment"># pip install torch-lr-finder
from torch_lr_finder import LRFinder
finder = LRFinder(model, optimizer, nn.CrossEntropyLoss(), device=device)
finder.range_test(train_loader, end_lr=class="tok-num">10, num_iter=class="tok-num">100)
finder.plot()   class="tok-comment"># look for the steepest descent — that's your max_lr
⚠️

BatchNorm's Hidden Failure Modes

pitfall

BatchNorm is powerful but has several subtle failure modes: (1) **Small batch sizes:** BatchNorm computes statistics over the batch — with batch_size < 8, estimates are too noisy. Use GroupNorm (group_size=32) or LayerNorm instead. (2) **model.eval() is critical:** In eval mode, BatchNorm uses running statistics computed during training. Forgetting to call model.eval() before inference causes wildly different predictions because the batch statistics from a single test sample are wrong. (3) **Fine-tuning on different distributions:** If fine-tuning a pretrained model, the running mean/var may not match your data. Consider setting track_running_stats=False or using a small learning rate for BN layers. (4) **RNNs:** BatchNorm doesn't work with variable-length sequences — use LayerNorm instead.

The second most common PyTorch bug (after wrong tensor dimensions) is forgetting model.eval() — BatchNorm and Dropout behave differently in train vs eval mode.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.