Deep Learning Optimization
“From SGD to Adam — the tricks that make deep networks actually train”
SGD vs Momentum vs Adam vs AdamW, learning rate warmup and cosine scheduling, batch normalization, dropout, gradient clipping, and mixed-precision training — the recipe that makes modern deep networks train stably.
Prerequisites
Concepts Covered
∑Key Formulas
SGD + Momentum
Accumulate velocity in gradient direction — escapes shallow minima and damps oscillation in narrow valleys
Adam
Per-parameter adaptive learning rate: m̂_t = bias-corrected gradient mean, v̂_t = gradient variance
Batch Normalization
Normalize activations per mini-batch; learnable γ,β restore representational power
Cosine LR Schedule
Smoothly anneal learning rate from ηmax to ηmin over T steps — better than step decay
▶Interactive Simulation
A Good Model with Bad Optimization Is Worthless
The same network architecture trained with vanilla SGD, good initialization, and proper scheduling can outperform a larger network trained carelessly. Optimization tricks are what separate 'works in the paper' from 'works on your GPU in production.' The history: early DNNs failed because of vanishing gradients + poor initialization. The 2012 ImageNet breakthrough (AlexNet) used ReLU activations + dropout + weight decay. ResNets (2015) added skip connections to solve gradient flow at depth 100+. Modern transformers train stably at depth 1000+ with careful normalization, learning rate warmup, and gradient clipping. Each trick solved a concrete failure mode.
Learning rate is the most important hyperparameter. A 10× wrong learning rate often makes the difference between a model that trains and one that diverges — before you even try anything else.
Why Each Trick Exists
**Momentum (μ=0.9):** Gradient descent on an elongated loss bowl zigzags across the narrow dimension. Momentum damps these oscillations by averaging gradients over time — effectively turning a slow zig-zag into a smooth curve toward the minimum. **Adam:** Different parameters have very different gradient magnitudes. The first gradient update on rare words in an embedding layer is huge; on common words it's tiny. Adam normalises each parameter by its historical gradient magnitude — rare features get larger effective learning rates. **Batch normalization:** Internal covariate shift — the distribution of activations changes every weight update, forcing subsequent layers to constantly readjust. BatchNorm re-centres activations each layer, stabilising training and enabling 10× higher learning rates. **Dropout (p=0.5):** Randomly zeroes half the activations during training — forces the network to learn redundant representations, prevents co-adaptation of neurons — a powerful regularizer.
Batch size and learning rate are coupled: doubling batch size has similar effect to halving learning rate. The linear scaling rule (Goyal et al. 2017): scale lr proportionally with batch size, add 5-epoch warmup.
Modern Deep Learning Training Recipe
Initialize weights: He init for ReLU layers (σ=√(2/fan_in)), Xavier for tanh/sigmoid (σ=√(2/(fan_in+fan_out))).
Choose optimizer: Adam (β₁=0.9, β₂=0.999, ε=1e-8, lr=3e-4) for most tasks. SGD+momentum for ImageNet fine-tuning.
Add learning rate warmup: linearly ramp from 0 to target lr over 5% of total steps — prevents large gradient steps before model settles.
Cosine annealing (or ReduceLROnPlateau): decay lr smoothly to 1e-6 over training. OneCycleLR is a strong alternative.
Gradient clipping (max_norm=1.0): cap gradient norm before update — essential for RNNs, Transformers. torch.nn.utils.clip_grad_norm_().
Regularization: Weight decay (L2, λ=1e-4 to 1e-2) in the optimizer. Dropout (p=0.1–0.5). Label smoothing (ε=0.1) for classification.
Mixed precision (torch.cuda.amp): use float16 for forward pass, float32 for loss — 2× speed, 2× memory efficiency on modern GPUs.
Modern PyTorch Training Loop
import torch import torch.nn as nn import torch.optim as optim from torch.cuda.amp import GradScaler, autocast from torch.utils.data import TensorDataset, DataLoader class="tok-comment"># ── Minimal model + dataloader for the demo ──────────────────────────── class MyModel(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential(nn.Linear(class="tok-num">16, class="tok-num">64), nn.ReLU(), nn.Linear(class="tok-num">64, class="tok-num">10)) def forward(self, x): return self.net(x) X_data = torch.randn(class="tok-num">512, class="tok-num">16) y_data = torch.randint(class="tok-num">0, class="tok-num">10, (class="tok-num">512,)) train_loader = DataLoader(TensorDataset(X_data, y_data), batch_size=class="tok-num">32, shuffle=True) def train_one_epoch(model, loader, optimizer, scaler, scheduler, device, clip_norm=class="tok-num">1.0): model.train() total_loss = class="tok-num">0.0 for batch_idx, (X, y) in enumerate(loader): X, y = X.to(device), y.to(device) optimizer.zero_grad(set_to_none=True) class="tok-comment"># faster than zero_grad() class="tok-comment"># ── Mixed precision forward pass ────────────────────────────────────── with autocast(): logits = model(X) loss = nn.functional.cross_entropy(logits, y, label_smoothing=class="tok-num">0.1) class="tok-comment"># ── Scaled backward + gradient clipping ────────────────────────────── scaler.scale(loss).backward() scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), clip_norm) scaler.step(optimizer) scaler.update() scheduler.step() class="tok-comment"># step per batch for OneCycleLR total_loss += loss.item() return total_loss / len(loader) class="tok-comment"># ── Setup ───────────────────────────────────────────────────────────────────── device = torch.device(class="tok-str">"cuda" if torch.cuda.is_available() else class="tok-str">"cpu") model = MyModel().to(device) class="tok-comment"># AdamW (Adam with decoupled weight decay — better than Adam+L2) optimizer = optim.AdamW(model.parameters(), lr=class="tok-num">3e-4, weight_decay=class="tok-num">1e-2) class="tok-comment"># OneCycle LR: warmup + cosine anneal in one schedule scheduler = optim.lr_scheduler.OneCycleLR( optimizer, max_lr=class="tok-num">3e-4, total_steps=class="tok-num">100 * len(train_loader), class="tok-comment"># epochs × steps_per_epoch pct_start=class="tok-num">0.05, class="tok-comment"># class="tok-num">5% warmup anneal_strategy=class="tok-str">"cos", ) scaler = GradScaler() class="tok-comment"># for mixed precision class="tok-comment"># ── Optimizer comparison ────────────────────────────────────────────────────── import torch.optim as optim class="tok-comment"># SGD + Momentum (strong for fine-tuning pretrained CNNs) sgd = optim.SGD(model.parameters(), lr=class="tok-num">0.01, momentum=class="tok-num">0.9, weight_decay=class="tok-num">1e-4, nesterov=True) class="tok-comment"># Adam (default for most deep learning tasks) adam = optim.Adam(model.parameters(), lr=class="tok-num">3e-4, betas=(class="tok-num">0.9, class="tok-num">0.999)) class="tok-comment"># AdamW (Adam with properly decoupled weight decay — recommended by modern best practices) adamw = optim.AdamW(model.parameters(), lr=class="tok-num">3e-4, weight_decay=class="tok-num">0.01) class="tok-comment"># ── Batch Normalization example ─────────────────────────────────────────────── class ConvBlock(nn.Module): def __init__(self, in_ch, out_ch): super().__init__() self.block = nn.Sequential( nn.Conv2d(in_ch, out_ch, class="tok-num">3, padding=class="tok-num">1, bias=False), nn.BatchNorm2d(out_ch), class="tok-comment"># bias=False because BN has its own β nn.ReLU(inplace=True), ) def forward(self, x): return self.block(x) class="tok-comment"># ── Layer Normalization (Transformers prefer LayerNorm) ─────────────────────── class TransformerBlock(nn.Module): def __init__(self, d_model, n_heads): super().__init__() self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True) self.ff = nn.Sequential(nn.Linear(d_model, d_model*class="tok-num">4), nn.GELU(), nn.Linear(d_model*class="tok-num">4, d_model)) self.norm1 = nn.LayerNorm(d_model) class="tok-comment"># pre-norm (more stable than post-norm) self.norm2 = nn.LayerNorm(d_model) self.drop = nn.Dropout(class="tok-num">0.1) def forward(self, x): class="tok-comment"># Pre-norm architecture (used in GPT-class="tok-num">2+, better gradient flow) x = x + self.drop(self.attn(self.norm1(x), self.norm1(x), self.norm1(x))[class="tok-num">0]) x = x + self.drop(self.ff(self.norm2(x))) return x class="tok-comment"># ── Learning rate finder ────────────────────────────────────────────────────── class="tok-comment"># pip install torch-lr-finder from torch_lr_finder import LRFinder finder = LRFinder(model, optimizer, nn.CrossEntropyLoss(), device=device) finder.range_test(train_loader, end_lr=class="tok-num">10, num_iter=class="tok-num">100) finder.plot() class="tok-comment"># look for the steepest descent — that's your max_lr
BatchNorm's Hidden Failure Modes
BatchNorm is powerful but has several subtle failure modes: (1) **Small batch sizes:** BatchNorm computes statistics over the batch — with batch_size < 8, estimates are too noisy. Use GroupNorm (group_size=32) or LayerNorm instead. (2) **model.eval() is critical:** In eval mode, BatchNorm uses running statistics computed during training. Forgetting to call model.eval() before inference causes wildly different predictions because the batch statistics from a single test sample are wrong. (3) **Fine-tuning on different distributions:** If fine-tuning a pretrained model, the running mean/var may not match your data. Consider setting track_running_stats=False or using a small learning rate for BN layers. (4) **RNNs:** BatchNorm doesn't work with variable-length sequences — use LayerNorm instead.
The second most common PyTorch bug (after wrong tensor dimensions) is forgetting model.eval() — BatchNorm and Dropout behave differently in train vs eval mode.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.