Deep Learning January 15, 2025 8 min read

10 PyTorch Training Tricks That Cut My Training Time in Half

Mixed precision, gradient checkpointing, DataLoader tuning, torch.compile, and 6 more tricks with measured speedups on real experiments.

The 10 Tricks

1. Mixed Precision (AMP)

scaler = torch.cuda.amp.GradScaler()
with torch.autocast(device_type='cuda'):
    loss = model(x)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Speedup: 1.8-2.5x on modern GPUs

2. Gradient Checkpointing

model.gradient_checkpointing_enable()

Memory: -40%, Speed: -15% — worth it for large models

3. Optimal DataLoader

DataLoader(dataset, num_workers=4, pin_memory=True, persistent_workers=True)

4. torch.compile() (PyTorch 2.0+)

model = torch.compile(model)

Speedup: 1.2-2x depending on model

5-10. OneCycleLR, gradient clipping, fused optimizers, CUDA streams, torch.backends.cudnn.benchmark, and memory-efficient attention.

PyTorchTrainingMixed PrecisionPerformanceCUDA

O

Ossama Elhakki

AI Engineer & ML Systems Builder — Morocco

About me →Contact →