Back to Blog
Deep Learning October 15, 2024 7 min read

GPU Training Optimization: Getting the Most from Your Hardware

GPU utilization, bottleneck diagnosis, DataLoader optimization, and CUDA memory management — practical techniques for training 2x faster without new hardware.

Step 1: Diagnose the Bottleneck

# GPU utilization — should be 90%+
nvidia-smi -l 1

# If low GPU util → CPU/DataLoader bottleneck

Fix DataLoader Bottleneck

loader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=min(os.cpu_count(), 8),  # key
    pin_memory=True,                       # key
    persistent_workers=True,               # key
    prefetch_factor=2
)

CUDA Memory Management

# Clear cache between experiments
torch.cuda.empty_cache()

# Monitor
print(f'Allocated: {torch.cuda.memory_allocated()/1e9:.1f}GB')
print(f'Reserved:  {torch.cuda.memory_reserved()/1e9:.1f}GB')

Gradient Accumulation (for large batches on limited VRAM)

for i, (x, y) in enumerate(loader):
    loss = model(x) / accumulation_steps
    loss.backward()
    if (i+1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
GPUCUDAPyTorchTrainingPerformance
O

Ossama Elhakki

AI Engineer & ML Systems Builder — Morocco