Technique Comparison
| Method | Size Reduction | Accuracy Drop | Effort |
|---|
| INT8 Quantization | 4x | ~1% | Low |
| FP16 | 2x | <0.1% | Very Low |
| Pruning (30%) | 1.4x | ~2% | Medium |
| Distillation | 5-10x | 3-5% | High |
Post-Training Quantization (Easiest)
import torch
# Dynamic quantization (CPU inference)
model_int8 = torch.quantization.quantize_dynamic(
model,
{nn.Linear, nn.LSTM},
dtype=torch.qint8
)
# Result: 2-4x smaller, 2x faster on CPU
Knowledge Distillation
# Student learns from teacher's soft probabilities
teacher_logits = teacher(x).detach()
student_logits = student(x)
kd_loss = nn.KLDivLoss()(
F.log_softmax(student_logits/T, dim=-1),
F.softmax(teacher_logits/T, dim=-1)
) * T**2