Back to Blog
NLP March 28, 2025 9 min read

Fine-Tuning BERT for Production NLP: A Battle-Tested Guide

Everything I've learned fine-tuning BERT across 10+ NLP projects — tokenization, learning rate schedules, layer freezing, and deployment with ONNX.

The Fine-Tuning Recipe

from transformers import BertForSequenceClassification, AdamW

model = BertForSequenceClassification.from_pretrained(
    'bert-base-multilingual-cased',  # for AR/FR/EN
    num_labels=3
)

# Discriminative learning rates
optimizer = AdamW([
    {'params': model.bert.embeddings.parameters(), 'lr': 1e-5},
    {'params': model.bert.encoder.layer[:6].parameters(), 'lr': 2e-5},
    {'params': model.bert.encoder.layer[6:].parameters(), 'lr': 3e-5},
    {'params': model.classifier.parameters(), 'lr': 5e-5},
])

Key Lessons

  1. Start with LR=2e-5, batch=16, epochs=3-5
  2. Warmup for 10% of steps prevents catastrophic forgetting
  3. For Arabic: use CAMeL-BERT or AraBERT, not mBERT
  4. ONNX export: 3x faster inference, no PyTorch needed
BERTFine-TuningHuggingFaceTransformersProduction
O

Ossama Elhakki

AI Engineer & ML Systems Builder — Morocco