Back to Blog
NLP January 25, 2025 9 min read

Sentiment Analysis for Arabic Text: BERT vs Traditional ML

Building a production sentiment classifier for Arabic customer reviews — dataset curation, preprocessing challenges, model comparison, and deploying with FastAPI.

Dataset Curation

I used 3 sources:

  1. ASTD (Arabic Sentiment Twitter Dataset) — 10K tweets
  2. ArSAS (Arabic Sentiment Analysis) — 21K MSA + dialectal
  3. Scraped Moroccan Darija reviews from Jumia MA

Preprocessing

def preprocess_ar(text):
    text = re.sub(r'[\u064B-\u065F\u0670]', '', text)  # Remove diacritics
    text = re.sub(r'[أإآ]', 'ا', text)  # Normalize alef
    text = re.sub(r'ة', 'ه', text)  # Normalize teh marbuta
    text = re.sub(r'[^\w\s]', ' ', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()
    return text

Results Comparison

ModelAccuracyF1
TF-IDF + SVM76.2%0.74
FastText81.5%0.80
AraBERT v0.288.1%0.87
CAMeL-BERT86.7%0.85

AraBERT wins but is 20x slower. Use FastText if latency matters.

Sentiment AnalysisArabic NLPBERTAraBERTText Classification
O

Ossama Elhakki

AI Engineer & ML Systems Builder — Morocco