Sentiment Analysis for Arabic Text: BERT vs Traditional ML

Dataset Curation

I used 3 sources:

ASTD (Arabic Sentiment Twitter Dataset) — 10K tweets
ArSAS (Arabic Sentiment Analysis) — 21K MSA + dialectal
Scraped Moroccan Darija reviews from Jumia MA

Preprocessing

def preprocess_ar(text):
    text = re.sub(r'[\u064B-\u065F\u0670]', '', text)  # Remove diacritics
    text = re.sub(r'[أإآ]', 'ا', text)  # Normalize alef
    text = re.sub(r'ة', 'ه', text)  # Normalize teh marbuta
    text = re.sub(r'[^\w\s]', ' ', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()
    return text

Results Comparison

Model	Accuracy	F1
TF-IDF + SVM	76.2%	0.74
FastText	81.5%	0.80
AraBERT v0.2	88.1%	0.87
CAMeL-BERT	86.7%	0.85

AraBERT wins but is 20x slower. Use FastText if latency matters.