Dataset Curation
I used 3 sources:
- ASTD (Arabic Sentiment Twitter Dataset) — 10K tweets
- ArSAS (Arabic Sentiment Analysis) — 21K MSA + dialectal
- Scraped Moroccan Darija reviews from Jumia MA
Preprocessing
def preprocess_ar(text):
text = re.sub(r'[\u064B-\u065F\u0670]', '', text) # Remove diacritics
text = re.sub(r'[أإآ]', 'ا', text) # Normalize alef
text = re.sub(r'ة', 'ه', text) # Normalize teh marbuta
text = re.sub(r'[^\w\s]', ' ', text) # Remove punctuation
text = re.sub(r'\s+', ' ', text).strip()
return text
Results Comparison
| Model | Accuracy | F1 |
|---|---|---|
| TF-IDF + SVM | 76.2% | 0.74 |
| FastText | 81.5% | 0.80 |
| AraBERT v0.2 | 88.1% | 0.87 |
| CAMeL-BERT | 86.7% | 0.85 |
AraBERT wins but is 20x slower. Use FastText if latency matters.