When to Use What
| Task | Preprocessing |
|---|---|
| BERT/Transformers | Use their tokenizer ONLY |
| TF-IDF + classical ML | Lowercase, remove stopwords, lemmatize |
| FastText | Minimal (it handles subwords) |
| Arabic | Normalize alef/hamza, remove diacritics |
| French | Handle accents, elisions (l', d') |
Universal Pipeline
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
def preprocess(text, language='english'):
text = text.lower().strip()
text = re.sub(r'http\S+', '', text) # Remove URLs
text = re.sub(r'[^a-zA-Z0-9\s]', '', text) # Remove special chars
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(t) for t in tokens if len(t) > 2]
return ' '.join(tokens)
For BERT: Trust the Tokenizer
BERT's WordPiece handles casing, subwords, and special tokens. Don't apply custom preprocessing — it will hurt performance.