Back to Blog
NLP November 1, 2024 7 min read

NLP Text Preprocessing: The Complete Guide for 2025

Tokenization, normalization, stemming vs lemmatization, subword encoding — and when BERT's tokenizer is better than all of them combined.

When to Use What

TaskPreprocessing
BERT/TransformersUse their tokenizer ONLY
TF-IDF + classical MLLowercase, remove stopwords, lemmatize
FastTextMinimal (it handles subwords)
ArabicNormalize alef/hamza, remove diacritics
FrenchHandle accents, elisions (l', d')

Universal Pipeline

import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocess(text, language='english'):
    text = text.lower().strip()
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special chars
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens if len(t) > 2]
    return ' '.join(tokens)

For BERT: Trust the Tokenizer

BERT's WordPiece handles casing, subwords, and special tokens. Don't apply custom preprocessing — it will hurt performance.

NLPText PreprocessingTokenizationBERTLemmatization
O

Ossama Elhakki

AI Engineer & ML Systems Builder — Morocco