NLP: Text Classification Pipeline
“Teaching machines to read — from bag-of-words to transformers”
The full classical NLP pipeline: tokenization, TF-IDF vectorization, Naïve Bayes/Logistic/SVM classification, evaluation (macro-F1), word embeddings vs TF-IDF, and sentence-transformers for semantic search.
Prerequisites
Concepts Covered
∑Key Formulas
TF-IDF
Term frequency × inverse document frequency — high when word is frequent in doc but rare globally
Cosine Similarity
Document similarity measure independent of document length
Perplexity
Language model quality — lower perplexity = better next-word prediction
▶Interactive Simulation
The NLP Revolution
In 2017, GPT-3 didn't exist. In 2023, LLMs write code, pass medical exams, and summarize legal documents. The foundation of all NLP — from bag-of-words spam filters to transformer-based LLMs — is the same: represent text numerically so models can process it. Understanding the classic NLP pipeline (tokenize → vectorize → model → evaluate) gives you the mental model to understand why modern transformers work and where they differ.
The GPT-4 technical report shows that the model trained on 100× more text than GPT-3 still benefits from classic NLP preprocessing (tokenization, deduplication, data quality filtering). Fundamentals matter at scale.
The NLP Pipeline: 5 Stages
Raw text is just Unicode bytes — meaningless to a model. The NLP pipeline converts it to numbers: Tokenization (split text into tokens — words, subwords, or characters), Vocabulary building (assign an integer ID to each unique token), Vectorization (convert token IDs to dense numeric representations — one-hot, TF-IDF, or word embeddings), Model training (classify, cluster, generate, or retrieve), Evaluation (accuracy, F1, BLEU, perplexity depending on task).
TF-IDF: The Classic Vectorizer
Term Frequency (TF): how often does word t appear in document d? Document Frequency (DF): how many documents contain t? Inverse Document Frequency (IDF): log(N/DFt) — words that appear in every document (the, is, of) get near-zero IDF, making them irrelevant. Words that are specific to a few documents get high IDF. TF-IDF = TF × IDF. The result is a sparse matrix of shape (n_docs × vocab_size) where each entry reflects how characteristic that word is for that document.
From Bag-of-Words to Word Embeddings
TF-IDF treats each word as independent — 'bank' and 'financial institution' are completely unrelated. Word embeddings (Word2Vec, GloVe, FastText) learn dense vector representations where similar words are nearby in vector space: king - man + woman ≈ queen. These 300-dimensional vectors capture semantic relationships that TF-IDF cannot. Modern sentence transformers (SBERT, all-MiniLM-L6-v2) produce fixed-length vectors for entire sentences, enabling semantic search, clustering, and zero-shot classification.
For production text classification in 2025: start with TF-IDF + LogisticRegression as baseline, then try sentence-transformers embeddings + classifier, then fine-tune a pre-trained BERT/DistilBERT if quality is still insufficient.
Text Classification Pipeline
Lowercasing, punctuation removal, optional stopword removal
Tokenization: word_tokenize or subword (BPE/WordPiece for transformers)
Vectorization: CountVectorizer → TfidfVectorizer → word2vec → BERT embeddings
Model: MultinomialNB (fast baseline), LogisticRegression (strong linear), SVM, fine-tuned BERT
Evaluation: macro-F1 for balanced classes, weighted-F1 for imbalanced, AUC-ROC
Error analysis: inspect misclassified samples → improve features or labeling
Complete NLP Classification Pipeline
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import ComplementNB from sklearn.svm import LinearSVC from sklearn.model_selection import cross_val_score, train_test_split from sklearn.metrics import classification_report import numpy as np class="tok-comment"># ── Sample text data ─────────────────────────────────────────────────── corpus = [ class="tok-str">"machine learning algorithms data science python", class="tok-str">"neural network deep learning pytorch tensorflow", class="tok-str">"natural language processing text classification bert", class="tok-str">"computer vision image recognition convolutional", class="tok-str">"reinforcement learning reward policy agent", class="tok-str">"data preprocessing feature engineering pipeline", ] * class="tok-num">40 class="tok-comment"># class="tok-num">240 samples, class="tok-num">6 classes labels = list(range(class="tok-num">6)) * class="tok-num">40 X_text = corpus y = np.array(labels) X_train, X_test, y_train, y_test = train_test_split( X_text, y, test_size=class="tok-num">0.2, stratify=y, random_state=class="tok-num">42) class="tok-comment"># indices for sentence-transformer section train_idx = np.arange(len(X_train)) test_idx = np.arange(len(X_test)) class="tok-comment"># ── Baseline: TF-IDF + Logistic Regression ──────────────────────── pipe_lr = Pipeline([ (class="tok-str">'tfidf', TfidfVectorizer( ngram_range=(class="tok-num">1,class="tok-num">2), max_features=100_000, sublinear_tf=True, class="tok-comment"># log(class="tok-num">1+tf) dampens high frequencies strip_accents=class="tok-str">'unicode', analyzer=class="tok-str">'word', token_pattern=rclass="tok-str">'\w{class="tok-num">2,}', class="tok-comment"># ignore single-char tokens min_df=class="tok-num">2, class="tok-comment"># ignore very rare words )), (class="tok-str">'clf', LogisticRegression(C=class="tok-num">1.0, max_iter=class="tok-num">1000, class_weight=class="tok-str">'balanced')), ]) class="tok-comment"># ── Alternative: TF-IDF + LinearSVC (fast, great for text) ──────── pipe_svm = Pipeline([ (class="tok-str">'tfidf', TfidfVectorizer(ngram_range=(class="tok-num">1,class="tok-num">2), max_features=100_000, sublinear_tf=True)), (class="tok-str">'clf', LinearSVC(C=class="tok-num">0.5, class_weight=class="tok-str">'balanced', max_iter=class="tok-num">2000)), ]) class="tok-comment"># ── Evaluate both with cross-validation ─────────────────────────── for name, pipe in [(class="tok-str">'LR', pipe_lr), (class="tok-str">'SVM', pipe_svm)]: scores = cross_val_score(pipe, X_text, y, cv=class="tok-num">5, scoring=class="tok-str">'f1_macro', n_jobs=-class="tok-num">1) print(fclass="tok-str">"{name}: macro-F1 = {scores.mean():.3f} ± {scores.std():.3f}") class="tok-comment"># ── Modern approach: sentence embeddings ────────────────────────── class="tok-comment"># pip install sentence-transformers from sentence_transformers import SentenceTransformer from sklearn.linear_model import LogisticRegression encoder = SentenceTransformer(class="tok-str">'all-MiniLM-L6-v2') X_emb = encoder.encode(X_text, batch_size=class="tok-num">256, show_progress_bar=True) clf = LogisticRegression(max_iter=class="tok-num">1000).fit(X_emb[train_idx], y[train_idx]) print(fclass="tok-str">"Sentence-BERT accuracy: {clf.score(X_emb[test_idx], y[test_idx]):.3f}")
NLP Pipeline Pitfalls
Fitting TfidfVectorizer on the full dataset (before splitting) leaks test vocabulary into training — the IDF values are computed with test document frequencies. Always fit inside a Pipeline applied to training data only. Second: using max_features without min_df — very rare words (appearing in 1-2 documents) are noisy but included. Set min_df=2 or min_df=0.001. Third: ignoring class imbalance — a 95% majority class makes accuracy useless; use macro-F1 and class_weight='balanced'. Fourth: not stemming/lemmatizing for small datasets — 'run', 'running', 'ran' should map to the same feature.
For non-English text, use language-specific tokenizers and pre-trained multilingual models (mBERT, XLM-RoBERTa) rather than English-centric pipelines. Many NLP libraries default to English-only behavior silently.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.