ML Learning Hub
Applied MLintermediate

NLP: Text Classification Pipeline

Teaching machines to read — from bag-of-words to transformers

The full classical NLP pipeline: tokenization, TF-IDF vectorization, Naïve Bayes/Logistic/SVM classification, evaluation (macro-F1), word embeddings vs TF-IDF, and sentence-transformers for semantic search.

45 min
10 diagrams
7 Concepts Covered

Prerequisites

Neural Networks
Naïve Bayes

Concepts Covered

TokenizationTF-IDFBag of WordsN-gramsCosine SimilarityWord2VecSentence Transformers

Key Formulas

TF-IDF

Term frequency × inverse document frequency — high when word is frequent in doc but rare globally

Cosine Similarity

Document similarity measure independent of document length

Perplexity

Language model quality — lower perplexity = better next-word prediction

Interactive Simulation

Loading visualization…
🎯

The NLP Revolution

motivation

In 2017, GPT-3 didn't exist. In 2023, LLMs write code, pass medical exams, and summarize legal documents. The foundation of all NLP — from bag-of-words spam filters to transformer-based LLMs — is the same: represent text numerically so models can process it. Understanding the classic NLP pipeline (tokenize → vectorize → model → evaluate) gives you the mental model to understand why modern transformers work and where they differ.

The GPT-4 technical report shows that the model trained on 100× more text than GPT-3 still benefits from classic NLP preprocessing (tokenization, deduplication, data quality filtering). Fundamentals matter at scale.

💡

The NLP Pipeline: 5 Stages

intuition

Raw text is just Unicode bytes — meaningless to a model. The NLP pipeline converts it to numbers: Tokenization (split text into tokens — words, subwords, or characters), Vocabulary building (assign an integer ID to each unique token), Vectorization (convert token IDs to dense numeric representations — one-hot, TF-IDF, or word embeddings), Model training (classify, cluster, generate, or retrieve), Evaluation (accuracy, F1, BLEU, perplexity depending on task).

TF-IDF: The Classic Vectorizer

math

Term Frequency (TF): how often does word t appear in document d? Document Frequency (DF): how many documents contain t? Inverse Document Frequency (IDF): log(N/DFt) — words that appear in every document (the, is, of) get near-zero IDF, making them irrelevant. Words that are specific to a few documents get high IDF. TF-IDF = TF × IDF. The result is a sparse matrix of shape (n_docs × vocab_size) where each entry reflects how characteristic that word is for that document.

Smooth IDF (+1 prevents division by zero)
🔬

From Bag-of-Words to Word Embeddings

deepdive

TF-IDF treats each word as independent — 'bank' and 'financial institution' are completely unrelated. Word embeddings (Word2Vec, GloVe, FastText) learn dense vector representations where similar words are nearby in vector space: king - man + woman ≈ queen. These 300-dimensional vectors capture semantic relationships that TF-IDF cannot. Modern sentence transformers (SBERT, all-MiniLM-L6-v2) produce fixed-length vectors for entire sentences, enabling semantic search, clustering, and zero-shot classification.

For production text classification in 2025: start with TF-IDF + LogisticRegression as baseline, then try sentence-transformers embeddings + classifier, then fine-tune a pre-trained BERT/DistilBERT if quality is still insufficient.

⚙️

Text Classification Pipeline

algorithm
1

Lowercasing, punctuation removal, optional stopword removal

2

Tokenization: word_tokenize or subword (BPE/WordPiece for transformers)

3

Vectorization: CountVectorizer → TfidfVectorizer → word2vec → BERT embeddings

4

Model: MultinomialNB (fast baseline), LogisticRegression (strong linear), SVM, fine-tuned BERT

5

Evaluation: macro-F1 for balanced classes, weighted-F1 for imbalanced, AUC-ROC

6

Error analysis: inspect misclassified samples → improve features or labeling

</>

Complete NLP Classification Pipeline

code
python61 lines
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import classification_report
import numpy as np

class="tok-comment"># ── Sample text data ───────────────────────────────────────────────────
corpus = [
    class="tok-str">"machine learning algorithms data science python",
    class="tok-str">"neural network deep learning pytorch tensorflow",
    class="tok-str">"natural language processing text classification bert",
    class="tok-str">"computer vision image recognition convolutional",
    class="tok-str">"reinforcement learning reward policy agent",
    class="tok-str">"data preprocessing feature engineering pipeline",
] * class="tok-num">40   class="tok-comment"># class="tok-num">240 samples, class="tok-num">6 classes
labels = list(range(class="tok-num">6)) * class="tok-num">40
X_text = corpus
y = np.array(labels)
X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=class="tok-num">0.2, stratify=y, random_state=class="tok-num">42)
class="tok-comment"># indices for sentence-transformer section
train_idx = np.arange(len(X_train))
test_idx  = np.arange(len(X_test))

class="tok-comment"># ── Baseline: TF-IDF + Logistic Regression ────────────────────────
pipe_lr = Pipeline([
    (class="tok-str">'tfidf', TfidfVectorizer(
        ngram_range=(class="tok-num">1,class="tok-num">2),
        max_features=100_000,
        sublinear_tf=True,          class="tok-comment"># log(class="tok-num">1+tf) dampens high frequencies
        strip_accents=class="tok-str">'unicode',
        analyzer=class="tok-str">'word',
        token_pattern=rclass="tok-str">'\w{class="tok-num">2,}',  class="tok-comment"># ignore single-char tokens
        min_df=class="tok-num">2,                   class="tok-comment"># ignore very rare words
    )),
    (class="tok-str">'clf', LogisticRegression(C=class="tok-num">1.0, max_iter=class="tok-num">1000, class_weight=class="tok-str">'balanced')),
])

class="tok-comment"># ── Alternative: TF-IDF + LinearSVC (fast, great for text) ────────
pipe_svm = Pipeline([
    (class="tok-str">'tfidf', TfidfVectorizer(ngram_range=(class="tok-num">1,class="tok-num">2), max_features=100_000, sublinear_tf=True)),
    (class="tok-str">'clf', LinearSVC(C=class="tok-num">0.5, class_weight=class="tok-str">'balanced', max_iter=class="tok-num">2000)),
])

class="tok-comment"># ── Evaluate both with cross-validation ───────────────────────────
for name, pipe in [(class="tok-str">'LR', pipe_lr), (class="tok-str">'SVM', pipe_svm)]:
    scores = cross_val_score(pipe, X_text, y, cv=class="tok-num">5, scoring=class="tok-str">'f1_macro', n_jobs=-class="tok-num">1)
    print(fclass="tok-str">"{name}: macro-F1 = {scores.mean():.3f} ± {scores.std():.3f}")

class="tok-comment"># ── Modern approach: sentence embeddings ──────────────────────────
class="tok-comment"># pip install sentence-transformers
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression

encoder = SentenceTransformer(class="tok-str">'all-MiniLM-L6-v2')
X_emb = encoder.encode(X_text, batch_size=class="tok-num">256, show_progress_bar=True)
clf = LogisticRegression(max_iter=class="tok-num">1000).fit(X_emb[train_idx], y[train_idx])
print(fclass="tok-str">"Sentence-BERT accuracy: {clf.score(X_emb[test_idx], y[test_idx]):.3f}")
⚠️

NLP Pipeline Pitfalls

pitfall

Fitting TfidfVectorizer on the full dataset (before splitting) leaks test vocabulary into training — the IDF values are computed with test document frequencies. Always fit inside a Pipeline applied to training data only. Second: using max_features without min_df — very rare words (appearing in 1-2 documents) are noisy but included. Set min_df=2 or min_df=0.001. Third: ignoring class imbalance — a 95% majority class makes accuracy useless; use macro-F1 and class_weight='balanced'. Fourth: not stemming/lemmatizing for small datasets — 'run', 'running', 'ran' should map to the same feature.

For non-English text, use language-specific tokenizers and pre-trained multilingual models (mBERT, XLM-RoBERTa) rather than English-centric pipelines. Many NLP libraries default to English-only behavior silently.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.