Applied MLbeginner

Naïve Bayes Classifiers

“Bayes' theorem + one bold assumption = a surprisingly powerful classifier”

Bayes' theorem, the conditional independence assumption, Gaussian/Multinomial/Complement/Bernoulli variants, Laplace smoothing, text classification with TF-IDF, and probability calibration — with interactive posterior probability demo.

30 min

7 diagrams

7 Concepts Covered

Prerequisites

→Probability & Statistics

Concepts Covered

Bayes TheoremMAP DecisionLaplace SmoothingGaussianNBMultinomialNBComplementNBCalibration

Previous: Feature Engineering & Pipelines Next: Decision Trees & Random Forest

∑Key Formulas

Bayes' Theorem

Posterior = Likelihood × Prior / Evidence

Naïve Assumption

Features are conditionally independent given the class — the 'naïve' part

MAP Decision

Maximum a posteriori — drop P(x) since it's the same for all classes

Gaussian Likelihood

Gaussian NB assumes each feature is normally distributed within each class

▶Interactive Simulation

Loading visualization…

🎯

Why Learn Naïve Bayes?

motivation

Naïve Bayes classifies spam emails better than many complex models, runs in microseconds, requires almost no training data, handles missing features gracefully, and produces calibrated probabilities. Gmail's original spam filter was Naïve Bayes. It's the go-to baseline for text classification problems. Understanding it gives you deep intuition about probabilistic classifiers, the bias-variance tradeoff, and why a 'wrong' assumption (conditional independence) can still produce useful models in practice.

Despite its 'naïve' independence assumption being almost always wrong in practice, Naïve Bayes consistently achieves near-optimal classification accuracy in text and document classification tasks.

💡

The Probabilistic Intuition

intuition

Suppose you want to classify an email as Spam or Ham. You've seen 1000 emails. 300 were spam. The word 'Viagra' appears in 250 of the 300 spam emails and only 1 of the 700 ham emails. The word 'meeting' appears in 5 spam and 400 ham emails. For a new email containing both words, Naïve Bayes multiplies: P(spam) × P(Viagra|spam) × P(meeting|spam). This product is the unnormalized posterior — compare it against P(ham) × P(Viagra|ham) × P(meeting|ham). The larger wins. The 'naïve' assumption makes this multiplication valid.

∑

Derivation of the Decision Rule

math

By Bayes' theorem: P(C|x) ∝ P(x|C)P(C). The naïve assumption factorizes P(x|C) = ∏ P(xⱼ|C). Since P(x) is constant across classes, we only need the numerator. Taking logs (for numerical stability — products of small probabilities underflow): log P(C|x) = log P(C) + Σⱼ log P(xⱼ|C). For Gaussian NB, P(xⱼ|C) is a Gaussian with mean and variance estimated per feature per class. For Multinomial NB (text), P(xⱼ|C) is the smoothed word frequency in class C.

🔬

Laplace Smoothing: Avoiding Zero Probabilities

deepdive

If a word never appears in spam training emails, P(word|spam)=0 and the entire product becomes 0 — one unseen word makes the classifier completely ignore all other evidence. Laplace smoothing adds α (usually 1) to all word counts: P(xⱼ|C) = (count(xⱼ,C) + α) / (count(C) + α × |V|) where |V| is vocabulary size. This ensures no probability is ever exactly 0. Larger α = more smoothing = closer to uniform distribution (stronger prior).

Laplace smoothing is a form of L1 regularization in the probability simplex. It prevents overfitting to rare words and is crucial for Multinomial NB on text data.

⚖️

Which Naïve Bayes Variant to Use?

comparison

GaussianNB: continuous features assumed Gaussian per class. Good for sensor data, medical measurements. MultinomialNB: integer count features (word counts, TF-IDF ×N). Standard for document classification. BernoulliNB: binary features (word presence/absence). Good for short texts where rare words matter. ComplementNB: addresses MultinomialNB's bias on imbalanced data by modeling the complement of each class — often the best for text classification.

Continuous features → GaussianNB

Word count / TF-IDF features → MultinomialNB (requires non-negative)

Binary features (word present/absent) → BernoulliNB

Imbalanced text classification → ComplementNB

Mixed types → use ColumnTransformer + different NB per column type

</>

Text Classification with Naïve Bayes

code

python54 lines

from sklearn.naive_bayes import MultinomialNB, ComplementNB, GaussianNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.calibration import CalibratedClassifierCV
from sklearn.preprocessing import LabelEncoder

# ── Sample text data ───────────────────────────────────────────────────
X_text = [
    "buy cheap pills now", "get rich quick scheme", "free money click here",
    "meeting at 3pm tomorrow", "quarterly report attached", "team lunch Friday",
    "win a prize enter now", "limited offer act fast", "investment opportunity",
    "project deadline next week", "budget review scheduled", "please review document",
] * 20   # 240 samples
y_raw = ([1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0] * 20)  # 1=spam, 0=ham
import numpy as np
y = np.array(y_raw)
X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42)

# ── Text classification (spam detection) ──────────────────────────
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1,2),       # unigrams + bigrams
        max_features=50_000,
        sublinear_tf=True,       # log(1+tf)
        min_df=3,
    )),
    ('clf', ComplementNB(alpha=0.1)),  # best for text
])

scores = cross_val_score(pipe, X_text, y, cv=5, scoring='f1_macro')
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")

# ── Calibrated probabilities ───────────────────────────────────────
# NB probabilities are often poorly calibrated (overconfident)
# Use isotonic regression calibration for better probability estimates
pipe_cal = CalibratedClassifierCV(pipe, cv=5, method='isotonic')
pipe_cal.fit(X_train, y_train)
probs = pipe_cal.predict_proba(X_test)

# ── Continuous features: GaussianNB ────────────────────────────────
from sklearn.preprocessing import StandardScaler
gnb = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', GaussianNB(var_smoothing=1e-9)),
])
gnb.fit(X_train, y_train)
print(f"GaussianNB accuracy: {gnb.score(X_test, y_test):.3f}")

# ── Inspect learned parameters ─────────────────────────────────────
nb = gnb.named_steps['clf']
print("Class priors:", nb.class_prior_)
print("Feature means per class:", nb.theta_)   # shape (n_classes, n_features)

⚠️

Naïve Bayes Pitfalls

pitfall

The independence assumption means correlated features are double-counted. If 'Viagra' and 'pill' always co-occur in spam, their combined evidence is counted twice, leading to overconfident posteriors. This is why NB probabilities are usually poorly calibrated even when classifications are correct — always use CalibratedClassifierCV for probability outputs. Also: MultinomialNB requires non-negative features (can't use raw TF-IDF with negative values from e.g., cosine-centered representations).

Naïve Bayes is an excellent baseline and often hard to beat on small text datasets. If it scores 85% and your complex model scores 87%, ask: is the 2% gain worth 10× the complexity and training time?

?Knowledge Check

Progress is saved in your browser — no account needed.

Feature Engineering & Pipelines

Decision Trees & Random Forest

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services