Naïve Bayes Classifiers
“Bayes' theorem + one bold assumption = a surprisingly powerful classifier”
Bayes' theorem, the conditional independence assumption, Gaussian/Multinomial/Complement/Bernoulli variants, Laplace smoothing, text classification with TF-IDF, and probability calibration — with interactive posterior probability demo.
Prerequisites
Concepts Covered
∑Key Formulas
Bayes' Theorem
Posterior = Likelihood × Prior / Evidence
Naïve Assumption
Features are conditionally independent given the class — the 'naïve' part
MAP Decision
Maximum a posteriori — drop P(x) since it's the same for all classes
Gaussian Likelihood
Gaussian NB assumes each feature is normally distributed within each class
▶Interactive Simulation
Why Learn Naïve Bayes?
Naïve Bayes classifies spam emails better than many complex models, runs in microseconds, requires almost no training data, handles missing features gracefully, and produces calibrated probabilities. Gmail's original spam filter was Naïve Bayes. It's the go-to baseline for text classification problems. Understanding it gives you deep intuition about probabilistic classifiers, the bias-variance tradeoff, and why a 'wrong' assumption (conditional independence) can still produce useful models in practice.
Despite its 'naïve' independence assumption being almost always wrong in practice, Naïve Bayes consistently achieves near-optimal classification accuracy in text and document classification tasks.
The Probabilistic Intuition
Suppose you want to classify an email as Spam or Ham. You've seen 1000 emails. 300 were spam. The word 'Viagra' appears in 250 of the 300 spam emails and only 1 of the 700 ham emails. The word 'meeting' appears in 5 spam and 400 ham emails. For a new email containing both words, Naïve Bayes multiplies: P(spam) × P(Viagra|spam) × P(meeting|spam). This product is the unnormalized posterior — compare it against P(ham) × P(Viagra|ham) × P(meeting|ham). The larger wins. The 'naïve' assumption makes this multiplication valid.
Derivation of the Decision Rule
By Bayes' theorem: P(C|x) ∝ P(x|C)P(C). The naïve assumption factorizes P(x|C) = ∏ P(xⱼ|C). Since P(x) is constant across classes, we only need the numerator. Taking logs (for numerical stability — products of small probabilities underflow): log P(C|x) = log P(C) + Σⱼ log P(xⱼ|C). For Gaussian NB, P(xⱼ|C) is a Gaussian with mean and variance estimated per feature per class. For Multinomial NB (text), P(xⱼ|C) is the smoothed word frequency in class C.
Laplace Smoothing: Avoiding Zero Probabilities
If a word never appears in spam training emails, P(word|spam)=0 and the entire product becomes 0 — one unseen word makes the classifier completely ignore all other evidence. Laplace smoothing adds α (usually 1) to all word counts: P(xⱼ|C) = (count(xⱼ,C) + α) / (count(C) + α × |V|) where |V| is vocabulary size. This ensures no probability is ever exactly 0. Larger α = more smoothing = closer to uniform distribution (stronger prior).
Laplace smoothing is a form of L1 regularization in the probability simplex. It prevents overfitting to rare words and is crucial for Multinomial NB on text data.
Which Naïve Bayes Variant to Use?
GaussianNB: continuous features assumed Gaussian per class. Good for sensor data, medical measurements. MultinomialNB: integer count features (word counts, TF-IDF ×N). Standard for document classification. BernoulliNB: binary features (word presence/absence). Good for short texts where rare words matter. ComplementNB: addresses MultinomialNB's bias on imbalanced data by modeling the complement of each class — often the best for text classification.
Continuous features → GaussianNB
Word count / TF-IDF features → MultinomialNB (requires non-negative)
Binary features (word present/absent) → BernoulliNB
Imbalanced text classification → ComplementNB
Mixed types → use ColumnTransformer + different NB per column type
Text Classification with Naïve Bayes
from sklearn.naive_bayes import MultinomialNB, ComplementNB, GaussianNB from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline from sklearn.model_selection import cross_val_score, train_test_split from sklearn.calibration import CalibratedClassifierCV from sklearn.preprocessing import LabelEncoder class="tok-comment"># ── Sample text data ─────────────────────────────────────────────────── X_text = [ class="tok-str">"buy cheap pills now", class="tok-str">"get rich quick scheme", class="tok-str">"free money click here", class="tok-str">"meeting at 3pm tomorrow", class="tok-str">"quarterly report attached", class="tok-str">"team lunch Friday", class="tok-str">"win a prize enter now", class="tok-str">"limited offer act fast", class="tok-str">"investment opportunity", class="tok-str">"project deadline next week", class="tok-str">"budget review scheduled", class="tok-str">"please review document", ] * class="tok-num">20 class="tok-comment"># class="tok-num">240 samples y_raw = ([class="tok-num">1, class="tok-num">1, class="tok-num">1, class="tok-num">0, class="tok-num">0, class="tok-num">0, class="tok-num">1, class="tok-num">1, class="tok-num">1, class="tok-num">0, class="tok-num">0, class="tok-num">0] * class="tok-num">20) class="tok-comment"># class="tok-num">1=spam, class="tok-num">0=ham import numpy as np y = np.array(y_raw) X_train, X_test, y_train, y_test = train_test_split( X_text, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42) class="tok-comment"># ── Text classification (spam detection) ────────────────────────── pipe = Pipeline([ (class="tok-str">'tfidf', TfidfVectorizer( ngram_range=(class="tok-num">1,class="tok-num">2), class="tok-comment"># unigrams + bigrams max_features=50_000, sublinear_tf=True, class="tok-comment"># log(class="tok-num">1+tf) min_df=class="tok-num">3, )), (class="tok-str">'clf', ComplementNB(alpha=class="tok-num">0.1)), class="tok-comment"># best for text ]) scores = cross_val_score(pipe, X_text, y, cv=class="tok-num">5, scoring=class="tok-str">'f1_macro') print(fclass="tok-str">"F1: {scores.mean():.3f} ± {scores.std():.3f}") class="tok-comment"># ── Calibrated probabilities ─────────────────────────────────────── class="tok-comment"># NB probabilities are often poorly calibrated (overconfident) class="tok-comment"># Use isotonic regression calibration for better probability estimates pipe_cal = CalibratedClassifierCV(pipe, cv=class="tok-num">5, method=class="tok-str">'isotonic') pipe_cal.fit(X_train, y_train) probs = pipe_cal.predict_proba(X_test) class="tok-comment"># ── Continuous features: GaussianNB ──────────────────────────────── from sklearn.preprocessing import StandardScaler gnb = Pipeline([ (class="tok-str">'scaler', StandardScaler()), (class="tok-str">'clf', GaussianNB(var_smoothing=class="tok-num">1e-9)), ]) gnb.fit(X_train, y_train) print(fclass="tok-str">"GaussianNB accuracy: {gnb.score(X_test, y_test):.3f}") class="tok-comment"># ── Inspect learned parameters ───────────────────────────────────── nb = gnb.named_steps[class="tok-str">'clf'] print(class="tok-str">"Class priors:", nb.class_prior_) print(class="tok-str">"Feature means per class:", nb.theta_) class="tok-comment"># shape (n_classes, n_features)
Naïve Bayes Pitfalls
The independence assumption means correlated features are double-counted. If 'Viagra' and 'pill' always co-occur in spam, their combined evidence is counted twice, leading to overconfident posteriors. This is why NB probabilities are usually poorly calibrated even when classifications are correct — always use CalibratedClassifierCV for probability outputs. Also: MultinomialNB requires non-negative features (can't use raw TF-IDF with negative values from e.g., cosine-centered representations).
Naïve Bayes is an excellent baseline and often hard to beat on small text datasets. If it scores 85% and your complex model scores 87%, ask: is the 2% gain worth 10× the complexity and training time?
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.