Foundationsbeginner

Probability & Statistics

“The language of uncertainty — probability distributions, MLE, and Bayesian reasoning”

Probability distributions (Normal, Binomial, Poisson), MLE, Bayes' theorem, hypothesis testing, and the Central Limit Theorem — the language of uncertainty that underlies every loss function and evaluation metric.

50 min

9 diagrams

7 Concepts Covered

Prerequisites

→Calculus & Optimization

Concepts Covered

Normal DistributionMLEBayes Theoremp-valuesCLTHypothesis TestingConfidence Intervals

Previous: Calculus & Optimization Next: Information Theory

∑Key Formulas

Bayes' Theorem

Update prior belief P(H) with evidence E to get posterior P(H|E)

MLE

Find parameters that make observed data most probable — equivalent to minimizing NLL

Normal PDF

Bell curve — fully specified by mean μ and standard deviation σ

Central Limit Theorem

Sum of n i.i.d. random variables approaches Normal as n→∞ — why the Normal distribution is everywhere

▶Interactive Simulation

Loading visualization…

🎯

Uncertainty Is Everywhere in ML

motivation

Machine learning is fundamentally about making predictions under uncertainty. Classification outputs probabilities (not just labels). Bayesian models maintain full distributions over parameters. Gaussian Processes give confidence intervals. A/B tests use hypothesis testing. Neural network dropout can be interpreted as approximate Bayesian inference. Without probability theory, you can't reason about: whether a model is confidently wrong, whether your train/test split gives a reliable estimate, or whether two models are actually different. Statistics provides the tools to answer all of these.

Log loss (cross-entropy) IS the negative log-likelihood of a Bernoulli distribution. Minimizing cross-entropy IS doing maximum likelihood estimation. They're the same thing.

💡

Distributions — The Most Important Ones

intuition

**Normal (Gaussian):** Bell-shaped, symmetric. Ubiquitous by the CLT. Parameterized by μ (location) and σ (spread). 68-95-99.7% of data within ±1σ, ±2σ, ±3σ. **Binomial:** Number of successes in n binary trials with probability p. Mean = np, variance = np(1-p). **Poisson:** Number of events in a fixed time/space interval. λ controls both mean and variance. **Bernoulli:** Single binary trial. **Exponential:** Time between events. **Student-t:** Like Normal but heavier tails — used for small sample hypothesis tests. Understanding which distribution to use for your problem is a core skill.

If X₁, X₂, …, Xₙ are i.i.d. with mean μ and finite variance σ², then √n(X̄-μ)/σ → N(0,1). This is why almost everything in statistics is Gaussian after you average enough samples.

⚙️

Maximum Likelihood Estimation (MLE)

algorithm

Choose a probability model p(x|θ) for your data (e.g., Normal, Binomial).

Write the likelihood: L(θ) = ∏ᵢ p(xᵢ|θ) — probability of observed data under θ.

Take log: ℓ(θ) = Σᵢ log p(xᵢ|θ) — log-likelihood is easier to optimize (sum vs product).

Take derivative ∂ℓ/∂θ, set to zero, solve for θ̂_MLE.

For Normal: θ̂_MLE = (μ̂=x̄, σ̂²=Σ(xᵢ-x̄)²/n) — sample mean and biased variance.

For logistic regression: no closed form → use gradient descent on the log-loss = -ℓ(θ).

</>

Probability with SciPy & NumPy

code

python64 lines

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# ── Normal distribution ───────────────────────────────────────────────────────
mu, sigma = 170, 10          # heights in cm
dist = stats.norm(mu, sigma)

x = np.linspace(135, 205, 500)
pdf = dist.pdf(x)

# Probabilities
p_tall = 1 - dist.cdf(190)           # P(X > 190)
p_range = dist.cdf(180) - dist.cdf(160)  # P(160 < X < 180)
print(f"P(height > 190cm) = {p_tall:.4f}")
print(f"P(160 < height < 180) = {p_range:.4f}")

# 68-95-99.7 rule
for k in [1, 2, 3]:
    p = dist.cdf(mu + k*sigma) - dist.cdf(mu - k*sigma)
    print(f"P(μ ± {k}σ) = {p:.4f}")  # ≈ 0.68, 0.95, 0.997

# ── MLE — fitting a Normal distribution ──────────────────────────────────────
data = np.random.normal(170, 10, size=100)
mu_mle, sigma_mle = data.mean(), data.std()
print(f"\nMLE fit: μ̂={mu_mle:.2f}, σ̂={sigma_mle:.2f}")

# SciPy MLE (same result, handles any distribution)
mu_fit, sigma_fit = stats.norm.fit(data)
print(f"scipy fit: μ={mu_fit:.2f}, σ={sigma_fit:.2f}")

# ── Bayes' theorem ────────────────────────────────────────────────────────────
# Disease testing: prevalence 1%, test sensitivity 99%, specificity 95%
p_disease = 0.01
p_pos_given_disease = 0.99     # sensitivity
p_neg_given_healthy = 0.95    # specificity → P(pos|healthy) = 0.05

p_healthy = 1 - p_disease
p_pos_given_healthy = 1 - p_neg_given_healthy

# P(positive) = P(pos|disease)*P(disease) + P(pos|healthy)*P(healthy)
p_pos = p_pos_given_disease * p_disease + p_pos_given_healthy * p_healthy

# Bayes: P(disease | positive test)
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_pos
print(f"\nP(disease | positive test) = {p_disease_given_pos:.4f}")  # ~16.4%!
# Counterintuitive: despite 99% accurate test, only 16% chance with +ve result
# due to low base rate (prior) — base rate fallacy

# ── Hypothesis testing ────────────────────────────────────────────────────────
# Are two group means different?
group_a = np.random.normal(5.0, 1.5, 50)
group_b = np.random.normal(5.5, 1.5, 50)

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"\nt-test: t={t_stat:.3f}, p={p_value:.4f}")
print("Significant at α=0.05:", p_value < 0.05)

# Bootstrap confidence interval for mean (distribution-free)
np.random.seed(42)
boot_means = [np.random.choice(group_a, size=len(group_a), replace=True).mean()
              for _ in range(10000)]
ci_low, ci_high = np.percentile(boot_means, [2.5, 97.5])
print(f"95% CI for group A mean: [{ci_low:.3f}, {ci_high:.3f}]")

⚠️

p-values Are Not What You Think

pitfall

A p-value < 0.05 does NOT mean 'there is a 95% chance the effect is real.' It means: 'if the null hypothesis were true, we would see data this extreme less than 5% of the time.' This subtle difference causes widespread misuse. ML-specific pitfalls: (1) Multiple comparisons: if you test 20 hyperparameter configurations and report the best, you've implicitly run 20 hypothesis tests — correct with Bonferroni or use proper validation. (2) Confusing statistical significance with practical significance — with 100k samples, a trivially small effect can be highly significant. (3) Data dredging: running many splits until you find one where your model 'significantly beats' a baseline.

Effect size (Cohen's d = (μ₁-μ₂)/σ) tells you if a difference matters practically. A p=0.0001 with d=0.02 is statistically significant but practically meaningless.

?Knowledge Check

Progress is saved in your browser — no account needed.

Calculus & Optimization

Information Theory

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services