ML Learning Hub
Unsupervisedintermediate

Anomaly & Outlier Detection

Finding the one-in-a-thousand data point that doesn't belong

Statistical (Z-Score, IQR fences) and algorithmic (Isolation Forest, LOF, One-Class SVM) approaches to finding rare abnormal observations — fraud detection, manufacturing defects, network intrusion.

35 min
8 diagrams
7 Concepts Covered

Prerequisites

Probability & Statistics
Model Evaluation

Concepts Covered

Z-ScoreIQRIsolation ForestLOFOne-Class SVMContaminationAUC-PR

Key Formulas

Z-Score

Standard deviations from the mean — |z| > 3 is conventionally anomalous

IQR Fence

Tukey fences — points outside this interval are outliers (IQR = Q3-Q1)

Isolation Score

Isolation Forest: anomalies have shorter average path lengths h(x)

LOF Score

Local Outlier Factor: ratio of local density to neighbours' density

Interactive Simulation

Loading visualization…
🎯

Why Anomaly Detection Matters

motivation

Credit card fraud costs $32 billion annually. Network intrusion attacks cause trillions in damage. Industrial equipment failures cost $50 billion per year. Anomaly detection is the critical first line of defense in all these systems. The core challenge: you rarely have labeled examples of anomalies (they're rare by definition), so most anomaly detection is unsupervised — you only learn what 'normal' looks like, then flag deviations.

In medical diagnosis, a false negative (missing cancer) is catastrophic; in fraud detection, false positives (blocking real customers) destroy revenue. Choosing the right threshold is a business decision.

💡

The Statistical Viewpoint

intuition

The simplest intuition: normal data concentrates in high-density regions. Anomalies live in low-density regions. Z-Score flags points more than k standard deviations from the mean — but assumes Gaussian distributions. IQR fences are non-parametric: they flag points outside 1.5×IQR from the quartiles, making them robust to non-Gaussian data. Both are univariate — they check each feature independently and miss multivariate anomalies (a temperature of 20°C is normal; a pressure of 5 bar is normal; but temperature=20 AND pressure=5 together may be anomalous).

⚖️

Statistical vs Algorithmic Methods

comparison

Z-Score and IQR are fast and interpretable but assume features are independent and Gaussian. Isolation Forest builds random trees and measures how quickly each point can be isolated — anomalies isolate fast because they're in sparse regions. Local Outlier Factor (LOF) compares each point's local density to its neighbors' density: if your neighbors are much denser than you, you're an outlier. One-Class SVM finds the minimal hypersphere enclosing normal points. Autoencoder anomaly detection trains a neural network to reconstruct normal data — high reconstruction error signals anomaly.

Isolation Forest scales to millions of points and handles high-dimensional data well. LOF is better for clustered data with varying densities. Autoencoders excel at anomaly detection in images and time series.

⚙️

Isolation Forest Algorithm

algorithm
1

Build an ensemble of isolation trees (random binary trees)

2

For each tree: randomly select a feature, then a random split value

3

Recurse until each point is isolated (alone in a leaf)

4

Anomaly score = average path length across all trees

5

Short path → point isolated quickly → anomaly

6

Normal points need more splits → longer average path

</>

scikit-learn Anomaly Detection

code
python39 lines
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
import numpy as np

class="tok-comment"># ── Sample data (class="tok-num">5% anomalies) ─────────────────────────────────────────
X_normal, _ = make_classification(n_samples=class="tok-num">475, n_features=class="tok-num">10, random_state=class="tok-num">42)
X_anom  = np.random.randn(class="tok-num">25, class="tok-num">10) * class="tok-num">4    class="tok-comment"># class="tok-num">25 clear outliers
X = np.vstack([X_normal, X_anom])
y_true = np.array([class="tok-num">0]*class="tok-num">475 + [class="tok-num">1]*class="tok-num">25)       class="tok-comment"># class="tok-num">0=normal, class="tok-num">1=anomaly

X_scaled = StandardScaler().fit_transform(X)

class="tok-comment"># ── Isolation Forest ───────────────────────────────────────────────
iso = IsolationForest(
    n_estimators=class="tok-num">200,
    contamination=class="tok-num">0.05,   class="tok-comment"># expected fraction of outliers
    random_state=class="tok-num">42
)
labels_iso = iso.fit_predict(X_scaled)  class="tok-comment"># class="tok-num">1=inlier, -class="tok-num">1=outlier
scores_iso = iso.score_samples(X_scaled)  class="tok-comment"># lower = more anomalous

class="tok-comment"># ── Local Outlier Factor ────────────────────────────────────────────
lof = LocalOutlierFactor(n_neighbors=class="tok-num">20, contamination=class="tok-num">0.05)
labels_lof = lof.fit_predict(X_scaled)

class="tok-comment"># ── Z-Score (univariate, per-feature) ──────────────────────────────
from scipy import stats
z_scores = np.abs(stats.zscore(X))
outlier_mask = (z_scores > class="tok-num">3).any(axis=class="tok-num">1)

class="tok-comment"># ── Evaluate with known labels ─────────────────────────────────────
from sklearn.metrics import roc_auc_score, average_precision_score
class="tok-comment"># Convert: class="tok-num">1=inlier → class="tok-num">0=normal,  -class="tok-num">1=outlier → class="tok-num">1=anomaly
y_pred = (labels_iso == -class="tok-num">1).astype(int)
print(fclass="tok-str">"AUC-ROC: {roc_auc_score(y_true, -scores_iso):.3f}")
print(fclass="tok-str">"AP:      {average_precision_score(y_true, -scores_iso):.3f}")
⚠️

Anomaly Detection Pitfalls

pitfall

The contamination parameter in Isolation Forest and LOF directly controls the decision threshold. If you set contamination=0.05 but your actual anomaly rate is 0.1%, you'll mislabel many normal points as anomalies. Always calibrate this with domain knowledge or holdout labeled data. Second pitfall: high dimensionality breaks Z-Score and distance-based methods (curse of dimensionality). Apply PCA first when features > 20. Third: concept drift — 'normal' changes over time. Retrain or use online anomaly detection for streaming data.

Never evaluate anomaly detection with accuracy — class imbalance makes it meaningless. Use Precision@k, AUC-PR (area under precision-recall curve), or F1 at the chosen threshold.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.