Anomaly & Outlier Detection
“Finding the one-in-a-thousand data point that doesn't belong”
Statistical (Z-Score, IQR fences) and algorithmic (Isolation Forest, LOF, One-Class SVM) approaches to finding rare abnormal observations — fraud detection, manufacturing defects, network intrusion.
Prerequisites
Concepts Covered
∑Key Formulas
Z-Score
Standard deviations from the mean — |z| > 3 is conventionally anomalous
IQR Fence
Tukey fences — points outside this interval are outliers (IQR = Q3-Q1)
Isolation Score
Isolation Forest: anomalies have shorter average path lengths h(x)
LOF Score
Local Outlier Factor: ratio of local density to neighbours' density
▶Interactive Simulation
Why Anomaly Detection Matters
Credit card fraud costs $32 billion annually. Network intrusion attacks cause trillions in damage. Industrial equipment failures cost $50 billion per year. Anomaly detection is the critical first line of defense in all these systems. The core challenge: you rarely have labeled examples of anomalies (they're rare by definition), so most anomaly detection is unsupervised — you only learn what 'normal' looks like, then flag deviations.
In medical diagnosis, a false negative (missing cancer) is catastrophic; in fraud detection, false positives (blocking real customers) destroy revenue. Choosing the right threshold is a business decision.
The Statistical Viewpoint
The simplest intuition: normal data concentrates in high-density regions. Anomalies live in low-density regions. Z-Score flags points more than k standard deviations from the mean — but assumes Gaussian distributions. IQR fences are non-parametric: they flag points outside 1.5×IQR from the quartiles, making them robust to non-Gaussian data. Both are univariate — they check each feature independently and miss multivariate anomalies (a temperature of 20°C is normal; a pressure of 5 bar is normal; but temperature=20 AND pressure=5 together may be anomalous).
Statistical vs Algorithmic Methods
Z-Score and IQR are fast and interpretable but assume features are independent and Gaussian. Isolation Forest builds random trees and measures how quickly each point can be isolated — anomalies isolate fast because they're in sparse regions. Local Outlier Factor (LOF) compares each point's local density to its neighbors' density: if your neighbors are much denser than you, you're an outlier. One-Class SVM finds the minimal hypersphere enclosing normal points. Autoencoder anomaly detection trains a neural network to reconstruct normal data — high reconstruction error signals anomaly.
Isolation Forest scales to millions of points and handles high-dimensional data well. LOF is better for clustered data with varying densities. Autoencoders excel at anomaly detection in images and time series.
Isolation Forest Algorithm
Build an ensemble of isolation trees (random binary trees)
For each tree: randomly select a feature, then a random split value
Recurse until each point is isolated (alone in a leaf)
Anomaly score = average path length across all trees
Short path → point isolated quickly → anomaly
Normal points need more splits → longer average path
scikit-learn Anomaly Detection
from sklearn.ensemble import IsolationForest from sklearn.neighbors import LocalOutlierFactor from sklearn.svm import OneClassSVM from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_classification import numpy as np class="tok-comment"># ── Sample data (class="tok-num">5% anomalies) ───────────────────────────────────────── X_normal, _ = make_classification(n_samples=class="tok-num">475, n_features=class="tok-num">10, random_state=class="tok-num">42) X_anom = np.random.randn(class="tok-num">25, class="tok-num">10) * class="tok-num">4 class="tok-comment"># class="tok-num">25 clear outliers X = np.vstack([X_normal, X_anom]) y_true = np.array([class="tok-num">0]*class="tok-num">475 + [class="tok-num">1]*class="tok-num">25) class="tok-comment"># class="tok-num">0=normal, class="tok-num">1=anomaly X_scaled = StandardScaler().fit_transform(X) class="tok-comment"># ── Isolation Forest ─────────────────────────────────────────────── iso = IsolationForest( n_estimators=class="tok-num">200, contamination=class="tok-num">0.05, class="tok-comment"># expected fraction of outliers random_state=class="tok-num">42 ) labels_iso = iso.fit_predict(X_scaled) class="tok-comment"># class="tok-num">1=inlier, -class="tok-num">1=outlier scores_iso = iso.score_samples(X_scaled) class="tok-comment"># lower = more anomalous class="tok-comment"># ── Local Outlier Factor ──────────────────────────────────────────── lof = LocalOutlierFactor(n_neighbors=class="tok-num">20, contamination=class="tok-num">0.05) labels_lof = lof.fit_predict(X_scaled) class="tok-comment"># ── Z-Score (univariate, per-feature) ────────────────────────────── from scipy import stats z_scores = np.abs(stats.zscore(X)) outlier_mask = (z_scores > class="tok-num">3).any(axis=class="tok-num">1) class="tok-comment"># ── Evaluate with known labels ───────────────────────────────────── from sklearn.metrics import roc_auc_score, average_precision_score class="tok-comment"># Convert: class="tok-num">1=inlier → class="tok-num">0=normal, -class="tok-num">1=outlier → class="tok-num">1=anomaly y_pred = (labels_iso == -class="tok-num">1).astype(int) print(fclass="tok-str">"AUC-ROC: {roc_auc_score(y_true, -scores_iso):.3f}") print(fclass="tok-str">"AP: {average_precision_score(y_true, -scores_iso):.3f}")
Anomaly Detection Pitfalls
The contamination parameter in Isolation Forest and LOF directly controls the decision threshold. If you set contamination=0.05 but your actual anomaly rate is 0.1%, you'll mislabel many normal points as anomalies. Always calibrate this with domain knowledge or holdout labeled data. Second pitfall: high dimensionality breaks Z-Score and distance-based methods (curse of dimensionality). Apply PCA first when features > 20. Third: concept drift — 'normal' changes over time. Retrain or use online anomaly detection for streaming data.
Never evaluate anomaly detection with accuracy — class imbalance makes it meaningless. Use Precision@k, AUC-PR (area under precision-recall curve), or F1 at the chosen threshold.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.