Clustering: K-Means & DBSCAN
“Finding hidden groups in unlabeled data — no teacher required”
Unsupervised grouping of unlabeled data — K-Means (Lloyd's algorithm, inertia, K selection via elbow/silhouette), DBSCAN (core/border/noise points, ε-neighborhood, arbitrary shapes), and hierarchical clustering.
Prerequisites
Concepts Covered
∑Key Formulas
K-Means Objective
Minimize total within-cluster sum of squared distances to centroids
DBSCAN Core Point
A point is core if it has ≥ MinPts neighbours within radius ε
Silhouette Score
a = intra-cluster dist, b = nearest-cluster dist — ranges [-1, 1]
Inertia
Sum of squared distances from each point to its nearest centroid
▶Interactive Simulation
Why Cluster Unlabeled Data?
Most real-world data arrives without labels — customer transactions, genomic sequences, news articles, sensor readings. Clustering lets you discover natural structure that no human has annotated. Spotify segments listeners to build Discover Weekly. Insurance companies cluster claims to detect fraud rings. Biologists cluster gene expression profiles to find disease subtypes. The goal is always the same: find groups where members are more similar to each other than to members of other groups.
K-Means was published in 1967 and is still one of the most-used algorithms in production ML systems. Its simplicity is its strength.
K-Means: The Iterative Negotiation
Imagine dropping K flags randomly on a map. Every person on the map walks to their nearest flag. Each flag then moves to the average position of everyone who walked to it. Repeat until nobody changes their flag. This Lloyd's algorithm converges to a local minimum — it's not guaranteed to find the global optimum, which is why K-Means++ initialization (spreading initial centroids far apart) dramatically improves results in practice.
K-Means always converges but may converge to a poor local minimum. Always run with multiple random initializations (n_init=10 in scikit-learn) and keep the best.
The K-Means Objective
K-Means minimizes the sum of squared Euclidean distances from each point to its assigned centroid. The E-step assigns each point to its nearest centroid; the M-step moves each centroid to the mean of its assigned points. Each step monotonically decreases the objective, guaranteeing convergence.
Choosing K: The Elbow Method & Silhouette
K is a hyperparameter — you must choose it. The Elbow Method plots inertia vs K: inertia always decreases, but the rate of decrease elbows at the 'right' K. More rigorously, the Silhouette Score measures how well each point fits its own cluster vs the next nearest cluster. s=1 means perfect separation, s=0 means overlapping clusters, s<0 means the point is in the wrong cluster. Average silhouette over all points gives a scalar quality metric.
Try K = 2 to 10, plot inertia — look for the 'elbow'
Compute silhouette score for each K — peak = best K
Use Gap Statistic (compare to random data) for rigorously selecting K
Domain knowledge often overrides: if you know there are 5 customer segments, use K=5
K-Means vs DBSCAN vs Hierarchical
K-Means assumes spherical, equal-size clusters and requires K upfront. It fails badly on crescent-shaped or varying-density data. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds arbitrarily-shaped clusters and automatically identifies outliers — points that are not in any cluster. It requires only ε (neighborhood radius) and MinPts (minimum points to form a core). Hierarchical clustering builds a dendrogram — a tree of merges — that you cut at any level to get K clusters, giving you the full cluster structure in one pass.
Use DBSCAN when you expect noise/outliers or non-spherical clusters. Use K-Means when clusters are roughly spherical and you know K. Use Hierarchical when you want to explore multiple granularities simultaneously.
DBSCAN Step-by-Step
For each unvisited point p, find all points within radius ε (ε-neighborhood)
If |N_ε(p)| ≥ MinPts → p is a core point; start a new cluster
Expand cluster: add all density-reachable points (core point neighbors of core points)
If |N_ε(p)| < MinPts and p is reachable from a core → p is a border point
If p is not reachable from any core → p is noise (outlier, labeled -1)
Repeat until all points visited — O(n log n) with spatial index
scikit-learn Implementation
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering from sklearn.metrics import silhouette_score from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_blobs import numpy as np class="tok-comment"># ── Sample data ──────────────────────────────────────────────────────── X, _ = make_blobs(n_samples=class="tok-num">300, centers=class="tok-num">4, cluster_std=class="tok-num">0.8, random_state=class="tok-num">42) class="tok-comment"># ── K-Means with K selection via silhouette ──────────────────────── X_scaled = StandardScaler().fit_transform(X) best_k, best_score = class="tok-num">2, -class="tok-num">1 for k in range(class="tok-num">2, class="tok-num">11): km = KMeans(n_clusters=k, init=class="tok-str">'k-means++', n_init=class="tok-num">10, random_state=class="tok-num">42) labels = km.fit_predict(X_scaled) score = silhouette_score(X_scaled, labels) if score > best_score: best_k, best_score = k, score km_final = KMeans(n_clusters=best_k, init=class="tok-str">'k-means++', n_init=class="tok-num">10, random_state=class="tok-num">42) labels_km = km_final.fit_predict(X_scaled) print(fclass="tok-str">"K={best_k}, silhouette={best_score:.3f}, inertia={km_final.inertia_:.1f}") class="tok-comment"># ── DBSCAN — auto-detects clusters and outliers ──────────────────── db = DBSCAN(eps=class="tok-num">0.5, min_samples=class="tok-num">5) labels_db = db.fit_predict(X_scaled) n_clusters = len(set(labels_db)) - (class="tok-num">1 if -class="tok-num">1 in labels_db else class="tok-num">0) n_noise = list(labels_db).count(-class="tok-num">1) print(fclass="tok-str">"DBSCAN: {n_clusters} clusters, {n_noise} noise points") class="tok-comment"># ── Agglomerative — no K needed upfront (use dendrogram) ────────── from scipy.cluster.hierarchy import dendrogram, linkage Z = linkage(X_scaled, method=class="tok-str">'ward') class="tok-comment"># Cut at K=class="tok-num">3 agg = AgglomerativeClustering(n_clusters=class="tok-num">3, linkage=class="tok-str">'ward') labels_agg = agg.fit_predict(X_scaled)
Common Clustering Pitfalls
Clustering without scaling is the #1 mistake: features with large scales dominate the distance metric, making K-Means find clusters based on the highest-magnitude feature regardless of others. Always scale before clustering. Second pitfall: treating cluster labels as ground truth — clusters are hypotheses, not facts. Validate with domain experts and external metrics (if labels exist, use Adjusted Rand Index). Third: assuming clusters are meaningful — always sanity-check by looking at cluster profiles.
Cluster labels are arbitrary integers that change between runs. Never hardcode 'cluster 0 = high-value customers' — always characterize clusters by their feature distributions.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.