ML Learning Hub
Unsupervisedintermediate

Clustering: K-Means & DBSCAN

Finding hidden groups in unlabeled data — no teacher required

Unsupervised grouping of unlabeled data — K-Means (Lloyd's algorithm, inertia, K selection via elbow/silhouette), DBSCAN (core/border/noise points, ε-neighborhood, arbitrary shapes), and hierarchical clustering.

40 min
12 diagrams
7 Concepts Covered

Prerequisites

Linear Algebra
Model Evaluation

Concepts Covered

K-MeansDBSCANSilhouette ScoreInertiaCore PointsElbow MethodDendrogram

Key Formulas

K-Means Objective

Minimize total within-cluster sum of squared distances to centroids

DBSCAN Core Point

A point is core if it has ≥ MinPts neighbours within radius ε

Silhouette Score

a = intra-cluster dist, b = nearest-cluster dist — ranges [-1, 1]

Inertia

Sum of squared distances from each point to its nearest centroid

Interactive Simulation

Loading visualization…
🎯

Why Cluster Unlabeled Data?

motivation

Most real-world data arrives without labels — customer transactions, genomic sequences, news articles, sensor readings. Clustering lets you discover natural structure that no human has annotated. Spotify segments listeners to build Discover Weekly. Insurance companies cluster claims to detect fraud rings. Biologists cluster gene expression profiles to find disease subtypes. The goal is always the same: find groups where members are more similar to each other than to members of other groups.

K-Means was published in 1967 and is still one of the most-used algorithms in production ML systems. Its simplicity is its strength.

💡

K-Means: The Iterative Negotiation

intuition

Imagine dropping K flags randomly on a map. Every person on the map walks to their nearest flag. Each flag then moves to the average position of everyone who walked to it. Repeat until nobody changes their flag. This Lloyd's algorithm converges to a local minimum — it's not guaranteed to find the global optimum, which is why K-Means++ initialization (spreading initial centroids far apart) dramatically improves results in practice.

K-Means always converges but may converge to a poor local minimum. Always run with multiple random initializations (n_init=10 in scikit-learn) and keep the best.

The K-Means Objective

math

K-Means minimizes the sum of squared Euclidean distances from each point to its assigned centroid. The E-step assigns each point to its nearest centroid; the M-step moves each centroid to the mean of its assigned points. Each step monotonically decreases the objective, guaranteeing convergence.

Within-cluster Sum of Squares (WCSS)
🔬

Choosing K: The Elbow Method & Silhouette

deepdive

K is a hyperparameter — you must choose it. The Elbow Method plots inertia vs K: inertia always decreases, but the rate of decrease elbows at the 'right' K. More rigorously, the Silhouette Score measures how well each point fits its own cluster vs the next nearest cluster. s=1 means perfect separation, s=0 means overlapping clusters, s<0 means the point is in the wrong cluster. Average silhouette over all points gives a scalar quality metric.

1

Try K = 2 to 10, plot inertia — look for the 'elbow'

2

Compute silhouette score for each K — peak = best K

3

Use Gap Statistic (compare to random data) for rigorously selecting K

4

Domain knowledge often overrides: if you know there are 5 customer segments, use K=5

⚖️

K-Means vs DBSCAN vs Hierarchical

comparison

K-Means assumes spherical, equal-size clusters and requires K upfront. It fails badly on crescent-shaped or varying-density data. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds arbitrarily-shaped clusters and automatically identifies outliers — points that are not in any cluster. It requires only ε (neighborhood radius) and MinPts (minimum points to form a core). Hierarchical clustering builds a dendrogram — a tree of merges — that you cut at any level to get K clusters, giving you the full cluster structure in one pass.

Use DBSCAN when you expect noise/outliers or non-spherical clusters. Use K-Means when clusters are roughly spherical and you know K. Use Hierarchical when you want to explore multiple granularities simultaneously.

⚙️

DBSCAN Step-by-Step

algorithm
1

For each unvisited point p, find all points within radius ε (ε-neighborhood)

2

If |N_ε(p)| ≥ MinPts → p is a core point; start a new cluster

3

Expand cluster: add all density-reachable points (core point neighbors of core points)

4

If |N_ε(p)| < MinPts and p is reachable from a core → p is a border point

5

If p is not reachable from any core → p is noise (outlier, labeled -1)

6

Repeat until all points visited — O(n log n) with spatial index

</>

scikit-learn Implementation

code
python37 lines
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import numpy as np

class="tok-comment"># ── Sample data ────────────────────────────────────────────────────────
X, _ = make_blobs(n_samples=class="tok-num">300, centers=class="tok-num">4, cluster_std=class="tok-num">0.8, random_state=class="tok-num">42)

class="tok-comment"># ── K-Means with K selection via silhouette ────────────────────────
X_scaled = StandardScaler().fit_transform(X)

best_k, best_score = class="tok-num">2, -class="tok-num">1
for k in range(class="tok-num">2, class="tok-num">11):
    km = KMeans(n_clusters=k, init=class="tok-str">'k-means++', n_init=class="tok-num">10, random_state=class="tok-num">42)
    labels = km.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    if score > best_score:
        best_k, best_score = k, score

km_final = KMeans(n_clusters=best_k, init=class="tok-str">'k-means++', n_init=class="tok-num">10, random_state=class="tok-num">42)
labels_km = km_final.fit_predict(X_scaled)
print(fclass="tok-str">"K={best_k}, silhouette={best_score:.3f}, inertia={km_final.inertia_:.1f}")

class="tok-comment"># ── DBSCAN — auto-detects clusters and outliers ────────────────────
db = DBSCAN(eps=class="tok-num">0.5, min_samples=class="tok-num">5)
labels_db = db.fit_predict(X_scaled)
n_clusters = len(set(labels_db)) - (class="tok-num">1 if -class="tok-num">1 in labels_db else class="tok-num">0)
n_noise    = list(labels_db).count(-class="tok-num">1)
print(fclass="tok-str">"DBSCAN: {n_clusters} clusters, {n_noise} noise points")

class="tok-comment"># ── Agglomerative — no K needed upfront (use dendrogram) ──────────
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X_scaled, method=class="tok-str">'ward')
class="tok-comment"># Cut at K=class="tok-num">3
agg = AgglomerativeClustering(n_clusters=class="tok-num">3, linkage=class="tok-str">'ward')
labels_agg = agg.fit_predict(X_scaled)
⚠️

Common Clustering Pitfalls

pitfall

Clustering without scaling is the #1 mistake: features with large scales dominate the distance metric, making K-Means find clusters based on the highest-magnitude feature regardless of others. Always scale before clustering. Second pitfall: treating cluster labels as ground truth — clusters are hypotheses, not facts. Validate with domain experts and external metrics (if labels exist, use Adjusted Rand Index). Third: assuming clusters are meaningful — always sanity-check by looking at cluster profiles.

Cluster labels are arbitrary integers that change between runs. Never hardcode 'cluster 0 = high-value customers' — always characterize clusters by their feature distributions.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.