ML Learning Hub
Unsupervisedintermediate

PCA & Dimensionality Reduction

Finding the directions of maximum variance — compressing information without losing it

Principal Component Analysis from scratch: eigendecomposition of the covariance matrix, variance explained, choosing k components, whitening, t-SNE/UMAP contrast, and practical applications in visualization and preprocessing.

45 min
10 diagrams
7 Concepts Covered

Prerequisites

Linear Algebra
Clustering

Concepts Covered

EigendecompositionVariance ExplainedCovariance MatrixWhiteningt-SNEUMAPScree Plot

Key Formulas

Covariance Eigen-decomp

Covariance matrix decomposes into eigenvectors V and eigenvalues Λ

Projection

Project centered data onto top-k eigenvectors to get k-dim representation

Variance Explained

Fraction of total variance captured by principal component k

Reconstruction

Reconstruct approximate x from low-dim code z — measures information loss

Interactive Simulation

Loading visualization…
🎯

The Curse of Dimensionality

motivation

With 1000 features, every pair of points is almost equidistant (the 'concentration of measure' phenomenon). Distance-based algorithms break down. Visualizing high-dimensional data is impossible. Training takes forever and models overfit. PCA solves this by finding a low-dimensional subspace that captures most of the variance — removing noise, decorrelating features, and enabling visualization. A 10,000-pixel image can often be compressed to 50 PCA components with less than 5% reconstruction error.

The famous 'face space' result: human face images live in a ~50-dimensional subspace within a 50,000-pixel space. PCA finds this subspace.

💡

The Geometrical View

intuition

Imagine a cloud of points that looks like a stretched ellipse. PCA finds the longest axis of the ellipse (PC1 — maximum variance direction), then the next longest perpendicular axis (PC2), and so on. By projecting onto the first few principal components, you keep the most 'spread out' dimensions and discard the tightly bunched ones (which are typically noise). The principal components are the eigenvectors of the data covariance matrix, ordered by their eigenvalues (= variance along each direction).

Eigendecomposition of the Covariance Matrix

math

Center the data: X̃ = X - μ. Compute the p×p covariance matrix Σ = (1/n)X̃ᵀX̃. Eigendecompose: Σ = VΛVᵀ where V is orthonormal (Vᵀ=V⁻¹) and Λ is diagonal with eigenvalues λ₁≥λ₂≥…≥λₚ. The first eigenvector v₁ is the direction of maximum variance. In practice scikit-learn uses SVD on X̃ directly (more numerically stable than computing Σ explicitly).

Eigenvector equation — vₖ is a principal component
🔬

Choosing the Number of Components

deepdive

Plot the cumulative explained variance ratio. The curve rises steeply at first, then flattens. Choose k where you reach 90–95% cumulative variance — this is the 'elbow'. Alternatively, use PCA as preprocessing for a downstream model: tune k as a hyperparameter with cross-validation. For visualization, always use k=2 or k=3 regardless of explained variance.

1

Compute explained_variance_ratio_ for each component

2

Plot cumulative sum — find k where cumsum ≥ 0.95

3

For whitening (decorrelation + unit variance): set whiten=True

4

For large datasets: use IncrementalPCA (mini-batch) or TruncatedSVD (sparse)

5

Never apply PCA before train/test split — fit only on training data, transform both

⚙️

PCA Algorithm

algorithm
1

Center data: X̃ = X - mean(X, axis=0)

2

Compute SVD: X̃ = UΣVᵀ (equivalently: eigendecompose X̃ᵀX̃)

3

Sort eigenvectors by eigenvalues descending

4

Select top-k eigenvectors: V_k = V[:, :k]

5

Project: Z = X̃ @ V_k → k-dimensional representation

6

Reconstruct: X̂ = Z @ V_k.T + mean → measure reconstruction error

</>

scikit-learn PCA

code
python38 lines
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

class="tok-comment"># ── Sample data ────────────────────────────────────────────────────────
X_raw, y = make_classification(n_samples=class="tok-num">400, n_features=class="tok-num">20,
                                n_informative=class="tok-num">8, random_state=class="tok-num">42)
X_train, X_test, _, _ = train_test_split(X_raw, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42)

class="tok-comment"># ── Fit PCA ────────────────────────────────────────────────────────
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

class="tok-comment"># Fit full PCA first to inspect variance
pca_full = PCA().fit(X_scaled)
cumvar = np.cumsum(pca_full.explained_variance_ratio_)
k = np.argmax(cumvar >= class="tok-num">0.95) + class="tok-num">1
print(fclass="tok-str">"{k} components explain ≥class="tok-num">95% variance")

class="tok-comment"># Final PCA with chosen k
pca = PCA(n_components=k, random_state=class="tok-num">42)
X_pca = pca.fit_transform(X_scaled)          class="tok-comment"># fit on train only
X_test_pca = pca.transform(scaler.transform(X_test))

class="tok-comment"># ── Reconstruction error ───────────────────────────────────────────
X_reconstructed = pca.inverse_transform(X_pca)
recon_err = np.mean((X_scaled - X_reconstructed)**class="tok-num">2)
print(fclass="tok-str">"Reconstruction MSE: {recon_err:.4f}")

class="tok-comment"># ── 2D visualisation ───────────────────────────────────────────────
pca2 = PCA(n_components=class="tok-num">2)
X_2d = pca2.fit_transform(X_scaled)
plt.scatter(X_2d[:,class="tok-num">0], X_2d[:,class="tok-num">1], c=y, cmap=class="tok-str">'viridis', alpha=class="tok-num">0.7)
plt.xlabel(fclass="tok-str">"PC1 ({pca2.explained_variance_ratio_[class="tok-num">0]:.class="tok-num">1%})")
plt.ylabel(fclass="tok-str">"PC2 ({pca2.explained_variance_ratio_[class="tok-num">1]:.class="tok-num">1%})")
⚠️

PCA Pitfalls

pitfall

PCA is a linear method — it cannot capture non-linear manifolds. Use t-SNE or UMAP for non-linear dimensionality reduction when visualizing complex structures. Also: PCA maximizes variance, not discrimination — for classification, Linear Discriminant Analysis (LDA) often gives better separation because it maximizes between-class vs within-class variance. Finally: principal components are often uninterpretable. If you need interpretable features, prefer Sparse PCA or feature selection instead.

Never fit PCA on the full dataset — fit on training data only and apply the same transformation to test data. Fitting on the full dataset leaks test statistics into training.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.