PCA & Dimensionality Reduction
“Finding the directions of maximum variance — compressing information without losing it”
Principal Component Analysis from scratch: eigendecomposition of the covariance matrix, variance explained, choosing k components, whitening, t-SNE/UMAP contrast, and practical applications in visualization and preprocessing.
Prerequisites
Concepts Covered
∑Key Formulas
Covariance Eigen-decomp
Covariance matrix decomposes into eigenvectors V and eigenvalues Λ
Projection
Project centered data onto top-k eigenvectors to get k-dim representation
Variance Explained
Fraction of total variance captured by principal component k
Reconstruction
Reconstruct approximate x from low-dim code z — measures information loss
▶Interactive Simulation
The Curse of Dimensionality
With 1000 features, every pair of points is almost equidistant (the 'concentration of measure' phenomenon). Distance-based algorithms break down. Visualizing high-dimensional data is impossible. Training takes forever and models overfit. PCA solves this by finding a low-dimensional subspace that captures most of the variance — removing noise, decorrelating features, and enabling visualization. A 10,000-pixel image can often be compressed to 50 PCA components with less than 5% reconstruction error.
The famous 'face space' result: human face images live in a ~50-dimensional subspace within a 50,000-pixel space. PCA finds this subspace.
The Geometrical View
Imagine a cloud of points that looks like a stretched ellipse. PCA finds the longest axis of the ellipse (PC1 — maximum variance direction), then the next longest perpendicular axis (PC2), and so on. By projecting onto the first few principal components, you keep the most 'spread out' dimensions and discard the tightly bunched ones (which are typically noise). The principal components are the eigenvectors of the data covariance matrix, ordered by their eigenvalues (= variance along each direction).
Eigendecomposition of the Covariance Matrix
Center the data: X̃ = X - μ. Compute the p×p covariance matrix Σ = (1/n)X̃ᵀX̃. Eigendecompose: Σ = VΛVᵀ where V is orthonormal (Vᵀ=V⁻¹) and Λ is diagonal with eigenvalues λ₁≥λ₂≥…≥λₚ. The first eigenvector v₁ is the direction of maximum variance. In practice scikit-learn uses SVD on X̃ directly (more numerically stable than computing Σ explicitly).
Choosing the Number of Components
Plot the cumulative explained variance ratio. The curve rises steeply at first, then flattens. Choose k where you reach 90–95% cumulative variance — this is the 'elbow'. Alternatively, use PCA as preprocessing for a downstream model: tune k as a hyperparameter with cross-validation. For visualization, always use k=2 or k=3 regardless of explained variance.
Compute explained_variance_ratio_ for each component
Plot cumulative sum — find k where cumsum ≥ 0.95
For whitening (decorrelation + unit variance): set whiten=True
For large datasets: use IncrementalPCA (mini-batch) or TruncatedSVD (sparse)
Never apply PCA before train/test split — fit only on training data, transform both
PCA Algorithm
Center data: X̃ = X - mean(X, axis=0)
Compute SVD: X̃ = UΣVᵀ (equivalently: eigendecompose X̃ᵀX̃)
Sort eigenvectors by eigenvalues descending
Select top-k eigenvectors: V_k = V[:, :k]
Project: Z = X̃ @ V_k → k-dimensional representation
Reconstruct: X̂ = Z @ V_k.T + mean → measure reconstruction error
scikit-learn PCA
from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split import numpy as np import matplotlib.pyplot as plt class="tok-comment"># ── Sample data ──────────────────────────────────────────────────────── X_raw, y = make_classification(n_samples=class="tok-num">400, n_features=class="tok-num">20, n_informative=class="tok-num">8, random_state=class="tok-num">42) X_train, X_test, _, _ = train_test_split(X_raw, y, test_size=class="tok-num">0.2, random_state=class="tok-num">42) class="tok-comment"># ── Fit PCA ──────────────────────────────────────────────────────── scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) class="tok-comment"># Fit full PCA first to inspect variance pca_full = PCA().fit(X_scaled) cumvar = np.cumsum(pca_full.explained_variance_ratio_) k = np.argmax(cumvar >= class="tok-num">0.95) + class="tok-num">1 print(fclass="tok-str">"{k} components explain ≥class="tok-num">95% variance") class="tok-comment"># Final PCA with chosen k pca = PCA(n_components=k, random_state=class="tok-num">42) X_pca = pca.fit_transform(X_scaled) class="tok-comment"># fit on train only X_test_pca = pca.transform(scaler.transform(X_test)) class="tok-comment"># ── Reconstruction error ─────────────────────────────────────────── X_reconstructed = pca.inverse_transform(X_pca) recon_err = np.mean((X_scaled - X_reconstructed)**class="tok-num">2) print(fclass="tok-str">"Reconstruction MSE: {recon_err:.4f}") class="tok-comment"># ── 2D visualisation ─────────────────────────────────────────────── pca2 = PCA(n_components=class="tok-num">2) X_2d = pca2.fit_transform(X_scaled) plt.scatter(X_2d[:,class="tok-num">0], X_2d[:,class="tok-num">1], c=y, cmap=class="tok-str">'viridis', alpha=class="tok-num">0.7) plt.xlabel(fclass="tok-str">"PC1 ({pca2.explained_variance_ratio_[class="tok-num">0]:.class="tok-num">1%})") plt.ylabel(fclass="tok-str">"PC2 ({pca2.explained_variance_ratio_[class="tok-num">1]:.class="tok-num">1%})")
PCA Pitfalls
PCA is a linear method — it cannot capture non-linear manifolds. Use t-SNE or UMAP for non-linear dimensionality reduction when visualizing complex structures. Also: PCA maximizes variance, not discrimination — for classification, Linear Discriminant Analysis (LDA) often gives better separation because it maximizes between-class vs within-class variance. Finally: principal components are often uninterpretable. If you need interpretable features, prefer Sparse PCA or feature selection instead.
Never fit PCA on the full dataset — fit on training data only and apply the same transformation to test data. Fitting on the full dataset leaks test statistics into training.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.