Linear Algebra for ML
“The geometry behind every model — dot products, matrix transforms, and eigendecomposition”
Vectors, dot products, matrix multiplication, eigendecomposition and SVD — with visual intuition for how matrices transform space. The language every neural network is written in.
Prerequisites
Concepts Covered
∑Key Formulas
Dot Product
Measures how aligned two vectors are — zero means orthogonal, maximum when parallel
Matrix Multiply
Composition of two linear transformations — apply B first, then A
Eigendecomposition
Eigenvectors v stay on their span under transformation A; λ is the scaling factor
SVD
Any matrix decomposes into rotation × scale × rotation — used in PCA, LSA, recommender systems
▶Interactive Simulation
Why Linear Algebra IS Machine Learning
A neural network layer is y = Wx + b — a matrix multiplication. Gradient descent requires computing the gradient, which is a Jacobian matrix. PCA finds the principal eigenvectors of the covariance matrix. SVMs maximize a dot-product-based margin. Attention in Transformers is Q·Kᵀ·V — three matrix multiplications. Every forward pass, every backpropagation, every optimization step is linear algebra. Understanding the geometric intuition — what matrices DO to vectors in space — is what separates engineers who debug by understanding from engineers who debug by trial and error.
The dot product a·b = ‖a‖‖b‖cos(θ) is the foundation of cosine similarity (used in NLP), the kernel trick (SVMs), and attention mechanisms (Transformers).
Matrices as Space Transformers
Every m×n matrix A represents a linear transformation from ℝⁿ to ℝᵐ. Multiplying a vector v by A stretches, rotates, reflects, or projects it. The key insight: a matrix completely describes what happens to EVERY vector in the space — you only need to know what it does to the basis vectors (the columns of A, when A acts on the standard basis). The determinant tells you the volume scaling factor: |det(A)| = 2 means every region doubles in area. det = 0 means the matrix collapses space onto a lower dimension (rank-deficient, non-invertible).
Visualize any 2×2 matrix by watching where the unit square [0,1]×[0,1] gets sent. The four corners go to (0,0), the first column, the second column, and their sum.
Eigendecomposition Step by Step
Find eigenvalues: solve det(A - λI) = 0 (characteristic polynomial). For 2×2: λ = (tr(A) ± √(tr²-4det)) / 2.
For each eigenvalue λᵢ: solve (A - λᵢI)v = 0 to find the eigenvector vᵢ. Normalize: ‖vᵢ‖ = 1.
Stack eigenvectors as columns of Q: A = QΛQ⁻¹ where Λ = diag(λ₁, λ₂, …)
For symmetric matrices (covariance matrices): Q is orthogonal (Q⁻¹ = Qᵀ), eigenvalues are real.
Aⁿ = QΛⁿQ⁻¹ — large eigenvalues dominate repeated application (e.g., power iteration).
PCA: compute covariance C = XᵀX/n, eigendecompose, take top-k eigenvectors as projection matrix.
Linear Algebra with NumPy
import numpy as np class="tok-comment"># ── Vectors and dot products ────────────────────────────────────────────────── a = np.array([class="tok-num">3., class="tok-num">4.]) b = np.array([class="tok-num">1., class="tok-num">0.]) print(fclass="tok-str">"a·b = {np.dot(a, b):.2f}") class="tok-comment"># class="tok-num">3.0 print(fclass="tok-str">"‖a‖ = {np.linalg.norm(a):.2f}") class="tok-comment"># class="tok-num">5.0 print(fclass="tok-str">"cos(θ) = {np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b)):.3f}") class="tok-comment"># class="tok-num">0.6 class="tok-comment"># Cosine similarity (NLP/recommendation) def cosine_sim(u, v): return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v)) class="tok-comment"># ── Matrix operations ───────────────────────────────────────────────────────── A = np.array([[class="tok-num">2., class="tok-num">1.], [class="tok-num">0., class="tok-num">3.]]) B = np.array([[class="tok-num">1., class="tok-num">0.], [class="tok-num">2., class="tok-num">1.]]) print(class="tok-str">"A @ B =") print(A @ B) class="tok-comment"># matrix multiply (composition) print(fclass="tok-str">"det(A) = {np.linalg.det(A):.2f}") class="tok-comment"># class="tok-num">6.0 — volume scaling print(fclass="tok-str">"rank(A) = {np.linalg.matrix_rank(A)}") class="tok-comment"># class="tok-num">2 — full rank A_inv = np.linalg.inv(A) print(class="tok-str">"A @ A_inv ≈ I:", np.allclose(A @ A_inv, np.eye(class="tok-num">2))) class="tok-comment"># ── Eigendecomposition ──────────────────────────────────────────────────────── eigenvalues, eigenvectors = np.linalg.eig(A) print(fclass="tok-str">"Eigenvalues: {eigenvalues}") class="tok-comment"># [class="tok-num">2. class="tok-num">3.] print(fclass="tok-str">"Eigenvectors (columns):\n{eigenvectors.round(class="tok-num">3)}") class="tok-comment"># Verify: A @ v = λ * v for i in range(len(eigenvalues)): v = eigenvectors[:, i] lam = eigenvalues[i] print(fclass="tok-str">"λ{i+class="tok-num">1}={lam:.2f}, Aclass="tok-dec">@v = {Aclass="tok-dec">@v.round(class="tok-num">3)}, λ*v = {(lam*v).round(class="tok-num">3)}") class="tok-comment"># Reconstruct A from eigendecomposition Q = eigenvectors Lambda = np.diag(eigenvalues) A_reconstructed = Q @ Lambda @ np.linalg.inv(Q) print(class="tok-str">"Reconstruction error:", np.linalg.norm(A - A_reconstructed)) class="tok-comment"># ── SVD ─────────────────────────────────────────────────────────────────────── M = np.random.randn(class="tok-num">4, class="tok-num">3) class="tok-comment"># class="tok-num">4×class="tok-num">3 rectangular matrix U, S, Vt = np.linalg.svd(M, full_matrices=False) print(fclass="tok-str">"U: {U.shape}, S: {S.shape}, Vt: {Vt.shape}") class="tok-comment"># Low-rank approximation (keep top-k singular values) k = class="tok-num">2 M_approx = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :] print(fclass="tok-str">"Rank-{k} approx error: {np.linalg.norm(M - M_approx):.4f}") class="tok-comment"># ── PCA from scratch ───────────────────────────────────────────────────────── X = np.random.randn(class="tok-num">200, class="tok-num">5) X -= X.mean(axis=class="tok-num">0) class="tok-comment"># center C = (X.T @ X) / (len(X) - class="tok-num">1) class="tok-comment"># covariance matrix eigenvalues, eigenvectors = np.linalg.eigh(C) class="tok-comment"># eigh for symmetric matrices idx = np.argsort(eigenvalues)[::-class="tok-num">1] class="tok-comment"># sort descending PC = eigenvectors[:, idx[:class="tok-num">2]] class="tok-comment"># top-class="tok-num">2 principal components X_proj = X @ PC class="tok-comment"># project to 2D print(fclass="tok-str">"Explained variance: {eigenvalues[idx[:class="tok-num">2]] / eigenvalues.sum() * class="tok-num">100}")
Numerical Stability and Ill-Conditioning
The condition number of a matrix κ(A) = σ_max/σ_min (ratio of largest to smallest singular value) measures how sensitive solutions are to perturbations. High condition number → ill-conditioned → numerical errors amplify. Gradient descent converges slowly on ill-conditioned loss landscapes (elongated bowl) — this is why feature scaling matters and why Adam adapts learning rates per-parameter. Never invert a matrix directly with np.linalg.inv(A) if you're solving Ax=b — use np.linalg.solve(A,b) which is faster and more stable (uses LU factorization).
np.linalg.cond(A) tells you the condition number. κ > 10⁶ means you're in trouble — solutions to linear systems will have ~6 fewer significant digits than you expect.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.