Transformers & Self-Attention
“Every word speaks directly to every other word — attending to the whole sentence at once”
Deep dive into attention mechanisms: scaled dot-product, multi-head attention, positional encoding, feed-forward sublayer, temperature/top-k/top-p sampling, BERT encoder vs GPT decoder.
Prerequisites
Concepts Covered
∑Key Formulas
Scaled Dot-Product
Core attention: how much each query attends to each key
Multi-Head
H parallel attention functions joined and projected
Positional Encoding
Injects token position info (no recurrence)
FFN Sublayer
Position-wise feed-forward after each attention block
▶Interactive Simulation
⬡Model Architecture
The Problem with RNNs That Transformers Solved
RNNs process sequences token by token — to understand the relationship between word 1 and word 500, information must flow through 499 intermediate states, each potentially corrupting or forgetting it (vanishing gradient). Transformers solve this by allowing any position to directly attend to any other position in a single step. This direct path, combined with parallel computation, is why Transformers replaced RNNs for almost everything.
Attention as a Soft Database Query
Think of attention as a differentiable key-value store. You have a Query (what you're looking for), Keys (descriptors of each memory), and Values (the actual content). Attention computes similarity between Query and all Keys, runs softmax to get a probability distribution, then returns a weighted sum of Values. The word 'bank' in 'river bank' will attend heavily to 'river' (high Q·K similarity) and retrieve its financial meaning — context-dependent representation.
The √d_k scaling prevents dot products from growing large (which would make softmax extremely peaked, killing gradient flow through the distribution).
Multi-Head Attention: Why Multiple Heads?
A single attention head can only attend based on one 'criterion' (e.g., syntactic subject-verb agreement). Multiple heads learn different attention patterns simultaneously: head 1 might track syntax, head 2 semantics, head 3 coreference. Each head projects Q, K, V to a lower-dimensional subspace, computes attention there, then all heads are concatenated and projected back.
BERT vs GPT: Encoder vs Decoder
BERT uses bidirectional attention — each token attends to all other tokens (past and future). This is great for understanding (classification, NER, QA) but can't generate text left-to-right. GPT uses masked (causal) attention — each token only attends to previous tokens. This enables autoregressive text generation. The mask is a lower-triangular matrix of -inf values added before softmax, zeroing out future attention.
Transformer Encoder Block
Input embeddings E = token_embed + positional_encoding
Multi-Head Self-Attention: Q=EW_Q, K=EW_K, V=EW_V
Attention(Q,K,V) = softmax(QKᵀ/√d_k)V
Add & Norm: x₁ = LayerNorm(E + Attention(E))
Feed-Forward: FFN(x₁) = ReLU(x₁W₁ + b₁)W₂ + b₂
Add & Norm: x₂ = LayerNorm(x₁ + FFN(x₁))
Repeat for L layers
Scaled Dot-Product Attention from Scratch
import torch import torch.nn.functional as F import math def scaled_dot_product_attention(Q, K, V, mask=None): class="tok-str">""" Q, K, V: (batch, heads, seq_len, d_k) """ d_k = Q.shape[-class="tok-num">1] class="tok-comment"># Attention scores scores = torch.matmul(Q, K.transpose(-class="tok-num">2, -class="tok-num">1)) / math.sqrt(d_k) class="tok-comment"># Causal mask (GPT-style) if mask is not None: scores = scores.masked_fill(mask == class="tok-num">0, float(class="tok-str">'-inf')) class="tok-comment"># Softmax over key dimension attn_weights = F.softmax(scores, dim=-class="tok-num">1) class="tok-comment"># Weighted sum of values return torch.matmul(attn_weights, V), attn_weights class MultiHeadAttention(torch.nn.Module): def __init__(self, d_model, n_heads): super().__init__() self.d_k = d_model // n_heads self.n_heads = n_heads self.W_q = torch.nn.Linear(d_model, d_model) self.W_k = torch.nn.Linear(d_model, d_model) self.W_v = torch.nn.Linear(d_model, d_model) self.W_o = torch.nn.Linear(d_model, d_model) def forward(self, Q, K, V, mask=None): B, T, D = Q.shape class="tok-comment"># Project + split into heads Q = self.W_q(Q).view(B, T, self.n_heads, self.d_k).transpose(class="tok-num">1, class="tok-num">2) K = self.W_k(K).view(B, T, self.n_heads, self.d_k).transpose(class="tok-num">1, class="tok-num">2) V = self.W_v(V).view(B, T, self.n_heads, self.d_k).transpose(class="tok-num">1, class="tok-num">2) x, weights = scaled_dot_product_attention(Q, K, V, mask) class="tok-comment"># Concat heads + project x = x.transpose(class="tok-num">1, class="tok-num">2).contiguous().view(B, T, D) return self.W_o(x), weights
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.