Deep Learningadvanced

Transformers & Self-Attention

“Every word speaks directly to every other word — attending to the whole sentence at once”

Deep dive into attention mechanisms: scaled dot-product, multi-head attention, positional encoding, feed-forward sublayer, temperature/top-k/top-p sampling, BERT encoder vs GPT decoder.

90 min

9 diagrams

7 Concepts Covered

Prerequisites

→RNN / LSTM

→NLP Text Classification

Concepts Covered

Scaled Dot-ProductMulti-head AttentionPositional EncodingFFNBERTGPTTop-p Sampling

Previous: NLP: Text Classification Pipeline Next: Audio & Speech ML

∑Key Formulas

Scaled Dot-Product

Core attention: how much each query attends to each key

Multi-Head

H parallel attention functions joined and projected

Positional Encoding

Injects token position info (no recurrence)

FFN Sublayer

Position-wise feed-forward after each attention block

▶Interactive Simulation

Loading visualization…

⬡Model Architecture

Loading visualization…

🎯

The Problem with RNNs That Transformers Solved

motivation

RNNs process sequences token by token — to understand the relationship between word 1 and word 500, information must flow through 499 intermediate states, each potentially corrupting or forgetting it (vanishing gradient). Transformers solve this by allowing any position to directly attend to any other position in a single step. This direct path, combined with parallel computation, is why Transformers replaced RNNs for almost everything.

💡

Attention as a Soft Database Query

intuition

Think of attention as a differentiable key-value store. You have a Query (what you're looking for), Keys (descriptors of each memory), and Values (the actual content). Attention computes similarity between Query and all Keys, runs softmax to get a probability distribution, then returns a weighted sum of Values. The word 'bank' in 'river bank' will attend heavily to 'river' (high Q·K similarity) and retrieve its financial meaning — context-dependent representation.

The √d_k scaling prevents dot products from growing large (which would make softmax extremely peaked, killing gradient flow through the distribution).

∑

Multi-Head Attention: Why Multiple Heads?

math

A single attention head can only attend based on one 'criterion' (e.g., syntactic subject-verb agreement). Multiple heads learn different attention patterns simultaneously: head 1 might track syntax, head 2 semantics, head 3 coreference. Each head projects Q, K, V to a lower-dimensional subspace, computes attention there, then all heads are concatenated and projected back.

🔬

BERT vs GPT: Encoder vs Decoder

deepdive

BERT uses bidirectional attention — each token attends to all other tokens (past and future). This is great for understanding (classification, NER, QA) but can't generate text left-to-right. GPT uses masked (causal) attention — each token only attends to previous tokens. This enables autoregressive text generation. The mask is a lower-triangular matrix of -inf values added before softmax, zeroing out future attention.

⚙️

Transformer Encoder Block

algorithm

Input embeddings E = token_embed + positional_encoding

Multi-Head Self-Attention: Q=EW_Q, K=EW_K, V=EW_V

Attention(Q,K,V) = softmax(QKᵀ/√d_k)V

Add & Norm: x₁ = LayerNorm(E + Attention(E))

Feed-Forward: FFN(x₁) = ReLU(x₁W₁ + b₁)W₂ + b₂

Add & Norm: x₂ = LayerNorm(x₁ + FFN(x₁))

Repeat for L layers

</>

Scaled Dot-Product Attention from Scratch

code

python39 lines

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, heads, seq_len, d_k)
    """
    d_k = Q.shape[-1]
    # Attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    # Causal mask (GPT-style)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    # Softmax over key dimension
    attn_weights = F.softmax(scores, dim=-1)
    # Weighted sum of values
    return torch.matmul(attn_weights, V), attn_weights

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = torch.nn.Linear(d_model, d_model)
        self.W_k = torch.nn.Linear(d_model, d_model)
        self.W_v = torch.nn.Linear(d_model, d_model)
        self.W_o = torch.nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):
        B, T, D = Q.shape
        # Project + split into heads
        Q = self.W_q(Q).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        x, weights = scaled_dot_product_attention(Q, K, V, mask)
        # Concat heads + project
        x = x.transpose(1, 2).contiguous().view(B, T, D)
        return self.W_o(x), weights

?Knowledge Check

Progress is saved in your browser — no account needed.

NLP: Text Classification Pipeline

Audio & Speech ML

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services