RNN, LSTM & GRU — Sequence Modeling
“Teaching networks to remember — from catastrophic forgetting to selective gated memory”
Recurrent networks for sequences: vanilla RNN (BPTT, exploding/vanishing gradients), LSTM (forget/input/output gates, cell state), GRU (simplified gating), Bi-LSTM for bidirectional context.
Prerequisites
Concepts Covered
∑Key Formulas
RNN Hidden State
Hidden state mixes previous memory with current input
LSTM Cell State
Cell state updated by forget gate and input gate
LSTM Hidden
Output gate controls what to expose from cell state
GRU Update
Single update gate interpolates old and new hidden state
▶Interactive Simulation
⬡Model Architecture
Why Sequences Are Hard
Language, time series, audio, DNA — these all have temporal dependencies. 'He said he would come' — 'he' and 'would' are 5 words apart but tightly linked. A feedforward network processes each timestep independently. RNNs share parameters across time and maintain a hidden state that summarizes past inputs — enabling unbounded context. The challenge: making that memory selective and long-range.
The Vanishing Gradient Over Time
In BPTT (Backpropagation Through Time), gradients are multiplied by the weight matrix W at each timestep. If the largest eigenvalue of W is < 1, gradients vanish exponentially. If > 1, they explode. For a sequence of 100 timesteps, a gradient from timestep 1 is multiplied by W¹⁰⁰. Standard initialization makes this almost always vanish. LSTMs solve this with the cell state — a 'highway' that carries information with only additive (not multiplicative) updates.
LSTM gradients flow through c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t. The forget gate f_t keeps c-gradients from vanishing — they're gated additions, not matrix multiplications.
LSTM Gate Equations
Four gate computations determine what to forget, learn, and output at each step. All gates use sigmoid (output 0-1 = 'how much of this to let through'). The candidate cell state uses tanh (output -1 to 1 = actual content).
LSTM for Time Series Forecasting
import torch import torch.nn as nn from torch.utils.data import TensorDataset, DataLoader class="tok-comment"># ── Sample sequential dataloader ─────────────────────────────────────── class="tok-comment"># Shape: (n_samples, seq_len, features) → predict next value X_seq = torch.randn(class="tok-num">500, class="tok-num">20, class="tok-num">10) class="tok-comment"># class="tok-num">500 samples, seq_len=class="tok-num">20, class="tok-num">10 features y_seq = torch.randn(class="tok-num">500) class="tok-comment"># class="tok-num">500 scalar targets dataloader = DataLoader(TensorDataset(X_seq, y_seq), batch_size=class="tok-num">32, shuffle=True) class LSTMForecaster(nn.Module): def __init__(self, input_dim, hidden_dim, num_layers, output_dim, dropout=class="tok-num">0.2): super().__init__() self.lstm = nn.LSTM( input_dim, hidden_dim, num_layers, batch_first=True, dropout=dropout, bidirectional=False ) self.fc = nn.Linear(hidden_dim, output_dim) self.dropout = nn.Dropout(dropout) def forward(self, x, h0=None, c0=None): class="tok-comment"># x: (batch, seq_len, features) out, (hn, cn) = self.lstm(x, (h0, c0) if h0 is not None else None) class="tok-comment"># Use last timestep's output return self.fc(self.dropout(out[:, -class="tok-num">1, :])) class="tok-comment"># Training with teacher forcing + scheduled sampling model = LSTMForecaster(input_dim=class="tok-num">10, hidden_dim=class="tok-num">128, num_layers=class="tok-num">2, output_dim=class="tok-num">1) optimizer = torch.optim.Adam(model.parameters(), lr=class="tok-num">1e-3) class="tok-comment"># Gradient clipping is ESSENTIAL for RNN training for x, y in dataloader: pred = model(x) loss = nn.MSELoss()(pred.squeeze(), y) optimizer.zero_grad() loss.backward() nn.utils.clip_grad_norm_(model.parameters(), max_norm=class="tok-num">1.0) optimizer.step()
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.