ML Learning Hub
Deep Learningadvanced

RNN, LSTM & GRU — Sequence Modeling

Teaching networks to remember — from catastrophic forgetting to selective gated memory

Recurrent networks for sequences: vanilla RNN (BPTT, exploding/vanishing gradients), LSTM (forget/input/output gates, cell state), GRU (simplified gating), Bi-LSTM for bidirectional context.

70 min
8 diagrams
7 Concepts Covered

Prerequisites

Neural Networks
Deep Learning Optimization

Concepts Covered

BPTTVanishing GradientLSTM GatesCell StateGRUBi-LSTMSequence-to-Sequence

Key Formulas

RNN Hidden State

Hidden state mixes previous memory with current input

LSTM Cell State

Cell state updated by forget gate and input gate

LSTM Hidden

Output gate controls what to expose from cell state

GRU Update

Single update gate interpolates old and new hidden state

Interactive Simulation

Loading visualization…
Loading visualization…

Model Architecture

Loading visualization…
Loading visualization…
Loading visualization…
🎯

Why Sequences Are Hard

motivation

Language, time series, audio, DNA — these all have temporal dependencies. 'He said he would come' — 'he' and 'would' are 5 words apart but tightly linked. A feedforward network processes each timestep independently. RNNs share parameters across time and maintain a hidden state that summarizes past inputs — enabling unbounded context. The challenge: making that memory selective and long-range.

💡

The Vanishing Gradient Over Time

intuition

In BPTT (Backpropagation Through Time), gradients are multiplied by the weight matrix W at each timestep. If the largest eigenvalue of W is < 1, gradients vanish exponentially. If > 1, they explode. For a sequence of 100 timesteps, a gradient from timestep 1 is multiplied by W¹⁰⁰. Standard initialization makes this almost always vanish. LSTMs solve this with the cell state — a 'highway' that carries information with only additive (not multiplicative) updates.

LSTM gradients flow through c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t. The forget gate f_t keeps c-gradients from vanishing — they're gated additions, not matrix multiplications.

LSTM Gate Equations

math

Four gate computations determine what to forget, learn, and output at each step. All gates use sigmoid (output 0-1 = 'how much of this to let through'). The candidate cell state uses tanh (output -1 to 1 = actual content).

LSTM: Forget (f), Input (i), Cell candidate (c̃), Output (o) gates
</>

LSTM for Time Series Forecasting

code
python39 lines
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

class="tok-comment"># ── Sample sequential dataloader ───────────────────────────────────────
class="tok-comment"># Shape: (n_samples, seq_len, features) → predict next value
X_seq = torch.randn(class="tok-num">500, class="tok-num">20, class="tok-num">10)          class="tok-comment"># class="tok-num">500 samples, seq_len=class="tok-num">20, class="tok-num">10 features
y_seq = torch.randn(class="tok-num">500)                   class="tok-comment"># class="tok-num">500 scalar targets
dataloader = DataLoader(TensorDataset(X_seq, y_seq), batch_size=class="tok-num">32, shuffle=True)

class LSTMForecaster(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim, dropout=class="tok-num">0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_dim, hidden_dim, num_layers,
            batch_first=True, dropout=dropout,
            bidirectional=False
        )
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, h0=None, c0=None):
        class="tok-comment"># x: (batch, seq_len, features)
        out, (hn, cn) = self.lstm(x, (h0, c0) if h0 is not None else None)
        class="tok-comment"># Use last timestep's output
        return self.fc(self.dropout(out[:, -class="tok-num">1, :]))

class="tok-comment"># Training with teacher forcing + scheduled sampling
model = LSTMForecaster(input_dim=class="tok-num">10, hidden_dim=class="tok-num">128, num_layers=class="tok-num">2, output_dim=class="tok-num">1)
optimizer = torch.optim.Adam(model.parameters(), lr=class="tok-num">1e-3)

class="tok-comment"># Gradient clipping is ESSENTIAL for RNN training
for x, y in dataloader:
    pred = model(x)
    loss = nn.MSELoss()(pred.squeeze(), y)
    optimizer.zero_grad()
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=class="tok-num">1.0)
    optimizer.step()

?Knowledge Check

Progress is saved in your browser — no account needed.

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.