Deep Learningadvanced

RNN, LSTM & GRU — Sequence Modeling

“Teaching networks to remember — from catastrophic forgetting to selective gated memory”

Recurrent networks for sequences: vanilla RNN (BPTT, exploding/vanishing gradients), LSTM (forget/input/output gates, cell state), GRU (simplified gating), Bi-LSTM for bidirectional context.

70 min

8 diagrams

7 Concepts Covered

Prerequisites

→Neural Networks

→Deep Learning Optimization

Concepts Covered

BPTTVanishing GradientLSTM GatesCell StateGRUBi-LSTMSequence-to-Sequence

Previous: CNN Architectures: Classic → ResNet → ViT Next: Object Detection: YOLO & Faster-RCNN

∑Key Formulas

RNN Hidden State

Hidden state mixes previous memory with current input

LSTM Cell State

Cell state updated by forget gate and input gate

LSTM Hidden

Output gate controls what to expose from cell state

GRU Update

Single update gate interpolates old and new hidden state

▶Interactive Simulation

Loading visualization…

⬡Model Architecture

Loading visualization…

🎯

Why Sequences Are Hard

motivation

Language, time series, audio, DNA — these all have temporal dependencies. 'He said he would come' — 'he' and 'would' are 5 words apart but tightly linked. A feedforward network processes each timestep independently. RNNs share parameters across time and maintain a hidden state that summarizes past inputs — enabling unbounded context. The challenge: making that memory selective and long-range.

💡

The Vanishing Gradient Over Time

intuition

In BPTT (Backpropagation Through Time), gradients are multiplied by the weight matrix W at each timestep. If the largest eigenvalue of W is < 1, gradients vanish exponentially. If > 1, they explode. For a sequence of 100 timesteps, a gradient from timestep 1 is multiplied by W¹⁰⁰. Standard initialization makes this almost always vanish. LSTMs solve this with the cell state — a 'highway' that carries information with only additive (not multiplicative) updates.

LSTM gradients flow through c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t. The forget gate f_t keeps c-gradients from vanishing — they're gated additions, not matrix multiplications.

∑

LSTM Gate Equations

math

Four gate computations determine what to forget, learn, and output at each step. All gates use sigmoid (output 0-1 = 'how much of this to let through'). The candidate cell state uses tanh (output -1 to 1 = actual content).

</>

LSTM for Time Series Forecasting

code

python39 lines

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

# ── Sample sequential dataloader ───────────────────────────────────────
# Shape: (n_samples, seq_len, features) → predict next value
X_seq = torch.randn(500, 20, 10)          # 500 samples, seq_len=20, 10 features
y_seq = torch.randn(500)                   # 500 scalar targets
dataloader = DataLoader(TensorDataset(X_seq, y_seq), batch_size=32, shuffle=True)

class LSTMForecaster(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_dim, hidden_dim, num_layers,
            batch_first=True, dropout=dropout,
            bidirectional=False
        )
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, h0=None, c0=None):
        # x: (batch, seq_len, features)
        out, (hn, cn) = self.lstm(x, (h0, c0) if h0 is not None else None)
        # Use last timestep's output
        return self.fc(self.dropout(out[:, -1, :]))

# Training with teacher forcing + scheduled sampling
model = LSTMForecaster(input_dim=10, hidden_dim=128, num_layers=2, output_dim=1)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Gradient clipping is ESSENTIAL for RNN training
for x, y in dataloader:
    pred = model(x)
    loss = nn.MSELoss()(pred.squeeze(), y)
    optimizer.zero_grad()
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()

?Knowledge Check

Progress is saved in your browser — no account needed.

CNN Architectures: Classic → ResNet → ViT

Object Detection: YOLO & Faster-RCNN

Need an AI engineer or data scientist?

I build custom ML models, AI agents, computer vision, and automation — from idea to production.

Get in touch View services