Reinforcement Learning
“An agent learns by doing — maximising cumulative reward through trial-and-error interaction”
MDP formalism, Bellman equations, Q-learning, Deep Q-Networks (DQN), policy gradients (REINFORCE), and PPO — with an interactive grid-world visualization showing Q-values converge over 200 episodes.
Prerequisites
Concepts Covered
∑Key Formulas
Bellman Equation
Optimal Q-value: immediate reward + discounted best future value. The recursive target Q-learning converges to
Q-Learning Update
Move current estimate toward the Bellman target by step size α (temporal difference learning)
Policy Gradient (REINFORCE)
Increase probability of actions that led to high returns, decrease for low-return actions
PPO Clipped Objective
Proximal Policy Optimization: clips the policy update ratio to stay within [1-ε, 1+ε] — stable large-batch training
▶Interactive Simulation
Learning by Doing — the RL Paradigm
Supervised learning needs labelled examples. Reinforcement learning needs only a reward signal — feedback on whether an action was good or bad. This is how humans learn to walk, play games, and drive: through practice, feedback, and gradually improving strategy. RL successes: AlphaGo defeated the world Go champion (2016) — trained by RL after supervised pre-training. AlphaFold's structure refinement uses RL-like optimization. ChatGPT / GPT-4 use RLHF (Reinforcement Learning from Human Feedback) to align model outputs with human preferences. Robotic manipulation systems (Boston Dynamics, Figure AI) train policies in simulation then transfer to real hardware.
DeepMind's AlphaZero learned Chess, Go, and Shogi to superhuman level in 24 hours of self-play — no human game records, pure RL from random play and the reward of winning.
The RL Framework: Agent, Environment, MDP
**Markov Decision Process (MDP):** (S, A, T, R, γ) — States S, Actions A, Transition function T(s'|s,a), Reward R(s,a,s'), discount γ∈[0,1]. **Agent:** At each timestep t, observes state sₜ, takes action aₜ according to policy π(aₜ|sₜ), receives reward rₜ. **Goal:** Find policy π* that maximises expected discounted return Gₜ = Σᵢ γⁱrₜ₊ᵢ. **Q-function Q(s,a):** Expected return starting from state s, taking action a, then following the optimal policy. If we knew Q*, we'd always pick argmax_a Q*(s,a). **Key distinction:** Model-free RL (Q-learning, PPO) learns directly from interaction without knowing T. Model-based RL (Dyna, MuZero) learns a world model T̂ and plans in imagination.
The discount factor γ controls myopia: γ=0.99 makes the agent care about rewards 100 steps in the future; γ=0 is purely greedy (only cares about immediate reward). Most problems need γ∈[0.95, 0.999].
Q-Learning: Tabular and Deep (DQN)
Initialize Q-table Q(s,a)=0 for all state-action pairs (tabular) or Q-network θ (DQN).
For each episode: reset environment, observe initial state s.
ε-greedy action selection: with probability ε pick random action (exploration), else pick argmax_a Q(s,a) (exploitation). Anneal ε from 1.0 → 0.05 over training.
Execute action a, observe reward r and next state s'. Store (s,a,r,s') in replay buffer (DQN).
TD update: Q(s,a) ← Q(s,a) + α[r + γ·max_a' Q(s',a') - Q(s,a)].
DQN extras: (1) Experience replay — sample random minibatches from buffer to break correlations. (2) Target network — use a frozen copy Q_target(θ⁻) as the update target, sync every N steps. (3) Double DQN — decouple action selection from value estimation to reduce overestimation.
Convergence: tabular Q-learning converges to Q* for finite MDPs with sufficient exploration and decaying α. Neural Q-networks are not guaranteed but work well in practice.
Q-Learning and DQN from Scratch
import numpy as np import gymnasium as gym from collections import deque import torch, torch.nn as nn, torch.optim as optim class="tok-comment"># ── class="tok-num">1. Tabular Q-Learning (FrozenLake) ─────────────────────────────────────── env = gym.make(class="tok-str">"FrozenLake-v1", is_slippery=False) n_states, n_actions = env.observation_space.n, env.action_space.n Q = np.zeros((n_states, n_actions)) alpha = class="tok-num">0.8 class="tok-comment"># learning rate gamma = class="tok-num">0.95 class="tok-comment"># discount epsilon = class="tok-num">1.0 class="tok-comment"># exploration rate for episode in range(class="tok-num">2000): s, _ = env.reset() done = False while not done: class="tok-comment"># ε-greedy if np.random.random() < epsilon: a = env.action_space.sample() else: a = np.argmax(Q[s]) s_next, r, terminated, truncated, _ = env.step(a) done = terminated or truncated class="tok-comment"># Q-learning update (Bellman target) td_target = r + gamma * np.max(Q[s_next]) * (class="tok-num">1 - terminated) Q[s, a] += alpha * (td_target - Q[s, a]) s = s_next epsilon = max(class="tok-num">0.01, epsilon * class="tok-num">0.999) class="tok-comment"># decay exploration class="tok-comment"># Evaluate greedy policy n_wins = class="tok-num">0 for _ in range(class="tok-num">100): s, _ = env.reset() done = False while not done: a = np.argmax(Q[s]) s, r, terminated, truncated, _ = env.step(a) done = terminated or truncated if r == class="tok-num">1.0: class="tok-comment"># reached the goal n_wins += class="tok-num">1 print(fclass="tok-str">"Win rate: {n_wins/class="tok-num">100:.class="tok-num">0%}") class="tok-comment"># ── class="tok-num">2. Deep Q-Network (CartPole) ───────────────────────────────────────────── class DQN(nn.Module): def __init__(self, obs_dim, n_actions): super().__init__() self.net = nn.Sequential( nn.Linear(obs_dim, class="tok-num">128), nn.ReLU(), nn.Linear(class="tok-num">128, class="tok-num">128), nn.ReLU(), nn.Linear(class="tok-num">128, n_actions) ) def forward(self, x): return self.net(x) env = gym.make(class="tok-str">"CartPole-v1") obs_dim = env.observation_space.shape[class="tok-num">0] n_actions = env.action_space.n q_net = DQN(obs_dim, n_actions) q_target = DQN(obs_dim, n_actions) class="tok-comment"># frozen target network q_target.load_state_dict(q_net.state_dict()) optimizer = optim.Adam(q_net.parameters(), lr=class="tok-num">1e-3) replay = deque(maxlen=10_000) class="tok-comment"># experience replay buffer gamma = class="tok-num">0.99 epsilon = class="tok-num">1.0 batch_size = class="tok-num">64 target_update_freq = class="tok-num">100 steps = class="tok-num">0 for episode in range(class="tok-num">500): s, _ = env.reset() total_reward = class="tok-num">0 done = False while not done: class="tok-comment"># ε-greedy action if np.random.random() < epsilon: a = env.action_space.sample() else: with torch.no_grad(): a = q_net(torch.FloatTensor(s)).argmax().item() s_next, r, terminated, truncated, _ = env.step(a) done = terminated or truncated replay.append((s, a, r, s_next, terminated)) s = s_next; total_reward += r; steps += class="tok-num">1 class="tok-comment"># Train when buffer has enough samples if len(replay) >= batch_size: batch = [replay[i] for i in np.random.choice(len(replay), batch_size, replace=False)] S, A, R, S_next, D = map(np.array, zip(*batch)) S_t = torch.FloatTensor(S); A_t = torch.LongTensor(A) R_t = torch.FloatTensor(R); S_next_t = torch.FloatTensor(S_next) D_t = torch.FloatTensor(D) with torch.no_grad(): max_next_q = q_target(S_next_t).max(class="tok-num">1)[class="tok-num">0] target = R_t + gamma * max_next_q * (class="tok-num">1 - D_t) current_q = q_net(S_t).gather(class="tok-num">1, A_t.unsqueeze(class="tok-num">1)).squeeze() loss = nn.functional.mse_loss(current_q, target) optimizer.zero_grad(); loss.backward(); optimizer.step() if steps % target_update_freq == class="tok-num">0: q_target.load_state_dict(q_net.state_dict()) epsilon = max(class="tok-num">0.05, epsilon * class="tok-num">0.995) if episode % class="tok-num">50 == class="tok-num">0: print(fclass="tok-str">"Episode {episode:4d} | reward={total_reward:class="tok-num">5.0f} | ε={epsilon:.3f}")
Policy Gradient and Actor-Critic (A2C/PPO)
Policy gradient methods directly optimise the policy π_θ instead of a value function. The actor-critic architecture uses two networks: the actor π_θ selects actions, the critic V_φ estimates state values as a baseline to reduce gradient variance. PPO (Proximal Policy Optimization) adds a clipping mechanism to prevent destructively large policy updates — the ratio r_t(θ) = π_θ(a|s)/π_θ_old(a|s) is clipped to [1-ε, 1+ε]. This makes PPO sample-efficient enough for practical use and is the standard algorithm in RLHF for LLMs.
The Deadly Triad and Common RL Failure Modes
The deadly triad (Sutton & Barto): combining (1) function approximation (neural networks), (2) bootstrapping (TD updates), and (3) off-policy learning can cause divergence. DQN avoids this with replay buffers and target networks but training can still be unstable. Practical RL pitfalls: (1) Reward shaping can introduce unintended shortcuts — the boat racing agent that learned to spin in circles collecting bonuses. (2) Sparse rewards make exploration nearly impossible — use intrinsic motivation (curiosity) or reward shaping carefully. (3) Sim-to-real gap — policies trained in simulation fail on real robots due to unmodelled physics. (4) Evaluation with fixed seeds is misleading — RL results have high variance; report mean ± std over 10+ seeds.
OpenAI's boat racing agent trained with shaping rewards learned to drive in circles collecting bonus items instead of finishing the race — a famous reward hacking failure.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need an AI engineer or data scientist?
I build custom ML models, AI agents, computer vision, and automation — from idea to production.