All Projects
Reinforcement Learning
Advanced Game Playing — Deep RL
Double Dueling DQN + PER (SumTree). CartPole-v1 solved ep 300 (MA-100=441.1, best eval 497.2/500). LunarLander-v3 solved ep 207 (MA-100=202). 134,275-param network with LayerNorm.
Episode 300
CartPole solved
497.2 / 500
CartPole best eval
Episode 207
LunarLander solved
134,275
Network params
Dataset
CartPole-v1 + LunarLander-v3 (OpenAI Gymnasium)
Approach
Double + Dueling DQN + PER SumTree + soft target updates — all 4 improvements
Tech Stack
PythonPyTorch 2.10Gymnasium 1.2.0CUDANumPy
Keywords
Double DQNDueling DQNPERSumTreeCartPoleLunarLanderGymnasium
Visualizations6 Charts
Deep Dive
State-of-the-art Deep Q-Network combining all 4 modern DRL improvements.
Dueling DQN Architecture (134,275 params)
Input → Linear(256) → LayerNorm → ReLU
→ Value stream: Linear(256→128) → ReLU → Linear(128→1) = V(s)
→ Advantage stream: Linear(256→128) → ReLU → Linear(128→n_act) = A(s,a)
→ Q(s,a) = V(s) + (A(s,a) − mean(A(s,a)))
4 Techniques Combined
| Technique | What It Fixes |
|---|---|
| Double DQN | Q-target overestimation bias |
| Dueling DQN | Separate V(s) and A(s,a) estimation |
| PER (SumTree) | Sample high-TD-error transitions more often |
| Soft target updates τ=0.005 | Stable Q-target convergence |
Results
| Environment | Metric | Value |
|---|---|---|
| CartPole-v1 | Solved at episode | 300 |
| CartPole-v1 | MA-100 reward | 441.1 / 500 |
| CartPole-v1 | Best eval (20 ep) | 497.2 ± 12.2 |
| LunarLander-v3 | Solved at episode | 207 |
| LunarLander-v3 | MA-100 reward | 202 (threshold: 200) |
PER SumTree Binary segment tree: O(log n) priority sampling and updates. β anneals 0.4→1.0 over training to correct importance-sampling bias.
Hyperparameters lr=1e-4, γ=0.99, τ=0.005, buffer=100K, batch=64, ε: 1.0→0.01