Reinforcement Learning

Advanced Game Playing — Deep RL

Double Dueling DQN + PER (SumTree). CartPole-v1 solved ep 300 (MA-100=441.1, best eval 497.2/500). LunarLander-v3 solved ep 207 (MA-100=202). 134,275-param network with LayerNorm.

View on Kaggle

Episode 300

CartPole solved

497.2 / 500

CartPole best eval

Episode 207

LunarLander solved

134,275

Network params

Dataset

CartPole-v1 + LunarLander-v3 (OpenAI Gymnasium)

Approach

Double + Dueling DQN + PER SumTree + soft target updates — all 4 improvements

Tech Stack

PythonPyTorch 2.10Gymnasium 1.2.0CUDANumPy

Keywords

Double DQNDueling DQNPERSumTreeCartPoleLunarLanderGymnasium

Visualizations6 Charts

Deep Dive

State-of-the-art Deep Q-Network combining all 4 modern DRL improvements.

Dueling DQN Architecture (134,275 params)

Input → Linear(256) → LayerNorm → ReLU
→ Value stream:     Linear(256→128) → ReLU → Linear(128→1)      = V(s)
→ Advantage stream: Linear(256→128) → ReLU → Linear(128→n_act)  = A(s,a)
→ Q(s,a) = V(s) + (A(s,a) − mean(A(s,a)))

4 Techniques Combined

Technique	What It Fixes
Double DQN	Q-target overestimation bias
Dueling DQN	Separate V(s) and A(s,a) estimation
PER (SumTree)	Sample high-TD-error transitions more often
Soft target updates τ=0.005	Stable Q-target convergence

Results

Environment	Metric	Value
CartPole-v1	Solved at episode	300
CartPole-v1	MA-100 reward	441.1 / 500
CartPole-v1	Best eval (20 ep)	497.2 ± 12.2
LunarLander-v3	Solved at episode	207
LunarLander-v3	MA-100 reward	202 (threshold: 200)

PER SumTree Binary segment tree: O(log n) priority sampling and updates. β anneals 0.4→1.0 over training to correct importance-sampling bias.

Hyperparameters lr=1e-4, γ=0.99, τ=0.005, buffer=100K, batch=64, ε: 1.0→0.01

Back to Projects Hire Me