Reinforcement Learning
Reinforcement learning is the science of learning to make decisions by trial and error. An agent takes actions in an environment, receives rewards, and gradually learns a policy — a mapping from states to actions — that maximizes long-term cumulative reward. The mathematical backbone is the Markov Decision Process. Tabular methods (Q-learning, SARSA) solve small problems exactly. Deep RL (DQN, policy gradients, PPO) scales to massive state spaces. PPO is the algorithm behind RLHF, which aligns large language models with human preferences. If supervised learning is "learning from answers," RL is "learning from consequences."
Think about how you learned to ride a bike. Nobody handed you a dataset of 10,000 labeled bike-riding examples. Instead, you got on, wobbled, fell, adjusted, and eventually figured it out through raw interaction. You had a goal (don't fall), you took actions (lean left, pedal harder), you got feedback (pain, balance, speed), and over time you developed a policy — an intuitive mapping from "what I see and feel" to "what I should do next."
That's reinforcement learning in a nutshell. And it turns out to be the framework behind game-playing AIs that beat world champions, robots that learn to walk, and — most relevant to the current AI moment — the alignment of large language models through RLHF.
The RL Problem
Supervised learning says: "Here's the right answer, learn to predict it." Reinforcement learning says: "I won't tell you the right answer. I'll tell you how good or bad the consequences of your action were. Figure it out."
This is a fundamentally different setup. In supervised learning, you get immediate, exact feedback for every prediction. In RL, you might take 100 actions before you find out whether the overall sequence was good or bad. Did you lose the chess game because of move 37, or because of move 12? This is the credit assignment problem, and it's what makes RL genuinely hard.
The Cast of Characters
Every RL problem has the same core components:
- Agent — the learner and decision-maker. This is your algorithm.
- Environment — everything the agent interacts with. A game board, a physical world, a simulated system.
- State (s) — a description of the current situation. The board position, the robot's joint angles, the pixels on screen.
- Action (a) — what the agent can do. Move left, accelerate, output a token.
- Reward (r) — a scalar signal after each action. +1 for scoring, -1 for dying, 0 for everything in between.
- Policy (π) — the agent's strategy. A mapping from states to actions (or a probability distribution over actions).
The loop is simple: the agent observes a state, picks an action according to its policy, the environment transitions to a new state and emits a reward, and the cycle repeats. The goal is to find the policy that maximizes cumulative reward over time — not just the immediate reward.
Supervised vs Reinforcement Learning
| Aspect | Supervised Learning | Reinforcement Learning |
|---|---|---|
| Feedback | Correct label for every input | Scalar reward, often delayed |
| Data | Fixed dataset, i.i.d. samples | Agent generates its own data |
| Goal | Minimize prediction error | Maximize cumulative reward |
| Exploration | Not needed — labels are given | Critical — must try things to learn |
| Credit Assignment | Trivial — one label per input | Hard — reward might be delayed by many steps |
| Stationarity | Training distribution is fixed | Distribution shifts as policy changes |
Markov Decision Processes
An MDP is the mathematical language of RL. Nearly every RL algorithm assumes (or approximates) an MDP underneath. If you understand MDPs, you understand the skeleton that all of RL hangs on.
Formal Definition
An MDP is a tuple (S, A, P, R, γ):
- S — a set of states
- A — a set of actions
- P(s'|s, a) — transition probabilities. Given state s and action a, the probability of landing in state s'.
- R(s, a, s') — reward function. The immediate reward for transitioning from s to s' via action a.
- γ ∈ [0, 1) — discount factor. How much we value future rewards relative to immediate ones.
The Markov Property
The "Markov" part means: the future depends only on the present state, not on the history of how you got there. Formally: P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ...) = P(s_{t+1} | s_t, a_t).
In chess, the current board position tells you everything you need to decide your next move — you don't need the full move history. In practice, many environments aren't perfectly Markov, but we often construct states that capture enough history to make them approximately Markov (like stacking the last 4 frames in Atari games).
The Discount Factor γ
Why discount? Two reasons. First, mathematical: without discounting, the sum of rewards over an infinite horizon can diverge. Discounting guarantees a finite sum. Second, intuitive: a dollar today is worth more than a dollar next year. A reward now is more certain than a reward in 100 steps — the agent might die, the environment might change, the estimate might be wrong.
Setting γ = 0 makes the agent completely myopic — it only cares about the next reward. Setting γ = 0.99 makes it very far-sighted. Most practical systems use γ between 0.95 and 0.999.
import numpy as np
# A simple 4x4 grid world MDP
class GridWorld:
"""
4x4 grid. Agent starts at (0,0), goal at (3,3).
Actions: 0=up, 1=right, 2=down, 3=left.
Reward: +1 at goal, -0.01 per step (encourages efficiency).
"""
def __init__(self, size=4):
self.size = size
self.goal = (size - 1, size - 1)
self.reset()
def reset(self):
self.state = (0, 0)
return self.state
def step(self, action):
r, c = self.state
if action == 0: r = max(r - 1, 0) # up
elif action == 1: c = min(c + 1, self.size-1) # right
elif action == 2: r = min(r + 1, self.size-1) # down
elif action == 3: c = max(c - 1, 0) # left
self.state = (r, c)
done = (self.state == self.goal)
reward = 1.0 if done else -0.01
return self.state, reward, done
# Demo: random agent in grid world
env = GridWorld()
state = env.reset()
total_reward = 0
for step in range(200):
action = np.random.randint(4) # random policy
state, reward, done = env.step(action)
total_reward += reward
if done:
print(f"Random agent reached goal in {step+1} steps, reward: {total_reward:.2f}")
break
else:
print(f"Random agent failed after 200 steps, reward: {total_reward:.2f}")
A random agent stumbles around the grid. The entire point of RL is to replace that random policy with something intelligent.
Value Functions
The core question of RL: how good is it to be in a given state, or to take a given action? Value functions answer this.
State-Value Function V(s)
Vπ(s) is the expected cumulative discounted reward starting from state s and following policy π thereafter:
Vπ(s) = E_π[ Σ_{t=0}^{∞} γ^t · r_{t+1} | s_0 = s ]
In words: "If I'm in state s and I follow policy π forever, how much total (discounted) reward do I expect to collect?"
Action-Value Function Q(s, a)
Qπ(s, a) adds one more degree of freedom — it tells you the expected return of taking action a in state s, then following policy π:
Qπ(s, a) = E_π[ Σ_{t=0}^{∞} γ^t · r_{t+1} | s_0 = s, a_0 = a ]
If you know Q for every (state, action) pair, the optimal policy is trivial: just pick the action with the highest Q-value in every state. This is why Q-learning is so popular — learn Q, and you've implicitly learned the policy.
The Advantage Function A(s, a)
The advantage function measures how much better action a is compared to the average action in state s:
Aπ(s, a) = Qπ(s, a) - Vπ(s)
If A(s, a) > 0, action a is better than average. If A(s, a) < 0, it's worse. The advantage function is crucial for policy gradient methods — it reduces variance by centering the reward signal. We'll see this again with PPO.
The Bellman Equations
Here's the insight that makes RL computationally tractable. The value of a state can be expressed recursively in terms of the values of successor states:
Vπ(s) = Σ_a π(a|s) · Σ_{s'} P(s'|s,a) · [ R(s,a,s') + γ · Vπ(s') ]
In English: the value of state s equals the expected immediate reward plus the discounted value of wherever you end up next. This recursive structure is the engine of RL. Every algorithm — dynamic programming, TD learning, Q-learning — is basically a different way of solving or approximating this equation.
The Bellman optimality equation describes the value under the best possible policy:
V*(s) = max_a Σ_{s'} P(s'|s,a) · [ R(s,a,s') + γ · V*(s') ]
And for Q:
Q*(s, a) = Σ_{s'} P(s'|s,a) · [ R(s,a,s') + γ · max_{a'} Q*(s', a') ]
Dynamic Programming
If you know the MDP completely — all transition probabilities, all rewards — you can solve for the optimal policy exactly using dynamic programming. This is the "if you had a God's-eye view" scenario.
Policy Evaluation
Given a fixed policy π, compute Vπ(s) for all states by iteratively applying the Bellman equation until convergence:
def policy_evaluation(env_size, policy, P, R, gamma=0.99, theta=1e-8):
"""Evaluate a policy by iteratively solving the Bellman equation.
Args:
env_size: number of states
policy: array of shape [n_states, n_actions] (probability of each action)
P: transition probabilities P[s][a] = list of (prob, next_state, reward, done)
R: reward function (embedded in P here)
gamma: discount factor
theta: convergence threshold
"""
V = np.zeros(env_size)
while True:
delta = 0
for s in range(env_size):
v = V[s]
new_v = 0
for a in range(4):
for prob, next_s, reward, done in P[s][a]:
new_v += policy[s][a] * prob * (reward + gamma * V[next_s] * (1 - done))
V[s] = new_v
delta = max(delta, abs(v - V[s]))
if delta < theta:
break
return V
Policy Iteration
Alternate between evaluating the current policy and improving it greedily. Guaranteed to converge to the optimal policy in a finite number of steps.
- Evaluate: compute Vπ for current policy π.
- Improve: for each state, set the new policy to the action that maximizes expected value.
- Repeat until the policy stops changing.
Value Iteration
Skip the separate evaluation step. Directly apply the Bellman optimality equation as an update rule:
V(s) ← max_a Σ_{s'} P(s'|s,a) · [ R(s,a,s') + γ · V(s') ]
Repeat until convergence, then extract the optimal policy from the final V.
Model-Free Methods
Now we enter the real world, where you don't know the transition probabilities. You can't solve the Bellman equation directly because you don't have P. Instead, you learn from experience — from actual interactions with the environment.
Monte Carlo Methods
The simplest idea: play out complete episodes, observe the total return, and average. If you visit state s and eventually get a total return of G, then V(s) ≈ average of all the G's you've seen from s.
Pros: No bias — you're using actual returns, not estimates. No bootstrapping. Works for any environment, including non-Markov ones.
Cons: High variance — one episode might give a return of +50 and the next -30, just due to randomness. And you need to wait until the episode ends, which means Monte Carlo doesn't work for continuing (non-episodic) tasks.
def monte_carlo_evaluation(env, policy_fn, n_episodes=5000, gamma=0.99):
"""First-visit Monte Carlo policy evaluation."""
returns = {} # state -> list of observed returns
V = {}
for _ in range(n_episodes):
episode = []
state = env.reset()
done = False
while not done:
action = policy_fn(state)
next_state, reward, done = env.step(action)
episode.append((state, action, reward))
state = next_state
# Calculate returns backwards
G = 0
visited = set()
for state, action, reward in reversed(episode):
G = reward + gamma * G
if state not in visited: # first-visit
visited.add(state)
if state not in returns:
returns[state] = []
returns[state].append(G)
V[state] = np.mean(returns[state])
return V
Temporal Difference (TD) Learning
TD learning is the brilliant middle ground. Instead of waiting for the episode to end (Monte Carlo), you update your estimate after every single step, using the next state's estimated value as a bootstrap:
V(s) ← V(s) + α · [ r + γ · V(s') - V(s) ]
The term r + γ · V(s') is called the TD target. The difference r + γ · V(s') - V(s) is the TD error (δ). If the TD error is positive, the state was better than expected; if negative, worse.
TD learning has lower variance than Monte Carlo (because it bootstraps rather than waiting for noisy full returns) but introduces some bias (because V(s') is an estimate, not the truth). In practice, this tradeoff is usually favorable — TD methods converge faster.
Q-Learning
This is the most important tabular RL algorithm. Full stop. Q-learning directly estimates the optimal Q-function without needing to know the policy being followed. It's off-policy: the agent can explore with any strategy (like ε-greedy) while still learning the optimal policy.
The update rule:
Q(s, a) ← Q(s, a) + α · [ r + γ · max_{a'} Q(s', a') - Q(s, a) ]
The key insight is the max over next actions. The agent doesn't update toward what it actually did next — it updates toward the best thing it could do next. This is what makes Q-learning off-policy: it learns about the optimal policy regardless of the behavior policy used for exploration.
def q_learning(env, n_episodes=10000, alpha=0.1, gamma=0.99,
epsilon=1.0, epsilon_decay=0.9995, epsilon_min=0.01):
"""Tabular Q-learning with ε-greedy exploration."""
Q = {}
def get_q(state, action):
return Q.get((state, action), 0.0)
def get_best_action(state):
q_values = [get_q(state, a) for a in range(4)]
return int(np.argmax(q_values))
rewards_per_episode = []
for episode in range(n_episodes):
state = env.reset()
total_reward = 0
done = False
while not done:
# ε-greedy action selection
if np.random.random() < epsilon:
action = np.random.randint(4)
else:
action = get_best_action(state)
next_state, reward, done = env.step(action)
# Q-learning update: use max over next state
best_next_q = max(get_q(next_state, a) for a in range(4))
td_target = reward + gamma * best_next_q * (1 - done)
td_error = td_target - get_q(state, action)
Q[(state, action)] = get_q(state, action) + alpha * td_error
state = next_state
total_reward += reward
epsilon = max(epsilon_min, epsilon * epsilon_decay)
rewards_per_episode.append(total_reward)
return Q, rewards_per_episode
# Train on our grid world
env = GridWorld()
Q, rewards = q_learning(env)
# Extract learned policy
print("Learned policy (0=↑, 1=→, 2=↓, 3=←):")
actions = {0: '↑', 1: '→', 2: '↓', 3: '←'}
for r in range(4):
row = []
for c in range(4):
if (r, c) == (3, 3):
row.append('★')
else:
best_a = max(range(4), key=lambda a: Q.get(((r,c), a), 0))
row.append(actions[best_a])
print(' '.join(row))
# Show learning curve
avg_last_100 = np.mean(rewards[-100:])
print(f"\nAvg reward (last 100 episodes): {avg_last_100:.3f}")
SARSA — On-Policy TD Control
SARSA is Q-learning's on-policy cousin. The name literally describes the update: State, Action, Reward, next State, next Action.
Q(s, a) ← Q(s, a) + α · [ r + γ · Q(s', a') - Q(s, a) ]
The difference from Q-learning: instead of max_{a'} Q(s', a'), SARSA uses Q(s', a') where a' is the action the agent actually takes next. This makes SARSA on-policy — it evaluates and improves the policy it's actually following.
| Feature | Q-Learning | SARSA |
|---|---|---|
| Update target | max_{a'} Q(s', a') | Q(s', a') — actual next action |
| Policy type | Off-policy | On-policy |
| Exploration effect | Learns optimal policy despite exploration | Learns about the policy being followed (including exploration) |
| Cliff-walking behavior | Learns the optimal (risky) path | Learns a safe path that accounts for ε-random missteps |
| Convergence | To Q* under mild conditions | To Qπ for the ε-greedy policy |
In the classic "cliff walking" example, Q-learning learns the shortest path along the cliff edge (optimal but dangerous under exploration), while SARSA learns a safer path farther from the cliff (accounting for the fact that ε-greedy will occasionally step off the edge).
Deep Reinforcement Learning
Everything so far used tables — a Q-value for every (state, action) pair. That works for a 4×4 grid (16 states × 4 actions = 64 entries). But what about Atari games (210×160×3 pixel images)? Or Go (10170 possible board states)? Or continuous control with real-valued joint angles?
Tables can't scale. You need function approximation — neural networks that generalize across similar states. Welcome to deep RL, where everything gets more powerful and more unstable.
DQN — Deep Q-Networks
The 2013/2015 DeepMind breakthrough that played Atari games from raw pixels. The idea is deceptively simple: replace the Q-table with a neural network Q(s, a; θ) that takes a state (image) and outputs Q-values for all actions.
But naive deep Q-learning diverges. The neural network is trying to hit a moving target while standing on a moving platform. Two critical tricks stabilized it:
1. Experience Replay. Instead of learning from consecutive transitions (which are highly correlated), store transitions in a replay buffer and sample random mini-batches. This breaks temporal correlations and reuses data efficiently.
2. Target Network. Use a separate "target" network (with frozen weights θ⁻) to compute the TD target. Update the target network periodically by copying the main network's weights. This prevents the catastrophic feedback loop where the target shifts with every gradient step.
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random
class DQN(nn.Module):
"""Deep Q-Network for discrete action spaces."""
def __init__(self, state_dim, action_dim, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, action_dim)
)
def forward(self, x):
return self.net(x)
class ReplayBuffer:
def __init__(self, capacity=100_000):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return (torch.FloatTensor(states), torch.LongTensor(actions),
torch.FloatTensor(rewards), torch.FloatTensor(next_states),
torch.FloatTensor(dones))
def __len__(self):
return len(self.buffer)
class DQNAgent:
def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99,
epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01,
target_update_freq=100):
self.action_dim = action_dim
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.target_update_freq = target_update_freq
self.q_net = DQN(state_dim, action_dim)
self.target_net = DQN(state_dim, action_dim)
self.target_net.load_state_dict(self.q_net.state_dict())
self.optimizer = optim.Adam(self.q_net.parameters(), lr=lr)
self.buffer = ReplayBuffer()
self.steps = 0
def select_action(self, state):
if random.random() < self.epsilon:
return random.randrange(self.action_dim)
with torch.no_grad():
q_values = self.q_net(torch.FloatTensor(state).unsqueeze(0))
return q_values.argmax(dim=1).item()
def train_step(self, batch_size=64):
if len(self.buffer) < batch_size:
return
states, actions, rewards, next_states, dones = self.buffer.sample(batch_size)
# Current Q-values
q_values = self.q_net(states).gather(1, actions.unsqueeze(1)).squeeze()
# Target Q-values from frozen target network
with torch.no_grad():
next_q = self.target_net(next_states).max(dim=1)[0]
targets = rewards + self.gamma * next_q * (1 - dones)
loss = nn.MSELoss()(q_values, targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Periodic target network update
self.steps += 1
if self.steps % self.target_update_freq == 0:
self.target_net.load_state_dict(self.q_net.state_dict())
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
Policy Gradient Methods
DQN learns a value function and derives a policy from it. Policy gradient methods cut out the middleman: they directly parameterize and optimize the policy π(a|s; θ).
Why would you want this? Three reasons:
- Continuous actions. DQN outputs Q-values for discrete actions. But a robotic arm has continuous joint torques. Policy gradient methods naturally handle continuous action spaces.
- Stochastic policies. Sometimes the optimal policy is stochastic (think rock-paper-scissors). Value-based methods give you deterministic policies.
- Better convergence properties. Small changes in θ cause small changes in the policy, giving smoother optimization landscapes.
The Policy Gradient Theorem
The key insight: we can compute the gradient of expected reward with respect to policy parameters, even though the environment dynamics are unknown. The policy gradient theorem gives us:
∇_θ J(θ) = E_π[ ∇_θ log π(a|s; θ) · Qπ(s, a) ]
In words: adjust θ to make actions with high Q-values more likely. The gradient of log-probability tells you which direction in parameter space makes that action more probable. The Q-value tells you how much to push.
REINFORCE
The simplest policy gradient algorithm. Replace Qπ(s, a) with the actual sampled return G_t from that point forward:
class REINFORCE:
"""Monte Carlo policy gradient (REINFORCE) with baseline."""
def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
self.gamma = gamma
# Policy network: outputs action probabilities
self.policy = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, action_dim),
nn.Softmax(dim=-1)
)
# Baseline network: estimates V(s) to reduce variance
self.baseline = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 1)
)
self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr)
self.baseline_optimizer = optim.Adam(self.baseline.parameters(), lr=lr)
def select_action(self, state):
state_t = torch.FloatTensor(state).unsqueeze(0)
probs = self.policy(state_t)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
return action.item(), dist.log_prob(action)
def update(self, log_probs, rewards, states):
# Compute discounted returns
returns = []
G = 0
for r in reversed(rewards):
G = r + self.gamma * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
# Normalize returns for stability
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
# Baseline values
states_t = torch.FloatTensor(states)
baselines = self.baseline(states_t).squeeze()
# Advantage = return - baseline
advantages = returns - baselines.detach()
# Policy loss: -log π(a|s) · A(s,a)
policy_loss = 0
for log_prob, advantage in zip(log_probs, advantages):
policy_loss -= log_prob * advantage
# Baseline loss: MSE(V(s), G)
baseline_loss = nn.MSELoss()(baselines, returns)
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step()
self.baseline_optimizer.zero_grad()
baseline_loss.backward()
self.baseline_optimizer.step()
Actor-Critic Methods
REINFORCE uses full episode returns (Monte Carlo). Actor-critic methods bootstrap like TD learning — they replace the full return G_t with the TD target r + γ·V(s').
The architecture has two components:
- Actor — the policy network π(a|s; θ). Decides what to do.
- Critic — the value network V(s; w). Evaluates how good the situation is.
The critic provides the advantage estimate: δ = r + γ·V(s') - V(s). The actor uses this advantage to update its policy. The critic updates by minimizing the TD error. They train simultaneously, improving each other.
class ActorCritic(nn.Module):
"""Combined actor-critic network with shared feature extraction."""
def __init__(self, state_dim, action_dim, hidden=256):
super().__init__()
# Shared feature layers
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.ReLU()
)
# Actor head: action probabilities
self.actor = nn.Sequential(
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, action_dim),
nn.Softmax(dim=-1)
)
# Critic head: state value
self.critic = nn.Sequential(
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, 1)
)
def forward(self, state):
features = self.shared(state)
action_probs = self.actor(features)
state_value = self.critic(features)
return action_probs, state_value
def get_action(self, state):
state_t = torch.FloatTensor(state).unsqueeze(0)
probs, value = self.forward(state_t)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
return action.item(), dist.log_prob(action), value.squeeze()
def train_actor_critic(model, optimizer, log_probs, values, rewards,
next_value, gamma=0.99, entropy_coef=0.01):
"""A2C-style update using TD advantages."""
returns = []
R = next_value.detach()
for r in reversed(rewards):
R = r + gamma * R
returns.insert(0, R)
returns = torch.FloatTensor(returns)
log_probs = torch.stack(log_probs)
values = torch.stack(values)
advantages = returns - values.detach()
# Actor loss: policy gradient with advantage
actor_loss = -(log_probs * advantages).mean()
# Critic loss: value function regression
critic_loss = nn.MSELoss()(values, returns)
# Entropy bonus: encourage exploration
entropy = -(torch.exp(log_probs) * log_probs).mean()
loss = actor_loss + 0.5 * critic_loss - entropy_coef * entropy
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), 0.5)
optimizer.step()
A2C (Advantage Actor-Critic) runs multiple environments in parallel synchronously. A3C (Asynchronous Advantage Actor-Critic) runs them asynchronously with each worker updating a shared model. A2C is generally preferred now because synchronous updates are simpler and GPUs handle batched operations efficiently.
PPO — Proximal Policy Optimization
PPO is the workhorse of modern RL. Published by OpenAI in 2017, it's the default algorithm for continuous control, game playing, and — critically — RLHF for language model alignment. If you learn one deep RL algorithm, make it this one.
The Problem PPO Solves
Policy gradient methods have a fundamental tension. You want to take big steps to learn quickly, but big steps can catastrophically change your policy — you overshoot, the policy collapses, and you can never recover. Standard policy gradient gives you a direction to improve, but no sense of how far to go.
TRPO (Trust Region Policy Optimization) solved this by constraining the KL divergence between old and new policies. It works brilliantly but requires computing second-order derivatives, which is expensive and complex. PPO achieves similar stability with a simple first-order method.
The Clipped Surrogate Objective
PPO's key idea is elegant. Define the probability ratio:
r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t)
If r_t > 1, the new policy is more likely to take action a_t than the old policy. If r_t < 1, it's less likely. The clipped surrogate objective is:
L(θ) = E[ min( r_t(θ) · A_t, clip(r_t(θ), 1-ε, 1+ε) · A_t ) ]
Here ε is typically 0.1 or 0.2. Let's unpack what this does:
- If advantage A_t > 0 (action was good): the objective increases as r_t increases... but only up to (1+ε). Beyond that, the gradient is clipped to zero. You can't make a good action too much more likely in one step.
- If advantage A_t < 0 (action was bad): the objective increases as r_t decreases... but only down to (1-ε). You can't make a bad action too much less likely in one step.
The effect: a "trust region" achieved through clipping rather than constrained optimization. No Hessians, no conjugate gradients. Just a simple clamp on the probability ratio. It's almost embarrassingly simple, and it works astonishingly well.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
class PPOAgent(nn.Module):
"""Proximal Policy Optimization with clipped surrogate objective."""
def __init__(self, state_dim, action_dim, hidden=256,
lr=3e-4, gamma=0.99, lam=0.95,
clip_eps=0.2, epochs=10, batch_size=64):
super().__init__()
self.gamma = gamma
self.lam = lam # GAE lambda
self.clip_eps = clip_eps
self.epochs = epochs
self.batch_size = batch_size
# Shared backbone
self.features = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.Tanh(),
nn.Linear(hidden, hidden),
nn.Tanh()
)
self.actor_head = nn.Linear(hidden, action_dim)
self.critic_head = nn.Linear(hidden, 1)
self.optimizer = optim.Adam(self.parameters(), lr=lr)
def forward(self, state):
features = self.features(state)
logits = self.actor_head(features)
value = self.critic_head(features)
return logits, value.squeeze(-1)
def get_action(self, state):
state_t = torch.FloatTensor(state).unsqueeze(0)
logits, value = self.forward(state_t)
dist = Categorical(logits=logits)
action = dist.sample()
return action.item(), dist.log_prob(action).item(), value.item()
def compute_gae(self, rewards, values, dones, next_value):
"""Generalized Advantage Estimation (GAE).
GAE-lambda interpolates between TD(0) (lambda=0, low variance,
more bias) and Monte Carlo (lambda=1, high variance, no bias).
lambda=0.95 is the standard choice.
"""
advantages = []
gae = 0
values = values + [next_value]
for t in reversed(range(len(rewards))):
delta = rewards[t] + self.gamma * values[t+1] * (1 - dones[t]) - values[t]
gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
advantages.insert(0, gae)
returns = [adv + val for adv, val in zip(advantages, values[:-1])]
return advantages, returns
def update(self, states, actions, old_log_probs, advantages, returns):
"""PPO clipped surrogate update."""
states = torch.FloatTensor(states)
actions = torch.LongTensor(actions)
old_log_probs = torch.FloatTensor(old_log_probs)
advantages = torch.FloatTensor(advantages)
returns = torch.FloatTensor(returns)
# Normalize advantages
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
for _ in range(self.epochs):
# Mini-batch updates
indices = torch.randperm(len(states))
for start in range(0, len(states), self.batch_size):
idx = indices[start:start + self.batch_size]
mb_states = states[idx]
mb_actions = actions[idx]
mb_old_log_probs = old_log_probs[idx]
mb_advantages = advantages[idx]
mb_returns = returns[idx]
# Current policy evaluation
logits, values = self.forward(mb_states)
dist = Categorical(logits=logits)
new_log_probs = dist.log_prob(mb_actions)
entropy = dist.entropy().mean()
# Probability ratio
ratio = torch.exp(new_log_probs - mb_old_log_probs)
# Clipped surrogate objective
surr1 = ratio * mb_advantages
surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * mb_advantages
actor_loss = -torch.min(surr1, surr2).mean()
# Value loss
critic_loss = nn.MSELoss()(values, mb_returns)
# Total loss
loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy
self.optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.parameters(), 0.5)
self.optimizer.step()
def ppo_training_loop(env_fn, agent, total_steps=100_000, rollout_len=2048):
"""Standard PPO training loop: collect rollouts, then update."""
env = env_fn()
state = env.reset()
episode_reward = 0
episode_rewards = []
for step in range(0, total_steps, rollout_len):
states, actions, rewards, dones = [], [], [], []
log_probs, values = [], []
# Collect rollout
for _ in range(rollout_len):
action, log_prob, value = agent.get_action(state)
next_state, reward, done = env.step(action)
states.append(state)
actions.append(action)
rewards.append(reward)
dones.append(float(done))
log_probs.append(log_prob)
values.append(value)
episode_reward += reward
state = next_state
if done:
episode_rewards.append(episode_reward)
episode_reward = 0
state = env.reset()
# Compute advantages using GAE
_, _, next_value = agent.get_action(state)
advantages, returns = agent.compute_gae(rewards, values, dones, next_value)
# PPO update
agent.update(states, actions, log_probs, advantages, returns)
if episode_rewards:
avg = sum(episode_rewards[-10:]) / min(10, len(episode_rewards))
print(f"Step {step + rollout_len}, Avg Reward (last 10): {avg:.2f}")
return episode_rewards
PPO Hyperparameters That Matter
| Hyperparameter | Typical Value | Effect |
|---|---|---|
| clip_eps (ε) | 0.1–0.2 | Size of the trust region. Smaller = more conservative updates. |
| GAE λ | 0.95 | Bias-variance tradeoff for advantage estimation. λ=0 → TD(0), λ=1 → Monte Carlo. |
| γ (discount) | 0.99 | How far-sighted the agent is. |
| Epochs per rollout | 3–10 | How many times to reuse collected data. Too many → overfitting to old data. |
| Mini-batch size | 64–4096 | Larger batches reduce variance but cost more compute. |
| Entropy coefficient | 0.01 | Encourages exploration. Critical early in training. |
| Value loss coefficient | 0.5 | Relative weight of critic vs actor loss. |
| Gradient clip norm | 0.5 | Prevents destructive large gradient steps. |
RLHF — Reinforcement Learning from Human Feedback
This is why RL matters in 2024. RLHF is how ChatGPT, Claude, and other LLMs are aligned with human preferences. And at its core, it's PPO applied to language generation.
The Three-Stage Pipeline
Stage 1: Supervised Fine-Tuning (SFT). Take a pretrained language model and fine-tune it on high-quality demonstration data (human-written responses). This gives you a decent starting policy πSFT.
Stage 2: Reward Model Training. Show the SFT model's outputs to human raters who rank them by preference (e.g., "Response A is better than Response B"). Train a reward model R(prompt, response) to predict these preferences. The reward model is typically initialized from the SFT model with a scalar output head replacing the language modeling head.
Stage 3: PPO Optimization. Treat the language model as the RL policy. The "state" is the prompt plus tokens generated so far. The "action" is the next token. The "reward" comes from the reward model applied to the complete response. Use PPO to maximize this reward while staying close to the original SFT policy (via a KL penalty).
# Simplified RLHF training loop (pseudocode)
def rlhf_step(policy_model, ref_model, reward_model, prompts,
kl_coef=0.1, clip_eps=0.2):
"""One PPO step of RLHF training.
policy_model: the LLM being optimized
ref_model: frozen copy of the SFT model (KL anchor)
reward_model: predicts human preference scores
"""
# 1. Generate responses from current policy
responses = policy_model.generate(prompts)
# 2. Score responses with reward model
rewards = reward_model.score(prompts, responses)
# 3. Compute KL penalty to stay close to reference
log_probs_policy = policy_model.log_probs(prompts, responses)
log_probs_ref = ref_model.log_probs(prompts, responses)
kl_penalty = log_probs_policy - log_probs_ref # per-token KL
# 4. Adjusted reward: original reward minus KL penalty
adjusted_rewards = rewards - kl_coef * kl_penalty.sum(dim=-1)
# 5. PPO update on the policy model
advantages = compute_gae(adjusted_rewards, values)
ppo_update(policy_model, log_probs_policy, advantages, clip_eps)
return {
'mean_reward': rewards.mean().item(),
'mean_kl': kl_penalty.mean().item()
}
Recent alternatives to PPO for alignment include DPO (Direct Preference Optimization), which skips the reward model entirely and optimizes preferences directly. DPO reformulates the RLHF objective so the optimal policy can be expressed in closed form, turning the RL problem into a supervised learning problem on preference pairs. It's simpler and increasingly popular, though PPO-based RLHF remains the established approach.
Model-Based Reinforcement Learning
Everything above is model-free: the agent learns directly from interactions without trying to understand how the environment works. Model-based RL takes a different approach: learn a model of the environment (transition dynamics and rewards), then use that model to plan.
The advantage is sample efficiency. If you can predict what will happen when you take an action, you can "imagine" thousands of trajectories without ever interacting with the real environment. The disadvantage is that model errors compound — small mistakes in the learned model snowball into catastrophically wrong plans.
Key Approaches
Dyna. The classic hybrid: do real interactions AND imaginary rollouts through the learned model. After every real transition, use the model to generate several simulated transitions and update Q-values on those too. Simple, effective, and easy to bolt onto any model-free algorithm.
World Models. Learn a compressed representation of the environment dynamics (often using VAEs or recurrent networks), then train a policy entirely inside the "dream." Ha & Schmidhuber's 2018 World Models paper trained agents to play car racing by dreaming.
MuZero (2020). DeepMind's pinnacle of model-based RL. MuZero learns a model that predicts rewards, values, and policies — but never explicitly reconstructs observations. It plans using Monte Carlo Tree Search (MCTS) over the learned latent model. Achieved superhuman performance in Go, Chess, Shogi, and Atari — all with a single algorithm and without being told the rules of the game.
| Approach | Model-Free | Model-Based |
|---|---|---|
| Sample efficiency | Low — needs many interactions | High — can plan in imagination |
| Computation per step | Low | High — planning is expensive |
| Robustness | High — no model to be wrong | Lower — model errors compound |
| Examples | DQN, PPO, SAC | Dyna, MBPO, MuZero, Dreamer |
Key Challenges in RL
RL is powerful but fragile. Here are the practical challenges that make it hard to deploy in the real world.
Sparse Rewards
In many environments, the agent gets reward 0 for almost everything and +1 only when it accidentally stumbles onto the goal. Imagine learning to play chess with a reward signal of only "you won" or "you lost" at the end of the game — no intermediate feedback. With random exploration, the agent may never reach the goal to learn from. Solutions include reward shaping (engineering intermediate rewards), curiosity-driven exploration (intrinsic motivation to visit novel states), and hindsight experience replay (relabeling failed trajectories as successful by changing the goal).
Exploration vs Exploitation
The eternal dilemma: do you exploit what you know works, or explore to find something better? Too much exploitation gets you stuck in local optima. Too much exploration wastes time on bad actions. ε-greedy is the simplest approach but scales poorly to large state spaces. More sophisticated methods include UCB (Upper Confidence Bound), Thompson Sampling, and count-based exploration bonuses that reward visiting under-explored states.
Sim-to-Real Transfer
Training robots in the real world is slow, expensive, and dangerous. So you train in simulation — but policies learned in simulation often fail in reality because the simulation doesn't perfectly capture friction, latency, sensor noise, and other real-world messiness. Domain randomization (randomizing simulation parameters during training) and system identification (calibrating the simulator to match reality) help bridge this gap, but it remains one of the hardest problems in robotics.
Reproducibility and Hyperparameter Sensitivity
Deep RL algorithms are notoriously sensitive to hyperparameters, random seeds, and implementation details. The same algorithm can vary by 2x in performance across random seeds. Small implementation choices (how you normalize advantages, how you initialize weights, how you clip gradients) can matter more than the algorithm itself. This makes deep RL research hard to reproduce and engineering-intensive to deploy.
The RL Algorithm Landscape
Here's a quick map of how the major algorithms relate to each other:
| Algorithm | Type | On/Off-Policy | Action Space | Key Idea |
|---|---|---|---|---|
| Q-Learning | Value-based | Off-policy | Discrete | Learn Q-table via Bellman update |
| SARSA | Value-based | On-policy | Discrete | Learn Q from actual actions taken |
| DQN | Value-based | Off-policy | Discrete | Q-learning + neural nets + replay buffer |
| REINFORCE | Policy gradient | On-policy | Both | Monte Carlo policy gradient |
| A2C/A3C | Actor-critic | On-policy | Both | TD-based advantage + policy gradient |
| PPO | Actor-critic | On-policy | Both | Clipped surrogate objective |
| SAC | Actor-critic | Off-policy | Continuous | Maximum entropy framework |
| TD3 | Actor-critic | Off-policy | Continuous | Twin critics + delayed updates |
| MuZero | Model-based | Off-policy | Discrete | Learned model + MCTS planning |
What You Should Now Be Able To Do
- Explain how RL differs from supervised learning and why the credit assignment problem makes RL fundamentally harder.
- Define an MDP (S, A, P, R, γ) and explain the Markov property and the role of the discount factor.
- Write down the Bellman equation for V(s) and Q(s,a) and explain why its recursive structure is the backbone of RL.
- Distinguish dynamic programming (known model) from model-free methods (unknown model) and explain when each applies.
- Implement tabular Q-learning with ε-greedy exploration and explain why the max operator makes it off-policy.
- Explain why naive deep Q-learning diverges and how experience replay and target networks stabilize DQN.
- Describe the policy gradient theorem and explain why baselines reduce variance in REINFORCE.
- Explain PPO's clipped surrogate objective: what it clips, why it clips, and how this creates a trust region without second-order optimization.
- Walk through the RLHF pipeline (SFT → reward model → PPO) and explain why the KL penalty prevents reward hacking.
- Identify the right RL algorithm family for a given problem — discrete vs continuous actions, on-policy vs off-policy, model-free vs model-based.