Reinforcement Learning

Chapter 17: Learning Theory & Advanced ML 9 subtopics
TL;DR

Reinforcement learning is the science of learning to make decisions by trial and error. An agent takes actions in an environment, receives rewards, and gradually learns a policy — a mapping from states to actions — that maximizes long-term cumulative reward. The mathematical backbone is the Markov Decision Process. Tabular methods (Q-learning, SARSA) solve small problems exactly. Deep RL (DQN, policy gradients, PPO) scales to massive state spaces. PPO is the algorithm behind RLHF, which aligns large language models with human preferences. If supervised learning is "learning from answers," RL is "learning from consequences."

Think about how you learned to ride a bike. Nobody handed you a dataset of 10,000 labeled bike-riding examples. Instead, you got on, wobbled, fell, adjusted, and eventually figured it out through raw interaction. You had a goal (don't fall), you took actions (lean left, pedal harder), you got feedback (pain, balance, speed), and over time you developed a policy — an intuitive mapping from "what I see and feel" to "what I should do next."

That's reinforcement learning in a nutshell. And it turns out to be the framework behind game-playing AIs that beat world champions, robots that learn to walk, and — most relevant to the current AI moment — the alignment of large language models through RLHF.

The RL Problem

Supervised learning says: "Here's the right answer, learn to predict it." Reinforcement learning says: "I won't tell you the right answer. I'll tell you how good or bad the consequences of your action were. Figure it out."

This is a fundamentally different setup. In supervised learning, you get immediate, exact feedback for every prediction. In RL, you might take 100 actions before you find out whether the overall sequence was good or bad. Did you lose the chess game because of move 37, or because of move 12? This is the credit assignment problem, and it's what makes RL genuinely hard.

The Cast of Characters

Every RL problem has the same core components:

The loop is simple: the agent observes a state, picks an action according to its policy, the environment transitions to a new state and emits a reward, and the cycle repeats. The goal is to find the policy that maximizes cumulative reward over time — not just the immediate reward.

RL ≠ Unsupervised Learning. RL has a clear objective (maximize reward). Unsupervised learning has no reward signal at all. RL also differs from supervised learning because the "labels" (rewards) are delayed, sparse, and depend on the agent's own actions. It's a third paradigm.

Supervised vs Reinforcement Learning

AspectSupervised LearningReinforcement Learning
FeedbackCorrect label for every inputScalar reward, often delayed
DataFixed dataset, i.i.d. samplesAgent generates its own data
GoalMinimize prediction errorMaximize cumulative reward
ExplorationNot needed — labels are givenCritical — must try things to learn
Credit AssignmentTrivial — one label per inputHard — reward might be delayed by many steps
StationarityTraining distribution is fixedDistribution shifts as policy changes

Markov Decision Processes

An MDP is the mathematical language of RL. Nearly every RL algorithm assumes (or approximates) an MDP underneath. If you understand MDPs, you understand the skeleton that all of RL hangs on.

Formal Definition

An MDP is a tuple (S, A, P, R, γ):

The Markov Property

The "Markov" part means: the future depends only on the present state, not on the history of how you got there. Formally: P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ...) = P(s_{t+1} | s_t, a_t).

In chess, the current board position tells you everything you need to decide your next move — you don't need the full move history. In practice, many environments aren't perfectly Markov, but we often construct states that capture enough history to make them approximately Markov (like stacking the last 4 frames in Atari games).

The Discount Factor γ

Why discount? Two reasons. First, mathematical: without discounting, the sum of rewards over an infinite horizon can diverge. Discounting guarantees a finite sum. Second, intuitive: a dollar today is worth more than a dollar next year. A reward now is more certain than a reward in 100 steps — the agent might die, the environment might change, the estimate might be wrong.

Setting γ = 0 makes the agent completely myopic — it only cares about the next reward. Setting γ = 0.99 makes it very far-sighted. Most practical systems use γ between 0.95 and 0.999.

import numpy as np

# A simple 4x4 grid world MDP
class GridWorld:
    """
    4x4 grid. Agent starts at (0,0), goal at (3,3).
    Actions: 0=up, 1=right, 2=down, 3=left.
    Reward: +1 at goal, -0.01 per step (encourages efficiency).
    """
    def __init__(self, size=4):
        self.size = size
        self.goal = (size - 1, size - 1)
        self.reset()

    def reset(self):
        self.state = (0, 0)
        return self.state

    def step(self, action):
        r, c = self.state
        if action == 0: r = max(r - 1, 0)            # up
        elif action == 1: c = min(c + 1, self.size-1)  # right
        elif action == 2: r = min(r + 1, self.size-1)  # down
        elif action == 3: c = max(c - 1, 0)            # left

        self.state = (r, c)
        done = (self.state == self.goal)
        reward = 1.0 if done else -0.01
        return self.state, reward, done

# Demo: random agent in grid world
env = GridWorld()
state = env.reset()
total_reward = 0

for step in range(200):
    action = np.random.randint(4)  # random policy
    state, reward, done = env.step(action)
    total_reward += reward
    if done:
        print(f"Random agent reached goal in {step+1} steps, reward: {total_reward:.2f}")
        break
else:
    print(f"Random agent failed after 200 steps, reward: {total_reward:.2f}")

A random agent stumbles around the grid. The entire point of RL is to replace that random policy with something intelligent.

Value Functions

The core question of RL: how good is it to be in a given state, or to take a given action? Value functions answer this.

State-Value Function V(s)

Vπ(s) is the expected cumulative discounted reward starting from state s and following policy π thereafter:

Vπ(s) = E_π[ Σ_{t=0}^{∞} γ^t · r_{t+1} | s_0 = s ]

In words: "If I'm in state s and I follow policy π forever, how much total (discounted) reward do I expect to collect?"

Action-Value Function Q(s, a)

Qπ(s, a) adds one more degree of freedom — it tells you the expected return of taking action a in state s, then following policy π:

Qπ(s, a) = E_π[ Σ_{t=0}^{∞} γ^t · r_{t+1} | s_0 = s, a_0 = a ]

If you know Q for every (state, action) pair, the optimal policy is trivial: just pick the action with the highest Q-value in every state. This is why Q-learning is so popular — learn Q, and you've implicitly learned the policy.

The Advantage Function A(s, a)

The advantage function measures how much better action a is compared to the average action in state s:

Aπ(s, a) = Qπ(s, a) - Vπ(s)

If A(s, a) > 0, action a is better than average. If A(s, a) < 0, it's worse. The advantage function is crucial for policy gradient methods — it reduces variance by centering the reward signal. We'll see this again with PPO.

The Bellman Equations

Here's the insight that makes RL computationally tractable. The value of a state can be expressed recursively in terms of the values of successor states:

Vπ(s) = Σ_a π(a|s) · Σ_{s'} P(s'|s,a) · [ R(s,a,s') + γ · Vπ(s') ]

In English: the value of state s equals the expected immediate reward plus the discounted value of wherever you end up next. This recursive structure is the engine of RL. Every algorithm — dynamic programming, TD learning, Q-learning — is basically a different way of solving or approximating this equation.

The Bellman optimality equation describes the value under the best possible policy:

V*(s) = max_a Σ_{s'} P(s'|s,a) · [ R(s,a,s') + γ · V*(s') ]

And for Q:

Q*(s, a) = Σ_{s'} P(s'|s,a) · [ R(s,a,s') + γ · max_{a'} Q*(s', a') ]

Why Bellman matters. The recursive structure means we don't need to simulate infinite trajectories. We can compute values iteratively, bootstrapping from estimates. This is the mathematical foundation for every RL algorithm, from value iteration to PPO.

Dynamic Programming

If you know the MDP completely — all transition probabilities, all rewards — you can solve for the optimal policy exactly using dynamic programming. This is the "if you had a God's-eye view" scenario.

Policy Evaluation

Given a fixed policy π, compute Vπ(s) for all states by iteratively applying the Bellman equation until convergence:

def policy_evaluation(env_size, policy, P, R, gamma=0.99, theta=1e-8):
    """Evaluate a policy by iteratively solving the Bellman equation.
    
    Args:
        env_size: number of states
        policy: array of shape [n_states, n_actions] (probability of each action)
        P: transition probabilities P[s][a] = list of (prob, next_state, reward, done)
        R: reward function (embedded in P here)
        gamma: discount factor
        theta: convergence threshold
    """
    V = np.zeros(env_size)
    while True:
        delta = 0
        for s in range(env_size):
            v = V[s]
            new_v = 0
            for a in range(4):
                for prob, next_s, reward, done in P[s][a]:
                    new_v += policy[s][a] * prob * (reward + gamma * V[next_s] * (1 - done))
            V[s] = new_v
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            break
    return V

Policy Iteration

Alternate between evaluating the current policy and improving it greedily. Guaranteed to converge to the optimal policy in a finite number of steps.

  1. Evaluate: compute Vπ for current policy π.
  2. Improve: for each state, set the new policy to the action that maximizes expected value.
  3. Repeat until the policy stops changing.

Value Iteration

Skip the separate evaluation step. Directly apply the Bellman optimality equation as an update rule:

V(s) ← max_a Σ_{s'} P(s'|s,a) · [ R(s,a,s') + γ · V(s') ]

Repeat until convergence, then extract the optimal policy from the final V.

DP requires a known model. Dynamic programming needs complete knowledge of P(s'|s,a) and R. In most real-world problems, you don't have this. You can't write down the transition probabilities of a robot walking on uneven terrain. This is why model-free methods exist.

Model-Free Methods

Now we enter the real world, where you don't know the transition probabilities. You can't solve the Bellman equation directly because you don't have P. Instead, you learn from experience — from actual interactions with the environment.

Monte Carlo Methods

The simplest idea: play out complete episodes, observe the total return, and average. If you visit state s and eventually get a total return of G, then V(s) ≈ average of all the G's you've seen from s.

Pros: No bias — you're using actual returns, not estimates. No bootstrapping. Works for any environment, including non-Markov ones.

Cons: High variance — one episode might give a return of +50 and the next -30, just due to randomness. And you need to wait until the episode ends, which means Monte Carlo doesn't work for continuing (non-episodic) tasks.

def monte_carlo_evaluation(env, policy_fn, n_episodes=5000, gamma=0.99):
    """First-visit Monte Carlo policy evaluation."""
    returns = {}  # state -> list of observed returns
    V = {}

    for _ in range(n_episodes):
        episode = []
        state = env.reset()
        done = False
        while not done:
            action = policy_fn(state)
            next_state, reward, done = env.step(action)
            episode.append((state, action, reward))
            state = next_state

        # Calculate returns backwards
        G = 0
        visited = set()
        for state, action, reward in reversed(episode):
            G = reward + gamma * G
            if state not in visited:  # first-visit
                visited.add(state)
                if state not in returns:
                    returns[state] = []
                returns[state].append(G)
                V[state] = np.mean(returns[state])
    return V

Temporal Difference (TD) Learning

TD learning is the brilliant middle ground. Instead of waiting for the episode to end (Monte Carlo), you update your estimate after every single step, using the next state's estimated value as a bootstrap:

V(s) ← V(s) + α · [ r + γ · V(s') - V(s) ]

The term r + γ · V(s') is called the TD target. The difference r + γ · V(s') - V(s) is the TD error (δ). If the TD error is positive, the state was better than expected; if negative, worse.

TD learning has lower variance than Monte Carlo (because it bootstraps rather than waiting for noisy full returns) but introduces some bias (because V(s') is an estimate, not the truth). In practice, this tradeoff is usually favorable — TD methods converge faster.

The bias-variance tradeoff of RL. Monte Carlo: zero bias, high variance. TD(0): some bias, lower variance. TD(λ) and n-step methods let you interpolate between the two. GAE (Generalized Advantage Estimation), used in PPO, is exactly this idea applied to the advantage function.

Q-Learning

This is the most important tabular RL algorithm. Full stop. Q-learning directly estimates the optimal Q-function without needing to know the policy being followed. It's off-policy: the agent can explore with any strategy (like ε-greedy) while still learning the optimal policy.

The update rule:

Q(s, a) ← Q(s, a) + α · [ r + γ · max_{a'} Q(s', a') - Q(s, a) ]

The key insight is the max over next actions. The agent doesn't update toward what it actually did next — it updates toward the best thing it could do next. This is what makes Q-learning off-policy: it learns about the optimal policy regardless of the behavior policy used for exploration.

def q_learning(env, n_episodes=10000, alpha=0.1, gamma=0.99,
               epsilon=1.0, epsilon_decay=0.9995, epsilon_min=0.01):
    """Tabular Q-learning with ε-greedy exploration."""
    Q = {}

    def get_q(state, action):
        return Q.get((state, action), 0.0)

    def get_best_action(state):
        q_values = [get_q(state, a) for a in range(4)]
        return int(np.argmax(q_values))

    rewards_per_episode = []

    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        done = False

        while not done:
            # ε-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(4)
            else:
                action = get_best_action(state)

            next_state, reward, done = env.step(action)

            # Q-learning update: use max over next state
            best_next_q = max(get_q(next_state, a) for a in range(4))
            td_target = reward + gamma * best_next_q * (1 - done)
            td_error = td_target - get_q(state, action)
            Q[(state, action)] = get_q(state, action) + alpha * td_error

            state = next_state
            total_reward += reward

        epsilon = max(epsilon_min, epsilon * epsilon_decay)
        rewards_per_episode.append(total_reward)

    return Q, rewards_per_episode

# Train on our grid world
env = GridWorld()
Q, rewards = q_learning(env)

# Extract learned policy
print("Learned policy (0=↑, 1=→, 2=↓, 3=←):")
actions = {0: '↑', 1: '→', 2: '↓', 3: '←'}
for r in range(4):
    row = []
    for c in range(4):
        if (r, c) == (3, 3):
            row.append('★')
        else:
            best_a = max(range(4), key=lambda a: Q.get(((r,c), a), 0))
            row.append(actions[best_a])
    print(' '.join(row))

# Show learning curve
avg_last_100 = np.mean(rewards[-100:])
print(f"\nAvg reward (last 100 episodes): {avg_last_100:.3f}")

SARSA — On-Policy TD Control

SARSA is Q-learning's on-policy cousin. The name literally describes the update: State, Action, Reward, next State, next Action.

Q(s, a) ← Q(s, a) + α · [ r + γ · Q(s', a') - Q(s, a) ]

The difference from Q-learning: instead of max_{a'} Q(s', a'), SARSA uses Q(s', a') where a' is the action the agent actually takes next. This makes SARSA on-policy — it evaluates and improves the policy it's actually following.

FeatureQ-LearningSARSA
Update targetmax_{a'} Q(s', a')Q(s', a') — actual next action
Policy typeOff-policyOn-policy
Exploration effectLearns optimal policy despite explorationLearns about the policy being followed (including exploration)
Cliff-walking behaviorLearns the optimal (risky) pathLearns a safe path that accounts for ε-random missteps
ConvergenceTo Q* under mild conditionsTo Qπ for the ε-greedy policy

In the classic "cliff walking" example, Q-learning learns the shortest path along the cliff edge (optimal but dangerous under exploration), while SARSA learns a safer path farther from the cliff (accounting for the fact that ε-greedy will occasionally step off the edge).

Deep Reinforcement Learning

Everything so far used tables — a Q-value for every (state, action) pair. That works for a 4×4 grid (16 states × 4 actions = 64 entries). But what about Atari games (210×160×3 pixel images)? Or Go (10170 possible board states)? Or continuous control with real-valued joint angles?

Tables can't scale. You need function approximation — neural networks that generalize across similar states. Welcome to deep RL, where everything gets more powerful and more unstable.

DQN — Deep Q-Networks

The 2013/2015 DeepMind breakthrough that played Atari games from raw pixels. The idea is deceptively simple: replace the Q-table with a neural network Q(s, a; θ) that takes a state (image) and outputs Q-values for all actions.

But naive deep Q-learning diverges. The neural network is trying to hit a moving target while standing on a moving platform. Two critical tricks stabilized it:

1. Experience Replay. Instead of learning from consecutive transitions (which are highly correlated), store transitions in a replay buffer and sample random mini-batches. This breaks temporal correlations and reuses data efficiently.

2. Target Network. Use a separate "target" network (with frozen weights θ⁻) to compute the TD target. Update the target network periodically by copying the main network's weights. This prevents the catastrophic feedback loop where the target shifts with every gradient step.

import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random

class DQN(nn.Module):
    """Deep Q-Network for discrete action spaces."""
    def __init__(self, state_dim, action_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_dim)
        )

    def forward(self, x):
        return self.net(x)

class ReplayBuffer:
    def __init__(self, capacity=100_000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (torch.FloatTensor(states), torch.LongTensor(actions),
                torch.FloatTensor(rewards), torch.FloatTensor(next_states),
                torch.FloatTensor(dones))

    def __len__(self):
        return len(self.buffer)

class DQNAgent:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99,
                 epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01,
                 target_update_freq=100):
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.target_update_freq = target_update_freq

        self.q_net = DQN(state_dim, action_dim)
        self.target_net = DQN(state_dim, action_dim)
        self.target_net.load_state_dict(self.q_net.state_dict())

        self.optimizer = optim.Adam(self.q_net.parameters(), lr=lr)
        self.buffer = ReplayBuffer()
        self.steps = 0

    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.randrange(self.action_dim)
        with torch.no_grad():
            q_values = self.q_net(torch.FloatTensor(state).unsqueeze(0))
            return q_values.argmax(dim=1).item()

    def train_step(self, batch_size=64):
        if len(self.buffer) < batch_size:
            return

        states, actions, rewards, next_states, dones = self.buffer.sample(batch_size)

        # Current Q-values
        q_values = self.q_net(states).gather(1, actions.unsqueeze(1)).squeeze()

        # Target Q-values from frozen target network
        with torch.no_grad():
            next_q = self.target_net(next_states).max(dim=1)[0]
            targets = rewards + self.gamma * next_q * (1 - dones)

        loss = nn.MSELoss()(q_values, targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Periodic target network update
        self.steps += 1
        if self.steps % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.q_net.state_dict())

        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
Why naive deep Q-learning diverges. Three correlated problems: (1) consecutive samples are highly correlated, violating the i.i.d. assumption of SGD; (2) the target Q-values shift with every weight update, creating a non-stationary regression problem; (3) Q-value overestimation compounds through the max operator. Experience replay fixes #1. The target network fixes #2. Double DQN addresses #3 by decoupling action selection from evaluation.

Policy Gradient Methods

DQN learns a value function and derives a policy from it. Policy gradient methods cut out the middleman: they directly parameterize and optimize the policy π(a|s; θ).

Why would you want this? Three reasons:

  1. Continuous actions. DQN outputs Q-values for discrete actions. But a robotic arm has continuous joint torques. Policy gradient methods naturally handle continuous action spaces.
  2. Stochastic policies. Sometimes the optimal policy is stochastic (think rock-paper-scissors). Value-based methods give you deterministic policies.
  3. Better convergence properties. Small changes in θ cause small changes in the policy, giving smoother optimization landscapes.

The Policy Gradient Theorem

The key insight: we can compute the gradient of expected reward with respect to policy parameters, even though the environment dynamics are unknown. The policy gradient theorem gives us:

∇_θ J(θ) = E_π[ ∇_θ log π(a|s; θ) · Qπ(s, a) ]

In words: adjust θ to make actions with high Q-values more likely. The gradient of log-probability tells you which direction in parameter space makes that action more probable. The Q-value tells you how much to push.

REINFORCE

The simplest policy gradient algorithm. Replace Qπ(s, a) with the actual sampled return G_t from that point forward:

class REINFORCE:
    """Monte Carlo policy gradient (REINFORCE) with baseline."""
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
        self.gamma = gamma

        # Policy network: outputs action probabilities
        self.policy = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
            nn.Softmax(dim=-1)
        )

        # Baseline network: estimates V(s) to reduce variance
        self.baseline = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        self.baseline_optimizer = optim.Adam(self.baseline.parameters(), lr=lr)

    def select_action(self, state):
        state_t = torch.FloatTensor(state).unsqueeze(0)
        probs = self.policy(state_t)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        return action.item(), dist.log_prob(action)

    def update(self, log_probs, rewards, states):
        # Compute discounted returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        returns = torch.FloatTensor(returns)

        # Normalize returns for stability
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)

        # Baseline values
        states_t = torch.FloatTensor(states)
        baselines = self.baseline(states_t).squeeze()

        # Advantage = return - baseline
        advantages = returns - baselines.detach()

        # Policy loss: -log π(a|s) · A(s,a)
        policy_loss = 0
        for log_prob, advantage in zip(log_probs, advantages):
            policy_loss -= log_prob * advantage

        # Baseline loss: MSE(V(s), G)
        baseline_loss = nn.MSELoss()(baselines, returns)

        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()

        self.baseline_optimizer.zero_grad()
        baseline_loss.backward()
        self.baseline_optimizer.step()
The variance problem. Vanilla REINFORCE has extremely high variance. One lucky episode dominates the gradient. The baseline V(s) fixes this by centering the reward signal: instead of "was this return good?" we ask "was this return better than average?" This is the advantage function in disguise: A(s,a) = G_t - V(s). Reducing variance without adding bias is the central engineering challenge of policy gradient methods.

Actor-Critic Methods

REINFORCE uses full episode returns (Monte Carlo). Actor-critic methods bootstrap like TD learning — they replace the full return G_t with the TD target r + γ·V(s').

The architecture has two components:

The critic provides the advantage estimate: δ = r + γ·V(s') - V(s). The actor uses this advantage to update its policy. The critic updates by minimizing the TD error. They train simultaneously, improving each other.

class ActorCritic(nn.Module):
    """Combined actor-critic network with shared feature extraction."""
    def __init__(self, state_dim, action_dim, hidden=256):
        super().__init__()
        # Shared feature layers
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.ReLU()
        )
        # Actor head: action probabilities
        self.actor = nn.Sequential(
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, action_dim),
            nn.Softmax(dim=-1)
        )
        # Critic head: state value
        self.critic = nn.Sequential(
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, 1)
        )

    def forward(self, state):
        features = self.shared(state)
        action_probs = self.actor(features)
        state_value = self.critic(features)
        return action_probs, state_value

    def get_action(self, state):
        state_t = torch.FloatTensor(state).unsqueeze(0)
        probs, value = self.forward(state_t)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        return action.item(), dist.log_prob(action), value.squeeze()

def train_actor_critic(model, optimizer, log_probs, values, rewards,
                       next_value, gamma=0.99, entropy_coef=0.01):
    """A2C-style update using TD advantages."""
    returns = []
    R = next_value.detach()
    for r in reversed(rewards):
        R = r + gamma * R
        returns.insert(0, R)
    returns = torch.FloatTensor(returns)

    log_probs = torch.stack(log_probs)
    values = torch.stack(values)
    advantages = returns - values.detach()

    # Actor loss: policy gradient with advantage
    actor_loss = -(log_probs * advantages).mean()

    # Critic loss: value function regression
    critic_loss = nn.MSELoss()(values, returns)

    # Entropy bonus: encourage exploration
    entropy = -(torch.exp(log_probs) * log_probs).mean()

    loss = actor_loss + 0.5 * critic_loss - entropy_coef * entropy
    optimizer.zero_grad()
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), 0.5)
    optimizer.step()

A2C (Advantage Actor-Critic) runs multiple environments in parallel synchronously. A3C (Asynchronous Advantage Actor-Critic) runs them asynchronously with each worker updating a shared model. A2C is generally preferred now because synchronous updates are simpler and GPUs handle batched operations efficiently.

PPO — Proximal Policy Optimization

PPO is the workhorse of modern RL. Published by OpenAI in 2017, it's the default algorithm for continuous control, game playing, and — critically — RLHF for language model alignment. If you learn one deep RL algorithm, make it this one.

The Problem PPO Solves

Policy gradient methods have a fundamental tension. You want to take big steps to learn quickly, but big steps can catastrophically change your policy — you overshoot, the policy collapses, and you can never recover. Standard policy gradient gives you a direction to improve, but no sense of how far to go.

TRPO (Trust Region Policy Optimization) solved this by constraining the KL divergence between old and new policies. It works brilliantly but requires computing second-order derivatives, which is expensive and complex. PPO achieves similar stability with a simple first-order method.

The Clipped Surrogate Objective

PPO's key idea is elegant. Define the probability ratio:

r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t)

If r_t > 1, the new policy is more likely to take action a_t than the old policy. If r_t < 1, it's less likely. The clipped surrogate objective is:

L(θ) = E[ min( r_t(θ) · A_t, clip(r_t(θ), 1-ε, 1+ε) · A_t ) ]

Here ε is typically 0.1 or 0.2. Let's unpack what this does:

The effect: a "trust region" achieved through clipping rather than constrained optimization. No Hessians, no conjugate gradients. Just a simple clamp on the probability ratio. It's almost embarrassingly simple, and it works astonishingly well.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical

class PPOAgent(nn.Module):
    """Proximal Policy Optimization with clipped surrogate objective."""
    def __init__(self, state_dim, action_dim, hidden=256,
                 lr=3e-4, gamma=0.99, lam=0.95,
                 clip_eps=0.2, epochs=10, batch_size=64):
        super().__init__()
        self.gamma = gamma
        self.lam = lam          # GAE lambda
        self.clip_eps = clip_eps
        self.epochs = epochs
        self.batch_size = batch_size

        # Shared backbone
        self.features = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.Tanh(),
            nn.Linear(hidden, hidden),
            nn.Tanh()
        )
        self.actor_head = nn.Linear(hidden, action_dim)
        self.critic_head = nn.Linear(hidden, 1)
        self.optimizer = optim.Adam(self.parameters(), lr=lr)

    def forward(self, state):
        features = self.features(state)
        logits = self.actor_head(features)
        value = self.critic_head(features)
        return logits, value.squeeze(-1)

    def get_action(self, state):
        state_t = torch.FloatTensor(state).unsqueeze(0)
        logits, value = self.forward(state_t)
        dist = Categorical(logits=logits)
        action = dist.sample()
        return action.item(), dist.log_prob(action).item(), value.item()

    def compute_gae(self, rewards, values, dones, next_value):
        """Generalized Advantage Estimation (GAE).

        GAE-lambda interpolates between TD(0) (lambda=0, low variance,
        more bias) and Monte Carlo (lambda=1, high variance, no bias).
        lambda=0.95 is the standard choice.
        """
        advantages = []
        gae = 0
        values = values + [next_value]
        for t in reversed(range(len(rewards))):
            delta = rewards[t] + self.gamma * values[t+1] * (1 - dones[t]) - values[t]
            gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
            advantages.insert(0, gae)
        returns = [adv + val for adv, val in zip(advantages, values[:-1])]
        return advantages, returns

    def update(self, states, actions, old_log_probs, advantages, returns):
        """PPO clipped surrogate update."""
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        old_log_probs = torch.FloatTensor(old_log_probs)
        advantages = torch.FloatTensor(advantages)
        returns = torch.FloatTensor(returns)

        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        for _ in range(self.epochs):
            # Mini-batch updates
            indices = torch.randperm(len(states))
            for start in range(0, len(states), self.batch_size):
                idx = indices[start:start + self.batch_size]
                mb_states = states[idx]
                mb_actions = actions[idx]
                mb_old_log_probs = old_log_probs[idx]
                mb_advantages = advantages[idx]
                mb_returns = returns[idx]

                # Current policy evaluation
                logits, values = self.forward(mb_states)
                dist = Categorical(logits=logits)
                new_log_probs = dist.log_prob(mb_actions)
                entropy = dist.entropy().mean()

                # Probability ratio
                ratio = torch.exp(new_log_probs - mb_old_log_probs)

                # Clipped surrogate objective
                surr1 = ratio * mb_advantages
                surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * mb_advantages
                actor_loss = -torch.min(surr1, surr2).mean()

                # Value loss
                critic_loss = nn.MSELoss()(values, mb_returns)

                # Total loss
                loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy

                self.optimizer.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(self.parameters(), 0.5)
                self.optimizer.step()

def ppo_training_loop(env_fn, agent, total_steps=100_000, rollout_len=2048):
    """Standard PPO training loop: collect rollouts, then update."""
    env = env_fn()
    state = env.reset()
    episode_reward = 0
    episode_rewards = []

    for step in range(0, total_steps, rollout_len):
        states, actions, rewards, dones = [], [], [], []
        log_probs, values = [], []

        # Collect rollout
        for _ in range(rollout_len):
            action, log_prob, value = agent.get_action(state)
            next_state, reward, done = env.step(action)

            states.append(state)
            actions.append(action)
            rewards.append(reward)
            dones.append(float(done))
            log_probs.append(log_prob)
            values.append(value)

            episode_reward += reward
            state = next_state

            if done:
                episode_rewards.append(episode_reward)
                episode_reward = 0
                state = env.reset()

        # Compute advantages using GAE
        _, _, next_value = agent.get_action(state)
        advantages, returns = agent.compute_gae(rewards, values, dones, next_value)

        # PPO update
        agent.update(states, actions, log_probs, advantages, returns)

        if episode_rewards:
            avg = sum(episode_rewards[-10:]) / min(10, len(episode_rewards))
            print(f"Step {step + rollout_len}, Avg Reward (last 10): {avg:.2f}")

    return episode_rewards
Why PPO dominates. PPO combines three things: (1) the stability of trust-region methods (via clipping), (2) the simplicity of first-order optimization (no Hessians), and (3) sample efficiency from multiple epochs of mini-batch updates on collected data. It's not the most sample-efficient algorithm, but it's reliable, scalable, and hard to break. That's why it's the go-to algorithm for RLHF.

PPO Hyperparameters That Matter

HyperparameterTypical ValueEffect
clip_eps (ε)0.1–0.2Size of the trust region. Smaller = more conservative updates.
GAE λ0.95Bias-variance tradeoff for advantage estimation. λ=0 → TD(0), λ=1 → Monte Carlo.
γ (discount)0.99How far-sighted the agent is.
Epochs per rollout3–10How many times to reuse collected data. Too many → overfitting to old data.
Mini-batch size64–4096Larger batches reduce variance but cost more compute.
Entropy coefficient0.01Encourages exploration. Critical early in training.
Value loss coefficient0.5Relative weight of critic vs actor loss.
Gradient clip norm0.5Prevents destructive large gradient steps.

RLHF — Reinforcement Learning from Human Feedback

This is why RL matters in 2024. RLHF is how ChatGPT, Claude, and other LLMs are aligned with human preferences. And at its core, it's PPO applied to language generation.

The Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT). Take a pretrained language model and fine-tune it on high-quality demonstration data (human-written responses). This gives you a decent starting policy πSFT.

Stage 2: Reward Model Training. Show the SFT model's outputs to human raters who rank them by preference (e.g., "Response A is better than Response B"). Train a reward model R(prompt, response) to predict these preferences. The reward model is typically initialized from the SFT model with a scalar output head replacing the language modeling head.

Stage 3: PPO Optimization. Treat the language model as the RL policy. The "state" is the prompt plus tokens generated so far. The "action" is the next token. The "reward" comes from the reward model applied to the complete response. Use PPO to maximize this reward while staying close to the original SFT policy (via a KL penalty).

# Simplified RLHF training loop (pseudocode)
def rlhf_step(policy_model, ref_model, reward_model, prompts,
              kl_coef=0.1, clip_eps=0.2):
    """One PPO step of RLHF training.

    policy_model: the LLM being optimized
    ref_model: frozen copy of the SFT model (KL anchor)
    reward_model: predicts human preference scores
    """
    # 1. Generate responses from current policy
    responses = policy_model.generate(prompts)

    # 2. Score responses with reward model
    rewards = reward_model.score(prompts, responses)

    # 3. Compute KL penalty to stay close to reference
    log_probs_policy = policy_model.log_probs(prompts, responses)
    log_probs_ref = ref_model.log_probs(prompts, responses)
    kl_penalty = log_probs_policy - log_probs_ref  # per-token KL

    # 4. Adjusted reward: original reward minus KL penalty
    adjusted_rewards = rewards - kl_coef * kl_penalty.sum(dim=-1)

    # 5. PPO update on the policy model
    advantages = compute_gae(adjusted_rewards, values)
    ppo_update(policy_model, log_probs_policy, advantages, clip_eps)

    return {
        'mean_reward': rewards.mean().item(),
        'mean_kl': kl_penalty.mean().item()
    }
The KL penalty is critical. Without it, the policy model quickly learns to exploit the reward model — generating adversarial gibberish that scores high with the reward model but is meaningless to humans. This is called reward hacking. The KL divergence term anchors the optimized policy near the SFT model, preventing this collapse. Tuning kl_coef is one of the most important RLHF decisions.

Recent alternatives to PPO for alignment include DPO (Direct Preference Optimization), which skips the reward model entirely and optimizes preferences directly. DPO reformulates the RLHF objective so the optimal policy can be expressed in closed form, turning the RL problem into a supervised learning problem on preference pairs. It's simpler and increasingly popular, though PPO-based RLHF remains the established approach.

Model-Based Reinforcement Learning

Everything above is model-free: the agent learns directly from interactions without trying to understand how the environment works. Model-based RL takes a different approach: learn a model of the environment (transition dynamics and rewards), then use that model to plan.

The advantage is sample efficiency. If you can predict what will happen when you take an action, you can "imagine" thousands of trajectories without ever interacting with the real environment. The disadvantage is that model errors compound — small mistakes in the learned model snowball into catastrophically wrong plans.

Key Approaches

Dyna. The classic hybrid: do real interactions AND imaginary rollouts through the learned model. After every real transition, use the model to generate several simulated transitions and update Q-values on those too. Simple, effective, and easy to bolt onto any model-free algorithm.

World Models. Learn a compressed representation of the environment dynamics (often using VAEs or recurrent networks), then train a policy entirely inside the "dream." Ha & Schmidhuber's 2018 World Models paper trained agents to play car racing by dreaming.

MuZero (2020). DeepMind's pinnacle of model-based RL. MuZero learns a model that predicts rewards, values, and policies — but never explicitly reconstructs observations. It plans using Monte Carlo Tree Search (MCTS) over the learned latent model. Achieved superhuman performance in Go, Chess, Shogi, and Atari — all with a single algorithm and without being told the rules of the game.

ApproachModel-FreeModel-Based
Sample efficiencyLow — needs many interactionsHigh — can plan in imagination
Computation per stepLowHigh — planning is expensive
RobustnessHigh — no model to be wrongLower — model errors compound
ExamplesDQN, PPO, SACDyna, MBPO, MuZero, Dreamer

Key Challenges in RL

RL is powerful but fragile. Here are the practical challenges that make it hard to deploy in the real world.

Sparse Rewards

In many environments, the agent gets reward 0 for almost everything and +1 only when it accidentally stumbles onto the goal. Imagine learning to play chess with a reward signal of only "you won" or "you lost" at the end of the game — no intermediate feedback. With random exploration, the agent may never reach the goal to learn from. Solutions include reward shaping (engineering intermediate rewards), curiosity-driven exploration (intrinsic motivation to visit novel states), and hindsight experience replay (relabeling failed trajectories as successful by changing the goal).

Exploration vs Exploitation

The eternal dilemma: do you exploit what you know works, or explore to find something better? Too much exploitation gets you stuck in local optima. Too much exploration wastes time on bad actions. ε-greedy is the simplest approach but scales poorly to large state spaces. More sophisticated methods include UCB (Upper Confidence Bound), Thompson Sampling, and count-based exploration bonuses that reward visiting under-explored states.

Sim-to-Real Transfer

Training robots in the real world is slow, expensive, and dangerous. So you train in simulation — but policies learned in simulation often fail in reality because the simulation doesn't perfectly capture friction, latency, sensor noise, and other real-world messiness. Domain randomization (randomizing simulation parameters during training) and system identification (calibrating the simulator to match reality) help bridge this gap, but it remains one of the hardest problems in robotics.

Reproducibility and Hyperparameter Sensitivity

Deep RL algorithms are notoriously sensitive to hyperparameters, random seeds, and implementation details. The same algorithm can vary by 2x in performance across random seeds. Small implementation choices (how you normalize advantages, how you initialize weights, how you clip gradients) can matter more than the algorithm itself. This makes deep RL research hard to reproduce and engineering-intensive to deploy.

The RL practitioner's mantra. When in doubt, use PPO. Set clip_eps=0.2, GAE λ=0.95, γ=0.99, learning rate 3e-4 with Adam, normalize observations and advantages, and use multiple parallel environments. This is the "ImageNet recipe" of RL — a solid starting point that works across a wide range of problems. Deviate only when you have a specific reason.

The RL Algorithm Landscape

Here's a quick map of how the major algorithms relate to each other:

AlgorithmTypeOn/Off-PolicyAction SpaceKey Idea
Q-LearningValue-basedOff-policyDiscreteLearn Q-table via Bellman update
SARSAValue-basedOn-policyDiscreteLearn Q from actual actions taken
DQNValue-basedOff-policyDiscreteQ-learning + neural nets + replay buffer
REINFORCEPolicy gradientOn-policyBothMonte Carlo policy gradient
A2C/A3CActor-criticOn-policyBothTD-based advantage + policy gradient
PPOActor-criticOn-policyBothClipped surrogate objective
SACActor-criticOff-policyContinuousMaximum entropy framework
TD3Actor-criticOff-policyContinuousTwin critics + delayed updates
MuZeroModel-basedOff-policyDiscreteLearned model + MCTS planning

What You Should Now Be Able To Do