Home

The Policy Gradient Theorem, REINFORCE

5/2026


In this post we'll go through the framework surrounding the policy gradient theorem and derive the REINFORCE algorithm, which underpins modern RL post-training techniques including PPO and GRPO.



Policy Gradient Methods

Under the Markov decision process (MDP) framework, an agent interacts with an environment at discrete timesteps $t=0,1,2, \dots, T-1$, forming a trajectory/episode/rollout across T timesteps: \begin{align*} S_0, A_0, R_1, S_1, A_1, R_2, ... S_{T-1}, A_{T-1}, R_T \end{align*} Namely at time t, the agent receives some representation of the environment's state $S_t \in \mathcal{S}$ and uses it to select an action $A_t \in \mathcal{A}(s)$. The agent receives a scalar reward $R_{t+1} \in \mathcal{R}$ as a result of taking action $A_t$ and transitions into a new state $S_{t+1}$ according to some probability distribution $p(s',r|s,a) \equiv Pr(S_{t+1}=s',R_{t+1}=r|S_t=s,A_t=a)$ that defines the dynamics of the MDP. Under a finite MDP (finite state space $\mathcal{S}$, finite action space $\mathcal{A}$, and finite reward space $\mathcal{R}$), we have \begin{align*} p(s',r|s,a) \geq 0, \quad \sum_{s'} \sum_{r} p(s',r|s,a) = 1 \quad \forall s \in \mathcal{S}, a \in \mathcal{A}(s) \end{align*} Assuming that the agent's goal is to maximize rewards and borrowing the idea of NPVs from finance, we normalize reward terms into a common unit of present value via a discount factor $\gamma \in [0, 1]$ before summing them to obtain the return at time $t$: \begin{align*} G_t = \sum_{k=0}^{T - t - 1} \gamma^k R_{t+k+1} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t-1} R_T \end{align*} E.g. $G_3$ represents the NPV of all future rewards obtained by the agent, starting from $S_3$. A $\gamma$ of 1 treats all rewards as equal, whereas a lower discount favors short-term rewards. In most applications of RL-based LLM post-training we assume $\gamma=1$, i.e. we make no distinction between rewards obtained from response tokens near the beginning of the response vs. near the end. In fact, sequence-level rewards assign rewards to whole sequences instead. We can then define an action-value function for $\pi$: \begin{align*} q_\pi(s, a) \equiv \mathbb{E}_\pi [G_t | S_t=s, A_t=a] = \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s',r|s,a) \left( r + \gamma v_\pi(s') \right) \end{align*} representing the NPV starting from state $s$, taking action $a$ and following $\pi$ thereafter. The value function $v_\pi(s)$ represents the value of state $s$ – the NPV starting from state $s$ and following some $\pi$ used for action selection at each step: \begin{align*} v_\pi(s) \equiv \mathbb{E}_\pi [G_t | S_t=s] = \sum_{a \in \mathcal{A}(s)} \pi(a|s) q_\pi(s, a) \end{align*} where $\pi(a|s)=Pr(A_t=a | S_t=s) \in [0, 1]$ is the probability of taking action $a$ in state $s$.

Up until now, the true policy $\pi$ can be any oracle that defines the exact probability of taking action $a$ in state $s$ to achieve a specific objective. In simple finite MDPs, $\pi$ can be a medium-sized lookup table (tabular RL). But here we're interested in differentiable parametric policy models (function approximations $\pi_\theta$ of the true policy $\pi$) since the state space is enormous and technically infinitely large (any question of any length can be asked), thus unenumerable. Given $\pi_\theta(s)$ with parameters $\theta \in \mathbb{R}^d$ as the action selector, we want to find the optimal parameters $\theta^*$ that maximizes some scalar performance measure $J(\theta)$. We can define the policy model as a softmax over the action space \begin{align*} \pi_\theta(a|s) = \frac{e^{h_\theta(s, a)}}{\sum_{a' \in \mathcal{A}(s)} e^{h_\theta(s, a')}} \in [0, 1] \quad \forall a \in \mathcal{A}(s) \end{align*} where the state-action preference $h_\theta(s, a) \in \mathbb{R}$ maps a state-action pair $(s,a)$ to a scalar. Greedy action selection/decoding uses \begin{align*} A_t = \text{argmax}_{a \in \mathcal{A}(S_t)} \pi_\theta(a|S_t) \end{align*} Or, stochastic sampling methods like top-k and top-p (Nucleus sampling) can be used for better exploration. Finally we perform gradient ascent on $J(\theta)$ (or AdamW, etc.) to approximate $\theta^* = \text{argmax}_\theta J(\theta)$: \begin{align*} \theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) \end{align*} where $\nabla_\theta J(\theta_t) \in \mathbb{R}^d$. All methods that follow this general approach of optimizing a parametric policy model via a scalar performance $J$ are called policy gradient methods.



The Policy Gradient Theorem

Up until now, we've said that $J$ should be some scalar number (higher is better) and a function of the policy weights $\theta_t$. What should $J$ be? Given a (non-random) starting state $s_0$, a natural choice is to define it as the value of said starting state: \begin{align*} J(\theta) \equiv v_{\pi_\theta}(s_0) = \mathbb{E}_{\pi_\theta} [G_t | S_t=s_0] \in \mathbb{R} \end{align*} where $v_{\pi_\theta}$ is the true value function for the current policy $\pi_\theta$. I.e. given a representation of a prefix/question $s_0$, we can generate tokens at each timestep $t+1, t+2, ...$ , and assign a reward to each token (or to the whole rollout sequence). The RHS represents the expected return, under the current policy, over all possible completions/rollouts for a given question $s_0$, according to some reward function. Now take the gradient of both sides with respect to $\theta$ (assume all grads are with respect to $\theta$ from here on out): \begin{align*} \nabla J(\theta) = \nabla v_{\pi_\theta}(s_0) \in \mathbb{R}^d \end{align*} This quanity tells us how to update the policy weights $\theta$ so as to maximize $\mathbb{E}_{\pi_\theta} [G_t | S_t=s_0]$. Whatever policy would maximize that expected return would be the optimal policy for that question. Head in that direction for this gradient update, then do it for lots of other questions you care about.

Since the prefix that $s_0$ represents can be arbitrarily long, the number of possible starting states is technically unbounded, and therefore far exceeds the number of policy weights. E.g. for a vocabulary size 250k, there's already 62 billion possible two-token prefixes, and 15 quadrillion possible three-token prefixes. Making an update based on one state necessarily affects the action selection for other states as well. On the one hand this helps with generalization, allowing the policy to operate on states it hasn't seen before during training. On the other hand, training on one state can cause degradation and forgetting in other states – a fundamental tradeoff of parametric function approximation.

The policy gradient theorem moves the theoretical expression above towards a practical algorithm: \begin{align*} \nabla J(\theta) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi_\theta}(s, a) \nabla \pi_\theta(a|s) \in \mathbb{R}^d \quad \text{(Policy Gradient Theorem)} \end{align*} \begin{align*} \mu(s) = \eta(s) / \sum_{s'} \eta(s') \quad \forall s \in \mathcal{S} \quad \text{(On-policy State Distribution for $\pi_\theta$ given $s_0$)} \end{align*} where $\eta(s)$ is the expected number of visits to state $s$ under $\pi_\theta$, given starting state $s_0$. I.e. given a question $s_0$, follow $\pi_\theta$ and generate a bunch of responses. $\eta(s)$ would be the count of each token $s \in \mathcal{S}$, where $\mathcal{S}$ is the vocabulary set. $\mu(s)$ is the normalized version over all possible states, making $\mu(s)$ a proper discrete probability distribution (whose support is most likely a strict subset of the vocabulary): \begin{align*} \sum_{s} \mu(s) = 1, \quad \mu(s) \geq 0 \quad \forall s \in \mathcal{S} \end{align*} Making the expression sing:
  1. Each $\nabla \pi_\theta(a|s)$ term says "here's how to make this action more likely in this state".
  2. $\sum_{a} q_{\pi_\theta}(s, a) \nabla \pi_\theta(a|s)$ says "okay, but weigh each by the expected return". $\nabla J(\theta)$ will favor actions with higher expected returns.
  3. $\sum_{s} \mu(s) \sum_{a} q_{\pi_\theta}(s, a) \nabla \pi_\theta(a|s)$ weighs the above by their on-policy distribution, forcing the policy to spend its limited capacity optimizing for states it actually encounters. $\nabla J(\theta)$ will favor states that are more likely to be visited under the current policy.
You may wonder what happened to $s_0$ on the RHS of $\nabla J(\theta)$? It's been folded into the definition of $\eta(s)$. The whole RHS pertains to visited states and actions taken in the subsequent rollout, starting from $s_0$. The proportionality constant is absorbed by the learning rate. See http://incompleteideas.net/book/RLbook2020.pdf p. 325 for the full proof. The key is that the state transition probabilities $p(s'|s, a) = \sum_{r\in\mathcal{R}} p(s',r|s,a)$ get unrolled over time into the on-policy state distribution $\mu(s)$, which we can estimate via sampling actual trajectories/rollouts for $s_0$ during training. In general, $\nabla v_{\pi_\theta}(s_0)$ can be estimated through Monte Carlo simulation, which is exactly what REINFORCE does via sampling rollouts.

REINFORCE

REINFORCE simply introduces a sampling method whose expectation equals the RHS of the policy gradient theorem. Assume $\gamma = 1$. In the last line we use $q_{\pi_\theta}(S_t, A_t) = \mathbb{E}_{\pi_\theta}[G_t | S_t, A_t]$. \begin{align*} \nabla J(\theta) &\propto \sum_{s} \mu(s) \sum_{a} q_{\pi_\theta}(s, a) \nabla \pi_\theta(a \mid s) \\ &= \mathbb{E}_{S_t \sim \mu} \left[ \sum_a q_{\pi_\theta}(S_t, a) \nabla \pi_\theta(a \mid S_t) \right] \\ &= \mathbb{E}_{S_t \sim \mu} \left[ \sum_a \pi_\theta(a \mid S_t) q_{\pi_\theta}(S_t, a) \frac{\nabla \pi_\theta(a \mid S_t)}{\pi_\theta(a \mid S_t)} \right] \\ &= \mathbb{E}_{S_t \sim \mu} \left[ \mathbb{E}_{A_t \sim \pi_\theta(\cdot \mid S_t)} \left[ q_{\pi_\theta}(S_t, A_t) \frac{\nabla \pi_\theta(A_t \mid S_t)}{\pi_\theta(A_t \mid S_t)} \right] \right] \\ &= \mathbb{E}_{S_t \sim \mu, A_t \sim \pi_\theta} \left[ q_{\pi_\theta}(S_t, A_t) \nabla \ln \pi_\theta(A_t \mid S_t) \right] \\ &= \mathbb{E}_{\pi_\theta} \left[ G_t \nabla \ln \pi_\theta(A_t \mid S_t) \right] \end{align*}

Training Algorithm

\begin{align*} &\alpha \leftarrow \text{Select learning rate} \\ &\gamma \leftarrow 1 \\ &\bold{\theta} \leftarrow \text{Init params} \\ &\text{Loop until some validation metric is satisfactory} \\ &\quad \text{Generate rollout } S_0, A_0, R_1, S_1, A_1, R_2, ... S_{T-1}, A_{T-1}, R_T \text{ following } \pi_\theta \\ &\quad \text{For } t \in \{0, ..., T-1\} \text{ do:} \\ &\qquad G_t \leftarrow R_{t+1} + R_{t+2} + ... + R_T \\ &\qquad \bold{\theta} \leftarrow \bold{\theta} + \alpha G_t \nabla \ln \pi_\theta(A_t \mid S_t) \\ &\text{Return }\bold{\theta} \end{align*}

REINFORCE with Baseline

For a lower variance estimator of the policy gradient, we can subtract a baseline $b(s)$ from the return $G_t$ \begin{align*} \nabla J(\theta) &\propto \sum_{s} \mu(s) \sum_{a} \left(q_{\pi_\theta}(s, a) - b(s)\right) \nabla \pi_\theta(a \mid s) \\ &= \mathbb{E}_{\pi_\theta} \left[ (G_t - b(S_t)) \nabla \ln \pi_\theta(A_t \mid S_t) \right] \\ \end{align*} without introducing bias, since \begin{align*} \sum_{a} b(s) \nabla \pi_\theta(a \mid s) &= b(s) \nabla \sum_{a} \pi_\theta(a \mid s) = b(s) \nabla 1 = 0 \end{align*} This makes REINFORCE without baseline a special case with $b(s) = 0$.

References

Notation and concepts come from Sutton & Barto http://incompleteideas.net/book/RLbook2020.pdf, in particular chapter 3 on finite Markov decision processes and chapter 13 on policy gradient methods.