In this post we'll go through the framework surrounding the policy gradient theorem and derive the REINFORCE algorithm, which underpins modern RL post-training techniques including PPO and GRPO.
Under the Markov decision process (MDP) framework, an agent interacts with an environment
at discrete timesteps $t=0,1,2, \dots, T-1$, forming a trajectory/episode/rollout across T timesteps:
\begin{align*}
S_0, A_0, R_1, S_1, A_1, R_2, ... S_{T-1}, A_{T-1}, R_T
\end{align*}
Namely at time t, the agent receives some representation of the environment's
state $S_t \in \mathcal{S}$ and uses it to select an action
$A_t \in \mathcal{A}(s)$. The agent receives a scalar reward
$R_{t+1} \in \mathcal{R}$ as a result of taking action $A_t$ and
transitions into a new state $S_{t+1}$ according to some probability
distribution $p(s',r|s,a) \equiv Pr(S_{t+1}=s',R_{t+1}=r|S_t=s,A_t=a)$ that defines
the dynamics of the MDP. Under a finite MDP (finite state space
$\mathcal{S}$, finite action space $\mathcal{A}$, and finite reward
space $\mathcal{R}$), we have
\begin{align*}
p(s',r|s,a) \geq 0, \quad \sum_{s'} \sum_{r} p(s',r|s,a) = 1 \quad \forall s \in \mathcal{S}, a \in \mathcal{A}(s)
\end{align*}
Assuming that the agent's goal is to maximize rewards and borrowing
the idea of NPVs from finance, we normalize reward terms into a
common unit of present value via a discount factor
$\gamma \in [0, 1]$ before summing them to obtain the return at
time $t$:
\begin{align*}
G_t = \sum_{k=0}^{T - t - 1} \gamma^k R_{t+k+1} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t-1} R_T
\end{align*}
E.g. $G_3$ represents the NPV of all future rewards obtained by the
agent, starting from $S_3$. A $\gamma$ of 1 treats all rewards as
equal, whereas a lower discount favors short-term rewards. In most
applications of RL-based LLM post-training we assume $\gamma=1$,
i.e. we make no distinction between rewards obtained from response
tokens near the beginning of the response vs. near the end. In
fact, sequence-level rewards assign rewards to whole sequences instead.
We can then define an action-value function for $\pi$:
\begin{align*}
q_\pi(s, a) \equiv \mathbb{E}_\pi [G_t | S_t=s, A_t=a] = \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s',r|s,a) \left( r + \gamma v_\pi(s') \right)
\end{align*}
representing the NPV starting from state $s$, taking action $a$
and following $\pi$ thereafter.
The value function $v_\pi(s)$ represents the value of state
$s$ – the NPV starting from state $s$ and following some
$\pi$ used for action selection at each step:
\begin{align*}
v_\pi(s) \equiv \mathbb{E}_\pi [G_t | S_t=s] = \sum_{a \in \mathcal{A}(s)} \pi(a|s) q_\pi(s, a)
\end{align*}
where $\pi(a|s)=Pr(A_t=a | S_t=s) \in [0, 1]$ is the probability of
taking action $a$ in state $s$.
Up until now, the true policy $\pi$ can be any oracle that defines the
exact probability of taking action $a$ in state $s$ to achieve a specific
objective. In simple finite MDPs, $\pi$ can be a medium-sized
lookup table (tabular RL). But here we're interested in differentiable
parametric policy models (function approximations $\pi_\theta$ of the true policy $\pi$)
since the state space is enormous and technically infinitely large
(any question of any length can be asked), thus unenumerable.
Given $\pi_\theta(s)$ with parameters
$\theta \in \mathbb{R}^d$ as the action selector, we want to
find the optimal parameters $\theta^*$ that maximizes some scalar
performance measure $J(\theta)$. We can define the policy model as
a softmax over the action space
\begin{align*}
\pi_\theta(a|s) = \frac{e^{h_\theta(s, a)}}{\sum_{a' \in \mathcal{A}(s)} e^{h_\theta(s, a')}} \in [0, 1] \quad \forall a \in \mathcal{A}(s)
\end{align*}
where the state-action preference $h_\theta(s, a) \in \mathbb{R}$ maps a state-action
pair $(s,a)$ to a scalar. Greedy action selection/decoding uses
\begin{align*}
A_t = \text{argmax}_{a \in \mathcal{A}(S_t)} \pi_\theta(a|S_t)
\end{align*}
Or, stochastic sampling methods like top-k and top-p (Nucleus sampling)
can be used for better exploration. Finally we perform gradient
ascent on $J(\theta)$ (or AdamW, etc.) to approximate
$\theta^* = \text{argmax}_\theta J(\theta)$:
\begin{align*}
\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)
\end{align*}
where $\nabla_\theta J(\theta_t) \in \mathbb{R}^d$. All methods
that follow this general approach of optimizing a parametric policy
model via a scalar performance $J$ are called policy gradient methods.