In this post we'll derive the policy gradient theorem, which underpins modern-day RL post-training techniques such as REINFORCE, and in turn PPO and GRPO.
Under the Markov decision process (MDP) framework, an agent interacts with an environment
at discrete timesteps $t=0,1,2, \dots, T-1$, forming a trajectory/episode/rollout across T timesteps:
\begin{align*}
S_0, A_0, R_1, S_1, A_1, R_2, ... S_{T-1}, A_{T-1}, R_T
\end{align*}
Namely at time t, the agent receives some representation of the environment's
state $S_t \in \mathcal{S}$, which it uses to select an action
$A_t \in \mathcal{A}(s)$. The agent receives a scalar reward
$R_{t+1} \in \mathcal{R}$ as a result of taking action $A_t$, and
transitions into a new state $S_{t+1}$. Assuming that the agent's
goal is to maximize long-term rewards and borrowing the idea of NPVs
from finance, we normalize reward terms into a common unit of present value
via a discount factor $\gamma \in [0, 1]$ before summing them to obtain the return at time $t$:
\begin{align*}
G_t = \sum_{k=0}^{T - t - 1} \gamma^k R_{t+k+1} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t-1} R_T
\end{align*}
A $\gamma$ of 1 treats all rewards as equal, whereas a lower discount favors short-term rewards.
E.g. $G_3$ represents the NPV of all future rewards obtained by the agent, starting from $S_3$.
We can then define a value function
\begin{align*}
v_\pi(s) \equiv E_\pi [G_t | S_t=s]
\end{align*}
representing the value of a state $s$: the NPV starting from state $s$ and following some policy $\pi$ used for action
selection at each step. And we can define an action-value function
\begin{align*}
q_\pi(s, a) \equiv E_\pi [G_t | S_t=s, A_t=a]
\end{align*}
representing the NPV starting from state $s$, taking action $a$ (state-action pair $(s,a)$), and following policy $\pi$ thereafter.
Note that the former is just a weighted average of the latter, under policy $\pi$:
\begin{align*}
v_\pi(s) = \sum_{a \in \mathcal{A}(s)} \pi(a|s) q_\pi(s, a)
\end{align*}
where $\pi(a|s)=Pr(A_t=a | S_t=s) \in [0, 1]$ is the probability of taking action $a$ in state $s$.
Up until now, $\pi$ can be any oracle that selects actions based on the current state, i.e. any
program, method, or function that map a current state to an action choice. Now suppose we have some parametric policy/policy model $\pi_\theta(s)$ with parameters $\theta \in \mathbb{R}^d$ as the action selector, and we want to find the optimal parameters $\theta^*$
that maximizes some scalar performance measure $J(\theta)$. We can define the policy model as a softmax over the possible actions
\begin{align*}
\pi(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum_{a' \in \mathcal{A}(s)} e^{h(s, a', \theta)}} \in [0, 1]
\end{align*}
where $h(s, a, \theta)$ is some transformer that maps a state-action pair $(s,a)$ and parameters $\theta$ to a scalar.
Then we simply perform gradient ascent on $J(\theta)$ (or AdamW, etc.) to approximate $\theta^* = \text{argmax}_\theta J(\theta)$:
\begin{align*}
\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)
\end{align*}
where $\nabla_\theta J(\theta_t) \in \mathbb{R}^d$. All methods that follow this general approach of optimizing a parametric policy model via a scalar performance $J$ are
called policy gradient methods.