Policy Gradient Methods

Under the Markov decision process (MDP) framework, an agent interacts with an environment at discrete timesteps $t=0,1,2, \dots, T-1$, forming a trajectory/episode/rollout across T timesteps: \begin{align*} S_0, A_0, R_1, S_1, A_1, R_2, ... S_{T-1}, A_{T-1}, R_T \end{align*} Namely at time t, the agent receives some representation of the environment's state $S_t \in \mathcal{S}$, which it uses to select an action $A_t \in \mathcal{A}(s)$. The agent receives a scalar reward $R_{t+1} \in \mathcal{R}$ as a result of taking action $A_t$, and transitions into a new state $S_{t+1}$. Assuming that the agent's goal is to maximize long-term rewards and borrowing the idea of NPVs from finance, we normalize reward terms into a common unit of present value via a discount factor $\gamma \in [0, 1]$ before summing them to obtain the return at time $t$: \begin{align*} G_t = \sum_{k=0}^{T - t - 1} \gamma^k R_{t+k+1} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t-1} R_T \end{align*} A $\gamma$ of 1 treats all rewards as equal, whereas a lower discount favors short-term rewards. E.g. $G_3$ represents the NPV of all future rewards obtained by the agent, starting from $S_3$. We can then define a value function \begin{align*} v_\pi(s) \equiv E_\pi [G_t | S_t=s] \end{align*} representing the value of a state $s$: the NPV starting from state $s$ and following some policy $\pi$ used for action selection at each step. And we can define an action-value function \begin{align*} q_\pi(s, a) \equiv E_\pi [G_t | S_t=s, A_t=a] \end{align*} representing the NPV starting from state $s$, taking action $a$ (state-action pair $(s,a)$), and following policy $\pi$ thereafter. Note that the former is just a weighted average of the latter, under policy $\pi$: \begin{align*} v_\pi(s) = \sum_{a \in \mathcal{A}(s)} \pi(a|s) q_\pi(s, a) \end{align*} where $\pi(a|s)=Pr(A_t=a | S_t=s) \in [0, 1]$ is the probability of taking action $a$ in state $s$.

Up until now, $\pi$ can be any oracle that selects actions based on the current state, i.e. any program, method, or function that map a current state to an action choice. Now suppose we have some parametric policy/policy model $\pi_\theta(s)$ with parameters $\theta \in \mathbb{R}^d$ as the action selector, and we want to find the optimal parameters $\theta^*$ that maximizes some scalar performance measure $J(\theta)$. We can define the policy model as a softmax over the possible actions \begin{align*} \pi(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum_{a' \in \mathcal{A}(s)} e^{h(s, a', \theta)}} \in [0, 1] \end{align*} where $h(s, a, \theta)$ is some transformer that maps a state-action pair $(s,a)$ and parameters $\theta$ to a scalar. Then we simply perform gradient ascent on $J(\theta)$ (or AdamW, etc.) to approximate $\theta^* = \text{argmax}_\theta J(\theta)$: \begin{align*} \theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) \end{align*} where $\nabla_\theta J(\theta_t) \in \mathbb{R}^d$. All methods that follow this general approach of optimizing a parametric policy model via a scalar performance $J$ are called policy gradient methods.

The Policy Gradient Theorem, REINFORCE

Policy Gradient Methods

The Policy Gradient Theorem