Group Relative Policy Optimization

June 8, 2025

by Max Baker

Background: Proximal Policy Optimization (PPO) ⁠1

Group Relative Policy Optimization (GRPO) ⁠1

Background: Proximal Policy Optimization (PPO)

PPO introduces a clipped surrogate objective for policy optimization. By constraining the policy updates within a proximal region of the previous policy using clip, PPO stabilizes training and improves sample efficiency. Specifically, PPO updates the policy by maximizing the following objective:

𝒥_{PPO} (𝜃) = 𝔼_{(𝑞, 𝑎) \sim 𝒟} [\min (\frac{𝜋_{𝜃 (𝑜_{𝑡} | 𝑞, 𝑜_{< 𝑡})}}{𝜋_{𝜃_{old}} (𝑜_{𝑡} | 𝑞, 𝑜_{< 𝑡})} {\hat{𝐴}}_{𝑡}, clip (\frac{𝜋_{𝜃 (𝑜_{𝑡} | 𝑞, 𝑜_{< 𝑡})}}{𝜋_{𝜃_{old}} (𝑜_{𝑡} | 𝑞, 𝑜_{< 𝑡})}, 1 - 𝜀, 1 + 𝜀) {\hat{𝐴}}_{𝑡})]

where $(𝑞, 𝑎)$ is a question-answer pair from the data distribution $𝒟$ , $𝜀$ is the clipping range of importance sampling ratio, and ${\hat{𝐴}}_{𝑡}$ is an estimator of the advantage at time step $𝑡$ . Given the value function $𝑉$ and the reward function $𝑅$ , ${\hat{𝐴}}_{𝑡}$ is computed using the Generalized Advantage Estimation (GAE):

{\hat{𝐴}}_{𝑡}^{GAE} (𝛾, 𝜆) = \sum_{𝑙 = 0}^{\infty} {(𝛾 𝜆)}^{𝑙} 𝛿_{𝑡 + 𝑙},

where

𝛿_{𝑙} = 𝑅_{𝑙} + 𝛾 𝑉 (𝑠_{𝑙 + 1}) - 𝑉 (𝑠_{𝑙}), 0 \leq 𝛾, 𝜆 \leq 1 .

Group Relative Policy Optimization (GRPO)

Compared to PPO, GRPO eliminates the value function and estimates the advantage in a group-relative manner.

Objective Function

Similar to PPO, GRPO maximizes a clipped objective, together with a directly imposed KL penalty term:

𝒥_{GRPO} (𝜃) = 𝔼_{(𝑞, 𝑎) \sim 𝒟} \frac{1}{𝐺} \sum_{𝑖 = 1}^{𝐺} \frac{1}{| 𝑜_{𝑖} |} \sum_{𝑡 = 1}^{| 𝑜_{𝑖} |} [\min (𝑟_{𝑖, 𝑡} (𝜃) {\hat{𝐴}}_{𝑖, 𝑡}, clip (𝑟_{𝑖, 𝑡} (𝜃), 1 - 𝜀, 1 + 𝜀) {\hat{𝐴}}_{𝑖, 𝑡}) - 𝛽 𝐷_{KL} (𝜋_{𝜃} ‖ 𝜋_{ref})]

where:

$𝑟_{𝑖, 𝑡} (𝜃) = \frac{𝜋_{𝜃} (𝑜_{𝑖, 𝑡} | 𝑞, 𝑜_{𝑖, < 𝑡})}{𝜋_{𝜃_{old}} (𝑜_{𝑖, 𝑡} | 𝑞, 𝑜_{𝑖, < 𝑡})}$ is the importance sampling ratio for the $𝑖$ -th response at time step $𝑡$
${\hat{𝐴}}_{𝑖, 𝑡}$ is the advantage for the $𝑖$ -th response at time step $𝑡$
$𝛽$ is the KL penalty coefficient
$𝐷_{KL} (𝜋_{𝜃} ‖ 𝜋_{ref})$ is the KL divergence between the current policy and the reference policy
$𝜀$ is the clipping range of importance sampling ratio
$𝐺$ is the group size
$𝑜_{𝑖}$ is the $𝑖$ -th response
$𝑞$ is the question
$𝑜_{𝑖, < 𝑡}$ is the sequence of tokens before position $𝑡$ in response $𝑖$
$𝜋_{𝜃} (𝑜_{𝑖, 𝑡} | 𝑞, 𝑜_{𝑖, < 𝑡})$ is the probability of generating token $𝑜_{𝑖, 𝑡}$ under current policy
$𝜋_{𝜃_{old}} (𝑜_{𝑖, 𝑡} | 𝑞, 𝑜_{𝑖, < 𝑡})$ is the probability of generating token $𝑜_{𝑖, 𝑡}$ under the old policy