Group Relative Policy Optimization

Background: Proximal Policy Optimization (PPO) ⁠1
Group Relative Policy Optimization (GRPO) ⁠1
Background: Proximal Policy Optimization (PPO)

PPO introduces a clipped surrogate objective for policy optimization. By constraining the policy updates within a proximal region of the previous policy using clip, PPO stabilizes training and improves sample efficiency. Specifically, PPO updates the policy by maximizing the following objective:

π’₯PPO(πœƒ)=𝔼(π‘ž,π‘Ž)βˆΌπ’Ÿ[min(πœ‹πœƒ(π‘œπ‘‘|π‘ž,π‘œ<𝑑)πœ‹πœƒold(π‘œπ‘‘|π‘ž,π‘œ<𝑑)𝐴̂𝑑,Β clipΒ (πœ‹πœƒ(π‘œπ‘‘|π‘ž,π‘œ<𝑑)πœ‹πœƒold(π‘œπ‘‘|π‘ž,π‘œ<𝑑),1βˆ’πœ€,1+πœ€)𝐴̂𝑑)]

where (π‘ž,π‘Ž) is a question-answer pair from the data distribution π’Ÿ, πœ€ is the clipping range of importance sampling ratio, and 𝐴̂𝑑 is an estimator of the advantage at time step 𝑑. Given the value function 𝑉 and the reward function 𝑅, 𝐴̂𝑑 is computed using the Generalized Advantage Estimation (GAE):

𝐴̂𝑑GAE(𝛾,πœ†)=βˆ‘π‘™=0∞(π›Ύπœ†)𝑙𝛿𝑑+𝑙,

where

𝛿𝑙=𝑅𝑙+𝛾𝑉(𝑠𝑙+1)βˆ’π‘‰(𝑠𝑙),0≀𝛾,πœ†β‰€1.
Group Relative Policy Optimization (GRPO)

Compared to PPO, GRPO eliminates the value function and estimates the advantage in a group-relative manner.

Objective Function

Similar to PPO, GRPO maximizes a clipped objective, together with a directly imposed KL penalty term:

π’₯GRPO(πœƒ)=𝔼(π‘ž,π‘Ž)βˆΌπ’Ÿ1πΊβˆ‘π‘–=1𝐺1|π‘œπ‘–|βˆ‘π‘‘=1|π‘œπ‘–|[min(π‘Ÿπ‘–,𝑑(πœƒ)𝐴̂𝑖,𝑑,Β clipΒ (π‘Ÿπ‘–,𝑑(πœƒ),1βˆ’πœ€,1+πœ€)𝐴̂𝑖,𝑑)βˆ’π›½π·Β KL(πœ‹πœƒβ€–πœ‹Β ref)]

where:

  • π‘Ÿπ‘–,𝑑(πœƒ)=πœ‹πœƒ(π‘œπ‘–,𝑑|π‘ž,π‘œπ‘–,<𝑑)πœ‹πœƒold(π‘œπ‘–,𝑑|π‘ž,π‘œπ‘–,<𝑑) is the importance sampling ratio for the 𝑖-th response at time step 𝑑
  • 𝐴̂𝑖,𝑑 is the advantage for the 𝑖-th response at time step 𝑑
  • 𝛽 is the KL penalty coefficient
  • 𝐷KL(πœ‹πœƒβ€–πœ‹Β ref) is the KL divergence between the current policy and the reference policy
  • πœ€ is the clipping range of importance sampling ratio
  • 𝐺 is the group size
  • π‘œπ‘– is the 𝑖-th response
  • π‘ž is the question
  • π‘œπ‘–,<𝑑 is the sequence of tokens before position 𝑑 in response 𝑖
  • πœ‹πœƒ(π‘œπ‘–,𝑑|π‘ž,π‘œπ‘–,<𝑑) is the probability of generating token π‘œπ‘–,𝑑 under current policy
  • πœ‹πœƒold(π‘œπ‘–,𝑑|π‘ž,π‘œπ‘–,<𝑑) is the probability of generating token π‘œπ‘–,𝑑 under the old policy