Group Relative Policy Optimization
Background: Proximal Policy Optimization (PPO)
PPO introduces a clipped surrogate objective for policy optimization. By constraining the policy updates within a proximal region of the previous policy using clip, PPO stabilizes training and improves sample efficiency. Specifically, PPO updates the policy by maximizing the following objective:
where is a question-answer pair from the data distribution , is the clipping range of importance sampling ratio, and is an estimator of the advantage at time step . Given the value function and the reward function , is computed using the Generalized Advantage Estimation (GAE):
where
Group Relative Policy Optimization (GRPO)
Compared to PPO, GRPO eliminates the value function and estimates the advantage in a group-relative manner.
Objective Function
Similar to PPO, GRPO maximizes a clipped objective, together with a directly imposed KL penalty term:
where:
- is the importance sampling ratio for the -th response at time step
- is the advantage for the -th response at time step
- is the KL penalty coefficient
- is the KL divergence between the current policy and the reference policy
- is the clipping range of importance sampling ratio
- is the group size
- is the -th response
- is the question
- is the sequence of tokens before position in response
- is the probability of generating token under current policy
- is the probability of generating token under the old policy