| Computational inefficiency in traditional PPO for LLM training |
GRPO (Group Relative Policy Optimization): Eliminates the need for a separate value model by using the average reward of multiple outputs from the same prompt as the baseline for advantage calculation. |
Performance comparisons demonstrate computational efficiency while maintaining comparable effectiveness to traditional PPO, particularly well-suited for LLM reward modeling where rewards are often comparative in nature. |
GRPO |
| Token inefficiency and overthinking in long-form reasoning |
Dr.GRPO (Doctor GRPO): Addresses optimization bias in GRPO by removing response-length normalization and reward standardization, implementing an unbiased policy gradient estimation. |
Experimental results show significantly improved token efficiency with better controlled response lengths, effectively mitigating overthinking problems. |
Dr.GRPO |
| Instability with varying response lengths in long-form reasoning |
DAPO (Decouple Clip and Dynamic Sampling Policy Optimization): Implements token-level policy gradient calculation, allowing longer sequences to appropriately influence the gradient updates regardless of individual response lengths. |
Comparative analysis reveals more stable training dynamics with healthier entropy management and better quality pattern recognition, particularly for handling varying response lengths effectively. |
DAPO |
| Limited policy exploration due to rigid constraints |
GPG (Group Policy Gradient): Simplifies the policy gradient approach by removing reference models and policy constraints while maintaining stability through group-level reward normalization. |
Comparative experiments demonstrate enhanced exploration capabilities with reduced computational requirements, providing more flexible policy updates. |
GPG |
| Repetitive or narrow reasoning patterns |
Auxiliary entropy bonus: Incorporates an additive entropy term into the RL loss function to encourage token diversity and prevent deterministic response patterns. |
Experimental results show more varied and creative reasoning paths without sacrificing solution accuracy. |
T1 |
| Limitations of fixed reference models |
On-policy KL normalization: Combines KL normalization with Exponential Moving Average (EMA) updates to the reference model. |
Dynamic reference model updating allows for more effective RL scaling while maintaining stable training dynamics. |
T1 |
| Value model misalignment with strong prior policies |
Value-Pretraining Alignment: Implements a dedicated pretraining phase for the value model to ensure alignment with strong prior policies before RL begins. |
Two-stage convergence pattern shows initial range alignment followed by crucial knowledge injection, preventing collapse in output length for long-CoT tasks. |
VC-PPO,VAPO |
| Conflicting variance-bias requirements between value and policy optimization |
Decoupled-GAE (Generalized Advantage Estimation): Separates the λ parameter for value function and policy optimization, allowing unbiased value estimation while maintaining variance reduction benefits for policy updates. |
Mathematical analysis and experimental results demonstrate improved convergence rates without introducing additional bias, particularly effective for trajectory-level rewards in long CoT tasks. |
VC-PPO,VAPO |
| Limited exploration in constrained policy optimization |
KL Divergence Removal: Eliminates the KL penalty term that constrains policy divergence from the reference model, allowing the reasoning policy to explore more freely. |
Experiments reveal significant performance gains when removing constraints on policy distribution shifts during extended reasoning training. |
Open-Reasoner-Zero, DAPO |
| Premature deterministic behavior in RL systems |
Clip-Higher Strategy: Decouples lower and higher clipping ranges in PPO to specifically promote exploration of low-probability tokens while maintaining stability. |
Asymmetric clipping thresholds effectively counteract entropy collapse and maintain policy diversity throughout extended training. |
DAPO |
| Ineffective gradient signals in late-stage training |
Dynamic Sampling: Implements an adaptive sampling approach that filters out prompts with accuracy values of exactly 0 or 1 to ensure effective gradient signals. |
Comparative training curves demonstrate faster convergence to target performance despite the additional computational overhead of oversampling. |
DAPO, Bae et al. |
| Noisy reward signals from length-truncated samples |
Overlong Filtering: Masks the loss contribution of truncated samples that exceed maximum length to prevent inappropriate penalization of otherwise sound reasoning. |
Ablation studies highlight substantial training stability improvements when removing noisy reward signals from length-truncated samples. |
DAPO |
| Inconsistent advantage estimation across variable-length sequences |
Length-Adaptive GAE: Dynamically adjusts the λ parameter in GAE based on sequence length, ensuring balanced TD-error influence for both short and long outputs. |
Empirical tests reveal more balanced advantage estimation and improved training stability across sequences of varying lengths, particularly beneficial for long-form reasoning. |
VAPO |