Problem to Solve	Method Overview	Evidence	Related Studies
Computational inefficiency in traditional PPO for LLM training	GRPO (Group Relative Policy Optimization): Eliminates the need for a separate value model by using the average reward of multiple outputs from the same prompt as the baseline for advantage calculation.	Performance comparisons demonstrate computational efficiency while maintaining comparable effectiveness to traditional PPO, particularly well-suited for LLM reward modeling where rewards are often comparative in nature.	GRPO
Token inefficiency and overthinking in long-form reasoning	Dr.GRPO (Doctor GRPO): Addresses optimization bias in GRPO by removing response-length normalization and reward standardization, implementing an unbiased policy gradient estimation.	Experimental results show significantly improved token efficiency with better controlled response lengths, effectively mitigating overthinking problems.	Dr.GRPO
Instability with varying response lengths in long-form reasoning	DAPO (Decouple Clip and Dynamic Sampling Policy Optimization): Implements token-level policy gradient calculation, allowing longer sequences to appropriately influence the gradient updates regardless of individual response lengths.	Comparative analysis reveals more stable training dynamics with healthier entropy management and better quality pattern recognition, particularly for handling varying response lengths effectively.	DAPO
Limited policy exploration due to rigid constraints	GPG (Group Policy Gradient): Simplifies the policy gradient approach by removing reference models and policy constraints while maintaining stability through group-level reward normalization.	Comparative experiments demonstrate enhanced exploration capabilities with reduced computational requirements, providing more flexible policy updates.	GPG
Repetitive or narrow reasoning patterns	Auxiliary entropy bonus: Incorporates an additive entropy term into the RL loss function to encourage token diversity and prevent deterministic response patterns.	Experimental results show more varied and creative reasoning paths without sacrificing solution accuracy.	T1
Limitations of fixed reference models	On-policy KL normalization: Combines KL normalization with Exponential Moving Average (EMA) updates to the reference model.	Dynamic reference model updating allows for more effective RL scaling while maintaining stable training dynamics.	T1
Value model misalignment with strong prior policies	Value-Pretraining Alignment: Implements a dedicated pretraining phase for the value model to ensure alignment with strong prior policies before RL begins.	Two-stage convergence pattern shows initial range alignment followed by crucial knowledge injection, preventing collapse in output length for long-CoT tasks.	VC-PPO,VAPO
Conflicting variance-bias requirements between value and policy optimization	Decoupled-GAE (Generalized Advantage Estimation): Separates the λ parameter for value function and policy optimization, allowing unbiased value estimation while maintaining variance reduction benefits for policy updates.	Mathematical analysis and experimental results demonstrate improved convergence rates without introducing additional bias, particularly effective for trajectory-level rewards in long CoT tasks.	VC-PPO,VAPO
Limited exploration in constrained policy optimization	KL Divergence Removal: Eliminates the KL penalty term that constrains policy divergence from the reference model, allowing the reasoning policy to explore more freely.	Experiments reveal significant performance gains when removing constraints on policy distribution shifts during extended reasoning training.	Open-Reasoner-Zero, DAPO
Premature deterministic behavior in RL systems	Clip-Higher Strategy: Decouples lower and higher clipping ranges in PPO to specifically promote exploration of low-probability tokens while maintaining stability.	Asymmetric clipping thresholds effectively counteract entropy collapse and maintain policy diversity throughout extended training.	DAPO
Ineffective gradient signals in late-stage training	Dynamic Sampling: Implements an adaptive sampling approach that filters out prompts with accuracy values of exactly 0 or 1 to ensure effective gradient signals.	Comparative training curves demonstrate faster convergence to target performance despite the additional computational overhead of oversampling.	DAPO, Bae et al.
Noisy reward signals from length-truncated samples	Overlong Filtering: Masks the loss contribution of truncated samples that exceed maximum length to prevent inappropriate penalization of otherwise sound reasoning.	Ablation studies highlight substantial training stability improvements when removing noisy reward signals from length-truncated samples.	DAPO
Inconsistent advantage estimation across variable-length sequences	Length-Adaptive GAE: Dynamically adjusts the λ parameter in GAE based on sequence length, ensuring balanced TD-error influence for both short and long outputs.	Empirical tests reveal more balanced advantage estimation and improved training stability across sequences of varying lengths, particularly beneficial for long-form reasoning.	VAPO

Problem to Solve	Method Overview	Evidence	Related Studies
Uncontrolled CoT length in reasoning tasks	Cosine Length Reward: Applies a cosine-based reward shaping that prioritizes shorter, correct CoTs while penalizing short, incorrect ones.	Evaluation across diverse reasoning tasks reveals stabilized CoT length with preserved performance.	Demysitify
Reward hacking in deterministic reasoning tasks	Accuracy+Format Reward: Combines verification of answer correctness with structured formatting requirements that enforce explicit reasoning within specialized tags.	Rule-based reward systems demonstrate greater resistance to reward hacking than neural alternatives while simplifying the training pipeline.	DeepSeek-R1, SimpleRL,T1, Logic-RL, SimpleRL,STILL-3
Language mixing issues in multilingual environments	Language Consistency Incentive: Calculates rewards based on the proportion of target language words in the CoT to mitigate language mixing issues.	User studies indicate enhanced readability despite minor performance trade-offs in multilingual contexts.	DeepSeek-R1
Model overthinking and verbosity	Overthinking Length Penalty: Implements a weighted reward mechanism that penalizes excessive response length while preserving correctness to combat model overthinking.	Gradually introduced length penalties resulted in more token-efficient reasoning.	KIMI-K1.5,DAPO
Inaccurate reward modeling in nuanced domains	Chain-of-Thought RM: Enhances reward modeling with explicit step-by-step reasoning before final correctness judgment, particularly for domains with nuanced evaluation criteria.	Manual verification confirmed that CoT reward models achieved significantly higher accuracy compared to classic reward models without reasoning steps.	KIMI-K1.5

Problem to Solve

Method Overview

Evidence

Related Studies

Uncontrolled CoT length in reasoning tasks

Cosine Length Reward: Applies a cosine-based reward shaping that prioritizes shorter, correct CoTs while penalizing short, incorrect ones.

Evaluation across diverse reasoning tasks reveals stabilized CoT length with preserved performance.

Demysitify

Reward hacking in deterministic reasoning tasks

Accuracy+Format Reward: Combines verification of answer correctness with structured formatting requirements that enforce explicit reasoning within specialized tags.

Rule-based reward systems demonstrate greater resistance to reward hacking than neural alternatives while simplifying the training pipeline.

DeepSeek-R1, SimpleRL,T1, Logic-RL, SimpleRL,STILL-3

Language mixing issues in multilingual environments

Language Consistency Incentive: Calculates rewards based on the proportion of target language words in the CoT to mitigate language mixing issues.

User studies indicate enhanced readability despite minor performance trade-offs in multilingual contexts.

DeepSeek-R1

Model overthinking and verbosity

Overthinking Length Penalty: Implements a weighted reward mechanism that penalizes excessive response length while preserving correctness to combat model overthinking.

Gradually introduced length penalties resulted in more token-efficient reasoning.

KIMI-K1.5,DAPO

Inaccurate reward modeling in nuanced domains

Chain-of-Thought RM: Enhances reward modeling with explicit step-by-step reasoning before final correctness judgment, particularly for domains with nuanced evaluation criteria.

Manual verification confirmed that CoT reward models achieved significantly higher accuracy compared to classic reward models without reasoning steps.

KIMI-K1.5

Problem to Solve	Method Overview	Evidence	Related Studies
Resource-constrained RL training environments	High-impact Sample Selection: Prioritizes training samples based on learning impact measurement.	Results show significant reduction in required training data while maintaining performance.	LIMR
Training with noisy web-extracted data	Noise Reduction Filtering: Employs filtering mechanisms to remove noisy web-extracted data.	Filtered datasets demonstrate improved generalization capabilities on OOD tasks.	Demysitify

Problem to Solve

Method Overview

Evidence

Related Studies

Resource-constrained RL training environments

High-impact Sample Selection: Prioritizes training samples based on learning impact measurement.

Results show significant reduction in required training data while maintaining performance.

LIMR

Training with noisy web-extracted data

Noise Reduction Filtering: Employs filtering mechanisms to remove noisy web-extracted data.

Filtered datasets demonstrate improved generalization capabilities on OOD tasks.

Demysitify

Problem to Solve	Method Overview	Evidence	Related Studies
Poor readability and reasoning in direct RL approaches	Cold-start Progression: Implements a phased training approach beginning with high-quality CoT data fine-tuning before transitioning to large-scale reinforcement learning.	Models with cold-start initialization exhibit enhanced readability and reasoning capabilities compared to direct RL approaches.	DeepSeek-R1, T1, DeepscaleR, STILL-3
Inefficient training with problems of varied difficulty	Strategic Sampling: Combines curriculum-based progression from simple to complex problems with prioritization of difficult cases where model performance is weakest.	Targeted sampling approaches demonstrated faster convergence and more efficient use of computational resources during training.	KIMI-K1.5
Inefficient use of context in long-form reasoning	Progressive Context Scaling: Implements a multi-stage training approach that gradually increases context window size as model performance begins to plateau at each level.	Phased context window expansion demonstrates significant improvements in both computational efficiency and final performance metrics compared to fixed maximum context training.	DeepscaleR
Performance gaps on challenging reasoning problems	Targeted Annealing: Implements a final training phase on specifically mined challenging problems with a linearly decaying learning rate to refine reasoning capabilities.	Enhanced performance metrics on complex reasoning tasks without compromising general capabilities.	Open-Reasoner-Zero

Problem to Solve

Method Overview

Evidence

Related Studies

Poor readability and reasoning in direct RL approaches

Cold-start Progression: Implements a phased training approach beginning with high-quality CoT data fine-tuning before transitioning to large-scale reinforcement learning.

Models with cold-start initialization exhibit enhanced readability and reasoning capabilities compared to direct RL approaches.

DeepSeek-R1, T1, DeepscaleR, STILL-3

Inefficient training with problems of varied difficulty

Strategic Sampling: Combines curriculum-based progression from simple to complex problems with prioritization of difficult cases where model performance is weakest.

Targeted sampling approaches demonstrated faster convergence and more efficient use of computational resources during training.

KIMI-K1.5

Inefficient use of context in long-form reasoning

Progressive Context Scaling: Implements a multi-stage training approach that gradually increases context window size as model performance begins to plateau at each level.

Phased context window expansion demonstrates significant improvements in both computational efficiency and final performance metrics compared to fixed maximum context training.

DeepscaleR

Performance gaps on challenging reasoning problems

Targeted Annealing: Implements a final training phase on specifically mined challenging problems with a linearly decaying learning rate to refine reasoning capabilities.

Enhanced performance metrics on complex reasoning tasks without compromising general capabilities.

Open-Reasoner-Zero