# Verl框架应用案例分析 **Repository Path**: esheeper/VerlWorks ## Basic Information - **Project Name**: Verl框架应用案例分析 - **Description**: Verl框架应用案例分析 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-27 - **Last Updated**: 2025-10-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Verl框架具体应用场景调研 ## 说明 `aggression.md` 文件整理了Verl仓库所有相关项目的资料,并分门别类,每一个类别对应一个文件夹。 ## **下面的总结内容来自于[cognition-engineering](https://github.com/gair-nlp/cognition-engineering)** ### 预训练、后训练、测试时扩展(thinking推理能力)能力具象演示 ![three_scaling_laws](./assets/three_scaling_laws.png) 知识表示的演进过程可分为三个阶段。预训练扩展(蓝色)形成了由基础物理概念构成的孤立知识岛,其间仅存在少量固有联系。后训练扩展(绿色)通过在相关概念间建立更复杂的学习连接,使这些知识岛变得更加紧密。测试时扩展(红色)则通过延长计算时间,在原先互不关联的概念间形成动态推理路径,从而实现了跨知识领域的多步推理能力。测试时扩展的关键作用在于,它能够在预训练和常规后训练后仍然孤立的知识岛之间架起桥梁,连接远距离的知识节点。 ### Recipes/Tricks for RL Scaling
Training Algorithm
Problem to Solve Method Overview Evidence Related Studies
Computational inefficiency in traditional PPO for LLM training GRPO (Group Relative Policy Optimization): Eliminates the need for a separate value model by using the average reward of multiple outputs from the same prompt as the baseline for advantage calculation. Performance comparisons demonstrate computational efficiency while maintaining comparable effectiveness to traditional PPO, particularly well-suited for LLM reward modeling where rewards are often comparative in nature. GRPO
Token inefficiency and overthinking in long-form reasoning Dr.GRPO (Doctor GRPO): Addresses optimization bias in GRPO by removing response-length normalization and reward standardization, implementing an unbiased policy gradient estimation. Experimental results show significantly improved token efficiency with better controlled response lengths, effectively mitigating overthinking problems. Dr.GRPO
Instability with varying response lengths in long-form reasoning DAPO (Decouple Clip and Dynamic Sampling Policy Optimization): Implements token-level policy gradient calculation, allowing longer sequences to appropriately influence the gradient updates regardless of individual response lengths. Comparative analysis reveals more stable training dynamics with healthier entropy management and better quality pattern recognition, particularly for handling varying response lengths effectively. DAPO
Limited policy exploration due to rigid constraints GPG (Group Policy Gradient): Simplifies the policy gradient approach by removing reference models and policy constraints while maintaining stability through group-level reward normalization. Comparative experiments demonstrate enhanced exploration capabilities with reduced computational requirements, providing more flexible policy updates. GPG
Repetitive or narrow reasoning patterns Auxiliary entropy bonus: Incorporates an additive entropy term into the RL loss function to encourage token diversity and prevent deterministic response patterns. Experimental results show more varied and creative reasoning paths without sacrificing solution accuracy. T1
Limitations of fixed reference models On-policy KL normalization: Combines KL normalization with Exponential Moving Average (EMA) updates to the reference model. Dynamic reference model updating allows for more effective RL scaling while maintaining stable training dynamics. T1
Value model misalignment with strong prior policies Value-Pretraining Alignment: Implements a dedicated pretraining phase for the value model to ensure alignment with strong prior policies before RL begins. Two-stage convergence pattern shows initial range alignment followed by crucial knowledge injection, preventing collapse in output length for long-CoT tasks. VC-PPO,VAPO
Conflicting variance-bias requirements between value and policy optimization Decoupled-GAE (Generalized Advantage Estimation): Separates the λ parameter for value function and policy optimization, allowing unbiased value estimation while maintaining variance reduction benefits for policy updates. Mathematical analysis and experimental results demonstrate improved convergence rates without introducing additional bias, particularly effective for trajectory-level rewards in long CoT tasks. VC-PPO,VAPO
Limited exploration in constrained policy optimization KL Divergence Removal: Eliminates the KL penalty term that constrains policy divergence from the reference model, allowing the reasoning policy to explore more freely. Experiments reveal significant performance gains when removing constraints on policy distribution shifts during extended reasoning training. Open-Reasoner-Zero, DAPO
Premature deterministic behavior in RL systems Clip-Higher Strategy: Decouples lower and higher clipping ranges in PPO to specifically promote exploration of low-probability tokens while maintaining stability. Asymmetric clipping thresholds effectively counteract entropy collapse and maintain policy diversity throughout extended training. DAPO
Ineffective gradient signals in late-stage training Dynamic Sampling: Implements an adaptive sampling approach that filters out prompts with accuracy values of exactly 0 or 1 to ensure effective gradient signals. Comparative training curves demonstrate faster convergence to target performance despite the additional computational overhead of oversampling. DAPO, Bae et al.
Noisy reward signals from length-truncated samples Overlong Filtering: Masks the loss contribution of truncated samples that exceed maximum length to prevent inappropriate penalization of otherwise sound reasoning. Ablation studies highlight substantial training stability improvements when removing noisy reward signals from length-truncated samples. DAPO
Inconsistent advantage estimation across variable-length sequences Length-Adaptive GAE: Dynamically adjusts the λ parameter in GAE based on sequence length, ensuring balanced TD-error influence for both short and long outputs. Empirical tests reveal more balanced advantage estimation and improved training stability across sequences of varying lengths, particularly beneficial for long-form reasoning. VAPO
Reward Design
Problem to Solve Method Overview Evidence Related Studies
Uncontrolled CoT length in reasoning tasks Cosine Length Reward: Applies a cosine-based reward shaping that prioritizes shorter, correct CoTs while penalizing short, incorrect ones. Evaluation across diverse reasoning tasks reveals stabilized CoT length with preserved performance. Demysitify
Reward hacking in deterministic reasoning tasks Accuracy+Format Reward: Combines verification of answer correctness with structured formatting requirements that enforce explicit reasoning within specialized tags. Rule-based reward systems demonstrate greater resistance to reward hacking than neural alternatives while simplifying the training pipeline. DeepSeek-R1, SimpleRL,T1, Logic-RL, SimpleRL,STILL-3
Language mixing issues in multilingual environments Language Consistency Incentive: Calculates rewards based on the proportion of target language words in the CoT to mitigate language mixing issues. User studies indicate enhanced readability despite minor performance trade-offs in multilingual contexts. DeepSeek-R1
Model overthinking and verbosity Overthinking Length Penalty: Implements a weighted reward mechanism that penalizes excessive response length while preserving correctness to combat model overthinking. Gradually introduced length penalties resulted in more token-efficient reasoning. KIMI-K1.5,DAPO
Inaccurate reward modeling in nuanced domains Chain-of-Thought RM: Enhances reward modeling with explicit step-by-step reasoning before final correctness judgment, particularly for domains with nuanced evaluation criteria. Manual verification confirmed that CoT reward models achieved significantly higher accuracy compared to classic reward models without reasoning steps. KIMI-K1.5
Training Data
Problem to Solve Method Overview Evidence Related Studies
Resource-constrained RL training environments High-impact Sample Selection: Prioritizes training samples based on learning impact measurement. Results show significant reduction in required training data while maintaining performance. LIMR
Training with noisy web-extracted data Noise Reduction Filtering: Employs filtering mechanisms to remove noisy web-extracted data. Filtered datasets demonstrate improved generalization capabilities on OOD tasks. Demysitify
Multi-stage Training
Problem to Solve Method Overview Evidence Related Studies
Poor readability and reasoning in direct RL approaches Cold-start Progression: Implements a phased training approach beginning with high-quality CoT data fine-tuning before transitioning to large-scale reinforcement learning. Models with cold-start initialization exhibit enhanced readability and reasoning capabilities compared to direct RL approaches. DeepSeek-R1, T1, DeepscaleR, STILL-3
Inefficient training with problems of varied difficulty Strategic Sampling: Combines curriculum-based progression from simple to complex problems with prioritization of difficult cases where model performance is weakest. Targeted sampling approaches demonstrated faster convergence and more efficient use of computational resources during training. KIMI-K1.5
Inefficient use of context in long-form reasoning Progressive Context Scaling: Implements a multi-stage training approach that gradually increases context window size as model performance begins to plateau at each level. Phased context window expansion demonstrates significant improvements in both computational efficiency and final performance metrics compared to fixed maximum context training. DeepscaleR
Performance gaps on challenging reasoning problems Targeted Annealing: Implements a final training phase on specifically mined challenging problems with a linearly decaying learning rate to refine reasoning capabilities. Enhanced performance metrics on complex reasoning tasks without compromising general capabilities. Open-Reasoner-Zero
# Implementation of RL Scaling #### The Practitioner’s Roadmap: How to Apply Test-Time Scaling to your Applications? ![Practitioner’s Roadmap](./assets/workflow_tts.png) See [code](./simple_tts) for handy tutorial. # Curated Papers See [papers](./resources/papers/) # Long CoT Resource | **Work** | **Application** | **Type** | **Source** | **Quantity** | **Modality** | **Link** | |----------|----------------|----------|------------|--------------|--------------|----------| | O1 Journey--Part 1 | Math | Synthesize | GPT-4o | 0.3K | Text | [GitHub](https://github.com/GAIR-NLP/O1-Journey) [HuggingFace](https://huggingface.co/datasets/GAIR/o1-journey) | | Marco-o1 | Reasoning | Synthesize | Qwen2-7B-Instruct | 10K | Text | [GitHub](https://github.com/AIDC-AI/Marco-o1) | | STILL-2 | Math, Code, Science, Puzzle | Distillation | DeepSeek-R1-Lite-Preview, QwQ-32B-preview | 5K | Text | [GitHub](https://github.com/RUCAIBox/Slow_Thinking_with_LLMs) [HuggingFace](https://huggingface.co/datasets/RUC-AIBOX/long_form_thought_data_5k) | | RedStar-math | Math | Distillation | QwQ-32B-preview | 4K | Text | [HuggingFace](https://huggingface.co/datasets/RedStar-Reasoning/math_dataset) | | RedStar-code | Code | Distillation | QwQ-32B-preview | 16K | Text | [HuggingFace](https://huggingface.co/datasets/RedStar-Reasoning/code_dataset) | | RedStar-multimodal | Math | Distillation | QwQ-32B-preview | 12K | Vision, Text | [HuggingFace](https://huggingface.co/datasets/RedStar-Reasoning/multimodal_dataset) | | S1K | Math, Science, Code | Distillation | Gemini Flash Thinking | 1K | Text | [GitHub](https://github.com/simplescaling/s1) [HuggingFace](https://huggingface.co/datasets/simplescaling/s1K) | | S1K-1.1 | Math, Science, Code | Distillation | DeepSeek R1 | 1K | Text | [GitHub](https://github.com/simplescaling/s1) [HuggingFace](https://huggingface.co/datasets/simplescaling/s1K-1.1) | | LIMO | Math | Distillation | DeepSeek R1, DeepSeekR1-Distill-Qwen-32B | 0.8K | Text | [GitHub](https://github.com/GAIR-NLP/LIMO) [HuggingFace](https://huggingface.co/datasets/GAIR/LIMO) | | OpenThoughts-114k | Math, Code, Science, Puzzle | Distillation | DeepSeek R1 | 114K | Text | [GitHub](https://github.com/open-thoughts/open-thoughts) [HuggingFace](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) | | OpenR1-Math-220k | Math | Distillation | DeepSeek R1 | 220K | Text | [GitHub](https://github.com/huggingface/open-r1) [HuggingFace](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) | | OpenThoughts2-1M | Math, Code, Science, Puzzle | Distillation | DeepSeek R1 | 1M | Text | [GitHub](https://github.com/open-thoughts/open-thoughts) [HuggingFace](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M) | | CodeForces-CoTs | Code | Distillation | DeepSeek R1 | 47K | Text | [GitHub](https://github.com/huggingface/open-r1) [HuggingFace](https://huggingface.co/datasets/open-r1/codeforces-cots) | | Sky-T1-17k | Math, Code, Science, Puzzle | Distillation | QwQ-32B-Preview | 17K | Text | [GitHub](https://github.com/NovaSky-AI/SkyThought) [HuggingFace](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_data_17k) | | S²R | Math | Synthesize | Qwen2.5-Math-7B | 3K | Text | [GitHub](https://github.com/NineAbyss/S2R) [HuggingFace](https://huggingface.co/datasets/S2R-data/S2R-dataset) | | R1-Onevision | Science, Math, General | Distillation | DeepSeek R1 | 155K | Vision, Text | [GitHub](https://github.com/Fancy-MLLM/R1-Onevision) [HuggingFace](https://huggingface.co/datasets/Fancy-MLLM/R1-Onevision) | | OpenO1-SFT | Math, Code | Synthesize | - | 77K | Text | [GitHub](https://github.com/Open-Source-O1/Open-O1) [HuggingFace](https://huggingface.co/datasets/O1-OPEN/OpenO1-SFT) | | Medical-o1 | Medical | Distillation | Deepseek R1 | 25K | Text | [GitHub](https://github.com/FreedomIntelligence/HuatuoGPT-o1) [HuggingFace](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT) | | O1 Journey--Part 3 | Medical | Distillation | o1-preview | 0.5K | Text | [GitHub](https://github.com/SPIRAL-MED/Ophiuchus) [HuggingFace](https://huggingface.co/datasets/SPIRAL-MED/o1-journey-Ophiuchus) | | SCP-116K | Math, Science | Distillation | Deepseek R1 | 116K | Text | [GitHub](https://github.com/AQA6666/SCP-116K-open) [HuggingFace](https://huggingface.co/datasets/EricLu/SCP-116K) | | open-r1-multimodal | Math | Distillation | GPT-4o | 8K | Vision, Text | [GitHub](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal) [HuggingFace](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) | | Vision-R1-cold | Science, Math, General | Distillation | Deepseek R1 | 200K | Vision, Text | [GitHub](https://github.com/Osilly/Vision-R1) [HuggingFace](https://huggingface.co/datasets/Osilly/Vision-R1-cold) | | MMMU-Reasoning-Distill-Validation | Science, Math, General | Distillation | Deepseek R1 | 0.8K | Vision, Text | [ModelScope](https://www.modelscope.cn/datasets/modelscope/MMMU-Reasoning-Distill-Validation) | | Clevr-CoGenT | Vision Counting | Distillation | Deepseek R1 | 37.8K | Vision, Text | [GitHub](https://github.com/Deep-Agent/R1-V) [HuggingFace](https://huggingface.co/datasets/MMInstruction/Clevr_CoGenT_TrainA_R1) | | VL-Thinking | Science, Math, General | Distillation | Deepseek R1 | 158K | Vision, Text | [GitHub](https://github.com/UCSC-VLAA/VL-Thinking) [HuggingFace](https://huggingface.co/datasets/UCSC-VLAA/VL-Thinking) | | Video-R1 | Video | Distillation | Qwen2.5-VL-72B | 158K | Vision, Text | [GitHub](https://github.com/tulerfeng/Video-R1) [HuggingFace](https://huggingface.co/datasets/Video-R1/Video-R1-data) | | Embodied-Reasoner | Embodied AI | Synthesize | GPT-4o | 9K | Vision, Text | [GitHub](https://github.com/zwq2018/embodied_reasoner) [HuggingFace](https://huggingface.co/datasets/zwq2018/embodied_reasoner) | | OpenCodeReasoning | Code | Distillation | DeepSeek R1 | 736K | Text | [HuggingFace](https://huggingface.co/datasets/nvidia/OpenCodeReasoning) | | SafeChain | Safety | Distillation | Deepseek R1 | 40K | Text | [GitHub](https://github.com/uw-nsl/safechain) [HuggingFace](https://huggingface.co/datasets/UWNSL/SafeChain) | | KodCode | Code | Distillation | DeepSeek R1 | 2.8K | Text | [GitHub](https://github.com/KodCode-AI/kodcode) [HuggingFace](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-R1) | # Development Timeline The images present a comprehensive timeline of test-time scaling methods applied across various AI domains from 2020 to 2025. These visualizations track the evolution of key techniques including Parallel Sampling, Tree Search, Multi-turn Correction, and Long Chain-of-Thought (CoT) across different fields of application. ## Key Test-Time Scaling Methods The research maps four primary test-time scaling approaches: - **Parallel Sampling** (blue): Generating multiple candidate solutions in parallel - **Tree Search** (green): Exploring decision trees to find optimal solutions - **Multi-turn Correction** (red): Iterative refinement through multiple passes - **Long CoT (Chain-of-Thought)** (purple): Extended reasoning chains for complex problem-solving ## Training Strategies The methods are implemented using various training approaches: - **SFT** (Supervised Fine-Tuning): Diamond symbol - **DPO** (Direct Preference Optimization): Triangle symbol - **RL** (Reinforcement Learning): Square symbol - **Inference-only**: Circle symbol ![image](./assets/timeline_math.png) ![image](./assets/timeline_code.png) ![image](./assets/timeline_multi_modal.png) ![image](./assets/timeline_agent.png) ![image](./assets/timeline_embodied_ai.png) ![image](./assets/timeline_safety.png) ![image](./assets/timeline_rag.png) ![image](./assets/timeline_evaluation.png)