Authors: Jiacai Liu Yingru Li*† Yuqian Fu Jiawei Wang Qian Liu Zhuo Jiang†*

Work done at ByteDance. First published at Sep 17, 2025.

*Co-First Authors. †Corresponding Authors.

Figure 1. Rewards (Left ) and gradient norms (Right ) from our four failed GRPO TIR experiments on Qwen3-14B-Base. All experiments sample trajectories ( prompts × responses) at each training step and use a learning rate of . The is set to and for on-policy and off-policy experiments, respectively.

Figure 1. Rewards (Left ) and gradient norms (Right ) from our four failed GRPO TIR experiments on Qwen3-14B-Base. All experiments sample 1024 trajectories (64 prompts × 16 responses) at each training step and use a learning rate of 1e-6. The ppo_mini_batch_size is set to 1024 and 256 for on-policy and off-policy experiments, respectively.

<aside> 📌

Latest Update

[Dec 2] #DeepSeek-V3.2 (https://arxiv.org/abs/2512.02556) adopts our geometric sequence masking to deal with training-inference mismatch and general off-policy training instability!

[Dec 1] Our work is discussed in [Stabilizing Reinforcement Learning with LLMs: Formulation and Practices] from Qwen team.

[Nov 12] Rollout Correction for General Off-Policy Problems was merged into VeRL: [Usage Documents][More Details] (Yingru Li)

[Nov 10] Very excited to see that our work was discussed in a ****vLLM blog, where they implemented the “Bitwise Consistent On-Policy Reinforcement Learning”.

[Oct 31] VeRL fully async module has integrated our work in [PR].

[Oct 20] Slime has integrated our work in [PR]. (SGLang RL: Chenyang Zhao, Jiajun Li)

[Oct 16] Very excited to see that our work is cited by SWE-grep at Cognition, which also uses sequence-level masked importance sampling (MIS) to address the training-inference mismatch.

[Oct 15] A Megatron training crash issue was resolved via the geometric-level masking.

https://github.com/volcengine/verl/issues/3597

[Oct 13] VeRL has integrated our work in [VeRL 0.6.0][PR], including the newly introduced sequence masking/rejection with geometric mean of importance weight (Geo-MIS/Geo-RS). (Yingru Li)

</aside>

<aside> 💡

TL;DR:

The relentless push for faster inference has created a dangerous "training-inference mismatch" that can silently kill reinforcement learning with LLMs. Our investigation reveals a vicious cycle that is particularly acute in modern reasoning and agentic RL:

OOD Contexts Drive Low-Probability Sampling: Agentic workflows expose models to external inputs and dynamic environments, forcing frequent generation of low-probability tokens that are essential for novel reasoning, tool calls, and adaptive responses. 3.4 OOD Tool Responses Amplifies the Mismatch
Low-Probability Tokens Amplify Training Collapse: These tokens become the weakest link—the training-inference mismatch is most severe for them, causing catastrophically large gradients that lead to silent degradation and sudden training failure. 3.3 The Smoking Gun: The Low-Probability Token Pitfall
Hardware Variability Complicates the Problem: Different GPU architectures exacerbate the mismatch unpredictably, meaning the same agentic training setup can succeed on one machine and catastrophically fail on another. 3.5 The Environmental Factor: The Critical Role of Hardware
Sequence-Level Correction is the Principled Solution: Sequence-level Correction emerges as the theoretically grounded fix. It corrects the biased gradients by accounting for the full state trajectory, restoring training stability across different hardware and complex tasks. 4.2.1 A Principled Solution: Distribution Correction </aside>

<aside> 📖

Deeper Analysis:

For a rigorous theoretical breakdown of this problem, we've published a 3-part blog series to give more insights:

[Part 1: Why Off-Policy Breaks RL — An SGA Analysis Framework] Known (TRPO theory): The surrogate objective $L_\mu(\pi)$ is a first-order Taylor approximation of RL objective $J(\pi)$; the TRPO lower bound $J(\pi) \ge L_\mu(\pi) - C \cdot T^2 \cdot D_{TV}^{\max}$ shows the approximation error grows with $T^2$, requiring the trust region to shrink as $\delta \propto 1/T^2$. Our insight: Token-level IS (PPO/GRPO) computes $\nabla L_\mu$, not $\nabla J$—it corrects the token distribution but not the state distribution mismatch ($d_\mu \ne d_\pi$), resulting in $O(T^2 D_{TV}^{\max})$ bias. We formalize two failure modes via the SGA Lemma: Bias (measured by $D_{TV}$) and Variance (measured by $\chi^2$-divergence)—these metrics are not interchangeable.
[Part 2: Applying the SGA Framework — Token v.s. Sequence-level Correction] Our analysis: Token-level IS (PPO/GRPO) has $O(T^2 D_{TV}^{\max})$ bias from the surrogate's first-order approximation error; Sequence-level IS is unbiased but has exponential variance $O((1+\bar{\chi}^2_{\max})^T)$. Our solution: Seq-TIS achieves controllable bias-variance trade-off via clipping $\rho(y) \to \min(\rho(y), C)$. Key conclusion: This bias is a sequence-level problem requiring sequence-level solutions when non-negligible.
[Part 3: Trust Region Optimization via Sequence Masking] Known (TRPO theory): Trust region constraints ensure the surrogate remains a valid approximation. Our solutions: (1) Seq-MIS enforces a Hard Trust Region via rejection $\mathbb{I}(\rho \le C) \cdot \rho \cdot f$, excluding OOD samples entirely. (2) Geo-RS uses the geometric mean $\rho_{\text{geo}}=\rho^{1/T}$ to achieve a length-invariant Per-Token Trust Region—a practical implementation of TRPO's hard trust region for LLMs that avoids systematic rejection of long sequences. </aside>

Citation

@online{liu-li-2025-rl-collapse,
  title = {When Speed Kills Stability: Demystifying {RL} Collapse from the Training-Inference Mismatch},
  author = {Liu, Jiacai and Li, Yingru and Fu, Yuqian and Wang, Jiawei and Liu, Qian and Jiang, Zhuo},
  year = {2025},
  month = sep,
  url = {<https://richardli.xyz/rl-collapse>}
}

1. The Mystery of the Sudden Collapse

In the rapidly advancing field of reinforcement learning for large language models (LLM-RL), a frustrating pattern of sudden training collapse is emerging. Whether in complex reasoning RL or multi-turn agentic RL, many have observed training runs that, after a period of stable learning, catastrophically fail.

We recently encountered this firsthand while conducting agentic RL experiments for multi-turn tool-integrated-reasoning (TIR) on Qwen3 models. This occurred across both on-policy and off-policy variants of the GRPO algorithm on our L20 GPU cluster. Figure 1 shows the reward and gradient norm dynamics of our four crashed experiments on Qwen3-14B-Base. As training progresses, the gradient norms suddenly explode, leading to model collapse. Our initial investigation focused on common culprits:

We examined the code and confirmed that our agent loop follows a token-in-token-out process.
We tuned the hyperparameters beta1 and beta2 in the Adam optimizer.
We also applied batch normalization to the advantages to balance the updates.

...

However, none of these standard fixes worked. Since even the simpler on-policy experiments failed, we suspected the issue was not with the RL algorithm but with a more fundamental part of the training stack. This led us to investigate a critical and increasingly prevalent challenge in modern LLM-RL: the unavoidable gap between highly-optimized inference engines and faithful training frameworks.

2. A Fundamental Conflict: The Growing Gap Between Inference and Training

Rollout speed is a core bottleneck in LLM-RL. To achieve the massive throughput required, modern inference engines (e.g., vLLM, SGLang, TensorRT-LLM) employ aggressive optimization strategies like speculative decoding, low-precision computation (INT8/FP8), and specialized, batch-variant CUDA kernels. While maintaining sampling fidelity, the primary objective of modern inference engines is to maximize throughput, ****often measured in tokens per second. Conversely, training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) must strike a different balance, prioritizing numerical stability and precision for gradient computation, often using higher-precision formats like FP32 for master weights and optimizer states. This divergence in optimization priorities and constraints creates an inevitable training-inference mismatch. The relentless push for faster rollouts is making this gap wider, not smaller. While one might propose enforcing identical calculations (e.g., using "batch invariant kernels"), these solutions come with a severe performance penalty, defeating the purpose of using a high-speed inference engine in the first place. This speed-vs-consistency trade-off is at the heart of the problem, making it a persistent challenge rather than a simple engineering fix.

In our stack, this mismatch manifested between our vLLM inference sampler and our FSDP trainer. The actual parameter update was: