Authors: Jiacai Liu Yingru Li*† Yuqian Fu Jiawei Wang Qian Liu Yu Shen†*

Work done at ByteDance. First published at Sep 17, 2025.

*Co-First Authors. †Corresponding Authors.

Figure 1. Rewards (Left ) and gradient norms (Right ) from our four failed GRPO TIR experiments on Qwen3-14B-Base. All experiments sample trajectories ( prompts × responses) at each training step and use a learning rate of . The is set to and for on-policy and off-policy experiments, respectively.

Figure 1. Rewards (Left ) and gradient norms (Right ) from our four failed GRPO TIR experiments on Qwen3-14B-Base. All experiments sample 1024 trajectories (64 prompts × 16 responses) at each training step and use a learning rate of 1e-6. The ppo_mini_batch_size is set to 1024 and 256 for on-policy and off-policy experiments, respectively.

<aside> 📌

Latest Update

[News] Rollout Correction for General Off-Policy Problems was merged into VeRL: [Usage Documents][More Details] (Yingru Li)

[News] Very excited to see that our work is cited by SWE-grep at Cognition, which also uses sequence-level masked importance sampling (MIS) to address the training-inference mismatch.

[News] For a rigorous theoretical breakdown of this problem, we've published a new 2-part blog series. [Part 1: The Fatal Trade-off] establishes the analytical framework on the stochastic policy gradient (Bias vs. Variance), and [Part 2: The Estimator Trials] proves why sequence-level gradient estimator has less gradient bias.

[News] VeRL fully async module has integrated our work in [PR].

[News] Slime has integrated our work in [PR]. (SGLang RL: Chenyang Zhao, Jiajun Li)

[News] A Megatron training crash issue was resolved via the geometric-level masking.

https://github.com/volcengine/verl/issues/3597

[News] VeRL has integrated our work in [VeRL 0.6.0][PR][Documents], including the newly introduced geometric importance ratio and dual masking. (Yingru Li)

</aside>

<aside> 💡

TL;DR:

The relentless push for faster inference has created a dangerous "training-inference mismatch" that can silently kill reinforcement learning with LLMs. Our investigation reveals a vicious cycle that is particularly acute in modern reasoning and agentic RL:

OOD Contexts Drive Low-Probability Sampling: Agentic workflows expose models to external inputs and dynamic environments, forcing frequent generation of low-probability tokens that are essential for novel reasoning, tool calls, and adaptive responses. 3.4 OOD Tool Responses Amplifies the Mismatch
Low-Probability Tokens Amplify Training Collapse: These tokens become the weakest link—the training-inference mismatch is most severe for them, causing catastrophically large gradients that lead to silent degradation and sudden training failure. 3.3 The Smoking Gun: The Low-Probability Token Pitfall
Hardware Variability Complicates the Problem: Different GPU architectures exacerbate the mismatch unpredictably, meaning the same agentic training setup can succeed on one machine and catastrophically fail on another. 3.5 The Environmental Factor: The Critical Role of Hardware
Sequence-Level IS is the Principled Solution: Sequence-level Importance Sampling emerges as the theoretically grounded fix. It corrects the biased gradients by accounting for the full state trajectory, restoring training stability across different hardware and complex tasks. 4.2.1 A Principled Solution: Importance Sampling </aside>

Citation

@misc{liu-li-2025,
  title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
  url = {<https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda>},
  author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
  year = {2025},
  month = september,
}

1. The Mystery of the Sudden Collapse

In the rapidly advancing field of reinforcement learning for large language models (LLM-RL), a frustrating pattern of sudden training collapse is emerging. Whether in complex reasoning RL or multi-turn agentic RL, many have observed training runs that, after a period of stable learning, catastrophically fail.

We recently encountered this firsthand while conducting agentic RL experiments for multi-turn tool-integrated-reasoning (TIR) on Qwen3 models. This occurred across both on-policy and off-policy variants of the GRPO algorithm on our L20 GPU cluster. Figure 1 shows the reward and gradient norm dynamics of our four crashed experiments on Qwen3-14B-Base. As training progresses, the gradient norms suddenly explode, leading to model collapse. Our initial investigation focused on common culprits:

We examined the code and confirmed that our agent loop follows a token-in-token-out process.
We tuned the hyperparameters beta1 and beta2 in the Adam optimizer.
We also applied batch normalization to the advantages to balance the updates.

...

However, none of these standard fixes worked. Since even the simpler on-policy experiments failed, we suspected the issue was not with the RL algorithm but with a more fundamental part of the training stack. This led us to investigate a critical and increasingly prevalent challenge in modern LLM-RL: the unavoidable gap between highly-optimized inference engines and faithful training frameworks.

2. A Fundamental Conflict: The Growing Gap Between Inference and Training

Rollout speed is a core bottleneck in LLM-RL. To achieve the massive throughput required, modern inference engines (e.g., vLLM, SGLang, TensorRT-LLM) employ aggressive optimization strategies like speculative decoding, low-precision computation (INT8/FP8), and specialized, batch-variant CUDA kernels. While maintaining sampling fidelity, the primary objective of modern inference engines is to maximize throughput, ****often measured in tokens per second. Conversely, training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) must strike a different balance, prioritizing numerical stability and precision for gradient computation, often using higher-precision formats like FP32 for master weights and optimizer states. This divergence in optimization priorities and constraints creates an inevitable training-inference mismatch. The relentless push for faster rollouts is making this gap wider, not smaller. While one might propose enforcing identical calculations (e.g., using "batch invariant kernels"), these solutions come with a severe performance penalty, defeating the purpose of using a high-speed inference engine in the first place. This speed-vs-consistency trade-off is at the heart of the problem, making it a persistent challenge rather than a simple engineering fix.

In our stack, this mismatch manifested between our vLLM inference sampler and our FSDP trainer. The actual parameter update was: