
Authors: Jiacai Liu Yingru Li*† Yuqian Fu Jiawei Wang Qian Liu Yu Shen†*
Work done at ByteDance. First published at Sep 17, 2025.
*Co-First Authors. †Corresponding Authors.

Figure 1. Rewards (Left ) and gradient norms (Right ) from our four failed GRPO TIR experiments on Qwen3-14B-Base. All experiments sample 1024 trajectories (64 prompts × 16 responses) at each training step and use a learning rate of 1e-6. The ppo_mini_batch_size is set to 1024 and 256 for on-policy and off-policy experiments, respectively.
<aside> 📌
Latest Update
[News] Rollout Correction for General Off-Policy Problems was merged into VeRL: [Usage Documents][More Details] (Yingru Li)
[News] Very excited to see that our work is cited by SWE-grep at Cognition, which also uses sequence-level masked importance sampling (MIS) to address the training-inference mismatch.
[News] For a rigorous theoretical breakdown of this problem, we've published a new 2-part blog series. [Part 1: The Fatal Trade-off] establishes the analytical framework on the stochastic policy gradient (Bias vs. Variance), and [Part 2: The Estimator Trials] proves why sequence-level gradient estimator has less gradient bias.
[News] VeRL fully async module has integrated our work in [PR].
[News] Slime has integrated our work in [PR]. (SGLang RL: Chenyang Zhao, Jiajun Li)
[News] A Megatron training crash issue was resolved via the geometric-level masking.
https://github.com/volcengine/verl/issues/3597
[News] VeRL has integrated our work in [VeRL 0.6.0][PR][Documents], including the newly introduced geometric importance ratio and dual masking. (Yingru Li)
</aside>
<aside> 💡
The relentless push for faster inference has created a dangerous "training-inference mismatch" that can silently kill reinforcement learning with LLMs. Our investigation reveals a vicious cycle that is particularly acute in modern reasoning and agentic RL:
@misc{liu-li-2025,
title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
url = {<https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda>},
author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
year = {2025},
month = september,
}
In the rapidly advancing field of reinforcement learning for large language models (LLM-RL), a frustrating pattern of sudden training collapse is emerging. Whether in complex reasoning RL or multi-turn agentic RL, many have observed training runs that, after a period of stable learning, catastrophically fail.
We recently encountered this firsthand while conducting agentic RL experiments for multi-turn tool-integrated-reasoning (TIR) on Qwen3 models. This occurred across both on-policy and off-policy variants of the GRPO algorithm on our L20 GPU cluster. Figure 1 shows the reward and gradient norm dynamics of our four crashed experiments on Qwen3-14B-Base. As training progresses, the gradient norms suddenly explode, leading to model collapse. Our initial investigation focused on common culprits:
We examined the code and confirmed that our agent loop follows a token-in-token-out process.
We tuned the hyperparameters beta1 and beta2 in the Adam optimizer.
We also applied batch normalization to the advantages to balance the updates.
...
However, none of these standard fixes worked. Since even the simpler on-policy experiments failed, we suspected the issue was not with the RL algorithm but with a more fundamental part of the training stack. This led us to investigate a critical and increasingly prevalent challenge in modern LLM-RL: the unavoidable gap between highly-optimized inference engines and faithful training frameworks.
Rollout speed is a core bottleneck in LLM-RL. To achieve the massive throughput required, modern inference engines (e.g., vLLM, SGLang, TensorRT-LLM) employ aggressive optimization strategies like speculative decoding, low-precision computation (INT8/FP8), and specialized, batch-variant CUDA kernels. While maintaining sampling fidelity, the primary objective of modern inference engines is to maximize throughput, ****often measured in tokens per second. Conversely, training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) must strike a different balance, prioritizing numerical stability and precision for gradient computation, often using higher-precision formats like FP32 for master weights and optimizer states. This divergence in optimization priorities and constraints creates an inevitable training-inference mismatch. The relentless push for faster rollouts is making this gap wider, not smaller. While one might propose enforcing identical calculations (e.g., using "batch invariant kernels"), these solutions come with a severe performance penalty, defeating the purpose of using a high-speed inference engine in the first place. This speed-vs-consistency trade-off is at the heart of the problem, making it a persistent challenge rather than a simple engineering fix.
In our stack, this mismatch manifested between our vLLM inference sampler and our FSDP trainer. The actual parameter update was: