
Authors: Jiacai Liu Yingru Li*† Yuqian Fu Jiawei Wang Qian Liu Yu Shen†*
Work done at ByteDance. First published at Sep 17, 2025.
*Co-First Authors. †Corresponding Authors.

Figure 1. Rewards (Left ) and gradient norms (Right ) from our four failed GRPO TIR experiments on Qwen3-14B-Base. All experiments sample 1024 trajectories (64 prompts × 16 responses) at each training step and use a learning rate of 1e-6. The ppo_mini_batch_size is set to 1024 and 256 for on-policy and off-policy experiments, respectively.
<aside> 📌
Latest Update
[Dec 2] #DeepSeek-V3.2 (https://arxiv.org/abs/2512.02556) adopts our geometric sequence masking to deal with training-inference mismatch and general off-policy training instability!
[Dec 1] Our work is discussed in [Stabilizing Reinforcement Learning with LLMs: Formulation and Practices] from Qwen team.
[Nov 12] Rollout Correction for General Off-Policy Problems was merged into VeRL: [Usage Documents][More Details] (Yingru Li)
[Nov 10] Very excited to see that our work was discussed in a ****vLLM blog, where they implemented the “Bitwise Consistent On-Policy Reinforcement Learning”.
[Oct 31] VeRL fully async module has integrated our work in [PR].
[Oct 20] Slime has integrated our work in [PR]. (SGLang RL: Chenyang Zhao, Jiajun Li)
[Oct 16] Very excited to see that our work is cited by SWE-grep at Cognition, which also uses sequence-level masked importance sampling (MIS) to address the training-inference mismatch.
[Oct 15] A Megatron training crash issue was resolved via the geometric-level masking.
https://github.com/volcengine/verl/issues/3597
[Oct 13] VeRL has integrated our work in [VeRL 0.6.0][PR], including the newly introduced sequence masking/rejection with geometric mean of importance weight (Geo-MIS/Geo-RS). (Yingru Li)
</aside>
<aside> 💡
The relentless push for faster inference has created a dangerous "training-inference mismatch" that can silently kill reinforcement learning with LLMs. Our investigation reveals a vicious cycle that is particularly acute in modern reasoning and agentic RL:
<aside> 📖
For a rigorous theoretical breakdown of this problem, we've published a 3-part blog series to give more insights:
@online{liu-li-2025-rl-collapse,
title = {When Speed Kills Stability: Demystifying {RL} Collapse from the Training-Inference Mismatch},
author = {Liu, Jiacai and Li, Yingru and Fu, Yuqian and Wang, Jiawei and Liu, Qian and Shen, Yu},
year = {2025},
month = sep,
url = {<https://richardli.xyz/rl-collapse>}
}
In the rapidly advancing field of reinforcement learning for large language models (LLM-RL), a frustrating pattern of sudden training collapse is emerging. Whether in complex reasoning RL or multi-turn agentic RL, many have observed training runs that, after a period of stable learning, catastrophically fail.
We recently encountered this firsthand while conducting agentic RL experiments for multi-turn tool-integrated-reasoning (TIR) on Qwen3 models. This occurred across both on-policy and off-policy variants of the GRPO algorithm on our L20 GPU cluster. Figure 1 shows the reward and gradient norm dynamics of our four crashed experiments on Qwen3-14B-Base. As training progresses, the gradient norms suddenly explode, leading to model collapse. Our initial investigation focused on common culprits:
We examined the code and confirmed that our agent loop follows a token-in-token-out process.
We tuned the hyperparameters beta1 and beta2 in the Adam optimizer.
We also applied batch normalization to the advantages to balance the updates.
...
However, none of these standard fixes worked. Since even the simpler on-policy experiments failed, we suspected the issue was not with the RL algorithm but with a more fundamental part of the training stack. This led us to investigate a critical and increasingly prevalent challenge in modern LLM-RL: the unavoidable gap between highly-optimized inference engines and faithful training frameworks.
Rollout speed is a core bottleneck in LLM-RL. To achieve the massive throughput required, modern inference engines (e.g., vLLM, SGLang, TensorRT-LLM) employ aggressive optimization strategies like speculative decoding, low-precision computation (INT8/FP8), and specialized, batch-variant CUDA kernels. While maintaining sampling fidelity, the primary objective of modern inference engines is to maximize throughput, ****often measured in tokens per second. Conversely, training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) must strike a different balance, prioritizing numerical stability and precision for gradient computation, often using higher-precision formats like FP32 for master weights and optimizer states. This divergence in optimization priorities and constraints creates an inevitable training-inference mismatch. The relentless push for faster rollouts is making this gap wider, not smaller. While one might propose enforcing identical calculations (e.g., using "batch invariant kernels"), these solutions come with a severe performance penalty, defeating the purpose of using a high-speed inference engine in the first place. This speed-vs-consistency trade-off is at the heart of the problem, making it a persistent challenge rather than a simple engineering fix.
In our stack, this mismatch manifested between our vLLM inference sampler and our FSDP trainer. The actual parameter update was: