Authors: Jiacai Liu Yingru Li*† Yuqian Fu Jiawei Wang Qian Liu Yu Shen†*
Work done at ByteDance. First published at Sep 17, 2025.
*Co-First Authors. †Corresponding Authors.
Figure 1. Rewards (Left ) and gradient norms (Right ) from our four failed GRPO TIR experiments on Qwen3-14B-Base. All experiments sample 1024
trajectories (64
prompts × 16
responses) at each training step and use a learning rate of 1e-6
. The ppo_mini_batch_size
is set to 1024
and 256
for on-policy and off-policy experiments, respectively.
<aside> 💡
The relentless push for faster inference has created a dangerous "training-inference mismatch" that can silently kill reinforcement learning with LLMs. Our investigation reveals a vicious cycle that is particularly acute in modern reasoning and agentic RL:
@misc{liu-li-2025,
title = {When Speed Kills Stability: Demystifying RL Collapse from the Inference-Training Mismatch},
url = {<https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Inference-Training-Mismatch-271211a558b7808d8b12d403fd15edda>},
author = {Jiacai Liu and Yingru Li and Yuqian Fu and Jiawei Wang and Qian Liu and Yu Shen},
year = {2025},
month = september,
}
In the rapidly advancing field of reinforcement learning for large language models (LLM-RL), a frustrating pattern of sudden training collapse is emerging. Whether in complex reasoning RL or multi-turn agentic RL, many have observed training runs that, after a period of stable learning, catastrophically fail.
We recently encountered this firsthand while conducting agentic RL experiments for multi-turn tool-integrated-reasoning (TIR) on Qwen3 models. This occurred across both on-policy and off-policy variants of the GRPO algorithm on our L20 GPU cluster. Figure 1 shows the reward and gradient norm dynamics of our four crashed experiments on Qwen3-14B-Base. As training progresses, the gradient norms suddenly explode, leading to model collapse. Our initial investigation focused on common culprits:
We examined the code and confirmed that our agent loop follows a token-in-token-out process.
We tuned the hyperparameters beta1
and beta2
in the Adam optimizer.
We also applied batch normalization to the advantages to balance the updates.
...
However, none of these standard fixes worked. Since even the simpler on-policy experiments failed, we suspected the issue was not with the RL algorithm but with a more fundamental part of the training stack. This led us to investigate a critical and increasingly prevalent challenge in modern LLM-RL: the unavoidable gap between highly-optimized inference engines and faithful training frameworks.
Rollout speed is a core bottleneck in LLM-RL. To achieve the massive throughput required, modern inference engines (e.g., vLLM, SGLang, TensorRT-LLM) employ aggressive optimization strategies like speculative decoding, low-precision computation (INT8/FP8), and specialized, batch-variant CUDA kernels. While maintaining sampling fidelity, the primary objective of modern inference engines is to maximize throughput, ****often measured in tokens per second. Conversely, training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) must strike a different balance, prioritizing numerical stability and precision for gradient computation, often using higher-precision formats like FP32 for master weights and optimizer states. This divergence in optimization priorities and constraints creates an inevitable training-inference mismatch. The relentless push for faster rollouts is making this gap wider, not smaller. While one might propose enforcing identical calculations (e.g., using "batch invariant kernels"), these solutions come with a severe performance penalty, defeating the purpose of using a high-speed inference engine in the first place. This speed-vs-consistency trade-off is at the heart of the problem, making it a persistent challenge rather than a simple engineering fix.
In our stack, this mismatch manifested between our vLLM inference sampler and our FSDP trainer. The actual parameter update was:
$$ \mathbb{E} _{x\sim \mathcal{D}}\mathbb{E} _{y\sim \textcolor{red}{\pi _{\theta}^{\mathrm{vllm}}}\left( \cdot |x \right)}\left[ R\left( x,y \right) \nabla _{\theta}\log \textcolor{blue}{\pi _{\theta}^{\mathrm{fsdp}}}\left( y|x \right) \right], $$