Authors: **Yingru Li Jiacai Liu**
First published at Oct 30, 2025.
<aside> 📜
Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch
</aside>
<aside> 📌
In Reinforcement Learning for LLMs, the policy we use to generate data (let's call it $\mu$) is often different from the policy we want to optimize (let's call it $\pi$). This "training-inference mismatch" is not a small bug; it's a fundamental mathematical problem.
We use the Stochastic Gradient Ascent Lemma to prove that mismatch creates two distinct failure modes:
In Part 1, we use Total Variation (TV) distance to measure the bias and $\chi^2$-divergence to measure the variance respectively. We show that these two metrics are not interchangeable—and why confusing them leads to fatally flawed solutions.
In Part 2, based on the direct computation of the bias and variance of different estimators, we further show that:
In LLM-RL, our goal is to optimize a target policy, $\pi = \pi_\theta(y|x)$ (our LLM, parameterized by $\theta$), to maximize an expected reward, $J(\theta)$:
$$
⁍ $$
Here, $R(y|x)$ is the reward for a generated sequence $y$ given a prompt $x$.
We do this using stochastic gradient ascent. We need to calculate the true gradient, $g = \nabla J(\theta)$, and update our parameters:
$$ ⁍. $$
But here's the catch: in large-scale systems, we often cannot sample directly from the $\pi_\theta$ we are optimizing. Instead, we sample from a slightly different, mismatched policy, $\mu(y|x)$. This mismatch ($\mu \neq \pi$) happens for many reasons. In large-scale systems, it's a persistent, real-world gap between high-speed inference engines (like vLLM) and the training framework (like FSDP), driven by differences in quantization, numerical precision, and hardware-specific kernels.
<aside> 📌
Because we sample from $\mu$, we can't get the true gradient $g$. We can only compute an estimator, $\hat{g}$. How do we know if $\hat{g}$ is any good?
</aside>
To understand the damage, we need a formal tool. The Stochastic Gradient Ascent Lemma gives us a precise formula for the progress we make in a single optimization step (assuming an $L$-smooth objective).
The progress we make, $\mathbb{E}[J(\theta_{k+1})] - J(\theta_k)$, is bounded by:
$$ ⁍ $$
Where: