Authors: **Yingru Li Jiacai Liu**

First published at Oct 30, 2025.

<aside> 📜

Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch

</aside>

<aside> 📌

TL;DR

In Reinforcement Learning for LLMs, the policy we use to generate data (let's call it $\mu$) is often different from the policy we want to optimize (let's call it $\pi$). This "training-inference mismatch" is not a small bug; it's a fundamental mathematical problem.

We use the Stochastic Gradient Ascent Lemma to prove that mismatch creates two distinct failure modes:

  1. Fatal Bias: Your optimizer is actively pushed toward the wrong solution.
  2. Fatal Variance: Your optimizer is forced to a complete halt, and training stalls.

In Part 1, we use Total Variation (TV) distance to measure the bias and $\chi^2$-divergence to measure the variance respectively. We show that these two metrics are not interchangeable—and why confusing them leads to fatally flawed solutions.

In Part 2, based on the direct computation of the bias and variance of different estimators, we further show that:

  1. The token-level importance sampling (IS), used in classic methods such as PPO for the sake of variance reduction, has a $O(T^2 \Delta_{\max})$ bias.
    1. This bias is tolerable when the off-policiness is solely induced by the policy parameter updates and can be controlled via algorithm (e.g. by clip mechanism).
    2. However, when the mismatch is significant and has diverse sources (such as expert shift in MoE RL, or discrepancies in training and inference engine operators), the bias becomes intolerable and more prone to collapse.
  2. Original sequence-level importance sampling has zero bias but a fatal $O((1 + \bar{\chi}^2_{\max})^T)$ variance. By truncating the IS ratio, sequence-level TIS achieves a better balance of bias and variance, and is more suitable for model distributed LLM+RL Frameworks.
  3. More importantly, our analysis also applies to general off-policy RL problems. </aside>

The Core Problem: A Mismatch in Our Goals

In LLM-RL, our goal is to optimize a target policy, $\pi = \pi_\theta(y|x)$ (our LLM, parameterized by $\theta$), to maximize an expected reward, $J(\theta)$:

$$

⁍ $$

Here, $R(y|x)$ is the reward for a generated sequence $y$ given a prompt $x$.

We do this using stochastic gradient ascent. We need to calculate the true gradient, $g = \nabla J(\theta)$, and update our parameters:

$$ ⁍. $$

But here's the catch: in large-scale systems, we often cannot sample directly from the $\pi_\theta$ we are optimizing. Instead, we sample from a slightly different, mismatched policy, $\mu(y|x)$. This mismatch ($\mu \neq \pi$) happens for many reasons. In large-scale systems, it's a persistent, real-world gap between high-speed inference engines (like vLLM) and the training framework (like FSDP), driven by differences in quantization, numerical precision, and hardware-specific kernels.

<aside> 📌

Because we sample from $\mu$, we can't get the true gradient $g$. We can only compute an estimator, $\hat{g}$. How do we know if $\hat{g}$ is any good?

</aside>

Our Analytical Tool: The Stochastic Gradient Ascent Lemma

To understand the damage, we need a formal tool. The Stochastic Gradient Ascent Lemma gives us a precise formula for the progress we make in a single optimization step (assuming an $L$-smooth objective).

The progress we make, $\mathbb{E}[J(\theta_{k+1})] - J(\theta_k)$, is bounded by:

$$ ⁍ $$

Where: