Authors: **Yingru Li Jiacai Liu**
First published at Oct 31, 2025.
<aside> 📜
Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch
</aside>
@online{liu-li-2025-rl-collapse,
title = {When Speed Kills Stability: Demystifying {RL} Collapse from the Training-Inference Mismatch},
author = {Liu, Jiacai and Li, Yingru and Fu, Yuqian and Wang, Jiawei and Liu, Qian and Shen, Yu},
year = {2025},
month = sep,
url = {<https://richardli.xyz/rl-collapse>}
}
In Part 1, we established our analytical framework. We learned that the Stochastic Gradient Ascent (SGA) Lemma is our "microscope" for judging any gradient estimator, $\hat{g}$, used in a off-policy system where we sample from $\mu$ but optimize $\pi$.
Our microscope reveals two distinct failure modes:
Our goal is simple: find an estimator $\hat{g}$ that simultaneously controls both bias and variance.
We will now put the most common estimators on trial. For this analysis, we'll define
$$ f(y) := \nabla_\theta \log \pi(y|x) \cdot R(y|x) $$
as our target function, where the true gradient is $g = \mathbb{E}_\pi[f(y)]$.
This is the theoretically "purest" estimator. It corrects for the mismatch by re-weighting every sample by the full sequence-level ratio, $\rho(y) = \pi(y) / \mu(y)$.
Term B (Bias): This estimator is perfectly unbiased. By the definition of importance sampling:
$$ \mathbb{E}\mu[\hat{g}{\text{seq}}] = \mathbb{E}\mu\left[ \frac{\pi(y)}{\mu(y)} f(y) \right] = \sum_y \mu(y) \frac{\pi(y)}{\mu(y)} f(y) = \sum_y \pi(y) f(y) = \mathbb{E}\pi[f(y)] = g $$