Authors: **Yingru Li Jiacai Liu**
First published at Oct 31, 2025.
<aside> 📜
Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch
</aside>
In Part 1, we established our analytical framework. We learned that the Stochastic Gradient Ascent (SGA) Lemma is our "microscope" for judging any gradient estimator, $\hat{g}$, used in a mismatched system where we sample from $\mu$ but optimize $\pi$.
Our microscope reveals two distinct failure modes:
Our goal is simple: find an estimator $\hat{g}$ that simultaneously controls both fatal bias and fatal variance.
We will now put the most common estimators on trial. For this analysis, we'll define
$$ f(y) := \nabla_\theta \log \pi(y|x) \cdot R(y|x) $$
as our target function, where the true gradient is $g = \mathbb{E}_\pi[f(y)]$.
This is the theoretically "purest" estimator. It corrects for the mismatch by re-weighting every sample by the full sequence-level ratio, $\rho(y) = \pi(y) / \mu(y)$.
Term B (Bias): PASSES This estimator is perfectly unbiased. By the definition of importance sampling:
$$ \mathbb{E}\mu[\hat{g}{\text{seq}}] = \mathbb{E}\mu\left[ \frac{\pi(y)}{\mu(y)} f(y) \right] = \sum_y \mu(y) \frac{\pi(y)}{\mu(y)} f(y) = \sum_y \pi(y) f(y) = \mathbb{E}\pi[f(y)] = g $$