Citation

@online{liu-li-2025-rl-collapse,
  title = {When Speed Kills Stability: Demystifying {RL} Collapse from the Training-Inference Mismatch},
  author = {Liu, Jiacai and Li, Yingru and Fu, Yuqian and Wang, Jiawei and Liu, Qian and Jiang, Zhuo},
  year = {2025},
  month = sep,
  url = {<https://richardli.xyz/rl-collapse>}
}

Recap: The Microscope is Ready

In Part 1, we established our analytical framework. We learned that the Stochastic Gradient Ascent (SGA) Lemma is our "microscope" for judging any gradient estimator, $\hat{g}$, used in a off-policy system where we sample from $\mu$ but optimize $\pi$.

Our microscope reveals two distinct failure modes:

Bias (Term B): The estimator points in the wrong direction, pushing the optimizer toward an incorrect solution. We measure this with $D_{TV}$.
Variance (Term C): The estimator is so noisy it forces our learning rate to zero, stalling all progress. We measure this with $\chi^2$-divergence.

Our goal is simple: find an estimator $\hat{g}$ that simultaneously controls both bias and variance.

We will now put the most common estimators on trial. For this analysis, we'll define

$$ f(y) := \nabla_\theta \log \pi(y|x) \cdot R(y|x) $$

as our target function, where the true gradient is $g = \mathbb{E}_\pi[f(y)]$.

On Trial 1: Sequence-Level Importance Sampling (Seq-IS)

This is the theoretically "purest" estimator. It corrects for the mismatch by re-weighting every sample by the full sequence-level ratio, $\rho(y) = \pi(y) / \mu(y)$.

The Estimator: $\hat{g}_{\text{seq}} = \rho(y) \cdot f(y)$

The Analysis

Term B (Bias): This estimator is perfectly unbiased. By the definition of importance sampling:

$$ \mathbb{E}\mu[\hat{g}{\text{seq}}] = \mathbb{E}\mu\left[ \frac{\pi(y)}{\mu(y)} f(y) \right] = \sum_y \mu(y) \frac{\pi(y)}{\mu(y)} f(y) = \sum_y \pi(y) f(y) = \mathbb{E}\pi[f(y)] = g $$