Part 2: A Trial of Gradient Estimators — The Good, The Bad, and The Trade-offs

First published at Oct 31, 2025.

<aside> 📜

Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch

</aside>

Recap: The Microscope is Ready

In Part 1, we established our analytical framework. We learned that the Stochastic Gradient Ascent (SGA) Lemma is our "microscope" for judging any gradient estimator, $\hat{g}$, used in a mismatched system where we sample from $\mu$ but optimize $\pi$.

Our microscope reveals two distinct failure modes:

Fatal Bias (Term B): The estimator points in the wrong direction, pushing the optimizer toward an incorrect solution. We measure this with $D_{TV}$.
Fatal Variance (Term C): The estimator is so noisy it forces our learning rate to zero, stalling all progress. We measure this with $\chi^2$-divergence.

Our goal is simple: find an estimator $\hat{g}$ that simultaneously controls both fatal bias and fatal variance.

We will now put the most common estimators on trial. For this analysis, we'll define

$$ f(y) := \nabla_\theta \log \pi(y|x) \cdot R(y|x) $$

as our target function, where the true gradient is $g = \mathbb{E}_\pi[f(y)]$.

On Trial 1: Sequence-Level Importance Sampling (SL-IS)

This is the theoretically "purest" estimator. It corrects for the mismatch by re-weighting every sample by the full sequence-level ratio, $\rho(y) = \pi(y) / \mu(y)$.

The Estimator: $\hat{g}_{\text{seq}} = \rho(y) \cdot f(y)$
The Verdict: FAILS (Fatal Variance)

The Analysis

Term B (Bias): PASSES This estimator is perfectly unbiased. By the definition of importance sampling:

$$ \mathbb{E}\mu[\hat{g}{\text{seq}}] = \mathbb{E}\mu\left[ \frac{\pi(y)}{\mu(y)} f(y) \right] = \sum_y \mu(y) \frac{\pi(y)}{\mu(y)} f(y) = \sum_y \pi(y) f(y) = \mathbb{E}\pi[f(y)] = g $$