Variance Reduction for Long-Horizon LLM-RL

Authors: Yingru Li†, Jiawei Xu, Ziniu Li, Jiacai Liu, Wei Liu, Yuxuan Tong, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, and Baoxiang Wang

*†Project Lead Co-first author

First Published at Dec. 20, 2025.

[Code] [Dataset]

multi_turn_baseline_actor_grad_norm_val_test_score_extra_score_deepscaler_aime25.png

Figure 1. Gradient Norm (Left) and AIME25 Score (Right) under multi-turn TIR training. We adopt full on-policy training on the Qwen2.5-7B, with both the rollout batch size and mini-update size set to 128. The maximum response length is 8192, the group size is 16, and the maximum turn is 5. The vertical dotted line indicates the point at which compared methods collapse, coinciding with a sudden surge in gradient norm**.** Notably, our Optimal Token Baseline yields a stable gradient norm, which results in stable training and higher score.

<aside> 💡

TL;DR

The Problem: RL training for LLMs frequently suffers from "training collapse" due to exploding gradient variance in long-horizon tasks. Standard baselines (like Group Mean) fail because they treat all tokens and sequences as equally "noisy."
The Insight: Gradient noise is heterogeneous. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy).
The Solution: We introduce a computationally free Logit-Gradient Proxy. This allows us to approximate the true gradient norm using only forward-pass probabilities—requiring zero additional backward passes.
The Impact: • Stability: Eliminates training collapse by stabilizing gradient norms. • Efficiency: Matches the performance of group size $N=32$ with just $N=4$. • Savings: Reduces token consumption by 62% on Single-turn Reasoning and 66% on Multi-turn Tool-Integrated Reasoning (TIR). </aside>

Citation

@online{
  title = {The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL},
  author = {[Yingru Li](<http://richardli.xyz>), Jiawei Xu, Ziniu Li, Jiacai Liu, Wei Liu, Yuxuan Tong, Longtao
Zheng, Zhenghai Xue, Tianle Cai, Ge Zhang, Qian Liu, and Baoxiang Wang},
  year = {2025},
  month = Dec,
  url = {<https://richardli.xyz/optimal-token-baseline>}
}

1. Gradient Variance Causes Training Instability

Reinforcement Learning (RL) has become the standard approach for LLM reasoning and agent. However, practitioners face a persistent bottleneck: Training Collapse. As shown in Figure 1, models often learn effectively for hundreds of steps before the gradient norm suddenly surges, causing performance to crater.

Why does this happen?

The instability of RL in this regime is not merely a tuning issue but a structural consequence of how gradient variance scales with trajectory length and reward sparsity.

This can be analyzed through the lens of the standard REINFORCE gradient estimator:

$$ \hat{g}(\tau) = R(\tau) \cdot S(\tau), $$

where $\tau = (x, y)$ is the trajectory comprising the given prompt $x$ and generated response $y = (y_1, y_2, \ldots, y_T)$ where the horizon length $T = |y|$ is a random variable. $R(\tau)$ is the scalar reward and $S(\tau) = \sum_{t=1}^T s_t$ is the total trajectory score, where $s_t = \nabla_\theta \log \pi_{\theta}(y_t | x, y_{<t})$. The noise in the gradient is driven by two factors:

Long Horizons Accumulate Noise: $S(\tau)$ is a sum of random variables. As the horizon $T$ grows, the variance of this sum increases like a random walk.
Sparse Rewards Amplify Noise: $R(\tau)$ is observed at the very end, resulting in every step $t$ in the chain being scaled by the same, potentially high-variance, final outcome.

The convergence of these factors triggers a 'large variance,' leading to training instability.

A common approach to reduce gradient variance is to introduce a baseline $B(x)$, yielding

$$ \tilde{g}(\tau) =(R(\tau) - B(x)) \cdot S(\tau) . $$