Authors: Yingru Li†, Jiawei Xu, Ziniu Li, Jiacai Liu, Yuxuan Tong, Wei Liu, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, and Baoxiang Wang
*†Project Lead Co-first author
First Published at Dec. 20, 2025.

Figure 1. Gradient Norm (Left) and AIME25 Score (Right) under multi-turn TIR training. We adopt full on-policy training on the Qwen2.5-7B, with both the rollout batch size and mini-update size set to 128. The maximum response length is 8192, the group size is 16, and the maximum turn is 5. The vertical dotted line indicates the point at which compared methods collapse, coinciding with a sudden surge in gradient norm**.** Notably, our Optimal Token Baseline yields a stable gradient norm, which results in stable training and higher score.
<aside> 💡
TL;DR
@online{
title = {The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL},
author = {[Yingru Li](<http://richardli.xyz>), Jiawei Xu, Ziniu Li, Jiacai Liu, Yuxuan Tong, Wei Liu, Longtao
Zheng, Zhenghai Xue, Tianle Cai, Ge Zhang, Qian Liu, and Baoxiang Wang},
year = {2025},
month = Dec,
url = {<https://richardli.xyz/optimal-token-baseline>}
}
Reinforcement Learning (RL) has become the standard approach for LLM reasoning and agent. However, practitioners face a persistent bottleneck: Training Collapse. As shown in Figure 1, models often learn effectively for hundreds of steps before the gradient norm suddenly surges, causing performance to crater.
The instability of RL in this regime is not merely a tuning issue but a structural consequence of how gradient variance scales with trajectory length and reward sparsity.
This can be analyzed through the lens of the standard REINFORCE gradient estimator:
$$ \hat{g}(\tau) = R(\tau) \cdot S(\tau), $$
where $\tau = (x, y)$ is the trajectory comprising the given prompt $x$ and generated response $y = (y_1, y_2, \ldots, y_T)$ where the horizon length $T = |y|$ is a random variable. $R(\tau)$ is the scalar reward and $S(\tau) = \sum_{t=1}^T s_t$ is the total trajectory score, where $s_t = \nabla_\theta \log \pi_{\theta}(y_t | x, y_{<t})$. The noise in the gradient is driven by two factors:
The convergence of these factors triggers a 'large variance,' leading to training instability.
A common approach to reduce gradient variance is to introduce a baseline $B(x)$, yielding
$$ \tilde{g}(\tau) =(R(\tau) - B(x)) \cdot S(\tau) . $$