Yaxiang Zhang*, Yingru Li***†**, Jiacai Liu, Ziniu Li, Jiawei Xu, Qian Liu

**†***Corresponding Author Co-first Authors First published at Dec. 20.

Figure 1(a): Training-inference mismatch indicator under baseline(green line, constant learning rate) and learning rate decay(yellow line). We train Qwen3-4B with initial learning rate 1e-6, batch size=ppo_mini_batch_size=64, i.e completely on policy. Oversampling and rejection sampling(for the groups with all 0/all 1 rewards) are applied. Our experiments suggest that training-inference mismatch can be effectively suppressed near 3k steps just by shrinking the update size, thus demonstrating this mismatch is not static random noise which stems solely from numerical precision limits, but rather a dynamic issue in the training process.

Figure 1(b): Corresponding validation performance.

Figure 1(b): Corresponding validation performance.

Figure 1(c): Pseudo-code for proposed lr scheduler.

<aside> 💡

TL;DR

The Problem: Reinforcement Learning (RL) training for LLMs is notoriously unstable. While recent studies attribute this to "training-inference mismatch" (caused by hybrid engines), standard fixes like Importance Sampling might fail during longer training runs.

The Insight: We analyze this instability through an optimization lens. We find that as training progresses, gradient noise and training-inference mismatch increases simultaneously. This suggests that the "mismatch" is not merely a static numerical issue, but a dynamic problem coupled with the model's optimization trajectory.

The Solution: A specialized Learning Rate (LR) Scheduler.

Mechanism: By decaying the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.
Heuristic: We propose a novel method to time this decay based on Response Length. The surge in response length serves as a reliable early indicator of impending instability, signaling exactly when to reduce the learning rate. </aside>

Citation

@online{
  title = {Beyond Precision: Why Training-Inference Mismatch is an Optimization Problem and How Simple LR Scheduling Fixes It},
  author = {Yaxiang Zhang, Yingru Li, Jiacai Liu, Ziniu Li, Jiawei Xu, Qian Liu},
  year = {2025},
  month = Dec,
  url = {<https://richardli.xyz/mismatch-lr-schedule>}
}

1. Noisy Gradient Causes Training Collapse

Reinforcement Learning (RL) has proven capable of incentivizing LLMs to perform better on reasoning and other complex tasks. Nevertheless, RL is also famous for its training instability. Some prior work suggests that this issue may stem from the use of hybrid engines in RL training, which introduces a mismatch between training and inference. To measure the degree of this mismatch, we first introduce the Log Perplexity (Log ppl) of a trajectory $\tau$:

$$ \log \text{ppl}(\tau, \theta)= -\sum_{t=1}^{T} \log \pi_{\theta}(y_t|x,y_{<t}) $$

where $\tau = (x, y)$ is the trajectory comprising the given prompt $x$ and generated response $y = (y_1, y_2, \ldots, y_T)$, and $\theta$ represents model weights. In a single RL training step, the model generates $N$ (rollout number) responses. Our metric for training-inference mismatch is defined as:

$$ \log\text{ppl\abs\diff} = \frac{1}{N} \sum{i=1}^{N}|\log\text{ppl}(\tau_i,\pi\theta^{\text{train}})-\log\text{ppl}(\tau_i, \pi_\theta^{\text{inference}})| $$

This measures the average perplexity discrepancy among different sequences. Taking the training of Qwen3-4B on the dapo_filter dataset as an example, we observe two key phenomena:

Validation performance degrades significantly after reaching a peak: Figure 1 verifies this pattern. There is an obvious drop in accuracy around 350-400 steps.
Training-inference mismatch grows: As Figure 2 shows, the sharp increase in the difference between the log perplexity of the training and inference engines around 350-400 steps matches the drop in performance.

Figure 2 (a): validation accuracy on aime24 dataset.

Figure 2 (a): validation accuracy on aime24 dataset.

Figure 2 (b): validation accuracy on aime25 dataset.

Figure 2 (b): validation accuracy on aime25 dataset.

Figure 2(c): The figure shows the indicator for the degree of training-inference mismatch.

Figure 2(c): The figure shows the indicator for the degree of training-inference mismatch.

We infer that the degrading performance and training-inference mismatch may be related to deteriorating optimization dynamics. To investigate this, we plot the time-smoothed L2 norm of gradients in Figure 3. The L2 norm represents a combination of Signal and Noise. Since the training data usually becomes less informative over time (the model has already learned the easy patterns), an increasing L2 norm implies that noise is dominating the update direction. Also, this phenomenon coincides ****with the widening gap between training and inference distributions, suggesting a potential correlation between the two.

<aside> 📖

Noise includes bias and variance, which both grow with response length. For a more detailed discussion on why noise of gradient estimation is severe in RL, readers may refer to [https://richardli.xyz/post/rl-collapse-part1/,](https://richardli.xyz/post/rl-collapse-part1/) https://richardli.xyz/post/rl-collapse-part2/, and The Optimal Token Baseline.

</aside>