Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

Qwen Pilot, Alibaba Group | Published at ICLR 2026

<aside> 👉

TL;DR / Summary

RL fine-tuning with verifiable rewards (RLVR) improves reasoning through surprisingly sparse changes: the RL model behaves almost identically to the base model at most token distributions, differing only at a small set of critical decision points.

→ Through cross-sampling experiments, we show that these few divergent distributions are functionally important: inserting a small fraction of RL-sampled tokens into base generations recovers RL performance gains, while replacing a few RL decisions with base tokens in RL generations collapses performance to base-model levels.

→ At these high-divergence positions, RLVR typically reweights choices already considered by the base model rather than “inventing” new tokens, though the extent of such novel-token introduction varies across methods.

Key insight: RLVR works less like rewriting the model and more like applying sparse, high-impact modifications at certain token positions in reasoning. As a result, the RL policy behaves almost identically to the base policy, differing only at a small number of critical decision points that steers generation towards reasoning trajectories that otherwise remain accessible under the base model.

</aside>

Introduction

Reinforcement learning with verifiable rewards (RLVR) has emerged a key paradigm for improving reasoning in LLMs. Methods such as GRPO (Group Relative Policy Optimization) (Shao et al., 2024) have pushed models to achieve impressive results on challenging math and reasoning benchmarks.

But despite these successes, we still don’t fully understand how RL fine-tuning changes model behavior. Most evaluations look at general aggregate metrics such as accuracy, rewards, response lengths. These are useful, but they're like looking at a city from an airplane: you see the big picture, but miss the street-level details.

This raises a simple question:

What actually changes in a language model when we fine-tune it with RLVR?

Answering this question turns out to be less obvious than it first appears. What does it actually mean for a language model to change? This tension is reminiscent of the Ship of Theseus:

If we gradually replace the planks of a ship one by one, at what point, if any, does it become a different ship?

RL fine-tuning raises a similar puzzle. Fine-tuning updates the model through many small, incremental adjustments. The resulting model can, from some perspectives, behave quite differently such as solving harder reasoning problems and producing longer reasoning chains, yet its overall structure may remain largely intact. Are we fundamentally changing the model, or simply steering it while preserving its underlying identity?

To make this analogy concrete, we first need to identify what the “planks” of a language model correspond to. While the model’s parameters are difficult to interpret directly, we can observe its behavior: the next-token distributions it produces at each step of generation. These distributions fully characterize the model’s behavior: given any context, the model is defined by the probabilities it assigns to the next token. From this perspective, the next-token distributions naturally serve as the model’s “planks”. RL fine-tuning may alter them differently across contexts, but the extent and structure of these changes are not immediately clear.

This suggests a natural approach: instead of reasoning abstractly about whether a model has “changed”, we can directly track how these observable components evolve under RL fine-tuning. Understanding RLVR therefore becomes a question of which parts of the model’s behavior change, and how those changes are structured and distributed across tokens and contexts.

Recent work has also begun exploring RLVR through a token-level perspective, highlighting the role of high-entropy tokens (Wang et al., 2025), as well as through general KL divergence and rank shift statistics (Huan et al., 2025; Chen et al., 2026). But a deeper question remains:

How do token distributions actually change with RLVR, and to what extent do these changes drive reasoning improvements?

In this work, we study RLVR through a token-level lens of change, comparing next-token distributions between base and RL models and linking these differences to their functional impact on sequence-level reasoning performance.

The central picture that emerges is surprisingly simple:

RLVR leaves most token distributions largely unchanged, but applies sparse, targeted modifications at critical decision points, guiding generation toward more effective reasoning trajectories that are otherwise accessible to the base model.