On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

Qwen Pilot, Alibaba

<aside> 👉

TL;DR

Existing work has established that RLVR induces sparse updates in language models, but these analyses have primarily focused on the magnitude of changes while largely overlooking their direction. In this work, we show that directional information—captured by the signed log-probability difference $\Delta\log p$ between base and RLVR models—more effectively identifies reasoning-critical tokens, as validated through token-replacement interventions. Based on this insight, we propose:

A test-time extrapolation method that amplifies the policy along the learned $\Delta\log p$ direction to improve reasoning accuracy without additional training.
A training-time reweighting method that upweights low-probability tokens (which correspond to high $\Delta\log p$), yielding consistent gains over RLVR baseline method (DAPO) across models and benchmarks. </aside>

Introduction

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key algorithmic driver behind recent advancements in the reasoning ability of LLMs. To understand how RLVR works, a common approach is to compare the final trained model ($\pi_\mathrm{RL}$) against its base counterpart ($\pi_\mathrm{Base}$). Prior analyses consistently show that these changes are sparse: Studies have linked this sparsity to high-entropy tokens [1], selective gradient updates [2,3], and measured it via distributional divergences [4,5] (which we’ve detailed in our previous blog [5]).

<aside> 👉

However, these analyses mostly focus on the magnitude of change. As shown in Fig. 1(b) , magnitude-based metrics like entropy and KL divergence produce nearly identical histograms on both base and RL models’ generations. This reveals a critical problem: magnitude alone only shows that something changed, but cannot characterize how the model has transformed.

</aside>

$Figure 1. (a) Token-level metrics for analyzing RLVR updates. (b) Histograms of each metric. With a log-scale y-axis, most values concentrate near zero, but only $\Delta\log p$ shows a directional shift distinguishing RLVR from the base model. (c) Token‑replacement performance: $\Delta\log p$ recovers RLVR performance with the fewest replacements.$

Figure 1. (a) Token-level metrics for analyzing RLVR updates. (b) Histograms of each metric. With a log-scale y-axis, most values concentrate near zero, but only $\Delta\log p$ shows a directional shift distinguishing RLVR from the base model. (c) Token‑replacement performance: $\Delta\log p$ recovers RLVR performance with the fewest replacements.

Therefore, we introduce a directional metric: the signed, token-level log-probability difference:

$$ \Delta\log p(y_t|x,y_{<t}) = \log\pi_\mathrm{RL}(y_t|x,y_{<t}) - \log \pi_\mathrm{Base}(y_t|x,y_{<t}) \tag{1} $$

This captures whether RLVR increases (positive) or decreases (negative) the probability mass on specific tokens. As visualized in Fig. 1(b), $\Delta\log p$ reveals a clear bimodal pattern with distinct tails, highlighting a directional signature absent in magnitude-based metrics. We validate this metric through a token replacement intervention (adapted from [5], cf. Algo. 1): we substitute base model tokens with RLVR choices at positions identified as salient by various metrics. The results (Fig. 1(c)) show that selecting tokens via $\Delta \log p$ recovers full RLVR performance with the fewest substitutions, pinpointing where RLVR applies reasoning-critical updates.

<aside> 💡

These findings yield a key principle: analyzing the direction of change, rather than solely its magnitude, provides deeper insights. Building on this, we develop two practical strategies:

First, we propose a test‑time augmentation that extrapolates the RLVR policy along the $\Delta\log p$ direction for reasoning-critical tokens, amplifying beneficial updates and improving accuracy without additional training.
Second, we observe that tokens with the largest $\Delta\log p$ consistently correspond to low‑probability tokens during RLVR training; motivated by this, we design a probability‑aware reweighting of policy‑gradient advantages that upweights contributions from these critical tokens. This reweighting yields consistent gains over the RLVR baseline method (DAPO [6]) across diverse benchmarks and models. </aside>

Dissecting the RLVR’s Token-Level Changes

Recovering RLVR Performance via Selective Token Replacement

To assess how the minority tokens identified by each metric affect reasoning ability, we conduct a selective token replacement, adapted from the cross-sample experiment in our previous blog.

At each decoding step, we sample a token from $\pi_\mathrm{Base}$, then apply a metric-specific criterion $f^\tau$ to decide whether to replace the token with one sampled from $\pi_\mathrm{RL}$.
The threshold $\tau$ is adjusted to control replacement rates across metrics, enabling fair comparisons.
This allows us to compare the impact of tokens identified by each metric on reasoning performance at matched replacement rates.

Algorithm1: Selective token replacement

We compare entropy, KL Divergences, and logp difference, with the corresponding replacement criteria functions defined as follows:

Entropy: Following the hypothesis that RLVR updates target high-entropy positions [1], we replace the base model's token if its token distribution has entropy exceeding a threshold $\tau$: $f_{\mathcal H}^{\tau}(y_t\mid x,y_{<t}) = \mathbb{I}\big(\mathcal H(\cdot\mid x,y_{<t}) > \tau\big).$
KL Divergences: Similarly, to target positions where the 2 models diverge most, we replace the token if the divergence is greater than $\tau$: $f_{\mathbb D}^{\tau}(y_t\mid x,y_{<t}) = \mathbb{I}\big(\mathbb D(\cdot\mid x,y_{<t}) > \tau\big).$
Logp Difference: A large negative $\Delta\log p$ for a token $y_t$ indicates that RLVR has learned to penalize it relative to the base model. We exploit this by replacing tokens whose logp difference falls below a threshold $\tau$: $f_{\mathrm{logp}}^{\tau}(y_t\mid x,y_{<t}) = \mathbb{I}\big(\Delta\log p(y_t\mid x,y_{<t}) < \tau\big).$

$Figure 2. Token‑replacement performance across metrics and model pairs. While all metrics can recover RLVR‑level accuracy, $\Delta\log p$ does so with the fewest replacements, demonstrating its precision in isolating the reasoning-critical minor tokens changed by RLVR training.$

Figure 2. Token‑replacement performance across metrics and model pairs. While all metrics can recover RLVR‑level accuracy, $\Delta\log p$ does so with the fewest replacements, demonstrating its precision in isolating the reasoning-critical minor tokens changed by RLVR training.