On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

Qwen Pilot, Alibaba

<aside> 👉

TL;DR

Existing work has established that RLVR induces sparse updates in language models, but these analyses have primarily focused on the magnitude of changes while largely overlooking their direction. In this work, we show that directional information—captured by the signed log-probability difference $\Delta\log p$ between base and RLVR models—more effectively identifies reasoning-critical tokens, as validated through token-replacement interventions. Based on this insight, we propose:

  1. A test-time extrapolation method that amplifies the policy along the learned $\Delta\log p$ direction to improve reasoning accuracy without additional training.
  2. A training-time reweighting method that upweights low-probability tokens (which correspond to high $\Delta\log p$), yielding consistent gains over RLVR baseline method (DAPO) across models and benchmarks. </aside>

Introduction

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key algorithmic driver behind recent advancements in the reasoning ability of LLMs. To understand how RLVR works, a common approach is to compare the final trained model ($\pi_\mathrm{RL}$) against its base counterpart ($\pi_\mathrm{Base}$). Prior analyses consistently show that these changes are sparse: Studies have linked this sparsity to high-entropy tokens [1], selective gradient updates [2,3], and measured it via distributional divergences [4,5] (which we’ve detailed in our previous blog [5]).

<aside> 👉

However, these analyses mostly focus on the magnitude of change. As shown in Fig. 1(b) , magnitude-based metrics like entropy and KL divergence produce nearly identical histograms on both base and RL models’ generations. This reveals a critical problem: magnitude alone only shows that something changed, but cannot characterize how the model has transformed.

</aside>

Figure 1. (a) Token-level metrics for analyzing RLVR updates. (b) Histograms of each metric. With a log-scale y-axis, most values concentrate near zero, but only $\Delta\log p$  shows a directional shift distinguishing RLVR from the base model. (c) Token‑replacement performance: $\Delta\log p$  recovers RLVR performance with the fewest replacements.

Figure 1. (a) Token-level metrics for analyzing RLVR updates. (b) Histograms of each metric. With a log-scale y-axis, most values concentrate near zero, but only $\Delta\log p$ shows a directional shift distinguishing RLVR from the base model. (c) Token‑replacement performance: $\Delta\log p$ recovers RLVR performance with the fewest replacements.

Therefore, we introduce a directional metric: the signed, token-level log-probability difference:

$$ \Delta\log p(y_t|x,y_{<t}) = \log\pi_\mathrm{RL}(y_t|x,y_{<t}) - \log \pi_\mathrm{Base}(y_t|x,y_{<t}) \tag{1} $$

This captures whether RLVR increases (positive) or decreases (negative) the probability mass on specific tokens. As visualized in Fig. 1(b), $\Delta\log p$ reveals a clear bimodal pattern with distinct tails, highlighting a directional signature absent in magnitude-based metrics. We validate this metric through a token replacement intervention (adapted from [5], cf. Algo. 1): we substitute base model tokens with RLVR choices at positions identified as salient by various metrics. The results (Fig. 1(c)) show that selecting tokens via $\Delta \log p$ recovers full RLVR performance with the fewest substitutions, pinpointing where RLVR applies reasoning-critical updates.

<aside> đź’ˇ

These findings yield a key principle: analyzing the direction of change, rather than solely its magnitude, provides deeper insights. Building on this, we develop two practical strategies:

Dissecting the RLVR’s Token-Level Changes

Recovering RLVR Performance via Selective Token Replacement

To assess how the minority tokens identified by each metric affect reasoning ability, we conduct a selective token replacement, adapted from the cross-sample experiment in our previous blog.

Algorithm1: Selective token replacement

Algorithm1: Selective token replacement

We compare entropy, KL Divergences, and logp difference, with the corresponding replacement criteria functions defined as follows:

Figure 2. Token‑replacement performance across metrics and model pairs. While all metrics can recover RLVR‑level accuracy, $\Delta\log p$ does so with the fewest replacements, demonstrating its precision in isolating the reasoning-critical minor tokens changed by RLVR training.

Figure 2. Token‑replacement performance across metrics and model pairs. While all metrics can recover RLVR‑level accuracy, $\Delta\log p$ does so with the fewest replacements, demonstrating its precision in isolating the reasoning-critical minor tokens changed by RLVR training.