Qwen Pilot, Alibaba Group | Published on March 20, 2026
📝 Paper PDF | 🤗 Hugging Face | 🤖 ModelScope | 🐱 GitHub | 📊 SwanLab
💡Summary, TL; DR
<aside>
⚡
- We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO-style training scales effectively, it typically relies on outcome-based rewards (ORM) that assign a uniform advantage to every token in a trajectory.
- We argue that this completely coarse-grained credit assignment imposes a lower performance ceiling, as the model cannot distinguish critical reasoning steps from trivial tokens.
- FIPO addresses this limitation by incorporating a discounted future-KL divergence into the policy update, effectively establishing a dense advantage formulation that re-weights token-level advantages based on subsequent trajectory behavior.
- Empirically, we find that this granular reinforcement allows the model to break through the length stagnation observed in standard baselines. Evaluated on Qwen2.5-32B-Base, FIPO extends the average chain-of-thought length from 4,000 to over 10,000 tokens, driving AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% .
- Our results demonstrate that establishing a dense advantage formulation is a promising direction for evolving ORM-based GRPO algorithms to unlock the reasoning potential of base models.

Figure 1: FIPO vs. Baselines Performance Comparison on AIME2024. FIPO demonstrates that pure RL training alone is sufficient to not only outperform other pure RL baselines (the reproduced DAPO and Deepseek-R1-Zero-32B), but also surpass o1-mini. This performance gain is accompanied by the generation of significantly longer responses on average.
</aside>
Introduction
The defining shift in modern LLM development is the move toward inference-time scaling. Models like OpenAI’s o-series [1], Google’s Gemini [2] and DeepSeek’s R1 [3] have proven that by scaling Chain-of-Thought (CoT) through Reinforcement Learning with Verifiable Rewards (RLVR), we can unlock elite-level reasoning. However, the path to reproducing these results remains opaque. Most state-of-the-art recipes rely on undisclosed algorithms and, crucially, a heavy dependency on large-scale, long-CoT synthetic data to "seed" the model’s reasoning capabilities. For the community, this raises a fundamental question:
🤔 Can we elicit deep reasoning from a base model that initially exhibits no such tendencies, without relying on long CoT synthetic data or opaque "warm-up" SFT?
<aside>
👉
- Recent efforts like DAPO [4] have successfully replicated GRPO-style training on clean base models. Yet, these models often hit a structural ceiling where reasoning trajectories plateau at intermediate lengths (~ 4,000 tokens). This is a direct consequence of coarse-grained credit assignment: because standard GRPO [5] distributes a uniform advantage to every token based solely on the final answer, it fails to provide the specific signal needed for the model to evolve the complex token trajectories required for elite-level tasks.
- To circumvent these sparse advantage signals, some frameworks revert to the PPO [6] framework. A notable example is Open-Reasoner-Zero[7], which incorporates a value model to provide token-level feedback and succeeds in pushing reasoning lengths to the 10,000-token regime. However, despite the added complexity of a value model, its final performance remains comparable to the more efficient, value-free DAPO.
- Other high-performing methods like VC-PPO [8], VAPO [9], and T-PPO [10] achieve comparable and even superior results, but they rely on value models pre-trained by models that have already undergone Supervised Fine-Tuning (SFT) with Long-CoT data. Though effective, this methodology could introduce an external knowledge prior through the value model, creating a confounding factor: it becomes difficult to discern whether the performance gains stem from the policy optimization algorithm itself or are simply inherited from the pre-trained value model.
</aside>
<aside>
🧠
To bridge this gap, we introduce Future-KL Influenced Policy Optimization (FIPO). Instead of relying on a Critic to estimate token-level values, FIPO achieves advantage density by incorporating Future-KL divergence.
- By re-weighting the advantage of each token based on the cumulative behaviors of its subsequent trajectory, FIPO highlights the specific tokens that drive logical progression.
- When applied to Qwen2.5-32B-Base, a model with no prior exposure to long-CoT data, FIPO effectively breaks the 4,000-token plateau of standard baselines. It enables a progressive lengthening of reasoning chains to over 10,000 tokens, pushing AIME 2024 performance to 58.0%.
- By eschewing the need for a value model and starting from a vanilla base model, FIPO achieves performance comparable to, and in some cases superior to, these pre-trained value-model-based approaches.
- Our result shows that establishing a dense advantage formulation is a promising direction for evolving ORM-based GRPO algorithms to unlock the inherent reasoning potential of base models
</aside>
Future-KL Influenced Policy Optimization:
Probability Shift captures the update signal:
Our method is grounded in our recent investigations into the dynamics of Large Language Models (LLMs) during reinforcement learning.
<aside>
🔑
Recap:
- Sparse, Targeted Shifts: We found that RLVR works less like rewriting the model and more like applying sparse, high-impact modifications at specific token positions in reasoning.
- Metrics that captures such shifts: We found that directional information, captured by the signed log-probability difference $\Delta\log p$ between base and RLVR models, more effectively identifies reasoning-critical tokens.
More details see blogs:
**On the Direction of RLVR Updates for LLM Reasoning [11]**
**Towards an Understanding of RLVR, Part I: The Ship of Theseus of Language Models [12]**
</aside>
Inspired by these insights, we identify the token-level probability shift as the atomic unit for our credit assignment mechanism. Formally, we define the probability shift at time step t as the log-space difference between the current policy and the old policy:
$$
\Delta\log p(y_t|x,y_{<t}) = \log\pi_\mathrm{\theta}(y_t|x,y_{<t}) - \log \pi_\mathrm{old}(y_t|x,y_{<t}) \tag{1}
$$
<aside>
💡
This term serves as a differential signal capturing the instantaneous policy drift:
- Positive Shift ($\Delta\log p_{t}$ > 0): Indicates that the current policy has increased the likelihood of the token relative to the old policy. This typically suggests that the training objective is reinforcing this specific reasoning step.
- Negative Shift ($\Delta\log p_{t}$ < 0): Implies that the policy is suppressing the generation of subsequent trajectories, signaling that the updated model is actively down-weighting this specific token relative to the reference policy
Unlike traditional KL penalties, which treat this drift primarily as a regularization cost to be minimized, we interpret$\Delta\log p_{t}$ as a directional signal of behavioral adjustment, thereby explicitly coupling the optimization objective to the generative dynamics.
</aside>
Future-KL:
While $\Delta\log p_{t}$ captures the local distributional shift, reasoning is inherently a sequential process where the true significance of this token depends on the trajectory it initiates. To capture this causal influence, we define Future-KL as the cumulative signed probability shift from the current step t to the end of the sequence T:
$$
\mathrm{FutureKL}t = \sum{k=t}^{T} M_k \cdot\gamma^{k - t} \cdot \Delta \log p_k, \quad
M_k = \mathbb{I}\left( \frac{\pi_\theta(y_k|y_{<t})}{\pi_{\text{old}}(y_k|y_{<t})} \le c \right).\tag{2}
$$
This summation is mathematically equivalent to the log-likelihood ratio of the joint probability distributions for the subsequent trajectory. It can thus be interpreted as a sample-based estimate of the KL divergence restricted to the future horizon, measuring the cumulative deviation of the current policy from the reference policy for the remainder of the trajectory. We therefore term this metric as Future-KL.