Qwen Pilot, Alibaba Group | Published on March 20, 2026

📝 Paper PDF | 🤗 Hugging Face | 🤖 ModelScope | 🐱 GitHub | 📊 SwanLab

💡Summary, TL; DR

<aside> ⚡

Figure 1: FIPO vs. Baselines Performance Comparison on AIME2024. FIPO demonstrates that pure RL training alone is sufficient to not only outperform other pure RL baselines (the reproduced DAPO and Deepseek-R1-Zero-32B), but also surpass o1-mini. This performance gain is accompanied by the generation of significantly longer responses on average.

Figure 1: FIPO vs. Baselines Performance Comparison on AIME2024. FIPO demonstrates that pure RL training alone is sufficient to not only outperform other pure RL baselines (the reproduced DAPO and Deepseek-R1-Zero-32B), but also surpass o1-mini. This performance gain is accompanied by the generation of significantly longer responses on average.

</aside>

Introduction

The defining shift in modern LLM development is the move toward inference-time scaling. Models like OpenAI’s o-series [1], Google’s Gemini [2] and DeepSeek’s R1 [3] have proven that by scaling Chain-of-Thought (CoT) through Reinforcement Learning with Verifiable Rewards (RLVR), we can unlock elite-level reasoning. However, the path to reproducing these results remains opaque. Most state-of-the-art recipes rely on undisclosed algorithms and, crucially, a heavy dependency on large-scale, long-CoT synthetic data to "seed" the model’s reasoning capabilities. For the community, this raises a fundamental question:

🤔 Can we elicit deep reasoning from a base model that initially exhibits no such tendencies, without relying on long CoT synthetic data or opaque "warm-up" SFT?

<aside> 👉

<aside> 🧠

To bridge this gap, we introduce Future-KL Influenced Policy Optimization (FIPO). Instead of relying on a Critic to estimate token-level values, FIPO achieves advantage density by incorporating Future-KL divergence.

Future-KL Influenced Policy Optimization:

Probability Shift captures the update signal:

Our method is grounded in our recent investigations into the dynamics of Large Language Models (LLMs) during reinforcement learning.

<aside> 🔑

Recap:

More details see blogs:

**On the Direction of RLVR Updates for LLM Reasoning [11]**

**Towards an Understanding of RLVR, Part I: The Ship of Theseus of Language Models [12]**

</aside>

Inspired by these insights, we identify the token-level probability shift as the atomic unit for our credit assignment mechanism. Formally, we define the probability shift at time step t as the log-space difference between the current policy and the old policy:

$$ \Delta\log p(y_t|x,y_{<t}) = \log\pi_\mathrm{\theta}(y_t|x,y_{<t}) - \log \pi_\mathrm{old}(y_t|x,y_{<t}) \tag{1} $$

<aside> 💡

This term serves as a differential signal capturing the instantaneous policy drift:

Unlike traditional KL penalties, which treat this drift primarily as a regularization cost to be minimized, we interpret$\Delta\log p_{t}$ as a directional signal of behavioral adjustment, thereby explicitly coupling the optimization objective to the generative dynamics.

</aside>

Future-KL:

While $\Delta\log p_{t}$ captures the local distributional shift, reasoning is inherently a sequential process where the true significance of this token depends on the trajectory it initiates. To capture this causal influence, we define Future-KL as the cumulative signed probability shift from the current step t to the end of the sequence T:

$$ \mathrm{FutureKL}t = \sum{k=t}^{T} M_k \cdot\gamma^{k - t} \cdot \Delta \log p_k, \quad M_k = \mathbb{I}\left( \frac{\pi_\theta(y_k|y_{<t})}{\pi_{\text{old}}(y_k|y_{<t})} \le c \right).\tag{2} $$

This summation is mathematically equivalent to the log-likelihood ratio of the joint probability distributions for the subsequent trajectory. It can thus be interpreted as a sample-based estimate of the KL divergence restricted to the future horizon, measuring the cumulative deviation of the current policy from the reference policy for the remainder of the trajectory. We therefore term this metric as Future-KL.