Qwen Pilot, Alibaba | Published on March 24, 2026
<aside> 👉
TL;DR
In our reproduction of Zero-RL, we quantitatively outlined some of the overlooked realities beneath the surface of outcome-based RL. Here are the four "dark secrets" we uncovered:
Answer: instead of the base model's native \\boxed{}) , introduces significant syntactic friction. Aligning RL prompts with the prior training distribution yields noticeably better reasoning stability.
</aside>"To make her son immortal, Thetis took Achilles by the heel and dipped him in the River Styx. All parts the sacred water touched became invulnerable; Yet the heel she held, untouched by the divine water, Remained the sole, fatal flaw in his otherwise perfect might."
— The Myth of Achilles
The current wave of RLVR for LLMs has sparked immense enthusiasm. The community is understandably excited by milestones like the "Aha Moment," [1] where models successfully learn to pause, reflect, and correct their own logic. But what about the heel?
Beneath these celebrated successes lie the 'dark secrets' of RLVR—the messy, statistical realities and vulnerabilities that are often masked by qualitative cherry-picking. While running Zero-RL training with DAPO (the powerful DAPO algorithm) [2] on VERL [3] , we observed bizarre phenomena in training. A large-scale quantitative evaluation uncovered the truth: the mechanisms responsible for the model's success also introduce deep, unexamined blind spots.
During our reproduction of the DAPO algorithm for Zero-RL training on Qwen2.5-32B-base [4], we closely monitored the output trajectories during the rollout phase. Alongside the highly anticipated "Aha Moment"—where the model pauses to correct preliminary errors—we uncovered a prevalent and unsettling opposing phenomenon.
⚠️ "Oops Moment": A phenomenon where a model successfully derives the correct reasoning path or intermediate answer, but subsequently triggers a redundant "self-reflection" sequence that alters the originally correct result into an incorrect one.
To contextualize this, here are two typical trajectories where the model sabotages its own valid logic:
3507). However, it immediately generated "Wait, let me double check...", erroneously introduced complex constraints (e.g., "non-isomorphic cycle graphs"), and ultimately output the wrong answer (15).11). It then triggered "However, re-examining the problem...", shattering the valid reasoning chain with chaotic deductions to confidently output 17.
