Qwen Pilot, Alibaba | Published on March 24, 2026

<aside> 👉

TL;DR

In our reproduction of Zero-RL, we quantitatively outlined some of the overlooked realities beneath the surface of outcome-based RL. Here are the four "dark secrets" we uncovered:

"Oops" > "Aha": Destructive over-reflection ("Oops Moments") occurs at ~3%, three times more often than the well-known self-correcting "Aha Moments" (~1%). Survivorship bias is real.
The Verifier's Blind Spot: A persistent 0.1%–0.5% noise floor of "Wrong-but-Accepted" signals remains even late in training, acting as a direct catalyst for Reward Hacking and spurious reasoning.
Direction Beats Exact Magnitude: Injecting ±20% noise into the Advantage signal showed negligible impact on training dynamics.
The Hidden Formatting Tax: Forcing a new output format (e.g., Answer: instead of the base model's native \\boxed{}) , introduces significant syntactic friction. Aligning RL prompts with the prior training distribution yields noticeably better reasoning stability. </aside>

"To make her son immortal, Thetis took Achilles by the heel and dipped him in the River Styx. All parts the sacred water touched became invulnerable; Yet the heel she held, untouched by the divine water, Remained the sole, fatal flaw in his otherwise perfect might."

— The Myth of Achilles

The current wave of RLVR for LLMs has sparked immense enthusiasm. The community is understandably excited by milestones like the "Aha Moment," [1] where models successfully learn to pause, reflect, and correct their own logic. But what about the heel?

Beneath these celebrated successes lie the 'dark secrets' of RLVR—the messy, statistical realities and vulnerabilities that are often masked by qualitative cherry-picking. While running Zero-RL training with DAPO (the powerful DAPO algorithm) [2] on VERL [3] , we observed bizarre phenomena in training. A large-scale quantitative evaluation uncovered the truth: the mechanisms responsible for the model's success also introduce deep, unexamined blind spots.

The "Oops Moment": When Self-Reflection Goes Wrong

During our reproduction of the DAPO algorithm for Zero-RL training on Qwen2.5-32B-base [4], we closely monitored the output trajectories during the rollout phase. Alongside the highly anticipated "Aha Moment"—where the model pauses to correct preliminary errors—we uncovered a prevalent and unsettling opposing phenomenon.

⚠️ "Oops Moment": A phenomenon where a model successfully derives the correct reasoning path or intermediate answer, but subsequently triggers a redundant "self-reflection" sequence that alters the originally correct result into an incorrect one.

To contextualize this, here are two typical trajectories where the model sabotages its own valid logic:

Case 1: Meaningless Complication (Table 1): In a combinatorics problem (Question-ID: 1728), the model correctly calculated the intermediate sum (3507). However, it immediately generated "Wait, let me double check...", erroneously introduced complex constraints (e.g., "non-isomorphic cycle graphs"), and ultimately output the wrong answer (15).
Case 2: Blind Re-examination (Table 2): When calculating polyhedron edges (Question-ID: 5419), the model correctly deduced the intermediate answer (11). It then triggered "However, re-examining the problem...", shattering the valid reasoning chain with chaotic deductions to confidently output 17.