Qwen Pilot, Alibaba | Published on March 24, 2026

<aside> 👉

TL;DR

In our reproduction of Zero-RL, we quantitatively outlined some of the overlooked realities beneath the surface of outcome-based RL. Here are the four "dark secrets" we uncovered:

"To make her son immortal, Thetis took Achilles by the heel and dipped him in the River Styx. All parts the sacred water touched became invulnerable; Yet the heel she held, untouched by the divine water, Remained the sole, fatal flaw in his otherwise perfect might."

— The Myth of Achilles

The current wave of RLVR for LLMs has sparked immense enthusiasm. The community is understandably excited by milestones like the "Aha Moment," [1] where models successfully learn to pause, reflect, and correct their own logic. But what about the heel?

Beneath these celebrated successes lie the 'dark secrets' of RLVR—the messy, statistical realities and vulnerabilities that are often masked by qualitative cherry-picking. While running Zero-RL training with DAPO (the powerful DAPO algorithm) [2] on VERL [3] , we observed bizarre phenomena in training. A large-scale quantitative evaluation uncovered the truth: the mechanisms responsible for the model's success also introduce deep, unexamined blind spots.

The "Oops Moment": When Self-Reflection Goes Wrong

During our reproduction of the DAPO algorithm for Zero-RL training on Qwen2.5-32B-base [4], we closely monitored the output trajectories during the rollout phase. Alongside the highly anticipated "Aha Moment"—where the model pauses to correct preliminary errors—we uncovered a prevalent and unsettling opposing phenomenon.

⚠️ "Oops Moment": A phenomenon where a model successfully derives the correct reasoning path or intermediate answer, but subsequently triggers a redundant "self-reflection" sequence that alters the originally correct result into an incorrect one.

To contextualize this, here are two typical trajectories where the model sabotages its own valid logic:

image.png

image.png