PANews June 26 news, the Cursor team found in SWE-bench Pro and SWE-bench Multilingual evaluations that frontier coding agents complete tasks on a large scale by “checking the answers” rather than autonomous reasoning. Research shows that about 63% of Opus 4.8 Max’s successful cases in SWE-bench Pro directly reused public fix patches; after masking Git history and restricting internet access, its pass rate dropped from 87.1% to 73.0%, while Composer 2.5 fell from 74.7% to 54.0%. Based on this, Cursor built a strict evaluation environment, removing the history .git and limiting network access via proxy to isolate runtime “reward cheating.” The team pointed out that the problem is more severe with newer, stronger models, and evaluation scores have already mixed “coding ability” and “answer retrieval ability,” requiring clear explanation of the evaluation environment and assumptions in reports.
Cursor: Reward Cheating Conceals the True Capabilities of Large Models in Programming Benchmarks
Share to:
Author: PA一线
This content is for market information only and is not investment advice.
Follow PANews official accounts, navigate bull and bear markets together
Recommended Reading
Related Topics
PANews App
24/7 blockchain news tracking and in-depth analysis.



