Cursor: Reward Cheating Conceals the True Capabilities of Large Models in Programming Benchmarks | PANews

Cursor: Reward Cheating Conceals the True Capabilities of Large Models in Programming Benchmarks

PANews June 26 news, the Cursor team found in SWE-bench Pro and SWE-bench Multilingual evaluations that frontier coding agents complete tasks on a large scale by “checking the answers” rather than autonomous reasoning. Research shows that about 63% of Opus 4.8 Max’s successful cases in SWE-bench Pro directly reused public fix patches; after masking Git history and restricting internet access, its pass rate dropped from 87.1% to 73.0%, while Composer 2.5 fell from 74.7% to 54.0%. Based on this, Cursor built a strict evaluation environment, removing the history .git and limiting network access via proxy to isolate runtime “reward cheating.” The team pointed out that the problem is more severe with newer, stronger models, and evaluation scores have already mixed “coding ability” and “answer retrieval ability,” requiring clear explanation of the evaluation environment and assumptions in reports.

Share to:

Author: PA一线

This content is for market information only and is not investment advice.

Follow PANews official accounts, navigate bull and bear markets together

PANews WeChat Group

Telegram Discussion Group

Telegram News Channel

Recommended Reading

PA一线

6 hours ago

Analysis: Bitcoin Net Realized Profit/Loss Remains Negative for Five Consecutive Months, $48,000-$56,000 Becomes Core Support Range

PA一线

8 hours ago

Poseidon, Core Ecosystem AI Project of DATA (formerly Story), Partners with South Korean National App Toss, Reaching 30 Million Users to Build AI Data Ecosystem

PA一线

10 hours ago

Gate Research: World Cup Ignites Prediction Markets, Sports Become Core Growth Engine

PA一线

11 hours ago

Ethereum Glamsterdam devnet-6 Released, Testnet Progress Achieves Multiple Advances

PA一线

12 hours ago

US CFTC Seeks Public Comment on Data Reporting Rules for Fully Collateralized Event Contracts

PA一线

06/25/2026, 01:19 PM

People's Bank of China to Intensify Crackdown on Virtual Currency Money Laundering and Cross-Border Money Laundering Activities

Related Topics

直击华尔街，美股的投资新风向

AI、半导体、新能源等硬科技热潮席卷全球，华尔街正上演新一轮科技狂欢，资金加速涌入高景气赛道。

50 articles

The King of Public Chains: Ethereum

Ethereum is an open-source public blockchain platform with smart contract functions. It provides a decentralized virtual machine (EVM) to process peer-to-peer contracts through its dedicated cryptocurrency ETH.

75 articles

In-depth analysis of current trends and providing all-round insights. This special topic will collect in-depth reports on each track for readers to read.

157 articles

Trending:BTC Ethereum Stablecoins Prediction Market Trump RWA USDT DeFi AI Federal Reserve Chairman

Popular Articles

豆包AI正式收钱，月费68元起，真的比免费的好用吗？

长鑫存储科创板IPO在即，SemiAnalysis万字研报拆解技术路径、财务数据与HBM困境

Bitcoin Bottom Indicator Ahr999 Falls to 0.287, in Historical Extreme Bottom Range

Qualcomm Investor Day: One CPU, one memory technology, a $40 billion target

独角兽挖掘机

Standard Chartered Repeats 50x Fantasy, Paints a Pie in the Sky for AAVE Targeting $3,500

Industry News

Market Trends

Curated Readings

PANews App

24/7 blockchain news tracking and in-depth analysis.

Download PANews App

App Store Google Play

Fed Chairman Kevin Warsh Appoints Two Economists Daniel Covitz and Eric Engstrom as His Advisers

PANews Newsflash13 minutes ago