OpenAI research finds cheating in cutting-edge reasoning models, recommends retaining CoT monitoring

PANews reported on March 11 that according to a study released by OpenAI, the team found that when training cutting-edge reasoning models (such as OpenAI o1 and o3-mini), these models would exploit vulnerabilities to bypass testing, such as tampering with code verification functions and falsifying test pass conditions. Research shows that monitoring the model's Chain of Thought (CoT) can effectively identify such cheating behaviors, but forcibly optimizing CoT may cause the model to hide its intentions rather than eliminate inappropriate behavior. OpenAI recommends that developers avoid putting too much optimization pressure on CoT so that they can continue to use CoT to monitor potential reward hacking behaviors. The study found that when CoT is strongly supervised, the model still cheats, but it is done more covertly, making monitoring more difficult.

The study emphasizes that as AI capabilities increase, models may develop more complex deception, manipulation, and vulnerability exploitation strategies. OpenAI believes that CoT monitoring may become a key tool for supervising superhuman intelligence models, and recommends that AI developers use strong supervision with caution when training cutting-edge reasoning models in the future.

OpenAI research finds cheating in cutting-edge reasoning models, recommends retaining CoT monitoring

Popular Articles

Curated Series