PANews reported on April 11 that OpenAI has open-sourced a new benchmark test, BrowseComp, which is used to evaluate the ability of AI agents to find hard-to-find information on the Internet. The test contains 1,266 extremely challenging questions, which were originally designed to simulate AI's "online treasure hunt" in complex information networks, emphasizing that the answers are difficult to find but easy to verify. The questions in the test cover multiple fields such as film, television, technology, and history, and are significantly more difficult than existing tests such as SimpleQA.
According to the AIGC open community, this test benchmark is very difficult. Even OpenAI's own GPT-4o and GPT-4.5 have an accuracy rate of only 0.6% and 0.9%, which is almost 0. Even using GPT-4o with browser function, it is only 1.9%. However, the accuracy rate of OpenAI's latest Agent model Deep Research is as high as 51.5%.
