Tens of millions of errors per hour: An investigation reveals the "accuracy illusion" of Google AI search.

Author: Claude, Deep Tide TechFlow

Deep Dive: A recent test conducted by The New York Times in collaboration with AI startup Oumi shows that Google Search's AI Overviews feature has an accuracy rate of approximately 91%. However, considering Google processes 5 trillion searches annually, this translates to tens of millions of incorrect answers per hour. Even more problematic is that even when the answer is correct, more than half of the cited links fail to support the conclusion.

Google is feeding users misinformation on an unprecedented scale, and most people are completely unaware of it.

According to The New York Times, AI startup Oumi was commissioned to evaluate the accuracy of Google's AI Overviews feature using SimpleQA, an industry-standard test developed by OpenAI. The test covered 4,326 search queries, conducted twice, once in October of last year (driven by Gemini 2) and once in February of this year (after upgrading to Gemini 3). The results showed that Gemini 2 achieved an accuracy of approximately 85%, while Gemini 3 improved to 91%.

91% sounds good, but it's a different story when you consider the scale of Google. Google processes approximately 5 trillion search queries annually, and with a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

The answer is correct, but the source is wrong.

More worrying than accuracy is the issue of "anchoring" of cited sources.

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had the problem of "unsubstantiated citations," meaning that the links attached to the AI summary did not support the information it provided. After upgrading to Gemini 3, this proportion actually increased, jumping to 56%. In other words, while the model is providing correct answers, it is becoming increasingly less adept at "submitting assignments."

Oumi CEO Manos Koukoumidis's question hits the nail on the head: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The problem is exacerbated by the heavy reliance on low-quality sources in AI Overviews. Oumi found that Facebook and Reddit were the second and fourth most cited sources, respectively. Facebook was cited 7% of inaccurate answers, compared to 5% of accurate answers.

A BBC reporter's fake article successfully "poisoned" the public within 24 hours.

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC reporter tested a deliberately fabricated article, and within 24 hours, Google's AI summary presented the false information as fact to the user.

This means that anyone familiar with how the system works could potentially "poison" AI search results by posting fake content and boosting its traffic. Google spokesperson Ned Adriance responded that the search AI feature is built on the same ranking and security mechanisms used to block spam, and stated that "most of the examples in the tests were unrealistic queries that people wouldn't actually search for."

Google countered: The test itself was flawed.

Google has raised several concerns regarding Oumi's research. A Google spokesperson stated that the research "contains serious flaws," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi uses its own AI model, HallOumi, to evaluate the performance of another AI, potentially introducing additional errors; and the test content does not reflect users' actual search behavior.

Google's internal testing also showed that Gemini 3 generated as many as 28% of false outputs when running independently of the Google search framework. However, Google emphasized that AI Overviews, which leverages the search ranking system to improve accuracy, outperforms the model itself.

However, as PCMag's review points out, there is a logical paradox: if your defense is that "the report pointing out that our AI is inaccurate also uses potentially inaccurate AI," this probably won't increase users' confidence in the accuracy of your product.