Even the world's top large-scale models can't beat Pokémon: These games are AI's nightmare.

Author: Guo Xiaojing, Tencent Technology

Edited by Xu Qingyang

The world's top AI models can pass medical licensing exams, write complex code, and even beat human experts in math competitions, but they repeatedly fail in a children's game called Pokémon.

This high-profile attempt began in February 2025 when a researcher at Anthropic launched a Twitch stream of "Claude playing Pokémon Red" to coincide with the release of Claude Sonnet 3.7.

Two thousand viewers flooded into the live stream. In the public chat area, viewers offered advice and encouragement to Claude, gradually transforming the live stream into a public observation of AI capabilities.

Sonet 3.7 can only be described as "playing" Pokémon, but "playing" doesn't mean "winning." It can get stuck for dozens of hours at crucial points and make basic mistakes that even child players wouldn't make.

This is not the first time Claude has tried this.

Early versions performed even worse: some wandered aimlessly on the map, some got stuck in an infinite loop, and many couldn't even leave the starting village.

Even with its significantly enhanced abilities, Claude Opus 4.5 still makes inexplicable mistakes. On one occasion, it circled outside the gym for four whole days without ever getting in, simply because it didn't realize it needed to cut down a tree blocking its path.

Why did a children's game become AI's Waterloo?

Because what Pokémon demands is precisely the ability that AI lacks most today: to reason continuously in an open world without explicit instructions, to remember decisions made hours ago, to understand implicit causal relationships, and to make long-term plans among hundreds of possible actions.

These things, which are easy for an 8-year-old, are an insurmountable gap for AI models that claim to "surpass humans".

01. Does the gap in toolsets determine success or failure?

In comparison, Google's Gemini 2.5 Pro successfully completed a Pokémon game of comparable difficulty in May 2025. Google CEO Sundar Pichai even jokingly stated in public that the company had taken a step towards creating an "artificial Pokémon AI."

However, this result cannot be simply attributed to the Gemini model being "smarter" on its own.

The key difference lies in the toolset used by the model. Joel Zhang, the independent developer who runs the Pokémon livestream on Gemini, likens the toolset to an "Iron Man suit": the AI doesn't enter the game empty-handed, but is placed in a system that can call upon various external capabilities.

Gemini's toolset offers more support, such as transing game visuals into text to compensate for the model's weakness in visual understanding, and provides customized puzzle-solving and path planning tools. In contrast, Claude's toolset is simpler, and its approach more directly reflects the model's true capabilities in perception, reasoning, and execution.

In routine tasks, these differences are not obvious.

When a user requests an online query from the chatbot, the model will automatically invoke search tools. However, in long-term tasks like Pokémon, the differences in toolsets can be magnified to the point of being decisive in success or failure.

02 Turn-based gameplay exposes the shortcomings of AI's "long-term memory"

Because Pokémon employs a strict turn-based system and does not require immediate reaction, it has become an excellent "training ground" for testing AI. In each step of the operation, the AI only needs to combine the current screen, the target prompt, and the available actions to reason and output clear instructions such as "press the A button".

This seems to be the interaction format that large language models excel at.

The crux of the problem lies precisely in the "discontinuity" of the time dimension. Although Claude Opus 4.5 has accumulated over 500 hours of running and executed approximately 170,000 steps, the model is limited by the re-initialization after each step, forcing it to search for clues within an extremely narrow context window. This mechanism makes it more like an amnesiac relying on sticky notes to maintain cognition, endlessly cycling through fragmented information, and unable to achieve the experiential leap from quantitative to qualitative change like a true human player.

In fields like chess and Go, AI systems have long surpassed human capabilities, but these systems are highly customized for specific tasks. In contrast, Gemini, Claude, and GPT, as general-purpose models, frequently defeat humans in exams and programming competitions, yet repeatedly falter in a children's game.

This contrast itself is highly enlightening.

According to Joel Zhang, the core challenge facing AI is its inability to consistently execute a single, clearly defined goal over a long period of time. "If you want an agent to do real work, it can't forget what it did five minutes ago," he points out.

This capability is an indispensable prerequisite for realizing the automation of cognitive labor.

Independent researcher Peter Whidden offered a more intuitive description. He once open-sourced a Pokémon algorithm based on traditional AI. "The AI knows almost everything about Pokémon," he said. "It's trained on massive amounts of human data and clearly knows the correct answers. But when it comes to execution, it becomes incredibly clumsy."

In the game, this gap of "knowing but not being able to do" is constantly amplified: the model may know that it needs to find a certain item, but it cannot be stably positioned in the two-dimensional map; it knows that it should talk to the NPC, but it repeatedly fails in pixel-level movement.

03 Behind the Evolution of Capabilities: The Unbridged Gap of "Instinct"

Nevertheless, the progress in AI is clearly visible. The Claude Opus 4.5 significantly outperforms its predecessor in self-recording and visual understanding, allowing it to progress further in games. After completing Pokémon Blue, the Gemini 3 Pro then tackled the even more challenging Pokémon Crystal without losing a single battle. This is something the Gemini 2.5 Pro could never achieve.

Meanwhile, Anthropic's Claude Code toolset allows models to write and run their own code, which has been used in retro games such as RollerCoaster Tycoon and is said to be able to successfully manage virtual theme parks.

These cases reveal a less intuitive reality: AI equipped with the right toolset can be highly efficient in knowledge-based tasks such as software development, accounting, and legal analysis, even if it still struggles to handle tasks requiring real-time responses.

The Pokémon experiment also revealed another intriguing phenomenon: models trained on human data exhibit behavioral characteristics similar to those of humans.

In its technical report on Gemini 2.5 Pro, Google noted that the model's reasoning quality significantly decreased when the system simulated a "panic state," such as when a Pokémon was about to faint.

When the Gemini 3 Pro finally completed Pokémon Blue, it left itself a note that was not required for the mission: "To end poetically, I'm going back to my original home, have one last conversation with my mother, and retire my character."

In Joel Zhang's view, this behavior was unexpected and carried a certain human emotional projection.

04. The "digital Long March" that AI struggles to overcome extends far beyond Pokémon.

Pokémon is not an isolated case. In the pursuit of artificial general intelligence (AGI), developers have discovered that even if AI can excel in the bar exam, it still faces insurmountable setbacks when dealing with the following types of complex games.

NetHack: The Abyss of Rules

This 1980s dungeon crawler is a nightmare for AI research. It is highly randomized and has a "permanent death" mechanism. Facebook AI Research found that even if models can write code, they perform far worse than human beginners when faced with NetHack, which requires common sense, logic, and long-term planning.

Minecraft: The Disappearance of a Sense of Purpose

While AI can craft wooden pickaxes and even mine diamonds, independently defeating the Ender Dragon remains a fantasy. In open worlds, AI often "forgets" its initial purpose during resource gathering that can last for dozens of hours, or gets completely lost while navigating complex navigation systems.

StarCraft II: The Gap Between Generality and Specialization

While customized models have defeated professional players, they crumble instantly when Claude or Gemini are given direct visual commands. General-purpose models remain inadequate in handling the uncertainty of the "fog of war" and balancing micro-management with macro-level construction.

RollerCoaster Tycoon: An Imbalance Between the Micro and the Macro

Managing a theme park requires tracking the status of thousands of guests. Even with its rudimentary management capabilities, Claude Code is highly susceptible to burnout when dealing with large-scale financial collapses or unforeseen events. Any lapse in reasoning could lead to the park's bankruptcy.

Elden Ring and Sekiro: The Gap in Physics Feedback

These types of games with strong action feedback are extremely unfriendly to AI. Current visual processing latency means that while the AI is still "thinking" about the boss's actions, the character has often already died. The millisecond-level reaction requirements constitute the natural upper limit of the model's interaction logic.

05 Why has Pokémon become a litmus test for AI?

Today, Pokémon is gradually becoming an informal yet highly persuasive benchmark in the field of AI evaluation.

The models from Anthropic, OpenAI, and Google garnered hundreds of thousands of comments on Twitch livestreams. Google detailed Gemini's game progress in a technical report, and Pichai publicly mentioned the achievement at the I/O developer conference. Anthropic even set up a "Claude Plays Pokémon" demonstration area at an industry conference.

“We’re a group of super tech enthusiasts,” admitted David Hershey, head of applied AI at Anthropic. But he emphasized that it’s not just for fun.

Unlike traditional benchmarks that rely on one-off question-and-answer sessions, Pokémon can continuously track a model's reasoning, decision-making, and goal-oriented progress over a very long period of time, which is closer to the complex tasks that humans would like AI to perform in the real world.

The challenges AI faces in Pokémon continue to this day. But it is precisely these recurring difficulties that clearly outline the uncharted boundaries of general artificial intelligence.

Wuji, a special contributor, also contributed to this article.