Japan's AI Dark Horse Emerges: How a 7B Small Model Challenges Fable and Mythos?

On June 22, 2026, the new model Fugu released by Sakana AI caused a stir in the AI community. In the rigorous SWE-Bench Pro and TerminalBench benchmarks, Fugu Ultra scored 73.7 and 82.1 respectively, surpassing GPT-5.5 and Claude Opus 4.8, and was even claimed to be on par with the export-controlled Fable 5 and Mythos Preview. Surprisingly, the core of this system that tops engineering and reasoning capabilities is not a behemoth with hundreds of billions of parameters, but a model with only 7B parameters. It doesn't do the work itself, but acts as a "foreman" dynamically orchestrating the world's top large models. This counter-intuitive architecture not only shatters the myth that "parameters equal justice" but also reflects Japan's AI breakthrough path under computational constraints.

The 7B Parameter "Foreman": Fugu's Counter-Intuitive Architecture

To understand Fugu's peculiarity, one must first look at its origin. Sakana AI was founded in Tokyo in 2023 by Llion Jones, co-author of the Transformer paper, and former Google researcher David Ha. From its inception, the company has carried a "nature-inspired" gene, dedicated to using evolutionary algorithms and natural swarm intelligence to solve AI problems. In 2025, Sakana AI received investments from giants like NVIDIA and Google, reaching a valuation of over $2.5 billion. But even with the backing of giants, Japan still lacks the massive computational infrastructure and data pools found in China and the US. Under these resource constraints, Sakana AI did not choose to directly compete with hundred-billion-parameter large models but took an "orchestration" path.

Fugu's official positioning is "a multi-agent orchestration system as a single base model." In traditional AI architecture, a large model is a "monolithic behemoth"; a user inputs a prompt, and the model calculates from the first layer of the neural network to the last, outputting a result. This mode is highly efficient for simple problems, but when facing complex, multi-step engineering tasks, it often suffers from hallucinations or logical breaks.

Fugu completely changes this paradigm. Its core is a 7B parameter model trained via reinforcement learning, called the RL Conductor. This 7B model itself does not directly generate the final answer but plays the role of a "foreman." When a user submits a task through a single OpenAI-compatible API, the RL Conductor dynamically analyzes the task type and then distributes sub-tasks to the world's top models in the agent pool, such as GPT-5, Gemini 3.1 Pro, or Claude Opus 4.8. It is responsible for scheduling, verifying, and synthesizing the outputs of these models, ultimately delivering a result that has undergone multiple checks.

The theoretical support for this architecture comes from two papers at ICLR 2026: "TRINITY: An Evolved LLM Coordinator" and "Learning to Orchestrate Agents in Natural Language with the Conductor." The papers detail how a small parameter model can "command" large models through reinforcement learning. This changes the paradigm of test-time scaling. In the past, computation was mainly used for deep reasoning within the model, i.e., making the model "grind out" an answer; now, computation is used for external scheduling, verification, and synthesis. Traditional large models are all-purpose monoliths, while Fugu is a team of experts. The 7B RL Conductor proves that the number of model parameters is no longer the sole standard for capability; knowing how to call tools and external agents can also achieve a leap in performance.

The Truth Behind the Benchmarks: Rivaling Fable and Surpassing GPT-5.5

The direct reason Fugu caused a sensation was its benchmark scores on rigorous tests. In the AI industry, benchmarks are the hard currency for measuring model capability, but different benchmarks have completely different focuses. The SWE-Bench Pro and TerminalBench 2.1 chosen by Sakana AI are both "tough nuts" leaning towards real-world engineering environments.

SWE-Bench Pro focuses on software engineering capabilities, requiring models to locate and fix bugs in real code repositories. According to data published on the Sakana AI console, Fugu Ultra scored 73.7 on SWE-Bench Pro. In comparison, Claude Opus 4.8 scored 69.2, GPT-5.5 scored 58.6, and Gemini 3.1 Pro scored 54.2. On TerminalBench 2.1, another test for system operation capabilities, Fugu Ultra scored 82.1, surpassing GPT-5.5's 78.2 and Opus 4.8's 74.6. These two tests not only examine the model's code generation ability but also its logical stability and tool-calling ability in multi-step, long-chain tasks. Fugu Ultra's lead means it is less prone to mid-way crashes or deviating from the target than monolithic models when handling complex engineering problems.

More attention was drawn to the comparison between Fugu and Fable 5 and Mythos Preview. Anthropic's Fable series and the Mythos series from another frontier lab represent the current pinnacle of AI reasoning capabilities. However, due to export controls or incomplete public release, these two models were not included in Fugu's agent pool. Sakana AI officially claims that Fugu Ultra is "on par" with Fable 5 and Mythos Preview on engineering and science benchmarks, but it must be clear that this comparison is not based on tests within the same pool. Fugu's scores are based on the actual operational results of its own system, while the data for Fable and Mythos are based on scores publicly reported by their respective manufacturers.

This comparison standard has sparked some controversy in the developer community. Some argue that test conditions in different environments are difficult to fully align, making direct score comparisons unfair. However, other developers point out that in the absence of a unified testing environment, referencing manufacturer-reported data is an industry convention. Setting aside the controversy with Fable and Mythos, Fugu Ultra's surpassing of GPT-5.5 and Opus 4.8 on SWE-Bench Pro and TerminalBench 2.1 is a solid comparison under the same conditions. This surpassing is not because Fugu's underlying model is smarter than GPT-5.5, but because the RL Conductor is more precise in task decomposition and expert scheduling. In experiments requiring multiple rounds of reasoning and verification, such as AutoResearch, Rubik's Cube solving, and mechanical design, Fugu also consistently showed advantages. This indicates that when handling "long, messy, multi-step" real-world workflows, the multi-agent orchestration architecture is indeed more resilient than monolithic models.

Real-World Development Scenario Testing: Code Review and Long Session Stability

For developers and AI tool users, benchmarks are only a reference; what truly determines a model's usability is its performance in real work scenarios. Fugu underwent Beta testing with nearly 500 early users before its release, and their feedback revealed Fugu's unique value in practical applications.

Code review is one of the most frequently used AI scenarios for developers. Traditional monolithic models, when reviewing code, often only find superficial syntax errors or common logic flaws. In Beta testing, some developers reported that Fugu showed exceptional meticulousness in code reviews, capable of identifying deep-seated architectural bugs that other tools could only find a few surface-level issues for. This difference stems from Fugu's architecture. Upon receiving a code review task, the RL Conductor can separately call models specialized in static analysis, logical reasoning, and security auditing to conduct multi-angle cross-validation on the same piece of code. This "expert consultation" model naturally discovers more hidden problems than a single model's "solo effort."

Another frequently mentioned advantage is long session stability. When building AI Agent products, one of the most troublesome issues for developers is the model's "persona drift" in long sessions. As the number of dialogue turns increases, monolithic models often forget the initial setup or deviate in instruction following. Some enterprise executives reported after testing that Fugu's Persona in long sessions was exceptionally stable, with almost no drift occurring. This is because the RL Conductor itself is not responsible for maintaining long-text memory; it is only responsible for accurately selecting the most suitable underlying model to generate a reply in each dialogue turn based on the current context. This "separation of control and generation" architecture greatly enhances the stability of the Agent during long-term operation.

In the field of cybersecurity, Fugu also demonstrated end-to-end practical capabilities. In tests, Fugu could independently complete the entire process from reconnaissance, XSS/SQLi vulnerability detection, to authentication auditing, and generate a complete penetration test report, strictly adhering to instructions not to overstep and damage the system. The completion of such complex tasks relies on the RL Conductor's precise orchestration of the security toolchain and the capabilities of different large models.

Furthermore, Token efficiency is another highlight of Fugu. Traditional large models, when dealing with complex problems, often generate lengthy chains of thought, consuming a large number of Tokens. Fugu's RL Conductor avoids meaningless long CoT consumption through precise routing. Official and early tests show it can significantly reduce the waste of invalid Tokens. For developers billed per Token, this not only means cost reduction but also improved response speed.

The Achilles' Heel of Underlying Dependencies: The Cost of Multi-Agent Orchestration

Although Fugu performs impressively in architecture and benchmarks, as a tool for practical work, it is not without weaknesses. The multi-agent orchestration architecture, while bringing performance breakthroughs, also brings non-negligible risks and limitations.

The core issue is the risk of underlying dependencies. Fugu's agent pool is highly dependent on the underlying APIs of US giants like GPT, Claude, and Gemini. Although the RL Conductor has dynamic routing capabilities and can switch to other models if one fails or is rate-limited, this only circumvents the risk of a single supplier and does not, and cannot, detach from the entire US AI infrastructure ecosystem. If these underlying models collectively raise prices, impose large-scale rate limits, or change API terms, Fugu's cost structure and stability will be directly impacted. This model of "residing" on others' infrastructure has inherent fragility in commercialization and long-term stability.

Next is the trade-off between latency and cost structure. Although the RL Conductor saves on invalid Token consumption through precise routing, multi-agent orchestration inevitably involves multiple API calls and inter-model communication. For real-time interactive scenarios requiring extremely low latency, such as real-time voice conversations or high-frequency trading assistance, Fugu Ultra's "deep thinking and scheduling" time may be longer than directly calling a monolithic model. In scenarios with extremely high demands for response speed, Fugu's architectural advantage may instead become a drag on the experience.

Additionally, controversy over comparison fairness persists. As mentioned earlier, Fugu claims to rival Fable and Mythos, but the latter two were not included in Fugu's agent pool. In the developer community, some voices question whether this comparison based on manufacturer-reported data has practical reference value. After all, the performance of different models varies greatly under different task distributions, and a simple total score comparison may obscure specific strengths and weaknesses. For developers who need to accurately assess model capabilities, the lack of data from tests within the same pool means caution is still needed during selection.

Orchestration Over Compute Power: Japan's Asymmetric Breakthrough in Large Models

Stepping beyond specific product reviews, Fugu's birth has deeper implications for Japan's large model ecosystem. In the global AI arms race, Japan is in an awkward position. It lacks the continuous stream of top-tier computing power and cutting-edge algorithm accumulation of the US, nor does it have China's massive data pools and fierce market competition environment. More severely, Japan also faces the risk of export controls on US frontier models (like Fable/Mythos). Against this backdrop, Sakana AI's "evolutionary algorithm" and "multi-agent orchestration" path demonstrates an "asymmetric breakthrough" logic for resource-constrained countries.

Japan is not without its own large model manufacturers. NTT launched tsuzumi, and institutions like ELYZA, Rinna, and LLM-jp are also striving to train native language models. However, most of these manufacturers follow the traditional route of "training from scratch," making it difficult to compete with top Chinese and American models in parameter scale and general capabilities. Sakana AI is the only laboratory among them with global frontier influence that champions an "asymmetric architecture."

Fugu's dynamic routing capability is essentially helping Japanese enterprises and institutions establish "AI Sovereignty." Under computational constraints, rather than spending huge sums to train a hundred-billion-parameter model that is inferior to GPT-5.5 in every aspect, it is better to train a smart 7B "foreman." This foreman can flexibly access the world's best models based on task requirements. If one day a certain US model is subject to export controls or cut off, the RL Conductor can quickly route tasks to other available models, or even access specialized models native to Japan. This architecture gives Japan a degree of autonomy and risk resistance in the use of AI capabilities.

OmniTools, in observing the global AI tool ecosystem, found that large model capabilities are gradually leveling out, and the main battlefield of competition is shifting from pure parameter stacking to toolchains and implementation scenarios. Fugu's emergence precisely confirms this trend. It no longer pursues ultimate performance in a single model but aims for system-level optimization. This approach holds significant reference value for countries and regions that do not have an advantage in computing power and data.

Of course, this "asymmetric breakthrough" also has its ceiling. As long as the core technology of the underlying models remains in the hands of a few giants, the capability ceiling of the orchestration system will be limited by the underlying models. Fugu proves that a 7B model can be an excellent commander, but it cannot create capabilities that the underlying models do not possess. For Japan's large models to truly achieve a breakthrough, in addition to innovation in orchestration architecture, continuous investment in underlying computing power, core algorithms, and high-quality data is still needed. Fugu is an ingenious system-level innovation, but it is not a panacea. For developers and enterprise users, Fugu offers a highly competitive new option in complex engineering scenarios, but when using it, one must also be clearly aware of the fragility of its underlying dependencies and the trade-offs in latency costs.