Everything Beyond the Model Is Harness: DeepSeek Enters the Fray, Why Has the Main Battlefield of Domestic AI Competition Shifted?

In mid-to-late May 2026, Deepseek internally formed a new Harness team, focusing on code agent products, internally benchmarking against Anthropic's Claude Code. Former Jane Street star quantitative engineer Cui Tianyi joined the team in March, and senior researcher Chen Deli publicly confirmed this and is responsible for recruitment. In Deepseek's recruitment JD, a formula is explicitly written: "Model + Harness = Agent." As the capabilities of foundational large models gradually level out, the era of purely competing on parameters is passing. Deepseek's move to build a toolchain team signals that the main battlefield of domestic AI competition is shifting from "training large models" to "building toolchains and office implementation."

Why is Deepseek building Harness itself?

For a long time, developers' expectations for Deepseek remained on open-sourcing more powerful foundational models. But strong coding ability does not mean developers will adopt it as a productivity tool. What truly changes the way of working is not code answers in a chat box, but an engineering agent that can enter the terminal, understand projects, read and write files, run commands, and fix errors. Before the official move, the developer community had already created various open-source terminal Agents based on Deepseek models. By forming the Harness team at this time, Deepseek aims to control interface design rights and the training data loop, incorporating the paths blazed by the community into an official flagship product.

To understand this strategic intent, one must first clarify what Harness actually is. For readers without a technical background, the term "Harness" might be somewhat unfamiliar. In Deepseek's formula, the model is responsible for reasoning, and Harness is responsible for everything else. Originally meaning "horse tack" or "safety harness" in engineering, in the AI field, it refers to the Agent's "runtime infrastructure."

For a more accessible understanding, we can liken the large model to the "brain" and "intelligence" of a highly intelligent worker, while Harness is this worker's "job description, KPI assessment standards, office blast walls, and toolbox." It is not a "scaffold" assembled before running, nor a "framework" providing building blocks, but a continuously running system. It is responsible for orchestrating execution loops, dispatching tool calls, managing context, performing security checks, and handling error recovery and state persistence. The large model itself is stateless and lacks environmental interaction capabilities; it can only receive text input and output text. Harness compensates for these deficiencies, enabling the model to truly interact with the external world and execute specific tasks.

Why must a foundational model company master this runtime itself? The core reason is that Agent products are not only an outlet for model capabilities but also a training ground for them. Deepseek's JD emphasizes "achieving the co-evolution of model and Harness." In real complex tasks, models encounter failures caused by environmental constraints and abnormal tool returns. Harness records these failure trajectories, which can feed back into model training, creating a flywheel effect. If left to the community to build, model manufacturers would lose the most critical application-layer data feedback, becoming mere providers of computing power and weights.

From an engineering perspective, optimizing Harness determines an Agent's success more than simply optimizing prompts. According to technical expert analysis, during Agent operation, tool output accounts for 67.6% of what the Agent actually sees in context, while system prompts account for only 3.4%. This means most of the model's "field of vision" is occupied by the results of tool calls. If Harness improperly handles the format of tool outputs or fails to effectively compress redundant information, the model falls into "context rot," leading to a sharp decline in subsequent reasoning quality.

Even more fatal is the problem of compound errors. An Agent process containing 10 steps, each with 99% reliability, has an end-to-end success rate of about 90%; when task complexity increases to 50 steps, the success rate plummets to 60%. In real codebase maintenance or enterprise office automation scenarios, continuous operations of dozens of steps are the norm. At this point, no matter how strong the model's reasoning ability is, it cannot compensate for the cumulative probabilistic loss. Only through error handling and recovery mechanisms in Harness can retries or path corrections be made when a step fails. This is precisely the engineering value of Harness, and the reason Deepseek must build it itself.

Tencent as a connector, Alibaba for front-end penetration: The differentiated paths of tech giants' toolchains

Deepseek's pivot is not an isolated case. According to industry media reports, strengthening Agent capabilities has become an important development direction for domestic foundational large models in 2026. Foundational models are gradually becoming "utilities like water, electricity, and coal," with the main competitive battlefield shifting to the application layer. Other domestic tech giants are also seeking differentiated positioning through toolchains, but their paths vary, reflecting differences in each company's ecological endowments and target users.

In June 2026, Tencent played a new enterprise Agent card, launching WorkBuddy Enterprise Edition. Its core positioning is an all-scenario workplace intelligent agent desktop workbench, aiming to move from individual efficiency enhancement to organizational collaboration. WorkBuddy Enterprise Edition supports multi-Agent parallelism and business system Connector access, attempting to seize the unified entry point for AI office work. Tencent's positioning logic relies on its vast WeCom and Tencent Cloud ecosystem. For large enterprises, the pain point of AI office work lies not in the ultimate experience of single-point tools, but in whether it can connect isolated internal office systems. By acting as a connector, Tencent enables Agents to directly dispatch enterprise data and processes, focusing on organizational-level collaboration and complex task delivery. The advantage of this path is high barriers to entry; once integrated into core business processes, replacement costs are enormous. The challenge lies in requiring strong enterprise service capabilities and customized support.

Alibaba has taken a different path, choosing to lower the automation threshold on the Web side. Alibaba open-sourced PageAgent, a pure front-end browser-based GUI Agent framework. This framework requires no backend deployment; one line of code allows a website to integrate AI operator capabilities. Alibaba's positioning logic lies in empowering Web developers, turning any webpage into an AI-native application instantly. Given the reality that many traditional enterprise systems cannot provide API interfaces, achieving automation through front-end DOM manipulation is a pragmatic, disruptive path. The advantage of this path is its lightweight nature and ease of integration, enabling rapid coverage of a vast number of long-tail websites; however, frequent changes in front-end DOM structures may pose stability challenges, placing higher demands on Harness's error recovery capabilities.

In comparison, companies are no longer simply competing on model benchmarks but are building toolchains based on their own ecological endowments. Tencent acts as a connector, Alibaba pursues front-end penetration, and Deepseek enters through the most essential code engineering scenario for developers. This divergence indicates that the domestic AI industry has recognized there is no perfect universal Agent, only vertical solutions polished through heavy-duty Harness engineering in specific scenarios. For enterprise procurement, choosing which toolchain essentially means choosing which automation path: deep binding with an office ecosystem, flexible embedding into existing Web systems, or empowering developers' engineering workflows.

Viktor's $20 million ARR proves: Enterprises are willing to pay for autonomous execution

The maturation of toolchains is changing the paradigm of AI participation in the office domain. The logic of native Copilots is "draft and wait for humans to finish"; AI generates a piece of copy or code, but the final step still requires human intervention for modification and execution. In this model, AI is merely an efficiency tool and cannot truly replace labor. Enterprise employees need to constantly monitor AI output, verify, and implement it, which actually increases cognitive load.

Clear signals of a paradigm shift have already emerged in overseas markets. As a reference for overseas trends, Polish AI office automation company Viktor positions itself as an AI employee within Slack, achieving $20 million in annual recurring revenue (ARR) without a sales team, serving 30,000 enterprises, and securing $75 million in Series A funding in May 2026. Viktor's model represents the endgame for new AI employees: possessing a cloud computer, capable of long-duration continuous operation, firmly grasping massive context, and directly delivering results.

Viktor is positioned as a Tier 3 AI Coworker, meaning it no longer handles simple Q&A but complex, multi-step, long-running tasks like marketing audits, ad management, and lead research. There is significant willingness to pay on the enterprise side for this kind of AI that requires no final human confirmation and can operate continuously for long periods. This explosion of commercial data proves that the value anchor of office automation has shifted from "assisted generation" to "autonomous execution."

Domestic manufacturers' layout of Harness and Agent toolchains is precisely to embrace this trend. When Harness can provide sufficient safety guardrails, state persistence, and error recovery capabilities, AI can transform from an "intern" requiring constant human supervision into an "outsourcer" that can independently deliver work results. Enterprise procurement focus will also shift from model parameter size to whether the Agent can run stably for 8 hours without crashing, and whether it can automatically handle API rate limits and webpage structure changes. For developers, this means the focus of building AI applications will shift from "how to write good prompts" to "how to design a robust runtime environment."

Token explosion and the engineering moat of "thick frameworks"

After shifting to toolchain competition, the challenges faced by enterprise procurement and developers in actual implementation have not diminished but have become more focused on the engineering level.

The primary issue is the token explosion problem. Long-running Agents, in the cycle of "think, act, feedback," easily cause context to swell rapidly due to redundant tool outputs. The developer community widely discusses this challenge, believing it not only drives up inference costs but also leads to model attention dispersion and a sharp increase in task failure rates. For example, when executing a web data scraping task, if Harness stuffs the entire webpage's HTML source code unchanged into the context, the model quickly gets lost in redundant information and forgets the original task goal. Therefore, Harness's context compression and memory management capabilities become core evaluation metrics for enterprise procurement. An excellent Harness must know which historical information can be discarded and which tool return results need summarization, testing deep engineering architecture capabilities rather than the model's own intelligence.

This has also raised developer vigilance against "thin wrapper" frameworks. If the Harness launched by a large model manufacturer is merely a simple API wrapper providing basic chat windows and tool call interfaces, it will lack practical debugging value. The fragility in production environments requires Harness to possess "thick framework" features such as sandbox isolation, fine-grained permission control, and checkpoint restart. Only a runtime with deep engineering moats can truly meet the stability demands of enterprise-grade applications. For instance, in code execution scenarios, Harness must provide a secure sandbox environment to prevent malicious code generated by the model from damaging the host system; in long-running tasks, it must support checkpoint restart to avoid restarting the entire task from scratch due to network fluctuations.

Furthermore, geopolitical factors have left a vast market vacuum for domestic Harness solutions. Top overseas engineering agent products like Claude Code impose access restrictions on mainland China and Chinese-funded enterprises. Unable to directly use these top-tier tools, domestic developers can only seek domestic alternatives. Deepseek's formation of the Harness team is not only a response to technological trends but also an answer to this massive substitution demand.

For enterprise procurement and developers, understanding the value of Harness means no longer being dazzled by flashy chat demonstrations when choosing AI products, but instead questioning what its error recovery mechanism is, what its context management strategy is, and whether it can truly integrate into existing workflows. In the toolchain competition phase, enterprises should prioritize evaluating a manufacturer's engineering delivery capability and ecological compatibility rather than simply comparing model benchmarks; developers should focus on the openness of the Harness framework and the completeness of its debugging toolchain, choosing a platform that provides deeply controllable runtime.