Behind DeepSeek's 85% Speed Increase: Large Models Bid Farewell to Parameter Race, Embracing Cost War

On June 27, 2026, a paper titled "DSpark: Confidence-Scheduled Speculative Decoding Based on Semi-Autoregressive Generation" attracted industry attention. The paper was co-authored by DeepSeek founder Liang Wenfeng in collaboration with Peking University. The paper revealed a set of impressive data: when DSpark was deployed on the DeepSeek-V4 online service system handling real user traffic, per-user generation speed saw a massive increase of 60% to 85% (Flash version) and 57% to 78% (Pro version); in offline or high-concurrency scenarios, aggregate throughput increased by 51% to 400%.

This data is not a linear increase brought by simple hardware stacking, but a qualitative change in the underlying inference architecture. To understand the true value of these numbers, one must first see the computational black holes in the large model inference process.

The Truth Behind the 85% Speedup: Computation Eaten by "Invalid Verification"

Current mainstream large models mostly use autoregressive generation. Simply put, when generating text, the model outputs one word at a time. For each word generated, the model needs to reprocess all previous context to calculate the probability distribution of the next word. This sequential generation method leaves the GPU's parallel computing capability severely idle. More critically, large model inference is often "memory-bound".

At the hardware level, this memory bottleneck is mainly reflected in the reading and writing of KV Cache (Key-Value Cache). To avoid recomputing historical context, the model stores the hidden states generated at each previous step in video memory as Key and Value pairs. As the sequence length increases, the KV Cache volume grows linearly. When generating each new token, the GPU computing units must wait for the huge KV Cache to be moved from video memory to the computing cores. This means the GPU spends most of its time not doing matrix multiplications but waiting for data. The idle computing units, combined with the bandwidth pressure from repeated reading and writing of video memory, constitute the most fundamental cost black hole in large model inference.

To break this sequential bottleneck, the industry introduced speculative decoding. The basic idea is to introduce a smaller, faster draft model that guesses the next few possible words, which are then batch-verified by the target large model. If the draft guesses correctly, the large model confirms multiple words at once, greatly speeding up; if wrong, the large model discards and starts over.

However, traditional speculative decoding schemes create new computational waste while speeding up. Early parallel draft methods (such as Medusa) or autoregressive draft methods (such as Eagle3) often adopt a blind guessing strategy. Taking parallel draft as an example, it has the draft model guess multiple possible next words in parallel at the same time point, but this ignores the sequential dependencies between words. Although autoregressive draft considers dependencies by having the draft model generate a long sequence on its own, as the sequence gets longer, the probability of the draft model guessing wrong rises exponentially.

When the target large model verifies these long draft sequences, the problem emerges. The target model finds that the first few words are guessed correctly, but most of the latter part is wrong. Under traditional verification mechanisms, the target model must perform complete forward propagation computation on the entire draft sequence. During this computation, the model not only loads its own huge parameter weights but also processes the additional KV Cache brought by the draft. This means it consumes a large amount of GPU computing power and memory bandwidth while reviewing those drafts that are destined to be discarded. This "invalid verification" might not be noticeable under low concurrency, but under the high pressure of DeepSeek-V4's real traffic, the computational waste is dramatically magnified, not only slowing down actual response speed but also making already high inference costs even worse. DSpark was designed to reclaim this consumed computing power.

Intern and Mentor: How DSpark Makes Speculative Decoding No Longer Blind

DSpark's core technical mechanism can be summarized as semi-autoregressive generation and confidence-scheduled speculative decoding. These two academic-sounding jargon essentially reshape the collaborative relationship between the draft model and the target model, fundamentally changing the logic of GPU memory read/write and computation scheduling.

We can use a simple analogy to understand this process. Suppose the target large model is a rigorous mentor, and the draft model is a highly responsive intern. In traditional speculative decoding, the mentor asks the intern to blindly guess the next ten sentences to write. The intern writes quickly but often goes off-topic by the fifth sentence. The mentor looks at it and finds that the first four sentences are usable, but the last six are all wasted. The effort the mentor spends correcting these six wasted sentences is the wasted computing power.

DSpark's semi-autoregressive generation mechanism is like equipping the intern with an auxiliary brain that considers contextual logic. When guessing, the intern no longer blindly diverges from the context but can self-correct to some extent based on what has already been generated. Technically, this means the draft model, when generating multiple candidate tokens, uses the hidden state from the previous step to guide the next step's generation, thereby improving the hit rate of long draft sequences and mitigating the decay in acceptance rate towards the end.

Even more crucial is confidence scheduling. Under DSpark, when submitting the draft, the intern attaches a confidence score to each part, indicating how certain they are about the guess. The mentor then dynamically schedules based on this score.

In the specific algorithm logic, the draft model's output layer outputs not only the probability distribution over the vocabulary but also an additional scalar value as the confidence of the predicted token. DSpark segments the draft sequence by setting dynamic threshold mechanisms. If a segment's confidence is continuously above the threshold, the system judges it as likely correct and classifies it as high priority; if the confidence drops below the threshold, the system considers subsequent drafts highly likely to be wrong.

In the underlying GPU computation logic, this scheduling changes the previous "one-size-fits-all" batch verification model. For high-confidence (likely correct) parts, the mentor assigns them to high-priority computation streams and focuses on rapid verification; for low-confidence parts, the mentor directly discards them or places them in low-priority computation streams to reduce verification resource investment. This avoids wasting precious memory bandwidth and compute cycles on drafts that are bound to be wrong.

Through this mechanism, DSpark effectively reduces the computational waste from invalid verification. According to media reports like Zhidixi, on the Qwen3 series models, DSpark's average acceptance length increased by 26.7% to 30.9% compared to the previous generation Eagle3, and by 16.3% to 18.4% compared to DFlash. This means that within the same time, the target large model can accept more correct drafts, and generation speed naturally increases. This optimization does not add any extra hardware investment; it purely relies on improved scheduling algorithms to squeeze out computing power.

The Compute Ledger: Why Reducing Waste Matters More Than Simply Adding GPUs

From an engineering perspective, DSpark is an elegant algorithm optimization. But from a business perspective, it is a survival guide for large model companies in a brutal cost war.

The financial model of the large model industry is extremely sensitive to inference costs. In OmniTools' previous analysis of OpenAI's leaked financials, we saw a shocking cost structure: annual revenue of 13 billion but operating loss of 20.9 billion. Behind this burn-rate-for-scale approach, in addition to high training costs, the continuous bleeding comes from inference compute consumption. Although training costs for large models are huge, they are usually one-time or periodic capital expenditures; inference costs, however, are operational costs. Every user API call, every model-generated response, consumes real money in GPU compute, electricity, and depreciation.

We can break down a typical API call cost structure. Suppose a user inputs a 1000-word prompt and asks the model to generate a 1000-word response. In traditional autoregressive mode, the Prefill phase processes 1000 input words, followed by 1000 sequential generation steps in the Decode phase. Each generation step must load the massive model weights and read/write the ever-growing KV Cache. In a typical hundred-billion-parameter model, a single Decode step's computation might take only a few milliseconds, but data movement could take tens of milliseconds.

If traditional blind speculative decoding is used, although the draft model quickly generates 200 words, the target model finds during verification that only the first 50 are correct, and the last 150 are wrong. So when processing those 150 words, the GPU computing units mobilized, the memory bandwidth consumed, and the electricity used all become sunk costs. When there are tens of millions of API calls per day, the cumulative computational waste from invalid verification can directly reflect on the company's quarterly financial statements, becoming a bottomless pit that devours profits.

When user volume surges, if inference efficiency is low, companies have only two options: either restrict user access or frantically purchase GPUs to expand capacity. The former loses market share, the latter bleeds cash flow. In an era where capital is becoming more rational, the model of relying on endless financing to plug computing holes is no longer sustainable. More fatally, as model parameter scale increases and context lengths expand, the compute consumption per inference rises exponentially. If computational waste like invalid verification is allowed to exist, the loss gap for large model companies will magnify infinitely as user scale grows.

The inference optimization path represented by DSpark offers a third solution. By reducing the computational waste of invalid verification, DeepSeek-V4 increases aggregate throughput in high-concurrency scenarios by up to 400% without adding hardware clusters. This means the same server farm can handle several times as many user requests as before.

For large model companies, the value of such engineering optimization far exceeds simply stacking computing power. Adding GPUs means linear cost growth and linear capacity increase—each added GPU brings additional procurement costs, operational costs, and energy pressure. In contrast, efficiency improvements at the algorithm level enable exponential capacity leaps under fixed costs. As the industry enters a price war, whoever can lower the compute cost per API call can offer more competitive pricing while maintaining gross margins. DSpark not only makes models faster but also makes the business loop of large models healthier, giving companies the confidence to survive in the era of thin margins.

Open-Sourcing DeepSpec: Handing Weapons to Small and Medium Teams

Alongside the release of the DSpark paper, DeepSeek has also open-sourced the DeepSpec framework. According to its official GitHub, DeepSpec is open-sourced under the MIT license, includes speculative decoding algorithm modules such as DSpark, DFlash, and Eagle3, and is compatible with mainstream open-source models like Qwen3 and Gemma.

This is a move that deserves high industry attention. In the current AI ecosystem, closed-source giants like OpenAI and Anthropic also use speculative decoding or similar inference acceleration architectures at the bottom layer, but they rarely open-source these full-stack inference optimization toolchains. For small and medium AI startup teams wanting to train an efficient draft model for their fine-tuned model, they often need to build from scratch.

We can imagine the real situation of a startup team with only a few A100 GPUs. They might have fine-tuned a vertical domain model but found during deployment that the generation speed is extremely slow and user experience is terrible. If they want to implement speculative decoding themselves, they need to understand CUDA operator development, KV Cache memory management, and design the training pipeline for the draft model. This not only means high labor costs but also a long trial-and-error cycle. Many small and medium teams exhaust their funds precisely at this stage.

DeepSpec's open-sourcing directly delivers the most advanced inference optimization weapons to the entire industry. In the specific framework design, DeepSpec provides highly modular interfaces. Developers do not need to rewrite the inference engine from the bottom up; they only need to specify the main model path and draft model path in the configuration file, and the framework automatically handles the complete process of draft generation, confidence calculation, and target model verification.

For teams wanting to train their own draft models, DeepSpec provides a standardized data distillation module that can distill the hidden states of the target large model for the draft model to learn, greatly lowering the data preparation threshold. Developers no longer need to explore how to build a semi-autoregressive draft model or repeatedly debug confidence scheduling parameters. Through the standardized toolchain provided by DeepSpec, small and medium teams only need to configure parameters according to the documentation to equip their models with speculative decoding capabilities and enjoy the benefits of dramatically increased generation speed.

The Main Battlefield Shifts: Everything Beyond the Model Is Harness

The release of DSpark and its successful deployment on DeepSeek-V4 confirms an irreversible industry trend: the main battlefield of large model competition has shifted.

As OmniTools pointed out in the article "Everything Beyond the Model Is Harness: DeepSeek Enters, Why Has the Main Battlefield of Domestic AI Competition Changed?", when the base model parameters of various companies have reached hundreds of billions and performance on various benchmarks tends to homogenize, simply competing on parameters can no longer create differentiation. What determines the life and death of an AI product and its cost are the system-level engineering capabilities outside the model, such as toolchains, inference scheduling architecture, API routing, etc.—the so-called Harness.

For those unfamiliar with underlying engineering, the term "Harness" might be abstract. In software engineering, Harness usually refers to a "test harness" or "framework," but in the context of large models, it refers to the entire system engineering infrastructure surrounding the base model. This includes but is not limited to: API routing and distribution systems that manage user requests, inference scheduling architectures that accelerate generation (such as speculative decoding and KV Cache management), toolchains that enable the model to call external tools and search the web, and disaster recovery monitoring systems that ensure high availability. The base model is like an engine, while the Harness is the chassis, transmission, and drive shaft. No matter how powerful the engine, if the Harness is weak, the power can't reach the wheels.

DSpark is a typical contest at the Harness level. It does not change the base parameters of DeepSeek-V4 or improve the model's intelligence, but through extreme engineering optimization, it solves the problem of computational waste under real traffic, directly improving user experience (response speed) and the company's financial model (inference cost). Future large model competition will increasingly manifest as this invisible system-level engineering contest. Whoever has smarter inference scheduling, more efficient KV Cache management, and higher speculative decoding hit rates will be able to output more with the same compute reserves. This shift from "parameter competition" to "engineering implementation" requires companies to understand not only algorithms but also systems, hardware, and real business traffic characteristics.

Limitations and Boundaries: DSpark Is Not a Panacea

Although DSpark has demonstrated astonishing optimization effects in long-text generation and high-concurrency scenarios, it is not a panacea for solving large model inference problems. Every technical solution has its applicable boundaries, and DSpark is no exception.

First is the extremely high storage threshold. According to data disclosed in the DSpark paper, training a draft model adapted to a medium-scale model (such as Qwen3-4B) requires a target cache volume of up to 38TB. This number might be just a routine configuration for top-tier companies with massive compute clusters, but for small and medium teams with limited resources, 38TB of high-speed storage access is itself a difficult hurdle to overcome. This means that although DeepSpec has open-sourced the code, small and medium teams wanting to fully reproduce and deploy DSpark-level optimization still need to overcome the real barrier of hardware resources. Storage costs and I/O bottlenecks could become a realistic gap hindering the widespread adoption of this technology.

Second is the limitation in optimization phases. Large model inference is typically divided into two phases: Prefill and Decode. The Prefill phase is when the model reads the user's input prompt and generates the first token; this phase is compute-intensive, and GPU computing power is fully utilized. The Decode phase is the subsequent token-by-token generation process, which is memory-bound and where speculative decoding mainly plays a role. DSpark primarily targets the Decode process. For the user-critical first-token latency (i.e., Prefill phase), DSpark's optimization effect is relatively limited. In extremely short-text interaction scenarios (such as simple Q&A or command execution), since the generated content itself is small, the acceleration potential of speculative decoding is compressed, and DSpark's speed advantage might not be obvious.

Finally, there is the issue of adapting to heterogeneous computing power. Currently, DSpark's online verification is mainly based on DeepSeek-V4's specific service architecture, and its adaptation and actual speedup on non-Nvidia hardware (such as various domestic computing chips) lack public testing data. Under different hardware architectures, the scheduling logic of memory bandwidth and computing units differs, and whether the confidence scheduling strategy needs to be re-tuned remains an engineering problem to be solved.

There is no silver bullet in technology optimization. DSpark proves the huge potential of eliminating computational waste through algorithmic scheduling, providing a powerful tool for the large model industry to win the cost war. But in actual implementation, developers still need to rationally evaluate the input-output ratio of this solution based on their own business scenarios, hardware reserves, and latency sensitivity. The journey of engineering large models has only just entered the deep waters.