Behind Deepseek's 85% Speedup: Large Models Say Goodbye to Parameter Involution and Start a Cost War

CN
PANews
Follow
2 hours ago

On June 27, 2026, a paper titled "DSpark: Confidence Scheduling for Speculative Decoding Based on Semi-Autoregressive Generation" drew industry attention. The paper was authored by Liang Wenfeng, founder of DeepSeek, and was jointly completed with Peking University. The paper revealed a set of striking data: when DSpark was deployed in the DeepSeek-V4 online service system, handling real user traffic, the per-user generation speed increased significantly by 60% to 85% (Flash version) and 57% to 78% (Pro version); in offline or high-concurrency scenarios, aggregate throughput increased by 51% to 400%.

This set of data is not simply a linear growth brought about by hardware stacking, but rather a qualitative change in the underlying inference architecture. To understand the significance of this number, one must first recognize the computational black holes in the inference process of large models.

The Truth Behind the 85% Speedup: Computation Power Consumed by "Ineffective Verification"

Most mainstream large models currently adopt an autoregressive generation mechanism. In simple terms, when the model generates text, it produces it one character at a time. Every time a character is generated, the model has to reprocess all previous contexts to compute the probability distribution of the next character. This serial generation method leads to severe underutilization of the GPU's parallel computing capabilities. More critically, large model inference is typically "memory access constrained."

At the hardware level, this memory access bottleneck is primarily reflected in the read and write operations of the KV Cache (Key-Value Cache). To avoid recalculating historical contexts, the model stores the hidden states generated at each step in the form of Keys and Values in the graphics memory. As the sequence length increases, the size of the KV Cache grows linearly. When generating each new word, the GPU's computing units must wait for the massive KV Cache to be transferred from memory to the computing core. This means that for the vast majority of the time, the GPU is not performing matrix multiplications but waiting for data. The idle state of computing units, combined with the bandwidth pressure from repeated read and write operations on memory, constitutes the fundamental cost black hole of large model inference.

To break this serial bottleneck, the industry has introduced speculative decoding technology. The basic idea is to introduce a smaller, faster draft model that guesses a few potential next words, which are then batch-verified by the target large model. If the draft guesses correctly, the large model can confirm multiple words in one go, greatly speeding up the process; if incorrect, the large model starts over.

However, traditional speculative decoding schemes generate new waste in computational power even as they accelerate processing. Early parallel draft schemes (like Medusa) or autoregressive draft schemes (like Eagle3) often use blind guessing strategies. For example, in parallel drafts, the draft model guesses multiple possible next words simultaneously, but this ignores the sequential dependencies between words. Although autoregressive drafts consider these dependencies and allow the draft model to generate a long string first, the probability of the draft model guessing incorrectly increases exponentially with the length of the sequence.

When the target large model verifies these long draft strings, the problems become apparent. The target model finds that while the first few words are correct, most of the subsequent words are wrong. Under traditional verification mechanisms, the target model must perform a full forward propagation computation for the entire draft string. During this computation process, the model must not only load its massive parameter weights but also handle the additional KV Cache introduced by the draft. This means that while correcting those drafts destined to be discarded, it consumes a significant amount of GPU computation power and memory bandwidth. This "ineffective verification" may not be obvious under low concurrency, but under the pressure of DeepSeek-V4's real traffic, the waste of computational power is drastically magnified, slowing down actual response times and further exacerbating the already high inference costs. The emergence of DSpark aims to reclaim this portion of lost computational power.

Intern and Mentor: How DSpark Stops Speculative Decoding from Being Blind

The core technical mechanism of DSpark can be summarized as semi-autoregressive generation and confidence scheduling speculative decoding. These two seemingly esoteric academic terms fundamentally reshape the collaborative relationship between draft models and target models, changing the logic of GPU memory read/write and computational scheduling.

We can understand this process with a simple analogy. Suppose the target large model is a meticulous mentor, and the draft model is a quick-reacting intern. In traditional speculative decoding, the mentor asks the intern to blindly guess the next ten sentences. While the intern may write quickly, they often go off track by the fifth sentence. The mentor sees that the first four sentences are usable, but the last six are all wasted drafts. The effort the mentor spends correcting these six wasted drafts is equivalent to the wasted computational power.

The semi-autoregressive generation mechanism of DSpark acts as an auxiliary brain for the intern that considers the logic of previous and subsequent contexts. The intern, when guessing, does not completely detach from the context but can make some self-corrections based on what has already been generated. In technical terms, this means that the draft model uses the hidden state from the previous step to guide the generation of the next step when generating multiple candidate words, thereby improving the hit rate of long sequence drafts and slowing down the decline in the passage rate of the ending content.

More crucial is the innovation in confidence scheduling. Under the DSpark framework, when the intern submits a draft, they assign a confidence score to each part of the content, indicating their level of certainty about that portion. The mentor will dynamically adjust the grading process based on this score.

In specific algorithm logic, the output layer of the draft model not only outputs the probability distribution over the vocabulary but also an additional scalar value representing the confidence of that predicted word. DSpark employs a dynamic threshold mechanism to segment the draft sequences. If a segment of draft consistently exceeds the confidence threshold, it is deemed highly likely to be correct and classified as high priority; if the confidence drops below the threshold, the system judges the probability of error in subsequent drafts to be extremely high.

At the GPU computing logic level, this scheduling changes the previous "one-size-fits-all" batch verification model. For parts with high confidence (likely correct) from the intern, the mentor allocates them to a high-priority computation stream for quick grading; for low-confidence parts, the mentor either chooses to discard them outright or puts them in a low-priority computation stream to reduce verification resource investment. This avoids the target model wasting precious memory bandwidth and computing cycles on drafts that are certain to be incorrect.

Through this mechanism, DSpark effectively reduces the computational waste caused by ineffective verification. According to reports from media such as Zhiyuan, the average acceptance length of DSpark on the Qwen3 series models improved by 26.7% to 30.9% over the previous generation Eagle3 and by 16.3% to 18.4% over DFlash. This means that within the same timeframe, the target large model can adopt more correct drafts, naturally increasing the generation speed. This optimization does not involve any additional hardware investment and relies solely on improvements in the scheduling algorithm to extract computational power.

Computational Ledger: Why Reducing Waste is More Important than Simply Adding Cards

From an engineering perspective, DSpark represents a brilliant algorithm optimization. But from a business perspective, it is a survival guide for large model vendors in the brutal cost battle.

The financial model of the large model industry is particularly sensitive to inference costs. In a previous analysis by OmniTools regarding OpenAI's leaked financials, we saw a shocking cost structure: an annual revenue of 13 billion but an operating loss of 20.9 billion. Beyond the high training costs, which are one-time or phase capital expenditures, the ongoing bleeding comes from the consumption of inference computational power. While the training costs of large models are substantial, inference costs are operational costs incurred with every API call and every model-generated response, consuming real GPU power, electricity, and depreciation.

We can specifically deduce the cost structure of a typical API call. If a user inputs a prompt of 1,000 words and requests the model to generate a response of 1,000 words, in the traditional autoregressive mode, the Prefill phase processes the 1,000 input words, followed by the Decode phase performing 1,000 serial generations. Each generation requires loading the massive model weights and reading/writing the increasingly large KV Cache. In a typical model with hundreds of billions of parameters, a single Decode step's computation may require only a few milliseconds, but data transfer could take several tens of milliseconds.

If traditional blind speculative decoding is used, although the draft model rapidly generates 200 words, the target model during verification discovers that only the first 50 are correct, while the remaining 150 are all wrong. Thus, in processing the last 150 words, the GPU computational units activated, memory bandwidth consumed, and electricity utilized all become sunk costs. With tens of millions of API calls daily, this cumulative computational waste from ineffective verification can directly reflect on vendors' quarterly financial statements, becoming a bottomless pit devouring profits.

When user growth surges, if inference efficiency is low, vendors have only two choices: either restrict user access or wildly purchase GPUs for expansion. The former loses market share, while the latter undermines cash flow. As capital becomes more rational, relying on infinite financing to fill computational power gaps is no longer sustainable. More critically, with the increase in model parameter size and context length, the computational power consumption for each inference grows exponentially. If the waste of ineffective verification is allowed to persist, the losses faced by large model vendors will be magnified infinitely with the growth of user base.

The inference optimization path represented by DSpark offers a third solution. By reducing the computational waste of ineffective verification, DeepSeek-V4 increased aggregate throughput in high-concurrency scenarios by up to 400% without increasing hardware clusters. This means that the same server group can handle several times the user requests compared to before.

For large model vendors, the value of this engineering optimization far surpasses simply stacking computational power. Adding cards leads to linear growth in costs and linear growth in capacity; each new card adds procurement costs, operational costs, and energy pressure; while improvements in algorithm efficiency yield exponential capacity increases under fixed costs. As the industry enters a price war phase, those who can reduce computational costs per API call can provide more competitive pricing while ensuring profit margins. DSpark not only speeds up the model but also makes the business loop of large models healthier, providing vendors the confidence to thrive in this low-margin era.

Open Source DeepSpec: Arm the Small and Medium Teams

Simultaneously with the release of the DSpark paper, DeepSeek also open-sourced the DeepSpec framework. According to the official GitHub page, DeepSpec is open-sourced under the MIT license, containing speculative decoding algorithm modules like DSpark, DFlash, and Eagle3, and is compatible with mainstream open-source models such as Qwen3 and Gemma.

This is an action that deserves significant industry attention. In the current AI ecosystem, closed-source giants like OpenAI and Anthropic also use speculative decoding or similar inference acceleration architectures at the core level, but they rarely open-source these full-stack inference optimization toolchains. For small to medium AI startup teams, wanting to train an efficient draft model for their fine-tuned models often requires building from scratch.

We can imagine a startup team with just a few A100 GPUs facing a real dilemma. They may have fine-tuned a model for a particular vertical but find the generation speed extremely slow during deployment, leading to a poor user experience. If they wish to implement speculative decoding themselves, they need knowledge of CUDA operator development, memory management for KV Cache, and to design the training process for the draft model. This not only implies high labor costs but also suggests a lengthy trial and error cycle. Many small and medium teams exhaust their funding during this phase.

The open-sourcing of DeepSpec directly equips the entire industry with advanced inference optimization tools. In its framework design, DeepSpec provides a highly modular interface. Developers do not need to rewrite the inference engine from the ground up; they only need to specify the main model path and draft model path in a configuration file, and the framework will automatically manage the complete process of draft generation, confidence computation, and target model verification.

For teams wanting to train their own draft models, DeepSpec offers standardized data distillation modules that can distill the hidden states of the target large model for use by the draft model, significantly lowering the data preparation threshold. Developers no longer need to figure out how to build semi-autoregressive draft models or repeatedly debug confidence scheduling parameters. Through the standardized toolchain provided by DeepSpec, small and medium teams only need to configure parameters according to the documentation to incorporate speculative decoding capabilities into their models and enjoy a significant increase in generation speed.

This open-source strategy not only accelerates the overall technological iteration within the industry but also, to some extent, weakens the barriers that closed-source giants have in terms of inference efficiency. When the most advanced engineering optimization capabilities become industry public infrastructure, the focus of competition in large models will inevitably shift to higher-level application innovations and more fundamental data quality.

Battlefield Shift: Everything Beyond the Model Belongs to Harness

The release of DSpark and its successful deployment on DeepSeek-V4 confirm an irreversible industry trend: the main battlefield of large model competition has shifted.

As noted by OmniTools in the article "Everything Beyond the Model Belongs to Harness: Why the Competition Battlefield for Domestic AI Has Changed," when the parameter sizes of base models from various vendors reach the hundreds of billions and achieve similar performances in various benchmark tests, simply competing on parameters can no longer create differentiation. What determines the life and cost of an AI product are the system-level engineering capabilities beyond the model, such as toolchains, inference scheduling architectures, and API routing — collectively referred to as Harness.

For those unfamiliar with underlying engineering, the term "Harness" may seem abstract. In software engineering, "Harness" typically refers to "test harness" or "framework," but in the context of large models, it refers to the entire system engineering infrastructure encompassing the outer layer surrounding the base model. It includes but is not limited to: an API routing distribution system managing user requests, an inference scheduling architecture responsible for accelerating generation (such as speculative decoding and KV Cache management), a toolchain enabling the model to access external tools and network search, and a disaster recovery monitoring system ensuring high availability. The base model is like an engine, while the Harness encompasses the chassis, gearbox, and drive shaft. No matter how powerful the engine, if the Harness fails, the power cannot reach the wheels.

DSpark is a typical game at the Harness level. It does not alter the base parameters of DeepSeek-V4 or improve the intelligence of the model, but through extreme engineering optimization, it resolves the issue of computational waste under real traffic, directly enhancing user experience (response speed) and the vendor's financial model (inference costs). The future competition among large models will increasingly manifest as this invisible system-level engineering contest. The smarter the inference scheduling, the more efficient the KV Cache management, and the higher the hit rate of speculative decoding, the more output can be achieved under the same computational power reserves. This shift from "parameter competition" to "engineering implementation competition" requires vendors to understand not only algorithms but also systems, hardware, and the real characteristics of traffic.

Limitations and Boundaries: DSpark is Not a Panacea

Despite exhibiting remarkable optimization effects in long text generation and high-concurrency scenarios, DSpark is not a panacea for large model inference problems. Any technical solution has its applicable boundaries, and DSpark is no exception.

First is the extremely high storage threshold. According to the data disclosed in the DSpark paper, training a draft model suitable for a medium-sized model (like Qwen3-4B) requires a target cache volume as high as 38TB. This number may just be routine for top vendors with massive computational clusters, but for resource-limited small and medium teams, 38TB of fast storage access itself is a hurdle that is hard to cross. This means that although DeepSpec has open-sourced the code, small and medium teams wishing to fully replicate and deploy optimizations at the DSpark level still need to overcome the reality of hardware resource constraints. Storage costs and I/O bottlenecks could realistically become barriers to the widespread adoption of this technology.

Second is the limitation during the optimization stage. Large model inference is generally divided into two stages: Prefill (pre-filling) and Decode (decoding). The Prefill phase involves the model reading user input prompts and generating the first character, which is computation-intensive, thus fully utilizing GPU power; whereas the Decode phase is the subsequent serial generation process, which is memory-intensive and is the main stage where speculative decoding plays a role. DSpark mainly addresses the Decode process. Its optimization effect is relatively limited for the first Token latency (i.e., the Prefill phase), which is extremely sensitive to users. In scenarios involving very short text interactions (such as simple Q&A or command execution), due to the limited amount of generated content, the acceleration potential of speculative decoding is compressed, making DSpark's speed advantage less pronounced.

Lastly, there’s the issue of heterogeneous computational power adaptation. Currently, the online verification of DSpark is predominantly based on the specific service architecture of DeepSeek-V4, and there is a lack of public testing data regarding its adaptation and actual acceleration ratios on non-Nvidia hardware (like various domestic computational chips). Differences in scheduling logic of memory bandwidth and computational units across different hardware architectures raise questions about whether the confidence scheduling strategy needs to be retuned, which remains an engineering problem to be solved.

Technical optimization is not a silver bullet. DSpark demonstrates the tremendous potential of eliminating computational waste through algorithm scheduling, providing large model industry a powerful weapon to win the cost war. However, in practical application, developers must still rationally assess the return on investment of this program based on their business scenarios, hardware reserves, and sensitivity to latency. The engineering pathway for large models has only just entered deep waters.

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink