Author: Deep Tide TechFlow
On March 25, U.S. tech stocks rose generally, with the Nasdaq 100 index in the green, but one category of stocks was bleeding against the trend:
SanDisk fell 3.50%, Micron dropped 3.4%, Seagate decreased by 2.59%, and Western Digital fell 1.63%. The entire storage sector felt like someone suddenly turned off the power in a party.
The culprit is a paper, or more accurately, Google Research's formal promotion of a paper.
What does this paper do?
To understand this matter, one needs to clarify a concept in AI infrastructure that receives little external attention: KV Cache.
When you converse with a large language model, the model does not start from scratch to understand your question each time. It retains the entire context of the conversation in a format called "Key-Value Pair," which is the KV Cache, the model's short-term working memory.
The problem is that the size of KV Cache grows proportionally with the length of the context window. When the context window reaches the magnitude of millions of tokens, the GPU memory consumed by KV Cache may even exceed the model's own weights. For a reasoning cluster servicing a large number of users simultaneously, this is a genuine and daily expensive infrastructure bottleneck.
The original version of this paper first appeared on arXiv in April 2025 and will be officially published at ICLR 2026. Google Research has named it TurboQuant, an algorithm that compresses KV Cache to 3 bits, reducing memory usage by at least 6 times, with no training or fine-tuning required, ready to be used out of the box.
The specific technical approach is in two steps:
Step One, PolarQuant. It does not use the standard Cartesian coordinate system to represent vectors but instead transforms the vectors into polar coordinates—composed of "radius" and a set of "angles"—thus fundamentally simplifying the geometric complexity of high-dimensional space, allowing subsequent quantization to be performed with lower distortion rates.
Step Two, QJL (Quantized Johnson-Lindenstrauss). After PolarQuant completes the main compression, TurboQuant employs a 1-bit QJL transformation to perform unbiased correction of the residual errors, ensuring the accuracy of inner product estimation—which is critical for the proper functioning of the Transformer attention mechanism.
The results: in the LongBench benchmark tests covering question answering, code generation, and summarization tasks, TurboQuant matched or even surpassed the performance of the existing optimal baseline, KIVI; in the "finding a needle in a haystack" retrieval task, it achieved perfect recall rates; on NVIDIA H100, the 4-bit TurboQuant accelerated attention logic operations by 8 times.
Traditional quantization methods have an original sin: every time a piece of data is compressed, it requires additional storage for "quantization constants" to record how to decompress, with this metadata's overhead often reaching an additional 1 to 2 bits per value. It doesn't seem much, but under the context of millions of tokens, these bits accumulate at a despairing rate. TurboQuant eliminates this additional overhead completely through the geometric rotation of PolarQuant and the 1-bit residual correction of QJL.
Why is the market in a panic?
The directness of the conclusion is hard to ignore: a model that requires 8 H100s to service a million-token context theoretically only needs 2. Reasoning service providers could handle more than 6 times the number of concurrent long-context requests using the same hardware.
This is a knife stabbing to the core narrative of the storage sector.
In the past two years, Seagate, Western Digital, and Micron have been elevated to the altar by the AI capital frenzy, with one underlying logic: large models are increasingly capable of "remembering" more; the appetite for long context windows is limitless, and storage demand will continue to explode. Seagate increased by over 210% in 2025, and its production capacity for 2026 is already sold out.
The emergence of TurboQuant directly challenges the premise of this narrative.
Wells Fargo tech analyst Andrew Rocha's comments are most direct: "As the context window grows, the data storage in KV Cache expands explosively, and memory demand follows suit. TurboQuant is directly attacking this cost curve... If widely adopted, it will fundamentally question how much memory capacity is truly needed."
But Rocha also used a key premise: IF.
The part of this matter that is really worth debating
Was the market's reaction too extreme? The answer is likely: a bit.
First, the clickbait issue of 8 times acceleration. Several analysts pointed out that this 8 times acceleration benchmark compares new technology with the old 32-bit non-quantized system, rather than comparing it with the optimized systems already widely deployed. The actual improvement exists, but it is not as dramatic as the title suggests.
Secondly, the paper only tested small models. All evaluations of TurboQuant used models with at most about 8 billion parameters. What truly causes storage vendors sleepless nights is the ultra-large models with 70 billion or even 400 billion parameters, where KV Cache numbers become astronomical. TurboQuant's performance at these scales remains unknown.
Third, Google has not yet released any official code. As of now, TurboQuant is not present in vLLM, llama.cpp, Ollama, or any mainstream reasoning frameworks. It was the community developers who reproduced early implementations from the mathematical derivations in the paper; an early reproducer clearly pointed out that if the QJL error correction module is improperly implemented, the output will directly turn into garbled text.
But this does not mean that the market's concerns are unfounded.
This is the collective muscle memory from the 2025 DeepSeek moment at play. That event taught the entire market a harsh lesson: efficiency breakthroughs at the algorithmic level can completely change the narrative of expensive hardware overnight. Since then, any efficiency breakthrough from top AI labs triggers a conditioned reflex in the hardware sector.
Moreover, this signal comes from Google Research, not an obscure university lab—this company has enough engineering capability to turn papers into production-grade tools, and it is itself one of the largest consumers of AI inference globally. Once TurboQuant lands internally, the server procurement logic for Waymo, Gemini, and Google Search will quietly change.
The script repeating from history
Here lies a classic debate that deserves serious attention: The Jevons Paradox.
19th-century economist Jevons discovered that improvements in the efficiency of steam engines not only did not reduce coal consumption in Britain but instead led to a significant increase in consumption—because efficiency gains lowered usage costs, stimulating larger scale applications.
Proponents' logic is: if Google enables a model to run on 16GB of VRAM, developers will not stop there; they will use the saved computing power to run models 6 times more complex, processing larger multimodal data and supporting longer contexts. The efficiency unlocked by software ultimately addresses those demands that were previously unattainable due to excessive costs.
But this rebuttal has a premise: the market needs time to digest and expand again. During the period it takes for TurboQuant to transition from a paper to a production tool, and from a production tool to an industry standard, can the hardware demand expansion fill the "gap" created by efficiency quickly enough?
No one knows the answer. The market is pricing in this uncertainty.
The real significance of this matter for the AI industry
More noteworthy than the fluctuations of storage stocks is a deeper trend revealed by TurboQuant.
The main battlefield of the AI arms race is shifting from "stacking compute power" to "extreme efficiency".
If TurboQuant can prove its performance claims on large-scale models, it will bring a fundamental transformation: long-context reasoning will turn from a "luxury only top labs can afford" into the default industry standard.
And this race for efficiency points is precisely the area Google excels in: mathematically nearly optimal compression algorithms, fundamentally chasing the limits of Shannon information theory rather than brute-force engineering stacking. The theoretical distortion rate of TurboQuant is only about 2.7 times the constant factor above the information theoretical lower bound.
This means that similar breakthroughs will not be limited to just one in the future. It represents a whole research path maturing.
For the storage industry, perhaps the more sobering question is not "Will this affect demand this time?" but rather: As the cost curve of AI inference continues to be lowered by the software layer, how wide can the hardware layer's moat remain?
The current answer is: still very wide, but not wide enough to disregard such signals.
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。