In the new phase of the explosion of AI inference, GPUs are still scarce and tight, but memory that can hold data and run data quickly will become the new protagonist.
Continuing from the previous article, let's talk in detail about HBM. Currently, the heat surrounding HBM is still explosive, and at the same time, numerous tiered alleviation solutions and architecture-level alternatives like TPU are also on the way to ease the tightness of HBM.
And the market's core concern about storage is: is storage a cyclical industry or AI infrastructure?
If it is the former, everyone is waiting for the drum to stop; if it is the latter, that will be a completely different future scenario. This may not be a question of judgment, but a complex mathematical problem, and this article will attempt to analyze it from the perspective of logic and facts.
1. Why does inference make memory the protagonist?
When a large model infers, three things need to be done for each token generated:
Read all the parameters of the entire model from memory once and send them to the computing core;
Read the intermediate state (called KV Cache) of all the tokens prior to this token once as well;
Then perform matrix multiplication to calculate the next token.
The third step is the calculation, while the first two steps are transport.
The total time for transport usually exceeds the total time for computation.
This fact applies to almost all models with over 10 billion parameters.
A 70 billion parameter open-source model (Llama 3 70B), under FP16 accuracy, has model weights of about 140 GB. Generating each token requires reading this 140 GB from HBM to the GPU computing core. To ensure smooth token generation—such as 30 tokens per second—the bandwidth between HBM and the computing core must support a transport volume of about 4.2 TB per second. This is why the HBM bandwidth of the H100 SXM5 is set at 3.35 TB/s—below this number, the inference of the 70B model starts to lag.
Bandwidth is one thing, capacity is another. If the total parameter count of a model exceeds the HBM capacity of a single GPU, the model must be split into several parts and distributed across multiple GPUs to run, which is called tensor parallelism. However, this breaks the model's intactness, and what could originally be calculated at once becomes multiple calculations, requiring communication between GPUs to transfer intermediate results—the communication overhead becomes the new bottleneck.
Therefore, both capacity and bandwidth are important, but they have different emphases.
Capacity determines: Can the model fit on a single card? Does it need to be split? What is the communication overhead after splitting?
Bandwidth determines: Once it is fit, how fast can the tokens be emitted? How low is the latency?
In addressing the demands of inference, NVIDIA and AMD have taken different paths:
NVIDIA's latest flagship, Rubin R200, comes with 288 GB of HBM4 for a single GPU, with a memory bandwidth of 22 TB/s;
AMD's next-generation MI455X comes with 432 GB of HBM4 for a single GPU, with a memory bandwidth of 19.6 TB/s.
AMD has 50% more capacity, but 11% less bandwidth.
NVIDIA emphasizes bandwidth—moving data faster.
AMD emphasizes capacity—keeping the model intact without splitting.
The target customer groups of the two companies differ: AMD aims at the open-source crowd that runs super large models like 405B and 671B; NVIDIA targets the SaaS crowd that requires high concurrency and low-latency commercial inference.
Recently, a new player going for IPO, Cerebras WSE-3, has only 44 GB of on-chip SRAM per chip, but its memory bandwidth reaches 21 PB/s—950 times that of NVIDIA's Rubin. Trading off 7 times less capacity for three orders of magnitude more bandwidth. Cerebras's assessment differs from NVIDIA and AMD.
2. Inference tightens two bottlenecks simultaneously.
This is my 12th article in the "AI Investment Map," which has gone through several days and numerous drafts, still having 18,000 words after various edits. It is recommended to forward and bookmark it. I suggest setting this account as a "starred" one.
Go directly to the public account to view; the formatting here is a disaster.
"HBM Panorama Research Report: From Training to Inference, the Protagonist is No Longer the GPU"
https://mp.weixin.qq.com/s/ch6D62c-4OsOllHfzf4jMA
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。