Charts
DataOn-chain
VIP
Market Cap
API
Rankings
CoinOSNew
CoinClaw🦞
Language
  • 简体中文
  • 繁体中文
  • English
Leader in global market data applications, committed to providing valuable information more efficiently.

Features

  • Real-time Data
  • Special Features
  • AI Grid

Services

  • News
  • Open Data(API)
  • Institutional Services

Downloads

  • Desktop
  • Android
  • iOS

Contact Us

  • Chat Room
  • Business Email
  • Official Email
  • Official Verification

Join Community

  • Telegram
  • Twitter
  • Discord

© Copyright 2013-2026. All rights reserved.

简体繁體English
|Legacy

In the new phase of the AI reasoning explosion.

CN
BTCdayu
Follow
3 hours ago
AI summarizes in 5 seconds.

In the new phase of the explosion of AI inference, GPUs are still scarce and tight, but memory that can hold data and run data quickly will become the new protagonist.

Continuing from the previous article, let's talk in detail about HBM. Currently, the heat surrounding HBM is still explosive, and at the same time, numerous tiered alleviation solutions and architecture-level alternatives like TPU are also on the way to ease the tightness of HBM.

And the market's core concern about storage is: is storage a cyclical industry or AI infrastructure?

If it is the former, everyone is waiting for the drum to stop; if it is the latter, that will be a completely different future scenario. This may not be a question of judgment, but a complex mathematical problem, and this article will attempt to analyze it from the perspective of logic and facts.

1. Why does inference make memory the protagonist?

When a large model infers, three things need to be done for each token generated:

Read all the parameters of the entire model from memory once and send them to the computing core;

Read the intermediate state (called KV Cache) of all the tokens prior to this token once as well;

Then perform matrix multiplication to calculate the next token.

The third step is the calculation, while the first two steps are transport.

The total time for transport usually exceeds the total time for computation.

This fact applies to almost all models with over 10 billion parameters.

A 70 billion parameter open-source model (Llama 3 70B), under FP16 accuracy, has model weights of about 140 GB. Generating each token requires reading this 140 GB from HBM to the GPU computing core. To ensure smooth token generation—such as 30 tokens per second—the bandwidth between HBM and the computing core must support a transport volume of about 4.2 TB per second. This is why the HBM bandwidth of the H100 SXM5 is set at 3.35 TB/s—below this number, the inference of the 70B model starts to lag.

Bandwidth is one thing, capacity is another. If the total parameter count of a model exceeds the HBM capacity of a single GPU, the model must be split into several parts and distributed across multiple GPUs to run, which is called tensor parallelism. However, this breaks the model's intactness, and what could originally be calculated at once becomes multiple calculations, requiring communication between GPUs to transfer intermediate results—the communication overhead becomes the new bottleneck.

Therefore, both capacity and bandwidth are important, but they have different emphases.

Capacity determines: Can the model fit on a single card? Does it need to be split? What is the communication overhead after splitting?

Bandwidth determines: Once it is fit, how fast can the tokens be emitted? How low is the latency?

In addressing the demands of inference, NVIDIA and AMD have taken different paths:

NVIDIA's latest flagship, Rubin R200, comes with 288 GB of HBM4 for a single GPU, with a memory bandwidth of 22 TB/s;

AMD's next-generation MI455X comes with 432 GB of HBM4 for a single GPU, with a memory bandwidth of 19.6 TB/s.

AMD has 50% more capacity, but 11% less bandwidth.

NVIDIA emphasizes bandwidth—moving data faster.

AMD emphasizes capacity—keeping the model intact without splitting.

The target customer groups of the two companies differ: AMD aims at the open-source crowd that runs super large models like 405B and 671B; NVIDIA targets the SaaS crowd that requires high concurrency and low-latency commercial inference.

Recently, a new player going for IPO, Cerebras WSE-3, has only 44 GB of on-chip SRAM per chip, but its memory bandwidth reaches 21 PB/s—950 times that of NVIDIA's Rubin. Trading off 7 times less capacity for three orders of magnitude more bandwidth. Cerebras's assessment differs from NVIDIA and AMD.

2. Inference tightens two bottlenecks simultaneously.

This is my 12th article in the "AI Investment Map," which has gone through several days and numerous drafts, still having 18,000 words after various edits. It is recommended to forward and bookmark it. I suggest setting this account as a "starred" one.

Go directly to the public account to view; the formatting here is a disaster.

"HBM Panorama Research Report: From Training to Inference, the Protagonist is No Longer the GPU"

https://mp.weixin.qq.com/s/ch6D62c-4OsOllHfzf4jMA


免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

返20%!OKX钱包大赛火热,17700U大家分
广告
|
|
APP
Windows
Mac
Share To

X

Telegram

Facebook

Reddit

CopyLink

|
|
APP
Windows
Mac
Share To

X

Telegram

Facebook

Reddit

CopyLink

Selected Articles by BTCdayu

3 hours ago
"Former Google TPU Architect: The True Bottleneck of AI is Not Computing Power" In this two-hour interview
2 days ago
The Chinese circle on Twitter is likely to undergo a major change.
2 days ago
Today is a research report from Qualcomm.
View More

Table of Contents

|
|
APP
Windows
Mac
Share To

X

Telegram

Facebook

Reddit

CopyLink

Related Articles

avatar
avatarBTCdayu
3 hours ago
"Former Google TPU Architect: The True Bottleneck of AI is Not Computing Power" In this two-hour interview
avatar
avatarPhyrex
10 hours ago
Today, the WTI oil price has slightly decreased.
avatar
avatarPhyrex
11 hours ago
On Wednesday, the data for the $BTC spot ETF is still in slight fluctuation.
avatar
avatarPhyrex
11 hours ago
The situation with Websea this time is very similar to that of the JPEX exchange, which closed during the Token2049 in Singapore in 2023.
APP
Windows
Mac

X

Telegram

Facebook

Reddit

CopyLink