qinbafrank
qinbafrank|Apr 29, 2026 09:31
A very in-depth article, starting from the first principles of GPU architecture evolution, focuses on answering the long-term concerns of the market: why does the HBM memory demand of each GPU inevitably grow exponentially? Why doesn't HBM demand stagnate or periodically collapse like traditional DRAM? Record a key point as a reading note The core KPIs of the AI inference era have been completely changed In the CPU era, the highest KPI is "performance/FLOPS" (the faster the score, the better). In the era of AI inference (especially after the rise of agentic flow), the highest KPI has become token economics - token throughput per unit cost/unit electricity+token generation speed. The essence of Nvidia's' AI factory 'is to output the most tokens at the lowest cost while maximizing token speed. The Pareto frontier curve needs to continuously move to the upper right. 2. First principles formula for token throughput (core conclusion) Token throughput=HBM Size x HBM Bandwidth Bottleneck=HBM Size Because each request comes with a hot KV cache, it must be placed in HBM. As the batch size increases, the KV cache grows linearly, and the HBM capacity must also grow linearly synchronously (otherwise, it's like a shuttle bus with too small a carriage, requiring multiple trips to pull people). The bottleneck of token generation speed for each user is HBM Bandwidth Generating each token requires multiple high-frequency reads of the weights and KV cache in HBM. The higher the bandwidth, the faster the decode speed (just like the wider the door of a shuttle bus, the faster passengers get on and off). Complete analogy: Throughput=Bus Capacity (HBM Size) x Door Width (HBM Bandwidth). As long as we want to double the token throughput with each generation, the size x BW product of HBM must be doubled. This is the hardware ceiling, and software optimization cannot fundamentally replace it. 3. The essential difference between the CPU era and the AI era CPU era: DDR is just an "auxiliary" and upgrades are extremely slow (DDR3 to DDR5 took 15 years). Reason: The CPU has a large amount of hidden latency such as cache and superscalar; Low bandwidth/capacity requirements for daily workloads; The app size is growing slowly. AI/GPU era: The computing paradigm has completely shifted towards' memory bound '. Inference is memory, KV cache+context length+concurrent multiple requests, putting all the pressure on HBM. HBM has gone from being a 'icing on the cake' to a decisive factor. 4. Verify correspondence with reality The token throughput curve of Nvidia from A100 to Rubin Ultra almost completely overlaps with the HBM Size × BW curve on the logarithmic axis (as mentioned in Figure 2 of the article). Even though it is difficult to achieve 100% utilization, HBM remains the ceiling of the entire system. Lao Huang must force three companies (Samsung, Hynix, Micron) to constantly upgrade, otherwise the GPU cannot be sold. 5. Software optimization cannot change hardware requirements Software optimization (such as LPU moving weights to SRAM) only improves the Pareto curve from another dimension, and the hardware ceiling is still determined by HBM. Just like in the CPU era, no matter how fast the software is, CPU manufacturers must continue to upgrade and run scores
Mentioned
Share To

Timeline

HotFlash

APP

X

Telegram

Facebook

Reddit

CopyLink

Hot Reads