Nvidia's share is only 48%, where are the opportunities in the era of inference?
This is the ninth article in the AI investment research series of 100 articles, 20,000 words long. It is recommended to bookmark it; not many will be able to finish reading it.
The previous articles covered Intel, AMD, and ARM. Their stock prices have risen significantly over the past year—AMD has doubled, Intel has tripled, and ARM has also reached a historical high. After this rise, a simple question arises: Can we still invest in those that have already increased? Are there still opportunities among those that haven’t risen?
To answer this question, we cannot avoid a core term—inference. The companies mentioned earlier have risen, and these two words frequently appear in the analysis.
So: How big is the inference track? What stage is it currently in? Which companies will benefit how? Which have already been priced by the market, and which have not?
This is something that should be understood first.
1. How big is the track?
Model training is "writing programs," while inference is "the process of calling this program every day." After GPT was trained, hundreds of millions of people asked it questions daily, with each Q&A consuming inference computing power. When Claude Code runs a task, the agent runs itself for a hundred rounds, and each round involves inference.
Multiple industry studies and media references point in the same direction: after models enter production environments, inference will become a major part of the lifecycle costs, commonly estimated to account for 80-90%. This means that in the future AI era, 8 out of every 10 dollars in the computing bill will be consumed by inference.
However, the market has discussed almost exclusively training over the past three years because training is a more "exciting" story—who has more H100s, whose parameters are larger, who trains the next generation of models first. Inference has been treated as a secondary concern after training.
This cognitive bias is being reversed, and this is indeed the fundamental reason why this batch of semiconductor companies has been revalued over the past year.
So, while the inference track is large, just how large is it? It can be specifically measured from five angles.
First is the number of users. ChatGPT has 900 million weekly active users and 50 million paid users. The comparison on the Chinese side is even more direct—daily token usage is expected to rise from 100 billion at the beginning of 2024 to 140 trillion by 2026, a 1,400-fold increase. This area is still far from saturation.
Second is usage intensity. OpenAI's token processing capacity was 6 billion per minute in October 2025, which increased to 15 billion by April 2026—a 2.5 times increase in six months. The enterprise version revenue accounts for more than 40%, with enterprise users’ usage intensity being dozens of times that of consumers.
Third is dialogue length. The context length has increased from a few hundred tokens in the early days to the current DeepSeek API documentation listing that V4 Pro / Flash context length is 1M, with maximum output at 384K. The longer the document, the higher the memory and computing power consumed per inference.
Fourth, models themselves are becoming increasingly compute-intensive. Reasoning models like OpenAI o1, DeepSeek R1, and Claude thinking first "think" internally with thousands or even tens of thousands of tokens before answering questions. Jensen Huang once mentioned how reasoning models might require substantially more compute, even reaching orders of magnitude higher.
In the past, when you asked AI a question, it would give an answer directly; now, when you ask AI a complex problem, it considers it for half a minute before responding. That "half minute of thinking" is the added computing power consumption.
Fifth is the agent. One agent task typically calls on the model 10-100 times. OpenAI Codex has surpassed 3 million weekly active users—that's just one product from one company. An industry insider estimates that the overall computation consumed by AI entities can exceed that of large language models of the same parameter scale by more than ten times.
Multiplying these five factors, the total demand for inference will see a significant expansion in the next three to five years, which is not an exaggeration but a view that is increasingly accepted as mainstream.
An old phenomenon in economics called the Jevons Paradox describes how increasing the efficiency of a unit of use of something can lead to an overall increase in total consumption because it becomes cheaper, allowing for more applications. After the efficiency of steam engines improved, coal consumption in the UK skyrocketed; when the unit price of inference tokens decreases, AI query counts soar. This is the same script. The IEA estimates that global data center electricity consumption will rise from 1.5% of total electricity use in 2024 to double to 945 TWh by 2030—roughly equivalent to the annual electricity consumption of Germany and France combined.
Moreover, specific actions from the industry front further solidify this argument:
Anthropic's ARR is projected to rise from $1 billion at the end of 2024 to $30 billion by early 2026—a 30-fold increase in just 14 months. To support this curve, the company has secured over 11 GW of computing power from the end of 2025 to early 2026, including a $21 billion order for TPUs from Broadcom. OpenAI has already committed to deploying 10 GW of custom chips. Google has raised its TPU shipment target for 2026 by 50% to 6 million units.
The capital expenditures figures from cloud vendors are even more direct. Google plans capital expenditures of $175 billion to $185 billion in 2026, nearly double that of 2025; Amazon will invest $200 billion in 2026; Meta plans to increase by 65% to $118 billion. The total capital expenditure of the eight major cloud vendors will exceed $600 billion in 2026, an annual increase of 40%.
Putting this all together, the conclusion is simple—AI inference demand curves have already exceeded the supply capacity of any hardware supplier.
This is the entire background of the inference track: the training era is about "creating a god," while the inference era is about "this god being called upon by hundreds of millions of people every day, with each agent calling it a hundred times, thinking tens of thousands of tokens each time." From the former to the latter, the consumption of computing power does not increase linearly, but experiences a geometric leap.
2. Which stocks will benefit?
A large track does not mean that all companies will benefit, and Nvidia's monopoly is already showing signs of loosening in the data!
By 2026, Nvidia's share of the global AI inference chip market is approximately 48.2%, AMD is about 16.7%, and the ASIC camp accounts for about 18.5% (of which Google TPU is 7.8%, AWS Inferentia is 5.2%, and other ASICs are 5.5%), while domestic inference chips total 16.6%.
Nvidia still maintains over 80% market share in the training market, but in the inference market, it now holds less than half, at 48.2%.
Why is this happening?
During the training era, Nvidia relied on comprehensive strength—high-performance GPUs + NVLink high-speed interconnects + CUDA ecosystem. This combination was a dimensionality reduction blow to training.
Read the full article: Nvidia's share is only 48%, where are the opportunities in the era of inference?
https://mp.weixin.qq.com/s/pbT3QSbtLpEoJc1T4gL2Sw
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。