The "Chinese Tax" of AI Large Models: Why is Chinese more Token-consuming than English?

When engineers smooth out the edges of language in exchange for efficiency, the wisdom that unwittingly grows in the cracks quietly disappears.

In the days following the release of Opus 4.7, complaints flooded X. Some said a single conversation used up all their session quota, while others reported that the cost of running the same code had more than doubled compared to last week; some even shared screenshots showing that their $200 Max subscription reached its limit in less than two hours.

Independent developer BridgeMind admits that Claude is the best model in the world, but it is also the most expensive model. His Max subscription hit the limit in under two hours, but fortunately — he bought two subscriptions.｜Image source: X@bridgemindai

Anthropic's official pricing remains unchanged, at $5 per million input tokens and $25 per million output tokens. However, this version introduced a new tokenizer, while Claude Code raised the default effort from high to xhigh. With these two changes combined, the token consumption for the same task became 2 to 2.7 times greater than before.

I saw two statements related to Chinese in these discussions. One was that token consumption for Chinese under the new tokenizer has hardly increased, allowing Chinese users to avoid this price hike. The other, more interesting observation was that classical Chinese consumes fewer tokens than modern Chinese; conversing with AI in classical Chinese can save costs.

The first statement suggests that Claude has made some optimization for Chinese, but Anthropic’s release documentation does not mention any adjustments related to Chinese.

The second statement is harder to explain. Classical Chinese is evidently more difficult for human readers than modern Chinese; how could a more complex text for humans be easier for AI?

So I conducted a test using 22 segments of parallel text (including business news, technical documents, classical texts, daily conversations, etc.), sending them through 5 tokenizers (Claude 4.6 and 4.7, GPT-4o, Qwen 3.6, DeepSeek-V3), and comparing the token counts for each segment across models.

Test text:

1. Everyday conversations in Chinese and English (travel, forum help requests, writing requests)

2. Technical documents in Chinese and English (Python documentation, Anthropic documents)

3. News articles in Chinese and English (NYT political news, NYT business news, official statements from Apple)

4. Literary excerpts in Chinese and classical Chinese (“Memorial on the Expedition” “Tao Te Ching”)

After testing, both statements received partial verification, but the facts are slightly more complex than the rumors.

Chinese Tax

First, here are the conclusions:

1. On both Claude and GPT, Chinese has always been more expensive than English.

2. On Qwen and DeepSeek, Chinese is actually cheaper than English.

3. Opus 4.7’s tokenizer update that triggered turbulence almost only caused inflation in English, while Chinese remained unchanged.

Looking at specific numbers. The full series of models prior to Claude Opus 4.7 (including Opus 4.6, Sonnet, Haiku) used the same tokenizer. Under this tokenizer, the token consumption for Chinese was consistently higher than that for an equivalent amount of English content, with the ratio of cn/en ranging from 1.11× to 1.64×.

The most extreme scenario appeared in NYT style business news: the same content in Chinese consumed 64% more tokens, equating to paying 64% more.

Token consumption in Chinese is significantly higher than in other models for Claude 4.6 and earlier (red box).

The most extreme case appeared in NYT style business news: for the same content, the Chinese version consumed 64% more tokens (green box).

GPT-4o's o200k tokenizer performed better, with most cn/en ratios falling between 1.0 and 1.35×, with some scenarios below 1. Chinese still tends to be more expensive overall, but the gap is much smaller than Claude.

Domestic models Qwen 3.6 and DeepSeek-V3 have entirely reversed data. Both have cn/en ratios significantly below 1, meaning that for the same content, the Chinese version actually saves tokens compared to the English version. DeepSeek even reached 0.65×, meaning the Chinese version of the same sentence was a third cheaper than the English version.

The new tokenizer in Opus 4.7 only saw inflation in English. The number of English tokens inflated by 1.24× to 1.63×, while Chinese largely remained at 1.000×, showing almost no change. Those earlier bills shockingly experienced by English developers, Chinese users indeed felt none of it. The reason could be that the Chinese text in the old version was already being processed at the character granularity level, leaving very little room for split.

In Opus 4.7 compared to 4.6, English consumed more tokens, while Chinese remained unchanged.

During the testing process, I also noticed something. The differences in token consumption are not just a billing issue; they directly affect the size of the workspace. In a 200k context window, using the old Claude tokenizer to input Chinese material allows for 40% to 70% less content than English.

For the same tasks, such as having AI analyze a lengthy document or summarize a set of meeting notes, Chinese users can provide fewer materials to the model, leading to a shorter contextual reference for the model. The result is that they pay more money but receive a smaller workspace.

With the four sets of data together, a question naturally arises:

Why does the token count differ when the same content is in different languages? Why is Chinese more expensive on Claude and GPT, while cheaper on Qwen and DeepSeek?

The answer lies in the tokenizer concept mentioned multiple times above.

How Many Pieces Can a Chinese Character Be Cut Into?

Before the model reads any text, it tokenizes the input into individual tokens through the tokenizer. You can imagine the tokenizer as the AI's "brick cutter." When you input a sentence, it is responsible for breaking that sentence into standardized bricks (i.e., tokens). AI models do not look at the text; they only recognize the number of the bricks. The more bricks you use, the more you pay.

The way English is cut is relatively intuitive; for example, "intelligence" is likely one token, and "information" is also one token, with one word corresponding to one billing unit.

But when it comes to this step for Chinese, issues arise. Sending the same sentence "Artificial intelligence is reshaping the global information infrastructure" through the GPT-4's cl100k tokenizer and the Qwen 2.5 tokenizer results in completely different outputs.

GPT-4 essentially breaks each Chinese character into one token; whereas Qwen recognizes phrases as one token, for example, "人工智能" (artificial intelligence) counts as one token in Qwen.

In the case of 16 Chinese characters, GPT-4 outputs 19 tokens, while Qwen outputs only 6 tokens.

Why do these discrepancies occur? The reason lies in an algorithm called BPE (Byte Pair Encoding).

BPE works by counting which character combinations appear most frequently in the training corpus, then merging high-frequency combinations into a single token to be added to the vocabulary.

During the GPT-2 era, the majority of the training corpus consisted of English. Character combinations in English (like th, ing, tion) appeared repeatedly, quickly being consolidated into tokens. Chinese characters appeared too infrequently in that corpus to be included in the vocabulary, and were thus treated as raw bytes. Each Chinese character occupying 3 bytes became 3 tokens.

BPE's merging is determined by the character frequency in the training corpus. Under the dominance of English data, Chinese UTF-8 bytes could not be merged into entire characters.

This illustrates the split outcomes of the same sentence under different tokenizers.

Is Classical Chinese Really Cheaper?

In all models, classical Chinese consumes fewer tokens than modern Chinese, and even fewer than English.

The classical Chinese example led me to realize that the token count alone does not reveal much. However, thinking further along this line, I recalled a layer I had previously overlooked.

Examples such as spark, flame, and glow are seen in written language and names, implying brightness and intensity.

The number itself carries no structural information about the character. The relationship between 38721 and 38722 is as indistinguishable for the model as that between 1 and 10000, effectively encapsulating the information contained in "the structure of the character." The fact of stacking three "火" is non-existent in the indexing.

Of course, the model can indirectly learn that "焱," "炎," and "灼" often appear in similar contexts through extensive training data, but this route is more indirect than utilizing radical information directly.

A paper published in January 2025 in MIT Press’s "Computational Linguistics" titled "Tokenization Changes Meaning in Large Language Models: Evidence from Chinese" addressed this question.

Growing Radicals from Fragments

Author David Haslett noted a historical coincidence.

The UTF-8 sequence is ordered by radical, with characters sharing radicals having similar encodings｜Image source: Github

This means that when the tokenizer breaks Chinese characters into three UTF-8 byte tokens, characters sharing radicals will share the first token. During training, the model repeatedly sees these shared byte patterns and may learn that "characters with the same first token often belong to the same semantic category." This functionally approximates the way humans infer meanings through radicals.

Haslett designed three experiments to validate this.

The first experiment asked GPT-4, GPT-4o, and Llama 3 whether "茶" and "茎" contain the same semantic radical.

The second experiment had the model score the semantic similarity of two Chinese characters.

The third experiment tasked the model with identifying the "different category" exclusion task.

Each experiment controlled for two variables: whether the two characters genuinely share a radical, and whether the two characters share the first token under the tokenizer. This 2×2 design allowed her to isolate the effects of radical influence and token influence.

All three experiments concluded consistently: when Chinese characters are cut into multiple tokens (as in the old tokenizer under GPT-4, where 89% of characters were segmented into multiple tokens), the model's accuracy in recognizing shared radicals increases; when characters are encoded as a single token (in GPT-4o's new tokenizer, where only 57% of characters are still multiple tokens), accuracy declines.

In other words, the hypothesis from the previous paragraph holds. Fragmenting Chinese characters indeed incurs higher costs, but the byte sequences remain traces of radicals learned by the model. Conversely, encoding them as whole character tokens lowers the costs, but radical information is encapsulated in an opaque code, making it impossible for the model to extract this clue from the byte sequence.

It is important to note that this conclusion is limited to task-specific semantic subtleties related to character forms, and cannot be equated with overall declines in the model's Chinese understanding, logical reasoning, or long-text generation capabilities. Additionally, aside from differences in tokenizers, significant variations exist in model architecture, training corpus, and parameter counts between GPT-4 and GPT-4o, making it impossible to attribute 100% of the accuracy changes solely to the adjustments in token granularity.

This finding has also been validated in engineering context. A study on GPT-4o in 2024 found that after the new tokenizer grouped certain combinations of Chinese characters into one long token, the model actually demonstrated misunderstanding. When researchers used specialized Chinese tokenizers to re-segment these long tokens and refeed them to the model, understanding accuracy improved.

The current mainstream consensus in the global large model industry remains that tokenizers optimized for target languages, which process whole words or characters, can significantly enhance overall model performance. Whole character and word encoding not only dramatically reduces token costs and increases the effective information volume of the context window, but also shortens sequence length, reduces inference latency, and improves stability in handling long texts. The advantages identified in the paper for specific tasks do not encompass the majority of performance gains in Chinese NLP scenarios.

However, this issue still highlights one of the most challenging problems in large systems: you can optimize what you have designed, but you cannot optimize what you are unaware of possessing. The Unicode Consortium's ordered encoding for the sake of human retrieval, intersecting with BPE's inadvertent fragmentation due to the low frequency of Chinese data, fortuitously recreated the process of human literacy within the black box of neural networks. But when engineers sought to eliminate the “Chinese tax” by integrating characters and reducing costs, the unexpectedly born semantic channel also closed.

History does not unfold as a linear progression; rather, it is a fluid constantly deforming under various constraints.

Some capabilities are designed, while others happen to remain untouched.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

The "Chinese Tax" of AI Large Models: Why is Chinese more Token-consuming than English?

Chinese Tax

How Many Pieces Can a Chinese Character Be Cut Into?

Is Classical Chinese Really Cheaper?

Growing Radicals from Fragments

Selected Articles by 深潮TechFlow

Table of Contents

Related Articles