The AI trading competition has concluded, with the domestic model winning the championship, while GPT-5 incurred a loss of 60%.

CN
14 hours ago

This article is authorized to be reproduced by Automatic Insight, author: Rhythm Editorial Department, copyright belongs to the original author.

On the early morning of November 4, the highly anticipated Alpha Arena AI Trading Competition came to an end.

The results surprised everyone. Alibaba's Qwen 3 Max won first place with a return of 22.32%, while another Chinese company, DeepSeek, came in second with a return of 4.89%.

In contrast, four star players from Silicon Valley faced complete failure. OpenAI's GPT-5 lost 62.66%, Google's Gemini 2.5 Pro lost 56.71%, Musk's Grok 4 lost 45.3%, and Anthropic's Claude 4.5 Sonnet also lost 30.81%.

This competition was actually a special experiment. On October 17, the American research company Nof1.ai put six of the world's top large language models into the real cryptocurrency market, each model receiving an initial fund of $10,000 to conduct perpetual contract trading on the decentralized trading platform Hyperliquid for 17 days. Perpetual contracts are derivatives with no expiration date, allowing traders to amplify returns through leverage, but also increasing risk.

These AIs started from the same point, with the same market data, but the final results were completely different.

This was not a scoring test in a virtual environment, but a survival game with real money. When AI left the "sterile" environment of the laboratory and faced the dynamic, adversarial, and uncertain real market for the first time, their choices were no longer determined by model parameters, but by their understanding of risk, greed, and fear.

This experiment allowed people to see for the first time that when so-called "intelligence" confronts the complexity of the real world, the elegant performance of models often cannot be sustained, exposing flaws beyond training.

For a long time, people have used various static benchmarks to measure AI's capabilities.

From MMLU to HumanEval, AI has been scoring higher and higher on these standardized tests, even surpassing humans. But the essence of these tests is like solving problems in a quiet room, where both the questions and answers are fixed, and AI only needs to find the optimal solution from vast amounts of data. Even the most complex math problems can be memorized.

The real world, especially the financial market, is completely different.

It is not a static question bank, but a constantly changing arena filled with noise and deception. Here, it is a zero-sum game; one person's profit inevitably means another person's loss. Price fluctuations are never just rational calculations; they are also influenced by human emotions—greed, fear, luck, hesitation—visible in every price movement.

Moreover, the market itself reacts to human behavior; when everyone believes prices will rise, prices often peak.

This feedback mechanism continuously corrects, counteracts, and punishes certainty, rendering any static test pale and powerless.

Nof1.ai initiated Alpha Arena to throw AI into a real social melting pot. Each model was given real money, and losses were real losses, while profits were real profits.

The models had to independently complete analysis, decision-making, order placement, and risk control. This was equivalent to giving each AI an independent trading room, transforming it from a "problem solver" into a "trader." It had to decide not only the direction of opening positions but also the size of positions, timing of trades, and whether to stop losses or take profits.

More importantly, every decision they made would change the experimental environment; buying would push prices up, selling would push prices down, and stopping losses could save them or cause them to miss a rebound. The market is fluid, and every step shapes the next situation.

The fundamental question this experiment sought to answer was whether AI truly understands risk.

In static tests, it can rely on memory and pattern matching to get close to the "correct answer"; but in a real market filled with noise and feedback, where there is no standard answer, how long can its "intelligence" last when it must act in uncertainty?

The progress of the competition was more dramatic than expected.

In mid-October, the cryptocurrency market was highly volatile, with Bitcoin's price jumping up and down almost daily. The six AI models began their first real trading in such an environment.

By October 28, halfway through the competition, the mid-term leaderboard was released. DeepSeek's account value soared to $22,500, with a return of 125%. In other words, it more than doubled its funds in just 11 days.

Alibaba's Qwen followed closely, with a return exceeding 100%. Even Claude and Grok, who later faced defeat, were still maintaining profits of 24% and 13% at that time.

Social media quickly erupted. Some began discussing whether to entrust their investment portfolios to AI, while others jokingly suggested that perhaps AI had truly found the secret to guaranteed profits.

However, the cruelty of the market soon became apparent.

Entering early November, Bitcoin hovered around $110,000, with volatility sharply increasing. Models that had continuously increased their positions during the rising phase faced severe blows the moment the market turned.

In the end, only two models from China managed to hold onto their profits, while the American camp suffered a complete defeat. This rollercoaster competition allowed us to clearly see for the first time that those AIs we thought were far ahead were not as smart as imagined in the face of the real market.

From the trading data, we can see each AI's "personality."

Qwen traded only 43 times in 17 days, averaging less than three times a day, making it the most restrained among all participants. Its win rate was not outstanding, but the profit-loss ratio for each trade was extremely high, with a maximum single profit reaching $8,176.

In other words, Qwen was not the "most accurate predictor," but rather the "most disciplined bettor." It acted only at certain moments and chose to remain inactive during uncertainty. This high signal quality strategy limited its drawdown during market corrections, ultimately preserving its victory.

DeepSeek's number of trades was similar to Qwen's, with only 41 trades in 17 days, but its performance resembled that of a cautious fund manager. Its Sharpe ratio was the highest among all participants, reaching 0.359, a remarkable figure in the highly volatile cryptocurrency market.

In traditional financial markets, the Sharpe ratio is typically used to measure risk-adjusted returns. A higher value indicates a more robust strategy. However, in such a short period and such a volatile market, any model that can maintain a positive value is not simple. DeepSeek's performance indicates that it does not pursue maximum returns but strives to maintain balance in a high-noise environment.

Throughout the competition, it maintained its rhythm, avoiding chasing prices and acting blindly. It resembled a trader with a strict system, willing to forgo opportunities rather than let emotions dictate decisions.

In contrast, the performance of the American AI camp exposed significant risk control issues.

Google's Gemini placed a total of 238 orders in 17 days, averaging over 13 times a day, making it the most frequent trader among all participants. Such high-frequency trading also incurred enormous costs, with transaction fees alone consuming $1,331, accounting for 13% of the initial capital. In a competition with only $10,000 in starting funds, this is a significant self-consumption.

Worse still, this frequent trading did not yield additional profits. Gemini constantly tried and failed, stopped losses, and tried again, like a retail trader obsessed with watching the market, led by the noise. Every slight price fluctuation triggered its trading orders. It reacted too quickly to volatility but was too slow to perceive risk.

In behavioral finance, this imbalance has a name: overconfidence. Traders overestimate their predictive abilities while neglecting the accumulation of uncertainty and costs. Gemini's failure is a typical consequence of this blind confidence.

GPT-5's performance was the most disappointing. Its number of trades was not particularly high, with a total of 116 in 17 days, but it had almost no risk control. The maximum single loss reached $622, while the maximum profit was only $271, resulting in a severely imbalanced profit-loss ratio. It resembled a gambler driven by confidence, occasionally winning when the market was favorable, but once the market reversed, losses multiplied.

Its Sharpe ratio was -0.525, meaning the risk taken did not yield any returns. In the investment field, such a result is almost equivalent to "it would be better not to operate at all."

This experiment once again proved that what truly determines victory or defeat is not the accuracy of model predictions, but how it handles uncertainty. The victories of Qwen and DeepSeek essentially stemmed from superior risk control. They seem to understand better that in the market, surviving first qualifies one to talk about intelligence.

The results of Alpha Arena serve as a heavy mockery of the current AI evaluation system. Those "smart models" that ranked high in benchmarks like MMLU faced defeat when they entered the real market.

These models are language masters built from countless texts, capable of generating logically sound and grammatically perfect answers, but they may not truly understand the reality those words point to.

An AI can write a paper on risk management in seconds, citing appropriately and reasoning thoroughly; it can also accurately explain what the Sharpe ratio, maximum drawdown, and value at risk are. But when it actually holds funds, it may make the most reckless decisions. Because it only "knows," it does not "understand."

Knowing and understanding are two different things.

Being able to speak and being able to act are even more different.

This gap is philosophically known as the epistemological problem. Plato once distinguished between knowledge and true belief. Knowledge is not just correct information; it also requires understanding why it is correct.

Today's large language models may possess countless "correct pieces of information," but they lack that understanding. They can tell you the importance of risk management but do not know how that importance is learned by humans through fear and loss.

The real market is the ultimate testing ground for understanding ability. It will not give you a break just because you are GPT-5; every wrong decision will immediately feedback as a loss of funds.

In the laboratory, AI can start over countless times, continuously adjusting parameters and backtesting until it finds the so-called "correct answer." But in the market, every mistake means a loss of real money, and there is no turning back from that loss.

The logic of the market is also far more complex than models imagine. When the principal loses 50%, a 100% return is needed to return to the starting point; when the loss expands to 62.66%, the required return to break even will soar to 168%. This nonlinear risk amplifies the cost of mistakes. AI can minimize losses through algorithms during training, but it cannot truly experience this market punishment mechanism shaped by fear, hesitation, and greed.

For this reason, the market has become a mirror to test the authenticity of intelligence, allowing both humans and machines to see clearly what they truly understand and what they genuinely fear.

This competition also prompted a rethinking of the differences in AI research and development approaches between China and the United States.

Several mainstream companies in the United States continue to adhere to the general model approach, aiming to build systems that demonstrate stable capabilities across a wide range of tasks. Models from OpenAI, Google, and Anthropic fall into this category, with the goal of pursuing breadth and consistency, enabling the models to possess cross-domain understanding and reasoning abilities.

In contrast, Chinese teams tend to consider the practical application and feedback mechanisms of specific scenarios early in the model development process. Although Alibaba's Qwen is also a general large model, its training and testing environment was integrated with actual business systems earlier, and this data feedback from real scenarios may have made the model more sensitive to risks and constraints. DeepSeek's performance also shows similar characteristics, as it seems to be able to correct decisions more quickly in dynamic environments.

This is not a matter of "who wins and who loses." This experiment provides a window into the performance differences of various training philosophies in the real world. General models emphasize universality but can easily expose sluggish responses in extreme environments; whereas models that have been exposed to real feedback earlier may appear more flexible and stable in complex systems.

Of course, the result of a single competition may not represent the overall strength of AI in China and the United States. A 17-day trading cycle is too short, and the influence of luck is hard to eliminate; if the time frame were extended, the trends might be completely different. Moreover, this test only involved cryptocurrency perpetual contract trading, which cannot be extrapolated to all financial markets and is insufficient to summarize AI performance in other fields.

However, it is enough to prompt a rethinking of what constitutes true capability. When AI is placed in a real environment and needs to make decisions amid risk and uncertainty, what we see is not just the victory or defeat of algorithms, but the differences in their paths. In the race to transform AI technology into actual productivity, Chinese models have already taken the lead in certain specific areas.

At the moment the competition ended, Qwen's last Bitcoin position was liquidated, and the account balance was fixed at $12,232. It won, but it did not know it had won. That 22.32% return meant nothing to it; it was just another execution of an instruction.

In Silicon Valley, engineers might still be celebrating a 0.1% increase in GPT-5's MMLU score. Meanwhile, on the other side of the globe, AI from China has just proven in a real-money casino, in the simplest way, that the AI that can make money is the good AI.

Nof1.ai announced that the next season of the competition is about to start, with a longer cycle, more participants, and a more complex market environment. Will those models that stumbled in the first season learn from their losses? Or will they repeat the same fate amid greater volatility?

No one knows the answer. But it is certain that when AI begins to step out of the ivory tower and prove itself with real money, everything becomes different.

Related: Crypto czar David Sacks claims that the threat of artificial intelligence (AI) is more like "1984" than "Terminator."

Original article: “AI Trading Competition Ends, Domestic Model Takes the Crown, GPT-5 Posts 60% Loss”

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink