AI Model Arena Competition: A Deep Insight into the Nof1 Real Trading Arena Competition

CN
PANews
Follow
14 hours ago

On October 18, the AI research laboratory focused on financial markets, nof1, initiated an unprecedented experiment: allowing six of the world's top AI models—GPT-5, Gemini 2.5 Pro, Grok-4, Claude Sonnet 4.5, DeepSeek V3.1, Qwen3 Max—to each manage $10,000 in real funds for cryptocurrency trading on Hyperliquid.

Current rankings and account values: As of the evening of October 30, the latest rankings are as follows:

  • DeepSeek Chat V3.1: $15,671.39 (+56.71%)
  • Qwen3 Max: $12,520.34 (+25.20%)
  • BTC Buy & Hold: $10,146.69 (+1.47%)
  • Claude Sonnet 4.5: $9,290.97 (-7.09%)
  • Grok 4: $7,030.02 (-29.70%)
  • Gemini 2.5 Pro: $3,446.03 (-65.54%)
  • GPT 5: $2,749.32 (-72.51%)

This list has seen dramatic changes compared to data from a few days ago. Although DeepSeek remains in the lead, its return has significantly retreated from 95.71% to 56.71%, and its account value has dropped from $19,570 to $15,671, evaporating nearly $4,000. Qwen3 also experienced a retreat, falling from 53.68% to 25.20%. More notably, Claude Sonnet 4.5 shifted from a slight profit to a loss of 7%, while GPT 5's losses further expanded to 72%, nearing liquidation.

Understanding the Market Through Curves: The Evolution of Three Stages

Stage One (October 18-25): Rising Phase, Initial Strategy Divergence

The market was in an upward channel, and the strategy differences among the models began to emerge:

  • DeepSeek: Quickly rose from $10,000 to $17,000, demonstrating strong trend-capturing ability.
  • Qwen3: Steadily increased to the $12,000-15,000 range.
  • Claude/Grok: Fluctuated between $10,000-12,000.
  • Gemini/GPT: Dropped below $5,000, falling behind due to fees and poor decision-making.

Stage Two (October 26-28): Accelerated Rise, Peak Emergence

  • DeepSeek peaked: On October 27, it broke through $23,000, achieving a 130% return in 9 days. It held a large number of ETH and SOL long positions, using 10-15x leverage.
  • Qwen3 was restrained: Peaked at $17,000 with moderate gains. An 82.4% cash position allowed it to pick its moments and avoid chasing prices.
  • Claude/Grok oscillated: Fluctuated between $11,000-13,000, with conflicting strategies—wanting to participate but lacking decisiveness.
  • Gemini/GPT exited: Accounts fell to $3,000-4,000, essentially losing the chance for recovery.

Stage Three (October 29-30): Market Correction, Risk Control Revealed

  • DeepSeek: Experienced a cliff-like retreat: dropped from $23,000 to $15,671, losing $7,000 (-30%) in two days: no profit-taking mechanism, failed to lock in profits at the peak. 95.6% long position time, no hedging methods, and no timely stop-loss. Despite a 30% retreat, it still leads the second place by $3,000, with a substantial prior advantage.
  • Qwen3: Showed resilience, retreating from $17,000 to $12,520 (-26%), lower than DeepSeek, with an 82.4% cash position, quickly closing positions and engaging in short-term trading (average 9.7 hours), exposing itself for a short time, quickly stopping losses to prevent further declines.
  • BTC Buy & Hold: The victory of a simple strategy with an account value of $10,146 (+1.47%), surpassing Claude and Grok, ranking third. Ironically, four "smart" AIs, after hundreds of trades, performed worse than the "buy and hold" strategy, doing more ≠ doing better; the simple strategy avoided overtrading and high costs.
  • Claude: Conservative strategy failed, shifting from +0.93% to -7.09% ($10,093→$9,290). Fees severely eroded profits, with a low win-loss ratio (1.34:1), small gains with high costs, frequent rebalancing during corrections accelerated losses, missed significant upward trends, and failed to defend effectively during downturns.
  • Grok: Accelerated collapse, losses expanded from -8% to -29.7% ($7,030): 90.6% long positions but only a 22.7% win rate, resulting in a realized loss of -$2,449, with little capital remaining, relying on $1,611 in unrealized profits for support, at risk of going to zero.
  • Gemini/GPT: Desperate struggle, GPT fell to $2,749 (-72.51%), Gemini to $3,446 (-65.54%). The failure was comprehensive: overtrading, low win rates, poor win-loss ratios, and high leverage risks.

Deep Issues Revealed by the Downward Correction

1. The Duality of "Following the Trend"

DeepSeek's success was built on the foundation of "following the trend": being long 95% of the time, believing the trend would continue. In an upward trend, this strategy allowed it to achieve 95% of the highest returns. However, when the trend reversed, the same strategy led to a 30% loss.

This exposes a key issue: trend-following strategies need to be paired with effective profit-taking and stop-loss mechanisms. If there is only "let profits run" without "cutting losses," a significant reversal can wipe out most of the profits.

DeepSeek may have overly trusted the value of "long-term holding," neglecting market uncertainty. Its largest single profit of $7,378 came from a 60-hour ETH trade, and this successful experience may have reinforced its "long-termism" belief. However, financial markets are not one-way streets; trends can reverse at any time.

2. Cash Position as Wisdom and Protection

Qwen3 demonstrated the value of holding cash through its performance. Its 82.4% cash position during the rising phase seemed like "missing opportunities," but during the downturn, it became "avoiding losses."

A 26% retreat vs. 32% may seem like only a 6 percentage point difference, but under the compounding effect, this gap will grow larger. More importantly, Qwen3 preserved more capital and psychological advantage; once the market stabilizes, it can quickly re-enter positions. In contrast, if DeepSeek continues to retreat, it may fall into a vicious cycle of "floating losses—hesitation—missing rebounds."

3. The Vitality of Simple Strategies

The performance of BTC Buy & Hold is a slap in the face to all "smart" AIs. This strategy involves no technical analysis, no complex algorithms, and no frequent rebalancing, yet it currently ranks third, surpassing half of the AI models.

This result tells us: in trading, making fewer mistakes is more important than making more correct decisions. Gemini lost 66% through 193 trades, while BTC Buy & Hold preserved its capital with 0 trades. Who is more successful? The answer is obvious.

4. Lack of Risk Management

Except for Qwen3, almost all AIs exposed serious flaws in risk management:

  • DeepSeek: No profit-taking mechanism, allowing a peak return of 130% to retreat to 57%.
  • Claude: Overly reliant on a one-sided mindset of "not shorting," lacking hedging methods.
  • Grok: Knowing the win rate was only 22.7%, still insisted on being long 90.6% of the time.
  • GPT: 40x leverage on BTC positions, with a liquidation price allowing only 1.2% margin for error.
  • Gemini: Completely lacking risk control, 193 trades felt like gambling.

This indicates that while these AIs can "understand" market data and "execute" trading instructions, they are still far from mature in the core capability of risk management in trading.

Limitations of the Experiment: Calm Reflection Beyond Data

After reviewing the data and analysis, it is easy to be drawn to DeepSeek's 56% return or Gemini's 66% loss. However, before drawing any conclusions, we must confront the systemic limitations of this experiment itself—these limitations may be more important than the results themselves.

1. Time Window Too Short: 12 Days Cannot Reveal the Truth

This experiment lasted only 12 days, from October 18 to 30. What does 12 days mean in the crypto market? It may just be the tip of a complete bull-bear cycle.

What we observed as "rise-peak-correction" is merely a complete small cycle, but it feels more like luck. If the experiment had started at the market peak or encountered a "519-style" single-day drop of 30%, the current rankings could be completely reversed.

DeepSeek's 56% return may heavily depend on the characteristics of the market during these 12 days. Its 95% long strategy is dominant in a one-sided rise, but if faced with three months of sideways fluctuations, this strategy could be eroded by fees and repeated stop-losses.

Similarly, Qwen3's 82% cash position is advantageous in a choppy market, but in a frenzied bull market like 2021, it would lag behind to the point of questioning life. In a BTC bull market that rises from $10,000 to $100,000, being in cash 80% of the time means you only captured 20% of the gains.

Twelve days of data is insufficient to prove the long-term effectiveness of any strategy.

2. Same Prompt: AIs Were Handcuffed

All six AI models received the same market data and trading instruction framework. This is akin to having six fund managers make decisions based on the same research report—you are not testing their research capabilities but their execution discipline.

In the real trading world, alpha comes from information asymmetry. Top quantitative funds have exclusive on-chain tracking systems that can see whale transfers; they have off-market bulk order flow data that can sense institutional movements in advance.

But in this experiment, the AIs saw the same information. This feels more like a "competition of execution" rather than a "competition of strategy innovation."

We cannot determine from this experiment who would be the true winner if DeepSeek had exclusive on-chain data and Gemini had exclusive Twitter sentiment analysis.

3. Distortion of Capital Scale: The Fairytale World of $10,000

Each AI manages only $10,000 in capital. This is considered ultra-small scale funding on Hyperliquid—you can enter and exit at any time, slippage can be ignored, liquidity shocks do not exist, and splitting large orders is completely unnecessary.

However, in the real world of quantitative trading, managing $10 million and managing $10,000 are two different species.

  • GPT's 40x leverage is barely feasible at the $10,000 scale, but if it were $10 million × 40x = $400 million in exposure, any 3% reverse fluctuation would lead to immediate liquidation, and your order itself would crash the market.
  • Qwen3's 9.7-hour short-term strategy is flexible and efficient with small funds, but with large funds, the transaction costs (slippage + fees) for each entry and exit would render this strategy completely ineffective. When you open a position, you will drive up the price, and when you close it, you will push the price down, ultimately realizing you are giving money to the market.
  • DeepSeek's high-leverage trend strategy can quickly enter and exit at the $10,000 scale, but when managing $1 million, your orders will leave significant traces in Hyperliquid's depth, and other traders will watch your positions to trade against you.

This experiment tests the "flexibility of small funds," not the "robustness of scalable strategies."

4. Luck of Market Environment: Not Encountering Real Hell

The market during the experiment was relatively stable, with volatility at a moderate level. We did not see:

  • Systemic crashes: like the collapse of FTX, where all cryptocurrencies plummet together, and liquidity evaporates instantly.
  • Single-coin flash crashes: like LUNA going to zero, dropping from $80 to $0.0001 in an hour.
  • Exchange failures: like the Binance outage on October 11, where you have positions but cannot close them, only to watch your account get liquidated.
  • Extreme liquidity exhaustion: like a sudden drop in depth during the early hours of the weekend, where your stop-loss order slips 20% in execution.

All AIs' risk control systems have not been tested under extreme pressure, and these are the real challenges that crypto traders need to face. What would DeepSeek's stop-loss mechanism do when faced with "consecutive limit-downs with no execution"? We do not know. Is Qwen3's quick exit still effective during an exchange outage? We also do not know.

Luck may play a much larger role in the 12-day experiment than we imagine.

5. Randomness of a Single Experiment: No Second Season for Validation

This is a one-time experiment, with no "second season" to validate the stability of the strategies. We cannot determine:

  • Is DeepSeek's lead a true ability or just a lucky random walk?
  • If we scrambled the strategy parameters of the six AIs and ran it again, would DeepSeek still be in first place?
  • If we switched to the next 12 days starting from November 1, would the rankings be completely reversed?

The current results resemble six people rolling dice, with DeepSeek happening to roll the highest number. But this does not mean its dice are better; it may just be luckier.

So, how should we view these rankings?

After considering these limitations, you might ask: does this experiment still have significance?

Yes, but the significance does not lie in "who is the champion." The true value of this experiment is that it allows us to see:

  1. AI can engage in real trading - this in itself is a milestone. A year ago, we were still discussing whether AI would replace traders; now AI has delivered results in real trading.
  2. Risk management is more important than prediction - all AIs can "understand" candlestick charts, but only a few can manage risk. This confirms the old wisdom of Wall Street.
  3. The resilience of simple strategies - BTC Buy & Hold's third place reminds us that in uncertain markets, making fewer mistakes may be more valuable than making more correct decisions.
  4. Strategies do not have eternal superiority or inferiority - DeepSeek's advantage today may become a trap tomorrow. As market conditions change, the optimal strategy will also change.

But if you are ready to hand over your money to DeepSeek for management just because it ranks first, or to blindly copy its strategy, then you are making a big mistake.

A 12-day champion does not represent a 12-month champion; a $10,000 champion does not represent a $1,000,000 champion; the champion of this market phase does not represent the champion of the next phase.

Investing has never had simple answers. This experiment has provided us with valuable data, but the limitations behind the data may be more worthy of reflection than the data itself.

This report's data was edited and organized by WolfDAO. If you have any questions, please contact us for updates.

Written by: Riffi / WolfDAO (X: @10xWolfdao)

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink