AI traders face a collective failure in "interviews": excessive trading, chaotic strategies...

The Alpha Arena competition took Claude, ChatGPT, and other top models into real trading competitions, ultimately resulting in an overall portfolio loss of about one-third, with most models experiencing losses, overtrading, and significant decision discrepancies.

Written by: Bu Shuqing

Source: Wall Street Insights

Artificial intelligence is knocking on the door of Wall Street trading rooms, but the current report card does not look good.

Early results from a series of public trading competitions indicate that mainstream large language models (LLMs) generally perform poorly in autonomous trading—most systems incur losses, trade too frequently, and make entirely different decisions when receiving the same instructions. These results raise a core question: how deep is the gap between LLMs and the functioning of real markets?

One of the most representative cases comes from the Alpha Arena competition operated by the tech startup Nof1. This competition pits eight cutting-edge AI systems, including Anthropic’s Claude, Google’s Gemini, OpenAI’s ChatGPT, and Elon Musk’s Grok, against each other in four rounds of independent competition, with each round granting $10,000 in initial funds to autonomously trade U.S. tech stocks for two weeks. Ultimately, the overall investment portfolio lost about one-third, with only 6 out of 32 results achieving profit.

Nof1 founder Jay Azhang candidly stated, "Right now, directly handing money to LLMs for them to trade themselves is not feasible."

Competition Results: Losses, Overtrading, and Decision Discrepancies

Data from the Alpha Arena reveals multiple flaws currently seen in LLMs within trading scenarios. Under the same prompts, Alibaba’s Qwen executed 1,418 trades in one competition round, while the best-performing Grok 4.20 only placed 158 orders. Grok’s best performance occurred during the round where it could observe competitors’ performance.

The AI blog Flat Circle tracked 11 market-related arenas, and the results showed that at least one model achieved profits in all arenas, but only two of the arenas had a median model with positive returns, indicating that most models struggled to outperform the market.

The decision discrepancies among models are also noteworthy. According to Azhang, in the latest round of testing for Alpha Arena, Claude tends to go long, Gemini is entirely unfazed by short-selling, while Qwen is keen on using high leverage to take risks. "They each have their 'personality,' managing them is almost like managing a human analyst," said Doug Clinton, head of Intelligent Alpha, which operates LLM-driven funds, adding that results can be improved to some extent by informing the models of their biases.

Capability Boundaries: LLMs Excel in Research, but Struggle with Timing

Jay Azhang pointed out that LLMs possess advantages in research and utilizing the right tools, but there are systematic shortcomings in trade execution: they do not yet understand the individual weights of numerous variables affecting stock prices, such as analyst ratings, insider trading, and sentiment changes, which can lead to issues like mistimed trades, improper position sizes, and excessive trading frequency.

Intelligent Alpha's benchmarking provides a relatively positive reference. The test offered ten AI models access to financial documents, analyst forecasts, earnings call transcripts, macroeconomic data, and internet search capabilities, focusing on the judgment of profit forecast directions. The results showed that in the fourth quarter of 2025, OpenAI's ChatGPT achieved a correct prediction rate of 68% for profit forecast directions, marking the best performance to date. Clinton stated that with each new version release, overall model performance shows an improving trend.

Methodological Dilemma: Backtesting Fails, Real Market Testing Becomes Sole Option

Evaluating AI trading capabilities faces a fundamental methodological barrier. Traditional quantitative strategies rely on historical backtesting to validate effectiveness, but this framework fails almost completely with LLMs—a model asked how to trade during the March 2020 market in 2026 has already "known" the historical trends. This issue, referred to as "lookahead bias," forces researchers to evaluate AI solely through real market performance, thus spurring the emergence of numerous current benchmarking tests and arenas.

Jim Moran, author of the Flat Circle blog and co-founder of the former alternative data provider YipitData, believes that most current public experiments are too short and noisy to support definitive conclusions. These arenas also have inherent disadvantages, including the inability to access proprietary stock research resources and lower execution quality. "If one of these AI agents from the arenas were directly transplanted into a top hedge fund, its performance should be better," he stated.

Industry Outlook: Truly Effective Strategies Might Quietly Disappear from Public View

Alexander Izydorczyk, former head of data science at Coatue Management and currently at NX1 Capital, pointed out recently that among the AI trading robots he tracks, none have yet shown lasting superior return capabilities. He believes the limitation of these arenas lies in their training data missing the practical quantitative techniques used by secret trading firms.

However, Izydorczyk also left a thought-provoking judgment: "Beginners can sometimes see things that seasoned traders cannot." He wrote in his personal blog, "When LLM agents' trading strategies truly start to work, you won't hear any news about it right away."

Nof1 is preparing for the second season of Alpha Arena, planning to provide each AI model with internet search capabilities, longer thinking time, more data sources, and multi-step execution abilities. But the company's core business model is to provide retail traders with the system tools to build AI trading agents—not directly placing AI in trading positions. This positioning itself may be the most pragmatic note on the current capabilities of AI trading.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

AI traders face a collective failure in "interviews": excessive trading, chaotic strategies...

Competition Results: Losses, Overtrading, and Decision Discrepancies

Capability Boundaries: LLMs Excel in Research, but Struggle with Timing

Methodological Dilemma: Backtesting Fails, Real Market Testing Becomes Sole Option

Industry Outlook: Truly Effective Strategies Might Quietly Disappear from Public View

Selected Articles by Foresight News

Table of Contents

Related Articles