AI-Trader: The First Live Benchmark of AI Agents Using Real Money
The First Honest Test
Until now, all AI trader benchmarks used historical data or simulations. Researchers from HKUDS (University of Hong Kong) went further and created AI-Trader โ the first benchmark where AI agents trade with real money in real time.
Each agent receives $10,000 and full autonomy in making trading decisions across three markets:
- US Equities โ stocks on NYSE and NASDAQ
- China A-Shares โ stocks on the Shanghai and Shenzhen exchanges
- Crypto โ cryptocurrencies on centralized exchanges
Methodology
Testing Conditions
- Each agent operates fully autonomously โ without human intervention
- Testing period: 3 months of live trading
- Commissions, slippage, latency โ all real
- Agents have access to market data, news, and financial reports
Evaluated Metrics
| Metric | Description |
|---|---|
| Total Return | Overall return for the period |
| Sharpe Ratio | Risk-adjusted return |
| Max Drawdown | Maximum drawdown |
| Win Rate | Percentage of profitable trades |
| Faithfulness | How well the agentโs actions match its explanations |
The last metric โ Faithfulness โ is particularly interesting. It checks whether the agent actually does what it โthinks.โ
Initial Results
Note: the figures below are illustrative and reflect projected estimates. The original study tested models available in late 2025 (GPT-4o, Claude 3.5 Sonnet, etc.).
Results from the first round of testing (3 months):
US Equities
| Agent | Return | Sharpe | Max DD |
|---|---|---|---|
| GPT-4o Agent | +8.2% | 1.34 | -6.1% |
| Claude 3.5 Sonnet Agent | +7.8% | 1.51 | -4.3% |
| DeepSeek Agent | +5.1% | 0.89 | -8.7% |
| S&P 500 (benchmark) | +6.3% | 1.12 | -5.5% |
Crypto
| Agent | Return | Sharpe | Max DD |
|---|---|---|---|
| GPT-4o Agent | +12.4% | 0.87 | -18.2% |
| Claude 3.5 Sonnet Agent | +9.1% | 1.02 | -11.5% |
| BTC Hold (benchmark) | +15.1% | 0.73 | -22.4% |
Key Takeaways
- AI agents can be profitable โ but they donโt always beat simple buy & hold
- Sharpe Ratio of the best agents exceeds the benchmark โ they manage risk better
- The crypto market proved the most challenging due to volatility
- Faithfulness is the main problem: agents often โexplainโ their decisions post-hoc rather than making decisions based on their reasoning
Why This Matters
AI-Trader is the first step toward objective evaluation of AI traders. Before it, all claims about โprofitable AI botsโ were based on backtests, which, as we know, are prone to overfitting.
Now the industry has a standard for comparison. And the initial results show: AI traders are promising but far from perfect.
Follow updated results on the project website.
Discussion
Join the discussion in our Telegram chat!