AnalysisChatbot ArenaLMArenaBenchmarks

Two answers, one click, $100 million: the business behind Chatbot Arena

People have voted 82 million times, for free, on which AI answer is better. That turned into a company valued at $1.7 billion – and the most influential AI ranking in the world. How the model works, where it creaks, and why nothing like it exists in German. With a duel you can try yourself.

$100M annualized revenue – built from free user votes

Two anonymous AI answers facing off – the Chatbot Arena principle

Perhaps the most successful AI business model of the year consists of a single question: "Which answer is better – A or B?" Millions of people answer it voluntarily and unpaid on Arena (formerly LMArena, née Chatbot Arena). In late June, the company behind it announced it had reached $100 million in annualized revenue from those collected votes – less than a year after launching its commercial offering. We looked at how free clicks became a billion-dollar company, where the method is vulnerable – and what the German-speaking market is missing.

In brief

  • $100M annualized revenue, announced June 29, 2026 – roughly eight months after the September 2025 launch of the paid "AI Evaluations" product. Around the turn of the year the rate was still about $30M.
  • 82 million+ votes from 700 million+ conversations and 10 million+ monthly visitors – per the company's own (self-reported) figures.
  • Valuation: $1.7 billion after a $150M Series A in January 2026; Arena has raised $250M in total.
  • The product isn't the website – it's the data: AI labs pay for detailed analytics on how their models land with real people.
  • The method has documented weaknesses: a widely cited study accuses the leaderboard of structurally favoring big labs, and in 2025 Meta demonstrated how a polished model variant can game the ranking.

How the duel works

The principle is deliberately simple. You ask a question, two anonymous AI models answer side by side, you vote: A, B, tie, or both bad. Only after you vote does the site reveal which models you just compared. That ordering is the core of the method: if you don't know whether an answer came from OpenAI or an unknown open-source model, you vote without brand goggles.

From millions of such pairwise comparisons, Arena computes a ranking – since December 2023 no longer with chess-style Elo but with the statistically more robust Bradley-Terry model: beating a strong model counts for more than beating a weak one. Since 2024 there is also a "Style Control" view that factors out formatting and answer length – because people measurably prefer longer, nicely formatted answers regardless of substance.

Try it: your first AI duel

This is what the principle feels like – with two real, unedited answers from GPT-5.5 and Claude Opus 4.8 to the same everyday German business task (the answers are in German – that's the point):

KI-Duell · demo

Can you spot the better AI answer?

Two AI models, the same task — anonymised, in random order. Vote first, then we reveal who wrote what. (The answers are in German — that's the point of a German-language duel.)

The task

Du führst einen kleinen Handwerksbetrieb. Schreib eine kurze, freundliche E-Mail-Absage an einen Bewerber (Herr Weber) für die Stelle als Bürokraft: Er war dir sympathisch, aber eine andere Bewerberin hatte mehr Erfahrung. Du möchtest ihn gern für die Zukunft im Blick behalten. Maximal 110 Wörter.

Your vote is stored only as an anonymous daily counter — no cookies, no IP address.

From Berkeley project to billion-dollar company

Chatbot Arena launched in 2023 as a research project at UC Berkeley (the LMSYS group). Only in April 2025 did it become a company, founded by Anastasios Angelopoulos, Wei-Lin Chiang and Ion Stoica. After that, things moved fast:

When What
2023 Launch as the university research project "Chatbot Arena"
April 2025 Spun out as a company (LMArena)
May 2025 Seed round: $100M at a $600M valuation
September 2025 Launch of the paid "AI Evaluations" product
January 2026 Series A: $150M at a $1.7B valuation; rebrand to "Arena"
June 2026 $100M annualized revenue; "Agent Mode" growing 10% weekly, per Arena

Payment isn't by subscription but by consumption: AI developers buy analytics on how their models perform in real user duels – broken down by task type, language, weaknesses. The business model is remarkably honest to summarize: users supply the labor, the platform sells the distillate. Notably, no paying customers have been publicly named – that's part of the picture too.

The cracks in the leaderboard

The more important the ranking became, the more closely researchers looked – and found plenty.

The study "The Leaderboard Illusion" (April 2025, Cohere Labs together with researchers from Princeton, Stanford, MIT and others) documented that large labs were allowed to test private model variants in the arena before release, with only the best result going public – Meta alone reportedly tested 27 private variants before the Llama 4 launch. The study also found 19.2% of all battle data came from Google models and 20.4% from OpenAI models, while 83 open-weight models shared 29.7%. Arena disputed parts of it – its pre-release testing policy had been public since March 2024, and open models made up 40.9% of the leaderboard – but acknowledged room for improvement.

How gameable a vote-based ranking is was demonstrated the same month by the Llama 4 Maverick case: Meta submitted an experimental version tuned for likability that shot to #2. The actually released version of the same model subsequently landed at rank 32. Arena apologized and tightened its rules.

The third crack is subtler: crowd votes measure what pleases – not what's correct. Longer, formatted, agreeable answers win systematically. That's exactly why the Style Control view exists; and exactly why experts warn against confusing a popularity ranking with a capability benchmark.

And in German?

Arena does maintain a German category leaderboard – it emerges as a by-product when the system auto-classifies German-language duels. It currently holds about 136,000 votes across 276 models, and the top ranks are statistically tied within the margin of error. For comparison: the global text leaderboard has over 7.1 million votes. To our knowledge, no German-first blind-voting platform centered on German tasks exists – Arena's own interface is English.

With the German AI Benchmark we already measure monthly how well frontier models handle German business tasks – scored by a cross-vendor judge panel, not a crowd. What's missing is the second half of the picture: what German users actually prefer. Whether we build KI-Duell into a full tool – fresh duels every day, an audience ranking from real votes, GDPR-clean with no sign-up – depends on how many people play the demo above and tell us.

Assessment

Arena's $100M milestone proves one thing above all: human preference data is the scarcest commodity in the AI industry. Benchmarks can be replicated, training data can be bought – but no competitor can retroactively collect 82 million honest, blind user judgments. That's exactly why Silicon Valley pays for them.

At the same time, the criticism shows a vote ranking is no oracle: it measures popularity among those who participate, with all the documented distortions. If you're choosing AI models for your own business, read Arena ranks as one signal among several – alongside task-specific tests like our benchmark and simply trying models on your concrete use case. Popularity is not suitability; the most interesting insights appear exactly where the two diverge.

Sources

The two answers in the demo duel were collected unedited on July 4, 2026, through the same gateway our German AI Benchmark uses.

Share
← All articles

All analyses are based on i6eal's own measurements or on clearly labelled sources. Figures are snapshots and may change; corrections are disclosed transparently.