Perhaps the most successful AI business model of the year consists of a single question: "Which answer is better – A or B?" Millions of people answer it voluntarily and unpaid on Arena (formerly LMArena, née Chatbot Arena). In late June, the company behind it announced it had reached $100 million in annualized revenue from those collected votes – less than a year after launching its commercial offering. We looked at how free clicks became a billion-dollar company, where the method is vulnerable – and what the German-speaking market is missing.
In brief
- $100M annualized revenue, announced June 29, 2026 – roughly eight months after the September 2025 launch of the paid "AI Evaluations" product. Around the turn of the year the rate was still about $30M.
- 82 million+ votes from 700 million+ conversations and 10 million+ monthly visitors – per the company's own (self-reported) figures.
- Valuation: $1.7 billion after a $150M Series A in January 2026; Arena has raised $250M in total.
- The product isn't the website – it's the data: AI labs pay for detailed analytics on how their models land with real people.
- The method has documented weaknesses: a widely cited study accuses the leaderboard of structurally favoring big labs, and in 2025 Meta demonstrated how a polished model variant can game the ranking.
How the duel works
The principle is deliberately simple. You ask a question, two anonymous AI models answer side by side, you vote: A, B, tie, or both bad. Only after you vote does the site reveal which models you just compared. That ordering is the core of the method: if you don't know whether an answer came from OpenAI or an unknown open-source model, you vote without brand goggles.
From millions of such pairwise comparisons, Arena computes a ranking – since December 2023 no longer with chess-style Elo but with the statistically more robust Bradley-Terry model: beating a strong model counts for more than beating a weak one. Since 2024 there is also a "Style Control" view that factors out formatting and answer length – because people measurably prefer longer, nicely formatted answers regardless of substance.
Try it: your first AI duel
This is what the principle feels like – with two real, unedited answers from GPT-5.5 and Claude Opus 4.8 to the same everyday German business task (the answers are in German – that's the point):
KI-Duell · demoCan you spot the better AI answer?
Two AI models, the same task — anonymised, in random order. Vote first, then we reveal who wrote what. (The answers are in German — that's the point of a German-language duel.)
The taskDu führst einen kleinen Handwerksbetrieb. Schreib eine kurze, freundliche E-Mail-Absage an einen Bewerber (Herr Weber) für die Stelle als Bürokraft: Er war dir sympathisch, aber eine andere Bewerberin hatte mehr Erfahrung. Du möchtest ihn gern für die Zukunft im Blick behalten. Maximal 110 Wörter.
From Berkeley project to billion-dollar company
Chatbot Arena launched in 2023 as a research project at UC Berkeley (the LMSYS group). Only in April 2025 did it become a company, founded by Anastasios Angelopoulos, Wei-Lin Chiang and Ion Stoica. After that, things moved fast:
Payment isn't by subscription but by consumption: AI developers buy analytics on how their models perform in real user duels – broken down by task type, language, weaknesses. The business model is remarkably honest to summarize: users supply the labor, the platform sells the distillate. Notably, no paying customers have been publicly named – that's part of the picture too.
The cracks in the leaderboard
The more important the ranking became, the more closely researchers looked – and found plenty.
The study "The Leaderboard Illusion" (April 2025, Cohere Labs together with researchers from Princeton, Stanford, MIT and others) documented that large labs were allowed to test private model variants in the arena before release, with only the best result going public – Meta alone reportedly tested 27 private variants before the Llama 4 launch. The study also found 19.2% of all battle data came from Google models and 20.4% from OpenAI models, while 83 open-weight models shared 29.7%. Arena disputed parts of it – its pre-release testing policy had been public since March 2024, and open models made up 40.9% of the leaderboard – but acknowledged room for improvement.
How gameable a vote-based ranking is was demonstrated the same month by the Llama 4 Maverick case: Meta submitted an experimental version tuned for likability that shot to #2. The actually released version of the same model subsequently landed at rank 32. Arena apologized and tightened its rules.
The third crack is subtler: crowd votes measure what pleases – not what's correct. Longer, formatted, agreeable answers win systematically. That's exactly why the Style Control view exists; and exactly why experts warn against confusing a popularity ranking with a capability benchmark.
And in German?
Arena does maintain a German category leaderboard – it emerges as a by-product when the system auto-classifies German-language duels. It currently holds about 136,000 votes across 276 models, and the top ranks are statistically tied within the margin of error. For comparison: the global text leaderboard has over 7.1 million votes. To our knowledge, no German-first blind-voting platform centered on German tasks exists – Arena's own interface is English.
With the German AI Benchmark we already measure monthly how well frontier models handle German business tasks – scored by a cross-vendor judge panel, not a crowd. What's missing is the second half of the picture: what German users actually prefer. Whether we build KI-Duell into a full tool – fresh duels every day, an audience ranking from real votes, GDPR-clean with no sign-up – depends on how many people play the demo above and tell us.
Assessment
Arena's $100M milestone proves one thing above all: human preference data is the scarcest commodity in the AI industry. Benchmarks can be replicated, training data can be bought – but no competitor can retroactively collect 82 million honest, blind user judgments. That's exactly why Silicon Valley pays for them.
At the same time, the criticism shows a vote ranking is no oracle: it measures popularity among those who participate, with all the documented distortions. If you're choosing AI models for your own business, read Arena ranks as one signal among several – alongside task-specific tests like our benchmark and simply trying models on your concrete use case. Popularity is not suitability; the most interesting insights appear exactly where the two diverge.
Sources
The two answers in the demo duel were collected unedited on July 4, 2026, through the same gateway our German AI Benchmark uses.
← All articlesAll analyses are based on i6eal's own measurements or on clearly labelled sources. Figures are snapshots and may change; corrections are disclosed transparently.