[{"data":1,"prerenderedAt":30},["ShallowReactive",2],{"nr-en-chatbot-arena-100-millionen-nutzerstimmen":3},{"slug":4,"title":5,"dek":6,"date":7,"time":8,"publishedAt":9,"updated":10,"updatedAt":10,"dateFmt":11,"updatedFmt":10,"kind":12,"tier":13,"author":14,"authorName":15,"topics":16,"tracker":21,"trackerLabel":22,"headlineStat":23,"image":24,"ogImage":25,"imageAlt":26,"csv":10,"minutes":27,"words":28,"html":29},"chatbot-arena-100-millionen-nutzerstimmen","Two answers, one click, $100 million: the business behind Chatbot Arena","People have voted 82 million times, for free, on which AI answer is better. That turned into a company valued at $1.7 billion – and the most influential AI ranking in the world. How the model works, where it creaks, and why nothing like it exists in German. With a duel you can try yourself.","2026-07-04","18:45","2026-07-04T18:45:00+02:00","","July 4, 2026","analyse","flagship","ideal-syka","Ideal Syka",[17,18,19,20],"Chatbot Arena","LMArena","Benchmarks","AI evaluation","\u002Fki-benchmark-deutsch","German AI Benchmark","$100M annualized revenue – built from free user votes","\u002Fnewsroom\u002Fimg\u002Fchatbot-arena-100-millionen-nutzerstimmen.webp","\u002Fog-nr\u002Fchatbot-arena-100-millionen-nutzerstimmen.en.png","Two anonymous AI answers facing off – the Chatbot Arena principle",6,1171,"\u003Cp>Perhaps the most successful AI business model of the year consists of a single question: &quot;Which answer is better – A or B?&quot; Millions of people answer it voluntarily and unpaid on \u003Cstrong>Arena\u003C\u002Fstrong> (formerly LMArena, née Chatbot Arena). In late June, the company behind it announced it had reached \u003Cstrong>$100 million in annualized revenue\u003C\u002Fstrong> from those collected votes – less than a year after launching its commercial offering. We looked at how free clicks became a billion-dollar company, where the method is vulnerable – and what the German-speaking market is missing.\u003C\u002Fp>\n\u003Ch2>In brief\u003C\u002Fh2>\n\u003Cul>\n\u003Cli>\u003Cstrong>$100M\u003C\u002Fstrong> annualized revenue, announced June 29, 2026 – roughly \u003Cstrong>eight months\u003C\u002Fstrong> after the September 2025 launch of the paid &quot;AI Evaluations&quot; product. Around the turn of the year the rate was still about \u003Cstrong>$30M\u003C\u002Fstrong>.\u003C\u002Fli>\n\u003Cli>\u003Cstrong>82 million+ votes\u003C\u002Fstrong> from \u003Cstrong>700 million+ conversations\u003C\u002Fstrong> and \u003Cstrong>10 million+ monthly visitors\u003C\u002Fstrong> – per the company&#39;s own (self-reported) figures.\u003C\u002Fli>\n\u003Cli>Valuation: \u003Cstrong>$1.7 billion\u003C\u002Fstrong> after a \u003Cstrong>$150M\u003C\u002Fstrong> Series A in January 2026; Arena has raised \u003Cstrong>$250M\u003C\u002Fstrong> in total.\u003C\u002Fli>\n\u003Cli>The product isn&#39;t the website – it&#39;s the \u003Cstrong>data\u003C\u002Fstrong>: AI labs pay for detailed analytics on how their models land with real people.\u003C\u002Fli>\n\u003Cli>The method has documented weaknesses: a widely cited study accuses the leaderboard of \u003Cstrong>structurally favoring big labs\u003C\u002Fstrong>, and in 2025 Meta demonstrated how a polished model variant can game the ranking.\u003C\u002Fli>\n\u003C\u002Ful>\n\u003Ch2>How the duel works\u003C\u002Fh2>\n\u003Cp>The principle is deliberately simple. You ask a question, two anonymous AI models answer side by side, you vote: A, B, tie, or both bad. Only \u003Cstrong>after\u003C\u002Fstrong> you vote does the site reveal which models you just compared. That ordering is the core of the method: if you don&#39;t know whether an answer came from OpenAI or an unknown open-source model, you vote without brand goggles.\u003C\u002Fp>\n\u003Cp>From millions of such pairwise comparisons, Arena computes a ranking – since December 2023 no longer with chess-style Elo but with the statistically more robust \u003Cstrong>Bradley-Terry model\u003C\u002Fstrong>: beating a strong model counts for more than beating a weak one. Since 2024 there is also a &quot;Style Control&quot; view that factors out formatting and answer length – because people measurably prefer longer, nicely formatted answers regardless of substance.\u003C\u002Fp>\n\u003Ch2>Try it: your first AI duel\u003C\u002Fh2>\n\u003Cp>This is what the principle feels like – with two real, unedited answers from GPT-5.5 and Claude Opus 4.8 to the same everyday German business task (the answers are in German – that&#39;s the point):\u003C\u002Fp>\n\u003C!--ki-duell-demo-->\u003Ch2>From Berkeley project to billion-dollar company\u003C\u002Fh2>\n\u003Cp>Chatbot Arena launched in 2023 as a research project at UC Berkeley (the LMSYS group). Only in April 2025 did it become a company, founded by Anastasios Angelopoulos, Wei-Lin Chiang and Ion Stoica. After that, things moved fast:\u003C\u002Fp>\n\u003Cdiv class=\"tbl-scroll\">\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth>When\u003C\u002Fth>\n\u003Cth>What\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\u003Ctr>\n\u003Ctd>2023\u003C\u002Ftd>\n\u003Ctd>Launch as the university research project &quot;Chatbot Arena&quot;\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>April 2025\u003C\u002Ftd>\n\u003Ctd>Spun out as a company (LMArena)\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>May 2025\u003C\u002Ftd>\n\u003Ctd>Seed round: $100M at a $600M valuation\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>September 2025\u003C\u002Ftd>\n\u003Ctd>Launch of the paid &quot;AI Evaluations&quot; product\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>January 2026\u003C\u002Ftd>\n\u003Ctd>Series A: $150M at a \u003Cstrong>$1.7B\u003C\u002Fstrong> valuation; rebrand to &quot;Arena&quot;\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd>June 2026\u003C\u002Ftd>\n\u003Ctd>\u003Cstrong>$100M\u003C\u002Fstrong> annualized revenue; &quot;Agent Mode&quot; growing 10% weekly, per Arena\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftbody>\u003C\u002Ftable>\u003C\u002Fdiv>\n\u003Cp>Payment isn&#39;t by subscription but by consumption: AI developers buy analytics on how their models perform in real user duels – broken down by task type, language, weaknesses. The business model is remarkably honest to summarize: \u003Cstrong>users supply the labor, the platform sells the distillate.\u003C\u002Fstrong> Notably, no paying customers have been publicly named – that&#39;s part of the picture too.\u003C\u002Fp>\n\u003Ch2>The cracks in the leaderboard\u003C\u002Fh2>\n\u003Cp>The more important the ranking became, the more closely researchers looked – and found plenty.\u003C\u002Fp>\n\u003Cp>The study \u003Cstrong>&quot;The Leaderboard Illusion&quot;\u003C\u002Fstrong> (April 2025, Cohere Labs together with researchers from Princeton, Stanford, MIT and others) documented that large labs were allowed to test \u003Cstrong>private model variants\u003C\u002Fstrong> in the arena before release, with only the best result going public – Meta alone reportedly tested \u003Cstrong>27 private variants\u003C\u002Fstrong> before the Llama 4 launch. The study also found \u003Cstrong>19.2%\u003C\u002Fstrong> of all battle data came from Google models and \u003Cstrong>20.4%\u003C\u002Fstrong> from OpenAI models, while \u003Cstrong>83 open-weight models shared 29.7%\u003C\u002Fstrong>. Arena disputed parts of it – its pre-release testing policy had been public since March 2024, and open models made up 40.9% of the leaderboard – but acknowledged room for improvement.\u003C\u002Fp>\n\u003Cp>How gameable a vote-based ranking is was demonstrated the same month by the \u003Cstrong>Llama 4 Maverick\u003C\u002Fstrong> case: Meta submitted an experimental version tuned for likability that shot to \u003Cstrong>#2\u003C\u002Fstrong>. The actually released version of the same model subsequently landed at \u003Cstrong>rank 32\u003C\u002Fstrong>. Arena apologized and tightened its rules.\u003C\u002Fp>\n\u003Cp>The third crack is subtler: crowd votes measure \u003Cstrong>what pleases\u003C\u002Fstrong> – not what&#39;s correct. Longer, formatted, agreeable answers win systematically. That&#39;s exactly why the Style Control view exists; and exactly why experts warn against confusing a popularity ranking with a capability benchmark.\u003C\u002Fp>\n\u003Ch2>And in German?\u003C\u002Fh2>\n\u003Cp>Arena does maintain a German category leaderboard – it emerges as a by-product when the system auto-classifies German-language duels. It currently holds about \u003Cstrong>136,000 votes\u003C\u002Fstrong> across 276 models, and the top ranks are statistically tied within the margin of error. For comparison: the global text leaderboard has over \u003Cstrong>7.1 million\u003C\u002Fstrong> votes. To our knowledge, no German-first blind-voting platform centered on German tasks exists – Arena&#39;s own interface is English.\u003C\u002Fp>\n\u003Cp>With the \u003Ca href=\"\u002Fki-benchmark-deutsch\">German AI Benchmark\u003C\u002Fa> we already measure monthly how well frontier models handle German business tasks – scored by a cross-vendor judge panel, not a crowd. What&#39;s missing is the second half of the picture: \u003Cstrong>what German users actually prefer.\u003C\u002Fstrong> Whether we build KI-Duell into a full tool – fresh duels every day, an audience ranking from real votes, GDPR-clean with no sign-up – depends on how many people play the demo above and tell us.\u003C\u002Fp>\n\u003Ch2>Assessment\u003C\u002Fh2>\n\u003Cp>Arena&#39;s $100M milestone proves one thing above all: \u003Cstrong>human preference data is the scarcest commodity in the AI industry.\u003C\u002Fstrong> Benchmarks can be replicated, training data can be bought – but no competitor can retroactively collect 82 million honest, blind user judgments. That&#39;s exactly why Silicon Valley pays for them.\u003C\u002Fp>\n\u003Cp>At the same time, the criticism shows a vote ranking is no oracle: it measures popularity among those who participate, with all the documented distortions. If you&#39;re choosing AI models for your own business, read Arena ranks as one signal among several – alongside task-specific tests like our benchmark and simply trying models on your concrete use case. Popularity is not suitability; the most interesting insights appear exactly where the two diverge.\u003C\u002Fp>\n\u003Ch2>Sources\u003C\u002Fh2>\n\u003Cul>\n\u003Cli>\u003Ca href=\"https:\u002F\u002Ftechcrunch.com\u002F2026\u002F06\u002F29\u002Farena-the-ai-leaderboard-everyone-uses-is-now-a-100m-business\u002F\">TechCrunch: Arena, the AI leaderboard everyone uses, is now a $100M business\u003C\u002Fa> (June 29, 2026)\u003C\u002Fli>\n\u003Cli>\u003Ca href=\"https:\u002F\u002Farena.ai\u002Fblog\u002Farena-100m-revenue\u002F\">Arena blog: 100M in Revenue\u003C\u002Fa> (June 29, 2026; self-reported vote\u002Fvisitor figures)\u003C\u002Fli>\n\u003Cli>\u003Ca href=\"https:\u002F\u002Ftechcrunch.com\u002F2026\u002F01\u002F06\u002Flmarena-lands-1-7b-valuation-four-months-after-launching-its-product\u002F\">TechCrunch: LMArena lands $1.7B valuation\u003C\u002Fa> (January 6, 2026)\u003C\u002Fli>\n\u003Cli>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20879\">Study &quot;The Leaderboard Illusion&quot;\u003C\u002Fa> (arXiv, April 2025) and \u003Ca href=\"https:\u002F\u002Farena.ai\u002Fblog\u002Four-response\u002F\">Arena&#39;s response\u003C\u002Fa>\u003C\u002Fli>\n\u003Cli>\u003Ca href=\"https:\u002F\u002Ftechcrunch.com\u002F2025\u002F04\u002F11\u002Fmetas-vanilla-maverick-ai-model-ranks-below-rivals-on-a-popular-chat-benchmark\u002F\">TechCrunch on the Llama 4 Maverick case\u003C\u002Fa> (April 11, 2025)\u003C\u002Fli>\n\u003C\u002Ful>\n\u003Cp>The two answers in the demo duel were collected unedited on July 4, 2026, through the same gateway our German AI Benchmark uses.\u003C\u002Fp>\n",1783276596442]