Formal German business correspondence — emails, quotes, rejections in the right register.
The big leaderboards measure almost only in English. This benchmark tests leading AI models on what matters in German business: formal correspondence, bureaucratic German, law and source faithfulness. One of the first to measure AI specifically on real German business tasks — transparent and dated.
Overall score across six German business dimensions (0–100 points). Click a model to see its per-dimension scores.
The leaders sit close together: with 24 tasks per run, gaps of a few points are within measurement noise — read the larger gaps and the trend across runs as the signal.
Each cell is a model's score (0–100) in one dimension — the stronger the colour, the better. The top value per column is outlined.
| Model | Business communication | Official & bureaucratic German | Law & tax (domain knowledge) | Summarisation & faithfulness | Source faithfulness (document Q&A) | Language quality & style | Overall |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.8Anthropic | 97 | 95 | 99 | 96 | 100 | 100 | 98 |
| GPT-5.5OpenAI | 96 | 90 | 100 | 90 | 100 | 96 | 95 |
| Qwen3.7 MaxAlibaba | 90 | 80 | 94 | 95 | 100 | 98 | 93 |
| Gemini 3.1 ProGoogle | 87 | 79 | 94 | 92 | 100 | 92 | 91 |
| DeepSeek V4-ProDeepSeek | 96 | 83 | 74 | 93 | 100 | 97 | 91 |
| Grok 4.3xAI | 93 | 78 | 77 | 95 | 100 | 97 | 90 |
| Mistral Large 3Mistral | 91 | 79 | 93 | 93 | 80 | 94 | 88 |
No model leads everywhere. The radar shows where the three best models have their strengths and gaps across the six dimensions.
Tap to toggle · hover to spotlight a model and reveal its scores.
The overall score hides that each dimension has a different leader. Here is the best model per field.
Formal German business correspondence — emails, quotes, rejections in the right register.
Understanding bureaucratic German and rendering it in plain language — or producing it correctly.
Verifiable questions on German law and tax (BGB, AO, deadlines, limitation periods).
Faithfully summarising German texts — capturing key points, never altering figures, inventing nothing.
Answering strictly from a provided German document — and saying “not in the document” when the answer is absent.
Grammatically flawless, idiomatic German — correct umlauts/ß, no translationese, register control.
A primary measurement — not a re-aggregation of public leaderboards. Every model solves the same German tasks; scoring is name-blind by a cross-vendor panel against fixed rubrics and reference answers.
Only example tasks are public. The actual test set stays private so the benchmark can't be trained on or gamed.
Scored by a panel of three independent models from different vendors (OpenAI, Google, Anthropic) that are not themselves on the board. No model is ever graded by a judge from its own company (leave-one-family-out): a vendor's answers are scored only by judges from other vendors. Scoring is name-blind (the judge does not see the model name) and absolute, against a fixed rubric plus reference answer; the individual marks are averaged. Results are hand-spot-checked.
Regular run; every result is versioned (the history is the real asset).
Six example tasks — one per dimension. The actual test tasks stay private so the benchmark can't be trained on or gamed.
„Schreibe eine freundliche Auftragsbestätigung an einen Neukunden – mit Bestellübersicht und voraussichtlichem Liefertermin. Sie-Form.“
„Was bedeutet es, wenn ein Steuerbescheid „bestandskräftig“ geworden ist? Erkläre es in einfachen Worten – ohne die fachliche Korrektheit zu verlieren.“
„Welche gesetzliche Kündigungsfrist gilt für den Arbeitgeber nach § 622 BGB bei 8 Jahren Betriebszugehörigkeit?“
Expected: 3 Monate zum Ende eines Kalendermonats
„Fasse den folgenden zweiseitigen Geschäftsbericht-Auszug in fünf Stichpunkten zusammen, ohne Zahlen zu verändern. [Text wird gestellt]“
„Beantworte ausschließlich anhand des beigefügten Datenschutz-Dokuments: Wie lange werden Bewerberdaten gespeichert? Wenn es nicht im Dokument steht, sage das ausdrücklich.“
„Überarbeite einen holprigen, maschinell wirkenden Text zu natürlichem, idiomatischem Geschäftsdeutsch und korrigiere Grammatik- und Stilfehler. [Text wird gestellt]“
The big leaderboards measure almost only in English. But for German companies, what matters is how well a model handles German business language, bureaucratic German and German law. That's exactly what this benchmark measures — as one of the first to test AI specifically on real German business tasks rather than academic benchmarks.
Every model answers the same tasks. A panel of three independent models from different vendors, none of them on the board, scores the answers name-blind against a fixed rubric and, where possible, a reference answer; no model is ever graded by a judge from its own company. We spot-check samples by hand.
If the test set were public, models could be trained on it and the comparison would be worthless. We show a different example task per dimension than the private one; the actual set stays private and is continually refreshed. 'Private' here means not publicly listed — the task text does pass through the vendors' APIs.
The benchmark is re-measured at regular intervals; the date above shows when it last ran. If a model returns too few valid answers, we keep its previous value rather than publish an outlier.
Last measured: 22 June 2026
Scores are snapshots of an automated evaluation over a private test set; they measure performance on German business tasks, not overall “quality”. Model names and versions follow availability via the gateway used. Individual runs can fluctuate — what counts is the trend, not a single run. With 24 tasks per run, small point differences are not statistically meaningful; models within a few points should be read as tied.
We bring the model with the best German into your process — for correspondence, documents and contracts, with an eye on cost and data protection.