German AI Benchmark

Which AI is best at German?

The big leaderboards measure almost only in English. This benchmark tests leading AI models on what matters in German business: formal correspondence, bureaucratic German, law and source faithfulness. One of the first to measure AI specifically on real German business tasks — transparent and dated.

Status: 22 June 2026Built on our own private test tasks. Scored name-blind by a cross-vendor panel, methodology disclosed.

The ranking

Overall score across six German business dimensions (0–100 points). Click a model to see its per-dimension scores.

0Scale 0–100 · click a model to see the six dimensions100

The leaders sit close together: with 24 tasks per run, gaps of a few points are within measurement noise — read the larger gaps and the trend across runs as the signal.

Strengths by dimension

Each cell is a model's score (0–100) in one dimension — the stronger the colour, the better. The top value per column is outlined.

Model	Business communication	Official & bureaucratic German	Law & tax (domain knowledge)	Summarisation & faithfulness	Source faithfulness (document Q&A)	Language quality & style	Overall
Claude Opus 4.8Anthropic	97	95	99	96	100	100	98
GPT-5.5OpenAI	96	90	100	90	100	96	95
Qwen3.7 MaxAlibaba	90	80	94	95	100	98	93
Gemini 3.1 ProGoogle	87	79	94	92	100	92	91
DeepSeek V4-ProDeepSeek	96	83	74	93	100	97	91
Grok 4.3xAI	93	78	77	95	100	97	90
Mistral Large 3Mistral	91	79	93	93	80	94	88

Strength profile of the top 3

No model leads everywhere. The radar shows where the three best models have their strengths and gaps across the six dimensions.

Tap to toggle · hover to spotlight a model and reveal its scores.

Who leads in each dimension?

The overall score hides that each dimension has a different leader. Here is the best model per field.

Business communication

Formal German business correspondence — emails, quotes, rejections in the right register.

Claude Opus 4.897/100

Official & bureaucratic German

Understanding bureaucratic German and rendering it in plain language — or producing it correctly.

Claude Opus 4.895/100

Law & tax (domain knowledge)

Verifiable questions on German law and tax (BGB, AO, deadlines, limitation periods).

GPT-5.5100/100

Summarisation & faithfulness

Faithfully summarising German texts — capturing key points, never altering figures, inventing nothing.

Claude Opus 4.896/100

Source faithfulness (document Q&A)

Answering strictly from a provided German document — and saying “not in the document” when the answer is absent.

GPT-5.5100/100

Language quality & style

Grammatically flawless, idiomatic German — correct umlauts/ß, no translationese, register control.

Claude Opus 4.8100/100

How the benchmark works

A primary measurement — not a re-aggregation of public leaderboards. Every model solves the same German tasks; scoring is name-blind by a cross-vendor panel against fixed rubrics and reference answers.

Private test set

Only example tasks are public. The actual test set stays private so the benchmark can't be trained on or gamed.

Anonymous scoring

Scored by a panel of three independent models from different vendors (OpenAI, Google, Anthropic) that are not themselves on the board. No model is ever graded by a judge from its own company (leave-one-family-out): a vendor's answers are scored only by judges from other vendors. Scoring is name-blind (the judge does not see the model name) and absolute, against a fixed rubric plus reference answer; the individual marks are averaged. Results are hand-spot-checked.

Versioned

Regular run; every result is versioned (the history is the real asset).

What we test

Six example tasks — one per dimension. The actual test tasks stay private so the benchmark can't be trained on or gamed.

Business communication
„Schreibe eine freundliche Auftragsbestätigung an einen Neukunden – mit Bestellübersicht und voraussichtlichem Liefertermin. Sie-Form.“
Official & bureaucratic German
„Was bedeutet es, wenn ein Steuerbescheid „bestandskräftig“ geworden ist? Erkläre es in einfachen Worten – ohne die fachliche Korrektheit zu verlieren.“
Law & tax (domain knowledge)
„Welche gesetzliche Kündigungsfrist gilt für den Arbeitgeber nach § 622 BGB bei 8 Jahren Betriebszugehörigkeit?“
Expected: 3 Monate zum Ende eines Kalendermonats
Summarisation & faithfulness
„Fasse den folgenden zweiseitigen Geschäftsbericht-Auszug in fünf Stichpunkten zusammen, ohne Zahlen zu verändern. [Text wird gestellt]“
Source faithfulness (document Q&A)
„Beantworte ausschließlich anhand des beigefügten Datenschutz-Dokuments: Wie lange werden Bewerberdaten gespeichert? Wenn es nicht im Dokument steht, sage das ausdrücklich.“
Language quality & style
„Überarbeite einen holprigen, maschinell wirkenden Text zu natürlichem, idiomatischem Geschäftsdeutsch und korrigiere Grammatik- und Stilfehler. [Text wird gestellt]“

Frequently asked questions

Why a dedicated German benchmark?

The big leaderboards measure almost only in English. But for German companies, what matters is how well a model handles German business language, bureaucratic German and German law. That's exactly what this benchmark measures — as one of the first to test AI specifically on real German business tasks rather than academic benchmarks.

How is it scored?

Every model answers the same tasks. A panel of three independent models from different vendors, none of them on the board, scores the answers name-blind against a fixed rubric and, where possible, a reference answer; no model is ever graded by a judge from its own company. We spot-check samples by hand.

Why aren't the test tasks public?

If the test set were public, models could be trained on it and the comparison would be worthless. We show a different example task per dimension than the private one; the actual set stays private and is continually refreshed. 'Private' here means not publicly listed — the task text does pass through the vendors' APIs.

How current is the ranking?

The benchmark is re-measured at regular intervals; the date above shows when it last ran. If a model returns too few valid answers, we keep its previous value rather than publish an outlier.

Transparency & status

Last measured: 22 June 2026

Models tested: Claude Opus 4.8 · GPT-5.5 · Qwen3.7 Max · Gemini 3.1 Pro · DeepSeek V4-Pro · Grok 4.3 · Mistral Large 3
Judge model: openai/gpt-5.4 · google/gemini-3-pro-preview · anthropic/claude-opus-4.7
Model access: Vercel AI Gateway

Scores are snapshots of an automated evaluation over a private test set; they measure performance on German business tasks, not overall “quality”. Model names and versions follow availability via the gateway used. Individual runs can fluctuate — what counts is the trend, not a single run. With 24 tasks per run, small point differences are not statistically meaningful; models within a few points should be read as tied.

Which model fits your German content?

We bring the model with the best German into your process — for correspondence, documents and contracts, with an eye on cost and data protection.

Book a call View services