Multi-step quantitative problems at competition and research level.
Every major AI model on one scale — computed from the hardest public benchmarks across seven disciplines, distilled into an implied AI IQ. Transparent, conservative, with source and status.
Ranked by implied AI IQ across seven dimensions. Click a model to see what its score is made of.
No model leads everywhere. The radar shows where the three best models have their strengths and gaps — across all seven dimensions.
Tap to show/hide · hover spotlights a model and reveals its scores.
The overall IQ hides that each dimension has a different leader. Here's the best model per capability.
Multi-step quantitative problems at competition and research level.
Graduate-level natural science, “Google-proof”.
Recognising novel patterns unseen in training (fluid intelligence).
Turning a prompt into a working web app (human preference).
Fixing real bugs in real GitHub projects.
Operating browser, terminal and desktop autonomously (agentic).
Following instructions and admitting uncertainty instead of hallucinating.
More IQ costs more — but not linearly. The cheap models (green) often deliver most of the capability at a fraction of the price. For many tasks that's enough.
Transparent and conservative. No model is reduced to a single favourite benchmark — and missing data must never flatter a score.
Each model is scored across seven cognitive fields — from mathematics through abstract reasoning to computer use. Every dimension draws on official, public benchmarks.
Benchmark results are mapped to a 0–100 scale per dimension. Where a value is missing it's imputed conservatively (the lowest observed value) — marked with ≈.
The overall score is the unweighted mean of all seven dimensions. So a model's weak spots count too — not just its showcase discipline.
The score is mapped to an IQ-like scale (50–150) via a fixed, disclosed formula — as an intuition aid, not a human IQ test.
IQ = 50 + Ø(7 Dim.)The IQ scale is an intuition aid — it does not claim a model would score this on a human IQ test. A score of 50 maps to IQ 100; a model perfect across the board would sit at 150.No. “AI IQ” is an intuition aid: we translate benchmark performance onto a familiar scale. A language model doesn't “have” an IQ in the human sense — but the scale makes capability gaps comparable at a glance.
Because intelligence isn't one-dimensional. A model can lead on mathematics and trail on abstract reasoning. That's exactly why we average across seven dimensions instead of showing one favourite benchmark.
From official, public leaderboards (Epoch AI, ARC Prize, SWE-bench, LMArena and others). Each dimension names its benchmarks; the sources with status are listed below.
It updates automatically from the leaderboards; the top shows when a value last moved. If no reliable change is found, the last verified value stands — the ranking can lag, but it won't guess.
Last checked: 22 June 2026
Benchmark scores are snapshots and vary by eval setup. Some tests (AIME, GPQA) are near-saturated at the top, and ARC-AGI-2 is officially verified for only a few models — missing values are imputed conservatively (≈). Prices are approximate per 1M output tokens, some estimated where no official figure exists. The AI IQ aggregates all of this into one comparison number — it measures benchmark performance, not “intelligence” in the human sense.
We don't just rate benchmarks — we put the right model into your process, with an eye on cost, speed and data protection.