AI IQ

How smart is each AI?

Every major AI model on one scale — computed from the hardest public benchmarks across seven disciplines, distilled into an implied AI IQ. Transparent, conservative, with source and status.

As of: 22 June 2026From official leaderboards. Methodology disclosed, every number sourced.

The AI IQ ranking

Ranked by implied AI IQ across seven dimensions. Click a model to see what its score is made of.

50Scale 50–150 · click a model for the 7 dimensions150

Strength profile of the top 3

No model leads everywhere. The radar shows where the three best models have their strengths and gaps — across all seven dimensions.

Tap to show/hide · hover spotlights a model and reveals its scores.

Who leads in each discipline?

The overall IQ hides that each dimension has a different leader. Here's the best model per capability.

Mathematical reasoning

Multi-step quantitative problems at competition and research level.

GPT-5.591/100

from: FrontierMath v2 · AIME 2025

Scientific reasoning

Graduate-level natural science, “Google-proof”.

Gemini 3.1 Pro94/100

from: GPQA Diamond

Abstract reasoning

Recognising novel patterns unseen in training (fluid intelligence).

GPT-5.553/100

from: ARC-AGI-2

App building

Turning a prompt into a working web app (human preference).

Claude Fable 592/100

from: WebDev Arena

Production engineering

Fixing real bugs in real GitHub projects.

Claude Opus 4.888/100

from: SWE-bench Verified

Computer use

Operating browser, terminal and desktop autonomously (agentic).

Claude Fable 585/100

from: OSWorld · BrowseComp

Reliability

Following instructions and admitting uncertainty instead of hallucinating.

Claude Fable 585/100

from: AA-Omniscience · Halluzinationsrate

Intelligence vs. cost

More IQ costs more — but not linearly. The cheap models (green) often deliver most of the capability at a fraction of the price. For many tasks that's enough.

Cost per 1M output tokens (log)Implied AI IQ

How the AI IQ is built

Transparent and conservative. No model is reduced to a single favourite benchmark — and missing data must never flatter a score.

1
Seven dimensions
Each model is scored across seven cognitive fields — from mathematics through abstract reasoning to computer use. Every dimension draws on official, public benchmarks.
2
Score per dimension (0–100)
Benchmark results are mapped to a 0–100 scale per dimension. Where a value is missing it's imputed conservatively (the lowest observed value) — marked with ≈.
3
Mean, not cherry-picking
The overall score is the unweighted mean of all seven dimensions. So a model's weak spots count too — not just its showcase discipline.
4
Mapping to the IQ scale
The score is mapped to an IQ-like scale (50–150) via a fixed, disclosed formula — as an intuition aid, not a human IQ test.

The formulaIQ = 50 + Ø(7 Dim.)The IQ scale is an intuition aid — it does not claim a model would score this on a human IQ test. A score of 50 maps to IQ 100; a model perfect across the board would sit at 150.

Frequently asked questions

Is this a real IQ test?

No. “AI IQ” is an intuition aid: we translate benchmark performance onto a familiar scale. A language model doesn't “have” an IQ in the human sense — but the scale makes capability gaps comparable at a glance.

Why isn't the best model first in every discipline?

Because intelligence isn't one-dimensional. A model can lead on mathematics and trail on abstract reasoning. That's exactly why we average across seven dimensions instead of showing one favourite benchmark.

Where do the numbers come from?

From official, public leaderboards (Epoch AI, ARC Prize, SWE-bench, LMArena and others). Each dimension names its benchmarks; the sources with status are listed below.

How current is the ranking?

It updates automatically from the leaderboards; the top shows when a value last moved. If no reliable change is found, the last verified value stands — the ranking can lag, but it won't guess.

Sources & status

Last checked: 22 June 2026

Benchmark scores are snapshots and vary by eval setup. Some tests (AIME, GPQA) are near-saturated at the top, and ARC-AGI-2 is officially verified for only a few models — missing values are imputed conservatively (≈). Prices are approximate per 1M output tokens, some estimated where no official figure exists. The AI IQ aggregates all of this into one comparison number — it measures benchmark performance, not “intelligence” in the human sense.

Which model fits your use case?

We don't just rate benchmarks — we put the right model into your process, with an eye on cost, speed and data protection.

Book a call See services