State of AI

How smart is AI actually, right now?

The easy tests show nothing anymore — every top model is above 90%. It only gets interesting where models still fail. Here are the hardest open benchmarks — and how fast the gap to humans is closing.

As of 21 June 2026Curated from official leaderboards. Every number with a source and date.

The frontier — where models (still) fail

The emptier the bar, the further AI still is from humans. ARC-AGI-3 is the current low point: humans solve 100%, the best model 0.37%.

ARC-AGI-3

open

Interactive, novel reasoning in unseen mini-environments.

100 % · Human
0.37 %
Best model: Gemini 3.1 ProHumans solve 100%ARC Prize

A purpose-built RL/search system (not an LLM) reached 12.58%.

FrontierMath

climbing

Research-level, unpublished mathematics.

47.6 %
Best model: GPT-5.4Research mathematicians (hours per problem)Epoch AI

Epoch released FrontierMath v2 on 12 Jun 2026 (cleaned items).

Humanity's Last Exam

climbing

Thousands of expert questions at the edge of human knowledge.

64.5 %
Best model: Claude Mythos 5Domain experts, per fieldHLE / Safe AI

Scores vary by eval setup (~53–64%); roughly +30 points over the past year.

SWE-bench Verified

climbing

Fixing real software bugs in real GitHub projects.

80.9 %
Best model: Claude Opus 4.5Share of real issues resolvedSWE-bench

GPQA Diamond

near-saturated

PhD-level science questions, “Google-proof”.

70 % · Human
94.1 %
Best model: Gemini 3.1 ProPhD-level experts ≈ 70%Epoch AI

Models now sit above human-expert level — the test is losing its discriminating power.

The half-life of benchmarks is shrinking

Tests meant to challenge for years are now exhausted in months. MMLU (2020) lasted about four years, GPQA (2023) only two. That's the real story — not a single score, but the pace.

MMLU
4y
GPQA
2y
Humanity's Last Exam
still open
ARC-AGI-3
still open
20202021202220232024202520262027
Stanford AI Index 2026

Sources & status

Last checked: 21 June 2026

Benchmark scores are snapshots and vary by eval setup. They measure individual capabilities, not “intelligence” as a whole.

FAQ

Does “close to humans” mean AI can soon do everything?

No. These tests measure narrow capabilities. A model can answer GPQA questions above human level and almost completely fail ARC-AGI-3. “State of AI” is a progress picture, not an AGI forecast.

Why do some numbers disagree?

Different leaderboards test under slightly different conditions (prompting, tool access, test version). We name the source and date for each value — and where there's a range, we say so.

How current is this page?

It's maintained from official leaderboards; the date is shown at the top. Soon it will update automatically — until then we check it regularly by hand.

Is AI moving faster than your last plan?

We sort out what's actually relevant for your business — and implement the use cases that are worth it today.