State of AI – live progress at the capability frontier

The frontier — where models (still) fail

The emptier the bar, the further AI still is from humans. ARC-AGI-3 is the current low point: humans solve 100%, the best model 0.37%.

ARC-AGI-3

open

Interactive, novel reasoning in unseen mini-environments.

100 % · Human

0.37 %

Best model: Gemini 3.1 ProHumans solve 100%ARC Prize

A purpose-built RL/search system (not an LLM) reached 12.58%.

FrontierMath

climbing

Research-level, unpublished mathematics.

47.6 %

Best model: GPT-5.4Research mathematicians (hours per problem)Epoch AI

Epoch released FrontierMath v2 on 12 Jun 2026 (cleaned items).

Humanity's Last Exam

climbing

Thousands of expert questions at the edge of human knowledge.

64.5 %

Best model: Claude Mythos 5Domain experts, per fieldHLE / Safe AI

Scores vary by eval setup (~53–64%); roughly +30 points over the past year.

SWE-bench Verified

climbing

Fixing real software bugs in real GitHub projects.

80.9 %

Best model: Claude Opus 4.5Share of real issues resolvedSWE-bench

GPQA Diamond

near-saturated

PhD-level science questions, “Google-proof”.

70 % · Human

94.1 %

Best model: Gemini 3.1 ProPhD-level experts ≈ 70%Epoch AI

Models now sit above human-expert level — the test is losing its discriminating power.

The half-life of benchmarks is shrinking

Tests meant to challenge for years are now exhausted in months. MMLU (2020) lasted about four years, GPQA (2023) only two. That's the real story — not a single score, but the pace.

MMLU

4y

GPQA

2y

Humanity's Last Exam

still open

ARC-AGI-3

still open

20202021202220232024202520262027

Stanford AI Index 2026

Sources & status

Last checked: 21 June 2026

Benchmark scores are snapshots and vary by eval setup. They measure individual capabilities, not “intelligence” as a whole.

FAQ

Does “close to humans” mean AI can soon do everything?

No. These tests measure narrow capabilities. A model can answer GPQA questions above human level and almost completely fail ARC-AGI-3. “State of AI” is a progress picture, not an AGI forecast.

Why do some numbers disagree?

Different leaderboards test under slightly different conditions (prompting, tool access, test version). We name the source and date for each value — and where there's a range, we say so.

How current is this page?

It's maintained from official leaderboards; the date is shown at the top. Soon it will update automatically — until then we check it regularly by hand.

How smart is AI actually, right now?

The frontier — where models (still) fail

ARC-AGI-3

FrontierMath

Humanity's Last Exam

SWE-bench Verified

GPQA Diamond

The half-life of benchmarks is shrinking

Sources & status

FAQ

Is AI moving faster than your last plan?