ARC-AGI-3
openInteractive, novel reasoning in unseen mini-environments.
A purpose-built RL/search system (not an LLM) reached 12.58%.
The easy tests show nothing anymore — every top model is above 90%. It only gets interesting where models still fail. Here are the hardest open benchmarks — and how fast the gap to humans is closing.
The emptier the bar, the further AI still is from humans. ARC-AGI-3 is the current low point: humans solve 100%, the best model 0.37%.
Interactive, novel reasoning in unseen mini-environments.
A purpose-built RL/search system (not an LLM) reached 12.58%.
Research-level, unpublished mathematics.
Epoch released FrontierMath v2 on 12 Jun 2026 (cleaned items).
Thousands of expert questions at the edge of human knowledge.
Scores vary by eval setup (~53–64%); roughly +30 points over the past year.
Fixing real software bugs in real GitHub projects.
PhD-level science questions, “Google-proof”.
Models now sit above human-expert level — the test is losing its discriminating power.
Tests meant to challenge for years are now exhausted in months. MMLU (2020) lasted about four years, GPQA (2023) only two. That's the real story — not a single score, but the pace.
Last checked: 21 June 2026
Benchmark scores are snapshots and vary by eval setup. They measure individual capabilities, not “intelligence” as a whole.
No. These tests measure narrow capabilities. A model can answer GPQA questions above human level and almost completely fail ARC-AGI-3. “State of AI” is a progress picture, not an AGI forecast.
Different leaderboards test under slightly different conditions (prompting, tool access, test version). We name the source and date for each value — and where there's a range, we say so.
It's maintained from official leaderboards; the date is shown at the top. Soon it will update automatically — until then we check it regularly by hand.
We sort out what's actually relevant for your business — and implement the use cases that are worth it today.