AA‑Briefcase benchmark finds AIs largely fail at real knowledge work

In detail

Artificial Analysis builds tasks from thousands of fragmented sources (Slack, emails, meeting transcripts, large data exports).
Claude Fable 5 posts the highest rubric pass rate but meets all criteria on only 3% of tasks.
On 31 of 91 tasks no model exceeds a 50% pass rate.
Per‑task costs vary dramatically: about $0.04 (DeepSeek V4 Flash) up to over $31 (Claude Fable 5), an >800× range.

Why it matters

Companies assuming LLMs can reliably perform complex, multi‑document knowledge work risk overestimating both accuracy and value; failures range from obvious execution errors to subtle missed details, which affects budgeting and deployment decisions.

For you Test candidate models with pilots that mirror your real document mix and measure complete, detailed success metrics and per‑task cost—not just API throughput.

Sources

The Decoder