ModelsResearchTools

MirrorCode benchmark: Claude Opus 4.7 reimplements 16,000-line programs in 14 hours

Epoch AI and METR have released a new benchmark where AI models must rewrite complete programs from scratch—Claude Opus 4.7 leads with a 56% success rate.

In detail

  • MirrorCode tests 25 target programs (Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, compression).
  • Claude Opus 4.7 reimplemented gotree (16,000 lines of Go code, 40+ commands) in 14 hours for $251—a human would need 2–17 weeks.
  • Overall rankings: Claude Opus 4.7 (56%), GPT-5.5 (44%), Gemini 3.1 Pro Preview (32%); largest tasks defeat all models.
  • One large task cost $2,600 and ran continuously for 19 days—demonstrates AI can already handle demanding long-term programming tasks.

Why it matters

The benchmark shows that AI models can now tackle complex software development tasks that previously required weeks of human work. This has immediate implications for software development productivity and cost structure.

For you Evaluate whether your organization can offload routine programming tasks (refactoring, utility development) to Claude Opus 4.7—the cost savings could be substantial.

← All news

Summaries are generated automatically and link to the original source.