Benchmarks and methods expose coding‑AI limits and cost levers

In detail

SWE‑Explore separates file search from the actual fix: 848 tasks from 203 OSS projects with multi‑model successful runs used to label relevant lines.
Google’s Gemini‑SQL2 (Gemini 3.1 Pro) reaches 80.04% execution accuracy on the BIRD text‑to‑SQL benchmark; GPT‑5.5‑xhigh ~72.8%.
Microsoft’s SkillOpt treats instruction ‘skills’ as trainable Markdown files and reports >20‑point gains for GPT‑5.5 on procedural tasks.
Moonshot’s open‑weights Kimi K2.7 Code targets programming workflows with lower price per token; outperforms on some agentic benchmarks but trails GPT‑5.5 on many coding tasks.

Why it matters

SMEs building developer automation should know agents may find the right file but miss crucial lines; at the same time, modular instruction tuning and cheaper specialized models offer practical cost/performance tradeoffs.

For you Benchmark any coding agent on your repositories (line‑level relevance) and try lightweight skill‑training or a specialized cheaper model before buying frontier tokens.

Sources

The Decoder