In detail
- iLLaDA was pretrained on 12 trillion tokens (versus 2.3 trillion for predecessor LLaDA) and achieves 63.9 average points—just above Qwen2.5 7B at 63.3 points.
- Diffusion models refine masked tokens in parallel across multiple passes rather than generating word-by-word sequentially; every position can attend to every other position simultaneously.
- Google DeepMind released DiffusionGemma in parallel, generating roughly four times faster but scoring worse on benchmarks like MMLU—optimized for low-latency cases, not quality-critical production.
Why it matters
Diffusion models could offer a genuine alternative to autoregressive architectures when trained from scratch. Relevant for German businesses weighing speed against quality in their AI deployments.
For you Monitor whether diffusion models deliver practical advantages in your use cases—they may reduce latency without sacrificing quality.