In detail
- Scope: 60 models, 75 questions, three languages, 14 narratives; scoring 1–5 where 1 indicates repeating Russian talking points
- Evaluation model: calibrated Claude Opus 4.5; validation by disinformation experts at Propastop
- Top performers: Anthropic's Claude models, followed by Nvidia's Nemotron 3 and Alibaba's Qwen 3.6 Plus
- Mistral's models, including Medium 3.5, rank in the bottom third; tests ran without web access
Why it matters
The results highlight real differences in models' ability to reject disinformation, which matters for organizations deploying LLMs in public communications, content moderation or intelligence‑adjacent tasks.
For you Check third‑party benchmarks on misinformation resistance for candidate models and run domain‑specific propaganda tests before deploying models in public‑facing or regulatory contexts.