In detail
- Method: feed anonymized real user conversations (with history) to the new model and have it generate the next response to detect and count misbehaviors
- Purpose: produce verifiable frequency estimates of problems that can be compared to real production data post‑release
- Evaluation: applied to four GPT‑5 models using ~1.3M conversations from Aug 2025–Mar 2026; strict precommitment used for GPT‑5.4
Why it matters
Realistic, traffic‑based simulation gives much more actionable risk estimates than synthetic tests, helping businesses plan mitigations and compliance controls for model deployment.
For you If you deploy conversational AI, run deployment simulations with representative traffic to estimate likely failure rates and define monitoring thresholds before launch.