Gemini · Claude · GPT · Decrypt
AI Models Can’t Agree on Basic Facts Most of the Time, Study Indicates
Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.
★ Tier-1 Source
Ask five of the world's most advanced AI systems whether a statement is true, and two-thirds of the time, at least one will give you a different answer.
Key facts
- The study gave GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro the same 1,000 real-world fact-check claims submitted by actual users
- GPT-5.4 said it was false, Claude Opus 4.7 called it mostly true, Gemini 3 Pro said false, and Gemini 3 Pro + Search rated it true
- The statistical measure of agreement, called Krippendorff’s alpha, came in at 0.639 on a scale where 1.0 means perfect agreement and 0 means random chance
- On 672 out of 1,000 claims, at least one model broke from the majority
Summary
Five frontier AI models disagreed on 67% of 1,000 real-world fact-check claims. At 0.639 Krippendorff's alpha, the models fall below the 0.8 reliability threshold. The study gave GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro the same 1,000 real-world fact-check claims submitted by actual users. On 672 out of 1,000 claims, at least one model broke from the majority.