AI Models Can’t Agree on Basic Facts Most of the Time, Study Indicates

Fri, May 29 · 5:26 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

★ Tier-1 Source

Ask five of the world's most advanced AI systems whether a statement is true, and two-thirds of the time, at least one will give you a different answer.

Key facts

The study gave GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro the same 1,000 real-world fact-check claims submitted by actual users
GPT-5.4 said it was false, Claude Opus 4.7 called it mostly true, Gemini 3 Pro said false, and Gemini 3 Pro + Search rated it true
The statistical measure of agreement, called Krippendorff’s alpha, came in at 0.639 on a scale where 1.0 means perfect agreement and 0 means random chance
On 672 out of 1,000 claims, at least one model broke from the majority

Summary

Five frontier AI models disagreed on 67% of 1,000 real-world fact-check claims. At 0.639 Krippendorff's alpha, the models fall below the 0.8 reliability threshold. The study gave GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro the same 1,000 real-world fact-check claims submitted by actual users. On 672 out of 1,000 claims, at least one model broke from the majority.

Read full article at Decrypt →

#Gemini #Claude #GPT