Frontier LLMs disagree on 67% of real-world fact-checks, study finds

A study presented 1,000 recent user-submitted claims from fact-checking platform Lenz to five frontier models, asking each for a verdict on a True / Mostly True / Misleading / False scale. The panel split on 67% of claims, with at least one model dissenting from the majority or no majority forming at all. On 34% of claims, the gap between the most-disagreeing pair spanned two or more buckets — a substantive disagreement on the answer rather than a calibration shift. Krippendorff’s ordinal alpha came in at 0.639, indicating structured but limited agreement.

The disagreement clusters in the middle of the rubric. Models converged easily on clear True or False verdicts but rarely reached unanimity on Mostly True or Misleading calls — of 328 unanimous claims, zero were unanimous-Mostly-True and only four were unanimous-Misleading. Peer-agreement between specific models ranged from 75% (the two Gemini 3 Pro variants, sharing a base) down to 53% for Claude Opus 4.7 paired against Gemini variants and for Gemini 3 Pro vs Sonar Pro. Verdict distributions also diverged: some models polarized toward the extremes while others spread across the nuanced middle.

The practical takeaway is that on claims without a canonical training-corpus answer key, frontier models are not interchangeable judges. Treating the majority as ground truth implies at least one model is wrong on two-thirds of claims and at least three are wrong on 13% — and actual error rates are almost certainly higher, since even unanimous verdicts can share blind spots. The result undercuts using a single frontier LLM as an automated fact-checker for nuanced real-world claims.