A study by doctoral student Chloe Lim (Political Science) of Stanford University gained some attention this week, inspiring some unflattering headlines like this one
from Vocativ: "Great, Even Fact Checkers Can’t Agree On What Is True."
Lim's research approach somewhat resembled
research by Michelle A. Amazeen of Rider University. Amazeen and Lim both used tests of coding consistency to assess the accuracy of fact checkers, but the two reached roughly opposite conclusions. Amazeen concluded that consistent results helped strengthen the inference that fact-checkers fact-check accurately. Lim concluded that inconsistent fact-checker ratings may undermine the public impact of fact-checking.
Key differences in the research procedure help explain why Amazeen and Lim reached differing conclusions.
Data Classification
Lim used two different methods for classifying data from PolitiFact and the
Washington Post Fact Checker. She converted PolitiFact ratings to a five-point scale corresponding to the
Washington Post Fact Checker's "Pinocchio" ratings, and she divided ratings into "True" and "False" groups using the line between "Mostly False" and "Half True" as the barrier between true and false statements.
Amazeen opted for a different approach. Amazeen did not try to reconcile the two different rating systems at PolitiFact and the Fact Checker, electing to use a binary system that counted every statement rated other than "True" or "Geppetto check mark" as false.
Amazeen's method essentially guaranteed high inter-rater reliability, because "True" judgments from the fact checkers are rare. Imagine comparing movie reviewers who use a five-point scale but with their data divided up into great movies or not-great movies. A one-star rating of "Ishtar" by one reviewer would show agreement with a four-star rating of the same movie by another reviewer.
Disagreements only occur when one reviewer gives five stars while the other one gives a lower rating.
Professor Joseph Uscinski's reply to Amazeen's research, published in
Critical Review, put it succinctly:
Amazeen’s analysis sets the bar for agreement so low that it cannot be taken seriously.
Amazeen found high agreement among fact checkers because her method guaranteed that outcome.
Lim's methods provide for more varied and robust data sets, though Lim experienced the same problem Amazeen found in that two different fact-checking organizations only rarely check the same claims. Both researchers used relatively small data sets.
The meaning of Lim's study
In our view, Lim's study rushes to its conclusion that fact-checkers disagree without giving proper attention to the most obvious explanation for the disagreement she measures.
The rating systems the fact checkers use lend themselves to subjective evaluations. We should expect that condition to lead to inconsistent ratings. When I
reviewed Amazeen's method at Zebra Fact Check, I criticized it for applying inter-coder reliability standards to a process much less rigorous than social science coding.
Klaus Krippendorff, creator of the K-alpha measure Amazeen used in her research, explained the importance of giving coders good instructions to follow:
The key to reliable content analyses is reproducible coding instructions. All phenomena afford multiple interpretations. Texts typically support alternative interpretations or readings. Content analysts, however, tend to be interested in only a few, not all. When several coders are employed in generating comparable data, especially large volumes and/or over some time, they need to focus their attention on what is to be studied. Coding instructions are intended to do just this. They must delineate the phenomena of interest and define the recording units to be described in analyzable terms, a common data language, the categories relevant to the research project, and their organization into a system of separate variables.
The rating systems of PolitiFact and the
Washington Post Fact Checker are gimmicks, not coding instructions. The definitions mean next to nothing, and PolitiFact's creator, Bill Adair, has called PolitiFact's determination of Truth-O-Meter ratings "
entirely subjective."
Lim's conclusion is right. The fact checkers are inconsistent. But Lim's use of coder reliability ratings is, in our view, a little like using a plumb line to measure whether a building has collapsed due to earthquake. The tool is too sophisticated for the job. The "Truth-O-Meter" and "Pinocchio" rating systems as described and used by the fact checkers do not qualify as adequate sets of coding instructions.
We've belabored the point about PolitiFact's rating system for years. It's a gimmick that tends to mislead people. And the fact-checking organizations that do not use a rating system avoid it for precisely that reason.
Lucas Graves' history of the modern fact-checking movement, "
Deciding What's True: The Rise of Political Fact-Checking in American Journalism," (Page 41) offers an example of the dispute:
The tradeoffs of rating systems became a central theme of the 2014 Global Summit of fact-checkers. Reprising a debate from an earlier journalism conference, Bill Adair staged a "steel-cage death match" with the director of Full Fact, a London-based fact-checking outlet that abandoned its own five-point rating scheme (indicated by a magnifying lens) as lacking precision and rigor. Will Moy explained that Full Fact decided to forgo "higher attention" in favor of "long-term reputation," adding that "a dodgy rating system--and I'm afraid they are inherently dodgy--doesn't help us with that."
Coding instructions should provide coders with clear guidelines preventing most or all debate in deciding between two rating categories.
Lim's study in its present form does its best work in creating questions about fact checkers' use of rating systems.