Suppose someone presents you with some data in the form of numbers in tables. These numbers may have been used as evidence to support some contention. Can you judge whether those numbers are authentic without actually repeating the entire study?
There have been cases in the past where people have reviewed other people’s data and found suspicious numeric patterns that would have been unlikely to occur naturally. One of the famous cases is that involving Cyril Burt’s studies of twins that purportedly showed that genetics played a far greater role in a person’s development than had been previously thought. In 1974, soon after Burt’s death in 1971, Leon Kamin analyzed Burt’s data and found that they were likely not correct because the statistical correlations he reported stayed stable up to the third decimal place, despite being obtained from different sample sizes. The odds of that happening naturally are extremely low. (Not in Our Genes by R. C. Lewontin, Steven Rose, and Leon Kamin (1984) p. 103.)
But that method requires the comparison of two sets of data. From economist Tyler Cowen I learn that there is a purely statistical pattern, known as Benford’s law, that can be employed to see if the data being shown from a single set of data have been honestly generated or fudged.
Benford’s law says that in any set of numbers such as “the lengths of rivers, the populations of cities, molecular weights of chemicals, and any number of other categories”, the frequency of occurrence of the first digit is not random. The probability that the first digit will be d is given by P(d)=log10(1+1/d). So the probability of the first digit being 1 is 30.1%, 2 is 17.6%, 3 is 12.5%, 4 is 9.7%, 5 is 7.9%, 6 is 6.7%, 7 is 5.8%, 8 is 5.1%, and 9 is 4.6%.
Given the recent turmoil in the financial sector, Jialan Wang applied this analysis to the accounting statements from businesses to see if they might reveal an increase in skullduggery and cooking of books over time. An initial analysis suggested that this was the case but he later discovered that the presence of zeros and negative numbers in accounting (not covered by Benford’s law) complicated matters, because the number of zeros had increased over time, preventing a clear conclusion from being drawn. Why the number of zeros increased is a puzzle. Wang also provides a thoughtful reflection on research biases that can creep in when studying something.
Of course, people who are aware of Benford’s law can still fudge their numbers while taking it into account. But it is not always easy to manufacture numbers to fit a pattern that arises from a stochastic process.