Building a Science Detector

Oh, let us count the ways

The Defense Advanced Research Projects Agency (DARPA) Defense Sciences Office (DSO) is requesting information on new ideas and approaches for creating (semi)automated capabilities to assign “Confidence Levels” to specific studies, claims, hypotheses, conclusions, models, and/or theories found in social and behavioral science research. These social and behavioral science Confidence Levels should rapidly enable a non-expert to understand and quantify the confidence they can have in a specific research result or claim’s reliability, reproducibility, and robustness.

First off, “confidence levels?” We’ve already got “confidence intervals,” and there’s been a decades-long push to use them in place of hypothesis testing.[1][2] This technique is fully compatible with frequentism (though over there it doesn’t mean what you think it does), and it even predates null-hypothesis significance testing! Alas, scientists find “we calculate a Cohen’s d of 0.3 +- 0.1” less satisfying to type than “we have refuted the null hypothesis.” The former shows a pretty weak effect, while the latter comes across as bold and confident. If those won’t do, what about meta-analyses? [3]

Second, these “confidence levels” would only apply to published research. Most research never gets published, yet those results are vital to understanding how strong any one finding is.[4] We can try to estimate the rate of unpublished works, and indeed over the decades many people have tried, but there is no current consensus on how best to compensate for the problem.[5][6][7]

Thirdly, “social and behavioral science?” The replication crisis extends much farther, into biomedicine, chemistry, and so on. Physics doesn’t get mentioned much, and there’s a reason for that (beyond their love of confidence intervals). Emphasis mine:

Even if you adjust the acceptable P value, a test of statistical significance, from 0.05 to 0.005—the lower it is, the more significant your data—that won’t deal with, let’s say, bias resulting from corporate funding. (Particle physicists demand a P value below 0.0000003! And you gotta get below 0.00000005 for a genome-wide association study.)

Just think on that. “p < 0.0000003″ means “if the null hypothesis is true, we would find a more extreme result in less than 1 in 3,333,333 trials on data like what we have observed.” If you wanted to see one of those exceptions, you’d have to do one experiment a day for 6,326 years just to have a better than 50/50 chance of spotting it. For comparison, the odds of a particular US citizen being struck by lightening over a year are 1 in 700,000; worldwide, the yearly odds of death by snake bite are about 1 in 335,000; and over the lifetime of a US citizen, the odds of them dying by dog attack are 1 in 112,400. p < 0.0000003 is a ridiculously high bar to leap, which means either a) false positives are easy to generate in physics, either via the law of large numbers or shoddy statistical techniques, or b) the field has been bitten so many times by results that can’t be replicated, even when they were real, that they’ve cranked the bar ridiculously high, or c) both.

Fourth, confidence isn’t everything. The Princeton Engineering Anomalies Research lab did studies where people tried to psychically bias random number generators. Over millions of trials, they got extremely significant results… but the odds of success were still around 50.1% vs. the expected 50%. Were they now confident that psychic abilities exist, or merely that luck and reporting bias could introduce a subtle skew into the data? Compacting those complexities into a number or label that a lay-person can understand is extremely difficult, perhaps impossible.

Basically, what DARPA is asking for has been hashed out in the literature for decades, and the best recommendations have been ignored.[8] They may have deep pockets and influence, but what DARPA wants requires a complete overhaul in how science is conducted across the globe, spanning everything from journals to how universities are organized.[9] When even quite minor tweaks to the scientific process are met with stiff opposition, pessimism seems optimistic.

[1] Gardner, Martin J., and Douglas G. Altman. “Confidence intervals rather than P values: estimation rather than hypothesis testing.” Br Med J (Clin Res Ed)292.6522 (1986): 746-750.

[2] Rozeboom, William W. “The fallacy of the null-hypothesis significance test.” Psychological bulletin 57.5 (1960): 416.

[3] Egger, Matthias, et al. “Bias in meta-analysis detected by a simple, graphical test.” Bmj 315.7109 (1997): 629-634.

[4] Rosenthal, Robert. “The file drawer problem and tolerance for null results.” Psychological bulletin 86.3 (1979): 638.

[5] Franco, Annie, Neil Malhotra, and Gabor Simonovits. “Publication bias in the social sciences: Unlocking the file drawer.” Science 345.6203 (2014): 1502-1505.

[6] Rosenberg, Michael S. “The file-drawer problem revisited: a general weighted method for calculating fail-safe numbers in meta-analysis.” Evolution 59.2 (2005): 464-468.

[7] Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. “P-curve: a key to the file-drawer.” Journal of Experimental Psychology: General 143.2 (2014): 534.

[8] Sedlmeier, Peter, and Gerd Gigerenzer. “Do studies of statistical power have an effect on the power of studies?.” Psychological bulletin 105.2 (1989): 309.

[9] Rawat, Seema, and Sanjay Meena. “Publish or Perish: Where Are We Heading?” Journal of Research in Medical Sciences : The Official Journal of Isfahan University of Medical Sciences 19.2 (2014): 87–89. Print.