There is a magic and arbitrary line in ordinary statistical testing: the p level of 0.05. What that basically means is that if the p level of a comparison between two distributions is less than 0.05, there is a less than 5% chance that your results can be accounted for by accident. We’ll often say that having p<0.05 means your result is statistically significant. Note that there’s nothing really special about 0.05; it’s just a commonly chosen dividing line.
Now a paper has come out that ought to make some psychologists, who use that p value criterion a lot in their work, feel a little concerned. The researchers analyzed the distribution of reported p values in 3 well-regarded journals in experimental psychology, and described the pattern.
Here’s one figure from the paper.
The solid line represents the expected distribution of p values. This was calculated from some theoretical statistical work.
…some theoretical papers offer insight into a likely distribution. Sellke, Bayarri, and Berger (2001) simulated p value distributions for various hypothetical effects and found that smaller p values were more likely than larger ones. Cumming (2008) likewise simulated large numbers of experiments so as to observe the various expected distributions of p.
The circles represent the actual distribution of p values in the published papers. Remember, 0.05 is the arbitrarily determined standard for significance; you don’t get accepted for publication if your observations don’t rise to that level.
Notice that unusual and gigantic hump in the distribution just below 0.05? Uh-oh.
I repeat, uh-oh. That looks like about half the papers that report p values just under 0.05 may have benefited from a little ‘adjustment’.
What that implies is that investigators whose work reaches only marginal statistical significance are scrambling to nudge their numbers below the 0.05 level. It’s not necessarily likely that they’re actually making up data, but there could be a sneakier bias: oh, we almost meet the criterion, let’s add a few more subjects and see if we can get it there. Oh, those data points are weird outliers, let’s throw them out. Oh, our initial parameter of interest didn’t meet the criterion, but this other incidental observation did, so let’s report one and not bother with the other.
But what it really means is that you should not trust published studies that only have marginal statistical significance. They may have been tweaked just a little bit to make them publishable. And that means that publication standards may be biasing the data.
Masicampo EJ, and Lalande DR (2012). A peculiar prevalence of p values just below .05. Quarterly journal of experimental psychology PMID: 22853650