In science, there is something called the “replicability crisis”–the fact that the results of most studies cannot be replicated. This appears to mainly come from psychology and medicine, where meta-studies have found low replicability rates. But it likely generalizes to other scientific fields as well.
At least, when people talk about the replicability crisis, they definitely seem to believe that it generalizes to all fields. And yet, one of the most commonly discussed practices is p-hacking. Excuse me, folks, but I’m pretty sure that p-hacking does not generalize to physics. In my research, we don’t calculate p-values at all!
(Background: p-hacking is the problematic practice of tweaking statistical analysis until you get a p-value that is just barely low enough to technically count as statistically significant. FiveThirtyEight has a neat toy so you can try p-hacking yourself.)
Here I speculate why p-values rarely appear in physics, and what sort of problems we have in their place.
The 5-sigma Higgs and other case studies
Although I say that we don’t calculate p-values in my field (superconductivity) they’re occasionally implied in particle physics and astrophysics. So let’s discuss a few case studies.
First, consider the discovery of the Higgs boson in 2012. This was reported as a “5-sigma” result. This is an alternative way to report a p-value. 5-sigma means p < 3×10-7. Physicists laugh at your p < 0.05!
Next, consider OPERA’s observation in 2011 of neutrinos travelling faster than light. That was a 6-sigma result (p < 1×10-9). And then there’s BICEP2’s observation in 2014 of B-modes in the cosmic microwave background radiation, considered to be evidence for inflationary cosmology. That was a whopping 7.7-sigma result (p < 1×10-14). Wow!
Of course, it turns out that OPERA’s result came from a loose cable and fast clock. BICEP2’s result came from a failure to account for galactic dust correctly. The Higgs boson is still good though.
But wait a minute. How is it that the two false results had much smaller p-values than the one true result? And p wasn’t just a little smaller, it was smaller by seven orders of magnitude.
Basically, what it comes down to is that the p-value calculations are just wrong (sorry SciAm). The p-values are calculated assuming a Gaussian distribution of error, which is what you expect from purely random error. Unfortunately, past a certain number of sigmas, random error is no longer the thing you worry about most. Instead, we worry about experimental error, which has a tendency to produce huge outliers. Thus, the true distribution is non-Gaussian.
And no wonder physicists report sigmas instead of p-values. Calculating p-values from Gaussian distributions is technically incorrect. The true p-values are impossible to calculate, and much higher.
What is the null hypothesis?
In my own research, p-values are worse than useless, they’re conceptually nonsensical. How would you even begin? The first step is to identify the null hypothesis, but there isn’t any null hypothesis to explain high-temperature superconductivity. Well, the null hypothesis is that superconductivity doesn’t exist. But we’ve already rejected that one to our satisfaction.
In practice, the null hypothesis is effectively, “Something is wrong with the experiment.” And for this null hypothesis, we can’t calculate p-values, we can only test it with a bunch of troubleshooting. So suppose I observe a trend in my data. Now, try replacing the superconductor with a non-superconductor, do I see the same thing? Now try changing the laser settings, do I see the same thing? Now make sure there aren’t any loose wires, do I still see the same thing? So on and so forth, and that’s what research is.
So, here’s where I have some issues with the way we talk about the replicability crisis. It’s said that the problem is people need to publish more null results. But in physics, the “null result” is usually that there is some technical issue, probably having to do with specific details about the lab setup. Unless we have reason to believe that other researchers are affected by similar problems, it’s a waste of journal space to talk about it.
The role of publishing
Once you find some positive results, and have taken reasonable measures to be sure it isn’t the product of technical issues, it’s time to move on to the next step. The ultimate form of troubleshooting is to try the experiment in a completely different lab, where there are all sorts of subtle instrumental differences. Or, if we have a particular theory for what the trend represents, then we can test the theory with an entirely different kind of experiment. We can also get some third parties to look over the data analysis for themselves.
All of these things require the help of other scientists. And to get other scientists involved, we need to communicate, often in the form of published articles. Publishing is the ultimate form of troubleshooting.
Under this perspective, I don’t immediately see why it’s an issue that many publications have incorrect conclusions. Publication is part of the process by which we figure out that the conclusions are wrong!
The problem is when people (either scientists or journalists) hold up a conclusion as correct, with a greater degree of confidence than is warranted. I have heard that is the real problem they’re having in psychology, because some of the studies called into question were things that were fairly well-established.
Of course, this is not to say there aren’t methodological issues and biases in physics. I think physicists are biased towards publishing exciting conclusions, especially the kind that suggest further money should be directed to the particular kind of experiments that they have expertise in. The effects of such bias are difficult to measure because error depends on highly specific experimental details. If only we had a p-hacking problem, we could at least assess the bias via meta-analysis.