In science, there is something called the “replicability crisis”–the fact that the results of most studies cannot be replicated. This appears to mainly come from psychology and medicine, where meta-studies have found low replicability rates. But it likely generalizes to other scientific fields as well.
At least, when people talk about the replicability crisis, they definitely seem to believe that it generalizes to all fields. And yet, one of the most commonly discussed practices is p-hacking. Excuse me, folks, but I’m pretty sure that p-hacking does not generalize to physics. In my research, we don’t calculate p-values at all!
(Background: p-hacking is the problematic practice of tweaking statistical analysis until you get a p-value that is just barely low enough to technically count as statistically significant. FiveThirtyEight has a neat toy so you can try p-hacking yourself.)
Here I speculate why p-values rarely appear in physics, and what sort of problems we have in their place.
The 5-sigma Higgs and other case studies
Although I say that we don’t calculate p-values in my field (superconductivity) they’re occasionally implied in particle physics and astrophysics. So let’s discuss a few case studies.
First, consider the discovery of the Higgs boson in 2012. This was reported as a “5-sigma” result. This is an alternative way to report a p-value. 5-sigma means p < 3×10-7. Physicists laugh at your p < 0.05!
Next, consider OPERA’s observation in 2011 of neutrinos travelling faster than light. That was a 6-sigma result (p < 1×10-9). And then there’s BICEP2’s observation in 2014 of B-modes in the cosmic microwave background radiation, considered to be evidence for inflationary cosmology. That was a whopping 7.7-sigma result (p < 1×10-14). Wow!
Of course, it turns out that OPERA’s result came from a loose cable and fast clock. BICEP2’s result came from a failure to account for galactic dust correctly. The Higgs boson is still good though.
But wait a minute. How is it that the two false results had much smaller p-values than the one true result? And p wasn’t just a little smaller, it was smaller by seven orders of magnitude.
Basically, what it comes down to is that the p-value calculations are just wrong (sorry SciAm). The p-values are calculated assuming a Gaussian distribution of error, which is what you expect from purely random error. Unfortunately, past a certain number of sigmas, random error is no longer the thing you worry about most. Instead, we worry about experimental error, which has a tendency to produce huge outliers. Thus, the true distribution is non-Gaussian.
And no wonder physicists report sigmas instead of p-values. Calculating p-values from Gaussian distributions is technically incorrect. The true p-values are impossible to calculate, and much higher.
What is the null hypothesis?
In my own research, p-values are worse than useless, they’re conceptually nonsensical. How would you even begin? The first step is to identify the null hypothesis, but there isn’t any null hypothesis to explain high-temperature superconductivity. Well, the null hypothesis is that superconductivity doesn’t exist. But we’ve already rejected that one to our satisfaction.
In practice, the null hypothesis is effectively, “Something is wrong with the experiment.” And for this null hypothesis, we can’t calculate p-values, we can only test it with a bunch of troubleshooting. So suppose I observe a trend in my data. Now, try replacing the superconductor with a non-superconductor, do I see the same thing? Now try changing the laser settings, do I see the same thing? Now make sure there aren’t any loose wires, do I still see the same thing? So on and so forth, and that’s what research is.
So, here’s where I have some issues with the way we talk about the replicability crisis. It’s said that the problem is people need to publish more null results. But in physics, the “null result” is usually that there is some technical issue, probably having to do with specific details about the lab setup. Unless we have reason to believe that other researchers are affected by similar problems, it’s a waste of journal space to talk about it.
The role of publishing
Once you find some positive results, and have taken reasonable measures to be sure it isn’t the product of technical issues, it’s time to move on to the next step. The ultimate form of troubleshooting is to try the experiment in a completely different lab, where there are all sorts of subtle instrumental differences. Or, if we have a particular theory for what the trend represents, then we can test the theory with an entirely different kind of experiment. We can also get some third parties to look over the data analysis for themselves.
All of these things require the help of other scientists. And to get other scientists involved, we need to communicate, often in the form of published articles. Publishing is the ultimate form of troubleshooting.
Under this perspective, I don’t immediately see why it’s an issue that many publications have incorrect conclusions. Publication is part of the process by which we figure out that the conclusions are wrong!
The problem is when people (either scientists or journalists) hold up a conclusion as correct, with a greater degree of confidence than is warranted. I have heard that is the real problem they’re having in psychology, because some of the studies called into question were things that were fairly well-established.
Of course, this is not to say there aren’t methodological issues and biases in physics. I think physicists are biased towards publishing exciting conclusions, especially the kind that suggest further money should be directed to the particular kind of experiments that they have expertise in. The effects of such bias are difficult to measure because error depends on highly specific experimental details. If only we had a p-hacking problem, we could at least assess the bias via meta-analysis.
robert79 says
p-values don’t assume a gaussian distribution, the t-test (and similar tests) do. Non-parametric tests typically make no (or far less) assumptions on the distribution of the data.
In the worst case scenario, Tchebyshev’s theorem gives us an upper bound for the p-value in the 5 sigma case, namely p < 0.04. So you could argue that the physicists are actually using roughly the same level of significance as the psychologists, although you need really fat-tailed data to get close to this value, which I doubt is the case in physics.
Also I think that the null hypothesis in case of testing for superconductivity is not: "superconductivity does not exist" but more something like "this particular compound has a conductivity of zero at this specific temperature".
The main problem with the reproducibility crisis in the social sciences I think is mainly bad statistics. The assumption of normality is rarely tested and t-tests are thrown about left and right. Combine this with bad experimental design, lack of corrections for multiple comparisons, and various other problems and biases which you point out and you get bad results.
Siggy says
@robert79,
I don’t think Chebyshev’s theorem applies since when they estimate sigma, they aren’t including systematic error, or huge outliers from experimental error.
EnkidumCan'tLogin says
“I don’t immediately see why it’s an issue that many publications have incorrect conclusions. Publication is part of the process by which we figure out that the conclusions are wrong!”
Yes, this is exactly right. Speaking as someone in psychology, a 50% replicability rate (which is the claim I’ve heard) seems about right, and desirable. We want enough leniency to encourage exploration of promising but not yet certain areas, but not just allow anything at all, and we want to avoid obvious abuses like the pizza articles PZ has been highlighting. The only issue I have with the replicability stuff is that statistics do tend to be misused and exploratory work presented as hypothesis-driven, which is a serious mistake.
hjhornbeck says
I don’t think ET. Jaynes gets his due, he was a hardcore Bayesian before it was cool and likely dampened the spread of frequentism into physics.
The null hypothesis is tied to the dataset you’re analyzing, so it would be more like “I will measure substantial resistance when I do X” than “superconductivity exists.” You bring up a good point, though: “all resistance will disappear” is unprovable under naive falsification, because experimental error will always conspire to create a non-zero resistance.
Technically they’re not called into question, though. The statistical power of a study is the odds of it failing to reject a false null hypothesis. The typical power of a psychology study is in the 30-50% range; so if an overview of psychology studies finds that only 40% can be replicated, that’s what we’d expect to find even if every single one correctly rejected the null.
Physics has big problems with over-hyped studies, too, with EM Drive being only the most recent example.
Incidentally, have you read Cohen on null hypotheses? I consider it required reading on the subject:
Siggy says
@hjhornbeck
I must interject that superconductivity isn’t quite the same as zero resistance. Vortex motion can dissipate energy. But anyway…
In psychology and medicine, the null hypothesis is generally that all effect sizes are zero. And in my research, this is sometimes applicable. For example, I might heat the sample and see what changes, and the null hypothesis is that nothing changes. But sometimes that would be quite shocking, like if superconductivity remained despite raising the temperature. So it would be strange to take that as the default hypothesis.
Or what if I’m simply trying to figure out the critical temperature of a new sample? I don’t have any default hypothesis about that, it’s just some number we measure.
I think the psychology studies in question claim to have greater power than that. I didn’t read the Reproducibility Project paper, but I did read a Comment which criticized it. Their main criticism was that the attempted replications changed the procedures of the original studies in relevant ways. Of course, the authors reply that the original studies were supposed to be robust to such variations in procedure. I don’t know what to make of that.
I have not read Cohen, I will take a look.