On his show Last Week Tonight, John Oliver gives an excellent segment on the nature of science and how the relentless drive to hype it so as to provide sensationalist headlines has resulted in a highly distorted view of how it works. (Thanks to reader Jeff Hess for the tip.)
One of the things Oliver mentions is ‘p-hacking’ where, rather than designing an experiment to test a hypothesis linking two variables to see if any correlations between them are statistically significant, you mine already existing data to search for pairs of variables that are correlated statistically significantly and, if you find them, then publish just those results.
This whole issue of p-values is tricky. The Discover magazine blogger Neuroskeptic has produced a video that attempts to explain what it is.
Christie Aschwanden reports on a meeting of the American Statistical Association where 26 experts issued a statement that said that it is high time to stop misusing p-values because the consequences are serious.
The misuse of the p-value can drive bad science (there was no disagreement over that), and the consensus project was spurred by a growing worry that in some scientific fields, p-values have become a litmus test for deciding which studies are worthy of publication. As a result, research that produces p-values that surpass an arbitrary threshold are more likely to be published, while studies with greater or equal scientific importance may remain in the file drawer, unseen by the scientific community.
The results can be devastating, said Donald Berry, a biostatistician at the University of Texas MD Anderson Cancer Center. “Patients with serious diseases have been harmed,” he wrote in a commentary published today. “Researchers have chased wild geese, finding too often that statistically significant conclusions could not be reproduced.” Faulty statistical conclusions, he added, have real economic consequences.
One of the most important messages is that the p-value cannot tell you if your hypothesis is correct. Instead, it’s the probability of your data given your hypothesis. That sounds tantalizingly similar to “the probability of your hypothesis given your data,” but they’re not the same thing, said Stephen Senn, a biostatistician at the Luxembourg Institute of Health. To understand why, consider this example. “Is the pope Catholic? The answer is yes,” said Senn. “Is a Catholic the pope? The answer is probably not. If you change the order, the statement doesn’t survive.”
A common misconception among nonstatisticians is that p-values can tell you the probability that a result occurred by chance. This interpretation is dead wrong, but you see it again and again and again and again. The p-value only tells you something about the probability of seeing your results given a particular hypothetical explanation — it cannot tell you the probability that the results are true or whether they’re due to random chance. The ASA statement’s Principle No. 2: “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.”
The p-hacking process Oliver talks about is similar to the kinds of things that people like to believe in that seem to show that there are deep, mysterious underlying forces at work in life. For example, some may have seen this email that circulated some time ago showing the seemingly astounding similarities between the murders of presidents Lincoln and Kennedy.
Abraham Lincoln was elected to Congress in 1846.
John F. Kennedy was elected to Congress in 1946.
Abraham Lincoln was elected President in 1860.
John F. Kennedy was elected President in 1960.
The names Lincoln and Kennedy each contain seven letters.
Both were particularly concerned with civil rights.
Both wives lost their children while living in the White House.
Both Presidents were shot on a Friday.
Both were shot in the head.
Lincoln’s secretary, Kennedy, warned him not to go to the theatre.
Kennedy’s secretary, Lincoln, warned him not to go to Dallas.
Both were assassinated by Southerners.
Both were succeeded by Southerners.
Both successors were named Johnson.
Andrew Johnson, who succeeded Lincoln, was born in 1808.
Lyndon Johnson, who succeeded Kennedy, was born in 1908.
John Wilkes Booth was born in 1839.
Lee Harvey Oswald was born in 1939.
Both assassins were known by their three names.
Both names are comprised of fifteen letters
Booth ran from the theater and was caught in a warehouse.
Oswald ran from a warehouse and was caught in a theater.
Booth and Oswald were assassinated before their trials.
This kind of thing can seem highly impressive to those who are unaware how it is possible, if time consuming, to sift through the vast amounts of data that are associated with any real-life event and extract just those few that fit a theory. Snopes has done a good job of showing the vacuity of the Lincoln-Kennedy coincidences.
Trying to give the general public an idea of how science actually works and when we can take scientific conclusions seriously and when we should be skeptical is the theme of the book that I am currently working on tentatively titled The Paradox of Science.
Roeland de Bruijn says
I am waiting on that book! My brother in law is an intelligent dude, but once the label ‘scientific’ has been attached, he believes the findings. Which means he believes in Reiki, energy fields, traditional Chinese medicine, etc. He complains I am closed-minded, while I call him gullible.
Problem is that he wants to believe in these kinds of things, The type of book you are describing would be a perfect gift to him, so he can raise his knowledge a bit.
Please write faster,…
Ben Goldacre has been banging on about this for years. His book, Bad Pharma, while someting of a slog and quite the doorstop, is an impassioned plea for science to do things differently. Specifically, for trials to be registered in advance with a clear definition of what it is they’re testing. Too often trials that produce a cluster of results draw a target round the cluster after the event, and trials that don’t produce a cluster just get shelved as if they never happened. Trials that fail to produce results are just as important, but not a lucrative, which means the model for how we do science is broken.
Sure, this is bogus, but the fact that the name Ronald Wilson Reagan is composed of 6, 6, and 6 letters is extremely significant!
As far as the problem of lack of confirmation studies, I would think that universities should pair up. If universities X and Y both have fairly sizable neuroscience research departments, for example, why not agree that when university X issues a new study, university Y will assign some undergrads and a mentor to repeat those experiments and issue a study, and vice versa? That way, the world gets at least one follow-up study for each new study and some undergrads get experience in trying to get published. Yes, they may have a harder time getting a non-novel study published, but the two research labs could work together to try to encourage journals to publish it. If the two universities had publicized their partnership agreement well, they could try to “shame” the journals to cooperate because everyone agrees more follow-up studies are needed.
I meant to say graduate students, not undergrads.
“One of the most important messages is that the p-value cannot tell you if your hypothesis is correct. Instead, it’s the probability of your data given your hypothesis.”
Close. A p-value is the probability of observing the data by random chance given that the null hypothesis is true. If statisticians can’t get the definition right, then there really is a problem.
Are p-values ever actually used `correctly’ in practice? Presumably a small p-value (prob of data given null hypothesis true) is only then useful if you think the null hypothesis is unlikely to begin with. If you expect beforehand that the null hypothesis is quite likely (and therefore your results are “surprising” enough to be published), then even if this p-value is small, the probability the null hypothesis is false could still be large. This is without even worrying about multiple comparisons etc.
Multiple comparisons first: some people do use Bonferroni correction and other techniques for handling multiple comparisons.
A small p-value by itself doesn’t say anything about the scientific significance (i.e. importance) of the results. The p-value is influenced by the size of the actual difference (if one exists) or strength of the relationship (correlation), and the sample size.
If the null hypothesis is likely to be true, then it is unlikely to be experimentally tested, because the result of such an experiment are unlikely to be interesting. That’s not necessarily what should happen, but it usually does.
@jaxkayaker: “If the null hypothesis is likely to be true, then it is unlikely to be experimentally tested, because the result of such an experiment are unlikely to be interesting. That’s not necessarily what should happen, but it usually does.”
My impression is the opposite (although I am not myself in a field that uses statistical hypothesis testing, so this is an outsider’s view). I was under the impression that experimenters are keen to search for “surprising” results, i.e., where one’s “prior” for the null hypothesis would have indicated that the null is likely to be true. Conformation of what we already think we know is less sexy and won’t get into the high ranking journals.
Researchers are keen for novel, interesting, surprising results. You’re confused about the definition of null hypothesis, thus we’re having a disconnect in communication.
As far as I understand, the null hypothesis is usually “that there is no effect”. Hence a result is surprising when one would otherwise assume the null likely to be true, but nevertheless one obtains some extreme data that would occur less than, e.g., 5% of the time that the null is true. However, in this case, your prior for the alternative hypothesis could be so small it is unlikely to be true even in the face of the data.
John Morales says
AMartin @11, which is where replicability is applicable.
(It’s usually “that there is no effect”, but statistically it’s the specifically the hypothesis being tested, so it could be the other way around)
I retract my statement, you know approximately what is meant by the null hypothesis. However, people like novel results, and don’t like null results. Null results are rarely reported (the “file drawer problem”) and are difficult to publish, particularly in a high visibility journal. Novelty means not previously published. One criticism of null hypothesis significance testing has been that nulls tested are trivially false. People just don’t go around testing null hypotheses they think are actually true until they get one that’s false because that would be surprising. You’d reject the null hypothesis at a rate equivalent to the significance level (i.e. usually 5%) even if the detected differences were spurious. You’re right that if a null with a high prior were rejected, it would be found to be interesting and further investigated, but testing likely nulls is considered to be a waste of time, money and effort.