There has long been a common method used in science and social sciences when deciding whether results are worth publishing. One starts out with what is called the ‘null hypothesis’, a kind of baseline that might represent (say) the current conventional wisdom, and then one sees if the results of the experiment are consistent with it. If it is not consistent, then the results are considered to be more interesting than if they were. This requires the use of statistics and then one has the problem of deciding whether the result is a real effect or a statistical anomaly. For a long time, something called the ‘p-value’ was used to make this decision and a p-value of 0.05 was used as the benchmark for statistical significance.
For example, if you are testing whether a coin is fair, then the null hypothesis would be that it is. With that hypothesis, you can then calculate the value of a number M such that the probability that for (say) N tosses of the coin, there would be a 95% probability that the number of heads would lie in the range N-M to N+M. If the actual results lie outside that range, one says that the fairness of the coin has been rejected and that the result is statistically significant. More loosely, one can say that one has 95% confidence that the null hypothesis has been falsified.
The 95% benchmark (or alternatively the p-value of 0.05) has become the standard for most purposes but recently the whole idea of statistical significance and this method of testing of hypothesis has come under fire because it enables too many results that are not conclusive to be published as if they were. After all, even within that framework, it is possible that 5% (one in 20) experiments would lie outside the pre-determined range even if the coin were actually fair.
But that is not the only problem and The American Statistician published a entire issue on this topic last month and its editorial strongly came out strongly against this method, giving the following ‘don’ts’:
- Don’t base your conclusions solely on whether an association or effect was found to be “statistically significant” (i.e., the p-value passed some arbitrary threshold such as p< 0.05).
- Don’t believe that an association or effect exists just because it was statistically significant.
- Don’t believe that an association or effect is absent just because it was not statistically significant.
- Don’t believe that your p-value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true.
- Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof).
The article goes on:
The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p< 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.
Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. Made broadly known by Fisher’s use of the phrase, original intention for statistical significance was simply as a tool to indicate when a result warrants further scrutiny. But that idea has been irretrievably lost. Statistical significance was never meant to imply scientific importance, and the confusion of the two was decried soon after its widespread use. Yet a full century later the confusion persists.
And so the tool has become the tyrant. The problem is not simply use of the word “significant,” although the statistical and ordinary language meanings of the word are indeed now hopelessly confused); the term should be avoided for that reason alone. The problem is a larger one, however: using bright-line rules for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making (ASA statement, Principle 3). A label of statistical significance adds nothing to what is already conveyed by the value of p; in fact, this dichotomization of p-values makes matters worse.
For example, no p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics. In a world without bright lines, on the other hand, it becomes untenable to assert dramatic differences in interpretation from inconsequential differences in estimates. As Gelman and Stern famously observed, the difference between “significant” and “not significant” is not itself statistically significant.
For several generations, researchers have been warned that a statistically non-significant result does not ‘prove’ the null hypothesis (the hypothesis that there is no difference between groups or no effect of a treatment on some measured outcome)1. Nor do statistically significant results ‘prove’ some other hypothesis. Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exists.
The trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different6–8. The same problems are likely to arise under any proposed statistical alternative that involves dichotomization, whether frequentist, Bayesian or otherwise.
Neither journal is advocating an ‘anything goes’ approach. Instead they are asking for more nuanced and detailed analyses and presentations of data, and both make recommendations as to what should replace the p-value and statistical significance as measure of worthiness.