# Calls to end use of statistical significance

There has long been a common method used in science and social sciences when deciding whether results are worth publishing. One starts out with what is called the ‘null hypothesis’, a kind of baseline that might represent (say) the current conventional wisdom, and then one sees if the results of the experiment are consistent with it. If it is not consistent, then the results are considered to be more interesting than if they were. This requires the use of statistics and then one has the problem of deciding whether the result is a real effect or a statistical anomaly. For a long time, something called the ‘p-value’ was used to make this decision and a p-value of 0.05 was used as the benchmark for statistical significance.

For example, if you are testing whether a coin is fair, then the null hypothesis would be that it is. With that hypothesis, you can then calculate the value of a number M such that the probability that for (say) N tosses of the coin, there would be a 95% probability that the number of heads would lie in the range N-M to N+M. If the actual results lie outside that range, one says that the fairness of the coin has been rejected and that the result is statistically significant. More loosely, one can say that one has 95% confidence that the null hypothesis has been falsified.

The 95% benchmark (or alternatively the p-value of 0.05) has become the standard for most purposes but recently the whole idea of statistical significance and this method of testing of hypothesis has come under fire because it enables too many results that are not conclusive to be published as if they were. After all, even within that framework, it is possible that 5% (one in 20) experiments would lie outside the pre-determined range even if the coin were actually fair.

But that is not the only problem and The American Statistician published a entire issue on this topic last month and its editorial strongly came out strongly against this method, giving the following ‘don’ts’:

• Don’t base your conclusions solely on whether an association or effect was found to be “statistically significant” (i.e., the p-value passed some arbitrary threshold such as p< 0.05).
• Don’t believe that an association or effect exists just because it was statistically significant.
• Don’t believe that an association or effect is absent just because it was not statistically significant.
• Don’t believe that your p-value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true.
• Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof).

The article goes on:

The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p< 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.
Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. Made broadly known by Fisher’s use of the phrase, original intention for statistical significance was simply as a tool to indicate when a result warrants further scrutiny. But that idea has been irretrievably lost. Statistical significance was never meant to imply scientific importance, and the confusion of the two was decried soon after its widespread use. Yet a full century later the confusion persists.

And so the tool has become the tyrant. The problem is not simply use of the word “significant,” although the statistical and ordinary language meanings of the word are indeed now hopelessly confused); the term should be avoided for that reason alone. The problem is a larger one, however: using bright-line rules for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making (ASA statement, Principle 3). A label of statistical significance adds nothing to what is already conveyed by the value of p; in fact, this dichotomization of p-values makes matters worse.

For example, no p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into “significant” and “not significant” is taken as an imprimatur of authority on these characteristics. In a world without bright lines, on the other hand, it becomes untenable to assert dramatic differences in interpretation from inconsequential differences in estimates. As Gelman and Stern famously observed, the difference between “significant” and “not significant” is not itself statistically significant.

For several generations, researchers have been warned that a statistically non-significant result does not ‘prove’ the null hypothesis (the hypothesis that there is no difference between groups or no effect of a treatment on some measured outcome)1. Nor do statistically significant results ‘prove’ some other hypothesis. Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exists.

The trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different6–8. The same problems are likely to arise under any proposed statistical alternative that involves dichotomization, whether frequentist, Bayesian or otherwise.

Neither journal is advocating an ‘anything goes’ approach. Instead they are asking for more nuanced and detailed analyses and presentations of data, and both make recommendations as to what should replace the p-value and statistical significance as measure of worthiness.

1. Sam N says

This was always, as suggested, a lazy approach for people who didn’t want to evaluate publications on general merits. I think my evaluation is probably based on a combination of effect size, p-value, and wherever possible, a visual comparison of the distributions. If the methodology seems clean, and visually, the group measures are clearly distinct, I’ve never had any issues with finding large effect sizes and ridiculously small p-values. If the group measures are not very distinct or the effect size is small, I maintain skepticism of what is causing the difference regardless of p-value. And if I see p-values close to 0.05, I may give the result a 1% weighting in my thoughts regarding the matter at hand. So I find the binary way it is regarded is useful, to give me a lot of skepticism of results that scarcely clear it.

2. Owlmirror says

FWIW, HJ Hornbeck has written about this as well, and if I understand the post (and linked posts) correctly, the problem is that the entire statistical methodology is wrongheaded, and scientists should be explicitly Bayesian rather than frequentist. Or something like that.

3. Marshall says

Another aspect is that the calculation of a p-value is dependent *entirely* on the model used. It is nothing more than the statement of how likely is that your *model* will produce a value at least as extreme as the value observed. In the case of a coin, this is easy because the binomial distribution accurately models the results of a fair coin flip--but in more complex endeavors that actually require scientific research, simplifying models are incredibly easy to tweak in the favor of a smaller p-value.