I am not aware of any theoretical or empirical justification for imposing an arbitrary absolute threshold for P-values below which we consider a difference “significant”–and thus considered scientifically relevant–rather than taking into account the actual P-value and adjusting *how* scientifically relevant we consider a particular difference to be.

For example, if your P-value is less than 0.00001, then it is exceedingly likely that the difference you have observed reflects a real difference between the underlying populations. If your P-value is between 0.05 and 0.01, then it is reasonably likely. If your P-value is 0.25, then it is unlikely.

But why is the title of a published paper allowed to refer to a difference with a P-value of 0.49995, but a difference with a P-value of 0.50005 is verboten to consider at all? Is there any justification for this practice other than it’s-just-what-we-do?

## 25 comments

Skip to comment form ↓

## Cuttlefish

February 9, 2014 at 2:34 pm (UTC -7) Link to this comment

None.

In theory, the level chosen should be a reflection of the costs of type 1 and type 2 errors–which would be worse, to claim an effect that is not there, or to miss one that is?

## Marcus Ranum

February 9, 2014 at 2:56 pm (UTC -7) Link to this comment

This is what is called a “vague concept” and the only way to deal with them is to be somewhat arbitrary.

If I have one grain of sand, and add another grain and another and another sooner or later I’ll have a “pile” – does taking away the grain that came before the “pile” mean it’s no longer a pile of sand? Is there a certain number of grains below which it’s no longer a beach?

Vague concepts reveal language’s true nature – as an aid to thinking that allows us to group concepts and entities. Because if I need to tell you “there’s a herd of horses out there” it’s much easier than counting them all so I can tell you “there are 35 horses out there” – we use and rely on imprecision because it’s close enough to get the job done, sometimes. Although, I find it exceptionally irritating when people get up to the edge of stereotyping, instead, e.g: “you lefties” um… Yes, “left” and “right” are a possible metric on a political spectrum but they oversimplify things exactly 98.28% too much. (I was going to write “a wee bit” but that’s also a vague concept)

## lou Jost

February 9, 2014 at 5:27 pm (UTC -7) Link to this comment

It is much worse than that. P-values almost never what we really want to know. They are usually based on some null hypothesis that is known to be false, so the experimenter can always get a significant p-value if his or her sample size is large enough. For example, a common null hypothesis is that some parameter of the experimental group is the same as that of the control group. But for most parameters, especially those that are measured on a continuous scale rather than an integer scale, it is certain that the parameter’s true value in the experimental group will not be EXACTLY equal to the parameter’s true value in the control group. The difference might appear only in the tenth decimal place, but even that tiny difference can produce a significant p-value if the sample size is large enough. What we REALLY want to know is the magnitude of the difference between the parameter values in the two groups. We also need to know our uncertainty in the parameter’s value (ie the confidence interval). To interpret the results properly, we also have to choose a parameter whose magnitude is meaningful in some absolute sense (not all parameters are usable). I always write a paragraph or two in my scientific papers trying to get people to see that p-values are not usually useful. But most scientists don’t understand.

## Darkling

February 9, 2014 at 5:34 pm (UTC -7) Link to this comment

Effect size and confidence intervals. I don’t care if your difference is statistically significant but the effect size is so small as to be biologically meaningless.

Although I do remember reading a paragraph in a now otherwise forgotten paper where they started off by describing the difference between two means and then ended by stating that the means were not significantly different.

## lou Jost

February 9, 2014 at 6:15 pm (UTC -7) Link to this comment

Darkling, the flip side of that is also interesting. People don’t publish insignificant results, but this approach obscures the distinction between two very different cases with insignificant p-values:

1) Sample size too small, confidence interval very large, so biologically significant effects can’t be detected even if present. Publication not warranted.

2) Large enough sample size so that confidence intervals are narrow. Can confidently say that there is no biologically significant effect. This is positive information and warrants publication in many fields.

## maudell

February 9, 2014 at 6:43 pm (UTC -7) Link to this comment

I believe that the p<0.05 threshold is used for purely historical reasons. Just calculating a simple multivariate OLS regression used to be pretty long, and the calculations were often made by non-scientists. So they arbitrarily picked a standard significance level. 19 out of 20 seemed to be decent. But no, it shouldn't be seen as a significant/not significant dichotomy.

## Darkling

February 9, 2014 at 6:59 pm (UTC -7) Link to this comment

I think that the “Historical reasons” probably date back to Fisher and since he used it everyone else did too. I can see the utility of it serving as a good rule of thumb, however it’s not magic or set in stone.

As an aside an entertaining paper was published back in 1994 titled “The Earth is round p <0.05" in the American Psychologist by J Cohen. I suppose you could say that discussion about the utility of p-values is a sign that a discipline is starting to mature… :p

## dean

February 10, 2014 at 9:55 am (UTC -7) Link to this comment

That’s always been how I’ve taught it to my beginning students, and that’s how it was first discussed in my early stat courses. It was also always mentioned that the .05 “threshold” was a historical tradition (as aptly noted above in two spots). I would also note that we (stat faculty, at least the ones I know) do stress that p-values are not the end-all and be-all and discussions and estimates of effect sizes and differences should be included.

Another important point, and changing my students’ feelings on this (or their guiding faculty, or people for whom I’m analyzing data) is difficult – almost as difficult as convincing them that they are not allowed to eliminate “outliers” simply to gain a better result.

## Danny W

February 10, 2014 at 9:58 am (UTC -7) Link to this comment

p-values are worse than meaningless for the reasons Iou described. In any reasonably advanced statistical test, your null is that some statistic follows a particular distribution. But what would that even mean? Is the observable REALLY random AND follows the given distribution or is the distribution just a model for the truth? Of course, it is just a model and, as Box said, all models are wrong, but some are useful. The question is how useful?

But statistical tests based on p-values don’t even address this question. They ask, how confident can we be that the model is false? When we know it is false…

## Funk Doctor X

February 10, 2014 at 5:24 pm (UTC -7) Link to this comment

I always report exact p-values where I can and encourage others to do so as well. Let the reader make up their mind if the size of the effect and the reported p-value is something they consider “significant” wrt to their own research. I get that p < 0.05 is the magic threshold for getting the privilege of putting asterisks on your data. But scientists should have as much info as possible to critically evaluate whatever it is they're reading, even if the effect is "not statistically significant"…

## DrugMonkey

February 10, 2014 at 6:23 pm (UTC -7) Link to this comment

It is intellectually dishonest to report p values anything other than p<0.05. (or whatever is the loosest standard for significance you personally have ever published)

## Darkling

February 10, 2014 at 6:40 pm (UTC -7) Link to this comment

P values are really only useful for the simplest of tests, regression of x on y or something similar. Anything more complicated than that and you should be talking about how well supported your model is compared to other plausible models. P-values become meaningless at that point. Hell, I’ve seen people in the same breath talk about how the interaction effect was significant, but that the main effects of the interaction weren’t, which is really just cargo cult statistics.

## Donovan

February 10, 2014 at 7:53 pm (UTC -7) Link to this comment

Right now, I’d be happy with a 0.99 P value.

@11 DrugMonkey: It’s not at all intellectually dishonest to report >0.05 significance. It would be dishonest to report it as being less, but so is misrepresenting any aspect of your research. I have submitted many reports and preliminaries with p values of >0.1, and to eager colleagues at that. An example is stream flow in New England. My hypothesized trend is there, but it’s not significant (going by memory, it was 0.062 or something). It’s clear that in the next decade, though, the differences will best the <0.05 threshold. So it's not just perfectly honest for me to report that, it'd be dishonest to claim I found no trend.

Sure, I may not get published – okay, I

won’tget published – but the only honest thing to do is report my findings in the most straightforward way I can, letting the scientific community decide what to do next.Which sums up my idea on p values. The 0.05, 0.01, and 0.001 might be great shortcuts when pouring through literature, but every p value must be considered on it’s own terms. An overly ambitious experiment returning even a 0.2 p value deserves a second look, while an extremely anal scientist might scrape out some 0.000001 significance that has all of the significance of run speed between greyhounds and coffee cups.

## Darkling

February 10, 2014 at 10:53 pm (UTC -7) Link to this comment

## david

February 11, 2014 at 5:17 am (UTC -7) Link to this comment

Several situations in which there must be an accepted cutoff for significance:

Design a study to test the null hypothesis, which can be stated “the difference between two groups is less than X (being a meaningful-sized difference).” The sample size in the study is determined from X, combined with assumptions about the population variance, and a desired alpha and beta. Alpha is the famous 0.05 — the convention used generally for determining success is also the convention used for setting sample size.

Get a job at FDA. You have to decide to approve or disapprove a drug for clinical use. You can’t use a subjective ad-hoc approach. Tell the pharma companies what significance level they have to design their trials to achieve, and then act on those pre-defined thresholds. Typically, that’s two studies each with p<0.05.

## julial

February 11, 2014 at 6:56 am (UTC -7) Link to this comment

Better keep this discussion away from the creos.

You guys aren’t unreservedly certain of your data, or your analysis methods, or your conclusions.

And you admit it. This is sin. ;-)

## mikka

February 11, 2014 at 10:10 am (UTC -7) Link to this comment

may have something to do with two-tailed vs one-tailed tests. one-tailed p = two-tailed p/2. So some tables, and some statistical software give a maximum p of 0.5 because anymore than that would mean a probability >1 (eek!).

What paper are you talking about?

## hyperdeath

February 11, 2014 at 10:57 am (UTC -7) Link to this comment

Actually, a p value of 0.00001 can be highly problematic when looking for rare events in large datasets. If the prior probability for each trial is 1 in a million, then a p < 0.00001 event will most likely be a false positive.

## DrugMonkey

February 11, 2014 at 4:55 pm (UTC -7) Link to this comment

An interaction is no different from a main effect, Darkling. Draw a set of axes and an X on the plot. boom. done. no main effects and a text book interaction.

## DrugMonkey

February 11, 2014 at 4:57 pm (UTC -7) Link to this comment

oh and by the way PP, p-value trolling? and you have the nerve to comment on any of my topics? hhahhahahahhaa.

## Darkling

February 11, 2014 at 5:02 pm (UTC -7) Link to this comment

Gah. And this is why statistics courses should be a compulsory part of any graduate students course. By definition, a significant interaction means that the component main effects, have an effect. Talking about whether the main effects are significant or not when the interaction effect is significant, is gibberish.

## turkeyfish

February 12, 2014 at 3:28 pm (UTC -7) Link to this comment

“But why is the title of a published paper allowed to refer to a difference with a P-value of 0.49995, but a difference with a P-value of 0.50005 is verboten to consider at all? Is there any justification for this practice other than it’s-just-what-we-do?” Who publishes results with a P-value of 0.5? Keep in mind that P values are usually reported as the probability that a given null hypothesis (no difference between observed and expected based on chance alone, usually based on assuming an underlying Gaussian distribution) is rejected as being false, when in fact it is true (probability of committing a type I statistical error). A P-value of 0.05 is the probability that such a null hypothesis would be rejected by chance alone as false when it is actually true would only occur only 1 time in 20. A P-value of 0.01 is the probability that such a null hypothesis would be rejected by chance as false when it is in fact true would only occur only 1 time in 100, etc. Obviously, this isn’t the only kind of statistical error that can arise from a design, since it is also possible to accept the null hypothesis as true, when in fact it it is false (probability of a type II statistical error). Assuming the law of the excluded middle, there is an obvious trade-off that must be accepted between making a statistical error of the first kind and that of the second kind. It is for these reasons, as well as practical considerations associated with the cost of sampling, that statisticians seek to identify the statistical design and test with the most power for a given design (probability that the test will reject the null hypothesis when the alternative hypothesis is true, that is the probability of not committing a Type II error), as well as attempting to identify the methods with the most specificity (measures the proportion of true negatives which are correctly identified as such) and sensitivity (measures the proportion of actual positives which are correctly identified as such). In more complicated statistical designs, one must be sure what the author means when assigning a probability value to the rejection of the null-hypothesis and thus, exactly what constitutes the null-hypothesis. Likewise, in statistical measures of correlation or covariance, a P value reflect the probability of the statistical independence of one or more random variables and not the magnitude of the difference between those variables. Such tests may also make assumptions about the partitioning of the observed variance, so that potential interaction effects can be removed or accounted for in the statistical design. It is for these reasons that P values in one given field of investigation are different from those reported in another field of investigation, since typically statistical significance is only a guide to “meaningful differences” between expected differences and observed differences predicted or not predicted by the model for which there is no uniform or absolute scale to compare across all kinds of studies and designs. Similarly, for multiple comparisons it is also essential to correct for the fact that the chance of committing a type I statistical error increases as the number of comparisons increases. For this reason, it may be necessary to “adjust” the expected probability values to reflect the probability of “experiment wide error”, when more than one test is performed (eg. Bonferroni or Scheffe corrections). The value of 0.05 has traditionally been found useful in most studies because of the putative modeled relationship between the absolute magnitude of observed differences of the means and the variance of the observations, as compared to those expected for the null-hypothesis, inherent in comparisons among radmom variables assumed to be drawn from a Gaussian distribution. However, such relationships may or may not hold or only hold approximately, if the underlying variance being modeled better fits another distribution. Sir Ronald Fisher used a 0.05 value since it was a reasonable choice given the nature of differences observed in agronomic studies. Of course, one could always be more of a skeptic and require lower P values, but this only reflects a philosophical difference as to whether a particular observed statistical difference is more or less meaningful in a broader context than another difference given a particular set of observations, ie applicability or utility of a given model and assumed distribution of expected variance in a given circumstance. It does not reflect on the logic that underlies probability theory nor measure theory upon which it is based. Also, keep in mind that differences is the assumed topology of the sampling space can greatly affect the suitability of a particular model for the study of a given phenomenon. For example, it would be likely be inappropriate to use probability theory or measure theory to discuss differences among phenomena that are pseudometric rather than metric in character. Unlike for “points” in a metric space, points in a pseudometric space or general topological space need not be distinguishable by the distance between them. That is, one may have d(x,y)=0 for distinct values x not equal to y, that is two different entities (sets) may have zero distances between them. In metric and probabilistic spaces, two different entities will always have a positive distance between them. That is, two distinct entities have zero distance between them, if an only if they are the same entity. Consequently, the notions of distance, measure, or probability would have either no or a possibly highly non-standard meaning in such contexts.

## turkeyfish

February 12, 2014 at 3:29 pm (UTC -7) Link to this comment

Sorry. So much for paragraphs that got lost when replying.

## Darkling

February 12, 2014 at 4:48 pm (UTC -7) Link to this comment

I guess the core of hypothesis testing is that a test statistic (t, F, chi-square etc) is generated based on some null hypothesis. This test statistic then allows you to see how likely your data was, if your null hypothesis were correct. Depending on the nature of your data there’s parametric and non-parametric tests that you can use.

Even if the standard techniques don’t really work for your data (due to violations of assumptions), you can still calculate a test statistic and then use bootstrap simulations to generate randomised data to see how likely your data was.

Now, whether or not you consider a p-value to be significant is fairly arbitrary. It’ll depend upon your sample size and the nature of study. If you use set it too low then you’ll have a greater chance of false positives, too low and a greater chance of false negatives. In a lot of cases though it’s just convention, but if you’re referring to a parameter in your papers title that has a p-value of 0.49 I hope it’s to simply point out that said variable didn’t explain any variation in the data (or “this result is another example of why ***ists should not be using stepwise regression”).

Still in essence hypothesis testing is very flexible. However issues come up when people start asking about the utility of the null hypothesis as has been stated several times above. The test statistic is testing the null hypothesis (which is always going to be false, just not always provably so). This works if your study has only the binary choices of null and alternative (H0 and H1). As the study becomes more complicated then NHST starts to break down somewhat since there’s more than two choices. Now we have H0, H1 and H2 (or worse). At this point you need some way to quantify support for the different hypotheses (likelihoods etc) because NHST is outside it’s pay grade.

## wtfwhateverd00d

February 13, 2014 at 2:59 am (UTC -7) Link to this comment

If you can dry up and then stop touching yourself long enough, you might find having your 12 step sponsor read you this to be of interesting:

http://www.nature.com/news/scientific-method-statistical-errors-1.14700

Scientific method: Statistical errors

P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume.

Regina Nuzzo

12 February 2014

Also, wash your hands before giving us another recipe post, your photos smell of ass.