Most useful paper I’ve read this week

I’m teaching our science writing course in the Fall, and I’m also one of the instructors in our teachers’ workshop next month (we still have room for more participants!). And now I’ve found a useful, general, basic paper that I have to hand out.

Motulsky, HJ (2014) Common Misconceptions about Data Analysis and Statistics. JPET 351(1):200-205.

What it’s got is clear, plain English; brevity; covers some ubiquitous errors; will be incredibly useful for our introductory biology students. You should read it, too, for background in basic statistical literacy. Here’s the abstract.

Ideally, any experienced investigator with the right tools should be able to reproduce a finding published in a peer-reviewed biomedical science journal. In fact, however, the reproducibility of a large percentage of published findings has been questioned. Undoubtedly, there are many reasons for this, but one reason may be that investigators fool themselves due to a poor understanding of statistical concepts. In particular, investigators often make these mistakes: 1) P-hacking, which is when you reanalyze a data set in many different ways, or perhaps reanalyze with additional replicates, until you get the result you want; 2) overemphasis on P values rather than on the actual size of the observed effect; 3) overuse of statistical hypothesis testing, and being seduced by the word “significant”; and 4) over-reliance on standard errors, which are often misunderstood.

I can probably open any biomedical journal and find papers that commit all four of those errors.


  1. dianne says

    P hacking; if you set a significance level of 0.05, then run 20 tests, chances are you’ll get a “significant” result from pure chance.

  2. Petal to the Medal says

    Thanks for alerting us to that very worthwhile paper. I’ve shared the link with several friends. John Oliver made some of the same points in one of his monologues a few weeks ago. I recommend it to anybody who hasn’t seen it. It’s easy to find on the Last Week Tonight website. WARNING: possibly NSFW.

  3. Raucous Indignation says

    Please let me correct that for you.

    “You can probably open any high level biomedical journal such as the New England Journal of Medicine and find that MOST papers commit all four of those errors.”

    You are welcome.

  4. latveriandiplomat says

    How many of these are “errors” and how many are “things I have to do to get enough papers published to keep my job”.

    If you find lots of people doing things the wrong way, maybe it’s the system that’s broken, in that it rewards wrong actions over correct ones.

  5. says

    It’s great to see Harvey Motulsky’s wise advice getting attention. He is the author of the statistical analysis software Graphpad; in doing documentation and support for it, he runs into these questions. I’ve known Harvey for many years, and gave him comments on the latest edition of his book which is his reference 2014a.

  6. CB says

    I just finished reading “Statistics Done Wrong: The Woefully Complete Guide” by Alex Reinhart, and had to double-check that the article you mentioned wasn’t written by the same author – they are almost verbatim the same as each other (only the book is longer, of course).

    In my MS and PhD, I was carefully coached to do the opposite of each of these. First you explore the data (p-hacking), then you do transformations (to adjust for non-normal distributions, of course) and consolidate categories as needed for your effects to come to light. You always need to present p-values (no paper would be approved without p-values!), and you just shrug and report your “significant” findings when you discover that the average weight of cows on different diets differed (significantly, of course!) by a whopping 4 grams, rather than say the scale doesn’t even weigh a cow to the 4 gram level.

    The trick, of course, is that most reviewers were trained the same way I was, so if I depart from the “standards”, my papers get sent back for revision (or worse), asking me to add p-values and crossing out my confidence intervals as being “non-standard”.

  7. Hj Hornbeck says

    dianne @1:

    P hacking; if you set a significance level of 0.05, then run 20 tests, chances are you’ll get a “significant” result from pure chance.

    Whoops, you’ve mistaken the p-value for the false positive rate. They’re not the same. From the linked paper:

    Many scientists mistakenly believe that the chance of making a false-positive conclusion is 5%. In fact, in many situations, the chance of making a type I false-positive conclusion is much higher than 5% (Colquhoun, 2014). For example, in a situation where you expect the null hypothesis to be true 90% of the time (say you are screening lightly prescreened compounds, so expect 10% to work), you have chosen a sample size large enough to ensure 80% power, and you use the traditional 5% significance level, the false discovery rate is not 5% but rather is 36%.

    If you run those numbers for 20 tests, then your most likely outcome is seven false positives, and about 94% of the time you’ll get 4-11. Even if you only have two statistically-sigificant tests, there’s a 59% chance that at least one is a false positive.

    Oh, and this paper’s example is a bit contrieved, as the typical study doesn’t have a power of 80%; in practice, for moderately-sized effects it’s somewhere between 40 and 50%. That means that for a stastitical significance level of p < 0.05, it's more realistic to say that anywhere between 47% and 53% of statistically significant results are false.

    It's no wonder scientific studies frequently contradict one another, the most commonly-used statistical techniques practically guarantee it.

  8. dianne says

    @6: You’re right, I was being totally lazy and not doing the math. Tsk! The saying in astronomy, apparently, is half of all 2 sigma results are wrong. Again, lazy rule of thumb, not an actual statistical test.

    OTOH, there’s also the problem of being unable to obtain a “statistically significant” result because you’re trying to sort out what is happening in a disease with an incidence of less than 1 per 100,000. Especially when the “disease” is probably multiple diseases with a single phenotype (I’m looking at you, Waldenstroem’s macroglobulinemia.)

  9. chrislawson says

    1. I see these errors all the time in prominent medical journals, and it drives me crazy. I think all peer-reviewed journals should have a dedicated statistics review team.

    2. Not quite about the p-values @6: The p-value is a measure of the extremity from the expected value of a given dataset. As a rough rule, it means the probability that a randomly generated dataset will be at least as far from the expected norm as the observed dataset. So this means that if you generate random datasets and then crunch the stats, you will arrive at a p<=0.05 around 5% of the time. So dianne was quite right. You can even try this on a spreadsheet or using a small computer program.

    It's certainly true, though, that this is not the same as the false-positive rate (or that the inverse is the false-negative rate). For instance if your observed Z-score is exactly zero, then the p-value is 1.000 (i.e there is no difference between your dataset and what one would expect from the probability distribution function). But obviously this doesn't mean that the result is 100% explained by randomness, and I'd even suggest that if this was reported in a paper I would be extremely suspicious about the finding.

    Meanwhile if you're p-hacking, you have a 50% chance of hitting at least one p<0.05 result by chance after 14 tests, not 20. If you do 20 tests, there's actually a 64% that you will get at least one p<0.05 result. And of course, this only applies to the statistics. If there is any bias in the trial design…so, yes, I certainly agree with you that there are plenty of ways to generate spurious p<0.05 findings without having to slog through scores of trials.

  10. dianne says

    I think all peer-reviewed journals should have a dedicated statistics review team.

    Some of them do. It doesn’t help. I’ve submitted to journals with statistics reviewers and they flub the biology so badly that the statistics points they raise no longer make any sense. It’s hard to find anyone who knows both.

    Meanwhile if you’re p-hacking, you have a 50% chance of hitting at least one p<0.05 result by chance after 14 tests, not 20.

    This is the basic argument used against doing routine screening labs on an asymptomatic patient: if you run a chem-14 on someone, you have a good chance of getting an abnormal result–which you’ll then feel obliged to chase–by pure chance. One way around this problem is to simply test again. Chance shouldn’t lead to the same abnormality twice. (Though it should be pointed out that the normal values are established by essentially running the test on 100* healthy people and then declaring “normal” to be where 95% of those values fall. So 5% of perfectly healthy** people have “abnormal” values because of the way the values are established.)

    *It’s not really 100, of course. That was just a convenient number to use as an example.
    **As far as we know, anyway. It could be that some of them are just not yet showing signs of illness. Also, the “normal” values of some tests had to be moved as it became clear that “healthy” and “average in the population” were two different things. Think cholesterol or hgbA1c.

  11. Hj Hornbeck says

    chrislawson @8:

    As a rough rule, it means the probability that a randomly generated dataset will be at least as far from the expected norm as the observed dataset.

    That’s not so much a rough rule as the correct rule. I’ve spilled quite a few words discussing p-values, so I can vouch for it.

    So this means that if you generate random datasets and then crunch the stats, you will arrive at a p<=0.05 around 5% of the time.

    Incorrect, you’re only considering cases where the null hypothesis is true. You also need to factor in a study’s power, and the prior probability of the null hypothesis being true. Here’s how the linked paper handles those:

    This table tabulates the theoretical results of 1000 experiments where the prior probability that the null hypothesis is false is 10%, the sample size is large enough so that the power is 80%, and the significance level is the traditional 5%. In 100 of the experiments (10%), there really is an effect (the null hypothesis is false), and you will obtain a “statistically significant” result (P , 0.05) in 80 of these (because the power is 80%). In 900 experiments, the null hypothesis is true, but you will obtain a statistically significant result in 45 of them (because the significance threshold is 5%, and 5% of 900 is 45). In total, you will obtain 80 + 45 = 125 statistically significant results, but 45/125 = 36% of these will be false positive.

    Notice that the above logic means you’d observe p < 0.05 about 12.5% percent of the time, not 5%. You’ll only get to 5% if the null hypothesis is true 100% of the time, which is a touch unrealistic.

  12. LicoriceAllsort says

    Ooh, thanks. I have this paper saved, as well, and re-read it every time I’m analyzing a dataset:

    Zuur AF, Ieno EN, and CS Elphick. 2010. A protocol for data exploration to avoid common statistical problems. Methods in Ecology & Evolution, 1(1), 3-14.

    And a bit more obscure but I’ve found it to be useful:

    García-Berthou E. 2001. On the misuse of residuals in ecology: testing regression residuals vs. the analysis of covariance. Journal of Animal Ecology, 70, 708–711.

  13. chrislawson says

    dianne@10: yes, overtesting is a huge problem in medicine. I frequently have to explain to people why we don’t want to do any test unless there is a specific reason for doing so plus a decent prior probability. Even on basic tests, almost everyone gets an FBE and an E/LFT’s (aka UEC) which means they get a battery of Hb, Hct, RBC count, MCV, MCH, MCHC, WCC, neutrophils, lymphocytes, monocytes, eosinophils, basophils, platelets, Na, K, Cl, HCO3, pH, anion gap, urea, eGFR, urate, AST, ALT, ALP, GGT, bilirubin, albumin, total protein, glucose…with some variations depending on the lab. That’s 30 variables. Which means even in a healthy person there’s about a 79% chance that at least one will be out of the normal range. And every further test adds to the chance of a spurious abnormal finding.

    (There’s a specialist in my region whom I refuse to refer to because he orders a list of tests that is literally 2 double-columned pages long for every patient. He was investigated by the PSR and told to knock it off — so now he tells his patients to ask their GPs to order them instead. Which is how I got to see his printed list of tests. I asked the patient which ones the specialist wanted. “Oh, he said all of them.” Unbelievable.)

    Hornbeck@11: I was indeed basing the probabilities on random datasets, i.e. where the null hypothesis is 100% correct. You’re absolutely right that in real life research this is very unlikely and should be seen as the absolute lower bound of probability for Type I error and in most experimental setups the error risk is substantially higher (and given some of the shoddy designs I’ve seen, it’s almost a miracle they were only getting borderline p<0.05 results instead of p<0.001s).