Stop Assessing Science

I completely agree with PZ, in part because I’ve heard the same tune before.

The results indicate that the investigators contributing to Volume 61 of the Journal of Abnormal and Social Psychology had, on the average, a relatively (or even absolutely) poor chance of rejecting their major null hypotheses, unless the effect they sought was large. This surprising (and discouraging) finding needs some further consideration to be seen in full perspective.

First, it may be noted that with few exceptions, the 70 studies did have significant results. This may then suggest that perhaps the definitions of size of effect were too severe, or perhaps, accepting the definitions, one might seek to conclude that the investigators were operating under circumstances wherein the effects were actually large, hence their success. Perhaps, then, research in the abnormal-social area is not as “weak” as the above results suggest. But this argument rests on the implicit assumption that the research which is published is representative of the research undertaken in this area. It seems obvious that investigators are less likely to submit for publication unsuccessful than successful research, to say nothing of a similar editorial bias in accepting research for publication.

Statistical power is defined as the odds of failing to reject a false null hypothesis. The larger the study size, the greater the statistical power. Thus if your study has a poor chance of answering the question it is tasked with, it is too small.

Suppose we hold fixed the theoretically calculable incidence of Type I errors. … Holding this 5% significance level fixed (which, as a form of scientific strategy, means leaning over backward not to conclude that a relationship exists when there isn’t one, or when there is a relationship in the wrong direction), we can decrease the probability of Type II errors by improving our experiment in certain respects. There are three general ways in which the frequency of Type II errors can be decreased (for fixed Type I error-rate), namely, (a) by improving the logical structure of the experiment, (b) by improving experimental techniques such as the control of extraneous variables which contribute to intragroup variation (and hence appear in the denominator of the significance test), and (c) by increasing the size of the sample. … We select a logical design and choose a sample size such that it can be said in advance that if one is interested in a true difference provided it is at least of a specified magnitude (i.e., if it is smaller than this we are content to miss the opportunity of finding it), the probability is high (say, 80%) that we will successfully refute the null hypothesis.

If low statistical power was just due to a few bad apples, it would be rare. Instead, as the first quote implies, it’s quite common. That study found that for studies with small effect sizes, where Cohen’s d was roughly 0.25, their average statistical power was an abysmal 18%. For medium-effect sizes, where d is roughly 0.5, that number is still less than half. Since those two ranges cover the majority of social science effect sizes, that means the typical study has very low power and thus a small sample size. Instead, the problem of low power must be systemic to how science is carried out.

In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and reconstructive efforts of Carnap, Hempel, and Popper to unscramble the logical relationships of theories and hypotheses to evidence. Meanwhile our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.

I know, it’s a bit confusing that I haven’t clarified who I’m quoting. That first paragraph comes from this study:

Cohen, Jacob. “The Statistical Power of Abnormal-Social Psychological Research: A Review.” The Journal of Abnormal and Social Psychology 65, no. 3 (1962): 145.

While the second and third are from this:

Meehl, Paul E. “Theory-Testing in Psychology and Physics: A Methodological Paradox.” Philosophy of Science 34, no. 2 (1967): 103–115.

That’s right, scientists have been complaining about small sample sizes for over 50 years. Fanelli et. al. [2017] might provide greater detail and evidence than previous authors did, but the basic conclusion has remained the same. Nor are these two studies lone wolves in the darkness; I wrote about a meta-analysis of 16 different power-level studies between Cohen’s and now, all of which agree with Cohen’s.

If your assessments have been consistently telling you the same thing for decades, maybe it’s time to stop assessing. Maybe it’s time to start acting on those assessments, instead. PZ is already doing that, thankfully…

More data! This is also helpful information for my undergraduate labs, since I’m currently in the process of cracking the whip over my genetics students and telling them to count more flies. Only a thousand? Count more. MORE!

… but this is a chronic, systemic issue within science. We need more.

Double-Dipping Datasets

I wrote this comment down on a mental Post-It note:

nathanieltagg @10:
… So, here’s the big one: WHY is it wrong to use the same dataset to look for different ideas? (Maybe it’s OK if you don’t throw out many null results along the way?)

It followed this post by Myers.

He described it as a failed study with null results. There’s nothing wrong with that; it happens. What I would think would be appropriate next would be to step back, redesign the experiment to correct flaws (if you thought it had some; if it didn’t, you simply have a negative result and that’s what you ought to report), and repeat the experiment (again, if you thought there was something to your hypothesis).

That’s not what he did.

He gave his student the same old data from the same set of observations and asked her to rework the analyses to get a statistically significant result of some sort. This is deplorable. It is unacceptable. It means this visiting student was not doing something I would call research — she was assigned the job of p-hacking.

And both the comment and the post have been clawing away at me for a few weeks, when I’ve been unable to answer. So let’s fix that: is it always bad to re-analyze a dataset? If not, then when and how?

[Read more…]

BBC’s “Transgender Kids, Who Knows Best?” p1: You got Autism in my Gender Dysphoria!

This series on BBC’s “Transgender Kids: Who Knows Best?” is co-authored by HJ Hornbeck and Siobhan O’Leary. It attempts to fact-check and explore the many claims of the documentary concerning gender variant youth. You can follow the rest of the series here:

  1. Part One: You got Autism in my Gender Dysphoria!
  2. Part Two: Say it with me now
  3. Part Three: My old friend, eighty percent
  4. Part Four: Dirty Sexy Brains

 

Petitions seem as common as pennies, but this one stood out to me (emphasis in original).

The BBC is set to broadcast a documentary on BBC Two on the 12th January 2017 at 9pm called ‘Transgender Kids: Who Knows Best?‘. The documentary is based on the controversial views of Dr. Kenneth Zucker, who believes that Gender Dysphoria in children should be treated as a mental health issue.

In simpler terms, Dr. Zucker thinks that being/querying being Transgender as a child is not valid, and should be classed as a mental health issue. […]

To clarify, this petition is not to stop this program for being broadcast entirely; however no transgender experts in the UK have watched over this program, which potentially may have a transphobic undertone. We simply don’t know what to expect from the program, however from his history and the synopsis available online, we can make an educated guess that it won’t be in support of Transgender Rights for Children.

That last paragraph is striking; who makes a documentary about a group of people without consulting experts, let alone gets it aired on national TV? It helps explain why a petition over something that hadn’t happened yet earned 11,000+ signatures.

Now if you’ve checked your watch, you’ve probably noticed the documentary came and went. I’ve been keeping an eye out for reviews, and they fall into two camps: enthusiastic support

So it’s a good thing BBC didn’t listen to those claiming this documentary shouldn’t have run. As it turns out, it’s an informative, sophisticated, and generally fair treatment of an incredibly complex and fraught subject.

… and enthusiastic opposition

The show seems to have been designed to cause maximum harm to #trans children and their families. I can hardly begin to tackle here the number of areas in which the show was inaccurate, misleading, demonising, damaging and plain false.

… but I have yet to see someone do an in-depth analysis of the claims made in this specific documentary. So Siobhan is doing precisely that, in a series of blog posts.
[Read more…]

The Odds of Elvis Being an Identical Twin

This one demanded to be shared ASAP. Here’s what you need to know:

  1. Identical or monozygotic twins occur in roughly four births per 1,000.
  2. Fraternal or dizygotic twins occur in roughly eight births per 1,000.
  3. Elvis Prestley had a twin brother, Jesse Garon Presley, that was stillborn.

For simplicity’s sake, we’ll assume sex is binary and split 50/50, despite the existence of intersex fraternal twins. What are the odds of Elvis being an identical twin? The answer’s below the fold.

[Read more…]

Replication Isn’t Enough

I bang on about statistical power because it indirectly raises the odds of a false positive. In brief, it forces you to do more tests to reach a statistical conclusion, stuffing the file drawer and thus making published results appear more certain than they are. In detail, see John Borghi or Ioannidis (2005). In comic, see Maki Naro.

The concept of statistical power has been known since 1928, the wasteful consequences of low power since 1962, and yet there’s no sign that scientists are upping their power levels. This is a representative result:

Our results indicate that the average statistical power of studies in the field of neuroscience is probably no more than between ~8% and ~31%, on the basis of evidence from diverse subfields within neuro-science. If the low average power we observed across these studies is typical of the neuroscience literature as a whole, this has profound implications for the field. A major implication is that the likelihood that any nominally significant finding actually reflects a true effect is small.

Button, Katherine S., et al. “Power failure: why small sample size undermines the reliability of neuroscience.” Nature Reviews Neuroscience 14.5 (2013): 365-376.

The most obvious consequence of low power is a failure to replicate. If you rarely try to replicate studies, you’ll be blissfully unaware of the problem; once you take replications seriously, though, you’ll suddenly find yourself in a “replication crisis.”

You’d think this would result in calls for increased statistical power, with the occasional call for a switch in methodology to a system that automatically incorporates power. But it’s also led to calls for more replications.

As a condition of receiving their PhD from any accredited institution, graduate students in psychology should be required to conduct, write up, and submit for publication a high-quality replication attempt of at least one key finding from the literature, focusing on the area of their doctoral research.
Everett, Jim AC, and Brian D. Earp. “A tragedy of the (academic) commons: interpreting the replication crisis in psychology as a social dilemma for early-career researchers.” Frontiers in psychology 6 (2015).


Much has been made of preregistration, publication of null results, and Bayesian statistics as important changes to how we do business. But my view is that there is relatively little value in appending these modifications to a scientific practice that is still about one-off findings; and applying them mechanistically to a more careful, cumulative practice is likely to be more of a hindrance than a help. So what do we do? …

Cumulative study sets with internal replication.

If I had to advocate for a single change to practice, this would be it.

There’s an intuitive logic to this: currently less than one in a hundred papers are replications of prior work, so there’s plenty of room for expansion; many key figures like Ronald Fisher and Jerzy Neyman have emphasized the necessity of replications; and it doesn’t require any modification of technique; and the “replication crisis” is primarily about replications. It sounds like an easy, feel-good solution to the problem.

But then I read this paper:

Smaldino, Paul E., and Richard McElreath. “The Natural Selection of Bad Science.” arXiv preprint arXiv:1605.09511 (2016).

It starts off with a meta-analysis of meta-analyses of power, and comes to the same conclusion as above.

We collected all papers that contained reviews of statistical power from published papers in the social, behavioural and biological sciences, and found 19 studies from 16 papers published between 1992 and 2014. … We focus on the statistical power to detect small effects of the order d=0.2, the kind most commonly found in social science research. …. Statistical power is quite low, with a mean of only 0.24, meaning that tests will fail to detect small effects when present three times out of four. More importantly, statistical power shows no sign of increase over six decades …. The data are far from a complete picture of any given field or of the social and behavioural sciences more generally, but they help explain why false discoveries appear to be common. Indeed, our methods may overestimate statistical power because we draw only on published results, which were by necessity sufficiently powered to pass through peer review, usually by detecting a non-null effect.

Rather than leave it at that, though, the researchers decided to simulate the pursuit of science. They set up various “labs” that exerted different levels of effort to maintain methodological rigor, killed off labs that didn’t publish much and replaced them with mutations of labs that published more, and set the simulation spinning.

We ran simulations in which power was held constant but in which effort could evolve (μw=0, μe=0.01). Here selection favoured labs who put in less effort towards ensuring quality work, which increased publication rates at the cost of more false discoveries … . When the focus is on the production of novel results and negative findings are difficult to publish, institutional incentives for publication quantity select for the continued degradation of scientific practices.

That’s not surprising. But then they started tinkering with replication rates. To begin with, replications were done 1% of the time, were guaranteed to be published, and having one of your results fail to replicate would exact a terrible toll.

We found that the mean rate of replication evolved slowly but steadily to around 0.08. Replication was weakly selected for, because although publication of a replication was worth only half as much as publication of a novel result, it was also guaranteed to be published. On the other hand, allowing replication to evolve could not stave off the evolution of low effort, because low effort increased the false-positive rate to such high levels that novel hypotheses became more likely than not to yield positive results … . As such, increasing one’s replication rate became less lucrative than reducing effort and pursuing novel hypotheses.

So it was time for extreme measures: force the replication rate to high levels, to the point that 50% of all studies were replications. All that happened was that it took longer for the overall methodological effort to drop and false positives to bloom.

Replication is not sufficient to curb the natural selection of bad science because the top performing labs will always be those who are able to cut corners. Replication allows those labs with poor methods to be penalized, but unless all published studies are replicated several times (an ideal but implausible scenario), some labs will avoid being caught. In a system such as modern science, with finite career opportunities and high network connectivity, the marginal return for being in the top tier of publications may be orders of magnitude higher than an otherwise respectable publication record.

Replication isn’t enough. The field of science needs to incorporate more radical reforms that encourage high methodological rigor and greater power.

Index Post: P-values

Over the months, I’ve managed to accumulate a LOT of papers discussing p-values and their application. Rather than have them rot on my hard drive, I figured it was time for another index post.

Full disclosure: I’m not in favour of them. But I came to that by reading these papers, and seeing no effective counter-argument. So while this collection is biased against p-values, that’s no more a problem than a bias against the luminiferous aether or humour theory. And don’t worry, I’ll include a few defenders of p-values as well.

What’s a p-value?

It’s frequently used in “null hypothesis significance testing,” or NHST to its friends. A null hypothesis is one you hope to refute, preferably a fairly established one that other people accept as true. That hypothesis will predict a range of observations, some more likely than others. A p-value is simply the odds of some observed event happening, plus the odds of all events more extreme, assuming the null hypothesis is true. You can then plug that value into the following logic:

  1. Event E, or an event more extreme, is unlikely to occur under the null hypothesis.
  2. Event E occurred.
  3. Ergo, the null hypothesis is false.

They seem like a weird thing to get worked up about.

Significance testing is a cornerstone of modern science, and NHST is the most common form of it. A quick check of Google Scholar shows “p-value” shows up 3.8 million times, while its primary competitor, “Bayes Factor,” shows up 250,000. At the same time, it’s poorly understood.

The P value is probably the most ubiquitous and at the same time, misunderstood, misinterpreted, and occasionally miscalculated index in all of biomedical research. In a recent survey of medical residents published in JAMA, 88% expressed fair to complete confidence in interpreting P values, yet only 62% of these could answer an elementary P-value interpretation question correctly. However, it is not just those statistics that testify to the difficulty in interpreting P values. In an exquisite irony, none of the answers offered for the P-value question was correct, as is explained later in this chapter.

Goodman, Steven. “A Dirty Dozen: Twelve P-Value Misconceptions.” In Seminars in Hematology, 45:135–40. Elsevier, 2008. http://www.sciencedirect.com/science/article/pii/S0037196308000620.

The consequence is an abundance of false positives in the scientific literature, leading to many failed replications and wasted resources.

Gotcha. So what do scientists think is wrong with them?

Well, th-

And make it quick, I don’t have a lot of time.

Right right, here’s the top three papers I can recommend:

Null hypothesis significance testing (NHST) is arguably the mosl widely used approach to hypothesis evaluation among behavioral and social scientists. It is also very controversial. A major concern expressed by critics is that such testing is misunderstood by many of those who use it. Several other objections to its use have also been raised. In this article the author reviews and comments on the claimed misunderstandings as well as on other criticisms of the approach, and he notes arguments that have been advanced in support of NHST. Alternatives and supplements to NHST are considered, as are several related recommendations regarding the interpretation of experimental data. The concluding opinion is that NHST is easily misunderstood and misused but that when applied with good judgment it can be an effective aid to the interpretation of experimental data.

Nickerson, Raymond S. “Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy.” Psychological Methods 5, no. 2 (2000): 241.

After 4 decades of severe criticism, the ritual of null hypothesis significance testing (mechanical dichotomous decisions around a sacred .05 criterion) still persists. This article reviews the problems with this practice, including near universal misinterpretation of p as the probability that H₀ is false, the misinterpretation that its complement is the probability of successful replication, and the mistaken assumption that if one rejects H₀ one thereby affirms the theory that led to the test.

Cohen, Jacob. “The Earth Is Round (p < .05).” American Psychologist 49, no. 12 (1994): 997–1003. doi:10.1037/0003-066X.49.12.997.

This chapter examines eight of the most commonly voiced objections to reform of data analysis practices and shows each of them to be erroneous. The objections are: (a) Without significance tests we would not know whether a finding is real or just due to chance; (b) hypothesis testing would not be possible without significance tests; (c) the problem is not significance tests but failure to develop a tradition of replicating studies; (d) when studies have a large number of relationships, we need significance tests to identify those that are real and not just due to chance; (e) confidence intervals are themselves significance tests; (f) significance testing ensure objectivity in the interpretation of research data; (g) it is the misuse, not the use, of significance testing that is the problem; and (h) it is futile to reform data analysis methods, so why try?

Schmidt, Frank L., and J. E. Hunter. “Eight Common but False Objections to the Discontinuation of Significance Testing in the Analysis of Research Data.” What If There Were No Significance Tests, 1997, 37–64.

OK, I have a bit more time now. What else do you have?

Using a Bayesian significance test for a normal mean, James Berger and Thomas Sellke (1987, pp. 112–113) showed that for p values of .05, .01, and .001, respectively, the posterior probabilities of the null, Pr(H₀ | x), for n = 50 are .52, .22, and .034. For n = 100 the corresponding figures are .60, .27, and .045. Clearly these discrepancies between p and Pr(H₀ | x) are pronounced, and cast serious doubt on the use of p values as reasonable measures of evidence. In fact, Berger and Sellke (1987) demonstrated that data yielding a p value of .05 in testing a normal mean nevertheless resulted in a posterior probability of the null hypothesis of at least .30 for any objective (symmetric priors with equal prior weight given to H₀ and HA ) prior distribution.

Hubbard, R., and R. M. Lindsay. “Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing.” Theory & Psychology 18, no. 1 (February 1, 2008): 69–88. doi:10.1177/0959354307086923.

Because p-values dominate statistical analysis in psychology, it is important to ask what p says about replication. The answer to this question is ‘‘Surprisingly little.’’ In one simulation of 25 repetitions of a typical experiment, p varied from .44. Remarkably, the interval—termed a p interval —is this wide however large the sample size. p is so unreliable and gives such dramatically vague information that it is a poor basis for inference.

Cumming, Geoff. “Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better.Perspectives on Psychological Science 3, no. 4 (July 2008): 286–300. doi:10.1111/j.1745-6924.2008.00079.x.

Simulations of repeated t-tests also illustrate the tendency of small samples to exaggerate effects. This can be shown by adding an additional dimension to the presentation of the data. It is clear how small samples are less likely to be sufficiently representative of the two tested populations to genuinely reflect the small but real difference between them. Those samples that are less representative may, by chance, result in a low P value. When a test has low power, a low P value will occur only when the sample drawn is relatively extreme. Drawing such a sample is unlikely, and such extreme values give an exaggerated impression of the difference between the original populations. This phenomenon, known as the ‘winner’s curse’, has been emphasized by others. If statistical power is augmented by taking more observations, the estimate of the difference between the populations becomes closer to, and centered on, the theoretical value of the effect size.

is G., Douglas Curran-Everett, Sarah L. Vowler, and Gordon B. Drummond. “The Fickle P Value Generates Irreproducible Results.” Nature Methods 12, no. 3 (March 2015): 179–85. doi:10.1038/nmeth.3288.

If you use p=0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time. If, as is often the case, experiments are underpowered, you will be wrong most of the time. This conclusion is demonstrated from several points of view. First, tree diagrams which show the close analogy with the screening test problem. Similar conclusions are drawn by repeated simulations of t-tests. These mimic what is done in real life, which makes the results more persuasive. The simulation method is used also to evaluate the extent to which effect sizes are over-estimated, especially in underpowered experiments. A script is supplied to allow the reader to do simulations themselves, with numbers appropriate for their own work. It is concluded that if you wish to keep your false discovery rate below 5%, you need to use a three-sigma rule, or to insist on p≤0.001. And never use the word ‘significant’.

Colquhoun, David. “An Investigation of the False Discovery Rate and the Misinterpretation of P-Values.” Royal Society Open Science 1, no. 3 (November 1, 2014): 140216. doi:10.1098/rsos.140216.

I was hoping for something more philosophical.

The idea that the P value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations).

Goodman, Steven N. “Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy.” Annals of Internal Medicine 130, no. 12 (1999): 995–1004.

Overemphasis on hypothesis testing–and the use of P values to dichotomise significant or non-significant results–has detracted from more useful approaches to interpreting study results, such as estimation and confidence intervals. In medical studies investigators are usually interested in determining the size of difference of a measured outcome between groups, rather than a simple indication of whether or not it is statistically significant. Confidence intervals present a range of values, on the basis of the sample data, in which the population value for such a difference may lie. Some methods of calculating confidence intervals for means and differences between means are given, with similar information for proportions. The paper also gives suggestions for graphical display. Confidence intervals, if appropriate to the type of study, should be used for major findings in both the main text of a paper and its abstract.

Gardner, Martin J., and Douglas G. Altman. “Confidence Intervals rather than P Values: Estimation rather than Hypothesis Testing.” BMJ 292, no. 6522 (1986): 746–50.

What’s this “Neyman-Pearson” thing?

P-values were part of a method proposed by Ronald Fisher, as a means of assessing evidence. Even as the ink was barely dry on it, other people started poking holes in his work. Jerzy Neyman and Egon Pearson took some of Fisher’s ideas and came up with a new method, based on long-term prediction. Their method is superior, IMO, but rather than replacing Fisher’s approach it instead wound up being blended with it, ditching all the advantages to preserve the faults. This citation covers the historical background:

Huberty, Carl J. “Historical Origins of Statistical Testing Practices: The Treatment of Fisher versus Neyman-Pearson Views in Textbooks.” The Journal of Experimental Education 61, no. 4 (1993): 317–33.

While the remainder help describe the differences between the two methods, and possible ways to “fix” their shortcomings.

The distinction between evidence (p’s) and error (a’s) is not trivial. Instead, it reflects the fundamental differences between Fisher’s ideas on significance testing and inductive inference, and Neyman-Pearson’s views on hypothesis testing and inductive behavior. The emphasis of the article is to expose this incompatibility, but we also briefly note a possible reconciliation.

Hubbard, Raymond, and M. J Bayarri. “Confusion Over Measures of Evidence ( p ’S) Versus Errors ( α ’S) in Classical Statistical Testing.” The American Statistician 57, no. 3 (August 2003): 171–78. doi:10.1198/0003130031856.

The basic differences are these: Fisher attached an epistemic interpretation to a significant result, which referred to a particular experiment. Neyman rejected this view as inconsistent and attached a behavioral meaning to a significant result that did not refer to a particular experiment, but to repeated experiments. (Pearson found himself somewhere in between.)

Gigerenzer, Gerd. “The Superego, the Ego, and the Id in Statistical Reasoning.” A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues, 1993, 311–39.

This article presents a simple example designed to clarify many of the issues in these controversies. Along the way many of the fundamental ideas of testing from all three perspectives are illustrated. The conclusion is that Fisherian testing is not a competitor to Neyman-Pearson (NP) or Bayesian testing because it examines a different problem. As with Berger and Wolpert (1984), I conclude that Bayesian testing is preferable to NP testing as a procedure for deciding between alternative hypotheses.

Christensen, Ronald. “Testing Fisher, Neyman, Pearson, and Bayes.” The American Statistician 59, no. 2 (2005): 121–26.

C’mon, there aren’t any people defending the p-value?

Sure there are. They fall into two camps: “deniers,” a small group that insists there’s nothing wrong with p-values, and the much more common “fixers,” who propose making up for the shortcomings by augmenting NHST. Since a number of fixers have already been cited, I’ll just focus on the deniers here.

On the other hand, the propensity to misuse or misunderstand a tool should not necessarily lead us to prohibit its use. The theory of estimation is also often misunderstood. How many epidemiologists can explain the meaning of their 95% confidence interval? There are other simple concepts susceptible to fuzzy thinking. I once quizzed a class of epidemiology students and discovered that most had only a foggy notion of what is meant by the word “bias.” Should we then abandon all discussion of bias, and dumb down the field to the point where no subtleties need trouble us?

Weinberg, Clarice R. “It’s Time to Rehabilitate the P-Value.” Epidemiology 12, no. 3 (2001): 288–90.

The solution is simple and practiced quietly by many researchers—use P values descriptively, as one of many considerations to assess the meaning and value of epidemiologic research findings. We consider the full range of information provided by P values, from 0 to 1, recognizing that 0.04 and 0.06 are essentially the same, but that 0.20 and 0.80 are not. There are no discontinuities in the evidence at 0.05 or 0.01 or 0.001 and no good reason to dichotomize a continuous measure. We recognize that in the majority of reasonably large observational studies, systematic biases are of greater concern than random error as the leading obstacle to causal interpretation.

Savitz, David A. “Commentary: Reconciling Theory and Practice.” Epidemiology 24, no. 2 (March 2013): 212–14. doi:10.1097/EDE.0b013e318281e856.

The null hypothesis can be true because it is the hypothesis that errors are randomly distributed in data. Moreover, the null hypothesis is never used as a categorical proposition. Statistical significance means only that chance influences can be excluded as an explanation of data; it does not identify the nonchance factor responsible. The experimental conclusion is drawn with the inductive principle underlying the experimental design. A chain of deductive arguments gives rise to the theoretical conclusion via the experimental conclusion. The anomalous relationship between statistical significance and the effect size often used to criticize NHSTP is more apparent than real.

Hunter, John E. “Testing Significance Testing: A Flawed Defense.” Behavioral and Brain Sciences 21, no. 02 (April 1998): 204–204. doi:10.1017/S0140525X98331167.