Everything Is Significant!

Back in 1939, Joseph Berkson made a bold statement.

I believe that an observant statistician who has had any considerable experience with applying the chi-square test repeatedly will agree with my statement that, as a matter of observation, when the numbers in the data are quite large, the P’s tend to come out small. Having observed this, and on reflection, I make the following dogmatic statement, referring for illustration to the normal curve: “If the normal curve is fitted to a body of data representing any real observations whatever of quantities in the physical world, then if the number of observations is extremely large—for instance, on an order of 200,000—the chi-square P will be small beyond any usual limit of significance.”

This dogmatic statement is made on the basis of an extrapolation of the observation referred to and can also be defended as a prediction from a priori considerations. For we may assume that it is practically certain that any series of real observations does not actually follow a normal curve with absolute exactitude in all respects, and no matter how small the discrepancy between the normal curve and the true curve of observations, the chi-square P will be small if the sample has a sufficiently large number of observations in it.

Berkson, Joseph. “Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test.” Journal of the American Statistical Association 33, no. 203 (1938): 526–536.
His prediction would be vindicated two decades later.

Experience shows that when large numbers of subjects are used in studies, nearly all comparisons of means are “significantly” different and all correlations are “significantly” different from zero. The author once had occasion to use 700 subjects in a study of public opinion. After a factor analysis of the results, the factors were correlated with individual-difference variables such as amount of education, age, income, sex, and others. In looking at the results I was happy to find so many “significant” correlations (under the null-hypothesis model)-indeed, nearly all correlations were significant, including ones that made little sense. Of course, with an N of 700 correlations as large as .08 are “beyond the .05 level.” Many of the “significant” correlations were of no theoretical or practical importance.
Nunnally, Jum. “The Place of Statistics in Psychology.” Educational and Psychological Measurement 20, no. 4 (1960): 641–650.
One of the common experiences of research workers is the very high frequency with which significant results are obtained with large samples. Some years ago, the author had occasion to run a number of tests of significance on a battery of tests collected on about 60,000 subjects from all over the United States. Every test came out significant. Dividing the cards by such arbitrary criteria as east versus west of the Mississippi River, Maine versus the rest of the country, North versus South, etc., all produced significant differences in means. In some instances, the differences in the sample means were quite small, but nonetheless, the p values were all very low.
Bakan, David. “The Test of Significance in Psychological Research.Psychological Bulletin 66, no. 6 (1966): 423.
This statement and other related ones led to a crisis in psychology in the 1960’s, eerily similar to the modern “replication crisis.” But while the current crisis is focused on publication bias and the process of science, this earlier one was focused on the theory behind science.
The major point of this paper is that the test of significance does not provide the information concerning psychological phenomena characteristically attributed to it; and that, furthermore, a great deal of mischief has been associated with its use. What will be said in this paper is hardly original. It is, in a certain sense, what “everybody knows.” To say it “out loud” is, as it were, to assume the role of the child who pointed out that the emperor was really outfitted only in his underwear. Little of that which is contained in this paper is not already available in the literature, and the literature will be cited.
Bakan (1966)
If one of Fisherian frequentism’s major flaws is that it exaggerates effect sizes when samples are small, the other is that it exaggerates significance when sample sizes are large. One reason for this comes from the use of what Jacob Cohen dubs “nil hypotheses,” but other authors had previously observed, going as far back as Berkson (1938).
Statisticians classically asked the wrong question – and were willing to answer with a lie, one that was often a downright lie. They asked “Are the effects of A and B different?” and they were willing to answer “no.” All we know about the world teaches us that the effects of A and B are always different – in some decimal place – for any A and B. Thus asking “Are the effects different?” is foolish.
Tukey, John W. “The Philosophy of Multiple Comparisons.” Statistical Science 6, no. 1 (February 1991): 100–116. doi:10.1214/ss/1177011945.
If you make your hypotheses too specific, they are guaranteed to be false. This effect is tough to detect in small samples, but it becomes a major problem with large samples.
A second reason is that p-values are biased against the null hypothesis. Because Fisherian frequentism only considers one hypothesis, it cannot account for evidence that is in favor of or opposed to more than one hypothesis. Besides the above link, I also ran into this when I did an in-depth Bayesian analysis of Darryl Bem’s infamous paper. By the time of post number seven, the nine experiments and twenty tests that he thought were strongly in favor of precognition were actually weak evidence against its existence. You don’t just have to take my word for it, though, other authors have come to the same conclusion.
In sum, tests for direction are easier than tests for existence: when applied to the same data, tests for direction are more diagnostic than tests for existence. From a Bayesian perspective, the one-sided P value is a test for direction; when this test is misinterpreted as a test for existence—as classical statisticians are wont to do—this overstates the true evidence that the data provide against a point null hypothesis.
Marsman, M., and E.-J. Wagenmakers. “Three Insights from a Bayesian Interpretation of the One-Sided P Value.” Educational and Psychological Measurement, October 5, 2016. doi:10.1177/0013164416669201.
But there’s a third reason, one that has nothing to do with theory.
Crud factor: In the social sciences and arguably in the biological sciences, “everything correlates to some extent with everything else.” This truism, which I have found no competent psychologist disputes given five minutes reflection, does not apply to pure experimental studies in which attributes that the subjects bring with them are not the subject of study (except in so far as they appear as a source of error and hence in the denominator of a significance test). There is nothing mysterious about the fact that in psychology and sociology everything correlates with everything. Any measured trait or attribute is some function of a list of partly known and mostly unknown causal factors in the genes and life history of the individual, and both genetic and environmental factors are known from tons of empirical research to be themselves correlated.
Meehl, Paul E. “Why Summaries of Research on Psychological Theories Are Often Uninterpretable.” Psychological Reports 66, no. 1 (1990): 195–244.
Since everything correlates with everything else to some extent, at least within biological systems, manipulating one variable will always have some effect on another variable. Crank up the sample size, and this is guaranteed to reach statistical significance. Just because the dependent variable was effected, however, doesn’t show the independent one is the primary cause. For instance, black people in the United States consistently get lower scores on school tests. A complex mix of social factors are to blame, but it’s incredibly easy to think that a single variable like race can explain it all. Isn’t that what Ockham’s Razor says, after all?
Between this and the problems with small sample sizes, nailing a statistically significant result in Fisherian frequentism is surprisingly easy. Apply some time and patience, and you can reject any hypothesis you want rejected.