Suppose that you want to demonstrate that baby boomers are more narcissistic than other generations, or that women are more agreeable and neurotic than men, or that people of different races have different amounts of intelligence. How do psychologists do that? Can they in fact do that?
Typically, the method is to come up with a bunch of questions that superficially appear to measure the intended characteristic. Then the questions are “validated”, for example, by making sure the questions all correlate with one another. Once the questionnaire is declared valid, psychologists can then measure a variety of different groups and make far-reaching claims about how our current political/social situation was caused all along by the thing that they happen to study.
If you find this methodology questionable, but aren’t sure exactly what went wrong, you might be interested in hearing about psychometrics, the field concerned with psychological measurement. According to psychometricians, part of the problem is that psychologists are failing to follow best practices. That is the subject of this paper:
Test bias across groups
Suppose that I create a test which I claim measures intelligence. I use this test to measure people of different races, and I find that white people get higher scores than black people. Does this mean that there is a difference in average intelligence, or does it mean the test is biased? You could just check the test questions, but it’s entirely possible that the test is biased in some way that isn’t obvious. How can you prove it one way or another?
What we’d like to prove is measurement invariance. That is, given two people of equal intelligence, they should (on average) score the same on the test, regardless of their race. Psychologists cannot demonstrate measurement invariance directly, because they have no way to measure intelligence apart from the test.
Instead, they usually demonstrate a related property, known as predictive invariance. To do this, we first select some criterion of interest. For example, we might simply separate out one of the questions on the test and call that the criterion of interest. We then look at the relation between this criterion of interest and the results of the test. If the relationship is the same across groups, then we’ve demonstrated predictive invariance.
Now, the question is, does predictive invariance imply measurement invariance? Psychometricians have come to the conclusion that it does not. In fact, it can be proven* under common assumptions that predictive invariance contradicts measurement invariance. It is therefore fascinating to hear psychologists boast that “the issue of test bias is scientifically dead” on the basis that predictive invariance is found nearly everywhere. This appears to imply instead the opposite, that test bias is ubiquitous.
*The proof is interesting but technical, so I will discuss it at a later date.
Causal theories of testing
As mentioned earlier, survey questions are often validated by making sure they all correlate with one another. This criterion is called internal consistency. Do you necessarily expect high internal consistency?
As it turns out, it depends on your causal theory. For example, suppose I am measuring socioeconomic status (SES) using a series of questions. We call SES the “latent variable”, and the questions “indicators”. Does the latent variable cause the indicators, or do the indicators cause the latent variable? For example, you might say that education and income cause SES. When indicators cause the latent variables you don’t necessarily expect internal consistency at all (although internal correlations may appear anyway).
But here’s the problem. Most measures in psychology don’t have any causal theory behind them. What the test measures is the thing measured by the test. There is no latent variable that is separate from the indicators. Under this view, it is impossible for the test to be biased, because it always measures exactly the thing it intends to measure, itself. This is well and good (not really), but then how can psychologists justify their far-reaching conclusions?
Borsboom suggests a bunch of questions that could be asked of the latent variable. Does it cause the indicators, or is it caused by the indicators? Is it discrete, or continuous? If the latent variable is continuous, are the indicators a monotonic function of it? Are they a smooth or erratic function? Part of the problem is that psychological theory doesn’t predict any particular answers to these questions. So perhaps we will only know once psychologists start following advice from psychometricians.
Borsboom spends a lot of time talking about the social obstacles to incorporating psychometrics into psychological research. I did not discuss this because I am not a researcher in this field and it’s out of my hands. Instead I focused on some of the concrete issues in common psychological methodology. First, I talked about how tests can be biased towards various groups, and psychologists assess this bias incorrectly. Second, I talked about how psychologists fail to think of their tests as related to latent variables in a causal structure. But this is just brushing the surface.
Recently, there’s been controversy over the “replicability crisis” in psychology. Could it be that psychologists can’t replicate studies because of their bad psychometric practices? Sadly, I think these might be independent problems piled on top of each other. Bad psychometric practices wouldn’t lead to replication failures, they would instead lead to replicable, but unjustified conclusions.
Of course, this paper was written over ten years ago. Perhaps psychologists have seen the light in the last decade? I couldn’t say.