The danger of sampling error


In analyzing a situation using data, one of the common errors that one can fall into is that of sampling error, when one bases one’s conclusion on a sample that is not representative of the population at large. That is how many stereotypes and prejudices arise, because people form judgments about entire groups based on their experiences with just a few members of that group that they happen to encounter in their own lives.

Researchers are of course aware of this pitfall and they take steps to take this into account, either by making sure the sample is properly drawn or, if necessary, deliberately taking a sample that is not representative and making weighted adjustments after the fact to take this into account. [Thanks to Crip Dyke in comment #4 for the link-MS} For example, when taking opinion polls, subgroups that make up only a small fraction of the total population are over-sampled so that that subgroups’ sizes are larger and the views are more representative of the subgroups and not distorted by a few outliers. Then when those results are folded into the total sample results, adjustments are made accordingly. Those who do not understand this process sometimes seize upon the over-sampling of small groups to denounce the poll results as having been deliberately skewed to favor those groups.

Even though I am aware of this pitfall, I still succumb to it on occasion in making casual conclusions without thinking things through. The sheltering in place requirements imposed due to the current pandemic brought this home to me. I live in a condominium complex that has about a hundred units. Soon after moving here last August, I came to the conclusion that it must consist mostly of older people and retirees because they were the ones I usually saw whenever I went outside, which was not often since I tend to live mostly a quiet life at home.

That was the source of my error. Because I am retired, I tend to go out in the middle of the day, which is definitely old people and retiree time. I was vaguely aware that there may be people of other ages that had to be elsewhere during the day but did not realize how many until this pandemic resulted in everybody being at home all the time. Now when I go out during the day for a walk, even with physical distancing, I see young parents with infants in strollers, toddlers, children playing outside, teenagers listening to their phones, joggers, and young and middle-aged people and realize that I live in a much more age-diverse group than I had thought.

Comments

  1. Bruce says

    Mano’s post is right. Another example may be when Americans look at local TV news. Too often, the only people of color shown on the news are connected with crimes. The white people in the audience ignore the white people connected with crime, because they see lots of other white people in their lives, so TV crime stuff is only a tiny part of the picture. But white people who rarely see any representations of people of color will thus get a large fraction of their impressions from such TV stuff. So in the minds of white people, the sample of people of color they see is mostly from those TV crime shows. This gives white people a distorted sampling of the true range of people of color.
    Now, younger people, including younger white people, are more likely to get around more, look at more people, and be less subject to the sampling bias created by TV. But overall, those who trusted TV got mislead, even though it didn’t require anyone in TV to intend to mislead.

  2. says

    When I was doing research on trans* populations in the 90s and early 00s, I struggled with this quite a bit and never came up with a satisfactory response.

    Even labeling your data as relating to “out” trans* persons is suspect, since “out” is not a binary condition, but often a matter of degrees, and even among out trans* persons connectivity to trans* communities predicted availability to respond to questionnaires. So the more networked with other trans* persons someone was, the more their responses were likely to be counted in my research.

    And yet, when working on questions of trans* vulnerability to certain phenomena (poverty, violence, etc.) being isolated from community intuitively seems likely to be correlated to increasing vulnerability to such things. (The odd rich trans* person who lives on a private island notwithstanding.)

    So what do you do? It’s hard. Not least because everyone wants numbers. I could say that I found X in this sample which is suggestive of a trans* vulnerability to phenomenon Y that (exceeds/ equals/ falls below) the vulnerability found in the general population, but people would inevitably quote the number and not the suggestion, as if the specific number was in any way relevant to any group other than the sample itself.

    It was honestly a bit maddening. I still don’t know that there’s a good solution to that problem, save to produce very limited studies (a community needs assessment, for instance, is less affected by such things since it’s restricted to a particular time and place, and while other people may exist that are nominally the part of some target population -- like trans* persons -- if they aren’t participating in trans* community then they also aren’t as likely to participate in the use of any new resources provided, so the bias is still there, but it can be of an acceptable nature and size for the purpose to which the study’s results are put).

  3. robert79 says

    “Researchers are of course aware of this pitfall and they take steps to take this into account, either by making sure the sample is properly drawn or, if necessary, deliberately taking a sample that is not representative and making weighted adjustments after the fact to take this into account. For example, when taking opinion polls, subgroups that make up only a small fraction of the total population are over-sampled so that that subgroups’ sizes are larger and the views are more representative of the subgroups and not distorted by a few outliers. Then when those results are folded into the total sample results, adjustments are made accordingly.”

    Oh my… This seems so obvious in hindsight, but it’s actually the first time I see it. (And I teach data analysis/statistics!) Do you have any example/articles/whatever.. on hand where this is used?

Leave a Reply

Your email address will not be published. Required fields are marked *