Why the algorithm is so often wrong


As a data scientist, the number one question I hear from friends is “How did the algorithm get that so wrong?” People don’t know it, but that’s a data science question.

For example, Facebook apparently thinks I’m trans, so they keep on advertising HRT to me. How did they get that one wrong? Surely Facebook knows I haven’t changed pronouns in my entire time on the platform.

I really don’t know why the algorithm got it wrong in any particular case, but it’s really not remotely surprising. For my job, I build algorithms like that (not for social media specifically, but it’s the general idea), and as part of the process I directly measure how often the algorithm is wrong. Some of the algorithms I have created are wrong 99.8% of the time, and I sure put a lot of work into making that number a tiny bit lower. It’s a fantastically rare case where we can build an algorithm that’s just right all the time.

If you think about it from Facebook’s perspective, their goal probably isn’t to show ads that understand you on some personal level, but to show ads that you’ll actually click on. How many ads does the typical person see, vs the number they click on? Suppose I never click on any ads. Then the HRT ads might be a miss, but then so is every other ad that Facebook shows me, so the algorithm hasn’t actually lost much by giving it a shot.

So data science algorithms are quite frequently wrong simply as a matter of course. But why? Why can’t the algorithm see something that would be so obvious to any human reviewer?

The algorithm vs humans

There are algorithms that can perform better than humans, but those are rare cases, and often very heavyweight stuff. Google’s search engine, that’s a heavyweight algorithm. Self driving cars, I can’t even imagine. But the vast majority of algorithms for everyday use are more rudimentary than that. You can think of these algorithms as mass-produced decision-making. The mass-produced algorithm isn’t nearly as good as boutique, human-made decisions, but then nobody got time to hand-select Facebook ads just for you.

There are some significant barriers to making algorithms that outperform humans. Usually, we train these algorithms using “labeled” data. That’s data where the correct answer is already known. For example, if we’re training an algorithm to identify spam, we provide a collection of e-mails, along with a flag that indicates whether each e-mail is spam or not. But if not even humans can determine whether an e-mail is spam, then where does the flag come from? If the flag comes from human labeling, then the algorithm can do no better than that, and probably does worse.

Another issue is that human labeling is often fairly close to the theoretical best you can do. The Bayes error rate refers to the theoretical minimum error rate, even when the model is trained on an infinite number of rows of data. To illustrate, suppose two people have so far behaved identically on Facebook. Couldn’t they go on to behave differently in the future? Algorithms aren’t magic, they can’t make the correct prediction for both of those people.

Although I’m not saying this is the most likely explanation, we ought to at least consider the possibility that my behavior is identical (as far as Facebook knows) to a person interested in HRT. It’s not safe to assume that I “know” what a trans person would look like—and also not everyone who wants HRT is trans! And really it’s just as well that Facebook can’t tell.

The two kinds of error

The Bayes error rate is the theoretical minimum error rate when a model is trained on infinite data. But real models are trained on a lot less data than that. This introduces two sources of error known as bias and variance.

Bias is the error that arises from an algorithm that is too simple and inflexible. For example, one common type of model, logistic regression, assumes that the relationship between each input and the prediction is monotonic, meaning either strictly increasing or strictly decreasing. If there’s a non-monotonic relationship, or a complex interaction between two of the input variables, the model will miss that.

Although Facebook’s attempts to advertise T to me have been a failure, it has been more successful at advertising tea. But, I have enough tea for now. I’m not going to buy more for quite a while, probably after Facebook has forgotten about it. Facebook knows that people who click on tea ads are likely to click on even more tea ads. But why can’t it figure out when my needs are saturated? Possibly there’s no way for Facebook to know such a thing. But it’s also possible that the algorithm just isn’t flexible enough to include such a pattern.  (Also I still click on the tea ads regardless, so I guess Facebook isn’t actually wrong.)

Variance is the error that arises from statistical flukes in the training data. The smaller the training data, and the more degrees of freedom in the model, the more often statistical flukes will occur. Data scientists can reduce variance by either collecting more data or by using a more inflexible model.

For example, maybe the reason Facebook doesn’t include my pronoun history is because that opens the floodgates to a bunch of inputs regarding my profile history. Then the algorithm would be finding spurious relationships between number of prior jobs listed and an interest in baldness treatment.

There’s a tradeoff between bias and variance, where the algorithms with lower variance tend to have higher bias and vice versa. The best algorithms will have some of both.

Exploration and testing

Although it is not the most likely explanation, another possibility is that you’re part of a test. In order to maintain or improve the quality of a data science algorithm, they need to collect data. If the only data they have comes from giving people ads that the algorithm thinks they want, that imposes a selection effect that makes it hard to maintain the algorithm.

One very simple scheme to get around this is to have a test group. The test group receives totally random ads, creating an entirely unbiased data set. Now the issue is that the user might notice something is off, so there are more complex testing schemes that are less noticeable. But the bottom line is that algorithms aren’t always giving you their best, because it needs to test to learn.

In data science this is called the exploration/exploitation tradeoff. On the one hand the algorithm could “exploit” what it already knows. On the other hand, it could “explore” to learn more.

TL;DR

Predictive algorithms are naturally wrong all the time. There is a theoretical limit to how well they can do, and in practice they cannot even do that well. Finite data introduces two more sources of error: “variance” coming from spurious patterns in the data, and “bias” coming from inflexibility in the model. And in some small percentage of cases, algorithms deliberately make inferior guesses in order to test and improve.

This isn’t a complete list of all sources of error, but is just an explanation of the basics. And there’s also the issue of how errors can be discriminatory against certain groups, or cause disparate impact. I left this topic out because I feel it deserves a dedicated discussion.

Data science algorithms are all around us, directly impacting our lives, so it’s fair to complain that they’re not always doing it well. But the two biggest complaints I see are that the algorithm is too often wrong, and that tech companies are collecting too much data. As I hope is now clear, these complaints are in conflict one another. To thread that conflict, it’s necessary to have a better understanding of where errors in “the algorithm” come from.

Comments

  1. Bruce says

    Siggy, if you ever mention any HRT thing only blog, that sort of proves the Fbkkk algorithm was CORRECT to send you the ad. Any company selling say pizza, or HRT, will pay extra to target their ad to a more likely audience. The ad buyers are comparing the “targeted” rate of return versus their other options. Any ad on a FTB-related web site is more likely to be a hit than an untargeted ad. While most readers of PZ Myers’s blog don’t need HRT, they’re all more likely to click on such ads than are the general population of say CNN viewers. That is, while these algorithms are far from good, they are sort of working if they are better than anyone else’s algorithms. If you or I were trying to place ads to sell HRT, what would be a better wide audience to show to? If a leopard is chasing us, you don’t have to run faster than the leopard, just me. You’ll have lots of time to stroll away as he eats me.

  2. says

    Yeah, the HRT ads are a bit funny to me, but it’s not really as much of a mystery as I make it out in the OP. It’s not totally off the mark, I’m in the right crowd.

    But I would push back on saying that the HRT ads are “correct”. If their goal was to make me aware of their company, or for me to look at their website, then they were correct. But ultimately their goal is probably to sell HRT to me, and they haven’t done that, so it was incorrect. And yes, the vast majority of targeted ads are incorrect by that measure.

  3. robert79 says

    Also, in the case of ads, the algorithm doesn’t have to be perfect, it just needs to do its job well enough that Facebook makes a profit. If the percentage of ads clicked goes from 2% to 4% the algorithm will still be wrong 96% of the time but will have doubled the income from ads.

    Of course, self-driving cars are another story…

Leave a Reply

Your email address will not be published. Required fields are marked *