Link Roundup: January 2020 » « In which I get married

Ethics of accuracy

Andreas Avester summarized Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil. Now, I’m not sure how many readers remember this, but I’m a professional data scientist. Which doesn’t really qualify me as an authority to talk about data science, much less the ethics thereof, but, hey, it’s a thing. I have thoughts.

In my view there are two distinct¹ ethical issues with data science: 1) our models might make mistakes, or 2) our models might be too accurate. As I said in Andreas’ comments:

The first problem is obvious, so let me explain the second one. Suppose you found an algorithm that perfectly predicted people’s healthcare expenses, and started using this to price health insurance. Well then, it’s like you might as well not have health insurance, because everyone’s paying the same amount either way. This is “fair” in the sense that everyone’s paying exactly the amount of burden they’re placing on society. But it’s “unfair” in that, the amount of healthcare expenses people have is mostly beyond their control. I think it would be better if our algorithms were actually less accurate, and we just charged everyone the same price–modulo, I don’t know, smoking.

The ethical perils of high accuracy predictions get right to the heart of the question: what is fair? Is it fair for people to pay a price according to how much healthcare they need? Or can we admit that the distribution of health problems is already unfair to begin with? And what do we do about health problems that are at least partially within people’s control? Do we charge people the full costs, or just enough to generate an effective incentive structure?

Andreas Avester quoted a section of the book talking about car insurance, which is fairly similar to my example of health insurance. If data science enables car insurance companies to more accurately predict who will get into car accidents, then this will increase inequality. O’Neil seems to agree on this point.

However, O’Neil chose to illustrate the problem with a hypothetical case of a good driver who has to commute through a bad neighborhood late at night. An insurance company that tracks her location might conclude that she is a risky driver, and charge her more for insurance. While I agree that this is unfair, it’s a poor example, because it suggests that the solution is to make our models more and more accurate, which might exacerbate the problem! As I said in the comments,

I think O’Neil chooses that example, because it’s just easier to sympathize with the good driver who is screwed by the algorithm. It’s harder to sympathize with all the bad drivers getting screwed. But they are nonetheless getting screwed, and increasingly accurate algorithms will hurt rather than help.

Here’s why I think “too much accuracy” is a bigger problem than “too many mistakes”. Improving accuracy is already in the interest of the companies that make these models. Insurance companies don’t want to charge the good driver higher prices, because a competing company who figures out that she’s a good driver will be able to undercut their price. Companies adopt data science methods precisely because they have proven to themselves that these methods make fewer mistakes. However, when the problem is “too much accuracy”, this is a problem that can only be addressed through policy, and I don’t trust companies to support those policies.²

But I should also offer a contrary argument. Too much accuracy might, in the long run, be the deeper problem, but in the short run, data science is a developing field that will certainly make mistakes. And even when it makes fewer mistakes than traditional methods, it’s possible that it makes worse mistakes. For example, the task of picking out job candidates from a stack of resumes is a deeply unfair and not very accurate process. But if gender identifiers are mostly hidden, we can hope that it’s at least equally unfair to people of all genders. On the other hand, if you give the stack of resumes to a neural network, the algorithm might be able to infer gender from various clues, and you might not even know it has done so. In principle, companies want more accurate models, but in practice, they might find it expedient to accept more discriminatory models.

By the way, if any readers would be interested in me writing more about data science topics, let me know.

Footnotes:

1. I don’t mean to say these are the only two ethical issues in data science. Andreas Avester’s summary also discussed perverse incentives, which strikes me as a third distinct problem. And although it was absent from this discussion, privacy concerns are obviously a big deal. (return)

2. Although, in the case of insurance, I’d expect insurance companies to be in favor of regulation. Because without regulation, the insurance markets might just collapse. (return)

Link Roundup: January 2020 » « In which I get married

Comments

anat says

January 8, 2020 at 10:21 am

Of course there is the bigger question of what kind of equality or equity we consider worth striving for. Is it more fair when everyone pays the same amount, when everyone pays by their ability to pay, or by their cost to the system, or perhaps some other idea. And the answer might change based on the thing being paid for – how essential is it to living a life of decent quality, how much control do we have on how much of it we consume.
Ketil Tveiten says

January 8, 2020 at 12:41 pm

The nice thing about these issues is that they are a luxury reserved for data-sciencers who have gotten past dealing with GIGO; and so probably still reserved for the minority.
abbeycadabra says

January 8, 2020 at 2:23 pm

Off topic, thank you for twigging me to a better way to handle footnotes, I’d been looking for a user-friendlier method.
Andreas Avester says

January 8, 2020 at 2:31 pm

However, O’Neil chose to illustrate the problem with a hypothetical case of a good driver who has to commute through a bad neighborhood late at night. An insurance company that tracks her location might conclude that she is a risky driver, and charge her more for insurance. While I agree that this is unfair, it’s a poor example, because it suggests that the solution is to make our models more and more accurate, which might exacerbate the problem!

Have you read the full book? Because otherwise it would be problematic for you to make assumptions about what the author wanted to express based upon my book review/summary and a few paragraphs I chose to quote. In her book, the author mentions dozens of examples, it was me who chose to quote this one.

Insurance companies don’t want to charge the good driver higher prices, because a competing company who figures out that she’s a good driver will be able to undercut their price.

Insurance companies should be mandated by law to charge the same price for everybody (at least in most situations).
Siggy says

January 8, 2020 at 6:39 pm

@Andreas Avester,
I don’t need to see the other examples in the book to say that this one was a bad example.

But also, this does seem to be a theme with many of the examples with the book, and not just the ones that you chose. For example, when you quoted Wikipedia, it said

If a poor student can’t get a loan because a lending model deems him too risky (by virtue of his zip code)

which seems to hint that the problem is that the model is too simplistic and relies too heavily on zip code. In other words, the problem is mistakes.

I also looked for another summary, and found one on Scientific American, and all it talks about are the perils of mistakes. Not one mention of the perils of high accuracy. The most charitable thing I can say is that Cathy seems to be aware of both sides of the problem, but clearly only one side is making an impression on readers.
Siggy says

January 8, 2020 at 6:49 pm

Insurance companies should be mandated by law to charge the same price for everybody (at least in most situations).

In the context of health insurance, this policy is called community rating. Although, sometimes price is allowed to vary by a few case characteristics. In the US under Obamacare, price can vary by age and location, and by tobacco use. So when I mentioned smoking I was thinking of the US.
Siggy says

January 8, 2020 at 6:51 pm

@abbey #3,
I take my footnote formatting directly from a guide from one of my bloggy friends.
Chris J says

January 9, 2020 at 11:34 am

The problem isn’t with accuracy per se, the problem is with insurance. The biggest lie is that insurance is about helping people who can’t afford their bills, and the smaller one is that insurance is about distributing the load of medical payments. “Insurance” is a for-profit industry, with the goal of making more money than they distribute. Thus, they are incentivized to, on average, take more money from individuals than they would pay out to those same individuals, and incentivized to keep prices cheaper to healthy folks (who they won’t pay anything out for) so that more people will buy their product.

Better data and prediction models just means they can hit that average more efficiently. The fundamental goal of insurance companies is antithetical to what we pretend insurance is about.

In my opinion, a real, honest insurance system would just be a tax based on ability to pay. A giant pool that all health costs are drawn from. It shouldn’t be based on expectation of medical expenses at all because that’s not the point, the point is to help people cover costs they wouldn’t be able to otherwise.
Siggy says

January 9, 2020 at 4:20 pm

@Chris J #8,
That’s the idea behind single payer health insurance! Or community rating + individual mandate, as we’re supposed to have in the US.

It’s a good question whether the problem I’m talking about is unique to insurance, or if there are similar problems in other realms. Consider the use of machine learning models to rate people’s credit. Are there any ethical problems posed by being able to estimate credit with increasing accuracy? Or consider the use of models in college admissions, or resume sorting.

A Trivial Knot

Everything is simple except when not

There are no atheists in foxholes, by definition

Agnostic atheism and other hairsplitting terms

The Probability Broach: Law of unintended consequences

The Greater Gardening of 2026 – Part 20 – Deeply Depressed

The end of the American empire

Quick follow up on Jey McCreight

Road Raging

The Measurement of Suffering

Comments

Leave a Reply Cancel reply