Racial Profiling – a data mining perspective (WARNING: WONKY)

Sam Harris posted a piece called “In Defense of Profiling.”  PZ Myers posted a response explaining why that’s a terrible idea.

In general it should go without saying that I agree with PZ, unless stated otherwise.  I just want to add a little something from the perspective of a computer science nerd whose been around a bit with the notion of data mining.  I also want to prove that I didn’t go to grad school for nothing.  (It cost me thousands!  <drum fill>)

In data mining, we have the twin problems of “false positives” and “false negatives.”  For example, suppose there is a certain disease that is very rare in the population, but very deadly if not detected.  Let’s say that one person in a million has this disease, which is 0.0001%.  If we devise a screening procedure for this disease, we would like to be very sure that we catch almost all people who are at risk.

So let’s say we apply this screening test to somebody, and they (unbeknownst to you) have the disease, but your test says they do not.  That is a false negative, and it can kill the person, since they won’t be treated.

Let’s say we apply the screening test to somebody who does not have the disease, and the test says that they have it.  That is a false positive.  But the consequences of a false positive are not as dire.  If the screen says you have the disease, you follow up with another test that is more rigorous, and more expensive, to prove for certain that you need treatment.

As I said, we would like to prevent false negatives as much as we can, so we set the sensitivity of the screen to be very high.  Even so, the test will give a positive result for only one person for every thousand who takes the test, and those few people who have the disease are almost certain to get a positive.

Okay, now let’s say I go to the doctor and I get a positive result.  How worried should I be that I have the disease?  As it turns out, not very worried at all.

See, the screening test says I have the disease, but I almost certainly don’t. One in a thousand people gets a positive. One in a million people actually has the disease. Therefore, only one in a thousand positive tests are of people who actually have the disease.

Is that clear? Good. Now let’s consider terrorists…

Comedian Kumail Nanjiani, wanted for slaying thousands of audiences with his hilarious wit.  

Pop quiz: which of these two guys is the terrorist? Answers at the bottom of the post, or you can mouse over the pictures for more information.

How many terrorists are there in the US?  I don’t know, but let’s look at it this way.  Statistics show that about five people per hundred thousand are the victim of a homicide each year.  The vast majority of murderers aren’t actually terrorists as such, but let’s be generous and assume that one in ten is.  That means that at any given time, 5 in a million are likely to be terrorists, or 0.0005%.

Suppose we start profiling people who simply look Middle Eastern, like mug shot number two up there.  There are about 1.5 million Arab Americans living in the US, which is 0.5% of all people here.  Similarly, there are 2.5 million people who are practicing Muslims, which accounts for 0.8% of all people living here.

Page 1 of 2 | Next page