Bayes’ Theorem: Deceptively Simple

Bayes' Theorem, in classic form

Good ol’ Bayes’ Theorem. Have you even wondered where it comes from, though? If you don’t know probability, there doesn’t seem to be any obvious logic to it. Once you’ve had it explained to you, though, it seems blindingly obvious and almost tautological. I know of a couple of good explainers, such as E.T. Jaynes’ Probability Theory, but just for fun I challenged myself to re-derive the Theorem from scratch. Took me about twenty minutes; if you’d like to try, stop reading here and try working it out yourself.

Right, let’s start with a Venn diagram.

Just your standard Venn Diagram.

This entire square represents everything we care about here. “A” is one section of this universe, “B” is another, and the two of them might overlap to some degree. The only assumptions were making here are that the size of A, written as “|A|,” is greater than zero, as is |B| or the size of B, and that both A and B fit entirely within the universe. We’re not saying anything else about the size of those sections, or how much they overlap. That section can be written as A ∩ B, or “the intersection of A and B.”

Let’s say we’re interested in the relative sizes between two sections of this chart. Specifically, we want to calculate this:

|A ∩ B| / |A| = ?

We can start by doing a basic math trick. If you multiply something by a number that isn’t zero, then divide by that number again, you get back the original number you started with. So let’s invoke that with the value of |B|.

|A ∩ B| / |A| = |A ∩ B| / |B| * |B| / |A|

This seems to be moving backwards, but it’ll come in handy once we finish talking about probabilities.

The easiest way to think of probability is the ratio of desired outcomes to possible outcomes. How would we calculate the probability of rolling a five on a die? Well, there’s one desired outcome (we roll the five), as well as six possible outcomes (we roll a one, two, three, and so on). One divided by six is 1/6. Pretty intuitive, right? So let’s turn the diagram above into a dartboard.

(A dart appearing and disappearing above the Venn diagram from earlier.)

Suppose we randomly pick a spot above this board, and drop a dart onto it. What is the probability it would hit A? The area of A is |A|, so there are |A| desired outcomes here. To make the math easy, we’ll scale everything so the board has an area of one unit. Thus we can write that probability as

p( A ) = |A| / 1

The same logic applies to p(B), of course. Let’s try something a bit trickier: let’s say we’ve dropped the dart and it hit B. What are the odds that it also happened to hit A? The mathematical shorthand for that question is “p( A | B ).” We know the number of possible outcomes here, |B|, and the desired outcomes happen in the area where A intersects B, so the answer is

P( A | B ) = |A ∩ B| / |B|

A-ha! We can repeat this logic for p( B | A ), too, then shuffle around that mysterious first equation a bit. For instance, the way the equation was set up implied we must divide |B| by |A| before multiplying by the left-hand fraction. Because we’re only doing multiplication and division, though, we can instead multiply first then divide by |A| later on. Dividing a number by one doesn’t change it, right? So we can pop those divisions in anywhere we find convenient. Applying all these tricks, we get

|A ∩ B| / |B| * |B| / |A| = (|A ∩ B| / |B| * |B|) / |A| = ((|A ∩ B| / |B|) * (|B|/1)) / (|A|/1)

I think you can see where I’m going with this. By substituting in the probability expressions from above, we arrive at:

And from there we just need to state “B = H” and “A = E” to arrive back at Bayes’ Theorem. This makes it obvious why Bayesian statistics was once called “Inverse Probability,” as it permits you to calculate p( H | E ) from p( E | H ) and vice-versa.

I’ll admit there are much simpler derivations out there, but all the ones I’ve read fail to explain an obvious consequence. Recently, judges have rejected the use of Bayes’ Theorem in the courtroom. Ronald Fisher, who probably did more to bury Bayesian statistics than anyone else, nonetheless did everything he could to rescue the Reverend from his own Theorem.[1]

It has become realized in recent years (Fisher, 1958) that although Bayes considered the special axiom associated with his name for assigning probabilities a priori, and devoted a scholium to its discussion, in his actual mathematics he avoided this axiomatic approach as open to dispute, but showed that its purpose could be served by an auxiliary experiment, so that the probability statements a posteriori at which he arrived were freed from any reliance on the axiom, and shown to be demonstrable on the basis of observations only, such as are the source of new knowledge in the natural sciences.[2]

If Bayes’ Theorem is so trivial to derive, how can people reject it? Why did statistical giants like Fisher twist themselves in knots to avoid it?

One reason is that it has deep epistemological consequences. Think of a person, then ask yourself whether they have read this sentence. Either they have or they have not, and yet Bayes’ Theorem can attach numbers like 25% or 65.37% to it. How can someone 25% read that sentence? These numbers cannot represent actual things in the world, which is weird because when deriving Bayes’ Theorem I invoked counting.

The other reason is that I’m treating hypotheses and data as if they were the same thing. “Person X read that sentence” and “I observed person X reading that sentence” seem to be qualitatively different statements; the former is a potential truth about the world, the latter a sensation derived from direct interaction with the world. Yet according to our derivation they can be smushed together and measured using the same metric. Substituting H for B and E for A was a mere label switch. Under this interpretation, I can also write:

p( E2 | E1 ) , p( H2 | H1 )

“The probability of seeing some evidence, dependent on the probability of seeing other evidence” seems non-controversial, but how can the probability of one hypothesis depend on another? What does “depend” even mean for something abstract and non-causal?

We can resolve these contradictions by denying that hypotheses carry a probability. Stating any part of the Venn diagram represents a hypothesis becomes a bogus move, though the rest of the logic behind Bayes’ Theorem remains legit. This is a core assumption of frequentist statistics.

For example, if I toss a coin, the probability of heads coming up is the proportion of times it produces heads. But it cannot be the proportion of times it produces heads in any finite number of tosses. If I toss the coin 10 times and it lands heads 7 times, the probability of a head is not therefore 0.7. A fair coin could easily produce 7 heads in 10 tosses. The relative frequency must refer therefore to a hypothetical infinite number of tosses. The hypothetical infinite set of tosses (or events, more generally) is called the reference class or collective. […]

[In another experiment,] Each event is ‘throwing a fair die 25 times and observing the number of threes’. That is one event. Consider a hypothetical collective of an infinite number of such events. We can then determine the proportion of such events in which the number of threes is 5. That is a meaningful probability we can calculate. However, we cannot talk about P(H | D), for example P(‘I have a fair die’ | ‘I obtained 5 threes in 25 rolls’), [or] the probability that the hypothesis that I have a fair die is true, given I obtained 5 threes in 25 rolls. What is the collective? There is none. The hypothesis is simply true or false. [3]

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.[4]

This answer is not perfect. A minor complaint is that most probability systems have values that can represent “absolutely true” and “absolutely false,” so there’s no obvious theoretical barrier to including hypotheses.

More importantly, consider the first example of Eliezer S. Yudkowsky’s famous introduction to Bayes’ Theorem.

1% of women at age forty who participate in routine screening have breast cancer.  80% of women with breast cancer will get positive mammographies.  9.6% of women without breast cancer will also get positive mammographies.  A woman in this age group had a positive mammography in a routine screening.  What is the probability that she actually has breast cancer?
We can plug that question into the above Venn diagram, assigning “B” to women with breast cancer and “A” to women with positive mammographies. We then generate matching hypothetical women and walk through the math:
Out of 10,000 women, 100 have breast cancer; 80 of those 100 have positive mammographies.  From the same 10,000 women, 9,900 will not have breast cancer and of those 9,900 women, 950 will also get positive mammographies.  This makes the total number of women with positive mammographies 950+80 or 1,030.  Of those 1,030 women with positive mammographies, 80 will have cancer.  Expressed as a proportion, this is 80/1,030 or 0.07767 or 7.8%.
A frequentist would have no objection to that process, and readily agree to the answer. Yet if Yudkowsky had written this instead …
 p( H ) = 1% p( E | H ) = 80% p( E | not H ) = 9.6% p ( E ) = p( E | H ) * p( H ) + p( E | not H ) * p( not H ) = 80% * 1% + 9.6% * 99% = 10.304% P( H | E ) = p( E | H ) * p( H ) / p( E ) = 80% * 1% / 10.304% = 7.767%
… a frequentist would reject it. Why? What great difference is there between “What is the probability she has actually has breast cancer?” and “What is the probability of the hypothesis that she actually has breast cancer?”, that forces us to reject all reasoning that follows from the latter?
The other way to resolve these contradictions is to change what probability means. Hypotheses may be true or false in the real world, but my experience of that world is mediated by my physical self. My senses can lead me astray, and my finite lifespan prevents me from carrying out an infinite number of tests. So rather than directly deal with the real world, I instead model it with “degrees of belief.” A binary yes/no becomes a fractional value, which should come arbitrarily close to a binary value if fed enough hypotheses and data.
Likewise, the frequency of data is modeled by a fractional degree of belief. If a true hypothesis tells me to be X% confident that the next bit of evidence is Y, then X% of the time I should see Y. If it does not, then the assertion was not true and thus the hypothesis cannot be either. By doing these conversions, I can legitimately mix hypotheses and data and thus invoke Bayes’ Theorem. As a bonus, the hypothesis-hypothesis comparison allows us to justify Ockham’s Razor via prior probabilities. There are some problems with the Bayesian approach, like how to handle p( E ), but in my slightly-biased opinion they’re a lot less unsettling than the frequentist route.
I could go on, but I think I’ve established “deceptively simple” beyond a reasonable doubt. It’s worth pondering over the holidays, in a spare moment.

[1] Aldrich, John. “RA Fisher on Bayes and Bayes’ theorem.” Bayesian Analysis 3.1 (2008): 161-170.

[2] Fisher, Ronald. “Some examples of Bayes’ method of the experimental determination of probabilities a priori.” Journal of the Royal Statistical Society. Series B (Methodological)(1962): 118-124.

[3] Dienes, Zoltan. Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Palgrave Macmillan, 2008. pg. 58-59

[4] Neyman, Jerzy, and Egon S. Pearson. On the Problem of the Most Efficient Tests of Statistical Hypotheses. Springer, 1933.