Ugh, Not Again

P-values are back in the news. Nature published an article, signed by 800 scientists, calling for an end to the concept of “statistical significance.” It ruffled my feathers, even though I agreed with its central thesis.

The trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different. The same problems are likely to arise under any proposed statistical alternative that involves dichotomization, whether frequentist, Bayesian or otherwise.

Unfortunately, the false belief that crossing the threshold of statistical significance is enough to show that a result is ‘real’ has led scientists and journal editors to privilege such results, thereby distorting the literature. Statistically significant estimates are biased upwards in magnitude and potentially to a large degree, whereas statistically non-significant estimates are biased downwards in magnitude. Consequently, any discussion that focuses on estimates chosen for their significance will be biased. On top of this, the rigid focus on statistical significance encourages researchers to choose data and methods that yield statistical significance for some desired (or simply publishable) result, or that yield statistical non-significance for an undesired result, such as potential side effects of drugs — thereby invalidating conclusions.

Nothing wrong there. While I’ve mentioned some Bayesian buckets, I tucked away a one-sentence counter-argument in an aside over here. Any artificial significant/non-significant boundary is going to promote the distortions they mention here. What got me writing this post was their recommendations.

What will retiring statistical significance look like? We hope that methods sections and data tabulation will be more detailed and nuanced. Authors will emphasize their estimates and the uncertainty in them — for example, by explicitly discussing the lower and upper limits of their intervals. They will not rely on significance tests. When P values are reported, they will be given with sensible precision (for example, P = 0.021 or P = 0.13) — without adornments such as stars or letters to denote statistical significance and not as binary inequalities (P  < 0.05 or P > 0.05). Decisions to interpret or to publish results will not be based on statistical thresholds. People will spend less time with statistical software, and more time thinking.

This basically amounts to nothing. Journal editors still have to decide what to print, and if there is no strong alternative they’ll switch from an arbitrary cutoff of p < 0.05 to an ad-hoc arbitrary cutoff. In the meantime, they’re leaving flawed statistical procedures in place. P-values exaggerate the strength of the evidence, as I and others have argued. Confidence intervals are not an improvement, either. As I put it:

For one thing, if you’re a frequentist it’s a category error to state the odds of a hypothesis being true, or that some data makes a hypothesis more likely, or even that you’re testing the truth-hood of a hypothesis. […]

How does this intersect with confidence intervals? If it’s an invalid move to hypothesise[sic] “the population mean is Y,” it must also be invalid to say “there’s a 95% chance the population mean is between X and Z.” That’s attaching a probability to a hypothesis, and therefore a no-no! Instead, what a frequentist confidence interval is really telling you is “assuming this data is a representative sample, if I repeat my experimental procedure an infinite number of times then I’ll calculate a sample mean between X and Z 95% of the time.” A confidence interval says nothing about the test statistic, at least not directly.

In frequentism, the parameter is fixed and the data varies. It doesn’t make sense to consider other parameters, that’s a Bayesian move. And yet the authors propose exactly that!

We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence. Specifically, we recommend that authors describe the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits. In doing so, they should remember that all the values between the interval’s limits are reasonably compatible with the data, given the statistical assumptions used to compute the interval. Therefore, singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.

Much of what the authors proposed would be fixed by switching to Bayesian statistics. Their own suggestions invoke Bayesian ideas without realizing it. Yet they go out of their way to say nothing’s wrong with p-values or confidence intervals, despite evidence to the contrary. Their proposal is destined to fail, yet it got more support than the arguably-superior p < 0.005 proposal.

Maddening. Maybe it’s time I got out my poison pen and added my two cents to the scientific record.

Happy Emmy Noether Day!

Whenever anyone asks me for my favorite scientist, her name comes first.

At a time when women were considered intellectually inferior to men, Noether (pronounced NUR-ter) won the admiration of her male colleagues. She resolved a nagging puzzle in Albert Einstein’s newfound theory of gravity, the general theory of relativity. And in the process, she proved a revolutionary mathematical theorem that changed the way physicists study the universe.

It’s been a century since the July 23, 1918, unveiling of Noether’s famous theorem. Yet its importance persists today. “That theorem has been a guiding star to 20th and 21st century physics,” says theoretical physicist Frank Wilczek of MIT. […]

Although most people have never heard of Noether, physicists sing her theorem’s praises. The theorem is “pervasive in everything we do,” says theoretical physicist Ruth Gregory of Durham University in England. Gregory, who has lectured on the importance of Noether’s work, studies gravity, a field in which Noether’s legacy looms large.

And as luck would have it, today was the day she was born. So read up on why she’s such a critical figure, and use it as an excuse to remember other important women in science.

Frequentists Don’t Get A Free Ticket

I’ve been digging Crash Course’s series on statistics. They managed to pull off two good episodes about Bayesian statistics, plus one about the downsides of p-values, but overall Adriene Hill has stuck close to the frequentist interpretation of statistics. It was inevitable that one of their episodes would get my goat, enough to want to break it down.

And indeed, this episode on t-tests is worth a break.

[Read more…]

Gaining Credibility

You might have wondered why I didn’t pair my frequentist analysis in this post with a Bayesian one. Two reasons: length, and quite honestly I needed some time to chew over hypotheses. The default frequentist ones are entirely inadequate, for starters:

  • null: The data follows a Gaussian distribution with a mean of zero.
  • alternative: The data follows a Gaussian distribution with a non-zero mean.

In chart form, their relative likelihoods look like this. [Read more…]

Continued Fractions

If you’ve followed my work for a while, you’ve probably noted my love of low-discrepancy sequences. Any time I want to do a uniform sample, and I’m not sure when I’ll stop, I’ll reach for an additive recurrence: repeatedly sum an irrational number with itself, check if the sum is bigger than one, and if so chop it down. Dirt easy, super-fast, and most of the time it gives great results.

But finding the best irrational numbers to add has been a bit of a juggle. The Wikipedia page recommends primes, but it also claimed this was the best choice of all:\frac{\sqrt{5} - 1}{2}

I couldn’t see why. I made a half-hearted attempt at digging through the references, but it got too complicated for me and I was more focused on the results, anyway. So I quickly shelved that and returned to just trusting that they worked.

That is, until this Numberphile video explained them with crystal clarity. Not getting the connection? The worst possible number to use in an additive recurrence is a rational number: it’ll start repeating earlier points and you’ll miss at least half the numbers you could have used. This is precisely like having outward spokes on your flower (no seriously, watch the video), and so you’re also looking for any irrational number that’s poorly approximated by any rational number. And, wouldn’t you know it…

\frac{\sqrt{5} - 1}{2} ~=~ \frac{\sqrt{5} + 1}{2} - 1 ~=~ \phi - 1

… I’ve relied on the Golden Ratio without realising it.

Want to play around a bit with continued fractions? I whipped up a bit of Go which allows you to translate any number into the integer sequence behind its fraction. Go ahead, muck with the thing and see what patterns pop out.

Abductive and Inferential Science

I love it when Professor Moriarty wanders back to YouTube, and his latest was pretty good. He got into a spot of trouble at the end, which led me to muse on writing a blog post to help him out. I’ve already covered some of that territory, alas, but in the process I also stumbled on something more interesting to blog about. It also effects Sean Carroll’s paper, which Moriarty relied on.

The fulcrum of my topic is the distinction between inference and abduction. The former goes “I have a hypothesis, what does the data say about it?,” while the latter goes “I have data, can I find a hypothesis which explains it?” Moriarty uses this as a refutation of falsification: if we start from the data instead of the hypothesis, we’re not trying to falsify anything! To add salt to the wound, Moriarty argues (and I agree) that a majority of scientific activity consists of abduction and not inference; it’s quite common for scientists to jump from one topic to another, essentially engaging in a tonne of abductive activity until someone forces them to write up a hypothesis. Sean Carroll doesn’t dwell on this as much, but his paper does treat abduction and inference as separate things.

They aren’t separate, at least when it comes to the Bayesian interpretation of statistics. Let’s use a toy example to explain how; here’s a black box with a clear cover:

import ("math/rand")

func blackbox() float64 {

     x := rand.Float64()
     return (4111 + x*(4619 + x*(3627 + x*(7392*x - 9206)))/1213
     }

Each time we turn the crank on this function, we get back a number of some sort. The abductive way to analyse this is pretty straightforward: we grab a tonne of numbers and look for a hypothesis. I’ll go for the mean, median, and standard deviation here, the minimum I’ll need to check for a Gaussian distribution.

Samples = 1000001
Mean    = 5.61148
Std.Dev = 1.40887
Median  = 5.47287

Looks like there’s a slight skew downwards, but it’s not that bad. So I’ll propose that the output of this black box follows a Gaussian distribution, with mean 5.612 and standard deviation 1.409, until I can think of a better hypothesis which handles the skew.

After we reset for the inferential analysis, we immediately run into a problem: this is a black box. We know it has no input, and outputs a floating-point number, and that’s it. How can we form any hypothesis, let alone a null and alternative? We’ve no choice but to make something up. I’ll set my null to be “the black box outputs a random floating-point number,” and the alternative to “the output follows a Gaussian distribution with a mean of 0 and a standard deviation of 1.” Turn the crank, aaaand…

Samples            = 1000001
log(Bayes Factor)  = 26705438.01142
  (That means the most likely hypothesis is H1 (Gaussian distribution, mean = 0, std.dev = 1))

Unsurprisingly, our alternative does a lot better than our null. But our alternative is wrong! We’d get that impression pretty quickly if we watched the numbers streaming in. There’s an incredible temptation to take that data to refine or propose a new hypothesis, but that’s an abductive move. Inference is really letting us down.

Worse, this black box isn’t too far off from the typical science experiment. It’s rare any researcher is querying a black box, true, but it’s overwhelmingly true that they’re generating new data without incorporating other people’s datasets. It’s also rare you’re replicating someone else’s work; most likely, you’re taking existing ideas and rearranging them into something new, so prior findings may not carry forward. Inferential analysis is more tractable than I painted it, I’ll confess, but the limited information and focus on novelty still favors the abductive approach.

But think a bit about what I did on the inferential side: I picked two hypotheses and pitted them against one another. Do I have to limit myself to two? Certainly not! Let’s rerun the analysis with twenty-two hypotheses: the flat distribution we used as a null before, plus twenty-one alternative hypotheses covering every integral mean from -10 to 10 (though keeping the standard deviation at 1).

Samples                                 = 100001
log(likelihood*prior), H0               = -4436161.89971
log(likelihood*prior), H1, mean = -10   = -12378220.82173
log(likelihood*prior), H1, mean =  -9   = -10866965.39358
log(likelihood*prior), H1, mean =  -8   = -9455710.96544
log(likelihood*prior), H1, mean =  -7   = -8144457.53730
log(likelihood*prior), H1, mean =  -6   = -6933205.10915
log(likelihood*prior), H1, mean =  -5   = -5821953.68101
log(likelihood*prior), H1, mean =  -4   = -4810703.25287
log(likelihood*prior), H1, mean =  -3   = -3899453.82472
log(likelihood*prior), H1, mean =  -2   = -3088205.39658
log(likelihood*prior), H1, mean =  -1   = -2376957.96844
log(likelihood*prior), H1, mean =   0   = -1765711.54029
log(likelihood*prior), H1, mean =   1   = -1254466.11215
log(likelihood*prior), H1, mean =   2   = -843221.68401
log(likelihood*prior), H1, mean =   3   = -531978.25586
log(likelihood*prior), H1, mean =   4   = -320735.82772
log(likelihood*prior), H1, mean =   5   = -209494.39958
log(likelihood*prior), H1, mean =   6   = -198253.97143
log(likelihood*prior), H1, mean =   7   = -287014.54329
log(likelihood*prior), H1, mean =   8   = -475776.11515
log(likelihood*prior), H1, mean =   9   = -764538.68700
log(likelihood*prior), H1, mean =  10   = -1153302.25886
  (That means the most likely hypothesis is H1 (Gaussian distribution, mean = 6, std.dev = 1))

Aha, the inferential approach has finally gotten us somewhere! It’s still wrong, but you can see the obvious solution: come up with as many hypotheses as you can to explain the data, before we look at it, and run them all as the data rolls in. If you’re worried about being swamped by hypotheses, I’ve got a word for you: marginalization. Bayesian statistics handles hypotheses with parameters by integrating over all of them; you can think of these as composites, a mash of point hypotheses which collectively do a helluva lot better at prediction than any one hypothesis in isolation. In practice, then, Bayesians have always dealt with large numbers of hypotheses simultaneously.

The classic example of this is conjugate priors, where we carefully combine hyperparameters to evaluate a potentially infinite family of probability distributions. In fact, let’s try it right now: the proper conjugate here is the Normal-Inverse-Gamma, as we’re tracking both the mean and standard deviation of Gaussian distributions.

Samples = 1000001
μ       = 5.61148
λ       = 1000001.00000
α       = 500000.50000
β       = 992457.82655

median  = 5.47287

That’s a good start, μ lines up with the mean we calculated earlier, and λ is obviously the sample count. The shape of the posteriors is still pretty opaque, though; we’ll need to chart this out by evaluating the Normal-Inverse-Gamma PDF a few times.Conjugate posterior for the collection of all Gaussian distributions which could describe the data.Excellent, the inferential method has caught up to abduction! In fact, as of now they’re both working identically. Think: what’s the difference between a hypothesis you proposed before collecting the data, and one you proposed after? In frequentism, the stopping problem implies that we could exit early and falsely reject our null, when data coming down the pipe would have pushed it back to “fail to reject.” There, the choice of hypothesis could have an influence on the outcome, so there is a difference between the two cases. This is made worse by frequentism’s obsession over one hypothesis above all others, the null.

Bayesian statistics is free of that problem, because every hypothesis is judged on their relative likelihood in reference to a dataset shared by all hypotheses. There is no stopping problem baked into the methodology. Whether I evaluate any given hypothesis before or after I collect the data is irrelevant, because either way it has to cope with all the data. This also frees me up to invent hypotheses whenever I wish.

But this also defeats the main attack against falsification. The whole point of invoking abduction was to save us from asserting any hypotheses in the beginning; if there’s no difference in when we invoke our hypotheses, however, then falsification might still apply.

Here’s where I return to giving Professor Moriarity a hand. He began that video by saying scientists usually don’t engage in falsification, hence it cannot be The Scientific Method, but ended it by approvingly quoting Feynman: “We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.” Isn’t that falsification, right there?

This is yet another area where frequentist and Bayesian statistics diverge. As I pointed out earlier, frequentism is obsessed with falsifying the null hypothesis and trying to prove it wrong. Compare and contrast with what past-me wrote about Bayes Factors:

If data comes up that doesn’t square well with a hypothesis, its certainty takes a hit. But if we’re comparing it to another hypothesis that also doesn’t predict the data, the Bayes Factor will remain close to 1 and our certainties won’t shift much at all. Likewise, if both hypotheses strongly predict the data, the Factor again stays close to 1. If we’re looking to really shift our certainty around, we need a big Bayes Factor, which means we need to find scenarios where one hypothesis strongly predicts the data while the other strongly predicts this data shouldn’t happen.

Or, in other words, we should look for situations where one theory is… false. That sounds an awful lot like falsification!

But it’s not the same thing. Scroll back up to that Normal-Inverse-Gamma PDF, and pick a random point on the graph. The likelihood at that point is less than the likelihood at the maximum point. If you were watching those two points as we updated with new data, your choice would have gradually gone from about equally likely to substantially less likely. Your choice is more likely to be false, all things being equal, but it’s also not false with a capital F. Maybe the first million data points were a fluke, and if we continued sampling to a billion your choice would roar back to the top? This is the flip-side of having no stopping problem: the door is always left open a crack for any crackpot hypotheses to make a comeback.

Now look closely at the scale of the vertical axis. That maximal likelihood is well above 100%! In fact it’s somewhere around 4,023,000% by my calculations. While the vast majority are dropping downwards, there’s an ever-shrinking huddle of points that are becoming more likely as data is added! Falsification should only make things less likely, however.

Under Bayesian statistics, falsification is treated as a heuristic rather than a core part of the process. We’re best served by trying to find areas where hypotheses differ, yet we never declare one hypothesis to be false. This saves Moriarty: he’s both correct in disclaiming falsification, and endorsing the process of trying to prove yourself wrong. The confusion between the two stems from having to deal with two separate paradigms that appear to have substantial overlap, even though a closer look reveals fundamental differences.

Bayes’ Theorem: Deceptively Simple

Bayes' Theorem, in classic form

Good ol’ Bayes’ Theorem. Have you even wondered where it comes from, though? If you don’t know probability, there doesn’t seem to be any obvious logic to it. Once you’ve had it explained to you, though, it seems blindingly obvious and almost tautological. I know of a couple of good explainers, such as E.T. Jaynes’ Probability Theory, but just for fun I challenged myself to re-derive the Theorem from scratch. Took me about twenty minutes; if you’d like to try, stop reading here and try working it out yourself.

[Read more…]

Women’s Work

[I know, I’m a good three months late on this. It’s too good for the trash bin, though, and knowing CompSci it’ll be relevant again within the next year.]

This swells my heart.

LAURIE SEGALL: Computer science, it hasn’t always been dominated by men. It wasn’t until 1984 that the number of women studying computer science started falling. So how does that fit into your argument as to why there aren’t more women in tech?

JAMES DAMORE: So there are several reasons for why it was like that. Partly, women weren’t allowed to work other jobs so there was less freedom for people; and, also… it was simply different kinds of work. It was more like accounting rather than modern-day computer programming. And it wasn’t as lucrative, so part of the reason so many men give go into tech is because it’s high paying. I know of many people at Google that- they weren’t necessarily passionate about it, but it was what would provide for their family, and so they still worked there.

SEGALL: You say those jobs are more like accounting. I mean, look at Grace Hopper who pioneered computer programming; Margaret Hamilton, who created the first ever software which was responsible for landing humans on the moon; Katherine Johnson, Mary Jackson, Dorothy Vaughan, they were responsible for John Glenn accurately making his trajectory to the moon. Those aren’t accounting-type jobs?

DAMORE: Yeah, so, there were select positions that weren’t, and women are definitely capable of being confident programmers.

SEGALL: Do you believe those women were outliers?

DAMORE: … No, I’m just saying that there are confident women programmers. There are many at Google, and the women at Google aren’t any worse than the men at Google.

Segall deserves kudos for getting Damore to reverse himself. Even he admits there’s no evidence women are worse coders than men, in line with the current scientific evidence. I’m also fond of the people who make solid logical arguments against Damore’s views. We even have good research on why computing went from being dominated by women to dominated by men, and how occupations flip between male- and female-dominated as their social standing changes.

But there’s something lacking in Segall’s presentation of the history of women in computing. She isn’t alone, I’ve been reading a tonne of stories about the history of women in computing, and all of them suffer from the same omission: why did women dominate computing, at first? We think of math and logic as being “a guy thing,” so this is terribly strange. [Read more…]