Russian Hacking and Bayes’ Theorem, Part 2

I think I did a good job of laying out the core hypotheses last time, save two: the Iranian government or a disgruntled Democrat did it. I think I can pick them up on-the-fly, so let’s skip ahead to step 2.

The Priors

What’s the prior odds of the Kremlin hacking into the DNC and associated groups or people?
I’d say they’re pretty high. Right back to the Bolshevik revolution, Russian spy agencies have taken an interest in running disinformation campaigns. They have a word for gathering compromising information to blackmail people into doing their bidding, “kompromat.” Putin himself earned a favourable place in Boris Yeltsin’s government via some kompromat of one of Yeltsin’s opponents.
As for hacking elections, European intelligence agencies have also fingered Russia for using kompromat to interfere with elections in Germany, the Netherlands, Hungary, Georgia, and Ukraine.
That’s all well and good, but what about other actors? China also has sophisticated information warfare capabilities, but they seem more interested in trade secrets and tend to keep their discoveries under wraps. North Korea is a lot more splashy, but recently have focused on financial crimes. The Iranian government has apparently stepped up their online attack capabilities, and have a grudge against the USA, but apparently focus on infrastructure and disruption.
The DNC convention was rather contentious, with fans of Bernie Sanders bitter at how it turned out, and putting Trump in power had been preferred to voting for Clinton, for some, but it doesn’t fit the timeline: the DNC was suspicious of an attack in April, documents were leaked in June, but Sanders still had a chance of winning the nomination until the end of July.
An independent group is the real wild card, with any number of motivations and due to their lack of power eager to make it look like someone else did the deed.
What about the CIA or NSA? The latter claims to be just a passive listener, and I haven’t heard of anyone claiming otherwise. The CIA has a long history of interfering in other countries’ elections; in 1990’s Nicaragua, they even released documents to the media in order to smear a candidate they didn’t like. It’s one thing to muck around with other countries, however, as it’ll be nearly impossible for them to extradite you over for a proper trial. Muck around in your own country’s election, and there’s no shortage of reporters and prosecutors willing to go after you.
Where does all this get us? I’d say to a tier of prior likelihoods:
  • “The Kremlin did it” (A) and “Independent hackers did it” (D) have about the same prior.
  • “China,” (B) “North Korea,” (C) “Iran,” (H) and “the CIA” (E) are less likely than the prior two.
  • “the NSA” (F) and “disgruntled insider” (I) is less likely still.
  • And c’mon, I’m not nearly good enough to pull this off. (G)

The Evidence

I haven’t placed quantities to the priors, because the evidence side of things is pretty damning. Let’s take a specific example: the Cyrillic character set found in some of the leaked documents. We can both agree that this can be faked: switch around the keyboard layout, plant a few false names, and you’re done. Do it flawlessly and no-one will know otherwise.
But here’s the kicker: is there another hypothesis which is more likely than “the Kremlin did it,” on this bit of evidence? To focus on a specific case, is it more likely that an independent hacking group would leave Cyrillic characters and error messages in those documents than Russian hackers? This seems silly; an independent group could leave a false trail pointing to anyone, which dilutes the odds of them pointing the finger at a specific someone. Even if the independent group had a bias towards putting the blame on Russia, there’s still a chance they could finger someone else.
Put another way, a die numbered one through six could turn up a one when thrown, but a die with only ones on each face would be more likely to turn up a one. A one is always more likely from the second die. By the same token, even though it’s entirely plausible that an independent hacking group would switch their character sets, the evidence still provides better proof of Russian hacking.
What does evidence that points away from the Kremlin look like?

President Vladimir Putin says the Russian state has never been involved in hacking.

Speaking at a meeting with senior editors of leading international news agencies Thursday, Putin said that some individual “patriotic” hackers could mount some attacks amid the current cold spell in Russia’s relations with the West.
But he categorically insisted that “we don’t engage in that at the state level.”

Is this great evidence? Hell no, it’s entirely possible Putin is lying, and given the history of KGB and FSB it’s probable. But all that does is blunt the magnitude of the likelihoods, it doesn’t change their direction. By the same token, this ….
Intelligence agency leaders repeated their determination Thursday that only “the senior most officials” in Russia could have authorized recent hacks into Democratic National Committee and Clinton officials’ emails during the presidential election.
Director of National Intelligence James Clapper affirmed an Oct. 7 joint statement from 17 intelligence agencies that the Russian government directed the election interference…
….  counts as evidence in favour of the Kremlin being the culprit, even if you think James Clapper is a dirty rotten liar. Again, we can quibble over how much it shifts the balance, but no other hypothesis is more favoured by it.
We can carry on like this through a lot of the other evidence.
I can’t find anyone who’s suggested North Korea or the NSA did it. The consensus seems to point towards the Kremlin, and while there are scattered bits of evidence pointing elsewhere there isn’t a lot of credibility or analysis attached, and some of it is “anyone but Russia” instead of “group X,” which softens the gains made by other hypotheses.
The net result is that the already-strong priors for “the Kremlin did it” combine with the direction the evidence points in, and favour that hypothesis even more. How strongly it favours that hypothesis depends on how you weight the evidence, but you have to do some wild contortions to put another hypothesis ahead of it. A qualitative analysis is all we need.
Now, to some people this isn’t good enough. I’ve got two objections to deal with, one from Sam Biddle over at The Intercept, and another from Marcus Ranum at stderr. Part three, anyone?

A Third One!

I know, I know, these are starting to get passé. But this third event brings a little more information.

For the third time in a year and a half, the Advanced Laser Interferometer Gravitational Wave Observatory (LIGO) has detected gravitational waves. […]

This most recent event, which we detected on Jan. 4, 2017, is the most distant source we’ve observed so far. Because gravitational waves travel at the speed of light, when we look at very distant objects, we also look back in time. This most recent event is also the most ancient gravitational wave source we’ve detected so far, having occurred over two billion years ago. Back then, the universe itself was 20 percent smaller than it is today, and multicellular life had not yet arisen on Earth.

The mass of the final black hole left behind after this most recent collision is 50 times the mass of our sun. Prior to the first detected event, which weighed in at 60 times the mass of the sun, astronomers didn’t think such massive black holes could be formed in this way. While the second event was only 20 solar masses, detecting this additional very massive event suggests that such systems not only exist, but may be relatively common.

Thanks to this third event, astronomers can set a stronger maximum mass for the graviton, the proposed name for any gravity force-carrying particle. They also have some hints as to how these black holes form; the axis of spin for these two black holes appear to be misaligned, which suggests they became binaries well after forming as opposed to starting off as binary stars in orbit. Finally, the absence of another signal tells us something important about intermediate black holes, thousands of times heavier than the Sun but less than millions.

The paper reports a “survey of the universe for midsize-black-hole collisions up to 5 billion light years ago,” says Karan Jani, a former Georgia Tech Ph.D. physics student who participated in the study. That volume of space contains about 100 million galaxies the size of the Milky Way. Nowhere in that space did the study find a collision of midsize black holes.

“Clearly they are much, much rarer than low-mass black holes, three collisions of which LIGO has detected so far,” Jani says. Nevertheless, should a gravitational wave from two Goldilocks black holes colliding ever gets detected, Jani adds, “we have all the tools to dissect the signal.”

If you want more info, Veritasium has a quick summary, while if you want something meatier the full paper has been published and the raw data has been released.

Otherwise, just be content that we’ve learned a little more about the world.

Russian Hacking and Bayes’ Theorem, Part 1

I’m a bit of an oddity on this network, as I’m pretty convinced Russia was behind the DNC email hack. I know both Mano Singham and Marcus Ranum suspect someone else is responsible, last I checked, and Myers might lean that way too. Looking around, though, I don’t think anyone’s made the case in favor of Russian hacking. I might as well use it as an excuse to walk everyone through using Bayes’ Theorem in an informal setting.

(Spoiler alert: it’s the exact same method we’d use in a formal setting, but with more approximations and qualitative responses.)

[Read more…]

About Damn Time

Ask me to name the graph that annoys me the most, and I’ll point to this one.538's graph of Trump's popularity, as of May 25th, 2017.

Yes, Trump entered his presidency as the least liked in modern history, but he’s repeatedly interfered with Russian-related investigations and admitted he did it to save his butt. That’s a Watergate-level scandal, yet his approval numbers have barely changed. He’s also pushed a much-hated healthcare reform bill, been defeated multiple times in court, tried to inch away from his wall pledge, and in general repeatedly angered his base. His approval ratings should be negative by now, but because the US is so polarized many conservatives are clinging to him anyway.

A widely held tenet of the current conventional wisdom is that while President Trump might not be popular overall, he has a high floor on his support. Trump’s sizable and enthusiastic base — perhaps 35 to 40 percent of the country — won’t abandon him any time soon, the theory goes, and they don’t necessarily care about some of the controversies that the “mainstream media” treats as game-changing developments. […]

But the theory isn’t supported by the evidence. To the contrary, Trump’s base seems to be eroding. There’s been a considerable decline in the number of Americans who strongly approve of Trump, from a peak of around 30 percent in February to just 21 or 22 percent of the electorate now. (The decline in Trump’s strong approval ratings is larger than the overall decline in his approval ratings, in fact.) Far from having unconditional love from his base, Trump has already lost almost a third of his strong support. And voters who strongly disapprove of Trump outnumber those who strongly approve of him by about a 2-to-1 ratio, which could presage an “enthusiasm gap” that works against Trump at the midterms. The data suggests, in particular, that the GOP’s initial attempt (and failure) in March to pass its unpopular health care bill may have cost Trump with his core supporters.

At long last, Donald Trump’s base appears to be shrinking. This raises the chances of impeachment, and will put tremendous pressure on Republicans to abandon Trump to preserve their midterm majority. I’m pissed the cause appears to be health care, and not the shady Russian ties or bad behavior, but doing the right thing for the wrong reason is still doing the right thing. It also fits in nicely with current events.

According to the forecast released Wednesday by the nonpartisan Congressional Budget Office, 14 million fewer people would have health insurance next year under the Republican bill, increasing to a total of 19 million in 2020. By 2026, a total of 51 million people would be uninsured, roughly 28 million more than under Obamacare. That is roughly equivalent to the loss in coverage under the first version of the bill, which failed to pass the House of Representatives.

Much of the loss in coverage would be due to the Republican plan to shrink the eligibility for Medicaid; for many others—particularly those with preexisting conditions living in certain states—healthcare on the open marketplace would become unaffordable. Some of the loss would be due to individuals choosing not to get coverage.

The Republican bill, dubbed the American Health Care Act, would also raise insurance premiums by an average of 20 percent in 2018 compared with Obamacare, according to the CBO, and an additional 5 percent in 2019, before premiums start to drop.

So keep an eye on Montana’s special election (I’m writing this before results have come in); if the pattern repeats from previous special elections, Republicans will face a huge loss during the 2018 midterms, robbing Trump of much of his power and allowing the various investigations against him to pick up more steam.

Daryl Bem and the Replication Crisis

I’m disappointed I don’t see more recognition of this.

If one had to choose a single moment that set off the “replication crisis” in psychology—an event that nudged the discipline into its present and anarchic state, where even textbook findings have been cast in doubt—this might be it: the publication, in early 2011, of Daryl Bem’s experiments on second sight.

I’ve actually done a long blog post series on the topic, but in brief: Daryl Bem was convinced that precognition existed. To put these beliefs to the test, he had subjects try to predict an image that was randomly generated by a computer. Over eight experiments, he found that they could indeed do better than chance. You might think that Bem is a kook, and you’d be right.

But Bem is also a scientist.

Now he would return to JPSP [the Journal of Personality and Social Psychology] with the most amazing research he’d ever done—that anyone had ever done, perhaps. It would be the capstone to what had already been a historic 50-year career.

Having served for a time as an associate editor of JPSP, Bem knew his methods would be up to snuff. With about 100 subjects in each experiment, his sample sizes were large. He’d used only the most conventional statistical analyses. He’d double- and triple-checked to make sure there were no glitches in the randomization of his stimuli. Even with all that extra care, Bem would not have dared to send in such a controversial finding had he not been able to replicate the results in his lab, and replicate them again, and then replicate them five more times. His finished paper lists nine separate ministudies of ESP. Eight of those returned the same effect.

One way to attack an argument is to merely follow its logic. If you can find it leads to an absurd conclusion, the argument must have been flawed even if you cannot find the flaw. Bem had inadvertently discovered a “reductio ad absurdum” argument against contemporary scientific practice: if proper scientific procedure can prove ESP exists, proper scientific procedure must be broken.

Meanwhile, at the conference in Berlin, [E.J.] Wagenmakers finally managed to get through Bem’s paper. “I was shocked,” he says. “The paper made it clear that just by doing things the regular way, you could find just about anything.”

On the train back to Amsterdam, Wagenmakers drafted a rebuttal, to be published in JPSP alongside the original research. The problems he saw in Bem’s paper were not particular to paranormal research. “Something is deeply wrong with the way experimental psychologists design their studies and report their statistical results,” Wagenmakers wrote. “We hope the Bem article will become a signpost for change, a writing on the wall: Psychologists must change the way they analyze their data.”

Slate has a long read up on the current replication crisis, and how it links to Bem. It’s aimed at a lay audience and highly readable; I recommend giving it a click.

P-hacking is No Big Deal?

Possibly not. simine vazire argued the case over at “sometimes i’m wrong.”

The basic idea is as follows: if we use shady statistical techniques to indirectly adjust the p-value cutoff in Null Hypothesis Significance Testing or NHST, we’ll up the rate of false positives we’ll get. Just to put some numbers to this, a p-value cutoff of 0.05 means that when the null hypothesis is true, we’ll get a bad sample about 5% of the time and conclude its true. If we use p-hacking to get an effective cutoff of 0.1, however, then that number jumps up to 10%.

However, p-hacking will also raise the number of true positives we get. How much higher it gets can be tricky to calculate, but this blog post by Erika Salomon gives out some great numbers. During one simulation run, a completely honest test of a false null hypothesis would return a true positive 12% of the time; when p-hacking was introduced, that skyrocketed to 74%.

If the increase in false positives is balanced out by the increase in true positives, then p-hacking makes no difference in the long run. The number of false positives in the literature would be entirely dependent on the power of studies, which is abysmally low, and our focus should be on improving that. Or, if we’re really lucky, the true positives increase faster than the false positives and we actually get a better scientific record via cheating!

We don’t really know which scenario will play out, however, and vazire calls for someone to code up a simulation.

Allow me.

My methodology will be to divide studies up into two categories: null results that are never published, and possibly-true results that are. I’ll be using a one-way ANOVA to check whether the average of two groups drawn from a Gaussian distribution differ. I debated switching to a Student t test, but comparing two random draws seems more realistic than comparing one random draw to a fixed mean of zero.

I need a model of effect and sample sizes. This one is pretty tricky; just because a study is unpublished doesn’t mean the effect size is zero, and vice-versa. Making inferences about unpublished studies is tough, for obvious reasons. I’ll take the naive route here, and assume unpublished studies have an effect size of zero while published studies have effect sizes on the same order of actual published studies. Both published and unpublished will have sample sizes typical of what’s published.

I have a handy cheat for that: the Open Science Collaboration published a giant replication of 100 psychology studies back in 2015, and being Open they shared the raw data online in a spreadsheet. The effect sizes are in correlation coefficients, which are easy to convert to Cohen’s d, and when paired with a standard deviation of one that gives us the mean of the treatment group. The control group’s mean is fixed at zero but shares the same standard deviation. Sample sizes are drawn from said spreadsheet, and represent the total number of samples and not the number of samples per group. In fact, it gives me two datasets in one: the original study effect and sample size, plus the replication’s effect and sample size. Unless I say otherwise, I’ll stick with the originals.

P-hacking can be accomplished a number of ways: switching between the number of tests in the analysis and iteratively doing significance tests are but two of the more common. To simply things I’ll just assume the effective p-value is a fixed number, but explore a range of values to get an idea of how a variable p-hacking effect would behave.

For some initial values, let’s say unpublished studies constitute 70% of all studies, and p-hacking can cause a p-value threshold of 0.05 to act like a threshold of 0.08.

Octave shall be my programming language of choice. Let’s have at it!

(Template: OSC 2015 originals)
With a 30.00% success rate and a straight p <= 0.050000, the false positive rate is 12.3654% (333 f.p, 2360 t.p)
Whereas if p-hacking lets slip p <= 0.080000, the false positive rate is 18.2911% (548 f.p, 2448 t.p)

(Template: OSC 2015 replications)
With a 30.00% success rate and a straight p <= 0.050000, the false positive rate is 19.2810% (354 f.p, 1482 t.p)
Whereas if p-hacking lets slip p <= 0.080000, the false positive rate is 26.2273% (577 f.p, 1623 t.p)

Ouch, our false positive rate went up. That seems strange, especially as the true positives (“t.p.”) and false positives (“f.p.”) went up by about the same amount. Maybe I got lucky with the parameter values, though; let’s scan a range of unpublished study rates from 0% to 100%, and effective p-values from 0.05 to 0.2. The actual p-value rate will remain fixed at 0.05. So we can fit it all in one chart, I’ll take the proportion of p-hacked false positives and subtract it from the vanilla false positives, so that areas where the false positive rate goes down after hacking are negative.

How varying the proportion of unpublished/false studies and the p-hacking amount changes the false positive rate.

There are no values less than zero?! How can that be? The math behind these curves is complex, but I think I can give an intuitive explanation.

Drawing the distribution of p-values when the result is null vs. the results from the OSC originals.The diagonal is the distribution of p-values when the effect size is zero; the curve is what you get when it’s greater than zero. As there are more or less values in each category, the graphs are stretched or squashed horizontally. The p-value threshold is a horizontal line, and everything below that line is statistically significant. The proportion of false to true results is equal to the proportion between the lengths of that horizontal line from the origin.

P-hacking is the equivalent of nudging that line upwards. The proportions change according to the slope of the curve. The steeper it is, the less it changes. It follows that if you want to increase the proportion of true results, you need to find a pair of horizontal lines where the horizontal distance increases as fast or faster in proportion to the increase along that diagonal. Putting this geometrically, imagine drawing a line starting at the origin but at an arbitrary slope. Your job is to find a slope such that the line pierces the non-zero effect curve twice.

Slight problem: that non-zero effect curve has negative curvature everywhere. The slope is guaranteed to get steeper as you step up the curve, which means it will curve up and away from where the line crosses it. Translating that back into math, it’s guaranteed that the non-effect curve will not increase in proportion with the diagonal. The false positive rate will always increase as you up the effective p-value threshold.

And thus, p-hacking is always a deal.

But Everything Worked Out, Right?

The right person won in the recent France election, but the outcome worries me. The polls badly underestimated his win.

The average poll conducted in the final two weeks of the campaign gave Macron a far smaller lead (22 percentage points) than he ended up winning by (32 points), for a 10-point miss. In the eight previous presidential election runoffs, dating back to 1969, the average poll missed the margin between the first- and second-place finishers by only 3.9 points.

That should be a warning flag to the French to take less stock in their polls and weight unlikely outcomes as more likely. It’s doubtful they will, though, because everything turned out all right. That’s no slam against the French, it’s just human nature. Take the 2012 US election:

Four years ago, an average of survey results the week before the election had Obama winning by 1.2 percentage points. He actually beat Mitt Romney by 3.9 points.

If that 2.7-point error doesn’t sound like very much to you, well, it’s very close to what Donald Trump needs to overtake Hillary Clinton in the popular vote. She leads by 3.3 points in our polls-only forecast.

That was Harry Enten of FiveThirtyEight four days before the 2016 US election, four days before Clinton fell victim to a smaller polling error. Americans should have done back in 2012 what the French should do now, but they didn’t. Even the betting markets figured Clinton would sweep, an eerie mirror of their French counterparts.

Overall, there are a higher number of bets on Ms Le Pen coming out on top, than Brexit or Donald Trump – even though the odds are much lower, according to the betting experts.

The moral of the story: don’t let a win go to your head. You might miss a critical bit of data if you do.

Gimmie that Old-Time Breeding

Full disclosure: I think Evolutionary Psychology is a pseudo-science. This isn’t because the field endorses a flawed methodology (relative to the norm in other sciences), nor because they come to conclusions I’m uncomfortable with. No, the entire field is based on flawed or even false assumptions; it doesn’t matter how good your construction techniques are, if your foundation is a banana cream pie your building won’t be sturdy.

But maybe I’m wrong. Maybe EvoPsych researchers are correct when they say every other branch of social science is founded on falsehoods. So let’s give one of their papers a fair shake.

Ellis, Lee, et al. “The Future of Secularism: a Biologically Informed Theory Supplemented with Cross-Cultural Evidence.” Evolutionary Psychological Science: 1-19. [Read more…]

Everything Is Significant!

Back in 1939, Joseph Berkson made a bold statement.

I believe that an observant statistician who has had any considerable experience with applying the chi-square test repeatedly will agree with my statement that, as a matter of observation, when the numbers in the data are quite large, the P’s tend to come out small. Having observed this, and on reflection, I make the following dogmatic statement, referring for illustration to the normal curve: “If the normal curve is fitted to a body of data representing any real observations whatever of quantities in the physical world, then if the number of observations is extremely large—for instance, on an order of 200,000—the chi-square P will be small beyond any usual limit of significance.”

This dogmatic statement is made on the basis of an extrapolation of the observation referred to and can also be defended as a prediction from a priori considerations. For we may assume that it is practically certain that any series of real observations does not actually follow a normal curve with absolute exactitude in all respects, and no matter how small the discrepancy between the normal curve and the true curve of observations, the chi-square P will be small if the sample has a sufficiently large number of observations in it.

Berkson, Joseph. “Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test.” Journal of the American Statistical Association 33, no. 203 (1938): 526–536.
His prediction would be vindicated two decades later.

[Read more…]