# P-hacking is No Big Deal?

Possibly not. simine vazire argued the case over at “sometimes i’m wrong.”

The basic idea is as follows: if we use shady statistical techniques to indirectly adjust the p-value cutoff in Null Hypothesis Significance Testing or NHST, we’ll up the rate of false positives we’ll get. Just to put some numbers to this, a p-value cutoff of 0.05 means that when the null hypothesis is true, we’ll get a bad sample about 5% of the time and conclude its true. If we use p-hacking to get an effective cutoff of 0.1, however, then that number jumps up to 10%.

However, p-hacking will also raise the number of true positives we get. How much higher it gets can be tricky to calculate, but this blog post by Erika Salomon gives out some great numbers. During one simulation run, a completely honest test of a false null hypothesis would return a true positive 12% of the time; when p-hacking was introduced, that skyrocketed to 74%.

If the increase in false positives is balanced out by the increase in true positives, then p-hacking makes no difference in the long run. The number of false positives in the literature would be entirely dependent on the power of studies, which is abysmally low, and our focus should be on improving that. Or, if we’re really lucky, the true positives increase faster than the false positives and we actually get a better scientific record via cheating!

We don’t really know which scenario will play out, however, and vazire calls for someone to code up a simulation.

Allow me.

My methodology will be to divide studies up into two categories: null results that are never published, and possibly-true results that are. I’ll be using a one-way ANOVA to check whether the average of two groups drawn from a Gaussian distribution differ. I debated switching to a Student t test, but comparing two random draws seems more realistic than comparing one random draw to a fixed mean of zero.

I need a model of effect and sample sizes. This one is pretty tricky; just because a study is unpublished doesn’t mean the effect size is zero, and vice-versa. Making inferences about unpublished studies is tough, for obvious reasons. I’ll take the naive route here, and assume unpublished studies have an effect size of zero while published studies have effect sizes on the same order of actual published studies. Both published and unpublished will have sample sizes typical of what’s published.

I have a handy cheat for that: the Open Science Collaboration published a giant replication of 100 psychology studies back in 2015, and being Open they shared the raw data online in a spreadsheet. The effect sizes are in correlation coefficients, which are easy to convert to Cohen’s d, and when paired with a standard deviation of one that gives us the mean of the treatment group. The control group’s mean is fixed at zero but shares the same standard deviation. Sample sizes are drawn from said spreadsheet, and represent the total number of samples and not the number of samples per group. In fact, it gives me two datasets in one: the original study effect and sample size, plus the replication’s effect and sample size. Unless I say otherwise, I’ll stick with the originals.

P-hacking can be accomplished a number of ways: switching between the number of tests in the analysis and iteratively doing significance tests are but two of the more common. To simply things I’ll just assume the effective p-value is a fixed number, but explore a range of values to get an idea of how a variable p-hacking effect would behave.

For some initial values, let’s say unpublished studies constitute 70% of all studies, and p-hacking can cause a p-value threshold of 0.05 to act like a threshold of 0.08.

Octave shall be my programming language of choice. Let’s have at it!

(Template: OSC 2015 originals)
With a 30.00% success rate and a straight p <= 0.050000, the false positive rate is 12.3654% (333 f.p, 2360 t.p)
Whereas if p-hacking lets slip p <= 0.080000, the false positive rate is 18.2911% (548 f.p, 2448 t.p)

(Template: OSC 2015 replications)
With a 30.00% success rate and a straight p <= 0.050000, the false positive rate is 19.2810% (354 f.p, 1482 t.p)
Whereas if p-hacking lets slip p <= 0.080000, the false positive rate is 26.2273% (577 f.p, 1623 t.p)

Ouch, our false positive rate went up. That seems strange, especially as the true positives (“t.p.”) and false positives (“f.p.”) went up by about the same amount. Maybe I got lucky with the parameter values, though; let’s scan a range of unpublished study rates from 0% to 100%, and effective p-values from 0.05 to 0.2. The actual p-value rate will remain fixed at 0.05. So we can fit it all in one chart, I’ll take the proportion of p-hacked false positives and subtract it from the vanilla false positives, so that areas where the false positive rate goes down after hacking are negative.

There are no values less than zero?! How can that be? The math behind these curves is complex, but I think I can give an intuitive explanation.

The diagonal is the distribution of p-values when the effect size is zero; the curve is what you get when it’s greater than zero. As there are more or less values in each category, the graphs are stretched or squashed horizontally. The p-value threshold is a horizontal line, and everything below that line is statistically significant. The proportion of false to true results is equal to the proportion between the lengths of that horizontal line from the origin.

P-hacking is the equivalent of nudging that line upwards. The proportions change according to the slope of the curve. The steeper it is, the less it changes. It follows that if you want to increase the proportion of true results, you need to find a pair of horizontal lines where the horizontal distance increases as fast or faster in proportion to the increase along that diagonal. Putting this geometrically, imagine drawing a line starting at the origin but at an arbitrary slope. Your job is to find a slope such that the line pierces the non-zero effect curve twice.

Slight problem: that non-zero effect curve has negative curvature everywhere. The slope is guaranteed to get steeper as you step up the curve, which means it will curve up and away from where the line crosses it. Translating that back into math, it’s guaranteed that the non-effect curve will not increase in proportion with the diagonal. The false positive rate will always increase as you up the effective p-value threshold.

And thus, p-hacking is always a deal.

# But Everything Worked Out, Right?

The right person won in the recent France election, but the outcome worries me. The polls badly underestimated his win.

The average poll conducted in the final two weeks of the campaign gave Macron a far smaller lead (22 percentage points) than he ended up winning by (32 points), for a 10-point miss. In the eight previous presidential election runoffs, dating back to 1969, the average poll missed the margin between the first- and second-place finishers by only 3.9 points.

That should be a warning flag to the French to take less stock in their polls and weight unlikely outcomes as more likely. It’s doubtful they will, though, because everything turned out all right. That’s no slam against the French, it’s just human nature. Take the 2012 US election:

Four years ago, an average of survey results the week before the election had Obama winning by 1.2 percentage points. He actually beat Mitt Romney by 3.9 points.

If that 2.7-point error doesn’t sound like very much to you, well, it’s very close to what Donald Trump needs to overtake Hillary Clinton in the popular vote. She leads by 3.3 points in our polls-only forecast.

That was Harry Enten of FiveThirtyEight four days before the 2016 US election, four days before Clinton fell victim to a smaller polling error. Americans should have done back in 2012 what the French should do now, but they didn’t. Even the betting markets figured Clinton would sweep, an eerie mirror of their French counterparts.

Overall, there are a higher number of bets on Ms Le Pen coming out on top, than Brexit or Donald Trump – even though the odds are much lower, according to the betting experts.

The moral of the story: don’t let a win go to your head. You might miss a critical bit of data if you do.

# Gimmie that Old-Time Breeding

Full disclosure: I think Evolutionary Psychology is a pseudo-science. This isn’t because the field endorses a flawed methodology (relative to the norm in other sciences), nor because they come to conclusions I’m uncomfortable with. No, the entire field is based on flawed or even false assumptions; it doesn’t matter how good your construction techniques are, if your foundation is a banana cream pie your building won’t be sturdy.

But maybe I’m wrong. Maybe EvoPsych researchers are correct when they say every other branch of social science is founded on falsehoods. So let’s give one of their papers a fair shake.

Ellis, Lee, et al. “The Future of Secularism: a Biologically Informed Theory Supplemented with Cross-Cultural Evidence.” Evolutionary Psychological Science: 1-19. [Read more…]

# Everything Is Significant!

Back in 1939, Joseph Berkson made a bold statement.

I believe that an observant statistician who has had any considerable experience with applying the chi-square test repeatedly will agree with my statement that, as a matter of observation, when the numbers in the data are quite large, the P’s tend to come out small. Having observed this, and on reflection, I make the following dogmatic statement, referring for illustration to the normal curve: “If the normal curve is fitted to a body of data representing any real observations whatever of quantities in the physical world, then if the number of observations is extremely large—for instance, on an order of 200,000—the chi-square P will be small beyond any usual limit of significance.”

This dogmatic statement is made on the basis of an extrapolation of the observation referred to and can also be defended as a prediction from a priori considerations. For we may assume that it is practically certain that any series of real observations does not actually follow a normal curve with absolute exactitude in all respects, and no matter how small the discrepancy between the normal curve and the true curve of observations, the chi-square P will be small if the sample has a sufficiently large number of observations in it.

Berkson, Joseph. “Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test.” Journal of the American Statistical Association 33, no. 203 (1938): 526–536.
His prediction would be vindicated two decades later.

# Stop Assessing Science

I completely agree with PZ, in part because I’ve heard the same tune before.

The results indicate that the investigators contributing to Volume 61 of the Journal of Abnormal and Social Psychology had, on the average, a relatively (or even absolutely) poor chance of rejecting their major null hypotheses, unless the effect they sought was large. This surprising (and discouraging) finding needs some further consideration to be seen in full perspective.

First, it may be noted that with few exceptions, the 70 studies did have significant results. This may then suggest that perhaps the definitions of size of effect were too severe, or perhaps, accepting the definitions, one might seek to conclude that the investigators were operating under circumstances wherein the effects were actually large, hence their success. Perhaps, then, research in the abnormal-social area is not as “weak” as the above results suggest. But this argument rests on the implicit assumption that the research which is published is representative of the research undertaken in this area. It seems obvious that investigators are less likely to submit for publication unsuccessful than successful research, to say nothing of a similar editorial bias in accepting research for publication.

Statistical power is defined as the odds of failing to reject a false null hypothesis. The larger the study size, the greater the statistical power. Thus if your study has a poor chance of answering the question it is tasked with, it is too small.

Suppose we hold fixed the theoretically calculable incidence of Type I errors. … Holding this 5% significance level fixed (which, as a form of scientific strategy, means leaning over backward not to conclude that a relationship exists when there isn’t one, or when there is a relationship in the wrong direction), we can decrease the probability of Type II errors by improving our experiment in certain respects. There are three general ways in which the frequency of Type II errors can be decreased (for fixed Type I error-rate), namely, (a) by improving the logical structure of the experiment, (b) by improving experimental techniques such as the control of extraneous variables which contribute to intragroup variation (and hence appear in the denominator of the significance test), and (c) by increasing the size of the sample. … We select a logical design and choose a sample size such that it can be said in advance that if one is interested in a true difference provided it is at least of a specified magnitude (i.e., if it is smaller than this we are content to miss the opportunity of finding it), the probability is high (say, 80%) that we will successfully refute the null hypothesis.

If low statistical power was just due to a few bad apples, it would be rare. Instead, as the first quote implies, it’s quite common. That study found that for studies with small effect sizes, where Cohen’s d was roughly 0.25, their average statistical power was an abysmal 18%. For medium-effect sizes, where d is roughly 0.5, that number is still less than half. Since those two ranges cover the majority of social science effect sizes, that means the typical study has very low power and thus a small sample size. Instead, the problem of low power must be systemic to how science is carried out.

In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and reconstructive efforts of Carnap, Hempel, and Popper to unscramble the logical relationships of theories and hypotheses to evidence. Meanwhile our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.

I know, it’s a bit confusing that I haven’t clarified who I’m quoting. That first paragraph comes from this study:

Cohen, Jacob. “The Statistical Power of Abnormal-Social Psychological Research: A Review.” The Journal of Abnormal and Social Psychology 65, no. 3 (1962): 145.

While the second and third are from this:

Meehl, Paul E. “Theory-Testing in Psychology and Physics: A Methodological Paradox.” Philosophy of Science 34, no. 2 (1967): 103–115.

That’s right, scientists have been complaining about small sample sizes for over 50 years. Fanelli et. al. [2017] might provide greater detail and evidence than previous authors did, but the basic conclusion has remained the same. Nor are these two studies lone wolves in the darkness; I wrote about a meta-analysis of 16 different power-level studies between Cohen’s and now, all of which agree with Cohen’s.

If your assessments have been consistently telling you the same thing for decades, maybe it’s time to stop assessing. Maybe it’s time to start acting on those assessments, instead. PZ is already doing that, thankfully…

More data! This is also helpful information for my undergraduate labs, since I’m currently in the process of cracking the whip over my genetics students and telling them to count more flies. Only a thousand? Count more. MORE!

… but this is a chronic, systemic issue within science. We need more.

# Double-Dipping Datasets

I wrote this comment down on a mental Post-It note:

nathanieltagg @10:
… So, here’s the big one: WHY is it wrong to use the same dataset to look for different ideas? (Maybe it’s OK if you don’t throw out many null results along the way?)

It followed this post by Myers.

He described it as a failed study with null results. There’s nothing wrong with that; it happens. What I would think would be appropriate next would be to step back, redesign the experiment to correct flaws (if you thought it had some; if it didn’t, you simply have a negative result and that’s what you ought to report), and repeat the experiment (again, if you thought there was something to your hypothesis).

That’s not what he did.

He gave his student the same old data from the same set of observations and asked her to rework the analyses to get a statistically significant result of some sort. This is deplorable. It is unacceptable. It means this visiting student was not doing something I would call research — she was assigned the job of p-hacking.

And both the comment and the post have been clawing away at me for a few weeks, when I’ve been unable to answer. So let’s fix that: is it always bad to re-analyze a dataset? If not, then when and how?

# BBC’s “Transgender Kids, Who Knows Best?” p1: You got Autism in my Gender Dysphoria!

This series on BBC’s “Transgender Kids: Who Knows Best?” is co-authored by HJ Hornbeck and Siobhan O’Leary. It attempts to fact-check and explore the many claims of the documentary concerning gender variant youth. You can follow the rest of the series here:

1. Part One: You got Autism in my Gender Dysphoria!
2. Part Two: Say it with me now
3. Part Three: My old friend, eighty percent
4. Part Four: Dirty Sexy Brains

Petitions seem as common as pennies, but this one stood out to me (emphasis in original).

The BBC is set to broadcast a documentary on BBC Two on the 12th January 2017 at 9pm called ‘Transgender Kids: Who Knows Best?‘. The documentary is based on the controversial views of Dr. Kenneth Zucker, who believes that Gender Dysphoria in children should be treated as a mental health issue.

In simpler terms, Dr. Zucker thinks that being/querying being Transgender as a child is not valid, and should be classed as a mental health issue. […]

To clarify, this petition is not to stop this program for being broadcast entirely; however no transgender experts in the UK have watched over this program, which potentially may have a transphobic undertone. We simply don’t know what to expect from the program, however from his history and the synopsis available online, we can make an educated guess that it won’t be in support of Transgender Rights for Children.

That last paragraph is striking; who makes a documentary about a group of people without consulting experts, let alone gets it aired on national TV? It helps explain why a petition over something that hadn’t happened yet earned 11,000+ signatures.

Now if you’ve checked your watch, you’ve probably noticed the documentary came and went. I’ve been keeping an eye out for reviews, and they fall into two camps: enthusiastic support

So it’s a good thing BBC didn’t listen to those claiming this documentary shouldn’t have run. As it turns out, it’s an informative, sophisticated, and generally fair treatment of an incredibly complex and fraught subject.

… and enthusiastic opposition

The show seems to have been designed to cause maximum harm to #trans children and their families. I can hardly begin to tackle here the number of areas in which the show was inaccurate, misleading, demonising, damaging and plain false.

… but I have yet to see someone do an in-depth analysis of the claims made in this specific documentary. So Siobhan is doing precisely that, in a series of blog posts.

# The Odds of Elvis Being an Identical Twin

This one demanded to be shared ASAP. Here’s what you need to know:

1. Identical or monozygotic twins occur in roughly four births per 1,000.
2. Fraternal or dizygotic twins occur in roughly eight births per 1,000.
3. Elvis Prestley had a twin brother, Jesse Garon Presley, that was stillborn.

For simplicity’s sake, we’ll assume sex is binary and split 50/50, despite the existence of intersex fraternal twins. What are the odds of Elvis being an identical twin? The answer’s below the fold.

# Replication Isn’t Enough

I bang on about statistical power because it indirectly raises the odds of a false positive. In brief, it forces you to do more tests to reach a statistical conclusion, stuffing the file drawer and thus making published results appear more certain than they are. In detail, see John Borghi or Ioannidis (2005). In comic, see Maki Naro.

The concept of statistical power has been known since 1928, the wasteful consequences of low power since 1962, and yet there’s no sign that scientists are upping their power levels. This is a representative result:

Our results indicate that the average statistical power of studies in the field of neuroscience is probably no more than between ~8% and ~31%, on the basis of evidence from diverse subfields within neuro-science. If the low average power we observed across these studies is typical of the neuroscience literature as a whole, this has profound implications for the field. A major implication is that the likelihood that any nominally significant finding actually reflects a true effect is small.

Button, Katherine S., et al. “Power failure: why small sample size undermines the reliability of neuroscience.” Nature Reviews Neuroscience 14.5 (2013): 365-376.

The most obvious consequence of low power is a failure to replicate. If you rarely try to replicate studies, you’ll be blissfully unaware of the problem; once you take replications seriously, though, you’ll suddenly find yourself in a “replication crisis.”

You’d think this would result in calls for increased statistical power, with the occasional call for a switch in methodology to a system that automatically incorporates power. But it’s also led to calls for more replications.

As a condition of receiving their PhD from any accredited institution, graduate students in psychology should be required to conduct, write up, and submit for publication a high-quality replication attempt of at least one key finding from the literature, focusing on the area of their doctoral research.
Everett, Jim AC, and Brian D. Earp. “A tragedy of the (academic) commons: interpreting the replication crisis in psychology as a social dilemma for early-career researchers.” Frontiers in psychology 6 (2015).

Much has been made of preregistration, publication of null results, and Bayesian statistics as important changes to how we do business. But my view is that there is relatively little value in appending these modifications to a scientific practice that is still about one-off findings; and applying them mechanistically to a more careful, cumulative practice is likely to be more of a hindrance than a help. So what do we do? …

Cumulative study sets with internal replication.

There’s an intuitive logic to this: currently less than one in a hundred papers are replications of prior work, so there’s plenty of room for expansion; many key figures like Ronald Fisher and Jerzy Neyman have emphasized the necessity of replications; and it doesn’t require any modification of technique; and the “replication crisis” is primarily about replications. It sounds like an easy, feel-good solution to the problem.

But then I read this paper:

Smaldino, Paul E., and Richard McElreath. “The Natural Selection of Bad Science.” arXiv preprint arXiv:1605.09511 (2016).

It starts off with a meta-analysis of meta-analyses of power, and comes to the same conclusion as above.

We collected all papers that contained reviews of statistical power from published papers in the social, behavioural and biological sciences, and found 19 studies from 16 papers published between 1992 and 2014. … We focus on the statistical power to detect small effects of the order d=0.2, the kind most commonly found in social science research. …. Statistical power is quite low, with a mean of only 0.24, meaning that tests will fail to detect small effects when present three times out of four. More importantly, statistical power shows no sign of increase over six decades …. The data are far from a complete picture of any given field or of the social and behavioural sciences more generally, but they help explain why false discoveries appear to be common. Indeed, our methods may overestimate statistical power because we draw only on published results, which were by necessity sufficiently powered to pass through peer review, usually by detecting a non-null effect.

Rather than leave it at that, though, the researchers decided to simulate the pursuit of science. They set up various “labs” that exerted different levels of effort to maintain methodological rigor, killed off labs that didn’t publish much and replaced them with mutations of labs that published more, and set the simulation spinning.

We ran simulations in which power was held constant but in which effort could evolve (μw=0, μe=0.01). Here selection favoured labs who put in less effort towards ensuring quality work, which increased publication rates at the cost of more false discoveries … . When the focus is on the production of novel results and negative findings are difficult to publish, institutional incentives for publication quantity select for the continued degradation of scientific practices.

That’s not surprising. But then they started tinkering with replication rates. To begin with, replications were done 1% of the time, were guaranteed to be published, and having one of your results fail to replicate would exact a terrible toll.

We found that the mean rate of replication evolved slowly but steadily to around 0.08. Replication was weakly selected for, because although publication of a replication was worth only half as much as publication of a novel result, it was also guaranteed to be published. On the other hand, allowing replication to evolve could not stave off the evolution of low effort, because low effort increased the false-positive rate to such high levels that novel hypotheses became more likely than not to yield positive results … . As such, increasing one’s replication rate became less lucrative than reducing effort and pursuing novel hypotheses.

So it was time for extreme measures: force the replication rate to high levels, to the point that 50% of all studies were replications. All that happened was that it took longer for the overall methodological effort to drop and false positives to bloom.

Replication is not sufficient to curb the natural selection of bad science because the top performing labs will always be those who are able to cut corners. Replication allows those labs with poor methods to be penalized, but unless all published studies are replicated several times (an ideal but implausible scenario), some labs will avoid being caught. In a system such as modern science, with finite career opportunities and high network connectivity, the marginal return for being in the top tier of publications may be orders of magnitude higher than an otherwise respectable publication record.

Replication isn’t enough. The field of science needs to incorporate more radical reforms that encourage high methodological rigor and greater power.

# Steven Pinker and his Portable Goalposts

PZ Myers seems to have pissed off quite a few people, this time for taking Steven Pinker to task. His take is worth reading in full, but I’d like to add another angle. In the original interview, there’s a very telling passage:

Belluz: But as you mentioned, there’s been an uptick in war deaths driven by the staggeringly violent ongoing conflict in Syria. Does that not affect your thesis?

Pinker: No, it doesn’t affect the thesis because the rate of death in war is about 1.4 per 100,000 per year. That’s higher than it was at the low point in 2010. But it’s still a fraction of what it was in earlier years.

See the problem here? Pinker’s hypothesis is that over the span of centuries, violence will decrease. The recent spike in deaths may be the start of a reversal that proves Pinker wrong. But because his hypothesis covers such a wide timespan, we’re going to need fifty or more years worth of data to challenge it. [Read more…]