Science, we have a systemic problem


I read with growing horror this account of the research practices of the Wansink lab. They do research in nutrition, or maybe some combination of economics, psychology, and dietary practices — it’s described as “research about how people perceive, consume, and think about food”, and it’s not stuff I’d ever be interested in reading (although that does not imply that it has no value). The PI, Brian Wansink, wrote up a summary of his process on a blog, though, and honestly, my jaw just dropped reading this.

A PhD student from a Turkish university called to interview to be a visiting scholar for 6 months. Her dissertation was on a topic that was only indirectly related to our Lab’s mission, but she really wanted to come and we had the room, so I said “Yes.”

When she arrived, I gave her a data set of a self-funded, failed study which had null results (it was a one month study in an all-you-can-eat Italian restaurant buffet where we had charged some people ½ as much as others). I said, “This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.” I had three ideas for potential Plan B, C, & D directions (since Plan A had failed). I told her what the analyses should be and what the tables should look like. I then asked her if she wanted to do them.

He described it as a failed study with null results. There’s nothing wrong with that; it happens. What I would think would be appropriate next would be to step back, redesign the experiment to correct flaws (if you thought it had some; if it didn’t, you simply have a negative result and that’s what you ought to report), and repeat the experiment (again, if you thought there was something to your hypothesis).

That’s not what he did.

He gave his student the same old data from the same set of observations and asked her to rework the analyses to get a statistically significant result of some sort. This is deplorable. It is unacceptable. It means this visiting student was not doing something I would call research — she was assigned the job of p-hacking.

Further, what’s just as shocking is that Wansink sees so little wrong with this behavior that he would publicly write about it.

He’s not done.

Every day she came back with puzzling new results, and every day we would scratch our heads, ask “Why,” and come up with another way to reanalyze the data with yet another set of plausible hypotheses. Eventually we started discovering solutions that held up regardless of how we pressure-tested them.

Note: no new experiments. This is all just churning over the same failed experiment, the same failed data set. Back in the day, I learned that you design an experiment to test a specific hypothesis, and that you don’t get to use the data to test different hypotheses until you get a result that you like. But what do I know, I’m old.

Still not done.

I outlined the first paper, and she wrote it up, and every day for a month I told her how to rewrite it and she did. This happened with a second paper, and then a third paper (which was one that was based on her own discovery while digging through the data).

Out of this one failed (I repeat, fucking failed) data set, they ground out FOUR papers. Four. Within a few months. Good god, I’ve been doing everything wrong.

You might be wondering what these papers were that he milked out of this failed data set. Here are the titles:

Lower Buffet Prices Lead to Less Taste Satisfaction
Peak-end pizza: prices delay evaluations of quality
Low prices and high regret: how pricing influences regret at all-you-can-eat buffets
Eating Heavily: Men Eat More in the Company of Women

I am trying hard not to be judgmental, and failing. These sound like superficial, pointless crap churned out to appease a capitalistic marketing machine, with virtually no value and making no contribution to human knowledge. But I guess it’s good enough to get you a leadership position in a prestigious lab at Cornell.

It’s also a huge problem that this kind of strategy works. It’s not just Wansink — it’s a science establishment that allows and even encourages this kind of garbage production.

I hear there’s a replication crisis in the sciences. I have no idea how that could be.

Comments

  1. handsomemrtoad says

    In this case, BE judgemental! Be VERY judgemental!

    But some of the very best scientists have problems with reproducibility. My very favorite scientist, who is doing the MOST important (IMHO) research in the world, had to retract a groundbreaking paper which he had published in Science. OUCH! The scientist was the great Peter G. Schultz, the leading pioneer of combinatorial chemistry and screening molecular libraries; also, a leader in expanding the genetic code to include unnatural amino acids synthesized in the lab, winner of numerous awards including the Wolf Prize in Chemistry (one step away from the Nobel), founder of several successful companies including Affymax. The retracted paper was a method for synthesizing proteins with pre-glycosylated amino acids. Dial-a-glycoprotein! Would have opened the door for all kinds of new applications.

  2. JimB says

    Totally off topic!

    Just saw PZ on Rachel Maddow! Right in the middle in the red shirt!

    Cracked me up. Welcome to the big time PZ!

  3. JimB says

    She was talking about Indivisible. My ears perked up when she mentioned Morris Minnesota. She gave the head count of your meeting and the city population. And then she showed the group photo you guys took (I assume!).

    Pretty much just gave you an atta boy!

  4. JimB says

    Umm. Not that she singled you out. I froze the picture and then scanned till I found you in the center. And laughed some more!

  5. says

    I don’t entirely agree with you. As a grad student, I look at old spectroscopy data all the time, often testing hypotheses that the studies were not originally intended for, or coming up with new hypotheses. After all, what matters in spectroscopy is the type of material, temperature, light intensity, etc. and most certainly not the intent of the experimenter. If an experiment has “null results” that just means null results with respect to some particular hypothesis, and doesn’t mean there aren’t other hypotheses that could be tested by the same data set.

    So from my perspective, what makes these studies bad is not that they used an old data set, but that they test too many hypotheses for such a small set of data (n=122??), and engage in some bizarre practices like throwing out data in some publications but not others.

  6. Raryn says

    I think the big problem wasn’t using the old data set, but using it as anything except a hypothesis generating exercise.

    They had the data, great. It wasn’t publishable as is, but presumably was appropriately gathered. Troll through it to generate a new hypothesis (or four, that’s fine). Then repeat the study *with that as your hypothesis* and see if it’s confirmed.

    But p-hacking for publishable results then salami slicing one dataset into four papers without doing any additional work? Bleh.

  7. nathanieltagg says

    @Siggy #7:
    I had the same thought. Posts like this one from the squishy sciences make me question if we’re doing it right in Physics.

    We suffer from few of the same problems: variability tends to come from well-understood instrumental or statistical sources, for one. Second, we fit to models, not simply find correlations. Third, we select data subsets that tend to be somewhat disjoint (at least in particle physics) and so don’t find ourselves in the state of re-using data.

    But I can’t put my finger on why PZ and the bio/psych guys are wrong, or why we are. It’s clearly born out of different experimental methodologies.. but I don’t want to just retreat into “what’s right for them isn’t right for us”.

    So, here’s the big one: WHY is it wrong to use the same dataset to look for different ideas? (Maybe it’s OK if you don’t throw out many null results along the way?)

  8. says

    Related to what others have said: it’s true you can test other hypotheses against the data than the hypothesis the experiment was designed for.

    Yes, the “Texas Sharpshoter Fallacy” is still possible as well.

    It depends.

  9. bognor says

    How about polite but firm letters to the editors of those four journals, imploring them to retract the dodgy papers? Surely at least one of them would agree, at that would help reinforce that p-hacking must not be tolerated and the culture that permits it must change.

  10. says

    @10, nathanieltagg

    So, here’s the big one: WHY is it wrong to use the same dataset to look for different ideas?

    Well, it’s always ok to merely look for different ideas. And sometimes to even use that data set to test different ideas than it was designed for.

    But the possibility of coincidence/luck needs to be kept in mind.

    Indeed, you have to ask if you really think your data set should contain zero coincidences of any kind. Because otherwise (f you look hard enough for a pattern) you are bound to find a coincidence of some sort!

  11. says

    @nathanieltagg,
    Yeah, I always feel like complaints about p-hacking just don’t address issues I have in physics. I never calculate p-values at all! Calculating p-values would typically be an absurd exercise in large negative exponents. The danger is not statistical noise, but that something is going on that we haven’t accounted for. And taking new data doesn’t necessarily help with that.

  12. unclefrogy says

    to me it sounds like the problem wasn’t so much that they re-analyzed the data to figure out what it showed but that after they did that they stopped and just kept going over the same data over and over again.
    does not sound like they were curious about what was going on but just wanted publishable results. That the results might be valuable to the fast food and food services industries was just a lucky break
    uncle frogy

  13. says

    The problem is not that old data is used. It’s that it is used to FIND and CONFIRM the hypothesis with the same data set. That is what p-hacking is about. You search the data for anomalies, than use the fact that you just found an anomaly as confirmation of itself. That is bad sciences.

    Using old data in itself is not a big problem. But you either already need a hypothesis you want to confirm or you look for new ones which than lead to additional experiments.

    In astronomy it is done all the time, because telescope time is rare and expensive. One famous example are the cosmic background measurements. There are not that many of them, because you need a specialized satellite. But they resulted in a lot of papers every time.

  14. blgmnts says

    One example is that, after finding an exoplanet, they re-analyzed old HST images with current methods and found the exoplanet in the noisy glow around the parent star. That gave them another data point to analyze the planet’s orbit.

  15. Reginald Selkirk says

    This reminds me of J.B. Rhine, the ESP advocate at Duke. After running an apparently successful test of ESP in his lab, he challenged anyone else to process the same data set and see what their results were. He considered this “scientific replication.”

  16. slatham says

    I’m not much of a statistician, but this is what I learned in school: in frequentist statistics (not the best type for many applications) there are two main kinds of errors, Type I and Type II. Type I is rejecting the null hypothesis when it is true; Type II is failing to reject a null hypothesis when it is false. By convention, many fields seem to agree that Type 1 errors are worse, and they endeavor to limit them to a frequency of p<0.05. That is, even if the null hypothesis was true, data collected and analyzed in the same manner as you have done would, by chance, yield results as different from the null expectation as you observed some proportion of the time. If that proportion was less than 1 in 20, then you get to call your results "statistically significant". If everybody plays fair, then 1/20 statistically significant results published should be a false rejection of the null hypothesis. In some fields the chosen 'alpha' (acceptable Type 1 error rate) is more constrained (like 1/100).

    Acceptable Type II error ('beta') is generally accepted to be 0.20 in several fields (power of test should be 0.8). But power analyses are rarely reported and neither is the beta that informed the statistical design. I think this tells you already that there is a problem in statistical applications in science.

    But the problem discussed here is that if you test multiple hypotheses with independent statistical analyses, and none of the null hypotheses is false, you will eventually call a "statistically significant" result anyway. Each time you do a test, you increase your chances of finding a p<0.05 result. If you only report the significant results, you are implicitly lying. Lying! The p-value reported is decoupled from the 'alpha' that the field has agreed to.

    There are corrections for this! There are corrections for multiple comparisons built into, for example, the calculation of test statistics for ANOVAs. (And some analyses correct for post-hoc application of statistics, e.g. the Tukey test.) But when the multiple tests are done across analyses (the analyses themselves have a correction built in), you can divide the size of the "statistically significant" result by the number of tests performed or something similar (google the Bonferroni Correction or sequential Bonferroni).

    The tendency of journals to prefer publishing only statistically significant results is that studies failing to reject the null hypothesis are not reported. It's possible that 19 studies might fail to reject a given null hypothesis but only a study that does reject it will be the one that gets published. So right away, due to this bias, you can expect more than 5% of the statistically significant results you see published to actually be wrong. Then on top of this there is resulting self-censorship by researchers (google: File drawer problem) and of course the p-hacking that we've been discussing.

    Yeah, it's a mess. But somehow (probably due to replication efforts in competitive fields, or unfortunately later when research programmes fail to progress from previous findings) scientific knowledge does seem to improve over time. Lies (deliberate or happenstance) eventually get weeded out.

  17. Enkidum says

    Agreed with those saying there’s nothing wrong with analyzing, hell, even publishing, data from an experiment with a null result.

    I’ve recently published a large-scale study (N ~=12,000) comparing the prevalence of a trait between two groups of people. We went in with a particular hypothesis (the trait will be more common in Group X than Group Y), which was not only disconfirmed, but we got exactly the opposite of our expected effect (the trait was twice as common in Group Y than Group X, and this was clearly not just due to noise).

    We were obviously incredibly surprised by this, and went on to analyze the data further, finding some further covariates that might explain the unexpected finding, as well as finding further support for our original higher-level hypothesis that motivated us to do the study in the first place. We reported these results, but were very clear to point out that they were exploratory, and in the discussion went to some lengths to point out that if we went solely by the planned comparison, we would assume that our hypothesis was false, but if we went solely by the subsequent analyses, we would assume that it was true. And we were careful not to report p-values for the exploratory stuff, since they are essentially meaningless (although to be honest, we did use confidence intervals, which are just a way of hiding p-values behind a more acceptable facade).

    I don’t think we did anything wrong, and I’d like to thank our anonymous reviewer for forcing us to be a bit more careful in our wording about the exploratory stuff. But this is clearly not what was done by this lab.

    Finally, @Slatham – Bonferroni is great (and necessary), but it won’t suddenly make p-values meaningful if you’re doing analyses you didn’t plan to do before your experiment (as in the example I give from my own work). Dividing a meaningless number by an integer is still a meaningless number.

  18. jrkrideau says

    # 10 nathanieltagg

    WHY is it wrong to use the same dataset to look for different ideas?

    As someone trained in one of the “SQUISHY” sciences what they did is wrong because they p-hacked, followed garden of forking paths and generally tortured the data nearly to death as David @ 2 pointed out. And the data set was small enough that they were capitalizing on pure random noise.

    @7 siggy.

    There is nothing wrong with doing different analyzes as a hypothesis generating tool but you have to realize that social and behavioural data is much nosier, I think, than the hard sciences or even many of the biological ones. We usually do not have clean models.

    If you look hard enough you often can find “something” whether or not it’s an artifact of the experiment or the equipment or something real.

    Think of these papers as the equivalent of the Cold Fusion experiment. Or perhaps think of it as CERN publishing on faster-than-light results rather than checking to see what was really happening.

  19. nathanieltagg says

    @Siggy, @Pansky, others,

    I think that must be the crux of it: in physics, we don’t compute P-values.

    Or, to put it more accurately, we don’t have statistical tests that sit on artificial borders like p=0.05 or 0.01; indeed, or usual tests for declaring ‘evidence’ are at 3-sigma or higher (and even there we won’t claim strong evidence). In particle physics, where they are explicitly looking at many thousands of possible ‘coincidence’ hypotheses, their standards are much higher (5 sigma minimum for ‘suggestion’, 7 for ‘evidence’).

    P-hacking only really works if your field is sitting on a boundary of evidence where coincidence takes place 1 time in 20 or so. If you sit at 1 time in 1000, then you can look at 100 or more correlations without fear you are finding something at random. But if you’re at 1 in 20, you’re suspect if you look at more than 2.

  20. Rich Woods says

    @jrkrideau #23:

    behavioural data is much nosier

    All hail Tpyos!

    Thank you, that brought a laugh at the end of a long day.

  21. Enkidum says

    nathanieltagg @24 is right about the difference between physics and softer sciences. When you’ve got a 5-sigma standard, you can explore like crazy and still be pretty sure when you’re finding out the truth. When you’ve got a p <.05 standard, things are a lot more worrisome.

    Relaxed standards are important in psychology because we're largely in exploratory territory, and more than anything else we want to find things that are likely to be interesting. Which is why I'm honestly not that concerned about estimates that, e.g., 50% of published psychology studies are wrong. That's a hell of a lot better than noise, and anything interesting will either be replicable or it won't.

    What is concerning, though, is the way in which those results are presented and discussed. We have to be open about data exploration, which is necessary because we don't have infinite money and time, and still need to communicate to other scientists. There needs to be a better way of getting career credit for saying "here is some stuff that I think is true for reasons X, Y and Z,, here is how confident I am about this, treat it all with a grain of salt", and not being required to attach precise p-values to each of those claims.

  22. anbheal says

    @17 Turi1337 — Agreed. It’s silly to simply toss all the hard work and expense of data collection and preliminary research out the window. Poring over that data with an eye toward alternative insights is a legitimate endeavor. Even at the level of focus groups, it is not uncommon for a marketing department to look for an answer to one issue, and be surprised that another completely unexpected result could be teased out. In this instance, of course, it appears that Wansink just wanted to milk the Publish-Or-Perish teat, and sent data out in search of a result, rather than positing a relevant alternative hypothesis and seeing if the data could speak to it.

    The secondary culprit is the grantwriting juggernaut. I once did a project for Rand to evaluate about 250 healthcare projects they had done, each a one-off, with an eye toward prioritizing a Top 10 for creating a market in, or commoditizing with 23-year-old B.A.s doing it repetitively rather than 38-year-old Ph.D.s doing it once. There was one that asked: “Has Monica, Rachel, and Phoebe all being pregnant in the latest season of Friends changed the patterns of birth control use among American teenage girls?” Well, first off, the question is phrased pretty back-assward, if you’re concerned about teenage pregnancy. But my immediate thought, which I included in my report, was: “Give that grantwriter a promotion! If she can get funding for THAT, think of what she could do with a genuine research project!”

  23. Holms says

    I don’t entirely agree with you. As a grad student, I look at old spectroscopy data all the time, often testing hypotheses that the studies were not originally intended for, or coming up with new hypotheses.

    Sure, old data sets can provide clues for new avenues of research, even if the original hypothesis for which that particular dataset was obtained failed. But do you simply declare your new hypothesis valid and write up a new paper, or do you design a new experiment to investigate it first?

  24. a_ray_in_dilbert_space says

    As an exercise, let’s imagine that the professor had generated the data he handed to the researcher using a random number generator following some probabilistic model.

    If the student tries enough hypotheses, she will find one that agrees with the data sufficiently well to get a paper out of it. That paper will bear no relation to reality.

    What matters are the protocol that generated the data, the statistical properties of the errors in the data and the number of hypotheses we try on the data. This is a recipe for irreproducibility. It is fine to use the data to develop hypotheses, but the hypotheses must be validated using independent data.

  25. shockna says

    One example is that, after finding an exoplanet, they re-analyzed old HST images with current methods and found the exoplanet in the noisy glow around the parent star.

    You get the same thing with supernova studies, too. For close objects, archival data from before the explosion (often HST) sometimes has a source detected at the position of the (later) supernova, which can give you a good idea of what kind of star it was that exploded.

    Since the amount of archival datasets in Astronomy (especially time domain) is about to explode to levels never before seen, I sometimes wonder if we’ll end up with similar issues.

  26. says

    @Holms,

    Sure, old data sets can provide clues for new avenues of research, even if the original hypothesis for which that particular dataset was obtained failed. But do you simply declare your new hypothesis valid and write up a new paper, or do you design a new experiment to investigate it first?

    The former. Yes, the standards of evidence really are different in physical sciences as compared to social sciences. I meant it when I said it the first time!

    p-values simply aren’t the problem. The problem is when something is going on that you don’t expect. For this kind of problem, taking new data doesn’t really help! Publishing sometimes helps, because you have more people to poke at your assumptions.

    When OPERA found FTL neutrinos, that definitely wasn’t a p-hacking problem, and taking more and more data would never have helped. What did solve the problem? Publishing.

  27. nathanieltagg says

    OPERA? Ooo. Bad example. I’m the author of the MINOS neutrino time-of-flight paper, and that OPERA publication was like a benchmark in bad-time-to-publish.

    The problem was solved because they went back and looked at their cables… outside expertise only generated noise.

  28. wzrd1 says

    Actually, OPERA is an excellent example, although publishing was an error.
    Going back and seeing what caused the erroneous readings was the correct thing to do and the effects of publishing should be a cautionary tale.

    Many years ago, I was prototyping a quick step up transformer circuit, simple oscillator and loop regulation. I ran into an odd problem, where instead of gaining a 10:1 output, I was measuring 100:1 output! After much examination, I brought it up to an electrical engineer to ascertain what I was missing.
    It turned out that at the frequencies I was using, capacitance, which could be typically ignored in the breadboard I was using became real, measurable influences on the circuit and the output stage became a voltage multiplier.
    Proof was when I mechanically reconfigured the layout and the effect disappeared, reconfiguring back reproduced the effect.
    At least that didn’t earn me an entry in The Journal of Irreproducible Results. ;)
    I cannot say the same for Wansink.

  29. says

    @nathanieltagg,
    I didn’t pay attention to the OPERA stuff, and didn’t know they eventually figured out the problem for themselves. I am not sure that means it was wrong for them to publish though, since obviously nobody could have known beforehand whether outside expertise or internal troubleshooting would bear more fruit. Eh, maybe.

    Either way, it’s clear that taking more data was not the solution, since it was already a 6-sigma result.

  30. says

    I’ve actually worked with this guy in a peripheral capacity when I was at Cornell. He struck me as someone much more interested in his media relations and public image than his research. I’d speculate that he picked dieting as a research topic mostly out of his desire to become a household name. Like, he wants to be the next Robert Atkins or something.