Someone needs to start a Journal of Pizza Quality Research, stat

We need somewhere to bury sloppy research on fast food, after all. Brian Wansink gets interviewed on Retraction Watch (y’all remember Wansink, the fellow who ground his data exceedingly fine to extract four papers from a null result), and he does himself no favors.

Well, we weren’t testing a registered hypothesis, so there’d be no way for us to try to massage the data to meet it. From what I understand, that’s one definition of p-hacking. Originally, we were testing a hypothesis – we thought the more expensive the pizza, the more you’d eat. And that was a null result.

But we set up this two-month study so that we could look at a whole bunch of totally unanswered empirical questions that we thought would be interesting for people who like to eat in restaurants. For example, if you’re eating a meal, what part influences how much like the meal? The first part, the middle part, or the last part? We had no prior hypothesis to think anything would predominate. We didn’t know anybody who had looked at this in a restaurant, so it was a totally empirical question. We asked people to rate the first, middle, and last piece of pizza – for those who ate 3 or more pieces – and asked them to rate and the quality of the entire meal. We plotted out the data to find out which piece was most linked to the rating of the overall meal, and saw ‘Oh, it looks like this happens.’ It was total empiricism. This is why we state the purpose of these papers is ‘to explore the answer to x.’ It’s not like testing Prospect Theory or a cognitive dissonance hypothesis. There’s no theoretical precedent, like the Journal of Pizza Quality Research. Not yet.

That last bit sounds like a threat.

Here’s the thing: we all do what he describes. An experiment failed (yes, it’s happened to me a lot). OK, let’s look at the data we’ve got very carefully and see if there’s anything potentially interesting in it, any ideas that might be extractable. The results are a set of observations, after all, and we should use them to try and figure out what’s going on, and in a perfect world, there’d be public place to store negative results so they aren’t just buried in a file drawer somewhere. There’s nothing wrong with analyzing your data out the wazoo.

The problem is that he then published it all under the guise of papers testing different hypotheses. Most of us don’t do that at all. We see a hint of something interesting buried in the data for a null result, and we say, “Hmm, let’s do an experiment to test this hypothesis”, or “Maybe I should include this suggestive bit of information in a grant proposal to test this hypothesis.” Just churning out low-quality papers to plump up the CV is why I said this is a systemic problem in science — we reward volume rather than quality. It doesn’t make scientists particularly happy to be drowning in drivel, but Elsevier is probably drooling at the idea of a Journal of Pizza Quality Research — another crap specialized journal that earns them an unwarranted amount of money and provides another dumping ground for said drivel being spewed out.

Wansink seems to be dimly aware of this situation.

These sorts of studies are either first steps, or sometimes they’re real-world demonstrations of existing lab findings. They aren’t intended to be the first and last word about a social science issue. Social science isn’t definitive like chemistry. Like Jim Morrison said, “People are strange.” In a good way.

Yes. First steps. Maybe you shouldn’t publish first steps. Maybe you should hold off until you’re a little more certain you’re on solid ground.

No one expects social science to be just like chemistry, but this idea that you don’t need robust observations with solid methodology might be one reason there is a replicability crisis. Rather than repeating and engaging in some healthy self-criticism of your results, you’re haring off to publish the first thing that breaches an arbitrary p-value criterion.

There really are significant problems with the data he did publish, too. Take a look at this criticism of one of his papers. The numbers don’t add up. The stats don’t make sense. His tables don’t even seem to be appropriately labeled. You could not replicate the experiment from the report he published. This stuff is incredibly sloppy, and he doesn’t address their failings in the interview, except inadequately and in ways that don’t solve the problems with the work.

Again, I’m trying to be generous in interpreting the purpose of this research — often, interdisciplinary criticism can completely miss the point of the work (see also how physicists sometimes fail to comprehend biology, and inappropriately apply expectations from one field to another) — but I’m also seeing a lack of explanation of the context and relevance of the work. I mean, when he says, “For example, if you’re eating a meal, what part influences how much like the meal? The first part, the middle part, or the last part?”, I’m just wondering why. Why would it matter, what are all the variables here (not just the food, but in the consumer), and what do you learn from the fact that Subject X liked dessert, but not the appetizer?

It sounds like something a restaraunteur or a food chain might want to know, or that might might appeal to an audience at a daytime talk show, but otherwise, I’m not seeing the goal…or how their methods can possibly sort out the multitude of variables that have to be present in this research.


  1. whywhywhy says

    Why didn’t he simply publish the null result? It seems like a contribution to understanding satisfaction in dining.

  2. slithey tove (twas brillig (stevem)) says

    curious to explore my understanding of analysis.
    Would it be valid to see an interesting distribution of data, ie clumps: propose a question the data seems to be implying, then adopt that as the hypothesis to be tested by reanalyzing the original data?
    My bet it that is the wrong approach.
    Proper approach is to keep the hypothesis, discard the original data set and start a new collection of data on which to p-analyze. [excuse any mangled jargon]
    TLDR: replication is the basis of the scientific method. Verification of previous results only fails to disprove the hypothesis, does not prove the hypothesis.

    isn’t it sometimes valid to propose a 2nd hypothesis (H2), analyze it to get p(2) then H3 –> p(3), etc Hn –> p(n)
    then conclude which H(n) is most valid by p(n) value?
    I’m getting twisted up

  3. a_ray_in_dilbert_space says

    slithey tove,
    A couple of considerations.

    1) If the data were not gathered specifically to test the hypothesis in question, then there could be systematic errors do to the way the data were gathered that were not important for the original question, but were for the subsequent question(s). This may mean that it is invalid to generalize from the sample to the general population.

    2) If there are no systematic issues with the data, and you have a metric shit-ton of data, you might randomly divide the data into two subsamples of a butt load each. Analyze the first butt load to see if there are interesting correlations/questions. Then see if these hold up in the second butt load.

    3) You can Monte Carlo the data to see how likely your “significant result” really is, given how many independent analyses you have done. That is look at data that are just “noise,” and see how often you get a signal. This is the main reason why you need >5 standard deviations of significance for a valid result in particle physics.

  4. jrkrideau says

    People might want to follow up on some of the points that Jordan Anaya makes in the the Medium blog not only on these four papers but on 6 others he’s examined.

    It’s a horror show. I’d say until further notice nothing coming out of that lab can be trusted.

    Not only does it look like blatant p-hacking and the garden of forking paths problems in these four papers, in other papers , simple numbers don’t add up, degrees of freedom seem to fluctuate and even N-sizes seem variable.

    Any given problem that Anaya reports could be a typo, a transcription mistake or even possibly the result of a data entry error in some cases but there are just too many of them across too many papers.

  5. Holms says

    I mean, when he says, “For example, if you’re eating a meal, what part influences how much like the meal? The first part, the middle part, or the last part?”, I’m just wondering why. Why would it matter, what are all the variables here (not just the food, but in the consumer), and what do you learn from the fact that Subject X liked dessert, but not the appetizer?

    Your appraisal of that part of his ‘study’ is actually more generous than he deserves – he wasn’t asking people to rate the different course of a meal comprised of multiple discrete course, but rather the individual slices of pizza from the same pizza. Even a restauranteur would struggle to find a point to that.

  6. says

    that we thought would be interesting for people who like to eat in restaurants

    Wow, so many assumptions embedded right there:
    – maybe they are measuring a mix of people who hate eating in restaurants but are not able to cook
    – maybe they are measuring people who like pizza
    – maybe they are measuring people who can afford to eat in restaurants
    – maybe they are measuring something to do with the distance people live from certain restaurants
    – they have probably implicitly omitted a population that eat kosher or halal or vegan
    – etc.

    In other words the assumption that they measured anything at all about “people who like to eat in restaurants” is completely unfounded. How do they generalize from “the people we measured” to “people” (in general) and “people eating in this particular restaurant” to “people eating in restaurants” (in general) and as far as “liking” it? Pff.

    We asked people to rate the first, middle, and last piece of pizza

    So, they actually measured something about “people in a particular pizzeria at a particular time who could be arsed to take our survey and who we believe answered it honestly.”

    I didn’t think it was possible for the social “sciences” to get more bogus than when I was an undergrad in the early 80s (they still taught about Freud in those days!) but I see I was wrong.

  7. John Small Berries says

    Not being a scientist, I don’t understand why it’s a “failed experiment”. They had a hypothesis, they tested it, and it turned out that the hypothesis was not supported by evidence. So even though it didn’t answer their question the way they were hoping it would it did answer the question.

    Or are experiments only considered successful if they produce the desired result?

  8. blf says

    I grew up near and have lived in Chicago. I know pizza. I would be willing to be a researcher.

    I am currently living within — in a Sarah Palin sense — sight of Italy. I would be quite happy to conduct a detailed study of pizza vs other Italian foods, and how they taste with numerous Italian vins, if only I could get the excessive funding, including the necessary & unnecessary flunkies, and private superyacht.

  9. wzrd1 says

    We can offer a household of four and associated offerings.
    Confounder, my wife suffers from extreme, life threatening capsciam hypersensitivity.
    I’m also absent a sense of smell, but am a “supertaster”.
    I’m also a former professional chef.

    We’ll judge upon quality of ingredients, quality of spice mixtures, idiocy and stupidity (to judge upon many pizzas we’ve had around the world and in this failing nation (thanks, Trump).
    Hint, for full disclosure: La puttanesca is extremely spicy for her, causing blisters.

    Me? I could consume battery acid and judge it’s taste, then be sick for a couple of days.
    But, I could also judge a finest dining experience meal, despite an absence of a sense of smell.
    Hint: *Everyone* has enjoyed my cooking, despite region on this plant (save Antarctica).