The problem of false positive results


My post on physics researchers searching for the Higgs particle needing to get the chance of statistical errors down to below the five-sigma level (or 0.000028%) generated some discussion on the problems that can arise (mainly the increased likelihood of false positive results) in other areas such as the social sciences where the threshold for acceptability is often as high as 5%.

Ed Yong writes that this is becoming a serious problem in the field of psychology where there seem to be a lot of false positive results coupled with a reluctance by journals to publish articles that contradict positive results that they published earlier. As a result there is a worrying number of published results that are not reproducible but the original studies remain officially unrefuted. Yong published an article on this recently in Nature (vol. 485, 298–300, 17 May 2012).

Yong says that the problem is that journals love surprising and counter-intuitive results and these are more likely to occur in the field of psychology where almost everyone has some intuition about what should happen about practically anything. This is unlike the case in (say) physics where people are unlikely to have a gut feeling about how Higgs bosons or neutrinos behave. As Yong says:

Psychology is not alone in facing these problems. In a now-famous paper, John Ioannidis, an epidemiologist currently at Stanford School of Medicine in California argued that “most published research findings are false”, according to statistical logic. In a survey of 4,600 studies from across the sciences, Daniele Fanelli, a social scientist at the University of Edinburgh, UK, found that the proportion of positive results rose by more than 22% between 1990 and 2007. Psychology and psychiatry, according to other work by Fanelli, are the worst offenders: they are five times more likely to report a positive result than are the space sciences, which are at the other end of the spectrum (see ‘Accentuate the positive’). The situation is not improving. In 1959, statistician Theodore Sterling found that 97% of the studies in four major psychology journals had reported statistically significant positive results. When he repeated the analysis in 1995, nothing had changed.

One reason for the excess in positive results for psychology is an emphasis on “slightly freak-show-ish” results, says Chris Chambers, an experimental psychologist at Cardiff University, UK. “High-impact journals often regard psychology as a sort of parlour-trick area,” he says. Results need to be exciting, eye-catching, even implausible. Simmons says that the blame lies partly in the review process. “When we review papers, we’re often making authors prove that their findings are novel or interesting,” he says. “We’re not often making them prove that their findings are true.”

Siri Carpenter reports (Science vol. 335 no. 6076 pp. 1558-1561, 30 March 2012) on an initiative led by Brian Nosek involving about 50 academic psychologists who have started what they call an Open Science Collaboration where they seek to systematically replicate important results. Needless to say this is causing some trepidation in the community since if a lot of major, highly publicized results are refuted, they fear that the field may be tarnished.

But that concern must surely take a back seat to finding the truth. It is never a good thing when false ideas in science are allowed to propagate. I think the OSC effort will do psychology a world of good in the long run.

Another interesting development arises out of a case of seeming misconduct by a researcher in psychology. As Ed Yong reports, this misconduct was not discovered because of whistleblowers but by a researcher who developed a statistical tool to see if published data sets were too good to be true and hence implausible. “His test looks for an overabundance of positive results given the nature of the experiments – a sign that researchers have deliberately omitted negative results that didn’t support their conclusion, or massaged their data in a way that produces positive results.” This tool reminds me of Benford’s law.

This new method is important for two reasons. One is because the practice of massaging data, either consciously or unconsciously, may be more common than we would like to be the case and we need some way of making researchers more vigilant about the danger. The other is that it enables people to check on the plausibility of results without having to repeat the entire experiment.

It is tricky to navigate the world of knowledge. We humans tend to accept uncritically those results that seem to confirm what we already believe. Yet academic publishing tends to reward those things that overturn conventional wisdom. If allowed to remain uncontradicted, they can become the new conventional wisdom.

The only gut feeling we should trust is skepticism. It is good to treat any study, and especially one that gives surprising results, as tentative and wait for corroboration before giving too much credence to it.

Comments

  1. slc1 says

    I can remember as a graduate student that the rule in physics was 3 standard deviations and publish.

  2. says

    Isn’t one of psychology’s big problems that a lot of its claims are difficult to test, anyway? I don’t know what the current landscape looks like, but when I was an undergrad (ba psych, jhu 1985) most of the experiments that were objective were absurd* and most of the experiments that were “interesting” were based on self-report or worse.** The experiments that seem to be most memorable are the ones that often wouldn’t get by a human subjects board today, which neatly protects them from replication or closer study.

    It seemed to me that every introductory psych textbook I read started with “psychology is a science” and went downhill rapidly from there. I only took it because it was an easy degree and bullshitting was acceptable input on exams and papers. In that sense, I followed the field’s spiritual leader, Sigmund Freud, and just made stuff up, too.

    mjr.

    (* I refer here to things like my advisor, Dr Olton, who discovered that rats that nursed mothers with lemon-scent on their nipples, preferred to mate with females that were lemon-scented. All very interesting, but you can’t extrapolate rat behavior to humans and … what’s the point?)
    (** thinking here about ‘social science’ experiments like Milgram’s)

  3. Paul Jarc says

    The only gut feeling we should trust is skepticism. It is good to treat any study, and especially one that gives surprising results, as tentative and wait for corroboration before giving too much credence to it.

    This seems like it will lead your beliefs toward average accuracy compared to the people around you, not necessarily to better accuracy. I’ve been reading a lot about rationality and cognitive biases at the excellent blog Less Wrong. Your statement sounds a lot like “motivated skepticism”--holding unfavorable ideas to a higher standard of evidence. Here are a few posts on the topic.

  4. says

    Isn’t one of psychology’s big problems that a lot of its claims are difficult to test, anyway? I don’t know what the current landscape looks like, but when I was an undergrad (ba psych, jhu 1985) most of the experiments that were objective were absurd* and most of the experiments that were “interesting” were based on self-report or worse.**

    I think some fields of psychology rely more on self-report than others. I just did Psych 101 and my lecturers did emphasise the objective measures in the research they presented. For example, my positive psychology lecturer talked about research in which they measured the cortisol levels of people who’d been doing activities designed to increase happiness. When it comes to measuring internal mental states, though, it’s hard (impossible?) to be objective.

    In my essay for Psych 101 I had to write about this study. I mentioned that self-report was unreliable and the researchers should have tried to measure behaviour instead, but in one of the marker’s comments my tutor told me off, saying self-report was widely used. I also said the sample size was too small, which was apparently wrong as well, so I was probably talking complete crap.

    It seemed to me that every introductory psych textbook I read started with “psychology is a science” and went downhill rapidly from there. I only took it because it was an easy degree and bullshitting was acceptable input on exams and papers. In that sense, I followed the field’s spiritual leader, Sigmund Freud, and just made stuff up, too.

    The impression I got is that they’re really trying to drum “psychology is a science” into us and teach us rigorous critical thinking. Although I knew much of it already, I felt that the unit gave me a good, thorough review on statistics, p-values, the scientific method etc. Of course that won’t help with publication bias.

    We still have to learn about Freud in Personality, but we have to critique his methods and contrast them with Skinner’s Behaviorism and other theories. One of my lecturers said outright that Freud was good for his time but completey insane by today’s standards.

    (* I refer here to things like my advisor, Dr Olton, who discovered that rats that nursed mothers with lemon-scent on their nipples, preferred to mate with females that were lemon-scented. All very interesting, but you can’t extrapolate rat behavior to humans and … what’s the point?)
    (** thinking here about ‘social science’ experiments like Milgram’s)

    I think the point of the rat study was to show that when choosing mates, rats (and people) prefer mates who are similar to their relatives. Though as you say, because human behaviour is so complex, it’s difficult to make conclusions based on rat studies.

    I was actually impressed by Milgram’s most famous electric shock experiments, because they seemed to have an objective measure (the machine’s voltage) on how obedient people would be if instructed by an authority figure. Of course there were lots of criticisms, and the ethical consequences were awful. Participants were traumatised for life. On this ABC All in the Mind podcast, Behind the Shock Machine, you can hear a couple of Australian participants talking about how they thought they were evil and comparable to Nazis because they had administered the shocks all the way up to 450 volts.

  5. says

    I see the problem as one of incentives, and not really limited to the social sciences. So long as the funds for research are distributed according to how “interesting” it is, and the fame is allocated according to how entertaining, implausible, or genius-seeming the results are, we’re going to have a serious problem getting high quality research done on a continuous basis.

    We’re also dealing with the problem of diminishing returns. The low-hanging fruit has been picked, and it’s no longer possible for any significant percentage of scientists to be a Newton, Galileo, or Darwin. You can’t be a Rosalind Franklin, Marie Curie, or Grace Hopper anymore either — and those are all more recent.

    The short story of it is that if we want to continue to get impressive results, it’s going to take more people, more money, and more cooperation than it ever did before. I’m not seeing the political will in the United States to do the kind of research and development investments we honestly need right now. Surprisingly, it’s difficult to even get the funds to deploy the last few decades new technology, let alone to develop the future’s.

Trackbacks

Leave a Reply

Your email address will not be published. Required fields are marked *