Here’s how you evaluate the scientific rigor of a field

Warning: it’s boring, tedious, hard work. There’s nothing flashy about it.

First step: define a clear set of tested standards. For clinical trials, there’s something called Consolidated Standards of Reporting Trials (CONSORT) which was established by an international team of statisticians, clinicians, etc., and defines how you should carry out and publish the results of trials. For example, you are supposed to publish pre-specified expected outcomes: “I am testing whether an infusion of mashed spiders will cure all cancers”. When your results are done, you should clearly state how it addresses your hypothesis: “Spider mash failed to have any effect at all on the progression of cancer.” You are also expected to fully report all of your results, including secondary outcomes: “88% of subjects abandoned the trial as soon as they found out what it involved, and 12% vomited up the spider milkshake.” And you don’t get to reframe your hypothesis to put a positive spin on your results: “We have discovered that mashed-up spiders are an excellent purgative.”

It’s all very sensible stuff. If everyone did this, it would reduce the frequency of p-hacking and poor statistical validity of trial results. The catch is that if everyone did this, it would be harder to massage your data to extract a publishable result, because journals tend not to favor papers that say, “This protocol doesn’t work”.

So Ben Goldacre and others dug into this to see how well journals which had publicly accepted the CONSORT standards were enforcing those standards. Read the methods and you’ll see this was a thankless, dreary task in which a team met to go over published papers with a fine-toothed comb, comparing pre-specified expectations with published results, re-analyzing data, going over a checklist for every paper, and composing a summary of violations of the standard. They then sent off correction letters to the journals that published papers that didn’t meet the CONSORT standard, and measured their response.

I have to mention this here because this is the kind of hard, dirty work that needs to be done to maintain rigor in an important field (these are often tests of medicines you may rely on to save your life), and it isn’t the kind of splashy stuff that will get you noticed in Quillette or Slate. It should be noticed, because the results were disappointing.

Sixty-seven trials were assessed in total. Outcome reporting was poor overall and there was wide variation between journals on pre-specified primary outcomes (mean 76% correctly reported, journal range 25–96%), secondary outcomes (mean 55%, range 31–72%), and number of undeclared additional outcomes per trial (mean 5.4, range 2.9–8.3). Fifty-eight trials had discrepancies requiring a correction letter (87%, journal range 67–100%). Twenty-three letters were published (40%) with extensive variation between journals (range 0–100%). Where letters were published, there were delays (median 99 days, range 0–257 days). Twenty-nine studies had a pre-trial protocol publicly available (43%, range 0–86%). Qualitative analysis demonstrated extensive misunderstandings among journal editors about correct outcome reporting and CONSORT. Some journals did not engage positively when provided correspondence that identified misreporting; we identified possible breaches of ethics and publishing guidelines.

All five journals were listed as endorsing CONSORT, but all exhibited extensive breaches of this guidance, and most rejected correction letters documenting shortcomings. Readers are likely to be misled by this discrepancy. We discuss the advantages of prospective methodology research sharing all data openly and pro-actively in real time as feedback on critiqued studies. This is the first empirical study of major academic journals’ willingness to publish a cohort of comparable and objective correction letters on misreported high-impact studies. Suggested improvements include changes to correspondence processes at journals, alternatives for indexed post-publication peer review, changes to CONSORT’s mechanisms for enforcement, and novel strategies for research on methods and reporting.

People. You’ve got a clear set of standards for proper statistical analysis. You’ve got a million dollars from NIH for a trial. You should at least sit down and study the appropriate methodology for analyzing your results and make sure you follow them. This sounds like an important ethical obligation to me.


  1. Curious Digressions says

    And you don’t get to reframe your hypothesis to put a positive spin on your results: “We have discovered that mashed-up spiders are an excellent purgative.”

    If a researcher wanted to publish this article, couldn’t they just set up a new set of tests with “Spider Milkshakes are effective purgatives” as the pre-specified expected outcomes? I’m not a scientist and did poorly overcoming my preconceptions during labs in basic chemistry, so I’m probably missing something here. Is the difference between p-hacking and the above just an extra round of testing? Are they just trying to be cheap and avoid the bother and expense of more clinical trials?

  2. Rich Woods says

    Ben Goldacre never ceases to impress me. Well, except maybe for that time at a science festival when he tried to tell a few jokes while clearly still hungover, although in a way that was quite funny in itself.

  3. charlesanthony says

    I would think that if the grant bodies (NIH, etc) mandated CONSORT (or some other mutually agreed standard) compliance as part of the grant process, it would go a long way towards fixing the problem.

  4. DonDueed says

    The study seemed to be more focused on the role of the journals in enforcing the standards, rather than on the researchers’ compliance to the standards.

    They sent correction letters to the journals. Did they do that for the researchers? Whose responsibility is it to ensure compliance?

  5. chrislawson says


    Many of these studies would have been undertaken at institutions where conforming to CONSORT is mandatory — but mandating is useless if there is no process to ensure compliance.

  6. chrislawson says


    I am willing to relax the standard on not reframing the hypothesis if there is a truly unexpected and important new finding. But as you say, this should then be followed up by experiments designed specifically to test the new hypothesis. And it should be stated upfront in the paper that this was not part of the original hypothesis testing.

  7. chrislawson says


    The study addresses both the authors’ compliance to CONSORT and the journals’. As for correction letters, the standard approach is to send them to the publishing journal — after all, that’s where the paper was published, so it makes sense for any correction to published there also. Usually any correspondence would be forwarded from the journal to the researchers so that they can respond.

  8. chrislawson says

    PZ, following up on my comment above, it’s worth pointing out that CONSORT does allow publication of findings not in the original design. To quote from the paper in question:

    CONSORT guidance states that trial publications should report “completely defined pre-specified primary and secondary outcome measures, including how and when they were assessed” (6a) and “any changes to trial outcomes after the trial commenced, with reasons” (6b) with further elaboration on these issues in the accompanying CONSORT publication [10]. Therefore, consistent with CONSORT, where outcome switching occurred but was openly declared to have occurred in the trial report, these outcomes were classified as correctly reported, as there are often valid reasons for reported outcomes to differ from those pre-specified.

    So the problem wasn’t that the researchers changed their trial outcomes, it’s that they did not report and/or justify the changes.

    Mind you, the really worrying thing from this paper is the heavy layer of bogosity in correspondence from journal editors in order to wave away their inadequate editorial performance. The absolute worst is the repeated blaming of trial registries for not keeping trial information up to date — it is the researchers’ job to give the registry correct information at the outset, not the registry’s job to hunt down every research team and interrogate them regularly throughout the trial process.

  9. jrkrideau says

    @ 9 chrislawson
    I think one of the problems with reporting unexpected or secondary results without acknowledging one was not looking for them (i.e. was gobsmacked by them) is that a reader may assume that these were predicted, and that this is confirmatory research when it is exploratory, blind luck, or at best can be thought of as an accidental pilot study.

    Essentially the authors are lying by omission.

    I see nothing wrong with reporting the results as “Hey look at what we found! We are designing a new study to investigate this new phenomenon.”

    There also is just the change the goal posts sort of thing that is really dubious especially when the researchers do not make their data publically available for reanalysis. The PACE trial Comparison of adaptive pacing therapy, cognitive behaviour therapy, graded exercise therapy, and specialist medical care for chronic fatigue syndrome (PACE): a randomised trial /
    PACE has generated huge controversy over some of the things that CONSORT was supposed to help address.

    It is not at all unusual for a journal to simply refuse to take action when the editors are alerted to problems. Brian Deer, a British investigative journalist, spent something like 6 or 7 years investigating and publishing about the failings of Andrew Wakefield—the British ex-doctor who almost singlehandedly started the anti-vaccine hysteria—before the Lancet retracted the paper that fueled the hysteria. Of course, the retraction and Wakefield being struck off did nothing to slow down the anti-vacs nuts. if you have noticed a lot of measles around, thank Andy Wakefild.

  10. jrkrideau says

    Ben Goldacre and team must have done a massive amount of work.

    Here is a link to a paper by three “fact & data checkers. There must be a better term for them, “The four horsemen of the new Apocalypse”— they often have a fourth member in their hit team—that illustrates the amount of work that can be needed and they were only looking at the output of one lab and in a more limited way. /
    Statistical infarction

  11. says

    This is the inevitable result of the criteria for success in research.
    Those standards are designed to curb the worst excesses, but without actual penalties for failure to adhere to them they exert no effective pressure. So long as there are beneficial effects to manipulating or inventing results, with no associated costs, the problem will persist.
    The reason the situation exists is that there is no authority with a vested interest in sound scientific research. Bizarrely.

  12. chrislawson says

    Ian King@13–

    There are definitely authorities with a vested interest in sound scientific research. You can google for a list of them. The problem is that there are often external pressures that compete with those interests — and in some cases direct political meddling (e.g. the US govt. making it effectively impossible for the NIH to study gun-related violence).

  13. chrislawson says


    There are a number of reasons for this particular CONSORT rule, but yeah, that’s the big one.

    As for the Lancet’s refusal to retract the Wakefield paper for many years, it is my opinion that this falls entirely at the feet of the personality flaws of its chief editor — the same flaws that led to the paper being published in the first place. (Deer’s excellent work was necessary to prove the fraud and ethical violations committed by Wakefield but even before we knew anything about that the paper should have been rejected on scientific grounds alone…it was essentially a case-control study THAT DECIDED IT DIDN”T NEED A CONTROL GROUP!)

  14. Nerd of Redhead, Dances OM Trolls says

    franko#16, that was imposed on NIH by congress. Since its inception, NCCIH has repeatably shown that with proper studies almost alternative and complementary “medicine” is nothing but the placebo effect.

  15. chrislawson says


    As Nerd of Redhead says, the NIH didn’t choose to support CAM, it was forced to by a group of alternative medicine-loving poiliticians, particularly Tom Harkin. Given that its hand was forced, the NIH tried to do the best it could, but, well, Wikipedia has a pretty good summary:

    Joseph J. Jacobs was appointed the first director of the OAM in 1992. Initially, Jacobs’ insistence on rigorous scientific methodology caused friction with the office’s patrons, such as U.S. Senator Tom Harkin. Sen. Harkin, who had become convinced his allergies were cured by taking bee pollen pills, criticized the “unbendable rules of randomized clinical trials,” saying “It is not necessary for the scientific community to understand the process before the American public can benefit from these therapies.” Harkin’s office reportedly pressured the OAM to fund studies of specific “pet theories,” including bee pollen and antineoplastons. In the face of increasing resistance to the use of scientific methodology in the study of alternative medicine, one of the OAM board members, Barrie Cassileth, publicly criticized the office, saying: “The degree to which nonsense has trickled down to every aspect of this office is astonishing … It’s the only place where opinions are counted as equal to data.” Finally, in 1994, Harkin appeared on television with cancer patients who blamed Jacobs for blocking their access to antineoplastons, leading Jacobs to resign from the OAM in frustration. In an interview with Science, Jacobs “blasted politicians – especially Senator Tom Harkin… for pressuring his office, promoting certain therapies, and, he says, attempting an end run around objective science.”

    I wouldn’t blame the NIH for this. It’s like blaming NASA for the moon-landing conspiracy hoaxes.

  16. jrkrideau says

    @ 15 chrislawson
    I had forgotten the lack of control group in PACE. Anyway, why would you need a control group? Just more hassle.

    I don’t know enough about the Lancet and its chief editors but I would not be surprised that this was the reason for the very late retraction.

    Still, it was a retraction. As far as I know, Daryl Bem’s crazy parapsychology paper has never been retracted and, there have been reports that Journal of Personality and Social Psychology has refused to publish a paper failing to duplicate Bem’s results. I had thought that refusing to publish such papers was more the province of the comic books, that is Nature and Science.

  17. chrislawson says


    Yep, there are plenty of awful papers out there that should have been retracted but weren’t.