Don’t dump and run

In the December issue of EMBO ReportsMatheus Sanitá Lima and David Roy Smith argue that biologists utilizing next-generation sequencing data should include detailed methods with their submissions to the Sequence Read Archive (the paper is paywalled at the publisher site but available here):

For those who are unfamiliar with it, the SRA is an international public online archive for next-generation sequencing (NGS) data, which was established about a decade ago under the guidance of the International Nucleotide Sequence Database Collaboration (INSDC)…Once there, you will find yourself at a sequencing-read superstore.

Sanitá Lima & Smith

But the value of the sequencing reads you can download from the SRA is often limited by an insufficient description of the methods:

Recently, we were mining data from the SRA to study transcription in mitochondria and chloroplasts…Apart from the price of a computer and a commercial bioinformatics software suite—and significant time investment, of course—the research project cost us nothing. We did, however, encounter some setbacks when trying to determine the protocols used to generate the various RNA-seq data sets employed in our analysis. In short, we were confronted with an SRA annotation issue.

We had used hundreds of RNA-seq experiments generated from different laboratory groups, often using very different protocols. Some of these experiments contained detailed and meticulous information on the growth conditions, RNA isolation and purification techniques, library preparation, and sequencing methods. Other experiments, unfortunately, had little or no accompanying details about how they were generated, leaving us guessing about the underlying experimental procedures.

Usually, the detailed methods are in the publication that results from the sequencing data, but as Sanitá Lima and Smith point out, not everything in the SRA is published.

Moreover, it would have taken a lot of time and energy to look up the individual papers for hundreds of different experiments, many of which were behind a paywall, which goes against the purpose of an open-access data bank like the SRA. In our opinion, it is much more efficient, fair, and useful to have the methods directly linked to the SRA entry. In many ways, the experiments being deposited in the SRA can be as important and impactful as the primary research papers presenting the data.

Being down with OPD (Other People’s Data), I couldn’t agree more. Aside from the other reasons, this is a fairness issue: most of the data in the SRA and similar archives are produced using funds from the National Science Foundation, National Institutes of Health, and similar agencies, in other words, taxpayer money. Having paid for them, I think it’s fair for taxpayers, and not just those who have institutional subscriptions, to have access to the data in a useful form.

Sanitá Lima and Smith confess that they haven’t always been paragons of virtue in this respect:

Before we start sounding too self-righteous, we should come clean and admit that the senior author of this article has submitted his fair share of data into the SRA without providing a detailed protocol for those entries. It was not until he started mining large amounts of RNA-seq data from the SRA that he finally saw the proverbial Illumina light at the end of the annotation tunnel and asked forgiveness for all of his sins. Thankfully, he is now a reformed bioinformatician and is looking forward to developing a clean SRA record in the future.

That bit prompted me to check some of my own submissions. Here’s the methods section of the first one I came across, from my one E. coli paper:

Paired-end sequencing was performed on an Illumina* HiSeq 2000 at the University of British Columbia’s Biodiversity Research Centre using standard procedures. The the time point samples were prepared with the NEXTflex DNA Sequencing Kit and DNA Barcodes by Bioo Scientific (Austin, TX).

Not great. How were the bacteria grown? What medium, what temperature, were they shaken and if so at what speed? Someone trying to replicate that experiment would need all that information and more. Thankfully, that paper is open access, but really the methods should have been included in the SRA submission. Like Dr. Smith, I’ll try to do better.


Stable links:

Sanitá Lima, M. and Smith, D.R. 2017. Don’t just dump your data and run. EMBO Reports, e201745118. doi: 10.15252/embr.201745118


  1. jack16 says

    I think the SRA (Sequence Read Archive) should have user instructions that would include these important requirements.


Leave a Reply