ENCODE gets a public reaming

I rarely laugh out loud when reading science papers, but sometimes one comes along that triggers the response automatically. Although, in this case, it wasn’t so much a belly laugh as an evil chortle, and an occasional grim snicker. Dan Graur and his colleagues have written a rebuttal to the claims of the ENCODE research consortium — the group that claimed to have identified function in 80% of the genome, but actually discovered that a formula of 80% hype gets you the attention of the world press. It was a sad event: a huge amount of work on analyzing the genome by hundreds of labs got sidetracked by a few clueless statements made up front in the primary paper, making it look like they were led by ignoramuses who had no conception of the biology behind their project.

Now Graur and friends haven’t just poked a hole in the balloon, they’ve set it on fire (the humanity!), pissed on the ashes, and dumped them in a cesspit. At times it feels a bit…excessive, you know, but still, they make some very strong arguments. And look, you can read the whole article, On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE, for free — it’s open source. So I’ll just mention a few of the highlights.

I’d originally criticized it because the ENCODE argument was patently ridiculous. Their claim to have assigned ‘function’ to 80% (and Ewan Birney even expected it to converge on 100%) of the genome boiled down to this:

The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type.

So if ever a transcription factor ever, in any cell, bound however briefly to a stretch of DNA, they declared it to be functional. That’s nonsense. The activity of the cell is biochemical: it’s stochastic. Individual proteins will adhere to any isolated stretch of DNA that might have a sequence that matches a binding pocket, but that doesn’t necessarily mean that the constellation of enhancers and promoters are present and that the whole weight of the transcriptional machinery will regularly operate there. This is a noisy system.

The Graur paper rips into the ENCODE interpretations on many other grounds, however. Here’s the abstract to give you a summary of the violations of logic and evidence that ENCODE made, and also to give you a taste of the snark level in the rest of the paper.

A recent slew of ENCODE Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates according to which the fraction of the genome that is evolutionarily conserved through purifying selection is under 10%. Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection, which implies that at least 80 − 10 = 70% of the genome is perfectly invulnerable to deleterious mutations, either because no mutation can ever occur in these “functional” regions, or because no mutation in these regions can ever be deleterious. This absurd conclusion was reached through various means, chiefly (1) by employing the seldom used “causal role” definition of biological function and then applying it inconsistently to different biochemical properties, (2) by committing a logical fallacy known as “affirming the consequent,” (3) by failing to appreciate the crucial difference between “junk DNA” and “garbage DNA,” (4) by using analytical methods that yield biased errors and inflate estimates of functionality, (5) by favoring statistical sensitivity over specificity, and (6) by emphasizing statistical significance rather than the magnitude of the effect. Here, we detail the many logical and methodological transgressions involved in assigning functionality to almost every nucleotide in the human genome. The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.

You may be wondering about the curious title of the paper and its reference to immortal televisions. That comes from (1): that function has to be defined in a context, and that the only reasonable context for a gene sequence is to identify its contribution to evolutionary fitness.

The causal role concept of function can lead to bizarre outcomes in the biological sciences. For example, while the selected effect function of the heart can be stated unambiguously to be the pumping of blood, the heart may be assigned many additional causal role functions, such as adding 300 grams to body weight, producing sounds, and preventing the pericardium from deflating onto itself. As a result, most biologists use the selected effect concept of function, following the Dobzhanskyan dictum according to which biological sense can only be derived from evolutionary context.

The ENCODE group could only declare function for a sequence by ignoring all other context than the local and immediate effect of a chemical interaction — it was the work of short-sighted chemists who grind the organism into slime, or worse yet, only see it as a set of bits in a highly reduced form in a computer database.

From an evolutionary viewpoint, a function can be assigned to a DNA sequence if and only if it is possible to destroy it. All functional entities in the universe can be rendered nonfunctional by the ravages of time, entropy, mutation, and what have you. Unless a genomic functionality is actively protected by selection, it will accumulate deleterious mutations and will cease to be functional. The absurd alternative, which unfortunately was adopted by ENCODE, is to assume that no deleterious mutations can ever occur in the regions they have deemed to be functional. Such an assumption is akin to claiming that a television set left on and unattended will still be in working condition after a million years because no natural events, such as rust, erosion, static electricity, and earthquakes can affect it. The convoluted rationale for the decision to discard evolutionary conservation and constraint as the arbiters of functionality put forward by a lead ENCODE author (Stamatoyannopoulos 2012) is groundless and self-serving.

There is a lot of very useful material in the rest of the paper — in particular, if you’re not familiar with this stuff, it’s a very good primer in elementary genomics. The subtext here is that there are some dunces at ENCODE who need to be sat down and taught the basics of their field. I am not by any means a genomics expert, but I know enough to be embarrassed (and cruelly amused) at the dressing down being given.

One thing in particular leapt out at me is particularly fundamental and insightful, though. A common theme in these kinds of studies is the compromise between sensitivity and selectivity, between false positives and false negatives, between Type II and Type I errors. This isn’t just a failure to understand basic biology and biochemistry, but incomprehension about basic statistics.

At this point, we must ask ourselves, what is the aim of ENCODE: Is it to identify every possible functional element at the expense of increasing the number of elements that are falsely identified as functional? Or is it to create a list of functional elements that is as free of false positives as possible. If the former, then sensitivity should be favored over selectivity; if the latter then selectivity should be favored over sensitivity. ENCODE chose to bias its results by excessively favoring sensitivity over specificity. In fact, they could have saved millions of dollars and many thousands of research hours by ignoring selectivity altogether, and proclaiming a priori that 100% of the genome is functional. Not one functional element would have been missed by using this procedure.

This is a huge problem in ENCODE’s work. Reading Birney’s commentary on the process, you get a clear impression that they regarded it as a triumph every time they got even the slightest hint that a stretch of DNA might be bound by some protein — they were terribly uncritical and grasped at the feeblest straws to rationalize ‘function’ everywhere they looked. They wanted everything to be functional, and rather than taking the critical scientific view of trying to disprove their own claims, they went wild and accepted every feeble excuse to justify them.

The Intelligent Design creationists get a shout-out — they’ll be pleased and claim it confirms the validity of their contributions to real science. Unfortunately for the IDiots, it is not a kind mention, but a flat rejection.

We urge biologists not be afraid of junk DNA. The only people that should be afraid are those claiming that natural processes are insufficient to explain life and that evolutionary theory should be supplemented or supplanted by an intelligent designer (e.g., Dembski 1998; Wells 2004). ENCODE’s take-home message that everything has a function implies purpose, and purpose is the only thing that evolution cannot provide. Needless to say, in light of our investigation of the ENCODE publication, it is safe to state that the news concerning the death of “junk DNA” have been greatly exaggerated.

Another interesting point is the contrast between big science and small science. As a microscopically tiny science guy, getting by on a shoestring budget and undergraduate assistance, I like this summary.

The Editor-in-Chief of Science, Bruce Alberts, has recently expressed concern about the future of “small science,” given that ENCODE-style Big Science grabs the headlines that decision makers so dearly love (Alberts 2012). Actually, the main function of Big Science is to generate massive amounts of reliable and easily accessible data. The road from data to wisdom is quite long and convoluted (Royar 1994). Insight, understanding, and scientific progress are generally achieved by “small science.” The Human Genome Project is a marvelous example of “big science,” as are the Sloan Digital Sky Survey (Abazajian et al. 2009) and the Tree of Life Web Project (Maddison et al. 2007).

Probably the most controversial part of the paper, though, is that the authors conclude that ENCODE fails as a provider of Big Science.

Unfortunately, the ENCODE data are neither easily accessible nor very useful—without ENCODE, researchers would have had to examine 3.5 billion nucleotides in search of function, with ENCODE, they would have to sift through 2.7 billion nucleotides. ENCODE’s biggest scientific sin was not being satisfied with its role as data provider; it assumed the small-science role of interpreter of the data, thereby performing a kind of textual hermeneutics on a 3.5-billion-long DNA text. Unfortunately, ENCODE disregarded the rules of scientific interpretation and adopted a position common to many types of theological hermeneutics, whereby every letter in a text is assumed a priori to have a meaning.

Ouch. Did he just compare ENCODE to theology? Yes, he did. Which also explains why the Intelligent Design creationists are so happy with its bogus conclusions.


  1. says

    This reminds me of some vague recollection from some story where space aliens, on reviewing the contents of taxis, concluded there must be some critical function to dashboard ornaments, or, in a more likely to be remembered case, one of the people in Harry Potter, who was obsessed with “Muggle” technology, asking the main character, “What is the rubber ducky used for?” We have precisely the same logic here in the ENCODE project.

  2. martinhafner says

    Great article. I am afraid though, that it will not be sufficient to elimnate the damage ENCODE has caused.

  3. says

    While I agree that ENCODE got a little ahead of itself interpreting their results, I think the argument made in the abstract goes a little far in the other direction. 10% is the figure for the fraction of the genome under PURIFYING selection, the fraction that is highly CONSERVED. However many mutations represent a trade-off between competing (or at least, less than perfectly aligned) goals and can well be functional without being conserved or purified. The famous sickle cell mutation granting malarial resistance is one example…clearly ‘functional’ but, if I understand those terms correctly (and maybe I don’t), that mutation must not be conserved under purifying selection.
    The current paper writes that ” according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection,” but I don’t think this is really ENCODE’s conclusion/assumption: rather, precisely because a mutation has some function(s), it can be maintained as a (functional) mutation by certain kinds of selection (e.g. stabilizing selection).
    Are they really trying to claim that since only 10% of the genome is conserved via purifying selection, the rest can’t be functional?

  4. ibbica says

    So if ever a transcription factor ever, in any cell, bound however briefly to a stretch of DNA, they declared it to be functional. That’s nonsense.

    For more than the reason that it doesn’t demonstrate useful “function”…
    I recall a… er, lively discussion a few months ago in the lab* about whether things like “provides the necessary length of DNA between two binding sites to allow for the appropriate binding of multiple cofactors” should be counted as “functional”. Seemed to boil down to semantics, in our case; if you want to use the word “function” to mean “is transcribed to mRNA” (or worse, “is transcribed to mRNA and then translated to produce a protein product”), then yeah, there’s a LOT of “functionless” (albeit not “useless”) DNA. If you use “function” to mean “serves to enhance the propagation of the host’s genome” (e.g. “useful”), well that’s quite a different story. Never mind if you want to include things like sequence copies have don’t currently “do” anything but serve the “function” of allowing for future adaptations… But none of what we could come up with seemed to work well with ENCODE’s definition of “function”, which seems to be based on something like “what can we use as an operational definition that will let us maximize data generation while using while only one technique?” :P
    This sort of discussion is apparently a great way to confuse the hell out of new undergrads, btw ;)
    *A behavioural endocrinology lab, admittedly not a group of practicing geneticists, so YMMV

  5. martinhafner says

    As I already put it on Ewan Birney’s blog a few months ago:

    It’s as if some geographer would define hills with height over base > 100m as mountains. This would transform the majority of the world’s surface into mountains. It is obvious that this wouldn’t make sense even if the given definition is very precise.

  6. rodw says

    The fact that living things are complex objects produced by the tinkering of natural processes creates problems for simplistic ideas of function. The authors distinguish junk DNA as useless but harmless DNA, ‘garbage DNA ( from Sid Brenner) as useless and harmfull DNA and coin the term ‘indifferent DNA’ for spacer DNA where some stretch of DNA is needed for function but sequence doesnt matter. It seems to me that there are other interesting possibilities for ‘function’/’non-function’ Consider a situation in which transcription factor X ( TFX) has the function of regulating, say, 30 genes. Because there will be thousands of random functionless sites for TFX scattered through the genome the promotoer of TFX will have to make sufficient protein to regulate the 30 genes considering that a lot of it will be soaked up by the random sites. If you delete a few random sites it will have no effect, but if you delete enough you’ll elevate the amount of unbound TFX enough to screw up transcription of important genes. So do the random sites then have ‘function’ ?? I would say no, but I’ll bet carefull statistical analysis on a very large data set would show some minute amount of conservation …of some of the ‘random’ sites

  7. chrislawson says


    The sickle-cell mutation would be counted as functional using evolutionary conservation as a measure because this particular mutation is well conserved in some populations (specifically those with a high exposure to malaria). Our broken vitamin C gene would not be counted as functional…except possibly by ENCODE.

    Also, I don’t believe the authors are claiming that evolutionary conservation is the only way of determining functionality (obviously any brand new mutation can’t be conserved until it has been in the human genome for a long time, even if it is an extremely advantageous mutation), but they are pointing out that the ~10% estimated by conservation is waaaay different from ENCODE’s mythical 80+%.

    Plus, we *know* that around 50-60% of the human genome is made up of LINEs, SINEs, and leftover broken viral DNA segments. The only way ENCODE could possibly get around this would be to demonstrate that LINEs, SINEs, etc., had some previously unknown function. But they don’t do that, of course. They merely assert that anything that even touches a transcription enzyme is functional. Very, very poor science.

  8. says

    chrislawson wrote: “Plus, we *know* that around 50-60% of the human genome is made up of LINEs, SINEs, and leftover broken viral DNA segments” and also “They merely assert that anything that even touches a transcription enzyme is functional. Very, very poor science.”

    True, I’m with you on that.
    As far as their 10% figure goes, though, I may be misunderstanding but I don’t agree with your argument. When measuring conservation, they’re looking for genetic locations that don’t vary (much) across species, or across generations or populations of a single species. if a major population sees some otherwise fixed site generate a mutation that grows in frequency a la the sickle cell mutation, that frequency will be different from the frequency in other populations and not show as conserved. furthermore, even within that population, the sickle cell allele does not go to fixation: TWO copies gives you sickle cell anemia, (hence the name) which is probably worse than vulnerability to malaria, so selection causes the frequency to remain intermediate, i.e. a stabilized mutation, which other populations/generations/species don’t have, hence not really conserved.
    For that reason, 10% gives kind of a floor/minimum for functional fraction of the genome, but is not really an ‘estimate’ at all, any more than ENCODE’s 80% is (which is more like a ceiling, and an imaginative one at that…)

  9. says

    I think that the ID people shouldn’t be too happy with ENCODE because if we were to accept ENCODE’s definition of function then it would destroy their (erroneous) assertion that you cannot increase information in a way that should be evident enough even for them to grasp*, as most increase in genome length would very likely be functional (per ENCODE’s definition) and thus be an increase in information.

    So all the LINES and SINES would increase function (and thus information) each time they are copied, thus demolishing their bogus objection.

    Unless of course they want to argue that you can have added function without added information, in which case there whole complaint about information becomes irrelevant.

    *well, until they put a lot of effort in order to find a way not to grasp it.

  10. Ichthyic says

    A behavioural endocrinology lab

    interesting. I studied under Howard Bern at UCB.

    Has your lab got a list of publications? love to see.

  11. Calilasseia says

    Once again, this is what happens when, as is all too frequently the case when agenda-riddled ideologues want science to conform to their dogmas, said ideologues start playing apologetics with science, despite the fact that science is about as far removed from apologetics as it’s possible to be.

    Basically, the ENCODE people assumed a conclusion, and then peddled the usual tiresome litany of blind assertions to prop up that conclusion. It’s almost as if the ENCODE staff were being paid directly by the Duplicity Institute. Which is why the comparison of ENCODE’s assorted eructations to theology is entirely apposite.

    By the way, can we have a proper line break tag in here? Only I’m having to fake this using the italics tags.