The genetic load problem


Dan Graur has written a good summary of genetic load. It’s an important concept in population genetics, and everyone should be familiar with it…and this is a nice 2½ page summary with only a little math in it.

I’ll try to summarize the summary in two paragraphs and even less math … but you should read the whole thing.

Genetic load is the cost of natural selection. You all understand natural selection (my usual problem is trying to explain that there’s more to evolution than just selection), and so you know that you can’t have selection without imposing a loss of fitness on individuals that lack the trait in question. As it turns out, when you do the math, the only parameter that matters is the mutation rate, µ, and the mean fitness of a population, w, is (1-µ)n, where n is the number of loci, or genes, in the genome. What w is, basically, is the cost to the population of carrying suboptimal variants.

Notice that (1-µ) is taken to the nth power — that tells you right away that the number of genes has a significant effect on the cost to a population. As Graur shows by example, using a reasonable estimate of the number of genes and the mutation rate, the human genetic load is easily bearable — if each couple has about 2½ children, losses due to selection overall will be easily compensated for, and the population size will be stable. But if n is significantly greater than 20-30,000 genes, because of that exponent, the cost becomes excessive. If the genome was 80% functional, he estimates we’d each have to have 7 x 1045 children just to maintain our current population.

What all this means is that there is an upper bound to the number of genes we can possibly carry, and it happens to be in the neighborhood of the number of genes estimated in the human genome project. We can’t have significantly more, or the likelihood of genes breaking down with our current mutation rate would mean that most of our children would be born dead of lethal genetic errors, or the burden of a swarm of small deficits to their fitness.

What Graur doesn’t mention is that this is old news. The concept was worked out in the 1930s by Haldane; it was dubbed “genetic load” in 1950 by Muller; Dobzhansky and Crow wrote papers on the topic in the 50s and 60s. I learned it as an undergraduate biology student in the 1970s. I have an expectation that more advanced and active researchers in the field will have this concept well in hand, and are completely familiar with it. It’s just part of the quantitative foundation of evolutionary biology.

And this is why some of us go all spluttery and cross-eyed at any mention of the ENCODE project. They just blithely postulated orders of magnitude more functioning elements in the genome than could be tolerated by any calculation of the genetic load — it quickly became clear that these people had no understanding of the foundation of modern evolutionary biology.

It was embarrassing. It was like seeing a grown-up reveal that he didn’t know how to use fractions. It’s as if NASA engineers plotted a moon launch while forgetting the exponent “2” in F= gm1m2/r2. Oops.

When well-established, 80 year old scientific principles set an upper bound on the number of genes in your data set, and you go sailing off beyond that, at the very least you don’t get to just ignore the fact that you’re flouting all that science. You’d better be able to explain how you can break that limit.

(via Sandwalk)


Just in case you wanted to replicate the experience of a struggling undergraduate, here’s a paper I wrestled with when I was 21 and at the University of Washington: James Crow, Some Possibilities for Measuring Selection Intensity in Man (Woman, you’re off the hook). It’s got more math than Graur’s explanation. If you don’t get it, or are not already familiar with the topic from other educational sources, then that’s OK — you just shouldn’t be getting million-dollar grants to study functional elements in the human genome.

Comments

  1. Athywren, Social Justice Weretribble says

    James Crow? I know that name. I have a bad feeling about this….
    (I would comment on the rest, but I am rather clueless about your squishy biology malarkey, so all I could add would be “interesting.”)

  2. parasiteboy says

    And this is why some of us go all spluttery and cross-eyed at any mention of the ENCODE project.

    Now I understand why you dislike them so much.
    Do they have a counter argument to the upper-limit of functional genes given the constraints of the genetic load and birth rate? Is it varying mutation rates amongst genes?

  3. Azkyroth Drinked the Grammar Too :) says

    The genetic load problem

    Weren’t we just discussing that in the Lounge? :P

  4. frog says

    ::twists head around a few times, hopes it’s on straight::

    So if I’m understanding the concept correctly…with a lower rate of mutation, a species’ genome could theoretically be larger, because the number of mutations occurring wouldn’t be too high (in absolute terms) for the population to sustain itself.

    Is there any evidence that species in general tend to maximize their genetic load? Or are there species that have significantly smaller genomes than their mutation rate would theoretically allow?

  5. tulse says

    The notion of genetic load makes a lot of sense intuitively, but as someone not deeply familiar with biology I’m confused by the notion of “maximal fitness” as presented in the referenced paper. What does “maximal fitness” actually mean? Does it just mean that all individuals in a population have the same fixed alleles? Or does it actually mean that specific individuals are “the best possible” in some actual sense? The paper seems to talk of “fitness” as applying to individuals, so a population approach to fitness seems counter to that. Perhaps I’m thinking of “fitness” too colloquially, but it seems rather difficult to conceive of maximal fitness in any concrete individual sense (as opposed to purely mathematical sense of allele frequencies).

    (Heck, I figure really maximal fitness for humans would involve lasers for eyes, wings, and the ability to eat all the chili-cheese dogs one wants without feeling ill. But that’s just me.)

  6. chris61 says

    @2 parasiteboy

    Do they have a counter argument to the upper-limit of functional genes given the constraints of the genetic load and birth rate? Is it varying mutation rates amongst genes?

    They don’t need a counter argument because their definition of functional is not an evolutionary one. ENCODE isn’t proposing there are more than 20,000-30,000 genes because ENCODE’s functional units aren’t Dan Grauer’s genes. His are theoretical units that can be used to derive calculations of things like genetic loads and other parameters of population genes.

    ENCODE is looking at the 98-99% of the genome that doesn’t code for protein and looking to annotate it in such a way as to identify specific regulatory elements such as promoters, enhancers etc. A real human gene may be a million nucleotides from its transcriptional start site to its transcriptional termination site. So an RNA of a million nucleotides will be made from it. That RNA may then be processed down to a much smaller RNA of say 8000 nucleotides. The processing is required before the RNA can be exported from the nucleus to the cytoplasm where it will be translated into a protein. The entire RNA will not be translated into protein; it will have untranslated sequences at both ends. Lets say 1500 nucleotides are translated. So how much of that million nucleotides is functional? ENCODE would say all million because all million are transcribed. Dan Grauer would say its a much smaller fraction because his estimate would be based on a estimate of mutations that create suboptimal variants. Suboptimal in this case referring to reproductive fitness. Evolutionary biology will look at that million nucleotides and give you an estimate of how much is functional but population/evolutionary biology can’t look at that million nucleotides and tell you precisely which parts of it are functional or whether, for example, a deletion or insertion found in a human patient is or is not likely to contribute to whatever disease they suffer from. That’s what ENCODE’s looking to do.

  7. monad says

    It may be an old formula, but it’s new and interesting to me. The other consequence that leaps out at me is that you can get away with more genes if you reduce the mutation rate. I understand eukaryotes have a lot of copy-checking mechanisms on their DNA, which should then help the problem, but then shuffle it in sexual reproduction. I guess this then provides some diversity without having the same impact on genetic load? It makes intuitive sense; unlike a mutated gene, an inherited gene is guaranteed not to have killed off at least one previous individual.

  8. says

    w, is (1-µ)^n, where n is the number of loci, or genes, in the genome. What w is, basically, is the cost to the population of carrying suboptimal variants.

    Shouldn’t that be (1-w) is the cost to the population, or w is the success rate or something, since (1-µ)^n tends to zero as n grows (assuming µ<1 since it makes no sense otherwise)?

  9. says

    Yes, you can have more genes if you lower the mutation rate. But we actually have a good estimate of the mutation rate (and actually, a good estimate of the number of genes) already.

  10. monad says

    @10 PZ Myers: Sorry, I wasn’t questioning the figures as applied to humans, just thinking tangentially about the different consequences applied to other taxa.

  11. tulse says

    I take it that this analysis presumes that mutations in non-coding sections of DNA are extremely unlikely to produce functional effects?

  12. Crip Dyke, Right Reverend Feminist FuckToy of Death & Her Handmaiden says

    @neilerickson:

    I had your question too.

    I’m not sure how that equation relates to real world things, b/c of what you said about the equation making no sense unless µ<1.

    But I do think I get the "maximal fitness" thing.

    The "maximal fitness" is merely the best combination of alleles available. If you have the best version of a gene currently existing in any member of your species…and then repeat that for every single gene…then you are "maximally fit" for your species.

    The cost for less-optimal is something that is not entirely clear. In some cases it could be that the "advantage" of an allele is merely that you don't die of a hideous genetic condition sometime after birth (thus people with that allele exist in the population) but most likely before you reproduce (thus fitness cost – the hideousness or not of the condition doesn't figure into the "fitness cost" at all). In other cases, one might be made more (likely to be) fecund by the actions of a gene, but this affects the overall availability of resources…and someone is going to lose out in the resource race.

    At least that's how it seems to me. Biology ain't my subject.

  13. Esteleth is Groot says

    Hmm.

    The entire phenomenon of post-translational modification comes to mind. This allows for n genes to be transcribed into n^x proteins (where x is the number of possible PTMs) also shoves down the absolute number of necessary genes. Of course, this is somewhat compensated for by the possibility of incorrect PTMs.

    Protein moonlighting (a single protein serving multiple roles – to be distinguished from a single protein having multiple active sites that do different things) also shoves down the number of necessary genes and acts as a failsafe measure in case of mutation destroying the function of a protein.

  14. tulse says

    The “maximal fitness” is merely the best combination of alleles available.

    “Best” defined as…what, exactly? This is what confuses me — I’m not clear how one defines “maximally fit” without resorting to ad hoc criteria. How does one define “maximally fit” in a way that isn’t circular?

  15. platypus says

    There’s a real howler of a mistake in Graur’s analysis.

    Across pages 2-3 he says:

    Assume that there are 10,000 loci in the genome and that the mutation rate is µ =10^−5 per locus per generation. [Then L = 1.1] Let us now assume that the entire human genome (3 × 10^9 bp) is functional, and
    consists of 3 million functional loci, each 1,000-nucleotide long. With a mutation rate of
    10^-5 per locus per generation[…]

    The number of DNA changes due to mutation is pretty well established to be in the ballpark of one change per billion bases per generation. If you change your definition of a locus so that the genome goes from 10,000 to 3,000,000 (a 300-fold increase), the number of nucleotide changes per generation does not change. Those same mutations get divided into a larger number of loci, which means your per-locus mutation rate will decrease by the same factor of 300 that your number of loci increased.

    And that means your load comes out to… 1.1, exactly the same as before.

  16. monad says

    @18 platypus:
    That’s how it would work if he were partitioning the whole genome into a larger or smaller number of loci that are therefore of smaller or larger size. But he’s not, he’s talking about whether there are fewer or more loci placed within a genome that is otherwise junk. The size of the loci aren’t changing, so neither does the mutation rate per locus.

  17. Amphiox says

    “Maximally fit” I think refers to the combination of alleles that results in the greatest likelihood of the highest degree of reproductive success, given the prevailing environmental conditions of that particular generation. You’ll never be able to determine precisely WHICH combination of alleles it is, since the environment is continuously changing, and it could even be a different combination at different times/generations (or it could even change within the course of a single generation). But, theoretically, at any given instant in time, there should be one, or a handful of such possible allele combinations for any population. The degree to which any individual genome with its individual combination of alleles differs from and is inferior (in terms of likelihood for maximal reproductive success) to this maximally fit genome is thus the cost of variation and the cost of natural selection.

  18. David Marjanović says

    So how much of that million nucleotides is functional? ENCODE would say all million because all million are transcribed.

    …which is a really breathtakingly naïve mistake.

    The transcription machinery binds to DNA. It has a higher affinity to some sequences than to others, but it non-negligibly binds to everything.

    Transcription is a very inefficient process. We make lots of useless RNA and promptly destroy it again. Metabolic heat, yaaaaay.

  19. chris61 says

    @21 David

    The transcription machinery binds to DNA. It has a higher affinity to some sequences than to others, but it non-negligibly binds to everything.

    Yet transcriptional start sites have fairly characteristic biochemical signatures. The transcription machinery doesn’t just bind everywhere, especially not on DNA in the form of chromatin.

    We make lots of useless RNA and promptly destroy it again

    How do you define useless? The majority of mRNAs in humans are made as large transcripts which are subsequently spliced. It has been shown experimentally for many genes that spliced transcripts result in higher levels of gene expression than non spliced transcripts. The process itself has a function even if much of the specific sequence isn’t necessary. Same argument could be made for the non translated sequences found at both ends of most mRNAs. When you remove them the RNA becomes less stable or is translated less efficiently. There are also frequently antisense transcripts that originate within introns which at least in some cases have also been shown to affect gene expression levels. There are in fact plenty of examples of non coding RNA for which function has been experimentally demonstrated.

  20. David Marjanović says

    Yet transcriptional start sites have fairly characteristic biochemical signatures. The transcription machinery doesn’t just bind everywhere, especially not on DNA in the form of chromatin.

    It does bind everywhere, or perhaps almost everywhere. It binds preferentially – more strongly, therefore for longer times, therefore more often long enough to actually initiate transcription – to canonical transcription start sites; but it binds everywhere, and everything is transcribed at least occasionally.

    How do you define useless?

    Coding neither for a protein nor for rRNA or tRNA nor for regulatory RNA in the widest sense (miRNA, snRNA etc. etc. ad nauseam, and yes, I recognize that this may include kinds that haven’t been discovered yet).

    It has been shown experimentally for many genes that spliced transcripts result in higher levels of gene expression than non spliced transcripts.

    …By “gene expression”, do you mean “translation”?

    But I’m not talking about mRNA containing introns, a cap, a poly-A tail or other untranslated bits. I’m talking about completely useless transcripts that don’t contain anything useful.

    There are also frequently antisense transcripts that originate within introns which at least in some cases have also been shown to affect gene expression levels.</blockquote<

    Of course; double-stranded RNA is destroyed (because its most common source is virus genomes), and this has evolved into the use of antisense transcripts for gene regulation. That's already included in "regulatory RNA".

    Come on, man, half of our genome consists of retrovirus corpses in all stages of decay, and most of the rest is tandem repeats whose amount varies between individuals. Do you seriously believe all this junk isn't junk?

  21. gillt says

    …By “gene expression”, do you mean “translation”?

    By gene expression no one means translation. Over the past few decades we’ve routinely relied on mRNA transcript abundance (relative and absolute) via RNA-seq, qPCR and microarrays (gene expression assays) as an indirect way to understand what proteins are being made. Important caveats notwithstanding, this has generated lots of useful information.

  22. Amphiox says

    The process itself has a function even if much of the specific sequence isn’t necessary.

    Then those specific unnecessary sequences do not have function, even if they are transcribed. They have no more function than the extra paper you cut away when you trace a stencil has function. They are “junk” in the purest sense of the word.

    There are in fact plenty of examples of non coding RNA for which function has been experimentally demonstrated.

    These all fall into the category of regulatory sequences, which has been experimentally demonstrated to consist of something on the order of no more than 5-10% of the genome, as far as I have heard.

    And the concept of genetic load applies to regulatory sequences as well as genes. And in fact when someone says “gene” in the context of genetic load, it should be taken to mean “the gene plus ALL of its relevant regulatory sequences, near and far”.

    Because the idea of genetic load applies equally to EVERYTHING that natural selection can act on, which means EVERYTHING that has a potential phenotypic consequence, and gene regulation of course has phenotypic consequence.

    ANYTHING that can mutate, and have its activity altered by the mutation in a way that changes SOMETHING in the organism that has the potential to have consequences on fitness (regardless of whether or not it has actual impact on fitness *now* in the *current* environment) is subject to the genetic load argument pertaining to how much of it can POSSIBLY be functional.

  23. chris61 says

    @27 Amphiox

    Scientists know some but probably not all of the functions that non coding DNA performs. For the most part it is less highly conserved than coding DNA but there are examples where conserved functions have been found even where the DNA sequence itself isn’t highly conserved. So how much of that DNA is functional and perhaps more importantly as it relates to ENCODE, how do we recognize functional DNA. Population genetics and evolutionary biology alone can’t answer that. ENCODE is contributing to an answer.

  24. David Marjanović says

    By gene expression no one means translation.

    That was the only way I could make sense of “spliced transcripts result in higher levels of gene expression than non spliced transcripts”.

    ENCODE is contributing to an answer.

    How? It starts from the (manifestly wrong) assumption that everything that is transcribed has a function. That’s why it simply looks what is transcribed, and proclaims that all that stuff has a function.

  25. gillt says

    I don’t understand the spliced transcript comment either. A citation would be helpful.

    But the point about transcription is that we’re already relying on it, and have been for years, as a proxy for function. If the cell is making mRNA then it’s probably making protein, the thinking goes. If you’re going to criticize ENCODE for the same line of reasoning then at least be consistent and also wonder aloud how the decades of studies that extrapolate from transcript to phenotype ever got through peer review.

  26. chris61 says

    @28

    If, as is the case for at least some of these non coding transcripts, the promoter is highly evolutionarily conserved but the actual sequence of the transcript isn’t, would you consider transcription functional or non-functional in those cases? Would you agree that such cases raise the possibility that the process of transcription may be important even if the product itself isn’t?

  27. Crip Dyke, Right Reverend Feminist FuckToy of Death & Her Handmaiden says

    @amphiox, #20:

    yes, I agree with what you said (or are trying to say, if the words don’t parse perfectly) about maximal fitness being the set of genes that on average produces the most living descendants.

    I think you’ve got it, and I hope I’ve got it.