ChatGPT totally fails my one-question exam on Junk DNA


I was chatting with HJ Hornbeck and Dr Sarah yesterday, and we were tinkering with ChatGPT. We asked it for some simple medical advice, of the sort you might consult WebMD for, and Dr Sarah confirmed that it was close to what she would recommend. Then we asked it a simple question of the sort I might throw out in a college classroom.

We asked, “What is junk DNA?” Uh-oh.

Junk DNA is a term used to describe the non-coding regions of the genome, which do not encode proteins. For many years, it was believed that these regions of the genome had no function and were simply “leftover” DNA that accumulated over evolutionary time. This led to the term “junk DNA” being used to describe these regions, as they were thought to be of no consequence to the organism.

However, more recent research has shown that many of these non-coding regions actually play important roles in regulating gene expression and controlling the activity of specific genes. They can act as molecular “switches” that turn genes on or off in response to different environmental signals, and they can also act as spacers that help to organize the genome into distinct functional regions.

In addition, some non-coding regions of the genome contain functional elements such as transposable elements, ribosomal DNA, and small RNA molecules, which play important roles in regulating gene expression and controlling the activity of specific genes.

In light of these findings, the term “junk DNA” is no longer used in a scientifically accurate context, and has been largely replaced by the more accurate terms “non-coding DNA” or “regulatory DNA.” These non-coding regions of the genome are now recognized as playing important roles in shaping the phenotype and evolution of organisms.

Holy crap, that is a terrible answer, from the very first sentence. It conflates junk DNA with non-coding DNA, and builds its entire argument on that premise. It then claims that the discovery of regulatory sequences undermined the concept…but regulatory sequences have never been regarded as junk, and neither have ribosomal DNA or these unspecified “small RNA molecules” (what? like tRNA or siRNA?) And now transposable elements are just assumed to be functional? Some have acquired functional roles, but most are not, which is kind of significant given that they make up around half the mammalian genome.

Then the grand conclusion is that “junk DNA” is no longer used, and has been replaced by the terms “non-coding DNA” or “regulatory DNA.” No, it has not. Those are not the same thing at all. I note that much of the answer seems to have been cribbed from the kind of thing you get in a random Google search, which is polluted with all kinds of pseudoscience, creationist sources, and general denial that our DNA could be less than perfect. For instance, even Scientific American has a trash article that concludes that evolution is too wise to waste this valuable information. Maybe ChatGPT should have exercised a little discrimination, and looked at qualified sources like Graur, and Palazzo, and Moran? Because I’d give this answer a big fat “F” and comment that apparently the author hadn’t listened in lecture or read the textbook.

Speaking of Moran, his long awaited book, What’s in Your Genome?: 90% of Your Genome Is Junk, is off to the printers, with an expected release date of 16 May. Maybe that will help ChatGPT correct its bad science. That’s a book I’m looking forward to.

Comments

  1. says

    While ChatGPT definitely does make mistakes, it seems to me you’re blowing up what is really a quibble. It wrote “This led to the term “junk DNA” being used to describe these regions, as they were thought to be of no consequence to the organism. However, more recent research has shown that MANY of these non-coding regions actually play important roles in regulating gene expression and controlling the activity of specific genes.”(Emphasis mine.) I believe this is essentially correct — the functionality of some non-coding DNA was not initially recognized. You seem to claim otherwise but I don’t think you are historically correct.

  2. Matt G says

    Also looking forward to the book. It’s disturbing how much some biologists end up sounding like creationists. I’m looking at you, ENCODE!

    Yesterday, Larry had a photo of you and him at Down House.

  3. says

    No, it is ahistorical to claim that junk DNA was ever used to describe non-coding or regulatory regions. The term was popularized by Ohno in the 1970s, well after scientists were familiar with functions of non-coding DNA. We were well aware of Jacob and Monod, for example.

    The “MANY of these noncoding regions play important roles” is a familiar construction used by creationists and other apologists for universal selectionism. OK, how many? What percentage of the genome? You’ll find they are referring to a miniscule proportion as if it is representative of the whole.

  4. acroyear says

    You were saying ‘google search’?

    I just did the google and this was the top article. https://www.news-medical.net/life-sciences/What-is-Junk-DNA.aspx . Several other articles also seem to bear out #1 above – that much DNA originally thought to be “junk” turned out to have a function.

    The ChatGPT answer looks to be mostly a rewrite of that news-medical article.

    But really, all of those articles were written for the layperson. The kind of detail you’re asking for may be right for your specialty classes, but is it too much to ask of the average person, and too much to ask of articles targeting the average person?

    If anything, the term “Junk” is too misleading, and too historically loaded, and to be blunt, is just a crappy word to use in science. If there are better terms to use for all of the variations of function (including no function at all), then lets get those terms out there so we all stop using ‘Junk’ in the first place…and from there, stop asking the question if it doesn’t really have a solid meaning given the better terms to use.

  5. acroyear says

    AH – now i see the specific anger point (and yeah, you seem really angry about this one for some reason). “Many” vs “a few”. Of course there are times that something thought to be non-functional turned out to have one given more research….and of course creationists will throw away context and go “see, ALL”.

  6. StevoR says

    @ acroyear : “If anything, the term “Junk” is too misleading, and too historically loaded, and to be blunt, is just a crappy word to use in science.”

    Wild very non-biological context tangant but there is one meaning of Junk where Science or at least ship / boat building engineering is apt in a form of craftship that is sadly fading away. Just 3 grandmasters left. (9 min clip)

    ..But yeah, very much NOT what we are talking about here & something very different.

  7. says

    My husband has been telling me every day about all the questions he’s been testing on ChatGPT. The punchline is the same every time. ChatGPT totally fabricates something that sounds vaguely plausible, over and over. I keep on saying, “ChatGPT is a language machine, not a truth machine.”

  8. says

    As far as I can tell, ChatGPT primarily generates poor-to-mediocre quality Wikipedia articles, so you’d generally get a better answer out of Wikipedia. The “advantage” of ChatGPT is that you con look up something more easily when you aren’t quite sure how to phrase your query.

  9. says

    When it dismisses the idea that most of our genome is Junk DNA, ChatGPT is also reflecting the views of the majority of molecular biologists, genomicists and cell biologists. That doesn’t mean it is right, though. Almost all molecular evolutionists do not dismiss Junk DNA. And they are upset at their colleagues for dismissing it. It’s an extraordinary divide. Larry’s book explains why the molecular evolutionists are right. PZ is correct.

  10. says

    “Why is anyone making excuses for ChatGPT’s wrong answer?”

    Because you seem disproportionately exercised about what you see as a faulty emphasis. It’s actually easy to find worse mistakes by this particular bot but nobody at this point is asking us to treat it as reliable. It’s quite amazing what it can do, The developers aren’t saying it’s a dependable research tool, why does this so outrage you?

  11. Dunc says

    <

    blockquote>Why is anyone making excuses for ChatGPT’s wrong answer?<?blockquote>

    ChatGPT is not attempting to answer the question in any meaningful sense. It does not understand the question. It does not understand it’s own output. It’s just producing an assemblage of essentially arbitrary words which are statistically associated with the input assemblage of arbitrary words in its corpus of training data.

    It’s not even wrong, because to be wrong would still at least imply some level of understanding that there’s even a question there, which could in theory be answered with reference to independently-existing things called facts. ChatGPT is entirely unaware of any of these things.

  12. wzrd1 says

    Dunc has the entire right of it. ChatGPT is true, full and state of the art AI as I currently define it.
    Artificial Idiocy.
    It doesn’t comprehend anything, it strings statistically meaningful to its algorithms phrases, based upon its computation of what input it was given. Literally, every question is an outside context problem, with the machine having no actual understanding or comprehension of what is being asked, hence, an uncomprehending “answer” that’s little better than randomized returns being hodgepodged together in a grammatically correct, if contextually incomprehensible response – in short, a grammatically correct word and concept salad.
    Indeed, it’s essentially as intelligent as the original ELIZA, with some bells and whistles added in.

  13. Jean says

    PZ, you post the wrong answer and go over some of the reasons why it is wrong but you don’t give the correct answer to your question. It would be good to get a short and correct answer (other than a book reference which most people won’t see and read) instead of just getting the wrong answer that seems to be easier to find all over the internet (even from supposedly reliable sources).

  14. says

    Indeed, it’s essentially as intelligent as the original ELIZA, with some bells and whistles added in.

    Obviously we don’t know what “intelligence” is but it scores better than humans at IQ tests and has passed exams that humans have trouble with. It may be as “intelligent” as ELIZA but, I wonder if most humans are, either.

    The implementation of ELIZA I hacked on in the 80s was an static expert system with about 100 rules. Networks like GPT-3 operate in a completely different manner – for one thing they are not relying oh a hand-coded expert knowledgebase. Yes, they are statistical in nature and operate like semantic forest/giant markov chains, but so do humans. We’ve just got a few of what you might consider “bells and whistles.”

    The next big challenge will not be “intelligence” because by some measures the AIs beat humans. It will be self-awareness. And what is that? Other than ELIZA with some bells and whistles. printf(“I am self-aware\n”);? It’s another vague concept and I think humanity would be ill-advised to hang its sense of specialness on that.

  15. says

    Marcus – I was thinking much the same thing. They’ve passed the threshold of being able to BS like a human. I didn’t see that coming, personally. Now for the thing that can turn these programs into “real boys.”

    Might be a derail, but I’d like to know other people’s thoughts on this. What is self-awareness? A practical definition, not anything nebulous. And then, how might you hypothetically give this cognitive power to a doohickey?

  16. says

    I’m also to lazy to look for good answer what IS junk DNA.

    What picked up my curiosity – anyone tried to find any organism with junk DNA, cut it out and see what will happen?

  17. says

    I might argue the Turing test has already been passed – many people read things on the internet that they cannot tell were written by AIs. That is going to continue more and rapidly. Those annoying telemarketers will all be AIs and they will be more clever and persistent than humans. Students are also doing Turing tests, handing in AI generated essays. My bet is that the philosophers and other squishy fields will be obsolete soon. Ditto the pundits. Imagine a million Jordan Petersons all locked on “autoglib” – nobody will care what is real. Certain fields where there are details will snag the AIs but all the marketers and presidential speechwriters need to read the writing on the wall. All fields that lack objective truth are toast. Endless plausible generalities inbound. Nick Bostrom, who needs him? The old joke about “I could replace him with a perl script” is wrong only inasmuch as it’s a python script not perl.

  18. Tethys says

    I am uncertain of what the correct answer may be, as I did not realize that non-coding DNA is not included in the term ‘Junk DNA’.

    My understanding is that all the bits that make up our genome have some function, but a large percentage are merely functioning as filler.
    Transcription of the entire thing is necessary to grow an organism. DNA failing to replicate at fertilization doesn’t usually result in viability.

    Are nipples on men analogous? They don’t have any obvious function, but all humans grow nipples because they are basic mammal anatomy. Perhaps it would be clearer if they used terms such as developmental artifacts, rather than junk? Junk is loaded with trash connotations.

  19. Dunc says

    Marcus, @22: As Charlie Stross put it, “ChatGPT isn’t artificial intelligence, it’s artificial Boris Johnson”.

  20. says

    @GAS:
    I did a posting about this back on stderr a few years ago. This is my theory, which is mine and this is it, etc.

    Living things need elaborate feedback and monitoring loops: check the colon level, how’s my fluid balance, am I talking bullshit? Etc. It appears to me that many processes we are and do have somatic and cognitive feedback checking for error, threats, about to poop my drawers, etc. What if that is what “self awareness” is: the ability to reflexively think about our selves and what we are doing? Just it’s complicated to the point where we don’t understand it and think it is magic.

    Consider a dog: it has all the panoply of dog inputs and status checks and also it has a self-reward system for a nebulous feedback target to “be a good dog” – the dog might have an intermittent thread that pops a query “am I being a good dog?” If you think about it, the dog may not even have an objective measure for being a good dog – but a probablistic one would work damn near as well! So, the AI, excuse me, dog, sets a monitoring loop on its behavior that runs concurrent with monitoring all the other stuff a dog monitors about itself. I argue that being aware that it is a “good dog” is self-awareness and depends on a sense of self that is separate from all the other dogs or objects in the area.

    What I just established there is a very low bar for self-awareness but if it was implemented with a few matching models (after all “colon full” is also a vague concept!) it would behave just as our sense of self-awareness does. And if you can’t tell it’s different, is it different?

    There is another problem: meta-cognition – thinking about how we think. That’s also a low bar, though! “What is a good dog?” Oops I gotta think about that. Well a good dog doesn’t eat the couch and throw it back up. Or a good dog doesn’t do a whole bunch of other simple learned rules. A good dog doesn’t have to engage in meta-cognition to convince Nick Bostrom that it’s not in a simulation – it just growls at the annoying man but doesn’t bite him because good dogs don’t randomly bite philosophers. In other words there is no need for a philosophical framework of goodness – self-awareness can simply ask probablistic questions as necessary. You can be a good dog without a philosophy, relying just on self-awareness in a monitoring loop.

    I used to joke that self-awareness is hard to distinguish from a while loop. Your while loop can block waiting for input and print the line “I feel fine.” to every stimulus. Make it a lookup table and you’ve implemented some republican congresspeople. It’s a monitoring loop that fires based on measures of other states rather than blocking on input. I guess I am saying that a self-aware being might have a clock interrupt fire and say “I’ve gotta go.” If you’re locked in a blocking input loop you’re stuck waiting for the interlocutor to ask “how are we doing for time?” Once you put a powerful generative AI behind a massively parallel monitoring loop I expect it will behave like a self-aware being. Because self-awareness is not such a big magical deal after all.

  21. says

    Good writers are quite safe for now.

    Not even close.

    Because there are going to be a million John Ringos writing military SF based on Baron De Marbot’s experiences and memoirs (for one example) which are fascinating and elegantly written (for a farmboy who grew up to be a division’s commanding general) so – just for one example – a robo-Ringo could ask an AI to reimagine Marbot’s memoirs as SF. And it would not suck because it’d be Marbot not Ringo.

  22. Rob Grigjanis says

    Jean @18:

    It would be good to get a short and correct answer (other than a book reference which most people won’t see and read) instead of just getting the wrong answer that seems to be easier to find all over the internet (even from supposedly reliable sources).

    IANAB, but this made me chuckle. Maybe there is an answer which is both short and correct, but sometimes short and correct can be mutually exclusive. I think that’s a problem with a lot of pop-sci.

  23. says

    Massively asynchronous parallel processes could fool people into thinking they were dealing with spontaneity.

    Consider we’re talking and I suddenly start talking faster and louder because I have to pee. You don’t know that. I appear to have changed some behaviors spontaneously but really its my bladder monitoring loop just started throwing interrupts, which are perturbing the whole system that is me.

  24. chrislawson says

    PZ@6–

    Because they want to believe the “all DNA is functional” bs and aren’t interested in revising their opinion over anything as trite as overwhelming evidence.

  25. Jean says

    Rob Grigjanis @28

    …but sometimes short and correct can be mutually exclusive…

    The point I was making was not to give a comprehensive answer which would likely not be short but rather to give the answer he would have liked to see from a student and given an A. I doubt that such an exam question would require much more than a page to answer and is what the initial context was.

    Moreover it would make for a good reference and would be much better for non-biologists like me for remembering what junk DNA actually is than an angry put down of an IA answer.

  26. Rob Grigjanis says

    Jean @32: Speaking only for myself, I can quickly get tired of giving “not much more than a page” answers when I don’t know how many people reading it actually give a shit, and when my impression is that the number who do is quite small.

  27. chrislawson says

    cervantes, please, you keep defending 100% flat-out errors as if they are slightly awkward misphrasings. They are not. Remember earlier this week when Google shares took a nose-dive because its highly touted AI Bard made a flat-out error in the middle of a major presentation?

    Well, in the junk-DNA case here the error is even more egregious than the one that tanked Google’s share price. Bard’s mistake was to misreport the Webb telescope as the first to image extrasolar planets when in fact the first image dates from 2004, several years before the Webb telescope was even launched…this is a bad mistake but at least it is true that there have been images taken of extrasolar planets and the error is one of attribution. The quote PZ provides is the equivalent of Bard saying “astronomers have determined that there are no extrasolar planets because comets.”

    And that’s because ChatGPT/Bard has clearly been trained on text patterns, with evidence-seeking by weight of search engine pings but no training on evidence-weighing. Which makes it great for churning out text that looks authoritative without being actually authoritative. Perfect for marketers and political manipulators. And that’s because the big money behind Bard and ChatGPT is trying to drive up advertising revenue, not improve communication or understanding. It is no coincidence that the humans most obviously at risk of losing their jobs to chatbots are copy writers in ad agencies.

  28. chrislawson says

    Tethys@24–

    I am uncertain of what the correct answer may be, as I did not realize that non-coding DNA is not included in the term ‘Junk DNA’. My understanding is that all the bits that make up our genome have some function, but a large percentage are merely functioning as filler.

    I suspect the problem is that you have had the misfortune to read a few bad articles written with unwarranted confidence, of which there are many even in usually reliable sources (see PZ’s reference to an awful SciAm article).

    I just Googled “junk dna”. Of the 10 first-page Google hits, 5 are badly erroneous. The decent ones are Wikipedia (as PZ says), a not-bad Quanta article, and a couple of scientific papers that most readers are not going to click on. (Here’s a hint, any article that defines “junk DNA” as “non-coding DNA” has already crapped its own bed — “junk” was never meant to be synonymous with “non-coding” and yet it is repeatedly misreported as such even in scientific papers.) This is no doubt why PZ’s ChatGPT prompt returned such a bad answer.

    If the topic interests you, then I recommend Dave Graur’s website (weirdly he uses Tumblr!) especially this particular post as a jumping off point, that Wikipedia article on non-coding DNA (which has a good discussion in the “junk DNA” subsection), or waiting for Larry Moran’s book to come out.

  29. chrislawson says

    acroyear– writing for laypeople makes it reasonable to simplify and summarise, it does not make it reasonable to misrepresent and distort.

  30. barbara4 says

    The idea of junk DNA is pretty simple, really. It’s DNA that has no useful function for the organism. It doesn’t code for proteins, it doesn’t code for useful RNA, it doesn’t regulate genes, it doesn’t form telomeres or centromere attachment points or other useful structures, it isn’t a useful splice site. It isn’t a useful filler. It’s not functional.

    Junk DNA is DNA you could live well without. It’s not conserved in evolution because it has no function that selection can act on. It’s just there.

    How did it get there? Some junk DNA is old, dead viruses. The live viruses are a bit controversial — they’re not useful to the organism, so they seem like junk, but they’re useful to the virus, so are they really junk? (I think so.) Some junk DNA is old, mutated, non-functional genes, like the human Vitamin C gene remnant. Some is duplicated material that has mutated and isn’t functional. Some was mRNA that was read back into the DNA but isn’t functional in that form. (If it is functional, it’s not junk. By definition.) No doubt there are other sources that I don’t know about.

  31. John Morales says

    Jean @32, PZ has written about this multiple times.
    Perhaps try this Google search:
    site:https://freethoughtblogs.com/pharyngula/ junk dna

  32. Alan G. Humphrey says

    @ #26 Marcus Ranum
    Thank you for this. I have thought very much the same for a long time.

    For “being a good dog” a parallel function to self-awareness would be awareness of others watching, like how “being a good spouse” depends on how likely being caught out is. A dog probably doesn’t include grape-vine reporting in its awareness of watchers.

  33. acroyear says

    ok, I now admit myself that I mis-read a section in the article I linked to (since i was mostly looking at the beginning sentences): “In the human genome for example, almost all (98%) of the DNA is noncoding, while in bacteria, only 2% of the genetic material does not code for anything.” – here the article DOES support all your descriptions that the amount of coding dna (in humans) is tiny, and so yes, CGPT did get this totally wrong. Probably because it hit other sites that lied about it (I only did cursory glances at others), or possibly because it really messed up the words it was converting.

    For some reason, I’ve gathered that Wikipedia is apparently ‘off limits’ by some sort of gentlemen’s agreement between Google and Wiki. Perhaps annoying because if GPT had full access to wikipedia and prioritized it as a source, it would probably come up with more accurate results on many topics, unless it happened to hit a page on a bad day when it was being manipulated by a troll. But I guess Wikipedia is large enough to fight back on being sampled, though given how much they beg for money, I’m not sure they could afford a lawsuit.

  34. raven says

    What picked up my curiosity – anyone tried to find any organism with junk DNA, cut it out and see what will happen?

    Yes.

    It’s been done with mice.
    They found a long stretch of noncoding DNA that based on other evidence might actually be important and do something.
    They deleted it.
    Nothing happened to the mice.

    This has been seen a lot with individual genomic sequencing.
    You find a lot of indel mutations, insertion and deletion mutations.
    Some of the deleted stretches can be quite long.
    Nothing noticeable happens a lot of the time.

  35. raven says

    Nature
    . 2004 Oct 21;431(7011):988-93. doi: 10.1038/nature03022.
    Megabase deletions of gene deserts result in viable mice
    Marcelo A Nóbrega 1, Yiwen Zhu, Ingrid Plajzer-Frick, Veena Afzal, Edward M Rubin

    Abstract
    The functional importance of the roughly 98% of mammalian genomes not corresponding to protein coding sequences remains largely undetermined. Here we show that some large-scale deletions of the non-coding DNA referred to as gene deserts can be well tolerated by an organism.

    We deleted two large non-coding intervals, 1,511 kilobases and 845 kilobases in length, from the mouse genome.

    Viable mice homozygous for the deletions were generated and were indistinguishable from wild-type littermates with regard to morphology, reproductive fitness, growth, longevity and a variety of parameters assaying general homeostasis.

    Further detailed analysis of the expression of multiple genes bracketing the deletions revealed only minor expression differences in homozygous deletion and wild-type mice. Together, the two deleted segments harbour 1,243 non-coding sequences conserved between humans and rodents (more than 100 base pairs, 70% identity). Some of the deleted sequences might encode for functions unidentified in our screen; nonetheless, these studies further support the existence of potentially ‘disposable DNA’ in the genomes of mammals.

    Here is the paper.
    It is free online and you can read it yourself.

    They deleted megabases of DNA in these mice.
    They were even conserved between mice and humans implying they might be important for something.
    They started with phenotypically normal mice and ended up after a lot of work with…phenotypically normal mice.

  36. billmcd says

    ChatGPT is, unambiguously, not artificial intelligence of any kind. It is artificial stupidity. Every time I have seen it asked to do more than recite a short sequence of rote facts that can be quickly and easily looked up from wikipedia, it has failed. The resulting text has 2 consistent properties: it is factually wrong, usually beginning with the first sentence, and it is stylistically bad, even (especially!) when asked to mimic a specific writer’s style—something the big brains at CNN spent a day and a half gushing over its ability to do.

    It’s trash, and like the self-driving features in modern cars, should be destroyed.

  37. hemidactylus says

    Coding DNA is both transcribed into RNA (messenger RNA) AND translated into peptides (often proteins). Noncoding DNA includes some regions that can be transcribed into RNAs that aren’t translated, such as transfer RNA and ribosomal RNA. That doesn’t exhaust the types of noncoding DNA, but the noncoding DNA that is actually transcribed yields noncoding RNA. Noncoding DNA still does stuff useful to the organism. And some isn’t transcribed, but still isn’t junk, like upstream regulatory regions that influence the spatiotemporality of gene expression in development.

    Junk DNA instead has no function. Some may have had a function in the past like remnant yolking genes found in mammals such as humans who have embryos that produce no yolk in development.

    Evolution results in many cases from gene duplication which relaxes selection constraints due to redundancy (having more than one copy of a useful gene). If mutational changes of a duplicated gene shift its function it will not become junk. But if mutations decay the duplicate region into nonsense…that is junk. Large regions or entire genomes can be duplicated sometimes. Those resultant excess genes could augment the protein production of the originals from whence they came. Or mutations might diverge them into producing somewhat different proteins. Or much of the excess duplicate regions might just decay into gibberish (due to the redundancy relaxing selection constraints) much like my futile attempt to give a basis for some of what is called junk.

  38. raven says

    That much of the human genome is junk DNA was known and accepted long ago, around the start of the 21st century.
    For most of us, it isn’t even an interesting question because it is settled and we’ve moved on.

    Notably, we estimate that typical individuals are hemizygous for roughly 30-50 deletions larger than 5 kb, totaling around 550-750 kb of euchromatic sequence across their genomes.
    Anyone reading this has 30-50 deletions of around 1/2 to 3/5 megabases total and…you aren’t dead.

    Nat Genet
    . 2006 Jan;38(1):75-81. doi: 10.1038/ng1697. Epub 2005 Dec 4.
    A high-resolution survey of deletion polymorphism in the human genome
    Donald F Conrad 1, T Daniel Andrews, Nigel P Carter, Matthew E Hurles, Jonathan K Pritchard
    Abstract

    Recent work has shown that copy number polymorphism is an important class of genetic variation in human genomes. Here we report a new method that uses SNP genotype data from parent-offspring trios to identify polymorphic deletions. We applied this method to data from the International HapMap Project to produce the first high-resolution population surveys of deletion polymorphism. Approximately 100 of these deletions have been experimentally validated using comparative genome hybridization on tiling-resolution oligonucleotide microarrays. Our analysis identifies a total of 586 distinct regions that harbor deletion polymorphisms in one or more of the families. Notably, we estimate that typical individuals are hemizygous for roughly 30-50 deletions larger than 5 kb, totaling around 550-750 kb of euchromatic sequence across their genomes. The detected deletions span a total of 267 known and predicted genes. Overall, however, the deleted regions are relatively gene-poor, consistent with the action of purifying selection against deletions. Deletion polymorphisms may well have an important role in the genetics of complex traits; however, they are not directly observed in most current gene mapping studies. Our new method will permit the identification of deletion polymorphisms in high-density SNP surveys of trio or other family data.

  39. raven says

    Eight percent of our DNA consists of remnants of ancient viruses, and another 40 percent is made up of repetitive strings of genetic letters that is also thought to have a viral origin.Jan 9, 2020

    The non-human living inside of you https://www.cshl.edu › the-non-human-living-inside-of-you

    This isn’t that complicated.

    If you look at the human genome, much of it is a literal junkyard.
    8% of our genome is dead retroviruses from past invasions of our genome by viruses.*
    40 + % of it is transposon derived sequences thought to be derived from even older retrovirus invasions.

    46%
    Transposable elements (TEs) occupy almost half, 46%, of the human genome, making the TE content of our genome one of the highest among mammals, second only to the opossum genome with a reported TE content of 52% [1, 2].Oct 27, 2009

    LINE dancing in the human genome: transposable elements …https://genomemedicine.biomedcentral.com

    Why do we need transposable elements or dead retroviruses. These DNA sequences spread because it is to their advantage not ours.

    *We have resurrected one of these retroviruses by simple genetic engineering to fix the mutations. The Phoenix retrovirus was an extinct fossil for 5 million years.

  40. Jean says

    John Morales @38: I know PZ has talked about it many times. The point is that only having the bad definition in a highlighted section of the post without the good definition (even a simplified overview as per barbara4 @37) puts the emphasis on the wrong definition. The human brain being what it is, as well as sloppy readers and bad faith elements, means that the error gets a boost in more people “understanding” of junk DNA rather than correcting the misunderstanding. That seems to be exactly the opposite of what the post was intended to be.

  41. John Morales says

    Jean:

    The point is that only having the bad definition in a highlighted section of the post without the good definition (even a simplified overview as per barbara4 @37) puts the emphasis on the wrong definition.

    Well done! That’s precisely the point. The post title kinda gives it away.

    The entire post, not just the emphasis, is about the wrong definition.
    It goes on to detail just what’s wrong with that definition.

    The human brain being what it is, as well as sloppy readers and bad faith elements, means that the error gets a boost in more people “understanding” of junk DNA rather than correcting the misunderstanding.

    Um, the entire post is about how that answer is wrong, and why it is wrong.
    That’s rather explicitly correcting the error (“misunderstanding”, as you put it).

    That seems to be exactly the opposite of what the post was intended to be.

    I can’t dispute that it seems so to you.

    To me, it seems to have been intended to be more like “don’t trust GPT to give accurate scientific answers”.

  42. Jean says

    John Morales: Half of the post is the wrong answer. Saying that the whole post is about how this answer is wrong is itself wrong. That was the intent but I think it misses on that as I explained. And yes, that ‘s my opinion.

  43. Tethys says

    The confusion stems from using terms imprecisely.

    Non-coding DNA is not the same thing as junk DNA, it is non-functional DNA.
    Logically speaking, if isn’t contributing anything because it’s not functional, wouldn’t it also qualify as non-coding?

    However, like junk, sometimes it gets reused?

    PZ- Some have acquired functional roles, but most are not, which is kind of significant given that they make up around half the mammalian genome.

    I don’t see why ‘filler’ isn’t acceptable?
    It’s there because it gets transcribed as part of the genome, much like the AI just transcribes the information that it finds along with relevant key words, despite its available information being full of junk.

    I still understand ‘junk DNA’ to be the large portion of the genome which is a developmental artifact, but since it can also be repurposed it seems like a genome uses it as raw materials that may come in handy someday.

  44. wzrd1 says

    Obviously we don’t know what “intelligence” is but it scores better than humans at IQ tests and has passed exams that humans have trouble with.

    I said back in the 1980’s, I could get a computer to answer IQ test questions, doesn’t make it smarter than an average library. It’s problem solving, knowledge retention and processing and problem solving, with excess capacity using random association to help problem solving and add creativity.

    I’ll gently remind all, junk DNA was the source of many things, to include the placenta. Without that junk code from a retrovirus, reassortment couldn’t occur to utilize those bits of code to make a placenta and hence, mammalia and us.
    It’s junk until it’s either lost via random deletion or utilized by future mutation and any changes in much gets randomly deleted, mutated or otherwise lost or utilized if beneficial, assuming the loss or mutation isn’t deleterious. It’s evolution 101.
    Why does everyone wanna get lost in the weeds over the blatantly obvious? A slight upon your heritage?

    I still stand by my definition of AI, Artificial Idiocy. Maybe we’ll eventually get it somewhere, but nowhere near the hot mess we have today that makes an antivaxer Google search “finding” look equal.

  45. John Morales says

    wzrd1:

    I still stand by my definition of AI, Artificial Idiocy.

    You ever played chess against a computer?

    Maybe we’ll eventually get it somewhere, but nowhere near the hot mess we have today that makes an antivaxer Google search “finding” look equal.

    It’s very early days.
    Consider how it was, say, 5 years ago. Nothing like this.

    Jean, fair enough. I do get what you’re saying, I just don’t see it that way.

  46. chrislawson says

    Tethys–

    I think I see the problem. Functional is not synonymous with coding.

    [1] Functional, coding DNA → sequences that are used to make proteins that have biological function (this is about 8% of the human genome by base pairs, but about 99.9% of the human genome that gets talked about)
    [2] Non-functional, coding DNA → sequences turned into RNA or proteins, but these products have no biological function and tend to degrade rapidly
    [3] Functional, non-coding DNA → sequences that are never turned into protein but still have important regulatory functions, e.g. sequences that are not part of a gene but which act as anchor points for transcription enzymes
    [4] Non-functional, non-coding DNA → traditionally “junk DNA” (I personally would include [2] as “junk” but that’s not the common meaning), which is not turned into protein and has no biological function or may even be deleterious, as in transposon-induced cancers

    Obviously we have to be very careful about identifying things as non-functional when it could be that we just haven’t worked out the function yet. We also have to be careful about mistaking dysfunction overall for function that has dysfunctional allele variations (see sickle cell anaemia). But on the other side, we also have to be cautious about labelling any sign of activity as functional, which is the core fault of the ENCODE analysis. They’re all about eliminating Type 2 errors but quite clearly don’t give a damn how many Type 1 errors they make in the process.

    Meanwhile, there are clear examples of non-functional DNA elements — one obvious example is in repeat expansion diseases like Fragile X. Affected people have >200 GGG repeats inside an intron (i.e. a non-translated part) of the FMR1 gene while unaffected people have 5-44 repeats, and in between is regarded as a “grey zone”. Thus most humans have GGG repeats in the FMR1 genes that are (1) never expressed in protein, and (2) functionally identical whether there are 5 repeats or 44 repeats. That’s up to 39 repeats with no biological function at all, either helpful or deleterious, in a non-coding region. Junk DNA in a nutshell.

    It is also true that evolution as we know it necessitates the existence of junk DNA. Several known processes such as mutation, gene duplication, and transposition must lead to the formation of non-functional, non-coding DNA sequences.

    There can be valid arguments about whether a given sequence is junk, and what fraction of a genome is junk, but there are no current valid arguments for abandoning the concept of junk DNA.

  47. chrislawson says

    Oh, and not all mutated genes have future functionality. Humans have the gene for vitamin C (well, specifically the GLO gene for the enzyme that catalyses the last step in synthesis) but it is broken. We appear to have lost gene function around the time lemurs and primates diverged 50-70 Mya. The gene has now lost seven of its twelve exons. It’s not coming back. And there’s no reason to think the degraded remnants of this gene will find a new function with a much greater chance than random code.

    Interestingly, some bats lost vitamin C synthesis and later regained it. These bats showed strong conservation of the “broken” gene before reactivation, which suggests to me that the gene had some secondary function even in its “broken” state that prevented the sequence eroding.

  48. hemidactylus says

    @51- Tethys
    Noncoding DNA doesn’t code for protein. It still can be quite functional. Producing transfer RNA or ribosomal RNA or determining where and when its downstream coding sequence will code for a protein are functions.

    Non-functional (ie- junk) DNA does none of these things. And most likely it will never serve as a reservoir for future function. As raven points out much of it is dead viral regions. It’s all decayed into gibberish whatever the source.

    Actually functional DNA regions, as in genes themselves, may serve as an evolutionary reservoir via duplication and divergence. Regions that are complete gibberish to start with are not very co-optable. But they are not detrimental enough to be gotten rid of or or they are merely neutral.

  49. John Morales says

    chrislawson, that was great.
    Succinct and informative.

    (Not that, unlike PZ, I’m in a position to determine its accuracy, but the vibe is good)

  50. sparc says

    PZ:

    It isn’t a faulty emphasis. It is flat out wrong.

    Not only that but it also completey contradicts evolution theory and that’s why creationists of any kind are so happy with ENCODE.

  51. hemidactylus says

    @52- wzrd1
    Could an endogenous retrovirus that was co-opted to be involved in placental formation have ever been classified at any point of its existence in the genome as junk?

    See:
    https://www.pnas.org/doi/full/10.1073/pnas.2132646100

    “An important issue of the present investigation is the discovery that the identified envelope gene has been conserved in primate evolution in a functional state over >40 million years, as the corresponding locus is present from New World monkeys to humans as a full-length coding gene, for which we further demonstrate that the fusogenic function is conserved in all primates tested. Conservation of the fusogenic function is a strong hint for a functional role of the gene in the physiology of the host and suggests a selective process by which the retroviral gene function has been diverted by the host to its own benefit.”

    If it were junk at any point the diverted gene function would instead been degraded into gibberish???

    Instead of a duplicated gene diverging toward another function, HERVs supply the diverted sequence perhaps…or they instead decay into junk, which is possibly the majority of cases.

    And here: https://journals.asm.org/doi/10.1128/JVI.74.7.3321-3329.2000

    *“ Here we report the placental expression of an HERV-encoded envelope glycoprotein that exhibits all the features of retroviral envelopes necessary to promote cell-cell fusion. The persistence for more than 25 million years of an env gene encoding a complete retroviral envelope glycoprotein in the genomes of Old World primates as well as its tissue-specific expression in human placenta suggests that evolution has retained a function of this protein that is beneficial for the host.”

    To me functional retention means that it wasn’t at any time non-functional decayed junk.

  52. Dunc says

    Interesting article from the New Yorker: ChatGPT Is a Blurry JPEG of the Web:

    Think of ChatGPT as a blurry JPEG of all the text on the Web. It retains much of the information on the Web, in the same way that a JPEG retains much of the information of a higher-resolution image, but, if you’re looking for an exact sequence of bits, you won’t find it; all you will ever get is an approximation. But, because the approximation is presented in the form of grammatical text, which ChatGPT excels at creating, it’s usually acceptable. You’re still looking at a blurry JPEG, but the blurriness occurs in a way that doesn’t make the picture as a whole look less sharp.

    This analogy to lossy compression is not just a way to understand ChatGPT’s facility at repackaging information found on the Web by using different words. It’s also a way to understand the “hallucinations,” or nonsensical answers to factual questions, to which large language models such as ChatGPT are all too prone. These hallucinations are compression artifacts, but … they are plausible enough that identifying them requires comparing them against the originals, which in this case means either the Web or our own knowledge of the world. When we think about them this way, such hallucinations are anything but surprising; if a compression algorithm is designed to reconstruct text after ninety-nine per cent of the original has been discarded, we should expect that significant portions of what it generates will be entirely fabricated.

  53. isochron says

    Obviously I’m very late to this discussion on ‘junk DNA’ (about which I know nothing much) but would like to respond to #1: “While ChatGPT definitely does make mistakes, it seems to me you’re blowing up what is really a quibble…”

    ChatGPT does not “make mistakes”. It is a con-trick. Back in December a thread on twitter was posted by physicist, Teresa Kubacka (link to thread below). She asked ChatGPT a number of questions including asking it to tell her about a totally fictional physical phenomenon which she knew did not exist. ChatGPT happily churned out a convincingly worded response, based on a number of papers it had been fed, and included citations to works on the topic. The phenomenon was fictional, the cited papers (it turned out) do not exist. ChatGPT is a con trick. Such a good con, in fact, that it seems to have convinced its own creators that it is real.

    https://threadreaderapp.com/thread/1599893718214901760.html

  54. barbara4 says

    It is true that junk DNA occasionally — very, very occasionally — does develop a function that benefits the organism and becomes non-junk DNA. That can lead to the reasoning, “We keep the junk for future use, so it’s not really junk.” No, not really. We keep junk DNA only because we can’t get rid of it. There’s no cellular mechanism to determine what’s useful and what isn’t, then snip out the useless parts.

  55. barbara4 says

    Could an endogenous retrovirus that was co-opted to be involved in placental formation have ever been classified at any point of its existence in the genome as junk?

    Yes. Before it had a useful function for an ancestral mammal, it WAS junk DNA. Later, part of it became useful. But selection doesn’t happen on what will be useful later, but on what is useful now or was in the recent past.

  56. GerrardOfTitanServer says

    What is “Junk DNA” ? Junk DNA is the portions of an individual’s genome that doesn’t do anything. More precisely, junk DNA is the DNA that could be removed from a fertilized (human) egg and still develop into a phenotypically normal (human) individual. A large majority of DNA in the human genome is junk DNA.


    Human DNA is a few percent coding, a few percent non-coding but still functional, a few percent “structural” stuff that could be called functional, 45% confirmed junk DNA, leaving a remainder of about 40% which might have some kind of function but which is almost-certainly mostly-junk.

  57. sparc says

    If one considers how much junk DNA has been removed in the process of making of thousands of knockout mice one would expect that functions of such sequences (e.g. B2 elements in the murine genome) had already been discovered. However, again and again it is the protein coding sequnences and the regulatory sequences like promotors for which functions and phenotypic consequneces have been revealed.