I will be true despite thy scythe and thee.

Our data we once kept in drives and disks,
Protecting it for use some future day;
Predictably, we did not know the risks:
Its storage starts the process of decay
But scan a sonnet; digitize a play
Record a speech—whatever you might choose
And store it in synthetic DNA
Encoded there in zeroes, ones and twos—
Your data, but converted to base three,
Recorded in nucleic acid bases
(We often write them A, C, G and T)
Which guard against the stuff that time erases
Who would have guessed the cutting edge would find
A storage system older than mankind?

So, yeah, take a look at this. I have, in my office, several copies of the works of Shakespeare, in different formats. Facsimile editions of early issues, the Riverside edition, some other things… one on CD-ROM, and (a gift from someone who knows me very well) a wonderful miniature Romeo and Juliet about the size of a matchbox. At my undergrad college, there was a set of Shakespeare in the general collection stacks of the library that was a limited edition printing, with gilt-edged pages and hand-printed illustrations–I wanted to steal it to keep it safe from people who wanted to… erm… steal it. (I didn’t. I hope it is still there.)

Whether you store your Shakespeare in paper form or in ones and zeroes on a flash device, hard drive, or CD-ROM, your choice of medium has a lifespan. It will decay. Electronic storage that continually checks for errors can be, in the long long long term, expensive. At least expensive enough that researchers are willing to look for alternatives. And it turns out, there is a tried and true method of data storage that can handle incredible amounts of data in very little physical space, in a stable medium (given reasonable storage–even in bad conditions, this medium has been known to accurately store data for tens of thousands of years).


In case you missed it, that’s a link to a CNN article on data storage in DNA.

Scientists have developed a technique of storing information in DNA, the molecule found in living creatures including humans that contains genetic instructions. The experiment is discussed in a new study in the journal Nature.

Y’know, it’s kind of funny to hear people talk about how we are going to make this giant leap forward when the singularity comes and we can download our consciousness to some digital form. We practically fetishize digital storage. How does DNA storage compare?

The technique, researchers said, could even encode a zettabyte’s worth of data. That’s enough to encompass the total amount of digital information that currently exists on Earth, which would be “breathtakingly expensive” right now, Birney said.

Researchers used five different kinds of digital information to show that their method would work to preserve a variety of media in DNA. These included a text file with William Shakespeare’s 154 sonnets, a PDF of a scientific paper, a photo in JPEG format of the European Bioinformatics Institute, and an MP3 audio excerpt of Martin Luther King’s “I Have a Dream” speech.

Scientists showed that they could encode these files in DNA and then, by sequencing the DNA, reconstruct them with 100% accuracy.

Damn, that is cool.

It’s not in binary, though, much as I love my ones and zeroes; that’s not the way DNA stores data:

Text on your computer, while it may look like words, is actually encoded in your computer as ones and zeros – this is called binary. For the purposes of DNA synthesis, scientists took that information and converted it to base 3 – that is, zeroes, ones and twos.

From there, the data gets translated into collections of DNA’s nucleic acid bases, represented by the letters A, C, G and T.

That’s how scientists encode the DNA fragments.

One last thing… my silly little sonnet, up above there? If that were converted to DNA, what size of storage device are we looking at?

DNA has the advantage of being light and small, researchers said. One of Shakespeare’s sonnets would weigh 0.3 picograms (10^-12) grams, said Nick Goldman, lead study author.

At the risk of repeating myself… Damn, that is cool.

(Blog post title from sonnet 123, if you were wondering.)


  1. Becca Stareyes says

    I hate to ruin a good poem, but DNA would be base 4 — it has four symbols that you can map to 0,1,2,3. Just like base 2 has two symbols (01), base 10 has ten symbols, and base 16 has sixteen symbols.

  2. Cuttlefish says

    Read the story–whether or not DNA is base 4, the researchers used base 3 for their code. I had the same thought, but since it wasn’t my methodology, I figured I’d defer to the folks who actually did the work.

  3. says

    For some reason, I was expecting the prose sections of this post to be iambic free verse. So close:

    So, yeah, take a look at this. I have,
    in my office, sev’ral copies of the works of Shakespeare,
    in different formats.
    Facsimile editions of early issues,
    The Riverside, some other things…
    one on CD-ROM, and
    (a gift from one who knows me very well)
    a miniature Romeo
    and Juliet about the size
    of a matchbox. At my undergrad college,
    there was a set of Shakespeare in the general collection
    of the library
    T’was a limited edition,
    All gilt-edged pages and illustrations made by hand–
    I would have stolen it to keep it safe from those who’d steal it first.
    (Which I did not. I hope it is still there.)

  4. howardpeirce says

    Well, crap. My comment last used my ID
    Set up by Twitter. I do not use
    My Poe account to comment here.
    Am I in moderation thus?

    Ask not from whom the moderated
    Comment comes. It comes from me.

  5. Robert B. says

    I noticed that, too – I wonder why it isn’t quaternary?

    Reading the article, I wonder if that might have to do with error protection. For example, they mention that they never repeat a base two or more times in a row, because that can cause errors. (I seem to remember hearing about such errors occurring in living DNA, too.) They might use the fourth base as a “spacer.” For example, a binary byte 11001101 might be translated to trinary as 021122, which might be written on DNA as AGCCGG. But that would repeat the C and G bases, so they actually write it as ATGTCTCTGTGT, using the fourth base as a way to make repetition impossible.

    (This is all guesswork, by the way, but it explains the otherwise confusing decision to code in trinary.)

    I think they might actually get more information density by using binary, though. They’d just have to have two bases that meant 0 and two that meant 1, and alternate between them. In this scheme, 11001101 might be CTAGCTAT, taking only eight bases to write instead of twelve. (Converting from binary to trinary is wasteful anyway – a byte of 8 bits translates to 6 trits [trinary digits] but the first trit is mostly wasted, coming out as 0 94% of the time. Translating every 6 bits into 5 trits would be cleaner, but wouldn’t match the way binary computers store data, which probably has its own error-prone problems.)

  6. howardpeirce says

    I wonder whether the fourth bit might be used as a checksum.

    I’m reminded of all those 7-bit paper tapes NASA has from the Gemini and Apollo missions.

  7. Robert B. says

    The potential fourth digit is not a “bit” in the sense you mean. We’re thinking along the same lines – the trinary thing has something to do with error protection – but a checksum means a specific math trick that wouldn’t apply here. (They might very well be using checksums, which work just as well in any base, but that wouldn’t be the reason to code in trinary rather than quaternary.)

Leave a Reply

Your email address will not be published. Required fields are marked *