ACM Communications magazine of January has an article, written from a team of bio-researchers at the Pacific Northwest National Laboratory, about storing information within DNA sequences of... living organisms! The main purpose of the research was to find a way to protect vital information in case of a major nuclear catastrophe. So, given the fact that some bacteria like the very common Escherichai coli and Deinococcus, can endure something like 1000x more radiation than humans can, and can also survive extreme environment conditions (ultraviolet, desiccation, partial vacuum), they are very good candidates to be used for information retrieval in case of a large-scale nuclear accident or war or catastrophe or alien attack or whatever.

Any information, in order to be represented and then saved, must be somehow encoded. Even when we leave a note on the door: "I'll be back in 4 minutes", we use the 'English language' representation (which uses the Latin alphabet). Since DNA has four basic building units called deoxyribonucleosides (A (Adenine), C (Cytosine), G (Guanine), and T (Thymine)), these must be the "bits" of our information representation. These for bases form pairs, and more specifically, Adenine pairs with Thymine (AT) and Cytosine with Guanine (CG). The researchers developed a simple encoding scheme in order to represent the Latin alphabet plus some other basic symbols using sequences of three bases (triplets). Below is the encoding scheme:

AAA: 0 | AAC: 1 | AAG: 2 | AAT: 3 | ACA: 4 | ACC: 5 | ACG: 6 | ACT: 7
AGA: 8 | AGC: 9 | AGG: A | AGT: B | ATA: C | ATC: D | ATG: E | ATT: F
CAA: G | CAC: H | CAG: I | CAT: J | CCA: K | CCC: L | CCG: M | CCT: N
CGA: O | CGC: P | CGG: Q | CGT: R | CTA: S | CTC: T | CTG: U | CTT: V
GAA: W | GAC: X | GAG: Y | GAT: Z | GCA: SP| GCC: : | GCG: , | GCT: -
GGA: . | GGC: ! | GGG: ( | GGT: ) | GTA: ` | GTC: ‘ | GTG: “ | GTT: "
TAA: ? | TAC: ; | TAG: / | TAT: [ | TCA: ] | TCC: | TCG: | TCT:
TGA: | TGC: | TGG: | TGT: | TTA: | TTC: | TTG: | TTT:

These triplets can be used in order to encode any English text, in much the same way that computers use the binary digits 0 and 1. Of course, I guess that we could use a slightly more complex encoding scheme using 256 different triplets (or tetraplets) in order to represent all byte values with obvious advantages.

One of the best parts of this new storing technology, is that the information inserted in the DNA sequences of the hosts like the bacteria, remains intact for hundreds of generations and, possibly, more. The technological background to achieve this, was developed by God Laboratories, during the last million years, in a try-and-error study that produced efficient mechanisms to detect and correct errors caused by random mutations in the DNA of living organisms. The researchers of PNNL (Pacific Northwest blah blah), said:

"With the extremely efficient DNA repair mechanisms associated with Deinococcus, we did not detect any mutations in our experiment in which we retrieved the DNA after the bacteria that carried the message was allowed to propagate for about a hundred generations."

And also the storing potential of such technology is awesome: If we consider, the scientists say, that a litre of liquid can containg up to 1012 bacteria, it is clear then that the storing capabilities are enormous. They do not say though, how much information can each bacterium hold, and me not being a biologist cannot know the details, but even if each bacterium could hold just one single bit, then a litre of water could store 1 Terabyte... Who needs those damn 720Kb floppies anymore!

Even best is the fact that information stored within DNA sequences of living organisms does not need backups! Since the organisms reproduce, they actually create backups themselves and spread around!

Potential Applications


Bibliography

ORGANIC DATA MEMORY Using the DNA Approach, January 2003/Vol. 46, No. 1 COMMUNICATIONS OF THE ACM