Dr. Sriram Kosuri of Harvard University’s Wyss Institute (Pic:http://openwetware.org/wiki/Sriram_Kosuri)
Archival storage of the world’s ever increasing volume of information is reaching a point where it will require a breakthrough technology to meet the enormous need.
Although flash drives and external hard drives have become remarkably efficient and eminently affordable, the speed of vast data creation on a daily basis could make them impractical. What also makes them impractical over the long-term is the sheer physical wear and tear.
As the world confronts 2.7 zettabytes of information (Projection by the International Data Corporation--1 zettabyte is 1 followed by 21 zeroes), it makes perfect sense to look for alternative ways to archive it. A team of three scientists, George M. Church, Yuan Gao and Sriram Kosuri, has done precisely that. They wrote data to Deoxyribonucleic acid or DNA, nature’s astonishingly large data storage system. DNA as we know contains genetic instructions for living organisms.
Hard as I may try I am still unable to fully picture or comprehend how DNA data storage works. Interviewing one of the three researchers at Harvard University’s Wyss Institute was perhaps one way to cover some ground. So I decided to contact Dr. Kosuri who responded rather expeditiously to my questions via email. Reproduced below is the written Q and A between Dr. Kosuri and I. I am writing a separate news story for the IANS wire.
Q: Would you mind describing the concept of DNA data storage as opposed to, say, on a flash drive?
Sriram Kosuri (SK): In terms of the paper (that they wrote for Science), on its face DNA storage is quite simple and has been done in on small scales previously. DNA is a four base code, individual bits (1's and 0's) can be encoded onto the DNA using arbitrary encoding scheme. We chose a simple code where two bases (A & C) correspond to 1 and the other two (G & T) to 0.
There are two main advantages of DNA storage as compared to other forms. First, it's an extremely dense form of information storage mostly because you can store the DNA in 3D. So it's a 3D storage technology and when compared to other technologies (including experimental ones like positioning individual atoms on a surface, or holographic storage), it's much more dense. This is despite major efforts to increase data storage densities in these other technologies over decades. The other main advantage is that DNA is very long-lived compared to existing data storage mechanisms, and sequencing DNA is probably going to be a readable standard for years to come (unlike for example magnetic tape).
The major disadvantages are that it is immutable (can't be modified once written), not random-access (have to read the whole thing to get any part), and slow (right now), and thus is more of an archival storage mechanism than directly comparable to flash or hard drives.
Q: The common perception of DNA is some sort of a life code sitting inside a fuzzy, stringy cell. For those would you care to explain what it means to write data to DNA?
SK: Sure; DNA is a chemical polymer; essentially a chain of nucleotides; adenine (A), guanine (G), cytosine (C), and thymine (T) in some order. To build it up, you have to string together the sequence you want one base at a time. The process we use was from a company called Agilent Technologies (a spinoff of HP) that uses ink-jet printing to build up such a sequence. Simplified, it's like replacing your 4 color ink jet printer with the four nucleotides, and then building up sequences on individual spots. The limitations of chemically synthesized DNA is that it that there are always errors, and you can't make more than small stretches (>200 at a time), but you can do it in a highly parallel manner.
Q: For instance, is there a cell involved?
SK: No cell involved. It's all pretty well established chemistry. The DNA never sees the inside of a cell.
Q: Can you picture for us what a DNA data storage device, if there is any such thing, would look like?
SK: We do this exercise in the supplement, as shown below:
Large-scale Storage Considerations
At some point, storing DNA as a single large mass with extremely large barcodes is both unrealistic and cumbersome no matter the future sequencing and synthesis technologies. To understand where this trade-off lies, we hypothetically imagine such a data store without constraining ourselves on sequencing/synthesis costs.
A larger 48-bit address (2.8e14 unique addresses) block with a 128bit data block would require 216nt length oligo synthesis, which is already available. Such a scheme would give 1.85e-19 g/bit, which would give 1.48 mg of DNA for storage of 1 petabyte of information (at 100x coverage). This DNA, stored in 1536 well plate, would give ~1.5 exabytes with dimensions of 128mm x 86mm x 13mm.
Reading each well (petabyte) would require ~1 exabase of sequencing, or 1.8e6 HiSeq runs (600e9 bases per run). Thus, we would need ~6 orders of magnitude improvement in sequencing technologies for routinely reading petabytes of DNA information. For synthesis, current Agilent arrays top out at 1e6 features (we need 6.25e13 features for a petabyte); so ~7-8 orders of magnitude improvement in synthesis technologies is required. However, reading and writing costs do not dominate in very long-term storage applications (e.g., century or longer archival storage), and could currently be cost competitive due to the expected lower maintenance costs.
Q: Extraordinary longevity of stored data is one of the major benefits of DNA data storing. Is that one of the considerations here?
SK: Yes. It's one of the major reasons this is really good for archival information. DNA can last for thousands of years in dried form, and is very stable.
Q: What about reading back the stored data? How is that accomplished?
SK: We use the same instruments that people use to sequence human genomes. The process involves using what people call next-generation sequencers in order to read out all the data cheaply
Q: Is it theoretically possible to encode the data to a living organism, say for instance, oneself? If yes, what would be the implications? To me that would seem like a strange form of technological singularity.
SK: It is possible, but there are a lot of disadvantages. First, some of the sequences could be detrimental to the cell and could be lost. Second, because the information is extraneous to the cell's functioning, it's likely to be mutated or deleted wholesale in the long term. Finally, there is no real use to keep it inside the cell, as the cell or a body wouldn't really know what to do with it. It's not like we can read our own genomes inside ourselves and learn from it.
Q: How soon would DNA data storage become a market reality?
SK: It's possible that it is doable now for very specialized long-term archival information needs (>100 years). However, for large scale storage, as we estimated in that pasted text, we are 6 and 8 orders of magnitude away for sequencing and synthesis technology scales respectively. To give perspective, we have seen a similar order drop in the last 10 years or so for both of those technologies. There are many reasons why we might not continue to increase these scales at the torrid pace we have, but if we do keep pace, about a decade.
Q: Your work led to the storage of 5.27 megabit of data, which is 600 times more than what had been achieved earlier. How was that accomplished?
SK: The main thing we do is that we are able to leverage the massive cost and scale advantages of next-generation synthesis and sequencing technologies while overcoming their limitations, which is that they can only read and write short pieces. Thus we make many short pieces, and each short piece has an address that lets us know where that data goes in the whole data stream.
Q: How does this technology potentially change computing in general?
SK: Well it's mostly a concept paper right now. The real concept that this paper brings up is that we can continue on a path to try to improve data densities on existing technologies to get to the densities we display with DNA, or we can work on getting DNA storage technologies scaled and cheaper as an alternative path.