Stabilizing synthetic data in the DNA of living organisms
- 841 Downloads
Data-encoding synthetic DNA, inserted into the genome of a living organism, is thought to be more robust than the current media. Because the living genome is duplicated and copied into new generations, one of the merits of using DNA material is long-term data storage within heritable media. A disadvantage of this approach is that encoded data can be unexpectedly broken by mutation, deletion, and insertion of DNA, which occurs naturally during evolution and prolongation, or laboratory experiments. For this reason, several information theory-based approaches have been developed as an error check of broken DNA data in order to achieve data durability. These approaches cannot efficiently recover badly damaged data-encoding DNA. We recently developed a DNA data-storage approach based on the multiple sequence alignment method to achieve a high level of data durability. In this paper, we overview this technology and discuss strategies for optimal application of this approach.
KeywordsDNA data storage Sequence alignment Polymerase chain reaction (PCR) Error check Error correction Genetically modified organism (GMO)
“Increasing performance of CPUs and memories will be squandered if not matched by a similar performance increase in I/O. While the capacity of Single Large Expensive Disks (SLED) has grown rapidly, the performance improvement of SLED has been modest. Redundant Arrays of Inexpensive Disks (RAID), based on the magnetic disk technology developed for personal computers, offers an attractive alternative to SLED, promising improvements of an order of magnitude in performance, reliability, power consumption, and scalability.”
David Patterson, Garth Gibson, and Randy Katz 1988 when they first proposed RAID1
DNA consists of stable double-stranded long polymers of only four different nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T). For living organisms, the primary role of DNA is long-term and inheritable data storage of genetic information, which is the set of blueprints to construct cellular components, such as RNA and protein molecules. The features of a DNA sequence are analogous to a digital data sequence. For this reason, recent studies have focused on the behavior of DNA in the application of artificial data memory. For example, by inserting a synthetic data-encoding DNA molecule into the genome of living organisms, the encoded data can be reproducible and inheritable in the small media vessel of the living cell. The data-storage media broadly used today, including paper, magnetic media, and silicon chips, are easily damaged by human-caused and spontaneously occurred accidents, and require constant time-consuming and laborious maintenance. The properties of DNA provide the potential for realization of long-term data storage, which can be maintained as archived data for hundreds to thousands of years (Bancroft et al. 2001; Cox 2001; Smith et al. 2003; Yachie et al. 2007; Wong et al. 2003). Since Clelland et al. (1999) demonstrated the encryption of hidden messages into DNA, the DNA-data encoding approach has essentially focused on DNA steganography, useful for the identification of genetically modified organisms (GMO) by its encoded signature or trademark in genomic DNA (Arita and Ohashi 2004; Heider and Barnekow 2007a, b; Heider et al. 2008; Wong et al. 2003). However, the artificial DNA sequences can be changed or degraded by laboratorial treatment during the preparation of synthetic data-encoded DNA, and by genetic mutations, deletions, and insertions in generations beyond those of the data-stored cells. Furthermore, there is the possibility of interfusion of sequencing errors when a DNA sequencer retrieves the encoded data. For these reasons, sophisticated methodology is required to ensure the robustness and durability of encoded data in order to realize the significant potential for DNA-mediated data storage for large amounts of important information over an extended period of time.
PCR-based readout and error checkable codes
The genomic DNA segment of the data-encoded region is altered, as diagrammatically represented in Fig. 1b, according to mechanisms underlying natural selective pressure in a living system. Moreover, in the data encoding and decoding procedures, data breakage can occur due to human error. Therefore, several approaches based on information theory have been proposed to check for errors in the DNA and broken DNA data (Arita 2004; Smith et al. 2003).
A strategy employing a binary comma-free code for DNA memory has also been proposed; this code utilizes the binary comma-free and error-correctable data sequence in the DNA letters (Arita 2004). Although this comma-free code is robust and the error correction works to correct against small-scale damage such as DNA point mutations, this system does not have the capacity to recover broken data when a large DNA segment is deleted from the data-encoded DNA region.
A major disadvantage of these PCR-based methods is the susceptibility of the template DNA region for the introduction of errors. In the case when the forward or reverse primer regions are broken, there is the potential for the entire data sequence to become unreadable because of the failure to support PCR amplification. When this occurs, there is no code that can protect the template regions.
Concept of alignment-based readout
For more simple, flexible and durable data storage using DNA material, we recently proposed the alignment-based method (Yachie et al. 2007), which is independent of the PCR-based readout procedure but based on the sequence alignment method (Altschul et al. 1990). In this data-storage method, multiple oligonucleotide sequences encoding the same data are redundantly inserted into the genome of a living organism. When the encoded data is retrieved, the complete genome sequence, including multiple segments of inserted DNAs, is first read by a DNA sequencer, and then the multiple regions encoding the same data are searched for by sequence alignment. Based on the multiple alignments of sequences of data-encoded regions, errors that may have occurred within the encoded data can be checked and corrected.
Storing multiple-copied data
When multiple segments of the same DNA sequence exist within a single genome, or even within a single cell, there is the possibility for problems to occur with the storage of data. In particular, the existence of two of the same DNA sequences within a bacterial genome is known to induce homologous DNA recombination (Kowalczykowski et al. 1994; Kuzminov 1999). Homologous DNA recombination subsequently can disturb the growth and development of data-storage cells or the inserted DNA sequences may be removed from the genome in early cell division in cell culture. Therefore, in order to safely encode multiple-copied data sequences into genomic DNA, unique DNA oligomers encoding the same data according to different data encryption procedures are required.
We previously demonstrated one of the simplest procedures to define the multiple and reversible transformations from a single data sequence to multiple DNA sequences (Yachie et al. 2007). In this procedure, a set of ‘codons’ is prepared that indicates the relationships between all possible patterns of x letters in a data sequence and their assigned DNA segments of the same size. There are x possible reading frames of ‘codons’ according to a one-by-one frame shifting of data letters in the target data sequence region, thus x different DNA sequences can be designed from the data sequence. This process mimics the DNA codons used for intracellular protein synthesis. There are three possible reading frames of three-letter DNA codons in the DNA sequence to encode the amino acid sequence of the protein.
Data retrieval by sequence alignment
In the data retrieval procedure, the complete genomic sequence harboring multiple synthetic DNA oligomers is fully sequenced by using a DNA sequencer, and then, the total sequence of genomic DNA is decompressed to multiple data sequences by using the decoding functions that are paired with the respective encoding functions used for data storage (Fig. 3c). The majority of regions of the respective long sequences decoded at the genomic level are nonsense, and they are mostly different from each other, because the different decoding functions are performed for a single genomic sequence. According to Eq. 8, the data sequence encoded in each synthetic DNA region of the genome accurately appears within the partial region of certain long data sequence transformed from the genomic sequence by the decoding function, which is the reverse of the encoding function used for the design of the respective region. Therefore, if all the data-encoded regions are not broken by DNA errors, every long decoded sequence must include the same unique data sequence in its partial region (Fig. 3c). By progressing through the series of data handling procedures, it is possible to search for and finally read out the same data sequence of encoded data by using the sequence alignment function.
Error check and correction by sequence alignment
At the end of the readout procedure, data durability can be further enhanced by taking advantage of the sequence alignment method. Because of the associative rules in the encoding and decoding functions defined in Eqs. 6 and 7, DNA mutations, deletions, and insertions of synthetic DNA are the causal factors of point breakage of data sequence, sectional data deletion, and nonsense data insertion, respectively. The types and positions of DNA errors are directly related to the errors in the decompressed data sequence. Therefore, according to this rule, even if some DNA errors are randomly contained in the multiple synthetic regions of data-stored genomic DNA, we can find the multiple-copied but partially broken data sequences by searching for similar data sequences in the respective long data sequences decoded from the genome, and the mismatches of aligned data sequences can identify the position of broken data-encoded sequences. Accordingly, the multiple-copied data sequences encoded within the different features of DNA sequences can fulfill an error check function.
When the number of decompressed data sequences from the synthetic DNA regions is two, the majority decision rule cannot be justified, and only the error check can be performed. At least three synthetic DNAs are necessary for error correction of broken data, and the cost of higher redundancy must be paid to achieve more robust durability of stored data, leading to a trade-off relationship between data durability and the copy number of encoded data. More durable data storage with stronger error correction can be achieved by the insertion of more synthetic DNA into the genome. This trade-off appears to favor data durability. By performing a computational simulation of data retrieval using this alignment-based method, we previously suggested that when the data storage is conducted by four synthetic DNAs and when as much as 15% of the data-stored genome is randomly mutated or deleted, the recovery rate is over 99% for data rescue (Yachie et al. 2007). According to combinatorial theory-based logic, the multiplication of synthetic DNA can improve data durability at an exponential rate.
Cost and practical realization
For data storage into magnetic media, especially hard disks, redundant arrays of inexpensive disks (RAID) are used frequently (Patterson et al. 1988). The concept of RAID is to promote redundantly encoded data within a larger volume of inexpensive memory storage in order to secure data durability. However, when RAID technology was firstly proposed in 1988, the cost of memory storage was high when compared with the current technology, and the RAID concept of ‘inexpensive disks’ was considered novel.
The DNA data-storage methods prior to RAID in the computational technology field with PCR-based data retrieval process were associated with restraints including the high cost of DNA oligomer synthesis and DNA sequencing and the size limitation of artificial DNA that can insert into the genome of a living organism. The PCR-based methods store nonredundant data into one short synthetic DNA fragment along with the template DNA regions at each end. The cost of sequencing the partial data-encoded genomic region amplified by PCR is markedly lower than the cost of complete genome sequencing. By comparison, the alignment-based method requires the redundant encoding of data into multiple synthetic DNA molecules and sequencing at the genomic level. However, DNA sequencing is becoming less time-consuming and the associated cost of DNA synthesis and sequencing is decreasing.
There are several hundred complete sequences of chromosomal DNA from eukaryotes, prokaryotes, and archaea currently available via the public databases (Liolios et al. 2008). This area of research has fuelled a growing demand for higher speeds of DNA sequencing and lower cost and has boosted the emergence of new technologies (Church 2006; Hall 2007). For example, the recent development of sequencing technology utilizing emulsion PCR (Margulies et al. 2005; Shendure et al. 2005) has dramatically increased the speed of sequencing. Other new technologies are in development, and in the near future, new machines with the capacity to read one million bases per second at a low cost will be available (Bonetta 2006).
In the alignment-based method, following the design of multiple different DNA sequences encoding the same data, we can add the template DNA regions to each end of respective sequences and insert these into the living genome. Complete genome sequencing is unnecessary with this approach, because the partial genomic regions of respective synthetic DNA can be amplified by PCR and sequenced, and the multiple alignments of these sequences can check and correct errors. However, as explained above, PCR-based data retrieval is associated with the disadvantage of introducing breakage points in the template DNA region, and the readout procedure depends on appropriate or error-proof PCR amplification. Therefore, although data storage durability can be improved to a degree by multiple PCR amplifications and further by alignment of data sequences only from the successfully amplified DNAs, our proposed method fully utilizes the ability of the combinatorial and compositive error supplements by the multiple alignment of all the multiple-copied data, thus the data durability achieved by sequencing at the whole genomic level is beyond comparison. One of better strategies is that, with the combination of PCR-based readout, the complete genome sequencing is performed when the multiple readouts by PCR are failed. Optimal data durability can be guaranteed only by complete genome sequencing independent of PCR-based readout.
Although the length of synthetic DNA sequences that can be inserted into the genome of a living organism has been considered to be limited (Cox 2001), our method expends multiple and redundant synthetic DNA oligomers to copy-and-paste the data for storage. The upper limit of the total size of synthetic DNAs that can be inserted into the genome of a living organism has been increased remarkably by the development of megacloning technologies demonstrated in the transportation of whole bacterial genomes (Itaya et al. 2005, 2008). Notably, the entire 16.3-Kb mouse mitochondrial genome (Itaya et al. 2008), the 134.5-Kb rice chloroplast genome (Itaya et al. 2008), and the 3.5-Mb cyanobacterium genome (Itaya et al. 2005) have been inserted into the 4.2-Mb genome of Bacillus subtilis employed as a mother vessel. B. subtilis has the ability to form a tough and protective endospore, which allows this organism to survive extreme environmental conditions. Surprisingly, previous studies isolated strains of Bacillus species from an extinct bee trapped in 25–30 million-year-old amber (Cano and Borucki 1995) and from a brine inclusion within a 250 million-year-old salt crystal of the Permian Period (Vreeland et al. 2000). The development of megacloning technology utilizing B. subtilis has provided the potential for large-size and long-term DNA data storage.
In this paper, we overviewed the application of reversible transformations for data encoding and decoding for DNA data storage based on sequence alignment and complete genome sequencing. Because the data-encoding procedure for DNA requires no code and only the application of the associative rule, other codes, including the comma, alternate, and comma-free codes, can be used with our method. For example, the Huffman code is an economical code, in which is based on the varied frequencies at which English letters are used in the English-speaking sphere. The most frequent letter is ‘e’, with a frequency of 12.7%, and the least frequent letters are ‘q’ and ‘z’, both of which have a use frequency of 0.1% (Smith et al. 2003). According to this information, the Huffman code assigns a shorter ‘codon’ of nucleotide bases for more frequently used characters, and vice versa. Sophisticated codon design is necessary for the Huffman code: once the decoding procedure has commenced from the beginning of data-encoded DNA, there must be only one way to decompress the letters from the sequence of ‘codons’. For example, if the shortest DNA codon T is defined for the English letter ‘e’, no other longer codon can start with T (Smith et al. 2003). Similarly, in the alignment-based method, the encoded data within the synthetic DNAs can be short and durable, because of the versatile ability to be combined with previously proposed codes.
Similar to the general concept for magnetic disk drives, the alignment-based DNA data-storage method employs the redundant copy-and-paste-and-paste concept for storing data to realize the long-term DNA data storage of large amounts of important information in the small media vessel of the living cell. Although this methodology requires the use of redundant synthetic DNAs encoding the same data and whole-genome sequencing of the data-stored cell at high cost, research demands are promoting the development of new megacloning technology and high-speed DNA sequencing at a lower cost. For this reason, our proposed simple and flexible strategy may offer a practical solution for highly durable data storage in DNA.
We are grateful to Kazuhide Sekiyama and Junichi Sugahara for helpful discussions and Dr. Kazuharu Arakawa for useful suggestions in the preparation of the manuscript. This study was supported by research funds from Yamagata Prefecture and Tsuruoka City to Keio University and a grant from the Japan Society for the Promotion of Science (JSPS) to N. Y.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Patterson D, Gibson G, Katz R (1988) A case for redundant arrays of inexpensive disks (RAID). Proc 1988 ACM SIGMOD Conf, vol 1. pp 109–116Google Scholar