Abstract
DNA storage has emerged as an important area of research. The reliability of a DNA storage system depends on designing those DNA strings (called DNA codes) that are sufficiently dissimilar. In this work, we introduce DNA codes that satisfy the newly introduced constraint, a generalization of the non-homopolymers constraint. In particular, each codeword of the DNA code has the specific property that any two consecutive sub-strings of the DNA codeword will not be the same. This is apart from the usual constraints such as Hamming, reverse, reverse-complement and GC-content. We believe that the new constraints proposed in this paper will provide significant achievements in reducing the errors, during reading and writing data into the synthetic DNA strings. We also present a construction (based on a variant of stochastic local search algorithm) to determine the size of the DNA codes with a constraint that each DNA codeword is free from secondary structures in addition to the usual constraint. This further improves the lower bounds from the existing literature, in some specific cases. A recursive isometric map between binary vectors and DNA strings is also proposed. By applying this map over the well known binary codes, we obtain classes of DNA codes with all of the above constraints, including the property that the constructed DNA codewords are free from the hairpin like secondary structures.
Similar content being viewed by others
References
Blawat, M., Gaedke, K., Hütter, I., Chen, X.M., Turczyk, B., Inverso, S., Pruitt, B.W., Church, G.M.: Forward error correction for DNA data storage. Procedia Comput. Sci. 80, 1011–1022 (2016)
Bornholt, J., Lopez, R., Carmean, D.M., Ceze, L., Seelig, G., Strauss, K.: A DNA-based archival storage system. ACM SIGOPS Operating Syst. Rev. 50(2), 637–649 (2016)
Chee, Y.M., Ling, S.: Improved lower bounds for constant GC-content DNA codes. IEEE Trans. Inf. Theory 54(1), 391–394 (2008). https://doi.org/10.1109/TIT.2007.911167
Chheda, N., Gupta, M.K.: RNA As a permutation. arXiv:1403.5477v1 (2014)
Church, G.M., Gao, Y., Kosuri, S.: Next-generation digital information storage in DNA. Science 337(6102), 1628–1628 (2012). https://doi.org/10.1126/science.1226355
Erlich, Y., Zielinski, D.: DNA Fountain enables a robust and efficient storage architecture. Science 355(6328), 950–954 (2017). https://doi.org/10.1126/science.aaj2038
Gaborit, P., King, O.D.: Linear constructions for DNA codes. Theor. Comput. Sci. 334, 99–113 (2005)
Goldman, N., Bertone, P., Chen, S., Dessimoz, C., LeProust, E.M., Sipos, B., Birney, E.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494(7435), 77 (2013)
Guenda, K., Gulliver, T.A., Solé, P.: On cyclic DNA codes. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 121–125. https://doi.org/10.1109/ISIT.2013.6620200 (2013)
Immink, K.A.S., Cai, K.: Properties and constructions of constrained codes for DNA-based data storage. arXiv:1812.06798 (2018)
Jacobs, A.: Data-storage for eternity (ETH Zürich, 13th of February 2015 https://www.ethz.ch/en/news-and-events/eth-news/news/2015/02/data-storage-for-eternity.html)
Jain, S., Hassanzadeh, F.F., Schwartz, M., Bruck, J.: Duplication-correcting codes for data storage in the DNA of living organisms. IEEE Trans. Inf. Theory 63(8), 4996–5010 (2017). https://doi.org/10.1109/TIT.2017.2688361
Kari, L., Konstantinidis, S., Losseva, E., Sosík, P., Thierrin, G.: Hairpin structures in DNA words. In: DNA Computing, Pp. 158–170 (2006)
Kiah, H.M., Puleo, G.J., Milenkovic, O.: Codes for DNA sequence profiles. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 814–818. https://doi.org/10.1109/ISIT.2015.7282568 (2015)
Kim, Y.S., Kim, S.H.: New construction of DNA codes with constant-GC contents from binary sequences with ideal autocorrelation. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 1569–1573. https://doi.org/10.1109/ISIT.2011.6033808 (2011)
Kovačević, M., Tan, V.Y.F.: Asymptotically optimal codes correcting fixed-length duplication errors in DNA storage systems. IEEE Commun. Lett. 22(11), 2194–2197 (2018). https://doi.org/10.1109/LCOMM.2018.2868666
Limbachiya, D., Benerjee, K.G., Rao, B., Gupta, M.K.: On DNA codes using the ring \(\mathbb {Z}_{4}+w\mathbb {Z}_{4}\). In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 2401–2405. https://doi.org/10.1109/ISIT.2018.8437313 (2018)
Limbachiya, D., Gupta, M.K.: Natural Data Storage: A Review on sending Information from now to then via Nature. arXiv:1505.04890 (2015)
Limbachiya, D., Gupta, M.K., Aggarwal, V.: Family of constrained codes for archival DNA data storage. IEEE Commun. Lett. 22(10), 1972–1975 (2018). https://doi.org/10.1109/LCOMM.2018.2861867
Limbachiya, D., Rao, B., Gupta, M.K.: The Art of DNA Strings: Sixteen Years of DNA Coding Theory. arXiv:1607.00266 (2016)
Loman, N., Misra, R., Dallman, T., Constantinidou, C., Gharbia, S., Wain, J., Pallen, M.: Performance comparison of benchtop high-throughput sequencing platforms. Nat. Biotechnol. 30(6), 434–439 (2012)
Lothaire, M.: Combinatorics on Words, 2nd edn. Cambridge Mathematical Library. Cambridge University Press, Cambridge (1997). https://doi.org/10.1017/CBO9780511566097
Marathe, A., Condon, A.E., Corn, R.M.: On combinatorial DNA word design. J. Comput. Biol. 8(3), 201–219 (2001). https://doi.org/10.1089/10665270152530818
Milenkovic, O., Kashyap, N.: DNA Codes that avoid secondary structures. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 288–292. https://doi.org/10.1109/ISIT.2005.1523340 (2005)
Muller, D.E.: Application of boolean algebra to switching circuit design and to error detection. Transactions of the I. R. E. Professional Group on Electronic Computers EC-3(3), 6–12 (1954). https://doi.org/10.1109/IREPGELC.1954.6499441
Myers, P., Sebaihia, M., Cerdeño-tárraga Bentley, S., Crossman, L., Parkhill, J.: Tandem repeats and morphological variation. Nature (2007)
Nelms, B.L., Labosky, P.A.: A predicted hairpin cluster correlates with barriers to PCR. sequencing and possibly BAC recombineering Scientific Reports 1, 106 (2011)
Ridge, P., Carroll, H., Sneddon, D., Clement, M., Snell, Q.: Large grain size stochastic optimization alignment. In: Proceedings IEEE Symposium on BioInformatics and BioEngineering (BIBE), pp. 127–134. https://doi.org/10.1109/BIBE.2006.253325 (2006)
Rykov, V.V., Macula, A.J., Torney, D.C., White, P.S.: DNA Sequences and quaternary cyclic codes. In: Proceedings IEEE International Symposium on Information Theory (ISIT), pp. 248–248. https://doi.org/10.1109/ISIT.2001.936111 (2001)
Smith, D.H., Aboluion, N., Montemanni, R., Perkins, S.: Linear and nonlinear constructions of DNA codes with Hamming distance d and constant GC-content. Discret. Math. 311(13), 1207–1219 (2011)
Song, W., Cai, K., Zhang, M., Yuen, C.: Codes with run-length and GC-content constraints for DNA-based data storage. IEEE Commun. Lett. 22(10), 2004–2007 (2018). https://doi.org/10.1109/LCOMM.2018.2866566
Thomson, N., Sebaihia, M., Cerdeño-tárraga Bentley, S., Crossman, L., Parkhill, J.: The value of comparison. Nat. Rev. Microbiology 1(11), 11–12 (2003)
Tulpan, D., Smith, D.H., Montemanni, R.: Thermodynamic post-processing versus GC-content pre-processing for DNA codes satisfying the hamming distance and reverse-complement constraints. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(2), 441–452 (2014). https://doi.org/10.1109/TCBB.2014.2299815
Tulpan, D.C., Hoos, H.H., Condon, A.E.: Stochastic local search algorithms for DNA word design. In: DNA Computing, pp. 229–241 (2003)
Yakovchuk, P., Protozanova, E., Frank-Kamenetskii, M.D.: Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nuclice Acis Res. 34(2), 564–574 (2006). https://doi.org/10.1093/nar/gkj454
Yazdi, S.H.T., Yuan, Y., Ma, J., Zhao, H., Milenkovic, O.: A rewritable, random-access DNA-based storage system. Scientific Reports 5, 14138 (2015)
Zhu, X., Sun, C., Liu, W., Wu, W.: Research on the counting problem based on linear constructions for DNA coding. In: Proceedings Computational Intelligence and Bioinformatics, pp. 294–302 (2006)
Zuker, M.: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31(13), 3406–3415 (2003)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The preliminary version of the paper is available at https://arxiv.org/abs/1902.04419
Rights and permissions
About this article
Cite this article
Benerjee, K.G., Deb, S. & Gupta, M.K. On conflict free DNA codes. Cryptogr. Commun. 13, 143–171 (2021). https://doi.org/10.1007/s12095-020-00459-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12095-020-00459-7
Keywords
- DNA codes
- Homopolymers
- Conflict free DNA strings
- Hamming constraint
- Reverse constraint
- Reverse-complement constraint
- GC-content constraint
- Hairpin like secondary structures