A Content-Addressable DNA Database with Learned Sequence Encodings

  • Kendall StewartEmail author
  • Yuan-Jyue Chen
  • David Ward
  • Xiaomeng Liu
  • Georg Seelig
  • Karin Strauss
  • Luis Ceze
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11145)


We present strand and codeword design schemes for a DNA database capable of approximate similarity search over a multidimensional dataset of content-rich media. Our strand designs address cross-talk in associative DNA databases, and we demonstrate a novel method for learning DNA sequence encodings from data, applying it to a dataset of tens of thousands of images. We test our design in the wetlab using one hundred target images and ten query images, and show that our database is capable of performing similarity-based enrichment: on average, visually similar images account for 30% of the sequencing reads for each query, despite making up only 10% of the database.



We would like to thank the anonymous reviewers for their input, which were very helpful to improve the manuscript. We also thank the Molecular Information Systems Lab and Seelig Lab members for their input, especially Max Willsey, who helped frame an early version. We thank Dr. Anne Fischer for suggesting a better way to present some of the data. This work was supported in part by Microsoft, and a grant from DARPA under the Molecular Informatics Program.


  1. 1.
    Adleman, L.M.: Molecular computation of solutions to combinatorial problems. Science 266(5187), 1021–1024 (1994)CrossRefGoogle Scholar
  2. 2.
    Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)CrossRefGoogle Scholar
  3. 3.
    Baum, E.B.: Building an associative memory vastly larger than the brain. Science 268(5210), 583–585 (1995)CrossRefGoogle Scholar
  4. 4.
    Church, G.M., Gao, Y., Kosuri, S.: Next-generation digital information storage in DNA. Science 337(6102), 1628–1628 (2012)CrossRefGoogle Scholar
  5. 5.
    Dirks, R.M., Bois, J.S., Schaeffer, J.M., Winfree, E., Pierce, N.A.: Thermodynamic analysis of interacting nucleic acid strands. SIAM Rev. 49(1), 56–88 (2007)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Erlich, Y., Zielinski, D.: DNA fountain enables a robust and efficient storage architecture. Science 355(6328), 950–954 (2017)CrossRefGoogle Scholar
  7. 7.
    Garzon, M.H., Bobba, K., Neel, A.: Efficiency and reliability of semantic retrieval in DNA-based memories. In: Chen, J., Reif, J. (eds.) DNA 2003. LNCS, vol. 2943, pp. 157–169. Springer, Heidelberg (2004). Scholar
  8. 8.
    Goldman, N., et al.: Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494(7435), 77–80 (2013)CrossRefGoogle Scholar
  9. 9.
    Grass, R.N., Heckel, R., Puddu, M., Paunescu, D., Stark, W.J.: Robust chemical preservation of digital information on dna in silica with error-correcting codes. Angew. Chem. Int. Ed. 54(8), 2552–2555 (2015)CrossRefGoogle Scholar
  10. 10.
    Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical report, California Institute of Technology (2007)Google Scholar
  11. 11.
  12. 12.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 604–613. ACM, New York (1998).
  13. 13.
    Kawashimo, S., Ono, H., Sadakane, K., Yamashita, M.: Dynamic neighborhood searches for thermodynamically designing DNA sequence. In: Garzon, M.H., Yan, H. (eds.) DNA 2007. LNCS, vol. 4848, pp. 130–139. Springer, Heidelberg (2008). Scholar
  14. 14.
    Lee, V.T., Kotalik, J., del Mundo, C.C., Alaghi, A., Ceze, L., Oskin, M.: Similarity search on automata processors. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 523–534 (2017)Google Scholar
  15. 15.
    Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)Google Scholar
  16. 16.
    Neel, A., Garzon, M.: Semantic retrieval in DNA-based memories with Gibbs energy models. Biotechnol. Prog. 22(1), 86–90 (2006)CrossRefGoogle Scholar
  17. 17.
    Neel, A., Garzon, M., Penumatsa, P.: Soundness and quality of semantic retrieval in DNA-based memories with abiotic data. In: 2004 Congress on Evolutionary Computation, pp. 1889–1895. IEEE (2004)Google Scholar
  18. 18.
    Organick, L., et al.: Random access in large-scale DNA data storage. Nat. Biotechnol. 36(3), 242–248 (2018)CrossRefGoogle Scholar
  19. 19.
    Reif, J.H., LaBean, T.H.: Computationally inspired biotechnologies: improved DNA synthesis and associative search using error-correcting codes and vector-quantization? In: Condon, A., Rozenberg, G. (eds.) DNA 2000. LNCS, vol. 2054, pp. 145–172. Springer, Heidelberg (2001). Scholar
  20. 20.
    Reif, J.H., et al.: Experimental construction of very large scale DNA databases with associative search capability. In: Jonoska, N., Seeman, N.C. (eds.) DNA 2001. LNCS, vol. 2340, pp. 231–247. Springer, Heidelberg (2002). Scholar
  21. 21.
    Salakhutdinov, R., Hinton, G.: Semantic hashing. Int. J. Approx. Reason. 50(7), 969–978 (2009)CrossRefGoogle Scholar
  22. 22.
    Tsaftaris, S.A., Hatzimanikatis, V., Katsaggelos, A.K.: DNA hybridization as a similarity criterion for querying digital signals stored in DNA databases. In: 2006 IEEE International Conference on Acoustics Speed and Signal Processing, pp. II-1084–II-1087. IEEE (2006)Google Scholar
  23. 23.
    Tsaftaris, S.A., Katsaggelos, A.K., Pappas, T.N., Papoutsakis, T.E.: DNA-based matching of digital signals. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. V-581–V-584. IEEE (2004)Google Scholar
  24. 24.
    Tulpan, D., et al.: Thermodynamically based DNA strand design. Nucleic Acids Res. 33(15), 4951–4964 (2005)CrossRefGoogle Scholar
  25. 25.
    Wan, J., et al.: Deep learning for content-based image retrieval: a comprehensive study, pp. 157–166 (2014).
  26. 26.
    Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS 2008, pp. 1753–1760. Curran Associates Inc. (2008)Google Scholar
  27. 27.
    Wu, L.R.: Continuously tunable nucleic acid hybridization probes. Nat. Methods 12(12), 1191–1196 (2015)CrossRefGoogle Scholar
  28. 28.
    Yazdi, S.M.H.T., Gabrys, R., Milenkovic, O.: Portable and error-free DNA-based data storage. Sci. Rep. 7(1), 1433 (2017)CrossRefGoogle Scholar
  29. 29.
    Zadeh, J.N., et al.: NUPACK: analysis and design of nucleic acid systems. J. Comput. Chem. 32(1), 170–173 (2011)CrossRefGoogle Scholar
  30. 30.
    Zhang, D.Y., Chen, S.X., Yin, P.: Optimizing the specificity of nucleic acid hybridization. Nat. Chem. 4(3), 208–214 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Kendall Stewart
    • 1
    Email author
  • Yuan-Jyue Chen
    • 2
  • David Ward
    • 1
  • Xiaomeng Liu
    • 1
  • Georg Seelig
    • 1
  • Karin Strauss
    • 1
    • 2
  • Luis Ceze
    • 1
  1. 1.Paul G. Allen School of Computer Science & EngineeringUniversity of WashingtonSeattleUSA
  2. 2.Microsoft ResearchRedmondUSA

Personalised recommendations