Skip to main content

Compressing Genomic Sequences by Using Deep Learning

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2020 (ICANN 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12396))

Included in the following conference series:

Abstract

Huge amount of genomic sequences have been generated with the development of high-throughput sequencing technologies, which brings challenges to data storage, processing, and transmission. Standard compression tools designed for English text are not able to compress genomic sequences well, so an effective dedicated method is needed urgently. In this paper, we propose a genomic sequence compression algorithm based on a deep learning model and an arithmetic encoder. The deep learning model is structured as a convolutional layer followed by an attention-based bi-directional long short-term memory network, which predicts the probabilities of the next base in a sequence. The arithmetic encoder employs the probabilities to compress the sequence. We evaluate the proposed algorithm with various compression approaches, including a state-of-the-art genomic sequence compression algorithm DeepDNA, on several real-world data sets. The results show that the proposed algorithm can converge stably and achieves the best compression performance which is even up to 3.7 times better than DeepDNA. Furthermore, we conduct ablation experiments to verify the effectiveness and necessity of each part in the model and implement the visualization of attention weight matrix to present different importance of various hidden states for final prediction. The source code for the model is available in Github (https://github.com/viviancui59/Compressing-Genomic-Sequences).

This work is partially supported by National Science Foundation of China (61872201, 61702521, U1833114) and Science and Technology Development Plan of Tianjin (18ZXZNGX00140, 18ZXZNGX00200).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ncbi.nlm.nih.gov/.

References

  1. Bakr, N.S., Sharawi, A.A., et al.: DNA lossless compression algorithms. Am. J. Bioinf. Res. 3(3), 72–81 (2013)

    Google Scholar 

  2. Behzadi, B., Le Fessant, F.: DNA compression challenge revisited: a dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005). https://doi.org/10.1007/11496656_17

    Chapter  Google Scholar 

  3. Berger, B., Peng, J., Singh, M.: Computational solutions for omics data. Nat. Rev. Genet. 14(5), 333 (2013)

    Article  Google Scholar 

  4. Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: 2007 Data Compression Conference (DCC 2007), pp. 43–52. IEEE (2007)

    Google Scholar 

  5. Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. Genome Inform. 10, 51–61 (1999)

    Google Scholar 

  6. Chen, X., Li, M., Ma, B., Tromp, J.: Dnacompress: fast and effective DNA sequence compression. Bioinformatics 18(12), 1696–1698 (2002)

    Article  Google Scholar 

  7. Deorowicz, S., Grabowski, S.: Robust relative compression of genomes with random access. Bioinformatics 27(21), 2979–2986 (2011)

    Article  Google Scholar 

  8. Goyal, M., Tatwawadi, K., Chandak, S., Ochoa, I.: Deepzip: lossless data compression using recurrent neural networks. arXiv preprint arXiv:1811.08162 (2018)

  9. Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proceedings of DCC93: Data Compression Conference, pp. 340–350. IEEE (1993)

    Google Scholar 

  10. Grumbach, S., Tahi, F.: A new challenge for compression algorithms: genetic sequences. Inf. Process. Manage. 30(6), 875–886 (1994)

    Article  Google Scholar 

  11. Hughes, L.C., et al.: Comprehensive phylogeny of ray-finned fishes (Actinopterygii) based on transcriptomic and genomic data. Proc. Natl. Acad. Sci. 115(24), 6249–6254 (2018)

    Article  Google Scholar 

  12. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)

    Google Scholar 

  13. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  14. Mahoney, M.V.: Fast text compression with neural networks. In: FLAIRS Conference, pp. 230–234 (2000)

    Google Scholar 

  15. Matsumoto, T., Sadakane, K., Imai, H.: Biological sequence compression algorithms. Genome Inf. 11, 43–52 (2000)

    Google Scholar 

  16. Mishra, K.N., Aaggarwal, A., Abdelhadi, E., Srivastava, D.: An efficient horizontal and vertical method for online DNA sequence compression. Int. J. Comput. Appl. 3(1), 39–46 (2010)

    Google Scholar 

  17. Muir, P., et al.: The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol. 17(1), 53 (2016)

    Article  Google Scholar 

  18. Pinho, A.J., Pratas, D.: Mfcompress: a compression tool for fasta and multi-fasta data. Bioinformatics 30(1), 117–118 (2013)

    Article  Google Scholar 

  19. Quang, D., Xie, X.: DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44(11), e107–e107 (2016)

    Article  Google Scholar 

  20. Sato, H., Yoshioka, T., Konagaya, A., Toyoda, T.: DNA data compression in the post genome era. Genome Inform. 12, 512–514 (2001)

    Google Scholar 

  21. Wang, R., et al.: Deepdna: a hybrid convolutional and recurrent neural network for compressing human mitochondrial genomes. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 270–274. IEEE (2018)

    Google Scholar 

  22. Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoguang Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cui, W., Yu, Z., Liu, Z., Wang, G., Liu, X. (2020). Compressing Genomic Sequences by Using Deep Learning. In: Farkaš, I., Masulli, P., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2020. ICANN 2020. Lecture Notes in Computer Science(), vol 12396. Springer, Cham. https://doi.org/10.1007/978-3-030-61609-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61609-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61608-3

  • Online ISBN: 978-3-030-61609-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics