Compressing Genomic Sequences by Using Deep Learning

Cui, Wenwen; Yu, Zhaoyang; Liu, Zhuangzhuang; Wang, Gang; Liu, Xiaoguang

doi:10.1007/978-3-030-61609-0_8

Wenwen Cui¹¹,
Zhaoyang Yu¹¹,
Zhuangzhuang Liu¹¹,
Gang Wang¹¹ &
…
Xiaoguang Liu¹¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12396))

Included in the following conference series:

International Conference on Artificial Neural Networks

3202 Accesses
2 Citations

Abstract

Huge amount of genomic sequences have been generated with the development of high-throughput sequencing technologies, which brings challenges to data storage, processing, and transmission. Standard compression tools designed for English text are not able to compress genomic sequences well, so an effective dedicated method is needed urgently. In this paper, we propose a genomic sequence compression algorithm based on a deep learning model and an arithmetic encoder. The deep learning model is structured as a convolutional layer followed by an attention-based bi-directional long short-term memory network, which predicts the probabilities of the next base in a sequence. The arithmetic encoder employs the probabilities to compress the sequence. We evaluate the proposed algorithm with various compression approaches, including a state-of-the-art genomic sequence compression algorithm DeepDNA, on several real-world data sets. The results show that the proposed algorithm can converge stably and achieves the best compression performance which is even up to 3.7 times better than DeepDNA. Furthermore, we conduct ablation experiments to verify the effectiveness and necessity of each part in the model and implement the visualization of attention weight matrix to present different importance of various hidden states for final prediction. The source code for the model is available in Github (https://github.com/viviancui59/Compressing-Genomic-Sequences).

This work is partially supported by National Science Foundation of China (61872201, 61702521, U1833114) and Science and Technology Development Plan of Tianjin (18ZXZNGX00140, 18ZXZNGX00200).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.ncbi.nlm.nih.gov/.

References

Bakr, N.S., Sharawi, A.A., et al.: DNA lossless compression algorithms. Am. J. Bioinf. Res. 3(3), 72–81 (2013)
Google Scholar
Behzadi, B., Le Fessant, F.: DNA compression challenge revisited: a dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005). https://doi.org/10.1007/11496656_17
Chapter Google Scholar
Berger, B., Peng, J., Singh, M.: Computational solutions for omics data. Nat. Rev. Genet. 14(5), 333 (2013)
Article Google Scholar
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: 2007 Data Compression Conference (DCC 2007), pp. 43–52. IEEE (2007)
Google Scholar
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. Genome Inform. 10, 51–61 (1999)
Google Scholar
Chen, X., Li, M., Ma, B., Tromp, J.: Dnacompress: fast and effective DNA sequence compression. Bioinformatics 18(12), 1696–1698 (2002)
Article Google Scholar
Deorowicz, S., Grabowski, S.: Robust relative compression of genomes with random access. Bioinformatics 27(21), 2979–2986 (2011)
Article Google Scholar
Goyal, M., Tatwawadi, K., Chandak, S., Ochoa, I.: Deepzip: lossless data compression using recurrent neural networks. arXiv preprint arXiv:1811.08162 (2018)
Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proceedings of DCC93: Data Compression Conference, pp. 340–350. IEEE (1993)
Google Scholar
Grumbach, S., Tahi, F.: A new challenge for compression algorithms: genetic sequences. Inf. Process. Manage. 30(6), 875–886 (1994)
Article Google Scholar
Hughes, L.C., et al.: Comprehensive phylogeny of ray-finned fishes (Actinopterygii) based on transcriptomic and genomic data. Proc. Natl. Acad. Sci. 115(24), 6249–6254 (2018)
Article Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Mahoney, M.V.: Fast text compression with neural networks. In: FLAIRS Conference, pp. 230–234 (2000)
Google Scholar
Matsumoto, T., Sadakane, K., Imai, H.: Biological sequence compression algorithms. Genome Inf. 11, 43–52 (2000)
Google Scholar
Mishra, K.N., Aaggarwal, A., Abdelhadi, E., Srivastava, D.: An efficient horizontal and vertical method for online DNA sequence compression. Int. J. Comput. Appl. 3(1), 39–46 (2010)
Google Scholar
Muir, P., et al.: The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol. 17(1), 53 (2016)
Article Google Scholar
Pinho, A.J., Pratas, D.: Mfcompress: a compression tool for fasta and multi-fasta data. Bioinformatics 30(1), 117–118 (2013)
Article Google Scholar
Quang, D., Xie, X.: DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44(11), e107–e107 (2016)
Article Google Scholar
Sato, H., Yoshioka, T., Konagaya, A., Toyoda, T.: DNA data compression in the post genome era. Genome Inform. 12, 512–514 (2001)
Google Scholar
Wang, R., et al.: Deepdna: a hybrid convolutional and recurrent neural network for compressing human mitochondrial genomes. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 270–274. IEEE (2018)
Google Scholar
Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of CS, TJ Key Lab of NDST, Nankai University, Tianjin, China
Wenwen Cui, Zhaoyang Yu, Zhuangzhuang Liu, Gang Wang & Xiaoguang Liu

Authors

Wenwen Cui
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoyang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhuangzhuang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Gang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoguang Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoguang Liu .

Editor information

Editors and Affiliations

Department of Applied Informatics, Comenius University in Bratislava, Bratislava, Slovakia
Igor Farkaš
Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kgs. Lyngby, Denmark
Paolo Masulli
Department of Informatics, University of Hamburg, Hamburg, Germany
Stefan Wermter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cui, W., Yu, Z., Liu, Z., Wang, G., Liu, X. (2020). Compressing Genomic Sequences by Using Deep Learning. In: Farkaš, I., Masulli, P., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2020. ICANN 2020. Lecture Notes in Computer Science(), vol 12396. Springer, Cham. https://doi.org/10.1007/978-3-030-61609-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-61609-0_8
Published: 14 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61608-3
Online ISBN: 978-3-030-61609-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics