Abstract
Huge amount of genomic sequences have been generated with the development of high-throughput sequencing technologies, which brings challenges to data storage, processing, and transmission. Standard compression tools designed for English text are not able to compress genomic sequences well, so an effective dedicated method is needed urgently. In this paper, we propose a genomic sequence compression algorithm based on a deep learning model and an arithmetic encoder. The deep learning model is structured as a convolutional layer followed by an attention-based bi-directional long short-term memory network, which predicts the probabilities of the next base in a sequence. The arithmetic encoder employs the probabilities to compress the sequence. We evaluate the proposed algorithm with various compression approaches, including a state-of-the-art genomic sequence compression algorithm DeepDNA, on several real-world data sets. The results show that the proposed algorithm can converge stably and achieves the best compression performance which is even up to 3.7 times better than DeepDNA. Furthermore, we conduct ablation experiments to verify the effectiveness and necessity of each part in the model and implement the visualization of attention weight matrix to present different importance of various hidden states for final prediction. The source code for the model is available in Github (https://github.com/viviancui59/Compressing-Genomic-Sequences).
This work is partially supported by National Science Foundation of China (61872201, 61702521, U1833114) and Science and Technology Development Plan of Tianjin (18ZXZNGX00140, 18ZXZNGX00200).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Bakr, N.S., Sharawi, A.A., et al.: DNA lossless compression algorithms. Am. J. Bioinf. Res. 3(3), 72–81 (2013)
Behzadi, B., Le Fessant, F.: DNA compression challenge revisited: a dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005). https://doi.org/10.1007/11496656_17
Berger, B., Peng, J., Singh, M.: Computational solutions for omics data. Nat. Rev. Genet. 14(5), 333 (2013)
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: 2007 Data Compression Conference (DCC 2007), pp. 43–52. IEEE (2007)
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. Genome Inform. 10, 51–61 (1999)
Chen, X., Li, M., Ma, B., Tromp, J.: Dnacompress: fast and effective DNA sequence compression. Bioinformatics 18(12), 1696–1698 (2002)
Deorowicz, S., Grabowski, S.: Robust relative compression of genomes with random access. Bioinformatics 27(21), 2979–2986 (2011)
Goyal, M., Tatwawadi, K., Chandak, S., Ochoa, I.: Deepzip: lossless data compression using recurrent neural networks. arXiv preprint arXiv:1811.08162 (2018)
Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proceedings of DCC93: Data Compression Conference, pp. 340–350. IEEE (1993)
Grumbach, S., Tahi, F.: A new challenge for compression algorithms: genetic sequences. Inf. Process. Manage. 30(6), 875–886 (1994)
Hughes, L.C., et al.: Comprehensive phylogeny of ray-finned fishes (Actinopterygii) based on transcriptomic and genomic data. Proc. Natl. Acad. Sci. 115(24), 6249–6254 (2018)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Mahoney, M.V.: Fast text compression with neural networks. In: FLAIRS Conference, pp. 230–234 (2000)
Matsumoto, T., Sadakane, K., Imai, H.: Biological sequence compression algorithms. Genome Inf. 11, 43–52 (2000)
Mishra, K.N., Aaggarwal, A., Abdelhadi, E., Srivastava, D.: An efficient horizontal and vertical method for online DNA sequence compression. Int. J. Comput. Appl. 3(1), 39–46 (2010)
Muir, P., et al.: The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol. 17(1), 53 (2016)
Pinho, A.J., Pratas, D.: Mfcompress: a compression tool for fasta and multi-fasta data. Bioinformatics 30(1), 117–118 (2013)
Quang, D., Xie, X.: DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44(11), e107–e107 (2016)
Sato, H., Yoshioka, T., Konagaya, A., Toyoda, T.: DNA data compression in the post genome era. Genome Inform. 12, 512–514 (2001)
Wang, R., et al.: Deepdna: a hybrid convolutional and recurrent neural network for compressing human mitochondrial genomes. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 270–274. IEEE (2018)
Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Cui, W., Yu, Z., Liu, Z., Wang, G., Liu, X. (2020). Compressing Genomic Sequences by Using Deep Learning. In: Farkaš, I., Masulli, P., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2020. ICANN 2020. Lecture Notes in Computer Science(), vol 12396. Springer, Cham. https://doi.org/10.1007/978-3-030-61609-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-61609-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61608-3
Online ISBN: 978-3-030-61609-0
eBook Packages: Computer ScienceComputer Science (R0)