Skip to main content

GeCo2: An Optimized Tool for Lossless Compression and Analysis of DNA Sequences

  • Conference paper
  • First Online:
Book cover Practical Applications of Computational Biology and Bioinformatics, 13th International Conference (PACBB 2019)

Abstract

The development of efficient DNA data compression tools is fundamental for reducing the storage, given the increasing availability of DNA sequences. The importance is also reflected for analysis purposes, given the search for optimized and new tools for anthropological and biomedical applications. In this paper, we describe the characteristics and impact of the GeCo2 tool, an improved version of the GeCo tool. In the proposed tool, we enhanced the mixture of models, where each context model or tolerant context model has now a specific decay factor. Additionally, specific cache-hash sizes and the ability to run only a context model with inverted repeats was developed. A new command line interface, twelve new pre-computed levels, and several optimizations in the code were also included. The results show a compression improvement using less computational resources (RAM and processing time). This new version permits more flexibility for compression and analysis purposes, namely a higher ability of addressing different characteristics of the DNA sequences. The decompression is performed using symmetric computational resources (RAM and time). The GeCo2 is freely available, under GPLv3 license, at https://github.com/pratas/geco2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mardis, E.R.: DNA sequencing technologies: 2006–2016. Nat. Protoc. 12(2), 213 (2017)

    Article  Google Scholar 

  2. Marco, D.: Metagenomics: Theory, Methods and Applications. Horizon Scientific Press, Poole (2010)

    Google Scholar 

  3. Marciniak, S., et al.: Harnessing ancient genomes to study the history of human adaptation. Nat. Rev. Genet. 18(11), 659 (2017)

    Article  Google Scholar 

  4. Weber, W., et al.: Emerging biomedical applications of synthetic biology. Nat. Rev. Genet. 13(1), 21 (2012)

    Article  Google Scholar 

  5. Schatz, M.C., et al.: The DNA data deluge. IEEE Spectr. 50(7), 28–33 (2013)

    Article  MathSciNet  Google Scholar 

  6. Goyal, M., et al.: DeepZip: lossless data compression using recurrent neural networks. arXiv:1811.08162 (2018)

  7. Sayood, K.: Introduction to Data Compression. Morgan Kaufmann, Burlington (2017)

    MATH  Google Scholar 

  8. Dougherty, E.R., et al. (eds.): Genomic Signal Processing and Statistics. Hindawi Publishing Corporation, London (2005)

    Google Scholar 

  9. Grumbach, S., et al.: Compression of DNA sequences. In: DCC-1993, Utah, pp. 340–350 (1993)

    Google Scholar 

  10. Ziv, J., et al.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)

    Article  MathSciNet  Google Scholar 

  11. Grumbach, S., et al.: A new challenge for compression algorithms: genetic sequences. Inf. Process. Manag. 30(6), 875–886 (1994)

    Article  Google Scholar 

  12. Rivals, E., et al.: A guaranteed compression scheme for repetitive DNA sequences. In: DCC-1996, Utah, p. 453 (1996)

    Google Scholar 

  13. Loewenstern, D., et al.: Significantly lower entropy estimates for natural DNA sequences. In: DCC-1997, Utah (1997)

    Google Scholar 

  14. Allison, L., et al.: Compression of strings with approximate repeats. In: Proceedings of Intelligent Systems in Molecular Biology, ISMB 1998, Montreal, Canada, pp. 8–16 (1998)

    Google Scholar 

  15. Apostolico, A., et al.: Compression of biological sequences by greedy off-line textual substitution. In: DCC-2000, Utah (2000)

    Google Scholar 

  16. Chen, X., et al.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(12), 1696–1698 (2002)

    Article  Google Scholar 

  17. Matsumoto, T., et al.: Biological sequence compression algorithms. In: Proceedings of the 11th Workshop, Tokyo, Japan, pp. 43–52 (2000)

    Google Scholar 

  18. Tabus, I., et al.: DNA sequence compression using the normalized maximum likelihood model for discrete regression. In: DCC-2003, Utah, pp. 253–262 (2003)

    Google Scholar 

  19. Korodi, G., et al.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf. Syst. 23(1), 3–34 (2005)

    Article  Google Scholar 

  20. Manzini, G., et al.: A simple and fast DNA compressor. Softw.—Pract. Exper. 34, 1397–1411 (2004)

    Article  Google Scholar 

  21. Lee, A.J.T., et al.: DNAC: an efficient compression algorithm for DNA sequences. National Taiwan University, Taipei 10617, R.O.C. 1(1) (2004)

    Google Scholar 

  22. Cao, M.D., et al.: A simple statistical algorithm for biological sequence compression. In: DCC-2007, Utah (2007)

    Google Scholar 

  23. Vey, G.: Differential direct coding: a compression algorithm for nucleotide sequence data. Database (2009)

    Google Scholar 

  24. Mishra, K.N., et al.: An efficient horizontal and vertical method for online DNA sequence compression. Int. J. Comput. Appl. 3(1), 39–46 (2010)

    Google Scholar 

  25. Rajeswari, P.R., et al.: GENBIT Compress-Algorithm for repetitive and non repetitive DNA sequences. Int. J. Comput. Sci. Inf. Technol. 2, 25–29 (2010)

    Google Scholar 

  26. Gupta, A., et al.: A novel approach for compressing DNA sequences using semi-statistical compressor. Int. J. Comput. Appl. 33, 245–251 (2011)

    Google Scholar 

  27. Zhu, Z., et al.: DNA sequence compression using adaptive particle swarm optimization-based memetic algorithm. IEEE Trans. Evol. Comput. 15(5), 643–658 (2011)

    Article  Google Scholar 

  28. Pinho, A.J., et al.: Bacteria DNA sequence compression using a mixture of finite-context models. In: IEEE Workshop on Statistical Signal Processing, Nice (2011)

    Google Scholar 

  29. Pinho, A.J., et al.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6(6), e21588 (2011)

    Article  Google Scholar 

  30. Roy, S., et al.: An efficient biological sequence compression technique using LUT and repeat in the sequence. arXiv:1209.5905 (2012)

    Article  MathSciNet  Google Scholar 

  31. Satyanvesh, D., et al.: GenCodex - a novel algorithm for compressing DNA sequences on multi-cores and GPUs. In: Proceedings of IEEE 19th International Conference on High Performance Computing (HiPC), Pune (2012)

    Google Scholar 

  32. Bose, T., et al.: BIND-an algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37(4), 785–789 (2012)

    Article  Google Scholar 

  33. Li, P., et al.: DNA-COMPACT: DNA compression based on a pattern-aware contextual modeling technique. PLoS ONE 8(11), e80377 (2013)

    Article  Google Scholar 

  34. Pratas, D., et al.: Exploring deep Markov models in genomic data compression using sequence pre-analysis. In: EUSIPCO-2014, Lisbon, pp. 2395–2399 (2014)

    Google Scholar 

  35. Sardaraz, M., et al.: SeqCompress: an algorithm for biological sequence compression. Genomics 104(4), 225–228 (2014)

    Article  Google Scholar 

  36. Guo, H., et al.: Genome compression based on Hilbert space filling curve. In: International Conference on Management, Education, Information and Control (MEICI 2015), Shenyang, pp. 29–31 (2015)

    Google Scholar 

  37. Xie, X., et al.: CoGI: towards compressing genomes as an image. IEEE/ACM Trans. Comput. Biol. Bioinform. 12(6), 1275–1285 (2015)

    Article  Google Scholar 

  38. Pratas, D., et al.: Efficient compression of genomic sequences. In: DCC-2016, Utah, pp. 231–240 (2016)

    Google Scholar 

  39. Chen, M., et al.: Genome sequence compression based on optimized context weighting. Genet. Mol. Res.: GMR 16(2) (2017)

    Google Scholar 

  40. Pratas, D., et al.: Cryfa: a tool to compact and encrypt FASTA files. In: PACBB-2017, pp. 305–312 (2017)

    Google Scholar 

  41. Hosseini, M., et al.: Cryfa: a secure encryption tool for genomic data. Bioinformatics 35(1), 146–148 (2018)

    Article  Google Scholar 

  42. Hosseini, M., et al.: A survey on data compression methods for biological sequences. Information 7(4), 56 (2016)

    Article  Google Scholar 

  43. Pratas, D., et al.: A DNA sequence corpus for compression benchmark. In: PACBB-2018, pp. 208–215 (2018)

    Google Scholar 

  44. Bell, T.C., et al.: Text Compression. Prentice Hall, Upper Saddle River (1990)

    Google Scholar 

  45. Pratas, D., et al.: Substitutional tolerant Markov models for relative compression of DNA sequences. In: PACBB-2017, pp. 265–272 (2017)

    Google Scholar 

  46. Ferreira, P.J.S.G., et al.: Compression-based normal similarity measures for DNA sequences. In: ICASSP-2014, Florence, pp. 419–423 (2014)

    Google Scholar 

Download references

Acknowledgments

This work was partially funded by FEDER (Programa Operacional Factores de Competitividade - COMPETE) and by National Funds through the FCT, in the context of the projects UID/CEC/00127/2019 & PTCD/EEI-SII/6608/2014 and the grant PD/BD/113969/2015 to MH.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Armando J. Pinho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pratas, D., Hosseini, M., Pinho, A.J. (2020). GeCo2: An Optimized Tool for Lossless Compression and Analysis of DNA Sequences. In: Fdez-Riverola, F., Rocha, M., Mohamad, M., Zaki, N., Castellanos-Garzón, J. (eds) Practical Applications of Computational Biology and Bioinformatics, 13th International Conference. PACBB 2019. Advances in Intelligent Systems and Computing, vol 1005 . Springer, Cham. https://doi.org/10.1007/978-3-030-23873-5_17

Download citation

Publish with us

Policies and ethics