Advertisement

GeCo2: An Optimized Tool for Lossless Compression and Analysis of DNA Sequences

  • Diogo Pratas
  • Morteza Hosseini
  • Armando J. PinhoEmail author
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1005)

Abstract

The development of efficient DNA data compression tools is fundamental for reducing the storage, given the increasing availability of DNA sequences. The importance is also reflected for analysis purposes, given the search for optimized and new tools for anthropological and biomedical applications. In this paper, we describe the characteristics and impact of the GeCo2 tool, an improved version of the GeCo tool. In the proposed tool, we enhanced the mixture of models, where each context model or tolerant context model has now a specific decay factor. Additionally, specific cache-hash sizes and the ability to run only a context model with inverted repeats was developed. A new command line interface, twelve new pre-computed levels, and several optimizations in the code were also included. The results show a compression improvement using less computational resources (RAM and processing time). This new version permits more flexibility for compression and analysis purposes, namely a higher ability of addressing different characteristics of the DNA sequences. The decompression is performed using symmetric computational resources (RAM and time). The GeCo2 is freely available, under GPLv3 license, at https://github.com/pratas/geco2.

Keywords

Data compression Genomic sequence compression GeCo2 tool DNA sequences Lossless data compression Mixture models 

Notes

Acknowledgments

This work was partially funded by FEDER (Programa Operacional Factores de Competitividade - COMPETE) and by National Funds through the FCT, in the context of the projects UID/CEC/00127/2019 & PTCD/EEI-SII/6608/2014 and the grant PD/BD/113969/2015 to MH.

References

  1. 1.
    Mardis, E.R.: DNA sequencing technologies: 2006–2016. Nat. Protoc. 12(2), 213 (2017)CrossRefGoogle Scholar
  2. 2.
    Marco, D.: Metagenomics: Theory, Methods and Applications. Horizon Scientific Press, Poole (2010)Google Scholar
  3. 3.
    Marciniak, S., et al.: Harnessing ancient genomes to study the history of human adaptation. Nat. Rev. Genet. 18(11), 659 (2017)CrossRefGoogle Scholar
  4. 4.
    Weber, W., et al.: Emerging biomedical applications of synthetic biology. Nat. Rev. Genet. 13(1), 21 (2012)CrossRefGoogle Scholar
  5. 5.
    Schatz, M.C., et al.: The DNA data deluge. IEEE Spectr. 50(7), 28–33 (2013)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Goyal, M., et al.: DeepZip: lossless data compression using recurrent neural networks. arXiv:1811.08162 (2018)
  7. 7.
    Sayood, K.: Introduction to Data Compression. Morgan Kaufmann, Burlington (2017)zbMATHGoogle Scholar
  8. 8.
    Dougherty, E.R., et al. (eds.): Genomic Signal Processing and Statistics. Hindawi Publishing Corporation, London (2005)Google Scholar
  9. 9.
    Grumbach, S., et al.: Compression of DNA sequences. In: DCC-1993, Utah, pp. 340–350 (1993)Google Scholar
  10. 10.
    Ziv, J., et al.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Grumbach, S., et al.: A new challenge for compression algorithms: genetic sequences. Inf. Process. Manag. 30(6), 875–886 (1994)CrossRefGoogle Scholar
  12. 12.
    Rivals, E., et al.: A guaranteed compression scheme for repetitive DNA sequences. In: DCC-1996, Utah, p. 453 (1996)Google Scholar
  13. 13.
    Loewenstern, D., et al.: Significantly lower entropy estimates for natural DNA sequences. In: DCC-1997, Utah (1997)Google Scholar
  14. 14.
    Allison, L., et al.: Compression of strings with approximate repeats. In: Proceedings of Intelligent Systems in Molecular Biology, ISMB 1998, Montreal, Canada, pp. 8–16 (1998)Google Scholar
  15. 15.
    Apostolico, A., et al.: Compression of biological sequences by greedy off-line textual substitution. In: DCC-2000, Utah (2000)Google Scholar
  16. 16.
    Chen, X., et al.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(12), 1696–1698 (2002)CrossRefGoogle Scholar
  17. 17.
    Matsumoto, T., et al.: Biological sequence compression algorithms. In: Proceedings of the 11th Workshop, Tokyo, Japan, pp. 43–52 (2000)Google Scholar
  18. 18.
    Tabus, I., et al.: DNA sequence compression using the normalized maximum likelihood model for discrete regression. In: DCC-2003, Utah, pp. 253–262 (2003)Google Scholar
  19. 19.
    Korodi, G., et al.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf. Syst. 23(1), 3–34 (2005)CrossRefGoogle Scholar
  20. 20.
    Manzini, G., et al.: A simple and fast DNA compressor. Softw.—Pract. Exper. 34, 1397–1411 (2004)CrossRefGoogle Scholar
  21. 21.
    Lee, A.J.T., et al.: DNAC: an efficient compression algorithm for DNA sequences. National Taiwan University, Taipei 10617, R.O.C. 1(1) (2004)Google Scholar
  22. 22.
    Cao, M.D., et al.: A simple statistical algorithm for biological sequence compression. In: DCC-2007, Utah (2007)Google Scholar
  23. 23.
    Vey, G.: Differential direct coding: a compression algorithm for nucleotide sequence data. Database (2009)Google Scholar
  24. 24.
    Mishra, K.N., et al.: An efficient horizontal and vertical method for online DNA sequence compression. Int. J. Comput. Appl. 3(1), 39–46 (2010)Google Scholar
  25. 25.
    Rajeswari, P.R., et al.: GENBIT Compress-Algorithm for repetitive and non repetitive DNA sequences. Int. J. Comput. Sci. Inf. Technol. 2, 25–29 (2010)Google Scholar
  26. 26.
    Gupta, A., et al.: A novel approach for compressing DNA sequences using semi-statistical compressor. Int. J. Comput. Appl. 33, 245–251 (2011)Google Scholar
  27. 27.
    Zhu, Z., et al.: DNA sequence compression using adaptive particle swarm optimization-based memetic algorithm. IEEE Trans. Evol. Comput. 15(5), 643–658 (2011)CrossRefGoogle Scholar
  28. 28.
    Pinho, A.J., et al.: Bacteria DNA sequence compression using a mixture of finite-context models. In: IEEE Workshop on Statistical Signal Processing, Nice (2011)Google Scholar
  29. 29.
    Pinho, A.J., et al.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6(6), e21588 (2011)CrossRefGoogle Scholar
  30. 30.
    Roy, S., et al.: An efficient biological sequence compression technique using LUT and repeat in the sequence. arXiv:1209.5905 (2012)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Satyanvesh, D., et al.: GenCodex - a novel algorithm for compressing DNA sequences on multi-cores and GPUs. In: Proceedings of IEEE 19th International Conference on High Performance Computing (HiPC), Pune (2012)Google Scholar
  32. 32.
    Bose, T., et al.: BIND-an algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37(4), 785–789 (2012)CrossRefGoogle Scholar
  33. 33.
    Li, P., et al.: DNA-COMPACT: DNA compression based on a pattern-aware contextual modeling technique. PLoS ONE 8(11), e80377 (2013)CrossRefGoogle Scholar
  34. 34.
    Pratas, D., et al.: Exploring deep Markov models in genomic data compression using sequence pre-analysis. In: EUSIPCO-2014, Lisbon, pp. 2395–2399 (2014)Google Scholar
  35. 35.
    Sardaraz, M., et al.: SeqCompress: an algorithm for biological sequence compression. Genomics 104(4), 225–228 (2014)CrossRefGoogle Scholar
  36. 36.
    Guo, H., et al.: Genome compression based on Hilbert space filling curve. In: International Conference on Management, Education, Information and Control (MEICI 2015), Shenyang, pp. 29–31 (2015)Google Scholar
  37. 37.
    Xie, X., et al.: CoGI: towards compressing genomes as an image. IEEE/ACM Trans. Comput. Biol. Bioinform. 12(6), 1275–1285 (2015)CrossRefGoogle Scholar
  38. 38.
    Pratas, D., et al.: Efficient compression of genomic sequences. In: DCC-2016, Utah, pp. 231–240 (2016)Google Scholar
  39. 39.
    Chen, M., et al.: Genome sequence compression based on optimized context weighting. Genet. Mol. Res.: GMR 16(2) (2017)Google Scholar
  40. 40.
    Pratas, D., et al.: Cryfa: a tool to compact and encrypt FASTA files. In: PACBB-2017, pp. 305–312 (2017)Google Scholar
  41. 41.
    Hosseini, M., et al.: Cryfa: a secure encryption tool for genomic data. Bioinformatics 35(1), 146–148 (2018)CrossRefGoogle Scholar
  42. 42.
    Hosseini, M., et al.: A survey on data compression methods for biological sequences. Information 7(4), 56 (2016)CrossRefGoogle Scholar
  43. 43.
    Pratas, D., et al.: A DNA sequence corpus for compression benchmark. In: PACBB-2018, pp. 208–215 (2018)Google Scholar
  44. 44.
    Bell, T.C., et al.: Text Compression. Prentice Hall, Upper Saddle River (1990)Google Scholar
  45. 45.
    Pratas, D., et al.: Substitutional tolerant Markov models for relative compression of DNA sequences. In: PACBB-2017, pp. 265–272 (2017)Google Scholar
  46. 46.
    Ferreira, P.J.S.G., et al.: Compression-based normal similarity measures for DNA sequences. In: ICASSP-2014, Florence, pp. 419–423 (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Diogo Pratas
    • 1
  • Morteza Hosseini
    • 1
  • Armando J. Pinho
    • 1
    Email author
  1. 1.IEETA/DETIUniversity of AveiroAveiroPortugal

Personalised recommendations