Compressing Resequencing Data with GReEn

  • Armando J. Pinho
  • Diogo Pratas
  • Sara P. Garcia
Part of the Methods in Molecular Biology book series (MIMB, volume 1038)


Genome sequencing centers are flooding the scientific community with data. A single sequencing machine can nowadays generate more data in one day than any existing machine could have produced throughout the entire year of 2005. Therefore, the pressure for efficient sequencing data compression algorithms is very high and is being felt worldwide. Here, we describe GReEn (Genome Resequencing Encoding), a compression tool recently proposed for compressing genome resequencing data using a reference genome sequence.

Key words

Data compression DNA sequences Probabilistic models Arithmetic coding Open source software 


  1. 1.
    Grumbach S, Tahi F (1993) Compression of DNA sequences. In: Proceedings of the data compression conference, DCC-93, Snowbird, pp 340–350Google Scholar
  2. 2.
    Rivals E, Delahaye J-P, Dauchet M, Delgrange O (1996) A guaranteed compression scheme for repetitive DNA sequences. In: Proceedings of the data compression conference, DCC-96, Snowbird, p 453Google Scholar
  3. 3.
    Loewenstern D, Yianilos PN (1997) Significantly lower entropy estimates for natural DNA sequences. In: Proceedings of the data compression conference, DCC-97, Snowbird, March 1997, pp 151–160Google Scholar
  4. 4.
    Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. In: Dunker AK, Konagaya A, Miyano S, Takagi T (eds) Genome informatics 2000: proceedings of the 11th workshop, Tokyo, pp 43–52Google Scholar
  5. 5.
    Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE Eng Med Biol Mag 20:61–66CrossRefGoogle Scholar
  6. 6.
    Chen X, Li M, Ma B, Tromp J (2002) DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(12):1696–1698PubMedCrossRefGoogle Scholar
  7. 7.
    Manzini G, Rastero M (2004) A simple and fast DNA compressor. Softw Pract Exp 34:1397–1411CrossRefGoogle Scholar
  8. 8.
    Korodi G, Tabus I (2005) An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans Inform Syst 23(1):3–34CrossRefGoogle Scholar
  9. 9.
    Behzadi B, Le Fessant F (2005) DNA compression challenge revisited. In: Combinatorial pattern matching: proceedings of CPM-2005. LNCS, vol 3537. Jeju Island, June 2005. Springer-Verlag, New York, pp 190–200Google Scholar
  10. 10.
    Korodi G, Tabus I (2007) Normalized maximum likelihood model of order-1 for the compression of DNA sequences. In: Proceedings of the data compression conference, DCC-2007, Snowbird, March 2007, pp 33–42Google Scholar
  11. 11.
    Cao MD, Dix TI, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression. In: Proceedings of the data compression conference, DCC-2007, Snowbird, March 2007, pp 43–52Google Scholar
  12. 12.
    Giancarlo R, Scaturro D, Utro F (2009) Textual data compression in computational biology: a synopsis. Bioinformatics 25(13):1575–1586PubMedCrossRefGoogle Scholar
  13. 13.
    Pinho AJ, Neves AJR, Afreixo V, Bastos CAC, Ferreira PJSG (2006) A three-state model for DNA protein-coding regions. IEEE Trans Biomed Eng 53(11):2148–2155PubMedCrossRefGoogle Scholar
  14. 14.
    Pinho AJ, Neves AJR, Ferreira PJSG (2008) Inverted-repeats-aware finite-context models for DNA coding. In: Proceedings of the 16th European signal processing conference, EUSIPCO-2008, Lausanne, August 2008Google Scholar
  15. 15.
    Pinho AJ, Neves AJR, Bastos CAC, Ferreira PJSG (2009) DNA coding using finite-context models and arithmetic coding. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP-2009, Taipei, April 2009, pp 1693–1696Google Scholar
  16. 16.
    Pinho AJ, Pratas D, Ferreira PJSG (2011) Bacteria DNA sequence compression using a mixture of finite-context models. In: Proceedings of the IEEE workshop on statistical signal processing, Nice, June 2011Google Scholar
  17. 17.
    Pinho AJ, Ferreira PJSG, Neves AJR, Bastos CAC (2011) On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 6(6):e21588PubMedCrossRefGoogle Scholar
  18. 18.
    Pinho AJ, Pratas D, Garcia SP (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res 40(4):e27PubMedCrossRefGoogle Scholar
  19. 19.
    Rissanen J (1976) Generalized Kraft inequality and arithmetic coding. IBM J Res Dev 20(3):198–203CrossRefGoogle Scholar
  20. 20.
    Sayood K (2006) Introduction to data compression, 3rd edn. Morgan Kaufmann, San FranciscoGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Armando J. Pinho
    • 1
  • Diogo Pratas
    • 1
  • Sara P. Garcia
    • 1
  1. 1.IEETA/DETI, University of AveiroAveiroPortugal

Personalised recommendations