Skip to main content
Log in

Performance comparison of sequential and parallel compression applications for DNA raw data

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

We present an experimental performance comparison of lossless compression programs for DNA raw data in FASTQ format files. General-purpose (PBZIP2, P7ZIP and PIGZ) and domain-specific compressors (SCALCE, QUIP, FASTQZ and DSRC) were analyzed in terms of compression ratio, execution speed, parallel scalability and memory consumption. Results showed that domain-specific tools increased the compression ratios up to 70 %, while reducing the runtime of general-purpose tools up to \(7\times \) during compression and up to \(3\times \) during decompression. Parallelism scaled performance up to \(13\times \) when using 20 threads. Our analysis indicates that QUIP, DSRC and PBZIP2 are the best tools in their respective categories, with acceptable memory requirements. Nevertheless, the end user must consider the features of available hardware and define the priorities among its optimization objectives (compression ratio, runtime during compression or decompression, scalability, etc.) to properly select the best application for each particular scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.bzip.org/.

  2. http://www.lzop.org/.

  3. http://www.gzip.org/.

  4. http://compression.ca/pbzip2/.

  5. http://p7zip.sourceforge.net/.

  6. http://zlib.net/pigz/.

  7. http://pthreads.org/.

  8. http://www.zlib.net/.

  9. http://mattmahoney.net/dc/zpaq.html.

  10. http://www.boost.org/.

  11. http://valgrind.org/.

References

  1. Deorowicz S (2013) Grabowski S (2013) Data compression for sequencing data. Algorithms for molecular biology: AMB 8(1):25. doi:10.1186/1748-7188-8-25. http://www.almob.org/content/8/1/25

  2. Loh PR, Baym M, Berger B (2012) Compressive genomics. Nat Biotechnol 30(7):627–30. doi:10.1038/nbt.2241. http://www.ncbi.nlm.nih.gov/pubmed/22781691

  3. RAID Incorporated (2015) Storing and managing petabytes of genome sequencing data. Tech. rep. http://webinfo.raidinc.com/storing-and-managing-petabytes-of-genome-sequencing-data [Online]. Accessed on 23 March 2015

  4. Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics (Oxford, England) 25(14):1731–1738. doi:10.1093/bioinformatics/btp319. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2705231&tool=pmcentrez&rendertype=abstract

  5. Baxevanis Andreas D, Ouellette Francis BF (2004) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. doi:10.1007/s10439-006-9105-9

  6. Format specification (FASTQ) (2014) http://maq.sourceforge.net/fastq.shtml [Online]. Accessed on 23 Sept 2014

  7. Howison M, Zapata F, Dunn CW (2013) Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics 29(23):2959–2963. doi:10.1093/bioinformatics/btt525

    Article  Google Scholar 

  8. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26(10):1135–1145. doi:10.1038/nbt1486

    Article  Google Scholar 

  9. Bakr NS, Sharawi AA (2013) DNA lossless compression algorithms: review. Am J Bioinform Res 3(3):72–81. doi:10.5923/j.bioinformatics.20130303.04

    Google Scholar 

  10. 1000 Genomes (2014) A deep catalog of human genetic variation. http://www.1000genomes.org [Online]. Accessed on 03 Oct 2014

  11. Encyclopedia of DNA Elements (ENCODE) (2014) http://www.encodeproject.org/ [Online]. Accessed on 03 Oct 2014

  12. Genomics England (2014). http://www.genomicsengland.co.uk [Online]. Accessed on 03 Oct 2014

  13. ICGC Cancer Genome Projects (2014) https://icgc.org/ [Online]. Accessed on 03 Oct 2014

  14. Wandelt S, Bux M, Leser U (2013) Trends in genome compression. Curr Bioinform 1–24 .https://edit.rok.informatik.hu-berlin.de/wbi/research/publications/2013/2013-cbio.pdf

  15. Kaipa KK, Lee K, Ahn T, Narayanan R (2010) System for random access dna sequence compression. IEEE international conference on bioinformatics and biomedicine workshops system, pp 853–854. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5703942

  16. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory I(3):337–338

  17. Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536

    Article  MathSciNet  MATH  Google Scholar 

  18. Huffman DA (1952) A method for the construction of minimum-redundancy codes. Proc IRE 27:1098,1101

    Google Scholar 

  19. Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucl Acids Res 40(22):1–9. doi:10.1093/nar/gks754

    Article  Google Scholar 

  20. Pinho AJ, Pratas D, Garcia SP (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucl Acids Res 40(4):1–8. doi:10.1093/nar/gkr1124

    Article  Google Scholar 

  21. Christley S, Lu Y, Li C, Xie X (2009) Human genomes as email attachments. Bioinformatics 25(2):274–275. doi:10.1093/bioinformatics/btn582

    Article  Google Scholar 

  22. Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE engineering in medicine and biology magazine 20:61–66. doi:10.1109/51.940049. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=940049

  23. Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Genome informatics. Workshop on genome informatics 10:51–61 http://www.ncbi.nlm.nih.gov/pubmed/11072342

  24. Stephane G, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf Process Manage 30:875–886. https://hal.archives-ouvertes.fr/file/index/docid/180949/filename/grumbach.pdf

  25. Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief Bioinform 15(3):390–406. doi:10.1093/bib/bbt088

    Article  Google Scholar 

  26. Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. Genome Inf 11:43–52. doi:10.11234/gi1990.11.43

    Google Scholar 

  27. Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6):860–862. doi:10.1093/bioinformatics/btr014. http://www.ncbi.nlm.nih.gov/pubmed/21252073

  28. Bhola V, Bopardikar AS, Narayanan R, Lee K, Ahn T (2011) No-reference compression of genomic data stored in FASTQ format. Proceedings—2011 IEEE international conference on bioinformatics and biomedicine, BIBM 2011, pp 147–150. doi:10.1109/BIBM.2011.110

  29. Yanovsky V (2011) ReCoil—an algorithm for compression of extremely large datasets of dna data. Algorithms Mol Biol 6(1):23. doi:10.1186/1748-7188-6-23. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3219593&tool=pmcentrez&rendertype=abstractwww.almob.org/content/6/1/23

  30. Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS (2012) DELIMINATE—a fast and efficient method for loss-less compression of genomic sequences: sequence analysis. Bioinformatics (Oxford, England) 28(19):2527–9. doi:10.1093/bioinformatics/bts467. http://www.ncbi.nlm.nih.gov/pubmed/22833526

  31. Grassi E, Gregorio FD, Molineris I (2012) KungFQ: a simple and powerful approach to compress fastq files. IEEE/ACM Trans Comput Biol Bioinform 9(6):1837–1842. doi:10.1109/TCBB.2012.123

    Article  Google Scholar 

  32. Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PLoS ONE 8(3):1–11. doi:10.1371/journal.pone.0059190

    Article  Google Scholar 

  33. Roguski L, Deorowicz S (2014) DSRC 2—industry oriented compression of FASTQ files. Bioinformatics 30(15):2213–2215. doi:10.1093/bioinformatics/btu208. http://bioinformatics.oxfordjournals.org/content/30/15/2213

  34. Sardaraz M, Tahir M, Ikram AA. Advances in high throughput dna sequence data compression. J Bioinform Comput Biol 0(0):1630,002(0). doi:10.1142/S0219720016300021. http://www.worldscientific.com/doi/abs/10.1142/S0219720016300021. (PMID: 26846812)

  35. Zhu Z, Zhang Y, Ji Z, He S, Yang X (2013) High-throughput DNA sequence data compression. Brief Bioinform 16(1). doi:10.1093/bib/bbt087. http://www.ncbi.nlm.nih.gov/pubmed/24300111

  36. Adler, M.: PIGZ Documentation. http://zlib.net/pigz/pigz.pdf. [Online; accessed: 2014-12-03]

  37. Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the burrows-wheeler transform. Bioinformatics (Oxford, England) 28(11):1415–1419. doi:10.1093/bioinformatics/bts173. http://www.ncbi.nlm.nih.gov/pubmed/22556365

  38. Dutta A, Haque MM, Bose T, Reddy C, Mande SS (2015) Fqc: a novel approach for efficient compression, archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13(03):1541,003

    Article  Google Scholar 

  39. Hach F, Numanagic I, Alkan C, Sahinalp SC, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23):3051–7. doi:10.1093/bioinformatics/bts593. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3509486&tool=pmcentrez&rendertype=abstract

  40. Howison M (2013) High-throughput compression of FASTQ data with SeqDB. IEEE/ACM Trans Comput Biol Bioinform 10(1):213–218. doi:10.1109/TCBB.2012.160

    Article  MathSciNet  Google Scholar 

  41. Janin L, Schulz-Trieglaff O, Cox AJ (2014) BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics (Oxford, England) 1–6. doi:10.1093/bioinformatics/btu387. http://www.ncbi.nlm.nih.gov/pubmed/24950811

  42. Linux man page (2014) Pbzip2: parallel bzip2 file compressor. http://linux.die.net/man/1/pbzip2 [Online]. Accessed on 03 Dec 2014

  43. Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20):3276–3281. doi:10.1093/bioinformatics/btv384

    Article  Google Scholar 

  44. Oberhumer M (2015) Lzo real-time data compression library. http://www.oberhumer.com/opensource/lzo/ [Online]. Accessed on 03 March 2016

  45. Pavlov I (2016) 7-zip. http://www.7-zip.org [Online]. Accessed on 03 March 2016

  46. WinRAR archiver, a powerful tool to process RAR and ZIP files. http://www.rarlab.com/ [Online]. Accessed on 03 Dec 2014

  47. Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2015) A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579(1):75–81. doi:10.1016/j.gene.2015.12.053

    Article  Google Scholar 

  48. Zhan X, Yao D (2014) A novel method to compress high-throughput dna sequence read archive. In: Software intelligence technologies and applications international conference on frontiers of internet of things 2014, international conference on, pp 58–61. doi:10.1049/cp.2014.1536

  49. Daily K, Rigor P, Christley S, Xie X, Baldi P (2010) Data structures and compression algorithms for high-throughput sequencing technologies. BMC bioinformatics 11(1):514. doi:10.1186/1471-2105-11-514. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2964686&tool=pmcentrez&rendertype=abstract

  50. Guo G, Qiu S, Ye Z, Wang B, Fang L, Lu M, See S, Mao R (2013) GPU-accelerated adaptive compression framework for genomics data. 2013 IEEE international conference on big data GPU-accelerated, pp 181–186

  51. Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G (2011) Compressing genomic sequence fragments using SlimGene. J Comput Biol 18(3):401–13. doi:10.1089/cmb.2010.0253. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3123913&tool=pmcentrez&rendertype=abstract

  52. Sakib MN, Tang J, Zheng WJ, Huang CT (2011) Improving transmission efficiency of large sequence alignment/map (SAM) files. PLoS ONE 6(12):2–5. doi:10.1371/journal.pone.0028251

    Article  Google Scholar 

  53. Vey G (2009) Differential direct coding: a compression algorithm for nucleotide sequence data. Database: the journal of biological databases and curation 2009:bap013. doi:10.1093/database/bap013. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2797453&tool=pmcentrez&rendertype=abstract

  54. Awan F, Mukherjee A (2002) Lossless compression handbook, chap. text compression, pp 227–245. Communications, networking and multimedia. Elsevier Science, London

  55. Carus A, Mesut A (2010) Fast text compression using multiple static dictionaries. Inf Technol J 9:1013–1021

    Article  Google Scholar 

  56. Crochemore M, Lecroq T (2012) The computer science and engineering handbook, chap. pattern matching and text compression algorithms, pp 3–77. CRC Press, Boca Raton

  57. Burrows M, Wheeler DA (1994) A block-sorting lossless data compression algorithm. Tech. rep. digital equipment corporation, California

  58. 7-zip soruceforge editor’s review (2016) https://sourceforge.net/projects/sevenzip/editorial/?source=psp [Online]. Accessed on 03 March 2016

  59. Mahooney M (2016) Data compression explained. http://mattmahoney.net/dc/dce.html#Section_523 [Online]. Accessed on 03 March 2016

  60. Selva JJ, Chen X (2013) SRComp: short read sequence compression using burstsort and Elias omega coding. PLoS ONE 8(12):1–7. doi:10.1371/journal.pone.0081414

    Article  Google Scholar 

  61. Tembe W, Lowey J, Suh E, Genomics T, Street N (2010) G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17):2192–2194. doi:10.1093/bioinformatics/btq346

    Article  Google Scholar 

  62. Mahoney MV (2005) Adaptive weighing of context models for lossless data compression. Florida Tech., Melbourne CS-2005-16(x):1–6. http://professor.unisinos.br/linds/teoinfo/paq.pdf

  63. Salomon D, Bryant D, Motta G (2010) Handbook of data compression. Springer, London

    Book  MATH  Google Scholar 

  64. Batu T, Ergun F, Sahinalp C (2006) Oblivious string embeddings and edit distance approximations. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm, pp 792–801. Society for Industrial and Applied Mathematics

  65. Biocancer Research Journal: Transcriptoma (2014) http://www.biocancer.com/journal/1353/31-transcriptoma [Online]. Accessed on 03 Dec 2014

  66. Pevsner J (2009) Bioinformatics and functional genomics, 2nd edn. Springer, Berlin

  67. Deorowicz S, Grabowski S (2011) Robust relative compression of genomes with random access. Bioinformatics 27(21):2979–2986. doi:10.1093/bioinformatics/btr505

    Article  Google Scholar 

Download references

Acknowledgments

We want to thank Felipe Cabarcas and Juan Fernando Alzate from the Centro Nacional de Secuenciación Genómica at the University of Antioquia for giving us access to their computing cluster and test data; and for their help in clarifying many bioinformatics related issues. This work was supported by the University of Antioquia under project code PRV15-2-02.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aníbal Guerra.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guerra, A., Lotero, J. & Isaza, S. Performance comparison of sequential and parallel compression applications for DNA raw data. J Supercomput 72, 4696–4717 (2016). https://doi.org/10.1007/s11227-016-1753-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1753-4

Keywords

Navigation