Abstract
We present an experimental performance comparison of lossless compression programs for DNA raw data in FASTQ format files. General-purpose (PBZIP2, P7ZIP and PIGZ) and domain-specific compressors (SCALCE, QUIP, FASTQZ and DSRC) were analyzed in terms of compression ratio, execution speed, parallel scalability and memory consumption. Results showed that domain-specific tools increased the compression ratios up to 70 %, while reducing the runtime of general-purpose tools up to \(7\times \) during compression and up to \(3\times \) during decompression. Parallelism scaled performance up to \(13\times \) when using 20 threads. Our analysis indicates that QUIP, DSRC and PBZIP2 are the best tools in their respective categories, with acceptable memory requirements. Nevertheless, the end user must consider the features of available hardware and define the priorities among its optimization objectives (compression ratio, runtime during compression or decompression, scalability, etc.) to properly select the best application for each particular scenario.
Similar content being viewed by others
References
Deorowicz S (2013) Grabowski S (2013) Data compression for sequencing data. Algorithms for molecular biology: AMB 8(1):25. doi:10.1186/1748-7188-8-25. http://www.almob.org/content/8/1/25
Loh PR, Baym M, Berger B (2012) Compressive genomics. Nat Biotechnol 30(7):627–30. doi:10.1038/nbt.2241. http://www.ncbi.nlm.nih.gov/pubmed/22781691
RAID Incorporated (2015) Storing and managing petabytes of genome sequencing data. Tech. rep. http://webinfo.raidinc.com/storing-and-managing-petabytes-of-genome-sequencing-data [Online]. Accessed on 23 March 2015
Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics (Oxford, England) 25(14):1731–1738. doi:10.1093/bioinformatics/btp319. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2705231&tool=pmcentrez&rendertype=abstract
Baxevanis Andreas D, Ouellette Francis BF (2004) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. doi:10.1007/s10439-006-9105-9
Format specification (FASTQ) (2014) http://maq.sourceforge.net/fastq.shtml [Online]. Accessed on 23 Sept 2014
Howison M, Zapata F, Dunn CW (2013) Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics 29(23):2959–2963. doi:10.1093/bioinformatics/btt525
Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26(10):1135–1145. doi:10.1038/nbt1486
Bakr NS, Sharawi AA (2013) DNA lossless compression algorithms: review. Am J Bioinform Res 3(3):72–81. doi:10.5923/j.bioinformatics.20130303.04
1000 Genomes (2014) A deep catalog of human genetic variation. http://www.1000genomes.org [Online]. Accessed on 03 Oct 2014
Encyclopedia of DNA Elements (ENCODE) (2014) http://www.encodeproject.org/ [Online]. Accessed on 03 Oct 2014
Genomics England (2014). http://www.genomicsengland.co.uk [Online]. Accessed on 03 Oct 2014
ICGC Cancer Genome Projects (2014) https://icgc.org/ [Online]. Accessed on 03 Oct 2014
Wandelt S, Bux M, Leser U (2013) Trends in genome compression. Curr Bioinform 1–24 .https://edit.rok.informatik.hu-berlin.de/wbi/research/publications/2013/2013-cbio.pdf
Kaipa KK, Lee K, Ahn T, Narayanan R (2010) System for random access dna sequence compression. IEEE international conference on bioinformatics and biomedicine workshops system, pp 853–854. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5703942
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory I(3):337–338
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536
Huffman DA (1952) A method for the construction of minimum-redundancy codes. Proc IRE 27:1098,1101
Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucl Acids Res 40(22):1–9. doi:10.1093/nar/gks754
Pinho AJ, Pratas D, Garcia SP (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucl Acids Res 40(4):1–8. doi:10.1093/nar/gkr1124
Christley S, Lu Y, Li C, Xie X (2009) Human genomes as email attachments. Bioinformatics 25(2):274–275. doi:10.1093/bioinformatics/btn582
Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE engineering in medicine and biology magazine 20:61–66. doi:10.1109/51.940049. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=940049
Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Genome informatics. Workshop on genome informatics 10:51–61 http://www.ncbi.nlm.nih.gov/pubmed/11072342
Stephane G, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf Process Manage 30:875–886. https://hal.archives-ouvertes.fr/file/index/docid/180949/filename/grumbach.pdf
Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief Bioinform 15(3):390–406. doi:10.1093/bib/bbt088
Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. Genome Inf 11:43–52. doi:10.11234/gi1990.11.43
Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6):860–862. doi:10.1093/bioinformatics/btr014. http://www.ncbi.nlm.nih.gov/pubmed/21252073
Bhola V, Bopardikar AS, Narayanan R, Lee K, Ahn T (2011) No-reference compression of genomic data stored in FASTQ format. Proceedings—2011 IEEE international conference on bioinformatics and biomedicine, BIBM 2011, pp 147–150. doi:10.1109/BIBM.2011.110
Yanovsky V (2011) ReCoil—an algorithm for compression of extremely large datasets of dna data. Algorithms Mol Biol 6(1):23. doi:10.1186/1748-7188-6-23. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3219593&tool=pmcentrez&rendertype=abstractwww.almob.org/content/6/1/23
Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS (2012) DELIMINATE—a fast and efficient method for loss-less compression of genomic sequences: sequence analysis. Bioinformatics (Oxford, England) 28(19):2527–9. doi:10.1093/bioinformatics/bts467. http://www.ncbi.nlm.nih.gov/pubmed/22833526
Grassi E, Gregorio FD, Molineris I (2012) KungFQ: a simple and powerful approach to compress fastq files. IEEE/ACM Trans Comput Biol Bioinform 9(6):1837–1842. doi:10.1109/TCBB.2012.123
Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PLoS ONE 8(3):1–11. doi:10.1371/journal.pone.0059190
Roguski L, Deorowicz S (2014) DSRC 2—industry oriented compression of FASTQ files. Bioinformatics 30(15):2213–2215. doi:10.1093/bioinformatics/btu208. http://bioinformatics.oxfordjournals.org/content/30/15/2213
Sardaraz M, Tahir M, Ikram AA. Advances in high throughput dna sequence data compression. J Bioinform Comput Biol 0(0):1630,002(0). doi:10.1142/S0219720016300021. http://www.worldscientific.com/doi/abs/10.1142/S0219720016300021. (PMID: 26846812)
Zhu Z, Zhang Y, Ji Z, He S, Yang X (2013) High-throughput DNA sequence data compression. Brief Bioinform 16(1). doi:10.1093/bib/bbt087. http://www.ncbi.nlm.nih.gov/pubmed/24300111
Adler, M.: PIGZ Documentation. http://zlib.net/pigz/pigz.pdf. [Online; accessed: 2014-12-03]
Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the burrows-wheeler transform. Bioinformatics (Oxford, England) 28(11):1415–1419. doi:10.1093/bioinformatics/bts173. http://www.ncbi.nlm.nih.gov/pubmed/22556365
Dutta A, Haque MM, Bose T, Reddy C, Mande SS (2015) Fqc: a novel approach for efficient compression, archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13(03):1541,003
Hach F, Numanagic I, Alkan C, Sahinalp SC, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23):3051–7. doi:10.1093/bioinformatics/bts593. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3509486&tool=pmcentrez&rendertype=abstract
Howison M (2013) High-throughput compression of FASTQ data with SeqDB. IEEE/ACM Trans Comput Biol Bioinform 10(1):213–218. doi:10.1109/TCBB.2012.160
Janin L, Schulz-Trieglaff O, Cox AJ (2014) BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics (Oxford, England) 1–6. doi:10.1093/bioinformatics/btu387. http://www.ncbi.nlm.nih.gov/pubmed/24950811
Linux man page (2014) Pbzip2: parallel bzip2 file compressor. http://linux.die.net/man/1/pbzip2 [Online]. Accessed on 03 Dec 2014
Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20):3276–3281. doi:10.1093/bioinformatics/btv384
Oberhumer M (2015) Lzo real-time data compression library. http://www.oberhumer.com/opensource/lzo/ [Online]. Accessed on 03 March 2016
Pavlov I (2016) 7-zip. http://www.7-zip.org [Online]. Accessed on 03 March 2016
WinRAR archiver, a powerful tool to process RAR and ZIP files. http://www.rarlab.com/ [Online]. Accessed on 03 Dec 2014
Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2015) A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579(1):75–81. doi:10.1016/j.gene.2015.12.053
Zhan X, Yao D (2014) A novel method to compress high-throughput dna sequence read archive. In: Software intelligence technologies and applications international conference on frontiers of internet of things 2014, international conference on, pp 58–61. doi:10.1049/cp.2014.1536
Daily K, Rigor P, Christley S, Xie X, Baldi P (2010) Data structures and compression algorithms for high-throughput sequencing technologies. BMC bioinformatics 11(1):514. doi:10.1186/1471-2105-11-514. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2964686&tool=pmcentrez&rendertype=abstract
Guo G, Qiu S, Ye Z, Wang B, Fang L, Lu M, See S, Mao R (2013) GPU-accelerated adaptive compression framework for genomics data. 2013 IEEE international conference on big data GPU-accelerated, pp 181–186
Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G (2011) Compressing genomic sequence fragments using SlimGene. J Comput Biol 18(3):401–13. doi:10.1089/cmb.2010.0253. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3123913&tool=pmcentrez&rendertype=abstract
Sakib MN, Tang J, Zheng WJ, Huang CT (2011) Improving transmission efficiency of large sequence alignment/map (SAM) files. PLoS ONE 6(12):2–5. doi:10.1371/journal.pone.0028251
Vey G (2009) Differential direct coding: a compression algorithm for nucleotide sequence data. Database: the journal of biological databases and curation 2009:bap013. doi:10.1093/database/bap013. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2797453&tool=pmcentrez&rendertype=abstract
Awan F, Mukherjee A (2002) Lossless compression handbook, chap. text compression, pp 227–245. Communications, networking and multimedia. Elsevier Science, London
Carus A, Mesut A (2010) Fast text compression using multiple static dictionaries. Inf Technol J 9:1013–1021
Crochemore M, Lecroq T (2012) The computer science and engineering handbook, chap. pattern matching and text compression algorithms, pp 3–77. CRC Press, Boca Raton
Burrows M, Wheeler DA (1994) A block-sorting lossless data compression algorithm. Tech. rep. digital equipment corporation, California
7-zip soruceforge editor’s review (2016) https://sourceforge.net/projects/sevenzip/editorial/?source=psp [Online]. Accessed on 03 March 2016
Mahooney M (2016) Data compression explained. http://mattmahoney.net/dc/dce.html#Section_523 [Online]. Accessed on 03 March 2016
Selva JJ, Chen X (2013) SRComp: short read sequence compression using burstsort and Elias omega coding. PLoS ONE 8(12):1–7. doi:10.1371/journal.pone.0081414
Tembe W, Lowey J, Suh E, Genomics T, Street N (2010) G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17):2192–2194. doi:10.1093/bioinformatics/btq346
Mahoney MV (2005) Adaptive weighing of context models for lossless data compression. Florida Tech., Melbourne CS-2005-16(x):1–6. http://professor.unisinos.br/linds/teoinfo/paq.pdf
Salomon D, Bryant D, Motta G (2010) Handbook of data compression. Springer, London
Batu T, Ergun F, Sahinalp C (2006) Oblivious string embeddings and edit distance approximations. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm, pp 792–801. Society for Industrial and Applied Mathematics
Biocancer Research Journal: Transcriptoma (2014) http://www.biocancer.com/journal/1353/31-transcriptoma [Online]. Accessed on 03 Dec 2014
Pevsner J (2009) Bioinformatics and functional genomics, 2nd edn. Springer, Berlin
Deorowicz S, Grabowski S (2011) Robust relative compression of genomes with random access. Bioinformatics 27(21):2979–2986. doi:10.1093/bioinformatics/btr505
Acknowledgments
We want to thank Felipe Cabarcas and Juan Fernando Alzate from the Centro Nacional de Secuenciación Genómica at the University of Antioquia for giving us access to their computing cluster and test data; and for their help in clarifying many bioinformatics related issues. This work was supported by the University of Antioquia under project code PRV15-2-02.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Guerra, A., Lotero, J. & Isaza, S. Performance comparison of sequential and parallel compression applications for DNA raw data. J Supercomput 72, 4696–4717 (2016). https://doi.org/10.1007/s11227-016-1753-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1753-4