Performance comparison of sequential and parallel compression applications for DNA raw data

Guerra, Aníbal; Lotero, Jaime; Isaza, Sebastián

doi:10.1007/s11227-016-1753-4

Performance comparison of sequential and parallel compression applications for DNA raw data

Published: 10 June 2016

Volume 72, pages 4696–4717, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

431 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

We present an experimental performance comparison of lossless compression programs for DNA raw data in FASTQ format files. General-purpose (PBZIP2, P7ZIP and PIGZ) and domain-specific compressors (SCALCE, QUIP, FASTQZ and DSRC) were analyzed in terms of compression ratio, execution speed, parallel scalability and memory consumption. Results showed that domain-specific tools increased the compression ratios up to 70 %, while reducing the runtime of general-purpose tools up to \(7\times \) during compression and up to \(3\times \) during decompression. Parallelism scaled performance up to \(13\times \) when using 20 threads. Our analysis indicates that QUIP, DSRC and PBZIP2 are the best tools in their respective categories, with acceptable memory requirements. Nevertheless, the end user must consider the features of available hardware and define the priorities among its optimization objectives (compression ratio, runtime during compression or decompression, scalability, etc.) to properly select the best application for each particular scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two High-Performance Alternatives to ZLIB Scientific-Data Compression

Light Loss-Less Data Compression, with GPU Implementation

Applying Selectively Parallel I/O Compression to Parallel Storage Systems

Notes

References

Deorowicz S (2013) Grabowski S (2013) Data compression for sequencing data. Algorithms for molecular biology: AMB 8(1):25. doi:10.1186/1748-7188-8-25. http://www.almob.org/content/8/1/25
Loh PR, Baym M, Berger B (2012) Compressive genomics. Nat Biotechnol 30(7):627–30. doi:10.1038/nbt.2241. http://www.ncbi.nlm.nih.gov/pubmed/22781691
RAID Incorporated (2015) Storing and managing petabytes of genome sequencing data. Tech. rep. http://webinfo.raidinc.com/storing-and-managing-petabytes-of-genome-sequencing-data [Online]. Accessed on 23 March 2015
Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics (Oxford, England) 25(14):1731–1738. doi:10.1093/bioinformatics/btp319. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2705231&tool=pmcentrez&rendertype=abstract
Baxevanis Andreas D, Ouellette Francis BF (2004) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. doi:10.1007/s10439-006-9105-9
Format specification (FASTQ) (2014) http://maq.sourceforge.net/fastq.shtml [Online]. Accessed on 23 Sept 2014
Howison M, Zapata F, Dunn CW (2013) Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics 29(23):2959–2963. doi:10.1093/bioinformatics/btt525
Article Google Scholar
Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26(10):1135–1145. doi:10.1038/nbt1486
Article Google Scholar
Bakr NS, Sharawi AA (2013) DNA lossless compression algorithms: review. Am J Bioinform Res 3(3):72–81. doi:10.5923/j.bioinformatics.20130303.04
Google Scholar
1000 Genomes (2014) A deep catalog of human genetic variation. http://www.1000genomes.org [Online]. Accessed on 03 Oct 2014
Encyclopedia of DNA Elements (ENCODE) (2014) http://www.encodeproject.org/ [Online]. Accessed on 03 Oct 2014
Genomics England (2014). http://www.genomicsengland.co.uk [Online]. Accessed on 03 Oct 2014
ICGC Cancer Genome Projects (2014) https://icgc.org/ [Online]. Accessed on 03 Oct 2014
Wandelt S, Bux M, Leser U (2013) Trends in genome compression. Curr Bioinform 1–24 .https://edit.rok.informatik.hu-berlin.de/wbi/research/publications/2013/2013-cbio.pdf
Kaipa KK, Lee K, Ahn T, Narayanan R (2010) System for random access dna sequence compression. IEEE international conference on bioinformatics and biomedicine workshops system, pp 853–854. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5703942
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory I(3):337–338
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536
Article MathSciNet MATH Google Scholar
Huffman DA (1952) A method for the construction of minimum-redundancy codes. Proc IRE 27:1098,1101
Google Scholar
Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucl Acids Res 40(22):1–9. doi:10.1093/nar/gks754
Article Google Scholar
Pinho AJ, Pratas D, Garcia SP (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucl Acids Res 40(4):1–8. doi:10.1093/nar/gkr1124
Article Google Scholar
Christley S, Lu Y, Li C, Xie X (2009) Human genomes as email attachments. Bioinformatics 25(2):274–275. doi:10.1093/bioinformatics/btn582
Article Google Scholar
Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE engineering in medicine and biology magazine 20:61–66. doi:10.1109/51.940049. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=940049
Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Genome informatics. Workshop on genome informatics 10:51–61 http://www.ncbi.nlm.nih.gov/pubmed/11072342
Stephane G, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf Process Manage 30:875–886. https://hal.archives-ouvertes.fr/file/index/docid/180949/filename/grumbach.pdf
Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief Bioinform 15(3):390–406. doi:10.1093/bib/bbt088
Article Google Scholar
Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. Genome Inf 11:43–52. doi:10.11234/gi1990.11.43
Google Scholar
Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6):860–862. doi:10.1093/bioinformatics/btr014. http://www.ncbi.nlm.nih.gov/pubmed/21252073
Bhola V, Bopardikar AS, Narayanan R, Lee K, Ahn T (2011) No-reference compression of genomic data stored in FASTQ format. Proceedings—2011 IEEE international conference on bioinformatics and biomedicine, BIBM 2011, pp 147–150. doi:10.1109/BIBM.2011.110
Yanovsky V (2011) ReCoil—an algorithm for compression of extremely large datasets of dna data. Algorithms Mol Biol 6(1):23. doi:10.1186/1748-7188-6-23. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3219593&tool=pmcentrez&rendertype=abstractwww.almob.org/content/6/1/23
Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS (2012) DELIMINATE—a fast and efficient method for loss-less compression of genomic sequences: sequence analysis. Bioinformatics (Oxford, England) 28(19):2527–9. doi:10.1093/bioinformatics/bts467. http://www.ncbi.nlm.nih.gov/pubmed/22833526
Grassi E, Gregorio FD, Molineris I (2012) KungFQ: a simple and powerful approach to compress fastq files. IEEE/ACM Trans Comput Biol Bioinform 9(6):1837–1842. doi:10.1109/TCBB.2012.123
Article Google Scholar
Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PLoS ONE 8(3):1–11. doi:10.1371/journal.pone.0059190
Article Google Scholar
Roguski L, Deorowicz S (2014) DSRC 2—industry oriented compression of FASTQ files. Bioinformatics 30(15):2213–2215. doi:10.1093/bioinformatics/btu208. http://bioinformatics.oxfordjournals.org/content/30/15/2213
Sardaraz M, Tahir M, Ikram AA. Advances in high throughput dna sequence data compression. J Bioinform Comput Biol 0(0):1630,002(0). doi:10.1142/S0219720016300021. http://www.worldscientific.com/doi/abs/10.1142/S0219720016300021. (PMID: 26846812)
Zhu Z, Zhang Y, Ji Z, He S, Yang X (2013) High-throughput DNA sequence data compression. Brief Bioinform 16(1). doi:10.1093/bib/bbt087. http://www.ncbi.nlm.nih.gov/pubmed/24300111
Adler, M.: PIGZ Documentation. http://zlib.net/pigz/pigz.pdf. [Online; accessed: 2014-12-03]
Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the burrows-wheeler transform. Bioinformatics (Oxford, England) 28(11):1415–1419. doi:10.1093/bioinformatics/bts173. http://www.ncbi.nlm.nih.gov/pubmed/22556365
Dutta A, Haque MM, Bose T, Reddy C, Mande SS (2015) Fqc: a novel approach for efficient compression, archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13(03):1541,003
Article Google Scholar
Hach F, Numanagic I, Alkan C, Sahinalp SC, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23):3051–7. doi:10.1093/bioinformatics/bts593. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3509486&tool=pmcentrez&rendertype=abstract
Howison M (2013) High-throughput compression of FASTQ data with SeqDB. IEEE/ACM Trans Comput Biol Bioinform 10(1):213–218. doi:10.1109/TCBB.2012.160
Article MathSciNet Google Scholar
Janin L, Schulz-Trieglaff O, Cox AJ (2014) BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics (Oxford, England) 1–6. doi:10.1093/bioinformatics/btu387. http://www.ncbi.nlm.nih.gov/pubmed/24950811
Linux man page (2014) Pbzip2: parallel bzip2 file compressor. http://linux.die.net/man/1/pbzip2 [Online]. Accessed on 03 Dec 2014
Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20):3276–3281. doi:10.1093/bioinformatics/btv384
Article Google Scholar
Oberhumer M (2015) Lzo real-time data compression library. http://www.oberhumer.com/opensource/lzo/ [Online]. Accessed on 03 March 2016
Pavlov I (2016) 7-zip. http://www.7-zip.org [Online]. Accessed on 03 March 2016
WinRAR archiver, a powerful tool to process RAR and ZIP files. http://www.rarlab.com/ [Online]. Accessed on 03 Dec 2014
Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2015) A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579(1):75–81. doi:10.1016/j.gene.2015.12.053
Article Google Scholar
Zhan X, Yao D (2014) A novel method to compress high-throughput dna sequence read archive. In: Software intelligence technologies and applications international conference on frontiers of internet of things 2014, international conference on, pp 58–61. doi:10.1049/cp.2014.1536
Daily K, Rigor P, Christley S, Xie X, Baldi P (2010) Data structures and compression algorithms for high-throughput sequencing technologies. BMC bioinformatics 11(1):514. doi:10.1186/1471-2105-11-514. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2964686&tool=pmcentrez&rendertype=abstract
Guo G, Qiu S, Ye Z, Wang B, Fang L, Lu M, See S, Mao R (2013) GPU-accelerated adaptive compression framework for genomics data. 2013 IEEE international conference on big data GPU-accelerated, pp 181–186
Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G (2011) Compressing genomic sequence fragments using SlimGene. J Comput Biol 18(3):401–13. doi:10.1089/cmb.2010.0253. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3123913&tool=pmcentrez&rendertype=abstract
Sakib MN, Tang J, Zheng WJ, Huang CT (2011) Improving transmission efficiency of large sequence alignment/map (SAM) files. PLoS ONE 6(12):2–5. doi:10.1371/journal.pone.0028251
Article Google Scholar
Vey G (2009) Differential direct coding: a compression algorithm for nucleotide sequence data. Database: the journal of biological databases and curation 2009:bap013. doi:10.1093/database/bap013. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2797453&tool=pmcentrez&rendertype=abstract
Awan F, Mukherjee A (2002) Lossless compression handbook, chap. text compression, pp 227–245. Communications, networking and multimedia. Elsevier Science, London
Carus A, Mesut A (2010) Fast text compression using multiple static dictionaries. Inf Technol J 9:1013–1021
Article Google Scholar
Crochemore M, Lecroq T (2012) The computer science and engineering handbook, chap. pattern matching and text compression algorithms, pp 3–77. CRC Press, Boca Raton
Burrows M, Wheeler DA (1994) A block-sorting lossless data compression algorithm. Tech. rep. digital equipment corporation, California
7-zip soruceforge editor’s review (2016) https://sourceforge.net/projects/sevenzip/editorial/?source=psp [Online]. Accessed on 03 March 2016
Mahooney M (2016) Data compression explained. http://mattmahoney.net/dc/dce.html#Section_523 [Online]. Accessed on 03 March 2016
Selva JJ, Chen X (2013) SRComp: short read sequence compression using burstsort and Elias omega coding. PLoS ONE 8(12):1–7. doi:10.1371/journal.pone.0081414
Article Google Scholar
Tembe W, Lowey J, Suh E, Genomics T, Street N (2010) G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17):2192–2194. doi:10.1093/bioinformatics/btq346
Article Google Scholar
Mahoney MV (2005) Adaptive weighing of context models for lossless data compression. Florida Tech., Melbourne CS-2005-16(x):1–6. http://professor.unisinos.br/linds/teoinfo/paq.pdf
Salomon D, Bryant D, Motta G (2010) Handbook of data compression. Springer, London
Book MATH Google Scholar
Batu T, Ergun F, Sahinalp C (2006) Oblivious string embeddings and edit distance approximations. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm, pp 792–801. Society for Industrial and Applied Mathematics
Biocancer Research Journal: Transcriptoma (2014) http://www.biocancer.com/journal/1353/31-transcriptoma [Online]. Accessed on 03 Dec 2014
Pevsner J (2009) Bioinformatics and functional genomics, 2nd edn. Springer, Berlin
Deorowicz S, Grabowski S (2011) Robust relative compression of genomes with random access. Bioinformatics 27(21):2979–2986. doi:10.1093/bioinformatics/btr505
Article Google Scholar

Download references

Acknowledgments

We want to thank Felipe Cabarcas and Juan Fernando Alzate from the Centro Nacional de Secuenciación Genómica at the University of Antioquia for giving us access to their computing cluster and test data; and for their help in clarifying many bioinformatics related issues. This work was supported by the University of Antioquia under project code PRV15-2-02.

Author information

Authors and Affiliations

Facultad de Ciencias y Tecnología FaCyT, Universidad de Carabobo UC, Valencia, Venezuela
Aníbal Guerra
Facultad de Ingeniería, Universidad de Antioquia UdeA, Calle 70 No. 52-21, Medellín, Colombia
Aníbal Guerra, Jaime Lotero & Sebastián Isaza

Authors

Aníbal Guerra
View author publications
You can also search for this author in PubMed Google Scholar
Jaime Lotero
View author publications
You can also search for this author in PubMed Google Scholar
Sebastián Isaza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aníbal Guerra.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guerra, A., Lotero, J. & Isaza, S. Performance comparison of sequential and parallel compression applications for DNA raw data. J Supercomput 72, 4696–4717 (2016). https://doi.org/10.1007/s11227-016-1753-4

Download citation

Published: 10 June 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s11227-016-1753-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance comparison of sequential and parallel compression applications for DNA raw data

Abstract

Access this article

Similar content being viewed by others

Two High-Performance Alternatives to ZLIB Scientific-Data Compression

Light Loss-Less Data Compression, with GPU Implementation

Applying Selectively Parallel I/O Compression to Parallel Storage Systems

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Performance comparison of sequential and parallel compression applications for DNA raw data

Abstract

Access this article

Similar content being viewed by others

Two High-Performance Alternatives to ZLIB Scientific-Data Compression

Light Loss-Less Data Compression, with GPU Implementation

Applying Selectively Parallel I/O Compression to Parallel Storage Systems

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation