Advertisement

The Journal of Supercomputing

, Volume 72, Issue 12, pp 4696–4717 | Cite as

Performance comparison of sequential and parallel compression applications for DNA raw data

  • Aníbal Guerra
  • Jaime Lotero
  • Sebastián Isaza
Article

Abstract

We present an experimental performance comparison of lossless compression programs for DNA raw data in FASTQ format files. General-purpose (PBZIP2, P7ZIP and PIGZ) and domain-specific compressors (SCALCE, QUIP, FASTQZ and DSRC) were analyzed in terms of compression ratio, execution speed, parallel scalability and memory consumption. Results showed that domain-specific tools increased the compression ratios up to 70 %, while reducing the runtime of general-purpose tools up to \(7\times \) during compression and up to \(3\times \) during decompression. Parallelism scaled performance up to \(13\times \) when using 20 threads. Our analysis indicates that QUIP, DSRC and PBZIP2 are the best tools in their respective categories, with acceptable memory requirements. Nevertheless, the end user must consider the features of available hardware and define the priorities among its optimization objectives (compression ratio, runtime during compression or decompression, scalability, etc.) to properly select the best application for each particular scenario.

Keywords

DNA raw data compression Performance evaluation Parallel scalability Memory consumption Bioinformatics 

Notes

Acknowledgments

We want to thank Felipe Cabarcas and Juan Fernando Alzate from the Centro Nacional de Secuenciación Genómica at the University of Antioquia for giving us access to their computing cluster and test data; and for their help in clarifying many bioinformatics related issues. This work was supported by the University of Antioquia under project code PRV15-2-02.

References

  1. 1.
    Deorowicz S (2013) Grabowski S (2013) Data compression for sequencing data. Algorithms for molecular biology: AMB 8(1):25. doi: 10.1186/1748-7188-8-25. http://www.almob.org/content/8/1/25
  2. 2.
    Loh PR, Baym M, Berger B (2012) Compressive genomics. Nat Biotechnol 30(7):627–30. doi: 10.1038/nbt.2241. http://www.ncbi.nlm.nih.gov/pubmed/22781691
  3. 3.
    RAID Incorporated (2015) Storing and managing petabytes of genome sequencing data. Tech. rep. http://webinfo.raidinc.com/storing-and-managing-petabytes-of-genome-sequencing-data [Online]. Accessed on 23 March 2015
  4. 4.
    Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics (Oxford, England) 25(14):1731–1738. doi: 10.1093/bioinformatics/btp319. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2705231&tool=pmcentrez&rendertype=abstract
  5. 5.
    Baxevanis Andreas D, Ouellette Francis BF (2004) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. doi: 10.1007/s10439-006-9105-9
  6. 6.
    Format specification (FASTQ) (2014) http://maq.sourceforge.net/fastq.shtml [Online]. Accessed on 23 Sept 2014
  7. 7.
    Howison M, Zapata F, Dunn CW (2013) Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics 29(23):2959–2963. doi: 10.1093/bioinformatics/btt525 CrossRefGoogle Scholar
  8. 8.
    Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26(10):1135–1145. doi: 10.1038/nbt1486 CrossRefGoogle Scholar
  9. 9.
    Bakr NS, Sharawi AA (2013) DNA lossless compression algorithms: review. Am J Bioinform Res 3(3):72–81. doi: 10.5923/j.bioinformatics.20130303.04 Google Scholar
  10. 10.
    1000 Genomes (2014) A deep catalog of human genetic variation. http://www.1000genomes.org [Online]. Accessed on 03 Oct 2014
  11. 11.
    Encyclopedia of DNA Elements (ENCODE) (2014) http://www.encodeproject.org/ [Online]. Accessed on 03 Oct 2014
  12. 12.
    Genomics England (2014). http://www.genomicsengland.co.uk [Online]. Accessed on 03 Oct 2014
  13. 13.
    ICGC Cancer Genome Projects (2014) https://icgc.org/ [Online]. Accessed on 03 Oct 2014
  14. 14.
    Wandelt S, Bux M, Leser U (2013) Trends in genome compression. Curr Bioinform 1–24 .https://edit.rok.informatik.hu-berlin.de/wbi/research/publications/2013/2013-cbio.pdf
  15. 15.
    Kaipa KK, Lee K, Ahn T, Narayanan R (2010) System for random access dna sequence compression. IEEE international conference on bioinformatics and biomedicine workshops system, pp 853–854. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5703942
  16. 16.
    Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory I(3):337–338Google Scholar
  17. 17.
    Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Huffman DA (1952) A method for the construction of minimum-redundancy codes. Proc IRE 27:1098,1101Google Scholar
  19. 19.
    Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucl Acids Res 40(22):1–9. doi: 10.1093/nar/gks754 CrossRefGoogle Scholar
  20. 20.
    Pinho AJ, Pratas D, Garcia SP (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucl Acids Res 40(4):1–8. doi: 10.1093/nar/gkr1124 CrossRefGoogle Scholar
  21. 21.
    Christley S, Lu Y, Li C, Xie X (2009) Human genomes as email attachments. Bioinformatics 25(2):274–275. doi: 10.1093/bioinformatics/btn582 CrossRefGoogle Scholar
  22. 22.
    Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE engineering in medicine and biology magazine 20:61–66. doi: 10.1109/51.940049. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=940049
  23. 23.
    Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Genome informatics. Workshop on genome informatics 10:51–61 http://www.ncbi.nlm.nih.gov/pubmed/11072342
  24. 24.
    Stephane G, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf Process Manage 30:875–886. https://hal.archives-ouvertes.fr/file/index/docid/180949/filename/grumbach.pdf
  25. 25.
    Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief Bioinform 15(3):390–406. doi: 10.1093/bib/bbt088 CrossRefGoogle Scholar
  26. 26.
    Matsumoto T, Sadakane K, Imai H (2000) Biological sequence compression algorithms. Genome Inf 11:43–52. doi: 10.11234/gi1990.11.43 Google Scholar
  27. 27.
    Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27(6):860–862. doi: 10.1093/bioinformatics/btr014. http://www.ncbi.nlm.nih.gov/pubmed/21252073
  28. 28.
    Bhola V, Bopardikar AS, Narayanan R, Lee K, Ahn T (2011) No-reference compression of genomic data stored in FASTQ format. Proceedings—2011 IEEE international conference on bioinformatics and biomedicine, BIBM 2011, pp 147–150. doi: 10.1109/BIBM.2011.110
  29. 29.
    Yanovsky V (2011) ReCoil—an algorithm for compression of extremely large datasets of dna data. Algorithms Mol Biol 6(1):23. doi: 10.1186/1748-7188-6-23. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3219593&tool=pmcentrez&rendertype=abstractwww.almob.org/content/6/1/23
  30. 30.
    Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS (2012) DELIMINATE—a fast and efficient method for loss-less compression of genomic sequences: sequence analysis. Bioinformatics (Oxford, England) 28(19):2527–9. doi: 10.1093/bioinformatics/bts467. http://www.ncbi.nlm.nih.gov/pubmed/22833526
  31. 31.
    Grassi E, Gregorio FD, Molineris I (2012) KungFQ: a simple and powerful approach to compress fastq files. IEEE/ACM Trans Comput Biol Bioinform 9(6):1837–1842. doi: 10.1109/TCBB.2012.123 CrossRefGoogle Scholar
  32. 32.
    Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PLoS ONE 8(3):1–11. doi: 10.1371/journal.pone.0059190 CrossRefGoogle Scholar
  33. 33.
    Roguski L, Deorowicz S (2014) DSRC 2—industry oriented compression of FASTQ files. Bioinformatics 30(15):2213–2215. doi: 10.1093/bioinformatics/btu208. http://bioinformatics.oxfordjournals.org/content/30/15/2213
  34. 34.
    Sardaraz M, Tahir M, Ikram AA. Advances in high throughput dna sequence data compression. J Bioinform Comput Biol 0(0):1630,002(0). doi: 10.1142/S0219720016300021. http://www.worldscientific.com/doi/abs/10.1142/S0219720016300021. (PMID: 26846812)
  35. 35.
    Zhu Z, Zhang Y, Ji Z, He S, Yang X (2013) High-throughput DNA sequence data compression. Brief Bioinform 16(1). doi: 10.1093/bib/bbt087. http://www.ncbi.nlm.nih.gov/pubmed/24300111
  36. 36.
    Adler, M.: PIGZ Documentation. http://zlib.net/pigz/pigz.pdf. [Online; accessed: 2014-12-03]
  37. 37.
    Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the burrows-wheeler transform. Bioinformatics (Oxford, England) 28(11):1415–1419. doi: 10.1093/bioinformatics/bts173. http://www.ncbi.nlm.nih.gov/pubmed/22556365
  38. 38.
    Dutta A, Haque MM, Bose T, Reddy C, Mande SS (2015) Fqc: a novel approach for efficient compression, archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13(03):1541,003CrossRefGoogle Scholar
  39. 39.
    Hach F, Numanagic I, Alkan C, Sahinalp SC, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23):3051–7. doi: 10.1093/bioinformatics/bts593. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3509486&tool=pmcentrez&rendertype=abstract
  40. 40.
    Howison M (2013) High-throughput compression of FASTQ data with SeqDB. IEEE/ACM Trans Comput Biol Bioinform 10(1):213–218. doi: 10.1109/TCBB.2012.160 MathSciNetCrossRefGoogle Scholar
  41. 41.
    Janin L, Schulz-Trieglaff O, Cox AJ (2014) BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics (Oxford, England) 1–6. doi: 10.1093/bioinformatics/btu387. http://www.ncbi.nlm.nih.gov/pubmed/24950811
  42. 42.
    Linux man page (2014) Pbzip2: parallel bzip2 file compressor. http://linux.die.net/man/1/pbzip2 [Online]. Accessed on 03 Dec 2014
  43. 43.
    Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31(20):3276–3281. doi: 10.1093/bioinformatics/btv384 CrossRefGoogle Scholar
  44. 44.
    Oberhumer M (2015) Lzo real-time data compression library. http://www.oberhumer.com/opensource/lzo/ [Online]. Accessed on 03 March 2016
  45. 45.
    Pavlov I (2016) 7-zip. http://www.7-zip.org [Online]. Accessed on 03 March 2016
  46. 46.
    WinRAR archiver, a powerful tool to process RAR and ZIP files. http://www.rarlab.com/ [Online]. Accessed on 03 Dec 2014
  47. 47.
    Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2015) A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579(1):75–81. doi: 10.1016/j.gene.2015.12.053 CrossRefGoogle Scholar
  48. 48.
    Zhan X, Yao D (2014) A novel method to compress high-throughput dna sequence read archive. In: Software intelligence technologies and applications international conference on frontiers of internet of things 2014, international conference on, pp 58–61. doi: 10.1049/cp.2014.1536
  49. 49.
    Daily K, Rigor P, Christley S, Xie X, Baldi P (2010) Data structures and compression algorithms for high-throughput sequencing technologies. BMC bioinformatics 11(1):514. doi: 10.1186/1471-2105-11-514. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2964686&tool=pmcentrez&rendertype=abstract
  50. 50.
    Guo G, Qiu S, Ye Z, Wang B, Fang L, Lu M, See S, Mao R (2013) GPU-accelerated adaptive compression framework for genomics data. 2013 IEEE international conference on big data GPU-accelerated, pp 181–186Google Scholar
  51. 51.
    Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G (2011) Compressing genomic sequence fragments using SlimGene. J Comput Biol 18(3):401–13. doi: 10.1089/cmb.2010.0253. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3123913&tool=pmcentrez&rendertype=abstract
  52. 52.
    Sakib MN, Tang J, Zheng WJ, Huang CT (2011) Improving transmission efficiency of large sequence alignment/map (SAM) files. PLoS ONE 6(12):2–5. doi: 10.1371/journal.pone.0028251 CrossRefGoogle Scholar
  53. 53.
    Vey G (2009) Differential direct coding: a compression algorithm for nucleotide sequence data. Database: the journal of biological databases and curation 2009:bap013. doi: 10.1093/database/bap013. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2797453&tool=pmcentrez&rendertype=abstract
  54. 54.
    Awan F, Mukherjee A (2002) Lossless compression handbook, chap. text compression, pp 227–245. Communications, networking and multimedia. Elsevier Science, LondonGoogle Scholar
  55. 55.
    Carus A, Mesut A (2010) Fast text compression using multiple static dictionaries. Inf Technol J 9:1013–1021CrossRefGoogle Scholar
  56. 56.
    Crochemore M, Lecroq T (2012) The computer science and engineering handbook, chap. pattern matching and text compression algorithms, pp 3–77. CRC Press, Boca RatonGoogle Scholar
  57. 57.
    Burrows M, Wheeler DA (1994) A block-sorting lossless data compression algorithm. Tech. rep. digital equipment corporation, CaliforniaGoogle Scholar
  58. 58.
    7-zip soruceforge editor’s review (2016) https://sourceforge.net/projects/sevenzip/editorial/?source=psp [Online]. Accessed on 03 March 2016
  59. 59.
    Mahooney M (2016) Data compression explained. http://mattmahoney.net/dc/dce.html#Section_523 [Online]. Accessed on 03 March 2016
  60. 60.
    Selva JJ, Chen X (2013) SRComp: short read sequence compression using burstsort and Elias omega coding. PLoS ONE 8(12):1–7. doi: 10.1371/journal.pone.0081414 CrossRefGoogle Scholar
  61. 61.
    Tembe W, Lowey J, Suh E, Genomics T, Street N (2010) G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17):2192–2194. doi: 10.1093/bioinformatics/btq346 CrossRefGoogle Scholar
  62. 62.
    Mahoney MV (2005) Adaptive weighing of context models for lossless data compression. Florida Tech., Melbourne CS-2005-16(x):1–6. http://professor.unisinos.br/linds/teoinfo/paq.pdf
  63. 63.
    Salomon D, Bryant D, Motta G (2010) Handbook of data compression. Springer, LondonCrossRefMATHGoogle Scholar
  64. 64.
    Batu T, Ergun F, Sahinalp C (2006) Oblivious string embeddings and edit distance approximations. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm, pp 792–801. Society for Industrial and Applied MathematicsGoogle Scholar
  65. 65.
    Biocancer Research Journal: Transcriptoma (2014) http://www.biocancer.com/journal/1353/31-transcriptoma [Online]. Accessed on 03 Dec 2014
  66. 66.
    Pevsner J (2009) Bioinformatics and functional genomics, 2nd edn. Springer, BerlinGoogle Scholar
  67. 67.
    Deorowicz S, Grabowski S (2011) Robust relative compression of genomes with random access. Bioinformatics 27(21):2979–2986. doi: 10.1093/bioinformatics/btr505 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Facultad de Ciencias y Tecnología FaCyTUniversidad de Carabobo UCValenciaVenezuela
  2. 2.Facultad de IngenieríaUniversidad de Antioquia UdeA, Calle 70 No. 52-21MedellínColombia

Personalised recommendations