Skip to main content

De Novo NGS Data Compression

Abstract

This chapter deals with the compression of genomic data without reference genomes. It presents various techniques which have been specifically developed to compress sequencing data in lossless or lossy modes. The chapter also provides an evaluation of different NGS data compressor tools.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-59826-0_4
  • Chapter length: 25 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-59826-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Hardcover Book
USD   169.99
Price excludes VAT (USA)
Fig. 4.1

References

  1. Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris, T., Uricaru, R., Rizk, G.: Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph de Bruijn graph. BMC Bioinf. 16, 288 (2015)

    CrossRef  Google Scholar 

  2. Bonfield, J.K., Mahoney, M.V.: Compression of fastq and sam format sequencing data. PLoS One 8(3), e59190 (2013)

    CrossRef  Google Scholar 

  3. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  4. Cánovas, R., Moffat, A., Turpin, A.: Lossy compression of quality scores in genomic data. Bioinformatics 30(15), 2130–2136 (2014)

    CrossRef  Google Scholar 

  5. Chikhi, R., Rizk, G.: Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithms Mol. Biol. 8(1), 22 (2013)

    CrossRef  Google Scholar 

  6. Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the burrows-wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)

    CrossRef  Google Scholar 

  7. Deorowicz, S., Grabowski, S.: Compression of DNA sequence reads in fastq format. Bioinformatics 27(6), 860–862 (2011)

    CrossRef  Google Scholar 

  8. Deutsch, P., Gailly, J.: Zlib compressed data format specification version 3.3. RFC 1950 (1996)

    Google Scholar 

  9. Grabowski, S., Deorowicz, S., Roguski, Ł.: Disk-based compression of data from genome sequencing. Bioinformatics 31(9), 1389–1395 (2014)

    CrossRef  Google Scholar 

  10. Hach, F., Numanagic, I., Alkan, C., Sahinalp, S.C.: Scalce: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23), 3051–3057 (2012)

    CrossRef  Google Scholar 

  11. Huffman, D.: A method for the construction of minimum-redundancy codes. In: Proceedings of the Institute of Radio Engineers (1952)

    MATH  Google Scholar 

  12. Janin, L., Rosone, G., Cox, A.J.: Adaptive reference-free compression of sequence quality scores. Bioinformatics 30(1), 24–30 (2014)

    CrossRef  Google Scholar 

  13. Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40(22), e171 (2012)

    CrossRef  Google Scholar 

  14. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    CrossRef  Google Scholar 

  15. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., Subgroup, G.P.D.P.: The sequence alignment/map format and samtools. Bioinformatics 25(16), 2078–2079 (2009). doi: 10.1093/bioinformatics/btp352

    CrossRef  Google Scholar 

  16. Mahoney, M.: (2000) http://mattmahoney.net/dc/

  17. Mahoney, M.: Adaptive weighing of context models for lossless data compression. Florida Tech. Technical Report (2005)

    Google Scholar 

  18. Moffat, A.: Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, 1917–1921 (1990)

    CrossRef  Google Scholar 

  19. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)

    CrossRef  Google Scholar 

  20. Roguski, L., Deorowicz, S.: DSRC 2-industry-oriented compression of FASTQ files. Bioinformatics 30(15), 2213–2215 (2014)

    CrossRef  Google Scholar 

  21. Saha, S., Rajasekaran, S.: Efficient algorithms for the compression of fastq files. In: 2014 IEEE International Conference on Bioinformatics and Biomedicine (2014)

    Google Scholar 

  22. Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: Proceedings of the 37th Annual Symposium on Foundations of Computer Science, FOCS ’96, Washington, DC, pp. 320–328. IEEE Computer Society, Los Alamitos (1996). http://dl.acm.org/citation.cfm?id=874062.875524

  23. Seward, J.: (1996) bzip2: http://www.bzip.org/1.0.3/html/reading.html

  24. Shannon, C., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Urbana (1949)

    MATH  Google Scholar 

  25. Tembe, W., Lowey, J., Suh, E.: G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17), 2192–2194 (2010)

    CrossRef  Google Scholar 

  26. Wan, R., Anh, V.N., Asai, K.: Transformations for the compression of fastq quality scores of next-generation sequencing data. Bioinformatics 28(5), 628–635 (2012)

    CrossRef  Google Scholar 

  27. Welch, T.: A technique for high-performance data compression. Computer 6, 8–19 (1984)

    CrossRef  Google Scholar 

  28. Witten, I., Neal, R., Cleary, J.: Arithmetic coding for data compression. Commun. ACM 30, 520–540 (1987)

    CrossRef  Google Scholar 

  29. Yanovsky, V.: Recoil - an algorithm for compression of extremely large datasets of dna data. Algorithms Mol. Biol. 6, 23 (2011)

    CrossRef  Google Scholar 

  30. Yu, Y.W., Yorukoglu, D., Berger, B.: Traversing the k-mer landscape of ngs read datasets for quality score sparsification. In: Research in Computational Molecular Biology, pp. 385–399. Springer, Berlin (2014)

    Google Scholar 

  31. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    MathSciNet  CrossRef  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominique Lavenier .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Benoit, G., Lemaitre, C., Rizk, G., Drezen, E., Lavenier, D. (2017). De Novo NGS Data Compression. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59826-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59824-6

  • Online ISBN: 978-3-319-59826-0

  • eBook Packages: Computer ScienceComputer Science (R0)