Skip to main content

Genomic Data Compression

  • Reference work entry
  • First Online:
Book cover Encyclopedia of Big Data Technologies

Abstract

Genomic sequence data obtained through high-throughput sequencing (HTS) technologies are commonly stored either as raw sequencing reads in FASTQ format or as reads mapped to a reference genome in SAM format. Both of these formats have large memory footprints. Worldwide increase of HTS data has prompted the development of specialized compression methods that aim to significantly reduce HTS data size. Below is a comparative overview of available lossless genomic data compression approaches, including their advantages and pitfalls.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 849.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 999.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Benoit G et al (2015) Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics 16:288

    Article  Google Scholar 

  • Bonfield JK (2014) The scramble conversion tool. Bioinformatics 30(19):2818–2819

    Article  Google Scholar 

  • Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PloS one 8:e59190

    Article  Google Scholar 

  • Chandak S, Tatwawadi K, Weissman T (2018) Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34:558–567

    Article  Google Scholar 

  • Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2009) The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res 38:1767–1771

    Article  Google Scholar 

  • Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the burrows–wheeler transform. Bioinformatics 28:1415–1419

    Article  Google Scholar 

  • CRAM format specification (version 3.0) (2017) https://samtools.github.io/hts-specs/CRAMv3.pdf

  • Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27:860–862

    Article  Google Scholar 

  • Deutsch LP (1996) GZIP file format specification version 4.3. https://tools.ietf.org/html/rfc1952

  • Dutta A, Haque MM, Bose T, Reddy CVSK, Mande SS (2015) FQC: a novel approach for efficient compression, archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13:1541003

    Article  Google Scholar 

  • Ewing B, Hillier L, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185

    Article  Google Scholar 

  • Fritz MHY, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740

    Article  Google Scholar 

  • Ginart AA et al (2018) Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 9:566

    Article  Google Scholar 

  • Grabowski S, Deorowicz S, Roguski Ł (2014) Disk-based compression of data from genome sequencing. Bioinformatics 31:1389–1395

    Article  Google Scholar 

  • Hach F, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28:3051–3057

    Article  Google Scholar 

  • Hach F, Numanagić I, Sahinalp SC (2014) DeeZ: reference-based compression by local assembly. Nat Methods 11:1082–1084

    Article  Google Scholar 

  • Holland RC, Lynch N (2013) Sequence squeeze: an open contest for sequence compression. GigaScience 2:5

    Article  Google Scholar 

  • Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 40:e171–e171

    Article  Google Scholar 

  • Josef E (2014)Fast, efficient, lossless compression of FASTQ files. https://github.com/Infinidat/slimfastq

  • Kingsford C, Patro R (2015) Reference-based compression of short-read sequences using path encoding. Bioinformatics 31:1920–1928

    Article  Google Scholar 

  • Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration (2010) The sequence read archive. Nucleic Acids Res 39:D19–D21

    Google Scholar 

  • Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079

    Article  Google Scholar 

  • Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31:3276–3281

    Article  Google Scholar 

  • Numanagić I et al (2016) Comparison of high-throughput sequencing data compression tools. Nat Methods 13:1005–1008

    Article  Google Scholar 

  • Ochoa I, Hernaez M, Weissman T (2014) Aligned genomic data compression via improved modeling. J Bioinform Comput Biol 12:1442002

    Article  Google Scholar 

  • Patro R, Kingsford C (2015) Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 31:2770–2777

    Article  Google Scholar 

  • Picard Tools – By Broad Institute (2015) http://broadinstitute.github.io/picard/

  • Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20:3363–3369

    Article  Google Scholar 

  • Roguski Ł, Deorowicz S (2014) DSRC 2industry-oriented compression of FASTQ files. Bioinformatics 30: 2213–2215

    Article  Google Scholar 

  • Sam/bam Format Specification Working Group et al (2014) Sequence alignment/map format specification. http://samtools.github.io/hts-specs/SAMv1.pdf

  • Seward J (1998) bzip2. http://www.bzip.org/

  • Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P (2015) Sambamba: fast processing of NGS alignment formats. Bioinformatics 31:2032–2034

    Article  Google Scholar 

  • Voges J, Munderloh M, Ostermann J (2016) Predictive coding of aligned next-generation sequencing data. In: Data compression conference (DCC 2016). IEEE, pp 241–250

    Google Scholar 

  • Zhang Y et al (2015) Light-weight reference-based compression of FASTQ data. BMC Bioinformatics 16:188

    Article  Google Scholar 

  • Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2016) A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579:75–81

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Cenk Sahinalp .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Zhu, K., Numanagić, I., Sahinalp, S.C. (2019). Genomic Data Compression. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_55

Download citation

Publish with us

Policies and ethics