Genomic Data Compression

Zhu, Kaiyuan; Numanagić, Ibrahim; Sahinalp, S. Cenk

doi:10.1007/978-3-319-77525-8_55

Kaiyuan Zhu³,
Ibrahim Numanagić⁴ &
S. Cenk Sahinalp³

184 Accesses
1 Citations

Abstract

Genomic sequence data obtained through high-throughput sequencing (HTS) technologies are commonly stored either as raw sequencing reads in FASTQ format or as reads mapped to a reference genome in SAM format. Both of these formats have large memory footprints. Worldwide increase of HTS data has prompted the development of specialized compression methods that aim to significantly reduce HTS data size. Below is a comparative overview of available lossless genomic data compression approaches, including their advantages and pitfalls.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 849.99; Price excludes VAT (USA)

Hardcover Book: USD 999.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Benoit G et al (2015) Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics 16:288
Article Google Scholar
Bonfield JK (2014) The scramble conversion tool. Bioinformatics 30(19):2818–2819
Article Google Scholar
Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PloS one 8:e59190
Article Google Scholar
Chandak S, Tatwawadi K, Weissman T (2018) Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34:558–567
Article Google Scholar
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2009) The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res 38:1767–1771
Article Google Scholar
Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the burrows–wheeler transform. Bioinformatics 28:1415–1419
Article Google Scholar
CRAM format specification (version 3.0) (2017) https://samtools.github.io/hts-specs/CRAMv3.pdf
Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27:860–862
Article Google Scholar
Deutsch LP (1996) GZIP file format specification version 4.3. https://tools.ietf.org/html/rfc1952
Dutta A, Haque MM, Bose T, Reddy CVSK, Mande SS (2015) FQC: a novel approach for efficient compression, archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13:1541003
Article Google Scholar
Ewing B, Hillier L, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185
Article Google Scholar
Fritz MHY, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740
Article Google Scholar
Ginart AA et al (2018) Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 9:566
Article Google Scholar
Grabowski S, Deorowicz S, Roguski Ł (2014) Disk-based compression of data from genome sequencing. Bioinformatics 31:1389–1395
Article Google Scholar
Hach F, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28:3051–3057
Article Google Scholar
Hach F, Numanagić I, Sahinalp SC (2014) DeeZ: reference-based compression by local assembly. Nat Methods 11:1082–1084
Article Google Scholar
Holland RC, Lynch N (2013) Sequence squeeze: an open contest for sequence compression. GigaScience 2:5
Article Google Scholar
Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 40:e171–e171
Article Google Scholar
Josef E (2014)Fast, efficient, lossless compression of FASTQ files. https://github.com/Infinidat/slimfastq
Kingsford C, Patro R (2015) Reference-based compression of short-read sequences using path encoding. Bioinformatics 31:1920–1928
Article Google Scholar
Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration (2010) The sequence read archive. Nucleic Acids Res 39:D19–D21
Google Scholar
Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079
Article Google Scholar
Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31:3276–3281
Article Google Scholar
Numanagić I et al (2016) Comparison of high-throughput sequencing data compression tools. Nat Methods 13:1005–1008
Article Google Scholar
Ochoa I, Hernaez M, Weissman T (2014) Aligned genomic data compression via improved modeling. J Bioinform Comput Biol 12:1442002
Article Google Scholar
Patro R, Kingsford C (2015) Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 31:2770–2777
Article Google Scholar
Picard Tools – By Broad Institute (2015) http://broadinstitute.github.io/picard/
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20:3363–3369
Article Google Scholar
Roguski Ł, Deorowicz S (2014) DSRC 2industry-oriented compression of FASTQ files. Bioinformatics 30: 2213–2215
Article Google Scholar
Sam/bam Format Specification Working Group et al (2014) Sequence alignment/map format specification. http://samtools.github.io/hts-specs/SAMv1.pdf
Seward J (1998) bzip2. http://www.bzip.org/
Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P (2015) Sambamba: fast processing of NGS alignment formats. Bioinformatics 31:2032–2034
Article Google Scholar
Voges J, Munderloh M, Ostermann J (2016) Predictive coding of aligned next-generation sequencing data. In: Data compression conference (DCC 2016). IEEE, pp 241–250
Google Scholar
Zhang Y et al (2015) Light-weight reference-based compression of FASTQ data. BMC Bioinformatics 16:188
Article Google Scholar
Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2016) A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579:75–81
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Indiana University Bloomington, Bloomington, IN, USA
Kaiyuan Zhu & S. Cenk Sahinalp
Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
Ibrahim Numanagić

Authors

Kaiyuan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Ibrahim Numanagić
View author publications
You can also search for this author in PubMed Google Scholar
S. Cenk Sahinalp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Cenk Sahinalp .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
School of Information Technologies, Sydney University, Sydney, Australia
Albert Y. Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Zhu, K., Numanagić, I., Sahinalp, S.C. (2019). Genomic Data Compression. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_55

Download citation

DOI: https://doi.org/10.1007/978-3-319-77525-8_55
Published: 20 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics