Abstract
Genomic sequence data obtained through high-throughput sequencing (HTS) technologies are commonly stored either as raw sequencing reads in FASTQ format or as reads mapped to a reference genome in SAM format. Both of these formats have large memory footprints. Worldwide increase of HTS data has prompted the development of specialized compression methods that aim to significantly reduce HTS data size. Below is a comparative overview of available lossless genomic data compression approaches, including their advantages and pitfalls.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Benoit G et al (2015) Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics 16:288
Bonfield JK (2014) The scramble conversion tool. Bioinformatics 30(19):2818–2819
Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM format sequencing data. PloS one 8:e59190
Chandak S, Tatwawadi K, Weissman T (2018) Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34:558–567
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (2009) The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res 38:1767–1771
Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the burrows–wheeler transform. Bioinformatics 28:1415–1419
CRAM format specification (version 3.0) (2017) https://samtools.github.io/hts-specs/CRAMv3.pdf
Deorowicz S, Grabowski S (2011) Compression of DNA sequence reads in FASTQ format. Bioinformatics 27:860–862
Deutsch LP (1996) GZIP file format specification version 4.3. https://tools.ietf.org/html/rfc1952
Dutta A, Haque MM, Bose T, Reddy CVSK, Mande SS (2015) FQC: a novel approach for efficient compression, archival, and dissemination of fastq datasets. J Bioinform Comput Biol 13:1541003
Ewing B, Hillier L, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185
Fritz MHY, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740
Ginart AA et al (2018) Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 9:566
Grabowski S, Deorowicz S, Roguski Ł (2014) Disk-based compression of data from genome sequencing. Bioinformatics 31:1389–1395
Hach F, Numanagić I, Alkan C, Sahinalp SC (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28:3051–3057
Hach F, Numanagić I, Sahinalp SC (2014) DeeZ: reference-based compression by local assembly. Nat Methods 11:1082–1084
Holland RC, Lynch N (2013) Sequence squeeze: an open contest for sequence compression. GigaScience 2:5
Jones DC, Ruzzo WL, Peng X, Katze MG (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 40:e171–e171
Josef E (2014)Fast, efficient, lossless compression of FASTQ files. https://github.com/Infinidat/slimfastq
Kingsford C, Patro R (2015) Reference-based compression of short-read sequences using path encoding. Bioinformatics 31:1920–1928
Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration (2010) The sequence read archive. Nucleic Acids Res 39:D19–D21
Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079
Nicolae M, Pathak S, Rajasekaran S (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31:3276–3281
Numanagić I et al (2016) Comparison of high-throughput sequencing data compression tools. Nat Methods 13:1005–1008
Ochoa I, Hernaez M, Weissman T (2014) Aligned genomic data compression via improved modeling. J Bioinform Comput Biol 12:1442002
Patro R, Kingsford C (2015) Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 31:2770–2777
Picard Tools – By Broad Institute (2015) http://broadinstitute.github.io/picard/
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20:3363–3369
Roguski Ł, Deorowicz S (2014) DSRC 2industry-oriented compression of FASTQ files. Bioinformatics 30: 2213–2215
Sam/bam Format Specification Working Group et al (2014) Sequence alignment/map format specification. http://samtools.github.io/hts-specs/SAMv1.pdf
Seward J (1998) bzip2. http://www.bzip.org/
Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P (2015) Sambamba: fast processing of NGS alignment formats. Bioinformatics 31:2032–2034
Voges J, Munderloh M, Ostermann J (2016) Predictive coding of aligned next-generation sequencing data. In: Data compression conference (DCC 2016). IEEE, pp 241–250
Zhang Y et al (2015) Light-weight reference-based compression of FASTQ data. BMC Bioinformatics 16:188
Zhang Y, Patel K, Endrawis T, Bowers A, Sun Y (2016) A FASTQ compressor based on integer-mapped k-mer indexing for biologist. Gene 579:75–81
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this entry
Cite this entry
Zhu, K., Numanagić, I., Sahinalp, S.C. (2019). Genomic Data Compression. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_55
Download citation
DOI: https://doi.org/10.1007/978-3-319-77525-8_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering