Abstract
High-throughput DNA sequence data generated by next generation sequencing (NGS) technologies have brought tremendous stress in data storage and transmission. Data compression serves as a candidate solution to mitigate this pressure. In this paper, a lossless referenced-based compression framework namely FQZip is proposed for NGS data in FASTQ format. Particularly, the three components namely metadata, sequence reads, and quality scores in FASTQ files are compressed independently with specific coding schemes. The sequence reads are aligned to a reference genome and then arithmetic coding, Huffman coding, and LZMA are adopted to store the indispensable alignment results. The metadata and quality scores are stored with other simple yet efficient compression mechanisms. Experimental results on real-world NGS data indicate that FQZip obtains superior compression ratio to other state-of-the-art NGS data compression methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Pennisi, E.: Will Computers Crash Genomics? Science 331, 666–668 (2011)
Kozanitis, C., Heiberg, A., Varghese, G., Bafna, V.: Using Genome Query Language to Uncover Genetic Variation. Bioinformatics 30, 1–8 (2014)
Kahn, S.D.: On the Future of Genomic Data. Science. 331, 728-729 (2011)
Giancarlo, R., Rombo, S.E., Utro, F.: Compressive Biological Sequence Analysis and Archival in the Era of High-Throughput Sequencing Technologies. Briefings in Bioinformatics 15, 390–406 (2014)
Zhu, Z., Zhang, Y., Ji, Z., He, S., Yang, X.: High-throughput DNA Sequence Data Compression. Briefings in Bioinformatics (2013), doi:10.1093/bib/bbt087
Cock, P.J.A., Fields, C.J., Goto, N., Heuer, M.L., Rice, P.M.: The Sanger FASTQ File Format for Sequences with Quality Scores, and the Solexa/Illumina FASTQ Variants. Nucleic Acids Research 38, 1767–1771 (2010)
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, 2078–2079 (2009)
Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R.: The Variant Call Format and VCFtools. Bioinformatics 27, 2156–2158 (2011)
Deorowicz, S., Grabowski, S.: Compression of DNA Sequence Reads in FASTQ Format. Bioinformatics 27, 860–862 (2011)
Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of Next-Generation Sequencing Reads Aided by Highly Efficient De Novo Assembly. Nucleic Acids Research 40, 171 (2012)
Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM Format Sequencing Data. Plos One 8, e59190 (2013)
Fritz, M.H.Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient Storage of High Throughput DNA Sequencing Data Using Reference-Based Compression. Genome Research 21, 734–740 (2011)
Li, P., Jiang, X., Wang, S., Kim, J., Xiong, H., Ohno-Machado, L.: HUGO: Hierarchical mUlti-reference Genome cOmpression for Aligned Reads. Journal of the American Medical Informatics Association 21, 363–373 (2014)
Li, H., Durbin, R.: Fast and Accurate Short Read Alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)
Storer, J.A.: Data Compression: Methods and Theory. Computer Science Press, Inc., New York (1988)
Rissanen, J., Langdon, G.G.: Arithmetic coding. IBMJ Res. Dev. 23, 149–162 (1979)
Popitsch, N., von Haeseler, A.: NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Research 41 (2013)
Wan, R., Anh, V.N., Asai, K.: Transformations for the Compression of FASTQ Quality Scores of Next-Generation Sequencing Data. Bioinformatics 28, 628–635 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhang, Y., Li, L., Xiao, J., Yang, Y., Zhu, Z. (2015). FQZip: Lossless Reference-Based Compression of Next Generation Sequencing Data in FASTQ Format. In: Handa, H., Ishibuchi, H., Ong, YS., Tan, KC. (eds) Proceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary Systems - Volume 2. Proceedings in Adaptation, Learning and Optimization, vol 2. Springer, Cham. https://doi.org/10.1007/978-3-319-13356-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-13356-0_11
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13355-3
Online ISBN: 978-3-319-13356-0
eBook Packages: EngineeringEngineering (R0)