Skip to main content

Part of the book series: Proceedings in Adaptation, Learning and Optimization ((PALO,volume 2))

Abstract

High-throughput DNA sequence data generated by next generation sequencing (NGS) technologies have brought tremendous stress in data storage and transmission. Data compression serves as a candidate solution to mitigate this pressure. In this paper, a lossless referenced-based compression framework namely FQZip is proposed for NGS data in FASTQ format. Particularly, the three components namely metadata, sequence reads, and quality scores in FASTQ files are compressed independently with specific coding schemes. The sequence reads are aligned to a reference genome and then arithmetic coding, Huffman coding, and LZMA are adopted to store the indispensable alignment results. The metadata and quality scores are stored with other simple yet efficient compression mechanisms. Experimental results on real-world NGS data indicate that FQZip obtains superior compression ratio to other state-of-the-art NGS data compression methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pennisi, E.: Will Computers Crash Genomics? Science 331, 666–668 (2011)

    Article  Google Scholar 

  2. Kozanitis, C., Heiberg, A., Varghese, G., Bafna, V.: Using Genome Query Language to Uncover Genetic Variation. Bioinformatics 30, 1–8 (2014)

    Article  Google Scholar 

  3. Kahn, S.D.: On the Future of Genomic Data. Science. 331, 728-729 (2011)

    Google Scholar 

  4. Giancarlo, R., Rombo, S.E., Utro, F.: Compressive Biological Sequence Analysis and Archival in the Era of High-Throughput Sequencing Technologies. Briefings in Bioinformatics 15, 390–406 (2014)

    Article  Google Scholar 

  5. Zhu, Z., Zhang, Y., Ji, Z., He, S., Yang, X.: High-throughput DNA Sequence Data Compression. Briefings in Bioinformatics (2013), doi:10.1093/bib/bbt087

    Google Scholar 

  6. Cock, P.J.A., Fields, C.J., Goto, N., Heuer, M.L., Rice, P.M.: The Sanger FASTQ File Format for Sequences with Quality Scores, and the Solexa/Illumina FASTQ Variants. Nucleic Acids Research 38, 1767–1771 (2010)

    Article  Google Scholar 

  7. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, 2078–2079 (2009)

    Article  Google Scholar 

  8. Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R.: The Variant Call Format and VCFtools. Bioinformatics 27, 2156–2158 (2011)

    Article  Google Scholar 

  9. http://www.gzip.org/

  10. http://www.bzip.org/

  11. Deorowicz, S., Grabowski, S.: Compression of DNA Sequence Reads in FASTQ Format. Bioinformatics 27, 860–862 (2011)

    Article  Google Scholar 

  12. Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of Next-Generation Sequencing Reads Aided by Highly Efficient De Novo Assembly. Nucleic Acids Research 40, 171 (2012)

    Article  Google Scholar 

  13. Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM Format Sequencing Data. Plos One 8, e59190 (2013)

    Article  Google Scholar 

  14. Fritz, M.H.Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient Storage of High Throughput DNA Sequencing Data Using Reference-Based Compression. Genome Research 21, 734–740 (2011)

    Article  Google Scholar 

  15. Li, P., Jiang, X., Wang, S., Kim, J., Xiong, H., Ohno-Machado, L.: HUGO: Hierarchical mUlti-reference Genome cOmpression for Aligned Reads. Journal of the American Medical Informatics Association 21, 363–373 (2014)

    Article  Google Scholar 

  16. Li, H., Durbin, R.: Fast and Accurate Short Read Alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)

    Article  Google Scholar 

  17. http://7-zip.org/sdk.html

  18. Storer, J.A.: Data Compression: Methods and Theory. Computer Science Press, Inc., New York (1988)

    Google Scholar 

  19. Rissanen, J., Langdon, G.G.: Arithmetic coding. IBMJ Res. Dev. 23, 149–162 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  20. Popitsch, N., von Haeseler, A.: NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Research 41 (2013)

    Google Scholar 

  21. Wan, R., Anh, V.N., Asai, K.: Transformations for the Compression of FASTQ Quality Scores of Next-Generation Sequencing Data. Bioinformatics 28, 628–635 (2012)

    Article  Google Scholar 

  22. http://www.ncbi.nlm.nih.gov/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhang, Y., Li, L., Xiao, J., Yang, Y., Zhu, Z. (2015). FQZip: Lossless Reference-Based Compression of Next Generation Sequencing Data in FASTQ Format. In: Handa, H., Ishibuchi, H., Ong, YS., Tan, KC. (eds) Proceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary Systems - Volume 2. Proceedings in Adaptation, Learning and Optimization, vol 2. Springer, Cham. https://doi.org/10.1007/978-3-319-13356-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13356-0_11

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13355-3

  • Online ISBN: 978-3-319-13356-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics