Skip to main content

Compressing Genomic Sequence Fragments Using SlimGene

  • Conference paper
Research in Computational Molecular Biology (RECOMB 2010)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6044))

Abstract

With the advent of next generation sequencing technologies, the cost of sequencing whole genomes is poised to go below $1000 per human individual in a few years. As more and more genomes are sequenced, analysis methods are undergoing rapid development, making it tempting to store sequencing data for long periods of time so that the data can be re-analyzed with the latest techniques. The challenging open research problems, huge influx of data, and rapidly improving analysis techniques have created the need to store and transfer very large volumes of data.

Compression can be achieved at many levels, including trace level (compressing image data), sequence level (compressing a genomic sequence), and fragment-level (compressing a set of short, redundant fragment reads, along with quality-values on the base-calls). We focus on fragment-level compression, which is the pressing need today.

Our paper makes two contributions, implemented in a tool, SlimGene. First, we introduce a set of domain specific loss-less compression schemes that achieve over 40× compression of fragments, outperforming bzip2 by over 6×. Including quality values, we show a 5× compression using less running time than bzip2. Second, given the discrepancy between the compression factor obtained with and without quality values, we initiate the study of using ‘lossy’ quality values. Specifically, we show that a lossy quality value quantization results in 14× compression but has minimal impact on downstream applications like SNP calling that use the quality values. Discrepancies between SNP calls made between the lossy and lossless versions of the data are limited to low coverage areas where even the SNP calls made by the lossless version are marginal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bellamy, J.C.: Digital Telephony, vol. 3rd. Wiley, Chichester (2000)

    Google Scholar 

  2. Brandon, M.C., Wallace, D.C., Baldi, P.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25, 1731–1738 (2009)

    Article  Google Scholar 

  3. The CASAVA software toolkit, http://www.illumina.com/pages.ilmn?ID=314

  4. Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 18, 1696–1698 (2002)

    Article  Google Scholar 

  5. Christley, S., Lu, Y., Li, C., Xie, X.: Human genomes as email attachments. Bioinformatics 25, 274–275 (2009)

    Article  Google Scholar 

  6. Dublin, M.: So Long, Data Depression (2009), http://www.genomeweb.com/informatics/so-long-data-depression

  7. Feuk, L., Carson, A.R., Scherer, S.W.: Structural variation in the human genome. Nat. Rev. Genet. 7(2), 85–97 (2006)

    Article  Google Scholar 

  8. Helicos Biosciences, http://www.helicosbio.com/

  9. Iafrate, A.J., Feuk, L., Rivera, M.N., Listewnik, M.L., Donahoe, P.K., Qi, Y., Scherer, S.W., Lee, C.: Detection of large-scale variation in the human genome. Nat. Genet. 36(9), 949–951 (2004)

    Article  Google Scholar 

  10. The Illumina Genome Analyzer, http://www.illumina.com/sequencing/

  11. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009)

    Article  Google Scholar 

  12. Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17, 149–154 (2001)

    Article  Google Scholar 

  13. http://www.mpeg.org

  14. Newman, T.L., Tuzun, E., Morrison, V.A., Hayden, K.E., Ventura, M., McGrath, S.D., Rocchi, M., Eichler, E.E.: A genome-wide survey of structural variation between human and chimpanzee. Genome Res. 15(10), 1344–1356 (2005)

    Article  Google Scholar 

  15. Pacific BioSciences, http://www.pacificbiosciences.com/index.php

  16. Roche 454 Sequencing, http://www.454.com/

  17. The SAM/BAM format, http://samtools.sourceforge.net/SAM1.pdf

  18. Sharp, A.J., Cheng, Z., Eichler, E.E.: Structural variation of the human genome. Annu. Rev. Genomics Hum. Genet. (June 2006)

    Google Scholar 

  19. Steven, E.A., Mccanne, S., Vetterli, E.: A Layered Dct Coder For Internet Video. In: Proceedings of the IEEE International Conference on Image Processing, pp. 13–16 (1996)

    Google Scholar 

  20. Ziv, J., Lempel, A.: Compression of Individual Sequences Via Variable-Rate Coding. IEEE Transactions on Information Theory (1978)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V., Varghese, G. (2010). Compressing Genomic Sequence Fragments Using SlimGene . In: Berger, B. (eds) Research in Computational Molecular Biology. RECOMB 2010. Lecture Notes in Computer Science(), vol 6044. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12683-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12683-3_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12682-6

  • Online ISBN: 978-3-642-12683-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics