Skip to main content

Optimizing Read Reversals for Sequence Compression

(Extended Abstract)

  • Conference paper
  • First Online:
Algorithms in Bioinformatics (WABI 2015)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9289))

Included in the following conference series:

Abstract

New generation sequencing technologies produce massive data sets of millions of reads, making the compression of sequence read files an important problem. The sequential order of the reads in these files typically conveys no biologically significant information, providing the freedom to reorder them so as to facilitate compression. Similarly, for many problems the orientation of the reads (original or reverse complement) are indistinguishable from an information-theoretic perspective, providing the freedom to optimize the orientation of each read.

In this paper, we introduce a class of algorithmic problems concerned with optimizing read ordering and orientation for sequence compression. We show that most of the interesting variants are hard, but provide heuristics yielding strong approximation guarantees. In particular, we give a linear time 2-approximation algorithm for the optimal ordering/orientation under the prefix match criteria. Further, through experiments on a number of data sets, we demonstrate that this heuristic works well in practice. A prototype implementation of this 2-factor approximation is available at https://github.com/LaoZZZZZ/prefixMatching.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adjeroh, D., Zhang, Y., Mukherjee, A., Powell, M., Bell, T.: DNA sequence compression using the Burrows-Wheeler transform. In: Proceedings on Bioinformatics Conference, 2002, pp. 303–313. IEEE Computer Society (2002)

    Google Scholar 

  2. Bhola, V., Bopardikar, A.S., Narayanan, R., Lee, K., Ahna, T.: No-reference compression of genomic data stored in FASTQ format. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2011), pp. 147–150. IEEE (2011)

    Google Scholar 

  3. Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM format sequencing data. PLoS One 8(3), e59190 (2013)

    Article  Google Scholar 

  4. Brandon, M.C., Wallace, D.C., Baldi, P.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14), 1731–1738 (2009)

    Article  Google Scholar 

  5. Cazaux, B., Rivals, E.: Approximation of greedy algorithms for max-ATSP, maximal compression, maximal cycle cover, and shortest cyclic cover of strings. In: PSC 2014: Prague Stringology Conference, pp. 148–161. Czech Technical University in Prague, Czech Republic (2014)

    Google Scholar 

  6. Christofides, N.: Worst-case analysis of a new heuristic for the travelling salesman problem. Technical report, DTIC Document (1976)

    Google Scholar 

  7. Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)

    Article  Google Scholar 

  8. Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8(1), 25 (2013)

    Article  Google Scholar 

  9. Gailly, J., Adler, M.: Gzip program (2014). http://www.gnu.org/software/gzip/. Accessed 16 June 2014

  10. Hach, F., Numanagić, I., Alkan, C., Sahinalp, S.C.: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23), 3051–3057 (2012)

    Article  Google Scholar 

  11. Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 (2012)

    Article  Google Scholar 

  12. Paluch, K., Mucha, M., Madry, A.: A 7/9 - approximation algorithm for the maximum traveling salesman problem. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) Approximation, Randomization, and Combinatorial Optimization. LNCS, vol. 5687, pp. 298–311. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  13. Papadimitriou, C., Yannakakis, M.: Optimization, approximation, and complexity classes. In: Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing, pp. 229–234. ACM (1988)

    Google Scholar 

  14. Patro, R., Kingsford, C.: Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics, btv248 (2015)

    Google Scholar 

  15. Slavik, P.: Approximation Algorithms for Set Cover and Related Problems. Ph.D. thesis, Buffalo, NY, USA, AAI9833643 (1998)

    Google Scholar 

  16. Tembe, W., Lowey, J., Suh, E.: G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17), 2192–2194 (2010)

    Article  Google Scholar 

  17. Trevisan, L.: When hamming meets euclid: the approximability of geometric TSP and steiner tree. SIAM J. Comput. 30, 475–485 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  18. Trevisan, L.: When hamming meets euclid: the approximability of geometric TSP and MST. In: Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, pp. 21–29. ACM (1997)

    Google Scholar 

  19. Yu, Y.W., Yorukoglu, D., Peng, J., Berger, B.: Quality score compression improves genotyping accuracy. Nat. Biotechnol. 33(3), 240–243 (2015)

    Article  Google Scholar 

Download references

Acknowledgments

This research was partially supported by NSF Grants DBI-1355990 and IIS-1017181, and a Google Faculty Research Award to Steven Skiena. E. Arkin and J. Mitchell acknowledge support from the National Science Foundation (CCF-1540890) and the US-Israel Binational Science Foundation (BSF project 2010074). Rezaul Chowdhury was supported by NSF grant CCF-1439084. Rob Patro would like to acknowledge Carl Kingsford and Geet Duggal for useful discussions and for helping to initially pose and explore the prefix matching with reversal problem. Finally, we would like to thank the anonymous reviewers for suggestions and comments which greatly improved the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rob Patro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sichen, Z. et al. (2015). Optimizing Read Reversals for Sequence Compression. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-48221-6_14

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-48220-9

  • Online ISBN: 978-3-662-48221-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics