Optimizing Read Reversals for Sequence Compression

Sichen, Zhong; Zhao, Lu; Liang, Yan; Zamani, Mohammadzaman; Patro, Rob; Chowdhury, Rezaul; Arkin, Esther M.; Mitchell, Joseph S. B.; Skiena, Steven

doi:10.1007/978-3-662-48221-6_14

Zhong Sichen⁷,
Lu Zhao⁷,
Yan Liang⁷,
Mohammadzaman Zamani⁶,
Rob Patro⁶,
Rezaul Chowdhury⁶,
Esther M. Arkin⁷,
Joseph S. B. Mitchell⁷ &
…
Steven Skiena⁶

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9289))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

1094 Accesses
1 Citations

Abstract

New generation sequencing technologies produce massive data sets of millions of reads, making the compression of sequence read files an important problem. The sequential order of the reads in these files typically conveys no biologically significant information, providing the freedom to reorder them so as to facilitate compression. Similarly, for many problems the orientation of the reads (original or reverse complement) are indistinguishable from an information-theoretic perspective, providing the freedom to optimize the orientation of each read.

In this paper, we introduce a class of algorithmic problems concerned with optimizing read ordering and orientation for sequence compression. We show that most of the interesting variants are hard, but provide heuristics yielding strong approximation guarantees. In particular, we give a linear time 2-approximation algorithm for the optimal ordering/orientation under the prefix match criteria. Further, through experiments on a number of data sets, we demonstrate that this heuristic works well in practice. A prototype implementation of this 2-factor approximation is available at https://github.com/LaoZZZZZ/prefixMatching.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adjeroh, D., Zhang, Y., Mukherjee, A., Powell, M., Bell, T.: DNA sequence compression using the Burrows-Wheeler transform. In: Proceedings on Bioinformatics Conference, 2002, pp. 303–313. IEEE Computer Society (2002)
Google Scholar
Bhola, V., Bopardikar, A.S., Narayanan, R., Lee, K., Ahna, T.: No-reference compression of genomic data stored in FASTQ format. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2011), pp. 147–150. IEEE (2011)
Google Scholar
Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM format sequencing data. PLoS One 8(3), e59190 (2013)
Article Google Scholar
Brandon, M.C., Wallace, D.C., Baldi, P.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14), 1731–1738 (2009)
Article Google Scholar
Cazaux, B., Rivals, E.: Approximation of greedy algorithms for max-ATSP, maximal compression, maximal cycle cover, and shortest cyclic cover of strings. In: PSC 2014: Prague Stringology Conference, pp. 148–161. Czech Technical University in Prague, Czech Republic (2014)
Google Scholar
Christofides, N.: Worst-case analysis of a new heuristic for the travelling salesman problem. Technical report, DTIC Document (1976)
Google Scholar
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)
Article Google Scholar
Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8(1), 25 (2013)
Article Google Scholar
Gailly, J., Adler, M.: Gzip program (2014). http://www.gnu.org/software/gzip/. Accessed 16 June 2014
Hach, F., Numanagić, I., Alkan, C., Sahinalp, S.C.: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23), 3051–3057 (2012)
Article Google Scholar
Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 (2012)
Article Google Scholar
Paluch, K., Mucha, M., Madry, A.: A 7/9 - approximation algorithm for the maximum traveling salesman problem. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) Approximation, Randomization, and Combinatorial Optimization. LNCS, vol. 5687, pp. 298–311. Springer, Heidelberg (2009)
Chapter Google Scholar
Papadimitriou, C., Yannakakis, M.: Optimization, approximation, and complexity classes. In: Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing, pp. 229–234. ACM (1988)
Google Scholar
Patro, R., Kingsford, C.: Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics, btv248 (2015)
Google Scholar
Slavik, P.: Approximation Algorithms for Set Cover and Related Problems. Ph.D. thesis, Buffalo, NY, USA, AAI9833643 (1998)
Google Scholar
Tembe, W., Lowey, J., Suh, E.: G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17), 2192–2194 (2010)
Article Google Scholar
Trevisan, L.: When hamming meets euclid: the approximability of geometric TSP and steiner tree. SIAM J. Comput. 30, 475–485 (2000)
Article MathSciNet MATH Google Scholar
Trevisan, L.: When hamming meets euclid: the approximability of geometric TSP and MST. In: Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, pp. 21–29. ACM (1997)
Google Scholar
Yu, Y.W., Yorukoglu, D., Peng, J., Berger, B.: Quality score compression improves genotyping accuracy. Nat. Biotechnol. 33(3), 240–243 (2015)
Article Google Scholar

Download references

Acknowledgments

This research was partially supported by NSF Grants DBI-1355990 and IIS-1017181, and a Google Faculty Research Award to Steven Skiena. E. Arkin and J. Mitchell acknowledge support from the National Science Foundation (CCF-1540890) and the US-Israel Binational Science Foundation (BSF project 2010074). Rezaul Chowdhury was supported by NSF grant CCF-1439084. Rob Patro would like to acknowledge Carl Kingsford and Geet Duggal for useful discussions and for helping to initially pose and explore the prefix matching with reversal problem. Finally, we would like to thank the anonymous reviewers for suggestions and comments which greatly improved the manuscript.

Author information

Authors and Affiliations

Department of Computer Science, Stony Brook University, Stony Brook, USA
Mohammadzaman Zamani, Rob Patro, Rezaul Chowdhury & Steven Skiena
Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, USA
Zhong Sichen, Lu Zhao, Yan Liang, Esther M. Arkin & Joseph S. B. Mitchell

Authors

Zhong Sichen
View author publications
You can also search for this author in PubMed Google Scholar
Lu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yan Liang
View author publications
You can also search for this author in PubMed Google Scholar
Mohammadzaman Zamani
View author publications
You can also search for this author in PubMed Google Scholar
Rob Patro
View author publications
You can also search for this author in PubMed Google Scholar
Rezaul Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar
Esther M. Arkin
View author publications
You can also search for this author in PubMed Google Scholar
Joseph S. B. Mitchell
View author publications
You can also search for this author in PubMed Google Scholar
Steven Skiena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rob Patro .

Editor information

Editors and Affiliations

University of Maryland, College Park, Maryland, USA
Mihai Pop
University of Lille, Lille, France
Hélène Touzet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sichen, Z. et al. (2015). Optimizing Read Reversals for Sequence Compression. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_14

Download citation

DOI: https://doi.org/10.1007/978-3-662-48221-6_14
Published: 28 August 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48220-9
Online ISBN: 978-3-662-48221-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics