Abstract
New generation sequencing technologies produce massive data sets of millions of reads, making the compression of sequence read files an important problem. The sequential order of the reads in these files typically conveys no biologically significant information, providing the freedom to reorder them so as to facilitate compression. Similarly, for many problems the orientation of the reads (original or reverse complement) are indistinguishable from an information-theoretic perspective, providing the freedom to optimize the orientation of each read.
In this paper, we introduce a class of algorithmic problems concerned with optimizing read ordering and orientation for sequence compression. We show that most of the interesting variants are hard, but provide heuristics yielding strong approximation guarantees. In particular, we give a linear time 2-approximation algorithm for the optimal ordering/orientation under the prefix match criteria. Further, through experiments on a number of data sets, we demonstrate that this heuristic works well in practice. A prototype implementation of this 2-factor approximation is available at https://github.com/LaoZZZZZ/prefixMatching.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adjeroh, D., Zhang, Y., Mukherjee, A., Powell, M., Bell, T.: DNA sequence compression using the Burrows-Wheeler transform. In: Proceedings on Bioinformatics Conference, 2002, pp. 303–313. IEEE Computer Society (2002)
Bhola, V., Bopardikar, A.S., Narayanan, R., Lee, K., Ahna, T.: No-reference compression of genomic data stored in FASTQ format. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2011), pp. 147–150. IEEE (2011)
Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM format sequencing data. PLoS One 8(3), e59190 (2013)
Brandon, M.C., Wallace, D.C., Baldi, P.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14), 1731–1738 (2009)
Cazaux, B., Rivals, E.: Approximation of greedy algorithms for max-ATSP, maximal compression, maximal cycle cover, and shortest cyclic cover of strings. In: PSC 2014: Prague Stringology Conference, pp. 148–161. Czech Technical University in Prague, Czech Republic (2014)
Christofides, N.: Worst-case analysis of a new heuristic for the travelling salesman problem. Technical report, DTIC Document (1976)
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)
Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8(1), 25 (2013)
Gailly, J., Adler, M.: Gzip program (2014). http://www.gnu.org/software/gzip/. Accessed 16 June 2014
Hach, F., Numanagić, I., Alkan, C., Sahinalp, S.C.: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23), 3051–3057 (2012)
Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 (2012)
Paluch, K., Mucha, M., Madry, A.: A 7/9 - approximation algorithm for the maximum traveling salesman problem. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) Approximation, Randomization, and Combinatorial Optimization. LNCS, vol. 5687, pp. 298–311. Springer, Heidelberg (2009)
Papadimitriou, C., Yannakakis, M.: Optimization, approximation, and complexity classes. In: Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing, pp. 229–234. ACM (1988)
Patro, R., Kingsford, C.: Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics, btv248 (2015)
Slavik, P.: Approximation Algorithms for Set Cover and Related Problems. Ph.D. thesis, Buffalo, NY, USA, AAI9833643 (1998)
Tembe, W., Lowey, J., Suh, E.: G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26(17), 2192–2194 (2010)
Trevisan, L.: When hamming meets euclid: the approximability of geometric TSP and steiner tree. SIAM J. Comput. 30, 475–485 (2000)
Trevisan, L.: When hamming meets euclid: the approximability of geometric TSP and MST. In: Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, pp. 21–29. ACM (1997)
Yu, Y.W., Yorukoglu, D., Peng, J., Berger, B.: Quality score compression improves genotyping accuracy. Nat. Biotechnol. 33(3), 240–243 (2015)
Acknowledgments
This research was partially supported by NSF Grants DBI-1355990 and IIS-1017181, and a Google Faculty Research Award to Steven Skiena. E. Arkin and J. Mitchell acknowledge support from the National Science Foundation (CCF-1540890) and the US-Israel Binational Science Foundation (BSF project 2010074). Rezaul Chowdhury was supported by NSF grant CCF-1439084. Rob Patro would like to acknowledge Carl Kingsford and Geet Duggal for useful discussions and for helping to initially pose and explore the prefix matching with reversal problem. Finally, we would like to thank the anonymous reviewers for suggestions and comments which greatly improved the manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sichen, Z. et al. (2015). Optimizing Read Reversals for Sequence Compression. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_14
Download citation
DOI: https://doi.org/10.1007/978-3-662-48221-6_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48220-9
Online ISBN: 978-3-662-48221-6
eBook Packages: Computer ScienceComputer Science (R0)