BWT: An Index Structure to Speed-Up Both Exact and Inexact String Matching

Chen, Yangjun; Wu, Yujia

doi:10.1007/978-981-10-8476-8_12

Yangjun Chen⁶ &
Yujia Wu⁶

Part of the book series: Studies in Big Data ((SBD,volume 44))

2048 Accesses

Abstract

The BWT transformation of a string is originally proposed for string compression, but can also be used to speed up string matchings. In this chapter, we address two issues around this mechanism: (1) how to use BWT to improve the running time of a multiple pattern string matching process; and (2) how to integrate mismatching information into a search of BWT arrays to expedite string matching with k mismatches. For the first problem, we will first construct the BWT array of a target string s, denoted as BWT(s); and then establish a trie structure over a set of pattern strings \( \varvec{R}\,\varvec{ = }\,\left\{ {r_{1} , \ldots ,r_{l} } \right\} \), denoted as T(R). By scanning BWT(s) against T(R), the time spent for finding occurrences of r_i’s can be significantly reduced. For the second problem, for a given pattern string r, we will precompute its mismatching information (over some different substrings of it, denoted as M(r)) and construct a tree structure, called a mismatching tree, to record the mismatches between r and s during a search of BWT(s) against r. In this process, the mismatching tree can be effectively utilized to do some kind of useful mismatching information derivation based on M(r) to avoid any possible redundancy. Extensive experiments have been done to compare our methods with the existing ones, which show that for both the problems described above our methods are promising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Li, R., et al. (2008). SOAP: short oligonucleotide alignment program. Bioinformatics, 24, 713–714.
Article Google Scholar
Amir, A., Lewenstein, M., & Porat, E. (2004). Faster algorithms for string matching with k mismatches. Journal of Algorithms, 50(2), 257–275.
Article MathSciNet MATH Google Scholar
Aoe, J.-I. (1989). An efficient implementation of static string pattern matching machines. IEEE Transactions on Software Engineering, 15(8), 1010–1016.
Article Google Scholar
Baeza-Yates, R. A., Perleberg, C. H. Fast and practical approximate string matching. In A. Apostolico, M. Crocchemore, Z. Galil, & U. Manber (Eds.), Combinatorial pattern matching, lecture notes in computer science (Vol. 644, pp. 185–192). Berlin: Springer.
Chapter Google Scholar
Baeza-Yates, R. A., & Régnier, M. Fast algorithms for two-dimensional and multiple pattern matching. In Proceedings of the SWAT ‘90 the Second Scandinavian Workshop on Algorithm Theory (pp. 332–347). Bergen, Sweden: Springer.
Chapter Google Scholar
Boyer, R. S., & Moore, J. S. (1977). A fast string searching algorithm. Communication of the ACM, 20(10), 762–772.
Article MATH Google Scholar
Knuth, D. E., Morris, J. H., & Pratt, V. R. (1977). Fast pattern matching in strings. SIAM Journal on Computing, 6(2), 323–350.
Article MathSciNet MATH Google Scholar
Landau, G. M., & Vishkin, U. (1985). Efficient string matching in the presence of errors. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science (pp. 126–136).
Google Scholar
Apostolico, A., & Giancarlo, R. (1986). The Boyer-Moore-Galil string searching strategies revisited. SIAM Journal on Computing, 15(1), 98–105.
Article MathSciNet MATH Google Scholar
McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2), 262–272.
Article MathSciNet MATH Google Scholar
Weiner, P. (1973). Linear pattern matching algorithm. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory (pp. 1–11).
Google Scholar
Manber, U., & Myers, E. W. (1990). Suffix arrays: a new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 319–327). Philadelphia, PA: SIAM.
Google Scholar
Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm.
Google Scholar
Ferragina, P., & Manzini, G. (2000). Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science (pp. 390–398). IEEE.
Google Scholar
Langmead, B. (2014, September). Introduction to the Burrows-Wheeler transform. www.youtube.com/watch?v=4n7NPk5lwbI.
Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communication of the ACM, 23(1), 333–340.
Article MathSciNet MATH Google Scholar
Commentz-Walter, B. (1979). A string matching algorithm fast on the average. In Proceedings of the 6th Colloquium on Automata, Languages and Programming, 16–20 July 1979, pp. 118–132.
Google Scholar
Wu, S., & Manber, U. (1994). A fast algorithm for multi-pattern searching. Technical Report TR-94-17, Department of Computer Science, Chung-Cheng University.
Google Scholar
Crochemore, M., et al. (1999). Fast practical multi-pattern matching. Information Processing Letters, 71, 107–113.
Article MathSciNet MATH Google Scholar
Dandass, Y. S., Burgess, S. C., Lawrence, M., & Bridges, S. M. (2008). Accelerating string set matching in FPGA hardware for bioinformatics research. BMC Bioinformatics, 9, 197.
Article Google Scholar
Colussi, L., Galil, Z., & Giancarlo, R. (1990). On the exact complexity of string matching. In Proceedings of the 31st Annual IEEE Symposium of Foundation of Computer Science (Vol. 1, pp. 135–144).
Google Scholar
Landau, G. M., & Vishkin, U. (1986). Efficient string matching with k mismatches. Theoretical Computer Science, 43, 239–249.
Article MathSciNet MATH Google Scholar
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760.
Article Google Scholar
Baeza-Yates, R. A., & Gonnet, G. H. (1992). A new approach in text searching. Communication of the ACM, 35(10), 74–82.
Article Google Scholar
Ehrenfeucht, A., & Haussler, D. A new distance metric on strings computable in linear time. Discrete Applied Mathematics, 20, 191–203.
Article MathSciNet MATH Google Scholar
Eddy, S. R. (2004). What is dynamic programming? Nature Biotechnology, 22, 909–910. https://doi.org/10.1038/nbt0704-909.
Article Google Scholar
Chang, W. L., & Lampe, J. Theoretical and empirical comparisons of approximate string matching algorithms. In A. Apostolico, M. Crocchemore, Z. Galil, & U. Manber (Eds.), Combinatorial pattern matching. Lecture notes in computer science (Vol. 644, pp. 175–184). Berlin: Springer.
Chapter Google Scholar
Ukkonen, E. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92, 191–211.
Article MathSciNet MATH Google Scholar
Manber, U., & Baeza-Yates, R. A. (1991). An algorithm for string matching with a sequence of don’t cares. Information Processing Letters, 37, 133–136.
Article MathSciNet MATH Google Scholar
Pinter, R. Y. (1985). Efficient string matching with don’t’ care patterns. In A. Apostolico & Z. Galil (Eds.), Combinatorial algorithms on words. NATO ASI Series (Vol. F12, pp. 11–29). Berlin: Springer.
Chapter Google Scholar
Chen, Y., Wu, Y., & Xie, J. (2016). An efficient algorithm for read matching in DNA databases. In Proceedings of the International Conference on DBKDA’2016, Lisbon, Portugal, 26–30 June 2016 (pp. 23–34).
Google Scholar
Chen, Y., & Wu, Y. (2017). Mismatching trees and BWT arrays: A new way for string matching with k-mismatches. In ICDE2017, 19–22 April 2017 (pp. 339–410). San Diego, USA: IEEE.
Google Scholar
Galil, Z. (1977). On improving the worst case running time of the Boyer-Moore string searching algorithm. Communication of the ACM, 22(9), 505–508.
Article MathSciNet MATH Google Scholar
Lecroq, T. (1992). A variation on the Boyer-Moore algorithm. Theoretical Computer Science, 92(1), 119–144.
Article MathSciNet MATH Google Scholar
Tarhio, J., & Ukkonen, E. Boyer-Moore approach to approximate string matching. In J. R. Gilbert & R. Karlssion (Eds.), SWAT 90, Proceedings of the 2nd Scandinavian Workshop on Algorithm Theory, Lecture Notes in Computer Science (Vol. 447, pp. 348–359). Berlin: Springer.
Chapter Google Scholar
Salmela, L., Tarhio, J., & Kytojoki, J. (2006). Multi-pattern string matching with q-grams. ACM Journal of Experimental Algorithmics, 11.
Google Scholar
Jiang, H., & Wong, W. H. (2008). SeqMap: Mapping massive amount of oligonucleotides to the genome. Bioinformatics, 24, 2395–2396.
Article Google Scholar
Kim, J. Y., & Yaylor, J. S. (1992). Fast multiple keyword searching. In Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching, 29 April–01 May 1992 (pp. 41–51). Springer.
Chapter Google Scholar
Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5), 589–595.
Article Google Scholar
Knuth, D. E. (1975). The art of computer programming (Vol. 3). Massachusetts: Addison-Wesley Publish Com.
Google Scholar
Li, H., & Homer. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5), 473–483. https://doi.org/10.1093/bib/bbq015.
Article Google Scholar
Karp, R. L., & Rabin, M. O. (1987). Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2), 249–260.
Article MathSciNet MATH Google Scholar
Harrison, M. C. (1971). Implementation of the substring test by hashing. Communication of the ACM, 14(12), 777–779.
Article Google Scholar
Li, H., et al. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18, 1851–1858.
Article Google Scholar
Li, H. (2014). wgsim: a small tool for simulating sequence reads from a reference genome. https://github.com/lh3/wgsim/.
Schatz, M. (2009). Cloudburst: Highly sensitive read mapping with mapreduce. Bioinformatics, 25, 1363–1369.
Article Google Scholar
Lin, H., et al. (2008). ZOOM! Zillions of oligos mapped. Bioinformatics, 24, 2431–2437.
Article Google Scholar
Baeza-Yates, R. A., & Gonnet, G. H. (1989). A new approach to text searching. In N. J. Belkin & C. J. van Rijsbergen (Eds.), SIGIR 89, Proceedings of the 12th Annual International ACM Conference on Research and Development in Information Retrieval (pp. 168–175).
Google Scholar
Smith, A. D., et al. (2008). Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics, 9, 128.
Article Google Scholar
Tarhio, J., & Ukkonen, E. Approximate Boyer-Moore string matching. SIAM Journal on Computing, 22(2), 243–260.
Article MathSciNet MATH Google Scholar
Nicolas, M., & Rajasekarian, S. (2013). On string matching with k mismatches. https://arxiv.org/pdf/1307.1406.
Cole, R., Gottlieb, L., & Lewenstein, M. (2004). Dictionary matching and indexing with errors and don’t cares. In STOC’04 (pp. 91–100).
Google Scholar
Hon, W., et al. (2007). A space and time efficient algorithm for constructing compressed suffix arrays. Alrothmica, 48, 23–36.
MathSciNet MATH Google Scholar
Bauer, S., Schulz, M. H., & Robinson, P. N. (2014). gsuffix:http:://gsuffixSourceforge.net/.
Google Scholar
Lab website. (2014). http://home.cc.umanitoba.ca/~xiej/.
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: bolger: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
Google Scholar
Cunningham, F., et al. (2015). Nucleic Acids Research 2015, 43, Database issue: D662-D669.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Computer Science, University of Winnipeg, Winnipeg, Canada
Yangjun Chen & Yujia Wu

Authors

Yangjun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yujia Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yangjun Chen .

Editor information

Editors and Affiliations

School of Computing Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India
Sanjiban Sekhar Roy
Department of Civil Engineering, National Institute of Technology Patna, Patna, Bihar, India
Pijush Samui
University of Southern Queensland, Springfield, Queensland, Australia
Ravinesh Deo
Polytechnic University of Milan, Milan, Italy
Stavros Ntalampiras

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chen, Y., Wu, Y. (2018). BWT: An Index Structure to Speed-Up Both Exact and Inexact String Matching. In: Roy, S., Samui, P., Deo, R., Ntalampiras, S. (eds) Big Data in Engineering Applications. Studies in Big Data, vol 44. Springer, Singapore. https://doi.org/10.1007/978-981-10-8476-8_12

Download citation

DOI: https://doi.org/10.1007/978-981-10-8476-8_12
Published: 03 May 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8475-1
Online ISBN: 978-981-10-8476-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics