Abstract
The BWT transformation of a string is originally proposed for string compression, but can also be used to speed up string matchings. In this chapter, we address two issues around this mechanism: (1) how to use BWT to improve the running time of a multiple pattern string matching process; and (2) how to integrate mismatching information into a search of BWT arrays to expedite string matching with k mismatches. For the first problem, we will first construct the BWT array of a target string s, denoted as BWT(s); and then establish a trie structure over a set of pattern strings \( \varvec{R}\,\varvec{ = }\,\left\{ {r_{1} , \ldots ,r_{l} } \right\} \), denoted as T(R). By scanning BWT(s) against T(R), the time spent for finding occurrences of r i ’s can be significantly reduced. For the second problem, for a given pattern string r, we will precompute its mismatching information (over some different substrings of it, denoted as M(r)) and construct a tree structure, called a mismatching tree, to record the mismatches between r and s during a search of BWT(s) against r. In this process, the mismatching tree can be effectively utilized to do some kind of useful mismatching information derivation based on M(r) to avoid any possible redundancy. Extensive experiments have been done to compare our methods with the existing ones, which show that for both the problems described above our methods are promising.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Li, R., et al. (2008). SOAP: short oligonucleotide alignment program. Bioinformatics, 24, 713–714.
Amir, A., Lewenstein, M., & Porat, E. (2004). Faster algorithms for string matching with k mismatches. Journal of Algorithms, 50(2), 257–275.
Aoe, J.-I. (1989). An efficient implementation of static string pattern matching machines. IEEE Transactions on Software Engineering, 15(8), 1010–1016.
Baeza-Yates, R. A., Perleberg, C. H. Fast and practical approximate string matching. In A. Apostolico, M. Crocchemore, Z. Galil, & U. Manber (Eds.), Combinatorial pattern matching, lecture notes in computer science (Vol. 644, pp. 185–192). Berlin: Springer.
Baeza-Yates, R. A., & Régnier, M. Fast algorithms for two-dimensional and multiple pattern matching. In Proceedings of the SWAT ‘90 the Second Scandinavian Workshop on Algorithm Theory (pp. 332–347). Bergen, Sweden: Springer.
Boyer, R. S., & Moore, J. S. (1977). A fast string searching algorithm. Communication of the ACM, 20(10), 762–772.
Knuth, D. E., Morris, J. H., & Pratt, V. R. (1977). Fast pattern matching in strings. SIAM Journal on Computing, 6(2), 323–350.
Landau, G. M., & Vishkin, U. (1985). Efficient string matching in the presence of errors. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science (pp. 126–136).
Apostolico, A., & Giancarlo, R. (1986). The Boyer-Moore-Galil string searching strategies revisited. SIAM Journal on Computing, 15(1), 98–105.
McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2), 262–272.
Weiner, P. (1973). Linear pattern matching algorithm. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory (pp. 1–11).
Manber, U., & Myers, E. W. (1990). Suffix arrays: a new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 319–327). Philadelphia, PA: SIAM.
Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm.
Ferragina, P., & Manzini, G. (2000). Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science (pp. 390–398). IEEE.
Langmead, B. (2014, September). Introduction to the Burrows-Wheeler transform. www.youtube.com/watch?v=4n7NPk5lwbI.
Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communication of the ACM, 23(1), 333–340.
Commentz-Walter, B. (1979). A string matching algorithm fast on the average. In Proceedings of the 6th Colloquium on Automata, Languages and Programming, 16–20 July 1979, pp. 118–132.
Wu, S., & Manber, U. (1994). A fast algorithm for multi-pattern searching. Technical Report TR-94-17, Department of Computer Science, Chung-Cheng University.
Crochemore, M., et al. (1999). Fast practical multi-pattern matching. Information Processing Letters, 71, 107–113.
Dandass, Y. S., Burgess, S. C., Lawrence, M., & Bridges, S. M. (2008). Accelerating string set matching in FPGA hardware for bioinformatics research. BMC Bioinformatics, 9, 197.
Colussi, L., Galil, Z., & Giancarlo, R. (1990). On the exact complexity of string matching. In Proceedings of the 31st Annual IEEE Symposium of Foundation of Computer Science (Vol. 1, pp. 135–144).
Landau, G. M., & Vishkin, U. (1986). Efficient string matching with k mismatches. Theoretical Computer Science, 43, 239–249.
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760.
Baeza-Yates, R. A., & Gonnet, G. H. (1992). A new approach in text searching. Communication of the ACM, 35(10), 74–82.
Ehrenfeucht, A., & Haussler, D. A new distance metric on strings computable in linear time. Discrete Applied Mathematics, 20, 191–203.
Eddy, S. R. (2004). What is dynamic programming? Nature Biotechnology, 22, 909–910. https://doi.org/10.1038/nbt0704-909.
Chang, W. L., & Lampe, J. Theoretical and empirical comparisons of approximate string matching algorithms. In A. Apostolico, M. Crocchemore, Z. Galil, & U. Manber (Eds.), Combinatorial pattern matching. Lecture notes in computer science (Vol. 644, pp. 175–184). Berlin: Springer.
Ukkonen, E. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92, 191–211.
Manber, U., & Baeza-Yates, R. A. (1991). An algorithm for string matching with a sequence of don’t cares. Information Processing Letters, 37, 133–136.
Pinter, R. Y. (1985). Efficient string matching with don’t’ care patterns. In A. Apostolico & Z. Galil (Eds.), Combinatorial algorithms on words. NATO ASI Series (Vol. F12, pp. 11–29). Berlin: Springer.
Chen, Y., Wu, Y., & Xie, J. (2016). An efficient algorithm for read matching in DNA databases. In Proceedings of the International Conference on DBKDA’2016, Lisbon, Portugal, 26–30 June 2016 (pp. 23–34).
Chen, Y., & Wu, Y. (2017). Mismatching trees and BWT arrays: A new way for string matching with k-mismatches. In ICDE2017, 19–22 April 2017 (pp. 339–410). San Diego, USA: IEEE.
Galil, Z. (1977). On improving the worst case running time of the Boyer-Moore string searching algorithm. Communication of the ACM, 22(9), 505–508.
Lecroq, T. (1992). A variation on the Boyer-Moore algorithm. Theoretical Computer Science, 92(1), 119–144.
Tarhio, J., & Ukkonen, E. Boyer-Moore approach to approximate string matching. In J. R. Gilbert & R. Karlssion (Eds.), SWAT 90, Proceedings of the 2nd Scandinavian Workshop on Algorithm Theory, Lecture Notes in Computer Science (Vol. 447, pp. 348–359). Berlin: Springer.
Salmela, L., Tarhio, J., & Kytojoki, J. (2006). Multi-pattern string matching with q-grams. ACM Journal of Experimental Algorithmics, 11.
Jiang, H., & Wong, W. H. (2008). SeqMap: Mapping massive amount of oligonucleotides to the genome. Bioinformatics, 24, 2395–2396.
Kim, J. Y., & Yaylor, J. S. (1992). Fast multiple keyword searching. In Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching, 29 April–01 May 1992 (pp. 41–51). Springer.
Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5), 589–595.
Knuth, D. E. (1975). The art of computer programming (Vol. 3). Massachusetts: Addison-Wesley Publish Com.
Li, H., & Homer. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5), 473–483. https://doi.org/10.1093/bib/bbq015.
Karp, R. L., & Rabin, M. O. (1987). Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2), 249–260.
Harrison, M. C. (1971). Implementation of the substring test by hashing. Communication of the ACM, 14(12), 777–779.
Li, H., et al. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18, 1851–1858.
Li, H. (2014). wgsim: a small tool for simulating sequence reads from a reference genome. https://github.com/lh3/wgsim/.
Schatz, M. (2009). Cloudburst: Highly sensitive read mapping with mapreduce. Bioinformatics, 25, 1363–1369.
Lin, H., et al. (2008). ZOOM! Zillions of oligos mapped. Bioinformatics, 24, 2431–2437.
Baeza-Yates, R. A., & Gonnet, G. H. (1989). A new approach to text searching. In N. J. Belkin & C. J. van Rijsbergen (Eds.), SIGIR 89, Proceedings of the 12th Annual International ACM Conference on Research and Development in Information Retrieval (pp. 168–175).
Smith, A. D., et al. (2008). Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics, 9, 128.
Tarhio, J., & Ukkonen, E. Approximate Boyer-Moore string matching. SIAM Journal on Computing, 22(2), 243–260.
Nicolas, M., & Rajasekarian, S. (2013). On string matching with k mismatches. https://arxiv.org/pdf/1307.1406.
Cole, R., Gottlieb, L., & Lewenstein, M. (2004). Dictionary matching and indexing with errors and don’t cares. In STOC’04 (pp. 91–100).
Hon, W., et al. (2007). A space and time efficient algorithm for constructing compressed suffix arrays. Alrothmica, 48, 23–36.
Bauer, S., Schulz, M. H., & Robinson, P. N. (2014). gsuffix:http:://gsuffixSourceforge.net/.
Lab website. (2014). http://home.cc.umanitoba.ca/~xiej/.
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: bolger: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
Cunningham, F., et al. (2015). Nucleic Acids Research 2015, 43, Database issue: D662-D669.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Chen, Y., Wu, Y. (2018). BWT: An Index Structure to Speed-Up Both Exact and Inexact String Matching. In: Roy, S., Samui, P., Deo, R., Ntalampiras, S. (eds) Big Data in Engineering Applications. Studies in Big Data, vol 44. Springer, Singapore. https://doi.org/10.1007/978-981-10-8476-8_12
Download citation
DOI: https://doi.org/10.1007/978-981-10-8476-8_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8475-1
Online ISBN: 978-981-10-8476-8
eBook Packages: EngineeringEngineering (R0)