Skip to main content

BWT: An Index Structure to Speed-Up Both Exact and Inexact String Matching

  • Chapter
  • First Online:
Big Data in Engineering Applications

Part of the book series: Studies in Big Data ((SBD,volume 44))

  • 2048 Accesses

Abstract

The BWT transformation of a string is originally proposed for string compression, but can also be used to speed up string matchings. In this chapter, we address two issues around this mechanism: (1) how to use BWT to improve the running time of a multiple pattern string matching process; and (2) how to integrate mismatching information into a search of BWT arrays to expedite string matching with k mismatches. For the first problem, we will first construct the BWT array of a target string s, denoted as BWT(s); and then establish a trie structure over a set of pattern strings \( \varvec{R}\,\varvec{ = }\,\left\{ {r_{1} , \ldots ,r_{l} } \right\} \), denoted as T(R). By scanning BWT(s) against T(R), the time spent for finding occurrences of r i ’s can be significantly reduced. For the second problem, for a given pattern string r, we will precompute its mismatching information (over some different substrings of it, denoted as M(r)) and construct a tree structure, called a mismatching tree, to record the mismatches between r and s during a search of BWT(s) against r. In this process, the mismatching tree can be effectively utilized to do some kind of useful mismatching information derivation based on M(r) to avoid any possible redundancy. Extensive experiments have been done to compare our methods with the existing ones, which show that for both the problems described above our methods are promising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Li, R., et al. (2008). SOAP: short oligonucleotide alignment program. Bioinformatics, 24, 713–714.

    Article  Google Scholar 

  2. Amir, A., Lewenstein, M., & Porat, E. (2004). Faster algorithms for string matching with k mismatches. Journal of Algorithms, 50(2), 257–275.

    Article  MathSciNet  MATH  Google Scholar 

  3. Aoe, J.-I. (1989). An efficient implementation of static string pattern matching machines. IEEE Transactions on Software Engineering, 15(8), 1010–1016.

    Article  Google Scholar 

  4. Baeza-Yates, R. A., Perleberg, C. H. Fast and practical approximate string matching. In A. Apostolico, M. Crocchemore, Z. Galil, & U. Manber (Eds.), Combinatorial pattern matching, lecture notes in computer science (Vol. 644, pp. 185–192). Berlin: Springer.

    Chapter  Google Scholar 

  5. Baeza-Yates, R. A., & Régnier, M. Fast algorithms for two-dimensional and multiple pattern matching. In Proceedings of the SWAT ‘90 the Second Scandinavian Workshop on Algorithm Theory (pp. 332–347). Bergen, Sweden: Springer.

    Chapter  Google Scholar 

  6. Boyer, R. S., & Moore, J. S. (1977). A fast string searching algorithm. Communication of the ACM, 20(10), 762–772.

    Article  MATH  Google Scholar 

  7. Knuth, D. E., Morris, J. H., & Pratt, V. R. (1977). Fast pattern matching in strings. SIAM Journal on Computing, 6(2), 323–350.

    Article  MathSciNet  MATH  Google Scholar 

  8. Landau, G. M., & Vishkin, U. (1985). Efficient string matching in the presence of errors. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science (pp. 126–136).

    Google Scholar 

  9. Apostolico, A., & Giancarlo, R. (1986). The Boyer-Moore-Galil string searching strategies revisited. SIAM Journal on Computing, 15(1), 98–105.

    Article  MathSciNet  MATH  Google Scholar 

  10. McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2), 262–272.

    Article  MathSciNet  MATH  Google Scholar 

  11. Weiner, P. (1973). Linear pattern matching algorithm. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory (pp. 1–11).

    Google Scholar 

  12. Manber, U., & Myers, E. W. (1990). Suffix arrays: a new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 319–327). Philadelphia, PA: SIAM.

    Google Scholar 

  13. Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm.

    Google Scholar 

  14. Ferragina, P., & Manzini, G. (2000). Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science (pp. 390–398). IEEE.

    Google Scholar 

  15. Langmead, B. (2014, September). Introduction to the Burrows-Wheeler transform. www.youtube.com/watch?v=4n7NPk5lwbI.

  16. Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communication of the ACM, 23(1), 333–340.

    Article  MathSciNet  MATH  Google Scholar 

  17. Commentz-Walter, B. (1979). A string matching algorithm fast on the average. In Proceedings of the 6th Colloquium on Automata, Languages and Programming, 16–20 July 1979, pp. 118–132.

    Google Scholar 

  18. Wu, S., & Manber, U. (1994). A fast algorithm for multi-pattern searching. Technical Report TR-94-17, Department of Computer Science, Chung-Cheng University.

    Google Scholar 

  19. Crochemore, M., et al. (1999). Fast practical multi-pattern matching. Information Processing Letters, 71, 107–113.

    Article  MathSciNet  MATH  Google Scholar 

  20. Dandass, Y. S., Burgess, S. C., Lawrence, M., & Bridges, S. M. (2008). Accelerating string set matching in FPGA hardware for bioinformatics research. BMC Bioinformatics, 9, 197.

    Article  Google Scholar 

  21. Colussi, L., Galil, Z., & Giancarlo, R. (1990). On the exact complexity of string matching. In Proceedings of the 31st Annual IEEE Symposium of Foundation of Computer Science (Vol. 1, pp. 135–144).

    Google Scholar 

  22. Landau, G. M., & Vishkin, U. (1986). Efficient string matching with k mismatches. Theoretical Computer Science, 43, 239–249.

    Article  MathSciNet  MATH  Google Scholar 

  23. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760.

    Article  Google Scholar 

  24. Baeza-Yates, R. A., & Gonnet, G. H. (1992). A new approach in text searching. Communication of the ACM, 35(10), 74–82.

    Article  Google Scholar 

  25. Ehrenfeucht, A., & Haussler, D. A new distance metric on strings computable in linear time. Discrete Applied Mathematics, 20, 191–203.

    Article  MathSciNet  MATH  Google Scholar 

  26. Eddy, S. R. (2004). What is dynamic programming? Nature Biotechnology, 22, 909–910. https://doi.org/10.1038/nbt0704-909.

    Article  Google Scholar 

  27. Chang, W. L., & Lampe, J. Theoretical and empirical comparisons of approximate string matching algorithms. In A. Apostolico, M. Crocchemore, Z. Galil, & U. Manber (Eds.), Combinatorial pattern matching. Lecture notes in computer science (Vol. 644, pp. 175–184). Berlin: Springer.

    Chapter  Google Scholar 

  28. Ukkonen, E. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92, 191–211.

    Article  MathSciNet  MATH  Google Scholar 

  29. Manber, U., & Baeza-Yates, R. A. (1991). An algorithm for string matching with a sequence of don’t cares. Information Processing Letters, 37, 133–136.

    Article  MathSciNet  MATH  Google Scholar 

  30. Pinter, R. Y. (1985). Efficient string matching with don’t’ care patterns. In A. Apostolico & Z. Galil (Eds.), Combinatorial algorithms on words. NATO ASI Series (Vol. F12, pp. 11–29). Berlin: Springer.

    Chapter  Google Scholar 

  31. Chen, Y., Wu, Y., & Xie, J. (2016). An efficient algorithm for read matching in DNA databases. In Proceedings of the International Conference on DBKDA2016, Lisbon, Portugal, 26–30 June 2016 (pp. 23–34).

    Google Scholar 

  32. Chen, Y., & Wu, Y. (2017). Mismatching trees and BWT arrays: A new way for string matching with k-mismatches. In ICDE2017, 19–22 April 2017 (pp. 339–410). San Diego, USA: IEEE.

    Google Scholar 

  33. Galil, Z. (1977). On improving the worst case running time of the Boyer-Moore string searching algorithm. Communication of the ACM, 22(9), 505–508.

    Article  MathSciNet  MATH  Google Scholar 

  34. Lecroq, T. (1992). A variation on the Boyer-Moore algorithm. Theoretical Computer Science, 92(1), 119–144.

    Article  MathSciNet  MATH  Google Scholar 

  35. Tarhio, J., & Ukkonen, E. Boyer-Moore approach to approximate string matching. In J. R. Gilbert & R. Karlssion (Eds.), SWAT 90, Proceedings of the 2nd Scandinavian Workshop on Algorithm Theory, Lecture Notes in Computer Science (Vol. 447, pp. 348–359). Berlin: Springer.

    Chapter  Google Scholar 

  36. Salmela, L., Tarhio, J., & Kytojoki, J. (2006). Multi-pattern string matching with q-grams. ACM Journal of Experimental Algorithmics, 11.

    Google Scholar 

  37. Jiang, H., & Wong, W. H. (2008). SeqMap: Mapping massive amount of oligonucleotides to the genome. Bioinformatics, 24, 2395–2396.

    Article  Google Scholar 

  38. Kim, J. Y., & Yaylor, J. S. (1992). Fast multiple keyword searching. In Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching, 29 April–01 May 1992 (pp. 41–51). Springer.

    Chapter  Google Scholar 

  39. Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5), 589–595.

    Article  Google Scholar 

  40. Knuth, D. E. (1975). The art of computer programming (Vol. 3). Massachusetts: Addison-Wesley Publish Com.

    Google Scholar 

  41. Li, H., & Homer. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5), 473–483. https://doi.org/10.1093/bib/bbq015.

    Article  Google Scholar 

  42. Karp, R. L., & Rabin, M. O. (1987). Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2), 249–260.

    Article  MathSciNet  MATH  Google Scholar 

  43. Harrison, M. C. (1971). Implementation of the substring test by hashing. Communication of the ACM, 14(12), 777–779.

    Article  Google Scholar 

  44. Li, H., et al. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18, 1851–1858.

    Article  Google Scholar 

  45. Li, H. (2014). wgsim: a small tool for simulating sequence reads from a reference genome. https://github.com/lh3/wgsim/.

  46. Schatz, M. (2009). Cloudburst: Highly sensitive read mapping with mapreduce. Bioinformatics, 25, 1363–1369.

    Article  Google Scholar 

  47. Lin, H., et al. (2008). ZOOM! Zillions of oligos mapped. Bioinformatics, 24, 2431–2437.

    Article  Google Scholar 

  48. Baeza-Yates, R. A., & Gonnet, G. H. (1989). A new approach to text searching. In N. J. Belkin & C. J. van Rijsbergen (Eds.), SIGIR 89, Proceedings of the 12th Annual International ACM Conference on Research and Development in Information Retrieval (pp. 168–175).

    Google Scholar 

  49. Smith, A. D., et al. (2008). Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics, 9, 128.

    Article  Google Scholar 

  50. Tarhio, J., & Ukkonen, E. Approximate Boyer-Moore string matching. SIAM Journal on Computing, 22(2), 243–260.

    Article  MathSciNet  MATH  Google Scholar 

  51. Nicolas, M., & Rajasekarian, S. (2013). On string matching with k mismatches. https://arxiv.org/pdf/1307.1406.

  52. Cole, R., Gottlieb, L., & Lewenstein, M. (2004). Dictionary matching and indexing with errors and don’t cares. In STOC’04 (pp. 91–100).

    Google Scholar 

  53. Hon, W., et al. (2007). A space and time efficient algorithm for constructing compressed suffix arrays. Alrothmica, 48, 23–36.

    MathSciNet  MATH  Google Scholar 

  54. Bauer, S., Schulz, M. H., & Robinson, P. N. (2014). gsuffix:http:://gsuffixSourceforge.net/.

    Google Scholar 

  55. Lab website. (2014). http://home.cc.umanitoba.ca/~xiej/.

  56. Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: bolger: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

    Google Scholar 

  57. Cunningham, F., et al. (2015). Nucleic Acids Research 2015, 43, Database issue: D662-D669.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yangjun Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Chen, Y., Wu, Y. (2018). BWT: An Index Structure to Speed-Up Both Exact and Inexact String Matching. In: Roy, S., Samui, P., Deo, R., Ntalampiras, S. (eds) Big Data in Engineering Applications. Studies in Big Data, vol 44. Springer, Singapore. https://doi.org/10.1007/978-981-10-8476-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-8476-8_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-8475-1

  • Online ISBN: 978-981-10-8476-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics