Advertisement

Fast Indexes for Gapped Pattern Matching

  • Manuel Cáceres
  • Simon J. Puglisi
  • Bella ZhukovaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12011)

Abstract

We describe indexes for searching large data sets for variable-length-gapped (VLG) patterns. VLG patterns are composed of two or more subpatterns, between each adjacent pair of which is a gap-constraint specifying upper and lower bounds on the distance allowed between subpatterns. VLG patterns have numerous applications in computational biology (motif search), information retrieval (e.g., for language models, snippet generation, machine translation) and capture a useful subclass of the regular expressions commonly used in practice for searching source code. Our best approach provides search speeds several times faster than prior art across a broad range of patterns and texts.

Notes

Acknowledgments

Our thanks go to Tania Starikovskaya for suggesting the problem of indexing for regular-expression matching to us. We also thank Matthias Petri and Simon Gog for prompt answers to questions about their article and code and the anonymous reviewers for helpful comments. This work was funded by the Academy of Finland via grant 319454 and by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie grant agreement No. 690941 (BIRDS).

References

  1. 1.
    Bader, J., Gog, S., Petri, M.: Practical variable length gap pattern matching. In: Goldberg, A.V., Kulikov, A.S. (eds.) SEA 2016. LNCS, vol. 9685, pp. 1–16. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-38851-9_1CrossRefGoogle Scholar
  2. 2.
    Bille, P., Farach-Colton, M.: Fast and compact regular expression matching. Theor. Comput. Sci. 409(3), 486–496 (2008)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Bille, P., Gørtz, I.L.: Substring range reporting. Algorithmica 69(2), 384–396 (2014)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Bille, P., Gørtz, I.L., Vildhøj, H.W., Wind, D.K.: String matching with variable length gaps. Theor. Comput. Sci. 443, 25–34 (2012)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Bille, P., Thorup, M.: Regular expression matching with multi-strings and intervals. In: Proceedings of SODA, pp. 1297–1308. ACM-SIAM (2010)Google Scholar
  6. 6.
    Cox, R.: Regular expression matching with a trigram index or how Google code search worked (2012). https://swtch.com/~rsc/regexp/regexp4.html
  7. 7.
    Crawford, T., Iliopoulos, C.S., Raman, R.: String matching techniques for musical similarity and melodic recognition. Comput. Musicol. 11, 73–100 (1998)Google Scholar
  8. 8.
    Crochemore, M., Iliopoulos, C.S., Makris, C., Rytter, W., Tsakalidis, A.K., Tsichlas, T.: Approximate string matching with gaps. N. J. Comput. 9(1), 54–65 (2002)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Fredriksson, K., Grabowski, S.: Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr. 11(4), 335–357 (2008)CrossRefGoogle Scholar
  10. 10.
    Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of SODA, pp. 1459–1477. ACM-SIAM (2018)Google Scholar
  11. 11.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the SODA, pp. 841–850. ACM-SIAM (2003)Google Scholar
  12. 12.
    Haapasalo, T., Silvasti, P., Sippu, S., Soisalon-Soininen, E.: Online dictionary matching with variable-length gaps. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 76–87. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-20662-7_7CrossRefGoogle Scholar
  13. 13.
    Knuth, D., Morris, J.H., Pratt, V.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Lewenstein, M.: Indexing with gaps. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 135–143. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-24583-1_14CrossRefGoogle Scholar
  15. 15.
    Lopez, A.: Hierarchical phrase-based translation with suffix arrays. In: Proceedings of the EMNLP-CoNLL 2007, pp. 976–985. ACL (2007)Google Scholar
  16. 16.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of the SIGIR, pp. 472–479. ACM (2005)Google Scholar
  18. 18.
    Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured motifs search. J. Comput. Biol. 12(8), 1065–1082 (2005)CrossRefGoogle Scholar
  19. 19.
    Navarro, G.: Wavelet trees for all. J. Discrete Algorithms 25, 2–20 (2014)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Pissis, S.P.: MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinform. 15(235), 1–12 (2014)Google Scholar
  21. 21.
    Saikkonen, R., Sippu, S., Soisalon-Soininen, E.: Experimental analysis of an online dictionary matching algorithm for regular expressions with gaps. In: Bampis, E. (ed.) SEA 2015. LNCS, vol. 9125, pp. 327–338. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-20086-6_25CrossRefGoogle Scholar
  22. 22.
    Turpin, A., Tsegay, Y., Hawking, D., Williams, H.E.: Fast generation of result snippets in web search. In: Proceedings of the SIGIR 2007, pp. 127–134. ACM (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Manuel Cáceres
    • 1
  • Simon J. Puglisi
    • 2
  • Bella Zhukova
    • 2
    Email author
  1. 1.Department of Computer ScienceUniversity of ChileSantiagoChile
  2. 2.Department of Computer ScienceUniversity of Helsinki, Helsinki Institute for Information Technology (HIIT)HelsinkiFinland

Personalised recommendations