Advertisement

Practical Variable Length Gap Pattern Matching

  • Johannes Bader
  • Simon Gog
  • Matthias Petri
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9685)

Abstract

Solving the problem of reporting all occurrences of patterns containing variable length gaps in an input text T efficiently is important for various applications in a broad range of domains such as Bioinformatics or Natural Language Processing. In this paper we present an efficient solution for static inputs which utilizes the wavelet tree of the suffix array. The algorithm partially traverses the wavelet tree to find matches and can be easily adapted to several variants of the problem. We explore the practical properties of our solution in an experimental study where we compare to online and semi-indexed solutions using standard datasets. The experiments show that our approach is the best choice for searching patterns with many gaps in large texts.

Keywords

Online Algorithm Query Time Space Usage Suffix Tree Pattern Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgement

We are grateful to Timo Bingmann for profiling our initial implementation. This work was supported under the Australian Research Council’s Discovery Projects scheme (project DP140103256) and Deutsche Forschungsgemeinschaft.

References

  1. 1.
    Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Baeza-Yates, R.: A fast set intersection algorithm for sorted sequences. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 400–408. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Bille, P., Gørtz, I.L.: Substring range reporting. Algorithmica 69(2), 384–396 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Bille, P., Thorup, M.: Regular expression matching with multi-strings and intervals. In: Proceedings of SODA, pp. 1297–1308 (2010)Google Scholar
  5. 5.
    Bille, P., Gørtz, I.L., Vildhøj, H.W., Wind, D.K.: String matching with variable length gaps. Theor. Comput. Sci. 443, 25–34 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Fredriksson, K., Grabowski, S.: Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retrieval 11(4), 335–357 (2008)CrossRefGoogle Scholar
  7. 7.
    Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014)Google Scholar
  8. 8.
    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of SODA, pp. 841–850 (2003)Google Scholar
  9. 9.
    Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S., Pagni, M., Sigrist, C.J.A.: The PROSITE database. Nucleic Acids Res. 34(suppl 1), D227–D230 (2006)CrossRefGoogle Scholar
  10. 10.
    Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Soft. Prac. Exp. 45(1), 1–29 (2015)CrossRefGoogle Scholar
  12. 12.
    Lewenstein, M.: Indexing with gaps. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 135–143. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  13. 13.
    Lopez, A.: Hierarchical phrase-based translation with suffix arrays. In: Proceedings of EMNLP-CoNLL, pp. 976–985 (2007)Google Scholar
  14. 14.
    Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Metzler, D., Croft, W.B.: A Markov random field model for term dependencies. In: Proceedings of SIGIR, pp. 472–479 (2005)Google Scholar
  16. 16.
    Mihalcea, R., Tarau, P., Figa, E.: Pagerank on semantic networks, with application to word sense disambiguation. In: Proceedings of COLING (2004)Google Scholar
  17. 17.
    Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured motifs search. J. Comput. Biol. 12(8), 1065–1082 (2005)CrossRefGoogle Scholar
  18. 18.
    Navarro, G., Raffinot, M.: Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol. 10(6), 903–923 (2003)CrossRefGoogle Scholar
  19. 19.
    Rahman, M.S., Iliopoulos, C.S., Lee, I., Mohamed, M., Smyth, W.F.: Finding patterns with variable length gaps or don’t cares. In: Chen, D.Z., Lee, D.T. (eds.) COCOON 2006. LNCS, vol. 4112, pp. 146–155. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  20. 20.
    Thompson, K.: Regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Institute of Theoretical InformaticsKarlsruhe Institute of TechnologyKarlsruheGermany
  2. 2.Department of Computing and Information SystemsThe University of MelbourneVICAustralia

Personalised recommendations