Skip to main content

String Indexing for Patterns with Wildcards

Abstract

We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and σ the size of the alphabet. We obtain the following results.

  • A linear space index with query time O(m+σ jloglogn+occ). This significantly improves the previously best known linear space index by Lam et al. (in Proc. 18th ISAAC, pp. 846–857, [2007]), which requires query time Θ(jn) in the worst case.

  • An index with query time O(m+j+occ) using space \(O(\sigma^{k^{2}} n \log^{k} \log n)\), where k is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time.

  • A time-space trade-off, generalizing the index by Cole et al. (in Proc. 36th STOC, pp. 91–100, [2004]).

We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. Alstrup, S., Husfeldt, T., Rauhe, T.: Marked ancestor problems. In: Proc. 39th FOCS, pp. 534–543 (1998)

    Google Scholar 

  2. Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with k mismatches. J. Algorithms 50(2), 257–275 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  3. Belazzougui, D.: Faster and space-optimal edit distance “1” dictionary. In: Proc. 20th CPM, pp. 154–167 (2009)

    Google Scholar 

  4. Bille, P., Gørtz, I.L.: Substring range reporting. In: Proc. 22nd CPM, pp. 299–308 (2011)

    Google Scholar 

  5. Bille, P., Gørtz, I.L., Vildhøj, H., Wind, D.: String matching with variable length gaps. In: Proc. 17th SPIRE, pp. 385–394 (2010)

    Google Scholar 

  6. Bucher, P., Bairoch, A.: A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In: Proc. 2nd ISMB, pp. 53–61 (1994)

    Google Scholar 

  7. Chan, H.L., Lam, T.W., Sung, W.K., Tam, S.L., Wong, S.S.: A linear size index for approximate pattern matching. J. Discrete Algorithms 9(4), 358–364 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  8. Chazelle, B.: Filtering search: a new approach to query-answering. SIAM J. Comput. 15(3), 703–724 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  9. Chen, G., Wu, X., Zhu, X., Arslan, A., He, Y.: Efficient string matching with wildcards and length constraints. Knowl. Inf. Syst. 10(4), 399–419 (2006)

    Article  Google Scholar 

  10. Clifford, P., Clifford, R.: Simple deterministic wildcard matching. Inf. Process. Lett. 101(2), 53–54 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  11. Coelho, L., Oliveira, A.: Dotted suffix trees a structure for approximate text indexing. In: Proc. 13th SPIRE, pp. 329–336 (2006)

    Google Scholar 

  12. Cole, R., Hariharan, R.: Approximate string matching: a simpler faster algorithm. SIAM J. Comput. 31(6), 1761–1782 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  13. Cole, R., Hariharan, R.: Verifying candidate matches in sparse and wildcard matching. In: Proc. 34rd STOC, pp. 592–601 (2002)

    Google Scholar 

  14. Cole, R., Gottlieb, L., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. 36th STOC, pp. 91–100 (2004)

    Google Scholar 

  15. Fischer, M.J., Paterson, M.S.: String-matching and other products. In: Complexity of Computation, SIAM-AMS Proceedings, pp. 113–125 (1974)

    Google Scholar 

  16. Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with O(1) worst case access time. J. ACM 31, 538–544 (1984)

    Article  MATH  Google Scholar 

  17. Fredriksson, K., Grabowski, S.: Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr. 11(4), 335–357 (2008)

    Article  Google Scholar 

  18. Fredriksson, K., Grabowski, S.: Nested counters in bit-parallel string matching. In: Proc. 3rd LATA, pp. 338–349 (2009)

    Google Scholar 

  19. Galil, Z., Giancarlo, R.: Improved string matching with k mismatches. SIGACT News 17(4), 52–54 (1986)

    Article  Google Scholar 

  20. Hagerup, T.: Sorting and searching on the word RAM. In: Proc. 15th STACS, pp. 366–398 (1998)

    Google Scholar 

  21. Harel, D., Tarjan, R.: Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13(2), 338–355 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  22. Hofmann, K., Bucher, P., Falquet, L., Bairoch, A.: The PROSITE database, its status in 1999. Nucleic Acids Res. 27(1), 215–219 (1999)

    Article  Google Scholar 

  23. Iliopoulos, C.S., Rahman, M.S.: Pattern matching algorithms with don’t cares. In: Proc. 33rd SOFSEM, pp. 116–126 (2007)

    Google Scholar 

  24. Kalai, A.: Efficient pattern-matching with don’t cares. In: Proc. 13th SODA, pp. 655–656 (2002)

    Google Scholar 

  25. Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Space efficient indexes for string matching with don’t cares. In: Proc. 18th ISAAC, pp. 846–857 (2007)

    Google Scholar 

  26. Landau, G., Vishkin, U.: Efficient string matching with k mismatches. Theor. Comput. Sci. 43, 239–249 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  27. Landau, G., Vishkin, U.: Fast parallel and serial approximate string matching. J. Algorithms 10(2), 157–169 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  28. Lewenstein, M.: Indexing with gaps. In: Proc. 18th SPIRE, pp. 135–143 (2011)

    Google Scholar 

  29. Maas, M., Nowak, J.: Text indexing with errors. J. Discrete Algorithms 5(4), 662–681 (2007)

    Article  MathSciNet  Google Scholar 

  30. Mehldau, G., Myers, G.: A system for pattern matching applications on biosequences. Comput. Appl. Biosci. 9(3), 299–314 (1993)

    Google Scholar 

  31. Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured motifs search. J. Comput. Biol. 12(8), 1065–1082 (2005)

    Article  Google Scholar 

  32. Myers, E.: Approximate matching of network expressions with spacers. J. Comput. Biol. 3(1), 33–51 (1996)

    Article  Google Scholar 

  33. Navarro, G., Raffinot, M.: Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol. 10(6), 903–923 (2003)

    Article  Google Scholar 

  34. Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)

    Google Scholar 

  35. Rahman, M.S., Iliopoulos, C.S., Lee, I., Mohamed, M., Smyth, W.F.: Finding patterns with variable length gaps or don’t cares. In: Proc. 12th COCOON, pp. 146–155 (2006)

    Google Scholar 

  36. Sahinalp, S., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: Proc. 37th FOCS, pp. 320–328 (1996)

    Google Scholar 

  37. Tam, A., Wu, E., Lam, T., Yiu, S.: Succinct text indexing with wildcards. In: Proc. 16th SPIRE, pp. 39–50 (2009)

    Google Scholar 

  38. Tsur, D.: Fast index for approximate string matching. J. Discrete Algorithms 8(4), 339–345 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  39. Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th SWAT, pp. 1–11 (1973)

    Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their valuable comments. Based on their suggestions we could substantially improve the analysis of the query time for patterns with variable length gaps.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hjalte Wedel Vildhøj.

Additional information

Preliminary version appeared in Proceedings of the 13th Scandinavian Symposium and Workshops on Algorithm Theory. Lecture Notes in Computer Science, vol. 7357, pp. 283–294, Springer, Berlin (2012).

Supported by a grant from the Danish Council for Independent Research | Natural Sciences.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Bille, P., Gørtz, I.L., Vildhøj, H.W. et al. String Indexing for Patterns with Wildcards. Theory Comput Syst 55, 41–60 (2014). https://doi.org/10.1007/s00224-013-9498-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00224-013-9498-4

Keywords

  • String indexing
  • Wildcard
  • Variable length gap
  • Suffix tree
  • LCP data structure