Theory of Computing Systems

, Volume 55, Issue 1, pp 41–60 | Cite as

String Indexing for Patterns with Wildcards

  • Philip Bille
  • Inge Li Gørtz
  • Hjalte Wedel Vildhøj
  • Søren Vind
Article

Abstract

We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and σ the size of the alphabet. We obtain the following results.
  • A linear space index with query time O(m+σjloglogn+occ). This significantly improves the previously best known linear space index by Lam et al. (in Proc. 18th ISAAC, pp. 846–857, [2007]), which requires query time Θ(jn) in the worst case.

  • An index with query time O(m+j+occ) using space \(O(\sigma^{k^{2}} n \log^{k} \log n)\), where k is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time.

  • A time-space trade-off, generalizing the index by Cole et al. (in Proc. 36th STOC, pp. 91–100, [2004]).

We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest.

Keywords

String indexing Wildcard Variable length gap Suffix tree LCP data structure 

References

  1. 1.
    Alstrup, S., Husfeldt, T., Rauhe, T.: Marked ancestor problems. In: Proc. 39th FOCS, pp. 534–543 (1998) Google Scholar
  2. 2.
    Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with k mismatches. J. Algorithms 50(2), 257–275 (2004) CrossRefMATHMathSciNetGoogle Scholar
  3. 3.
    Belazzougui, D.: Faster and space-optimal edit distance “1” dictionary. In: Proc. 20th CPM, pp. 154–167 (2009) Google Scholar
  4. 4.
    Bille, P., Gørtz, I.L.: Substring range reporting. In: Proc. 22nd CPM, pp. 299–308 (2011) Google Scholar
  5. 5.
    Bille, P., Gørtz, I.L., Vildhøj, H., Wind, D.: String matching with variable length gaps. In: Proc. 17th SPIRE, pp. 385–394 (2010) Google Scholar
  6. 6.
    Bucher, P., Bairoch, A.: A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In: Proc. 2nd ISMB, pp. 53–61 (1994) Google Scholar
  7. 7.
    Chan, H.L., Lam, T.W., Sung, W.K., Tam, S.L., Wong, S.S.: A linear size index for approximate pattern matching. J. Discrete Algorithms 9(4), 358–364 (2011) CrossRefMATHMathSciNetGoogle Scholar
  8. 8.
    Chazelle, B.: Filtering search: a new approach to query-answering. SIAM J. Comput. 15(3), 703–724 (1986) CrossRefMATHMathSciNetGoogle Scholar
  9. 9.
    Chen, G., Wu, X., Zhu, X., Arslan, A., He, Y.: Efficient string matching with wildcards and length constraints. Knowl. Inf. Syst. 10(4), 399–419 (2006) CrossRefGoogle Scholar
  10. 10.
    Clifford, P., Clifford, R.: Simple deterministic wildcard matching. Inf. Process. Lett. 101(2), 53–54 (2007) CrossRefMATHMathSciNetGoogle Scholar
  11. 11.
    Coelho, L., Oliveira, A.: Dotted suffix trees a structure for approximate text indexing. In: Proc. 13th SPIRE, pp. 329–336 (2006) Google Scholar
  12. 12.
    Cole, R., Hariharan, R.: Approximate string matching: a simpler faster algorithm. SIAM J. Comput. 31(6), 1761–1782 (2002) CrossRefMATHMathSciNetGoogle Scholar
  13. 13.
    Cole, R., Hariharan, R.: Verifying candidate matches in sparse and wildcard matching. In: Proc. 34rd STOC, pp. 592–601 (2002) Google Scholar
  14. 14.
    Cole, R., Gottlieb, L., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. 36th STOC, pp. 91–100 (2004) Google Scholar
  15. 15.
    Fischer, M.J., Paterson, M.S.: String-matching and other products. In: Complexity of Computation, SIAM-AMS Proceedings, pp. 113–125 (1974) Google Scholar
  16. 16.
    Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with O(1) worst case access time. J. ACM 31, 538–544 (1984) CrossRefMATHGoogle Scholar
  17. 17.
    Fredriksson, K., Grabowski, S.: Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr. 11(4), 335–357 (2008) CrossRefGoogle Scholar
  18. 18.
    Fredriksson, K., Grabowski, S.: Nested counters in bit-parallel string matching. In: Proc. 3rd LATA, pp. 338–349 (2009) Google Scholar
  19. 19.
    Galil, Z., Giancarlo, R.: Improved string matching with k mismatches. SIGACT News 17(4), 52–54 (1986) CrossRefGoogle Scholar
  20. 20.
    Hagerup, T.: Sorting and searching on the word RAM. In: Proc. 15th STACS, pp. 366–398 (1998) Google Scholar
  21. 21.
    Harel, D., Tarjan, R.: Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13(2), 338–355 (1984) CrossRefMATHMathSciNetGoogle Scholar
  22. 22.
    Hofmann, K., Bucher, P., Falquet, L., Bairoch, A.: The PROSITE database, its status in 1999. Nucleic Acids Res. 27(1), 215–219 (1999) CrossRefGoogle Scholar
  23. 23.
    Iliopoulos, C.S., Rahman, M.S.: Pattern matching algorithms with don’t cares. In: Proc. 33rd SOFSEM, pp. 116–126 (2007) Google Scholar
  24. 24.
    Kalai, A.: Efficient pattern-matching with don’t cares. In: Proc. 13th SODA, pp. 655–656 (2002) Google Scholar
  25. 25.
    Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Space efficient indexes for string matching with don’t cares. In: Proc. 18th ISAAC, pp. 846–857 (2007) Google Scholar
  26. 26.
    Landau, G., Vishkin, U.: Efficient string matching with k mismatches. Theor. Comput. Sci. 43, 239–249 (1986) CrossRefMATHMathSciNetGoogle Scholar
  27. 27.
    Landau, G., Vishkin, U.: Fast parallel and serial approximate string matching. J. Algorithms 10(2), 157–169 (1989) CrossRefMATHMathSciNetGoogle Scholar
  28. 28.
    Lewenstein, M.: Indexing with gaps. In: Proc. 18th SPIRE, pp. 135–143 (2011) Google Scholar
  29. 29.
    Maas, M., Nowak, J.: Text indexing with errors. J. Discrete Algorithms 5(4), 662–681 (2007) CrossRefMathSciNetGoogle Scholar
  30. 30.
    Mehldau, G., Myers, G.: A system for pattern matching applications on biosequences. Comput. Appl. Biosci. 9(3), 299–314 (1993) Google Scholar
  31. 31.
    Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured motifs search. J. Comput. Biol. 12(8), 1065–1082 (2005) CrossRefGoogle Scholar
  32. 32.
    Myers, E.: Approximate matching of network expressions with spacers. J. Comput. Biol. 3(1), 33–51 (1996) CrossRefGoogle Scholar
  33. 33.
    Navarro, G., Raffinot, M.: Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol. 10(6), 903–923 (2003) CrossRefGoogle Scholar
  34. 34.
    Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001) Google Scholar
  35. 35.
    Rahman, M.S., Iliopoulos, C.S., Lee, I., Mohamed, M., Smyth, W.F.: Finding patterns with variable length gaps or don’t cares. In: Proc. 12th COCOON, pp. 146–155 (2006) Google Scholar
  36. 36.
    Sahinalp, S., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: Proc. 37th FOCS, pp. 320–328 (1996) Google Scholar
  37. 37.
    Tam, A., Wu, E., Lam, T., Yiu, S.: Succinct text indexing with wildcards. In: Proc. 16th SPIRE, pp. 39–50 (2009) Google Scholar
  38. 38.
    Tsur, D.: Fast index for approximate string matching. J. Discrete Algorithms 8(4), 339–345 (2010) CrossRefMATHMathSciNetGoogle Scholar
  39. 39.
    Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th SWAT, pp. 1–11 (1973) Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Philip Bille
    • 1
  • Inge Li Gørtz
    • 1
  • Hjalte Wedel Vildhøj
    • 1
  • Søren Vind
    • 1
  1. 1.DTU ComputeTechnical University of DenmarkLyngbyDenmark

Personalised recommendations