Document Retrieval with One Wildcard

  • Moshe Lewenstein
  • J. Ian Munro
  • Yakov Nekrich
  • Sharma V. Thankachan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8635)

Abstract

In this paper we extend several well-known document listing problems to the case when documents contain a substring that approximately matches the query pattern. We study the scenario when the query string can contain a wildcard symbol that matches any alphabet symbol; all documents that match a query pattern with one wildcard must be enumerated. We describe a linear space data structure that reports all documents containing a substring P in \(O(|P|+\sigma \sqrt{\log\log \log n} + \mathtt{docc})\) time, where σ is the alphabet size and docc is the number of listed documents. We also describe a succinct solution for this problem.

Furthermore our approach enables us to obtain an O()-space data structure that enumerates all documents containing both a pattern P 1 and a pattern P 2 in the special case when P 1 and P 2 differ in one symbol.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alstrup, S., Brodal, G.S., Rauhe, T.: Optimal static range reporting in one dimension. In: Proc. 33rd Annual ACM Symposium on Theory of Computing (STOC), pp. 476–482 (2001)Google Scholar
  2. 2.
    Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37(2), 309–325 (2000)MATHMathSciNetCrossRefGoogle Scholar
  3. 3.
    Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Demetrescu, C., Halldórsson, M.M. (eds.) ESA 2011. LNCS, vol. 6942, pp. 748–759. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  4. 4.
    Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. J. Discrete Algorithms 18, 3–13 (2013)MATHMathSciNetCrossRefGoogle Scholar
  5. 5.
    Bille, P., Gørtz, I.L.: Substring range reporting. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 299–308. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Bille, P., Gørtz, I.L., Vildhøj, H.W., Vind, S.: String indexing for patterns with wildcards. In: Fomin, F.V., Kaski, P. (eds.) SWAT 2012. LNCS, vol. 7357, pp. 283–294. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Boyer, R.S., Moore, J.S.: A Fast String Searching Algorithm. Communications of the ACM 20(10), 762–772 (1977)MATHCrossRefGoogle Scholar
  8. 8.
    Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. Theoretical Computer Science 411(40-42), 3795–3800 (2010)MATHMathSciNetCrossRefGoogle Scholar
  9. 9.
    Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. 36th Annual ACM Symposium on Theory of Computing (STOC 2004), pp. 91–100 (2004)Google Scholar
  10. 10.
    Fischer, J., Gagie, T., Kopelowitz, T., Lewenstein, M., Mäkinen, V., Salmela, L., Välimäki, N.: Forbidden patterns. In: Fernández-Baca, D. (ed.) LATIN 2012. LNCS, vol. 7256, pp. 327–337. Springer, Heidelberg (2012)Google Scholar
  11. 11.
    Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)MATHMathSciNetCrossRefGoogle Scholar
  12. 12.
    Hon, W.-K., Ku, T.-H., Shah, R., Thankachan, S.V., Vitter, J.S.: Compressed dictionary matching with one error. In: Proc. 2011 Data Compression Conference (DCC 2011), pp. 113–122 (2011)Google Scholar
  13. 13.
    Hon, W.-K., Patil, M., Shah, R., Thankachan, S.V., Vitter, J.S.: Indexes for document retrieval with relevance. In: Brodnik, A., López-Ortiz, A., Raman, V., Viola, A. (eds.) Ianfest-66. LNCS, vol. 8066, pp. 351–362. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  14. 14.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: String retrieval for multi-pattern queries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 55–66. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  15. 15.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Document listing for queries with excluded pattern. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 185–195. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  16. 16.
    Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: FOCS, pp. 713–722. IEEE Computer Society (2009)Google Scholar
  17. 17.
    Huynh, T.N.D., Hon, W.-K., Lam, T.-W., Sung, W.-K.: Approximate string matching using compressed suffix arrays. Theoretical Comp. Science 352(1), 240–249 (2006)MATHMathSciNetCrossRefGoogle Scholar
  18. 18.
    Iliopoulos, C.S., Rahman, M.S.: Indexing factors with gaps. Algorithmica 55(1), 60–70 (2009)MATHMathSciNetCrossRefGoogle Scholar
  19. 19.
    Knuth, D.E., Morris, J.H., Pratt, V.B.: Fast Pattern Matching in Strings. SIAM Journal on Computing 6(2), 323–350 (1977)MATHMathSciNetCrossRefGoogle Scholar
  20. 20.
    Lam, T.-W., Sung, W.-K., Tam, S.-L., Yiu, S.-M.: Space efficient indexes for string matching with don’t cares. In: Tokuyama, T. (ed.) ISAAC 2007. LNCS, vol. 4835, pp. 846–857. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  21. 21.
    Lewenstein, M.: Indexing with gaps. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 135–143. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  22. 22.
    Lewenstein, M.: Orthogonal range searching for text indexing. In: Brodnik, A., López-Ortiz, A., Raman, V., Viola, A. (eds.) Ianfest-66. LNCS, vol. 8066, pp. 267–302. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  23. 23.
    Lewenstein, M., Nekrich, Y., Vitter, J.S.: Space-efficient string indexing for wildcard pattern matching. In: Proc. 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), pp. 506–517 (2014)Google Scholar
  24. 24.
    Lewenstein, M., Nekrich, Y., Vitter, J.S.: Space-efficient string indexing for wildcard pattern matching. CoRR, abs/1401.0625 (2014)Google Scholar
  25. 25.
    Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MATHMathSciNetCrossRefGoogle Scholar
  26. 26.
    Matias, Y., Muthukrishnan, S., Şahinalp, S.C., Ziv, J.: Augmenting suffix trees, with applications. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 67–78. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  27. 27.
    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2002), pp. 657–666 (2002)Google Scholar
  28. 28.
    Navarro, G.: Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. CoRR, abs/1304.6023 (2013)Google Scholar
  29. 29.
    Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: Rabani, Y. (ed.) SODA, pp. 1066–1077. SIAM (2012)Google Scholar
  30. 30.
    Rahman, M.S., Iliopoulos, C.S.: Pattern matching algorithms with don’t cares. In: Proc. 33rd Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2007), pp. 116–126 (2007)Google Scholar
  31. 31.
    Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms 3(4) (2007)Google Scholar
  32. 32.
    Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1), 12–22 (2007)MATHMathSciNetCrossRefGoogle Scholar
  33. 33.
    Shah, R., Sheng, C., Thankachan, S.V., Vitter, J.S.: Top-k document retrieval in external memory. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 803–814. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  34. 34.
    Sleator, D.D., Tarjan, R.E.: A data structure for dynamic trees. J. Comput. Syst. Sci. 26(3), 362–391 (1983)MATHMathSciNetCrossRefGoogle Scholar
  35. 35.
    Tam, A., Wu, E., Lam, T.-W., Yiu, S.-M.: Succinct text indexing with wildcards. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 39–50. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  36. 36.
    Weiner, P.: Linear pattern matching algorithms. In: SWAT (FOCS), pp. 1–11. IEEE Computer Society (1973)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Moshe Lewenstein
    • 1
  • J. Ian Munro
    • 2
  • Yakov Nekrich
    • 2
  • Sharma V. Thankachan
    • 2
  1. 1.Bar-Ilan UniversityIsrael
  2. 2.University of WaterlooCanada

Personalised recommendations