Advertisement

Ranked Document Retrieval with Forbidden Pattern

  • Sudip Biswas
  • Arnab GangulyEmail author
  • Rahul Shah
  • Sharma V. Thankachan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9133)

Abstract

Let \(\mathcal{D}=\{\mathsf {T}_1,\mathsf {T}_2,\dots , \mathsf {T}_D\}\) be a collection of \(D\) string documents of \(n\) characters in total. The forbidden pattern document listing problem asks to report those documents \(\mathcal{D}' \subseteq \mathcal{D}\) which contain the pattern \(P\), but not the pattern \(Q\). The \({\mathsf {top\text{- }}k}\) forbidden pattern query \((P,Q,k)\) asks to report those \(k\) documents in \(\mathcal{D}'\) that are most relevant to \(P\). For typical relevance functions (like document importance, term-frequency, term-proximity), we present a linear space index with worst case query time of \(O(|P|+|Q|+\sqrt{nk})\) for the \({\mathsf {top\text{- }}k}\) problem. As a corollary of this result, we obtain a linear space and \(O(|P|+|Q|+\sqrt{nt})\) query time solution for the document listing problem, where \(t\) is the number of documents reported. We conjecture that any significant improvement over the results in this paper is highly unlikely.

Keywords

Query Time Inverted Index Relevance Function Prime Node Lower Common Ancestor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Alstrup, S., Brodal, G.S., Rauhe, T.: Optimal static range reporting in one dimension. In: Proceedings on 33rd Annual ACM Symposium on Theory of Computing, Heraklion, Crete, Greece, pp. 476–482, 6–8 July 2001Google Scholar
  2. 2.
    Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 234–242. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  3. 3.
    Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education, New York (2001)zbMATHGoogle Scholar
  4. 4.
    Durocher, S., Shah, R., Skala, M., Thankachan, S.V.: Linear-space data structures for range frequency queries on arrays and trees. In: Chatterjee, K., Sgall, J. (eds.) MFCS 2013. LNCS, vol. 8087, pp. 325–336. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  5. 5.
    Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974)zbMATHMathSciNetCrossRefGoogle Scholar
  6. 6.
    Fano, R.M.: On the number of bits required to implement an associative memory. Massachusetts Institute of Technology, Project MAC, Cambridge (1971) Google Scholar
  7. 7.
    Fischer, J., Gagie, T., Kopelowitz, T., Lewenstein, M., Mäkinen, V., Salmela, L., Välimäki, N.: Forbidden patterns. In: Fernández-Baca, D. (ed.) LATIN 2012. LNCS, vol. 7256, pp. 327–337. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  8. 8.
    Gawrychowski, P., Lewenstein, M., Nicholson, P.K.: Weighted ancestors in suffix trees. In: Schulz, A.S., Wagner, D. (eds.) ESA 2014. LNCS, vol. 8737, pp. 455–466. Springer, Heidelberg (2014) Google Scholar
  9. 9.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, Portland, OR, USA, pp. 397–406, 21–23 May 2000Google Scholar
  10. 10.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, New York (1997)zbMATHCrossRefGoogle Scholar
  11. 11.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: String retrieval for multi-pattern queries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 55–66. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  12. 12.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Document listing for queries with excluded pattern. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 185–195. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  13. 13.
    Hon, W., Shah, R., Thankachan, S.V., Vitter, J.S.: Space-efficient frameworks for top-k string retrieval. J. ACM 61(2), 9 (2014)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Hon, W., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, Atlanta, Georgia, USA, pp. 713–722, 25–27 October 2009Google Scholar
  15. 15.
    Larsen, K.G., Munro, J.I., Nielsen, J.S., Thankachan, S.V.: On hardness of several string indexing problems. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 242–251. Springer, Heidelberg (2014) Google Scholar
  16. 16.
    Matias, Y., Muthukrishnan, S.M., Şahinalp, S.C., Ziv, J.: Augmenting suffix trees, with applications. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, p. 67. Springer, Heidelberg (1998) Google Scholar
  17. 17.
    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, USA, pp. 657–666, 6–8 January 2002Google Scholar
  18. 18.
    Navarro, G.: Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4), 52 (2013)Google Scholar
  19. 19.
    Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, pp. 1066–1077, 17–19 January 2012Google Scholar
  20. 20.
    Navarro, G., Thankachan, S.V.: New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014)zbMATHMathSciNetCrossRefGoogle Scholar
  21. 21.
    Navarro, G., Thankachan, S.V.: Bottom-k document retrieval. J. Discret. Algorithms 32, 69–74 (2015). StringMasters 2012; 2013 Special Issue (Volume 2)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Patil, M., Thankachan, S.V., Shah, R., Hon, W., Vitter, J.S., Chandrasekaran, S.: Inverted indexes for phrases and strings. In: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, pp. 555–564, 25–29 July 2011Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Sudip Biswas
    • 1
  • Arnab Ganguly
    • 1
    Email author
  • Rahul Shah
    • 1
  • Sharma V. Thankachan
    • 2
  1. 1.School of Electrical Engineering and Computer ScienceLouisiana State UniversityBaton RougeUSA
  2. 2.School of Computational Science and EngineeringGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations