Advertisement

Algorithmica

, Volume 78, Issue 2, pp 379–393 | Cite as

Top-k Term-Proximity in Succinct Space

  • J. Ian Munro
  • Gonzalo Navarro
  • Jesper Sindahl NielsenEmail author
  • Rahul Shah
  • Sharma V. Thankachan
Article
  • 218 Downloads

Abstract

Let \(\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \ldots ,\mathsf {T}_D\}\) be a collection of D string documents of n characters in total, that are drawn from an alphabet set \(\varSigma =[\sigma ]\). The top-k document retrieval problem is to preprocess \(\mathcal{D}\) into a data structure that, given a query \((P[1\ldots p],k)\), can return the k documents of \(\mathcal{D}\) most relevant to the pattern P. The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in \(\mathsf {T}_d\). For example, it can be the term frequency (i.e., the number of occurrences of P in \(\mathsf {T}_d\)), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in \(\mathsf {T}_d\)), or a pattern-independent importance score of \(\mathsf {T}_d\) such as PageRank. Linear space and optimal query time solutions already exist for the general top-k document retrieval problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only o(n) bits on top of any compressed suffix array of \(\mathcal{D}\) and solves queries in \(O((p+k) {{\mathrm{polylog}}}\,\,n)\) time. We also show that scores that consist of a weighted combination of term proximity, term frequency, and document importance, can be handled using twice the space required to represent the text collection.

Keywords

Document indexing Top-k document retrieval Ranked document retrieval Succinct data structures Compressed data structures Compact data structures Proximity search 

References

  1. 1.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 2nd edn. Addison-Wesley, Reading (2011)Google Scholar
  2. 2.
    Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proceedings of the 19th ESA, pp. 748–759 (2011)Google Scholar
  3. 3.
    Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. J. Discrete Algorithms 18, 3–13 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Benson, G., Waterman, M.: A fast method for fast database search for all \(k\)-nucleotide repeats. Nucleic Acids Res. 22(22), 4828–4836 (1994)CrossRefGoogle Scholar
  5. 5.
    Broschart, A., Schenkel, R.: Index tuning for efficient proximity-enhanced query processing. In: INEX, pp. 213–217 (2009)Google Scholar
  6. 6.
    Büttcher, S., Clarke, C.L.A., Cormack, G.: Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, Cambridge (2010)zbMATHGoogle Scholar
  7. 7.
    de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, Berlin (2008)CrossRefzbMATHGoogle Scholar
  8. 8.
    Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM. 46(2), 236–280 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), Art. No. 20 (2007)Google Scholar
  10. 10.
    Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and applications to information retrieval. Theor. Comput. Sci. 426–427, 25–41 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefzbMATHGoogle Scholar
  12. 12.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On position restricted substring searching in succinct space. J. Discrete Algorithms 17, 109–114 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Faster compressed top-k document retrieval. In: Proceedings of the 23rd DCC, pp. 341–350 (2013)Google Scholar
  14. 14.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Space-efficient frameworks for top-k string retrieval. J. ACM. 61(2), 9 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-\(k\) string retrieval problems. In: Proceedings of the 50th FOCS, pp. 713–722 (2009)Google Scholar
  16. 16.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM. 48(3), 407–430 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Munro, J.I., Navarro, G., Nielsen, J.S., Shah, R., Thankachan, S.V.: Top-k term-proximity in succinct space. In: Proceedings of the 25th ISAAC, pp. 169–180 (2014)Google Scholar
  19. 19.
    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the 13th SODA, pp. 657–666 (2002)Google Scholar
  20. 20.
    Navarro, G.: Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4), Art. No. 52 (2014)CrossRefzbMATHGoogle Scholar
  21. 21.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), Art. No. 2 (2007)CrossRefzbMATHGoogle Scholar
  22. 22.
    Navarro, G., Nekrich, Y.: Top-\(k\) document retrieval in optimal time and linear space. In: Proceedings of the 23rd SODA, pp. 1066–1078 (2012)Google Scholar
  23. 23.
    Navarro, G., Russo, L.: Fast fully-compressed suffix trees. In: Proceedings of the 24th DCC, pp. 283–291 (2014)Google Scholar
  24. 24.
    Navarro, G., Thankachan, S.V.: Faster top-\(k\) document retrieval in optimal space. In: Proceedings of the 20th SPIRE, LNCS 8214, pp. 255–262 (2013)Google Scholar
  25. 25.
    Navarro, G., Thankachan, S.V.: Top-\(k\) document retrieval in compact space and near-optimal time. In: Proceedings of the 24th ISAAC, LNCS 8283, pp. 394–404 (2013)Google Scholar
  26. 26.
    Navarro, G., Thankachan, S.V.: New space/time tradeoffs for top-\(k\) document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Nekrich, Y., Navarro, G.: Sorted range reporting. In: Proceedings of the 13th SWAT, LNCS 7357, pp. 271–282 (2012)Google Scholar
  28. 28.
    Pǎtraşcu, M.: Succincter. In: Proceedings of the 49th FOCS, pp. 305–313 (2008)Google Scholar
  29. 29.
    Raman, R., Raman, V., Srinivasa, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), Art. No. 43 (2007)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Schenkel, R., Broschart, A., Hwang, S.-W., Theobald, M., Weikum, G.: Efficient text proximity search. In: SPIRE, pp. 287–299 (2007)Google Scholar
  31. 31.
    Shah, R., Sheng, C., Thankachan, S.V., Vitter, J.S.: Top-k document retrieval in external memory. In: Proceedings of the 21st ESA, LNCS 8125, pp. 803–814 (2013)Google Scholar
  32. 32.
    Weiner, P.: Linear pattern matching algorithm. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar
  33. 33.
    Yan, H., Shi, S., Zhang, F., Suel, T., Wen, J.-R.: Efficient term proximity search with term-pair indexes. In: CIKM, pp. 1229–1238 (2010)Google Scholar
  34. 34.
    Zhu, M., Shi, S., Li, M., Wen, J.-R.: Effective top-k computation in retrieving structured documents with term-proximity support. In: CIKM, pp. 771–780 (2007)Google Scholar
  35. 35.
    Zhu, M., Shi, S., Yu, N., Wen, J.-R.: Can phrase indexing help to process non-phrase queries? In: CIKM, pp. 679–688 (2008)Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • J. Ian Munro
    • 1
  • Gonzalo Navarro
    • 2
  • Jesper Sindahl Nielsen
    • 3
    Email author
  • Rahul Shah
    • 4
  • Sharma V. Thankachan
    • 5
  1. 1.Cheriton School of CSUniversity of WaterlooWaterlooCanada
  2. 2.Department of CSUniversity of ChileSantiagoChile
  3. 3.MADALGOAarhus UniversityAarhusDenmark
  4. 4.School of EECSLouisiana State UniversityBaton RougeUSA
  5. 5.School of CSEGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations