Advertisement

Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval

  • Wing-Kai Hon
  • Rahul Shah
  • Sharma V. Thankachan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7354)

Abstract

Let \(\cal{D} = \) {d 1,d 2,...d D } be a given set of D string documents of total length n, our task is to index \(\cal{D}\), such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. We propose an index of size |CSA| + nlogD(2 + o(1)) bits and O(t s (p) + kloglogn + polyloglogn) query time for the basic relevance metric term-frequency, where |CSA| is the size (in bits) of a compressed full text index of \(\cal{D}\), with O(t s (p)) time for searching a pattern of length p. We further reduce the space to |CSA| + nlogD(1 + o(1)) bits, however the query time will be O(t s (p) + k(logσloglogn)1 + ε  + polyloglogn), where σ is the alphabet size and ε > 0 is any constant.

Keywords

Query Time Document Retrieval Alphabet Size Path Label Marked Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Belazzougui, D., Navarro, G.: Alphabet-Independent Compressed Text Indexing. In: Demetrescu, C., Halldórsson, M.M. (eds.) ESA 2011. LNCS, vol. 6942, pp. 748–759. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  2. 2.
    Belazzougui, D., Navarro, G.: Improved Compressed Indexes for Full-Text Document Retrieval. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 386–397. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  3. 3.
    Blum, M., Floyd, R.W., Pratt, V., Rivest, R., Tarjan, R.: Time Bounds for Selection. Journal of Computer and System Sciences 7(4), 448–481 (1973)MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Shane Culpepper, J., Navarro, G., Puglisi, S.J., Turpin, A.: Top-k Ranked Document Search in General Text Databases. In: de Berg, M., Meyer, U. (eds.) ESA 2010. LNCS, vol. 6347, pp. 194–205. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  5. 5.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), art. 20 (2007)Google Scholar
  6. 6.
    Fischer, J.: Optimal Succinctness for Range Minimum Queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Frederickson, G.N.: An Optimal Algorithm for Selection in a Min-Heap. Information and Computation 104(2), 197–214 (1993)MathSciNetzbMATHCrossRefGoogle Scholar
  8. 8.
    Gagie, T., Navarro, G., Puglisi, S.J.: Colored Range Queries and Document Retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 67–81. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  9. 9.
    Golynski, A., Munro, J.I., Rao, S.S.: Rank/Select Operations on Large Alphabets: A Tool for Text Indexing. In: SODA, pp. 368–373 (2006)Google Scholar
  10. 10.
    Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing 35(2), 378–407 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  11. 11.
    Grossi, R., Gupta, A., Vitter, J.S.: High-Order Entropy-Compressed Text Indexes. In: SODA, pp. 841–850 (2003)Google Scholar
  12. 12.
    Hon, W.K., Patil, M., Shah, R., Wu, S.-B.: Efficient Index for Retrieving Top-k Most Frequent Documents. Journal of Discrete Algorithms 8(4), 402–417 (2010)MathSciNetzbMATHCrossRefGoogle Scholar
  13. 13.
    Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: String Retrieval for Multi-pattern Queries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 55–66. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  14. 14.
    Hon, W.-K., Shah, R., Thankachan, S.V.: Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 173–184. Springer, Heidelberg (2012)Google Scholar
  15. 15.
    Hon, W.K., Shah, R., Vitter, J.S.: Space-Efficient Framework for Top-k String Retrieval Problems. In: FOCS, pp. 713–722 (2009)Google Scholar
  16. 16.
    Hon, W.-K., Shah, R., Vitter, J.S.: Compression, Indexing, and Retrieval for Massive String Data. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 260–274. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  17. 17.
    Jansson, J., Sadakane, K., Sung, W.K.: Ultra-succinct Representation of Ordered Trees. In: SODA, pp. 575–584 (2007)Google Scholar
  18. 18.
    Karpinski, M., Nekrich, Y.: Top-k Color Queries for Document Retrieval. In: SODA, pp. 401–411 (2011)Google Scholar
  19. 19.
    Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing 22(5), 935–948 (1993)MathSciNetzbMATHCrossRefGoogle Scholar
  20. 20.
    Matias, Y., Muthukrishnan, S.M., Şahinalp, S.C., Ziv, J.: Augmenting Suffix Trees, with Applications. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 67–78. Springer, Heidelberg (1998)Google Scholar
  21. 21.
    McCreight, E.M.: A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)MathSciNetzbMATHCrossRefGoogle Scholar
  22. 22.
    Muthukrishnan, S.: Efficient Algorithms for Document Retrieval Problems. In: SODA, pp. 657–666 (2002)Google Scholar
  23. 23.
    Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: SODA, pp. 1066–1077 (2012)Google Scholar
  24. 24.
    Navarro, G., Puglisi, S.J., Valenzuela, D.: Practical Compressed Document Retrieval. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 193–205. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  25. 25.
    Navarro, G., Valenzuela, D.: Space-Efficient Top-k Document Retrieval. To appear in SEA (2012)Google Scholar
  26. 26.
    Navarro, G., Puglisi, S.J.: Dual-Sorted Inverted Lists. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 309–321. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  27. 27.
    Patil, M., Thankachan, S.V., Shah, R., Hon, W.K., Vitter, J.S., Chandrasekaran, S.: Inverted Indexes for Phrases and Strings. In: SIGIR, pp. 555–564 (2011)Google Scholar
  28. 28.
    Raman, R., Raman, V., Rao, S.: Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets. ACM Transactions on Algorithms 3(4) (2007)Google Scholar
  29. 29.
    Sadakane, K.: Succinct Data Structures for Flexible Text Retrieval Systems. Journal of Discrete Algorithms 5(1), 12–22 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  30. 30.
    Välimäki, N., Mäkinen, V.: Space-Efficient Algorithms for Document Retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  31. 31.
    Weiner, P.: Linear Pattern Matching Algorithms. In: SWAT (1973)Google Scholar
  32. 32.
    Willard, D.E.: Log-logarithmic Worst-Case Range Queries Are Possible in Space Θ(N). Information Processing Letters 17(2), 81–84 (1983)MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Wing-Kai Hon
    • 1
  • Rahul Shah
    • 2
  • Sharma V. Thankachan
    • 2
  1. 1.Department of CSNational Tsing Hua UniversityTaiwan
  2. 2.Department of CSLouisiana State UniversityUSA

Personalised recommendations