Advertisement

Faster Top-k Document Retrieval in Optimal Space

  • Gonzalo Navarro
  • Sharma V. Thankachan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8214)

Abstract

We consider the problem of retrieving the k documents from a collection of strings where a given pattern P appears most often. We show that, by representing the collection using a Compressed Suffix Array CSA, a data structure using the asymptotically optimal |CSA|+o(n) bits can answer queries in the time needed by CSA to find the suffix array interval of the pattern plus \(O(k\lg^2 k \lg^\epsilon n)\) accesses to suffix array cells, for any constant ε > 0. This is \(\lg n / \lg k\) times faster than the only previous solution using optimal space, \(\lg k\) times slower than the fastest structure that uses twice the space, and \(\lg^2 k \lg^\epsilon n\) times the lower-bound cost of obtaining k document identifiers from the CSA. To obtain the result we introduce a tool called the sampled document array, which can be of independent interest.

Keywords

Query Time Inverted Index Optimal Space Document Retrieval Array Cell 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet partitioning for compressed rank/select and applications. In: Proc. 21st ISAAC, Part II, pp. 315–326 (2010)Google Scholar
  2. 2.
    Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proc. 19th ESA, pp. 748–759 (2011)Google Scholar
  3. 3.
    Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. J. Discr. Alg. 18, 3–13 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Bell, T., Cleary, J., Witten, I.: Text compression. Prentice-Hall (1990)Google Scholar
  5. 5.
    Büttcher, S., Clarke, C., Cormack, G.: Information Retrieval: Implementing and Evaluating Search Engines. MIT Press (2010)Google Scholar
  6. 6.
    Gagie, T., Kärkkäinen, J., Navarro, G., Puglisi, S.J.: Colored range queries and document retrieval. Theo. Comp. Sci. 483, 36–50 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Golynski, A., Munro, I., Rao, S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proc. 17th SODA, pp. 368–373 (2006)Google Scholar
  8. 8.
    Hon, W.-K., Patil, M., Shah, R., Bin Wu, S.: Efficient index for retrieving top-k most frequent documents. J. Discr. Alg. 8(4), 402–417 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Hon, W.-K., Shah, R., Thankachan, S.V.: Towards an optimal space-and-query-time index for top-k document retrieval. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 173–184. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  10. 10.
    Hon, W.-K., Shah, R., Thankachan, S., Vitter, J.: Faster compressed top-k document retrieval. In: Proc. 23rd DCC, pp. 341–350 (2013)Google Scholar
  11. 11.
    Hon, W.-K., Shah, R., Vitter, J.: Space-efficient framework for top-k string retrieval problems. In: Proc. 50th FOCS, pp. 713–722 (2009)Google Scholar
  12. 12.
    Hon, W.-K., Shah, R., Wu, S.-B.: Efficient index for retrieving top-k most frequent documents. In: Proc. 16th SPIRE, pp. 182–193 (2009)Google Scholar
  13. 13.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proc 13th SODA, pp. 657–666 (2002)Google Scholar
  16. 16.
    Navarro, G.: Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. CoRR, arXiv:1304.6023v5 (2013)Google Scholar
  17. 17.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), art 2 (2007)Google Scholar
  18. 18.
    Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: Proc. 23rd SODA, pp. 1066–1078 (2012)Google Scholar
  19. 19.
    Navarro, G., Valenzuela, D.: Space-efficient top-k document retrieval. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 307–319. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  20. 20.
    Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discr. Alg. 5, 12–22 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Raman, R., Raman, V., Srinivasa Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Alg. 3(4), art 43 (2007)Google Scholar
  22. 22.
    Shah, R., Sheng, C., Thankachan, S.V., Vitter, J.S.: Top-k document retrieval in external memory. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 803–814. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  23. 23.
    Tsur, D.: Top-k document retrieval in optimal space. Inf. Proc. Lett. 113(12), 440–443 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Gonzalo Navarro
    • 1
  • Sharma V. Thankachan
    • 2
  1. 1.Department of Computer ScienceUniversity of ChileChile
  2. 2.Department of Computer ScienceLouisiana State UniversityUSA

Personalised recommendations