Space-Efficient Top-k Document Retrieval

  • Gonzalo Navarro
  • Daniel Valenzuela
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7276)

Abstract

Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.

Keywords

Term Frequency Inverted Index Query Pattern Document Retrieval Wavelet Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: Proc. 11th ALENEX, pp. 84–97 (2010)Google Scholar
  2. 2.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 2nd edn. Addison-Wesley (2011)Google Scholar
  3. 3.
    Belazzougui, D., Navarro, G.: Improved Compressed Indexes for Full-Text Document Retrieval. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 386–397. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  4. 4.
    Bender, M., Farach-Colton, M.: The LCA Problem Revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  5. 5.
    Culpepper, J.S., Navarro, G., Puglisi, S.J., Turpin, A.: Top-k Ranked Document Search in General Text Databases. In: de Berg, M., Meyer, U. (eds.) ESA 2010, Part II. LNCS, vol. 6347, pp. 194–205. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), article 20 (2007)Google Scholar
  7. 7.
    Gagie, T., Navarro, G., Puglisi, S.J.: Colored Range Queries and Document Retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 67–81. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  8. 8.
    Gagie, T., Puglisi, S.J., Turpin, A.: Range Quantile Queries: Another Virtue of Wavelet Trees. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 1–6. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  9. 9.
    Golynski, A., Munro, I., Rao, S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proc. 17th SODA, pp. 368–373 (2006)Google Scholar
  10. 10.
    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. 14th SODA, pp. 636–645 (2003)Google Scholar
  11. 11.
    Hon, W.-K., Shah, R., Thankachan, S.: Towards an optimal space-and-query-time index for top-k document retrieval. CoRR, arXiv:1108.0554 (2011)Google Scholar
  12. 12.
    Hon, W.-K., Shah, R., Vitter, J.: Space-efficient framework for top-k string retrieval problems. In: Proc. 50th FOCS, pp. 713–722 (2009)Google Scholar
  13. 13.
    Hon, W.-K., Shah, R., Wu, S.-B.: Efficient Index for Retrieving Top-k Most Frequent Documents. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 182–193. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  14. 14.
    Jacobson, G.: Space-efficient static trees and graphs. In: Proc. 30th FOCS, pp. 549–554 (1989)Google Scholar
  15. 15.
    Larsson, J., Moffat, A.: Off-line dictionary-based compression. Proc. of the IEEE 88(11), 1722–1732 (2000)CrossRefGoogle Scholar
  16. 16.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)MathSciNetMATHCrossRefGoogle Scholar
  17. 17.
    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proc. 13th SODA, pp. 657–666 (2002)Google Scholar
  18. 18.
    Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: Proc. 22nd SODA, pp. 1066–1078 (2012)Google Scholar
  19. 19.
    Navarro, G., Puglisi, S.J., Valenzuela, D.: Practical Compressed Document Retrieval. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 193–205. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  20. 20.
    Patil, M., Thankachan, S., Shah, R., Hon, W.-K., Vitter, J., Chandrasekaran, S.: Inverted indexes for phrases and strings. In: Proc. SIGIR, pp. 555–564 (2011)Google Scholar
  21. 21.
    Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discr. Alg. 5(1), 12–22 (2007)MathSciNetMATHCrossRefGoogle Scholar
  22. 22.
    Välimäki, N., Mäkinen, V.: Space-Efficient Algorithms for Document Retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  23. 23.
    Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Gonzalo Navarro
    • 1
  • Daniel Valenzuela
    • 1
  1. 1.Dept. of Computer ScienceUniv. of ChileChile

Personalised recommendations