Advertisement

Efficient Compressed Indexing for Approximate Top-k String Retrieval

  • Héctor Ferrada
  • Gonzalo Navarro
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8799)

Abstract

Given a collection of strings (called documents), the top-k document retrieval problem is that of, given a string pattern p, finding the k documents where p appears most often. This is a basic task in most information retrieval scenarios. The best current implementations require 20–30 bits per character (bpc) and k to 4k microseconds per query, or 12–24 bpc and 1–10 milliseconds per query. We introduce a Lempel-Ziv compressed data structure that occupies 5–10 bpc to answer queries in around k microseconds. The drawback is that the answer is approximate, but we show that its quality improves asymptotically with the size of the collection, reaching over 85% of the accumulated term frequency of the real answer already for patterns of length 4–6 on rather small collections, and improving for larger ones.

Keywords

Compressed Indexing Pattern Length Empty Node Short Pattern Large Text Collection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: Proc. ALENEX, pp. 84–97 (2010)Google Scholar
  2. 2.
    Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica 62(1), 54–101 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Ferrada, H., Navarro, G.: A Lempel-Ziv compressed structure for document listing. In: Kurland, O., Lewenstein, M., Porat, E. (eds.) SPIRE 2013. LNCS, vol. 8214, pp. 116–128. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  4. 4.
    Hon, W.-K., Patil, M., Shah, R., Wu, S.-B.: Efficient index for retrieving top-k most frequent documents. J. Discr. Alg. 8(4), 402–417 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Hon, W.-K., Shah, R., Thankachan, S.V.: Towards an optimal space-and-query-time index for top-k document retrieval. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 173–184. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Hon, W.-K., Shah, R., Vitter, J.: Space-efficient framework for top-k string retrieval problems. In: Proc. FOCS, pp. 713–722 (2009)Google Scholar
  7. 7.
    Konow, R., Navarro, G.: Faster compact top-k document retrieval. In: Proc. DCC, pp. 351–360 (2013)Google Scholar
  8. 8.
    Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J. Comp. 29(3), 893–911 (1999)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Munro, I.: Tables. In: Proc. FSTTCS, pp. 37–42 (1996)Google Scholar
  10. 10.
    Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discr. Alg. 2(1), 87–114 (2004)CrossRefzbMATHGoogle Scholar
  11. 11.
    Navarro, G.: Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. ACM Comp. Surv. 46(4), article 52 (2014)Google Scholar
  12. 12.
    Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: Proc. SODA, pp. 1066–1077 (2012)Google Scholar
  13. 13.
    Navarro, G., Valenzuela, D.: Space-efficient top-k document retrieval. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 307–319. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Clarke, C., Büttcher, S., Cormack, G.: Information Retrieval: Implementing and Evaluating Search Engines. MIT Press (2010)Google Scholar
  15. 15.
    Szpankowski, W.: A generalized suffix tree and its (un)expected asymptotic behaviors. SIAM J. Comp. 22(6), 1176–1198 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theor. 24(5), 530–536 (1978)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Héctor Ferrada
    • 1
  • Gonzalo Navarro
    • 1
  1. 1.Department of Computer ScienceUniversity of ChileChile

Personalised recommendations