Efficient Compressed Indexing for Approximate Top-k String Retrieval

Ferrada, Héctor; Navarro, Gonzalo

doi:10.1007/978-3-319-11918-2_3

Héctor Ferrada¹⁷ &
Gonzalo Navarro¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8799))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

627 Accesses
3 Citations

Abstract

Given a collection of strings (called documents), the top-k document retrieval problem is that of, given a string pattern p, finding the k documents where p appears most often. This is a basic task in most information retrieval scenarios. The best current implementations require 20–30 bits per character (bpc) and k to 4k microseconds per query, or 12–24 bpc and 1–10 milliseconds per query. We introduce a Lempel-Ziv compressed data structure that occupies 5–10 bpc to answer queries in around k microseconds. The drawback is that the answer is approximate, but we show that its quality improves asymptotically with the size of the collection, reaching over 85% of the accumulated term frequency of the real answer already for patterns of length 4–6 on rather small collections, and improving for larger ones.

Partially funded by Fondecyt grant 1-140796, Chile.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: Proc. ALENEX, pp. 84–97 (2010)
Google Scholar
Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica 62(1), 54–101 (2012)
Article MathSciNet MATH Google Scholar
Ferrada, H., Navarro, G.: A Lempel-Ziv compressed structure for document listing. In: Kurland, O., Lewenstein, M., Porat, E. (eds.) SPIRE 2013. LNCS, vol. 8214, pp. 116–128. Springer, Heidelberg (2013)
Chapter Google Scholar
Hon, W.-K., Patil, M., Shah, R., Wu, S.-B.: Efficient index for retrieving top-k most frequent documents. J. Discr. Alg. 8(4), 402–417 (2010)
Article MathSciNet MATH Google Scholar
Hon, W.-K., Shah, R., Thankachan, S.V.: Towards an optimal space-and-query-time index for top-k document retrieval. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 173–184. Springer, Heidelberg (2012)
Chapter Google Scholar
Hon, W.-K., Shah, R., Vitter, J.: Space-efficient framework for top-k string retrieval problems. In: Proc. FOCS, pp. 713–722 (2009)
Google Scholar
Konow, R., Navarro, G.: Faster compact top-k document retrieval. In: Proc. DCC, pp. 351–360 (2013)
Google Scholar
Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J. Comp. 29(3), 893–911 (1999)
Article MathSciNet Google Scholar
Munro, I.: Tables. In: Proc. FSTTCS, pp. 37–42 (1996)
Google Scholar
Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discr. Alg. 2(1), 87–114 (2004)
Article MATH Google Scholar
Navarro, G.: Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. ACM Comp. Surv. 46(4), article 52 (2014)
Google Scholar
Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: Proc. SODA, pp. 1066–1077 (2012)
Google Scholar
Navarro, G., Valenzuela, D.: Space-efficient top-k document retrieval. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 307–319. Springer, Heidelberg (2012)
Chapter Google Scholar
Clarke, C., Büttcher, S., Cormack, G.: Information Retrieval: Implementing and Evaluating Search Engines. MIT Press (2010)
Google Scholar
Szpankowski, W.: A generalized suffix tree and its (un)expected asymptotic behaviors. SIAM J. Comp. 22(6), 1176–1198 (1993)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theor. 24(5), 530–536 (1978)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Chile, Chile
Héctor Ferrada & Gonzalo Navarro

Authors

Héctor Ferrada
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto de Computação, Universidade Federal do Amazonas, 6200, Manaus, Brazil
Edleno Moura
King’s College London, UK
Maxime Crochemore

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferrada, H., Navarro, G. (2014). Efficient Compressed Indexing for Approximate Top-k String Retrieval. In: Moura, E., Crochemore, M. (eds) String Processing and Information Retrieval. SPIRE 2014. Lecture Notes in Computer Science, vol 8799. Springer, Cham. https://doi.org/10.1007/978-3-319-11918-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-11918-2_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11917-5
Online ISBN: 978-3-319-11918-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics