Advertisement

Improved Compressed Indexes for Full-Text Document Retrieval

  • Djamal Belazzougui
  • Gonzalo Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7024)

Abstract

We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at least \(|\mathsf{CSA}|+ O(n\frac{\lg D}{\lg\lg D})\) or 2|CSA| + o(n) bits of space, where CSA is a full-text index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequencies and top-k document retrieval using just \(|\mathsf{CSA}|+O(n\lg\lg\lg D)\) bits. We also improve current solutions that use 2|CSA| + o(n) bits, and consider other problems such as colored range listing, top-k most important documents, and computing arbitrary frequencies.

Keywords

Document Retrieval Compressed Index Wavelet Tree Document Listing Range Minimum Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)CrossRefGoogle Scholar
  2. 2.
    Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Monotone minimal perfect hashing: searching a sorted table with o(1) accesses. In: SODA, pp. 785–794 (2009)Google Scholar
  3. 3.
    Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Theory and practise of monotone minimal perfect hashing. In: ALENEX (2009)Google Scholar
  4. 4.
    Culpepper, J.S., Navarro, G., Puglisi, S.J., Turpin, A.: Top-k ranked document search in general text databases. In: de Berg, M., Meyer, U. (eds.) ESA 2010. LNCS, vol. 6347, pp. 194–205. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  5. 5.
    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Alg. 3(2), art. 20 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Gagie, T., Navarro, G., Puglisi, S.J.: Colored range queries and document retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 67–81. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  8. 8.
    Gagie, T., Puglisi, S.J., Turpin, A.: Range quantile queries: Another virtue of wavelet trees. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 1–6. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  9. 9.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: SODA, pp. 841–850 (2003)Google Scholar
  10. 10.
    Grossi, R., Orlandi, A., Raman, R.: Optimal trade-offs for succinct string indexes. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds.) ICALP 2010. LNCS, vol. 6198, pp. 678–689. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: FOCS, pp. 713–722 (2009)Google Scholar
  12. 12.
    Karpinski, M., Nekrich, Y.: Top-k color queries for document retrieval. In: SODA, pp. 401–411 (2011)Google Scholar
  13. 13.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Munro, I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  16. 16.
    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: SODA, pp. 657–666 (2002)Google Scholar
  17. 17.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), art. 2 (2007)CrossRefzbMATHGoogle Scholar
  18. 18.
    Navarro, G., Puglisi, S.J., Valenzuela, D.: Practical compressed document retrieval. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 193–205. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  19. 19.
    Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/ select dictionary. In: ALENEX (2007)Google Scholar
  20. 20.
    Pǎtraşcu, M.: Succincter. In: FOCS, pp. 305–313 (2008)Google Scholar
  21. 21.
    Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: SODA, pp. 233–242 (2002)Google Scholar
  22. 22.
    Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discr. Alg. 5(1), 12–22 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Sadakane, K., Navarro, G.: Fully-functional succinct trees. In: SODA, pp. 134–149 (2010)Google Scholar
  24. 24.
    Välimäki, N., Mäkinen, V.: Space-efficient algorithms for document retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  25. 25.
    Willard, D.E.: Log-logarithmic worst-case range queries are possible in space θ(n). Inf. Process. Lett. 17(2), 81–84 (1983)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Djamal Belazzougui
    • 1
  • Gonzalo Navarro
    • 2
  1. 1.LIAFA, Univ. Paris DiderotParis 7France
  2. 2.Department of Computer ScienceUniversity of ChileChile

Personalised recommendations