Document Listing on Repetitive Collections

  • Travis Gagie
  • Kalle Karhu
  • Gonzalo Navarro
  • Simon J. Puglisi
  • Jouni Sirén
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7922)

Abstract

Many document collections consist largely of repeated material, and several indexes have been designed to take advantage of this. There has been only preliminary work, however, on document retrieval for repetitive collections. In this paper we show how one of those indexes, the run-length compressed suffix array (RLCSA), can be extended to support document listing. In our experiments, our additional structures on top of the RLCSA can reduce the query time for document listing by an order of magnitude while still using total space that is only a fraction of the raw collection size. As a byproduct, we develop a new document listing technique for general collections that is of independent interest.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Demetrescu, C., Halldórsson, M.M. (eds.) ESA 2011. LNCS, vol. 6942, pp. 748–759. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  2. 2.
    Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  4. 4.
    Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Gagie, T., Navarro, G., Puglisi, S.J.: Colored range queries and document retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 67–81. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and applications to information retrieval. Theor. Comp. Sci. 426-427, 25–41 (2012)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. SODA, pp. 636–645 (2003)Google Scholar
  8. 8.
    Hon, W.-K., Shah, R., Vitter, J.: Space-efficient framework for top-k string retrieval problems. In: Proc. FOCS, pp. 713–722 (2009)Google Scholar
  9. 9.
    Mäkinen, V., Navarro, G., Sirén, J., Valimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Computational Biology 17(3), 281–308 (2010)CrossRefGoogle Scholar
  10. 10.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetMATHCrossRefGoogle Scholar
  11. 11.
    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proc. SODA, pp. 657–666 (2002)Google Scholar
  12. 12.
    Navarro, G., Puglisi, S.J., Valenzuela, D.: Practical compressed document retrieval. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 193–205. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  13. 13.
    Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proc. ALENEX (2007)Google Scholar
  14. 14.
    Pătraşcu, M.: Succincter. In: Proc. FOCS, pp. 305–313 (2008)Google Scholar
  15. 15.
    Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Disc. Alg. 5(1), 12–22 (2007)MathSciNetMATHCrossRefGoogle Scholar
  16. 16.
    Szpankowski, W.: A generalized suffix tree and its (un)expected asymptotic behaviors. SIAM J. Comput. 22(6), 1176–1198 (1993)MathSciNetMATHCrossRefGoogle Scholar
  17. 17.
    Välimäki, N., Mäkinen, V.: Space-efficient algorithms for document retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  18. 18.
    Weiner, P.: Linear pattern matching algorithm. In: Proc. SAT, pp. 1–11 (1973)Google Scholar
  19. 19.
    Willard, D.: Log-logarithmic worst-case range queries are possible in space θ(n). Inf. Pr. Lett. 17(2), 81–84 (1983)MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Travis Gagie
    • 1
  • Kalle Karhu
    • 2
  • Gonzalo Navarro
    • 3
  • Simon J. Puglisi
    • 1
  • Jouni Sirén
    • 3
  1. 1.Helsinki Institute for Information Technology (Aalto), Department of Computer ScienceUniversity of HelsinkiFinland
  2. 2.Department of Computer Science and EngineeringAalto UniversityFinland
  3. 3.Department of Computer ScienceUniversity of ChileChile

Personalised recommendations