Document Retrieval on Repetitive Collections

  • Gonzalo Navarro
  • Simon J. Puglisi
  • Jouni Sirén
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8737)


Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional pattern-matching techniques yield brute-force document retrieval solutions, which has motivated the research on tailored indexes that offer near-optimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by brute-force alternatives. We also design new methods that offer superior time/space trade-offs, particularly on repetitive collections.


Query Range Document Retrieval Base Document Wavelet Tree Document Listing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Claude, F., Munro, J.I.: Document listing on versioned documents. In: Kurland, O., Lewenstein, M., Porat, E. (eds.) SPIRE 2013. LNCS, vol. 8214, pp. 72–83. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  2. 2.
    Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Ferrada, H., Navarro, G.: A Lempel-Ziv compressed structure for document listing. In: Kurland, O., Lewenstein, M., Porat, E. (eds.) SPIRE 2013. LNCS, vol. 8214, pp. 116–128. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  4. 4.
    Gagie, T., Karhu, K., Navarro, G., Puglisi, S.J., Sirén, J.: Document listing on repetitive collections. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 107–119. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  5. 5.
    Hernández, C., Navarro, G.: Compressed representation of web and social networks via dense subgraphs. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 264–276. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Hon, W.-K., Shah, R., Vitter, J.: Space-efficient framework for top-k string retrieval problems. In: Proc. FOCS, pp. 713–722 (2009)Google Scholar
  7. 7.
    Konow, R., Navarro, G.: Faster compact top-k document retrieval. In: Proc. DCC, pp. 351–360 (2013)Google Scholar
  8. 8.
    Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. In: Proceedings of the IEEE Data Compression Conference, vol. 88(11), pp. 1722–1732 (2000)Google Scholar
  9. 9.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comp. Bio. 17(3), 281–308 (2010)CrossRefGoogle Scholar
  10. 10.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Computing 22(5), 935–948 (1993)CrossRefzbMATHMathSciNetGoogle Scholar
  11. 11.
    Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proc. SODA, pp. 657–666 (2002)Google Scholar
  12. 12.
    Navarro, G.: Indexing highly repetitive collections. In: Smyth, B. (ed.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012)Google Scholar
  13. 13.
    Navarro, G.: Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. ACM Computing Surveys 46(4), article 52 (2014)Google Scholar
  14. 14.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), art. 2 (2007)Google Scholar
  15. 15.
    Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: Proc. SODA, pp. 1066–1078 (2012)Google Scholar
  16. 16.
    Navarro, G., Valenzuela, D.: Space-efficient top-k document retrieval. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 307–319. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  17. 17.
    Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5, 12–22 (2007)CrossRefzbMATHMathSciNetGoogle Scholar
  18. 18.
    Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Gonzalo Navarro
    • 1
  • Simon J. Puglisi
    • 2
  • Jouni Sirén
    • 1
  1. 1.Center for Biotechnology and Bioengineering, Department of Computer ScienceUniversity of ChileChile
  2. 2.Department of Computer ScienceUniversity of HelsinkiFinland

Personalised recommendations