Document Listing on Repetitive Collections

  • Travis Gagie
  • Kalle Karhu
  • Gonzalo Navarro
  • Simon J. Puglisi
  • Jouni Sirén
Conference paper

DOI: 10.1007/978-3-642-38905-4_12

Part of the Lecture Notes in Computer Science book series (LNCS, volume 7922)
Cite this paper as:
Gagie T., Karhu K., Navarro G., Puglisi S.J., Sirén J. (2013) Document Listing on Repetitive Collections. In: Fischer J., Sanders P. (eds) Combinatorial Pattern Matching. CPM 2013. Lecture Notes in Computer Science, vol 7922. Springer, Berlin, Heidelberg

Abstract

Many document collections consist largely of repeated material, and several indexes have been designed to take advantage of this. There has been only preliminary work, however, on document retrieval for repetitive collections. In this paper we show how one of those indexes, the run-length compressed suffix array (RLCSA), can be extended to support document listing. In our experiments, our additional structures on top of the RLCSA can reduce the query time for document listing by an order of magnitude while still using total space that is only a fraction of the raw collection size. As a byproduct, we develop a new document listing technique for general collections that is of independent interest.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Travis Gagie
    • 1
  • Kalle Karhu
    • 2
  • Gonzalo Navarro
    • 3
  • Simon J. Puglisi
    • 1
  • Jouni Sirén
    • 3
  1. 1.Helsinki Institute for Information Technology (Aalto), Department of Computer ScienceUniversity of HelsinkiFinland
  2. 2.Department of Computer Science and EngineeringAalto UniversityFinland
  3. 3.Department of Computer ScienceUniversity of ChileChile

Personalised recommendations