Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

  • Jouni Sirén
  • Niko Välimäki
  • Veli Mäkinen
  • Gonzalo Navarro
Conference paper

DOI: 10.1007/978-3-540-89097-3_17

Part of the Lecture Notes in Computer Science book series (LNCS, volume 5280)
Cite this paper as:
Sirén J., Välimäki N., Mäkinen V., Navarro G. (2008) Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections. In: Amir A., Turpin A., Moffat A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg

Abstract

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of highly repetitive sequence collections in space-efficient manner so that retrieval of the content as well as queries on the content of the sequences can be provided time-efficiently. We show that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task. We engineer some new structures that use run-length encoding and give empirical evidence that these structures are superior to the current structures.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Jouni Sirén
    • 1
  • Niko Välimäki
    • 1
  • Veli Mäkinen
    • 1
  • Gonzalo Navarro
    • 2
  1. 1.Dept. of Computer ScienceUniv. of HelsinkiFinland
  2. 2.Dept. of Computer ScienceUniv. of Chile 

Personalised recommendations