Advertisement

Algorithmica

, Volume 32, Issue 1, pp 1–35 | Cite as

A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory

  • Crauser
  • Ferragina
Article

Abstract

The construction of full-text indexes on very large text collections is nowadays a hot problem. The suffix array [32] is one of the most attractive full-text indexing data structures due to its simplicity, space efficiency and powerful/ fast search operations supported. In this paper we analyze, both theoretically and experimentally, the I/ O complexity and the working space of six algorithms for constructing large suffix arrays. Three of them are state-of-the-art, the other three algorithms are our new proposals. We perform a set of experiments based on three different data sets (English texts, amino-acid sequences and random texts) and give a precise hierarchy of these algorithms according to their working-space versus construction-time tradeoff. Given the current trends in model design [12], [32] and disk technology [29], [30], we pose particular attention to differentiate between ``random'' and ``contiguous'' disk accesses, in order to explain reasonably some practical I/ O phenomena which are related to the experimental behavior of these algorithms and that would otherwise be meaningless in the light of other simpler external-memory models.

We also address two other issues. The former is concerned with the problem of building word indexes; we show that our results can be successfully applied to this case too, without any loss in efficiency and without compromising the simplicity of programming to achieve a uniform, simple and efficient approach to both the two indexing models. The latter issue is related to the intriguing and apparently counterintuitive ``contradiction'' between the effective practical performance of the well-known Baeza-Yates—Gonnet—Snider algorithm [17], verified in our experiments, and its unappealing worst-case behavior. We devise a new external-memory algorithm that follows the basic philosophy underlying that algorithm but in a significantly different manner, thus resulting in a novel approach which combines good worst-case bounds with efficient practical performance.

Key words. Suffix array, Large data sets, External-memory model, Text indexing, Full-text and word-based models. 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag New York Inc. 2001

Authors and Affiliations

  • Crauser
    • 1
  • Ferragina
    • 2
  1. 1.Max-Planck-Institut für Informatik, Saarbrücken, Germany. crauser@mpi-sb.mpg.de.DE
  2. 2.Dipartimento di Informatica, Università di Pisa, Pisa, Italy. ferragin@di.unipi.it.IT

Personalised recommendations