Algorithms and Computation

Volume 2906 of the series Lecture Notes in Computer Science pp 240-249

Constructing Compressed Suffix Arrays with Large Alphabets

  • Wing-Kai HonAffiliated withDepartment of Computer Science and Informations Systems, The University of Hong Kong
  • , Tak-Wah LamAffiliated withDepartment of Computer Science and Informations Systems, The University of Hong Kong
  • , Kunihiko SadakaneAffiliated withDepartment of Computer Science and Communication Engineering, Kyushu University
  • , Wing-Kin SungAffiliated withSchool of Computing, National University of Singapore

* Final gross prices may vary according to local VAT.

Get Access


Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Σ, this algorithm requires O(|Σ|nlogn) time and (2 H 0 + 1+ε)n bits of working space, where H 0 is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size | Σ| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters.

The main contribution of this paper is a new algorithm which can construct CSA in O(nlogn) time using (H 0 + 2+ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that H 0 > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time.