Chapter

Algorithms and Computation

Volume 2906 of the series Lecture Notes in Computer Science pp 240-249

Constructing Compressed Suffix Arrays with Large Alphabets

  • Wing-Kai HonAffiliated withDepartment of Computer Science and Informations Systems, The University of Hong Kong
  • , Tak-Wah LamAffiliated withDepartment of Computer Science and Informations Systems, The University of Hong Kong
  • , Kunihiko SadakaneAffiliated withDepartment of Computer Science and Communication Engineering, Kyushu University
  • , Wing-Kin SungAffiliated withSchool of Computing, National University of Singapore

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Σ, this algorithm requires O(|Σ|nlogn) time and (2 H 0 + 1+ε)n bits of working space, where H 0 is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size | Σ| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters.

The main contribution of this paper is a new algorithm which can construct CSA in O(nlogn) time using (H 0 + 2+ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that H 0 > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time.