Compressed Suffix Arrays for Massive Data

  • Jouni Sirén
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5721)


We present a fast space-efficient algorithm for constructing compressed suffix arrays (CSA). The algorithm requires O(n logn) time in the worst case, and only O(n) bits of extra space in addition to the CSA. As the basic step, we describe an algorithm for merging two CSAs. We show that the construction algorithm can be parallelized in a symmetric multiprocessor system, and discuss the possibility of a distributed implementation. We also describe a parallel implementation of the algorithm, capable of indexing several gigabytes per hour.


Massive Data Query Time Construction Algorithm Secondary Memory Partial Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal on Discrete Algorithms 2(1), 53–86 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  3. 3.
    Chan, H.-L., Hon, W.-K., Lam, T.-W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Transactions on Algorithms 3(2), 21 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. Journal of Experimental Algorithms 12, article no. 3.4 (2008)Google Scholar
  6. 6.
    Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21(2), 194–203 (1975)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. Journal of Experimental Algorithms 13, article no. 1.12 (2009)Google Scholar
  8. 8.
    Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Gerlach, W.: Dynamic FM-index for a collection of texts with application to space-efficient construction of the compressed suffix array. Master’s thesis, Bielefeld University (2007)Google Scholar
  10. 10.
    Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Information retrieval: data structures and algorithms, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)Google Scholar
  11. 11.
    González, R., Navarro, G.: Improved dynamic rank-select entropy-bound structures. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 374–386. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  12. 12.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378–407 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., Yiu, S.-M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Hon, W.-K., Lam, T.-W., Sung, W.-K., Tse, W.-L., Wong, C.-K., Yiu, S.-M.: Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In: ALENEX 2004, pp. 31–38. SIAM, Philadelphia (2004)Google Scholar
  15. 15.
    Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. SIAM Journal on Computing 38(6), 2162–2178 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. Theoretical Computer Science 387(3), 249–257 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Kulla, F., Sanders, P.: Scalable parallel suffix array construction. Parallel Computing 33(9), 605–612 (2007)CrossRefGoogle Scholar
  18. 18.
    Larsson, N.J., Sadakane, K.: Faster suffix sorting. Theoretical Computer Science 387(3), 258–272 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Lee, S., Park, K.: Dynamic rank-select structures with applications to run-length encoded texts. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 95–106. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  20. 20.
    Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3), 32 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)Google Scholar
  22. 22.
    Na, J.C., Park, K.: Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space. Theoretical Computer Science 385(1-3), 127–136 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), 2 (2007)CrossRefzbMATHGoogle Scholar
  24. 24.
    Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2), 4 (2007)CrossRefGoogle Scholar
  25. 25.
    Salson, M., Lecroq, T., Léonard, M., Mouchard, L.: Dynamic extended suffix arrays. Accepted to Journal of Discrete AlgorithmsGoogle Scholar
  26. 26.
    Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Jouni Sirén
    • 1
  1. 1.Department of Computer ScienceUniversity of HelsinkiFinland

Personalised recommendations