Abstract

We present a fast space-efficient algorithm for constructing compressed suffix arrays (CSA). The algorithm requires O(n logn) time in the worst case, and only O(n) bits of extra space in addition to the CSA. As the basic step, we describe an algorithm for merging two CSAs. We show that the construction algorithm can be parallelized in a symmetric multiprocessor system, and discuss the possibility of a distributed implementation. We also describe a parallel implementation of the algorithm, capable of indexing several gigabytes per hour.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal on Discrete Algorithms 2(1), 53–86 (2004)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  3. 3.
    Chan, H.-L., Hon, W.-K., Lam, T.-W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Transactions on Algorithms 3(2), 21 (2007)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. Journal of Experimental Algorithms 12, article no. 3.4 (2008)Google Scholar
  6. 6.
    Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21(2), 194–203 (1975)MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. Journal of Experimental Algorithms 13, article no. 1.12 (2009)Google Scholar
  8. 8.
    Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Gerlach, W.: Dynamic FM-index for a collection of texts with application to space-efficient construction of the compressed suffix array. Master’s thesis, Bielefeld University (2007)Google Scholar
  10. 10.
    Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Information retrieval: data structures and algorithms, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)Google Scholar
  11. 11.
    González, R., Navarro, G.: Improved dynamic rank-select entropy-bound structures. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 374–386. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  12. 12.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378–407 (2005)MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., Yiu, S.-M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Hon, W.-K., Lam, T.-W., Sung, W.-K., Tse, W.-L., Wong, C.-K., Yiu, S.-M.: Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In: ALENEX 2004, pp. 31–38. SIAM, Philadelphia (2004)Google Scholar
  15. 15.
    Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. SIAM Journal on Computing 38(6), 2162–2178 (2009)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. Theoretical Computer Science 387(3), 249–257 (2007)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Kulla, F., Sanders, P.: Scalable parallel suffix array construction. Parallel Computing 33(9), 605–612 (2007)CrossRefGoogle Scholar
  18. 18.
    Larsson, N.J., Sadakane, K.: Faster suffix sorting. Theoretical Computer Science 387(3), 258–272 (2007)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Lee, S., Park, K.: Dynamic rank-select structures with applications to run-length encoded texts. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 95–106. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  20. 20.
    Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3), 32 (2008)MathSciNetCrossRefMATHGoogle Scholar
  21. 21.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)Google Scholar
  22. 22.
    Na, J.C., Park, K.: Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space. Theoretical Computer Science 385(1-3), 127–136 (2007)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), 2 (2007)CrossRefMATHGoogle Scholar
  24. 24.
    Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2), 4 (2007)CrossRefGoogle Scholar
  25. 25.
    Salson, M., Lecroq, T., Léonard, M., Mouchard, L.: Dynamic extended suffix arrays. Accepted to Journal of Discrete AlgorithmsGoogle Scholar
  26. 26.
    Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Jouni Sirén
    • 1
  1. 1.Department of Computer ScienceUniversity of HelsinkiFinland

Personalised recommendations