Simple Linear Work Suffix Array Construction

  • Juha Kärkkäinen
  • Peter Sanders
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2719)

Abstract

A suffix array represents the suffixes of a string in sorted order. Being a simpler and more compact alternative to suffix trees, it is an important tool for full text indexing and other string processing tasks. We introduce the skew algorithm for suffix array construction over integer alphabets that can be implemented to run in linear time using integer sorting as its only nontrivial subroutine:
  1. 1.

    recursively sort suffixes beginning at positions i mod 3 ≠ 0.

     
  2. 2.

    sort the remaining suffixes using the information obtained in step one.

     
  3. 3.

    merge the two sorted sequences obtained in steps one and two.

     

The algorithm is much simpler than previous linear time algorithms that are all based on the more complicated suffix tree data structure. Since sorting is a well studied problem, we obtain optimal algorithms for several other models of computation, e.g. external memory with parallel disks, cache oblivious, and parallel. The adaptations for BSP and EREW-PRAM are asymptotically faster than the best previously known algorithms.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. The enhanced suffix array and its applications to genome analysis. In Proc. 2nd Workshop on Algorithms in Bioinformatics, volume 2452 of LNCS, pages 449–463. Springer, 2002.CrossRefGoogle Scholar
  2. 2.
    M. I. Abouelhoda, E. Ohlebusch, and S. Kurtz. Optimal exact string matching based on suffix arrays. In Proc. 9th Symposium on String Processing and Information Retrieval, volume 2476 of LNCS, pages 31–43. Springer, 2002.CrossRefGoogle Scholar
  3. 3.
    S. Alstrup, C. Gavoille, H. Kaplan, and T. Rauhe. Nearest common ancestors: A survey and a new distributed algorithm. In Proc. 14th Annual Symposium on Parallel Algorithms and Architectures, pages 258–264. ACM, 2002.Google Scholar
  4. 4.
    M. A. Bender and M. Farach-Colton. The LCA problem revisited. In Proc. 4th Latin American Symposium on Theoretical INformatics, volume 1776 of LNCS, pages 88–94. Springer, 2000.Google Scholar
  5. 5.
    S. Burkhardt and J. Kärkkäinen. Fast lightweight suffix array construction and checking. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching. Springer, June 2003. To appear.Google Scholar
  6. 6.
    M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, SRC (digital, Palo Alto), May 1994.Google Scholar
  7. 7.
    A. Chan and F. Dehne. A note on coarse grained parallel integer sorting. Parallel Processing Letters, 9(4):533–538, 1999.CrossRefGoogle Scholar
  8. 8.
    R. Cole. Parallel merge sort. SIAM J. Comput., 17(4):770–785, 1988.MATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    A. Crauser and P. Ferragina. Theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica, 32(1):1–35, 2002.MATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    R. Dementiev and P. Sanders. Asynchronous parallel disk sorting. In Proc. 15th. Annual Symposium on Parallelism in Algorithms and Architectures. ACM, 2003. To appear.Google Scholar
  11. 11.
    M. Farach. Optimal suffix tree construction with large alphabets. In Proc. 38th Annual Symposium on Foundations of Computer Science, pages 137–143. IEEE, 1997.Google Scholar
  12. 12.
    M. Farach, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In Proc. 39th Annual Symposium on Foundations of Computer Science, pages 174–183. IEEE, 1998.Google Scholar
  13. 13.
    M. Farach and S. Muthukrishnan. Optimal logarithmic time randomized suffix tree construction. In Proc. 23th International Conference on Automata, Languages and Programming, pages 550–561. IEEE, 1996.Google Scholar
  14. 14.
    M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. J. ACM, 47(6):987–1011, 2000.MATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th Annual Symposium on Foundations of Computer Science, pages 285–298. IEEE, 1999.Google Scholar
  16. 16.
    N. Futamura, S. Aluru, and S. Kurtz. Parallel suffix sorting. In Proc. 9th International Conference on Advanced Computing and Communications, pages 76–81. Tata McGraw-Hill, 2001.Google Scholar
  17. 17.
    A. V. Gerbessiotis and C. J. Siniolakis. Merging on the BSP model. Parallel Computing, 27:809–822, 2001.MATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    G. Gonnet, R. Baeza-Yates, and T. Snider. New indices for text: PAT trees and PAT arrays. In W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms. Prentice-Hall, 1992.Google Scholar
  19. 19.
    M. T. Goodrich. Communication-efficient parallel sorting. SIAM J. Comput., 29(2):416–432, 1999.CrossRefMathSciNetGoogle Scholar
  20. 20.
    R. Grossi and G. F. Italiano. Suffix trees and their applications in string algorithms. Rapporto di Ricerca CS-96-14, Università “Ca’ Foscari” di Venezia, Italy, 1996.Google Scholar
  21. 21.
    D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.Google Scholar
  22. 22.
    T. Hagerup and R. Raman. Waste makes haste: Tight bounds for loose parallel sorting. In Proc. 33rd Annual Symposium on Foundations of Computer Science, pages 628–637. IEEE, 1992.Google Scholar
  23. 23.
    T. Hagerup and C. Rüb. Optimal merging and sorting on the EREW-PRAM. Information Processing Letters, 33:181–185, 1989.MATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13:338–355, 1984.MATHCrossRefMathSciNetGoogle Scholar
  25. 25.
    J. Jájá. An Introduction to Parallel Algorithms. Addison Wesley, 1992.Google Scholar
  26. 26.
    J. Kärkkäinen. Suffix cactus: A cross between suffix tree and suffix array. In Z. Galil and E. Ukkonen, editors, Proc. 6th Annual Symposium on Combinatorial Pattern Matching, volume 937 of LNCS, pages 191–204. Springer, 1995.Google Scholar
  27. 27.
    T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching, volume 2089 of LNCS, pages 181–192. Springer, 2001.Google Scholar
  28. 28.
    D. K. Kim, J. S. Sim, H. Park, and K. Park. Linear-time construction of suffix arrays. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching. Springer, June 2003. To appear.Google Scholar
  29. 29.
    P. Ko and S. Aluru. Space efficient linear time construction of suffix arrays. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching. Springer, June 2003. To appear.Google Scholar
  30. 30.
    N. J. Larsson and K. Sadakane. Faster suffix sorting. Technical report LU-CSTR: 99-214, Dept. of Computer Science, Lund University, Sweden, 1999.Google Scholar
  31. 31.
    U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, Oct. 1993.MATHCrossRefMathSciNetGoogle Scholar
  32. 32.
    E. M. McCreight. A space-economic suffix tree construction algorithm. J. ACM, 23(2):262–272, 1976.MATHCrossRefMathSciNetGoogle Scholar
  33. 33.
    M. H. Nodine and J. S. Vitter. Deterministic distribution sort in shared and distributed memory multiprocessors. In Proc. 5th Annual Symposium on Parallel Algorithms and Architectures, pages 120–129. ACM, 1993.Google Scholar
  34. 34.
    M. H. Nodine and J. S. Vitter. Greed sort: An optimal sorting algorithm for multiple disks. J. ACM, 42(4):919–933, 1995.CrossRefMathSciNetGoogle Scholar
  35. 35.
    S. Rajasekaran and J. H. Reif. Optimal and sublogarithmic time randomized parallel sorting algorithms. SIAM J. Comput., 18(3):594–607, 1989.MATHCrossRefMathSciNetGoogle Scholar
  36. 36.
    E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.MATHCrossRefMathSciNetGoogle Scholar
  37. 37.
    L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 22(8):103–111, Aug. 1990.CrossRefGoogle Scholar
  38. 38.
    J. S. Vitter and E. A. M. Shriver. Algorithms for parallel memory, I: Two level memories. Algorithmica, 12(2/3):110–147, 1994.MATHCrossRefMathSciNetGoogle Scholar
  39. 39.
    P. Weiner. Linear pattern matching algorithm. In Proc. 14th Symposium on Switching. and Automata Theory, pages 1–11. IEEE, 1973.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Juha Kärkkäinen
    • 1
  • Peter Sanders
    • 1
  1. 1.Max-Planck-Institut für InformatikSaarbrückenGermany

Personalised recommendations