Fast Lightweight Suffix Array Construction and Checking

  • Stefan Burkhardt
  • Juha Kärkkäinen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2676)

Abstract

We describe an algorithm that, for any v ∈ [2, n], constructs the suffix array of a string of length n in \( \mathcal{O}\left( {vn + n log{\mathbf{ }}n} \right) \) time using \( \mathcal{O}\left( {v + n/\sqrt v } \right) \) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an \( \mathcal{O}\left( {n log n} \right) \) time algorithm using \( \mathcal{O}\left( {n/\sqrt {log n} } \right) \) extra space. This solves the open problem stated by Manzini and Ferragina [ESA ’02] of whether there exists a lightweight (sublinear extra space) \( \mathcal{O}\left( {n log n} \right) \) time algorithm. The key idea of the algorithm is to first sort a sample of suffixes chosen using mathematical constructs called difference covers. The algorithm is not only lightweight but also fast in practice as demonstrated by experiments. Additionally, we describe fast and lightweight suffix array checkers, i.e., algorithms that check the correctness of a suffix array.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. The enhanced suffix array and its applications to genome analysis. In Proc. 2nd Workshop on Algorithms in Bioinformatics, volume 2452 of LNCS, pages 449–463. Springer, 2002.CrossRefGoogle Scholar
  2. 2.
    A. Andersson, N. J. Larsson, and K. Swanson. Suffix trees on words. Algorithmica, 23(3):246–260, 1999.MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. 8th Annual Symposium on Discrete Algorithms, pages 360–369. ACM, 1997.Google Scholar
  4. 4.
    M. Blum and S. Kannan. Designing programs that check their work. J. ACM, 42(1):269–291, Jan. 1995.MATHCrossRefGoogle Scholar
  5. 5.
    M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, SRC (digital, Palo Alto), May 1994.Google Scholar
  6. 6.
    R. Clifford. Distributed and paged suffix trees for large genetic databases. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching. Springer, 2003. This volume.Google Scholar
  7. 7.
    C. J. Colbourn and A. C. H. Ling. Quorums from difference covers. Inf. Process. Lett., 75(1–2):9–12, July 2000.CrossRefMathSciNetGoogle Scholar
  8. 8.
    M. Farach. Optimal suffix tree construction with large alphabets. In Proc. 38th Annual Symposium on Foundations of Computer Science, pages 137–143. IEEE, 1997.Google Scholar
  9. 9.
    G. Gonnet, R. Baeza-Yates, and T. Snider. New indices for text: PAT trees and PAT arrays. In W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structures & Algorithms. Prentice-Hall, 1992.Google Scholar
  10. 10.
    H. Itoh and H. Tanaka. An efficient method for in memory construction of suffix arrays. In Proc. 6th Symposium on String Processing and Information Retrieval, pages 125–136. IEEE, 1999.Google Scholar
  11. 11.
    J. Kärkkäinen and P. Sanders. Simple linear work suffix array construction. In Proc. 13th International Conference on Automata, Languages and Programming. Springer, 2003. To appear.Google Scholar
  12. 12.
    J. Kärkkäinen and E. Ukkonen. Sparse suffix trees. In Proc. 2nd Annual International Conference on Computing and Combinatorics, volume 1090 of LNCS, pages 219–230. Springer, 1996.Google Scholar
  13. 13.
    R. M. Karp, R. E. Miller, and A. L. Rosenberg. Rapid identification of repeated patterns in strings, trees and arrays. In Proc. 4th Annual Symposium on Theory of Computing, pages 125–136. ACM, 1972.Google Scholar
  14. 14.
    T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching, volume 2089 of LNCS, pages 181–192. Springer, 2001.Google Scholar
  15. 15.
    J. Kilian, S. Kipnis, and C. E. Leiserson. The organization of permutation architectures with bused interconnections. IEEE Transactions on Computers, 39(11):1346–1358, Nov. 1990.CrossRefGoogle Scholar
  16. 16.
    D. K. Kim, J. S. Sim, H. Park, and K. Park. Linear-time construction of suffix arrays. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching. Springer, 2003. This volume.Google Scholar
  17. 17.
    P. Ko and S. Aluru. Linear time construction of suffix arrays. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching. Springer, 2003. This volume.Google Scholar
  18. 18.
    S. Kurtz. Reducing the space requirement of suffix trees. Software — Practice and Experience, 29(13):1149–1171, 1999.CrossRefGoogle Scholar
  19. 19.
    N. J. Larsson and K. Sadakane. Faster suffix sorting. Technical report LU-CSTR: 99-214, Dept. of Computer Science, Lund University, Sweden, 1999.Google Scholar
  20. 20.
    W.-S. Luk and T.-T. Wong. Two new quorum based algorithms for distributed mutual exclusion. In Proc. 17th International Conference on Distributed Computing Systems, pages 100–106. IEEE, 1997.Google Scholar
  21. 21.
    U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, Oct. 1993.MATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    G. Manzini and P. Ferragina. Engineering a lightweight suffix array construction algorithm. In Proc. 10th Annual European Symposium on Algorithms, volume 2461 of LNCS, pages 698–710. Springer, 2002.Google Scholar
  23. 23.
    E. M. McCreight. A space-economic suffix tree construction algorithm. J. ACM, 23(2):262–272, 1976.MATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    J. Seward. On the performance of BWT sorting algorithms. In Proc. Data Compression Conference, pages 173–182. IEEE, 2000.Google Scholar
  25. 25.
    J. Seward. The bzip2 and libbzip2 official home page, 2002. http://sources.redhat.com/bzip2/.
  26. 26.
    E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.MATHCrossRefMathSciNetGoogle Scholar
  27. 27.
    H. Wasserman and M. Blum. Software reliability via run-time result-checking. J. ACM, 44(6):826–849, Nov. 1997.MATHCrossRefMathSciNetGoogle Scholar
  28. 28.
    P. Weiner. Linear pattern matching algorithm. In Proc. 14th Symposium on Switching and Automata Theory, pages 1–11. IEEE, 1973.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Stefan Burkhardt
    • 1
  • Juha Kärkkäinen
    • 1
  1. 1.Max-Planck-Institut für InformatikSaarbrückenGermany

Personalised recommendations