Advertisement

Fast and Practical Algorithms for Computing All the Runs in a String

  • Gang Chen
  • Simon J. Puglisi
  • W. F. Smyth
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4580)

Abstract

A repetition in a string x is a substring \({ \bf{w}} = {\it \bf{u}}^e\) of x, maximum e ≥ 2, where u is not itself a repetition in w. A run in x is a substring \({\it \bf{w}} = {\it \bf{u}}^e{\it \bf{u^{*}}}\) of “maximal periodicity”, where \({\it \bf{u}}^e\) is a repetition and u * a maximum-length possibly empty proper prefix of u. A run may encode as many as \(|{\it \bf{u}}|\) repetitions. The maximum number of repetitions in any string \({\it \bf{x}} = {\it \bf{x}}[1..n]\) is well known to be Θ(nlogn). In 2000 Kolpakov & Kucherov showed that the maximum number of runs in x is O(n); they also described a Θ(n)-time algorithm, based on Farach’s Θ(n)-time suffix tree construction algorithm (STCA), Θ(n)-time Lempel-Ziv factorization, and Main’s Θ(n)-time leftmost runs algorithm, to compute all the runs in x. Recently Abouelhoda et al. proposed a Θ(n)-time Lempel-Ziv factorization algorithm based on an “enhanced” suffix array — a suffix array together with other supporting data structures. In this paper we introduce a collection of fast space-efficient algorithms for computing all the runs in a string that appear in many circumstances to be superior to those previously proposed.

Keywords

Practical Algorithm Suffix Array Array Construction Maximal Periodicity Large Alphabet 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algs. 2, 53–86 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Apostolico, A., Preparata, F.P.: Optimal off-line detection of repetitions in a string. Theoret. comput. sci. 22, 297–315 (1983)zbMATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Crochemore, M.: An optimal algorithm for computing the repetitions in a word. Inform. process. lett. 12(5), 244–250 (1981)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Fan, K., Puglisi, S.J., Smyth, W.F., Turpin, A.: A new periodicity lemma. SIAM J. Discrete Math. 20(3), 656–668 (2006)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Farach, M.: Optimal suffix tree construction with large alphabets. In: Proc. 38th FOCS, pp. 137–143 (1997)Google Scholar
  6. 6.
    Franek, F., Holub, J., Smyth, W.F., Xiao, X.: Computing quasi suffix arrays. J. Automata, Languages & Combinatorics 8(4), 593–606 (2003)zbMATHMathSciNetGoogle Scholar
  7. 7.
    Franek, F., Simpson, R. J., Smyth, W. F.: The maximum number of runs in a string. In: Miller, M., Park, K.(eds.) Proc. 14th AWOCA, pp. 26–35 (2003)Google Scholar
  8. 8.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing & string matching. SIAM J. Computing 35(2), 378–407 (2005)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Proc. 30th ICALP. pp. 943–955 (2003)Google Scholar
  10. 10.
    Karlin, S., Ghandour, G., Ost, F., Tavare, S., Korn, L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. USA 80, 5660–5664 (1983)zbMATHCrossRefGoogle Scholar
  11. 11.
    Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, Springer, Heidelberg (2001)Google Scholar
  12. 12.
    Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R.A., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Kolpakov, R., Kucherov, G.: http://bioinfo.lifl.fr/mreps/
  14. 14.
    Kolpakov, R., Kucherov, G.: On maximal repetitions in words. J. Discrete Algs. 1, 159–186 (2000)MathSciNetGoogle Scholar
  15. 15.
    Kurtz, S.: Reducing the space requirement of suffix trees. Software Practice & Experience 29(13), 1149–1171 (1999)CrossRefGoogle Scholar
  16. 16.
    Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Information Theory 22, 75–81 (1976)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Lentin, A., Schützenberger, M.P.: A combinatorial problem in the theory of free monoids, Combinatorial Mathematics & Its Applications. In: Bose, R.C., Dowling, T.A. (eds.) University of North Carolina Press, pp. 128–144 (1969)Google Scholar
  18. 18.
    Main, M.G.: Detecting leftmost maximal periodicities. Discrete Applied Maths 25, 145–153 (1989)zbMATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Main, M.G., Lorentz, R.J.: An O(n log n) Algorithm for Recognizing Repetition, Tech. Rep. CS-79–056, Computer Science Department, Washington State University (1979)Google Scholar
  20. 20.
    Main, M.G., Lorentz, R.J.: An O(nlog n) algorithm for finding all repetitions in a string. J. Algs. 5, 422–432 (1984)zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Mäkinen, V., Navarro, G.: Compressed full-text indices. ACM Computing Surveys (to appear)Google Scholar
  22. 22.
    Maniscalco, M., Puglisi, S.J.: Faster lightweight suffix array construction. In: Ryan, J., Dafik (eds.) Proc. 17th AWOCA pp. 16–29 (2006)Google Scholar
  23. 23.
    Manzini, G.: Two space-saving tricks for linear time LCP computation. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, Springer, Heidelberg (2004)Google Scholar
  24. 24.
    Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 33–50 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  25. 25.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. Assoc. Comput. Mach. 32(2), 262–272 (1976)MathSciNetGoogle Scholar
  26. 26.
    Puglisi, S.J., Smyth, W.F., Turpin, A.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys (to appear)Google Scholar
  27. 27.
    Rytter, W.: The number of runs in a string: improved analysis of the linear upper bound. In: Durand, B., Thomas, W. (eds.) Proc. 23rd STACS. LNCS, vol. 2884, pp. 184–195. Springer, Heidelberg (2006)Google Scholar
  28. 28.
    Sadakane, K.: Space-efficient data structures for flexible text retrieval systems. In: Bose, P., Morin, P. (eds.) ISAAC 2002. LNCS, vol. 2518, Springer, Heidelberg (2002)Google Scholar
  29. 29.
    Smyth, B.: Computing Patterns in Strings, Pearson Addison-Wesley, p. 423 (2003)Google Scholar
  30. 30.
    Thue, A.: Über unendliche zeichenreihen. Norske Vid. Selsk. Skr. I. Mat. Nat. Kl. Christiana 7, 1–22 (1906)Google Scholar
  31. 31.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14, 249–260 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  32. 32.
    Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th Annual IEEE Symp. Switching & Automata Theory, pp. 1–11 (1973)Google Scholar
  33. 33.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Information Theory 23, 337–343 (1977)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Gang Chen
    • 1
  • Simon J. Puglisi
    • 2
  • W. F. Smyth
    • 1
    • 2
  1. 1.Algorithms Research Group, Department of Computing & Software, McMaster University, Hamilton, Ontario, L8S 4K1Canada
  2. 2.Department of Computing, Curtin University, GPO Box U1987, Perth WA 6845Australia

Personalised recommendations