Advertisement

On-Line Linear-Time Construction of Word Suffix Trees

  • Shunsuke Inenaga
  • Masayuki Takeda
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4009)

Abstract

Suffix trees are the key data structure for text string matching, and are used in wide application areas such as bioinformatics and data compression. Sparse suffix trees are kind of suffix trees that represent only a subset of suffixes of the input string. In this paper we study word suffix trees, which are one variation of sparse suffix trees. Let D be a dictionary of words and w be a string in D  + , namely, w is a sequence w 1w k of k words in D. The word suffix tree of w w.r.t. D is a path-compressed trie that represents only the k suffixes in the form of w i w k . A typical example of its application is word- and phrase-level search on natural language documents. Andersson et al. proposed an algorithm to build word suffix trees in O(n) expected time with O(k) space. In this paper we present a new word suffix tree construction algorithm with O(n) running time and O(k) space in the worst cases. Our algorithm is on-line, which means that it can sequentially process the characters in the input, each by each, from left to right.

Keywords

Data Compression Construction Algorithm Suffix Tree Input String String Processing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aho, A.V., Corasick, M.: Efficient string matching: An aid to bibliographic search. Comm. ACM 18(6), 333–340 (1975)CrossRefMathSciNetMATHGoogle Scholar
  2. 2.
    Andersson, A., Larsson, N.J., Swanson, K.: Suffix trees on words. Algorithmica 23(3), 246–260 (1999)CrossRefMathSciNetMATHGoogle Scholar
  3. 3.
    Apostolico, A.: The myriad virtues of subword trees. Combinatorial Algorithms on Words F12, 85–96 (1985)Google Scholar
  4. 4.
    Baeza-Yates, R., Gonnet, G.H.: Efficient text searching of regular expressions. In: Ronchi Della Rocca, S., Ausiello, G., Dezani-Ciancaglini, M. (eds.) ICALP 1989. LNCS, vol. 372, pp. 46–62. Springer, Heidelberg (1989)CrossRefGoogle Scholar
  5. 5.
    Bannai, H., Inenaga, S., Shinohara, A., Takeda, M., Miyano, S.: Efficiently finding regulatory elements using correlation with gene expression. Journal of Bioinformatics and Computational Biology 2(2), 273–288 (2004)CrossRefGoogle Scholar
  6. 6.
    Clifford, R., Sergot, M.: Distributed and paged suffix trees for large genetic databases. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 70–82. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  7. 7.
    Dorohonceanu, B., Nevill-Manning, C.G.: Accelerating protein classification using suffix trees. In: Proc. 8th International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), pp. 128–133. AAAI Press, Menlo Park (2000)Google Scholar
  8. 8.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)CrossRefMATHGoogle Scholar
  9. 9.
    Inenaga, S., Bannai, H., Hyyrö, H., Shinohara, A., Takeda, M., Nakai, K., Miyano, S.: Finding optimal pairs of cooperative and competing patterns with bounded distance. In: Suzuki, E., Arikawa, S. (eds.) DS 2004. LNCS (LNAI), vol. 3245, pp. 32–46. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Inenaga, S., Funamoto, T., Takeda, M., Shinohara, A.: Linear-time off-line text compression by longest-first substitution. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 137–152. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  11. 11.
    Inenaga, S., Kivioja, T., Mäkinen, V.: Finding missing patterns. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 463–474. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    Kärkkänen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996)Google Scholar
  13. 13.
    Larsson, N.J.: Extended application of suffix trees to data compression. In: Proc. Data Compression Conference 1996 (DCC 1996), pp. 190–199. IEEE Computer Society, Los Alamitos (1996)CrossRefGoogle Scholar
  14. 14.
    Marsan, L., Sagot, M.-F.: Extracting structured motifs using a suffix tree - algorithms and application to promoter consensus identification. In: Proc. 4th Annual International Conference on Computational Molecular Biology (RECOMB 2000), pp. 210–219. ACM, New York (2000)Google Scholar
  15. 15.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of ACM 23(2), 262–272 (1976)CrossRefMathSciNetMATHGoogle Scholar
  16. 16.
    Na, J.C., Apostolico, A., Iliopoulos, C.S., Park, K.: Truncated suffix trees and their application to data compression. Theoretical Computer Science 304(1–3), 87–101 (2003)CrossRefMathSciNetMATHGoogle Scholar
  17. 17.
    Takeda, M., Miyamoto, S., Kida, T., Shinohara, A., Fukamachi, S., Shinohara, T., Arikawa, S.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 170–186. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  18. 18.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)CrossRefMathSciNetMATHGoogle Scholar
  19. 19.
    Weiner, P.: Linear pattern-matching algorithms. In: Proc. of 14th IEEE Ann. Symp. on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shunsuke Inenaga
    • 1
    • 2
  • Masayuki Takeda
    • 2
    • 3
  1. 1.Japan Society for the Promotion of Science 
  2. 2.Department of InformaticsKyushu UniversityFukuokaJapan
  3. 3.SORSTJapan Science and Technology Agency (JST) 

Personalised recommendations