Advertisement

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

  • Kazuyuki Narisawa
  • Shunsuke Inenaga
  • Hideo Bannai
  • Masayuki Takeda
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4580)

Abstract

This paper considers enumeration of substring equivalence classes introduced by Blumer et al. [1]. They used the equivalence classes to define an index structure called compact directed acyclic word graphs (CDAWGs). In text analysis, considering these equivalence classes is useful since they group together redundant substrings with essentially identical occurrences. In this paper, we present how to enumerate those equivalence classes using suffix arrays. Our algorithm uses rank and lcp arrays for traversing the corresponding suffix trees, but does not need any other additional data structure. The algorithm runs in linear time in the length of the input string. We show experimental results comparing the running times and space consumptions of our algorithm, suffix tree and CDAWG based approaches.

Keywords

Equivalence Class Equivalence Relation Linear Time Suffix Tree Input String 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Blumer, A., Blumer, J., Haussler, D., Mcconnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Narisawa, K., Bannai, H., Hatano, K., Takeda, M.: Unsupervised spam detection based on string alienness measures. Technical report, Department of Informatics, Kyushu University (2007)Google Scholar
  3. 3.
    Takeda, M., Matsumoto, T., Fukuda, T., Nanri, I.: Discovering characteristic expressions in literary works. Theoretical Computer Science 292(2), 525–546 (2003)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Inenaga, S., Hoshinoa, H., Shinohara, A., Takeda, M., Arikawa, S., Mauri, G., Pavesi, G.: On-line construction of compact directed acyclic word graphs. Discrete Applied Mathematics 146(2), 156–179 (2005)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th IEEE Annual Symp. on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar
  6. 6.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Computing 22(5), 935–948 (1993)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)Google Scholar
  9. 9.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theoretical Computer Science 40, 31–55 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Kim, D.K., Sim, J.S., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Baeza-Yates, R.A., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 186–199. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  14. 14.
    Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R.A., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 200–210. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  15. 15.
    Arnold, R., Bell, T.: A corpus for the evaluation of lossless compression algorithms. In: Proc. DCC ’97, pp. 201–210 (1997), http://corpus.canterbury.ac.nz/
  16. 16.
    Nevill-Manning, C., Witten, I.: Protein is incompressible. In: Proc. DCC 1999, pp. 257–266 (1999), http://www.data-compression.info/Corpora/ProteinCorpus/index.htm
  17. 17.
    Larsson, N.J., Sadakane, K.: Faster suffix sorting. Technical Report LU-CS-TR:99-214, LUNDFD6/(NFCS-3140)/1–20/(1999) Department of Computer Science, Lund University, Sweden (1999), http://www.larsson.dogma.net/qsufsort.c

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Kazuyuki Narisawa
    • 1
  • Shunsuke Inenaga
    • 2
  • Hideo Bannai
    • 1
  • Masayuki Takeda
    • 1
    • 3
  1. 1.Department of Informatics, Kyushu University, Fukuoka 819-0395Japan
  2. 2.Department of Computer Science and Communication Engineering, Kyushu University, Fukuoka 819-0395Japan
  3. 3.SORST, Japan Science and Technology Agency (JST) 

Personalised recommendations