N-Gram Analysis Based on Zero-Suppressed BDDs

  • Ryutaro Kurai
  • Shin-ichi Minato
  • Thomas Zeugmann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4384)

Abstract

In the present paper, we propose a new method of n-gram analysis using ZBDDs (Zero-suppressed BDDs). ZBDDs are known as a compact representation of combinatorial item sets. Here, we newly apply the ZBDD-based techniques for efficiently handling sets of sequences. Using the algebraic operations defined over ZBDDs, such as union, intersection, difference, etc., we can execute various processings and/or analyses for large-scale sequence data. We conducted experiments for generating n-gram statistical data for given real document files. The obtained results show the potentiality of the ZBDD-based method for the sequence database analysis.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Hoffmeister, B., Zeugmann, T.: Text Mining Using Markov Chains of Variable Length. In: Jantke, K.P., Lunzer, A., Spyratos, N., Tanaka, Y. (eds.) Federation over the Web. LNCS (LNAI), vol. 3847, pp. 1–24. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Tarlecki, A. (ed.) Mathematical Foundations of Computer Science 1991. LNCS, vol. 520, pp. 240–248. Springer, Heidelberg (1991)Google Scholar
  3. 3.
    Kudo, T., Yamamoto, K., Tsuboi, Y., Matsumoto, Y.: Text mining using linguistic information (in Japanese). IPSJ SIG-NLP NL-148 , pp. 65–72 (2002)Google Scholar
  4. 4.
    Minato, S.: Zero-suppressed BDDs for set manipulation in combinatorial problems. In: Proc. 30th Design Automation Conference (DAC-93), June, pp. 272–277. ACM Press, New York (1993)CrossRefGoogle Scholar
  5. 5.
    Minato, S.: Zero-suppressed BDDs and their applications. International Journal on Software Tools for Technology Transfer (STTT) 3(2), 156–170 (2001)MATHGoogle Scholar
  6. 6.
    Minato, S.: VSOP (Valued-Sum-of-Products) Calculator for Knowledge Processing Based on Zero-Suppressed BDDs. In: Jantke, K.P., Lunzer, A., Spyratos, N., Tanaka, Y. (eds.) Federation over the Web. LNCS (LNAI), vol. 3847, pp. 40–58. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    Nagano, M., Mori, S.: A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese. In: Proc. 15th Conference on Computational Linguistics, vol. 1, pp. 611–615. Association for Computational Linguistics, Morristown, NJ, USA (1994)Google Scholar
  8. 8.
    Tsuboi, Y.: Mining frequent substrings, Technical Report of IEICE, NLC, -47, 2003 (in Japanese) (2003)Google Scholar
  9. 9.
    Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92(1), 191–211 (1992)MATHCrossRefMathSciNetGoogle Scholar
  10. 10.

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Ryutaro Kurai
    • 1
  • Shin-ichi Minato
    • 1
  • Thomas Zeugmann
    • 1
  1. 1.Division of Computer Science, Hokkaido University, N-14, W-9, Sapporo 060-0814Japan

Personalised recommendations