N-Gram Analysis Based on Zero-Suppressed BDDs

  • Ryutaro Kurai
  • Shin-ichi Minato
  • Thomas Zeugmann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4384)

Abstract

In the present paper, we propose a new method of n-gram analysis using ZBDDs (Zero-suppressed BDDs). ZBDDs are known as a compact representation of combinatorial item sets. Here, we newly apply the ZBDD-based techniques for efficiently handling sets of sequences. Using the algebraic operations defined over ZBDDs, such as union, intersection, difference, etc., we can execute various processings and/or analyses for large-scale sequence data. We conducted experiments for generating n-gram statistical data for given real document files. The obtained results show the potentiality of the ZBDD-based method for the sequence database analysis.

Keywords

Ruby 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Hoffmeister, B., Zeugmann, T.: Text Mining Using Markov Chains of Variable Length. In: Jantke, K.P., Lunzer, A., Spyratos, N., Tanaka, Y. (eds.) Federation over the Web. LNCS (LNAI), vol. 3847, pp. 1–24. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Tarlecki, A. (ed.) Mathematical Foundations of Computer Science 1991. LNCS, vol. 520, pp. 240–248. Springer, Heidelberg (1991)Google Scholar
  3. 3.
    Kudo, T., Yamamoto, K., Tsuboi, Y., Matsumoto, Y.: Text mining using linguistic information (in Japanese). IPSJ SIG-NLP NL-148 , pp. 65–72 (2002)Google Scholar
  4. 4.
    Minato, S.: Zero-suppressed BDDs for set manipulation in combinatorial problems. In: Proc. 30th Design Automation Conference (DAC-93), June, pp. 272–277. ACM Press, New York (1993)CrossRefGoogle Scholar
  5. 5.
    Minato, S.: Zero-suppressed BDDs and their applications. International Journal on Software Tools for Technology Transfer (STTT) 3(2), 156–170 (2001)MATHGoogle Scholar
  6. 6.
    Minato, S.: VSOP (Valued-Sum-of-Products) Calculator for Knowledge Processing Based on Zero-Suppressed BDDs. In: Jantke, K.P., Lunzer, A., Spyratos, N., Tanaka, Y. (eds.) Federation over the Web. LNCS (LNAI), vol. 3847, pp. 40–58. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    Nagano, M., Mori, S.: A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese. In: Proc. 15th Conference on Computational Linguistics, vol. 1, pp. 611–615. Association for Computational Linguistics, Morristown, NJ, USA (1994)Google Scholar
  8. 8.
    Tsuboi, Y.: Mining frequent substrings, Technical Report of IEICE, NLC, -47, 2003 (in Japanese) (2003)Google Scholar
  9. 9.
    Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92(1), 191–211 (1992)MATHCrossRefMathSciNetGoogle Scholar
  10. 10.

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Ryutaro Kurai
    • 1
  • Shin-ichi Minato
    • 1
  • Thomas Zeugmann
    • 1
  1. 1.Division of Computer Science, Hokkaido University, N-14, W-9, Sapporo 060-0814Japan

Personalised recommendations