N-Gram Analysis Based on Zero-Suppressed BDDs
In the present paper, we propose a new method of n-gram analysis using ZBDDs (Zero-suppressed BDDs). ZBDDs are known as a compact representation of combinatorial item sets. Here, we newly apply the ZBDD-based techniques for efficiently handling sets of sequences. Using the algebraic operations defined over ZBDDs, such as union, intersection, difference, etc., we can execute various processings and/or analyses for large-scale sequence data. We conducted experiments for generating n-gram statistical data for given real document files. The obtained results show the potentiality of the ZBDD-based method for the sequence database analysis.
Unable to display preview. Download preview PDF.
- 2.Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Tarlecki, A. (ed.) Mathematical Foundations of Computer Science 1991. LNCS, vol. 520, pp. 240–248. Springer, Heidelberg (1991)Google Scholar
- 3.Kudo, T., Yamamoto, K., Tsuboi, Y., Matsumoto, Y.: Text mining using linguistic information (in Japanese). IPSJ SIG-NLP NL-148 , pp. 65–72 (2002)Google Scholar
- 7.Nagano, M., Mori, S.: A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese. In: Proc. 15th Conference on Computational Linguistics, vol. 1, pp. 611–615. Association for Computational Linguistics, Morristown, NJ, USA (1994)Google Scholar
- 8.Tsuboi, Y.: Mining frequent substrings, Technical Report of IEICE, NLC, -47, 2003 (in Japanese) (2003)Google Scholar
- 10.Ruby Home Page. http://www.ruby-lang.org/en/