Combining Bi-gram of Character and Word to Classify Two-Class Chinese Texts in Two Steps
This paper presents a two-step method of combining two types of features for two-class Chinese text categorization. First, the bi-gram of character is regarded as candidate feature, a Naive Bayesian classifier is used to classify texts. Then, the fuzzy area between two categories is fixed directly according to the outputs of the classifier. Second, the bi-gram of word with parts of speech verb or noun is regarded as candidate feature, a Naive Bayesian classifier same as that in the first step is used to deal with the documents falling into the fuzzy area, which are thought of classifying unreliable in the previous step. Our experiment validated the soundness of the proposed method, which achieved a high performance, with the precision, recall and F1 being 97.65%, 97.00% and 97.31% respectively on a test set consisting of 12,600 Chinese texts.
KeywordsChinese Character Negative Sample Feature Number Bayesian Classifier Chinese Word
Unable to display preview. Download preview PDF.
- 3.Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)Google Scholar
- 5.Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of SIGIR 1999, pp. 42–49 (1999)Google Scholar
- 6.Fan, X.: Causality Reasoning and Text Categorization, Postdoctoral Research Report of Tsinghua University, P.R. China (April 2004)Google Scholar
- 10.Yang, Y., Ault, T., Pierce, T.: Combining Multiple Learning Strategies for Effective Cross Validation. In: Proceedings of ICML 2000, pp. 1167–1174 (2000)Google Scholar
- 11.Hull, D.A., Pedersen, J.O., Schutze, H.: Method Combination for Document Filtering. In: Proceedings of SIGIR 1996, pp. 279–287 (1996)Google Scholar
- 12.Larkey, L.S., Croft, W.B.: Combining Classifiers in Text Categorization. In: Proceedings of SIGIR 1996, pp. 289–297 (1996)Google Scholar
- 14.Lam, W., Lai, K.Y.: A Meta-learning Approach for Text Categorization. In: Proceedings of SIGIR 2001, pp. 303–309 (2001)Google Scholar
- 15.Bennett, P.N., Dumais, S.T., Horvitz, E.: Probabilistic Combination of Text Classifiers Using Reliability Indicators: Models and Results. In: Proceedings of SIGIR 2002, pp. 11–15 (2002)Google Scholar