Combining Bi-gram of Character and Word to Classify Two-Class Chinese Texts in Two Steps

  • Xinghua Fan
  • Difei Wan
  • Guoying Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4259)


This paper presents a two-step method of combining two types of features for two-class Chinese text categorization. First, the bi-gram of character is regarded as candidate feature, a Naive Bayesian classifier is used to classify texts. Then, the fuzzy area between two categories is fixed directly according to the outputs of the classifier. Second, the bi-gram of word with parts of speech verb or noun is regarded as candidate feature, a Naive Bayesian classifier same as that in the first step is used to deal with the documents falling into the fuzzy area, which are thought of classifying unreliable in the previous step. Our experiment validated the soundness of the proposed method, which achieved a high performance, with the precision, recall and F1 being 97.65%, 97.00% and 97.31% respectively on a test set consisting of 12,600 Chinese texts.


Chinese Character Negative Sample Feature Number Bayesian Classifier Chinese Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  2. 2.
    Lewis, D.: Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  3. 3.
    Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)Google Scholar
  4. 4.
    Mitchell, T.M.: Machine Learning. McCraw Hill, New York (1996)MATHGoogle Scholar
  5. 5.
    Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of SIGIR 1999, pp. 42–49 (1999)Google Scholar
  6. 6.
    Fan, X.: Causality Reasoning and Text Categorization, Postdoctoral Research Report of Tsinghua University, P.R. China (April 2004)Google Scholar
  7. 7.
    Fan, X., Sun, M., Choi, K.-S., Zhang, Q.: Classifying Chinese Texts in Two Steps. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 302–313. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  8. 8.
    Fan, X., Sun, M.: A high performance two-class Chinese text categorization method. Chinese Journal of Computers 29(1), 124–131 (2006)MathSciNetGoogle Scholar
  9. 9.
    Rajashekar, T.B., Croft, W.B.: Combining Automatic and Manual Index Representations in Probabilistic Retrieval. Journal of the American society for information science 6(4), 272–283 (1995)CrossRefGoogle Scholar
  10. 10.
    Yang, Y., Ault, T., Pierce, T.: Combining Multiple Learning Strategies for Effective Cross Validation. In: Proceedings of ICML 2000, pp. 1167–1174 (2000)Google Scholar
  11. 11.
    Hull, D.A., Pedersen, J.O., Schutze, H.: Method Combination for Document Filtering. In: Proceedings of SIGIR 1996, pp. 279–287 (1996)Google Scholar
  12. 12.
    Larkey, L.S., Croft, W.B.: Combining Classifiers in Text Categorization. In: Proceedings of SIGIR 1996, pp. 289–297 (1996)Google Scholar
  13. 13.
    Li, Y.H., Jain, A.K.: Classification of Text Documents. The Computer Journal 41(8), 537–546 (1998)MATHCrossRefGoogle Scholar
  14. 14.
    Lam, W., Lai, K.Y.: A Meta-learning Approach for Text Categorization. In: Proceedings of SIGIR 2001, pp. 303–309 (2001)Google Scholar
  15. 15.
    Bennett, P.N., Dumais, S.T., Horvitz, E.: Probabilistic Combination of Text Classifiers Using Reliability Indicators: Models and Results. In: Proceedings of SIGIR 2002, pp. 11–15 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Xinghua Fan
    • 1
  • Difei Wan
    • 1
  • Guoying Wang
    • 1
  1. 1.College of Computer Science and TechnologyChongqing University of Posts and TelecommunicationsChongqingP.R. China

Personalised recommendations