Classifying Chinese Texts in Two Steps

  • Xinghua Fan
  • Maosong Sun
  • Key-sun Choi
  • Qin Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3651)


This paper proposes a two-step method for Chinese text categorization (TC). In the first step, a Naïve Bayesian classifier is used to fix the fuzzy area between two categories, and, in the second step, the classifier with more subtle and powerful features is used to deal with documents in the fuzzy area, which are thought of being unreliable in the first step. The preliminary experiment validated the soundness of this method. Then, the method is extended from two-class TC to multi-class TC. In this two-step framework, we try to further improve the classifier by taking the dependences among features into consideration in the second step, resulting in a Causality Naïve Bayesian Classifier.


Text Categorization Bayesian Classifier Chinese Word Chinese Text High Computational Efficiency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  2. 2.
    Lewis, D.: Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In: Proceedings of ECML-1998, pp. 4–15 (1998)Google Scholar
  3. 3.
    Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)Google Scholar
  4. 4.
    Mitchell, T.M.: Machine Learning. McCraw Hill, New York (1996)zbMATHGoogle Scholar
  5. 5.
    Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of SIGIR-1999, pp. 42–49 (1999)Google Scholar
  6. 6.
    Fan, X.: Causality Reasoning and Text Categorization, Postdoctoral Research Report of Tsinghua University, P.R. China (April 2004) (in Chinese)Google Scholar
  7. 7.
    Dumais, S.T., Platt, J., Hecherman, D., Sahami, M.: Inductive Learning Algorithms and Representation for Text Categorization. In: Proceedings of CIKM-1998, Bethesda, MD, pp. 148–155 (1998)Google Scholar
  8. 8.
    Sahami, M., Dumais, S., Hecherman, D., Horvitz, E.A.: Bayesian Approach to Filtering Junk E-Mail. In: Learning for Text Categorization: Papers from the AAAI Workshop, 55-62, Madison Wisconsin. AAAI Technical Report WS-98-05 (1998)Google Scholar
  9. 9.
    Fan, X.: Causality Diagram Theory Research and Applying It to Fault Diagnosis of Complexity System, Ph.D. Dissertation of Chongqing University, P.R. China (April 2002) (In Chinese)Google Scholar
  10. 10.
    Fan, X., Qin, Z., Maosong, S., Xiyue, H.: Reasoning Algorithm in Multi-Valued Causality Diagram. Chinese Journal of Computers 26(3), 310–322 (2003) (in Chinese)Google Scholar
  11. 11.
    Sahami, M.: Learning Limited Dependence Bayesian Classifiers. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, pp. 335–338 (1996)Google Scholar
  12. 12.
    Rajashekar, T.B., Croft, W.B.: Combining Automatic and Manual Index Representations in Probabilistic Retrieval. Journal of the American society for information science 6(4), 272–283 (1995)CrossRefGoogle Scholar
  13. 13.
    Yang, Y., Ault, T., Pierce, T.: Combining Multiple Learning Strategies for Effective Cross Validation. In: Proceedings of ICML 2000, pp. 1167–1174 (2000)Google Scholar
  14. 14.
    Hull, D.A., Pedersen, J.O., Schutze, H.: Method Combination for Document Filtering. In: Proceedings of SIGIR-1996, pp. 279–287 (1996)Google Scholar
  15. 15.
    Larkey, L.S., Croft, W.B.: Combining Classifiers in Text Categorization. In: Proceedings of SIGIR-1996, pp. 289–297 (1996)Google Scholar
  16. 16.
    Li, Y.H., Jain, A.K.: Classification of Text Documents. The Computer Journal 41(8), 537–546 (1998)zbMATHCrossRefGoogle Scholar
  17. 17.
    Lam, W., Lai, K.Y.: A Meta-learning Approach for Text Categorization. In: Proceedings of SIGIR-2001, pp. 303–309 (2001)Google Scholar
  18. 18.
    Bennett, P.N., Dumais, S.T., Horvitz, E.: Probabilistic Combination of Text Classifiers Using Reliability Indicators: Models and Results. In: Proceedings of SIGIR-2002, pp. 11–15 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Xinghua Fan
    • 1
    • 2
    • 3
  • Maosong Sun
    • 1
  • Key-sun Choi
    • 3
  • Qin Zhang
    • 2
  1. 1.State Key Laboratory of Intelligent Technology and SystemsTsinghua UniversityBeijingChina
  2. 2.State Intellectual Property Office of P.R. ChinaBeijingChina
  3. 3.Computer Science Division, KortermKAISTDaejeonKorea

Personalised recommendations