Feature Selection on Chinese Text Classification Using Character N-Grams

  • Zhihua Wei
  • Duoqian Miao
  • Jean-Hugues Chauchat
  • Caiming Zhong
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5009)


In this paper, we perform Chinese text classification using n-gram text representation on TanCorp which is a new large corpus special for Chinese text classification more than 14,000 texts divided into 12 classes. We use different n-gram feature (1-, 2-grams or 1-, 2-, 3-grams) to represent documents. Different feature weights (absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency) are compared. The sparseness of “document by feature” matrices is analyzed in various cases. We use the C-SVC classifier which is the SVM algorithm designed for the multi-classification task. We perform our experiments in the TANAGRA platform. We found out that the feature selection methods based on n-gram frequency (absolute or relative) always give better results and produce denser matrices.


Chinese text classification N-gram Feature selection 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Miao, D.Q., Wei, Z.H.: Chinese Language Understanding Algorithms and Applications. Tsinghua University Press (2007)Google Scholar
  2. 2.
    Radwan, J., Chauchat, J.-H.: Pourquoi les n-grammes permettent de classer des textes? Recherche de mots-clefs pertinents l’aide des n-grammes caractèristiques. In: JADT 2002: 6es Journées internationales d’Analyse statistique des Données Textuelles, pp. 381–390 (2002)Google Scholar
  3. 3.
    Alain, L., Halleb, M., Delprat, B.: Recherche d’information et cartographie dans des corpus textuels à partir des fréquences de n-grammes. In: Mellet, S. (ed.) 4èmes Journées Internationales d’Analyse statistique des Données Textuelles, Université de Nice - Sophia Antipolis, pp. 391–400 (1998)Google Scholar
  4. 4.
    Joachims, T.: Learning to Classify Text Using Support Vector Machines. University Dortmund (February 2001)Google Scholar
  5. 5.
    Zhou, S.G., et al.: A Chinese Document Categorization System Without Dictionary Support and Segmentation Processing. Journal of Computer Research and Development 38(7), 839–844 (2001)Google Scholar
  6. 6.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  7. 7.
    Benzécri, J.-P., L’Analyse, D.: T1 = la Taxinomie. DUNOD, Paris (1973)Google Scholar
  8. 8.
    Tan, S.B., et al.: A novel refinement approach for text categorization. In: CIKM 2005, pp. 469–476 (2005)Google Scholar
  9. 9.
    Fan, R.-E., Chen, P.-H., Lin, C.-J.: Working set selection using second order information for training SVM. Journal of Machine Learning Research, 1889–1918 (2005)Google Scholar
  10. 10.
    Ricco, R.: TANAGRA: un logiciel gratuit pour l’enseignement et la recherché. In: EGC 2005, RNTI-E-32, pp. 697–702 (2005)Google Scholar
  11. 11.
    Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)Google Scholar
  12. 12.
    Artur, S̆, et al.: Detailed experiment with letter n-gram method on Croatian-English parallel corpus. In: EPIA 2007, Portuguese Conference on Artificial Intelligence (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Zhihua Wei
    • 1
    • 2
  • Duoqian Miao
    • 1
  • Jean-Hugues Chauchat
    • 2
  • Caiming Zhong
    • 1
  1. 1.Key laboratory “Embedded System and Service Computing” Ministry of EducationTongji UniversityShanghaiChina
  2. 2.Laboratoire ERICUniversité Lumière Lyon 2Bron CedexFrance

Personalised recommendations