A Comparative Study on Representing Units in Chinese Text Clustering

  • Wang Hongjun
  • Yu Shiwen
  • Lv Xueqiang
  • Shi Shuicai
  • Xiao Shibin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4092)


Words and n-grams are commonly used Chinese text representing units and are proved to be good features for Chinese Text Categorization and Information Retrieval. But the effectiveness of applying these representing units for Chinese Text Clustering is still uncovered. This paper is a comparative study of representing units in Chinese Text Clustering. With K-means algorithm, several representing units were evaluated including Chinese character N-gram features, word features and their combinations. We found Chinese word features, Chinese character unigram features and bi-gram features most effective in our experiments. The combination of features didn’t improve the results. Detailed experimental results on several public Chinese Text Categorization datasets are provided in the paper.


Chinese text Clustering N-gram feature Bi-gram feature Word feature 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cutting, D., Karger, D., Pedersen, J., Tukey, J.W.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. SIGIR 92(5), 318–329 (1992)Google Scholar
  2. 2.
    Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proc. 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (1999)Google Scholar
  3. 3.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)Google Scholar
  4. 4.
    Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An Evaluation on Feature Selection for Text Clustering. In: ICML 2003 (2003)Google Scholar
  5. 5.
    Kummamuru, K., Lotlikar, R., Roy, S.: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. In: WWW 2004, New York, USA, May 17-22 (2004)Google Scholar
  6. 6.
    Zhang, H., Liu, Q., Zhang, H., Cheng, X.: Automatic Recognition of Chinese Unknown Words Based on Role. In: Tagging 19th International Conference on Computational Linguistics, SigHan Workshop (2002)Google Scholar
  7. 7.
    Baoli, L., Yuzhong, C., Xiaojing, B., Yu, S.: Experimental Study on Representing Units in Chinese Text Categorization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 602–614. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  8. 8.
    Xue, D.-j.: A Study on Key Issues of Automated Text Categorization for Chinese Documents. PHD theses, Tsinghua University (2004)Google Scholar
  9. 9.
    Nie, J.-Y., Ren, F.: Chinese information retrieval: using characters or words? Information Processing and Management 35, 443–462 (1999)CrossRefGoogle Scholar
  10. 10.
    Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)Google Scholar
  11. 11.
    Faber, V.: Clustering and the Continuous k-Means Algorithm. Los Alamos Science, November 22 (1994)Google Scholar
  12. 12.
    Bradley, P., Fayyad, U.: Refining Initial Points for K-Means Clustering. In: Proc. of ICML 1998, pp. 91–99 (1998)Google Scholar
  13. 13.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. Of ICML 1997, pp. 412–420 (1997)Google Scholar
  14. 14.
    Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the International Conference on Information and Knowledge Management (2002)Google Scholar
  15. 15.
    Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3) (2004)Google Scholar
  16. 16.
    Surdeanu, M., Turmo, J., Ageno, A.: A hybrid unsupervised approach for document clustering. In: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24 (2005)Google Scholar
  17. 17.
    Chen, J., Palmer, M.S.: Chinese Verb Sense Discrimination Using an EM Clustering Model with Rich Linguistic Features. In: ACL 2004, pp. 295–302 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Wang Hongjun
    • 1
    • 2
  • Yu Shiwen
    • 1
  • Lv Xueqiang
    • 2
  • Shi Shuicai
    • 2
  • Xiao Shibin
    • 2
  1. 1.Institute Of Computing Linguistics Peking UniversityBeijing
  2. 2.Chinese Information Processing Center Beijing Information Technology InstituteBeijing

Personalised recommendations