Raising High-Degree Overlapped Character Bigrams into Trigrams for Dimensionality Reduction in Chinese Text Categorization
High dimensionality of feature space is a crucial obstacle for Automated Text Categorization. According to the characteristics of Chinese character N-grams, this paper reveals that there exists a kind of redundancy arising from feature overlapping. Focusing on Chinese character bigrams, the paper puts forward a concept of δ-overlapping between two bigrams, and proposes a new method of dimensionality reduction, called δ-Overlapped Raising (δ – OR), by raising the δ-overlapped bigrams into their corresponding trigrams. Moreover, the paper designs a two-stage dimensionality reduction strategy for Chinese bigrams by integrating a filtering method based on Chi-CIG score function and the δ – OR method. Experimental results on a large-scale Chinese document collection indicate that, on the basis of the first stage of reduction processing, δ – OR at the second stage can significantly reduce the dimension of feature space without sacrificing categorization effectiveness. We believe that the above methodology would be language-independent.
KeywordsDimensionality Reduction Chinese Character Text Categorization Latent Semantic Indexing Document Vector
Unable to display preview. Download preview PDF.
- 3.McCallum, A., Nigam, K.: A Comparison of Event Models for Naïve Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)Google Scholar
- 4.Wiener, E., Pedersen, J.O., Weigend, A.S.: A Neural Network Approach to Topic Spotting. In: Proceedings of 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)Google Scholar
- 5.Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Proceedings of 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 11–21 (1994)Google Scholar
- 7.Lertnattee, V., Theeramunkong, T.: Improving Centroid-Based Text Classification Using Term-Distribution-Based Weighting and Feature Selection. In: Proceedings of International Conference on Intelligent Technologies, pp. 349–356 (2001)Google Scholar
- 8.Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of 14th of International Conference on Machine Learning, pp. 143–151 (1997)Google Scholar
- 12.Lewis, D.D.: An Evaluation of Phrasal and Clustered Representations on a Text Categorization. In: Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37–50 (1992)Google Scholar
- 13.Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
- 14.Molina, L.C., Belanche, L., Nebot, A.: Feature Selection Algorithms: A Survey and Experimental Evaluation. In: Proceedings of 2nd IEEE International Conference on Data Mining, Maebashi City, Japan, pp. 306–313 (2002)Google Scholar
- 17.Schutze, H., Hull, D.A., Pedersen, J.O.: A comparison of Classifiers and Document Representations for the Routing Problem. In: Proceedings of 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 229–237 (1995)Google Scholar
- 19.Xue, D., Sun, M.: A Study on Feature Weighting in Chinese Text Categorization. In: Proceedings of the 4th International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City, pp. 594–604 (2003)Google Scholar
- 20.Luo, S.: Statistic-Based Two-Character Chinese Word Extraction. Master Thesis of Tsinghua University, China (2003)Google Scholar