TEXT CLASSIFICATION FOR CHINESE WEB DOCUMENTS
Although many methods for feature selection and text classification have been applied to English web documents, relatively few studies have been done on Chinese web documents. This paper introduces a term weighting method based on inverse document frequency, html tags and length of Chinese phrase, provides an algorithm for web text classification based on improving on lattice machine approach. The experiments show this method is effective in feature reduction and text classification.
KeywordsFeature Selection Category Label Multiple Category Inverse Document Frequency Chinese Phrase
Unable to display preview. Download preview PDF.
- 1.W.J. Cohen and Y. Singer (1996), Context-Sensitive Learning Methods for Text Categorization. In: SIGIR’96:Proc.19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315.Google Scholar
- 3.Y. Yang and X. Liu (1999), A Re-Examination of Text Categorization Methods. In: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), pp. 42–49.Google Scholar
- 4.W. Hui, D. Ivo and B. David (1998), Data Reduction Based on Hyper Relation. In: Proceedings of KDD98, New York, pp. 349–353.Google Scholar
- 5.National Archive Office of China (1987), The Method of Classing Chinese Archive (in Chinese). Archive Book Concern, Beijing.Google Scholar