Automatic Document Categorization Based on k-NN and Object-Based Thesauri
The k-NN classifier(k-NN) is one of the most popular document categorization methods because of its simplicity and relatively good performance. However, it significantly degrades precision when ambiguity arises – there exist more than one candidate category for a document to be assigned. To remedy the drawback, we propose a new method, which incorporates the relationships of object-based thesauri into the document categorization using k-NN. Employing the thesaurus entails structuring categories into taxonomies, since their structure needs to be conformed to that of the thesaurus for capturing relationships between themselves. By referencing relationships in the thesaurus which correspond to the structured categories, k-NN can be drastically improved, removing the ambiguity. In this paper, we first perform the document categorization by using k-NN and then employ the relationships to reduce the ambiguity. Experimental results show that the proposed approach improves the precision of k-NN up to 13.86% without compromising its recall.
KeywordsObject Class Structure Category Text Categorization Association Relationship Document Categorization
Unable to display preview. Download preview PDF.
- 1.Antonie, M.L., Zaiane, O.R.: Text document categorization by term association. In: Proceeding of the second IEEE Intenational Conference on Data Mining(ICDM), pp. 19–26 (2002)Google Scholar
- 2.Bao, Y., Ishii, N.: Combining multiple k-nearest neighbor classifier for text classification by reducts. Discovery Science, 340–347 (2002)Google Scholar
- 3.Choi, J.H., Yang, J.D., Lee, D.G.: An object-based approach to managing domain specific thesauri: semiautomatic thesauri construction and query-based browsing. Intenational Journal of Software Engineering & Knowledge Engineering 10(4), 1–27 (2002)Google Scholar
- 4.Diao, L., Hu, K., Lu, Y., Shi, C.: Boosting simple decision trees with bayesian learning for text categorization. In: Proceeding of the fourth World Congress on Intelligent Control and Automation, vol. 1, pp. 321–325 (2002)Google Scholar
- 6.Hiroshi, U., Takao, M., Shioya, I.: Improving text categorization by resolving semantic ambiguity. In: Proceeding of the IEEE Pacific Rim Conference on Communications, Computers and Signal processing (PACRIM), pp. 796–799 (2003)Google Scholar
- 7.Hu, J., Huang, H.: An algorithm for text categorization with SVM. In: Processing the tenth IEEE Region Conference on Computers, Communications, Control and Power Engineering, vol. 1, pp. 47–50 (2002)Google Scholar
- 10.Sasaki, M., Kita, K.: Rule-based text categorization using hierarchical categories. In: Proceeding of the IEEE International Conference on System, Man and Cybernetics, vol. 3, pp. 2827–2830 (1998)Google Scholar
- 11.Schapire, R.E., Singer, Y.: Text categorization with the concept of fuzzy set of informative keywords. In: Proceeding of the IEEE International Fuzzy Systems Conference(FUZZ-IEEE), vol. 2, pp. 609–614 (1999)Google Scholar
- 13.Soucy, P., Mineau, G.W.: A simple KNN algorithm for text categorization. In: Proceeding of the first IEEE International Conference on Data Mining(ICDM), vol. 28, pp. 647–648 (2001)Google Scholar