Abstract
The k-NN classifier(k-NN) is one of the most popular document categorization methods because of its simplicity and relatively good performance. However, it significantly degrades precision when ambiguity arises – there exist more than one candidate category for a document to be assigned. To remedy the drawback, we propose a new method, which incorporates the relationships of object-based thesauri into the document categorization using k-NN. Employing the thesaurus entails structuring categories into taxonomies, since their structure needs to be conformed to that of the thesaurus for capturing relationships between themselves. By referencing relationships in the thesaurus which correspond to the structured categories, k-NN can be drastically improved, removing the ambiguity. In this paper, we first perform the document categorization by using k-NN and then employ the relationships to reduce the ambiguity. Experimental results show that the proposed approach improves the precision of k-NN up to 13.86% without compromising its recall.
This work was supported by Korea Science and Engineering Foundation(KOSEF) Grant No. R05-2003-000-11986-0.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Antonie, M.L., Zaiane, O.R.: Text document categorization by term association. In: Proceeding of the second IEEE Intenational Conference on Data Mining(ICDM), pp. 19–26 (2002)
Bao, Y., Ishii, N.: Combining multiple k-nearest neighbor classifier for text classification by reducts. Discovery Science, 340–347 (2002)
Choi, J.H., Yang, J.D., Lee, D.G.: An object-based approach to managing domain specific thesauri: semiautomatic thesauri construction and query-based browsing. Intenational Journal of Software Engineering & Knowledge Engineering 10(4), 1–27 (2002)
Diao, L., Hu, K., Lu, Y., Shi, C.: Boosting simple decision trees with bayesian learning for text categorization. In: Proceeding of the fourth World Congress on Intelligent Control and Automation, vol. 1, pp. 321–325 (2002)
Han, E.H., Karypis, G., Kumar, V.: Text categorization using weight adjusted k-nearest neighbor classification. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, pp. 53–65. Springer, Heidelberg (2001)
Hiroshi, U., Takao, M., Shioya, I.: Improving text categorization by resolving semantic ambiguity. In: Proceeding of the IEEE Pacific Rim Conference on Communications, Computers and Signal processing (PACRIM), pp. 796–799 (2003)
Hu, J., Huang, H.: An algorithm for text categorization with SVM. In: Processing the tenth IEEE Region Conference on Computers, Communications, Control and Power Engineering, vol. 1, pp. 47–50 (2002)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Lam, W., Han, Y.: Automatic textual document categorization based on generalized instance sets and a metamodel. Proceeding of the IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 628–633 (2003)
Sasaki, M., Kita, K.: Rule-based text categorization using hierarchical categories. In: Proceeding of the IEEE International Conference on System, Man and Cybernetics, vol. 3, pp. 2827–2830 (1998)
Schapire, R.E., Singer, Y.: Text categorization with the concept of fuzzy set of informative keywords. In: Proceeding of the IEEE International Fuzzy Systems Conference(FUZZ-IEEE), vol. 2, pp. 609–614 (1999)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2000)
Soucy, P., Mineau, G.W.: A simple KNN algorithm for text categorization. In: Proceeding of the first IEEE International Conference on Data Mining(ICDM), vol. 28, pp. 647–648 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bang, S.L., Yang, H.J., Yang, J.D. (2004). Automatic Document Categorization Based on k-NN and Object-Based Thesauri. In: Apostolico, A., Melucci, M. (eds) String Processing and Information Retrieval. SPIRE 2004. Lecture Notes in Computer Science, vol 3246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30213-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-30213-1_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23210-0
Online ISBN: 978-3-540-30213-1
eBook Packages: Springer Book Archive