Automatic Document Categorization Based on k-NN and Object-Based Thesauri

Bang, Sun Lee; Yang, Hyung Jeong; Yang, Jae Dong

doi:10.1007/978-3-540-30213-1_14

Sun Lee Bang¹⁸,
Hyung Jeong Yang¹⁹ &
Jae Dong Yang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3246))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

708 Accesses

Abstract

The k-NN classifier(k-NN) is one of the most popular document categorization methods because of its simplicity and relatively good performance. However, it significantly degrades precision when ambiguity arises – there exist more than one candidate category for a document to be assigned. To remedy the drawback, we propose a new method, which incorporates the relationships of object-based thesauri into the document categorization using k-NN. Employing the thesaurus entails structuring categories into taxonomies, since their structure needs to be conformed to that of the thesaurus for capturing relationships between themselves. By referencing relationships in the thesaurus which correspond to the structured categories, k-NN can be drastically improved, removing the ambiguity. In this paper, we first perform the document categorization by using k-NN and then employ the relationships to reduce the ambiguity. Experimental results show that the proposed approach improves the precision of k-NN up to 13.86% without compromising its recall.

This work was supported by Korea Science and Engineering Foundation(KOSEF) Grant No. R05-2003-000-11986-0.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Antonie, M.L., Zaiane, O.R.: Text document categorization by term association. In: Proceeding of the second IEEE Intenational Conference on Data Mining(ICDM), pp. 19–26 (2002)
Google Scholar
Bao, Y., Ishii, N.: Combining multiple k-nearest neighbor classifier for text classification by reducts. Discovery Science, 340–347 (2002)
Google Scholar
Choi, J.H., Yang, J.D., Lee, D.G.: An object-based approach to managing domain specific thesauri: semiautomatic thesauri construction and query-based browsing. Intenational Journal of Software Engineering & Knowledge Engineering 10(4), 1–27 (2002)
Google Scholar
Diao, L., Hu, K., Lu, Y., Shi, C.: Boosting simple decision trees with bayesian learning for text categorization. In: Proceeding of the fourth World Congress on Intelligent Control and Automation, vol. 1, pp. 321–325 (2002)
Google Scholar
Han, E.H., Karypis, G., Kumar, V.: Text categorization using weight adjusted k-nearest neighbor classification. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, pp. 53–65. Springer, Heidelberg (2001)
Chapter Google Scholar
Hiroshi, U., Takao, M., Shioya, I.: Improving text categorization by resolving semantic ambiguity. In: Proceeding of the IEEE Pacific Rim Conference on Communications, Computers and Signal processing (PACRIM), pp. 796–799 (2003)
Google Scholar
Hu, J., Huang, H.: An algorithm for text categorization with SVM. In: Processing the tenth IEEE Region Conference on Computers, Communications, Control and Power Engineering, vol. 1, pp. 47–50 (2002)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Lam, W., Han, Y.: Automatic textual document categorization based on generalized instance sets and a metamodel. Proceeding of the IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 628–633 (2003)
Article Google Scholar
Sasaki, M., Kita, K.: Rule-based text categorization using hierarchical categories. In: Proceeding of the IEEE International Conference on System, Man and Cybernetics, vol. 3, pp. 2827–2830 (1998)
Google Scholar
Schapire, R.E., Singer, Y.: Text categorization with the concept of fuzzy set of informative keywords. In: Proceeding of the IEEE International Fuzzy Systems Conference(FUZZ-IEEE), vol. 2, pp. 609–614 (1999)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2000)
Article Google Scholar
Soucy, P., Mineau, G.W.: A simple KNN algorithm for text categorization. In: Proceeding of the first IEEE International Conference on Data Mining(ICDM), vol. 28, pp. 647–648 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Chonbuk National University, Jeonju, 561-756, South Korea
Sun Lee Bang & Jae Dong Yang
Department of Computer Science, Carnegie Mellon University, Pisttsburgh, 15213, USA
Hyung Jeong Yang

Authors

Sun Lee Bang
View author publications
You can also search for this author in PubMed Google Scholar
Hyung Jeong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jae Dong Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Georgia Institute of Technology and Università di Padova,
Alberto Apostolico
Department of Information Engineering, University of Padova,
Massimo Melucci

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bang, S.L., Yang, H.J., Yang, J.D. (2004). Automatic Document Categorization Based on k-NN and Object-Based Thesauri. In: Apostolico, A., Melucci, M. (eds) String Processing and Information Retrieval. SPIRE 2004. Lecture Notes in Computer Science, vol 3246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30213-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-540-30213-1_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23210-0
Online ISBN: 978-3-540-30213-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics