Automatic Document Categorization Based on k-NN and Object-Based Thesauri

  • Sun Lee Bang
  • Hyung Jeong Yang
  • Jae Dong Yang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3246)


The k-NN classifier(k-NN) is one of the most popular document categorization methods because of its simplicity and relatively good performance. However, it significantly degrades precision when ambiguity arises – there exist more than one candidate category for a document to be assigned. To remedy the drawback, we propose a new method, which incorporates the relationships of object-based thesauri into the document categorization using k-NN. Employing the thesaurus entails structuring categories into taxonomies, since their structure needs to be conformed to that of the thesaurus for capturing relationships between themselves. By referencing relationships in the thesaurus which correspond to the structured categories, k-NN can be drastically improved, removing the ambiguity. In this paper, we first perform the document categorization by using k-NN and then employ the relationships to reduce the ambiguity. Experimental results show that the proposed approach improves the precision of k-NN up to 13.86% without compromising its recall.


Object Class Structure Category Text Categorization Association Relationship Document Categorization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Antonie, M.L., Zaiane, O.R.: Text document categorization by term association. In: Proceeding of the second IEEE Intenational Conference on Data Mining(ICDM), pp. 19–26 (2002)Google Scholar
  2. 2.
    Bao, Y., Ishii, N.: Combining multiple k-nearest neighbor classifier for text classification by reducts. Discovery Science, 340–347 (2002)Google Scholar
  3. 3.
    Choi, J.H., Yang, J.D., Lee, D.G.: An object-based approach to managing domain specific thesauri: semiautomatic thesauri construction and query-based browsing. Intenational Journal of Software Engineering & Knowledge Engineering 10(4), 1–27 (2002)Google Scholar
  4. 4.
    Diao, L., Hu, K., Lu, Y., Shi, C.: Boosting simple decision trees with bayesian learning for text categorization. In: Proceeding of the fourth World Congress on Intelligent Control and Automation, vol. 1, pp. 321–325 (2002)Google Scholar
  5. 5.
    Han, E.H., Karypis, G., Kumar, V.: Text categorization using weight adjusted k-nearest neighbor classification. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, pp. 53–65. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  6. 6.
    Hiroshi, U., Takao, M., Shioya, I.: Improving text categorization by resolving semantic ambiguity. In: Proceeding of the IEEE Pacific Rim Conference on Communications, Computers and Signal processing (PACRIM), pp. 796–799 (2003)Google Scholar
  7. 7.
    Hu, J., Huang, H.: An algorithm for text categorization with SVM. In: Processing the tenth IEEE Region Conference on Computers, Communications, Control and Power Engineering, vol. 1, pp. 47–50 (2002)Google Scholar
  8. 8.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  9. 9.
    Lam, W., Han, Y.: Automatic textual document categorization based on generalized instance sets and a metamodel. Proceeding of the IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 628–633 (2003)CrossRefGoogle Scholar
  10. 10.
    Sasaki, M., Kita, K.: Rule-based text categorization using hierarchical categories. In: Proceeding of the IEEE International Conference on System, Man and Cybernetics, vol. 3, pp. 2827–2830 (1998)Google Scholar
  11. 11.
    Schapire, R.E., Singer, Y.: Text categorization with the concept of fuzzy set of informative keywords. In: Proceeding of the IEEE International Fuzzy Systems Conference(FUZZ-IEEE), vol. 2, pp. 609–614 (1999)Google Scholar
  12. 12.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2000)CrossRefGoogle Scholar
  13. 13.
    Soucy, P., Mineau, G.W.: A simple KNN algorithm for text categorization. In: Proceeding of the first IEEE International Conference on Data Mining(ICDM), vol. 28, pp. 647–648 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Sun Lee Bang
    • 1
  • Hyung Jeong Yang
    • 2
  • Jae Dong Yang
    • 1
  1. 1.Department of Computer ScienceChonbuk National UniversityJeonjuSouth Korea
  2. 2.Department of Computer ScienceCarnegie Mellon UniversityPisttsburghUSA

Personalised recommendations