Abstract
An investigation has been conducted on two well known similarity-based learning approaches to text categorization. This includes the k-nearest neighbor (k-NN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, we propose a new classifier called the kNN model-based classifier by unifying the strengths of k-NN and Rocchio classifier and adapting to characteristics of text categorization problems.
A text categorization prototypes system has been implemented and then evaluated on two common document corpora, namely, the 20-newsgroup collection and the ModApte version of the Reuters-21578 collection of news stories. The experimental results show that the kNN model-based approach outperforms the k-NN, Rocchio classifier.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Lam, W., Ho, C.Y.: Using a Generalized Instance Set for Automatic Text Categorization. In: SIGIR 1998, pp. 81–89 (1998)
Lewis, D.D.: Naïve (Bayes) at forty: The independent assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Cohen, W.W., Singer, Y.: Context–Sensitive Learning Methods for Text Categorization. ACM Trans. Inform. Syst. 17(2), 141–173 (1999)
Li, H., Yamanishi, K.: Text Classification Using ESC-based Stochastic Decision Lists. In: Proceedings of CIKM 1999, 8th ACM International Conference on Information and Knowledge Management, pp. 122–130 (1999)
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of SIGIR 1999, 22nd ACM International Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Ruiz, M.E., Srinivasan, P.: Hierarchical Neural Networks for Text Categorization. In: Proceedings of SIGIR 1999, 22nd ACM International Information Retrieval, pp. 281–282 (1999)
Mitchell, T.M.: Machine Learning. McGraw Hill, New York (1996)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Joachims, T.: A Statistical Learning Model of Text Classification for Support Vector Machines. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 128–136 (2001)
Rocchio Jr., J.J.: Relevance Feedback in Information Retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Inc., Englewood Cliffs (1971)
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Test Categorization. In: Proceedings of ICML 1997, 14th International Conference on Machine Learning, pp. 143–151 (1997)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Cortes, C., Vapnik, V.: Support-Vector Network. Machine Learning 20, 273–297 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K. (2004). An kNN Model-Based Approach and Its Application in Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_69
Download citation
DOI: https://doi.org/10.1007/978-3-540-24630-5_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive