Abstract
Centroid-based categorization is one of the most popular algorithms in text classification. Normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes. In the past, normalization involved with only document- or class-length normalization. In this paper, we propose a new type of normalization called term-length normalization which considers term distribution in a class. The performance of this normalization is investigated in three environments of a standard centroid-based classifier (TFIDF): (1) without class-length normalization, (2) with cosine class-length normalization and (3) with summing weight normalization. The results suggest that our term-length normalization is useful for improving classification accuracy in all cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
McCallum, A., Rosenfeld, R., Mitchell, T., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: Proc. 15th International Conf. on Machine Learning, pp. 359–367. Morgan Kaufmann, San Francisco (1998)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Learning to classify text from labeled and unlabeled documents. In: Proceedings of AAAI 1998, 15th Conference of the American Association for Artificial Intelligence, Madison, US, pp. 792–799. AAAI Press, Menlo Park (1998)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd Annual International SIGIR, Berkley, pp. 42–49 (1999)
Chuang, W.T., Tiyyagura, A., Yang, J., Giuffrida, C.: A fast algorithm for hierarchical text classification. In: Data Warehousing and Knowledge Discovery, pp. 409–418 (2000)
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 143–151. Morgan Kaufmann Publishers, San Francisco (1997)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Han, E.H., Karypis, C.: Centroid-based document classification: Analysis and experimental results. In: Principles of Data Mining and Knowledge Discovery, pp. 424–431 (2000)
Lertnattee, V., Theeramunkong, T.: Improving centroid-based text classification using term-distribution-based weighting and feature selection. In: Proceedings of INTECH 2001, 2nd International Conference on Intelligent Technologies, Bangkok, Thailand, pp. 349–355 (2001)
Salton, C., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)
Singhal, A., Salton, C., Buckley, C.: Length normalization in degraded text collections. Technical Report TR95-1507 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lertnattee, V., Theeramunkong, T. (2003). Term-length Normalization for Centroid-based Text Categorization. In: Palade, V., Howlett, R.J., Jain, L. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2003. Lecture Notes in Computer Science(), vol 2773. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45224-9_113
Download citation
DOI: https://doi.org/10.1007/978-3-540-45224-9_113
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40803-1
Online ISBN: 978-3-540-45224-9
eBook Packages: Springer Book Archive