Abstract
The distribution of the number of documents in topic classes is typically highly skewed. This leads to good micro-average performance but not so desirable macro-average performance. By viewing topics as clusters in a high dimensional space, we propose the use of clustering to determine subtopic clusters for large topic classes by assuming that large topic clusters are in general a mixture of a number of subtopic clusters. We used the Reuters News articles and support vector machines to evaluate whether using subtopic cluster can lead to better macro-average performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Li, Y.H., Jain, A.K.: Classification of text documents. The Computer Journal 41(8), 537–546 (1998)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proc. 22nd ACM SIGIR Conf., pp. 42–49 (1999)
Aas, K., Eikvil, L.: Text Categorisation: a survey, Technical Report #941, Norwegian Computing Center (1999)
Lewis, D.: Reuters-21578 text categorization test collection distribution 1.0, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization, Technical Report, Microsoft Research (1998)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proc. European Conference on Machine Learning, pp. 137–142 (1998)
Yang, Y.: An evaluation of statistical approaches to text categorization, Technical Report CMU-CS-97127, Computer Science Department, Carnegie Mellon University (1997)
Schapire, R., Singer, Y.: Boostexter: a boosting-based system for text categorization. Machine Learning 39(2), 135–168 (2000)
Schütze, H.: Single-link, complete-link & average-link clustering, NLP and Text Mining, http://www-csli.stanford.edu/~schuetze/
Nicholas, C., Kogan, J., Teboulle, M.: Tutorial on clustering large and high-dimensional data, http://www.csee.umbc.edu/~nicholas/clustering/
Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM Computing Surveys 31(3), 263–323 (1999)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic text retrieval. Communications of the ACM 18(11), 613–620 (1975)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
McCallum, A.: Rainbow, http://www.cs.cmu.edu/~mccallum/bow/rainbow/
van Rijsbergen, C.J.: Information Retrieval, Butterworths, London (1979)
Kullback, S., Leibler, R.: On information and sufficiency. Annals of Mathematical Statistics 22, 79–86 (1951)
Website, http://www.spss.com/
Callan, J.P.: Passage-level evidence in document retrieval. In: Proc. 17th ACM SIGIR Conf., pp. 302–310 (1994)
Takamura, H., Matsumoto, Y.: Two-dimensional clustering for text categorization. In: Proc. of CoNLL-2002, pp. 29–35 (2002)
Hatzivassiloglou, V., Gravano, L., Maganti, A.: An investigation of linguistic features and clustering algorithms for topical document clustering. In: Proc. 23rd ACM SIGIR Conf., pp. 224–231 (2000)
Reuters Corpus, Volume 1, English language (Release date 2000-11-03, Format version 1, correction level 0), http://about.reuters.com/researchandstandards/corpus/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chik, F.C.Y., Luk, R.W.P., Chung, K.F.L. (2005). Text Categorization Based on Subtopic Clusters. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_19
Download citation
DOI: https://doi.org/10.1007/11428817_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26031-8
Online ISBN: 978-3-540-32110-1
eBook Packages: Computer ScienceComputer Science (R0)