Text Categorization Based on Subtopic Clusters

Chik, Francis C. Y.; Luk, Robert W. P.; Chung, Korris F. L.

doi:10.1007/11428817_19

Francis C. Y. Chik¹⁹,
Robert W. P. Luk¹⁹ &
Korris F. L. Chung¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3513))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

1389 Accesses
1 Citations

Abstract

The distribution of the number of documents in topic classes is typically highly skewed. This leads to good micro-average performance but not so desirable macro-average performance. By viewing topics as clusters in a high dimensional space, we propose the use of clustering to determine subtopic clusters for large topic classes by assuming that large topic clusters are in general a mixture of a number of subtopic clusters. We used the Reuters News articles and support vector machines to evaluate whether using subtopic cluster can lead to better macro-average performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Li, Y.H., Jain, A.K.: Classification of text documents. The Computer Journal 41(8), 537–546 (1998)
Article MATH Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proc. 22nd ACM SIGIR Conf., pp. 42–49 (1999)
Google Scholar
Aas, K., Eikvil, L.: Text Categorisation: a survey, Technical Report #941, Norwegian Computing Center (1999)
Google Scholar
Lewis, D.: Reuters-21578 text categorization test collection distribution 1.0, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization, Technical Report, Microsoft Research (1998)
Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proc. European Conference on Machine Learning, pp. 137–142 (1998)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization, Technical Report CMU-CS-97127, Computer Science Department, Carnegie Mellon University (1997)
Google Scholar
Schapire, R., Singer, Y.: Boostexter: a boosting-based system for text categorization. Machine Learning 39(2), 135–168 (2000)
Article MATH Google Scholar
Schütze, H.: Single-link, complete-link & average-link clustering, NLP and Text Mining, http://www-csli.stanford.edu/~schuetze/
Nicholas, C., Kogan, J., Teboulle, M.: Tutorial on clustering large and high-dimensional data, http://www.csee.umbc.edu/~nicholas/clustering/
Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM Computing Surveys 31(3), 263–323 (1999)
Article Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic text retrieval. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
McCallum, A.: Rainbow, http://www.cs.cmu.edu/~mccallum/bow/rainbow/
Website: http://www-2.cs.cmu.edu/~mccallum/bow/
van Rijsbergen, C.J.: Information Retrieval, Butterworths, London (1979)
Google Scholar
Kullback, S., Leibler, R.: On information and sufficiency. Annals of Mathematical Statistics 22, 79–86 (1951)
Article MATH MathSciNet Google Scholar
Website, http://www.spss.com/
Callan, J.P.: Passage-level evidence in document retrieval. In: Proc. 17th ACM SIGIR Conf., pp. 302–310 (1994)
Google Scholar
Takamura, H., Matsumoto, Y.: Two-dimensional clustering for text categorization. In: Proc. of CoNLL-2002, pp. 29–35 (2002)
Google Scholar
Hatzivassiloglou, V., Gravano, L., Maganti, A.: An investigation of linguistic features and clustering algorithms for topical document clustering. In: Proc. 23rd ACM SIGIR Conf., pp. 224–231 (2000)
Google Scholar
Reuters Corpus, Volume 1, English language (Release date 2000-11-03, Format version 1, correction level 0), http://about.reuters.com/researchandstandards/corpus/

Download references

Author information

Authors and Affiliations

Department of Computing, Hong Kong Polytechnic University,
Francis C. Y. Chik, Robert W. P. Luk & Korris F. L. Chung

Authors

Francis C. Y. Chik
View author publications
You can also search for this author in PubMed Google Scholar
Robert W. P. Luk
View author publications
You can also search for this author in PubMed Google Scholar
Korris F. L. Chung
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software and Computing Systems, University of Alicante, Spain
Andrés Montoyo
Grupo de investigación del Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Alicante, Spain
Rafael Muńoz
Lab. CEDRIC, CNAM, Paris, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chik, F.C.Y., Luk, R.W.P., Chung, K.F.L. (2005). Text Categorization Based on Subtopic Clusters. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_19

Download citation

DOI: https://doi.org/10.1007/11428817_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26031-8
Online ISBN: 978-3-540-32110-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics