Automatic Document Categorization

Stein, Benno; Eissen, Sven Meyer zu

doi:10.1007/978-3-540-39451-8_19

Benno Stein⁹ &
Sven Meyer zu Eissen⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2821))

Included in the following conference series:

Annual Conference on Artificial Intelligence

546 Accesses
10 Citations

Abstract

Clustering a document collection is the current approach to automatically derive underlying document categories. The categorization performance of a document clustering algorithm can be captured by the F-Measure, which quantifies how close a human-defined categorization has been resembled.

However, a bad F-Measure value tells us nothing about the reason why a clustering algorithm performs poorly. Among several possible explanations the most interesting question is the following: Are the implicit assumptions of the clustering algorithm admissible with respect to a document categorization task?

Though the use of clustering algorithms for document categorization is widely accepted, no foundation or rationale has been stated for this admissibility question. The paper in hand is devoted to this gap. It presents considerations and a measure to quantify the sensibility of a clustering process with regard to geometric distortions of the data space. Along with the method of multidimensional scaling, this measure provides an instrument for accessing a clustering algorithm’s adequacy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C.: Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search. In: Proceedings of the ACM SIGMOD international conference on Management of data, pp. 452–463. ACM Press, New York (2002)
Google Scholar
Backhaus, K., Erichson, B., Plinke, W., Weiber, R.: Multivariate Anaylsemethoden. Springer, Heidelberg (1996)
Google Scholar
Bailey, T., Cowles, J.: Cluster Definition by the Optimization of Simple Measures. IEEE Transactions on Pattern Analysis and Machine Intelligence (September 1983)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. In: Selected papers from the sixth international conference on World Wide Web, pp. 1157–1166. Elsevier Science Publishers Ltd., Amsterdam (1997)
Google Scholar
Brücher, H., Knolmayer, G., Mittermayer, M.-A.: Document classification methods for organizing explicit knowledge. In: Third European Conference on Organizational Knowledge, Learning, and Capabilities (2002)
Google Scholar
Buja, A., Swayne, D.F., Littman, M., Dean, N., Hofmann, H.: XGvis: Interactive Data Visualization with Multidimensional Scaling. Journal of Computational and Graphical Statistics (2001)
Google Scholar
Chalmers, M.: Using a landscape metaphor to represent a corpus of documents. In: Campari, I., Frank, A.U. (eds.) COSIT 1993. LNCS, vol. 716, pp. 377–390. Springer, Heidelberg (1993)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD 1996 (1996)
Google Scholar
Fabrikant, S.I.: Visualizing Region and Scale in Information Spaces. In: The 20th International Cartographic Conference, Beijing, China, pp. 2522–2529 (August 2001)
Google Scholar
Florek, K., Lukaszewiez, J., Perkal, J., Steinhaus, H., Zubrzchi, S.: Sur la liason et la division des points d’un ensemble fini. Colloquium Methematicum 2 (1951)
Google Scholar
Han, E.-H., Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. Technical Report 00-017, Univercity of Minnesota, Department of Computer Science / Army HPC Research Center (March 2000)
Google Scholar
Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the web. In: Proceedings of the eleventh international conference on World Wide Web, pp. 432–442. ACM Press, New York (2002)
Chapter Google Scholar
Iwayama, M., Tokunaga, T.: Cluster-based text categorization: a comparison of category search strategies. In: Fox, E.A., Ingwersen, P., Fidel, R. (eds.) Proceedings of SIGIR-95, 18thACMInternational Conference on Research and Development in Information Retrieval, Seattle, USA, pp. 273–281. ACM Press, New York (1995)
Chapter Google Scholar
Jain, K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys (CSUR) 31(3), 264–323 (2000) ISSN 0360-0300
Article Google Scholar
Jain, A.K., Dubes, R.C.: Algorithm for Clustering in Data. Prentice Hall, Englewood Cliffs (1990) ISBN 0-13-022278-X
Google Scholar
Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32 (1967)
Google Scholar
Kandogan, E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 107–116. ACM Press, New York (2001)
Chapter Google Scholar
Karypis, G., Han, E.-H., Kumar, V.: Chameleon: A hierarchical clustering algorithm using dynamic modeling. Technical Report Paper No. 432, University of Minnesota, Minneapolis (1999)
Google Scholar
Kaufman, L., Rousseuw, P.J.: Finding Groups in Data. Wiley, Chichester (1990)
Book Google Scholar
Kohonen, T.: Self Organization and Assoziative Memory. Springer, Heidelberg (1990)
Google Scholar
Kohonen, T., Kaski, S., Lagus, K., Salojrvi, J., Honkela, J., Paatero, V., Saarela, A.: Self organization of a massive document collection. IEEE Transactions on Neural Networks 11 (May 2000), citeseer.nj.nec.com/378852.html
Kruskal, J.B.: Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika 29(1) (March 1964)
Google Scholar
Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: Proceedings of the KDD 1999 Workshop, San Diego, USA (1999)
Google Scholar
Lengauer, T.: Combinatorical Algorithms for Integrated Circuit Layout. Applicable Theory in Computer Science. Teubner-Wiley, Chichester (1990)
Google Scholar
MacQueen, J.B.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Eißen, S.M., Stein, B.: The AiSearch Meta Search Engine Prototype. In: Basu, A., Dutta, S. (eds.) Proceedings of the 12th Workshop on Information Technology and Systems (WITS 2002), Barcelona Spain. Technical University of Barcelona (2002)
Google Scholar
Navarro, D.J.: Spatial Visualization of Document Similarity. Defence Human Factors Special Interest Group Meeting (2001)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Rohrer, R.M., Ebert, D.S., Sibert, J.L.: The Shape of Shakespeare: Visualizing Text using Implicit Surfaces. In: IEEE Symposium on Information Visualization, North Carolina, USA, October 1998, pp. 121–129 (1998)
Google Scholar
Roxborough, T., Arunabha: Graph Clustering using Multiway Ratio Cut. In: North, S. (ed.) Graph Drawing. LNCS. Springer, Heidelberg (1996)
Google Scholar
Sablowski, R., Frick, A.: Automatic Graph Clustering. In: North, S. (ed.) Graph Drawing. LNCS. Springer, Heidelberg (1996)
Google Scholar
Sabol, V., Kienreich, W., Granitzer, M., Becker, J., Tochtermann, K., Andrews, K.: Applications of a Lightweight, Web-Based Retrieval, Clustering and Visualisation Framework. In: Karagiannis, D., Reimer, U. (eds.) PAKM 2002. LNCS (LNAI), vol. 2569, pp. 359–368. Springer, Heidelberg (2002)
Chapter Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1988)
Google Scholar
Sneath, P.H.A.: The application of computers to taxonomy. J. Gen. Microbiol. 17 (1957)
Google Scholar
Song, M.: BiblioMapper: A Cluster-based Information Visualization Technique. In: IEEE Symposium on Information Visualization, North Carolina, USA, October 1998, pp. 130–136 (1998)
Google Scholar
Stein, B., Niggemann, O.: The Nature of Structure and its Identification. In: 25.Workshop on Graph Theory. Lecture Notes on Computer Science, LNCS. Springer, Ascona (July 1999)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: Acomparison of document clustering techniques. Technical Report 00-034, Department of Computer Science and Egineering, University of Minnesota (2000)
Google Scholar
Weippl, E.: Visualizing Content-based Relations in Texts. In: Proceedings of the 2nd Australasian conference on User interface, pp. 34–41. IEEE Computer Society Press, Los Alamitos (2001) ISBN 0-7695-0969-X
Chapter Google Scholar
Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (November 1993)
Google Scholar
Yan, J.T., Hsiao, P.Y.: A fuzzy clustering algorithm for graph bisection. Information Processing Letters 52 (1994)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Google Scholar
Zaho, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report 01-40, Univercity of Minnesota, Department of Computer Science / Army HPC Research Center (Febraury 2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Paderborn University, D-33095, Paderborn, Germany
Benno Stein & Sven Meyer zu Eissen

Authors

Benno Stein
View author publications
You can also search for this author in PubMed Google Scholar
Sven Meyer zu Eissen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

HITeC e. V. Universität Hamburg Fachbereich Informatik, Vogt-Kölln-Str. 30, 22527, Hamburg, Germany
Andreas Günter
Otto-von-Guericke-University of Magdeburg,
Rudolf Kruse
Cognitive Systems Laboratory, Department Informatik, Universität Hamburg, 22527, Hamburg, Germany
Bernd Neumann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stein, B., Eissen, S.M.z. (2003). Automatic Document Categorization. In: Günter, A., Kruse, R., Neumann, B. (eds) KI 2003: Advances in Artificial Intelligence. KI 2003. Lecture Notes in Computer Science(), vol 2821. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39451-8_19

Download citation

DOI: https://doi.org/10.1007/978-3-540-39451-8_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20059-8
Online ISBN: 978-3-540-39451-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics