Abstract
Clustering a document collection is the current approach to automatically derive underlying document categories. The categorization performance of a document clustering algorithm can be captured by the F-Measure, which quantifies how close a human-defined categorization has been resembled.
However, a bad F-Measure value tells us nothing about the reason why a clustering algorithm performs poorly. Among several possible explanations the most interesting question is the following: Are the implicit assumptions of the clustering algorithm admissible with respect to a document categorization task?
Though the use of clustering algorithms for document categorization is widely accepted, no foundation or rationale has been stated for this admissibility question. The paper in hand is devoted to this gap. It presents considerations and a measure to quantify the sensibility of a clustering process with regard to geometric distortions of the data space. Along with the method of multidimensional scaling, this measure provides an instrument for accessing a clustering algorithm’s adequacy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C.: Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search. In: Proceedings of the ACM SIGMOD international conference on Management of data, pp. 452–463. ACM Press, New York (2002)
Backhaus, K., Erichson, B., Plinke, W., Weiber, R.: Multivariate Anaylsemethoden. Springer, Heidelberg (1996)
Bailey, T., Cowles, J.: Cluster Definition by the Optimization of Simple Measures. IEEE Transactions on Pattern Analysis and Machine Intelligence (September 1983)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. In: Selected papers from the sixth international conference on World Wide Web, pp. 1157–1166. Elsevier Science Publishers Ltd., Amsterdam (1997)
Brücher, H., Knolmayer, G., Mittermayer, M.-A.: Document classification methods for organizing explicit knowledge. In: Third European Conference on Organizational Knowledge, Learning, and Capabilities (2002)
Buja, A., Swayne, D.F., Littman, M., Dean, N., Hofmann, H.: XGvis: Interactive Data Visualization with Multidimensional Scaling. Journal of Computational and Graphical Statistics (2001)
Chalmers, M.: Using a landscape metaphor to represent a corpus of documents. In: Campari, I., Frank, A.U. (eds.) COSIT 1993. LNCS, vol. 716, pp. 377–390. Springer, Heidelberg (1993)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD 1996 (1996)
Fabrikant, S.I.: Visualizing Region and Scale in Information Spaces. In: The 20th International Cartographic Conference, Beijing, China, pp. 2522–2529 (August 2001)
Florek, K., Lukaszewiez, J., Perkal, J., Steinhaus, H., Zubrzchi, S.: Sur la liason et la division des points d’un ensemble fini. Colloquium Methematicum 2 (1951)
Han, E.-H., Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. Technical Report 00-017, Univercity of Minnesota, Department of Computer Science / Army HPC Research Center (March 2000)
Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the web. In: Proceedings of the eleventh international conference on World Wide Web, pp. 432–442. ACM Press, New York (2002)
Iwayama, M., Tokunaga, T.: Cluster-based text categorization: a comparison of category search strategies. In: Fox, E.A., Ingwersen, P., Fidel, R. (eds.) Proceedings of SIGIR-95, 18thACMInternational Conference on Research and Development in Information Retrieval, Seattle, USA, pp. 273–281. ACM Press, New York (1995)
Jain, K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys (CSUR) 31(3), 264–323 (2000) ISSN 0360-0300
Jain, A.K., Dubes, R.C.: Algorithm for Clustering in Data. Prentice Hall, Englewood Cliffs (1990) ISBN 0-13-022278-X
Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32 (1967)
Kandogan, E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 107–116. ACM Press, New York (2001)
Karypis, G., Han, E.-H., Kumar, V.: Chameleon: A hierarchical clustering algorithm using dynamic modeling. Technical Report Paper No. 432, University of Minnesota, Minneapolis (1999)
Kaufman, L., Rousseuw, P.J.: Finding Groups in Data. Wiley, Chichester (1990)
Kohonen, T.: Self Organization and Assoziative Memory. Springer, Heidelberg (1990)
Kohonen, T., Kaski, S., Lagus, K., Salojrvi, J., Honkela, J., Paatero, V., Saarela, A.: Self organization of a massive document collection. IEEE Transactions on Neural Networks 11 (May 2000), citeseer.nj.nec.com/378852.html
Kruskal, J.B.: Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika 29(1) (March 1964)
Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: Proceedings of the KDD 1999 Workshop, San Diego, USA (1999)
Lengauer, T.: Combinatorical Algorithms for Integrated Circuit Layout. Applicable Theory in Computer Science. Teubner-Wiley, Chichester (1990)
MacQueen, J.B.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Eißen, S.M., Stein, B.: The AiSearch Meta Search Engine Prototype. In: Basu, A., Dutta, S. (eds.) Proceedings of the 12th Workshop on Information Technology and Systems (WITS 2002), Barcelona Spain. Technical University of Barcelona (2002)
Navarro, D.J.: Spatial Visualization of Document Similarity. Defence Human Factors Special Interest Group Meeting (2001)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Rohrer, R.M., Ebert, D.S., Sibert, J.L.: The Shape of Shakespeare: Visualizing Text using Implicit Surfaces. In: IEEE Symposium on Information Visualization, North Carolina, USA, October 1998, pp. 121–129 (1998)
Roxborough, T., Arunabha: Graph Clustering using Multiway Ratio Cut. In: North, S. (ed.) Graph Drawing. LNCS. Springer, Heidelberg (1996)
Sablowski, R., Frick, A.: Automatic Graph Clustering. In: North, S. (ed.) Graph Drawing. LNCS. Springer, Heidelberg (1996)
Sabol, V., Kienreich, W., Granitzer, M., Becker, J., Tochtermann, K., Andrews, K.: Applications of a Lightweight, Web-Based Retrieval, Clustering and Visualisation Framework. In: Karagiannis, D., Reimer, U. (eds.) PAKM 2002. LNCS (LNAI), vol. 2569, pp. 359–368. Springer, Heidelberg (2002)
Salton, G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1988)
Sneath, P.H.A.: The application of computers to taxonomy. J. Gen. Microbiol. 17 (1957)
Song, M.: BiblioMapper: A Cluster-based Information Visualization Technique. In: IEEE Symposium on Information Visualization, North Carolina, USA, October 1998, pp. 130–136 (1998)
Stein, B., Niggemann, O.: The Nature of Structure and its Identification. In: 25.Workshop on Graph Theory. Lecture Notes on Computer Science, LNCS. Springer, Ascona (July 1999)
Steinbach, M., Karypis, G., Kumar, V.: Acomparison of document clustering techniques. Technical Report 00-034, Department of Computer Science and Egineering, University of Minnesota (2000)
Weippl, E.: Visualizing Content-based Relations in Texts. In: Proceedings of the 2nd Australasian conference on User interface, pp. 34–41. IEEE Computer Society Press, Los Alamitos (2001) ISBN 0-7695-0969-X
Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (November 1993)
Yan, J.T., Hsiao, P.Y.: A fuzzy clustering algorithm for graph bisection. Information Processing Letters 52 (1994)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Zaho, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report 01-40, Univercity of Minnesota, Department of Computer Science / Army HPC Research Center (Febraury 2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stein, B., Eissen, S.M.z. (2003). Automatic Document Categorization. In: Günter, A., Kruse, R., Neumann, B. (eds) KI 2003: Advances in Artificial Intelligence. KI 2003. Lecture Notes in Computer Science(), vol 2821. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39451-8_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-39451-8_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20059-8
Online ISBN: 978-3-540-39451-8
eBook Packages: Springer Book Archive