Skip to main content

Automatic Document Categorization

Interpreting the Perfomance of Clustering Algorithms

  • Conference paper
Book cover KI 2003: Advances in Artificial Intelligence (KI 2003)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2821))

Included in the following conference series:

Abstract

Clustering a document collection is the current approach to automatically derive underlying document categories. The categorization performance of a document clustering algorithm can be captured by the F-Measure, which quantifies how close a human-defined categorization has been resembled.

However, a bad F-Measure value tells us nothing about the reason why a clustering algorithm performs poorly. Among several possible explanations the most interesting question is the following: Are the implicit assumptions of the clustering algorithm admissible with respect to a document categorization task?

Though the use of clustering algorithms for document categorization is widely accepted, no foundation or rationale has been stated for this admissibility question. The paper in hand is devoted to this gap. It presents considerations and a measure to quantify the sensibility of a clustering process with regard to geometric distortions of the data space. Along with the method of multidimensional scaling, this measure provides an instrument for accessing a clustering algorithm’s adequacy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C.: Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search. In: Proceedings of the ACM SIGMOD international conference on Management of data, pp. 452–463. ACM Press, New York (2002)

    Google Scholar 

  2. Backhaus, K., Erichson, B., Plinke, W., Weiber, R.: Multivariate Anaylsemethoden. Springer, Heidelberg (1996)

    Google Scholar 

  3. Bailey, T., Cowles, J.: Cluster Definition by the Optimization of Simple Measures. IEEE Transactions on Pattern Analysis and Machine Intelligence (September 1983)

    Google Scholar 

  4. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. In: Selected papers from the sixth international conference on World Wide Web, pp. 1157–1166. Elsevier Science Publishers Ltd., Amsterdam (1997)

    Google Scholar 

  5. Brücher, H., Knolmayer, G., Mittermayer, M.-A.: Document classification methods for organizing explicit knowledge. In: Third European Conference on Organizational Knowledge, Learning, and Capabilities (2002)

    Google Scholar 

  6. Buja, A., Swayne, D.F., Littman, M., Dean, N., Hofmann, H.: XGvis: Interactive Data Visualization with Multidimensional Scaling. Journal of Computational and Graphical Statistics (2001)

    Google Scholar 

  7. Chalmers, M.: Using a landscape metaphor to represent a corpus of documents. In: Campari, I., Frank, A.U. (eds.) COSIT 1993. LNCS, vol. 716, pp. 377–390. Springer, Heidelberg (1993)

    Google Scholar 

  8. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD 1996 (1996)

    Google Scholar 

  9. Fabrikant, S.I.: Visualizing Region and Scale in Information Spaces. In: The 20th International Cartographic Conference, Beijing, China, pp. 2522–2529 (August 2001)

    Google Scholar 

  10. Florek, K., Lukaszewiez, J., Perkal, J., Steinhaus, H., Zubrzchi, S.: Sur la liason et la division des points d’un ensemble fini. Colloquium Methematicum 2 (1951)

    Google Scholar 

  11. Han, E.-H., Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. Technical Report 00-017, Univercity of Minnesota, Department of Computer Science / Army HPC Research Center (March 2000)

    Google Scholar 

  12. Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the web. In: Proceedings of the eleventh international conference on World Wide Web, pp. 432–442. ACM Press, New York (2002)

    Chapter  Google Scholar 

  13. Iwayama, M., Tokunaga, T.: Cluster-based text categorization: a comparison of category search strategies. In: Fox, E.A., Ingwersen, P., Fidel, R. (eds.) Proceedings of SIGIR-95, 18thACMInternational Conference on Research and Development in Information Retrieval, Seattle, USA, pp. 273–281. ACM Press, New York (1995)

    Chapter  Google Scholar 

  14. Jain, K., Murty, M.N., Flynn, P.J.: Data Clustering: a Review. ACM Computing Surveys (CSUR) 31(3), 264–323 (2000) ISSN 0360-0300

    Article  Google Scholar 

  15. Jain, A.K., Dubes, R.C.: Algorithm for Clustering in Data. Prentice Hall, Englewood Cliffs (1990) ISBN 0-13-022278-X

    Google Scholar 

  16. Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32 (1967)

    Google Scholar 

  17. Kandogan, E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 107–116. ACM Press, New York (2001)

    Chapter  Google Scholar 

  18. Karypis, G., Han, E.-H., Kumar, V.: Chameleon: A hierarchical clustering algorithm using dynamic modeling. Technical Report Paper No. 432, University of Minnesota, Minneapolis (1999)

    Google Scholar 

  19. Kaufman, L., Rousseuw, P.J.: Finding Groups in Data. Wiley, Chichester (1990)

    Book  Google Scholar 

  20. Kohonen, T.: Self Organization and Assoziative Memory. Springer, Heidelberg (1990)

    Google Scholar 

  21. Kohonen, T., Kaski, S., Lagus, K., Salojrvi, J., Honkela, J., Paatero, V., Saarela, A.: Self organization of a massive document collection. IEEE Transactions on Neural Networks 11 (May 2000), citeseer.nj.nec.com/378852.html

  22. Kruskal, J.B.: Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika 29(1) (March 1964)

    Google Scholar 

  23. Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: Proceedings of the KDD 1999 Workshop, San Diego, USA (1999)

    Google Scholar 

  24. Lengauer, T.: Combinatorical Algorithms for Integrated Circuit Layout. Applicable Theory in Computer Science. Teubner-Wiley, Chichester (1990)

    Google Scholar 

  25. MacQueen, J.B.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  26. Eißen, S.M., Stein, B.: The AiSearch Meta Search Engine Prototype. In: Basu, A., Dutta, S. (eds.) Proceedings of the 12th Workshop on Information Technology and Systems (WITS 2002), Barcelona Spain. Technical University of Barcelona (2002)

    Google Scholar 

  27. Navarro, D.J.: Spatial Visualization of Document Similarity. Defence Human Factors Special Interest Group Meeting (2001)

    Google Scholar 

  28. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  29. Rohrer, R.M., Ebert, D.S., Sibert, J.L.: The Shape of Shakespeare: Visualizing Text using Implicit Surfaces. In: IEEE Symposium on Information Visualization, North Carolina, USA, October 1998, pp. 121–129 (1998)

    Google Scholar 

  30. Roxborough, T., Arunabha: Graph Clustering using Multiway Ratio Cut. In: North, S. (ed.) Graph Drawing. LNCS. Springer, Heidelberg (1996)

    Google Scholar 

  31. Sablowski, R., Frick, A.: Automatic Graph Clustering. In: North, S. (ed.) Graph Drawing. LNCS. Springer, Heidelberg (1996)

    Google Scholar 

  32. Sabol, V., Kienreich, W., Granitzer, M., Becker, J., Tochtermann, K., Andrews, K.: Applications of a Lightweight, Web-Based Retrieval, Clustering and Visualisation Framework. In: Karagiannis, D., Reimer, U. (eds.) PAKM 2002. LNCS (LNAI), vol. 2569, pp. 359–368. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  33. Salton, G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1988)

    Google Scholar 

  34. Sneath, P.H.A.: The application of computers to taxonomy. J. Gen. Microbiol. 17 (1957)

    Google Scholar 

  35. Song, M.: BiblioMapper: A Cluster-based Information Visualization Technique. In: IEEE Symposium on Information Visualization, North Carolina, USA, October 1998, pp. 130–136 (1998)

    Google Scholar 

  36. Stein, B., Niggemann, O.: The Nature of Structure and its Identification. In: 25.Workshop on Graph Theory. Lecture Notes on Computer Science, LNCS. Springer, Ascona (July 1999)

    Google Scholar 

  37. Steinbach, M., Karypis, G., Kumar, V.: Acomparison of document clustering techniques. Technical Report 00-034, Department of Computer Science and Egineering, University of Minnesota (2000)

    Google Scholar 

  38. Weippl, E.: Visualizing Content-based Relations in Texts. In: Proceedings of the 2nd Australasian conference on User interface, pp. 34–41. IEEE Computer Society Press, Los Alamitos (2001) ISBN 0-7695-0969-X

    Chapter  Google Scholar 

  39. Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (November 1993)

    Google Scholar 

  40. Yan, J.T., Hsiao, P.Y.: A fuzzy clustering algorithm for graph bisection. Information Processing Letters 52 (1994)

    Google Scholar 

  41. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)

    Google Scholar 

  42. Zaho, Y., Karypis, G.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report 01-40, Univercity of Minnesota, Department of Computer Science / Army HPC Research Center (Febraury 2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Stein, B., Eissen, S.M.z. (2003). Automatic Document Categorization. In: Günter, A., Kruse, R., Neumann, B. (eds) KI 2003: Advances in Artificial Intelligence. KI 2003. Lecture Notes in Computer Science(), vol 2821. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39451-8_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39451-8_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20059-8

  • Online ISBN: 978-3-540-39451-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics