Comparison of Selected Methods for Document Clustering

  • Radim Sevcik
  • Hana Rezankova
  • Dusan Husek
Conference paper
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 86)

Abstract

17 cluster analysis techniques proposed for document clustering in terms of internal and external quality measures of clustering and computing time demands are compared. These are combinations of three basic methods (direct, repeatedbisectionandagglomerative) and five clustering criterion functions for solution assessment (twointra − cluster, oneinter − cluster, andtwocomplexones); all implemented in the CLUTO software package. Furthermore, in the case of the agglomerative method we also applied a single linkage and complete linkage clustering as a criterion function. Collection 20 Newsgroups, a binary vector representation of e-mail messages, was used for comparing the methods. Experiments with document clustering have proved that, from the point of view of entropy and purity, the direct method provides the best results. As regards computing time, the repeated bisection (divisive) method has been the fastest.

Keywords

Web clustering Cluster analysis Textual documents Web content classification Newsgroups analysis Vector model 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Andrews, N., Fox, E.: Recent Developments in Document Clustering. Tech. rep., Department of Computer Science, Virginia Tech. (2007)Google Scholar
  2. 2.
    Bouguila, N.: On multivariate binary data clustering and feature weighting. Computational Statistics and Data Analysis 54(1), 120–134 (2010)CrossRefGoogle Scholar
  3. 3.
    Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Philadelphia (2007)MATHGoogle Scholar
  4. 4.
    Husek, D., Pokorny, J., Rezankova, H., Snasel, V.: Data clustering: From documents to the web. In: Vakali, A., Pallis, G. (eds.) Web Data Management Practices: Emerging Techniques and Technologies, pp. 1–33. Idea Group Publishing, USA (2007)Google Scholar
  5. 5.
    Jiang, Z., Lu, C.: A latent semantic analysis based method of getting the category attribute of words. In: ICECT 2009: Proceedings of the 2009 International Conference on Electronic Computer Technology, pp. 141–146. IEEE Computer Society, Washington (2009), doi:10.1109/ICECT.2009.19CrossRefGoogle Scholar
  6. 6.
    Karypis, G.: CLUTO: A Clustering Toolkit, Release 2.1.1. Tech. rep., University of Minnesota, Department of Computer Science, Minneapolis, MN (2003)Google Scholar
  7. 7.
    Li, T.: A unified view on clustering binary data. Machine Learning 62(3), 199–215 (2006), doi:10.1007/s10994-005-5316-9CrossRefGoogle Scholar
  8. 8.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2009)Google Scholar
  9. 9.
    McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow
  10. 10.
    Sevcik, R.: Classification of Electronic Documents Using Cluster Analysis. Diploma thesis, University of Economics, Prague (2010)Google Scholar
  11. 11.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. Tech. rep., University of Minnesota, Department of Computer Science, Minneapolis, MN (2000)Google Scholar
  12. 12.
    Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM, New York (1998), doi: http://doi.acm.org/10.1145/290941.290956 CrossRefGoogle Scholar
  13. 13.
    Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering. Tech. rep., University of Minnesota, Department of Computer Science (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Radim Sevcik
    • 1
  • Hana Rezankova
    • 1
  • Dusan Husek
    • 2
  1. 1.University of Economics, PraguePraha 3Czech Republic
  2. 2.Institute of Computer ScienceAcademy of Sciences of the Czech RepublicPraha 8Czech Republic

Personalised recommendations