Advertisement

Unsupervised Sparsification of Similarity Graphs

  • Tim Gollub
  • Benno Stein
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

Cluster analysis often grapples with high-dimensional and noisy data. The paper in hand identifies sparsification as an approach to address this problem. Sparsification improves both the runtime and the quality of cluster algorithms that exploit pairwise object similarities, i.e., that rely on similarity graphs. Sparsification has been addressed in the field of graphical cluster algorithms in the past, but the developed approaches leave the burden of parameter tuning to the user. Our approach to sparsification relies on the inherent characteristics of the data and is completely unsupervised. It leads to significant improvements in the cluster quality and outperforms even the optimum supervised approaches to sparsification that rely on a single global threshold.

Keywords

Similarity Score Object Space Object Representation Virtual Object Similarity Graph 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Black, P. E. (2004). “Sparsification”, in dictionary of algorithms and data structures [online]. In U.S. National Institute of Standards and Technology, (Eds.), Algorithms and theory of computation handbook. Boca Raton: CRC Press LLC. URL http://www.itl.nist.gov/div897/sqg/dads/HTML/sparsificatn.html.
  2. Ertöz, L., Steinbach, M., & Kumar, V. (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In SDM.Google Scholar
  3. Everitt, B. S. (1993). Cluster analysis. New York: Toronto.Google Scholar
  4. Gollub, T. (2008). Verfahren zur modellbildung fr das dokumenten-clustering. Diplomarbeit, Bauhaus-Universität Weimar, Fakultät Medien, Mediensysteme, April 2008. In German.Google Scholar
  5. Guha, S., Rastogi, R., & Shim, K. (1999). Rock: A robust clustering algorithm for categorical attributes. In ICDE ’99: Proceedings of the 15th International Conference on Data Engineering (p. 512). Washington, DC, USA: IEEE Computer Society. ISBN 0-7695-0071-4.Google Scholar
  6. Jain, A. K., Murty, M. N., & Flynn, P. J. (2000). Data clustering: A review. ACM Computing Surveys (CSUR), 31(3), 264–323. ISSN 0360-0300. http://doi.acm.org/10.1145/331499.331504.Google Scholar
  7. Karypis, G., Han, E.-H., & Kumar, V. (1999). Chameleon: A hierarchical clustering algorithm using dynamic modeling. Technical Report Paper No. 432, Minneapolis: University of Minnesota.Google Scholar
  8. Kaufman, L., & Rousseuw, P. J. (1990). Finding groups in data. New York: Wiley.CrossRefGoogle Scholar
  9. Kumar, V. (2000). An introduction to cluster analysis for data mining. Technical report, CS Dept, University of Minnesota, USA.Google Scholar
  10. Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. ISSN 0960-3174. http://dx.doi.org/10.1007/s11222-007-9033-z.Google Scholar
  11. Minsky, M. (1965). Models, minds, machines. In Proceedings of the IFIP Congress (pp. 45–49).Google Scholar
  12. Rose, T. G., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1 – From yesterday’s news to tomorrow’s language resources. In Proceedings of the Third International Conference on Language Resources and Evaluation.Google Scholar
  13. Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 410–420).Google Scholar
  14. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communication ACM, 18(11), 613–620.zbMATHCrossRefGoogle Scholar
  15. Stein, B., & Meyer zu Eißen, S. (2003). Automatic document categorization: interpreting the perfomance of clustering algorithms. In A. Gnter, R. Kruse & B. Neumann (Eds.), KI 2003: Advances in artificial intelligence, volume 2821 LNAI of Lecture Notes in Artificial Intelligence (pp. 254–266). Springer, September 2003. ISBN 3-540-20059-2.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  1. 1.Faculty of Media/Media SystemsBauhaus-Universität WeimarWeimarGermany

Personalised recommendations