Advertisement

An Accelerated K-Means Algorithm Based on Adaptive Distances

  • Hans-Joachim Mucha
  • Hans-Georg Bartel
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

Widely-used cluster analysis methods such as K-means and spectral clustering require some measures of (pairwise) distance on the multivariate space. Unfortunately, distances are often dependent on the scales of the variables. In applications, this can become a crucial point. Here we propose an accelerated K-means technique that consists of two steps. First, an appropriate weighted Euclidean distance is established on the multivariate space. This step is based on univariate assessments of the importance of the variables for the cluster analysis task. Here, additionally, one gets a crude idea about what the number of clusters K is at least. Subsequently, a fast K-means step follows based on random sampling. It is especially suited for the purpose of data reduction of massive data sets. From a theoretical point of view, it looks like MacQueen’s idea of clustering data over a continuous space. However, the main difference is that our algorithm examines only a random sample in a single pass. The proposed algorithm is used to solve a segmentation problem in an application to ecology.

Keywords

Adaptive Weight Univariate Assessment Cluster Analysis Method Multivariate Space Weighted Euclidean Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. Bradley P, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. Tech. rep., Microsoft ResearchGoogle Scholar
  2. Faber V (1994) Clustering and the continuous k-means algorithm. Los Alamos Sci 22:138–144Google Scholar
  3. Faber V, Hochberg JG, Kelly PM, Thomas TR, White JM (1994) Concept extraction. A data-mining technique. Los Alamos Sci 22:123–137, 145–149Google Scholar
  4. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7: 179–188Google Scholar
  5. Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. J Royal Stat Soc B 66(4):815–849, URL http://www-stat.stanford.edu/\textasciitildejhf/ftp/cosa.pdf
  6. Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27: 857–871CrossRefGoogle Scholar
  7. Hartigan JA, Hartigan PM (1985) The dip test of unimodality. Ann Statist 13:70–84MathSciNetzbMATHCrossRefGoogle Scholar
  8. Hennig C (2009) Merging Gaussian mixture components - an overview. In: Mucha HJ, Ritter G (eds) Classification and clustering: Models, software and applications. Report 26, WIAS, Berlin, pp 80–89Google Scholar
  9. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New JerseyzbMATHGoogle Scholar
  10. Kaufman L, Rousseeuw PJ (1986) Clustering large data sets. In: Gelsema ES, Kanal LN (eds) Pattern recognition in Practice II (with discussion). Elsevier/North-Holland, pp 425–437Google Scholar
  11. Kaufman L, Rousseeuw PJ (1990) Finding groups in data. Wiley, New YorkCrossRefGoogle Scholar
  12. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Le Cam LM, Neyman J (eds) Proc. 5th Berkeley Symp. Math. Statist. Prob., Univ. California Press, Berkley, vol 1, pp 281–297Google Scholar
  13. Mucha HJ (1992) Clusteranalyse mit mikrocomputern. Akademie Verlag, BerlinzbMATHGoogle Scholar
  14. Mucha HJ (1995) Clustering in an interactive way. discussion paper no. 13. Tech. Rep. Sfb 373, Humboldt-Universität, BerlinGoogle Scholar
  15. Mucha HJ (2009) ClusCorr98 for Excel 2007: clustering, multivariate visualization, and validation. In: Mucha HJ, Ritter G (eds) Classification and clustering: Models, software and applications. WIAS, Berlin, 26, pp 40–40Google Scholar
  16. Mucha HJ, Klinke S (1993) Clustering techniques in the interactive statistical computing environment XploRe. Tech. Rep. 9318, Institut de Statistique, Université Catholique de Louvain, Louvain-la-NeuveGoogle Scholar
  17. Mucha HJ, Sofyan H (2000) Cluster analysis. discussion paper no. 49. Tech. Rep. Sfb 373, Humboldt-Universität, BerlinGoogle Scholar
  18. Mucha HJ, Simon U, Brüggemann R (2002) Model-based cluster analysis applied to flow cyto- metry data of phytoplankton. Tech. Rep. 5, WIAS, Berlin, URL http://www.wias-berlin.de/
  19. Mucha HJ, Bartel HG, Dolata J (2003) Core-based clustering techniques. In: Schader M, Gaul W, Vichi M (eds) Between data science and applied data analysis. Springer, Berlin, pp 74–82CrossRefGoogle Scholar
  20. Murtagh F (2009) The remarkable simplicity of very high dimensional data: application of model-based clustering. J Classification 26:249–277MathSciNetCrossRefGoogle Scholar
  21. Späth H (1980) Cluster analysis algorithms for data reduction and classification of objects. Ellis Horwood, ChichesterzbMATHGoogle Scholar
  22. Späth H (1985) Cluster dissection and analysis. Ellis Horwood, ChichesterzbMATHGoogle Scholar
  23. Steinhaus H (1956) Sur la division des corps matériels en parties. Bull de l’Académie Polonaise des Sci IV(12):801–804Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Weierstrass InstituteBerlinGermany
  2. 2.Department of ChemistryHumboldt University BerlinBerlinGermany

Personalised recommendations