Abstract
Widely-used cluster analysis methods such as K-means and spectral clustering require some measures of (pairwise) distance on the multivariate space. Unfortunately, distances are often dependent on the scales of the variables. In applications, this can become a crucial point. Here we propose an accelerated K-means technique that consists of two steps. First, an appropriate weighted Euclidean distance is established on the multivariate space. This step is based on univariate assessments of the importance of the variables for the cluster analysis task. Here, additionally, one gets a crude idea about what the number of clusters K is at least. Subsequently, a fast K-means step follows based on random sampling. It is especially suited for the purpose of data reduction of massive data sets. From a theoretical point of view, it looks like MacQueen’s idea of clustering data over a continuous space. However, the main difference is that our algorithm examines only a random sample in a single pass. The proposed algorithm is used to solve a segmentation problem in an application to ecology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bradley P, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. Tech. rep., Microsoft Research
Faber V (1994) Clustering and the continuous k-means algorithm. Los Alamos Sci 22:138–144
Faber V, Hochberg JG, Kelly PM, Thomas TR, White JM (1994) Concept extraction. A data-mining technique. Los Alamos Sci 22:123–137, 145–149
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7: 179–188
Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. J Royal Stat Soc B 66(4):815–849, URL http://www-stat.stanford.edu/\textasciitildejhf/ftp/cosa.pdf
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27: 857–871
Hartigan JA, Hartigan PM (1985) The dip test of unimodality. Ann Statist 13:70–84
Hennig C (2009) Merging Gaussian mixture components - an overview. In: Mucha HJ, Ritter G (eds) Classification and clustering: Models, software and applications. Report 26, WIAS, Berlin, pp 80–89
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New Jersey
Kaufman L, Rousseeuw PJ (1986) Clustering large data sets. In: Gelsema ES, Kanal LN (eds) Pattern recognition in Practice II (with discussion). Elsevier/North-Holland, pp 425–437
Kaufman L, Rousseeuw PJ (1990) Finding groups in data. Wiley, New York
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Le Cam LM, Neyman J (eds) Proc. 5th Berkeley Symp. Math. Statist. Prob., Univ. California Press, Berkley, vol 1, pp 281–297
Mucha HJ (1992) Clusteranalyse mit mikrocomputern. Akademie Verlag, Berlin
Mucha HJ (1995) Clustering in an interactive way. discussion paper no. 13. Tech. Rep. Sfb 373, Humboldt-Universität, Berlin
Mucha HJ (2009) ClusCorr98 for Excel 2007: clustering, multivariate visualization, and validation. In: Mucha HJ, Ritter G (eds) Classification and clustering: Models, software and applications. WIAS, Berlin, 26, pp 40–40
Mucha HJ, Klinke S (1993) Clustering techniques in the interactive statistical computing environment XploRe. Tech. Rep. 9318, Institut de Statistique, Université Catholique de Louvain, Louvain-la-Neuve
Mucha HJ, Sofyan H (2000) Cluster analysis. discussion paper no. 49. Tech. Rep. Sfb 373, Humboldt-Universität, Berlin
Mucha HJ, Simon U, Brüggemann R (2002) Model-based cluster analysis applied to flow cyto- metry data of phytoplankton. Tech. Rep. 5, WIAS, Berlin, URL http://www.wias-berlin.de/
Mucha HJ, Bartel HG, Dolata J (2003) Core-based clustering techniques. In: Schader M, Gaul W, Vichi M (eds) Between data science and applied data analysis. Springer, Berlin, pp 74–82
Murtagh F (2009) The remarkable simplicity of very high dimensional data: application of model-based clustering. J Classification 26:249–277
Späth H (1980) Cluster analysis algorithms for data reduction and classification of objects. Ellis Horwood, Chichester
Späth H (1985) Cluster dissection and analysis. Ellis Horwood, Chichester
Steinhaus H (1956) Sur la division des corps matériels en parties. Bull de l’Académie Polonaise des Sci IV(12):801–804
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mucha, HJ., Bartel, HG. (2012). An Accelerated K-Means Algorithm Based on Adaptive Distances. In: Gaul, W., Geyer-Schulz, A., Schmidt-Thieme, L., Kunze, J. (eds) Challenges at the Interface of Data Analysis, Computer Science, and Optimization. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24466-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-24466-7_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24465-0
Online ISBN: 978-3-642-24466-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)