An Accelerated K-Means Algorithm Based on Adaptive Distances

Mucha, Hans-Joachim; Bartel, Hans-Georg

doi:10.1007/978-3-642-24466-7_5

Hans-Joachim Mucha⁵ &
Hans-Georg Bartel⁶

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2552 Accesses
1 Citations

Abstract

Widely-used cluster analysis methods such as K-means and spectral clustering require some measures of (pairwise) distance on the multivariate space. Unfortunately, distances are often dependent on the scales of the variables. In applications, this can become a crucial point. Here we propose an accelerated K-means technique that consists of two steps. First, an appropriate weighted Euclidean distance is established on the multivariate space. This step is based on univariate assessments of the importance of the variables for the cluster analysis task. Here, additionally, one gets a crude idea about what the number of clusters K is at least. Subsequently, a fast K-means step follows based on random sampling. It is especially suited for the purpose of data reduction of massive data sets. From a theoretical point of view, it looks like MacQueen’s idea of clustering data over a continuous space. However, the main difference is that our algorithm examines only a random sample in a single pass. The proposed algorithm is used to solve a segmentation problem in an application to ecology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bradley P, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. Tech. rep., Microsoft Research
Google Scholar
Faber V (1994) Clustering and the continuous k-means algorithm. Los Alamos Sci 22:138–144
Google Scholar
Faber V, Hochberg JG, Kelly PM, Thomas TR, White JM (1994) Concept extraction. A data-mining technique. Los Alamos Sci 22:123–137, 145–149
Google Scholar
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7: 179–188
Google Scholar
Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. J Royal Stat Soc B 66(4):815–849, URL http://www-stat.stanford.edu/\textasciitildejhf/ftp/cosa.pdf
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27: 857–871
Article Google Scholar
Hartigan JA, Hartigan PM (1985) The dip test of unimodality. Ann Statist 13:70–84
Article MathSciNet MATH Google Scholar
Hennig C (2009) Merging Gaussian mixture components - an overview. In: Mucha HJ, Ritter G (eds) Classification and clustering: Models, software and applications. Report 26, WIAS, Berlin, pp 80–89
Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New Jersey
MATH Google Scholar
Kaufman L, Rousseeuw PJ (1986) Clustering large data sets. In: Gelsema ES, Kanal LN (eds) Pattern recognition in Practice II (with discussion). Elsevier/North-Holland, pp 425–437
Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data. Wiley, New York
Book Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Le Cam LM, Neyman J (eds) Proc. 5th Berkeley Symp. Math. Statist. Prob., Univ. California Press, Berkley, vol 1, pp 281–297
Google Scholar
Mucha HJ (1992) Clusteranalyse mit mikrocomputern. Akademie Verlag, Berlin
MATH Google Scholar
Mucha HJ (1995) Clustering in an interactive way. discussion paper no. 13. Tech. Rep. Sfb 373, Humboldt-Universität, Berlin
Google Scholar
Mucha HJ (2009) ClusCorr98 for Excel 2007: clustering, multivariate visualization, and validation. In: Mucha HJ, Ritter G (eds) Classification and clustering: Models, software and applications. WIAS, Berlin, 26, pp 40–40
Google Scholar
Mucha HJ, Klinke S (1993) Clustering techniques in the interactive statistical computing environment XploRe. Tech. Rep. 9318, Institut de Statistique, Université Catholique de Louvain, Louvain-la-Neuve
Google Scholar
Mucha HJ, Sofyan H (2000) Cluster analysis. discussion paper no. 49. Tech. Rep. Sfb 373, Humboldt-Universität, Berlin
Google Scholar
Mucha HJ, Simon U, Brüggemann R (2002) Model-based cluster analysis applied to flow cyto- metry data of phytoplankton. Tech. Rep. 5, WIAS, Berlin, URL http://www.wias-berlin.de/
Mucha HJ, Bartel HG, Dolata J (2003) Core-based clustering techniques. In: Schader M, Gaul W, Vichi M (eds) Between data science and applied data analysis. Springer, Berlin, pp 74–82
Chapter Google Scholar
Murtagh F (2009) The remarkable simplicity of very high dimensional data: application of model-based clustering. J Classification 26:249–277
Article MathSciNet Google Scholar
Späth H (1980) Cluster analysis algorithms for data reduction and classification of objects. Ellis Horwood, Chichester
MATH Google Scholar
Späth H (1985) Cluster dissection and analysis. Ellis Horwood, Chichester
MATH Google Scholar
Steinhaus H (1956) Sur la division des corps matériels en parties. Bull de l’Académie Polonaise des Sci IV(12):801–804
Google Scholar

Download references

Author information

Authors and Affiliations

Weierstrass Institute, Mohrenstraße 39, 10117, Berlin, Germany
Hans-Joachim Mucha
Department of Chemistry, Humboldt University Berlin, Brook-Taylor-Straße 2, 12489, Berlin, Germany
Hans-Georg Bartel

Authors

Hans-Joachim Mucha
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Georg Bartel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hans-Joachim Mucha .

Editor information

Editors and Affiliations

Fak. Wirtschaftswissenschaften, Inst. Entscheidungstheorieund, Universität Karlsruhe (TH), Kaiserstr. 12, Karlsruhe, 76128, Germany
Wolfgang A. Gaul
Insitute for Information Systems, and Management (IISM), Karlsruhe Institute of Technology (KIT), Kaiserstr. 12, Karlsruhe, 76131, Baden-Württemberg, Germany
Andreas Geyer-Schulz
, Information Systems, University ofHildesheim, Marienburger Platz 22, Hildesheim, 31141, Germany
Lars Schmidt-Thieme
Institute for Information Systems, and Management (IISM), Karlsruhe Institute of Technology (KIT), Kaiserstraße 12, Karlsruhe, 76128, Germany
Jonas Kunze

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mucha, HJ., Bartel, HG. (2012). An Accelerated K-Means Algorithm Based on Adaptive Distances. In: Gaul, W., Geyer-Schulz, A., Schmidt-Thieme, L., Kunze, J. (eds) Challenges at the Interface of Data Analysis, Computer Science, and Optimization. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24466-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-24466-7_5
Published: 05 January 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24465-0
Online ISBN: 978-3-642-24466-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics