Skip to main content

An Accelerated K-Means Algorithm Based on Adaptive Distances

  • Conference paper
  • First Online:
Challenges at the Interface of Data Analysis, Computer Science, and Optimization

Abstract

Widely-used cluster analysis methods such as K-means and spectral clustering require some measures of (pairwise) distance on the multivariate space. Unfortunately, distances are often dependent on the scales of the variables. In applications, this can become a crucial point. Here we propose an accelerated K-means technique that consists of two steps. First, an appropriate weighted Euclidean distance is established on the multivariate space. This step is based on univariate assessments of the importance of the variables for the cluster analysis task. Here, additionally, one gets a crude idea about what the number of clusters K is at least. Subsequently, a fast K-means step follows based on random sampling. It is especially suited for the purpose of data reduction of massive data sets. From a theoretical point of view, it looks like MacQueen’s idea of clustering data over a continuous space. However, the main difference is that our algorithm examines only a random sample in a single pass. The proposed algorithm is used to solve a segmentation problem in an application to ecology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Bradley P, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. Tech. rep., Microsoft Research

    Google Scholar 

  • Faber V (1994) Clustering and the continuous k-means algorithm. Los Alamos Sci 22:138–144

    Google Scholar 

  • Faber V, Hochberg JG, Kelly PM, Thomas TR, White JM (1994) Concept extraction. A data-mining technique. Los Alamos Sci 22:123–137, 145–149

    Google Scholar 

  • Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7: 179–188

    Google Scholar 

  • Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. J Royal Stat Soc B 66(4):815–849, URL http://www-stat.stanford.edu/\textasciitildejhf/ftp/cosa.pdf

  • Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27: 857–871

    Article  Google Scholar 

  • Hartigan JA, Hartigan PM (1985) The dip test of unimodality. Ann Statist 13:70–84

    Article  MathSciNet  MATH  Google Scholar 

  • Hennig C (2009) Merging Gaussian mixture components - an overview. In: Mucha HJ, Ritter G (eds) Classification and clustering: Models, software and applications. Report 26, WIAS, Berlin, pp 80–89

    Google Scholar 

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New Jersey

    MATH  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1986) Clustering large data sets. In: Gelsema ES, Kanal LN (eds) Pattern recognition in Practice II (with discussion). Elsevier/North-Holland, pp 425–437

    Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data. Wiley, New York

    Book  Google Scholar 

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Le Cam LM, Neyman J (eds) Proc. 5th Berkeley Symp. Math. Statist. Prob., Univ. California Press, Berkley, vol 1, pp 281–297

    Google Scholar 

  • Mucha HJ (1992) Clusteranalyse mit mikrocomputern. Akademie Verlag, Berlin

    MATH  Google Scholar 

  • Mucha HJ (1995) Clustering in an interactive way. discussion paper no. 13. Tech. Rep. Sfb 373, Humboldt-Universität, Berlin

    Google Scholar 

  • Mucha HJ (2009) ClusCorr98 for Excel 2007: clustering, multivariate visualization, and validation. In: Mucha HJ, Ritter G (eds) Classification and clustering: Models, software and applications. WIAS, Berlin, 26, pp 40–40

    Google Scholar 

  • Mucha HJ, Klinke S (1993) Clustering techniques in the interactive statistical computing environment XploRe. Tech. Rep. 9318, Institut de Statistique, Université Catholique de Louvain, Louvain-la-Neuve

    Google Scholar 

  • Mucha HJ, Sofyan H (2000) Cluster analysis. discussion paper no. 49. Tech. Rep. Sfb 373, Humboldt-Universität, Berlin

    Google Scholar 

  • Mucha HJ, Simon U, Brüggemann R (2002) Model-based cluster analysis applied to flow cyto- metry data of phytoplankton. Tech. Rep. 5, WIAS, Berlin, URL http://www.wias-berlin.de/

  • Mucha HJ, Bartel HG, Dolata J (2003) Core-based clustering techniques. In: Schader M, Gaul W, Vichi M (eds) Between data science and applied data analysis. Springer, Berlin, pp 74–82

    Chapter  Google Scholar 

  • Murtagh F (2009) The remarkable simplicity of very high dimensional data: application of model-based clustering. J Classification 26:249–277

    Article  MathSciNet  Google Scholar 

  • Späth H (1980) Cluster analysis algorithms for data reduction and classification of objects. Ellis Horwood, Chichester

    MATH  Google Scholar 

  • Späth H (1985) Cluster dissection and analysis. Ellis Horwood, Chichester

    MATH  Google Scholar 

  • Steinhaus H (1956) Sur la division des corps matériels en parties. Bull de l’Académie Polonaise des Sci IV(12):801–804

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hans-Joachim Mucha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mucha, HJ., Bartel, HG. (2012). An Accelerated K-Means Algorithm Based on Adaptive Distances. In: Gaul, W., Geyer-Schulz, A., Schmidt-Thieme, L., Kunze, J. (eds) Challenges at the Interface of Data Analysis, Computer Science, and Optimization. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24466-7_5

Download citation

Publish with us

Policies and ethics