Advertisement

Computational Statistics

, Volume 34, Issue 1, pp 373–394 | Cite as

An extension of the K-means algorithm to clustering skewed data

  • Volodymyr MelnykovEmail author
  • Xuwen Zhu
Original Paper
  • 99 Downloads

Abstract

Grouping similar objects into common groups, also known as clustering, is an important problem of unsupervised machine learning. Various clustering algorithms have been proposed in literature. In recent years, the need to analyze large amounts of data has led to reconsidering some fundamental clustering procedures. One of them is the celebrated K-means algorithm popular among practitioners due to its speedy performance and appealingly intuitive construction. Unfortunately, the algorithm often shows poor performance unless data groups have spherical shapes and approximately same sizes. In many applications, this restriction is so severe that the use of the K-means algorithm becomes questionable, misleading, or simply incorrect. We propose an extension of K-means that preserves the speed and intuitive interpretation of the original algorithm while providing greater flexibility in modeling clusters. The idea of the proposed generalization relies on the exponential transformation of Manly originally designed to obtain near-normally distributed data. The suggested modification is derived and illustrated on several datasets with good results.

Keywords

Exponential transformation CEM algorithm Cluster analysis Skewness 

Notes

Acknowledgements

The research is partially funded by the University of Louisville EVPRI internal research grant from the Office of the Executive Vice President for Research and Innovation.

Supplementary material

180_2018_821_MOESM1_ESM.pdf (222 kb)
Supplementary material 1 (pdf 221 KB)

References

  1. Altman EI (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Finance 23(4):589–609CrossRefGoogle Scholar
  2. Anderson E (1935) The Irises of the Gaspe peninsula. Bull Am Iris Soc 59:2–5Google Scholar
  3. Andrews DF, Gnanadesikan R, Warner JL (1971) Transformations of multivariate data. Biometrics 27(4):825–840CrossRefGoogle Scholar
  4. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821MathSciNetCrossRefzbMATHGoogle Scholar
  5. Box G, Cox DR (1964) An analysis of transformations. J R Stat Soc B 26:211–252zbMATHGoogle Scholar
  6. Brock G, Pihur V, Datta S, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25(4):1–22CrossRefGoogle Scholar
  7. Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332MathSciNetCrossRefzbMATHGoogle Scholar
  8. Dasgupta S (1999) Learning mixtures of Gaussians. In: Proceedings of IEEE symposium on foundations of computer science, New York, pp 633–644Google Scholar
  9. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38zbMATHGoogle Scholar
  10. Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 01 Jan 2017
  11. Efron B, Tibshirani R, Storey J, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160MathSciNetCrossRefzbMATHGoogle Scholar
  12. Fisher RA (1936) The use of multiple measurements in taxonomic poblems. Ann Eugen 7:179–188CrossRefGoogle Scholar
  13. Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780Google Scholar
  14. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218CrossRefzbMATHGoogle Scholar
  15. Kahraman HT, Sagiroglu S, Colak I (2013) Developing intuitive knowledge classifier and modeling of users’ domain dependent data in web. Knowl Based Syst 37:283–295CrossRefGoogle Scholar
  16. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkCrossRefzbMATHGoogle Scholar
  17. Lee SX, McLachlan GJ (2013) Model-based clustering and classification with non-normal mixture distributions. Stat Methods Appl 22(4):427–454MathSciNetCrossRefzbMATHGoogle Scholar
  18. Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19(2):354–376MathSciNetCrossRefGoogle Scholar
  19. Manly BFJ (1976) Exponential data transformations. Biom Unit 25:37–42Google Scholar
  20. McLachlan G, Peel D (2000) Finite mixture models. Wiley, New YorkCrossRefzbMATHGoogle Scholar
  21. Melnykov I, Melnykov V (2014) On k-means algorithm with the use of Mahalanobis distances. Stat Probab Lett 84:88–95MathSciNetCrossRefzbMATHGoogle Scholar
  22. Melnykov V (2013a) Challenges in model-based clustering. WIREs Comput Stat 5:135–148CrossRefGoogle Scholar
  23. Melnykov V (2013b) Finite mixture modelling in mass spectrometry analysis. J R Stat Soc Ser C 62:573–592MathSciNetCrossRefGoogle Scholar
  24. Melnykov V, Melnykov I (2012) Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput Stat Data Anal 56:1381–1395MathSciNetCrossRefzbMATHGoogle Scholar
  25. Melnykov V, Shen G (2013) Clustering through empirical likelihood ratio. Comput Stat Data Anal 62:1–10MathSciNetCrossRefzbMATHGoogle Scholar
  26. Melnykov V, Chen WC, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25CrossRefGoogle Scholar
  27. Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press, CambridgezbMATHGoogle Scholar
  28. Schlattmann P (2009) Medical applications of finite mixture models. Springer, BerlinzbMATHGoogle Scholar
  29. Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 38:1409–1438Google Scholar
  30. Tortora C, Franczak BC, Browne RP, McNicholas PD (2014) A mixture of coalesced generalized hyperbolic distributions. arXiv:1403.2332
  31. Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244MathSciNetCrossRefGoogle Scholar
  32. Zhu X, Melnykov V (2017) ManlyMix: an R package for manly mixture modeling. R J 9(2):176–197Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of AlabamaTuscaloosaUSA
  2. 2.University of LouisvilleLouisvilleUSA

Personalised recommendations