An entropy-based initialization method of K-means clustering on the optimal number of clusters

Chowdhury, Kuntal; Chaudhuri, Debasis; Pal, Arup Kumar

doi:10.1007/s00521-020-05471-9

An entropy-based initialization method of K-means clustering on the optimal number of clusters

Original Article
Published: 10 November 2020

Volume 33, pages 6965–6982, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Kuntal Chowdhury¹,
Debasis Chaudhuri² &
Arup Kumar Pal¹

1392 Accesses
37 Citations
Explore all metrics

Abstract

Clustering is an unsupervised learning approach used to group similar features using specific mathematical criteria. This mathematical criterion is known as the objective function. Any clustering is done depending on some objective function. K-means is one of the widely used partitional clustering algorithms whose performance depends on the initial point and the value of K. In this paper, we have combined both these parameters. We have defined an entropy-based objective function for the initialization process, which is better than other existing initialization methods of K-means clustering. Here, we have also designed an algorithm to calculate the correct number of clusters of datasets using some cluster validity indexes. In this paper, the entropy-based initialization algorithm has been proposed and applied to different 2D and 3D data sets. The comparison with other existing initialization methods has been represented in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Partitional Technique for Searching Initial Cluster Centers in K-means Algorithm

A Hybrid K-Means Algorithm Combining Preprocessing-Wise and Centroid Based-Criteria for High Dimension Datasets

A Novel Approach for Initializing Centroid at K-Means Clustering in Paradigm of Computational Geometry

Notes

COUNT() function gives the count of no. of cluster validity indexes support for a particular K value (no. of clusters). Here, 2, 3, ..., c denote the number of clusters starting from K = 2.

References

Askari G, Li Y, MoezziNasab R (2014) An adaptive polygonal centroidal voronoi tessellation algorithm for segmentation of noisy sar images. Int Arch Photogram Remote Sens Spatial Inf Sci 40(2):65
Article Google Scholar
Astrahan M (1970) Speech analysis by clustering, or the hyperphoneme method. STANFORD UNIV CA DEPT OF COMPUTER SCIENCE, Tech. rep
Bai L, Liang J, Dang C, Cao F (2012) A cluster centers initialization method for clustering categorical data. Expert Syst Appl 39(9):8022–8029
Article Google Scholar
Ball GH, Hall DJ (1965) Isodata, a novel method of data analysis and pattern classification. Tech. rep, Stanford research inst Menlo Park CA
Bordogna G, Pasi G (2011) Soft clustering for information retrieval applications. Wiley Interdiscip Rev Data Min Knowl Discov 1(2):138–146
Article Google Scholar
Bozdogan H (1994) Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. In: Proceedings of the first US/Japan conference on the frontiers of statistical modeling: an informational approach, Springer, pp 69–113
Cao F, Liang J, Jiang G (2009) An initialization method for the k-means algorithm using neighborhood model. Comput Math Appl 58(3):474–483
Article MathSciNet Google Scholar
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210
Article Google Scholar
Chaudhuri D, Murthy C, Chaudhuri B (1994) Finding a subset of representative points in a data set. IEEE Trans Syst Man Cybern 24(9):1416–1424
Article Google Scholar
Chen K, Liu L (2005) The “best k” for entropy-based categorical data clustering
Chowdhury K, Chaudhuri D, Pal AK (2018) Seed point selection algorithm in clustering of image data. In: Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, Springer, pp 119–126
Dalhatu K, Sim ATH (2016) Density base k-mean’s cluster centroid initialization algorithm
Dey L, Chakraborty S (2014) Canonical pso based-means clustering approach for real datasets. International scholarly research notices 2014
Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using the software structure: a simulation study. Mol Ecol 14(8):2611–2620
Article Google Scholar
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769
Google Scholar
Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38:293–306
Article MathSciNet Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc
Jin Z, Kim DY, Cho J, Lee B (2015) An analysis on optimal cluster ratio in cluster-based wireless sensor networks. IEEE Sens J 15(11):6413–6423
Article Google Scholar
Lu JF, Tang J, Tang ZM, Yang JY (2008) Hierarchical initialization approach for k-means clustering. Pattern Recognit Lett 29(6):787–795
Article Google Scholar
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Prob Oakland CA USA 1:281–297
MathSciNet MATH Google Scholar
Mahmud MS, Rahman MM, Akhtar MN (2012) Improvement of k-means clustering algorithm with better initial centroids based on weighted average. In: Electrical & Computer Engineering (ICECE), 2012 7th International Conference on, IEEE, pp 647–650
Milligan GW (1981) A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2):187–199
Article Google Scholar
Nazeer KA, Sebastian M (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. Proc World Congress Eng 1:1–3
Google Scholar
Pakhira MK et al (2009) A modified k-means algorithm to avoid empty clusters. Int J Recent Trends Eng 1(1)
Pal SK, Pramanik P (1986) Fuzzy measures in determining seed points in clustering. Pattern Recognit Lett 4(3):159–164
Article Google Scholar
Reddy D, Jana PK, Member IS (2012) Initialization for k-means clustering using voronoi diagram. Proc Technol 4:395–400
Article Google Scholar
Smyth P (1996) Clustering using monte carlo cross-validation. Kdd 1:26–133
Google Scholar
Sugar CA, James GM (2003) Finding the number of clusters in a dataset: an information-theoretic approach. J Am Stat Assoc 98(463):750–763
Article MathSciNet Google Scholar
Suryawanshi R, Puthran S (2016) Review of various enhancement for clustering algorithms in big data mining. Int J Adv Res Comput Sci Softw Eng
Thakare Y, Bagal S (2015) Performance evaluation of k-means clustering algorithm with various distance metrics. Int J Comput Appl 110(11)
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol 63(2):411–423
Article MathSciNet Google Scholar
Tzortzis G, Likas A (2014) The minmax k-means clustering algorithm. Pattern Recognit 47(7):2505–2516
Article Google Scholar
Virmani D, Taneja S, Malhotra G (2015) Normalization based k means clustering algorithm. arXiv preprint arXiv:150300900
Wang J (2010) Consistent selection of the number of clusters via crossvalidation. Biometrika 97(4):893–904
Article MathSciNet Google Scholar
Wang X, Bai Y (2016) A modified minmax-means algorithm based on pso. Comput Intell Neurosci
Wang Y, Li Y, Zhao Q (2016) Coupling regular tessellation with rjmcmc algorithm to segment sar image with unknown number of classes. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, p 7
Xu L (2002) Byy harmony learning, structural rpcl, and topological self-organizing on mixture models. Neural Netw 15(8):1125–1151
Article Google Scholar
Xu S, Qiao X, Zhu L, Zheng H (2010) Deep analysis on mining frequent & maximal reference sequences with generalized suffix tree. J Comput Inf Syst 6(7):2187–2197
Google Scholar
Yadav J, Sharma M (2013) Automatic k-detection algorithm. In: 2013 International Conference on Machine Intelligence and Research Advancement (ICMIRA), IEEE, pp 269–273

Download references

Author information

Authors and Affiliations

Department of CSE, Indian Institute of Technology (Indian School of Mines) [IIT(ISM)], Dhanbad, Jharkhand, India
Kuntal Chowdhury & Arup Kumar Pal
Deputy General Manager, DRDO Integration Centre, Panagarh, West Bengal, India
Debasis Chaudhuri

Authors

Kuntal Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar
Debasis Chaudhuri
View author publications
You can also search for this author in PubMed Google Scholar
Arup Kumar Pal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kuntal Chowdhury.

Ethics declarations

Conflict of interest

The authors, Kuntal Chowdhury Debasis Chaudhuri Arup Kumar Pal, declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chowdhury, K., Chaudhuri, D. & Pal, A.K. An entropy-based initialization method of K-means clustering on the optimal number of clusters. Neural Comput & Applic 33, 6965–6982 (2021). https://doi.org/10.1007/s00521-020-05471-9

Download citation

Received: 12 February 2020
Accepted: 26 October 2020
Published: 10 November 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s00521-020-05471-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An entropy-based initialization method of K-means clustering on the optimal number of clusters

Abstract

Access this article

Similar content being viewed by others

Partitional Technique for Searching Initial Cluster Centers in K-means Algorithm

A Hybrid K-Means Algorithm Combining Preprocessing-Wise and Centroid Based-Criteria for High Dimension Datasets

A Novel Approach for Initializing Centroid at K-Means Clustering in Paradigm of Computational Geometry

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An entropy-based initialization method of K-means clustering on the optimal number of clusters

Abstract

Access this article

Similar content being viewed by others

Partitional Technique for Searching Initial Cluster Centers in K-means Algorithm

A Hybrid K-Means Algorithm Combining Preprocessing-Wise and Centroid Based-Criteria for High Dimension Datasets

A Novel Approach for Initializing Centroid at K-Means Clustering in Paradigm of Computational Geometry

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation