Abstract
For solving the clustering for probability density functions (CDF) problem with a given number of clusters, the metaheuristic optimization (MO) algorithms have been widely studied because of their advantages in searching for the global optimum. However, the existing approaches cannot be directly extended to the automatic CDF problem for determining the number of clusters k. Besides, balance-driven clustering, an essential research direction recently developed in the problem of discrete-element clustering, has not been considered in the field of CDF. This paper pioneers a technique to apply an MO algorithm for resolving the balance-driven automatic CDF. The proposed method not only can automatically determine the number of clusters but also can approximate the global optimal solution in which both the clustering compactness and the clusters’ size similarity are considered. The experiments on one-dimensional and multidimensional probability density functions demonstrate that the new method possesses higher quality clustering solutions than the other conventional techniques. The proposed method is also applied in analyzing the difficulty levels of entrance exam questions.
Similar content being viewed by others
References
Banerjee A, Ghosh J (2004) Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Netw 15(3):702–719
Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2–3):191–203
Chen JH, Hung WL (2015) An automatic clustering algorithm for probability density functions. J Stat Comput Simul 85(15):3047–3063
Chen JH, Hung WL (2021) A jackknife entropy-based clustering algorithm for probability density functions. J Stat Comput Simul 91(5):861–875
Chen TL, Shiu SY (2007) A new clustering algorithm based on self-updating process. In: JSM proceedings, statistical computing section, Salt Lake City, Utah, pp 2034–2038
Chen J, Chang Y, Hung W (2018) A robust automatic clustering algorithm for probability density functions with application to categorizing color images. Commun Stat Simul Comput 47(7):2152–2168
Costa LR, Aloise D, Mladenovic N (2017) Less is more: basic variable neighborhood search heuristic for balanced minimum sum-of-squares clustering. Inf Sci 415:247–253
Deep K, Singh KP, Kansal ML et al (2009) A real coded genetic algorithm for solving integer and mixed integer optimization problems. Appl Math Comput 212(2):505–518
Demiriz A, Bennett KP, Bradley PS (2008) Using assignment constraints to avoid empty clusters in k-means clustering. Constrained clustering: advances in algorithms, theory, and applications, p 201
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc: Ser B (Methodol) 39(1):1–22
Diem HK, Trung VD, Trung NT et al (2018) A differential evolution-based clustering for probability density functions. IEEE Access 6:41325–41336
Elsisi M (2019) Future search algorithm for optimization. Evol Intel 12(1):21–31
Ester M, Kriegel HP, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, pp 226–231
Everitt BS (1985) Mixture distributions-I. Encyclopedia of statistical sciences
Fayyad UM, Reina C, Bradley PS (1998) Initialization of iterative refinement clustering algorithms. In: KDD, pp 194–198
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188
Fukunaga K (2013) Introduction to statistical pattern recognition. Academic Press Inc, San Diego
Goh A, Vidal R (2008) Unsupervised Riemannian clustering of probability density functions. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 377–392
Hellinger E (1909) Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die Reine und Angewandte Mathematik 1909(136):210–271
Ho-Kieu D, Vo-Van T, Nguyen-Trang T (2018) Clustering for probability density functions by new-medoids method. Scientific Programming
Holland JH et al (1992) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT Press, London
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Hoboken
Kaufmann L (1987) Clustering by means of medoids. In: Proc. Statistical Data Analysis Based on the L1 Norm Conference, Neuchatel, pp 405–416
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95-international conference on neural networks, IEEE, pp 1942–1948
Kim J, Billard L (2018) Double monothetic clustering for histogram-valued data. Commun Stat Appl Methods 25(3):263–274
Lebesgue H (1902) Intégrale, longueur, aire. Annali di Matematica Pura ed Applicata (1898-1922) 7(1):231–359
Li L, Zhou X, Li Y et al (2020) An improved genetic algorithm with Lagrange and density method for clustering. Concurr Comput Pract Exp 32(24):e5969
Liao Y, Qi H, Li W (2012) Load-balanced clustering algorithm with distributed self-organization for wireless sensor networks. IEEE Sens J 13(5):1498–1506
Liu H, Han J, Nie F et al (2017) Balanced clustering with least square regression. In: Proceedings of the AAAI Conference on Artificial Intelligence
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, pp 281–297
Malinen MI, Fränti P (2014) Balanced k-means for clustering. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, pp 32–41
Matusita K (1967) On the notion of affinity of several distributions and some of its applications. Ann Inst Stat Math 19(1):181–192
Montanari A, Calò DG (2013) Model-based clustering of probability density functions. Adv Data Anal Classif 7(3):301–319
Mukhopadhyay A, Maulik U, Bandyopadhyay S (2015) A survey of multiobjective evolutionary clustering. ACM Comput Surv (CSUR) 47(4):1–46
Nguyen-Trang T, Nguyen-Thoi T, Truong-Khac T et al (2019) An efficient hybrid optimization approach using adaptive elitist differential evolution and spherical quadratic steepest descent and its application for clustering. Scientific Programming
Pham-Toan D, Vo-Van T, Pham-Chau A et al (2019) A new binary adaptive elitist differential evolution based automatic k-medoids clustering for probability density functions. Mathematical Problems in Engineering
Storn R, Price K (1997) Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. J Global Optim 11(4):341–359
Tai V, Thao N, Ha C (2016) Clustering for probability density functions based on genetic algorithm. In: Applied Mathematics in Engineering and Reliability, Proceedings of the 1st International Conference on Applied Mathematics in Engineering and Reliability (Ho Chi Minh City, Vietnam, May 2016), pp 51–57
Toussaint GT (1972) Feature evaluation criteria and contextual decoding algorithms in statistical pattern recognition. PhD thesis, University of British Columbia
Van Vo T, Pham-Gia T (2010) Clustering probability distributions. J Appl Stat 37(11):1891–1910
Vo-Van T, Nguyen-Thoi T, Vo-Duy T et al (2017) Modified genetic algorithm-based clustering for probability density functions. J Stat Comput Simul. https://doi.org/10.1080/00949655.2017.1300663
Vo-Van T, Nguyen-Hai A, Tat-Hong M et al (2020) A new clustering algorithm and its application in assessing the quality of underground water. Scientific Programming
Vovan T (2019) Cluster width of probability density functions. Intell Data Anal 23(2):385–405
VoVan T, NguyenTrang T (2018) Similar coefficient for cluster of probability density functions. Commun Stat Theory Methods 47(8):1792–1811
Webb AR (2003) Statistical pattern recognition. Wiley, England
Xu L, Hu Q, Hung E et al (2015) Large margin clustering on uncertain data by considering probability distribution similarity. Neurocomputing 158:81–89
Zhang Y, Wang JZ, Li J (2015) Parallel massive clustering of discrete distributions. ACM Trans Multimed Comput Commun Appl (TOMM) 11(4):1–24
Zhou Q, Hao JK, Wu Q (2021) Responsive threshold search based memetic algorithm for balanced minimum sum-of-squares clustering. Inf Sci 569:184–204
Zong Y, Xu G, Zhang Y et al (2010) A robust iterative refinement clustering algorithm with smoothing search space. Knowl-Based Syst 23(5):389–396
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No potential conflict of interest was reported by the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nguyen-Trang, T., Nguyen-Thoi, T., Nguyen-Thi, KN. et al. Balance-driven automatic clustering for probability density functions using metaheuristic optimization. Int. J. Mach. Learn. & Cyber. 14, 1063–1078 (2023). https://doi.org/10.1007/s13042-022-01683-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-022-01683-8