Abstract
Clustering is a commonly used method for exploring and analysing data where the primary objective is to categorise observations into similar clusters. In recent decades, several algorithms and methods have been developed for analysing clustered data. We notice that most of these techniques deterministically define a cluster based on the value of the attributes, distance, and density of homogenous and single-featured datasets. However, these definitions are not successful in adding clear semantic meaning to the clusters produced. Evolutionary operators and statistical and multidisciplinary techniques may help in generating meaningful clusters. Based on this premise, we propose a new evolutionary clustering algorithm (ECA*) based on social class ranking and meta-heuristic algorithms for stochastically analysing heterogeneous and multifeatured datasets. The ECA* is integrated with recombinational evolutionary operators, Levy flight optimisation, and some statistical techniques, such as quartiles and percentiles, as well as the Euclidean distance of the K-means algorithm. Experiments are conducted to evaluate the ECA* against five conventional approaches: K-means (KM), K-means++ (KM++), expectation maximisation (EM), learning vector quantisation (LVQ), and the genetic algorithm for clustering++ (GENCLUST++). That the end, 32 heterogeneous and multifeatured datasets are used to examine their performance using internal and external and basic statistical performance clustering measures and to measure how their performance is sensitive to five features of these datasets (cluster overlap, the number of clusters, cluster dimensionality, the cluster structure, and the cluster shape) in the form of an operational framework. The results indicate that the ECA* surpasses its counterpart techniques in terms of the ability to find the right clusters. Significantly, compared to its counterpart techniques, the ECA* is less sensitive to the five properties of the datasets mentioned above. Thus, the order of overall performance of these algorithms, from best performing to worst performing, is the ECA*, EM, KM++, KM, LVQ, and the GENCLUST++. Meanwhile, the overall performance rank of the ECA* is 1.1 (where the rank of 1 represents the best performing algorithm and the rank of 6 refers to the worst performing algorithm) for 32 datasets based on the five dataset features mentioned above.
Similar content being viewed by others
References
Ghosal A, Nandy A, Das AK et al (2020) A short review on different clustering techniques and their applications. Emerging technology in modelling and graphics. Springer, Berlin, pp 69–83
Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48:4743–4759
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31:651–666
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035
Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33:1455–1465
Koschke R, Eisenbarth T (2000) A framework for experimental evaluation of clustering techniques. In: Proceedings IWPC 2000. 8th International Workshop on Program Comprehension. IEEE, pp 201–210
Hassan BA, Rashid TA (2019) Operational framework for recent advances in backtracking search optimisation algorithm: A systematic review and performance evaluation. Appl Math Comput 370:124919
Hassan BA, Rashid TA (2020) Datasets on statistical analysis and performance evaluation of backtracking search optimisation algorithm compared with its counterpart algorithms. Data Br 28:105046
Kamaruzaman AF, Zain AM, Yusuf SM, Udin A (2013) Levy flight algorithm for optimization problems—a literature review. Appl Mech Mater 421:496–501
Kraus MW, Keltner D (2013) Social class rank, essentialism, and punitive judgment. J Pers Soc Psychol 105:247
Benvenuto F, Piana M, Campi C, Massone AM (2018) A hybrid supervised/unsupervised machine learning approach to solar flare prediction. Astrophys J 853:90
Chen D, Zou F, Lu R, Li S (2019) Backtracking search optimization algorithm based on knowledge learning. Inf Sci (Ny) 473:202–226
Hruschka ER, Campello RJGB, Freitas AA (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man. Cybern Part C (Appl Rev) 39:133–155
Wu X, Kumar V, Quinlan JR et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769
Bahmani B, Moseley B, Vattani A et al (2012) Scalable k-means++. Proc VLDB Endow 5:622–633
Moon TK (1996) The expectation-maximization algorithm. IEEE Signal Process Mag 13:47–60
Kohonen T (1989) Self-organizing feature maps. Self-organization and associative memory. Springer, Heidelberg, pp 119–157
Kohonen T (1995) Learning vector quantization. Self-organizing maps. Springer, Heidelberg, pp 175–189
Sato A, Yamada K (1996) Generalized learning vector quantization. Advances in neural information processing systems. MIT Press, Cambridge, pp 423–429
Chang D-X, Zhang X-D, Zheng C-W (2009) A genetic algorithm with gene rearrangement for K-means clustering. Pattern Recogn 42:1210–1222
Asadi M, Mazinani SM (2019) Presenting a new clustering algorithm by combining intelligent bat and chaotic map algorithms to improve energy consumption in wireless sensor network. Springer, Singapore
Di Gesú V, Giancarlo R, Lo BG et al (2005) GenClust: a genetic algorithm for clustering gene expression data. BMC Bioinf 6:289
Rahman MA, Islam MZ (2014) A hybrid clustering technique combining a novel genetic algorithm with K-means. Knowledge-Based Syst 71:345–365
Islam MZ, Estivill-Castro V, Rahman MA, Bossomaier T (2018) Combining k-means and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering. Expert Syst Appl 91:402–417
Rokach L, Maimon O (2005) Clustering methods. Data mining and knowledge discovery handbook. Springer, Berlin, pp 321–352
Szekely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: extending ward’s minimum variance method. J Classif 22:151–184
Civicioglu P (2013) Backtracking search optimization algorithm for numerical optimization problems. Appl Math Comput 219:8121–8144
Lughofer E (2012) A dynamic split-and-merge approach for evolving cluster models. Evol Syst 3:135–151
Visalakshi NK, Suguna J (2009) K-means clustering using max–min distance measure. In: NAFIPS 2009–2009 annual meeting of the north american fuzzy information processing society. IEEE, pp 1–6
Natural Computational Intelligence Research Center (2019). http://www.nci-rc.com. Accessed 9 Oct 2019
Hassani M, Seidl T (2017) Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J Comput Sci 4:171–183
Fränti P (2000) Genetic algorithm with deterministic crossover for vector quantization. Pattern Recogn Lett 21:61–68
Mahajan M, Nimbhorkar P, Varadarajan K (2012) The planar k-means problem is NP-hard. Theor Comput Sci 442:13–21
Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recogn 47:3034–3045
Chen M (2016) Normalized Mutual Information. In: MathWorks File Exch. https://www.mathworks.com/matlabcentral/fileexchange/29047-normalized-mutual-information. Accessed 11 Jul 2019
Fong S, Deb S, Yang X-S, Zhuang Y (2014) Towards enhancement of performance of K-means clustering using nature-inspired optimization algorithms. Sci World J. https://doi.org/10.1155/2014/564829
Hassan BA (2020) CSCF: a chaotic sine cosine firefly algorithm for practical application problems. Neural Comput Appl 1–20
Saeed MHR, Hassan BA, Qader SM (2017) An optimized framework to adopt computer laboratory administrations for operating system and application installations. Kurdistan J Appl Res 2(3):92–97
Hassan BA, Ahmed AM, Saeed SA, Saeed AA (2016) Evaluating e-government services in Kurdistan institution for strategic studies and scientific research using the EGOVSAT model. Kurdistan J Appl Res 1(2):1–7
Acknowledgements
The authors would like to thank the referees for their remarkable suggestions. This paper’s technical content has significantly improved based on their suggestions. Meanwhile, the authors wish to express a sincere thanks to Kurdistan Institution for Strategic Studies and Scientific Research and the University of Kurdistan Hewler for providing facilities and continuous support in conducting this study.
Funding
No funding was received.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hassan, B.A., Rashid, T.A. A multidisciplinary ensemble algorithm for clustering heterogeneous datasets. Neural Comput & Applic 33, 10987–11010 (2021). https://doi.org/10.1007/s00521-020-05649-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05649-1