A multidisciplinary ensemble algorithm for clustering heterogeneous datasets

Hassan, Bryar A.; Rashid, Tarik A.

doi:10.1007/s00521-020-05649-1

A multidisciplinary ensemble algorithm for clustering heterogeneous datasets

Original Article
Published: 02 January 2021

Volume 33, pages 10987–11010, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

706 Accesses
39 Citations
2 Altmetric
Explore all metrics

Abstract

Clustering is a commonly used method for exploring and analysing data where the primary objective is to categorise observations into similar clusters. In recent decades, several algorithms and methods have been developed for analysing clustered data. We notice that most of these techniques deterministically define a cluster based on the value of the attributes, distance, and density of homogenous and single-featured datasets. However, these definitions are not successful in adding clear semantic meaning to the clusters produced. Evolutionary operators and statistical and multidisciplinary techniques may help in generating meaningful clusters. Based on this premise, we propose a new evolutionary clustering algorithm (ECA*) based on social class ranking and meta-heuristic algorithms for stochastically analysing heterogeneous and multifeatured datasets. The ECA* is integrated with recombinational evolutionary operators, Levy flight optimisation, and some statistical techniques, such as quartiles and percentiles, as well as the Euclidean distance of the K-means algorithm. Experiments are conducted to evaluate the ECA* against five conventional approaches: K-means (KM), K-means++ (KM++), expectation maximisation (EM), learning vector quantisation (LVQ), and the genetic algorithm for clustering++ (GENCLUST++). That the end, 32 heterogeneous and multifeatured datasets are used to examine their performance using internal and external and basic statistical performance clustering measures and to measure how their performance is sensitive to five features of these datasets (cluster overlap, the number of clusters, cluster dimensionality, the cluster structure, and the cluster shape) in the form of an operational framework. The results indicate that the ECA* surpasses its counterpart techniques in terms of the ability to find the right clusters. Significantly, compared to its counterpart techniques, the ECA* is less sensitive to the five properties of the datasets mentioned above. Thus, the order of overall performance of these algorithms, from best performing to worst performing, is the ECA*, EM, KM++, KM, LVQ, and the GENCLUST++. Meanwhile, the overall performance rank of the ECA* is 1.1 (where the rank of 1 represents the best performing algorithm and the rank of 6 refers to the worst performing algorithm) for 32 datasets based on the five dataset features mentioned above.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

References

Ghosal A, Nandy A, Das AK et al (2020) A short review on different clustering techniques and their applications. Emerging technology in modelling and graphics. Springer, Berlin, pp 69–83
Chapter Google Scholar
Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48:4743–4759
Article Google Scholar
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31:651–666
Article Google Scholar
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035
Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33:1455–1465
Article Google Scholar
Koschke R, Eisenbarth T (2000) A framework for experimental evaluation of clustering techniques. In: Proceedings IWPC 2000. 8th International Workshop on Program Comprehension. IEEE, pp 201–210
Hassan BA, Rashid TA (2019) Operational framework for recent advances in backtracking search optimisation algorithm: A systematic review and performance evaluation. Appl Math Comput 370:124919
MathSciNet MATH Google Scholar
Hassan BA, Rashid TA (2020) Datasets on statistical analysis and performance evaluation of backtracking search optimisation algorithm compared with its counterpart algorithms. Data Br 28:105046
Article Google Scholar
Kamaruzaman AF, Zain AM, Yusuf SM, Udin A (2013) Levy flight algorithm for optimization problems—a literature review. Appl Mech Mater 421:496–501
Article Google Scholar
Kraus MW, Keltner D (2013) Social class rank, essentialism, and punitive judgment. J Pers Soc Psychol 105:247
Article Google Scholar
Benvenuto F, Piana M, Campi C, Massone AM (2018) A hybrid supervised/unsupervised machine learning approach to solar flare prediction. Astrophys J 853:90
Article Google Scholar
Chen D, Zou F, Lu R, Li S (2019) Backtracking search optimization algorithm based on knowledge learning. Inf Sci (Ny) 473:202–226
Article MathSciNet Google Scholar
Hruschka ER, Campello RJGB, Freitas AA (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man. Cybern Part C (Appl Rev) 39:133–155
Google Scholar
Wu X, Kumar V, Quinlan JR et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37
Article Google Scholar
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769
Google Scholar
Bahmani B, Moseley B, Vattani A et al (2012) Scalable k-means++. Proc VLDB Endow 5:622–633
Article Google Scholar
Moon TK (1996) The expectation-maximization algorithm. IEEE Signal Process Mag 13:47–60
Article Google Scholar
Kohonen T (1989) Self-organizing feature maps. Self-organization and associative memory. Springer, Heidelberg, pp 119–157
Chapter Google Scholar
Kohonen T (1995) Learning vector quantization. Self-organizing maps. Springer, Heidelberg, pp 175–189
Chapter Google Scholar
Sato A, Yamada K (1996) Generalized learning vector quantization. Advances in neural information processing systems. MIT Press, Cambridge, pp 423–429
Google Scholar
Chang D-X, Zhang X-D, Zheng C-W (2009) A genetic algorithm with gene rearrangement for K-means clustering. Pattern Recogn 42:1210–1222
Article Google Scholar
Asadi M, Mazinani SM (2019) Presenting a new clustering algorithm by combining intelligent bat and chaotic map algorithms to improve energy consumption in wireless sensor network. Springer, Singapore
Book Google Scholar
Di Gesú V, Giancarlo R, Lo BG et al (2005) GenClust: a genetic algorithm for clustering gene expression data. BMC Bioinf 6:289
Article Google Scholar
Rahman MA, Islam MZ (2014) A hybrid clustering technique combining a novel genetic algorithm with K-means. Knowledge-Based Syst 71:345–365
Article Google Scholar
Islam MZ, Estivill-Castro V, Rahman MA, Bossomaier T (2018) Combining k-means and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering. Expert Syst Appl 91:402–417
Article Google Scholar
Rokach L, Maimon O (2005) Clustering methods. Data mining and knowledge discovery handbook. Springer, Berlin, pp 321–352
Chapter Google Scholar
Szekely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: extending ward’s minimum variance method. J Classif 22:151–184
Article MathSciNet Google Scholar
Civicioglu P (2013) Backtracking search optimization algorithm for numerical optimization problems. Appl Math Comput 219:8121–8144
MathSciNet MATH Google Scholar
Lughofer E (2012) A dynamic split-and-merge approach for evolving cluster models. Evol Syst 3:135–151
Article Google Scholar
Visalakshi NK, Suguna J (2009) K-means clustering using max–min distance measure. In: NAFIPS 2009–2009 annual meeting of the north american fuzzy information processing society. IEEE, pp 1–6
Natural Computational Intelligence Research Center (2019). http://www.nci-rc.com. Accessed 9 Oct 2019
Hassani M, Seidl T (2017) Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J Comput Sci 4:171–183
Article Google Scholar
Fränti P (2000) Genetic algorithm with deterministic crossover for vector quantization. Pattern Recogn Lett 21:61–68
Article Google Scholar
Mahajan M, Nimbhorkar P, Varadarajan K (2012) The planar k-means problem is NP-hard. Theor Comput Sci 442:13–21
Article MathSciNet Google Scholar
Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recogn 47:3034–3045
Article Google Scholar
Chen M (2016) Normalized Mutual Information. In: MathWorks File Exch. https://www.mathworks.com/matlabcentral/fileexchange/29047-normalized-mutual-information. Accessed 11 Jul 2019
Fong S, Deb S, Yang X-S, Zhuang Y (2014) Towards enhancement of performance of K-means clustering using nature-inspired optimization algorithms. Sci World J. https://doi.org/10.1155/2014/564829
Article Google Scholar
Hassan BA (2020) CSCF: a chaotic sine cosine firefly algorithm for practical application problems. Neural Comput Appl 1–20
Saeed MHR, Hassan BA, Qader SM (2017) An optimized framework to adopt computer laboratory administrations for operating system and application installations. Kurdistan J Appl Res 2(3):92–97
Article Google Scholar
Hassan BA, Ahmed AM, Saeed SA, Saeed AA (2016) Evaluating e-government services in Kurdistan institution for strategic studies and scientific research using the EGOVSAT model. Kurdistan J Appl Res 1(2):1–7
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the referees for their remarkable suggestions. This paper’s technical content has significantly improved based on their suggestions. Meanwhile, the authors wish to express a sincere thanks to Kurdistan Institution for Strategic Studies and Scientific Research and the University of Kurdistan Hewler for providing facilities and continuous support in conducting this study.

Funding

No funding was received.

Author information

Authors and Affiliations

Kurdistan Institution for Strategic Studies and Scientific Research, Sulaimani, Iraq
Bryar A. Hassan
Computer Science and Engineering Department, University of Kurdistan Hewler, Erbil, Iraq
Tarik A. Rashid
Department of Computer Networks, Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Iraq
Bryar A. Hassan

Authors

Bryar A. Hassan
View author publications
You can also search for this author in PubMed Google Scholar
Tarik A. Rashid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bryar A. Hassan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hassan, B.A., Rashid, T.A. A multidisciplinary ensemble algorithm for clustering heterogeneous datasets. Neural Comput & Applic 33, 10987–11010 (2021). https://doi.org/10.1007/s00521-020-05649-1

Download citation

Received: 08 May 2020
Accepted: 16 December 2020
Published: 02 January 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s00521-020-05649-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multidisciplinary ensemble algorithm for clustering heterogeneous datasets

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A multidisciplinary ensemble algorithm for clustering heterogeneous datasets

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation