Skip to main content
Log in

A multidisciplinary ensemble algorithm for clustering heterogeneous datasets

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Clustering is a commonly used method for exploring and analysing data where the primary objective is to categorise observations into similar clusters. In recent decades, several algorithms and methods have been developed for analysing clustered data. We notice that most of these techniques deterministically define a cluster based on the value of the attributes, distance, and density of homogenous and single-featured datasets. However, these definitions are not successful in adding clear semantic meaning to the clusters produced. Evolutionary operators and statistical and multidisciplinary techniques may help in generating meaningful clusters. Based on this premise, we propose a new evolutionary clustering algorithm (ECA*) based on social class ranking and meta-heuristic algorithms for stochastically analysing heterogeneous and multifeatured datasets. The ECA* is integrated with recombinational evolutionary operators, Levy flight optimisation, and some statistical techniques, such as quartiles and percentiles, as well as the Euclidean distance of the K-means algorithm. Experiments are conducted to evaluate the ECA* against five conventional approaches: K-means (KM), K-means++ (KM++), expectation maximisation (EM), learning vector quantisation (LVQ), and the genetic algorithm for clustering++ (GENCLUST++). That the end, 32 heterogeneous and multifeatured datasets are used to examine their performance using internal and external and basic statistical performance clustering measures and to measure how their performance is sensitive to five features of these datasets (cluster overlap, the number of clusters, cluster dimensionality, the cluster structure, and the cluster shape) in the form of an operational framework. The results indicate that the ECA* surpasses its counterpart techniques in terms of the ability to find the right clusters. Significantly, compared to its counterpart techniques, the ECA* is less sensitive to the five properties of the datasets mentioned above. Thus, the order of overall performance of these algorithms, from best performing to worst performing, is the ECA*, EM, KM++, KM, LVQ, and the GENCLUST++. Meanwhile, the overall performance rank of the ECA* is 1.1 (where the rank of 1 represents the best performing algorithm and the rank of 6 refers to the worst performing algorithm) for 32 datasets based on the five dataset features mentioned above.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Ghosal A, Nandy A, Das AK et al (2020) A short review on different clustering techniques and their applications. Emerging technology in modelling and graphics. Springer, Berlin, pp 69–83

    Chapter  Google Scholar 

  2. Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48:4743–4759

    Article  Google Scholar 

  3. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31:651–666

    Article  Google Scholar 

  4. Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035

  5. Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33:1455–1465

    Article  Google Scholar 

  6. Koschke R, Eisenbarth T (2000) A framework for experimental evaluation of clustering techniques. In: Proceedings IWPC 2000. 8th International Workshop on Program Comprehension. IEEE, pp 201–210

  7. Hassan BA, Rashid TA (2019) Operational framework for recent advances in backtracking search optimisation algorithm: A systematic review and performance evaluation. Appl Math Comput 370:124919

    MathSciNet  MATH  Google Scholar 

  8. Hassan BA, Rashid TA (2020) Datasets on statistical analysis and performance evaluation of backtracking search optimisation algorithm compared with its counterpart algorithms. Data Br 28:105046

    Article  Google Scholar 

  9. Kamaruzaman AF, Zain AM, Yusuf SM, Udin A (2013) Levy flight algorithm for optimization problems—a literature review. Appl Mech Mater 421:496–501

    Article  Google Scholar 

  10. Kraus MW, Keltner D (2013) Social class rank, essentialism, and punitive judgment. J Pers Soc Psychol 105:247

    Article  Google Scholar 

  11. Benvenuto F, Piana M, Campi C, Massone AM (2018) A hybrid supervised/unsupervised machine learning approach to solar flare prediction. Astrophys J 853:90

    Article  Google Scholar 

  12. Chen D, Zou F, Lu R, Li S (2019) Backtracking search optimization algorithm based on knowledge learning. Inf Sci (Ny) 473:202–226

    Article  MathSciNet  Google Scholar 

  13. Hruschka ER, Campello RJGB, Freitas AA (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man. Cybern Part C (Appl Rev) 39:133–155

    Google Scholar 

  14. Wu X, Kumar V, Quinlan JR et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37

    Article  Google Scholar 

  15. Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769

    Google Scholar 

  16. Bahmani B, Moseley B, Vattani A et al (2012) Scalable k-means++. Proc VLDB Endow 5:622–633

    Article  Google Scholar 

  17. Moon TK (1996) The expectation-maximization algorithm. IEEE Signal Process Mag 13:47–60

    Article  Google Scholar 

  18. Kohonen T (1989) Self-organizing feature maps. Self-organization and associative memory. Springer, Heidelberg, pp 119–157

    Chapter  Google Scholar 

  19. Kohonen T (1995) Learning vector quantization. Self-organizing maps. Springer, Heidelberg, pp 175–189

    Chapter  Google Scholar 

  20. Sato A, Yamada K (1996) Generalized learning vector quantization. Advances in neural information processing systems. MIT Press, Cambridge, pp 423–429

    Google Scholar 

  21. Chang D-X, Zhang X-D, Zheng C-W (2009) A genetic algorithm with gene rearrangement for K-means clustering. Pattern Recogn 42:1210–1222

    Article  Google Scholar 

  22. Asadi M, Mazinani SM (2019) Presenting a new clustering algorithm by combining intelligent bat and chaotic map algorithms to improve energy consumption in wireless sensor network. Springer, Singapore

    Book  Google Scholar 

  23. Di Gesú V, Giancarlo R, Lo BG et al (2005) GenClust: a genetic algorithm for clustering gene expression data. BMC Bioinf 6:289

    Article  Google Scholar 

  24. Rahman MA, Islam MZ (2014) A hybrid clustering technique combining a novel genetic algorithm with K-means. Knowledge-Based Syst 71:345–365

    Article  Google Scholar 

  25. Islam MZ, Estivill-Castro V, Rahman MA, Bossomaier T (2018) Combining k-means and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering. Expert Syst Appl 91:402–417

    Article  Google Scholar 

  26. Rokach L, Maimon O (2005) Clustering methods. Data mining and knowledge discovery handbook. Springer, Berlin, pp 321–352

    Chapter  Google Scholar 

  27. Szekely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: extending ward’s minimum variance method. J Classif 22:151–184

    Article  MathSciNet  Google Scholar 

  28. Civicioglu P (2013) Backtracking search optimization algorithm for numerical optimization problems. Appl Math Comput 219:8121–8144

    MathSciNet  MATH  Google Scholar 

  29. Lughofer E (2012) A dynamic split-and-merge approach for evolving cluster models. Evol Syst 3:135–151

    Article  Google Scholar 

  30. Visalakshi NK, Suguna J (2009) K-means clustering using max–min distance measure. In: NAFIPS 2009–2009 annual meeting of the north american fuzzy information processing society. IEEE, pp 1–6

  31. Natural Computational Intelligence Research Center (2019). http://www.nci-rc.com. Accessed 9 Oct 2019

  32. Hassani M, Seidl T (2017) Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J Comput Sci 4:171–183

    Article  Google Scholar 

  33. Fränti P (2000) Genetic algorithm with deterministic crossover for vector quantization. Pattern Recogn Lett 21:61–68

    Article  Google Scholar 

  34. Mahajan M, Nimbhorkar P, Varadarajan K (2012) The planar k-means problem is NP-hard. Theor Comput Sci 442:13–21

    Article  MathSciNet  Google Scholar 

  35. Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recogn 47:3034–3045

    Article  Google Scholar 

  36. Chen M (2016) Normalized Mutual Information. In: MathWorks File Exch. https://www.mathworks.com/matlabcentral/fileexchange/29047-normalized-mutual-information. Accessed 11 Jul 2019

  37. Fong S, Deb S, Yang X-S, Zhuang Y (2014) Towards enhancement of performance of K-means clustering using nature-inspired optimization algorithms. Sci World J. https://doi.org/10.1155/2014/564829

    Article  Google Scholar 

  38. Hassan BA (2020) CSCF: a chaotic sine cosine firefly algorithm for practical application problems. Neural Comput Appl 1–20

  39. Saeed MHR, Hassan BA, Qader SM (2017) An optimized framework to adopt computer laboratory administrations for operating system and application installations. Kurdistan J Appl Res 2(3):92–97

    Article  Google Scholar 

  40. Hassan BA, Ahmed AM, Saeed SA, Saeed AA (2016) Evaluating e-government services in Kurdistan institution for strategic studies and scientific research using the EGOVSAT model. Kurdistan J Appl Res 1(2):1–7

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the referees for their remarkable suggestions. This paper’s technical content has significantly improved based on their suggestions. Meanwhile, the authors wish to express a sincere thanks to Kurdistan Institution for Strategic Studies and Scientific Research and the University of Kurdistan Hewler for providing facilities and continuous support in conducting this study.

Funding

No funding was received.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bryar A. Hassan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hassan, B.A., Rashid, T.A. A multidisciplinary ensemble algorithm for clustering heterogeneous datasets. Neural Comput & Applic 33, 10987–11010 (2021). https://doi.org/10.1007/s00521-020-05649-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05649-1

Keywords

Navigation