Abstract
Today, data plays an important and fundamental role in our daily lives. The increasing growth of data production has led to the big data revolution. Managing and analyzing this data, which is often unlabeled, is a major challenge for the real world. Clustering is one of the most important branches of data mining for data analysis and its purpose is to divide the data into meaningful subsets called clusters. Hierarchical clustering is one of the unsupervised learning algorithms for grouping data points with similar properties, so that its concept lies in the construction and analysis of dendrograms. Over the decades, many algorithms have been developed for clustering with different approaches. In this paper, an efficient ensemble hierarchical clustering algorithm based on MapReduce-based clusters clustering technique and an innovative similarity criterion is introduced. The main idea of ensemble clustering is to combine the results of different single clustering methods. Ensemble techniques usually produce better results than single methods due to multiple learning. Accordingly, it can be expected that the aggregation of hierarchical clustering methods will lead to higher quality in clustering. In addition, MapReduce is a model for implementing big data applications, where we use this model to implement hierarchical clustering methods. Meanwhile, the similarity between the samples is calculated through an innovative similarity criterion. The proposed approach is presented in three steps. In the first step, the data are clustered by several single hierarchical clustering methods. Then in the second step, hyper-clusters are generated by applying the clusters clustering technique. Finally, the final clusters are generated in the third step. This is done by allocating samples to hyper-clusters. Accordingly, the final clusters are formed in the third step. The simulation is performed on multiple real-world datasets and the results show better performance of the proposed approach compared to algorithms such as CHC and RCESCC.
Similar content being viewed by others
Availability of data and material
Data sharing not applicable to this manuscript as no datasets were generated or analyzed during the current study.
References
Boongoen, T., Iam-On, N.: Cluster ensembles: A survey of approaches with recent extensions and applications. Comput. Sci. Rev. 28, 1–25 (2018)
Rezaeipanah, A., Nazari, H., Ahmadi, G.: A Hybrid Approach for Prolonging Lifetime of Wireless Sensor Networks Using Genetic Algorithm and Online Clustering. J. Comput. Sci. Eng. 13(4), 163–174 (2019)
Nasiri, E., Berahmand, K., Rostami, M., Dabiri, M.: A novel link prediction algorithm for protein-protein interaction networks by attributed graph embedding. Comput. Biol. Med. 137, 104772 (2021)
Ghobaei-Arani, M.: A workload clustering based resource provisioning mechanism using Biogeography based optimization technique in the cloud based systems. Soft. Comput. 25(5), 3813–3830 (2021)
Mirzaei, A., Rahmati, M., Ahmadi, M.: A new method for hierarchical clustering combination. Intell. Data Anal. 12(6), 549–571 (2008)
Mojarad, M., Nejatian, S., Parvin, H., Mohammadpoor, M.: A fuzzy clustering ensemble based on cluster clustering and iterative Fusion of base clusters. Appl. Intell. 49(7), 2567–2581 (2019)
Shahidinejad, A., Ghobaei-Arani, M., Esmaeili, L.: An elastic controller using Colored Petri Nets in cloud computing environment. Clust. Comput. 23(2), 1045–1071 (2020)
Rezaeipanah, A., Amiri, P., Jafari, S.: Performing the kick during walking for robocup 3d soccer simulation league using reinforcement learning algorithm. Int. J. Soc. Robot. 13(6), 1235–1252 (2021)
Ghobaei-Arani, M., Shahidinejad, A.: An efficient resource provisioning approach for analyzing cloud workloads: a metaheuristic-based clustering approach. J. Supercomput. 77(1), 711–750 (2021)
Lu, W.: Improved K-means clustering algorithm for big data mining under Hadoop parallel framework. J. Grid Comput. 18(2), 239–250 (2020)
Mojarad, M., Sarhangnia, F., Rezaeipanah, A., Parvin, H., Nejatian, S.: Modeling Hereditary Disease Behavior Using an Innovative Similarity Criterion and Ensemble Clustering. Curr. Bioinform. 16(5), 749–764 (2021)
Xia, D., Ning, F., He, W.: Research on parallel adaptive Canopy-K-Means clustering algorithm for big data mining based on cloud platform. J. Grid Comput. 18(2), 263–273 (2020)
Shanthamallu, U. S., Spanias, A., Tepedelenlioglu, C., & Stanley, M.: A brief survey of machine learning methods and their sensor and IoT applications. In 2017 8th International Conference on Information, Intelligence, Systems & Applications (IISA) (pp. 1–8). IEEE. (2017)
Karthick, S., Yuvaraj, N., Rajakumari, P. A., & Raja, R. A.: Ensemble Similarity Clustering Frame work for Categorical Dataset Clustering Using Swarm Intelligence. In Intelligent Computing and Applications (pp. 549–557). Springer, Singapore. (2021)
Strehl, A., Ghosh, J.: Cluster ensembles–-a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3(Dec), 583–617 (2002)
Fern, X.Z., Lin, W.: Cluster ensemble selection. Stat. Anal. Data Mining: ASA Data Sci. J. 1(3), 128–141 (2008)
Azimi, J., & Fern, X: Adaptive cluster ensemble selection. In Twenty-First International Joint Conference on Artificial Intelligence (pp. 992–997). Pasadena, California (2009)
Jia, J., Xiao, X., Liu, B., Jiao, L.: Bagging-based spectral clustering ensemble selection. Pattern Recogn. Lett. 32(10), 1456–1467 (2011)
Jia, J., Xiao, X., & Liu, B: Similarity-based spectral clustering ensemble selection. In 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (pp. 1071–1074). IEEE. (2012)
Banerjee, A: Leveraging frequency and diversity based ensemble selection to consensus clustering. In 2014 Seventh international conference on contemporary computing (IC3) (pp. 123–129). IEEE. (2014)
Naldi, M.C., Carvalho, A.C.P.L.F., Campello, R.J.: Cluster ensemble selection based on relative validity indexes. Data Min. Knowl. Disc. 27(2), 259–289 (2013)
Tripathi, A.K., Sharma, K., Bala, M.: A novel clustering method using enhanced grey wolf optimizer and mapreduce. Big Data Res. 14, 93–100 (2018)
Padmapriya, K.M., Anandhi, B., Vijayakumar, M.: MapReduce fuzzy C-means ensemble clustering with gentle AdaBoost for big data analytics. Int. J. Business Intell. Data Mining 19(2), 170–188 (2021)
Santos, J.A., Syed, T.I., Naldi, M.C., Campello, R.J., Sander, J.: Hierarchical density-based clustering using MapReduce. IEEE Transact. Big Data 7(1), 102–114 (2019)
Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst. 16(6), 497–502 (2005)
Gao, H., Jiang, J., She, L., Fu, Y.: A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework. Int. J. Digital Content Technol. Appl. 4(3), 95–100 (2010)
Liang, Z., Chen, P.: An automatic clustering algorithm based on the density-peak framework and Chameleon method. Pattern Recogn. Lett. 150, 40–48 (2021)
Osmani, A., Mohasefi, J.B., Gharehchopogh, F.S.: Sentiment classification using two effective optimization methods derived from the artificial bee colony optimization and imperialist competitive algorithm. Comput. J. 65(1), 18–66 (2022)
Berahmand, K., Mohammadi, M., Faroughi, A., Mohammadiani, R.P.: A novel method of spectral clustering in attributed networks by constructing parameter-free affinity matrix. Clust. Comput. 25, 869–888 (2022)
Ishizaka, A., Lokman, B., Tasiou, M.: A stochastic multi-criteria divisive hierarchical clustering algorithm. Omega 103, 102370 (2021)
Khedairia, S., Khadir, M.T.: A multiple clustering combination approach based on iterative voting process. J. King Saud Univ.-Comput. Inform. Sci. 34(1), 1370–1380 (2022)
Gupta, D., Khanna, A., L, S.K., Shankar, K., Furtado, V., Rodrigues, J.J.: Efficient artificial fish swarm based clustering approach on mobility aware energy-efficient for MANET. Transact. Emerg. Telecommun. Technol. 30(9), e3524 (2019)
Jafarzadegan, M., Safi-Esfahani, F., Beheshti, Z.: Combining hierarchical clustering approaches using the PCA method. Expert Syst. Appl. 137, 1–10 (2019)
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
All authors contributed to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This material is the authors' own original work, which has not been previously published elsewhere.
Consent for publication
Informed consent was obtained from all individual participants included in the study.
Competing interests
We certify that there is no actual or potential conflict of interest in relation to this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tian, P., Shen, H. & Abolfathi, A. Towards Efficient Ensemble Hierarchical Clustering with MapReduce-based Clusters Clustering Technique and the Innovative Similarity Criterion. J Grid Computing 20, 34 (2022). https://doi.org/10.1007/s10723-022-09623-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-022-09623-0