1 Introduction

Machine learning can be divided into supervised and unsupervised learning, depending on whether the examples are trained with or without labels [1]. Supervised learning focuses on prediction, mainly by considering the model complexity as well as the variance and bias between samples. The main task is to obtain the corresponding response variables when observations are made on the predictor variables. In contrast, unsupervised learning focuses on observation. Unlike supervised learning, response variables are not available in unsupervised learning. Consequently, the chief task is to determine the underlying characteristics of the input variables. Specifically, a cluster analysis is a representative technique used in unsupervised learning.

1.1 Literature review

The concept of a cluster analysis was first introduced by Driver and Kroeber in 1932 [2]. Later, Zubin and Tryon brought it to the field of psychology. Clustering techniques are currently widely used in many fields, such as data mining [3], image segmentation [4], wireless communication [5], outlier detection [6], agricultural production [7], and e-commerce [8]. Different from classification techniques, the classes to be divided in clustering are unknown. The correlation, distribution, and variability among the data need to be analyzed from the sample data. In other words, the process divides the samples into different groups by weighing the similarity measures between them, where each group is called a cluster [9]. Homogeneity and separability are two important metrics used in cluster analyses. The former indicates the similarity between objects in the same cluster, and the latter implies the difference of objects between different clusters. The purpose of clustering is to maximize the homogeneity of the same cluster and the heterogeneity of different clusters [10]. Driven by these two concepts, various types of clustering methods have been introduced. Interestingly, different conclusions can be drawn depending on the method used.

Clustering algorithms can be broadly classified into two categories, namely, hierarchical and partitional methods [11]. Hierarchical clustering methods assume a hierarchical structure between clusters and recursively find nested clusters. Their advantage lies in that the entire clustering process can be completed at once without requiring a priori knowledge. However, this approach is computationally intensive. The main methods include DIANA, BIRCH, CURE, and CHAMELEON. The partitional method, on the other hand, simulates the lookup of all clusters as a partition of the data instead of imposing a hierarchical structure [12]. Specifically, the dataset is divided into a fixed number of clusters based on a specific criterion. These clusters are disjointed, i.e., each object belongs to only one cluster. This type of method is characterized by insensitivity to the input dataset and is easy to operate. In addition, it is computationally simple. However, the scalability is poor, and most methods fall into local optima when the dimensions of the data objects increase [13].

Due to the ease of implementation, simplicity, and efficiency, k-means clustering has become one of the most widely used clustering methods [14]. It separates all samples into the closest clusters by minimizing the sum of squared errors to find the approximate solution in a greedy manner. However, due to the nature of the gradient descent, k-means often converges to a local minimum of the objective function. For the same reason, the quality of k-means for solving clustering problems depends heavily on the initial solution [15]. If not chosen properly, the algorithm can converge slowly, and may produce empty clusters. Under this circumstance, the probability of falling into a local optimum will increase [16]. With further research, many k-means variants have emerged to overcome this problem. For example, Bortoloti et al. [17] proposed supervised kernel-density-estimation k-means, called SKDEKMeans. The kernel destiny was used to estimate a better representation of the distribution so that the balance between majority and minority clusters was achieved. A k-centroid initialization algorithm (PkCIA) was then proposed by Manochandar et al. [18]. The eigenvector of the new matrix was adopted as an index for computing initial cluster centroids. On this basis, the problem that the original algorithm is highly sensitive to the initial solution can be solved. An I-k-means-plus was proposed by Hassan [19]. According to his philosophy, the quality of the solution can be improved by removing or splitting the class clusters in each iteration. It was experimentally demonstrated that the clustering process was accelerated with a relatively higher accuracy. Huang et al. [20] developed a robust deep k-means model to learn the hidden attributes. The objective function is derived to a more trackable form to tackle the optimization problem more easily while obtaining the final robust results. An entropy-based initialization algorithm was proposed by Chowdhury et al. [21]. In their method, an entropy-based objective function was defined to finish the initialization process. Meanwhile, by using a number of cluster validity indexes, the proper number of clusters for different datasets can be calculated. Therefore, the performance of the proposed algorithm was enhanced. Zhao et al. [22] proposed another novel variant of k-means to perform top-down hierarchical clustering. It exhibited a faster speed while maintaining a lower clustering error.

Data clustering has been widely used in the real world for mining valuable information. It has long been applied in such areas such as object detection and segmentation patterns, medical risk assessment, energy exploration and development, IoT applications and anomaly detection [23].

In the real world, datasets are mostly vague, complex, and large. Meanwhile, their labels and attributes are often difficult to access smoothly. In particular, it is almost impossible to cluster the data with varied shapes, sizes, and densities [24]. In this case, an accurate and efficient estimation of the initial centroids without a priori information is urgently required [25].

Since the aims of clustering are to maximize the similarity within the same cluster and dissimilarity across clusters, it can be considered an optimization problem [26]. In optimization problems, it is often necessary to maximize or minimize some objective function (the function used to evaluate the quality of the solution). In the entire process, various difficulties, such as constraints, multiple objectives, uncertainties, and local optimum traps, need to be solved. Optimization algorithms are one of the most powerful tools to address these problems. These methods treat the problem as a black box and search for the best solution through predefined steps. Traditional optimization methods include the dynamic programming algorithm (DPA), stochastic search, steepest descent, and Newton’s method [27]. The drawback of these methods is that they are usually limited by the size of a particular problem and the given dataset.

Inspired by natural phenomena and biological evolutionary behaviors, many simple and easy-to-implement metaheuristic algorithms have emerged in recent years for solving global optimization problems, such as Monarch Butterfly Optimization (MBO) [28], Slime Mould Algorithm (SMA) [29], Moth Search Algorithm (MSA) [30], Hunger Games Search (HGS) [31], Harris Hawks Optimization (HHO) [32], and others. The recent emergence of metaheuristic algorithms has developed a simple yet powerful data abstraction and analysis tool for researchers [33]. Currently, the popular research trend is to combine clustering algorithms with metaheuristics, thus ensuring a greater probability of achieving optimal clustering [34]. Chen et al. [35] proposed a new algorithm called QALO-K. k-means was optimized with a quantum-inspired ant lion to enhance the clustering performance and reach the global optimum. In addition, three clustering algorithms, GA-PFKM, PSO-PFKM, and SCA-PFKM, were proposed by Kuo et al. [36] to address the problem where fuzzy k-mode algorithms are sensitive to the initial solution. Nayak et al. [37] combined fuzzy c-means (FCM) with chemical reaction optimization (CRO) to achieve the global best solution. Aggarwal and Singh [38] introduced a nature-inspired algorithm for optimizing the k-means++ algorithm, aimed at overcoming the tendency to fall into local optima. Lakshmi et al. [39] mixed the crow search algorithm (CSA) with k-means, and the quality of the solutions obtained on the benchmark dataset was significantly improved. Due to the defect that traditional clustering methods usually perform poorly when dealing with high-dimensional optimization problems, Yang and Sutrisno [40] proposed a clustering-based SOS (CSOS) algorithm. The combination of local and global searches was achieved through cross-cluster interactions between elite individuals, thus enhancing the clustering efficiency. Note that Fuzzy C-means (FCM) tends to fall into local minima when facing complex problems. Verma et al. [41] proposed hybrid FCM and particle swarm optimization (PSO) algorithms (Hybrid FCM- PSO), while the global optimization property of PSO is used to search for cluster centers. In [42], an Automatic Clustering Local Search HMS (ACLSHMS) algorithm was proposed for image segmentation, incorporating a local search operator in the algorithm aimed at optimizing the cluster configuration of the clusters. In addition, given the effectiveness of unsupervised learning for medical image diagnosis, Mittal et al. [43] proposed a novel k-means-based improved gravitational search algorithm clustering (KIGSA-C) method for diagnosing medical images of coronavirus (COVID-19).

Considering the relevance of clustering methods to most real-world problems, there is a need to modify the current algorithms to improve the clustering performance and expand the range of applications. Cluster analysis is an open field. Researchers [2, 25, 44] encourage the application of new metaheuristic algorithms in combination with traditional clustering methods to efficiently solve complex clustering problems.

Elephant herding optimization (EHO) [45] is a novel metaheuristic algorithm proposed by Wang et al. in 2016. The algorithm has a strong global optimization capability and few control parameters [46]. Consequently, it is simple and efficient for clustering. Unfortunately, EHO still has defects, such as a lack of exploitation ability, slow convergence, and an ease of falling into local optimality. Li et al. [47] proposed an improved EHO algorithm (IMEHO) that introduced a global speed strategy and a novel learning strategy to update the speed and position of search agents. Experiments showed that the algorithm can find a better solution. Ismaeel et al. [48] proposed three EHO variants, EEH015, EEH020, and EEH025, based on the γ-value. The purpose was to overcome the problem of an unreasonable convergence to the origin. Huseyin [49] proposed a binary version of EHO. Mostafa et al. [50] presented a study of parameters in EHO. Three versions of EHO with cultural-based, alpha-tuning, and biased initialization were proposed to ameliorate the exploration and exploitation capabilities. However, none of the above algorithms are involved in the field of clustering, and their performance in clustering analysis has not been verified.

According to the no free lunch (NFL) theorem, a metaheuristic algorithm that performs well on one specific problem cannot be adapted to all optimization problems [51]. This allows researchers to add new modules and mechanisms to enhance the performance of metaheuristic algorithms. It has been determined that these hybrid algorithms can obtain a global optimal solution more efficiently than a single metaheuristic algorithm [52]. In summary, the research in this paper has a strong relevance. Inspired by this, a gradient-based elephant herding optimization for cluster analysis (GBEHO) is proposed in this paper for cluster analyses. EHO is combined with a gradient-based optimizer (GBO) [53] to further improve the convergence efficiency and exploitation capability. In addition, random wandering and variational operators are introduced to improve the ability of the algorithm to jump out of the local optimum and increase the convergence accuracy.

1.2 Contribution and organization of the paper

Overall, although many researchers have made great contributions to enhance the performance of clustering algorithms, there are still limitations. The paper contributes with six main aspects:

  1. 1.

    A novel hybrid metaheuristic algorithm, GBEHO, is proposed for the cluster analysis, which can automatically determine the best cluster centers.

  2. 2.

    Certain modifications are made to easily address the problem of falling into the local optimum. First, Gaussian chaotic mapping is introduced for initialization to generate high-quality initial populations. Second, a random wandering operator is designed to optimize the update strategy of the patriarch position. Third, a mutation operator is adopted to change the update strategy of other agents in the EHO. This prevents premature convergence and enhances the ability of the algorithm to jump out of the local optimum.

  3. 3.

    To prevent premature convergence and enhance the balance between exploration and exploitation, EHO is combined with GBO. A framework is developed to fuse the advantages of both algorithms, and the resulting clustering centers are evaluated using a greedy selection strategy.

  4. 4.

    A set of ablation experiments are designed to verify the effect of the variational probability PSR on the performance of the algorithm. The experiments are conducted on 23 recognized benchmark functions and tested statistically. The results show that the newly added operators are emphatic for the improvement of EHO, and that the optimization is most effective when PSR = 0.2.

  5. 5.

    The analysis for the different modules illustrates that the combined strategy is effective. Experiments are carried out on synthetic and real-world datasets. GBEHO is compared with nine other metaheuristics and clustering algorithms, including k-means, particle swarm optimization (PSO), differential evolution (DE), genetic algorithm (GA), cuckoo search algorithm (CS), gravitational search algorithm (GSA), bat algorithm (BA), a quantum-inspired ant lion optimized hybrid k-means algorithm (QALO-K), hybrid grey wolf optimizer and a tabu search (GWOTS). The experimental results show that GBEHO has a superior clustering accuracy and higher stability.

  6. 6.

    Comparative experiments are conducted with four other state-of-the-art techniques on five datasets, including CSOS, Hybrid FCM-PSO, ACLSHMS, and KIGSA-C. A variety of measures, namely, accuracy rate, specificity, detection rate, and F-measure, are adopted to evaluate the clustering effect. The experiments proved that GBEHO is an effective algorithm for clustering analysis.

The structure of this paper is shown as follows: Section 2 briefly introduces the principles of cluster analysis, EHO, and GBO. Section 3 provides a specific description of the novel concepts and design process. Section 4 conducts the experiment and analyzes the results. Discussions are given in Section 5. Finally, conclusions are summarized, and future research directions are proposed in Section 6.

2 The basic theory

2.1 Principle of clustering

Clustering is the process of organizing datasets and objects into different clusters based on certain rules [54]. In short, all data points are clustered into different clusters by comparing their similarity. Suppose there exists a set of objects U = {x1,x2,……,xn} in an argument space U, where URnm. The hard assignment follows the principle of dividing objects into K clusters C = {C1,C2,……,CK}. No intersection is allowed between any two clusters. This can be expressed as follows:

$$ {C_{i}} \ne \emptyset ,i = 1,2, {\ldots} {\ldots} ,K $$
(1)
$$ {C_{i}} \cap {C_{j}} = \emptyset ,i,j = 1,2, {\ldots} {\ldots} K,i \ne j $$
(2)
$$ \cup_{i = 1}^{K}{C_{i}} = \{ {x_{1}},{x_{2}}, {\ldots} {\ldots} {x_{n}}\} $$
(3)
$$ sim({X_{1}},{X_{2}}) {>} sim({X_{1}},{Y_{1}}),{X_{1}},{X_{2}} \in {C_{i}}\ \ \text{and}\ \ {Y_{1}} \in {C_{j}},i \ne j{\text{ }} $$
(4)

During this process, the similarity between objects in a cluster plays the most significant role in the clustering result [55]. The main way to measure the similarity in clusters is to calculate the distance between data points, such as the Mahalanobis distance [56], cosine distance [57], Pearson correlation measure [58], Jaccard measure [59], or Dice coefficient measure [60]. The most common is the Euclidean distance [61]. For two data points xi = {xi1,xi2,……xim} and xj = {xj1,xj2,……xjm} in m dimensions, the Euclidean distance is shown as follows:

$$ d({x_{i}},{x_{j}}) = \sqrt {\sum\limits_{n = 1}^{m} {{{({x_{in}} - {x_{jn}})}^{2}}} } $$
(5)

Generally, the smaller the intracluster distance or the larger the intercluster distance, the better the clustering performance [62]. In this paper, the sum of squared errors (SSE) is chosen as the objective function. SSE should be minimized in each iteration, which can be expressed as follows:

$$ {\text{min }}\quad SSE = \sum\limits_{i = 1}^{k} {\sum\limits_{x \in {c_{i}}}^{} {d{{(x,{g_{i}})}^{2}}} } \qquad {\text{ where }}\quad{g_{i}} = \frac{{\sum\limits_{x \in {c_{i}}} x }}{{\left| {{c_{i}}} \right|}} $$
(6)

where d(x,gi)2 denotes the squared distance from the sample point x to the center of mass gi of cluster ci.

2.2 EHO

EHO is a population-based algorithm proposed to simulate the nomadic life characteristics of elephants. In EHO, three principles are followed. (i) The population of all agents is divided into a specific number of clans. (ii) Each clan is led by a female individual, called a matriarch, representing the best-positioned agent in each iteration. (iii) The worst agent in each iteration represents the male elephant, who, once reaching adulthood, leaves its clan to live alone. EHO sets up the clan operator and the separation operator to model the above behavior.

2.2.1 The clan operator

For the search agent j in clan ci, its position must be modified according to the relationship with the clan leader, which can be expressed by:

$$ {x_{new,ci,j}} = {x_{ci,j}} + \alpha \times ({x_{best,ci}} - {x_{ci,j}}) \times rand $$
(7)

where xbest,ci is the position of the best agent in clan ci, xci,j and xnew,ci,j are the current and new positions of the search agent j in clan ci, respectively, and α and rand are both random numbers between [0,1]. Unlike other member position updates, the position of the clan leader is adjusted based on the current position of all agents in the clan. This can be modeled by (8).

$$ {x_{new,ci,j}} = \upbeta \times {x_{center,ci}} $$
(8)

where \({x_{center,{c_{i}}}}\) denotes the central position of all agents in clan ci, which is calculated by:

$$ {x_{center,ci}} = \frac{1}{{{n_{ci}}}} \times \sum\limits_{j = 1}^{{n_{ci}}} {{x_{ci,j}}} $$
(9)

where β affects the extent to which xcenter,ci acts on xnew,ci,j, β ∈ [0,1], and nci is the number of all agents in clan ci.

2.2.2 Separating operator

The separating operator imitates the life characteristics of male elephants. When adults, male elephants leave their current clan, represented by the following equation:

$$ {x_{worst,ci}} = {x_{\min }} + ({x_{\max }} - {x_{\min }} + 1) \times rand $$
(10)

where r is a random number between [0,1], and \({x_{{\max \limits } }}\) and \({x_{{\min \limits } }}\) are the upper and lower bounds of the individual position, respectively.

2.2.3 Elitism strategy

To protect the best elephant individuals from being ruined, EHO sets an elitism strategy. At the beginning of the algorithm, the best m elephant individuals are saved. After an iteration is completed, the fitness values of the worst m elephants are compared with the best elephant individuals that were saved before, and the better agents have the opportunity to be preserved. In this way, it is ensured that the quality of the latter agents is not worse than the quality of the former agents.

2.2.4 Pseudocode of EHO

Based on the above description, the process of EHO can be summarized, and the pseudocode is shown in Algorithm 1.

figure f

2.3 GBO

GBO is a population-based algorithm solved by the gradient method. In GBO, the search direction is controlled by Newton’s method. Additionally, two main operators and a set of vectors are adapted to explore the search space.

2.3.1 Gradient search rule (GSR)

The gradient search rule (GSR) is extracted from Newton’s method to control the direction of the vector search. To ensure a balance between exploration and exploitation during the iterations and accelerate the convergence, a series of vectors are introduced as follows:

$$ {\rho_{1}} = 2 \times rand \times \alpha - \alpha $$
(11)
$$ \alpha = \left| {\upbeta \times \sin \left[\frac{{3\pi }}{2} + \sin (\upbeta \times \frac{{3\pi }}{2})\right]} \right| $$
(12)
$$ \upbeta = {{\upbeta}_{\min }} + ({{\upbeta}_{\max }} - {{\upbeta}_{\min }}) \times {\left[1 - {(\frac{m}{M})^{3}}\right]^{2}} $$
(13)

where \({{\upbeta }_{{\max \limits } }}\) and \({{\upbeta }_{{\min \limits } }}\) are taken as 1.2 and 0.2, respectively, m and M represent the current and the maximum number of iterations, respectively, and rand denotes a random number between [0,1]. The value of α varies with the iterations and can be used to control the convergence rate. Early in the iteration, the value of α is large, thus allowing the algorithm to increase the diversity and converge quickly to the region where it hopes to find the optimal solution. Later in the iteration, the value decreases. Therefore, the algorithm can better exploit the explored regions. On this basis, the expression of GSR is as follows:

$$ GSR = rand \times {\rho_{1}} \times \frac{{2{\Delta} x \times {x_{n}}}}{{({x_{worst}} - {x_{best}} + \varepsilon )}} $$
(14)

where xworst and xbest represent positions of the worst and the best agents, and ε is a small number in the range of [0,0.1]. The proposed GSR is capable of a random search, which enhances the exploration ability of GBO and the ability to jump out of the local optimum. Δx is calculated by the following expression:

$$ {\Delta} x = rand(1:N) \times \left| {step} \right| $$
(15)
$$ step = \frac{{({x_{best}} - x_{r1}^{m}) + \delta }}{2} $$
(16)
$$ \delta = 2 \times rand \times \left( \left| {\frac{{x_{r1}^{m} + x_{r2}^{m} + x_{r3}^{m} + x_{r4}^{m}}}{4} - {x_{n}^{m}}} \right|\right) $$
(17)

where rand(1 : N) denotes N random numbers between [0,1] and step is the step size. xbest represents the global optimal agent, and \({x_{n}^{m}}\) denotes the mth dimension of the nth agent. r1,r2,r3,r4 are different integers randomly selected from [1, N].

Moreover, a motion parameter DM is set for a local search to improve the exploitation capabilities. The expression is shown as follows:

$$ DM = rand \times {\rho_{2}} \times ({x_{best}} - {x_{n}}) $$
(18)

rand denotes a random number between [0,1], and ρ2 is the parameter that controls the step size and is represented as follows:

$$ {\rho_{2}} = 2 \times rand \times \alpha - \alpha $$
(19)

Ultimately, the current location of the search agent (\({x_{n}^{m}}\)) can be updated by GSR and DM in the following way:

$$ X{1_{n}^{m}} = {x_{n}^{m}} - GSR + DM $$
(20)

According to 14 and 18, (20) can also be expressed as follows:

$$ X{1_{n}^{m}} = {x_{n}^{m}} - rand \times {\rho_{1}} \times \frac{{2{\Delta} x \times {x_{n}^{m}}}}{{(y{p_{n}^{m}} - y{q_{n}^{m}} + \varepsilon )}} + rand \times {\rho_{2}} \times ({x_{best}} - {x_{n}^{m}}) $$
(21)

where \(y{p_{n}^{m}} {=} {y_{n}^{m}} {+} {\Delta } x\), \(y{q_{n}^{m}} {=} {y_{n}^{m}} {-} {\Delta } x\), and \({y_{n}^{m}}\) is a newly generated variable determined by the average of \({x_{n}^{m}}\) and \(z_{n + 1}^{m}\). According to Newton’s method, \(z_{n + 1}^{m}\) is formulated by:

$$ z_{n + 1}^{m} = {x_{n}^{m}} - r and n \times \frac{{2{\Delta} x \times {x_{n}^{m}}}}{{({x_{worst}} - {x_{best}} + \varepsilon )}} $$
(22)

where Δx is specified by (15), and xworst and xbest denote the current worst and best agents, respectively. After replacing the current vector \({x_{n}^{m}}\) in (21) with xbest, a new vector \(X{2_{n}^{m}}\) can be obtained with the following expression.

$$ X{2_{n}^{m}} = {x_{best}} - rand \times {\rho_{1}} \times \frac{{2{\Delta} x \times {x_{n}^{m}}}}{{(y{p_{n}^{m}} - y{q_{n}^{m}} + \varepsilon )}} + rand \times {\rho_{2}} \times (x_{r1}^{m} - x_{r2}^{m}) $$
(23)

Based on 21 and 23, the new solution \(x_{n}^{m + 1}\) can be expressed as:

$$ x_{n}^{m + 1} = {r_{a}} \times [{r_{b}} \times X{1_{n}^{m}} + (1 - {r_{b}}) \times X{2_{n}^{m}}] + (1 - {r_{a}}) \times X{3_{n}^{m}} $$
(24)
$$ X{3_{n}^{m}} = {x_{n}^{m}} - {\rho_{1}} \times (X{2_{n}^{m}} - X{1_{n}^{m}}) $$
(25)

where ra and rb are random numbers between [0,1].

2.3.2 Local escaping operator (LEO)

The local escaping operator (LEO) is set to retune the resulting solution so that the algorithm can move away from local optima, improving the probability of finding the optimal solution. A solution with superior performance (\(X_{LEO}^{m}\)) is introduced in the LEO, which is represented as:

$$ \begin{array}{l} if \quad rand < pr\\ X_{LEO}^{m} {=} \left\{ \begin{array}{l} X_{n}^{m + 1} {+} {f_{1}} \times ({u_{1}} \times {x_{best}} - {u_{2}} \times {x_{k}^{m}}) + {f_{2}} \times {\rho_{1}} \times [{u_{3}} \times (X{2_{n}^{m}} - X{1_{n}^{m}})\\ + {u_{2}} \times (x_{r1}^{m} - x_{r2}^{m})]/2\qquad \qquad \qquad \qquad \qquad \qquad \quad rand < 0.5\\ {x_{best}} + {f_{1}} \times ({u_{1}} \times {x_{best}} - {u_{2}} \times {x_{k}^{m}}) + {f_{2}} \times {\rho_{1}} \times [{u_{3}} \times (X{2_{n}^{m}} - X{1_{n}^{m}})\\ + {u_{2}} \times (x_{r1}^{m} - x_{r2}^{m})]/2\qquad \qquad \qquad \qquad \qquad \qquad \quad otherwise \end{array} \right.\\ end \end{array} $$
(26)

pr is a predetermined threshold, where pr = 0.5. f1 is a random number between [-1,1], and f2 is a random number that conforms to the standard normal distribution. u1,u2,u3 are respectively represented by:

$$ {u_{1}} = {L_{1}} \times 2 \times rand + (1 - {L_{1}}) $$
(27)
$$ {u_{2}} = {L_{1}} \times rand + (1 - {L_{1}}) $$
(28)
$$ {u_{3}} = {L_{1}} \times rand + (1 - {L_{1}}) $$
(29)

where L1 is a binary parameter of 0 or 1, and μ1 is a random number between [0,1]. When μ1 < 0.5, L1 = 1; otherwise, L1 = 0. In summary, the resulting solution \({x_{k}^{m}}\) is expressed as follows:

$$ {x_{k}^{m}} = {L_{2}} \times {x_{p}^{m}} + (1 - {L_{2}}) \times {x_{rand}} $$
(30)

where \({x_{p}^{m}}\) is a randomly selected solution from the population, p ∈ [1,2,……N]. L2 is a binary parameter of 0 or 1, and μ2 is a random number between [0,1]. When μ2 < 0.5, L2 = 1; otherwise, L2 = 0. xrand is the newly generated solution in the following manner.

$$ {x_{rand}} = {X_{\min }} + rand \times ({X_{\max }} - {X_{\min }}) $$
(31)

3 The proposed algorithm

3.1 Motivation

Traditional clustering algorithms (e.g., k-means), whose degree of validity depends on the initial solution, may fall into local optima when dealing with complex problems. Therefore, in this paper a new method is developed for data clustering. The method applies the concept of metaheuristics to automatically estimate the initial clustering centers and enhance the ability of the algorithm to escape from local optima.

The ability to balance exploration and exploitation is the concern of all metaheuristic algorithms [63]. The analysis of EHO reveals that the worst positioned agents are only randomly modified by (10). This kind of approach lacks some variation mechanism, which makes the exploitation capacity insufficient and thus leads to a slow convergence. Furthermore, the best-positioned agents are adjusted by (8). This would be useless once the population has fallen into a local optimum while reducing the diversity of the population. In addition, the capability of exploitation of EHO is relatively weak, which increases the probability of falling into a local optimum [64]. By combining with GBO, the search direction during the iteration can be guided to avoid trapping in a local optimum, resulting in a better solution. The local escape operator (LEO) in GBO can improve the diversity of the population and avoid excessive stagnation. In this case, the proposed algorithm can make full use of the gradient information so that the search efficiency of the algorithm can be improved. [65].

Based on the above reasons, several modifications are performed. First, Gaussian chaos mapping is introduced to initialize the population, thus increasing the diversity and traversal of the initial population. Next, two operators, random wandering and variation operators, are adopted to optimize the position of the agents. The aim is to achieve a better balance between exploration and exploitation. Furthermore, EHO is combined with GBO to enhance the exploitation capability by introducing GSR and LEO operators. In summary, the authors believe that this kind of modification is quite interesting.

3.2 Methodology

Since the algorithm is based on a metaheuristic, the search agents need to be represented first. Depending on the specificity of the clustering problem, the representation of the individuals is supposed to be changed. If the input dataset U={x1,x2,……,xn} includes n agents, then each object with m features can be represented as xi={xi1,xi2,…… xim},i ∈ [1,n]. Since one or more initial clustering centers are generated, the dimensionality \(\dim \) of the algorithm will change based on the number of clusters k, i.e., \(\dim = m \times k\). Therefore, each candidate solution Cj denotes a set of cluster centers, which can be represented by:

$$ {C_{j}} = \left\{ {{c_{11}},{c_{12}}, {\ldots} {\ldots} {c_{1m}},{c_{21}}, {\ldots} {\ldots} {c_{\dim }}} \right\} $$
(32)

The solution for the initial iteration is irrelevant to the clustering problem and is randomly generated based on the available dataset. To complete the initialization process, upper and lower bounds must be determined for each feature. Namely, the lower bound is represented as \({c_{{\min \limits } }} = \{ {c_{l1}},\) cl2,……clm}, where \({c_{lm}} = \min \limits \{ {c_{1m}},{c_{2m}}, {\ldots } {\ldots } {c_{nm}}\}\). Similarly, the upper bound is determined as \({c_{{\max \limits } }} = \{ {c_{u1}},\) cu2,……cum}, where \({c_{um}} = {\max \limits } \{ {c_{1m}},{c_{2m}}, {\ldots } {\ldots } {c_{nm}}\}\).

3.2.1 Initialization

It is noted that a strong connection exists between the quality of the initial population and the efficiency of the metaheuristic algorithm. Under this circumstance, it is necessary to improve the initialization by suitable methods to obtain a higher quality initial population. In the original EHO, the search process starts from a randomly generated initialized population. Based on that, a priori knowledge of the objective function or constraints is not required. However, it lacks ergodicity and diversity. It has been experimentally demonstrated that chaotic maps have similar properties to randomness but possess better statistical and dynamic properties [66]. Therefore, it is advantageous to use chaotic maps for population initialization in GBEHO.

In this paper, a pre-programmed Gaussian sequence [67] is selected to replace the conventional random number generator, which is represented as follows.

$$ \begin{array}{@{}rcl@{}} \eta (t + 1) = \left\{ \begin{array}{l} 0\qquad\qquad\quad\ \eta (t) = 0\\ \frac{1}{{\eta (t)}} - [\frac{1}{{\eta (t)}}]{\quad}otherwise \end{array} \right. \end{array} $$
(33)

η(t) and η(t + 1) denote the numbers of chaotic maps generated in the current and next generations, respectively. The initialized population is generated by the Gaussian chaos mapping function, which can explore the space more extensively to obtain better exploration results.

3.2.2 Random wandering operator

It should be emphasized that in the original EHO, the position of the patriarch is determined by the position of all members in the same clan. Once the algorithm has fallen into a local optimum, the quality of the best solution is difficult to modify. As a result, the populations generated by the clan operator are prone to wandering in place. This makes the algorithm somewhat lack the ability to jump out of the local optimum. In our consideration, as the best-positioned agent in each clan, the update strategy of the patriarch should be pioneering and innovative.

One of the most significant rules of metaheuristic algorithms is to maintain a balance between exploration (diversification) and exploitation (intensification). In the pre-exploration stage, agents need to explore the search space sufficiently to identify promising regions for exploitation. During this phase, individuals should have a better stochastic search ability; otherwise, it will lead to premature convergence. In the exploitation phase, agents focus on discovering better solutions in the explored regions. Therefore, the accuracy of individuals in finding the best solution should be optimized so that the algorithm converges to a feasible local or global optimal solution in a limited time. Based on this consideration, the update strategy of the patriarch is adapted as follows:

$$ x_{best,ci}^{t + 1} = \left\{ \begin{array}{l} x_{best,ci}^{t} + C(\sigma ){\text{ }}{{{\text{ }}\qquad\qquad\qquad\qquad\ \ it} {\left/ {\vphantom {{{\text{ }}it} {Maxiter}}} \right.} {Maxiter}} < 0.5\\ {\text{ }}x_{a,ci}^{t} + 2(rand - 0.5)(x_{b,ci}^{t} - x_{c,ci}^{t}){\text{ }}\quad otherwise{\text{ }} \end{array} \right. $$
(34)

where \(x_{best,ci}^{t}\) and \(x_{best,ci}^{t + 1}\) denote the current and latest positions of the patriarch in clan ci, respectively, xa,ci,xb,ci,xc,ci denote individuals randomly selected from clan ci, respectively, rand is a random number between [0,1], and C(σ) denotes the Cauchy distributed random number. It has been proven that a Cauchy distribution-based random walk could contribute to global exploration [68]. The Cauchy distribution function is defined as

$$ \begin{array}{@{}rcl@{}} F(\sigma;a,b) = \frac{1}{2} + \frac{1}{\pi }\arctan (\frac{{\sigma - a}}{b}) \end{array} $$
(35)

where a is the location parameter and b is the scale parameter. In the standard Cauchy distribution, a = 0,b = 1. Meanwhile, the Cauchy density function is as follows

$$ {f_{C(a,b)}}(\sigma) = \frac{b}{{\pi ({b^{2}} + {\sigma^{2}})}} $$
(36)

The Cauchy distributed random number C(σ) generated by (35) can be expressed by

$$ \sigma = \tan \left( \pi \left( F(\sigma ;a,b) - \frac{1}{2}\right)\right) $$
(37)

It should be noted that the random wandering operator based on the Cauchy distribution replaces the original strategy of updating based on the mean value in GBEHO. Under this circumstance, it is beneficial for agents to expand the search area, bringing an increase in diversity. For the algorithm to run smoothly, bounds should be checked to prevent crossing them. Once out of range, the Cauchy mutation is repeated several times until the new solution lies within the specified range. As the iterations continue to run, it is actually a process of decreasing the step size. Later in the algorithm, GBEHO moves to exploitation. At that time, the clan leader is modified by the position of three random individuals in the population, which contributes to improving the accuracy of discovering the globally optimal solution.

3.2.3 Mutation operator

Another deficiency of EHO is the lack of a variation mechanism, which is reflected in the following two points. First, most of the agents in a population, excluding the worst individual, are updated based on the relationship with the clan leader, and the sense of independence is poor. This type of mechanism is not conducive to enhancing the diversity. For instance, once the algorithm is caught in a local optimum, it is difficult to have the opportunity to continue exploring. Second, during the search process, a few agents broke away from the group led by the female matriarch. These agents obviously have a more prominent sense of independence and are able to perform a random search in the search space. However, their sense of following the matriarch is still relatively weak in terms of the whole clan. Unless most of them explore the wrong search direction, it will slow down the convergence speed of the algorithm and affect the search efficiency. In the original algorithm, the position of the worst individual is adjusted by the random nature, making it difficult to ensure that the search agent is updated to a better position [69].

Similar to mutations in chromosomes, mutation strategies have been widely used through genetic algorithms [70], the aim of which is to increase the diversity of the population. To ensure that most agents have the opportunity to mutate, a variance probability (PSR) is set. This parameter should take a value between (0,1) to avoid exceeding the population size boundary. If PSR is less than 0.2, it means that fewer individuals undergo mutation, and it is difficult for the experiment to have a substantial effect. If PSR is greater than 0.8, then the algorithm will determine that most of the individuals will participate in the mutation, which is contrary to the original intention of the setting. Therefore, for the purpose of maintaining a balance between exploration and exploitation while meeting the diversity enhancement requirements, the magnitude of the variance probability PSR is proposed to be experimentally tested in order to determine the optimal clustering effect. It has been experimentally verified that this module has a positive impact on the performance of the algorithm. The ablation experiments will be presented in the next section. In GBEHO, the mutation operator is set as shown below:

$$ {x_{worst,ci}} = {x_{worst,ci}} + {\delta_{m}}{r_{1}} + K $$
(38)
$$ K = {u_{1}}{e^{\frac{{ - 2t}}{{Maxiter}}}} $$
(39)
$$ x_{i}^{t + 1} {=} \left\{ \begin{array}{l} {x_{i}^{t}} + {r_{2}}\left( \frac{{x_{pbest}^{t} + {x_{Gbest}}}}{2} - {x_{i}^{t}}\right) + {r_{3}}\left( \frac{{x_{pbest}^{t} - {x_{Gbest}}}}{2} - {x_{i}^{t}}\right)\quad rand < PSR\\ {x_{i}^{t}} + \alpha {r_{4}}(x_{best,ci}^{t} - {x_{i}^{t}})\qquad\qquad\qquad\qquad\qquad\quad otherwise \end{array} \right. $$
(40)

where xworst,ci represents the position of the agent to be modified and δ is the variation factor. In this paper, \(\delta = 0.1 * ({X_{{\max \limits } }} - {X_{{\min \limits } }})\). r1,r2,r3, and r4 are random numbers uniformly distributed from 0 to 1. u1 is a random variable of [− 1,1], and t and Maxiter represent the current and maximum number of iterations, respectively. \(x_{pbest}^{t}\) is the optimal solution at the tth iteration, and xGbest stands for the global optimal solution.

3.2.4 Greedy selection strategy

When designing a hybrid framework, there are two critical issues [71]. One is to combine two or more methods into one framework, and the other is to evaluate the best solution from the iterations. In this paper, EHO is set as the basic algorithm because of its ease of implementation and certain exploration capability. The obtained solutions are then updated via GBO to enhance the diversity of the population. Compared to EHO, GBO is more advantageous in terms of its exploitation capability due to GSR and LEO. Finally, the solutions provided by the search agents are evaluated by a reedy selection strategy. If the fitness generated by the new agent is better than the current one, it is replaced and involved in a new round of iteration processes. The purpose is to ensure the convergence of GBEHO.

$$ GBestX = \left\{ \begin{array}{l} {x_{k}^{i}} \qquad \qquad \ f({x_{k}^{i}}) < f(GBestX)\\ GBestX \qquad else \end{array} \right. $$
(41)

where GBestX represents the global optimal agent, and \({x_{k}^{i}}\) represents the kth agent generated in the ith iteration.

3.3 Pseudocode of GBEHO

According to the above adjustments, the pseudocode of GBEHO is shown in Algorithm 2. The initialization is performed in line 4 by means of the introduced chaotic mapping. The EHO phase is then completed in lines 7 to 16. In detail, the two proposed operators are applied in lines 11 and 15. In the second stage, the algorithm performs the gradient search rule (GSR) and local escaping operator (LEO) operators, which are shown in lines 17-28. Finally, the clustering process is completed based on the searched clustering centers in lines 33-36. In addition, the flow chart of GBEHO is given in Fig. 1.

figure g
Fig. 1
figure 1

Flowchart of GBEHO

3.4 Time complexity

The time complexity of the algorithm can reflect the magnitude of the running time variation with an increase in the input size [72]. The time complexity of the proposed GBEHO is bounded by the number of search agents N, the dimensions of the problem D, and the maximum number of iterations T.

In general, the time complexity of GBEHO can be divided into the following parts: chaos initialization, random wandering, mutation, and the GBO strategy. First, the time spent initializing the population using Gaussian chaos mapping is O(N). Next, the main loop phase with a maximum number of iterations of T is executed. Random wandering with a Cauchy distribution takes O(TN), and the execution of the mutation operator takes O(TN). In addition, the GBO strategy costs O(TND), so the computational complexity of GBEHO is O(TDN + TN).

4 Experiments and analysis

In this section, experiments are conducted to verify the validity of the GBEHO. All simulations are implemented on a Windows 10 operating system computer with an Intel(R) Core (TM) i5-9300H (2.40 GHz) processor, 16 GB of RAM and the MATLAB R2019b platform.

4.1 Influence of the parameters

In Section 3, the variation probability PSR is introduced into GBEHO. To verify the sensitivity of the controlled parameters, four versions of GBEHO were developed to test the performance under different parameters on a set of 23 recognized functions [73]. The values of PSR vary in the range of [0.2,0.8] with a step size of 0.1. For the sake of convenience, these sub-algorithms are named GB2, GB3, GB4, GB5, GB6, GB7, and GB8, corresponding to PSR values of 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8, respectively. The test functions include 7 unimodal benchmark functions, 6 expandable multimodal functions, and 10 multimodal functions with fixed dimensions. The basic information of the benchmark functions is listed in Table 1. In this subsection, PSR is the only parameter that changes across GBEHO versions. For the sake of validating the parameters, the final clustering part of the original algorithm was excluded, and only the final fitness values were calculated. Furthermore, the number of clans c in GBEHO is set to 5. The maximum number of iterations \({t_{{\max \limits } }}\) is set to 500, and the size of population N is set to 10.

Table 1 Details of nine benchmark functions

To evaluate the performance of each variant, several measurement terms were invoked, including the mean and standard deviation values (std). Given the randomized nature of the heuristic algorithm, it was necessary to compare the experimental results via statistical tests in order to test the validity of the experimental data [74]. Therefore, the Wilcoxon rank-sum test and the Friedman test with a significance level of 5% were adopted, where the p value is an important indicator of the confidence of the results. When p < 0.05, it was determined that there was a statistically significant difference between the two groups of results. In addition, bold indicates the best candidate solution obtained in each function, and NaN indicates that the algorithm performs best on the current function. Moreover, the ranks of the results obtained by different algorithms on each function were also compared. The results obtained when each experiment was completed 30 times are shown in Table 2.

Table 2 Comparison results on 23 benchmark functions of EHO and GBEHO with PSR varying from 0.2 to 0.8

The box plots represent discrete information of a set of data, which can detect outliers and data skewness and can be used to differentiate the ability of algorithms in terms of data symmetry and dispersion [75]. The height in each boxplot reflects the stability, with narrower heights representing less noise and outliers and more stable results. The aggregation of the solution is an important factor in assessing the performance of the algorithm. If an algorithm falls into a local optimum, it will lead to premature convergence, and the quality of the solution will be degraded. 6 representative functions are selected from unimodal benchmark functions, multimodal functions, and fixed-dimension functions, with the box plots of the 7 variants and the original EHO plotted in Fig. 2. As can be observed from the results in the figure, the graphics of GB2 are relatively lower and narrower. Considering the mean, best value, worst value, and standard deviation, GBEHO performs better in different functions when PSR = 0.2.

Fig. 2
figure 2

Comparison of the box plots for different algorithms

The results of the 8 algorithms on 23 benchmark functions are represented in Table 2. As clearly shown in the table, the different GBEHO variants achieved higher ranks than EHO. The results show that the quality of the candidate solutions is significantly strengthened by two operators and the combination with GBO. Indeed, GBEHO obtains a lower mean and standard deviation on most functions, especially GB2. Meanwhile, the performance of the algorithm gradually decreases as the probability of variation increases. Specifically, GB2 surpasses the other algorithms on F1 to F4, F9 to F11, F13 to F15, and F17 to F19 and obtains the highest ranking. GB3 performs best on F10 and F16 and ranks second among all algorithms. Comparatively, the improvement of GBEHO with PSR greater than 0.6 is less pronounced. These results show that the newly added operators further promote the capability of local exploitation. Meanwhile, the balance between exploration and exploitation is well achieved through the combined framework. However, the performance of variants on parts of the functions is relatively insignificant, e.g., F5, F6, F8, F12, F22, and F23. It is obvious that the best solutions on those functions are achieved by the original EHO. This is probably due to the unique characteristics of the different functions that make the modifications of EHO inapplicable. In conclusion, the variation operator can promote the performance of the original algorithm, and the improvement is most pronounced when PSR = 0.2. Therefore, PSR is set to 0.2 in the subsequent experiments.

4.2 Analysis of the modifications

To investigate the impact of the modifications on the performance of the algorithm, a set of comparison experiments is conducted. In this subsection, in addition to GBEHO and EHO, three other methods, namely, Gaussian sequence + EHO (GEHO), mutation operator + EHO (MEHO), and random wandering operator + EHO (RWEHO) are designed. The above three strategies are also the core modules of the modifications. 6 representative functions are selected to verify the performance of the different variants. The size of population N is set to 30, the maximum number of iterations \({t_{{\max \limits } }}\) is 500, and the number of clans c is 5. Other than that, other parameters are kept consistent. To reduce the effects of errors and instabilities, all algorithms are subjected to 30 experiments. The final results are based on an average of 30 experiments.

Figure 3 shows the convergence curves of 5 algorithms on the 30-dimensional functions. The convergence efficiency of the other 4 variants is significantly better than the original EHO. This indicates that the modifications of Gaussian mapping, random wandering, and mutation operators can indeed all improve the convergence efficiency of the EHO. Specifically, the Gaussian sequence improves the initialization, which leads to an increased efficiency in the early stages of the algorithm. In addition, the global optimal solutions achieved by MEHO and RWEHO are superior to EHO and GEHO. It is therefore proven that random wandering and mutation operators enhance the diversity of the population. Consequently, exploration and exploitation are promoted, leading to a higher convergence accuracy. In comparison, GBEHO has the best convergence performance. The global optimum is attained around the 300th and 10th generations on the F2 and F9 functions, respectively. The convergence rate on the F11, F14, F15, and F20 functions is also the fastest among several algorithms. These results provide strong evidence that the combined effect of modifications has led to further improvements in the search accuracy and breadth of GBEHO.

Fig. 3
figure 3

Comparison of box plots for different algorithms

In addition, the average, standard deviation, best, and worst values of different variants on the six benchmark functions are recorded and reported in Table 3. The results in the table are the average results obtained after 30 runs of each algorithm, and the best results on each function are shown in bold. It is obvious that all variants perform better than the original EHO algorithm, indicating that the strategy of Gaussian sequence, the random wandering, and the mutation operators are efficient, respectively. Besides, it is also worth noting that GBEHO achieves the most desirable performance overall, with the best average and standard deviation results on F2, F9, F11, F14, and F15. This indicates that the combination of different strategies is effective in a way that can significantly improve exploration and exploitation. In summary, it can be concluded that the modifications of EHO are convincing.

Table 3 Comparison results on 6 benchmark functions of different variants

4.3 Comparison with other metaheuristic algorithms

To further verify the effectiveness of the algorithm, GBEHO was compared with nine other metaheuristics, namely, k-means, particle swarm optimization (PSO) [76], differential evolution (DE) [77], genetic algorithm (GA) [70], cuckoo search algorithm (CS) [78], gravitational search algorithm (GSA) [79], bat algorithm (BA) [80], a quantum-inspired ant lion optimized hybrid k-means algorithm (QALO-K) [35], hybrid grey wolf optimizer and a tabu search (GWOTS) [81].

4.3.1 Parameter settings

Under the consideration of fairness, parameters within the selected algorithm are preset, which are shown in Table 4. It should be noted that the parameters in the table are set according to the recommendations in the above work. Except for the parameters in the table, the other parameters are kept consistent. Furthermore, the maximum number of iterations \({t_{{\max \limits } }}\) is set to 200, and the size of population N is set to 10. The number of clans c in GBEHO is set to 5.

Table 4 Parameter settings of the different algorithms

4.3.2 Datasets

Adán et al. [34] stated that the evaluation of a complete clustering algorithm should include both synthetic and standard real-world datasets. The datasets chosen for the experiments are from the University of California, Irvine (UCI) machine learning repository [82] and include Iris, Wine, Seeds, Breast, Heart, CMC, and Vowel. The synthetic dataset consists of two artificial datasets: two-moon and aggregation [83]. The basic information of the datasets is shown in Table 5.

Table 5 Basic information of the datasets

4.3.3 Comparison of the experimental results

In this section, the various algorithms are compared based on the experimental values of SSE. Each algorithm is run 30 times separately, and the obtained results are shown in Table 6. Best, Worse, Mean, and Std. denote the best, worst, mean, and standard deviation of all the results, respectively. Obviously, it can be seen that the algorithms produce separate values due to the complexity of the dataset. GBEHO can provide the lowest solutions in most datasets. Compared to the basic k-means algorithm, GBEHO achieves better mean values in all cases.

Table 6 Comparison results on different datasets

For 9 datasets, GBEHO can provide the lowest mean SSE results for 7 datasets: Wine, Seeds, Breast, Heart, CMC, Vowel, and Aggregation. In particular, GBEHO achieves the lowest best and worst values on these datasets. However, due to the inability to accurately identify the manifold structure, GBEHO performed poorly in the Two-moon dataset. The standard deviations of GBEHO are smaller than those of the other algorithms, indicating that the algorithm is more stable in its operation. In general, GBEHO could obtain more satisfactory results than the other 9 algorithms. Consequently, these results provide strong proof for GBEHO to solve the clustering problem effectively.

Figure 4 shows the box plots obtained by 9 algorithms on the different datasets. It is observed that the box plots of GBEHO are the narrowest among all data sets. Obviously, GBEHO has a more stable clustering ability, and the population diversity is ameliorated by using the strategy of mixing EHO and GBO. In addition, GBEHO produced the fewest outlier points, which indicates that GBEHO has a strong robustness. These facts indicate that the proposed algorithm can effectively circumvent local minima.

Fig. 4
figure 4

Comparison of box plots for different algorithms

4.3.4 Convergence analysis

Iteration is the act of repeating a set of procedures to achieve the best solution. When all procedures of an algorithm are repeated once, this is called one iteration. The results of each iteration provide the initial value for the next iteration [84]. The convergence curve can reflect the convergence rate and the global search ability during the iteration of the algorithm.

The comparison of convergence curves on different datasets is shown in Fig. 5. All curves are generated synthetically after 30 independent runs of the different algorithms. GBEHO reaches stability at the 20th generation on the Iris, Wine, Seeds, Breast, Heart, and Aggregation datasets. Despite the fact that GBEHO converges more slowly on the Vowel dataset, the quality of the solutions found is higher. The results verify that GBEHO has relatively faster convergence and a superior global search capability. Compared with GBEHO, the performances of metaheuristics for PSO, DE, GA, CS, GSA, and BA are slightly less.

Fig. 5
figure 5

Convergence curves of different algorithms

4.3.5 Statistical analysis

In the proceeding experiments, there are inevitable chance factors that affect the experimental results. To test the variability between different algorithms, further statistical analysis of the obtained results is needed to obtain more reliable data. Nonparametric tests can be used in the field of mathematics to check the performance of the algorithms [85]. The Wilcoxon signed-rank test [86] and Friedman test [87] are two well-known techniques. Both can be used on data distributions, statistically examining whether a difference exists between two groups. The experiments in this paper are performed at the 5% significance level.

Table 7 reports the results for the comparison of GBEHO with PSO, GBEHO with DE, GBEHO with GA, GBEHO with CS, GBEHO with GSA, GBEHO with BA, GBEHO with QALO-K, GBEHO with GWOTS, and GBEHO with k-means on the nine groups. If the p-value is less than 0.05, then the result is significantly different. The bold values in the table indicate values greater than 0.05. As observed from the table, except for the values obtained for GBEHO vs. CS on Iris and GBEHO vs. PSO on the Heart dataset, which are greater than 0.05, all other values are less than 0.05, which provides valid evidence against the null hypothesis. The results suggest that the excellent performance of GBEHO is statistically significant, and not achieved by chance.

Table 7 Results of p-values obtained by the Wilcoxon signed-rank test

The results of the Friedman test are shown in Table 8. The obtained values are the average ranking of all algorithms when conducting the experiments. According to the results, the algorithm with the lower ranking is considered to be the most efficient algorithm. Obviously, a better average ranking of GBEHO proves that the proposed algorithm has a more competitive advantage. At the same time, it makes the series of experiments more convincing.

Table 8 Results of ranks obtained by the Friedman test

4.3.6 Analysis of the clustering process

In this subsection, three datasets, Iris, Seeds and Aggregation, are selected for visualization and presentation. The original distributions are shown in Fig. 6. Figures 78 and 9 display the clustering visualization results. We know that GBEHO and PSO are the two best algorithms on the Iris dataset. It can be observed in Fig. 7 that both algorithms accurately divide the dataset into three distinct clusters, and both achieve relatively better solutions. In comparison, the centroids found by GBEHO are significantly closer to the real scenario than PSO. This suggests that GBEHO has a better performance. In terms of the iterations, the centroids found by GBEHO are relatively stable at the 20th generation. This indicates that GBEHO has a faster convergence rate and stability. Figure 8 compares the clustering results on the Seeds dataset, where GBEHO and GA are the two superior algorithms. Apparently, GBEHO achieves better positions of centroids in the 20th generation and in the final results. In the 20th generation, GBEHO is able to extract centroids of the bottom leftmost cluster, while GA is unable to. It is clear that GBEHO is able to distinguish blue and green clusters more accurately than GA. The performance on the Aggregation dataset is shown in Fig. 9. For the two clusters on the top left and top, GBEHO obtains more precise clustering centroids. Both GWOTS and GBEHO find the exact centroids on the upper and lower right clusters. However, GBEHO’s delineation in the bottommost cluster is more obvious. Although the black and magenta clusters in Fig. 6c are not accurately distinguished, this is due to the shortcomings of the traditional Euclidean distance. In terms of the overall convergence rate and clustering accuracy, GBEHO is relatively superior.

Fig. 6
figure 6

The original distribution

Fig. 7
figure 7

Comparison of the clustering results on the Iris dataset

Fig. 8
figure 8

Comparison of clustering results on the Seeds dataset

Fig. 9
figure 9

Comparison of clustering results on the Aggregation dataset

4.4 Comparison experiments with state-of-art techniques

In this subsection, extra experiments are conducted to further validate the performance of the proposed algorithm.5 UCI datasets, namely, Wine, Breast, CMC, Heart, and Vowel, are chosen to evaluate the significance of GBEHO with PSR = 0.2 versus the reported results of four other recently proposed algorithms, such as CSOS, Hybrid FCM-PSO, ACLSHMS, and KIGSA-C. Table 9 shows the values of the experimental parameters for the different algorithms. The maximum number of iterations Maxit is set to 500, and the size of population N is set to 30. To eliminate the influence of uncontrollable factors to the greatest extent possible, all algorithms were run 30 times, and the average value was adopted as the final result for comparison.

Table 9 Parameter settings of the different algorithms

When completing the clustering of the dataset, attention needs to be given to the degree of adaptation of the clustersto the input data. Therefore, it is necessary to validate via certain evaluation criteria, which is a fundamental aspect of data clustering. The metrics for evaluating the clustering results are broadly classified into three categories, namely, external metrics, internal metrics, and relative validation [25]. Four evaluation metrics are invoked in the experiments to quantitatively compare the clustering performance, namely, accuracy rate (AR), specificity (SP), detection rate (DR), and F-measure (F1), which are defined in 42-45.

$$ AR = \frac{{TP + TN}}{{TP + TN + FN + FP}} $$
(42)
$$ SP = \frac{{TN}}{{TN + FP}} $$
(43)
$$ DR = \frac{{TP}}{{TP + FN}} $$
(44)
$$ {F_{1}} = \frac{{\left( {{b^{2}} + 1} \right) \cdot precision \cdot recall}}{{{b^{2}} \cdot precision + recall}} $$
(45)

where TP is true positive, TN is true negative, FP is false positive, FN is false negative in classification, \(precision = \frac {{TP}}{{TP + FP}}\), \(recall = \frac {{TP}}{{TP + FN}}\) and b = 1.

The obtained results are shown in Table 10. Figure 10 shows the comparison of the evaluation metrics of the five algorithms on different datasets. On the Wine dataset, all five algorithms achieve satisfactory results. The reason lies in the simpler structure of the Wine dataset. Therefore, the different algorithms are able to achieve more accurate identification. In contrast, on the breast and Vowel datasets, several algorithms do not perform well due to the more complex structure of the clusters. Specifically, for AR, GBEHO achieves the best performance on Wine and CMC, and ranks 2nd, 3rd and 4th on breast, Vowel and heart, respectively. As for SP, GBEHO ranks 1st on Wine and CMC, and ranks 3rd, 3rd and 4th on breast, Vowel and heart datasets, respectively. For DR, GBEHO ranks first on Wine and CMC datasets, and 2nd, 2nd and 3rd on breast, Vowel and heart datasets, respectively. For F1, GBEHO ranks 2nd on Wine, breast and CMC datasets, and 4th on heart and Vowel. Overall, GBEHO performs the best on Wine, CMC. The performance on breast is located at 2nd, which is not as good as KIGSA-C. While for the heart dataset, CSOS performs the best, Hybrid FCM-PSO is second, and GBEHO is able to achieve a tie with KIGSA-C. On the Vowel dataset, GBEHO performs second only to KIGSA-C and ranks second. From the above analysis, it can be concluded that GBEHO’s performance is competitive and convincing in the comparison of the state-of-art techniques. In general, it is proved that GBEHO provides a better choice of clustering. Therefore, GBEHO can be regarded as a powerful and effective clustering algorithm.

Table 10 Results of GBEHO with PSR = 0.2 versus others state-of-the-art techniques
Fig. 10
figure 10

Comparison results of 5 algorithms on different datasets

5 Discussions

Overall, the experimental results are consistent with the hypothesis. The introduction of the two operators and GBO improves the performance of the original EHO. Experiments on benchmark functions and datasets with different types prove that the improvement is significant. The proposed GBEHO is proven to have a higher clustering accuracy by evaluating four metrics, namely, accuracy rate, specificity, detection rate, and F-measure. Therefore, it can be concluded that GBEHO is an effective clustering method that can be used for a cluster analysis of different datasets.

Compared with other algorithms based on metaheuristics that are used for clustering, GBEHO shows a more competitive and superior performance and provides more desirable clustering results. GBEHO inherits all the advantages of traditional EHO, such as a superior global exploration capability. Meanwhile, the clan operator and separating operator in the original EHO are improved by the random wandering operator and mutation operator so GBEHO is better equipped with a stronger local exploitation compared with PSO, DE, GA, etc. Compared with BA and CS, it has a better exploration, and thus better avoids falling into the local optimum trap. In that case, the convergence rate is optimized. Moreover, GBEHO provides more accurate clustering results than the state-of-art algorithms. However, we observe that GBEHO is subject to several problems as follows. First, the time complexity of GBEHO is too high compared to other classical algorithms, which is caused by the newly added mechanism. The enhancement in clustering accuracy leads to an increase in the complexity of GBEHO. Second, with the increase in dimensionality, some of the metaheuristic algorithms suffer from a weakened stability. A scalability test with expandable dimensions is not performed, so the adaptability of GBEHO to multiple dimensions needs to be further examined. However, based on the No Free Lunch (NFL) theorem [51], there is no perfect optimization method, so we do not intend to claim that GBEHO is the best method in the world. The famous k-means algorithm has gained widespread use and attention since its inception, but it does not mean that k-means is without flaws. On the contrary, k-means is still limited to dependence on the initial solution and the tendency to fall into a local optimum. For our proposed method, we are more concerned with the accuracy of clustering rather than the time. As the research work goes further and becomes more detailed, the authors believe that there will be more techniques for improving operational efficiency in the future, such as parallel computing, which will provide better technical support for GBEHO.

6 Conclusions and future work

Traditional clustering methods easily fall into local optima, and the initialization of the center of mass position is a prominent problem. In this paper, an improved version of EHO is proposed for clustering analysis. Chaotic mapping based on Gaussian sequences improves the ergodicity and diversity of the initialized populations. Two operators, random wandering and mutation, are presented to optimize the strategy of updating positions in EHO, thus promoting the population diversity and the ability to jump out of the local optimum. Among them, the former improves the diversity of the population as well as the global exploration ability, and the latter promotes local exploitation at a later stage. In addition, GBO operators contribute to further balancing exploration and exploitation for the sake of determining the best center of mass more accurately. More suitable variable parameters are determined through ablation experiments.

Experiments on artificial and real-world datasets indicated that GBEHO has a better clustering performance than the other metaheuristic algorithms and their variants. The obtained intracluster variance was compared with classical k-means, PSO, DE, and GA algorithms to show superiority. By analyzing box plots and convergence curves, it was shown that GBEHO has a greater stability and faster convergence. The numerical data were confirmed by statistical analysis. Nonparametric tests were performed to verify significant differences between GBEHO and other algorithms. The visualization graphs of the clustering process demonstrated that GBEHO can find more accurate centroids at a faster iteration rate. Compared with the other state-of-art algorithms, GBEHO achieves more realistic results on accuracy rate, specificity, detection rate, and F-measure on five UCI datasets. Taken together, these results confirmed that GBEHO is an effective tool for data clustering.

In future research, we plan to reduce the time complexity of GBEHO through further design and experimentation. GBEHO can also be extended to several application areas, such as intrusion detection, image segmentation, and route planning. In addition, the performance of the hybrid algorithm will continue to be optimized to address sophisticated problems faced in practical engineering. The authors believe that this is an algorithm with great potential, and its application effect is worthy of expectation.