Introduction

Data clustering known as unsupervised algorithms plays an irreplaceable role in the field of machine learning [1]. Compared with supervised classification algorithms, data clustering algorithms can reveal the internal structure of unknown data regions and provide a new perspective on data. In recent years, data clustering algorithms have been widely applied in data mining [2], image segmentation [3], biomedicine [4], intelligent transportation [5], and many other fields, which shows the significant value of its research. Since the definition of clustering is not completely unified [1], a classic definition is shown here: (1) instances in the same cluster are relatively close and similar; (2) instances in different clusters are relatively far apart and different; (3) measurements of similarity and dissimilarity should be clear and meaningful. The goal of clustering algorithms is to find such a set of clusters as described. As suggested by Fraley and Raftery in Ref. [6], clustering algorithms can be divided into two categories: hierarchical clustering and partitional clustering. Unlike hierarchical clustering, partitional clustering obtains the final clustering groups by first specifying the initial group and then repeatedly assigning instances to groups until it satisfies the convergence criterion [7]. The most popular partitional methods are the K-means algorithm and the K-centroids algorithm.

Due to its advantages of simple implementation, easy interpretation, and suitability for sparse data [8], the K-means algorithm has been widely studied and applied in many research areas. However, it is also noticed that the K-means algorithm is extremely sensitive to the setting of the initial centroids and does not perform well in some datasets [9]. This dependence on initialization leads to the condition that the performance of the K-means algorithm can be improved by optimizing the selection of the initial centroids [10]. Since the heuristic algorithms were proposed, they have attracted researchers’ attention and developed sufficiently due to their excellent optimization capabilities. These kinds of algorithms are able to search the solutions by following specific rules in a reasonable time, for the reason that they are designed to solve various optimization problems, especially NP-hard problems [11]. Considering such superiority, the heuristic algorithms are also introduced in addressing the problem of optimizing the selection of the initial centroids for the K-means algorithm [12,13,14,15].

BHA, proposed by Abdolreza [16] in 2013, is one of the effective swarm intelligence optimization algorithms. Its design mechanism simulates the attraction between a black hole and its surrounding stars in space. BHA has a simple structure and is easy to implement for application problems, meanwhile, it is also fast, efficient, and less affected by hyperparameters. Therefore, it has shown a great performance in many application fields, such as optimization problems [17, 18], feature selection [19], image segmentation [20], gene selection [21], clustering analysis [16, 22, 23], etc. In the past few years, plenty of improved BHAs have been proposed. For instance, Mohammed et al. [24] proposed the gravitational search—black hole algorithm (GSBHA) that combined GSA and BHA, which showed better performance than the original algorithms. In Ref. [25], Yaghoobi et al. modified the formula to search the area near the black hole as well as introduced mutation and crossover operators. After that, the local search ability of BHA was noticed and redesigned to improve its effectiveness [26]. Ibrahim et al. [27] also introduced the white hole local operator for the promotion of the exploitation ability.

Although the studies of the BHA effectively improve its performance, these algorithms also have limitations [28]. The simple design of the structure may cause the algorithm to be trapped in a worse local optimal value with a low probability of escaping in some application problems. In addition, the algorithm may be difficult to control and rely on randomness because there are almost no hyperparameters. Moreover, the quality of the black hole in the swarm has a great influence on the performance of the algorithm, and it is difficult to explore more space once the black hole almost manipulates the update process. At the same time, the trajectories of the stars toward the black hole are not elaborate enough to provide a sufficient exploitation of the solution space. The problems mentioned above may lead to a poor ability to control the balance between exploration and exploitation for the algorithm.

The main goal of this article is to address the problem of selecting the initial centroids for improving the performance of the K-means algorithm. Therefore, an improved BHA, namely self-adaptive logarithmic spiral path black hole algorithm (SLBHA), is proposed here. Figure 1 shows how the algorithm works, and the details of this algorithm will be presented in the other chapters. The main contribution of this paper is to use the new strategies for overcoming the mentioned drawbacks of the BHA. At the same time, the improved BHA shows an effective way to solve the clustering problem. First, the stars in SLBHA are updated by the logarithmic spiral path and random vector path. In this process, a parameter for controlling the randomness can help to adjust between the two paths, which contributes to the local searching around the black hole. Second, the greedy retention strategy is proposed to help retain better results. Third, SLBHA introduces a replacement mechanism for stars, which improves the diversity of the population, expands the search space of the population, and provides more possibilities for jumping out of the local optimal solution. Finally, an adaptive parameter is added to help control the balance between global and local search procedures by adjusting the replacement mechanism. Then the experimental results based on the standard datasets demonstrate the effectiveness of the proposed methods. The external criteria of the clustering problem are creatively employed in this paper for the analysis.

Fig. 1
figure 1

The working process of the proposed algorithm

The rest of this paper is organized as follows. “Related works” briefly summarizes the heuristic algorithms applied to the clustering problems. “Preliminaries” presents the basic concepts of the classical BHA model and the conventional K-means algorithm. “The proposed work” details the proposed algorithm in this paper. “Results and discussion” evaluates the proposed model through experimental tests and compares it with other selected comparative algorithms. “Conclusions and future work” gives the conclusions and future research directions of this paper.

Related works

The effectiveness of heuristic algorithms in improving the K-means algorithm is proved in a great of research works. In this section, we give a brief overview of representative literature from the perspective of algorithms.

The genetic algorithm was utilized (GA) for clustering problems by Maulik et al. [29]. After that, the quantum-inspired genetic algorithm for K-means clustering (KMQGA) proposed by Xiao et al. [14] should be noticed. They introduced the Q-bit representation and the concept of quantum computing into their work and changed the length of a Q-bit in KMQGA as a variable quantity. Due to this mechanism, the searching space of this algorithm was extended, which verified its effectiveness on both the simulated datasets and the real datasets. Based on the advantages of the genetic algorithm, Fatahi et al. [30] proposed a combination of the pollination of flowers algorithm and the genetic algorithm (FPAGA). The experimental results demonstrate its effectiveness with greater accuracy and better stability. However, these methods should pay attention to the diversity of the datasets as well as the exploration ability.

As one of the classical intelligent optimization algorithms, the particle swarm optimization (PSO) algorithm was also introduced in clustering problems. In Ref. [31], the author proposed two PSO methods for data clustering. The first algorithm showed how PSO helps to find the centroids of a specified number of clusters and the second one applied the K-means algorithm to seed the initial swarm. The fitness function of the proposed methods in Ref. [31] was novel at that time, but the design of the experiment could be more normalized. Hatamlou et al. [15] hybridized the PSO with a heuristic search algorithm (PSOHS). PSO was used to search for an initial solution to the clustering algorithm and then a heuristic search algorithm was applied to promote the quality of this solution. The superiority of this algorithm over other approaches has been shown in its experiment analysis. Li et al. [32] proposed the adaptive learning PSO to prevent the K-means clustering algorithm from depending on initial cluster centers. Then, the improved KM-ALPSO was proposed for customer segmentation and showed effectiveness and practicability in this task. PSO and its variants showed efficiency and robustness in solving this problem, but their ability to balance exploration and exploitation is questionable.

Ant colony optimization (ACO) methodology was applied to data clustering for data clustering in [33]. Niknam et al. [34] not only noticed the effectiveness of the ACO algorithm but were also interested in the simulated annealing (SA) algorithm. They combined these two algorithms and made use of the SA as a local search in ACO. The experimental results showed a better response and a quicker convergence than ordinary evolutionary methods. In addition, the authors [35] also proposed a new hybrid evolutionary algorithm that combined the fuzzy adaptive particle swarm optimization (FAPSO), ACO, and K-means algorithms, which was called FAPSO–ACO–K. The performance of this algorithm was much better than the other algorithms for the partitional clustering problem. The combination of ACO and the K-means algorithm has some different characters compared with the other heuristic algorithms, which should be researched and explored.

The gravitational search algorithm (GSA) is an effective method for searching problem space for the optimal solution and it was combined with the K-means algorithm in the hybrid method proposed by Hatamlou et al. [36]. Their hybrid algorithm, named GSA–KM, helped the K-means algorithm to escape from local optima and increased the convergence speed of GSA. Different from the classical GSA, Dowlatshahi et al. [37] adapted the structure of GSA by a special encoding scheme and presented the grouping GSA (GGSA). The simulation experimental results indicated that this method can effectively be applied to multivariate data clustering. Han et al. [38] introduced a new mechanism that is inspired by the collective response behavior of birds into GSA to add diversity, which was called bird flock GSA (BFGSA). Since the collective response mechanism helped the algorithm explore a wider range of the search space, the performance of BFGSA was much better than the other algorithms. The proposed GSA-based methods mentioned above concentrated on overcoming the drawbacks of the traditional GSA and achieved great results.

In addition, there are still many other heuristic algorithms applied to this problem. For example, Senthilnath et al. [39] used the firefly algorithm (FA) for data clustering and compared it with the other two algorithms. Xie et al. [9] proposed two variants of FA, namely inward intensified exploration FA (IIEFA) and compound intensified exploration FA (CIEFA). The dispersing mechanism introduced in CIEFA ensured sufficient variance between fireflies for the purpose of increasing search efficiency. However, the time complexity of the FA-based methods is not good enough. A modified bee colony optimization (MBCO) was presented in Ref. [40] and the hybrid algorithms performed better than the compared algorithms. To tackle the cuckoo search (CS) clustering problem, Boushaki et al. [11] extended the CS capabilities using nonhomogeneous update which is inspired by the quantum theory. Zhou et al. [41] used a recently proposed metaheuristic optimization algorithm, called symbiotic organism search (SOS), to solve the clustering problems. Tawhid et al. [42] proposed a new hybrid swarm intelligence optimization algorithm called monarch butterfly optimization (MBO) algorithm with cuckoo search (CS) algorithm and applied it to the clustering problem. An enhanced whale optimization algorithm (EWOA) is introduced in Ref. [43], while the experiments demonstrated the applicability and feasibility of the enhancements. Almotairi et al. [44] proposed a method named HRSA for the clustering problem, which combined the original Reptile Search Algorithm (RSA) and Remora Optimization Algorithm (ROA).

For all the algorithms mentioned above, it should be emphasized that there is no algorithm that can obtain satisfactory solutions for any application problems and outperform any other algorithms. A useful algorithm should strike a balance between exploitation and exploration abilities and converge to the optimal solution as required. In this paper, we focus on designing the improved BHA for solving the initial optimization problem of the K-means algorithm. The logarithmic spiral path and a replacement mechanism for stars are introduced to improve the searching ability of the algorithm. At the same time, we innovatively design an adaptive parameter to help control the balance between global and local search procedures through adjusting the replacement mechanism. The experiments show that the proposed algorithm is able to converge to the optimal solution mostly and outperform the compared algorithms.

Preliminaries

This section describes briefly the main concepts utilized in the proposed approach, which are the classic BHA and the K-means algorithm.

The classical BHA

The concept of the black hole was first identified by John Michell and Pierre Laplace in the eighteenth century and named by John Wheeler in 1967 [16]. BHA is a population-based intelligent optimization algorithm inspired by the behaviors of black hole and stars. The black holes are formed when massive stars gravitationally collapse and then create a gravitational field that is so powerful that even light cannot escape. The space around a black hole called the event horizon is the limit that the matter can reach because nothing enters the scope of the event horizon can escape. The radius of the event horizon is called the Schwarzschild radius, which is calculated by the following equation:

$$R=\frac{2GM}{{C}_{l}},$$
(1)

where \(G\) is the gravitational constant, \(M\) is the mass of the black hole, and \({C}_{l}\) is the speed of light. Inspired by the above concepts, the BHA is proposed as a novel heuristic algorithm. In the process of searching for the optimal solution, the best agent is set as the black hole while the others are regarded as stars. Then the locations of stars change as they move toward the black hole following the specific tail. Once a better solution is found, the black hole will be replaced with it. In addition, those who are beyond the event horizon of the black hole will be swallowed by it and new stars will be generated to keep the population constant.

Suppose the number of the population is \(N\) and the dimension of the optimization problem is \(D\). At the beginning, the agents are randomly initialized in the solution space then the fitness value of each agent is calculated [28]. Take solving the minimum value problem as an example, the agent of the minimum fitness value is set as the black hole. \({X}_{i}^{t}\) represents the \(i\) th agent at the \(t\) th iteration and \({X}_{BH}^{t}\) is the black hole. Then the movements of agents can be formulated by the following equation:

$${X}_{i}^{t+1}={X}_{i}^{t}+rand\times \left({X}_{BH}^{t}-{X}_{i}^{t}\right),$$
(2)

where \(rand\) represents a random number in the interval [0,1]. The formula indicates that the stars in the population are attracted by the black hole and move in the direction of the black hole while the distances of movements are decided by the random number \(rand\). However, there exists an event horizon around the black hole. Once a star approaches or exceeds the radius of the event horizon, it will be absorbed by the black hole, and the algorithm will replace it with a new star in the population. The event radius of a black hole here is different with the Eq. (1), which is given by the following equation:

$$R=\frac{{f}_{BH}}{{\sum }_{i}^{N}{f}_{i}},$$
(3)

where \({f}_{i}\) and \({f}_{BH}\) represent the fitness value of the \(i\) th agent and the black hole, respectively. At each iteration of the algorithm, the agents in the population are re-evaluated and then compared to the black hole based on the fitness value for checking whether the black hole needs to be replaced. The process of the BHA will not stop until the convergence condition is met, where the optimal solution is found.

The K-means algorithm

According to the related articles [7], the K-means algorithm starts from a randomly set of centroids, assigns instances to the clusters by the comparison of the distances with the centroids, then recalculates the centroids and iterates until the termination condition is satisfied. The main similarity metric utilized by the K-means clustering algorithm is Euclidean distance [8], that is, the data points in the same cluster are closer and the distances within different clusters are relatively farther based on the Euclidean distance. Specifying the number of clusters that the algorithm needs to split into and the location of the initial centroids are essential for the K-means algorithm. For the next step, data points are assigned to different clusters by comparing the distances to the initial centroids. The resulting clusters may still have large errors at this point, so the centroids should be recalculated and the data points should be reallocated as well. The above process is completed iteratively until the stop condition of the algorithm is met. The selection of the initial centroids has an important influence on the results of the clustering, which means the better initial centroids can greatly improve the performance of the algorithm.

Given \(k\) initial centroids, each centroid represents a cluster. There are \(m\) instances in the dataset, and the dimension of each instance is \(d\), then the objective function is defined as:

$$J=\sum_{i=1}^{k}\sum_{j=1}^{\left|{C}_{i}\right|}{\text{distance}}\left({x}_{j}^{i},{\mu }^{i}\right),$$
(4)

where \({C}_{i}\) represents the \(i\) th cluster and \({\mu }^{i}\) is the centroid of it \(. \left|{C}_{i}\right|\) is the total number of the cluster\({C}_{i}\). \({x}_{j}^{i}\) represents the \(j\) th instance of the \(i\) th cluster and \(distance\left({x}_{j}^{i},{\mu }^{i}\right)\) indicates the distance between \({x}_{j}^{i}\) and \({\mu }^{i}\) which can be defined by the following equation:

$${\text{distance}}\left({x}_{j}^{i},{\mu }^{i}\right)=\sqrt{\sum_{p=1}^{d}{\left[{x}_{j}^{i}\left(p\right)-{\mu }^{i}\left(p\right)\right]}^{2}}.$$
(5)

After assigning the data points for the first time, the centroids in the algorithm are recalculated using the mean value of the instances belonging to the cluster, which is formulated by the following equation:

$${\mu }^{i}=\frac{1}{\left|{C}_{i}\right|}\sum_{j=1}^{\left|{C}_{i}\right|}{x}_{j}^{i}.$$
(6)

The stopping condition of the algorithm may be that the objective function value no longer changes with the re-clustering of instances, or the algorithm reaches the number of iterations. The time complexity of the K-means algorithm is proved to be linear, namely \(O\left(Tkmd\right)\) [8], where \(T\) is the number of iterations of the algorithm. It is the linear time complexity that makes K-means a popular and competitive method. Even if the number of instances in the dataset is relatively large, the K-means algorithm has certain advantages compared with other clustering algorithms.

The proposed work

The classical BHA has demonstrated its feasibility and superiority in the data clustering problem at the beginning of its proposal [16]. However, the problem that the algorithm is easy to fall into local optimization limits its application and development. Therefore, this paper designs an improved black hole algorithm, namely SLBHA. The flow chart of the algorithm is shown in Fig. 2. To overcome the drawback of trapping in the local optimum, SLBHA introduces the logarithmic spiral path for improving the local exploitation ability into the BHA and adds a greedy retention strategy. Moreover, a new improved mechanism of global exploration is designed in SLBHA, which greatly expands the search scope of the algorithm and refines the search process. The proposed algorithm will be described in the following sections. To be clearer, the pseudo-code of the algorithm is shown in Algorithm 1.

Fig. 2
figure 2

The flow chart of SLBHA

Initialization and representation of agents

Similar to other population-based intelligent optimization algorithms, each individual in the population of the BHA represents a feasible solution that is a set of centroids for the data clustering problem. The agents can be defined as follows:

$${X}_{i}={\left({c}_{i,1},{c}_{i,2},\dots ,{c}_{i,k}\right)}^{T},$$
(7)

where \({c}_{i,j}(j=\mathrm{1,2},\dots ,k)\) represents the \(j\) th centroid in \(i\) th agent, it is a d-dimensional vector. \(k\) is the number of clusters which is defined before the process of the algorithm. It is obvious that \({X}_{i}\) is a matrix of \(k\times d\) and a swarm of \(N\) agents is a set of \(N\) matrices. For initialization, \(k\) points for each agent as the set of centroids are randomly selected from the dataset in the proposed methods.

The proposed improved self-adaptive logarithmic BHA (SLBHA)

In the classical BHA, the stars are attracted by the black hole and move toward it at a certain distance. It can be seen that the method of deciding the trajectory of stars is usually single and depends a lot on randomness. That is to say, when the black hole is located in the local optimal solution, the stars may be attracted and moved to it with a too-fast convergence at the beginning, which may cause the algorithm to fall into the local optimal solution and be unable to jump out of it. After setting the position of the black hole, the movements of the stars are equivalent to searching the space around the black hole. No matter whether the phase of the algorithm is global exploration or local exploitation, the BHA deals with it with the same method, which means its balance between the two searching modes may not be satisfactory.

Logarithmic spiral path with parameter

The logarithmic spiral path was proposed in the whale optimization algorithm [45] to simulate the helix-shaped movement of humpback whales, which is an effective search method. Used in the firefly algorithm in the literature [46], this path improved the local exploitation ability of the firefly algorithm and verified its effectiveness in experiments. Sharma et al. [47] introduced a logarithmic spiral-based local search strategy and incorporated it with the ABC algorithm to build a new viable algorithm. To the best of our knowledge, there has been no such research that combines this strategy with the BHA until now. In this paper, the logarithmic spiral path is introduced into the BHA and the improved updating formula is given in the following equation:

$${\widehat{X}}_{i}^{t+1}=\left\{\begin{array}{c}{X}_{i}^{t}+\left({X}_{BH}^{t}-{X}_{i}^{t}\right)\cdot {e}^{bl}\cdot {\text{cos}}\left(2\pi l\right), {r}_{1}<\rho \\ {X}_{i}^{t}+R\cdot \left({X}_{BH}^{t}-{X}_{i}^{t}\right), {r}_{1}\ge \rho \end{array},\right.$$
(8)

where \({\widehat{X}}_{i}^{t+1}\) represents the calculated result of the next generation that may be used for updating. The symbol \(\cdot \) is the element-by-element multiplication operation. \(b\) is a constant related to the shape of the logarithmic spiral path and \(l\) is a random number in [-1,1]. \({r}_{1}\) is another random number in [0,1] and is generated in each updating process. \(\rho \) is the hyperparameter that controls the update paths of agents, which is in the range [0,1]. When the generated random number \({r}_{1}\) is less than \(\rho \), the agent will be updated according to the logarithmic spiral path, otherwise, the other path is chosen. \(R\) is a random matrix of size \(k\times 1\). The idea of replacing the random number with the random vector to expand the agent search solution space was first proposed by Yaghoobi and Mojallali [25], and then Deeb et al. introduced it in Ref. [22] and made improvements. Inspired by the above literature, this paper also takes use of the random vector as one of the paths. Deeb et al. [22] suggest that the elements in the random vector should be valued within the range of [0,1.5] and the dimension of the vector is \(d\). However, to avoid excessive deviation from the original optimization path, the randomness is reduced using a \(k\times 1\) random vector in this paper. The generated numbers of vectors are also adjusted in [0,1]. In this way, the algorithm not only expands the search space for the agents but also keeps the degree of randomness at a balanced level. The improved star trajectory selection mechanism is shown in Fig. 3. The path of traditional BHA is represented by the dotted line, which is only for illustration here and not actually used in the algorithm.

Fig. 3
figure 3

The movement of the stars

In addition, the datasets have been standardized before used in the experiments of this paper, so the value of each dimension of the agent should be between 0 and 1. In the process of star movement, a module to check whether the star exceeds the value range is necessary. When the value of the position exceeds 1 or is less than 0, it is set as 1 or 0 for calculation.

Greedy retention strategy

In this paper, a cautious greedy retention strategy is used to retain the optimal solution of the agent during its movement. If the position that the star is moving causes the quality of the solution to decline, it will remain stationary, and only when the movement gains more promotion can the star move. In this way, it is possible to ensure that the update process of the stars is always in a progressive state and the enhancement of the algorithm solution is stable. Its mathematical expression is shown in the following equation:

$${X}_{i}^{t+1}=\left\{\begin{array}{c}{\widehat{X}}_{i}^{t+1}, if f\left({\widehat{X}}_{i}^{t+1}\right)<f({X}_{i}^{t})\\ {X}_{i}^{t}, else\end{array}\right..$$
(9)

With the above measures, the optimization ability of the BHA has been improved to some extent. However, it is insufficient because such a greedy strategy may lead the algorithm to be trapped in a local optimum. Improving the global search ability of the algorithm, especially in the early stage of the algorithm, can help to get rid of the local optimal solution. Therefore, the following strategies are designed based on this idea.

Replacement mechanism of stars

In the classical BHA, the stars are attracted to search only around the black hole and such an approach may lead to the lack of global exploration capability of the algorithm. To further improve the global search capability of the BHA and increase the diversity of the population, a replacement mechanism for stars is proposed in this paper. The basic idea of this mechanism is that the locations of good solutions should be preserved and the worse solutions should be replaced to search more solution space. Figure 4 shows the process of the replacement mechanism. The steps of the operator are given as follows:

Fig. 4
figure 4

Diagram of replacement mechanism

Step 1: Sort all the stars (including the black hole) in descending order according to the fitness value. It can be concluded that the top-ranked stars represent the worse quality solutions. Half of the better agents is Part 1 and the other half is Part 2.

Step 2: Choose the stars of Part 1 and execute the mutation operator one by one. The results of mutation are put into the solution pool which can be seen as a set of the solutions that are waiting for selection.

Step 3: The stars in Part 2 are reinitialized and thrown into the solution pool, too.

Step 4: All stars in the solution pool are sorted in descending order. The latter half (the best part) of the solution pool is utilized to replace Part 2 of the original swarm.

The mutation operator in Step 2 can be defined as: (1) Randomly select a position of mutation \(z\) (integer in [1, k]) of \({X}_{mu}\); (2) Select a data point \({x}_{z}\) at random in the dataset; (3) Replace the clustering center point of \({X}_{mu}[z]\) with \({x}_{z}\) to generate a new agent. The mutation operator makes relatively minor changes, which means the mutation of superior agents in the population may be beneficial to obtain better results with less cost. The reinitialization operation is equivalent to completely replacing agents, and the generation of new individuals in half of the population is conducive to promoting the diversity of the population and expanding its search space.

Self-adaptive parameter

It is not enough to only improve the exploration and exploitation capabilities of the algorithm, because it is more important to maintain the balance between the two search modes. The algorithm needs to update the population at the right time to expend the searching space, so as to avoid falling into the local optimal region. In the earlier iterations of the algorithm, more global searching is necessary, while more attention should be paid to local search in the later iterations. Therefore, an adaptive parameter is introduced in this paper to choose the appropriate timing and maintain the balance. The formula of the self-adaptive parameter \(\alpha \) is shown as follows:

$$\alpha ={e}^{\frac{-2t}{T+1}},$$
(10)

where \(t\) is the current iteration number and \(T\) is the total number of iterations. Figure 5 is the functional graph of \(\alpha \). It can be seen from the figure that the value of \(\alpha \) gradually decreases form 1 to close to 0.1 with the increase of iterations. In each iteration, a random number \({r}_{2}\) is generated in [0,1]. If \({r}_{2}<\alpha \), the replacement mechanism will be performed in this iteration. It is apparent that the probability of performing this operation is higher at the early stage of the algorithm. However, with the increase of the number of iterations, the algorithm needs more local exploration and excessive global searching may affect the convergence speed of the algorithm. Therefore, the probability of performing this operation is relatively reduced.

Fig. 5
figure 5

The curve where the value of \(\alpha \) changes with the number of iterations

Algorithm 1

Pseudo-code of SLBHA

figure a

Results and discussion

To objectively and comprehensively evaluate and verify the effectiveness of the algorithms proposed in this paper, 13 datasets are selected for experiments. The comparison algorithms used here are listed as follows: K-means [7], K-means +  + [48], FC-Kmeans [49], PSO [31], ABC [50, 51], FA [38], BHA [16], WOA [52, 53], SOS [40], CIEFA [9], IBH [22]. The experiments mentioned above are conducted on an Intel(R) Core (TM) i7-10,700 CPU with 16 GB RAM. The content related to the experiments will be introduced in this section.

Datasets description

Table 1 gives a detailed introduction to the datasets used in this paper, including the number of features and the number of instances of the datasets. The datasets of Stroke prediction, Early-stage diabetes risk prediction, Mobile price, and Housing Prices are selected from the Kaggle website, and the rest are from the UCI data repository. For more details, please access the website of Kaggle (https://www.kaggle.com/) and the UCI Repository (http://archive.ics.uci.edu/ml/index.php) [54].

Table 1 Datasets used in the experiments

The datasets used in this paper have different dimensions and instances, which presents various challenges to the data clustering algorithm. Usually, the preprocessing process including feature coding and data standardization is required before using the datasets. The main theme in this paper is the data clustering problem, so distance measurement is an important factor [1]. Therefore, the non-numeric data need to be converted to numeric data to facilitate calculation. In addition, the data need to be normalized to eliminate the influence of different scales, which can also indirectly avoid the influence of noise and outliers. Here, the datasets are pre-processed by the methods mentioned above including feature coding, data standardization, and normalization to facilitate the comparison of algorithms, and this step is also essential.

Evaluation criteria

Clustering classifies a dataset with undefined classes according to some specific method, so its evaluation methods are defined differently from that of classification algorithms. The literature [55] gives three methods to evaluate the validity of clustering: external criteria, internal criteria, and relative criteria. External criteria are mainly evaluated by imposing the results of clustering algorithms on a pre-specified dataset structure to validate the clustering solutions. Internal criteria evaluate the internal structure generated by the clustering algorithm. As for relative criteria, they evaluate a structure by comparing it with other methods. External criteria are based on some prior information of the datasets while internal criteria are not dependent on external information [56]. The evaluation criteria used in the experiments of this paper include external criteria and quantization error.

External criteria

Suppose \(C=\left\{{C}_{1},{C}_{2},\dots ,{C}_{k}\right\}\) is the set of clusters that the data clustering algorithm generated and \(P=\left\{{P}_{1},{P}_{2},\dots ,{P}_{s}\right\}\) is defined structure of the dataset. Consider a pair of data points \(\left({x}_{a},{x}_{b}\right)\) randomly selected from the dataset, the following terms will be measured:

SS: if \({x}_{a}\) and \({x}_{b}\) belong to the same cluster of \(C\) and the same partition of \(P\).

SD: if \({x}_{a}\) and \({x}_{b}\) are in the same cluster of \(C\), but in the different partitions of \(P\).

DS: if \({x}_{a}\) and \({x}_{b}\) are in the different clusters of \(C\), but in the same partition of \(P\).

DD: if \({x}_{a}\) and \({x}_{b}\) belong to the different clusters of \(C\) and the different partitions of \(P\).

Here, \(a\), \(b\), \(c,\) and \(d\) are utilized to represents the number of SS, SD, DS, and DD. The total number of all pairs of data points in the dataset is \(M\), which means \(M = a+b+c+d\). It can be deduced that \(M=m(m-1)/2\), where \(m\) is the total number of the dataset mentioned before. Then the indices that measure the similarity between \(C\) and \(P\) can be defined as follows:

$$J=\frac{a}{\left(a+b+c\right)},$$
(11)
$$FM=\frac{a}{\sqrt{{m}_{1}{m}_{2}}}=\sqrt{\frac{a}{a+b}\cdot \frac{a}{a+c}},$$
(12)

where \(J\) is the Jaccard coefficient while \(FM\) represents Folkes and Mallows index. \({m}_{1}=a/(a+b)\), \({m}_{2}=a/(a+c)\). For the two indices, the higher value indicates the more similar degree of \(C\) and \(P\). Here, these two indices are utilized to evaluate the similarity between the obtained clustering results and the original labels of the dataset, so as to compare the effectiveness of the algorithms. To the best of our knowledge, it is the first work that introduces these two criteria into the related studies.

Quantization error

The fitness function is essential for intelligent optimization algorithms. A suitable fitness function can improve the efficiency and performance of the algorithm. Inspired by the literature [31], this paper selects quantization error as the fitness function, which can be formulated as follows:

$${\text{fitness}}=f\left({X}_{p}^{t}\right)=f\left[\left({c}_{p,1},{c}_{p,2},\dots ,{c}_{p,k}\right)\right]=\frac{\sum_{i=1}^{k}\left[\sum_{j=1}^{\left|{C}_{i}\right|}\frac{{\text{distance}}\left({x}_{j}^{i},{c}_{p,i}\right)}{\left|{C}_{i}\right|}\right]}{k},$$
(13)

where \({X}_{p}^{t}\) is the \(p\) th agent in \(t\) th iteration and \({c}_{p.k}\) is its \(k\)th element. The quantization error is also evaluated based on the structure of the clusters to some extent, which can be regarded as an internal criterion. The smaller the fitness value is, the better the solutions searched by the agent are. The indicators used in this paper also include the average fitness value, the best fitness value, the worst fitness value, and the standard deviation.

Experiment settings

In order for each algorithm to exhibit the best performance, the state-of-art algorithms selected for comparison all use the parameters suggested in the original articles, which is shown in Table 2. Among them, algorithms not mentioned in the table generally require no parameters. The parameters of SLBHA are tested and discussed in “Parameter experiment”. For the sake of fairness, the experiments are conducted by running the algorithms 30 times. The maximum number of iterations and agents are set as 100 and 20, and the fitness function of all the algorithms is set as the same one, i.e., the quantization error used in this paper. Since they run on the same standardized datasets, the values of each dimension of agents in every algorithm are in the range of [0,1].

Table 2 Parameter settings of the experiments

Parameter experiment

In the proposed algorithm, there is a parameter \(\rho \), which controls the selection of the star movement path, and its value may have an important impact on the algorithm’s exploration capability. Therefore, an experiment is designed to analyze which value of this parameter is appropriate for the algorithms. In addition to the introduced parameter \(\rho \), for the swarm intelligence algorithm, the population size \(N\) and the number of iterations \(T\) also exercise a greater influence on the performance of the algorithm. Therefore, it is necessary to design experiments to explore the values of these three parameters for this paper.

For the purpose of exploring the appropriate value of the parameter \(\rho \) while observing its influence on the algorithm and the improvement effect, the first part of the parameter experiments will discuss the settings of \(\rho \) and \(T\). In this experiment, \(\rho \) is taken within [0, 1], then the experiment selects the values of \(\rho \) by 0.1 steps. In the meanwhile, the number of iterations \(T\) is set in [100, 150], and the step length is 10. For each parameter combination pair \(\langle \rho ,T\rangle \), the experiment runs 20 times independently on 10 datasets, and then the mean results of the fitness values are summarized and plotted in Fig. 6.

Fig. 6
figure 6

Mean fitness values of part of SLBHA for different parameter combinations of parameter \(\rho \) and the number of iterations \(T\) on datasets

The abscissa of Fig. 6 represents the value of \(\rho \), with a total of 11 values, while the ordinate represents the value of total iteration numbers. It can be seen from the figure that the impact of the iteration numbers on the accuracy of the algorithm is obvious. In almost all datasets used here, the figure is gradually deepened from top to bottom. Especially in the Ionosphere dataset, the optimal solution can almost always be found when \(t\ge 50\), so its color is deeper and more average than the others. In horizontal comparison, \(\rho \) values have different effects on the performance of the algorithm. For all used datasets, it can be seen that most of the deeper positions are located in [0.4,0.6] expect some outliers, which also indicates that the influence of parameter \(\rho \) on the algorithm is not always stable. Considering the meaning of parameter \(\rho \), it represents the balance degree of the two paths in the SLBHA algorithm, and the random number used for control is generated in [0,1]. Therefore, only the logarithmic spiral path is effective when \(\rho =1\), and only the vector update path is effective when \(\rho =0\). Comparing the color depth of the left end (\(\rho =0\)), the right end (\(\rho =1\)), and the middle part (\(\rho \in \)[0.4,0.6]), the middle part is deeper than the left and right ends, indicating that too large or small of the value \(\rho \) is not suitable for the algorithm. From another perspective, it also proves that the combination of the two paths is much better than using a single path and the proportion of these two paths should be relatively balanced to achieve the optimal performance of the algorithm. According to Fig. 6, it can be found that when \(\rho =0.5\) and \(T=100\), the algorithm can achieve satisfactory performance in all datasets while saving time and space resources.

The second part of the parameter experiments is based on the first part mentioned above, discussing the influence of the population size \(N\) on SLBHA. \(\rho \) and \(T\) are set as 0.5 and 100, respectively. The population size \(N\) is set within [10, 50] and the step size is set as 5. For each value of \(N\), SLBHA runs 20 times independently, and the average fitness values are obtained and plotted as a line graph as shown in Fig. 7.

Fig. 7
figure 7

Mean fitness values of SLBHA for different population sizes \(N\) on datasets

As shown in Fig. 7, the abscissa represents different population sizes while the ordinate represents the fitness values. It is shown in Fig. 7 that the average fitness values of SLBHA decrease with the expansion of population size, but increase slightly to some extent after a certain threshold. There are two or more fluctuation trends shown on all the lines in the figure. This phenomenon first indicates that the efficiency of the algorithm increases with the expansion of the population size. However, after a certain critical value, this trend slows down and develops in the opposite direction, which is, the unsuitably over-large population will lead to a slight decline in algorithm performance. In conclusion, the impact of population size \(N\) on the algorithms is fluctuating and complex and it should be selected thoughtfully. To prevent the fluctuation effect of algorithm performance decrease caused by a large population, the population size is set as 20 in this paper after comprehensive consideration.

To sum up, the parameter experiments, on the one hand, provide a reference for parameter settings in this paper. In a comprehensive view, the parameters are set as follows: \(\rho =0.5\), \(T=100\), \(N=20\). On the other hand, these experiments can also reflect the effectiveness of the improvement measures designed in this paper.

Analysis of the proposed strategies

In this section, the proposed strategies are analyzed and their effectiveness is verified through experiments. According to the beginning design ideas, the strategies of SLBHA shown in “Analysis of the proposed strategies” can be divided into two groups. Here, the star replacement mechanism and self-adaptive parameter strategy can be regarded as a group, while the other two strategies form another group. The improvement strategies within a group are interrelated and indivisible. Therefore, the part of improved SLBHA (only retained the logarithmic spiral path with parameter and the greedy retention strategy) and the completely improved SLBHA are compared with the traditional BHA. Table 3 shows the comparison results of the two algorithms, where LBHA represents the part of improved BHA.

Table 3 Comparison between BHA and proposed algorithms

From the previous descriptions of the algorithm, it can be seen that the LBHA adds a logarithmic spiral path and random control parameter for adjustment as well as the greedy retention strategy compared to the classical BHA. The complete SLBHA adds a replacement mechanism of stars and uses a self-adaptive parameter for regulation, which can expand the global searching space of the agents and increase the population diversity, resulting in a better performance of the search capability. The experiments conducted here show the validity of the above strategies. As for the mean value, SLBHA performs better than LBHA on all datasets except the QSAR biodegradation dataset in terms of the average of the fitness values, and both algorithms perform better than BHA. When it comes to the minimum and maximum of the fitness values, SLBHA performs better than LBHA on more than 2/3 of the datasets. The possible reason is that SLBHA may have slightly insufficient local exploration for some datasets, despite its extended searching range in comparison. In addition, these algorithms are inherently heuristic, and the instability of the results due to randomness is not unusual. As can also be seen in Table 9, the LBHA algorithm has better standard deviation results than SLBHA on more than half of the datasets, which also indicates that the stability of these two algorithms is similar. However, the results of BHA are more stable than those of the two algorithms on 10 datasets, which shows that though the proposed algorithm improves the overall performance, its stability needs to be further improved to some extent. The calculating results of external indicators (including the Jaccard coefficient and FM values) of SLBHA in all datasets are higher than LBHA, which means that the distribution of clusters obtained by SLBHA is more similar to that of labels in the original datasets. In addition, these two algorithms also outperform BHA in these two indicators.

In summary, although BHA has higher stability, SLBHA and LBHA are better than BHA in algorithm performance, that is, the improvement measures added in LBHA are effective. Furthermore, the SLBHA outperforms the LBHA showing the replacement mechanism of stars and the self-adaptive parameter contributes to the convergence of the proposed algorithm.

Results analysis and discussion

As mentioned above, several metrics are used to evaluate the experiments conducted in this paper. In this section, the experimental results will be analyzed and discussed. For comparison purposes, the best, worst, mean, and standard deviation of the 30 experimental fitness values are given in Tables 4, 5, 6 and 7. Tables 8 and 9 give the means of Jaccard coefficient values and FM values. The minimum values for each row in Tables 4, 5, 6 and 7 as well as the maximum values for each row in Tables 8 and 9 are shown in bold for clarity.

Table 4 Comparison in terms of the best fitness value
Table 5 Comparison in terms of the worst fitness value
Table 6 Comparison in terms of the average fitness value
Table 7 Comparison in terms of the standard deviation
Table 8 Comparison in terms of the Jaccard coefficient
Table 9 Comparison in terms of the FM values

Tables 3 and 4 show that the algorithm proposed in this paper performed better results on most of the datasets. It can be found that even though SLBHA can always converge to the best fitness value in iterations, it cannot get the results steadily. The iteration curves of the clustering algorithms are shown in Fig. 8. It can be seen that the proposed SLBHA can quickly and accurately find satisfactory solutions in all datasets than the other comparison algorithms. Table 6 shows that the SLBHA is better compared to the other algorithms in terms of average fitness values on all the datasets except the Shill Bidding dataset. For comparison, the experimental results from Tables 4, 5, 6 and 7 on the Shill Bidding dataset indicate that though the average fitness values obtained by the SOS algorithm are better than the two proposed algorithms, the minimum values are not good enough. In Table 7, we can find that the SOS algorithm is more stable than the proposed methods in this paper on several datasets. It may be concluded that the SOS algorithm failed to find the optimal solution and got trapped in the local optimal solution, then this situation repeated several times stably and caused such a result. The same phenomenon can also be seen in the results of other algorithms. For example, the traditional method K-means algorithm performs better than most of the heuristic algorithms in Table 7, which indicates that the K-means algorithm is more stable. The main reason for this phenomenon may be that the convergence of the K-means algorithm is certifiable and it has a simple structure and easy rules. It should be noted here that we pay more attention to the performance of the algorithm than stability. Therefore, although the K-means algorithm is stable enough, it cannot make us satisfied. It should also be pointed out that the other selected algorithms and the proposed algorithm are all heuristic algorithms, which are influenced by randomness as mentioned above, and there is no algorithm whose stability is better than other algorithms in Table 7. However, the proposed algorithm in this paper still needs to pay attention to the improvement of stability. Considering the Arcene dataset, which has 10,000 dimensions and 900 instances, its high dimension brings challenges to clustering algorithms. It can be seen from the experiments that the proposed method performs better than other algorithms, which shows that the SLBHA is also suitable for such high dimensional datasets and it is a feasible solution to this challenge.

Fig. 8
figure 8

Iterative curves for different datasets

It can be found from Tables 8 and 9 that the proposed algorithm outperforms the other methods in terms of the external metrics except for the Mobile Price dataset which means that clusters obtained by SLBHA are closer to the distribution of the original datasets compared to the other algorithms. In summary, it can be proved that the proposed algorithm can effectively find the best clustering centroids that are closer to the real distribution on the above datasets.

In summary, it can be found that the algorithm proposed in this article performs better than most comparative algorithms in terms of best, worst, and average fitness values, as well as Jaccard coefficient and FM values, including traditional clustering algorithms and heuristic algorithms. However, the proposed SLBHA performs less prominently in terms of standard deviation. It can be concluded that SLBHA has a good ability to converge to the optimal solution and find the results closest to the original label distribution. However, this ability may pose a risk of instability. The algorithm performs well on multiple datasets, which also verifies the universality of the application of the algorithms in this paper.

Time complexity

SLBHA proposed in this paper is designed on the framework of classical BHA. The time complexity of the BHA is mainly related to the total number of iterations \(T\) and population size \(N\), and its time complexity is lower than other heuristic algorithms [54]. The modification works of the former two strategies compared to the BHA are constant in cost, so its time complexity is comparable to that of the BHA. Then the SLBHA adds a replacement mechanism and a random control parameter. The cost of the replacement mechanism is linearly related to the population size \(N\). Therefore, although its time complexity is larger than that of BHA, these extra costs will not reduce the availability of the algorithm.

To further compare the time complexity of the proposed algorithm and the other algorithms, experiments of running time are conducted on all the datasets. The parameter settings for experiments are the same as the previous content. The experimental results of the average running time are listed in Table 10. It can be seen from Table 10 that the running time of K-means, K-means +  + and FC-Kmeans, three traditional-based clustering algorithms, is significantly shorter than that of heuristic algorithms. Compared with the K-means algorithm, K-means +  + has fewer iterations, which makes it get the least running time among all the above algorithms. The FC-Kmeans algorithm, on the other hand, takes longer than both because it combines the K-means and the K-means +  + algorithms. Among all the heuristic algorithms, SLBHA and classical PSO have the least running time, which means that the proposed algorithm performs better while also achieving faster convergence speed. It is worth noting that the experimental results show that the running time of the BHA is higher than that of the SLBHA. After observing and analyzing the experimental process, we believe that the possible reason for this phenomenon is that the BHA algorithm is prone to falling into the local optima. In the process of BHA, it can be observed that each star in each iteration needs to be calculated whether it is too close to the black hole. When falling into a local optimal solution, the stars are very similar and near the black hole, which means that there are too many new stars generated in this process. Once BHA falls into a local optimum too early, more stars are regenerated which results in an increase in time cost. In contrast, although the strategy proposed in this article increases a portion of the runtime, the stars are more dispersed in the solution space, with fewer agents being regenerated due to being too close to the black hole. The reduced time required to generate new agents exceeds the time required for redundant operations. Therefore, SLBHA not only greatly improves BHA’s performance, but also controls time costs when addressing the clustering problem. Overall, the burden of being forced to generate a large number of new agents due to local optima is a potential flaw of BHA that has not been discovered before, and this phenomenon is worth noting.

Table 10 Comparison in terms of the running time

In conclusion, the running time experiments supplement the analysis theory of time complexity mentioned above. The experimental results indicate that SLBHA also outperforms most of the compared heuristic algorithms in terms of time cost. It should be seen as an advantage of SLBHA that it ensures the quality of the obtained solutions while reducing time costs.

Statistical tests

Friedman test

The Friedman test is a non-parametric test that can be used as a tool for determining whether there is a statistically significant difference between three or more groups [57]. This test is particularly useful when the size of samples is very small. The null hypothesis is set as that there is no significant difference between the given algorithms. Then the alternate hypothesis is that at least two of them are different from each other. Here, the test statistic for Friedman test is given as follows:

$${F}_{R}=\frac{12}{{N}_{d}{K}_{m}({K}_{m}+1)}\sum {R}_{i}^{2}-3{N}_{d}({K}_{m}+1),$$
(14)

where \({N}_{d}\) is the total number of datasets, \({K}_{m}\) is the total number of algorithms and \({R}_{i}\) is the sum of ranks of all datasets for algorithm \(i\). The average rank of the algorithms based on the average fitness values is shown in Table 11. After calculating the test statistic, we can draw a conclusion about whether these algorithms are significantly different through the decision rules about the Friedman test. For one thing, if the statistic \({F}_{R}\) is larger than the critical value that can be found in Friedman’s critical values table, the null hypothesis can be rejected. For another thing, if the p value is less than or equal to the \(alpha\) which is the level of significance, we can also reject the null hypothesis. Otherwise, the null hypothesis should be accepted. Table 11 also shows the calculated results of the Friedman test. It can be seen in Table 11 that SLBHA got the best average rank and the K-means algorithm got the worst. The statistic \({F}_{R}\) is 125.0 and the p value is 1.7827e − 21. The former is greater than the critical value while the latter is less than alpha, which means that we can reject the null hypothesis. In conclusion, there is a significant difference between the algorithms tested in this paper.

Table 11 The average rank of the algorithms

Wilcoxon rank sum test

The Wilcoxon rank sum test is also a non-parametric test. It can be used to determine whether two dependent groups are selected from the same distribution [58]. The p values of the Wilcoxon rank sum test of the proposed SLBHA are reported in Table 11. The values below 0.05 are underlined in the tables. As shown in Table 12, there is a significant difference between SLBHA with the selected algorithms because all of the Wilcoxon rank sum test results are less than 0.05. Combined with the other tables mentioned above, it can be concluded that SLBHA is significantly superior to other algorithms.

Table 12 The Wilcoxon rank sum test results of SLBHA

Conclusions and future work

In this article, an improved self-adaptive logarithmic spiral path black hole algorithm (SLBHA) is discussed and analyzed. The path improvement measures introduced in SLBHA effectively enhance the local search capability of the algorithm. Then the replacement mechanism of the stars increases the diversity of the population. Additionally, SLBHA uses a self-adaptive parameter to balance the global and local search phases. Therefore, the algorithm effectively improves the exploration and exploitation capabilities of BHA and the ability to maintain the balance between them, with high availability and effectiveness. Moreover, the effectiveness of the proposed algorithm is experimentally verified. The quantization error and external criteria (Jaccard coefficient and FM values) are utilized to measure the performance of clustering algorithms. The experimental results show that the SLBHA outperforms the other comparative algorithms on most of the datasets and the generated clusters are closer to the label distribution in reality. The time complexity of SLBHA is also better than other heuristic algorithms. Statistical tests indicate that there is a significant difference between the proposed algorithm and other compared algorithms. However, the experiments also show that the shortcoming of the algorithm is that its stability is strongly affected by randomness. Further future work is mainly in two aspects. On the one hand, there is still room for improvement in the stability of the proposed algorithms and it may relate to the control of randomness, which is a common problem for heuristic algorithms. On the other hand, the proposed algorithm can be applied to solve other optimization problems and they may need to make the corresponding changes according to the application scenarios.