1 Introduction

Clustering, a common unsupervised learning algorithm [1,2,3,4], groups the samples in the unlabeled dataset according to the nature of features, so that the similarity of data objects in the same cluster is the highest while that of different clusters is the lowest [5,6,7]. Clustering is popularly used in biology [8], medicine [9], psychology [10], statistics [11], mathematics [12] and computer science [13]. Since the early 1950s, many clustering algorithms have been proposed. In this paper, considering the novelty and effectiveness of density-based method, we will focus on density-based noise application spatial clustering algorithm (DBSCAN) and explore an adaptive method to tune the hyperparameter for DBSCAN instead of empirical setting.

1.1 Literature review

In clustering algorithms, K-means [14], as the most basic partition clustering algorithm at present, has the advantages of simple principle, strong practicability, fast convergence speed and strong model interpretation and so on. However, it is difficult to converge non-convex datasets and often stops at the local optimal solution.

Different from K-means, DBSCAN [15, 16] is another popular clustering algorithm based on density. It achieves clusters via finding high-density areas separated by low-density areas based on cluster density. Compared with other clustering algorithms based on the distance between objects, DBSCAN is suitable for finding clusters of any shape in spatial database and connecting adjacent regions with corresponding density. It can effectively deal with abnormal data, especially the clustering of spatial data [17]. Although DBSCAN has many advantages in clustering, it still has some disadvantages. For different datasets, DBSCAN needs to set the most appropriate parameters, MinPts and EPS, to achieve the best clustering effect. To some extent, the process of setting parameters limits the application of DBSCAN [18].

Over the years, to apply DBSCAN effectively, many researchers have improved DBSCAN [19] through meta-heuristic algorithm [20,21,22,23] to realize the automatic search and determination of EPS and MinPts parameters in DBSCAN. For example, Lai et al. [24] proposed a multi-segment optimization algorithm. As a special variable updating method, it has good optimization performance, can obtain good DBSCAN accuracy, and can quickly obtain appropriate EPS parameter selection. Ji’an et al. [25] proposed an adaptive DBSCAN to solve the clustering problem, taking the target solution and its motion range as noise points, in which DBSCAN \(\epsilon\) The neighborhood is affected by some specific physical factors. Zhu et al. [26] applied the harmony search optimization algorithm to DBSCAN, and obtained better clustering parameters and better clustering results. Hu et al. [27] proposed a density-based clustering algorithm, KR-DBSCAN, which is based on reverse nearest neighbor and influence space. Li et al. [28] combined the improved DBSCAN algorithm based on bat optimization and DP algorithm for clustering, and obtained good results. However, these methods still have the characteristics of low convergence accuracy, poor universality, and slow convergence speed.

Meta heuristic algorithm is a popular algorithm in recent years, such as Gray Wolf Whale (GWO), Dragonfly algorithm (DA) and Ant Lion Optimizer (ALO). It has the characteristics of high convergence accuracy and strong robustness. It can be used to solve the selection of parameters in DBSCAN. However, the common meta heuristic algorithms are easy to fall into local optimization. Therefore, we choose Arithmetic optimization algorithm (AOA) as the optimization algorithm. AOA is a new population-based metaheuristic algorithm proposed by Abualah [29], which uses four basic arithmetic operators in mathematics. AOA can not only deal with low dimensional problems [30], but also has a strong ability to solve high-dimensional problems [31]. The distribution mechanism enhances its global search ability, and the algorithm based on population [32] without optimization also helps to achieve faster convergence speed.

However, the ability of standard AOA to balance global optimization and local optimization is still insufficient, and the optimization accuracy is also insufficient. To better balance global optimization and local optimization and improve the optimization accuracy, we proposed some search strategies of improving development (local search) and exploration (global search). In addition, Opposition-based learning (OBL) [33,34,35] is one of the most popular strategies to enhance exploration, which can improve the population diversity of the algorithm in the search space. In the optimization problem, the strategy of checking the candidate solution and its opposite solution at the same time is adopted to speed up the convergence speed to the global optimal solution.

In general, the current clustering effect of DBSCAN is limited by the optimization results of its parameters. At present, the optimization algorithm used to solve DBSCAN parameter optimization has low convergence accuracy and is easy to fall into local optimal solution. Although the standard AOA improves the global dispersion compared with other optimization algorithms, it still has some shortcomings, such as insufficient convergence accuracy and global search ability.

1.2 The gap

To sum up, the demand for the accuracy of DBSCAN clustering algorithm is still increasing. To improve the accuracy of DBSCAN clustering algorithm, more advanced machine learning methods are needed to automatically optimize the parameters of DBSCAN clustering algorithm to improve the accuracy of clustering.

1.3 The contribution

To improve the accuracy and convergence speed of the automatic selection of DBSCAN parameters, this paper proposes a new meta-heuristic improvement strategy, OBLAOA-DBSCAN, which combines the advantages of AOA and OBL with DBSCAN to adjust dynamically the two parameters of DBSCAN. In addition, according to the experimental results, DBSCAN improved with OBLAOA performs well in a variety of public datasets. Therefore, the contributions of this article are as follows:

  1. (1)

    An OBLAOA-DBSCAN clustering algorithm is proposed, which can realize automatic parameter search and improve the clustering accuracy and efficiency.

  2. (2)

    By adding the OBL strategy, an OBLAOA optimizer is established, which can effectively improve the exploration performance of AOA.

  3. (3)

    The proposed OBLAOA-DBSCAN algorithm can provide better clustering results than other clustering algorithms including K-means, Spectral, Optics, DPC and the combination method of DBSCAN and other meta-heuristic optimization algorithms.

1.4 The structure of the paper

The remaining contents are organized as following. Section 2 outlines some backgrounds of DBSCAN and AOA. Section 3 introduces the OBLAOA and gives the use principle and concrete operation. Section 4 illustrates the proposed OBLAOA-DBSCAN algorithm. Section 5 compares the proposed OBLAOA with the original AOA by using 12 benchmark functions. Section 6 demonstrates the superiority of the proposed algorithm with 10 datasets by comparing with some considered clustering algorithms. Section 7 concludes the paper.

2 Related work

2.1 The basic theory of DBSCAN

DBSCAN, an unsupervised learning method, is proposed by [36] handling the clustering problem efficiently based on density. DBSCAN has the capacity to identify noise points efficiently and exactly. Furthermore, it can also distinguish clusters with arbitrary shapes.

In this clustering method, two parameters, the epsilon (EPS) and MinPts, are required to be pre-set to appraise the density distribution of points. DBSCAN starts from an unvisited point randomly. Then, it counts the points fallen within the adjacent area radius of the point less than EPS.

If the number of points is more than MinPts, the current point and its nearby points from a cluster, and the starting point is marked as visited. Then, all the points in the cluster are not marked as visited that are processed in the same way recursively, to expand the cluster. Otherwise, the point is temporarily marked as a noise point. If the cluster is fully expanded, that is, all points in the cluster are marked as visited, and then the same algorithm is used to process the non-visited points. Until all objects are marked as a certain cluster or noise, the clustering process ends. The DBSCAN algorithm flow is presented in Algorithm 1.

figure a

DBSCAN suffers from the determination of these two parameters. Previous studies have presented that these two parameters can be found by statistical and classical methods of combining different data mining ways, but these methods consume excessive time. Therefore, we introduce a meta-heuristic optimization to improve the accuracy and efficiency of finding these parameters considerably to achieve clustering faster and more precisely.

2.2 The arithmetic optimization algorithm

Arithmetic Optimization Algorithm (AOA) is a new meta-heuristic optimization algorithm [29] inspired by four major arithmetic operators (Multiplication (M), Division (D), Subtraction(S)), and Addition (A)). The mathematical models of exploration and exploitation phase are detailed as follows. Note that the exploration stage and exploitation stage is conditioned by the math optimizer accelerated (MOA) function. It is calculated by

$$\begin{aligned} \mathrm{MOA}\left( C_\mathrm{Iter }\right) =\delta +C_\mathrm{Iter} *\left( \frac{\gamma -\delta }{M_{Iter}}\right) , \end{aligned}$$
(1)

where \(M_{Iter}\) is the maximum number of iterations, and \(C_{Iter}\) represents the current iteration, which is between 1 and \(M_{Iter}\). \(MOA (C_{Iter})\) represents the value of MOA at the current iteration. \(\gamma\) and \(\delta\) are set to 1 and 0.2 respectively. The math optimizer probability (MOP) at the current iteration is calculated by

$$\begin{aligned} \mathrm{MOP} (C_\mathrm{Iter} ) = 1 - \frac{{C_\mathrm{Iter}}^{\frac{1}{\alpha }}}{{M_\mathrm{Iter}}^{\frac{1}{\alpha }}}, \end{aligned}$$
(2)

where \(\alpha\) is a sensitive parameter and represents the exploitation accuracy over the iterations, which is set to 0.5.

\(r_1, r_2, r_3\) are random numbers. When \(MOA < r_1\), we carry out exploration section by executing D or M. The position updating equation in the exploration stage is followed:

$$\begin{aligned} x_{i,j}(C_\mathrm{Iter}+1) = {\left\{ \begin{array}{ll} x^{\star }(C_\mathrm{Iter}) \div (\mathrm{MOP} + \epsilon ) \times ( (ub_j - lb_j ) \times \mu + lb_j), &{} r_2 < 0.5 \\ x^{\star }(C_\mathrm{Iter}) \times \mathrm{MOP} \times ((ub_j - lb_j) \times \mu + lb_j), &{} \text { otherwise}, \end{array}\right. } \end{aligned}$$
(3)

where \(x_{i,j}(C_{\text {Iter}}+1)\) denotes the jth dimension of the ith solution in the next iteration, and \(x^{\star }(C_{Iter})\) is the best-obtained solution in the previous iteration. \(\epsilon\) is a small integer, \(ub_j\) and \(lb_j\) refer to the upper and lower bound value of jth position. \(\mu\) is a control parameter, which is set to 0.5.

When \(MOA \ge r_1\), we carry out exploitation section by executing S or A. In the case of \(r_3 < 0.5\), S performs (first rule in Eq. 4). Otherwise, A performs the task in the position of S (second rule in Eq. 4). The position updating equation in the exploitation stage is followed:

$$\begin{aligned} x_{i,j}(C_{\text {Iter}} + 1 ) = {\left\{ \begin{array}{ll} x^{\star }(C_\mathrm{Iter}) - \mathrm{MOP} \times ((ub_j - lb_j) \times \mu + lb_j), &{} r_3 < 0.5 \\ x^{\star }(C_\mathrm{Iter}) + \mathrm{MOP} \times ((ub_j - lb_j) \times \mu + lb_j), &{} \text{ otherwise } . \end{array}\right. } \end{aligned}$$
(4)

2.3 The opposition-based learning

Opposition-based learning (OBL) is employed to consider candidate schemes and their inverses. Depending on which estimate, or inverse estimate is closer to the solution, the search interval can be recursively halved until the estimate or inverse estimate is close enough to the existing solution. It determines whether the original solution x is replaced by the opposite solution \(\bar{x}\) by comparing the fitness function values of them. Considering the solution \({x} \in [lb,ub]\), \(\bar{x}\) is calculated by the following equation:

$$\begin{aligned} \bar{x} = ub + lb - x. \end{aligned}$$
(5)

This equation above can be popularized to n-dimension via:

$$\begin{aligned} \bar{x}_{j}=ub_{j} +lb_{j}-x_{j}, j = 1,2,\cdots ,n. \end{aligned}$$
(6)

According to the results of comparison, it ends up with storing the best of two solutions.

3 The proposed OBLAOA

OBL is committed to taking both candidate solutions and their opposite solutions into consideration, which shows greater opportunity to reach the global optimal and faster convergence acceleration than only executing S or A. It is adopted to find a solution, which is opposite to the present solution, and subsequently it determines if the opposite solution is used by comparing the fitness function values of them. For example, if \(f(x^{\star }(C_{\text {Iter}})) \le f(\bar{x}^{\star }(C_{\text {Iter}}))\), then \(x^{\star }(C_{\text {Iter}})\) is saved; otherwise, \(\bar{x}^{\star }(C_{\text {Iter}})\) is stored. The equation used in OBLAOA to get the opposite solution is as,

$$\begin{aligned} \bar{x}^{\star }(C_{\text {Iter}}) = ub + lb - x^{\star }(C_{\text {Iter}}) \end{aligned}$$
(7)
Fig. 1
figure 1

The flowchart of OBLAOA

where \(x^{\star }(C_{\text {Iter}})\) denotes the position of the best solution in the current iteration. \(\bar{x}^{\star }(C_\mathrm{Iter} )\) denotes the opposite position of the best solution in the current iteration.

The flowchart of the proposed OBLAOA is given in Fig. 1 and the pseudocode is recorded in Algorithm 2.

figure b
Fig. 2
figure 2

The flowchart of the proposed OBLAOA-DBSCAN

4 The improved DBSCAN with OBLAOA

In this section, we apply OBLAOA to DBSCAN to optimize two parameters of DBSCAN (EPS and MinPts). Here more advanced modification method, namely OBLAOA-DBSCAN, is proposed, which can further improve the performance of the clustering algorithm.

In details, the OBLAOA-DBSCAN can perform the optimization process of determining the parameters EPS and MinPts automatically in an extensive scope of search spaces via a meta-heuristic method. First, set the normalized range matrix of two parameters (EPS and MinPts) as the upper bounds (\(ub_{j}\)) and lower bounds (\(lb_{j}\)) of search space. Then, the OBLAOA is used to search for suitable parameters within the effective search space.

To get the best clustering results, the sum of the average Euclidean distance of each cluster, the fitness function in OBLAOA-DBSCAN is given as,

$$\begin{aligned} \mathrm {D} \left( o_{i}, o_{l}\right) =\left( \sum _{j=1}^{m}\left( o_{i j}-o_{l j}\right) ^{r}\right) ^{\frac{1}{r}} \end{aligned}$$
(8)

where \(D (o_{i}, o_{l})\) is an Euclidean distance function that produces different metrics between object i and object l, \(o_{i j}, o_{l j} (i, l=1, \ldots , n, j=1, \ldots ,m)\) represents the value of the j-th attribute of object i and object l, respectively.

With the value of fitness function updates continuously, the position of best-obtained solution, which determines the value of two parameters, varies. At this time, the corresponding parameters MinPts and EPS will change. Until the fitness value no longer changes, apply the obtained parameters into DBSCAN algorithm for clustering.

When only using DBSCAN for clustering, problems, such as low accuracy of clustering results and low definition of noise points, always appear because of parameters setting manually. By introducing OBL to enhance the exploration ability of AOA, OBLAOA can provide effective parameter solutions for DBSCAN, thereby improving the clustering ability. The flowchart is shown in Fig. 2. After calculation, the time complexity of OBLAOA-DBSCAN is \(O(N(1 + M \times {nlog(n)} + M \times n))\).Where N represents the number of candidate solutions, M is the number of iterations, and n is the dimension of solving the problem.

5 Numerical simulation

5.1 The benchmark functions

To evaluate the performance of the proposed OBLAOA optimizer, we conducted numerical simulation experiments with 8 test functions in CEC2021. The benchmark functions are presented in Table 1, and its constraint range is represented by Range in the table.

Table 1 The CEC2021 benchmark functions

5.2 The setting of experimental parameters

The results of OBLAOA are saved and compared with five traditional methods (i.e., AOA, IAOA, DAOA, EN-GWO and WSSA) for each test case. The parameters of each algorithm are set as follows. The maximum number of iterations and population size of all algorithms are set as 500 and 20, respectively, and the number of function evaluations is 30 [37]. In addition, the initial random population set of all algorithms are the same. All CEC2021 test functions are simulated in 10 and 20 dimensions, respectively.

Table 2 Results of 10-dimensional CECE2021 test functions (\(F_1\)-\(F_8\))
Table 3 Results of 20-dimensional CECE2021 test functions (\(F_1\)-\(F_8\))
Table 4 Results of three engineering problems
Fig. 3
figure 3

Comparison of different algorithms for three engineering application problems

Fig. 4
figure 4

10-dimensional F1-F8 convergence curve of CEC2021 test function

Fig. 5
figure 5

20-dimensional F1-F8 convergence curve of CEC2021 test function

5.3 Analysis of the results

The results of numerical simulation are recorded in Tables 2 and  3. To verify the effectiveness of OBLAOA, we compared the results of OBLAOA with the standards AOA, IAOA, DAOA, ENGWO and WSSA. We select the corresponding average value (AVG), standard deviation (STD) and best value (BEST) as performance indicators and report them in all tables. We show better results in bold in Tables 2 and 3. In addition, Wilson’s rank test was used for all results, and all results of Wilson’s rank test (h) were 1. It can be seen from the table that OBLAOA has better performance than standard AOA and other current popular optimization algorithms (i.e., IAOA, DAOA, ENGWO and WSSA). In the test of high-dimensional meta heuristic algorithm, for all functions, the average value and optimal value of OBLAOA are better than standard AOA and current popular algorithms. In the test of low-dimensional meta-heuristic algorithm, the average and optimal values of OBLAOA are better than AOA for F1, F2, F3, F5, F6, F7 and F8 functions. In some experiments, compared with AOA, the performance of OBLAOA is significantly improved. Taking the F3 function of 10 dim as an example, the best index of OBLAOA is 106.76, which is 46.62\(\%\) lower than standard AOA, 77.17\(\%\) lower than DAOA, 2.67\(\%\) lower than ENGWO and 80.91\(\%\) lower than WSSA. As far as F6 is concerned, the index of best is 1600, which is 44\(\%\) lower than standard AOA, 21.95\(\%\) lower than DAOA, 24.88\(\%\)lower than ENGWO and 31.91\(\%\) lower than WSSA. From the F8 function, the index of best is 2.99e + 3, which is 59.15\(\%\) lower than standard AOA, 59.14\(\%\) lower than DAOA, 54.83\(\%\) lower than IAOA, 63.49\(\%\) lower than ENGWO and 65.67\(\%\) lower than WSSA. To sum up, our proposed OBL is better than standard AOA and other current popular algorithms in dealing with complex functions.

In order to further prove the optimization effect of OBLAOA, we selected three practical engineering problems for verification, including welded beam design [38], compression spring design [39] and design problems of I-beam [40]. The results are recorded in Table 4 and shown in Fig. 3. To verify the adequacy of the experimental results, we also carried out Wilcoxon signed rank test. The results are expressed in h, which are all 1, and recorded in Table 4. From Fig. 3, we can see that our OBLAOA has better optimization effect compared with other algorithms. Our OBLAOA has the highest convergence accuracy among all problems. Specifically, in the CSD problem, our OBLAOA converges first, and the convergence effect is greatly improved compared with ENGWO and WSSA. In general, our OBLAOA has better convergence effect in solving practical engineering problems. As can be seen from Table 4, our OBLAOA algorithm also has great advantages over standard AOA and the latest algorithm in solving practical engineering problems. Our OBLAOA has obtained the best value in all three engineering problems. Taking the WBD problem as an example, our best value is 4.25, which is 34\(\%\) lower than the standard AOA algorithm, 40.3\(\%\) lower than DAOA and 56.72\(\%\) lower than IAOA. ENGWO and WSSA do not converge, which is quite different from OBLAOA. We can also see from Figs. 4 and 5 that OBLAOA converges earlier and faster, and the final fitness value is lower than that of other algorithms.

6 Experiment and performance evaluation

This section is summarized as follows. In Sect. 6.1, we describe the datasets in the experiment. In Sect. 6.2, we introduce the evaluation indexes that used. In Sect. 6.3, we describe the parameter setting process in detail. In Sect. 6.4, we use ten datasets to test different optimization algorithms. In Sect. 6.5, we compare the optimized OBLAOA-DBSCAN with five classical clustering algorithms.

6.1 The datasets

In this part, we use ten datasets to test the performance of our optimization algorithm OBLAOA-DBSCAN. The instance of 10 datasets is 788, 399, 373, 150, 251, 300, 198, 1980, 341 and 846. The dimensions of 10 datasets are 3, 3, 3, 5, 3, 2, 34, 3, 3 and 19. The clusters of 10 datasets are 7, 6, 2, 3, 3, 5, 2, 5, 9 and 4. Table 5 shows ten datasets as experimental data. We compared the real labels with the clustering label and use the comparison result as the evaluation index of the algorithm, therefore, we use the datasets with real labels.

Table 5 Datasets used in experiments

6.2 The error index

In order to measure the clustering results of the improved method, we use Accuracy, Davies- Bouldin index (DBI), Silhouette index (Sil), Rand index (RI) [41, 42], Normalized Mutual Information (NMI), Homogeneity, Completeness, and V-measure [43]. Because of the datasets with the real label, we use the accuracy index to show the performance of the proposed method.

Accuracy is the ratio of correctly clustered data to total data. The correctly clustered data is obtained by comparing the cluster labels K with the actual labels C. DBI is used to measure the distance within the cluster and the distance between the clusters. The smaller DBI means the smaller distance within the cluster and the larger distance among clusters that is formulated as:

$$\begin{aligned} \mathrm {DBI}=\frac{1}{N}\sum _{i=1}^{N}\left( \max \limits _{j=1,\ldots ,N, j \ne i} \left( \frac{d_{i j}}{S_i+S_j}\right) \right) , \end{aligned}$$

where N is the number of clusters, \(d_{i j}\) is the average of the distance between clusters i and j. In addition, \(S_{i}\) and \(S_{j}\) are the mean distances of cluster i and cluster j.

The Silhouette value describes the similarity between different clusters. The larger this value is, the higher similarity between the target and its cluster, and the lower similarity with other clusters. The formula is as follows:

$$\begin{aligned} \mathrm {SIL}=\frac{1}{N} \sum _{i=1}^{N}\left( \frac{b\left( {i}\right) -a\left( {i}\right) }{\max \left\{ a\left( {i}\right) , b\left( {i}\right) \right\} }\right) , \end{aligned}$$

where a(i) is the average distance between a cluster \(C_{i}\) and all other data points in the same cluster, and b(i) is the average difference between a cluster \(C_{i}\) and other clusters.

The Rand index is a way to compare the similarity of results between two different clustering methods. The larger the value is, the clustering result that compared with the real situation is more consistent. The formula is as follows:

$$\begin{aligned} \mathrm {RI}=\frac{x+y}{C_{n}^{2}}, \end{aligned}$$

where x represents the number of the same labels in both C and K, and y represents the number of different labels in both C and K. \(C_{n}^{2}\) represents the number of combinations of C and K that can be made in the dataset.

NMI is used to measure the coincidence degree of two datasets and refers to the correlation between two sets of results. The greater the NMI, the greater the degree of correlation between categories. The formula is as follows:

$${\text{Hl}} = {\text{ }} - \sum\limits_{{i = 1}}^{{{\text{ N }}}} {\left( {\frac{{{\text{Ml}}}}{N}*\log _{2} \frac{{{\text{Ml}}}}{N}} \right)} ,{\text{Hr}} = {\text{ }} - \sum\limits_{{i = 1}}^{{{\text{ N }}}} {\left( {\frac{{{\text{Mr}}}}{N}*\log _{2} \frac{{{\text{Mr}}}}{N}} \right)} ,{\text{Hlr}} = {\text{ }} - \sum\limits_{{i = 1}}^{{{\text{ N }}}} {\left( {\frac{{{\text{Ml*Mr}}}}{N}*\log _{2} \frac{{{\text{Ml*Mr}}}}{N}} \right)} ,$$

and

$${\text{NMI}} = {\text{ }}\sqrt {\left( {{\text{Hl + Hr}} - \frac{{{\text{Hlr}}}}{{{\text{Hl}}}}} \right)*\left( {{\text{Hl + Hr}} - \frac{{{\text{Hlr}}}}{{Hr}}} \right)} ,$$

where Ml represents the cluster distribution of the randomly selected object from the clustering result K, Mr represents the cluster distribution of the randomly selected object from the actual labels C.

Homogeneity refers to each cluster only containing one member of the same cluster. Completeness refers to all members of a cluster are in the same cluster. V-measure is average of Homogeneity and Completeness. The formula is as follows:

$${\text{homogeneity}} = {\text{ Hl + Hr}} - \frac{{{\text{Hlr}}}}{{{\text{Hl}}}},{\text{completeness}} = {\text{Hl + Hr}} - \frac{{{\text{Hlr}}}}{{{\text{Hr}}}},$$

and

$$\begin{aligned} \mathrm {V-measure}= & {} \frac{2*homogeneity*completeness}{completeness+homogeneity}. \end{aligned}$$

The DBI index is usually less than 1, and the lower the index, the better the performance. SIL and RI index values are usually within 1, the closer they are to 1, the better the clustering performance of this method will be. The bigger Accuracy, NMI, homogeneity, completeness, and V-measure are the more real the clustering results are. Through the analysis of evaluation index, we can clearly compare the clustering performance of the new algorithm.

6.3 Experiment settings

Table 6 The range of parameters for the investigated datasets

DBSCAN [44] requires two parameters to be selected during clustering. By changing the values of EPS and MinPts parameters, we can get different clustering results. We first set up a large range of two parameters to run. We set EPS to 0-20 and MinPts to 0-40 to find an appropriate clustering results and adjust the range of parameters manually. These ranges of parameters are shown in Table 6. By comparing the results, we get a more accurate range for each dataset, which is used for the following experiments. We take EPS to one decimal and round down MinPts. We use the OBLAOA-DBSCAN algorithm to optimize these two parameters in the experiment. Firstly, we compare the optimization algorithms. The results of OBLAOA are compared with the following algorithms: Arithmetic Optimization Algorithm (AOA), Whale Optimization Algorithm (WOA) [45], Salp Swarm Algorithm (SSA) [46], Weighted Salp Swarm Algorithms (WSSA) [47], Exponential Neighborhood Grey Wolf Optimization (ENGWO) [48], developed Arithmetic Optimization Algorithm (dAOA) [49] and improved arithmetic optimization algorithm (IAOA) [50]. Secondly, we compare our OBLAOA-DBSCAN algorithm with five classical clustering algorithms, namely K-means [51], Spectral [52], OPTICS [53], clustering by fast search and find of density peaks (DPC) [54] and the original DBSCAN.

To compare the gap conveniently and clearly among the algorithms, we set the parameters in the test as follows. The maximum number of iterations and population size of all algorithms are set to 100 and 20. In addition, we run each algorithm 20 times, and take the average result to eliminate the error in the experiment. The experimental algorithm run by MATLAB 2017b.

6.4 Experimental results of the optimization algorithm

In this part, we compared our improved optimization algorithm OBL-AOA with other seven meta-heuristic optimization algorithms. We take Euclidean distance as the fitness function and get the convergence curve of fitness function. In Tables 7, 8, 9 and 10, we show the error indexes of different algorithms and make better indexes in bold. In Fig. 6, we show the convergence curves of six datasets, and the convergence curves of other datasets are in Fig. 9 in Appendix.

Fig. 6
figure 6

Convergence curves with DBSCAN optimized by different meta-heuristic algorithms I

Table 7 The evaluation indexes of datasets in DBSCAN optimized by different meta-heuristic algorithms I
Table 8 The evaluation indexes of datasets in DBSCAN optimized by different meta-heuristic algorithms II
Table 9 The evaluation indexes of datasets in DBSCAN optimized by different meta-heuristic algorithms III
Table 10 The evaluation indexes of datasets in DBSCAN optimized by different meta-heuristic algorithms IV

The experiment shows that our OBLAOA algorithm is better than the original AOA algorithm, and it is the best among the eight optimization algorithms when we apply it into DBSCAN algorithm. We use the convergence curve and the error index to introduce them. Our optimization algorithm has better fitness function and the convergence rate, it can be seen through the convergence curve in Fig. 6. In all the datasets, the fitness function of our OBLAOA algorithm are better than the original AOA algorithm and other optimization algorithms. Fig. 6 shows that the convergence accuracy and rate of the OBLAOA are better than those of the AOA. In the datasets Aggregation, Jain, and Synthesis, as the function gradually converges, all algorithms converge more slowly and sometimes AOA falls into local optimal solution. However, because the OBL algorithm has strong local search capability, OBLAOA can still update the optimal solution.

Our OBLAOA algorithm performs better than the other optimization algorithms according to the results of error index in Tables 7, 8, 9 and 10. In the datasets Compound, Jain, Iris, Wpbc, Synthesis and Vehicle, we can clearly see that our OBLAOA-DBSCAN algorithm is better in accuracy. Its DBI index is smaller than the others, and its SIL, RI, NMI, homogeneity, completeness, and V-measure index are larger than the others. Their accuracy is the best of the eight algorithms, the accuracy of Compound is 0.8538, the accuracy of Jain is 0.7151, the accuracy of Iris is 1, the accuracy of Wpbc is 0.9346, the accuracy of Synthesis is 0.9998, the accuracy of Vehicle is 0.9656. Although some of the indexes are the same, our OBLAOA algorithm is better in general. In the four datasets Aggregation, Spiral, Pathbased and R15, the accuracy and the evaluation index of the different algorithms are similar. However, it can be concluded that, in general, the OBLAOA algorithm has a better effect on the analysis of clustering problems than the original AOA algorithm and other six meta-heuristic algorithms. Therefore, the OBLAOA-DBSCAN algorithm has a good influence on the clustering of datasets.

6.5 Experimental results of clustering algorithm

Table 11 The evaluation indexes of datasets in different clustering algorithms I
Table 12 The evaluation indexes of datasets in different clustering algorithms II
Fig. 7
figure 7

The results with different clustering algorithms I

Fig. 8
figure 8

The results with different clustering algorithms II

Specific clustering results of these datasets are recorded in Figs. 7 and 8, where we show the results by using K-means algorithm, Spectral algorithm, Optics algorithm, DPC algorithm, DBSCAN algorithm and the best clustering optimization algorithm (OBLAOA-DBSCAN). Each colour in the figure represents a cluster of data. By comparing the graphs of each cluster, we can make a basic judgment about the effect of clustering as follows. We can find that OBLAOA-DBSCAN algorithm has a better clustering result in Figs. 7 and 8. It can cluster the data into a better shape and find the actual number of clusters. The graphs of datasets without illustrations are in the in Fig. 10 in Appendix. In Tables 11 and 12, we show the error indexes of different Clustering algorithms and make better indexes in bold. The data with * in the table represents the data in articles [55] and [56].

In Fig. 7, compared with K-means, the effect of dataset Aggregation shows that our algorithm has more reliable clusters. Each cluster in the figure is clearly distinguished, while some clusters in K-means are not clearly distinguished. In Fig. 8, compared with Spectral, the effect of dataset Synthesis shows that our algorithm clusters more accurately on the left side of the graph. The cluster effect for a whole block of data is better than Spectral algorithm. From the graphs of Jain, Spiral and Pathbased in Figs. 7 and 8, the OBLAOA-DBSCAN algorithm is more accurate than K-means and Spectral for the for circular datasets.

In Fig. 8, we can see from the graphs of datasets Pathbased and R15 that our algorithm clusters more accurately than Optics when dealing with dense data. When dealing with discrete data points, the Optics algorithm marks them as noise points. Our algorithm is more accurate when dealing with these points. We can draw this conclusion from the picture of dataset Synthesis. Through the datasets Aggregation and Jain in Fig. 7, it can be obtained that DPC algorithm marks the boundary points as noise points when dealing with data. Therefore, we can find that OBLAOA-DBSCAN algorithm has a better clustering effect than Optics and DPC on circular datasets by comparing cluster graphs. In addition, OBLAOA-DBSCAN correctly identifies sets of data points in areas of lower local density, and edge data points. In contrast, the original DBSCAN failed to accurately cluster these points.

We can find the Accuracy, RI, Sil, NMI, homogeneity, completeness, and V-measure index of OBLAOA-DBSCAN algorithm are significantly higher than those of K-means, Spectral, Optics, DPC and DBSCAN algorithm, the DBI index of the OBLAOA-DBSCAN algorithm is lower than that of K-means and Spectral algorithm from Tables 11 and 12. Therefore, the accuracy of improved OBLAOA-DBSCAN algorithm is better than the original DBSCAN in the dataset clustering.

Compared with the indexes of other articles in Table 11, our algorithm has better NMI indexes than K-means and original DBSCAN algorithms. On dataset Compound, OBLAOA-DBSCAN’s NMI index is 48.74\(\%\) higher than K-means’s and 1.04\(\%\) higher than DBSCAN’s. On dataset Iris, OBLAOA-DBSCAN’s NMI index is 13.25\(\%\) higher than K-means’ and 56.25\(\%\) higher than DBSCAN’s. Compared with the indexes of other articles in Table 12, our algorithm has better NMI indexes than K-means and DPC algorithms. On dataset Aggregation, OBLAOA-DBSCAN’s NMI index is 17.52\(\%\) higher than K-means’s and 2.08\(\%\) higher than DPC’s. On dataset Jain, OBLAOA-DBSCAN’s NMI index is 72.16\(\%\) higher than K-means’s and 2.74\(\%\) higher than DPC’s. On dataset Pathbased, OBLAOA-DBSCAN’s NMI index is 28.99\(\%\) higher than K-means’s and 16.22\(\%\) higher than DPC’s. On dataset R15, OBLAOA-DBSCAN’s NMI index is 0.58\(\%\) higher than K-means’s and 0.67\(\%\) higher than DPC’s.

In Table 11, we can find the index DBI and RI of dataset Spiral and Pathbased are not best, but their accuracy is better than those compared with real labels. According to figure, we can draw a conclusion that for circular datasets like Figs. 7 and 8, our DBSCAN algorithm can determine the shape of clustering more accurately and get better results. In Table 11, we can see that the SIL index has a negative set of values on circular dataset Spiral, but the clustering shapes are more consistent with the real labels. Through the above comparative analysis, we can find that OBLAOA-DBSCAN algorithm not only optimizes better than other optimization algorithms, but also performs better in clustering analysis compared with some classical clustering algorithms. In general, we can conclude that OBLAOA-DBSCAN algorithm has a very good effect on the clustering of datasets.

7 Conclusion

In this paper, we have proposed a new clustering algorithm named OBLAOA-DBSCAN. In this algorithm, we introduce OBL into AOA algorithm and develop an OBLAOA optimizer to improve the global search ability and convergence accuracy of standard AOA algorithm. Then, we use the improved OBLAOA algorithm to adjust the EPS and MinPts parameters of DBSCAN in order to improve the clustering effect of DBSCAN and propose a hybrid clustering algorithm (OBLAOA-DBSCAN). In our numerical simulation, we have demonstrated that the improved OBLAOA is more effective than the original AOA and other current popular algorithms. In addition, we also have validated the effectiveness of our proposed OBLAOA-DBSCAN algorithm by many clustering projects and found that the proposed clustering algorithm can achieve an accurate and reliable clustering results with less computational costs.

Although OBLAOA-DBSCAN can achieve significant improvement, there are still some insufficient, such as the selection of the best parameters of the optimization algorithm, the global search ability and clustering effect of the optimization algorithm need to be further improved. In the future, we will apply OBLAOA-DBSCAN to clustering problems on more datasets. In addition, OBLAOA can also be applied to other application problems like clustering model, such as image classification and recognition, speech signal classification, electrical information classification and so on, which needs further research by other researchers.