A hybrid deep kernel incremental extreme learning machine based on improved coyote and beetle swarm optimization methods

The iteration times and learning efficiency of kernel incremental extreme learning machines are always affected by the redundant nodes. A hybrid deep kernel incremental extreme learning machine (DKIELM) based on the improved coyote and beetle swarm optimization methods was proposed in this paper. A hybrid intelligent optimization algorithm based on the improved coyote optimization algorithm (ICOA) and improved beetle swarm optimization algorithm (IBSOA) was proposed to optimize the parameters and determine the number of effectively hidden layer neurons for the proposed DKIELM. A Gaussian global best-growing operator was adopted to replace the original growing operator in the intelligent optimization algorithm to improve COA searching efficiency and convergence. In the meantime, IBSOA was designed based on tent mapping inverse learning and dynamic mutation strategies to avoid falling into a local optimum. The experimental results demonstrated the feasibility and effectiveness of the proposed DKIELM with encouraging performances compared with other ELMs.


Introduction
Since proposed by Huang et al., extreme learning machine (ELM) has shown fast training speed and incomparable classification ability in the fields of machine learning and neural networks during the past decades [1,2]. From the perspective of network structure, ELMs are single-hidden layer feedforward networks (SLFNs). ELMs are an efficient solution for SLFNs to address drawbacks such as learning rate, training epochs, and local minima [3] as they do not require any iterative techniques [4,5].
The hidden nodes, input weights, and hidden bias of ELM are all generated randomly. Meanwhile, the output weights of neurons are calculated using the conventional Moore-Penrose inverse based on the least square method. Therefore, the parameter tuning process is not required to generate the network model before training, which is entirely irrelevant to the training data. ELM has incomparable benefits compared to the other machine learning methods in terms of training speed, global optimality, and generalization performance.
Although the effectiveness of ELM has been proved in some specific applications, its training data are added one by one or chunk by chunk. Therefore, a wide range of improved ELMs was developed. The incremental ELM (I-ELM) proposed by Huang et al. [6] could calculate the output weights for the ever-increasing hidden nodes more accurately, but it remained a universal approximate approach. The adaptive growth ELM (AG-ELM) proposed by Zhang et al. [7], which consisted of the hidden nodes, the incremental renewal of the network weights, and the sequential generation of the grouping new networks, could approximate the nonlinearity functions effectively. ELM can also be transformed into an online version based on the online sequence analysis [8]. As for generalization, the kernel ELM (KELM) based on the kernel function was proposed in [9] for classification problems. A sequential online ELM based on KELM is proposed in [10] for online learning. Through optimizing the input weights, Sun et al. [11] proposed a differential evolutionary ELM under the compact network structure construction to improve the generalization performance. However, the current version of I-ELM still faces redundant nodes that continually increase the network structure complexity and reduce the learning efficiency. The convergence rate and the prediction ability are still low due to the unfavorable number of hidden layer nodes and the sensitiveness to the new data.
Training speed and learning accuracy crucial for ELM depend heavily on the parameter optimization process. Therefore, researchers proposed several intelligent optimization methods to compute ELM parameters more precisely. Improving the training speed and learning accuracy of ELMs based on bionics methods is becoming a research focus. A differential evolution algorithm is utilized in [12] to optimize the ELM input parameters. An adaptive differential evolution algorithm is given in [13] to optimize the hidden layer node parameters. The MP generalized inverse method was utilized to calculate the output weights. An improved particle swarm optimization algorithm is proposed in [14] to optimize the hidden layer node parameters.
Although the traditional differential evolution optimization method has strong global optimization ability, it could not avoid the premature problem. Traditional particle swarm optimization algorithm has low searching speed despite its local optimization merits. Therefore, a hybrid optimization approach called DEPSO based on differential evolution algorithm and the particle swarm optimization method is given in [15] to optimize the hidden layer nodes. A novel hybrid optimization method given in [16] took advantage of the global search ability of the differential evolution algorithm and the local search capability of the multi-group gray wolf optimization algorithm. However, there are still problems of local optimum and poor population diversity in later iterations.
As the extension of the research in [16], a Hybrid Intelligent Deep Kernel Incremental Extreme Learning Machine (HI-DKIELM) based on the improved coyote optimization algorithm (ICOA) and improved beetle swarm optimization algorithm (IBSOA) is proposed in this paper. First, COA was improved based on the Gaussian global best-growing operator to enhance the searching efficiency and convergence speed. Second, the tent mapping inverse learning and dynamic mutation strategies were adopted to prevent BSOA from local optimum and modify the population diversity in later iterations. Finally, the hybrid intelligent optimization method was adopted to modify the parameters of a deep kernel incremental extreme learning machine (DKIELM) and improve the training speed and classification accuracy. This paper is the first proposal of the HCOBSO strategy as a new perspective.
The significant contributions of this paper are summarized as follows: 1. An improved Coyote Optimization Algorithm is proposed based on the Gaussian global best-growing operator to improve the searching efficiency and convergence.
In this paper, we introduce the Gaussian global bestgrowing operator to replace the original growing operator in the COA method to enhance the searching efficiency and convergence speed. 2. An improved Beetle Swarm Optimization Method was designed based on tent mapping inverse learning and dynamic mutation strategies to prevent BSOA from falling into a local optimum. Due to the higher solving speed and accuracy, BSOA has been used in signal positioning and data classification successfully. However, the original BSOA has several drawbacks. First, it falls into the local optimum easily when solving complex optimization problems. Second, the computational complexity is very high when updating the parameters. Third, in multi-dimension function optimization, relying on a single individual search alone increases the possibility of the optimization algorithm running into the local optimum. Therefore, BSOA was improved in this paper using tent mapping inverse learning and dynamic mutation strategies. 3. The novel HCOBSO method was proposed for optimizing the parameters of DKIELM. The proposed hybrid intelligent optimization method took advantage of the global search ability of COA and the local search capability of BSOA.
The remainder of this paper is organized as follows: the related works are briefly reviewed in section "Preliminary". Section "The proposed HCOBSO method" presents the HCOBSO algorithm. Section "The proposed HI-DKIELM" presents details of the DKIELM. The experiment result is presented in section "Results and discussion". Section "Conclusion" concludes our work and outlines our future work.

Preliminary
In this part, the notation of Kernel Incremental extreme learning machine (KI-ELM), the Gaussian global bestgrowing operator, the tent mapping inverse learning, and the dynamic mutation strategies was provided for the convenience of understanding the proposed ELM algorithm.

Kernel incremental extreme learning machine
is composed of N training samples, the input x i has d dimensions, and t i is the label of the output, then the output of the ELM is [17,18] The kernel matrix of KI-ELM can be expressed as thus, the output function of KI-ELM can be given as (3) The output value Ŷ test can be estimated online as

Gaussian global best-growing operator
The Gaussian global best-growing operator is based on the best optimal strategy and Gaussian distribution and inspired by the global optimal harmony search algorithm. The best optimal strategy uses the state information of the current globally optimal target to improve the mining ability and enhance inner information sharing. Gaussian distribution, or normal distribution, is different from the uniform random distribution [0,1] utilized in COA. It expands the scope of the rope and improves the global roping ability to a certain extent.

Tent mapping inverse learning
Research showed that in swarm intelligent searching, the convergence performance is affected by the initial population. The larger and the more evenly distributed the populations, the faster the algorithm converges to the optimal solution. Using chaos mapping to initialize the population is random, ergodic, and bounded. Therefore, the search efficiency can be improved. The initial sequence generated by Tent mapping [19,20] is more uniform than that by logistic mapping. Therefore, tent mapping was applied to BSOA to initialize the population, and inverse learning strategies were applied to optimize the initial population. Through competition, the better individuals can be selected for the next generation of learning. Therefore, the population searching scope was expanded, and the invalid searching operation was reduced. Thus, the speed of the convergence is improved. The mathematical expression is as follows [21]: Defining an inverse solution feasible in D-dimensional space as

Dynamic mutation strategies
In later iterations of BSOA, the decreasing population diversity reduces the searching capacity. Dynamic mutation strategies are introduced to increase population diversity in later iterations and improve the convergence accuracy, so that premature can be avoided. Scholars have proposed a variety of variation algorithms, such as the Gaussian variation and Cauchy variation [22,23]. Compared with the Gaussian operator, the Cauchy operator has longer wings and can generate a larger range of random numbers, offering a greater chance to jump out of the local optimum. In the meantime, the Cauchy variation takes less time to search nearby areas when the peak value is lower.

The proposed HCOBSO method
Output weight is the most important parameter of an ELM. Therefore, the number of hidden layer nodes L should be estimated in advance. In this chapter, the HCOBSO method based on ICOA and IBSO is proposed. The motivation of the new idea is stated in section "Kernel incremental extreme learning machine". The concepts of ICOA and IBSO were explained in detail in sections "Gaussian global best-growing operator" and "Tent mapping inverse learning. The implementation of the proposed hybrid intelligent optimization method was summed up in section "Dynamic mutation strategies". Therefore, an understanding of the proposed optimization method can be facilitated.

Motivations
Although COA and BSO are all wolves' optimization algorithms, they are different in some sense. In terms of searching, COA simulates the growing and dying processes of the coyote, whereas BSO simulates the Hierarchy and hunting patterns of beetle. In terms of guide pattern, COA uses only one best wolf to guide other wolves growing. Therefore, the two methods were improved, respectively, and then fused to compensate for the drawbacks in each of them. As a result, an HCOBSO method was obtained. The proposed hybrid intelligent optimization method took advantage of the global search ability of COA and the local search capability of BSOA.

The improved COA method
The COA method is inspired by the Canis latrans species in North America, and is classified as both swarm intelligence and evolutionary heuristic method. COA method with a structure different from that of other optimization method does not focus on the social hierarchy and dominant rules.
In this paper, we introduce the Gaussian global bestgrowing operator to replace the original growing operator in the COA method to enhance the searching efficiency and convergence speed. In the following, each step in the proposed COA framework will be explained in detail.

Parameter initialization and random initialization of the
coyote pack First, the global population of coyotes consisting of N p packs each with N c coyotes is initialized. The initial social conditions for the Cth coyote of the pth pack of the jth dimension inside the search space are set randomly, as follows [24,25]: where lb j and ub j represent the lower and upper bounds of the jth decision variable, and r j is a real random number generated inside the range [0, 1] using uniform probability. Then, the social fitness value of the coyote is evaluated as 2. Coyote growing inside the pack. The information of the coyotes is computed with the COA as the cultural tendency cult of the pack Therefore, the coyotes are under the alpha influence 1 and the pack influence 2 . 1 means a cultural difference from a random coyote of the pack cr 1 to the alpha coyote. 2 means a cultural difference from a random coyote of the pack cr 2 to the cultural tendency of the pack. 1 and 2 are written as The alpha is the best coyote inside the pack and the alpha of the pth pack of the tth , an instant of the time, is defined as Unlike the original COA method, in this paper, the new social condition of coyote: new_soc 1 is updated using the Gaussian global best-growing operator through the following equations: The GP is the global best coyote in the current condition, rn 1 and rn 2 represent a real random number using Gaussian normal distribution. Compared with the original COA, there is a big difference between the individuals at the beginning of the searching operation. It can enhance the searching ability at the beginning of the searching operation. While in the latter searching, the difference is become small and the searching space is become small. Therefore, the new social condition of coyote can improve the production capacity.
3. Birth and death of coyote The birth of new coyote pups is written as a combination of the social conditions of two-parent and environmental influence wherein r j is the random number in the range [0, 1] using uniform probability, j 1 and j 2 are two random dimensions of the problem, R j is a random number inside the decision variable bound of the jth dimension, and p s and p a are scatter probability and association probability, respectively. Besides, p s and p a can be written as To keep the population size static, the COA syncs the birth and death of coyotes, as shown in Alg. 1.

Coyote expulsion and admission
Sometimes, the coyote will leave packs and become solitary or join a pack instead and occurs with probability p e (20) The pseudocode of the proposed COA is described in Alg. 2 To evaluate the robustness of the proposed COA algorithm, the Schaffer function is adopted to test its optimization performance, as shown in Fig. 1. Figure 1 shows that the proposed COA methods can find the global optimal solution effectively. The optimization results are close to the global extreme reference value, indicating that the proposed COA method has a good optimization ability.

The improved BSO method
The solving speed and precision accuracy of BSO are higher than other optimal methods. BSO is used in signal positioning and data classification successfully. However, the original BSO has several drawbacks. At first, it falls into local optimality easily when solving the complex optimal problem. Second, the computational complexity is very high when updating the parameters. Third, sawyer only relies on a single individual search, increasing the possibility of the optimal algorithm into the bureau in the multi-dimension function optimization. Therefore, the BSO optimal method is improved using Tent mapping inverse learning and dynamic mutation.
For the population initialization, the Tent mapping inverse learning is adopted to initializing the population as follows [26]: 1. In the searching space, the Tent mapping inverse learning is adopted to generate N positions x ij of the beetle population as the initial population OB; 2. According to the definition of inverse solving, inverse positions x ′ ij of each beetle position x ij in the initial population OB are generated as the inverse population FB; 3. The fitness value belongs to the 2N beetle populations, sorted using the ascending sort with the combination of the population OB and FB . N beetle population was selected while corresponding top N fitness value as the initial population.
In the later iteration of the optimal beetle method, the diversity of the population will become worse. Therefore, the secondary optimization operation is performed to the beetle population using the dynamic mutation In Eq. (24), is the mutation weight and C(0, 1) is a random value which generates by the Cauchy operator with the probability parameter 1. The pseudocode of the improved BSO is shown in Alg. 3.
. Fig. 1 The test performance of Schaffer function To demonstrate the effectiveness of the proposed BSO method, we test the convergence using the unimodal function min f (x) = ∑ n i=1 x 2 i and compare it with GA, PSO, DE, and original BSO methods. The test results are shown in Fig. 2. From the experiment results, we can find that the convergence speed and convergence accuracy are improved slightly while compared with other four optimization approaches. The global searching ability and local searching ability will be balanced, and the convergence speed will be accelerated and bring it to jump out the local minima when we bring up the Tent mapping inverse learning and dynamic mutation.

The proposed HCOBSO method
A new hybrid optimization approach called hybrid Coyote Optimization and Beetle Swarm Optimization (HCOBSO) algorithm is proposed with the advantage of improved COA and improved BSO optimization method while deriving from Frog Leaping algorithm (FLA) to improve the generalization performance. The detailed pseudocode of the proposed optimization method is given in Table1.

The proposed HI-DKIELM
The detailed structure of the proposed HI-DKIELM is given. The design idea of the HI-DKIELM is based on KI-ELM and deep learning network, showing that the HI-DKIELM consists of three parts: an input layer, output layer, and some hidden layer of a cascade, as shown in Fig. 3. During the training process, the HCOBSO optimization method is utilized to optimize the output weight to enhance its robustness. According to k hidden layer, the input feature X k can be obtained after subtracting the initial input data and then mapping it using the kernel function. The detailed implementation process of the proposed HI-DKIELM is taken in Table 2.

Experimental settings
Several experiments are provided in diverse ways to demonstrate the effectiveness of the proposed HI-DKIELM approach.
The experiment is tested on a PC with Intel Core I7-8700 at 3.40 GHz and 16 GB RAM. The proposed method is implemented using Matlab2013a, and the code for the other models comes directly from the code published by the respective authors. To verify the effectiveness and robustness of the HI-DKIELM algorithm, the experiment is divided into five parts: 1. First of all, in section "Evaluation of the performance of the HCOBSO optimizationalgorithm using the CEC2017 and CEC2019 database", the CEC2017 and CEC 2019 database is adopted to test the performance and robustness of the HCOBSO optimization algorithm proposed in section "The proposed HCOBSO method" and compared with other incomplete optimization methods to test the contribution of each improvement.
2. Section "Performance evaluation of the HCOBSO optimizationalgorithm on typical functions" focuses on the proposed HCOBSO optimization method is tested on five typical optimization functions to check the optimization ability. 3. Section "Selection of hyperparameters" focuses on selecting hyperparameters in HI-DKIELM, including the penalty factor C and the number of hidden nodes L. 4. Section "Regression analysis" focuses on the regression problems. The generalization performance of the HI-DKIELM algorithm is tested with 10 sets of real UHI data sets and compared it with traditional ELM, I-ELM, EM-ELM, EI-ELM, and B-ELM. 5. Section "Classification analysis" focuses on the classification problems. The generalization performance of the proposed HI-DKIELM algorithm is tested with ten sets of real UHI data sets, and compared with traditional ELM, I-ELM, EM-ELM, EI-ELM, and B-ELM. 6. In section "Real-world application learning tasks", the performance of the HI-DKIELM in practical application learning tasks is verified and compared with other baseline methods.

Data and parameter settings
In these experiments, several databases are used in various experiments to verify the performance of the novel HI-DKIELM method.
For the experiment shown in sections "Evaluation of the performance of the HCOBSO optimizationalgorithm using the CEC2017 and CEC2019 database" and "Performance evaluation of the HCOBSO optimizationalgorithm on typical functions", the CEC2017 database and five typical optimization functions are used to prove the optimization effectiveness of the proposed HCOBSO optimization method, respectively.
The CEC2017 database consists of a unimodal function ( The CEC2019 database is known "The 100-Digit challenge" which are intend to be used in annual optimization competition. They are used to evaluate the algorithm for large scale optimization problems. The first three functions, CEC01-CEC03, have various dimensions. On the other hand, the CEC04 to CEC10 functions set as ten-dimensional minimization problems in the range [−100, 100] , and they are shifted and rotated. All the CEC functions are scalable and all global optimum of these functions were united toward point 1. The five typical functions are shown in Table 3. The In section "Evaluation of the performance of the HCOBSO optimizationalgorithm using the CEC2017 and CEC2019 database", the best parameter is N = 100, N c = 5, and N p = 20 for the COA optimization method. The parameter of the HCOSBO is setting as N = 100, N c = 5, N p = 20 in the earlier searching, and the parameter is setting as N = 100, N c = 10, and N p = 10 in the later searching.
In sections "Selection of hyperparameters" and "Regression analysis", the UHI machine data set is adopted in the experiment with 13 regression problems and 14 classification problems. The data specifications are shown in Table 4. The relevant data of the training set and test set are represented by the notation #train and the notation #test, respectively.

Evaluation of the performance of the HCOBSO optimization algorithm using the CEC2017 and CEC2019 database
To test the contribution of each improvement to the HCOBSO method, the HCOBSO method is compared with some incomplete optimization algorithms such as HCOBSO 5 ( N c = 5, N p = 20 ), COA, BSO, ICOA, LPB [27], and FDO [28] on the CEC2017 and CEC2019 database, as shown in Table 5. In the experiment, three functions are selected in each group. The rank method is used in this experiment to compare the mean. The rank becomes better with the decrease of the mean. When the mean is the same, the standard variance will be compared. Table 5 shows that the HCOBSO 5 will get the best mean and variance for the most times on unimodal functions, indicating that the global searching ability of the BSO method and the mining ability of the COA algorithm are improved with the growth of N p . The large value of N p means the strong mining ability and fast rate of convergence. However, the HCOBSO 5 will fall into local minimum easily and the optimization performance is not well on multimodal functions. Among the five optimization methods, the average rank of the HCOBSO is 1.58, while the average rank of COA, BSO, ICOA, and HCOBSO 5 are 4.16, 4.58, 2.75, and 2, respectively. The HCOBSO optimization method proposed in this paper gets the best total rank among the five methods, while the total ranks of COA, BSO, ICOA, and HCOBSO 5 are 4, 5, 3, and 2, respectively. The results have shown that the HCOBSO can obtain the best optimization performance and prove the effectiveness of this method.
From the results shown in Table6, we can find that the HCOBSO will get the best mean and variance for the most times on CEC 2019 database, that means with the growth  of N p , the global searching ability of the BSO method and the mining ability of the COA algorithm are improved in the meantime correspondingly. That means the larger the value of N p , the more strong mining ability and faster rate of convergence. Among the six optimization methods, the average rank of the HCOBSO is 1.7, while the average ranks of LPB, FDO, COA, BSO, and HCOBSO 5 are 3.3, 3, 4.5, 5.2and 3.3 respectively. In the meantime, we can observe that the HCOBSO optimization method proposed in this paper gets the best total rank among the six methods.

Performance evaluation of the HCOBSO optimization algorithm on typical functions
In this part, we compare the HCOBSO algorithm with four baseline optimization methods: COA, BSO, ICOA, and DE-MPGWO [15] using typical functions.
In this study, the number of iterations of each algorithm is set to 2000, and the overall scale of the four baseline optimization methods is the same, i.e., N p = 40 . The experiments  To sum up, the HCOBSO optimization method proposed in this study has a good balance in search of accuracy and convergence speed and can improve search ability.

Selection of hyperparameters
In this part, we will deduce parameters in HI-DKIELM and the original ELM scheme. Compared with the traditional neural network training algorithm, the ELM and HI-DKIELM need fewer parameters. According to the introduction of HI-DKIELM in section "The proposed HI-DKIELM", two specified parameters are crucial: penalty factor C and the number of hidden nodes L . The results are shown in Fig. 4.

3
The learning accuracies of ELM and HI-DKIELM in the L subspace are shown in Fig. 4a and b; the parameter C is prefixed. The learning accuracies of ELM and HI-DKIELM in the C subspace are shown in Fig. 4c and d; the parameter L is prefixed. Figure 4a and b shows that the proposed HI-DKIELM follows a similar convergence property compared to the ELM but with higher testing accuracy. Meanwhile, the performance tends to be more stable in a wide range L . According to the results shown in Fig. 4c and d, the accuracy of HI-DKIELM changes slightly as C increases, and then increases rapidly to converge a better performance than the original ELM.
The value L needs to be large enough. Since L does not affect the accuracy of the curve, we can select C with few nodes.

Regression analysis
In this part, we will test the performance of the HI-DKIELM method proposed in section "The proposed HI-DKIELM" based on the regression analysis using the actual UHI data set in Table 4 to evaluate its generalization and robustness, compared with four baseline ELM methods: enhanced random search-based ELM (EI-ELM), error minimization ELM (EM-ELM), incremental ELM (I-ELM), and bidirectional ELM (B-ELM).The initially hidden layer neurons and increasing hidden neurons in the network are set as one in the experiment. The number of neurons in the hidden layer is the same as the number of iterations. In Table 9, we compare the testing root-mean-square error (RMSE) and training time of the regression problem.
The proposed method shows a similar generalization performance as other ELMs with much higher learning speed. For instance, the training time of the proposed HI-DKIELM method is only 0.0001S, 156 times as fast as that of ELM, 2025 times of I-ELM, 13004 times of EI-ELM, 75 times of EM-ELM, and 3732 times of B-ELM in the Auto MPG dataset. The training time of the proposed HI-DKIELM method is about 0.0001S, 616 times as fast as that of ELM, 4680 times of I-ELM, 35478 times of EI-ELM, 431 times of EM-ELM, and 453 times of B-ELM in the Delta elevators dataset. Furthermore, in this section, the performance of ELM and the proposed method has been tested in the XOR problem, which has one training sample in each class. The aim of this simulation is to verify whether the proposed method can handle some rare cases, such as cases with extremely few training data sets. The results are that the testing RSME obtained by ELM is close to 0.0010 when 2000 hidden nodes are used, while the testing RMSE obtained by the proposed method is close to 0.0010 when 87 hidden nodes are used.
In addition, the comparison results of average test accuracy are shown in Fig. 5. In Fig. 5, the xrepresents the number of hidden nodes, and the yrepresents the average testing root-mean-square error (RMSE). Compared with other Table 7 Typical ELM methods, the HI-DKIELM method has the best test performance. For instance, the test error of the Ca house dataset obtained by the HI-DKIELM method is half of that of ELM. In addition, with 2000 hidden nodes, the test RMSE of the ELM method is close to 0.0010, while that of the HI-DKIELM method is close to 0.0010 with 89 hidden nodes.

Classification analysis
In this part, for the classification problem, we will test the performance of the HI-DKIELM method proposed in section "The proposed HI-DKIELM" using the actual UHI data set given in Table 4 to evaluate its generalization and robustness and compare it with the four baseline ELM methods.
The initially hidden layer neurons and a growing number of hidden neurons in the network are set as one in the experiment. The number of neurons in the hidden layer is the same as the number of iterations. We compare the testing root-mean-square error (RMSE) and training time of the regression problem. The comparison of the testing accuracy is displayed in Fig. 6.
The results shown in Fig. 6 shows that the proposed HI-DKIELM has significant advantages in training speed. In the Connect-4 dataset with a large number of medium input dimension training samples, the running speed of the proposed method is 5600 times, 300 times, 45 times, and 1000 times as fast as that of I-ELM, EI-ELM, EM-ELM, and PC-ELM, respectively. The moderate number of training samples in the Hill Valley dataset has a moderate input dimension. In the hill Valley dataset, the running time of the proposed method is 3 times, 20 times, 25 times, and 1200 times as fast as that of I-ELM, EI-ELM, EM-ELM, and PC-ELM, respectively, with better performance.

Real-world application learning tasks
In this section, we will evaluate the performance of the proposed HI-DKIELM on the real-world applications: image classification task.
The original data are color images from the Corel data set. Each image is segmented, using the Blob world system, into fragments that represent instances. Fragments containing specific visual content (e.g., elephant) are labeled positive, while the remaining fragments are labeled negative. Therefore, fragments (i.e., instances) from the same kind of image (e.g., elephant) form a binary learning problem. We have five different image data sets: Tiger, Elephant, Fox, Bikes, and Cars, and the number of instances is 1096, 1259, 1474, 5215, and 5600, respectively. The instances in Tiger, Elephant, and Fox data sets are described by a 230-dimensional feature vector which represents the color, texture, and shape of the region, while the instances in Bikes and Cars data sets are represented by a 90-dimensional feature vector.    1 3 In image classification, visual content-based image retrieval is an important application, for example, finding pictures containing an elephant from a data set. In this subsection, sample images from the benchmark data sets are shown in Fig. 7. The detailed experimental results are in Tables 10 and  11 for image classification tasks in terms of classification accuracy (ACC) and area under the curve (AUC), respectively. In Tables 10 and 11, the proposed HI-DKILEM on all image datasets achieves the best ACC and AUC, indicating HI-DKILEM is superior to other methods in content-based image retrieval tasks. The extraordinary performance is due to many local approximations created by the proposed HI-DKIELM. The results show that the naive Bayes (NB) method on the Tiger, Elephant, Bikes, and Cars datasets has the worst ACC and AUC performance. However, the SVM method on the Fox data set has the most inferior performance. For other baselines, more detailed experiment results can be found in Tables 10 and 11.

Conclusion
In this paper, a novel ELM method named Hybrid Intelligent Deep Kernel Incremental Extreme Learning Machine (HI-DKIELM) based on the hybrid coyote optimization Beetle Swarm Optimization (HCOBSO) method is proposed. The proposed method has several features different from existing ELM-based methods.
1. An improved Coyote Optimization Algorithm has been proposed to improve the searching efficiency and convergence, and the Gaussian global best-growing operator is utilized to replaces the original growing operator. 2. An improved Beetle Swarm Optimization Method has been designed according to Tent mapping inverse learning and dynamic mutation strategies to prevent the BSO algorithm from falling into a local optimum. 3. A novel hybrid intelligent optimize algorithm called hybrid coyote optimization Beetle Swarm Optimization (HCOBSO) method for optimizing the parameters of the DKIELM has been presented. The proposed hybrid intelligent optimization method combines the global search ability of the coyote optimization algorithm and the local search capability of the Beetle Swarm Optimization algorithm.
In future work, there is developing space to explore the proposed methods. First, it is still challenging to make more insight into ELM for exploring its deep learning capability. Second, from an optimization point of view, we can use other novel heuristic algorithms, such as the five-element cycle optimization method.