1 Introduction

Conventional single hidden layer feed forward neural networks (SLFNs) are based on ineffective training algorithms with weak learning abilities. All parameters of such networks should be updated in their learning algorithms. Huang et al. [1] have presented an extreme learning machine (ELM) in recent years. Unlike in conventional neural networks, In ELM, the weight and bias values of each hidden node adopt the random initialization before being determined with no effort-intensive iterative tuning. To ensure good network learning performance and generalization performance of the network structure, the ELM has considerably improved sample training speeds, fewer local minima, and low convergence speeds [2,3,4,5].

In conventional ELM, gain stronger learning ability, higher dimensional network structures are always utilized. Finding the ideal hidden layer node count and the scale of the control models, however, is complicated. Due to this, Huang et al. [6,7,8] established an incremental extreme learning machine (I-ELM) by adopting an incremental algorithm that adaptively selects the number of hidden layer nodes and updates the output weights in real time. By optimizing the I-ELM algorithm, Huang developed an enhanced incremental extreme learning machine (EI-ELM) [9], which effectively selects the hidden layer nodes to build the network model by making the network less complex. However, when the network scale is exceedingly high, the cycles’ number of EI-ELM considerably increases, thereby affecting its capability of generalization. Yang and other academics [10] introduced the Barron-optimized convex incremental extreme learning machine, subsequently named it as the CI-ELM, which, to boost the convergence rate, calculates the hidden layer’ output weights when more nodes were added. Furthermore, in [11], a hybrid incremental extreme learning machine was put forward, adjusting the hidden layer nodes’ weight and bias values to minimize the error via the chaos optimization algorithm. Nonetheless, the existing I-ELM has certain limitations that should be overcome desperately. The intricateness of the network architecture may increase because of redundant nodes, thereby reducing the efficient of studying. The low convergence rate gives rise to the amount of hidden layer nodes exceeding the amount of training samples. The model updating effect is poor and has higher sensitivity to new data. Combining these factors is essential for the ELM as it affects the ELM’s accuracy as well as the rate of convergence. Hence, several studies have focused on optimizing the combination of these parameters. Intelligent optimization algorithms based on bionics technologies have been applied to improve ELM variables based on bionics methods to improve learning speed and accuracy. To find the ideal parameters, a brand new hybrid approach was proposed [12]. Furthermore, the settings of the hidden layer nodes were optimized using an adaptable DE algorithm, and then, the output weights were determined using the MP generalized inverted approach [13]. A more effective technique for particle swarm optimization is employed to adjust the hidden layer nodes’ settings [14]. A hybrid intelligent extreme learning machine was put forth, which packaged DE and PSO together to optimize network parameters [15]. However, the following two issues arise with this form of intelligent optimization algorithm: PSO may do local search but moves slowly, while DE has a high ability to find the global optimum but will experience premature convergence.

To improve the parameter performance of ELM, the combination of deep learning and ELM has been applied by several studies owing to deep learning's superior feature extraction abilities. A multilayer ELM was established to effectively improve the parameter performance and training ability of ELM due to the excellent feature extraction ability of deep learning [16]. An ELM with deep kernels [17] is proposed as a means of enhancing accuracy and has been successfully applied to detect flaws in aero engine components.

Based on the aforementioned studies, a deep incremental extreme learning machine on the basis of a deep kernel that is hybrid intelligent was established in this study. First, we associated the courtship search behavior of longhorn with the objective function to be optimized and designed a new swarm intelligence optimization algorithm, namely the artificial transgender longicorn algorithm (ATLA). Then, the algorithm was combined with the multi-population gray wolf optimization. Therefore, a new hybrid intelligent optimized algorithm, namely the artificial transgender longicorn multi-population gray wolf optimization (ATL-MPGWO) method, was established to optimize the output weight using a hybrid intelligent method. The proposed mixed intelligence algorithm combines the MPGWO algorithm's global search capability with its local search capability of the ATLA to achieve the ideal output weights to increase the I-ELM's training efficiency and classification precision.

This study’s significant contributions are summarized below:

  1. (1)

    The ATLA was combined with the MPGWO algorithm to establish a novel mixed intelligence optimization algorithm, which optimizes the parameters of hidden nodes to obtain an optimal output weight.

  2. (2)

    A deep network structure was incorporated into a KI-ELM to realize a high-dimensional spatial mapping classification of data through deep kernel incremental kernel extreme learning, thereby improving the classification accuracy and generalization performance of the algorithm.

2 ATL-MPGWO Method

2.1 ATLA

To address the nonlinear optimization of pressure vessels, a longicorn herd algorithm was proposed by simulating longicorn foraging [18, 19].

The suggested ATLA algorithm is demonstrated in Fig. 1. The population is assumed to be composed entirely of male longicorn beetles. As indicated by the optimal search results, male longicorn beetles transform into female longicorn beetles. During the transformation, the longicorn beetles release sex pheromones and attract male longicorn beetles within a certain concentration range to approach them. The source of sex pheromones is situated in the circle's center, where the transsexual longicorn beetles are also located. The male longicorn beetles in the circle approach the transsexual longicorn beetles, whereas the beetles outside the circle search freely.

Fig. 1
figure 1

Schematic of the ALTA

The steps involved in the ALTA are as follows:

  1. (1)

    Attraction of the opposite sex: After the male longicorn beetles transform into female longicorn beetles, they attract the male longicorn beetles within the concentration range. Thus, the moving direction of the male longicorn beetles is as follows:

$$ \overrightarrow {d}_{i} = \frac{{x_{{{\text{best}}}}^{t} - x_{i}^{t} }}{{\left\| {x_{{{\text{best}}}}^{t} - x_{i}^{t} } \right\|}}. $$
(1)

In formula (1), \(x_{i}^{t}\) denotes the position of the \(i\)th male longicorn beetle in the \(t\)th iteration, and \(x_{{{\text{best}}}}^{t}\) denotes the optimal value of the current iteration. When the male longicorn beetles identify the moving direction, they change their position as follows [20]:

$$ \left\{ {\begin{array}{*{20}l} {x_{i}^{t - 1} = x_{i}^{t} + s^{t} \times \overrightarrow {{d_{i} }} } \\ {s^{t} = {\raise0.7ex\hbox{${0.95s^{t - 1} }$} \!\mathord{\left/ {\vphantom {{0.95s^{t - 1} } {r_{i} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${r_{i} }$}}} \\ \end{array} } \right., $$
(2)

where \(r_{i}\) denotes the fitness of the \(i\)th male longicorn beetle in the current iteration; with the increase in the value, the gap between the current individual and the current optimal value decreases, and the value of the moving step \(s^{t}\) thus decreases.

(2) Levy flight random search: In the population of longicorn beetles, male longicorn beetles outside the concentration range are randomly searched, as expressed in the following formula for Levy flight:

$$ x_{i}^{t + 1} = x_{i}^{t} + \alpha \otimes levy(\theta ). $$
(3)

The advantage of the ALTA is that male longicorn beetles approach female longicorn beetles when the fitness of the male longicorn beetles is greater than the set threshold; otherwise, they perform Levy flight random search. In addition to ensuring that the male longicorn beetles can find the optimal mate with the sex pheromone concentration range, a small number of male longicorn beetles are allowed to move randomly outside the range. As a result, the ALTA is kept from reaching a local optimum. According to the description above, the ALTA algorithm is depicted in Table 1.

Table 1 ALTA pseudocode

2.2 MPGWO Algorithm

MPGWO is an intelligent optimization algorithm based on GWO [21, 22]. The multi-population algorithm involves simultaneously diversifying the populations and search spaces using the gray wolf algorithm. Multiple populations are used to search independently in different search spaces through an elite mechanism. The information sharing and communication among populations can improve the diversity of populations and can ensure a better solution. The specific implementation steps of the algorithm are described in Table 2.

Table 2 MPGWO pseudocode

2.3 ATL-MPGWO Methods

Through this study, an improved hybrid intelligent optimization algorithm was established, on the basis of the artificial modified longicorn beetle optimization algorithm and the MPGWO algorithm. The algorithm is inspired by the memetic evolution mechanism of the shuffled frog-leaping algorithm—the ATL-MPGWO algorithm. To utilize the complementary advantages of the two algorithms to improve the performance of the hybrid algorithm, the specific implementation steps of the algorithm are described in Table 3.

Table 3 ATL-MPGWO method

3 Improved Hybrid Intelligent Extreme Learning Machine

3.1 KI-ELM

The I-ELM is less similar to the original incremental neural network. The latter just could employ a particular category of excitation functions. On the contrary, the former could not only adopt continuous but also piecewise continuous functions. Under the premise of the same performance index, the operation efficiency of I-ELM is higher than that of SVM algorithm and BP algorithm. Some improved incremental extreme learning machines have been proposed successively, just like EI-ELM, PC-ELM, and OP-ELM, which are primarily improved the hidden layer node optimization comparing with the I-ELM [23]. The model are stated as follows:

$$ K_{ELM} = HH^{T} = h(x_{i} ) \cdot h(x_{j} ) = K(x_{i} ,x_{j} ). $$
(4)

The output function of the K-ELM can be converted as follows:

$$ \begin{gathered} f(x) = h(x)\beta \hfill \\ \quad \quad \, = h(x)H^{T} \left( {\frac{1}{C} + HH^{T} } \right)^{ - 1} T \hfill \\ \quad \quad \, = \left[ {\begin{array}{*{20}c} {K(x,x_{1} )} \\ {K(x,x_{2} )} \\ \vdots \\ {K(x,x_{N} )} \\ \end{array} } \right]\left( {\frac{1}{C} + K_{ELM} } \right)^{ - 1} T. \hfill \\ \end{gathered} $$
(5)

In formula (5), suppose \(A = \left[ {\frac{1}{C} + K_{ELM} } \right]\). Then, at the moment \(t\)

$$ A_{t} = \left[ {\begin{array}{*{20}c} {\frac{1}{c} + K(x_{1} ,x_{1} )} & \cdots & {K(x_{1} ,x_{N} )} \\ \vdots & \cdots & \vdots \\ {K(x_{N} ,x_{1} )} & \cdots & {\frac{1}{c} + K(x_{N} ,x_{N} )} \\ \end{array} } \right]. $$
(6)

At the moment \(t + 1\)

$$ A_{t + 1} = \left[ {\begin{array}{*{20}c} {\frac{1}{c} + K(x_{1} ,x_{1} )} & \cdots & {K(x_{1} ,x_{N + k} )} \\ \vdots & \cdots & \vdots \\ {K(x_{N + k} ,x_{1} )} & \cdots & {\frac{1}{c} + K(x_{N + k} ,x_{N + k} )} \\ \end{array} } \right]. $$
(7)

Formula (7) is simplified and set as follows:

$$ U_{t} = \left[ {\begin{array}{*{20}c} {K(x_{1} ,x_{N + 1} )} & \cdots & {K(x_{1} ,x_{N + k} )} \\ \vdots & \cdots & \vdots \\ {K(x_{N} ,x_{N + 1} )} & \cdots & {K(x_{N} ,x_{N + k} )} \\ \end{array} } \right] $$
(8)
$$ {\text{D}}_{t} = \left[ {\begin{array}{*{20}c} {\frac{1}{c} + K(x_{N + 1} ,x_{N + 1} )} & \cdots & {K(x_{N + 1} ,x_{N + K} )} \\ \vdots & \cdots & \vdots \\ {K(x_{N + k} ,x_{N + 1} )} & \cdots & {\frac{1}{c} + K(x_{N + k} ,x_{N + k} )} \\ \end{array} } \right]. $$
(9)

By a finite number of transformations, Formula (9) is equal to Formula (10)

$$ A_{t + 1}^{ - 1} = \left[ {\begin{array}{*{20}c} {A_{t} } & {U_{t} } \\ {U_{t}^{T} } & {D_{t} } \\ \end{array} } \right]. $$
(10)

With new data

$$ A_{t + 1}^{ - 1} = \left[ {\begin{array}{*{20}c} {A_{t}^{ - 1} + A_{t}^{ - 1} U_{t} C_{t}^{ - 1} U - t^{T} A_{t}^{ - 1} } & { - C_{t}^{ - 1} U_{t} C_{t}^{ - 1} } \\ { - C_{t}^{ - 1} U_{t}^{T} A_{t}^{ - 1} } & {C_{t}^{ - 1} } \\ \end{array} } \right], $$
(11)

where \(C_{t} = D_{t} - U_{t}^{T} A_{t}^{ - 1} U_{t}\). For test data \(X_{{{\text{text}}}} = \left[ {x_{{{\text{test}}1}} ,x_{{{\text{test}}2}} , \cdots x_{{{\text{test}}M}} } \right]\), the output value \(\mathop Y\limits^{ \wedge }_{test}\) can be estimated online, as follows:

$$ \mathop Y\limits^{ \wedge }_{test} = \left[ {\begin{array}{*{20}c} {K(x_{test1} ,x_{1} )} & \cdots & {K(x_{test1} ,x_{M} )} \\ \vdots & \cdots & \vdots \\ {K(x_{testN} ,x_{1} )} & \cdots & {K(x_{N} ,x_{M} )} \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {A_{M}^{ - 1} y_{1} } \\ \vdots \\ {A_{M}^{ - 1} y_{M} } \\ \end{array} } \right]. $$
(12)

3.2 Proposed Algorithm

Based on the KI-ELM, this paper proposes a DKIELM. During the training phase, an artificially transgendered beetle MPGWO algorithm and a parameter optimization algorithm were applied to enhance the robustness of the algorithm.

Key components of the proposed HI-DKIELM are multiple hidden layers, an input layer, and an output layer, as illustrated in Fig. 2. The steps involved in the algorithm are mentioned in Table 4:

Fig. 2
figure 2

Structure of the HI-DKIELM

Table 4 HI-DKIELM algorithm

The HI-DKIELM deals with the input data in layer-by-layer extraction and finds effective features, thereby facilitating the differentiation of the types. These abstract features are summaries and summaries of the original datum. Calculations on kernel functions are utilized to replace internal high-dimensional spaces. Furthermore, product operation improves classification accuracy. After the original datum is abstracted by \(k\) hidden layer, the input feature can be obtained, and then, the kernel function can be used to map the input feature \(X^{k}\).

4 Experiment Analysis

To test the effectiveness and robustness, the optimization performance of the hybrid intelligent algorithm was evaluated. The specifications of the selected data set are listed into Table 5. The system used in the experiments was a PC running on Windows 7 with an Intel (r) Xeon (r) CPU E3-1231v3 @ 3.40 GHz and 16 GB of memory.

Table 5 Real UHI data set

4.1 Testing the Performance of the ATLA

The performance optimization of the ATLA was evaluated using the Schaffer, Michalewicz, and Step functions. The experimental results are summarized in Figs. 3, 4, and 5.

Fig. 3
figure 3

Performance test results with the Schaffer function. Where a, b and c are 3D view, optimization results, and optimization trajectory, respectively

Fig. 4
figure 4

Performance test results with the Michalewicz function. Where a, b and c are 3D view, optimization results, and optimization trajectory, respectively

Fig. 5
figure 5

Performance test results with the Step function. Where a, b and c are 3D view, optimization results, and optimization trajectory, respectively

The experimental results indicated that the three test functions contained a large number of local extremum points except the global extremum points; the functions satisfied some requirements for the algorithm's optimization ability. Moreover, the proposed ATLA can be used to effectively determine the global optimum solution. These results were near the position of the global extremum reference values, manifesting as good optimization ability.

4.2 Performance Test of the Proposed Hybrid Optimization Algorithm

First, the hybrid intelligent optimization algorithm—the ATL-MPGWO algorithm—was compared with the BSO algorithm, MPGWO algorithm, and DEPSO algorithm [24]. Ten typical optimization functions were selected, as listed in Table 6. In the parameter settings, the dimensions were all set to 30, and the value ranges of the solutions were \(F_{n6} :[ - 100,100]\) and \(F_{n9} :[ - 500,500]\). The remaining eight functions were \([ - 30,30]\).

Table 6 Common typical optimization functions

According to the optimization results in Table 7, the four optimization methods proposed in this paper can achieve good optimization results when the \(F_{n2}\) optimization function is used. For the other nine optimization functions, the proposed ATL-MPGWO hybrid optimization algorithm has higher optimization accuracy than the other three optimization algorithms. By contrast, the MPGWO algorithm may fall into the minimum point without jumping out of the extremum, whereas the BSO algorithm and other algorithms in the literature [24] have a certain degree of optimization ability and can converge to a more accurate solution. However, the ATL-MPGWO hybrid optimization algorithm can constantly jump out of the minimum during the iterations of the algorithm and has good searching ability.

Table 7 Optimization results of common typical optimization functions

4.3 HI-DKIELM Parameter Settings

In the HI-DKIELM, setting the scale of hidden layers exerts a great influence on the performance in different networks. In view of the Abalone data set, it is assumed that the layers range from 1 to 6 in the hidden layers and the quantities of nodes are set into 20 in each layer. By testing ten times in each network structure, Fig. 6 illustrates the relation between the scale of hidden layers and the network structure. The testing accuracy fails to increase simultaneously with the increasing number of layers. The HI-DKIELM algorithm succeeds to perform the best with three hidden layers. Once the number of layers increases, the testing accuracy goes declining. Therefore, setting the hidden layers of the network structure is 3 in the following regression and classification test.

Fig. 6
figure 6

Comparison of test accuracy with different numbers of hidden layers

In the HI-DKIELM, the kernel function parameter \(\gamma\) and regularization parameter \(C\) are easy to change the performance of algorithm, and are mostly chosen through cross-validation measures. The values are taken within \(10^{0} \sim 10^{10}\), and the corresponding test precision was evaluated during the test. The test accuracy and the values of \(\gamma\) and \(C\) are plotted as a curved surface as displayed in Fig. 7. When the regularization parameter \(C\) is not large, the algorithm exhibits poor performance, which fluctuates up and down with the kernel function. Furthermore, the performance of the algorithm is little fluctuation with the increase in the regularization parameter, and the algorithm has the highest accuracy.

Fig. 7
figure 7

Comparison of test accuracy for different parameters

4.4 HI-DKIELM Regression Test

In this paper, we conducted regression problem test on HI-DKIELM and the commonly used CI-ELM, EI-ELM, ECI-ELM, and DCI-KELM, with the aim of evaluating how well the HI-DKIELM generalizes. For the comparison, the neural-network hidden layer neurons started out with 1, and the number autoincreased by 1 with each training iteration. The amount of iterations and hidden layer neurons were identical across all five ELM. The contrast of test error and training error is revealed in Table 8 during the regression test, and Table 9 shows the contrast in network complexity and training time.

Table 8 Contrast in test error and training error during the regression test
Table 9 Contrast in network complexity and training time during the regression test

As listed in the experimental data in Table 8, on the accuracy of the regression, the HI-DKIELM algorithm observably improved. Compared with other algorithms, the proposed algorithm has lower error in training and testing, which compared with the others. With the CCPP database as an example, RMSE = 0.052 is the error termination condition, and the maximum number of hidden layer nodes is set to 100, the training error of the HI-DKIELM algorithm in was 0.0417, and the testing error was 0.0435, whereas the training error of DCI-KELM algorithm was 0.0535 and the testing error was 0.0604. In terms of training error and test error, the proposed HI-DKIELM algorithm is obviously better than others.

Table 9 analyzes the algorithm performance of the regression problem. From the experimental data, a great quantity of nodes for HI-DKIELM decreased significantly, which compared with the other four ELM algorithms. Similarly, with CCPP database as an example, when the stop condition was 0.052, the required nodes for the proposed HI-DKIELM algorithm were 27.06 and the training time was 1.0277 s. By contrast, the required nodes for the DCI-KELM algorithm were 45.35 and the training time was 2.0743 s, and the required nodes for the ECI-KELM algorithm were 11.92 and the training time was 3.0178 s. The HI-DKIELM algorithm is significantly better than the others, from the complexity of the network and the training time. Compared with other algorithms, the proposed algorithm is superior in network complexity and training time.

4.5 HI-DKIELM Classification Test

In this paper, we conducted classification problem test on HI-DKIELM and the commonly used CI-ELM, EI-ELM, ECI-ELM, and DCI-KELM, with the aim of evaluating how well the HI-DKIELM generalizes. For the comparison, the neural-network hidden layer neurons started out with 1, and autoincreased by 1 with each training iteration. The amount of iterations and hidden layer neurons were identical across all five ELM. The contrast of in average value and standard deviation is revealed in Table 10 during the classification problem test, and Table 11 shows the contrast in network complexity and training time. The value of the error termination condition RMSE is given in parentheses.

Table 10 Contrast in average value and standard deviation during the classification test
Table 11 Contrast in network complexity and training time during the classification test

As can be seen from the experimental data in Table 10, the classification accuracy of the HI-DKIELM algorithm was significantly improved in the comparative analysis of all five ELM. As an illustration, considering the Boston Housing data set, we put RMSE = 0.1 as the error termination condition. The average and standard deviation of the proposed HI-DKIELM algorithm were 98.21 and 0.0038, while those of DCI-KELM algorithm were 93.01 and 0.0041, and ECI-KELM algorithm were 84.82 and 0.0072, respectively. In terms of the above analysis, the HI-DKIELM algorithm proposed outperforms others by a large margin.

Table 11 illustrates a comparison of the performances of different algorithms in the classification problem. The data suggested that the HI-DKIELM performed the best among the all five algorithms. We compared the experimental results, and found that the number of nodes of neural network was considerably reduced. Similarly, with Boston Housing data set as an example, when the RMSE is 0.1, the nodes and training time of HI-DKIELM algorithm were only 19.42 and 0.0772 s, while the DCI-KELM algorithm is 22.06 and 0.0942 s, and the ECI-KELM algorithm is 39.17 and 0.1168 s, respectively.. In terms of the above analysis, the HI-DKIELM algorithm proposed outperforms others by a large margin. Compared with other algorithms, the proposed algorithm is superior to other algorithms in terms of mean and standard deviation.

4.6 Summary

According to the above experimental results of regression and classification, it can be seen that: in the regression test, the training and testing errors of the HI-DKIELM algorithm are significantly better than those of the other algorithms (Table 8), and the number of nodes and the training time are much less than those of the other algorithms (Table 9); in the classification test, compared with the other four ELM algorithms, the classification accuracy of the HI-DKIELM algorithm has been significantly improved (Table 10), and the number of nodes and training time of the network are significantly reduced (Table 11).

It can be seen that compared with other algorithms, the number of network nodes of the HI-DKIELM algorithm is significantly reduced, which reduces the network complexity of the ELM and greatly improves the learning efficiency of the algorithm; and it has a more compact and efficient network structure, and has better prediction and generalization capabilities. However, adding deep networks to the algorithm may lead to overfitting problems.

5 Conclusions

This paper first summarizes the algorithm of extreme learning machine, then introduces the ATL-MPGWO Method, then introduces the improved hybrid intelligent extreme learning machine, and finally the simulation experiment and conclusion. This study successfully addressed the problem of ineffective iteration and low learning efficiency in K-ELM due to redundant nodes in the neural network. A DKIELM with hybrid intelligence and a deep network structure was established. First, a novel hybrid intelligent optimization algorithm was designed by combining the ATLA and the MPGWO algorithm. The designed algorithm was applied to optimize the hidden layer node parameters, which not only improved network stability but also reduced the network complexity of the ELM and improved the learning efficiency of model parameters. Second, a deep network structure was applied to the KI-ELM, which extracted the input data layer by layer and improved the classification accuracy and generalization abilities. From the perspective of network complexity and training time, the results of regression test and classification test show that the proposed HI-DKELM algorithm has more compact and effective network structure, and has better prediction and generalization ability.

The deep network is added to the deep kernel extreme learning machine of hybrid intelligence, which may lead to the overfitting problem. Considering the appropriate number of network layers will be the future research direction.