Introduction

As the main method of modern ironmaking, blast furnace ironmaking is the most critical process in iron and steel industry. As shown in Fig. 1, a blast furnace can be divided into five zones from top to bottom: throat, shaft, belly, bosh, and hearth. In practical production, the thermal state of the blast furnace is a crucial indicator of its output, energy consumption, and service life. However, as a typical black-box system, it is difficult to directly measure the internal thermal state of the blast furnace due to the limitations of measurement and sensor technology [1, 2]. Many studies have demonstrated that the silicon content of molten iron can indirectly characterize the blast furnace temperature without loss of information, and its trend can reflect the quality of the molten iron and the operating condition of the blast furnace [3]. High silicon content indicates that the blast furnace is overheating, resulting in unnecessary energy loss [4]. On the contrary, low silicon content indicates a low blast furnace temperature, which severely affects the iron output and often takes weeks to recover [5]. Therefore, it is important to establish a prediction model for the silicon content of molten iron to provide guidance for the actual production of the blast furnace ironmaking process.

Fig. 1
figure 1

Blast furnace ironmaking process

Due to powerful data mining capability and less reliance on expert knowledge, data-driven algorithms for silicon content prediction have gradually become the focus of research [4, 6,7,8]. However, many existing works only applied a single model to predict the silicon content [9,10,11], which may not fully satisfy the accuracy requirement due to the complex and high-dimensional nonlinear relationship between silicon content and process variables. Therefore, in recent years, some sophisticated models have been proposed to address this problem. Liu et al. [12] proposed a novel silicon content prediction method called T-HyperGAT, which combined the hypergraph attention network (HyperGAT) and the gated recurrent unit (GRU) network to capture the high-order correlations and time-series characteristics from complex industrial data. Li et al. [13] proposed a context-aware enhanced GRU network with feature-temporal attention to predict silicon content, which allowed for enhanced local awareness and soft alignment of variables. Zhao et al. [14] proposed an ameliorated Moth-Flame optimization (AMFO) algorithm to optimize the parameters of fast learning network (FLN) to build the prediction model of silicon content. These algorithms aim to achieve better silicon content prediction performance by building more complex models, but with a corresponding increase in computational cost.

Support vector machine (SVR) is a promising machine learning algorithm that can deal with high-dimensional nonlinear prediction problems. However, the hyperparameters of SVR need to be adjusted carefully to achieve promising performance. Due to the powerful global search ability, evolutionary algorithms (EAs) have been widely used to search the optimal hyperparameters of SVR [15,16,17,18] and other industrial parameter optimization tasks [19,20,21] (e.g., Stojanovic et al. proposed the optimal tuning of cascade load force controllers for a parallel robot platform in [20]). However, most existing works [22,23,24,25] only considered the accuracy of SVR and ignored the complexity of the model, which might not be adequate for online prediction tasks in actual industrial processes.

Multiobjective evolutionary ensemble learning (MOEEL) algorithms have gradually become the research focus in various prediction problems [26,27,28,29,30]. MOEEL is suitable for dealing with real industrial data such as silicon content for two main reasons: (1) MOEEL aims to construct an ensemble model based on a series of high-quality base learners in order to better handle the complex and high-dimensional nonlinear correlations between silicon content and process variables; and (2) MOEEL usually takes the complexity and accuracy of base learners as two conflicting optimization objectives to achieve a balance between model accuracy and generalization, which is crucial for online prediction of silicon content in actual blast furnace ironmaking process. Wang et al. [31] proposed a multiobjective sparse nonlinear ensemble learning with evolutionary feature selection (MOSNE-EFS) to achieve the precise prediction of the strip hardness. Singh and Singh [32] developed a stacking-based evolutionary ensemble learning system called NSGA-II-Stacking to predict the onset of Type-2 diabetes mellitus (T2DM). Hu et al. [4] proposed a multiobjective evolutionary nonlinear ensemble learning model with evolutionary feature selection mechanism (MOENE-EFS) for silicon content prediction, which can efficiently select input features and construct the ensemble model in a nonlinear way. However, these works usually adopted traditional multiobjective evolutionary algorithm in training base learners and ignored the balance between local search and global search, which may suffer from falling into local optima with low diversity [31, 33,34,35].

To address the above issues, this paper proposes a two-stage multiobjective evolutionary ensemble learning algorithm for silicon content prediction of blast furnace, in which support vector regression (SVR) is adopted as the base learner. The proposed algorithm can obtain a series of non-dominated solutions with high accuracy and diversity, based on which an ensemble model with good prediction performance and low model complexity can be constructed. The main contributions in this paper are summarized as follows:

  1. (1)

    Unlike traditional methods that only take the prediction accuracy of SVR as the optimization objective, the proposed method optimizes the accuracy and complexity of SVR at the same time, and a prediction model with high accuracy and low model complexity can be obtained.

  2. (2)

    In the first stage, a non-dominated sorting differential evolution with dynamic resource allocation (DRA-NSDE) is proposed to achieve the balance between local and global search through dynamic resource allocation, which results in a series of base learners with high accuracy and diversity.

  3. (3)

    In the second stage, a stacking ensemble method based on clustering and differential evolution (CDE-Stacking) is proposed to select and ensemble the base learners, in which the K-Means clustering is utilized to ensure the diversity of base learners, and the single-objective DE is used to optimize the ensemble weights of selected base learners.

The remainder of this paper is organized as follows: “Related works” section briefly introduces the related works. “The proposed algorithm” section elaborates the framework and details of the proposed two-stage MOEEL algorithm. “Experiments” section consists of the analysis of the proposed DRA-NSDE and CDE-Stacking strategies and the performance analysis of the proposed two-stage MOEEL algorithm on both the UCI benchmark data set and the actual blast furnace data set. Finally, “Conclusion” section concludes this paper and introduces our future work.

Related works

Support vector regression

Unlike traditional regression methods, SVR constructs a fault-tolerant band and makes as many points as possible fall into it. When \(\vert f(x)-y \vert \leqslant \varepsilon \), i.e., the bias between the predicted value and the actual value is within the tolerance range, the prediction is considered correct.

By maximizing the interval and minimizing the empirical risk, SVR can obtain the optimal regression model. A maximizing interval problem can be transformed into a minimizing interval problem by taking the inverse and adding the empirical risk loss term into the objective function, as shown in Eq. (1).

$$\begin{aligned} \min \limits _{\omega ,b} \frac{1}{2}\left\Vert \omega \right\Vert ^2 + C \sum \limits _{i=1}^m l_{\varepsilon }(f(x_i)-y_i) \end{aligned}$$
(1)

where C is the regularization parameter, and \(l_{\varepsilon }\) is the \(\varepsilon \)-insensitive loss function.

$$\begin{aligned} l_{\varepsilon }(z) = {\left\{ \begin{array}{ll} 0, &{} \vert z \vert \leqslant \varepsilon \\ \vert z \vert - \varepsilon , &{} \vert z \vert > \varepsilon \end{array}\right. } \end{aligned}$$

However, setting \(\varepsilon \) by experience can not ensure that most of the data are located within the interval band, therefore, it is necessary to add slack variables for each sample so that some samples are allowed to be located outside the interval band. After introducing the slack variables \(\xi , {\hat{\xi }}\), the optimization objective shown in Eq. (1) is changed to the following form [36]:

$$\begin{aligned}&\min \limits _{\omega ,b,\xi _i,{\hat{\xi }}_i} \frac{1}{2}\left\Vert \omega \right\Vert ^2 + C \sum \limits _{i=1}^m (\xi _i + {\hat{\xi }}_i) \\&\quad \text {s.t.} ~ f(x_i) - y_i \leqslant \varepsilon + \xi _i \\&\quad y_i - f(x_i) \leqslant \varepsilon + \hat{\xi _i} \\&\quad \xi _i \geqslant 0, \hat{\xi _i} \geqslant 0, i=1,2,\ldots ,m \end{aligned}$$

The following Lagrangian function [37] can be obtained by introducing the Lagrangian multipliers \(u_i \geqslant 0, \hat{u_i} \geqslant 0, \alpha _i \geqslant 0, \hat{\alpha _i} \geqslant 0\) to remove the constraints:

$$\begin{aligned} \begin{aligned}&L(\omega , b, \alpha , {\hat{\alpha }}, \xi , {\hat{\xi }}, u, {\hat{u}}) \\&\quad = ~\frac{1}{2} \left\Vert \omega \right\Vert ^2 + C \sum \limits _{i=1}^m (\xi _i + {\hat{\xi }}_i) \\&\qquad - \sum \limits _{i=1}^m u_i \xi _i - \sum \limits _{i=1}^m \hat{u_i} \hat{\xi _i} \\&\qquad + \sum \limits _{i=1}^m \alpha _i (f(x_i) - y_i - \varepsilon - \xi _i) \\&\qquad + \sum \limits _{i=1}^m \hat{\alpha _i} (y_i - f(x_i) - \varepsilon - \hat{\xi _i}) \end{aligned} \end{aligned}$$

Substituting \(f(x) = \omega ^T x + b\) into the Lagrangian function and making the partial derivative of \(\omega ,b,\xi ,{\hat{\xi }}\) equal to zero, the following equations can be obtained:

$$\begin{aligned} {\left\{ \begin{array}{ll} \sum \limits _{i=1}^m (\hat{\alpha _i}-\alpha _i) x_i = \omega \\ \sum \limits _{i=1}^m (\hat{\alpha _i}-\alpha _i) = 0 \\ u_i + \alpha _i = C \\ \hat{u_i} + \hat{\alpha _i} = C \end{array}\right. } \end{aligned}$$

According to KKT Conditions [37]:

$$\begin{aligned} {\left\{ \begin{array}{ll} \alpha _i (f(x_i) - y_i - \varepsilon - \xi _i) = 0 \\ \alpha _i (y_if - (x_i) - \varepsilon - \xi _i) = 0 \\ \alpha _i \hat{\alpha _i} = 0 \\ \xi _i \hat{\xi _i} = 0 \\ (C - \alpha _i) \xi _i = 0 \\ (C - \hat{\alpha _i}) \hat{\xi _i} = 0 \\ \end{array}\right. } \end{aligned}$$

the pairwise problem of SVR can be obtained as follows:

$$\begin{aligned}&\begin{aligned}&\max \limits _{\alpha ,{\hat{\alpha }}} \sum \limits _{i=1}^m y_i (\hat{\alpha _i} - \alpha _i) - \varepsilon (\hat{\alpha _i} + \alpha _i) \\&\quad - \frac{1}{2} \sum \limits _{i=1}^m \sum \limits _{j=1}^m (\hat{\alpha _i} - \alpha _i) (\hat{\alpha _j} - \alpha _j) x_i^T x_j \end{aligned}\\&\quad \text {s.t.} ~ \sum \limits _{i=1}^m (\hat{\alpha _i} - \alpha _i) = 0 \\&\qquad 0 \leqslant \alpha _i, \hat{\alpha _i} \leqslant C \end{aligned}$$

By solving the above optimization problem, the regression function of SVR can be obtained as Eq. (2).

$$\begin{aligned} f(x) = \sum \limits _{i=1}^m (\hat{\alpha _i} - \alpha _i) x_i^T x + b \end{aligned}$$
(2)

When dealing with the nonlinear regression problem, SVR generally maps x to a high-dimensional space by \(\phi (x)\), and consequently the nonlinear problem is transformed into an approximately linear regression problem. However, as the number of variables becomes larger, the problem dimensionality grows exponentially, which significantly increases the computational resource. This problem can be solved by introducing kernel function into SVR to avoid the tremendous inner product computation in the high-dimensional feature space. The commonly used kernel function are Gaussian kernel function, polynomial kernel function, Laplace kernel function and Sigmoid kernel function. After introducing kernel functions into SVR, Eq. (2) can be rewritten as:

$$\begin{aligned} \begin{aligned} f(x)&= \omega ^T \phi (x) + b \\&= \sum \limits _{i=1}^m (\hat{\alpha _i} - \alpha _i) \phi (x_i)^T \phi (x) + b \\&= \sum \limits _{i=1}^m (\hat{\alpha _i} - \alpha _i) k(x, x_i) + b \end{aligned} \end{aligned}$$
(3)

Multiobjective evolutionary ensemble learning

In ensemble learning, the construction and integration of base learners are two important processes. According to the “error-ambiguity decomposition” theory [38], the ensemble performance depends heavily on whether the prediction accuracy and diversity of the base learners are well balanced. Even if the base learners with high accuracy and strong diversity are obtained, the ensemble performance will not be satisfactory if an inappropriate ensemble strategy is chosen. Therefore, various ensemble strategies have been proposed, such as AdaBoost [39] and Random Forest [40], which have specific steps to build the ensemble model. In multiobjective evolutionary algorithms (MOEAs), multiple Pareto optimal solutions can be obtained by one evolution, and the corresponding non-dominated solution set can meet the accuracy and diversity requirements of ensemble learning. Therefore, multiobjective evolutionary ensemble learning (MOEEL) is very suitable for the silicon content prediction problem of blast furnace [4].

Algorithm 1
figure d

Framework of MOEEL

The framework of MOEEL is shown in Algorithm 1, which includes two stages. In the first stage (lines 1–3), the base learners are encoded as solutions in the population, and then a suitable multiobjective evolutionary algorithm is used to search for base learners with high accuracy and strong diversity. To achieve this purpose, we need to design the encoding scheme, define objective functions for base learners, and choose appropriate genetic operators as well as the environment selection method. The second stage (lines 4–5) is to select appropriate base learners and then integrate them in some way to construct the final ensemble model.

Fig. 2
figure 2

Schematic of the solution encoding and decoding

The proposed algorithm

Algorithm overview

Algorithm 2 outlines the overall framework of the proposed two-stage MOEEL algorithm, where the contributions are highlighted in italic. In the first stage (lines 1–6), the initial population is generated and evaluated according to the proposed solution encoding scheme and two objective functions. Then the non-dominated solution set consisting of base learners with high accuracy and strong diversity is obtained based on the proposed DRA-NSDE algorithm. In the second stage (line 7), the obtained base learners are integrated through the proposed CDE-Stacking method to construct the ensemble model with high accuracy and generalization ability for silicon content prediction.

Algorithm 2
figure e

Framework of the Proposed Algorithm

Encoding and decoding

As shown in Fig. 2, a solution in the population is encoded as a vector which has two parts: the feature selection part and the hyperparameter part. The feature selection part is a binary vector whose length equals to the total number of features. In the binary vector, 1 indicates that the feature is selected to construct the base learner while 0 indicates that the feature is not selected. The hyperparameter part is a real number vector which contains three genes, representing three hyperparameters in SVR, i.e., regularization parameter C, width coefficient \(\gamma \) in Gaussian kernel function and insensitive loss coefficient \(\varepsilon \). Please note that the upper and lower bounds of the hyperparameter part are determined by pre-experiments, i.e., multiple experiments on a single SVR model to roughly determine the range of values of C, \(\gamma \) and \(\varepsilon \).

Table 1 Upper bounds and lower bounds of parameters in the hyperparameter part

The solution needs to be decoded to get its objective function values. First, the features used to build the base learner are determined from the feature selection section, and the corresponding training data can be extracted from the blast furnace dataset. Then three SVR hyperparameters are obtained from the hyperparameters section, and the SVR model can be constructed and trained. After the training is completed, the SVR model will be evaluated on the validation set and the evaluation results are considered as the objective function values of this solution.

Noted that the upper and lower bounds for the hyperparameter part are determined by pre-experiments. Several experiments are conducted on a single SVR model before the formal experiments in which we can roughly determine the range of values for C, \(\gamma \) and \(\varepsilon \). The upper and lower bounds of parameters obtained in pre-experiments are shown in Table 1.

Objective functions

In the silicon content prediction problem of blast furnace, the prediction accuracy and the model complexity are two conflicting optimization objectives to guide the evolutionary search direction of the algorithm.

Accuracy

The first objective is to maximize the prediction accuracy of the model. In this paper, root mean square error loss (RMSE) is used as the accuracy metric, which is defined as Eq. (4).

$$\begin{aligned} \text {RMSE} = \sqrt{\frac{1}{L} \sum _{i=1}^{L}({\hat{y}}_i - y_i)^2} \end{aligned}$$
(4)

where L is the number of samples, \({\hat{y}}_i\) is the predicted value for the ith sample, and \(y_i\) is the true value of the ith sample. It should be noted that a smaller RMSE indicates a higher accuracy.

Complexity

The second objective is to minimize the complexity of the model. In this paper, the number of support vectors in the SVR model is used as the complexity metric, which is defined as Eq. (5).

$$\begin{aligned} \text {CMPLX} = N_{support} \end{aligned}$$
(5)

where CMPLX stands for model complexity and \(N_{support}\) is the number of support vectors in the SVR model.

Mutation and crossover

The mutation operator used in the proposed algorithm is DE/rand/1/bin [41]. In DE/rand/1/bin, for each target vector \(x_i\), its associated mutant vector \(v_i\) can be generated by Eq. (6).

$$\begin{aligned} v_i = x_{r_1} + F(x_{r_2} - x_{r_3}) \end{aligned}$$
(6)

where \(x_{r_1}, x_{r_2}, x_{r_3}\) are three randomly selected solutions from the population and \(r_1, r_2, r_3\) are mutually exclusive integers randomly generated with the range [1, N] (N is the population size). The scaling factor F is a positive control parameter for scaling the difference vector.

Since the solution in the proposed algorithm is encoded as a vector that contains both binary part and real part, some necessary modifications have to be made in the original DE mutation operator. The mutation of the hyperparameter part (real encoded) between two solutions is the same as Eq. (6), while the mutation of the feature selection part (binary encoded) is performed as follows [4]:

  1. Step 1:

    Apply an Exclusive-OR operation on the feature selection part of \(x_{r_2}\) and \(x_{r_3}\) to obtain a binary vector \(x_{temp}\).

  2. Step 2:

    Generate a random number R between [0, 1) for each gene in the feature selection part of \(x_{temp}\) if its value is 1. Compare this random number with the difference weight F. If \(R<F\), the binary number of this gene is changed from 1 to 0, otherwise no operation is performed.

  3. Step 3:

    Perform OR operation on the feature selection part of the obtained \(x_{temp}\) and \(x_{r_1}\) to obtain the mutant vector \(v_i\).

According to DE/rand/1/bin, the trial vector \(u_i\) can be obtained by the crossover operation on the mutant vector \(v_i\) and the target vector \(x_i\), as shown in Eq. (7).

$$\begin{aligned} u_{i,j} = {\left\{ \begin{array}{ll} v_{i,j}, &{} \text {if} ~ r_j < CR ~ \text {or} ~ j=i \\ x_{i,j}, &{} otherwise \end{array}\right. } \end{aligned}$$
(7)

where \(u_{i,j}\) denotes the jth gene on the ith solution in the population, \(r_j\) is a random number between [0, 1), and CR is the crossover probability. Note that there is no selection operation after the mutation and crossover operation compared with the original DE algorithm. Instead, the trial vector \(u_i\) is also regarded as an offspring solution and added to the offspring population directly.

DRA-NSDE

A variety of crossover operators have been designed for the differential evolution (DE) algorithm, each with a different search behavior. For example, the DE/rand/1/bin operator focuses on maintaining good diversity and therefore converges relatively slowly, while the DE/best/1/bin operator converges faster but tends to lose good diversity. To achieve a good balance between exploration and exploitation, the dynamic allocation strategy of computational resources is proposed in the DRA-NSDE. The algorithm details are presented in Algorithm 3.

Algorithm 3
figure f

DRA-NSDE

Fig. 3
figure 3

Local search and global search (\(K=3\))

The schematic of local search and global search in Algorithm 3 is shown in Fig. 3. First, the K-Means clustering algorithm is performed on the current population P in the objective space, and the population P will be divided into K sub-populations, i.e., \({\mathbb {P}}=\{P_1,P_2,\ldots ,P_K\}\). Then we get the local search ratio \(\rho _L\) and the global search ratio \(\rho _G\) in the current evolution process, where \(\rho _L\) is used to calculate the number of offspring solutions \(N_{L}^k\) generated only from sub-population \(P_k (k=1,\ldots ,K)\), and \(\rho _G\) is used to calculate the number of offspring solutions \(N_{G}\) generated from the whole population P (see Eqs. (8) and (9)). In the following local search process, for each sub-population \(P_k\) in set \({\mathbb {P}}\), we first select three different solutions to perform the mutation and crossover operation described in “Mutation and crossover” section to generate the offspring solution \(u_i\), and then add it to the offspring population Q. This process will be repeated until \(N_{L}^k\) offspring solutions are generated and added to Q. After the local search is completed on all sub-populations, the global search begins subsequently. Different from the local search, the three random solutions are selected from the whole population P. The subsequent mutation and crossover operations are consistent with the local search. This process will also be repeated until \(N_{G}\) offspring solutions are generated and added to Q. After the offspring population Q is obtained, Q will be combined with the current population P to get the joint population. Based on the non-dominated ranking and crowding distance [42], the environment selection operation is performed on the joint population to obtain a new population for the next generation.

$$\begin{aligned} N_{L}^k = \rho _L * \vert P_k \vert \end{aligned}$$
(8)
$$\begin{aligned} N_{G} = \rho _G * \vert P \vert \end{aligned}$$
(9)

To achieve the dynamic resource allocation, the algorithm needs to maintain two variables, i.e., the local search ratio \(\rho _L\) and the global search ratio \(\rho _G\) during the evolution process. At the beginning of the algorithm, these two variables are set to be equal, i.e., \(\rho _L = \rho _G = 0.5\), so that each of the two searching strategies generates half of the solutions in the offspring population Q. After the environment selection, a new population that will enter the next generation can be obtained, and we can count the following four numbers: \(ns_L, nf_L, ns_G, nf_G\). Among all the offspring solutions generated from the local search, the number of solutions that successfully enter the next generation is counted as \(ns_L\), while the number of solutions that failed to enter the next generation is counted as \(nf_L\). Similarly, among all the offspring solutions generated from the global search, the number of solutions that successfully enter the next generation is counted as \(ns_G\) and the number of solutions that failed to enter the next generation is counted as \(nf_G\). These four numbers are accumulated after LP (learning period) generations, and then the two ratios \(\rho _L\) and \(\rho _G\) can be updated according to Eqs. (10) and (11) [43].

$$\begin{aligned} \rho _L = \frac{ns_L}{ns_L + nf_L} \end{aligned}$$
(10)
$$\begin{aligned} \rho _G = \frac{ns_G}{ns_G + nf_G} \end{aligned}$$
(11)

The above equations denote the success rate of the two searching strategies in a given learning period. After the initial LP generations, the local search ratio \(\rho _L\) and the global search ratio \(\rho _G\) will be dynamically updated in each generation, so that the computational resources can be allocated dynamically at different stages of the evolution. In this way, the algorithm can avoid falling into local optimal point and have a good ability to maintain better diversity. Note that after each update of the two searching strategy ratios, their success number ns and fail number nf recorded in the earliest generation need to be cleared from the archive to ensure that the update of the two ratios relies only on the most recent LP generations, thus eliminating the potential negative impact at the early stage of the evolution.

After the evolution process is completed, a fast non-dominated sorting [42] is performed on the population in the last generation. Then all solutions on the first non-dominated layer are extracted to obtain the final non-dominated solution set NDSet.

In the proposed DRA-NSDE algorithm, we divided the population into different local areas based on K-Means clustering, and then dynamically select the local search strategy or global search strategy based on their success rate to achieve dynamic resource allocation. This solves the problem of difficult selection of target vector in traditional NSDE algorithm, improves the diversity of the algorithm, and enables the algorithm to jump out of the local optimal region.

CDE-stacking

After the non-dominated solution set is obtained in the first stage, we need to further ensemble these solutions to construct the ensemble model for silicon content prediction. In the ensemble learning, the key to achieve successful ensemble effectiveness lies in the accuracy and diversity of selected base learners that have potential contradictory relationship [4]. In the stacking-based ensemble learning, the ensemble weights also have significant influence on the ensemble effectiveness. The simple average method assigns the same ensemble weight to each base learner, which ignores the differences between base learners. Therefore, a clustering and differential evolution-based stacking ensemble (CDE-Stacking) method is proposed in this paper. Based on the K-Means clustering, the proposed CDE-Stacking first selects a series of “good and different” base learners, i.e., a series of SVRs with high accuracy and strong diversity from the non-dominated solution set obtained in the first stage. Then a single-objective differential evolution algorithm is used to optimize the ensemble weights of the selected base learners, based on which the selected base learners are integrated to construct the ensemble model for silicon content prediction. The details of the proposed CDE-Stacking algorithm are shown in Algorithm 4.

Algorithm 4
figure g

CDE-Stacking

Fig. 4
figure 4

Schematic of CDE-Stacking (\(K_e=5, n=4\))

In Algorithm 4, there are two key hyperparameters, i.e., the number of clusters \(K_e\) and the number of non-dominated solutions n selected from each cluster. Figure 4 shows the schematic of the proposed CDE-Stacking when \(K_e=5, n=4\). Once the base learners are randomly selected from each cluster, the output of the ensemble model can be obtained by computing the weight-sum of all base learners’ outputs. The weights used above are optimized by a single-objective DE algorithm, where each solution W in the population \(\{W_1, W_2, \ldots , W_{N_e}\}\) is a real encoded vector and each gene represents a base learner’s ensemble weight. The DE algorithm aims to minimize the prediction error of the ensemble model, which is measured by the RMSE metric in Eq. (4). The evolution process is based on the standard DE/rand/1/bin. After the evolution, the global best solution \(W^*\) is obtained and decoded to integrate all the selected base learners in \({\mathbb {L}}\) to construct the final ensemble model for silicon content prediction. In Fig. 4, our proposed CDE-Stacking method selects solutions at different locations on the Pareto front by clustering the population into sub-populations, ensuring variability in the combinations of accuracy and complexity among the selected base learners. In general, the reduction of the complexity of the SVR base learner makes it easier to improve the generalization ability. Therefore, using the proposed CDE-Stacking method to construct the ensemble model can achieve a balance between accuracy and generalization ability.

Experiments

In this section, the proposed two-stage MOEEL algorithm with two main contributions, namely DRA-NSDE and CDE-Stacking, were tested through a series of experiments. The effectiveness of DRA-NSDE and CDE-Stacking was initially verified based on 14 multi-objective benchmark problems and a UCI benchmark dataset, respectively. Subsequently, the proposed two-stage MOEEL algorithm was evaluated based on both the UCI benchmark dataset and the actual blast furnace dataset, and compared with several other powerful machine learning algorithms. All the experiments were independently conducted multiple times to eliminate the effects of randomness. To determine the statistical significance, the Wilcoxon’s rank sum test with a significance level of 0.05 was used to compare our proposed algorithm with the other rivals. The symbols “(+)” and “(-)” indicate whether our proposed algorithm is significantly better or worse than a rival, while “(=)” indicates that there is no statistical significant difference between them.

Validation of DRA-NSDE

To verify the effectiveness of the proposed DRA-NSDE algorithm in the first stage, 14 multiobjective benchmark test problems were used, including ZDT1-ZDT4, ZDT6 [44], UF1, UF2 [45] and DTLZ1-DTLZ7 [46]. The diversity and convergence of the approximate PFs obtained by the multi-objective evolutionary algorithms are two important factors to measure the algorithm performance [21]. Due to the ability to both measure the convergence and the diversity of the algorithm, the inverted generational distance (IGD) [47] and the hypervolumne (HV) [48] are used as performance evaluation indicators. The IGD metric is defined as follows:

$$\begin{aligned}{} & {} IGD = \frac{\sum _{j=1}^{\vert P^* \vert } d_{min}(x_j^*, P)}{\vert P^* \vert } \end{aligned}$$
(12)
$$\begin{aligned}{} & {} d_{min}(x_j^*, P) = \min {\left\{ \sqrt{\sum _{k=1}^{m} (f_k(x_i) - f_k(x_j^*))^2},~i = 1, 2, \ldots , \vert P \vert \right\} }\nonumber \\ \end{aligned}$$
(13)

where P is the non-dominated solution set obtained by the algorithm and \(P^*\) is the real Pareto optimal set. \(\vert P \vert \) and \(\vert P^* \vert \) represent the number of solutions in the solution set, respectively. \(x_i\) is the ith solution in \(\vert P \vert \), and \(x_j^*\) denotes the jth solution in \(\vert P^* \vert \). m is the number of objectives, and \(f_k(x)\) represents the kth objective value of solution x. Given a reference point \(r^* = (r_1^*, r_2^*, \ldots , r_m^*)\), the HV metric is defined as

$$\begin{aligned} HV(S) = VOL \left( \bigcup _{x \in P} [f_1(x), r_1^*] \times \cdots \times [f_m(x), r_m^*] \right) \nonumber \\ \end{aligned}$$
(14)

where \(VOL(\cdot )\) is the Lebesgue measure. A lager HV metric indicates a better algorithm performance.

Sensitivity test of parameter K

First, the sensitivity of the core parameter, i.e., parameter K of the K-Means clustering, is analyzed. Four different values of K, i.e., {2, 3, 4, 5}, are tested on the 14 benchmark problems. The population size is set to 100, and the maximum generation is set to 300 on the ZDT problems and 500 on the UF and DTLZ problems. For each problem, 30 independent experiments are conducted and the statistical results (mean and standard deviation) of IGD are recorded. This experimental setting is also adopted in the following experiments. Note that the standard deviation results are shown in round brackets, and the algorithm with the best performance in italic value. The experimental results are shown in Table 2.

Table 2 Sensitive test on parameter K in the proposed DRA-NSDE
Fig. 5
figure 5

Sensitive test on parameter K in the proposed DRA-NSDE

According to Table 2, it is clear that the proposed DRA-NSDE algorithm with \(K=3\) outperforms the other settings for most test problems. The mean value of IGD in 30 independent experiments is visualized in Fig. 5. Intuitively, DRA-NSDE with \(K=3\) achieves lower IGD values for most problems. It should be noted that although DRA-NSDE with \(K=3\) does not achieve the best IGD values in some test problems such as ZDT1 and ZDT6, the difference between this setting and the best-performing setting is not significant. A more in-depth analysis suggests that using a too large cluster number (i.e., \(K=4\) or \(K=5\)) will lead to insufficient non-dominated solutions assigned to each sub-population. As a result, the quality of the offspring solutions generated by the local search strategy may deteriorate, which will slow down the convergence speed within the sub-population and ultimately affect the overall performance of the algorithm. Conversely, if the cluster number is too small (i.e., \(K=2\)), the clustering-based dynamic resource allocation mechanism cannot be fully utilized. Therefore, based on these results, we use \(K=3\) as the default hyperparameter setting for the following experiments.

Comparison with other DE strategies

In this section, we will analyze the performance of the DRA-NSDE algorithm with different mutation strategies, namely DE/rand/1/bin, DE/current-to-rand/bin, DE/best/1/bin, and DE/current-to-best/bin. The first two strategies focus on improving the algorithm diversity, while the last two strategies prioritize the algorithm convergence. The comparison results are presented in Tables 3 and 4.

Table 3 Comparison of IGD and HV between different DE strategies
Table 4 Wilcoxon’s rank sum test analysis of different DE strategies in terms of IGD and HV metrics on 14 benchmark test problems

According to Table 3, it is evident that DE/rand/1/bin outperforms the other three DE strategies for both IGD and HV metrics for most test problems. Specifically, when compared to DE/best/1/bin and DE/current-to-best/bin, DE/rand/1/bin achieves the best IGD and HV values for all ZDT and UF problems. The main reason can be analyzed as follows. The clustering and dynamic resource allocation can help to ensure good convergence of the NSDE algorithm. Since both DE/best/1/bin and DE/current-to-best/bin make full use of the best solution in the population, which further helps to improve the convergence, their interaction may lead to early convergence and lack of diversity. On the contrary, DE/rand/1/bin employs three randomly selected solutions to generate the mutant vector instead of the target vector, which can improve diversity. Therefore, the cooperation of the good convergence of the clustering and dynamic resource allocation and the good diversity of DE/rand/1/bin can achieve better balance of exploration and exploitation. However, DE/rand/1/bin is not the most effective strategy for DTLZ2, DTLZ4, and DTLZ5, indicating that no single strategy can perform optimally for all problems. Table 4 shows the statistical analysis results of DE/rand/1/bin and other DE strategies on the Wilcoxon’s rank sum test of the IGD and HV metrics. The results indicate that DE/rand/1/bin significantly outperforms the other three DE strategies on most test problems. Therefore, we choose DE/rand/1/bin as the DE strategy for our proposed DRA-NSDE.

Comparison with other MOEAs

In this section, we compare the performance of the proposed DRA-NSDE with four other powerful MOEAs: NSGA-II [42], NSDE [49], MO-ES [50] and MO-CMA-ES [51]. NSGA-II is a popular multiobjective optimization algorithm proposed by Kalyanmoy Deb in 2002. NSDE is a DE-based algorithm that uses non-dominated sorting to generate the next generation population. MO-ES is an evolutionary strategy designed for solving multiobjective optimization problems. MO-CMA-ES is the multiobjective version of CMA-ES - a highly effective stochastic optimization algorithm that adaptively estimates and updates the covariance matrix of the search distribution to efficiently find a global optimum in continuous and high-dimensional problem spaces. All algorithms were tested on 14 benchmark problems across 30 independent experiments, with the parameter settings remaining consistent with the previous section.

Table 5 Comparison of IGD and HV between DRA-NSDE and rivals
Table 6 Wilcoxon’s rank sum test analysis of DRA-NSDE with rivals in terms of IGD and HV metrics on 14 benchmark test problems

The comparison results are presented in Tables 5 and 6. According to the results, it can be found that the proposed DRA-NSDE outperforms the other four rival algorithms in terms of both IGD and HV metrics on 11 out of 14 test problems, which demonstrates the superiority of DRA-NSDE. It is worth noting that DRA-NSDE is unable to achieve the best results on DTLZ3 and DTLZ4, whereas traditional algorithms such as NSGA-II and NSDE perform better in the two problems due to their powerful global search abilities. In addition, MO-CMA-ES gives better result than our DRA-NSDE on ZDT6, which suggests that MO-CMA-ES may be more suitable for non-convex problems.

Figure 6 illustrates the Pareto fronts obtained by the five algorithms on three typical problems: ZDT4, UF1, and DTLZ2. The result indicates that the proposed DRA-NSDE algorithm not only obtains a Pareto front closer to the true Pareto front, but also maintains good diversity. In the case of ZDT4, NSDE, MO-ES, and MO-CMA-ES failed to converge to the true Pareto front, and NSGA-II lost the diversity during the evolution process. On the contrary, the Pareto front obtained by the proposed DRA-NSDE algorithm almost entirely covered the true Pareto front. The main reason is that the proposed dynamic resource allocation mechanism adaptively adjusts the ratio of local and global search in each generation based on their success rates, so that a good balance between exploration and exploitation can be achieved. To give an insight to the proposed dynamic resource allocation mechanism, we further illustrate the variation curves of local search ratio and global search ratio when solving the above three typical problems. As shown in Fig. 7, in the early stage of the evolution, the local search ratio is higher than the global search ratio because at this stage the population is relatively dispersed and thus the algorithm focuses more on the convergence speed (more local search can accelerate the convergence of the algorithm). As the evolution proceeds, the global search ratio gradually exceeds the local search ratio because the population may lose diversity during evolution, and consequently the algorithm turns to focuses more on maintaining good diversity by increasing the global search ratio. The dynamic allocation mechanism allows the algorithm to flexibly adopt different search strategies at different stages of the evolution, thus enabling the algorithm to obtain a balance between exploration and exploitation.

Fig. 6
figure 6

Comparison of Pareto front obtained by different algorithms

Fig. 7
figure 7

The variation curves of local search ratio and global search ratio

From the above experimental results and analysis, we can draw the conclusion that the proposed DRA-NSDE method can achieve superior or competitive results for most of test problems. This ensures the stability of DRA-NSDE in generating a range of high-quality solutions to participate in the next stage and construct an ensemble model which is expected to achieve promising prediction performance.

Validation of CDE-stacking

In order to verify the effectiveness of the proposed CDE-Stacking method in selecting non-dominated solutions and constructing the ensemble model in the second stage, CDE-Stacking is compared with its DE-free version (denoted as C-Stacking) and clustering-free version (denoted as DE-Stacking) after the same evolutionary process in the first stage. Please note that all the three versions have the same non-dominated solution set obtained by the proposed DRA-NSDE algorithm, and thus the convincing and valid results can be obtained. In C-Stacking, instead of optimizing the ensemble weights by DE, the ensemble weight of each solution is set to be equal to \(\frac{1}{n*K_e}\), where \(n*K_e\) represents the total number of selected solutions. In DE-Stacking, the solution selection process based on K-Means clustering is removed, and all solutions are selected in the non-dominated set whose ensemble weights are optimized by DE to construct the ensemble model.

Table 7 Comparison between different solution ensemble methods
Fig. 8
figure 8

Box plots of different solution ensemble methods

Fig. 9
figure 9

Sensitive test on parameter \(K_e\) in the proposed CDE-Stacking

The data set used in this subsection, i.e., SkillCraft1 Master Table Dataset [52], is obtained from the UCI Machine Learning Repository. This data set contains 3395 samples and each sample has 20 original attributes. After discarding the first attribute GameID and handling the missing values, a set of 3338 samples with 18 input features and 1 predictable feature (LeagueIndex) is obtained. Then, the obtained data set is normalized according to the following method:

$$\begin{aligned} {\textbf{x}}_{norm} = \frac{{\textbf{x}} - {\textbf{x}}_{min}}{{\textbf{x}}_{max} - {\textbf{x}}_{min}} \end{aligned}$$
(15)

In the prediction problems, RMSE and R\(^2\) are two commonly used metrics to evaluate the algorithm performance. RMSE is the root mean square error that can evaluate the model prediction accuracy as shown in Eq. (4), while R\(^2\) is the model goodness of fit which is defined as follows:

$$\begin{aligned} R^2(y, {\hat{y}}) = 1 - \frac{\sum _{i=1}^{L}(y_i-{\hat{y}}_i)^2}{\sum _{i=1}^{L}(y_i-{\bar{y}}_i)^2} \end{aligned}$$
(16)

where \({\hat{y}}_i\) is the predicted value for the ith sample, \(y_i\) is the corresponding true value and \({\bar{y}}_i\) is the average value of all true value \(y_i\).

In the following experiments, three versions of ensemble method, i.e., C-Stacking, DE-Stacking, and CDE-Stacking, are tested on the preprocessed SkillCraft1 data set. The population size in DE-Stacking and CDE-Stacking is set to 100, and their maximum number of generations is set to 500. The number of clusters \(K_e\) is set to 5, and the number of solutions selected from each cluster n is set to 4. All results are obtained through 15 independent experiments.

The computational results are shown in Table 7 and Fig. 8. It can be seen that CDE-Stacking achieved the best and the most stable results in both RMSE and R\(^2\) among all the three methods, which means that CDE-Stacking is significantly better and more robust than C-Stacking and DE-Stacking. The clustering-based solution selection process can further improve the diversity of the ensemble model by randomly selecting solutions in sub-populations. The DE-based optimization of ensemble weights can utilize the solutions in different locations of the Pareto front, thus improving the accuracy of the ensemble model. Therefore, the above results verified the effectiveness of the proposed CDE-Stacking method.

Table 8 Parameter setting of the proposed algorithm
Table 9 Comparison between different prediction models in SkillCraft1 data set

We further analyze the sensitivity of the parameter \(K_e\) in the proposed CDE-Stacking method. The values of \(K_e\) are set as \(K_e \in \{2, 4, 5, 10\}\). To ensure the total number of selected solutions is the same, i.e., 20, the parameter n (number of solutions selected from each cluster) needs to be set as \(n \in \{10, 5, 4, 2\}\). The other parameters are the same as the above settings. Figure 9 gives the R\(^2\) and RMSE results obtained by different values of \(K_e\). It shows that the performance of the proposed algorithm first becomes better but then worse as the number of clusters increases. This indicates that when \(K_e\) increases, more sub-populations in different locations of the Pareto front are generated so that the diversity of the selected solution set is getting better. But when \(K_e\) becomes too large, i.e., 10, the solutions with high accuracy will be less in the selected solution set, which will in turn deteriorate the performance of the algorithm. Therefore, the setting of \(K_e\) should balance the diversity and the accuracy of the selected solution set. The sensitive results indicate that the proposed CDE-Stacking method works well when the number of clusters \(K_e\) is set to 5.

UCI benchmark test

To verify the effectiveness of the proposed two-stage MOEEL algorithm, a series of numerical experiments are conducted on the same UCI benchmark data set described in “Validation of CDE-stacking” section, i.e., the SkillCraft1 data set. For the purpose of model comparison, we introduce 8 rival prediction models including 2 single machine learning models, 3 tree-based non-linear ensemble learning models, and 3 SVR-based non-linear ensemble learning models.

  1. (1)

    MLP: Multi-layer perceptron regressor;

  2. (2)

    SVR: Support vector regression model;

  3. (3)

    Bagging: Bagging regressor with decision tree as its base-learner;

  4. (4)

    AdaBoost: AdaBoost regressor with decision tree as its base-learner;

  5. (5)

    RF: Random forest regressor with decision tree as its base-learner;

  6. (6)

    Bagging-SVR: Bagging regressor with SVR as its base-learner;

  7. (7)

    AdaBoost-SVR: AdaBoost regressor with SVR as its base-learner;

  8. (8)

    Stacking-SVR: Stacking of SVRs which uses a MLP model as its final regressor. Consistent with the proposed algorithm, the number of SVRs in Stacking-SVR is set to 20. The MLP model used to combine the SVRs’ outputs is a three-layer neural network which has 20 hidden nodes in its hidden layer.

  9. (9)

    Proposed: The proposed two-stage MOEEL algorithm with DRA-NSDE and CDE-Stacking.

Fig. 10
figure 10

Box plots of different prediction models in SkillCraft1 data set

The main parameters of the proposed algorithm are listed in Table 8. All the 9 models were tested on SkillCraft1 data set, and the comparison results obtained through 15 independent experiments are shown in Table 9 and Fig. 10. Based on these results, the following analysis can be drawn:

  1. (1)

    The proposed prediction model significantly outperforms the other 8 models in both RMSE and R\(^2\), indicating that our model is more advantageous.

  2. (2)

    MLP and three tree-based ensemble learning models (Bagging, AdaBoost, RF) achieve better performance than SVR, indicating that neural network and ensemble learning are more suitable than SVR in complex prediction problem because they have a strong non-linear relationship extraction ability.

  3. (3)

    While the superiority has been verified in tree-based ensemble models, three SVR-based ensemble models (Bagging-SVR, AdaBoost-SVR, Stacking-SVR) fail to show a significant improvement compared with SVR. However, based on the proposed two-stage MOEEL algorithm, the SVR-based ensemble model (Proposed) achieves significantly better performance than all the three tree-based ensemble models. This indicates that the proper selection of base learners and the proper ensemble method are critical when constructing the ensemble model. Our proposed method can obtain a better non-dominated solution set with high accuracy and strong diversity through the dynamic resource allocation mechanism, and it can efficiently select high-diversity solutions and construct the ensemble model with appropriate ensemble weights.

Blast furnace data test

To verify the effectiveness of the proposed two-stage MOEEL algorithm for silicon content prediction, the blast furnace data set is introduced in this subsection. The data come from one major iron and steel enterprise in China and is collected from May 1, 2021 to March 14, 2022. After handling the outliers and missing values, the silicon content prediction data set is obtained. It has 843 samples, each of which contains 40 input features (blast furnace process variable) and 1 output variable (silicon content).

In the actual blast furnace production process, the experts are more interested in the hit rate (HR) of the silicon content prediction, which is defined as follows:

$$\begin{aligned} \text {HR} = \frac{1}{L} \left( \sum _{i=1}^{L} H_i\right) \times 100\% \nonumber \\ H_i = {\left\{ \begin{array}{ll} 1, &{} \vert {\hat{y}}_i - y_i \vert < 0.1 \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(17)

where \({\hat{y}}_i\) is the predict value of silicon content for the ith sample, \(y_i\) is the true value, and L is the total number of samples. Hit rate indicates the prediction accuracy of the silicon content. Together with RMSE (Eq. (4)) and R\(^2\) (Eq. (16)), the three indicators are used to evaluate the silicon content prediction performance of different algorithms.

Table 10 Comparison between different Silicon prediction models

To verify the effectiveness of the proposed algorithm, a new rival model called Proposed/RF is introduced. Proposed/RF is a variation version of the proposed two-stage MOEEL algorithm in which the random forest regressor is taken as the base learner instead of SVR. Due to the characteristic of randomly selecting features in random forest algorithm, all features collected in blast furnace data set are considered in the Proposed/RF model training process, which means the feature selection part in Fig. 2 is removed and only the hyperparameter part is utilized to optimize the parameters of RF. Other settings of Proposed/RF are the same as proposed. Including the other models described in “UCI benchmark test” section, all the 10 models are tested on blast furnace data set through 15 independent experiments, and the comparison results are presented in Table 10 and Fig. 11.

Fig. 11
figure 11

Box plots of different Silicon prediction models

Fig. 12
figure 12

Comparison of Pareto fronts between Proposed/RF and Proposed

Fig. 13
figure 13

Prediction curves and error distribution plots of different silicon prediction models

Fig. 14
figure 14

Comparison of model complexity

Based on the results, it can be clearly seen that our proposed model outperforms the other rivals, because the proposed model achieves the best results of R\(^2\) and RMSE and the performance difference is significant. Though AdaBoost obtains the highest HR, the difference in HR results between our model and AdaBoost is not significant. We can also see that Proposed/RF achieves better prediction performance than RF, which indicates that the proposed two-stage MOEEL framework is effective in finding the optimal parameters of the base learner in ensemble learning. Compared with Proposed/RF, the performance of Proposed can be further improved, which indicates that SVR is a better choice than random forest to construct the ensemble model as base learner under our proposed two-stage MOEEL framework. The key reason is that the feature selection part in our designed encoding scheme is more effective than the original feature selection mechanism in random forest, leading to better diversity in our proposed two-stage MOEEL model and thus improving the prediction accuracy. For a more specific illustration, Fig. 12 gives a comparison of Pareto fronts between Proposed/RF and Proposed obtained from their best results in the 15 independent experiments. It can be clearly seen that the front diversity of Proposed is much better than that of Proposed/RF. Please note that the total number of non-dominated solutions is also more than that obtained by Proposed/RF, which is helpful in stage two when selecting solutions from different sub-populations. More specifically, if a sub-population contains enough numbers of different solutions, the solutions it generated will have a better diversity. At last, it is worth noting that the proposed method not only improves the prediction accuracy but also significantly saves the model training cost since the model complexity of SVR is much lower than that of random forest. Therefore, the above analysis verified the superiority of the proposed model in the silicon content prediction.

The prediction curves and its error distribution of each model on the test set are visualized in Fig. 13, from which we can see that the prediction value of the proposed model is closer to the true value. In addition, the error distribution curve of the proposed model is thinner and the errors are centered around zero, indicating that the proposed model has a better prediction accuracy.

Further discussion

In this subsection, we further perform a statistical analysis of the model complexity. In the proposed DRA-NSDE, the number of support vectors in SVR base learner is used to evaluate the complexity of a solution (Eq. (5)). Consequently, in the proposed ensemble model, the average support vector number of all SVR base learners is adopted as the evaluation of the model complexity. The same method is used to evaluate the complexity of other SVR-based ensemble models (Bagging-SVR, AdaBoost-SVR, Stacking-SVR). The statistical results in 15 independent experiments on both the UCI benchmark data set and the blast furnace data set are shown in Fig. 14.

From the results, we can see that the proposed model has the lowest complexity on both the UCI benchmark data set and the blast furnace data set. This indicates that the second objective (CMPLX) of solutions in the proposed DRA-NSDE is effective, as it enables our proposed model to improve the prediction accuracy while maintaining relatively low model complexity. This property of our proposed method can save a large amount of computational resources during model training process and accelerate the online prediction efficiency of blast furnace silicon content. In actual blast furnace ironmaking process, the silicon content is detected about every 40 min [53], while our model can produce online predictions in real time once it is trained offline. Therefore, our proposed method can fully satisfy the task of online prediction of silicon content in the actual blast furnace ironmaking process.

Conclusion

As an important indicator of operating conditions for blast furnace, accurate prediction of silicon content in the molten iron is very important to ensure stable production of blast furnace in the iron and steel industry. To achieve a prediction model with high accuracy and good generalization, this paper proposed a two-stage multiobjective evolutionary algorithm. The task of the first stage is to construct a set of near Pareto optimal base learners by means of the DRA-NSDE with dynamic resource allocation based on clustering, while the task of the second stage is to achieve the ensemble learning model from the candidate base learners using a single-objective differential evolution algorithm. There are three main characteristics of the proposed silicon content prediction method: 1)The proposed method chooses SVR as the base learner. Together with the designed solution encoding scheme, the input features of each base learner can be selected automatically so that the algorithm can not only extract meaningful combinations of features to improve the accuracy of the model, but also maintain better diversity of the ensemble model. 2) The proposed DRA-NSDE algorithm can dynamically allocate resources at different evolution stages based on the success rate of the two search strategies (local search and global search), which can solve the problem of difficult selection of target vectors in traditional NSDE. DRA-NSDE also improves the quality of the non-dominated solutions, which helps the algorithm to obtain faster convergence speed and better diversity. The diversity of the solutions to be integrated in stage two is further improved by the proposed CDE-Stacking method, which also optimizes the ensemble weights through a single-objective differential evolution algorithm. 3) The ensemble model has relatively lower model complexity, which is important in actual blast furnace ironmaking process. Experimental results on both the UCI benchmark data set and the actual blast furnace data set demonstrate the effectiveness of the improvement strategies used in the proposed two-stage MOEEL algorithm, and further comparison results show that the proposed algorithm can achieve prediction models with higher accuracy and lower complexity with comparison to some other powerful single machine learning methods and ensemble learning methods.

In the future work, it is worthwhile to further investigate how to select potential base learners from the clustered subpopulations in a more targeted manner rather than random selection. In addition, the application of the proposed algorithm to the operation optimization of the blast furnace ironmaking process is also one of our future research.