Introduction

Despite the tremendous promotion in computer processing power, the computationally expensive problem occurs frequently in multiple scientific and engineering disciplines where complex computer simulations are used. In these cases, obtaining more data means additional experiments and thus it results in highly non-trivial computational expense (Forrester and Keane [1]). As a result, surrogate models have been widely used to replace the complex simulation models (Namura et al. [2]). Their application fields involve multidisciplinary design optimization (Yao et al. [3]), uncertainty analysis (Yao et al. [4]), and so on.

At present, the commonly used surrogate models are: Polynomial Response Surface (PRS) (Goel et al. [5]), Multivariate Adaptive Regression Splines (MARS) (Gu and Wahba [6]), Kriging (Clark et al. [7]), Radial Basis Function (RBF) (Yao et al. [8]), Support Vector Regression (SVR) (Clarke et al. [9]). There are also some hybrid surrogate modeling paradigms where different surrogate models are combined to offer effective solutions ((Zhang et al. [10])(Yin et al. [11]). Among these surrogates, PRS is a popular surrogate model because of its simplicity, good interpretability, and computational efficiency (Bhosekar and Ierapetritou [12]). However, PRS is unsuitable for the non-linear, multi-modal, multi-dimensional design landscapes (Forrester and Keane [1]). Because of the limitations existing in PRS, some enhanced versions which include removing or suppressing unnecessary variables are proposed, such as subset selection (Furnival and Wilson [13]) or regularization (Tibshirani [14]). Subset selection methods address the trade-off between prediction error and the regression model complexity by selecting a subset of variables. Stepwise regression is an effective way to solve the subset selection problem (Hosseinpour et al. [15]). Borrowing the idea of stepwise regression, Giustolisi et al. (O. Giustolisi and Doglioni [16]) propose Evolutionary Polynomial Regression (EPR), which uses an evolutionary process rather than follows the hill-climbing method of stepwise regression. For the regularization method, Least absolute shrinkage and selection (Lasso) is widely used (Tibshirani [14]), which shrinks some coefficients and sets others to zero to retain the good features of both subset selection and ridge regression. Zou et al. (Zou and Hastie [17]) propose the elastic net (EN) technique to fix the problem of failing to select correlated grouped variables that existed in the Lasso.

Either for subset selection or regularization methods, their performance is largely dependent on the basis functions included in the model. By choosing the significant terms, the prediction ability of the model would be enhanced. Commonly, the basis function selection is based on a specific error estimation of the model. An error estimation method that accords with the true error of the model could guide the selection of basis functions more effectively. Currently, the mean squared error estimation method based on cross validation (CV), such as prediction error sum of squares (PRESS) (Goel et al. [18]), is a popular way to estimate the global accuracy of the model. However, with limited samples available, the mean squared error estimation method based on cross validation is influenced by the data partition scheme, which may not estimate the true global average error well. Thus, in this situation, it is difficult to select significant basis functions effectively for PRS. To avoid the problem, Gu et al. (Gu and Wei [19]) propose a robust model structure selection method, which could select the significant model terms according to the overall mean absolute error of the resampled subsets. Although the method achieves good results in some numerical examples, it leads to large computational costs. Besides, the aforementioned modeling methods are based on the error estimation of all the CV subsets, which may result in incorrect basis function selection due to significant influence from certain subsets with large estimation error (as discussed in Sect. 2.3).

In this paper, a novel approach based on multitask optimization and ensemble modeling (PRS-MOEM) is proposed to effectively mitigate the negative influence of the subsets with large estimation error due to the random partition, and enhance the basis function selection accuracy and robustness. First, multiple subsets are partitioned from the training set based on cross validation. Instead of the traditional modeling method which directly builds a single model guided by the error estimation based on cross validation, multiple sub-models are constructed by building the surrogate for each subset. The multiple sub-model modeling processes are solved in parallel by multiple optimization tasks. To improve the optimization performance, multitask optimization can be adopted (Ong and Gupta [20]; Naik and Rangwala [21]), and an improved evolutionary algorithm with transfer migration is developed to solve multitask optimization problem, which can significantly enhance the optimization efficiency and robustness by useful information exchange between the similar optimization tasks. Second, a novel ensemble method is proposed to integrate the multiple sub-models into the final optimal one. Actually, there are relevant researches on ensemble modeling which integrates multiple models into one model. However, previous studies are usually conducted by the weighted sum approach (Fang et al. [22]; Zhou and Jiang [23]). Since some sub-models may deviate from the true model greatly due to the specific training subset features (the subsets are randomly partitioned), this weighted sum approach may result in wrongly selecting basis functions. Thus, the interpretability of the PRS model would be reduced as well as its accuracy. To obtain a well performed ensemble, in this paper a scoring method is proposed to measure the significance of each basis function according to the error estimation of the sub-models and the occurrence frequency of these basis functions in all the sub-models. The basis functions are ranked according to the significance scores in descending order. Each time add a basis function into the ensemble and measure the model accuracy by the bias-corrected Akaike’s information criterion (AICc). The ensemble with the lowest AICc is chosen as the final model. In this way, the negative influence from the sub-models with large prediction error can be mitigated, and the uncertain impact resulting from the random partition of subsets can be alleviated. Thus the basis function selection accuracy as well as algorithm robustness can be effectively enhanced. The main contributions of this paper can be summarized as follows:

  • A PRS modeling problem is decomposed into multiple subproblems, which are to build the sub-model for each subset separately. Therefore, the potential of each subset can be fully explored and the phenomenon of wrong dominance by specific subsets can be mitigated.

  • Multitask optimization is introduced to simultaneously solve the subproblems, and an improved evolutionary algorithm with transfer migration is developed to enhance the optimization efficiency and robustness.

  • An ensemble modeling method for integrating multiple sub-models is proposed. Using a novel scoring strategy, the method sorts each basis function of the sub-models, combines them serially, and selects a final optimum model.

The rest of the paper is organized as follows. In Sect. 2, a brief review of Polynomial Response Surface (PRS) is introduced and the disadvantage of the traditional modeling method based on CV is analyzed. In Sect. 3, the PRS method based on Basis Function Selection by Multitask Optimization and Ensemble Modeling (PRS-MOEM) is developed in detail. In Sect. 4, the proposed PRS-MOEM is testified with seven numerical examples and one practical engineering problem, followed by conclusions in the final section.

Preliminary

Polynomial regression surface

The PRS model is derived from the linear regression model, the matrix form can be written as

$$\begin{aligned} {\hat{y}}({\varvec{x}}) = [{f_1}({\varvec{x}}), {f_2}({\varvec{x}}),\ldots ,{f_{nvars}}({\varvec{x}})] \cdot {\varvec{\beta }} \end{aligned}$$
(1)

where \({\varvec{x}} = {[{x_1}, {x_2}, ... , {x_m}]^T}\) is a sampled point. m is the number of variables. \({\varvec{\beta }} = {[{\beta _1},{\beta _2},...,{\beta _{nvars}}]^T}\) is the regression coefficients vector. \([{f_1}({\varvec{x}}),{f_2}({\varvec{x}}),...,{f_{nvars}}({\varvec{x}})]\) is the basis function vector, and \({f_i}\) is a basis function. nvars is the number of basis functions. Given a set of training points \({{\varvec{x}}^{(l)}} \in {{\mathbb {R}}^m},l = 1,2,...,n\), and the corresponding actual response vector \({\varvec{y}} = {[{y^{(1)}},{y^{(2)}},...,{y^{(n)}}]^T}\), the design matrix is defined as

$$\begin{aligned} {\varvec{\mathrm{F}}} = \left[ {\begin{array}{*{20}{c}} {{f_1}({{\varvec{x}}^{(1)}})}&{}{\cdot \cdot \cdot }&{}{{f_{nvars}}({{\varvec{x}}^{(1)}})}\\ {\cdot \cdot \cdot }&{}{\cdot \cdot \cdot }&{}{\cdot \cdot \cdot }\\ {{f_1}({{\varvec{x}}^{(n)}})}&{}{\cdot \cdot \cdot }&{}{{f_{nvars}}({{\varvec{x}}^{(n)}})} \end{array}} \right] \end{aligned}$$
(2)

The least squares method is often used to solve the regression coefficients as

$$\begin{aligned} {\varvec{\beta }} = {{\varvec{\mathrm{F}}}^ + } \cdot {\varvec{y}} \end{aligned}$$
(3)

where \({{\varvec{\mathrm{F}}}^ + } = {({{\varvec{\mathrm{F}}}^T}{\varvec{\mathrm{F}}})^{ - 1}}{{\varvec{\mathrm{F}}}^T}\) is the Moore-Penrose pseudo-inverse of \({\varvec{\mathrm{F}}}\).

To lower the mutual coherence of the design matrix, the multivariable Legendre orthogonal polynomial (Fan et al. [24]) is applied to form the basis function in this paper, which is

$$\begin{aligned} {f_i}({\varvec{x}}) = \prod \limits _{j = 1}^m {{L_{{\eta _j}}}\left( {{x_j}} \right) } \end{aligned}$$
(4)

where \({\eta _j}\) is the order of the jth univariate Legendre polynomial \({L_{{\eta _j}}}\left( {{x_j}} \right) \), and \(\sum _{j = 1}^m {{\eta _j}} = P\), P is a user-defined highest order of polynomials. \({L_{{\eta _j}}}\left( {{x_j}} \right) \) is determined by the recursive definition. It is supposed that \({L_0}\left( {{x_j}} \right) = 1\) and \({L_1}\left( {{x_j}} \right) = {x_j}\), then

$$\begin{aligned} \left( {{\eta _j} + 1} \right) {L_{{\eta _j} + 1}}\left( {{x_j}} \right) = \left( {2{\eta _j} + 1} \right) {x_j}{L_{{\eta _j}}}\left( {{x_j}} \right) - {\eta _j}{L_{{\eta _j} - 1}}\left( {{x_j}} \right) \end{aligned}$$
(5)

Traditional PRS modeling method based on cross validation

PRS modeling first should define the basis function vector, which needs separate sample set for model validation and basis function selection. However, with limited samples available, it is difficult to obtain separate sample set for surrogate validation. To address this problem, the cross-validation method is widely used as it can provide good error estimation when the sample size is small (Bischl et al. [25]). For K-fold cross-validation, the training set \(D = \{ ({{\varvec{x}}^{(l)}},{{\varvec{y}}^{(l)}}),l = 1,2,...,n\}\) is randomly partitioned into K disjoint sets of approximately equal size, denoted as \({D_1},{D_2},...,{D_K}\). Note that for small data size, 10-fold cross-validation provides almost unbiased estimate of prediction error. With a relatively large number of samples, small K is preferred, such as 5, to avoid high computational cost (Simon [26]). For \(k=1,2,...,K\), \({D_k}\) is the validation subset with the corresponding training subset \({D^{( - k)}} = D - {D_k}\). The candidate basis function set is defined as \({\varvec{{\Phi }}} = {\{ {f_i}\} _{i = 1,...,nvars}}\). The tradition modeling method tries to find an active basis function set from \({\varvec{{\Phi }}}\) by minimizing the error estimation of the final model, where . Here ‘active’ means that the basis function is selected into the final model, while ‘inactive’ is opposite. First, the selected active basis function vector with \({N_{active}}\) items is defined as

$$\begin{aligned} \begin{array}{l} {\varvec{\mathrm{S}}} = {[{f_{a1}},{f_{a2}},..., {f_{a{N_{active}}}}]^T}\\ {f_{ai}} \in {\varvec{{\Phi }}} , 1 \le i \le {N_{active}} \end{array} \end{aligned}$$
(6)

The design matrix \({{\varvec{\mathrm{F}}}^{( - k)}}\) is constructed with the active basis function vector \({\varvec{\mathrm{S}}}\) for \({D^{( - k)}}\) as

$$\begin{aligned} \begin{array}{l} {{\varvec{\mathrm{F}}}^{( - k)}} = {\left[ {{\varvec{\mathrm{S}}}({{\varvec{x}}^{k1}}),{\varvec{\mathrm{S}}}({{\varvec{x}}^{k2}}),..., {\varvec{\mathrm{S}}}({{\varvec{x}}^{k{n_k}}})} \right] ^T}\\ \\ {{\varvec{x}}^{kj}} \in {D^{( - k)}},\mathrm{{ 1}} \le j \le {n_k} \end{array} \end{aligned}$$
(7)

where the superscript T means the transverse of the matrix, \({n_k}\) is the number of points in the subset \({D^{( - k)}}\). Then the regression coefficient vector \({{\varvec{\beta }} ^{( - k)}}\) is calculated by

$$\begin{aligned} {{\varvec{\beta }} ^{( - k)}} = {({\mathbf{{F}}^{( - k)}})^ + } \cdot {{\varvec{y}}^{( - k)}} \end{aligned}$$
(8)

where \({{\varvec{y}}^{( - k)}}\) is the actual response vector of the training samples in \({D^{( - k)}}\), and \({({\mathbf{{F}}^{( - k)}})^ + }\) is the Moore-Penrose pseudo-inverse of \({{\varvec{\mathrm{F}}}^{( - k)}}\). Based on \(\mathbf {S}\) and \({{\varvec{\beta }} ^{( - k)}}\), the sub-model corresponding to the training subset \({D^{( - k)}}\) can be obtained as

$$\begin{aligned} {{\hat{y}}_k}({\varvec{x}}) = \mathbf{{S}}({\varvec{x}}) \cdot {{\varvec{\beta }} ^{( - k)}} \end{aligned}$$
(9)
Fig. 1
figure 1

The diagram of PRS modeling framework

Based on \({D_k}\), the prediction error of the model can be estimated. As the commonly used ordinary cross-validation error estimation criterion (CV) would lead to large bias when the sample size is small (Yanagihara et al. [27]), the bias-corrected cross-validation criterion (CCV) is chosen in this paper for error estimation, which is

$$\begin{aligned} \begin{array}{l} {\varvec{\Theta }} ({{\varvec{\beta }} ^{( - k)}}) = \frac{1}{{n - {n_k}}} \cdot \sum \limits _{i = 1}^{n - {n_k}} {(|{y^{(i)}} - {{{\hat{y}}}_k}({{\varvec{x}}^{(i)}}){|_{{{\varvec{x}}^{(i)}} \in {D_k}}})} \\ \qquad \qquad + \frac{1}{{{n_k}}} \cdot \sum \limits _{j = 1}^{{n_k}} {(\frac{\mathrm{{1}}}{{2n}}|{y^{(j)}} - {{{\hat{y}}}_k}({{\varvec{x}}^{(j)}}){|_{{{\varvec{x}}^{(j)}} \in {D^{( - k)}}}})} \end{array} \end{aligned}$$
(10)

CCV calculates the prediction error of the model in the first term, and considers the fitting error meanwhile in the second term, which could avoid large bias with limited samples. For \(k=1,2,...,K\), repeat the aforementioned steps and obtain the prediction errors of all the sub-models based on the active basis function vector \(\mathbf {S}\). By minimizing the error sum \(\sum _{k = 1}^K {{\varvec{\Theta }} ({{\varvec{\beta }} ^{( - k)}})}\), the optimal active basis function vector can be obtained. The diagram of the above method is shown in Fig. 1a, and the optimization task can be formulated as

$$\begin{aligned} \begin{array}{l} \mathrm{{Find}} \quad \mathbf{{S}} = {[{f_{a1}},{f_{a2}},..., {f_{a{N_{active}}}}]^T},{f_{ai}} \in {\varvec{\Phi }} \\ \mathrm{{Min}} \quad \sum \limits _{k = 1}^K {{\varvec{\Theta }} ({{\varvec{\beta }} ^{( - k)}})} \\ \mathrm{{s}}\mathrm{{.t}}\mathrm{{.}} \quad \ \ {{\varvec{\beta }} ^{( - k)}} = {({\mathbf{{F}}^{( - k)}})^ + } \cdot {{\varvec{y}}^{( - k)}}\\ \qquad \quad {\mathbf{{F}}^{( - k)}} = {\left[ {\mathbf{{S}}({{\varvec{x}}^{k1}}),\mathbf{{S}}({{\varvec{x}}^{k2}}),...,\mathbf{{S}}({{\varvec{x}}^{k{n_k}}})} \right] ^T}\\ \qquad \quad {{\varvec{x}}^{kj}} \in {D^{( - k)}},\mathrm{{ 1}} \le j \le {n_k}\mathrm{{, }}1 \le k \le K \\ \\ \qquad \quad {N_{active}} \le \mathop {\min }\limits _{1 \le k \le K} \left( {{n_k}} \right) \mathrm{{ - 1}} \end{array} \end{aligned}$$
(11)

The number of active basis functions in the final model is \({N_{active}}\) and has to meet the constraint \({N_{active}} \le \mathop {\min }\limits _{1 \le k \le K} \left( {{n_k}} \right) \mathrm{{ - 1}}\) (Lee et al. [28]).

Disadvantage of the traditional PRS modeling based on cross validation

In solving Eq.(11), the optimal active basis function vector \(\mathbf{{S}}\) is obtained by minimizing the total sub-model error sum \(\sum _{k = 1}^K {{\varvec{\Theta }} ({{\varvec{\beta }} ^{( - k)}})}\), which may easily lead to incorrect selection due to the significant influence from certain sub-models with large error estimation. A one-dimensional problem is used for illustration and presented in Fig. 2. The sample set is divided into two subsets, labeled as subset A (blue dots) and subset B (red dots) respectively. To demonstrate the influence of the subsets on the final model construction, build the optimal sub-models for each training subset first. With the training subset A and the validation subset B, the sub-model 1 (brown dash line) is obtained by minimizing the subset estimation error stated in Eq.(10). And similarly, the sub-model 2 (red dash line) is obtained with the training subset B and the validation subset A. Obviously, the nonlinear sub-model 1 is more in line with the true model (green dash line), and the sub-model 2 is just a linear model (only linear basis functions are selected) which deviates a lot from the true response. This indicates that with improper subset partition (which may happen with large probability due to the random CV partition method), the optimal model obtained by minimizing the estimation error is far from the true model, which means it fails to correctly identify the basis functions with training subset B.

Then by solving Eq.(11) to minimize the total error sum of all the subsets, the final model is obtained (black solid line). It can be observed that the final model is also a linear model, which is very close to sub-model 2. It is because the large error estimation of sub-model 2 dominated the total error sum, which accordingly guides the optimization search towards minimizing the error estimation of sub-model 2. Therefore, the basis function selection of the final model is close to the sub-model 2. The underlying reason for this problem is that the active basis function vector is optimized by considering its performance on all the subsets (the total error sum). Then the optimization process can be easily dominated by some specific subsets for which the error estimation is large. To address this problem, an intuitive idea is to optimize the active basis function vector for each subset separately, so that the potential of each subset can be fully explored and the phenomenon of wrong dominance by specific subsets can be mitigated. Then based on all the potential active basis function vectors, the final model can be ensemble according to a certain criterion. Based on this idea, in this paper a novel PRS modeling approach is proposed. It optimizes the active basis function vector and constructs the corresponding sub-model for each subset separately and in parallel based on multitask optimization (MO). Then the active basis functions are scored and selected by ensemble modeling (EM). This PRS modeling framework is developed in Sect. 3.

Multitask optimization

MO (Jin et al. [29]; Liao et al [30]) can simultaneously solves multiple tasks in a single run and achieve better performance with positive knowledge transfer, which has shown high efficiency on expensive optimization problem (Ding et al. [31]; Wang et al. [32]). As discussed in the above section, a PRS modeling optimization problem can be decomposed into K subproblems which are optimizing active basis function vector for each subset separately. Due to the similarity of these subproblems and the independence of optimization processes, MO is introduced to construct the sub-model for each subset in this paper. Suppose that \(\mathbf{{S}}^{( - k)}\) is feasible active basis function vector of the kth optimization task \({{\varvec{\Theta }} _k}\), then the subproblems are to be simultaneously addressed can be formulated as

$$\begin{aligned} \begin{array}{l} \left\{ {\mathbf{{S}}^{( - 1 )}},{\mathbf{{S}}^{(-2)}}, ... ,{\mathbf{{S}}^{(-K)}} \right\} \\ \qquad \qquad = \left\{ \arg \min {{\varvec{\Theta }}_1},\arg \min {{\varvec{\Theta }}_2},...,\arg \min {{\varvec{\Theta }}_K} \right\} \end{array} \end{aligned}$$
(12)

The details of the optimization problem are clarified in the next section.

PRS-MOEM method

The PRS-MOEM framework is shown in Fig. 1b. It mainly includes two key parts, namely the multitask optimization to obtain potential active basis function vector and construct optimal sub-models for all the subsets based on K-fold cross validation, and the ensemble modeling to integrate all the sub-models into the final model. Furthermore, to effectively solve the multitask optimization problem, an improved evolutionary algorithm with transfer migration is developed.

Fig. 2
figure 2

The illustration of PRS modeling being greatly influenced by specific subsets

Multitask optimization for sub-model construction

For the subset \({D_k}\) and \({D^{( - k)}}\), the active basis function vector can be obtained by solving the following optimization problem:

$$\begin{aligned} \begin{array}{l} \mathrm{{Find}} \quad {\mathbf{{S}}^{( - k)}} = {[{f_{a1}},{f_{a2}},...,{f_{aN_{active}^{\left( { - k} \right) }}}]^T},{f_{ai}} \in {\varvec{\Phi }} \\ \\ \mathrm{{Min}} \quad {{\varvec{\Theta }} ({{\varvec{\beta }} ^{( - k)}})}\\ \\ \mathrm{{s}}\mathrm{{.t}}\mathrm{{.}} \quad \ \ {{\varvec{\beta }} ^{( - k)}} = {(\mathbf{{F}}^{( - k)})^ + } \cdot {{\varvec{y}}^{( - k)}}\\ \\ \qquad \quad {\mathbf{{F}}^{( - k)}} = {\left[ {\mathbf{{S}}({{\varvec{x}}^{k1}}),\mathbf{{S}}({{\varvec{x}}^{k2}}),...,\mathbf{{S}}({{\varvec{x}}^{k{n_k}}})} \right] ^T}\\ \\ \qquad \quad {{\varvec{x}}^{kj}} \in {D^{( - k)}},\mathrm{{ 1}} \le j \le {n_k} \\ \\ \qquad \quad N_{active}^{\left( { - k} \right) } \le {n_k}\mathrm{{ - 1}} \end{array} \end{aligned}$$
(13)

where \({\mathbf{{S}}^{( - k)}}\) is the active basis function vector of the kth sub-model trained by the subset \({D^{( - k)}}\), and \(N_{active}^{\left( { - k} \right) }\) is the number of items in this vector.

For the K subsets, K optimization tasks have to be conducted to construct all the sub-models. Considering that the optimization processes of the K sub-models are independent of each other, multitask optimization can be adopted to solve the K optimization tasks in parallel. Meanwhile, the optimization tasks are all for the same purpose as to obtain the optimal surrogate based on the training sample subsets obtained from the same true model. Thus these optimization tasks have an inherent similarity. Previous research has shown that positive transfer can outweigh the deleterious negative transfer especially when there are some prior understandings of the similarity between the black-box optimization tasks (Cheng et al. [33]; Feng et al. [34]). Thus, herein the transfer migration method is proposed to exchange the useful information between the different optimization tasks to enhance optimization efficiency, and the details are presented in Sect. 3.3.

Another important issue in the optimization problem Eq. (13) is the design space, i.e. the optional basis function set \({\varvec{\Phi }}\) herein. In this paper, the multivariable Legendre orthogonal polynomials are used to form the basis function set, the items of which are defined by the highest order (denoted as P) of the polynomials. Theoretically, higher order is preferred with larger design space to cover more possible situations. However, with the quickly enlarged \({\varvec{\Phi }}\) due to the increase of P, the optimization difficulty will increase dramatically due to the fast increase of the optimization variable number. With a long history of optimization algorithm research, it remains a very challenging problem as how to effectively solve the large-scale optimization problem with global convergence capability. Thus with larger P, the optimization effectiveness may not be guaranteed, and the optimization results may be some local optimum which greatly affects the surrogate accuracy. With smaller P the optimization algorithm is more robust to obtain the global optimum. But it may fail to capture the high order polynomials of the true model. How to properly define P is a difficult problem. In this paper, it is proposed a pre-training method which conduct the multitask optimization under different highest order settings, i.e. \(P = 1,2,...,P_{max}\), where \(P_{max}\) is a user-defined maximum, and obtain the preliminary objective values for each subset optimization task under different P values. Then for the kth subset, define the proper highest order as

(14)

With \({P_k}\), the kth sub-model is trained, and the corresponding optimal active basis function vector \(\mathbf{{S}}^{{{( - k)}^ * }}\) could be obtained. The necessity of the highest order setting is illustrated in Sect. 4.3.1.

Ensemble modeling

How to combine the sub-models into the final ensemble with good performance is an important and challenging issue. Previous studies generally use the weighted sum approach, but it may lead to incorrect selection of basis functions due to the similar reasons for the PRS modeling based on total error sum of all the subsets, as analyzed in Sect. 2.3. Thus, the interpretability and accuracy of the ensemble are both reduced. To address this problem, instead of directly summing all the sub-models into an ensemble, it is proposed to construct the final model by quantitatively scoring and rationally selecting the basis functions in this paper.

First, a novel scoring method is proposed to measure the significance of each basis function. There are two important issues that should be taken into consideration during scoring, namely the sub-model accuracy and basis function occurrence frequency. On one hand, if a sub-model has high accuracy as well as low complexity (fewer active basis function numbers), then it would more possibly match the true model. Thus the active basis functions of this sub-model have more significance for surrogate modeling and should have a higher probability to be selected. On the other hand, if a basis function becomes active in many sub-models, it is more likely to be included in the true model for its high occurrence frequency among the sub-models. According to these considerations, the significance metric to score the basis function is defined as

(15)

where is the optimal objective value for the kth sub-model. \(N_{active}^{{{( - k)}^ * }}\) is the number of active basis functions in the kth sub-model. \({f_i}\) denotes the ith component in the basis function set \({{\varvec{\Phi }} _s}\) which is the union of all the candidate basis function sets \(\mathbf{{S}}^{{{( - k)}^ * }}\) of the sub-models, and \({N_s}\) is the total number of basis functions in \({{\varvec{\Phi }} _s}\). Each basis function \({f_i}(i = 1,2,...,{N_s})\) is first scored in each sub-model and denoted as \(scor{e_{ik}}\). Then the final score \(scor{e_i}\) of each basis function is obtained by summing up \(scor{e_{ik}}\) across all the sub-models. From Eq.(15), it can be observed that the following rules are applied for basis function selection:

  • First, if the ith basis function is inactive in the kth sub-model, then the sub-score \({scor{e_{ik}}}\) of this basis function in this sub-model is zero. If it is active, then its sub-score is calculated according to the accuracy and complexity of this sub-model.

  • Second, if a sub-model has smaller , which means this sub-model’s performance is good with high accuracy, then the active basis functions in this sub-model tend to have higher scores as this sub-model has larger possibility to match the true model.

  • Third, if a sub-model has larger \(N_{active}^{{{( - k)}^ *}}\), which means this sub-model is more complex with a large number of basis functions, then the active basis functions in this sub-model tend to have lower scores so as to prevent over-fitting.

  • Fourth, through adding the sub-scores of basis functions across all the sub-models, the overall significant assessment of each basis function can be obtained with consideration for all the subsets.

After the scoring procedure, the sequence of the basis functions according to the significance scores in descending order can be obtained, and the top \({N_0} = \min \left( {{N_s},n - 2} \right) \) elements are selected and composed the candidate significant basis function set \({\{ {f_{(j)}}\} _{j = 1,...,{N_0}}}\). In this paper, it is proposed to add one basis function of \({\{ {f_{(j)}}\} _{j = 1,...,{N_0}}}\) into the ensemble each time according to the ranking sequence and quantify the accuracy of this ensemble. After the jth basis function \({f_{(j)}}\) is added, there are j active basis functions in the current ensemble, based on which the PRS model can be built based on Sect. 2.1. Denote the prediction model of this ensemble as \({{\hat{y}}_{(j)}}({\varvec{x}})\). Then the bias-corrected Akaike’s information criterion (AICc) (Hurvich and Tsai [35]) is used to quantify the model accuracy which is

$$\begin{aligned} \begin{array}{l} \mathop {\mathrm{AICc}}\nolimits _{(j)} = n \times \ln (\frac{1}{n}\sum \limits _{l = 1}^n {{{({y^{(l)}} - {{{\hat{y}}}_{(j)}}({{\varvec{x}}^{(l)}}))}^2}} )\\ \qquad \qquad + n \times (\ln (2\pi ) + 1) + 2j + 2j \times \frac{{j + 1}}{{n - j - 1}} \end{array} \end{aligned}$$
(16)

Until all the basis functions are added into the ensemble, there are totally \({N_0}\) candidate ensemble models. By comparing \(\mathop {\mathrm{AICc}}\nolimits _{(j)}\left( {1 \le j \le {N_0}} \right) \), the ensemble with the minimum \(\mathrm{AICc}\) can be selected as the optimum model with the highest accuracy.

Evolutionary algorithm with transfer migration

In the PRS-MOEM framework, the efficacy of solving multitask optimization is very important. To enhance the optimization effectiveness, the evolutionary genetic algorithm (GA) is applied and improved in the following two aspects. First, the chromosomes of GA are coded to indicate the active states of basis functions. Second, a transfer migration method is proposed to exchange useful information between the different optimization tasks. The diagram of chromosome coding is shown in Fig. 3. For the optimization tasks, there are \(nvars = (p + m)!/p!m!\) basis functions in the candidate set \(\varvec{\Phi }\). Then, compose the chromosome by nvars bits with each bit corresponding to a basis function state. m is the dimension number of the input variable vector \({\varvec{x}}\). \(P = 1,2,...,P_{max}\) is the highest order. The bit with the value of 0 denotes that the corresponding basis function is inactive, while the bit with the value of 1 is the opposite. Note that searching the optimal chromosome is essentially to find the most proper active basis functions vector in the basis function set \(\varvec{\Phi }\).

Fig. 3
figure 3

The diagram of chromosome coding.

To improve the optimization performance, a one-way circular transfer migration strategy is proposed to bridge the parallel search with positive information exchange. Define population of the kth optimization task in the ith generation as \(\mathop {\mathrm{population}}\nolimits _k^{(i)}\), which consists of three subpopulations \(A_k^{(i)},B_k^{(i)},C_k^{(i)}\). Rank the population according to the objective value in ascending order for minimization problem (or descending order for maximization problem). \(A_k^{(i)}\) is first \(W\%\) elite of ranked population, and \(C_k^{(i)}\) is last \(W\%\) of ranked population. \(B_k^{(i)}\) is the rest of the \(\mathop {\mathrm{population}}\nolimits _k^{(i)}\). For \(k=1,2,...,K\), the \(\mathop {\mathrm{population}}\nolimits _k^{(i)}\) is denoted as

$$\begin{aligned} \begin{array}{l} \mathop {\mathrm{population}}\nolimits _1^{(i)} = \{ A_1^{(i)},B_1^{(i)},C_1^{(i)}\} \\ ...\\ \mathop {\mathrm{population}}\nolimits _k^{(i)} = \{ A_k^{(i)},B_k^{(i)},C_k^{(i)}\} \\ ...\\ \mathop {\mathrm{population}}\nolimits _K^{(i)} = \{ A_K^{(i)},B_K^{(i)},C_K^{(i)}\} . \end{array} \end{aligned}$$
(17)
Fig. 4
figure 4

The flowchart of PRS-MOEM

Based on the idea of positive transfer migration to enhance optimization efficiency, the multiple optimization tasks can be mutually boosted and accelerated by transferring the elite from task to task. The subpopulations between different optimization tasks migrate every G generation. For simplicity, the transfer migration of subpopulations only occurs in the adjacent optimization tasks. The diagram of the transfer migration between populations in different optimization tasks is shown in Fig. 5. For the optimization task \(k=1,2,...,K-1\), the top \(W\%\) elite subpopulation \(A_k^{(i)}\) are transferred to the task \(k+1\) and used to replace the bottom ranked subpopulation \(C_{k + 1}^{(i)}\). The elite subpopulation \(A_K^{(i)}\) is transferred back to the task 1 and used to replace the bottom ranked subpopulation \(C_1^{(i)}\). The transfer migration fraction \(W\%\) is often set as \(20\%\). The generation interval G for transfer migration operation is set as 5 in this paper. The GA search stops when the number of generations achieves a user-defined maximal value. To demonstrate the advantage of the proposed transfer migration, the performance with and without the transfer migration would be compared in Sect. 4.3.2.

Fig. 5
figure 5

The diagram of the cyclical one-way transfer migration method

Algorithm of PRS-MOEM

To sum up, based on the preceding multitask optimization and ensemble modeling framework, the flowchart of PRS-MOEM is shown in Fig. 4, and the detailed steps are explained as follows:

  • Step 1: Design of experiments (DOE). Obtain the training point set \(D = \{ ({{\varvec{x}}^{(l)}},{y^{(l)}}),l = 1,2,...,n\}\) with DOE methods, e.g. Latin hypercube design (LHD) (Dette and Pepelyshev [36]).

  • Step 2: Partition D into subsets by K-fold cross-validation. The training set \(D = \{ ({{\varvec{x}}^{(l)}},{y^{(l)}}),l = 1,2,...,n\}\) is randomly partitioned into K disjoint sets of approximately equal size, denoted as \({D_1},{D_2},...,{D_K}\). For \(k = 1,2,...,K\), denote the training subset as \({D^{( - k)}} = D - {D_k}\) and the validation subset as \({D_k}\).

  • Step 3: Multitask optimization. For different highest order settings \(P = 1,2,...,P_{max}\), compose the candidate basis function set and conduct the multitask optimization. GA is used as the optimization solver and the top \(W\%\) elite subpopulation migrate between different optimization tasks every G generations. In this paper, \(W = 20\) and \(G = 5\). GA stops when the number of generations i achieves a user-defined preliminary value \(i_{pre}\). In this paper, to save the computational time and define the proper highest order, \(i_{pre} = 10\).

  • Step 4: Determine the optimal active basis function vector for each subset. For the kth optimization (\(k=1,2,...,K\)), define the proper highest order \({P_k}\) by Eq.(14) which has the minimum objective values . Then with \({P_k}\), the kth sub-model is trained using multitask optimization in Step 3 from the number of generations \(i_{pre}\) to a user-defined maximal values \(i_{\max }\), and accordingly obtain the optimal active basis function vector \(\mathbf{{S}}^{{{( - k)}^ * }}\).

  • Step 5: Ensemble modeling. Based on the optimal active basis function vectors \(\mathbf{{S}}^{{{( - k)}^ * }}\), compose the candidate active basis function set \({{\varvec{\Phi }} _s} = \bigcup \limits _{1 \le k \le K} {{\mathbf{{S}}^{{{( - k)}^ * }}}}\) and denote the total elements number as \({N_s}\). Score the significance of each basis function of this set by Eq.(15), and rank the set in the descending order according to the scores. Select the top \({N_0} = \min \left( {{N_s},n - 2} \right) \) elements to compose the candidate significant basis function set \({\{ {f_{(j)}}\} _{j = 1,...,{N_0}}}\). Each time add one basis function of \({\{ {f_{(j)}}\} _{j = 1,...,{N_0}}}\) into the ensemble according to the ranking sequence and quantify the accuracy of this ensemble with \({\mathrm{AICc}}\). For \(j = 1,...,{N_0}\) , add the basis function \({f_{(j)}}\) into the ensemble and calculate the accuracy \(\mathop {\mathrm{AICc}}\nolimits _{(j)}\) of the corresponding PRS model. Select the ensemble with the minimum \({\mathrm{AICc}}\) as the optimal model with the highest accuracy.

Table 1 The description of the design variables for the sandwich panel design problem

Numerical examples

Test problem

To illustrate the effectiveness of the proposed method in this paper, two types of problems are chosen: a). benchmark numerical test functions (Jamil and Yang [37]) and b). a practical engineering problem for sandwich panel design. The details are given below:

  1. 1.

    Branin function:

    $$\begin{aligned} \begin{array}{l} f({\varvec{x}}) = {({x_2} - \frac{{5.1x_1^2}}{{4{\pi ^2}}} + \frac{{5{x_1}}}{\pi } - 6)^2} \\ \\ \qquad \quad + 10(1 - \frac{1}{{8\pi }})\cos ({x_1}) + 10\\ \\ {x_1} \in [ - 5,10],{x_2} \in [0,15] \end{array} \end{aligned}$$
    (18)
  2. 2.

    Three-Hump function:

    $$\begin{aligned} \begin{array}{l} f({\varvec{x}}) = 2x_1^2 - 1.05x_1^4 + \frac{{x_1^6}}{6} + {x_1}{x_2} + x_2^2\\ \\ {x_1},{x_2} \in [ - 2,2] \end{array} \end{aligned}$$
    (19)
  3. 3.

    Giunta function:

    $$\begin{aligned} \begin{array}{l} f({\varvec{x}}) = 0.6 + \sum \nolimits _{i = 1}^2 {\left[ \begin{array}{l} \sin (\frac{{16{x_i}}}{{15}}\mathrm{{ - 1}}) + \sin ^2(\frac{{16{x_i}}}{{15}}\mathrm{{ - 1}})\\ \\ + \sin (4(\frac{{16{x_i}}}{{15}}\mathrm{{ - 1}}))/50 \end{array} \right] } \\ \\ {x_1},{x_2} \in [ - 1,1] \end{array} \end{aligned}$$
    (20)
  4. 4.

    Schaffer function:

    $$\begin{aligned} \begin{array}{l} f({\varvec{x}}) = 0.5 + \frac{{{{(\sin ({{(x_1^2 + x_2^2)}^{\frac{1}{2}}}))}^2} - 0.5}}{{1 + 0.001 \times (x_1^2 + x_2^2)}}\\ \\ {x_1} \in [ - 3,3],{x_2} \in [ - 3,3] \end{array} \end{aligned}$$
    (21)
  5. 5.

    Biggs function:

    $$\begin{aligned} \begin{array}{l} f({\varvec{x}}) = \sum \limits _{i = 1}^{10} {{{({e^{ - {t_i}{x_1}}} - {x_3}{e^{ - {t_i}{x_2}}} - {y_i})}^2}} \\ \\ {t_i} = 0.1i,{y_i} = {e^{ - {t_i}}} - 5{e^{ - 10{t_i}}},i = 1,...,10\\ \\ {x_j} \in [0,20],j = 1,2,3 \end{array} \end{aligned}$$
    (22)
  6. 6.

    Dette and Pepelyshev curved (DP3) function:

    $$\begin{aligned} \begin{array}{l} f({\varvec{x}}) = 4{({x_1} - 2 + 8{x_2} - 8x_2^2)^2} + {(3 - 4{x_2})^2}\\ \\ \qquad \quad + 16\sqrt{{x_3} + 1} {(2{x_3} - 1)^2}\\ \\ {x_i} \in [0,1],i = 1,2,3 \end{array} \end{aligned}$$
    (23)
  7. 7.

    Colville function:

    $$\begin{aligned} \begin{array}{l} f({\varvec{x}}) = 100{(x_1^2 - {x_2})^2} + {({x_1} - 1)^2} + {({x_3} - 1)^2}\\ \\ + 90{(x_3^2 - {x_4})^2} + 10.1[{({x_2} - 1)^2} + {({x_4} - 1)^2}]\\ \\ + 19.8({x_2} - 1)({x_4} - 1)\\ \\ {x_i} \in [ - 10,10],i = 1,2,3,4 \end{array} \end{aligned}$$
    (24)

The structural design of all-metal sandwich panels has great success in many engineering applications (Stickel and Nagarajan [38]), and the global deflection of the sandwich panels is an essential structural response in the optimal design. However, with finite element methods and natural tests, the calculation of the global deflection is time-consuming and quite complex. To save the computation cost, surrogate modeling methods are often applied which could provide simple but reliable metamodels. In this paper, a square-core type sandwich panel design problem is selected to investigate the performance of the proposed method in the field of engineering (Kalnins et al. [39]), which is shown in Fig. 6. The global deflection of the panel is determined by six design variables, the description of which are outlined in Table 1. The response values of global deflection are calculated by FEM commercial software ANSYS employing SHELL 181-4-node shell element, more details can be found in the literature (Kalnins et al. [39]). The simulation data are downloaded from http://www.cs.rtu.lv/jekabsons/datasets.html and used in this paper for surrogate modeling.

Table 2 Parameters for different surrogate models
Fig. 6
figure 6

The square-core type sandwich panel

Experimental settings

To verify the proposed PRS-MOEM method, it is compared with the following popular surrogates: 1). Polynomial Response Surface (PRS), 2). Least absolute shrinkage and selection (Lasso), 3). Ordinary Kriging (OK), 4). Radial Basis Function (RBF), 5). Elastic Net (EN) (Li et al. [41]), and 6). The traditional PRS modeling method based on cross validation (PRS-CCV) as introduced in Sect. 2.2. With the reference to (Zhang et al. [10]; Yin et al. [11]; Fan et al. [24]; Gribonval et al. [40]; Li et al. [41]), the parameter settings are presented in Table 2.

Root mean squared error (RMSE) and Maximum absolute error (MAE) (Goel et al. [18]) are used to evaluate the predictive capabilities of the surrogate models in this paper, which are

$$\begin{aligned} \mathrm{{RMSE}} = \sqrt{\frac{1}{{{n_t}}}{{\sum \limits _{l = 1}^{{n_t}} {({y^{(l)}} - {\hat{y}}({{\varvec{x}}^{(l)}}))} }^2}} \end{aligned}$$
(25)
$$\begin{aligned} \mathrm{{MAE}} = \max |{y^{(l)}} - {\hat{y}}({{\varvec{x}}^{(l)}})|,l = 1,2,...,{n_t} \end{aligned}$$
(26)

where \({y^{(l)}}\) and \({\hat{y}}({{\varvec{x}}^{(l)}})\) denotes the actual response and the predicted response at the lth test point, respectively, and \({n_t}\) is the number of test points. For high-quality surrogate model, RMSE and MAE should be both low.

Table 3 Numerical setup for test problems
Table 4 Optimal objective values of different highest order settings in Schaffer test
Table 5 Optimal objective values of different highest order settings in Biggs test
Table 6 Comparison of the average optimal objective values for Schaffer and Biggs tests

For all the benchmark numerical examples in Eq.(18)–(24), Latin hypercube design (LHD) is used for sampling the training data set. For the sandwich panel design problem, the sample set of 500 points analyzed by finite element simulation are downloaded from the online web source. The data set is divided into 60 training points and 440 test points randomly. Considering the effect of random sampling, 100 training sets are obtained for each test problem randomly, as shown in Table 3, based on which the surrogate modeling is conducted repeatedly and independently. Then the statistical performance (the mean and variation of RMSE and MAE) of different surrogate models can be investigated. The variation of each prediction metric is depicted with box plots. A short tail and a small size of the box plot signify a robust approximation. In addition, the computational time (CPU time) of the different surrogate modeling methods are also given, the tests are performed on a personal computer with a 2.3GHz CPU and 8GB RAM.

Fig. 7
figure 7

The average convergence process for Schaffer and Biggs tests

Result and discussion

Determination of the highest order

To illustrate the necessity of determining the highest order, the optimization results of multitask optimization under different highest order settings \(P = 1,2,...,10\) for Schaffer and Biggs functions are shown in Tables 4 and 5 respectively. The lowest value in each column is shown in bold for ease of comparison. denotes the optimal objective value for the kth sub-model. For the Schaffer test function, it is observed that \(P = 8\) is the best for the first, fourth, sixth, seventh, eighth, and tenth subsets, and \(P = 9\) is the best for the second, third, fifth, and ninth subsets. Similarly, the most proper highest order for Biggs test function is \(P = 5\) or 6 for different subsets. The results show that the optimum obtained by GA is greatly influenced by the highest order settings. Take the Biggs test for example. When P is increased from one to five, the optimal objective values of all subsets are dramatically reduced by two orders. It clearly indicated that with small P values, the high order nonlinearity of the test function cannot be captured by the PRS model, which leads to a large prediction error. When \(P = 5\) or 6, the optimal objective values of different subsets reach their corresponding lowest point. The variation of the best P values among different optimization tasks mainly results from two issues. One is the randomness of subset partition, which leads to different features observed from different training and validation subsets. The other is the inherent fluctuation associated with GA. For the same subset, the differences between the optimal objective values for \(P = 5\) and 6 are very small, which indicates the stableness of GA in solving the training optimization problems for the Biggs function at this dimension. Then with the continued increase of P, the optimal objective values dramatically increase again due to the failure of GA to obtain the global optimum with the fast increase of the design space dimension. Thus the most proper highest order for the Biggs test is determined as \(P = 5\) or 6 for different subsets, according to which the best active basis function vector can be selected.

Fig. 8
figure 8

The comparison between the RMSE boxplots obtained with and without transfer migration for Schaffer and Biggs tests

Effect of the transfer migration method

To demonstrate the advantage of the transfer migration, multitask optimization is conducted with and without the step of the transfer migration for the Schaffer test with the highest order setting \(P = 8\) and for the Biggs test functions with \(P = 5\) respectively. The average convergence graphs of the first optimization task for subset one obtained by 100 independent runs are shown in Fig. 7. Besides, the average optimal objective values for ten optimization tasks obtained by the 100 independent runs are shown in Table 6. denotes the average optimal objective value for the kth optimization task. It can be observed that for both Schaffer and Biggs test, the GA with the transfer migration converges faster, and its average optimal objective values are universally better than that without the transfer migration. These confirm the proposed method can enhance the optimization efficiency by means of positive transfer learning. On the other hand, the results show that the randomness of the subset partition could influence the optimization results greatly. Take the Biggs test for example. The maximal optimum is 3.144 for the tenth optimization task, and the minimal optimum is 1.824 for the sixth optimization task. To further investigate the effect of the proposed method on the prediction accuracy of the model, PRS-MOEM is constructed with and without the transfer migration. Box plots of the RMSE and the mean RMSE results obtained by the 100 independent runs for these two test functions are shown in Fig. 8 and Table 7. It could be easily seen that the model with the transfer migration performs much better than that without the transfer migration for both two tests, which verifies the effectiveness of the proposed multitask optimization algorithm in enhancing PRS modeling.

Table 7 Comparison between the average RMSE obtained with and without transfer migration
Table 8 Comparison of the average RMSE in Branin and Schaffer tests

The above experiments clearly show the significance of the transfer migration method. However, it is noteworthy that without priority information, it is still a challenging problem that whether the transfer between different optimization tasks is positive or deleterious in practical engineering. In this paper, the specific advantage is that the similarity of multiple optimization tasks is ensured due to the training sample subsets being obtained from the same true model. Thus, the prediction accuracy of the model can be improved through the useful information exchange between these related optimization tasks.

Fig. 9
figure 9

Values of AICc and AIC varying with the number of active basis functions in Branin and Schaffer test

Fig. 10
figure 10

Boxplots of RMSE for the model using AICc and the model using AIC in Branin and Schaffer tests

Effect of the model selection criterion

To investigate the rationality of using AICc instead of AIC to quantify the model accuracy in this paper, AICc and AIC are applied to select the final model for Branin and Schaffer test functions. Box plots of the RMSE and the mean RMSE results obtained by 100 independent runs are shown in Fig. 10 and Table 8 respectively. It can be observed that when AICc is chosen as the model selection criterion the box plot has a smaller size and shorter tail, and the mean RMSE of the models using AICc is much lower than that using AIC for both two test functions. It suggests that AICc can select the final model with better performance. To further study the effect of using AIC and AICc on basis function selection respectively, the curves of the measure values of the two criteria with respect to the number of selected active basis functions are shown in Fig. 9. It could be seen that the value of AIC decreases gradually as the number of active basis functions increases, but the value of AICc decreases first and then increases significantly with the obvious turning point. It suggests that AIC tends to choose more basis functions, which could lead to the problem of over-fitting and result in lower prediction accuracy. In the contrast, AICc can prevent this risk effectively.

Accuracy and robustness

The accuracy of the different models is evaluated by the average prediction metrics of the 100 independent runs. Tables 9 and 10 show the mean RMSE and MAE for the different surrogate models respectively. The lowest value in each column is shown in bold for ease of comparison. Moreover, to further evaluate the robustness of different models, the variation of RMSE and MAE is expressed with box plots in Figs. 11 and 12 respectively. The RMSE and MAE results show that PRS-MOEM performs the best among the surrogates in all the test problems except for the Schaffer test function. For Three-Hump, DP3, and Colville problems, PRS-CCV also performs well and the accuracy results are very close to PRS-MOEM. However, for Branin, Giunta, Biggs, and the sandwich panel design problem, PRS-MOEM performs much better than PRS-CCV. To investigate the differences of the selected active basis functions between these two models, the histograms that describe the frequency of active basis functions obtained for the 100 training data sets of the Branin and Giunta functions are shown in Fig. 13 respectively. It can be seen that for PRS-CCV, except for some specific basis functions, the others are all selected with similar frequencies, which means PRS-CCV fails to identify the significant basis functions accurately and stably under the randomly generated training sets. In the contrast, for PRS-MOEM, there is a clear distinction of the frequencies between the significant and the other basis functions, which further verified the robustness of the proposed method.

Table 9 Comparison of the average RMSE for different surrogate models
Table 10 Comparison of the average MAE for different surrogate models
Table 11 Comparison of the average computational time (in seconds) for different surrogate models
Fig. 11
figure 11

Box plots of RMSE for different surrogate models in a Branin, b Three-Hump, c Giunta, d Schaffer, e Biggs, f DP3, g Colville, and h the sandwich panel design problem

Fig. 12
figure 12

Box plots of MAE for different surrogate models in a Branin, b Three-Hump, c Giunta, d Schaffer, e Biggs, f DP3, g Colville, and h the sandwich panel design problem

Fig. 13
figure 13

Frequency of active basis functions for PRS-MOEM and PRS-CCV in Branin and Giunta tests

Computational time

To show the computational cost of the different models, the seven surrogates are repeatedly constructed for the different test functions, Table 11 presents their average CPU time in seconds obtained by 100 independent runs. It can be seen that the average calculation time of PRS, OK, and RBF are universally small for all test functions, Lasso and EN have slightly larger calculation time. For PRS-MOEM and PRS-CCV, they spend more time building a model, around 2 seconds, which is much larger than the other surrogate modeling methods. But in practical engineering, this computational burden is negligible compared to the high-fidelity analysis of the training samples, e.g. it takes hundreds of hours to run a single finite element analysis for a complex flight vehicle structure. Thus for PRS modeling, it is worthwhile to spend modest more computational cost in selecting proper active basis function vector for better surrogate accuracy. In the future, more effective evolutionary algorithms should be studied to enhance the global optimization capability in solving large-scale optimization problems.

Conclusion

In this paper, an improved PRS modeling approach PRS-MOEM is proposed to enhance the prediction accuracy by better selecting basis functions with robustness based on the multitask optimization and ensemble modeling framework. Unlike the traditional PRS modeling based on cross validation which directly builds a single model guided by the total error estimation of all the training subsets, the proposed method constructs a sub-model for each training subset. By properly integrating all the sub-models, the information of the subsets can be fully explored. To construct all the sub-models effectively by solving the multitask optimization problem, an improved evolutionary algorithm with transfer migration is developed, which can significantly enhance the optimization efficiency and robustness by useful information exchange between similar optimization tasks. To obtain a well performed ensemble based on all the sub-models, a scoring method is proposed to measure the significance of each basis function according to the error estimation of the sub-models and the occurrence frequency of these basis functions in all the sub-models. Then based on the AICc criterion, the significant basis functions can be selected and the optimal ensemble can be defined. PRS-MOEM can effectively mitigate the negative influence from the sub-models with large prediction error, and alleviate the uncertain impact resulting from the randomness of training subsets. Thus the basis function selection accuracy and robustness can be effectively enhanced, which are verified by seven numerical examples and one practical engineering application example in the test section. For future works, the applicability of PRS-MOEM will be further investigated on a more diverse set of test problems as well as complex practical engineering problems. Besides, the extension of PRS-MOEM to effectively solve high dimensional problems is also a promising research direction.