Polynomial Response Surface based on basis function selection by multitask optimization and ensemble modeling

Polynomial Regression Surface (PRS) is a commonly used surrogate model for its simplicity, good interpretability, and computational efficiency. The performance of PRS is largely dependent on its basis functions. With limited samples, how to correctly select basis functions remains a challenging problem. To improve prediction accuracy, a PRS modeling approach based on multitask optimization and ensemble modeling (PRS-MOEM) is proposed for rational basis function selection with robustness. First, the training set is partitioned into multiple subsets by the cross validation method, and for each subset a sub-model is independently constructed by optimization. To effectively solve these multiple optimization tasks, an improved evolutionary algorithm with transfer migration is developed, which can enhance the optimization efficiency and robustness by useful information exchange between these similar optimization tasks. Second, a novel ensemble method is proposed to integrate the multiple sub-models into the final model. The significance of each basis function is scored according to the error estimation of the sub-models and the occurrence frequency of the basis functions in all the sub-models. Then the basis functions are ranked and selected based on the bias-corrected Akaike’s information criterion. PRS-MOEM can effectively mitigate the negative influence from the sub-models with large prediction error, and alleviate the uncertain impact resulting from the randomness of training subsets. Thus the basis function selection accuracy and robustness can be enhanced. Seven numerical examples and an engineering problem are utilized to test and verify the effectiveness of PRS-MOEM.


Introduction
Despite the tremendous promotion in computer processing power, the computationally expensive problem occurs frequently in multiple scientific and engineering disciplines where complex computer simulations are used. In these cases, obtaining more data means additional experiments and thus it results in highly non-trivial computational expense (Forrester and Keane [1]). As a result, surrogate models have been widely used to replace the complex simulation models (Namura et al. [2]). Their application fields involve multidisciplinary design optimization (Yao et al. [3]), uncertainty analysis (Yao et al. [4]), and so on.
At present, the commonly used surrogate models are: Polynomial Response Surface (PRS) (Goel et al. [5]), Multivariate Adaptive Regression Splines (MARS) (Gu and Wahba [6]), Kriging (Clark et al. [7]), Radial Basis Function (RBF) (Yao et al. [8]), Support Vector Regression (SVR) (Clarke et al. [9]). There are also some hybrid surrogate modeling paradigms where different surrogate models are combined to offer effective solutions ((Zhang et al. [10])(Yin et al. [11]). Among these surrogates, PRS is a popular surrogate model because of its simplicity, good interpretability, and computational efficiency (Bhosekar and Ierapetritou [12]). However, PRS is unsuitable for the non-linear, multimodal, multi-dimensional design landscapes (Forrester and Keane [1]). Because of the limitations existing in PRS, some enhanced versions which include removing or suppressing unnecessary variables are proposed, such as subset selection (Furnival and Wilson [13]) or regularization (Tibshirani [14]). Subset selection methods address the trade-off between prediction error and the regression model complexity by selecting a subset of variables. Stepwise regression is an effective way to solve the subset selection problem (Hosseinpour et al. [15]). Borrowing the idea of stepwise regression, Giustolisi et al. (O. Giustolisi and Doglioni [16]) propose Evolutionary Polynomial Regression (EPR), which uses an evolutionary process rather than follows the hill-climbing method of stepwise regression. For the regularization method, Least absolute shrinkage and selection (Lasso) is widely used (Tibshirani [14]), which shrinks some coefficients and sets others to zero to retain the good features of both subset selection and ridge regression. Zou et al. (Zou and Hastie [17]) propose the elastic net (EN) technique to fix the problem of failing to select correlated grouped variables that existed in the Lasso.
Either for subset selection or regularization methods, their performance is largely dependent on the basis functions included in the model. By choosing the significant terms, the prediction ability of the model would be enhanced. Commonly, the basis function selection is based on a specific error estimation of the model. An error estimation method that accords with the true error of the model could guide the selection of basis functions more effectively. Currently, the mean squared error estimation method based on cross validation (CV), such as prediction error sum of squares (PRESS) (Goel et al. [18]), is a popular way to estimate the global accuracy of the model. However, with limited samples available, the mean squared error estimation method based on cross validation is influenced by the data partition scheme, which may not estimate the true global average error well. Thus, in this situation, it is difficult to select significant basis functions effectively for PRS. To avoid the problem, Gu et al. (Gu and Wei [19]) propose a robust model structure selection method, which could select the significant model terms according to the overall mean absolute error of the resampled subsets. Although the method achieves good results in some numerical examples, it leads to large computational costs. Besides, the aforementioned modeling methods are based on the error estimation of all the CV subsets, which may result in incorrect basis function selection due to significant influence from certain subsets with large estimation error (as discussed in Sect. 2.3).
In this paper, a novel approach based on multitask optimization and ensemble modeling (PRS-MOEM) is proposed to effectively mitigate the negative influence of the subsets with large estimation error due to the random partition, and enhance the basis function selection accuracy and robustness. First, multiple subsets are partitioned from the training set based on cross validation. Instead of the traditional modeling method which directly builds a single model guided by the error estimation based on cross validation, multiple sub-models are constructed by building the surrogate for each subset. The multiple sub-model modeling processes are solved in parallel by multiple optimization tasks. To improve the optimization performance, multitask optimization can be adopted (Ong and Gupta [20]; Naik and Rangwala [21]), and an improved evolutionary algorithm with transfer migration is developed to solve multitask optimization problem, which can significantly enhance the optimization efficiency and robustness by useful information exchange between the similar optimization tasks. Second, a novel ensemble method is proposed to integrate the multiple sub-models into the final optimal one. Actually, there are relevant researches on ensemble modeling which integrates multiple models into one model. However, previous studies are usually conducted by the weighted sum approach (Fang et al. [22]; Zhou and Jiang [23]). Since some sub-models may deviate from the true model greatly due to the specific training subset features (the subsets are randomly partitioned), this weighted sum approach may result in wrongly selecting basis functions. Thus, the interpretability of the PRS model would be reduced as well as its accuracy. To obtain a well performed ensemble, in this paper a scoring method is proposed to measure the significance of each basis function according to the error estimation of the sub-models and the occurrence frequency of these basis functions in all the sub-models. The basis functions are ranked according to the significance scores in descending order. Each time add a basis function into the ensemble and measure the model accuracy by the bias-corrected Akaike's information criterion (AICc). The ensemble with the lowest AICc is chosen as the final model. In this way, the negative influence from the sub-models with large prediction error can be mitigated, and the uncertain impact resulting from the random partition of subsets can be alleviated. Thus the basis function selection accuracy as well as algorithm robustness can be effectively enhanced. The main contributions of this paper can be summarized as follows: -A PRS modeling problem is decomposed into multiple subproblems, which are to build the sub-model for each subset separately. Therefore, the potential of each subset can be fully explored and the phenomenon of wrong dominance by specific subsets can be mitigated. -Multitask optimization is introduced to simultaneously solve the subproblems, and an improved evolutionary algorithm with transfer migration is developed to enhance the optimization efficiency and robustness. -An ensemble modeling method for integrating multiple sub-models is proposed. Using a novel scoring strategy, the method sorts each basis function of the sub-models, combines them serially, and selects a final optimum model.
The rest of the paper is organized as follows. In Sect. 2, a brief review of Polynomial Response Surface (PRS) is introduced and the disadvantage of the traditional modeling method based on CV is analyzed. In Sect. 3, the PRS method based on Basis Function Selection by Multitask Optimization and Ensemble Modeling (PRS-MOEM) is developed in detail. In Sect. 4, the proposed PRS-MOEM is testified with seven numerical examples and one practical engineering problem, followed by conclusions in the final section.

Polynomial regression surface
The PRS model is derived from the linear regression model, the matrix form can be written aŝ is the basis function vector, and f i is a basis function. nvars is the number of basis functions. Given a set of training points x (l) ∈ R m , l = 1, 2, ..., n, and the corresponding actual response vector y = [y (1) , y (2) , ..., y (n) ] T , the design matrix is defined as (1) ) · · · f nvars (x (1) ) · · · · · · · · · f 1 (x (n) ) · · · f nvars (x (n) ) ⎤ ⎦ (2) The least squares method is often used to solve the regression coefficients as where To lower the mutual coherence of the design matrix, the multivariable Legendre orthogonal polynomial (Fan et al. [24]) is applied to form the basis function in this paper, which is where η j is the order of the jth univariate Legendre polynomial L η j x j , and m j=1 η j = P, P is a user-defined highest order of polynomials. L η j x j is determined by the recursive definition. It is supposed that L 0 x j = 1 and L 1 x j = x j , then

Traditional PRS modeling method based on cross validation
PRS modeling first should define the basis function vector, which needs separate sample set for model validation and basis function selection. However, with limited samples available, it is difficult to obtain separate sample set for surrogate validation. To address this problem, the cross-validation method is widely used as it can provide good error estimation when the sample size is small (Bischl et al. [25]). For Kfold cross-validation, the training set D = {(x (l) , y (l) ), l = 1, 2, ..., n} is randomly partitioned into K disjoint sets of approximately equal size, denoted as D 1 , D 2 , ..., D K . Note that for small data size, 10-fold cross-validation provides almost unbiased estimate of prediction error. With a relatively large number of samples, small K is preferred, such as 5, to avoid high computational cost (Simon [26]). For k = 1, 2, ..., K , D k is the validation subset with the corresponding training subset D (−k) = D − D k . The candidate basis function set is defined as ..,nvars . The tradition modeling method tries to find an active basis function set from by minimizing the error estimation of the final model, where . Here 'active' means that the basis function is selected into the final model, while 'inactive' is opposite. First, the selected active basis function vector with N active items is defined as The design matrix F (−k) is constructed with the active basis function vector S for D (−k) as where the superscript T means the transverse of the matrix, n k is the number of points in the subset D (−k) . Then the regression coefficient vector β (−k) is calculated by where y (−k) is the actual response vector of the training samples in D (−k) , and (F (−k) ) + is the Moore-Penrose pseudo-inverse of F (−k) . Based on S and β (−k) , the submodel corresponding to the training subset D (−k) can be obtained aŝ Based on D k , the prediction error of the model can be estimated. As the commonly used ordinary cross-validation error estimation criterion (CV) would lead to large bias when the sample size is small (Yanagihara et al. [27]), the biascorrected cross-validation criterion (CCV) is chosen in this paper for error estimation, which is CCV calculates the prediction error of the model in the first term, and considers the fitting error meanwhile in the second term, which could avoid large bias with limited samples. For k = 1, 2, ..., K , repeat the aforementioned steps and obtain the prediction errors of all the sub-models based on the active basis function vector S. By minimizing the error sum K k=1 (β (−k) ), the optimal active basis function vector can be obtained. The diagram of the above method is shown in Fig. 1a, and the optimization task can be formulated as The number of active basis functions in the final model is N active and has to meet the constraint N active ≤ min 1≤k≤K (n k ) −1 (Lee et al. [28]).

Disadvantage of the traditional PRS modeling based on cross validation
In solving Eq.(11), the optimal active basis function vector S is obtained by minimizing the total sub-model error sum K k=1 (β (−k) ), which may easily lead to incorrect selection due to the significant influence from certain sub-models with large error estimation. A one-dimensional problem is used for illustration and presented in Fig. 2. The sample set is divided into two subsets, labeled as subset A (blue dots) and subset B (red dots) respectively. To demonstrate the influence of the subsets on the final model construction, build the optimal sub-models for each training subset first. With the training subset A and the validation subset B, the sub-model 1 (brown dash line) is obtained by minimizing the subset estimation error stated in Eq. (10). And similarly, the submodel 2 (red dash line) is obtained with the training subset B and the validation subset A. Obviously, the nonlinear submodel 1 is more in line with the true model (green dash line), and the sub-model 2 is just a linear model (only linear basis functions are selected) which deviates a lot from the true response. This indicates that with improper subset partition (which may happen with large probability due to the random CV partition method), the optimal model obtained by minimizing the estimation error is far from the true model, which means it fails to correctly identify the basis functions with training subset B.
Then by solving Eq.(11) to minimize the total error sum of all the subsets, the final model is obtained (black solid line). It can be observed that the final model is also a linear model, which is very close to sub-model 2. It is because the large error estimation of sub-model 2 dominated the total error sum, which accordingly guides the optimization search towards minimizing the error estimation of sub-model 2. Therefore, the basis function selection of the final model is close to the sub-model 2. The underlying reason for this problem is that the active basis function vector is optimized by considering its performance on all the subsets (the total error sum). Then the optimization process can be easily dominated by some specific subsets for which the error estimation is large. To address this problem, an intuitive idea is to optimize the active basis function vector for each subset separately, so that the potential of each subset can be fully explored and the phenomenon of wrong dominance by specific subsets can be mitigated. Then based on all the potential active basis function vectors, the final model can be ensemble according to a certain criterion. Based on this idea, in this paper a novel PRS modeling approach is proposed. It optimizes the active basis function vector and constructs the corresponding sub-model for each subset separately and in parallel based on multitask optimization (MO). Then the active basis functions are scored and selected by ensemble modeling (EM). This PRS modeling framework is developed in Sect. 3.

Multitask optimization
MO (Jin et al. [29]; Liao et al [30]) can simultaneously solves multiple tasks in a single run and achieve better performance with positive knowledge transfer, which has shown high efficiency on expensive optimization problem (Ding et al. [31]; Wang et al. [32]). As discussed in the above section, a PRS modeling optimization problem can be decomposed into K subproblems which are optimizing active basis func- Fig. 1 The diagram of PRS modeling framework tion vector for each subset separately. Due to the similarity of these subproblems and the independence of optimization processes, MO is introduced to construct the sub-model for each subset in this paper. Suppose that S (−k) is feasible active basis function vector of the kth optimization task k , then the subproblems are to be simultaneously addressed can be formulated as The details of the optimization problem are clarified in the next section.

PRS-MOEM method
The PRS-MOEM framework is shown in Fig. 1b. It mainly includes two key parts, namely the multitask optimization to obtain potential active basis function vector and construct optimal sub-models for all the subsets based on K -fold cross validation, and the ensemble modeling to integrate all the sub-models into the final model. Furthermore, to effectively solve the multitask optimization problem, an improved evolutionary algorithm with transfer migration is developed.

Multitask optimization for sub-model construction
For the subset D k and D (−k) , the active basis function vector can be obtained by solving the following optimization where S (−k) is the active basis function vector of the kth submodel trained by the subset D (−k) , and N active is the number of items in this vector.
For the K subsets, K optimization tasks have to be conducted to construct all the sub-models. Considering that the optimization processes of the K sub-models are independent of each other, multitask optimization can be adopted to solve the K optimization tasks in parallel. Meanwhile, the optimization tasks are all for the same purpose as to obtain the optimal surrogate based on the training sample subsets obtained from the same true model. Thus these optimization tasks have an inherent similarity. Previous research has shown that positive transfer can outweigh the deleterious negative transfer especially when there are some prior understandings of the similarity between the black-box optimization tasks (Cheng et al. [33]; Feng et al. [34]). Thus, herein the transfer migration method is proposed to exchange the useful information between the different optimization tasks to enhance optimization efficiency, and the details are presented in Sect. 3.3.
Another important issue in the optimization problem Eq. (13) is the design space, i.e. the optional basis function set herein. In this paper, the multivariable Legendre orthogonal polynomials are used to form the basis function set, the items of which are defined by the highest order (denoted as P) of the polynomials. Theoretically, higher order is preferred with larger design space to cover more possible situations. However, with the quickly enlarged due to the increase of P, the optimization difficulty will increase dramatically due to the fast increase of the optimization variable number. With a long history of optimization algorithm research, it remains a very challenging problem as how to effectively solve the large-scale optimization problem with global convergence capability. Thus with larger P, the optimization effectiveness may not be guaranteed, and the optimization results may be some local optimum which greatly affects the surrogate accuracy. With smaller P the optimization algorithm is more robust to obtain the global optimum. But it may fail to capture the high order polynomials of the true model. How to properly define P is a difficult problem. In this paper, it is proposed a pre-training method which conduct the multitask optimization under different highest order settings, i.e. P = 1, 2, ..., P max , where P max is a user-defined maximum, and obtain the preliminary objective values P k , k = 1, ..., K for each subset optimization task under different P values. Then for the kth subset, define the proper highest order as With P k , the kth sub-model is trained, and the corresponding optimal active basis function vector S (−k) * could be obtained. The necessity of the highest order setting is illustrated in Sect. 4.3.1.

Ensemble modeling
How to combine the sub-models into the final ensemble with good performance is an important and challenging issue. Previous studies generally use the weighted sum approach, but it may lead to incorrect selection of basis functions due to the similar reasons for the PRS modeling based on total error sum of all the subsets, as analyzed in Sect. 2.3. Thus, the interpretability and accuracy of the ensemble are both reduced.
To address this problem, instead of directly summing all the sub-models into an ensemble, it is proposed to construct the final model by quantitatively scoring and rationally selecting the basis functions in this paper. First, a novel scoring method is proposed to measure the significance of each basis function. There are two important issues that should be taken into consideration during scoring, namely the sub-model accuracy and basis function occurrence frequency. On one hand, if a sub-model has high accuracy as well as low complexity (fewer active basis function numbers), then it would more possibly match the true model. Thus the active basis functions of this sub-model have more significance for surrogate modeling and should have a higher probability to be selected. On the other hand, if a basis function becomes active in many sub-models, it is more likely to be included in the true model for its high occurrence frequency among the sub-models. According to these considerations, the significance metric to score the basis function is defined as where * k is the optimal objective value for the kth submodel. N  1, 2, ..., N s ) is first scored in each sub-model and denoted as score ik . Then the final score score i of each basis function is obtained by summing up score ik across all the sub-models. From Eq. (15), it can be observed that the following rules are applied for basis function selection: -First, if the ith basis function is inactive in the kth submodel, then the sub-score score ik of this basis function in this sub-model is zero. If it is active, then its sub-score is calculated according to the accuracy and complexity of this sub-model. -Second, if a sub-model has smaller * k , which means this sub-model's performance is good with high accuracy, then the active basis functions in this sub-model tend to have higher scores as this sub-model has larger possibility to match the true model. -Third, if a sub-model has larger N (−k) * active , which means this sub-model is more complex with a large number of basis functions, then the active basis functions in this sub-model tend to have lower scores so as to prevent over-fitting.
-Fourth, through adding the sub-scores of basis functions across all the sub-models, the overall significant assessment of each basis function can be obtained with consideration for all the subsets.
After the scoring procedure, the sequence of the basis functions according to the significance scores in descending order can be obtained, and the top N 0 = min (N s , n − 2) elements are selected and composed the candidate significant basis function set { f ( j) } j=1,...,N 0 . In this paper, it is proposed to add one basis function of { f ( j) } j=1,...,N 0 into the ensemble each time according to the ranking sequence and quantify the accuracy of this ensemble. After the jth basis function f ( j) is added, there are j active basis functions in the current ensemble, based on which the PRS model can be built based on Sect. 2.1. Denote the prediction model of this ensemble asŷ ( j) (x). Then the bias-corrected Akaike's information criterion (AICc) (Hurvich and Tsai [35]) is used to quantify the model accuracy which is Until all the basis functions are added into the ensemble, there are totally N 0 candidate ensemble models. By comparing AICc ( j) (1 ≤ j ≤ N 0 ), the ensemble with the minimum AICc can be selected as the optimum model with the highest accuracy.

Evolutionary algorithm with transfer migration
In the PRS-MOEM framework, the efficacy of solving multitask optimization is very important. To enhance the optimization effectiveness, the evolutionary genetic algorithm (GA) is applied and improved in the following two aspects. First, the chromosomes of GA are coded to indicate the active states of basis functions. Second, a transfer Note that searching the optimal chromosome is essentially to find the most proper active basis functions vector in the basis function set .
To improve the optimization performance, a one-way circular transfer migration strategy is proposed to bridge the parallel search with positive information exchange. Define population of the kth optimization task in the ith generation as population (i) k , which consists of three subpopulations Rank the population according to the objective value in ascending order for minimization problem (or descending order for maximization problem). A Based on the idea of positive transfer migration to enhance optimization efficiency, the multiple optimization tasks can be mutually boosted and accelerated by transferring the elite from task to task. The subpopulations between different optimization tasks migrate every G generation.

Algorithm of PRS-MOEM
To sum up, based on the preceding multitask optimization and ensemble modeling framework, the flowchart of PRS-MOEM is shown in Fig. 4, and the detailed steps are explained as follows: which has the minimum objective values P k . Then with P k , the kth sub-model is trained using multitask optimization in Step 3 from the number of generations i pre to a user-defined maximal values i max , and accordingly obtain the optimal active basis function vector S (−k) * .

Test problem
To illustrate the effectiveness of the proposed method in this paper, two types of problems are chosen: a). benchmark numerical test functions (Jamil and Yang [37]) and b). a practical engineering problem for sandwich panel design. The details are given below: 1. Branin function: 2. Three-Hump function: 3. Giunta function: 4. Schaffer function: 5. Biggs function: 6. Dette and Pepelyshev curved (DP3) function: 7. Colville function: The structural design of all-metal sandwich panels has great success in many engineering applications (Stickel and Nagarajan [38]), and the global deflection of the sandwich panels is an essential structural response in the optimal design. However, with finite element methods and natural tests, the calculation of the global deflection is timeconsuming and quite complex. To save the computation cost, surrogate modeling methods are often applied which could provide simple but reliable metamodels. In this paper, a square-core type sandwich panel design problem is selected to investigate the performance of the proposed method in

Experimental settings
To verify the proposed PRS-MOEM method, it is compared with the following popular surrogates: 1  [41]), the parameter settings are presented in Table 2. Root mean squared error (RMSE) and Maximum absolute error (MAE) (Goel et al. [18]) are used to evaluate the predictive capabilities of the surrogate models in this paper, which are where y (l) andŷ(x (l) ) denotes the actual response and the predicted response at the lth test point, respectively, and n t is the number of test points.  Table 3, based on which the surrogate modeling is conducted repeatedly and independently. Then the statistical performance (the mean and variation of RMSE and MAE) of different surrogate models can be investigated. The variation of each prediction metric is depicted with box plots. A short tail and a small size of the box plot signify a robust approximation. In addition, the computational time (CPU time) of the different surrogate modeling methods are also given, the tests are performed on a personal computer with a 2.3GHz CPU and 8GB RAM.

Determination of the highest order
To illustrate the necessity of determining the highest order, the optimization results of multitask optimization under different highest order settings P = 1, 2, ..., 10 for Schaffer and Biggs functions are shown in Tables 4 and 5 respectively. The lowest value in each column is shown in bold for ease of comparison. * k , k = 1, ..., 10 denotes the optimal objective value for the kth sub-model. For the Schaffer test function, it is observed that P = 8 is the best for the first, fourth, sixth, seventh, eighth, and tenth subsets, and P = 9 is the best for the second, third, fifth, and ninth subsets. Similarly, the most proper highest order for Biggs test function is P = 5 or 6 for different subsets. The results show that the optimum obtained by GA is greatly influenced by the highest

Lasso
The number of basis functions is set to 6n (where n is the number of sample points) (Gribonval et al. [40]). The 10-fold cross-validation method is used to choose the regularization parameter λ.

OK
The constant regression function and Gaussian correlation model are employed. In all cases, θ 0 = 1 m×1 , and 0.1 ≤ θ i ≤ 20 (Gribonval et al. [10]), for i = 1, 2, ..., m, where m is the number of variables and 1 m×1 is the vector whose entries are all equal to 1.

RBF
The basis function is multiquadric with c = 0.9.

EN
The number of basis functions is set as 6n. The ridge regularization parameter is λ 2 = 10e − 3 (Li et al. [41]), and the 10-fold cross-validation method is used to determine the lasso regularization parameter.
PRS-MOEM 10-fold cross-validation is used to partition the training set, and P max is set to 10.

PRS-CCV
The settings are the same as PRS-MOEM.  order settings. Take the Biggs test for example. When P is increased from one to five, the optimal objective values of all subsets are dramatically reduced by two orders. It clearly indicated that with small P values, the high order nonlinearity of the test function cannot be captured by the PRS model, which leads to a large prediction error. When P = 5 or 6, the optimal objective values of different subsets reach their corresponding lowest point. The variation of the best P values among different optimization tasks mainly results from two issues. One is the randomness of subset partition, which leads to different features observed from different training and validation subsets. The other is the inherent fluctuation associated with GA. For the same subset, the differences between the optimal objective values for P = 5 and 6 are  Table 6 Comparison of the average optimal objective values for Schaffer and Biggs tests  . 7 The average convergence process for Schaffer and Biggs tests very small, which indicates the stableness of GA in solving the training optimization problems for the Biggs function at this dimension. Then with the continued increase of P, the optimal objective values dramatically increase again due to the failure of GA to obtain the global optimum with the fast increase of the design space dimension. Thus the most proper highest order for the Biggs test is determined as P = 5 or 6 for different subsets, according to which the best active basis function vector can be selected.

Effect of the transfer migration method
To demonstrate the advantage of the transfer migration, multitask optimization is conducted with and without the step of the transfer migration for the Schaffer test with the highest order setting P = 8 and for the Biggs test functions with P = 5 respectively. The average convergence graphs of the first optimization task for subset one obtained by 100 independent runs are shown in Fig. 7. Besides, the average optimal objective values for ten optimization tasks obtained by the 100 independent runs are shown in Table 6.
.., 10 denotes the average optimal objective value for the kth optimization task. It can be observed that for both Schaffer and Biggs test, the GA with the transfer migration converges faster, and its average optimal objective values are universally better than that without the transfer migration. These confirm the proposed method can enhance the optimization efficiency by means of positive transfer learning. On the other hand, the results show that the randomness of the subset partition could influence the optimization results greatly. Take the Biggs test for example. The maximal optimum is 3.144 for the tenth optimization task, and the minimal optimum is 1.824 for the sixth optimization task. To further investigate the effect of the proposed method on the prediction accuracy of the model, PRS-MOEM is constructed with and without the transfer migration. Box plots of the RMSE and the mean RMSE results obtained by the 100 independent runs for these two test functions are shown in Fig. 8 and Table 7. It could be easily seen that the model with the transfer migration performs much better than that without the transfer migration for both two tests, which verifies the effectiveness of the proposed multitask optimization algorithm in enhancing PRS modeling.
The above experiments clearly show the significance of the transfer migration method. However, it is noteworthy that without priority information, it is still a challenging problem that whether the transfer between different optimization tasks is positive or deleterious in practical engineering. In this paper, the specific advantage is that the similarity of multiple optimization tasks is ensured due to the training sample subsets being obtained from the same true model. Thus, the prediction accuracy of the model can be improved through the useful information exchange between these related optimization tasks.

Effect of the model selection criterion
To investigate the rationality of using AICc instead of AIC to quantify the model accuracy in this paper, AICc and AIC are applied to select the final model for Branin and Schaffer test functions. Box plots of the RMSE and the mean RMSE results obtained by 100 independent runs are shown in Fig. 10 and Table 8 respectively. It can be observed that when AICc is chosen as the model selection criterion the box plot has a smaller size and shorter tail, and the mean RMSE of the models using AICc is much lower than that using AIC for both two test functions. It suggests that AICc can select the final model with better performance. To further study the effect of using AIC and AICc on basis function selection respectively, the curves of the measure values of the two criteria with respect to the number of selected active basis functions are shown in Fig. 9. It could be seen that the value of AIC decreases gradually as the number of active basis functions increases, but the value of AICc decreases first and then increases significantly with the obvious turning point. It suggests that AIC tends to choose more basis functions, which could lead to the problem of over-fitting and result in lower prediction accuracy. In the contrast, AICc can prevent this risk effectively.

Accuracy and robustness
The accuracy of the different models is evaluated by the average prediction metrics of the 100 independent runs. Tables 9  and 10 show the mean RMSE and MAE for the different surrogate models respectively. The lowest value in each column is shown in bold for ease of comparison. Moreover, to further evaluate the robustness of different models, the variation of RMSE and MAE is expressed with box plots in Figs. 11 and 12 respectively. The RMSE and MAE results show that PRS-MOEM performs the best among the surrogates in all the test problems except for the Schaffer test function. For Three-Hump, DP3, and Colville problems, PRS-CCV also performs well and the accuracy results are very close to PRS-MOEM. However, for Branin, Giunta, Biggs, and the sandwich panel design problem, PRS-MOEM performs much better than PRS-CCV. To investigate the differences of the selected active basis functions between these two models, the histograms that describe the frequency of active basis functions obtained for the 100 training data sets of the Branin and Giunta functions are shown in Fig. 13 respectively. It can be seen that for PRS-CCV, except for some specific basis functions, the others are all selected with similar frequencies, which means PRS-CCV fails to identify the significant basis functions accurately and stably under the randomly generated training sets. In the contrast, for PRS-MOEM, there is a clear distinction of the frequencies between the significant and the other basis functions, which further verified the robustness of the proposed method.

Computational time
To show the computational cost of the different models, the seven surrogates are repeatedly constructed for the different test functions, Table 11 presents their average CPU time in seconds obtained by 100 independent runs. It can be seen that the average calculation time of PRS, OK, and RBF are universally small for all test functions, Lasso and EN have slightly larger calculation time. For PRS-MOEM and PRS-CCV, they spend more time building a model, around 2 seconds, which is much larger than the other surrogate modeling methods. But in practical engineering, this computational burden is negligible compared to the high-fidelity analysis of the training samples, e.g. it takes hundreds of   hours to run a single finite element analysis for a complex flight vehicle structure. Thus for PRS modeling, it is worthwhile to spend modest more computational cost in selecting proper active basis function vector for better surrogate accuracy. In the future, more effective evolutionary algorithms should be studied to enhance the global optimization capability in solving large-scale optimization problems.

Conclusion
In this paper, an improved PRS modeling approach PRS-MOEM is proposed to enhance the prediction accuracy by better selecting basis functions with robustness based on the multitask optimization and ensemble modeling framework. Unlike the traditional PRS modeling based on cross validation which directly builds a single model guided by the total error estimation of all the training subsets, the proposed method constructs a sub-model for each training subset. By properly integrating all the sub-models, the information of the subsets can be fully explored. To construct all the sub-models effectively by solving the multitask optimization problem, an improved evolutionary algorithm with transfer migration is developed, which can significantly enhance the optimization efficiency and robustness by useful information exchange between similar optimization tasks. To obtain a well performed ensemble based on all the sub-models, a scoring method is proposed to measure the significance of each basis function according to the error estimation of the sub-models and the occurrence frequency of these basis functions in all the sub-models. Then based on the AICc criterion, the significant basis functions can be selected and the optimal ensemble can be defined. PRS-MOEM can effectively mitigate the negative influence from the sub-models with large prediction error, and alleviate the uncertain impact resulting from the For future works, the applicability of PRS-MOEM will be further investigated on a more diverse set of test problems as well as complex practical engineering problems. Besides, the extension of PRS-MOEM to effectively solve high dimensional problems is also a promising research direction.