Ensemble of metamodels: the augmented least squares approach

Ferreira, Wallace G.; Serpa, Alberto L.

doi:10.1007/s00158-015-1366-1

Ensemble of metamodels: the augmented least squares approach

RESEARCH PAPER
Published: 15 December 2015

Volume 53, pages 1019–1046, (2016)
Cite this article

Structural and Multidisciplinary Optimization Aims and scope Submit manuscript

Wallace G. Ferreira¹ &
Alberto L. Serpa²

783 Accesses
33 Citations
6 Altmetric
Explore all metrics

Abstract

In this work we present an approach to create ensemble of metamodels (or weighted averaged surrogates) based on least squares (LS) approximation. The LS approach is appealing since it is possible to estimate the ensemble weights without using any explicit error metrics as in most of the existent ensemble methods. As an additional feature, the LS based ensemble of metamodels has a prediction variance function that enables the extension to the efficient global optimization. The proposed LS approach is a variation of the standard LS regression by augmenting the matrices in such a way that minimizes the effects of multicollinearity inherent to calculation of the ensemble weights. We tested and compared the augmented LS approach with different LS variants and also with existent ensemble methods, by means of analytical and real-world functions from two to forty-four variables. The augmented least squares approach performed with good accuracy and stability for prediction purposes, in the same level of other ensemble methods and has computational cost comparable to the faster ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on ensemble learning

Article 30 August 2019

Data clustering: application and trends

Article 27 November 2022

Introduction to Machine Learning

Notes

Most of the publications is focused on linear ensembles, but it can be observed a growth of interest on nonlinear ensemble methods, in which any type of approximation should be used to combine the models, e.g., neural networks, support vector regression, etc. See, for instance Yu et al. (2005), Lai et al. (2006) and Meng and Wu (2012).
Matlab is a well known and widely used numerical programing platform and it is developed and distributed by The Mathworks Inc., see www.mathworks.com.
Further details and recent updates of SURROGATES Toolbox refer to the website: https://sites.google.com/site/srgtstoolbox/.
Boxplot is a common statistical graph used for visual comparison of the distribution of different variables in a same plane. The box is defined by lines at the lower quartile (25 %), median (50 %) and upper quartile (75 %) of the data. Lines extending above and upper each box (whiskers) indicate the spread for the rest of the data out of the quartiles definition. If existent, outliers are represented by plus signs “ + ”, above/below the whiskers. We used the Matlab function boxplot (with default parameters) to create the plots.

References

Acar E (2010) Various approaches for constructing an ensemble of metamodels using local error measures. Struct Multidiscip Optim 42(6):879–896
Article Google Scholar
Acar E, Rais-Rohani M (2009) Ensemble of metamodels with optimized weight factors. Struct Multidiscip Optim 37(3):279– 294
Article Google Scholar
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723
Article MathSciNet MATH Google Scholar
Amemiya T (1985) Advanced Econometrics. Harvard University Pres, Cambridge
Google Scholar
Bishop CM (1995) Neural Networks for Pattern Recognition. Oxford University Press Inc., New York
MATH Google Scholar
Björk A (1996) Numerical Methods for Least Squares Problems. SIAM: Society for Industrial and Applied Mathematics
Breiman L (1996) Stacked regressions. Mach Learn 24:49–64
MathSciNet MATH Google Scholar
Efroymson MA (1960) Multiple regression analysis. In: Mathematical Methods for Digital Computers, Wiley, New York, USA, pp 191–203
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Article MathSciNet MATH Google Scholar
Fang KT, Li R, Sudjianto A (2006) Design and Modeling for Computer Experiments. Computer Science and Data Analysis Series. Chapman & Hall/CRC, USA
MATH Google Scholar
Ferreira WG, Serpa AL (2015) Ensemble of metamodels: Extensions of the least squares approach to efficient global optimization. Struct Multidiscip Optim. (submitted - ID SMO-15-0339)
Ferreira WG, Alves P, Slave R, Attrot W, Magalhaes M (2012) Optimization of a CLU truck frame. In: Ford Global Noise & Vibration Conference, Ford Motor Company, PUB-NVH108-02
Fierro RD, Bunch JR (1997) Regularization by truncated total least squares. SIAM J Sci Comput 18 (4):1223–1241
Article MathSciNet MATH Google Scholar
Forrester A, Keane A (2009) Recent advances in surrogate-based optimization. Prog Aerosp Sci 45:50–79
Article Google Scholar
Forrester A, Sóbester A, Keane A (2008) Engineering Desing Via Surrogate Modelling - A Practical Guide. Wiley, United Kingdom
Book Google Scholar
Foster DP, George EI (1994) The risk inflation criterion for multiple regression. Ann Stat 22:1947–1975
Article MathSciNet MATH Google Scholar
Giunta AA, Watson LT (1998) Comparison of approximation modeling techniques: polynomial versus interpolating models. In: 7th AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, AIAA-98-4758, pp 392–404
Goel T, Haftka RT, Shyy W, Queipo NV (2007) Ensemble of surrogates. Struct Multidiscip Optim 33:199–216
Article Google Scholar
Golub GH, Heath M, Wahba G (1979) Generalizaed cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2):215–223
Article MathSciNet MATH Google Scholar
Gunn SR (1997) Support vector machines for classification and regression. Technical Report. Image, Speech and Inteligent Systems Research Group. University of Southhampton, UK
Google Scholar
Hannan EJ, Quinn BG (1979) The determination of the order of autoregression. J R Stat Soc Ser B 41:190–195
MathSciNet MATH Google Scholar
Hashem S (1993) Optimal linear combinations of neural networks. PhD thesis, School of Industrial Engineering. Purdue University, West Lafayette, USA
Google Scholar
Hoerl AE, Kennard RW (1970a) Ridge regression: Applications to nonorthogonal problems. Technometrics 12(1):69–82
Hoerl AE, Kennard RW (1970b) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55–67
Huber PJ, Rochetti EM (2009) Robust Statistics. Wiley Series in Probability and Statistics. Wiley, New Jersey
Google Scholar
van Huffel S, Vandewalle J (1991) The Total Least Squares Problem: Computational Aspects and Analysis. SIAM: Philadelphia, USA
Book MATH Google Scholar
Jekabsons G (2009) RBF: Radial basis function interpolation for matlab/octave. Riga Technical University, Latvia. version 1.1 ed
Google Scholar
Jolliffe IT (2002) Principal Component Analysis. Springer Series in Statistics. Springer, New York
MATH Google Scholar
Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Glob Optim 13:455–492
Article MathSciNet MATH Google Scholar
Koziel S, Leifesson L (2013) Surrogate-Based Modeling and Optimization - Applications in Engineering. Springer, New York
Book Google Scholar
Lai KK, Yu L, Wang SY, Wei H (2006) A novel nonlinear neural network ensemble forecasting model for financial time series forecasting. In: Lecture Notes in Computer Science, vol 3991, pp 790–793
Lophaven SN, Nielsen HB, Sondergaard J (2002) DACE - a matlab kriging toolbox. Tech. Rep. IMM-TR-2002-12. Technical University of Denmark
Markovsky I, van Huffel S (2007) Overview of total least-squares methods. Signal Process 87:2283–2302
Article MATH Google Scholar
Meng C, Wu J (2012) A novel nonlinear neural network ensemble model using k-plsr for rainfall forecasting. In: Bio-Inspired Computing Applications. Lecture Notes in Computer Science, vol 6840, pp 41–48
Miller A (2002) Subset Selection in Regression. Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, USA
Book Google Scholar
Montgomery DC, Peck EA, Vining GG (2006) Introduction to Linear Regression Analysis. Wiley Series in Probability and Statistics. Wiley, New Jersey
MATH Google Scholar
Ng S (2012) Variable selection in predictive regressions. In: Handbook of Economical Forecasting, Elsevier, pp 752–789
Perrone MP, Cooper LN (1993) When networks disagree: Ensemble methods for hybrid neural networks. Artificial Neural Networks for Speech and Vision. Chapman & Hall, London
Google Scholar
Queipo NV, et al. (2005) Surrogate-based analysis and optimization. Prog Aerosp Sci 41:1–28
Article Google Scholar
Ramu M, Prabhu RV (2013) Metamodel based analysis and its applications: A review. Acta Technica Corviniensis - Bulletin of Engineering 4(2):25–34
Google Scholar
Rasmussen CE, Williams CK (2006) Gaussian Processes for Machine Learning. The MIT Press
Rousseeuw PJ, Leroy AM (2003) Robust Regression and Outlier Detection. Wiley Series in Probability and Statistics. Wiley, New Jersey
Google Scholar
Sanchez E, Pintos S, Queipo NV (2008) Toward and optimal ensemble of kernel-based approximations with engineering applications. Struct Multidiscip Optim 36:247–261
Article Google Scholar
Scheipl F, Kneib T, Fahrmeir L (2013) Penalized likelihood and bayesian function selection in regression models. Adv Stat Anal 97(4):349–385
Article MathSciNet Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Article MathSciNet MATH Google Scholar
Seni G, Elder J (2010) Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Synthesis Lectures on Data Mining and Knowledge Discovery. Morgan & Claypool Publishers, Chicago
Google Scholar
Shibata R (1984) Approximation efficiency of a selection procedure for a number of regression variables. Biometrika 71:43– 49
Article MathSciNet MATH Google Scholar
Simpson TW, Toropov V, Balabanov V, Viana FAC (2008) Design and analysis of computer experiments in multidisciplinary design optimization: A review of how far we have come - or not. In: 12th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, Victoria, British Columbia
Thacker WI, Zhang J, Watson LT, Birch JB, Iyer MA, Berry MW (2010) Algorithm 905: SHEPPACK: modified shepard algorithm for interpolation of scattered multivariate data. ACM Trans Math Softw 37(3):1–20
Article Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via lasso. J R Stat Soc 58(1):267–288
MathSciNet MATH Google Scholar
Viana FAC (2009) SURROGATES toolbox user’s guide version 2.0 (release 3). Available at website: http://fchegury.googlepages.com
Viana FAC (2011) Multiples surrogates for prediction and optimization. PhD thesis, University of Florida, USA
Google Scholar
Viana FAC, Haftka RT, Steffen V (2009) Multiple surrogates: how cross-validation error can help us to obtain the best predictor. Struct Multidiscip Optim 39(4):439–457
Article Google Scholar
Viana FAC, Gogu C, Haftka RT (2010) Making the most out of surrogate models: tricks of the trade. In: ASME 2010 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Volume 1: 36th Design Automation Conference, Parts A and B Montreal, Quebec, Canada, August 15-18
Viana FAC, Haftka RT, Watson LT (2013) Efficient global optimization algorithm assisted by multiple surrogates techniques. J Glob Optim 56:669–689
Article MATH Google Scholar
Weisberg S (1985) Applied Linear Regression. Wiley Series in Probability and Statistics. Wiley, New Jersey
Google Scholar
Wolpert D (1992) Stacked generalizations. Neural Netw 5:241– 259
Article Google Scholar
Yang XS, Koziel S, Liefsson L (2013) Computational optimization, modeling and simulation: Recent trends and challenges. Procedia Computer Science 18:855–860
Article Google Scholar
Yu L, Wang SY, Lai KK (2005) A novel nonlinear ensemble forecasting model incorporating glar and ann for foreign exchange rates. Comput Oper Res 32:2523–2541
Article MATH Google Scholar
Zerpa LE, Queipo NV, Pintos S, Salager JL (2005) An optimization methodology of alkaline-surfactant-polymer flooding processes using field scale numerical simulation and multiple surrogates. J Pet Sci Eng 47:197–208
Article Google Scholar
Zhang C, Ma Y (2012) Ensemble Machine Learning. Methods and Applications. Springer, New York
Book MATH Google Scholar
Zhou ZH (2012) Ensemble Methods. Foundations and Algorithms. Machine Learning & Pattern Recognition Series. Chapman & Hall/CRC, USA
Google Scholar

Download references

Acknowledgments

The authors would like to thank Dr. F.A.C. Viana for the prompt help with SURROGATES Toolbox and also for the useful comments and discussions about the preliminary results of this work.

W.G. Ferreira would like to thank Ford Motor Company and also the support of his colleagues at MDO group and Product Development department that helped on the development of this work, which is part of his doctoral research underway at UNICAMP.

Finally, the authors are grateful for the comments and questions from the journal editor and reviewers. Undoubtedly their valuable suggestions helped to improve the clarity and consistency of the present text.

Author information

Authors and Affiliations

Structural and Optimization Engineering (CAE), Ford Motor Company Brazil, Av. Taboão, 899. PD Bldg. 6. S. B., 09655-900, Campo, SP, Brazil
Wallace G. Ferreira
FEM - School of Mechanical Engineering, DMC - Dept. of Computational Mechanics, University of Campinas (UNICAMP), 13083-970, Campinas, SP, Brazil
Alberto L. Serpa

Authors

Wallace G. Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Alberto L. Serpa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wallace G. Ferreira.

Appendices

Appendix A: Test problems

1.1 A.1 Analytical benchmarks

These functions were chosen since they are widely used to validate both metamodeling and optimization methods, as for example in (Goel et al. 2007), (Acar and Rais-Rohani 2009) and (Viana et al. 2009).

Branin-Hoo

$$ \begin{array}{lcl} y\left( \mathbf{x}\right) & = &\left( x_2 + \frac{5.1x_1^2}{4\pi^2} + \frac{5x_1}{\pi} -6\right)^2\\ & &\\ & &+ 10\left( 1-\frac{1}{8\pi}\right)\cos\left( x_1\right) + 10\text{,} \end{array} $$

(18)

for the region −5≤x ₁≤10 and 0≤x ₂≤15.

Hartman

$$ y(\mathbf{x})= -\sum\limits_{i=1}^{4}{c_i\exp\left[ -\sum\limits_{j=1}^{n_v}{a_{ij}\left( x_j - p_{ij} \right)^2} \right]} \text{,} $$

(19)

where $x_i \in \left [0, 1\right ]^{n_v}$, with constants c _i, a _{i
j} and p _{i
j} given in Table 6, for the case n _v=3 (Hartman-3); and in Tables 7 and 8, for the case n _v=6 (Hartman-6).

Table 6 Data for Hartman-3 function

Full size table

Table 7 Data for Hartman-6 function, c _i and a _{i
j}

Full size table

Table 8 Data for Hartman-6 function, p _{i
j}

Full size table

Extended Rosenbrock

$$ y\left( \mathbf{x}\right) = \sum\limits_{i=1}^{n_v-1}{ \left[\left( 1- x_i\right)^2 + 100\left( x_{i+1} - x_i^2 \right)^2 \right] } \text{,} $$

(20)

where $x_i \in \left [ -5,10\right ]^{n_v}$.

Dixon-Price

$$ y\left( \mathbf{x}\right) = \left( x_1 - 1 \right)^2 + \sum\limits_{i=2}^{n_v}{i \left( 2x_i^2 - x_{i-1} \right)^2} \text{ ,} $$

(21)

where $x_i \in \left [ -10,10\right ]^{n_v}$.

Giunta-Watson

$$ f(\mathbf{x})=\sum\limits_{i=1}^{n_v}\left[ \frac{3}{10}+\sin \left( \frac{16}{15}x_{i}-1\right) +\sin^{2}\left( \frac{16}{15}x_{i}-1\right) \right] \text{,} $$

(22)

where $\mathbf {x} \in \left [-2, 4\right ]^{n_v}$. This function the “noise-free” version of the function used in Giunta and Watson (1998).

1.2 A.2 Engineering applications

In Figs. 15 and 16 are presented simulation models currently applied in automotive industry. These models are typical examples of the ones used in the Multidisciplinary Optimization (MDO) department at Ford Motor Company, where the first author of the present research works as structural optimization engineer. The examples described in this section are taken only as illustrations and they were part of a MDO project presented in a restrict conference summarized in the report by Ferreira et al. (2012).

A regular MDO study at early design phases can comprise several models with hundreds of design variables and response functions to be monitored. After design sensitivity analysis stage, the top most significant variables and functions are selected in each model for metamodeling and multidisciplinary optimization.

We will use in this work the data available for the following variables and responses regarding these models to compare the performance of the ensemble methods discussed in this work by means of real-world applications.

Truck models and responses

a) Truck Durability::: it is presented in Fig. 15a a finite elements (FEM) model build in NASTRAN for truck frame durability evaluation. The durability responses (i.e., stress and/or fatigue/endurance metrics) are described as function of n _v=12 geometry variables;
b) Truck Dynamics::: it is presented in Fig. 15b a multibody model build in ADAMS for vehicle dynamics evaluation. The dynamics responses (i.e., displacements, velocities or accelerations for ride and handling performance) are defined based on the same n _v=12 geometry variables used in the durability responses.

Car models and responses

a) Car NVH::: it is presented in Fig. 16a a FEM model build in NASTRAN for passenger car NVH (noise, vibration and harshness) evaluation. The NVH response is described as function of n _v=30 geometry variables;
b) Car Crash::: it is presented in Fig. 16b a FEM model in RADIOSS for passenger car Frontal Crash evaluation. The crash responses (i.e., displacements, velocities or accelerations for safety performance) are described with n _v=44 variables, that is the same 30 geometry used for NVH and additional 14 material parameters.

Appendix B: Preliminary numerical study Our

Our preliminary numerical experiments with LS ensembles were recorded in an internal research report at DMC-FEM-UNICAMP. Since these results were not published before, we summarize the main findings here in this appendix for convenience.

2.1 B.1 Numerical experiments setup

We compared the performance of least squares ensemble (LS) with PRESS based methods, i.e., variations of OWS, (11), implemented SURROGATES Toolbox, Viana (2009).

At first we investigated the accuraccy as the number of sampling points increases for the case of two variables (n _v=2) of Giunta-Watson function, (22), in the design space $\chi = \mathbf {x}\in \left [-2,4\right ]^{n_{v}}$.

In addition, we compared the methods for increasing the number of variables, i.e., n _v=1,2,5 and 10, also for Giunta-Watson function, in the same design space. In this case, the number of sampling points were chosen based on the rule N=20n _v, in order to keep the same point density as the dimension increases.

In all the cases investigated, we repeat the experiments with 100 different sampling points (DOE), to average out the influence of random data points on the quality of fit. The DOE are created by using the Latin Hypercube MATLAB function lhsdesign, optimized with maxmin criterion with 1000 iterations.

The ensemble of metamodels were composed with 4 distinct models, that is: PRS, KRG, RBNN and SVR, by considering the same setup presented in Table 3.

In all the examples, a total of N _{t
e
s
t}=2000 test points were considered to calculate RMSE, as defined in (1). The cross-validation procedure, (2), was applied with k=10, to balance accuracy and computational cost for PRESS calculation in OWS method and variations.

2.2 B.2 Summary of results

The main results of this preliminary study are compiled in Fig. 17.

In summary, we observed that:

(i):: The accuracy of LS ensemble method is on the same level of the PRESS based ensemble methods for moderate number of sampling points (N<50) and LS was superior only for very dense design spaces, see Fig. 17a. On the other hand, even for a very large number of sampling points, the computational cost for LS was always lower than OWS variants, i.e., more than one order of magnitude, see Fig. 17b;
(ii):: The accuracy of LS ensemble method is on the same level of the OWS ensemble methods, for increasing the number of variables, see Fig. 17c. In the same way, the LS method performed much faster than OWS methods. At least one order of magnitude for low dimension problems (up to 5 variables) and more than two orders of magnitude for high dimension problems (10 variables), see Fig. 17d;
(iii):: On the other hand, LS method presented an undesired instability (measured by the standard deviation of RMSE in 100 runs) as the number of variables increases, see Fig. 17e;
(v):: The variation of accuracy of LS method has been reduced around 30 % for 10 variables by applying a stepwise selection procedure to the standard least squares solution, in order to reduce the effect of multicollinearity among the metamodels, see Fig. 17f.

Based on these preliminary results, we concluded that the LS method can be viewed as an alternative of the PRESS based methods, since it is comparable in terms of accuracy and it performs much faster than the other ensemble methods available. In addition, the results with stepwise regression motivated a deeper investigation on the available methods for combating multicollinearity effects in least squares solution, in order to verify its feasibility for application in ensemble methods.

Appendix C: Multicollinearity in least squares

1.1 C.1 The sources of multicollinearity

The issue of multicollinearity in least squares regression is well known in statistics and related areas and the research in this front remounts at least to the decade of 1950. See for example Björk (1996) and Montgomery et al. (2006) and the list of references therein for a broader perspective on this subject.

By definition, the least squares problem is based on the assumption that the k regressors x _i, or predictor variables, in the simple linear case of (12), are mutually orthogonal. In other words, it is assumed in advance that there is no linear relationship among the predictor variables.

In matrix form, the least squares problem can be stated as

$$ \mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\varepsilon} \text{ ,} $$

(23)

where y is a (N×1) vector of responses; X is a (N×p) matrix of the regressor variables; β is a (p×1) vector of unknown coefficients; and ε is a (p×1) vector of random errors, that are assumed to be normally and independently distributed, with zero mean and finite variance, i.e. ε _i∼ NID(0,σ ²). In this form, N represents the number of observations (or samples) and p = k when the intercept term β ₀ is considered zero and p=(k+1), otherwise.

One possible solution for (23) is the standard least squares estimator, i.e.,

$$ {\hat{\beta}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \text{ ,} $$

(24)

that has the following properties:

(a):: Unbiasedness:
$$ \text{Bias}\left( {\hat{\beta}}\right) \equiv E\left[{\hat{\beta}}\right] - {\beta} = 0 \text{ \ \ \ } \Rightarrow \text{ \ \ \ } E\left[{\hat{\beta}}\right] = \mathbf{\beta}\text{ ;} $$
(25)
(b):: Variance:
$$ \begin{array}{rcl} \text{Var}\left( {\hat{\beta}}\right) & \equiv &E\left[{\hat{\beta}}{\hat{\beta}}^T\right] - E\left[{\hat{\beta}}\right]\left( E\left[{\hat{\beta}}\right]\right)^T\\ & & \\ & = & \sigma^2(\mathbf{X}^{T}\mathbf{X})^{-1} \text{ , \ \ \ \ where}\\ & & \\ \sigma^2 & \approx & \hat{\sigma}^2 = \frac{\mathbf{y}^T\mathbf{y} -{\hat{\beta}}^T\mathbf{X}^T\mathbf{y}}{N-p} \text{ \ \ ;} \end{array} $$
(26)
(c):: Mean Squared Error:
$$ \begin{array}{lcl} \text{MSE}\left( {\hat{\beta}}\right) & \equiv & E\left[ \left\|{\hat{\beta}} - \mathbf{\beta}\right\|^2 \right]\\ & & \\ & = & \text{tr} \left\{ \text{Var}\left( {\hat{\beta}}\right) \right\} + \left\| \text{Bias}\left( {\hat{\beta}}\right) \right\|^2 \text{;} \end{array} $$
(27)
(d) Gauss-Markov Theorem::: The least squares estimator ${\hat {\beta }}$ is the best linear unbiased estimator (BLUE) of β.

For proofs and details on these properties see Montgomery et al. (2006) or Björk (1996).

Unfortunately, in most applications the assumption of mutually orthogonal does not hold and the final regression model can be misleading or erroneous. Thus, when there are linear or near-linear dependencies among the regressors, the problem of multicollinearity arises. This is due to the fact that the so called correlation matrix X ^T X has rank lower than p and for consequence the its inverse does not exist anymore. In this case, the least squares estimate ${\hat {\beta }}$ becomes numerically unstable.

This instability of the coefficients can be explained by the variance definition. If the matrix X ^T X has linear dependence among columns (i.e., multicollinearity), then the variance of the coefficients can increase rapidly or become infinite and, by consequence the prediction will be poor.

Among the several sources of multicollinearity, the primary ones are: (i) the data collection method (size and distribution of sampling points); (ii) model overdefined or with redundant variables. During the decades, several methods have been devised for dealing with multicollinearity in least squares problems (see Montgomery et al. (2006), Chap. 11). In general the techniques include gathering additional data and some kind of modification in the the way that the coefficients ${\hat {\beta }}$ are estimated, in order to reduce the prediction errors induced by multicollinearity.

1.2 C.2 Prior model selection

In our context, the main source of multicollinearity is due to the fact the models $\hat {y}_{i}(\mathbf {x})$ tend to be very similar since all of them are trying to match the true response y(x), as best as possible, therefore the problem is overdefined on its nature. In addition, this situation can be worsened if the sampling points are not well distributed in the design space.

In this way, by assuming that we have a fairly good design space distribution, the first option is conduct a prior selection and remove the most redundant models in the set $[\hat {y}_{1}\left (\mathbf {x}\right ) $, $\hat {y}_{2}\left (\mathbf {x}\right ) $, ... , $\hat {y}_{M}\left (\mathbf {x}\right )]$, by means of some heuristic method.

One quick way, for instance, can be by using the concept of correlation, i.e. by defining the pairwise correlation matrix $\mathbf {R} = \left [r^2\right (\hat {y}_{i},\hat {y}_{j}\left )\right ] $, where r(X _i,Y _i) is the sample linear correlation coefficient, that is,

$$ r\left( X_{i},Y_{i}\right) = \frac{\sum\limits^{N}_{i=1}{(X_{i} - \bar{X})(Y_{i} - \bar{Y})}}{\left[\sum\limits^{N}_{i=1}{(X_{i} - \bar{X})^2}\sum\limits^{N}_{i=1}{(Y_{i} - \bar{Y})^2}\right]^{\frac{1}{2}}} $$

(28)

with

$$ \bar{X} = \frac{1}{N}\sum\limits^{N}_{i=1}{X_{i}} \text{,} $$

for any two random vectors X _i and Y _i, of size N.

In this way, R _{i
i}=1 and (0≤R _{i
j}≤1), for i≠j. Therefore, we can easily identify the most correlated models based on a threshold, say for example (R _{i
j}≥0.8), and verify if it is possible to eliminate the less significant model in the pair ij from the set, before creating the ensemble.

This heuristic approach can be useful when we have a large set of models, thus we can rapidly identify the most correlated pairs and remove the poorest ones in terms of accuracy in advance. It is worth noting that, specially for small sets, this criterion must be used carefully since interpolating or highly accurate models can lead to (R _{i
j}→1.0) as well, and of course cannot be discarded blindly.

Other diagnostics for multicollinearity exist in the least squares literature. Most of them are based on the examination of the correlation matrix X ^T X, namely: correlation coefficients, determinant, eigensystem analysis, VIF (variance inflation factors), etc. We will not present them here, since at the end of the day all these diagnostic measures are useful to estimate pairwise correlation and not more than that. For example, it is possible to identify pairs of highly correlated models, but if the collinearity is among more than two models it cannot be identified by a simple inspection. In addition, as we already discussed, pure collinearity does not mean directly that models are not significant in terms of accuracy or predictability for the ensemble. Refer to Chap. 11 of Montgomery et al. (2006) for a detailed discussion on this subject.

1.3 C.3 Gathering additional data

As reported in Montgomery et al. (2006), one of the best methods to reduce the sources of multicollinearity in least squares is collecting additional data. The idea is first to understand the distribution of points in the design space and add more sampling points in the non-populated areas, in order to avoid concentrations along lines and therefore break-up multicollinearity.

In many cases, unfortunately, collecting additional data costly or even impossible. In other cases, collecting additional data is not a viable solution to the multicollinearity problem because its source is due to constraints on the model or in the population. This is the case of ensemble of metamodels. Since all the models $\hat {y}_{i}$ are trying to approximate the true response y as best as possible, or exactly in the case of interpolation, therefore the true response is acting y as a constraint in the problem, and then multicollinearity will come up naturally.

1.4 C.4 Variable selection methods in regression

A central problem in least squares regression is related to the definition of the best set of variables or predictors to build the model. It is desired to define a parsimonious model, i.e. the simpler regression model that represent the problem at hand, as always as possible. A parsimonious model is easier to interpret, to collect data and, in addition, it is less prone to redundancies that induce linear dependencies and multicollinearity and reduce accuracy and predictability.

It is well known that the key driving question in all the variable selection methods is:

“How to include or exclude variables in a least squares model in order to achieve the desired accuracy and be parsimonious at same time?”

During the decades several methods have been devised, implemented and tested in an attempt to answer this question. In this sense, let us briefly explain the concept of balancing bias and variance in least squares approximation.

1.4.1 C.4.1 The bias and variance dilemma

As remarked by Montgomery et al. (2006), the Gauss-Markov property assures that the estimator ${\hat {\beta }}$ has minimum error, in the least squares sense, among all unbiased linear estimators, but there is no guarantee that its variance will be small.

As we discussed previously, when the method of least squares is applied to nonorthogonal data, very inaccurate estimates of β can be obtained, due to the inflation of the variance. This implies that the absolute values of the coefficients are very unstable and may dramatically change in sign and magnitude by small variations in the design matrix X.

One way to mitigate this issue is to relax the requirement that the estimator of β be unbiased. Let us assume that we can find a ${\hat {\beta }}^{\ast }$ in such a way that

$$ {\hat{\beta}}^{\ast} = {\hat{\beta}} - \mathbf{\delta} \text{, \ \ \ \ \ \ with \ \ \ \ } \left\| \mathbf{\delta} \right\| < \left\| {\hat{\beta}} \right\| \text{, \ \ \ \ and \ \ \ \ } \text{E}\left[\mathbf{\delta}\right] = \delta \text{, } $$

(29)

then the bias will be

$$ \text{Bias}\left( {\hat{\beta}}^{\ast} \right) = \text{E}\left[{\hat{\beta}} - \mathbf{\delta}\right] - \mathbf{\beta} \text{ \ \ \ } \Rightarrow \text{ \ \ \ } \text{Bias}\left( {\hat{\beta}}^{\ast} \right) = -\delta $$

(30)

and, by assuming that ${\hat {\beta }}$ and δ are independent, then

$$ \text{Var}\left( {\hat{\beta}}^{\ast} \right) = \text{Var}\left( {\hat{\beta}} - \mathbf{\delta}\right) = \text{Var}\left( {\hat{\beta}}\right) - \text{Var}\left( \mathbf{\delta}\right) \text{.} $$

(31)

In addition, the MSE will become

$$ \begin{array}{lcl} \text{MSE}\left( {\hat{\beta}}^{\ast} \right) & = & \text{tr} \left\{ \text{Var}\left( {\hat{\beta}}\right) \right\} - \text{tr} \left\{ \text{Var}\left( \mathbf{\delta}\right) \right\} + \left\| \mathbf{\delta} \right\|^2 \\ & & \\ & = & \text{E}\left[ \left\| {\hat{\beta}} - \mathbf{\beta} - \mathbf{\delta} \right\|^2 \right] \\ & & \\ & = & \text{E}\left[ 2\left\| {\hat{\beta}} - \mathbf{\beta} \right\|^2 + 2\left\| \mathbf{\delta} \right\|^2 -\left\|{\hat{\beta}} - \mathbf{\beta} + \mathbf{\delta} \right\|^2 \right] \\ & & \\ & = & 2\times\text{MSE}\left( {\hat{\beta}}\right) + f\left( \left\| \mathbf{\delta} \right\|^2 \right) \end{array} $$

(32)

by using the parallelogram law for vector norms and the linearity of the expectation operator.

In summary, it can be concluded that, by allowing a small amount δ of bias in ${\hat {\beta }}^{\ast }$, the variance of ${\hat {\beta }}^{\ast }$ will be smaller than ${\hat {\beta }}$. On the other hand, the mean squared error at the data may increase rapidly as a function of the level of bias induced. If the effect of increasing bias is smaller than the effect of reducing variance then it is possible to reduce the error. Therefore, by controlling the size of ${\hat {\beta }}$, it is possible to control the stability and error level of the solution by balancing bias and variance.

In Fig. 18 is presented a geometrical interpretation on this behavior of the solution of ${\hat {\beta }}$ in terms of bias and variance, for a generic problem with two variables, i.e. β=[β ₁,β ₂]^T. The idea behind this behavior is by choosing a smaller estimator ${\hat {\beta }}^{\ast }$, then the variance will be smaller. As consequence, the price for reducing variance will be always by adding bias in the solution, i.e. the mean squared error will not be the minimum anymore at the sampling points.

In practical terms, it means that if one chooses a biased estimator ${\hat {\beta }}^{\ast }$, by increasing the mean squared error at sampling points, the variance will be reduced and the sensitivity of the regression coefficients to changes on the data will be also reduced (i.e. less sensitivity to noise or perturbations in the components of matrix X). Therefore, with more stable coefficients, the overfitting can be reduced and the accuracy of predictions for future data will increase as well.

The possible ways to reduce the magnitude of the vector ${\hat {\beta }}$ are mainly two: (i) by removing/combining variables from the scope of the model, or by forcing some of the $\hat {\beta }_{i} = 0$; and (ii) by reducing (shrinking) the size of the vector ${\hat {\beta }}$.

Based on these two central ideas, most of the methods summarized in Section 4.3 were devised in order to find a solution on how to trade-off between bias and variance, and improve accuracy and predictability for a given set of variables in a problem least squares problem.

Finally, variable selection methods is a large front of research in least squares approximation field. (Miller 2002) presented an extensive review on variable selection in regression problems and this is still a subject of active research, as can bee seen in the recent publications, for instance (Ng 2012) which states that: “The variable selection is by no means solved.” and (Scheipl et al. 2013) that reinforces that there is still a wide and open field for future research in variable and function selection in multivariate regression.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferreira, W.G., Serpa, A.L. Ensemble of metamodels: the augmented least squares approach. Struct Multidisc Optim 53, 1019–1046 (2016). https://doi.org/10.1007/s00158-015-1366-1

Download citation

Received: 06 April 2015
Revised: 27 September 2015
Accepted: 07 November 2015
Published: 15 December 2015
Issue Date: May 2016
DOI: https://doi.org/10.1007/s00158-015-1366-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble of metamodels: the augmented least squares approach

Abstract

Access this article