Advertisement

Structural and Multidisciplinary Optimization

, Volume 59, Issue 5, pp 1439–1454 | Cite as

Variable selection using Gaussian process regression-based metrics for high-dimensional model approximation with limited data

  • Kyungeun Lee
  • Hyunkyoo Cho
  • Ikjin LeeEmail author
Research Paper

Abstract

In recent years, the importance of computationally efficient surrogate models has been emphasized as the use of high-fidelity simulation models increases. However, high-dimensional models require a lot of samples for surrogate modeling. To reduce the computational burden in the surrogate modeling, we propose an integrated algorithm that incorporates accurate variable selection and surrogate modeling. One of the main strengths of the proposed method is that it requires less number of samples compared with conventional surrogate modeling methods by excluding dispensable variables while maintaining model accuracy. In the proposed method, the importance of selected variables is evaluated using the quality of the model approximated with the selected variables only. Nonparametric probabilistic regression is adopted as the modeling method to deal with inaccuracy caused by using selected variables during modeling. In particular, Gaussian process regression (GPR) is utilized for the modeling because it is suitable for exploiting its model performance indices in the variable selection criterion. Outstanding variables that result in distinctly superior model performance are finally selected as essential variables. The proposed algorithm utilizes a conservative selection criterion and appropriate sequential sampling to prevent incorrect variable selection and sample overuse. Performance of the proposed algorithm is verified with two test problems with challenging properties such as high dimension, nonlinearity, and the existence of interaction terms. A numerical study shows that the proposed algorithm is more effective as the fraction of dispensable variables is high.

Keywords

Surrogate model Variable selection High-dimensional problem Gaussian process regression Limited data 

Abbreviations

n

Dimension of input

X

Training input

X

New input

f

Posterior output with zero mean function

\( {\overline{\mathbf{f}}}_{\ast } \)

Best estimation for f

g

Posterior output with explicit basis function

cov(g)

Covariance of g

y

Training output (noisy response)

mi(xi)

Mean function of GPR in xi-y plane

c(x| X)

Posterior variance in a specified point x

m

Number of observations

h(x)

Basis function of GPR

k(x)

Covariance function of GPR

cov(f)

Covariance of f

\( {\overline{\mathbf{g}}}_{\ast } \)

Best estimation for g

\( \boldsymbol{\upbeta}, \widehat{\boldsymbol{\upbeta}} \)

Coefficients of basis function and their estimation

\( \boldsymbol{\uptheta}, \widehat{\boldsymbol{\uptheta}} \)

Hyperparameters of covariance function and their estimation

\( {\sigma}^2,{\widehat{\sigma}}^2 \)

Noise variance and its estimation

ki(xi, x´;θ)

Covariance function of GPR with xi-y plane and hyperparameter θ

ε

Gaussian noise

Notes

Funding information

This research was supported by the development of thermoelectric power generation system and business model utilizing non-use heat of industry funded by the Korea Institute of Energy Technology Evaluation and Planning (KETEP) and the Ministry of Trade, Industry and Energy (MOTIE) of the Republic of Korea (No.20172010000830).

References

  1. Bastos LS, O’Hagan A (2009) Diagnostics for Gaussian process emulators. Technometrics 51(4):425–438MathSciNetCrossRefGoogle Scholar
  2. Beck J, Guillas S (2016) Sequential design with mutual information for computer experiments (MICE): emulation of a tsunami model. SIAM/ASA Journal on Uncertainty Quantification 4(1):739–766MathSciNetCrossRefzbMATHGoogle Scholar
  3. Bessa MA, Bostanabad R, Liu Z, Hu A, Apley DW, Brinson C, Liu WK (2017) A framework for data-driven analysis of materials under uncertainty: countering the curse of dimensionality. Comput Methods Appl Mech Eng 320:633–667MathSciNetCrossRefGoogle Scholar
  4. Bouguettaya A, Yu Q, Liu X, Zhou X, Song A (2015) Efficient agglomerative hierarchical clustering. Expert Syst Appl 42(5):2785–2797CrossRefGoogle Scholar
  5. Campolongo F, Cariboni J, Saltelli A (2007) An effective screening design for sensitivity analysis of large models. Environ Model Softw 22(10):1509–1518CrossRefGoogle Scholar
  6. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28CrossRefGoogle Scholar
  7. Cho H, Bae S, Choi KK, Lamb D, Yang RJ (2014) An efficient variable screening method for effective surrogate models for reliability-based design optimization. Struct Multidiscip Optim 50(5):717–738CrossRefGoogle Scholar
  8. Cho H, Choi KK, Gaul NJ, Lee I, Lamb D, Gorsich D (2016) Conservative reliability-based design optimization method with insufficient input data. Struct Multidiscip Optim 54(6):1609–1630MathSciNetCrossRefGoogle Scholar
  9. Cook RD (2000) Detection of influential observation in linear regression. Technometrics 42(1):65–68MathSciNetCrossRefGoogle Scholar
  10. Gorodetsky A, Marzouk Y (2016) Mercer kernels and integrated variance experimental design: connections between Gaussian process regression and polynomial approximation. SIAM/ASA Journal on Uncertainty Quantification 4(1):796–828MathSciNetCrossRefzbMATHGoogle Scholar
  11. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182zbMATHGoogle Scholar
  12. Hayter A (2012) Probability and statistics for engineers and scientists. Nelson EducationGoogle Scholar
  13. Helton JC, Johnson JD, Sallaberry CJ, Storlie CB (2006) Survey of sampling-based methods for uncertainty and sensitivity analysis. Reliab Eng Syst Saf 91(10–11):1175–1209CrossRefGoogle Scholar
  14. Homma T, Saltelli A (1996) Importance measures in global sensitivity analysis of nonlinear models. Reliab Eng Syst Saf 52(1):1–17CrossRefGoogle Scholar
  15. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRefGoogle Scholar
  16. Jin R, Chen W, Sudjianto A (2002) On sequential sampling for global metamodeling in engineering design. In ASME 2002 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical EngineersGoogle Scholar
  17. Jin, R., Chen, W., and Sudjianto, A. (2003) An efficient algorithm for constructing optimal design of computer experiments. in ASME 2003 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical EngineersGoogle Scholar
  18. Joseph VR, Hung Y (2008) Orthogonal-maximin Latin hypercube designs. Stat Sin 171–186Google Scholar
  19. Joseph VR, Gul E, Ba S (2015) Maximum projection designs for computer experiments. Biometrika 102(2):371–380MathSciNetCrossRefzbMATHGoogle Scholar
  20. Ko CW, Lee J, Queyranne M (1995) An exact algorithm for maximum entropy sampling. Oper Res 43(4):684–691MathSciNetCrossRefzbMATHGoogle Scholar
  21. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324CrossRefzbMATHGoogle Scholar
  22. Lee I, Choi KK, Noh Y, Zhao L, Gorsich D (2011) Sampling-based stochastic sensitivity analysis using score functions for RBDO problems with correlated random variables. J Mech Des 133(2):021003CrossRefGoogle Scholar
  23. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Computing Surveys (CSUR) 50(6):94CrossRefGoogle Scholar
  24. Moon H, Dean AM, Santner TJ (2012) Two-stage sensitivity-based group screening in computer experiments. Technometrics 54(4):376–387MathSciNetCrossRefGoogle Scholar
  25. Oakley JE, O’Hagan A (2004) Probabilistic sensitivity analysis of complex models: a Bayesian approach. J R Stat Soc Ser B Stat Methodol 66(3):751–769MathSciNetCrossRefzbMATHGoogle Scholar
  26. Pronzato L, Walter É (1988) Robust experiment design via maximin optimization. Math Biosci 89(2):161–176MathSciNetCrossRefzbMATHGoogle Scholar
  27. Qi M, Zhang GP (2001) An investigation of model selection criteria for neural network time series forecasting. Eur J Oper Res 132(3):666–680CrossRefzbMATHGoogle Scholar
  28. Quiñonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6(Dec):1939–1959MathSciNetzbMATHGoogle Scholar
  29. Rasmussen CE, Williams CK (2006) Gaussian process for machine learning, cambridge, MIT pressGoogle Scholar
  30. Saltelli A, Campolongo F, Cariboni J (2009) Screening important inputs in models with strong interaction properties. Reliab Eng Syst Saf 94(7):1149–1155CrossRefGoogle Scholar
  31. Shan S, Wang GG (2010) Survey of modeling and optimization strategies to solve high-dimensional design problems with computationally-expensive black-box functions. Struct Multidiscip Optim 41(2):219–241MathSciNetCrossRefzbMATHGoogle Scholar
  32. Sobol IM (2001) Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Math Comput Simul 55(1–3):271–280MathSciNetCrossRefzbMATHGoogle Scholar
  33. Solomatine DP, Ostfeld A (2008) Data-driven modelling: some past experiences and new approaches. J Hydroinf 10(1):3–22CrossRefGoogle Scholar
  34. Stein M (1987) Large sample properties of simulations using Latin hypercube sampling. Technometrics 29(2):143–151MathSciNetCrossRefzbMATHGoogle Scholar
  35. Sun NZ, Sun A (2015) Model calibration and parameter estimation: for environmental and water resource systems. SpringerGoogle Scholar
  36. Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–2794MathSciNetCrossRefzbMATHGoogle Scholar
  37. Welch WJ, Buck RJ, Sacks J, Wynn HP, Mitchell TJ, Morris MD (1992) Screening, predicting, and computer experiments. Technometrics 34(1):15–25CrossRefGoogle Scholar
  38. Wu D, Hajikolaei KH, Wang GG (2018) Employing partial metamodels for optimization with scarce samples. Struct Multidiscip Optim 57(3):1329–1343Google Scholar
  39. Zhao J, Leng C, Li L, Wang H (2013) High-dimensional influence measure. Ann Stat 41(5):2639–2667MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Mechanical EngineeringKorea Advanced Institute of Science and TechnologyDaejeonSouth Korea
  2. 2.Department of Mechanical EngineeringMokpo National UniversityMuan-gunSouth Korea

Personalised recommendations