Abstract
Using surrogate models outside training data boundaries can be risky and subject to significant errors. This paper presents a computationally efficient approach to estimate the boundaries of training data inputs in surrogate modeling using the Mahalanobis distance (MD). This distance can then be used as a threshold for deciding whether or not a particular prediction site is within the boundaries of the training data inputs, and has the potential of a likelihood/probabilistic interpretation. The approach is evaluated using two and four dimensional analytical restricted input spaces and a complex biomechanical six dimensional problem. The proposed approach: i) gives good approximations for the boundaries of the restricted input spaces, ii) exhibits reasonable error rates when classifying prediction sites as inside or outside known restricted input spaces and iii) reflects expected error trends for increasing values of the MDs similar to those obtained using a computationally expensive convex hull approach.
Similar content being viewed by others
Notes
For example, the Matlab implementation of the Quickhull algorithm (convhulln) was unable to compute the convex hull associated with two hundred (200) training data in a ten (10) dimensional hyper-spherical restricted input space due to lack of memory, when using a computer with a 2.5 GHz Pentium IV processor and 2GB/5GB of RAM/virtual memory.
Abbreviations
- BER :
-
Balanced error rate
- C :
-
Covariance matrix
- KS :
-
Kolmogorov-Smirnov
- LHS :
-
Latin hypercube sampling
- m :
-
Number of training data
- MD :
-
Mahalanobis distance
- n :
-
Number of input variables
- p :
-
Probability of a prediction site being within the training data boundaries
- R p :
-
Set of real numbers of dimension p
- S :
-
Surrogate model
- T :
-
Training data
- x :
-
Input variables
- y :
-
Response variables
- α :
-
Statistical significance level
- \(\chi_p^{2} \) :
-
Chi-square distribution—p degrees of freedom
- Δ:
-
Difference
- ε :
-
Relative error
- μ :
-
Mean
- b :
-
boundary estimation
- b20:
-
median of top 20% largest Mahalanobis distances
- bl :
-
largest Mahalanobis distance
- T :
-
training data
References
Barber CB, Dobkin DP, Huhdanpaa HT (1996) The Quickhull algorithm for convex hulls. ACM Trans Math Softw 22(4):469–483
Bei Y, Fregly BJ (2004) Multibody dynamic simulation of knee contact mechanics. Med Eng Phys 26:777–789
Cressie NAC (1993) Statistics for spatial data. Wiley, New York
Forrester AIJ, Keane AJ (2009) Recent advances in surrogate based optimization. Prog Aerosp Sci 45:50–79
Hammersley JM (1960) Related problems. 3. Monte-Carlo methods for solving multivariable problems. Ann N Y Acad Sci 86(3):844–874
Jacques J, Lavergne C, Devictor N (2006) Sensitivity analysis in presence of model uncertainty and correlated inputs. Reliab Eng Syst Saf 91:1126–1134
Lin YC, Haftka RT, Queipo NV, Fregly BJ (2008) Dynamic simulation of knee motion using three dimensional surrogate contact modeling. In: Proceedings of the ASME 2008 summer bioengineering conference. SBC, Marco Island
Lophaven SN, Nielsen HB, Sondergaard J (2002) DACE—A Matlab kriging toolbox, version 2.0. Report IMM-TR-2002-12. Informatics and Mathematical Modeling. Technical University of Denmark
Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press
McKay MD, Beckman RJ, Conover WJ (1979) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics (American Statistical Association) 21(2):239–245
McLachlan GJ (1992) Discriminant analysis and statistical pattern recognition. Wiley Interscience
Missoum S, Ramu P, Haftka RT (2007) A convex hull approach for the reliability-based design optimization of nonlinear transient dynamic problems. Comput Methods Appl Mech Eng 196:2895–2906
Mount DM (2002) CMSC 754 Computational geometry. Lecture Notes, University of Maryland
Queipo NV, Haftka RT, Shyy W, Goel T, Vaidyanathan R, Kevin Tucker P (2005) Surrogate based analysis and optimization. Prog Aerosp Sci 41:1–28
Shioda R, Tunçel L (2007) Clustering via minimum volume ellipsoid. Comput Optim Appl 37:247–295
Stephens MA (1974) EDF statistics for goodness of fit and some comparisons. J Am Stat Assoc 69:730–737
Sun P, Freund MR (2004) Computation of minimum-volume covering ellipsoids. Oper Res 52(5):690–706
Acknowledgements
This work was supported in part by the National Science Foundation CBET Division under Grant No. 0602996 to B. J. Fregly and R. T. Haftka.
Author information
Authors and Affiliations
Corresponding author
Additional information
Part of this work was presented at the 8th World Congress on Structural and Multidisciplinary Optimization, June 1–5, 2009, Lisbon, Portugal.
Rights and permissions
About this article
Cite this article
Pineda, L.E., Fregly, B.J., Haftka, R.T. et al. Estimating training data boundaries in surrogate-based modeling. Struct Multidisc Optim 42, 811–821 (2010). https://doi.org/10.1007/s00158-010-0541-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00158-010-0541-7