Abstract
When building a Kriging model, the general intuition is that using more data will always result in a better model. However, we show that when we have a large non-uniform dataset, using a uniform subset can have several advantages. Reducing the time necessary to fit the model, avoiding numerical inaccuracies and improving the robustness with respect to errors in the output data are some aspects which can be improved by using a uniform subset. We furthermore describe several new and current methods for selecting a uniform subset. These methods are tested and compared on several artificial datasets and one real life dataset. The comparison shows how the selected subsets affect different aspects of the resulting Kriging model. As none of the subset selection methods performs best on all criteria, the best method to choose depends on how the different aspects are valued. The comparison made in this paper can be used to facilitate the user in making a good choice.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Agca S, Eksioglu B, Ghosh JB (2000) Lagrangian solution of maximum dispersion problems. Nav Res Logist 47(2):97–114
Banzhaf W, Francone FD, Keller RE, Nordin P (1998) Genetic programming: an introduction on the automatic evolution of computer programs and its applications. Morgan Kaufmann, San Francisco, CA, USA
Booker AJ, Dennis JE, Frank, PD, Serafini, DB, Torczon V, Trosset MW (1999) A rigorous framework for optimization of expensive functions by surrogates. Struct Multidisc Optim 17(1):1–13
Cherkassky V, Mulier F (1998) Learning from data: concepts, theory, and methods. Wiley, New York, NY, USA
Davis GJ, Morris MD (1997) Six factors which affect the condition number of matrices associated with Kriging. Math Geol 29:669–683
Dixon LCW, Szegö GP (1978) The global optimization problem: an introduction. In: Dixon LCW, Szegö GP (eds) Toward global optimization, vol 2. North-Holland, pp 1–15
Erkut, E (1990) The discrete p-dispersion problem. Eur J Oper Res 46:48–60
Erkut E, Neuman S (1989) Analytical models for locating undesirable facilities. Eur J Oper Res 50:275–291
Ghosh JB (1996) Computational aspects of the maximum diversity problem. Oper Res Lett 19:175–181
Golbraikh A, Shen M, Xiao Z, Xiao YD, Lee KH, Tropsha A (2003) Rational selection of training and test sets for the development of validated QSAR models. J Comput-Aided Mol Des 17(2):241–253
Golbraikh A, Tropsha A (2002) Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection. J Comput-Aided Mol Des 16(5–6):357–369
Golub GH, van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
Hansen P, Moon IJ (1994) Dispersing facilities on a network. Cahiers du CERO 36:221–234
Hardy, RL (1971) Multiquadratic equations of topography and other irregular surfaces. J Geophys Res 76(8):1905–1915
Hedayat AS, Sloane NAJ, Stufken J (1999) Orthogonal arrays: theory and applications. Springer, New York
Husslage BGM, van Dam ER, den Hertog D, Stehouwer HP, Stinstra E (2003) Coordination of coupled black box simulations in the construction of metamodels. Concurr Eng 11(4):267–278
Jin R, Chen W, Simpson TW (2001) Comparative studies of metamodelling techniques under multiple modelling criteria. Struct Multidisc Optim 23:1–13
Jin R, Chen W, Sudjianto A (2002) On sequential sampling for global metamodeling in engineering design. In: DETC-DAC34092, 2002 ASME design automation conference, pp 1–10
Jin R, Chen W, Sudjianto A (2005) An efficient algorithm for constructing optimal design of computer experiments. J Stat Plan Inference 134(1):268–287
Jones DR (2001) A taxonomy of global optimization methods based on response surfaces. J Glob Optim 21(4):345–383
Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive black-box functions. J Glob Optim 13(4):455–492
Koehler JR, Owen AB (1996) Computer experiments. Handb Stat 13:261–308
Kordon A (2006) Evolutionary computation at dow chemical. SIGEVOlution 1(3):4–9
Koza JR (1992) Genetic programming: on the programming of computers by natural selection. MIT Press, Cambridge, MA, USA
Krige DG (1951) A statistical approach to some basic mine valuation problems on the Witwatersrand. J Chem Metall Min Soc S Afr 52(6):119–139
Kuo CC, Glover F, Dhir KS (1993) Analyzing and modeling the maximum diversity problem by zero-one programming. Decis Sci 24:1171–1185
Lam RLH, Welch WJ, Young SS (2002) Uniform coverage designs for molecule selection. Technometrics 44:99–109
Lophaven SN, Nielsen HB, Sondergaard J (2002) DACE: a Matlab Kriging toolbox version 2.0. Technical Report IMM-TR-2002-12. Technical Univeristy of Denmark, Copenhagen
Matheron G (1963) Principles of geostatistics. Econ Geol 58(8):1246–1266
McKay MD, Beckman RJ, Conover WJ (1979) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21(2):239–245
Owen AB (1992) Orthogonal arrays for computer experiments, integration and visualization. Stat Sin 2:439–452
Padula SL, Alexandrov N, Green LL (1996) MDO test suite at NASA Langley research center. In: AIAA, NASA, and ISSMO, symposium on multidisciplinary analysis and optimization, vol 6. Bellevue, WA, pp 410–420
Pisinger D (2006) Upper bounds and exact algorithms for p-dispersion problems. Comput Oper Res 33(5):1380–1398
Powell MJD (1987) Radial basis functions for multivariable interpolation: a review. In: Clarendon press institute of mathematics and its applications conference series, pp 143–167
Ravi SS, Rosenkrantz DJ, Tayi GK (1991) Facility dispersion problems: heuristics and special cases (extended abstract). In: Algorithms and data structures, 2nd workshop WADS ’91, 14–16 August, Ottawa, Canada, pp 355–366
Ravi SS, Rosenkrantz DJ, Tayi GK (1994) Heuristic and special case algorithms for dispersion problems. Oper Res 42:299–310
Sacks J, Welch WJ, Mitchell TJ, Wynn HP (1989) Design and analysis of computer experiments. Stat Sci 4:409–435
Santner ThJ, Williams BJ, Notz WI (2003) The design and analysis of computer experiments. Springer series in statistics. Springer, New York
Siem, AYD, den Hertog D (2007) Kriging models that are robust with respect to simulation errors. Center Discussion Paper 2007-68. Tilburg University
Simpson, TW, Peplinski J, Koch PN, Allen JK (2001) Metamodels for computer-based engineering design: survey and recommendations. Eng Comput 17:129–150
Srivastava A, Hacker K, Lewis K, Simpson TW (2004) A method for using legacy data for metamodel-based design of large-scale systems. Struct Multidisc Optim 28:145–155
Stehouwer HP, den Hertog D (1999) Simulation-based design optimization: methodology and applications. In: Proceedings of the first ASMO UK / ISSMO conference on engineering design optimization. Ilkley, UK
Stein M (1987) Large sample properties of simulations using Latin hypercube sampling. Technometrics 29(2):143–151
Steuer RE (1986) Multiple criteria optimization: theory and application. John Wiley, New York
Tang B (1993) Orthogonal array-based Latin hypercubes. J Am Stat Assoc 88:1392–1397
van Dam ER, Husslage BGM, den Hertog D, Melissen JBM (2007) Maximin Latin hypercube designs in two dimensions. Oper Res 55(1):158–169
Wang G, Dong Z, Aitchison P (2001) Adaptive response surface method – a global optimization scheme for approximation-based design problems. J Eng Optim 33:707–734
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Rennen, G. Subset selection from large datasets for Kriging modeling. Struct Multidisc Optim 38, 545–569 (2009). https://doi.org/10.1007/s00158-008-0306-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00158-008-0306-8