## Abstract

We introduce Bayesian optimization , a technique developed for optimizing time-consuming engineering simulations and for fitting machine learning models on large datasets. Bayesian optimization guides the choice of experiments during materials design and discovery to find good material designs in as few experiments as possible. We focus on the case when materials designs are parameterized by a low-dimensional vector. Bayesian optimization is built on a statistical technique called Gaussian process regression , which allows predicting the performance of a new design based on previously tested designs. After providing a detailed introduction to Gaussian process regression, we describe two Bayesian optimization methods: expected improvement, for design problems with noise-free evaluations; and the knowledge-gradient method, which generalizes expected improvement and may be used in design problems with noisy evaluations. Both methods are derived using a value-of-information analysis, and enjoy one-step Bayes-optimality .

### Keywords

- Posterior Distribution
- Gaussian Process
- Credible Interval
- Material Design
- Expected Improvement

*These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.*

This is a preview of subscription content, access via your institution.

## Buying options

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions## Notes

- 1.
Being “mean-square differentiable” at

*x*in the direction given by the unit vector \(e_i\) means that the limit \(\lim _{\delta \rightarrow 0} (f(x+\delta e_i) - f(x))/\delta \) exists in mean square. Being “*k*-times mean-square differentiable” is defined analogously.

## References

H.J. Kushner, A new method of locating the maximum of an arbitrary multi- peak curve in the presence of noise. J. Basic Eng.

**86**, 97–106 (1964)J. Mockus,

*Bayesian Approach to Global Optimization: Theory and Applications*(Kluwer Academic, Dordrecht, 1989)J. Mockus, V. Tiesis, A. Zilinskas, The application of Bayesian methods for seeking the extremum, in

*Towards Global Optimisation*, ed. by L.C.W. Dixon, G.P. Szego, vol. 2 (Elsevier Science Ltd., North Holland, Amsterdam, 1978), pp. 117–129D.R. Jones, M. Schonlau, W.J. Welch, Efficient Global Optimization of Expensive Black-Box Functions. J. Global Optim.

**13**(4), 455–492 (1998)A. Booker, J. Dennis, P. Frank, D. Serafini, V. Torczon, M.W. Trosset, Optimization using surrogate objectives on a helicopter test example. Prog. Syst. Control Theor.

**24**, 49–58 (1998)J. Snoek, H. Larochelle, R.P. Adams, Practical bayesian optimization of machine learning algorithms. in

*Advances in Neural Information Processing Systems*, pp. 2951–2959 (2012)E. Brochu, M. Cora, N. de Freitas,

*A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning*. Technical Report TR-2009-023, Department of Computer Science, University of British Columbia, November 2009A. Forrester, A. Sobester, A. Keane,

*Engineering Design Via Surrogate Modelling: A Practical Guide*(Wiley, West Sussex, UK, 2008)T.J. Santner, B.W. Willians, W. Notz,

*The Design and Analysis of Computer Experiments*(Springer, New York, 2003)M.J. Sasena,

*Flexibility and Efficiency Enhancements for Constrained Global Design Optimization with Kriging Approximations*. Ph.D. thesis, University of Michigan (2002)D.G. Kbiob, A statistical approach to some basic mine valuation problems on the witwatersrand. J. Chem. Metall. Min. Soc. S. Afr. (1951)

G. Matheron,

*The theory of regionalized variables and its applications*, vol 5. École national supérieure des mines (1971)N. Cressie, The origins of kriging. Math. Geol.

**22**(3), 239–252 (1990)C.E. Rasmussen, C.K.I. Williams,

*Gaussian Processes for Machine Learning*(MIT Press, Cambridge, MA, 2006)C.E. Rasmussen (2011), http://www.gaussianprocess.org/code, Accessed 15 July 2015

A.B. Gelman, J.B. Carlin, H.S. Stern, D.B. Rubin,

*Bayesian Data Analysis*(CRC Press, Boca Raton, FL, second edition, 2004)J.O. Berger,

*Statistical decision theory and Bayesian analysis*(Springer-Verlag, New York, second edition) (1985)B. Ankenman, B.L. Nelson, J. Staum, Stochastic kriging for simulation metamodeling. Oper. Res.

**58**(2), 371–382 (2010)P.W. Goldberg, C.K.I. Williams, C.M. Bishop, Regression with input-dependent noise: a gaussian process treatment.

*Advances in neural information processing systems*, p. 493–499 (1998)K. Kersting, C. Plagemann, P. Pfaff, W. Burgard, Most likely heteroscedastic Gaussian process regression. In

*Proceedings of the 24th international conference on Machine learning*, ACM, pp. 393–400 (2007)C. Wang,

*Gaussian Process Regression with Heteroscedastic Residuals and Fast MCMC Methods*. Ph.D. thesis, University of Toronto (2014)P.I. Frazier, J. Xie, S.E. Chick, Value of information methods for pairwise sampling with correlations, in

*Proceedings of the 2011 Winter Simulation Conference*, ed. by S. Jain, R.R. Creasey, J. Himmelspach, K.P. White, M. Fu (Institute of Electrical and Electronics Engineers Inc, Piscataway, New Jersey, 2011), pp. 3979–3991S. Sankaran, A.L. Marsden, The impact of uncertainty on shape optimization of idealized bypass graft models in unsteady flow.

*Physics of Fluids (1994-present)*, 22(12):121–902 (2010)P.I. Frazier, W.B. Powell, S. Dayanik, The knowledge gradient policy for correlated normal beliefs. INFORMS J. Comput.

**21**(4), 599–613 (2009)W. Scott, P.I. Frazier, W.B. Powell, The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression. SIAM J. Optim.

**21**(3), 996–1026 (2011)L.P. Kaelbling,

*Learning in Embedded Systems*(MIT Press, Cambridge, MA, 1993)R.S. Sutton, A.G. Barto,

*Reinforcement Learning*(The MIT Press, Cambridge, Massachusetts, 1998)J. Gittins, K. Glazebrook, R. Weber.

*Multi-armed Bandit Allocation Indices*. Wiley, 2nd edition (2011)A. Mahajan, D. Teneketzis, Multi-armed bandit problems. In D. Cochran A. O. Hero III, D. A. Castanon, K. Kastella, (Ed.).

*Foundations and Applications of Sensor Management*. Springer-Verlag (2007)D. Huang, T.T. Allen, W.I. Notz, N. Zeng, Global Optimization of Stochastic Black-Box Systems via Sequential Kriging Meta-Models. J. Global Optim.

**34**(3), 441–466 (2006)O. Roustant, D. Ginsbourger, Y. Deville, Dicekriging, diceoptim: two R packages for the analysis of computer experiments by kriging-based metamodelling and optimization. J. Stat. Softw.

**51**(1), p. 54 (2012)P.I. Frazier,

*Learning with Dynamic Programming*. John Wiley and Sons (2011)D. Ginsbourger, R. Riche, Towards gaussian process-based optimization with finite time horizon.

*mODa 9–Advances in Model-Oriented Design and Analysis*, p. 89–96 (2010)R. Waeber, P.I. Frazier, S.G. Henderson, Bisection search with noisy responses. SIAM J. Control Optim.

**51**(3), 2261–2279 (2013)J. Xie, P.I. Frazier, Sequential bayes-optimal policies for multiple comparisons with a known standard. Oper. Res.

**61**(5), 1174–1189 (2013)P.I. Frazier, Tutorial: Optimization via simulation with bayesian statistics and dynamic programming, in

*Proceedings of the 2012 Winter Simulation Conference Proceedings*, ed. by C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, A.M. Uhrmacher (Institute of Electrical and Electronics Engineers Inc., Piscataway, New Jersey, 2012), pp. 79–94R.A. Howard, Information Value Theory. Syst. Sci. Cybern. IEEE Trans.

**2**(1), 22–26 (1966)C.D. Perttunen, A computational geometric approach to feasible region division inconstrained global optimization. in

*Proceedings of 1991 IEEE International Conference on Systems, Man, and Cybernetics, 1991.’Decision Aiding for Complex Systems*, pp. 585–590 (1991)B.E. Stuckman, A global search method for optimizing nonlinear systems. Syst. Man Cybern. IEEE Trans.

**18**(6), 965–977 (1988)J. Villemonteix, E. Vazquez, E. Walter, An informational approach to the global optimization of expensive-to-evaluate functions. J. Global Optim.

**44**(4), 509–534 (2009)D.C.T. Bautista,

*A Sequential Design for Approximating the Pareto Front using the Expected Pareto Improvement Function*. Ph.D. thesis, The Ohio State University (2009)P.I. Frazier, A.M. Kazachkov, Guessing preferences: a new approach to multi-attribute ranking and selection, in

*Proceedings of the 2011 Winter Simulation Conference*, ed. by S. Jain, R.R. Creasey, J. Himmelspach, K.P. White, M. Fu (Institute of Electrical and Electronics Engineers Inc, Piscataway, New Jersey, 2011), pp. 4324–4336J. Knowles, ParEGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems. Evol. Comput. IEEE Trans.

**10**(1), 50–66 (2006)S.C. Clark, J. Wang, E. Liu, P.I. Frazier, Parallel global optimization using an improved multi-points expected improvement criterion (working paper, 2014)

D. Ginsbourger, R. Le Riche, L. Carraro, A multi-points criterion for deterministic parallel global optimization based on kriging. In

*International Conference on Nonconvex Programming, NCP07*, Rouen, France, December 2007D. Ginsbourger, R. Le Riche, and L. Carraro, Kriging is well-suited to parallelize optimization. In

*Computational Intelligence in Expensive Optimization Problems*, Springer, vol. 2, p. 131–162 (2010)A.I.J. Forrester, A. Sóbester, A.J. Keane, Multi-fidelity optimization via surrogate modelling. Proc. R. Soc. A: Math. Phys. Eng. Sci.

**463**(2088), 3251–3269 (2007)P.I. Frazier, W.B. Powell, H.P. Simão, Simulation model calibration with correlated knowledge-gradients, in

*Proceedings of the 2009 Winter Simulation Conference Proceedings*, ed. by M.D. Rossetti, R.R. Hill, B. Johansson, A. Dunkin, R.G. Ingalls (Institute of Electrical and Electronics Engineers Inc, Piscataway, New Jersey, 2009), pp. 339–351D. Huang, T.T. Allen, W.I. Notz, R.A. Miller, Sequential kriging optimization using multiple-fidelity evaluations. Struct. Multi. Optim.

**32**(5), 369–382 (2006)J. Bect, D. Ginsbourger, L. Li, V. Picheny, E. Vazquez, Sequential design of computer experiments for the estimation of a probability of failure. Stat. Comput.

**22**(3), 773–793 (2012)J.R. Gardner, M.J. Kusner, Z. Xu, K. Weinberger, J.P. Cunningham, Bayesian optimization with inequality constraints. In

*Proceedings of The 31st International Conference on Machine Learning*, pp. 937–945 (2014)D.M. Negoescu, P.I. Frazier, W.B. Powell, The knowledge gradient algorithm for sequencing experiments in drug discovery. INFORMS J. Comput. 23(1) (2011)

P.I. Frazier (2009–2010), http://people.orie.cornell.edu/pfrazier/src.html

## Acknowledgments

Peter I. Frazier was supported by AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-0038, NSF CAREER CMMI-1254298, NSF IIS-1247696, and the ACSF’s AVF. Jialei Wang was supported by AFOSR FA9550-12-1-0200.

## Author information

### Authors and Affiliations

### Corresponding author

## Editor information

### Editors and Affiliations

## Derivations and Proofs

### Derivations and Proofs

This section contains derivations and proofs of equations and theoretical results found in the main text.

### 1.1 Proof of Proposition 1

###
*Proof*

Using Bayes’ rule, the conditional probability density of \(\theta _{[2]}\) at a point \(u_{[2]}\) given that \(\theta _{[1]} = u_{[1]}\) is

To deal with the inverse matrix in this expression, we use the following identity for inverting a block matrix: the inverse of the block matrix \(\begin{bmatrix} A&B \\ C&D\end{bmatrix}\), where both *A* and *D* are invertible square matrices, is

Applying (3.20) to (3.19), and using a bit of algebraic manipulation to get rid of constants, we have

where \(\mu ^{\text {new}} = \mu _{[2]} - \varSigma _{[2,1]}\varSigma _{[1,1]}^{-1}(u_{[1]} - \mu _{[1]})\) and \(\varSigma ^{\text {new}} = \varSigma _{[2,2]}-\varSigma _{[2,1]}\varSigma _{[1,1]}^{-1}\varSigma _{[1,2]}\).

We see that (3.21) is simply the unnormalized probability density function of a normal distribution. Thus the conditional distribution of \(\theta _{[2]}\) given \(\theta _{[1]} = u_{[1]}\) is multivariate normal, with mean \(\mu ^{\text {new}}\) and covariance matrix \(\varSigma ^{\text {new}}\).

### 1.2 Derivation of Equation (3.16)

Since \(f(x) \sim \text {Normal}(\mu _n(x), \sigma _n^2(x))\), the probability density of *f*(*x*) is \(p(f(x)=\)
\(z) = \frac{1}{\sqrt{2\pi }} \exp \left( (z - \mu _n(x))^2/2 \sigma _n(x)^2\right) \). We use this to calculate \(\mathrm {EI}(x)\):

### 1.3 Calculation of the KG factor

The KG factor (3.18) is calculated by first considering how the quantity \(\mu ^*_{n+1} - \mu ^*_n\) depends on the information that we have at time *n*, and the additional datapoint that we will obtain, \(y_{n+1}\).

First observe that \(\mu ^*_{n+1} - \mu ^*_n\) is a deterministic function of the vector \([\mu _{n+1}(x) : x\in A_{n+1}]\) and other quantities that are known at time *n*. Then, by applying the analysis in Sect. 3.3.5, but letting the posterior given \(x_{1:n},y_{1:n}\) play the role of the prior, we obtain the following version of (3.10), which applies to any given *x*,

In this expression, \(\mu _n(\cdot )\) and \(\varSigma _n(\cdot ,\cdot )\) are given by (3.13) and (3.14).

We see from this expression that \(\mu _{n+1}(x)\) is a linear function of \(y_{n+1}\), with an intercept and a slope that can be computed based on what we know after the *n*th measurement.

We will calculate the distribution of \(y_{n+1}\), given what we have observed at time *n*. First, \(f(x_{n+1}) | x_{1:n},y_{1:n} \sim \text {Normal}\left( \mu _n(x_{n+1}), \varSigma _n(x_{n+1}, x_{n+1})\right) \). Since \(y_{n+1} = f(x_{n+1}) + \varepsilon _{n+1}\), where \(\varepsilon _{n+1}\) is independent with distribution \(\varepsilon _{n+1} \sim \text {Normal}(0, \lambda ^2)\), we have

Plugging the distribution of \(y_{n+1}\) into (3.22) and doing some algebra, we have

where \(\widetilde{\sigma }(x, x_{n+1}) = \frac{\varSigma _n(x, x_{n+1})}{\sqrt{\varSigma _n(x_{n+1}, x_{n+1}) + \lambda ^2}}\). Moreover, we can write \(\mu _{n+1}(x)\) as

where \(Z=(y_{n+1}-\mu _n(x_{n+1}))/ \sqrt{\varSigma _n(x_{n+1}, x_{n+1}) + \lambda ^2}\) is a standard normal random variable, given \(x_{1:n}\) and \(y_{1:n}\).

Now (3.18) becomes

Thus, computing the KG factor comes down to being able to compute the expectation of the maximum of a collection of linear functions of a scalar normal random variable. Algorithm 2 of [24], with software provided as part of the matlabKG library [53], computes the quantity

for arbitrary equal-length vectors *a* and *b*. Using this ability, and letting \(\mu _n(A_{n+1})\) be the vector \([\mu _n(x'): x'\in A_{n+1}]\) and \(\widetilde{\sigma }(A_{n+1},x)\) be the vector \([\widetilde{\sigma }(x', x): x'\in A_{n+1}]\), we can write the KG factor as

If \(A_{n+1}=A_n\), as it is in the versions of the knowledge-gradient method proposed in [24, 25], then the last term \(\max (\mu _n(A_{n+1})) - \mu ^*_n\) is equal to 0 and vanishes.

## Copyright information

© 2016 Springer International Publishing Switzerland

## About this chapter

### Cite this chapter

Frazier, P.I., Wang, J. (2016). Bayesian Optimization for Materials Design. In: Lookman, T., Alexander, F., Rajan, K. (eds) Information Science for Materials Discovery and Design. Springer Series in Materials Science, vol 225. Springer, Cham. https://doi.org/10.1007/978-3-319-23871-5_3

### Download citation

DOI: https://doi.org/10.1007/978-3-319-23871-5_3

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-319-23870-8

Online ISBN: 978-3-319-23871-5

eBook Packages: Chemistry and Materials ScienceChemistry and Material Science (R0)