Skip to main content

Differentiating the Multipoint Expected Improvement for Optimal Batch Design

  • Conference paper
  • First Online:
Machine Learning, Optimization, and Big Data (MOD 2015)

Abstract

This work deals with parallel optimization of expensive objective functions which are modelled as sample realizations of Gaussian processes. The study is formalized as a Bayesian optimization problem, or continuous multi-armed bandit problem, where a batch of \(q > 0\) arms is pulled in parallel at each iteration. Several algorithms have been developed for choosing batches by trading off exploitation and exploration. As of today, the maximum Expected Improvement (EI) and Upper Confidence Bound (UCB) selection rules appear as the most prominent approaches for batch selection. Here, we build upon recent work on the multipoint Expected Improvement criterion, for which an analytic expansion relying on Tallis’ formula was recently established. The computational burden of this selection rule being still an issue in application, we derive a closed-form expression for the gradient of the multipoint Expected Improvement, which aims at facilitating its maximization using gradient-based ascent algorithms. Substantial computational savings are shown in application. In addition, our algorithms are tested numerically and compared to state-of-the-art UCB-based batch-sequential algorithms. Combining starting designs relying on UCB with gradient-based EI local optimization finally appears as a sound option for batch design in distributed Gaussian Process optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)

    Article  MATH  Google Scholar 

  2. Azzalini, A., Genz, A.: The R package mnormt: the multivariate normal and \(t\) distributions (version 1.5-1) (2014)

    Google Scholar 

  3. Bect, J., Ginsbourger, D., Li, L., Picheny, V., Vazquez, E.: Sequential design of computer experiments for the estimation of a probability of failure. Stat. Comput. 22(3), 773–793 (2011)

    Article  MathSciNet  Google Scholar 

  4. Berman, S.M.: An extension of Plackett’s differential equation for the multivariate normal density. SIAM J. Algebr. Discrete Methods 8(2), 196–197 (1987)

    Article  MATH  Google Scholar 

  5. Brochu, E., Cora, M., de Freitas, N.: A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning, December 2010. eprint arXiv:1012.2599

  6. Bull, A.: Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res. 12, 2879–2904 (2011)

    MATH  MathSciNet  Google Scholar 

  7. Chevalier, C.: Fast uncertainty reduction strategies relying on Gaussian process models. Ph.D. thesis, University of Bern (2013)

    Google Scholar 

  8. Chevalier, C., Ginsbourger, D.: Fast computation of the multipoint expected improvement with applications in batch selection. In: Giuseppe, N., Panos, P. (eds.) Learning and Intelligent Optimization. Springer, Heidelberg (2014)

    Google Scholar 

  9. Ginsbourger, D., Picheny, V., Roustant, O., with contributions by Chevalier, C., Marmin, S., Wagner, T.: DiceOptim: Kriging-based optimization for computer experiments. R package version 1.5 (2015)

    Google Scholar 

  10. Desautels, T., Krause, A., Burdick, J.: Parallelizing exploration-exploitation tradeoffs with gaussian process bandit optimization. In: ICML (2012)

    Google Scholar 

  11. Frazier, P.I.: Parallel global optimization using an improved multi-points expected improvement criterion. In: INFORMS Optimization Society Conference, Miami FL (2012)

    Google Scholar 

  12. Frazier, P.I., Powell, W.B., Dayanik, S.: A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5), 2410–2439 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  13. Genz, A.: Numerical computation of multivariate normal probabilities. J. Comput. Graph. Stat. 1, 141–149 (1992)

    Google Scholar 

  14. Ginsbourger, D., Le Riche, R.: Towards gaussian process-based optimization with finite time horizon. In: Giovagnoli, A., Atkinson, A.C., Torsney, B., May, C. (eds.) mODa 9 Advances in Model-Oriented Design and Analysis, Contributions to Statistics, pp. 89–96. Physica-Verlag, HD (2010)

    Chapter  Google Scholar 

  15. Ginsbourger, D., Le Riche, R., Carraro, L.: Kriging is well-suited to parallelize optimization. In: Tenne, Y., Goh, C.-K. (eds.) Computational Intelligence in Expensive Optimization Problems. ALO, vol. 2, pp. 131–162. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  16. Jones, D.R., Schonlau, M., William, J.: Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4), 455–492 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  17. Kenny, Q.Y., Li, W., Sudjianto, A.: Algorithmic construction of optimal symmetric latin hypercube designs. J. Stat. Plann. Inf. 90(1), 145–159 (2000)

    Article  MATH  Google Scholar 

  18. Mebane, W., Sekhon, J.: Genetic optimization using derivatives: the rgenoud package for R. J. Stat. Softw. 42(11), 1–26 (2011)

    Article  Google Scholar 

  19. Mockus, J., Tiesis, V., Zilinskas, A.: The application of bayesian methods for seeking the extremum. In: Dixon, L., Szego, G. (eds.) Towards Global Optimization, vol. 2, pp. 117–129. Elsevier, Amsterdam (1978)

    Google Scholar 

  20. Rasmussen, C.R., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)

    MATH  Google Scholar 

  21. Roustant, O., Ginsbourger, D., Deville, Y.: DiceKriging, DiceOptim: two R packages for the analysis of computer experiments by Kriging-based metamodelling and optimization. J. Stat. Softw. 51(1), 1–55 (2012)

    Article  Google Scholar 

  22. Schonlau, M.: Computer experiments and global optimization. Ph.D. thesis, University of Waterloo (1997)

    Google Scholar 

  23. Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Trans. Inf. Theory 58(5), 3250–3265 (2012)

    Article  MathSciNet  Google Scholar 

  24. Vazquez, E., Bect, J.: Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J. Stat. Plan. Infer. 140(11), 3088–3095 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  25. Villemonteix, J., Vazquez, E., Walter, E.: An informational approach to the global optimization of expensive-to-evaluate functions. J. Global Optim. 44(4), 509–534 (2009)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgement

Part of this work has been conducted within the frame of the ReDice Consortium, gathering industrial (CEA, EDF, IFPEN, IRSN, Renault) and academic (École des Mines de Saint-Étienne, INRIA, and the University of Bern) partners around advanced methods for Computer Experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sébastien Marmin .

Editor information

Editors and Affiliations

6 Appendix: Differential Calculus

6 Appendix: Differential Calculus

  • \(g_1\) and \(g_2\) are functions giving respectively the mean of \(\varvec{Y}(\varvec{X})\) and its covariance. Each component of these functions is either a linear or a quadratic combination of the trend function \(\varvec{\mu }\) or the covariance function C evaluated at different points of \(\varvec{X}\). The results are obtained by matrix differentiation. See the Appendix B of [21] for a similar calculus.

  • \(g_3\) (resp. \(g_4\)) is the affine (resp. linear) tranformation of the mean vector \(\varvec{m}\) into \(\varvec{m}^{(k)}\) (resp. the covariance matrix \(\varSigma \) into \(\varSigma ^{(k)}\)). The differentials are then expressed in terms of the same linear transformation:

    $$\begin{aligned} d_{\varvec{m}}\left[ g_3\right] (\varvec{h})=L^{(k)}\varvec{h}\nonumber ~~\text { and }~~d_{\varSigma }\left[ g_4\right] (H)=L^{(k)}HL^{(k)\top }. \end{aligned}$$
  • \(g_5\) is defined by \(g_5\left( \varvec{m}^{(k)},\varSigma ^{(k)}\right) =\varphi _{\varSigma _{ii}^{(k)}}\left( m_i^{(k)}\right) \). Then the result is obtained by differentiating the univariate Gaussian probability density function with respect to its mean and variance parameters. Indeed we have:

    $$\begin{aligned} d_{\left( \varvec{m}^{(k)},\varSigma ^{(k)}\right) }\left[ g_5\right] (h,H)=&~d_{\varvec{m}^{(k)}}\left[ g_5(\cdot ,\varSigma ^{(k)})\right] (h)+d_{\varSigma ^{(k)}}\left[ g_5(\varvec{m}^{(k)},\cdot )\right] (H) \end{aligned}$$
  • \(g_{6}\) gives the mean and the covariance of \(\varvec{Z}^{(k)}_{-i}|Z_i=0\). We have:

    $$\begin{aligned} \left( \varvec{m}^{(k)}_{|i},\varSigma ^{(k)}_{|i} \right) =g_{6}\left( \varvec{m}^{(k)},\varSigma ^{(k)}\right) =\left( \varvec{m}^{(k)}_{-i}-\frac{m^{(k)}_i}{\varSigma _{ii}^{(k)}}\varvec{\varSigma }_{-i,i}^{(k)}, \varSigma _{-i,-i}^{(k)}-\frac{1}{\varSigma ^{(k)}_{ii}}\varvec{\varSigma }_{-i,i}^{(k)}\varvec{\varSigma }_{-i,i}^{(k)\top }\right) \end{aligned}$$
    $$\begin{aligned} d_{\left( \varvec{m}^{(k)},\varSigma ^{(k)}\right) }\left[ g_{6}\right] (\varvec{h},H) = d_{\varvec{m}^{(k)}}\left[ g_{6}\left( \cdot ,\varSigma ^{(k)}\right) \right] (\varvec{h})+d_{\varSigma ^{(k)}}\left[ g_{6}\right] \left( \varvec{m}^{(k)},\cdot \right) (H), \end{aligned}$$
    $$\begin{aligned}&\text {with : }~~d_{\varvec{m}^{(k)}}\left[ g_{6}\left( \cdot ,\varSigma ^{(k)}\right) \right] (\varvec{h})= \left( \varvec{h}_{-i} -\frac{\varvec{h}_i}{\varSigma _{ii}^{(k)}}\varvec{\varSigma }_{-i,i}^{(k)}, ~0~\right) \nonumber \\&\text {and : }~~d_{\varSigma ^{(k)}}\left[ g_{6}\left( \varvec{m}^{(k)},\cdot \right) \right] (H)= \left( \frac{m^{(k)}_i H_{ii}}{\varSigma _{ii}^{(k)2}}\varvec{\varSigma }_{-i,i}^{(k)}-\frac{m^{(k)}_i}{\varSigma _{ii}^{(k)}}H_{-i,i},\right. \nonumber \\&\left. ~H_{-i,-i}+\frac{H_{ii}}{\varSigma _{ii}^{(k)2}}\varvec{\varSigma }_{-i,i}^{(k)}\varvec{\varSigma }_{-i,i}^{(k)\top }-\frac{1}{\varSigma _{ii}^{(k)}}H_{-i,i}\varvec{\varSigma }_{-i,i}^{(k)\top }-\frac{1}{\varSigma _{ii}^{(k)}}\varvec{\varSigma }_{-i,i}^{(k)}H_{-i,i}^\top \right) \end{aligned}$$
  • \(g_7\) and \(g_8\) : these two functions take a mean vector and a covariance matrix in argument and give a probability in output : \(\varPhi _{q,\varSigma ^{(k)}}\left( -\varvec{m}^{(k)}\right) =g_7\left( \varvec{m}^{(k)},\varSigma ^{(k)}\right) \), \(\varPhi _{q-1,\varSigma ^{(k)}_{|i}}\left( -\varvec{m}^{(k)}_{|i}\right) =g_8\left( \varvec{m}^{(k)}_{|i},\varSigma ^{(k)}_{|i}\right) \) So, for \(\{p,\varGamma ,\varvec{a}\} = \{q,\varSigma ^{(k)},-\varvec{m}^{(k)}\}\) or \(\{q-1,\varSigma ^{(k)}_{|i},-\varvec{m}^{(k)}_{|i}\}\), we face the problem of differentiating a function \(\varPhi : (\varvec{a},\varGamma )\rightarrow \varPhi _{p,\varGamma }(\varvec{a})\), with respect to \((\varvec{a},\varGamma )\in \mathrm {I}\!\mathrm {R}^p\times \mathcal {S}_{++}^p\):

    $$\begin{aligned} d_{(\varvec{a},\varGamma )}\left[ \varPhi \right] (\varvec{h},H)=d_{\varvec{a}}\left[ \varPhi (\cdot ,\varGamma )\right] (\varvec{h})+d_{\varGamma }\left[ \varPhi (\varvec{a},\cdot )\right] (H). \end{aligned}$$

    The first differential of this sum can be written:

    $$\begin{aligned} d_{\varvec{a}}\left[ \varPhi (\cdot ,\varGamma )\right] (\varvec{h}) = \left\langle \left( \frac{\partial }{\partial a_i} \varPhi (\varvec{a},\varGamma )\right) _{1\le i\le p},\varvec{h}\right\rangle , \end{aligned}$$

    with : \(\frac{\partial }{\partial a_i} \varPhi (\varvec{a},\varGamma ) = \int \limits _{-\infty }^{a_1} \!\!\!\ldots \!\!\!\int \limits _{-\infty }^{a_{i-1}} \!\int \limits _{-\infty }^{a_{i+1}}\!\!\!\ldots \!\!\! \int \limits _{-\infty }^{a_p} \varphi _{p,\varGamma }(u_{-i},a_i) \mathrm {d}\varvec{u}_{-i}=\varphi _{1,\varGamma _{ii}} \varPhi _{p-1,\varGamma _{|i}}\left( \varvec{a}_{|i}\right) . \) The last equality is obtained with the identity: \(\forall \varvec{u}\in \mathrm {I}\!\mathrm {R}^q,~ \varphi _{q,\varGamma }(\varvec{u})=\varphi _{1,\varGamma _{ii}}(u_i) \varphi _{p-1,\varGamma _{|i}}(\varvec{u}_{|i}),\) with \(\varvec{u}_{|i}=\varvec{u}_{-i}-\frac{u_i}{\varGamma _{ii}}\varvec{\varGamma }_{-i,i}\) and \(\varGamma _{|i}=\varGamma _{-i,-i}-\frac{1}{\varGamma _{ii}}\varvec{\varGamma }_{-i,i}\varvec{\varGamma }_{-i,i}^\top \). The second differential is:

    $$\begin{aligned}d_{\varGamma }\left[ \varPhi (\varvec{a},\cdot )\right] (H) := \frac{1}{2}\mathrm {tr}\left( H . \left( \frac{\partial \varPhi }{\partial \varGamma _{ij}} (\varvec{a},\varGamma )\right) _{i,j\le p}\right) \nonumber = \frac{1}{2}\mathrm {tr}\left( H . \left( \frac{\partial ^2\varPhi }{\partial a_i\partial a_j}(\varvec{a},\varGamma )\right) _{i,j\le p}\right) \end{aligned}$$

    where : \(\frac{\partial ^2\varPhi }{\partial a_i\partial a_j} (\varvec{a},\varGamma )= \left\{ \begin{array}{ccc} \varphi _{2,\varSigma _{\{i,j\},\{i,j\}}}(x_i,x_j) \varPhi _{p-2,\varSigma _{|ij}}(\varvec{x}_{|{ij}})\text { , if }i\ne j,\nonumber \\ -\frac{x_i}{\varGamma _{ii}} \frac{\partial }{\partial a_i}\varPhi _{\varGamma }(\varvec{a},\varGamma ) - \sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^p \frac{1}{\varGamma _{ii}}\varGamma _{ij}\frac{\partial ^2}{\partial a_i\partial a_j} \varPhi (\varvec{a},\varGamma )\nonumber . \end{array}\right. \)

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Marmin, S., Chevalier, C., Ginsbourger, D. (2015). Differentiating the Multipoint Expected Improvement for Optimal Batch Design. In: Pardalos, P., Pavone, M., Farinella, G., Cutello, V. (eds) Machine Learning, Optimization, and Big Data. MOD 2015. Lecture Notes in Computer Science(), vol 9432. Springer, Cham. https://doi.org/10.1007/978-3-319-27926-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27926-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27925-1

  • Online ISBN: 978-3-319-27926-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics