Skip to main content
Log in

Optimal learning with a local parametric belief model

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

We are interested in maximizing smooth functions where observations are noisy and expensive to compute, as might arise in computer simulations or laboratory experimentations. We derive a knowledge gradient policy, which chooses measurements which maximize the expected value of information, while using a locally parametric belief model that uses linear approximations with radial basis functions. The method uses a compact representation of the function which avoids storing the entire history, as is typically required by nonparametric methods. Our technique uses the expected value of a measurement in terms of its ability to improve our estimate of the optimum, capturing correlations in our beliefs about neighboring regions of the function, without posing any assumptions on the global shape of the underlying function a priori. Experimental work suggests that the method adapts to a range of arbitrary, continuous functions, and appears to reliably find the optimal solution. Moreover, the policy is shown to be asymptotically optimal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Agrawal, R.: The continuum-armed bandit problem. SIAM J. Control Optim. 33(6), 1926–1951 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  2. Barton, R.R., Meckesheimer, M.: Chapter 18: Metamodel-based simulation optimization. In: Henderson, S.G., Nelson, B.L. (eds.) Simulation, Handbooks in Operations Research and Management Science, vol. 13, pp. 535–574. Elsevier, Amsterdam (2006)

    Google Scholar 

  3. Barut, E., Powell, W.B.: Optimal learning for sequential sampling with non-parametric beliefs. J. Glob. Optim. 58(3), 517–543 (2014). doi:10.1007/s10898-013-0050-5

  4. Branin, F.H.: Widely convergent method for finding multiple solutions of simultaneous nonlinear equations. IBM J. Res. Dev. 16(5), 504–522 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  5. Calvin, J., Žilinskas, A.: One-dimensional global optimization for observations with noise. Comput. Math. Appl. 50(1–2), 157–169 (2005). doi:10.1016/j.camwa.2004.12.014. Cited by 8

    Article  MathSciNet  MATH  Google Scholar 

  6. Chang, F., Lai, T.L.: Optimal stopping and dynamic allocation. Adv. Appl. Probab. 19(4), 829–853 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  7. Chick, S.E., Gans, N.: Economic analysis of simulation selection problems. Manag. Sci. 55(3), 421–437 (2009)

    Article  Google Scholar 

  8. Cochran, W., Cox, G.: Experimental Deisgns. Wiley, New York (1957)

    Google Scholar 

  9. Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models. J. Artif. Intell. Res. 4, 129–145 (1996)

    MATH  Google Scholar 

  10. Eubank, R.L.: Spline Smoothing and Nonparametric Regression, 2nd edn. Marcel Dekker, New York (1988)

    MATH  Google Scholar 

  11. Fan, J., Gijbels, I.: Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability Series. Chapman and Hall/CRC, London (1996)

    Google Scholar 

  12. Frazier, P.I., Powell, W.B., Dayanik, S.: A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5), 2410–2439 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  13. Frazier, P.I., Powell, W.B., Dayanik, S.: The knowledge-gradient policy for correlated normal beliefs. INFORMS J. Comput. 21(4), 599–613 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  14. Fu, M.C.: Chapter 19: Gradient estimation. In: Henderson, S.G., Nelson, B.L. (eds.) Simulation, Handbooks in Operations Research and Management Science, vol. 13, pp. 575–616. Elsevier, Amsterdam (2006)

    Google Scholar 

  15. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, 2nd edn. Chapman and Hall/CRC, London (2004)

    MATH  Google Scholar 

  16. Gibbs, M.N.: Bayesian Gaussian processes for regression and classification. Ph.D. thesis, University of Cambridge (1997)

  17. Ginebra, J., Clayton, M.K.: Response surface bandits. J. R. Stat. Soc. Ser. B (Methodol.) 57(4), 771–784 (1995)

    MathSciNet  MATH  Google Scholar 

  18. Gittins, J.C.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B (Methodol.) 41(2), 148–177 (1979)

    MathSciNet  MATH  Google Scholar 

  19. Gupta, S.S., Miescke, K.J.: Bayesian look ahead one-stage sampling allocations for selection of the best population. J. Stat. Plan. Inference 54(2), 229–244 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  20. Hannah, L.A., Blei, D.M., Powell, W.B.: Dirichlet process mixtures of generalized linear models. J. Mach. Learn. Res. 12, 1923–1953 (2011)

    MathSciNet  MATH  Google Scholar 

  21. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall PTR, Upper Saddle River (1998)

    Google Scholar 

  22. Huang, D., Allen, T., Notz, W., Zeng, N.: Global optimization of stochastic black-box systems via sequential kriging meta-models. J. Glob. Optim. 34(3), 441–466 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  23. Jamshidi, A.A., Kirby, M.J.: Towards a black box algorithm for nonlinear function approximation over high-dimensional domains. SIAM J. Sci. Comput. 29(3), 941–964 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  24. Jamshidi, A.A., Powell, W.B.: A recursive local polynomial approximation method using dirichlet clouds and radial basis functions. Working Paper. Princeton University, New Jersey (2013)

  25. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998). doi:10.1023/A:1008306431147

    Article  MathSciNet  MATH  Google Scholar 

  26. Jones, R., Lee, Y.C., Barnes, C.W., Flake, G., Lee, K., Lewis, P., Qian, S.: Function approximation and time series prediction with neural networks. In: IJCNN International Joint Conference on Neural Networks, vol. 1, pp. 649–665 (1990)

  27. Knuth, D.E.: The Art of Computer Programming, vol. 2: Seminumerical Algorithms, 3rd edn. Addison-Wesley, Boston (1998)

    MATH  Google Scholar 

  28. Kushner, H.J.: A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J. Fluids Eng. 86(1), 97–106 (1964). doi:10.1115/1.3653121

    MathSciNet  Google Scholar 

  29. Mes, M.R., Powell, W.B., Frazier, P.I.: Hierarchical knowledge gradient for sequential sampling. J. Mach. Learn. Res. 12, 2931–2974 (2011)

    MathSciNet  MATH  Google Scholar 

  30. Montgomery, D.C., Peck, E.A., Vining, G.G.: Introduction to Linear Regression Analysis, 4th edn. Wiley, Hoboken (2006)

    MATH  Google Scholar 

  31. Muller, H.G.: Weighted local regression and kernel methods for nonparametric curve fitting. J. Am. Stat. Assoc. 82(397), 231–238 (1987)

    Google Scholar 

  32. Negoescu, D.M., Frazier, P.I., Powell, W.B.: The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS J. Comput. 23(3), 346–363 (2010)

    Article  MathSciNet  Google Scholar 

  33. Nelson, B.L., Swann, J., Goldsman, D., Song, W.: Simple procedures for selecting the best simulated system when the number of alternatives is large. Oper. Res. 49, 950–963 (1999)

    Article  Google Scholar 

  34. Powell, W.B., Ryzhov, I.O.: Optimal Learning. Wiley, London (2012)

    Book  Google Scholar 

  35. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  36. Ryzhov, I.O., Powell, W.B., Frazier, P.: The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1), 180–195 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  37. Scott, W., Frazier, P., Powell, W.B.: The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM J. Optim. 21(3), 996–1026 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  38. Spall, J.C.: Introduction to Stochastic Search and Optimization, 1st edn. Wiley, New York (2003)

    Book  MATH  Google Scholar 

  39. Villemonteix, J., Vazquez, E., Walter, E.: An informational approach to the global optimization of expensive-to-evaluate functions. J. Glob. Optim. 44(4), 509–534 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  40. Žilinskas, A.: Two algorithms for one-dimensional multimodai minimization. Math. Oper. Stat. Ser. Optim. 12(1), 53–63 (1981). doi:10.1080/02331938108842707

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bolong Cheng.

Appendix: Proofs

Appendix: Proofs

Assumption 1

For any local parametric model \(\theta \), let \(\theta (i)\) denote the \(i\)-th element of the \(\theta \). For any \(i \ne j\), we have \(\limsup _{n \rightarrow \infty } | \text{ Corr }^n [\theta (i), \theta (j)] | \le 1\) almost surely. Essentially, we claim the correlation between parameters to be bounded by 1.

Assumption 2

For any level \(g \in {\mathcal {G}}\setminus \{ 0 \}\), we have \(\liminf _n |v_x^{g,n}| > 0\). In other words, estimates for all the levels, except when \(g= 0\), will have bias.

Lemma 1

If we have a prior on each parameter, then for any \(x, x'\in {\mathcal {X}}\), we have \(\sup _n |x^T \theta _j^{g,n}| < \infty \) and \(\sup _n |x^T \varSigma ^{\theta , g,n}_j x'| < \infty \) almost surely.

Proof

We can show that for any cloud \(j\), \((\theta _j^n, \varSigma ^{\theta , n}_j)\) is a uniformly integrable martingale; therefore, it converges to some integrable random variable \((\theta _j^\infty , \varSigma ^{\theta , \infty }_j)\) almost surely.

Fix any \(g\) and \(j\). we see that

$$\begin{aligned} \varSigma ^{\theta , n+1} - \varSigma ^{\theta , n}= & {} - \frac{\varSigma ^{\theta , n} x^n (x^n)^T \varSigma ^{\theta , n}}{\lambda + \varSigma _{x^n, x^n}}, \\ \text{ diag } (\varSigma ^{\theta , n+1} - \varSigma ^{\theta , n})= & {} - \frac{1}{\lambda + \varSigma _{x^n, x^n} } \left[ \begin{array}{ccc} \zeta ^2_{x^n} (1) &{} \cdots &{} 0 \\ &{} \ddots &{} \\ 0 &{} \cdots &{} \zeta ^2_{x^n} (d)\end{array}\right] , \end{aligned}$$

where \(\zeta _x^n (i)\) is the \(i\)-th element of the product \(\varSigma ^{\theta ,n} x^n\). By definition, we have \(\lambda + \varSigma _{x^n, x^n} > 0\). We also note that the \(\text{ diag }(\varSigma ^{\theta ,n})\) is in fact the variance of each parameter, namely \(\text{ Var }^n[\theta ]\). It is apparent that \(\text{ Var }^n [\theta (i)]\) is a non-increasing sequence and bounded by zero below. For non-diagonal elements of \(\varSigma ^{\theta ,n}( i,j)\), where \(i \ne j\), we define it as the covariance of element \(i\) and \(j\). By the definition of correlation, we have

$$\begin{aligned} \varSigma ^{\theta , n} (i,j)= & {} \text{ Corr }^n[\theta (i), \theta (j)] \sqrt{\text{ Var }^n [\theta (i)]\text{ Var }^n [\theta (j)]} ,\\ |\varSigma ^{\theta , n} (i,j) |\le & {} | \sqrt{\text{ Var }^n [\theta (i)]\text{ Var }^n [\theta (j)]} | < \infty . \end{aligned}$$

It follows that \(\sup _n |x^T \varSigma ^{\theta , g,n}_j x'| < \infty \), since all the elements of \(\varSigma ^{\theta , g,n}_j\) are finite, as well as all the alternatives \(x, x' \in {\mathcal {X}}\). \(\square \)

Lemma 2

If we have a prior on each alternative, then for any \(x, x' \in {\mathcal {X}}\), the following are finite almost surely: \(\sup _n |\theta _x^{g,n}|\), \(\sup _n |a_{x'}^{g,n} (x)|\), and \(\sup _n |b_{x'}^{g,n} (x)|\).

Proof

We see that \(\sup _n |\theta _x^{g,n}|\) is finite follows from the fact that \(\theta _x^{g,n}\) is a convex combination of \(x^T \theta _j^{g,n}\). By Lemma 1, the convex combination of finite variables is also finite.

To show that \(\sup _n |a_{x'}^{g,n} (x)|\) is finite, we need bound both parts of the sum as the following,

$$\begin{aligned} |a_{x'}^{n} (x)| \le \left| \frac{\sum _{j = 1}^{N_c} \varphi _j^{n+1} (x) x^T \theta _j^n }{K^{n+1} (x)} \right| + \left| \frac{ \varphi _I^{n+1}(x) (f(x^n | \varTheta ^{n}) - x^T \theta _I^n) x^T \varSigma ^{\theta , n}_I x' }{K^{n+1} (x) \gamma _I^n }\right| . \end{aligned}$$

For the first part, we a have convex combination of all the local parametric model, therefore

$$\begin{aligned} \left| \frac{\sum _{j = 1}^{N_c} \varphi _j^{n+1} (x) x^T \theta _j^n }{K^{n+1} (x)} \right| \le \frac{\sum _{j = 1}^{N_c} \varphi _j^{n+1} (x) \left| x^T \theta _j^n \right| }{K^{n+1} (x)} < \infty . \end{aligned}$$

By Lemma 1, the convex combination of finite values is also finite. For the second part, we see that

$$\begin{aligned} \left| \frac{ \varphi _I^{n+1}(x) (f(x^n | \varTheta ^{n}) - x^T \theta _I^n) x^T \varSigma ^{\theta , n}_I x' }{K^{n+1} (x) \gamma _I^n }\right|= & {} \left| \frac{\varphi _I^{n+1}(x)}{K^{n+1} (x) } \right| \left| \frac{ (f(x^n | \varTheta ^{n}) - x^T \theta _I^n) x^T \varSigma ^{\theta , n}_I x' }{\gamma _I^n }\right| ,\\\le & {} \left| \frac{ (f(x^n | \varTheta ^{n}) - x^T \theta _I^n) x^T \varSigma ^{\theta , n}_I x' }{\gamma _I^n }\right| . \end{aligned}$$

Using Lemma 1 again, this expression is also finite since \(\gamma _I^{n}\) is bounded by \(\lambda _x > 0\) below. Now we will show that \(\sup _n |b_{x'}^{g,n} (x)|\) is finite. We temporarily remove the index \(g\) for clarity, then for any cloud \(I\),

$$\begin{aligned} |b_{x'}^n (x) |= & {} \left| \frac{ \varphi _I^{n+1} \left( \sqrt{\lambda + \varSigma ^n_{x' x'} }\right) x^T \varSigma ^{\theta , n}_I x' }{K^{n+1} \gamma _I^n } \right| \\\le & {} \left| \frac{ \varphi _I^{n+1} }{K^{n+1}} \right| \left| \frac{\left( \sqrt{\lambda + \varSigma ^n_{x' x'} }\right) x^T \varSigma ^{\theta , n}_I x' }{\gamma _I^n } \right| \\\le & {} \left| \frac{\left( \sqrt{\lambda + \varSigma ^n_{x' x'} }\right) x^T \varSigma ^{\theta , n}_I x' }{\gamma _I^n } \right| . \end{aligned}$$

We know that \(\sup _n |\varSigma ^n_{x' x'} | < \infty \). It follows from Lemma 1 that \(\sup _n |b_{x'}^{g,n} (x) |\) is finite almost surely. \(\square \)

Lemma 3

For any \(\omega \in \varOmega \), let \({\mathcal {X}}' (\omega )\) be the random set of alternatives measured infinitely often by the KG-RBF policy. For all \(x, x' \in {\mathcal {X}}\), the following statements hold almost surely,

  • if \(x \in {\mathcal {X}}'\), then \(\lim _n b_{x'}^n (x) = 0\) and \(\lim _n b_{x}^n (x') = 0\),

  • if \(x \notin {\mathcal {X}}'\), then \(\liminf _n b_{x}^n (x) > 0\).

Proof

We first consider the case where \(x \in {\mathcal {X}}'\). We know that \(\lim _{n} (\sigma _x^{g,n})^2 \rightarrow 0\) for all \(g \in {\mathcal {G}}\). We focus on the asymptotic behavior of weights \(w_x^{g,n}\) first. By definition, the bias of the base level is \(\lim _n v_x^{0,n} = 0\). For \(g \ne 0\), we have

$$\begin{aligned} w_g^{g,n} \le \frac{\left( (\sigma _x^{g,n})^2 + (v_x^{g,n})^2\right) ^{-1}}{(\sigma ^{0,n})^{-2} + \left( (\sigma _x^{g,n})^2 + (v_x^{g,n})^2\right) ^{-1}}. \end{aligned}$$

By Assumption 2, all hierarchical levels \(g \ne 0\) have bias, i.e. \(\lim _n v_x^{g,n} \ne 0\). This means that \(\lim _n w_x^{g,n} \rightarrow 0\) for all \(g \ne 0\), implying \(\lim _n w_x^{0,n} = 1\). In other words, we have

$$\begin{aligned} \lim _n b_{x'}^n (x) = \lim _n \sum _{g \in {\mathcal {G}}} w_{x'}^{g,n} b_{x'}^{g,n} (x) = \lim _n b_{x'}^{0,n} (x). \end{aligned}$$

The DC-RBF model for the base level is equivalent to a look up table model with independent alternatives, since we let \(D_T\) to be infinitesimally small (or equivalently we let \(N_c = M\) for the base level). Note that independence of the alternatives on the base level comes from the fact that all of the RBF’s are independent of each other by definition. Following Eq. (4), we have

$$\begin{aligned} b_{x'}^{0,n} (x) = \frac{e_x \varSigma ^{0,n} e_{x'}}{\sqrt{\lambda _{x'} + \varSigma _{x'x'}^n} }, \end{aligned}$$

where \(\varSigma ^{0,n}\) is the covariance matrix of the alternatives and is diagonal. It is apparent that \(b_{x'}^{0,n} (x) = 0\) when \(x' \ne x\), since the alternatives are independent. For \(x' = x\), we know that the numerator \(e_x \varSigma ^n e_x = (\sigma _x^n)^2\), which is the posterior variance of alternative \(x\). Since \(x \in {\mathcal {X}}'\) is measured infinitely many times, we have \((\sigma _x^n)^2 \rightarrow 0\). Therefore, \(\lim _n b_{x'}^n (x) = 0\) under this case as well.

If \(x \notin {\mathcal {X}}'\), it is equivalent to show that for any cloud \(I\) and level \(g \in {\mathcal {G}}\), we have

$$\begin{aligned} \liminf _n \tilde{\sigma } (x,x, I) = \frac{ \varphi _I^{n+1} \left( \sqrt{\lambda + \varSigma ^n_{xx} }\right) x^T \varSigma ^{\theta , n}_I x }{K^{n+1} \gamma _I^n } > 0. \end{aligned}$$

This becomes evident when we rearrange the expression as

$$\begin{aligned} \liminf _n \tilde{\sigma } (x,x, I)= & {} \left( \frac{ \varphi _I^{n+1} }{K^{n+1}} \right) \left( \frac{ \sqrt{\lambda + \varSigma ^n_{xx} } }{\gamma _I^n } \right) \left( x^T \varSigma ^{\theta , n}_I x \right) \\> & {} \left( \frac{ \varphi _I^{n+1} }{K^{n+1}} \right) \left( \frac{ \sqrt{\lambda } }{\gamma _I^n } \right) \left( x^T \varSigma ^{\theta , n}_I x \right) \\> & {} 0. \end{aligned}$$

The first part of the product is greater than zero by our use of the Gaussian kernel, where \(\varphi (x) > 0\), for all \(x \in {\mathcal {X}}\). By definition \(\varSigma ^n_{xx} = \sigma ^2_x> 0\) for a finitely sampled alternative; therefore, the second part of the product is also lower bounded by 0. The last part of the sum is also lower bounded by zero since \(\varSigma ^{\theta , n}_I\) is a semi-definite matrix by definition. We have \(b_x^n(x)\) as a convex combination of the above expression in Eq. (22); hence \(\liminf _n b_x^n(x) > 0\) for finitely sampled alternatives \(x \notin {\mathcal {X}}'\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, B., Jamshidi, A. & Powell, W.B. Optimal learning with a local parametric belief model. J Glob Optim 63, 401–425 (2015). https://doi.org/10.1007/s10898-015-0299-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-015-0299-y

Keywords

Navigation