Abstract
We are interested in maximizing smooth functions where observations are noisy and expensive to compute, as might arise in computer simulations or laboratory experimentations. We derive a knowledge gradient policy, which chooses measurements which maximize the expected value of information, while using a locally parametric belief model that uses linear approximations with radial basis functions. The method uses a compact representation of the function which avoids storing the entire history, as is typically required by nonparametric methods. Our technique uses the expected value of a measurement in terms of its ability to improve our estimate of the optimum, capturing correlations in our beliefs about neighboring regions of the function, without posing any assumptions on the global shape of the underlying function a priori. Experimental work suggests that the method adapts to a range of arbitrary, continuous functions, and appears to reliably find the optimal solution. Moreover, the policy is shown to be asymptotically optimal.
Similar content being viewed by others
References
Agrawal, R.: The continuum-armed bandit problem. SIAM J. Control Optim. 33(6), 1926–1951 (1995)
Barton, R.R., Meckesheimer, M.: Chapter 18: Metamodel-based simulation optimization. In: Henderson, S.G., Nelson, B.L. (eds.) Simulation, Handbooks in Operations Research and Management Science, vol. 13, pp. 535–574. Elsevier, Amsterdam (2006)
Barut, E., Powell, W.B.: Optimal learning for sequential sampling with non-parametric beliefs. J. Glob. Optim. 58(3), 517–543 (2014). doi:10.1007/s10898-013-0050-5
Branin, F.H.: Widely convergent method for finding multiple solutions of simultaneous nonlinear equations. IBM J. Res. Dev. 16(5), 504–522 (1972)
Calvin, J., Žilinskas, A.: One-dimensional global optimization for observations with noise. Comput. Math. Appl. 50(1–2), 157–169 (2005). doi:10.1016/j.camwa.2004.12.014. Cited by 8
Chang, F., Lai, T.L.: Optimal stopping and dynamic allocation. Adv. Appl. Probab. 19(4), 829–853 (1987)
Chick, S.E., Gans, N.: Economic analysis of simulation selection problems. Manag. Sci. 55(3), 421–437 (2009)
Cochran, W., Cox, G.: Experimental Deisgns. Wiley, New York (1957)
Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models. J. Artif. Intell. Res. 4, 129–145 (1996)
Eubank, R.L.: Spline Smoothing and Nonparametric Regression, 2nd edn. Marcel Dekker, New York (1988)
Fan, J., Gijbels, I.: Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability Series. Chapman and Hall/CRC, London (1996)
Frazier, P.I., Powell, W.B., Dayanik, S.: A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim. 47(5), 2410–2439 (2008)
Frazier, P.I., Powell, W.B., Dayanik, S.: The knowledge-gradient policy for correlated normal beliefs. INFORMS J. Comput. 21(4), 599–613 (2009)
Fu, M.C.: Chapter 19: Gradient estimation. In: Henderson, S.G., Nelson, B.L. (eds.) Simulation, Handbooks in Operations Research and Management Science, vol. 13, pp. 575–616. Elsevier, Amsterdam (2006)
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, 2nd edn. Chapman and Hall/CRC, London (2004)
Gibbs, M.N.: Bayesian Gaussian processes for regression and classification. Ph.D. thesis, University of Cambridge (1997)
Ginebra, J., Clayton, M.K.: Response surface bandits. J. R. Stat. Soc. Ser. B (Methodol.) 57(4), 771–784 (1995)
Gittins, J.C.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B (Methodol.) 41(2), 148–177 (1979)
Gupta, S.S., Miescke, K.J.: Bayesian look ahead one-stage sampling allocations for selection of the best population. J. Stat. Plan. Inference 54(2), 229–244 (1996)
Hannah, L.A., Blei, D.M., Powell, W.B.: Dirichlet process mixtures of generalized linear models. J. Mach. Learn. Res. 12, 1923–1953 (2011)
Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall PTR, Upper Saddle River (1998)
Huang, D., Allen, T., Notz, W., Zeng, N.: Global optimization of stochastic black-box systems via sequential kriging meta-models. J. Glob. Optim. 34(3), 441–466 (2006)
Jamshidi, A.A., Kirby, M.J.: Towards a black box algorithm for nonlinear function approximation over high-dimensional domains. SIAM J. Sci. Comput. 29(3), 941–964 (2007)
Jamshidi, A.A., Powell, W.B.: A recursive local polynomial approximation method using dirichlet clouds and radial basis functions. Working Paper. Princeton University, New Jersey (2013)
Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998). doi:10.1023/A:1008306431147
Jones, R., Lee, Y.C., Barnes, C.W., Flake, G., Lee, K., Lewis, P., Qian, S.: Function approximation and time series prediction with neural networks. In: IJCNN International Joint Conference on Neural Networks, vol. 1, pp. 649–665 (1990)
Knuth, D.E.: The Art of Computer Programming, vol. 2: Seminumerical Algorithms, 3rd edn. Addison-Wesley, Boston (1998)
Kushner, H.J.: A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J. Fluids Eng. 86(1), 97–106 (1964). doi:10.1115/1.3653121
Mes, M.R., Powell, W.B., Frazier, P.I.: Hierarchical knowledge gradient for sequential sampling. J. Mach. Learn. Res. 12, 2931–2974 (2011)
Montgomery, D.C., Peck, E.A., Vining, G.G.: Introduction to Linear Regression Analysis, 4th edn. Wiley, Hoboken (2006)
Muller, H.G.: Weighted local regression and kernel methods for nonparametric curve fitting. J. Am. Stat. Assoc. 82(397), 231–238 (1987)
Negoescu, D.M., Frazier, P.I., Powell, W.B.: The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS J. Comput. 23(3), 346–363 (2010)
Nelson, B.L., Swann, J., Goldsman, D., Song, W.: Simple procedures for selecting the best simulated system when the number of alternatives is large. Oper. Res. 49, 950–963 (1999)
Powell, W.B., Ryzhov, I.O.: Optimal Learning. Wiley, London (2012)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Ryzhov, I.O., Powell, W.B., Frazier, P.: The knowledge gradient algorithm for a general class of online learning problems. Oper. Res. 60(1), 180–195 (2012)
Scott, W., Frazier, P., Powell, W.B.: The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM J. Optim. 21(3), 996–1026 (2011)
Spall, J.C.: Introduction to Stochastic Search and Optimization, 1st edn. Wiley, New York (2003)
Villemonteix, J., Vazquez, E., Walter, E.: An informational approach to the global optimization of expensive-to-evaluate functions. J. Glob. Optim. 44(4), 509–534 (2009)
Žilinskas, A.: Two algorithms for one-dimensional multimodai minimization. Math. Oper. Stat. Ser. Optim. 12(1), 53–63 (1981). doi:10.1080/02331938108842707
Author information
Authors and Affiliations
Corresponding author
Appendix: Proofs
Appendix: Proofs
Assumption 1
For any local parametric model \(\theta \), let \(\theta (i)\) denote the \(i\)-th element of the \(\theta \). For any \(i \ne j\), we have \(\limsup _{n \rightarrow \infty } | \text{ Corr }^n [\theta (i), \theta (j)] | \le 1\) almost surely. Essentially, we claim the correlation between parameters to be bounded by 1.
Assumption 2
For any level \(g \in {\mathcal {G}}\setminus \{ 0 \}\), we have \(\liminf _n |v_x^{g,n}| > 0\). In other words, estimates for all the levels, except when \(g= 0\), will have bias.
Lemma 1
If we have a prior on each parameter, then for any \(x, x'\in {\mathcal {X}}\), we have \(\sup _n |x^T \theta _j^{g,n}| < \infty \) and \(\sup _n |x^T \varSigma ^{\theta , g,n}_j x'| < \infty \) almost surely.
Proof
We can show that for any cloud \(j\), \((\theta _j^n, \varSigma ^{\theta , n}_j)\) is a uniformly integrable martingale; therefore, it converges to some integrable random variable \((\theta _j^\infty , \varSigma ^{\theta , \infty }_j)\) almost surely.
Fix any \(g\) and \(j\). we see that
where \(\zeta _x^n (i)\) is the \(i\)-th element of the product \(\varSigma ^{\theta ,n} x^n\). By definition, we have \(\lambda + \varSigma _{x^n, x^n} > 0\). We also note that the \(\text{ diag }(\varSigma ^{\theta ,n})\) is in fact the variance of each parameter, namely \(\text{ Var }^n[\theta ]\). It is apparent that \(\text{ Var }^n [\theta (i)]\) is a non-increasing sequence and bounded by zero below. For non-diagonal elements of \(\varSigma ^{\theta ,n}( i,j)\), where \(i \ne j\), we define it as the covariance of element \(i\) and \(j\). By the definition of correlation, we have
It follows that \(\sup _n |x^T \varSigma ^{\theta , g,n}_j x'| < \infty \), since all the elements of \(\varSigma ^{\theta , g,n}_j\) are finite, as well as all the alternatives \(x, x' \in {\mathcal {X}}\). \(\square \)
Lemma 2
If we have a prior on each alternative, then for any \(x, x' \in {\mathcal {X}}\), the following are finite almost surely: \(\sup _n |\theta _x^{g,n}|\), \(\sup _n |a_{x'}^{g,n} (x)|\), and \(\sup _n |b_{x'}^{g,n} (x)|\).
Proof
We see that \(\sup _n |\theta _x^{g,n}|\) is finite follows from the fact that \(\theta _x^{g,n}\) is a convex combination of \(x^T \theta _j^{g,n}\). By Lemma 1, the convex combination of finite variables is also finite.
To show that \(\sup _n |a_{x'}^{g,n} (x)|\) is finite, we need bound both parts of the sum as the following,
For the first part, we a have convex combination of all the local parametric model, therefore
By Lemma 1, the convex combination of finite values is also finite. For the second part, we see that
Using Lemma 1 again, this expression is also finite since \(\gamma _I^{n}\) is bounded by \(\lambda _x > 0\) below. Now we will show that \(\sup _n |b_{x'}^{g,n} (x)|\) is finite. We temporarily remove the index \(g\) for clarity, then for any cloud \(I\),
We know that \(\sup _n |\varSigma ^n_{x' x'} | < \infty \). It follows from Lemma 1 that \(\sup _n |b_{x'}^{g,n} (x) |\) is finite almost surely. \(\square \)
Lemma 3
For any \(\omega \in \varOmega \), let \({\mathcal {X}}' (\omega )\) be the random set of alternatives measured infinitely often by the KG-RBF policy. For all \(x, x' \in {\mathcal {X}}\), the following statements hold almost surely,
-
if \(x \in {\mathcal {X}}'\), then \(\lim _n b_{x'}^n (x) = 0\) and \(\lim _n b_{x}^n (x') = 0\),
-
if \(x \notin {\mathcal {X}}'\), then \(\liminf _n b_{x}^n (x) > 0\).
Proof
We first consider the case where \(x \in {\mathcal {X}}'\). We know that \(\lim _{n} (\sigma _x^{g,n})^2 \rightarrow 0\) for all \(g \in {\mathcal {G}}\). We focus on the asymptotic behavior of weights \(w_x^{g,n}\) first. By definition, the bias of the base level is \(\lim _n v_x^{0,n} = 0\). For \(g \ne 0\), we have
By Assumption 2, all hierarchical levels \(g \ne 0\) have bias, i.e. \(\lim _n v_x^{g,n} \ne 0\). This means that \(\lim _n w_x^{g,n} \rightarrow 0\) for all \(g \ne 0\), implying \(\lim _n w_x^{0,n} = 1\). In other words, we have
The DC-RBF model for the base level is equivalent to a look up table model with independent alternatives, since we let \(D_T\) to be infinitesimally small (or equivalently we let \(N_c = M\) for the base level). Note that independence of the alternatives on the base level comes from the fact that all of the RBF’s are independent of each other by definition. Following Eq. (4), we have
where \(\varSigma ^{0,n}\) is the covariance matrix of the alternatives and is diagonal. It is apparent that \(b_{x'}^{0,n} (x) = 0\) when \(x' \ne x\), since the alternatives are independent. For \(x' = x\), we know that the numerator \(e_x \varSigma ^n e_x = (\sigma _x^n)^2\), which is the posterior variance of alternative \(x\). Since \(x \in {\mathcal {X}}'\) is measured infinitely many times, we have \((\sigma _x^n)^2 \rightarrow 0\). Therefore, \(\lim _n b_{x'}^n (x) = 0\) under this case as well.
If \(x \notin {\mathcal {X}}'\), it is equivalent to show that for any cloud \(I\) and level \(g \in {\mathcal {G}}\), we have
This becomes evident when we rearrange the expression as
The first part of the product is greater than zero by our use of the Gaussian kernel, where \(\varphi (x) > 0\), for all \(x \in {\mathcal {X}}\). By definition \(\varSigma ^n_{xx} = \sigma ^2_x> 0\) for a finitely sampled alternative; therefore, the second part of the product is also lower bounded by 0. The last part of the sum is also lower bounded by zero since \(\varSigma ^{\theta , n}_I\) is a semi-definite matrix by definition. We have \(b_x^n(x)\) as a convex combination of the above expression in Eq. (22); hence \(\liminf _n b_x^n(x) > 0\) for finitely sampled alternatives \(x \notin {\mathcal {X}}'\). \(\square \)
Rights and permissions
About this article
Cite this article
Cheng, B., Jamshidi, A. & Powell, W.B. Optimal learning with a local parametric belief model. J Glob Optim 63, 401–425 (2015). https://doi.org/10.1007/s10898-015-0299-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-015-0299-y