Lower Bounds on the Noiseless Worst-Case Complexity of Efficient Global Optimization

Efficient global optimization is a widely used method for optimizing expensive black-box functions. In this paper, we study the worst-case oracle complexity of the efficient global optimization problem. In contrast to existing kernel-specific results, we derive a unified lower bound for the oracle complexity of efficient global optimization in terms of the metric entropy of a ball in its corresponding reproducing kernel Hilbert space. Moreover, we show that this lower bound nearly matches the upper bound attained by non-adaptive search algorithms, for the commonly used squared exponential kernel and the Matérn kernel with a large smoothness parameter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu $$\end{document}ν. This matching is up to a replacement of d/2 by d and a logarithmic term \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log \frac{R}{\epsilon }$$\end{document}logRϵ, where d is the dimension of input space, R is the upper bound for the norm of the unknown black-box function, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document}ϵ is the desired accuracy. That is to say, our lower bound is nearly optimal for these kernels.


Introduction
Black-box optimization by sequentially evaluating different candidate solutions without access to gradient information is a pervasive problem.For example, tuning the hyperparameters of machine learning models [3,30], optimizing control system per-formance [2,40] and discovering drugs or designing materials [10,21], etc., can all be formulated as a black-box optimization problem without explicit gradient information.Therefore, efficient global optimization [13,29], as a sample-efficient method to solve the expensive black-box optimization problem without explicit gradient information, has recently been receiving much attention.Efficient global optimization is based on the idea of constructing a surrogate function using Gaussian process regression or kernel ridge regression to guide the search of optimal solution [13].
In many applications, e.g., tuning the hyperparameters of a deep neural network (where the objective function in discrete variables, such as number of layers, can be regarded as a restriction of continuous function), each sample can take significant resources such as time and computation.For such problems, understanding the sample complexity of efficient global optimization is of great theoretical interest and practical relevance.
There is a large body of literature on the convergence rates of particular efficient global optimization algorithms [7,26,31,33,34,37].Two typical analysis set-ups are the Bayesian and non-Bayesian settings. 1In the Bayesian setting, the black-box function is assumed to be sampled from a Gaussian process, whereas in the non-Bayesian setting, the black-box function is assumed to be regular in the sense of having a bounded norm in the corresponding reproducing kernel Hilbert space.
As a complement to convergence analysis of different algorithms, complexity analysis tries to understand the inherent hardness of a problem.Specifically, we are interested in answering the question: for a class of optimization problems, how many queries to an oracle, which returns some information about the function, are necessary to guarantee the identification of a solution with objective value at most worse than the optimal value [22]?Without a complexity analysis, we cannot tell whether existing algorithms can be improved further in terms of convergence rate.This problem is well studied for convex optimization (e.g., in [22]), but less well understood for efficient global optimization.
Intuitively, the complexity of efficient global optimization largely depends on the richness or complexity of the functions inside the corresponding reproducing kernel Hilbert space (RKHS).Indeed, selecting the proper RKHS or the kernel function k is an important research question in the literature [14,15].Intuitively, the choice of the kernel functions captures the prior knowledge on the black-box function to optimize.As an extreme example, if we know the ground-truth black-box function is linear, we can adopt the linear kernel.Then after a finite number of noiseless function evaluations, we can uniquely determine the ground-truth function and hence the optimal solution.However, agnostically selecting simple kernels may lead to a surrogate function that is not expressive enough.For example, when the black-box function is nonlinear, using an RKHS with a linear kernel can not learn the ground-truth function well.For such a function, it is more reasonable to select a more expressive kernel such as the squared exponential kernel.To measure the complexity of a set of functions, metric entropy [16] is widely used in learning theory.However, as far as we know, the explicit connection between a complexity measure such as metric entropy for a function set and the problem complexity of efficient global optimization has not been established.This paper focuses on the complexity analysis of efficient global optimization with general kernel functions in the non-Bayesian and noiseless setting.Although the noisy setting is more realistic from the practical point of view, it is also critical to consider the noiseless setting from the complexity-theoretic point of view.The rationale is that the noise may introduce additional statistical complexity to the problem and corrupt the inherent complexity analysis of the efficient global optimization.In addition, the noiseless setting is not a simple extension of the noisy setting.Existing analysis under noisy setting (e.g., [5,25,27,28]) typically relies on strictly positive noise variance.Simply setting noise variance to zero makes the analysis and results diminish.For example, the noisy bound for Squared Exponential (SE) kernel in [28] is which is dominated by σ2 2 , where σ 2 is the noise variance, is the desired accuracy, 2 and R is the function norm upper bound.Simply setting σ = 0 gives a meaningless Ω(0) bound.Without the analysis under noiseless setting, it is unclear whether this dominant σ 2 2 term is due to noise or the inherent complexity of the RKHS.To highlight our originality and contribution, a comparison of our results with the state-of-the-art complexity analysis is given in Table 1.As far as we know, our work is the first to give a unified general lower bound in terms of metric entropy.Interestingly, we also notice that the commonly seen Θ(1/ 2 ) term in the noisy setting disappears in the noiseless setting, which matches our intuition that estimating a point with Gaussian noise typically takes Θ(1/ 2 ) sample complexity.Specifically, our contributions include: -We introduce a new set of analysis techniques and derive a general unified lower bound for the deterministic oracle complexity of efficient global optimization in terms of the metric entropy of the function space ball in the corresponding reproducing kernel Hilbert space, providing a unified and intuitive understanding for the complexity of efficient global optimization.-Our general lower bound allows us to leverage existing estimates of the covering number of the function space ball in the RKHS to derive kernel-specific lower bounds for the commonly used squared exponential kernel and Matérn kernel with a large smoothness parameter ν, without the commonly seen 1/ 2 term for the noisy setting interestingly.Furthermore, the lower bound for squared exponential kernel under noiseless setting is derived for the first time, to the best of our knowledge.-We further show that these kernel-specific lower bounds nearly match the upper bounds attained by some non-adaptive search algorithms, where the upper bound for the squared exponential kernel is newly derived in this paper.Hence, our general lower bound is close to optimal for these specific kernels.

Related Work
There has been a large body of literature on analyzing the complexity and the convergence properties of efficient global optimization.We first summarize the relevant literature area by area.We then highlight the position and the original contribution of our paper.Algorithm-dependent Convergence Analysis.One line of research analyzes the properties of particular types of algorithms.For example, some papers [9,17] analyze the consistency of efficient global optimization algorithms.Vazquez and Bect [34], Wang and de Freitas [37] analyze the convergence property of the expected improvement algorithm.Vakili et al. [33] proposes a maximum variance reduction algorithm that achieves optimal order simple regret for particular kernel functions.Under the assumption of Hölder continuity of the covariance function, lower and upper bounds are derived for the Bayesian setting in [12].Among this set of literature, the works on information-theoretic upper bounds are more relevant to our metric entropy lower bound.Srinivas et al. [31] derives an information-theoretic upper bound for the cumulative regret of the upper confidence bound algorithm.Russo and Van Roy [26] gives an information-theoretic analysis of Thompson sampling.However, there is no existing work that provides a complementary information-theoretic lower bound.
Kernel-specific Lower Bound Analysis.As for lower bounds or complexity analysis, Bull [4] derives a lower bound of simple regret for Matérn kernel in a noise-free setting.Scarlett et al. [28] provides lower bounds of both simple regret and cumulative regret for the squared exponential and Matérn kernels.With the Matérn kernel, a tight regret bound has been provided for Bayesian optimization in one dimension in [27].With heavy-tailed noise in the non-Bayesian setting, a cumulative regret lower bound has been provided for the Matérn and squared exponential kernels in [25].More recently, Cai and Scarlett [5] provides lower bounds for both standard and robust Gaussian process bandit optimization.However, unlike the information-theoretic upper bound shown in [31], the existing lower bound results are mostly (if not all) restricted to specific kernel functions (mostly squared exponential and Matérn).The explicit connection between the optimization lower bound and the complexity of the RKHS has not been established so far in the existing literature.In this paper, we establish such a connection by constructing a lower bound in terms of metric entropy.
Covering Number Estimate in RKHS.Another area of research relevant to this paper is the estimate of covering number or metric entropy in function spaces.Some of the classical results are used in this paper.In [8,Sect. 3.3], the covering number for the function space ball in a Besov space is estimated.A technique to derive a lower estimate of the covering number for a stationary kernel is developed in [42], and as an application, a lower bound of a function space ball's covering number for the squared exponential kernel is derived.
General Information-based Complexity Analysis.Our focus is efficient global optimization in this paper, due to its increasing popularity and lack of a unified and intuitive understanding for its complexity.Nevertheless, there have also been many classical works in the general area of information-based complexity analysis.For example, it is shown that the optimal convergence rates of global optimization are equivalent to those of approximation in the sup-norm [23].However, approximation in the sup-norm itself is another hard problem with its complexity to be understood.There is also another set of results that try to connect the finite rank approximation, which is more general than sample-based interpolation, with metric entropy [8,18,32].However, they can not be directly applied to our efficient global optimization problem, due to the general finite rank approximation definitions that are inconsistent with our sample-based efficient global optimization setting.
Minimax Rates for Kernel Regression.In learning theory, there are well-established results on covering number bound of learning errors.Many existing works [6,24] derive covering number bounds for the generalization error of learning problems with RKHS or more general hypothetical spaces.However, in a typical learning setting, the sample points and corresponding observations are assumed to be identically and independently distributed, with observations corrupted by noise.To the contrast, the setting we consider in this paper is an essentially different global optimization problem.Specifically, our goal is to identify a solution with the desired level of optimality and the sample point can be adaptively selected.
Position and Originality of Our Work.Despite the rich literature summarized above, we notice two major limitations of the state-of-the-art complexity bounds.Firstly, existing analysis (see, e.g., [4,5]) is typically restricted to a specific group of kernels (most commonly, the Squared Exponential kernel and the Matérn kernel).A unified understanding of the optimization complexity is lacking.Our work addresses this limitation by providing a unified general lower bound in terms of metric entropy, which recovers (close-to) state-of-the-art lower bounds when restricted to specific kernels.Secondly, the lower bounds with noise can be dominated by a Θ 1 2 term (e.g., in [28] for squared exponential kernel), which may corrupt the understanding for the complexity of efficient global optimization.Our work addresses this limitation by proving bounds in the noiseless regime.

Problem Statement
We consider efficient global optimization in a non-Bayesian setting [31].Specifically, we optimize a deterministic function f from a reproducing kernel Hilbert space (RKHS) H with input space R d , where d is the dimension.H is equipped with the reproducing kernel k(•, •) : R d × R d →R.Let X ⊂ R d be the known feasible set (e.g., a hyperbox) of the optimization problem.In the following, we will use [n] to denote the set {1, 2, . . ., n}.We assume that Assumption 3.1 X is compact and nonempty.Assumption 3.1 is reasonable because, in many applications (e.g., continuous hyperparameter tuning) of efficient global optimization, we are able to restrict the optimization into certain ranges based on domain knowledge.Regarding the black-box function f ∈ H that we aim to optimize, we assume that, Assumption 3.2 f H ≤ R, where R is a positive real number and • H is the norm induced by the inner product associated with H. Assumption 3.2 requires that the function to be optimized is regular in the sense that it has bounded norm in the RKHS, which is a common assumption (e.g., [4,28]) for complexity and convergence analysis.
Assumption 3.3 is a common assumption for analyzing the convergence and complexity of efficient global optimization.It holds for a large class of commonly used kernel functions (e.g., Matérn kernel and squared exponential kernel) after normalization.
Our problem3 is formulated as We know that Hence, it can be shown under Assumptions 3.2 and 3.3, that f is continuous and thus (1) has an optimal solution on the compact set X .As in standard efficient global optimization, we restrict ourselves to the zero-order oracle case.That is, our algorithm can only query the function value f (x) but not higher-order information at a point x in each step.Based on the function evaluations before the current step, the algorithm sequentially decides the next point to sample.In this paper, we only consider oracle query (namely, function evaluation) complexity without considering the complexity of solving auxiliary optimization problems in typical efficient global optimization algorithms (e.g., maximizing the expected improvement).
In this paper, we focus on the performance metric of simple regret r (t) .Note that in some of the literature, simple regret is also defined as , where xt is one additional point reported after t steps.Since we can always pay one more function evaluation for the reported point, this definition difference will not impact our convergence or complexity analysis.

Preliminary
To analyze the problem complexity of efficient global optimization, we need a metric to measure the complexity of the RKHS.As an extreme example, if we choose a linear kernel, the underlying function to be optimized is a linear function.Hence, we can reconstruct it after a finite number of steps and compute the optimum without any error.The covering number is such a widely used metric to measure the complexity of an RKHS [41].To facilitate our discussion, we introduce some concepts about the complexity of function sets.Given a normed vector space (V , • ) and a subset G ⊂ V , for > 0, we make the following complexity related definitions [39].It can be verified that, Proposition 4.1 (Thm.IV, [16]) To facilitate the subsequent complexity analysis, we use x 1 , x 2 , . . ., x t to denote the sequence of evaluated points up to step t.We now formalize the concept of a deterministic algorithm for solving the efficient global optimization problem.

Definition 4.6 (Deterministic algorithm)
A deterministic algorithm A for solving the optimization problem in (1) is a sequence of mappings (π t ) ∞ t=1 , where π t : (X × R) t−1 →X , t ≥ 2 and π 1 : {∅}→X .When running the algorithm A, the sample at step t is Note that deterministic algorithms include most of the popular acquisition functions based efficient global optimization algorithms (e.g., lower/upper confidence bound [31] and expected improvement [13]).
We assume that the first sample point x 1 is deterministic, either given before running the algorithm or chosen by the algorithm.Now, if we suppose that f is such that the algorithm observes a sequence of 0's for every function evaluation f (x τ ), it will generate a deterministic sample trajectory.We will see in our main result that this trajectory can be used to construct adversarial functions to derive the lower bound.We formally define it below.Definition 4.7 (Zero sequence) Given a deterministic algorithm A = (π t ) ∞ t=1 .We set ), we get a deterministic sequence x 0 1 , x 0 2 , . . ., x 0 t , . .., which only depends on the algorithm A. We call this sequence the zero sequence of the algorithm A.

Main Results
Our strategy to derive the lower bound is decomposing the RKHS into two orthogonal subspaces with one of them expanding as more samples are obtained, as shown in Fig. 1.Then, we can project the function space ball into these two subspaces.We will show that as the number of sampled points grows, the covering number of the ball's projection into one subspace increases and the other decreases.We derive the lower bound on the number of optimization steps by bounding the increase/decrease rate.All the proofs of the lemmas and theorems are attached in the Appendix, except Lemma 5.4 and Theorem 5.1.Before proceeding, we introduce some notations.
Fig. 1 The function space view of our proof strategy We first decompose the RKHS into two orthogonal subspaces.
Notice that H t expands when we have more and more function evaluation data.In parallel, H ⊥ t shrinks.We then consider the intersection of the function space ball S with H t and H ⊥ t .
With these definitions, we can show that any function in S can be decomposed into two functions in S t and S ⊥ t , respectively.
The function m t (x) is exactly the posterior mean function in Gaussian process regression.
Intuitively, we can add some function from S ⊥ t to f without changing the historical evaluations at x 1 , . . ., x t .If we have some way of lower bounding the complexity of S ⊥ t , we may be able to find a perturbing function from S ⊥ t that leads to suboptimality.We will try to lower bound the complexity of S ⊥ t through Lemmas 5.2 and 5.3.Since S t and S ⊥ t are orthogonal to each other in the RKHS, it is intuitive that the complexity of S can be decomposed into the complexity of S ⊥ t and S t .Formally, we have Lemma 5.2.Lemma 5.2 For any t > 0, ⊥ t > 0, we have , Lemma 5.2 is proved based on Lemma 5.1.With Lemma 5.2, we can lower bound M(S ⊥ t (X ), ⊥ t , • ∞ ) if we are able to upper bound N (S t (X ), t , • ∞ ).Since S t is inside a finite dimensional space H t , we can show that, , then for any sample sequence Proof By assumption that t ≤ log N (S(X ),4 , • ∞ ) By < 0 and the definition of 0 , 1 2 log N (S(X ), 4 , • ∞ ) − log 2 > 0. We also notice that log N (S(X ), R, • ∞ ) = 0 < 2 log 2 and thus, 0 ≤ R 4 .We then can apply Lemma 5.3 to derive, where the first inequality follows by Lemma 5.3 and the second by assumption on t.
We are now ready to give our main result in Theorem 5.1.

Theorem 5.1 If there exists a deterministic algorithm that achieves simple regret r (T ) ≤ for any function f ∈ S in T function evaluations for our problem
Before we prove Theorem 5.1, we give a sketch of the proof.For any deterministic algorithm and any number of optimization steps t, we consider the corresponding deterministic zero sequence x 0 1 , x 0 2 , • • • , x 0 t as defined in Definition 4.7.We try to construct an adversarial function inside the corresponding S ⊥ t with 0 function value at the points x 0 i , i ∈ [t] and low function values at some point that is not sampled.The possible minimal value of such an adversarial function links to the covering number of the set S ⊥ t (X ), which can be lower bounded by combining Lemmas 5.2 and 5.3.

Proof of Theorem 5.1 Given an deterministic algorithm A = (π t ) +∞
t=1 , if it always gets the evaluations 0, then the sample trajectory satisfies, which is exactly the zero sequence of the algorithm.Note that the zero sequence x 0 t only depends on the deterministic algorithm A. Once we fix the algorithm, the zero sequence is fixed.We want to check the feasibility of the problem (3), Any feasible solution of (3) has some 'adversarial' property against the algorithm A. In fact, suppose that (s, x) is a feasible solution for problem (3), when we run the algorithm A over s, the sample sequence up to step t is exactly the zero sequence truncated at step t and r (t) = min τ ∈[t] s(x 0 τ ) − min x∈X s(x) > .Now the question is under what condition, the problem (3) is feasible.Since we are analyzing the asymptotic rate, we restrict to the case < 0 , where 0 is given in Lemma 5.4.By Lemmas 5.4 and 5.2, if t ≤ log N (S(X ),4 , • ∞ ) to any given algorithm, we have, Therefore, there exists functions and at least one of f 1 and f 2 has L ∞ norm over the set X at least 3  2 .Without loss of generality, we assume When applying the given algorithm to f , if t ≤ log N (S(X ),4 , • ∞ ) , the suboptimality gap or the simple regret r (t) is at least 3 2 .Therefore, to reduce the simple regret r (T ) ≤ for all the functions in S within T steps, it is necessary that, To verify the effectiveness of Theorem 5.1, we apply it to a simple case in Ex. 5.1.

Example 5.1
For the quadratic kernel k(x, y) = (x T y) 2 , the corresponding RKHS is finite-dimensional and is given as [20], where S d×d is the set of symmetric matrices of size d × d.We know that, where •, • F is the Frobenius inner product.Since S d×d can be embedded into R d×(d+1) 2 and the metric entropy for compact set in Euclidean space is Θ log 1 as discussed in [39], the lower bound in Theorem 5.1 reduces to a constant.By applying a grid search algorithm for the quadratic kernel, we can identify the ground-truth function after a finite number of steps and determine the optimal solution without any error.Therefore, the lower bound is tight in for the quadratic kernel.

Comparison with Upper Bounds for Commonly Used Kernels
Ex. 5.1 demonstrates the validity of Theorem 5.1 for simple quadratic kernel functions.
In this section, we will derive kernel-specific lower bounds for the squared exponential kernel and the Matérn kernels by using Theorem 5.1 and existing estimates of the covering numbers for their RKHS's.We compare our lower bounds with derived/existing upper bounds and show that they nearly match.

Squared Exponential Kernel
One widely used kernel in efficient global optimization is the squared exponential (SE) kernel given by In this case, we restrict to X = [0, 1] d .By applying Theorem 5.1, we have, Theorem 5.2 With X = [0, 1] d and using the squared exponential kernel, if there exists a deterministic algorithm that achieves simple regret r (T ) ≤ for any function f ∈ S in T function evaluations for our problem (1), it is necessary that, Furthermore, there exists a deterministic and T satisfying such that the algorithm achieves r (T ) ≤ in T function evaluations for any f ∈ S.
The upper bound part is obtained through sampling non-adaptively to reduce the posterior variance to a uniform low level in X .In this theorem, we focus on the asymptotic analysis of efficient global optimization and hide the coefficients that may depend on the dimension.We notice that the upper bound and lower bound are both polynomial in log 1 and nearly match, up to a replacement of d/2 by d in the order and one additional logarithmic term log R .

Matérn Kernel
In this section, we consider the Matérn kernel, where ρ and ν are positive parameters of the kernel function, Γ is the gamma function, and K ν is the modified Bessel function of the second kind.
Theorem 5.3 With X = [0, 1] d and the Matérn kernel, if there exists a deterministic algorithm that achieves simple regret r (T ) ≤ for any function f ∈ S in T function evaluations for our problem (1), it is necessary that, Furthermore, there exists a deterministic algorithm and T satisfying, such that the algorithm achieves r (T ) ≤ in T function evaluations for any f ∈ S.

Remark 5.2
The upper bound part of Theorem 5.3 is proved by Theorem 1 of [4].We also notice that [4] provides a lower bound of the same order as the upper bound in Eq. (10), which means that the upper bound order is also the optimal lower bound order.
Remark 5.3 When ν ≥ 1 2 d, our lower bound can further imply the lower bound of Ω R d 2ν log R −1 , which nearly matches the upper bound, up to a replacement 123 of d/2 by d and a term log R .However, when ν d is small, there is still a significant gap between the lower bound implied by our general lower bound and the optimal lower bound.Remark 5. 4 There are two possible reasons why the bound is not tight.One potential reason is that we apply a conservative lower estimate for the metric entropy corresponding to the Matérn kernel.The other is that our metric entropy approach is limited in the regime of small smoothness parameter ν.Filling this gap is left as future work.

Experiments
In this section, we will first give a demonstration of adversarial functions, on which two common algorithms, the lower confidence bound (LCB) [31] and the expected improvement (EI) [13], perform poorly and achieve the optimization lower bound.Both algorithms model the unknown black-box function as sampled from a Gaussian process.The idea of LCB algorithm is minimizing the lower confidence bound, which is defined to be posterior mean minus a coefficient times the posterior standard deviation, to get the next sample point in each step.The EI algorithm maximizes the expected improvement with respect to the best observed value so far to get the next sample point.Then we run the two algorithms on a set of randomly sampled functions and compare the average performance and the adversarial performance in terms of simple regret.The algorithms are implemented based on GPy [11] and CasADi [1].All the auxiliary optimization problems in the algorithms are solved using the solver IPOPT [35] with multiple different starting points.Our experiments take about 15 h on a device with AMD Ryzen Threadripper 3990X 64-Core Processor and 251 GB RAM.

Demonstration of Adversarial Functions
In our proof of Theorem 5.1, we use a particular set of adversarial functions, which reveal value 0 to the algorithm and have low values somewhere else.In this section, we demonstrate such adversarial functions for two popular algorithms, expected improvement and lower confidence bound.
We use the Matérn kernel in one dimension with ν = 5 2 , ρ = 1, σ 2 = 1.We set the compact set to X = [−10, 10] and assume that the RKHS norm upper bound is R = 1.We apply both lower confidence bound algorithm with the constant weight 1 for the posterior standard deviation and the expected improvement algorithm.We manually assign x 1 = 0 as the first sampled point and derive the adversarial function by solving Prob.(11).Thanks the optimal recovery property [38, Thm 13.2], the optimal value for the inner problem of (11) can be analytically derived as Figure 2 demonstrates the adversarial functions inside the corresponding RKHS with a bounded norm of 1, which have value 0 at all the sampled points but have low global optimal value somewhere else.We notice that the envelope formed by the functions inside the ball with consistent evaluation data shrinks as more and more data becomes available.Intuitively, any algorithm needs to sample sufficiently densely globally in the adversarial case in order to find a close-to-optimal solution.

Average vs. Adversarial Performance
The proofs of Theorems 5.2 and 5.3 indicate that a non-adaptive sampling algorithm can achieve a close-to-optimal worst-case convergence rate.However, in practice, adaptive algorithms (e.g., lower confidence bound and expected improvement) are usually adopted and perform better.There could potentially be a gap between average-case convergence and worst-case convergence.To perform such a comparison, we randomly sample a set of functions from the RKHS to run the algorithms over.Specifically, we first uniformly sample a finite set of knots X ⊂ X and then sample the function values f X on the knots from the marginal distribution of the Gaussian process, which is a Fig. 3 Comparison of average performance (±standard deviation shown as shaded area, over 100 instances) and adversarial performance.Adversarial simple regret is defined as the opposite of the optimal value of Prob.(11), namely the simple regret of the adversarial function at different optimization steps.Since the simple regret is defined as the best sampled function value minus the global optimal value (see definition 3.1), this plot can also be seen as the convergence rate plot if the algorithm reports the best sampled point finite-dimensional Gaussian distribution.We then construct the minimal norm interpolant of the knots as the sampled function.To be consistent with the bounded norm assumption, we reject the functions with a norm value larger than R.
We use simple regret, which is defined to be min τ ∈[t] f (x τ ) − min x∈X f (x), to measure the performance of different algorithms.We set X = [0, 1] 3 ⊂ R 3 and set the length scales and variances of both the Matérn kernel function (with ν = 5  2 ) and the squared exponential kernel.Figure 3 shows the comparison of average simple regret and adversarial simple regret.We observe that the average performance is much better than the performance on adversarial functions in terms of simple regret.Intuitively, adversarial functions are only a subset of needle-in-haystack functions, with most regions flat and somewhere very small, when t becomes large.For those adversarial functions such as shown in Fig. 2, it can be difficult for the efficient global optimization algorithms to "see" the trend of the function.For common functions inside the function space ball, however, the algorithms are still able to detect the trend of the function value and find a near-to-optimal solution quickly.

Conclusions
In this paper, we provide a general lower bound on the worst-case suboptimality or simple regret for noiseless efficient global optimization in a non-Bayesian setting in terms of the metric entropy of the corresponding reproducing kernel Hilbert space (RKHS).We apply the general lower bounds to commonly used specific kernel functions, including the squared exponential kernel and the Matérn kernel.We further derive upper bounds and compare them to the lower bounds and find that they nearly match, except for the case for the Matérn kernel when ν d is small.Two interesting future research directions are deriving an upper bound on the worst-case convergence rate in terms of metric entropy and characterizing the average-case convergence rate.We also conjecture that introducing randomness into the existing algorithms can improve the worst-case performance.An expected analysis challenge is that our current approach is sensitive to randomness.We also leave the extension of our analysis to the noisy case as future work.
We take α * as the solution to the problem in (13), whose feasibility is guaranteed by representer theorem [36] and the non-emptiness of the feasible set ( f is feasible for (12)).Therefore, m t (x) = (α * ) T K X x is the optimal solution to (12).Since f is a feasible solution for the problem (12), } is an t -covering of S(X ) and we have the cardinality .

C Proof of Lemma 5.3
We first introduce the set, Without loss of generality, we assume that K t has full rank in the following analysis.Notice that if this condition does not hold, we only need to restrict to the subspace spanned by the eigenvectors of K t with strictly positive eigenvalues and consider the intersection of E t with the subspace.Since the restriction only reduces the essential dimension, the upper bound still holds.We introduce the norm Therefore, we have N (S t (X ), , • ∞ ) ≤ N (E t , , • K t ).We further have, The second inequality in (15) follows by that if α 1 , α 2 , . . ., α M is an -packing of the set E t , then ∅, ∀i = j by the definition of packing.The third inequality in (15) follows by the assumption of 0 Therefore, Theorem 5.1 implies that, We now focus on proving the upper bound part.To facilitate the following proof, we define, where Note that with squared exponential kernel and the sampled points set X to be used in this proof, the invertibility of the matrix K is guaranteed.As implied by [19,Prop. 1], We consider the algorithm that evaluates the grid points  Let x * denote the ground-truth optimal solution.We can bound the suboptimality, where the inequalities in (22a) and (22b) follow by (20) and the inequality (22c) follows by the definition of xt in (21).We now try to upper bound σ t ( xt ).We first introduce a set of Lagrangian interpolation functions, Let k 0 (x) denote the function k(0, x) and k0 its corresponding Fourier transformation.By inverse Fourier transformation, we have, To proceed, we need to use the Lemma D.1 [41].
We apply the bounds in Lemma D.1 to Eq. ( 24) and have, We know that k0 . Similar to the analysis in the proof of Example 4 of [41], we first try to bound the first term in the upper bound derived in Eq. ( 25).where the first inequality follows by combining ( 24), ( 27) and ( 28), the second inequality follows by that 1 + 1 2 N ≤ 2 and 1 + (N 2 N ) d ≤ 2(N 2 N ) d , and the last inequality follows by that log N ≤ N .Let N ≥ max{ 32(log 2+2) Combining ( 23), (30) and that N d = t, we have Combining that f ( xt ) − min x∈X f (x) ≤ 2Rσ t ( xt ) in ( 22) and ( 31), we have Setting the right hand side to be smaller than , we observe that the number of steps t only needs to be O log R d .This completes the proof.(

E Proof of
We then apply Theorem 5.1 such that we can get the lower bound The upper bound is implied by Theorem 1 in [4].

Fig. 2
Fig. 2 Demonstrations of adversarial in dimension one , . . ., N − 1} without adaptation, and evaluate the point xt before termination after t = N d function evaluations on the grid points, where xt is given as, xt = arg min x∈X ¯ft (x).

Table 1
A summary of the state-of-the-art complexity result for efficient global optimization 4he noise variance.R is the function norm upper bound.d is the dimension of input space.ν is the smoothness parameter of Matérn kernel.N/A means 'not applicable'.S(X ) is the ball in the corresponding reproducing kernel Hilbert space, with input set X .N (•, •, •) is the standard covering number to be formally defined in Sect.4 So m t ∈ S t and f − m t ∈ S ⊥ t .Let ( p 1 , p 2 , ..., p m ) be an t -covering of S t (X ) and (q 1 , q 2 , ...,q n ) an ⊥ t -covering of S ⊥ t (X ).Then ∀ f ∈ S, by Lemma 5.1, f = m t + ( f − m t ),where m t ∈ S t and f − m t ∈ S ⊥ t .By the definition of covering, ∃ p i , such that m t | X − p i ∞ ≤ t and ∃q j , such that