Improving exploration strategies in large dimensions and rate of convergence of global random search algorithms

We consider global optimization problems, where the feasible region $\X$ is a compact subset of $\mathbb{R}^d$ with $d \geq 10$. For these problems, we demonstrate the following. First: the actual convergence of global random search algorithms is much slower than that given by the classical estimates, based on the asymptotic properties of random points. Second: the usually recommended space exploration schemes are inefficient in the non-asymptotic regime. Specifically, (a) uniform sampling on entire~$\X$ is much less efficient than uniform sampling on a suitable subset of $\X$, and (b) the effect of replacement of random points by low-discrepancy sequences is negligible.


Introduction
Consider the general problem of continuous global minimization f (x) → min x∈X with objective function f (•) and feasible region X , which is assumed to be a compact subset of R d with vol(X ) > 0. In order to avoid unnecessary technical difficulties, we assume that X is convex.In all numerical examples, we use X = [0, 1] d .
Any global optimization algorithm combines two key strategies: exploration and exploitation.Performing exploration is equivalent to what we call "space-filling"; that is, choosing points which are well-spread in X .Exploitation strategies use local information about f (and perhaps derivatives of f ) and differ greatly for different types of global optimization algorithms.In this paper, we are only concerned with the exploration stage.Although many of our finding can be generalized to other space-filling schemes (where space-filling is not random and the space-filling strategy changes in the course of receiving more information about the objective function), in this paper we concentrate on simple exploration schemes like pure random search, where space-filling is performed by covering X with balls of given radius centered at the chosen points.Moreover, we assume that the points chosen at the exploration stage are independent.That is, we associate the exploration stage with a global random search (GRS) algorithm producing a sequence of random points x 1 , x 2 , . . ., x n , where each point x j ∈ X has some probability distribution P j (we write this x j ∼ P j ) and x 1 , x 2 , . . ., x n are independent.The value n is determined by a stopping rule.We assume that 1 ≤ n min ≤ n ≤ n max < ∞, where n min and n max are two given numbers.The number n max determines the maximum number of function evaluations at the exploration stage and the fact that n max < ∞ determines what we call "the non-asymptotic regime".In the numerical study of Section 4, we also use Sobol's sequence, the most widely used low-discrepancy sequence.
We distinguish between 'small', 'medium' and 'high' dimensional problems depending on the following relations between d and n max : (S) small dimensions: n min ≥ 2 d , n max 2 d (hence, log 2 n max d); (M) medium dimensions: n max is comparable to 2 d : c 1 d ≤ log 2 n max ≤ c 2 d with suitable constants c 1 and c 2 : 0 Of course, there are in-between situations and the classification above depends on the cost of function evaluation.In case of non-expensive observations and 10 In this study, we leave out the situation (S) of small dimensions and concentrate on situations (M) and (H).The reasons why we are not interested in the situation (S) of small dimensions are: (a) there are too many exploration schemes available in literature in the case of small dimensions, and (b) we are interested in the situations when the asymptotic regime is out of reach, and these are the situations (M) and (H).
In all considerations below we assume that the aim of the exploration stage is to reach a neighbourhood of an unknown point x * ∈ X with high probability ≥ 1 − γ (with some γ > 0).We assume that x * is uniformly distributed in X and by a neighbourhood of x * we mean the ball B = B(x * , ε) with suitable ε > 0. In other words, we will be interested in the problem of construction of weak coverings defined as follows.
Let x 1 , . . ., x n be some points in R d .Denote X n = {x 1 , . . ., x n } and where r > 0 is the radius of the balls B(x i , r) and B(x i , r) = X ∩ B(x i , r).We will call B(X n , r) weak ) would make a full (strong) covering of X .As demonstrated in [7,8,9], for any n and any given γ > 0, one can construct weak coverings of X with significantly smaller radii r than for the case γ = 0 (assuming that d is not too small).This is the main reason why we are not interested in strong coverings.The second reason is that numerically checking whether the set (1.1) makes a full covering (for a generic X n ) is extremely hard in situations (M) and (H) whereas simple Monte-Carlo gives very accurate estimates of γ for weak coverings, even for very high dimensions.For a short discussion concerning full covering and its role in optimization, see Section 2.1.
The main technique of construction of weak coverings will be generation of independent random points x 1 , . . ., x n in X with x j ∼ P , where P is a distribution concentrated either on the whole X or a subset of X .It follows from Proposition 3.2.3 in [1] that using points outside X for construction of coverings is not beneficial when X is convex and hence we will always assume that x j ∈ X for all j.
The following are the main messages of the paper.
1. Classical results on convergence rates of GRS algorithms are based on the asymptotic properties of random points uniformly distributed in X ; see Section 2. In the non-asymptotic regime, however, these results give estimates on the convergence rates which are far too optimistic.We show in Section 3 that for medium and high dimensions, the actual convergence rate of GRS algorithms is much slower.2. The usually recommended sampling schemes (these schemes are based on the asymptotic properties of random points) are inefficient in the non-asymptotic regime.In particular, as shown in Section 4, uniform sampling on entire X is much less efficient than uniform sampling on a suitable subset of X (we will refer to this phenomena as the 'δ-effect').3.In situations (M) and (H), the effect of replacement of random points by low-discrepancy sequences is negligible; see Section 4.2.
We also make certain practical recommendations concerning the best exploration schemes in the situations (M) and (H) in the case X = [0, 1] d .Our main recommendations will concern the situation (M) of medium dimensions, which we consider as the hardest for analysis.The situation (H) is simpler than (M) in the sense that the optimization problems in case (H) are so complicated that very simple space-filling schemes outlined in Section 6 provide relatively effective sampling schemes.
The structure of the paper is as follows.In Section 2, which contains no new results, we discuss the importance of covering and review classical results on convergence and rate of convergence of general GRS algorithms.The purpose Section 3 is to demonstrate that for medium and high dimensions the asymptotic regime is unachievable, and hence the actual convergence rate of GRS algorithms is much slower than the classical estimates of the rate of convergence indicate.In Section 4 we compare several exploration strategies and show that standard recommendations (such as: "use a low-discrepancy sequence") are inaccurate (for medium and high dimensions).In Section 5, we develop accurate approximations for the volume of intersection of a cube and a ball (with arbitrary centre and any radius).The approximations of Section 5 are used throughout numerical studies of Sections 3 and 4. In Section 6 we summarize our findings and give recommendations on how to perform exploration of X in medium and large dimensions.
2 Importance of covering and classical results on convergence and rate of convergence of GRS algorithms is the distance between a point x ∈ X and the point set X n .Covering radius is also the smallest r ≥ 0 such that the union of the balls with centers at x j ∈ X n and radius r fully covers X ; that is, CR(X n ) = min r>0 such that X ⊆ B(X n , r) , where B(X n , r) = n j=1 B(x j , r) and B(x, r) = {z ∈ R d : ρ(x, z) ≤ r} is the ball of radius r and centre x ∈ R d .Optimal n-point covering is the point set X * n such that CR(X * n ) = min Xn CR(X n ).Most of the general considerations in the paper are valid for a general distance ρ, but all numerical studies are conducted for the Euclidean distance only.We will thus assume that the distance ρ is Euclidean.
Point sets with small covering radius are very desirable in theory and practice of global optimization and many branches of numerical mathematics.In particular, the celebrated results of A.G.Sukharev imply that any n-point optimal covering design X * n provides the following: (a) min-max n-point global optimization method in the set of all adaptive n-point optimization strategies, see [20] and [21,Ch.4,Th.2.1],(b) worst-case n-point multi-objective global optimization method in the set of all adaptive n-point algorithms, see [27], and (c) the n-point min-max optimal quadrature, see [21,Ch.3,Th.1.1].In all three cases, the class of (objective) functions is the class of Liptshitz functions, and the optimality of the design is independent of the value of the Liptshitz constant.Sukharev's results on n-point min-max optimal quadrature formulas have been generalized in [10] for functional classes different from the class of Liptshitz functions; see also formula (2.3) in [2].

Convergence of a general GRS algorithm
Consider the general problem of continuous global minimization f (x) → min x∈X .Assume that f * = inf x∈X f (x) > −∞ and f (•) is continuous at all points x ∈ W (δ) for some δ > 0, where W (δ) = {x ∈ X : f (x)−f * δ}.That is, we assume that f (•) is continuous in the neighbourhood of the set , which is non-empty but may contain more than one point x * .To avoid technical difficulties, we assume that there are only a finite number of global minimizers of f (•); that is, the set X * is finite.
Consider a general GRS algorithm producing a sequence of random points x 1 , x 2 , . .., where each point x j ∈ X has some probability distribution P j (we write this x j ∼ P j ), where for j > 1 the distributions P j may depend on the previous points x 1 , . . ., x j−1 and on the results of the objective function evaluations at these points (the function evaluations may not be noise-free).We say that this algorithm converges if for any δ > 0, the sequence of points x j arrives at the set W (δ) = {x ∈ X : f (x)−f * δ} with probability one.If the objective function is evaluated without error then this obviously implies convergence (as n → ∞) of record values f o,j = min i=1...j f (x i ) to f * with probability 1.
In view of continuity of f (•) in the neighbourhood of X * , the event of arrival of sequence of points x j at the set W (δ) with given δ > 0, is equivalent to the arrival of this sequence at the set B * (ε) = ∪ x * ∈X * B(x * , ε) for some ε > 0 depending on δ.
Conditions on the distributions P j (j = 1, 2, . ..) ensuring convergence of the GRS algorithms are well understood; see, for example, [15,19] and [25,Sect. 3.2].Such results are consequences of the classical in probability theory 'zero-one law' or Borel-Cantelli lemmas (see e.g.[4,Section 7.3]) and provide sufficient conditions on convergence.We follow [26,Theorem 2.1] to provide the most general sufficient conditions for convergence of GRS algorithms.
Theorem 1.Consider a GRS algorithm with x j ∼ P j and let B ⊂ X be a Borel subset of X .Assume that where q j (B) = inf P j (B) and the infimum is taken over all locations of previous points x i (i = 1, . . ., j − 1) and corresponding results of evaluations of f (•).Then the sequence of points {x 1 , x 2 , . ..} falls infinitely often into the set B, with probability 1.
Note that Theorem 1 does not make any assumptions about observations of f (•) and hence is valid for the very general case where evaluations of the objective function f (•) are noisy and the noise is not necessarily random.
Consider the following three particular cases.
(a) If in (2.2) we use B = B * (ε) or B = W (ε) with some ε > 0, then Theorem 1 gives a sufficient condition that the corresponding GRS algorithm converges; that is, there exists a subsequence {x ij } of the sequence {x j } which converges (with probability 1) to the set X * in the sense that the distance between x ij and X * tends to 0 as j → ∞.For this subsequence {x ij }, we have f If the evaluations of f (•) are noise-free, then we can use the sequence of record points (that is, the points where the records f o,j = min <j f (x ) are attained) as {x ij }; in this case, f (x ij ) = f o,j is the sequence of records converging to f * with probability 1.By the dominated convergence theorem (see e.g.[4, Section 7.2]), convergence of the sequence of records f o,j to f * with probability 1 implies other important types of convergence of f o,j to f * -in mean and mean square: 2) holds for B = B(x, ε) with any x ∈ X and any ε > 0, then Theorem 1 gives a sufficient condition that the sequence of points {x 1 , x 2 , . ..} is dense with probability 1.As this is a stronger sufficient condition than in (a), all conclusions of (a) are valid.
(c) If we use pure random search (PRS) with P = P U , the uniform distribution on X (that is, P j = P U for all j and the points x 1 , x 2 , . . .are independent), then the assumption that X is convex implies B(x, ε) ≥ const ε > 0 for all x ∈ X any ε > 0 and therefore the condition (2.2) trivially holds for any B = B(x, ε), as in (b) above.In practice, the usual choice of the distribution P j is where 0 α j 1 and Q j is a specific probability measure on X which may depend on previous evaluations of the objective function.Sampling from the distribution (2.3) corresponds to taking a uniformly distributed random point in X with probability α j and sampling from Q j with probability 1 − α j .In case of distributions (2.3), the condition ∞ j=1 α j = ∞ yields the fulfilment of (2.2) for all B = B(x, ε) and therefore the GRS algorithm with such P j is theoretically converging.

Rate of convergence
Consider first a PRS algorithm, where x j are i.i.d. with distribution P .Let ε, δ > 0 be fixed and B be the target set we want to hit by points x 1 , x 2 , . ... For example, we set δ} in the case when the accuracy is expressed in terms of closeness with respect to the function value, B = B(x * , ε) if we are studying convergence towards a particular global minimizer x * , and B = B * (ε) if the aim is to approach a neighbourhood of X * .
Assume that P is such that P (B) > 0. In particular, if P = P U is the uniform probability measure on X , then, as X has Lipschitz boundary, we have P (B) = vol(B)/vol(X ) > 0. Note that in all interesting instances the value p = P (B) is positive but small, and this will be assumed below.
Define the Bernoulli trials where the success in the trial j means x j ∈ B. PRS generates a sequence of independent Bernoulli trials with the same success probability Pr{x j ∈ B} = P (B).In view of independence of x 1 , x 2 , . .., we have n and therefore the probability n tends to one as n → ∞.
Let n γ be the number of points which are required for PRS to reach the set B with probability at least 1 − γ, where γ ∈ (0, 1); that is, The numerator − ln γ in the expression (2.4) for n γ depends on γ but it is not large; for example, − ln γ 4.605 for γ = 0.01.However, the denominator P (B) (depending on ε, d and the shape of X ) can be very small.
Assuming that B = B(x * , ε), where the norm is standard Euclidean, and B is fully inside X , we have where is the volume of the unit Euclidean ball B(0, 1) and Γ (•) is the gammafunction.The resulting version of the expression (2.4) for n γ in the case B = B(x * , ε) and vol(X ) = 1 becomes As ε → 0, the ball B = B(x * , ε) lies fully inside X for P U -almost all x * .Indeed, asymptotically, as n → ∞, the covering radius computed for uniformly distributed random points x j , tends to 0 and hence the equality (2.5) is valid asymptotically for almost all x * .This is the reason for superscript 'as' in (2.6).As shown below in Section 3, in the non-asymptotic regime in situations (M) and (H), the volume vol(B(x * , ε)) is necessarily smaller than given by (2.5) and therefore the true n γ is (much) larger than n as γ in (2.6).Consider now general GRS algorithms where the probabilities P j are chosen in the form (2.3), where the coefficients α j satisfy the condition (2.2).Instead of the equality Pr{x j ∈ B} = P (B) for all j 1, we now have the inequality Pr{x j ∈ B} α j P U (B), where the equality holds in the worst-case scenario.We define n(γ) as the smallest integer such that the inequality n(γ) j=1 α j −ln γ/P U (B) is satisfied.For the choice α j = 1/j, which is a common recommendation, we can use the approximation n j=1 α j ln n.Therefore we obtain n(γ) exp{−ln γ/P U (B)}.For the case of , the volume of the unit ball.Note also that if the distance between x * and the boundary of X is smaller than ε, then the constant c and hence n(γ) are even larger.For example, for γ = 0.1, d = 10 and ε = 0.1, n(γ) is larger than 10 1000000000 .Even for optimization problems in a small dimension d = 3, and for γ = 0.1 and ε = 0.1, the number n(γ) of points required for the GRS algorithm to hit the set B in the worst-case scenario is huge: n(γ) 10 238 .
3 Points uniformly distributed on X

Asymptotic case
In this section, the point set X n = {x 1 , . . ., x n } consists of the first n points of a sequence X ∞ = {x 1 , x 2 , . ..} of independent uniformly distributed random vectors in X .Assume, without loss of generality, that vol(X ) = 1.
Consider the random variable ρ(U, X n ), the distance between U (the uniform random point in X ) and X n ; see (2.1) for the definition of ρ.The cdf (cumulative distribution function) of ρ(U, X n ) gives the average proportions of X which are covered by the balls centered at X n with radius r.That is, where the set B(X n , r) is defined in (1.1).In asymptotic considerations, we need to suitably normalize the radius (which tends to zero as n → ∞) in (3.1).We thus consider the following sequence of cdf's: Lemma 1 where the convergence is uniform in t and cdf 's F n are defined in (3.2).
The statement of Lemma 1 follows from Zador's arguments in his fundamental paper [24]; see the beginning of page 142.The key observation of Zador is that asymptotically, as n → ∞, the covering radius computed for uniformly distributed random points x j , tends to 0 and hence the equality (2.5) is valid asymptotically for almost all U ; this is formula (19) in [24].The statement of Lemma 1 is in fact a particular case of Theorem 9.1 in [3], if Q is chosen as the uniform distribution on X .
In what follows, we will need the (1 − γ)-quantile (0 < γ < 1) of the cdf F in the rhs of (3.3).This (1 − γ)-quantile is determined as t 1−γ = [− log(γ)] 1/d , for which we have F (t 1−γ ) = 1 − γ .The quantity t 1−γ can be interpreted as the normalised asymptotic radius required for covering a subset of X of volume (1 − γ) (the weak covering introduced in Section 1).For very small ε, to cover a subset of X with random centers x j ∈ X n of volume which is approximately 1 − γ, n = n γ should satisfy which coincides with (2.6).The above result can be reformulated in terms of the asymptotic radius r as follows: for very large n the union of n balls with random centers x j ∈ X n and radius covers a subset of X of volume which is approximately 1 − γ.
In the non-asymptotic (finite n) regime, the distribution function F d (r, X n ) of (3.1) can be obtained in the following way (below, for X n = {x 1 , . . ., x n }, the components x 1 , x 2 , . . .x n are not necessarily uniform but are i.i.d.).
Conditionally on U , we have for fixed U ∈ X : where X has the same distribution as x 1 .From (3.6), the distribution function F d (r, X n ) can be obtained by averaging over the distribution of U : For large n and small r we use an approximate equality P X { U − X ≤ r} r d V d in (3.6).By doing so, averaging with respect to U is redundant and we arrive at the results of Section 2.3.If n is not so large, the quantity P X { U − X ≤ r} has to be approximated by other means.This will be discussed in Section 5.

Bounds for
Evaluating the expectation in (3.7) is difficult but simple bounds can be obtained by applying Jensen's inequality.Here we will focus attention to the case of X = [0, 1] d and X n = {x 1 , . . ., x n }, where x 1 , x 2 , . . . is a sequence of uniformly distributed random vectors on X .From (3.7), we have An immediate use of Jensen's inequality yields the bound: Here and below a = (a, a, . . ., a) ∈ R d for any a.However, noticing the fact P X { U − X ≤ r} = P Z { Z − X ≤ r} where Z in a uniform random vector on [1/2, 1] d , we can apply Jensen's inequality to obtain: The forms of the bounds in (3.8) and (3.9) suggest an approximation of the following form may be useful: Here, instead of fixing U to 1/2 or 3/4, it is a uniform random vector on [0, 1] d .The probability P U,X { U − X ≤ r} has the interpretation of being the average intersection a ball of radius r with a random center at U has with the cube [0, 1] d .For different d and r, the distribution of P X { U − X ≤ r} normalised by the volume of the ball r d V d is shown in Figures 9-10.

Numerical studies
In this section, we will demonstrate one of the key messages of the paper saying that in high dimensions, the asymptotic results are not attainable for reasonable values of n and consequently produce poor approximations for n not astronomically large.
In Figure 1, we plot F d (r, X n ) as a function of d for n = 1000 (using blue plusses) and n = 10000 (using black circles).For each value of d, the radius r is chosen based on the asymptotic result given in (3.5) with 1 − γ = 0.9; this is shown by the solid red line at 0.9.We see that very quickly and for n that would be deemed large, F d (r, X n ) is significantly smaller than 0.9 and quickly tends to zero in d.
The big difference between the asymptotic and finite regime is further illustrated in Figures 2-8.In these figures, using a solid black line we depict F d (r, X n ) as a function of r for different values of d and n that are provided in the caption of each figure.In these figures, the dashed red line is the approximation obtained from the asymptotic result (3.3), that is, the approximation In Figures 3-8, we also include two Jensen's bounds given in (3.8) (dot dashed orange) and (3.9) (dotted blue), as well as the approximation given in (3.10) (longer dashed green).From these figures, we can make the following observations.
for all r and n large enough).The respective triples (r; F (n 4-6 are (0.9; 0.935, 0.08) and (0.8; 0.95, 0.13).For the case of d = 50 and shown in Figures 7-8, the asymptotic properties are so far from being achieved with n = 1000 and n = 10000 that such a comparison does not even make sense.3 and 5.In these figures, we depict the distribution of intersection a random point has with the cube normalised by the volume of the ball r d V d ; that is, we plot the density of the r.v.κ U = P X { U − X ≤ r} /(r d V d ), where both U and X have uniform distribution on [0, 1] d .The importance of these two figures is another illustration of inadequacy of the key assumption behind (2.6), which can be formulated as the assumption that the distribution of density of the r.v.κ U is very close to the delta-measure concentrated at one.This assumption is indeed reasonably adequate if r can be chosen small enough.However, as Figure 9 and especially Figure 10 illustrate, even for relatively large values of n the required values of r are not small enough for this to hold even approximately.Note that in the derivation of the asymptotic values of n γ = n as γ in (2.6) we use the value 1 rather than the random variables κ U with the densities shown in Figures 9,10.In this section we demonstrate the δ-effect, which manifests that in high dimensions sampling in a cube C δ with suitable 0 < δ < 1 leads to a much more efficient covering scheme than sampling within the whole cube [0, 1] d .Note that the δ-effect is not obvious being completely unknown in the literature on stochastic global optimization and perhaps in literature on global optimization in general.
All existing literature recommends space-filling in the whole set X and not in its subset.Moreover, there are recommendations in literature (see, for example, [5,22]) of choosing more points closer to the boundary of the cube rather than purely uniformly in order to improve space-filling properties of random points.
In Figures 11-13, for different values of d and n we plot F d (r, X n ) as a function of δ.For each d and n, the value of r has been chosen such that max 0≤δ≤1 F d (r, X n ) = 0.9; these values of r (along with optimal values of δ, in brackets) can be obtained from Table 1.In these figures, the values of F d (r, X n ) for n = 1000, 10000, 100000 are shown with a solid black line, dashed blue line and dotted green line respectively.These figures demonstrate the 'δ-effect' formulated as the second main message in Introduction.These figures also clearly demonstrate that despite sampling uniformly in the cube [0, 1] d is asymptotically optimal, for large d it is always a poor strategy, which can be substantially improved.The discussion of Jensen's bounds given in Section 3.2 still apply to the case of X n sampled uniformly within δ-cube C δ .The only adjustment that needs to be made to the results of Section 3.2 is to let X be a uniform random vector in C δ and not [0, 1] d .In Figure 14, we depict the Jensen's lower bound given in (3.9) for X uniform in [0, 1] d and for X uniform in the δ cube [1/4, 3/4] d (so that δ = 0.5).We see that the lower bound for X n sampled within the δ-cube is larger than X n sampled from the whole cube.This further supports the conclusion that for n not astronomically large, the 'δ-effect' should always be considered.
In Table 1, for X n chosen uniformly in the cube [0, 1] d and X n chosen uniformly in the δ-cube, we tabulate the values of r n,1−γ with γ = 0.1 for different d and n.In the columns labeled δ-cube, the values in the brackets correspond to the approximately optimal values of δ.We can see that for small d, the δ-effect is very small (since n is relatively large in these dimensions).For larger dimensions, the δ-effect is very prominent.
In Tables 2-3, we consider an equivalent reformulation of the results of Table 1.In these tables, for a given r we specify the value of n γ , with γ = 0.1, for X n chosen uniformly in the cube [0, 1] d and X n chosen uniformly in the δ-cube.We also include the approximation based on the the asymptotic  Here we plot F d (r, X n ) as a function of d, where the radius is fixed from (3.5) with γ = 0.1 (the line 1 − γ is depicted by a red solid line).For X n chosen uniformly in the cube [0, 1] d , we depict F d (r, X n ) with blue plusses.For X n chosen from a Sobol sequence in the whole cube [0, 1] d , we use orange triangles.When the points in X n are uniform i.i.d.within the δ-cube with optimal δ we use green crosses.Finally, when points in X n are chosen from a Sobol sequence within the same δ-cube we use purple diamond.Figures 15 and 16 illustrate two new key messages along with the message discussed in Figure 1.Firstly, the use of low-discrepancy sequences seem to produce slightly better results in comparison to random choice of points for small dimensions but in higher dimensions the use of low-discrepancy sequences (in our case, Sobol sequences) produces results that are almost equivalent to random sampling uniformly either in [0, 1] d or in the optimally chosen δ-cube.Secondly, in large dimensions sampling from a suitable δ-cube greatly outperforms the other schemes considered here being still far from the asymptotic results.These messages are further supported in Figures 17 and 18.Here we plot the asymptotic approximation F n (r) from Lemma 1 (dashed red) and F d (r, X n ) as a function of r for the following choices of X n : random in the cube [0, 1] d (blue line with plusses), chosen from a Sobol sequence in [0, 1] d (orange line with triangles), random in the δ-cube with optimal delta (green line with crosses), chosen from a Sobol sequence within the same δ-cube (purple line with diamonds).We see that in Figure 17 for d = 10, the Sobol sequence is slightly advantageous to the random uniform on the whole cube and δ-cube for most interesting values of γ.Choosing X n as uniform within the δ-cube produces better coverings than with Sobol's points in [0, 1] d for most values of γ, but slightly worse for small γ.This slight advantage of the Sobol sequence in [0, 1] d and in the δ-cube diminishes in the case d = 20 shown in Figure 18.
To further study the similarities in performance between X n chosen randomly in the δ-cube with optimal δ and X n chosen from a Sobol sequence within the same δ-cube, in Figures 19-20 we plot the ratio of the c.d.f.'s F d (r, X n,U )/F d (r, X n,S ) across different d.The subscript U and S respectively differentiate between X n chosen randomly in the δ-cube with optimal δ and X n chosen from a Sobol sequence within the same δ-cube.For each value of d, r is chosen so that max 0≤δ≤1 F d (r, n,U ) = 0.9.

Non-uniform prior distribution for the target
In this section, we explore the effect a non-uniform prior distribution for x * ∈ X has on the conclusions above formulated for the case of uniform distribution.We will assume that each component of x * has Fig. 19: Efficiency of Sobol's points, n = 2 10 .Fig. 20: Efficiency of Sobol's points, n = 2 13 .
independent components distributed according to the following symmetric beta distribution with density: , for some α > 0 .
If α = 1, the density p α (t) is uniform on [0, 1] while for 0 < α < 1 this density is U-shaped.In most cases below we choose the arcsine density p 0.5 (t).
We then select the distribution of random points x j ∈ X n to have similar shape but constrained to the δ-cube.More precisely, we assume that x j have independent components with the density , for some α > 0 and 0 ≤ δ ≤ 1.
In the case α = 1, the points x j have uniform distribution on the cube C δ .Figures 21-22 are similar to Figures 11-13, but with the key difference of assuming a non-uniform prior distribution for x * .For different values of d and n and for α = 0.5, we plot F d (r, X n ) as a function of δ.For each d and n, the value of r has been chosen so that max 0≤δ≤1 F d (r, X n ) = 0.9.In these figures, the values of F d (r, X n ) for n = 1000, 10000, 100000 are shown with a solid black line, dashed blue line and dotted green line respectively.Figures 23-24 are similar to Figures 21-22, but with varying values of α and fixed n = 10000.In these figures, we selected α = 0.25 (dashed dark green), α = 0.5 (dotted purple), α = 0.75 (dot-dashed grey) and α = 1 which gives the uniform distribution (solid black).Figures 21-22 clearly demonstrate that the 'δ-effect' is still significant in this non-uniform setting.

Intersection of one ball with the cube
As a result of (3.6), our main quantity of interest in this section will be the probability in the case when X has the uniform distribution on the δ-cube [1/2−δ/2, 1/2+δ/2] d and U = (u 1 , . . ., u d ) ∈ R d is fixed.The case of δ = 1 will be directly applicable to Section 3.2.Because of the results of Section 3.2, we will bear in mind two typical choices of U will be U = 1/2 and U = 3/4 but will formulate results for general U .For fixed u ∈ R d , consider the r.v.η u,δ = (z − u) 2 , where z has density p δ (t) = 1/δ , (1 − δ)/2 < t < (1 + δ)/2 , for some 0 ≤ δ ≤ 1. (5.2) The first three central moments of η u,δ are: ) Then for given U = (u 1 , . . ., u d ) ∈ R d , consider the random variable where we assume that X = (x 1 , . . ., x d ) is a random vector with i.i.d.components x i with density (5.2).From (5.3), its mean is Using independence of x 1 , . . ., x d and (5.4), we obtain and from independence of x 1 , . . ., x d and (5.5) we get If d is large enough then the conditions of the CLT for U − X 2 are approximately met and the distribution of U − X 2 is approximately normal with mean µ d,δ,U and variance σ 2 d,δ,U .That is, we can approximate the probability P U,δ,r = P X { U −X ≤ r} by where Φ(•) is the c.d.f. of the standard normal distribution: The approximation (5.7) has acceptable accuracy if the probability P U,δ,r is not very small; for example, it falls inside a 2σ-confidence interval generated by the standard normal distribution.
To improve on the usual CLT approximation, we use Edgeworth-type expansion in the CLT for sums of independent non-identically distributed r.v. by V.Petrov, see [13]: where γ ν,j is the cumulant of order ν at (u j − x j ) 2 , H m is the Chebyshev-Hermite polynomial of degree m and the summation is carried out over all non-negative integer solutions of the equation The partition function p(ν) provides the number of possible partitions of a non-negative integer ν and therefore at each value of ν provides the number of terms in the summation.The sequence has the

Conclusions
We have considered continuous global optimization problems, where the feasible region X is a compact subset of R d .As a strategy for exploration, we have mostly considered sampling of i.i.d.random points either in X or a suitable subset of X .
We have distinguished between between 'small', 'medium' and 'high' dimensional problems depending on the following relations between d and n max (which is the maximum possible number of points available for space exploration): (S) small dimensions: n max We only considered the situations (M) and (H), where we have demonstrated the following effects: (i) the actual convergence of randomized exploration schemes is much slower than that given by the classical estimates, which are based on the asymptotic properties of random points; (ii) the usually recommended space exploration schemes are practically inefficient as the asymptotic regime is unreachable.In particular, we have shown: (ii-a) uniform sampling on entire X is much less efficient than uniform sampling on a suitable subset of X , and (ii-b) the effect of replacement of random points by a low-discrepancy sequence is very small so that using low-discrepancy sequences and other deterministic constructions does not lead to significant improvements (unless the number of evaluation points n = n max is fixed to some particular value like 2 d or 2 d−1 , see [9]).We believe that the effects (i) and (ii) have not been stated in literature, at least in this generality.The effect (ii-a) has been numerically demonstrated in our previous papers [7,8].The effect (ii-b) enhances one of the main messages of the paper [12].
It was not the purpose of the paper to give the most effective exploration schemes.However, the results of this paper, along with studies reported in [7,8] and [9], allow us to give several general recommendations on efficient organization of exploration strategies in the situations (M) and (H), at least when X is a cube.
In a high-dimensional cube X = [0, 1] d with d > 20 and 1 < n max < 2 d , we propose the following strategy of construction of nested exploration designs X n : x 1 = 1/2 (the centre of X ) and the other points x j are taken randomly among the vertices of a cube [1/4, 3/4] d .Sampling from vertices can be done without replacement (see [8]) and, moreover, we can keep the points x j so that the Hamming distance between them is at least d − log 2 (n max − 1) + 1.
It is more difficult to be so specific in the situation (M) as there may be different relations between 2 d , n min and n max .A good strategy would be using the product of arcsine distributions on a suitable δcube (see [8]); this distribution is slightly superior to the uniform on δ-cube of Section 4 (with different values of δ optimized for the respective distribution).An even more natural strategy would be sampling in 2 d small cubes (or side-length ε) surrounding the vertices of a δ-cube C δ = [1/2 − δ/2, 1/2 + δ/2] d (after placing x 1 = 1/2).The choice of δ and ε depends on 2 d , n min and n max are requires a separate study.As usual, reduction of randomness in sampling makes any of these schemes marginally more efficient.
0.91 but the true value of F d (0.5, X n ) is closer to 0.41.As n increases from 1000 to 10000 as is shown in Figure 4, for r = 0.4 we have F (n 1/d V 1/d d r) = 0.925 and F d (0.4, X n ) is closer to 0.6.(Recall that in view of (3.3), we should have

Fig. 2 :
Fig. 2: F d (r, X n ) and F (n 1/d V 1/d d r) as functions of r; d = 5 and n = 1000

Table 3 :
are extended versions of Fig.1.Here we plot F d (r, X n ) as a function of d, where the radius is fixed from (3.5) with γ = 0.1 (the line 1 − γ is depicted by a red solid line).For X n chosen uniformly in the cube [0, 1] d , we depict F d (r, X n ) with blue plusses.For X n chosen from a Sobol sequence in the whole cube [0, 1] d , we use orange triangles.When the points in X n are uniform i.i.d.within the δ-cube with optimal δ we use green crosses.Finally, when points in X n are chosen from a Sobol sequence within the same δ-cube we use purple diamond.Figures15 and 16illustrate two new key messages along with the message discussed in Figure1.Firstly, the use of low-discrepancy sequences seem to produce slightly better results in comparison to random choice of points for small dimensions but in higher dimensions the use of low-discrepancy sequences (in our case, Sobol sequences) produces results that are almost equivalent to random sampling uniformly either in [0, 1] d or in the optimally chosen δ-cube.Secondly, in large dimensions sampling from a suitable δ-cube greatly outperforms the other schemes considered here being still far from the asymptotic results.These messages are further supported in Figures17 and 18.Here we plot the asymptotic approximation F n (r) from Lemma 1 (dashed red) and F d (r, X n ) as a function of r for the following choices of X n : random in the cube [0, 1] d (blue line with plusses), chosen from a Sobol sequence in [0, 1] d (orange line with triangles), random

1 −
x k ; the first few values are: 1, 1, 2,3,5,7,11,15,22, 30, 42, 56, 77, 101.The first few terms in the summation (including Hermite polynomials) are provided in[14, p. 139].In the case of U = 1/2 or U = 3/4, the random variables (u j − x j ) 2 will be i.i.d.For this case λ ν,d does not depend on d and thus we have the slight simplification λ ν,d = λ ν = γ ν /σ ν d,δ,U .In Figures25-30, we plot P U,1,r for U = 1/2 and U = 3/4 as a function of r with a solid black line.In these figures, we demonstrate the accuracy of approximation (5.7) with a dashed blue line.With a dot-dashed red line, we plot the accuracy of an approximation obtained by taking one additional term in the expansion given in(5.8);this requires use of the third central moment given in(5.6).We can see that overall, for d = 10 and d = 20, the approximations are fairly accurate.However, when considering covering by n balls it is more important to focus on the lower tail.Figures26, 28, 29 and 30 demonstrate that taking one additional term in the Petrov's expansion (5.8) produces a significant improvement in accuracy.

Table 1 :
Values for r n,1−γ with γ = 0.1.argumentsleadingto(2.6).We see that in high dimensions, the requirement of r being small enough for (2.6) to provide sensible approximations requires n to be extremely large.Such large values of n are impractical.