1 Introduction

Consider the general problem of continuous global minimization \(f(x)\!\rightarrow \! {\text {min}}_{x\in {\mathcal {X}}}\) with objective function \(f(\cdot )\) and feasible region \({\mathcal {X}}\), which is assumed to be a compact subset of \(\mathbb {R}^d\) with \(\textrm{vol}({\mathcal {X}})>0\). In order to avoid unnecessary technical difficulties, we assume that \({\mathcal {X}}\) is convex. In all numerical examples, we use \({\mathcal {X}}=[0,1]^d\).

Any global optimization algorithm combines two key strategies: exploration and exploitation. Performing exploration is equivalent to what we call “space-filling”; that is, choosing points which are well-spread in \({\mathcal {X}}\). Exploitation strategies use local information about f (and perhaps derivatives of f) and differ greatly for different types of global optimization algorithms. In this paper, we are only concerned with the exploration stage. Although many of our findings can be generalized to other space-filling schemes (where space-filling is not random and the space-filling strategy changes in the course of receiving more information about the objective function), in this paper we concentrate on simple exploration schemes like pure random search, where space-filling is performed by covering \({\mathcal {X}}\) with balls of given radius centered at the chosen points. Moreover, we assume that the points chosen at the exploration stage are independent. That is, we associate the exploration stage with a global random search (GRS) algorithm producing a sequence of random points \(x_1,x_2,\ldots , x_n\), where each point \(x_j \in {\mathcal {X}}\) has some probability distribution \(P_j\) (we write this \(x_j \sim P_j\)) and \(x_1,x_2,\ldots , x_n\) are independent. The value n is determined by a stopping rule. We assume that \(1 \le n_{\min } \le n \le n_{\max } < \infty \), where \(n_{\min } \) and \( n_{\max }\) are two given numbers. The number \( n_{\max }\) determines the maximum number of function evaluations at the exploration stage and the fact that \( n_{\max }< \infty \) determines what we call “the non-asymptotic regime”. In the numerical study of Sect. 4, we also use Sobol’s sequence, the most widely used low-discrepancy sequence.

We distinguish between ‘small’, ‘medium’ and ‘high’ dimensional problems depending on the following relations between d and \(n_{\max }\):

  1. (S)

    small dimensions: \(n_{\min } \ge 2^d\), \(n_{\max } \gg 2^d\) (hence, \(\log _2 n_{\max } \gg d\));

  2. (M)

    medium dimensions: \(n_{\max } \) is comparable to \( 2^d\): \(c_1 d \le \log _2 n_{\max } \le c_2 d\) with suitable constants \(c_1\) and \(c_2\): \(0 \ll c_1 \le 1 \le c_2 \ll \infty \);

  3. (H)

    high dimensions: \(n_{\max } \ll 2^d\).

Of course, there are in-between situations and the classification above depends on the cost of function evaluation. In case of non-expensive observations and \(10^3 \le n_{\max } \le 10^6 \), typical values of d in the three cases are: \( \text { (S)\,\,:}\,\,d \le 10; (M): 10 \le d \le 20; (H): d>20. \) Values of \(d \approx 10 \) are border-line cases between (S) and (M) whereas \(d \approx 20 \) are border-line cases between (M) and (H).

In this study, we leave out the situation (S) of small dimensions and concentrate on situations (M) and (H). The reasons why we are not interested in the situation (S) of small dimensions are: (a) there are too many exploration schemes available in literature in the case of small dimensions, and (b) we are interested in the situations when the asymptotic regime is out of reach, and these are the situations (M) and (H).

In all considerations below we assume that the aim of the exploration stage is to reach a neighbourhood of an unknown point \(x_* \in {\mathcal {X}}\) with high probability \(\ge 1-\gamma \) (with some \(\gamma >0\)). We assume that \(x_*\) is uniformly distributed in \({\mathcal {X}}\) and by a neighbourhood of \(x_*\) we mean the ball \(B={{\mathcal {B}}}(x_*,\varepsilon )\) with suitable \(\varepsilon >0\). In other words, we will be interested in the problem of construction of partial coverings defined as follows.

Let \(x_1, \ldots , x_n\) be some points in \(\mathbb {R}^d\). Denote \(X_n= \{x_1, \ldots , x_n\}\) and

$$\begin{aligned} B(X_n,r)= \bigcup _{i=1}^n B(x_i,r)\,, \end{aligned}$$
(1.1)

where \(r>0\) is the radius of the balls \({{\mathcal {B}}}(x_i,r)\) and \(B(x_i,r) = {\mathcal {X}}\cap {{\mathcal {B}}}(x_i,r)\). We will call \(B(X_n,r)\) partial (or weak) covering of \({\mathcal {X}}\) of level \(1-\gamma \) if \(\textrm{vol}(B(X_n,r))/\textrm{vol}({\mathcal {X}})\ge 1-\gamma \).

If \(\gamma =0\) then \(B(X_n,r)\) would make a full (strong) covering of \({\mathcal {X}}\). As demonstrated in [7,8,9], for any n and any given \(\gamma >0\), one can construct partial coverings of \({\mathcal {X}}\) with significantly smaller radii r than for the case \(\gamma =0\) (assuming that d is not too small). This is the main reason why we are not interested in strong coverings. The second reason is that numerically checking whether the set (1.1) makes a full covering (for a generic \(X_n\)) is extremely hard in situations (M) and (H) whereas simple Monte-Carlo gives very accurate estimates of \(\gamma \) for partial coverings, even for very high dimensions. For a short discussion concerning full covering and its role in optimization, see Sect. 2.1.

The main technique of construction of partial coverings will be generation of independent random points \(x_1, \ldots , x_n\) in \({\mathcal {X}}\) with \(x_j \sim P\), where P is a distribution concentrated either on the whole \({\mathcal {X}}\) or a subset of \({\mathcal {X}}\). It follows from Proposition 3.2.3 in [1] that using points outside \({\mathcal {X}}\) for construction of coverings is not beneficial when \({\mathcal {X}}\) is convex and hence we will always assume that \(x_j \in {\mathcal {X}}\) for all j.

The following are the main messages of the paper.

  1. 1.

    Classical results on convergence rates of GRS algorithms are based on the asymptotic properties of random points uniformly distributed in \({\mathcal {X}}\); see Sect. 2. In the non-asymptotic regime, however, these results give estimates on the convergence rates which are far too optimistic. We show in Sect. 3 that for medium and high dimensions, the actual convergence rate of GRS algorithms is much slower.

  2. 2.

    The usually recommended sampling schemes (these schemes are based on the asymptotic properties of random points) are inefficient in the non-asymptotic regime. In particular, as shown in Sect. 4, uniform sampling on entire \({\mathcal {X}}\) is much less efficient than uniform sampling on a suitable subset of \({\mathcal {X}}\) (we will refer to this phenomena as the ‘\(\delta \)-effect’).

  3. 3.

    In situations (M) and (H), the effect of replacement of random points by low-discrepancy sequences is negligible; see Sect. 4.2.

We also make certain practical recommendations concerning the best exploration schemes in the situations (M) and (H) in the case \({\mathcal {X}}=[0,1]^d\). Our main recommendations will concern the situation (M) of medium dimensions, which we consider as the hardest for analysis. The situation (H) is simpler than (M) in the sense that the optimization problems in case (H) are so complicated that very simple space-filling schemes outlined in Sect. 6 provide relatively effective sampling schemes.

The structure of the paper is as follows. In Sect. 2, which contains no new results, we discuss the importance of covering and review classical results on convergence and rate of convergence of general GRS algorithms. The purpose Sect. 3 is to demonstrate that for medium and high dimensions the asymptotic regime is unachievable, and hence the actual convergence rate of GRS algorithms is much slower than the classical estimates of the rate of convergence indicate. In Sect. 4 we compare several exploration strategies and show that standard recommendations (such as: “use a low-discrepancy sequence”) are inaccurate (for medium and high dimensions). In Sect. 5, we develop accurate approximations for the volume of intersection of a cube and a ball (with arbitrary centre and any radius). The approximations of Sect. 5 are used throughout numerical studies of Sects. 3 and 4. In Sect. 6 we summarize our findings and give recommendations on how to perform exploration of \({\mathcal {X}}\) in medium and large dimensions.

2 Importance of covering and classical results on convergence and rate of convergence of GRS algorithms

2.1 Covering radius

Consider \(X_n=\{x_1, \ldots , x_n\} \), a set of n points in \({\mathcal {X}}\). The covering radius of \({\mathcal {X}}\) for \(X_n\) is \(\textrm{CR} (X_n) = \max _{x\in {\mathcal {X}}} \rho (x,X_n), \) where

$$\begin{aligned} \rho (x,X_n)= \min _{x_j\in X_n} \rho (x,x_j)\, \end{aligned}$$
(2.1)

is the distance between a point \(x \in {\mathcal {X}}\) and the point set \(X_n\). Covering radius is also the smallest \(r \ge 0\) such that the union of the balls with centers at \(x_j \in X_n\) and radius r fully covers \({\mathcal {X}}\); that is, \(\textrm{CR} (X_n)= \min _{ {r>0} } \; \hbox { such that } {\mathcal {X}}\subseteq {{\mathcal {B}}} (X_n,r)\,, \) where \({{\mathcal {B}}} (X_n,r)= \bigcup _{j=1}^n {{\mathcal {B}}} (x_j,r)\) and \({{\mathcal {B}}} (x,{ r })= \{ z \in \mathbb {R}^d:\; \rho (x,z) \le { r } \}\) is the ball of radius r and centre \(x\in \mathbb {R}^d\). Optimal n-point covering is the point set \(X_n^*\) such that \( \textrm{CR}(X_n^*)=\min _{X_n} \textrm{CR}(X_n). \) Most of the general considerations in the paper are valid for a general distance \(\rho \), but all numerical studies are conducted for the Euclidean distance only. We will thus assume that the distance \(\rho \) is Euclidean.

Other common names for the covering radius are: fill distance (in approximation theory; see [18, 23]), dispersion (in Quasi Monte Carlo; see [6, Ch. 6]), minimax-distance criterion (in computer experiments; see [16, 17]) and coverage threshold (in probability theory; see [11]).

Point sets with small covering radius are very desirable in theory and practice of global optimization and many branches of numerical mathematics. In particular, the celebrated results of A.G.Sukharev imply that any n-point optimal covering design \(X_n^*\) provides the following: (a) min-max n-point global optimization method in the set of all adaptive n-point optimization strategies, see [20] and [21, Ch.4,Th.2.1], (b) worst-case n-point multi-objective global optimization method in the set of all adaptive n-point algorithms, see [29], and (c) the n-point min-max optimal quadrature, see [21, Ch.3,Th.1.1]. In all three cases, the class of (objective) functions is the class of Liptshitz functions, and the optimality of the design is independent of the value of the Liptshitz constant. Sukharev’s results on n-point min-max optimal quadrature formulas have been generalized in [10] for functional classes different from the class of Liptshitz functions; see also formula (2.3) in [2].

2.2 Convergence of a general GRS algorithm

Consider the general problem of continuous global minimization \(f(x)\!\rightarrow \! {\text {min}}_{x\in {\mathcal {X}}}\). Assume that \(f_*=\inf _{x \in {\mathcal {X}}} f(x)>-\infty \) and \(f(\cdot )\) is continuous at all points \(x \in W (\delta )\) for some \(\delta >0\), where \(W (\delta )\!=\!\left\{ x\in {\mathcal {X}}:f(x)\!-\!f_* \! \leqslant \!\delta \right\} \). That is, we assume that \(f(\cdot )\) is continuous in the neighbourhood of the set \({\mathcal {X}}_*=\left\{ x_* \in {\mathcal {X}}:f(x_* )=f_* \right\} \) of global minimizers of \(f(\cdot )\), which is non-empty but may contain more than one point \(x_*\). To avoid technical difficulties, we assume that there are only a finite number of global minimizers of \(f(\cdot )\); that is, the set \({\mathcal {X}}_*\) is finite.

Consider a general GRS algorithm producing a sequence of random points \(x_1,x_2,\ldots \), where each point \(x_j \in {\mathcal {X}}\) has some probability distribution \(P_j\) (we write this \(x_j \sim P_j\)), where for \(j>1\) the distributions \(P_j\) may depend on the previous points \(x_1,\ldots ,x_{j-1}\) and on the results of the objective function evaluations at these points (the function evaluations may not be noise-free). We say that this algorithm converges if for any \(\delta \!>\!0\), the sequence of points \(x_j\) arrives at the set \(W (\delta )\!=\!\left\{ x\in {\mathcal {X}}:f(x)\!-\!f_* \! \leqslant \!\delta \right\} \) with probability one. If the objective function is evaluated without error then this obviously implies convergence (as \(n \rightarrow \infty \)) of record values \(f_{\textrm{o},j}=\min _{i=1\ldots j} f(x_i)\) to \(f_*\) with probability 1.

In view of continuity of \(f(\cdot )\) in the neighbourhood of \({\mathcal {X}}_*\), the event of arrival of sequence of points \(x_j\) at the set \(W (\delta )\) with given \(\delta >0\), is equivalent to the arrival of this sequence at the set \(B_*(\varepsilon )= \cup _{x_* \in {\mathcal {X}}_*} B(x_*,\varepsilon ) \) for some \(\varepsilon >0\) depending on \(\delta \).

Conditions on the distributions \(P_j\) (\(j=1,2,\ldots \)) ensuring convergence of the GRS algorithms are well understood; see, for example, [15, 19, 28] and [25, Sect. 3.2]. Such results are consequences of the classical in probability theory ‘zero–one law’ or Borel-Cantelli lemmas (see e.g. [4, Section 7.3]) and provide sufficient conditions on convergence. We follow [27, Theorem 2.1] to provide the most general sufficient conditions for convergence of GRS algorithms.

Theorem 1

Consider a GRS algorithm with \(x_j\sim P_j\) and let \(B \subset {\mathcal {X}}\) be a Borel subset of \({\mathcal {X}}\). Assume that

$$\begin{aligned} \sum _{j=1}^\infty q_j(B)=\infty \,, \end{aligned}$$
(2.2)

where \(q_j(B)= \inf P_j(B)\) and the infimum is taken over all locations of previous points \(x_i\) (\(i=1, \ldots , j-1\)) and corresponding results of evaluations of \(f(\cdot )\). Then the sequence of points \(\{x_1, x_2, \ldots \}\) falls infinitely often into the set B, with probability 1.

Note that Theorem 1 does not make any assumptions about observations of \(f(\cdot )\) and hence is valid for the very general case where evaluations of the objective function \(f(\cdot )\) are noisy and the noise is not necessarily random.

Consider the following three particular cases.

  1. (a)

    If in (2.2) we use \(B=B_*(\varepsilon )\) or \(B=W (\varepsilon )\) with some \(\varepsilon >0\), then Theorem 1 gives a sufficient condition that the corresponding GRS algorithm converges; that is, there exists a subsequence \(\{x_{i_j}\}\) of the sequence \(\{x_{j}\}\) which converges (with probability 1) to the set \({\mathcal {X}}_*\) in the sense that the distance between \(x_{i_j}\) and \({\mathcal {X}}_*\) tends to 0 as \(j \rightarrow \infty \). For this subsequence \(\{x_{i_j}\}\), we have \(f(x_{i_j}) \rightarrow f_*\) as \(j \rightarrow \infty \). If the evaluations of \(f(\cdot )\) are noise-free, then we can use the sequence of record points (that is, the points where the records \(f_{\textrm{o},j}= \min _{\ell <j} f(x_\ell ) \) are attained) as \(\{x_{i_j}\}\); in this case, \(f(x_{i_j})=f_{\textrm{o},j}\) is the sequence of records converging to \(f_*\) with probability 1. By the dominated convergence theorem (see e.g. [4, Section 7.2]), convergence of the sequence of records \(f_{\textrm{o},j}\) to \(f_*\) with probability 1 implies other important types of convergence of \(f_{\textrm{o},j}\) to \(f_*\) — in mean and mean square: \( E f_{\textrm{o},j} \rightarrow f_* \) and \( E (f_{\textrm{o},j}-f_*)^2 \rightarrow 0 \) as \( j \rightarrow \infty .\)

  2. (b)

    If (2.2) holds for \(B=B(x,\varepsilon ) \) with any \(x\in {\mathcal {X}}\) and any \(\varepsilon >0\), then Theorem 1 gives a sufficient condition that the sequence of points \(\{x_1, x_2, \ldots \}\) is dense with probability 1. As this is a stronger sufficient condition than in (a), all conclusions of (a) are valid.

  3. (c)

    If we use pure random search (PRS) with \(P=P_U\), the uniform distribution on \({\mathcal {X}}\) (that is, \(P_j=P_U\) for all j and the points \(x_1, x_2, \ldots \) are independent), then the assumption that \({\mathcal {X}}\) is convex implies \(\text {Pr}(B(x,\varepsilon ))\ge \textrm{const}_\varepsilon >0\) for all \(x\in {\mathcal {X}}\) any \(\varepsilon >0\) and therefore the condition (2.2) trivially holds for any \(B=B(x,\varepsilon ) \), as in (b) above. In practice, the usual choice of the distribution \(P_j\) is

    $$\begin{aligned} P_{j}=\alpha _{j}P_U+(1-\alpha _{j})Q_{j}\,, \end{aligned}$$
    (2.3)

    where \(0\leqslant \alpha _{j}\leqslant 1\) and \(Q_j\) is a specific probability measure on \({\mathcal {X}}\) which may depend on previous evaluations of the objective function. Sampling from the distribution (2.3) corresponds to taking a uniformly distributed random point in \({\mathcal {X}}\) with probability \(\alpha _{j}\) and sampling from \(Q_j\) with probability \(1-\alpha _{j}\). In case of distributions (2.3), the condition \(\sum _{j=1}^\infty \alpha _j=\infty \) yields the fulfilment of (2.2) for all \(B=B(x,\varepsilon ) \) and therefore the GRS algorithm with such \(P_j\) is theoretically converging.

2.3 Rate of convergence

Consider first a PRS algorithm, where \(x_j\) are i.i.d. with distribution P. Let \(\varepsilon , \delta >0\) be fixed and B be the target set we want to hit by points \(x_1,x_2, \ldots \). For example, we set \(B=W (\delta )=\{x\in {\mathcal {X}}:f(x)-f_*\leqslant \delta \}\) in the case when the accuracy is expressed in terms of closeness with respect to the function value, \(B=B(x_*,\varepsilon )\) if we are studying convergence towards a particular global minimizer \(x_*\), and \(B=B_*(\varepsilon )\) if the aim is to approach a neighbourhood of \({\mathcal {X}}_*\).

Assume that P is such that \(P(B)\!>\!0\). In particular, if \(P=P_U\) is the uniform probability measure on \({\mathcal {X}}\), then, as \({\mathcal {X}}\) has Lipschitz boundary, we have \(P(B)= \textrm{vol}(B) / \textrm{vol}({\mathcal {X}}) \!>\!0\). Note that in all interesting instances the value P(B) is positive but small, and this will be assumed below.

Define the Bernoulli trials where the success in the trial j means \(x_j \in B\). PRS generates a sequence of independent Bernoulli trials with the same success probability \( \textrm{Pr}\{x_j\in B\}=P(B)\). In view of independence of \(x_1,x_2, \ldots \), we have

$$\begin{aligned} \text {Pr}\{x_1\notin B,\ldots ,x_n\notin B\}= \left( 1-P(B)\right) ^n \end{aligned}$$

and therefore the probability

$$\begin{aligned} \text {Pr}\{x_j\in B\text { for at least one }j,\; 1\leqslant j\leqslant n \}= 1-\left( 1-P(B)\right) ^n \end{aligned}$$

tends to one as \(n\rightarrow \infty \).

Let \(n_\gamma \) be the number of points which are required for PRS to reach the set B with probability at least \( 1-\gamma \), where \(\gamma \in (0,1)\); that is,

$$\begin{aligned} n_\gamma = \min \{ n: \; 1-\left( 1-P(B)\right) ^n\geqslant 1-\gamma \}\,. \end{aligned}$$

Solving the equation \(1-\left( 1-P(B)\right) ^n\geqslant 1-\gamma \) with respect to n, we obtain

$$\begin{aligned} n_\gamma = \bigg \lceil {\ln \gamma }/{\ln \left( 1-P(B)\right) } \bigg \rceil \, \cong {(- \ln \gamma })/{P(B)} \end{aligned}$$
(2.4)

as P(B) is small and \(\ln \left( 1-P(B)\right) \cong -P(B)\) for small P(B).

The numerator \(- \ln \gamma \) in the expression (2.4) for \(n_\gamma \) depends on \(\gamma \) but it is not large; for example, \(-\ln \gamma \simeq 4.605\) for \(\gamma =0.01\). However, the denominator P(B) (depending on \(\varepsilon \), d and the shape of \({\mathcal {X}}\)) can be very small.

Assuming that \(B=B(x_*,\varepsilon )\), where the norm is standard Euclidean, and B is fully inside \({\mathcal {X}}\), we have

$$\begin{aligned} \textrm{vol} (B(x_*,\varepsilon ) )=\textrm{vol} ({{\mathcal {B}}}(x_*,\varepsilon ))= V_d \, \varepsilon ^d , \end{aligned}$$
(2.5)

where \(V_d= {\pi }^{d/2} / \left[ \Gamma (d/2\!+\!1)\right] \, \) is the volume of the unit Euclidean ball \({{\mathcal {B}}}(0,1)\) and \(\Gamma (\cdot )\) is the gamma-function. The resulting version of the expression (2.4) for \(n_\gamma \) in the case \(B=B(x_*,\varepsilon )\) and vol\(({\mathcal {X}})=1\) becomes

$$\begin{aligned} n_{\gamma }^{\textrm{as}} = {- \ln \gamma }/\left( \varepsilon ^d V_d\right) . \end{aligned}$$
(2.6)

As \(\varepsilon \rightarrow 0\), the ball \(B=B(x_*,\varepsilon )\) lies fully inside \({\mathcal {X}}\) for \(P_U\)-almost all \(x_*\). Indeed, asymptotically, as \(n \rightarrow \infty \), the covering radius computed for uniformly distributed random points \(x_j\), tends to 0 and hence the equality (2.5) is valid asymptotically for almost all \(x_*\). This is the reason for superscript ‘as’ in (2.6). As shown below in Sect. 3, in the non-asymptotic regime in situations (M) and (H), the volume \(\textrm{vol} (B(x_*,\varepsilon ))\) is necessarily smaller than given by (2.5) and therefore the true \(n_\gamma \) is (much) larger than \(n_{\gamma }^{\textrm{as}}\) in (2.6).

Consider now general GRS algorithms where the probabilities \(P_j\) are chosen in the form (2.3), where the coefficients \(\alpha _j\) satisfy the condition (2.2). Instead of the equality \(\text {Pr}\{x_j\in B\}=P(B)\) for all \(j\geqslant 1\), we now have the inequality \(\text {Pr}\{x_j\in B\}\geqslant \alpha _j P_U(B),\) where the equality holds in the worst-case scenario. We define \({n(\gamma )}\) as the smallest integer such that the inequality \(\sum _{j=1}^{n(\gamma )} \alpha _j \geqslant - {\ln \gamma }/{P_U(B)}\, \) is satisfied. For the choice \(\alpha _j =1/j\), which is a common recommendation, we can use the approximation \(\sum _{j=1}^{n} \alpha _j \simeq \ln n \). Therefore we obtain \(n(\gamma ) \simeq \exp \{ - {\ln \gamma }/{P_U(B)} \}\). For the case of \({\mathcal {X}}=[0,1]^d\) and \(B=B(x_*,\varepsilon )\), we obtain \(n(\gamma ) \simeq \exp \{ c \cdot \varepsilon ^{-d} \}\), where \(c = ( - {\ln \gamma })/V_d\).

Note also that if the distance between \(x_*\) and the boundary of \({\mathcal {X}}\) is smaller than \(\varepsilon \), then the constant c and hence \(n(\gamma ) \) are even larger. For example, for \(\gamma =0.1\), \(d=10\) and \(\varepsilon =0.1\), \(n(\gamma )\) is larger than \(10^{ 1000000000}\). Even for optimization problems in a small dimension \(d=3\), and for \(\gamma =0.1\) and \(\varepsilon =0.1\), the number \(n(\gamma )\) of points required for the GRS algorithm to hit the set B in the worst-case scenario is huge: \(n(\gamma ) \simeq 10^{238}\).

3 Points uniformly distributed on \({\mathcal {X}}\)

3.1 Asymptotic case

In this section, the point set \(X_n=\{x_1, \ldots , x_n\}\) consists of the first n points of a sequence \(X_\infty =\{x_1, x_2, \ldots \}\) of independent uniformly distributed random vectors in \( {\mathcal {X}}\). Assume, without loss of generality, that \(\textrm{vol}({\mathcal {X}})=1\).

Consider the random variable \(\rho (U,X_n)\), the distance between U (the uniform random point in \({\mathcal {X}}\)) and \(X_n\); see (2.1) for the definition of \(\rho \). The cdf (cumulative distribution function) of \(\rho (U,X_n)\) gives the average proportions of \({\mathcal {X}}\) which are covered by the balls centered at \(X_n\) with radius r. That is,

$$\begin{aligned} F_d(r,X_n):= \textrm{Pr}(\rho (U,X_n)\le r ) ={\mathbb {E}_{X_n}\textrm{vol}(B(X_n,r))} , \end{aligned}$$
(3.1)

where the set \( B(X_n,r)\) is defined in (1.1). In asymptotic considerations, we need to suitably normalize the radius (which tends to zero as \(n \rightarrow \infty \)) in (3.1). We thus consider the following sequence of cdf’s:

$$\begin{aligned} F_n(t):=\textrm{Pr}( n^{1/d} V_d^{1/d} \rho (U,X_n) \le t )= F_d\left( [nV_d]^{-1/d}\, t,X_n \right) . \end{aligned}$$
(3.2)

Lemma 1

$$\begin{aligned} F_n(t) \rightarrow F(t):= 1-\exp (-t^{d}) \;\; \hbox { as } n\rightarrow \infty , \end{aligned}$$
(3.3)

where the convergence is uniform in t and cdf’s \(F_n\) are defined in (3.2).

The statement of Lemma 1 follows from Zador’s arguments in his fundamental paper [24]; see the beginning of page 142. The key observation of Zador is that asymptotically, as \(n \rightarrow \infty \), the covering radius computed for uniformly distributed random points \(x_j\), tends to 0 and hence the equality (2.5) is valid asymptotically for almost all U; this is formula (19) in [24]. The statement of Lemma 1 is in fact a particular case of Theorem 9.1 in [3], if Q is chosen as the uniform distribution on \({\mathcal {X}}\).

In what follows, we will need the \((1-\gamma )\)-quantile (\(0< \gamma <1\)) of the cdf F in the rhs of (3.3). This \((1-\gamma )\)-quantile is determined as \(t_{1-\gamma }=[-\log (\gamma )]^{1/d}\), for which we have \(F(t_{1-\gamma })=1-\gamma \). The quantity \(t_{1-\gamma }\) can be interpreted as the normalised asymptotic radius required for covering a subset of \({\mathcal {X}}\) of volume \((1-\gamma )\) (the partial covering introduced in Sect. 1). For very small \(\varepsilon \), to cover a subset of \({\mathcal {X}}\) with random centers \(x_j \in X_n\) of volume which is approximately \(1-\gamma \), \(n=n_{\gamma }\) should satisfy

$$\begin{aligned} n_{\gamma } = \frac{t_{1-\gamma }^d}{\varepsilon ^{d}V_d}= \frac{-\ln (\gamma )}{\varepsilon ^dV_d} , \end{aligned}$$
(3.4)

which coincides with (2.6). The above result can be reformulated in terms of the asymptotic radius r as follows: for very large n the union of n balls with random centers \(x_j \in X_n\) and radius

$$\begin{aligned} r_{n,1-\gamma } = n^{-1/d}{V}_d^{-1/d}t_{1-\gamma } = n^{-1/d}{V}_d^{-1/d}[-\log (\gamma )]^{1/d} \, \end{aligned}$$
(3.5)

covers a subset of \({\mathcal {X}}\) of volume which is approximately \(1-\gamma \).

In the non-asymptotic (finite n) regime, the distribution function \(F_d(r,X_n)\) of (3.1) can be obtained in the following way (below, for \(X_n=\{x_1, \ldots , x_n\}\), the components \(x_1, x_2, \ldots x_n\) are not necessarily uniform but are i.i.d.).

Conditionally on U, we have for fixed \(U \in {{\mathcal {X}}}\):

$$\begin{aligned} \mathbb {P} \left\{ U \in {{\mathcal {B}}}_d(X_n,r) \right\}= & {} 1-\prod _{j=1}^n \mathbb {P} \left\{ U \notin {{\mathcal {B}}}_d({x}_j,r) \right\} \nonumber \\= & {} 1-\prod _{j=1}^n\left( 1-\mathbb {P} \left\{ U \in {{\mathcal {B}}}_d({x}_j,r) \right\} \right) \nonumber \\= & {} 1-\bigg (1-\mathbb {P}_{X} \left\{ \Vert U - {X} \Vert \le r \right\} \bigg )^n\,, \end{aligned}$$
(3.6)

where X has the same distribution as \(x_1\). From (3.6), the distribution function \(F_d(r,X_n)\) can be obtained by averaging over the distribution of U:

$$\begin{aligned} F_d(r,X_n)= \mathbb {E}_{_U} \mathbb {P} \left\{ U \in {{\mathcal {B}}}_d(X_n,r) \right\} \,. \end{aligned}$$
(3.7)

For large n and small r we use an approximate equality \(\mathbb {P}_{X} \left\{ \Vert U - {X} \Vert \le r \right\} \simeq r^{d}V_d \) in (3.6). By doing so, averaging with respect to U is redundant and we arrive at the results of Sect. 2.3. If n is not so large, the quantity \(\mathbb {P}_{X} \left\{ \Vert U - {X} \Vert \le r \right\} \) has to be approximated by other means. This will be discussed in Sect. 5.

3.2 Bounds for \(F_d(r,X_n)\)

Evaluating the expectation in (3.7) is difficult but simple bounds can be obtained by applying Jensen’s inequality. Here we will focus attention to the case of \({\mathcal {X}}=[0,1]^d\) and \(X_n=\{x_1, \ldots , x_n\}\), where \(x_1, x_2, \ldots \) is a sequence of uniformly distributed random vectors on \({\mathcal {X}}\). From (3.7), we have

$$\begin{aligned} \mathbb {E}_{_U} \mathbb {P} \left\{ U \in {{\mathcal {B}}}_d(X_n,r) \right\} = 1- \mathbb {E}_{_U} \left[ \left( 1-\mathbb {P}_{X} \left\{ \Vert U - {X} \Vert \le r \right\} \right) ^n \right] \,. \end{aligned}$$

An immediate use of Jensen’s inequality yields the bound:

$$\begin{aligned} \mathbb {E}_{_U} \mathbb {P} \left\{ U \in {{\mathcal {B}}}_d(X_n,r) \right\} \le 1- \left( 1-\mathbb {P}_{X} \left\{ \Vert \varvec{1/2} - {X} \Vert \le r \right\} \right) ^n \,. \end{aligned}$$
(3.8)

Here and below \({ \textbf{a}}=(a,a,\ldots , a) \in \mathbb {R}^d\) for any a. However, noticing the fact \(\mathbb {P}_{X} \left\{ \Vert U - {X} \Vert \le r \right\} = \mathbb {P}_{Z} \left\{ \Vert Z - {X} \Vert \le r \right\} \) where Z in a uniform random vector on \([1/2,1]^d\), we can apply Jensen’s inequality to obtain:

$$\begin{aligned} \mathbb {E}_{_U} \mathbb {P} \left\{ U \in {{\mathcal {B}}}_d(X_n,r) \right\} \le 1- \left( 1-\mathbb {P}_{X} \left\{ \Vert \varvec{3/4}-{X} \Vert \le r \right\} \right) ^n \,. \end{aligned}$$
(3.9)

The forms of the bounds in (3.8) and (3.9) suggest an approximation of the following form may be useful:

$$\begin{aligned} \mathbb {E}_{_U} \mathbb {P} \left\{ U \in {{\mathcal {B}}}_d(X_n,r) \right\} \simeq 1- \left( 1-\mathbb {P}_{U,X} \left\{ \Vert U-{X} \Vert \le r \right\} \right) ^n \,. \end{aligned}$$
(3.10)

Here, instead of fixing U to \(\varvec{1/2}\) or \({\varvec{3/4}}\), it is a uniform random vector on \([0,1]^d\). The probability \({\mathbb {P}_{U,X} \left\{ \Vert U-{X} \Vert \le r \right\} }\) has the interpretation of being the average intersection a ball of radius r with a random center at U has with the cube \([0,1]^d\). For different d and r, the distribution of \({\mathbb {P}_{X} \left\{ \Vert U-{X} \Vert \le r \right\} }\) normalised by the volume of the ball \(r^dV_d\) is shown in Figs. 910.

3.3 Numerical studies

In this section, we will demonstrate one of the key messages of the paper saying that in high dimensions, the asymptotic results are not attainable for reasonable values of n and consequently produce poor approximations for n not astronomically large.

In Fig. 1, we plot \(F_d(r,X_n)\) as a function of d for \(n=1000\) (using blue plusses) and \(n=10000\) (using black circles). For each value of d, the radius r is chosen based on the asymptotic result given in (3.5) with \(1-\gamma =0.9\); this is shown by the solid red line at 0.9. We see that very quickly and for n that would be deemed large, \(F_d(r,X_n)\) is significantly smaller than 0.9 and quickly tends to zero in d.

The big difference between the asymptotic and finite regime is further illustrated in Figs. 28. In these figures, using a solid black line we depict \(F_d(r,X_n)\) as a function of r for different values of d and n that are provided in the caption of each figure. In these figures, the dashed red line is the approximation obtained from the asymptotic result (3.3), that is, the approximation \(F_d(r,X_n) \approx F(n^{1/d}V_d^{1/d}r)\). In Figs. 38, we also include two Jensen’s bounds given in (3.8) (dot dashed orange) and (3.9) (dotted blue), as well as the approximation given in (3.10) (longer dashed green). From these figures, we can make the following observations.

  1. 1.

    Unless d is small, the asymptotic results produce poor approximations even if n is reasonably large.

  2. 2.

    The approximation in (3.10) is rather accurate but worsens for smaller \(\gamma \).

  3. 3.

    For \(r\le 1/2\), the asymptotic bounds and (3.8) coincide; this follows from the equality \({\mathbb {P}_{X} \left\{ \Vert \varvec{1/2} - {X} \Vert \le r \right\} =r^dV_d}\) for \(r\le 1/2\).

  4. 4.

    The refined Jensen’s bound given in (3.9) is superior to (3.8) and especially to the asymptotic bound. This becomes particularly evident in higher dimensions; see Figs. 7 and 8.

In Figs. 3-6, the crosses on the dashed red line and solid black line mark points of interest. In Fig. 3, for \(r=0.5\) we obtain \(F(n^{1/d}V_d^{1/d}r)=0.91\) but the true value of \(F_d(0.5,X_n)\) is closer to 0.41. As n increases from 1000 to 10000 as is shown in Fig. 4, for \(r=0.4\) we have \(F(n^{1/d}V_d^{1/d}r)=0.925\) and \(F_d(0.4,X_n)\) is closer to 0.6. (Recall that in view of (3.3), we should have \(F(n^{1/d}V_d^{1/d}r) \simeq F_d(r,X_n))\) for all r and n large enough). The respective triples \((r; F(n^{1/d}V_d^{1/d}r), F_d(r,X_n))\) for Figs. 46 are (0.9; 0.935, 0.08) and (0.8; 0.95, 0.13). For the case of \(d=50\) and shown in Figs. 78, the asymptotic properties are so far from being achieved with \(n=1000\) and \(n=10000\) that such a comparison does not even make sense.

Fig. 1
figure 1

Covering proportions using the asymptotic radius; \(n=1000,\;10000\)

Fig. 2
figure 2

\( F_d(r,X_n) \) and \(F(n^{1/d}V_d^{1/d}r)\) as functions of r; \( d=5\) and \(n=1000\)

Fig. 3
figure 3

\(d=10,\;n=1000 \)

Fig. 4
figure 4

\(d=10,\;n=10000 \)

Fig. 5
figure 5

\(d=20,\;n=1000 \)

Fig. 6
figure 6

\(d=20\), \(n=10000 \)

Fig. 7
figure 7

\(n=1000, d=50\)

Fig. 8
figure 8

\(n=10,000{,} d=50\)

In Figs. 9 and 10 we use \(d=10\), \(d=20\) and the values of r corresponding to the crosses in Figs. 3 and 5. In these figures, we depict the distribution of intersection a random point has with the cube normalised by the volume of the ball \(r^dV_d\); that is, we plot the density of the r.v. \(\kappa _U=\mathbb {P}_{X} \left\{ \Vert U-{X} \Vert \le r \right\} /(r^dV_d)\), where both U and X have uniform distribution on \([0,1]^d\). The importance of these two figures is another illustration of inadequacy of the key assumption behind (2.6), which can be formulated as the assumption that the distribution of density of the r.v. \(\kappa _U \) is very close to the delta-measure concentrated at one. This assumption is indeed reasonably adequate if r can be chosen small enough. However, as Fig. 9 and especially Fig. 10 illustrate, even for relatively large values of n the required values of r are not small enough for this to hold even approximately. Note that in the derivation of the asymptotic values of \(n_\gamma =n_{\gamma }^{\textrm{as}}\) in (2.6) we use the value 1 rather than the random variables \(\kappa _U\) with the densities shown in Figs. 9,10.

Fig. 9
figure 9

Density of r.v. \(\kappa _U\); \(d=10,r=0.5\)

Fig. 10
figure 10

Density of r.v. \(\kappa _U\); \(d=20,r=0.9\)

4 Modification of sampling schemes and non-uniform distribution of the target

In Sect. 3, we have used the principal sampling scheme where points \(x_j\) in \(X_n=\{x_1, \ldots , x_n\}\) are i.i.d. uniform on \({\mathcal {X}}=[0,1]^d\). In Sect. 4.1 we study a modification of this scheme where \(x_j \in X_n\) are i.i.d. uniform random points in a smaller \(\delta \)-cube \(C_\delta =[1/2-\delta /2, 1/2+\delta /2]^d\) with \(0<\delta <1\). In Sect. 4.2 we investigate the effect of replacing random points by points from a low-discrepancy sequence. The choice of a specific low-discrepancy sequence has very little impact on the results and we present the results for Sobol sequence only. In Sect. 4.3 we will investigate the effect of replacement of the uniform distribution of the target \(x_* \in [0,1]^d\) by a bowl-shaped distribution such as the product of arcsine distributions on [0, 1].

4.1 Points \(x_j\) are i.i.d. uniformly distributed on \(C_\delta \)

In this section we demonstrate the \(\delta \)-effect, which manifests that in high dimensions sampling in a cube \(C_\delta \) with suitable \(0<\delta <1\) leads to a much more efficient covering scheme than sampling within the whole cube \([0,1]^d\). Note that the \(\delta \)-effect is not obvious being completely unknown in the literature on stochastic global optimization and perhaps in literature on global optimization in general. All existing literature recommends space-filling in the whole set X and not in its subset. Moreover, there are recommendations in literature (see, for example, [5, 22]) of choosing more points closer to the boundary of the cube rather than purely uniformly in order to improve space-filling properties of random points.

In Figs. 1113, for different values of d and n we plot \(F_d(r,X_n)\) as a function of \(\delta \). For each d and n, the value of r has been chosen such that \(\max _{0\le \delta \le 1}F_d(r,X_n)=0.9\); these values of r (along with optimal values of \(\delta \), in brackets) can be obtained from Table 1. In these figures, the values of \(F_d(r,X_n)\) for \(n=1000, 10000, 100000\) are shown with a solid black line, dashed blue line and dotted green line respectively. These figures demonstrate the ‘\(\delta \)-effect’ formulated as the second main message in Introduction. These figures also clearly demonstrate that despite sampling uniformly in the cube \([0,1]^d\) is asymptotically optimal, for large d it is always a poor strategy, which can be substantially improved.

Fig. 11
figure 11

\(d=10;\; n=1000, 10,000, 100,000\)

Fig. 12
figure 12

\(d=20;\; n=1000, 10,000, 100,000\)

Fig. 13
figure 13

\(d=50;\; n=1000, 10,000, 100,000\)

Fig. 14
figure 14

\(d=10, n=1000\): Jensen’s bound with \(\delta =1\) and \(\delta =0.5\)

The discussion of Jensen’s bounds given in Sect. 3.2 still apply to the case of \(X_n\) sampled uniformly within \(\delta \)-cube \(C_\delta \). The only adjustment that needs to be made to the results of Sect. 3.2 is to let X be a uniform random vector in \(C_\delta \) and not \([0,1]^d\). In Fig. 14, we depict the Jensen’s lower bound given in (3.9) for X uniform in \([0,1]^d\) and for X uniform in the \(\delta \) cube \([{1/4,3/4}]^d\) (so that \(\delta =0.5\)). We see that the lower bound for \(X_n\) sampled within the \(\delta \)-cube is larger than \(X_n\) sampled from the whole cube. This further supports the conclusion that for n not astronomically large, the ‘\(\delta \)-effect’ should always be considered.

In Table 1, for \(X_n\) chosen uniformly in the cube \([0,1]^d\) and \(X_n\) chosen uniformly in the \(\delta \)-cube, we tabulate the values of \(r_{n,1-\gamma }\) with \(\gamma =0.1\) for different d and n. In the columns labeled \(\delta \)-cube, the values in the brackets correspond to the approximately optimal values of \(\delta \). We can see that for small d, the \(\delta \)-effect is very small (since n is relatively large in these dimensions). For larger dimensions, the \(\delta \)-effect is very prominent.

Table 1 Values for \(r_{n,1-\gamma }\) with \(\gamma =0.1\)

In Tables 23, we consider an equivalent reformulation of the results of Table 1. In these tables, for a given r we specify the value of \(n_\gamma \), with \(\gamma =0.1\), for \(X_n\) chosen uniformly in the cube \([0,1]^d\) and \(X_n\) chosen uniformly in the \(\delta \)-cube. We also include the approximation based on the the asymptotic arguments leading to (2.6). We see that in high dimensions, the requirement of r being small enough for (2.6) to provide sensible approximations requires n to be extremely large. Such large values of n are impractical.

Table 2 Values of \(n_{\gamma }\): \(d=20, \gamma =0.1\)
Table 3 Values of \(n_{\gamma }\): \(d=50, \gamma =0.1\)

4.2 Points \(x_j\) are taken from a low-discrepancy sequence

Figures 1516 are extended versions of Fig. 1. Here we plot \(F_d(r,X_n)\) as a function of d, where the radius is fixed from (3.5) with \(\gamma =0.1\) (the line \(1-\gamma \) is depicted by a red solid line). For \(X_n\) chosen uniformly in the cube \([0,1]^d\), we depict \(F_d(r,X_n)\) with blue plusses. For \(X_n\) chosen from a Sobol sequence in the whole cube \([0,1]^d\), we use orange triangles. When the points in \(X_n\) are uniform i.i.d. within the \(\delta \)-cube with optimal \(\delta \) we use green crosses. Finally, when points in \(X_n\) are chosen from a Sobol sequence within the same \(\delta \)-cube we use purple diamond. Figures 15 and 16 illustrate two new key messages along with the message discussed in Fig. 1. Firstly, the use of low-discrepancy sequences seem to produce slightly better results in comparison to random choice of points for small dimensions but in higher dimensions the use of low-discrepancy sequences (in our case, Sobol sequences) produces results that are almost equivalent to random sampling uniformly either in \([0,1]^d\) or in the optimally chosen \(\delta \)-cube. Secondly, in large dimensions sampling from a suitable \(\delta \)-cube greatly outperforms the other schemes considered here being still far from the asymptotic results. These messages are further supported in Figs. 17 and 18. Here we plot the asymptotic approximation \(F_n(r)\) from Lemma 1 (dashed red) and \(F_d(r,X_n)\) as a function of r for the following choices of \(X_n\): random in the cube \([0,1]^d\) (blue line with plusses), chosen from a Sobol sequence in \([0,1]^d\) (orange line with triangles), random in the \(\delta \)-cube with optimal delta (green line with crosses), chosen from a Sobol sequence within the same \(\delta \)-cube (purple line with diamonds). We see that in Fig. 17 for \(d=10\), the Sobol sequence is slightly advantageous to the random uniform on the whole cube and \(\delta \)-cube for most interesting values of \(\gamma \). Choosing \(X_n\) as uniform within the \(\delta \)-cube produces better coverings than with Sobol’s points in \([0,1]^d\) for most values of \(\gamma \), but slightly worse for small \(\gamma \). This slight advantage of the Sobol sequence in \([0,1]^d\) and in the \(\delta \)-cube diminishes in the case \(d=20\) shown in Fig. 18.

To further study the similarities in performance between \(X_n\) chosen randomly in the \(\delta \)-cube with optimal \(\delta \) and \(X_n\) chosen from a Sobol sequence within the same \(\delta \)-cube, in Figs. 1920 we plot the ratio of the c.d.f.’s \(F_d(r,X_{n,U})/F_d(r,X_{n,S})\) across different d. The subscript U and S respectively differentiate between \(X_n\) chosen randomly in the \(\delta \)-cube with optimal \(\delta \) and \(X_n\) chosen from a Sobol sequence within the same \(\delta \)-cube. For each value of d, r is chosen so that \(\max _{0\le \delta \le 1}F_d(r,X_{n,U})=0.9\).

Fig. 15
figure 15

Covering using the asymptotic radius with Sobol and \(\delta \)-cube points: \(n=2^{10}\)

Fig. 16
figure 16

Covering using the asymptotic radius with Sobol and \(\delta \)-cube points: \(n=2^{13}\)

Fig. 17
figure 17

Sobol and \(\delta \)-cube points versus the asymptotic covering: \(d=10, n=1024\)

Fig. 18
figure 18

Sobol and \(\delta \)-cube points versus the asymptotic covering: \(d=20, n=1024\)

Fig. 19
figure 19

Efficiency of Sobol’s points, \( n=2^{10}\)

Fig. 20
figure 20

Efficiency of Sobol’s points, \( n=2^{13}\)

4.3 Non-uniform prior distribution for the target

In this section, we explore the effect a non-uniform prior distribution for \(x_* \in {\mathcal {X}}\) has on the conclusions above formulated for the case of uniform distribution. We will assume that each component of \(x_*\) has independent components distributed according to the following symmetric beta distribution with density:

$$\begin{aligned} p_{\alpha }(t)= \frac{t^{\alpha -1}[1-t]^{\alpha -1}}{\hbox {Beta}(\alpha ,\alpha )} \,, \text { for some }\alpha > 0. \end{aligned}$$

If \(\alpha =1\), the density \(p_{\alpha }(t)\) is uniform on [0, 1] while for \(0<\alpha <1\) this density is U-shaped. In most cases below we choose the arcsine density \(p_{0.5}(t)\).

We then select the distribution of random points \(x_j \in X_n\) to have similar shape but constrained to the \(\delta \)-cube. More precisely, we assume that \(x_j\) have independent components with the density

$$\begin{aligned} p_{\alpha ,\delta }(t)= & {} \frac{2\cdot (2\delta )^{1-2\alpha }}{\hbox {Beta}(\alpha ,\alpha )} \left[ \delta ^2-(2t-1)^2\right] ^{\alpha -1}\,, \;\;\frac{1-\delta }{2}<t<\frac{\delta +1}{2},\\{} & {} \text { for some }\alpha > 0\;\;\textrm{and }\; 0\le \delta \le 1. \end{aligned}$$

In the case \(\alpha =1\), the points \(x_j\) have uniform distribution on the cube \(C_\delta \).

Figs. 2122 are similar to Figs. 1113, but with the key difference of assuming a non-uniform prior distribution for \(x_*\). For different values of d and n and for \(\alpha =0.5\), we plot \(F_d(r,X_n)\) as a function of \(\delta \). For each d and n, the value of r has been chosen so that \(\max _{0\le \delta \le 1}F_d(r,X_n)=0.9\). In these figures, the values of \(F_d(r,X_n)\) for \(n=1000, 10000, 100000\) are shown with a solid black line, dashed blue line and dotted green line respectively. Figures 23-24 are similar to Figs. 2122, but with varying values of \(\alpha \) and fixed \(n=10000\). In these figures, we selected \(\alpha =0.25\) (dashed dark green), \(\alpha =0.5\) (dotted purple), \(\alpha =0.75\) (dot-dashed grey) and \(\alpha =1\) which gives the uniform distribution (solid black). Figures 2122 clearly demonstrate that the ‘\(\delta \)-effect’ is still significant in this non-uniform setting.

Fig. 21
figure 21

\(d=20\) and \(\alpha =0.5\) with \(n=1000,10,000,100,000\)

Fig. 22
figure 22

\(d=50\) and \(\alpha =0.5\) with \(n=1000,10,000,100,000\)

Fig. 23
figure 23

\(d=20, n=10000\) and \(\alpha =0.25,0.5,0.75,1\)

Fig. 24
figure 24

\(d=50, n=10000\) and \(\alpha =0.25,0.5,0.75,1\)

5 Intersection of one ball with the cube

As a result of (3.6), our main quantity of interest in this section will be the probability

$$\begin{aligned} P_{U,\delta ,r}:=\mathbb {P}_{_X} \left\{ \Vert U\!-\!X \Vert \! \le \! { r } \right\} \!= \! \mathbb {P}_{_X} \left\{ \Vert U\!-\!X \Vert ^2 \le { r^2 } \right\} \!= \! \mathbb {P} \left\{ \sum _{j=1}^d (u_j\!-\!x_j)^2 \le { r }^2 \right\} \;\; \end{aligned}$$
(5.1)

in the case when X has the uniform distribution on the \(\delta \)-cube \([1/2-\delta /2, 1/2+\delta /2]^d\) and \({U=(u_1, \ldots , u_d) \in \mathbb {R}^d}\) is fixed. The case of \(\delta =1\) will be directly applicable to Sect. 3.2. Because of the results of Sect. 3.2, we will bear in mind two typical choices of U will be \(U=\varvec{1/2}\) and \(U=\varvec{3/4}\) but will formulate results for general U.

For fixed \(u\in \mathbb {R}^d\), consider the r.v. \(\eta _{u,\delta } = (z-u)^2\), where z has density

$$\begin{aligned} p_{\delta }(t)= 1/\delta \,, \;\;(1-\delta )/2<t<(1+\delta )/2\,,\text { for some } 0\le \delta \le 1. \end{aligned}$$
(5.2)

The first three central moments of \(\eta _{u,\delta }\) are:

$$\begin{aligned} \mu _{u}^{(1)}= & {} \mathbb {E}\eta _{u,\delta } =\left( u-\frac{1}{2} \right) ^2+ \frac{{{\delta }}^{2}}{12} \,, \end{aligned}$$
(5.3)
$$\begin{aligned} \mu _{u}^{(2)}= & {} \textrm{var} (\eta _{u,\delta }) = {\frac{\delta ^{2}}{3}} \left[ \left( u-\frac{1}{2} \right) ^2+ {\frac{{{\delta }}^{2}}{60}} \right] \,, \end{aligned}$$
(5.4)
$$\begin{aligned} \mu _{u}^{(3)}= & {} \mathbb {E} \left[ \eta _{u,\delta } - \mu _{u}^{(1)}\right] ^3 = {\frac{4 {{\delta }}^{4}}{ 15 }} \left[ \left( u-\frac{1}{2} \right) ^2+ {\frac{{{\delta }}^{2} }{252}} \right] \,. \end{aligned}$$
(5.5)

Then for given \(U=(u_1, \ldots , u_d) \in \mathbb {R}^d\), consider the random variable

$$\begin{aligned} \Vert U-X \Vert ^2 =\sum _{i=1}^d \eta _{u_i,\delta ,\alpha } =\sum _{j=1}^d (u_j-x_j)^2\,, \end{aligned}$$

where we assume that \(X=(x_1, \ldots , x_d) \) is a random vector with i.i.d. components \(x_i\) with density (5.2). From (5.3), its mean is

$$\begin{aligned} \mu =\mu _{d,\delta ,U}:=\mathbb {E}\Vert U-X \Vert ^2 =\Vert U-\varvec{1/2}\Vert ^2 +\frac{{d{\delta }}^{2}}{12}\,. \end{aligned}$$

Using independence of \(x_1, \ldots , x_d\) and (5.4), we obtain

$$\begin{aligned} {\sigma }_{d,\delta ,U}^2:=\textrm{var}(\Vert U-X \Vert ^2 ) = {\frac{\delta ^{2}}{3}} \left[ \Vert U- \varvec{1/2}\Vert ^2+ {\frac{{d{\delta }}^{2}}{ 60}} \right] \,, \end{aligned}$$

and from independence of \(x_1, \ldots , x_d\) and (5.5) we get

$$\begin{aligned} {\mu }_{d,\delta ,U}^{(3)}:= \mathbb {E}\left[ \Vert U-X \Vert ^2- \mu \right] ^3 = \sum _{j=1}^d \mu _{u_j}^{(3)} = {\frac{\,{{\delta }}^{4}}{ 15 }} \left[ \Vert U-\varvec{1/2}\Vert ^2+ {\frac{{d{\delta }}^{2} }{ 252 }} \right] \,.\;\;\;\;\;\; \end{aligned}$$
(5.6)

If d is large enough then the conditions of the CLT for \(\Vert U-X \Vert ^2\) are approximately met and the distribution of \(\Vert U-X \Vert ^2 \) is approximately normal with mean \(\mu _{d,\delta ,U}\) and variance \({\sigma }_{d,\delta ,U}^2\). That is, we can approximate the probability \(P_{U,\delta ,r}= \mathbb {P}_{_X} \left\{ \Vert U\!-\!X \Vert \! \le \! { r } \right\} \) by

$$\begin{aligned} P_{U,\delta ,r}\!\cong \Phi \left( \frac{{ r }^2-\mu _{d,\delta ,U}}{{\sigma }_{d,\delta ,U}} \right) \,, \end{aligned}$$
(5.7)

where \(\Phi (\cdot )\) is the c.d.f. of the standard normal distribution:

$$\begin{aligned} \Phi (t) = \int _{-\infty }^t \varphi (v)dv\;\;\textrm{with}\;\; \varphi (v)=\frac{1}{\sqrt{2\pi }} e^{-v^2/2}\,. \end{aligned}$$

The approximation (5.7) has acceptable accuracy if the probability \(P_{U,\delta ,r}\) is not very small; for example, it falls inside a \(2\sigma \)-confidence interval generated by the standard normal distribution.

To improve on the usual CLT approximation, we use Edgeworth-type expansion in the CLT for sums of independent non-identically distributed r.v. by V.Petrov, see [13]:

$$\begin{aligned} P\left( \frac{\Vert U-X \Vert ^2-\mu _{d,\delta ,U}}{\sigma _{d,\delta ,U}} \le t \right) = \Phi (t) +\sum _{\nu =1}^{\infty }\frac{Q_{\nu ,d}(t)}{d^{\nu /2}} \,, \end{aligned}$$
(5.8)

where

$$\begin{aligned} Q_{\nu ,d}(t)= & {} -\varphi (t)\sum H_{\nu +2s-1}(t)\prod _{m=1}^{\nu }\frac{1}{k_m!}\left( \frac{\lambda _{m+2,d}}{(m+2)!} \right) ^{k_m} ,\\ \lambda {}_{\nu ,d}= & {} \frac{d^{(\nu -2)/2}}{\sigma _{d,\delta ,U}^\nu }\sum _{j=1}^{d}\gamma _{\nu ,j}, \end{aligned}$$

\(\gamma _{\nu ,j}\) is the cumulant of order \(\nu \) at \((u_j-x_j)^2-\mu _{u_j}^{(1)} \), \(H_m\) is the Chebyshev-Hermite polynomial of degree m and the summation is carried out over all non-negative integer solutions of the equation

$$\begin{aligned}{} & {} k_1+2k_2+\cdots +\nu k_\nu =\nu \\{} & {} s=k_1+k_2+\cdots +k_\nu . \end{aligned}$$

The partition function \(p(\nu )\) provides the number of possible partitions of a non-negative integer \(\nu \) and therefore at each value of \(\nu \) provides the number of terms in the summation. The sequence has the generating function

$$\begin{aligned} \sum _{\nu =0}^{\infty } p(\nu )x^{\nu } = \prod _{k=1}^{\infty } \left( \frac{1}{1-x^k} \right) ; \end{aligned}$$

the first few values are: 1, 1, 2, 3, 5, 7, 11, 15, 22, 30, 42, 56, 77, 101. The first few terms in the summation (including Hermite polynomials) are provided in [14, p. 139].

In the case of \(U=\varvec{1/2}\) or \(U=\varvec{3/4}\), the random variables \((u_j-x_j)^2\) will be i.i.d. For this case \(\lambda _{\nu ,d}\) does not depend on d and thus we have the slight simplification \(\lambda {}_{\nu ,d}=\lambda {}_{\nu } = \gamma _\nu /\sigma _{d,\delta ,U}^\nu \).

In Figs. 2530, we plot \(P_{U,1,r}\) for \(U=\varvec{1/2}\) and \(U=\varvec{3/4}\) as a function of r with a solid black line. In these figures, we demonstrate the accuracy of approximation (5.7) with a dashed blue line. With a dot-dashed red line, we plot the accuracy of an approximation obtained by taking one additional term in the expansion given in (5.8); this requires use of the third central moment given in (5.6). We can see that overall, for \(d=10\) and \(d=20\), the approximations are fairly accurate. However, when considering covering by n balls it is more important to focus on the lower tail. Figures 262829 and 30 demonstrate that taking one additional term in the Petrov’s expansion (5.8) produces a significant improvement in accuracy.

Fig. 25
figure 25

\(d=10, U=\varvec{1/2}\)

Fig. 26
figure 26

\(d=10, U=\varvec{1/2}\)

Fig. 27
figure 27

\(d=20, U=\varvec{1/2}\)

Fig. 28
figure 28

\(d=20, U=\varvec{1/2}\)

Fig. 29
figure 29

\(d=10, U=\varvec{3/4}\)

Fig. 30
figure 30

\(d=20, U=\varvec{3/4}\)

6 Conclusions

We have considered continuous global optimization problems, where the feasible region \({\mathcal {X}}\) is a compact subset of \(\mathbb {R}^d\). As a strategy for exploration, we have mostly considered sampling of i.i.d. random points either in \({\mathcal {X}}\) or a suitable subset of \({\mathcal {X}}\).

We have distinguished between between ‘small’, ‘medium’ and ‘high’ dimensional problems depending on the following relations between d and \(n_{\max }\) (which is the maximum possible number of points available for space exploration):

  1. (S)

    small dimensions: \(n_{\max } \gg 2^d\) (roughly, \(d < 10\));

  2. (M)

    medium dimensions: \(n_{\max } \) is comparable to \( 2^d\) (roughly, \(10 \le d \le 20\));

  3. (H)

    high dimensions: \(n_{\max } \ll 2^d\) (roughly, \(d > 20\)).

We only considered the situations (M) and (H), where we have demonstrated the following effects: (i) the actual convergence of randomized exploration schemes is much slower than that given by the classical estimates, which are based on the asymptotic properties of random points; (ii) the usually recommended space exploration schemes are practically inefficient as the asymptotic regime is unreachable. In particular, we have shown: (ii-a) uniform sampling on entire \({\mathcal {X}}\) is much less efficient than uniform sampling on a suitable subset of \({\mathcal {X}}\), and (ii-b) the effect of replacement of random points by a low-discrepancy sequence is very small so that using low-discrepancy sequences and other deterministic constructions does not lead to significant improvements (unless the number of evaluation points \(n=n_{\max }\) is fixed to some particular value like \(2^{d} \) or \(2^{d-1}\), see [9]). We believe that the effects (i) and (ii) have not been stated in literature, at least in this generality. The effect (ii-a) has been numerically demonstrated in our previous papers [7, 8]. The effect (ii-b) enhances one of the main messages of the paper [12].

It was not the purpose of the paper to give the most effective exploration schemes. However, the results of this paper, along with studies reported in [7,8,9, 26], allow us to give several general recommendations on efficient organization of exploration strategies in the situations (M) and (H), at least when \({\mathcal {X}}\) is a cube.

In a high-dimensional cube \({\mathcal {X}}=[0,1]^d\) with \(d>20\) and \(1<n_{\max }<2^{d}\), we propose the following strategy of construction of nested exploration designs \(X_n\): \(x_1=\varvec{1/2}\) (the centre of \({\mathcal {X}}\)) and the other points \(x_j\) are taken randomly among the vertices of a cube \([1/4,3/4]^d\). Sampling from vertices can be done without replacement (see [8]) and, moreover, we can keep the points \(x_j\) so that the Hamming distance between them is at least \(\lfloor d-\log _2 (n_{\max }-1) \rfloor +1\).

It is more difficult to be so specific in the situation (M) as there may be different relations between \(2^d\), \(n_{\min }\) and \(n_{\max }\). A good strategy would be using the product of arcsine distributions on a suitable \(\delta \)-cube (see [8]); this distribution is slightly superior to the uniform on \(\delta \)-cube of Sect. 4 (with different values of \(\delta \) optimized for the respective distribution). An even more natural strategy would be sampling in \(2^d\) small cubes (of side-length \(\varepsilon \)) surrounding the vertices of a \(\delta \)-cube \(C_\delta =[{ 1/2-{\delta }/2, 1/2+{\delta }/2}]^d \) (after placing \(x_1=\varvec{1/2}\)). The choice of \(\delta \) and \(\varepsilon \) depends on \(2^d\), \(n_{\min }\) and \(n_{\max }\) and requires a separate study. As usual, reduction of randomness in sampling makes any of these schemes marginally more efficient.