Global optimization using random embeddings

We propose a random-subspace algorithmic framework for global optimization of Lipschitz-continuous objectives, and analyse its convergence using novel tools from conic integral geometry. X-REGO randomly projects, in a sequential or simultaneous manner, the high-dimensional original problem into low-dimensional subproblems that can then be solved with any global, or even local, optimization solver. We estimate the probability that the randomly-embedded subproblem shares (approximately) the same global optimum as the original problem. This success probability is then used to show convergence of X-REGO to an approximate global solution of the original problem, under weak assumptions on the problem (having a strictly feasible global solution) and on the solver (guaranteed to find an approximate global solution of the reduced problem with sufficiently high probability). In the particular case of unconstrained objectives with low effective dimension, that only vary over a low-dimensional subspace, we propose an X-REGO variant that explores random subspaces of increasing dimension until finding the effective dimension of the problem, leading to X-REGO globally converging after a finite number of embeddings, proportional to the effective dimension. We show numerically that this variant efficiently finds both the effective dimension and an approximate global minimizer of the original problem.


Introduction
We address the global optimization problem where f : X → R is Lipschitz continuous and possibly non-convex, and where X is a set with nonempty interior, and possibly unbounded, which thus includes the unconstrained case X = R D . We propose a generic algorithmic framework, named X-REGO (X -Random Embeddings for Global Optimization) that (approximately) solves a sequence of realizations of the following randomized reduced problem, min y f (Ay + p) subject to Ay + p ∈ X , We use crucial tools from conic integral geometry to estimate the probability above. Applications of these bounds to functions with low effective dimensionality are also provided.

Related work.
Dimensionality reduction is essential to the efficient solution of high-dimensional optimization problems. Sketching techniques reduce the ambient dimension of a given subspace by projecting it randomly into a lower dimensional one while preserving lengths [67]; such techniques have been used successfully for improving the efficiency of linear and nonlinear least squares (local) solvers and of those for more general sums of functions; see for example, [53,56,8,19] and the references therein. Here, we sketch the problem variables/search space in order to reduce its dimension for the specific aim of global optimization; furthermore, our results are not derived using sketching techniques but conic integral geometry ones. In a huge-scale setting, where full-dimensional vector operations are computationally expensive, Nesterov [49] advocates the use of coordinate descent, a local optimization method that updates successively one of the coordinates of a candidate solution using a coordinate-wise variant of a first-order method, while keeping other coordinates fixed. Coordinate descent methods and their block counterparts have become a method of choice for many large-scale applications, see, e.g., [4,55,68] and have been extended to random subspace descent [46,44] that operates over a succession of random low-dimensional subspaces, not necessarily aligned with coordinate axes. See also [38] for a random proximal subspace descent algorithm, and [35,40] for higher-order random subspace methods for local nonlinear optimization.
In local derivative-free optimization, several algorithms explore successively one-dimensional [59,50,9] and low-dimensional [16] random subspaces. Gratton et al. [36,37] propose and explore a randomized version of direct search where at each iteration the function is explored along a collection of directions, i.e., one-dimensional half-spaces. Golovin et al. [34] develop convergence rates to a ball of -minimizers for a variant of randomized direct search for a special class of quasi-convex objectives. Their convergence analysis heavily relies on high-dimensional geometric arguments: they show that sublevel sets contain a sufficiently large ball tangent to the level set, so that at each iteration, with a given probability, sampling the next iterate from a suitable distribution centred at the current iterate decreases the cost.
Unlike the above-mentioned works, our focus here is on the global optimization of generic Lipschitz-continuous objectives. Stochastic global optimization methods abound, such as simulated annealing [32], random search [58], multistart methods [32], and genetic algorithms [41]. Our proposal here is connected to random search methods, namely, it can be viewed as a multi-dimensional random search, where a deterministic or stochastic method is applied to the subspace minimization. Recently, random subspace methods have been developed/applied for the global optimization of objectives with special structure, assuming typically, low-effective dimensionality of the objective [66,10,11,43,15,18,54]. These functions only vary over a lowdimensional subspace, and are also called multi-ridge functions [29,62], functions with active subspaces [21], or functions with functional sparsity when the subspace of variation is aligned with coordinate axes [65]. Assuming the random subspace dimension d (in (RPX )) to be an overestimate of the objective's effective dimension d e (the dimension of the subspace of variation), these works have proven that one random embedding is sufficient with probability one to solve the original problem (P) in the unconstrained case (X = R d ) [66,15] while several random embeddings are required in the constrained case [18]. In particular, in [18], we propose an X-REGO variant that is designed specifically for the bound-constrained optimization of functions with low effective dimensionality. As such it keeps the random subspace dimension d in (RPX ) fixed and greater than the effective dimension which is assumed to be known. Here, X-REGO is designed and analysed for a generic objective and a possibly unbounded/unconstrained and nonconvex domain X , and the random subspace dimension d is arbitrary and allowed to vary during the optimization.
Recently, random projections have been successfully applied to highly overparametrized settings, such as in deep neural network training [47,42] and adversarial attacks in deep learning [14,63]. Though there is no theoretical guarantee at present that a precise low-dimension subspace exists in these problems, it is a reasonable assumption to make given the high dimensionality of the search space and the supporting numerical evidence. Our approach here investigates the validity of random subspace methods when low effective dimensionality is absent or unknown to the user; we find -both theoretically and numerically -that for large scale problems, such techniques are still beneficial, and furthermore, at least in the unconstrained case, they can naturally adapt and capture such special structures efficiently. We hope that this provides a general theoretical justification to a broader application of such techniques.
The second part of the paper applies the generic X-REGO convergence results and the (RPX ) related probabilistic bounds to the case when the objective is unconstrained and has low effective dimension, but the effective dimension d e is unknown. Related results have been proposed that aim to learn the effective subspace before [29,24,62,27] or during the optimization process [30,69,20,22]; additional costs/evaluations are needed in these approaches. Some apply a principal component analysis (PCA) to the gradient evaluated at a collection of random points [21,27,22]. Alternatively, [29,24,62] recast the problem into a low-rank matrix recovery problem, and [30] proposes a Bayesian optimization algorithm that sequentially updates a posterior distribution over effective subspaces, and over the objective, using new functions evaluations. Still in the context of Bayesian optimization, Zhang et al. [69] estimate the effective subspace using Sliced Inverse Regression, a supervised dimensionality reduction technique in contrast with the above-mentioned PCA, while Chen et al. [20] extend Sliced Inverse Regression to learn the effective subspace in a semi-supervised way. Instead, our proposed algorithm explores a sequence of random subspaces of increasing dimension until it discovers the effective dimension of the problem. Independently, a similar idea has been recently used in sketching methods for regularized least-squares optimization [45].
Our contributions. We explore the use of random embeddings for the generic global optimization problem (P). Our proposed algorithmic framework, X-REGO, replaces (P) by a sequence of reduced random subproblems (RPX ), that are solved (possibly approximately and probabilistically) using any global optimization solver. As such, X-REGO extends block coordinate descent and local random subspace methods to the global setting.
Our convergence analysis for X-REGO crucially relies on a lower bound on the probability of -success of (RPX ), whose computation, exploiting connections between (RPX ) and the field of conic integral geometry, is a key contribution of this paper 1 . Using asymptotic expansions of integrals, we derive interpretable lower bounds in the setting where the random subspace dimension d is fixed and the original dimension D grows to infinity. In the box-constrained case X = [−1, 1] D , we also compare these bounds with the probability of success of the simplest random search strategy, where a point is sampled in the domain uniformly at random at each iteration. We show that when the point p at which the random subspace is drawn is close enough to a global solution x * of (P), the random subspace is more likely to intersect a ball of -minimizer than finding an -minimizer using random search. Provided that the reduced problem can be solved at a reasonable cost, random subspace methods are thus provably better than random search in some cases; and even more so, numerically.
In the second part of the paper, we address global optimization of functions with low effective dimension, and propose an X-REGO variant that progressively increases the random subspace dimension. Instead of requiring a priori knowledge of the effective dimension of the objective, we show numerically that this variant is able to learn the effective dimension of the problem. We also provide convergence results for this variant after a finite number of embeddings, using again our conic integral geometry bounds. Noticeably, these convergence results have no dependency on D. We compare numerically several instances of X-REGO when the reduced problem is solved using the (global and local) KNITRO solver [13]. We also discuss several strategies to choose the parameter p in (RPX ).
Paper outline. Section 2 presents the geometry of the problem, and motivates the use of conic integral geometry to estimate the probability of (RPX ) being -successful. Section 3 summarizes key results from conic integral geometry that are used later in the paper. In Section 4, we derive lower bounds on the probability of (RPX ) to be -successful, obtain asymptotic expansions of this probability, and compare the search within random embeddings with random search. Section 5 presents the X-REGO algorithmic framework, and Section 6 the corresponding convergence analysis. Finally, Section 7 proposes a specific instance of X-REGO for global optimization of functions with low effective dimension, with associate convergence results, and Section 8 contains numerical illustrations.
Notation. We use bold capital letters for matrices (A) and bold lowercase letters (a) for vectors. In particular, I D is the D × D identity matrix and 0 D , 1 D (or simply 0, 1) are the D-dimensional vectors of zeros and ones, respectively. We write a i to denote the ith entry of a and write a i:j , i < j, for the vector (a i a i+1 · · · a j ) T . We let range(A) denote the linear subspace spanned in R D by the columns of A ∈ R D×d . We write ·, · , · (or equivalently · 2 ) for the usual Euclidean inner product and Euclidean norm, respectively.
Given two random variables (vectors) x and y (x and y), the expression x law = y (x law = y) means that x and y (x and y) have the same distribution. We reserve the letter A for a D × d Gaussian random matrix (see Definition A.1).
Given a point a ∈ R D and a set S of points in R D , we write a + S to denote the set {a + s : s ∈ S}. Given functions f (x) : R → R and g(x) : R → R + , we write f (x) = Θ(g(x)) as x → ∞ to denote the fact that there exist positive reals M 1 , M 2 and a real number

Geometric description of the problem
Let > 0 denote the accuracy to which problem (P) is to be solved, and so let G be the set of -minimizers of (P), Note that, by Definition 1.1, the reduced problem (RPX ) is -successful if and only if the intersection of the (affine) subspace p + range(A) and G is non-empty: To further characterize this probability, let us now introduce the following assumptions.
Assumption LipC (Lipschitz continuity of f ). The objective function f : Assumption FeasBall (Existence of a ball of -minimizers). There exists a global minimizer x * of (P) that satisfies B /L (x * ) ⊂ X , where B /L (x * ) is the D-dimensional Euclidean ball of radius /L and centered at x * , where L is the Lipschitz constant of f and > 0 is the desired accuracy tolerance.
We then have the following result.
Proposition 2.1. Let Assumption LipC hold. Let A be a D × d Gaussian matrix, a positive accuracy tolerance and x * any global minimizer of (P) satisfying Assumption FeasBall. Let p ∈ X be a given vector. Then, Proof. Let x * be a global minimizer of f in X satisfying Assumption FeasBall, and let x ∈ B /L (x * ). Then, x ∈ G due to the Lipschitz continuity property of f , namely The result follows then simply from (2.2).
In the case of non-unique solutions, each global minimizer x * of (P) satisfying Assumption FeasBall provides a different lower bound in Proposition 2.1. If all the balls B /L (x * ) associated with different global minimizers are disjoint, the probability of -success of (RPX ) is lower bounded by the sum, over each x * satisfying Assumption FeasBall, of the probability In this paper, we estimate the latter probability for an arbitrary x * ; this is a worst-case bound in the sense that it clearly underestimates the chance of subproblem success (for a(ny) x * ) in the presence of multiple global minimizers of (P). Given x * satisfying Assumption FeasBall, let us assume that p / ∈ B /L (x * ) (otherwise, the reduced problem (RPX ) is always -successful, which can be seen by simply taking y = 0). To estimate the right-hand side of (2.3), we first construct a set C p (x * ) containing the rays connecting p with points in B /L (x * ), Note that C p (x * ) is a convex cone that has been translated by p (see Figure 1). We can easily verify this fact by recalling the definition of a convex cone.

Definition 2.2.
A convex set C is called a convex cone if for every c ∈ C and any non-negative scalar ρ, ρc ∈ C. that are contained in X and the blue dot represents a global minimizer x * of (P). (RPX ) is -successful when the red line intersects B /L (x * ). We construct a cone C p (x * ) in such a way that the following condition holds: p + range(A) intersects B /L (x * ) if and only if p + range(A) and C p (x * ) share a ray.
The next result indicates that, based on (2.3) and the definition of C p (x * ), we can rewrite the right-hand side of (2.3) as -the probability of the event that translated cones p + range(A) and C p (x * ) share a ray. It turns out that this probability has a quantifiable expression based on conic integral geometry, where a broad concern is the quantification/estimation of probabilities of a random cone (e.g., p + range(A)) and a fixed cone (e.g., C p (x * )) sharing a ray. We then present in Section 3 key tools from conic integral geometry to help us estimate the probability of -success of (RPX ).
Theorem 2.4. Let Assumption LipC hold. Let A be a D × d Gaussian matrix, a positive accuracy tolerance and x * any global minimizer of (P) satisfying Assumption FeasBall. Let p ∈ X \ G be a given vector and let C p (x * ) be defined in (2.5). Then, Proof. From Proposition 2.1, we have The result follows from the fact that the event {p + range(A) ∩ C p (x * ) = {p}} is a subset of the event {p + range(A) ∩ B /L (x * ) = ∅}. We prove this fact below. Suppose that the event {p + range(A) ∩ C p (x * ) = {p}} occurs. Then, there exists a point : θ ≥ 0} and note that R ⊂ p + range(A). Now, since x ∈ C p (x * ), by definition of C p (x * ) there existsx ∈ B /L (x * ) and θ > 0 such that x = p +θ(x − p). We expressx in terms of x :x = p + θ (x − p), where θ = 1/θ > 0. By definition of R,x ∈ R and, thus,x also lies in p + range(A). This proves that the set {p + range(A) ∩ B /L (x * )} is non-empty.

A snapshot of conic integral geometry
A central question posed in conic integral geometry is the following: What is the probability that a randomly rotated convex cone shares a ray with a fixed convex cone?
The answer to this question is given by the conic kinematic formula [57].
Theorem 3.1 (Conic kinematic formula). Let C and F be closed convex cones in R D such that at most one of them is a linear subspace. Let Q be a D × D random orthogonal matrix drawn uniformly from the set of all D × D real orthogonal matrices. Then, where v k (C) denotes the kth intrinsic volume of cone C.

Proof.
A proof can be found in [57, p. 261].
We plan to use the conic kinematic formula to estimate (2.6). This formula expresses the probability of the intersection of the two cones in terms of quantities known as conic intrinsic volumes. It is thus important to understand the conic intrinsic volumes and ways to compute them.

Conic intrinsic volumes
Conic intrinsic volumes are commonly defined through the spherical Steiner formula (see [2]), which we do not define here as it is beyond the scope of this work/not needed here. Instead, we will familiarise ourselves with the conic intrinsic volumes through their properties and specific examples. This is a short introductory review of conic intrinsic volumes; for more details, an interested reader is directed to [2,3,1,48,57] and the references therein.
For a closed convex cone C in R D , there are exactly D + 1 conic intrinsic volumes: v 0 (C), v 1 (C),. . . , v D (C). Conic intrinsic volumes have useful properties, some of which are summarized below. Given a closed convex cone C ⊆ R D , we have (see [3,Fact 5. In other words, they form a discrete probability distribution on {0, 1, . . . , D}. (2) Invariance under rotations. Given any orthogonal matrix Q ∈ R D×D , the intrinsic volumes of the rotated cone QC and the original cone C are equal: The Gauss-Bonnet formula implies that v k (C) ≤ 1/2 for any k. Figure 2: A depiction of the two-dimensional polyhedral cone C π/3 in Example 3.4. The projection Π C π/3 (a) of a onto C π/3 falls onto the one-dimensional face of the cone.  Here, a denotes the standard Gaussian vector 3 in R D and Π Y (x) := arg min y { x − y : y ∈ Y } denotes the Euclidean/orthogonal projection of x onto the set Y , namely the vector in Y that is the closest to x.
Example 3.4. Let us consider a simple a two-dimensional polyhedral cone C π/3 illustrated in Figure 2 and let us calculate v 0 (C π/3 ), v 1 (C π/3 ) and v 2 (C π/3 ) using (3.5). The cone C π/3 has a single two-dimensional face (filled with blue), which is the interior of C π/3 . If a random vector a falls inside this face then Π C π/3 (a) = a and, therefore, Let us now calculate v 0 (C π/3 ). Note that C π/3 has only one zero-dimensional face, which is the origin. Note also that Π C π/3 (a) = 0 if and only if a ∈ C • π/3 . Hence, To calculate v 1 (C π/3 ), we simply use (3.2) to obtain 2 The formal definition of relative interior of a set S is as follows: relint(S) := {x ∈ S : ∃δ > 0, B δ (x) ∩ aff(S) ⊆ S}, where the affine hull aff(S) is the smallest affine set containing S. For example, the relative interior of a line segment [A, B] living in R 2 is (A, B); the relative interior of a two-dimensional square living in R 3 is the square minus its boundary. 3 A random vector for which each entry is an independent standard normal variable.
We already mentioned in Remark 2.3 that a d-dimensional linear subspace L d is a cone. In fact, L d is a polyhedral cone which has only one (d-dimensional) face. Therefore, the projection of any vector in R D onto L d will always lie on its (only) d-dimensional face. Hence, (3.6) follows from (3.5).
Example 3.6 (Circular cone). A circular cone is another important example; they have a number of applications in convex optimization (see, e.g., [7,Section 3] and [12,Section 4]). The circular cone of angle α in R D is denoted by Circ D (α) and is defined as The circular cone can be viewed as a collection of rays connecting the origin and some Ddimensional ball which does not contain the origin in its interior. The intrinsic volumes of Circ D (α) are given by the formulae (see [ for k = 1, 2, . . . , D − 1, where i j is the extension of the binomial coefficient to noninteger i and j through the gamma function, . (3.9) The 0th and Dth intrinsic volumes of the circular cone are given by (see [1,Ex. 4.4.8]) : The following property of circular cones will be needed later.

The Crofton formula
We now present a useful corollary of the conic kinematic formula. If one of the cones in Theorem 3.1 is given by a linear subspace then the conic kinematic formula reduces to the Crofton formula.
Corollary 3.8 (Crofton formula). Let C be a closed convex cone in R D and L d be a ddimensional linear subspace. Let Q be a D × D random orthogonal matrix drawn uniformly from the set of all D × D real orthogonal matrices. We have The Crofton formula is easily derived from (3.1) using the fact that the kth intrinsic volume of a linear subspace L d is 1 if d = k and 0 otherwise. The Crofton formula will be essential in estimating the probability of -success of (RPX ).
4 Bounding the probability of -success of the reduced problem (RPX ) Building on the tools developed in the last section, we can estimate the right-hand side of (2.7) in Theorem 2.4, and thereby obtain bounds on the probability of -success of (RPX ).
Note that if p / ∈ B /L (x * ), then C p (x * ) defined in (2.5) is a circular cone Circ D (α * p ) with α * p = arcsin( /(L x * −p )) that has been rotated and then translated by p, see (3.7). Therefore, the intersection p+range(A)∩C p (x * ) in (2.7) is that of a random d-dimensional linear subspace and the rotated circular cone both translated by p. We can translate these 'cones' back to the origin and then, using the Crofton formula, evaluate the right-hand side of (2.7) exactly since the expressions for the conic intrinsic volumes of the circular cone C p (x * ) are known (see (3.8), (3.10) and (3.11)). The Crofton formula and the right-hand side of (2.7) only differ in the formulation of a random linear subspace: in the former, a random linear subspace is given as QL d , whereas in (2.7) it is represented by range(A). The following theorem states that these two representations are equivalent. The transformation of (2.7) into a form suitable for the application of Crofton formula is given in the following corollary.
Corollary 4.2. Let Assumption LipC hold. Let A be a D × d Gaussian matrix, Q be a D × D random orthogonal matrix drawn uniformly from the set of all D×D real orthogonal matrices and L d be a d-dimensional linear subspace in R D . Let > 0 an accuracy tolerance and let p ∈ X \ G be a given vector. Let Circ D (α * p ) be the circular cone with α * p = arcsin( /(L x * − p )), where x * is any global minimizer of (P) satisfying Assumption FeasBall. Then, Proof. As mentioned earlier, by definition, C p (x * ) is the rotated and translated (by where the penultimate equality follows from the orthogonal invariance of Gaussian matrices and where the last equality follows from Theorem 4.1. Corollary 4.2 now allows us to use the Crofton formula to quantify the lower bound in (4.2). In the next theorem, we derive our first lower bound, that is dependent on the location of p in X . In particular, note that p is assumed to be at a distance at least /L from x * .

Theorem 4.3 (A lower bound on the success probability). Let Assumption LipC hold, let
A be a D × d Gaussian matrix and > 0, an accuracy tolerance. Let p ∈ X \ G be a given vector and let r p := /(L x * − p ), where x * is any global minimizer of (P) that satisfies Assumption FeasBall. Then, where the function τ (r, d, D) for 0 < r < 1 and 1 ≤ d < D is defined as (4.5) Here, i j denotes the general binomial coefficient defined in (3.9). Proof. Let α * p = arcsin(r p ) and let C denote Circ D (α * p ) for notational convenience. First, note that by (3.8) and (3.11) By (4.2) and the Crofton formula (3.12), we have where the inequality follows from the fact that v k (C)'s are all nonnegative (see (3.2)).
Let us explain why we choose to bound the -success of (RPX ) in (4.6) by a multiple of v D−d+1 (C) in particular, whereas we could have chosen any other intrinsic volume or the entire sum of these volumes. Our reason for such a choice for the lower bound is underpinned by the following observation: using the formulae (3.8) and (3.11) for the intrinsic volumes, one Therefore, approximating the sum by its leading term v D−d+1 (C) is reasonable for large values of D.
Given a global minimizer x * of (P) that satisfies Assumption FeasBall and a positive constant R max , the following result provides a lower bound on the probability of -success of (RPX ) that holds for all p ∈ X satisfying x * −p ≤ R max < ∞. Note that, in contrast with the last theorem, this result holds for p arbitrarily close to x * ; as such, it will be crucial to the convergence of our algorithmic proposals in Section 6. Note that there are natural ways to choose R max in some cases: • If a sequence of reduced problems (RPX ) is being considered such that the random subspaces are drawn at the same p ∈ X , on can simply take R max = x * − p .
• If the sequence of reduced problems (RPX ) corresponds to a bounded parameter sequence • If X is bounded, since p ∈ X and x * ∈ X , one can simply choose R max to be the diameter of X .
Note that when X is not bounded, it is in general difficult to derive a uniform lower bound on the probability of -success of (RPX ) that is valid for all p ∈ X (taking p → ∞ will make the lower bound go to zero). The above list provides two examples of rules for selecting p that guarantee that the result below holds even in the case X bounded. Other examples are given in Section 5.
Theorem 4.4 (A uniform lower bound on the success probability). Suppose that Assumption LipC holds. Let A be a D × d Gaussian matrix, a positive accuracy tolerance, x * a global minimizer of (P) that satisfies Assumption FeasBall. For all p ∈ X satisfying p − x * < R max for some suitably chosen constant R max , we have where τ (·, ·, ·) is defined in (4.5) and r min := /(LR max ).
Proof. Let x * be a global minimizer that satisfies Assumption FeasBall, let r p be defined in Theorem 4.3 and let α * p = arcsin(r p ). We consider the two cases p ∈ X \ G and p ∈ G separately.
First, let p be any point in X \ G . Then, (4.8) Now, define C min := Circ D (α * min ). By (4.8) and Lemma 3.7, it follows that C min ⊆ Circ D (α * p ). Using Corollary 4.2, we then obtain where the last inequality follows from the same line of argument as in (4.6). Using (3.8) and (3.11), it is easy to verify that 2v D−d+1 (C min ) = τ (r min , d, D). We have shown (4.7) for p ∈ X \ G .
Unfortunately, the formula defining τ (r, d, D) is not easy to interpret. To better understand the dependence of the lower bounds (4.4) and (4.7) on the parameters of the problem, we now analyse the behaviour of τ (r, d, D) in the asymptotic regime.

Asymptotic expansions
We establish the asymptotic behaviour of τ (r, d, D) for large D. The other parameters are kept fixed except for r which we allow to decrease with D. Note indeed that r p in Theorem 4.3 is inversely proportional to x * − p , which typically increases with D. Before we begin, we first need to establish the following lemma. Lemma 4.5. Let 0 < α < π/2 be either a fixed angle or a function of D that tends to 0 as D → ∞. Then, as D → ∞, Integration by parts with u = sin(x)/(D cos(x)) and We integrate I by parts with u = sin(x)/(D cos 3 (x)) and dv = D cos(x) sin D−1 (x)dx to obtain Since the latter integral is positive, we have Since I is positive for any 0 < α < π/2, (4.15) implies that I = O(sin D+1 (α)/D).
We establish the asymptotic behaviours of τ (r p , d, D) and τ (r min , d, D) by analysing the asymptotics of τ (r, d, D) defined in (4.5) and later substituting r p and r min for r in τ (r, d, D). Theorem 4.6. Let τ (r, d, D) be defined in (4.5). Let d be fixed and let r be either fixed or tend to zero as D → ∞. Then, 16) and the constants in Θ(·) are independent of D.
is bounded above and bounded away from zero by constants independent of D; thus, it can be absorbed into the constants of Θ.
Let us now prove (4.16) for d = 1. We have where, by (4.18), and, by Lemma 4.5, By substituting (4.20) and (4.21)  Now, to obtain the asymptotics for τ (r p , d, D) and τ (r min , d, D), we simply apply Theorem 4.6 for r = r p = /(L x * − p ) and r = r min = /(LR max ), respectively. Corollary 4.7. Asymptotically for D → ∞, keeping d, and L fixed and letting x * − p be either fixed or tend to infinity as D → ∞, the lower bounds (4.4) and (4.7) satisfy with r min = /(LR max ).
Proof. Note that r p = /(L x * − p ) is either fixed or tends to zero as D → ∞. Then, the result follows from Theorem 4.6.
Corollary 4.7 shows that for any p not in G , the lower bounds in Theorem 4.3 and Theorem 4.4 decrease exponentially with D, which is as expected since problem (P) is generally NP-hard. Note that this decrease is slower for larger values of d or p closer to x * , which is reassuring.

Comparing (RPX ) to simple random search
Using the above lower bounds on the probability of -success of the reduced problem (RPX ), we now compare (RPX ) to a simple random search method to understand the relative performance of (RPX ) and when it is beneficial to use it for general functions. As a baseline for comparison, we use Uniform Sampling (US) and we restrict ourselves, in this section, to the specific case X = [−1, 1] D (as this will allow us to estimate the probability of success of US). We start off with the derivation of a lower bound for the probability of -success of US and the computation of its asymptotics.
Note that if a uniformly sampled point falls inside B /L (x * ) then US is -successful. This implies that where we have used the fact that Vol( that Vol(X ) = 2 D . Using Stirling's approximation, it is straightforward to establish the asymptotic behaviour of the lower bound τ us .
Let us now compare the lower bound τ us of US to the lower bound τ (r p , d, D) for (RPX ). It is clear from the analysis of τ (r p , d, D) in Section 4.1 that the probability of -success of (RPX ) is higher if p is closer to the set of global minimizers. In the next theorem, we determine a threshold distance ∆ 0 between p and a global minimizer x * such that τ (r p , d, D) and τ us are approximately equal to each other. This would tell us how close p should be to x * for (RPX ) to have a larger lower bound for the probability of success than that of US. The analysis is done in the asymptotic regime.
Proof. From (4.23) and (4.25), we have then both lower and upper bounds in (4.28) tend to infinity implying that τ (r p , d, D)/τ us → ∞. On the other hand, if ∆ 0 / x * − p → ψ < 1 then both lower and upper bounds in (4.28) tend to zero implying that τ (r p , d, D)/τ us → 0. Theorem 4.9 tells us that the distance between p and x * (in the asymptotic setting) must be no greater than ∆ 0 ≈ 0.48 √ D for τ (r p , d, D) to be larger than τ us in the case X = [−1, 1] D . Note that, since the distance between the origin and a corner of X is equal to , there is no point p such that the ball of radius ∆ 0 centred at p covers all points in X . In other words, in the specific case X = [−1, 1] D , for any p in X , there always exists x * for which τ (r p , d, D) is smaller than τ us ; on the other hand, if p = 0 and x * is close to the origin then τ (r p , d, D) > τ us . Note also that ∆ 0 has no dependence on the embedding subspace dimension d. This is due to the asymptotic nature of the analysis: in (4.28), we see that both inequalities depend on d, but the dependence diminishes as D → ∞ since d is kept fixed. Although the asymptotic analysis shows no significant dependence on the subspace dimension, numerical experiments show that the value of d has a notable effect on success of (RPX ). In Figure 3, we plot τ (r p , d, D) as a function of x * − p for different values of d with D fixed at 200. The lower bound τ us of US is represented by a black horizontal line. We see that, for larger d, τ (r p , d, D) decreases at a slower rate and has greater threshold distance before becoming smaller than τ us . Remark 4.10. An important distinction must be made between the implications of the -success of (RPX ) and the -success of US in solving the original problem (P). Note that the -success of US means that US has sampled a point that lies in G , which in turn implies that US has successfully (approximately) solved (P). This is not the case for (RPX ). Recall that -success of (RPX ) by definition means that there is an approximate solution x * to (P) that lies in the embedded d-dimensional subspace. One needs to perform an additional global search over the subspace to locate x * . Therefore, for an entirely fair comparison between the two approaches, this additional computational complexity should be taken into account.

X-REGO: an algorithmic framework for global optimization using random embeddings
This section presents the proposed algorithmic framework for global optimization using random embeddings, named X-REGO by analogy with [18] (see the Introduction for distinctions between these variants). X-REGO is a generic algorithmic framework that replaces the high-dimensional original problem (P) by a sequence of low-dimensional random problems of the form (RPX ); these reduced random problems can then be solved using any global -and in practice, even a local -optimization solver.
Note that the kth embedding in X-REGO is determined by a realizationÃ For generality of our analysis, we also assume that the parameter p in (RPX ) is a random variable. The kth embedding is drawn at the pointp k−1 = p k−1 (ω k−1 ), a realization of the random variable p k−1 , assumed to have support included in X . Note that this definition includes deterministic choices for p k−1 , by writing it as a random variable with support equal to a singleton (deterministic and stochastic selection rules for the p are given below).
Algorithm 1 X -Random Embeddings for Global Optimization (X-REGO) applied to (P)  Calculateỹ k by solving approximately and possibly, probabilistically, subject toÃ k y +p k−1 ∈ X .

6:
Choose (deterministically or randomly)p k ∈ X . subject to A k y + p k−1 ∈ X .
(RPX k ) As such, X-REGO can be seen as a stochastic process: additionally top k , andÃ k , each algorithm realization provides sequencesx k = x k (ω k ),ỹ k = y k (ω k ) andf k min = f k min (ω k ), for k ≥ 1, that are realizations of the random variables x k , y k and f k min , respectively. To calculate ỹ k , ( RPX k ) may be solved to some required accuracy using a deterministic global optimization algorithm that is allowed to fail with a certain probability; or employing a stochastic algorithm, so thatỹ k is only guaranteed to be an approximate global minimizer of ( RPX k ) (at least) with a certain probability. This allows us to account for solvers having some stochastic component (multistart methods, genetic algorithms, ...), or deterministic solvers that may fail in some cases due, e.g., to a computational budget shortage. Note also that the choice of the random variable p k and of the subspace dimension d k provide some flexibility in the algorithm. For p k , possibilities include: • p k = p: all the random embeddings explored are drawn at the same point (in case p is a fixed vector in X ), or according to the same distribution (if p is a random variable), • The sequence p 0 , p 1 , . . . can be constructed dynamically during the optimization, e.g., based on the information gathered so far on the objective. For example, one may choose p k = x k opt , where x k opt is the best point found up to the kth embedding: Note that ( RPX k ) is always feasible for all choices of p k (y = 0 is feasible sincep k ∈ X ). However, it may happen that this is the only feasible point of ( RPX k ); to avoid this situation we may assume that p k is in the interior of X . This latter assumption is not needed for our convergence results to hold, but it is a desirable assumption from a numerical point of view.
Regarding the subspace dimension d k , one can be for example choose a fixed value based on the computational budget available for the reduced problem, or d k can be progressively increased, using a warm start in each embedding. We refer the reader to Section 8 for a numerical comparison of some of those strategies.
The termination in Line 2 could be set to a given maximum number of embeddings, or could check that no significant progress in decreasing the objective function has been achieved over the last few embeddings, compared to the value f (x k opt ). For generality, we leave it unspecified here.

Global convergence of X-REGO to a set of global -minimizers
The convergence results presented in this paper extend the ones given in [18], in which X-REGO (with fixed subspace dimension d k = d ≥ d e for all k) is proven to converge for functions with low-effective dimension d e . Section 6.1 is devoted to a generic convergence analysis of X-REGO, under generic assumptions on the probability of -success of (RPX k ) and on the probability of success of the solver to find an approximate minimizer of its realisation ( RPX k ), while Section 6.2 presents the application of these results to arbitrary Lipschitz-continuous objectives, building on the results presented in the previous sections to show the validity of the -success assumption.

A general convergence framework for X-REGO
This section recalls results in [18] that are needed for our main convergence results in the next section. We show that x k opt defined in (5.2) converges to the set of -minimizers G almost surely as k → ∞ (see Theorem 6.3). Intuitively, our proof relies on the fact that any vectorx k defined in (5.1) belongs to G if the following two conditions hold simultaneously: (a) the reduced problem (RPX k ) is ( − λ)-successful in the sense of Definition 1.1, namely, f k min ≤ f * + − λ; (6.1) (b) the reduced problem ( RPX k ) is solved (by a deterministic/stochastic algorithm) to an accuracy λ ∈ (0, ) in the objective function value, namely, holds (at least) with a certain probability.
Note that in order to prove convergence of X-REGO to (global) -minimizers, the value of in the success probability of the reduced problem (RPX ) needs to be replaced by ( − λ). This change is motivated by the fact that we allow inexact solutions (up to accuracy λ) of the reduced problem ( RPX k ). We also emphasize that, according to the discussion in Section 5, and for the sake of generality, the parameter p k in (RPX k ) is now a random variable (in contrast with Section 4 where it was assumed to be deterministic). Let us introduce two additional random variables that capture the conditions in (a) and (b) above, is solved to accuracy λ in the sense of (6.2)}, (6.4) where 1 is the usual indicator function for an event.
Assumption Success-Solv. For all k ≥ 1, there exists ρ k ∈ [ρ lb , 1], with ρ lb > 0 such that 9 i.e., with (conditional) probability at least ρ k ≥ ρ lb , the solution y k of (RPX k ) satisfies (6.2). 10 7 A similar setup regarding random iterates of probabilistic models can be found in [5,17] in the context of local optimization. 8 It would be possible to restrict the definition of the σ-algebra F k so that it contains strictly the randomness of the embeddings A i and p i for i ≤ k; then we would need to assume that y k is F k -measurable, which would imply that R k , S k and x k are also F k -measurable. Similar comments apply to the definition of F k−1/2 . 9 The equality in the displayed equation follows from E[S k |F k−1 ] = 1 · P[S k = 1|F k−1 ] + 0 · P[S k = 0|F k−1 ]. 10 In general, ρ k will depend on the dimension d k of the kth random embedding.
Remark 6.2. If a deterministic (global optimization) algorithm is used to solve ( RPX k ), then S k is always F k−1/2 k -measurable and Assumption Success-Solv is equivalent to S k ≥ ρ k > 0. Since S k is an indicator function, this further implies that S k ≡ 1.
The next assumption says that the drawn subspaces are ( − λ)-successful with a positive probability.
Note that Assumption Success-Solv and Assumption Succes-Emb have been slightly modified compared to [18]: here, the dimension of the reduced problem is varying, so in general the probabilities of success of the solver and embedding depend on k as well. Under Assumption Success-Solv and Assumption Succes-Emb, the following result shows the convergence of X-REGO to the set of -minimizers. where x k opt and G are defined in (5.2) and (2.1), respectively. Furthermore, for any ξ ∈ (0, 1), Proof. The proof is a straightforward extension of the one given in [18], and for completeness, we include it in Appendix B.1.
Remark 6.4. If the original problem (P) is convex (and known a priori to be so), then clearly, a local (deterministic or stochastic) optimization algorithm may be used to solve ( RPX k ) and achieve (6.2). Apart from this important speed-up and simplification, it seems difficult at present to see how else problem convexity could be exploited in order to improve the success bounds and convergence of X-REGO.

Global convergence of X-REGO for general objectives
The previous section provides a convergence result, with associate convergence rate, that depends on some parameters ρ lb and τ lb defined in Assumption Success-Solv and Succes-Emb. The former intrinsically depends on the solver used to solve the reduced subproblems, and will not be discussed further here. However, the latter parameter τ lb can be estimated for general Lipschitzcontinuous objectives using the results derived in Section 4.
Corollary 6.5. Suppose that Assumption LipC holds, that there exists a global minimizer x * of (P) that satisfies Assumption FeasBall (replacing by −λ in Assumption FeasBall, i.e., slightly relaxing the assumption), and thatp k satisfies p k − x * ≤ R max for all k and for some suitably chosen R max . Suppose also that d k ≥ d lb for some d lb > 0. Then, Assumption Succes-Emb holds with τ lb = τ (r min , d lb , D), with r min = ( − λ)/(LR max ).
Proof. Let us first recall that for all k, there holds by Corollary 4.2: where Q is a D × D random orthogonal matrix drawn uniformly from the set of all D × D real orthogonal matrices, L d k a d k -dimensional linear subspace, and α * p k−1 := arcsin(( − λ)/ x * − p k−1 ). Let α * min := arcsin(( − λ)/(LR max )), and note that α * min ≤ α * p k−1 for all k. By Lemma 3.7, for any α * min ≤ α ≤ π/2, there holds Circ D (α * min ) ⊆ Circ D (α) so that for all k. By the Crofton formula, there holds By [3, Prop. 5.9], h k ≥ h k+1 for all k = 0, . . . , D − 1. We deduce that Using the fact that the intrinsic volumes are all non-negative, and the definition of h k , we get: Note finally that, in terms of conditional expectation, we can write This shows that (6.5) in Assumption Succes-Emb holds.
We now estimate the rate of convergence of X-REGO for Lipschitz continuous functions using the estimates for τ provided in Corollary 4.7.
Theorem 6.6. Suppose that Assumptions LipC and Success-Solv hold, that there exists a global minimizer x * of (P) that satisfies Assumption FeasBall (replacing by −λ in Assumption Feas-Ball), and thatp k satisfies p k − x * ≤ R max for all k and for some suitably chosen R max . Suppose also that d k ≥ d lb for some d lb > 0. Then, x k opt defined in (5.2) converges to the set of -minimizers G almost surely as k → ∞, and Proof. The result follows from Theorem 6.3, Corollary 6.5 and Corollary 4.7.

Ensuring boundedness ofp k
So far, our convergence results rely on the assumption that, for each k, p k − x * ≤ R max for some suitably chosen R max and for some global minimizer x * surrounded by a ball of radius ( − λ) of feasible solutions, see Assumption FeasBall. We show in this section that the following strategies for choosing the random variable p k guarantee that x k opt defined in (5.2) converges to the set of -minimizers G almost surely as k → ∞.
1. p k is deterministic and does not vary with k (e.g., p k = 0 for all k).
3. p k is any random variable with support contained in X , and X is bounded. 4. p k is a random variable defined as p k = x k opt , where x k opt is the best point found over the k first embeddings, see (5.2), and the objective is coercive.
Note that for the strategies 1, 2 and 3, the validity of Theorem 6.6 follows simply from the triangular inequality: and the fact that p k is bounded. We prove next that x k opt defined in (5.2) converges to the set of -minimizers G almost surely as k → ∞ for strategy 4 if the objective is coercive. Assumption 6.7 (Coerciveness, see [6]). When X is unbounded, the (continuous) function f : X → R in (P) satisfies lim x →∞ f (x) = ∞. (6.9) Corollary 6.8. Let Assumption 6.7 hold, and let x * be a global minimizer of (P). Let p k = x k opt for k ≥ 1, with x k opt defined in (5.2), and let p 0 ∈ X be such that f (p 0 ) < ∞. Then, there exists R max such that, for all k, Proof. Note that the sequence (f (p k )) k=0,1,2,... is decreasing by definition of the random variable x k opt . Therefore, for all k there holds By coerciveness of f , there exists R < ∞ such that for any deterministic vector y ∈ X , y > R implies f (y) > f (p 0 ). We deduce that p k < R for all k, so that p k − x * ≤ p k + x * ≤ R + x * . The result follows by writing R max = R + x * . Corollary 6.9. Suppose that Assumptions LipC, Success-Solv and 6.7 hold, that there exists a global minimizer x * of (P) that satisfies Assumption FeasBall (replacing by − λ in Assumption FeasBall), and that d k ≥ d lb for some d lb > 0. Let p k = x k opt for k ≥ 1, with x k opt defined in (5.2), and let p 0 ∈ X be such that f (p 0 ) < ∞. Then, x k opt converges to the set of -minimizers G almost surely as k → ∞, and there exists R max such that Proof. The result follows from Theorem 6.6 and Corollary 6.8.

Applying X-REGO to functions with low effective dimensionality
The recent works [15,18] explore random embedding algorithms for functions with low effective dimension, that only vary over a subspace of dimension d e < D, and address respectively the case X = R D and X = [−1, 1] D . Both papers assume that the dimension of the random subspace d in (RPX ) is the same or exceeds the effective dimension d e , and derive bounds on the probability of (RPX ) to be -successful in that setting; these bounds are then used to prove convergence of respective random subspace methods. For the remainder of this paper, we explore the use of X-REGO for unconstrained global optimization of functions with low effective dimension, for any random subspace dimension d, thus removing the assumption d ≥ d e . To prove convergence of X-REGO in that setting, we rely on the results derived in Section 4.

Definitions and existing results
Definition 7.1 (Functions with low effective dimensionality, see [66]). A function f : R D → R has effective dimension d e < D if there exists a linear subspace T of dimension d e such that for all vectors x in T and x ⊥ in T ⊥ (the orthogonal complement of T ), we have and d e is the smallest integer satisfying (7.1).
The linear subspaces T and T ⊥ are respectively named the effective and constant subspaces of f . In this section, we make the following assumption on the function f . Assumption LowED. The function f : R D → R has effective dimensionality d e with effective subspace 11 T and constant subspace T ⊥ spanned by the columns of the orthonormal matrices U ∈ R D×de and V ∈ R D×(D−de) , respectively. We write x = U U T x and x ⊥ = V V T x, the unique Euclidean projections of any vector x ∈ R D onto T and T ⊥ , respectively.
As discussed in [18], functions with low effective dimension have the nice property that their global minimizers are not isolated: to any global minimizer x * of (P), with Euclidean projection x * on the effective subspace T , one can associate a subspace G * on which the objective reaches its minimal value. Indeed, writing In the case d ≥ d e , the following result, derived in [66], says that the reduced problem (RPX ) is successful with probability one.
Theorem 7.2. (see [66,Theorem 2], and [52,Rem. 2.22]) Let X = R D and let Assumption LowED hold. Let A be a D × d Gaussian matrix with d ≥ d e , and let p ∈ R D . Then, with probability one, for any fixed x ∈ R D , there exists a y ∈ R d such that f (x) = f (Ay + p). In particular, for any global minimizer x * of (P), with probability one, there exists y * ∈ R d such that f (Ay * + p) = f (x * ) = f * .
Thus, in the unconstrained case X = R D , the solution of a single reduced problem (RPX ) with subspace dimension d ≥ d e provides an exact global minimizer of the original problem (P) with probability one. In the next section, we address the case d < d e .

Probability of success of the reduced problem for lower dimensional embeddings
Unfortunately, Theorem 7.2 crucially depends on the assumption d ≥ d e . When d < d e , we quantify the probability of the random embedding to contain a (global) -minimizer. Similarly to the definition of G * above, one may associate to any global minimizer x * of (P) a connected set G * of -minimizers. Denoting the Euclidean projection of x * on the effective subspace by x * , under Assumption LipC (Lipschitz continuity of f ), G * is the Cartesian product of a d e -dimensional ball (contained in the effective subspace) by the constant subspace T ⊥ (see Assumption LowED): where L is the Lipschitz constant of f . Indeed, let x := x * + U g + V h ∈ G * , for some g ∈ R de satisfying g ≤ /L and for some h ∈ R D−de . Then, f (x) = f (x * + U g) by Assumption LowED, since V h ∈ T ⊥ . By Lipschitz continuity of f , we get: As already discussed in Section 2, the reduced problem (RPX ) is -successful if the random subspace p+range(A) intersects the set of approximate global minimizers, which by Theorem 7.3 contains any connected components G * defined in (7.3) for some global minimizer x * of (P). Figure 4 shows an abstract representation of the situation where the random subspace p + range(A 1 ) intersects the connected component G * , the corresponding embedding is therefore -successful; conversely, the random subspace p + range(A 2 ) does not intersect G * . If G * = G defined in (2.1), this implies that the corresponding embedding is not -successful.

Successful embedding
Unsuccessful embedding G * Figure 4: Abstract illustration of embeddings for functions with low effective dimension. The reduced problem is -successful if the random subspace intersects the connected component G * .
The following result further characterizes the probability of success of (RPX ). Theorem 7.3. Let X = R D , and let Assumptions LipC and LowED hold. Let A be a D × d Gaussian matrix, p ∈ R D be a fixed vector, > 0 an accuracy tolerance and x * any global minimizer of (P) with associate connected component G * as in (7.3). Then, where U is an orthonormal matrix whose columns span the effective subspace T (see Assumption LowED), B := U T A, a d e × d Gaussian matrix and B /L (U T x * ), the d e -dimensional ball of radius /L centered at U T x * .
Proof. The first inequality simply follows from (2.2) and from the fact that G * ⊆ G , see (7.4). For the second relationship, since the matrix Q := U V (with V defined in Assumption LowED) is orthogonal, for all y ∈ R d Writing B := U T A ∈ R de×d and C := V T A ∈ R (D−de)×d , we get for any global minimizer x * of (P) (with associate Euclidean projection x * on the effective subspace) By definition of G * , there follows that Ay + p ∈ G * if and only if By + U T p ∈ B /L (U T x * ). By Theorem A.2, B is a d e × d Gaussian matrix, which completes the proof. The probability of -success of (RPX ) can thus be lower bounded by the probability of the d-dimensional random subspace range(B) + U T p intersecting the ball B /L (U T x * ) in R de . We now estimate the latter probability using the conic integral geometry results presented in Section 3 and Section 4: the two next results can be seen as the immediate counterparts of where the function τ (r, d, d e ) for 0 < r < 1 is defined in (4.5).
Proof. The result is a direct extension of the analysis made in Section 4, and more precisely, Similarly as Theorem 4.4, the next result provides a uniform lower bound on the probability of -success of (RPX ).
Corollary 7.5. Let X = R D , and suppose that Assumptions LipC and LowED hold, with effective dimension d e > d. Let > 0 be an accuracy tolerance, A, a D × d Gaussian matrix, x * , any global minimizer of (P). Let p ∈ R D be a given vector that satisfies U T p−U T x * ≤ R max , for some suitably chosen R max , and let r eff min := / (LR max ). Then (7.8) where the function τ (r, d, d e ) for 0 < r < 1 is defined in (4.5).
Proof. The result is a mere adaptation of Theorem 4.4, replacing A by B, x * by U T x * , p by U T p and D by d e .
Note that adding some constraints (setting X ⊂ R D ) makes the analysis much more complicated as even if a random subspace {p + range(A)} intersects G * , this intersection may be outside the feasible domain; we therefore restrict ourselves to the unconstrained case in this paper.

X-REGO for functions with low effective dimension
We present an X-REGO variant dedicated to the optimization of functions with low effective dimension. This algorithm starts by exploring an embedding of low dimension d lb , assuming d lb ≤ d e , and the dimension is progressively increased until capturing the effective dimension of the problem, see Algorithm 2. Note that Line 3 to Line 6 are exactly the same as in Algorithm 1. Recall that Theorem 7.2 guarantees that the algorithm finds the global minimum of (P) with probability one if the reduced problem is solved exactly and if d k ≥ d e , so that in this ideal case we can terminate the algorithm after d e − d lb + 1 random embeddings; thus, Algorithm 2 terminates in finitely many random embeddings. Since the effective dimension is unknown, we typically terminate the algorithm when no progress is observed in the objective value, see Section 8 for numerical illustrations.
Algorithm 2 X-REGO for (P) when f has low effective dimension 1: Initialize d 1 = d lb for some d lb ≥ 1 andp 0 ∈ X 2: for k ≥ 1 until termination do 3: Run lines 3 to 6 in Algorithm 1.

Convergence of X-REGO for functions with low effective dimension
Similarly to Section 5 and Section 6, for each k, p k is a random variable. The particular case of a deterministic p k is represented using a random variable whose support is a singleton. To prove convergence of Algorithm 2 to an -minimizer while allowing the reduced problems to be solved approximately, we again require the reduced problems to be ( − λ)-successful, see Assumption Success-Solv and Assumption Succes-Emb. Note that unlike Section 6.2, the results below are finite termination results, as we known that with an ideal solver, Algorithm 2 finds an -minimizer after at most d e −d lb +1 embeddings. Let us first show that Assumption Succes-Emb holds, and derive the value of τ lb . Corollary 7.6. Suppose that X = R D , that Assumptions LipC and LowED hold, thatp k satisfies U Tpk − U T x * ≤ R max for all k and for some suitably chosen R max and that d lb < d e . Then, Assumption Succes-Emb holds with τ lb = τ (r eff min , d lb , d e ), with r eff min = ( − λ)/(LR max ) and τ (·, ·, ·) defined in (4.5). Proof. For all embeddings such that d k < d e , the proof is the same as for Corollary 6.5, replacing D by d e and r min by r eff min . Note that if d k ≥ d e , (RPX ) is successful with probability one according to Theorem 7.2. The result follows then simply from the fact that 1 ≥ τ (r min , d lb , d e ) = 2v de−d lb +1 (see the Gauss-Bonnet formula (3.4), and the fact that the intrinsic volumes are nonnegative).
The following result proves convergence of Algorithm 2 to the set of -minimizers almost surely after at most d e − d lb + 1 random embedding. Note in particular that this convergence result has no dependency on D.
Corollary 7.7 (Global convergence of X-REGO for functions with low effective dimension). Suppose that X = R D , that Assumptions LipC, Success-Solv and LowED hold, and thatp k satisfies U Tpk − U T x * ≤ R max for all k and for some suitably chosen R max . Suppose also that d lb ≤ d e , let > 0 be an accuracy tolerance, and let k max = d e − d lb + 1 be the index of the first embedding with dimension d e . Then where x k opt is defined in (5.2) and ρ k is the probability of success of the solver for ( RPX k ) (see Assumption Success-Solv). In particular, if the reduced problem is solved exactly, then f (x kmax opt ) ≤ f * + with probability one. For 1 ≤ k < k max , we have where τ lb = τ (r eff min , d lb , d e ), with τ (·, ·, ·) defined in (4.5) and r eff min = ( − λ)/(LR max ).
Proof. Note that, by Corollary 7.6, Theorem 6.3 applies. However, since we are interested in finite termination results, we do not directly use Theorem 6.3; instead, we extract the following claim from its convergence proof, see (B.6). For all K ≥ 1, It follows that where τ lb and ρ lb are computed/defined in Corollary 7.6 and Assumption Success-Solv, respectively. Finally, if K ≥ k max , it follows that d K ≥ d e , so that the probability of (RPX k ) to be ( − λ)-successful is equal to one according to Theorem 7.2. So, if K ≥ k max , which concludes the proof.

Numerical experiments
Let us illustrate the behavior of X-REGO on a set of benchmark global optimization problems whose objectives have low effective dimension. We show empirically that Algorithm 2 simultaneously manages to accurately estimate the effective dimension of the problem, and outperforms significantly (and especially in the high-dimensional regime) the no-embedding framework, in which the original problem (P) is solved directly, with no exploitation of the special structure.

Setup
Test set. Our synthetic test set is very similar to the one we used in [15,18], and contains a set of benchmark global optimization problems adapted to have low effective dimensionality in the objective as explained in Appendix C. Our test set is made of 18 D-dimensional functions with low effective dimension, with D = 10, 100 and 1000. These D-dimensional functions are constructed from 18 low-dimensional global optimization problems with known global minima (some of which are in the Dixon-Szego test set [23]), by artificially adding coordinates and then applying a rotation so that the effective subspace is not aligned with the coordinate axes.
Solver. The reduced problems are solved using the KNITRO solver ( [13]). Note that, by default, KNITRO is a local solver, but switches to a global solver by activating its multistart feature. We therefore consider three "KNITRO"-type solvers: local KNITRO (no multistart used, referred to as KNITRO), and multistart KNITRO with a low/high number of starting points (cheap or expensive versions of multistart KNITRO, referred to here as ch-mKNITRO and exp-mKNITRO, respectively). The higher the number of starting points, the more likely the solver is to find a global minimizer of the reduced problem. See Table 1 for a detailed description of the settings of the different solvers.
Algorithms using a global solver (ch-mKNITRO and exp-mKNITRO). We test two different instances of the algorithmic framework presented in Algorithm 2 against the noembedding framework, in which (P) is solved directly without using any random embedding and with no explicit exploitation of its special structure. For each instance, we let d lb = 1. Since the effective dimension of the problem is assumed to be unknown, termination in Algorithm 2 is Default options (unless overwritten by additional options) Default options (unless overwritten by additional options) Additional options for ( RPX k ) and (P) ms_enable = 1, ms_bndrange = 2 ms_enable = 1, ms_bndrange = 2, ms_maxsolves = min(100, 2d k ) (for ( RPX k )), ms_maxsolves = min(100, 2D) (for (P)) / defined as the first embedding on which either stagnation is observed in the computed optimal cost of the reduced problem ( RPX k ), or if not, until d k = D. Objective stagnation is measured as follows: stop after k f embeddings, where k f is the smallest k ≥ 2 that satisfies If k f ≤ D, we let d est e := k f − 1 be our estimate of the effective dimension of the problem. Indeed, by Theorem 7.2, two random problems of dimension d and d + 1 with d ≥ d e have the same optimal cost with probability one, so that the left-hand side of (8.1) would be zero if the reduced problems were solved exactly (i.e., under the assumption of an ideal solver). We argue that, on the other hand, it is very unlikely that two random reduced problems of dimension d and d + 1 with d < d e have the same optimal cost. 12 We therefore terminate the algorithm after either k = k f (if there exists k f ≤ D satisfying (8.1)), or else k = D random embeddings. Regarding the choice of p k , we consider two possibilities: either p k is a vector that does not depend on k, or p k is the best point found over the k first embeddings (i.e., p k = x k opt ).
Algorithms relying on a local solver (KNITRO) and a resampling strategy. We also compare several instances of Algorithm 2 with the no-embedding framework when the reduced problem is solved using a local solver. Note that due to the possible nonconvexity of the problem, running a local solver on the original problem is not expected to find the global minimizer; results combining the no-embedding framework with a local solver are thus only reported for comparison. Recall also that our convergence analysis requires the solver to be able to find an approximate global minimizer of the subproblem with a sufficiently high probability. We show numerically that local solvers can be used when the points p k are suitably chosen to globalize the search; we typically let p k , for some indices k, be a random variable with a sufficiently large support to contain a global minimizer of (P). Similarly as with global solvers, let k f be the smallest k ≥ 2 that satisfies (8.1), and, if k f ≤ D, let d est e := k f − 1 be our estimate of the effective dimension of the problem. However, since the solver is local, we cannot assume that (8.1) implies that we found an approximate global minimizer of the original problem (P). We therefore continue the optimization, fixing the subspace dimension: d k = d est e for all k > k f , and assuming that p k will be such that the next random subspace will leave the basin of attraction of the actual local minimizer. To prevent against local solutions, we use a stronger stopping criterion: the algorithm is stopped either after D embeddings, or earlier, when k > k f and when the computed optimal cost of the reduced problem did not change significantly over the last n stop random embeddings, i.e., if In our experiments, we considered two possibilities: n stop = 3 or n stop = 5. Here again, we consider two main strategies for choosing p k : either p k does not depend on k (e.g., p k is an identically distributed random variable, for all k), or p k is the best point found over the past embeddings (p k = x k opt ), resampling p k at random in a sufficiently large domain for some values of k, see below.
Summary of the algorithms: In total, we compare four instances of Algorithm 2, that correspond to specific choices of p k , k ≥ 0, and on the choice of a local/global solver.
-Adaptive X-REGO (A-REGO). In Algorithm 2, the reduced problem is solved using a global solver and the point p k is chosen as the best point found 13 up to the kth embedding: -Local adaptive X-REGO (LA-REGO). In Algorithm 2, the reduced problem ( RPX k ) is solved using a local solver (instead of global as in A-REGO). Until we find the effective dimension (i.e., for k < k f ), we use the same update rule for p k as in A-REGO: p k := A k y k + p k−1 . For the remaining embeddings, the point p k is chosen as follows: p k = A k y k + p k−1 if |f (A k y k + p k−1 ) − f (p k−1 )| > γ = 10 −5 , and p k is draw uniformly in [−1, 1] D otherwise, to compensate for the local behavior of the solver 14 .
-Nonadaptive X-REGO (N-REGO). In Algorithm 2, the reduced problem is solved globally, and all the random subspaces are drawn at some fixed point: p k = a. The fixed value a is simply defined as a realization of a random variable uniformly distributed in [−1, 1] D . 15 -Local nonadaptive X-REGO (LN-REGO). In Algorithm 2, the reduced problem ( RPX k ) is solved using a local solver. Until we find the effective dimension (i.e., for k < k f ), we set p k = a, with a as in N-REGO. For k ≥ k f , p k is a random variable distributed uniformly in [−1, 1] D (and resampled at each embedding), to compensate for the local behavior of the solver.
Note that, regarding the choice of p k when using a local solver, we typically have two phases. In the first phase, we apply the same selection rules for p k , k < k f , as when using a global solver.
For k ≥ k f , we allow resampling to avoid the algorithm to be trapped at a local minimizer. We do not introduce some resampling in the first phase, because then stochasticity would impact the criterion (8.1) and our estimate of the effective dimension of the problem.
Experimental setup. For each algorithm described above, we solve the entire test set 3 times to estimate the average performance of the algorithms, and record the computational cost, which we measure in terms of function evaluations (the termination criterion is described above). Note that from the four algorithms described above, we get six different algorithms, since algorithms A-REGO and N-REGO are endowed with two different global solvers: exp-mKNITRO and ch-mKNITRO, corresponding respectively to a low and large number of starting points. To compare with 'no-embedding', we solve the full-dimensional problem (P) directly with the corresponding solver with no use of random embeddings. The budget and termination criteria used to solve ( RPX k ) within X-REGO or to solve (P) in the 'no-embedding' framework are the default ones, summarized in Table 1. If A fails to successfully converge to a -minimizer of f , with = 10 −3 , within the maximum computational budget, we set N f (A) = ∞. We further define as the minimal computational cost required by any algorithm to optimize f . We normalize all the computational costs by N * f and, for each A, we plot a function π A (α) that computes the proportion of f 's in the test set S, for which the normalized computational effort spent by A was less than α. Mathematically speaking, where | · | denotes the cardinality of a set. The algorithm A is considered to have achieved better performance if it produces higher values for π A (α) for lower values of α, i.e., on figures, the curve π A (α) is higher and lefter.

Numerical results
Comparison of X-REGO with the no-embedding framework. The comparison between the above-mentioned instances of X-REGO and the no-embedding framework is given in Figure 5. A-REGO and N-REGO clearly outperform the no-embedding framework in terms of accuracy vs computational cost, especially for large D. Reducing the number of starting points in the multistart strategy (i.e., replacing exp-mKNITRO by ch-mKNITRO) allows to further significantly improve the performance, though the total proportion of problems ultimately solved is slightly decreased compared to exp-mKNITRO. Note also that the use of a local solver (LA-REGO and LN-REGO) outperforms both global X-REGO instances and the no-embedding framework, especially for large D. They find the global minimizer in a significantly higher number of subproblems than when directly addressing the original high-dimensional problem with the local solver: the resampling strategy for p k described above helps to globalize the search. Table 2 contains the average, over the test problems, of the number of embeddings used per algorithm; note that for (approximately) global solvers, and especially using p k = x k opt , the average number of embeddings is very close to the ideal k f . Indeed, the average effective dimension on our problem sets is equal to 3.7, so the ideal average number of embeddings should be 4.7, as we need an additional embedding for the stopping criterion (8.1) to be satisfied. For local solvers, the average number of embeddings is slightly higher due to the need to resample candidate solutions to globalize the search and due to the stronger stopping criterion.   Estimation of the effective dimension. As described earlier, instances of X-REGO naturally provide an estimate d est e of the effective dimension of the problem: d est e = k f − 1, where k f is the smallest integer that satisfies (8.1). In case there exists no k f ≤ D satisfying (8.1), we set d est e = D. For several instances of Algorithm 2, Table 3 reports the number of problems of the data set on which d est e ∈ [d e , d e + 2], where d e is the exact effective dimension of the problem, for D = 10, D = 100 and D = 1000. Typically, adaptive choices of p k results in a slightly larger estimate of the effective dimension; we also note that the use of a local solver is comparable to a global one regarding the ability of the algorithm to estimate the effective dimension on this problem set when p k is chosen adaptively, and significantly lower otherwise. The values given in Table 3 have been averaged over three independent runs of our experiment, on the whole dataset, to account for randomness in the algorithms. What if we know the effective dimension of the problem? In the favorable situation when the effective dimension d e of each problem is known, we can set d lb = d e in Algorithm 2, and theoretically, for an ideal global solver, Algorithm 2 is guaranteed to solve exactly the original problem using one embedding. Figure 6 explores numerically the validity of this claim. We compare several instances of X-REGO with corresponding counterparts, where the effective dimension is known. When using an (approximately) global solver (ch-mKNITRO or exp-mKNITRO), we stop Algorithm 2 after one embedding of dimension d e . When the solver is local (KNITRO), we let Algorithm 2 explore several embeddings of dimension d e , and stop the algorithm when (8.2) is satisfied, with n stop = 3, or otherwise after 50 embeddings. Figure 6 shows the corresponding performance profiles, when comparing these strategies with the ones presented on Figure 4, and the corresponding no-embedding algorithms. In general, and except when using local solvers, knowing d e allows to solve a significant proportion of the problems in a considerably smaller time. Admittedly, these conclusions strongly depend on the probability of the solver to be successful, i.e., of the number of starting points of the multistart procedure. Note also than in our test set, the effective dimension is typically low (average value is 3.7), which might also decrease the benefit of knowing the effective dimension and thus avoiding to explore lower-dimensional subspaces; we expect the gap between Algorithm 2 and algorithms where d e is known to increase with the effective dimension of the problem.

Conclusions to numerical experiments.
We have compared several instances of Algorithm 2 with the no-embedding framework, where the original problem is addressed directly, with no use of random embeddings nor exploitation of the special structure. Overall, Algorithm 2 outperforms the no-embedding framework, and this observation becomes more and more apparent when the dimension of the original problem increases. We have also combined Algorithm 2 with a local solver; though our convergence theory does not cover this situation, we have shown that the resulting algorithm can outperform both the no-embedding framework and instances of Algorithm 2 relying on global solver when the parameters p k are sampled at random in a sufficiently large domain to "globalize" the search.
Regarding the estimation of the effective dimension, we noticed that instances of Algorithm 2 relying on adaptive rules for selecting p k (A-REGO and LA-REGO) significantly outperform their fixed p k counterparts. Finally, we have shown that, in the favourable case when the effective dimension is known, letting d lb ≥ d e in Algorithm 2 leads to a substantial improvement in performance.

Conclusions and future work
We explored a generic algorithmic framework, X-REGO, for global optimization of Lipschitzcontinuous functions. X-REGO is based on successively generating reduced problems (RPX ), where the parameter p is flexibly-chosen. Flexibility in choosing p allows the user to calibrate the level of exploration in X . Our central result is the proof of global convergence of X-REGO, that heavily relies on an estimate of the probability of the reduced problem (RPX ) to be -successful. By looking at the reduced problem through the prism of conic geometry, we have developed a new type of analysis to bound the probability of -success of (RPX ). The bounds are expressed in terms of the so-called conic intrinsic volumes of circular cones which have exact formulae and thus are quantifiable. Using these formulae, we analysed the asymptotic behaviour of the bounds for large D. The analysis suggests that the success rate of (RPX ) (as expected) decreases exponentially with growing D. Confirming our intuition, the analysis also shows that (RPX ) has a high success rate for larger d and smaller distances between the location where subspaces are embedded (i.e., the point p) and the location of a global minimizer x * . This latter property of (RPX ) for general Lipschitz continuous functions is remindful of the dependence of the success rates of (RPX ) for functions with low effective dimensionality on the distance between p and x * , see [18]. Furthermore, to understand the relative performance of (RPX ), we compared it with a uniform sampling technique. We looked at lower bounds for the probability of -success of the two techniques and found that the lower bound τ (r p , d, D) for (RPX ) is greater than the lower bound τ us for uniform sampling if the distance x * − p is smaller than 0.48 √ D in the asymptotic regime (D → ∞). In the asymptotic analysis, the embedding subspace d was kept fixed. The analysis showed that in this regime d has no significant effect on the relative performance of (RPX ). Future research may involve comparison of the performances of (RPX ) and uniform sampling in different asymptotic settings, for example, when d = βD for some fixed constant β.
Our derivations are conceptual in nature, exploring new connections of global optimization to other areas such as conic integral geometry. As an illustration, in the second part of the paper, we used our analysis to obtain lower bounds -that are independent of D -for the probability of -success of (RPX ) for functions with low effective dimensionality in the case d < d e . This analysis is exploited algorithmically and allows lifting the restriction of needing to know d e for random embeddings algorithms for functions with low effective dimensionality. We tested the effectiveness of X-REGO numerically using global and local KNITRO for solving the reduced problem on a set of benchmark global optimization problems modified to have low effective dimensionality. We proposed different variants of X-REGO each corresponding to a specific rule for choosing p's and contrasted them against each other and against the 'no-embedding' framework in which the solvers were applied to (P) directly with no use of subspace embeddings. The results of the experiments showed that the difference in performance between X-REGO and 'no-embedding' becomes more prominent for larger D, in favour of X-REGO. The results further suggest that the effectiveness of X-REGO, just like of REGO in [15], is solver-dependent. In our experiments, the best results were achieved by the local solver. In the future, we plan to investigate the performance of X-REGO when applied to general objectives and compare it with popular global optimization solvers.
Theorem A.2. (see [39,Theorem 2.3.10]) Let A be a D × d Gaussian random matrix. If U ∈ R D×p , D ≥ p, and V ∈ R d×q , d ≥ q, are orthonormal, then U T AV is a Gaussian random matrix.

B Global convergence proof
This section contains material already presented in [18], with minor changes to capture the fact that the lower bounds ρ k and τ k are now variable with k (or, in other words, the probability that the reduced problem ( RPX k ) is -successful, as well as the probability that the solver finds a sufficiently accurate solution of the reduced problem, is changing with the dimension of the reduced problem d k in Algorithm 1). The following three lemmas are needed in our convergence proof.

Proof. Assumption Success-Solv implies
where the equality follows from the fact that R k is F k−1/2 -measurable and, thus, can be pulled out of the expectation (see [26,Theorem 4.1.14]).
A useful property is given next.
We repeatedly expand the expectation of the product for K − 1, . . ., 1, in exactly the same manner as above, to obtain the desired result.
In the next lemma, we show that if (RPX k ) is ( − λ)-successful and is solved to accuracy λ in objective value, then the solution x k must be inside G .
Proof. By Definition 1.1, if (RPX k ) is ( − λ)-successful, then there exists y k int ∈ R d k such that A k y k int + p k−1 ∈ X and f (A k y k int + p k−1 ) ≤ f * + − λ.

(B.2)
Since y k int is in the feasible set of (RPX k ) and f k min is the global minimum of (RPX k ), we have Then, for x k , (6.2) gives the first inequality below, Note that the sequence {f (x 1 opt ), f (x 2 opt ), . . . , f (x K opt )} is monotonically decreasing. Therefore, if x k opt ∈ G for some k ≤ K then x i opt ∈ G for all i = k, . . . , K; and so the sequence ({x k opt ∈ G }) K k=1 is an increasing sequence of events. Hence, where the second inequality follows from Lemma B.2. Finally, passing to the limit with K in (B.6), we deduce with τ lb and ρ lb defined in Assumption Succes-Emb and Assumption Success-Solv, respectively. Since τ lb ρ lb > 0 by Assumption Success-Solv and Assumption Succes-Emb, we get the required result. Note that if 1 − (1 − τ lb ρ lb ) k ≥ ξ (B.7) then (B.6) implies P[x k opt ∈ G ] ≥ ξ. Since (B.7) is equivalent to k ≥ log(1 − ξ) log(1 − τ lb ρ lb ) , (B.7) holds for all k ≥ K ξ since K ξ ≥ log(1 − ξ) log(1 − τ lb ρ lb ) . Table 4 contains the name, domain and global minimum of the functions used to generate the high-dimensional test set. Similarly as in [15,18], the problem set contains 18 problems taken from [31,28,60]. To generate this problem set, we transformed each of the 18 functions in Table 4 into a high-dimensional function with low-effective dimension, by adapting the method proposed by Wang et al. [66]. Letḡ(x) be any function from Table 4, with dimension d e and let the given domain be scaled to [−1, 1] de . We create a D-dimensional function g(x) by adding D − d e fake dimensions toḡ(x), g(x) =ḡ(x) + 0 · x de+1 + 0 · x de+2 + · · · + 0 · x D . We further rotate the function by applying a random orthogonal matrix Q to x to obtain a nontrivial constant subspace. The final form of the function we test is f (x) = g(Qx).

C Problem set
(C.1) Note that the first d e rows of Q now span the effective subspace T of f (x). For each problem in the test set, we generate three functions f according to (C.1), one for each D = 10, 100, 1000. Note that the range of effective dimension covered by our test set is slightly larger than in [15,18], to better assess the ability of the algorithm to learn d e .