Simulated Annealing for Convex Optimization: Rigorous Complexity Analysis and Practical Perspectives

We give a rigorous complexity analysis of the simulated annealing algorithm by Kalai and Vempala (Math Oper Res 31(2):253–266, 2006) using the type of temperature update suggested by Abernethy and Hazan (International Conference on Machine Learning, 2016). The algorithm only assumes a membership oracle of the feasible set, and we prove that it returns a solution in polynomial time which is near-optimal with high probability. Moreover, we propose a number of modiﬁcations to improve the practical performance of this method, and present some numerical results for test problems from copositive programming.


Introduction
Let K ⊆ R n be a convex body, and suppose that only a membership oracle of K is available. Let ·, · be an inner product on R n , and fix a unit vector c ∈ R n . We are interested in the problem (1) One approach to solve problems of this type is using simulated annealing, a paradigm of randomized algorithms for general optimization introduced by Kirkpatrick et al. [14]. It features a so-called temperature parameter that decreases during the run of the simulated annealing algorithm. At high temperatures, the method explores the feasible set relatively freely, also moving to solutions that have worse objective values than the current point. As the temperature decreases, so does the probability that a solution with a worse objective value is accepted. Kalai and Vempala [12] showed that, for convex optimization, a polynomial-time simulated annealing algorithm exists. Specifically, their algorithm returns a feasible solution that is near-optimal with high probability in polynomial time. (A recent algorithm by Lee et al. [15] has an asymptotically better complexity.) Abernethy and Hazan [1] recently clarified that Kalai and Vempala's algorithm is closely related to a specific interior point method. In general, interior point methods for convex bodies require a so-called self-concordant barrier of the feasible set. It was shown by Nesterov and Nemirovskii [23] that every open convex set that does not contain an affine subspace is the domain of a self-concordant barrier, known as the universal barrier. However, it is not known how to compute the gradients and Hessians of this barrier in general.
The interior point method to which Kalai and Vempala's algorithm corresponds uses the so-called entropic barrier over K , to be defined later. This barrier was introduced by Bubeck and Eldan [7], who also established the self-concordance of the barrier and analyzed its complexity parameter ϑ.
Drawing on the connection to interior point methods, Abernethy and Hazan [1] proposed a new temperature schedule for Kalai and Vempala's algorithm. This schedule does not depend on the dimension n of the problem, but on the complexity parameter ϑ of the entropic barrier. Our goal is to prove in detail that simulated annealing with this new temperature schedule returns a solution in polynomial time which is near-optimal with high probability. Moreover, we aim to investigate the practical applicability of this method. Our experiments suggest that it is very difficult to implement a practical simulated annealing algorithm for copositive programming using hit-and-run sampling. Although these are negative results, we find it important to communicate them, in order to stimulate research into algorithms for copositive programming that are not sampling-based, e.g., cutting plane methods.

Algorithm Statement
Central to simulated annealing is a family of exponential-type probability distributions known as Boltzmann distributions. Definition 1.1 Let K ⊆ R n be a convex body, and let θ ∈ R n . Let ·, · be an inner product. Then, the Boltzmann distribution with parameter θ is the probability distribution supported on K having density with respect to the Lebesgue measure proportional to x → e θ,x .
Note that if ·, · is the Euclidean inner product, the covariance operator Σ in Algorithm 1 can be represented as a matrix in R n×n .
In each iteration k of Kalai and Vempala's algorithm [12], the temperature T k is lowered. Then, hit-and-run samples are generated whose target distribution is the Boltzmann distribution with parameter θ k = −c/T k . These random walks use an approximation Σ(θ k−1 ) of Σ(θ k−1 ) to generate search directions. With these samples, an approximation Σ(θ k ) of Σ(θ k ) is formed, which will then be used in the next iteration. As k grows sufficiently large, so does the norm of θ k . The Boltzmann distributions with parameter θ k will then concentrate more and more probability mass close to the set of optimal solutions to (1). For sufficiently large k, any sample from such a Boltzmann distribution is near-optimal with high probability.
One thing that needs further clarification is how to decrease the temperature in each iteration. In their original paper, Kalai and Vempala [12] show that the algorithm returns a near-optimal solution with high probability, for the temperature update (cf. line 5 in Algorithm 2) where O * suppresses polylogarithmic terms in the problem parameters. As mentioned above, Abernethy and Hazan's alternative temperature schedule depends on the complexity parameter of the entropic barrier. This function is defined as follows. Definition 1.2 Let K ⊆ R n be a convex body. Define the function f : R n → R as f (θ ) = ln K e θ,x dx. Then, the convex conjugate f * of f , is called the entropic barrier for K .
Bubeck and Eldan [7] showed that f * is a self-concordant barrier for K with complexity parameter ϑ ≤ n + o(n). The complexity parameter of f * is where we refer the reader to [3,7] for more details. Abernethy and Hazan [1] propose the temperature update which will lead to m = O * ( √ ϑ) iterations. As noted above, we have ϑ ≤ n + o(n) in general, but it is not currently known if ϑ < n for any convex bodies. In particular, the temperature update (4) only improves on (2) if ϑ < n/16, which is not known to hold for any convex body. We show in Appendix 1 to this paper that, for the Euclidean unit ball in R n , numerical evidence suggests that ϑ = (n + 1)/2. We therefore consider a variation on the temperature schedule (4) suggested by Abernethy and Hazan, namely which corresponds to (4) when α = 4, but gives larger temperature reductions when α < 4. We will refer to (5) as Abernethy-Hazan-type temperature updates. If ϑ < n, this may result in a larger temperature decrease than the Kalai and Vempala scheme (2), for a suitable choice of the parameter α. The algorithm by Kalai and Vempala [12] that uses a temperature schedule of the type introduced by Abernethy and Hazan [1] is now given in Algorithm 2.

Contributions and Outline of this Paper
Abernethy and Hazan [1] do not give a rigorous analysis of the temperature schedule (4) in their paper, only a sketch of the proof. In this paper, we provide the full details for the more general schedule (5). In doing so, we also provide some details that were omitted in the original work by Kalai and Vempala [12], that concern the application of a theorem by Rudelson [25]. Moreover, we discuss the perspectives for practical computation with Algorithm 2. Finally, we propose some heuristic improvements to Algorithm 2 to obtain speed-up. Many of these results also appear in the PhD thesis [2] of the first author. [12] using temperature schedule of type introduced by Abernethy and Hazan [1] Input: normalized (with respect to · ) vector c ∈ R n ; membership oracle of a convex body K ⊆ R n ; radius R ≥ 1 of a ball containing K ; complexity parameter ϑ ≤ n + o(n) of the entropic barrier over K ; x 0 ∈ K drawn randomly from the uniform distribution over K ; update parameter α > 1 + 1/ √ ϑ; number of phases m ∈ N; number of hit-and-run steps ∈ N; number of covariance update samples N ∈ N; approximation Σ(0) of Σ(0) satisfying 1 2

Algorithm 2 Algorithm by Kalai and Vempala
Generate X k by applying hit-and-run sampling to the Boltzmann distribution with parameter θ k , starting the walk from X k−1 , taking steps, drawing directions from a N (0, Σ(θ k−1 ))-distribution Generate Y jk by applying hit-and-run sampling to the Boltzmann distribution with parameter θ k , starting the walk from X k−1 , taking steps, drawing directions from a N (0, Σ(θ k−1 ))distribution 10: end for 11: We start with a review of useful facts on probability distributions in Sect. 2. Then, Sect. 4 proves Algorithm 2 returns a solution that is near-optimal with high probability. In Sect. 5, we discuss the complexity of Algorithm 2. In Sect. 6, we look at the behavior of hit-and-run sampling for optimization over the doubly nonnegative cone and suggest some heuristic improvements. We then evaluate the resulting algorithm on problems from copositive programming (due to Berman et al. [5]) in Sect. 7. Lemma 2.2 (e.g., [17,Proposition 4.7]) Let X be a random variable on K with distribution μ, and let ϕ be a different probability distribution on K . If μ − ϕ TV = p, we can construct another random variable Y on K distributed according to ϕ such that P{X = Y } = 1 − p. Similarly, given two distributions μ and ϕ on K such that μ − ϕ TV = p, there exists a joint distribution ν on K × K , with marginals μ and ϕ, respectively, such that if (X , Y ) ∼ (K × K , ν), one has X ∼ (K , μ), Y ∼ (K , ϕ), We can now state the following mixing time result. It gives the number of hit-andrun steps one has to take before the distribution of the hit-and-run sample is sufficiently close to the target distribution. The result given here is a corollary of a result by Lovász and Vempala [19, Theorem 1.1].

Theorem 2.3
Let K ⊂ R n be a convex body. Suppose θ 0 , θ 1 ∈ R n satisfy Δθ := θ 0 − θ 1 θ 0 < 1. Pick p > 0, and suppose we have an invertible matrix Σ(θ 0 ) such that Consider a hit-and-run random walk as in Algorithm 1 applied to the Boltzmann distribution μ 1 with parameter θ 1 from a random starting point drawn from a Boltzmann distribution μ 0 with parameter θ 0 . Let μ be the distribution of the hit-and-run point after steps of hit-and-run sampling applied to μ 1 , where the directions are drawn from a N (0, Σ(θ 0 ))-distribution. Then, there exists an absolute constant C > 0, such that, after hit-and-run steps, we have μ − μ 1 TV ≤ p.
We omit the proof, since it is a straightforward consequence of Lovász and Vempala [19, Theorem 1.1], applied to the Boltzmann distribution. The interested reader may find a detailed proof in [2,Theorem 4.14]. Note that the theorem above requires an approximation of the covariance of a distribution that is in some sense 'close' to the target distribution. This is why Algorithm 2 approximates the covariance of every distribution that it encounters: This covariance is then used in the next iteration to generate hit-and-run directions.
One may reformulate the Assumption (6) in the theorem by using the following result from Horn and Johnson [11]. Consequently, we have, for k = 0, 1, . . ., A key point of the analysis is to show that these conditions hold for each iteration k.

Approximation of the Covariance Matrix
To guarantee the required approximation of the covariance matrix, Kalai and Vempala [12] use the following corollary to a theorem by Rudelson [25].
Let ϕ be a log-concave probability distribution over R n with mean 0 and identity covariance (i.e., isotropic), and let ρ ∈ (0, 1). Let X 1 , ..., X N be independent samples from ϕ. Then, there exist absolute constants This theorem cannot be directly applied in the setting of Algorithm 2 for three reasons: The hit-and-run samples do not follow the target distribution ϕ, the samples are not independent, and ϕ does not have mean 0 and identity covariance, i.e., ϕ is not isotropic. While Kalai and Vempala (correctly) state that Theorem 3.1 can be extended to the hit-and-run setting without significantly changing the number of samples N , a formal proof is not given, and we will provide the missing details in this section, since this is not straightforward.
We first show how to extend Theorem 3.1 to the non-isotropic case. To this end, we will need two results on log-concave random variables. The first is that the sum of independent log-concave random variables is again log-concave. Lemma 3.2 Let X , Y be independent random variables in R n with log-concave density functions f and g, respectively. Then, X + Y is a log-concave random variable.
Proof Theorem 7 in [24] shows that x → R n f (x − y)g(y) dy is log-concave on R n . This function is precisely the density function of X + Y .
The second result is a concentration result for log-concave distributions.

Lemma 3.3 (Lemma 3.3 from [18] (adapted from [20])) Let X be a random variable with a log-concave distribution. Denote
We now extend Theorem 3.1 to the non-isotropic case.

Corollary 3.4
Let ϕ be a log-concave probability distribution over R n with mean μ ϕ and covariance Σ ϕ , and let ρ ∈ (0, 1). Let X 1 , ..., X N be independent samples from ϕ, where and C 1 and C 2 are the absolute constants from Theorem 3.1, and let Then, we have with probability 1 − ρ.
Proof One may assume w.l.o.g. that Σ ϕ = I , since the statement of the theorem is invariant under invertible linear transformations. Indeed, if a random variable X with distribution ϕ is replaced by AX for some linear transformation A, then the new covariance matrix and its approximation satisfy , we therefore get the identity covariance. Moreover, Assuming therefore w.l.o.g. that Σ ϕ = I , the random variable X − μ ϕ has an isotropic log-concave distribution, and after setting It therefore suffices to show that Σ andΣ are 'sufficiently close.' To this end, direct calculation yieldŝ is a log-concave random vector, by Lemma 3.2. We may therefore bound right-hand side of the last equality via the concentration inequality of where we used that the variance of each component of X i is 1 for all i. Thus, if This implies (9).
We proceed to show that we may replace independent sampling by hit-and-run sampling in Corollary 3.4 in a well-defined sense. To this end, fix a starting vector X 0 and a tolerance q > 0, and consider the following two sequences of random variables: as in Step 9 of Algorithm 2. Here, we assume that the steps are sufficient to guarantee that the total variation distance between the distribution of Y i and ϕ is at most a given p ∈ (0, 1) (with reference to Theorem 2.3).
A key observation is that we may assume, without loss of generality, that X i = Y i for all (i ∈ {1, . . . , N }) with high probability.
. . , N }) be as above. Without loss of generality, one may assume that the joint distribution of X i and Y i satisfies Proof The X i all have distribution ϕ on K and the Y i all have the same distribution, say μ on K (that depends on X 0 ).
Moreover, by assumption μ − ϕ T V ≤ p. By the second part of Lemma 2.2, there exists a joint distribution, say ν on K × K with marginals ϕ and μ, so that We now replace (i.e., couple) the random variables Y i with new random variables Moreover, the random variables Y i will be indistinguishable from the hitand-run random variables Y i , in the sense that both Y i ∼ (μ, K ) and Y i ∼ (μ, K ) for all i.
As a result, Corollary 3.4 still holds if we replace the X i by the Y i for all i, but now with probability (1 − ρ) − N p, by the union bound.

Proof of Convergence
We continue by proving that Algorithm 2 converges to the optimum in polynomial time.
The following result was established by Kalai and Vempala [12] for linear functions, and extended from linear to convex functions h : R n → R by De Klerk and Laurent [10].
The main step in the analysis of Algorithm 2 is thus to show that we maintain a good approximation Σ(θ k ) of Σ(θ k ) for all k, to guarantee that the hit-and-run sampling continues to work in all iterations, as discussed in the previous section.

Theorem 4.2 Consider the setting of Algorithm
N as in (8), and let be as in (7). (Note that depends on n, p, and Δθ .) With these inputs, Algorithm 2 that returns a solution X m with Proof First, let us show that the conditions of Theorem 2.3 are satisfied. Note that is the covariance matrix of the uniform distribution. Since c = 1 and K is contained in a ball with radius R, Hence, for k = 1, For all k > 1, our choice of θ k and (3) yield Throughout all iterations k ≤ m of Algorithm 2, we maintain with probability 1 − m(ρ + N p) = 1 − q, by the analysis in the previous section. Thus, the conditions of Theorem 2.3 are satisfied with high probability. By the first part of Lemma 2.2, X m is equal to a random variable drawn from a Boltzmann distribution with parameter θ m with probability at least 1 − p. Thus, Markov's inequality and Lemma 4.1 show that where the final inequality uses the chosen value of m as follows:

Complexity Analysis and Discussion
We saw that for some combination of inputs, Algorithm 2 returns a solution which is near-optimal with high probability.
Next, the number of samples from (8) satisfies Next, we bound the number of hit-and-run steps. Recall (7) shows that For the Δθ from Theorem 4.2 and the value of p from (13), (16) shows Summarizing, the total number of oracle calls by Algorithm 2 is where we used Bubeck and Eldan's result [7] that ϑ ≤ n + o(n). Thus, we have given a rigorous complexity analysis of the algorithm by Kalai and Vampala [12] using the temperature update of Abernethy and Hazan [1]. Our final bound for the number of oracle calls coincides with the O * (n 4.5 ) bound claimed in [12].

Initialization
One might still wonder how to generate a good estimate Σ(0) of the uniform covariance matrix Σ(0) to start Algorithm 2. The 'rounding the body' procedure from Lovász and Vempala [18] is suitable for this purpose. This procedure returns a Σ(0) for which so the starting condition for Algorithm 2 can be satisfied by the 'rounding the body' procedure. The number of calls to the membership oracle for this procedure is O * (n 4 ), which is overshadowed by the complexity of Algorithm 2 itself.

Numerical Examples on the Doubly Nonnegative Cone
Having established the theoretical complexity of Algorithm 2, we now move to the second goal of this work: investigating the practical perspectives of this method. We will test Algorithm 2 on the problem of determining if a matrix is copositive, which is known to be a co-NP-complete problem [22].
To define this problem, let S m×m denote the space of real symmetric m ×m matrices. A matrix C ∈ S m×m is called copositive if x C x ≥ 0 for all x ∈ R m + (see Bomze [6] for a survey on copositive programming and its applications).
The standard SDP relaxation of the problem of checking for copositivity of C is the following: where ·, · is the trace inner product. If the value of (17) is nonnegative, the matrix C is copositive. However, since we place a nonnegativity constraint on every element of the matrix X , the Newton system in every interior point iteration is of size O(m 2 ×m 2 ), which quickly leads to impractical computation times (see, e.g., Burer [8]). Before we can apply Algorithm 2, we need to translate (17) to a problem over R m(m+1)/2 . The approach is standard: such that svec(A) ∈ R m(m+1)/2 . If R m(m+1)/2 is endowed with the Euclidean inner product, the adjoint of svec is defined for every a ∈ R m(m+1)/2 as such that smat(a) ∈ S m×m . Moreover, smat(svec(A)) = A and svec(smat(a)) = a for all A and a. Now let c = svec(C). Problem (17) is equivalent to the following problem over R m(m+1)/2 : Note that membership of the feasible set of (18) can be determined in O(m 3 ) operations. Let n = 1 2 m(m + 1) be the number of variables in problem (18).

Covariance Approximation
First, we investigate how the quality of the covariance approximation depends on the walk length for problem (18). We take 20,000 hit-and-run samples from the uniform distribution over the feasible set of (18)  is the all-ones matrix). These samples are used to create the estimate Σ 0 . Then, the experiment is repeated for walk lengths ≤ 50,000 and sample sizes N ≤ 20,000. We refer to these new estimates as Σ ,N . We would like − y Σ ,N y ≤ y ( Σ 0 − Σ ,N )y ≤ y Σ ,N y ∀y ∈ R n , for some small > 0. This is equivalent to i.e., we would like the spectral radius of Σ  Fig. 1, where m refers to the size of the matrices in (17). Hence, the covariance matrices in question are of size 1 2 m(m + 1) × 1 2 m(m + 1). One major conclusion from Fig. 1 is that the trajectory toward zero is relatively slow. To show that simply adding more samples with higher walk lengths will in practice not be feasible, we present the running times required to estimate a covariance matrix at the desired accuracy in Fig. 2. Specifically, this figure shows the running times of the 'efficient' combinations of N and : these are the combinations of N and plotted in Fig. 1 for which there are no N and such that N ≤ N and ρ( Σ −1 ,N Σ 0 − I ) <  Figure 2 shows that, even at low dimensions, approximating the covariance matrix to high accuracy will take an unpractical amount of time, which approximately shows the time required to approximate the covariance matrix up to desired accuracy.
To show that the slow trajectory toward zero in Fig. 1 is a result of covariance estimation's fundamental difficulty, we consider a simpler problem. We want to approximate the covariance matrix of the uniform distribution over the hypercube [0, 1] n in R n . Note that the true covariance matrix of this distribution is known to be Σ := 1 12 I . Again, we will use hit-and-run with varying walk lengths and sample sizes to generate samples from the uniform distribution over [0, 1] n , and compare the resulting covariance matrices Σ ,N with Σ = 1 12 I . (Comparing against a covariance estimate based on hit-and-run samples as in Fig. 1 yields roughly the same image.) The result is shown in Fig. 3. Figure 3 shows a pattern similar to that of Fig. 1: As the problem size increases, the walk length should increase with the sample size to ensure the estimate is as good as the sample size can guarantee. While this progression toward zero may appear as slow, we do not have to know every eigenvalue and eigenvector of the covariance to high accuracy. Recall that we only use this covariance estimate in Algorithm 2 to generate hit-and-run directions. As such, it may suffice to have an estimate that roughly shows which directions are 'long,' and which ones are 'short.'

Mean Approximation
Next, we consider the problem of approximating the mean. Although it is not required for Algorithm 2 to approximate the mean of a Boltzmann distribution, such a mean does lie on the central path of the interior point method proposed by Abernethy and Hazan [1, Appendix D]. We again take 20,000 hit-and-run samples from the uniform distribution over the feasible set of (18) with walk length 50,000. (Directions are drawn from N (0, I ) and the starting point is svec(m I + J )/(2e svec(m I + J )), where J is the all-ones matrix.) These samples are used to create the mean estimate x 0 . Then, the experiment is repeated for walk lengths ≤ 50,000 and sample sizes N ≤ 20,000. We refer to these new estimates as x ,N . Using the approximation Σ 0 of the uniform covariance matrix from the previous section, we compute x 0 − x ,N Σ −1 0 and plot the results in Fig. 4.
The results are comparable to those in Figs. 1 and 2. It will take an impractical amount of time before the mean estimate approximates the true mean well enough for practical purposes.

Kalai-Vempala Algorithm
The results from the previous two sections show that we should not hope to approximate the covariance matrix and sample mean with high accuracy in high dimensions. However, it is still insightful to verify if this is really required for Algorithm 2 to work in practice.
We therefore generated a random vector c ∈ R m(m+1)/2 as follows: If C ∈ R m×m is a matrix with all elements belonging to a standard normal distribution, then C + C + ( √ 2 − 2) Diag(C) is a symmetric matrix whose elements all have variance 2. We then let serve as the objective of our optimization problem (18). We can find the optimal solution x * with MOSEK 8.0 [21] and then run Algorithm 2 with ε = 10 −3 and p = 10 −1 . The final gap c, x final − x * is shown in Fig. 5. One can see that for practical sample sizes and walk lengths, the method does not converge to the optimal solution.

Kalai-Vempala Algorithm with Acceleration Heuristic
Keeping our findings above in mind, we propose the heuristic adaption of Algorithm 2 presented in Algorithm 3. The main modifications we suggest to make to Algorithm 2 are: 1. Use the (centered) samples generated in the previous iteration as directions for hit-and-run in the current iteration. This would eliminate the need to estimate the covariance matrix of a distribution, only to then draw samples from that same distribution. Instead, we can also draw directions directly from the centered samples (cf. line 10 in Algorithm 3). Thus, each sample is used to generate a hit-and-run direction with uniform probability. 2. As a starting point for the first random walk in some iteration k, use the sample mean from iteration k − 1. While this does significantly change the distribution of the starting point, it concentrates more probability mass around the mean of the Boltzmann distribution with parameter θ k−1 , such that the starting point of the random walk is likely already close to the mean of the Boltzmann distribution with parameter θ k . In a similar vein, we return the mean of the samples in the final iteration, not just one sample. This will not change the expected objective value of the final result, and will therefore also not affect the probabilistic guarantee that we derived in (15) by Markov's inequality. However, using the mean does reduce the variance in the final solution. 3. Start all except the first random walk in some iteration k from the end point of the previous random walk, rather than from a common starting point. The idea here is that the random samples as a whole will then exhibit less dependence, thus improving the approximation quality of the empirical distribution.
With these modifications implemented, we can no longer study the quality of the covariance matrix. Therefore, we will simply consider if the resulting optimization algorithm leads to a small error in the objective value. We solve the problem from Sect. 6.3 with Algorithm 3. The results are shown in Fig. 6.
For low dimensions in particular, the proposed changes seem to have a positive effect.
It can be seen from Fig. 6 that-roughly speaking-the final gap c, x final − x * takes values between two extremes. At one end, the method does not converge and

Algorithm 3 Heuristic adaptation of Algorithm 2
Input: unit vector c ∈ R n ; membership oracle of a convex body K ⊆ R n ; radius R of Euclidean ball containing K ; complexity parameter ϑ ≤ n + o(n) of the entropic barrier over K ; update parameter α > 1 + 1/ √ ϑ; error tolerance ε > 0; failure probability p ∈ (0, 1); number of hit-and-run steps ∈ N; number of samples N ∈ N; y 1 , ..., y N ∈ K drawn randomly from the uniform distribution over K .

10:
Generate Y jk by applying hit-and-run sampling to the Boltzmann distribution with parameter θ k , starting the walk from Y j−1,k , taking steps, drawing directions uniformly from

11: end for
12: 15: end while 16: return X k−1 the final gap is still of the order 10 −1 . At the other end, the method does converge to the optimum, such that the gap is of the order 10 −4 = ε p. Note that ε p is exactly the size we would like the expected gap to have to guarantee that the gap is smaller than ε with probability 1 − p by Markov's inequality. Whether we are at one end or the other depends on N and being large enough compared to m. As a heuristic, we propose that where n = m(m + 1)/2 is the number of variables, are generally sufficient to ensure that the final gap is of the order ε p.
is nonnegative, where e is the all-ones vector. Xia et al. [28] show that solving (20) is equivalent to solving where M = 2m max i, j∈{1,...,m} |A i j |.
(To be precise, every optimal solution (a, ν, η) to (21) gives an optimal solution a to (20), and these two problems have the same optimal values.) Note that we are generally not interested in solving (21) to optimality: it suffices to find a feasible solution of (21) with a negative objective value, or confirm that no such solution exists. For the majority of the matrices encountered by Algorithm 3 applied to our test sets described below, this could be checked quickly.

Separating from the Completely Positive Cone
Recall that a matrix A ∈ S m×m is completely positive if A = B B for some B ≥ 0. It is easily seen that optimization problems over the completely positive cone can be relaxed as optimization problems over the doubly nonnegative cone. To strengthen this relaxation, one could add a cutting plane separating the optimal solution Y of the doubly nonnegative relaxation from the completely positive cone. This is listed as an open (computational) problem by Berman et al. [5,Section 5], who note that the problem of generating such a cut has only been answered for specific structures of Y , including 5 × 5 matrices [9]. In general, such a cut could be generated for a doubly nonnegative matrix Y by the copositive program Below, we solve this problem for 6 × 6 matrices, by way of example.
To generate test instances, we are interested in matrices on the boundary of the 6 × 6 doubly nonnegative cone. The extreme rays of this cone are described by Ycart [29,Proposition 6.1]. We generate random instances from the class of matrices described under case 3, graph 4 in Proposition 6.1 in [29]. These matrices are (up to permutation of the indices) doubly nonnegative matrices Y = [Y i j ] i j with rank 3 satisfying Y i,i+1 = 0 for i = 1, ..., 5. To generate such a matrix, we draw the elements of two vectors v 1 , v 2 ∈ R 6 and the first element (v 3 ) 1 ∈ R of a vector v 3 ∈ R 6 from a Poisson distribution with rate 1, and multiply each of these elements with −1 with probability 1 2 . The remaining elements of v 3 are then chosen such that Y = 3 k=1 v k v k satisfies Y i,i+1 = 0 for i = 1, ..., 5. This procedure is repeated if the matrix Y is not doubly nonnegative, or if BARON 15 [27] could find a nonnegative matrix B ∈ R 6×9 such that Y = B B in less than 30 s. (For the cases where such a decomposition could be found, BARON terminated in less than a second.) Thus, we are left with doubly nonnegative matrices for which it cannot quickly be shown that they are completely positive.
For ten of such randomly generated matrices (see Appendix 1), the optimal value of Algorithm 3 applied to (22) is given in Table 1. This table shows the normalized objective value Y / Y , X * , where Y is a doubly nonnegative matrix as described above, and X * is the final solution returned by Algorithm 3. Note that in all cases, Algorithm 3 succeeds in finding a copositive matrix X * such that Y , X * < 0, which means a cut separating Y from the completely positive matrices was found.
Note that solving the MILP (21) for a matrix A that is not copositive yields a hyperplane separating A from the copositive cone. Thus, we can also solve problem (22) with the Ellipsoid method of Yudin and Nemirovski [30], for example. For the sake of comparison, the results of the Ellipsoid method are also included in Table  1. Note, in particular, that the number of oracle calls in Table 1 is several orders of magnitude smaller for the Ellipsoid method. and p = 0.1, and N and as in (19). The Ellipsoid method was run with error tolerance 10 −4

Conclusion
We have shown that Kalai and Vempala's algorithm [12] returns a solution which is near-optimal for (1) with high probability in polynomial time, when the temperature update (5) is used. The main drawback in using the algorithm in practice is that a large number of samples (i.e., membership oracle calls) are required. As a result, in our tests the Ellipsoid method outperformed Algorithm 3 by a large margin. Thus, based on our experiments, one would favor polynomial-time cutting plane methods like the Ellipsoid method, or more sophisticated alternatives as described, for example, in [16]. In order to obtain a practically viable variant of the Kalai-Vempala algorithm, one would have to improve the sampling process greatly, or utilize massive parallelism to speed up the hit-and-run sampling.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Appendix A: The Complexity Parameter of the Entropic Barrier for the Euclidean Ball
Let B n := {x ∈ R n : x 2 ≤ 1} be the unit ball in R n with respect to the Euclidean inner product ·, · . Let X θ be a random variable following a Boltzmann distribution over B n with parameter θ ∈ R n , and denote the expectation operator by E. We are interested in the complexity parameter ϑ of the entropic barrier on B n . Recall from (3) that For every θ ∈ R n , there exists a rotation matrix Q with | det Q| = 1 such that θ, Qy = θ y 1 for all y ∈ R n . Using the fact that the volume of an (n − 1)dimensional ball with radius r is r n−1 times some factor depending only on n, we see that Numerical approximation of E[ θ, X θ 2 ] − θ, E[X θ ] 2 for different values of n and θ yields Fig. 7. This figure suggests that ϑ = 1 2 (n + 1).

Appendix B: Extremal Doubly Nonnegative Matrix Examples
Below are the ten randomly generated extreme points of the 6 × 6 doubly nonnegative cone that are used in Sect. 7.1. These matrices can be strictly separated from the completely positive cone.