1 Introduction

Let \(K \subseteq \mathbb {R}^n\) be a convex body, and suppose that only a membership oracle of K is available. Let \(\langle \cdot , \cdot \rangle \) be an inner product on \(\mathbb {R}^n\), and fix a unit vector \(c \in \mathbb {R}^n\). We are interested in the problem

$$\begin{aligned} \min _{x \in K} \langle c, x \rangle . \end{aligned}$$
(1)

One approach to solve problems of this type is using simulated annealing, a paradigm of randomized algorithms for general optimization introduced by Kirkpatrick et al. [14]. It features a so-called temperature parameter that decreases during the run of the simulated annealing algorithm. At high temperatures, the method explores the feasible set relatively freely, also moving to solutions that have worse objective values than the current point. As the temperature decreases, so does the probability that a solution with a worse objective value is accepted. Kalai and Vempala [12] showed that, for convex optimization, a polynomial-time simulated annealing algorithm exists. Specifically, their algorithm returns a feasible solution that is near-optimal with high probability in polynomial time. (A recent algorithm by Lee et al. [15] has an asymptotically better complexity.)

Abernethy and Hazan [1] recently clarified that Kalai and Vempala’s algorithm is closely related to a specific interior point method. In general, interior point methods for convex bodies require a so-called self-concordant barrier of the feasible set. It was shown by Nesterov and Nemirovskii [23] that every open convex set that does not contain an affine subspace is the domain of a self-concordant barrier, known as the universal barrier. However, it is not known how to compute the gradients and Hessians of this barrier in general.

The interior point method to which Kalai and Vempala’s algorithm corresponds uses the so-called entropic barrier over K, to be defined later. This barrier was introduced by Bubeck and Eldan [7], who also established the self-concordance of the barrier and analyzed its complexity parameter \(\vartheta \).

Drawing on the connection to interior point methods, Abernethy and Hazan [1] proposed a new temperature schedule for Kalai and Vempala’s algorithm. This schedule does not depend on the dimension n of the problem, but on the complexity parameter \(\vartheta \) of the entropic barrier. Our goal is to prove in detail that simulated annealing with this new temperature schedule returns a solution in polynomial time which is near-optimal with high probability. Moreover, we aim to investigate the practical applicability of this method. Our experiments suggest that it is very difficult to implement a practical simulated annealing algorithm for copositive programming using hit-and-run sampling. Although these are negative results, we find it important to communicate them, in order to stimulate research into algorithms for copositive programming that are not sampling-based, e.g., cutting plane methods.

1.1 Algorithm Statement

Central to simulated annealing is a family of exponential-type probability distributions known as Boltzmann distributions.

Definition 1.1

Let \(K \subseteq \mathbb {R}^n\) be a convex body, and let \(\theta \in \mathbb {R}^n\). Let \(\langle \cdot , \cdot \rangle \) be an inner product. Then, the Boltzmann distribution with parameter \(\theta \) is the probability distribution supported on K having density with respect to the Lebesgue measure proportional to \(x \mapsto e^{\langle \theta , x \rangle }\).

Throughout this work, we will use \(\varSigma (\theta )\) to refer to the covariance matrix of the Boltzmann distribution with parameter \(\theta \in \mathbb {R}^n\). If \(\langle \cdot , \cdot \rangle \) is some reference inner product, then \(\langle x, y \rangle _\theta := \langle x, \varSigma (\theta ) y \rangle \) for any \(\theta \in \mathbb {R}^n\). Moreover, let \(\Vert \cdot \Vert _\theta \) denote the norm induced by the inner product \(\langle \cdot , \cdot \rangle _\theta \).

The procedure Kalai and Vempala [12] use to generate samples on K is called hit-and-run sampling. This Markov chain Monte Carlo method was introduced by Smith [26] to sample from the uniform distribution over a bounded convex set. Later, it was generalized to absolutely continuous distributions (see for example Bélisle et al. [4]). The details of hit-and-run sampling are given in Algorithm 1.

figure a

Note that if \(\langle \cdot , \cdot \rangle \) is the Euclidean inner product, the covariance operator \(\varSigma \) in Algorithm 1 can be represented as a matrix in \(\mathbb {R}^{n \times n}\).

In each iteration k of Kalai and Vempala’s algorithm [12], the temperature \(T_k\) is lowered. Then, hit-and-run samples are generated whose target distribution is the Boltzmann distribution with parameter \(\theta _k = -c / T_k\). These random walks use an approximation \(\widehat{\varSigma }(\theta _{k-1})\) of \(\varSigma (\theta _{k-1})\) to generate search directions. With these samples, an approximation \(\widehat{\varSigma }(\theta _k)\) of \(\varSigma (\theta _k)\) is formed, which will then be used in the next iteration. As k grows sufficiently large, so does the norm of \(\theta _k\). The Boltzmann distributions with parameter \(\theta _k\) will then concentrate more and more probability mass close to the set of optimal solutions to (1). For sufficiently large k, any sample from such a Boltzmann distribution is near-optimal with high probability.

One thing that needs further clarification is how to decrease the temperature in each iteration. In their original paper, Kalai and Vempala [12] show that the algorithm returns a near-optimal solution with high probability, for the temperature update (cf. line 5 in Algorithm 2)

$$\begin{aligned} T_{k} = \left( 1-\frac{1}{\sqrt{n}}\right) T_{k-1}, \end{aligned}$$
(2)

in \(m = O^*(\sqrt{n})\) iterations, where \(O^*\) suppresses polylogarithmic terms in the problem parameters. As mentioned above, Abernethy and Hazan’s alternative temperature schedule depends on the complexity parameter of the entropic barrier. This function is defined as follows.

Definition 1.2

Let \(K \subseteq \mathbb {R}^n\) be a convex body. Define the function \(f: \mathbb {R}^n \rightarrow \mathbb {R}\) as \(f(\theta ) = \ln \int _K e^{\langle \theta , x \rangle } {{\,\mathrm{d\!}\,}}x\). Then, the convex conjugate \(f^*\) of f,

$$\begin{aligned} f^*(x) = \sup _{\theta \in \mathbb {R}^n} \left\{ \langle \theta , x \rangle - f(\theta ) \right\} , \end{aligned}$$

is called the entropic barrier for K.

Bubeck and Eldan [7] showed that \(f^*\) is a self-concordant barrier for K with complexity parameter \(\vartheta \le n +o(n)\). The complexity parameter of \(f^*\) is

$$\begin{aligned} \vartheta := \sup _{x \in K} \langle Df^*(x), [D^2 f^*(x)]^{-1} Df^*(x) \rangle = \sup _{\theta \in \mathbb {R}^n} \langle \theta , \varSigma (\theta ) \theta \rangle = \sup _{\theta \in \mathbb {R}^n} \Vert \theta \Vert _\theta ^2, \end{aligned}$$
(3)

where we refer the reader to [3, 7] for more details.

Abernethy and Hazan [1] propose the temperature update

$$\begin{aligned} T_{k} = \left( 1 - \frac{1}{4 \sqrt{\vartheta }}\right) T_{k-1}, \end{aligned}$$
(4)

which will lead to \(m = O^*(\sqrt{\vartheta })\) iterations. As noted above, we have \(\vartheta \le n +o(n)\) in general, but it is not currently known if \(\vartheta < n\) for any convex bodies. In particular, the temperature update (4) only improves on (2) if \(\vartheta < n/16\), which is not known to hold for any convex body. We show in Appendix 1 to this paper that, for the Euclidean unit ball in \(\mathbb {R}^n\), numerical evidence suggests that \(\vartheta = (n+1)/2\). We therefore consider a variation on the temperature schedule (4) suggested by Abernethy and Hazan, namely

$$\begin{aligned} T_{k} = \left( 1 - \frac{1}{\alpha \sqrt{\vartheta }}\right) T_{k-1} \text{ for } \text{ some } \alpha > 1+ 1/\sqrt{\vartheta }, \end{aligned}$$
(5)

which corresponds to (4) when \(\alpha = 4\), but gives larger temperature reductions when \(\alpha < 4\). We will refer to (5) as Abernethy–Hazan-type temperature updates. If \(\vartheta < n\), this may result in a larger temperature decrease than the Kalai and Vempala scheme (2), for a suitable choice of the parameter \(\alpha \).

The algorithm by Kalai and Vempala [12] that uses a temperature schedule of the type introduced by Abernethy and Hazan [1] is now given in Algorithm 2.

figure b

1.2 Contributions and Outline of this Paper

Abernethy and Hazan [1] do not give a rigorous analysis of the temperature schedule (4) in their paper, only a sketch of the proof. In this paper, we provide the full details for the more general schedule (5). In doing so, we also provide some details that were omitted in the original work by Kalai and Vempala [12], that concern the application of a theorem by Rudelson [25]. Moreover, we discuss the perspectives for practical computation with Algorithm 2. Finally, we propose some heuristic improvements to Algorithm 2 to obtain speed-up. Many of these results also appear in the PhD thesis [2] of the first author.

We start with a review of useful facts on probability distributions in Sect. 2. Then, Sect. 4 proves Algorithm 2 returns a solution that is near-optimal with high probability. In Sect. 5, we discuss the complexity of Algorithm 2. In Sect. 6, we look at the behavior of hit-and-run sampling for optimization over the doubly nonnegative cone and suggest some heuristic improvements. We then evaluate the resulting algorithm on problems from copositive programming (due to Berman et al. [5]) in Sect. 7.

2 Preliminaries

We will use the total variation distance to measure if the probability distribution of a hit-and-run sample is close to the target distribution.

Definition 2.1

Let \((K, \mathcal {F})\) be a measurable space. For two probability distributions \(\mu \) and \(\varphi \) over this space, their total variation distance is

$$\begin{aligned} \Vert \mu - \varphi \Vert _{\text {TV}} :=\sup _{A \in \mathcal {F}} | \mu (A) - \varphi (A)|. \end{aligned}$$

A useful property of the total variation distance is that it allows coupling of random variables, as the following lemma asserts.

Lemma 2.2

(e.g., [17,  Proposition 4.7]) Let X be a random variable on K with distribution \(\mu \), and let \(\varphi \) be a different probability distribution on K. If \(\Vert \mu - \varphi \Vert _{\text {TV}} = p\), we can construct another random variable Y on K distributed according to \(\varphi \) such that \(\mathbb {P}\{X = Y\} = 1-p\). Similarly, given two distributions \(\mu \) and \(\varphi \) on K such that \(\Vert \mu - \varphi \Vert _{\text {TV}} = p\), there exists a joint distribution \(\nu \) on \(K\times K\), with marginals \(\mu \) and \(\varphi \), respectively, such that if \((X,Y) \sim (K\times K,\nu )\), one has \(X \sim (K,\mu )\), \(Y \sim (K,\varphi )\), and \(\mathbb {P}_\nu \{X = Y\} \ge 1-p\).

We can now state the following mixing time result. It gives the number of hit-and-run steps one has to take before the distribution of the hit-and-run sample is sufficiently close to the target distribution. The result given here is a corollary of a result by Lovász and Vempala [19,  Theorem 1.1].

Theorem 2.3

Let \(K \subset \mathbb {R}^n\) be a convex body. Suppose \(\theta _0, \theta _1 \in \mathbb {R}^n\) satisfy \(\varDelta \theta :=\Vert \theta _0 - \theta _1 \Vert _{\theta _0} < 1\). Pick \(p > 0\), and suppose we have an invertible matrix \(\widehat{\varSigma }(\theta _0)\) such that

$$\begin{aligned} \frac{1}{2} \widehat{\varSigma }(\theta _0)^{-1} \preceq \varSigma (\theta _0)^{-1} \preceq 2 \widehat{\varSigma }(\theta _0)^{-1}. \end{aligned}$$
(6)

Consider a hit-and-run random walk as in Algorithm 1 applied to the Boltzmann distribution \(\mu _1\) with parameter \(\theta _1\) from a random starting point drawn from a Boltzmann distribution \(\mu _0\) with parameter \(\theta _0\). Let \(\mu ^\ell \) be the distribution of the hit-and-run point after \(\ell \) steps of hit-and-run sampling applied to \(\mu _1\), where the directions are drawn from a \(\mathcal {N}(0,\widehat{\varSigma }(\theta _0))\)-distribution. Then, there exists an absolute constant \(C>0\), such that, after

$$\begin{aligned} \ell = C \frac{n^3 }{(1-\varDelta \theta )^4} \log ^5 \left( \frac{n }{p^2 (1-\varDelta \theta )^4 } \right) \end{aligned}$$
(7)

hit-and-run steps, we have \(\Vert \mu ^\ell - \mu _1 \Vert _{\text {TV}} \le p\).

We omit the proof, since it is a straightforward consequence of Lovász and Vempala [19,  Theorem 1.1], applied to the Boltzmann distribution. The interested reader may find a detailed proof in [2,  Theorem 4.14].

Note that the theorem above requires an approximation of the covariance of a distribution that is in some sense ‘close’ to the target distribution. This is why Algorithm 2 approximates the covariance of every distribution that it encounters: This covariance is then used in the next iteration to generate hit-and-run directions.

One may reformulate the Assumption (6) in the theorem by using the following result from Horn and Johnson [11].

Lemma 2.4

([11,  Corollary 7.7.4(a)], see also [2,  Lemma 2.4]) Let AB be positive definite, self-adjoint linear operators from \(\mathbb {R}^n\) to \(\mathbb {R}^n\). Then, \(A \succeq B\) if and only if \(B^{-1} \succeq A^{-1}\).

Consequently, we have, for \(k = 0,1,\ldots \),

$$\begin{aligned} \frac{1}{2} \widehat{\varSigma }(\theta _k) \preceq \varSigma (\theta _k) \preceq 2 \widehat{\varSigma }(\theta _k) \iff \frac{1}{2} \widehat{\varSigma }(\theta _k)^{-1} \preceq \varSigma (\theta _k)^{-1} \preceq 2\widehat{\varSigma }(\theta _k)^{-1}. \end{aligned}$$

A key point of the analysis is to show that these conditions hold for each iteration k.

3 Approximation of the Covariance Matrix

To guarantee the required approximation of the covariance matrix, Kalai and Vempala [12] use the following corollary to a theorem by Rudelson [25].

Theorem 3.1

([12,  Theorem A.1]) Let \(\varphi \) be a log-concave probability distribution over \(\mathbb {R}^n\) with mean 0 and identity covariance (i.e., isotropic), and let \(\rho \in (0,1)\). Let \(X_1, ..., X_N\) be independent samples from \(\varphi \). Then, there exist absolute constants \(C_1>0\) and \(C_2 > 0\) such that, if

$$\begin{aligned} N \ge C_1n \log ^{C_2}(n / \rho ), \end{aligned}$$

we have

$$\begin{aligned} \left\| \frac{1}{N} \sum _{j=1}^N X_j X_j^\top - I \right\| \le \frac{1}{4}, \end{aligned}$$

with probability \(1-\rho \), where the norm is the spectral (operator) norm.

This theorem cannot be directly applied in the setting of Algorithm 2 for three reasons: The hit-and-run samples do not follow the target distribution \(\varphi \), the samples are not independent, and \(\varphi \) does not have mean 0 and identity covariance, i.e., \(\varphi \) is not isotropic. While Kalai and Vempala (correctly) state that Theorem 3.1 can be extended to the hit-and-run setting without significantly changing the number of samples N, a formal proof is not given, and we will provide the missing details in this section, since this is not straightforward.

We first show how to extend Theorem 3.1 to the non-isotropic case. To this end, we will need two results on log-concave random variables. The first is that the sum of independent log-concave random variables is again log-concave.

Lemma 3.2

Let XY be independent random variables in \(\mathbb {R}^n\) with log-concave density functions f and g, respectively. Then, \(X+Y\) is a log-concave random variable.

Proof

Theorem 7 in [24] shows that \(x \mapsto \int _{\mathbb {R}^n} f(x-y) g(y) {{\,\mathrm{d\!}\,}}y\) is log-concave on \(\mathbb {R}^n\). This function is precisely the density function of \(X+Y\). \(\square \)

The second result is a concentration result for log-concave distributions.

Lemma 3.3

(Lemma 3.3 from [18] (adapted from [20])) Let X be a random variable with a log-concave distribution. Denote \(\mathbb {E}(\Vert X - \mathbb {E}(X)\Vert _2^2) =: \sigma ^2\). Then, for all \(t > 1\),

$$\begin{aligned} \mathbb {P}\{ \Vert X - \mathbb {E}(X)\Vert _2 \le t \sigma \} \ge 1- e^{1-t}. \end{aligned}$$

We now extend Theorem 3.1 to the non-isotropic case.

Corollary 3.4

Let \(\varphi \) be a log-concave probability distribution over \(\mathbb {R}^n\) with mean \(\mu _\varphi \) and covariance \(\varSigma _\varphi \), and let \(\rho \in (0,1)\). Let \(X_1, ..., X_N\) be independent samples from \(\varphi \), where

$$\begin{aligned} N = \left\lceil \max \left\{ C _1 n \log ^{C_2}(2n / \rho ), 4n\left( 1+\ln \left( \frac{2}{\rho }\right) \right) ^2\right\} \right\rceil , \end{aligned}$$
(8)

and \(C_1\) and \(C_2\) are the absolute constants from Theorem 3.1, and let

$$\begin{aligned} {\hat{\mu }} = \frac{1}{N} \sum _{i=1}^N X_i, \;\;\; {\hat{\varSigma }} = \frac{1}{N} \sum _{i=1}^N (X_i - {\hat{\mu }})(X_i - {\hat{\mu }})^\top . \end{aligned}$$

Then, we have

$$\begin{aligned} \frac{1}{2} {\hat{\varSigma }} \preceq \varSigma _\varphi \preceq \frac{3}{2} {\hat{\varSigma }}, \end{aligned}$$
(9)

with probability \(1-\rho \).

Proof

One may assume w.l.o.g. that \(\varSigma _\varphi = I\), since the statement of the theorem is invariant under invertible linear transformations. Indeed, if a random variable X with distribution \(\varphi \) is replaced by AX for some linear transformation A, then the new covariance matrix and its approximation satisfy

$$\begin{aligned} \varSigma _\varphi ^{(\mathrm new)} = A \varSigma _\varphi A^* \text{ and } \hat{\varSigma }^{(\mathrm new)} = A {\hat{\varSigma }} A^*. \end{aligned}$$

By choosing \(A = \varSigma _{\varphi }^{-1/2}\), we therefore get the identity covariance. Moreover,

$$\begin{aligned} \frac{1}{2} {\hat{\varSigma }} \preceq \varSigma _\varphi \preceq \frac{3}{2} {\hat{\varSigma }} \Longleftrightarrow \frac{1}{2} A{\hat{\varSigma }} A^*\preceq A\varSigma _\varphi A^* \preceq \frac{3}{2} A{\hat{\varSigma }} A^*. \end{aligned}$$

Assuming therefore w.l.o.g. that \(\varSigma _\varphi = I\), the random variable \(X - \mu _\varphi \) has an isotropic log-concave distribution, and after setting

$$\begin{aligned} \varSigma ' = \frac{1}{N} \sum _{i=1}^N (X_i - \mu _\varphi )(X_i - \mu _\varphi )^\top , \end{aligned}$$

Theorem 3.1 yields \( \left\| \varSigma ' - I \right\| \le \frac{1}{4}\), with probability \(1- \rho /2\). It therefore suffices to show that \(\varSigma '\) and \({\hat{\varSigma }}\) are ‘sufficiently close.’ To this end, direct calculation yields

$$\begin{aligned} {\hat{\varSigma }} - \varSigma ' = -(\mu _\varphi - {\hat{\mu }})(\mu _\varphi - \hat{\mu })^\top , \end{aligned}$$

so that

$$\begin{aligned} \left\| {\hat{\varSigma }} - \varSigma ' \right\| = \Vert {\hat{\mu }} - \mu _\varphi \Vert _2^2. \end{aligned}$$

Note that \({\hat{\mu }}= \frac{1}{N} \sum _{i=1}^N X_i\) is a log-concave random vector, by Lemma 3.2. We may therefore bound right-hand side of the last equality via the concentration inequality of Lemma to obtain

$$\begin{aligned} \mathbb {P}\left\{ \Vert {\hat{\mu }} - \mu _\varphi \Vert _2^2 \le \frac{n}{N}\left( 1+\ln \left( \frac{2}{\rho }\right) \right) ^2 \right\} \ge 1 - \rho /2, \end{aligned}$$

where we used that the variance of each component of \(X_i\) is 1 for all i. Thus, if \(N \ge 4n\left( 1+\ln \left( \frac{2}{\rho }\right) \right) ^2\), one has

$$\begin{aligned} \Vert {\hat{\varSigma }} - I \Vert \le \Vert \varSigma ' - I\Vert + \left\| {\hat{\varSigma }} - \varSigma ' \right\| \le \frac{1}{4} + \frac{1}{4} = \frac{1}{2}, \end{aligned}$$

with probability at least \((1-\rho /2)^2 > 1- \rho \). This implies (9). \(\square \)

We proceed to show that we may replace independent sampling by hit-and-run sampling in Corollary 3.4 in a well-defined sense. To this end, fix a starting vector \(X_0\) and a tolerance \(q >0\), and consider the following two sequences of random variables:

  • I.i.d. \(X_i\) \((i \in \{1,\ldots ,N\})\) drawn independently from \(\varphi \).

  • \(Y_i\) obtained via \(\ell \) steps of hit-and-run from starting point \(X_{0}\) with \((i \in \{1,\ldots ,N\})\) as in Step 9 of Algorithm 2. Here, we assume that the \(\ell \) steps are sufficient to guarantee that the total variation distance between the distribution of \(Y_i\) and \(\varphi \) is at most a given \(p \in (0,1)\) (with reference to Theorem 2.3).

A key observation is that we may assume, without loss of generality, that \(X_i = Y_i\) for all \((i \in \{1,\ldots ,N\})\) with high probability.

Proposition 3.1

Let \(X_i\), \(Y_i\) \((i \in \{1,\ldots ,N\})\) be as above. Without loss of generality, one may assume that the joint distribution of \(X_i\) and \(Y_i\) satisfies \(\mathbb {P}\{X_i = Y_i\} \ge 1-p\) for each \(i \in \{1,\ldots ,N\}\).

Proof

The \(X_i\) all have distribution \(\varphi \) on K and the \(Y_i\) all have the same distribution, say \(\mu \) on K (that depends on \(X_0)\).

Moreover, by assumption \(\Vert \mu - \varphi \Vert _{TV} \le p\). By the second part of Lemma 2.2, there exists a joint distribution, say \(\nu \) on \(K \times K\) with marginals \(\varphi \) and \(\mu \), so that \((X,Y) \sim (\nu ,K\times K)\) implies \(\mathbb {P}_\nu \{X=Y\} \ge 1-p\), as well as \(X \sim (\varphi ,K)\) and \(Y \sim (\mu ,K)\).

We now replace (i.e., couple) the random variables \(Y_i\) with new random variables \(Y'_i\) so that \((X_i,Y_i') \sim (\nu ,K\times K)\) for each \(i \in \{1,\ldots ,N\}\).

The important point is that we now have \(\mathbb {P}[X_i = Y'_i] \ge 1-p\) for each \(i \in \{1,\ldots ,N\}\). Moreover, the random variables \(Y_i'\) will be indistinguishable from the hit-and-run random variables \(Y_i\), in the sense that both \(Y_i \sim (\mu ,K)\) and \(Y_i' \sim (\mu ,K)\) for all i. \(\square \)

As a result, Corollary 3.4 still holds if we replace the \(X_i\) by the \(Y_i\) for all i, but now with probability \((1 - \rho ) - Np\), by the union bound.

4 Proof of Convergence

We continue by proving that Algorithm 2 converges to the optimum in polynomial time. The following result was established by Kalai and Vempala [12] for linear functions, and extended from linear to convex functions \(h: \mathbb {R}^n \rightarrow \mathbb {R}\) by De Klerk and Laurent [10].

Lemma 4.1

([10,  Corollary 1]) Let \(K \subseteq \mathbb {R}^n\) be a convex body. For any convex function \(h: \mathbb {R}^n \rightarrow \mathbb {R}\) and temperature \(T > 0\), we have

$$\begin{aligned} \frac{\int _K h(x) e^{-h(x)/T} {{\,\mathrm{d\!}\,}}x}{\int _K e^{-h(x)/T} {{\,\mathrm{d\!}\,}}x} \le nT + \min _{x \in K} h(x). \end{aligned}$$

The main step in the analysis of Algorithm 2 is thus to show that we maintain a good approximation \(\widehat{\varSigma }(\theta _k)\) of \(\varSigma (\theta _k)\) for all k, to guarantee that the hit-and-run sampling continues to work in all iterations, as discussed in the previous section.

Theorem 4.2

Consider the setting of Algorithm 2. Let \(\alpha > 1\) be such that \(\varDelta \theta = \sqrt{\vartheta } / (\alpha \sqrt{\vartheta } - 1) < 1\), let \(q \in (0, 1]\),

$$\begin{aligned} m&= \left\lceil \frac{\log \left( q \varepsilon / (4 \alpha n R) \right) }{\log \left( 1 - 1/ (\alpha \sqrt{\vartheta }) \right) } \right\rceil , \end{aligned}$$
(10)
$$\begin{aligned} \rho&= \frac{q}{2m}, \end{aligned}$$
(11)
$$\begin{aligned} N&\text{ as } \text{ in } (8), \end{aligned}$$
(12)
$$\begin{aligned} p&= \frac{q}{2Nm}, \end{aligned}$$
(13)

and let \(\ell \) be as in (7). (Note that \(\ell \) depends on n, p, and \(\varDelta \theta \).) With these inputs, Algorithm 2 that returns a solution \(X_m\) with

$$\begin{aligned} \mathbb {P} \left\{ \langle c, X_m \rangle - \min _{x \in K} \langle c, x \rangle \le \varepsilon \right\} \ge 1 - q. \end{aligned}$$

Proof

First, let us show that the conditions of Theorem 2.3 are satisfied. Note that

$$\begin{aligned} \Vert c\Vert _0^2 = \langle c, \varSigma (0) c \rangle = \frac{\int _K \langle c, x - \mu _{\theta , K} \rangle ^2 {{\,\mathrm{d\!}\,}}x}{\int _K {{\,\mathrm{d\!}\,}}x}, \end{aligned}$$

because \(\varSigma (0)\) is the covariance matrix of the uniform distribution. Since \(\Vert c \Vert = 1\) and K is contained in a ball with radius R,

$$\begin{aligned} \Vert c\Vert _0^2 \le \frac{\int _K \Vert c \Vert ^2 \Vert x - \mu _{\theta , K} \Vert ^2 {{\,\mathrm{d\!}\,}}x}{\int _K {{\,\mathrm{d\!}\,}}x} \le \frac{\int _K 1^2 (2R)^2 {{\,\mathrm{d\!}\,}}x}{\int _K {{\,\mathrm{d\!}\,}}x} = (2R)^2. \end{aligned}$$

Hence, for \(k = 1\),

$$\begin{aligned} \Vert \theta _1 - \theta _0 \Vert _{\theta _0} = \frac{\Vert c \Vert _0}{T_1} \le \frac{2 R}{T_1} = \frac{2 R}{2 \alpha R \left( 1 -1 / (\alpha \sqrt{\vartheta }) \right) } = \varDelta \theta . \end{aligned}$$

For all \(k > 1\), our choice of \(\theta _k\) and (3) yield

$$\begin{aligned} \Vert \theta _k - \theta _{k-1} \Vert _{\theta _{k-1}} = \left( \frac{T_{k-1}}{T_k} - 1 \right) \Vert \theta _{k-1} \Vert _{\theta _{k-1}} \le \left( \frac{1}{1 - 1 / (\alpha \sqrt{\vartheta })} - 1 \right) \sqrt{\vartheta } = \varDelta \theta . \end{aligned}$$

Throughout all iterations \(k \le m\) of Algorithm 2, we maintain

$$\begin{aligned} \tfrac{1}{2} \widehat{\varSigma }(\theta _k)^{-1} \preceq \varSigma (\theta _k)^{-1} \preceq 2 \widehat{\varSigma }(\theta _k)^{-1}, \end{aligned}$$
(14)

with probability \(1 - m(\rho +Np) = 1-q\), by the analysis in the previous section. Thus, the conditions of Theorem 2.3 are satisfied with high probability.

By the first part of Lemma 2.2, \(X_m\) is equal to a random variable drawn from a Boltzmann distribution with parameter \(\theta _m\) with probability at least \(1-p\). Thus, Markov’s inequality and Lemma 4.1 show that

$$\begin{aligned} \mathbb {P} \left\{ \langle c, X_m \rangle - \min _{x \in K} \langle c, x \rangle > \varepsilon \right\} \le \frac{\mathbb {E}\left[ \langle c, X_m \rangle - \min _{x \in K} \langle c, x \rangle \right] }{\varepsilon } \le \frac{n T_m}{\varepsilon } \le \tfrac{1}{2} q,\nonumber \\ \end{aligned}$$
(15)

where the final inequality uses the chosen value of m as follows:

$$\begin{aligned} T_m = 2 \alpha R \left( 1 - \frac{1}{\alpha \sqrt{\vartheta }} \right) ^m \le 2 \alpha R \frac{q \varepsilon }{4 \alpha n R} = \frac{q \varepsilon }{2 n}. \end{aligned}$$

\(\square \)

5 Complexity Analysis and Discussion

We saw that for some combination of inputs, Algorithm 2 returns a solution which is near-optimal with high probability. Let us now consider the number of membership oracle calls required for this configuration. It was noted in, for example, [2,  Section 4.1] that the number of oracle calls for one hit-and-run walk is \(O^*(\ell )\). Hence, Algorithm 2 uses \(O^*(m N \ell )\) oracle calls.

First, let us look at the number of iterations from (10). Since

$$\begin{aligned} \frac{-1}{\log \left( 1 - 1/ (\alpha \sqrt{\vartheta }) \right) } = O(\alpha \sqrt{\vartheta }), \end{aligned}$$

we have \(m = O(\sqrt{\vartheta })\) for fixed \(\alpha \). Next, the number of samples from (8) satisfies

$$\begin{aligned} N = O(n \ln (mn / q)) = O(n \ln (\sqrt{\vartheta }n / q)) = O^*(n). \end{aligned}$$

Next, we bound the number of hit-and-run steps. Recall (7) shows that

$$\begin{aligned} \ell = O\left( \frac{n^3 }{(1-\varDelta \theta )^4} \log ^5 \left( \frac{n }{p^2 (1-\varDelta \theta )^4 } \right) \right) . \end{aligned}$$
(16)

For the \(\varDelta \theta \) from Theorem 4.2 and the value of p from (13), (16) shows

$$\begin{aligned} \ell = O\left( n^3\log ^5 \left( \frac{n\vartheta }{q^2} \right) \right) = O^*\left( n^3\right) . \end{aligned}$$

Summarizing, the total number of oracle calls by Algorithm 2 is

$$\begin{aligned} \begin{aligned} O^*(m N \ell )&= O^* \left( n^{4.5}\right) , \end{aligned} \end{aligned}$$

where we used Bubeck and Eldan’s result [7] that \(\vartheta \le n + o(n)\).

Thus, we have given a rigorous complexity analysis of the algorithm by Kalai and Vampala [12] using the temperature update of Abernethy and Hazan [1]. Our final bound for the number of oracle calls coincides with the \(O^*(n^{4.5})\) bound claimed in [12].

5.1 Initialization

One might still wonder how to generate a good estimate \(\widehat{\varSigma }(0)\) of the uniform covariance matrix \(\varSigma (0)\) to start Algorithm 2. The ‘rounding the body’ procedure from Lovász and Vempala [18] is suitable for this purpose. This procedure returns a \(\widehat{\varSigma }(0)\) for which

$$\begin{aligned} \mathbb {P}\left\{ \tfrac{1}{2} \widehat{\varSigma }(0) \preceq \varSigma (0) \preceq 2 \widehat{\varSigma }(0) \right\} \ge 1-\frac{1}{n}. \end{aligned}$$

By Lemma 2.4, \(\tfrac{1}{2} \widehat{\varSigma }(0) \preceq \varSigma (0) \preceq 2 \widehat{\varSigma }(0)\) if and only if \(\tfrac{1}{2} \widehat{\varSigma }(0)^{-1} \preceq \varSigma (0)^{-1} \preceq 2 \widehat{\varSigma }(0)^{-1}\), so the starting condition for Algorithm 2 can be satisfied by the ‘rounding the body’ procedure. The number of calls to the membership oracle for this procedure is \(O^*(n^4)\), which is overshadowed by the complexity of Algorithm 2 itself.

6 Numerical Examples on the Doubly Nonnegative Cone

Having established the theoretical complexity of Algorithm 2, we now move to the second goal of this work: investigating the practical perspectives of this method. We will test Algorithm 2 on the problem of determining if a matrix is copositive, which is known to be a co-NP-complete problem [22].

To define this problem, let \(\mathbb {S}^{m \times m}\) denote the space of real symmetric \(m \times m\) matrices. A matrix \(C \in \mathbb {S}^{m \times m}\) is called copositive if \(x^\top C x \ge 0\) for all \(x \in \mathbb {R}^{m}_+\) (see Bomze [6] for a survey on copositive programming and its applications).

The standard SDP relaxation of the problem of checking for copositivity of C is the following:

$$\begin{aligned} \inf \left\{ \langle C, X \rangle : \sum _{i=1}^{m} \sum _{j=1}^{m} X_{ij} \le 1, X \ge 0, X \succeq 0 \right\} , \end{aligned}$$
(17)

where \(\langle \cdot , \cdot \rangle \) is the trace inner product. If the value of (17) is nonnegative, the matrix C is copositive. However, since we place a nonnegativity constraint on every element of the matrix X, the Newton system in every interior point iteration is of size \(O(m^2 \times m^2)\), which quickly leads to impractical computation times (see, e.g., Burer [8]).

Before we can apply Algorithm 2, we need to translate (17) to a problem over \(\mathbb {R}^{m(m+1)/2}\). The approach is standard: for any \(A = [A_{ij}]_{ij} \in \mathbb {S}^{m \times m}\), define

$$\begin{aligned} {{\,\mathrm{svec}\,}}(A) := (A_{11}, \sqrt{2} A_{12}, ..., \sqrt{2} A_{1m}, A_{22}, \sqrt{2} A_{23}, ..., \sqrt{2} A_{2m}, ..., A_{mm})^\top , \end{aligned}$$

such that \({{\,\mathrm{svec}\,}}(A) \in \mathbb {R}^{m(m+1)/2}\). If \(\mathbb {R}^{m(m+1)/2}\) is endowed with the Euclidean inner product, the adjoint of \({{\,\mathrm{svec}\,}}\) is defined for every \(a \in \mathbb {R}^{m(m+1)/2}\) as

$$\begin{aligned} {{\,\mathrm{smat}\,}}(a) = \begin{bmatrix} a_1 &{} a_2 /\sqrt{2} &{} \dots &{} a_m / \sqrt{2}\\ a_2 /\sqrt{2} &{} a_{m+1} &{} \dots &{} a_{2m-1} / \sqrt{2}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ a_m / \sqrt{2} &{} a_{2m-1} / \sqrt{2} &{} \dots &{} a_{m(m+1)/2} \end{bmatrix}, \end{aligned}$$

such that \({{\,\mathrm{smat}\,}}(a) \in \mathbb {S}^{m \times m}\). Moreover, \({{\,\mathrm{smat}\,}}({{\,\mathrm{svec}\,}}(A)) = A\) and \({{\,\mathrm{svec}\,}}({{\,\mathrm{smat}\,}}(a)) = a\) for all A and a. Now let \(c = {{\,\mathrm{svec}\,}}(C)\). Problem (17) is equivalent to the following problem over \(\mathbb {R}^{m(m+1)/2}\):

$$\begin{aligned} \inf \left\{ \langle c, x \rangle : \sum _{i=1}^{m} \sum _{j=1}^{m} ({{\,\mathrm{smat}\,}}(x))_{ij} \le 1, x \ge 0, {{\,\mathrm{smat}\,}}(x) \succeq 0 \right\} . \end{aligned}$$
(18)

Note that membership of the feasible set of (18) can be determined in \(O(m^3)\) operations. Let \(n = \frac{1}{2}m(m+1)\) be the number of variables in problem (18).

6.1 Covariance Approximation

First, we investigate how the quality of the covariance approximation depends on the walk length for problem (18). We take \(20,\! 000\) hit-and-run samples from the uniform distribution over the feasible set of (18) with walk length \(50,\! 000\) (directions are drawn from \(\mathcal {N}(0,I)\) and the starting point is \({{\,\mathrm{svec}\,}}(mI+J) / (2e^\top {{\,\mathrm{svec}\,}}(mI+J))\), where J is the all-ones matrix). These samples are used to create the estimate \(\widehat{\varSigma }_0\). Then, the experiment is repeated for walk lengths \(\ell \le 50,\! 000\) and sample sizes \(N \le 20,\! 000\). We refer to these new estimates as \(\widehat{\varSigma }_{\ell , N} \). We would like

$$\begin{aligned} -\epsilon y^\top \widehat{\varSigma }_{\ell ,N} y \le y^\top (\widehat{\varSigma }_0 - \widehat{\varSigma }_{\ell ,N}) y \le \epsilon y^\top \widehat{\varSigma }_{\ell ,N} y \qquad \forall y \in \mathbb {R}^n, \end{aligned}$$

for some small \(\epsilon > 0\). This is equivalent to

$$\begin{aligned} -\epsilon x^\top x \le x^\top (\widehat{\varSigma }_{\ell ,N}^{-1/2} \widehat{\varSigma }_0 \widehat{\varSigma }_{\ell ,N}^{-1/2} - I) x \le \epsilon x^\top x \qquad \forall x \in \mathbb {R}^n, \end{aligned}$$

i.e., we would like the spectral radius of \(\widehat{\varSigma }_{\ell ,N}^{-1/2} \widehat{\varSigma }_0 \widehat{\varSigma }_{\ell ,N}^{-1/2} - I\) to be at most \(\epsilon \). Because the spectra of \(\widehat{\varSigma }_{\ell ,N}^{-1/2} \widehat{\varSigma }_0 \widehat{\varSigma }_{\ell ,N}^{-1/2} - I\) and \(\widehat{\varSigma }_{\ell ,N}^{-1} \widehat{\varSigma }_0 - I\) are the same, we focus on the spectral radius \(\rho (\widehat{\varSigma }_{\ell ,N}^{-1} \widehat{\varSigma }_0 - I)\).

The result is shown in Fig. 1, where m refers to the size of the matrices in (17). Hence, the covariance matrices in question are of size \(\frac{1}{2}m(m+1) \times \frac{1}{2}m(m+1)\).

Fig. 1
figure 1

Effect of sample size N and walk length \(\ell \) on quality of uniform covariance approximation

One major conclusion from Fig. 1 is that the trajectory toward zero is relatively slow. To show that simply adding more samples with higher walk lengths will in practice not be feasible, we present the running times required to estimate a covariance matrix at the desired accuracy in Fig. 2. Specifically, this figure shows the running times of the ‘efficient’ combinations of N and \(\ell \): these are the combinations of N and \(\ell \) plotted in Fig. 1 for which there are no \(N'\) and \(\ell '\) such that \(N' \ell ' \le N \ell \) and \(\rho (\widehat{\varSigma }_{\ell ',N'}^{-1} \widehat{\varSigma }_0 - I) < \rho (\widehat{\varSigma }_{\ell ,N}^{-1} \widehat{\varSigma }_0 - I)\). (The computer used has an Intel i7-6700 CPU with 16 GB RAM, and the code used six threads.) Figure 2 shows that, even at low dimensions, approximating the covariance matrix to high accuracy will take an unpractical amount of time, which approximately shows the time required to approximate the covariance matrix up to desired accuracy.

Fig. 2
figure 2

Running times required to find an approximation \(\widehat{\varSigma }_{\ell ,N}\) of the desired quality

To show that the slow trajectory toward zero in Fig. 1 is a result of covariance estimation’s fundamental difficulty, we consider a simpler problem. We want to approximate the covariance matrix of the uniform distribution over the hypercube \([0,1]^n\) in \(\mathbb {R}^n\). Note that the true covariance matrix of this distribution is known to be \(\varSigma := \frac{1}{12} I\).

Again, we will use hit-and-run with varying walk lengths and sample sizes to generate samples from the uniform distribution over \([0,1]^n\), and compare the resulting covariance matrices \(\widehat{\varSigma }_{\ell , N}\) with \(\varSigma = \frac{1}{12} I\). (Comparing against a covariance estimate based on hit-and-run samples as in Fig. 1 yields roughly the same image.) The result is shown in Fig. 3.

Figure 3 shows a pattern similar to that of Fig. 1: As the problem size increases, the walk length should increase with the sample size to ensure the estimate is as good as the sample size can guarantee. While this progression toward zero may appear as slow, we do not have to know every eigenvalue and eigenvector of the covariance to high accuracy. Recall that we only use this covariance estimate in Algorithm 2 to generate hit-and-run directions. As such, it may suffice to have an estimate that roughly shows which directions are ‘long,’ and which ones are ‘short.’

6.2 Mean Approximation

Fig. 3
figure 3

Effect of sample size N and walk length \(\ell \) on quality of uniform covariance matrix approximation over the hypercube in \(\mathbb {R}^n\)

Next, we consider the problem of approximating the mean. Although it is not required for Algorithm 2 to approximate the mean of a Boltzmann distribution, such a mean does lie on the central path of the interior point method proposed by Abernethy and Hazan [1,  Appendix D].

We again take \(20,\! 000\) hit-and-run samples from the uniform distribution over the feasible set of (18) with walk length \(50,\! 000\). (Directions are drawn from \(\mathcal {N}(0,I)\) and the starting point is \({{\,\mathrm{svec}\,}}(mI+J) / (2e^\top {{\,\mathrm{svec}\,}}(mI+J))\), where J is the all-ones matrix.) These samples are used to create the mean estimate \(\widehat{x}_0\). Then, the experiment is repeated for walk lengths \(\ell \le 50,\! 000\) and sample sizes \(N \le 20,\! 000\). We refer to these new estimates as \(\widehat{x}_{\ell , N}\). Using the approximation \(\widehat{\varSigma }_0\) of the uniform covariance matrix from the previous section, we compute \(\Vert \widehat{x}_0 - \widehat{x}_{\ell ,N} \Vert _{\widehat{\varSigma }_0^{-1}}\) and plot the results in Fig. 4.

Fig. 4
figure 4

Effect of sample size N and walk length \(\ell \) on quality of uniform mean approximation

The results are comparable to those in Figs. 1 and 2. It will take an impractical amount of time before the mean estimate approximates the true mean well enough for practical purposes.

6.3 Kalai–Vempala Algorithm

The results from the previous two sections show that we should not hope to approximate the covariance matrix and sample mean with high accuracy in high dimensions. However, it is still insightful to verify if this is really required for Algorithm 2 to work in practice.

We therefore generated a random vector \(c \in \mathbb {R}^{m(m+1)/2}\) as follows: If \(C \in \mathbb {R}^{m \times m}\) is a matrix with all elements belonging to a standard normal distribution, then \(C+C^\top +(\sqrt{2} - 2) {{\,\mathrm{Diag}\,}}(C)\) is a symmetric matrix whose elements all have variance 2. We then let

$$\begin{aligned} c = \frac{{{\,\mathrm{svec}\,}}(C+C^\top +(\sqrt{2} - 2) {{\,\mathrm{Diag}\,}}(C))}{\Vert {{\,\mathrm{svec}\,}}(C+C^\top +(\sqrt{2} - 2) {{\,\mathrm{Diag}\,}}(C))\Vert }, \end{aligned}$$

serve as the objective of our optimization problem (18). We can find the optimal solution \(x_*\) with MOSEK 8.0 [21] and then run Algorithm 2 with \(\varepsilon = 10^{-3}\) and \(p = 10^{-1}\). The final gap \(\langle c, x_{\text {final}} - x_* \rangle \) is shown in Fig. 5. One can see that for practical sample sizes and walk lengths, the method does not converge to the optimal solution.

Fig. 5
figure 5

Effect of sample size N and walk length \(\ell \) on the final gap of Algorithm 2

6.4 Kalai–Vempala Algorithm with Acceleration Heuristic

Keeping our findings above in mind, we propose the heuristic adaption of Algorithm 2 presented in Algorithm 3. The main modifications we suggest to make to Algorithm 2 are:

  1. 1.

    Use the (centered) samples generated in the previous iteration as directions for hit-and-run in the current iteration. This would eliminate the need to estimate the covariance matrix of a distribution, only to then draw samples from that same distribution. Instead, we can also draw directions directly from the centered samples (cf. line 10 in Algorithm 3). Thus, each sample is used to generate a hit-and-run direction with uniform probability.

  2. 2.

    As a starting point for the first random walk in some iteration k, use the sample mean from iteration \(k-1\). While this does significantly change the distribution of the starting point, it concentrates more probability mass around the mean of the Boltzmann distribution with parameter \(\theta _{k-1}\), such that the starting point of the random walk is likely already close to the mean of the Boltzmann distribution with parameter \(\theta _k\). In a similar vein, we return the mean of the samples in the final iteration, not just one sample. This will not change the expected objective value of the final result, and will therefore also not affect the probabilistic guarantee that we derived in (15) by Markov’s inequality. However, using the mean does reduce the variance in the final solution.

  3. 3.

    Start all except the first random walk in some iteration k from the end point of the previous random walk, rather than from a common starting point. The idea here is that the random samples as a whole will then exhibit less dependence, thus improving the approximation quality of the empirical distribution.

figure c

With these modifications implemented, we can no longer study the quality of the covariance matrix. Therefore, we will simply consider if the resulting optimization algorithm leads to a small error in the objective value. We solve the problem from Sect. 6.3 with Algorithm 3. The results are shown in Fig. 6.

Fig. 6
figure 6

Effect of sample size N and walk length \(\ell \) on the final gap of Algorithm 3

For low dimensions in particular, the proposed changes seem to have a positive effect.

It can be seen from Fig. 6 that—roughly speaking—the final gap \(\langle c, x_{\text {final}} - x_* \rangle \) takes values between two extremes. At one end, the method does not converge and the final gap is still of the order \(10^{-1}\). At the other end, the method does converge to the optimum, such that the gap is of the order \(10^{-4} = \varepsilon p\). Note that \(\varepsilon p\) is exactly the size we would like the expected gap to have to guarantee that the gap is smaller than \(\varepsilon \) with probability \(1-p\) by Markov’s inequality. Whether we are at one end or the other depends on N and \(\ell \) being large enough compared to m. As a heuristic, we propose that

$$\begin{aligned} N = \left\lceil n \sqrt{n} \right\rceil , \qquad \ell = \left\lceil n \sqrt{n} \right\rceil , \end{aligned}$$
(19)

where \(n = m(m+1)/2\) is the number of variables, are generally sufficient to ensure that the final gap is of the order \(\varepsilon p\).

7 Numerical Examples on the Copositive Cone

We now turn our attention away from the doubly nonnegative cone, and toward the copositive cone. Although—as mentioned earlier—deciding if a matrix is copositive is a co-NP-complete problem [22], there are a number of procedures to test for copositivity. Clearly, \(A = [A_{ij}]_{ij} \in \mathbb {S}^{m \times m}\) is copositive if and only if

$$\begin{aligned} \min \left\{ a^\top A a : e^\top a = 1, a \ge 0 \right\} , \end{aligned}$$
(20)

is nonnegative, where e is the all-ones vector. Xia et al. [28] show that solving (20) is equivalent to solving

$$\begin{aligned} \begin{aligned} \min \,&- \nu \\ \text {s.t.}\,&A a + \nu e - \eta = 0\\&e^\top a = 1\\&0 \le a \le b\\&0 \le \eta \le M(e - b)\\&b \in \{0,1\}^n, \end{aligned} \end{aligned}$$
(21)

where \(M = 2 m \max _{i,j \in \{1, ..., m\}} |A_{ij}|\). (To be precise, every optimal solution \((a, \nu , \eta )\) to (21) gives an optimal solution a to (20), and these two problems have the same optimal values.) Note that we are generally not interested in solving (21) to optimality: it suffices to find a feasible solution of (21) with a negative objective value, or confirm that no such solution exists. For the majority of the matrices encountered by Algorithm 3 applied to our test sets described below, this could be checked quickly.

7.1 Separating from the Completely Positive Cone

Recall that a matrix \(A \in \mathbb {S}^{m \times m}\) is completely positive if \(A = B B^\top \) for some \(B \ge 0\). It is easily seen that optimization problems over the completely positive cone can be relaxed as optimization problems over the doubly nonnegative cone. To strengthen this relaxation, one could add a cutting plane separating the optimal solution Y of the doubly nonnegative relaxation from the completely positive cone. This is listed as an open (computational) problem by Berman et al. [5,  Section 5], who note that the problem of generating such a cut has only been answered for specific structures of Y, including \(5 \times 5\) matrices [9]. In general, such a cut could be generated for a doubly nonnegative matrix Y by the copositive program

$$\begin{aligned} \inf \left\{ \langle Y, X \rangle : \langle X, X \rangle \le 1, X \text { copositive} \right\} . \end{aligned}$$
(22)

Below, we solve this problem for \(6 \times 6\) matrices, by way of example.

To generate test instances, we are interested in matrices on the boundary of the \(6 \times 6\) doubly nonnegative cone. The extreme rays of this cone are described by Ycart [29,  Proposition 6.1]. We generate random instances from the class of matrices described under case 3, graph 4 in Proposition 6.1 in [29]. These matrices are (up to permutation of the indices) doubly nonnegative matrices \(Y = [Y_{ij}]_{ij}\) with rank 3 satisfying \(Y_{i,i+1} = 0\) for \(i = 1, ..., 5\). To generate such a matrix, we draw the elements of two vectors \(v_1, v_2 \in \mathbb {R}^6\) and the first element \((v_3)_1 \in \mathbb {R}\) of a vector \(v_3 \in \mathbb {R}^6\) from a Poisson distribution with rate 1, and multiply each of these elements with \(-1\) with probability \(\frac{1}{2}\).

The remaining elements of \(v_3\) are then chosen such that \(Y = \sum _{k=1}^3 v_k v_k^\top \) satisfies \(Y_{i,i+1} = 0\) for \(i = 1, ..., 5\). This procedure is repeated if the matrix Y is not doubly nonnegative, or if BARON 15 [27] could find a nonnegative matrix \(B \in \mathbb {R}^{6\times 9}\) such that \(Y = BB^\top \) in less than 30 s. (For the cases where such a decomposition could be found, BARON terminated in less than a second.) Thus, we are left with doubly nonnegative matrices for which it cannot quickly be shown that they are completely positive.

For ten of such randomly generated matrices (see Appendix 1), the optimal value of Algorithm 3 applied to (22) is given in Table 1. This table shows the normalized objective value \(\langle Y / \Vert Y\Vert , X^* \rangle \), where Y is a doubly nonnegative matrix as described above, and \(X^*\) is the final solution returned by Algorithm 3.

Table 1 Objective values returned by Algorithm 3 and by the Ellipsoid method, applied to (22)

Note that in all cases, Algorithm 3 succeeds in finding a copositive matrix \(X^*\) such that \(\langle Y, X^* \rangle < 0\), which means a cut separating Y from the completely positive matrices was found.

Note that solving the MILP (21) for a matrix A that is not copositive yields a hyperplane separating A from the copositive cone. Thus, we can also solve problem (22) with the Ellipsoid method of Yudin and Nemirovski [30], for example. For the sake of comparison, the results of the Ellipsoid method are also included in Table 1. Note, in particular, that the number of oracle calls in Table 1 is several orders of magnitude smaller for the Ellipsoid method.

8 Conclusion

We have shown that Kalai and Vempala’s algorithm [12] returns a solution which is near-optimal for (1) with high probability in polynomial time, when the temperature update (5) is used. The main drawback in using the algorithm in practice is that a large number of samples (i.e., membership oracle calls) are required. As a result, in our tests the Ellipsoid method outperformed Algorithm 3 by a large margin. Thus, based on our experiments, one would favor polynomial-time cutting plane methods like the Ellipsoid method, or more sophisticated alternatives as described, for example, in [16]. In order to obtain a practically viable variant of the Kalai–Vempala algorithm, one would have to improve the sampling process greatly, or utilize massive parallelism to speed up the hit-and-run sampling.