Exploiting ideal-sparsity in the generalized moment problem with application to matrix factorization ranks

We explore a new type of sparsity for the generalized moment problem (GMP) that we call ideal-sparsity. This sparsity exploits the presence of equality constraints requiring the measure to be supported on the variety of an ideal generated by bilinear monomials modeled by an associated graph. We show that this enables an equivalent sparse reformulation of the GMP, where the single (high dimensional) measure variable is replaced by several (lower-dimensional) measure variables supported on the maximal cliques of the graph. We explore the resulting hierarchies of moment-based relaxations for the original dense formulation of GMP and this new, equivalent ideal-sparse reformulation, when applied to the problem of bounding nonnegative- and completely positive matrix factorization ranks. We show that the ideal-sparse hierarchies provide bounds that are at least as good (and often tighter) as those obtained from the dense hierarchy. This is in sharp contrast to the situation when exploiting correlative sparsity, as is most common in the literature, where the resulting bounds are weaker than the dense bounds. Moreover, while correlative sparsity requires the underlying graph to be chordal, no such assumption is needed for ideal-sparsity. Numerical results show that the ideal-sparse bounds are often tighter and much faster to compute than their dense analogs.


Introduction
We consider the generalized moment problem (abbreviated as GMP), of the form where f 0 , f i ∈ R[x] are multivariate polynomials in the variables x = (x 1 , ..., x n ), a i ∈ R, K ⊆ R n (taken to be Borel measurable), and the optimization is over the set M (R n ) of (finite positive) Borel measures on R n .In problem (1) we restrict to measures µ ∈ M (R n ) whose support Supp(µ) is contained in K, which is equivalent to requiring f dµ = K f dµ for any Borel measurable function f : R n → R. Throughout, we assume that K is a basic closed semialgebraic set of the form where g j ∈ R[x] are polynomials, E is a given set of pairs of distinct elements of V = [n] := {1, . . ., n}, and E is the following set of pairs Hence, the set K is contained in the variety of the ideal generated by the monomials x i x j for the pairs {i, j} ∈ E. It will be convenient to consider the graph G = (V, E), so that the conditions x i x j = 0 appearing in the definition of K correspond to the nonedges of G.This notation may seem at first sight cumbersome.However, the motivation for it is that the graph G functions as supporting solutions for problem (1); this will be especially useful for applications to matrix factorization ranks like the completely positive rank (cp-rank) or the nonnegative rank of a matrix, where G will correspond to the support graph of the matrix.
The generalized moment problem (with K semialgebraic) has been much studied in recent years.It permits to model a wide variety of problems, including polynomial optimization (minimization of a polynomial or rational function over K), volume computation, control theory, option pricing in finance, and much more.See, e.g., [45,46,47,35] and further references therein.
The focus of this paper is to exploit the presence of explicit ideal constraints (of a special form) in the description of the semialgebraic set K for solving problem (1).This indeed naturally implies some sparsity structure on problem (1), to which we will refer as ideal-sparsity structure.Our objective is to explore how one can best exploit this ideal-sparsity structure in order to define more efficient semidefinite hierarchies for problem (1) and apply them to sparse matrix factorization ranks.A remarkable feature is that the idealsparse hierarchies provide bounds that are at least as good (and often better) as the bounds provided by the original dense hierarchy.Moreover, the underlying sparsity graph is not required to be chordal.Both these features are in stark contrast to the existing sparse hierarchies based on correlative sparsity whose bounds are always dominated by the dense bounds and that require the underlying sparsity graph to be chordal in order to guarantee convergence.We refer to Section 3.2 for an in-depth discussion about correlative and ideal-sparsity.
We focus here on the application to the completely positive and the nonnegative factorization ranks, asking for a factorization by nonnegative vectors.However, as we will mention in the final discussion section, this ideal-sparsity framework could also be applied to more general settings.Indeed, it could be applied to other matrix factorization ranks, such as the (completely) positive semidefinite rank, where one asks for a factorization by positive semidefinite matrices, in which case one would have to apply tools from polynomial optimization in noncommutative variables.Also, instead of an ideal generated by quadratic monomials, one could have an ideal generated by higher degree monomials.In addition, up to a change of variables, one could consider an ideal generated by more general products of linear terms, such as (a T x + b)(c T x + d).This type of constraint, often known as a complementary constraint, occurs in various applications, including ReLU neural networks or optimization when considering KKT optimality conditions.
Next, we mention the overall organization of the paper and give some general notation used throughout.After that, we will give a broad overview of the contents and main results obtained in the paper.

Organization of the paper
The paper is organized as follows.In the rest of the Introduction we outline the main results in the paper.Then, in Section 2 we recall some preliminaries about linear functionals on polynomials and moment matrices.In Section 3 we consider the GMP (1): we show its sparse reformulation (11), we present the corresponding sparse hierarchies, and we discuss how ideal-sparsity relates to the more classic correlative sparsity.Section 4 is devoted to the application to the cp-rank and Section 5 to the application to the nonnegative rank.We conclude with some final remarks and discussions in Section 6.

Notation
We gather here some notation that is used throughout the paper.For n, t ∈ N set N n t = {α ∈ N n : |α| ≤ t}, where |α| = n i=1 α i denotes the degree of the monomial x α = x α1 1 • • • x αn n .We let [x] t = (x α ) α∈N n t denote the vector of monomials with degree at most t (listed in some given fixed order).Moreover, R[x] (resp., R[x] t ) denotes the set of n-variate polynomials in variables x = (x 1 , . . ., x n ) (with degree at most t).Let Σ denote the set of sum-of-squares polynomials, of the form i q 2 i for some q i ∈ R[x], and set Σ t = Σ ∩ R[x] t .Consider a set U ⊆ [n].Given a vector y ∈ R |U | , we let (y, 0 V \U ) ∈ R n denote the vector obtained by padding y with zeros at the entries indexed by [n] \ U .For an n-variate function f : R |V | → R, we let f |U : R |U | → R denote the function in the variables x(U ) = {x i : i ∈ U }, which is obtained from f by setting to zero all the variables x i indexed by i ∈ V \ U .That is, f |U (y) = f (y, 0 V \U ) for y ∈ R |U | .So, if f is an n-variate polynomial, then f |U is a |U |-variate polynomial in the variables x(U ).
For a symmetric matrix M ∈ S n , the notation M 0 means that M is positive semidefinite, i.e., v T M v ≥ 0 for all v ∈ R n .Throughout, we let I n and J n denote the identity matrix and the all-ones matrix of size n, which we sometimes also denote as I and J when the dimension is clear from the context.The support of a vector x ∈ R n is the set Supp(x) = {i ∈ [n] : x i = 0}.

Roadmap through the paper
In the rest of this section, we now offer a quick roadmap through the main contents of the paper.We begin with recalling how to define the dense hierarchy of bounds for the problem (1).We then discuss their main drawback (quick growth of the matrices involved in the semidefinite programs) and several options for addressing this difficulty that have been offered in the literature.After that, we introduce the new idealsparse reformulation of problem (1) and the corresponding ideal-sparse hierarchy, which we then specialize to the applications for bounding the completely positive and nonnegative ranks.

Classical (dense) moment relaxations
We begin by recalling the classical moment approach that permits to build hierarchies of semidefinite approximations for problem (1).For details, see, e.g., the monograph by Lasserre [46], or the survey [40].For t ∈ N ∪ {∞}, the set is the quadratic module generated by g = (g 1 , . . ., g m ), and truncated at degree 2t, setting g 0 = 1.We also set Here, Σ denotes the set of sums of squares of polynomials in R[x].Similarly, denotes the truncation of the ideal I E at degree 2t.We can now define the moment relaxation of level t for problem (1): 2t denotes the set of linear functionals L : R[x] 2t → R. The motivation for the above parameter is as follows.Assume µ ∈ M(R n ) is a measure that is feasible for problem (1), and consider the associated linear functional L that acts on R nonnegative on the set K), and L = 0 on I E,2t (since any polynomial in I E,2t vanishes on K).This shows that the parameter ξ t lower bounds the optimum value val of problem (1).
We refer to the above hierarchy of parameters ξ t as the dense moment hierarchy.Clearly, they satisfy ξ t ≤ ξ t+1 ≤ ξ ∞ ≤ val.Moreover, under some mild assumptions, these bounds converge asymptotically to the optimum value val of (1).This fundamental property follows from the general theory about GMP (see, e.g., [46], [40]) and is summarized in the following theorem that will be used repeatedly throughout this work.
Theorem 1. Assume problem (1) is feasible and the following Slater-type condition holds: Then, problem (1) has an optimal solution µ, which can be chosen to be finite atomic.If, in addition, M(g) is Archimedean, i.e., R − n i=1 x 2 i ∈ M(g) for some scalar R > 0, then we have As it will be recalled in Section 2.1, program (6) can be reformulated as a semidefinite program and thus the bound ξ t can be computed using semidefinite optimization algorithms.However, a common drawback of the dense hierarchy (6) is that it involves matrices whose size grows very quickly with the level t and with the degree and number of variables of the polynomials f 0 , f 1 , . . ., f N , g 1 , . . ., g m .Hence, even though these relaxations are convex, they might be challenging to solve already for GMP instances of modest size.

Existing schemes to improve scalability of the dense moment relaxations
Several schemes have been developed to overcome the scalability issue of the dense hierarchy (6) just mentioned above.They aim to reduce the size of the involved matrices by exploiting the specific structure of the input polynomials without compromising the convergence guarantees of the structure-induced moment relaxations.One workaround consists of exploiting the symmetries [56], but this requires that each input polynomial is invariant under the action of a subgroup of the general linear group.
Another approach is to exploit different kinds of sparsity structures.The first kind is called correlative sparsity, which occurs when there are few correlations between the variables of the input polynomials [63,44].Correlative sparsity has been extended to derive moment relaxations of polynomial problems in complex variables [38], noncommutative variables [39] and polynomial matrix inequalities [69].The second kind is called term sparsity, which occurs when they are few (by comparison with all possible) monomial terms involved in the input polynomials, and for which correlative sparsity is not exploitable.For unconstrained polynomial optimization, one well-known solution is to eliminate the monomial terms which never appear among the support of sums of squares decompositions [55].Alternatively, one can decompose the input polynomial as a sum of nonnegative circuits, by solving a geometric programming relaxation [37] or a second-order cone programming relaxation [5,64], or as a sum of arithmetic-geometric-mean-exponentials [15] with relative entropy programming relaxations.Term sparsity has recently been the focus of active research with extensions to constrained polynomial optimization [65,66].Note that both kinds of sparsity can be combined [67].For a general exposition about sparse polynomial optimization, we refer to the recent surveys [51,70].
We will return to the correlative sparsity approach for GMP later in Section 3.2 and discuss how it relates to the new ideal-sparsity structure considered in the paper.By contrast with classical polynomial optimization problems, it is not completely clear which initial set of monomials should be chosen to initialize the term sparsity hierarchy when facing a given GMP instance.Therefore, we do not explore the combination of term and ideal-sparsity, for such an investigation would warrant a separate publication.

New ideal-sparse moment relaxations
As we now see, one can exploit the fact that the set K in (2) is contained in the variety of the ideal I E from (3).The basic idea is that, instead of optimizing over a single measure µ supported on K ⊆ R n , one may optimize over several measures that are supported on smaller dimensional spaces.
A set W ⊆ V is a clique of the graph G = (V, E) if {u, v} ∈ E for any two distinct vertices u, v ∈ W .A clique is maximal (w.r.t inclusion) if it is not strictly contained in any other clique of G. Let V 1 , . . ., V p denote the maximal cliques of the graph G = (V, E) and, for k ∈ [p], define the following subset of K: Recall Supp(x) = {i ∈ [n] : Therefore, the sets K 1 , . . ., K p cover the set K: Next, define the projection Recall that (y, 0 V \V k ) denotes the vector of R n obtained from y ∈ R |V k | by padding it with zeros at all entries indexed by V \ V k .Moreover, given a function f : We may now define the following sparse analog of problem (1): Hence, while problem (1) has a single measure variable µ on the space R |V | , problem (11) involves p measure variables, where µ k is on the smaller dimensional space R |V k | .As we will show in Proposition 6 below, both formulations (1) and ( 11) are in fact equivalent, i.e, we have equality val = val isp .Here, we use the superscript 'isp' as a reminder that the formulation exploits ideal-sparsity; we will follow this same notation below for the corresponding moment hierarchy and also later for the parameters attached to matrix factorization ranks.
Based on its reformulation via (11), we can now define another hierarchy of moment approximations for problem (1), to which we refer as the ideal-sparse moment hierarchy: This hierarchy provides bounds for val that are at least at good as the bounds (6).Namely, holds for any t ≥ 1 (see Theorem 7 below).
Hence, the ideal-sparse bounds ξ isp t present a double advantage compared to the dense bounds ξ t .First, they are at least as good and sometimes strictly better, as we will see later in concrete examples.For the application to the completely positive and nonnegative ranks, we will see classes of matrices showing a large separation between the dense bound and the ideal-sparse bound of level t = 1; see Examples 14 and 16.Second, their computation is potentially faster since the sets V k can be much smaller than the full set V .We will also see in later examples that the computation of the ideal-sparse bounds can be much faster indeed.On the other hand, the number of cliques in the graph G could be large, so there is a trade-off.We refer to discussions later in the paper around specific applications.
Interestingly, no structural chordality property needs to be assumed on the cliques V 1 , . . ., V p of the graph G.We will comment in Section 3.2 about the link between the ideal-sparsity approach presented here and the more classical correlative sparsity approach that can be followed when considering a chordal extension G of the graph G.
The idea of optimizing over multiple measures has appeared already in several contexts, similarly to what can be routinely done in most computational methods, e.g., finite elements.In the context of analyzing dynamical systems involving polynomial data, a similar trick has been used to perform optimal control of piecewise-affine systems in [1], then later on to characterize invariant measures for piecewise-polynomial systems (see [50, § 3.5]).In the context of set estimation, one can also rely on a multi-measure approach to approximate the moments of Lebesgue measures supported on unions of basic semialgebraic sets [48].The common idea consists in using the piecewise structure of the dynamics and/or the state-space partition to decompose the measure of interest into a sum of local measures supported on each partition cell.The advantage in our current setting is that these measures are supported on smaller dimensional spaces, which leads to potentially strong computational benefit when considering the associated semidefinite programming relaxations.
We next present instances of GMP to which the above ideal-sparsity framework naturally applies, namely to derive bounds on matrix factorization ranks such as the completely positive rank and the nonnegative rank.

Bounds on the completely positive rank via GMP
Let A ∈ S n be a symmetric matrix with nonnegative entries.Assume A is a completely positive matrix (abbreviated as cp-matrix), i.e., A can be written as a a T for some nonnegative vectors a 1 , . . ., a r ∈ R n + .
Then, the smallest integer r ∈ N for which such a decomposition exists is the cp-rank of A, denoted rank cp (A).
Checking whether a given matrix A is completely positive is a computational hard problem (see [23]).The moment approach has been applied to the question of testing whether A is a a cp-matrix and finding a cpfactorization, in particular, by Nie [53] who formulates it as testing the existence of a representing measure (over the standard simplex) for the sequence of entries of A.
Hierarchies of moment-based relaxations have also been employed to obtain sequences of bounds for the rank of tensors [61], as well as for the symmetric nuclear norm of tensors [54].Here, we focus on the question of bounding the cp-rank.No efficient algorithms are known for finding the cp-rank.This motivates the search for efficient methods giving lower bounds on the cp-rank, as, e.g., in [28,32,33].The following parameter was introduced in [28], as a natural "convexification" of the cp-rank: providing a lower bound for it: τ cp (A) ≤ rank cp (A).As observed below, the parameter τ cp (A) can be reformulated as an instance of problem (1), with an ideal-sparsity structure inherited from the matrix A.
To avoid trivialities we assume A ii > 0 for all i ∈ [n].(Indeed, if A is a cp-matrix with A ii = 0, then its i-th row/column is identically zero and thus it can be removed without changing the cp-rank.)Note that the constraints A ≥ xx T and x ≥ 0 are equivalent 1 ) and A ij − x i x j ≥ 0 (1 ≤ i < j ≤ n).Moreover, they imply x i x j = 0 whenever A ij = 0. Let us define the graph G A = (V, E A ) as the support graph of A, with and define the semialgebraic set As we now observe, the parameter τ cp (A) can be reformulated as an instance of GMP.
Lemma 2. The parameter τ cp (A) is equal to the optimal value of the generalized moment problem: 1 The reason for using the constraint i ≥ 0 is because it leads to a larger quadratic module and thus a possibly better bound (see [32]).
Proof.The (easy) key observation is that any feasible solution to τ cp (A), i.e., any decomposition of the form A = λ s =1 λ a a T with λ ≥ 0, s =1 λ = 1 and a ∈ K A , corresponds to a measure µ := λ s =1 λ δ a that is feasible for τ cp (A) and finite atomic (and vice-versa).Observe also that the Slater-type condition (7) holds (since f 0 = 1 > 0 on K A ).The result now follows using (the first part of) Theorem 1: if A is completely positive, then val cp is feasible and thus has a finite atomic optimal solution, which implies τ cp (A) = val cp ; otherwise, both parameters τ cp (A) and val cp are infeasible and thus equal to ∞.
Based on the formulation of the parameter τ cp (A) in Lemma 2 as a GMP instance, we can define the corresponding bounds ξ cp t (A), obtained as special instance of the bounds (6) (see relations ( 29)-(34) below).Then, the convergence of the bounds ξ cp t (A) to τ cp (A) follows as a direct application of Theorem 1.As in the general case of GMP, one may exploit the presence of the ideal constraints x i x j = 0 (for {i, j} ∈ E A ) in the definition of K A and define a hierarchy of ideal-sparse bounds ξ cp,isp t (A).These bounds satisfy also with asymptotic convergence to τ cp (A).We refer to Section 4 for details about these parameters and links to earlier bounds in the literature.

Bounds on the nonnegative rank via GMP
The above approach for the cp-rank naturally extends to the asymmetric setting of the nonnegative rank.For a nonnegative matrix M ∈ R m×n , its nonnegative rank, denoted rank + (M ), is defined as the smallest integer r for which there exist nonnegative vectors In other words, rank + (M ) can be seen as the smallest cp-rank of a cp-matrix A ∈ S m+n of the form Computing the nonnegative rank is an NP-hard problem [62].In analogy to the parameter τ cp in (13), the following "convexification" of the nonnegative rank was introduced in [28]: Note that, compared to the parameter τ cp (A) in (13), where we had an additional constraint A − xx T 0, we now cannot impose such a constraint.
One can define analogs of the bounds ξ cp t and ξ cp,isp t for the nonnegative rank, which now involve a linear functional acting on polynomials in m + n variables.For convenience, we set V = [m + n] = U ∪ W , where U = [m] = {1, . . ., m} (corresponding to the row indices of M ) and W = {m + 1, . . ., m + n} (corresponding to the column indices of M , up to a shift by m).We also set so that E M corresponds to the (bipartite) support graph of the matrix M .Note that, in comparison to (14), we now only consider bipartite pairs {i, j} (with i ∈ U and j ∈ W ). To emphasize the difference between the two situations we now put M as a superscript, while we placed A as subscript in the notation E A .Let M max = max i,j M ij denote the largest entry of M .As observed in [32], one may assume without loss of generality that the vectors in ( 16) satisfy a ∞ , b ∞ ≤ √ M max (after rescaling).This motivates defining the following semialgebraic set The analog of Lemma 2 holds, which provides a GMP reformulation for τ + (M ).
Lemma 3. The parameter τ + (M ) is equal to the optimal value of the generalized moment problem: Based on this formulation of the parameter τ + (M ), we may consider the corresponding bounds ξ + t (A), as special instance of the bounds in (6).Their asymptotic convergence to τ + (A) follows as a direct application of Theorem 1.One may also exploit the presence of the ideal constraints x i x j = 0 (for {i, j} ∈ E M ) in the definition of K M and define a hierarchy of sparse bounds ξ +,isp t (M ).These parameters satisfy with asymptotic convergence of all parameters to τ + (M ).We refer to Section 5 for details about these parameters.

Preliminaries about sums of squares and moments
In this section, we recall some preliminaries about sums of squares and linear functionals on polynomials that we will use throughout.These results are well-known in the polynomial optimization community, we refer, e.g., to the following sources [40,43,45,46,47,35,49] and further refereces therein for background and broad overviews.

Nonnegative linear functionals and moment matrices
The program (6) defining the parameter ξ t involves a linear functional L ∈ R[x] * 2t , which is assumed to be nonnegative on the truncated quadratic module M(g) 2t (in ( 4)) and to vanish on the truncated ideal I E,2t (in ( 5)).We now recall how these conditions can be expressed more concretely in terms of positive semidefiniteness conditions on associated (moment) matrices and thus used to reformulate the program (6) as a semidefinite program.
For this, given L ∈ R[x] * 2t , define the matrix often called a (pseudo)moment matrix in the literature.So, in the notation L([x] t [x] T t ), it is understood that L is acting entry-wise on the entries of the polynomial matrix Then, it is well-known (and easy to see) that L(σ) ≥ 0 for all σ ∈ Σ ∩ R[x] 2t if and only if the matrix M t (L) is positive semidefinite.Consider now a polynomial g with degree k = deg(g).Then L(σg) ≥ 0 for all σ ∈ Σ with deg(σg) ≤ 2t if and only if the matrix ) (often called a localizing moment matrix) is positive semidefinite.Hence, the condition L ≥ 0 on M(g) 2t can be equivalently reformulated via the positive semidefiniteness constraints In the same way, the ideal condition L = 0 on I E,2t is equivalent to the linear constraints Hence, the parameter ξ t is expressed as the optimum value of a semidefinite program.Recall that there exist efficient algorithms for solving semidefinite programs up to any precision (under some mild assumptions; see, e.g., [22] and further references therein).

Flatness and extraction of optimal solutions
As recalled in Theorem 1, if the quadratic module M(g) is Archimedean (i.e., R − i x 2 i ∈ M(g) for some R > 0), then the bounds ξ t converge asymptotically to ξ ∞ .In addition, if the Slater-type condition (7) holds, then ξ ∞ = val and problem (1) has a finite atomic optimal solution µ, i.e., supported on finitely many points in K.
A remarkable property of the bounds ξ t is that they often exhibit finite convergence.Indeed, there is an (easy to check) criterion, known as the flatness condition, which permits to conclude that the level t bound is exact, i.e., ξ t = val, and to extract a finite atomic optimal solution of GMP.This flatness condition, see (20) below, goes back to work of Curto and Fialkow [19,20].We also refer, e.g., to [46,49] for a detailed exposition of the following result.For details on how to extract an atomic optimal solution under the flatness condition (20), we refer to [36,49].
Theorem 4. [19,20] Consider the set K from (2) and set 2t is an optimal solution to the program (6) defining the parameter ξ t and it satisfies the following flatness condition: Then, equality ξ t = val holds, and problem (1) has an optimal solution µ that is finite atomic and supported on r points in K.
The above result naturally applies also to the sparse reformulation (11) of GMP and to the sparse hierarchy ξ isp t in (12).Indeed, it suffices to apply Theorem 4 to each of the linear functionals L k and to check whether L k satisfies the corresponding flatness criterion.We adapt the result to this setting for concreteness.
Corollary 5. Consider the sets K in (2) and K k in (10) ) is an optimal solution to the program (12) defining ξ isp t and it satisfies the flatness condition: for each k ∈ [p] there exists an integer s k such that d K k ≤ s k ≤ t and the following holds Then, equality ξ isp t = val isp (= val) holds, and problem (11) has an optimal solution (µ 1 , . . ., µ p ), where each µ k is finite atomic and supported on r k atoms in Note that, for the application to the completely positive rank and the nonnegative rank, all involved polynomials in the corresponding instances of GMP are quadratic, so that d K = d K k = 1 and the smallest relaxation level that can be considered is t = 1.For the application to the cp-rank, if the flatness condition holds for an optimal solution for the parameter ξ cp t (A) (or for the parameter ξ cp,isp t (A)), then the parameter is equal to τ cp (A) and one can extract a cp-factorization of A. In this way one finds an explicit factorization of A and thus an upper bound on its cp-rank.In this case, if the computed value of τ cp (A) is equal to the number of recovered atoms, this certifies that τ cp (A) is equal to the cp-rank and the recovered cpdecomposition of A is an optimal one.We will illustrate this on some examples in Section 4.3.2.In the same way, for the application to the nonnegative rank, if the flatness condition holds for an optimal solution for the parameter ξ + t (M ) (or for the parameter ξ +,isp t (M )), then the parameter is equal to τ + (M ) and one can extract a nonnegative factorization of M .

Ideal-sparsity for GMP
In this section we investigate how ideal-sparsity can be exploited for the GMP (1).First, we consider in Section 3.1 the ideal-sparse reformulation (11) and the corresponding ideal-sparse bounds, and, after that, we mention in Section 3.2 how this relates to the more classic approach based on exploiting correlative sparsity.

Ideal-sparse moment relaxations
Consider the GMP (1), where the set K is defined as in (2).As in Section 1, we consider the graph G = (V, E), whose maximal cliques are denoted V 1 , . . ., V p , and we define the sets K k ⊆ K ⊆ R n (as in ( 8)) and their projections K k ⊆ R |V k | (as in (10)).Recall from (9) that K = K 1 ∪ . . .∪ K p .Then, one can define the (sparse) version (11) of GMP.As observed above, while problem (1) has a single measure variable µ whose support is contained in K ⊆ R n , problem (11) involves p measure variables µ 1 , . . ., µ p , where µ k is supported on the set K k ⊆ R |V k | , thus a smaller dimensional space.We now show that both formulations (1) and ( 11) are equivalent.Proposition 6. Problems (1) and ( 11) are equivalent, i.e., their optimum values are equal: val = val isp .
We now show the reverse inequality val isp ≤ val.For this, assume µ is feasible for (1).We now define a feasible solution (µ 1 , . . ., µ p ) to (11), with the same objective value as µ.For k ∈ [p], define the set As each x ∈ K has its support contained in some V k , it follows that the sets Λ 1 , . . ., Next, we show that pdµ = k p |V k dµ k for any measurable function p : R |V | → R. Indeed, as the sets Λ 1 , . . ., Λ p disjointly partition the set K, we have pdµ Therefore, (µ 1 , . . ., µ p ) is a feasible solution to (11) with the same value as µ, which shows val isp ≤ val.
Based on the reformulation (11), we can define the ideal-sparse moment relaxation (12) for problem (1), which we repeat here for convenience: for any integer t ≥ 1, This hierarchy provides bounds for val that are at least at good as the bounds ξ t from (6).
Proof.Clearly, ξ isp t ≤ val isp , which, combined with Proposition 6, gives ξ isp t ≤ val.We now show ξ t ≤ ξ isp t .For this, assume (L 1 , . . ., L p ) is feasible for (22) Hence, L is feasible for (6) with the same objective value as (L 1 , . . ., L p ), which shows ξ t ≤ ξ isp t .Convergence of ξ isp t to val follows from the just proven fact that ξ t ≤ ξ isp t and from Theorem 1, which implies lim t→∞ ξ t = val under the stated assumptions.
Observe that in Theorem 7 no structural chordality property needs to be assumed on the cliques V 1 , . . ., V p of the graph G.In other words, the cliques V 1 , . . ., V p need not satisfy the running intersection property (see (24) below), which is a characterizing property of chordal graphs that is often used in sparsity exploiting techniques like correlative sparsity.In Section 3.2 below, we will comment about the link between the idealsparsity approach presented here and the more classical correlative sparsity approach that can be followed when considering a chordal extension G of the graph G.
As mentioned earlier in the introduction, the sparse bounds ξ isp t present a double advantage compared to the dense bounds ξ t : they are at least as good (and often strictly better), and their computation is potentially faster since the sets V k can be much smaller than the full set V .We will see later examples illustrating this.On the other hand, a possible drawback is that the number of maximal cliques of G could be large.Indeed, it is well-known that the number of maximal cliques can be exponential in the number of nodes (this is the case, e.g., when G is a complete graph on 2n nodes with a deleted perfect matching).A possible remedy is to consider a graph G = (V, E) containing G as a subgraph, i.e., such that E ⊆ E.Then, let V 1 , . . ., V p denote the maximal cliques of G, whose number p satisfies p ≤ p, since each maximal clique of G is contained in a maximal clique of G.One can define the corresponding ideal-sparse moment hierarchy of bounds, denoted ξ isp t , which involves p measure variables supported on the sets V 1 , . . ., V p (instead of the sets V 1 , . . ., V p ).However, as V h may contain some non-edge of G, one now needs to still impose an ideal condition on each linear functional L h acting on R[x( V h )] (h ∈ [ p]).Namely, the parameter ξ isp t is defined as ) Note that this parameter interpolates between the dense and sparse parameters: indeed, Then, one can easily verify that ( L 1 , . . ., L p ) provides a feasible solution for ξ isp t , with the same objective value as (L 1 , . . ., L p ).Let us only check the ideal constraint.For this assume {i, j} ∪ Supp(α) ⊆ V h and {i, j} ∈ E.Then, {i, j} is not contained in any clique V k of G and thus L k ((x i x j x α ) |V k ) = 0 for all k ∈ [p], which directly implies L h (x i x j x α ) = 0. Remark 9.There may be a trade-off to be made between the parameter ξ isp t , which fully exploits the sparsity of G (and provides a possibly better bound), and the parameter ξ isp t , which only partially exploits the sparsity, depending on the choice of the extension G of G. Namely, the parameter ξ isp t may involve many cliques of smaller sizes, while the parameter ξ isp t involves less cliques but with larger sizes.If one cares to have a small number of cliques, then one can (but is not required to) consider for G a chordal extension G of G, in which case the number of maximal cliques is at most the number of nodes.
In our numerical experiments for matrix factorization ranks we will consider only the two extreme cases of the dense and ideal-sparse parameters ξ t and ξ isp t .For most of the matrices considered the number of maximal cliques seems indeed not to play a significant role.However, when this number becomes too large, one may have to consider alternative intermediate parameters (see Section 6 for a brief discussion).

Bounds based on correlative sparsity
In this section we compare the ideal-sparse approach with the more classic one based on exploiting correlative sparsity.The setting of correlative sparsity is usually applied to a polynomial optimization problem, where each of the polynomials arising as a constraint involves only a subset of the variables (indexed, say, by one of the subsets V 1 , . . ., V p ) and the objective polynomial is a sum of such polynomials.Then, one can define more economical relaxations that respect this sparsity pattern.In the case when the sets V 1 , . . ., V p respect the so-called RIP property (see (24) below) (and under some Archimedean condition), these hierarchies enjoy asymptotic convergence properties analogous to the dense hierarchies; see [44,34] for details and also [51] for general background on correlative sparsity.We now explain how correlative sparsity applies to the instance of GMP considered in this paper.
As before, we assume K is contained in the variety of the ideal I E , generated by the monomials x i x j corresponding to the nonedges of the graph G = (V, E).In the ideal-sparsity approach we considered a measure variable for each maximal clique of G.However, the number of maximal cliques of G can be large, which could represent a drawback for this approach.
An alternative is to consider a chordal extension G = (V, E) of G, that is, a chordal graph G containing G as a subgraph, i.e., such that E ⊆ E.Then, as a well-known property of chordal graphs, G has at most n distinct maximal cliques.Let V 1 , . . ., V p denote the maximal cliques of G, so p ≤ n.As one of the many equivalent definitions of chordal graphs, it is known that the maximal cliques V 1 , . . ., V p satisfy (possibly after reordering) the so-called running intersection property (RIP): See, e.g., [24] for details.As we explain below, it turns out that one can 'transport' the chordal sparsity structure of the graph G to the moment matrices involved in the definition of the dense bound ξ t in (6).
To see this, let us first rewrite the parameter ξ t more concretely as a semidefinite program.For convenience, set d j := deg(g j )/2 for j ∈ [m].Then, following the discussion in Section 2.1, the parameter ξ t can be expressed as For fixed t ∈ N, define the sets . Then, L(x α x β ) = 0 for any α, β ∈ N n t such that {α, β} is not contained in any of the sets I 1,t , . . ., I p,t .
Proof.Assume there is no index k ∈ [ p] such that {α, β} ⊆ I k,t .Then, Supp(α + β) is not a clique in G, for otherwise it would be contained in some V k , implying Supp(α), Supp(β) ⊆ V k and thus α, β ∈ I k,t , yielding a contradiction.As Supp(α + β) is not a clique in G, it contains a pair {i, j} ∈ E, which implies x α x β ∈ I E,2t and thus L(x α x β ) = 0.
In view of Lemma 10, in the definition of ξ t in (25), one may restrict the matrix L([x] t [x] T t ) to its principal submatrix indexed by I t , since any row/column indexed by α ∈ N n t \ I t is identically zero.Moreover, L(x α x β ) = 0 implies {α, β} ⊆ I k,t for some k ∈ [ p].In other words, the support graph of the matrix is contained in the graph with vertex set I t , whose maximal cliques are the sets I 1,t , . . ., I p,t .The next lemma shows that the RIP property also holds for the sets I 1,t , . . ., I p,t .Therefore, the moment matrix M t (L) = L([x] t [x] T t ) has a correlative sparsity pattern, which it inherits from the chordal extension G of G.
The above extends to the localizing matrices L( In the same way, one may restrict the matrix L(g j [x] t−dj [x] T t−dj ) to its principal submatrix indexed by I t−dj and its support graph is contained in the graph with vertex set I t−dj , whose maximal cliques are the sets I 1,t−dj , . . ., I p,t−dj .Moreover, there is a correlative sparsity pattern on the matrix L( Therefore, one may apply Theorem 12 below to get a more economical reformulation of ξ t .Indeed, by Theorem 12, one may write , where Z j,k is obtained from a matrix indexed by the set I k,t−dj by padding it with zero entries, and replace the condition L(g j [x] t−dj [x] T t−dj ) 0 by the conditions Z j,1 , . . ., Z j, p 0. The advantage is that requiring Z j,k 0 boils down to checking positive semidefiniteness of a potentially much smaller matrix, indexed by I k,t−dj .Hence, this allows to replace one (large) positive semidefinite matrix by several smaller positive semidefinite matrices.While this method offers a more economical way for computing the dense parameter ξ t , it is nevertheless inferior to the idealsparse approach described in the previous section.Recall in particular Remark 9, where we indicated how to construct a sparse parameter ξ isp t , which could also be based on using a chordal extension G of G, but superior in quality as ξ t ≤ ξ isp t .Theorem 12 ([2]).Consider a positive semidefinite matrix X ∈ S n + whose support graph is contained in a chordal graph G, with maximal cliques V 1 , . . ., V p .Then, there exist positive semidefinite matrices + is obtained by padding Y k with zeros.
As a final observation, another possibility to exploit the above correlative sparsity structure would be simply to replace in the definition of ξ t in program (6) In other words, if L |V k denotes the restriction of L to the polynomials in variables indexed by V k , then we replace the condition L ≥ 0 on M(g) 2t by the conditions In this way we obtain another parameter, denoted by ξ csp t , that is weaker than ξ t and thus satisfies Recall ξ isp t is the parameter from (23) obtained when selecting an extension G of G, including, for instance, selecting a chordal extension G = G.

Application to the completely positive rank
In this section we investigate how ideal-sparsity can be exploited to design bounds on the completely positive rank.We define the corresponding hierarchies of lower bounds on the cp-rank and indicate their relations to other known bounds in the literature.

Ideal-sparse lower bounds on the cp-rank
Consider a symmetric nonnegative matrix A ∈ S n and assume A ii = 0 for all i ∈ V (to avoid trivialities).Then, its cp-rank, denoted rank cp (A), is the smallest integer r ∈ N for which A admits a decomposition of the form A = r =1 a a T with a ≥ 0 (setting r = ∞ if no such decomposition exists, when A is not completely positive).Fawzi and Parrilo [28] introduced the parameter τ cp (A) from ( 13), as a convexification of the cp-rank, whose definition is repeated for convenience: Clearly, we have τ cp (A) ≤ rank cp (A).As was already indicated in Section 1, the parameter τ cp (A) can be reformulated as an instance of problem ( 1) with an ideal-sparsity structure inherited from the matrix A. For this, recall G A = (V = [n], E A ) denotes the support graph of A, where E A consists of all pairs {i, j} with i = j ∈ V and A ij = 0 (as in ( 14)), and recall the definition of the semialgebraic set K A from (15).As shown in Lemma 2, τ cp (A) can be reformulated as an instance of GMP: Dense hierarchies for cp-rank.Based on the above reformulation of τ cp (A), for any integer t ≥ 1, let us define the following parameter (as special instance of ( 6)): L(( We first indicate how this parameter relates to other similar moment-based bounds considered in the literature, in particular in [32] and [33].Note that, due to the presence of the (ideal) constraints (33), the constraint (32) trivially holds for any pair {i, j} ∈ E A .If we omit the ideal constraint (33) and impose the constraint (32) for all pairs {i, j} with i = j ∈ V , then we obtain a parameter investigated in [33], denoted here as ξ cp t,(2022) (A).The parameter ξ cp t,(2022) (A) strengthens an earlier parameter ξ cp t,(2019) (A) introduced in [32], whose definition follows by replacing in the definition of ξ cp t,(2022) (A) the constraint (34) by the weaker constraint So, for any t ≥ 1, we have ξ cp t,(2019) (A) ≤ ξ cp t,(2022) (A) ≤ ξ cp t (A).Since the bounds ξ cp t,(2019) (A) were shown to converge asymptotically to τ cp (A) in [32], the same holds for the bounds ξ cp t (A).Note that the convergence of the latter bounds also follows directly from Theorem 1.As mentioned in [32], there are more constraints that can be added to the above program and still lead to a lower bound on the cp-rank (in fact on τ cp (A)).In particular, exploiting the fact that the variables x i should be nonnegative, one may add the constraints One may also add other localizing constraints, such as Note that the constraints ( 39) are redundant at the smallest level t = 1.Note also that one could add a similar constraint replacing x i x j by any monomial.We use the notation ξ cp t, † (A) to denote the parameter obtained by adding (38) to the program defining ξ cp t (A).Define analogously ξ cp t,(2019), † (A) by adding (38) to ξ cp t,(2019) (A), so that we have As we will see in relation (52) below, the bound ξ cp 2,(2019), † (A) is at least as good as rank(A), an obvious lower bound on rank cp (A).Let ξ cp t, ‡ (A) denote the strengthening of ξ cp t, † (A) by adding constraints (36), (37), and (39), so that we have Ideal-sparse hierarchies for cp-rank.We now consider the ideal-sparse bounds for the cp-rank, which further exploit the ideal-sparsity pattern of A. For this, let V 1 , . . ., V p denote the maximal cliques of the graph G A and, for t ≥ 1, define the following parameter (as special instance of ( 12)): maximum cardinality of a clique in G).As observed in [28], the edge clique-cover parameter gives a lower bound on the cp-rank: Indeed, if A = r =1 a a T with a ≥ 0 and r = rank cp (A), then the supports of a 1 , . . ., a r are (not necessarily distinct) cliques that provide an edge clique-cover of G A by at most r cliques.
In [28] a semidefinite parameter τ sos cp (A) is introduced, which is shown to be at least as good as rank(A) and as c frac (G A ), the fractional edge clique-cover number, i.e., the natural linear relaxation of c(G A ) defined by So, we have c(G A ) ≥ c frac (G A ) and In [32] it is shown that the bound ξ cp 2,(2019), † (A) is at least as strong as τ sos cp (A).Indeed, the proof for the relevant result (Proposition 7 in [32]) only uses the relation L((A ij − x i x j )x i x j ) ≥ 0 from (38) and the relation L((xx T ) ⊗2 A ⊗2 in (35).Hence, we have the chain of inequalities As we now observe, the (weak) ideal-sparse bound ξ cp,wisp 1 (A) of the first level t = 1 is at least as good as the parameter c frac (G A ). Proof.Let (L 1 , . . ., L p ) be an optimal solution for the parameter ξ cp,wisp 1 (A).Using (44), we have where we use (41) for the last equality.As A ij > 0, this gives k:{i,j}⊆V k L k (1) ≥ 1 for every edge {i, j} ∈ E A .Hence, the vector x = (L k (1)) p k=1 ∈ R p + is feasible for program (51), which implies the inequality p k=1 L k (1) ≥ c frac (G A ), as desired.We now give a class of cp-matrices that exhibit a large separation between the dense and ideal-sparse bounds at level t = 1: these matrices have size n = 2m and rank cp (A) = ξ cp,wisp where I m is the identity matrix and J m the all-ones matrix.Then, A is a cp-matrix (because it is nonnegative and diagonally dominant).Its cp-rank is is the complete bipartite graph K m,m (thus, connected, triangle-free and not a tree), using a result of [25], also mentioned below).Clearly, we have c(G A ) = c frac (G A ) = m 2 .Hence, using Lemma 13, we obtain , which shows a large separation between the dense and ideal-sparse bounds of level t = 1.
For this, observe that ξ cp 1 (A) can be reformulated as Consider the linear functional 2m+1 .We show that L is feasible for the above program, which implies ξ cp 1 (A) ≤ L(1) < m + 1.For this, it suffices to show that L([x] 1 [x] T 1 ) 0. By taking the Schur complement with respect to the upper left corner, this boils down to checking that L(1)A − (m + 1)J m 0. As the all-ones vector is an eigenvector of A (with eigenvalue 2m + 1), it is also an eigenvector of L(1)A − (m + 1)J m with corresponding eigenvalue L(1)(2m + 1) − 2m(m + 1) = 0.The eigenvalues of the matrix L(1)A − (m + 1)J m for its eigenvectors orthogonal to the all-ones vector are eigenvalues of A, and thus they are nonnegative since A is positive semidefinite.This shows that L(1)A − (m + 1)J m 0 and the proof is complete.
We conclude with some observations about known upper bounds on the cp-rank.[27].In analogy, it has been a long standing conjecture by Drew et al. [25] that the cp-rank of an n × n completely positive matrix is at most n 2 /4.This conjecture, however, was disproved in [11,12] for any n ≥ 7.In particular, it is shown in [12] that the maximum cp-rank of an n × n cp-matrix is of the order n 2 /2 + O(n 3/2 ).

General upper bounds on the cp-rank are rank
If the support graph G A is triangle-free, then |E A | ≤ rank cp (A) ≤ max{n, |E A |}; moreover, if G A is connected, triangle-free and not a tree, then rank cp Hence, the bound ξ cp,wisp 1 (A) gives the exact value of the cp-rank when G A is connected, triangle-free and not a tree.On the other hand, if G A is a tree and A is nonsingular, then the bound ξ cp 2, † (A) gives the exact value (equal to n) of the cp-rank (since it is at least τ sos cp (A) ≥ rank(A) by relation ( 52)).

Numerical results for the completely positive rank
In this section, we explore the behaviour of the various bounds for the completely positive rank on three classes of examples.Our objective is to illustrate the superiority of the ideal-sparse hierarchies compared to the dense ones.We examine both the quality of the bounds as well as computation times.
The first class we consider consists of randomly generated sparse cp-matrices.We will give the exact construction below.In all numerical examples we considered for these matrices, the bounds obtained for ξ cp t (A) and ξ cp,isp t (A) were always at most rank(A) + 2. So we do not list the numerical bounds for these examples as there does not seem to be much insight gained from them.However, random examples give us a way to compare the computation times amongst different hierarchies and across various matrix sizes, non-zero densities, and levels.In what follows the non-zero density of A ∈ S n , denoted nzd(A), is defined as the proportion of non-zero entries above the main diagonal, i.e., nzd(A) = |E A |/ n 2 .Hence, a diagonal matrix has nzd=0, and a dense matrix has nzd=1.
The second class contains examples from the literature, whose cp-rank is known from theory.However, recall the moment hierarchies provide lower bounds on τ cp , whose value is often unknown and could be strictly less than the cp-rank.Regardless, these examples give an interesting testbed to evaluate the quality of the new bounds.
The third class of examples consists of doubly nonnegative matrices, which are known to not be completely positive.In running these examples, the hope is to obtain an infeasibility certificate from the solver.This then numerically certifies that the matrix is not completely positive.In this context one hierarchy is said to perform better than another one if it returns the infeasibility certificate at a lower level or using less run time.
The size of the matrices involved in the semidefinite programs grows quickly with the level t in the hierarchy (roughly, as n+t t ), so these problems become quickly too big for the solver (in particular, due to memory storage).We will consider matrices up to size n = 12 for the dense and sparse hierarchies at level t = 2.At level t = 3 and for matrices of size n = 12, we can only compute bounds for the weak sparse hierarchy.
All computations shown were run on a personal computer running Windows 11 Home 64-bit with an 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz Processor and 16GB of RAM.The software we use was custom coded in Julia [9] utilizing the JuMP [26] package for problem formulation, and MOSEK [3] as the semidefinite programming solver. 2omputation times vs. matrix size and non-zero density, level t = 2 (indicated by a green circle) against matrix size and non-zero density for 850 random matrices, generated using the above described procedure.The matrices are arranged in ascending size (n = 5, 6, 7, 8, 9) and then ascending non-zero density, ranging from the minimal density needed to have a connected support graph up to a fully dense matrix (nzd = 1).

Computation times vs. matrix size and non-zero level t = 3
Figure 2: This is a similar plot to Figure 1 but now for level t=3 of each of the hierarchies.By omitting markers we indicate that the corresponding computations either exceeded memory constraints or took longer than 10 3 seconds.

Randomly generated sparse cp-matrices
We first describe how we construct random sparse cp-matrices.Given integers n ∈ N and n − 1 ≤ m ≤ n 2 , we create a symmetric n × n binary matrix M with exactly m ones above the diagonal, whose positions are selected uniformly at random.Let G be the graph with M as adjacency matrix.We only keep the instances where G is a connected graph.We enumerate the maximal cliques V 1 , . . ., V p of G (using, e.g., the Bron-Kerbosch algorithm [14]).Then, we select a subset of maximal cliques V q1 , ..., V q l whose union covers every edge of G (e.g., using a greedy algorithm).For each k ∈ with uniformly random entries following U[0, 1] and supported on V q k .We will choose m k = 2 by default.Then, consider the matrix k∈[l] i∈[m k ] a (k,i) (a (k,i) ) T , scale it so that all diagonal entries are equal to 1 and call A the resulting matrix.By construction, A is completely positive with connected support G A = G, and non-zero density nzd = m/ n 2 .We generate such random examples for varying matrix size (n = 5, 6, 7, 8, 9) and incrementing the nonzero density nzd in ascending order.In order to not include examples with disconnected graphs we need nzd ≥ (n − 1)/ n 2 .To account for different graph configurations with the same non-zero density we generate 10 examples per matrix size and nzd value.For all of them we compute the dense and (weak) sparse bounds of level t = 2 and t = 3.Here, we are not so much interested in the numerical bounds, but rather in their computation times.This numerical experiment indeed permits to show the differences in computation time between the ideal-sparse and dense hierarchies.It turns out that the computation times for the parameters ξ cp t , ξ cp t, † , and ξ cp t, ‡ are all comparable at level t = 2, 3, likewise for the ideal-sparse analogs.For this reason, we only plot the results for the " †" variant, i.e., for the parameters . The results are shown in Figure 1 (for t = 2) in Figure 2 (for t = 3).
We can make the following observations about the results in Figure 1.As expected, the ideal-sparse hierarchy is faster to compute than the dense hierarchy for matrices with non-zero density nzd ≤ 0.8.The computation of the weak ideal-sparse hierarchy is even faster.Moreover, the speed-up increases with the matrix size and the level of the hierarchy as can be seen across Figures 1 and 2. At level t = 3, some hierarchies can no longer be computed for certain matrix sizes and non-zero densities.This is particularly evident in the case of the dense hierarchy for matrices of size 7 and larger.The ideal-sparse hierarchies can be computed up to size 9 depending on the non-zero density.We show only the examples that we could compute in less than 10 3 seconds.The parameters that either took longer that 10 3 seconds or exceeded memory constraints can be inferred by the omission of their respective markers in Figure 2.
We also make an observation regarding how the values of the dense and weak-ideal sparse bounds compare for these random matrices.As observed earlier, the weak ideal-sparse hierarchy ξ cp,wisp t (A) is no longer guaranteed to be at least as strong as the dense hierarchy ξ cp t (A).Indeed, in our numerical experiments, we frequently observe the strict inequality ξ cp,wisp t (A) < ξ cp t (A) for randomly generated matrices A. For example, the matrix (with entries rounded for presentation) has the following parameters at order t = 2:

Selected sparse cp-matrices
Here, we compute the dense and (weak) ideal-sparse parameters for a few selected cp-matrices taken from the literature.We first briefly discuss the four example matrices we will consider, denoted ex1, ex2, ex3, ex4, and shown in relations ( 54) and ( 55) below.
The numerical results for these four examples are presented in Table 1, where we also show other parameters for the matrix (size n, rank r, cp-rank r cp ) and its support graph (number p of maximal cliques, edge clique-cover number c).Here are some comments about Table 1.
The results confirm the results in Lemma 13: the ideal-sparse bound of level t = 1 is equal to the number of edges for ex1 and ex2 (and matches the cp-rank); moreover it gives a strong improvement on the dense bound of level 1.The bounds of level t = 2 all exceed the rank of the matrix (as expected in view of ( 52)).At level t = 3, only the weak ideal-sparse bound can be computed for the matrices ex3 and ex4.
In Table 1, the values of the bounds at level t = 3 are close to those at level t = 2 for matrices ex3 and ex4.However, the tests for the flatness condition (21) fail, so that one cannot claim that the bounds are equal to τ cp at this stage.
We also tested whether the flatness conditions (20) and ( 21) hold for matrices ex1 and ex2 at level t = 2, and whether one can extract atoms and construct a cp-factorization.
The results are summarized in Table 2, where we indicate the number of atoms (corresponding to a cp-factorization with that many factors) when the extraction procedure is successful.We indicate that the extraction procedure fails by reporting "# atoms=0".As mentioned in [36], one may indeed try and apply the extraction procedure even if flatness does not hold.
For the dense bounds of level t = 2, flatness does not hold for the matrices ex1 and ex2.However, while one does not succeed to extract atoms for matrix ex1, the extraction is successful for matrix ex2 and returns 6 atoms.Interestingly, flatness holds for the ideal-sparse bounds and the atom extraction is successful.However, the number of extracted atoms is 10 for matrix ex1, thus twice the cp-rank.To verify that the extracted atoms are (approximatively) correct, we use them to construct a cp-matrix A rec , which we then compare to the original matrix A. In all cases we obtain A rec − A 1 ≤ 10 −8 , which shows that a correct factorization has been constructed.Note that for the ideal-sparse parameter, since one splits the problem over the maximal cliques and has a distinct linear functional L k for each clique V k , it may be more difficult to satisfy the flatness condition (21) (since each L k must satisfy it), as happens for matrices ex3 and ex4.In this section we consider the following three matrices that are known to be doubly nonnegative but not completely positive (taken from [57,53,6]): The objective is to see whether the hierarchies are able to detect that the matrix is not cp.This can be achieved in two ways: when the solver returns an infeasibility certificate, or when it returns a bound that exceeds a known upper bound on the cp-rank.We test this for the bounds at level t = 1 and t = 2.At level t = 2 we try different variants by adding the constraints ( 36), ( 37), (38), and (39) and their sparse analogs.The results are presented in Tables 3 and 4.
There we indicate one of three possible outcomes.The first outcome is indicated with a question mark "?", which indicates that the solver could not reach a decision within the default MOSEK solver parameters.The second possible outcome is when the solver returns an infeasibility certificate (indicated with *), or when it returns a value that exceeds a known upper bound for the cp-rank (in which case the bound is marked with *).The last column in both tables, labeled "r cp ≤", provides such an upper bound on the cp-rank of a cp-matrix with the given support graph.The third possible outcome is when the solver returns a value that does not violate the upper bound, in which case no conclusion can be reached.All computations took less than a second and hence times are not shown.We make three observations about Tables 3-4.The first is that the ideal-sparse hierarchies show infeasibility at level t = 1 already for examples ex5 and ex6 while the dense hierarchy shows the same only at level t = 2 with all additional constraints imposed.Secondly, the ideal-sparse hierarchy correctly identifies ex7 as not cp at level t = 2 while the dense hierarchy does not succeed even at level t = 3.The third observation is that adding additional constraints helps prevent the solver from returning an "unknown result status" but this seems to be less needed in the case of the ideal-sparse hierarchies.It should be noted that increasing the level of the hierarchy creates more opportunity for numerical errors in the computations, as seen in Table 4.

Application to the nonnegative rank
In this section we indicate how the treatment in the previous section for the cp-rank extends naturally to the asymmetric setting of the nonnegative rank.
Computing the nonnegative rank is an NP-hard problem [62].Fawzi and Parrilo [28] introduced the following natural "convexification" of the nonnegative rank: which can be seen as an asymmetric analog of τ cp .We consider the analogs of the parameters ξ cp t and ξ cp,isp t , which now involve linear functionals acting on polynomials in m + n variables.As in the introduction, set V = [m + n] = U ∪ W , where U = [m] = {1, . . ., m} (corresponding to the row indices of M ) and W = {m + 1, . . ., m + n} (corresponding to the column indices of M , up to a shift by m).Set As is well-known (see, e.g., [32]), the vectors in (56) may be assumed to satisfy a ∞ , b ∞ ≤ √ M max (after rescaling).This motivates the definition of the semialgebraic set K M from ( 19) and, for any integer t ≥ 1, of the parameter: L Fawzi and Parrilo [28] introduced a semidefinite bound τ sos + (M ) and show it satisfies τ sos + (M ) ≤ τ + (M ).In [32] it is shown that the parameters ξ + 2, † (M ) strengthen this bound3 : There is a well-known combinatorial lower bound on the nonnegative rank, which can be seen as an asymmetric analog of the lower bound on the cp-rank of A given by the edge clique-cover number c(G A ). Recall G M = (U ∪ W, E M ) is the bipartite graph defined as the support graph of M ∈ R m×n + .Define the edge biclique-cover number of G M , denoted bc(G M ), as the smallest number of bicliques whose union covers every edge in E M .Then, we have rank As a biclique in G M corresponds to a pair (A, B) ⊆ U × W for which the rectangle A × B is fully contained in the support of M , the parameter bc(G M ) is also known as the rectangle covering number of M (see, e.g., [28,31]).Define its fractional analog bc frac (G M ) as Yet another well-known combinatorial interpretation of bicliques is as follows.Define the rectangular graph RG(M ), with vertex set E M and where two distinct pairs {i, j}, {k, } ∈ E M form an edge of RG(M ) if M i M kj = 0.In other words, {i, j}, {k, } ∈ E M do not form an edge in RG(M ) precisely if ({i, k}, {j, }) corresponds to a biclique in G M .Then, the parameter bc(G M ) coincides with the coloring number of RG(M ) and bc frac (G M ) coincides with χ f (RG(M )), the fractional coloring number of RG(M ).So, rank + (M ) ≥ bc(G M ) = χ(RG(M )).
The following relationships are shown in [28]: where ϑ(RG(M )) is the theta number of the complement of RG(M ).As we now observe, the ideal-sparse parameter ξ +,isp 1 (M ) is at least as good as bc frac (G M ), which is the analog of Lemma 13.
Lemma 15.For M ∈ R m×n Proof.Let (L 1 , . . ., L p ) be an optimal solution for ξ k=1 provides a feasible solution to program (71), which implies As for the cp-rank, we now give a class of matrices showing a large separation between the ideal-sparse and dense bounds of level t = 1.
Example 16.Consider the identity matrix M = I n ∈ S n .Clearly, we have rank cp (I n ) = rank(I n ) = n.As the support graph G M is the disjoint union of n edges, its fractional edge biclique-cover number is equal to n and thus, in view of Lemma 15, we have ξ +,isp 1 (I n ) = n = rank + (I n ).We now show that for the dense bound, we have ξ + 1 (I n ) < 8 for any n ≥ 4. For this recall that ξ + 1 (I n ) is given by Hence, L is feasible for the program defining ξ

Numerical results for the nonnegative rank
In this section we test the ideal-sparse and dense hierarchies on two classes of nonnegative matrices.The first class consists of size 4 × 4 matrices that depend continuously on a single variable.The second class we consider are the Euclidean distance matrices (EDMs).

Matrices related to the nested rectangles problem
The nonnegative matrices we will consider have an interesting link between their nonnegative rank and the geometric nested rectangles problem (see [13]).Bounds for their nonnegative rank were investigated by Fawzi and Parrilo [28] and Gribling et al. [32].Consider the matrices If a, b < 1, then S(a, b) is fully dense and no improvement can be expected from our new bounds.Thus, we consider the case b = 1 and a ∈ [0, 1].We have computed the bounds ξ + t, ‡ (M ) and ξ +,isp t, ‡ (M ) at level t = 1, 2, 3 for M = S(a, 1) with a ranging from 0 to 1 in increments of 0.01.The results are displayed in Figure 3 below.We can make the following two observations about Figure 3. First, the ideal-sparse hierarchy is much stronger at level t = 1, but at level t = 2 the dense and ideal-sparse hierarchies give comparable bounds.Second, for a = 1, all bounds, except the dense bound of level 1, are equal to 4 = rank + (S(1, 1)) (as is expected for the ideal-sparse hierarchy in view of Lemma 15).

Euclidean distance matrices
The second class of examples we consider are the Euclidean distance matrices M n = ((i − i) 2 ) n i,j=1 ∈ R n×n + , known to have a large separation between their rank and their nonnegative rank.Indeed, rank(M n ) = 3 [8], and their bipartite support graph G Mn is K n,n with a deleted perfect matching (known as a crown graph), whose edge biclique-cover number satisfies bc(G Mn ) = Θ(log n) [21].So we have rank(M n ) = 3 and Bounds ξ + t, ‡ (M ) and ξ +,isp t, ‡ (M ) for M = S(a, 1) and t = 1, 2, 3 versus a ∈ [0, 1] Figure 3: This figure shows ξ + t, † (S(a, 1)) and ξ +,isp t, † (S(a, 1)) computed at levels t = 1, 2, 3 with a ranging from 0 to 1 in increments of 0.01.The colour indicates a lower bound on the obtained numerical value: yellow, red and purple show the bound is at least 2, 3, and 4, respectively.So a red square at a = 0.35 and "sp t=2" means ξ +,isp 2, † (M ) ≥ 3.
rank + (M ) ≥ bc(G Mn ) = Θ(log n).In addition, it is known that rank + (M n ) ≤ 2 + n 2 , see [31,Theorem 9].The numerical results are shown in Table 5.In these examples, the ideal-sparse bound of level t = 2 is more difficult to compute, since the support graph G M has 2 n−1 maximal bicliques, each with n vertices.For this reason we could compute ξ +,isp 2, † only until n = 7 before running out of memory.So this example illustrates the limitations of the ideal-sparsity approach, when the number of maximal cliques is too large.Note that this difficulty -large number of maximal bicliques -remains even if we would replace the support graph G Mn by a supergraph G, obtained by adding to M Gn (say) s edges from the missing perfect matching.Indeed, such G still has 2 n−s−1 maximal bicliques, each with n + s vertices.6 Concluding remarks In this paper we have introduced a new sparsity approach for GMP, which arises when in the formulation of GMP one has explicit ideal-type constraints that require the support of the measure to be contained in the variety of an ideal generated by monomials x i x j corresponding to (the non-edges of) a graph G.
We compared it to the more classic correlative sparsity structure that requires a chordal structure on the graph G, while our new ideal-sparse hierarchy does not need it.We explored its application to the problem of bounding the nonnnegative rank and the cp-rank of a matrix and illustrate the new approach on some classes of examples.There are several natural extensions and further research directions that are left open by this work.We now sketch some of them.
How to deal with many cliques.In the new ideal-sparse approach, instead of a single measure on the full space R n , one has several measures on smaller spaces indexed by the maximal cliques of the graph G.
At any given level t ≥ 1, the corresponding ideal-sparse bounds are at least as good as their dense analogs and, depending on the number of maximal cliques, their computation can be much faster.The computation of the ideal-sparse parameters indeed involves several (based on the maximal cliques) semidefinite matrices of smaller sizes.The first research direction is to investigate the trade-off between having many cliques (in the ideal-sparse setting) and large matrix constraints (in the dense setting).As seen in Section 5.3.2 the sparse hierarchy behaves particularly bad on examples where the underlying graph has exponentially many cliques.We suggest a possible solution in Remark 9, where we consider merging some of the cliques by considering a (possibly chordal) extension of the support graph G. Clique merging has been explored before in the context of power flow networks, see [59] and [30].These methods exploit correlative sparsity and thus require the underlying support graph to be chordal.Finding the minimal chordal extension of a graph is NP-complete [4], but heuristics exist for certain cases (see, e.g., [10]).Supposing one has chosen a method for finding chordal extensions, it is still unclear which among the possible chordal extensions will result in better SDPs.One can try to merge small cliques based on how much it would reduce the estimated computational burden.These estimates can be based, for example, on the number of constraints, see [52], or on the cost of an interior-point method iteration, see [60].As it stands, we know of no systematic way to find a "computationally optimal" trade-off between the dense and ideal-sparse hierarchies.
Application to other matrix factorization ranks.We have explored the application to nonnegative and completely positive matrix factorization ranks.We have not considered their non-commutative analogs for the positive semidefinite (psd) rank and the completely positive semidefinite (cpsd) rank, where, respectively, given M ∈ R m×n + one wants psd matrices X i , Y j ∈ S r such that M = ( X i , Y j ) i∈[m],j∈[n] , and given A ∈ S n one wants psd matrices X i ∈ S r + such that A = ( X i , X j ) i,j∈[n] , with r smallest possible.One recovers the nonnegative rank and the cp-rank when restricting the factors X i , Y j to be diagonal matrices.We refer the reader to [32], where a common polynomial optimization framework is offered to treat all these four matrix factorization ranks.In the noncommutative setting of the psd-and cpsd-ranks, zero entries of M (or A) also imply ideal-type constraints of the form X i Y j = 0 (or X i X j = 0).Thus the techniques in the present paper may extend to this general setting.We leave this extension to future work.
More general ideal-sparsity and applications.We have considered an ideal-sparsity structure, where the ideal in (3) is generated by quadratic monomials.Beside their use for bounding matrix factorization ranks, constraints of the form x i x j = 0 naturally arise in a number of other applications.First we note that up to a change of variables, one can consider more general constraints of the form (a x + b)(c x + d) = 0.This type of constraint is commonly referred to as a complementarity constraint, where either the term (a x + b) or the term (c x + d) is required to be zero.We mention two areas where such complementary constraints naturally arise: analysis of neural networks and optimality conditions in optimization.
Complementarity constraints arise naturally when modeling neural networks with the rectified linear activation functions (ReLU).The semialgebraic representation of the graph of the ReLU function involves a constraint of the form y(y − x) = 0, which is exactly a complementarity constraint.The fact that the graph of the ReLU function admits a semialgebraic representation has been exploited computationally using the moment-sum-of-squares framework, for analyzing the Lipschitz constant of the neural network as well as stability and performance properties of dynamical systems controlled by the ReLU neural networks, see, e.g., [16,17,41].Ideal sparsity is therefore a natural candidate to render these methods more computationally efficient and would deserve further study.
Complementarity systems arise also in optimization within the Karush-Kuhn-Tucker (KKT) conditions.The complementarity slackness of the KKT condition reads λ i f i (x) = 0, where λ i is the Lagrange multiplier associated to the i th constraint f i (x) ≤ 0. If f i is affine, this is in the form of ideal constraints.The fact that the KKT conditions form a basic semialgebraic set when the optimization problem has polynomial data was exploited in [42] to analyze dynamical systems controlled by optimization algorithms, albeit without exploiting the ideal-sparsity.More generally, the ideal-sparsity could be used to analyze the linear complementarity problems (LCP) that have applications in, e.g., economics, engineering or game theory; see [18] for an extensive treatment of the subject.
Finally, instead of considering an ideal generated by quadratic monomials, one may consider an ideal generated by a set of monomials x S = i∈S x i (S ∈ S), where S is a given collection of subsets of V = [n].The treatment extends naturally to this more general setting, where in the definition (2) of the set K, we replace the constraints x i x j = 0 ({i, j} ∈ E) by i∈S x i = 0 (S ∈ S).Indeed, let V 1 , . . ., V p denote the maximal subsets of V that do not contain any set S ∈ S.Then, for the dense formulation (1) of GMP, one can again show an equivalent sparse reformulation as in (11), which involves p measures supported on the subspaces R |V1| , . . ., R |Vp| instead of a single measure on R |V | .We leave it for further research to explore applications of this more general ideal-sparsity setting and possible further extensions to other types of varieties.

Lemma 13 .
If A ∈ S n is nonnegative with support graph G A , then ξ cp,wisp 1 (A) ≥ c frac (G A ).

n
= size of A, p = number of maximal cliques of G A , c = edge clique-cover number of G A , r = rank(A), rcp = rankcp(A) -: computations failed due to memory constraints the complete graph.Accordingly, we have the following inequalities among the parameters.Lemma 8. Assume G contains G as a subgraph.For any integer t ≥ 1 we have ξ t ≤ ξ isp t ≤ ξ isp t .Proof.The proof for the inequality ξ t ≤ ξ isp t is analogous to the proof of ξ t ≤ ξ isp t in Theorem 7. We now show ξ isp t ≤ ξ isp t .For this, assume (L 1 , . . ., L p ) is feasible for the parameter

Table 1 :
Dense and ideal-sparse bounds for selected sparse cp-matrices

Table 2 :
Testing flatness and atom extraction

Table 4 :
Detecting non cp-matrices for t = 2, 3.Given a nonnegative matrix M ∈ R m×n , its nonnegative rank, denoted rank + (M ), is the smallest integer r for which there exist nonnegative vectors a ∈ R m + and b ∈ R n + such that