Moment-SoS Methods for Optimal Transport Problems

Most common Optimal Transport (OT) solvers are currently based on an approximation of underlying measures by discrete measures. However, it is sometimes relevant to work only with moments of measures instead of the measure itself, and many common OT problems can be formulated as moment problems (the most relevant examples being $L^p$-Wasserstein distances, barycenters, and Gromov-Wasserstein discrepancies on Euclidean spaces). We leverage this fact to develop a generalized moment formulation that covers these classes of OT problems. The transport plan is represented through its moments on a given basis, and the marginal constraints are expressed in terms of moment constraints. A practical computation then consists in considering a truncation of the involved moment sequences up to a certain order, and using the polynomial sums-of-squares hierarchy for measures supported on semi-algebraic sets. We prove that the strategy converges to the solution of the OT problem as the order increases. We also show how to approximate linear quantities of interest, and how to estimate the support of the optimal transport map from the computed moments using Christoffel-Darboux kernels. Numerical experiments illustrate the good behavior of the approach.


Introduction
Optimal Transport provides a principled and versatile approach to work with probability distributions.In recent years, an increasing amount of theoretical results are being leveraged to build numerical solvers which are by now playing a fundamental role in numerous applications ranging from economy [7,15], quantum chemistry [11,4], gradient flow modeling [8], and machine learning [28].
The prototypical example is the two-marginal Monge-Kantorovich problem: given two Borel sets X 1 ⊂ R n1 and X 2 ⊂ R n2 and given two probability measures µ and ν, solve inf where c : X 1 × X 2 → R + is a lower semi-continuous cost function.The infimum runs over the set Π(X 1 × X 2 ; µ, ν) of coupling measures (usually called transport plans) on X 1 × X 2 with marginal distributions equal to µ and ν respectively.More generally, the problem can be posed on general Polish spaces.It can also be multi-marginal as we introduce later on in Section 3. Numerous methods have been introduced for solving such problems in practice.Many algorithms rely on an approximation of measures by discrete measures (by sampling or discretization on discrete grids), whose numerical cost quickly becomes prohibitive when the number of discretization points 1 arXiv:2211.10742v2[math.NA] 2 Dec 2022 increases.Among the existing strategies to mitigate this effect, probably the most popular one is based on adding an entropic regularization to the loss function which is then solved with a Sinkhorn algorithm ( [12,10,28]).The algorithm is however still posed on a grid.Other approaches involving discrete grids are the auction algorithm [6], numerical methods based on Laguerre cells [16], multiscale algorithms [23,30] and methods based on dynamic formulations [5].Recently, an approach that dynamically discovers sampling points where the support of the solution measure lies has been introduced in [14].It provides promising results to address the curse of dimensionality when the optimal transport plans have sparse support.
The present paper considers an entirely different avenue where the spatial discretization is replaced by a spectral discretization where the transport plan is represented through its moments on a given basis, and the marginal constraints are expressed in terms of moment constraints.A practical computation then consists in considering a truncation of the involved moment sequences up to a certain order.The procedure needs to be well posed in the sense that one needs to guarantee convergence to the original problem as the truncation order increases.This type of approach is not entirely novel but it has only been explored in a few prior works despite that it is often relevant to work only moments of a measure instead of the measure itself.The idea was originally mentioned in [19] for the case of a polynomial basis and polynomial cost function.The moment approach was also recently explored in [9] for applications related to image processing involving trigonometric polynomial bases.It has also been recently used for solving the Monge problem [17], where an approximation of the transport map is constructed by solving a moment matrix completion problem and an approximation method based on Christoffel-Darboux kernel [22].A relatively different contribution can be found in [2], where the authors relax marginal constraints into a set of moment constraints but here moments do not come from a prescribed basis.Instead they are selected from a given dictionary involving potentially general test functions.Last but not least, the idea of leveraging moment formulations and sums-of-squares has also been used for the dual formulation of problem (1) in order to derive statistical estimation bounds of high dimensional OT problems (see [32,24]).
In view of the present state of the art, the main contribution of this paper is to provide a general moment problem formulation of most common OT problems with polynomial or piecewise polynomial costs.In particular, the problem of estimating L p -Wasserstein distances for p ≥ 1, barycenters, and Gromov-Wasserstein discrepancies on Euclidean spaces will be covered by our framework.We prove that the resulting sequence of optimal solutions converges to the whole moment sequence of the original OT measure as the polynomial order increases.The case of piecewise polynomial costs is addressed by a reformulation in terms of conditional measures.
For practical computations, we can directly apply the SoS-moment hierarchy similarly as in [19,9], which eventually boils down to solving semidefinite programming optimization problems.It is worth emphasizing that by switching the point of view to a moment problem, we do not recover the measure itself.Instead, the resulting outputs will be moments of the optimal transport plan.Depending on the application, this may of course be a limitation.However, we show that it is possible to recover linear quantities of interest, and also the support of the measure by a post-processing algorithm based on Christoffel-Darboux kernels.Our numerical examples show that even the support of concentrated measures can efficiently be estimated with relatively low polynomial order.This feature seems particularly appealing.It could for instance be leveraged to recover optimal transport plans from high-dimensional OT problems: we could use the support estimation to provide well-chosen sampling points to grid-based approaches.A full development of these ideas will be presented in a forthcoming work.
The paper is organized as follows.After introducing some basic notation in Section 2, we define optimal optimal transport problems in Section 3. We prove that when the involved cost function is a polynomial or a piecewise polynomial, the problem can be interpreted as a generalized moment problem.This section will allow us to introduce the basic principles of our approach to OT problems and important results from real algebraic geometry.In Sections 4 and 5, we consider optimal transport problems that are playing a crucial role in numerous applicative areas, and prove that they can be expressed as generalized moment problems.More precisely, we consider in section 4 the problems of computing L p -Wasserstein distances and barycenters for p ≥ 1, and in section 5 the problems of computing Gromov-Wasserstein discrepancies and corresponding barycenters.In Section 6, we formulate a generalized moment problem that includes all previous OT problems.We derive a solution strategy based on the SoS-moment (or Lasserre's) hierarchy, and we prove its convergence to the solution of the OT problem.Section 7 explains how to postprocess the moments to estimate linear quantities of interest and the support of the measure.Section 8 illustrates the potential of the approach by giving numerical results on estimating the L 1 and L 2 Wasserstein distances, barycenters, and the L 2 Gromov-Wasserstein discrepancy.

Some elements of notation
In the following, N should be understood as the set of non-negative integers (including zero).Vectors x from the Euclidean space R n will be denoted with bold notation.The coordinates x = (x 1 , . . ., x n ) T will be written with plain text.The canonical vectors will be denoted as e i = (0, . . ., 0, 1, 0, . . ., 0) T for i ∈ {1, . . ., n}.For any index p ∈ N * , x p := ( r be the space of real polynomials of degree less than r that can be written α∈N n r c α x α for some real coefficients c α .For any Borel set X in R n , we denote M(X ) the space of finite signed Borel measures supported on X , M(X ) + := {µ ∈ M(X ) : µ ≥ 0} its positive cone of finite Borel measures supported on X , and the set of probability measures supported on X .The indicator function of a subset A ⊂ R n is denoted as 1 A (x).For any Borel set X in R n and any measure µ ∈ M(X ), is the moment of µ associated to the multi-index α ∈ N n , and is the sequence of moments of µ.The mass of µ is denoted mass(µ) = m 0 (µ).A measure µ is said determinate if it is uniquely determined by its moment sequence m(µ).
Finally, for a given sequence y = (y α ) α∈N n , we introduce the Riesz functional y : R[x] → R which associates to a real polynomial g(x) = α∈N n a α x α the value y (g) = α∈N n a α y α .For any measure µ, we thus have 3 Optimal transport problems with polynomial costs

Formulation
To guide the subsequent discussion, we consider the multi-marginal version of problem (1) as a prototypical example of an OT problem.This problem consists in considering K probability measures µ i ∈ P(X i ) defined on Borel sets X i ⊂ R ni for all i ∈ {1, . . ., K}, and solving The loss function is of the form and the set X is defined as the product set Note that X can be identified with a subset of R n , with The function c : X → R is a given cost function, and the constraint Π is a shorthand notation for the set of coupling measures having µ i as marginals, namely where proj i : X → X i denotes the canonical projection, and the push-forward measure proj i #π is the i-th marginal of π.
The existence of a minimizer for (3) is standard in OT theory.Indeed, Π is trivially non empty since the coupling ⊗ K i=1 µ i belongs to this set.The set Π is convex and compact for the weak − * topology thanks to the imposed marginals, and if the cost function c is lower semi-continuous (l.s.c.), then the loss function L : π → c dπ is l.s.c. with respect to the weak-* topology.Hence we can guarantee the existence of a minimizer by imposing a very weak hypothesis on the cost function c such as lower-semicontinuity.
As a result of ( 5) and ( 6), instead of considering the OT problem (3) where we search for an unknown measure π ∈ Π, one can alternatively consider the moment problem of searching for the optimal sequence y = (y α ) α∈N n solving where Π mom := Π mom (X ; m(µ 1 ), . . ., m(µ K )) is the set of sequences in R N n satisfying the following constraints: (i) Marginal conditions: the sequence y should satisfy y (0,...,0,β,0,...,0) = m β (µ i ), ∀β ∈ N ni , and ∀i ∈ {1, . . ., K} (ii) Moment sequence condition: the sequence y must have a representing measure supported on X , that is, there must exist a measure π ∈ M(X ) + such that y = m(π).We write this condition as y ∈ MS(X ).
The equivalence between problems (3) and ( 7) is closely related to the determinacy of measures µ i , for which a sufficient condition is that the sets X i are compact.We summarize these facts in the theorem below.
Theorem 3.1 (Polynomial cost).If the sets {X i } K i=1 are compact, then the OT problem (3) with polynomial cost is equivalent to the generalized moment problem (7): ) is a minimizer of problem (7).
• A minimizer y * of problem (7) has a representing measure which is solution of (3).
In addition, if the solution π * of the OT problem (3) is unique, then the solution y * of (7) is unique and y * = m(π * ).
Proof.Suppose π * is a solution of problem (3).Since it is a Borel measure supported on the compact set X , it is determinate1 so there exists a unique sequence y = m(π * ).This sequence y is clearly in the set Π mom .Therefore, as a feasible solution for (7), it satisfies ρ mom ≤ L(y) = L(π * ) = ρ.Conversely, let y * be a solution to problem (7).Since y ∈ MS(X ), there is a representing measure π such that y * = m(π).For π to belong to the feasible set Π of problem (3), the marginal conditions on m(π) should imply that the marginals of π are the µ i .This is satisfied given that the marginal measures µ i are determinate because X i is compact.Therefore π ∈ Π and ρ ≤ L(π) = L(m(π)) = L(y * ) = ρ mom .This proves that ρ = ρ mom , and that m(π * ) is a solution of problem (7) if and only if π * is a solution of (3).Remark 3.2.Since the µ i are probability measures, y ∈ Π mom is such that y (0,...,0) = 1 and a representing measure π ∈ M(X ) + such that y = m(π) has mass 1, i.e. π ∈ P(X ).

The moment sequence condition
In this section, we discuss how the moment sequence condition y ∈ M S(X ) is translated into mathematical terms.This question is in fact directly related to the so-called moment problem which studies the following question: Given a Borel subset X ⊆ R n and a sequence of real numbers y = (y α ) α∈N n , what are the conditions on y under which we can guarantee that y = m(π) for some positive measure π ∈ M(X ) + ?For the one dimensional case (n = 1), this classical problem is well understood and dates back to contributions by Markov, Stieltjes, Hausdorff, and Hamburger.Explicit conditions on y exist, and they are all stated in terms of positive semi-definiteness of certain Hankel matrices.Much less is known for the multidimensional case (n > 1).A general result is given by the Riesz-Haviland theorem, which states that a moment sequence y has an associated Borel measure π such that y = m(π) if and only if y (f ) ≥ 0 for all polynomials f ∈ R[x] nonnegative on X .This theorem is not really useful if we do not have an explicit characterization of polynomials that are nonnegative on X (so called positivstellensatz ).Such a characterization has been provided by Schmügden in [31] when the ambient space X is a compact basic semi-algebraic set of the form for some polynomials It is proven is [31] that a sequence y has a representing Borel measure supported on X (i.e.satisfies the moment sequence condition) if and only if it satisfies where g I = j∈I g j and where we have used the convention g ∅ = 1.For a polynomial g(x) = r be the matrix with entries which is such that for any polynomial f ∈ R[x] r of degree less than r, we have Therefore, the moment sequence condition ( 9) is equivalent to M r (g I y) 0, ∀I ⊂ {0, . . ., J}, ∀r ∈ N.
A simpler characterization has been given by Putinar in [29] under the following additional assumption.
Assumption 3.3.There exists a polynomial u of the form u = u 0 + J j=1 u j g j , where the u j are sums of squares (SoS) polynomials, and such that {x ∈ R n : u(x) ≥ 0} is compact.
Under Assumption 3.3, it is proven in [29] that y has a representing Borel measure supported on X if and only if where we have used the convention g 0 = 1.The linear positive semidefinite constraints (10) or (11) are exactly the moment sequence condition.We summarize the above results in the following theorem.Theorem 3.4 (Th.3.8 in [20]).Let X be a basic semi-algebraic set as in (8).A sequence y ∈ R N n satisfies y ∈ M S(X ) (i.e.satisfies the moment sequence condition on X ) if and only if it satisfies the positive semidefinite constraints (10), or the positive semidefinite constraints (11) under the additional Assumption 3.3.
In our context, since X is the product set X 1 × • • • × X K , we assume that each X i is a compact basic semi-algebraic set defined as Then X is a basic semi-algebraic set defined as in (8) with J = K i=1 J i and functions {g j } J j=1 defined by for some positive R. Since for any compact semi-algebraic set X , there exists a sufficiently large R such that X ⊂ {x : x 2 < R}, the condition R − x 2 2 ≥ 0 is redundant and can be systematically added to the definition of X .A stronger condition for Assumption 3.3 to hold for a product set X is that the description of each set X i contains a function g . If this is not the case, we should prefer to add a single function 2 to the description of X in order to reduce the number of positive semidefinite constraints.

Piecewise polynomial costs
Some OT problems (such as, e.g., the L 1 -Wasserstein distance), involve a continuous or l.s.c.piecewise polynomial cost.These problems can also be formulated as generalized moment problems, up to the introduction of new unknown measures.
is a minimizer of problem (13), and conversely, a minimizer (y * i ) 1≤i≤m of problem ( 13) is such that each y * i has a representing measure π i supported on Āi , and the sum π = π 1 + . . .+ π m is solution of (3).In addition, if the solution π * of the OT problem (3) is unique, then (13) has infinitely many solutions (y * 1 , . . ., y * m ) but the sum y * 1 + . . .+ y * m := y * is unique and such that y * = m(π * ).Remark 3.7.For having a practical characterizing of the set M S( Āi ) of sequences that satisfy the moment sequence condition on Āi , the partition should be such that the Āi are basic semi-algebraic compact sets.If X is a basic semi-algebraic compact set, that means that A i should be defined as the set of points in X satisfying a finite set of additional polynomial inequalities.
Sum of piecewise polynomial costs.In the case where where each c k is a l.s.c.piecewise polynomial associated with a particular partition we could introduce a finer partition of X composed by sets The function c being polynomial on each set A i , the problem can be reformulated as a generalized moment problem involving m 1 . . .m s measures π i supported on the sets Āi , for i ∈ {1, . . ., m 1 }×. ..×{1, . . ., m s }.However, the resulting number of unknown measures is exponential in s.
An alternative approach, that will be used later in this paper, is to introduce for each 1 ≤ k ≤ s a collection of measures (π k,i ) 1≤i≤mi and consider the problem over measures π ∈ Π and π k,i ∈ M( Āk,i ) This results in a problem with m 1 + . . .+ m s + 1 unknown measures, that can be equivalently written as the problem inf y∈Πmom,(y k,i ) with measures y ∈ Π mom and Note that the measure π (resp.the sequence y) can be eliminated from problem (15) (resp.( 16)).We summarize the above results in the next theorem.
Theorem 3.8 (Sum of piecewise polynomial costs).Assume X 1 ×• • •×X K is compact, and consider a l.s.c.piecewise polynomial cost of the form (14), where each c k is a l.s.c.piecewise polynomial over a partition

Wasserstein distances and barycenters
In this section, we consider the problems of computing distances and barycenters in Wasserstein spaces and show that they can be expressed as generalized moment problems.Throughout the section, X denotes a compact basic semi-algebraic set in the normed vector space (R d , • p ), with p ∈ N * .

Wasserstein distances
The Wasserstein space P p (X ) is defined as the set of probability measures µ ∈ P(X ) with finite moments up to order p, namely Let µ and ν be two probability measures in P p (X ).For any p ∈ N * , the L p -Wasserstein distance W p (µ, ν) between µ and ν is defined by The space P p (X ) endowed with the distance W p is a metric space, usually called L p -Wasserstein space (see [33] for more details).The W p distance defined through problem ( 17) is an optimal transport problem of the form (1) with K = 2 marginals, X 1 = X 2 = X and a continuous cost function We claim that for any p ∈ N * , this problem can be seen as a generalized moment problem.We distinguish the cases where p is even and odd.
Case p even.When p is an even number, the cost c is a polynomial and we simply use the binomial theorem to derive that the loss function in (17) can be expressed as or in terms of the moments m(π) of π, where we recall that e i is the i-th canonical vector in N d .The marginal constraints π ∈ Π(X × X ; µ, ν) of problem (17) can also be expressed in terms of moments.We derived their general form in equation ( 6).In the present context, they read The problem (17) can then be expressed as the generalized moment problem where Π mom := Π mom (X × X ; m(µ), m(ν)) is the set of sequences y ∈ R N 2d that satisfy the moment sequence condition and the marginal constraints Here, Theorem 3.1 applies and proves the equivalence between problems ( 18) and (17).
Case p odd.When p is odd, the presence of the absolute value in the cost function c prevents from having a polynomial expression.We can nevertheless derive a moment formulation by exploiting the fact that the cost is piecewise polynomial on X × X .We first introduce for all i ∈ {1, . . ., d} the subsets that form a partition of X × X , i.e.
If X is compact semi-algebraic, then A + i and A − i are also compact semi-algebraic.For any π ∈ P(X × X ), we can define measures π + i , π − i by which are such that When p is odd, since , we can write the Wasserstein loss function as From Theorem (3.8), we know that problem ( 17) is equivalent to the following problem with 2d + 1 measures, which can be equivalently reformulated as a generalized moment problem and the constraints y ∈ Π mom (X × X ; m(µ), m(ν)) and Note that the variable y can be eliminated.

Wasserstein barycenters
A notion that is widely used to approximate measures in the Wasserstein spaces is the one of barycenters.To define it, let N ∈ N * and let be the simplex in R N .We say that Bar(Y N , Λ N ) ∈ P p (X ) is a barycenter associated to a given set Y N = (µ i ) 1≤i≤N of N probability measures from P p (X ) and to a given set of weights Existence and uniqueness of minimizers of (19) has been studied in depth in [1] for the case p = 2.It is shown, in particular, that if one of the µ i has a density, the barycenter is unique.In the following we assume existence of minimizers.Problem (19) can be written as an optimization problem over measures ν ∈ P p (X ), and When p is even, this can be equivalently written as a generalized moment problem inf y,y1,...,y N N i=1 over sequences that satisfy the constraints y i ∈ Π mom (X × X ; y, m(µ i )), 1 ≤ i ≤ N, Note that the unknown y can be eliminated by imposing that all y i have the same left marginal sequence.When p is odd, the problem ( 19) is equivalent to a generalized moment problem inf y,(y + i,j ,y − i,j ) over sequences satisfying moment sequence conditions y ∈ M S(X ) and y ± i,j ∈ M S(X × X ), 1 ≤ i ≤ N, 1 ≤ j ≤ d, and the additional constraints y + i,j + y − i,j = y + i,k + y − i,k for all i ∈ {1, . . ., N } and 1 ≤ j < k ≤ d, and y + i,j + y − i,j ∈ Π mom (X × X ; y, m(µ i )) for all i ∈ {1, . . ., N } and 1 ≤ j ≤ d.Again, the unknown y could be eliminated by imposing that all y i have the same left marginal sequences.

Gromov-Wasserstein discrepancies and barycenters
For some applications such as shape matching or word embedding, an important limitation of classic Wasserstein metrics lies in the fact that it is not invariant to rotations and translations and more generally to isometries.It is also defined for measures defined on the same ambient space X .To overcome these limitations, several extensions have been proposed (see, e.g., [3]).We focus here on the so-called Gromov-Wasserstein discrepancies in Euclidian spaces, originally introduced in [25], and which has recently attracted a lot of attention from practitioners.

Gromov-Wasserstein discrepancies
Given two compact semi-algebraic Borel sets X ∈ R d X and Y ∈ R d Y , two probability measures µ ∈ P(X ) and ν ∈ P(Y), and two cost functions c X : X × X → R and c Y : Y × Y → R, we define for p ∈ N * a Gromov-Wasserstein discrepancy GW p between measures µ and ν as where the loss function L Note that this problem is quadratic in π.It can alternatively be expressed as a linear problem with a rank-one tensor constraint in the augmented space which can be identified with a basic semi-algebraic set of R 2n with n = d X + d Y .Using the space Z, we can write and we have In the particular case where the cost functions c X = • − • q q and c Y = • − • q q are related to q norms, for some q ∈ N * , we denote GW p,q (µ, ν) the corresponding Gromov-Wasserstein discrepancy and L GWp,q the corresponding loss.Note that case GW 2,2 is of particular practical interest.We now distinguish different cases depending on whether the costs c X and c Y are polynomials or not.

Polynomial costs c X and c Y
Here we consider polynomial costs c X and c Y .Two cases will be again distinguished.
Case p even.When p is even, then the cost |c X (x, x ) − c Y (y, y )| p is a polynomial on X .Given polynomial expansions of c X and c Y , we can deduce a polynomial expansion of their difference ).This yields the following expression of the Gromov-Wasserstein loss function in terms of moments with L GWp aug : R N 2n → R a linear functional, or with L GWp : R N n → R a quadratic functional.When p is even and the costs are polynomials, the Gromov-Wasserstein problem (20) can therefore be expressed as a generalized moment problem with quadratic objective function with Π mom = Π mom (X × Y; m(µ), m(ν)) the set of sequences satisfying the moment sequence condition y ∈ M S(X × Y) and marginal conditions, and we easily prove the equivalence between the two problems, following the proof of Theorem 3.1.
Case p odd.When p is odd, we can use a similar strategy as for Wasserstein distances.We introduce two subsets A + and A − of Z defined by with g(z) = c X (x, x ) − c Y (y, y ) for z = (x, y, x , y ).The sets are such that (A + , A − ) form a partition of Z.If X and Y are basic semi-algebraic sets, then the sets A + , A − are also basic semialgebraic sets.For any π ∈ P(X × Y), we define two measures γ Since 1 A − + 1 A + = 1 Z , we can write the Gromov-Wasserstein loss function as Therefore, from (3.6), we know that the problem ( 20) is equivalent to the following problem over three measures satisfying the constraint (22), or equivalently with L GWp aug defined by (21), and where y ∈ Π mom (m(µ), m(ν)) satisfies marginal conditions and the moment sequence condition on X × Y, the sequences y + ∈ M S(A + ) and y − ∈ M S(A − ) satisfy the moment sequence condition on A + and A − respectively, and the three sequences satisfy the additional quadratic constraint y + + y − = y ⊗ y, or equivalently

Piecewise polynomial costs c X and c Y
The case where c X and c Y are piecewise polynomial functions can be treated by following the general strategy presented in Section 3.3.Let us briefly discuss the case of GW p,q with q odd, where the cost is For p even and q odd, a first strategy is to introduce a partition {A α : α ∈ {−1, 1} 2d } with 2 2d elements, where On each element A α , the cost g(z) p is a polynomial.Therefore, the problem on a single measure π can be reformulated as a problem on 4 d measures π α = 1 Aα π.For p odd and q odd, we can introduce a partition {A ± α : α ∈ {−1, 1} 2d } with 2 2d+1 elements, where The initial problem on a measure π is then reformulated as a problem on 2 With the approach above, the number of measures is exponential in d.For p even, in order to reduce the number of measures, an alternative approach is to write the cost as which can be reduced to a sum of 2d 2 +d piecewise polynomials, each of these piecewise polynomials being associated with a partition composed by 2 or 4 elements.This yields a reformulation in O(d 2 ) measures.

Gromov-Wasserstein barycenters
Using the same notations as in Section 4.2, we say that Bar(Y N , Λ N ) ∈ P(X ) is a Gromov-Wasserstein barycenter associated to a given set Y N = (µ i ) 1≤i≤N of N probability measures in P(Y) and to a given set of weights Λ N = (λ i ) 1≤i≤n in the simplex Σ N , if and only if Bar Existence and uniqueness of minimizers of ( 23) has been studied in depth in [1] for GW 2,2 .It is shown, in particular, that if one of the µ i has a density, the barycenter is unique.In the following we assume existence of minimizers.Problem (23) can be written as a quadratic optimization problem inf ν,π1,...,π N N i=1 When p is even and the costs c X and c Y are polynomials, this can be equivalently written as a generalized moment problem inf y,y1,...,y N N i=1 λ i L GWp (y i ) (24) over sequences that satisfy moment sequence conditions y ∈ M S(X ) and y i ∈ M S(X × Y), 1 ≤ i ≤ N, and the additional constraints When p is odd and the costs c X and c Y are polynomials, using notations of Section 5.1.1,we can introduce additional measures γ − i and γ + i supported on A + and A − respectively, and the problem is reformulated as inf y,y1,...,y N ,y + 1 ,...,y with the same constraints as before for y, y 1 , . . ., y N and the additional constraints When the costs c X and c Y are piecewise polynomials, e.g. for GW p,q with odd q, the problem can still be reformulated as a generalization moment problem up to the introduction of new measures, following Section 5.1.2.The derivation is rather technical but straightforward.

SoS-moment hierarchy
All OT problems considered in this paper are of the form under additional constraints where G and H j , j ∈ Γ, are linear or quadratic functions of a finite set of moments of the measures π 1 , . . ., π M , and Γ is a countable set.The constraints include that mass(π i ) = m 0 (π i ) ≤ 1. Problem ( 25) can be equivalently formulated as a generalized moment problem where K is the set of sequences y 1 ∈ M S(X 1 ), . . ., y M ∈ M S(X M ) that satisfy the constraints and where the functions or quadratic functions involving only finitely many entries of the input sequences y 1 , . . ., y M .The constraints include the conditions (y i ) 0 ≤ 1 for all 1 ≤ i ≤ M .The X i are assumed to be compact semi-algebraic sets defined by for some polynomials g i,j over R ni , where g i,0 (x i ) = 1 and g i,1 where R > 0. From Theorem (3.4), the moment sequence condition y i ∈ M S(X i ) is equivalent to the following set of positive semidefinite constraints M r (g i,j y i ) 0, ∀j ∈ {0, . . ., J i }, i ∈ {1, . . ., M }, r ∈ N.
The matrix M r (g i,j y i ) depends linearly on the entries (y i ) α of order |α| ≤ r i,j + 2r with r i,j = deg(g i,j )/2 .We assume that G only involves moments of order up to r G , and the function H j only involves moments of order up to r Hj .The Lasserre's (or SoS-moment) approach for solving (25) consists in considering a hierarchy of problems where K r is the set of sequences y 1 ∈ M S r (X 1 ), . . ., y M ∈ M S r (X M ) that satisfy the constraints with Γ r = {j ∈ Γ : r Hj ≤ 2r}, and where M S r (X i ) is the set of sequences y i that satisfy M r−ri,j (g i,j y i ) 0, ∀j ∈ {0, . . ., J i }.
Problem ( 27) is called a relaxation of order r of problem (26).These problems are considered for r ≥ r * := max{ r G /2 , max i,j r i,j }.They only involve the entries of y 1 , . . ., y M of order less than 2r, and can be formulated over M finite dimensional vectors y r i in R N n i 2r , 1 ≤ i ≤ M .Then y r i can be considered again as an infinite sequence in N ni by completion with zeros.Theorem 6.1.Problem (27) admits a solution for all r ≥ r * .The sequence (ρ r ) r≥r * is increasing and ρ r → ρ as r → ∞.Moreover, from a sequence of solutions (y r 1 , . . ., y r M ) of problems (27), we can extract a subsequence (y r k 1 , . . ., y r k M ) such that for each where the set of sequences (y 1 , . . ., y M ) is a solution of problem (26) and admits a representing measure (π 1 , . . ., π M ) solution of (25).If (26) (or equivalently (25)) admits a unique solution, then we have the convergence of the whole sequence (y r i ) α to (y i ) α as r → ∞, for all α ∈ N ni , where (y 1 , . . ., y M ) is the solution of (26).Lemma 6.2.Let r ∈ N and consider a sequence y ∈ N n .If M r (y) 0 and M r−1 (gy) 0 with g(x) = R 2 − x 2 2 for r ∈ N, then for all 0 ≤ k ≤ r, Proof.Since M r (y) 0 implies M k (y) 0 for all 0 ≤ k ≤ r, we deduce from [20,Prop. 3.6] that ) ≤ R 2k y 0 .Then max 1≤i≤n y (x 2k i ) ≤ y 0 R 2k and we conclude by using (29).
Proof of Theorem 6.1.The proof is adapted from the proof of [20,Theorem 4.3].We detail it for the sake of completeness.Clearly, K r ⊃ K r+1 ⊃ . . .⊃ K for all r ≥ r * , so that ρ r is increasing with r and ρ r ≤ ρ.For all 1 ≤ i ≤ M , we have (y r i ) 0 ≤ 1 and g i,0 = 1 and Then from the constraints (28) and Lemma 6.2, we deduce with τ k = max{1, R 2k }.We deduce that K r is a compact set of a finite dimensional space, and from the continuity of G and H j , j ∈ Γ r , we deduce that ( 27) admits a solution y r = (y r 1 , . . ., y r M ).Now we identify each y r i with a sequence in N ni with components (y r i ) α = 0 for |α| > 2r.We introduce sequences ŷr i ∈ N ni defined by (ŷ r i ) α := (y r i ) α /τ ω(α) , which are such that ŷr i ∞ ≤ 1 and ŷr i ∈ c 0 := {y ∈ N ni : lim |α|→0 y α = 0} ⊂ ∞ .Since c 0 is the topological dual of 1 , we have by the Banach-Alaoglu theorem that the unit ball B 1 (c 0 ) of c 0 is compact in the weak-* topology σ(c 0 , 1 ).Therefore, we can extract a subsequence (ŷ r k i ) k≥1 of (ŷ r i ) r≥r * which converges to some ŷi ∈ B 1 (c 0 ) in the weak-* topology.In particular, this implies that for all fixed α ∈ N ni , (ŷ Since the function G(y 1 , . . ., y M ) only depends continuously on the finite set of variables Also, for a fixed j ∈ Γ, since H j only depend continuously on the finite set of variables and from the closedness of the positive cone of symmetric positive semidefinite matrices, we deduce that M m (g i,j y i ) = lim k→∞ M m (g i,j y r k i ) 0. Hence (y 1 , . . ., y M ) ∈ K and which proves that (y 1 , . . ., y M ) is a solution of (27).Since ρ r is increasing, this implies that the whole sequence ρ r converges to ρ as r → ∞.If the solution of ( 27) is unique, then from all subsequences of ((y r i ) α ) r≥r * , we can extract a subsequence that converges to the same limit ((y i ) α ) r≥r * , which implies the convergence of the whole sequence.

Post-processing
Here we consider the post-processing of the solution of the SoS-moment approach.From the solution of the problem (27) of order r, we obtain an approximation y r of the moments y = m(µ) (up to order 2r) of some probability measure of interest µ over a basic semi-algebraic set X ⊂ R n , which is the target solution of the initial OT problem.We here assume that µ is the unique solution of the initial OT problem.By theorem 6.1, we have that y r α converges to m α (µ) as r → ∞, for each α ∈ N n .

Approximation of linear quantities of interest
From approximate moments, we directly obtain an estimation of the first statistics of µ and its marginals (mean, variance, covariance...) or more generally of any quantity of interest and we have that I r → I(g) as r → ∞.For a function g which is not a polynomial, the quantity I can be approximated by I r,p = y r (g p ) where g p = |α|≤p c α x α ∈ R[x] p is a polynomial approximation of g, and the error of approximation of g by g p .We have that I r,p converges to I as r, p → ∞.Studying the rate of convergence of I r,p to I requires some additional information on the convergence of g p and the convergence of the approximate moments.

Approximation of the support of µ
Here, we show how to estimate the support S(µ) of µ from an approximation of its moments, using the Christoffel function.Note that S(µ) is contained in the basic semi-algebraic set X .This methodology has been originally proposed in [22].It is presented and analysed in [27,34] in a statistical setting.For r ∈ N, we denote Π n r = R[x] r the space of polynomials over R n with degree less than r.We let φ r (x) = (x α ) α∈N n r ∈ R s(r) be the vector of monomials of degree less than r, with s(r) := n+r r = #N n r = dim Π n r .For any r ∈ N, the moment matrix M r (µ) ∈ R s(r)×s(r) of µ, with moments up to order 2r, is given by which is the Gram matrix in L 2 µ (X ) of the canonical basis of Π n r .For two polynomials g(x) = φ r (x) T a and h(x) = φ r (x) T b in R[x] r with coefficient a, b ∈ R s(r) , a T M r (µ)b = X h(x)g(x)dµ(x), that is the inner product in L 2 µ (X ).In practice, an approximation of this moment matrix can be obtained from the solution y r of a relaxation of order r, or from a solution y r of higher order r ≥ r in order to get a better estimation.
Non degenerate case: Let us first consider the case where S(µ) is not contained by a proper real algebraic subset of X .In other words, for any polynomial p ∈ R[x], X p(x) 2 dµ(x) = 0 if and only if p = 0.This is the case when S(µ) has nonzero Lebesgue measure.Hence, M r (µ) is invertible and the finite-dimensional space Π n r of polynomials of degree less than r is a reproducing kernel Hilbert space in L 2 µ , whose kernel, called the Christoffel-Darboux kernel, is given for x, y ∈ R n by (see [13]) where (ϕ 1 , . . ., ϕ s(r) ) is some orthonormal basis of Π n r .It can be also written The Christoffel function Λ µ,r is defined for y ∈ R n by Λ µ,r (x) = inf{ X p(y) 2 dµ(y) : p ∈ Π n r , p(x) = 1}.
In the present regular case, we have for all x, The support is then approximated by the set for some suitably chosen γ r .Since Λ µ,r (x) ≤ γ r is equivalent to the polynomial inequality κ µ,r (x, x) ≤ γ −1 r , S r is a polynomial sublevel set in X .From the Markov inequality, we have that Therefore, by choosing γ r = η/s(r), we guarantee that µ(S r (µ)) ≥ 1 − η, that is S r (µ) contains a fraction 1 − η of the mass of µ.
When the measure is absolutely continuous with respect to the Lebesgue measure λ, it is proven in [21] that S r (µ), with a suitable choice of the sequence γ r , converges to S(µ) in the Haussdorff distance.Also, for a point x / ∈ S(µ), Λ µ,r (x) −1 grows exponentially with r, while for a point x ∈ S(µ), it only grows polynomially.A heuristic approach then consists in estimating the rate of convergence from several values of r in order to decide if x is in the support or not.

Singular case:
We now consider the case where the measure of µ is contained in an algebraic set, which results in a singular moment matrix M r (µ), and we follow [27] for the definition of an approximate support.
We let V be the Zariski closure of S(µ), which is the smallest algebraic set containing S(µ).We denote by I r the ideal of polynomials in Π n r that vanish on V , which is the set of polynomials p ∈ Π n r satisfying p(x) 2 dµ = 0.The quotient space Π n r /I r is a reproducing kernel Hilbert space in L 2 µ (X ) with dimension r = rank(M r (µ)), with kernel where ϕ 1 , . . ., ϕ r is an orthonormal basis of Π n r /I r in L 2 µ .This kernel can be obtained by with M r (µ) † the Moore-Penrose pseudo-inverse of M r (µ) (of rank r ), which can be expressed with orthonormal eigenvectors v i and corresponding nonzero eigenvalues λ i of M r (µ).
We still have Λ µ,r (x) = κ µ,r (x, x) −1 for all x ∈ V , but for x / ∈ V , the functions Λ µ,r (x) and κ −1 µ,r differ, which yields two possible definitions of an approximate support S r (µ), using either Λ µ,r (x) or κ µ,r (x, x) −1 , that is either (30) or Practical aspects: The functions κ µ,r and Λ µ,r are functions of the moment matrix M r (µ) of the true measure µ.In practice, the measure µ is replaced by the approximation µ r of a relaxation of order r, that yields approximate functions κ µr,r and Λ µr,r and corresponding approximate supports S r := S r (µ r ).Note that for fixed r, µ could be replaced by the solution µ r of a relaxation of higher order r.A quantitative approach is still missing.

Approximation of the density
If the measure µ admits a density f with respect to a known measure ν on X , i.e dµ(x) = f (x)dν(x), then the Christoffel function could also be used to estimate the density on the support S(µ) (or its estimation), as suggested in [21].Also, the values y r α provide approximations of the moments that is the inner product of f (x) and x α in L 2 ν (X ).Different types of approximations of f can be obtained from this information.In particular, a polynomial approximation f r,p = |β|≤p a β x β of f , p ≤ 2r, can then be obtained by solving a weighted least-squares problem min where G ν is a Gram matrix in L 2 µ with entries (G ν ) α,β = m (ν)(x α x β ) = X x α x β dν(x).From a computational point of view, the use of canonical polynomial basis may yield to numerical instabilities and high round-off errors.Therefore, the use of other polynomial basis should be preferred.Some reformulations of the initial OT problem yield approximations of the moments of the measures 1 A k µ where the A k form a partition of X .In this case, a local polynomial approximation of the density on A k can be computed, which results in a global piecewise polynomial approximation of f over X .

Wasserstein distances and barycenters
The code used to generate the examples shown here is available at https://gitlab.tue.nl/data-driven/sos-otFor our numerical tests, we consider cartoon images as displayed in Figure (1).The images are 400 × 400 pixels but we will see each of them as a uniform measure on a subset S ⊂ [0, 1] 2 which is defined as dµ The shape and location of the support S varies for each image.We consider three different types of shapes: smileys, stars, and pacmen.We start by estimating the W 1 and W 2 distances of two translated smileys µ 1 , µ 2 for which we know the exact translation vector T = (t 1 , t 2 ) ∈ R 2 (see Figure (2a)).In this simple case, we know that the exact distance is given by so we can validate the accuracy of our moment approach.Figure (2b) shows the relative errors in the estimation as a function of the relaxation order r.We see that we obtain an extremely high accuracy for all relaxation orders.The high accuracy obtained with only the first order is particularly remarkable.Figure (2c) shows an exponential increase in the runtime as a function of r, which is to be expected given that the number of unknown moments to estimate grows exponentially with r.Repeating the same experiment with translated stars and translated pacmen yields similar results, with very high accuracy since the first relaxation order r = 1.

W 2 -Wasserstein barycenters
We now turn to the computation of barycenters.We present the following test for validation purposes: consider the four translated smileys of Figure (3a).We know that the W 2 barycenter of these measures with uniform weights (0.25, 0.25, 0.25, 0.25) is equal to the smiley which is located at the center of the other images.Our goal is to study how accurately we can recover that target barycenter with our moment approach.
(a) Translated smileys.Since we know the exact barycenter, we also know the exact moments so we start by examining how accurately they are estimated.Figure (3b) shows the relative error in the computation of the first moments as a function of r. Figure (3c) reports the maximum absolute error in the moment estimation for each order r.We observe that the absolute errors decays relatively quickly (we gain about a half order of magnitude per relaxation order).Similar observations hold for relative errors.It would be interesting to examine the trend for larger orders but this has not been possible with the current implementation due to conditionning issues, and also due to the use of Mosek as a black-box optimization solver (which prevented us from sparsifying certain variables and operations, which are critical to prevent memory overflows when the complexity grows).We leave this implementation point for a future contribution in which we will also explore strategies to solve optimal transport problems in high dimension.
From the moments that our approach provides, we can reconstruct the support of the barycenter by computing the Christoffel function, and applying thresholding techniques discussed in Section 7.2.Figure 4a shows the Christoffel function for increasing relaxation orders r.The figure also depicts the support that we obtain by thresholding this function with parameter γ r = 0.3 for all relaxation orders r.We may note how the estimation of the support improves as r grows.For r = 4 and r = 5 it is possible to "discover" that the measure has several non-connected components such as the mouth and the eyes.
We can repeat the same experiment by replacing the smileys with stars or pacmen.In this case, we obtain very similar results as the ones from Figure (3) so we do not include them for the sake of brevity.We however plot the Christoffel function and the obtained support after thresholding (see Figures 4b and 4c).We observe that order r = 3 is already enough to discover that the star has five corners.In the case of the pacman, the method is able to discover a very fair estimation of the support for r = 4 and r = 5 but it only gives a coarse approximation of the mouth (approximating this part better would have required higher relaxation orders).

Gromov-Wasserstein discrepancy and barycenters
Here we illustrate the computation of Gromov-Wasserstein discrepancies and barycenters.For our numerical tests, we consider empirical measures µ 1 to µ 4 associated with happy and sad smileys, see Figure 5.Each measure corresponds to 1000 independent samples from a mixture of three uniform measures with equal weights 1/3, the first two measures being supporting on the eyes, the third measure having the mouth as support.The mouth is here an algebraic set with zero Lebesgue measure.Measure µ 2 (resp.µ 4 ) is the push-forward of µ 1 (resp.µ 3 ) by an isometry, so that GW 2  2,2 (µ 1 , µ 2 ) = GW 2 2,2 (µ 3 , µ 4 ) = 0.In this section, for the formulation of moment problems, we relied on Matlab libraries tensap [26] and GloptiPoly [18].

Gromov-Wasserstein discrepancy GW 2,2
We here illustrate the estimation of the discrepancies GW 2,2 (µ i , µ j ).For a given relaxation order r, we initialize the truncated moment sequence y (0) with the truncated moments m(µ i ⊗ µ j ) of the product measure µ i ⊗ µ j .Then we construct a sequence of truncated moments y (k) , k ≥ 1, by a fixed point algorithm, where y (k) minimizes y → L GW2,2 aug (y ⊗ y (k−1) ) over the truncated moment sequences y satisfying the moment sequence condition and marginal constraints.As shown in Figure 6 for a given relaxation order, the fixed point algorithm converges rapidly, after roughly 4 to 5 iterations.Note that for (i, j) equal to (1, 2) and (3,4), the objective function converges to a plateau of order 10 −13 , very close to zero in double precision.

Gromov-Wasserstein barycenters GW 2,2
We now turn to the computation of Gromov-Wassertein barycenters, using discrepancy GW 2,2 .We consider the computation of the barycenters of the empirical measures µ 1 and µ 2 illustrated on Figure 5.The experiments are here for illustrative purpose.For a given relaxation order r, we have to solve the optimization problem (24) over truncated sequences y, y 1 and y 2 , where y 1 has as marginals y and the truncated moments of µ 1 , and y 2 has as marginals y and the truncated moments of µ 2 .The objective functional can be rewritten λL 2 ) over truncated sequences satisfying marginal constraints and moment sequence conditions.For the initialization of y (0) , we take the truncated moments of either µ 1 or µ 2 (depending on the value of λ) and for y 2 ), we take the tensor product of y (0) and the truncated moments of µ 1 (resp.µ 2 ).This algorithm converges rather slowly and should clearly be improved.However, it allows us to illustrate the potential of the proposed approach.The results are given at iteration 100.
i+d , and for each k ∈ N 2d , with |k| = p, introduce a partition adapted to the piecewise polynomial p k (z) := d i=1 |x i − x i | qki d i=1 |y i − y i | qk i+d , and as many measures as the number of elements in the partition.To a polynomial p k (z) is associated a partition composed by at most 2 m k elements, with m k ≤ p the number of odd entries in k.This yields a reformulation with a number of measures bounded by 2 p 2d+p 2d = O(d p ).As an example, for p = 2, i − x i ||x j − x j | − d i=1 d j=1 |x i − x i ||y j − y j | + d i=1 d j=1 |y i − y i ||y j − y j |,

Figure 1 :
Figure 1: Sample images considered for the tests.
(a) The barycenter of the four smileys in the corners with uniform weights is equal to the smiley in the center.Relative error in the estimation of the first moments.Absolute errors give a similar plot.Maximum absolute error in moment estimation.Relative errors give a similar plot.

Figure 4 :
Figure 4: Estimation of the support of the barycenter.For each subfigure: Christoffel function (top), and support after thresholding (bottom)