1 Introduction

Optimal transport provides a principled and versatile approach to work with probability distributions. In recent years, an increasing amount of theoretical results are being leveraged to build numerical solvers which are by now playing a fundamental role in numerous applications ranging from economy [1, 2], quantum chemistry [3, 4], gradient flow modeling [5], and machine learning [6].

The prototypical example is the two-marginal Monge-Kantorovich problem: given two Borel sets \(\mathcal {X}_1\subset \mathbb {R}^{n_1}\) and \(\mathcal {X}_2\subset \mathbb {R}^{n_2}\) and given two probability measures \(\mu \) and \(\nu \), solve

$$\begin{aligned} \inf \biggl \{ \int _{\mathcal {X}_1\times \mathcal {X}_2} c(\textbf{x}_1,\textbf{x}_2) \mathrm d\pi (\textbf{x}_1,\textbf{x}_2) \; :\;\pi \in \Pi (\mathcal {X}_1 \times \mathcal {X}_2; \mu , \nu ) \biggr \} \end{aligned}$$
(1.1)

where \(c:\mathcal {X}_1\times \mathcal {X}_2\mapsto \mathbb {R}^+\) is a lower semi-continuous cost function. The infimum runs over the set \( \Pi (\mathcal {X}_1 \times \mathcal {X}_2; \mu , \nu ) \) of coupling measures (usually called transport plans) on \(\mathcal {X}_1\times \mathcal {X}_2\) with marginal distributions equal to \(\mu \) and \(\nu \) respectively. More generally, the problem can be posed on general Polish spaces. It can also be multi-marginal as we introduce later on in Sect. 3.

Numerous methods have been introduced for solving such problems in practice. Many algorithms rely on an approximation of measures by discrete measures (by sampling or discretization on discrete grids), whose numerical cost quickly becomes prohibitive when the number of discretization points increases. Among the existing strategies to mitigate this effect, probably the most popular one is based on adding an entropic regularization to the loss function which is then solved with a Sinkhorn algorithm [6,7,8]. The algorithm is however still posed on a grid. Other approaches involving discrete grids are the auction algorithm [9], numerical methods based on Laguerre cells [10], multiscale algorithms [11, 12] and methods based on dynamic formulations [13]. Recently, an approach that dynamically discovers sampling points where the support of the solution measure lies has been introduced in [14]. It provides promising results to address the curse of dimensionality when the optimal transport plans have sparse support.

The present paper considers an entirely different avenue where the spatial discretization is replaced by a spectral discretization where the transport plan is represented through its moments on a given basis, and the marginal constraints are expressed in terms of moment constraints. A practical computation then consists in considering a truncation of the involved moment sequences up to a certain order. The procedure needs to be well posed in the sense that one needs to guarantee convergence to the original problem as the truncation order increases.

Among the applications in which working with moments of the measures is particularly relevant, we can first mention uncertainty quantification and sensitivity analysis [15, 16] where first moments of probability distributions are of central interest. First moments of measures can also be used for optimal design of experiments for polynomial regression [17]. Also, relevant geometrical and topological information on the support of a measure can be efficiently captured by the first moments of the measure, which opens the way to many applications in data analysis [18, 19]. We may also mention problems connected with Partial Differential Equations: PDEs connected to quantum chemistry, Fokker-Planck and kinetic equations involve probability distributions as their unknowns, and to reduce computational costs one often restricts oneself to characterizing some moments (among the many works following this approach, we may mention [20,21,22]). Solving hyperbolic PDEs with moment approaches is also currently an active field of research (see [23, 24]). The field of stochastic homogenization also involves estimation of moments of the PDE solution instead of the solution itself (see [25, 26]). For all these applications, one may need to compare, or combine the underlying distributions by using only information on the moments, and solving optimal transport problems.

The moment approach to solve optimal transport problems is not entirely novel but it has only been explored in a few prior works. The idea was originally mentioned in [27] for the case of a polynomial basis and polynomial cost function. The moment approach was also recently explored in [28] for applications related to image processing involving trigonometric polynomial bases. It has also been recently used for solving the Monge problem [29], where an approximation of the transport map is constructed by solving a moment matrix completion problem and an approximation method based on Christoffel–Darboux kernel [30]. A relatively different contribution can be found in [31], where the authors relax marginal constraints into a set of moment constraints but here moments do not come from a prescribed basis. Instead they are selected from a given dictionary involving potentially general test functions. Last but not least, the idea of leveraging moment formulations and sums-of-squares (SoS) has also been used for the dual formulation of problem (1.1) in order to derive statistical estimation bounds of high dimensional OT problems (see [32, 33]).

In view of the present state of the art, the main contribution of this paper is to provide a general moment problem formulation of most common OT problems with polynomial or piecewise polynomial costs. In particular, the problem of estimating \(L^p\)-Wasserstein distances for \(p\ge 1\), barycenters, and Gromov–Wasserstein discrepancies on Euclidean spaces will be covered by our framework. We prove that the resulting sequence of optimal solutions converges to the whole moment sequence of the original OT measure as the polynomial order increases. The case of piecewise polynomial costs is addressed by a reformulation in terms of conditional measures.

For practical computations, we can directly apply the moment-SoS hierarchy similarly as in [27, 28], which eventually boils down to solving semidefinite programming optimization problems. It is worth emphasizing that by switching the point of view to a moment problem, we do not recover the measure itself. Instead, the resulting outputs will be moments of the optimal transport plan. Depending on the application, this may of course be a limitation. However, we show that it is possible to recover linear quantities of interest, and also the support of the measure by a post-processing algorithm based on Christoffel–Darboux kernels. Our numerical examples show that even the support of concentrated measures can efficiently be estimated with relatively low polynomial order. This feature seems particularly appealing. It could for instance be leveraged to recover optimal transport plans from high-dimensional OT problems: we could use the support estimation to provide well-chosen sampling points to grid-based approaches. A full development of these ideas will be presented in a forthcoming work.

The paper is organized as follows. After introducing some basic notation in Sect. 2, we define optimal transport problems in Sect. 3. We prove that when the involved cost function is a polynomial or a piecewise polynomial, the problem can be interpreted as a generalized moment problem. This section will allow us to introduce the basic principles of our approach to OT problems and important results from real algebraic geometry. In Sects. 4 and 5, we consider optimal transport problems that are playing a crucial role in numerous applicative areas, and prove that they can be expressed as generalized moment problems. More precisely, we consider in Sect. 4 the problems of computing \(L^p\)-Wasserstein distances and barycenters for \(p\ge 1\), and in Sect. 5 the problems of computing Gromov–Wasserstein discrepancies and corresponding barycenters. In Sect. 6, we formulate a generalized moment problem that includes all previous OT problems. We derive a solution strategy based on the moment-SoS (or Lasserre’s) hierarchy, and we prove its convergence to the solution of the OT problem. Section 7 explains how to postprocess the moments to estimate linear quantities of interest and the support of the measure. Section 8 illustrates the potential of the approach by giving numerical results on estimating the \(L^1\) and \(L^2\) Wasserstein distances, barycenters, and the \(L^2\) Gromov–Wasserstein discrepancy.

2 Some elements of notation

In the following, \(\mathbb {N}\) should be understood as the set of non-negative integers (including zero). Vectors \(\textbf{x}\) from the Euclidean space \(\mathbb {R}^n\) will be denoted with bold notation. The coordinates \(\textbf{x}=(x_1,\dots , x_n)^T\) will be written with plain text. The canonical vectors will be denoted as \(\textbf{e}_i = (0,\dots , 0, 1,0,\dots , 0)^T\) for \(i\in \{1,\dots ,n\}\). For any index \(p\in \mathbb {N}^*:=\mathbb {N}{\setminus }\{0\}\), \( \Vert \textbf{x}\Vert _p :=\left( \sum _{i=1}^n \vert x_i\vert ^p \right) ^{1/p} \) denotes the \(\ell ^p(\mathbb {R}^n)\) norm of \(\textbf{x}\). We let \(\mathbb {R}[\textbf{x}]\) be the space of real polynomials over \(\mathbb {R}^n\). For any multi-index \(\varvec{\alpha }=(\alpha _1,\dots , \alpha _n)^T\in \mathbb {N}^n\) with length \(\vert {\varvec{\alpha }}\vert = \sum _{i=1}^n \alpha _i\), we define the associated monomial \( \textbf{x}^{{\varvec{\alpha }}} = \prod _{i=1}^n x_i^{\alpha _i} \) of degree \(\vert {\varvec{\alpha }}\vert \). We let \(\mathbb {N}_{r}^n = \{{\varvec{\alpha }}\in \mathbb {N}^n: \vert {\varvec{\alpha }}\vert \le r\},\) and \(\mathbb {R}[\textbf{x}]_r\) be the space of real polynomials of degree less than r that can be written \(\sum _{{\varvec{\alpha }}\in \mathbb {N}_r^n} c_{\varvec{\alpha }}\textbf{x}^{\varvec{\alpha }}\) for some real coefficients \(c_{\varvec{\alpha }}\).

For any Borel set \(\mathcal {X}\) in \(\mathbb {R}^n\), we denote \(\mathcal {M}(\mathcal {X})\) the space of finite signed Borel measures supported on \(\mathcal {X}\),

$$\begin{aligned} \mathcal {M}(\mathcal {X})_+ :=\{ \mu \in \mathcal {M}(\mathcal {X}) \; :\;\mu \ge 0 \} \end{aligned}$$

its positive cone of finite Borel measures supported on \(\mathcal {X}\), and

$$\begin{aligned} \mathcal {P}(\mathcal {X}) :=\{ \mu \in \mathcal {M}(\mathcal {X})_+ \; :\;\mu (\mathcal {X})=1 \} \end{aligned}$$

the set of probability measures supported on \(\mathcal {X}\). The indicator function of a subset \(A\subset \mathbb {R}^n\) is denoted as \(\mathbbm {1}_A\). For any Borel set \(\mathcal {X}\) in \(\mathbb {R}^n\) and any measure \(\mu \in \mathcal {M}(\mathcal {X})\),

$$\begin{aligned} m_{{\varvec{\alpha }}}(\mu ) :=\int _{\mathcal {X}} \textbf{x}^{{\varvec{\alpha }}} \mathrm d\mu (\textbf{x}) \end{aligned}$$

is the moment of \(\mu \) associated to the multi-index \({\varvec{\alpha }}\in \mathbb {N}^n\), and

$$\begin{aligned} m(\mu ) :=\left( m_{{\varvec{\alpha }}}(\mu )\right) _{{\varvec{\alpha }}\in \mathbb {N}^n} \end{aligned}$$

is the sequence of moments of \(\mu \). The mass of \(\mu \) is denoted \(\textrm{mass}(\mu ) = m_0(\mu ).\) A measure \(\mu \) is said to be determinate if it is uniquely determined by its moment sequence \(m(\mu )\).

Finally, for a given sequence \(y = (y_{\varvec{\alpha }})_{{\varvec{\alpha }}\in \mathbb {N}^n}\), we introduce the Riesz functional \(\ell _y: \mathbb {R}[\textbf{x}] \mapsto \mathbb {R}\) which associates to a real polynomial \(g(\textbf{x}) = \sum _{{\varvec{\alpha }}\in \mathbb {N}^n} a_{\varvec{\alpha }}\textbf{x}^{\varvec{\alpha }}\) the value \(\ell _y(g) = \sum _{{\varvec{\alpha }}\in \mathbb {N}^n} a_{\varvec{\alpha }}y_{\varvec{\alpha }}\). For any measure \(\mu \), we thus have

$$\begin{aligned} \ell _{m(\mu )}(g) = \sum _{{\varvec{\alpha }}\in \mathbb {N}^n} a_{\varvec{\alpha }}m_{\varvec{\alpha }}(\mu ) = \int _{\mathbb {R}^n} g(\textbf{x}) d\mu (\textbf{x}),\quad \forall g \in \mathbb {R}[\textbf{x}]. \end{aligned}$$
(2.1)

3 Optimal transport problems with polynomial costs

3.1 Formulation

To guide the subsequent discussion, we consider the multi-marginal version of problem (1.1) as a prototypical example of an OT problem. This problem consists in considering K probability measures \(\mu _i\in \mathcal {P}(\mathcal {X}_i)\) defined on Borel sets \(\mathcal {X}_i\subset \mathbb {R}^{n_i}\) for all \(i\in \{1, \dots , K\}\), and solving

$$\begin{aligned} \rho :=\inf _{\pi \in \Pi } \mathcal {L}(\pi ). \end{aligned}$$
(3.1)

The loss function is of the form

$$\begin{aligned} \mathcal {L}(\pi ) :=\int _\mathcal {X}c(\textbf{x}) \mathrm d\pi (\textbf{x}),\quad \forall \pi \in \mathcal {M}(\mathcal {X})_+, \end{aligned}$$
(3.2)

and the set \(\mathcal {X}\) is defined as the product set

$$\begin{aligned} \mathcal {X}:=\mathcal {X}_1\times \dots \times \mathcal {X}_K. \end{aligned}$$

Note that \(\mathcal {X}\) can be identified with a subset of \(\mathbb {R}^n\), with

$$\begin{aligned} n:=n_1 + \cdots + n_K. \end{aligned}$$

The function \(c: \mathcal {X}\rightarrow \mathbb {R}\) is a given cost function, and the constraint \(\Pi \) is a shorthand notation for the set of coupling measures having \(\mu _i\) as marginals, namely

$$\begin{aligned} \Pi (\mathcal {X};\mu _1,\dots , \mu _K) :=\{ \pi \in \mathcal {P}(\mathcal {X}) \; :\;\textrm{proj}_i \# \pi = \mu _i,\quad \forall i \in \{1,\dots , K\} \}, \end{aligned}$$

where \(\textrm{proj}_i: \mathcal {X}\rightarrow \mathcal {X}_i\) denotes the canonical projection, and the push-forward measure \(\textrm{proj}_i \# \pi \) is the i-th marginal of \(\pi \).

The existence of a minimizer for (3.1) is standard in OT theory. Indeed, \(\Pi \) is trivially non empty since the coupling \(\otimes _{i=1}^K \mu _i\) belongs to this set. The set \(\Pi \) is convex and compact for the weak-\(*\) topology thanks to the imposed marginals, and if the cost function c is lower semi-continuous (l.s.c.), then the loss function \(\mathcal {L}:\pi \mapsto \int c \, \mathrm d\pi \) is l.s.c. with respect to the weak-\(*\) topology. Hence we can guarantee the existence of a minimizer by imposing a very weak hypothesis on the cost function c such as lower semi-continuity.

We next show that when the loss function is of polynomial nature, problem (3.1) is equivalent to a moment problem under some conditions, as initially observed in [27] (see also [34, Section 7.3]). To see this, note first of all that if c is a polynomial from \(\mathbb {R}[\textbf{x}]\). Therefore it follows from (2.1) that the loss function (3.2) satisfies

$$\begin{aligned} \mathcal {L}(\pi ) = \ell _{m(\pi )}(c) =:\textrm{L}(m(\pi )). \end{aligned}$$
(3.3)

In addition, the marginal constraints on \(\pi \in \Pi \) in problem (3.1) imply constraints on the moments of \(\pi \),

$$\begin{aligned} m_{(0,\dots , 0, {\varvec{\beta }}, 0,\dots ,0)}(\pi ) = m_{{\varvec{\beta }}}(\mu _i),\quad \forall {\varvec{\beta }}\in \mathbb {N}^{n_i}, \text { and } \forall i\in \{1,\dots , K\}. \end{aligned}$$
(3.4)

As a result of (3.3) and (3.4), instead of considering the OT problem (3.1) where we search for an unknown measure \(\pi \in \Pi \), one can alternatively consider the moment problem of searching for the optimal sequence \(y = (y_{\varvec{\alpha }})_{{\varvec{\alpha }}\in \mathbb {N}^n}\) solving

$$\begin{aligned} \rho _{mom} :=\inf _{y \in \Pi _{mom}} L(y). \end{aligned}$$
(3.5)

where \(\Pi _{mom}:= \Pi _{mom}(\mathcal {X}; m(\mu _1), \ldots , m(\mu _K))\) is the set of sequences in \(\mathbb {R}^{\mathbb {N}^{n}}\) satisfying the following constraints:

  1. (i)

    Marginal conditions: the sequence y should satisfy

    $$\begin{aligned} y_{(0,\dots , 0, {\varvec{\beta }}, 0,\dots ,0)} = m_{{\varvec{\beta }}}(\mu _i),\quad \forall {\varvec{\beta }}\in \mathbb {N}^{n_i}, \text { and } \forall i\in \{1,\dots , K\} \end{aligned}$$
  2. (ii)

    Moment sequence condition: the sequence y must have a representing measure supported on \(\mathcal {X}\), that is, there must exist a measure \(\pi \in \mathcal {M}(\mathcal {X})_+\) such that \(y = m(\pi )\). We write this condition as \(y\in \mathrm {MS(\mathcal {X})}.\)

The equivalence between problems (3.1) and (3.5) is closely related to the determinacy of measures \(\mu _i\), for which a sufficient condition is that the sets \(\mathcal {X}_i\) are compact. We summarize these facts in the theorem below.

Theorem 3.1

(Polynomial cost) If the sets \(\{\mathcal {X}_i\}_{i=1}^K\) are compact, then the OT problem (3.1) with polynomial cost is equivalent to the generalized moment problem (3.5):

  • A minimizer \(\pi ^*\) of problem (3.1) is such that \(m(\pi ^*)\) is a minimizer of problem (3.5).

  • A minimizer \(y^*\) of problem (3.5) has a representing measure which is solution of (3.1).

In addition, if the solution \(\pi ^*\) of the OT problem (3.1) is unique, then the solution \(y^*\) of (3.5) is unique and \(y^* = m(\pi ^*).\)

Proof

Suppose \(\pi ^*\) is a solution of problem (3.1). Since it is a Borel measure supported on the compact set \(\mathcal {X}\), it is determinateFootnote 1 so there exists a unique sequence \(y = m(\pi ^*)\). This sequence y is clearly in the set \( \Pi _{mom}\). Therefore, as a feasible solution for (3.5), it satisfies \( \rho _{mom} \le L(y) = \mathcal {L}(\pi ^*) = \rho \). Conversely, let \(y^*\) be a solution to problem (3.5). Since \(y\in \text {MS}(\mathcal {X})\), there is a representing measure \(\pi \) such that \(y^* = m(\pi )\). For \(\pi \) to belong to the feasible set \(\Pi \) of problem (3.1), the marginal conditions on \(m(\pi )\) should imply that the marginals of \(\pi \) are the \(\mu _i\). This is satisfied given that the marginal measures \(\mu _i\) are determinate because \(\mathcal {X}_i\) is compact. Therefore \(\pi \in \Pi \) and \( \rho \le \mathcal {L}(\pi ) = L(m(\pi )) = L(y^*) = \rho _{mom}\). This proves that \(\rho = \rho _{mom}\), and that \(m(\pi ^*)\) is a solution of problem (3.5) if and only if \(\pi ^*\) is a solution of (3.1). \(\square \)

Remark 3.2

Since the \(\mu _i\) are probability measures, \(y \in \Pi _{mom}\) is such that \(y_{(0,\ldots ,0)} = 1\) and a representing measure \(\pi \in \mathcal {M}(\mathcal {X})_+\) such that \(y = m(\pi )\) has mass 1, i.e. \(\pi \in \mathcal {P}(\mathcal {X}).\)

3.2 The moment sequence condition

In this section, we discuss how the moment sequence condition \(y \in MS(\mathcal {X})\) is translated into mathematical terms. This question is in fact directly related to the so-called moment problem which studies the following question: Given a Borel subset \(\mathcal {X}\subseteq \mathbb {R}^n\) and a sequence of real numbers \(y=(y_{{\varvec{\alpha }}})_{{\varvec{\alpha }}\in \mathbb {N}^n}\), what are the conditions on y under which we can guarantee that \(y = m(\pi )\) for some positive measure \(\pi \in \mathcal {M}(\mathcal {X})_+\)? For the one dimensional case (\(n=1\)), this classical problem is well understood and dates back to contributions by Markov, Stieltjes, Hausdorff, and Hamburger. Explicit conditions on y exist, and they are all stated in terms of positive semi-definiteness of certain Hankel matrices. Much less is known for the multidimensional case (\(n>1\)). A general result is given by the Riesz-Haviland theorem, which states that a moment sequence y has an associated Borel measure \(\pi \) such that \(y=m(\pi )\) if and only if \(\ell _y(f)\ge 0\) for all polynomials \(f \in \mathbb {R}[\textbf{x}] \) nonnegative on \(\mathcal {X}\). This theorem is not really useful if we do not have an explicit characterization of polynomials that are nonnegative on \(\mathcal {X}\) (so called positivstellensatz). Such a characterization has been provided by Schmügden in [35] when the ambient space \(\mathcal {X}\) is a compact basic semi-algebraic set of the form

$$\begin{aligned} \mathcal {X}= \{ \textbf{x}\in \mathbb {R}^n \; :\;g_j(\textbf{x})\ge 0,\; j=1,\dots , J \} \end{aligned}$$
(3.6)

for some polynomials \(g_j \in \mathbb {R}[\textbf{x}]\). In reference [35], it is proven that a sequence y has a representing Borel measure supported on \(\mathcal {X}\) (i.e. satisfies the moment sequence condition) if and only if it satisfies

$$\begin{aligned} \ell _{y}(g_I f^2) \ge 0, \quad \forall I \subset \{0,\ldots ,J\}, \quad \forall f\in \mathbb {R}[\textbf{x}], \end{aligned}$$
(3.7)

where \(g_I = \prod _{j\in I} g_j\) and where we have used the convention \(g_\emptyset =1\).

For a polynomial \(g(\textbf{x}) = \sum _{{\varvec{\gamma }}\in \mathbb {N}^n} c_{\varvec{\gamma }}\textbf{x}^{\varvec{\gamma }}\in \mathbb {R}[\textbf{x}]\) and \(r\in \mathbb {N}\), we let \(\textbf{M}_r(g y) \in \mathbb {R}^{\mathbb {N}^n_r\times \mathbb {N}^n_r}\) be the matrix with entries

$$\begin{aligned} \textbf{M}_r(g y)_{{\varvec{\alpha }}, {\varvec{\beta }}} = \ell _y(g(\textbf{x}) \textbf{x}^{{\varvec{\alpha }}} \textbf{x}^{{\varvec{\beta }}} ) = \sum _{{\varvec{\gamma }}\in \mathbb {N}^n} c_{\varvec{\gamma }}y_{{\varvec{\alpha }}+ {\varvec{\beta }}+ {\varvec{\gamma }}}, \quad {\varvec{\alpha }},{\varvec{\beta }}\in \mathbb {N}^n_r, \end{aligned}$$

which is such that for any polynomial \(f \in \mathbb {R}[\textbf{x}]_r\) of degree less than r, we have

$$\begin{aligned} \ell _y(g f^2) = \sum _{{\varvec{\alpha }},{\varvec{\beta }}\in \mathbb {N}^n_r}\textbf{M}_r( g y)_{{\varvec{\alpha }},{\varvec{\beta }}} a_{\varvec{\alpha }}a_{\varvec{\beta }}, \quad \text {for } f(\textbf{x}) = \sum _{{\varvec{\gamma }}\in \mathbb {N}^n_r} a_{\varvec{\gamma }}\textbf{x}^{\varvec{\gamma }}. \end{aligned}$$

Therefore, the moment sequence condition (3.7) is equivalent to

$$\begin{aligned} \textbf{M}_r(g_I y) \succcurlyeq 0 , \quad \forall I \subset \{0,\ldots ,J\}, \quad \forall r\in \mathbb {N}, \end{aligned}$$
(3.8)

where for a symmetric matrix \(\textbf{M}\), \(\textbf{M}\succcurlyeq 0\) means that \(\textbf{M}\) is positive semi-definite. A simpler characterization has been given by Putinar in [36] under the following additional assumption.

Assumption 3.3

There exists a polynomial u of the form \(u = u_0 + \sum _{j=1}^J u_j g_j\), where the \(u_j\) are sums of squares (SoS) polynomials, and such that \(\{\textbf{x}\in \mathbb {R}^n: u(\textbf{x}) \ge 0\}\) is compact.

Under Assumption 3.3, it is proven in [36] that y has a representing Borel measure supported on \(\mathcal {X}\) if and only if

$$\begin{aligned} \textbf{M}_r(g_j y) \succcurlyeq 0 \quad \forall j\in \{0,\ldots ,J\}, \quad \forall r\in \mathbb {N}, \end{aligned}$$
(3.9)

where we have used the convention \(g_0 = 1\).

The linear positive semidefinite constraints (3.8) or (3.9) are exactly the moment sequence condition. We summarize the above results in the following theorem.

Theorem 3.4

(Th. 3.8 in [34]) Let \(\mathcal {X}\) be a basic semi-algebraic set as in (3.6). A sequence \(y \in \mathbb {R}^{\mathbb {N}^n}\) satisfies \(y\in MS(\mathcal {X})\) (i.e. satisfies the moment sequence condition on \(\mathcal {X}\)) if and only if it satisfies the positive semidefinite constraints (3.8), or the positive semidefinite constraints (3.9) under the additional Assumption 3.3.

In our context, since \(\mathcal {X}\) is the product set \(\mathcal {X}_1\times \dots \times \mathcal {X}_K\), we assume that each \(\mathcal {X}_i\) is a compact basic semi-algebraic set defined as

$$\begin{aligned} \mathcal {X}_i = \{ \textbf{x}_i\in \mathbb {R}^{n_i} \; :\;g^{(i)}_j(\textbf{x}_i)\ge 0,\; j=1,\dots , J_i \},\quad \forall i \in \{1,\dots ,K\}. \end{aligned}$$

Then \(\mathcal {X}\) is a basic semi-algebraic set defined as in (3.6) with \(J=\sum _{i=1}^K J_i\) and functions \(\{g_j\}_{j=1}^J\) defined by

$$\begin{aligned} g_j(\textbf{x}) = g_{l}^{(i)}(\textbf{x}_i) \quad \text {for } j = \sum _{k=1}^{i-1} J_k + l, \quad 1\le l \le J_i, \quad 1\le i \le K. \end{aligned}$$

Remark 3.5

(About Assumption 3.3) Assumption 3.3 is trivially satisfied if \(g_1(\textbf{x}) = R - \Vert \textbf{x} \Vert _2^2 \) for some positive R. Since for any compact semi-algebraic set \(\mathcal {X}\), there exists a sufficiently large R such that \(\mathcal {X}\subset \{\textbf{x}: \Vert \textbf{x}\Vert _2 < R\}\), the condition \(R - \Vert \textbf{x}\Vert _2^2 \ge 0\) is redundant and can be systematically added to the definition of \(\mathcal {X}\). A stronger condition for Assumption 3.3 to hold for a product set \(\mathcal {X}\) is that the description of each set \(\mathcal {X}_i\) contains a function \(g^{(i)}_1(\textbf{x}_i) = R_i - \Vert \textbf{x}_i \Vert _2^2\). If this is not the case, we should prefer to add a single function \(g_1(\textbf{x}) = R - \Vert \textbf{x}\Vert _2^2 \) to the description of \(\mathcal {X}\) in order to reduce the number of positive semidefinite constraints.

3.3 Piecewise polynomial costs

Some OT problems (such as, e.g., the \(L^1\)-Wasserstein distance), involve a continuous or l.s.c. piecewise polynomial cost. These problems can also be formulated as generalized moment problems, up to the introduction of new unknown measures.

Piecewise polynomial costs. Let us assume that \(\mathcal {X}= \mathcal {A}_1 \cup \cdots \cup \mathcal {A}_m\), where the \(\mathcal {A}_i\) are pairwise disjoint Borel sets and

$$\begin{aligned} c_{\mid \mathcal {A}_i} = c_i \in \mathbb {R}[\textbf{x}], \quad 1\le i \le m. \end{aligned}$$

For a measure \(\pi \in \mathcal {P}(\mathcal {X})\), we introduce the measures

$$\begin{aligned} \pi _i :=\mathbbm {1}_{\mathcal {A}_i} \pi \in \mathcal {M}({\bar{\mathcal {A}}}_i)_+, \end{aligned}$$

where \({\bar{\mathcal {A}}}_i\) is the closure of \(\mathcal {A}_i\). Since \(\mathbbm {1}_{\mathcal {A}_1} + \cdots + \mathbbm {1}_{\mathcal {A}_m} = \mathbbm {1}_\mathcal {X}\), we have \(\pi = \pi _1 + \cdots + \pi _m,\) and

$$\begin{aligned} \mathcal {L}(\pi ) = \int _\mathcal {X}c(\textbf{x}) d\pi (\textbf{x}) = \sum _{i=1}^m \int _\mathcal {X}c_i(\textbf{x}) d\pi _i(\textbf{x}) =\sum _{i=1}^m \ell _{m(\pi _i)}(c_i) := {\tilde{\mathcal {L}}}(\pi _1 , \ldots , \pi _m). \end{aligned}$$

We claim that the OT problem (3.1) is equivalent to

$$\begin{aligned} \inf _{ \begin{array}{c} \pi _1 \in \mathcal {M}({\bar{\mathcal {A}}}_1)_+, \ldots , \pi _m \in \mathcal {M}({\bar{\mathcal {A}}}_m)_+ , \\ \pi _1 + \cdots + \pi _m \in \Pi \end{array}} {\tilde{\mathcal {L}}}(\pi _1, \ldots , \pi _m) := {\tilde{\rho }}, \end{aligned}$$
(3.10)

Indeed, if \(\pi \) is solution of problem (3.1), then the measures \(\pi _i = \mathbbm {1}_{\mathcal {A}_i} \pi \), \(1\le i\le m\), satisfy the constraints of (3.10) and \({\tilde{\rho }} \le {\tilde{\mathcal {L}}}(\pi _1,\ldots ,\pi _m) = \mathcal {L}(\pi ) = \rho \). Conversely, the set of measures \((\pi _1,\ldots ,\pi _m)\) satisfying the constraints of problem (3.10) is compact in the weak-\(*\) topology (since the \(\pi _i \in \mathcal {M}({\bar{\mathcal {A}}}_i)_+\) and \(\mathrm {mass(\pi _i)}\le 1\)), and therefore implies the existence of solutions. Moreover, if \((\pi _1,\ldots ,\pi _m)\) is a solution of (3.10), then \(\pi = \pi _1 + \cdots + \pi _m \in \Pi \) and \({\tilde{\rho }} = {\tilde{\mathcal {L}}}(\pi _1, \ldots , \pi _m) = \sum _{i=1}^m \int _\mathcal {X}c(\textbf{x}) (\mathrm d\pi _1 + \cdots + \mathrm d\pi _m) = \mathcal {L}(\pi ) \ge \rho \). Finally, denoting \({\tilde{\mathcal {L}}}(\pi _1, \ldots , \pi _m) = {\tilde{L}}(m(\pi _1), \ldots ,m(\pi _m))\), we have that the initial OT problem (3.1) is equivalent to the optimization problem

$$\begin{aligned} \inf _{ \begin{array}{c} y_1 \in MS({\bar{\mathcal {A}}}_1)_+, \ldots , y_m \in MS({\bar{\mathcal {A}}}_m)_+ \\ y_1 + \cdots + y_m \in \Pi _{mom} \end{array}} {\tilde{L}}(y_1, \ldots , y_m) \end{aligned}$$
(3.11)

over m sequences \((y_i)_{1\le i\le m}\) that satisfy moment sequence conditions and whose sum satisfies marginal conditions. We summarize these facts in the theorem below.

Theorem 3.6

(Piecewise polynomial cost) If \( \mathcal {X}_1\times \dots \times \mathcal {X}_K\) is compact, then the OT problem (3.1) with l.s.c. piecewise polynomial cost over a partition \((\mathcal {A}_i)_{1\le i \le m}\) of \(\mathcal {X}\) is equivalent to the generalized moment problem (3.11): a minimizer \(\pi ^*\) of problem (3.1) is such that \((m(\pi _i^*))_{1\le i\le m}\), with \(\pi ^*_i = \mathbbm {1}_{\mathcal {A}_i} \pi ^*\), is a minimizer of problem (3.11), and conversely, a minimizer \((y^*_i)_{1\le i \le m}\) of problem (3.11) is such that each \(y^*_i\) has a representing measure \(\pi _i\) supported on \({\bar{\mathcal {A}}}_i\), and the sum \(\pi = \pi _1 + \cdots + \pi _m\) is solution of (3.1). In addition, if the solution \(\pi ^*\) of the OT problem (3.1) is unique, then even if (3.11) may have infinitely many solutions \((y^*_1,\ldots ,y_m^*)\), the sum \(y^*_1 + \cdots + y_m^*:= y^*\) is unique and such that \(y^* = m(\pi ^*).\)

Remark 3.7

For having a practical characterizing of the set \(MS({\bar{\mathcal {A}}}_i)\) of sequences that satisfy the moment sequence condition on \({\bar{\mathcal {A}}}_i\), the partition should be such that the \({\bar{\mathcal {A}}}_i\) are basic semi-algebraic compact sets. If \(\mathcal {X}\) is a basic semi-algebraic compact set, that means that \(\mathcal {A}_i\) should be defined as the set of points in \(\mathcal {X}\) satisfying a finite set of additional polynomial inequalities.

Sum of piecewise polynomial costs. In the case where

$$\begin{aligned} c(\textbf{x}) = \sum _{k=1}^s c_k(\textbf{x}), \end{aligned}$$
(3.12)

where each \(c_k\) is a l.s.c. piecewise polynomial associated with a particular partition \((\mathcal {A}_{k,i})_{1\le i \le m_k }\), i.e. \(c_{k \mid \mathcal {A}_{k,i}}:= c_{k,i} \in \mathbb {R}[\textbf{x}]\), we could introduce a finer partition of \(\mathcal {X}\) composed by sets \(\mathcal {A}_{1,i_1}\cap \cdots \cap \mathcal {A}_{s,i_s}= \mathcal {A}_{\textbf{i}} \), with \(1 \le i_k \le m_k\). The function c being polynomial on each set \( \mathcal {A}_{\textbf{i}}\), the problem can be reformulated as a generalized moment problem involving \(m_1 \cdots m_s\) measures \(\pi _{\textbf{i}}\) supported on the sets \({\bar{\mathcal {A}}}_{\textbf{i}}\), for \(\textbf{i} \in \{1,\ldots ,m_1\} \times \cdots \times \{1,\ldots ,m_s\}\). However, the resulting number of unknown measures is exponential in s.

An alternative approach, that will be used later in this paper, is to introduce for each \(1 \le k \le s\) a collection of measures \((\pi _{k,i})_{1\le i \le m_i}\) and consider the problem

$$\begin{aligned} \inf _{\pi \in \Pi ,(\pi _{k,i})}\sum _{k=1}^s \sum _{i=1}^{m_k} \int _{\mathcal {X}} c_{k,i}(\textbf{x}) d\pi _{k,i} \end{aligned}$$
(3.13)

over measures \(\pi \in \Pi \) and \(\pi _{k,i} \in \mathcal {M}({\bar{\mathcal {A}}}_{k,i})_+\), \(1\le i \le m_k, 1\le k\le s\), satisfying

$$\begin{aligned} \sum _{i=1}^{m_k} \pi _{k,i} = \pi , \quad \text {for all } 1\le k\le m. \end{aligned}$$

This results in a problem with \(m_1 + \cdots + m_s +1\) unknown measures, that can be equivalently written as the problem

$$\begin{aligned} \inf _{y\in \Pi _{mom},(y_{k,i})}\sum _{k=1}^s \sum _{i=1}^{m_k} \ell _{y_{k,i}}(c_{k,i}) \end{aligned}$$
(3.14)

with measures \(y\in \Pi _{mom}\) and \(y_{k,i} \in MS(\mathcal {A}_{k,i})\), \(1\le i \le m_k, 1\le k\le s\), satisfying the additional constraints

$$\begin{aligned} \sum _{i=1}^{m_k} y_{k,i} = y, \quad \text {for all } 1\le k\le s. \end{aligned}$$

Note that the measure \(\pi \) (resp. the sequence y) can be eliminated from problem (3.13) (resp. (3.14)). We summarize the above results in the next theorem.

Theorem 3.8

(Sum of piecewise polynomial costs) Assume \( \mathcal {X}_1\times \dots \times \mathcal {X}_K\) is compact, and consider a l.s.c. piecewise polynomial cost of the form (3.12), where each \(c_k\) is a l.s.c. piecewise polynomial over a partition \((\mathcal {A}_{k,i})_{1\le i \le m_k}\) of \(\mathcal {X}\), \(1\le k\le s\). Then the OT problem (3.1) is equivalent to the problem (3.14): a minimizer \(\pi ^*\) of problem (3.1) is such that \((m(\pi _{k,i}^*))\), with \(\pi ^*_{k,i} = \mathbbm {1}_{\mathcal {A}_{k,i}} \pi ^*\), is a minimizer of problem (3.14), and conversely, a minimizer \((y^*_{k,i})\) of problem (3.14) is such that each \(y^*_{k,i}\) has a representing measure \(\pi _{k,i}\) supported on \({\bar{\mathcal {A}}}_{k,i}\), and for each k, the sum \( \pi _{k,1} + \cdots + \pi _{k,m_k} = \pi \) is solution of (3.1). In addition, if the solution \(\pi ^*\) of the OT problem (3.1) is unique, then (3.14) have infinitely many solutions but the sum \(y^*_{k,1} + \cdots + y_{k,m_k}^*:= y^*\) is unique and such that \(y^* = m(\pi ^*).\)

4 Wasserstein distances and barycenters

In this section, we consider the problems of computing distances and barycenters in Wasserstein spaces and show that they can be expressed as generalized moment problems. Throughout the section, \(\mathcal {X}\) denotes a compact basic semi-algebraic set in the normed vector space \((\mathbb {R}^d,\Vert \cdot \Vert _p)\), with \(p \in \mathbb {N}^*\).

4.1 Wasserstein distances

The Wasserstein space \(\mathcal {P}_p(\mathcal {X})\) is defined as the set of probability measures \(\mu \in \mathcal {P}(\mathcal {X})\) with finite moments up to order p, namely

$$\begin{aligned} \mathcal {P}_p(\mathcal {X}) :=\biggl \{ \mu \in \mathcal {P}(\mathcal {X}) \; :\;\int _\mathcal {X}\Vert \textbf{x}\Vert ^p_p \,\mathrm d\mu (\textbf{x}) \;< +\infty \biggr \}. \end{aligned}$$

Let \(\mu \) and \(\nu \) be two probability measures in \(\mathcal {P}_p(\mathcal {X})\). For any \(p\in \mathbb {N}^*\), the \(L^p\)-Wasserstein distance \(W_p(\mu ,\nu )\) between \(\mu \) and \(\nu \) is defined by

$$\begin{aligned} W_p^p(\mu ,\nu ) :=\mathop {\inf }_{\pi \in \Pi (\mathcal {X}\times \mathcal {X};\mu ,\nu )} \int _{\mathcal {X}\times \mathcal {X}} \Vert \textbf{x}-\textbf{y}\Vert ^p_p \,\mathrm d\pi (\textbf{x}, \textbf{y}). \end{aligned}$$
(4.1)

The space \(\mathcal {P}_p(\mathcal {X})\) endowed with the distance \(W_p\) is a metric space, usually called \(L^p\)-Wasserstein space (see [37] for more details). The \(W_p\) distance defined through problem (4.1) is an optimal transport problem of the form (1.1) with \(K=2\) marginals, \(\mathcal {X}_1 = \mathcal {X}_2 = \mathcal {X}\) and a continuous cost function

$$\begin{aligned} c(\textbf{x},\textbf{y})= \Vert \textbf{x}-\textbf{y}\Vert ^p_p = \sum _{i=1}^d \vert x_i-y_i \vert ^p \end{aligned}$$

We claim that for any \(p\in \mathbb {N}^*\), this problem can be seen as a generalized moment problem. We distinguish the cases where p is even and odd.

Case p even. When p is an even number, the cost c is a polynomial and we simply use the binomial theorem to derive that the loss function in (4.1) can be expressed as

$$\begin{aligned} \mathcal {L}^{W_p}(\pi ) := \int _{\mathcal {X}\times \mathcal {X}} \Vert \textbf{x}-\textbf{y}\Vert ^p_p \,\mathrm d\pi (\textbf{x}, \textbf{y})&= \sum _{i=1}^d \sum _{k=0}^p {p\atopwithdelims ()k} \int _{\mathcal {X}\times \mathcal {X}} (-1)^k x_i^k y_i^{p-k} \mathrm d\pi (\textbf{x},\textbf{y}) \end{aligned}$$

or in terms of the moments \(m(\pi )\) of \(\pi \),

$$\begin{aligned} \mathcal {L}^{W_p}(\pi ) = \sum _{i=1}^d \sum _{k=0}^p {p\atopwithdelims ()k} (-1)^k m_{k\textbf{e}_i, (p-k)\textbf{e}_i}(\pi )&:= \sum _{i=1}^d L_i^{W_p}(m(\pi )) := L^{W_p}(m(\pi )), \end{aligned}$$

where we recall that \(\textbf{e}_i\) is the i-th canonical vector in \(\mathbb {N}^d\). The marginal constraints \(\pi \in \Pi (\mathcal {X}\times \mathcal {X}; \mu , \nu )\) of problem (4.1) can also be expressed in terms of moments. We derived their general form in equation (3.4). In the present context, they read

$$\begin{aligned} m_{({\varvec{\alpha }}, 0)}(\pi ) = m_{{\varvec{\alpha }}}(\mu ),\quad m_{(0, {\varvec{\beta }})}(\pi ) = m_{{\varvec{\beta }}}(\nu ), \quad \forall {\varvec{\alpha }},{\varvec{\beta }}\in \mathbb {N}^{d}. \end{aligned}$$

The problem (4.1) can then be expressed as the generalized moment problem

$$\begin{aligned} \inf _{y\in \Pi _{mom}} L^{W_p}(y), \end{aligned}$$
(4.2)

where \(\Pi _{mom}:= \Pi _{mom}(\mathcal {X}\times \mathcal {X};m(\mu ), m(\nu ))\) is the set of sequences \(y \in \mathbb {R}^{\mathbb {N}^{2d}}\) that satisfy the moment sequence condition and the marginal constraints

$$\begin{aligned} y_{({\varvec{\alpha }}, 0)} = m_{{\varvec{\alpha }}}(\mu )\quad \text {and} \quad y_{(0, {\varvec{\beta }})} = m_{{\varvec{\beta }}}(\nu ), \quad \forall {\varvec{\alpha }},{\varvec{\beta }}\in \mathbb {N}^{d}. \end{aligned}$$

Here, Theorem 3.1 applies and proves the equivalence between problems (4.2) and (4.1).

Case p odd. When p is odd, the presence of the absolute value in the cost function c prevents from having a polynomial expression. We can nevertheless derive a moment formulation by exploiting the fact that the cost is piecewise polynomial on \(\mathcal {X}\times \mathcal {X}\). We first introduce for all \(i\in \{1,\dots ,d\}\) the subsets

$$\begin{aligned} \mathcal {A}_i^+ :=\{ (\textbf{x},\textbf{y})\in \mathcal {X}\times \mathcal {X}\; :\;x_i-y_i \ge 0\}, \quad \mathcal {A}_i^- :=\{ (\textbf{x},\textbf{y})\in \mathcal {X}\times \mathcal {X}\; :\;x_i-y_i < 0\}, \end{aligned}$$

that form a partition of \(\mathcal {X}\times \mathcal {X}\), i.e.

$$\begin{aligned} \mathcal {A}_i^+ \cup \mathcal {A}_i^-= \mathcal {X}\times \mathcal {X}, \quad \mathcal {A}_i^+ \cap \mathcal {A}_i^- = \emptyset . \end{aligned}$$

If \(\mathcal {X}\) is compact semi-algebraic, then \(\mathcal {A}_i^+\) and \( \overline{ \mathcal {A}_i ^-}\) are also compact semi-algebraic. For any \(\pi \in \mathcal {P}(\mathcal {X}\times \mathcal {X})\), we can define measures \(\pi ^+_i, \pi ^-_i\) by

$$\begin{aligned} \pi _i^+&= \mathbbm {1}_{\mathcal {A}_i^+} \pi , \quad \pi _i^- = \mathbbm {1}_{\mathcal {A}_i^-} \pi , \end{aligned}$$

which are such that

$$\begin{aligned} \pi = \pi _i^+ + \pi _i^-,\quad \forall i \in \{1,\dots , d\}. \end{aligned}$$

When p is odd, since \(\mathbbm {1}_{\mathcal {A}_i^- } + \mathbbm {1}_{\mathcal {A}_i^+} = \mathbbm {1}_\mathcal {X}\), we can write the Wasserstein loss function as

$$\begin{aligned} \mathcal {L}^{W_p}(\pi )&= \sum _{i=1}^d \int _{\mathcal {X}\times \mathcal {X}} \vert x_i-y_i \vert ^p (\mathrm d\pi _i^+(\textbf{x},\textbf{y}) + \mathrm d\pi _i^-(\textbf{x},\textbf{y}))\\&= \sum _{i=1}^d \int _{\mathcal {X}\times \mathcal {X}} (x_i-y_i)^p (\mathrm d\pi _i^+(\textbf{x},\textbf{y})-\mathrm d\pi _i^-(\textbf{x},\textbf{y}))\\&= \sum _{i=1}^d \sum _{k=0}^p {p\atopwithdelims ()k} (-1)^k m_{k\textbf{e}_i, (p-k)\textbf{e}_i}(\pi _i^+ - \pi _i^-)\\&= \sum _{i=1}^d L^{W_p}_i(m(\pi _i^+) - m(\pi _i^-)). \end{aligned}$$

From Theorem 3.1, we know that problem (4.1) is equivalent to the following problem with \(2d+1\) measures,

$$\begin{aligned} W_p^p(\mu ,\nu ) = \inf _{\begin{array}{c} \pi \in \Pi (\mu ,\nu ), \\ \pi _i^+ \in \mathcal {M}(\mathcal {A}_i^+) ,\pi _i^- \in \times \mathcal {M}( \overline{\mathcal {A}_i^-}) \end{array}} \sum _{i=1}^d L_i^{W_p}(m(\pi ^+_i) - m(\pi _i^-)) \end{aligned}$$

which can be equivalently reformulated as a generalized moment problem

$$\begin{aligned} W_p^p(\mu ,\nu ) = \inf _{y , y_1^+ , \ldots , y_d^- } \sum _{i=1}^d L^{W_p}_i \left( y_i^+ - y_i^- \right) \end{aligned}$$

over a set of \(2d+1\) sequences satisfying moment sequence conditions \(y_ i^+ \in MS(\mathcal {A}_i^+)\) and \(y_ i^- \in MS(\overline{\mathcal {A}_i^-})\), \(1\le i \le d\), and the constraints \(y\in \Pi _{mom}(\mathcal {X}\times \mathcal {X}; m(\mu ),m(\nu ))\) and

$$\begin{aligned} y= y_i^+ + y_i^-, \quad \forall i \in \{1,\ldots ,d\}. \end{aligned}$$

Note that the variable y can be eliminated.

4.2 Wasserstein barycenters

A notion that is widely used to approximate measures in the Wasserstein spaces is the one of barycenters. To define it, let \(N\in \mathbb {N}^*\) and let

$$\begin{aligned} \Sigma _N :=\Big \{ \lambda \in \mathbb {R}^N\; :\;\lambda _i\ge 0,\, \sum _{i=1}^N \lambda _i = 1 \Big \} \end{aligned}$$

be the simplex in \(\mathbb {R}^N\). We say that \(\textrm{Bar}(\textrm{Y}_N, \Lambda _N) \in \mathcal {P}_p(\mathcal {X})\) is a barycenter associated to a given set \(\textrm{Y}_N = (\mu _i)_{1\le i\le N}\) of N probability measures from \(\mathcal {P}_p(\mathcal {X})\) and to a given set of weights \(\Lambda _N = (\lambda _i)_{1\le i\le n} \in \Sigma _N\), if and only if \(\textrm{Bar}(\Lambda _N, \textrm{Y}_N)\) is a solution to

$$\begin{aligned} \inf _{\nu \in \mathcal {P}_p(\mathcal {X})} \sum _{i=1}^N \lambda _i W_p^p(\nu ,\mu _i). \end{aligned}$$
(4.3)

Existence and uniqueness of minimizers of (4.3) has been studied in depth in [38] for the case \(p=2\). It is shown, in particular, that if one of the \(\mu _i\) has a density, the barycenter is unique. In the following we assume existence of minimizers. Problem (4.3) can be written as an optimization problem

$$\begin{aligned} \inf _{\nu , \pi _1,\ldots , \pi _N} \sum _{i=1}^N \lambda _i \mathcal {L}^{W_p}(\pi _i) \end{aligned}$$

over measures \(\nu \in \mathcal {P}_p(\mathcal {X})\), and \(\pi _i \in \mathcal {M}(\mathcal {X}\times \mathcal {X})_+\), \(1\le i \le N,\) satisfying the constraints \(\pi _i\in \Pi (\mathcal {X}\times \mathcal {X}; \nu ,\mu _i)\).

When p is even, this can be equivalently written as a generalized moment problem

$$\begin{aligned} \inf _{y, y_1,\ldots , y_N} \sum _{i=1}^N \lambda _i L^{W_p}(y_i) \end{aligned}$$

over sequences that satisfy the constraints \(y_i \in \Pi _{mom}(\mathcal {X}\times \mathcal {X}; y, m(\mu _i))\), \(1\le i \le N.\) Note that the unknown y can be eliminated by imposing that all \(y_i\) have the same left marginal sequence.

When p is odd, the problem (4.3) is equivalent to a generalized moment problem

$$\begin{aligned} \inf _{y, (y_{i,j}^+, y_{i,j}^-)}\sum _{i=1}^N \sum _{j=1}^d \lambda _i L^{W_p}_j(y_{i,j}^+ - y_{i,j}^-) \end{aligned}$$

over sequences satisfying moment sequence conditions \(y \in MS(\mathcal {X})\) and \(y_{i,j}^\pm \in MS(\mathcal {X}\times \mathcal {X})\), \(1\le i \le N, 1\le j \le d\), and the additional constraints \(y_{i,j}^+ + y_{i,j}^- = y_{i,k}^+ + y_{i,k}^-\) for all \(i \in \{1,\ldots ,N\}\) and \( 1\le j<k\le d\), and \(y_{i,j}^+ + y_{i,j}^- \in \Pi _{mom}(\mathcal {X}\times \mathcal {X}; y, m(\mu _i))\) for all \(i \in \{1,\ldots ,N\}\) and \( 1\le j\le d\). Again, the unknown y could be eliminated by imposing that all \(y_i\) have the same left marginal sequences.

5 Gromov–Wasserstein discrepancies and barycenters

For some applications such as shape matching or word embedding, an important limitation of classic Wasserstein metrics lies in the fact that it is not invariant to rotations and translations and more generally to isometries. It is also defined for measures defined on the same ambient space \(\mathcal {X}\). To overcome these limitations, several extensions have been proposed (see, e.g., [39]). We focus here on the so-called Gromov–Wasserstein discrepancies in Euclidian spaces, originally introduced in [40], and which has recently attracted a lot of attention from practitioners.

5.1 Gromov–Wasserstein discrepancies

Given two compact semi-algebraic Borel sets \(\mathcal {X}\in \mathbb {R}^{d_\mathcal {X}}\) and \(\mathcal {Y}\in \mathbb {R}^{d_\mathcal {Y}}\), two probability measures \(\mu \in \mathcal {P}(\mathcal {X})\) and \(\nu \in \mathcal {P}(\mathcal {Y})\), and two cost functions \(c_\mathcal {X}: \mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}\) and \(c_\mathcal {Y}: \mathcal {Y}\times \mathcal {Y}\rightarrow \mathbb {R}\), we define for \(p\in \mathbb {N}^*\) a Gromov–Wasserstein discrepancy \(GW_{p}\) between measures \(\mu \) and \(\nu \) as

$$\begin{aligned} GW_p^p(c_\mathcal {X},c_{\mathcal {Y}} , \mu ,\nu ) = \inf _{\pi \in \Pi (\mathcal {X}\times \mathcal {Y}; \mu ,\nu )} \mathcal {L}^{GW_p}(\pi ) \end{aligned}$$
(5.1)

where the loss function \(\mathcal {L}^{GW_p}: \mathcal {M}(\mathcal {X}\times \mathcal {Y})_+ \rightarrow \mathbb {R}\) is such that

$$\begin{aligned} \mathcal {L}^{GW_p}(\pi ) = \int _{\mathcal {X}\times \mathcal {Y}} \int _{\mathcal {X}\times \mathcal {Y}} \vert c_\mathcal {X}( \textbf{x}, \textbf{x}' ) -c_\mathcal {Y}( \textbf{y}, \textbf{y}' ) \vert ^p d\pi (\textbf{x},\textbf{y}) d\pi (\textbf{x}',\textbf{y}'). \end{aligned}$$

Note that this problem is quadratic in \(\pi \), and existence of minimizers of (5.1) is guaranteed under mild assumptions using weak lower semi-continuity, and compactness, similar to the classical Wasserstein problem (see [41, Prop. 3.1]). It can alternatively be expressed as a linear problem with a rank-one tensor constraint in the augmented space

$$\begin{aligned} \mathcal {Z}:= \mathcal {X}\times \mathcal {Y}\times \mathcal {X}\times \mathcal {Y}\end{aligned}$$

which can be identified with a basic semi-algebraic set of \(\mathbb {R}^{2n}\) with \(n= d_\mathcal {X}+ d_\mathcal {Y}\). Using the space \(\mathcal {Z}\), we can write

$$\begin{aligned} GW_p^p (c_\mathcal {X},c_{\mathcal {Y}} , \mu ,\nu ) = \inf _{\begin{array}{c} \gamma = \pi \otimes \pi \in \mathcal {M}_+(\mathcal {Z}) \\ \pi \in \Pi (\mathcal {X}\times \mathcal {Y}; \mu , \nu ) \end{array}} \mathcal {L}_{\text {aug}}^{GW_p}(\gamma ) \end{aligned}$$

with

$$\begin{aligned} \mathcal {L}_{\text {aug}}^{GW_p}(\gamma ) :=\int _\mathcal {Z}\vert c_\mathcal {X}( \textbf{x}, \textbf{x}' ) -c_\mathcal {Y}( \textbf{y}, \textbf{y}' ) \vert ^p d\gamma (\textbf{x}, \textbf{y}, \textbf{x}',\textbf{y}'), \end{aligned}$$

and we have

$$\begin{aligned} \mathcal {L}_{\text {aug}}^{GW_p}(\pi \otimes \pi ) = \mathcal {L}^{GW_p}(\pi ), \quad \forall \pi \in \mathcal {M}_+(\mathcal {X}\times \mathcal {Y}). \end{aligned}$$

In the particular case where the cost functions \(c_\mathcal {X}= \Vert \cdot - \cdot \Vert _q^q\) and \(c_\mathcal {Y}= \Vert \cdot - \cdot \Vert _q^q\) are related to \(\ell ^q\) norms, for some \(q \in \mathbb {N}^*\), we denote \(GW_{p,q}(\mu ,\nu )\) the corresponding Gromov–Wasserstein discrepancy and \(\mathcal {L}^{GW_{p,q}}\) the corresponding loss. Note that case \( GW_{2,2}\) is of particular practical interest. We now distinguish different cases depending on whether the costs \(c_\mathcal {X}\) and \(c_{\mathcal {Y}}\) are polynomials or not.

5.1.1 Polynomial costs \(c_\mathcal {X}\) and \(c_\mathcal {Y}\)

Here we consider polynomial costs \(c_\mathcal {X}\) and \(c_\mathcal {Y}\). Two cases will be again distinguished.

Case p even. When p is even, then the cost \( \vert c_\mathcal {X}( \textbf{x}, \textbf{x}' ) -c_\mathcal {Y}( \textbf{y}, \textbf{y}' ) \vert ^p \) is a polynomial on \(\mathcal {X}\). Given polynomial expansions of \(c_\mathcal {X}\) and \(c_\mathcal {Y}\), we can deduce a polynomial expansion of their difference

$$\begin{aligned} g(\textbf{z}) = c_\mathcal {X}( \textbf{x}, \textbf{x}' ) -c_\mathcal {Y}( \textbf{y}, \textbf{y}' ) = \sum _{i=1}^N c_i \textbf{z}^{{\varvec{\gamma }}_i}, \quad \forall \textbf{z}=(\textbf{x},\textbf{y},\textbf{x}',\textbf{y}') \in \mathcal {Z}, \end{aligned}$$

with \({\varvec{\gamma }}_i \in \mathbb {N}^{2n}\) and \(c_i\in \mathbb {R}\). Using the multinomial theorem,

$$\begin{aligned} |g(\textbf{z})|^p&= ( c_\mathcal {X}( \textbf{x}, \textbf{x}' ) -c_\mathcal {Y}( \textbf{y}, \textbf{y}' ) )^p \\&= \sum _{\textbf{k} = (k_1,\ldots ,k_N) \in \mathbb {N}^N , \vert \textbf{k}\vert = p } {p \atopwithdelims (){\textbf{k}}} \prod _{i=1}^N c_i^{k_i} \textbf{z}^{k_i{\varvec{\gamma }}_i} \\&:= \sum _{\textbf{k} \in \mathbb {N}^N , \vert \textbf{k} \vert = p } a_{\textbf{k}} \textbf{z}^{{\varvec{\gamma }}_{\textbf{k}}}, \end{aligned}$$

with \({\varvec{\gamma }}_{\textbf{k}}= \sum _{i=1}^N k_i {\varvec{\gamma }}_i \in \mathbb {N}^{2n}\) and \(a_{\textbf{k}}= {p \atopwithdelims (){\textbf{k}}} \prod _{i=1}^N c_i^{k_i} \). For \({\varvec{\gamma }}\in \mathbb {R}^{2n}\), we denote \({\varvec{\gamma }}^L,{\varvec{\gamma }}^R \in \mathbb {R}^{n}\) such that \({\varvec{\gamma }}= ({\varvec{\gamma }}^L, {\varvec{\gamma }}^R)\). This yields the following expression of the Gromov–Wasserstein loss function in terms of moments

$$\begin{aligned} \mathcal {L}^{GW_p}_{{aug}}(\pi \otimes \pi )&= \sum _{\textbf{k} \in \mathbb {N}^N , \vert \textbf{k} \vert = p } a_{\textbf{k}} m_{{\varvec{\gamma }}_{\textbf{k}}}(\pi \otimes \pi ):= L^{GW_p}_{aug}(m(\pi \otimes \pi )), \end{aligned}$$
(5.2)

with \(L^{GW_p}_{aug}:\mathbb {R}^{\mathbb {N}^{2n}} \rightarrow \mathbb {R}\) a linear functional, or

$$\begin{aligned} \mathcal {L}^{GW_p}(\pi )&= \sum _{\textbf{k} \in \mathbb {N}^N , \vert \textbf{k} \vert = p } a_{\textbf{k}} m_{{\varvec{\gamma }}^L_{\textbf{k}}}(\pi )m_{{\varvec{\gamma }}^R_{\textbf{k}}}(\pi ) := L^{GW_p}(m(\pi )), \end{aligned}$$

with \(L^{GW_p}:\mathbb {R}^{\mathbb {N}^{n}} \rightarrow \mathbb {R}\) a quadratic functional.

When p is even and the costs are polynomials, the Gromov–Wasserstein problem (5.1) can therefore be expressed as a generalized moment problem with quadratic objective function

$$\begin{aligned} GW_p^p(c_\mathcal {X},c_\mathcal {Y}; \mu ,\nu ) = \inf _{y \in \Pi _{mom}} L^{GW_p}(y) \end{aligned}$$

with \(\Pi _{mom}=\Pi _{mom}( \mathcal {X}\times \mathcal {Y}; m(\mu ),m(\nu ))\) the set of sequences satisfying the moment sequence condition \(y\in MS(\mathcal {X}\times \mathcal {Y})\) and marginal conditions, and we easily prove the equivalence between the two problems, following the proof of Theorem 3.1.

Case p odd. When p is odd, we can use a similar strategy as for Wasserstein distances. We introduce two subsets \(\mathcal {A}^+\) and \(\mathcal {A}^-\) of \(\mathcal {Z}\) defined by

$$\begin{aligned} \mathcal {A}^+ = \{\textbf{z}\in \mathcal {Z}: g(\textbf{z}) \ge 0 \}, \quad \mathcal {A}^- = \{\textbf{z}\in \mathcal {Z}: g(\textbf{z}) < 0 \}, \end{aligned}$$

with \(g(\textbf{z}) = c_\mathcal {X}( \textbf{x}, \textbf{x}' ) -c_\mathcal {Y}( \textbf{y}, \textbf{y}' )\) for \(\textbf{z}= (\textbf{x},\textbf{y},\textbf{x}',\textbf{y}').\) The sets are such that \((\mathcal {A}^+, \mathcal {A}^-)\) form a partition of \(\mathcal {Z}\). If \(\mathcal {X}\) and \(\mathcal {Y}\) are basic semi-algebraic sets, then the sets \(\mathcal {A}^+\), \(\overline{\mathcal {A}^-}\) are also basic semi-algebraic sets. For any \(\pi \in \mathcal {P}(\mathcal {X}\times \mathcal {Y})\), we define two measures \(\gamma ^+ = \mathbbm {1}_{\mathcal {A}^+}\pi \otimes \pi \) and \( \gamma ^- = \mathbbm {1}_{\mathcal {A}^- } \pi \otimes \pi ,\) which are such that

$$\begin{aligned} \pi \otimes \pi = \gamma ^+ + \gamma ^-. \end{aligned}$$
(5.3)

Since \(\mathbbm {1}_{\mathcal {A}^-} + \mathbbm {1}_{\mathcal {A}^+} = \mathbbm {1}_\mathcal {Z}\), we can write the Gromov–Wasserstein loss function as

$$\begin{aligned} \mathcal {L}^{GW_p}(\pi \otimes \pi )&= \mathcal {L}^{GW_p}(\gamma ^+ + \gamma ^-) = \int _{\mathcal {X}} g(\textbf{z})^p (d\gamma ^+(\textbf{z}) - d\gamma ^-(\textbf{z})). \end{aligned}$$

Therefore, from (3.6), we know that the problem (5.1) is equivalent to the following problem

$$\begin{aligned} \inf _{\pi \in \Pi (\mu ,\nu ), \gamma ^+ \in \mathcal {M}(\mathcal {A}^+) , \gamma ^- \in \mathcal {M}(\overline{\mathcal {A}^-}) } \mathcal {L}^{GW_p}_{aug}(\gamma ^+ + \gamma ^-) \end{aligned}$$

over three measures satisfying the constraint (5.3), or equivalently

$$\begin{aligned} GW_p^p(c_\mathcal {X},c_\mathcal {Y}; \mu ,\nu ) = \inf _{y \in \Pi _{mom}, y^+ \in MS(\mathcal {A}^+ ), y^- \in MS(\overline{\mathcal {A}^-} ) } L^{GW_p}_{aug}(y^+) - L^{GW_p}_{aug}(y^-), \end{aligned}$$

with \(L^{GW_p}_{aug}\) defined by (5.2), and where \(y \in \Pi _{mom}(m(\mu ),m(\nu ))\) satisfies marginal conditions and the moment sequence condition on \(\mathcal {X}\times \mathcal {Y}\), the sequences \(y^+ \in MS(\mathcal {A}^+ )\) and \( y^- \in MS(\overline{\mathcal {A}^-} ) \) satisfy the moment sequence condition on \(\mathcal {A}^+\) and \(\overline{\mathcal {A}^-}\) respectively, and the three sequences satisfy the additional quadratic constraint \(y^+ + y^- = y \otimes y \), or equivalently

$$\begin{aligned} y^+_{{\varvec{\alpha }},{\varvec{\beta }}} + y^-_{{\varvec{\alpha }},{\varvec{\beta }}} = y_{{\varvec{\alpha }}} y_{{\varvec{\beta }}}, \quad \forall {\varvec{\alpha }},{\varvec{\beta }}\in \mathbb {N}^{n}. \end{aligned}$$

5.1.2 Piecewise polynomial costs \(c_\mathcal {X}\) and \(c_\mathcal {Y}\)

The case where \(c_\mathcal {X}\) and \(c_\mathcal {Y}\) are piecewise polynomial functions can be treated by following the general strategy presented in Sect. 3.3. Let us briefly discuss the case of \(GW_{p,q}\) with q odd, where the cost is

$$\begin{aligned} \left| \Vert \textbf{x}- \textbf{x}' \Vert _q^q - \Vert \textbf{y}- \textbf{y}' \Vert _q^q \right| ^p = \left| \sum _{i=1}^d \vert x_i - x_i' \vert ^q - \vert y_i - y_i' \vert ^q \right| ^p:= \vert g(\textbf{z}) \vert ^p. \end{aligned}$$

For p even and q odd, a first strategy is to introduce a partition \(\{\mathcal {A}_{\varvec{\alpha } }: \varvec{\alpha } \in \{-1,1\}^{2d}\}\) with \(2^{2d}\) elements, where

$$\begin{aligned} \mathcal {A}_{\varvec{\alpha }} = \{&\textbf{z}= (\textbf{x},\textbf{y},\textbf{x}',\textbf{y}') \in \mathcal {Z}: \text { for all }\, 1\le i\le d, \\&x_i - x_i' \ge 0 \; \text {if} \; \alpha _i = 1 \; \text {or} \; x_i - x_i'< 0 \; \text {if} \; \alpha _i = -1, \\&y_i - y_i' \ge 0 \; \text {if} \; \alpha _{i+d} = 1 \; \text {or} \; y_i - y_i' < 0 \; \text {if} \; \alpha _{i+d} = -1 \}. \end{aligned}$$

On each element \(\mathcal {A}_{\varvec{\alpha }}\), the cost \( g(\textbf{z}) ^p\) is a polynomial. Therefore, the problem on a single measure \(\pi \) can be reformulated as a problem on \(4^d\) measures \(\pi _{\varvec{\alpha }} = \mathbbm {1}_{\mathcal {A}_{\varvec{\alpha }}} \pi \). For p odd and q odd, we can introduce a partition \(\{\mathcal {A}_{\varvec{\alpha }}^\pm : \varvec{\alpha } \in \{-1,1\}^{2d}\}\) with \(2^{2d+1}\) elements, where \( \mathcal {A}_{\varvec{\alpha }}^+ = \mathcal {A}_{\varvec{\alpha }} \cap \mathcal {B}_{\varvec{\alpha }}^{+} \) and \( \mathcal {A}_{\varvec{\alpha }}^- = \mathcal {A}_{\varvec{\alpha }} \cap \mathcal {B}_{\varvec{\alpha }}^{-} \), with

$$\begin{aligned}&\mathcal {B}_{\varvec{\alpha }}^{+} = \{ \textbf{z}= (\textbf{x},\textbf{y},\textbf{x}',\textbf{y}') \in \mathcal {Z}: \sum _{i=1}^d \alpha _i (x_i - x_i')^q - \alpha _{i+d} (y_i - y_i' )^q \ge 0 \},\\&\mathcal {B}_{\varvec{\alpha }}^{-} = \{ \textbf{z}= (\textbf{x},\textbf{y},\textbf{x}',\textbf{y}') \in \mathcal {Z}: \sum _{i=1}^d \alpha _i (x_i - x_i')^q - \alpha _{i+d} (y_i - y_i' )^q < 0 \}. \end{aligned}$$

The initial problem on a measure \(\pi \) is then reformulated as a problem on \(2^{2d+1}\) measures \(\pi _{\varvec{\alpha }}^\pm \), \(\varvec{\alpha } \in \{-1,1\}^{2d}\).

With the approach above, the number of measures is exponential in d. For p even, in order to reduce the number of measures, an alternative approach is to write the cost as

$$\begin{aligned} g(\textbf{z})^p = \sum _{\textbf{k} \in \mathbb {N}^{2d}, \vert \textbf{k}\vert = p } {p \atopwithdelims (){\textbf{k}}} \prod _{i=1}^d \vert x_i - x_i' \vert ^{qk_i} \prod _{i=1}^{d} (-1)^{k_{i+d}}\vert y_i - y_i' \vert ^{qk_{i+d}}, \end{aligned}$$

and for each \(\textbf{k} \in \mathbb {N}^{2d} \), with \(\vert \textbf{k}\vert = p\), introduce a partition adapted to the piecewise polynomial \(p_{ \textbf{k}}(\textbf{z}):= \prod _{i=1}^d \vert x_i - x_i' \vert ^{qk_i} \prod _{i=1}^{d} \vert y_i - y_i' \vert ^{qk_{i+d}}\), and as many measures as the number of elements in the partition. To a polynomial \(p_{ \textbf{k}}(\textbf{z})\) is associated a partition composed by at most \(2^{m_{\textbf{k}}}\) elements, with \(m_{\textbf{k}} \le p\) the number of odd entries in \(\textbf{k}\). This yields a reformulation with a number of measures bounded by \(2^p {2d + p \atopwithdelims (){2d}} = O(d^p)\). As an example, for \(p=2\),

$$\begin{aligned} g(\textbf{z})^2= & {} \sum _{i=1}^d\sum _{j=1}^d \vert x_i - x_i'\vert \vert x_j - x_j'\vert - \sum _{i=1}^d\sum _{j=1}^d \vert x_i - x_i'\vert \vert y_j - y_j'\vert \\{} & {} + \sum _{i=1}^d\sum _{j=1}^d \vert y_i - y_i'\vert \vert y_j - y_j'\vert , \end{aligned}$$

which can be reduced to a sum of \(2 d^2 + d\) piecewise polynomials, each of these piecewise polynomials being associated with a partition composed by 2 or 4 elements. This yields a reformulation in \(O(d^2)\) measures.

5.2 Gromov–Wasserstein barycenters

Using the same notations as in Sect. 4.2, we say that \(\textrm{Bar}(\textrm{Y}_N, \Lambda _N) \in \mathcal {P}(\mathcal {X})\) is a Gromov–Wasserstein barycenter associated to a given set \(\textrm{Y}_N = (\mu _i)_{1\le i\le N}\) of N probability measures in \(\mathcal {P}(\mathcal {Y})\) and to a given set of weights \(\Lambda _N = (\lambda _i)_{1\le i\le n} \) in the simplex \(\Sigma _N\), if and only if \(\textrm{Bar}(\Lambda _N, \textrm{Y}_N)\) is a solution to

$$\begin{aligned} \inf _{\nu \in \mathcal {P}(\mathcal {X})} \sum _{i=1}^N \lambda _i GW_{p}^p(c_\mathcal {X}, c_\mathcal {Y}; \nu ,\mu _i). \end{aligned}$$
(5.4)

Existence of minimizers has been guaranteed in [41, Thm. 5.1, 5.2] but uniqueness is not guaranteed. Note that, in practice this is not really a problem in the sense that if we have several global minimizers, it is sufficient to get one of these minimizers. In fact, the provided fixed-point algorithm seems to converge to a local minimum but the proof of convergence remains an open problem. We refer to [42, 43] for further references with theoretical background on Gromov–Wasserstein barycenters.

Problem (5.4) can be written as a quadratic optimization problem

$$\begin{aligned} \inf _{\nu , \pi _1,\ldots , \pi _N} \sum _{i=1}^N \lambda _i \mathcal {L}^{GW_p}(\pi _i) \end{aligned}$$

over measures \(\nu \in \mathcal {P}(\mathcal {X})\) and \(\pi _i \in \mathcal {M}(\mathcal {X}\times \mathcal {Y})_+\), \(1\le i \le N,\) satisfying the constraints \(\pi _i\in \Pi (\mathcal {X}\times \mathcal {Y}; \nu ,\mu _i)\).

When p is even and the costs \(c_\mathcal {X}\) and \(c_\mathcal {Y}\) are polynomials, this can be equivalently written as a generalized moment problem

$$\begin{aligned} \inf _{y , y_1,\ldots , y_N} \sum _{i=1}^N \lambda _i L^{GW_p}(y_i) \end{aligned}$$
(5.5)

over sequences that satisfy moment sequence conditions \(y\in MS(\mathcal {X})\) and \(y_i \in MS(\mathcal {X}\times \mathcal {Y})\), \(1\le i \le N,\) and the additional constraints \(y_i \in \Pi _{mom}(y, m(\mu _i))\) for \(1\le i \le N\).

When p is odd and the costs \(c_\mathcal {X}\) and \(c_\mathcal {Y}\) are polynomials, using notations of Sect. 5.1.1, we can introduce additional measures \(\gamma _i^-\) and \(\gamma _i^+\) supported on \(\mathcal {A}^+\) and \(\mathcal {A}^-\) respectively, and the problem is reformulated as

$$\begin{aligned} \inf _{y, y_1,\ldots , y_N, y_1^+, \ldots , y_N^-} \sum _{i=1}^N \lambda _i L^{GW_p}_{aug}(y_i^+ - y^- _i ), \end{aligned}$$

with the same constraints as before for \(y, y_1,\ldots ,y_N\) and the additional constraints \(\gamma _i ^\pm \in MS(\mathcal {A}^\pm )\) and \(y_i^+ + y_i^- = y_i \otimes y_i\), \(1\le i\le N\).

When the costs \(c_\mathcal {X}\) and \(c_\mathcal {Y}\) are piecewise polynomials, e.g. for \(GW_{p,q}\) with odd q, the problem can still be reformulated as a generalization moment problem up to the introduction of new measures, following Sect. 5.1.2. The derivation is rather technical but straightforward.

6 The moment-SoS hierarchy

All OT problems considered in this paper are of the form

$$\begin{aligned} \rho := \inf _{\pi _1 \in \mathcal {M}(\mathcal {X}_1)_+, \ldots , \pi _M \in \mathcal {M}(\mathcal {X}_M)_+} \mathcal {G}(\pi _1,\ldots ,\pi _M) \end{aligned}$$
(6.1)

under additional constraints

$$\begin{aligned} \mathcal {H}_j(\pi _1,\ldots ,\pi _M) = b_j, \quad j\in \Gamma , \end{aligned}$$

where \(\mathcal {G}\) and \(\mathcal {H}_j\), \(j\in \Gamma ,\) are linear or quadratic functions of a finite set of moments of the measures \(\pi _1, \ldots , \pi _M\), and \(\Gamma \) is a countable set. The constraints include that \(\textrm{mass}({\pi _i}) = m_{0}(\pi _i) \le 1.\) Problem (6.1) can be equivalently formulated as a generalized moment problem

$$\begin{aligned} \rho := \inf _{(y_1,\ldots ,y_M) \in K} G(y_1,\ldots ,y_M) \end{aligned}$$
(6.2)

where K is the set of sequences \(y_1 \in MS(\mathcal {X}_1), \ldots , y_M \in MS(\mathcal {X}_M)\) that satisfy the constraints

$$\begin{aligned} H_j(y_1, \ldots ,y_M) = b_j, \quad j\in \Gamma , \end{aligned}$$

and where the functions \(G: \mathbb {R}^{\mathbb {N}^{n_1}} \times \cdots \times \mathbb {R}^{\mathbb {N}^{n_M}} \rightarrow \mathbb {R}\) and \(H_j: \mathbb {R}^{\mathbb {N}^{n_1}} \times \cdots \times \mathbb {R}^{\mathbb {N}^{n_M}} \rightarrow \mathbb {R}\) are linear or quadratic functions involving only finitely many entries of the input sequences \(y_1, \ldots ,y_M\). The constraints include the conditions \((y_i)_{0} \le 1\) for all \(1\le i \le M\).

The \(\mathcal {X}_i\) are assumed to be compact semi-algebraic sets defined by

$$\begin{aligned} \mathcal {X}_i= \{\textbf{x}_i \in \mathbb {N}^{n_i}: g_{i,j}(\textbf{x}_i) \ge 0, \; 0 \le j \le J_i \}, \end{aligned}$$

for some polynomials \(g_{i,j} \) over \(\mathbb {R}^{n_i}\), where \(g_{i,0}(\textbf{x}_i) = 1\) and \(g_{i,1}(\textbf{x}_i) = R^2 - \Vert \textbf{x}_i\Vert _2^2 \) for \(\textbf{x}_i\in \mathbb {R}^{n_i}\), \(1\le i\le M\), where \(R>0.\) From Theorem 3.4, the moment sequence condition \(y_i \in MS(\mathcal {X}_i)\) is equivalent to the following set of positive semidefinite constraints

$$\begin{aligned} \textbf{M}_{r}(g_{i,j} y_i) \succcurlyeq 0 , \quad \forall j \in \{0,\ldots ,J_i\}, \quad i \in \{1,\ldots ,M\}, \quad r\in \mathbb {N}. \end{aligned}$$

The matrix \(\textbf{M}_{r}(g_{i,j} y_i) \) depends linearly on the entries \((y_i)_{{\varvec{\alpha }}}\) of order \(\vert {\varvec{\alpha }}\vert \le r_{i,j} + 2r\) with \(r_{i,j} = \lceil deg(g_{i,j})/2 \rceil \). We assume that G only involves moments of order up to \(r_G\), and the function \(H_j\) only involves moments of order up to \(r_{H_j}\).

The Lasserre’s (or moment-SoS) approach for solving (6.1) consists in considering a hierarchy of problems

$$\begin{aligned} \rho _r := \inf _{(y_1, \ldots ,y_M) \in K_r} G(y_1,\ldots ,y_M) \end{aligned}$$
(6.3)

where \(K_r\) is the set of sequences \(y_1 \in MS_r(\mathcal {X}_1), \ldots , y_M \in MS_r(\mathcal {X}_M)\) that satisfy the constraints

$$\begin{aligned}&H_j(y_1 , \ldots ,y_M) = b_j, \quad j\in \Gamma _r, \end{aligned}$$

with \(\Gamma _r = \{j\in \Gamma : r_{H_j }\le 2 r\}\), and where \(MS_r(\mathcal {X}_i)\) is the set of sequences \(y_i\) that satisfy

$$\begin{aligned} \textbf{M}_{r-r_{i,j}}(g_{i,j} y_i) \succcurlyeq 0 , \quad \forall j \in \{0,\ldots ,J_i\}. \end{aligned}$$
(6.4)

Problem (6.3) is called a relaxation of order r of problem (6.2). These problems are considered for \(r \ge r^*:= \max \{ \lceil r_G/2\rceil ,\max _{i,j} r_{i,j} \}\). They only involve the entries of \(y_1, \ldots ,y_M\) of order less than 2r, and can be formulated over M finite dimensional vectors \(y_i^r\) in \(\mathbb {R}^{\mathbb {N}^{n_i}_{2r}}\), \(1\le i\le M\). Then \(y_i^r\) can be considered again as an infinite sequence in \(\mathbb {N}^{n_i}\) by completion with zeros.

Theorem 6.1

Problem (6.3) admits a solution for all \(r \ge r^*\). The sequence \((\rho _r)_{r\ge r^*}\) is increasing and \(\rho _r \rightarrow \rho \) as \(r\rightarrow \infty .\) Moreover, from a sequence of solutions \((y_1^r,\ldots ,y_M^r)\) of problems (6.3), we can extract a subsequence \((y_1^{r_k},\ldots ,y_M^{r_k})\) such that for each \(1\le i\le N\) and \({\varvec{\alpha }}\in \mathbb {N}^{n_i}\),

$$\begin{aligned} (y_{i}^{r_k})_{{\varvec{\alpha }}} \rightarrow (y_i)_{{\varvec{\alpha }}} \quad \hbox { as}\ k\rightarrow \infty , \end{aligned}$$

where the set of sequences \((y_1, \ldots ,y_M)\) is a solution of problem (6.2) and admits a representing measure \((\pi _1,\ldots ,\pi _M)\) solution of (6.1). If (6.2) (or equivalently (6.1)) admits a unique solution, then we have the convergence of the whole sequence \((y_{i}^{r})_{{\varvec{\alpha }}}\) to \((y_i)_{{\varvec{\alpha }}}\) as \(r\rightarrow \infty \), for all \({\varvec{\alpha }}\in \mathbb {N}^{n_i},\) where \((y_1,\ldots ,y_M)\) is the solution of (6.2).

Lemma 6.2

Let \(r\in \mathbb {N}\) and consider a sequence \(y \in \mathbb {N}^{n}\). If \(\textbf{M}_r(y) \succcurlyeq 0\) and \(\textbf{M}_{r-1}(g y) \succcurlyeq 0\) with \(g(\textbf{x}) = R^2 - \Vert \textbf{x}\Vert _2^2\) for \(r\in \mathbb {N}\), then for all \(0\le k \le r\),

$$\begin{aligned} \vert y_{\varvec{\alpha }}\vert \le y_0 \max \{1, R^{2k}\}, \quad \forall {\varvec{\alpha }}\in \mathbb {N}^n_{2k}. \end{aligned}$$

Proof

Since \(\textbf{M}_r(y) \succcurlyeq 0\) implies \(\textbf{M}_k(y) \succcurlyeq 0\) for all \(0\le k \le r\), we deduce from [34, Prop. 3.6] that

$$\begin{aligned} \vert y_{{\varvec{\alpha }}} \vert \le \max \{y_{0} , \max _{1\le i \le n} \ell _y(x_i^{2k})\}, \quad \forall {\varvec{\alpha }}\in \mathbb {N}^n_{2k} \end{aligned}$$
(6.5)

for all \(0 \le k\le r.\) Moreover \(\textbf{M}_{r-1}( (R^2 - \Vert \textbf{x}\Vert _2^2)y) \succcurlyeq 0\) is equivalent to \(\ell _y(R^2 f^2) - \sum _{i=1}^n \ell _y(x_i^2 f^2) \ge 0\) for all \(f \in \mathbb {R}[\textbf{x}]_{r-1}\). Taking \(f=1,\) we obtain \(\ell _y(x_i^2) \le y_{0} R^2\). Then taking \(f(\textbf{x}) = x_i^{k-1}\) with \(1\le k \le r\), we obtain \(\ell _y(R^2 x_i^{2k-2}) - \sum _{j=1}^n \ell _y(x_j^2 x_i^{2k-2}) \ge 0\), which implies \(\ell _y(x_i^{2k}) \le R^2 \ell _y(x_i^{2k-2}) \le R^{2 k} y_0\). Then \(\max _{1\le i \le n} \ell _y(x_i^{2k}) \le y_0 R^{2k}\) and we conclude by using (6.5). \(\square \)

Proof of Theorem 6.1

The proof is adapted from the proof of [34, Theorem 4.3]. We detail it for the sake of completeness. Clearly, \(K_r \supset K_{r+1} \supset \cdots \supset K\) for all \(r\ge r^*\), so that \(\rho _r\) is increasing with r and \(\rho _r \le \rho \). For all \(1\le i\le M\), we have \((y_i^r)_0 \le 1\) and \(g_{i,0}=1\) and \(g_{i,1}=R^2 - \Vert \cdot \Vert ^2_2\). Then from the constraints (6.4) and Lemma 6.2, we deduce

$$\begin{aligned} \vert (y^r_{i})_{{\varvec{\alpha }}} \vert \le \tau _{\omega ({\varvec{\alpha }})}, \quad \omega ({\varvec{\alpha }}) = \lceil \vert {\varvec{\alpha }}\vert /2 \rceil \end{aligned}$$

with \(\tau _k = \max \{1,R^{2k}\}.\) We deduce that \(K_r\) is a compact set of a finite dimensional space, and from the continuity of G and \(H_j\), \(j\in \Gamma _r\), we deduce that (6.3) admits a solution \(y^r = (y^r_1,\ldots ,y^r_M)\). Now we identify each \(y^r_i\) with a sequence in \(\mathbb {N}^{n_i}\) with components \((y^r_i)_{{\varvec{\alpha }}} = 0\) for \(\vert {\varvec{\alpha }}\vert >2r.\) We introduce sequences \({\hat{y}}^r_i \in \mathbb {N}^{n_i}\) defined by

$$\begin{aligned} ({\hat{y}}^r_i)_{{\varvec{\alpha }}}:= (y^r_i)_{{\varvec{\alpha }}} / \tau _{\omega ({\varvec{\alpha }})}, \end{aligned}$$

which are such that \(\Vert {\hat{y}}^r_i \Vert _{\ell ^\infty } \le 1\) and \({\hat{y}}^r_i \in c_0:= \{ y \in \mathbb {N}^{n_i}: \lim _{\vert {\varvec{\alpha }}\vert \rightarrow 0} y_{\varvec{\alpha }}= 0\} \subset \ell ^\infty \). Since \(c_0\) is the topological dual of \(\ell ^1\), we have by the Banach-Alaoglu theorem that the unit ball \(B_1(c_0)\) of \(c_0\) is compact in the weak-\(*\) topology \(\sigma (c_0,\ell ^1)\). Therefore, we can extract a subsequence \(({\hat{y}}_i^{r_k})_{k\ge 1}\) of \(({\hat{y}}_i^{r})_{r\ge r^*}\) which converges to some \({\hat{y}}_i \in B_1(c_0)\) in the weak-\(*\) topology. In particular, this implies that for all fixed \({\varvec{\alpha }}\in \mathbb {N}^{n_i}\), \(({\hat{y}}^{r_k}_i)_{{\varvec{\alpha }}} \rightarrow ({\hat{y}}_i)_{\varvec{\alpha }}\) as \(k \rightarrow \infty \) and therefore, \((y^{r_k}_i)_{{\varvec{\alpha }}} \rightarrow (y_i)_{{\varvec{\alpha }}}\) as \(k\rightarrow \infty \), where \(y_i \in \mathbb {N}^{n_i}\) is defined by \((y_i)_{{\varvec{\alpha }}} = ({\hat{y}}_i)_{\varvec{\alpha }}\tau _{\omega ({\varvec{\alpha }})}\).

Since the function \(G(y_1,\ldots ,y_M)\) only depends continuously on the finite set of variables \(\{ (y_i)_{{\varvec{\alpha }}}: \vert \alpha \vert \le r_G, 1\le i \le N \}\), we deduce that \(\rho _{r_k} = G(y^{r_k}_1,\ldots ,y^{r_k}_M) \rightarrow G(y_1,\ldots ,y_M) \) as \(k\rightarrow \infty \). Also, for a fixed \(j \in \Gamma \), since \(H_j\) only depend continuously on the finite set of variables \(\{ (y_i)_{{\varvec{\alpha }}}: \vert {\varvec{\alpha }}\vert \le r_{H_j,} 1\le i \le N \}\), we have that \(H_j(y_1, \ldots , y_M) = \lim _{k \rightarrow \infty } H_j(y_1^{r_k}, \ldots , y_M^{r_k}) = b_j. \) Also, for any \(m \in \mathbb {N}\), since \( \textbf{M}_{m}(g_{i,j} y_i)\) only depends continuously in a finite set of variables \(\{ (y_i)_{{\varvec{\alpha }}}: \vert {\varvec{\alpha }}\vert \le r_{i,j} + 2m, 1\le i \le N \}\), and from the closedness of the positive cone of symmetric positive semidefinite matrices, we deduce that \( \textbf{M}_{m}(g_{i,j} y_i) = \lim _{k\rightarrow \infty } \textbf{M}_m(g_{i,j} y_i^{r_k}) \succcurlyeq 0\). Hence \((y_1,\ldots ,y_M) \in K\) and

$$\begin{aligned} \rho \le G(y_1,\ldots , y_M) = \lim _{k\rightarrow \infty } \rho _{r_k} \le \rho , \end{aligned}$$

which proves that \((y_1,\ldots , y_M)\) is a solution of (6.3). Since \(\rho _r\) is increasing, this implies that the whole sequence \(\rho _r \) converges to \(\rho \) as \(r\rightarrow \infty \). If the solution of (6.3) is unique, then from all subsequences of \(((y_i^r)_{{\varvec{\alpha }}})_{r\ge r^*}\), we can extract a subsequence that converges to the same limit \(((y_i)_{{\varvec{\alpha }}})_{r\ge r^*}\), which implies the convergence of the whole sequence. \(\square \)

7 Post-processing

Here we consider the post-processing of the solution of the moment-SoS approach. From the solution of the problem (6.3) of order r, we obtain an approximation \(y^r\) of the moments \(y = m(\mu )\) (up to order 2r) of some probability measure of interest \(\mu \) over a basic semi-algebraic set \(\mathcal {X}\subset \mathbb {R}^n\), which is the target solution of the initial OT problem. We here assume that \(\mu \) is the unique solution of the initial OT problem. By Theorem 6.1, we have that \(y^r_{\varvec{\alpha }}\) converges to \(m_{\varvec{\alpha }}(\mu )\) as \(r\rightarrow \infty \), for each \({\varvec{\alpha }}\in \mathbb {N}^n\).

7.1 Approximation of linear quantities of interest

From approximate moments, we directly obtain an estimation of the first statistics of \(\mu \) and its marginals (mean, variance, covariance...) or more generally of any quantity of interest

$$\begin{aligned} I(g) = \int _\mathcal {X}g(x) d\pi (\textbf{x}). \end{aligned}$$

For a polynomial \(g = \sum _{\vert {\varvec{\alpha }}\vert \le p} c_{\varvec{\alpha }}\textbf{x}^{\varvec{\alpha }}\in \mathbb {R}[\textbf{x}]_p\), \(I(g) = \ell _{m(\pi )}(g)\) is estimated by

$$\begin{aligned} I_r = \ell _{y^r}(g) = \sum _{\vert {\varvec{\alpha }}\vert \le p} c_{\varvec{\alpha }}y^r_{\varvec{\alpha }}\end{aligned}$$

and we have that \(I_r \rightarrow I(g)\) as \(r \rightarrow \infty \). For a function g which is not a polynomial, the quantity I can be approximated by \(I_{r,p} = \ell _{y^r}(g_{p})\) where \(g_p = \sum _{\vert {\varvec{\alpha }}\vert \le p} c_{\varvec{\alpha }}\textbf{x}^{\varvec{\alpha }}\in \mathbb {R}[\textbf{x}]_p\) is a polynomial approximation of g, and

$$\begin{aligned} \vert I - I_{r,p} \vert&= \vert \int _\mathcal {X}(g - g_p) d\mu (\textbf{x}) + \ell _{m(\mu ) - y^r}(g_p) \vert \\&\le \Vert g - g_p \Vert _{L^\infty (\mathcal {X})} + \sum _{\vert {\varvec{\alpha }}\vert \le p} \vert c_{\varvec{\alpha }}\vert \vert y^r_{\varvec{\alpha }}- m_{\varvec{\alpha }}(\mu ) \vert , \end{aligned}$$

with \(\Vert g - g_p \Vert _{L^\infty (\mathcal {X})}\) the error of approximation of g by \(g_p\). We have that \(I_{r,p}\) converges to I as \(r,p \rightarrow \infty \). Studying the rate of convergence of \(I_{r,p}\) to I requires some additional information on the convergence of \(g_p\) and the convergence of the approximate moments.

7.2 Approximation of the support of \(\mu \)

Here, we show how to estimate the support \(S(\mu )\) of \(\mu \) from an approximation of its moments, using the Christoffel function. Note that \(S(\mu )\) is contained in the basic semi-algebraic set \(\mathcal {X}\). This methodology has been originally proposed in [30]. It is presented and analysed in [44, 45] in a statistical setting.

For \(r\in \mathbb {N}\), we denote \(\Pi _r^n = \mathbb {R}[\textbf{x}]_r\) the space of polynomials over \(\mathbb {R}^n\) with degree less than r. We let \(\varvec{\phi }_r(\textbf{x}) = (\textbf{x}^{\varvec{\alpha }})_{{\varvec{\alpha }}\in \mathbb {N}^n_r} \in \mathbb {R}^{s(r)} \) be the vector of monomials of degree less than r, with \(s(r):= {n + r \atopwithdelims ()r} = \# \mathbb {N}^n_r = \dim \Pi _r^n.\) For any \(r \in \mathbb {N}\), the moment matrix \(\textbf{M}_r(\mu ) \in \mathbb {R}^{s(r) \times s(r)}\) of \(\mu \), with moments up to order 2r, is given by

$$\begin{aligned} \textbf{M}_r(\mu ) = \int _\mathcal {X}\varvec{\phi }_r(\textbf{x}) \varvec{\phi }_r(\textbf{x})^T d\mu (\textbf{x}), \end{aligned}$$

which is the Gram matrix in \(L^2_\mu (\mathcal {X})\) of the canonical basis of \(\Pi _r^n.\) For two polynomials \(g(\textbf{x}) = \varvec{\phi }_r(\textbf{x})^T \textbf{a} \) and \(h(\textbf{x}) = \varvec{\phi }_r(\textbf{x})^T \textbf{b} \) in \( \mathbb {R}[\textbf{x}]_r\) with coefficient \(\textbf{a}, \textbf{b} \in \mathbb {R}^{s(r)}\), \(\textbf{a}^T \textbf{M}_r(\mu ) \textbf{b} = \int _\mathcal {X}h(\textbf{x}) g(\textbf{x}) d\mu (\textbf{x}),\) that is the inner product in \(L^2_\mu (\mathcal {X})\). In practice, an approximation of this moment matrix can be obtained from the solution \(y^r\) of a relaxation of order r, or from a solution \(y^{{\tilde{r}}}\) of higher order \({\tilde{r}} \ge r\) in order to get a better estimation.

Non degenerate case: Let us first consider the case where \(S(\mu )\) is not contained by a proper real algebraic subset of \(\mathcal {X}\). In other words, for any polynomial \(p\in \mathbb {R}[\textbf{x}]\),

$$\begin{aligned} \int _\mathcal {X}p(\textbf{x})^2 d\mu (\textbf{x}) = 0 \quad \text {if and only if} \quad p=0. \end{aligned}$$

This is the case when \(S(\mu )\) has nonzero Lebesgue measure. Hence, \(\textbf{M}_r(\mu )\) is invertible and the finite-dimensional space \(\Pi ^n_r\) of polynomials of degree less than r is a reproducing kernel Hilbert space in \(L^2_\mu \), whose kernel, called the Christoffel–Darboux kernel, is given for \(\textbf{x},\textbf{y}\in \mathbb {R}^n\) by (see [46])

$$\begin{aligned} \kappa _{\mu ,r}(\textbf{x},\textbf{y}) = \sum _{i=1}^{s(r)} \varphi _i(\textbf{x}) \varphi _i(\textbf{y}) \end{aligned}$$

where \((\varphi _{1}, \ldots , \varphi _{s(r)})\) is some orthonormal basis of \(\Pi ^n_r\). It can be also written

$$\begin{aligned} \kappa _{\mu ,r}(\textbf{x},\textbf{y}) = \varvec{\phi }_r(\textbf{x})^T \textbf{M}_r(\mu )^{-1} \varvec{\phi }_r(\textbf{y}). \end{aligned}$$

The Christoffel function \(\Lambda _{\mu ,r}\) is defined for \(\textbf{y}\in \mathbb {R}^n\) by

$$\begin{aligned} \Lambda _{\mu ,r}(\textbf{x})= \inf \left\{ \int _\mathcal {X}p(\textbf{y})^2 d\mu (\textbf{y}): p \in \Pi _r^n, \; p(\textbf{x})=1\right\} . \end{aligned}$$

In the present regular case, we have for all \(\textbf{x}\),

$$\begin{aligned} \Lambda _{\mu ,r} = \kappa _{\mu ,r}(\textbf{x},\textbf{x})^{-1}. \end{aligned}$$

The support is then approximated by the set

$$\begin{aligned} S_r(\mu ) = \{ \textbf{x}\in \mathcal {X}: \Lambda _{\mu ,r}(\textbf{x}) \ge \gamma _r\}, \end{aligned}$$
(7.1)

for some suitably chosen \(\gamma _r\). Since \(\Lambda _{\mu ,r}(\textbf{x})\le \gamma _r\) is equivalent to the polynomial inequality \(\kappa _{\mu ,r}(\textbf{x},\textbf{x})\le \gamma _r^{-1}\), \(S_r\) is a polynomial sublevel set in \(\mathcal {X}\).

From the Markov inequality, we have that

$$\begin{aligned} \mu (\mathcal {X}\setminus S_r(\mu )) = \mu (\{\textbf{x}: \kappa _{\mu ,r}(\textbf{x},\textbf{x}) > \gamma _r^{-1}\}) \le \gamma _r \int _\mathcal {X}\kappa _{\mu ,r}(\textbf{x},\textbf{x}) d\mu (\textbf{x}) = \gamma _r {s(r)}. \end{aligned}$$

Therefore, by choosing \(\gamma _r = \eta /s(r)\), we guarantee that \(\mu (S_r(\mu )) \ge 1-\eta \), that is \(S_r(\mu )\) contains a fraction \(1-\eta \) of the mass of \(\mu .\)

When the measure is absolutely continuous with respect to the Lebesgue measure \(\lambda \), it is proven in [47] that \(S_r(\mu )\), with a suitable choice of the sequence \(\gamma _r\), converges to \(S(\mu )\) in the Haussdorff distance. Also, for a point \(\textbf{x}\notin S(\mu )\), \(\Lambda _{\mu ,r}(\textbf{x})^{-1}\) grows exponentially with r, while for a point \(\textbf{x}\in S(\mu )\), it only grows polynomially. A heuristic approach then consists in estimating the rate of convergence from several values of r in order to decide if \(\textbf{x}\) is in the support or not.

Singular case: We now consider the case where the measure of \(\mu \) is contained in an algebraic set, which results in a singular moment matrix \(\textbf{M}_r(\mu )\), and we follow [44] for the definition of an approximate support.

We let V be the Zariski closure of \(S(\mu )\), which is the smallest algebraic set containing \(S(\mu ).\) We denote by \(\mathcal {I}_r\) the ideal of polynomials in \(\Pi _r^n\) that vanish on V, which is the set of polynomials \(p \in \Pi ^n_r\) satisfying \(\int p(\textbf{x})^2 d\mu = 0\). The quotient space \(\Pi ^n_r / \mathcal {I}_r\) is a reproducing kernel Hilbert space in \(L^2_\mu (\mathcal {X})\) with dimension \(r' = \textrm{rank}(\textbf{M}_r(\mu )) \), with kernel

$$\begin{aligned} \kappa _{\mu ,r} (\textbf{x},\textbf{y})= \sum _{i=1}^{r'} \varphi _i(\textbf{x}) \varphi _i(\textbf{y}) \end{aligned}$$

where \(\varphi _1,\ldots ,\varphi _{r'}\) is an orthonormal basis of \(\Pi ^n_r / \mathcal {I}_r\) in \(L^2_\mu \). This kernel can be obtained by

$$\begin{aligned} \kappa _{\mu ,r} (\textbf{x},\textbf{y}) = \varvec{\phi }_r(\textbf{x})^T \textbf{M}_r(\mu )^{\dagger } \varvec{\phi }_r(\textbf{y}), \end{aligned}$$

with \(\textbf{M}_r(\mu )^{\dagger } \) the Moore-Penrose pseudo-inverse of \(\textbf{M}_r(\mu )\) (of rank \(r'\)), which can be expressed \(\textbf{M}_r(\mu )^{\dagger } = \sum _{i=1}^{r'} \lambda _i^{-1} \textbf{v}_i \textbf{v}_i^T\) given a spectral decomposition \(\textbf{M}_r(\mu ) = \sum _{i=1}^{r'} \lambda _i \textbf{v}_i \textbf{v}_i^T\) with orthonormal eigenvectors \(\textbf{v}_i\) and corresponding nonzero eigenvalues \(\lambda _i\) of \(\textbf{M}_r(\mu )\).

A Christoffel function \(\Lambda _{\mu ,r}\) can still be defined through a variational formulation

$$\begin{aligned} \Lambda _{\mu ,r}(\textbf{x})= \inf \left\{ \int _\mathcal {X}p(\textbf{y})^2 d\mu (\textbf{y}): p \in \Pi _r^n / \mathcal {I}_r, \; p(\textbf{x})=1\right\} . \end{aligned}$$

We still have \(\Lambda _{\mu ,r}(\textbf{x}) = \kappa _{\mu ,r}(\textbf{x},\textbf{x})^{-1}\) for all \(\textbf{x}\in V\), but for \(\textbf{x}\notin V\), the functions \(\Lambda _{\mu ,r}(\textbf{x})\) and \(\kappa _{\mu ,r}^{-1}\) differ, which yields two possible definitions of an approximate support \(S_r(\mu )\), using either \(\Lambda _{\mu ,r}(\textbf{x})\) or \(\kappa _{\mu ,r}(\textbf{x},\textbf{x})^{-1}\), that is either (7.1) or

$$\begin{aligned} S_r = \{\textbf{x}\in \mathcal {X}: \kappa _{\mu ,r}(\textbf{x},\textbf{x})^{-1} \ge \gamma _r\}. \end{aligned}$$

Practical aspects: The functions \(\kappa _{\mu ,r}\) and \(\Lambda _{\mu ,r}\) are functions of the moment matrix \(\textbf{M}_r(\mu )\) of the true measure \(\mu \). In practice, the measure \(\mu \) is replaced by the approximation \(\mu _{ r}\) of a relaxation of order r, that yields approximate functions \(\kappa _{\mu _r,r}\) and \(\Lambda _{\mu _r,r}\) and corresponding approximate supports \(S_r:= S_r(\mu _r).\) Note that for fixed r, \(\mu \) could be replaced by the solution \(\mu _{{\tilde{r}}}\) of a relaxation of higher order \({\tilde{r}}\). A quantitative approach is still missing.

7.3 Approximation of the density

If the measure \(\mu \) admits a density f with respect to a known measure \(\nu \) on \(\mathcal {X}\), i.e \(d\mu (\textbf{x}) = f(\textbf{x}) d\nu (\textbf{x})\), then the Christoffel function could also be used to estimate the density on the support \(S(\mu )\) (or its estimation), as suggested in [47].

Also, the values \(y^r_{\varvec{\alpha }}\) provide approximations of the moments

$$\begin{aligned} m_{\varvec{\alpha }}(\mu ) = \int _\mathcal {X}\textbf{x}^{\varvec{\alpha }}d\mu (\textbf{x}) = \int _{\mathcal {X}} \textbf{x}^{\varvec{\alpha }}f(\textbf{x}) d\nu (\textbf{x}), \end{aligned}$$

that is the inner product of \(f(\textbf{x})\) and \(\textbf{x}^{\varvec{\alpha }}\) in \(L^2_\nu (\mathcal {X})\). Different types of approximations of f can be obtained from this information. In particular, a polynomial approximation \(f_{r,p} = \sum _{\vert {\varvec{\beta }}\vert \le p} a_{\varvec{\beta }}\textbf{x}^{\varvec{\beta }}\) of f, \(p\le 2r\), can then be obtained by solving a weighted least-squares problem

$$\begin{aligned} \min _{(a_{\varvec{\beta }})_{\vert {\varvec{\beta }}\vert \le p}} \sum _{\vert {\varvec{\alpha }}\vert \le 2r} w_{\varvec{\alpha }}\Big \vert y^r_{{\varvec{\alpha }}} - \sum _{\vert {\varvec{\beta }}\vert \le p} (G_\nu )_{{\varvec{\alpha }},{\varvec{\beta }}} a_{\varvec{\beta }}\Big \vert ^2, \end{aligned}$$

where \(G_\nu \) is a Gram matrix in \(L^2_\mu \) with entries \((G_\nu )_{{\varvec{\alpha }},{\varvec{\beta }}} = \ell _m(\nu )(\textbf{x}^{\varvec{\alpha }}\textbf{x}^{\varvec{\beta }}) = \int _\mathcal {X}\textbf{x}^{\varvec{\alpha }}\textbf{x}^{\varvec{\beta }}d\nu (\textbf{x}) \). From a computational point of view, the use of canonical polynomial basis may yield to numerical instabilities and high round-off errors. Therefore, the use of other polynomial basis should be preferred.

Some reformulations of the initial OT problem yield approximations of the moments of the measures \(1_{A_k} \mu \) where the \(A_k\) form a partition of \(\mathcal {X}\). In this case, a local polynomial approximation of the density on \(A_k\) can be computed, which results in a global piecewise polynomial approximation of f over \(\mathcal {X}\).

8 Numerical illustrations

The aim of this section is to illustrate how the method behaves for the computation of Wasserstein distances, barycenters, and Gromov–Wasserstein discrepancies. We also discuss some choices we make in our implementation.

8.1 Wasserstein distances and barycenters

The code used to generate the examples shown here is available at

https://gitlab.tue.nl/data-driven/sos-ot

For our numerical tests, we consider cartoon images as displayed in Fig. 1. The images are \(400\times 400\) pixels but we will see each of them as a uniform measure on a subset \(S\subset [0, 1]^2\) which is defined as

$$\begin{aligned} d\mu (x) = \frac{1}{|S|} \mathbbm {1}_{S}(x) \mathrm dx\end{aligned}$$

The shape and location of the support S varies for each image. We consider three different types of shapes: smileys, stars, and pacmen.

Fig. 1
figure 1

Sample images considered for the tests

8.1.1 \(W_2\), \(W_1\) distance

We start by estimating the \(W_1\) and \(W_2\) distances of two translated smileys \(\mu _1,\,\mu _2\) for which we know the exact translation vector \(T=(t_1, t_2)\in \mathbb {R}^2\) (see Fig. 2a). In this simple case, we know that the exact distance is given by

$$\begin{aligned} W_p(\mu _1, \mu _2) = \Vert T \Vert _{\ell _p(\mathbb {R}^2)}, \quad p \in \{1, 2\}, \end{aligned}$$

so we can validate the accuracy of our moment approach. Figure 2b shows the relative errors in the estimation as a function of the relaxation order r. We see that we obtain an extremely high accuracy for all relaxation orders. The high accuracy obtained with only the first order is particularly remarkable. Figure 2c shows an exponential increase in the runtime as a function of r, which is to be expected given that the number of unknown moments to estimate grows exponentially with r. Repeating the same experiment with translated stars and translated pacmen yields similar results, with very high accuracy since the first relaxation order \(r=1\).

To confirm that distances are well estimated with low relaxation orders in more general cases, we consider a more complicated example involving the classical Lena image (\(512\times 512\) pixels), and an image of PortlandFootnote 2 (\(3456\times 5184\) pixels). Both images are depicted in Fig. 3a. Even if the images have different resolution and aspect ratio, we view them as piece-wise constant functions in \([0,1]^2\). We estimate their \(W_2\) distance with our moment SoS approach, and we compare the values to the distances estimated by the geomloss library (see [48]). The library estimates \(W_2\) by entropic regularization, and the Sinkhorn algorithm. To run the algorithm, we approximate the 2d images with a sum of \(10^{10}\) Dirac masses (taken on a cartesian grid of \([0, 1]^2\)). We give the obtained results with a regularization parameter of \(1.10^{-3}\). As Fig. 3b illustrates, both approaches give values that are in very good agreement. Their relative error is of order \(10^{-4}\) as Fig. 3c shows. Interestingly, the moment SoS approach gives a slightly lower distance value than the one obtained with Sinkhorn for \(r=1\). It becomes larger for larger relaxation orders. The result illustrates the ability of the method to estimate \(W_2\) with the same quality as state of the art algorithms.

Fig. 2
figure 2

\(W_1\) and \(W_2\) distance between translated smileys

Fig. 3
figure 3

\(W_2\) distance between grey scale images

8.1.2 \(W_2\)-Wasserstein barycenters

We now turn to the computation of barycenters. We present the following test for validation purposes: consider the four translated smileys of Fig. 4a. We know that the \(W_2\) barycenter of these measures with uniform weights (0.25, 0.25, 0.25, 0.25) is equal to the smiley which is located at the center of the other images. Our goal is to study how accurately we can recover that target barycenter with our moment approach.

Since we know the exact barycenter, we also know the exact moments so we start by examining how accurately they are estimated. Figure 4b shows the relative error in the computation of the first moments as a function of r. Figure 4c reports the maximum absolute error in the moment estimation for each order r. We observe that the absolute errors decays relatively quickly (we gain about a half order of magnitude per relaxation order). Similar observations hold for relative errors. It would be interesting to examine the trend for larger orders but this has not been possible with the current implementation due to conditioning issues, and also due to the use of Mosek as a black-box optimization solver (which prevented us from sparsifying certain variables and operations, which are critical to prevent memory overflows when the complexity grows). We leave this implementation point for a future contribution in which we will also explore strategies to solve optimal transport problems in high dimension.

Fig. 4
figure 4

Barycenter of four smileys

Fig. 5
figure 5

Estimation of the support of the barycenter. For each subfigure: Christoffel function (top), and support after thresholding (bottom)

From the moments that our approach provides, we can reconstruct the support of the barycenter by computing the Christoffel function, and applying thresholding techniques discussed in Sect. 7.2. Figure 5a shows the Christoffel function for increasing relaxation orders r. The figure also depicts the support that we obtain by thresholding this function with parameter \(\gamma _r=0.3\) for all relaxation orders r. We may note how the estimation of the support improves as r grows. For \(r=4\) and \(r=5\) it is possible to “discover” that the measure has several non-connected components such as the mouth and the eyes.

We can repeat the same experiment by replacing the smileys with stars or pacmen. In this case, we obtain very similar results as the ones from Fig. 4 so we do not include them for the sake of brevity. We however plot the Christoffel function and the obtained support after thresholding (see Fig. 5b, c). We observe that order \(r=3\) is already enough to discover that the star has five corners. In the case of the pacman, the method is able to discover a very fair estimation of the support for \(r=4\) and \(r=5\) but it only gives a coarse approximation of the mouth (approximating this part better would have required higher relaxation orders).

8.2 Gromov–Wasserstein discrepancy and barycenters

Here we illustrate the computation of Gromov–Wasserstein discrepancies and barycenters. For our numerical tests, we consider empirical measures \(\mu _1\) to \(\mu _4\) associated with happy and sad smileys, see Fig. 6. Each measure corresponds to 1000 independent samples from a mixture of three uniform measures with equal weights 1/3, the first two measures being supporting on the eyes, the third measure having the mouth as support. The mouth is here an algebraic set with zero Lebesgue measure. Measure \(\mu _2\) (resp. \(\mu _4\)) is the push-forward of \(\mu _1\) (resp. \(\mu _3\)) by an isometry, so that \(GW_{2,2}^2(\mu _1,\mu _2) = GW_{2,2}^2(\mu _3,\mu _4) = 0\). In this section, for the formulation of moment problems, we relied on Matlab libraries tensap [49] and GloptiPoly [50].

Fig. 6
figure 6

Empirical measures \(\mu _1\) to \(\mu _4\) for the examples with Gromov–Wasserstein discrepancies

8.2.1 Gromov–Wasserstein discrepancy \(GW_{2,2}\)

We here illustrate the estimation of the discrepancies \(GW_{2,2}(\mu _i,\mu _j)\). For a given relaxation order r, we initialize the truncated moment sequence \(y^{(0)}\) with the truncated moments \(m(\mu _i \otimes \mu _j)\) of the product measure \(\mu _i \otimes \mu _j\). Then we construct a sequence of truncated moments \(y^{(k)}\), \(k\ge 1\), by a fixed point algorithm, where \(y^{(k)}\) minimizes \( y\mapsto L_{aug}^{GW_{2,2}}(y \otimes y^{(k-1)})\) over the truncated moment sequences y satisfying the moment sequence condition and marginal constraints. As shown in Fig. 7 for a given relaxation order, the fixed point algorithm converges rapidly, after roughly 4 to 5 iterations. Note that for (ij) equal to (1, 2) and (3, 4), the objective function converges to a plateau of order \(10^{-13}\), very close to zero in double precision.

Fig. 7
figure 7

Convergence of \(L_{aug}^{GW_{2,2}}(y^{(k)} \otimes y^{(k)}) = L^{GW_{2,2}}( y^{(k)}) \) for the estimation of \(GW_{2,2}^2(\mu _1,\mu _2)\), \(GW_{2,2}^2(\mu _3,\mu _4)\), \(GW_{2,2}^2(\mu _1,\mu _4)\), for relaxation order \(r=5\)

Next, we provide estimations of discrepancies \(GW_{2,2}(\mu _i,\mu _j)\) obtained at convergence of the fixed point algorithm. In Table 1, we show the estimations of \(GW_{2,2}(\mu _i,\mu _j)\) obtained for different relaxation orders. The obtained estimations for \(GW_{2,2}(\mu _1,\mu _2)\) and \(GW_{2,2}(\mu _3,\mu _4)\) converge slowly with the relaxation order but are very small already for very small relaxation orders. The estimation of \(GW_{2,2}(\mu _1,\mu _4)\) rapidly converges with the relaxation order.

Table 1 Estimations of \(GW_{2,2}(\mu _1,\mu _2)\), \(GW_{2,2}(\mu _3,\mu _4)\) and \(GW_{2,2}(\mu _1,\mu _4)\) for relaxation orders \(r=2\) to 6

8.2.2 Gromov–Wasserstein barycenters \(GW_{2,2}\)

We now turn to the computation of Gromov–Wassertein barycenters, using discrepancy \(GW_{2,2}\). We consider the computation of the barycenters of the empirical measures \(\mu _1\) and \(\mu _2\) illustrated on Fig. 6. The experiments are here for illustrative purpose.

Fig. 8
figure 8

Gromov–Wasserstein baycenters: estimated support of barycenter \(\textrm{Bar}((\mu _1,{\mu _4}),\lambda )\) for \(\lambda \in \{0,0.25,0.5,0.75,1\}\) and different orders of relaxation \(r\in \{2,3,4,5\}\)

Fig. 9
figure 9

Gromov–Wasserstein baycenters: estimated Christoffel function of barycenter \(\textrm{Bar}((\mu _1,{\mu _4}),\lambda )\) for \(\lambda \in \{0,0.25,0.5,0.75,1\}\) and different orders of relaxation \(r\in \{2,3,4,5\}\)

For a given relaxation order r, we have to solve the optimization problem (5.5) over truncated sequences y, \(y_1\) and \(y_2\), where \(y_1\) has as marginals y and the truncated moments of \(\mu _1\), and \(y_2\) has as marginals y and the truncated moments of \(\mu _2\). The objective functional can be rewritten \(\lambda L_{aug}^{GW_{2,2}}(y_1 \otimes y_1) + (1-\lambda ) L_{aug}^{GW_{2,2}}(y_2 \otimes y_2)\), with \(\lambda \in [0,1]\). For the solution of the optimization problem, we rely on a fixed point algorithm which constructs sequences of truncated moment \(y^{(k)}\), \(y^{(k)}_1\) and \(y^{(k)}_2\), \(k\ge 1\), such that \((y^{(k)},y^{(k)}_1,y^{(k)}_2)\) minimizes \((y,y_1,y_2) \mapsto \lambda L_{aug}^{GW_{2,2}}(y_1 \otimes y_1^{(k)}) + (1-\lambda ) L_{aug}^{GW_{2,2}}(y_2 \otimes y_2^{(k)})\) over truncated sequences satisfying marginal constraints and moment sequence conditions. For the initialization of \(y^{(0)}\), we take the truncated moments of either \(\mu _1\) or \(\mu _2\) (depending on the value of \(\lambda \)) and for \(y^{(0)}_1\) (resp. \(y^{(0)}_2\)), we take the tensor product of \(y^{(0)}\) and the truncated moments of \(\mu _1\) (resp. \(\mu _2\)). This algorithm converges rather slowly and should clearly be improved. However, it allows us to illustrate the potential of the proposed approach. The results are given at iteration 100.

Figures 8 and 9 illustrate respectively the estimated supports and Christoffel functions of barycenters \(\textrm{Bar}((\mu _1,{\mu _4}),\lambda )\) for different values of \(\lambda \) and orders of relaxation r. We observe a rather fast convergence with r, at least for \(\lambda \notin \{0,1\}\). Figure 9, we may note that the interpolations of the smileys appear with different rotations as \(\lambda \) varies.

9 Conclusions

We have shown the theoretical foundations to compute most common optimal transport problems with a moment-SoS approach. The numerical results reveal that the method gives very good accuracies for the estimation of the value of the loss functions with very low orders. The support of concentrated measures can efficiently be estimated with relatively low polynomial orders. This feature seems particularly appealing because it could be leveraged to cleverly allocate degrees of freedom in optimal transport solvers that approximate the optimal transport plan.