Optimal Transport
 238 Downloads
Abstract
In this preliminary chapter, we introduce the problem of optimal transport, which is the main concept behind Wasserstein spaces. General references on this topic are the books by Rachev and Rüschendorf [107], Villani [124, 125], Ambrosio et al. [12], Ambrosio and Gigli [10], and Santambrogio [119]. This chapter includes only few proofs, when they are simple, informative, or are not easily found in one of the cited references.
In this preliminary chapter, we introduce the problem of optimal transport, which is the main concept behind Wasserstein spaces. General references on this topic are the books by Rachev and Rüschendorf [107], Villani [124, 125], Ambrosio et al. [12], Ambrosio and Gigli [10], and Santambrogio [119]. This chapter includes only few proofs, when they are simple, informative, or are not easily found in one of the cited references.
1.1 The Monge and the Kantorovich Problems
Perhaps most importantly, a minimiser for the Kantorovich problem exists under weak conditions. In order to show this, we first recall some definitions. Let \(C_b(\mathcal X)\) be the space of realvalued, continuous bounded functions on \(\mathcal X\). A sequence of probability measures \(\{\mu _n\}\in M(\mathcal X)\) is said to converge weakly^{2} to \(\mu \in M(\mathcal X)\) if for all \(f\in C_b(\mathcal X)\), \({{\int \! f \, \mathrm {d}\mu _n}} \to {{\int \! f \, \mathrm {d}\mu }} \). To avoid confusion with other types of convergence, we will usually write μ_{n} → μ weakly; in the rare cases where a symbol is needed we shall use the notation \(\mu _n\stackrel {w}\to \mu \). Of course, if μ_{n} → μ weakly and \(\mu _n\in P(\mathcal X)\), then μ must be in \(P(\mathcal X)\) too (this is seen by taking f ≡ 1 and by observing that \({{\int \! f \, \mathrm {d}\mu }} \ge 0\) if f ≥ 0).
A collection of probability measures \(\mathcal K\) is tight if for all 𝜖 > 0 there exists a compact set K such that \(\inf _{\mu \in \mathcal K}\mu (K)>1\epsilon \). If \(\mathcal K\) is represented by a sequence {μ_{n}}, then Prokhorov’s theorem (Billingsley [24, Theorem 5.1]) states that a subsequence of {μ_{n}} must converge weakly to some probability measure μ.
A remark about terminology is in order. Many authors talk about the Monge–Kantorovich problem or the optimal transport(ation) problem. More often than not, they refer to what we call here the Kantorovich problem. When one of the scenarios presented in Sects. 1.3 and 1.6.1 is considered, this does not result in ambiguity.
1.2 Probabilistic Interpretation
The preceding section was an analytic presentation of the Monge and the Kantorovich problems. It is illuminating, however, to also recast things in probabilistic terms, and this is the topic of this section.
A random element on a complete separable metric space (or any topological space) \(\mathcal X\) is simply a measurable function X from some (generic) probability space \((\varOmega ,\mathcal F,\mathbb {P})\) to \(\mathcal X\) (with its Borel σalgebra). The probability law (or probability distribution) is the probability measure \(\mu _X=X\#\mathbb {P}\) defined on the space \(\mathcal X\); this is the Borel measure satisfying \(\mu _X(A)=\mathbb {P}(X\in A)\) for all Borel sets A.
1.3 The Discrete Uniform Case
There is a special case in which the Monge–Kantorovich problem reduces to a finite combinatorial problem. Although it may seem at first hand as an oversimplification of the original problem, it is of importance in practice because arbitrary measures can be approximated by discrete measures by means of the strong law of large numbers. Moreover, the discrete case is important in theory as well, as a motivating example for the Kantorovich duality (Sect. 1.4) and the property of cyclical monotonicity (Sect. 1.7).
The Kantorovich problem is a linear program with n^{2} variables and 2n constraints. It must have a solution because B_{n} (hence B_{n}∕n) is a compact (nonempty) set in \(\mathbb {R}^{n^2}\) and the objective function is linear in the matrix elements, hence continuous. (This property is independent of the possibly infinitedimensional spaces \(\mathcal X\) and \(\mathcal Y\) in which the points lie.) The Monge problem also admits a solution because S_{n} is a finite set. To see that the two problems are essentially the same, we need to introduce the following notion. If B is a convex set, then x ∈ B is an extremal point of B if it cannot be written as a convex combination tz + (1 − t)y for some distinct points y, z ∈ B. It is well known (Luenberger and Ye [89, Section 2.5]) that there exists an optimal solution that is extremal, so that it becomes relevant to identify the extremal points of B_{n}. It is fairly clear that each permutation matrix is extremal in B_{n}; the less obvious converse is known as Birkhoff’s theorem, a proof of which can be found, for instance, at the end of the introduction in Villani [124] or (in a different terminology) in Luenberger and Ye [89, Section 6.5]. Thus, we have:
Proposition 1.3.1 (Solution of Discrete Problem)
There exists σ ∈ S_{n} such that M(σ) minimises C(M) over B_{n}∕n. Furthermore, if {σ_{1}, …, σ_{k}} is the set of optimal permutations, then the set of optimal matrices is the convex hull of {M(σ_{1}), …, M(σ_{k})}. In particular, if σ is the unique optimal permutation, then M(σ) is the unique optimal matrix.
Thus, in the discrete case, the Monge and the Kantorovich problems coincide. One can of course use the simplex method [89, Chapter 3] to solve the linear program, but there are n! vertices, and there is in principle no guarantee that the simplex method solves the problem efficiently. However, the constraints matrix has a very specific form (it contains only zeroes and ones, and is totally unimodular), so specialised algorithms for this problem exist. One of them is the Hungarian algorithm of Kuhn [85] or its variant of Munkres [96] that has a worstcase computational complexity of at most O(n^{4}). Another alternative is the class of net flow algorithms described in [89, Chapter 6]. In particular, the algorithm of Edmonds and Karp [50] has a complexity of at most \(O(n^3\log n)\). This monograph does not focus on computational aspects for optimal transport. This is a fascinating and very active area of contemporary research, and readers are directed to Peyré and Cuturi [103].
Remark 1.3.2
The special case described here could have been more precisely called “the discrete uniform case on the same number of points”, as “the discrete case” could refer to any two finitely supported measures μ and ν. In the Monge context, the setup discussed here is the most interesting case, see page 8 in the supplement for more details.
1.4 Kantorovich Duality
The discrete case of Sect. 1.3 is an example of a linear program and thus enjoys a rich duality theory (Luenberger and Ye [89, Chapter 4]). The general Kantorovich problem is an infinitedimensional linear program, and under mild assumptions admits similar duality.
1.4.1 Duality in the Discrete Uniform Case
1.4.2 Duality in the General Case
Lemma 1.4.1 (Functional Constraints)
The proof follows from the fact that (1.2) yields the above equality when φ and ψ are indicator functions. One then uses linearity and approximations to deduce the result.
Theorem 1.4.2 (Kantorovich Duality)
See the Bibliographical Notes for other versions of the duality.
When the cost function is continuous, or more generally, a countable supremum of continuous functions, the infimum is attained (see (1.3)). The existence of maximisers (φ, ψ) is more delicate and requires a finiteness condition, as formulated in Proposition 1.8.1 below.
The next sections are dedicated to more concrete examples that will be used through the rest of the book.
1.5 The OneDimensional Case
Since any distribution can be approximated by continuous distributions, in view of the above discussion, the following result from Villani [124, Theorem 2.18] should not be too surprising.
Theorem 1.5.1 (Optimal Transport in \(\mathbb {R}\))
If the infimum is finite and h is strictly convex, then the optimal transference plan is unique. Furthermore, if F is continuous, then the infimum is attained by the transport map T = G^{−1} ∘ F.
The prototypical choice for h is h(z) = z^{p} with p > 1. This result allows in particular a direct evaluation of the Wasserstein distances for measures on the real line (see Chap. 2).
Note that no regularity is needed in order that the optimal transference plan be unique, unlike in higher dimensions (compare Theorem 1.8.2). The structure of solutions in the concave case (0 < p < 1) is more complicated, see McCann [94].
When p = 1, the cost function is convex but not strictly so, and solutions will not be unique. However, the total cost in Theorem 1.5.1 admits another representation that is often more convenient.
Proposition 1.5.2 (Quantiles and Distribution Functions)
The proof is a simple application of Fubini’s theorem; see page 13 in the supplement.
Corollary 1.5.3
1.6 Quadratic Cost
1.6.1 The Absolutely Continuous Case
We begin with the Euclidean case, where \(\mathcal X=\mathcal Y=(\mathbb {R}^d,\\cdot \)\) is endowed with the Euclidean metric, and use the Kantorovich duality to obtain characterisations of optimal maps.
Theorem 1.6.1 (Differentiability of Convex Functions)
Let \(f:\mathbb {R}^d\to \mathbb {R}\cup \{\infty \}\) be a convex function with domain \(\mathrm {dom} f=\{x\in \mathbb {R}^d:f(x)<\infty \}\) and let \(\mathcal N\) be the set of points at which f is not differentiable. Then \(\mathcal N\cap \overline {\mathrm {dom} f}\) has Lebesgue measure 0.
Theorem 1.6.1 is usually stated for the interior of domf, denoted int(domf), rather than the closure. But, since A = domf is convex, its boundary has Lebesgue measure zero. To see this assume first that A is bounded. If intA is empty, then A lies in a lower dimensional subspace [113, Theorem 2.4]. Otherwise, without loss of generality 0 ∈intA, and then by convexity of A, ∂A ⊆ (1 + 𝜖)A for all 𝜖 > 0. When A is unbounded, write it as ∪_{n}A ∩ [−n, n]^{d}.
Theorem 1.6.2 (Quadratic Cost in Euclidean Spaces)
Let μ and ν be probability measures on \(\mathbb {R}^d\) with finite second moments, and suppose that μ is absolutely continuous with respect to Lebesgue measure. Then the solution to the Kantorovich problem is unique, and is induced from a transport map T that equals μalmost surely the gradient of a convex function ϕ. Furthermore, the pair (∥x∥^{2}∕2 − ϕ, ∥y∥^{2}∕2 − ϕ^{∗}) is optimal for the dual problem.
Proof
To alleviate the notation we write ϕ instead of \(\widetilde \varphi \). By Proposition 1.8.1, there exists an optimal dual pair (φ, ψ) such that ϕ(x) = ∥x∥^{2}∕2 − φ(x) is convex and lower semicontinuous, and by the discussion in Sect. 1.1, there exists an optimal π. Since ϕ is μintegrable, it must be finite almost everywhere, i.e., μ(domϕ) = 1. By Theorem 1.6.1, if we define \(\mathcal N\) as the set of nondifferentiability points of ϕ, then \(\mathrm {Leb}(\mathcal N\cap \mathrm {dom}\phi )=0\); as μ is absolutely continuous, the same holds for μ. (Here Leb denotes Lebesgue measure.)
We conclude that \(\mu (\mathrm {int}(\mathrm {dom}\phi )\setminus \mathcal N)=1\). In other words, ϕ is differentiable μalmost everywhere, and so for μalmost any x, there exists a unique y such that \(\phi (x)+\phi ^*(y)={\left \langle {x},{y}\right \rangle } \), and y = ∇ϕ(x). This shows that π is unique and induced from the transport map ∇ϕ(x). The gradient ∇ϕ is Borel measurable, since each of its coordinates can be written as \(\limsup _{q\to 0,q\in \mathbb {Q}}q^{1}(\phi (x+qv)\phi (x))\) for some vector v (the canonical basis of \(\mathbb {R}^d\)), which is Borel measurable because the limit superior is taken on countably many functions (and ϕ is measurable because it is lower semicontinuous).
1.6.2 Separable Hilbert Spaces
The finitedimensionality of \(\mathbb {R}^d\) in the previous subsection was only used in order to apply Theorem 1.6.1, so one could hope to extend the results to infinitedimensional separable Hilbert spaces.
Although there is no obvious parallel for Lebesgue measure (i.e., translation invariant) on infinitedimensional Banach spaces, one can still define absolute continuity via Gaussian measures. Indeed, \(\mu \in P(\mathbb {R}^d)\) is absolutely continuous with respect to Lebesgue measure if and only if the following holds: if \(\mathcal N\subset \mathbb {R}^d\) is such that \(\nu (\mathcal N)=0\) for any nondegenerate Gaussian measure ν, then \(\mu (\mathcal N)=0\). This definition can be extended to any separable Banach space \(\mathcal X\) via projections, as follows. Let \(\mathcal X^*\) be the (topological) dual of \(\mathcal X\), consisting of all realvalued, continuous linear functionals on \(\mathcal X\).
Definition 1.6.3 (Gaussian Measures)
A probability measure \(\mu \in P(\mathcal X)\) is a nondegenerate Gaussian measure if for any \(\ell \in \mathcal X^*\setminus \{0\}\) , \(\ell \#\mu \in P(\mathbb {R})\) is a Gaussian measure with positive variance.
Definition 1.6.4 (Gaussian Null Sets and Absolutely Continuous Measures)
A subset \(\mathcal N\subset \mathcal X\) is a Gaussian null set if whenever ν is a nondegenerate Gaussian measure, \(\nu (\mathcal N)=0\) . A probability measure \(\mu \in P(\mathcal X)\) is absolutely continuous if μ vanishes on all Gaussian null sets.
Clearly, if ν is a nondegenerate Gaussian measure, then it is absolutely continuous.
As explained in Ambrosio et al. [12, Section 6.2], a version of Rademacher’s theorem holds in separable Hilbert spaces: a locally Lipschitz function is Gâteaux differentiable except on a Gaussian null set of \(\mathcal X\). Theorem 1.6.2 (and more generally, Theorem 1.8.2) extend to infinite dimensions; see [12, Theorem 6.2.10].
1.6.3 The Gaussian Case
Apart from the onedimensional case of Sect. 1.5, there is another special case in which there is a unique and explicit solution to the Monge–Kantorovich problem.
Suppose that μ and ν are Gaussian measures on \(\mathbb {R}^d\) with zero means and nonsingular covariance matrices A and B. By Theorem 1.6.2, we know that there exists a unique optimal map T such that T#μ = ν. Since linear pushforwards of Gaussians are Gaussian, it seems natural to guess that T should be linear, and this is indeed the case.
Since T is a linear map that should be the gradient of a convex function ϕ, it must be that ϕ is quadratic, i.e., \(\phi (x)\phi (0)={\left \langle {x},{Qx}\right \rangle } \) for \(x\in \mathbb {R}^d\) and some matrix Q. The gradient of ϕ at x is (Q + Q^{t})x and the Hessian matrix is Q + Q^{t}. Thus, T = Q + Q^{t} and since ϕ is convex, T must be positive semidefinite.
Proposition 1.6.5 (Lower Bound for Quadratic Cost)
Proof
1.6.4 Regularity of the Transport Maps
The optimal transport map T between Gaussian measures on \(\mathbb {R}^d\) is linear, so it is of course very smooth (analytic). The densities of Gaussian measures are analytic too, so that T inherits the regularity of μ and ν. Using the formula for T, one can show that a similar phenomenon takes place in the onedimensional case. Though we do not have a formula for T at our disposal when μ and ν are general absolutely continuous measures on \(\mathbb {R}^d\), d ≥ 2, it turns out that even in that case, T inherits the regularity of μ and ν if some convexity conditions are satisfied.
To guess what kind of results can be hoped for, let us first examine the case d = 1. Let F and G denote the distribution functions of μ and ν, respectively. Suppose that G is continuously differentiable and that G′ > 0 on some open interval (finite or not) I such that ν(I) = 1. Then the inverse function theorem says that G^{−1} is also continuously differentiable. Recall that the support of a (Borel) probability measure μ (denoted suppμ) is the smallest closed set K such that μ(K) = 1. A simple application of the chain rule (see page 19 in the supplement) gives:
Theorem 1.6.6 (Regularity in \(\mathbb {R}\))
Let \(\mu ,\nu \in P(\mathbb {R})\) possess distribution functions F and G of class C^{k}, k ≥ 1. Suppose further that suppν is an interval I (possibly unbounded) and that G′ > 0 on the interior of I. Then the optimal map is of class C^{k} as well. If F, G ∈ C^{0} are merely continuous, then so is the optimal map.
The assumption on the support of ν is important: if μ is Lebesgue measure on [0, 1] and the support of ν is disconnected, then T cannot even be continuous, no matter how smooth ν is.
Theorem 1.6.7 (Regularity of Transport Maps)
 1.
If Ω_{1} and Ω_{2} are bounded and f, g are bounded below, then ϕ is strictly convex and of class C^{1, α}(Ω_{1}) for some α > 0.
 2.
If \(\varOmega _1=\varOmega _2=\mathbb {R}^d\) and f, g ∈ C^{0, α}, then ϕ ∈ C^{2, α}(Ω_{1}).
If in addition f, g ∈ C^{k, α}, then ϕ ∈ C^{k+2, α}(Ω_{1}).
In other words, the optimal map T = ∇ϕ ∈ C^{k+1, α}(Ω_{1}) is one derivative smoother than the densities, so has the same smoothness as the measures μ, ν.
Theorem 1.6.7 will be used in two ways in this book. Firstly, it is used to derive criteria for a Karcher mean of a collection of measures to be the Fréchet mean of that collection (Theorem 3.1.15). Secondly, it allows one to obtain very smooth estimates for the transport maps. Indeed, any two measures μ and ν can be approximated by measures satisfying the second condition: one can approximate them by discrete measures using the law of large numbers and then employ a convolution with, e.g., a Gaussian measure (see, for instance, Theorem 2.2.7). It is not obvious that the transport maps between the approximations converge to the transport maps between the original measures, but we will see this to be true in the next section.
1.7 Stability of Solutions Under Weak Convergence
In this section, we discuss the behaviour of the solution to the Monge–Kantorovich problem when the measures μ and ν are replaced by approximations μ_{n} and ν_{n}. Since any measure can be approximated by discrete measures or by smooth measures, this allows us to benefit from both worlds. On the one hand, approximating μ and ν with discrete measures leads to the finite discrete problem of Sect. 1.3 that can be solved exactly. On the other hand, approximating μ and ν with Gaussian convolutions thereof leads to very smooth measures (at least on \(\mathbb {R}^d\)), and so the regularity results of the previous section imply that the respective optimal maps will also be smooth. Finally, in applications, one would almost always observe the measures of interest μ and ν with a certain amount of noise, and it is therefore of interest to control the error introduced by the noise. In image analysis, μ can represent an image that has undergone blurring, or some other perturbation (Amit et al. [13]). In other applications, the noise could be due to sampling variation, where instead of μ one observes a discrete measure μ_{N} obtained from realisations X_{1}, …, X_{N} of random elements with distribution μ as \(\mu _N=N^{1}\sum _{i=1}^N\delta \{X_i\}\) (see Chap. 4).
In Sect. 1.7.1, we will see that the optimal transference plan π depends continuously on μ and ν. With this result under one’s belt, one can then deduce an analogous property for the optimal map T from μ to ν given some regularity of μ, as will be seen in Sect. 1.7.2.
We shall assume throughout this section that μ_{n} → μ and ν_{n} → ν weakly, which, we recall, means that \({{\int _{\mathcal X} \! {f} \, \mathrm {d}{\mu _n}}}\to {{\int _{\mathcal X} \! f \, \mathrm {d}\mu }}\) for all continuous bounded \(f:\mathcal X\to \mathbb {R}\). The following equivalent definitions for weak convergence will be used not only in this section, but elsewhere as well.
Lemma 1.7.1 (Portmanteau)

μ_{n} → μ weakly;

F_{n}(x) → F(x) for any continuity point x of F. Here \(\mathcal X=\mathbb {R}^d\), F_{n} is the distribution function of μ_{n} and F is that of μ;

for any open \(G\subseteq \mathcal X\) , \(\liminf \mu _n(G)\ge \mu (G)\) ;

for any closed \(F\subseteq \mathcal X\) , \(\limsup \mu _n(F)\le \mu (F)\) ;

\({{\int \! h \, \mathrm {d}{\mu _n}}}\to {{\int \! h \, \mathrm {d}\mu }}\) for any bounded measurable h whose set of discontinuity points is a μnull set.
For a proof, see, for instance, Billingsley [24, Theorem 2.1]. The equivalence with the last condition can be found in Pollard [104, Section III.2].
1.7.1 Stability of Transference Plans and Cyclical Monotonicity
In this subsection, we state and sketch the proof of the fact that if μ_{n} → μ and ν_{n} → ν weakly, then the optimal transference plans π_{n} ∈ Π(μ_{n}, ν_{n}) converge to an optimal π ∈ Π(μ, ν). The result, as stated in Villani [125, Theorem 5.20], is valid on complete separate metric spaces with general cost functions, and reads as follows.
Theorem 1.7.2 (Weak Convergence and Optimal Plans)
then (π_{n}) is a tight sequence and each of its weak limits π ∈ Π(μ, ν) is optimal.
One can even let c vary with n under some conditions.
Definition 1.7.3 (Cyclically Monotone Sets and Measures)
A probability measure π on \(\mathcal X^2\) is cyclically monotone if there exists a monotone Borel set Γ such that π(Γ) = 1.
Proposition 1.7.4 (Optimal Plans Are Cyclically Monotone)
Let \(\mu ,\nu \in P(\mathcal X)\) and suppose that the cost function c is nonnegative and continuous. Assume that the optimal π ∈ Π(μ, ν) has a finite total cost. Then suppπ is cyclically monotone. In particular, π is cyclically monotone.
Thus, optimal transference plans π solve infinitely many discrete Monge– Kantorovich problems emanating from their support. More precisely, for any finite collection (x_{i}, y_{i}) ∈suppπ, i = 1, …, N and any permutation σ ∈ S_{N}, (1.7) is satisfied. Therefore, the identity permutation is optimal between the measures \((1/N)\sum \delta \{x_i\}\) and \((1/N)\sum \delta \{y_j\}\).
In the same spirit as Γ defined above for the discrete case, one can strengthen Proposition 1.7.4 and prove existence of a cyclically monotone set Γ that includes the support of any optimal transference plan π: take Γ = ∪supp(π) for π optimal.
The converse of Proposition 1.7.4 also holds.
Proposition 1.7.5 (Cyclically Monotone Plans Are Optimal)
Let \(\mu ,\nu \in P(\mathcal X)\), \(c:\mathcal X^2\to \mathbb {R}_+\) continuous and π ∈ Π(μ, ν) a cyclically monotone measure with C(π) finite. Then π is optimal in Π(μ, ν).
The argument for more general costs follows similar lines and is sketched at the end of this subsection.
Given these intermediary results, it is now instructive to prove Theorem 1.7.2.
Proof (Proof of Theorem 1.7.2)
Theorem 1.7.6 (Rockafellar [112])
A nonempty \(\varGamma \subseteq \mathcal X^{2}\) is quadratic cyclically monotone if and only if it is included in the graph of the subdifferential of a lower semicontinuous convex function that is not identically infinite.
Secondly, we have not used at all the Kantorovich duality, merely its weak form. The machinery of cyclical monotonicity can be used in order to prove the duality Theorem 1.4.2. This is indeed the strategy of Villani [125, Chapter 5], who explains its advantage with respect to Hahn–Banachtype duality proofs.
1.7.2 Stability of Transport Maps
We now extend the weak convergence of π_{n} to π of the previous subsection to convergence of optimal maps. Because of the applications we have in mind, we shall work exclusively in the Euclidean space \(\mathcal X=\mathbb {R}^d\) with the quadratic cost function; our results can most likely be extended to more general situations.
In this setting, we know that optimal plans are supported on graphs of subdifferentials of convex functions. Suppose that π_{n} is induced by T_{n} and π is induced by T. Then in some sense, the weak convergence of π_{n} to π yields convergence of the graphs of T_{n} to the graph of T. Our goal is to strengthen this to uniform convergence of T_{n} to T. Roughly speaking, we show the following: there exists a set A with μ(A) = 1 and such that T_{n} converge uniformly to T on every compact subset of A. For the reader’s convenience, we give a userfriendly version here; a more general statement is given in Proposition 1.7.11 below.
Theorem 1.7.7 (Uniform Convergence of Optimal Maps)
Let μ_{n}, μ be absolutely continuous measures with finite second moments on an open convex set \(U\subseteq \mathbb {R}^d\) such that μ_{n} → μ weakly, and let ν_{n} → ν weakly with \(\nu _n,\nu \in P(\mathbb {R}^d)\) with finite second moments. If T_{n} and T are continuous on U and C(T_{n}) is bounded uniformly in n, then
for any compact Ω ⊆ U.
Since T_{n} and T are only defined up to Lebesgue null sets, it will be more convenient to work directly with the subgradients. That is, we view T_{n} and T as setvalued functions that to each \(x\in \mathbb {R}^d\) assign a (possibly empty) subset of \(\mathbb {R}^d\). In other words, T_{n} and T take values in the power set of \(\mathbb {R}^d\), denoted by \(2^{\mathbb {R}^d}\).
Of course, if \(\phi :\mathbb {R}^d\to \mathbb {R}\cup \{\infty \}\) is convex, then u = ∂ϕ is monotone. It follows from Theorem 1.7.6 that u is maximally cyclically monotone (no points can be added to its graph while preserving cyclical monotonicity). It can actually be shown that u is maximally monotone [6, Section 7]. In what follows, we will always work with subdifferentials of convex functions, so unless stated otherwise, u will always be assumed maximally monotone.
Maximally monotone functions enjoy the following very useful continuity property. It is proven in [6, Corollary 1.3] and will be used extensively below.
Proposition 1.7.8 (Continuity at Singletons)
Let \(x\in \mathbb {R}^d\) such that u(x) = {y} is a singleton. Then u is nonempty on some neighbourhood of x and it is continuous at x: if x_{n} → x and y_{n} ∈ u(x_{n}), then y_{n} → y.
Notice that this result implies that if a convex function ϕ is differentiable on some open set \(E\subseteq \mathbb {R}^d\), then it is continuously differentiable there (Rockafellar [113, Corollary 25.5.1]).
Let B_{r}(x_{0}) = {x : ∥x − x_{0}∥ < r} for r ≥ 0 and \(x_0\in \mathbb {R}^d\). The interior of a set \(G\subseteq \mathbb {R}^d\) is denoted by intG and the closure by \(\overline G\). If G is measurable, then LebG denotes the Lebesgue measure of G. Finally, convG denotes the convex hull of G.
We denote the set of points of Lebesgue density of G by G^{den}. Here are some facts about G^{den}: clearly, \(\mathrm {int} G\subseteq G^{\mathrm{{den}}}\subseteq \overline G\). Stein and Shakarchi [121, Chapter 3, Corollary 1.5] show that Leb(G ∖ G^{den}) = 0 (and Leb(G^{den} ∖ G) = 0, so G^{den} is very close to G). By the Hahn–Banach theorem, G^{den} ⊆int(conv(G)): indeed, if x is not in int(convG), then there is a separating hyperplane between x and convG ⊇ G, so the fraction above is at most 1∕2 for all t > 0.
The “denseness” of Lebesgue points is materialised in the following result. It is given as exercise in [121] when d = 1, and the proof can be found on page 27 in the supplement.
Lemma 1.7.9 (Density Points and Distance)
Of course, this result holds for any \(x_0\in \overline G\) if the little o is replaced by big O, since δ is Lipschitz. When x_{0} ∈intG, this is trivial because δ vanishes on intG.
The important part here is the following corollary: for almost all x ∈ G, δ(z) = o(∥z − x∥) as z → x. This can be seen in other ways: since δ is Lipschitz, it is differentiable almost everywhere. If \(x\in \overline G\) and δ is differentiable at x, then ∇δ(x) must be 0 (because δ is minimised there), and then δ(z) = o(∥z − x∥). We just showed that δ is differentiable with vanishing derivative at all Lebesgue points of x. The converse is not true: \(G=\{\pm 1/n\}_{n=1}^\infty \) has no Lebesgue points, but δ(y) ≤ 4y^{2} as y → 0.
The locality of monotone functions can now be stated as follows. It is proven on page 27 of the supplement.
Lemma 1.7.10 (Local Monotonicity)
Then y^{∗} = y_{0}. In particular, the result is true if the inequality holds on \(G=O\setminus \mathcal N\) with ∅≠O open and \(\mathcal N\) Lebesgue negligible.
With this background on monotone functions at our disposal, we are now ready to state the stability result for the optimal maps. We assume the following.
Assumptions 1
(convergence ) μ_{n} → μ and ν_{n} → ν weakly;
 (finiteness ) the optimal couplings π_{n} ∈ Π(μ_{n}, ν_{n}) satisfy$$\displaystyle \begin{aligned} \limsup_{n\to\infty} {\int_{\mathcal X^2} \! \frac 12\x  y\{}^2 \, \mathrm{d}\pi_n(x,y)} <\infty; \end{aligned}$$
(unique limit ) the optimal π ∈ Π(μ, ν) is unique.
We further denote the subgradients ∂ϕ _{ n } and ∂ϕ by u _{ n } and u, respectively.
These assumptions imply that π has a finite total cost. This can be shown by the \(\liminf \) argument in the proof of Theorem 1.7.2 but also from the uniqueness of π. As a corollary of the uniqueness of π, it follows that π_{n} → π weakly; notice that this holds even if π_{n} is not unique for any n. We will now translate this weak convergence to convergence of the maximal monotone maps u_{n} to u, in the following form.
Proposition 1.7.11 (Uniform Convergence of Optimal Maps)
In particular, if u is univalued throughout int(E) (so that ϕ ∈ C^{1} there), then uniform convergence holds for any compact Ω ⊂int(E).
if a sequence in the graph of u_{n} converges, then the limit is in the graph of u;
sequences in the graph of u_{n} are bounded if the domain is bounded.
Corollary 1.7.12 (Pointwise Convergence μAlmost Surely)
If in addition μ is absolutely continuous, then u_{n}(x) → u(x) μalmost surely.
Proof
We first claim that \(E\subseteq \overline {\mathrm {dom} u}\). Indeed, for any x ∈ E and any 𝜖 > 0, the ball B = B_{𝜖}(x) has positive measure. Consequently, u cannot be empty on the entire ball, because otherwise \(\mu (B)=\pi (B\times \mathbb {R}^d)\) would be 0. Since domu is almost convex (see the discussion before Assumptions 1), this implies that actually int(convE) ⊆domu.
1.8 Complementary Slackness and More General Cost Functions
It is wellknown (Luenberger and Ye [89, Section 4.4]) that the solutions to the primal and dual problems are related to each other via complementary slackness. In other words, solution of one problem provides a lot of information about the solution of the other problem. Here, we show that this idea remains true for the Kantorovich primal and dual problems, extending the discussion in Sect. 1.6.1 to more general cost functions.
Let \(\mathcal X\) and \(\mathcal Y\) be complete separable metric spaces, \(\mu \in P(\mathcal X)\), \(\nu \in P(\mathcal Y)\), and \(c:\mathcal X\times \mathcal Y\to \mathbb {R}_+\) be a measurable cost function.
1.8.1 Unconstrained Dual Kantorovich Problem
Proposition 1.8.1 (Existence of an Optimal Pair)
Let μ and ν be probability measures on \(\mathcal X\) and \(\mathcal Y\) such that the independent coupling with respect to the nonnegative and lower semicontinuous cost function is finite: \(\int _{\mathcal X\times \mathcal Y} \! c(x,y) \, \mathrm {d}\mu (x)\mathrm {d}\nu (y)<\infty \). Then there exists an optimal pair (φ, ψ) for the dual Kantorovich problem. Furthermore, the pair can be chosen in a way that μalmost surely, φ = ψ^{c} and νalmost surely, ψ = φ^{c}.
We remark that at the level of generality of Proposition 1.8.1, the function φ^{c} may fail to be Borel measurable; Ambrosio and Pratelli show that this pair can be modified up to null sets in order to be Borel measurable. If c is continuous, however, then φ^{c} is an infimum of a collection of continuous functions (in y). Hence − φ^{c} is lower semicontinuous, which yields that φ^{c} is measurable. When c is uniformly continuous, measurability of φ^{c} is established in a more lucid way, as exemplified in the next subsection.
1.8.2 The Kantorovich–Rubinstein Theorem
1.8.3 Strictly Convex Cost Functions on Euclidean Spaces
The following result generalises Theorem 1.6.2 to other powers p > 1 of the Euclidean norm. These cost functions define the Wasserstein distances of the next chapter.
Theorem 1.8.2 (Strictly Convex Costs on \(\mathbb {R}^d\))
Proof (Assuming ν has Compact Support)
The existence of the optimal pair (φ, φ^{c}) with the desired properties follows from Proposition 1.8.1 (they are Borel measurable because c is continuous). We shall now show that φ has a unique csupergradient μalmost surely.
Step 1: φ is csuperdifferentiable. Let π^{∗} be an optimal coupling. By duality arguments, π is concentrated on the set of (x, y) such that y ∈ ∂^{c}φ(x). Consequently, for μalmost any x, the csuperdifferential of φ at x is nonempty.
Step 2: φ is differentiable. Here, we impose the additional condition that ν is compactly supported. Then φ can be taken as a ctransform on the compact support of ν. Since h is locally Lipschitz (it is C^{1} because p > 1) this implies that φ is locally Lipschitz. Hence, it is differentiable Lebesgue almost surely, and consequently μalmost surely.
The general result, without assuming compact support for ν, can be found in Gangbo and McCann [59]. It holds for a larger class of functions h, those that are strictly convex on \(\mathbb {R}^d\) (this yields that ∇h is invertible), have superlinear growth (h(v)∕∥v∥→∞ as v →∞) and satisfying a technical geometric condition (which ∥v∥^{p}∕p does when p > 1). Furthermore, if h is sufficiently smooth, namely h ∈ C^{1, 1} locally (it is if p ≥ 2, but not if p ∈ (1, 2)), then μ does not need to be absolutely continuous; it suffices that it not give positive measure to any set of Hausdorff dimension smaller or equal than d − 1. When d = 1 this means that Theorem 1.8.2 is still valid as long as μ has no atoms (μ({x}) = 0 for all \(x\in \mathbb {R}\)), which is a weaker condition than μ being absolutely continuous.
It is also noteworthy that for strictly concave cost functions (e.g., p ∈ (0, 1)), the situation is similar when the supports of μ and ν are disjoint. The reason is that h may fail to be differentiable at 0, but it only needs to be differentiated at x − y with x ∈suppμ and y ∈suppν. If the supports are not disjoint, then one needs to leave all common mass in place until the supports become disjoint (Villani [124, Chapter 2]) and then the result of [59] applies. As a simple example, let μ be uniform on [0, 1] and ν be uniform on [0, 2]. After leaving common mass in place, we are left with uniforms on [0, 1] and [1, 2] (with total mass 1∕2) with essentially disjoint supports, for which the optimal transport map is the decreasing map T(x) = 2 − x. Thus, the unique optimal π is not induced from a map, but rather from an equal weight mixture of T and the identity. Informally, each point x in the support of μ needs to be split; half stays and x and the other half transported to 2 − x. The optimal coupling from ν to μ is unique and induced from the map S(x) = x if x ≤ 1 and 2 − x if x ≥ 1, which is neither increasing nor decreasing.
1.9 Bibliographical Notes
Many authors, including Villani [124, Theorem 1.3]; [125, Theorem 5.10], give the duality Theorem 1.4.2 for lower semicontinuous cost functions. The version given here is a simplification of Beiglböck and Schachermayer [17, Theorem 1]. The duality holds for functions that take values in [−∞, ∞] provided that they are finite on a sufficiently large subset of \(\mathcal X\times \mathcal Y\), but there are simple counterexamples if c is infinite too often [17, Example 4.1]. For results outside the Polish space setup, see Kellerer [80] and Rachev and Rüschendorf [107, Chapter 4].
Theorem 1.5.1 for the onedimensional case is taken from [124], where it is proven using the general duality theorem. For direct proofs and the history of this result, one may consult Rachev [106] or Rachev and Rüschendorf [107, Section 3.1]. The concave case is carefully treated by McCann [94].
The results in the Gaussian case were obtained independently by Olkin and Pukelsheim [98] and Givens and Shortt [65]. The proof given here is from Bhatia [20, Exercise 1.2.13]. An extension to separable Hilbert spaces can be found in Gelbrich [62] or CuestaAlbertos et al. [39].
The regularity theory of Sect. 1.6.4 is very delicate. Caffarelli [32] showed the first part of Theorem 1.6.7; the proof can also be found in Figalli’s book [52, Theorem 4.23]. Villani [124, Theorem 4.14] states the result without proof and refers to Alesker et al. [7] for a sketch of the second part of Theorem 1.6.7. Other regularity results exist, Villani [125, Chapter 12]; Santambrogio [119, Section 1.7.6]; Figalli [52].
CuestaAlbertos et al. [40, Theorem 3.2] prove Theorem 1.7.2 for the quadratic case; the form given here is from Schachermayer and Teichmann [120, Theorem 3].
The definition of cyclical monotonicity depends on the cost function. It is typically referred to as ccyclical monotonicity, with “cyclical monotonicity” reserved to the special case of quadratic cost. Since we focus on the quadratic case and for readability, we slightly deviate from the standard jargon. That cyclical monotonicity implies optimality (Proposition 1.7.5) was shown independently by Pratelli [105] (finite lower semicontinuous cost) and Schachermayer and Teichmann [120] (possibly infinite continuous cost). A joint generalisation is given by Beiglböck et al. [18].
Section 1.7.2 is taken from Zemel and Panaretos [134, Section 7.5]; a slightly weaker version was shown independently by Chernozhukov et al. [35]. Heinich and Lootgieter [68] establish almost sure pointwise convergence. If μ_{n} = μ, then the optimal maps converge in μmeasure [125, Corollary 5.23] in a very general setup, but there are simple examples where this fails if μ_{n}≠μ [125, Remark 5.25]. In the quadratic case, further stability results of a weaker flavour (focussing on the convex potential ϕ, rather than its derivative, which is the optimal map) can be found in del Barrio and Loubes [42].
The idea of using the ctransform (Sect. 1.8) is from Rüschendorf [116].
Footnotes
 1.
But see the bibliographical notes for some literature on more general spaces.
 2.
Weak convergence is sometimes called narrow convergence, weak* convergence, or convergence in distribution.
 3.
In general, ψ^{c} inherits the modulus of continuity of c, see Santambrogio [119, page 11].
 4.
Gradients of Borel functions are measurable, as the limit can be taken on a countable set. The inverse (∇h)^{−1} equals the gradient of the Legendre transform h^{∗} and is therefore Borel measurable.
References
 6.G. Alberti, L. Ambrosio, A geometrical approach to monotone functions in \(\mathbb R^n\). Math. Z. 230(2), 259–316 (1999)Google Scholar
 7.S. Alesker, S. Dar, V. Milman, A remarkable measure preserving diffeomorphism between two convex bodies in \(\mathbb R^n\). Geometriae Dedicata 74(2), 201–212 (1999)Google Scholar
 10.L. Ambrosio, N. Gigli, A user’s guide to optimal transport, in Modelling and Optimisation of Flows on Networks (Springer, Berlin, 2013), pp. 1–155CrossRefGoogle Scholar
 11.L. Ambrosio, A. Pratelli, Existence and stability results in the L ^{1}theory of optimal transportation, in Optimal Transportation and Applications (Springer, Berlin, 2003), pp. 123–160zbMATHGoogle Scholar
 12.L. Ambrosio, N. Gigli, G. Savaré, Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich, 2nd edn. (Springer, Berlin, 2008)Google Scholar
 13.Y. Amit, U. Grenander, M. Piccioni, Structural image restoration through deformable templates. J. Amer. Stat. Assoc. 86(414), 376–387 (1991)CrossRefGoogle Scholar
 17.M. Beiglböck, W. Schachermayer, Duality for Borel measurable cost functions. Trans. Amer. Math. Soc. 363(8), 4203–4224 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
 18.M. Beiglböck, M. Goldstern, G. Maresch, W. Schachermayer, Optimal and better transport plans. J. Func. Anal. 256(6), 1907–1927 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
 20.R. Bhatia, Positive Definite Matrices (Princeton University Press, Princeton, 2009)zbMATHCrossRefGoogle Scholar
 24.P. Billingsley, Convergence of Probability Measures, 2nd edn. (Wiley, New York, 1999)zbMATHCrossRefGoogle Scholar
 31.Y. Brenier, Polar factorization and monotone rearrangement of vectorvalued functions. Commun. Pure Appl. Math. 44(4), 375–417 (1991)MathSciNetzbMATHCrossRefGoogle Scholar
 32.L.A. Caffarelli, The regularity of mappings with a convex potential. J. Amer. Math. Soc. 5(1), 99–104 (1992)MathSciNetzbMATHCrossRefGoogle Scholar
 35.V. Chernozhukov, A. Galichon, M. Hallin, M. Henry, Monge–Kantorovich depth, quantiles, ranks and signs. Ann. Stat. 45(1), 223–256 (2017)MathSciNetzbMATHCrossRefGoogle Scholar
 37.J.A. CuestaAlbertos, C. Matrán, Notes on the Wasserstein metric in Hilbert spaces. Ann. Probab. 17(3), 1264–1276 (1989)MathSciNetzbMATHCrossRefGoogle Scholar
 39.J.A. CuestaAlbertos, C. MatránBea, A. TueroDiaz, On lower bounds for the L _{2}Wasserstein metric in a Hilbert space. J. Theor. Probab. 9(2), 263–283 (1996)MathSciNetzbMATHCrossRefGoogle Scholar
 40.J.A. CuestaAlbertos, C. Matrán, A. TueroDiaz, Optimal transportation plans and convergence in distribution. J. Multivar. Anal. 60(1), 72–83 (1997)MathSciNetzbMATHCrossRefGoogle Scholar
 42.E. del Barrio, J.M. Loubes, Central limit theorems for empirical transportation cost in general dimension. Ann. Probab. 47(2), 926–951 (2019). https://projecteuclid.org/euclid.aop/1551171641 MathSciNetzbMATHCrossRefGoogle Scholar
 47.R.M. Dudley, Real Analysis and Probability, vol. 74 (Cambridge University Press, Cambridge, 2002)zbMATHCrossRefGoogle Scholar
 50.J. Edmonds, R.M. Karp, Theoretical improvements in algorithmic efficiency for network flow problems. J. ACM 19(2), 248–264 (1972)zbMATHCrossRefGoogle Scholar
 52.A. Figalli, The Monge–Ampère Equation and Its Applications (European Mathematical Society, Zürich, 2017)Google Scholar
 59.W. Gangbo, R.J. McCann, The geometry of optimal transportation. Acta Math. 177(2), 113–161 (1996)MathSciNetzbMATHCrossRefGoogle Scholar
 62.M. Gelbrich, On a formula for the L _{2}Wasserstein metric between measures on Euclidean and Hilbert spaces. Math. Nachr. 147(1), 185–203 (1990)MathSciNetzbMATHCrossRefGoogle Scholar
 65.C.R. Givens, R.M. Shortt, A class of Wasserstein metrics for probability distributions. Mich. Math. J. 31(2), 231–240 (1984)MathSciNetzbMATHCrossRefGoogle Scholar
 68.H. Heinich, J.C. Lootgieter, Convergence des fonctions monotones. C. R. Acad. Sci. ser. 1 Mathé 322(9), 869–874 (1996)Google Scholar
 76.O. Kallenberg, Foundations of Modern Probability (Springer, Berlin, 1997)zbMATHGoogle Scholar
 77.L.V. Kantorovich, On the translocation of masses. (Dokl.) Acad. Sci. URSS 37(3), 199–201 (1942)Google Scholar
 80.H.G. Kellerer, Duality theorems for marginal problems. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 67(4), 399–432 (1984)MathSciNetzbMATHCrossRefGoogle Scholar
 83.M. Knott, C.S. Smith, On the optimal mapping of distributions. J. Optim. Theory Appl. 43(1), 39–49 (1984)MathSciNetzbMATHCrossRefGoogle Scholar
 85.H.W. Kuhn, The Hungarian method for the assignment problem. Nav. Res. Log. Quart. 2, 83–97 (1955)MathSciNetzbMATHCrossRefGoogle Scholar
 89.D.G. Luenberger, Y. Ye, Linear and Nonlinear Programming (Springer, Berlin, 2008)zbMATHCrossRefGoogle Scholar
 94.R.J. McCann, Exact solutions to the transportation problem on the line. Proc. R. Soc. London, Ser. A: Math. Phys. Eng. Sci. 455(1984), 1341–1380 (1999)Google Scholar
 95.G. Monge, Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie R. des Sci. de Paris 177, 666–704 (1781)Google Scholar
 96.J. Munkers, Algorithms for the assignment and transportation problems. J. Soc. Indust. Appl. Math. 5, 32–38 (1957)MathSciNetCrossRefGoogle Scholar
 98.I. Olkin, F. Pukelsheim, The distance between two random vectors with given dispersion matrices. Linear Algebra Appl. 48, 257–263 (1982)MathSciNetzbMATHCrossRefGoogle Scholar
 103.G. Peyré, M. Cuturi, Computational optimal transport. Found. Trends Mach. Learn. 11(5–6), 355–607 (2019). https://www.nowpublishers.com/article/Details/MAL073 zbMATHCrossRefGoogle Scholar
 104.D. Pollard. Convergence of Stochastic Processes (Springer, Berlin, 2012)zbMATHGoogle Scholar
 105.A. Pratelli, On the sufficiency of ccyclical monotonicity for optimality of transport plans. Mathe. Zeitschrift 258(3), 677–690 (2008)MathSciNetzbMATHCrossRefGoogle Scholar
 106.S.T. Rachev, The Monge–Kantorovich mass transference problem and its stochastic applications. Theory Probab. Appl. 29(4), 647–676 (1985)zbMATHCrossRefGoogle Scholar
 107.S.T. Rachev, L. Rüschendorf, Mass Transportation Problems: Volume I: Theory, Volume II: Applications (Springer, Berlin, 1998)zbMATHGoogle Scholar
 112.R.T. Rockafellar, Characterization of the subdifferentials of convex functions. Pac. J. Math. 17(3), 497–510 (1966)MathSciNetzbMATHCrossRefGoogle Scholar
 113.R.T. Rockafellar, Convex Analysis (Princeton University Press, Princeton, 1970)zbMATHCrossRefGoogle Scholar
 116.L. Rüschendorf, On coptimal random variables. Stat. Probab. Lett. 27(3), 267–270 (1996)MathSciNetzbMATHCrossRefGoogle Scholar
 117.L. Rüschendorf, S.T. Rachev, A characterization of random variables with minimum L ^{2}distance. J. Multivar. Anal. 32(1), 48–54 (1990)MathSciNetzbMATHCrossRefGoogle Scholar
 119.F. Santambrogio, Optimal Transport for Applied Mathematicians, vol. 87 (Springer, Berlin, 2015)zbMATHCrossRefGoogle Scholar
 120.W. Schachermayer, J. Teichmann, Characterization of optimal transport plans for the Monge–Kantorovich problem. Proc. Amer. Math. Soc. 137(2), 519–529 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
 121.E.M. Stein, R. Shakarchi, Real Analysis: Measure Theory, Integration & Hilbert Spaces. (Princeton University Press, Princeton, 2005)zbMATHCrossRefGoogle Scholar
 124.C. Villani, Topics in Optimal Transportation (American Mathematical Society, Providence, 2003)zbMATHCrossRefGoogle Scholar
 125.C. Villani, Optimal Transport: Old and New (Springer, Berlin, 2008)zbMATHGoogle Scholar
 134.Y. Zemel, V.M. Panaretos, Fréchet means and Procrustes analysis in Wasserstein space. Bernoulli 25(2), 932–976 (2019)MathSciNetzbMATHCrossRefGoogle Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.