Existence, Duality, and Cyclical monotonicity for weak transport costs

The optimal weak transport problem has recently been introduced by Gozlan et.\ al.\ \cite{GoRoSaTe17}. We provide general existence and duality results for these problems on arbitrary Polish spaces, as well as a necessary and sufficient optimality criterion in the spirit of cyclical monotonicity. As an application we extend the Brenier-Strassen Theorem of Gozlan-Juillet \cite{GoJu18} to general probability measures on $\mathbb{R}^d$ under minimal assumptions.

1. Introduction 1.1. Notation. This article is concerned with the optimal transport problem for weak costs, as initiated by Gozlan et.al. [19]. To state it (see (1.1) below) we introduce some basic notation. On a Polish space Z the set of probability measures is denoted by P(Z). Denoting by C b (Z) the space of real-valued continuous bounded functions on Z, we use the probabilists terminology of 'weak convergence' for the weak topology that C b (Z) induces on P(Z). For Polish spaces X, Y and probability measures µ ∈ P(X), ν ∈ P(Y) we write Π(µ, ν) for the set of all couplings on X × Y with marginals µ and ν. Given a coupling π on X × Y we denote a regular disintegration with respect to the first marginal by (π x ) x∈X . We consider cost functionals of the form usually it is assumed that C is lower bounded, lower semicontinuity in an appropriate sense, and that C(x, ·) is convex. With these ingredients, the weak transport problem is defined as V(µ, ν) := inf π∈Π(µ,ν) X C(x, π x )µ(dx).
(1.1) 1.2. Literature. The initial works of Gozlan et al. [19,18] are mainly motivated by applications to geometric inequalities. Indeed, particular costs of the form (1.1) were already considered by Marton [23,22] and Talagrand [33,34]. Further papers directly related to [19] include [30,29,31,15,17]. Notably the weak transport problem (1.1) also yields a natural framework to investigate a number of related problems: it appears in the recursive formulation of the causal transport problem [3], in [1,2,11] it is used to provide a new perspective on (discrete time) martingale optimal transport, in [5] it is employed as a tool to study a martingale transport problem in continuous time.
We will establish analogues of three fundamental facts in optimal transport theory: existence of optimizers, duality, and characterization of optimizers through c-cyclical monotonicity. We make the important comment, that these concepts (in particular existence and duality) have been previously studied for the weak transport problem. However, the results available so far may be too restrictive for certain applications.
Our goal is to establish these results at a level of generality that mimics the framework usually considered in the optimal transport literature (i.e. lower bounded, lower semicontinuous cost function). We emphasize that this extension is in fact required to treat specific examples of interest, cf. Section 1.3.4 below.
We briefly hint at the novel viewpoint which makes this extension possible: In a nutshell, the technicalities of the weak transport problem appear intricate and tedious since kernels (π x ) x are notoriously ill behaved with respect to weak convergence of measures on P(X × Y). In the present paper we circumvent this difficulty by embedding P(X × Y) into the bigger space P(X × P(Y)). This idea is borrowed from the investigation of process distances (cf. [26,4]) and will allow us to carry out proofs that closely resemble familiar arguments from classical optimal transport. 1.3.1. Primal Existence. As a first contribution we will establish in Section 2 the following basic existence results. Theorem 1.1 (Existence I). Assume that C : X × P(Y) → R ∪ {+∞} is jointly lower semicontinuous, bounded from below, and convex in the second argument. Then, the problem inf π∈Π(µ,ν) X C(x, π x )µ(dx), admits a minimizer.
Notably, Gozlan et.al. provide existence of minimizer under the assumption that π → C(x, π x ) dµ(x) is continuous on the set of all transport plans with first marginal µ, whereas our aim is to establish existence based on properties of the function C. We also note that Theorem 1.1 was first established by [2] in the case where X, Y are compact spaces.
In fact the assumptions of Theorem 1.1 may be more restrictive than they initially appear. Indeed, as the cost function defined in (1.5) below is not lower semicontinuous with respect to weak convergence, we will need to employ a refined version of Theorem 1.1 to carry out our application in Theorem 1.4 below.
Given a compatible metric d Y on the Polish space Y, we write P t d Y (Y) for the set of probability measures ν ∈ P(Y) such that d Y (y, y 0 ) t ν(dy) < ∞ for some (and then any) y 0 ∈ Y and denote the t-Wasserstein metric on P t d Y (Y) by W t (see e.g. [35,Chapter 7]). In the sequel we make the important convention that, whenever we refer to P t d Y (Y), it is assumed that this set is equipped with the topology generated by W t . On the other hand, regarding the Polish space X, we fix from now on a compatible bounded metric d X .
be jointly lower semicontinuous with respect to the product topology on X × P t d Y (Y), bounded from below, and convex in the second argument. Then, the problem inf π∈Π(µ,ν) X C(x, π x )µ(dx), admits a minimizer.
We emphasize that Theorem 1.1 is a special case of Theorem 1.2. To see this, just take d Y to be a compatible bounded metric. We also note that if C is strictly convex in the second argument and V(µ, ν) < ∞, then the minimizer π * ∈ Π(µ, ν) is unique. We report our proofs in Section 2.
1.3.2. Duality. We fix a compatible metric d Y on Y and introduce the space To each ψ ∈ Φ b,t we associate the function We remark that R C ψ(·) is universally measurable if C is measurable ([12, Proposition 7.47]) and so the integral µ(R C ψ) is well defined for all µ ∈ P(Y) if C is lower-bounded. We will compare the following duality result with those in [19,Theorem 9.6] and [2,Theorem 4.2] in Section 3, where the proof is provided.
be jointly lower semicontinuous with respect to the product topology on X × P t d Y (Y), bounded from below, and convex in the second argument. Then we have µ ∈ P(X) and ν (1.4) 1.3.3. C-monotonicity. Besides primal existence and duality, another fundamental result in classical optimal transport is the characterization of optimality through the notion of cyclical montonicity; see [27,16] as well as the monographs [28,35,36]. More recently, variants of this 'monotonicity priniciple' have been applied in transport problems for finitely or infinitely many marginals [25,13,20,7,37], the martingale version of the optimal transport problem [8,24,9], the Skorokhod embedding problem [6] and the distribution constrained optimal stopping problem [10]. We provide in Definition 5.1 below, a concept analogous to cyclical monotonicity (which we call C-monotonicity) for weak transport costs C . We show that every optimal transport plan is C-monotone in a very general setup. Conversely, we show that every C-monotone transport plan is optimal under certain regularity assumptions. See Theorems 5.2 and 5.5 respectively.
We note that related concepts already appeared in [5,Proposition 4.1] (where necessity of a 2-step optimality condition is established) and in [17] (necessity in the case of compactly supported measures and a quadratic cost criterion). To the best of our knowledge, our sufficient criterion is the first of its kind for weak transport costs.

1.3.4.
A general Brenier-Strassen theorem. As an application of our abstract results we extend the Brenier-Strassen theorem [17, Theorem 1.2] of Gozlan and Juillet to the case of general probabilities on X = Y = R d under the assumption that µ has finite second moment and ν has finite first moment. We thus drop the condition in [17] that the marginals have compact support. For this part we set (1.5) and write ≤ c for the convex order of probability measures.
Existence of µ * and the expression (1.6) were first proved by Gozlan et al [18] for d = 1 and by Alfonsi, Corbetta, Jourdain [1] for arbitrary d ∈ N. Indeed a general version of (1.6), appealing to W p and probabilities µ, ν ∈ P p (R d ) is provided in [1]. All other statements in the above theorem were originally established by Gozlan and Juillet [17] under the assumption of compactly supported measures µ, ν. The proof of Theorem 1.4 is given in Section 6.

Existence of minimizers
The principal idea in this section is to make use of the natural embedding of P(X × Y) into P(X ×P(Y)), which we explain in (2.1) below. It turns out that on this 'extended' space the minimization problems Theorem 1.2 and Theorem 1.1 can be handled more efficienlty.
We need to introduce additional notation: for a probability measure π ∈ P(X × Y) with not further specified marginals, we write π(dx × Y) and π(X × dy) for its X-marginal and Ymarginal respectively. At several instances we use the projection from a product space onto one of its components. This map is usually denoted by proj • where the subscript describes the component, e.g. proj X : X × Y → X stands for the projection onto the X-component. Denoting by (π x ) x∈X a regular disintegration of π with respect to π(dx × Y), then we can consider the measurable map We define the embedding J : P(X×Y) → P(X × P(Y)), The map J is well-defined since κ π is π(dx × Y)-almost surely unique. Note that elements in P(X × Y) precisely correspond to those elements of P(X × P(Y)) which are concentrated on a graph of a measurable function from X to P(Y).
We now describe the relation between minimization problems on Π(µ, ν) and Λ(µ, ν): Proof. For any π ∈ Π(µ, ν) we have J(π) ∈ Λ(µ, ν) and Now, letting P ∈ Λ(µ, ν), we easily derive from (2.5) thatÎ(P) ∈ Π(µ, ν) andÎ(P) x = P(Y) p P x (d p) for µ-a.e x. Using convexity we conclude 2.1. Existence of minimizers. The purpose of this subsection is to establish Theorem 1.2, or more precisely, a strengthened version of it; see Theorem 2.9 below. To this end we need a number of auxiliary results. We start by stressing that, in general, the embedding J is not continuous. In fact: Example 2.2. The map J is continuous if and only if X is discrete or |Y| = 1. Indeed, given X discrete and a sequence (π k ) k∈N ∈ P(X × Y) N which weakly converges to π, we have that Therefore (J(π k )) k∈N converges weakly to J(π). On the other hand, suppose there is a sequence (x k ) k∈N ∈ X N of distinct points converging to some x ∈ X, as well as p, q ∈ P(Y) with p q. For k ∈ N define a probability measure on P(X × Y) by which shows that J is discontinuous.
On the bright side, J possesses a crucial feature: it maps relatively compact sets to relatively compact sets. We prove this in Lemma 2.6 below. But first we need to digress into the characterization of tightness on P(P(Y)) and subspaces thereof. The following can be found in [32, p. 178, Ch. II].

Lemma 2.3. A set A ⊆ P(P(Y)) is tight if and only if the set of its intensities I(A) is tight in P(Y).
We need to refine Lemma 2.3 for our purposes, since we equip P t d Y (Y) with the W ttopology instead of the weak topology.

relatively compact if and only if the set of its intensities I(A) is relatively compact in
The proof of Lemma 2.4 heavily relies on the following lemma, for which we include a proof for sake of completeness.

relatively compact if and only if it is tight and
Proof of Lemma 2.4. The first implication follows by continuity of I and Lemma 2.3 provides tightness. Given I(A) is relatively compact in P t d Y (Y), it remains to show for fixed y ∈ Y that Putting (2.9) and (2.10) together completes the proof.
Proof of Lemma 2.5. '⇒': Since the topology induced by W t on P t d Y (Y) is finer than the weak topology on P t d Y (Y), relative compactness in W t implies relative compactness with respect to the weak topology. Therefore, Prokhorov's theorem yields tightness. Note that (2.8) follows immediately from the definition of convergence in W t . '⇐': Let A be tight such that (2.8) holds. Then, any sequence (µ k ) k∈N ∈ A N has an accumulation point µ ∈ P(Y) with respect to the weak topology. Without loss of generality assume that µ k → µ for k → ∞. By monotone convergence and the Portmanteau theorem, Hence, by (2.8) we can choose (for ε = 1, say) By weak convergence we know that Hence we may pick k 0 such that for all k ≥ k 0 Since ε was arbitrary, we obtain that the t-moments are converging, which implies convergence in W t .
We recall that on Y we are usually given a compatible complete metric d Y , whereas on X we fix a compatible bounded metric d X . We thus endow the product spaces X × Y and We can now state and prove the crucial property of J: Proof. By continuous mapping (see [14,Theorem A.3.10]) the sets Π X ⊆ P(X) and Π Y ⊆ P t d Y (Y), consisting respectively of the X-and Y-marginals of the elements in Π, are tight. Then, relative compactness of Π X and Π Y can be readily derived by courtesy of Lemma 2.5 and the structure of the product metric d, cf. (2.11).
Denote now respectively by Π X J ⊆ P(X) and Since the marginals of J(Π) are relatively compact, we conclude that J(Π) itself is relatively compact.
It is convenient to introduce the following assumptions, which we will often require: Definition 2.7 (A). Given Polish spaces X, Y, we say that a function • C is lower semicontinuous with respect to the product topology of , (2.13) then we say that C satisfies Condition (A+).
We now show that under Condition (A+) the cost functional defining the weak transport problem is lower semicontinuous: is lower semicontinuous. If C satisfies condition (A+) then the map is lower semicontinuous.
Proof. Let P k → P in P td (X × P t d Y (Y)). Similar to [14,Theorem A.3.12], we can approximate C from below by d-Lipschitz functions and obtain lower semicontinuity of (2.14), i.e., To show lower semicontinuity of (2.15), let π k → π in P t d (X × Y) and denote P k = J(π k ). We may assume that lim inf k X C(x, π k x )π k (dx×Y) = lim k X C(x, π k x )π k (dx×Y) by selecting a subsequence. By Lemma 2.6 we know that {P k } k is relatively compact in P td (X × P t d Y (Y)). Denote by P an accumulation point of {P k } k . From now on we work along a subsequence converging to P. Observe that Hence, we find by the first part that Observe that the X-marginal of P equals the X-marginal of π, so by convexity of C(x, ·) we then have is easily seen to be continuous and bounded in X × P(Y). Hence FdP k → FdP and by the structure of F we deduce This shows for the disintegration (π x ) x∈X of π that π x (dy) = P(Y) p(dy) P x (d p) for π(dx×Y)almost every x. So we conclude We are finally ready to provide our main existence result: (2.7)). Assume now that C fulfils Condition (A+) and Π ⊆ P t d (X × Y) is compact. Then there exists a minimizer π * ∈ Π of inf π∈Π X C(x, π x )π(dx × Y).
Proof. The existence of minimizers in Λ and Π are direct consequences of their compactness and the lower semicontinuity of the objective functionals (Proposition 2.8).
We move to the study ofV. Let (µ k , ν k ) → (µ, ν) in P(X) × (P t d Y , W t ). For any k ∈ N we find an optimizer P * k ofV(µ k , ν k ). Note that the set {P * k : k ∈ N} is relatively compact in P td (X × P t d Y (Y)). Therefore, we can find again a converging subsequence with limit point in Π(µ, ν). Without loss of generality we assume Using lower semicontinuity of the objective functional shows the assertion forV. By Lemma 2.1 the lower semicontinuity of V is immediate.
Of course Theorems 1.1 and 1.2 are particular cases of the second half of Theorem 2.9. More generally: if A is compact in P(X) and B is compact in and Theorem 2.9 applies.

Duality
We denote by Φ t the set of continuous functions on Y which satisfy the growth constraint and by Φ b,t the subset of functions in Φ t which are bounded from below. Further, we recall the notion of C-conjugate : The C-conjugate of a measurable function ψ : Y → R, denoted R C ψ, is given by We obtain Theorem 1.3 as a particular case of the following: If moreover C satisfies Condition (A+), then Proof of Theorem 3.1. Fix y 0 ∈ Y. Define the auxiliary cost function C : Since the integrand C is bounded from below and lower semicontinuous we can apply Proposition 2.8 and find that F is lower semicontinuous on P t d Y (Y). Note that for any α ∈ [0, 1] and m 1 , and, particularly, it follows that F is convex. We can extend F to the set M t d Y (Y) of bounded signed measures with finited t-moment (i.e. m ∈ M t d Y (Y) implies Y d Y (y, y 0 ) t |m|(dy) < ∞ for some y 0 ) by setting F(m) = +∞ if m P t d Y (Y). We equip the space M t d Y (Y) with the topology induce by Φ t . It follows that the extension of F is still convex and lower semicontinuous. Now, the spaces Φ t and M t d Y (Y) are in separating duality. Define the convex conjugate F * : (3.5) Observe that F * (ψ) = lim k→+∞ F * (ψ ∧ k), by monotone convergence. We may apply the Fenchel duality theorem [38, Theorem 2.3.3], and then replace Φ t by Φ b,t , obtaining:

Now we show that
To show the converse inequality, we assume without loss of generality that X R C ψ(x)µ(dx) < +∞. For all x ∈ X the value of R C ψ(x) is finite , because ψ is bounded from below. Fix ε > 0. The map R C ψ(·) is lower semianalytic by [12,Proposition 7.47] and by [12, Proposition 7.50] there exists an analytically measurable probability kernel (p x ) x∈X ∈ (P t d Y (Y)) X such that for all x ∈ X p x (ψ) + C(x, p x ) ≤ R C ψ(x) + ε. Then, we immediately obtain The term δ p x (d p)µ(dx) uniquely defines a probability measuresP on X × P(Y). Since C and ψ are bounded from below, we infer that and in particular proj YÎ (P) ∈ P t d Y (Y). ClearlyP ∈ Λ(µ, proj YÎ (P)), so and since ε was arbitrary, we have shown (3.6). So far, we know that Define f (y) := d Y (y, y 0 ) t and note that R C (ψ + f )(x) = R C ψ(x) for all x ∈ X, as well as which shows (3.2). If for all x ∈ X the map C(x, ·) is convex, then (3.3) follows by Lemma 2.1 and (3.2).

On the restriction property
The restriction property of optimal transport roughly states that if a coupling is optimal, then the conditioning of the coupling to a subset is also optimal given its marginals. This property fails for weak optimal transport, as we illustrate with an example: We consider the weak transport problem with these ingredients, and observe that an optimal coupling is given by since it produces a cost equal to zero. Consider the set K = {(x, y) : y 0} andπ(dx, dy) = π(dx, dy|K) the conditioning of π to the set K, i.e.π(S ) := π(S ∩K) π(K) . It follows that and denoting byμ andν the first and second marginals ofπ, we haveμ = µ andν = 1 2 δ 2 + 1 2 δ −2 . Withμ andν and again the cost C as ingredients, an optimizer for the weak transport problem is given bŷ since this time this coupling produces a cost equal to zero. On the other hand the cost ofπ is equal to 1, and soπ is not optimal between is marginals.

C-Monotonicity for weak transport costs
Cyclical monotonicity plays a crucial role in classical optimal transport [27,16]. This has inspired similar development for weak transport costs in [5,17]: C-monotonicity). We say that a coupling π ∈ Π(µ, ν) is C-monotone if there exists a measurable set Γ ⊆ X with µ(Γ) = 1, such that for any finite number of points x 1 , . . . , x N in Γ and measures m 1 , . . . , m N in P(Y) with N i=1 m i = N i=1 π x i , the following inequality holds: We first show that C-monotonicity is necessary for optimality under minimal assumptions. We then provide strengthened assumptions under which C-monotonicity is sufficient. 5.1. C-monotonicity: necessity. We denote by S N the set of permutations of the set {1, . . . , N}. If z := (z 1 . . . , z n ) is any N-vector, and σ ∈ S N , we naturally overload the notation by defining σ( z) := (z σ(1) , . . . , z σ(N) ).
Recall the notation (1.1) for the weak transport problem. Our main result, concerning the necessity of C-monotonicity is the following: Theorem 5.2. Let C be jointly measurable and C(x, ·) be convex and lower semicontinuous for all x. Assume that π * is optimal for V(µ, ν) and |V(µ, ν)| < ∞. Then π * is C-monotone.
Proof. Let N ∈ N. Then is an analytic set. Write D N := proj X N (D N ).
By Jankov-von Neumann uniformization [21,Theorem 18.1] there is an analytically measurable function f N : D N → P(Y) N such that graph( f N ) ⊆ D N . We can extend f N to X N by defining it on X N \ D N as the Borel-measurable map x → (π * x 1 , . . . , π * x N ). Observe that for all σ ∈ S N , we have (σ, σ)(D N ) = D N . Thanks to this, and Lemma 5.3 below, we can assume without loss of generality that f N satisfies We write f i N ( x) for the i-th element of the vector f N ( x) ∈ P(Y) N . Assume that there exists a coupling Q ∈ Π(µ N ) = Π(µ, . . . , µ) such that Q(D N ) > 0. We now show that this is in conflict with optimality of π * . We clearly may assume that Q is symmetric, i.e. such that for all σ ∈ S N we have Q(B) = Q(σ(B)) for all B ∈ B(X N ) (in other words σ(Q) = Q). We define the possible contenderπ of π * bỹ π(dx 1 , dy) := µ(dx 1 ) which is legitimate owing to all measurability precautions we have taken. We will prove (1)π ∈ Π(µ, ν), Ad (1): Evidently the first marginal ofπ is µ. Write σ i ∈ S N for the permutation that merely interchanges the first and i-th component of a vector. By the symmetric properties of Q and f N we find Ad (2): On D N holds by construction the strict inequality Using convexity of C(x, ·) and the symmetry properties of Q and f N , we find yielding a contradiction to the optimality of π * . We conclude that no measure Q with the stated properties exists. By "Kellerer's lemma" [8, Proposition 2.1], which is also true for analytic sets, we obtain that D N is contained in a set of the form N k=1 proj −1 k (M N ) where µ(M N ) = 0 and proj k denotes the projection from X N to its k-th component. Since N ∈ N was arbitrary, we can define the set Γ := ( N∈N M N ) C with µ(Γ) = 1, which has the desired property.
The missing bit in the above proof is Lemma 5.3. By [21,Theorem 7.9] there exists for every Polish space X a closed subset F of the Baire space N := N N and a continuous bijection h X : F → X. On the Baire space the lexicographic order naturally provides a total order. Hence, X inherits the total order of F ⊆ N by virtue of h X and its Borel-measurable inverse h −1 X := g X , namely: is Borel-measurable. Given f : A ⊆ X N → Y N an analytically measurable function, there exists an analytically measurable extensionf : X N → Y N such that for any σ ∈ S N f • σ = σ •f.
With these precautions g( a) = σ is indeed well defined. For each σ ∈ S N we define also B σ ⊆ N N by where the order ≤ i σ is defined depending on σ by It follows from this representation that B σ is Borel-measurable. We introduce We can apply Lemma 5.4, proving the continuity 1 of We define the candidate for the desired extension of f bŷ As a composition of analytically measurable function,f inherits this property. It is also clear thatf ( Finally, for any σ ∈ S N and x ∈ X N , we easily find where the metric d N on N is given by Proof. We show the assertion by induction. For N = 1 (5.2) holds trivially. Now assume that (5.2) holds for N = k. Given σ ∈ S k+1 and a, b ∈ N k+1 increasing, we know that anỹ σ ∈ S k yields max i∈{1,...,k} If σ(k + 1) = k + 1 the assertion follows by the inductive hypothesis. So let σ(k + 1) k + 1 and write k 1 = σ(k + 1) and k 2 = σ −1 (k + 1). Define a permutationσ ∈ S k bŷ Since that a k 2 ≤ a k+1 and b k 1 ≤ b k+1 , then 1 In fact one obtains max i∈{1,...,N} d N (g( a)( a) i , g( b)( b) i ) ≤ max i∈{1,...,N} d N (a i , b i ), for d N the metric on N that we recall in Lemma 5.4.
5.2. C-monotonicity: sufficiency. The conditions under which Theorem 5.2 holds are rather mild. If we assume further continuity properties of C, the next theorem establishes that C-monotonicity is also a sufficient criterion for optimality, resembling the classical case. For weak transport costs, we don't know of any comparable result in the literature. We recall that, for the given compatible complete metric d Y on Y, we denote by W 1 the 1-Wasserstein distance [35,Chapter 7].
. Assume that C : X × P 1 d Y (Y) → R satisfies condition (A+) and is W 1 -Lipschitz in the second argument is the sense that If π is C-monotone then π is an optimizer of V(µ, ν).
In the proof we will use the following auxiliary result, which we will establish subsequently: . Assume that C : X × P 1 d Y (Y) → R satisfies condition (A+) and is W 1 -Lipschitz in the sense of (5.4). Then inf π∈Π(µ,ν) where R C ϕ is defined as in (3.1).
Proof of Theorem 5.5. Let π be C-monotone. There is an increasing sequence (K n ) n∈N of compact sets on Y such that ν(K n ) 1. From this we can refine the µ-full measurable set Γ in the definition of C-monotonicity, see Definition 5.1, so that for each x ∈ Γ we have lim n π x (K n ) = 1 and π x ∈ P 1 d Y (Y). Our goal is to construct a dual optimizer ϕ ∈ Φ 1 to π such that π x (ϕ) When this is achieved, Theorem 1.3 and the following arguments show that π is optimal as desired: where we used that lim inf Let us prove the existence of a dual optimizer in Φ 1 . Let G ⊆ Γ be a finite subset. By definition of C-monotonicity, we conclude that the coupling 1 |G| x i ∈G δ x i (dx)π x i (dy) is optimal for the weak transport problem determined by the cost C and its first and second marginals. We can apply Lemma 5.6 in this context and obtain We fix y 0 ∈ K 1 and, without loss of generality, find a maximizing sequence (ϕ k ) k∈N of (5.6) such that for all k ∈ N the function ϕ k is L-Lipschitz and ϕ k (y 0 ) = 0. Note that for all By the Arzelà-Ascoli theorem we find for any n ∈ N a subsequence of (ϕ k ) k∈N and a L-Lipschitz continuous function ψ n on K n such that lim j ϕ k j (y) = ψ n (y) ∀y ∈ K n .
Thus by a diagonalization argument we can assume without loss of generality that the maximizing sequence converges uniformly for every K n to a given L-Lipschitz functionψ defined on A := n K n .
We can extendψ from A to all of Y, obtaining an everywhere L-Lipschitz function, via By dominated convergence, and the fact that π x (A) = 1, we have lim k π x (ϕ k ) = π x (ψ), (5.9) which yields For G ⊆ Y define Ψ G as the set of all L-Lipschitz continuous functions on A, vanishing at the point y , and satisfying The previous arguments show that, for each finite G ⊆ Γ, the set Ψ G is nonempty. We now check that Ψ G is closed in the topology of pointwise convergence: Let (ψ α ) α∈I be a net in Ψ G which converges pointwise to a function ϕ on A. Since A is the countable union of compact sets, it is possible to extract a sequence (ψ α k ) k∈N of the net such that ψ α k → ϕ pointwise on A and uniformly on each K n , from which ϕ is L-Lipschitz on A and can be extended to an L-Lipschitz continuous function ψ on Y, see (5.7). By repeating previous arguments (see (5.8), (5.9) and (5.10)) we obtain that ϕ ∈ Ψ G .
Note that Ψ G is a closed subset of y∈A [−Ld(y, y ), Ld(y, y )] which is compact in the topology of pointwise convergence by Tychonoff's theorem. Further, the collection {Ψ G : G ⊆ Γ, |G| < ∞} satisfies the finite intersection property, since if G 1 , . . . , G n are finite then Therefore it is possible to find ϕ ∈ G⊆Γ, |G|<∞ Ψ G . Again extend ϕ, from A to Y, by a L-Lipschitz function as usual. Thus, we have found the desired dual optimizer.
Proof of Lemma 5.6. By Theorem 1.3 we have (5.11) By Theorem 1.2 we find a minimizer π * ∈ Π(µ, ν) of V(µ, ν). Now we proceed by taking a maximizing sequence (ϕ k ) k∈N for the right-hand side of (5.11). Note that we can choose each ϕ k , in addition to being below-bounded and continuous, in a way such that it attains its infimum, i.e., there exists y k ∈ Y such that Indeed, this can be done by using e.g.
and the following computation shows that (ϕ k ∨ (b k + 1 k )) k∈N is another maximizing sequence: So let ϕ k attain its infimum as in (5.12). We want to show that we can choose the sequence to be Lipschitz with constant L. For this purpose we infer additional properties of potential minimizers of R C ϕ k . Define for each function ϕ k the Borel-measurable sets That A k ∅ follows since the minimizers of ϕ k form a subset. We also stress that To see the converse, assume y ∈ A c k ∩ proj 1 (Y k ) c . Define Z(z ) := {z ∈ Y : ϕ k (z ) − ϕ k (z) > Ld Y (z, z )}. If there existsz ∈ Z(y) ∩ A k , we obtain a contradiction to y ∈ proj 1 (Y k ) c . Let z 0 := y and inductively set z l ∈ Z(z l−1 ) such that inf z∈Z(z l−1 ) We have for any natural numbers 0 ≤ i < n 14) The r.h.s. is bounded from below by Ld Y (z i , z n ) and so as before we see that z n ∈ A k provides a contradiction. We therefore assume for all l that z l A k . The above inequality yields by lower-boundedness of ϕ k that (z l ) l∈N is a Cauchy sequence in Y. Writingz for its limit point, we conclude from (5.14) that ϕ k (z i ) − ϕ k (z) > Ld Y (z i ,z) and consequentely Z(z) ⊆ Z(z i ). Since then inf{ϕ k (z) : z ∈ Z(z i )} ≤ inf{ϕ k (z) : z ∈ Z(z)} and from (5.13), we deduce inf{ϕ k (z) : z ∈ Z(z)} ≥ ϕ k (z). Thus Z(z) = ∅, implyingz ∈ A k and yielding a contradiction to y ∈ proj 1 (Y k ) c . All in all, we have proven that A c k = proj 1 (Y k ). By Jankov-von Neumann uniformization [21,Theorem 18.1] there is an analytically measurable selection T k : proj 1 (Y k ) → A k . We set T k on A k = proj 1 (Y k ) c as the identity.
Therefore, we can assume that potential minimizers of R C ϕ k are concentrated on A k : p(ϕ k ) + C(x, p). (5.15) Thanks to proj 1 (Y k ) = A c k , we introduce a family of L-Lipschitz continuous functions by ψ k (y) := inf has x-marginal µ and y-marginal ν, and furthermore yπ x (dx) = zπ x (dx) (µ-a.s.), by the martingale property of m. Thus, by Jensen's inequality: Taking ε → 0 we conclude.
Let now K n be an increasing sequene of compact sets such that µ(K n ) 1. Denote by µ n the conditioning of µ to K n , and let ν n := T (µ n ). Then both µ n , ν n have compact support, since T is Lipschitz. By the restiction result Proposition 4.2 we deduce that the coupling p n (dx, dy) := µ n (dx)δ T (x) (dy) is optimal for V(µ n , ν n ). We may apply [17, Theorem 1.2(b)] and obtain the existence of a convex continuously differentiable function ϕ n whose gradient is 1-Lipschitz, and by optimality of p n and [17, Theorem 1.2(c)], we have T (x) = ∇ϕ n (x), µ − a.e. x ∈ K n . (6.4) This implies in particular ∇ϕ n (x) = T (x) for µ-a.e. x ∈ K 1 , so there existsx ∈ K 1 ⊆ K n such that ∇ϕ n (x) = T (x) for all n. Thanks to this, and equicontinuity of (∇ϕ n ) n , we may apply Arzelà-Ascoli locally, proving via a diagonalization argument that (modulo selection of a subsequence) there exists a 1-Lipschitz functionT such that ∇ϕ n →T locally uniformly. Without loss of generality we may assume ϕ n (0) = 0. Some elementary calculus then shows that ϕ n converges pointwise to a function ϕ. We deduce that ϕ is convex and differentiable, with ∇ϕ =T . It is therefore continuously differentiable. From (6.4) we derive that T = ∇ϕ µ-a.s. in K n and so T = ∇ϕ µ-a.s. In particular µ * = ∇ϕ(µ).
The arguments for the final sentence of Theorem1.4 are the same as in the proof of [17, Theorem 1.2(c)].