Optimal Transport

In the context of e.g. the Wasserstein GAN, it can be helpful to think of the discrete Wasserstein distance (and more generally, the optimal transport) between two finite distributions p and q as being a minibatch approximation of the Wasserstein distance between continuous distributions. If p, q are continuous distributions on Rd, x1, . . . , xn ∼ p, and y1, . . . , ym ∼ q, denote the empirical distributions over samples by p̃ and q̃ respectively:


Figure: Illustration of Monge's Problem
The following explanations largely adhere to Villani et al. (2009).

Definition (Deterministic Coupling)
A coupling (X, Y ) is said to be deterministic if there exists a measurable function T : X → Y such that Y = T (X).
Unlike couplings, deterministic couplings do not always exist.To say that (X, Y ) is a deterministic coupling of µ and ν is strictly equivalent to any one of the four statements below: (X, Y ) is a coupling of µ and ν whose law π is concentrated on the graph of a measurable function T : X → Y; X has law µ and Y = T (X), where T # µ = ν; X has law µ and Y = T (X), where T is a change of variables from µ to ν: for all ν-integrable (resp.nonnegative measurable) functions ϕ, (1) It is common to call T the transport map: Informally, one can say that T transports the mass represented by the measure µ, to the mass represented by the measure ν.

5/14
Existence of optimal couplings -Proof Lemma (Lower semicontinuity of the cost functional) Let X and Y be two Polish spaces, and c : X × Y → R ∪ {+∞} a lower semicontinuous cost function.Let h : X × Y → R ∪ {−∞} be an upper semicontinuous function such that c ≥ h.Let (π k ) k∈N be a sequence of probability measures on X × Y, converging weakly to some In particular, if c is nonnegative, then F : π → c dπ is lower semicontinuous on P(X × Y ), equipped with the topology of weak convergence.

Lemma (Tightness of transference plans)
Let X and Y be two Polish spaces.Let P ⊂ P(X ) and Q ⊂ P(Y) be tight subsets of P(X ) and P(Y) respectively.Then the set Π(P, Q) of all transference plans whose marginals lie in P and Q respectively, is itself tight in P(X × Y).
Since X is Polish, {µ} is tight in P(X ); similarly, {ν} is tight in P(Y).By Lemma (Tightness of transference plans), Π(µ, ν) is tight in P(X × Y ), and by Prokhorov's theorem this set has a compact closure.By passing to the limit in the equation for marginals, we see that Π(µ, ν) is closed, so it is in fact compact.
Let (π k ) k∈N be a sequence of probability measures on X × Y , such that Extracting a subsequence if necessary, we may assume that π k converges to some π ∈ Π(µ, ν).The function h : (x, y) → a(x) + b(y) lies in L 1 (π k ) and in L 1 (π), and c ≥ h by assumption; moreover, So Lemma (Lower semicontinuity of the cost functional) implies Thus π is minimizing.

7/14
Lower semicontinuity of the cost functional -Proof Lemma (Lower semicontinuity of the cost functional) Let X and Y be two Polish spaces, and c : X × Y → R ∪ {+∞} a lower semicontinuous cost function.Let h : X × Y → R ∪ {−∞} be an upper semicontinuous function such that c ≥ h.Let (π k ) k∈N be a sequence of probability measures on X × Y, converging weakly to some π ∈ P(X × Y ), in such a way that h ∈ L 1 (π k ), h ∈ L 1 (π), and In particular, if c is nonnegative, then F : π → c dπ is lower semicontinuous on P(X × Y ), equipped with the topology of weak convergence.
Proof: Replacing c by c − h, we may assume that c is a nonnegative lower semicontinuous function.Then c can be written as the pointwise limit of a nondecreasing family (c ) ∈N of continuous real-valued functions.By monotone convergence, Theorem of Baire: Assume X is a metric space.Every lower semicontinuous function f : X → R is the limit of a monotone increasing sequence of extended real-valued continuous functions on X; if f does not take the value −∞, the continuous functions can be taken to be real-valued.

Tightness of transference plans -Proof
Lemma (Tightness of transference plans) Let X and Y be two Polish spaces.Let P ⊂ P(X ) and Q ⊂ P(Y) be tight subsets of P(X ) and P(Y) respectively.Then the set Π(P, Q) of all transference plans whose marginals lie in P and Q respectively, is itself tight in P(X × Y).
Proof: Let µ ∈ P , ν ∈ Q, and π ∈ Π(µ, ν).By assumption, for any > 0 there is a compact set K ⊂ X, independent of the choice of µ in P , such that µ[X \ K ] ≤ ; and similarly, there is a compact set L ⊂ Y , independent of the choice of ν in Q, such that ν[Y \ L ] ≤ .Then for any coupling (X, Y ) of (µ, ν), The desired result follows since this bound is independent of the coupling, and

Remarks on the Theorem
The lower bound for c ensures that the expected costs E[c(X, Y )] are well-defined in R ∪ {+∞}.Often, c is non-negative, so one can choose a = 0 and b = 0.
This existence theorem does not imply that the optimal cost is finite.It might be that all transport plans lead to an infinite total cost, i.e., c dπ = +∞ for all π ∈ Π(µ, ν).
A simple condition to rule out this annoying possibility is c(x, y) dµ(x) dν(y) < +∞, which guarantees that at least the independent coupling has finite total cost.A stronger assumption is which implies that any coupling has finite total cost.

10/14
Kantorovich-Rubinstein-Duality Theorem (Kantorovich-Rubinstein-Duality) where Economic Interpretation: Let X be a set of bakeries and Y be a set of cafes.The problem in the Kantorovich formulation corresponds to minimizing the cost of a consortium between bakeries and cafes.Now assume that there is a transportation company that buys a unit from the bakery x ∈ X at the price ϕ(x) and sells it to the cafe y ∈ Y at the price ψ(y).To be competitive with the direct agreement between bakeries and cafes, it must hold that ψ(y) − ϕ(x) ≤ c(x, y).Then the profit is which corresponds to the dual formulation (except for the sign change of ϕ).

11/14
The Wasserstein distances Definition (Wasserstein distances) Let (X , d) be a Polish metric space, and let p ∈ [1, ∞).For any two probability measures µ, ν on X , the Wasserstein distance of order p between µ and ν is defined by the formula Example: Wp(δx, δy) = d(x, y).In this example, the distance does not depend on p; but this is not the rule.
At the present level of generality, Wp is still not a distance in the strict sense, because it might take the value +∞; but otherwise it does satisfy the axioms of a distance.

Definition (Wasserstein space)
The Wasserstein space of order p is defined as where x0 ∈ X is arbitrary.This space does not depend on the choice of the point x0.Then Wp defines a (finite) distance on Pp(X ). 12/14

Convergence in Wasserstein sense
The notation µ k w → µ means that µ k converges weakly to µ, i.e.
Definition (Weak convergence in P p ) Let (X , d) be a Polish space, and p ∈ [1, ∞).Let (µ k ) k∈N be a sequence of probability measures in Pp(X ) and let µ be another element of Pp(X ).Then (µ k ) is said to converge weakly in Pp(X ) if any one of the following equivalent properties is satisfied for some (and then any) x0 ∈ X :