Optimal Entropy-Transport problems and a new Hellinger-Kantorovich distance between positive measures

We develop a full theory for the new class of Optimal Entropy-Transport problems between nonnegative and finite Radon measures in general topological spaces. They arise quite naturally by relaxing the marginal constraints typical of Optimal Transport problems: given a couple of finite measures (with possibly different total mass), one looks for minimizers of the sum of a linear transport functional and two convex entropy functionals, that quantify in some way the deviation of the marginals of the transport plan from the assigned measures. As a powerful application of this theory, we study the particular case of Logarithmic Entropy-Transport problems and introduce the new Hellinger-Kantorovich distance between measures in metric spaces. The striking connection between these two seemingly far topics allows for a deep analysis of the geometric properties of the new geodesic distance, which lies somehow between the well-known Hellinger-Kakutani and Kantorovich-Wasserstein distances.


Introduction
The aim of the present paper is twofold: In Part I we develop a full theory of the new class of Optimal Entropy-Transport problems between nonnegative and finite Radon measures in general topological spaces. As a powerful application of this theory, in Part II we study the particular case of Logarithmic Entropy-Transport problems and introduce the new Hellinger-Kantorovich (HK) distance between measures in metric spaces. The striking connection between these two seemingly far topics is our main focus, and it paves the way for a beautiful and deep analysis of the geometric properties of the geodesic HK distance, which (as our proposed name suggests) can be understood as an inf-convolution of the well-known Hellinger-Kakutani and the Kantorovich-Wasserstein distances. In fact, our approach to the theory was opposite: in trying to characterize HK, we were first led to the Logarithmic Entropy-Transport problem, see Section A.
From Transport to Entropy-Transport problems. In the classical Kantorovich formulation, Optimal Transport problems [37,46,2,47] deal with minimization of a linear cost functional C (γ) = X 1 ×X 2 c(x 1 , x 2 ) dγ(x 1 , x 2 ), c : X 1 × X 2 → R, (1.1) among all the transport plans, i.e. probability measures in P(X 1 ×X 2 ), γ whose marginals µ i = π i γ ∈ P(X i ) are prescribed. Typically, X 1 , X 2 are Polish spaces, µ i are given Borel measures (but the case of Radon measures in Hausdorff topological spaces has also been considered, see [23,37]), the cost function c is a lower semicontinuous (or even Borel) function, possibly assuming the value +∞, and π i (x 1 , x 2 ) = x i are the projections on the i-th coordinate, so that Starting from the pioneering work of Kantorovich, an impressive theory has been developed in the last two decades: from one side, typical intrinsic questions of linear programming problems concerning duality, optimality, uniqueness and structural properties of optimal transport plans have been addressed and fully analyzed. In a parallel way, this rich general theory has been applied to many challenging problems in a variety of fields (probability and statistics, functional analysis, PDEs, Riemannian geometry, nonsmooth analysis in metric spaces, just to mention a few of them: since it is impossible here to give an even partial account of the main contributions, we refer to the books [47,39] for a more detailed overview and a complete list of references). The class of Entropy-Transport problems, we are going to study, arises quite naturally if one tries to relax the marginal constraints π i γ = µ i by introducing suitable penalizing functionals F i , that quantify in some way the deviation from µ i of the marginals γ i := π i γ of γ. In this paper we consider the general case of integral functionals (also called Csiszàr f -divergences [15]) of the form or for the total variation functional corresponding to the nonsmooth entropy V (s) := |s − 1|, considered in [35]. Notice that the presence of the singular part γ ⊥ i in the Lebesgue decomposition of γ i in (1.3) does not force F i (s) to be superlinear as s ↑ +∞ and allows for all the exponents p in (1.4).
Once a specific choice of entropies F i and of finite nonnegative Radon measures µ i ∈ M(X i ) is given, the Entropy-Transport problem can be formulated as ET(µ 1 , µ 2 ) := inf E (γ|µ 1 , µ 2 ) : γ ∈ M(X 1 × X 2 ) , (1.5) where E is the convex functional E (γ|µ 1 , µ 2 ) := F 1 (γ 1 |µ 1 ) + F 2 (γ 2 |µ 2 ) + X 1 ×X 2 c(x 1 , x 2 ) dγ. (1.6) Notice that the entropic formulation allows for measures µ 1 , µ 2 and γ with possibly different total mass. The flexibility in the choice of the entropy functions F i (which may also take the value +∞) covers a wide spectrum of situations (see Section 3.3 for various examples) and in particular guarantees that (1.5) is a real generalization of the classical optimal transport problem, which can be recovered as a particular case of (1.6) when F i (s) is the indicator function of {1} (i.e. F i (s) always takes the value +∞ with the only exception of s = 1, where it vanishes).
Since we think that the structure (1.6) of Entropy-Transport problems will lead to new and interesting models and applications, we have tried to establish their basic theory in the greatest generality, by pursuing the same line of development of Transport problems: in particular we will obtain general results concerning existence, duality and optimality conditions.
Considering e.g. the Logarithmic Entropy case, where F i (s) = s log s − (s − 1), the dual formulation of (1.5) is given by where D(ϕ 1 , ϕ 2 |µ 1 , µ 2 ) := where one can immediately recognize the same convex constraint of Transport problems: the couple of dual potentials ϕ i should satisfy ϕ 1 ⊕ϕ 2 ≤ c on X 1 ×X 2 . The main difference is due to the concavity of the objective functional (ϕ 1 , ϕ 2 ) → X 1 1 − e −ϕ 1 dµ 1 + X 2 1 − e −ϕ 2 dµ 2 , whose form can be explicitly calculated in terms of the Lagrangian conjugates F * i of the entropy functions. The change of variables ψ i := 1−e −ϕ i transforms (1.7) in the equivalent problem of maximizing the linear functional on the more complicated convex set (1.9) We will calculate the dual problem for every choice of F i and show that its value always coincide with ET(µ 1 , µ 2 ). The dual problem also provides optimality conditions, that involve the couple of potentials (ϕ 1 , ϕ 2 ), the support of the optimal plan γ and the densities σ i of its marginals γ i w.r.t. µ i . For the Logarithmic Entropy Transport problem above, they read as σ i > 0, ϕ i = − log σ i µ i a.e. in X i , ϕ 1 ⊕ ϕ 2 ≤ c in X 1 × X 2 , ϕ 1 ⊕ ϕ 2 = c γ-a.e. in X 1 × X 2 , (1. 10) and they are necessary and sufficient for optimality. The study of optimality conditions reveals a different behavior between pure transport problems and the other entropic ones. In particular, the c-cyclical monotonicity of the optimal plan γ (which is still satisfied in the entropic case) does not play a crucial role in the construction of the potentials ϕ i . When F i (0) are finite (as in the logarithmic case) it is possible to obtain a general existence result of (generalized) optimal potentials even when c takes the value +∞.
However, because of our original motivation (see Section A), Part II will focus on the case of the logarithmic entropy F i = U 1 , where H is given by (1.14). We will exploit its relevant geometric applications, reserving the other examples for future investigations.
From the Kantorovich-Wasserstein distance to the Hellinger-Kantorovich distance. From the analytic-geometric point of view, one of the most interesting cases of transport problems occurs when X 1 = X 2 = X coincide and the cost functional C is induced by a distance d on X: in the quadratic case, the minimum value of (1.1) for given measures µ 1 , µ 2 in the space P 2 (X) of probability measures with finite quadratic moment defines the so called L 2 -Kantorovich-Wasserstein distance which metrizes the weak convergence (with quadratic moments) of probability measures. The metric space (P 2 (X), W d ) inherits many geometric features from the underlying (X, d) (as separability, completeness, length and geodesic properties, positive curvature in the Alexandrov sense, see [2]). Its dynamic characterization in terms of the continuity equation [7] and its dual formulation in terms of the Hopf-Lax formula and the corresponding (sub-)solutions of the Hamilton-Jacobi equation [34] lie at the core of the applications to gradient flows and partial differential equations of diffusion type [2]. Finally, the behavior of entropy functionals as (1.3) along geodesics in (P 2 (X), W d ) [32,34,14] encodes a valuable geometric information, with relevant applications to Riemannian geometry and to the recent theory of metric-measure spaces with Ricci curvature bounded from below [44,45,31,3,4,5,19]. It has been a challenging question to find a corresponding distance (enjoying analogous deep geometric properties) between finite positive Borel measures with arbitrary mass in M(X). In the present paper we will show that by choosing the particular cost function c(x 1 , x 2 ) := (d(x 1 , x 2 )), where (d) := − log cos 2 (d) if d < π/2, +∞ otherwise, (1.19) the corresponding Logarithmic-Entropy Transport problem LET(µ 1 , µ 2 ) := min coincides with a (squared) distance in M(X) (which we will call Hellinger-Kantorovich distance and denote by HK) that can play the same fundamental role like the Kantorovich-Wasserstein distance for P 2 (X).
HK(µ 1 , µ 2 ) = min W d C (α 1 , α 2 ) : α i ∈ P 2 (C), h 2 α i = µ i . (1.24) It turns out that (the square of) (1.24) yields an equivalent variational representation of the LET functional. In particular, (1.24) shows that in the case of concentrated measures LET(a 1 δ x 1 , a 2 δ x 2 ) = HK 2 (a 1 δ x 1 , a 2 δ x 2 ) = d 2 C (x 1 , a 1 ; x 2 , a 2 ). (1.25) Notice that (1.24) resembles the very definition (1.18) of the Kantorovich-Wasserstein distance, where now the role of the marginals π i is replaced by the homogeneous marginals h 2 . It is a nontrivial part of the equivalence statement to check that the difference between the cut-off thresholds (π/2 in (1.21) and π in (1.22) does not affect the identity LET = HK 2 .
(iii) By refining the representation formula (1.24) by a suitable rescaling and gluing technique we can prove that (M(X), HK) is a geodesic metric space, a property that it is absolutely not obvious from the LET-representation and depends on a subtle interplay of the entropy functions F i (σ) = σ log σ − σ + 1 and the cost function c from (1.19). We show that the metric induces the weak convergence of measures in duality with bounded and continuous functions, thus it is topologically equivalent to the flat or Bounded Lipschitz distance [17,Sec. 11.3], see also [24,Thm. 3]. It also inherits the separability, completeness, length and geodesic properties from the correspondent ones of the underlying space (X, d). On top of that, we will prove a precise superposition principle (in the same spirit of the Kantorovich-Wasserstein one [2,Sect.8], [30]) for general absolutely continuous curves in (M(X), HK) in terms of dynamic plans in C: as a byproduct, we can give a precise characterization of absolutely continuous curves and geodesics as homogeneous marginals of corresponding curves in (P 2 (C), W d C ). An interesting consequence of these results concerns the lower curvature bound of (M(X), HK) in the sense of Alexandrov: it is a positively curved space if and only if (X, d) is a geodesic space with curvature ≥ 1.
(iv) The dual formulation of the LET problem provides a dual characterization of HK, viz.
It is not superfluous to recall that the HK variational problem is just one example in the realm of Entropy-Transport problems and we think that other interesting applications can arise by different choices of entropies and cost. One of the simplest variation is to choose the (seemingly more natural) quadratic cost function c(x 1 , x 2 ) := d 2 (x 1 , x 2 ) instead of the more "exotic" (1.19). The resulting functional is still associated to a distance expressed by GHK 2 (µ 1 , µ 2 ) := min r 2 1 + r 2 2 − 2r 1 r 2 exp(−d 2 (x 1 , x 2 )/2) dα (1.30) where the minimum runs among all the plans α ∈ M(C × C) such that h 2 π i α = µ i (we propose the name "Gaussian Hellinger-Kantorovich distance"). If (X, d) is a complete, separable and length metric space, (M(X), GHK) is a complete and separable metric space, inducing the weak topology as HK. However, it is not a length space in general, and we will show that the length distance generated by GHK is precisely HK.
The plan of the paper is as follows.
Part I develops the general theory of Optimal Entropy-Transport problems. Section 2 collects some preliminary material, in particular concerning the measure-theoretic setting in arbitrary Hausdorff topological spaces (here we follow [41]) and entropy functionals. We devote some effort to deal with general functionals (allowing a singular part in the definition (1.3)) in order to include entropies which may have only linear growth. The extension to this general framework of the duality theorem 2.7 (well known in Polish topologies) requires some care and the use of lower semicontinuous test functions instead of continuous ones. Section 3 introduces the class of Entropy-Transport problems, discussing same examples and proving a general existence result for optimal plans. The "reverse" formulation of Theorem 3.11, though simple, justifies the importance to deal with the largest class of entropies and will play a crucial role in Section 5.
Section 4 is devoted to find the dual formulation, to prove its equivalence with the primal problem (cf. Theorem 4.11), to derive sharp optimality conditions (cf. Theorem 4.6) and to prove the existence of optimal potentials in a suitable generalized sense (cf. Theorem 4.15). The particular class of "regular" problems (where the results are richer) is also studied with some details.
Section 5 introduces the third formulation (1.12) based on the marginal perspective function (1.11) and its "homogeneous" version (Section 5.2). The proof of the equivalence with the previous formulations is presented in Theorem 5.5 and Theorem 5.8. This part provides the crucial link for the further development in the cone setting.
Part II is devoted to Logarithmic Entropy-Transport (LET) problems (Section 6) and to their applications to the Hellinger-Kantorovich distance HK on M(X).
The Hellinger-Kantorovich distance is introduced by the lifting technique in the cone space in Section 7, where we try to follow a presentation modeled on the standard one for the Kantorovich-Wasserstein distance, independently from the results on the LETproblems. After a brief recap on the cone geometry (Section 7.1) we discuss in some detail the crucial notion of homogeneous marginals in Section 7.2 and the useful tightness conditions (Lemma 7.3) for plans with prescribed homogeneous marginals. Section 7.3 introduces the definition of the HK distance and its basic properties. The crucial rescaling and gluing techniques are discussed in Section 7.4: they lie at the core of the main metric properties of HK, leading to the proof of the triangle inequality and to the characterizations of various metric and topological properties in Section 7.5. The equivalence with the LET formulation is the main achievement of Section 7.6 (Theorem 7.20), with applications to the duality formula (Theorem 7.21), to the comparisons with the classical Hellinger and Kantorovich distances (Section 7.7) and with the Gaussian Hellinger-Kantorovich distance (Section 7.8).
The last Section of the paper collects various important properties of HK, that share a common "dynamic" flavor. After a preliminary discussion of absolutely continuous curves and geodesics in the cone space C in Section 8.1, we derive the basic superposition principle in Theorem 8.4. This is the cornerstone to obtain a precise characterization of geodesics (Theorem 8.6), a sharp lower curvature bound in the Alexandrov sense (Theorem 8.8) and to prove the dynamic characterizationà la Benamou-Brenier of Section 8.5. The other powerful tool is provided by the duality with subsolutions to the Hamilton-Jacobi equation (Theorem 8.12), which we derive after a preliminary characterization of metric slopes for a suitable class of test functions in C. One of the most striking results of Section 8.4 is the explicit representation formula for solutions to the Hamilton-Jacobi equation in X, that we obtain by a careful reduction technique from the Hopf-Lax formula in C. In this respect, we think that Theorem 8.11 is interesting by itself and could find important applications in different contexts. From the point of view of Entropy-Transport problems, Theorem 8.11 is particularly relevant since it provides a dynamic interpretation of the dual characterization of the LET functional. In Section 8.6 we show that in the Euclidean case X = R d all geodesic curves are characterized by the system (1.29). The last Section 8.7 provides various contraction results: in particular we extend the well known contraction property of the Heat flow in spaces with nonnegative Riemannian Ricci curvature to HK.
Note during final preparation. The earliest parts of the work developed here were first presented at the ERC Workshop on Optimal Transportation and Applications in Pisa in 2012. Since then the authors developed the theory continuously further and presented results at different workshops and seminars, see Appendix A for some remarks concerning the chronological development of our theory. In June 2015 they became aware of the parallel work [24], which mainly concerns the dynamical approach to the Hellinger-Kantorovich distance discussed in Section 8.5 and the metric-topological properties of Section 7.5 in the Euclidean case. Moreover, in mid August 2015 we became aware of the work [11,12], which starts from the dynamical formulation of the Hellinger-Kantorovich distance in the Euclidean case, prove existence of geodesics and sufficient optimality and uniqueness conditions (which we state in a stronger form in Section 8.6) with a precise characterization in the case of a couple of Dirac masses, provide a detailed discussion of curvature properties following Otto's formalism [33], and study more general dynamic costs on the cone space with their equivalent primal and dual static formulation (leading to characterizations analogous to (7.1) and (6.14) in the Hellinger-Kantorovich case).
Apart from the few above remarks, these independent works did not influence the first (cf. arXiv1508.07941v1) and the present version of this manuscript, which is essentially a minor modification and correction of the first version. In the final Appendix A we give a brief account of the chronological development of our theory.

M(X)
finite positive Radon measures on a Hausdorff topological space X P(X), P 2 (X) Radon probability measures on X (with finite quadratic moment) continuous and bounded real functions on X Lip b (X), Lip bs (X) bounded (with bounded support) Lipschitz real functions on X LSC b (X), LSC s (X) lower semicontinuous and bounded (or simple) real functions on X USC b (X), USC s (X) upper semicontinuous and bounded (or simple) real functions on X B(X), B b (X) Borel (resp. bounded Borel) real functions L p (X, µ), L p (X, µ; R d ) Borel µ-integrable real (or R d -valued) functions Γ(R + ) set of admissible entropy functions, see (2.13), (2.14). F (s), F i (s) admissible entropy functions.

C[r]
ball of radius r centered at o in C h 2 i , dil θ,2 (·) homogeneous marginals and dilations, see (7.15), (7.16)  for any closed set F ⊂ X and any (in particular when τ is metrizable), narrow and weak topology coincide [41, p. 371]. Therefore when (X, τ ) is a Polish space we recover the usual setting of Borel measures endowed with the weak topology.
Compactness with respect to narrow topology is guaranteed by an extended version of Prokhorov's Theorem [41,Thm. 3,p. 379]. Tightness of weakly convergent sequences in metrizable spaces is due to Le Cam [26].
Theorem 2.2. If a subset K ⊂ M(X) is bounded and equally tight then it is relatively compact with respect to the narrow topology. The converse is also true in the following cases: (i) (X, τ ) is a locally compact or a Polish space; (ii) (X, τ ) is metrizable and K = {µ n : n ∈ N} for a given weakly convergent sequence (µ n ).
If µ ∈ M(X) and Y is another Hausdorff topological space, a map T : X → Y is Lusin µ-measurable [41, Ch. I, Sec. 5] if for every ε > 0 there exists a compact set K ε ⊂ X such that µ(X \ K ε ) ≤ ε and the restriction of T to K ε is continuous. We denote by T µ ∈ M(Y ) the push-forward measure defined by For µ ∈ M(X) and a Lusin µ-measurable T : X → Y , we have T µ ∈ M(Y ). The linear space B(X) (resp. B b (X)) denotes the space of real Borel (resp. bounded Borel) functions. If µ ∈ M(X), p ∈ [1, ∞], we will denote by L p (X, µ) the subspace of Borel p-integrable functions w.r.t. µ, without identifying µ-almost equal functions.

Min-max and duality
We recall now a powerful form of von Neumann's Theorem, concerning minimax properties of convex-concave functions in convex subsets of vector spaces and refer to [18, Prop. 1.2+3.2, Chap. VI] for a general exposition.
Let A, B be nonempty convex sets of some vector spaces and let us suppose that A is endowed with a Hausdorff topology. Let L : A × B → R be a function such that a → L(a, b) is convex and lower semicontinuous in A for every b ∈ B, (2.9b) Notice that for arbitrary functions L one always has  14) The recession constant F ∞ , the right derivative F 0 at 0, and the asymptotic affine coefficient affF ∞ are defined by (here s o ∈ Dom(F )) (2.16) To avoid trivial cases, we assumed in (2.13) that the proper domain Dom(F ) contains at least a strictly positive real number. By convexity, Dom(F ) is a subinterval of [0, ∞), and we will mainly focus on the case when Dom(F ) has nonempty interior and F has superlinear growth, i.e. F ∞ = +∞, but it will be useful to deal with the general class defined by (2.13).
Concerning the behavior of F * at the boundary of its proper domain we can distinguish a few cases depending on the behavior of F at s − F and s + F : • The degenerate case when F ∞ = F 0 occurs only when F is linear.
If F is not linear, we always have F * is an increasing homeomorphism between (F 0 , F ∞ ) and (−F (0), affF ∞ ) (2.20) with the obvious extensions to the boundaries of the intervals when F 0 or affF ∞ are finite. By introducing the closed convex subset F of R 2 via the function F can be recovered from F * and from F through the dual Fenchel-Moreau formula Notice that F satisfies the obvious monotonicity property If F is finite in a neighborhood of +∞, then F * is superlinear as φ ↑ ∞. More precisely, its asymptotic behavior as φ → ±∞ is related to the proper domain of F by The functions F and F * are also related to the subdifferential ∂F : Equivalently, we have the explicit formulas Using the dual exponent q = p/(p − 1), the corresponding Legendre conjugates read Reverse entropies. Let us now introduce the reverse density function R : It is not difficult to check that R is a proper, convex and lower semicontinuous function, with so that R ∈ Γ(R + ) and the map F → R is an involution on Γ(R + ). A further remarkable involution property is enjoyed by the dual convex set R := {(ψ, φ) ∈ R 2 : R * (ψ) + φ ≤ 0} defined as (2.21): it is easy to check that It follows that the Legendre transform of R and F are related by As in (2.20) we have R * is an increasing homeomorphism between (−affF ∞ , F (0)) and (−F ∞ , −F 0 ). (2.32) A last useful identity involves the subdifferentials of F and R: for every s, r > 0 with sr = 1, and φ, ψ ∈ R we have φ ∈ ∂F (r) and ψ = −F * (φ) ⇐⇒ ψ ∈ ∂R(s) and φ = −R * (ψ) .

(2.33)
It is not difficult to check that the reverse entropy associated to U p is U 1−p .

Relative entropy integral functionals
For F ∈ Γ(R + ) we consider the functional F : and, whenever η 0 is the null measure, we have where, as usual in measure theory, we adopted the convention 0 · ∞ = 0. Because of our applications in Section 3, our next lemma deals with Borel functions ϕ ∈ B(X;R) taking values in the extended real lineR := R ∪ {±∞}. ByF we denote the closure of F inR ×R, i.e. (2.37) and, symmetrically by (2.29) and (2.30), (2.38) In particular, we have We continue to use the notation φ − and φ + to denote the negative and the positive part of a function φ, where φ − (x) := min{φ(x), 0} and φ + (x) := max{φ(x), 0}.
The next theorem gives a characterization of the relative entropy F , which is the main result of this section. Its proof is a careful adaptation of [2,Lemma 9.4.4] to the present more general setting, which includes the sublinear case when F ∞ < ∞ and the lack of complete regularity of the space. This suggests to deal with lower semicontinuous functions instead of continuous ones. We denote by LSC s (X) the class of lower semicontinuous and simple functions (i.e. taking a finite number of real values only) and introduce the notation ϕ = −φ and the concave function and the space LSC s (X) in the supremum of (2.44), (2.45) and (2.46) can also be replaced by the space LSC b (X) (resp B b (X)) of bounded l.s.c. (resp. Borel) functions.
In fact, considering first (2.44), by complete regularity it is possible to express every couple φ, ψ of bounded lower semicontinuous functions with values in F as the supremum of a directed family of continuous and bounded functions (φ α , ψ α ) α∈A which still satisfy the constraint F due to (2.23). We can then apply the continuity (2.2) of the integrals with respect to the Radon measures µ and γ.
In order to replace l.s.c. functions with continuous ones in (2.45) we can approximate ψ by an increasing directed family of continuous functions (ψ α ) α∈A . By truncation, one can always assume that max ψ ≥ sup ψ α ≥ inf ψ α ≥ min ψ. Since R * (ψ) is bounded, it is easy to check that also R * (ψ α ) is bounded and it is an increasing directed family converging to R * (ψ). An analogous argument works for (2.47).
When one replaces LSC s (X) with LSC b (X) or B b (X) in (2.44), the supremum is taken on a larger set, so that the righthand side of (2.44) cannot decrease; on the other hand, Lemma 2.6 shows that F (γ|µ) still provides an upper bound even if φ, ψ are in B b (X), thus duality also holds in this case. The same argument applies to (2.45) or (2.46).
The following result provides lower semicontinuity of the relative entropy or of an increasing sequence of relative entropies. Corollary 2.9. The functional F is jointly convex and lower semicontinuous in M(X) × M(X). More generally, if F n ∈ Γ(R + ), n ∈ N, is an increasing sequence pointwise converging to F and (µ, γ) ∈ M(X) × M(X) is the narrow limit of a sequence (µ n , γ n ) ∈ M(X) × M(X), then the corresponding entropy functionals F n , F satisfy lim inf n→∞ F n (γ n |µ n ) ≥ F (γ|µ). (2.51) Proof. The lower semicontinuity of F follows by (2.44), which provides a representation of F as the supremum of a family of lower semicontinuous functionals for the narrow topology. Using F n ≥ F m for n ≥ m fixed, we have lim inf n→∞ F n (γ n |µ n ) ≥ lim inf n→∞ F m (γ n |µ n ) ≥ F m (γ|µ), by the above lower semicontinuity. Hence, it suffices to check that lim n→∞ F n (γ|µ) = F (γ|µ) for every γ, µ ∈ M(X).
(2.52) This formula follows easily by the monotonicity of the convex sets F n (associated to F n by (2.21)) F n ⊂ F n+1 and by the fact that F = ∪ n F n , since F * n is pointwise decreasing to F * . Thus for every couple of simple and lower semicontinuous functions (φ, ψ) taking values in F we have (ψ(x), φ(x)) ∈ F N for every x ∈ X and a sufficiently large N so that Since φ, ψ are arbitrary we conclude applying the duality formula (2.44).
Next, we provide a compactness result for the sublevels of the relative entropy, which will be useful in Section 3.4 (see Theorem 3.3 and Lemma 3.9). Proposition 2.10 (Boundedness and tightness). If K ⊂ M(X) is bounded and F ∞ > 0, then for every C ≥ 0 the sublevels of F are bounded. If moreover K is equally tight and F ∞ = ∞, then the sets Ξ C are equally tight.
We conclude this section with a useful representation of F in terms of the reverse entropy R (2.28) and the corresponding functional R. We will use the result in Section 3.5 for the reverse formulation of the primal entropy-transport problem.

55)
where µ = γ + µ ⊥ is the reverse Lebesgue decomposition given by (2.8). In particular Proof. It is an immediate consequence of the dual characterization in (2.44) and the equivalence in (2.30).

Optimal Entropy-Transport problems
The major object of Part I is the entropy-transport functional, where two measures µ 1 ∈ M(X 1 ) and µ 2 ∈ M(X 2 ) are given, and one has to find a transport plan γ ∈ M(X 1 × X 2 ) that minimizes the functional.

The basic setting
Let us fix the basic set of data for Entropy-Transport problems. We are given -two Hausdorff topological spaces (X i , τ i ), i = 1, 2, which define the Cartesian product X := X 1 × X 2 and the canonical projections π i : X → X i ; -two entropy functions F i ∈ Γ(R + ), thus satisfying (2.13); -a proper lower semicontinuous cost function c : X → [0, +∞]; -a couple of nonnegative Radon measures µ i ∈ M(X i ) with finite mass m i := µ i (X i ) satisfying the compatibility condition We will often assume that the above basic setting is also coercive: this means that at least one of the following two coercivity conditions holds: For every transport plan γ ∈ M(X) we define the marginals γ i := π i γ and, as in (2.34), we define the relative entropies 3) With this, we introduce the Entropy-Transport functional as possibly taking the value +∞. Our basic setting is feasible if the functional E is not identically +∞, i.e. there exists at least one plan γ with E (γ|µ 1 , µ 2 ) < ∞.

The primal formulation of the Optimal Entropy-Transport problem
In the basic setting described in the previous Section 3.1, we want to investigate the following problem.
Remark 3.2 (Feasibility conditions). Problem 3.1 is feasible if there exists at least one plan γ with E (γ|µ 1 , µ 2 ) < ∞. Notice that this is always the case when since among the competitors one can choose the null plan η, so that More generally, thanks to (3.1) a sufficient condition for feasibility in the nondegenerate case m 1 m 2 = 0 is that there exit functions B 1 and B 2 with In fact, the plans (3.10) Notice that (3.1) is also necessary for feasibility: in fact, setting m i,n := m i + γ ⊥ i (X i )/n, the convexity of F i , the definition (2.15) of (F i ) ∞ , and Jensen's inequality provide Thus, whenever E (γ|µ 1 , µ 2 ) < ∞, we have and therefore We will often strengthen (3.1) by assuming that at least one of the domains of the entropies F i has nonempty interior, containing a point of the other domain: This condition is surely satisfied if J has nonempty interior, i.e. max( . We also observe that whenever µ i (X i ) = 0 then the null plan γ = η 0 provides the trivial solution to Problem 3.1. Another trivial case occurs when F i (0) < ∞ and F i are nondecreasing in Dom(F i ) (in particular when F i (0) = 0). Then it is clear that the null plan is a minimizer and ET(µ 1 , µ 2 ) = F 1 (0)m 1 + F 2 (0)m 2 .

Examples
Let us consider a few particular cases: E.1 Costless transport: Consider the case c ≡ 0. Since F i are convex, in this case the minimum is attained when the marginals γ i have constant densities. Setting σ i ≡ θ/m i in order to have m 1 σ 1 = m 2 σ 2 , we thus have In this case any feasible plan γ should have µ 1 and µ 2 as marginals and the functional just reduces to the pure transport part As a necessary condition for feasibility we get µ 1 (X 1 ) = µ 2 (X 2 ).
A situation equivalent to the optimal transport case occurs when (3.14) does not hold. In this case, the set J defined by (3.1) contains only one point θ which separates m 1 Dom(F 1 ) and m 2 Dom(F 2 ): It is not difficult to check that in this case E.4 Optimal transport with density constraints: We realize density constraints by introducing characteristic functions of intervals ET(µ 1 , µ 2 ) = min whose feasibility requires µ 2 (X 2 ) ≥ µ 1 (X 1 ).
E.8 Marginal Entropy-Transport problems: In this case one of the two marginals of γ is fixed, say γ 1 , by choosing F 1 (r) := I 1 (r). Thus the functional minimizes the sum of the transport cost and the relative entropy of the second marginal F 2 (γ 2 |µ 2 ) with respect to a reference measure µ 2 , namely This is the typical situation one has to solve at each iteration step of the Minimizing Movement scheme [2], when T is a (power of a) transport distance induced by c, as in the Jordan-Kinderlehrer-Otto approach [21].
E.9 The Piccoli-Rossi "generalized Wasserstein distance" [35,36]: for a metric space (X, d), set X 1 = X 2 = X, let τ be induced by d, and consider The Entropy-Transport problem for this discrete model consists in finding coefficients γ i,j ≥ 0 which minimize

Existence of solutions to the primal problem
The next result provides a first general existence result for Problem 3.1 in the basic coercive setting of Section 3.1. (ii) c has compact sublevels in X and Then Problem 3.1 admits at least one optimal solution. In this case Opt ET (µ 1 , µ 2 ) is a compact convex set of M(X).
Proof. We can apply the Direct Method of Calculus of Variations: since the map γ → E (γ|µ 1 , µ 2 ) is lower semicontinuous in M(X 1 ×X 2 ) by Theorem 2.7, it is sufficient to show that its sublevels are relatively compact, thus bounded and equally tight by Prokhorov Theorem 2.2. In both cases boundedness follows by the coercivity assumptions and the estimate (3.12): in fact, by the definition (2.15) of ( In case (ii) equal tightness is a consequence of the Markov inequality and the nonnegativity of F i : in fact, considering the compact sublevels In the case (i), since c ≥ 0 Proposition 2.10 shows that both the marginals of plans in a sublevel of the energy are equally tight: we thus conclude by [2, Lemma 5.2.2].
Remark 3.4. The assumptions (i) and (ii) in the previous Theorem are almost optimal, and it is possible to find counterexamples when they are not satisfied. In the case when Since min s F (s) + s = log 2 is attained at s = 1/2, we immediately see that Moreover, 2 log 2 is the infimum, which is reached by choosing α = 0 and ν 1 = ν 2 = 1 2 δ x , and letting x → ∞. On the other hand, since n 1 + n 2 + α > 0, the infimum can never be attained.
Let us briefly discuss the question of uniqueness, the first result only addresses the marginals γ i = π i γ. Lemma 3.5 (Uniqueness of the marginals in the superlinear strictly convex case). Let us suppose that F i are strictly convex functions. Then the µ i -absolutely continuous part σ i µ i of the marginals γ i = π i γ of any optimal plan are uniquely determined. In particular, if F i are also superlinear, then the marginals and we observe that the minimality of γ and the convexity of each addendum F i in the functional yield Since F i is strictly convex, the above identity implies σ i = σ i = σ i µ i -a.e. in X.
The next corollary reduces the uniqueness question of optimal couplings in Opt ET (µ 1 , µ 2 ) to corresponding results for the Kantorovich problem associated to the cost c.
Corollary 3.6. Let us suppose that F i are superlinear strictly convex functions and that for every couple of probability measures ν i ∈ P(X i ) with ν i µ i the optimal transport problem associated to the cost c (see Example E.3 of Section 3.3) admits a unique solution. Then Opt ET (µ 1 , µ 2 ) contains at most one plan.
Proof. We can assume m i = µ i (X i ) > 0 for i = 1, 2.It is clear that any γ ∈ Opt ET (µ 1 , µ 2 ) is a solution of the optimal transport problem for the cost c and given (possibly normalized) marginals γ i . Since γ i µ i and γ 1 and γ 2 are unique by Lemma 3.5, we conclude.   (1,0) , and µ 2 with support in {0} × R and containing at least two points, there is an infinite number of optimal plans. In fact, we shall see that the first marginal γ 1 of any optimal plan γ will have full support in (−1, 0), (1, 0), i.e. it will of the form aδ (−1,0) + bδ (1,0) with strictly positive a, b, and the support of the second marginal γ 2 will be concentrated in {0} × R and will contain at least two points. In fact, any plan σ with marginals γ 1 , γ 2 will then be optimal, since it can be written as the disintegration with arbitrary nonnegative densities α, β with α + β = 1 and α dγ 2 (y) = a, β dγ 2 (y) = b. In fact, the cost contribution of σ to the total energy is and it is independent of the choice of α and β.
We conclude this section by proving a simple lower semicontinuity property for the energy-transport functional ET. Note that in metrizable spaces any weakly convergent sequence of Radon measures is tight. Lemma 3.9. Let L be a directed set, (F λ i ) λ∈L and (c λ ) λ∈L be monotone nets of superlinear entropies and costs pointwise converging to F i and c respectively, and let (µ λ i ) λ∈L be equally tight nets of measures narrowly converging to µ i in M(X i ). Denoting by ET λ (resp. ET) the corresponding Entropy-Transport functionals induced by F λ i and c λ (resp. F i and c) be a corresponding net of optimal plans. The statement follows if assuming that E (γ λ |µ λ 1 , µ λ 2 ) = ET(µ λ 1 , µ λ 2 ) ≤ C < ∞ we can prove that ET(µ 1 , µ 2 ) ≤ C. By applying Proposition 2.10 we obtain that the sequences of marginals π i γ λ are tight in M(X i ), so that the net γ λ is also tight. By extracting a suitable subnet (not relabeled) narrowly converging to γ in M(X), we can still apply Proposition 2.10 and the lower semicontinuity of the entropy part F λ of the functional . A completely analogous argument shows that lim inf λ∈L c λ dγ λ ≥ c dγ.
As a simple application we prove the extremality of the class of Optimal Transport problems (see Example E.3 in Section 3.3) in the set of entropy-transport problems.
, r = 1 and let ET n be the Optimal Entropy Transport value (3.5) associated to (nF 1 , nF 2 ). Then for every couple of equally tight sequences (µ 1,n , (3.27)

The reverse formulation of the primal problem
Let us introduce the reverse entropy functions R i (see (2.28)) via and let R i be the corresponding integral functionals as in (2.55).
Keeping the notation of Lemma 2.3 we can thus define (3.30) By Lemma 2.11 we easily get the reverse formulation of the optimal Entropy-Transport Problem 3.1.
Theorem 3.11. For every γ ∈ M(X) and µ i ∈ M(X i ) In particular The functional R(µ 1 , µ 2 |, ·) is still a convex functional and it will be useful in Section 5.

The dual problem
In this section we want to compute and study the dual problem and the corresponding optimality conditions for the Entropy-Transport Problem 3.1 in the basic coercive setting of Section 3.1.

The "inf-sup" derivation of the dual problem in the basic coercive setting
In order to write the first formulation of the dual problem we introduce the reverse entropy functions R i defined as in (2.28) or Section 3.5 and their conjugate R * i : R → (−∞, +∞] which can be expressed by The equivalences (2.31) yield, for all (φ, ψ) ∈ R 2 As a first step we use the dual formulation of the entropy functionals given by Theorem 2.7 (cf. (2.45)) and find It is natural to introduce the saddle function L (γ, ψ) depending on γ ∈ M(X) and ψ = (ψ 1 , ψ 2 ) (we omit here the dependence on the fixed measures In order to guarantee that L takes real values, we consider the convex set We thus have E (γ|µ 1 , µ 2 ) = sup and the Entropy-Transport Problem can be written as We can then obtain the dual problem by interchanging the order of inf and sup as in Section 2.2. Let us denote by Thus, (4.6) provides the dual formulation, that we will study in the next section.

Dual problem and optimality conditions
Problem 4.1 (ψ-formulation of the dual problem). Let R * i be the convex functions defined by (4.1) and let Ψ be the the convex set The dual Entropy-Transport problem consists in finding a maximizer ψ ∈ Ψ for As usual, by operating the change of variable we can obtain an equivalent formulation of the dual functional D as the supremum of the concave functionals on the simpler convex set Problem 4.2 (ϕ-formulation of the dual problem). Let F • i be the concave functions defined by (4.9) and let Φ be the the convex set (4.11). The ϕ-formulation of the dual Entropy-Transport problem consists in finding a maximizer ϕ ∈ Φ for Proof. Since R * i is nondecreasing, for every ψ ∈ Ψ the functions ϕ i := R * i (ψ i ) belong to LSC s (X i ) and satisfy An analogous argument shows the converse inequality.
Since "inf sup ≥ sup inf" (cf. (2.10)), our derivation via (4.5) yields (4.13) Using Theorem 2.4 we will show in Section 4.3 that (4.13) is in fact an equality. Before this, we first discuss for which class of functions ψ i , ϕ i the dual formulations are still meaningful. Moreover, we analyze the optimality conditions associated to the equality case in (4.13).
Proof. Let us consider (4.18) in the case that ( the calculations in the other cases, including (4.19), are completely analogous). Applying Lemma 2.6 (with Optimality conditions. If there exists a couple ϕ as in Proposition 4.4 such that E (γ|µ 1 , µ 2 ) = D(ϕ|µ 1 , µ 2 ) then all the above inequalities (4.20) should be identities so that we have and the second part of Lemma 2.6 yields is a Borel partition related to the Lebesgue decomposition of the couple (γ i , µ i ) as in Lemma 2.3. We will show now that the existence of a couple ϕ satisfying 22) and the joint optimality conditions 4.21 is also sufficient to prove that a feasible γ ∈ M(X) is optimal. We emphasize that we do not need any integrability assumption on ϕ.
Proof. We want to repeat the calculations in (4.20) of Proposition 4.4, but now taking care of the integrability issues. We use a clever truncation argument of [40], based on the maps combined with a corresponding approximations of the entropies F i given by The boundedness of T n (ϕ i ) and Proposition 4.4 yield for everyγ ∈ M(X) e., and applying (ii) of the next Lemma 4.7, we obtain and the same relation also holds when (F i ) ∞ = +∞ since in this case γ ⊥ i = 0. Summing up the two contributions we get Applying Lemma 4.7 (i) and the fact that ϕ 1 ⊕ o ϕ 2 = c ≥ 0 γ-a.e. by (4.21a), we can pass to the limit as n ↑ ∞ by monotone convergence in the right-hand side, obtaining the desired optimality E (γ|µ 1 , µ 2 ) ≥ E (γ|µ 1 , µ 2 ). (4.26) In particular, both cases considered in (4.26) give Proof. Property (i): By (2.22) and the definition in (4.24) we get Eventually, (4.24) defines F i,n as the maximum of a family of n-Lipschitz functions, so F i,n is n-Lipschitz.
Therefore the set I n := {s ≥ 0 : F i (s) = F i,n (s)} is a nonempty closed interval (possibly reduced to a single point) and it is easy to see that denoting s + n := max I n , s − n := min I n , T n (s) := s − n ∨ s ∧ s + n , we have F i,n (s) = F i (T n (s)) + n(s − T n (s)). In particular, whenever s ≥ s + n we have n ∈ ∂F i,n (s) and similarly −n ∈ ∂F i,n (s) if s ≤ s − n . If s belongs to the interior of On the other hand, if ∂F i (s) φ i > n, then s cannot belong to the interior of I n , so that by monotonicity s ≥ s + n and ∂F i,n (s)

A general duality result
The aim of this section is to show in complete generality the duality result ET = D, by using the ϕ-formulation of the dual problem (4.12), which is equivalent to (4.7) by Proposition 4.3.
We start with a simple lemma depending on a specific feature of the entropy functions (which fails exactly in the case of pure transport problems, see Example E.3 of Section 3.3), using the strengthened feasibility condition in (3.14). First note that the couple ϕ i ≡ 0 provides an obvious lower bound for D(µ 1 , µ 2 ), viz.
We derive an upper and lower bound for the potential ϕ 1 under the assumption that c is bounded.
We improve the previous result by showing that in the case of bounded cost functions it is sufficient to consider bounded potentials ϕ i . The second lemma is well known in the case of Optimal Transport problems and will provide a useful a priori estimate in the case of bounded cost functions used in the proof of Theorem 4.11.
If moreover (3.14) holds, than there exist a constant ϕ max ≥ 0 only depending on Proof. Since c ≥ 0, possibly replacing ϕ 1 withφ 1 := ϕ 1 ∨ (− sup ϕ 2 ) we obtain a new couple (φ 1 , ϕ 2 ) with It is then not restrictive to assume that inf ϕ 1 ≥ − sup ϕ 2 ; a similar argument shows that we can assume inf ϕ 2 ≥ − sup ϕ 1 . Since If moreover sup ϕ 1 +sup ϕ 2 = −δ < 0, we could always add the constant δ to, e.g., ϕ 1 , thus increasing the value of D still preserving the constraint Φ. Thus, (4.30) is established. When (3.14) holds (e.g. in the case considered by (4.28)) the previous Lemma 4.8 Applying (4.30) once again, we obtain (4.31) with ϕ max : Before stating the last lemma we recall the useful notion of c-transforms of functions ϕ i : X i →R for a real valued cost c : Moreover, ϕ 1 = ϕ cc 1 if and only if ϕ 1 = ϕ c 2 for some function ϕ 2 ; in this case ϕ 1 is called c-concave and (ϕ cc The next lemma concerns the lower semicontinuity of ϕ c i in the case when c is simple (cf. [23]), i.e. it has the form Lemma 4.10. Let us assume that c has the form (4.37) and that Proof. It is easy to check that ϕ cc 1 , ϕ c 1 are simple, since the infima in (4.34) are taken on a finite number of possible values. By (4.35) it is thus sufficient to check that they are lower semicontinuous functions.
We do this for ϕ c 1 , the argument for ϕ cc 1 = (ϕ c 1 ) c is completely analogous. For this, consider the sets Clearly, (Y z ) z∈Z defines a Borel partition of X 1 ; we define ϕ z := sup{ϕ 1 (y) : y ∈ Y z }. By construction, for every z ∈ Z and y ∈ Y z the map is the minimum of a finite collection of lower semicontinuous functions, viz.
we obtain ϕ c 1 ∈ LSC(X 1 ). With all these auxiliary results at hand, we are now ready to prove our main result concerning the dual representation using Theorem 2.4. Theorem 4.11. In the basic coercive setting of Section 3.1 (i.e. (3.2a) or (3.2b) hold), the Entropy-Transport functional (3.4) and the dual functional (4.10) satisfy Proof. Since ET ≥ D is obvious, it suffices to show ET ≤ D. In particular, it is not restrictive to assume that D(µ 1 , µ 2 ) is finite. We proceed in various steps, considering first the case when c has compact sublevels. We will assume that (F i ) ∞ = +∞ (so that F • i are continuous and increasing on R, and , and we will remove the compactness assumption on the sublevels of c in the following steps. Step 1: The cost c has compact sublevels. We can directly apply Theorem 2.4 to the saddle functional L of (4.3) by choosing A = M given by (4.4) endowed with the narrow topology and B = Φ. Conditions (2.9a) and (2.9b) are clearly satisfied and the coercivity assumption Arguing as in the proof of Theorem 3.3 (ii) we immediately see that (2.11 In fact, for C sufficiently big, the sublevels {γ ∈ M : L (γ, ψ * ) ≤ C are closed, bounded (since D > 0) and equally tight (by the compactness of the sublevels of c), thus narrowly compact. Thus, (4.39), i.e. ET = D, follows from Theorem 2.4.
Step 2: The case when µ i have compact support, (3.14) holds and the cost c is simple, i.e. (4.37) holds. Let us setX i := supp(µ i ). Since (F i ) ∞ = +∞ the support of all γ with E (γ|µ 1 , µ 2 ) < ∞ is containedX 1 ×X 2 so that the minimum of the functional E (γ|µ 1 , µ 2 ) does not change by restricting the spaces toX i . By applying the previous step to the problem stated inX 1 ×X 2 , for every does not change and we obtain a couple of simple Borel functions with ϕ 1 ⊕ ϕ 2 ≤ c in X. We can eventually apply Lemma 4.10 to find (ϕ cc was arbitrary, we conclude that (4.39) holds in this case as well.
Step 3: We remove the assumption on the compactness of supp(µ i ).
Since µ i are Radon, we can find two sequences of compact sets K i,n ⊂ X i such that Let E n := ET(µ 1,n , µ 2,n ) and let E n < E n with lim n→∞ E n = lim inf n→∞ E n . Since µ i,n have compact support, by the previous step and Lemma 4.9 we can find a sequence ϕ n ∈ Φ and a constant ϕ max independent of n such that D(ϕ n |µ 1,n , µ 2,n ) ≥ E n and sup |ϕ i n | ≤ ϕ max .

This yields
Using the lower semicontinuity of ET from Lemma 3.9 we obtain Thus, (4.39) is established.
Step 4: We remove the assumption (3.14) on F i . It is sufficient to approximate F i by an increasing and pointwise converging sequence F n i ∈ Γ(R + ). The corresponding sequence (F n i ) • : ϕ i → sup s≥0 (F n i (s)+sϕ i ) of conjugate concave functions is also nondecreasing and pointwise converging to Passing to the limit n → ∞ we conclude ET(µ 1 , µ 2 ) ≤ D(µ 1 , µ 2 ) as desired.
Step 5: the case of a general cost c.
Let c : X → [0, ∞] be an arbitrary l.s.c. cost and let us denote by (c α ) α∈A the class of costs characterized by (4.37) and majorized by c. Then, A is a directed set with the pointwise order ≤, since maxima of a finite number of cost functions in A can still be expressed as in (4.37). It is not difficult to check that c = sup α∈A c α = lim α∈A c α so that by Lemma 3.9 ET(µ 1 , µ 2 ) = lim α∈A ET α (µ 1 , µ 2 ) = sup α∈A ET α (µ 1 , µ 2 ), where ET α denotes the Entropy-Transport functional associated to c α .
Thus for every E < ET(µ 1 , µ 2 ) we can find α ∈ A such that ET α (µ 1 , µ 2 ) > E and therefore, by the previous step, a couple ϕ α ∈ LSC s ( Arguing as in Remark 2.8 we can change the spaces of test potentials ϕ = (ϕ 1 , ϕ 2 ) ∈ Φ, see (4.11). If (X i , τ i ) are completely regular spaces, then we can equivalently replace lower semicontinuous functions by continuous ones, obtaining Corollary 4.13 (Subadditivity of ET). The functional ET is convex and positively 1homogeneous (in particular it is subadditive), i.e. for every µ i , µ i ∈ M(X) and λ ≥ 0 we have Proof. By Theorem 4.11 it is sufficient to prove the corresponding property of D, which follows immediately from its representation formula (4.8) as a supremum of linear functionals.

Existence of optimal Entropy-Kantorovich potentials
In this section we will consider two cases, when the dual problem admits a couple of optimal Entropy-Kantorovich potentials ϕ = (ϕ 1 , ϕ 2 ). The first case is completely analogous to the transport setting.
Theorem 4.14. Consider complete metric spaces (X i , d i ), i = 1, 2, and assume that (3.14) holds, and c is bounded and uniformly continuous with respect to the product dis- Proof. By the boundedness and uniform continuity of c we can find a continuous and concave modulus of continuity ω : Possibly replacing the distances d i with d i + ω(d i ), we may assume that In particular, every c-transform (4.34) of a bounded function is 1-Lipschitz (and in particular Borel). Let ϕ n be a maximizing sequence in Φ. By Lemma 4.9 we can assume that ϕ n is uniformly bounded; by (4.35) and (4.36) we can also assume that ϕ n are c-concave and thus 1-Lipschitz. If K i,n is a family of compact sets whose union A i has a full µ i measure in X i , we can thus extract a subsequence (still denoted by ϕ n ) pointwise convergent to ϕ = (ϕ 1 , ϕ 2 ) in A 1 ×A 2 . Obviously, we have ϕ 1 := lim n→∞ ϕ 1,n and ϕ 2 := lim inf n→∞ ϕ 2,n , we obtain a family thanks to Fatou's Lemma and the fact that F • i (ϕ i,n ) are uniformly bounded from above.
The next result is of different type, since it does not require any boundedness nor regularity of c (which can also assume the value +∞ in the case F i (0) < ∞).
Theorem 4.15. Let us suppose that at least one of the following two conditions hold: a) c is everywhere finite and (3.14) holds or b) F i (0) < +∞. Then a plan γ ∈ M(X) with finite energy E (γ|µ 1 , µ 2 ) < ∞ is optimal if and only if there exists a couple ϕ as in (4.22) satisfying the optimality conditions (4.21).
Proof. We already proved (Theorem 4.6) that the existence of a couple ϕ as in (4.22) satisfying (4.21) yields the optimality of γ.
Let us now assume that γ ∈ M(X) has finite energy and is optimal. If µ i ≡ η 0 then also γ = 0 and (4.21) are always satisfied, since we can choose ϕ i ≡ 0.
Using the Borel partitions (A i , A µ i , A γ i ) for the couples of measures γ i , µ i provided by Lemma 2.3 and arguing as in Proposition 4.4 we get Since all the integrands are nonnegative, up to selecting a suitable subsequence (not relabeled) we can assume that the integrands are converging pointwise a.e. to 0. We can thus find Borel sets For every x i ∈ X i we define the Borel functions ϕ 1 (x 1 ) := lim sup n→∞ ϕ 1,n (x 1 ) and It is clear that the couple ϕ = (ϕ 1 , ϕ 2 ) complies with (4.22), (4.21d) and (4.21c). If γ(X) = 0 then (4.21a) and (4.21b) are trivially satisfied, so that it is not restrictive to assume γ(X) > 0. If We can thus assume that µ i (X i ) > 0 and γ(X) > 0. In order to check (4.21a) and (4.21b) we distinguish two cases.
Case a: c is everywhere finite and (3.14) holds. Let us first prove that ϕ 1 < +∞ everywhere.
Then, γ is an optimal plan if and only if γ i µ i , for every In fact, if the strengthened feasibility condition (3.14) does not hold, it is not difficult to construct an example of optimal plan γ for which conditions (4.22), (4.21a), (4.21b) cannot be satisfied. Consider e.g.
We conclude this section by proving the simple property on subdifferentials we used in the proof of Theorem 4.15.
In particular, if s ∈ int(Dom(F )) then φ is finite.
Proof. Up to extracting a suitable subsequence, it is not restrictive to assume that φ is the limit of φ n as n → ∞. For every w ∈ Dom(F ) the Young inequality wφ n ≤ F (w)+F * (φ n ) yields If Dom(F ) = {s} then ∂F (s) = R and there is nothing to prove; let thus assume that Dom(F ) has nonempty interior. If φ ∈ R then (w − s)φ ≤ F (w) − F (s) for every w ∈ Dom(F ), so that φ ∈ ∂F (s). Since the righthand side of (4.50) is finite for every w ∈ Dom(F ), if φ = +∞ then w ≤ s for every w ∈ Dom(F ), so that s = max Dom(F ). An analogous argument holds when φ = −∞.

"Homogeneous" formulations of optimal Entropy-Transport problems
Starting from the reverse formulation of the Entropy-Transport problem of Section 3.5 via the functional R, see (3.30), in this section we will derive further equivalent representations of the ET functional, which will also reveal new interesting properties, in particular when we will apply these results to the logarithmic Hellinger-Kantorovich functional. The advantage of the reverse formulation is that it always admits a "1-homogeneous" representation, associated to a modified cost functional that can be explicitly computed in terms of R i and c.
We will always tacitly assume the basic coercive setting of Section 3.1, see (3.2).

The homogeneous marginal perspective functional.
First of all we introduce the marginal perspective function H c depending on the parameter c ≥ inf c: For c = ∞ we set The induced marginal perspective cost is H : The last formula (5.2) is justified by the property F i (0) = (R i ) ∞ and the fact that H c (r 1 , r 2 ) ↑ H ∞ (r 1 , r 2 ) as c ↑ ∞ for every r 1 , r 2 ∈ [0, ∞), see also Lemma 5.3 below .
Example 5.2. Let us consider the symmetric cases associated to the entropies U p and V : E.1 In the "logarithmic entropy case", which we will extensively study in Part II, we have F i (s) := U 1 (s) = s log s − (s − 1) and R i (r) = U 0 (r) = r − 1 − log r.

A direct computation shows
In the power-like case with p ∈ R \ {0, 1} we start from and obtain, for r 1 , r 2 > 0, where q = p/(p − 1). In fact, we have The following dual characterization of H c nicely explains the crucial role of H c .
In particular it is lower semicontinuous, convex and positively 1-homogeneous (thus sublinear) with respect to (r 1 , r 2 ), nondecreasing and concave w.r.t. c, and satisfies Moreover, a) the function H c coincides withH c in the interior of its domain; in particular, if Proof. Since sup Dom(R * i ) = F i (0) by (2.32), one immediately gets (5.9) in the case c = +∞; we can thus assume c < +∞.
It is not difficult to check that the function (r 1 , r 2 , θ) → θ R 1 (r 1 /θ) + R 2 (r 2 /θ) + c is jointly convex in [0, ∞)×[0, ∞)×(0, ∞) so thatH c is a convex and positive 1-homogeneous function. It is also proper (i.e. it is not identically +∞) thanks to (3.1). By Legendre duality [38,Thm.12.2], its lower semicontinuous envelope is given by where In order to prove point a) it is sufficient to recall that convex functions are always continuous in the interior of their domain [38,Thm. 10.1]. In particular, since lim θ↓0 θ R 1 (r 1 /θ)+ Concerning b), it is obvious when r 1 = r 2 = 0. When r 1 > r 2 = 0, the facts that sup Dom( An analogous formula holds when 0 = r 1 < r 2 . A simple consequence of Lemma 5.3 and (2.31) is the lower bound We now introduce the integral functional associated with the marginal perspective cost (5.3), which is based on the decomposition µ i = i γ i + µ ⊥ i : where we adopted the same notation as in (3.29). Let us first show that H is always greater than D.
and using (5.14) with r j = ρ j and Note that (2.19) and (2.43) imply F • i (ϕ i ) ≤ F i (0). An immediate consequence of the previous lemma is the following important result concerning the marginal perspective cost functional H defined by (5.15). It can be nicely compared to the Reverse Entropy-Transport functional R for which Theorem 3.11 stated R(µ 1 , µ 2 |γ) = E (γ|µ 1 , µ 2 ).
To establish the identity (5.21) we note that the difference to (5.18) only lies in dropping the additional restriction µ ⊥ i = 0. When both F 1 (0) = F 2 (0) = +∞ the equivalence is obvious since the finiteness of the functional γ → H (µ 1 , µ 2 |γ) yields µ ⊥ 1 = µ ⊥ 2 = 0. In the general case, one immediately see that the righthand side E of (5.21) (with "inf" instead of "min") is larger than ET(µ 1 , µ 2 ), since the infimum of H (µ 1 , µ 2 |·) is constrained to the smaller set of plans γ satisfying µ i γ i . On the other hand, ifγ ∈ Opt ET (µ 1 , µ 2 ) so that we have E ≤ ET(µ 1 , µ 2 ). The case when only one (say µ ⊥ 2 ) of the measures µ ⊥ i vanishes can be treated in the same way: since in this casem 1 = µ ⊥ 1 (X 1 ) > 0 and therefore F 1 (0) < ∞, by applying (5.20) we can choose γ :=γ + 1 Remark 5.6. Notice that (5.20) is always satisfied if the spaces X i are uncountable. If X i is countable, one can always add an isolated pointx i (sometimes called "cemetery") to X i and consider the augmented spaceX i = X i {x i } obtained as the disjoint union of X andx i , with augmented costc which extends c to +∞ onX 1 ×X 2 \ (X 1 × X 2 ). We can recover (5.21) by allowing γ in M(X 1 ×X 2 ).

Entropy-transport problems with "homogeneous" marginal constraints
In this section we will exploit the 1-homogeneity of the marginal perspective function H in order to derive a last representation of the functional ET, related to the new notion of homogeneous marginals. We will confine our presentation to the basic, still relevant, facts, and we will devote the second part of the paper to develop a full theory for the specific case of the Logarithmic Entropy-transport case.
In particular, the following construction (typical in the Young measure approach to variational problems) allows us to consider the entropy-transport problems in a setting of greater generality. We replace a couple (γ, ), where γ and are a measure on X and a nonnegative Borel function, respectively, by a measure α ∈ M(Y ) on the extended space Y = X × [0, ∞). The original couple (γ, ) corresponds to measures α = (x, (x)) γ concentrated on the graph of in Y and whose first marginal is γ.
Homogeneous marginals. In the usual setting of Section 3.1, we consider the product spaces Y i := X i ×[0, ∞) endowed with the product topology and denote the generic points in Y i with y i = (x i , r i ), x i ∈ X i and r i ∈ [0, ∞) for i = 1, 2. Projections from Y := Y 1 × Y 2 onto the various coordinates will be denoted by π y i , π x i , π r i with obvious meaning.
Proof. The calculations are quite similar to the proof of Lemma 5.4: As a consequence, we can characterize the entropy-transport minimum via measures α ∈ M(Y ).
Remark 5.10 (Rescaling invariance). By recalling (5.26a,b) and exploiting the 1-homogeneity of H it is not restrictive to solve the minimum problem (5.31) in the smaller class of probability plans concentrated in Y r,p := (x 1 , r 1 ; x 2 , r 2 ) ∈ Y : r p 1 + r p 2 ≤ r p , r p = i µ i (X i ).
Part II. The Logarithmic Entropy-Transport problem and the Hellinger-Kantorovich distance 6 The Logarithmic Entropy-Transport (LET) problem Starting from this section we will study a particular Entropy-Transport problem, whose structure reveals surprising properties.

The metric setting for Logarithmic Entropy-Transport problems.
Let (X, τ ) be a Hausdorff topological space endowed with an extended distance function d : X × X → [0, ∞] which is lower semicontinuous w.r.t. τ ; we refer to (X, τ, d) as an extended metric-topological space. In the most common situations, d will take finite values, (X, d) will be separable and complete and τ will be the topology induced by d; nevertheless, there are interesting applications where nonseparable extended distances play an important role, so that it will be useful to deal with an auxiliary topology, see e.g. [3,1]. From now on we suppose that X 1 = X 2 = X, we choose the logarithmic entropies so that Let us collect a few key properties that will be relevant in the sequel.

LE.5
The function can be characterized as the unique solution of the differential equation In particular is strictly increasing and uniformly 2-convex. It is not difficult to check that √ is also convex: this property is equivalent to 2 ≥ ( ) 2 and a direct calculation shows where we set d a (x 1 , x 2 ) := d(x 1 , x 2 ) ∧ a for x i ∈ X, a ≥ 0. (6.8) Since the function H(x 1 , r 2 1 ; x 2 , r 2 2 ) = r 2 1 + r 2 2 − 2r 1 r 2 cos(d π/2 (x 1 , x 2 )) (6.9) will have an important geometric interpretation (see Section 7.1), in the following we will choose the exponent p = 2 in the setting of Section 5.2.
We keep the usual notation X = X × X, identifying X 1 and X 2 with X and letting the index i run between 1 and 2, e.g. for γ ∈ M(X) the marginals are denoted by γ i = (π i ) γ.
Problem 6.1 (The Logarithmic Entropy-Transport problem). Let (X, τ, d) be an extended metric-topological space, and c be as in (6.2). Given µ i ∈ M(X) find γ ∈ M(X) minimizing

The Logarithmic Entropy-Transport problem: main results
In the next theorem we collect the main properties of the Logarithmic Entropy-Transport (LET) problem relying on the reverse function R from Section 3.5, cf. (3.30), and H from Section 5.1, cf. (5.15).
Proof. The variational problem (6.10) fits in the class considered by Problem 3.1, in the basic coercive setting of Section 3.1 since the logarithmic entropy (6.1) is superlinear with domain [0, ∞). The problem is always feasible since U 1 (0) = 1 so that (3.6) holds. a) follows by Theorem 3.3(i); the upper bound of LET is a particular case of (3.7), and its convexity and 1-homogeneity follows by Corollary 4.13. b) is a consequence of Theorem 3.11. c) is an application of Theorem 5.5 and (6.7).
We consider now the dual representation of LET; recall that LSC s (X) denotes the space of simple (i.e. taking a finite number of values) lower semicontinuous functions and for a couple φ i : X → R the symbol φ 1 ⊕ φ 2 denotes the function (x 1 , x 2 ) → φ 1 (x 1 ) + φ 2 (x 2 ) defined in X. In part a) we relate to Section 4.2, whereas b)-d) discusses the optimality conditions from Section 4.4.

Theorem 6.3 (Dual formulation and optimality conditions).
a) The dual problem (LET = D LE = D LE ). For all µ 1 , µ 2 ∈ M(X) we have 14) where D LE (ϕ|µ 1 , µ) := i X 1−e −ϕ i dµ i . The same identities hold if the space LSC s (X) is replaced by LSC b (X) or B b (X) in (6.13) and (6.14). When the topology τ is completely regular (in particular when d is a distance and τ is induced by d) the space LSC s (X) can be replaced by C b (X) as well. b) Optimality conditions. Let us assume that d is continuous. A plan γ ∈ M(X) is optimal if and only if its marginals γ i are absolutely continuous w.

and the fact that
Point c) is an obvious consequence of the optimality of γ. Point d) can be easily deduced by b) or by applying Theorem 4.15.
In the one-dimensional case, the (d)-cyclic monotonicity of part c) of the previous theorem reduces to classical monotonicity. Corollary 6.4 (Monotonicity of optimal plans in R). When X = R with the usual distance, the support of every optimal plan γ is a monotone set, i.e. (6.21) Proof. As the function is uniformly convex, (6.18) is equivalent to monotonicity.
The next result provides a variant of the reverse formulation in Theorem 6.2.
(i) The marginals γ i = π i γ are uniquely determined.
(ii) If X = R with the usual distance then γ is the unique element of Opt LET (µ 1 , µ 2 ).
(iii) If X = R d with the usual distance, µ 1 L d is absolutely continuous, and A i ⊂ R d and σ i : A i → (0, ∞) are as in Theorem 6.3 b), then σ 1 is approximately differentiable at γ 1 -a.e. point of A 1 and γ is the unique element of Opt LET (µ 1 , µ 2 ); it is concentrated on the graph of a function t : Proof. (i) follows directly from Lemma 3.5.
(ii) follows by Theorem 6.3(c), since whenever the marginals γ i are fixed there is only one plan with monotone support in R.
In order to prove (iii) we adapt the argument of [2, Thm. 6.2.4] to our singular setting, where the cost c can take the value +∞.
It follows that γ 1 -a.e. x 1 ∈ A 1 is a point of L d -density 1 of {x 1 ∈ A 1 : σ 1 (x 1 ) = s n (x 1 )} for some n ∈ N and s n is differentiable at x 1 . Let us denote by A 1 the set of all x 1 ∈ A 1 such that σ 1 is approximately differentiable at every x 1 ∈ A 1 with approximate differential Dσ 1 (x 1 ) equal to Ds n (x 1 ) for n sufficiently big.
We conclude this section with the last representation formula for LET(µ 1 , µ 2 ) given in terms of transport plans α in Y := Y × Y with Y := X × [0, ∞) with constraints on the homogeneous marginals, keeping the notation of Section 5.2. Even if it seems the most complicated one, it will provide the natural point of view in order to study the metric properties of the LET functional. Theorem 6.7. For every µ i ∈ M(X) we have Moreover, for every planγ ∈ Opt LET µ 1 µ 2 and every couple of Borel densities i as in (6.11) the planᾱ := (x 1 , 1 (x 1 ); x 2 , 2 (x 2 )) γ is optimal for (6.30) and (6.29).

The metric side of the LET-functional: the Hellinger-Kantorovich distance
In this section we want to show that the functional (µ 1 , µ 2 ) → LET(µ 1 , µ 2 ) (7.1) defines a distance in M(X), which is then called the Hellinger-Kantorovich distance and denoted HK. This distance property is strongly related to the property that the function (x 1 , r 1 ; In the next section we will briefly study this function and the induced metric space, the so-called cone C on X, [9,Sec. 3.6] obtained by taking the quotient w.r.t. the equivalent classes of points with distance 0.

The cone construction
In the extended metric-topological space (X, τ, d) of Section 6.1, we will denote by d a := d ∧ a the truncated distance and by y = (x, r), x ∈ X, r ∈ [0, ∞), the generic points of It is not difficult to show that the function d C : is nonnegative, symmetric, and satisfies the triangle inequality (see e.g. [9, Prop. 3.6.13]). We also notice that which implies the useful estimates From this it follows that d C induces a true distance in the quotient space C = Y / ∼ where y 1 ∼ y 2 ⇔ r 1 = r 2 = 0 or r 1 = r 2 , x 1 = x 2 . It is easy to check that (C, τ C ) is a Hausdorff topological space and d C is τ C -lower semicontinuous. If τ is induced by d then τ C is induced by d C . If (X, d) is complete (resp. separable), then (C, d C ) is also complete (resp. separable).
Perhaps the simplest example is provided by the unit sphere X = S d−1 = {x ∈ R d : |x| = 1} in R d endowed with the intrinsic Riemannian distance: the corresponding cone C is precisely R d .
We denote the canonical projection by Clearly p is continuous and is an homeomorphism between Y \ (X × {0}) and C o . A right inverse y : C → Y of the map p can be obtained by fixing a pointx ∈ X and defining x if r = 0, and y := (x, r). (7.8) Notice that r is continuous and x is continuous restricted to C o . A continuous rescaling product from C × [0, ∞) to C can be defined by We conclude this introductory section by a characterization of compact sets in (C, τ C ). Proof. It is easy to check that the condition is necessary. In order to show the sufficiency, let ρ = inf K r. If ρ > 0 then K is compact since it is a closed subset of the compact set p K(ρ) × [ρ, r 0 ] .
If ρ = 0 then o is an accumulation point of K by (7.6) and therefore o ∈ K since K is closed. If U is an open covering of K, we can pick U 0 ∈ U such that o ∈ U 0 . By (7.6) there exists ε > 0 such that is therefore a finite subcover of K.
On the other hand, the choice of d π/2 is crucial for its link with the function H of (6.9), with Entropy-Transport problems, and with a representation property for the Hopf-Lax formula that we will see in the next sections. Notice that the 1-homogeneous formula (6.7) would not be convex in (r 1 , r 2 ) if one uses d π instead of d π/2 . Nevertheless, we will prove in Section 7.3 the remarkable fact that both d π and d π/2 will lead to the same distance between positive measures.

Radon measures in the cone C and homogeneous marginals
It is clear that any measure ν ∈ M(C) can be lifted to a measureν ∈ M(Y ) such that p ν = ν: it is sufficient to takeν = y ν where y is a right inverse of p defined as in (7.8).
We call M 2 (C) (resp. P 2 (C)) the space of measures ν ∈ M(C) (resp. ν ∈ P(C)) such that Measures in M 2 (C) thus correspond to images p ν of measuresν ∈ M 2 (Y ) and have finite second moment w.r.t. the distance d C , which justifies the index 2 in M 2 (C). Notice moreover that the measure s 2ν does not charge X × {0} and it is independent of the choice of the pointx in (7.8).
The above considerations can be easily extended to plans in the product spaces C ⊗N (where typically N = 2, but also the general case will turn out to be useful later on). To clarify the notation, we will denote by y = ( a point in C ⊗N and we will set r i (y) = r(y i ) = r i , x i (y) = x(y i ) ∈ X. Projections on the i-coordinate from C ⊗N to C are usually denoted by π i or π y i , p = p ⊗N : (Y ) ⊗N → C ⊗N , y = y ⊗N : C ⊗N → (Y ) ⊗N are the Cartesian products of the projections and of the lifts.
Recall that the with the convention that W d C (ν 1 , ν 2 ) = +∞ if ν 1 (C) = ν 2 (C) and thus the minimum in (7.13) is taken on an empty set. We want to mimic the above definition, replacing the usual marginal conditions in (7.13) with the homogeneous marginals h 2 i which we are going to define.
Let us consider now a plan α in M(C ⊗N ) withᾱ = y α ∈ M(Y ⊗N ): we say that α lies in M 2 (C ⊗N ) if (7.14) Its "canonical" marginals in M(C) are α i = π y i α, whereas the "homogeneous" marginals correspond to (5.23) with p = 2: We will omit the index i when N = 1. Notice that r 2 i α does not charge (π i ) −1 (o) (similarly, (7.15) is independent of the choice of the pointx in (7.8).
As for (5.25), the homogeneous marginals on the cone are invariant with respect to dilations: if ϑ : C ⊗N → (0, ∞) is a Borel map in L 2 (C ⊗N , α) we set prd ϑ (y) i := y i · ϑ(y) −1 and dil ϑ,2 (α) := (prd ϑ ) (ϑ 2 α), (7.16) so that h 2 i (dil ϑ,2 (α)) = h 2 i (α) for every α ∈ M 2 (C ⊗N ). (7.17) As for the canonical marginals, a uniform control of the homogeneous marginals is sufficient to get equal tightness, cf. (2.4) for the definition. We state this result for an arbitrary number of components, and we emphasize that we are not claiming any closedness of the involved sets.

Lemma 7.3 (Homogeneous marginals and tightness).
Let K i , i = 1, · · · , N , be a finite collection of bounded and equally tight sets in M(X). Then, the set is equally tight in M(C N ).
Proof. By applying [2,Lem. 5.2.2], it is sufficient to consider the case N = 1: given a bounded and equally tight set K ⊂ M(X) we prove that H := α ∈ M 2 (C) : h 2 α ∈ K is equally tight. For A ⊂ X, R ⊂ (0, ∞) we will use the short notation A × C R for p(A × R) ⊂ C. If A and R are compact, then A × C R is compact in C.
Let M := sup µ∈K µ(X) < ∞; since K is tight, we can find an increasing sequence of compact sets K n ⊂ X such that µ(X \ K n ) ≤ 8 −n for every µ ∈ K. For an integer m ∈ N we then consider the compact sets K m ⊂ C defined by Since for every α ∈ H with h 2 α = µ and every A ∈ B(X) we have for every α ∈ H. Since all K m are compact, we obtain the desired equal tightness.
Remark 7.5 (Lifting of plans in Y ). Since any plan α ∈ M(C) can be lifted to a plan α = y α ∈ P(Y × Y ) such that p ᾱ = α the previous problem 7.4 is also equivalent to find min The advantage to work in the quotient space C is to gain compactness, as the next Theorem 7.6 will show.
An importance feature of the cone distance and the homogeneous is an invariance under rescaling, which can be done by the dilations from (7.16). Let us set It is not restrictive to solve the previous problem 7.4 by also assuming that α is a probability plan in P(C) concentrated on C[R] with R 2 = i µ i (X), i.e.
In fact the functional d 2 C and the constraints have a natural scaling invariance induced by the dilation maps defined by (7.16). Since In order to show that Problem 7.4 has a solution we can then use the formulation (7.25) and prove that the set C where the minimum will be found is narrowly compact in P(C). Notice that the analogous property would not be true in P(Y × Y ) (unless X is compact) since measures concentrated in (X × {0}) × (X × {0}) would be out of control. Also the constraints h 2 i α = µ i would not be preserved by narrow convergence, if one allows for arbitrary plans in P(C) as in (7.22). Theorem 7.6 (Existence of optimal plans for the HK problem). For every µ 1 , µ 2 ∈ M(X) the Hellinger-Kantorovich problem 7.4 always admits a solution α ∈ P(C) concentrated on Proof. By the rescaling (7.26) it is not restrictive to look for minimizers α of (7.25). Since C[R] is closed in C and the maps r 2 i are continuous and bounded in C[R], C is clearly narrowly closed. By Lemma 7.3, C is also equally tight in P(C), thus narrowly compact by Theorem 2.2. Since the d 2 C is lower semicontinuous in C, the existence of a minimizer of (7.25) then follows by the direct method of the calculus of variations.
We can also prove an interesting characterization of HK in terms of the L 2 -Kantorovich-Wasserstein distance on P 2 (C) given by (7.13). An even deeper connection will be discussed in the next section, see Corollary 7.13.
In particular the map h 2 : P 2 (C) → M(X) is a contraction, i.e.
We conclude this section with two simple properties of the HK functional. We denote by η 0 the null measure.
Subsequently we will use " " for the restriction of measures. Lemma 7.9 (A formulation with relaxed constraints). For every µ 1 , µ 2 ∈ M(X) we have Moreover, (i) equations (7.31a)-(7.31b) share the same class of optimal plans.
Proof. The formulas (7.31a) and (7.31b) are just two different ways to write the same functional, since for every α ∈ H 2 ≤ (µ 1 , µ 2 ) we have Thus, to prove (i) it is sufficient to show (7.31a). The inequality ≥ is obvious, since On the other hand, whenever α ∈ H 2 ≤ (µ 1 , µ 2 ), setting µ i := h 2 i α ∈ M(X), µ i := µ i −µ i and observing that α ∈ H 2 = (µ 1 , µ 2 ) we get The same calculations also prove point (iii). In order to check (ii) it is sufficient to observe that the integrand in (7.31b) vanishes Finally, if α ∈ Opt H K (µ 1 , µ 2 ) is optimal for (7.22), then by the consideration above it is optimal for (7.31b) and therefore (ii) shows that α o is optimal as well. The converse implication follows by (iii).

Gluing lemma and triangle inequality
In this section we will prove that HK satisfies the triangle inequality and therefore is a distance on M(X). The main technical step is provided by the following useful property for plans in M(C ⊗N ) with given homogeneous marginals, which is a simple application of the rescaling invariance in (7.26). and let j ∈ {1, . . . , N } be fixed. Then, it is possible to find a new planᾱ ∈ M 2 (C ⊗N ) which still satisfies (7.33) and additionally the normalization of the jth lift, π j (ᾱ) = δ o + p (µ j ⊗ δ 1 ). (7.34) Proof. By possibly adding ⊗ N δ o to α (which does not modify (7.33)) we may suppose that ω j := α {y ∈ C ⊗N : π j (y) = o} ≥ 1, where j is fixed as in the lemma. In order to findᾱ it is sufficient to rescale α by the function With the notation of (7.16) we setᾱ := dil ϑ,2 (α) and we decompose α in the sum which yields (7.34).
We can now prove a general form of the so-called "gluing lemma" that is the natural extension of the well known result for transport problems (see e.g. [2,Lemma 5.3.4]). Here its formulation is strongly related to the rescaling invariance of optimal plans given by Lemma 7.10. Then there exist plans α 1 , α 2 ∈ P 2 (C ⊗N ) such that h 2 i α k = µ i for i = 1, . . . , N and . . , N.

(7.37)
Moreover, the plans α k satisfy the following additional conditions: Proof. We first construct a plan α satisfying (7.37), then suitable rescalings will provide α k satisfying (7.38) or (7.39). In order to clarify the argument, we consider N -copies X 1 , X 2 , . . . , X N of X (and for C in a similar way) so that X ⊗N = N i=1 X i We argue by induction; the starting case N = 2 is covered by Theorem 7.6 and Lemma 7.10. Let us now discuss the induction step, by assuming that the thesis holds for N and proving it for N + 1. We can thus find an optimal plan α N such that (7.37) hold, and another optimal plan α ∈ Opt H K (µ N , µ N +1 ) for the couple µ N , µ N +1 . Applying the normalization Lemma 7.10 to α N (with j = N ) and to α (with j = 1) we can assume that π N (α N ) = δ o + p (µ N ⊗ δ 1 ) = π 1 (α).
A further application of the rescaling (7.26) with ϑ as in (5.26a) yields a plan α 1 satisfying also (7.38).
In order to obtain α 2 , we can assume α({|y| = 0}) = 0 and set α 2 = dil ϑ,2 (α), where we use the rescaling function To obtain (7.39) it remains to estimate r. We consider arbitrary coefficients θ i > 0 and use for n = 2, . . . , N the inequality , which yields optimizing with respect to θ i > 0 we obtain the value of Θ given by (7.36).
The next remark gives a similar rescaling result for probability couplings β ∈ P 2 (C ⊗N ). Arguing as in the proof of Corollary 7.7 one immediately obtains the following result, which will be needed for the proof of Theorem 8.8 and for the subsequent corollary.
Corollary 7.13. For every finite collection of measures µ i ∈ M(X), i = 1, . . . , N , there exist α i , β i ∈ P 2 (C) with α i concentrated in C[r] where r = min(M, Θ) is given as in (7.36) and β i concentrated in C[Ξ] given by (7.40) such that We are now in the position to show that the functional HK is a true distance on M(X), where we deduce the triangle inequality from that for W d C by using normalized lifts.
Corollary 7.14 (HK is a distance). HK is a distance on M(X); in particular, for every µ 1 , µ 2 , µ 3 ∈ M(X) we have the triangle inequality Proof. It is immediate to check that HK is symmetric and HK(µ 1 , µ 2 ) = 0 if and only if µ 1 = µ 2 . In order to check (7.42) it is sufficient to apply the previous corollary 7.13 to find measures α i ∈ P 2 (C), i = 1, 2, 3, such that h 2 α i = µ i and HK(µ 1 , µ 2 ) = W d C (α 1 , α 2 ) and HK(µ 2 , µ 3 ) = W d C (α 2 , α 3 ). Applying the triangle inequality for W d C we obtain As a consequence of the previous two results, the map h 2 : P 2 (C) → M(X) is a metric submersion.

Metric and topological properties
In this section we will assume that the topology τ on X is induced by d and that (X, d) is separable, so that also (C, d C ) is separable. Notice that in this case there is no difference between weak and narrow topology in M(X). Moreover, since X is separable, M(X) equipped with the weak topology is metrizable, so that converging sequences are sufficient to characterize the weak-narrow topology.
It turns out [2, Chap. 7] that (P 2 (C), W d C ) is a separable metric space: convergence of a sequence (α n ) n∈N to a limit measure α in (P 2 (C), W d C ) corresponds to weak-narrow convergence in P(C) and convergence of the quadratic moments, or, equivalently, to convergence of integrals of continuous functions with quadratic growth, i.e. lim n→∞ C ϕ dα n = C ϕ dα for every ϕ ∈ C(C) with |ϕ(y)| ≤ A + Br 2 (y), (7.43) for some constants A, B ≥ 0 depending on ϕ. Recall that r 2 (y) = d 2 C (y, o). Proof. Let us first suppose that lim n→∞ HK(µ n , µ) = 0. We argue by contradiction and we assume that there exists a function ζ ∈ C b (X) and a subsequence (still denoted by µ n ) such that inf n X ζ dµ n − X ζ dµ > 0. (7.44) The first estimate of (7.29) and the triangle inequality show that lim sup n→∞ µ n (X) ≤ lim sup n→∞ HK(µ n , µ) + HK(µ, η 0 ) 2 = µ(X), so that sup n µ n (X) = M 2 < ∞. By Corollary 7.7 we can find measures α n , α n ∈ P 2 (C) concentrated on C[2M ] such that By Lemma 7.3 the sequence (α n ) n∈N is equally tight in P 2 (C); since it is also uniformly bounded there exists a subsequence k → n k such that α n k weakly converges to a limit α ∈ P 2 (C). Since α n is concentrated on C[2M ] we also have lim k→∞ W d C (α n k , α) = 0 and therefore h 2 α = µ, lim k→∞ W d C (α n k , α) = 0. We thus have which contradicts (7.44).
In order to prove the converse implication, let us suppose that µ n is converging weakly to µ in M(X). If µ is the null measure η 0 = 0, then lim n→∞ µ n (X) = 0 so that lim n→∞ HK(µ n , µ) = 0 by (7.29).
Since h 2 α n = µ n and h 2 α = µ, by (7.28) we have HK(µ n , µ) ≤ W d C (α n , α). Since m −1 n µ n is weakly converging to m −1 µ in P(X) and m n → m, it is easy to check that m −1 n µ n ⊗ δ √ mn weakly converges to m −1 µ ⊗ δ √ m in P(Y ) and therefore α n weakly converges to α in P(C) by the continuity of the projection p. Hence, in order to conclude that W d C (α n , α) → 0 it is now sufficient to prove the convergence of their quadratic moments with respect to the vertex o. However, this is is immediate because of The following completeness result for (M(X), HK) is obtained by suitable liftings of measures µ i to probability measures α i ∈ P 2 (C), supported in some C [Θ]. Then the completeness of the Wasserstein space (P 2 (C), W d C ) is exploited.
Proof. We have to prove that every Cauchy sequence (µ n ) n∈N in (M(X), HK) admits a convergent subsequence. By exploiting the Cauchy property, we can find an increasing sequence of integers k → n(k) such that HK(µ m , µ m ) ≤ 2 −k whenever m, m ≥ n(k) and we consider the subsequence µ i := µ n(i) , so that HK(µ n(i) , µ n(i−1) ) ≤ µ 1 (X) + 1, and by applying the Gluing Lemma 7.11, for every N > 0 we can find measures α N i ∈ P 2 (C), i = 1, . . . , N , concentrated on C[Θ] with Θ := µ 1 (X) + 1, such that For every i the sequence N → α N i ∈ P 2 (C) is tight by Lemma 7.3 and concentrated on the bounded set C[Θ], so that by Prokhorov Theorem it is relatively compact in (P 2 (C), W d C ).
By a standard diagonal argument, we can find a further increasing subsequence m → N (m) and limit measures α i ∈ P 2 (C) such that lim m→∞ W d C (α N (m) i , α i ) = 0. The convergence with respect to W d C yields that It follows that i → α i is a Cauchy sequence in (P 2 (C), W d C ) which is a complete metric space [2, Prop. 7.1.5] and therefore there exists α ∈ P 2 (C) such that lim i→∞ W d C (α i , α) = 0. Setting µ := h 2 α ∈ M(X) we thus obtain lim i→∞ HK(µ i , µ) = 0.
We conclude this section by proving a simple comparison estimate for HK with the Bounded Lipschitz metric (cf. [17,Sec. 11.3]), see also [24,Thm. 3]. The Bounded Lipschitz metric is defined via We do not claim that the constant C * below is optimal.

Hellinger-Kantorovich distance and Entropy-Transport functionals
In this section we will establish our main result connecting HK with LET.
It is clear that the definition of HK does not change if we replace the distance d on X by its truncation d π = d ∧ π. It is less obvious that we can even replace the threshold π with π/2 and use the distance d π/2,C of Remark 7.2 in the formulation of the Hellinger-Kantorovich Problem 7.4. This property is related to the particular structure of the homogeneous marginals (which are not affected by masses concentrated in the vertex o of the cone C); in [27,Sect. 3.2] it is is called the presence of a sufficiently large reservoir, which shows that transport over distances larger than π/2 is never optimal, since it is cheaper to transport into or out of the reservoir in o). This will provide an essential piece of information to connect the HK and the LET functionals.
In order to prove that transport only occurs of distances ≤ π/2 we define the subset and consider the partition (C , C ) of C = C × C, where C := C \ C = d π/2,C = d C .
(ii)ᾱ ∈ M(Y ) is any optimal plan for (6.31) if and only if α := p ᾱ is an optimal plan for the Hellinger-Kantorovich Problem 7.4.
Since all the optimal plans for HK do not charge C , combining Lemma 7.9, Remark 7.5 and Theorems 6.3 and 6.7 statements (i), (ii), and (iii) follow easily.
Concerning (iv), the optimality ofα is obvious from the formulation (7.31b) and the optimality of β = dil ϑ,2 (α) follows from the invariance of (7.31b) with respect to dilations. We notice that β-almost everywhere in G we have so that by (7.31a) we arrive at Let us now set γ := (x 1 , x 2 ) β ∈ M(X × X) and β i := π i β ∈ M(C), which yield γ i := Denoting by (β i,x i ) x i ∈X the disintegration of β i with respect to γ i (here we need the metrizability and separability of (X, τ ), see [2, Section 5.3]), we find Applying Jensen inequality we obtain Now c(x 1 , x 2 ) dβ = c(x 1 , x 2 ) dγ and (7.54) imply with ν i := µ i −μ i ∈ M(X). Hence, µ i =˜ i γ i + ν i and the standard decomposition Hence, U 0 (s) = s−1−log s and the monotonicity of the logarithm yield where the last estimate follows from Theorem 6.2(b). Above, the first inequality is strict if ν i = µ ⊥ i so that i >˜ i on some set with positive γ i -measure. By the first statement of the Theorem it follows that γ ∈ Opt LET (µ 1 , µ 2 ). Hence, all the inequalities are in fact identities, and we conclude˜ i ≡ i . Since U 0 is strictly convex, the disintegration measure β i,x i is a Dirac measure concentrated on We observe that the system (γ, 1 , 2 ) provided by the previous Theorem enjoys a few remarkable properties, that are not obvious from the original Hellinger-Kantorovich formulation.
a) First of all, the annihilated part µ ⊥ i of the measures µ i is concentrated on the set When µ i (M i,j ) = 0 then µ i γ i . b) As a second property, an optimal plan γ ∈ Opt LET (µ 1 , µ 2 ) provides an optimal plan α = p • (x 1 , ) γ which is concentrated on the graph of the map ( where the maps i are independent, in the sense that i only depends on x i . c) A third important application of Theorem 7.20 is the duality formula for the HK functional which directly follows from (6.14) of Theorem 6.3. We will state it in a slightly different form in the next theorem, whose interpretation will be clearer in the light of Section 8.4. It is based on the inf-convolution formula where ξ ∈ B(X) with ξ > −1/2. (i) If ξ ∈ B b (X) with inf X ξ > −1/2 then the function P 1 ξ defined by (7.55) belongs to Lip b (X), satisfies sup X P 1 ξ < 1/2, and admits the equivalent representation In particular, if ξ has bounded support then P 1 ξ ∈ Lip bs (X), the space of Lipschitz functions with bounded support.
(ii) Let us suppose that (X, d) is a separable metric space and τ is induced by d. For every µ 0 , µ 1 ∈ M(X) we have Proof. Let us first observe that if where the upper bound follows using x = x, while the lower bound is easily seen from the first form of P 1 ξ in (7.55) and sin 2 ≥ 0. Since 1/(1 + 2ξ(x )) ≤ 1/(1 + 2a) for every x ∈ X, the function P 1 ξ is also Lipschitz, because it is the infimum of a family of uniformly Lipschitz functions. Moreover, for d(x, x ) ≥ π/2 we have the estimate which immediately gives (7.56). In particular, we have Let us now prove statement (ii). We denote by E the the right-hand side of (7.57) and by E the analogous expression where ξ runs in C b (X): , we know that sup X ψ 2 < 1 and ψ 2 ∈ Lip b (X). Thus, ψ 1 and ψ 2 are continuous and satisfy Hence, the couple (ψ 1 , ψ 2 ) is admissible for (6.14) (with C b (X) instead of LSC s (X); note that τ is metrizable and thus completely regular), so that HK 2 (µ 0 , µ 1 ) = LET(µ 0 , µ 1 ) ≥ E .
To show that E = E in the general case, we approximate ψ ∈ C b (X) with inf X ψ > −1 by a decreasing sequence of Lipschitz and bounded functions (e.g. by taking ψ n (x) := sup y ψ(y) − nd π (x, y)) and use that the supremum in (7.61) does not change if we restrict it to Lip b (X).
Let now ξ be Lipschitz and valued in [a, b] with −1/2 < a ≤ 0 ≤ b. Taking the increasing sequence of nonnegative cut-off functions ζ n (x) := 0 ∨ n − d(x,x)) ∧ 1 which are uniformly 1-Lipschitz, have bounded support and satisfy ζ n ↑ 1 as n → ∞, it is easy to check that ξ n := ζ n ξ belong to Lip bs (X) and take values in the interval [a, b] so that a 1+2a ≤ P 1 ξ n ≤ b 1+2b for every n ∈ N. Since ξ n (x) = 0 if d(x,x) ≥ n and ξ n (x) = ξ(x) if d(x,x) ≤ n − 1, by (7.56) we get Thus P 1 ξ n ∈ Lip bs (X) and the Lebesgue Dominated Convergence theorem shows that

Limiting cases: recovering the Hellinger-Kakutani distance and the Kantorovich-Wasserstein distance
In this section we will show that we can recover the Hellinger-Kakutani and the Kantorovich-Wasserstein distance by suitably rescaling the HK functional.
The Hellinger-Kakutani distance. As we have seen in Example E.5 of Section 3.3, the Hellinger-Kakutani distance between two measures µ 1 , µ 2 ∈ M(X) can be obtained as a limiting case when the space X is endowed with the discrete distance with a ∈ [π, +∞]. (7.63) The induced cone distance in this case is and the induced cost function for the Entropy-Transport formalism is given by Recalling (3.23)-(3.24) we obtain Since c Hell ≥ c = (d) for every distance function on X, we always have the upper bound HK(µ 1 , µ 2 ) ≤ Hell(µ 1 , µ 2 ) for every µ 1 , µ 2 ∈ M(X). The Kantorovich-Wasserstein distance. Let us first observe that whenever µ 1 , µ 2 ∈ M(X) have the same mass their HK-distance is always bounded form above by the Kantorovich-Wasserstein distance W d (the upper bound is trivial when µ 1 (X) = µ 2 (X), since in this case W d (µ 1 , µ 2 ) = +∞).
Proposition 7.23. For every couple µ 1 , µ 2 ∈ M(X) we have Proof. It is not restrictive to assume that W 2 d π/2 (µ 1 , µ 2 ) = d 2 π/2 γ < ∞ for an optimal plan γ with marginals µ i . We then define the plan α := s γ ∈ M(C×C) where s(x 1 , x 2 ) := ([x 1 , 1], [x 2 , 1]), so that h 2 i α = µ i . By using (7.52) and (7.3) we obtain In order to recover the Kantorovich-Wasserstein distance we perform a simultaneous scaling, by taking the limit of nHK d/n where HK d/n is induced by the distance d/n. Theorem 7.24 (Convergence of HK to W). Let (X, τ, d) be an extended metric topological space and let HK d/λ be the Hellinger-Kantorovich distances in M(X) induced by the distances λ −1 d for λ > 0. Then, for all µ 1 , µ 2 ∈ M(X) we have Proof. Let us denote by LET λ = HK 2 d/λ the optimal value of the LET-problem associated to the distance d/λ. Since the Kantorovich-Wasserstein distance is invariant by the rescaling λW d/λ = W d , estimate (7.69) shows that λHK d/λ ≤ W d .
Let γ λ be an optimal plan for HK d/λ (µ 1 , µ 2 ) with marginals γ λ,i = π i γ λ . We denote by F the entropy functionals associated to logarithmic entropy F (s) = U 1 (s) and by G the entropy functionals associated to F (s) := I 1 (s) as in Example E.3 of Section 3.3. Since the transport part of the LET-functional is associated to the costs we obtain the estimate (7.71) Proposition 2.10 shows that the family of plans (γ λ ) λ≥1 is relatively compact with respect to narrow convergence in M(X × X). Since λ 2 F (s) ↑ I 1 (s), passing to the limit along a suitable subnet (λ(α)) α∈A parametrized by a directed set A, and applying Corollary 2.9 we get a limit plan γ ∈ M(X × X) with marginals γ i such that In particular, we conclude that µ 1 (X) = γ(X × X) = µ 2 (X). Since d is lower semicontinuous, narrow convergence of γ λ(α) and (7.71) also yield
It follows that g := g • d is a distance in X, inducing the same topology as d. We can now introduce a distance HK g associated to g. The corresponding distance on C is given by g C (y 1 , y 2 ) := r 2 1 + r 2 2 − 2r 1 r 2 exp(−d 2 (x 1 , x 2 )/2). (7.73) From g(z) ≤ z we have g C ≤ d C .

Dynamic interpretation of the Hellinger-Kantorovich distance
As in Section 7.5, in all this chapter we will suppose that (X, d) is a complete and separable (possibly extended) metric space and τ coincides with the topology induced by d. All the results admits a natural generalization to the framework of extended metric-topological spaces [1,Sec. 4].

Absolutely continuous curves and geodesics in the cone C
Absolutely continuous curves and metric derivative. If (Z, d Z ) is a (possibly extended) metric space and I is an interval of R, a curve z : I → Z is absolutely continuous if there exists m ∈ L 1 (I) such that Its metric derivative |z | d Z (we will omit the index d Z when the choice of the metric is clear from the context) is the Borel function defined by and it is possible to show (see [2]) that the lim sup above is in fact a limit for L 1 -a.e. points in I and it provides the minimal (up to possible modifications in L 1 -negligible sets) function m for which (8.1) holds. We will denote by AC p (I; Z) the class of all absolutely continuous curves z : I → Z with |z | ∈ L p (I); when I is an open set of R, we will also consider the local space AC p loc (I; Z). If Z is complete and separable then AC p ([0, 1]; Z) is a Borel set in the space C([0, 1]; Z) endowed with the topology of uniform convergence. (This property can be extended to the framework of extended metric-topological spaces, see [3].) A curve z : [0, 1] → Z is a (minimal, constant speed) geodesic if d Z (z(t 0 ), z(t 1 )) = |t 1 − t 0 |d Z (z(0), z(1)) for every t 0 , t 1 ∈ [0, 1]. (8.3) In particular z is Lipschitz and |z | ≡ d Z (z(t 0 ), z(t 1 )) in [0, 1]. We denote by Geo(Z) ⊂ C([0, 1]; Z) the closed subset of all the geodesics. A metric space (Z, d Z ) is called a length (or intrinsic) space if the distance between arbitrary couples of points can be obtained as the infimum of the length of the absolutely continuous curves connecting them. It is called a geodesic (or strictly intrinsic) space if every couple of points z 0 , z 1 at finite distance can be joined by a geodesic.
Geodesics in C. If (X, d) is a geodesic (resp. length) space, then also C is a geodesic (resp. length) space, cf. If x 1 , x 2 ∈ X with d(x 1 , x 2 ) ≥ π, then a geodesic between y i = [x i , r i ] can be easily obtained by joining two geodesics connecting y i to o as before; observe that in this case d C (y 1 , y 2 ) = r 1 + r 2 .
In the case when d(x 1 , x 2 ) < π and r 1 , r 2 > 0, every geodesic y : I → C connecting y 1 to y 2 is associated to a geodesic x in X joining x 1 to x 2 and parametrized with unit speed in the interval [0, d(x 1 , x 2 )]. To find the radius r(t), we use the complex plane C: we write the curve connecting z 1 = r 1 ∈ C to z 2 = r 2 exp(i d(x 1 , x 2 )) ∈ C in polar coordinates, namely and then the geodesic curve in C takes the form For absolutely continuous curves the following characterization holds: Proof. By (7.4) one immediately sees that if y = p • y ∈ AC p ([0, 1]; C) then sr belongs to AC p ([0, 1]; R) and x ∈ AC p loc (O r ; X). Since y is absolutely continuous, we can evaluate the metric derivative at a.e. t ∈ O r where also r and |x | exist: starting from (7.3) leads to the limit (8.10).
Moreover, the same calculations show that if the lifting y belongs to AC p ([0, 1]; Y ) then the restriction of y to each connected component of O r is absolutely continuous with metric velocity given by (8.10) in L p (0, 1). Since y is globally continuous and constant in [0, 1] \ O r , we conclude that y ∈ AC p ([0, 1]; C).
As a consequence, in a length space, we get the variational representation formula otherwise, (8.13) we see that the map t → ζ(y(t), t) is absolutely continuous and satisfies d dt ζ(y(t), t) = 1 2 ∂ t ζ(y(t), t) + D C ζ(y(t), t), y C (t) R d+1 L 1 -a.e. in (0, 1). (8.14) Note that the first component of D C ζ contains the factor r rather than r 2 , since y C in (8.12) already has one factor r in its first component.

Lifting of absolutely continuous curves and geodesics
Dynamic plans and time-dependent marginals. Let (Z, d Z ) be a complete and separable metric space. A dynamic plan π in Z is a probability measure in P(C(I; Z)), and we say that π has finite 2-energy if it is concentrated on AC 2 (I; Z) and We denote by e t the evaluation map in C(I; Z) given by e t (z) := z(t). If π is a dynamic plan, α t = (e t ) π ∈ M(Z) is its marginal at time t ∈ I and the curve t → α t belongs to C(I; (M(Z), W d Z )). If moreover π is a dynamic plan with finite 2-energy, then α ∈ AC 2 (I; (M(Z), W d Z )). We say that π is an optimal geodesic plan between α 0 , α 1 ∈ P(Z) if (e i ) π = α i for i = 0, 1, if it is a dynamic plan concentrated on Geo(Z), and if When Z = C we will denote by h 2 t = h 2 • (e t ) the homogeneous marginal at time t ∈ I. Since h 2 : P(C) → M(X) is 1-Lipschitz (cf. Corollary 7.13), it follows that the curve µ t := h 2 α t = h 2 t π belongs to AC 2 (I; (M(X), HK)) and moreover |µ t | 2 H K ≤ |y | 2 d C (t) dπ(y) for a.e. t ∈ (0, 1). (8.17) A simple consequence of this property is that (M(X), HK) inherits the length (or geodesic) property of (X, d). Proof. Let us first suppose that (X, d) is a length space (the argument in the geodesic case is completely equivalent) and let µ i ∈ M(X). By Corollary 7.7 we find α i ∈ P 2 (C) such that h 2 α i = µ i and HK(µ 1 , µ 2 ) = W d C (α 1 , α 2 ). Since C is a length space, it is well known that P 2 (C) is a length space (see [44]), so that for every κ > 1 there exists α ∈ Lip([0, 1]; (P 2 (C), W d C )) connecting α 1 to α 2 such that |α | W d C ≤ κ W d C (α 1 , α 2 ). Setting µ t := h 2 α t we obtain a Lipschitz curve connecting µ 1 to µ 2 with length ≤ κ HK(µ 1 , µ 2 ). The converse property is a consequence of the next representation Theorem 8.4 and the fact that if (P 2 (C), W d ) is a length (resp. geodesic) space, then C and thus X are length (resp. geodesic) spaces.
We want to prove the converse representation result that every absolutely continuous curve µ : [0, 1] → (M(X), HK) can be written via a dynamic plan π as µ t = h 2 t π. The argument only depends on the metric properties of the Lipschitz submersion h.
Proof. The statement (i) is an immediate consequence of Theorem 8.4. Statement (ii) is a well known property of the Kantorovich-Wasserstein space (C, W d C ) in the case when C is geodesic. Theorem 8.4 also clarifies the relation between HK and GHK introduced in Section 7.8. In particular if (X, d) is a length metric space then HK is the length distance generated by GHK.
In order to prove the opposite inclusion and (8.23) it is sufficient to notice that the classes of absolutely continuous curves in C w.r.t. d C and g C coincide with equal metric derivatives |y | d C = |y | g C . Since GHK = HK g is the Hellinger-Kantorovich distance induced by g, the assertion follows by (8.20) of Theorem 8.4.

Lower curvature bound in the sense of Alexandrov
Let us first recall two possible definitions of Positively Curved (PC) spaces in the sense of Alexandrov, referring to [9] and to [10] for other equivalent definitions and for the more general case of spaces with curvature ≥ k.
According to Sturm [43], a metric space (Z, d Z ) is a Positively Curved (PC) metric space in the large if for every choice of points z 0 , z 1 , · · · , z N ∈ Z and coefficients λ 1 , · · · , λ N ∈ (0, +∞) we have N i,j=1 If every point of Z has a neighborhood that is PC, then we say that Z is locally positively curved. When the space Z is geodesic, the above (local and global) definitions coincide with the corresponding one given by Alexandrov, which is based on triangle comparison: for every choice of z 0 , z 1 , z 2 ∈ Z, every t ∈ [0, 1], and every point z t such that d Z (z t , z k ) = |k−t|d Z (z 0 , z 1 ) for k = 0, 1 we have When Z is also complete, the local and the global definition are equivalent. Next we provide conditions on (X, d) or (C, d C ) that guarantee that (M(X), HK) is a PC space. Before we go into the proof of this result, we highlight that for a compact convex subset Ω ⊂ R d with d ≥ 2 equipped with the Euclidean distance, the space (M(Ω), HK) is not PC, see [27,Sect. 5.6] for an explicit construction showing the semiconcavity of the squared distance fails.
Proof. Let us first prove statement (ii). If (C, d C ) is a PC space then also (P 2 (C), W d C ) is a PC space [44]. Applying Corollary 7.13, for every choice of µ i ∈ M(X), i = 0, . . . , N , we can then find measures β i ∈ P 2 (C) such that where it is crucial that β 0 is the same for every i. It then follows that Let us now consider (iii) "⇒": If (M(X), HK) is PC, we have to prove that (X, d) has locally curvature ≥ 1. By Theorem [9, Thm. 4.7.1] it is sufficient to prove that C \ {o} is locally PC to conclude that (C, d) has locally curvature ≥ 1. We thus select points y i = [x i , r i ], i = 0, 1, 2, in a sufficiently small neighborhood of y = [x, r] with r > 0, so that d(x i , x j ) < π/2 for every i, j and r i , r j > 0. We also consider a geodesic y t = [x t , s t ], t ∈ [0, 1], connecting y 0 to y 1 , thus satisfying d C (y t , y i ) = |i − t|d(y 0 , y 1 ) for i = 0, 1.
As simple applications of the Theorem above we obtain that M(R) and M(S d−1 ) endowed with HK are Positively Curved spaces.

Duality and Hamilton-Jacobi equation
In this section we will show the intimate connections of the duality formula of Theorem 7.21 with Lipschitz subsolutions of the Hamilton-Jacobi equation in X × (0, 1) given by and its counterpart in the cone space Since the Kantorovich-Wasserstein distance on P 2 (C) can be defined in duality with subsolutions to (8.31) via the Hopf-Lax formula and 2-homogeneous marginals are modeled on test functions as in (8.32), we can expect to obtain a dual representation for the Hellinger-Kantorovich distance on M(X) by studying the Hopf-Lax formula for initial data of the form ζ 0 (x, r) = ξ 0 (x)r 2 .
Slope and asymptotic Lipschitz constant. In order to give a metric interpretation to (8.30) and (8.31), let us first recall that for a locally Lipschitz function f : Z → R defined in a metric space (Z, d Z ) the metric slope |D Z f | and the asymptotic Lipschitz constant |D Z f | a are defined by with the convention that |D Z f |(z) = |D Z f | a (z) = 0 whenever z is an isolated point. |D Z f | a can also be defined as the minimal constant L ≥ 0 such that there exists a function G L : Note that |D Z f | a is always an upper semicontinuous function. When Z is a length space, |D Z f | a is the upper semicontinuous envelope of the metric slope |D Z f |. We will often write |Df |, |Df | a whenever the space Z will be clear from the context. In particular, the truncated distances d Z ∧ κ with κ > 0, the distances a sin((d Z ∧ κ)/a) with a > 0 and κ ∈ (0, aπ/2], and the distance g = g(d) given by (7.72) yield the same asymptotic Lipschitz constant.
In the case of the cone space C it is not difficult to see that the distance d C and d π/2,C coincide in suitably small neighborhoods of every point y ∈ C \ {o}, so that they induce the same asymptotic Lipschitz constants in C \ {o}. The same property holds for g C . In the case of the vertex o, relation (7.11) yields The next result shows that the asymptotic Lipschitz constant satisfies formula (8.33) for ζ([x, r]) = ξ(x)r 2 .
(iii) In the cases (i) or (ii) we have, for every x ∈ X and r ≥ 0, the relation The analogous formula holds for the metric slope |D C ζ|([x, r]). Moreover, equation (8.40) remains true if d C is replaced by the distance d π/2,C .
Proof. As usual we set y i = [x i , r i ] and y = [x, r]. Let us first check statement (i). If ζ is locally Lipschitz then |ξ( ) for every R sufficiently small, so that ξ is uniformly bounded. Moreover, using (7.3) for every R > 0 we have so that ξ is uniformly Lipschitz and (8.38) holds.
Passing to the limit y 1 , y 2 → y and using the fact that x 1 , x 2 → x due to r = 0, we obtain In order to prove the converse inequality we observe that for every L < L X there exist two sequences of points (x i,n ) n∈N converging to x w.r.t. d such that ξ(x 1,n ) − ξ(x 2,n ) ≥ L δ n where 0 < δ n :=d(x 1,n , x 2,n ) → 0. Choosing r 1,n := r and r 2,n = r(1 + λδ n ) for an arbitrary constant λ ∈ R with the same sign as ξ(x), we can apply (8.41) and arrive at Optimizing with respect to λ we obtain This proves (8.40) for the asymptotic Lipschitz constant |D C ζ| a . The arguments for proving (8.40) for metric slopes |D C ζ| are completely analogous.
Hopf-Lax formula and subsolutions to metric Hamilton-Jacobi equation in the cone C. Whenever f ∈ Lip b (C) the Hopf-Lax formula for y ∈ C and t > 0, (8.42) provides a function t → Q t f which is Lipschitz from [0, ∞) to C b (C), satisfies the a-priori bounds inf and solves ∂ + t Q t f (z) + 1 2 |D C Q t f | 2 a (z) ≤ 0 for every z ∈ C, t > 0, (8.44) where ∂ + t denotes the partial right derivative w.r.t. t. It is also possible to prove that for every y ∈ C the time derivative of Q t f (y) exists with possibly countable exceptions and that (8.44) is in fact an equality if (C, d C ) is a length space, a property that always holds if (X, d) is a length metric space. This is stated in our main result: Moreover, for every R > 0 we have for all x ∈ X, r ≤ P R. (8.46) The map t → ξ t is Lipschitz from [0, 1] to C b (X) with ξ t ∈ Lip b (X) for every t ∈ [0, 1]. Moreover, ξ t is a subsolution to the generalized Hamilton-Jacobi equation For every x ∈ X the map t → ξ t (x) is time differentiable with at most countable exceptions.
If (X, d) is a length space, (8.47a) holds with equality and |D X ξ t | a (x) = |D X ξ t |(x) for every x ∈ X and t ∈ [0, 1]: Notice that when ξ(x) ≡ ξ is constant, (8.45) reduces to P t ξ = ξ/(1 + 2tξ) which is the solution to the elementary differential equation d dt ξ + 2ξ 2 = 0. Proof. Let us observe that inf t∈[0,1],z∈X (1 + 2tξ(z)) = P > 0. A simple calculation shows Hence, if we choose we find (notice the truncation at π/2 instead of π) 2tξ(x ) + sin 2 (d π/2 (x, x )) , (8.48) which yields (8.45) and (8.46). Equation (8.46) also shows that the function ζ t = ξ t (x)r 2 coincides on C[P R] with the solution ζ R t given by the Hopf-Lax formula in the metric space C[R]. Since the initial datum ζ is bounded and Lipschitz on C[R] we deduce that ζ R t is bounded and Lipschitz, so that t → ξ t is bounded and Lipschitz in X by Lemma 8.10.
Equation (8.47a) and the other regularity properties then follow by (8.40) and the general properties of the Hopf-Lax formula in C[R].
Duality between the Hellinger-Kantorovich distance and subsolutions to the generalized Hamilton-Jacobi equation. We conclude this section with the main application of the above results to the Hellinger-Kantorovich distance.
Theorem 8.12. Let us suppose that (X, d) is a complete and separable metric space.

(8.50)
Moreover, in the above formula we can also take the supremum over functions ξ ∈ C k ([0, 1]; Lip b (X)) with bounded support.
Let us now prove (ii). As a first step, denoting by S the right-hand side of (8.50), we prove that HK 2 (µ 0 , µ 1 ) ≥ S. If ξ ∈ C 1 ([0, 1]; Lip b (X)) satisfies the pointwise inequality then it also satisfies (8.47a), because (8.51) provides the relation for every (x, t) ∈ X × (0, 1), (8.52) where the right hand side is bounded and continuous in X. Equation (8.52) thus yields the same inequality for the upper semicontinuous envelope of |D X ξ t | and this function coincides with |D X ξ t | a since X is a length space. We can therefore apply the previous point (i) by choosing λ > 1 and a Lipschitz curve µ : [0, 1] → M(X) joining µ 0 to µ 1 with metric velocity |µ t | H K ≤ λHK(µ 0 , µ 1 ), whose existence is guaranteed by the length property of X and a standard rescaling technique. Relation (8.49) yields H K dt ≤ λ 2 HK 2 (µ 0 , µ 1 ).
A further limit as r ↓ 0 and the application of the Lebesgue Dominated convergence Theorem yields the first inequality of (8.55). The argument to prove the second inequality is completely analogous.
When X = R d the characterization (8.50) of HK holds for an even smoother class of subsolutions ξ of the generalized Hamilton-Jacobi equation.
Corollary 8.14. Let X = R d be endowed with the Euclidean distance. Then

(8.56)
Proof. We just have to check that the supremum of (8.50) does not change if we substitute C ∞ ([0, 1]; Lip bs (R d )) with C ∞ c (R d × [0, 1]). This can be achieved by approximating any subsolution ξ ∈ C ∞ ([0, 1]; Lip bs (R d )) via convolution in space with a smooth kernel with compact support, which still provides a subsolution thanks to Lemma 8.13.
The next result provides the opposite inequality, which will be deduced from the duality between the solutions of the generalized Hamilton-Jacobi equation and HK developed in Theorem 8.12.
Combining Theorems 8.16 and 8.17 with Theorem 8.4 and the geodesic property of (M(R d ), HK) we immediately have the desired dynamic representation.  The Borel vector field (v, w) realizing the minimum in (8.67) is uniquely determined µ Ia.e. in R d × (0, 1).
The discussion in [27] reveals however that there may be many geodesic curves, so in general µ I is not unique. Indeed, the set of all geodesics connecting µ 0 = a 0 δ x 0 and µ 1 = a 1 δ x 1 with a 0 , a 1 > 0 and |x 1 −x 0 | = π/2 is infinite dimensional, see [27, Sect. 5.2].

Geodesics in M(R d )
As in the case of the Kantorovich-Wasserstein distance, one may expect that geodesics (µ t ) t∈[0,1] in (M(R d ), HK) can be characterized by the system (cf. [27,Sect. 5

])
∂ t µ t + ∇ · (µ t D x ξ t ) = 4ξ t µ t , ∂ t ξ t + 1 2 |D x ξ t | 2 + 2ξ 2 t = 0. (8.68) In order to give a precise meaning to (8.68) we first have to select an appropriate regularity for ξ t . On the one hand we cannot expect C 1 smoothness for solutions of the Hamilton-Jacobi equation (8.68) (in contrast with subsolutions, that can be regularized as in Corollary 8.14) and on the other hand the L d a.e. differentiability of Lipschitz functions guaranteed by Rademacher's theorem is not sufficient, if we want to consider arbitrary measures µ t that could be singular with respect L d . A convenient choice for our aims is provided by locally Lipschitz functions which are strictly differentiable at µ I -a.e. points, where µ I has been defined by (8.57). A function In the proofs we will also need to deal with pointwise representatives of the time derivative of a locally Lipschitz function ξ : R d × (0, 1) → R: if D(∂ t ξ) will denote the set (of full L d+1 measure) where ξ is differentiable w.r.t. time and ∂ t ξ the extension of ∂ t ξ to 0 outside D(∂ t ξ), we set It is not difficult to check that such functions are Borel; even if they depend on the specific choice of κ ε , they will still be sufficient for our aims (a more robust definition would require the use of approximate limits).
Notice that (8.72) seems the weakest natural formulation of the Hamilton-Jacobi equation, in view of Rademacher's Theorem. The assumption of strict differentiability of ξ at µ I -a.e. point provides an admissible vector field D X ξ for (8.73).
Proof. The proof splits into a sufficiency and a necessity part, the latter having several steps. Sufficiency. Let us suppose that µ, ξ satisfy conditions (i), . . . , (iv).
The following lemma provides the "integration by parts" formulas that where used in the sufficiency and necessity part of the previous proof of Theorem 8.19. It is established by a suitable temporal and spatial smoothing, involving a smooth kernel κ ε as in (8.54).
We conclude by taking the supremum with respect to all the subsolutions of (8.95) in C 1 ([0, 1]; Lip b (X)) and applying (8.50). minimum was consistent with the distance between two Dirac masses, which could easily be calculated via the dynamic formulation. So, in parallel we tried to develop the dynamic approach, which was not too successful at the early stages. Only after realizing and exploiting the connection to the cone distance in Summer and Autumn of 2013 we were able to connect LET systematically with the dynamic approach. The crucial and surprising observation was that optimal plans for E and lifts of measures µ ∈ M(X) to measures λ on the cone C could be identified by exploiting the optimality conditions systematically. Corresponding results were presented in workshops on Optimal Transport in Banff (June 2014) and Pisa (November 2014).
Already at the Banff workshop, the general structure of the primal and dual Entropy-Transport problem as well as the homogeneous perspective formulation were presented. Several examples and refinements where developed afterwards. The most recent part from Summer 2015 concerns our Hamilton-Jacobi equation in general metric spaces (X, d) and the induced cone C (cf. Section 8.4) and the derivation of the geodesic equations in R d (cf. Section 8.6). This last achievement now closes the circle, by showing that all the initial steps, which were done on a formal level in 2012 and the first half of 2013, have indeed a rigorous interpretation.