Kantorovich–Rubinstein Distance and Barycenter for Finitely Supported Measures: Foundations and Algorithms

The purpose of this paper is to provide a systematic discussion of a generalized barycenter based on a variant of unbalanced optimal transport (UOT) that defines a distance between general non-negative, finitely supported measures by allowing for mass creation and destruction modeled by some cost parameter. They are denoted as Kantorovich–Rubinstein (KR) barycenter and distance. In particular, we detail the influence of the cost parameter to structural properties of the KR barycenter and the KR distance. For the latter we highlight a closed form solution on ultra-metric trees. The support of such KR barycenters of finitely supported measures turns out to be finite in general and its structure to be explicitly specified by the support of the input measures. Additionally, we prove the existence of sparse KR barycenters and discuss potential computational approaches. The performance of the KR barycenter is compared to the OT barycenter on a multitude of synthetic datasets. We also consider barycenters based on the recently introduced Gaussian Hellinger–Kantorovich and Wasserstein–Fisher–Rao distances.


Introduction
Over the past decade, optimal transport (OT) based concepts for data analysis [for a thorough treatment of the mathematical foundations of optimal transport see e.g.Rachev and Rüschendorf, 1998, Villani, 2008, Santambrogio, 2015] have seen increasing popularity.This is mainly due to the fact that OT based methods respect important features of the data's geometric structure.Furthermore, noteworthy advances have been achieved in various areas, such as optimisation [Bertsimas and Tsitsiklis, 1997, Wolsey and Nemhauser, 1999, Grötschel et al., 2012], machine learning [Frogner et al., 2015, Peyré et al., 2019, Xie et al., 2020], computer vision [Gangbo and McCann, 2000, Su et al., 2015, Solomon et al., 2015] and statistical inference [Sommerfeld and Munk, 2018, Panaretos and Zemel, 2020, Hallin et al., 2021], among others.This methodological and computational progress recently also paved the way to novel areas of applications including genetics [Evans andMatsen, 2012, Schiebinger et al., 2019] and cell biology [Gellert et al., 2019, Klatt et al., 2020, Tameling et al., 2021, Wang and Yuan, 2021], to cite but a few.Of particular importance from a data analysis point of view are extensions to compare more than two measures, a prominent proposal being the Fréchet mean [Fréchet, 1948], in the present context known as Wasserstein barycenter [Agueh and Carlier, 2011].Wasserstein barycenters allow for a notion of average on the space of probability measures, which is well-adapted Figure 1: (Unbalanced) OT between two measures (support in blue and brown, respectively) with weights equal to one at each support point.Top-Left: OT plan (red) between normalised versions of the two measures.Rest: UOT plans (red/purple) between nonnormalised measures.From top-left to bottom-right C is decreasing.The edges, which have been removed most recently due to the reduction of C, are shown in green.Edges which have been added to the UOT graph due to the most recent reduction of C are marked in purple.
to the geometry of the data [ Álvarez-Esteban et al., 2016, Anderes et al., 2016].With recent progress on their computation [Cuturi and Doucet, 2014, Carlier et al., 2015, Bonneel et al., 2015, Kroshnin et al., 2019, Ge et al., 2019, Heinemann et al., 2022] they establish themselves even further as a promising tool in many fields of data analysis, such as texture mixing [Rabin et al., 2011], distributional clustering [Ye et al., 2017], histogram regression [Bonneel et al., 2016], domain adaptation [Montesuma and Mboula, 2021] and unsupervised learning [Schmitz et al., 2018], among others.However, a well known drawback of the Wasserstein distance and its barycenters in various applications is their limitation to measures with equal total mass.In fact, in many real world instances the difference in total mass intensity is of crucial importance.Employing vanilla Wasserstein based tools on general positive measures necessitates the usage of a normalisation procedure to enforce mass equality between the measures.This approach is, by design, oblivious to the mass differences between the original measures and can limit its use in applications.Exemplary, we mention that normalisation destroys stoichiometric features in the analysis of protein interaction and pathways as pointed out in Tameling et al. [2021].Overall, this might lead to incorrect conclusions on specific applications.An illustrative example is given in Figure 1.

Prior Work
The limitation of OT based concepts dealing only with measures of equal total mass has opened a wealth of approaches to account for more general measures.As an early proposal of this idea, the partial OT formulation [Caffarelli andMcCann, 2010, Figalli, 2010] suggests to fix the total mass of the OT plan in advance, while relaxing the marginal constraints.Comparably more recent are entropy transport formulations 1 .This general framework removes the marginal constraints and instead uses a divergence functional to measure the deviation between the transport marginals and the input measures.The entropy transport framework encompasses the Hellinger-Kantorovich distance [Liero et al., 2018, Chizat et al., 2018b], also known as Wasserstein-Fisher-Rao distance [Chizat et al., 2018a] and the Gaussian Hellinger-Kantorovich distance [Liero et al., 2018].Inherent to all of these models is their dependency on parameters whose exact influence on the models' properties is generally not well understood.An alternative idea is based on extending the well-studied dynamic formulation of OT [Benamou and Brenier, 2000] to measures with different total masses.With a focus on its geodesic properties, this approach has been studied in several works [Chizat et al., 2018a,c, Gangbo et al., 2019].In this paper, we rely on a simple and intuitive idea based on the seminal work of Kantorovich and Rubinstein [1958].This accounts for mass construction and deletion at a cost modeled by some prespecified parameter [for details see also Hanin, 1992, Guittet, 2002].It leads to the Kantorovich-Rubinstein distance (KRD) which curiously has been revisited several times under different names by various authors.For p = 1, it has been referred to as Earth Mover's Distance [Pele and Werman, 2008], and generalized Wasserstein distance [Piccoli and Rossi, 2014], while for general p ≥ 1 common terminology includes Kantorovich distance [Gramfort et al., 2015], generalized KRD [Sato et al., 2020], transport-transform metric [Müller et al., 2020] and robust optimal transport distance [Mukherjee et al., 2021].

Contributions
In this work, we define barycenters with respect to the KRD and investigate their fundamental properties from a data analysis point of view.This extends the popular notion of Wasserstein barycenters to unbalanced barycenters (UBCs), i.e., barycenters of measures of different total masses.Similary, UBCs have been considered explicitly for the Hellinger-Kantorovich distance [Chung andPhung, 2020, Friesecke et al., 2021] and for the partial OT distance for absolutely continuous measures [Kitagawa and Pass, 2015].Notably, the well-known approach of matrix scaling algorithms has been shown to provide a general framework to approximate any UBC based on entropy optimal transport [Chizat et al., 2018b] of finitely supported measures.Closely related to our approach is the work by Müller et al. [2020] approximating the KR barycenter in the special case of point patterns.The KR distance: Let (X , d) be a finite metric space, where X = {x 1 , . . ., x N } and is the set of non-negative measures2 on X .For a measure µ ∈ M + (X ) its total mass is defined as M(µ) := x∈X µ(x) and the subset of non-negative measures with total mass equal to one is the set of probability measures P(X ).If π ∈ M + (X × X ) is a measure on the product space X × X its marginals are defined as π(x, X ) := x ∈X π(x, x ) and π(X , x ) := x∈X π(x, x ), respectively.For two measures µ, ν ∈ M + (X ) we define the set of non-negative sub-couplings as Π ≤ (µ, ν) := {π ∈ M + (X × X ) | π(x, X ) ≤ µ(x), π(X , x ) ≤ ν(x ) ∀ x, x ∈ X }. (1) computations [Cuturi, 2013, Benamou et al., 2015, Carlier et al., 2017] Figure 2: Upper two rows: An excerpt of eight instances of a dataset of N = 100 nested ellipses at up to 5 different clusters in [0, 1] 2 .The number of ellipses in each cluster follows a Poisson distribution.For the cluster in the center the intensity is 2 and for the four outer clusters the intensity is 1.Each ellipse is discretized into 50 points with mass 1 at each location.For details on the computational methods refer to Section 4. Bottom-Left: The Wasserstein barycenter of the normalized versions of these measures (runtime about 15 hours).Bottom-Right: The (2, 0.2)-barycenter of these measures (runtime about 30 minutes).The (2, C)-barycenter for different values of C can be seen in Figure 8.
Similarly, we denote the set of couplings between µ and ν as Π = (µ, ν), where the inequality constraints in (1) are replaced by equalities.For p ≥ 1 and a parameter C > 0, unbalanced optimal transport (UOT) between two measures µ, ν ∈ M + (X ) is defined as (2) Notably, UOT p,C (µ, ν) is finite for all measures µ, ν ∈ M + (X ) with possibly different total masses and a solution of (2) always exists.Here, the parameter C penalizes deviation of mass from the marginals of π with respect to the input measures µ, ν ∈ M + (X ).In particular and unlike the (balanced) OT problem defined only for measures µ, ν ∈ M + (X ) with equal total mass M(µ) = M(ν), UOT in (2) relaxes the marginal constraint and allows optimal solutions to have more flexible marginals.Based upon UOT we define the p-th order Kantorovich-Rubinstein distance between two measures µ, ν ∈ M + (X ) as For any p ≥ 1, it defines a distance on the space of non-negative measures M + (X ) and it is an extension of the well-known p-Wasserstein distance W p (µ, ν) := (OT p (µ, ν)) 1 /p defined only for measures of equal total mass.Indeed, the KRD is shown to interpolate in-between OT on small scales and point-wise comparisons on large scales (Theorem 2.2) relative to the parameter C.This allows for an intuitive interpretation of the KRD.More precisely, in Lemma 2.1, we detail a clear geometrical connection between the value of C and the structure of the UOT.In particular, this contrasts the closely related partial OT problem [Figalli, 2010] mentioned above.Employing Lagrange multipliers one can see that for any choice of C, there exists a fixed mass m of the partial OT problem, such that these two problems are equivalent.However, finding this value of m requires to solve the UOT problem.We stress that the influence of m on the resulting transport is in general hard to determine, while the impact of C is intuitively clear.Thus, this perspective seems better suited to many applications.For the specific case of measures supported on ultrametric trees (Section 2.1.1)we prove (Theorem 2.3) an analogue of the well-known closed formula for the p-Wasserstein distance [Kloeckner, 2015].Additionally, the computation of the KRD is known to be equivalent to solving a related balanced OT problem [Guittet, 2002], allowing to apply any state-of-the-art solver with minimal modifications to compute the KRD and plan.
The KR barycenter: The KRD also lends itself to define a notion of a barycenter for a collection of measures as a generalization of the p-Wasserstein barycenter defined for probability measures µ 1 , . . ., µ J ∈ P(X ) as μ ∈ arg min Here, (X , d) is assumed to be embedded in some ambient space (Y, d), e.g., an Euclidean space with X ⊂ Y.The distance d on X is understood to be the distance on Y restricted to X .For µ 1 , . . ., µ J ∈ M + (X ), any measure is said to be a (p, C)-Kantorovich-Rubinstein barycenter or (p, C)-barycenter for short3 .We refer to the objective functional F p,C as (unbalanced) (p, C)-Fréchet functional.Notably, (p, C)-barycenters' support is not restricted to the finite space X which raises fundamental questions on its structural properties.In the following, we establish that there exists a finite set containing the support of any (p, C)-barycenter (Section 2.2).Indeed, this set can be explicitly constructed from the support of the individual µ i 's, but its size grows exponentially in the number of individual measures.However, we prove that there always exists a sparse (p, C)-barycenter whose support size is at most linear in the number of measures (Theorem 2.5).We note that these properties are analogs of well-known properties of Wasserstein barycenters [Anderes et al., 2016], that we re-establish for the unbalanced setting.
Comparably, employing more general entropy transport distances, we are not aware of any similar structural description of their barycenters in terms of the input measures and the parameter.Notably, the entropy optimal transport barycenter of dirac measures is not necessarily finitely supported itself [for an example see Friesecke et al., 2021].In contrast, our explicit structural description of the support of KR barycenters provides an immediate understanding of its properties for a given choice of C.This clear link between C and the (p, C)-barycenter also allows to incorporate previous knowledge of the measures or the ground space into the choice C. The (p, C)-barycenter can be tuned to be more flexible and provide superior performance compared to its p-Wasserstein counterpart by avoiding to normalise each measure.An illustrative example is included in Figure 2, where the (p, C)-barycenter detects all clusters correctly, while the Wasserstein barycenter does not provide any structural information on the underlying measures.This showcases potentially superior robustness and flexibility of the (p, C)-barycenter compared to the Wasserstein barycenter.We study this comparison in more detail on multiple synthetic data sets in Section 4. Here, the computational results4 are based on the fact that, due to our structural analysis of the support of the (p, C)-barycenter, it is straightforward to modify any given state-of-the-art solver for the Wasserstein barycenter problem to solve the (p, C)-barycenter problem (Section 4.1).

Kantorovich-Rubinstein Distance and (p, C)-Barycenter
In this section, we provide some theoretical analysis of the structural properties inherent in the UOT in (2) and as a consequence to the KRD in (3).We also focus on the variational formulation defining the (p, C)-barycenter in (5).

KR Distance
In this subsection, we focus on structural properties of minimizers for UOT in (2) and their consequences for the KRD.Notably, one can equivalently restate the penalization of total mass in (2) as While in (2) the parameter C > 0 controls the deviation of the total mass of π, the alternative representation (6) demonstrates its marginal characterization.Indeed, the parameter C specifies the maximal distance (scale) for which transportation is cheaper than creation or destruction of mass.More precisely, each optimal solution π C for (2) induces a directed transportation graph G(π C ) between the support points of µ (source points) and the support points of ν (sink points).By definition, the graph G(π C ) contains a directed edge (x, x ) if and only if π C (x, x ) > 0. For a directed path P = (x i 1 , . . ., x i k ) in G(π C ) its path length is defined as L(P ) = k−1 j=1 d p (x i j , x i j−1 ).The parameter C > 0 determines the maximal path length for any path in G(π C ) as the following statement demonstrates.
Lemma 2.1.For p ≥ 1, parameter C > 0 and measures µ, ν ∈ M + (X ) consider the UOT (2) with an optimal solution π C .The length of any directed path P from the corresponding transport graph G(π C ) is bounded by In particular, if d(x, x ) > C then for any optimal solution of 2 it holds π C (x, x ) = 0.
A proof is included in Appendix A.2. Lemma 2.1 shows that the underlying transportation graph has maximal path length C p which limits the interaction between source and sink points.It will be of crucial importance for closed formulas on ultra-metric trees in the following subsection.As an immediate consequence we obtain some important statements on the KRD in (3) along with its metric property.
Theorem 2.2.For any p ≥ 1 and parameter C > 0 the following statements hold: (i) The p-th order KRD in (3) defines a metric on the space of non-negative measures M + (X ).
where T V (µ, ν) := 1 /2 x∈X |µ(x) − ν(x)| is the total variation distance.The same equality holds for all We stress that the metric property of the KRD in Theorem 2.2 (i) has already been established in specific instances, e.g., for p = 1 [Piccoli and Rossi, 2014].Our proof follows that of Theorem 2 in Müller et al. [2020] for uniform measures on point patterns with minor modifications.Theorem 2.2 demonstrates how two measures µ, ν ∈ M + (X ) are compared with respect to KRD.Depending on the parameter C > 0 the optimal value interpolates between p-th order Wasserstein distance on small scales and total variation on larger scales with respect to C. Equivalently, these properties can be shown by considerations of the dual program for UOT in (2) given by where the equality holds due to strong duality.For p = 1 this can be further specified to which reveals its relation to the flat metric [Bogachev, 2007] as observed in Lellmann et al. [2014], Schmitzer and Wirth [2019].As in general M(µ) = M(ν), the bound f, g ≤ C p /2 on dual feasible solutions f, g is necessary for the dual to be finite.However, if the measures µ, ν ∈ M + (X ) have equal total mass M(µ) = M(ν) and C ≥ max x,x d(x, x ), then the bound on dual feasible solutions is redundant and we obtain the dual of the usual OT problem

KR Distance on Ultrametric Trees
For OT, the approximations of the underlying distance by a tree metric are common tools for theoretical and practical purposes.The former is usually employed for rates of convergence for the expectation of empirical OT costs [Sommerfeld et al., 2019] while in the latter tree approximations serve to reduce the computational complexity inherent in OT [Le et al., 2019].OT on ultramatric trees is also applied for the analysis of phylogenetic trees [Gavryushkin and Drummond, 2016].For an efficient computational implementation of UOT on tree metrics we refer to Sato et al. [2020].Notably, while OT with tree metric costs has a closed form solution, this fails to hold for its UOT counterpart.An exception is given in terms of ultrametric trees for which not only OT [Kloeckner, 2015] but also UOT admits a closed form solution, which we establish in this subsection.
To this end, consider a tree T with nodes V , edges E attached with (non-negative) weights w(e) for e ∈ E and a designated root r.Two nodes v, w ∈ V are connected by a unique path denoted P(v, w) either represented by a sequence of nodes or as a sequence of edges.The distance d T (v, w) is equal to the sum of the weights of those edges contained in P(v, w).
A leaf of T is any node such that its degree (number of edges attached to the node) is equal to one and the set of all leaf nodes is denoted as L ⊂ V .A node v is termed parent of node v denoted by par(v) = v if both are connected by a single edge but v is closer to the root than v.The parent of the root node is set to par(r) = r.For a node v its children are the elements of the set C(v) = {w ∈ V | v ∈ P(w, r)}.Notice that with this definition v is a child of itself (Figure 3 (a) for an illustration).A tree T is termed ultrametric tree if all its leaf nodes are at the same distance to the root.Equivalently, there exists a height function h : V → R + that is monotonically decreasing meaning that h(par(v)) ≥ h(v) and such that h(v) = 0 for v ∈ L. The distance is set to d T (v, par(v)) = |h(v) − h(par(v))| and extended on the full tree (Figure 3 (b) for an illustration).
Consider an ultrametric tree T with height function h and measures µ L , ν L supported on the leaf nodes L ⊂ V .We prove that the p-th order KRD admits a closed formula for such a setting.Intuitively, the parameter C restricts transportation of mass up to a certain threshold allowing to decompose T into subtrees.Mass transportation is restricted solely within each subtree whereas mass abundance or deficiency is penalized with parameter C for each particular subtree (Figure 4 for an illustration).We define the set   Theorem 2.3 (KR on ultrametric trees).Consider an ultrametric tree T with leaf nodes L and height function h : V → R + inducing the tree metric d T .For any p ≥ 1 and two measures µ L , ν L ∈ M + (L) supported on the leaf nodes of T it holds that The closed formula in Theorem 2.3 decomposes the underlying UOT into two tasks.While summing over subtrees carried out by the outer sum, the inner sum consists of two terms.
The first considers OT within each subtree whereas the second accounts for mass deviation on that particular subtree.The proof of this formula is given in Appendix A.2.1.

(p, C)-Barycenters
In the finite setting considered in this work a (p, C)-barycenter as defined in (5) always exists, but is not necessarily unique.Moreover, the location and structure of the support of the (p, C)-barycenter are not fixed and hence unknown.For the Wasserstein barycenter there exists a finitely supported, sparse barycenter in this context [Anderes et al., 2016, Le Gouic andLoubes, 2017].We establish analog properties of the (p, C)-barycenter.
Definition 2.4.Let (Y, d) be a metric space, p ≥ 1 and J ∈ N. A Borel barycenter application T J,p associates to any points (y 1 , . . ., y J ) ∈ Y J a minimum y ∈ Y of J i=1 d p (y i , y), i.e., T J,p (y 1 , . . ., y J ) ∈ arg min A Borel barycenter application is in general not a function since the minimum does not need to be unique.In particular, y = T J,p (y 1 , . . ., y J ) only means that y is one of the minima of the average distance function.As the measures µ 1 , . . ., µ J are defined on X we usually restrict the Borel barycenter application to inputs from the space X ⊂ Y.We define the full centroid set of the measures µ 1 , . . ., µ J ∈ M + (X ) as and the restricted centroid set We stress that for each L-tupel (x 1 , . . ., x L ) one fixed representative of T L,p (x 1 , . . ., x L ) is chosen for the construction of the centroid set C KR (J, p, C).To streamline the presentation any statement concerning C KR (J, p, C) in the following theorem is to be understood in the sense that there exists a choice of C KR (J, p, C) such that the statement holds true.
Theorem 2.5.Let µ 1 , . . ., µ J ∈ M + (X ) be a collection of non-negative measures on the finite discrete space X ⊂ Y.For any C > 0 it holds that Moreover, any (p, C)-barycenter µ satisfies supp(µ ) ⊆ C KR (J, p, C) and its total mass is bounded by (ii) For any (p, C)-barycenter µ * and any point y ∈ supp(µ * ), there exist UOT plans π i between µ * and µ i for i = 1, . . ., J, respectively, such that if π i (y, x) > 0, then there exists (v) Furthermore, set Z := J i=1 supp(µ i ) ∪ C KR (J, p) and define (vi) Let C > J 1/p diam(Z) and let µ 1 , . . ., µ J be ordered such that M(µ i ) ≤ M(µ j ) for i ≤ j.Suppose that J is odd or there there exists no point y ∈ Y contained in at least J/2 different support sets.Then, for any (p, C)-barycenter µ it holds that M(µ ) = M µ J/2 .Else, there exists at least one (p, C)-barycenter with this total mass.
The proof is based on the fact that finding a (p, C)-barycenter can be proven to be equivalent to solving a multi-marginal optimal transport problem (Section 3.2).Statement (i) provides insights into the structure of the support of any (p, C)-barycenter and its dependency with respect to the magnitude of C. The definition of C KR (J, p, C) can be understood as a joint restriction on L i=1 d p (x i , y) combined with an individual restriction on each d p (x i , y) of the original centroid points of C KR (J, p).The joint restriction ensures that simply deleting any mass at a given centroid point (and thus reducing the total mass of the measure) does not improve the objective value.This is a minimal feasibility assumption on the considered centroid point, as otherwise no measure containing this point can be optimal.The second restriction concerns each point individually.If a point x i has a distance larger than C p from a point y, then, by Lemma 2.1, there is no transport between y and x i .Thus, centroids which have have a larger distance to one of the points x 1 , . . ., x L they are constructed from can not be in the support of any (p, C)barycenter.This also gives rise to some helpful intuition for the support structure of any (p, C)-barycenter.Considering all C p -neighbourhoods around any of the support points of the µ i , then a (p, C)-barycenter can only have support in regions where at least balls from J/2 different measures intersect.A visual representation of this is given in the center of bottom row of Figure 5.By definition, the sets C KR (J, p, C) are equipped with a natural ordering in the sense that if We illustrate these sets in the top row Figure 5.We observe that the cardinality of the restricted centroid set in (9) decreases with decreasing C. In the extremes for large C the restricted centroid sets coincides with the full centroid sets in (8) that is independent of C. For small C, if there is no point which is contained in the support of at least J/2 measures, the restricted centroid set is empty.For an illustration we refer to the top row of Figure 5. Property (ii) is an analogue to a well-known characterization [Anderes et al., 2016] of the p-Wasserstein barycenter on R d with Euclidean distance d 2 , where the transport from the barycenter to the underlying measures is characterized by a transport map.The corresponding statement for the (p, C)-barycenter holds true as well in this context.Indeed, on (R d , d 2 ) condition ( 10), which can be understood as an injectivity-type assumption on the barycentric application, is satisfied due to the fact that T Thus, there only exist mass-splitting UOT plans between µ * and µ 1 , µ 2 and the transport is not characterized by a transport map.On more general spaces such as a tree T rooted at r, three leaves x 1 , x 2 , x 3 and positive edge weights e 1 , . . ., e 3 ∈ (0, 1) the barycenter on T of any two leafs x i = x j , is the root r.In particular, in this example, or in fact in any tree T = (V, E) which has a vertex y with degree of at least three 5 condition (10) fails.The unique (2, 2)-barycenter of two measures µ 1 = δ x 1 + δ x 2 and µ 2 = δ x 2 + δ x 3 is given by µ * = 2δ r .Thus, there are again only mass-splitting UOT plans between µ * and µ 1 and µ 2 .However, for the unit circle S 1 equipped with its natural arc-length distance property (10) does hold.Assume a 0 = T L (x 1 , . . ., x L ) = T L,p (y 1 , . . ., x L ), a 1 = T L−1,p (x 2 , . . ., x L ) and for each x ∈ S 1 denote H r (x) and H l (x) as the halfcircle right and left of x, respectively.It is straightforward to see by contraposition that if it holds a 1 ∈ H r (a 0 ), then this implies x 1 , y 1 ∈ H l (a 1 ) and x 1 , y 1 ∈ H l (a 0 ).However, it also holds d(x 1 , a 0 ) = d(y 1 , a 0 ), and thus x 1 − y 1 , a 0 = 0.In particular, this implies that either x 1 ∈ H l (a 0 ) and y 1 ∈ H r (a 0 ) or vice versa and hence x 1 = y 1 .The case a 1 ∈ H l (a 0 ) is analog and the case a 0 = a 1 clear.Property (iii) guarantees the existence of sparse (p, C)-barycenters.For large C the size C KR (J, p, C) scales as J i=1 M i , growing essentially exponentially in J.However, here we see that there always exists a (p, C)-barycenter supported on a sparse subset of C KR (J, p, C) which has cardinality growing only linearly in J. Part (iv) simply extends the montonicity of the (p, C)-KRD to the (p, C)-Fréchet functional.Statement (v) yields a critical point after which decreasing C does no longer change the resulting (p, C)barycenter and provides a closed form characterisation of the (p, C)-barycenter in this context.Finally, statement (vi) enables control on the total mass of the (p, C)-barycenter for large values of C. In particular, since the total mass is close to the median of the total masses of the µ i , we point out that the total mass of the (p, C)-barycenter in this setting is robust against outliers.A small amount of measures with unreasonably high mass has no impact on the total mass of the (p, C)-barycenter.Naturally, we compare the (p, C)-barycenter to its popular Wasserstein analogue in (4).As proven in Le Gouic and Loubes [2017] [and initially for p = 2 for R d by Anderes et al., 2016] the support of any p-Wasserstein barycenter is contained in Compared to the p-Wasserstein barycenter of the probability measures µ 1 , . . ., µ J the restricted centroid set C KR (J, p, C) allows more flexibility for specific cases and can provide a more reasonable representation of the data.We illustrate this in Figure 5 (bottomleft/right) where the (2, C)-barycenter clearly represents all clusters while the 2-Wasserstein barycenter fails to capture them.Nevertheless, if C is large enough and all measures have equal total mass both barycenters coincide.
, then any p-Wasserstein barycenter is also a (p, C)-barycenter and vice versa.
While this shows that the (p, C)-barycenter is a strict generalisation of the usual p-Wasserstein barycenter as the solutions coincide for large C, for smaller values of C there 5 The degree of a vertex in a graph is the number of vertices which are adjacent to it.can be significant differences.One such striking difference between the p-Wasserstein barycenter and the (p, C)-barycenter comes in the form of a localization property.Let Here, the (p, C)−barycenter tends to place mass between the clusters B 1 , . . ., B R .However, a (p, C)-barycenter is obtained by combining R barycenters of the measures restricted to the B 1 , . . ., B R , respectively.Lemma 2.7.Let µ 1 , . . ., µ J ∈ M + (X ) such that for all i = 1, . . ., J it holds supp(µ where conv(B r ) is the convex hull of B r for r = 1, . . ., R.Then, the measure In particular, Lemma 2.7 implies that the (p, C)-barycenter respects the cluster structure within the supports of the measures if the clustered are sufficiently separated and C is adapted according to the cluster size.Examples of this setting can be seen in Figure 2 and Figure 5.

A Lift to Optimal Transport, Wasserstein Barycenters and Multi-Marginal Optimal Transport
In this section, we provide the necessary tools and framework to establish our results in the previous section.Following the ideas of Guittet [2002] we state UOT in (2) as an equivalent balanced OT problem.We extend this idea to the (p, C)-barycenter, showing it to be equivalent to a specific Wasserstein barycenter problem as well as a balanced multi-marginal optimal transport problem.

A Lift to Optimal Transport
We fix a parameter C > 0, introduce an additional dummy point d and define the augmented space X := X ∪ {d} with metric cost Notably, dC : X × X → R + defines a metric on X [Müller et al., 2020, Lemma A1].
+ (X ) defines an augmented measure μ on X such that M(μ) = B. Hence, for two measures µ, ν ∈ M B + (X ) we can define the OT problem on X between their augmented measures ÕT dp C (μ, ν).In fact, it holds that where the first equality follows by Lemma 2.1 as for any optimal solution π C it holds π C (x, x ) = 0 if d p (x, x ) > C p and the second follows by [Guittet, 2002, Lemma 3.1].The same equalities remain valid replacing B by an arbitrarily large constant as summarized by the following lemma.
Lemma 3.1.Consider µ, ν ∈ M B + (X ) with extended versions μ, ν.Then for any a > 0 it holds that Proof.For p = 1, the result is trivial since by duality ÕT dC (μ, ν) only depends on the difference of the measures.For p > 1 we invoke dC -cyclical monotonicity [Villani, 2008, Thm. 5.10] of any OT plan π and use the property that dp C (x, d) = C p /2.This yields that (d, d) ∈ supp(π) which leads to the desired conclusion.

A Lift to Wasserstein Barycenters
We can also lift the optimization problem defining a (p, C)-barycenter to an equivalent p-Wasserstein barycenter formulation (4).Augmentation of the underlying measures, however, is not straightforward as the total mass of the (p, C)-barycenter is unknown.A first crude upper bound on its total mass leads to a feasible approach.
Lemma 3.2.Consider µ 1 , . . ., µ J ∈ M + (X ) and let F p,C be their associated unbalanced Fréchet functional.Then it holds that More precisely, any (p, C)-barycenter µ of µ 1 , . . ., µ J satisfies M(µ ) ≤ J i=1 M(µ i ).Proof.Assume first that there exists a measure µ ∈ M + (Y) such that µ = ν 1 + ν 2 where no transport between ν 2 and any µ i occurs in the optimal solution of UOT p,C (µ, µ i ) for 1 ≤ i ≤ J and it holds M(ν 2 ) > 0. Thus it holds and we improve the objective value of µ by removing ν 2 .Hence, let µ ∈ M + (Y) be any measure such that ν 2 ≡ 0. Consider π i the optimal solution for UOT p,C (µ, µ i ) for each 1 ≤ i ≤ J. Decompose the measure µ = J i=1 τ i , where τ i is the mass of µ transported to µ i according to π i and which is not yet included in any τ j for j < i.Clearly, M(µ) = J i=1 M(τ i ) ≤ J i=1 M(µ i ) and we conclude that min By our first considerations the claim follows.
Given the upper bound on the total mass of any (p, C)-barycenter at our disposal we can formulate a lift of the (p, C)-barycenter problem to a related p-Wasserstein barycenter problem.For this, let Ỹ := Y ∪ {d} endowed with the metric dC in (12) (replace X by Y and recall that X ⊂ Y) and augment the measures µ 1 , . . ., µ J to μ1 , . . ., μJ where μi = µ i + j =i M(µ j )δ d for 1 ≤ i ≤ J.In particular, M(μ i ) = J j=1 M(µ j ) and we can define the augmented p-Fréchet functional where by definition Fp,C is restricted to measures µ with mass M(µ) = J i=1 M(µ i ).
The proof of this Lemma is given in Appendix A.1.

A Lift to Multi-Marginal Optimal Transport
On the augmented space Ỹ := Y ∪ {d} equipped with metric dC in (12), we define for p ≥ 1 and J ∈ N a Borel barycenter application T J,p C : ỸJ → Ỹ that takes as input (y 1 , . . ., y J ) ∈ Ỹ and outputs any minimizer y ∈ Ỹ of the function Of particular interest to us is the barycentric application restricted to inputs from X .However, we collect some of its key properties for general input (y 1 , . . ., y J ) ∈ ỸJ .For this, we define the index set If clear from the context, then the dependence on y 1 , . . ., y J is suppressed and the set is simply denoted as B.
Lemma 3.5.Fix some parameter C > 0 and consider the space Ỹ with metric dC as defined in (12).For points (y 1 , . . ., y J ) ∈ ỸJ it holds that In particular, if strict inequality holds then T J,p C (y 1 , . . ., y J ) = d is unique.
(iii) If T J,p (y 1 , . . ., y J ) = d then it holds , then for any points y 1 , . . ., y J ∈ Y with |B| = 0 it holds that T J,p C (y 1 , . . ., y J ) = T J,p (y 1 , . . ., y J ) where the latter one is defined with respect to the usual metric d p on Y.
A proof of this result is provided in Appendix A.1.Lemma 3.5 allows to characterize the centroid sets of the augmented measures μ1 , . . ., μJ defined as Remark 3.6.We point out that computing T J,p C is in general a difficult optimisation problem.While for squared euclidean distance, computing the barycentric application simply amounts to taking the mean of the x i , even on the non-augmented space, there are no closed form solutions available for most choices of distances and values of p.This problem is exacerbated by the truncation of the distance d at C p [as also pointed out in Müller et al., 2020], since it implies that disregarding a certain subset of points and just computing the barycenter with respect to the remaining x i might in fact be optimal.However, initially it is not clear which x i to choose, turning this into a difficult combinatorial problem.
Recall that for any measure µ its support is contained in X a subset of Y.The augmented measure μ is extended by an additional support point at {d}.In particular, while the centroid set is a subset of Ỹ it only depends on the support of the measures μi contained in X := X ∪ {d}.
Corollary 3.7.For the centroid sets of the augmented measures μi Proof.The first inclusion follows by statements (i) and (iii) in Lemma 3.5 and the observation that |B| = J − L. The second by applying C KR (J, p, C) ⊂ C KR (J, p).
Remark 3.8.One could define C KR (J, p, C) in terms of dC instead of d to obtain equality in the first inclusion.Replacing T L,p by T L,p C in the definition of the centroid set would not alter any of the related proofs and yield slightly sharper control on the support of (p, C)barycenter.However, as we consider the given definition to be more intuitive, we omit this improvement in the statement of the theorem.

Computational Issues and Numerical Experiments
We present approaches to compute the (p, C)-barycenter problem by solving related OT problems.Based on this, we investigate the performance of the Wasserstein and (p, C)barycenters on multiple synthetic datasets.For reference, we also report on results for two related concepts of unbalanced barycenters (UBCs), namely the Gaussian-Hellinger-Kantorovich and Wasserstein-Fisher-Rao barycenter.

Algorithms
Theorem 2.5 and Proposition 3.9 both allow to pose the augmented problem (recall Section 3) as a linear program and using Lemma 3.3 one can obtain a solution to the original problem by solving the augmented one.Using any linear program solver this enables the direct computation of an exact solution of this problem.However, the number of variables in this approach scales as the size of C KR (J, p, C) and hence it turns out to be infeasible already for relatively small instance sizes.To compute (p, C)-barycenters at larger scales we revisit iterative methods to solve the (balanced) Wasserstein barycenter problem and give instructions how to use modifications of them to compute (p, C)-barycenters.In particular, we detail a multi-scale method which solves successive fixed-support (p, C)-barycenter LPs on increasingly refined support sets.This provides a meta-framework to adjust stateof-the-art solvers for the Wasserstein barycenter for (p, C)-barycenter computations.
To construct the augmented problem we add the dummy point d to the support of the µ i 's, while setting its distance to all other locations to be C p /2. Note, that by Lemma 2.1 and Lemma 3.1 the truncation of d at C p can be omitted if M(μ i ) > 3 max i=1,...,J M(µ i ).If this is not the case, we can enforce it by adding additional mass at d in all augmented measures without changing the optimal value.

LP-Formulation for the (p, C)-Barycenter
Using property (i) from Theorem 2.5, we can rewrite the augmented (p, C)-barycenter problem as a linear program similarly to the usual p-Wasserstein barycenter problem (4).However, compared to the latter one, we replace the standard centroid set C W (J, p) from ( 11), by the centroid set CKR (J, p, C) of the augmented measures from (13).This yields min π (1) ,...,π (J) ,a where M i = | Xi | is the cardinality of the support of the augmented measure μi .Here, c i jk denotes the distance between the j-th point of | CKR (J, p, C)| and the k-th point in the support of mu i , while b i is the vector of masses corresponding to μi .For practical purposes it may be advantageous to solve the multi-marginal problem instead of the (p, C)-barycenter problem.This changes the number of variables from | CKR (J, p, C)|(1 Depending on the value of C, and hence the cardinality of CKR (J, p, C), it is possible to pick the problem with the smaller complexity.While this formulation is appealing for proving theoretical statements as provided in Theorem 2.5, it quickly becomes computationally infeasible even for small scale problems as the number of variables in the LP grows potentially as M i .However, it still enables exact computations of (p, C)-barycenters for small scale examples, which is currently impossible for general UBCs.Though, while there has been some recent advancement for the 2-Wasserstein barycenter in special cases [Altschuler and Boix-Adsera, 2021] these LP-based algorithms ultimately do not scale to large instance sizes.

Iterative Algorithms and the Multi-Scale Approach
For the Wasserstein barycenter, iterative methods computing approximate barycenters, with a per iterations complexity only linear in the number of measures, enjoy great popularity.Most well known is the fixed-support Wasserstein barycenter [Ge et al., 2019, Lin et al., 2020, Xie et al., 2020] approach, aiming to find the best approximation of the barycenter on a pre-specified support set, for which a variety of methods is available.We utilise this fixed-support approach for the augmented (p, C)-barycenter problem by adding the dummy point d to the given support and constructing the cost as described above.This yields a meta-framework which allows to employ fixed-support Wasserstein barycenter algorithms for fixed-support (p, C)-barycenter computation.One can also modify more general free support methods [Cuturi and Doucet, 2014, Ge et al., 2019, Luise et al., 2019], which usually alternate between updating the support set of the barycenter and its weights on this set, to provide approximate (p, C)-barycenters.However, the necessary position updates usually explicitly or implicitly rely on being able to compute the barycentric application T J,p efficiently.Recalling Remark 3.6, this is in general not tractable for the augmented problem, which severely hinders the use of these approaches.Thus, it is tempting to avoid these issues by approximating Y with a large finite space, i.e., by taking a grid of high-resolution, and solving the fixed support (p, C)-barycenter problem on this set.However, solving the fixed-support problem on this large space requires significant computational effort.We advovate an alternative by adapting the ideas of multi-scale methods for the Wasserstein distance/barycenter [Mérigot, 2011, Gerber and Maggioni, 2017, Schmitzer, 2019] to the (p, C)-barycenter setting.The idea of this approach is to start with a coarse version of the problem and then successively solve refined problems, while using the knowledge of the coarse solution to reduce the complexity of the finer ones.Thus, we initialise the support set of the barycenter as a fixed grid of size In the j-th step of the algorithm, after solving the fixed-support problem, we remove the grid points which have zero mass and replace the remaining ones with its 2 d closest points in a refined version of the original grid of size 2 j K 1 × • • • × 2 j K d .This can be understood as solving the fixed-support problem on successively finer grids, while incorporating information provided by having already solved a coarser solution of the problem.We terminate the method once a pre-specified resolution has been reached.This allows to obtain fixed-support approximation of the (p, C)-barycenter on fine grids without having to optimise over the full support set.We point out that this approach, while inspired by multi-scale approaches is more closely related to the formerly mentioned free-support methods.As such it does in general not yield a globally optimal fixed-support (p, C)-barycenter at the finest resolution.Instead it converges to a local minimum of the unbalanced Fréchet functional depending on the resolution of the initial grid.This is a common problem among alternating procedures for the free-support barycenter problem and can be attributed to the fact that the Fréchet functional is non-convex in the support locations of the measures.However, we stress that with this approach we observe reasonable approximations of the (p, C)-barycenter while avoiding the inherent problems of generalising usual position update procedures discussed above.In particular, we do not have to solve the T J,p C barycenter problem at any point.Additionally, we note that the initial grid size should be chosen at least fine enough that the distance between two adjacent grid points is smaller than C. Otherwise it is possible that support points lying between two grid points, having distance larger C to both, are not accounted for.For a visual illustration of the algorithm we refer to Figure 6.

Synthetic Data Simulations
We test the performance of the (p, C)-barycenter as a data analytic tool compared to the usual p-Wasserstein barycenter on a multitude of datasets.We base our computations on the MAAIPM method [Ge et al., 2019], which allows for high-precision approximations of barycenters up to moderate data sizes.The algorithm has been deployed to solve the fixedsupport (p, C)-barycenter problems arising in the multi-scale method detailed above.For all experiments, the initial grid size as been set to 16 × 16 and the refinement is terminated at a gridsize of 128×128.Values below 10 −5 have been considered as zero for the purposes of grid refinement.All experiments have been carried out on a single core of an Intel Core i7 12700K.Implementations of our used method and some alternatives can be found as part of the R-package WSGeometry (on CRAN).

Mismatched Shapes
This first set of examples mainly serves as starting point to illustrate improved performance of the (p, C)-barycenter compared to the p-Wasserstein barycenter.A prototypical benchmark for the p-Wasserstein barycenter are two nested ellipses as popularized in Cuturi and Doucet [2014].For our example of nested ellipses, we assume that the support of each measure consists of nested ellipses, but the number of ellipses varies between the individual underlying measures.Specifically, we assume that for each µ i the number of ellipses is uniformly random in {1, 2, 3} and that each ellipse is discretised onto M support points with unit mass, respectively.This can be seen in Figure 7.We observe that while the p-Wasserstein barycenter recovers the elliptic shape of the underlying measures, it fails to produce distinct ellipses and instead produces something akin to a ring.In contrast, the (p, C)-barycenter yields two distinct ellipses, which coincides with the expected number of ellipses in one of the measures.This aligns well with intuition that the (p, C)-barycenter will simply disregard any additional structures which are not present in a sufficient amount of underlying measures.In contrast, the p-Wasserstein barycenter does not allow for this flexibility which enforces additional support points.

Local Scale Cluster Detection
Recall the setting of Figure 2. In the following class of examples, we are interested in datasets which possesses a natural cluster structure.Let B 1 , . . ., B R ⊂ R D be convex, disjoint sets and assume that supp(µ i ) ⊂ ∪ R r=1 B r for all i = 1, . . ., J. If the diameter of all B r is bounded from above by C and that the distance between each two B r , B s is at least 2 1/p C, then Lemma 2.7 guarentees that the (p, C)-barycenter detects all of the R clusters in which at least J/2 measures have positive mass.In particular, by Theorem 2.5 (v) the (p, C)-barycenter will have mass in all of those clusters.Intuitively, this setting is reasonable if, for instance, it is already known that any interactions between support points of different measures are limited to scales below a certain threshold, which should then be chosen as C. The lower bound on the inter-cluster distance ensures that any pair of two clusters is well-separated, ensuring that it is always possible to distinguish between two different clusters, as they can not be arbitrarily close to each other.In Figure 2 the p-Wasserstein barycenter completely fails to capture the geometric data structure.Most of its mass is between the clusters and the outer clusters have nearly no mass.Moreover, the elliptic structure within each cluster is clearly not captured.In contrast, the (p, C)-barycenter not only captures all clusters, it also distinguishes between the difference in intensity (expected number of ellipses) in the clusters, matching the theoretical guarantees of Lemma 2.7.We stress that for this example the choice of C is of particular importance.If we choose C too large, the (p, C)-barycenter will fail to recover the data's support structure (for an illustration of the (p, C)-barycenter in this example over a range of values of C see Figure 8).Consequently, it is crucial to choose C appropriately.In this example, the barycenter appears to be stable and detect all clusters for C ∈ [0.1, 0.275].Notably, if the locations of the clusters are already known, this setting also allows for parallel computations of the (p, C)-barycenter, where the problems are solved separately on each cluster and recombined at the end (Lemma 2.7).

Randomly distorted Measures
In a statistical context it is important to investigate the stability of the (p, C)-barycenter under random distortions.We fix a reference measure µ 0 on R d and generate a set of measures by random modifications of µ 0 .We then attempt to recover µ 0 by computing the p-Wasserstein and (p, C)-barycenter of these measures, respectively.In the following, let B(p) denote a Bernoulli random variable with mean p, P oi(λ) a Poisson distribution with mean λ and U [a, b] a uniform distribution on [a, b].We generate µ 1 , . . ., µ J as follows: For i = 1, . . ., J initialise µ i = µ 0 , then succesively modify µ i based on the four following steps.
(i) Point Deletion: Fix p del ∈ [0, 1] and λ del ∈ R + .We draw a Ber(p del ) random variable.If it takes the value 1, then we draw D ∼ P oi(λ del ) and select min(D, |supp(µ 0 )|) points in the support of µ 0 uniformly by drawing without replacement.These points (and their mass) are not contained in µ i , since they have been deleted.
(ii) Point Addition: We fix parameters Draw a Ber(p add ) random variable.If it takes the value 1, draw a Poi(λ add ) random variable α.Then, generate α random variables following a normal distribution with mean m add and covariance matrix σ add .Add these support points to µ i , where the weight of each of these points is determined by independent U [u 0 , u 1 ] random variables.
(iii) Position Change: ) random variable and shift the position of x 0 by it.
(iv) Weight Change: Fix parameters l, u ∈ R with l ≤ u.For each support point x 0 of µ 0 with weight w 0 , we draw a U [l, u] random variable U and change the weight of x 0 in µ i to be w 0 + U .An example of this setting can be seen in Figure 9. Comparing the two barycenters displayed there to the original measure reveals that, while the rough shape of the 2-Wasserstein barycenter is correct, its mass is spread out over a larger area and it has a significantly larger number of support points.Since all measures have been normalised, we have also lost all information on the mass of µ 0 .Contrary to that, the (p, C)-barycenter retrieves the original measures recovering the location and number of the of support points closely.Additionally, it also has a mass which only deviates from the original mass by about 0.23%.If one is only interested in recovering the general shape of the data, both approaches provide comparable performance.However, if the measures total mass and more detailed support structure are of importance the (p, C)-barycenter appears to be preferable.

Total Mass Intensity
While the p-Wasserstein barycenter of J probability measures has mass one, the mass of the (p, C)-barycenter depends on C as well as the geometry of the measures µ 1 , . . ., µ J ∈ M + (X ).Exact values for the mass of a (p, C)-barycenter without detailed computations, are only available in the limiting scenarios where C is extremely small or large relative to the other distances in X .For the former, we know by Theorem 2.5 (v) that the barycenter has mass zero for disjoint measures and for the latter, Theorem 2.5 (vi) yields that there exists a (p, C)-barycenter with total mass intensity equal to the median of M(µ 1 ), . . ., M(µ J ).For intermediate values of C, Theorem 2.5 (i) yields the upper bound by 2J −1 J i=1 M(µ i ).To highlight some possible behaviours of the total mass intensity of (p, C)-barycenter we consider three specific examples in Figure 10.We note that in all three cases at about C = 0.6 the mass of the barycenters is at the median of Figure 10: The mass of a (p, C)-barycenter for three sets of measures relative to the median of the total mass intensities of these measures.The green line corresponds to J = 25 measures from the same class as considered in Figure 2. The red line corresponds to the same measures where the four outer clusters have been moved closer to the central one, such that their distance has been halved.The blue line corresponds to J = 5 measures with the same cluster structure as in Figure 2, where the total number of ellipses in all clusters is fixed to be equal to four for all J measures.their respective µ 1 , . . ., µ J and does no longer change with increasing C.This is significantly smaller than the requirement in Theorem 2.5 (vi), which underlines the fact that while in the worst case, this lower bound is sharp, in many examples the total mass of the (p, C)-barycenter stabilises significantly earlier.Moreover, none of the three curves is monotone.Instead the total mass of the barycenter is increasing up to a certain point, after which it decreases until it reaches the median of the masses.This makes intuitive sense, as the measures are disjoint, thus for small C the barycenter is empty and starts to grow in mass quickly as the points within the clusters can be matched.In particular, the differences in intensity between clusters might lead to a total mass over the median M(µ 1 ), . . ., M(µ J ), as by Lemma 2.7 the total mass intensity of the (p, C)-barycenter is R r=1 med(M(µ 1 |Br ), . . ., M(µ J |Br ), where B 1 , . . ., B 5 denote the respective cluster locations.For larger C these clusters start to merge and support points between the clusters reduce the total mass.In particular, these points can be seen clearly in the plot.Up until about C = 0.1, which is the cluster size, the mass of the barycenters rises sharply, before stabilising until the intercluster distance is reached.This is about 0.3 for the green and blue lines and about 0.15 for the red line (since the measures in this example are generated by halving the intercluster distance from the green one).This behaviour highlights the sensitivity of the mass of the (p, C)-barycenter to the geometry of the measures.It is therefore impossible to infer the total mass of the (p, C)-barycenter from the magnitude of C alone without accounting for the specific measures.However, analysing the structural properties of the support sets of the measures might provide a good indication at what values of C changes in drastic behaviour of the total mass are to be expected.

Comparison with Related Unbalanced Barycenter Concepts
We compare the (p, C)-barycenter with two alternative UBC approaches.The Gaussian-Hellinger-Kantorovich Barycenter: This example falls in the general framework of optimal entropy transport problems.Measuring deviation between a feasible solution and the input marginals is carried out via the Kullback-Leibler divergence defined for µ ν6 as If µ ν the value of KL is set to be +∞.For a parameter λ > 0, the Gaussian-Hellinger-Kantorovich Distance [Liero et al., 2018] is defined as where π 1 and π 2 denote the respective marginals of π.The GHK λ barycenter is defined as The Hellinger-Kantorovich Barycenter: The Hellinger-Kantorovich distance, also known as Wasserstein-Fisher-Rao distance [Liero et al., 2018, Chizat et al., 2018a], is closely related to the Gaussian-Hellinger-Kantorovich distance.For fixed parameter σ ∈ (0, π/2], referred to as the cut-locus, it is defined as where cos σ : z → cos(min(z, σ)).For a fixed cut-off locus σ, the HK σ barycenter is defined as HK σ (µ i , µ).
Comparing the barycenters: As the resulting barycenters vary significantly in all three cases, depending on the parameters C, λ, σ, we compare their behaviour upon change of parameter.As a simple example, we consider four measures supported on subsets of a grid on [0, 1] 2 , displayed in Figure 11.To ensure fair comparison, we deploy the same method based on the general scaling method [Chizat et al., 2018b] to approximate the UBC in all three cases.However, we point out that this implies disregarding the ambient space and instead taking the minimum over all positive measures supported on a prespecified grid in [0, 1] 2 .
For high parameter values all three approaches yield similar results.This is, of course, to be expected, since these distances interpolate between p-Wasserstein distance and total variation/Kullback-Leibler distance and large parameters correspond to a setting being close to the Wasserstein distance.The KR barycenter has mass zero for small choice of C by Theorem 2.5 (iv), since the four measures have disjoint support.After reaching a threshold of C ≈ 0.1, the mass in the (2, C)-barycenter starts to increase as mass is added in the center of the unit square until at C ≈ 0.3 the mass of an individual data measure is reached.
For small λ the GHK λ barycenter has small mass and its support is close to that of a linear mean of the four measures, though the total mass intensity is significantly lower than for the original measures.With increasing λ the mass starts to increase and to smear  into the middle of the unit square, until a large square, encompassing all four data supports, is formed.After this point increasing λ causes the square to contract while its mass increases.Finally, we approach a single square at roughly the same size as the squares in the underlying measures for large λ.
The HK σ barycenter is close to a linear mean of the four measures for small cut-off.Increasing σ initially reduces the mass at each of the square locations.At a threshold of σ ≈ 0.34, we observe a change, where part of the mass is moved vertically or horizontally to the mid points between the squares in a rectangular shape.Until σ ≈ 0.43 all mass is shifted to these "middle-rectangles", at which point a second shift occurs, where the mass from these rectangles starts to move towards a square in the center.At σ ≈ 0.6, all mass has been shifted towards a square in the center and there is no further change in the HK barycenter, when increasing σ.
Additionally, we consider Figure 13, where the three unbalanced barycenter models are compared on three exemplary classes based on the MNIST dataset.Here, the original 28 × 28 images have been rescaled to sizes between 14 × 14 and 42 × 42 and embedded in a random subgrid of a 50 × 50 image.In this setting, there is a notable distinction between the GHK barycenter and the KR and HK barycenters.While for the former, the overall shape is recovered even for small parameter values, the latter two barycenters produce unstructured results for small parameters.The GHK distance is not constructed to have a maximal transport distance comparable to the impact of C or σ in the other two cases, which allows to transport across larger distance and recover the correct shape for smaller values of λ.However, the mass of the GHK barycenter is significantly smaller than that of the original measures for small values of λ and only increases to the correct magnitude for larger penalty values.The HK and KR barycenters consist of fragments of the final shape which move towards a joint location for increasing parameters.For large penalties all three models are nearly identical and display the corresponding number correctly.This makes sense, as in this setting the minimisation in any individual term of the (p, C)-Fréchet functional is driven by minimising an OT term.We point out that for the (p, C)-barycenter this regime is guaranteed to be reached by choosing C larger than the diameter of the space, while for the other two models the suitable parameter choice for this example is ambiguous without actually computing the result for specific values.
Overall, for large parameter values all considered UBCs perform similarly.In small parameter regimes we observe significant differences.This difference in behavior is to be expected as the dependence of the UOT models on their parameters varies significantly.
One key advantage of the KR barycenter is that its connection between the choice of C and the properties of the resulting barycenter is immediate and intuitive.While the cut-off locus σ for the HK barycenter fulfils a similar role, imposing control at the maximum scale at which transport does occur, the consequences of changing σ from one value to another are far less immediate due to the involved structure of the cost functional in this setting.
Similarly to the KR barycenter, it is worth noticing that the HK barycenter does allow for mass at locations given by centroids of support points of L < N measures.Though, while for the KRD a feature of the underlying measures is only contained in the barycenter if it is present in more than L = N/2 measures, the HK barycenter also allows for mass at locations constructed from less support points.Thus, the HK barycenter is prone to being more susceptible to errors due to noise within the data.Compared to the other two choices, the parameter λ of the GHK barycenter does appear to have less interpretation, with the only clear connection being that increasing λ increases the mass of the GHK barycenter.There does also not appear to be any well-founded method how to approach the choice of λ for a given dataset.

A Proofs
A.1 Proofs of Section 3 where (i) follows from the lift to an OT problem (Section 3.1) and (ii) follows from Lemma 3.1 by adding mass j =i M(µ j ) − M(µ) at d.We then have that Fp,C (µ) and min µ∈M + ( Ỹ) Combining both inequalities and using Lemma 3.2 then finishes the proof.
Proof of Lemma 3.5.(i) By definition, the objective value for T J,p C (y 1 , . . ., y J ) at d is equal to (J − |B|)C p /2.Thus, T J,p C outputs d if and only if for any y ∈ Y it holds In particular, if all inequalities are strict d is the unique output for T J,p C (y 1 , . . ., y J ). Statement (ii) is a direct consequence of (i).Proving (iv), let C > 2 1 /p diam(Y), pick points y 1 , . . ., y J ∈ Y and observe that for any y ∈ Y it holds that Thus, T J,p C (x 1 , . . ., x J ) = d and since |B| = 0, the claim follows from (iii).

A.2 Proofs of Section 2
Proof for Lemma 2.1.Suppose that π C is optimal but its induced graph G(π C ) contains a path P We define a new transport plan with augmented transport along the path P .For this, define := min 1≤j≤k−1 π C (x i j , x i j+1 ) and construct the new plan πC Compared to π C the transportation cost for πC is reduced by L(P ) while the marginal deviation is increased by C p .In particular, it holds that x,x As > 0 and L(P ) > C p this contradicts the optimality for π C .Consequently, any path P in the induced graph G(π C ) necessarily has path length at most C p .If d(x, x ) > C this implies that d p (x, x ) > C p and hence by the statement on induced graphs that π C (x, x ) = 0.
Proof for Theorem 2.2.We first establish the metric properties (i).It is straightforward to show KR p,C (µ, ν) = 0 if and only if µ = ν and that KR p,C is symmetric.For the triangle inequality let µ, ν, τ ∈ M + (X ) and choose B ≥ max{M(µ), M(ν), M(τ )}.Then by augmenting the measures accordingly (Section 3.1) we find that where the inequality follows by the triangle inequality for the Wasserstein distance [Villani, 2003, Theorem 7.3].Statement (ii) follows from Lemma 2.1 by noting that there exists at least one optimal solution π C equal to zero except on the diagonal for which π C (x, x) = µ(x) ∧ ν(x).Plugging into the objective of (2) yields the claim.Additionally, suppose that w.l.o.g.µ(x) ≥ ν(x) for all x ∈ X .Then independent to the choice of C > 0 and p ≥ 1 the unique optimal solution is to remain all shared mass at its common place and to delete surplus material which is exactly the solution π C (x, x) = µ(x) ∧ ν(x) described before.Statement (iii) follows by noting that for C ≥ max x,x d(x, x ) the dual formulation in (DUOT p,C ) and in (DOT p ) coincide.Finally, for statement (iv) we note that by construction it holds dp C 1 (x, y) ≤ dp C 2 (x, y) for all x, y ∈ Ỹ.Hence, for any coupling π of the augmented measures μ, ν it holds Taking the minimum over all couplings of μ and ν on both sides completes the proof.

A.2.1 Proof for Theorem 2.3
Using the lift to the OT problem, we can now start to prove the closed formula on ultrametric trees.For this, consider an ultrametric tree T with height function h : V → R + and define its p-height transformed tree denoted T p := T as the same tree but with height function h p (v) = 2 p−1 h(v) p .An illustration is given in Figure 4. Notice that by monotonicity T p is again an ultrametric tree.
Lemma A.1.Let T be an ultrametric tree with height function h : V → R + and consider its p-height transformed tree T p .Then it holds that for all leaf nodes v, w ∈ L ⊂ V .
Proof.Let v, w ∈ L be two leaf nodes in the ultrametric tree T with height function h and let a be their common ancestor7 .Since paths between any two vertices are unique and all leaf nodes have the same distance to the root, it holds that where we use that h(v) = 0. Repeating the argument for the ultrametric tree T p we conclude that d Tp (v, w) = 2d Tp (v, a) = 2 p h(a) p .
Equipped with this result we are now able to prove the closed formula from Theorem 2.3.
Proof for Theorem 2.3.Let KR p p,C (µ, ν) = UOT p,C (µ, ν) refer to UOT w.r.t. the distance on T , which only depends on the distance between individual leaf nodes.Considering the p-th height transformed tree T p and applying Lemma A.1 we conclude that everywhere else for i = 1, . . ., J. By construction, πi defines an OT plan between μ and μi for i = 1, . . ., J. It holds where the first equality follows from Proposition 3.9 and the third and fourth by construction.Since πi is an OT plan between μ and μi it holds for all i = 1, . . ., J that Thus, it follows together with the previous equations that i.e. πi is optimal.Lemma 3.5 now yields the first part of the statement.
(i) By Proposition 3.9 the objective value of the balanced multi-marginal and (p, C)barycenter problem coincide and a (p, C)-barycenter is obtained as the push-forward of an optimal balanced multi-coupling under the map T J,p C restricted to Y.By construction and Corollary 3.7 any such measure is supported in C KR (J, p, C).Thus, there always exists a (p, C)-barycenter whose support is restricted to C KR (J, p, C) and the minimum over Y and C KR (J, p, C) coincide.The second part is similar and we let μ be any p-Wasserstein barycenter.Then by Proposition 3.9, there exists a multi-coupling of μ1 , . . ., μJ , such that μ = T J,p C #π.Since any such push-forward measure can only have support in C KR (J, p, C)∪{d}, it holds for µ = μ|Y that supp(µ) ⊂ C KR (J, p, C).It remains to show the upper bound on the total mass.By the equivalence to the multi-marginal problem and by Lemma 3.5 (ii) any (p, C)-barycenter µ cannot have mass on a point which is constructed from a set of points (x 1 , . . ., x J ) for which 2|B(x 1 , . . ., x J )| ≥ J. Additionally, by part (ii) we know that there exists UOT plans, such that the mass of each (p, C)-barycenter support point is fully transported to points it is constructed from.Let (a 1 , . . ., a K ) be the weight vector of the support points of the (p, C)-barycenter, then it holds that M(µ i ).
(iii) The multi-marginal problem between μ1 , . . ., μJ is a balanced problem, thus we can pose this as a linear program with a total of J i=1 M i variables and N i=1 M i +J constraints.As all measures have the same total mass, we can drop one arbitrary marginal constraint for each measure besides the first.Thus, the rank of the constraint matrix in the corresponding constraint is bounded by N i=1 M i + 1.Hence, each basic feasible solution of the linear program has at most N i=1 M i +1 non-zero entries (see [Luenberger et al., 1984] for details).Let π be such a solution.By Proposition 3.9 the measure μ = T J,p C #π is a p-Wasserstein barycenter and by construction it has at most N i=1 M i + 1 support points.Due to the upper bound on the total mass of the (p, C)-barycenter in property (i), we can guarantee that there is non-zero mass at d for J > 2, hence in this case, restricting the measure to Y reduces the support size by one.For J = 2, we note that the multi-marginal problem is just the augmented UOT problem.By construction we either have a point x in the support of one of the two measures, such that there is transport between x and d or both measures have equal mass at d and it is optimal to leave this mass in place.In the first case, we have mass at T J,p C (x, d) = d, thus the support size can be reduced by one and in the second the problem is equivalent to the OT problem and thus the barycenter has at most M 1 + M 2 − 1 support points.Finally, by property (i) the support of any (p, C)barycenter is contained in C KR (J, p, C), thus the cardinality of this set also provides a trivial upper bound on the support size of any (p, C)-barycenter.Taking the minimum over both quantities, we conclude (iv) For any µ ∈ M + (Y), it holds where the inequality follows from Theorem 2.2 (iv).Taking the infimum over all measures in M + (Y) on both sides completes the proof.Now, the objective cost of not having mass a at x 0 is aC p (J −k)/2, while the cost of adding aδ x 0 to µ is equal to a(kC p /2 + J i=k dp C (x i , x 0 )).Hence, adding the point improves the value of the Fréchet functional, if For 2k > J, the right hand side will always be negative, so we can not improve.Thus, we assume 2k < J.By assumption it holds C ≥ J 1 p diam(Z).Hence, Therefore, for 2k < J the objective value of µ can always be improved by increasing its mass by a, as long as k < J/2 .Thus, since µ is a barycenter it holds M(µ) ≥ Mµ J/2 .An analog, converse argument yields that if k > J/2, we can always improve the objective value of µ, since removing and then re-adding any mass to µ increases the objective value by the previous argument.Hence, it holds M(µ) = Mµ J/2 .Now, assume J is even.For 2k = J nothing in the previous argument changes.However, for 2k = J (note that this can only hold now that J is even), the right hand side is zero, however, if all the x i for i = k, . . ., J, are identical to x 0 (in particular, there exists a point contained in the support of at least half of the measures), then the left hand side will also be zero.In this case, the presence of this point does not change the objective value and there are (p, C)-barycenters of different total masses.However, we can still always choose to not place mass in such cases, to obtain a (p, C)-barycenter of the desired total mass.
Proof for Lemma 2.7.It suffices to show that there is no centroid point, which is constructed from points from two or more different sets B r .Assume there is a point y 0 ∈ C KR (J, p, C), such that y 0 is constructed, among others, from x 1 ∈ B r and x 2 ∈ B s for r = s.We distinguish two cases.Assume y 0 ∈ B r , then it holds d p (x 1 , y 0 ) > 2 p−1 C p ≥ C p and y 0 would not be in the restricted centroid set.The analogue argument holds for y 0 ∈ B s .Now, assume y 0 is neither in B r nor B S .Since d(B r , B S ) > 2 1/p C, it holds either d p (B r , y 0 ) > C p or d p (B s , y 0 ) > C p .Thus, we obtain another contradiction to y 0 ∈ C KR (J, p, C).Hence, C KR (J, p, C) only contains centroids constructed from points within one B r and by convexity of the B r , any centroid point constructed from points within B r is again in B r .Theorem 2.5 (ii) yields that there will always be an optimal solution which only transports within each B r , thus the R problems are in fact independent and we can separate them without changing the objective value.

Figure 3 :
Figure 3: General Tree Structures: (a) A tree graph T with root r (orange), internal nodes (black) and leaf nodes L (green).By definition par(v 5 ) = par(v 6 ) = v 2 and the children of v 1 are equal C(v 1 ) = {v 1 , v 3 , v 4 , v 7 , v 8 }.The distance from each leaf node to the root may vary.(b) An ultrametric tree T with height function h (red) such that 0 = h 4 3 < h 3 < h 2 < h 1 < h 0 .Edge weights are defined by the the difference of consecutive height values, e.g.w(e 1 ) = h 0 − h 1 .Each leaf node (green) is at the same distance to the root r (orange).

Figure 4 :
Figure 4: Closed formula for the KRD on ultrametric trees: (a) Depending on the regularization C > 0 and the underlying height function h the ultrametric tree T introduced in Figure 3 (b) is decomposed into two subtrees.Each node in the set R(C) = {v 1 , v 2 } (orange) serves as a new root and corresponding subtrees T (v 1 ) := C(v 1 ) and T (v 2 ) := C(v 2 ) are equal their respective set of children with corresponding edges.(b)The p-th height transformation T p (v 1 ) and T p (v 2 ) of the induced subtrees T (v 1 ) and T (v 2 ), respectively.Each subtree is extended by a new root (blue) with an edge (lightblue) whose distance is equal the difference of regularization C p /2 and the p-th height transformed value of the former root.

Figure 5 :
Figure 5: Centroid sets and barycenters: The support points of three (J = 3) measures (yellow, brown and blue dots) with unit mass at each position.Top: Different centroid sets C KR (3, 2, C) (red squares) with increasing value of C from left to right.Bottom-Left: The centroid set C W (3, 2) (dark green squares) corresponding to the 2-Wasserstein barycenter.Bottom-Center: Circles corresponding to C p (for two different choices of C) balls around the support points.The grey colouring indicates that there is no overlap of at least two circles in this area and thus no (2, C)-barycenter can have mass in this area.Conversely, the green colouring indicates overlap and thus the potential support area of the barycenter.Bottom-Right: The (2, C)-barycenter (red squares) for a specific choice of C and the 2-Wasserstein barycenter (dark green squares).

Figure 6 :
Figure 6: An illustration of the multi-scale approach on two different datasets.The fixedsupport solutions are shown on grids of the sizes 8 × 8, 16 × 16, 32 × 32, 64 × 64 and 128 × 128 increasing from left to right.The corresponding run-times on a single core of an Intel Core i7 12700K in the first/second row were 2.5/5 seconds, 14/16 seconds, 145/42 seconds, 13/3 minutes and 143/22 minutes.Top: The dataset of nested ellipses from Figure 7. Bottom: The dataset of ellipses with clustered support structure from Figure 2.

Figure 7 :
Figure 7: An excerpt of a dataset of N = 100 discretized ellipses.Each measure contains between 1 and 3 ellipses with equal probability.Each ellipse consists of 50 points with mass 1 in [0, 1] 2 .Left: In darkgreen the 2-Wasserstein barycenter, where all measures are normalized to be probability measures (runtime about 8 hours).Right: In red, the (2, 1.5)-barycenter (runtime about 30 minutes).

Figure 12 :
Figure 12: Images displaying the underlying measures used for barycenter computation in Figure 13.Each row corresponds to a dataset of ten elements of the classical MNIST dataset which have been randomly rescaled and shifted within a 50 × 50 grid in [0, 1] 2 .Their total mass intensities have not been normalised.
For statement (iii) we again use that by definition dp C (y, d) = C p /2 for any y ∈ Y and hence min previous argument and Lemma 3.5, any (p, C)-barycenter support point x k reduces the maximum available mass by at least J/2 a k and by Lemma 3.2, the total mass of the (p, C)-barycenter is bounded by the sum of the total masses of the µ i .Therefore it holds that |supp(µ)| ≤ min |C KR (J, p, C)|,

( v )
Let C ≤ d min , then by Theorem 2.2 (ii) in the arg min in the second line follows from the fact that the total variation can only increase if we place mass outside of the support of the measures.Thus it suffices to consider measures supported on the union of the supports.Now, we note that the K summands are independent to each other, thus we can minimise them separately.Hence, for the k-th entry of a it holds thata k ∈ arg min a∈R + N i=1 |a k − a i k | = med(a 1 k , . . ., a J k )which yields the claim.(vi)Let M i = M(µ i ) for i = 1, . . ., J and set M 0 = 0. Assume that J is odd.Let µ be a (p, C)-barycenter of µ 1 , . . ., µ J with µ(Y) ∈ [M k−1 , M k ].In particular, µ fulfills the non-mass-splitting property in (ii).Let a ∈ (0, M k − M(µ)] and μ the augmented measure for µ.By construction, we can find support points x k , . . ., x J = d of the augmented measures μk , . . ., μJ from which w.l.o.g.mass a is transported to d in µ.If one of the points has mass smaller a, we can just replace a with the minimum of the masses of the points and repeat the argument until we have considered a total mass of a. Set x 0 = T J,p C (d, . . ., d, x k , . . ., x J ) and notice that if x 0 = d, we do not change the objective function in the augmented problem (Lemma 3.1) by adding this point which means w.l.o.g.x 0 = d.In this case, we have x 0 = arg min x∈Y J i=k dp C (x i , x).