1 Introduction

In algebraic geometry, the geometry of zero sets of systems of polynomials—known as algebraic varieties—are studied using commutative algebra. Tropical geometry is a variant of this field where the polynomials are defined by the tropical algebra: the tropical sum of two elements is their maximum and the tropical product is their usual sum. Mathematical objects such as functions and curves evaluated under the tropical algebra are piecewise linear structures, and tropical varieties are polyhedral complexes. Tropical geometry is an important tool for the study of classical algebraic varieties due to many theoretical coincidences between the two settings. In addition, tropical geometry possesses the advantage of computational tractability and efficiency, as well as connections to other applied sciences. For example, it has been used in optimization theory [41], dynamic programming in computer science [30], as well as in economics and game theory [26]. An application of tropical geometry that has gained much interest is the tropical geometric representation of the space of phylogenetic trees. In particular, there has very recently been active work in using tropical geometry as a data analytic tool for sets of phylogenetic trees [33, 37, 45, 50]. In this paper, we study the tropical projective torus, which is the ambient space of phylogenetic trees, and build upon it to provide a set of tools for statistical, probabilistic, and geometric studies using optimal transport theory.

Optimal transport theory arises from a question posed in economics, and specifically, in the allocation of resources. It deals with optimizing transport modes when geographically displacing resources. Its mathematical formulation was established in the 18th century and has been well-studied since, resulting in strong connections and mutual implications between the domains of dynamical systems and geometry. It has also provided important results in applications and computational fields, such as computer science. An important concept arising from optimal transport is the Wasserstein distances, which are metrics on probability distributions. Intuitively, they measure the effort required to recover the probability mass of one distribution in terms of an efficient reconfiguration of the other. As such, Wasserstein distances broaden the scope of optimal transport theory to probability theory. Additionally, they have been exploited to move further beyond these realms to solve concrete problems in inferential statistics, such as in Panaretos and Zemel [39]. Establishing Wasserstein distances in tropical geometric settings thus provides a framework for a vast body of existing results in these related fields to be applicable to the important problem of statistical inference and data analysis in applied tropical geometric settings by providing a setting for the study of probability measures and distributions. Additionally, it provides an alternative mechanism to study geometric aspects of tropical objects and spaces.

Connecting algebraic theory to optimal transport theory is a new direction of research with very recent contributions involving algebraic geometry and algebraic topology. In Çelik et al [7], the Wasserstein distance between a probability distribution and an algebraic variety is minimized via transportation polytopes. In topological data analysis, where algebraic topology is leveraged to reduce the dimensionality of complex data spaces and extract shape features within the data, optimal transport theory has improved computational efficiency [20] and also has been used to study geometric aspects of algebraic topological invariants [10]. A prior transportation problem (distinct from the optimal transport setting) has been previously considered in tropical geometry by Richter-Gebert et al [41]. Our work in this paper presents the first connection between tropical geometry and optimal transport theory. Specifically, we consider an infinite metric measure space in a continuous tropical geometric setting endowed with a combinatorial ground metric. Numerical computations of optimal transport with various ground metrics has been recently studied in the continuous setting and shown to be efficient [5, 25]. Additionally, studying the optimal transport problem provides a computational framework for the probability density space, which also encodes the geometry of sample space [21, 35, 36, 48]. In solving the optimal transport problem, we thus define tropical Wasserstein distances and provide algorithms for our proposed tropical Wasserstein distances. Collectively, these results offer tools for probabilistic, statistical, and geometric inference in a tropical geometric setting, which then may be translated to other applications where tropical geometry plays an important computational and interpretive role.

The remainder of this paper is organized as follows. Section 2 gives an overview of tropical geometry and the tropical projective torus as our ground space of interest. We present and review properties of the tropical metric, which endows this space with a metric structure; we also give some variational forms for the tropical metric. Section 3 overviews the problem of optimal transport and the role of the Wasserstein distances in this framework. We then define the tropical Wasserstein-p distance, with the tropical metric as the ground metric and the tropical projective torus as the ground space; we also give variational forms of the tropical Wasserstein distance. We study the specific cases of \(p=1\) and 2: the \(p=1\) case gives a method for computing all infinitely many tropical geodesics, while in the case of \(p=2\), the Wasserstein metric is amenable to statistical analysis by providing an inner product structure on probability measures on the tropical projective torus. Section 4 gives algorithms to explicitly compute the tropical Wasserstein-p distances, while Sect. 5 presents the results of several numerical experiments implementing our proposed algorithms. We close the paper with a discussion in Sect. 6 on future research stemming from the work presented in this paper.

2 Tropical geometry, the tropical projective torus, and the tropical metric

In this section, we give the basics of tropical geometry that are relevant for our work. We then present the tropical projective torus as our ground space of interest, and the tropical metric as the ground metric on this space. We also give alternative versions of the metric in terms of variational forms. This is the metric with respect to which we will define the tropical optimal transport problem and the tropical Wasserstein-p distances.

2.1 Essentials of tropical geometry: tropical algebra

Tropical geometry may be seen as a subdiscipline of algebraic geometry. In the latter, the zero sets of systems of polynomial equations are studied using algebraic methods; in the former, these polynomials are defined via the tropical semiring, \(({{\mathbb {R}}}\cup \{-\infty \}, \boxplus , \odot )\) where addition between two elements is given by their max and multiplication is given by their sum:

$$\begin{aligned} a \boxplus b&:= \max (a,b),\\ a \odot b&:= a + b. \end{aligned}$$

Notice that tropical subtraction is not defined, therefore resulting in a semiring, rather than a ring. Both operations of the semiring are commutative and associative; multiplication distributes over addition. Tropicalization refers to interpreting classical arithmetic operations with their tropical counterparts. Using these operations, lines, polynomials, and other more general mathematical constructions can be built, which will result in “skeletal" piecewise linear structures.

2.2 The tropical projective torus

Tropical geometry naturally gives rise to polyhedral structures. The interplay between algebraic geometry and polyhedral geometry results in new interpretations of important concepts which form the building blocks for the study of tropical geometry.

An important example is the reinterpretation of a fundamental object in computational algebraic geometry—the Gröbner basis. A Gröbner basis is a particular generating set of an ideal in a polynomial ring over a field; computing Gröbner bases is one of the main approaches in solving systems of polynomials, which is a central problem in algebraic geometry. Reinterpreting Gröbner bases using valuations (functions over fields that give a notion of its size) gives rise to Gröbner complexes. Gröbner complexes lead to universal Gröbner bases, which are analogs to tropical bases; see Maclagan and Sturmfels [30] for full details of this construction. The Gröbner complex is thus a fundamental object in tropical geometry; it is a polyhedral complex constructed for a homogeneous ideal in the polynomial ring \(K[x_0, x_1, \ldots , x_n]\) over a field K. The ambient space of a Gröbner complex is the tropical projective torus, denoted by \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\). In this paper, we consider the tropical projective torus as our ground space of interest.

The tropical projective torus is the quotient space that identifies vectors differing from each other by tropical scalar multiplication (or classical addition). It is generated by the following equivalence relation \(\sim \) on \({{\mathbb {R}}}^{n+1}\):

$$\begin{aligned} x\sim y \Leftrightarrow x_{1} - y_{1} = x_{2} - y_{2} = \cdots = x_{n+1} - y_{n+1}. \end{aligned}$$

Mathematically, \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) is constructed in the same manner as the complex torus: take a lattice \(\varLambda \in {\mathbb {C}}^{n+1}\) as a real vector space, then the complex torus is \({\mathbb {C}}^{n+1}/\varLambda \). For \(x\in {{\mathbb {R}}}^{n+1}\), let \({\bar{x}}\) be its image in \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\). The tropical projective torus identifies with \({{\mathbb {R}}}^n\) by taking representatives of the equivalence classes whose last coordinate is zero:

$$\begin{aligned} {\bar{x}} \mapsto (x_{1} - x_{n+1}, \, x_{2} - x_{n+1}, \ldots ,\, x_{n} - x_{n+1}). \end{aligned}$$
(1)

We denote an element in \({{\mathbb {R}}}^{n+1}\) by x, an element in \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) by \({{\bar{x}}}\), and an element in \({{\mathbb {R}}}^n\) by \(\mathbf{{x}}= (x_1 - x_{n+1},\, \ldots ,\, x_n - x_{n+1})\)—which is the image of \({\bar{x}}\) in \({{\mathbb {R}}}^n\).

The Space of Phylogenetic Trees. One important practical example that arises in the tropical projective torus is the space of phylogenetic trees, \({\mathcal {T}}_N\) (where N is the fixed number of leaves in a tree). Speyer and Sturmfels [44] identify an equivalence between the space of all phylogenetic trees and a tropical geometric space via a homeomorphism [28, 30, 33]. The space of phylogenetic trees is contained within the tropical projective torus. In other words, the tropical projective torus is also the ambient space of phylogenetic trees. Although the space of phylogenetic trees is a proper subset of the tropical projective torus, it possesses a very complex structure that is not yet well understood. In particular, it is connected and possesses a polyhedral structure, but is not convex [28, 33]. Additionally, trees are defined by a specific combinatorial condition, which makes the precise characterization of the space of phylogenetic trees and establishing its boundary within the tropical projective torus difficult. The dimension of tree space also is lower than the tropical projective torus: its dimension grows linearly in the number of leaves in a tree, while for the tropical projective torus, the dimension grows quadratically.

2.3 The tropical metric

The tropical projective torus \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) becomes a metric space when endowed with a generalized Hilbert projective metric function [1, 9], which is a combinatorial metric that is tropical in nature. It has been referred to as the tropical metric in recent literature [28, 33]. Our work here is based on the ambient tree space given by the tropical projective torus endowed with the tropical metric.

Definition 1

For a point \(x \in {{\mathbb {R}}}^{n+1}\), denote its coordinates by \(x_1, x_2, \ldots , x_{n+1}\) and its representation in the tropical projective torus \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) by \({\bar{x}}\). The tropical metric on \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) is given by

$$\begin{aligned} d_{\mathrm {tr}}({\bar{x}}, {\bar{y}})&:= \max _{1\le i\le n+1}(x_{i} - y_{i}) - \min _{1 \le i\le n+1}(x_{i} - y_{i}). \end{aligned}$$

When considering the representatives of the equivalence classes as in (1), the tropical metric translates to the following between \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) and \({{\mathbb {R}}}^n\): for \({\bar{x}}, {\bar{y}} \in {{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) and \(\mathbf{{x}}, \mathbf{{y}}\in {{\mathbb {R}}}^n\),

$$\begin{aligned} d_{\mathrm {tr}}({\bar{x}}, {\bar{y}}) := \max \Big \{\max _{1\le i < j \le n}\big |(\mathbf{{x}}_i - \mathbf{{y}}_i)-(\mathbf{{x}}_j - \mathbf{{y}}_j)\big |,\, \max _{1\le i \le n}|\mathbf{{x}}_i - \mathbf{{y}}_i|\Big \} := d_{\mathrm {tr}}(\mathbf{{x}}, \mathbf{{y}}). \end{aligned}$$

Figure 1 illustrates the relationship where \({{\mathbb {R}}}^{n+1}\) identifies with \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) by the equivalence relation \(\sim \); \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) then embeds into \({{\mathbb {R}}}^{n}\). The metric \(d_{\mathrm {tr}}\) is defined on \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) and has a representation in \({{\mathbb {R}}}^{n}\); it is an isometry from \({{\mathbb {R}}}^{n+1}\) to \({{\mathbb {R}}}^{n+1}/{\mathbf{1}}\) to \({{\mathbb {R}}}^{n}\). Again, recall the notation that an element in \({{\mathbb {R}}}^{n+1}\) is denoted by x, an element in \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) is denoted by \({{\bar{x}}}\), and an element in \({{\mathbb {R}}}^n\) is denoted by \(\mathbf{{x}}= (x_1 - x_{n+1}, \ldots , x_n - x_{n+1})\).

Lemma 1

On \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\), we have the following alternate expression for the tropical metric:

$$\begin{aligned} d_{\mathrm {tr}}({\bar{x}}, {\bar{y}}) = \max _{1\le i\le j\le n+1}{\left| (x_{i} - y_{i}) - (x_{j} - y_{j}) \right| }. \end{aligned}$$

Proposition 1

[33, Proposition 17] \(d_{\mathrm {tr}}(\cdot ,\cdot )\) is a well-defined metric function on \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\).

Fig. 1
figure 1

Diagram illustrating embedding of and relationships between Euclidean spaces and the tropical projective torus. The dashed arrow represents the isometry of the tropical metric between all three spaces

2.4 Variational forms of the tropical metric

It turns out that the tropical metric may be considered in terms of unknown functions and corresponding differential equations, which provides an alternative formulation for the tropical metric in terms of a variational form. Variational forms are useful in computational studies, since numerically, it is often easier to find solutions to variational problems rather than differential equations. As we will see further on, this turns out to be an important advantage in explicit computations of the tropical Wasserstein distances and associated results.

Notation. We use the \(+\) and − superscript notation as follows:

$$\begin{aligned} (\cdot )^+&:= \max (\cdot , 0),\\ (\cdot )^-&:= \min (\cdot , 0). \end{aligned}$$

Proposition 2

For \({\bar{x}}, {\bar{y}}\in {{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\), we have

$$\begin{aligned} d_{\mathrm {tr}}({\bar{x}}, {\bar{y}}) = \left( \begin{aligned} \text {minimize } \quad&\int _{0}^{1} L_{\mathrm {tr}}\big ({\mathbf {v}}(t) \big ) dt, \\ \text {subject to:} \quad&\frac{\text {d}\mathbf{z}}{\text {d}t}=\mathbf{{v}}(t),\,\, \mathbf{z}(0) = \mathbf{{x}},\,\, \mathbf{z}(1) = \mathbf{{y}}\end{aligned} \right) , \end{aligned}$$
(2)

where \(\mathbf{v,z}: [0,1] \rightarrow {{\mathbb {R}}}^{n}\) and we define the tropical Lagrangian \(L_{\mathrm {tr}}(\cdot )\) as the tropical norm for \({\mathbf {a}} \in {{\mathbb {R}}}^{n}\) as follows:

$$\begin{aligned} \begin{aligned} L_{\mathrm {tr}}(\mathbf{a})&= \Vert \mathbf{a} \Vert _{\mathrm {tr}} = \max \Big (\max _{1\le i\le n}({\mathbf {a}}_{i}),0\Big ) - \min \Big (\min _{1\le i\le n}({\mathbf {a}}_{i}),0\Big )\\&= {\max _{1\le i\le n}({\mathbf {a}}_{i})}^+ - {\min _{1\le i\le n}({\mathbf {a}}_{i})}^- \quad \forall ~{\mathbf {a}} \in {{\mathbb {R}}}^{n}. \end{aligned} \end{aligned}$$
(3)

Proof

Let \(D = \{\mathbf{{x}}_{i}-\mathbf{{y}}_{i}\mid 1\le i\le n\} \cup \{0\}\). By Definition 1, \(d_{\mathrm {tr}}({\bar{x}}, {\bar{y}}) = \max (D) - \min (D)\). Hence

$$\begin{aligned} d_{\mathrm {tr}}({\bar{x}}, {\bar{y}}) = d_{\mathrm {tr}}({\bar{y}}, {\bar{x}}) = {\max _{1\le i\le n}(\mathbf{{x}}_{i}-\mathbf{{y}}_{i})}^+ - {\min _{1\le i\le n}(\mathbf{{x}}_{i}-\mathbf{{y}}_{i})}^-. \end{aligned}$$

First, let \(\mathbf{z}(t) = t\cdot \mathbf{{y}}+ (1-t) \cdot \mathbf{{x}}\), then \(\mathbf{{v}}(t)\) is the constant vector \(\mathbf{{y}}- \mathbf{{x}}\), and the integral \(\int _{0}^{1}{\Vert {\mathbf {v}}(t) \Vert _{\mathrm {tr}} dt}\) becomes \(L_{\mathrm {tr}}(\mathbf{{y}}- \mathbf{{x}}) = d_{\mathrm {tr}}({\bar{x}},{\bar{y}})\). Second, in order to show that

$$\begin{aligned} \int _{0}^{1}{\Vert {\mathbf {v}}(t) \Vert _{\mathrm {tr}} dt} \ge d_{\mathrm {tr}}({\bar{x}},{\bar{y}}), \end{aligned}$$

it suffices to show that the integral is always no less than any of \(|\mathbf{{y}}_{i} - \mathbf{{x}}_{i}|\) and \(\left| \left( \mathbf{{y}}_{i} - \mathbf{{x}}_{i}\right) - \left( \mathbf{{y}}_{j} - \mathbf{{x}}_{j}\right) \right| \) where \(1\le i,j\le n\).

For \(1\le i\le n\), by definition of \(L_{\mathrm {tr}}\) we have

$$\begin{aligned} \Vert {\mathbf {v}}(t)\Vert _{\mathrm {tr}} \ge |\mathbf{{v}}(t)_{i}-0| = |\mathbf{{v}}(t)_{i}|. \end{aligned}$$

Now consider the function \(f_{i}: [0,1] \rightarrow {{\mathbb {R}}}\) given by \(f_{i}(t) = {\mathbf {z}}(t)_{i}\). Then \(\displaystyle \mathbf{{v}}(t)_{i} = \frac{df_{i}}{dt}(t)\), which gives

$$\begin{aligned} \int _{0}^{1}{\mathbf{{v}}(t)_{i} dt} = f_{i}(1) - f_{i}(0) = \mathbf{{y}}_{i} - \mathbf{{x}}_{i} \end{aligned}$$
(4)

and

$$\begin{aligned} \int _{0}^{1}{\Vert {\mathbf {v}}(t) \Vert _{\mathrm {tr}} dt} \ge \int _{0}^{1}{|\mathbf{{v}}(t)_{i}| dt} \ge \left| \int _{0}^{1}{\mathbf{{v}}(t)_{i} dt} \right| = \left| \mathbf{{y}}_{i} - \mathbf{{x}}_{i}\right| . \end{aligned}$$

Similarly, for any \(1\le i,j\le n\), by definition of \(L_{\mathrm {tr}}\), we have

$$\begin{aligned} \Vert {\mathbf {v}}(t) \Vert _{\mathrm {tr}} \ge |\mathbf{{v}}(t)_{i} - \mathbf{{v}}(t)_{j}|. \end{aligned}$$

By (4), we get

$$\begin{aligned} \int _{0}^{1}{\Vert {\mathbf {v}}(t) \Vert _{\mathrm {tr}} dt}&\ge \int _{0}^{1}{|\mathbf{{v}}(t)_{i} - \mathbf{{v}}(t)_{j}| dt} \ge \left| \int _{0}^{1}{\left( \mathbf{{v}}(t)_{i} - \mathbf{{v}}(t)_{j}\right) dt}\right| \\&= \left| \left( \mathbf{{y}}_{i} - \mathbf{{x}}_{i}\right) - \left( \mathbf{{y}}_{j} - \mathbf{{x}}_{j}\right) \right| . \end{aligned}$$

\(\square \)

Example 1

When \(n=2\),

$$\begin{aligned} L_{\mathrm {tr}}(\mathbf{a}) = {\left\{ \begin{array}{ll} a_{1}, &{} \text { if } a_{1}\ge a_{2}\ge 0; \\ a_{2}, &{} \text { if } a_{2}\ge a_{1}\ge 0; \\ -a_{1}, &{} \text { if } 0 \ge a_{2}\ge a_{1}; \\ -a_{2}, &{} \text { if } 0 \ge a_{1}\ge a_{2}; \\ a_{1}-a_{2}, &{} \text { if } a_{1}\ge 0 \ge a_{2}; \\ a_{2}-a_{1}, &{} \text { if } a_{2}\ge 0 \ge a_{1}. \end{array}\right. } \end{aligned}$$

The above variational form (2) of \(d_{\mathrm {tr}}(\cdot ,\cdot )\) may be further generalized as follows.

Corollary 1

For \({\bar{x}}, {\bar{y}}\in {{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\), let \(L_{\mathrm {tr}}\) be the same as in Proposition 2. For \(p>1\), we have

$$\begin{aligned} d_{\mathrm {tr}}({\bar{x}}, {\bar{y}}) = \left( \begin{aligned} \text {minimize } \quad&\left( \int _{0}^{1}{L_{\mathrm {tr}}\big ( {\mathbf {v}}(t) \big )^{p}}dt\right) ^{\frac{1}{p}} \, \\ \text {subject to:} \quad&\frac{\text {d}\mathbf{z}}{\text {d}t}=\mathbf{{v}}(t),\,\, \mathbf{z}(0) = \mathbf{{x}},\,\, \mathbf{z}(1) = \mathbf{{y}}\end{aligned} \right) . \end{aligned}$$
(5)

Proof

When \(\mathbf{z}(t) = t\cdot \mathbf{{y}}+(1-t)\cdot \mathbf{{x}}\), \(\mathbf{{v}}(t)\) is still the constant \(\mathbf{{y}}- \mathbf{{x}}\) and the equality still holds. In addition, by the Hölder inequality,

$$\begin{aligned} \left( \int _{0}^{1}{\big \Vert {\mathbf {v}}(t) \big \Vert _{\mathrm {tr}}^{p} dt}\right) ^{\frac{1}{p}} \ge \int _{0}^{1}{\big | \Vert {\mathbf {v}}(t) \Vert _{\mathrm {tr}} \big | dt}. \end{aligned}$$

Hence for any \(\mathbf{z}:[0,1]\rightarrow {{\mathbb {R}}}^{n}\) and \(\mathbf{v}(t) = \frac{d\mathbf{z}}{dt}\),

$$\begin{aligned} \left( \int _{0}^{1}{\big \Vert {\mathbf {v}}(t) \big \Vert _{\mathrm {tr}}^{p} dt}\right) ^{\frac{1}{p}} \ge d_{\mathrm {tr}}(\mathbf{{x}},\mathbf{{y}}), \end{aligned}$$

as in Definition 1. \(\square \)

3 Optimal transport and the tropical Wasserstein-p distances

We now give a brief background on and a description of the problem of optimal transport; we also formally present the setting of the optimal transport problem specific to our work.

The question underlying the theory of optimal transport can be posed in a very basic and intuitive manner as follows: What is the most efficient way to move a given pile of dirt from one location to another? The total volume of the dirt must remain intact, but the shape and form of the pile may change during transportation and arrive at its location in a differently shaped pile. This problem has been recast mathematically in various formulations with various assumptions. There is a vast literature of historical as well as technical aspects and perspectives on the optimal transport problem; see for example Ambrosio and Gigli [2], Villani [47, 48] for detailed discussions.

3.1 Optimal transport and probability

Adapting the intuitive description of the optimal transport problem above to a more mathematically formal setting, we may view the pile of dirt as a probability measure to be transported over a space—or alternatively, one probability distribution to be transformed into another—which gives us a probabilistic and statistical perspective on the problem.

A key factor in solving the optimal transport problem is the cost function, which gives the cost of moving the pile of dirt, or the transporting the probability measure. Mathematically, this is generally a function of two variables—an origin or “start" location and destination or “end" location—which maps to the positive real line to give the cost, and may take into account any number of factors. In the simplest case, however, when the cost of moving the pile of dirt from its origin to destination is nothing more than the distance between the origin and destination, the solution to the optimal transport problem yields the Wasserstein distance (for a fixed dimension). Intuitively, the Wasserstein distance gives the minimum cost of transforming one probability distribution into another. This minimum cost is simply the “amount of dirt" to be transported, multiplied by the mean distance it must be moved. In the case of probability distributions that contain a total mass of 1, the minimum cost is therefore simply the mean distance it must be moved. More precisely, the Wasserstein distance is a distance function for probability distributions defined on a given metric space, referred to as the ground space and the associated metric is referred to as the ground metric; these concepts are formalized further on in Definition 3. The Wasserstein distance is thus a useful tool for comparing distributions.

Specific Setting. In our work, the ground space is the tropical projective torus and the ground metric is the tropical metric. We consider the set of all probability measures on the tropical projective torus, which exist and are well-defined [33], as a space. This work defines and constructs Wasserstein distances as a metric on these probability measures associated with the tropical projective torus. Figure 2 provides a conceptual illustration of the relationship between the ground space, equipped with a ground metric, and the Wasserstein space of probability measures over the ground space, equipped with the Wasserstein distance.

Fig. 2
figure 2

Illustrative figure of the relationship between the ground space and the Wasserstein space of probability measures. Here, the plane below depicts the tropical projective torus \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) is the ground space; it is equipped with the tropical metric. This space admits well-defined probability measures [33]. Collecting these probability measures as a separate space yields the space of probability measures on \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\); in this figure, it is depicted in the manifold above. This space can be equipped with a particular metric—the Wasserstein distance. The Wasserstein distance is therefore defined on the space of probability measures on \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\); it measures distances between probability measures on the tropical projective torus. In this illustrative figure, we also show the space of phylogenetic trees with N leaves, \({\mathcal {T}}_N\), as a figurative proper non-convex subset of the tropical projective torus \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\). The probability measures associated with this specific subset of \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) are depicted in the Wasserstein space of probability measures above, which is also non-convex (see Remark 3)

Wasserstein Distances as Metrics Between Probability Distributions. Although other metrics for probability distributions exist in the literature on mathematical statistics, the Wasserstein distance possesses desirable computational and intuitive properties. To illustrate a few such properties, let us consider random variables XY defined on \({{\mathbb {R}}}^d\) distributed as \(X \sim P\) and \(Y \sim Q\) with densities p and q, respectively. Three commonly-used measures for distances between P and Q are total variation, \(\frac{1}{2}\int |p-q|\); Hellinger, \(\sqrt{\int (\sqrt{p} - \sqrt{q})^2}\); and \(L_2\), \(\int (p-q)^2\).

When comparing one discrete versus one continuous distribution, these distances yield results that are not very informative. Let P be uniform on [0, 1], and let Q be uniform on \(\{0,\, 1/n,\, 2/n,\, \ldots ,\, 1\}\). The total variation distance between these distributions is 1, which is the total size of each of the two sets, and the largest that any distance can be, while the Wasserstein distance is 1/n.

These distances also do not take into account the underlying geometry of the space on which the distributions are defined. Consider the three densities \(p_1\), \(p_2\) and \(p_3\) shown in Fig. 3. We have

$$\begin{aligned} \int |p_1 - p_2| = \int |p_1 - p_3| = \int |p_2 - p_3|, \end{aligned}$$

and similar results for the Hellinger and \(L_2\) distances, however, intuitively, we would like to think of \(p_1\) and \(p_2\) being more similar and and hence closer to each other than to \(p_3\). The Wasserstein distance is able to make this distinction.

In computing a distance between distributions, we arrive at some measure of their similarity or dissimilarity, but the total variation, Hellinger, \(L_2\), and other distances do not provide any information on how or why the distributions are qualitatively different. Perhaps the most helpful property of the Wasserstein distance is that, in addition to a measure of distance between the distributions, we also obtain a map that describes how P morphs into Q. This map is known as a transport plan.

In addition to the illustrative examples discussed above, there are other desirable computational and statistical properties of the Wasserstein distance, such as stability to small perturbations and a well-behaved and intuitive Wasserstein Fréchet mean. Further details and more complete discussions on statistical aspects of the Wasserstein distance can be found in Panaretos and Zemel [38], Wasserman [49].

Aside from statistical aspects, there also exist other analytic advantages of the Wasserstein distances, depending on the context. For instance, the Wasserstein distances’ intimate connection to optimal transport problems inherently make them natural tools in these and other settings with foundations in partial differential equations.

Fig. 3
figure 3

Three example densities \(p_1\), \(p_2\), \(p_3\). This figure appears in Wasserman [49]. The total variation, Hellinger, and \(L_2\) distances between these three densities are the same, while the Wasserstein distance between \(p_1\) and \(p_2\) is smaller than that between either \(p_1\) or \(p_2\) and \(p_3\)

Wasserstein Distances and Phylogenetic Trees. Wasserstein distances have been previously studied in the context of phylogenetic trees. A single tree itself may be treated as a metric space, for instance, by considering genetic distances which measure distances between pairs of sequences on a single tree; the metric here is defined within the tree itself. When considering a single tree, the context is a finite metric space. Wasserstein distances have been defined and studied in these contexts, such as in Evans and Matsen [12], where probability distributions giving rise to individual trees are compared. Kloeckner [18] studies geometric properties of measures on equidistant trees (i.e., rooted trees with equal branch lengths from the root to all leaves) using Wasserstein distances. For finite spaces, Sommerfeld and Munk [43] conduct statistical inference studies for empirical Wasserstein metrics computed from datasets. Very recently, Le et al [22] studied the sliced formulation of optimal transport—developed to alleviate computational and statistical drawbacks of optimal transport theory—on tree metrics. Sato et al [42] furthermore propose an extremely fast algorithm that solves the optimal transport problem to compute Wasserstein distances on a tree with one million nodes in less than one second. The setting of these works all differ from the study of Wasserstein distances on the space of phylogenetic trees.

In the context of tree spaces, other probability-based distances between trees have also been proposed [13, 14]. These are related, but are nevertheless strictly different from the notion of distances between probability measures over tree space. The contributions of these works are classical measures between probability distributions on genetic sequences that make up trees, which then induce probabilistic distances between trees, including Hellinger distances and Kullback–Leibler divergences. Kullback–Leibler divergences measure the difference in terms of information gain between models of statistical inference [19]. Outside the scope of interest of this paper, other tree spaces have also been proposed that are not probability-based; an example of a combinatorial construction based on posets that turns out to be related to tree-reconstruction using Markov processes is the edge-product space [15, 34].

3.2 Formalizing the optimal transport problem and defining the Wasserstein-p distances

Monge [32] is largely recognized to have provided the first mathematical formalization of the optimal transport problem described above, while the subsequent probabilistic reinterpretation by Kantorovich [17] lead to a fundamental computational breakthrough that seeded the development of linear optimization. As such, the statement of the mathematical optimal transport problem is often referred to as the Monge–Kantorovich transport problem and presented in the setting of measure theory. We now give an overview of this presentation.

Definition 2

Let \(\varOmega \) and \(\varOmega '\) be separable metric spaces that are Radon spaces (that is, any probability measure on each space is a Radon measure). Let \(c: \varOmega \times \varOmega ' \rightarrow [0, \infty ]\) be a Borel-measurable cost function. For \(\rho ^0 \in {\mathscr {P}}(\varOmega )\) and \(\rho ^1 \in {\mathscr {P}}(\varOmega ')\) where \({\mathscr {P}}(\cdot )\) denotes the collection of probability measures on the respective spaces, the Monge–Kantorovich transport problem is to find a probability measure \(\pi \) on \(\varOmega \times \varOmega '\) such that

$$\begin{aligned} \inf \Bigg \{ \int _{\varOmega \times \varOmega '} c(x, y) \mathrm {d}\pi (x,y) \,\, \bigg | \,\, \pi \in \varPi (\rho ^0, \rho ^1) \Bigg \} \end{aligned}$$

is achieved. Here, \(\varPi (\rho ^0, \rho ^1)\) denotes the collection of all probability measures on \(\varOmega \times \varOmega '\) with marginal measures \(\rho ^0\) on \(\varOmega \) and \(\rho ^1\) on \(\varOmega '\).

When the cost function is lower semi-continuous, and given that \(\varOmega \) and \(\varOmega '\) are Radon spaces, \(\varPi (\rho ^0, \rho ^1)\) is tight, and therefore a solution to the Monge–Kantorovich transport problem always exists under these conditions (e.g., [3]). From this formulation, the Wasserstein-p distance may be defined as follows.

Definition 3

Let \((\varOmega ,d)\) be a separable metric Radon space. Let \(p \ge 1\) and \({\mathscr {P}}_p(\varOmega )\) be the collection of all probability measures \(\mu \) on \(\varOmega \) such that \(\mu \) has finite pth moment for some \(\mathbf{{x}}_0 \in \varOmega \); i.e., \(\displaystyle \int _{\varOmega } d(\mathbf{{x}}, \mathbf{{x}}_0)^p \mathrm {d}\mu (\mathbf{{x}}) < +\infty \). The Wasserstein-p distance between probability measures \(\rho ^0, \rho ^1 \in {\mathscr {P}}_p(\varOmega )\) is given by

$$\begin{aligned} W_p&: {\mathscr {P}}_p(\varOmega ) \times {\mathscr {P}}_p(\varOmega ) \rightarrow [0, +\infty )\\ W_p(\rho ^0, \rho ^1)&:= \Bigg ( \inf _{\pi \in \varPi (\rho ^0, \rho ^1)} \int _{\varOmega \times \varOmega } d(\mathbf{{x}},\mathbf{{y}})^p \mathrm {d}\pi (\mathbf{{x}},\mathbf{{y}}) \Bigg )^{1/p}, \end{aligned}$$

where, as before, \(\varPi (\rho ^0, \rho ^1)\) is the collection of all probability measures on \(\varOmega \times \varOmega \) with marginal measures \(\rho ^0\) and \(\rho ^1\) on the respective copies of \(\varOmega \). Equivalently, we have

$$\begin{aligned} W_p(\rho ^0, \rho ^1)^p = \inf \Big \{ {\mathbb {E}}\big [d(X, Y)^p \big ] \Big \}, \end{aligned}$$

where \({\mathbb {E}}[\cdot ]\) denotes the expectation, and the infimum is taken over all joint distributions of random variables X and Y with respective marginals \(\rho ^0\) and \(\rho ^1\). The metric d is referred to as the ground metric; the function \(\pi \) is known as the transport plan.

The transport plan \(\pi (\mathbf{{x}},\mathbf{{y}})\) is a function that describes a way to move the measure \(\rho ^0\) into \(\rho ^1\), and between locations \(\mathbf{{x}}\) and \(\mathbf{{y}}\); transport plans are not unique. Since the total mass moved out of a region around x must be equal to \(\rho ^0(\mathbf{{x}})\mathrm {d}\mathbf{{x}}\) and the total mass moved into a region around \(\mathbf{{x}}\) must be \(\rho ^1(\mathbf{{x}})\mathrm {d}\mathbf{{x}}\), we have the following restrictions on a transport plan:

$$\begin{aligned} \int _{{\mathbb {R}}^n} \pi (\mathbf{{x}},\mathbf{{x}}')d\mathbf{{x}}'&= \rho ^0(\mathbf{{x}});\\ \int _{{\mathbb {R}}^n} \pi (\mathbf{{x}},\mathbf{{x}}')d\mathbf{{x}}&= \rho ^1(\mathbf{{x}}'). \end{aligned}$$

In other words, \(\pi \) is a joint probability distribution with marginals \(\rho ^0\) and \(\rho ^1\). The total infinitesimal mass which moves from \(\mathbf{{x}}\) to \(\mathbf{{y}}\), therefore, is \(\pi (\mathbf{{x}},\mathbf{{y}}) \mathrm {d}\mathbf{{x}}\mathrm {d}\mathbf{{y}}\) and the cost of moving this amount of mass from \(\mathbf{{x}}\) to \(\mathbf{{y}}\) is \(c(\mathbf{{x}},\mathbf{{y}})\pi (\mathbf{{x}},\mathbf{{y}})\mathrm {d}\mathbf{{x}}\mathrm {d}\mathbf{{y}}\). The total cost is then

$$\begin{aligned} C = \iint c(\mathbf{{x}},\mathbf{{y}})\pi (\mathbf{{x}},\mathbf{{y}})d\mathbf{{x}}d\mathbf{{y}}= \int c(\mathbf{{x}},\mathbf{{y}})\mathrm {d}\pi (\mathbf{{x}},\mathbf{{y}}). \end{aligned}$$

The optimal transport plan is the \(\pi \) which achieves the minimal value of C:

$$\begin{aligned} C^* = \inf _{\pi \in \varPi (\rho ^0, \rho ^1)} \int c(\mathbf{{x}},\mathbf{{y}})\mathrm {d}\pi (\mathbf{{x}},\mathbf{{y}}). \end{aligned}$$

If the cost of a move \(c(\mathbf{{x}},\mathbf{{y}})\) is no more than the distance between the two points \(d(\mathbf{{x}},\mathbf{{y}})\), then the optimal cost value \(C^*\) is identically the Wasserstein-1 distance, \(W_1\).

Remark 1

In the particular case where \(p=1\), the Wasserstein-1 distance is also referred to as the Kantorovich–Rubinstein distance, and the earth mover’s distance (EMD) in the computer science literature.

Remark 2

The Wasserstein distances satisfy all conditions for a formal definition of a metric (e.g., [48]). If the condition of finite pth moment is relaxed, the Wasserstein distances may technically be infinite, and therefore not a metric in the strict sense.

Remark 3

For any \(p \ge 1\), if \((\varOmega , d)\) is a complete and separable metric space, then so too is \(({\mathscr {P}}_p(\varOmega ), W_p)\) (e.g., [48]). Other geometric properties between the ground space and its associated Wasserstein distance also hold, including compactness, convexity, as well as non-convexity. An adaptation of the Brunn–Minkowski theorem [6, 31] relating volumes of compact and convex sets, as well as its generalization to non-convex sets by Lyusternik [29], for comparative relations between ground and Wasserstein spaces also exists [48]. The geometric implication of these results is that compact, non-convex subsets of the ground space with respect to the ground metric correspond to non-convex subsets in the Wasserstein space of probability measures (with generalized Ricci curvature bounds) over the ground space with respect to the Wasserstein distance.

In the applicative setting of our work concerning the space of phylogenetic trees as a non-convex subset of the tropical projective torus, the implication is that the corresponding space of probability measures associated with the space of phylogenetic trees is also non-convex with respect to the Wasserstein distances. (Compactness of tree space can be established by fixing an upper bound on the height of trees.) This provides a geometric compatibility between the space of phylogenetic trees equipped with the tropical metric and its associated space of probability measures equipped with Wasserstein distances. See Fig. 2 for an illustrative description of this relationship.

3.3 A time-dependent cost function: formulating a Hamiltonian

In formulating the above variational forms of the tropical metric (2) and (5), the notation with respect to t is not by coincidence and purposely alludes to a dependence upon time. Within the setting of Wasserstein distances and their relation to the optimal transport problem where the ground metric is itself the cost function, intuitively, a time-dependent ground metric corresponds to a cost function where time is a cost factor.

Considering time dependence allows for a rich and alternate formulation of the optimal transport problem, which extends to the continuous displacement of measures—precisely the setting of the tropical metric on the tropical projective torus as a continuous metric measure space. However, there are certain instances where continuous displacement problems turn out to be equivalent to steady-state, time-independent problems with an alternate formulation that favors computational efficiency: this occurs when the Lagrangian L is homogeneous of degree 1 and convex.

Lemma 2

The tropical Lagrangian \(L_{\mathrm {tr}}\) defined in (3) is convex on \({{\mathbb {R}}}^n\). More specifically, for \({\mathbf {a}},{\mathbf {b}}\in {{\mathbb {R}}}^{n}\) and \(0\le w\le 1\), we have

$$\begin{aligned} (1-w)\Vert {\mathbf {a}}\Vert _{\mathrm {tr}} + w\Vert {\mathbf {b}}\Vert _{\mathrm {tr}} \ge \Vert (1-w){\mathbf {a}}+w{\mathbf {b}}\Vert _{\mathrm {tr}}. \end{aligned}$$
(6)

Proof

By definition,

$$\begin{aligned} \Vert (1-w){\mathbf {a}}+w{\mathbf {b}}\Vert _{\mathrm {tr}}&= \max \Big (\max _{1\le i\le n}{\big ((1-w)a_{i}+wb_{i}\big )},0\Big )\\&\quad - \min \Big (\min _{1\le i\le n}{\big ((1-w)a_{i}+wb_{i}\big )},0\Big )\\&= {\max _{1\le i\le n}{\big ((1-w)a_{i}+wb_{i}\big )}}^+ - {\min _{1\le i\le n}{\big ((1-w)a_{i}+wb_{i}\big )}}^-. \end{aligned}$$

So either there exist \(1\le j,k\le n\) such that

$$\begin{aligned} \Vert (1-w){\mathbf {a}}+w{\mathbf {b}}\Vert _{\mathrm {tr}} = \big ((1-w)a_{j}+wb_{j}\big ) - \big ((1-w)a_{k}+wb_{k}\big ), \end{aligned}$$

or there exists \(1\le j\le n\) such that

$$\begin{aligned} \Vert (1-w){\mathbf {a}}+w{\mathbf {b}}\Vert _{\mathrm {tr}} = (1-w)a_{j}+wb_{j}. \end{aligned}$$

Note that

$$\begin{aligned}&\big ((1-w)a_{j}+wb_{j}\big ) - \big ((1-w)a_{k}+wb_{k}\big ) \\&\quad = (1-w)\left( a_{j} - a_{k}\right) + w\left( b_{j} - b_{k}\right) \\&\quad \le (1-w)\Big ({\max _{1\le i\le n}{(a_{i})}}^+ - {\min _{1\le i\le n}{(a_{i})}}^-\Big ) + w\Big ({\max _{1\le i\le n}{(b_{i})}}^+ - {\min _{1\le i\le n}{(b_{i})}}^-\Big )\\&\quad =(1-w)\Vert {\mathbf {a}} \Vert _{\mathrm {tr}} + w\Vert {\mathbf {b}}\Vert _{\mathrm {tr}}. \end{aligned}$$

We also have

$$\begin{aligned} (1-w)a_{j} +wb_{j}&\le (1-w)|a_{j}| + w|b_{j}| \\&\le (1\!-\!w)\Big ({\max _{1\le i\le n}{(a_{i})}}^+ \!-\! {\min _{1\le i\le n}{(a_{i})}}^-\Big ) \!+\! w\Big ({\max _{1\le i\le n}{(b_{i})}}^+ \!-\! {\min _{1\le i\le n}{(b_{i})}}^-\Big ) \\&=(1-w)\Vert {\mathbf {a}}\Vert _{\mathrm {tr}} + w\Vert {\mathbf {b}}\Vert _{\mathrm {tr}}. \end{aligned}$$

Hence Lemma 2 holds in either case. \(\square \)

Remark 4

Note that convexity of \(L_{\mathrm {tr}}\) also implies convexity of \(\frac{1}{p}L_{\mathrm {tr}}^{p}\).

The convexity of the tropical Lagrangian \(L_{\mathrm {tr}}\) then allows for the formulation of the Hamiltonian [48, Example 7.5] for \({\mathbf {b}} \in {{\mathbb {R}}}^{n}\) as follows:

$$\begin{aligned} \begin{aligned} H({\mathbf {b}})=&\sup _{{\mathbf {a}}\in {{\mathbb {R}}}^{n}} \left\{ {\mathbf {a}}^{\intercal } {\mathbf {b}}- \frac{1}{p} \Vert {\mathbf {a}} \Vert _{\mathrm {tr}}^p \right\} \\ =&\sup _{{\mathbf {a}}\in {{\mathbb {R}}}^{n}} \left\{ \sum _{i=1}^{n}{b_{i} a_{i}}-\frac{1}{p}\Big ({\max _{1\le i\le n}(a_{i})}^+ - {\min _{1\le i\le n}(a_{i})}^-\Big )^p\right\} . \end{aligned} \end{aligned}$$
(7)

We now explicitly compute the value of the Hamiltonian (7), which will provide concise formulations with regard to the tropical Wasserstein-p distances. For convenience, and identifying \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) with \({{\mathbb {R}}}^n\), for \({\mathbf {b}} \in {{\mathbb {R}}}^{n}\) we define

$$\begin{aligned} \zeta ({\mathbf {b}}) := \max _{S\subset \{1,2,\ldots ,n\}}{\left| \sum _{i\in S}{b_{i}}\right| }. \end{aligned}$$
(8)

In other words, \(\zeta ({\mathbf {b}})\) is the absolute value of the sum of either all positive \(b_{i}\) or all negative \(b_{i}\). In particular, \(\zeta ({\mathbf {b}})=0\) if and only if \({\mathbf {b}}={\mathbf {0}}\).

Example 2

When \(n=2\), we have \({\mathbf {b}} = (b_1, b_2)\) and

$$\begin{aligned} \zeta ({\mathbf {b}}) = {\left\{ \begin{array}{ll} b_{1}+b_{2}, &{} \text { if } b_{1}\ge 0, b_{2}\ge 0; \\ -b_{1}-b_{2}, &{} \text { if } b_{1}\le 0, b_{2}\le 0; \\ b_{1}, &{} \text { if } b_{1}\ge -b_{2}\ge 0; \\ b_{2}, &{} \text { if } b_{2}\ge -b_{1}\ge 0; \\ -b_{1}, &{} \text { if } -b_{1}\ge b_{2}\ge 0; \\ -b_{2}, &{} \text { if } -b_{2}\ge b_{1}\ge 0. \end{array}\right. } \end{aligned}$$

Proposition 3

The value of \(H({\mathbf {b}})\) is:

  1. (i)

    0 when \({\mathbf {b}}={\mathbf {0}}\), or \(\zeta ({\mathbf {b}})\le 1\) and \(p=1\);

  2. (ii)

    \(\infty \) when \({\mathbf {b}}\ne {\mathbf {0}}\) and \(p<1\), or \(\zeta ({\mathbf {b}})>1\) and \(p=1\);

  3. (iii)

    \(\displaystyle \frac{p-1}{p}\zeta ({\mathbf {b}})^{\frac{p}{p-1}}\) when \({\mathbf {b}}\ne {\mathbf {0}}\) and \(p>1\).

Proof

  1. (i)

    When \({\mathbf {b}}={\mathbf {0}}\), \(\sum _{i=1}^{n}{b_{i} a_{i}}\) is always zero, and \(L_{\mathrm {tr}}({\mathbf {a}})\ge 0\), so \(H({\mathbf {b}})\le 0\). However, when \({\mathbf {a}}={\mathbf {0}}\), the right-hand side of (7) is zero, so \(H({\mathbf {0}})=0\). When \(\zeta ({\mathbf {b}})\le 1\) and \(p=1\), let

    $$\begin{aligned} u := \max _{1\le i\le n}(a_{i},0) \ge 0 ~~~\text{ and }~~~ v := \min _{1\le i\le n}(a_{i},0) \le 0. \end{aligned}$$

    Then we have

    $$\begin{aligned} \sum _{i=1}^{n}{b_{i} a_{i}}&= \sum _{b_{i}>0}{b_{i} a_{i}} + \sum _{b_{i}<0}{b_{i} a_{i}} \\&\le \sum _{b_{i}>0}{b_{i} u} + \sum _{b_{i}<0}{b_{i} v} \\&\le \zeta ({\mathbf {b}})u + \zeta ({\mathbf {b}})(-v) \\&= \zeta ({\mathbf {b}})(u-v)\le u-v. \end{aligned}$$

    Hence \(H({\mathbf {b}})\le 0\), and equality holds when \({\mathbf {a}}={\mathbf {0}}\). So \(H({\mathbf {b}})=0\).

  2. (ii)

    Now we may assume that \({\mathbf {b}}\ne {\mathbf {0}}\) and thus \(\zeta ({\mathbf {b}})>0\). We may choose nonempty \(S\subset \{1,2,\ldots ,n\}\) such that

    $$\begin{aligned} \zeta ({\mathbf {b}}) = \left| \sum _{j\in S}{b_{j}} \right| . \end{aligned}$$

    For any \(N>0\) and each \(1\le i\le n\), we let

    $$\begin{aligned} a_{i} = {\left\{ \begin{array}{ll} \displaystyle \frac{b_{i}}{|b_{i}|}\cdot N, &{}\text { if } i\in S; \\ 0, &{}\text { if } i\notin S. \end{array}\right. } \end{aligned}$$

    Then \(\sum _{i=1}^{n}{b_{i} a_{i}} = \zeta ({\mathbf {b}})\cdot N\) and the set \(\{a_{i}\mid 1\le i\le n\}\cup \{0\}\) is either \(\{0, N\}\) or \(\{0, -N\}\), so \(L_{\mathrm {tr}}({\mathbf {a}})\) is \(N - 0\) or \(0 - (-N)\), which is N. Since \(\zeta ({\mathbf {b}})>0\), when \(p<1\), or \(\zeta ({\mathbf {b}})>1\) and \(p=1\), we have

    $$\begin{aligned} \lim \limits _{N\rightarrow \infty }{\left( \zeta ({\mathbf {b}})N - \frac{1}{p}N^{p}\right) } = \infty . \end{aligned}$$

    So \(H({\mathbf {b}})=\infty \).

  3. (iii)

    We denote uv as in (i) above. Then

    $$\begin{aligned} H({\mathbf {b}}) \le \zeta ({\mathbf {b}})(u-v) - \frac{1}{p}(u-v)^{p}. \end{aligned}$$

    Let \(s:=u-v\ge 0\). We need to find the maximum of \(\zeta ({\mathbf {b}})s-\frac{1}{p}s^{p}\) when \(s\ge 0\). The derivative of this function of s is

    $$\begin{aligned} \zeta ({\mathbf {b}}) - s^{p-1}. \end{aligned}$$

    Hence the function is increasing when \(0\le s\le \zeta ({\mathbf {b}})^{\frac{1}{p-1}}\), and it is decreasing when \(s\ge \zeta ({\mathbf {b}})^{\frac{1}{p-1}}\). So the maximum is attained when \(s=\zeta ({\mathbf {b}})^{\frac{1}{p-1}}\), thus

    $$\begin{aligned} H({\mathbf {b}})\le \zeta ({\mathbf {b}})\cdot \zeta ({\mathbf {b}})^{\frac{1}{p-1}} - \frac{1}{p}\zeta ({\mathbf {b}})^{\frac{p}{p-1}} = \frac{p-1}{p}\zeta ({\mathbf {b}})^{\frac{p}{p-1}}. \end{aligned}$$

    Finally, as in (ii), we may choose nonempty \(S\subset \{1,2,\ldots ,n\}\) such that

    $$\begin{aligned} \zeta ({\mathbf {b}}) = \left| \sum _{j\in S}{b_{j}} \right| , \end{aligned}$$

    and the equality holds when

    $$\begin{aligned} a_{i} = {\left\{ \begin{array}{ll} \displaystyle \frac{b_{i}}{|b_{i}|}\cdot \zeta ({\mathbf {b}})^{\frac{1}{p-1}}, &{}\text { if } i\in S; \\ 0, &{}\text { if } i\notin S. \end{array}\right. } \end{aligned}$$
    (9)

\(\square \)

For notational convenience, we also define \(\eta :{\mathbb {R}}^n\rightarrow {\mathbb {R}}^n\), where \(\eta ({\mathbf {b}})=(\eta ({\mathbf {b}})_i)_{i=1}^n\), with

$$\begin{aligned} \eta ({\mathbf {b}})_i:=a_i= {\left\{ \begin{array}{ll} \displaystyle \frac{b_{i}}{|b_{i}|}\cdot \zeta ({\mathbf {b}})^{\frac{1}{p-1}}, &{}\text { if } i\in S; \\ 0, &{}\text { if } i\notin S. \end{array}\right. } \end{aligned}$$

That is, \(\eta ({\mathbf {b}})_i\) is defined by (9).

The Tropical Wasserstein-p Distances. We consider the tropical projective torus as a probability space [33] with finite pth moment as follows:

$$\begin{aligned} {\mathscr {P}}_p({{\mathbb {R}}}^{n}) = \Big \{\rho \in L^1({\mathbb {R}}^{n})~:\int _{{\mathbb {R}}^{n}}\rho (\mathbf{{x}})^p d\mathbf{{x}}=1,~\rho \ge 0\Big \}. \end{aligned}$$

Within the optimal transport framework discussed above and as in Definition 3, the tropical Wasserstein-p distance is given as follows:

$$\begin{aligned} {\tilde{W}}^{\mathrm {tr}}_p&: {\mathscr {P}}_p({\mathbb {R}}^n) \times {\mathscr {P}}_p({\mathbb {R}}^n) \rightarrow [0, +\infty ) \nonumber \\ {\tilde{W}}^{\mathrm {tr}}_{p}(\rho ^0, \rho ^1)^{p}&:= \inf _{\pi \in \varPi (\rho ^0, \rho ^1)} \int _{{\mathbb {R}}^n\, \times \, {\mathbb {R}}^{n}} d_{\mathrm {tr}}(\mathbf{{x}},\mathbf{{y}})^p\mathrm {d}\pi (\mathbf{{x}},\mathbf{{y}}), \end{aligned}$$
(10)

where the infimum is taken over the set of all possible joint distributions (transport plans) \(\pi \) with marginals \(\rho ^0\) and \(\rho ^1\), \(\varPi (\rho ^0, \rho ^1)\). Here, the distance \({\tilde{W}}^{\mathrm {tr}}_p\) depends the choice of p in the linear programming formulation (10). The following alternative gives an equivalent definition of the tropical Wasserstein-p distances.

Definition 4

(Tropical Wasserstein-p distance) The tropical Wasserstein-p distance is given by

$$\begin{aligned} W^{\mathrm {tr}}_p(\rho ^0,\rho ^1)^p=\inf _{{\mathbf {v}},\rho }\int _{0}^1 \int _{{\mathbb {R}}^n} \big \Vert {\mathbf {v}}(t,\mathbf{{x}}) \big \Vert _{\mathrm {tr}}^p\,\rho (t,\mathbf{{x}})d\mathbf{{x}}dt \end{aligned}$$
(11a)

such that the following dynamical constraint or continuity equations hold:

$$\begin{aligned} \begin{aligned} \partial _t\rho (t,\mathbf{{x}})+\nabla \cdot \big (\rho (t,\mathbf{{x}}){\mathbf {v}}(t,\mathbf{{x}}) \big )&=0,\\ \rho (0,\mathbf{{x}})&=\rho ^0(\mathbf{{x}}),\\ \rho (1,\mathbf{{x}})&=\rho ^1(\mathbf{{x}}). \end{aligned} \end{aligned}$$
(11b)

Here \(\Vert \cdot \Vert _{\mathrm {tr}}\) is the tropical norm, \(\rho ^0\), \(\rho ^1\in {\mathscr {P}}_p({\mathbb {R}}^n)\), \(\nabla \), \(\nabla \cdot \) are gradient and divergence operators in \({\mathbb {R}}^n\), and the infimum is taken over all continuous density functions \(\rho :[0,1]\times {\mathbb {R}}^n\rightarrow {\mathbb {R}}\), and Borel vector fields \({\mathbf {v}}:[0,1]\times {\mathbb {R}}^n \rightarrow {\mathbb {R}}^n\).

Here, the formulation (4) given by the pairs (11a) and (11b) is known as the Benamou–Brenier formula, given by Benamou and Brenier [4]. As discussed in Chapter 8 of Villani [47], when c satisfies suitable conditions, the linear programming formulation \({\tilde{W}}_p^{\text {tr}}\) is equivalent to the dynamical formulation \(W_p^{\text {tr}}\). In this work, we focus on the dynamical formulation (4) with \(p=1,2\) for their concrete implications on computations of the tropical projective torus.

3.4 The tropical Wasserstein-1 distance

We first study the case \(p=1\). In this case, it turns out that the tropical Wasserstein-1 distance \(W_1^{\mathrm {tr}}\) may be recast as the following minimization problem.

Proposition 4

(Minimal Flux Formulation) By identifying \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\) with \({{\mathbb {R}}}^n\) as discussed in Sect. 2.3, the tropical Wasserstein-1 distance satisfies

$$\begin{aligned} \begin{aligned} W^{\mathrm {tr}}_1(\rho ^0,\rho ^1)=\inf _{{\mathbf {m}}} \bigg \{\int _{{\mathbb {R}}^n} \big \Vert {\mathbf {m}}(\mathbf{{x}}) \big \Vert _{\mathrm {tr}} d\mathbf{{x}}\, :\rho ^1(\mathbf{{x}})-\rho ^0(\mathbf{{x}})+\nabla \cdot {\mathbf {m}}(\mathbf{{x}})=0\bigg \}, \end{aligned} \end{aligned}$$
(12)

where the infimum is taken over all Borel flux functions \({\mathbf {m}} :{\mathbb {R}}^n\rightarrow {\mathbb {R}}^n\).

Proof

This minimal flux formulation follows the result in optimal transport theory. By Jensen’s inequality, the minimizer of (4) is obtained by a time-independent solution. Denote

$$\begin{aligned} {\mathbf {m}}(\mathbf{{x}}) := \int _0^1 {\mathbf {v}}(t,\mathbf{{x}})\rho (t,\mathbf{{x}})dt. \end{aligned}$$

Then

$$\begin{aligned} \int _{0}^1\int _{{\mathbb {R}}^n} \big \Vert {\mathbf {v}}(t,\mathbf{{x}}) \big \Vert _{\mathrm {tr}}\,\rho (t,\mathbf{{x}})d\mathbf{{x}}dt \ge \int _{{\mathbb {R}}^n} \big \Vert {\mathbf {m}}(\mathbf{{x}}) \big \Vert _{\mathrm {tr}}d\mathbf{{x}}\end{aligned}$$

By choosing \(\rho (t,\mathbf{{x}})=(1-t)\rho ^0(\mathbf{{x}})+t\rho ^1(\mathbf{{x}})\), i.e., \(\rho ^1(x)-\rho ^0(x)+\nabla \cdot m(x)=0\), we derive the minimizer of above minimization problem. \(\square \)

Concretely, \({\mathbf {m}}(\mathbf{{x}})\) is the flux vector field that assigns a vector to each point in the measure and determines how much of the mass (measure) should be moved, and in which direction.

The reformulation of the tropical Wasserstein-1 distance given in Proposition 4 has enormous computational benefits, compared to that given in Definition 3 [25]. Notably, the size of the optimization variable is much smaller in solving a discrete approximation; additionally, the structure of the formulation given in Proposition 4 borrows from \(L_1\)-type minimization problems, which are well-studied and for which there exist fast and simple algorithms (see references in Li et al [25]). We will reap these benefits in formulating explicit algorithms to compute the tropical Wasserstein-p distances for \(p=1,2\), as discussed further on in Sect. 4.

Geodesics on the Tropical Projective Torus. Geodesics on the tropical projective torus are not unique [28, 33]. In particular, between any two given points in \({{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\), there are infinitely many geodesics. The following result gives the explicit connection between geodesics on the tropical projective torus and the minimizer of the tropical Wasserstein-1 distance.

Proposition 5

(Minimizer of the Tropical Wasserstein-1 distance) The minimizer of the tropical Wasserstein-1 distance is given by the following pair:

$$\begin{aligned} \left\{ \begin{aligned}&\nabla _{{\mathbf {m}}} \big \Vert {\mathbf {m}}(\mathbf{{x}}) \big \Vert _{\mathrm {tr}}=\nabla \varPhi (\mathbf{{x}}) \quad \text {if }{\mathbf {m}}(\mathbf{{x}})>0,\\&\rho ^1(\mathbf{{x}})-\rho ^0(\mathbf{{x}})+\nabla \cdot {\mathbf {m}}(\mathbf{{x}})=0. \end{aligned}\right. \end{aligned}$$
(13)

Proof

The minimizer of tropical Wasserstein-1 distance may be derived as follows. Define a Lagrange multiplier \(\varPhi :{{\mathbb {R}}}^n \rightarrow {\mathbb {R}}\) for the equality constraint of (12), and consider the saddle point problem

$$\begin{aligned} L({\mathbf {m}},\varPhi )=\int _{{\mathbb {R}}^n}\Vert {\mathbf {m}}(\mathbf{{x}})\Vert _{\text {tr}} d\mathbf{{x}}+\int _{{\mathbb {R}}^n} \varPhi (\mathbf{{x}})\big (\nabla \cdot {\mathbf {m}}(\mathbf{{x}})+\rho ^1(\mathbf{{x}})-\rho ^0(\mathbf{{x}}) \big )d\mathbf{{x}}. \end{aligned}$$

Notice that L is convex in \({\mathbf {m}}\) and concave in \(\varPhi \). Thus, the saddle point \(({\mathbf {m}}, \varPhi )\) satisfies \(\delta _{{\mathbf {m}}}L({\mathbf {m}},\varPhi )=0\), \(\delta _\varPhi L({\mathbf {m}},\varPhi )=0\). This corresponds to the equation pair (13). \(\square \)

Remark 5

We notice that the first equation in (13) represents the tropical Eikonal equation

$$\begin{aligned} \zeta \big (\nabla \varPhi (\mathbf{{x}}) \big )=1. \end{aligned}$$

The tropical Eikonal equation describes the movement of each particle according to the infinitely many geodesics under the tropical metric between \(\rho ^0\) to \(\rho ^1\). This behavior will be explored and demonstrated numerically in experiments further on in Sect. 5.

Proposition 6

The set of all infinitely many tropical geodesics is contained in a classical convex polytope.

Proof

For any point \({\bar{c}}\) on a tropical geodesic connecting \({\bar{a}}, {\bar{b}} \in {{\mathbb {R}}}^{n+1}/{{\mathbb {R}}}{\mathbf{1}}\), by the definition of geodesics, we have

$$\begin{aligned} d_{\mathrm {tr}}({\bar{c}},{\bar{a}})+d_{\mathrm {tr}}({\bar{c}},{\bar{b}}) = d_{\mathrm {tr}}({\bar{a}},{\bar{b}}). \end{aligned}$$

So \({\bar{c}}\) belongs to a tropical ellipse with foci \({\bar{a}},{\bar{b}}\). By Proposition 26 of Lin and Yoshida [27], the set of all points on tropical geodesics is a classical convex polytope. \(\square \)

3.5 The tropical Wasserstein-2 distance

We now consider the case where \(p=2\). Here we refer to (9) using the notation \(\eta ({\mathbf {b}})\).

Proposition 7

(Minimizer of the Tropical Wasserstein-2 Distance) The minimizer of the tropical Wasserstein-2 distance \(({\mathbf {v}}(t,\mathbf{{x}}), \rho (t,\mathbf{{x}}))\) satisfies

$$\begin{aligned} {\mathbf {v}}(t,\mathbf{{x}})=\eta \big (\nabla \varPhi (t,\mathbf{{x}}) \big ), \end{aligned}$$

where \(\eta :{\mathbb {R}}^n\rightarrow {\mathbb {R}}^n\) is given by

$$\begin{aligned} \eta \big (\nabla \varPhi (t,\mathbf{{x}})\big )_i = {\left\{ \begin{array}{ll} \displaystyle \frac{\nabla _{x_i}\varPhi (t,\mathbf{{x}})}{|\nabla _{x_i}\varPhi (t,\mathbf{{x}})|}\cdot \zeta \big (\nabla \varPhi (t,\mathbf{{x}}) \big ) &{}\text { for } i\in S; \\ 0 &{}\text { for } i\notin S, \end{array}\right. } \end{aligned}$$

where S is as in (8). Also,

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t\rho (t,\mathbf{{x}})+\nabla \cdot \big ( \rho (t,\mathbf{{x}}) \eta \big (\nabla \varPhi (t,\mathbf{{x}}) \big ) \big )=0,\\&\partial _t\varPhi (t,\mathbf{{x}})+\frac{1}{2}{\zeta }\big (\nabla \varPhi (t,\mathbf{{x}}) \big )^2\le 0,\\&\rho (0,\mathbf{{x}})=\rho ^0(\mathbf{{x}}),\quad \rho (1,\mathbf{{x}})=\rho ^1(\mathbf{{x}}). \end{aligned} \right. \end{aligned}$$
(14)

In particular, if \(\rho (t,\mathbf{{x}})>0\), then

$$\begin{aligned} \partial _t\varPhi (t,\mathbf{{x}})+\frac{1}{2}{\zeta }\big (\nabla \varPhi (t,\mathbf{{x}}) \big )^2=0. \end{aligned}$$

Proof

The minimizer path for the tropical Wasserstein-2 distance is derived as follows. For \(p=2\), denote \({\mathbf {m}}(t,\mathbf{{x}}):=\rho (t,\mathbf{{x}}) v(t,\mathbf{{x}})\) where

$$\begin{aligned} F({\mathbf {m}},\rho )={\left\{ \begin{array}{ll} \displaystyle \frac{\Vert {\mathbf {m}}\Vert _{\text {tr}}^2}{2\rho }&{} \text {if }\rho >0;\\ 0 &{} \text {if }\rho =0, {\mathbf {m}}=0;\\ +\infty &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Then the variational problem (4) can be reformulated as

$$\begin{aligned} \begin{aligned} \frac{1}{2}W_2^{\text {tr}}(\rho _0,\rho _1)^2=&\inf _{{\mathbf {m}},\rho } \Big \{\int _{0}^1\int _{{\mathbb {R}}^n} F\big ({\mathbf {m}}(t,\mathbf{{x}}), \rho (t,\mathbf{{x}}) \big )d\mathbf{{x}}dt:\\&\partial _t\rho (t,\mathbf{{x}})+\nabla \cdot \big ( {\mathbf {m}}(t,\mathbf{{x}}) \big )= 0,\\&\rho (0,\mathbf{{x}})=\rho _0(\mathbf{{x}}),~\rho (1,\mathbf{{x}})=\rho _1(\mathbf{{x}})\Big \}. \end{aligned} \end{aligned}$$
(15)

Notice that variational problem (15) is convex in \(({\mathbf {m}},\mu )\). Again, we denote the Lagrange multiplier \(\varPhi :[0,1]\times {{\mathbb {R}}}^n \rightarrow {\mathbb {R}}\), then we can reformulate (15) into a saddle point problem.

$$\begin{aligned} L({\mathbf {m}},\rho , \varPhi )=\int F({\mathbf {m}},\rho )+ \varPhi (t,\mathbf{{x}}) \Big (\partial _t\rho (t,\mathbf{{x}})+\nabla \cdot {\mathbf {m}}(t,\mathbf{{x}})\Big ) d\mathbf{{x}}. \end{aligned}$$

Thus the saddle point \(({\mathbf {m}},\rho , \varPhi )\) satisfies the system \(\delta _{{\mathbf {m}}} L =0\), \(\delta _\rho L \ge 0\), \(\delta _\varPhi L =0\), i.e.,

$$\begin{aligned} {\left\{ \begin{array}{ll} &{} \displaystyle \frac{\nabla _{{\mathbf {m}}}\Vert {\mathbf {m}}\Vert _{\mathrm {tr}}^2}{\rho }=\nabla \varPhi \\ &{} \displaystyle -\frac{\Vert {\mathbf {m}}\Vert _{\mathrm {tr}}^2}{2\rho }-\partial _t\varPhi \ge 0. \end{array}\right. } \end{aligned}$$

Following Proposition 3, we obtain the minimizer of the system (14). \(\square \)

4 Algorithms: solving the optimal transport problem

In this section, we design algorithms for solving the optimal transport problems that give rise to the tropical Wasserstein-p distances and geodesics. Our approach is mainly based on the G-Prox primal-dual hybrid gradient (G-Prox PDHG) algorithm [16], which is a modified version of Chambolle–Pock primal-dual algorithms [8, 40].

We now provide a brief overview of the algorithm; see Chambolle and Pock [8], Jacobs et al [16], Pock and Chambolle [40] for further details. The classical primal-dual hybrid gradient algorithms convert the following minimization problem

$$\begin{aligned} \min _{X} f(KX) + g(X) \end{aligned}$$

into the following saddle point problem

$$\begin{aligned} \min _X \max _Y \Big \{L(X,Y)=\left\langle KX, Y\right\rangle + g(X) - f^*(Y)\Big \}, \end{aligned}$$

where f and g are convex functions with respect to a variable X, \(f^*\) is a convex dual function of F, and K is a continuous linear operator. For each iteration, the algorithm performs gradient descent on the primal variable X and gradient ascent on the dual variable Y as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} X^{k+1}=&{}\arg \min _{X} L(X,Y^k)+\frac{1}{2h}\Vert X-X^k\Vert ^2 ;\\ Y^{k+1}=&{}\arg \max _{Y} L(2X^{k+1}-X^k,Y)-\frac{1}{2\tau }\Vert Y-Y^k\Vert ^2, \end{array}\right. } \end{aligned}$$
(16)

where suitable norms need to be considered in the update.

For the tropical Wasserstein-1 and Wasserstein-2 distances, we apply the algorithm in (16) to (12) and (15) by setting \(Y = \varPhi \) and specifying

$$\begin{aligned} W^{\mathrm {tr}}_1: \qquad&\begin{aligned} X&= {\mathbf {m}},\\ KX&= \nabla \cdot {\mathbf {m}},\\ g(X)&= \Vert {\mathbf {m}}\Vert _{\mathrm {tr}},\\ f(X)&= {\left\{ \begin{array}{ll} 0 &{} \text { if } X + \rho ^1 - \rho ^0 = 0,\\ \infty &{} \text { otherwise}; \end{array}\right. } \end{aligned} \\ W^{\mathrm {tr}}_2: \qquad&\begin{aligned} X&= ({\mathbf {m}},\rho ),\\ KX&= \partial _t \rho + \nabla \cdot {\mathbf {m}},\\ g(X)&= F({\mathbf {m}},\rho ),\\ f(X)&= {\left\{ \begin{array}{ll} 0 &{} \text { if } X = 0,\\ \infty &{} \text { otherwise.} \end{array}\right. } \end{aligned} \end{aligned}$$

In this paper, we use a version of the G-Prox PDHG algorithm that applies the \(H^1\) norm in the dual variable Y update and uses the \(L^2\) norm in the primal variable X update. This choice of norms gives us more stable and faster convergence of the algorithm than the standard PDHG algorithm [8].

4.1 Computing the tropical Wasserstein-1 distances

We consider here \(p=1\). We first present the spatial discretization to compute the general Wasserstein-1 distance.

Consider a uniform lattice graph \(G=(V, E)\) with spacing \(\varDelta \mathbf{{x}}\) to discretize the spatial domain, where V is the vertex set \(V=\{1,2,\ldots , N\},\) and E is the edge set. Here \({\mathbf {i}}=(i_1, \ldots , i_d)\in V\) represents a point in \({\mathbb {R}}^d\). Consider a discrete probability set supported on all vertices:

$$\begin{aligned} {\mathcal {P}}(G)=\left\{ (q_{\mathbf {i}})_{{\mathbf {i}}=1}^N\in {\mathbb {R}}^{N} \ \Big | \ \sum _{{\mathbf {i}}=1}^N q_{\mathbf {i}}=1,~q_{\mathbf {i}}\ge 0,~{\mathbf {i}} \in V \right\} , \end{aligned}$$

where \(q_{\mathbf {i}}\) here represents a probability at node i, i.e., \(q_{\mathbf {i}}=\int _{C_{\mathbf {i}}} \rho (\mathbf{{x}})d\mathbf{{x}}\), and \(C_{\mathbf {i}}\) is a cube centered at \({\mathbf {i}}\) with length \(\varDelta \mathbf{{x}}\). Thus, \(\rho ^0(\mathbf{{x}})\), \(\rho ^1(\mathbf{{x}})\) is approximated by \(q^0=(q^0_{\mathbf {i}} )_{{\mathbf {i}}=1}^N\) and \(q^1= (q^1_{\mathbf {i}} )_{{\mathbf {i}}=1}^N\).

We use two steps to compute the Wasserstein-1 distance on \({\mathcal {P}}(G)\). We first define a flux on a lattice. Denote the flux matrix as \({\mathbf {m}}=({\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}})_{{\mathbf {i}}=1}^N\in {\mathbb {R}}^{N\times d}\), where each component \({\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}}\) is a row vector in \({\mathbb {R}}^d\), i.e.,

$$\begin{aligned} {\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}}=\Big ({\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}e_v} \Big )_{v=1}^d=\Bigg (\int _{C_{{\mathbf {i}}+\frac{1}{2}e_v}}m^v(\mathbf{{x}})d\mathbf{{x}}\Bigg )_{v=1}^d, \end{aligned}$$

where \(e_v=(0,\ldots , \varDelta \mathbf{{x}},\ldots , 0)^\intercal \), with \(\varDelta \mathbf{{x}}\) at the vth column. In other words, if we denote \({\mathbf {i}}=(i_1, \ldots , i_d)\in {\mathbb {R}}^d\) and \({\mathbf {m}}(\mathbf{{x}})=({\mathbf {m}}^1(\mathbf{{x}}), \ldots , {\mathbf {m}}^d(\mathbf{{x}}))\), then

$$\begin{aligned} {\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}e_v}\approx {\mathbf {m}}^v\Big (i_1,\ldots ,\, i_{v-1},\, i_v+\frac{1}{2}\varDelta \mathbf{{x}},\, i_{v+1},\ldots , i_d\Big )\varDelta \mathbf{{x}}^d. \end{aligned}$$

We consider a zero flux condition: if a point \({\mathbf {i}}+\frac{1}{2}e_v\) is outside the domain of interest \(\varOmega \), we let \({\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}e_v}=0\). Based on such a flux \({\mathbf {m}}\), we define a discrete divergence operator \(\text {div}_G({\mathbf {m}}):=(\text {div}_G \big ({\mathbf {m}}_{\mathbf {i}}))_{{\mathbf {i}}=1}^N\), where

$$\begin{aligned} \text {div}_G({\mathbf {m}}_{\mathbf {i}}):=\frac{1}{\varDelta \mathbf{{x}}}\sum _{v=1}^d ({\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}e_v} - {\mathbf {m}}_{{\mathbf {i}}-\frac{1}{2}e_v}). \end{aligned}$$

We next introduce the discrete cost functional

$$\begin{aligned} \Vert {\mathbf {m}}\Vert :=\sum _{{\mathbf {i}}=1}^N\Vert {\mathbf {m}}_{{\mathbf {i}} +\frac{1}{2}}\Vert _{2}=\sum _{{\mathbf {i}}=1}^N \sqrt{\sum _{v=1}^d |{\mathbf {m}}_{{\mathbf {i}}+\frac{e_v}{2}}|^2}. \end{aligned}$$

This gives rise to the following optimization problem in the tropical setting

$$\begin{aligned} \begin{aligned}&\underset{{\mathbf {m}}}{\text {minimize}}&\Vert {\mathbf {m}}\Vert _{\mathrm {tr}}=\sum _{{\mathbf {i}}=1}^N \sqrt{\sum _{v=1}^d \Vert {\mathbf {m}}_{{\mathbf {i}}+\frac{e_v}{2}}\Vert _{\mathrm {tr}}^2} \\&\text {subject to}&\frac{1}{\varDelta \mathbf{{x}}}\sum _{v=1}^d ({\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}e_v} - {\mathbf {m}}_{{\mathbf {i}}-\frac{1}{2}e_v})+q_{{\mathbf {i}}}^1-q_{{\mathbf {i}}}^0=0, \end{aligned} \end{aligned}$$
(17)

for \({\mathbf {i}} =1,\ldots , N; v =1,\ldots , d.\)

We solve (17) by studying its saddle point structure. Denoting the Lagrange multiplier of (17) as \(\varPhi =(\varPhi _{\mathbf {i}})_{{\mathbf {i}}=1}^N\), we obtain

$$\begin{aligned} \min _{{\mathbf {m}}}\max _{\varPhi } \quad L({\mathbf {m}}, \varPhi ):=\min _{{\mathbf {m}}}\max _{\varPhi } \quad \Vert {\mathbf {m}}\Vert _{\mathrm {tr}}+\varPhi ^\intercal (\text {div}_G({\mathbf {m}})+q^1-q^0). \end{aligned}$$
(18)

Saddle point problems such as (18) are well studied by the first-order primal-dual hybrid gradient (PDHG) algorithm. Implementing the G-Prox PDHG algorithm gives the following iteration steps:

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathbf {m}}^{k+1}=&{}\arg \min _{{\mathbf {m}}} L({\mathbf {m}},\varPhi ^k)+\frac{1}{2h}\Vert {\mathbf {m}} - {\mathbf {m}}^k\Vert ^2_{L^2},\\ \varPhi ^{k+1}=&{}\arg \max _{\varPhi } L(2{\mathbf {m}}^{k+1}-{\mathbf {m}}^k,\varPhi )-\frac{1}{2\tau }\Vert \varPhi -\varPhi ^k\Vert ^2_{H^1}, \end{array}\right. } \end{aligned}$$
(19)

where the quantities h, \(\tau \) are two small step sizes, and

$$\begin{aligned} \Vert {\mathbf {m}}-{\mathbf {m}}^k\Vert ^2_{L^2}&=\sum _{{\mathbf {i}}=1}^N\sum _{v=1}^d\Big ({\mathbf {m}}_{{\mathbf {i}} +\frac{1}{2}e_v}-{\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}e_v}^k \Big )^2 \varDelta \mathbf{{x}},\\ \Vert \varPhi -\varPhi ^k\Vert ^2_{H^1}&=\sum _{{\mathbf {i}}=1}^N\Big (\nabla _G\varPhi _{{\mathbf {i}}} -\nabla _G\varPhi _{{\mathbf {i}}}^k\Big )^2 \varDelta \mathbf{{x}}. \end{aligned}$$

These steps alternate a gradient ascent in the dual variable \(\varPhi \), and a gradient descent in the primal variable \({\mathbf {m}}\).

It turns out that iteration (19) can be solved by simple explicit formulae. Since the unknown variables \({\mathbf {m}}\), \(\varPhi \) are component-wise separable in this problem, each of its components \({\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}}\), \(\varPhi _{{\mathbf {i}}}\) can be independently obtained by solving (19). First, notice that

$$\begin{aligned}&\arg \min _{{\mathbf {m}}}~L({\mathbf {m}},\varPhi ^k)+\frac{1}{2h}\Vert {\mathbf {m}} -{\mathbf {m}}^k\Vert ^2_{L^2} \\&\quad = \arg \min _{{\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}}}\sum _{{\mathbf {i}}=1}^N \Bigg (\Vert {\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}}\Vert _{\mathrm {tr}}- \Big (\nabla _G \varPhi _{{\mathbf {i}}+\frac{1}{2}}^k\Big )^\intercal {\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}}+\frac{1}{2h}\Vert {\mathbf {m}}_{{\mathbf {i}} +\frac{1}{2}}-{\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}}^k\Vert ^2_{L^2}\Bigg ), \end{aligned}$$

where \(\nabla _G\varPhi ^k_{{\mathbf {i}}+\frac{1}{2}}:=\frac{1}{\varDelta \mathbf{{x}}}(\varPhi ^k_{{\mathbf {i}}+e_v}-\varPhi _{{\mathbf {i}}}^k)_{v=1}^d\). The first iteration in (19) has an explicit solution, which is:

$$\begin{aligned} {\mathbf {m}}^{k+1}_{i+\frac{1}{2}}=\text {shrink}_{\mathrm {tr}} ({\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}}^k+h\nabla _G \varPhi ^k_{{\mathbf {i}}+\frac{1}{2}}, h), \end{aligned}$$

where the shrink operator is a projection operation to the unit ball with norm \(\Vert \cdot \Vert _{\mathrm {tr}}\). Its exact formulation is given further on in Proposition 8.

Second, consider

$$\begin{aligned}&\arg \max _{\varPhi } L(2{\mathbf {m}}^{k+1}-{\mathbf {m}}^k,\varPhi )-\frac{1}{2\tau }\Vert \varPhi -\varPhi ^k\Vert ^2_2 \\&\quad = \arg \max _{\varPhi } \sum _{{\mathbf {i}}=1}^N \max _{\varPhi _{\mathbf {i}}}\Big ( \varPhi _{{\mathbf {i}}} \big (\text {div}_G(2{\mathbf {m}}^{k+1}_{{\mathbf {i}}}-{\mathbf {m}}^k_{{\mathbf {i}}}) +q_{{\mathbf {i}}}^1-q_{{\mathbf {i}}}^0 \big )-\frac{1}{2\tau }\Vert \varPhi _{{\mathbf {i}}}-\varPhi ^k_{{\mathbf {i}}}\Vert ^2_{H^1}\Big ). \end{aligned}$$

Thus the second iteration in (19) becomes

$$\begin{aligned} \varPhi _{{\mathbf {i}}}^{k+1}=\varPhi _{{\mathbf {i}}}^k+\tau (-\varDelta _G)^{-1} \bigl ( \text {div}_G(2{\mathbf {m}}_{{\mathbf {i}}}^{k+1}-{\mathbf {m}}^k_{{\mathbf {i}}}) +q_{\mathbf {i}}^1-q_{{\mathbf {i}}}^0\bigl ). \end{aligned}$$

where \(\varDelta _G=\text {div}_G \cdot \nabla _G\) is the discrete Laplacian operator.

We are now ready to state our algorithm.

figure a

Remark 6

The relative error at iteration k is given by \(\displaystyle \frac{|\Vert {\mathbf {m}}^k\Vert _\text {tr}-\Vert {\mathbf {m}}^{k-1}\Vert _\text {tr}|}{\Vert {\mathbf {m}}^{k-1}\Vert _\text {tr}}\).

In the algorithm, we require the shrink operator with respect to the tropical metric, \(\text {shrink}_{\text {tr}}\), which is given in the following result.

Proposition 8

Let \(h>0\) and \(b_{1}\ge b_{2}\ge \cdots \ge b_{k}\ge 0 > b_{k+1} \ge \cdots \ge b_{n}\). We denote

$$\begin{aligned} u_{i} = b_{i} \, \forall \ 1\le i\le k,~~~u_{k+1} = 0 \end{aligned}$$

and

$$\begin{aligned} v_{i} = -b_{n+1-i} \, \forall \ 1\le i\le n-k,~~~v_{n-k+1} = 0. \end{aligned}$$

Suppose

$$\begin{aligned} j_{1}&={\left\{ \begin{array}{ll} \displaystyle \max \bigg (1\le j\le k+1 \ \Big | \ \sum _{i=1}^{j}{(u_{i} - u_{j})} < 1 \bigg ), &{} \text { if } k\ge 1, \\ 0, &{} \text { if } k=0,\\ \end{array}\right. } \\ \ell _{1}&= \max (j_{1},k); \end{aligned}$$

and

$$\begin{aligned} j_{2}&= {\left\{ \begin{array}{ll} \displaystyle \max \bigg (1\le j\le n-k+1 \ \Big | \ \sum _{i=1}^{j}{(v_{i} - v_{j})} < 1 \bigg ), &{} \text { if } k\le n-1, \\ 0, &{} \text { if } k=n, \end{array}\right. } \\ \ell _{2}&= \max (j_{2},n-k). \end{aligned}$$

We let

$$\begin{aligned} t_{1} = {\left\{ \begin{array}{ll} \displaystyle \frac{\left( \sum _{i=1}^{j_{1}}{u_{i}}\right) - 1}{j_{1}} &{}\text { if } 1\le j_{1}\le k; \\ 0 &{}\text { otherwise}. \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} t_{2} = {\left\{ \begin{array}{ll} \displaystyle \frac{\left( \sum _{i=1}^{j_{2}}{v_{i}}\right) - 1}{j_{2}} &{}\text { if } 1\le j_{2}\le n-k; \\ 0 &{}\text { otherwise}. \end{array}\right. } \end{aligned}$$

Then

$$\begin{aligned} \mathrm {shrink}_{tr}({\mathbf {b}}, h):=\text {argmin}_{{\mathbf {a}}\in {{\mathbb {R}}}^{n}}{\left\{ \frac{\sum _{i=1}^{n}{a_{i}^{2}}}{2h} + \Vert {\mathbf {a}}\Vert _{\mathrm {tr}} - \sum _{i=1}^{n}{b_{i}\cdot a_{i}}\right\} } \end{aligned}$$
(20)

is the following unique point \(\mathbf{{x}}\in {{\mathbb {R}}}^{n}\), where

$$\begin{aligned} x_{i} = {\left\{ \begin{array}{ll} h\cdot t_{1}, &{}\text { if }i\le \ell _{1}; \\ h\cdot b_{i}, &{}\text { if } \ell _{1}< i < n + 1 - \ell _{2}; \\ -h\cdot t_{2}, &{}\text { if } i\ge n + 1 - \ell _{2}. \end{array}\right. } \end{aligned}$$

Proof

Note that by definition of \(t_{1}, t_{2}\), they are bounded by all of \(u_{i}\) with \(i\le j_{1}\) and all of \(v_{i}\) with \(i\le j_{2}\), respectively. In addition, we have

$$\begin{aligned} \sum _{i=1}^{\ell _{1}}{\left( u_{i} - t_{1} \right) } \le 1 \end{aligned}$$
(21)

and

$$\begin{aligned} \sum _{i=1}^{\ell _{2}}{\left( v_{i} - t_{2} \right) } \le 1. \end{aligned}$$
(22)

Now we claim that

$$\begin{aligned} \Vert {\mathbf {a}}\Vert _{\mathrm {tr}} \ge \sum _{i=1}^{\ell _{1}}{\left( u_{i} - t_{1} \right) \cdot a_{i}} - \sum _{i=1}^{\ell _{2}}{\left( v_{i} - t_{2} \right) \cdot a_{n+1-i}}. \end{aligned}$$
(23)

Notice that (21) implies that

$$\begin{aligned} \sum _{i=1}^{\ell _{1}}{\left( u_{i} - t_{1} \right) \cdot a_{i}} \le \max _{1\le i\le j_{1}}{a_{i}}. \end{aligned}$$

We also have that (22) implies that

$$\begin{aligned} \sum _{i=1}^{\ell _{2}}{\left( v_{i} - t_{2} \right) \cdot a_{n+1-i}} \ge \bigg (\sum _{i=1}^{\ell _{2}}{\left( v_{i} - t_{2} \right) }\bigg )\cdot \min _{1\le i\le j_{2}}{a_{n+1-i}} \ge \min \Big (0, \min _{1\le i\le j_{2}}{a_{n+1-i}} \Big ). \end{aligned}$$

Hence, the right-hand side of (23)

$$\begin{aligned} \sum _{i=1}^{\ell _{1}}{\left( u_{i} - t_{1} \right) \cdot a_{i}} - \sum _{i=1}^{\ell _{2}}{\left( v_{i} - t_{2} \right) \cdot a_{n+1-i}} \le&\max _{1\le i\le j_{1}}{a_{i}} - \min \Big (0, \min _{1\le i\le j_{2}}{a_{n+1-i}} \Big )\\&= \max _{1\le i_{1} \le j_{1}, 1\le i_{2} \le j_{2}}{\big (a_{i_{1}},\, a_{i_{1}} - a_{i_{2}} \big )}\\&\le \Vert {\mathbf {a}} \Vert _{\mathrm {tr}}. \end{aligned}$$

So our claim is proved.

Since \(h>0\) is a constant, we can multiply the objective function in (20) by 2h. Now, this new function is greater than or equal to

$$\begin{aligned}&\sum _{i=1}^{n}{a_{i}^{2}} + 2h\left( \sum _{i=1}^{\ell _{1}}{\left( u_{i} - t_{1} \right) \cdot a_{i}} - \sum _{i=1}^{\ell _{2}}{\left( v_{i} - t_{2} \right) \cdot a_{n+1-i}}\right) - 2h\sum _{i=1}^{n}{b_{i}\cdot a_{i}} \\&\quad = \sum _{i=1}^{n}{a_{i}^{2}} - \sum _{i=1}^{\ell _{1}}{2ht_{1}\cdot a_{i}} + \sum _{i=1}^{\ell _{2}}{2ht_{2}\cdot a_{n+1-i}} - 2h\sum _{i=\ell _{1}+1}^{n-\ell _{2}}{b_{i}\cdot a_{i}}. \end{aligned}$$

The global minimum of the last quadratic polynomial is attained exactly at the point \(\mathbf{{x}}\) in Proposition 8, so we have a lower bound for the new objective function, which is given when \({\mathbf {a}}=\mathbf{{x}}\). Finally, we note that the equality of (23) is attained at \(\mathbf{{x}}\), so this value is actually attained by \({\mathbf {a}}=\mathbf{{x}}\). \(\square \)

Example 3

When \(n=2\), given \((b_{1},b_{2})\in {{\mathbb {R}}}^{2}\), suppose \(x_{1} = f_{1}(b_{1},b_{2})\) and \(x_{2} = f_{2}(b_{1},b_{2})\), then the shrink operator is given as follows.

Table 1 The operator \(\text {shrink}_{\mathrm {tr}}\) when \(n=2\)

Remark 7

Proposition 8 provides an algorithm to compute the shrink. Suppose we have \(h>0\) and \({\mathbf {a}}_0, {\mathbf {b}}\in {{\mathbb {R}}}^{n}\) and we would like find

$$\begin{aligned} \text {shrink}_{\mathrm {tr}}({\mathbf {a}}_0+h {\mathbf {b}}, h)=\text {argmin}_{{\mathbf {a}}\in {{\mathbb {R}}}^{n}} \left\{ \frac{|{\mathbf {a}}-{\mathbf {a}}_0|_{2}^{2}}{2h} + \Vert {\mathbf {a}}\Vert _{\mathrm {tr}} - \sum _{i=1}^{n}{b_{i}\cdot a_{i}} \right\} . \end{aligned}$$

Note that

$$\begin{aligned} |{\mathbf {a}}-{\mathbf {a}}_0|_{2}^{2} = \sum _{i=1}^{n}{\left( a_{i} - a_{0i} \right) ^{2}} = \sum _{i=1}^{n}{a_{i}^{2}} - \sum _{i=1}^{n}{2a_{0i}\cdot a_{i}} + \text { constant.} \end{aligned}$$

Then we let \({\mathbf {b}}' = {\mathbf {b}} + \frac{{\mathbf {a}}_0}{h}\), the optimization problem becomes the one in Proposition 8 for \({\mathbf {b}}'\) and h after sorting the coordinates of \({\mathbf {b}}'\).

4.2 Computing the tropical Wasserstein-2 distances

We now present an algorithm to compute the tropical Wasserstein-2 distance in the tropical projective torus \({{\mathbb {R}}}^3/{{\mathbb {R}}}{\mathbf{1}}\) identified with \({\mathbb {R}}^2\). Consider the same uniform lattice graph on a domain \(\varOmega \subset {\mathbb {R}}^2\) as in the case for the tropical Wasserstein-1 distance. Define the following matrices

$$\begin{aligned} \varvec{\rho }&= \big (\varvec{\rho }^n_{{\mathbf {i}}} \big )^{N_x, N_t}_{{\mathbf {i}},n=1}\\ {\mathbf {m}}&= \Big ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}e_v} \Big )^{d, N_x, N_t}_{v,{\mathbf {i}},n=1} \end{aligned}$$

where the time interval is discretized uniformly with \(N_t\) points, and \(N_x\) is the number of vertices from a uniform lattice graph. Here we assume Neumann boundary conditions for \(\varvec{\rho }\): \(\displaystyle \frac{\partial \rho }{\partial \hat{{\mathbf {n}}}} = 0\) on \(\partial \varOmega \), where \(\hat{{\mathbf {n}}}\) is a outward normal vector. Given initial densities \(\rho _0\) and \(\rho _1\), the boundary conditions for \(\rho \) at \(t=0\) and \(t=1\) are

$$\begin{aligned} \big (\varvec{\rho }^1_{{\mathbf {i}}} \big )^{N_x}_{{{\mathbf {i}}}=1} = \rho _0 ~~~\text{ and }~~~ \big (\varvec{\rho }^{N_t}_{{\mathbf {i}}} \big )^{N_x}_{{\mathbf {i}}=1} = \rho _1. \end{aligned}$$

Define \(\varDelta t := \frac{1}{N_t}\). We can reformulate the minimization problem (15) into a discretization as follows:

$$\begin{aligned} \begin{aligned} \underset{{\mathbf {m}}}{\text {minimize}}&\quad \sum ^{N_t}_{n=1} \sum ^{N_x}_{{\mathbf {i}}=1} \frac{\Vert {\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}}\Vert ^2_{\mathrm {tr}}}{2 \varvec{\rho }^n_{{\mathbf {i}}}}\varDelta \mathbf{{x}}\varDelta t \\ \text {subject to}&\quad \partial _t \varvec{\rho }^n_{\mathbf {i}} + \text {div}_G({\mathbf {m}}^n_{{\mathbf {i}}})=0, \quad {\mathbf {i}}=1,\ldots , N_x;~ n=1,\ldots , N_t\\&\quad \big (\varvec{\rho }^1_{{\mathbf {i}}}\big )^{N_x}_{{{\mathbf {i}}}=1}=\rho _0,\\&\quad \big (\varvec{\rho }^{N_t}_{{\mathbf {i}}} \big )^{N_x}_{{{\mathbf {i}}}=1}. \end{aligned} \end{aligned}$$
(24)

where

$$\begin{aligned} \partial _t \varvec{\rho }^n_{{\mathbf {i}}} = {\left\{ \begin{array}{ll} \frac{1}{\varDelta t} (\varvec{\rho }^{n+1}_{{\mathbf {i}}} - \varvec{\rho }^{n}_{{\mathbf {i}}}) &{} \text { for } n = 1\\ \frac{1}{2\varDelta t} (\varvec{\rho }^{n+1}_{{\mathbf {i}}} - \varvec{\rho }^{n-1}_{{\mathbf {i}}}) &{} \text { for } n = 2,\ldots , N_t-1\\ \frac{1}{\varDelta t} (\varvec{\rho }^{n}_{{\mathbf {i}}} - \varvec{\rho }^{n-1}_{{\mathbf {i}}}) &{} \text { for } n = N_t \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} \text {div}_G({\mathbf {m}}^n_{{\mathbf {i}}}) = \frac{1}{\varDelta x} \sum ^2_{v=1} \Big ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}e_v} - {\mathbf {m}}^n_{{\mathbf {i}}-\frac{1}{2}e_v} \Big ) \text { for } n=1,\ldots ,N_t. \end{aligned}$$

In \({\mathbb {R}}^2\), using (3), we can calculate the tropical norm of the flux function \({\mathbf {m}}\) by considering the six different cases based on \(\{{\mathbf {m}}_{{\mathbf {i}}+\frac{1}{2}e_v}\}^2_{v=1}\). The tropical norm of \({\mathbf {m}}\) is given as follows:

Table 2 Tropical norm when \(n=2\)

Let \(\varPhi =(\varPhi ^n_{{\mathbf {i}}})_{{\mathbf {i}}=1}^{N_x}{}_{n=1}^{N_t}\) here be the Lagrange multiplier which satisfies the Neumann boundary condition on the boundary of the domain. The minimization problem (24) can be reformulated as a saddle point problem.

$$\begin{aligned} \min _{{\mathbf {m}},\varvec{\rho }}\max _{\varPhi } \quad L({\mathbf {m}},\varvec{\rho }, \varPhi ):=\min _{{\mathbf {m}},\varvec{\rho }}\max _{\varPhi } \quad \sum ^{N_t}_{n=1} \sum ^{N_x}_{{\mathbf {i}}=1} \frac{\Vert {\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}}\Vert ^2_{\mathrm {tr}}}{2\varvec{\rho }^n_{{\mathbf {i}}}}+\varPhi ^n_{{\mathbf {i}}} \Big (\partial _t\varvec{\rho }^n_{{\mathbf {i}}} + \text {div}_G\big ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}} \big ) \Big ).\nonumber \\ \end{aligned}$$
(25)

Again, we implement G-Prox PDHG to solve the problem as follows:

$$\begin{aligned} \left\{ \begin{array}{ll} \varvec{\rho }^{k+1} = \text {arg}\min _{\varvec{\rho }} &{} \quad L({\mathbf {m}}^k,\varvec{\rho },\varPhi ^k) + \frac{1}{2\tau } \Vert \varvec{\rho }-\varvec{\rho }^k\Vert ^2_{L^2(\varOmega \times [0,1])},\\ m^{k+1} = \text {arg}\min _{{\mathbf {m}}} &{} \quad L({\mathbf {m}},\varvec{\rho }^{k+1},\varPhi ^k) + \frac{1}{2\tau } \Vert {\mathbf {m}}-{\mathbf {m}}^k\Vert ^2_{L^2(\varOmega \times [0,1])},\\ \varPhi ^{k+1} = \text {arg}\max _\varPhi &{} \quad L(2{\mathbf {m}}^{k+1} - {\mathbf {m}}^k,2\varvec{\rho }^{k+1}-\varvec{\rho }^k,\varPhi ) - \frac{1}{2h} \Vert \varPhi -\varPhi ^k\Vert ^2_{H^1(\varOmega \times [0,1])}, \end{array} \right. \end{aligned}$$
(26)

where h, \(\tau \) are two small step sizes and

$$\begin{aligned} \Vert \varvec{\rho }-\varvec{\rho }^k\Vert ^2_{L^2}&=\sum ^{N_t}_{n=1} \sum ^{N_x}_{{\mathbf {i}}=1} \big (\varvec{\rho }^n_{{\mathbf {i}}}-(\varvec{\rho }^n_{{\mathbf {i}}})^k\big )^2 \varDelta \mathbf{{x}}\varDelta t\\ \Vert \varPhi -\varPhi ^k\Vert ^2_{H^1}&=\sum ^{N_t}_{n=1} \sum ^{N_x}_{{\mathbf {i}}=1} \left( (\partial _t \varPhi ^n_{{\mathbf {i}}} - \partial _t (\varPhi ^n_{{\mathbf {i}}})^k)^2 + \Vert \nabla _G\varPhi ^n_{{\mathbf {i}}} - \nabla _G(\varPhi ^n_{{\mathbf {i}}})^k\Vert ^2 \right) \varDelta \mathbf{{x}}\varDelta t. \end{aligned}$$

From (26), each component \({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}}\), \(\varvec{\rho }^n_{{\mathbf {i}}}\), and \(\varPhi ^n_{{\mathbf {i}}}\) can be obtained. From the first iteration,

$$\begin{aligned} \begin{aligned} \varvec{\rho }^{k+1}=&\text {arg}\min _{\varvec{\rho }} \quad L({\mathbf {m}}^k,\varvec{\rho },\varPhi ^k) + \frac{1}{2\tau } \Vert \varvec{\rho }-\varvec{\rho }^k\Vert ^2_{L^2} \\ =&\text {arg}\min _{\varvec{\rho }} \quad \sum ^{N_t}_{n=1}\sum ^{N_x}_{{\mathbf {i}}=1} \frac{\Vert ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}})^k\Vert ^2_{\mathrm {tr}}}{2\varvec{\rho }^n_{{\mathbf {i}}}}+(\varPhi ^n_{{\mathbf {i}}})^k \partial _t\varvec{\rho }^n_{{\mathbf {i}}} + \frac{1}{2\tau } \Vert \varvec{\rho }-\varvec{\rho }^k\Vert ^2_{L^2} \end{aligned} \end{aligned}$$

We calculate the minimizer by differentiating the equation with respect to \(\varvec{\rho }^n_{{\mathbf {i}}}\). The minimizer \(\varvec{\rho }^{k+1}\) is a positive root of the following cubic polynomial:

$$\begin{aligned} - \frac{\Vert ({\mathbf {m}}^n_{{\mathbf {i}}})^{k}\Vert ^2_{\text {tr}}}{2 ((\varvec{\rho }^n_{{\mathbf {i}}})^{k+1})^2} - \partial _t (\varPhi ^n_{{\mathbf {i}}})^k + \frac{1}{\tau } \big ((\varvec{\rho }^n_{{\mathbf {i}}})^{k+1} - (\varvec{\rho }^n_{{\mathbf {i}}})^k \big ) = 0. \end{aligned}$$

Thus, we can calculate the root by using a cubic solver.

$$\begin{aligned} (\varvec{\rho }^n_{{\mathbf {i}}})^{k+1} = \text {root}^+\biggl (-(\varvec{\rho }^n_{{\mathbf {i}}})^k - \tau \partial _t (\varPhi ^n_{{\mathbf {i}}})^k, 0, -\frac{\tau }{2} \Vert ({\mathbf {m}}^n_{{\mathbf {i}}})^k\Vert ^2_{\text {tr}} \biggl ), \end{aligned}$$

where \(\text {root}^+(a,b,c)\) is a solution for a cubic polynomial \(x^3 + a x^2 + b x + c = 0\).

We can reformulate the second iteration as follows:

$$\begin{aligned} \begin{aligned} {\mathbf {m}}^{k+1}&= \text {arg}\min _{{\mathbf {m}}} \quad L({\mathbf {m}},\varvec{\rho }^{k+1},\varPhi ^k) + \frac{1}{2\tau } \Vert {\mathbf {m}}-{\mathbf {m}}^k\Vert ^2_{L^2} \\&= \text {arg}\min _{{\mathbf {m}}} \sum ^{N_t}_{n=1} \sum ^{N_x}_{{\mathbf {i}}=1} \frac{\Vert {\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}}\Vert ^2_{\mathrm {tr}}}{2(\rho ^n_{{\mathbf {i}}})^{k+1}} + \varPhi ^n_{{\mathbf {i}}} \text {div}_G({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}})+ \frac{1}{2\tau } \Vert {\mathbf {m}}-{\mathbf {m}}^k\Vert ^2_{L^2}\\&= \text {arg}\min _{{\mathbf {m}}} \sum ^{N_t}_{n=1} \sum ^{N_x}_{{\mathbf {i}}=1} \frac{\Vert {\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}}\Vert ^2_{\mathrm {tr}}}{2(\varvec{\rho }^n_{{\mathbf {i}}})^{k+1}} - {\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}} \nabla _G \varPhi ^n_{{\mathbf {i}}} + \frac{1}{2\tau } \Vert {\mathbf {m}}-{\mathbf {m}}^k\Vert ^2_{L^2}\\&= \text {arg}\min _{{\mathbf {m}}} \sum ^{N_t}_{n=1} \sum ^{N_x}_{{\mathbf {i}}=1} \frac{\Vert {\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}}\Vert ^2_{\mathrm {tr}}}{2(\varvec{\rho }^n_{{\mathbf {i}}})^{k+1}} + \frac{1}{2\tau } \Vert {\mathbf {m}}-{\mathbf {m}}^k-\tau \nabla _G \varPhi \Vert ^2_{L^2}. \end{aligned} \end{aligned}$$

Differentiating the equation with respect to \({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}}\), we obtain the following expression:

$$\begin{aligned} \begin{aligned} \frac{\Vert {\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}}\Vert _\text {tr} \nabla _G \Vert {\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}}\Vert _\text {tr}}{(\varvec{\rho }^n_{{\mathbf {i}}})^{k+1}} + \frac{1}{\tau }\big ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}} - ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}})^k - \tau \nabla _G \varPhi ^n_{{\mathbf {i}}} \big ) = 0. \end{aligned} \end{aligned}$$

Solving this expression gives an explicit solution for \(({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}})^{k+1}\):

$$\begin{aligned} ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}})^{k+1}={\varvec{F}}\Big ( \big ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}} \big )^k+\tau \nabla _G (\varPhi ^n_{{\mathbf {i}}})^k,\, \tau /(\varvec{\rho }^n_{{\mathbf {i}}})^{k+1} \Big ). \end{aligned}$$
(27)

Let \(\mu =\tau /(\varvec{\rho }^n_{{\mathbf {i}}})^{k+1}\) and \(c=(c_1,c_2)\) be

$$\begin{aligned} c_1=({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}e_1})^k+\tau \nabla _{x_1} (\varPhi ^n_{{\mathbf {i}}+\frac{1}{2}e_1})^k\\ c_2=({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}e_2})^k+\tau \nabla _{x_2} (\varPhi ^n_{{\mathbf {i}}+\frac{1}{2}e_2})^k. \end{aligned}$$

The function \({\varvec{F}}(c,\mu )\) is then given as follows:

Table 3 The definition of \({\varvec{F}}(c,\mu )\)

Similarly, we get an explicit formula of \(\varPhi ^{k+1}\) from the third iteration.

$$\begin{aligned} (\varPhi ^n_{{\mathbf {i}}})^{k+1}= & {} (\varPhi ^n_{{\mathbf {i}}})^k + h (-\varDelta _{t,G})^{-1}\Big ( \partial _t \big (2(\varvec{\rho }^n_{{\mathbf {i}}})^{k+1}-(\varvec{\rho }^n_{{\mathbf {i}}})^k\big )\\&+ \text {div}_{t,G} \Big (2\big ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}} \big )^{k+1}- \big ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}} \big )^k \Big ) \Big ) \end{aligned}$$

for \({\mathbf {i}}=1,\ldots N_x\) and \(n=1,\ldots ,N_t\). Here, \(\varDelta _{t,G} = \partial _{tt} + \varDelta _G\) is the discrete Laplacian operator over time and space.

Now, define

$$\begin{aligned} E^k :=\sum ^{N_t}_{n=1} \sum ^{N_x}_{{\mathbf {i}}=1} \frac{\big \Vert \big ({\mathbf {m}}^n_{{\mathbf {i}}+\frac{1}{2}} \big )^k \big \Vert ^2_{\mathrm {tr}}}{2(\varvec{\rho }^n_{{\mathbf {i}}})^k}. \end{aligned}$$

Then the relative error at iteration k is calculated as \(\displaystyle \frac{|E^k-E^{k-1}|}{|E^{k-1}|}\).

We are now ready to present our algorithm to compute the tropical Wasserstein-2 metric.

figure b

4.3 Convergence

Our proposed primal-dual algorithms for the tropical Wasserstein-1 and tropical Wasserstein-2 distances converge to their respective minimizers as given by Propositions 5 and 7.

Theorem 1

  1. (i)

    Consider the G-Prox PDHG algorithm to compute the tropical Wasserstein-1 distance. Let

    $$\begin{aligned} \sqrt{\tau \mu }\Vert (-\varDelta _G)^{-\frac{1}{2}}\mathrm {div}_G\Vert _2<1. \end{aligned}$$

    Then \(({\mathbf {m}}^{k}, \varPhi ^{k})\) defined by (19) converges weakly to \(({\mathbf {m}}^*,\varPhi ^*)\).

  2. (ii)

    Consider the G-Prox PDHG algorithm to compute the tropical Wasserstein-2 distance. Let

    $$\begin{aligned} \sqrt{\tau \mu }\Vert (-\varDelta _{t,G})^{-\frac{1}{2}}\mathrm {div}_{t,G}\Vert _2<1. \end{aligned}$$

    Then \(({\mathbf {m}}^{k}, \varvec{\rho }^k, \varPhi ^{k})\) defined by (26) converges weakly to \(({\mathbf {m}}^*,\varvec{\rho }^*,\varPhi ^*)\).

Proof

The proof follows that of Theorem 1 in Pock and Chambolle [40]. We justify the conditions in Pock and Chambolle [40]. In the case of (i), we write the Lagrangian L as

$$\begin{aligned} L({\mathbf {m}}, \varPhi )=g({\mathbf {m}})+\varPhi ^\intercal K {\mathbf {m}}-f^*(\varPhi ), \end{aligned}$$

where \(g({\mathbf {m}})=\Vert {\mathbf {m}}\Vert _{\mathrm {tr}}\), \(K=\text {div}_G\), and \(f^*(\varPhi )=\sum _{{\mathbf {i}}}\varPhi _{{\mathbf {i}}} (q_{{\mathbf {i}}}^0-q_{{\mathbf {i}}}^1)\). Observe that g, \(f^*\) are convex functions and K is a linear operator. Then there exists a saddle point \(({\mathbf {m}}^*,\varPhi ^*)\). Notice that the preconditioning norm for \(\varPhi \) is \(\varSigma :=\mu (-\varDelta _G)^{-1}\) and the preconditioning norm for \({\mathbf {m}}\) is \(T:=\tau \cdot \mathrm {Id}\) where \(\mathrm {Id}\) is an identity operator. Thus, the algorithm converges when \(\Vert \varSigma ^{\frac{1}{2}}KT^{\frac{1}{2}}\Vert _2^2<1\). This is our condition \(\sqrt{\tau \mu }\Vert (-\varDelta _G)^{-\frac{1}{2}}\mathrm {div}_G\Vert _2<1\), which finishes the proof. A similar argument holds for (ii). \(\square \)

5 Numerical experiments

In this section, we present the results of numerical experiments solving the tropical optimal transport problem for three different sets of initial densities using our proposed G-Prox primal-dual methods for \(L^1\) and \(L^2\). In particular, we give the minimizers of \(L^1\) and \(L^2\) tropical optimal transport problems from each experiment.

Experiment 1. We consider a two-dimensional problem on \(\varOmega = [0,1]\times [0,1]\). The initial densities \(\rho _0\) and \(\rho _1\) are same sizes of squares centered at \((\frac{1}{3},\frac{1}{3})\) and \((\frac{2}{3},\frac{2}{3})\), respectively. In this experiment, the parameters are

$$\begin{aligned} N_x&=128\times 128,\\ N_t&=15. \end{aligned}$$

Figure 4 shows the minimizer m(x) of the tropical Wasserstein-1 distance and Fig. 5 shows the minimizer \(\rho (t,x)\) of the tropical Wasserstein-2 distance.

Fig. 4
figure 4

Experiment 1: \(L^1\) tropical optimal transport. a, b show the initial densities \(\rho _0\) and \(\rho _1\), while c shows the geodesics of the \(L^1\) tropical optimal transport between \(\rho _0\) and \(\rho _1\)

Fig. 5
figure 5

Experiment 1: \(L^2\) tropical optimal transport. The six figures show the geodesics of \(L^2\) tropical optimal transport from \(t=0\) to \(t=1\). The initial densities are same as in Fig. 4

Experiment 2. Similar to Experiment 1, we consider a two dimensional problem on \(\varOmega = [0,1]\times [0,1]\). The initial densities \(\rho _0\) and \(\rho _1\) are same sizes of squares centered at \((\frac{1}{3},\frac{2}{3})\) and \((\frac{2}{3},\frac{1}{3})\) respectively. The same parameters are set as in Experiment 1. Together with Experiment 1, Experiment 2 shows that the minimizers of tropical optimal transport show different geodesics depending on the positions of initial densities. See Fig. 6 for \(L^1\) result and Fig. 7 for \(L^2\) result.

Fig. 6
figure 6

Experiment 2: \(L^1\) tropical optimal transportation. a, b show the initial densities \(\rho _0\) and \(\rho _1\). c shows the geodesics of the \(L^1\) tropical optimal transportation between \(\rho _0\) and \(\rho _1\)

Fig. 7
figure 7

Experiment 2: \(L^2\) tropical optimal transport. The figures show the geodesics of \(L^2\) tropical optimal transport between two initial densities from \(t=0\) to \(t=1\). The initial densities are same as in Fig. 6

Experiment 3. We again consider a two dimensional problem on \(\varOmega = [0,1]\times [0,1]\). The initial density \(\rho _0\) at time 0 is a square centered at (0.5, 0.5) with width 0.2. The initial density \(\rho _1\) at time 1 is four squares of the same size centered at (0.2, 0.2), (0.2, 0.8), (0.8, 0.2) and (0.8, 0.8) with width 0.1. The same parameters are set as in Experiment 1. See Fig. 8 for the \(L^1\) result and Fig. 9 for the \(L^2\) result; notice that the geodesics of minimizers from both results depend on the direction in which the densities travel. We see that Experiment 3 coincides with Experiments 1 and 2.

Fig. 8
figure 8

Experiment 3: \(L^1\) tropical optimal transport. a, b show the initial densities \(\rho _0\) and \(\rho _1\), while c shows the geodesics of the \(L^1\) tropical optimal transport between \(\rho _0\) and \(\rho _1\). This experiment shows similar patterns of geodesics from Experiment 1 and Experiment 2

Fig. 9
figure 9

Experiment 3: \(L^2\) tropical optimal transport. The six figures show the geodesics of \(L^2\) tropical optimal transportation from \(t=0\) to \(t=1\). The initial densities are same as in Fig. 8

Software. Software to implement the numerical experiments presented in this paper is publicly available and located on the TropicalOT GitHub repository at https://github.com/antheamonod/TropicalOT.

6 Discussion

In this paper, we connected optimal transport theory—specifically, dynamic optimal transport—with tropical geometry. In particular, we explicitly formulated geodesics for the tropical Wasserstein-p distances over the tropical projective torus. The tropical projective torus is the ambient space of the polyhedral Gröbner complex of a homogeneous ideal in a polynomial ring \(K[x_0, x_1, \ldots , x_n]\) over a field K—a foundational object in tropical geometry. It is also the ambient space of the space of phylogenetic trees.

We constructed and implemented primal-dual algorithms to compute tropical Wasserstein-1 and 2 geodesics on the tropical projective torus. These results provide a framework to identifying all infinitely-many geodesic paths between points in this space, which leads to a better understanding of paths on the ambient space containing important structures in tropical geometry theory as well as in practice and applications. In addition, the Wasserstein-2 distance possesses an important structure for statistical inference, since it provides the form for Fréchet means on the tropical projective torus, as well as a general inner product structure.

Our research lays the foundation for further connections between optimal transport and tropical geometry. Our work provides powerful tools to study important aspects such as geometry and statistics on the tropical projective torus. A current work in progress is to characterize and solve the optimal transport problem on the subset of the tropical projective torus corresponding to phylogenetic tree space with 5 leaves, \({\mathcal {T}}_5\). This space is made up of a union of 5!! = 15 polyhedral cones in the tropical projective torus, each with dimension 2. In this study, the main challenge involves the polyhedral structure of the tree space (as discussed in Sect. 2.2), and in particular, how to handle the intersections of the cones; a weaker form of the divergence and gradient operators are required to traverse the cones. The present work solves the problem within a single cone, which defines a shrink operator with already six cases, see Table 1; we also expect the characterization of the shrink operator to be combinatorially more complicated on all 15 cones of \({\mathcal {T}}_5\).

From the perspective of optimal transport, we observe that the combinatorial structure of the tropical metric poses several interesting challenges in optimal transport. For example, the partial differential equations derived in Sect. 3 are defined in a piecewise manner: in two-dimensional sample space, there are six corresponding equations characterizing geodesics in optimal transport. In the general case, there are interesting regularity issues to be further studied. The theory of optimal transport and the study of associated density manifolds provide a natural base to construct heat equations with respect to the tropical metric. This provides an important potential to defining non-uniform probability distributions on the tropical projective torus: classically, the solution to the heat equation gives rise to the Gaussian distribution, thus, a solution to the tropical heat equation is a candidate for a tropical Gaussian distribution on the tropical projective torus [11, 46]. The dynamic setting of optimal transport with the tropical ground metric introduced in this paper also provides a foundation to studying the displacement convexity and Ricci curvature tensor on the tropical projective torus. In forthcoming work, we further study such questions by applying the relevant work of Li [23, 24], which also studies geometric and probabilistic questions in the context of optimal transport theory.