1 Introduction

The Wasserstein distance, also known as Monge–Kantorovich distance, is used in optimal transport theory to describe and characterize optimal transitions between probability measures. They are characterized by the lowest (or cheapest) average costs to fully transfer a probability measure into another. The costs are most typically proportional to the distance of locations to be connected. Rachev and Rüschendorf (1998) provide a comprehensive discussion of the Wasserstein distance and Villani (2009) summarizes the optimal transport theory.

The nested distance generalizes and extends the theory from probability measures to stochastic processes. It is based on the Wasserstein distance and has been introduced by Pflug (2009), cf. also Pflug and Pichler (2012). The nested distance quantifies the distance of stochastic processes and multistage stochastic programs are continuous with respect to the nested distance. Multistage stochastic programming has applications in many sectors, e.g., the financial sector (Edirisinghe 2005; Brodt 1983), in management science or in energy economics (Analui and Pflug 2014; Beltrán et al. 2017; Carpentier et al. 2012, 2015). The prices, demands, etc., are often modeled as a stochastic process \(\xi =(\xi _0,\dots ,\xi _T)\) and the optimal values are rarely obtained analytically. For the numerical approach the stochastic process is replaced by a finite valued stochastic scenario process \({{\tilde{\xi }}}=({{\tilde{\xi }}}_0,\dots ,{{\tilde{\xi }}}_T)\), which is a finite tree. Naturally, the approximation error should be minimized without unnecessarily increasing the complexity of the computational effort. Kirui et al. (2020) provide a Julia package for generating scenario trees and scenario lattices for multistage stochastic programming. Maggioni and Pflug (2019) provide guaranteed bounds and Horejšová et al. (2020) investigate corresponding reduction techniques.

This paper addresses the Sinkhorn divergence in place of the Wasserstein distance. This pseudo-distance is also called Sinkhorn distance or Sinkhorn loss. In contrast to the exact implementation of Bertsekas and Castanon (1989), e.g., Sinkhorn divergence corresponds to a regularization of the Wasserstein distance, which is strictly convex and which allows to improve the efficiency of the computation by applying Sinkhorn’s (fixed-point) iteration procedure. The relaxation itself is similar to the modified objective of interior-point methods in numerical optimization. A cornerstone is the theorem by Sinkhorn (1967) that shows both a unique decomposition for non-negative matrices and ensures convergence of the associated iterative scheme. Cuturi (2013) has shown the potential of the Sinkhorn divergence and made it known to a wider audience. Nowadays, Sinkhorn divergence is used in statistical applications, cf. Bigot et al. (2019) and Luise et al. (2018), for image recognition and machine learning, cf. Kolouri et al. (2017) and Genevay et al. (2018), among many other applications.

Extending Sinkhorn’s algorithm to multistage stochastic programming has been proposed recently in Tran (2020, Section 5.2.3, pp. 97–99), where a numerical example indicating computational advantages is also given. This paper resumes this idea and assesses the entropy relaxed nested distance from theoretical perspective. We address its approximating properties and derive its convex conjugate, the dual. Moreover, numerical tests included confirm the computational advantage regarding the simplicity of the implementation as well as significant gains in speed.

Outline of the paper The following Sect. 2 introduces the notation and provides the definitions to discuss the nested distance. Additionally, the importance of the filtration and the complexity of the computation is shown. Section 3 introduces the Sinkhorn divergence and derive its dual. In Sect. 4 we regularize the nested distance and show the equality between two different approaches. Results and comparisons are visualized and discussed in Sect. 5. Section 6 summarizes and concludes the paper.

2 Preliminaries

This section recalls the definition of distances generally, of the Wasserstein distance and nested distance and provides an example to highlight the impact of information available, which is increasing gradually over time and stages. Throughout, we shall base our exposition on a probability space \((\Xi , {\mathcal {F}}, P)\).

2.1 Wasserstein distance

The Wasserstein distance is a distance for probability measures. It is the building block for the process distance, which involves information in addition and its regularized version, which we address here, the Sinkhorn divergence. The Sinkhorn divergence is not a distance in itself. To point out the differences we highlight the defining elements.

Definition 2.1

(Distance of measures) Let \({\mathcal {P}}\) be a set of probability measures on \(\Xi \). A function \(d:{\mathcal {P}}\times {\mathcal {P}}\rightarrow [0,\infty )\) is called distance, if it satisfies the following conditions:

  1. (i)

    Nonnegativity: for all \(P_1\), \(P_2\in {\mathcal {P}}\),

    $$\begin{aligned} d(P_1,P_2)\ge 0; \end{aligned}$$
  2. (ii)

    Symmetry: for all \(P_1\), \(P_2\in {\mathcal {P}}\),

    $$\begin{aligned} d(P_1,P_2)= d(P_2,P_1);\end{aligned}$$
  3. (iii)

    Triangle inequality: for all \(P_1\), \(P_2\) and \(P_{3}\in {\mathcal {P}}\),

    $$\begin{aligned} d(P_1,P_2)\le d(P_1,P_3)+d(P_3,P_2);\end{aligned}$$
  4. (iv)

    Definiteness: if \(d(P_1,P_2)=0\), then \(P_1=P_2\).

Rachev (1991) presents a huge variety of probability metrics. Here, we focus on the Wasserstein distance, which allows a generalization for stochastic processes. For this we assume that the sample space \(\Xi ={\mathbb {R}}^d\) is equipped with a metric d.

Definition 2.2

(Wasserstein distance) Let P and \({{\tilde{P}}}\) be two probability measure on \(\Xi \) endowed with a distance \(d:\Xi \times \Xi \rightarrow {\mathbb {R}}\) with finite moment of order r. The Wasserstein distance of order \(r\ge 1\) is

$$\begin{aligned} d^r(P,{{\tilde{P}}}){:}{=}\inf _\pi \iint _{\Xi \times \Xi }d(\xi ,{{\tilde{\xi }}})^r\,\pi (\mathrm d\xi ,\mathrm d{{\tilde{\xi }}}), \end{aligned}$$

where the infimum is over all probability measures \(\pi \) on \(\Xi \times \Xi \) with marginals P and \({{\tilde{P}}}\), respectively.

Remark 2.3

(Distance versus cost functions) The definition of the Wasserstein distance presented here starts with a distance d on \(\Xi \) and the Wasserstein distance is a distance on \({\mathcal {P}}\) in the sense of Definition 2.1 above. We mention that the literature occasionally develops the theory for cost functions \(c:\mathcal X\times {\mathcal {X}}\rightarrow {\mathbb {R}}\) instead of the distance d. Also, the results presented below extend to cost functions in place of the distance on the underlying space.

In a discrete framework, probability measures are of the form \(P=\sum \nolimits _{i=1}^np_i\,\delta _{\xi _i}\) with \(p_i\ge 0\) and \(\sum \nolimits _{i=1}^np_i=1\) and the support of P (\(\{\xi _i:i= 1,2,\dots ,n\}\subset \Xi \)) is finite. By Definition 2.1, the Wasserstein distance \(d^r\) of two discrete measures \(P=\sum _{i=1}^np_i\,\delta _{\xi _i}\) and \({{\tilde{P}}}= \sum _{j=1}^{{{\tilde{n}}}}{{\tilde{p}}}_j\,\delta _{{{\tilde{\xi }}}_j}\) is the r-th root of the optimal value of

$$\begin{aligned} \text {minimize }_{\text {in }\pi \ }&\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}} \pi _{ij}\,d_{ij}^r \nonumber \\ \text {subject to }&\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}=p_i,&i=1,\dots ,n, \nonumber \\&\sum _{i=1}^n\pi _{ij}={{\tilde{p}}}_j,&j=1,\dots {{\tilde{n}}} \text { and} \nonumber \\&\pi _{ij}\ge 0, \end{aligned}$$
(2.1)

where

$$\begin{aligned} d_{ij}{:}{=}d(\xi _i,{{\tilde{\xi }}}_j)\end{aligned}$$
(2.2)

is an \(n\times {{\tilde{n}}}\)-matrix collecting all distances. The optimal measure in (2.1) is denoted \(\pi ^W\) and called an optimal transport plan. The convex, linear dual of (2.1) is

$$\begin{aligned} \text {maximize}_{\text { in }\lambda \text { and } \mu }&\sum _{i=1}^n p_i\,\lambda _i+\sum _{j=1}^{{{\tilde{n}}}} {{\tilde{p}}}_j\,\mu _j \end{aligned}$$
(2.3a)
$$\begin{aligned} \text {subject to }&\lambda _i+\mu _j\le d_{ij}^r \ \text { for all }i=1,\dots n \text { and } j=1,\dots {{\tilde{n}}}. \end{aligned}$$
(2.3b)

Remark 2.4

The problem (2.1) can be written as linear optimization problem

$$\begin{aligned} \text {minimize}_{\text { in }x\ }&c^{\top }x\\ \text {subject to }&Ax=b,\\&x\ge 0, \end{aligned}$$

where \(x=(\pi _{11},\pi _{21},\dots ,\pi _{n{{\tilde{n}}}})^\top \), \(c=(d_{11}, d_{21},\dots ,d_{n{{\tilde{n}}}})^\top \), \(b=(p_1,\dots ,p_n,{{\tilde{p}}}_1,\dots ,{{\tilde{p}}}_{{{\tilde{n}}}})^\top \) and A is the matrix

$$\begin{aligned} A=\begin{pmatrix}{\mathbb {1}}_{{{\tilde{n}}}}\otimes I_n\\ I_{{{\tilde{n}}}}\otimes {\mathbb {1}}_n \end{pmatrix}\end{aligned}$$

with \({\mathbb {1}}=(1,\dots ,1)\); here, \(\otimes \) denotes the Kronecker product.

2.2 The distance of stochastic processes

Be two probability spaces. We now consider two stochastic processes with realizations \(\xi \), \({{\tilde{\xi }}}\in \Xi \) and \(\Xi {:}{=}\Xi _0 \times \Xi _1\times \dots \times \Xi _T\). There are many metrics d such that \((\Xi ,d)\) is a metric space. Without loss of generality we may set \(\Xi _t={\mathbb {R}}\) for all \(t\in \{0,1,\dots ,T\}\) and employ the \(\ell ^1\)-distance, i.e., \(d(\xi ,{{\tilde{\xi }}})= \sum _{t=0}^T|\xi _t-{{\tilde{\xi }}}_t|\). As in (2.1) above, the distance matrix \(d_{ij}\) collects the distances of scenarios for discrete measures, cf. (2.2).

A stochastic process with finitely many states (i.e., outcomes) for \(t\in \{0,1,\dots ,T\}\) is a scenario tree. Scenario trees are frequently employed in optimization under uncertainty to model the random outcome in the evolution of a process which describes the random price, say, of an underlying asset. They are convenient, because they can be implemented in computers to assess each individual trajectory as possible realization of the stochastic process.

The Figs. 1 and 3 depict such scenario tree, they indicate the transition probabilities in addition.

Remark 2.5

Figure 1 illustrates that the Wasserstein distance does not capture the different information (knowledge) available at the intermediate stage. Indeed, with \(\epsilon >0\), the matrix collecting the distances of the trajectories taken from both trees is

$$\begin{aligned} d=\begin{pmatrix}\epsilon &{} 2+\epsilon \\ 2 &{} 0 \end{pmatrix} \end{aligned}$$

and the optimal transport plan for the Wasserstein distance is

$$\begin{aligned} \pi =\frac{1}{2}\begin{pmatrix}1 &{} 0\\ 0 &{} 1 \end{pmatrix}. \end{aligned}$$

The Wasserstein distance, according (2.1), is , where a small value for \(\epsilon \) indicates that the processes are similar.

However, the information available at stage 1 is very distinct in both trees in Fig. 1. When observing \(2+\epsilon \) at stage 1 in the first tree it is certain that the next observation is 3, and it will be 1 when observing 2. In contrast, the second process does not provide any certainty whether the result will be 1 or 3 after observing 2 at the first stage.

Fig. 1
figure 1

Two processes illustrating two different flows of information, cf. Heitsch et al. (2006), Kovacevic and Pichler (2015). The arcs of the stochastic tree display the transition probabilities

We conclude from the preceding remark that the Wasserstein distance is not suitable to distinguish stochastic processes with different flows of information. The reason is that this approach does not involve conditional probabilities at stages \(t=0,1,\dots ,T-1\), but only probabilities at the final stage \(t=T\), where all the information available at intermediate stages is ignored.

We follow the usual convention and express information, which is accessible, by corresponding sets. The information available at every stage t includes information from preceding stages, which have been revealed, but excludes information from later, future stages. For this reason the sets

$$\begin{aligned}A_1\times \dots \times A_t\times \Xi _{t+1}\times \dots \times \Xi _T,\quad A_{t^\prime }\subset \Xi _{t^\prime } \text { measurable},\end{aligned}$$

encode the information available at stage t, they constitute the \(\sigma \)-algebra \({\mathcal {F}}_t\) (\(\tilde{\mathcal F}_t\), resp.). The following generalization of the Wasserstein distance takes this flow of increasing information explicitly into account. We state the definition involving general probability measures here, although we work with discrete measures only in what follows.

Definition 2.6

(The nested distance) The nested distance of order \(r\ge 1\) of two filtered probability spaces \({\mathbb {P}}=(\Xi ,({\mathcal {F}}_t),P)\) and \(\tilde{{\mathbb {P}}}=({{\tilde{\Xi }}}, (\tilde{{\mathcal {F}}}_t), {{\tilde{P}}})\) with finite moment of order r with respect to the distance \(d:\Xi \times {{\tilde{\Xi }}}\rightarrow {\mathbb {R}}\) is the optimal value of the optimization problem

(2.4)
$$\begin{aligned} \text {subject to }&\pi (A\times {{\tilde{\Xi }}}\mid {\mathcal {F}}_t\otimes \tilde{{\mathcal {F}}}_t)=P(A\mid {\mathcal {F}}_t) \text { a.s. for every } A\in {\mathcal {F}}_t,\ t=1,\dots ,T, \end{aligned}$$
(2.5)
$$\begin{aligned}&\pi (\Xi \times B\mid {\mathcal {F}}_t\otimes \tilde{\mathcal F}_t)={{\tilde{P}}}(B\mid \tilde{{\mathcal {F}}}_t) \text { a.s. for every } B\in \tilde{{\mathcal {F}}}_t,\ t=1,\dots ,T, \end{aligned}$$
(2.6)

where the infimum is among all bivariate probability measures \(\pi \in {\mathcal {P}}(\Xi \times {{\tilde{\Xi }}})\). The optimal value of (2.1), the nested distance of order r, is denoted by \({\varvec{d}}^r({\mathbb {P}},\tilde{{\mathbb {P}}})\).

For discrete stochastic processes we use trees to model the whole space and filtration. In the stochastic tree, \({\mathcal {N}}_t\) (\(\tilde{{\mathcal {N}}}_t\), resp.) denotes the set of all nodes at the stage t. Furthermore, a predecessor m of the node i, not necessarily the immediate predecessor, is indicated by \(m\prec i\). Here, we may replace the conditional probabilities in (2.5) and (2.6) by the genuine transition probabilities. The arcs of the tree in Fig. 1 exemplarily display these transition probabilities.

The nested distance for stochastic trees is the r-th root of the optimal value of

$$\begin{aligned} \text {minimize }_{\text {in } \pi \ }&\sum _{i,j}\pi _{ij}\cdot d_{ij}^r\nonumber \\ \text {subject to }&\sum _{j\succ j_t}\pi (i,j\mid i_t,j_t)=P(i\mid i_t),&i_t\prec i,\,j_t,\nonumber \\&\sum _{i\succ i_t}\pi (i,j\mid i_t,j_t)={{\tilde{P}}}(j\mid j_t),&j_t\prec j,\,i_t,\nonumber \\&\pi _{ij}\ge 0 \text { and } \sum _{i,j}\pi _{ij}=1, \end{aligned}$$
(2.7)

where \(i\in {\mathcal {N}}_T\) and \(j\in \tilde{{\mathcal {N}}}_T\) are the leaf nodes and \(i_t\in {\mathcal {N}}_t\) as well as \(j_t\in \tilde{\mathcal N}_t\) are nodes at the same stage t and \(P(i\mid i_t){:}{=}\frac{P(i)}{P(i_t)}\) is the conditional probability. Analogously, the conditional probabilities \(\pi (i,j\mid i_t,j_t)\) are

$$\begin{aligned} \pi (i,j\mid i_t,j_t){:}{=}\frac{\pi _{ij}}{\sum _{i'\succ i_t,j'\succ j_t}\pi _{i^\prime j^\prime }}. \end{aligned}$$
(2.8)

Remark 2.7

Employing the definition (2.8) for \(\pi (i,j\mid i_t,j_t)\) reveals that the problem (2.7) is indeed a linear program in \(\pi \) (cf. (2.1)).

figure a

2.3 Rapid, nested computation of the process distance

This subsection addresses an advanced approach for solving the linear program (2.7). We first recall the tower property, which allows an important simplification of the constraints in (2.4).

Lemma 2.8

To compute the nested distance it is enough to condition on the immediately following \(\sigma \)-algebra: the conditions

$$\begin{aligned} \pi \big (A\times \Xi \mid {\mathcal {F}}_t\otimes \tilde{{\mathcal {F}}}_t\big )\ \text { for all }\ A\in {\mathcal {F}}_T\end{aligned}$$

in (2.4) may be replaced by

$$\begin{aligned} \pi \big (A\times \Xi \mid {\mathcal {F}}_t\otimes \tilde{{\mathcal {F}}}_t\big )\ \text { for all }\ A\in {\mathcal {F}}_{t+1}.\end{aligned}$$

Proof

The proof is based on the tower property of the expectation and can be found in Pflug and Pichler (2014, Lemma 2.43). \(\square \)

As a result of the tower property the full problem (2.7) can be calculated faster in a recursive way and the matrix for the constraints does not have to be stored. We employ this result in an algorithm below. For further details we refer to Pflug and Pichler (2014, Chapter 2.10.3). The collection of all direct successors of node \(i_t\) (\(j_t\), resp.) is denoted by \(i_t+\) (\(j_t+\), resp.).

3 Sinkhorn divergence

In what follows we consider the entropy-regularization of the Wasserstein distance (2.1) and characterize its dual. Moreover, we recall Sinkhorn’s algorithm, which allows and provides a considerably faster implementation. These results are combined then to accelerate the computation of the nested distance.

3.1 Entropy-regularized Wasserstein distance

Interior point methods add a logarithmic penalty to the objective to force the optimal solution of the modified problem into the strict interior. The Sinkhorn distance proceeds similarly. The regularizing term \(H(x){:}{=}-\sum _{i,j}x_{ij}\log x_{ij}\) is added to the cost function in problem (2.1). This has shown beneficiary in other problem settings as well.

Remark 3.1

The mapping \(\varphi (y){:}{=}y\log y\) is convex and negative for \(y\in (0,1)\) with continuous extensions \(\varphi (0)=\varphi (1)=0\) so that \(H(x)\ge 0\), provided that all \(x_{ij}\in [0,1]\).

Definition 3.2

(Sinkhorn divergence) The Sinkhorn divergence is the objectove of the optimization problem

$$\begin{aligned} \text {minimize}_{\text { in } \pi \ }&{\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}} \pi _{ij}\, d_{ij}^r-\frac{1}{\lambda }H(\pi )} \end{aligned}$$
(3.1a)
$$\begin{aligned} \text {subject to }&\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}=p_i,&i=1,\dots ,n,\nonumber \\&\sum _{i=1}^n\pi _{ij}={{\tilde{p}}}_j,&j=1,\dots ,{{\tilde{n}}},\nonumber \\&\pi _{ij}>0&\text { for all } i, j, \end{aligned}$$
(3.1b)

where d is a distance or a cost matrix and \(\lambda >0\) is a regularization parameter. With \(\pi ^S\) being the optimal transport in (3.1a)–(3.1b) we denote the Sinkhorn divergence by

$$\begin{aligned} d_S^r{:}{=}\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}^S\, d_{ij}^r \end{aligned}$$

and the Sinkhorn divergence including the entropy by

$$\begin{aligned} de_S^r{:}{=}\sum _{i=1}^n\sum _{j=1}^{\tilde{n}}\pi _{ij}^S\,d_{ij}^r-\frac{1}{\lambda }H\big (\pi ^S\big ). \end{aligned}$$

We may mention here that we avoid the term Sinkhorn distance since for all \(\lambda >0\), the Sinkhorn divergence \(d_S^r\) is strictly positive and \(de_S^r\) can be negative for small \(\lambda \) which violates the axioms of a distance given in Definition 2.1 above (particularly (i), (iii) and (iv)). Strict positivity of \(d_S^r\) can be forced by a correction term, the so-called Sinkhorn loss [see Bigot et al. (2019, Definition 2.3)] or by employing the cost matrix \(d\cdot {\mathbb {1}}_{p\ne {{\tilde{p}}}}\) instead.

Remark 3.3

The strict inequality constraint (3.1b) is not a restriction. Indeed, the mapping \(\varphi (\cdot )\) defined in Remark 3.1 has derivative \(\varphi '(0)=-\infty \) and thus it follows that every optimal measure satisfies the strict inequality \(\pi _{ij}>0\) for \(\lambda >0\).

We have the following inequalities.

Proposition 3.4

(Comparison of Sinkhorn and Wasserstein) It holds that

$$\begin{aligned} de_S^r\le d^r\le d_S^r. \end{aligned}$$
(3.2)

Proof

Recall that \(\pi \log \pi \le 0\) for all \(\pi \in (0,1)\) and thus it holds that \(\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}} \pi _{ij}\,d_{ij}^r+\frac{1}{\lambda }\sum _{i=1}^n\sum _{j=1}^{\tilde{n}}\pi _{ij}\log \pi _{ij} \le \sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}} \pi _{ij}\,d_{ij}^r\) for all \(\pi \in (0,1]^{n\times {{\tilde{n}}}}\). It follows that

$$\begin{aligned} \min _{\pi }\ \sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}\,d_{ij}^r +\frac{1}{\lambda }\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}\log \pi _{ij} \le \min _\pi \ \sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}\,d_{ij}^r\end{aligned}$$

and thus the first inequality. The remaining inequality is clear by the definition of the Wasserstein distance. \(\square \)

Both Sinkhorn divergences \(d_S^r\) and \(de_S^r\) approximate the Wasserstein distance \(d^r\), and we have convergence for \(\lambda \rightarrow \infty \) to \(d^r\). The following proposition provides precise bounds.

Proposition 3.5

For every \(\lambda >0\) we have

$$\begin{aligned} 0\le d_S^r-d^r\le \frac{1}{\lambda }\left( H(\pi ^S)-H(\pi ^W)\right) \end{aligned}$$
(3.3)

and

$$\begin{aligned} 0\le d^r - de_S^r \le \frac{1}{\lambda }H(\pi ^S)\le \frac{1}{\lambda }H(p\cdot {{\tilde{p}}}^{\top }) \end{aligned}$$
(3.4)

with \(p=(p_1,\dots ,p_n)\) and \({{\tilde{p}}}=({{\tilde{p}}}_1,\dots ,\tilde{p}_{{{\tilde{n}}}})\), respectively.

Proof

The first inequalities follow from (3.2) and from optimality of \(\pi ^S\) in the inequality

$$\begin{aligned} d_S^r-\frac{1}{\lambda }H(\pi ^S)\le d^r-\frac{1}{\lambda }H(\pi ^W). \end{aligned}$$

The latter again with (3.2) and \(d_S^r-de_S^r= \frac{1}{\lambda }H\big (\pi ^S\big )\). Finally, by the log sum inequality, \(H(\pi )\le H\big (p\cdot {{\tilde{p}}}^\top \big )\) for every measure \(\pi \) with marginals p and \({{\tilde{p}}}\). \(\square \)

Remark 3.6

As a consequence of the log sum inequality we obtain as well that \(H(\pi ^S)\le \log n+\log {{\tilde{n}}}\). The inequalities (3.3) and  (3.4) thus give strict upper bounds in comparing the Wasserstein distance and the Sinkhorn divergence.

Alternative definitions There exist alternative concepts of the Sinkhorn divergence which we want to mention here. The first alternative definition involves the Kullback–Leibler divergence \(D_\textit{KL}(\pi \mid P\otimes {{\tilde{P}}})\), which is defined as

$$\begin{aligned} D_\textit{KL}(\pi \mid P\otimes {{\tilde{P}}}){:}{=}-\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}\log \frac{\pi _{ij}}{p_i\,{{\tilde{p}}}_j}=H(P)+H({{\tilde{P}}})-H(\pi ),\end{aligned}$$

where the latter equality is justified provided that \(\pi \) has marginal measures P and \({{\tilde{P}}}\). The Sinkhorn divergence (in the alternative definition) is the r-th root of the optimal value of

$$\begin{aligned} \text {minimize}_{\text { in }\pi }&\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}\,d_{ij}^r \end{aligned}$$
(3.5a)
$$\begin{aligned} \text {subject to }&\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}=p_i,&i=1,\dots ,n,\nonumber \\&\sum _{i=1}^n\pi _{ij}={{\tilde{p}}}_j,&j=1,\dots ,{{\tilde{n}}},\nonumber \\&\pi _{ij}>0&\text {and}\nonumber \\&D_\textit{KL}(\pi \mid P\otimes {{\tilde{P}}})\le \alpha&\text {for all }i,j, \end{aligned}$$
(3.5b)

where \(\alpha \ge 0\) is the regularization parameter. For each \(\alpha \) in (3.5b) we have by the duality theory a corresponding \(\lambda \) in (3.1a) such that the optimal values coincide. Let \(\alpha >0\) and \(\pi ^\textit{KL}\) be the solution to problem (3.5a)–(3.5b) with Lagrange multipliers \(\beta \) and \(\gamma \). Then the optimal value of problem (3.5a) equals \(d_S^r\) from (3.1a) with

$$\begin{aligned} \lambda =-\frac{\log (\pi ^\textit{KL}_{ij})+1}{d_{ij}+\beta _i+\gamma _j} \end{aligned}$$

for any \(i\in \{1,\dots ,n\}\) and \(j\in \{1,\dots ,{{\tilde{n}}}\}\). For further information and illustration we refer to Cuturi (2013, Section 3).

A further, possible definition employs a different entropy regularization given by

$$\begin{aligned}{{\tilde{H}}}(\pi )=-\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}\cdot (\log \pi _{ij}-1).\end{aligned}$$

Luise et al. (2018) use this definition for Sinkhorn approximation for learning with Wasserstein distance and prove an exponential convergence. This definition leads to a similar matrix decomposition and iterative algorithm as described in the following sections.

3.2 Dual representation of Sinkhorn

We shall derive Sinkhorn’s algorithm and its extension to the nested distance via duality. To this end consider the Lagrangian function

$$\begin{aligned} L(\pi ;\beta ,\gamma )&{:}{=}&\sum _{i=1}^n\sum _{j=1}^{\tilde{n}}\pi _{ij}\,d_{ij}+\frac{1}{\lambda }\sum _{i=1}^n\sum _{j=1}^{\tilde{n}}\pi _{ij}\log \pi _{ij} +\beta ^\top (p-\pi \cdot {\mathbb {1}})\nonumber \\&+(\tilde{p}-{\mathbb {1}}^{\top }\cdot \pi )^\top \gamma \end{aligned}$$
(3.6)

of the problem (3.2). The partial derivatives are

$$\begin{aligned} \frac{\partial L}{\partial \pi _{ij}}=\frac{1}{\lambda }\left( \log \pi _{ij}+1\right) +d_{ij}-\beta _i-\gamma _j=0, \end{aligned}$$
(3.7)

and it follows from (3.7) that the optimal measure has entries

(3.8)

By inserting \(\pi _{ij}^*\) in the Lagrangian function L we get the convex dual function

$$\begin{aligned}&d(\beta ,\gamma ) {:}{=}\inf _{\pi }L(\pi ;\beta ,\gamma )=L(\pi ^{*};\beta ,\gamma )\\&\quad =\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}} d_{ij}\cdot e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}-\frac{1}{\lambda }\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}} e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}\cdot \big (\lambda (d_{ij}-\beta _i-\gamma _j)+1\big )\\&\qquad +\sum _{i=1}^n\beta _i\left( p_i-\sum _{j=1}^{{{\tilde{n}}}}e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}\right) +\sum _{j=1}^{{{\tilde{n}}}}\gamma _j\left( {{\tilde{p}}}_j-\sum _{i=1}^n e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}\right) \\&\quad =\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}} \left( \beta _i+\gamma _j-\frac{1}{\lambda }\right) e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}+\sum _{i=1}^n\beta _ip_i+\sum _{j=1}^{{{\tilde{n}}}}\gamma _j{{\tilde{p}}}_j\\&\qquad -\sum _{i=1}^n \beta _i\left( \sum _{j=1}^{{{\tilde{n}}}}e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}\right) -\sum _{j=1}^{{{\tilde{n}}}}\gamma _j\left( \sum _{i=1}^n e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}\right) \\&\quad =\sum _{i=1}^n\beta _i\,p_i+\sum _{j=1}^{{{\tilde{n}}}}\gamma _j\,\tilde{p}_j-\frac{1}{\lambda }\sum _{i=1}^n\sum _{j=1}^{\tilde{n}}e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}. \end{aligned}$$

The dual problem thus is

$$\begin{aligned} \text {maximize}_{\text { in }\beta , \gamma }&\sum _{i=1}^n\beta _i\,p_i+\sum _{j=1}^{{{\tilde{n}}}}\gamma _j\,{{\tilde{p}}}_j-\frac{1}{\lambda }\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}} e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}\\ \text {subject to }&\beta \in {\mathbb {R}}^n,\ \gamma \in \mathbb R^{{{\tilde{n}}}}. \end{aligned}$$

Due to \(\sum _{i,j}e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}=1\) we may write the latter problem as

$$\begin{aligned} \text {maximize}_{\text { in }\beta ,\gamma \ }&\sum _{i=1}^n\beta _i\,p_i+\sum _{j=1}^{{{\tilde{n}}}}\gamma _j\,{{\tilde{p}}}_j \end{aligned}$$
(3.9a)
$$\begin{aligned} \text {subject to }&\sum _{i=1}^n\sum _{j=1}^{\tilde{n}}e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}=1\text { and } \beta \in {\mathbb {R}}^n,\ \gamma \in {\mathbb {R}}^{{{\tilde{n}}}}. \end{aligned}$$
(3.9b)

Remark 3.7

We deduce from (3.9b) that \(-\lambda \left( d_{ij}- \beta _i- \gamma _j\right) - 1\le 0\), or

$$\begin{aligned} \beta _i+\gamma _j\le d_{ij}+\frac{1}{\lambda }\quad \text { for all }i, j \end{aligned}$$
(3.10)

provided that \(\lambda >0\). It is thus apparent that (3.9a)–(3.9b) is a relaxation of problem (2.3a)–(2.3b) together with the constraint (3.10). As well, observe that both problems coincide for \(\lambda \rightarrow \infty \) in (3.9b).

3.3 Sinkhorn’s algorithm

To derive Sinkhorn’s algorithm we consider the Lagrangian function (3.6) again, but now for the remaining variables. Similar to \(\pi ^*\) in (3.8), the gradients are

$$\begin{aligned} \frac{\partial L}{\partial \beta _i}=p_i- \sum _{j=1}^{\tilde{n}}\pi _{ij}= p_i-\sum _{j=1}^{\tilde{n}}e^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}=0 \end{aligned}$$

and

$$\begin{aligned} \frac{\partial L}{\partial \gamma _j}= {{\tilde{p}}}_j-\sum _{i=1}^n\pi _{ij} ={{\tilde{p}}}_j-\sum _{i=1}^ne^{-\lambda (d_{ij}-\beta _i-\gamma _j)-1}=0 \end{aligned}$$

so that the equations

$$\begin{aligned} \beta _i=\frac{1}{\lambda }\log \left( \frac{p_i}{\sum _{j=1}^{\tilde{n}}e^{-\lambda (d_{ij}-\gamma _j)-1}}\right) \quad \text {and}\quad \gamma _j=\frac{1}{\lambda }\log \left( \frac{\tilde{p}_j}{\sum _{i=1}^ne^{-\lambda (d_{ij}-\beta _i)-1}}\right) \end{aligned}$$

follow. To avoid the logarithm introduce and and rewrite the latter equations as

$$\begin{aligned} {{\tilde{\beta }}}_i= \frac{p_i}{\sum _{j=1}^{\tilde{n}}e^{-\lambda \,d_{ij}}\,{{\tilde{\gamma }}}_j} \quad \text {and}\quad {{\tilde{\gamma }}}_j= \frac{{{\tilde{p}}}_j}{\sum _{i=1}^n \tilde{\beta }_i\,e^{-\lambda \,d_{ij}}}, \end{aligned}$$
(3.11)

while the optimal transition plan (3.8) is

$$\begin{aligned}\pi _{ij}^*= {{\tilde{\beta }}}_i\cdot e^{-\lambda \,d_{ij}} \cdot {{\tilde{\gamma }}}_j.\end{aligned}$$
figure b

The simple starting point of Sinkhorn’s iteration is that (3.11) can be used to determine \({{\tilde{\beta }}}\) and \({{\tilde{\gamma }}}\) alternately. Indeed, Sinkhorn’s theorem (cf. Sinkhorn 1967; Sinkhorn and Knopp 1967) for the matrix decomposition ensures that iterating (3.11) converges and the vectors \({{\tilde{\beta }}}\) and \({{\tilde{\gamma }}}\) are unique up to a scalar. Algorithm 2 summarizes the individual steps again.

Remark 3.8

(Central path) We want to emphasize that for changing the regularization parameter \(\lambda \) it is note necessary to recompute all powers in (3.12). Indeed, increasing \(\lambda \) to \(2\cdot \lambda \), for example, corresponds to raising all entries in the matrix (3.12) to the power 2, etc.

Remark 3.9

(Softmax) The expression (3.11) resembles to what is known as the Gibbs measure and to the softmax in data science.

Remark 3.10

(Historical remark) In the literature, this approach is also known as matrix scaling (cf. Rote and Zachariasen 2007), RAS (cf. Bachem and Korte 1979) as well as Iterative Proportional Fitting (cf. Rüschendorf 1995). Kruithof (1937) used the method for the first time in telephone forecasting. The importance of this iteration scheme for data science was probably observed in Cuturi (2013, Algorithm 1) for the first time.

Remark 3.11

We may refer to Altschuler et al. (2017) for a performance analysis including speed and convergence of the Sinkhorn algorithm. For a discussion of numerical stability of the algorithm we refer to Peyré and Cuturi (2019, Section 4.4).

4 Entropic transitions

This section extends the preceding sections and combines the Sinkhorn divergence and the nested distance by incorporating the regularized entropy \(\frac{1}{\lambda }H(\pi )\) to the recursive nested distance Algorithm 1 and investigates its properties and consequences. We characterize the nested Sinkhorn divergence first. The main result is used to exploit duality.

4.1 Nested Sinkhorn divergence

Let \({{\varvec{d}}}{{\varvec{e}}}^{(t)}\) be the matrix of incremental divergences of sub-trees at stage t. Analogously to (2.9) we consider the conditional version of the problem (3.1a) and denote by \(\beta _{i_tj_t}\) and \(\gamma _{j_ti_t}\) the pair of optimal Lagrange parameters associated with the problem

$$\begin{aligned} \text {minimize}_{\text { in} \pi }&\sum _{i'\in i_t+,j'\in j_t+}\pi (i',j'\mid i_t,j_t)\cdot {{\varvec{d}}}{{\varvec{e}}}^{(t+1)}(i',j')\nonumber \\&\quad +\frac{1}{\lambda }\pi (i',j'\mid i_t,j_t)\cdot \log \pi (i',j'\mid i_t,j_t)\nonumber \\ \text {subject to }&\sum _{j'\in j_t+}\pi (i',j'\mid i_t,j_t)=P(i'\mid i_t),i'\in i_t+,\nonumber \\&\sum _{i'\in i_t+}\pi (i',j'\mid i_t,j_t)={{\tilde{P}}}(j'\mid j_t),j'\in j_t+,\nonumber \\&\pi (i',j'\mid i_t,j_t)>0, \end{aligned}$$
(4.1)

where \(\pi (i',j'|i_t,j_t)=\exp \left( -\lambda \big ({{\varvec{d}}}{{\varvec{e}}}_{i_tj_t}^{(t+1)} -\beta _{i_tj_t}- \gamma _{j_ti_t}\big )-1\right) \). The optimal value is the new divergence \({{\varvec{d}}}{{\varvec{e}}}^{(t)}(i_t,j_t)\). Computing the nested distance recursively from \(t=T-1\) down to 0 we get

$$\begin{aligned} \pi _{ij}&=\pi _1(i_1,j_1\mid i_0,j_0)\cdot \ldots \cdot \pi _{T-1}(i,j\mid i_{T-1},j_{T-1})\nonumber \\&=e^{-\lambda ({{\varvec{d}}}{{\varvec{e}}}_{i_0j_0}^{(1)}-\beta _{i_0j_0}-\gamma _{j_0i_0})-1}\cdot \ldots \cdot e^{-\lambda ({{\varvec{d}}}{{\varvec{e}}}_{i_{T-1}j_{T-1}}^{(T)}-\beta _{i_{T-1}j_{T-1}}-\gamma _{j_{T-1}i_{T-1}})-1}\nonumber \\&=\exp \left( -T-\lambda \sum _{t=0}^{T-1}{{\varvec{d}}}{{\varvec{e}}}_{i_tj_t}^{(t+1)}-\beta _{i_tj_t}-\gamma _{j_ti_t}\right) , \end{aligned}$$
(4.2)

where \(i\in {\mathcal {N}}_T\) and \(j\in \tilde{{\mathcal {N}}}_T\) are the leaf nodes with predecessors \((i_0,i_1,\dots ,i_{T-1},i)\) and \((j_0,j_1,\dots ,j_{T-1},j)\). As above introduce

Combining the components it follows that

$$\begin{aligned} \pi _{ij}&=\exp \left( -T-\lambda \sum _{t=0}^{T-1} {{\varvec{d}}}{{\varvec{e}}}_{i_tj_t}^{(t+1)}-\beta _{i_tj_t}-\gamma _{j_ti_t}\right) \\&=\prod _{t=0}^{T-1}{{\tilde{\beta }}}_{i_tj_t}\exp \left( -\lambda \, {{\varvec{d}}}{{\varvec{e}}}_{i_tj_t}^{(t+1)}\right) \, {{\tilde{\gamma }}}_{j_ti_t}, \end{aligned}$$

where the product is the entry-wise product (Hadamard product).

The following theorem summarizes the relation of the nested distance with the Sinkhorn divergence.

Theorem 4.1

(Entropic relaxation of the nested distance) The recursive solution (4.1) ((4.2), resp.) coincides with the optimal transport plan given by

$$\begin{aligned} \text {minimize}_{\text { in }\pi }&\sum _{i=1}^n\sum _{j=1}^{{{\tilde{n}}}}\pi _{ij}\cdot d_{ij}^r+\frac{1}{\lambda }\pi _{ij}\cdot \log \big (\pi _{ij}\big )\nonumber \\ \text {subject to }&\sum _{j\succ j_t+}\pi (i,j\mid i_t,j_t)=P(i\mid i_t),&i_t\prec i,j_t,\nonumber \\&\sum _{i\succ i_t+}\pi (i,j\mid i_t,j_t)={{\tilde{P}}}(j\mid j_t),&j_t\prec j,i_t,\nonumber \\&\pi _{ij}>0\,\,\text {and}\,\,\sum _{i,j}\pi _{ij}=1. \end{aligned}$$
(4.3)

Proof

First define \(\pi {:}{=}\prod _{t=1}^T\pi _t\), where \(\pi _t\) is the conditional transition probability, i.e., the solution at stage t and the matrices are multiplied element-wise (the Hadamard product) as in equation (4.2) above. It follows that

$$\begin{aligned} d^r\cdot \pi +\frac{1}{\lambda }\pi \log \pi&=d^r\cdot \prod _{t=1}^T\pi _t+\frac{1}{\lambda }\cdot \prod _{t=1}^T\pi _t\log \left( \prod _{t=1}^T\pi _{t}\right) \nonumber \\&=d^r\cdot \prod _{t=1}^T\pi _t+\frac{1}{\lambda }\cdot \prod _{t=1}^T\pi _t \cdot \sum _{t=1}^T\log \pi _t. \end{aligned}$$
(4.4)

Observe that \(\pi _t(A)={\mathbb {E}}(1_A\mid \mathcal F_t\otimes \tilde{{\mathcal {F}}}_t)\) (cf. Lemma (2.8)). Denote the r-distance of subtrees at stage t by \({{\varvec{d}}}{{\varvec{e}}}_t^r\). By linearity of the conditional expectation we have with (4.4) at the last stage

and from calculation in backward recursive way

where we have used the tower property of the conditional expectation in (4.5). By induction and the definition of \({{\varvec{d}}}{{\varvec{e}}}_t^r\) at stage t, it follows finally that

(4.5)
(4.6)

where we have used the tower property of the conditional expectation again in (4.5). The assertion (4.3) of the theorem thus follows. \(\square \)

Remark 4.2

The optimization problem in Theorem 4.1 considers all constraints as the full nested problem (2.7), only the objective differs. For this reason the optimal solution of (4.3) is feasible for the problem (2.7) and vice versa.

Notice as well that the tower property can be used in a forward calculation.

Similarly to Proposition 3.5 we have the following extension to the nested Sinkhorn divergence.

Corollary 4.3

For the nested distance and the nested Sinkhorn divergence, the same inequalities as in Proposition 3.5 apply, i.e.,

$$\begin{aligned} 0\le {\varvec{d}}_S^r -{\varvec{d}} ^r\le & {} \frac{1}{\lambda }\left( H(\pi ^S)-H(\pi ^W)\right) \quad \text {and}\quad 0\le \varvec{d}^r-{{\varvec{d}}}{{\varvec{e}}}_S^r\le \frac{1}{\lambda }H(\pi ^S)\\\le & {} \frac{1}{\lambda }H(p\cdot p^\top ), \end{aligned}$$

where \(\pi ^S\) (\(\pi ^W\), resp.) is the optimal transport plan from (4.3) ((2.7), resp.) with discrete, unconditional probabilities p and \({{\tilde{p}}}\) at the final stage T.

Proof

The proof follows the lines of the proof of the Propositions 3.4 and 3.5. \(\square \)

Moreover, we have the following general inequality that allows an error bound depending on the total T of stages.

Corollary 4.4

Let m (\({{\tilde{m}}}\), resp.) be the maximum number of immediate successors in the process \({\mathbb {P}}\) (\(\tilde{{\mathbb {P}}}\), resp.), i.e., \(m=\max \left\{ |i+|:i\in {\mathcal {N}}_t,\ t=1,\dots ,T-1\right\} \). It holds that

$$\begin{aligned} {{\varvec{d}}}{{\varvec{e}}}_S^r-{\varvec{d}}^r\le \frac{\log m+\log \tilde{m}}{\lambda }\cdot T, \end{aligned}$$
(4.7)

where T is the total number of stages.

Proof

Recall from Remark 3.6 that \(H(\pi ^S)\le \log (n\,{{\tilde{n}}})= \log n+\log {{\tilde{n}}}\) for every conditional probability measures, where n and \({{\tilde{n}}}\) are the number of immediate successors in both trees. The result follows with \(n\le m^T\) (\({{\tilde{n}}}\le \tilde{m}^T\), resp.) and \(\log n\le T\log m\) and the nested program (4.1). \(\square \)

4.2 Nested Sinkhorn duality

The nested distance is of importance in stochastic optimization because of its dual, which is characterized by the Kantorovich–Rubinstein theorem, cf. (2.3a)–(2.3b) above. The nested distance allows for a characterization by duality as well. Here we develop the duality for the nested Sinkhorn divergence. In line with Theorem 4.1 we need to consider the problem

$$\begin{aligned} \text {minimize}_{\text { in }\pi }&\left( \iint \left( d(\xi ,{{\tilde{\xi }}})^r+\frac{1}{\lambda }\log \pi (\xi , {{\tilde{\xi }}})\right) \pi (\mathrm d\xi ,\mathrm d{{\tilde{\xi }}})\right) ^{\nicefrac 1r}\nonumber \\ \text {subject to }&\pi (A\times {{\tilde{\Xi }}}\mid {\mathcal {F}}_t\otimes \tilde{{\mathcal {F}}}_t)=P(A\mid {\mathcal {F}}_t), \qquad A\in {\mathcal {F}}_t,\ t=1,\dots ,T, \end{aligned}$$
(4.8a)
$$\begin{aligned}&\pi (\Xi \times B\mid {\mathcal {F}}_t\otimes \tilde{\mathcal F}_t)={{\tilde{P}}}(B\mid \tilde{{\mathcal {F}}}_t), \qquad B\in \tilde{{\mathcal {F}}}_t,\ t=1,\dots ,T. \end{aligned}$$
(4.8b)

However, we first reformulate the problem (3.9a)–(3.9b). By translating the dual variables, \({\hat{\beta {:}{=}}} -\beta +{{\,\mathrm{{\mathbb {E}}}\,}}\beta \) and \({\hat{\gamma {:}{=}}} -\gamma +{\tilde{{{\,\mathrm{{\mathbb {E}}}\,}}}}\gamma \), and defining \(M_0{:}{=}-{{\,\mathrm{{\mathbb {E}}}\,}}\beta -{{\tilde{{{\,\mathrm{{\mathbb {E}}}\,}}}}}\gamma \) we have the alternative representation

$$\begin{aligned} \text {maximize}_{\text { in }M_0\ }&M_0\\ \text {subject to }&{{\,\mathrm{{\mathbb {E}}}\,}}{{\hat{\beta }}}=0,\ {{\tilde{{{\,\mathrm{{\mathbb {E}}}\,}}}}}{{\hat{\gamma }}}=0,\\&\sum _{\xi ,{{\tilde{\xi }}}}\exp \left( -\lambda \left( d(\xi ,{{\tilde{\xi }}})^r-{{\hat{\beta }}}(\xi )-{\hat{\gamma }}({{\tilde{\xi }}})-M_0\right) -1\right) =1,\\&{\hat{\beta }}\in {\mathbb {R}}^n,\ {\hat{\gamma }}\in {\mathbb {R}}^{{{\tilde{n}}}}. \end{aligned}$$

To establish the dual representation of the nested distance we introduce the projections

$$\begin{aligned} {{\,\mathrm{{\mathrm{proj}}}\,}}_t:L^1({\mathcal {F}}_T\otimes \tilde{{\mathcal {F}}}_T)&\rightarrow L^1({\mathcal {F}}_t\otimes \tilde{{\mathcal {F}}}_T)\\ {\hat{\beta }}\otimes {\hat{\gamma }}&\mapsto {{\,\mathrm{{\mathbb {E}}}\,}}({\hat{\beta }}\mid \mathcal F_t)\otimes {\hat{\gamma }} \end{aligned}$$

and

$$\begin{aligned} {\tilde{{{\,\mathrm{{\mathrm{proj}}}\,}}}}_t:L^1({\mathcal {F}}_T\otimes \tilde{{\mathcal {F}}}_T)&\rightarrow L^1({\mathcal {F}}_T\otimes \tilde{{\mathcal {F}}}_t)\\ {\hat{\beta }}\otimes {\hat{\gamma }}&\mapsto {\hat{\beta }}\otimes {{\,\mathrm{{\mathbb {E}}}\,}}({\hat{\gamma }}\mid \tilde{{\mathcal {F}}}_t), \end{aligned}$$

where \(\beta \otimes \gamma \) is the function defined by \(\big (\beta \otimes \gamma \big )(\xi ,\eta ){:}{=}\beta (\xi )\cdot \gamma (\eta )\) and where we note that the conditional expectation is a random variable itself.

We recall the following characterization of the measurability constraints (4.8a)–(4.8b) and refer to Pflug and Pichler (2014, Proposition 2.48) for its proof.

Proposition 4.5

The measure \(\pi \) satisfies the marginal condition

$$\begin{aligned} \pi (A\times {{\tilde{\Xi }}}\mid {\mathcal {F}}_t\otimes \tilde{\mathcal F}_t)=P(A\mid {\mathcal {F}}_t)\quad \text {a.s. for all }A\in \Xi \end{aligned}$$

if and only if

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}_{\pi }\beta ={{\,\mathrm{{\mathbb {E}}}\,}}_{\pi }{{\,\mathrm{{\mathrm{proj}}}\,}}_t\beta \quad for all \beta \text { measurable with respect to }{\mathcal {F}}_T\otimes \tilde{{\mathcal {F}}}_T.\end{aligned}$$

Moreover, \({{\,\mathrm{{\mathrm{proj}}}\,}}_t(\beta )={{\,\mathrm{{\mathbb {E}}}\,}}_{\pi }(\beta \mid \mathcal F_t\otimes \tilde{{\mathcal {F}}}_T)\) if \(\pi \) has marginal P.

Theorem 4.6

The infimum or the nested distance including the entropy \({{\varvec{d}}}{{\varvec{e}}}^r({\mathbb {P}},\tilde{{\mathbb {P}}})\) of problem (4.3) equals the supremum of all numbers \(M_0\) such that

$$\begin{aligned} e^{-\lambda (d(\xi ,{{\tilde{\xi }}})^r-M_T(\xi ,{{\tilde{\xi }}}))-1}\in \mathcal P(\Xi \times {{\tilde{\Xi }}}), \qquad (\xi , {{\tilde{\xi }}})\in \Xi \times {{\tilde{\Xi }}}, \end{aligned}$$

where \({\mathcal {P}}(\Xi \times {{\tilde{\Xi }}})\) is a set of probability measures on \((\Xi \times {{\tilde{\Xi }}})\) and \(M_t\) is an \(\mathbb R\)-valued process on \(\Xi \times {{\tilde{\Xi }}}\) of the form

$$\begin{aligned} M_t=M_0+\sum _{s=1}^t{\hat{\beta _s}}+{\hat{\gamma _s}} \end{aligned}$$
(4.9)

and the functions \({\hat{\beta }}_t\), measurable with respect to \({\mathcal {F}}_t\otimes \mathcal {{{\tilde{F}}}}_{t-1}\), and \({\hat{\gamma }}_t\), measurable with respect to \({\mathcal {F}}_{t-1}\otimes \mathcal {\tilde{F}}_t\), satisfy \({{\,\mathrm{{\mathrm{proj}}}\,}}_{t-1}({\hat{\beta }}_t)=0\) and \({\tilde{{{\,\mathrm{{\mathrm{proj}}}\,}}}}_{t-1}({\hat{\gamma }}_t)=0\).

Proof

With Proposition 4.5 rewrite the dual problem as

$$\begin{aligned}&\inf _{\pi >0}\sup _{M_0,f_t,g_t}{{\,\mathrm{{\mathbb {E}}}\,}}_{\pi }\left[ d^r+\frac{1}{\lambda }\log \pi \right] +M_0\cdot (1-{{\,\mathrm{{\mathbb {E}}}\,}}_{\pi }{\mathbb {1}})+\\&\quad -\sum _{s=0}^{T-1}\big ({{\,\mathrm{{\mathbb {E}}}\,}}_{\pi }f_{s+1}-{{\,\mathrm{{\mathbb {E}}}\,}}_{\pi }{{\,\mathrm{{\mathrm{proj}}}\,}}_s(f_{s+1})\big ) -\sum _{s=0}^{T-1}\big ({{\,\mathrm{{\mathbb {E}}}\,}}_{\pi }g_{s+1}-{{\,\mathrm{{\mathbb {E}}}\,}}_{\pi }{\tilde{{{\,\mathrm{{\mathrm{proj}}}\,}}}}_s(g_{s+1})\big ), \end{aligned}$$

where the second line encodes the measurability constraints. By the minmax theorem (cf. Sion 1958) this is equivalent to

$$\begin{aligned} \sup _{M_0,f_t,g_t}M_0+\inf _{\pi >0}{{\,\mathrm{{\mathbb {E}}}\,}}_{\pi }\Big [&d^r+\frac{1}{\lambda }\log \pi -M_0\cdot {\mathbb {1}}\\&-\sum _{s=0}^{T-1}(f_{s+1}-{{\,\mathrm{{\mathrm{proj}}}\,}}_s(f_{s+1})) -\sum _{s=0}^{T-1}(g_{s+1}-{\tilde{{{\,\mathrm{{\mathrm{proj}}}\,}}}}_s(g_{s+1}))\Big ]. \end{aligned}$$

The integral exists and the minimum is obtained by a probability measure

$$\begin{aligned} \pi =\exp \left( -\lambda \left( d^r-\sum _{s=0}^{T-1}(f_{s+1}-{{\,\mathrm{{\mathrm{proj}}}\,}}_s(f_{s+1}))-\sum _{s=0}^{T-1}(g_{s+1}-{\tilde{{{\,\mathrm{{\mathrm{proj}}}\,}}}}_s(g_{s+1})-M_0\right) -1\right) . \end{aligned}$$

Set \({\hat{\beta _s{:}{=}}} f_s-{{\,\mathrm{{\mathrm{proj}}}\,}}_{s-1}(f_s)\) and \({\hat{\gamma _s{:}{=}}} g_s-{\tilde{{{\,\mathrm{{\mathrm{proj}}}\,}}}}_{s-1}(g_s)\). Consequently, the problem reads

$$\begin{aligned} \text {maximize}_{\text { in }M_0\ }&M_0\\ \text {subject to }&\exp \left[ -\lambda \left( d^r-\sum _{s=1}^T{\hat{\beta }}_s-\sum _{s=1}^T{\hat{\gamma }}_s-M_0\right) -1\right] \in {\mathcal {P}}(\Xi \times {{\tilde{\Xi }}})\\&{{\,\mathrm{{\mathrm{proj}}}\,}}_{t-1}({\hat{\beta _t}})=0,{\tilde{{{\,\mathrm{{\mathrm{proj}}}\,}}}}_{t-1}({\hat{\gamma _t}})=0, \end{aligned}$$

and thus the assertion. \(\square \)

The following corollary links the optimal probability measure and the stochastic process (4.9) for the optimal components \({{\hat{\beta }}}\) and \({{\hat{\gamma }}}\).

Corollary 4.7

The process \(M_t\) in (4.9), for which the supremum is attained, is a martingale with respect to the optimal measure \(\pi \).

Proof

The proof of Pflug and Pichler (2014, Theorem 2.49) applies with minor adaptions only. \(\square \)

5 Numerical results

The nested Sinkhorn divergence \({\varvec{d}}_S^r\) as well as \({{\varvec{d}}}{{\varvec{e}}}_S^r\) depend on the regularization parameter \(\lambda \). We discuss this dependency, the error, speed of convergence and numerical issues in comparison to the non-regularized nested distance \({\varvec{d}}^r\).

Fig. 2
figure 2

Results from computation of an arbitrary chosen processes given in Fig. 3 with \(d(\xi _i,{{\tilde{\xi }}}_j)=|\xi _i-{{\tilde{\xi }}}_j|\) and \(r=1\)

We compare Algorithms 1 and 2 with respect to the nested distance \({\varvec{d}}^r\) and the nested Sinkhorn divergence with and without the entropy \(\frac{1}{\lambda }H(\pi ^S)\) as well as the required computational time for two finite valued stochastic scenario processes visualized in Fig. 3.

Fig. 3
figure 3

Two arbitrary chosen processes with height \(T=3\)

Figure 2 displays the results. We see that the regularized nested distance \({\varvec{d}}_S^r\) (green) and \({{\varvec{d}}}{{\varvec{e}}}_S^r\) (red) converge to the nested distance \({\varvec{d}}^r\) for increasing \(\lambda \). In contrast to \({\varvec{d}}_S^r\), the regularized nested distance including the entropy converges slower to \({\varvec{d}}^r\). The reason is that for larger \(\lambda \) the weight of the entropy in the cost function in (3.1a) decreases and the entropy of \(\pi ^S\) and \(\pi ^W\) coincide (cf. (4.7)). Computing the distances with Sinkhorn’s algorithm in recursive way, in contrast to solving the linear problem for the Wasserstein distance, is about six times faster. In addition, the required time for the regularized nested distance with and without the entropy varies much less by contrast with the computational time for the nested distance. Furthermore, the differences between \({\varvec{d}}^r\) and \({\varvec{d}}_S^r\) and \({{\varvec{d}}}{{\varvec{e}}}_S^r\), respectively, is rapidly decreasing and insignificant for \(\lambda >20\). Moreover, the time displayed in Fig. 2b does not depend on the regularization parameter \(\lambda \).

The following two examples illustrate the computational accelerations.

Example 5.1

We now fix \(\lambda =20\) and vary the stages \(T\in \{1,2,3,4,5\}\). The first finite tree has the branching structure \([1\ 2\ 3\ 2\ 3\ 4]\) and the second tree has a simpler structure \([1\ 2\ 2\ 1\ 3\ 2]\) (i.e., the first tree has 144 leaf nodes and the second tree 24). All states and probabilities in the trees are generated randomly.

Table 1 Average distance and divergence with corresponding computational time in seconds on i5-3210M CPU

Table 1 summarizes the results collected. We notice that the Sinkhorn algorithm is up to 10 times faster compared with the usual Wasserstein distance, although the speed advantage decreases for larger trees. The Sinkhorn algorithm also leads to small errors which increase marginally for trees with more stages.

Example 5.2

To provide an additional performance comparison we fix \(\lambda =20\) and vary \(T\in \{1,2,3,4,5,6\}\). The first finite tree has the structure \([1\,\,4\,\,5\,\,3\,\,4\,\,4\,\,6]\) and the second tree \([1\,\,2\,\,2\,\,1\,\,3\,\,2\,\,3]\). This means that the first tree has 5760 leaf nodes while the second tree has only 72. All states and probabilities in the trees are generated randomly. Table 2 summarizes the results are summarized

Table 2 Average distance and divergence (cf. Table 1)
Table 3 The nested distance and divergence for two trees with 3 stages and leaves and branching 2. The regularization parameter is \(\lambda =10\)

Additionally, we tried to improve the speed by modifying the recursive algorithm. Instead of computing once from \(T-1\) down to 0 we computed from \(T-1\) down to 0 several times to achieve a convergence in the optimal transport plan \(\pi ^S\). This approach has no advantages.

Remark 5.3

The entries \(k_{ij}\) of the matrix (3.12) are small for \(d_{ij}\not =0\), particularly for \(\lambda \gg 1\) (i.e., \(\lambda \) large) and \(r\gg 1\). In this case, the entries of the vectors \({{\tilde{\beta }}}\) and \({{\tilde{\gamma }}}\) in Algorithm 2 can grow extraordinary high. For this reason, rescaling the vectors is necessary. Further, an adequate balance between \(\lambda \), modelling the approximation quality, and acceleration desired is crucial in real applications. See also Remark 3.11 for the same issue.

Example 5.4

Table 3 investigates the approximation quality for varying orders r. The trees compared have 3 stages and each node branches into two directions. For large \(r\gg 1\) it is important to recall Remark 5.3 here, but on the other side the approximation quality improves for increasing order r.

6 Summary

Stochastic processes with information evolving in finitely many stages and finitely many states are encoded in stochastic trees. The nested distance, which builds on the Wasserstein distance, allows distinguishing stochastic processes and stochastic trees.

In this paper we regularize the Wasserstein distance by employing the Sinkhorn divergence. This approach extends to multiple stages and allows introducing a nested Sinkhorn divergence for stochastic processes. We elaborate its properties and describe the accelerations, which can be achieved in this way.

In conclusion, we can summarize that the Sinkhorn divergence offers a good trade-off between the regularization error and the speed advantage. Further work should focus on defining a (nested) distance for neuronal networks and extending the implementation of Sinkhorn divergence in the Julia package for faster tree generation and computation.