The nested Sinkhorn divergence to learn the nested distance

The nested distance builds on the Wasserstein distance to quantify the difference of stochastic processes, including also the evolution of information modelled by filtrations. The Sinkhorn divergence is a relaxation of the Wasserstein distance, which can be computed considerably faster. For this reason we employ the Sinkhorn divergence and take advantage of the related (fixed point) iteration algorithm. Furthermore, we investigate the transition of the entropy throughout the stages of the stochastic process and provide an entropy-regularized nested distance formulation, including a characterization of its dual. Numerical experiments affirm the computational advantage and supremacy.


Introduction
The Wasserstein distance, also known as Monge-Kantorovich distance, is used in optimal transport theory to describe and characterize optimal transitions between probability measures.They are characterized by the lowest (or cheapest) average costs to fully transfer a probability measure into another.The costs are most typically proportional to the distance of locations to be connected.Rachev and Rüschendorf (1998) provide a comprehensive discussion of the Wasserstein distance and Villani (2009) summarizes the optimal transport theory.
The nested distance generalizes and extends the theory from probability measures to stochastic processes.It is based on the Wasserstein distance and has been introduced by Pflug (2009), cf. also Pflug and Pichler (2012).The nested distance quantifies the distance of stochastic processes and multistage stochastic programs are continuous with respect to the nested distance.Multistage stochastic programming has applications in many sectors, e.g., the financial sector (Edirisinghe 2005;Brodt 1983), in management science or in energy economics (Analui and Pflug 2014;Beltrán et al. 2017;Carpentier et al. 2012Carpentier et al. , 2015)).The prices, demands, etc., are often modeled as a stochastic process ξ = (ξ 0 , . . ., ξ T ) and the optimal values are rarely obtained analytically.For the numerical approach the stochastic process is replaced by a finite valued stochastic scenario process ξ = ( ξ0 , . . ., ξT ), which is a finite tree.Naturally, the approximation error should be minimized without unnecessarily increasing the complexity of the computational effort.Kirui et al. (2020) provide a Julia package for generating scenario trees and scenario lattices for multistage stochastic programming.Maggioni and Pflug (2019) provide guaranteed bounds and Horejšová et al. (2020) investigate corresponding reduction techniques.
This paper addresses the Sinkhorn divergence in place of the Wasserstein distance.This pseudo-distance is also called Sinkhorn distance or Sinkhorn loss.In contrast to the exact implementation of Bertsekas and Castanon (1989), e.g., Sinkhorn divergence corresponds to a regularization of the Wasserstein distance, which is strictly convex and which allows to improve the efficiency of the computation by applying Sinkhorn's (fixed-point) iteration procedure.The relaxation itself is similar to the modified objective of interior-point methods in numerical optimization.A cornerstone is the theorem by Sinkhorn (1967) that shows both a unique decomposition for non-negative matrices and ensures convergence of the associated iterative scheme.Cuturi (2013) has shown the potential of the Sinkhorn divergence and made it known to a wider audience.Nowadays, Sinkhorn divergence is used in statistical applications, cf.Bigot et al. (2019) and Luise et al. (2018), for image recognition and machine learning, cf.Kolouri et al. (2017) and Genevay et al. (2018), among many other applications.
Extending Sinkhorn's algorithm to multistage stochastic programming has been proposed recently in Tran (2020, Section 5.2.3, pp. 97-99), where a numerical example indicating computational advantages is also given.This paper resumes this idea and assesses the entropy relaxed nested distance from theoretical perspective.We address its approximating properties and derive its convex conjugate, the dual.Moreover, numerical tests included confirm the computational advantage regarding the simplicity of the implementation as well as significant gains in speed.

Outline of the paper
The following Sect. 2 introduces the notation and provides the definitions to discuss the nested distance.Additionally, the importance of the filtration and the complexity of the computation is shown.Section 3 introduces the Sinkhorn divergence and derive its dual.In Sect. 4 we regularize the nested distance and show the equality between two different approaches.Results and comparisons are visualized and discussed in Sect. 5. Section 6 summarizes and concludes the paper.

Preliminaries
This section recalls the definition of distances generally, of the Wasserstein distance and nested distance and provides an example to highlight the impact of information available, which is increasing gradually over time and stages.Throughout, we shall base our exposition on a probability space ( , F, P).

Wasserstein distance
The Wasserstein distance is a distance for probability measures.It is the building block for the process distance, which involves information in addition and its regularized version, which we address here, the Sinkhorn divergence.The Sinkhorn divergence is not a distance in itself.To point out the differences we highlight the defining elements.
Definition 2.1 (Distance of measures) Let P be a set of probability measures on .A function d : P ×P → [0, ∞) is called distance, if it satisfies the following conditions: (i) Nonnegativity: for all P 1 , P 2 ∈ P, (ii) Symmetry: for all P 1 , P 2 ∈ P, d(P 1 , P 2 ) = d(P 2 , P 1 ); (iii) Triangle inequality: for all P 1 , P 2 and P 3 ∈ P, d(P 1 , P 2 ) ≤ d(P 1 , P 3 ) + d(P 3 , P 2 ); (iv) Definiteness: if d(P 1 , P 2 ) = 0, then P 1 = P 2 .Rachev (1991) presents a huge variety of probability metrics.Here, we focus on the Wasserstein distance, which allows a generalization for stochastic processes.For this we assume that the sample space = R d is equipped with a metric d.Definition 2.2 (Wasserstein distance) Let P and P be two probability measure on endowed with a distance d : × → R with finite moment of order r .The Wasserstein distance of order r ≥ 1 is where the infimum is over all probability measures π on × with marginals P and P, respectively.

Remark 2.3 (Distance versus cost functions)
The definition of the Wasserstein distance presented here starts with a distance d on and the Wasserstein distance is a distance on P in the sense of Definition 2.1 above.We mention that the literature occasionally develops the theory for cost functions c : X × X → R instead of the distance d.Also, the results presented below extend to cost functions in place of the distance on the underlying space.
In a discrete framework, probability measures are of the form P = n i=1 p i δ ξ i with p i ≥ 0 and n i=1 p i = 1 and the support of P ({ξ i : i = 1, 2, . . ., n} ⊂ ) is finite.By Definition 2.1, the Wasserstein distance d r of two discrete measures P = n i=1 p i δ ξ i and P = ñ j=1 p j δ ξ j is the r -th root of the optimal value of minimize in π where is an n × ñ-matrix collecting all distances.The optimal measure in (2.1) is denoted π W and called an optimal transport plan.

The distance of stochastic processes
Be two probability spaces.We now consider two stochastic processes with realizations ξ , ξ ∈ and A stochastic process with finitely many states (i.e., outcomes) for t ∈ {0, 1, . . ., T } is a scenario tree.Scenario trees are frequently employed in optimization under uncertainty to model the random outcome in the evolution of a process which describes the random price, say, of an underlying asset.They are convenient, because they can be implemented in computers to assess each individual trajectory as possible realization of the stochastic process.
The Figs. 1 and 3 depict such scenario tree, they indicate the transition probabilities in addition.
Remark 2.5 Figure 1 illustrates that the Wasserstein distance does not capture the different information (knowledge) available at the intermediate stage.Indeed, with > 0, the matrix collecting the distances of the trajectories taken from both trees is and the optimal transport plan for the Wasserstein distance is The Wasserstein distance, according (2.1), is d = i, j d i j π i j = /2, where a small value for indicates that the processes are similar.However, the information available at stage 1 is very distinct in both trees in Fig. 1.When observing 2 + at stage 1 in the first tree it is certain that the next observation is 3, and it will be 1 when observing 2. In contrast, the second process does not provide any certainty whether the result will be 1 or 3 after observing 2 at the first stage.
We conclude from the preceding remark that the Wasserstein distance is not suitable to distinguish stochastic processes with different flows of information.The reason is that this approach does not involve conditional probabilities at stages t = 0, 1, . . ., T − 1, but only probabilities at the final stage t = T , where all the information available at intermediate stages is ignored.
We follow the usual convention and express information, which is accessible, by corresponding sets.The information available at every stage t includes information from preceding stages, which have been revealed, but excludes information from later, future stages.For this reason the sets encode the information available at stage t, they constitute the σ -algebra F t ( Ft , resp.).The following generalization of the Wasserstein distance takes this flow of increasing information explicitly into account.We state the definition involving general probability measures here, although we work with discrete measures only in what follows.
Definition 2.6 (The nested distance) The nested distance of order r ≥ 1 of two filtered probability spaces P = ( , (F t ), P) and P = ( ˜ , ( Ft ), P) with finite moment of order r with respect to the distance d : × ˜ → R is the optimal value of the optimization problem where the infimum is among all bivariate probability measures π ∈ P( × ˜ ).The optimal value of (2.1), the nested distance of order r , is denoted by d r (P, P).
For discrete stochastic processes we use trees to model the whole space and filtration.In the stochastic tree, N t ( Ñt , resp.) denotes the set of all nodes at the stage t.Furthermore, a predecessor m of the node i, not necessarily the immediate predecessor, is indicated by m ≺ i.Here, we may replace the conditional probabilities in (2.5) and (2.6) by the genuine transition probabilities.The arcs of the tree in Fig. 1 exemplarily display these transition probabilities.
Algorithm 1: Nested computation of the nested distance d r (P, P) of two treeprocesses P and P.
Input: for all combinations of leaf nodes i ∈ N T and j ∈ ÑT with predecessors r Output: the optimal transport plan at the leaf nodes i ∈ N T and j ∈ ÑT is ). for t = T − 1 down to 0 and every combination of inner nodes i ∈ N t and j ∈ Ñt do solve the linear programs minimize in π (2.9) and denote its optimal value by d r t (i t , j t ).Result: The nested distance is d r (P, P):=d r 0 (0, 0).
The nested distance for stochastic trees is the r -th root of the optimal value of minimize in π i, j where i ∈ N T and j ∈ ÑT are the leaf nodes and i t ∈ N t as well as j t ∈ Ñt are nodes at the same stage t and P(i | i t ):= P(i) P(i t ) is the conditional probability.Analogously, the conditional probabilities π(i, j | i t , j t ) are

Rapid, nested computation of the process distance
This subsection addresses an advanced approach for solving the linear program (2.7).
We first recall the tower property, which allows an important simplification of the constraints in (2.4).
Lemma 2.8 To compute the nested distance it is enough to condition on the immediately following σ -algebra: the conditions ) may be replaced by Proof The proof is based on the tower property of the expectation and can be found in Pflug and Pichler (2014, Lemma 2.43).
As a result of the tower property the full problem (2.7) can be calculated faster in a recursive way and the matrix for the constraints does not have to be stored.We employ this result in an algorithm below.For further details we refer to Pflug and Pichler (2014, Chapter 2.10.3).The collection of all direct successors of node i t ( j t , resp.) is denoted by i t + ( j t +, resp.).

Sinkhorn divergence
In what follows we consider the entropy-regularization of the Wasserstein distance (2.1) and characterize its dual.Moreover, we recall Sinkhorn's algorithm, which allows and provides a considerably faster implementation.These results are combined then to accelerate the computation of the nested distance.

Entropy-regularized Wasserstein distance
Interior point methods add a logarithmic penalty to the objective to force the optimal solution of the modified problem into the strict interior.The Sinkhorn distance proceeds similarly.The regularizing term H (x):=− i, j x i j log x i j is added to the cost function in problem (2.1).This has shown beneficiary in other problem settings as well.
where d is a distance or a cost matrix and λ > 0 is a regularization parameter.With π S being the optimal transport in (3.1a)-(3.1b)we denote the Sinkhorn divergence by and the Sinkhorn divergence including the entropy by We may mention here that we avoid the term Sinkhorn distance since for all λ > 0, the Sinkhorn divergence d r S is strictly positive and de r S can be negative for small λ which violates the axioms of a distance given in Definition 2.1 above (particularly (i), (iii) and (iv)).Strict positivity of d r S can be forced by a correction term, the so-called Sinkhorn loss [see Bigot et al. (2019, Definition 2.3)] or by employing the cost matrix d • 1 p = p instead.

Remark 3.3
The strict inequality constraint (3.1b) is not a restriction.Indeed, the mapping ϕ(•) defined in Remark 3.1 has derivative ϕ (0) = −∞ and thus it follows that every optimal measure satisfies the strict inequality π i j > 0 for λ > 0.
We have the following inequalities.Proof Recall that π log π ≤ 0 for all π ∈ (0, 1) and thus it holds that n i=1 and thus the first inequality.The remaining inequality is clear by the definition of the Wasserstein distance.
Both Sinkhorn divergences d r S and de r S approximate the Wasserstein distance d r , and we have convergence for λ → ∞ to d r .The following proposition provides precise bounds.
Proof The first inequalities follow from (3.2) and from optimality of π S in the inequality The latter again with (3.2) and d r S −de r S = 1 λ H π S .Finally, by the log sum inequality, H (π ) ≤ H p • p for every measure π with marginals p and p.

Remark 3.6
As a consequence of the log sum inequality we obtain as well that H (π S ) ≤ log n + log ñ.The inequalities (3.3) and (3.4) thus give strict upper bounds in comparing the Wasserstein distance and the Sinkhorn divergence.
Alternative definitions There exist alternative concepts of the Sinkhorn divergence which we want to mention here.The first alternative definition involves the Kullback-Leibler divergence D KL (π | P ⊗ P), which is defined as where the latter equality is justified provided that π has marginal measures P and P.
A further, possible definition employs a different entropy regularization given by

Luise et al. (2018) use this definition for Sinkhorn approximation for learning with
Wasserstein distance and prove an exponential convergence.This definition leads to a similar matrix decomposition and iterative algorithm as described in the following sections.

Dual representation of Sinkhorn
We shall derive Sinkhorn's algorithm and its extension to the nested distance via duality.To this end consider the Lagrangian function of the problem (3.2).The partial derivatives are and it follows from (3.7) that the optimal measure has entries By inserting π * i j in the Lagrangian function L we get the convex dual function The dual problem thus is maximize in β,γ n i=1 Due to i, j e −λ(d i j −β i −γ j )−1 = 1 we may write the latter problem as maximize in β,γ n i=1 provided that λ > 0. It is thus apparent that (3.9a)-(3.9b) is a relaxation of problem (2.3a)-(2.3b)together with the constraint (3.10).As well, observe that both problems coincide for λ → ∞ in (3.9b).

Sinkhorn's algorithm
To derive Sinkhorn's algorithm we consider the Lagrangian function (3.6) again, but now for the remaining variables.Similar to π * in (3.8), the gradients are so that the equations follow.To avoid the logarithm introduce βi :=e λ β i − 1 /2 and γ j :=e λ γ j − 1 /2 and rewrite the latter equations as while the optimal transition plan (3.8) is The simple starting point of Sinkhorn's iteration is that (3.11) can be used to determine β and γ alternately.Indeed, Sinkhorn's theorem (cf. Sinkhorn 1967;Sinkhorn and Knopp 1967) for the matrix decomposition ensures that iterating (3.11) converges and the vectors β and γ are unique up to a scalar.Algorithm 2 summarizes the individual steps again.

Algorithm 2: Sinkhorn's iteration
, regularization parameter λ > 0, stopping criterion and a starting value γ = ( γ1 , . . ., γ ñ ) (3.12) while stopping criterion is not satisfied do for i = 1 to n do βi ← Remark 3.8 (Central path) We want to emphasize that for changing the regularization parameter λ it is note necessary to recompute all powers in (3.12).Indeed, increasing λ to 2 • λ, for example, corresponds to raising all entries in the matrix (3.12) to the power 2, etc. Remark 3.9 (Softmax) The expression (3.11) resembles to what is known as the Gibbs measure and to the softmax in data science.

Remark 3.10 (Historical remark)
In the literature, this approach is also known as matrix scaling (cf.Rote and Zachariasen 2007), RAS (cf.Bachem and Korte 1979) as well as Iterative Proportional Fitting (cf.Rüschendorf 1995).Kruithof (1937) used the method for the first time in telephone forecasting.The importance of this iteration scheme for data science was probably observed in Cuturi (2013, Algorithm 1) for the first time.

Remark 3.11
We may refer to Altschuler et al. (2017) for a performance analysis including speed and convergence of the Sinkhorn algorithm.For a discussion of numerical stability of the algorithm we refer to Peyré and Cuturi (2019, Section 4.4).

Entropic transitions
This section extends the preceding sections and combines the Sinkhorn divergence and the nested distance by incorporating the regularized entropy 1 λ H (π ) to the recursive nested distance Algorithm 1 and investigates its properties and consequences.We characterize the nested Sinkhorn divergence first.The main result is used to exploit duality.

Nested Sinkhorn divergence
Let de (t) be the matrix of incremental divergences of sub-trees at stage t.Analogously to (2.9) we consider the conditional version of the problem (3.1a) and denote by β i t j t and γ j t i t the pair of optimal Lagrange parameters associated with the problem minimize inπ where π(i , The optimal value is the new divergence de (t) (i t , j t ).Computing the nested distance recursively from t = T − 1 down to 0 we get where i ∈ N T and j ∈ ÑT are the leaf nodes with predecessors (i 0 , i 1 , . . ., i T −1 , i) and ( j 0 , j 1 , . . ., j T −1 , j).As above introduce βi t j t := exp λ β i t j j − 1 /2 and γ j t i t := exp λ γ j t i t − 1 /2 .
Combining the components it follows that where the product is the entry-wise product (Hadamard product).
The following theorem summarizes the relation of the nested distance with the Sinkhorn divergence.
Theorem 4.1 relaxation of the nested distance) The recursive solution (4.1) ((4.2), resp.)coincides with the optimal transport plan given by Proof First define π := T t=1 π t , where π t is the conditional transition probability, i.e., the solution at stage t and the matrices are multiplied element-wise (the Hadamard product) as in equation (4.2) above.It follows that . Denote the r -distance of subtrees at stage t by de r t .By linearity of the conditional expectation we have with (4.4) at the last stage and from calculation in backward recursive way where we have used tower property of the conditional expectation in (4.5).By induction and the definition of de r t at stage t, it follows finally that where we have used the tower property of the conditional expectation again in (4.5).
The assertion (4.3) of the theorem thus follows.

Remark 4.2
The optimization problem in Theorem 4.1 considers all constraints as the full nested problem (2.7),only the objective differs.For this reason the optimal solution of (4.3) is feasible for the problem (2.7) and vice versa.Notice as well that the tower property can be used in a forward calculation.
Similarly to Proposition 3.5 we have the following extension to the nested Sinkhorn divergence.
Corollary 4.3 For the nested distance and the nested Sinkhorn divergence, the same inequalities as in Proposition 3.5 apply, i.e., where π S (π W , resp.) is the optimal transport plan from (4.3) ((2.7), resp.) with discrete, unconditional probabilities p and p at the final stage T .

Proof
The proof follows the lines of the proof of the Propositions 3.4 and 3.5.
Moreover, we have the following general inequality that allows an error bound depending on the total T of stages.
Corollary 4.4 Let m ( m, resp.)be the maximum number of immediate in the process P ( P, resp.), i.e., m = max {|i + |: i ∈ N t , t = 1, . . ., T − 1}.It holds that where T is the total number of stages.
Proof Recall from Remark 3.6 that H (π S ) ≤ log(n ñ) = log n + log ñ for every conditional probability measures, where n and ñ are the number of immediate successors in both trees.The result follows with n ≤ m T ( ñ ≤ mT , resp.) and log n ≤ T log m and the nested program (4.1).

Nested Sinkhorn duality
The nested distance is of importance in stochastic optimization because of its dual, which is characterized by the Kantorovich-Rubinstein theorem, cf.(2.3a)-(2.3b)above.The nested distance allows for a characterization by duality as well.Here we develop the duality for the nested Sinkhorn divergence.In line with Theorem 4.1 we need to consider the problem To establish the dual representation of the nested distance we introduce the projections where β ⊗ γ is the function defined by β ⊗ γ (ξ, η):=β(ξ ) • γ (η) and where we note that the conditional expectation is a random variable itself.We recall the following characterization of the measurability constraints (4.8a)-(4.8b)and refer to Pflug and Pichler (2014, Proposition 2.48) for its proof.
Proof With Proposition 4.5 rewrite the dual problem as 123 where the second line encodes the measurability constraints.By the minmax theorem (cf. 1958) this is equivalent to sup The integral exists and the minimum is obtained by a probability measure Set βs := f s − proj s−1 ( f s ) and γs :=g s − proj s−1 (g s ).Consequently, the problem reads and thus the assertion.
The following corollary links the optimal probability measure and the stochastic process (4.9) for the optimal components β and γ .
Corollary 4.7 The process M t in (4.9), for which the supremum is attained, is a martingale with respect to the optimal measure π .
Proof The proof of Pflug and Pichler (2014, Theorem 2.49) applies with minor adaptions only.

Numerical results
The nested Sinkhorn divergence d r S as well as de r S depend on the regularization parameter λ.We discuss this dependency, the error, speed of convergence and numerical issues in comparison to the non-regularized nested distance d r .
We compare Algorithms 1 and 2 with respect to the nested distance d r and the nested Sinkhorn divergence with and without the entropy 1 λ H (π S ) as well as the required computational time for two finite valued stochastic scenario processes visualized in Fig. 3.   in Algorithm 2 can grow extraordinary high.For this reason, rescaling the vectors is necessary.Further, an adequate balance between λ, modelling the approximation quality, and acceleration desired is crucial in real applications.See also Remark 3.11 for the same issue.
Example 5.4 Table 3 investigates the approximation quality for varying orders r .The trees compared have 3 stages and each node branches into two directions.For large r 1 it is important to recall Remark 5.3 here, but on the other side the approximation quality improves for increasing order r .

Summary
Stochastic processes with information evolving in finitely many stages and finitely many states are encoded in stochastic trees.The nested distance, which builds on the Wasserstein distance, allows distinguishing stochastic processes and stochastic trees.
In this paper we regularize the Wasserstein distance by employing the Sinkhorn divergence.This approach extends to multiple stages and allows introducing a nested Sinkhorn divergence for stochastic processes.We elaborate its properties and describe the accelerations, which can be achieved in this way.
In conclusion, we can summarize that the Sinkhorn divergence offers a good tradeoff between the regularization error and the speed advantage.Further work should focus on defining a (nested) distance for neuronal networks and extending the imple-123 mentation of Sinkhorn divergence in the Julia package for faster tree generation and computation.

Fig. 3
Fig. 3 Two arbitrary chosen processes with height T = 3 There are many metrics d such that ( , d) is a metric space.Without loss of generality we may set t = R for all t ∈ {0, 1, . . ., T } and employ the 1 -distance, i.e., d(ξ, ξ) =