1 Introduction

1.1 Outline

If some type of natural phenomenon is modelled through a stochastic process, one might expect that the model does not describe reality in an entirely accurate way. To be able to study the impact of such inaccuracies on the problems one is trying to solve, it makes sense to equip the set of laws of stochastic processes with a suitable notion of distance or topology.

Denoting by \(\Omega := \mathcal {X}^N\) the path space (where X is some Polish space and \(N \in \mathbb {N}\)), the set of laws of stochastic processes is , i.e. the set of probability measures on \(\Omega \).

Clearly, carries the usual weak topology. However, this topology does not respect the time evolution of stochastic processes which has a number of potentially inconvenient consequences: e.g., problems of optimal stopping/utility maximization/stochastic programming are not continuous, arbitrary processes can be approximated by processes which are deterministic after the first period, etc. In the following we describe a number of approaches which have been developed by different authors to deal with these (and related) problems. Our main result (Theorem 1.2) is that all of these approaches actually define the same topology in the present discrete time setup. Moreover, this topology is the weakest topology which allows for continuity of optimal stopping problems.

1.2 Adapted Wasserstein distances, nested distance

A number of authors have independently introduced variants of the Wasserstein distance which take the temporal structure of processes into account: the definition of ‘iterated Kantorovich distance’ by Vershik [60, 61] might be seen as a first construction in this direction. The topic is also considered by Rüschendorf [58]. Independently, Pflug and Pflug–Pichler [30, 52,53,54,55,56] introduce the nested distance and describe the concept’s rich potential for the approximation of stochastic multi-period optimization problems. Lassalle [46] considers the ‘causal transport problem’ that leads to a corresponding notion of distance. Once again independently of these developments, Bion–Nadal and Talay [16] define an adapted version of the Wasserstein distance between laws of solutions to SDEs. Gigli [28, Chapter 4] introduces a similar distance for measures whose first marginal agrees, see also [4, Section 12.4].

To set the stage for describing these ‘adapted’ variants let us fix \(p \ge 1\) and recall the definition of the usual p-Wassterstein distance.

\((\mathcal {X},\rho _\mathcal {X})\) is now a Polish metric space. On \(\Omega = \mathcal {X}^N\) we use the Polish metric \(\rho _\Omega ((x_t)_t, (y_t)_t) := (\sum _t \rho _\mathcal {X}(x_t,y_t)^p)^{1/p}\). Typically, when clear from the context we will omit the subscript for the metric. We use \((X_t)_t\) to denote the canonical process on \(\Omega \), i.e. \(X_t\) is the projection onto the tth factor of \(\Omega = \mathcal {X}^N\). On \(\Omega \times \Omega \) call \(X = (X_t)_t\) the projection on the first factor and call \(Y = (Y_t)_t\) the projection on the second factor. For we denote by the set of probability measures \(\pi \) on \(\Omega \times \Omega \) for which \(X \sim \mu \) and \(Y \sim \nu \) under \(\pi \), i.e. for which the distribution of X under \(\pi \) is \(\mu \) and that of Y under \(\pi \) is \(\nu \). In applications, a particular role is played by Monge couplings. A Monge coupling from \(\mu \) to \(\nu \) is a coupling \(\pi \) for which \(Y = T(X)\) \(\pi \)-a.s. for some Borel mapping \(T:\Omega \rightarrow \Omega \) that transports \(\mu \) to \(\nu \), i.e. satisfies \(T_{\#} (\mu )=\nu \).

For , i.e. for probability measures on \(\Omega \) with finite pth moment their p-Wasserstein distance is

(1)

Following, [57] the infimum in (1) remains unchanged if one minimizes only over Monge couplings in many situations.

To motivate the formal definition of the adapted cousins in (5) and (6) below, we start with an informal discussion in terms of Monge mappings: In probabilistic terms, the preservation of mass assumption \(T_{\#} (\mu )= \nu \) asserts

$$\begin{aligned} \big (T_1(X_1, \ldots , X_N), \ldots , T_N(X_1, \ldots , X_N)\big ) \sim \nu , \end{aligned}$$
(2)

which ignores the evolution of \(\mu \) and \(\nu \) (resp.) in time. Rather it would appear more natural to restrict to mappings \((T_k)_{k=1}^N\) which are adapted in the sense that \(T_k\) depends only on \(X_1, \ldots , X_k\). Adapted Wasserstein distances can be defined following precisely this intuition, relying on a suitable version of adaptedness on the level of couplings:

The set \({{\,\mathrm{Cpl}\,}}_c(\mu , \nu )\) of causal couplingsFootnote 1 consists of all \(\pi \in {{\,\mathrm{Cpl}\,}}(\mu , \nu )\) such that

$$\begin{aligned} \pi \big ( (Y_1,\dots ,Y_t)\in A | X\big )&= \pi \big ( (Y_1,\dots ,Y_t)\in A | X_1,\dots X_t \big ). \end{aligned}$$
(3)

for all \(t\le N\) and \(A \subseteq \mathcal {X}^t\) measurable, cf. [46]. The set of all bi-causal couplings \({{\,\mathrm{Cpl}\,}}_{bc}(\mu , \nu )\) consists of all \(\pi \in {{\,\mathrm{Cpl}\,}}_c(\mu , \nu )\) such that the distribution of (YX) under \(\pi \) is also in \({{\,\mathrm{Cpl}\,}}_c(\nu ,\mu )\), i.e. that (3) also holds with the roles of X and Y reversed.

The term causal was introduced by Lassalle [46], who considers a causal transport problem in which the usual set of couplings is replaced by the set of causal couplings. The resulting concept is not actually a metric as it lacks symmetry, but as suggested by Soumik Pal, this is easily mended and we formally define the causal - and symmetrized-causal p-Wasserstein distance, resp. as follows:

For set

(4)
(5)

We use the term adapted Wasserstein distance for

(6)

Rüschendorf [58] refers to \({{\mathcal {A}}}{{\mathcal {W}}}_p\) as ‘modified Wasserstein distance’. Pflug–Pichler [52, Definition 1] use the names multi-stage distance of order p and nested distance. It can also be considered as a discrete time version of the ‘Wasserstein-type distance’ of Bion–Nadal and Talay [16]. In [5] we use a slightly modified definition of \({{\mathcal {A}}}{{\mathcal {W}}}_p\) which scales better with the number of time-periods N but leads to an equivalent metric (for fixed p and N). We shall discuss further properties of \({{\mathcal {A}}}{{\mathcal {W}}}_p\) (and in particular the connection with Vershik’s iterated Kantorovich distance) in Sect. 1.8 below.

1.3 Hellwig’s information topology

The information topology introduced by Hellwig in [31] (as well as Aldous’ extended weak topology which we discuss next) is based on the idea that an essential part of the structure of a process is the information that we may deduce about the future behaviour of the process given its behaviour up to current time t. For a process whose law is \(\mu \), this information is captured by the conditional law given \(X_1=x_1, \dots , X_t=x_t\) under \(\mu \).

is also the disintegration \(\mu _{x_1,\dots ,x_t}\) of w.r.t. the first t coordinates.

Hellwig’s information topology is the initial topology w.r.t. a family of maps \((\mathcal {I}_t)_{t=1}^{N-1}\) which are defined based on these disintegrations:

Equivalently, \(\mathcal {I}_t(\mu )\) is the joint law of

under \(\mu \), and Hellwig’s information topology is therefore the coarsest topology which makes continuous for all t the maps which send a probability \(\mu \) to the joint law describing the evolution of the coordinate process up to time t and the prediction about the future behaviour of the coordinate process after t.

Remark 1.1

All the topologies we consider in this paper are second countable. As such they can be characterized by saying which sequences converge. Restated in the language of sequences, the above definition says that a sequence \((\mu _n)_n\) in \({\mathcal {P}}(\Omega )\) converges in Hellwig’s information topology to \(\mu \in {\mathcal {P}}(\Omega )\) if and only if, for every t, the sequence \((\mathcal {I}_t(\mu _n))_n\) converges to \(\mathcal {I}_t(\mu )\) in the usual weak topology on .

The work of Hellwig [31] was motivated by questions of stability in dynamic economic models/games; see the related articles [11, 32, 40, 59].

1.4 Aldous’ extended weak topology

Aldous [3] introduces a type of convergence for pairs of filtrations and continuous time stochastic processes on them that he calls extended weak convergence [3, Definition 15.2]. Restricted to our current setting, his definition can be paraphrased in a similar manner as that of the information topology. Aldous’ idea is to represent a stochastic process with law \(\mu \) through the associated prediction process,Footnote 2 that is, the process given by

That is, \((Z^\mu _t)_{t=0}^N\) is a measure-valued martingale that makes increasingly accurate predictions about the full trajectory of the process X.

Rather then comparing the laws of processes directly, the extended weak topology is derived from the weak topology on the corresponding prediction processes (plus the original processes). I.e. formally, the extended weak topology on is the initial topology w.r.t. the map

which sends \(\mu \) to the joint distribution of

under \(\mu \).

Note that, to stay faithful to Aldous’ original definition, we defined \({\mathcal {E}}\) to map \(\mu \) not just to the law of the prediction process but to the joint law of the original process and its prediction process. One easily checks that the original process may be omitted in our setting without changing the resulting topology.

1.5 The optimal stopping topology

The usual weak topology on is the coarsest topology which makes continuous all the functions

$$\begin{aligned} \mu \mapsto {\textstyle \int }f \,\mathrm d\mu \end{aligned}$$

for \(f: \Omega \rightarrow {{\mathbb {R}}}\) continuous and bounded.

One may follow a similar pattern and look at the coarsest topology which makes continuous the outcomes of all sequential decision procedures. Perhaps the easiest way to formalize this is to look at optimal stopping problems. In detail, write \(AC(\Omega )\) for the set of all processes \((L_t)_{t=1}^N\) which are adapted, bounded and satisfy that \(x\mapsto L_t(x) \) is continuous for each \(t\le N\). Write \(v^L(\mu )\) for the corresponding value function, given that the process X follows the law \(\mu \), i.e.

The optimal stopping topology on is the coarsest topology which makes the functions

$$\begin{aligned} \mu \mapsto v^L(\mu ) \end{aligned}$$

continuous for all \((L_t)_{t=1}^N \in AC(\Omega )\).

1.6 Main result

We can now state our main result:

Theorem 1.2

Let \((\mathcal {X},\rho _\mathcal {X})\) be a Polish metric space, where \(\rho _\mathcal {X}\) is a bounded metric and set \(\Omega := \mathcal {X}^N\). Then the following topologies on are equal

  1. (1)

    the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\)

  2. (2)

    the topology induced by \({\mathcal {S}}{\mathcal {C}}{\mathcal {W}}_p\)

  3. (3)

    Hellwig’s information topology

  4. (4)

    Aldous’ extended weak topology

  5. (5)

    the optimal stopping topology.

The assumption that \(\rho _\mathcal {X}\) is bounded serves only to simplify the statement of the theorem, because in this case the topology induced by \({\mathcal {W}}_p\) coincides with the weak topology. For every Polish space there is a bounded complete metric which induces the topology (given any complete metric \(\rho _\mathcal {X}\), replace it by e.g. \(\min (1,\rho _\mathcal {X})\)).

1.6.1 p-Wasserstein and unbounded metrics

There is an analogous statement, Theorem 1.3 below, which drops the assumption that \(\rho _\mathcal {X}\) is bounded. To be able to state it, we introduce slight variations of Hellwig’s information topology, of Aldous’ extended weak topology and of the optimal stopping topology:

Hellwig [31] equips the target spaces of \(\mathcal {I}_t\) with the weak topology – or more precisely he equips with the weak topology, with the product topology and finally with the weak topology based on this topology. One may easily define a p-Wasserstein version of Hellwigs information topology by using the recipe ‘replace the weak topology by the p-Wasserstein metric everywhere’. Concretely, if we restrict \(\mathcal {I}_t\) to , we may view it as a map into , where the last space carries the metric

We will call the resulting variant of Hellwigs information topology on the \({\mathcal {W}}_p\)-information topology.

Similarly, one may systematically replace every occurrence of the weak topology in the definition of the extended weak topology by the p-Wasserstein metric. We call the resulting topology on the extended \({\mathcal {W}}_p\)-topology.

Just like the weak topology is the coarsest topology which makes integration of continuous bounded functions continuous, the p-Wasserstein topology is the coarsest topology which makes integration of continuous functions bounded by \(c \cdot (1 + \rho (x_0,x)^p)\) continuous. Following this analogy, we define \(AC_p(\Omega )\) as the set of all processes \((L_t)_{t=1}^N\) which are adapted, bounded by \(x \mapsto c \cdot (1 + \rho (x_0,x)^p)\) for some \(c \in {{\mathbb {R}}}_+\) and satisfy that \(x\mapsto L_t(x) \) is continuous for each \(t\le N\).

The \({\mathcal {W}}_p\)-optimal stopping topology on is the coarsest topology which makes the functions

$$\begin{aligned} \mu \mapsto v^L(\mu ) \end{aligned}$$

continuous for all \((L_t)_{t=1}^N \in AC_p(\Omega )\).

With these we may state the following generalization of Theorem 1.2:

Theorem 1.3

Let \((\mathcal {X},\rho _\mathcal {X})\) be a Polish metric space and set \(\Omega := \mathcal {X}^N\). Then the following topologies on are equal

  1. (1)

    the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\)

  2. (2)

    the topology induced by \({\mathcal {S}}{\mathcal {C}}{\mathcal {W}}_p\)

  3. (3)

    the \({\mathcal {W}}_p\)-information topology

  4. (4)

    the extended \({\mathcal {W}}_p\)-topology

  5. (5)

    the \({\mathcal {W}}_p\)-optimal stopping topology.

Clearly, one recovers Theorem 1.2 from Theorem 1.3 by choosing a bounded metric on \(\mathcal {X}\), because the \({\mathcal {W}}_p\)-information topology for bounded \(\rho _\mathcal {X}\) is just the information topology, the extended \({\mathcal {W}}_p\)-topology for bounded \(\rho _\mathcal {X}\) is just the extended weak topology and the \({\mathcal {W}}_p\)-optimal stopping topology for bounded \(\rho _\mathcal {X}\) is just the optimal stopping topology.

The relationship between the topologies listed in Theorem 1.2 and those listed in Theorem 1.3 is similar to the non-adapted case where we know that usual p-Wasserstein convergence is equivalent to usual weak convergence plus convergence of the pth moments.

Lemma 1.4

Convergence in any of the topologies of Theorem 1.3 is equivalent to convergence in any of the topologies of Theorem 1.2 (where for building \({\mathcal {S}}{\mathcal {C}}{\mathcal {W}}_p\) and \({\mathcal {A}} {\mathcal {W}}_p\), \(\rho _\mathcal {X}\) is replaced by a bounded compatible complete metric e.g. \(\min (1,\rho _\mathcal {X})\)) plus convergence of pth moments on \(\Omega \) w.r.t. (the original) \(\rho _\Omega \).

We prove Lemma 1.4 in Sect. 6, making use of (parts of) Theorems 1.2 and 1.3.

1.7 Further remarks on related work

1.7.1 Some further articles of successors of Aldous

One of the original applications of Aldous’ weak extended topology concerned the stability of optimal stopping [3]. This corresponds to one half of \((4)=(5)\) in Theorem 1.2, but in a much more general setting. This line of work has been continued by Lamberton and Pagès [45], Coquet and Toldo [20], among others.

Aldous’ extended weak topology was also inspiring and instrumental for the development of the theory of convergence of filtrations, and the associated questions of stability of the martingale representation property and Doob–Meyer decompositions. In this regard, see the works by Hoover et al. [35, 37] and by Mémin et al. [19, 48]. The related question of stability of stochastic differential equations (as well as their backwards version) with respect to the driving noise has particularly seen a burst of activity in the last two decades. For brevity’s sake we only refer to the recent article by Papapantoleon, Posamaï, and Saplaouras [50] for an overview of the many available works in this direction.

1.7.2 Previous applications of adapted Wasserstein distances

Pflug, Pichler and co-authors [30, 52,53,54,55,56] have extensively developed and applied the notion of nested distaces for the purpose of scenario generation, stability, sensitivity bounds, and distributionally robust stochastic optimization, in the context of operations research.

Acciaio, Zalashko, and one of the present authors consider in [2] the adapted Wasserstein distance in continuous time in connection with utility maximization, enlargement of filtrations and optimal stopping.

Causal couplings have appeared in the work by Yamada and Watanabe [62], Jacod and Mémin [38] as well as Kurtz [42, 43], concerning weak solutions of stochastic differential equations, and by Rüschendof [58] concerning approximation theorems in probability theory. The term ‘causal’ is first used by Lassalle [46], who uses it in an additional constraint for the transport problem and gives an alternative derivation of the Talagrand inequality for the Wiener measure. Causal couplings are also present in the numerical scheme suggested in [1] for (extended mean-field) stochastic control.

The article [7] connects adapted Wasserstein distance (in continuous time) to martingale optimal transport (cf. [12,13,14, 17, 18, 23, 27, 33, 34] among many others). Several familiar objects appear as solutions to variational problems in this context. E.g. geometric Brownian motion is the martingale which is closest in \({{\mathcal {A}}}{{\mathcal {W}}}_2\) to usual Brownian motion subject having a log normal distribution at the terminal time-point, the local vol model is closest to Brownian motion subject to matching 1-d marginals.

Bion–Nadal and Talay [16] introduce an adapted Wasserstein-type distance on the set of diffusion SDEs and show that this distance corresponds to the computation of a tractable stochastic control problem. They also apply their results to the problem of fitting diffusion models to given marginals.

In [5] the present authors consider adapted Wasserstein distances in relation to stability in finance: Lipschitz continuity of utility maximization/hedging are established w.r.t. to the underlying models in discrete and continuous time.

1.8 Another formulation of the adapted Wasserstein distance and of Hellwigs information topology

Here we give an alternative formulation of the adapted Wasserstein distance/nested distance due to Pflug and Pichler.

Again, \(\mathcal {X}\) is a Polish space and \(\rho = \rho _\mathcal {X}\) is a compatible metric on \(\mathcal {X}\). Starting with \(V_N^p:=0\) we define

$$\begin{aligned}&V_t^p(x_1,\dots ,x_{t},y_1,\dots ,y_{t}) \\&\quad :=\inf _{\gamma ^{t+1} \in {{\,\mathrm{Cpl}\,}}(\mu _{x_1,\dots ,x_{t}}, \nu _{y_1,\dots ,y_{t}} )} \iint \left( \begin{array}{cc} V^p_{t+1}(x_1,\dots ,x_{t+1},y_1,\dots ,y_{t+1})\\ + \rho (x_{t+1},y_{t+1})^p\end{array} \right) \,\mathrm d\gamma ^{t+1}(x_{t+1}, y_{t+1}).\nonumber \end{aligned}$$
(7)

The nested distance is finally obtained in a backwards recursive way by

$$\begin{aligned} {{\mathcal {N}}}{{\mathcal {D}}}_p(\mu ,\nu )^p =\inf _{ \gamma ^1 \in {{\,\mathrm{Cpl}\,}}({{{\,\mathrm{proj}\,}}_1}_{\#} (\mu ),{{{\,\mathrm{proj}\,}}_1}_{\#} (\nu )) } \iint \left( V^p_1(x_1,y_1)+\ \rho (x_1,y_1)^p \right) \,\mathrm d\gamma ^1(x_1,y_1).\nonumber \\ \end{aligned}$$
(8)

Then \({\mathcal {A}} {\mathcal {W}}_p= {{\mathcal {N}}}{{\mathcal {D}}}_p\). We refer to [8] for the (straightforward) justification.

For \(N>1\) the adapted Wasserstein distance is not complete. As was established in [6], a natural complete space into which embeds is given by the space of nested distributions:

Consider the sequence of metric spaces

where at each stage t, the space is endowed with the p-Wasserstein distance with respect to the metric \(\rho _{t:N}\) on \(\mathcal {X}_{t:N}\), which we denote by \({\mathcal {W}}_{ \rho _{t:N},p}\). The space of nested distributions (of depth N) is defined as . We endow with the complete metric \({\mathcal {W}}_{\rho _{1:N},p}\).

The space of nested distributions was defined by Pflug [51]. Notably the idea to iterate the formation of Wasserstein spaces and metrics goes back to Vershik [60, 61] who uses the name ‘iterated Kantorovich distance’. The main interest of Vershik (and his successors) lies in the classification of filtrations (in the language of ergodic theory). We refer to the work of Emery and Schachermayer [25] for a survey from a probabilistic perspective and to Janvresse, Laurent and de la Rue [39] for a contemporary article (again from a probabilistic viewpoint).

is naturally embedded in the set of nested distributions of depth N through the map \(\mathcal {N}\) given by

(9)

where \((X_1,\dots ,X_N)\) is a vector with law \(\mu \), again denotes (conditional) law and we use \({{\bar{X}}}_{1}^{t}\) as a shorthand for the vector \(X_1, \dots , X_t\).

Following [6], we have:

Theorem 1.5

The map \(\mathcal {N}\) defined in (9) embeds the metric space isometrically into the complete separable metric space .

Remark 1.6

When \(\mathcal {X}\) has no isolated points, is actually the completion of , i.e. considered as a subset of is dense.

1.8.1 Hellwig’s information topology in terms of adapted Wasserstein distances

We note that Hellwig’s definition of the information topology can also be rephrased using the concept of adapted Wasserstein distance: Assume that \(\rho _\mathcal {X}\) is a bounded metric and for \(t \le N\), set

$$\begin{aligned} \Omega = {{\mathcal {X}}}^N=\underbrace{{{\mathcal {X}}}^t}_{=:X_1^{(t)}}\times \underbrace{{{\mathcal {X}}}^{N-t}}_{=:X_2^{(t)}}=X_1^{(t)} \times X_2^{(t)}. \end{aligned}$$

I.e. for each t, we consider \(\Omega \) as the product of two Polish spaces (which one might consider as ‘history’ and ‘future’). Extending the defintion of \({\mathcal {A}} {\mathcal {W}}_p\) in the obvious way to products of not necessarily equal Polish spaces, we can then equip with a one period adapted Wasserstein distance \({{\mathcal {A}}}{{\mathcal {W}}}_{p}^{(t)}, p\ge 1\). Setting for \(\mu , \nu \in {\mathcal {P}}(\Omega )\)

$$\begin{aligned} {{\mathcal {I}}}{{\mathcal {W}}}_p(\mu , \nu ):= \sum _{t=1}^N {{\mathcal {A}}}{{\mathcal {W}}}_{p}^{(t)}(\mu , \nu ),\quad p\ge 1, \end{aligned}$$
(10)

we obtain a compatible metric for the information topology. This is relatively straightforward (whereas the full version of Theorem 1.2 is not straightforward as far as we are concerned).

1.9 Preservation of compactness

We close this section with a result about the preservation of relative compactness which we shall use in Sects. 4 and 6, but which also might be of independent interest. Specifically, in [9, 10] the two-step version of Lemma 1.7 is used as a crucial tool in the investigation of the weak transport problem.

A more detailed investigation of compactness in with the weak adapted topology is the topic of the companion paper to this one, [24].

Assume for simplicity that \(\rho _\mathcal {X}\) is a bounded metric. Then we have

Lemma 1.7

(Compactness lemma) is relatively compact w.r.t. the usual weak topology iff is relatively compact.

We note that Lemma 1.7 is essentially a consequence of the characterization of compact subsets in ; in a somewhat different framework it was first proved in [36]. The version stated here follows by repeated application of [24, Lemma 3.3]/[9, Lemma 2.6].

The implication that \(\mathcal {N}[A]\) relatively compact implies A relatively compact is rather easy to see, but the other direction that A relatively compact implies \(\mathcal {N}[A]\) relatively compact is nontrivial since the mapping is not continuous when is endowed with the usual weak topology (except for trivial cases). Lemma 1.7 would not be true if we were to replace relative compactness by compactness.

The assumption that \(\rho _\mathcal {X}\) is bounded is inessential. A version of Lemma 1.7 holds if we replace by and the weak topology by the one induced by the p-Wasserstein metric.

A similar result based on Hellwig’s information toplogy, relating relative compactness in to relative compactness in , is also true.

2 Preparations

The rest of the paper will essentially be devoted to proving Theorem 1.2, or really its generalization Theorem 1.3.

In Sect. 3 we prove that Hellwig’s information topology equals the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\), i.e. \((3) =(1)\) in Theorem 1.3. In a sense, of all the topologies listed in Theorem 1.3, Hellwig’s information toplogy ‘looks’ the coarsest – or at least like one of the coarser ones, while the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\) ‘looks’ the finest.

In Sect. 4 we sandwich the topology induced by \({\mathcal {S}}{\mathcal {C}}{\mathcal {W}}_p\) between Hellwig’s information topology and the toplogy induced by \({\mathcal {A}} {\mathcal {W}}_p\), i.e. we show \((3) \le ~(2) \le ~(1)\) in Theorem 1.3.

In Sect. 5 we show that Aldous’ extended weak topology is equal to Hellwig’s information topology, i.e. \((4) =(3)\) in Theorem 1.3.

In Sect. 6 we prove Lemma 1.4.

In Sect. 7 we prove that the optimal stopping topology is coarser than the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\) and finer than Hellwig’s (\({\mathcal {W}}_p\)-)information topology, i.e. \((3) \le (5) \le (1)\) in Theorem 1.3.

2.1 Notation

The nested structure of spaces like for example introduced in Sect. 1.8 is (at least for the authors) not so easy to gain an intuition for. It seems rather challenging to picture probability measures on probability measures on probability measures... etc.

Therefore, much of the proofs in the following two sections will be about bookkeeping and not getting lost in these nested structures. In most other contexts we would regard such bookkeeping as abstract nonsense better swept under the rug, but in the context of the present paper we believe that it really constitutes an important and nontrivial ingredient in successfully carrying out the proofs.

To aid in this endeavour we make some notational preparations and introduce a few conventions.

2.1.1 Operations on spaces

In the introduction we described the topologies listed in Theorems 1.2 and 1.3 as initial topologies w.r.t. maps into more complex spaces. These spaces are built up from just a few basic operations, and in most cases the maps can also be constructed using a few relatively simple ingredients.

For spaces, the operations in question are

  • product formation, i.e. for spaces \(\mathcal {X}\) and \(\mathcal {Y}\) we may form their product space \(\mathcal {X}\times \mathcal {Y}\),

  • and passing from a space \(\mathcal {X}\) to the space of probability measures on \(\mathcal {X}\).

Here we run into some tension between the various existing definitions in the literature. While Hellwig and Aldous originally defined their topologies based on equipping the space of probability measures on some space \(\mathcal {X}\) with the weak topology, without any mention of metrics, \({\mathcal {A}} {\mathcal {W}}_p\) is a metric built on the p-Wasserstein metric, and Theorem 1.5 exhibits this metric as the ‘initial metric’ w.r.t. an embedding of (not ) into .

Luckily, when the base metric \(\rho _\mathcal {X}\) on \(\mathcal {X}\) is bounded and we decide that we only care about topologies and not the metrics that induce them, all of these distinctions vanish, and one may hope for these fine distinctions to not be so important in the end.

To give as uniform and as streamlined a treatment as possible of all the various ways in which these metric and topological spaces can be related to each other we employ the following strategy: A lot of our arguments are agnostic to the distinction between and , and to whether we are talking about metric or topological spaces etc. They only rely on properties of the operations of product formation and formation of spaces of probability measures and on properties of maps between various spaces built using these operations which hold in either case. For the rest of the paper we will therefore drop the p in and other explicit mentions of these distinctions. The reader may decide to read the paper using either of the following two sets of conventions, which are to be applied recursively:

Convention 1

(Weak topologies)

  • \(\mathcal {X}\), \(\mathcal {Y}\), \(\mathcal {Z}\), \(\mathcal {A}\), \(\mathcal {B}\), \(\mathcal {C}\), etc. are Polish spaces.

  • \(\mathcal {X}\times \mathcal {Y}\) is a topological space with the product topology (again Polish).

  • is a topological space with the weak topology (also Polish).

  • ‘space’ will mean Polish space.

Convention 2

(\({\mathcal {W}}_p\))

  • \(p \ge 1\) is fixed throughout the paper

  • \(\mathcal {X}\), \(\mathcal {Y}\), \(\mathcal {Z}\), \(\mathcal {A}\), \(\mathcal {B}\), \(\mathcal {C}\), etc. are Polish (i.e. complete separable) metric spaces with metrics \(\rho _\mathcal {X}\), \(\rho _\mathcal {Y}\), \(\rho _\mathcal {Z}\), \(\rho _\mathcal {A}\), \(\rho _\mathcal {B}\), \(\rho _\mathcal {C}\), etc. respectively.

  • \(\mathcal {X}\times \mathcal {Y}\) is a Polish metric space with the metric

  • is a Polish metric space with the p-Wasserstein metric

  • The subscript on the metric \(\rho \) may be dropped when clear from the context.

  • ‘space’ will mean Polish metric space.

Unless specified otherwise everything said from here on will be true for either way of reading. Convention 1 will lead to a direct proof of Theorem 1.2, while Convention 2 will give a proof of the more general version, Theorem 1.3. Occasionally an argument will require us to talk directly about metrics to establish continuity of some map. When one only cares about Theorem 1.2 and not Theorem 1.3 these sections can be read while assuming that \(p=1\) and that all metrics mentioned are bounded.

Another space we will need is

Definition 2.1

is the space of probability measures on \(\mathcal {A}\times \mathcal {B}\) which are concentrated on the graph of a measuruable function, i.e.:

The space carries the subspace topology/the restriction of the metric on .

2.1.2 Maps between spaces

Assuming Convention 1, when \(f: \mathcal {X}\rightarrow \mathcal {Y}\) is a continuous map, the pushforward under f, i.e. the map which sends to the measure with \(\nu (A) = \mu (f^{-1}[A])\) is also continuous.

Similarly, assuming Convention 2, when \(f: \mathcal {X}\rightarrow \mathcal {Y}\) is a Lipschitz-continuous map between metric spaces the pushforward under f is also Lipschitz-continous from to .

We will use to denote the pushforward under f, to emphasize the fact that is a functor, i.e. that it sends a diagram with a ‘nice’ (read continuous/Lipschitz) map

$$\begin{aligned} \mathcal {X}\overset{f}{\longrightarrow }\mathcal {Y}\end{aligned}$$

to a similar diagram

where the map is also ‘nice’, and that and (where \(1_{\mathcal {X}}\) is the identity function on \(\mathcal {X}\)).

For a product of spaces \(\mathcal {X}\times \mathcal {Y}\), the projection onto \(\mathcal {X}\) will alternatively be denoted by either \({{\,\mathrm{proj}\,}}_\mathcal {X}\) or by the same letter that is used for the space, but in a non-calligrapic font, i.e. \(X : \mathcal {X}\times \mathcal {Y}\rightarrow \mathcal {X}\).

If \(\mu \) is defined on some product \(\prod _i \mathcal {X}_i\) of spaces, we also introduce a shorthand notation for marginals of \(\mu \), i.e. for the pushforward of \(\mu \) under projection onto the product of some subset of the original factors:

If \(f: \mathcal {A}\rightarrow \mathcal {B}\) and \(g: \mathcal {A}\rightarrow \mathcal {C}\) are functions we write \((f {{\varvec{,}}}g)\) for the function

$$\begin{aligned} (f {{\varvec{,}}}g)&: \mathcal {A}\rightarrow \mathcal {B}\times \mathcal {C}\\ (f {{\varvec{,}}}g) (a)&:= (f(a),g(a)) . \end{aligned}$$

If we want to specify a map from, say \(\mathcal {A}\times \mathcal {B}\times \mathcal {C}\) to \(\mathcal {X}\) but we only really care about one of the variables we will use an underscore ‘\(\_\)’ instead of naming the unused variables, as in \((a,\_,\_) \mapsto f(a)\). Similarly, when integrating we may also use \(\_\) to denote unused variables, i.e. for we might write \({\textstyle \int }f(y) \,\mathrm d\mu (\_,y)\).

Two important maps will be the disintegration map \({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}\) and its left inverse \({{\,\mathrm{int}\,}}_{\mathcal {A}}^{\mathcal {B}}\).

The disintegration map

sends a probability \(\mu \) on \(\mathcal {A}\times \mathcal {B}\) to the measure

where \(a \mapsto \mu _a\) is a classical disintegration of \(\mu \), i.e. if \({{\bar{\mu }}} = {{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}(\mu )\) then

$$\begin{aligned} \int f(a,b) \,\mathrm d\nu (b) \,\mathrm d{{\bar{\mu }}}(a,\nu ) = \int f(a, b) \,\mathrm d\mu _a(b) \,\mathrm d\mu (a,\_) = \int f(a,b) \,\mathrm d\mu (a,b) . \end{aligned}$$

The disintegration map is measurable (see for example [15, Proposition 7.27]) and injective. It is not continuous w.r.t. the weak topologies or the Wasserstein metrics.

When writing \({{\,\mathrm{dis}\,}}_{\mathcal A}^{\mathcal B}\) we will not insist that \(\mathcal A\) has to be the first factor in the domain of \({{\,\mathrm{dis}\,}}_{\mathcal A}^{\mathcal B}\)\(\mathcal A\) and \(\mathcal B\) may even be products themselves, whose factors are intermingled in the product that makes up the domain of \({{\,\mathrm{dis}\,}}_{\mathcal A}^{\mathcal B}\). Also, we may sometimes omit \(\mathcal B\), only specifying the variable(s) w.r.t. which we are disintegrating, not the ones which are left over, as in \({{\,\mathrm{dis}\,}}_{\mathcal A}\).

The map

is (Lipschitz-)continuous.

The pair \({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}\), \({{\,\mathrm{int}\,}}_{\mathcal {A}}^{\mathcal {B}}\) enjoy the following properties:

  1. (1)

    \({{\,\mathrm{int}\,}}_{\mathcal {A}}^{\mathcal {B}}\) is the left inverse of the disintegration map, i.e.

    This is a direct consequence of the definition of the disintegration.

  2. (2)

    is injective. Therefore,

  3. (3)

    , i.e. \({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}\) and \({{\,\mathrm{int}\,}}_{\mathcal {A}}^{\mathcal {B}}\) are inverse bijections between and .

The last two properties are just a reformulation of the known fact that the disintegration of a measure is almost-surely uniquely defined.

2.1.3 Processes which take values in different spaces at different times

Already in the introduction, in Sect. 1.8.1, we found it convenient to extend the definition of \({\mathcal {A}} {\mathcal {W}}_p\) to products of not necessarily equal Polish spaces ‘in the obvious way’. To accommodate for reapplication of concepts in a similar style as seen there we make the minor generalization of letting all the processes we talk about take values in different spaces at different times—typically at time t they will take values in a space \(\mathcal {X}_t\).

Denote by and define , , .

3 Hellwig’s \({\mathcal {W}}_p\)-information topology is equal to the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\) the adapted Wasserstein distance

In this section we show \((3)=(1)\) in Theorem 1.3. We will do so by identifying both topologies as initial topologies w.r.t. a single map each, i.e. finding a space which is homeomorphic to with Hellwig’s (\({\mathcal {W}}_p\)-)information topology and one which is homeomorphic to with the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\) and then showing that these spaces are homeomorphic in the right way. As an auxilliary tool we will introduce another topology on which wasn’t mentioned in the introduction, but which is very similar to Hellwig’s. The proof strategy can be summarized by saying that we want to show that the following diagram is commutative.

(11)

Here \(\mathcal {N}\) is the map which induces the same topology as \({\mathcal {A}} {\mathcal {W}}_p\), \(\mathcal {I}\) induces Hellwig’s topology and \(\mathcal {I}'\) induces what we will call the reduced information topology. We shortly restate their definitions below.

Since these mappings are injective and by the definition of the initial topology all of these mappings are homeomorphisms. To be precise, \(\mathcal {N}\) is a homeomorphism from with the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\) onto (cf. Theorem 1.5), \(\mathcal {I}\) is a homeomorphism from with the information topology onto , and \(\mathcal {I}'\) is a homeomorphism from with the reduced information topology onto .

The maps \(\mathcal {K}\), \(\mathcal {M}\), \(\mathcal {H}\) are still to be found.

As introduced in Sect. 1.3 Hellwig’s (\({\mathcal {W}}_p\)-)information topology is induced by a family of maps \(\mathcal {I}_t\), given by:

Equivalently, the information topology is the initial topology w.r.t. the map

We saw in Sect. 1.8 that \({\mathcal {A}} {\mathcal {W}}_p\) is induced by an embedding . Rephrasing the definition there, \(\mathcal {N}\) is obtained by defining recursively from \(t=N-1\) to \(t=1\):

and setting

$$\begin{aligned} \mathcal {N}&:= \mathcal {N}^1 . \end{aligned}$$

In fact, because \({{\,\mathrm{dis}\,}}\) maps into the space of measures concentrated on the graph of a function, \(\mathcal {N}\) also maps into a smaller space, which we call \(\mathcal {F}_{1}\), and which is again defined by recursion down from \(N-1\) to 1:

I.e. \(\mathcal {F}_{1}\) is with all occurences of replaced by . Remember that we had

For convenience, let us also define

The fact that

and that therefore \(\mathcal {N}\) maps into \(\mathcal {F}_{1}\) is a consequence of Lemma 3.1 below.

Finally, \(\mathcal {I}'\) is defined as follows

I.e. the reduced information topology, like the information topology, makes continuous predictions about the behaviour of the process after time t given information about its behaviour up to time t, only now we are just predicting what the process will do in the next step, not for the rest of time.

\(\mathcal {I}\), \(\mathcal {I}'\) and \(\mathcal {N}\) are injective and therefore bijections onto their codomains. This means that the values of the maps \(\mathcal {K}\), \(\mathcal {M}\), \(\mathcal {H}\) in diagram (11) as functions between sets are really already prescribed. The task consists in finding a representation for them which makes it clear that they are continuous.

Lemma 3.1

\({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}\times \mathcal {Y}}\) restricted to maps onto .

Proof

We first show that it maps into . Let and let \(g : \mathcal {A}\times \mathcal {B}\rightarrow \mathcal {Y}\) be a function witnessing this fact, i.e. \(\nu (f) = \int f(a,b,g(a,b)) \,\mathrm d\nu (a,b,\_)\).

Let \(\alpha := {{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}\times \mathcal {Y}} (\nu )\). Then

$$\begin{aligned} \int \int 1_{g(a,b) \ne y} \,\mathrm d\beta (b, y) \,\mathrm d\alpha (a, \beta ) = \int 1_{g(a,b) \ne y} \,\mathrm d\nu (a,b,y) = 0 . \end{aligned}$$

This means that for \(\alpha \)-a.a. \((a,\beta )\) we have \( \int 1_{g(a,b) \ne y} \,\mathrm d\beta (b, y) = 0 \), i.e. \(\beta \) is concentrated on the graph of the function \(b \mapsto g(a,b)\).

To see that any can be obtained as the image of some under \({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}\times \mathcal {Y}}\), note that for such \(\alpha \), by the existence of measurably dependent (classical) disintegrations (see for example [15, Proposition 7.27]), , and \({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}\times \mathcal {Y}} (\nu ) = \alpha \). \(\square \)

3.1 Homeomorphisms

We give a plain language description of what follows in this section:

The continuity of \(\mathcal {M}\) will be quite trivial, because we are just discarding information.

The components of the map \(\mathcal {K}\) are obtained by ‘folding’ both the ‘head’ and the ‘tail’ of \(\mathcal {F}_{1}\) using iterated application of the map \({{\,\mathrm{int}\,}}\).

By continuity of \({{\,\mathrm{int}\,}}\), it’s easy to see that \(\mathcal {K}_k\) is continuous. To show that the map \(\mathcal {K}\) with the components \(\mathcal {K}_k\) is the map we are looking for, we basically show that

$$\begin{aligned} \mathcal {I}^{-1} \circ \mathcal {K}_k = \mathcal {N}^{-1} . \end{aligned}$$
(12)

\(\mathcal {N}^{-1}\) is again another way of ‘folding’ all of \(\mathcal {F}_{1}\) using \({{\,\mathrm{int}\,}}\) to arrive at . As \(\mathcal {I}^{-1}\) is also \({{\,\mathrm{int}\,}}\), showing (12) amounts to showing that these two different ways of ‘folding’ – first the head and tail and then in a last step the junction between k and \(k+1\) on the one hand, and from front to back on the other hand – do the same thing. This may be intuitively clear to the reader. The proof works by repeated application of Lemma 3.5, which represents one step of ‘folding order doesn’t matter’. Using Lemma 3.5 the proof is completely analogous to the proof that for an operation \(\star \) satisfying \( (a \star b) \star c = a \star ( b \star c )\), i.e. for an associative operation, one has

As we know, for such an operation any way of parenthesizing the multiplication of N elements gives the same result. An analogous statement holds for \({{\,\mathrm{int}\,}}\), though we do not formally state or prove this.

Finally, in Lemma 3.9, using Lemma 3.8 as the main ingredient we prove the ‘hard direction’, i.e. that \(\mathcal {H}\) is continuous. If the continuity of \(\mathcal {M}\) and \(\mathcal {K}\) as informally described here seem obvious to the reader they may wish to skip ahead to Lemmas 3.8 and 3.9.

Remark 3.2

The reader interested in working out the details and analogies between ‘folding’ using \({{\,\mathrm{int}\,}}\) and associative binary operations might be interested in reading about monads in the context of Category Theory first. (See for example Chapter VI in [47].) In fact, forms a monad, where

sends an element x of \(\mathcal {X}\) to the dirac measure at x and

This monad is studied in a little more detail in [29]. \({{\,\mathrm{int}\,}}\) can be obtained from \(\varvec{\mu }\) and a tensorial strength in the sense described for example in [49].

To show that \(\mathcal {M}\) is continuous we will need the following lemma.

Lemma 3.3

\({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}\) is natural in \(\mathcal {B}\), i.e. for \(f: \mathcal {B}\rightarrow \mathcal {B}'\) the following diagram commutes.

Proof

This is just straigtforward calculation using the definitions. \(\square \)

Applying Lemma 3.3 with , , \(\mathcal {B}' = \mathcal {X}_{k+1}\) and we get that

Setting we get \( \mathcal {I}'_k = \mathcal {M}_k \circ \mathcal {I}_k\) and then setting \(\mathcal {M}((\mu _k)_k) := (\mathcal {M}_k(\mu _k))_k\) gives \(\mathcal {I}' = \mathcal {M}\circ \mathcal {I}\).

There is an analogue of Lemma 3.3 which we list here for completeness.

Lemma 3.4

is natural in \(\mathcal {B}\), i.e. for \(f: \mathcal {B}\rightarrow \mathcal {B}'\) the following diagram commutes:

In particular, if \(\mathcal {B}\subseteq \mathcal {B}'\) then

if we regard as a subset of by recursively using the recipe: ‘if \(\mathcal {B}\) is a subset of \(\mathcal {B}'\), then we can view as the subset of those which are concentrated on \(\mathcal {B}\)’.

Proof

Again this is just calculation. \(\square \)

We already implicity used the ‘in particular’-part of Lemma 3.4 when we said that \(\mathcal {N}\) can be regarded both as a map into and into \(\mathcal {F}_{1}\) but the use there seemed too trivial to warrant much mention. There will be more such tacit uses.

Now we show that \(\mathcal {K}\) is continuous. We claim that it can be written as

where

or without the dots, letting denote concatenation of functions, e.g. :

To prove this we will repeatedly apply the following lemma.

Lemma 3.5

(\({{\,\mathrm{int}\,}}\) is ‘associative’) \({{\,\mathrm{int}\,}}\) satisfies the following relation:

These maps can be seen in the following commutative diagram.

Proof

This is just expanding the definition. Both maps send a measure to the measure \(\mu \) with

$$\begin{aligned} {\textstyle \int }f \,\mathrm d\mu = {\textstyle \int }f(a,b,c) \,\mathrm d\gamma (c) \,\mathrm d\beta (b, \gamma ) \,\mathrm d\alpha (a,\beta ) . \end{aligned}$$

\(\square \)

Lemma 3.6

The following relation holds.

(13)

Proof

Again, this is just repeated application of Lemma 3.5. Below we define \({\mathcal {T}}_l\) for \(N \ge l \ge k\) and show that

(14)

for all \(N \ge l \ge k\) by showing \({\mathcal {T}}_l = {\mathcal {T}}_{l-1}\) for all \(N \ge l > k\). The left hand side of (14) is the left hand side of (13) with the common tail of the left and right side in (13) dropped. \({\mathcal {T}}_k\) will be the right hand side of (13) with the common part dropped.

Here we regard with \(r < s\) (an empty product in our context) as the identity function. For \(l = N\) the first factor is an empty product and therefore clearly (14) is true for \(l = N\). To get from \({\mathcal {T}}_l\) to \({\mathcal {T}}_{l-1}\) we leave the first factor alone and apply Lemma 3.5 with , and \(\mathcal {C}= \mathcal {X}_{l:N}\). This transforms

into

and therefore \({\mathcal {T}}_l\) into \({\mathcal {T}}_{l-1}\). \(\square \)

Lemma 3.7

The right hand triangle in (11) commutes, i.e.

$$\begin{aligned} \mathcal {K}_k \circ \mathcal {N}= \mathcal {I}_k . \end{aligned}$$

Proof

Prepending \(\mathcal {N}\) to (13) gives

and appending \(\mathcal {I}_k\) gives

$$\begin{aligned} \mathcal {K}_k \circ \mathcal {N}= \mathcal {I}_k . \end{aligned}$$

\(\square \)

Now we will show that \(\mathcal {H}\) is continuous. We will postpone the proof of Lemma 3.8 below, which is the crucial non-bookkeeping ingredient in the proof of Lemma 3.9 below, until the end of this section. The methods used in the proof of Lemma 3.8 differ significantly from the rest in this section and make use of the concept of the modulus of continuity for measures, and results relating to it, introduced in the companion paper [24] to this one.

Lemma 3.8

Let

be the set of all \((\mu ', \mu )\) s.t.

$$\begin{aligned} {{\,\mathrm{int}\,}}_{\mathcal {A}}^{\mathcal {B}}(\mu ') = \mu _{\restriction \mathcal {A}\times \mathcal {B}} . \end{aligned}$$
(15)

The function

is continuous.

Clearly, as a function between sets, \(\mathcal {J}_{\mathcal {A},\mathcal {B}}^{\mathcal {Y}}(\mu ',\mu )\) only depends on \(\mu \). But, as we know, \({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}\times \mathcal {Y}}\) is not continuous. Only when we refine the topology on the source space, which we encode by regarding \(\mathcal {J}_{\mathcal {A},\mathcal {B}}^{\mathcal {Y}}\) as a map from the above subset of a product space, does it become continuous.

Lemma 3.9

\(\mathcal {H}\) is continuous.

Proof

We will inductively define

(again down from \(N-1\) to 1) so that they will be continuous by construction (and by virtue of Lemma 3.8). Also by construction, we will have \(\mathcal {H}^k \circ \mathcal {I}' = \mathcal {N}^k\). \(\mathcal {H}\) will be \(\mathcal {H}^1\) so that \(\mathcal {H}\circ \mathcal {I}' = \mathcal {N}\).

Set \(\mathcal {H}^{N-1} := {{\,\mathrm{proj}\,}}_{N-1}\), the projection from onto the last factor. by definition. Given \(\mathcal {H}^{k+1}\) define

where \({{\,\mathrm{proj}\,}}_k\) is the projection from onto the kth factor.

For this to be well-defined we need to check that for we have

I.e. for we want

The composite of the maps on the left-hand side is equal to

On the right-hand side we get by induction hypothesis

(16)

Using that we see for \(l \ge k+1\)

i.e. by induction (16) is also equal to .

As a composite of continuous maps \(\mathcal {H}^k\) is clearly continuous. (This is where we use Lemma 3.8.) As a map between sets \(\mathcal {H}^k\) is just

by induction hypothesis and definition of \(\mathcal {N}^k\). \(\square \)

3.2 Proof of Lemma 3.8

In this part we prove Lemma 3.8. Here we use several of the ideas developed in the companion paper [24]. In particular we will need [24, Lemma 4.2] which we reproduce below.

Lemma 3.10

[24, Lemma 4.2] Let . For any \(\varepsilon > 0\) there is a \(\delta > 0\) s.t. if

then

$$\begin{aligned} {\textstyle \int }\rho (y_1,y_2)^p \,\mathrm d\gamma (x_1,y_1,x_2,y_2) < \varepsilon ^p . \end{aligned}$$

For easy reference we also restate Lemma 3.8.

Lemma 3.8

Let

be the set of all \((\mu ', \mu )\) s.t.

$$\begin{aligned} {{\,\mathrm{int}\,}}_{\mathcal {A}}^{\mathcal {B}}(\mu ') = \mu _{\restriction \mathcal {A}\times \mathcal {B}} . \end{aligned}$$
(15)

The function

is continuous.

Proof of Lemma 3.8

Let . Let \(\varepsilon > 0\).

Choose \(\delta > 0\) according to Lemma 3.10 with \(\mathcal {X}= \mathcal {A}\times \mathcal {B}\), i.e. s.t. for any with and any with \({\textstyle \int }\rho (a_1,a_2)^p + \rho (b_1,b_2)^p \,\mathrm d\gamma (a_1,b_1,\_,a_2,b_2,\_) < \delta ^p\) we have \({\textstyle \int }\rho (y_1,y_2)^p \,\mathrm d\gamma (\_,y_1,\_,y_2) < \varepsilon ^p\).

Let with \(\max (\rho (\mu ,\nu ) , \rho (\mu ',\nu ')) < \min (\delta , \varepsilon )\).

This means we can find with

(17)

Let \((a,b) \mapsto f_a(b)\) and \((a,b) \mapsto g_a(b): \mathcal {A}\times \mathcal {B}\rightarrow \mathcal {Y}\) be measurable functions on whose graph \(\mu \) and \(\nu \), respectively, are concentrated. Let \({{\bar{\mu }}}:= \mathcal {J}_{\mathcal {A},\mathcal {B}}^{\mathcal {Y}}(\mu ',\mu )\), \({{\bar{\nu }}}:= \mathcal {J}_{\mathcal {A},\mathcal {B}}^{\mathcal {Y}}(\nu ', \nu )\).

As noted in the proof of Lemma 3.1 we know that for \({{\bar{\mu }}}\)-a.a. \((a,{\dot{\mu }})\) the measure \({\dot{\mu }}\) is concentrated on the graph of the function \(f_a\) (and similarly for \({{\bar{\nu }}}\)). This together with (which is a consequence of (15)) implies that

(again similarly for \({{\bar{\nu }}}\)).

From this we see that the measure defined as

is in .

We may measurably select almost-witnesses for the distances s.t. building on (17) we get

$$\begin{aligned} {\textstyle \int }\rho (a_1,a_2)^p + {\textstyle \int }\rho (b_1,b_2)^p \,\mathrm d{{\hat{\gamma }}}_{{{\hat{b}}}_1,{{\hat{b}}}_2}(b_1,b_2) \,\mathrm d\gamma '(a_1,{{\hat{b}}}_1, a_2, {{\hat{b}}}_2) < \min (\delta ^p , \varepsilon ^p) . \end{aligned}$$
(18)

Now

(19)

where is defined as

The integral over the first two summands in (19) is less than \(\min (\delta ^p , \varepsilon ^p)\) by (18). By our choice of \(\delta \) in the beginning this implies that the integral over the last summand is also less than \(\varepsilon ^p\), so that overall

$$\begin{aligned} \rho ({{\bar{\mu }}},{{\bar{\nu }}})^p < 2 \varepsilon ^p . \end{aligned}$$

Es \(\varepsilon \) was arbitrary this concludes the proof. \(\square \)

4 The symmetrized causal Wasserstein distance \({\mathcal {S}}{\mathcal {C}}{\mathcal {W}}_p\)

In this section we prove that the topology induced by \({\mathcal {S}}{\mathcal {C}}{\mathcal {W}}_p\) is sandwiched between Hellwig’s \({\mathcal {W}}_p\)-information topology and the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\), and therefore by what we have already seen in the previous section equal to both of them. Our arguments in this section make explicit use of metrics. The reader who is only interested in the simpler version of our main theorem, Theorem 1.2 may assume that \(p = 1\) and that all metrics are bounded.

Remember that for we have

(20)
(21)
(22)

In proving this we will take a slightly roundabout route. First we will focus on the case where is the product of just two spaces, i.e. where we have only two time points. Moreover, for expositional purposes, let us for the moment assume that \(\mathcal {X}_1\) and \(\mathcal {X}_2\) are both compact. Generalizing from this setting will not be very hard.

In the compact, two-time-point case we will show equality of the two topologies in question by extending both to a larger (compact) space and showing equality of the topologies on that larger space.

In more detail:

When there are only two timepoints Hellwig’s \({\mathcal {W}}_p\)-information topology and the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\) trivially coincide. Both are induced by emedding into via \({{\,\mathrm{dis}\,}}_{\mathcal {X}_1}^{\mathcal {X}_2}\). The latter space carries its standard metric , which – as was already established in Theorem 1.5 in Sect. 1.8 of the introduction – is an extension of \({\mathcal {A}} {\mathcal {W}}_p\). To highlight this connection, in this section we will also refer to that metric as . As a reminder,

where is the normal Wasserstein distance (on in this case). We will find an extension of \({\mathcal {C}} {\mathcal {W}}_p\) to , which still satisfies all properties of a metric except for symmetry and which is dominated by . Symmetrizing this extension gives a metric (which we will call ). The identity function from topologized with to topologized with will then be a continuous bijection from a compact space (this is where we use compactness of \(\mathcal {X}_1\), \(\mathcal {X}_2\)) to a Hausdorff space, i.e. a homeomorphism.

The next subsection will be devoted to finding an expression for the extension of \({\mathcal {C}} {\mathcal {W}}_p\) to and proving that it satisfies all the properties mentioned above.

Remark 4.1

When \(\mathcal {X}_1\) contains no isolated points, because is the metric completion of w.r.t. \({\mathcal {A}} {\mathcal {W}}_p\) and because the above properties imply that \({\mathcal {C}} {\mathcal {W}}_p\) is (uniformly) continuous w.r.t. \({\mathcal {A}} {\mathcal {W}}_p\), we have already uniquely identified . Still, we want to find an expression that allows us to work with and in particular that allows us to prove that is a metric and not just a pseudometric, i.e. that the induced topology is in fact Hausdorff. This is exactly what we gain from assuming compact base spaces and passing to the completion: instead of having to find a lower bound for in terms of (and possibly \(\mu \)) we now just have to prove that if \(\mu \ne \nu \) then .

For definiteness we note that we do not assume, compactness of any space in the following.

4.1 Extending the causal ’distance’

So now we are working with two Polish metric spaces \(\mathcal {X}_1\), \(\mathcal {X}_2\). Remember that we denote the ‘canonical process’ on by \((X_i)_{i=1,2}\), i.e. is the projection onto the ith coordinate.

To differentiate between the different roles that may play - i.e. is it the space for the left measure \(\mu \) or the right measure \(\nu \) when measuring the ‘distance’ - we will also refer to , \(\mathcal {X}_i\) by the aliases , \(\mathcal {Y}_i\) respectively. (And later , \(\mathcal {Z}_i\) as well.) Analogously, we have . (And .)

In this section we will repeatedly make use of the following construction:

Definition 4.2

Let \(\mathcal {A}\), \(\mathcal {B}\), \(\mathcal {C}\) be Polish metric spaces. Let and with \(\mu _{\restriction \mathcal {B}}= \nu _{\restriction \mathcal {B}}\). We define

as the measure given by

(23)

where \(b \mapsto \nu _b\) is a disintegration of \(\nu \) w.r.t. \(\mathcal {B}\) and similarly for \(\mu \).

We further define

Remark 4.3

If \(\mu \) is a probability on \(\mathcal A \times \mathcal B\) and \(\nu \) is a probability on \(\mathcal B \times \mathcal C\), another way of saying what is, is to state that it is a probability on \(\mathcal A \times \mathcal B \times \mathcal C\) s.t. the law of \((A, B)\) is equal to \(\mu \), the law of \((B, C)\) is equal to \(\nu \) (where per our convention \(A\) is the projection onto \(\mathcal A\), etc.), and \(A\) is conditionally independent from \(C\) given \(B\). (For the notion of conditional independence see for example [22, Definition II.43].)

Another helpful intuition comes from looking at the case where is concentrated on the graph of some measurable function \(f: \mathcal {A}\rightarrow \mathcal {B}\) and is concentrated on the graph of a measurable function \(g: \mathcal {B}\rightarrow \mathcal {C}\). is then concentrated on the graph of \(g \circ f : \mathcal {A}\rightarrow \mathcal {C}\). In some contexts \(g \circ f\) is also written as , which is where we borrowed the symbol from.

Remark 4.4

We will often encounter the situation that one of the factors \(\mathcal {A}\), \(\mathcal {B}\) or \(\mathcal {C}\) in Definition 4.2 is itself a product of spaces and the individual factors may not always be so nicely sorted. We will rely on naming in the subscript the space(s) along which to join the measures \(\mu \) and \(\nu \). For example if and we might write

to refer to the measure that we get when in (23) we use \((b_1,b_2) \in \mathcal {B}_1 \times \mathcal {B}_2\) as the middle variable b. We will not be systematic about the order of the factors in the resulting product space on which e.g. is a measure, again relying on naming our spaces for disambiguation.

For future reference we paraphrase the definition of a causal transport plan given in (3) in the introduction.

Lemma 4.5

Let \(\mu \) be a measure on and \(\nu \) be a measure on . is a causal transference plan from \(\mu \) to \(\nu \) iff under \(\gamma \)

$$\begin{aligned} X_2\text { and }Y_1\text { are conditionally independent given } X_1. \end{aligned}$$

Proof

One way of formulating conditional independence is as in (3), see for example [22, Definition II.43, Theorem II.45]. \(\square \)

In other words, is a causal transference plan iff .

We start by reexpressing \({\mathcal {C}} {\mathcal {W}}_p\) in different ways until we find one which also makes sense in .

Let and . Then

where

This is true because, on the one hand clearly a \(\gamma \in C_{1}\) is causal by Lemma 4.5 and the alternative characterization of . On the other hand, given any causal , again by Lemma 4.5, , and we may define . Now \(\gamma _{\restriction \mathcal {X}_1,\mathcal {X}_2,\mathcal {Y}_1} = \gamma '_{\restriction \mathcal {X}_1,\mathcal {X}_2,\mathcal {Y}_1}\) and \(\gamma _{\restriction \mathcal {X}_2,\mathcal {Y}_1,\mathcal {Y}_2} = \gamma '_{\restriction \mathcal {X}_2,\mathcal {Y}_1,\mathcal {Y}_2}\), so in particular

$$\begin{aligned}&\int \rho (x_1,y_1)^p + \rho (x_2,y_2)^p \,\mathrm d\gamma (x_1,x_2,y_1,y_2) \\&\quad =\int \rho (x_1,y_1)^p + \rho (x_2,y_2)^p \,\mathrm d\gamma '(x_1,x_2,y_1,y_2) . \end{aligned}$$

We may name the different building blocks of \(\gamma \in C_{1}\) to get

with

i.e. there is a bijection between \(C_{1}\) and \(C_{2}\) given by sending \(\gamma ' \in C_{1}\) to \((\gamma , \beta ) \in C_{2}\) where \(\gamma := \gamma '_{\restriction \mathcal {X}_1,\mathcal {Y}_1}\), \(\beta := \gamma '_{\restriction \mathcal {X}_2,\mathcal {Y}_1,\mathcal {Y}_2}\), and, in the other direction, by sending \((\gamma , \beta ) \in C_{2}\) to .

We can apply the bijection to \(\beta \). Translating the conditions on \((\gamma ,\beta ) \in C_{2}\) to conditions on \((\gamma , {{\,\mathrm{dis}\,}}_{\mathcal {Y}_1} (\beta ))\) we arrive at

where

Let \((\gamma , \beta ) \in C_{3}\) and let \((y_1, \beta ') \mapsto {{\tilde{\beta }}} ' _{y_1, \beta '}\) be a measurable mapping with for \(\beta \)-a.a. \((y_1,\beta ')\). Then we have that also \((\gamma ,{{\tilde{\beta }}}) \in C_{3}\), where is defined by

$$\begin{aligned} {{\tilde{\beta }}} := f \mapsto \int f(y_1,{{\tilde{\beta }}}'_{y_1,\beta '}) \,\mathrm d\beta (y_1,\beta '). \end{aligned}$$

By employing a \(\beta \)-a.e. measurable selector this implies that

We need

Lemma 4.6

If and then the only measure with \(\eta _{\restriction \mathcal A \times \mathcal B} = \kappa \) and \(\eta _{\restriction \mathcal B \times \mathcal C} = \lambda \) is .

Proof

If \(\eta \) satisfies the properties above and \(b \mapsto \kappa _b\), \(b \mapsto \lambda _b\) are (classical) disintegrations of \(\kappa \), \(\lambda \) w.r.t. \(\mathcal {B}\), then a (classical) disintegration \(b \mapsto \eta _b\) of \(\eta \) w.r.t. \(\mathcal {B}\) has to satisfy \({\eta _b}_{\restriction \mathcal {A}}= \kappa _b\) and \({\eta _b}_{\restriction \mathcal {C}}= \lambda _b\) a.s. As \(\lambda _b\) is a Dirac measure a.s. this forces \(\eta _b\) to be \(\kappa _b \otimes \lambda _b\) almost surely. \(\square \)

This implies that for \((\gamma , \beta ) \in C_{3}\) the distribution of

$$\begin{aligned} (y_1,\beta ') \mapsto (y_1, {\beta '_{\restriction \mathcal {X}_2}}, {\beta '_{\restriction \mathcal {Y}_2}}) \end{aligned}$$
(24)

under \(\beta \) is already determined by \(\gamma \), i.e. because the distribution of \((y_1,\beta ') \mapsto (y_1, {\beta '_{\restriction \mathcal {X}_2}})\) is and the distribution of \((y_1,\beta ') \mapsto (y_1, {\beta '_{\restriction \mathcal {Y}_2}})\) is \({{\,\mathrm{dis}\,}}_{\mathcal {Y}_1} (\nu )\), the distribution of (24) under \(\beta \) must be equal to

This means that we may get rid of \(\beta \):

For the final step we need another lemma:

Lemma 4.7

Let and . Let \(\smash {{{\hat{C}}}}\) denote the projection onto . Then

is equal to the distribution of

Proof

Let \(a \mapsto \lambda _a\) be a version of the (classical) disintegration of \(\lambda \) w.r.t. \(\mathcal {A}\) and let \(b \mapsto \beta _b\) be a disintegration of \(\beta \) w.r.t. \(\mathcal {B}\).

As one easily checks, a version of the (classical) disintegration of w.r.t. \(\mathcal {A}\) is given by \(a \mapsto \int \beta _b \,\mathrm d\lambda _a(b)\), so that is equal to

By the same argument a version of the disintegration of w.r.t. \(\mathcal {A}\) is given by \(h := a \mapsto \int {{\,\mathrm{dis}\,}}_{\mathcal {B}}(\beta )_b \,\mathrm d\lambda _a(b)\), where \(b \mapsto {{\,\mathrm{dis}\,}}_{\mathcal {B}}(\beta )_b\) is a disintegration of \({{\,\mathrm{dis}\,}}_{\mathcal {B}}(\beta )\) w.r.t. \(\mathcal {B}\). But such a disintegration is given by \(b \mapsto \delta _{\beta _b}\), (where \(\delta _{\beta _b}\) is the dirac measure at \(\beta _b\)). So \(h = a \mapsto \int \delta _{\beta _b} \,\mathrm d\lambda _a(b)\). This means (a version of) is given by

so that the distribution of under \(\eta \) is also given by

\(\square \)

Using this lemma with \(\mathcal {A}= \mathcal {Y}_1\), \(\mathcal {B}= \mathcal {X}_1\), \(\mathcal {C}= \mathcal {X}_2\), \(\lambda = \gamma \), \(\beta = \mu \) and writing \({\smash {{{\hat{X}}}_2}}\), \({\smash {{{\hat{Y}}}_2}}\) for the projections onto , respectively, we find:

where .

By Lemma 4.6 the function is a bijection, so we may as well write

Finally, under any we know that \({\smash {{{\hat{Y}}}_2}}\) is almost surely equal to a function of \(Y_1\), so that the completions of the sigma-algebras generated by \(Y_1\) and \(\smash {\vec Y}:= (Y_1,{\smash {{{\hat{Y}}}_2}})\) respectively are equal. This means that a.s. and we arrive at our final expression for :

Now this expression is trivial to generalize to and , i.e. for such \(\mu \), \(\nu \) we set

(25)

To summarize our discussion up to this point:

Lemma 4.8

The function

as defined in (25) is really an extension of

as defined in (20) (when is embedded into via \({{\,\mathrm{dis}\,}}_{\mathcal {X}_1}\)).

Next we promised to show

Lemma 4.9

is bounded by , i.e.

Proof

By the conditional version of Jensen’s inequality applied to the convex function we have

\(\square \)

Remark 4.10

For the reader who may be sceptical of whether Jensen’s inequality holds in this rather unusual setting, where we have a convex function

and conditional expectations on spaces of measures we remark that for the Wasserstein distance in particular this is very easy to check. The proof is just integrating transport plans between \({\smash {{{\hat{X}}}_2}}\) and \({\smash {{{\hat{Y}}}_2}}\) w.r.t. the distribution of these conditioned on \(\smash {\vec Y}\) (in this case) to get transport plans between and .

Lemma 4.11

Let . Then

Proof

Using our naming convention we have

We denote the projections onto , , by \({\smash {{{\hat{X}}}_2}}\), \({\smash {{{\hat{Y}}}_2}}\), \({\smash {{{\hat{Z}}}_2}}\) respectively. \(\smash {\vec Y}= (Y_1,{\smash {{{\hat{Y}}}_2}})\), \(\smash {\vec Z}:= (Z_1, {\smash {{{\hat{Z}}}_2}})\).

Let and . In the following let \({\mathbb {E}}\) refer to (conditional) expectation w.r.t. , and let refer to the \(L_p\)-norm w.r.t. \(\kappa \).

Combining the triangle inequalities for \(\rho \), and the we get

(26)
(27)

By the conditional Jensen inequality

and therefore

By construction, \(({\smash {{{\hat{X}}}_2}},{\smash {{{\hat{Y}}}_2}})\) is conditionally independent from \(\smash {\vec Z}\) given \(\smash {\vec Y}\), so that (this basic fact about conditional independence can be found for example as Theorem 45 in [22]). Combining this with (27) gives

(28)

Putting together (26) and (28) with the triangle inequality for \(\ell _p\) we get

\(\square \)

Lemma 4.12

is uniformly continuous w.r.t. on .

Proof

Let . We repeatedly use Lemma 4.11:

therefore

Switching the roles of \((\mu ,\nu )\) and \((\mu ',\nu ')\) implies

\(\square \)

Lemma 4.13

The infimum in (25) is attained.

Proof

This is an application of [9, Theorem 1.2].

For self-containedness and because it’s a nice application of the nested distance, we also sketch the argument here. We know that is compact. The problem is that is not (lower semi-) continuous. But we may switch to a topology which is better adapted to the problem at hand. Namely the two-timepoint \({\mathcal {A}} {\mathcal {W}}_p\)-topology. In this case the space for the first timepoint is and that for the second is . In effect that means that instead of we are now looking at . The function that we are optimizing over can be written as

where

C is a continuous function and so is \({{\hat{C}}}\). Now is not compact, but

is. So we can find a minimizer \(\gamma '\) of \({{\hat{C}}}\) in this set. To return to , or more precisely , we can send \(\gamma '\) to the distribution \(\gamma ''\) of . Because C is continuous and convex in its last argument and by (the conditional version of) Jensens inequality (which could again be proved ‘by hand’ here) \({{\hat{C}}}(\gamma '') \le {{\hat{C}}}(\gamma ')\). is the sought after minimizer of (25). \(\square \)

Lemma 4.14

Let . Then implies \(\mu = \nu \).

Proof

Call

To have labels for our spaces, see \(\mu , \nu \) as

Let s.t. .

Let s.t. .

All the following considerations happen under . Clearly, \(Z_1 = Y_1 = X_1\) a.s.

Moreover, because , the random variables \({\smash {{{\hat{Z}}}_2}}, {\smash {{{\hat{Y}}}_2}}, {\smash {{{\hat{X}}}_2}}\) form a martingale w.r.t. the filtration generated by \(\smash {\vec Z}, \smash {\vec Y}, \vec X\). The distribution of \({\smash {{{\hat{Z}}}_2}}\) is equal to the distribution of \({\smash {{{\hat{X}}}_2}}\). Both of these statements are also true if we integrate some bounded measurable function w.r.t. our random variables, i.e. for any bounded measurable \(f : \mathcal {X}_2 \rightarrow {{\mathbb {R}}}\) we have that \(\int f \,\mathrm d{\smash {{{\hat{Z}}}_2}}, \int f \,\mathrm d{\smash {{{\hat{Y}}}_2}}, \int f \,\mathrm d{\smash {{{\hat{X}}}_2}}\) is a martingale and that the distribution of \(\int f \,\mathrm d{\smash {{{\hat{Z}}}_2}}\) is equal to the distribution of \(\int f \,\mathrm d{\smash {{{\hat{X}}}_2}}\). But this means that we must have \(\int f \,\mathrm d{\smash {{{\hat{Z}}}_2}}= \int f \,\mathrm d{\smash {{{\hat{Y}}}_2}}= \int f \,\mathrm d{\smash {{{\hat{X}}}_2}}\) a.s. (Lemma 4.15 below). As this is true for all f from a countable generator of the sigma-algebra on \(\mathcal {X}_2\), we have \({\smash {{{\hat{Z}}}_2}}= {\smash {{{\hat{Y}}}_2}}= {\smash {{{\hat{X}}}_2}}\) a.s. \(\square \)

Lemma 4.15

Let \(X_1, X_2, X_3\) be a bounded martingale over \({{\mathbb {R}}}\). If the distribution of \(X_1\) is equal to the distribution of \(X_3\) then \(X_1 = X_2 = X_3\) a.s.

Proof

This is a consequence of the strict version of Jensen’s inequality applied to any everywhere strictly convex function. (Take for example \(x \mapsto x^2\).) \(\square \)

Remark 4.16

The reason we took the detour of turning our probability-measure-valued martingale into a family of martingales on \({{\mathbb {R}}}\) and arguing on these is because this way we avoid having to exhibit a continuous, everywhere strictly convex function on .

As a reminder:

Definition 4.17

For ,

Theorem 4.18

is a metric on satisfying

Proof

This follows from Lemmas 4.11, 4.14 and 4.9. \(\square \)

Remark 4.19

As outlined at the beginning of this section, and thanks to Theorem 4.18, we now know enough to conclude that the topology induced by \({\mathcal {S}} {\mathcal {C}} {\mathcal {W}}_p\) is equal to the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\), in case both \(\mathcal {X}_1\) and \(\mathcal {X}_2\) were compact. The non-compact case is not much harder. We now proceed to settle this case: For this we need the following lemma.

Lemma 4.20

The map

is a contraction when we equip the source space with and the target space with . More specifically for

(29)

Proof

We prove the second statement. Let , . Given and \(\varepsilon > 0\) the task is to find s.t.

(30)

We take inspiration from the discussion at the beginning of this section. Let be a measurable selector satisfying

The obvious choice for \(\gamma '\), namely will not work because in general it gets the relationship between \(X_1\) and \(X_2\) wrong, i.e. its first marginal may not be \({{\,\mathrm{int}\,}}_{\mathcal {X}_1} (\mu )\). Instead we again define and and set .

Clearly, if we can actually define \(\gamma '\) as announced, then (30) will hold, because then

It remains to check that \(\gamma _L\) and \(\gamma _R\) can actually be composed, i.e. that \((X_2,Y_1)\) has the same distribution under \(\gamma _L\) and \(\gamma _R\).

The step in the middle has its own Lemma 4.21 below. \(\square \)

Lemma 4.21

Let \({{\mathbb {P}}}\) be a probability on , for Polish spaces \(\mathcal {X}, \mathcal {Y}\). Let \(h : \mathcal {X}\times \mathcal {Y}\rightarrow {{\mathbb {R}}}\) be a measurable function. Then

where \({\mathbb {E}}\) without superscript is the (conditional) expectiation w.r.t. \({{\mathbb {P}}}\) and \({{\hat{X}}}\) is the projection onto .

Note that X is on both sides introduced by the expectation operator which carries a superscript, while Y may on both sides be interpreted as coming from the outermost context. On the right hand side Y may also be seen as having been introduced by the outermost conditional expectation operator. (As this operator conditions on Y this is the same thing.)

Proof

Both sides are clearly Y-measurable. We prove that for \(h(x,y) = f(x) g_1(y)\), multiplying by \(g_2(Y)\) and taking expectation gives the same result. By definition of the conditional expectation

Applying the continuous linear function this gives

Again by the definition of the conditional expectation:

where for the third equality we plugged in the previous equation. \(\square \)

Alternative proof of Lemma 4.20when \(\mathcal {X}_1\) has no isolated points When the space \(\mathcal {X}_1\) has no isolated points one can show that the space is dense in . This allows for a shorter proof of Lemma 4.20:

By the original definition (20) of \({\mathcal {C}} {\mathcal {W}}_p\) on the space the inequality (29) holds on . Both \({\mathcal {C}} {\mathcal {W}}_p\) and are uniformly continuous on w.r.t. some product metric of with itself. is dense in , and therefore is dense in . This implies that (29) holds on all of . \(\square \)

Theorem 4.22

The topology induced by on is equal to the toplogy induced by on that space.

Proof

As both topologies are metric and therefore first-countable we may argue on sequences. Let \((\mu _n)_n\) be a sequence in . As , if \((\mu _n)_n\) converges to \(\mu \) w.r.t. it also converges to \(\mu \) w.r.t. .

Now assume that a sequence \((\mu _n)_n\) in converges to \(\mu \) w.r.t. . We will show that every subsequence of \((\mu _n)_n\) has a subsequence which converges to \(\mu \) w.r.t. . Note that convergence of \((\mu _n)_n\) w.r.t. implies that the set is relatively compact w.r.t. the topology induced by . As \({{\,\mathrm{int}\,}}_{\mathcal {X}_1}\) is continuous as a map from with the topology induced by to with the toplogy induced by (Lemma 4.20), we have that is also relatively compact. By Lemma 1.7/[24, Lemma 3.3] this implies that K is relatively compact in with the topology induced by . Now let \((\mu _{n_k})_k\) be some subsequence of \((\mu _n)_n\). As K is relatively compact we can find a subsequence \((\mu _{n_{k_j}})_j\) of \((\mu _{n_k})_k\), which converges w.r.t. to some . As this sequence also converges to \(\mu '\) w.r.t. . But \((\mu _{n_{k_j}})_j\) also converges to \(\mu \) w.r.t. . Because the topology induced by is Hausdorff (Lemma 4.14), we must have \(\mu ' = \mu \), i.e. \((\mu _{n_{k_j}})_j\) converges to \(\mu \) w.r.t. . \(\square \)

Now we return to the general case of N time-points.

Theorem 4.23

The topology induced by \({\mathcal {S}} {\mathcal {C}} {\mathcal {W}}_p\) on is equal to Hellwig’s \({\mathcal {W}}_p\)-information topology and to the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\).

Proof

As every bicausal transport plan between \(\mu \) and \(\nu \) can be interpreted as a causal transport plan from \(\mu \) to \(\nu \) and also as a causal transport plan from \(\nu \) to \(\mu \) we have that . This means that the identity from with the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\) to with the topology induced by \({\mathcal {S}} {\mathcal {C}} {\mathcal {W}}_p\) is continuous. For the other direction we show that the identity from with the topology induced by \({\mathcal {S}} {\mathcal {C}} {\mathcal {W}}_p\) to with the \({\mathcal {W}}_p\)-information topology is continuous, i.e. we show that each of the maps

is continuous when gets the topology induced by \({\mathcal {S}} {\mathcal {C}} {\mathcal {W}}_p\).

If and is causal, then, in particular, \(\gamma \) is ‘causal at the timestep from t to \(t+1\)’, i.e. \(\gamma \) is causal when regarded as a coupling between . This means that if we define \({\mathcal {S}} {\mathcal {C}} {\mathcal {W}}_p'\) like \({\mathcal {S}} {\mathcal {C}} {\mathcal {W}}_p\), but only require causality based on the decomposition of as , then , i.e. the identity from with the topology induced by \({\mathcal {S}} {\mathcal {C}} {\mathcal {W}}_p\) to with the topology induced by \({\mathcal {S}} {\mathcal {C}} {\mathcal {W}}_p'\) is continuous. By Theorem 4.22 the map

is continuous when we equip with the topology induced by \({\mathcal {S}} {\mathcal {C}} {\mathcal {W}}_p'\). Now \(\mathcal {I}_t\) is continuous as a composite of continuous maps. \(\square \)

5 Aldous’ extended weak convergence

In this section we show that Aldous extended \({\mathcal {W}}_p\)-/weak topology is equal to Hellwig’s (\({\mathcal {W}}_p\)-)information topology.

We recall and paraphrase here the definition, already given in the introduction, of Aldous’ topology.

Definition 5.1

Given let \(\mu _{(x_{i})_{i=1}^{j}}\) be the value of a (classical) disintegration of \(\mu \) w.r.t. the first j coordinates at \((x_{i})_{i=1}^{j}\). (By convention \(\mu _{(x_{i})_{i=1}^{0}} = \mu \)). Define

The extended \({\mathcal {W}}_p-\)/weak topology on is the initial topology w.r.t. \({\mathcal {E}}\).

Remark 5.2

Reasonable people may disagree about whether the most faithful/useful transcription of Aldous’ definition should include the factors \(j=0\) and \(j=N\) in the above product of spaces. When including \(j=N\), as we did, one has to interpret \(\delta _{(x_{i})_{i=1}^{N}} \otimes \mu _{(x_{i})_{i=1}^{N}}\) simply as \(\delta _{(x_{i})_{i=1}^{N}}\). We leave it as an exercise to the reader to check that either or both may be dropped in the definition of \({\mathcal {E}}\) without affecting the resulting topology on .

Theorem 5.3

The (\({\mathcal {W}}_p\)-)extended weak topology is equal to the (\({\mathcal {W}}_p\)-)information topology.

Proof

We construct continuous maps

such that

$$\begin{aligned} {\mathcal {A}}'_{k} \circ {\mathcal {E}}&= \mathcal {I}_k \\ {\mathcal {A}}\circ \mathcal {I}&= {\mathcal {E}}. \end{aligned}$$

The first equality above implies that the identity on is continuous from the extended weak topology to the information topology, the second implies that it is continuous in the other direction.

\({\mathcal {A}}'_{k}\) is very simple. We just need to select the right factors and then discard the unnecessary \(\delta _{(x_{i})_{i=1}^{k}}\) part of the measure component. Formally

which is cleary continuous.

We construct \({\mathcal {A}}\) recursively, by constructing as a composite of continuous maps

satisfying

(31)

. We need the helper functions

Given \({\mathcal {A}}^m\) satisfying the induction hypothesis we set

where \(s_{m+1}\) is the obvious permutation of the coordinates to get the factors into the right order. \({\mathcal {A}}^{m+1}\) is continuous because by [24, Lemma 4.1] is continuous when one of the arguments is an element of some . That (31) still holds for \(m+1\) is a straightforward calculation. This way we get to \({\mathcal {A}}^{N-1}\). Finally, set

where

\(\square \)

6 Bounded vs unbounded metrics

Because we will need it in the next section we interject here a proof of Lemma 1.4, which we restate below.

Lemma 1.4

Convergence in any of the topologies of Theorem 1.3 is equivalent to convergence in any of the topologies of Theorem 1.2 (where for building and , \(\rho _\mathcal {X}\) is replaced by a bounded compatible complete metric e.g. \(\min (1,\rho _\mathcal {X})\)) plus convergence of pth moments on \(\Omega \) w.r.t. (the original) \(\rho _\Omega \).

Proof of Lemma 1.4

We provide the proof only for Hellwig’s topology, i.e. (3) of Theorems 1.3 and 1.2, respectively. As we have already seen in the previous sections, the topologies (2)–(4) are equivalent topologies, and the result therefore carries over to them. The (\({\mathcal {W}}_p\)-)optimal stopping topology, (5), is treated below. It is clear that convergence w.r.t. \({\mathcal {W}}_p\)-information topology implies convergence in Hellwig’g information topology plus convergence of pth moments. For the reverse implication, let \(1\le t\le N-1\), and denote by \({\mathcal {A}}:=\overline{{\mathcal {X}}}^t\) the first t and by \({\mathcal {B}}:=\overline{{\mathcal {X}}}_{t+1}\) the last \(N-t\) coordinates. Now assume that \((\mu _n)_n\) converges to \(\mu \) in Hellwig’s information topology and that the pth moments converge. The classical (not adapted) version of the very lemma we prove here implies that \(\mu _n\rightarrow \mu \) in \({\mathcal {W}}_p\); in particular is relatively compact. Lemma 1.7 (or really [24, Lemma 3.3]/[9, Lemma 2.6]) therefore guarantees that is relatively compact.

Every subsequence of \(({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}(\mu _n))_n\) therefore has a subsequence \(({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}(\mu _{n_k}))_k\) which converges w.r.t. the topology on (i.e. the one coming from nested Wasserstein metrics) to some . Because convergence in is stronger than convergence in (i.e. in the nested weak sense) we must also have \({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}(\mu _{n_k}) \overset{k}{\rightarrow } \mu '\) in . But also, by assumption, \({{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}(\mu _{n_k}) \overset{k}{\rightarrow } {{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}(\mu )\) in and therefore \(\mu ' = {{\,\mathrm{dis}\,}}_{\mathcal {A}}^{\mathcal {B}}(\mu )\). \(\square \)

7 Optimal stopping

In this section we investigate the relation between the (\({\mathcal {W}}_p\)-)optimal stopping topology and the adapted Wasserstein topology. Lemma 7.1 states that the topology induced by \({\mathcal {A}} {\mathcal {W}}_p\) ((1) of Theorem 1.3) is finer than the \({\mathcal {W}}_p\)-optimal stopping topology. Lemma 7.5 states that the \({\mathcal {W}}_p\)-optimal stopping topology is finer than the \({\mathcal {W}}_p\)-information topology ((3) of Theorem 1.3). This will finish the proof of Theorem 1.3.

Recall that

for \(L=(L_t)_{t=0}^N\in AC_p(\Omega )\).

Lemma 7.1

Let \(L\in AC_p(\Omega )\). Then \(\mu \mapsto v^L(\mu )\) is continuous w.r.t. \({\mathcal {A}} {\mathcal {W}}_p\). In fact, one has

(32)

for every .

Proof

Let and assume that \(v^L(\mu )\le v^L(\nu )\). Moreover, let \(\pi \in {{\,\mathrm{Cpl}\,}}_{bc}(\mu ,\nu )\) and \(\varepsilon >0\) be arbitrary, and fix a stopping time \(\tau \) satisfying . For \(u\in [0,1]\) define

$$\begin{aligned} \sigma (X,u)&:=\inf \{ t\in \{0,\cdots ,T\}: \pi (\tau (Y)\le t| X)\ge u\} \\&= \inf \{ t\in \{0,\cdots ,T\} : \pi (\tau (Y)\le t| X_1,\dots ,X_t)\ge u\}, \end{aligned}$$

where the equality holds by the properties of stopping times and since \(\pi \) is causal. We then have that

As further \(\sigma (\cdot ,u)\) is a stopping time for every fixed \(u\in [0,1]\) one has and therefore

Changing the role of \(\mu \) and \(\nu \) and using that \(\varepsilon >0\) and \(\pi \in {{\,\mathrm{Cpl}\,}}_{bc}(\mu ,\nu )\) was arbitrary yields (32).

Now assume that \({\mathcal {A}} {\mathcal {W}}_p(\mu _n,\mu ) \rightarrow 0\) and that is less than 1/n away from attaining the infimum \({\mathcal {A}} {\mathcal {W}}_p(\mu _n,\mu )\). Then \({\mathcal {W}}_p(\pi _n, \pi ) \rightarrow 0\), where is the identity coupling of \(\mu \). (A coupling between \(\pi _n\) and \(\pi \) is given by .) Because \((x,y) \mapsto \max _{0\le t\le N} |L_t(x)-L_t(y)|\) is a continuous function of growth of at most order p, we get that

Together with (32) this implies that \(v^L\) is continuous w.r.t. \({\mathcal {A}} {\mathcal {W}}_p\). \(\square \)

Remark 7.2

The above proof reveals that if \(L_t\) is Lipschitz with constant \(c>0\) for every t, then \(|v^L(\mu )-v^L(\nu )|\le c\, \mathcal {SCW}_1(\mu ,\nu )\).

In order to show that the optimal stopping topology is finer than the \({\mathcal {W}}_p\)-information topology, we need to make a few preparations.

Lemma 7.3

Let \({\mathcal {A}}\) be a Polish space. Then the family

(33)

is convergence determining for the weak topology on , that is, a sequence of probability measures \((\mu _n)_n\) in converges weakly to a probability measure if and only if \(\int F\,\mathrm d\mu _n\rightarrow \int F\,\mathrm d\mu \) for all F in (33).

This follows from the Stone-Weierstrass theorem in case of compact \({\mathcal {A}}\) and readily extends to general Polish spaces e.g. via Stone-Čech compactification.

Lemma 7.4

Let \({\mathcal {A}}\) be a Polish space. The family of functions

(34)

is convergence determining for the weak topology on .

Proof

Let L, G, and \((h_i)_{i\le L}\) as in (33). Moreover, let \(m\in {\mathbb {R}}\) such that \(|h_i|\le m\) for all \(1\le i\le L\) and define \(I:=[-m,m]^L\). Then \(I\subset {\mathbb {R}}^L\) is compact and satisfies

Let \(\sigma :{\mathbb {R}}\rightarrow {\mathbb {R}}\) be some fixed bounded continuous sigmoid function such as \(\sigma (r)=(1+e^{-r})^{-1}\) or \(\sigma (r)=\max (0,\min (r,1))\).

By the universal approximation result of Cybenko [21, Theorem 2], the set

is dense in \(C(I,{\mathbb {R}})\) w.r.t. the supremum norm. As a result, it is enough to replace G in (33) by functions of the form \(x\mapsto \sum _{i=1}^m u_i \sigma (v_i\cdot x+w_i)\). Evaluating the latter function on the vector \(x=(\int h_1\,\mathrm d\mu ,\dots ,\int h_L\,\mathrm d\mu )\) yields

$$\begin{aligned} \sum _{i=1}^m u_i \sigma \left( \sum _{k=1}^L v_i^k\int h_k \,\mathrm d\mu + w_i\right)&=\sum _{i=1}^m u_i \sigma \left( \,\int \left( \sum _{k=1}^{L+1} v_i^k h_k \right) \,\mathrm d\mu \,\right) \\&=\sum _{i=1}^m u_i\sigma \left( \int {{\bar{h}}}_i \,\mathrm d\mu \right) , \end{aligned}$$

upon defining \(v_i^{L+1}:=b_i\), \(w_{L+1}:=1\), and finally \({{\bar{h}}}_i:= \sum _{k=1}^{L+1} v_i^k h_k\) for every i. The result follows from Lemma 7.3. \(\square \)

Lemma 7.5

The \({\mathcal {W}}_p\)-optimal stopping topology is finer than the \({\mathcal {W}}_p\)-information topology.

Proof

The choice \(L_T:=-\rho (x,x_0)^p-1\) and \(L_t:=0\) for \(t\ne T\) shows that convergence in the \({\mathcal {W}}_p\)-optimal stopping topology implies convergence of the pth moments. Thus, we are left to show that convergence in the optimal stopping topology implies convergence in Hellwig’s information topology. Then, by the part of Lemma 1.4 which has already been established, we obtain convergence in the \({\mathcal {W}}_p\)-information topology.

Fix \(1\le t\le N-1\) and denote by \({\mathcal {A}}:=\overline{{\mathcal {X}}}^t\) the first t and by \({\mathcal {B}}:=\overline{{\mathcal {X}}}_{t+1}\) the last \(N-t\) coordinates. As \(C_b({\mathcal {A}})\) is convergence determining for , and \(\{\nu \mapsto G (\int _{{\mathcal {B}}} h\,\mathrm d\nu ):h\in C_b({\mathcal {B}}), G \in C_b({{\mathbb {R}}})\}\) is, by Lemma 7.4, convergence determining for , it follows e.g. from [26, Proposition 4.6 (p.115)] that

(35)

is convergence determining for the weak topology on . Since h in (35) is bounded, one can actually take g in (35) to be compactly supported. But a continuous compactly supported function can be approximated uniformly by piecewise linear functions. The latter are linear combinations of functions of the form \(z\mapsto \min (c,dz)\) where \(c,d\in {\mathbb {R}}\). It therefore follows that

(36)

is also convergence determining for the weak topology on . Let F be a function in (36), defined via \(f\in C_b({\mathcal {A}})\) and \(h\in C_b({\mathcal {B}})\), and let \(m\in {\mathbb {R}}\) be a bound for |f| and |h|. Define \(L\in AC_p(\Omega )\) via

$$\begin{aligned} L_t:=f\circ {\overline{X}}^t \quad L_T:=(f\circ {\overline{X}}^t) \cdot (h\circ {\overline{X}}_{t+1}) \quad \text {and } L_s:=m+1\text { for }s\ne t,T. \end{aligned}$$

(Where \({\overline{X}}^t\) is the projection onto the first t coordinates and \({\overline{X}}_{t+1}\) is the projection onto the remaining \(N-t\) coordinates.)

By dynamic programming (the Snell-envelope theorem) one has

for every . This implies that the optimal stopping topology is finer than the initial topology of \(\mu \mapsto \int F \,\mathrm d({{\,\mathrm{dis}\,}}_{{\mathcal {A}}}^{{\mathcal {B}}}(\mu ))\) over F in (36). As (36) is convergence determining for the weak topology on , the optimal stopping topology is indeed finer than the information topology, and as observed at the beginning of this proof therefore the \({\mathcal {W}}_p\)-optimal stopping topology is finer than the \({\mathcal {W}}_p\)-information topology. \(\square \)