All Adapted Topologies are Equal

A number of researchers have introduced topological structures on the set of laws of stochastic processes. A unifying goal of these authors is to strengthen the usual weak topology in order to adequately capture the temporal structure of stochastic processes. Aldous defines an extended weak topology based on the weak convergence of prediction processes. In the economic literature, Hellwig introduced the information topology to study the stability of equilibrium problems. Bion-Nadal and Talay introduce a version of the Wasserstein distance between the laws of diffusion processes. Pflug and Pichler consider the nested distance (and the weak nested topology) to obtain continuity of stochastic multistage programming problems. These distances can be seen as a symmetrization of Lassalle's causal transport problem, but there are also further natural ways to derive a topology from causal transport. Our main result is that all of these seemingly independent approaches define the same topology in finite discrete time. Moreover we show that this 'weak adapted topology' is characterized as the coarsest topology that guarantees continuity of optimal stopping problems for continuous bounded reward functions.

1. Introduction 1.1. Outline. If some type of natural phenomenon is modelled through a stochastic process, one might expect that the model does not describe reality in an entirely accurate way. To be able to study the impact of such inaccuracies on the problems one is trying to solve, it makes sense to equip the set of laws of stochastic processes with a suitable notion of distance or topology.
Denoting by Ω := X N the path space (where X is some Polish space and N ∈ N), the set of laws of stochastic processes is P(Ω), i.e. the set of probability measures on Ω.
Clearly, P(Ω) carries the usual weak topology. However, this topology does not respect the time evolution of stochastic processes which has a number of potentially inconvenient consequences: e.g., problems of optimal stopping / utility maximization / stochastic programming are not continuous, arbitrary processes can be approximated by processes which are deterministic after the first period, etc. In the following we describe a number of approaches which have been developed by different authors to deal with these (and related) problems. Our main result (Theorem 1.1) is that all of these approaches actually define the same topology in the present discrete time setup. Moreover, this topology is the weakest topology which allows for continuity of optimal stopping problems.
1.2. Adapted Wasserstein distances, nested distance. A number of authors have independently introduced variants of the Wasserstein distance which take the temporal structure of processes into account: the definition of 'iterated Kantorovich distance' by Vershik [58,59] might be seen as a first construction in this direction. The topic is also considered by Rüschendorf [56]. Independently, Pflug and Pflug-Pichler [50,54,51,52,53,28] introduce the nested distance and describe the concept's rich potential for the approximation of stochastic multi-period optimization problems. Lassalle [44] considers the 'causal transport problem' that leads to a corresponding notion of distance. Once again independently of these developments, Bion-Nadal and Talay [15] define an adapted version of the Wasserstein distance between laws of solutions to SDEs.
To set the stage for describing these 'adapted' variants let us fix p ≥ 1 and recall the definition of the usual p-Wassterstein distance.
(X , ρ X ) is now a Polish metric space. On Ω = X N we use the Polish metric ρ Ω ((x t ) t , (y t ) t ) := ( t ρ X (x t , y t ) p ) 1/p . Typically, when clear from the context we will omit the subscript for the metric. We use (X t ) t to denote the canonical process on Ω, i.e. X t is the projection onto the t-th factor of Ω = X N . On Ω × Ω call X = (X t ) t the projection on the first factor and call Y = (Y t ) t the projection on the second factor. For µ, ν ∈ P(Ω) we denote by Cpl (µ, ν) the set of probability measures π on Ω × Ω for which X ∼ µ and Y ∼ ν under π, i.e. for which the distribution of X under π is µ and that of Y under π is ν. In applications, a particular role is played by Monge couplings. A Monge coupling from µ to ν is a coupling π for which Y = T (X) π-a.s. for some Borel mapping T : Ω → Ω that transports µ to ν, i.e. satisfies T # (µ) = ν.
For µ, ν ∈ P p (Ω), i.e. for probability measures on Ω with finite p-th moment their p-Wasserstein distance is Following, [55] the infimum in (1) remains unchanged if one minimizes only over Monge couplings in many situations.
To motivate the formal definition of the adapted cousins in (5) and (6) below, we start with an informal discussion in terms of Monge mappings: In probabilistic terms, the preservation of mass assumption T # (µ) = ν asserts T 1 (X 1 , . . . , X N ), . . . , T N (X 1 , . . . , X N ) ∼ ν, (2) which ignores the evolution of µ and ν (resp.) in time. Rather it would appear more natural to restrict to mappings (T k ) N k=1 which are adapted in the sense that T k depends only on X 1 , . . . , X k . Adapted Wasserstein distances can be defined following precisely this intuition, relying on a suitable version of adaptedness on the level of couplings: The set Cpl c (µ, ν) of causal couplings 1 consists of all π ∈ Cpl(µ, ν) such that for all t ≤ N and A ⊆ X t measurable, cf. [44]. The set of all bi-causal couplings Cpl bc (µ, ν) consists of all π ∈ Cpl c (µ, ν) such that the distribution of (Y, X) under π is also in Cpl c (ν, µ), i.e. that (3) also holds with the roles of X and Y reversed. The term causal was introduced by Lassalle [44], who considers a causal transport problem in which the usual set of couplings is replaced by the set of causal couplings. The resulting concept is not actually a metric as it lacks symmetry, but as suggested by Soumik Pal, this is easily mended and we formally define the causal -and symmetrized-causal p-Wasserstein distance, resp. as follows: 1 Intuitively, at time t, given the past (X 1 , . . . , Xt) of X, the distribution of Yt does not depend on the future (X t+1 , . . . , X N ) of X. For absolutely continuous measures µ, the weak closure of the set of adapted Monge couplings, i.e. of those π ∈ Cpl (µ, ν) for which Y = T (X) π-a.s. for T adapted, is precisely the set of all causal couplings, see [42].
Rüschendorf [56] refers to AW p as 'modified Wasserstein distance'. Pflug-Pichler [50,Definition 1] use the names multi-stage distance of order p and nested distance. It can also be considered as a discrete time version of the 'Wasserstein-type distance' of Bion-Nadal and Talay [15]. In [4] we use a slightly modified definition of AW p which scales better with the number of time-periods N but leads to an equivalent metric (for fixed p and N ). We shall discuss further properties of AW p (and in particular the connection with Vershik's iterated Kantorovich distance) in Section 1.8 below.

Hellwig's information topology. The information topology introduced by
Hellwig in [29] (as well as Aldous' extended weak topology which we discuss next) is based on the idea that an essential part of the structure of a process is the information that we may deduce about the future behaviour of the process given its behaviour up to current time t. For a process whose law is µ, this information is captured by the conditional law L µ (X t+1 , . . . , X N |X 1 = x 1 , . . . , X t = x t ) of X t+1 , . . . , X N given X 1 = x 1 , . . . , X t = x t under µ. L µ (X t+1 , . . . , X N |X 1 = x 1 , . . . , X t = x t ) is also the disintegration µ x1,...,xt of µ ∈ P(Ω) w.r.t. the first t coordinates.
Hellwig's information topology is the initial topology w.r.t. a family of maps (I t ) N −1 t=1 which are defined based on these disintegrations: I t : P(Ω) → P X t × P X N −t I t (µ) := k t # (µ) k t (x 1 , . . . , x N ) := (x 1 , . . . , x t , µ x1,...,xt ) Equivalently, I t (µ) is the joint law of under µ, and Hellwig's information topology is therefore the coarsest topology which makes continuous for all t the maps which send a probability µ to the joint law describing the evolution of the coordinate process up to time t and the prediction about the future behaviour of the coordinate process after t.
The work of Hellwig [29] was motivated by questions of stability in dynamic economic models/games; see the related articles [38,57,30,10].
1.4. Aldous' extended weak topology. Aldous [3] introduces a type of convergence for pairs of filtrations and continuous time stochastic processes on them that he calls extended weak convergence [3,Definition 15.2]. Restricted to our current setting, his definition can be paraphrased in a similar manner as that of the information topology. Aldous' idea is to represent a stochastic process with law µ through the associated prediction process 2 , that is, the process given by is a measure-valued martingale that makes increasingly accurate predictions about the full trajectory of the process X.
Rather then comparing the laws of processes directly, the extended weak topology is derived from the weak topology on the corresponding prediction processes (plus the original processes). I.e. formally, the extended weak topology on P(Ω) is the initial topology w.r.t. the map which sends µ to the joint distribution of Note that, to stay faithful to Aldous' original definition, we defined E to map µ not just to the law of the prediction process but to the joint law of the original process and its prediction process. One easily checks that the original process may be omitted in our setting without changing the resulting topology.
1.5. The optimal stopping topology. The usual weak topology on P(Ω) is the coarsest topology which makes continuous all the functions µ → f dµ for f : Ω → R continuous and bounded.
One may follow a similar pattern and look at the coarsest topology which makes continuous the outcomes of all sequential decision procedures. Perhaps the easiest way to formalize this is to look at optimal stopping problems. In detail, write AC(Ω) for the set of all processes (L t ) N t=1 which are adapted, bounded and satisfy that x → L t (x) is continuous for each t ≤ N . Write v L (µ) for the corresponding value function, given that the process X follows the law µ, i.e. v L (µ) := inf{E µ (L τ ) : τ ≤ N is a stopping time}.
The optimal stopping topology on P(Ω) is the coarsest topology which makes the functions continuous for all (L t ) N t=1 ∈ AC(Ω). The assumption that ρ X is bounded serves only to simplify the statement of the theorem, because in this case the topology induced by W p coincides with the weak topology. For every Polish space there is a bounded complete metric which induces the topology (given any complete metric ρ X , replace it by e.g. min(1, ρ X )). 1.6.1. p-Wasserstein and unbounded metrics. There is an analogous statement, Theorem 1.2 below, which drops the assumption that ρ X is bounded. To be able to state it, we introduce slight variations of Hellwig's information topology, of Aldous' extended weak topology and of the optimal stopping topology: In [29] Hellwig equips the target spaces of I t with the weak topology -or more precisely he equips P X N −t with the weak topology, X t × P X N −t with the product topology and finally P X t × P X N −t with the weak topology based on this topology. One may easily define a p-Wasserstein version of Hellwigs information topology by using the recipe 'replace the weak topology by the p-Wasserstein metric everywhere'. Concretely, if we restrict I t to P p (Ω), we may view it as a map into P p X t × P p X N −t , where the last space carries the metric We will call the resulting variant of Hellwigs information topology on P p (Ω) the W p -information topology.
Similarly, one may systematically replace every occurrence of the weak topology in the definition of the extended weak topology by the p-Wasserstein metric. We call the resulting topology on P p (Ω) the extended W p -topology.
Just like the weak topology is the coarsest topology which makes integration of continuous bounded functions continuous, the p-Wasserstein topology is the coarsest topology which makes integration of continuous functions bounded by c · (1 + ρ(x 0 , x) p ) continuous. Following this analogy, we define AC p (Ω) as the set of all processes (L t ) N t=1 which are adapted, bounded by x → c · (1 + ρ(x 0 , x) p ) for some c ∈ R + and satisfy that x → L t (x) is continuous for each t ≤ N .
The W p -optimal stopping topology on P p (Ω) is the coarsest topology which makes the functions continuous for all (L t ) N t=1 ∈ AC p (Ω). With these we may state the following generalization of Theorem 1.1: Theorem 1.2. Let (X , ρ X ) be a Polish metric space and set Ω := X N . Then the following topologies on P p (Ω) are equal (1) the topology induced by AW p (2) the topology induced by SCW p (3) the W p -information topology (4) the extended W p -topology (5) the W p -optimal stopping topology.
Clearly, one recovers Theorem 1.1 from Theorem 1.2 by choosing a bounded metric on X , because the W p -information topology for bounded ρ X is just the information topology, the extended W p -topology for bounded ρ X is just the extended weak topology and the W p -optimal stopping topology for bounded ρ X is just the optimal stopping topology.
The relationship between the topologies listed in Theorem 1.1 and those listed in Theorem 1.2 is similar to the non-adapted case where we know that usual p-Wasserstein convergence is equivalent to usual weak convergence plus convergence of the p-th moments.  [3]. This corresponds to one half of (4)=(5) in Theorem 1.1, but in a much more general setting. This line of work has been continued by Lamberton and Pagès [43], Coquet and Toldo [19], among others.
Aldous' extended weak topology was also inspiring and instrumental for the development of the theory of convergence of filtrations, and the associated questions of stability of the martingale representation property and Doob-Meyer decompositions. In this regard, see the works by Hoover et al [35,33] and by Mémin et al [18,46]. The related question of stability of stochastic differential equations (as well as their backwards version) with respect to the driving noise has particularly seen a burst of activity in the last two decades. For brevity's sake we only refer to the recent article by Papapantoleon, Posamaï, and Saplaouras [48] for an overview of the many available works in this direction.

Previous applications of adapted Wasserstein distances.
Pflug, Pichler and co-authors [50,54,51,52,53,28] have extensively developed and applied the notion of nested distaces for the purpose of scenario generation, stability, sensitivity bounds, and distributionally robust stochastic optimization, in the context of operations research.
Acciaio, Zalashko, and one of the present authors consider in [2] the adapted Wasserstein distance in continuous time in connection with utility maximization, enlargement of filtrations and optimal stopping.
Causal couplings have appeared in the work by Yamada and Watanabe [60], Jacod and Mémin [36] as well as Kurtz [40,41], concerning weak solutions of stochastic differential equations, and by Rüschendof [56] concerning approximation theorems in probability theory. The term 'causal' is first used by Lassalle [44], who uses it in an additional constraint for the transport problem and gives an alternative derivation of the Talagrand inequality for the Wiener measure. Causal couplings are also present in the numerical scheme suggested in [1] for (extended mean-field) stochastic control.
The article [6] connects adapted Wasserstein distance (in continuous time) to martingale optimal transport (cf. [32,12,26,22,16,31,17,11,13] among many others). Several familiar objects appear as solutions to variational problems in this context. E.g. geometric Brownian motion is the martingale which is closest in AW 2 to usual Brownian motion subject having a log normal distribution at the terminal time-point, the local vol model is closest to Brownian motion subject to matching 1-d marginals.
Bion-Nadal and Talay [15] introduce an adapted Wasserstein-type distance on the set of diffusion SDEs and show that this distance corresponds to the computation of a tractable stochastic control problem. They also apply their results to the problem of fitting diffusion models to given marginals.
In [4] the present authors consider adapted Wasserstein distances in relation to stability in finance: Lipschitz continuity of utility maximization/hedging are established w.r.t. to the underlying models in discrete and continuous time.

Another formulation of the adapted Wasserstein distance and of Hellwigs information topology.
Here we give an alternative formulation of the adapted Wasserstein distance / nested distance due to Pflug and Pichler. Again, X is a Polish space and ρ = ρ X is a compatible metric on X . Starting with V p N := 0 we define V p t (x 1 , . . . , x t , y 1 , . . . , y t ) := inf γ t+1 ∈Cpl(µx 1 ,...,x t ,νy 1 ,...,y t ) The nested distance is finally obtained in a backwards recursive way by (8) Then AW p = N D p . We refer to [7] for the (straightforward) justification.
For N > 1 the adapted Wasserstein distance is not complete. As was established in [5], a natural complete space into which (P p (Ω) , AW p ) embeds is given by the space of nested distributions: Consider the sequence of metric spaces where at each stage t, the space P p (X t:N ) is endowed with the p-Wasserstein distance with respect to the metric ρ t:N on X t:N , which we denote by W ρ t:N ,p . The space of nested distributions (of depth N ) is defined as P p (X 1:N ). We endow P p (X 1:N ) with the complete metric W ρ 1:N ,p . The space of nested distributions was defined by Pflug [49]. Notably the idea to iterate the formation of Wasserstein spaces and metrics goes back to Vershik [58,59] who uses the name 'iterated Kantorovich distance'. The main interest of Vershik (and his successors) lies in the classification of filtrations (in the language of ergodic theory). We refer to the work of Emery and Schachermayer [24] for a survey from a probabilistic perspective and to Janvresse, Laurent and de la Rue [37] for a contemporary article (again from a probabilistic viewpoint).
P p (Ω) is naturally embedded in the set of nested distributions of depth N through the map N given by where (X 1 , . . . , X N ) is a vector with law µ, L again denotes (conditional) law and we useX t 1 as a shorthand for the vector X 1 , . . . , X t . Following [5], we have: Theorem 1.4. The map N defined in (9) embeds the metric space (P p (Ω) , AW p ) isometrically into the complete separable metric space (P p (X 1:N ) , W ρ 1:N ,p ). Remark 1.5. When X has no isolated points, P p (X 1:N ) is actually the completion of P p (Ω), i.e. P p (Ω) considered as a subset of P p (X 1:N ) is dense.

Hellwig's information topology in terms of adapted Wasserstein distances.
We note that Hellwig's definition of the information topology can also be rephrased using the concept of adapted Wasserstein distance: Assume that ρ X is a bounded metric and for t ≤ N , set I.e. for each t, we consider Ω as the product of two Polish spaces (which one might consider as 'history' and 'future'). Extending the defintion of AW p in the obvious way to products of not necessarily equal Polish spaces, we can then equip with a one period adapted Wasserstein distance AW (t) p , p ≥ 1. Setting for µ, ν ∈ P(Ω) we obtain a compatible metric for the information topology. This is relatively straightforward (whereas the full version of Theorem 1.1 is not straightforward as far as we are concerned).
1.9. Preservation of Compactness. We close this section with a result about the preservation of relative compactness which we shall use in Sections 4 and 6, but which also might be of independent interest. Specifically, in [8,9] the two-step version of Lemma 1.6 is used as a crucial tool in the investigation of the weak transport problem. A more detailed investigation of compactness in P(Ω) with the weak adapted topology is the topic of the companion paper to this one, [23].
Assume for simplicity that ρ X is a bounded metric. Then we have We note that Lemma 1.6 is essentially a consequence of the characterization of compact subsets in P(P(X)); in a somewhat different framework it was first proved in [34]. The version stated here follows by repeated application of [23, The implication that N [A] relatively compact implies A relatively compact is rather easy to see, but the other direction that A relatively compact implies N [A] relatively compact is nontrivial since the mapping N : P(Ω) → P(X 1:N ) is not continuous when P(Ω) is endowed with the usual weak topology (except for trivial cases). Lemma 1.6 would not be true if we were to replace relative compactness by compactness.
The assumption that ρ X is bounded is inessential. A version of Lemma 1.6 holds if we replace P(Ω) by P p (Ω) and the weak topology by the one induced by the p-Wasserstein metric.
A similar result based on Hellwig's information toplogy, relating relative compactness in P(Ω) to relative compactness in is also true.

Preparations
The rest of the paper will essentially be devoted to proving Theorem 1.1, or really its generalization Theorem 1.2.
In Section 3 we prove that Hellwig's information topology equals the topology induced by AW p , i.e. (3) = (1) in Theorem 1.2. In a sense, of all the topologies listed in Theorem 1.2, Hellwig's information toplogy 'looks' the coarsest -or at least like one of the coarser ones, while the topology induced by AW p 'looks' the finest.
In Section 4 we sandwich the topology induced by SCW p between Hellwig's information topology and the toplogy induced by AW p , i.e. we show (3) ≤ (2) ≤ (1) in Theorem 1.2.
In Section 6 we prove Lemma 1.3. In Section 7 we prove that the optimal stopping topology is coarser than the topology induced by AW p and finer than Hellwig's (W p -)information topology, i.e.
2.1. Notation. The nested structure of spaces like for example P p (X 1:N ) introduced in Section 1.8 is (at least for the authors) not so easy to gain an intuition for. It seems rather challenging to picture probability measures on probability measures on probability measures. . . etc.
Therefore, much of the proofs in the following two sections will be about bookkeeping and not getting lost in these nested structures. In most other contexts we would regard such bookkeeping as abstract nonsense better swept under the rug, but in the context of the present paper we believe that it really constitutes an important and nontrivial ingredient in successfully carrying out the proofs.
To aid in this endeavour we make some notational preparations and introduce a few conventions.

Operations on Spaces.
In the introduction we described the topologies listed in Theorems 1.1 and 1.2 as initial topologies w.r.t. maps into more complex spaces. These spaces are built up from just a few basic operations, and in most cases the maps can also be constructed using a few relatively simple ingredients.
For spaces, the operations in question are • product formation, i.e. for spaces X and Y we may form their product space X × Y, • and passing from a space X to the space P(X ) of probability measures on X . Here we run into some tension between the various existing definitions in the literature. While Hellwig and Aldous originally defined their topologies based on equipping the space P(X ) of probability measures on some space X with the weak topology, without any mention of metrics, AW p is a metric built on the p-Wasserstein metric, and Theorem 1.4 exhibits this metric as the 'initial metric' w.r.t. an embedding of P p (Ω) (not P(Ω)) into (P p (X 1:N ) , W ρ 1:N ,p ).
Luckily, when the base metric ρ X on X is bounded and we decide that we only care about topologies and not the metrics that induce them, all of these distinctions vanish, and one may hope for these fine distinctions to not be so important in the end.
To give as uniform and as streamlined a treatment as possible of all the various ways in which these metric and topological spaces can be related to each other we employ the following strategy: A lot of our arguments are agnostic to the distinction between P and P p , and to whether we are talking about metric or topological spaces etc. They only rely on properties of the operations of product formation and formation of spaces of probability measures and on properties of maps between various spaces built using these operations which hold in either case. For the rest of the paper we will therefore drop the p in P p and other explicit mentions of these distinctions. The reader may decide to read the paper using either of the following two sets of conventions, which are to be applied recursively:
• X × Y is a topological space with the product topology (again Polish).
• P(X ) is a topological space with the weak topology (also Polish).
• The subscript on the metric ρ may be dropped when clear from the context.
• 'space' will mean Polish metric space.
Unless specified otherwise everything said from here on will be true for either way of reading. Convention 1 will lead to a direct proof of Theorem 1.1, while Convention 2 will give a proof of the more general version, Theorem 1.2. Occasionally an argument will require us to talk directly about metrics to establish continuity of some map. When one only cares about Theorem 1.1 and not Theorem 1.2 these sections can be read while assuming that p = 1 and that all metrics mentioned are bounded.
Another space we will need is is the space of probability measures on A × B which are concentrated on the graph of a measuruable function, i.e.: The space F (A B) carries the subspace topology / the restriction of the metric on P(A × B).

Maps between spaces.
Assuming Convention 1, when f : X → Y is a continuous map, the pushforward under f , i.e. the map which sends µ ∈ P(X ) to the Similarly, assuming Convention 2, when f : X → Y is a Lipschitz-continuous map between metric spaces the pushforward under f is also Lipschitz-continous from P(X ) to P(Y).
We will use P(f ) : P(X ) → P(Y) to denote the pushforward under f , to emphasize the fact that P is a functor, i.e. that it sends a diagram with a 'nice' (read continuous/Lipschitz) map where the map is also 'nice', and that P(f • g) = P(f ) • P(g) and P(1 X ) = 1 P(X ) (where 1 X is the identity function on X ).
For a product of spaces X × Y, the projection onto X will alternatively be denoted by either proj X or by the same letter that is used for the space, but in a non-calligrapic font, i.e. X : X × Y → X .
If µ is defined on some product i X i of spaces, we also introduce a shorthand notation for marginals of µ, i.e. for the pushforward of µ under projection onto the product of some subset of the original factors: If f : A → B and g : A → C are functions we write (f , g) for the function g(a)) .
If we want to specify a map from, say A×B×C to X but we only really care about one of the variables we will use an underscore '_' instead of naming the unused variables, as in (a, _, _) → f (a). Similarly, when integrating we may also use _ to denote unused variables, i.e. for µ ∈ P(X × Y) we might write f (y) dµ(_, y).
Two important maps will be the disintegration map dis B A and its left inverse int B A . The disintegration map sends a probability µ on A × B to the measure The disintegration map is measurable (see for example [14,Proposition 7.27]) and injective. It is not continuous w.r.t. the weak topologies or the Wasserstein metrics. When writing dis B A we will not insist that A has to be the first factor in the domain of dis B A -A and B may even be products themselves, whose factors are intermingled in the product that makes up the domain of dis B A . Also, we may sometimes omit B, only specifying the variable(s) w.r.t. which we are disintegrating, not the ones which are left over, as in dis A .
The map The pair dis B A , int B A enjoy the following properties: This is a direct consequence of the definition of the disintegration.

dis B A and int B
A are inverse bijections between P(A × B) and F (A P(B)).
The last two properties are just a reformulation of the known fact that the disintegration of a measure is almost-surely uniquely defined.

Processes which take values in different spaces at different times.
Already in the introduction, in Section 1.8.1, we found it convenient to extend the definition of AW p to products of not necessarily equal Polish spaces 'in the obvious way'. To accommodate for reapplication of concepts in a similar style as seen there we make the minor generalization of letting all the processes we talk about take values in different spaces at different times -typically at time t they will take values in a space X t . Denote by X k j := k i=j X i and define X : 3. Hellwig's W p -information topology is equal to the topology induced by AW p In this section we show (3) = (1) in Theorem 1.2. We will do so by identifying both topologies as initial topologies w.r.t. a single map each, i.e. finding a space which is homeomorphic to P X with Hellwig's (W p -)information topology and one which is homeomorphic to P X with the topology induced by AW p and then showing that these spaces are homeomorphic in the right way. As an auxilliary tool we will introduce another topology on P X which wasn't mentioned in the introduction, but which is very similar to Hellwig's. The proof strategy can be summarized by saying that we want to show that the following diagram is commutative.
Here N is the map which induces the same topology as AW p , I induces Hellwig's topology and I induces what we will call the reduced information topology. We shortly restate their definitions below. The maps K, M, H are still to be found.
As introduced in Section 1.3 Hellwig's (W p -)information topology is induced by a family of maps I t , given by: Equivalently, the information topology is the initial topology w.r.t. the map We saw in Section 1.8 that AW p is induced by an embedding N : P X → P(X 1:N ).
Rephrasing the definition there, N is obtained by defining recursively from t = N −1 to t = 1: and setting In fact, because dis maps into the space of measures concentrated on the graph of a function, N also maps into a smaller space, which we call F 1 , and which is again defined by recursion down from N − 1 to 1: I.e. F 1 is P(X 1:N ) with all occurences of P(· × ·) replaced by F (· ·). Remember that we had For convenience, let us also define The fact that and that therefore N maps into F 1 is a consequence of Lemma 3.1 below. Finally, I is defined as follows X t •P proj X t+1 . I.e. the reduced information topology, like the information topology, makes continuous predictions about the behaviour of the process after time t given information about its behaviour up to time t, only now we are just predicting what the process will do in the next step, not for the rest of time.
I, I and N are injective and therefore bijections onto their codomains. This means that the values of the maps K, M, H in diagram (11) as functions between sets are really already prescribed. The task consists in finding a representation for them which makes it clear that they are continuous.
This means that for α-a.a. (a, β) we have 1 g(a,b) =y dβ(b, y) = 0, i.e. β is concentrated on the graph of the function b → g(a, b).
To see that any α ∈ F A 3.1. Homeomorphisms. We give a plain language description of what follows in this section: The continuity of M will be quite trivial, because we are just discarding information.
The components K k : F 1 → F X k P X k+1 of the map K are obtained by 'folding' both the 'head' and the 'tail' of F 1 using iterated application of the map int.
By continuity of int, it's easy to see that K k is continuous. To show that the map K with the components K k is the map we are looking for, we basically show that N −1 is again another way of 'folding' all of F 1 using int to arrive at P X . As I −1 is also int, showing (12) amounts to showing that these two different ways of 'folding' -first the head and tail and then in a last step the junction between k and k + 1 on the one hand, and from front to back on the other hand -do the same thing. This may be intuitively clear to the reader. The proof works by repeated application of Lemma 3.5, which represents one step of 'folding order doesn't matter'. Using Lemma 3.5 the proof is completely analogous to the proof that for an operation satisfying (a b) c = a (b c), i.e. for an associative operation, one has As we know, for such an operation any way of parenthesizing the multiplication of N elements gives the same result. An analogous statement holds for int, though we do not formally state or prove this. Finally, in Lemma 3.9, using Lemma 3.8 as the main ingredient we prove the 'hard direction', i.e. that H is continuous. If the continuity of M and K as informally described here seem obvious to the reader they may wish to skip ahead to Lemma 3.8 and Lemma 3.9.
Remark 3.2. The reader interested in working out the details and analogies between 'folding' using int and associative binary operations might be interested in reading about monads in the context of Category Theory first. (See for example Chapter VI in [45].) In fact, (P, η, µ) forms a monad, where η X : X → P(X ) sends an element x of X to the dirac measure at x and µ X : P(P(X )) → P(X ) This monad is studied in a little more detail in [27]. int can be obtained from µ and a tensorial strength t A,B : A × P(B) → P(A × B) in the sense described for example in [47].
To show that M is continuous we will need the following lemma.

Lemma 3.3. dis B
A is natural in B, i.e. for f : B → B the following diagram commutes.
Proof. This is just straigtforward calculation using the definitions.
Applying Lemma 3.3 with A = X k , B = X k+1 , B = X k+1 and f = proj X k+1 : X k+1 → X k+1 we get that There is an analogue of Lemma 3.3 which we list here for completeness. Proof. Again this is just calculation.
We already implicity used the 'in particular'-part of Lemma 3.4 when we said that N can be regarded both as a map into P(X 1:N ) and into F 1 but the use there seemed too trivial to warrant much mention. There will be more such tacit uses. Now we show that K is continuous. We claim that it can be written as , or without the dots, letting • denote concatenation of functions, e.g.
To prove this we will repeatedly apply the following lemma. Lemma 3.5 (int is 'associative'). int satisfies the following relation: These maps can be seen in the following commutative diagram.
Proof. This is just expanding the definition. Both maps send a measure α ∈ P(A × P(B × P(C))) to the measure µ with Lemma 3.6. The following relation holds.
Proof. Again, this is just repeated application of Lemma 3.5. Below we define T l for N ≥ l ≥ k and show that int X k+1 for all N ≥ l ≥ k by showing T l = T l−1 for all N ≥ l > k. The left hand side of (14) is the left hand side of (13) with the common tail of the left and right side in (13) dropped. T k will be the right hand side of (13) with the common part dropped.
Here we regard • s r . . . with r < s (an empty product in our context) as the identity function. For l = N the first factor is an empty product and therefore clearly (14) is true for l = N . To get from T l to T l−1 we leave the first factor alone and apply Lemma 3.5 with A = X k , B = X l−1 k+1 and C = X l:N . This transforms and therefore T l into T l−1 .
Proof. Prepending N to (13) gives Now we will show that H is continuous. We will postpone the proof of Lemma 3.8 below, which is the crucial non-bookkeeping ingredient in the proof of Lemma 3.9 below, until the end of this section. The methods used in the proof of Lemma 3.8 differ significantly from the rest in this section and make use of the concept of the modulus of continuity for measures, and results relating to it, introduced in the companion paper [23] to this one.
The function is continuous.
Clearly, as a function between sets, J Y A,B (µ , µ) only depends on µ. But, as we know, dis B×Y A is not continuous. Only when we refine the topology on the source space, which we encode by regarding J Y A,B as a map from the above subset of a product space, does it become continuous.

Lemma 3.9. H is continuous.
Proof. We will inductively define H k : I P X → P X k × P k+1 (again down from N − 1 to 1) so that they will be continuous by construction (and by virtue of Lemma 3.8). Also by construction, we will have H k • I = N k . H will be H 1 so that H • I = N .
Set H N −1 := proj N −1 , the projection from where proj k is the projection from N −1 k=1 F X k P(X k+1 ) onto the k-th factor. For this to be well-defined we need to check that for µ ∈ I P X we have int X k+1 X k (proj k (µ)) = P proj X k+1 H k+1 (µ) . I.e. for ν ∈ P X we want int X k+1 X k (proj k (I (ν))) = P proj X k+1 H k+1 (I (ν)) The composite of the maps on the left-hand side is equal to int X k+1 On the right-hand side we get by induction hypothesis Using that P(proj A ) • dis B A = P(proj A ) we see for l ≥ k + 1 i.e. by induction (16) is also equal to P proj X k+1 . As a composite of continuous maps H k is clearly continuous. (This is where we use Lemma 3.8.) As a map between sets H k is just by induction hypothesis and definition of N k .

Proof of Lemma 3.8.
In this part we prove Lemma 3.8. Here we use several of the ideas developed in the companion paper [23]. In particular we will need [23, Lemma 4.2] which we reproduce below.

Lemma 3.8. Let
The function is continuous.
As noted in the proof of Lemma 3.1 we know that forμ-a.a. (a,μ) the measureμ is concentrated on the graph of the function f a (and similarly forν). This together with P(1 A × P(proj B )) (μ) = µ (which is a consequence of (15)) implies that h dμ = h a, P(1 B , f a ) (b) dµ (a,b) (again similarly forν).
From this we see that the measureγ ∈ P is in Cpl (μ,ν).
We may measurably select almost-witnessesγb Now where γ ∈ Cpl (µ, ν) is defined as b 1 , a 2 ,b 2 ) . The integral over the first two summands in (19) is less than min(δ p , ε p ) by (18). By our choice of δ in the beginning this implies that the integral over the last summand is also less than ε p , so that overall ρ(μ,ν) p < 2ε p .
Es ε was arbitrary this concludes the proof.

The symmetrized causal Wasserstein distance SCW p
In this section we prove that the topology induced by SCW p is sandwiched between Hellwig's W p -information topology and the topology induced by AW p , and therefore by what we have already seen in the previous section equal to both of them. Our arguments in this section make explicit use of metrics. The reader who is only interested in the simpler version of our main theorem, Theorem 1.1 may assume that p = 1 and that all metrics are bounded.
Remember that for µ, ν ∈ P X we have In proving this we will take a slightly roundabout route. First we will focus on the case where X = X 1 × X 2 is the product of just two spaces, i.e. where we have only two time points. Moreover, for expositional purposes, let us for the moment assume that X 1 and X 2 are both compact. Generalizing from this setting will not be very hard.
In the compact, two-time-point case we will show equality of the two topologies in question by extending both to a larger (compact) space and showing equality of the topologies on that larger space.
In more detail: When there are only two timepoints Hellwig's W p -information topology and the topology induced by AW p trivially coincide. Both are induced by emedding P(X 1 × X 2 ) into P(X 1 × P(X 2 )) via dis X2 X1 . The latter space carries its standard metric ρ P(X1×P(X2)) , which -as was already established in Theorem 1.4 in Section 1.8 of the introduction -is an extension of AW p . To highlight this connection, in this section we will also refer to that metric as AW p . As a reminder, where W p is the normal Wasserstein distance (on P(X 2 ) in this case). We will find an extension CW p of CW p to P(X 1 × P(X 2 )), which still satisfies all properties of a metric except for symmetry and which is dominated by AW p . Symmetrizing this extension gives a metric (which we will call SCW p ). The identity function from P(X 1 × P(X 2 )) topologized with AW p to P(X 1 × P(X 2 )) topologized with SCW p will then be a continuous bijection from a compact space (this is where we use compactness of X 1 , X 2 ) to a Hausdorff space, i.e. a homeomorphism. The next subsection will be devoted to finding an expression for the extension of CW p to P(X 1 × P(X 2 )) and proving that it satisfies all the properties mentioned above.
Remark 4.1. When X 1 contains no isolated points, because P(X 1 × P(X 2 )) is the metric completion of P(X 1 × X 2 ) w.r.t. AW p and because the above properties imply that CW p is (uniformly) continuous w.r.t. AW p , we have already uniquely identified CW p . Still, we want to find an expression that allows us to work with CW p and in particular that allows us to prove that SCW p is a metric and not just a pseudometric, i.e. that the induced topology is in fact Hausdorff. This is exactly what we gain from assuming compact base spaces and passing to the completion: instead of having to find a lower bound for SCW p (µ, ν) in terms of AW p (µ, ν) (and possibly µ) we now just have to prove that if µ = ν then SCW p (µ, ν) > 0. 4.1. Extending the causal 'distance'. So now we are working with two Polish metric spaces X 1 , X 2 . Remember that we denote the 'canonical process' on X := X 1 × X 2 by (X i ) i=1,2 , i.e. X i : X → X i is the projection onto the i-th coordinate.
To differentiate between the different roles that X may play -i.e. is it the space for the left measure µ or the right measure ν when measuring the 'distance' CW p (µ, ν) -we will also refer to X , X i by the aliases Y, Y i respectively. (And later Z, Z i as well.) Analogously, we have In this section we will repeatedly make use of the following construction:  h(a, b, c) where b → ν b is a disintegration of ν w.r.t. B and similarly for µ.
We further define If µ is a probability on A×B and ν is a probability on B ×C, another way of saying what µ ⊗ B ν is, is to state that it is a probability on A × B × C s.t. the law of (A, B) is equal to µ, the law of (B, C) is equal to ν (where per our convention A is the projectio onto A, etc.), and A is conditionally independent from C given B.
(For the notion of conditional independence see for example [21,Definition II.43].) Another helpful intuition comes from looking at the case where µ ∈ F (A B) is concentrated on the graph of some measurable function f : A → B and ν ∈ F (B C) is concentrated on the graph of a measurable function g : B → C. µ o 9B ν is then concentrated on the graph of g • f : A → C. In some contexts g • f is also written as f o 9 g, which is where we borrowed the symbol from. Remark 4.4. We will often encounter the situation that one of the factors A, B or C in Definition 4.2 is itself a product of spaces and the individual factors may not always be so nicely sorted. We will rely on naming in the subscript the space(s) along which to join the measures µ and ν. For example if µ ∈ P( to refer to the measure that we get when in (23) we use (b 1 , b 2 ) ∈ B 1 × B 2 as the middle variable b. We will not be systematic about the order of the factors in the resulting product space on which e.g. µ ⊗ B1,B2 ν is a measure, again relying on naming our spaces for disambiguation.
For future reference we paraphrase the definition of a causal transport plan given in (3) in the introduction.

Lemma 4.5.
Let µ be a measure on X = X 1 × X 2 and ν be a measure on Y = Y 1 × Y 2 . γ ∈ Cpl (µ, ν) is a causal transference plan from µ to ν iff under γ X 2 and Y 1 are conditionally independent given X 1 .
Proof. By the conditional version of Jensen's inequality applied to the convex function (x,ŷ) → W p (x,ŷ) p we have

Remark 4.10.
For the reader who may be sceptical of whether Jensen's inequality holds in this rather unusual setting, where we have a convex function and conditional expectations on spaces of measures we remark that for the Wasserstein distance in particular this is very easy to check. The proof is just integrating transport plans betweenX 2 andŶ 2 w.r.t. the distribution of these conditioned on Y (in this case) to get transport plans between E γ (X 2 | Y ) and E γ (Ŷ 2 | Y ).

Lemma 4.13. The infimum in (25) is attained.
Proof. This is an application of [8, Theorem 1.2]. For self-containedness and because it's a nice application of the nested distance, we also sketch the argument here. We know that Cpl (µ, ν) is compact. The problem is not (lower semi-) continuous. But we may switch to a topology which is better adapted to the problem at hand. Namely the two-timepoint AW p -topology. In this case the space for the first timepoint is Y 1 × P(Y 2 ) and that for the second is X 1 × P(X 2 ). In effect that means that instead of γ ∈ Cpl (µ, ν) we are now looking at γ ∈ F (Y 1 × P(Y 2 ) P(X 1 × P(X 2 ))). The function that we are optimizing over can be written aŝ C is a continuous function and so isĈ. Now dis Y1×P(Y2) (Cpl (µ, ν)) is not compact, but is. So we can find a minimizer γ ofĈ in this set. To return to Cpl (µ, ν), or more precisely dis Y1×P(Y2) (Cpl (µ, ν)), we can send γ to the distribution γ of (Y 1 ,Ŷ 2 , E γ (ˆ X| Y )). Because C is continuous and convex in its last argument and by (the conditional version of) Jensens inequality (which could again be proved 'by hand' here)Ĉ(γ ) ≤Ĉ(γ ). int Y1×P(Y2) (γ ) is the sought after minimizer of (25). Lemma 4.14. Let µ, ν ∈ P(X 1 × P(X 2 )). Then CW p (µ, ν) = CW p (ν, µ) = 0 implies µ = ν.
, the random variablesẐ 2 ,Ŷ 2 ,X 2 form a martingale w.r.t. the filtration generated by Z, Y , X. The distribution of Z 2 is equal to the distribution ofX 2 . Both of these statements are also true if we integrate some bounded measurable function w.r.t. our random variables, i.e. for any bounded measurable f : X 2 → R we have that f dẐ 2 , f dŶ 2 , f dX 2 is a martingale and that the distribution of f dẐ 2 is equal to the distribution of f dX 2 . But this means that we must have f dẐ 2 = f dŶ 2 = f dX 2 a.s. (Lemma 4.15 below). As this is true for all f from a countable generator of the sigma-algebra on X 2 , we haveẐ 2 =Ŷ 2 =X 2 a.s. Lemma 4.15. Let X 1 , X 2 , X 3 be a bounded martingale over R. If the distribution of X 1 is equal to the distribution of X 3 then X 1 = X 2 = X 3 a.s.
Proof. This is a consequence of the strict version of Jensen's inequality applied to any everywhere strictly convex function. (Take for example x → x 2 .) Remark 4.16. The reason we took the detour of turning our probability-measurevalued martingale into a family of martingales on R and arguing on these is because this way we avoid having to exhibit a continuous, everywhere strictly convex function on P(X 2 ).
As a reminder: Definition 4.17. For µ, ν ∈ P(X 1 × P(X 2 )), SCW p (µ, ν) := max(CW p (µ, ν) , CW p (ν, µ)) . Theorem 4.18. SCW p is a metric on P(X 1 × P(X 2 )) satisfying Proof. This follows from Lemma 4.11, Lemma 4.14 and Lemma 4.9. Remark 4.19. As outlined at the beginning of this section we now know enough to conclude that the topology induced by SCW p is equal to the topology induced by AW p in the case where X 1 and X 2 are both compact. The non-compact case is not much harder. We need the following lemma.
is a contraction when we equip the source space with SCW p and the target space with W p . More specifically for µ, ν ∈ P(X 1 × P(X 2 )) Proof. We prove the second statement. Let µ ∈ P( X ), ν ∈ P( Y). Given γ ∈ Cpl (µ, ν) and ε > 0 the task is to find We take inspiration from the discussion at the beginning of this section. Let Ξ : X × Y → P(X 2 × Y 2 ) be a measurable selector satisfying Ξ ∈ Cpl (E γ (X 2 | Y ) ,Ŷ 2 ) γ-a.s. and The obvious choice for γ , namely f → E γ E Ξ (f (X 1 , X 2 , Y 1 , Y 2 )) will not work because in general it gets the relationship between X 1 and X 2 wrong, i.e. its first marginal may not be int X1 (µ). Instead we again define γ L ∈ P(X 1 × X 2 × Y 1 ) and γ R ∈ P(X 2 × Y 1 × Y 2 ) and set γ := γ L ⊗ X2,Y1 γ R .
Clearly, if we can actually define γ as announced, then (30) will hold, because then It remains to check that γ L and γ R can actually be composed, i.e. that (X 2 , Y 1 ) has the same distribution under γ L and γ R .
The step in the middle has its own Lemma 4.21 below.

Lemma 4.21.
Let P be a probability on P(X ) × Y, for Polish spaces X , Y. Let h : X × Y → R be a measurable function. Then where E without superscript is the (conditional) expectiation w.r.t. P andX is the projection onto P(X ).
Note that X is on both sides introduced by the expectation operator which carries a superscript, while Y may on both sides be interpreted as coming from the outermost context. On the right hand side Y may also be seen as having been introduced by the outermost conditional expectation operator. (As this operator conditions on Y this is the same thing.) Proof. Both sides are clearly Y -measurable. We prove that for h(x, y) = f (x)g 1 (y), multiplying by g 2 (Y ) and taking expectation gives the same result. By definition of the conditional expectation Applying the continuous linear function γ → E γ (f (X)) this gives Again by the definition of the conditional expectation: where for the third equality we plugged in the previous equation.
Alternative proof of Lemma 4.20 when X 1 has no isolated points. When the space X 1 has no isolated points one can show that the space F (X 1 P(X 2 )) is dense in P(X 1 × P(X 2 )). This allows for a shorter proof of Lemma 4.20: By the original definition (20) of CW p on the space P(X 1 × X 2 ) the inequality (29) holds on F (X 1 P(X 2 )) × F (X 1 P(X 2 )). Both CW p and (µ, ν) → W p (int X1 (µ), int X1 (ν)) are uniformly continuous on P( X )×P( X ) w.r.t. some product metric of AW p with itself. F (X 1 P(X 2 )) is dense in P( X ), and therefore F (X 1 P(X 2 )) × F (X 1 P(X 2 )) is dense in P( X ) × P( X ). This implies that (29) holds on all of P( X ) × P( X ).

Theorem 4.22.
The topology induced by SCW p on P(X 1 × P(X 2 )) is equal to the toplogy induced by AW p on that space.
Proof. As both topologies are metric and therefore first-countable we may argue on sequences. Let (µ n ) n be a sequence in P(X 1 × P(X 2 )). As SCW p (µ n , µ) ≤ AW p (µ n , µ), if (µ n ) n converges to µ w.r.t. AW p it also converges to µ w.r.t. SCW p . Now assume that a sequence (µ n ) n in P(X 1 × P(X 2 )) converges to µ w.r.t. SCW p . We will show that every subsequence of (µ n ) n has a subsequence which converges to µ w.r.t. AW p . Our assumption implies that the set K := {µ n | n ∈ N} is relatively compact. As int X1 is continuous as a map from P(X 1 × P(X 2 )) with the topology induced by SCW p to P(X 1 × X 2 ) with the toplogy induced by W p (Lemma 4.20), we have that int X1 [K] = {int X1 (µ n ) | n ∈ N} is also relatively compact. By Lemma 1.6/[23, Lemma 3.3] this implies that K is relatively compact in P(X 1 × P(X 2 )) with the topology induced by AW p . Now let (µ n k ) k be some subsequence of (µ n ) n . As K is relatively compact we can find a subsequence (µ n k j ) j of (µ n k ) k , which converges w.r.t. AW p to some µ ∈ P(X 1 × P(X 2 )). As SCW p µ n k j , µ ≤ AW p µ n k j , µ this sequence also converges to µ w.r.t. SCW p .
But (µ n k j ) j also converges to µ w.r.t. SCW p . Because the topology induced by SCW p is Hausdorff (Lemma 4.14), we must have µ = µ, i.e. (µ n k j ) j converges to µ w.r.t. AW p . Now we return to the general case of N time-points.
Theorem 4.23. The topology induced by SCW p on P X is equal to Hellwig's W p -information topology and to the topology induced by AW p .
Proof. As every bicausal transport plan between µ and ν can be interpreted as a causal transport plan from µ to ν and also as a causal transport plan from ν to µ we have that SCW p (µ, ν) ≤ AW p (µ, ν). This means that the identity from P X with the topology induced by AW p to P X with the topology induced by SCW p is continuous. For the other direction we show that the identity from P X with the topology induced by SCW p to P X with the W p -information topology is continuous, i.e. we show that each of the maps is continuous when P X gets the topology induced by SCW p . If µ, ν ∈ P X and γ ∈ Cpl (µ, ν) is causal, then, in particular, γ is 'causal at the timestep from t to t + 1', i.e. γ is causal when regarded as a coupling between µ, ν ∈ P X t × X t+1 . This means that if we define SCW p like SCW p , but only require causality based on the decomposition of X as X t × X t+1 , then SCW p (µ, ν) ≤ SCW p (µ, ν), i.e. the identity from P X with the topology induced by SCW p to P X with the topology induced by SCW p is continuous. By Theorem 4.22 the map dis Xt+1 X t is continuous when we equip P X t × X t+1 with the topology induced by SCW p . Now I t is continuous as a composite of continuous maps.

Aldous' extended weak convergence
In this section we show that Aldous extended W p -/weak topology is equal to Hellwig's (W p -)information topology.
We recall and paraphrase here the definition, already given in the introduction, of Aldous' topology.
be the value of a (classical) disintegration of µ w.r.t. the first j coordinates at ( The extended W p −/weak topology on P X is the initial topology w.r.t. E.

Remark 5.2.
Reasonable people may disagree about whether the most faithful / useful transcription of Aldous' definition should include the factors j = 0 and j = N in the above product of spaces. When including j = N , as we did, one has We leave it as an exercise to the reader to check that either or both may be dropped in the definition of E without affecting the resulting topology on P(X). Proof. We construct continuous maps The first equality above implies that the identity on P X is continuous from the extended weak topology to the information topology, the second implies that it is continuous in the other direction. A k is very simple. We just need to select the right factors and then discard the unnecessary δ (xi) k i=1 part of the measure component. Formally which is cleary continuous. We construct A recursively, by constructing as a composite of continuous maps . We need the helper functions . Given A m satisfying the induction hypothesis we set where s m+1 is the obvious permutation of the coordinates to get the factors into the right order. A m+1 is continuous because by [23, Theorem 4.1] ⊗ X m is continuous when one of the arguments is an element of some F (B C). That (31) still holds for m + 1 is a straightforward calculation. This way we get to A N −1 . Finally, set .

Bounded vs unbounded metrics
Because we will need it in the next section we interject here a proof of Lemma 1.3, which we restate below. Proof of Lemma 1.3. We provide the proof only for Hellwig's topology, i.e. (3) of Theorem 1.2 and Theorem 1.1, respectively. As we have already seen in the previous sections, the topologies (2)-(4) are equivalent topologies, and the result therefore carries over to them. The (W p -)optimal stopping topology, (5), is treated below. It is clear that convergence w.r.t. W p -information topology implies convergence in Hellwig'g information topology plus convergence of p-th moments. For the reverse implication, let 1 ≤ t ≤ N − 1, and denote by A := X t the first t and by B := X t+1 the last N −t coordinates. Now assume that (µ n ) n converges to µ in Hellwig's information topology and that the p-th moments converge. The classical (not adapted) version of the very lemma we prove here implies that µ n → µ in W p ; in particu- Every subsequence of (dis B A (µ n )) n therefore has a subsequence (dis B A (µ n k )) k which converges w.r.t. the topology on P p (A × P p (B)) (i.e. the one coming from nested Wasserstein metrics) to some µ ∈ P p (A × P p (B)). Because convergence in P p (A × P p (B)) is stronger than convergence in P(A × P(B)) (i.e. in the nested weak sense) we must also have dis B A (µ n k ) k → µ in P(A × P(B)). But also, by assumption, dis B A (µ n k ) k → dis B A (µ) in P(A × P(B)) and therefore µ = dis B A (µ).

Optimal Stopping
In this section we investigate the relation between the (W p -)optimal stopping topology and the adapted Wasserstein topology. Lemma 7.1 states that the topology induced by AW p ((1) of Theorem 1.2) is finer than the W p -optimal stopping topology. Lemma 7.5 states that the W p -optimal stopping topology is finer than the W p -information topology ((3) of Theorem 1.2). This will finish the proof of Theorem 1.2.
In order to show that the optimal stopping topology is finer than the W pinformation topology, we need to make a few preparations.

Lemma 7.3. Let A be a Polish space. Then the family
is convergence determining for the weak topology on P(P(A)), that is, a sequence of probability measures (µ n ) n in P(P(A)) converges weakly to a probability measure µ ∈ P(P(A)) if and only if F dµ n → F dµ for all F in (33).
This follows from the Stone-Weierstrass theorem in case of compact A and readily extends to general Polish spaces e.g. via Stone-Čech compactification.
is convergence determining for the weak topology on P(P(A)).
Proof. Let L, G, and (h i ) i≤L as in (33). Moreover, let m ∈ R such that |h i | ≤ m for all 1 ≤ i ≤ L and define I := [−m, m] L . Then I ⊂ R L is compact and satisfies h 1 dµ, . . . , h L dµ ∈ I for all µ ∈ P(A) .
By the universal approximation result of Cybenko [20,Theorem 2], the set is dense in C(I, R) w.r.t. the supremum norm. As a result, it is enough to replace G in (33) by functions of the form x → m i=1 u i σ(v i · x + w i ). Evaluating the latter function on the vector x = ( h 1 dµ, . . . , h L dµ) yields Lemma 7.5. The W p -optimal stopping topology is finer than the W p -information topology.
Proof. The choice L T := −ρ(x, x 0 ) p − 1 and L t := 0 for t = T shows that convergence in the W p -optimal stopping topology implies convergence of the p-th moments. Thus, we are left to show that convergence in the optimal stopping topology implies convergence in Hellwig's information topology. Then, by the part of Lemma 1.3 which has already been established, we obtain convergence in the W p -information topology.
Fix 1 ≤ t ≤ N − 1 and denote by A := X t the first t and by B := X t+1 the last N − t coordinates. As C b (A) is convergence determining for P(A), and {ν → G( B h dν) : h ∈ C b (B), G ∈ C b (R)} is, by Lemma 7.4, convergence determining for P(P(B)), it follows e.g. from [25, Proposition 4.6 (p.115)] that is convergence determining for the weak topology on P(A × P(B)). Since h in (35) is bounded, one can actually take g in (35) to be compactly supported. But a continuous compactly supported function can be approximated uniformly by piecewise linear functions. The latter are linear combinations of functions of the form z → min(c, dz) where c, d ∈ R. It therefore follows that is also convergence determining for the weak topology on P(A × P(B)). Let F be a function in (36), defined via f ∈ C b (A) and h ∈ C b (B), and let m ∈ R be a bound for |f | and |h|. Define L ∈ AC p (Ω) via L t := f • X t L T := (f • X t ) · (h • X t+1 ) and L s := m + 1 for s = t, T.
(Where X t is the projection onto the first t coordinates and X t+1 is the projection onto the remaining N − t coordinates.) By dynamic programming (the Snell-envelope theorem) one has for every µ ∈ P(A × B). This implies that the optimal stopping topology is finer than the initial topology of µ → F d(dis B A (µ)) over F in (36). As (36) is convergence determining for the weak topology on P(A × P(B)), the optimal stopping topology is indeed finer than the information topology, and as observed at the beginning of this proof therefore the W p -optimal stopping topology is finer than the W p -information topology.