1 Introduction

In this paper we discuss the approximation of transport maps on infinite-dimensional domains. Our main motivation are inference problems, in which the unknown belongs to a Banach space Y. Two examples could be the following:

  • Groundwater flow Consider a porous medium in a domain \(\mathrm {D}\subseteq {{\mathbb {R}}}^3\). Given observations of the subsurface flow, we are interested in the permeability (hydraulic conductivity) of the medium in \(\mathrm {D}\). The physical system is described by an elliptic partial differential equation, and the unknown quantity describing the permeability can be modeled as a function \(\psi \in L^\infty (\mathrm {D})=Y\) [25].

  • Inverse scattering Suppose that \(\mathrm {D}_{\mathrm{scat}}\subseteq {{\mathbb {R}}}^3\) is filled by a perfect conductor and illuminated by an electromagnetic wave. Given measurements of the scattered wave, we are interested in the shape of the scatterer \(\mathrm {D}_{\mathrm{scat}}\). Assume that this domain can be described as the image of some bounded reference domain \(\mathrm {D}\subseteq {{\mathbb {R}}}^3\) under a bi-Lipschitz transformation \(\psi :\mathrm {D}\rightarrow {{\mathbb {R}}}^3\), i.e., \(\mathrm {D}_\mathrm{scat}=\psi (\mathrm {D})\). The unknown is then the function \(\psi \in W^{1,\infty }(\mathrm {D})=Y\). We describe the forward model in [17].

The Bayesian approach to these problems is to model \(\psi \) as a Y-valued random variable and to determine the distribution of \(\psi \) conditioned on a (typically noisy) observation of the system. Bayes’ theorem can used to specify this “posterior” distribution via the prior and the likelihood. The prior is a measure on Y that represents our information on \(\psi \in Y\) before making an observation. Mathematically speaking, assuming that the observation and the unknown follow some joint distribution, the prior is the marginal distribution of the unknown \(\psi \). The goal is to explore the posterior and in this way to make inferences about \(\psi \). We refer to [9] for more details on the general methodology of Bayesian inversion in Banach spaces.

For the analysis and implementation of such methods, instead of working with (prior and posterior) measures on the Banach space Y, it can be convenient to parameterize the problem and work with measures on \({{\mathbb {R}}}^{{\mathbb {N}}}\) instead. To demonstrate this, choose a sequence \((\psi _j)_{j\in {{\mathbb {N}}}}\) in Y and a measure \(\mu \) on \({{\mathbb {R}}}^{{\mathbb {N}}}\). With \({{\varvec{y}}}{:=}(y_j)_{j\in {{\mathbb {N}}}}\in {{\mathbb {R}}}^{{\mathbb {N}}}\) and

$$\begin{aligned} \Phi ({{\varvec{y}}}){:=}\sum _{j\in {{\mathbb {N}}}}y_j\psi _j \end{aligned}$$
(1.1)

we can formally define a prior measure on Y as the pushforward \(\Phi _\sharp \mu \). Instead of inferring \(\psi \in Y\) directly, we may instead infer the coefficient sequence \({{\varvec{y}}}=(y_j)_{j\in {{\mathbb {N}}}}\in {{\mathbb {R}}}^{{\mathbb {N}}}\), in which case \(\mu \) holds the prior information on the unknown coefficients. These viewpoints are equivalent in the sense that the conditional distribution of \(\psi \) given an observation is the pushforward, under \(\Phi \), of the conditional distribution of \({{\varvec{y}}}\) given the observation. Under certain assumptions on the prior and the space Y, construction (1.1) arises naturally through the Karhunen–Loève expansion; see, e.g., [1, 22]. In this case the \(y_j\in {{\mathbb {R}}}\) are uncorrelated random variables with unit variance, and the \(\psi _j\) are eigenvectors of the prior covariance operator, with their norms equal to the square root of the corresponding eigenvalues.

In this paper we concentrate on the special case where the coefficients \(y_j\) are known to belong to a bounded interval. Up to a shift and a scaling this is equivalent to \(y_j\in [-1,1]\), which will be assumed throughout. We refer to [9, Sect. 2] for the construction and further discussion of such (bounded) priors. The goal then becomes to determine and explore the posterior measure on \(U{:=}[-1,1]^{{\mathbb {N}}}\). Denote this measure by \({\pi }\) and let \(\mu \) be the prior measure on U such that \({\pi }\ll \mu \). Then the Radon-Nikodym derivative \(f_{\pi }{:=}\frac{\mathrm {d}{\pi }}{\mathrm {d}\mu }:U\rightarrow [0,\infty )\) exists. Since the forward model (and thus the likelihood) only depends on \(\Phi ({{\varvec{y}}})\) in the Banach space Y, \(f_{\pi }\) must be of the specific type

$$\begin{aligned} f_{\pi }({{\varvec{y}}}) = {\mathfrak {f}}_{{\pi }}(\Phi ({{\varvec{y}}}))={\mathfrak {f}}_{{\pi }}\Big (\sum _{j\in {{\mathbb {N}}}}y_j\psi _j\Big ) \end{aligned}$$
(1.2)

for some \({\mathfrak {f}}_{\pi }:Y\rightarrow [0,\infty )\). We give a concrete example in Example 2.6 where this relation holds.

“Exploring” the posterior refers to computing expectations and variances w.r.t. \({\pi }\), or detecting areas of high probability w.r.t. \({\pi }\). A standard technique to do so in high dimensions is Monte Carlo—or in this context Markov chain Monte Carlo—sampling, e.g., [31]. Another approach is via transport maps [23]. Let \({\rho }\) be another measure on U from which it is easy to sample. Then, a map \(T:U\rightarrow U\) satisfying \(T_\sharp {\rho }={\pi }\) (i.e., \({\pi }(A)={\rho }(\{{{\varvec{y}}}\,:\,T({{\varvec{y}}})\in A\})\) for all measurable A) is called a transport map that pushes forward \({\rho }\) to \({\pi }\). Such a T has the property that if \({{\varvec{y}}}\sim {\rho }\) then \(T({{\varvec{y}}})\sim {\pi }\), and thus samples from \({\pi }\) can easily be generated once T has been computed. Observe that \(\Phi \circ T:U\rightarrow Y\) will then transform a sample from \({\rho }\) to a sample from \((\Phi \circ T)_\sharp {\rho }=\Phi _\sharp {\pi }\), which is the posterior in the Banach space Y. Thus, given T, we can perform inference on the quantity in the Banach space.

This motivates the setting we are investigating in this paper: for two measures \({\rho }\) and \({\pi }\) on U, such that their densities are of type (1.2) for a smooth (see Sect. 2) function \({\mathfrak {f}}_{\pi }\), and we are interested in the approximation of \(T:U\rightarrow U\) such that \(T_\sharp {\rho }={\pi }\). More precisely, we will discuss the approximation of the so-called Knothe–Rosenblatt (KR) transport by rational functions. The reason for using rational functions (rather than polynomials) is to guarantee that the resulting approximate transport is a bijection from \(U\rightarrow U\). The rate of convergence will in particular depend on the decay rate of the functions \(\psi _j\). If (1.1) is a Karhunen–Loève expansion, this is the decay rate of the square root of the eigenvalues of the covariance operator of the prior. The faster this decay, the larger the convergence rate will be. The reason for analyzing the triangular KR transport is its wide use in practical algorithms [13, 16, 35, 38], and the fact that its concrete construction makes it amenable to a rigorous analysis.

Sampling from high-dimensional distributions by transforming a (usually lower-dimensional) “latent” variable into a sample from the desired distribution is a standard problem in machine learning. It is tackled by methods such as generative adversarial networks [15] and variational autoencoders [12]. In the setting above, the high-dimensional distribution is the posterior on Y. We will show that under the assumptions of this paper, it is possible to approximately sample from this distribution by transforming a low-dimensional latent variable and without suffering from the curse of dimensionality. While Bayesian inference is our motivation, for the rest of the manuscript the presentation remains in an abstract setting, and our results therefore have ramifications on the broader task of transforming high-dimensional distributions.

1.1 Contributions and Outline

In this manuscript we generalize the analysis of [41] to the infinite-dimensional case. Part of the proofs are based on the results in [41], which we recall in appendix where appropriate to improve readability.

In Sect. 2 we provide a short description of our main result. Section 3 discusses the KR map in infinite dimensions. Its well-definedness in infinite dimensions has been established in [4]. In Theorem 3.3 we additionally give a formula for the pushforward density assuming continuity of the densities w.r.t. the product topology. In Sect. 4 we analyze the regularity of the KR transport. The fact that a transport inherits the smoothness of the densities is known for certain function classes: for example, in the case of \(C^k\) densities, [11] shows that the optimal transport also belongs to \(C^k\), and a similar statement holds for the KR transport; see for example [33, Remark 2.19]. In Proposition 4.2, assuming analytic densities we show analyticity of the KR transport. Furthermore, and more importantly, we carefully examine the domain of holomorphic extension to the complex numbers. These results are exploited in Sect. 5 to show convergence of rational function approximations to T in Theorem 5.2. This result proves a dimension-independent higher-order convergence rate for the transport of measures supported on infinite-dimensional spaces (which need not be supported on finite-dimensional subspaces). In this result, all occurring constants (not just the convergence rate) are controlled independently of the dimension. In Sect. 6 we show that this implies convergence of the pushforward measures (on U and on the Banach space Y) in the Hellinger distance, the total variation distance, the KL divergence, and the Wasserstein distance. These results are formulated in Theorems 6.1 and 6.4. To prove the latter, in Proposition 6.2 we slightly extend a statement from [32] to compact Polish spaces, to show that the Wasserstein distance between two pushforward measures can be bounded by the maximal distance of the two maps pushing forward the initial measure. Finally, we show that it is possible to compute approximate samples from the pushforward measure in the Banach space Y, by mapping a low-dimensional reference sample to the Banach space; see Corollary 6.5. All proofs can be found in the appendix.

2 Main Result

Let for \(k\in {{\mathbb {N}}}\)

$$\begin{aligned} U_{k}{:=}[-1,1]^k\qquad \text {and}\qquad U{:=}[-1,1]^{{\mathbb {N}}}\end{aligned}$$
(2.1)

where these sets are equipped with the product topology and the Borel \(\sigma \)-algebra, which coincides with the product \(\sigma \)-algebra [3, Lemma 6.4.2 (ii)]. Additionally, let \(U_0{:=}\emptyset \). Denote by \(\lambda \) the Lebesgue measure on \([-1,1]\) and by

$$\begin{aligned} \mu =\bigotimes _{j\in {{\mathbb {N}}}}\frac{\lambda }{2} \end{aligned}$$
(2.2)

the infinite product measure. Then \(\mu \) is a (uniform) probability measure on U. By abuse of notation, for \(k\in {{\mathbb {N}}}\) we additionally denote \(\mu =\otimes _{j=1}^k\frac{\lambda }{2}\), where k will always be clear from context.

For a reference \({\rho }\ll \mu \) and a target measure \({\pi }\ll \mu \) on U, we investigate the smoothness and approximability of the KR transport \(T:U\rightarrow U\) satisfying \(T_\sharp {\rho }={\pi }\); the notation \(T_\sharp {\rho }\) refers to the pushforward measure defined by \(T_\sharp {\rho }(A){:=}{\rho }(\{T({{\varvec{y}}})\in A\,:\,{{\varvec{y}}}\in U\})\) for all measurable \(A\subseteq U\). While in general there exist multiple maps \(T:U\rightarrow U\) pushing forward \({\rho }\) to \({\pi }\), the KR transport is the unique such map satisfying triangularity and monotonicity. Triangularity refers to the kth component \(T_k\) of \(T=(T_k)_{k\in {{\mathbb {N}}}}\) being a function of the variables \(x_1,\dots ,x_k\) only, i.e., \(T_k:U_{k}\rightarrow U_{1}\) for all \(k\in {{\mathbb {N}}}\). Monotonicity means that \(x_k\mapsto T_k(x_1,\dots ,x_{k-1},x_k)\) is monotonically increasing on \(U_{1}\) for every \(k\in {{\mathbb {N}}}\) and every fixed \((x_1,\dots ,x_{k-1})\in U_{k-1}\).

Absolute continuity of \({\rho }\) and \({\pi }\) w.r.t. \(\mu \) implies existence of the Radon-Nikodym derivatives

$$\begin{aligned} f_{\rho }{:=}\frac{\mathrm {d}{\rho }}{\mathrm {d}\mu }\qquad \text {and}\qquad f_{\pi }{:=}\frac{\mathrm {d}{\pi }}{\mathrm {d}\mu } \end{aligned}$$

which will also be referred to as the densities of these measures. Assuming for the moment existence of the KR transport T, approximating T requires approximating the infinitely many functions \(T_k:U_{k}\rightarrow U_{1}\), \(k\in {{\mathbb {N}}}\). This, and the fact that the domain \(U_{k}\) of \(T_k\) becomes increasingly high dimensional as \(k\rightarrow \infty \), makes the problem quite challenging.

For these reasons, further assumptions on \({\rho }\) and \({\pi }\) are necessary. Typical requirements imposed on the measures guarantee some form of intrinsic low dimensionality. Examples include densities belonging to certain reproducing kernel Hilbert spaces, or to other function classes of sufficient regularity. In this paper we concentrate on the latter. As is well-known, if \(T_k:U_{k}\rightarrow U_{1}\) belongs to \(C^k\), then it can be uniformly approximated with the k-independent convergence rate of 1, for instance with multivariate polynomials. The convergence rate to approximate \(T_k\) then does not deteriorate with increasing k, but the constants in such error bounds usually still depend exponentially on k. Moreover, as \(k\rightarrow \infty \), this line of argument requires the components of the map to become arbitrarily regular. For this reason, in the present work, where \(T=(T_k)_{k\in {{\mathbb {N}}}}\), it is not unnatural to restrict ourselves to transports that are \(C^\infty \). More precisely, we in particular assume analyticity of the densities \(f_{\rho }\) and \(f_{\pi }\), which in turn implies analyticity of T as we shall see. This will allow us to control all occurring constants independent of the dimension, and approximate the whole map \(T:U\rightarrow U\) using only finitely many degrees of freedom in our approximation.

Assume in the following that Z is a Banach space with complexification \(Z_{\mathbb {C}}\); see, e.g., [18, 27] for the complexification of Banach spaces. We may think of Z and \(Z_{\mathbb {C}}\) as real- and complex-valued function spaces, e.g., \(Z=L^2([0,1],{{\mathbb {R}}})\) and \(Z_{\mathbb {C}}=L^2([0,1],{\mathbb {C}})\). To guarantee analyticity and the structure in (1.1) we consider densities f of the following type:

Assumption 2.1

For constants \(p\in (0,1)\), \(0<M\le L<\infty \), a sequence \((\psi _j)_{j\in {{\mathbb {N}}}}\subseteq Z\), and a differentiable function \({\mathfrak {f}}:O_Z\rightarrow {\mathbb {C}}\) with \(O_Z\subseteq Z_{\mathbb {C}}\) open, the following hold:

  1. (a)

    \(\sum _{j\in {{\mathbb {N}}}}\Vert \psi _{j} \Vert _{Z}^p<\infty \),

  2. (b)

    \(\sum _{j\in {{\mathbb {N}}}}y_j\psi _{j}\in O_Z\) for all \({{\varvec{y}}}\in U\),

  3. (c)

    \({\mathfrak {f}}(\sum _{j\in {{\mathbb {N}}}}y_j\psi _{j})\in {{\mathbb {R}}}\) for all \({{\varvec{y}}}\in U\),

  4. (d)

    \(M= \inf _{\psi \in O_Z}|{\mathfrak {f}}(\psi )|\le \sup _{\psi \in O_Z}|{\mathfrak {f}}(\psi )| = L\).

The function \(f:U\rightarrow {{\mathbb {R}}}\) given by

$$\begin{aligned} f({{\varvec{y}}}){:=}{\mathfrak {f}}\bigg (\sum _{j\in {{\mathbb {N}}}}\psi _{j}y_j\bigg ) \end{aligned}$$
(2.3)

satisfies \(\int _U f({{\varvec{y}}})\;\mathrm {d}\mu ({{\varvec{y}}})=1\).

Assumption 2.2

For two sequences \((\psi _{*,j})_{j\in {{\mathbb {N}}}}\in Z\) with \((*,Z)\in \{({\rho },X),({\pi },Y)\}\), the functions

$$\begin{aligned} f_{\rho }({{\varvec{y}}})={\mathfrak {f}}_{\rho }\bigg (\sum _{j\in {{\mathbb {N}}}}y_j\psi _{{\rho },j}\bigg ),\qquad f_{\pi }({{\varvec{y}}})={\mathfrak {f}}_{\pi }\bigg (\sum _{j\in {{\mathbb {N}}}}y_j\psi _{{\pi },j}\bigg ) \end{aligned}$$

both satisfy Assumption 2.1 for some fixed constants \(p\in (0,1)\) and \(0<M\le L<\infty \).

The summability parameter p determines the decay rate of the functions \(\psi _j\)—the smaller p the stronger the decay of the \(\psi _j\). Because \(p<1\), the argument of \({\mathfrak {f}}\) in (2.3) is well-defined for \({{\varvec{y}}}\in U\) since \(\sum _{j\in {{\mathbb {N}}}}|y_j|\Vert \psi _{j} \Vert _{Z}<\infty \).

Our main result is about the existence and approximation of the KR-transport \(T:U\rightarrow U\) satisfying \(T_\sharp {\rho }={\pi }\). We state the result here in a simplified form; more details will be given in Theorems 5.2, 6.1, and 6.4. We only mention that the trivial approximation \(T_k(x_1,\dots ,x_k)\simeq x_k\) is interpreted as not requiring any degrees of freedom in the following theorem.

Theorem 2.3

Let \(f_{\rho }:U\rightarrow (0,\infty )\) and \(f_{\pi }:U\rightarrow (0,\infty )\) be two probability densities as in Assumption 2.2 for some \(p\in (0,1)\). Then there exists a unique triangular, monotone, and bijective map \(T:U\rightarrow U\) satisfying \(T_\sharp {\rho }={\pi }\).

Moreover, for every \(N\in {{\mathbb {N}}}\) there exists a space of rational functions employing N degrees of freedom, and a bijective, monotone, and triangular \({{\tilde{T}}}:U\rightarrow U\) in this space such that

$$\begin{aligned} \mathrm{dist}({{\tilde{T}}}_\sharp {\rho },{\pi })\le C N^{-\frac{1}{p}+1}. \end{aligned}$$
(2.4)

Here C is a constant independent of N and “\(\mathrm{dist}\)” may refer to the total variation distance, the Hellinger distance, the KL divergence, or the Wasserstein distance.

Equation (2.4) shows a dimension-independent convergence rate (indeed our transport is defined on the infinite-dimensional domain \(U=[-1,1]^{{\mathbb {N}}}\)), so that the curse of dimensionality is overcome. The rate of algebraic convergence becomes arbitrarily large as \(p\in (0,1)\) in Assumption 2.1 becomes small. The convergence rate \(\frac{1}{p}-1\) in Theorem 2.3 is well-known for the approximation of functions as in (2.3) by sparse polynomials, e.g., [6,7,8]; also see Remark 2.7. There is a key difference to earlier results dealing with the approximation of such functions: we do not approximate the function \(f:U\rightarrow {{\mathbb {R}}}\) in (2.3), but instead we approximate the transport \(T:U\rightarrow U\), i.e., an infinite number of functions. Our main observation in this paper is that the sparsity of the densities \(f_{\rho }\) and \(f_{\pi }\) carries over to the transport. Even though it has infinitely many components, T can still be approximated very efficiently if the ansatz space is carefully chosen and tailored to the specific densities. In addition to showing error convergence (2.4), in Theorem 5.2 we give concrete ansatz spaces achieving this convergence rate. These ansatz spaces can be computed in linear complexity and may be used in applications.

The main application for our result is to provide a method to sample from the target \({\pi }\) or the pushforward \(\Phi _\sharp {\pi }\) in the Banach space Y, where \(\Phi ({{\varvec{y}}})=\sum _{j\in {{\mathbb {N}}}}y_j\psi _{{\pi },j}\). Given an approximation \({{\tilde{T}}}=({{\tilde{T}}}_j)_{j\in {{\mathbb {N}}}}\) to T, this is achieved via \(\Phi ({{\tilde{T}}}({{\varvec{y}}}))\) for \({{\varvec{y}}}\sim {\rho }\). It is natural to truncate this expansion, which yields

$$\begin{aligned} \sum _{j=1}^s {{\tilde{T}}}_j(y_1,\dots ,y_j)\psi _{{\pi },j} \end{aligned}$$

for some truncation parameter \(s\in {{\mathbb {N}}}\) and \((y_1,\dots ,y_s)\in U_{s}\). This map transforms a sample from a distribution on the s-dimensional space \(U_{s}\) to a sample from an infinite-dimensional distribution on Y. In Corollary 6.5 we show that the error of this truncated representation in the Wasserstein distance converges with the same rate as given in Theorem 2.3.

Remark 2.4

The reference \({\rho }\) is a “simple” measure whose main purpose is to allow for easy sampling. One possible choice for \({\rho }\) (that we have in mind throughout this paper) is the uniform measure \(\mu \). It trivially satisfies Assumption 2.1 with \({\mathfrak {f}}_{\rho }:{\mathbb {C}}\rightarrow {\mathbb {C}}\) being the constant 1 function (and, e.g., \(\psi _{{\rho },j}=0\in {\mathbb {C}}\)).

Remark 2.5

Even though we can think of \({\rho }\) as being \(\mu \), we formulated Theorem 2.3 in more generality, mainly for the following reason: since the assumptions on \({\rho }\) and \({\pi }\) are the same, we may switch their roles. Thus Theorem 2.3 can be turned into a statement about the inverse transport \(S{:=}T^{-1}:U\rightarrow U\), which can also be approximated at the rate \(\frac{1}{p}-1\).

Example 2.6

(Bayesian inference) For a Banach space Y (“parameter space”) and a Banach space \({\mathcal {X}}\) (“solution space”), let \({\mathfrak {u}}:O_Y\rightarrow {\mathcal {X}}_{\mathbb {C}}\) be a complex differentiable forward operator. Here \(O_Y\subseteq Y_{\mathbb {C}}\) is some nonempty open set. Let \(G:{\mathcal {X}}\rightarrow {{\mathbb {R}}}^m\) be a bounded linear observation operator. For some unknown \(\psi \in Y\) we are given a noisy observation of the system in the form

$$\begin{aligned} \varsigma = G({\mathfrak {u}}(\psi ))+\eta \in {{\mathbb {R}}}^m, \end{aligned}$$

where \(\eta \sim {\mathcal {N}}(0,\Gamma )\) is a centered Gaussian random variable with symmetric positive definite covariance \(\Gamma \in {{\mathbb {R}}}^{m\times m}\). The goal is to recover \(\psi \) given the measurement \(\varsigma \).

To formulate the Bayesian inverse problem, we first fix a prior: let \((\psi _{j})_{j\in {{\mathbb {N}}}}\) be a summable sequence of linearly independent elements in Y. With

$$\begin{aligned} \Phi ({{\varvec{y}}}){:=}\sum _{j\in {{\mathbb {N}}}}y_j\psi _j \end{aligned}$$

and the uniform measure \(\mu \) on U, we choose the prior \(\Phi _\sharp \mu \) on Y. Determining \(\psi \) within the set \(\{\Phi ({{\varvec{y}}})\,:\,{{\varvec{y}}}\in U\}\subseteq Y\) is equivalent to determining the coefficient sequence \({{\varvec{y}}}\in U\). Assuming independence of \({{\varvec{y}}}\sim \mu \) and \(\eta \sim {{\mathcal {N}}}(0,\Gamma )\), the distribution of \({{\varvec{y}}}\) given \(\varsigma \) (the posterior) can then be characterized by its density w.r.t. \(\mu \), which, up to a normalization constant, equals

$$\begin{aligned} \exp \left( \Bigg (\varsigma -G\Big ({\mathfrak {u}}\Big (\sum _{j\in {{\mathbb {N}}}}y_j\psi _j\Big )\Big )\Bigg )^\top \Gamma ^{-1}\Bigg (\varsigma -G\Big ({\mathfrak {u}}\Big (\sum _{j\in {{\mathbb {N}}}}y_j\psi _j\Big )\Big )\Bigg )\right) . \end{aligned}$$
(2.5)

This posterior density is of form (2.3) and the corresponding measure \({\pi }\) can be chosen as a target in Theorem 2.3. Given T satisfying \(T_\sharp {\rho }={\pi }\), we may then explore \({\pi }\) to perform inference on the unknown \({{\varvec{y}}}\); see for instance [41, Sect. 7.4]. For more details on the rigorous derivation of (2.5) we refer to [34] and in particular [9, Sect. 3].

Remark 2.7

Functions as in Assumption 2.1 belong to the set of so-called \(({{\varvec{b}}},p,\varepsilon )\)-holomorphic functions; see, e.g., [6]. This class contains infinite parametric functions that are holomorphic in each argument \(y_j\), and exhibit some growth in the domain of holomorphic extension as \(j\rightarrow \infty \). The results of the present paper and the key arguments remain valid if we replace Assumption 2.1 with the \(({{\varvec{b}}},p,\varepsilon )\)-holomorphy assumption. Since most relevant examples of such functions are of specific type (2.3), we restrict the discussion to this case in order to avoid technicalities.

3 The Knothe–Rosenblatt Transport in Infinite Dimensions

Recall that we consider the product topology on \(U=[-1,1]^{{\mathbb {N}}}\). Assume that \(f_{\rho }\in C^0(U,{{\mathbb {R}}}_+)\) and \(f_{\pi }\in C^0(U,{{\mathbb {R}}}_+)\) are two positive probability densities. Here \({{\mathbb {R}}}_+{:=}(0,\infty )\), and \(C^0(U,{{\mathbb {R}}}_+)\) denotes the continuous functions from \(U\rightarrow {{\mathbb {R}}}_+\). We now recall the construction of the KR map.

For \({{\varvec{y}}}=(y_j)_{j\in {{\mathbb {N}}}}\in {\mathbb {C}}^{{\mathbb {N}}}\) and \(1\le k\le n<\infty \) let

$$\begin{aligned} {{\varvec{y}}}_{[k]}{:=}(y_j)_{j=1}^k,\qquad {{\varvec{y}}}_{[k:n]}{:=}(y_j)_{j=k}^n,\qquad {{\varvec{y}}}_{[n:]}{:=}(y_j)_{j\ge n}. \end{aligned}$$
(3.1)

For \(*\in \{{\rho },{\pi }\}\) and \({{\varvec{y}}}\in U\) define

$$\begin{aligned} {{\hat{f}}}_{*,0}({{\varvec{y}}}){:=}1 \end{aligned}$$
(3.2a)

and for \(k\in {{\mathbb {N}}}\)

$$\begin{aligned} {{\hat{f}}}_{*,k}({{\varvec{y}}}_{[k]}){:=}\int _{U}f_*({{\varvec{y}}}_{[k]},{{\varvec{t}}}) \;\mathrm {d}\mu ({{\varvec{t}}})>0,\qquad f_{*,k}({{\varvec{y}}}_{[k]}){:=}\frac{{{\hat{f}}}_{*,k}({{\varvec{y}}}_{[k]})}{{{\hat{f}}}_{*,k-1}({{\varvec{y}}}_{[k-1]})}>0.\quad \end{aligned}$$
(3.2b)

Then, \({{\varvec{y}}}_{[k]}\mapsto {{\hat{f}}}_{{\rho },k}({{\varvec{y}}}_{[k]})\) is the marginal density of \({\rho }\) in the first k variables \({{\varvec{y}}}_{[k]}\in U_{k}\), and we denote the corresponding measure on \(U_{k}\) by \({\rho }_k\). Similarly, \(y_k\mapsto f_{{\rho },k}({{\varvec{y}}}_{[k-1]},y_k)\) is the conditional density of \(y_k\) given \({{\varvec{y}}}_{[k-1]}\), and the corresponding measure on \(U_{1}\) is denoted by \({\rho }_k^{{{\varvec{y}}}_{[k-1]}}\). The same holds for the densities of \({\pi }\), and we use the analogous notation \({\pi }_k\) and \({\pi }_k^{{{\varvec{y}}}_{[k-1]}}\) for the marginal and conditional measures.

Recall that for two atomless measures \(\eta \) and \(\nu \) on \(U_{1}\) with distribution functions \(F_\eta :U_{1}\rightarrow [0,1]\) and \(F_\nu :U_{1}\rightarrow [0,1]\), \(F_\eta ^{-1}\circ F_\nu :U_{1}\rightarrow U_{1}\) pushes forward \(\nu \) to \(\eta \), as is easily checked, e.g., [33, Theorem 2.5]. In case \(\eta \) and \(\nu \) have positive densities on \(U_{1}\), this map is the unique strictly monotonically increasing such function. With this in mind, the KR-transport can be constructed as follows: let \(T_1:U_{1}\rightarrow U_{1}\) be the (unique) monotonically increasing transport satisfying

$$\begin{aligned} (T_1)_\sharp {\rho }_1 = {\pi }_1. \end{aligned}$$
(3.3a)

Analogous to (3.1) denote \(T_{[k]}{:=}(T_j)_{j=1}^k:U_{k}\rightarrow U_{k}\). Let inductively for any \({{\varvec{y}}}\in U\), \(T_{k+1}({{\varvec{y}}}_{[k]},\cdot ):U_{1}\rightarrow U_{1}\) be the (unique) monotonically increasing transport such that

$$\begin{aligned} (T_{k+1}({{\varvec{y}}}_{[k]},\cdot ))_\sharp {\rho }_{k+1}^{{{\varvec{y}}}_{[k]}} ={\pi }_{k+1}^{T_{[k]}({{\varvec{y}}}_{[k]})}. \end{aligned}$$
(3.3b)

Note that \(T_{k+1}:U_{{k+1}}\rightarrow U_{1}\) and thus \(T_{[k+1]}=(T_j)_{j=1}^{k+1}:U_{{k+1}}\rightarrow U_{{k+1}}\). It can then be shown that for any \(k\in {{\mathbb {N}}}\) [33, Proposition 2.18]

$$\begin{aligned} (T_{[k]})_\sharp {\rho }_k = {\pi }_k. \end{aligned}$$
(3.4)

By induction this construction yields a map \(T{:=}(T_k)_{k\in {{\mathbb {N}}}}\) where each \(T_k:U_{k}\rightarrow U_{1}\) satisfies that \(T_k({{\varvec{y}}}_{[k-1]},\cdot ):U_{1}\rightarrow U_{1}\) is strictly monotonically increasing and bijective. This implies that \(T:U\rightarrow U\) is bijective, as follows. First, to show injectivity: let \({{\varvec{x}}}\ne {{\varvec{y}}}\in U\) and \(j={{\,\mathrm{argmin}\,}}\{i\,:\,x_i\ne y_i\}\). Since \(t\mapsto T_j(x_1,\dots ,x_{j-1},t)\) is bijective, \(T_j(x_1,\dots ,x_{j-1},x_j)\ne T_j(x_1,\dots ,x_{j-1},y_j)\) and thus \(T({{\varvec{x}}})\ne T({{\varvec{y}}})\). Next, to show surjectivity: fix \({{\varvec{y}}}\in U\). Bijectivity of \(T_1:U_{1}\rightarrow U_{1}\) implies existence of \(x_1\in U_{1}\) such that \(T_1(x_1)=y_1\). Inductively choose \(x_j\) such that \(T_j(x_1,\dots ,x_j)=y_j\). Then \(T({{\varvec{x}}})={{\varvec{y}}}\). Thus:

Lemma 3.1

Let \(T=(T_k)_{k\in {{\mathbb {N}}}}:U\rightarrow U\) be triangular. If \(t\mapsto T_k({{\varvec{y}}}_{[k-1]},t)\) is bijective from \(U_{1}\rightarrow U_{1}\) for every \({{\varvec{y}}}\in U\) and \(k\in {{\mathbb {N}}}\), then \(T:U\rightarrow U\) is bijective.

The continuity assumption on the densities guarantees that the marginal densities on \(U_{k}\) converge uniformly to the full density, as we show next. This indicates that in principle it is possible to approximate the infinite-dimensional transport map by restricting to finitely many dimensions.

Lemma 3.2

Let \(f\in C^0(U,{{\mathbb {R}}}_+)\), and let \({{\hat{f}}}_k\) and \(f_k\) be as in (3.2). Then

  1. (i)

    f is measurable and \(f\in L^2(U,\mu )\),

  2. (ii)

    \({{\hat{f}}}_{k}\in C^0(U_{k},{{\mathbb {R}}}_+)\) and \(f_{k}\in C^0(U_{k},{{\mathbb {R}}}_+)\) for every \(k\in {{\mathbb {N}}}\),

  3. (iii)

    it holds

    $$\begin{aligned} \lim _{k\rightarrow \infty }\sup _{{{\varvec{y}}}\in U}|{{\hat{f}}}_{k}({{\varvec{y}}}_{[k]})-f({{\varvec{y}}})|=0. \end{aligned}$$
    (3.5)

Throughout what follows T always stands for the KR transport defined in (3.3). Next we show that T indeed pushes forward \({\rho }\) to \({\pi }\), and additionally we provide a formula for the transformation of densities. In the following \(\partial _jg({{\varvec{x}}}){:=}\frac{\partial }{\partial x_j}g({{\varvec{x}}})\). Furthermore, we call \(f:U\rightarrow {{\mathbb {R}}}\) a positive probability density if \(f({{\varvec{y}}})>0\) for all \({{\varvec{y}}}\in U\) and \(\int _U f({{\varvec{y}}})\;\mathrm {d}\mu ({{\varvec{y}}})=1\).

Theorem 3.3

Let \(f_{\pi }\), \(f_{\rho }\in C^0(U,{{\mathbb {R}}}_+)\) be two positive probability densities. Then

  1. (i)

    \(T=(T_k)_{k\in {{\mathbb {N}}}}:U\rightarrow U\) is measurable, bijective and satisfies \(T_\sharp {\rho }={\pi }\),

  2. (ii)

    for each \(k\in {{\mathbb {N}}}\) holds \(\partial _kT_k({{\varvec{y}}}_{[k]})\in C^0(U_{k},{{\mathbb {R}}}_+)\) and

    $$\begin{aligned} \det dT({{\varvec{y}}}){:=}\lim _{n\rightarrow \infty }\prod _{j=1}^n\partial _jT_j({{\varvec{y}}}_{[j]})\in C^0(U,{{\mathbb {R}}}_+) \end{aligned}$$
    (3.6)

    is well-defined (i.e., converges in \(C^0(U,{{\mathbb {R}}}_+)\)). Moreover

    $$\begin{aligned} f_{{\pi }}(T({{\varvec{y}}}))\det dT({{\varvec{y}}})=f_{\rho }({{\varvec{y}}}) \qquad \forall {{\varvec{y}}}\in U. \end{aligned}$$

Remark 3.4

Switching the roles of \(f_{\rho }\) and \(f_{\pi }\), for \(S=T^{-1}\) it holds \(f_{\rho }(S({{\varvec{y}}}))\det dS({{\varvec{y}}})=f_{\pi }({{\varvec{y}}})\) for all \({{\varvec{y}}}\in U\), where \(\det dS({{\varvec{y}}}){:=}\lim _{n\rightarrow \infty }\prod _{j=1}^n\partial _jS_j({{\varvec{y}}}_{[j]})\) is well-defined.

4 Analyticity of T

In this section we investigate the domain of analytic extension of T. To state our results, for \(\delta >0\) and \(D\subseteq {\mathbb {C}}\) we introduce the complex sets

$$\begin{aligned} {{\mathcal {B}}}_\delta {:=}\{z\in {\mathbb {C}}\,:\,|z|<\delta \}\qquad \text {and}\qquad {{\mathcal {B}}}_\delta (D){:=}\{z+y\,:\,z\in {{\mathcal {B}}}_\delta ,~y\in D\}, \end{aligned}$$

and for \(k\in {{\mathbb {N}}}\) and \({\varvec{\delta }}\in (0,\infty )^k\)

$$\begin{aligned} {{\mathcal {B}}}_{\varvec{\delta }}{:=}\times _{j=1}^k{{\mathcal {B}}}_{\delta _j}\qquad \text {and}\qquad {{\mathcal {B}}}_{\varvec{\delta }}(D) {:=}\times _{j=1}^k {{\mathcal {B}}}_{\delta _j}(D), \end{aligned}$$

which are subsets of \({\mathbb {C}}^k\). Their closures will be denoted by \({{\bar{{{\mathcal {B}}}}}}_\delta \), etc. If we write \({{\mathcal {B}}}_{\varvec{\delta }}(U_{1})\times U\) we mean elements \({{\varvec{y}}}\in {\mathbb {C}}^{{\mathbb {N}}}\) with \(y_j\in {{\mathcal {B}}}_{\delta _j}(U_{1})\) for \(j\le k\) and \(y_j\in U_{1}\) otherwise. Subsets of \({\mathbb {C}}^{{\mathbb {N}}}\) are always equipped with the product topology.

In this section we analyze the domain of holomorphic extension of each component \(T_k:U_{k}\rightarrow U_{1}\) of T to subsets of \({\mathbb {C}}^k\). The reason why we are interested in such statements, is that they allow to upper bound the expansion coefficients w.r.t. certain polynomial bases: for a multiindex \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\) (where \({{\mathbb {N}}}_0=\{0,1,2,\dots \}\)) let \(L_{\varvec{\nu }}({{\varvec{y}}})=\prod _{j=1}^kL_{\nu _j}(y_j)\) be the product of the one-dimensional Legendre polynomials normalized in \(L^2(U_{1},\mu )\). Then \((L_{\varvec{\nu }})_{{\varvec{\nu }}\in {{\mathbb {N}}}_0^k}\) forms an orthonormal basis of \(L^2(U_{k},\mu )\). Hence we can expand \(\partial _kT_k({{\varvec{y}}}_{[k]})=\sum _{{\varvec{\nu }}\in {{\mathbb {N}}}_0^k}l_{k,{\varvec{\nu }}} L_{\varvec{\nu }}({{\varvec{y}}}_{[k]})\) for \({{\varvec{y}}}\in U\) and with the Legendre coefficients

$$\begin{aligned} l_{k,{\varvec{\nu }}}=\int _{U_{k}}\partial _kT_k({{\varvec{y}}}_{[k]})L_{\varvec{\nu }}({{\varvec{y}}}_{[k]})\in {{\mathbb {R}}}. \end{aligned}$$
(4.1)

Analyticity of \(T_k\) (and thus of \(\partial _kT_k\)) on the set \({{\mathcal {B}}}_{\varvec{\delta }}(U_{1})\) implies bounds of the type (see Lemma C.3)

$$\begin{aligned} |l_{k,{\varvec{\nu }}}|\le C\Vert \partial _kT_k \Vert _{L^\infty ({{\mathcal {B}}}_{\varvec{\delta }}(U_{1}))}\prod _{j=1}^k (1+\delta _j)^{-\nu _j}. \end{aligned}$$
(4.2)

Here C in particular depends on \(\min _j \delta _j>0\). The exponential decay in each \(\nu _j\) leads to exponential convergence of truncated sparse Legendre expansions. Once we have approximated \(\partial _kT_k\), we integrate this term in \(x_k\) to obtain an approximation to \(T_k\). The reason for not approximating \(T_k\) directly is explained after Proposition 4.2 below; see (4.5). The size of the holomorphy domain (the size of \({\varvec{\delta }}\)) determines the constants in these estimates—the larger the entries of \({\varvec{\delta }}\), the smaller the upper bound (4.2) and the faster the convergence.

We are now in position to present our main technical tool to find suitable holomorphy domains of each \(T_k\) (or equivalently \(\partial _kT_k\)). We will work under the following assumption on the two densities \(f_{\rho }:U\rightarrow (0,\infty )\) and \(f_{\pi }:U\rightarrow (0,\infty )\). The assumption is a modification of [41, Assumption 3.5].

Assumption 4.1

For constants \(C_1\), \(M>0\), \(L<\infty \), \(k\in {{\mathbb {N}}}\), and \({\varvec{\delta }}\in (0,\infty )^k\), the following hold:

  1. (a)

    \(f\in C^0({{\mathcal {B}}}_{{\varvec{\delta }}}(U_{1})\times U,{\mathbb {C}})\) and \(f:U\rightarrow {{\mathbb {R}}}_+\) is a probability density,

  2. (b)

    \({{\varvec{x}}}\mapsto f({{\varvec{x}}},{{\varvec{y}}})\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}}(U_{1}),{\mathbb {C}})\) for all \({{\varvec{y}}}\in U\),

  3. (c)

    \( M\le |f({{\varvec{y}}})|\le L\) for all \({{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}}(U_{1})\times U\),

  4. (d)

    \(\sup _{{{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}}\times \{0\}^{{\mathbb {N}}}}|f({{\varvec{x}}}+{{\varvec{y}}})-f({{\varvec{x}}})| \le C_1\) for all \({{\varvec{x}}}\in U\),

  5. (e)

    \(\sup _{{{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[j]}}\times \{0\}^{{{\mathbb {N}}}}}|f({{\varvec{x}}}+{{\varvec{y}}})-f({{\varvec{x}}})|\le C_1 \delta _{j+1}\) for all \({{\varvec{x}}}\in U\) and \(j\in \{1,\dots ,k-1\}\).

Such densities yield certain holomorphy domains for \(T_k\) as we show in the next proposition, which is an infinite-dimensional version of [41, Theorem 3.6].

Proposition 4.2

Let \(k\in {{\mathbb {N}}}\), \({\varvec{\delta }}\in (0,\infty )^k\) and \(0< M\le L<\infty \). There exist \(C_1>0\), \(C_2\in (0,1]\) and \(C_3>0\) solely depending on M and L (but not on k or \({\varvec{\delta }}\)) such that if \(f_{\rho }\) and \(f_{\pi }\) satisfy Assumption 4.1 with \(C_1\), M, L, and \({\varvec{\delta }}\), then:

With \({\varvec{\zeta }}=(\zeta _j)_{j=1}^k\) defined by

$$\begin{aligned} \zeta _{j}{:=}C_2 \delta _{j} \qquad \forall j\in \{1,\dots ,k\}, \end{aligned}$$
(4.3)

it holds for all \(j\in \{1,\dots ,k\}\) with \(R_j{:=}\partial _jT_j\) (with T as in (3.3)) that

  1. (i)

    \(R_j\in C^1({{\mathcal {B}}}_{{\varvec{\zeta }}_{[j]}}(U_{1}),{{\mathcal {B}}}_{ C_3}(1))\) and \(\Re (R_j({{\varvec{x}}}))\ge \frac{1}{C_3}\) for all \({{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\zeta }}_{[j]}}(U_{1})\),

  2. (ii)

    if \(j\ge 2\), \(R_j:{{\mathcal {B}}}_{{\varvec{\zeta }}_{[j-1]}}(U_{1})\times U_{1}\rightarrow {{\mathcal {B}}}_{\frac{C_3}{\delta _j}}(1)\).

Let us sketch how this result can be used to show that \(T_k\) can be approximated by polynomial expansions. In Appendix B.2 we will verify Assumption 4.1 for densities as in (2.3). Proposition 4.2 (i) then provides a holomorphy domain for \(\partial _kT_k\), and together with (4.2) we can bound the expansion coefficients \(l_{k,{\varvec{\nu }}}\) of \(\partial _kT_k=\sum _{{\varvec{\nu }}\in {{\mathbb {N}}}_0^k}l_{k,{\varvec{\nu }}} L_{\varvec{\nu }}({{\varvec{y}}})\). However, there is a catch: in general one can find different \({\varvec{\delta }}\) such that Assumption 4.1 holds. The difficulty is to choose \({\varvec{\delta }}\) in a way that depends on \({\varvec{\nu }}\) to obtain a possibly sharp bound in (4.2). To do so we will use ideas from, e.g., [6] where similar calculations were made.

The outlined argument based on Proposition 4.2 (i) suffices to prove convergence of sparse polynomial expansions in the finite-dimensional case; see [41, Theorem 4.6]. In the infinite-dimensional case where we want to approximate \(T=(T_k)_{k\in {{\mathbb {N}}}}\) with only finitely many degrees of freedom we additionally need to employ Proposition 4.2 (ii): for \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\) such that \({\varvec{\nu }}\ne {\varvec{0}}{:=}(0)_{j=1}^k\) but \(\nu _k=0\), Proposition 4.2 (ii) together with (4.2) implies a bound of the type

$$\begin{aligned} |l_{k,{\varvec{\nu }}}|=\left| \int _{U_{k}} (\partial _k T_k({{\varvec{y}}}_{[k]})-1)L_{\varvec{\nu }}({{\varvec{y}}}_{[k]})\;\mathrm {d}\mu ({{\varvec{y}}}_{[k]}) \right| \le C \frac{1}{\delta _k} \prod _{j=1}^k (1+\delta _j)^{-\nu _j}, \end{aligned}$$
(4.4)

where the additional \(\frac{1}{\delta _k}\) stems from \(\Vert \partial _kT_k-1 \Vert _{L^\infty ({{\mathcal {B}}}_{{\varvec{\zeta }}_{[j-1]}}(U_{1})\times U_{1})}\le \frac{C_3}{\delta _k}\). Here we used the fact \(\int _{U_{k}} L_{\varvec{\nu }}({{\varvec{y}}}_{[k]})\;\mathrm {d}\mu ({{\varvec{y}}}_{[k]})=0\) for all \({\varvec{\nu }}\ne {\varvec{0}}\) by orthogonality of the \((L_{\varvec{\nu }})_{{\varvec{\nu }}\in {{\mathbb {N}}}_0^k}\) and because \(L_{\varvec{0}}\equiv 1\). In case \(\nu _k\ne 0\), then the factor \(\frac{1}{1+\delta _k}\) occurs on the right-hand side of (4.2) . Hence, all coefficients \(l_{k,{\varvec{\nu }}}\) for which \({\varvec{\nu }}\ne {\varvec{0}}\) are of size \(O(\frac{1}{\delta _k})\). In fact one can show that even \(\sum _{{\varvec{\nu }}\ne {\varvec{0}}}|l_{k,{\varvec{\nu }}}||L_{\varvec{\nu }}({{\varvec{y}}}_{[k]})|\) is of size \(O(\frac{1}{\delta _k})\). Thus

$$\begin{aligned} \partial _kT_k({{\varvec{y}}}_{[k]})=\sum _{{\varvec{\nu }}\in {{\mathbb {N}}}_0^k}l_{k,{\varvec{\nu }}} L_{\varvec{\nu }}({{\varvec{y}}}_{[k]})=l_{k,{\varvec{0}}} L_{\varvec{0}}({{\varvec{y}}}_{[k]})+O\left( \frac{1}{\delta _k}\right) . \end{aligned}$$

Using \(L_{\varvec{0}}\equiv 1\)

$$\begin{aligned} l_{k,{\varvec{0}}}= & {} \int _{U_{k}} \partial _kT_k({{\varvec{y}}}_{[k]}) L_{\varvec{0}}({{\varvec{y}}}_{[k]})\;\mathrm {d}\mu ({{\varvec{y}}}_{[k]}) \\= & {} \int _{U_{{k-1}}} T_k({{\varvec{y}}}_{[k-1]},1)-T_k({{\varvec{y}}}_{[k-1]},-1)\;\mathrm {d}\mu ({{\varvec{y}}}_{[k-1]}) =2, \end{aligned}$$

and therefore if \(\delta _k\) is very large, since \(L_{\varvec{0}}\equiv 1\)

$$\begin{aligned} T_k({{\varvec{y}}}_{[k]})= & {} -1+\int _{-1}^{y_k}\partial _kT_k({{\varvec{y}}}_{[k-1]},t)\;\mathrm {d}\mu (t)\nonumber \\\simeq & {} -1+\int _{-1}^{y_k}l_{k,{\varvec{0}}}L_{\varvec{0}}({{\varvec{y}}}_{[k-1]},t)\;\mathrm {d}\mu (t)=y_k. \end{aligned}$$
(4.5)

Hence, for large \(\delta _k\) we can use the trivial approximation \(T_k({{\varvec{y}}}_{[k]})\simeq y_k\). To address this special role played by the kth variable for the kth component we introduce

$$\begin{aligned} \gamma ({\varvec{\varrho }},{\varvec{\nu }}){:=}\varrho _k^{-\max \{1,\nu _k\}}\prod _{j=1}^{k-1}\varrho _j^{-\nu _j} \qquad \qquad \forall {\varvec{\varrho }}\in (1,\infty )^{{\mathbb {N}}},~{\varvec{\nu }}\in {{\mathbb {N}}}_0^k, \end{aligned}$$
(4.6)

which, up to constants, corresponds to the minimum of (4.2) and (4.4). This quantity can be interpreted as measuring the importance of the monomial \({{\varvec{y}}}^{\varvec{\nu }}\) in the ansatz space used for the approximation of \(T_k\), and we will use it to construct such ansatz spaces.

Remark 4.3

To explain the key ideas, in this section we presented the approximation of \(T_k\) via a Legendre expansion of \(\partial _k T_k\). For the proofs of our approximation results in Sect. 5 we instead approximate \(\sqrt{\partial _k T_k}-1\) with truncated Legendre expansions. This will guarantee the approximate transport to satisfy the monotonicity property as explained in Sect. 5.

5 Convergence of the Transport

We are now in position to state an algebraic convergence result for approximations of infinite-dimensional transport maps \(T:U\rightarrow U\) associated to densities of type (2.3).

For a triangular approximation \({{\tilde{T}}}=({{\tilde{T}}}_k)_{k\in {{\mathbb {N}}}}\) to T it is desirable that it retains the monotonicity and bijectivity properties, i.e., \(\partial _k{{\tilde{T}}}_k>0\) and \({{\tilde{T}}}:U\rightarrow U\) is bijective. The first guarantees that \({{\tilde{T}}}\) is injective and easy to invert (by subsequently solving the one-dimensional equations \(x_k={{\tilde{T}}}_k(y_1,\dots ,y_k)\) for \(y_k\) starting with \(k=1\)), and for the purpose of generating samples, the second property ensures that for \({{\varvec{y}}}\sim {\rho }\), the transformed sample \({{\tilde{T}}}({{\varvec{y}}})\sim {{\tilde{T}}}_\sharp {\rho }\) also belongs to U. These constraints are hard to enforce for polynomial approximations. For this reason, we use the same rational parametrization we introduced in [41] for the finite-dimensional case: for a set of k-dimensional multiindices \(\Lambda \subseteq {{\mathbb {N}}}_0^k\), define

$$\begin{aligned} {{\mathbb {P}}}_\Lambda {:=}\mathrm{span}\{{{\varvec{y}}}^{\varvec{\nu }}\,:\,{\varvec{\nu }}\in \Lambda \}. \end{aligned}$$

The dimension of this space is equal to the cardinality of \(\Lambda \), which we denote by \(|\Lambda |\). Let \(p_k\in {{\mathbb {P}}}_\Lambda \) (where \(\Lambda \) remains to be chosen) be a polynomial approximation to \(\sqrt{\partial _kT_k}-1\). Set for \({{\varvec{y}}}\in U_{k}\)

$$\begin{aligned} {{\tilde{T}}}_k({{\varvec{y}}}) {:=}-1 + 2 \frac{\int _{-1}^{y_k}\int _{U_{{k-1}}}(p_k({{\varvec{y}}}_{[k-1]},t)+1)^2\;\mathrm {d}\mu ({{\varvec{y}}}_{[k-1]})\;\mathrm {d}\mu (t)}{\int _{U_{k}}(p_k({{\varvec{y}}})+1)^2\;\mathrm {d}\mu ({{\varvec{y}}})}. \end{aligned}$$
(5.1)

It is easily checked that \({{\tilde{T}}}_k\) satisfies both monotonicity and bijectivity as long as \(p_k\ne -1\). Thus we end up with a rational function \({{\tilde{T}}}_k\), but we emphasize that the use of rational functions instead of polynomials is not due to better approximation capabilities, but solely to guarantee bijectivity of \({{\tilde{T}}}:U\rightarrow U\).

Remark 5.1

Observe that \(\Lambda =\emptyset \) gives the trivial approximation \(p_k{:=}0\in {{\mathbb {P}}}_\emptyset \) and \({{\tilde{T}}}_k({{\varvec{y}}})=y_k\).

The following theorem yields an algebraic convergence rate independent of the dimension (since the dimension is infinity) in terms of the total number of degrees of freedom for the approximation of T. Therefore the curse of dimensionality is overcome for densities as in Assumption 2.1.

Theorem 5.2

Let \(f_{\rho }\), \(f_{\pi }:U\rightarrow (0,\infty )\) be two probability densities satisfying Assumption 2.2 for some \(p\in (0,1)\). Set \(b_j{:=}\max \{\Vert \psi _{{\rho },j} \Vert _{Z},\Vert \psi _{{\pi },j} \Vert _{Z}\}\), \(j\in {{\mathbb {N}}}\).

There exist \(\alpha >0\) and \(C>0\) such that the following holds: for \(j\in {{\mathbb {N}}}\) set

$$\begin{aligned} \varrho _j{:=}1+ \frac{\alpha }{b_j}, \end{aligned}$$
(5.2)

and with \(\gamma ({\varvec{\varrho }},{\varvec{\nu }})\) as in (4.6) define

$$\begin{aligned} \Lambda _{\varepsilon ,k}{:=}\{{\varvec{\nu }}\in {{\mathbb {N}}}_0^k\,:\,\gamma ({\varvec{\varrho }},{\varvec{\nu }})\ge \varepsilon \}\qquad \forall k\in {{\mathbb {N}}}. \end{aligned}$$

For each \(k\in {{\mathbb {N}}}\) there exists a polynomial \(p_k\in {{\mathbb {P}}}_{\Lambda _{\varepsilon ,k}}\) such that with the components \({{\tilde{T}}}_{\varepsilon ,k}\) as in (5.1), \({{\tilde{T}}}_\varepsilon =(\tilde{T}_{\varepsilon ,k})_{k\in {{\mathbb {N}}}}:U\rightarrow U\) is a monotone triangular bijection. For all \(\varepsilon >0\), it holds that \(N_\varepsilon {:=}\sum _{k\in {{\mathbb {N}}}} |\Lambda _{\varepsilon ,k}|<\infty \) and

$$\begin{aligned} \sum _{k\in {{\mathbb {N}}}}\Vert T_k-{{\tilde{T}}}_{\varepsilon ,k} \Vert _{L^{\infty }(U_{k})}\le C N_\varepsilon ^{-\frac{1}{p}+1} \end{aligned}$$
(5.3a)

and

$$\begin{aligned} \sum _{k\in {{\mathbb {N}}}}\Vert \partial _kT_k- \partial _k{{\tilde{T}}}_{\varepsilon ,k} \Vert _{L^{\infty }(U_{k})}\le C N_\varepsilon ^{-\frac{1}{p}+1}. \end{aligned}$$
(5.3b)

Remark 5.3

Fix \(\varepsilon >0\). Since \(N_\varepsilon <\infty \), there exists \(k_0\in {{\mathbb {N}}}\) such that for all \(k\ge k_0\) holds \(\Lambda _{\varepsilon ,k}=\emptyset \) and thus \({{\tilde{T}}}_{\varepsilon ,k}({{\varvec{y}}}_{[k]})=y_k\), cp. Remark 5.1.

Switching the roles of \({\rho }\) and \({\pi }\), Theorem 5.2 also yields an approximation result for the inverse transport \(S=T^{-1}\) by some rational functions \({{\tilde{S}}}_k\) as in (5.1). Moreover, if \({{\tilde{T}}}\) is the rational approximation from Theorem 5.2, then its inverse \(\tilde{T}^{-1}:U\rightarrow U\) (whose components are not necessarily rational functions) also satisfies an error bound of type (5.3) as we show next.

Corollary 5.4

Consider the setting of Theorem 5.2. Denote \(S{:=}T^{-1}:U\rightarrow U\) and \({{\tilde{S}}}_\varepsilon {:=}{{\tilde{T}}}_\varepsilon ^{-1}:U\rightarrow U\). Then there exists a constant C such that for all \(\varepsilon >0\)

$$\begin{aligned} \sum _{k\in {{\mathbb {N}}}}\Vert S_k-{{\tilde{S}}}_{\varepsilon ,k} \Vert _{L^{\infty }(U_{k})}\le C N_\varepsilon ^{-\frac{1}{p}+1} \end{aligned}$$
(5.4a)

and

$$\begin{aligned} \sum _{k\in {{\mathbb {N}}}}\Vert \partial _kS_k- \partial _k{{\tilde{S}}}_{\varepsilon ,k} \Vert _{L^{\infty }(U_{k})}\le C N_\varepsilon ^{-\frac{1}{p}+1}. \end{aligned}$$
(5.4b)

Note that both S and \({{\tilde{S}}}_\varepsilon \) in Corollary 5.4 are monotonic, triangular bijections as they are the inverses of such maps.

6 Convergence of the Pushforward Measures

Theorem 5.2 established smallness of \(\sum _{k\in {{\mathbb {N}}}}|\partial _k(T_k-{{\tilde{T}}}_k)|\). The relevance of this term stems from the formal calculation (cp. (3.6))

$$\begin{aligned} |\det dT-\det d{{\tilde{T}}}|=\left| \prod _{k\in {{\mathbb {N}}}}\partial _kT_k- \prod _{k\in {{\mathbb {N}}}}\partial _k{{\tilde{T}}}_k\right| \le \sum _{k\in {{\mathbb {N}}}}|\partial _kT_k-\partial _k{{\tilde{T}}}_k| \prod _{j<k}|\partial _j T_j|\prod _{i>k}|\partial _i{{\tilde{T}}}_i|. \end{aligned}$$

Assuming that we can bound the last two products, the determinant \(\det d{{\tilde{T}}}\) converges to \(\det d T\) at the rate given in Theorem 5.2. This will allow us to bound the Hellinger distance (H), the total variation distance (TV), and the Kullback-Leibler divergence (KL) between \({{\tilde{T}}}_\sharp {\rho }\) and \({\pi }\), as we show in the following theorem. Recall that for two probability measures \(\nu \ll \mu \), \(\eta \ll \mu \) on U with densities \(f_\nu =\frac{\mathrm {d}\nu }{\mathrm {d}\mu }\), \(f_\eta =\frac{\mathrm {d}\eta }{\mathrm {d}\mu }\),

$$\begin{aligned} \mathrm{H}(\nu ,\eta )= & {} \frac{1}{\sqrt{2}}\Vert \sqrt{f_\nu }-\sqrt{f_\eta } \Vert _{L^2(U,\mu )},~~~ \mathrm{TV}(\nu ,\eta )=\frac{1}{2}\Vert f_\nu -f_\eta \Vert _{L^1(U,\mu )},\\ \mathrm{KL}(\nu ,\eta )= & {} \int _U \log \left( \frac{f_\nu }{f_\eta }\right) \;\mathrm {d}\nu . \end{aligned}$$

Theorem 6.1

Let \(f_{\rho }\), \(f_{\pi }\) satisfy Assumption 2.2 for some \(p\in (0,1)\), and let \({{\tilde{T}}}_\varepsilon :U\rightarrow U\) be the approximate transport from Theorem 5.2.

Then there exists \(C>0\) such that for \(\mathrm{dist}\in \{\mathrm{H},\mathrm{TV},\mathrm{KL}\}\) and every \(\varepsilon >0\)

$$\begin{aligned} \mathrm{dist}(({{\tilde{T}}}_\varepsilon )_\sharp \mu ,{\pi }) \le C N_\varepsilon ^{-\frac{1}{p}+1}. \end{aligned}$$
(6.1)

Next we treat the Wasserstein distance. Recall that for a Polish space (Md) (i.e., M is separable and complete with the metric d on M) and for \(q\in [1,\infty )\), the q-Wasserstein distance between two probability measures \(\nu \) and \(\eta \) on M (equipped with the Borel \(\sigma \)-algebra) is defined as [37, Def. 6.1]

$$\begin{aligned} W_q(\nu ,\eta ){:=}\inf _{\gamma \in \Gamma } \left( \int _{M\times M} d(x,y)^q\;\mathrm {d}\gamma (x,y)\right) ^{1/q}, \end{aligned}$$
(6.2)

where \(\Gamma \) stands for the couplings between \(\eta \) and \(\nu \), i.e., the set of probability measures on \(M\times M\) with marginals \(\nu \) and \(\eta \), cp. [37, Def. 1.1].

To bound the Wasserstein distance, we employ the following proposition. It has been similarly stated in [32, Theorem 2], but for measures on \({{\mathbb {R}}}^d\). To fit our setting, we extend the result to compact metric spaces,Footnote 1 but emphasize that the proof closely follows that of [32, Theorem 2], and the argument is very similar. As pointed out in [32], the bound in the proposition is sharp.

Proposition 6.2

Let \((M_1,d_1)\) be a compact metric space, and \((M_2,d_2)\) a Polish space, both equipped with the Borel \(\sigma \)-algebra. Let \(T:M_1\rightarrow M_2\) and \({{\tilde{T}}}:M_1\rightarrow M_2\) be two continuous functions and let \(\nu \) be a probability measure on \(M_1\). Then for every \(q\in [1,\infty )\)

$$\begin{aligned} W_q(T_\sharp \nu ,{{\tilde{T}}}_\sharp \nu )\le \sup _{x \in M_1}d_2(T(x),{{\tilde{T}}}(x))<\infty . \end{aligned}$$

To apply Proposition 6.2 we first have to equip U with a metric. For a sequence \((c_j)_{j\in {{\mathbb {N}}}}\in \ell ^1({{\mathbb {N}}})\) of positive numbers set

$$\begin{aligned} d({{\varvec{x}}},{{\varvec{y}}}){:=}\sum _{j\in {{\mathbb {N}}}}c_j|x_j-y_j|\qquad \forall \;{{\varvec{x}}},{{\varvec{y}}}\in U. \end{aligned}$$
(6.3)

By Lemma A.1, d defines a metric that induces the product topology on U. Since U with the product topology is a compact space by Tychonoff’s theorem [26, Theorem 37.3], (Ud) is a compact Polish space. Moreover:

Lemma 6.3

Let \(f_{\rho }\), \(f_{\pi }\) satisfy Assumption 2.2 and consider metric (6.3) on U. Then \(T:U\rightarrow U\) and the approximation \({{\tilde{T}}}_\varepsilon :U\rightarrow U\) from Theorem 5.2 are continuous with respect to d. Moreover, if there exists \(C>0\) such that with

$$\begin{aligned} b_j{:=}\max \{\Vert \psi _{{\rho },j} \Vert _{X},\Vert \psi _{{\pi },j} \Vert _{Y}\} \end{aligned}$$
(6.4)

it holds that \(b_j\le Cc_j\) for all \(j\in {{\mathbb {N}}}\) (cp. Assumption 2.2), then T and \({{\tilde{T}}}_\varepsilon \) are Lipschitz continuous.

With \(d:U\times U\rightarrow {{\mathbb {R}}}\) as in (6.3), (Ud) is a compact Polish space and T and \({{\tilde{T}}}_\varepsilon \) are continuous, so that we can apply Proposition 6.2. Using Theorem 5.2 and \(\sup _jc_j\in (0,\infty )\),

$$\begin{aligned} W_q(T_\sharp \mu ,({{\tilde{T}}}_\varepsilon )_\sharp \mu )\le \sup _{{{\varvec{y}}}\in U} d(T({{\varvec{y}}}),{{\tilde{T}}}_\varepsilon ({{\varvec{y}}}))\le \sum _{k\in {{\mathbb {N}}}}\Vert T_k-{{\tilde{T}}}_{\varepsilon ,k} \Vert _{L^\infty (U_{k})}c_k \le CN_\varepsilon ^{-\frac{1}{p}+1}.\nonumber \\ \end{aligned}$$
(6.5)

Next let us discuss why \(c_j{:=}b_j\) as in (6.4) is a natural choice in our setting. Let \(\Phi :U\rightarrow Y\) be the map

$$\begin{aligned} \Phi ({{\varvec{y}}}){:=}\sum _{j\in {{\mathbb {N}}}}y_j\psi _{{\pi },j}\in Y. \end{aligned}$$

In the inverse problem discussed in Example 2.6, we try to recover an element \(\Phi ({{\varvec{y}}})\in Y\). For computational purposes, the problem is set up to recover instead the expansion coefficients \({{\varvec{y}}}\in U\). Now suppose that \({\pi }\) is the posterior measure on U. Then \(\Phi _\sharp {\pi }=(\Phi \circ T)_\sharp {\rho }\) is the corresponding posterior measure on Y (the space we are actually interested in). The map \(\Phi :U\rightarrow Y\) is Lipschitz continuous w.r.t. the metric d on U, since for \({{\varvec{x}}}\), \({{\varvec{y}}}\in U\) due to \(\Vert \psi _{{\pi },j} \Vert _{Y}\le b_j\),

$$\begin{aligned} \Vert \Phi ({{\varvec{x}}})-\Phi ({{\varvec{y}}}) \Vert _{Y} =\left\| \sum _{j\in {{\mathbb {N}}}}(x_j-y_j)\psi _{{\pi },j} \right\| _{Y} \le \sum _{j\in {{\mathbb {N}}}}|x_j-y_j|b_j=d({{\varvec{x}}},{{\varvec{y}}}). \end{aligned}$$
(6.6)

Therefore, \(\Phi \circ T:U\rightarrow Y\) and \(\Phi \circ {{\tilde{T}}}_\varepsilon :U\rightarrow Y\) are Lipschitz continuous by Lemma 6.3. Moreover, compactness of U and continuity of \(\Phi :U\rightarrow Y\) imply that \(\Phi (T(U))=\Phi ({{\tilde{T}}}_\varepsilon (U))=\Phi (U)\subseteq Y\) is compact and thus separable. Hence we may apply Proposition 6.2 also to the maps \(\Phi \circ T:U\rightarrow \Phi (U)\) and \(\Phi \circ \tilde{T}_\varepsilon :U\rightarrow \Phi (U)\). This gives a bound on the distance between the pushforward measures on the Banach space Y. Specifically, since \(\Vert \Phi (T({{\varvec{y}}}))-\Phi ({{\tilde{T}}}_\varepsilon ({{\varvec{y}}})) \Vert _{Y}\le d(T({{\varvec{y}}}),{{\tilde{T}}}_\varepsilon ({{\varvec{y}}}))\), which can be bounded as in (6.5), we have shown:

Theorem 6.4

Let \(f_{\rho }\), \(f_{\pi }\) satisfy Assumption 2.2 for some \(p\in (0,1)\), let \({{\tilde{T}}}_\varepsilon :U\rightarrow U\) be the approximate transport and let \(N_\varepsilon \in {{\mathbb {N}}}\) be the number of degrees of freedom as in Theorem 5.2.

Then there exists \(C>0\) such that for every \(q\in [1,\infty )\) and every \(\varepsilon >0\)

$$\begin{aligned} W_q(({{\tilde{T}}}_\varepsilon )_\sharp \mu ,{\pi }) \le C N_\varepsilon ^{-\frac{1}{p}+1}, \end{aligned}$$

and for the pushforward measures on the Banach space Y

$$\begin{aligned} W_q((\Phi \circ {{\tilde{T}}}_\varepsilon )_\sharp \mu ,\Phi _\sharp {\pi }) \le C N_\varepsilon ^{-\frac{1}{p}+1}. \end{aligned}$$
(6.7)

Finally let us discuss how to efficiently sample from the measure \(\Phi _\sharp {\pi }\) on the Banach space Y. As explained in the introduction, for a sample \({{\varvec{y}}}\sim {\rho }\) we have \(T({{\varvec{y}}})\sim {\pi }\) and \(\Phi (T({{\varvec{y}}}))=\sum _{j\in {{\mathbb {N}}}}T_j({{\varvec{y}}}_{[j]})\psi _{{\pi },j}\sim \Phi _\sharp {\pi }\). To truncate this series, introduce \(\Phi _s({{\varvec{y}}}_{[s]}){:=}\sum _{j=1}^s y_j\psi _{{\pi },j}\). As earlier, denote by \({\rho }_s\) the marginal measure of \({\rho }\) on \(U_{s}\). For \({{\varvec{y}}}_{[s]}\sim {\rho }_s\), the sample

$$\begin{aligned} \Phi _{s}({{\tilde{T}}}_{\varepsilon ,[s]}({{\varvec{y}}}_{[s]})) =\sum _{j=1}^{s}T_{\varepsilon ,j}({{\varvec{y}}}_{[j]})\psi _{{\pi },j} \end{aligned}$$

follows the distribution of \((\Phi _s\circ \tilde{T}_{\varepsilon ,[s]})_\sharp {\rho }_s\), where \({{\tilde{T}}}_{\varepsilon ,[s]}{:=}({{\tilde{T}}}_{\varepsilon ,k})_{k=1}^s:U_{s}\rightarrow U_{s}\). In the next corollary we bound the Wasserstein distance between \((\Phi _s\circ \tilde{T}_{\varepsilon ,[s]})_\sharp {\rho }_s\) and \(\Phi _\sharp {\pi }\). Note that the former is a measure on Y, and in contrast to the latter, is supported on an s-dimensional subspace. Thus in general neither of these two measures need to be absolutely continuous w.r.t. the other. This implies that the KL divergence, the total variation distance, and the Hellinger distance, in contrast with the Wasserstein distance, need not tend to 0 as \(\varepsilon \rightarrow 0\) and \(s\rightarrow \infty \).

The corollary shows that the convergence rate in (6.7) can be retained by choosing the truncation parameter s as \(N_\varepsilon \) (the number of degrees of freedom in Theorem 5.2); in fact, it even suffices to truncate after the maximal k such that \(\Lambda _{\varepsilon ,k}\ne \emptyset \), as explained in Remark 6.7.

Corollary 6.5

Consider the setting of Theorem 6.4 and assume that \((b_j)_{j\in {{\mathbb {N}}}}\) in (6.4) is monotonically decreasing. Then there exists \(C>0\) such that for every \(q\in [1,\infty )\) and \(\varepsilon >0\)

$$\begin{aligned} W_q((\Phi _{N_\varepsilon }\circ {{\tilde{T}}}_{\varepsilon ,[N_\varepsilon ]})_\sharp {\rho }_{N_\varepsilon },\Phi _\sharp {\pi }) \le C N_\varepsilon ^{-\frac{1}{p}+1}. \end{aligned}$$

Remark 6.6

Convergence in \(W_q\) implies weak convergence [37, Theorem 6.9].

Remark 6.7

Checking the proof of Theorem 5.2, we have \(N_\varepsilon \le C\varepsilon ^{-p}\), cp. (C.21). Thus the maximal activated dimension (represented by the truncation parameter \(s=N_\varepsilon \)) increases only algebraically as \(\varepsilon \rightarrow 0\). The approximation error also decreases algebraically like \(\varepsilon ^{1-p}\) as \(\varepsilon \rightarrow 0\), cp. (C.22). Moreover, the function \(\Phi _{s_\varepsilon }\circ {{\tilde{T}}}_{\varepsilon ,[s_\varepsilon ]}\) with \(s_\varepsilon {:=}\max \{k\in {{\mathbb {N}}}\,:\,\Lambda _{\varepsilon ,k}\ne \emptyset \}\) leads to the same convergence rate in Corollary 6.5. In other words, we only need to use the components \({{\tilde{T}}}_{\varepsilon ,k}\) for which \(\Lambda _{\varepsilon ,k}\ne \emptyset \).

7 Conclusions

The use of transportation methods to sample from high-dimensional distributions is becoming increasingly popular to solve inference problems and perform other machine learning tasks. Therefore, questions of when and how these methods can be successful are of great importance, but thus far not well understood. In the present paper we analyze the approximation of the KR transport in the high- (or infinite-)dimensional regime and on the bounded domain \([-1,1]^{{\mathbb {N}}}\). Under the setting presented in Sect. 2, it is shown that the transport can be approximated without suffering from the curse of dimension. Our approximation is based on polynomial and rational functions, and we provide an explicit a priori construction of the ansatz space. Moreover, we show how these results imply that it is possible to efficiently sample from certain high-dimensional distributions by transforming a lower-dimensional latent variable.

As we have discussed in the finite-dimensional case [41, Sect. 5], from an approximation viewpoint there is also a link to neural networks, which can be established via [36, 39] where it is proven that ReLU neural networks are efficient at emulating polynomials and rational functions. While we have not developed this aspect further in the present manuscript, we mention that neural networks are used in the form of normalizing flows [29, 30] to couple distributions in spaces of equal dimension, and for example in the form of generative adversarial networks [2, 14] and, more recently, injective flows [19, 21], to map lower-dimensional latent variables to samples from a high-dimensional distribution. In Sect. 6 we provided some insight (for the present setting, motivated by inverse problems in science and engineering) into how low-dimensional the latent variable can be, and how expressive the transport should be, to achieve a certain accuracy in the Wasserstein distance (see Corollary 6.5). Further examining this connection and generalizing our results to distributions on unbounded domains (such as \({{\mathbb {R}}}^{{\mathbb {N}}}\) instead of \([-1,1]^{{\mathbb {N}}}\)) will be the topic of future research.