Abstract
For two probability measures \({\rho }\) and \({\pi }\) with analytic densities on the d-dimensional cube \([-1,1]^d\), we investigate the approximation of the unique triangular monotone Knothe–Rosenblatt transport \(T:[-1,1]^d\rightarrow [-1,1]^d\), such that the pushforward \(T_\sharp {\rho }\) equals \({\pi }\). It is shown that for \(d\in {{\mathbb {N}}}\) there exist approximations \({\tilde{T}}\) of T, based on either sparse polynomial expansions or deep ReLU neural networks, such that the distance between \({\tilde{T}}_\sharp {\rho }\) and \({\pi }\) decreases exponentially. More precisely, we prove error bounds of the type \(\exp (-\beta N^{1/d})\) (or \(\exp (-\beta N^{1/(d+1)})\) for neural networks), where N refers to the dimension of the ansatz space (or the size of the network) containing \({\tilde{T}}\); the notion of distance comprises the Hellinger distance, the total variation distance, the Wasserstein distance and the Kullback–Leibler divergence. Our construction guarantees \({\tilde{T}}\) to be a monotone triangular bijective transport on the hypercube \([-1,1]^d\). Analogous results hold for the inverse transport \(S=T^{-1}\). The proofs are constructive, and we give an explicit a priori description of the ansatz space, which can be used for numerical implementations.
1 Introduction
A long-standing challenge in applied mathematics and statistics is to approximate integrals w.r.t. a probability measure \({\pi }\), given only through its (possibly unnormalized) Lebesgue density \(f_{{\pi }}\), on a high-dimensional integration domain. Here we consider probability measures on the bounded domain \([-1,1]^d\). One of the main applications is in Bayesian inference, where parameters \({{\varvec{y}}}\in [-1,1]^d\) are inferred from noisy and/or indirect data. In this case, \({\pi }\) is interpreted as the so-called posterior measure. It is obtained by Bayes’ rule and encompasses all information about the parameters given the data. A typical goal is to compute the expectation \(\int _{[-1,1]^d}g({{\varvec{y}}})\;\mathrm {d}{\pi }( {{\varvec{y}}})\) of some quantity of interest \(g:[-1,1]^d\rightarrow {{\mathbb {R}}}\) w.r.t. the posterior.
Various approaches to high-dimensional integration have been proposed in the literature. One of the most common strategies consists of Monte Carlo-type sampling, e.g., with Markov chain Monte Carlo (MCMC) methods [55]. Metropolis–Hastings MCMC algorithms, for instance, are versatile and simple to implement. Yet the mixing times of standard Metropolis algorithms can scale somewhat poorly with the dimension d. (Here function-space MCMC algorithms [15, 16, 57] and others [47, 70] represent notable exceptions, with dimension-independent mixing for certain classes of target measures.) In general, MCMC algorithms may suffer from slow convergence and possibly long burn-in phases. Furthermore, MCMC is intrinsically serial, which can make sampling infeasible when each evaluation of \(f_{\pi }\) is costly. Variational inference, e.g., [5], can improve on some of these drawbacks. It replaces the task of sampling from \({\pi }\) by an optimization problem. The idea is to minimize (for instance) the KL divergence between \({\pi }\) and a second measure \({\tilde{{\pi }}}\) in some given class of tractable measures, where ‘tractable’ means that independent and identically distributed (iid) samples from \({\tilde{{\pi }}}\) can easily be produced.
Transport-based methods are one instance of this category: given an easy-to-sample-from “reference” measure \({\rho }\), we look for an approximation \({\tilde{T}}\) to the transport T such that \(T_\sharp {\rho }={\pi }\). Here \(T_\sharp {\rho }\) denotes the pushforward measure, i.e., \(T_\sharp {\rho }(B)={\rho }(T^{-1}(B))\) for all measurable sets B. Then \({\tilde{{\pi }}}{:}{=}{\tilde{T}}_\sharp {\rho }\) is an approximation to the “target” \({\pi }=T_\sharp {\rho }\). Unlike in optimal transportation theory, e.g., [59, 71], T is not required to minimize some cost. This allows imposing further structure on T in order to simplify numerical computations. In this paper, we concentrate on the triangular Knothe–Rosenblatt (KR) rearrangement [56], which has been found to be particularly useful in this context. The reason for concentrating on the KR transport (rather than, for instance, optimal transport) is that it is widely used in practical algorithms [25, 34, 66, 72], with the advantages of allowing easy inversion, efficient computation of the Jacobian determinant, and direct extraction of certain conditionals [65]; and from a mathematical standpoint, its explicit construction makes it amenable to a rigorous analysis. Given an approximation \({\tilde{T}}\) to (some) transport T, for a random variable \(X\sim {\rho }\) it holds that \({\tilde{T}}(X)\sim {\tilde{{\pi }}}\). Thus, iid samples from \({\tilde{{\pi }}}\) are obtained via \(({\tilde{T}}(X_i))_{i=1}^n\), where \((X_i)_{i=1}^n\) are \({\rho }\)-distributed iid. This strategy, and various refinements and variants, has been investigated theoretically and empirically, and successfully applied in Bayesian inference; see, e.g., [2, 22, 25, 45, 52, 66].
Normalizing flows, now widely used in the machine learning literature [36, 50, 54] for variational inference, generative modeling, and density estimation, are another instance of this transport framework. In particular, many so-called autoregressive flows, e.g., [34, 51], employ specific neural network parametrizations of triangular maps. A complete mathematical convergence analysis of these methods is not yet available in the literature.
Sampling methods can be contrasted with deterministic approaches, where a quadrature rule is constructed that converges at a guaranteed (deterministic) rate for all integrands in some function class. Unlike sampling methods, deterministic quadratures can achieve higher-order convergence. They even overcome the curse of dimensionality, presuming certain smoothness properties of the integrand. We refer to sparse-grid quadratures [10, 13, 27, 30, 61, 77] and quasi-Monte Carlo quadrature [9, 20, 60] as examples. It is difficult to construct deterministic quadrature rules for an arbitrary measure \({\pi }\), however, so typically they are only available in specific cases such as for the Lebesgue or Gaussian measure. Interpreting \(\int _{[-1,1]^d}g(t)\;\mathrm {d}{\pi }( t)\) as the integral \(\int _{[-1,1]^d}g(t)f_{\pi }(t)\;\mathrm {d}t\) w.r.t. the Lebesgue measure (here \(f_{\pi }\) is again the Lebesgue density of \({\pi }\)), such methods are still applicable. In Bayesian inference, however, \({\pi }\) can be strongly concentrated, corresponding to either small noise in the observations or a large data set. Then this viewpoint may become problematic. For example, the error of Monte Carlo quadrature depends on the variance of the integrand. The variance of \(gf_{\pi }\) (w.r.t. the Lebesgue measure) can be much larger than the variance of g (w.r.t. \({\pi }\)) when \({\pi }\) is strongly concentrated, i.e., when \(f_{\pi }\) is very “peaky.” This problem was addressed in [62] by combining an adaptive sparse-grid quadrature with a linear transport map. This approach combines the advantage of high-order convergence with quadrature points that are mostly within the area where \({\pi }\) is concentrated. Yet if \({\pi }\) is multimodal (i.e., concentrated in several separated areas) or unimodal but strongly non-Gaussian, then the linearity of the transport precludes such a strategy from being successful. A similar statement can be made about the method analyzed in [63], where the (Gaussian) Laplace approximation is used in place of a strongly concentrated posterior measure. For such \({\pi }\), the combination of nonlinear transport maps with deterministic quadrature rules may lead to significantly improved algorithms.
In a related spirit, we also mention interacting particle systems such as kernel-based Stein variational gradient descent (SVGD) and its variants, which have recently emerged as a promising research direction [19, 41]. Put simply, these methods try to find n points \((x_i)_{i=1}^n\) minimizing an approximation of the KL divergence between the target \({\pi }\) and the measure represented by the sample \((x_i)_{i=1}^n\). A discrete convergence analysis is not yet available, but a connection between the mean field limit and gradient flow has been established [23, 40, 42].
In this paper, we analyze the approximation of the transport T satisfying \(T_\sharp {\rho }={\pi }\) under the assumption that the reference and target densities \(f_{\rho }\) and \(f_{\pi }\) are analytic. This assumption is quite strong, but satisfied in many applications, including the main application we have in mind, which is Bayesian inference in partial differential equations (PDEs). The reference \({\rho }\) can be chosen at will, e.g., as a uniform measure so that \(f_{\rho }\) is constant (thus analytic) on \([-1,1]^d\). And here the target density \(f_{\pi }\) is a posterior density, stemming from a PDE-driven likelihood function. For certain linear and nonlinear PDEs (for example, the Navier–Stokes equations), it can be shown under suitable conditions that the corresponding posterior density is indeed an analytic function of the parameters; we refer, for instance, to [13, 64].
As outlined above, T can be employed in the construction of either sampling-based or deterministic quadrature methods. Understanding the approximation properties of T is the first step toward a rigorous convergence analysis of such algorithms. In practice, once a suitable ansatz space has been identified, a (usually non-convex) optimization problem must be solved to find a suitable \({\tilde{T}}\) within the ansatz space. While this optimization is beyond the scope of the current paper, we intend to empirically analyze a numerical algorithm based on the present results in a forthcoming publication.
Throughout we consider transports on \([-1,1]^d\) with \(d\in {{\mathbb {N}}}\). It is straightforward to generalize all the presented results to arbitrary Cartesian products with \(-\infty<a_j<b_j<\infty \) for all j, via an affine transformation of all occurring functions. Most (theoretical and empirical) earlier works on this topic have, however, assumed measures supported on all of \({{\mathbb {R}}}^d\). A similar analysis in the unbounded case, as well as numerical experiments and the development and improvement of algorithms in this case, will be the topics of future research.
1.1 Contributions
For \(d\in {{\mathbb {N}}}\) and under the assumption that the reference and target densities \(f_{\rho }\), \(f_{\pi }:[-1,1]^d\rightarrow (0,\infty )\) are analytic, we prove that there exist sparse polynomial spaces of dimension \(N\in {{\mathbb {N}}}\), in which the best approximation of the KR transport T converges to T at the exponential rate \(\exp (-\beta N^{1/d})\) for some \(\beta >0\) as \(N\rightarrow \infty \); see Theorem 4.3. To guarantee that the approximation \(\tilde{T}:[-1,1]^d\rightarrow [-1,1]^d\) is bijective, we propose to use an ansatz of rational functions, which ensures this property and retains the same convergence rate. In this case, N refers to the dimension of the polynomial space used in the denominator and numerator, i.e., again to the number of degrees of freedom; see Theorem 4.5. The argument is based on a result quantifying the regularity of T in terms of its complex domain of analyticity, which is given in Theorem 3.6.
In Sect. 6, we investigate the implications of approximating the transport map for the corresponding pushforward measures. We show that closeness of the transports in \(W^{1,\infty }\) implies closeness of the pushforward measures in the Hellinger distance, the total variation distance and the KL divergence. A similar statement is true for the Wasserstein distance if the transports are close in \(L^\infty \). Specifically, Proposition 6.4 states the same \(\exp (-\beta N^{1/d})\) error convergence as for the approximation of the transport is obtained for the distance between the pushforward \(\tilde{T}_\sharp {\rho }\) and the target \({\pi }\).
We provide lower bounds on \(\beta >0\), based on properties of \(f_{\rho }\) and \(f_{\pi }\). Furthermore, given \(\varepsilon >0\), we provide a priori ansatz spaces guaranteeing the best approximation in this ansatz space to be \(O(\varepsilon )\)-close to T; see Theorem 4.5 and Sect. 7. This allows to improve upon existing numerical algorithms: Previous approaches were either based on heuristics or based on adaptive (greedy) enrichment of the ansatz space [25], neither of which can guarantee asymptotic convergence or convergence rates in general. Moreover, greedy methods are inherently sequential (in contrast to a priori approaches), which can slow down computations.
Using known approximation properties of polynomials by rectified linear unit (ReLU) neural networks (NNs), we also show that ReLU NNs can approximate T at the exponential rate \(\exp (-\beta N^{1/(1+d)})\). In this case, N refers to the number of trainable parameters (“weights and biases”) in the network; see Theorem 5.1 for the convergence of the transport map and Proposition 6.5 for the convergence of the pushforward measure. We point out that normalizing flows in machine learning also try to approximate T with a neural network; see, for example, [26, 29, 33, 54]. Recent theoretical works on the expressivity of neural network representations of transport maps include [43, 68, 69], which provide universal approximation results; moreover, Reference [37] provides estimates on the required network depth. In the present work we do not merely show universality, i.e., approximability of T by neural networks, but we even prove an exponential convergence rate. Similar results have not yet been established to the best of our knowledge.
1.2 Main Ideas
Consider the case \(d=1\). Let \({\pi }\) and \({\rho }\) be two probability measures on \([-1,1]\) with strictly positive Lebesgue densities \(f_{\rho }\), \(f_{\pi }:[-1,1]\rightarrow \{x\in {{\mathbb {R}}}\,:\,x>0\}\). Let
be the corresponding cumulative distribution functions (CDFs), which are strictly monotonically increasing and bijective from \([-1,1]\rightarrow [0,1]\). For any interval \([a,b]\subseteq [-1,1]\), it holds that
Hence, \(T{:}{=}F_{{\pi }}^{-1}\circ F_{\rho }\) is the unique monotone transport satisfying \({\rho }\circ T^{-1}={\pi }\), i.e., \(T_\sharp {\rho }={\pi }\). The formula \(T=F_{{\pi }}^{-1}\circ F_{\rho }\) implies that T inherits the smoothness of \(F_{{\pi }}^{-1}\) and \(F_{\rho }\). Thus, it is at least as smooth as \(f_{\rho }\) and \(f_{\pi }\) (more precisely, \(f_{\rho }\), \(f_{\pi }\in C^k\) imply \(T\in C^{k+1}\)). We will see in Proposition 3.4 that if \(f_{\rho }\) and \(f_{\pi }\) are analytic, the domain of analyticity of T is (under further conditions and in a suitable sense) proportional to the minimum of the domain of analyticity of \(f_{\rho }\) and \(f_{\pi }\). By the “domain of analyticity,” we mean the domain of holomorphic extension to the complex numbers.
Knowledge of the analyticity domain of T allows to prove exponential convergence of polynomial approximations: Assume for the moment that \(T:[-1,1]\rightarrow [-1,1]\) admits an analytic extension to the complex disk with radius \(r>1\) and center \(0\in {{\mathbb {C}}}\). Then \(T(x)=\sum _{k\in {{\mathbb {N}}}}\frac{d^k}{dy^k} T(y)|_{y=0} \frac{x^k}{k!}\) for \(x\in [-1,1]\), and the kth Taylor coefficient can be bounded with Cauchy’s integral formula by \(C r^{-k}\). This implies that the nth Taylor polynomial uniformly approximates T on \([-1,1]\) with error \(O(r^{-n})=O(\exp (-\log (r)n))\). Thus, r determines the rate of exponential convergence.
The above construction of the transport can be generalized to the KR transport \(T:[-1,1]^d\rightarrow [-1,1]^d\) for \(d\in {{\mathbb {N}}}\). We will determine an analyticity domain for each component \(T_k\) of \(T=(T_k)_{k=1}^d\): not in the shape of a polydisc, but rather as a pill-shaped set containing \([-1,1]^k\). The reason is that analyticity of \(f_{\rho }\) and \(f_{\pi }\) does not imply the existence of a polydisc, but does imply the existence of such pill-shaped domains. Instead of Taylor expansions, one can then prove exponential convergence of Legendre expansions. Rather than approximating T with Legendre polynomials, we introduce a correction guaranteeing our approximation \(\tilde{T}:[-1,1]^d\rightarrow [-1,1]^d\) to be bijective. This results in a rational function \({\tilde{T}}\). Using existing theory for ReLU networks, we also deduce a ReLU approximation result.
1.3 Outline
In Sect. 1.4, we introduce notation. Section 2 recalls the construction of the triangular KR transport T. In Sect. 3, we investigate the domain of analyticity of T. Section 4 applies the results of Sect. 3 to prove exponential convergence rates for the approximation of the transport through sparse polynomial expansions. Subsequently, Sect. 5 discusses a deep neural network approximation result for the transport. We then use these results in Sect. 6 to establish convergence rates for the associated measures (rather than the transport maps themselves). Finally, in Sect. 7 we present a standard example in uncertainty quantification and demonstrate how our results may be used in inference algorithms.
For the convenience of the reader, in Sect. 3.1 we discuss analyticity of the transport map in the one-dimensional case separately from the general case \(d\in {{\mathbb {N}}}\) (which builds on similar ideas but is significantly more technical) and provide most parts of the proof in the main text. In the remaining sections, all proofs and technical arguments are deferred to appendix.
1.4 Notation
1.4.1 Sequences, Multi-indices, and Polynomials
Boldface characters denote vectors, e.g., \({{\varvec{x}}}=(x_i)_{i=1}^d\), \(d\in {{\mathbb {N}}}\). For \(j\le k\le d\), we denote slices by \({{\varvec{x}}}_{[k]}{:}{=}(x_i)_{i=1}^k\) and \({{\varvec{x}}}_{[j:k]}{:}{=}(x_i)_{i=j}^k\).
For a multi-index \({\varvec{\nu }}\in {{\mathbb {N}}}_0^d\), \({{\,\mathrm{supp}\,}}{\varvec{\nu }}{:}{=}\{j\,:\,\nu _j\ne 0\}\), and \(|{\varvec{\nu }}|{:}{=}\sum _{j\in {{\,\mathrm{supp}\,}}{\varvec{\nu }}}\nu _j\), where empty sums equal 0 by convention. Additionally, empty products equal 1 by convention and \({{\varvec{x}}}^{\varvec{\nu }}{:}{=}\prod _{j\in {{\,\mathrm{supp}\,}}{\varvec{\nu }}} x_j^{\nu _j}\). We write \({\varvec{\eta }}\le {\varvec{\nu }}\) if \(\eta _j\le \nu _j\) for all j, and \({\varvec{\eta }}<{\varvec{\nu }}\) if \({\varvec{\eta }}\le {\varvec{\nu }}\) and there exists j such that \(\eta _j<\nu _j\). A subset \(\Lambda \subseteq {{\mathbb {N}}}_0^d\) is called downward closed if it is finite and satisfies \(\{{\varvec{\eta }}\in {{\mathbb {N}}}_0^d\,:\,{\varvec{\eta }}\le {\varvec{\nu }}\}\subseteq \Lambda \) whenever \({\varvec{\nu }}\in \Lambda \).
For \(n\in {{\mathbb {N}}}\), \({{\mathbb {P}}}_n{:}{=}\mathrm{span}\{x^i\,:\,i\in \{0,\dots ,n\}\}\), where the span is understood over the field \({{\mathbb {R}}}\) (rather than \({{\mathbb {C}}}\)). Moreover, for \(\Lambda \subseteq {{\mathbb {N}}}_0^d\)
and a function \(p\in {{\mathbb {P}}}_\Lambda \) maps from \({{\mathbb {C}}}^d\rightarrow {{\mathbb {C}}}\). If \(\Lambda =\emptyset \), \({{\mathbb {P}}}_\emptyset {:}{=}\{0\}\), i.e., \({{\mathbb {P}}}_\emptyset \) only contains the constant 0 function.
1.4.2 Real and Complex Numbers
Throughout, \({{\mathbb {R}}}^d\) is equipped with the Euclidean norm and \({{\mathbb {R}}}^{d\times d}\) with the spectral norm. We write \({{\mathbb {R}}}_+{:}{=}\{x\in {{\mathbb {R}}}\,:\,x>0\}\) and denote the real and imaginary part of \(z\in {{\mathbb {C}}}\) by \(\Re (z)\), \(\Im (z)\), respectively. For any \(\delta \in {{\mathbb {R}}}_+\) and \(S\subseteq {{\mathbb {C}}}\)
and thus, \({{\mathcal {B}}}_\delta (S)=\bigcup _{x\in S} {{\mathcal {B}}}_\delta (x)\). For \({\varvec{\delta }}=(\delta _i)_{i=1}^d\subset {{\mathbb {R}}}_+\), . If we omit the argument S, then \(S{:}{=}\{0\}\), i.e., \({{\mathcal {B}}}_\delta {:}{=}{{\mathcal {B}}}_\delta (0)\).
1.4.3 Measures and Densities
Throughout \([-1,1]^d\) is equipped with the Borel \(\sigma \)-algebra. With \(\lambda \) denoting the Lebesgue measure on \([-1,1]\), \(\mu {:}{=}\frac{\lambda }{2}\). By abuse of notation also \(\mu {:}{=}\otimes _{j=1}^k\frac{\lambda }{2}\), where \(k\in {{\mathbb {N}}}\) will always be clear from context.
If we write “\(f:[-1,1]^d\rightarrow {{\mathbb {R}}}_+\) is a probability density,” we mean that f is measurable, \(f({{\varvec{x}}})>0\) for all \({{\varvec{x}}}\in [-1,1]^d\) and \(\int _{[-1,1]^d}f({{\varvec{x}}})\;\mathrm {d}\mu ({{\varvec{x}}})=1\), i.e., f is a probability density w.r.t. the measure \(\mu \) on \([-1,1]^d\).
1.4.4 Derivatives and Function Spaces
For \(f:[-1,1]^d\rightarrow {{\mathbb {R}}}\) (or \({{\mathbb {C}}}\)), we denote by \(\partial _kf({{\varvec{x}}}){:}{=}\frac{\partial }{\partial x_k}f({{\varvec{x}}})\) the (weak) partial derivative. For \({\varvec{\nu }}\in {{\mathbb {N}}}_0^d\), we write instead \(\partial ^{\varvec{\nu }}_{{\varvec{x}}}f({{\varvec{x}}}){:}{=}\frac{\partial ^{|{\varvec{\nu }}|}}{\partial _{x_1}^{\nu _1}\cdots \partial _{x_d}^{\nu _d}} f({{\varvec{x}}})\). For \(m\in {{\mathbb {N}}}_0\), the space \(W^{m,\infty }([-1,1]^d)\) consists of all \(f:[-1,1]^d\rightarrow {{\mathbb {R}}}\) with finite \(\Vert f \Vert _{W^{m,\infty }({[-1,1]^d})}{:}{=}\sum _{j=0}^m {{\,\mathrm{ess\,sup}\,}}_{{{\varvec{x}}}\in [-1,1]^d} \Vert d^j f({{\varvec{x}}}) \Vert _{}\). Here \(d^0f({{\varvec{x}}})=f({{\varvec{x}}})\) and for \(j\ge 1\), \(d^j f({{\varvec{x}}})\in {{\mathbb {R}}}^{d\times \cdots \times d}\simeq {{\mathbb {R}}}^{d^j}\) denotes the weak jth derivative, and \(\Vert d^jf({{\varvec{x}}}) \Vert _{}\) denotes the norm on \({{\mathbb {R}}}^{d\times \cdots \times d}\) induced by the Euclidean norm. For \(j=1\) we simply write \(df{:}{=}d^1f\). By abuse of notation, e.g., for \(T=(T_k)_{k=1}^d:{{\mathbb {R}}}^d\rightarrow {{\mathbb {R}}}^d\) we also write \(T\in W^{m,\infty }([-1,1]^d)\) meaning that \(T_k\in W^{m,\infty }([-1,1]^d)\) for all \(k\in \{1,\dots ,d\}\), and in this case, \(\Vert T \Vert _{W^{m,\infty }([-1,1]^d)}{:}{=}\sum _{k=1}^d\Vert T_k \Vert _{W^{m,\infty }([-1,1]^d)}\). Similarly, for a measure \(\nu \) on \([-1,1]^d\) and \(p\in [1,\infty )\) we denote by \(L^p([-1,1]^d,\nu )\) the usual \(L^p\) space with norm \(\Vert f \Vert _{L^p([-1,1]^d,\nu )}{:}{=}(\int _{[-1,1]^d}\Vert f({{\varvec{x}}}) \Vert _{}\;\mathrm {d}\nu ({{\varvec{x}}}))^{1/p}\), where either \(f:[-1,1]^d\rightarrow {{\mathbb {R}}}\) or \(f:[-1,1]^d\rightarrow {{\mathbb {R}}}^d\).
1.4.5 Transport Maps
Let \(d\in {{\mathbb {N}}}\). A map \(T:[-1,1]^d\rightarrow [-1,1]^d\) is called triangular if \(T=(T_j)_{j=1}^d\) and each \(T_j:[-1,1]^j\rightarrow [-1,1]\) is a function of \({{\varvec{x}}}_{[j]}=(x_i)_{i=1}^j\). We say that T is monotone if \(x_j\mapsto T_j({{\varvec{x}}}_{[j-1]},x_j)\) is monotonically increasing for every \({{\varvec{x}}}_{[j-1]}\in [-1,1]^{j-1}\), \(j\in \{1,\dots ,d\}\). Note that \(x_j\mapsto T_j({{\varvec{x}}}_{[j-1]},x_j):[-1,1]\rightarrow [-1,1]\) being bijective for every \({{\varvec{x}}}_{[j-1]}\in [-1,1]^{j-1}\), \(j\in \{1,\dots ,d\}\), implies \(T:[-1,1]^d\rightarrow [-1,1]^d\) to be bijective. Similar to our notation for vectors, for the vector valued function \(T=(T_j)_{j=1}^d\) we write \(T_{[k]}{:}{=}(T_j)_{j=1}^k\). Note that for a triangular transport, it holds that \(T_{[k]}:[-1,1]^k\rightarrow [-1,1]^k\).
For a measurable bijection \(T:[-1,1]^d\rightarrow [-1,1]^d\) and a measure \({\rho }\) on \([-1,1]^d\), the pushforward \(T_\sharp {\rho }\) and the pullback \(T^\sharp {\rho }\) measures are defined as
for all measurable \(A\subseteq [-1,1]^d\).
The inverse transport \(T^{-1}:[-1,1]^d\rightarrow [-1,1]^d\) is denoted by S. If \(T:[-1,1]^d\rightarrow [-1,1]^d\) is a triangular monotone bijection, then the same is true for \(S:[-1,1]^d\rightarrow [-1,1]^d\): It holds that \(S_1(y_1)=T_1^{-1}(y_1)\) and
Also note that \(T_\sharp {\rho }={\pi }\) is equivalent to \(S^\sharp {\rho }={\pi }\).
2 Knothe–Rosenblatt Transport
Let \(d\in {{\mathbb {N}}}\). Given a reference probability measure \({\rho }\) and a target probability measure \({\pi }\) on \([-1,1]^d\), under certain conditions (e.g., as detailed below) the KR transport is the (unique) triangular monotone transport \(T:[-1,1]^d\rightarrow [-1,1]^d\) such that \(T_\sharp {\rho }={\pi }\). We now recall the explicit construction of T, as, for instance, presented in [59]. Throughout it is assumed that \({\pi }\ll \mu \) and \({\rho }\ll \mu \) have continuous and positive densities, i.e.,
For a continuous probability density \(f:[-1,1]^d \rightarrow {{\mathbb {C}}}\), we denote \({\hat{f}}_0{:}{=}1\) and for \({{\varvec{x}}}\in [-1,1]^d\)
Thus, \({\hat{f}}_{k}( \cdot )\) is the marginal density of \({{\varvec{x}}}_{[k]}\) and \(f_k({{\varvec{x}}}_{[k-1]},\cdot )\) is the marginal density of \(x_k\) conditioned on \({{\varvec{x}}}_{[k-1]}\). The corresponding marginal conditional CDFs
are well defined for \({{\varvec{x}}}\in [-1,1]^d\) and \(k\in \{1,\dots ,d\}\). They are interpreted as functions of \(x_k\) with \({{\varvec{x}}}_{[k-1]}\) fixed; in particular, \(F_{{\pi };k}({{\varvec{x}}}_{[k-1]},\cdot )^{-1}\) denotes the inverse of \(x_k\mapsto F_{{\pi };k}({{\varvec{x}}}_{[k]})\).
For \({{\varvec{x}}}\in [-1,1]^d\), let
and inductively for \(k\in \{2,\dots ,d\}\) with \(T_{[k-1]}{:}{=}(T_j)_{j=1}^{k-1}:[-1,1]^{k-1}\rightarrow [-1,1]^{k-1}\), let
Then
yields the triangular KR transport \(T:[-1,1]^d\rightarrow [-1,1]^d\). In the following we denote by \(dT:[-1,1]^d\rightarrow {{\mathbb {R}}}^{d\times d}\) the Jacobian matrix of T. The following theorem holds; see, e.g., [59, Prop. 2.18] for a proof.
Theorem 2.1
Assume (2.1). The KR transport T in (2.5) satisfies \(T_\sharp {\rho }={\pi }\) and
The regularity assumption (2.1) on the densities can be relaxed in Theorem 2.1; see, e.g., [6].
In general, T satisfying \(T_\sharp {\rho }={\pi }\) is not unique. To keep the presentation succinct, henceforth we will simply refer to “the transport T,” by which we always mean the unique triangular KR transport in (2.5).
3 Analyticity
The explicit formulas for T given in Sect. 2 imply that positive analytic densities yield an analytic transport. Analyzing the convergence of polynomial approximations to T requires knowledge of the domain of analyticity of T. This is investigated in the present section.
3.1 One-Dimensional Case
Let \(d=1\). By (2.4a), \(T:[-1,1]\rightarrow [-1,1]\) can be expressed through composition of the CDF of \({\rho }\) and the inverse CDF of \({\pi }\). As the inverse function theorem is usually stated without giving details on the precise domain of extension of the inverse function, we give a proof, based on classical arguments, in Appendix A. This leads to the result in Lemma 3.2. Before stating it, we provide another short lemma that will be used multiple times.
Lemma 3.1
Let \(\delta >0\) and let \(K\subseteq {{\mathbb {C}}}\) be convex. Assume that \(f\in C^1({{\mathcal {B}}}_\delta (K);{{\mathbb {C}}})\) such that \(\sup _{x\in {{\mathcal {B}}}_\delta (K)}|f(x)|\le L\). Then \(\sup _{x\in K}|f'(x)|\le \frac{L}{\delta }\) and \(f:K\rightarrow {{\mathbb {C}}}\) is Lipschitz continuous with Lipschitz constant \(\frac{L}{\delta }\).
Proof
For any \(x\in K\) and any \(\varepsilon \in (0,\delta )\), by Cauchy’s integral formula
Letting \(\varepsilon \rightarrow 0\) implies the claim. \(\square \)
Lemma 3.2
Let \(\delta >0\), \(x_0\in {{\mathbb {C}}}\) and let \(f\in C^1({{\mathcal {B}}}_\delta (x_0);{{\mathbb {C}}})\). Suppose that
Let \(F:{{\mathcal {B}}}_\delta (x_0)\rightarrow {{\mathbb {C}}}\) be an antiderivative of \(f:{{\mathcal {B}}}_\delta (x_0)\rightarrow {{\mathbb {C}}}\).
With
there then exists a unique function \(G:{{\mathcal {B}}}_{\alpha \delta }(F(x_0))\rightarrow {{\mathcal {B}}}_{\beta \delta }(x_0)\) such that \(F(G(y))=y\) for all \(y\in {{\mathcal {B}}}_{\alpha \delta }(F(x_0))\). Moreover, \(G\in C^1({{\mathcal {B}}}_{\alpha \delta }(F(x_0));{{\mathbb {C}}})\) with Lipschitz constant \(\frac{1}{M}\).
Proof
We verify the conditions of Proposition A.2 with \({\tilde{\delta }}{:}{=}\delta /(1+\frac{2L}{M})<\delta \). To obtain a bound on the Lipschitz constant of \(F'=f\) on \({{\mathcal {B}}}_{{\tilde{\delta }}}(x_0)\), it suffices to bound \(F''=f'\) there. Due to \({\tilde{\delta }}+\frac{{\tilde{\delta }} 2L}{M} =\delta \), for all \(x\in {{\mathcal {B}}}_{{\tilde{\delta }}}(x_0)\) we have by Lemma 3.1
Since \(F'(x_0)=f(x_0)\ne 0\), the conditions of Proposition A.2 are satisfied, and G is well defined and exists on \({{\mathcal {B}}}_{\alpha \delta }(F(x_0))\), where \(\alpha \delta =\frac{\delta M^2}{2M+4L}=\frac{{\tilde{\delta }} M}{2}\le \frac{{\tilde{\delta }} |F'(t_0)|}{2}\). Finally, due to \(1=F(G(y))'=F'(G(y))G'(y)\), it holds \(G'(y)=\frac{1}{f(G(y))}\) for all \(y\in {{\mathcal {B}}}_{\alpha \delta }(F(x_0))\), which shows that \(G:{{\mathcal {B}}}_{\alpha \delta }(F(x_0))\rightarrow {{\mathbb {C}}}\) has Lipschitz constant \(\frac{1}{M}\). Hence, \(G:{{\mathcal {B}}}_{\alpha \delta }(F(x_0))\rightarrow {{\mathcal {B}}}_{\alpha \delta /M}(G(F(x_0)))={{\mathcal {B}}}_{\beta \delta }(x_0)\). Uniqueness of \(G:{{\mathcal {B}}}_{\alpha \delta }(F(x_0))\rightarrow {{\mathcal {B}}}_{\beta \delta }(x_0)\) satisfying \(F\circ G=\mathrm{Id}\) on \({{\mathcal {B}}}_{\alpha \delta }(F(x_0))\) follows by Proposition A.2 and the fact that \(\beta \delta =\frac{{\tilde{\delta }}}{2}\le {\tilde{\delta }}\). \(\square \)
For \(x\in [-1,1]\) and a density \(f:[-1,1]\rightarrow {{\mathbb {R}}}\), the CDF equals \(F(x)=\int _{-1}^xf(t)\;\mathrm {d}\mu ( t)\). By definition of \(\mu \),
In case f allows an extension \(f:{{\mathcal {B}}}_\delta ([-1,1])\rightarrow {{\mathbb {C}}}\), then \(F:{{\mathcal {B}}}_\delta ([-1,1])\rightarrow {{\mathbb {C}}}\) is also well defined via (3.2). Without explicitly mentioning it, we always consider F to be naturally extended to complex values in this sense.
The next result generalizes Lemma 3.2 from complex balls \({{\mathcal {B}}}_\delta \) to the pill-shaped domains \({{\mathcal {B}}}_\delta ([-1,1])\) defined in (1.3). The proof is given in Appendix B.1.
Lemma 3.3
Let \(\delta >0\) and let
-
(a)
\(f:[-1,1]\rightarrow {{\mathbb {R}}}_+\) be a probability density such that \(f\in C^1({{\mathcal {B}}}_\delta ([-1,1]);{{\mathbb {C}}})\),
-
(b)
\(M\le |f(x)|\le L\) for some \(0<M\le L<\infty \) and all \(x\in {{\mathcal {B}}}_\delta ([-1,1])\).
Set \(F(x){:}{=}\int _{-1}^x f(t)\;\mathrm {d}\mu ( t)\) and let \(\alpha =\alpha (M,L)\), \(\beta =\beta (M,L)\) be as in (3.1).
Then
-
(i)
\(F:[-1,1]\rightarrow [0,1]\) is a \(C^1\)-diffeomorphism, and \(F\in C^1({{\mathcal {B}}}_\delta ([-1,1]);{{\mathbb {C}}})\) with Lipschitz constant L,
-
(ii)
there exists a unique \(G:{{\mathcal {B}}}_{\alpha \delta }([0,1])\rightarrow {{\mathcal {B}}}_{\beta \delta }([-1,1])\) such that \(F(G(y))=y\) for all \(y\in {{\mathcal {B}}}_{\alpha \delta }([0,1])\) and
$$\begin{aligned} G:{{\mathcal {B}}}_{\alpha \delta }(F(x_0))\rightarrow {{\mathcal {B}}}_{\beta \delta }(x_0) \end{aligned}$$(3.3)for all \(x_0\in [-1,1]\). Moreover, \(G\in C^1({{\mathcal {B}}}_{\alpha \delta }([0,1]);{{\mathbb {C}}})\) with Lipschitz constant \(\frac{1}{M}\).
We arrive at a statement about the domain of analytic extension of the one-dimensional monotone transport \(T{:}{=}F_{\pi }^{-1}\circ F_{\rho }:[-1,1]\rightarrow [-1,1]\) as in (2.4a).
Proposition 3.4
Let \(\delta _{\rho }\), \(\delta _{\pi }>0\), \(L_{\rho }<\infty \), \(0<M_{\pi }\le L_{\pi }<\infty \) and
-
(a)
for \(*\in \{{\rho },{\pi }\}\) let \(f_*:[-1,1]\rightarrow {{\mathbb {R}}}_+\) be a probability density and \(f_*\in C^1({{\mathcal {B}}}_{\delta _*}([-1,1]);{{\mathbb {C}}})\),
-
(b)
for \(x\in {{\mathcal {B}}}_{\delta _{\rho }}([-1,1])\), \(y\in {{\mathcal {B}}}_{\delta _{\pi }}([-1,1])\)
$$\begin{aligned} |f_{\rho }(x)|\le L_{\rho },\qquad 0< M_{\pi }\le |f_{\pi }(y)|\le L_{\pi }. \end{aligned}$$
Then with \(r{:}{=}\min \{\delta _{\rho }, \frac{\delta _{\pi }M_{\pi }^2}{L_{\rho }(2M_{\pi }+4L_{\pi })}\}\) and \(q{:}{=}\frac{r L_{\rho }}{M_{\pi }}\) it holds \(T\in C^1({{\mathcal {B}}}_r([-1,1]); {{\mathcal {B}}}_{q}([-1,1]))\).
Proof
First, according to Lemma 3.3 (i), \(F_{\rho }:[-1,1]\rightarrow [0,1]\) admits an extension
where we used that \(F_{\rho }\) is Lipschitz continuous with Lipschitz constant \(L_{\rho }\). Furthermore, Lemma 3.3 (ii) implies with \(\varepsilon {:}{=}\frac{\delta _{\pi }M_{\pi }^2}{2M_{\pi }+4L_{\pi }}\) that \(F_{\pi }^{-1}:[0,1]\rightarrow [-1,1]\) admits an extension
where we used that \(F_{\pi }^{-1}\) is Lipschitz continuous with Lipschitz constant \(\frac{1}{M_{\pi }}\).
Assume first \(r=\delta _{\rho }\), which implies \(L_{\rho }\delta _{\rho }\le \varepsilon \). Then \(F_{\pi }^{-1}\circ F_{\rho }\in C^1({{\mathcal {B}}}_{\delta _{\rho }}([-1,1]);{{\mathbb {C}}})\) is well defined by (3.4). In the second case where \(r=\frac{\delta _{\pi }M_{\pi }^2}{L_{\rho }(2M_{\pi }+4L_{\pi })}\), we have \(\varepsilon =L_{\rho }r\) and \(r\le \delta _{\rho }\). Hence, \(F_{\rho }:{{\mathcal {B}}}_r([-1,1])\rightarrow {{\mathcal {B}}}_{L_{\rho }r}([0,1])={{\mathcal {B}}}_{\varepsilon }([0,1])\) is well defined. Thus, \(F_{\pi }^{-1}\circ F_{\rho }\in C^1({{\mathcal {B}}}_{r}([-1,1]);{{\mathbb {C}}})\) is well defined. In both cases, since \(T=F_{\pi }^{-1}\circ F_{\rho }\) is Lipschitz continuous with Lipschitz constant \(\frac{L_{\rho }}{M_{\pi }}\) (cp. Lemma 3.3), T maps to \({{\mathcal {B}}}_{\frac{rL_{\rho }}{M_{\pi }}}([-1,1])\). \(\square \)
The radius r in Proposition 3.4 describing the analyticity domain of the transport behaves like \(O(\min \{\delta _{\rho },\delta _{\pi }\})\) as \(\min \{\delta _{\rho },\delta _{\pi }\}\rightarrow \infty \) (considering the \(M_*\), \(L_*\) constants fixed). In this sense, the analyticity domain of T is proportional to the minimum of the analyticity domains of the reference and target densities.
3.2 General Case
We now come to the main result of Sect. 3, which is a multidimensional version of Proposition 3.4. More precisely, we give a statement about the analyticity domain of \((\partial _{k}T_k)_{k=1}^d\). The reason is that, from both a theoretical and a practical viewpoint, it is convenient first to approximate \(\partial _{k}T_k:[-1,1]^k\rightarrow [0,1]\) and then to obtain an approximation to \(T_k\) by integrating over \(x_k\). We explain this in more detail in Sect. 4, see (4.9).
The following technical assumption gathers our requirements on the reference \({\rho }\) and the target \({\pi }\).
Assumption 3.5
Let \(0< M \le L <\infty \), \(C_6>0\), \(d\in {{\mathbb {N}}}\) and \({\varvec{\delta }}\in (0,\infty )^d\). For \(*\in \{{\rho },{\pi }\}\):
-
(a)
\(f_*:[-1,1]^d\rightarrow {{\mathbb {R}}}_+\) is a probability density and \(f_{*}\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}}([-1,1]);{{\mathbb {C}}})\),
-
(b)
\(M\le |f_{*}({{\varvec{x}}})|\le L\) for \({{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}}([-1,1])\),
-
(c)
\(\sup _{{{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}}}|f_{*}({{\varvec{x}}}+{{\varvec{y}}})-f_{*}({{\varvec{x}}})| \le C_6\) for \({{\varvec{x}}}\in [-1,1]^d\),
-
(d)
\( \sup _{{{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}}\times \{0\}^{d-k}}|f_{*}({{\varvec{x}}}+{{\varvec{y}}})-f_{*}({{\varvec{x}}})|\le C_6 \delta _{k+1} \) for \({{\varvec{x}}}\in [-1,1]^d\) and \(k\in \{1,\dots ,d-1\}\).
Assumptions (a) and (b) state that \(f_*\) is a positive analytic probability density on \([-1,1]^d\) that allows a complex differentiable extension to the set \({{\mathcal {B}}}_{\varvec{\delta }}([-1,1])\subseteq {{\mathbb {C}}}^d\), cp. (1.3). Equation (2.4) shows that \(T_{k+1}\) is obtained by a composition of \(F_{{\pi };k+1}(T_{1},\dots ,T_{k},\cdot )^{-1}\) (the inverse in the last variable) and \(F_{{\rho };k+1}\). The smallness conditions (c) and (d) can be interpreted as follows: They will guarantee \(F_{{\rho };k+1}({{\varvec{y}}})\) (for certain complex \({{\varvec{y}}}\)) to belong to the domain where the complex extension of \(F_{{\pi };k+1}(T_1,\dots ,T_{k},\cdot )^{-1}\) is well defined.
Theorem 3.6
Let \(0<M\le L<\infty \), \(d\in {{\mathbb {N}}}\) and \({\varvec{\delta }}\in (0,\infty )^d\). There exist \(C_6\), \(C_7\), and \(C_8>0\) depending on M and L (but not on d or \({\varvec{\delta }}\)) such that if Assumption 3.5 holds with \(C_6\), then:
Let \(T=(T_k)_{k=1}^d\) be as in (2.5) and \(R_k{:}{=}\partial _{k}T_k\). With \({\varvec{\zeta }}=(\zeta _k)_{k=1}^d\) where
it holds for all \(k\in \{1,\dots ,d\}\) that
-
(i)
\(R_k\in C^1({{\mathcal {B}}}_{{\varvec{\zeta }}_{[k]}}([-1,1]);{{\mathcal {B}}}_{C_8}(1))\) and \(\Re (R_k({{\varvec{x}}}))\ge \frac{1}{C_8}\) for all \({{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\zeta }}_{[k]}}([-1,1])\),
-
(ii)
if \(k\ge 2\), \(R_k:{{\mathcal {B}}}_{{\varvec{\zeta }}_{[k-1]}}([-1,1])\times [-1,1]\rightarrow {{\mathcal {B}}}_{\frac{C_8}{\max \{1,\delta _k\}}}(1)\).
Put simply, the first item of the theorem can be interpreted as follows: The function \(\partial _{k}T_k\) allows in the jth variable an analytic extension to the set \({{\mathcal {B}}}_{\zeta _j}([-1,1])\), where \(\zeta _j\) is proportional to \(\delta _j\). The constant \(\delta _j\) describes the domain of analytic extension of the densities \(f_{\rho }\), \(f_{\pi }\) in the jth variable. Thus, the analyticity domain of each \(\partial _{k} T_k\) is proportional to the (intersection of the) domains of analyticity of the densities. Additionally, the real part of \(\partial _{k} T_k\) remains strictly positive on this extension to the complex domain. Note that \(\partial _{k} T_k({{\varvec{x}}})\) is necessarily positive for real \({{\varvec{x}}}\in [-1,1]^k\), since the transport is monotone.
The second item of the theorem states that the kth variable \(x_k\) plays a special role for \(T_k\): if we merely extend \(\partial _{k}T_k\) in the first \(k-1\) variables to the complex domain and let the kth argument \(x_k\) belong to the real interval \([-1,1]\), then the values of this extension behave like \(1+O(\frac{1}{\delta _k})\), and thus, the extension is very close to the constant 1 function for large \(\delta _k\). In other words, if the densities \(f_{\rho }\), \(f_{\pi }\) allow a (uniformly bounded from above and below) analytic extension to a very large subset of the complex domain in the kth variable, then the kth component of the transport \(T_k({{\varvec{x}}}_{[k]})\) will be close to \(-1+\int _{-1}^{x_k}1\;\mathrm {d}t=x_k\), i.e., to the identity in the kth variable.
We also emphasize that we state the analyticity results here for \(\partial _{k}T_k\) (in the form they will be needed below), but this immediately implies that \(T_k\) allows an analytic extension to the same domain.
Remark 3.7
Crucially, for any \(k<d\) the left-hand side of the inequality in Assumption 3.5 (d) depends on \((\delta _j)_{j=1}^k\), while the right-hand side depends only on \(\delta _{k+1}\) but not on \((\delta _j)_{j=1}^k\). This will allow us to suitably choose \({\varvec{\delta }}\) when verifying this assumption (see the proof of Lemma 3.9).
Remark 3.8
The proof of Theorem 3.6 shows that there exists \(C\in (0,1)\) independent of M and L such that we can choose \(C_6=C\frac{\min \{M,1\}^5}{\max \{L,1\}^4}\), \(C_7=C\frac{\min \{M,1\}^3}{\max \{L,1\}^3}\) and \(C_8=C^{-1}(\frac{\max \{L,1\}^4}{\min \{M,1\}^4})\); see (B.43), (B.41) and (B.24a), (B.40).
To give an example for \({\rho }\), \({\pi }\) fitting our setting, we show that Assumption 3.5 holds (for some sequence \({\varvec{\delta }}\)) whenever the densities \(f_{\rho }\), \(f_{\pi }\) are analytic.
Lemma 3.9
For \(*\in \{{\rho },{\pi }\}\), let \(f_*:[-1,1]^d\rightarrow {{\mathbb {R}}}_+\) be a probability density, and assume that \(f_*\) is analytic on an open set in \({{\mathbb {R}}}^d\) containing \([-1,1]^d\). Then there exist \(0<M\le L <\infty \) and \({\varvec{\delta }}\in (0,\infty )^d\) such that Assumption 3.5 holds with \(C_6(M,L)\) as in Theorem 3.6.
4 Polynomial-Based Approximation
Analytic functions on \([-1,1]^d\rightarrow {{\mathbb {R}}}\) allow for exponential convergence when approximated by multivariate polynomial expansions. We recall this in Sect. 4.1 for truncated Legendre expansions. These results are then applied to the KR transport in Sect. 4.2.
4.1 Exponential Convergence of Legendre Expansions
For \(n\in {{\mathbb {N}}}_0\), let \(L_n\in {{\mathbb {P}}}_n\) be the nth Legendre polynomial normalized in \(L^2([-1,1],\mu )\). For \({\varvec{\nu }}\in {{\mathbb {N}}}_0^d\) set \(L_{\varvec{\nu }}({{\varvec{y}}}){:}{=}\prod _{j=1}^dL_{\nu _j}(y_j)\). Then \((L_{\varvec{\nu }})_{{\varvec{\nu }}\in {{\mathbb {N}}}_0^d}\) is an orthonormal basis of \(L^2([-1,1]^d,\mu )\). Thus, every \(f\in L^2([-1,1]^d,\mu )\) admits the multivariate Legendre expansion \(f({{\varvec{y}}}) = \sum _{{\varvec{\nu }}\in {{\mathbb {N}}}_0^d}l_{\varvec{\nu }}L_{\varvec{\nu }}({{\varvec{y}}})\) with coefficients \(l_{\varvec{\nu }}= \int _{[-1,1]^d}f({{\varvec{y}}})L_{\varvec{\nu }}({{\varvec{y}}})\;\mathrm {d}\mu ({{\varvec{y}}})\). For finite \(\Lambda \subseteq {{\mathbb {N}}}_0^d\), \(\sum _{{\varvec{\nu }}\in \Lambda }l_{\varvec{\nu }}L_{\varvec{\nu }}({{\varvec{y}}})\) is the orthogonal projection of f in the Hilbert space \(L^2([-1,1]^d,\mu )\) onto its subspace
As is well known, functions that are holomorphic on sets of the type \({{\mathcal {B}}}_{\varvec{\delta }}([-1,1])\) have exponentially decaying Legendre coefficients. We recall this in the next lemma, which is adapted to the regularity we showed for the transport in Theorem 3.6.
Lemma 4.1
Let \(k\in {{\mathbb {N}}}\), \({\varvec{\delta }}\in (0,\infty )^k\) and \(f\in C^1({{\mathcal {B}}}_{\varvec{\delta }}([-1,1]);{{\mathbb {C}}})\). Set \(w_{\varvec{\nu }}{:}{=}\prod _{j=1}^k(1+2\nu _j)^{3/2}\), \({\varvec{\varrho }}=(1+\delta _j)_{j=1}^k\) and \(l_{\varvec{\nu }}{:}{=}\int _{[-1,1]^d} f({{\varvec{y}}}) L_{\varvec{\nu }}({{\varvec{y}}})\;\mathrm {d}\mu ({{\varvec{y}}})\) for \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\). Then
-
(i)
for all \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\)
$$\begin{aligned} |l_{\varvec{\nu }}| \le {\varvec{\varrho }}^{-{\varvec{\nu }}} w_{\varvec{\nu }}\Vert f \Vert _{L^\infty ({{\mathcal {B}}}_{\varvec{\delta }}({[-1,1]}))} \prod _{j\in {{\,\mathrm{supp}\,}}{\varvec{\nu }}}\frac{2\varrho _j}{\varrho _j-1}, \end{aligned}$$(4.1) -
(ii)
for all \({\varvec{\nu }}\in {{\mathbb {N}}}_0^{k-1}\times \{0\}\)
$$\begin{aligned} |l_{\varvec{\nu }}| \le {\varvec{\varrho }}^{-{\varvec{\nu }}} w_{\varvec{\nu }}\Vert f \Vert _{L^\infty ({{\mathcal {B}}}_{{\varvec{\delta }}_{[k-1]}}([-1,1])\times [-1,1])} \prod _{j\in {{\,\mathrm{supp}\,}}{\varvec{\nu }}}\frac{2\varrho _j}{\varrho _j-1}. \end{aligned}$$(4.2)
The previous lemma in combination with Theorem 3.6 yields a bound on the Legendre coefficients of the partial derivatives \(\partial _k T_k-1\) of the kth component of the transport. Specifically, for \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\) Theorem 3.6 (i) together with Lemma 4.1 (i) implies with \(t_1{:}{=}{\varvec{\varrho }}^{-{\varvec{\nu }}}\Vert \partial _kT_k-1 \Vert _{L^\infty ({{\mathcal {B}}}_{\varvec{\delta }}({[-1,1]}))}\) and \(t_2{:}{=}w_{\varvec{\nu }}\prod _{j\in {{\,\mathrm{supp}\,}}{\varvec{\nu }}}\frac{\varrho _j}{\varrho _j-1}\) the bound \(t_1t_2\) for the corresponding Legendre coefficient. For a multi-index \({\varvec{\nu }}\in {{\mathbb {N}}}_0^{k-1}\times \{0\}\), applying instead Theorem 3.6 (ii) together with Lemma 4.1 (ii) yields the bound \({\tilde{t}}_1 t_2\) where \({\tilde{t}}_1{:}{=}{\varvec{\varrho }}^{-{\varvec{\nu }}}\Vert \partial _kT_k-1 \Vert _{L^\infty ({{\mathcal {B}}}_{{\varvec{\delta }}_{[k-1]}}({[-1,1]})\times [-1,1])}\). By Theorem 3.6, the last norm is bounded by \(\frac{C_3}{\delta _k}\). Hence, compared to the first estimate \(t_1\), we gain the factor \(\frac{1}{\delta _k}\) by using the second estimate \({\tilde{t}}_1\) instead.Footnote 1
Taking the minimum of these estimates leads us to introduce for \(k\in {{\mathbb {N}}}\), \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\) and \({\varvec{\varrho }}\in (1,\infty )^k\) the quantity
and the set
corresponding to the largest values of \(\gamma ({\varvec{\varrho }},{\varvec{\nu }})\).
The structure of \(\Lambda _{k,\varepsilon }\) is as follows: The larger the \(\varrho _j\), the smaller the \(\varrho _j^{-1}\). Thus, the larger the \(\varrho _j\), the fewer the multi-indices \({\varvec{\nu }}\) with \(j\in {{\,\mathrm{supp}\,}}{\varvec{\nu }}\) belong to \(\Lambda _{k,\varepsilon }\). In this sense, \(\varrho _j\) measures the importance of the jth variable in the Legendre expansion of \(\partial _k T_k\). The kth variable plays a special role, as it always is among the most important variables, since for all \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\) it holds that \(\gamma ({\varvec{\varrho }}, {\varvec{\nu }}) \ge \gamma ({\varvec{\varrho }}, {\varvec{e}}_k)\), where \({\varvec{e}}_k=(0,\dots ,0,1)\in {{\mathbb {N}}}_0^k\). In other words, whenever \(\varepsilon >0\) is so small that \(\Lambda _{k,\varepsilon }\ne \emptyset \), at least one \({\varvec{\nu }}\) with \(\nu _k\ne 0\) belongs to \(\Lambda _{k,\varepsilon }\).
Having determined a set of multi-indices corresponding to the largest upper bounds obtained for the Legendre coefficients, we arrive at the next proposition. The assumptions on the function f correspond to the regularity of \(\partial _{k}T_k\) shown in Theorem 3.6. The proposition states that such f can be approximated with the error decreasing as \(O(-\beta N^{1/k})\) in terms of the dimension N of the polynomial space.
Proposition 4.2
Let \(k\in {{\mathbb {N}}}\), \({\varvec{\delta }}\in (0,\infty )^k\) and \(r>0\), such that \(f\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}}([-1,1]);{{\mathcal {B}}}_r)\) and \(f:{{\mathcal {B}}}_{{\varvec{\delta }}_{[k-1]}}([-1,1])\times [-1,1]\rightarrow {{\mathcal {B}}}_{\frac{r}{1+\delta _k}}\). With \(\varrho _j{:}{=}1+\delta _j\) set
For \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\), set \(l_{\varvec{\nu }}{:}{=}\int _{[-1,1]^k}f({{\varvec{y}}})L_{\varvec{\nu }}({{\varvec{y}}})\;\mathrm {d}\mu ({{\varvec{y}}})\).
Then for every \({\tilde{\beta }}<\beta \), there exists \(C=C(k,m,{\tilde{\beta }},{\varvec{\delta }},r,\Vert f \Vert _{L^\infty ({{{\mathcal {B}}}_{\varvec{\delta }}([-1,1])})})\) s.t. for every \(\varepsilon \in (0,\varrho _k^{-1})\) the following holds with \(\Lambda _{k,\varepsilon }\) as in (4.4):
In Proposition 4.2, \({\tilde{\beta }}\in (0,\beta )\) can be chosen arbitrarily close to \(\beta \). However, as \({\tilde{\beta }}\) approaches \(\beta \), the constant C in (4.8) will tend to infinity, cp. (C.11).
4.2 Polynomial and Rational Approximation
Combining Proposition 4.2 with Theorem 3.6 we obtain the following approximation result for the transport. It states that \(T:[-1,1]^d\rightarrow [-1,1]^d\) can be approximated by multivariate polynomials, converging in \(W^{m,\infty }([-1,1]^d)\) with the error decreasing as \(\exp (-\beta N_\varepsilon ^{1/d})\). Here \(N_\varepsilon \) is the dimension of the (ansatz) space in which T is approximated.
Theorem 4.3
Let \(m\in {{\mathbb {N}}}_0\). Let \(f_{\rho }\), \(f_{\pi }\) satisfy Assumption 3.5 for some constants \(0<M\le L<\infty \), \({\varvec{\delta }}\in (0,\infty )^d\) and with \(C_6=C_6(M,L)\) as in Theorem 3.6. Let \(C_7=C_7(M,L)\) be as in Theorem 3.6. For \(j\in \{1,\dots ,d\}\), set
For \(k\in \{1,\dots ,d\}\), let \(\Lambda _{k,\varepsilon }\) be as in (4.4) and define
For every \({\tilde{\beta }}<\beta \), there exists \(C=C({\varvec{\varrho }},m,d,{\tilde{\beta }},f_{\rho },f_{\pi })>0\) such that for every \(\varepsilon \in (0,1)\) with
\({\tilde{T}}{:}{=}({\tilde{T}}_k)_{k=1}^d\) and \(N_\varepsilon {:}{=}\sum _{k=1}^d|\Lambda _{k,\varepsilon }|\), it holds
Remark 4.4
We set \(\varrho _j=1+C_7\delta _j>1\) in Theorem 4.3, where \(\delta _j\) as in Assumption 3.5 encodes the size of the analyticity domain of the densities \(f_{\rho }\) and \(f_{\pi }\) (in the jth variable). The constant \(\beta \) in (4.7b) is an increasing function of each \(\varrho _j\). Loosely speaking, Theorem 4.3 states that the larger the analyticity domain of the densities, the faster the convergence when approximating the corresponding transport T with polynomials.
We skip the proof of the above theorem and instead proceed with a variation of this result. It states a convergence rate for an approximation \({\tilde{T}}_{k}\) to \(T_k\), which enjoys the property that \({\tilde{T}}_{k}({{\varvec{x}}}_{[k-1]},\cdot ):[-1,1]\rightarrow [-1,1]\) is monotonically increasing and bijective for every \({{\varvec{x}}}_{[k-1]}\in [-1,1]^{k-1}\). Thus, contrary to \({\tilde{T}}\) in Theorem 4.3, the \({\tilde{T}}\) in the next proposition is a bijection from \([-1,1]^d\rightarrow [-1,1]^d\) by construction.
This is achieved as follows: Let \(g:{{\mathbb {R}}}\rightarrow \{x\in {{\mathbb {R}}}\,:\,x\ge 0\}\) be analytic, such that \(g(0)=1\) and \(h{:}{=}g^{-1}:(0,\infty )\rightarrow {{\mathbb {R}}}\) is also analytic. We first approximate \(h(\partial _{k}T_k)\) by some function \(p_k\) and then obtain \(-1+\int _{-1}^{x_k}g(p_k({{\varvec{x}}}_{[k-1]},t))\;\mathrm {d}t\) as an approximation \({\tilde{T}}_k\) to \(T_k\). This approach, similar to what is proposed in [53], and in the present context in [45], guarantees \(\partial _{k}{\tilde{T}}_k=g(p_k({{\varvec{x}}}_{[k]}))\ge 0\) and \({\tilde{T}}_k({{\varvec{x}}}_{[k-1]},-1)=-1\) so that \({\tilde{T}}_k({{\varvec{x}}})\) is monotonically increasing in \(x_k\). In order to force \(\tilde{T}_k({{\varvec{x}}}_{[k-1]},\cdot ):[-1,1]\rightarrow [-1,1]\) to be bijective we introduce a normalization which leads to
The meaning of \(g(0)=1\) is that the trivial approximation \(p_k\equiv 0\) then yields \({\tilde{T}}_k({{\varvec{x}}})=x_k\).
To avoid further technicalities, henceforth we choose \(g(x)=(x+1)^2\) (and thus \(h(x)=\sqrt{x}-1\)), but emphasize that our analysis works just as well with any other positive analytic function such that \(g(0)=1\), e.g., \(g(x)=\exp (x)\) and \(h(x)=\log (x)\). The choice \(g(x)=(x+1)^2\) has the advantage that \(g(p_k)\) is polynomial if \(p_k\) is polynomial. This allows exact evaluation of the integrals in (4.9) without resorting to numerical quadrature, and results in a rational approximation \(\tilde{T}_k\):
Theorem 4.5
Let \(m\in {{\mathbb {N}}}_0\). Let \(f_{\rho }\), \(f_{\pi }\) satisfy Assumption 3.5 for some constants \(0<M\le L<\infty \), \({\varvec{\delta }}\in (0,\infty )^d\) and with \(C_6=C_6(M,L)\) as in Theorem 3.6. Let \(\varrho _j\), \(\beta \) and \(\Lambda _{k,\varepsilon }\) be as in Theorem 4.3.
For every \({\tilde{\beta }}<\beta \) there exists \(C=C({\varvec{\xi }},m,d,{\tilde{\beta }},f_{\rho },f_{\pi })>0\) and for every \(\varepsilon \in (0,1)\) there exist polynomials \(p_{k,\varepsilon }\in {{\mathbb {P}}}_{\Lambda _{k,\varepsilon }}\), \(k\in \{1,\dots ,d\}\), such that with
the map \({\tilde{T}}_\varepsilon {:}{=}({\tilde{T}}_{k,\varepsilon })_{k=1}^d:[-1,1]^d\rightarrow [-1,1]^d\) is a monotone triangular bijection, and with
it holds
We emphasize that our reason for using rational functions rather than polynomials in Theorem 4.5 is merely to guarantee that the resulting approximation \({\tilde{T}}:[-1,1]^d\rightarrow [-1,1]^d\) is a bijective and monotone map. We do not employ specific properties of rational functions (as done for Padé approximations) in order to improve the convergence order.
Remark 4.6
If \(\Lambda _{k,\varepsilon }=\emptyset \) then by convention \({{\mathbb {P}}}_{\Lambda _{k,\varepsilon }}=\{0\}\); thus, \(p_{k,\varepsilon }= 0\) and \(\tilde{T}_{k,\varepsilon }({{\varvec{x}}})=x_k\).
Remark 4.7
Let \(S=T^{-1}\) so that \(T_\sharp {\rho }={\pi }\) is equivalent to \(S^\sharp {\rho }={\pi }\). It is often simpler to first approximate S, and then compute T by inverting S, see [45]. Since the assumptions of Theorem 4.5 (and Theorem 3.6) on the measures \({\rho }\) and \({\pi }\) are identical, Theorem 4.5 also yields an approximation result for the inverse transport map S: For all \(\varepsilon >0\) and with \(\Lambda _{k,\varepsilon }\) as in Theorem 4.5, there exist multivariate polynomials \(p_k\in {{\mathbb {P}}}_{\Lambda _{k,\varepsilon }}\) such that with
it holds
5 Deep Neural Network Approximation
Based on the seminal paper [73], it has recently been observed that ReLU neural networks (NNs) are capable of approximating analytic functions at an exponential convergence rate [24, 49], and slight improvements can be shown for certain smoother activation functions, e.g., [39]. We also refer to [46] for much earlier results of this type for different activation functions. As a consequence, our analysis in Sect. 3 yields approximation results of the transport by deep neural networks (DNNs). Below we present the statement, which is based on [49, Thm. 3.7].
To formulate the result, we recall the definition of a feedforward ReLU NN. The (nonlinear) ReLU activation function is defined as \(\varphi (x){:}{=}\max \{0,x\}\). We call a function \(f:{{\mathbb {R}}}^d\rightarrow {{\mathbb {R}}}^d\) a ReLU NN, if it can be written as
for certain weight matrices \({{\varvec{W}}}_j\in {{\mathbb {R}}}^{n_{j+1}\times n_j}\) and bias vectors \({{\varvec{b}}}_j\in {{\mathbb {R}}}^{n_{j+1}}\) where \(n_0=n_{L+1}=d\). For simplicity, we do not distinguish between the network (described by \(({{\varvec{W}}}_j,{{\varvec{b}}}_j)_{j=0}^L\)), and the function it expresses (different networks can have the same output). We then write \(\mathrm{depth}(f){:}{=}L\), \(\mathrm{width}(f){:}{=}\max _j n_j\) and \(\mathrm{size}(f){:}{=}\sum _{j=0}^{L+1}(|{{\varvec{W}}}_j|_0+|{{\varvec{b}}}_j|_0)\), where \(|{{\varvec{W}}}_j|_0=|\{(k,l)\,:\,({{\varvec{W}}}_j)_{kl}\ne 0\}|\) and \(|{{\varvec{b}}}_j|_0=|\{k\,:\,({{\varvec{b}}}_j)_{k}\ne 0\}|\). In other words, the depth corresponds to the number of applications of the activation function (the number of hidden layers) and the size equals the number of nonzero weights and biases, i.e., the number of trainable parameters in the network.
Theorem 5.1
Let \(f_{\rho }\), \(f_{\pi }\) be two positive and analytic probability densities on \([-1,1]^d\). Then there exists \(\beta >0\), and for every \(N\in {{\mathbb {N}}}\), there exists a ReLU NN \(\Phi _N=(\Phi _{N,j})_{j=1}^d:{{\mathbb {R}}}^d\rightarrow {{\mathbb {R}}}^d\), such that \(\Phi _N:[-1,1]^d\rightarrow [-1,1]^d\) is bijective, triangular and monotone,
\(\mathrm{size}(\Phi _N) \le C N\) and \(\mathrm{depth}(\Phi _N)\le C\log (N)N^{1/2}\). Here C is a constant depending on d, \(f_{\rho }\) and \(f_{\pi }\) but independent of N.
Remark 5.2
Compared to Theorems 4.3 and 4.5, for ReLU networks we obtain the slightly worse convergence rate \(\exp (-\beta N^{1/(d+1)})\) instead of \(\exp (-\beta N^{1/d})\). This stems from the fact that, for ReLU networks, the best known approximation results for analytic functions in d dimensions show convergence with the rate \(\exp (-\beta N^{1/(d+1)})\); see [49, Thm. 3.5].
The proof of Theorem 5.1 proceeds as follows: First, we apply results from [49] to obtain a neural network approximation \({\tilde{\Phi }}_k\) to \(T_k\). The constructed \((\tilde{\Phi }_k)_{k=1}^d:[-1,1]^d\rightarrow {{\mathbb {R}}}^d\) is a triangular map that is close to T in the norm of \(W^{1,\infty }([-1,1]^d)\). However, it is not necessarily a monotone bijective self-mapping of \([-1,1]^d\). To correct the construction, we use the following lemma:
Lemma 5.3
Let \(f:[-1,1]^{k-1}\rightarrow {{\mathbb {R}}}\) be a ReLU NN. Then there exists a ReLU NN \(g_f:[-1,1]^{k}\rightarrow {{\mathbb {R}}}\) such that \(|g_f({{\varvec{y}}},t)|\le |f({{\varvec{y}}})|\) for all \(({{\varvec{y}}},t)\in [-1,1]^{k-1}\times [-1,1]\),
and \(\mathrm{depth}(g_f)\le 1+\mathrm{depth}(f)\) and \(\mathrm{size}(g_f)\le C (1+\mathrm{size}(f))\) with C independent of f and \(g_f\). Moreover, in the sense of weak derivatives \(|\nabla _{{\varvec{y}}}g_f({{\varvec{y}}},t)|\le |\nabla f({{\varvec{y}}})|\) and \(|\frac{d}{dt} g_f({{\varvec{y}}},t)|\le \sup _{{{\varvec{y}}}\in [-1,1]^{k-1}}|f({{\varvec{y}}})|\), i.e., these inequalities hold a.e. in \([-1,1]^{k-1}\times [-1,1]\).
With \({\tilde{\Phi }}_k:[-1,1]^k\rightarrow {{\mathbb {R}}}\) approximating the kth component \(T_k:[-1,1]^k\rightarrow [-1,1]\), it is then easy to check that with \(f_1({{\varvec{x}}}_{[k-1]}){:}{=}1-{\tilde{\Phi }}_k({{\varvec{x}}}_{[k-1]},1)\) and \(f_{-1}({{\varvec{x}}}_{[k-1]}){:}{=}-1-{\tilde{\Phi }}_k({{\varvec{x}}}_{[k-1]},-1)\) for \({{\varvec{x}}}\in [-1,1]\), the NN
satisfies \(\Phi _k({{\varvec{x}}}_{[k-1]},1)=1\) and \(\Phi _k({{\varvec{x}}}_{[k-1]},-1)=-1\). Since the introduced correction terms \(g_{f_1}\) and \(g_{f_{-1}}\) have size and depth bounds of the same order as \({\tilde{\Phi }}_k\), they will not worsen the resulting convergence rates. The details are provided in Appendix D.1.
In the previous theorem, we consider a “sparsely connected” network \(\Phi \), meaning that certain weights and biases are, by choice of the network architecture, set a priori to zero. This reduces the overall size of \(\Phi \). We note that this also yields a convergence rate for fully connected networks: Consider all networks of width O(N) and depth \(O(\log (N) N^{1/2})\). The size of a network within this architecture is bounded by \(O(N^2\log (N) N^{1/2})=O(N^{5/2}\log (N))\), since the number of elements of the weight matrix \({{\varvec{W}}}_j\) between two consecutive layers is \(n_jn_{j+1}\le N^2\). The network \(\Phi \) from Theorem 5.1 belongs to this class, and thus, the best approximation among networks with this architecture achieves at least the exponential convergence \(\exp (-\beta N^{1/(d+1)})\). In terms of the number of trainable parameters \(M=O(N^{5/2}\log (N))\), this convergence rate is, up to logarithmic terms, \(\exp (-\beta M^{2/(5d+5)})\).
Remark 5.4
The constant C in (4.12) and the (possibly different) constant C in (5.2) typically depend exponentially on the dimension d: This dependence is true for polynomial approximation results of analytic functions in d dimensions, which is why it will hold for C in (4.12) in general. Since the proof in [49], upon which our analysis is based, uses polynomial approximations, the same can be expected for the constant in (5.2). In [76] we discuss the approximation of T by rational functions in the high-dimensional case. There we give sufficient conditions on the reference and target to guarantee algebraic convergence of the error, with all constants being controlled independent of the dimension.
Remark 5.5
Normalizing flows approximate a transport map T using a variety of neural network constructions, typically by composing a series of nonlinear bijective transformations; each individual transformation employs neural networks in its parametrization, embedded into a specific functional form (possibly augmented with constraints) that ensures bijectivity [36, 50]. “Residual” normalizing flows [1, 54] compose maps that are not in general triangular, but “autoregressive” flows [33, 34, 72] use monotone triangular maps as their essential building block. Many practical implementations of autoregressive flows, however, limit the class of triangular maps that can be expressed. Thus, they cannot seek to directly approximate the KR transport in a single step; rather, they compose multiple such triangular maps, interleaved with permutations [68]. In principle, however, a direct approximation of the KR map is sufficient, and our results could be a starting point for constructive and quantitative guidance on the parametrization and expressivity of autoregressive flows in this setting. Our result is also close in style to [43], which shows low-order convergence rates for neural network approximations of transport maps for certain classes of target densities, by writing the map as a gradient of a potential function given by a neural network. This construction, which employs semi-discrete optimal transport, is not in general triangular and does not necessarily coincide with common normalizing flow architectures.
6 Convergence of Pushforward Measures
Let again \(T_\sharp {\rho }={\pi }\). In Sect. 4 we have shown that the approximation \({\tilde{T}}\) to T obtained in Theorems 4.5 and 5.1 converges to T in the \(W^{m,\infty }([-1,1]^d)\) norm for suitable \(m\in {{\mathbb {N}}}\). In the present section, we show that these results imply corresponding error bounds for the approximate pushforward measure, i.e., bounds for
with “\(\mathrm{dist}\)” referring to the Hellinger distance, the total variation distance, the Wasserstein distance or the KL divergence. Specifically, we will show that smallness of \(\Vert T-{\tilde{T}} \Vert _{W^{1,\infty }}\) (or \(\Vert T-\tilde{T} \Vert _{L^{\infty }}\) in case of the Wasserstein distance) implies smallness of (6.1).
As mentioned before, when casting the approximation of the transport as an optimization problem, it is often more convenient to first approximate the inverse transport \(S=T^{-1}\) by some \({\tilde{S}}\), and then to invert \({\tilde{S}}\) to obtain an approximation \({\tilde{T}} = {\tilde{S}}^{-1}\) of T; see [45] and also, e.g., the method in [22]. In this case, we usually have an upper bound on \(\Vert S-{\tilde{S}} \Vert _{}\) rather than \(\Vert T-{\tilde{T}} \Vert _{}\) in a suitable norm; cp. Remark 4.7. However, a bound of the type \(\Vert T-{\tilde{T}} \Vert _{W^{m,\infty }}<\varepsilon \) implies \(\Vert S-{\tilde{S}} \Vert _{W^{m,\infty }}=O(\varepsilon )\) for \(m\in \{0,1\}\) as the next lemma shows. Since closeness in \(L^\infty \) or \(W^{1,\infty }\) is all we require for the results of this section, the following analysis covers either situation.
Lemma 6.1
Let \(T:[-1,1]^d\rightarrow [-1,1]^d\) and \({\tilde{T}}:[-1,1]^d\rightarrow [-1,1]^d\) be bijective. Denote \(S=T^{-1}\) and \({\tilde{S}}={\tilde{T}}^{-1}\). Suppose that S has Lipschitz constant \(L_S\). Then
-
(i)
it holds
$$\begin{aligned} \Vert S-{\tilde{S}} \Vert _{L^\infty ([-1,1]^d)}\le L_S \Vert T-{\tilde{T}} \Vert _{L^\infty ([-1,1]^d)}, \end{aligned}$$ -
(ii)
if S, T, \({\tilde{S}}\), \({\tilde{T}}\in W^{1,\infty }([-1,1]^d)\) and \(dT:[-1,1]^d\rightarrow {{\mathbb {R}}}^{d\times d}\) has Lipschitz constant \(L_{dT}\), then
$$\begin{aligned} \Vert dS-d{\tilde{S}} \Vert _{L^\infty ([-1,1]^d)}\le (1+L_SL_{dT}) \Vert dS \Vert _{L^\infty ([-1,1]^d)} \Vert d\tilde{S} \Vert _{L^\infty ([-1,1]^d)} \Vert T-{\tilde{T}} \Vert _{W^{1,\infty }([-1,1]^d)}. \end{aligned}$$
6.1 Distances
For two probability measures \({\rho }\ll \mu \) and \({\pi }\ll \mu \) on \([-1,1]^d\) equipped with the Borel sigma-algebra, we consider the following distances:
-
Hellinger distance
$$\begin{aligned} \mathrm{H}({\rho },{\pi }) {:}{=}\left( \frac{1}{2}\int _{[-1,1]^d} \left( \sqrt{\frac{\mathrm {d}{\rho }}{\mathrm {d}\mu }({{\varvec{x}}})}-\sqrt{\frac{\mathrm {d}{\pi }}{\mathrm {d}\mu }({{\varvec{x}}})}\right) ^2 \;\mathrm {d}\mu ({{\varvec{x}}})\right) ^{1/2}, \end{aligned}$$(6.2a) -
total variation distance
$$\begin{aligned} \mathrm{TV}({\rho },{\pi }) {:}{=}\sup _{A\in {{\mathcal {A}}}}|{\rho }(A)-{\pi }(A)|=\frac{1}{2}\int _{[-1,1]^d} \Big |\frac{\mathrm {d}{\rho }}{\mathrm {d}\mu }({{\varvec{x}}})-\frac{\mathrm {d}{\pi }}{\mathrm {d}\mu }({{\varvec{x}}})\Big |\;\mathrm {d}\mu ({{\varvec{x}}}), \end{aligned}$$(6.2b) -
Kullback–Leibler (KL) divergence
$$\begin{aligned} \mathrm{KL}({\rho }\Vert {\pi }) {:}{=}{\left\{ \begin{array}{ll}\int _{[-1,1]^d} \log \left( \frac{\mathrm {d}{\rho }}{\mathrm {d}{\pi }}({{\varvec{x}}})\right) \;\mathrm {d}{\rho }({{\varvec{x}}}) &{}\text {if }{\rho }\ll {\pi }\\ \infty &{}\text {otherwise}, \end{array}\right. } \end{aligned}$$(6.2c) -
for \(p\in [1,\infty )\), the p-Wasserstein distance
$$\begin{aligned} W_p({\rho },{\pi }){:}{=}\inf _{\nu \in \Gamma ({\rho },{\pi })}\left( \int _{[-1,1]^d\times [-1,1]^d} \Vert {{\varvec{x}}}-{{\varvec{y}}} \Vert _{}^p\;\mathrm {d}\nu ({{\varvec{x}}},{{\varvec{y}}})\right) ^{1/p}, \end{aligned}$$(6.2d)where \(\Gamma ({\rho },{\pi })\) denotes the set of all measures on \([-1,1]^d\times [-1,1]^d\) with marginals \({\rho }\), \({\pi }\).
Contrary to the Hellinger, total variation, and Wasserstein distances, the KL divergence is not symmetric; however, \(\mathrm{KL}({\rho }\Vert {\pi })>0\) iff \({\rho }\ne {\pi }\).
Remark 6.2
As is well known, the Hellinger distance provides an upper bound of the difference of integrals w.r.t. two different measures. Assume that \(g\in L^2([-1,1]^d,{\rho }) \cap L^2([-1,1]^d,{\pi })\). Then
6.2 Error Bounds
Throughout this subsection, \(p\in [1,\infty )\) is arbitrary but fixed. Under suitable assumptions a result of the following type was shown in [58, Theorem 2] (see the extended result [76, Prop. 6.2] for the following variant)
Thus, Theorems 4.5 and 5.1 readily yield bounds on \(W_p({\tilde{T}}_\sharp {\rho },{\pi })\) for the approximate polynomial, rational, and NN transport maps from Sects. 4.2 and 5.
For the other three distances/divergences in (6.2), to obtain a bound on \(\mathrm{dist}({\tilde{T}}_\sharp {\rho },T_\sharp {\rho })\), we will upper bound the difference between the densities of those measures. Since the density of \({\tilde{T}}_\sharp {\rho }\) is given by \(f_{\rho }({\tilde{S}}({{\varvec{x}}}))\det d{\tilde{S}}({{\varvec{x}}})\), where \({\tilde{S}}={\tilde{T}}^{-1}\), we need to upper bound \(|f_{\rho }(S({{\varvec{x}}}))\det dS({{\varvec{x}}}) -f_{\rho }({\tilde{S}}({{\varvec{x}}}))\det d{\tilde{S}}({{\varvec{x}}})|\), where \(S=T^{-1}\). This will be done in the proof of the following theorem. To state the result, for a triangular map \(S\in C^1([-1,1]^d;[-1,1]^d)\) we first define
Theorem 6.3
Let T, \({\tilde{T}}:[-1,1]^d\rightarrow [-1,1]^d\) be bijective, monotone, and triangular. Define \(S{:}{=}T^{-1}\) and \({\tilde{S}}{:}{=}{\tilde{T}}^{-1}\) and assume that T, \(S\in W^{2,\infty }([-1,1]^d)\) and \({\tilde{T}}\), \({\tilde{S}}\in W^{1,\infty }([-1,1]^d)\). Moreover, let \({\rho }\) be a probability measure on \([-1,1]^d\) such that \(f_{\rho }{:}{=}\frac{\mathrm {d}{\rho }}{\mathrm {d}\mu }:[-1,1]^d\rightarrow {{\mathbb {R}}}_+\) is strictly positive and Lipschitz continuous. Suppose that \(\tau _0>0\) is such that \(\Vert d{\tilde{S}} \Vert _{L^\infty ([-1,1]^d)}<\frac{1}{\tau _0}\) and \({\tilde{S}}_{\mathrm{min}}\ge \tau _0\).
Then there exists C depending on \(\tau _0\) but not on \({\tilde{T}}\) such that for \(\mathrm{dist}\in \{\mathrm{H},\mathrm{TV}\}\)
and
Together with Theorem 4.5, we can now show exponential convergence of the pushforward measure in the case of analytic densities. For \(\varepsilon >0\), denote by \({\tilde{T}}_\varepsilon = (\tilde{T}_{\varepsilon ,k})_{k=1}^d\) the approximation to T from Theorem 4.5. Moreover, recall that \(N_\varepsilon \) in (4.11) denotes the number of degrees of freedom of this approximation (the number of coefficients of this rational function). As shown in Lemma 3.9, the exponential convergence shown in the next proposition holds in particular for positive and analytic reference and target densities \(f_{\rho }\), \(f_{\pi }\).
Proposition 6.4
Consider the setting of Theorem 4.5; in particular, let \(f_{\rho }\), \(f_{\pi }\) satisfy Assumption 3.5, and let \(\beta >0\) be as in (4.7b).
Then for every \({\tilde{\beta }}<\beta \) there exists \(C>0\) such that for every \(\varepsilon \in (0,1)\) and \(\mathrm{dist}\in \{\mathrm{H},\mathrm{TV},\mathrm{KL},W_p\}\) with \({\tilde{T}}_\varepsilon \) as in (4.10) and \(N_\varepsilon \) as in (4.11) it holds
Similarly, we get a bound for the pushforward under the NN transport from Theorem 5.1.
Proposition 6.5
Let \(f_{\rho }\) and \(f_{\pi }\) be two positive and analytic probability densities on \([-1,1]^d\). Then for every \(\mathrm{dist}\in \{\mathrm{H},\mathrm{TV},\mathrm{KL},W_p\}\), there exist constants \(\beta >0\) and \(C>0\), and for every \(N\in {{\mathbb {N}}}\) there exists a ReLU neural network \(\Phi _N:[-1,1]^d\rightarrow [-1,1]^d\) such that
and \(\mathrm{size}(\Phi _N)\le CN\) and \(\mathrm{depth}(\Phi _N)\le C\log (N)N^{-1/2}\).
The proof is completely analogous to Proposition 6.4 (but using Theorem 5.1 instead of Theorem 4.5 to approximate the transport T with the NN \(\Phi _N\)) which is why we do not give it in appendix.
7 Application to Inverse Problems in UQ
To give an application and explain in more detail the practical value of our results, we briefly discuss a standard inverse problem in uncertainty quantification.
7.1 Setting
Let \(n\in {{\mathbb {N}}}\) and let \(\mathrm {D}\subseteq {{\mathbb {R}}}^n\) be a bounded Lipschitz domain. For a diffusion coefficient \(a\in L^\infty (\mathrm {D};{{\mathbb {R}}})\) such that \({{\,\mathrm{ess\,inf}\,}}_{x\in \mathrm {D}}a(x)>0\), and a forcing term \(f\in H^{-1}(\mathrm {D})\), the PDE
has a unique weak solution in \(H_0^1(\mathrm {D})\). We denote it by \({\mathfrak {u}}(a)\), and call \({\mathfrak {u}}:a\mapsto {\mathfrak {u}}(a)\) the forward operator.
Let \(A:H_0^1(\mathrm {D})\rightarrow {{\mathbb {R}}}^m\) be a bounded linear observation operator for some \(m\in {{\mathbb {N}}}\). The inverse problem consists of recovering the diffusion coefficient \(a\in L^\infty (\mathrm {D})\), given the noisy observation
with the additive observation noise \({\varvec{\eta }}\sim {{\mathcal {N}}}(0,\Sigma )\), for a symmetric positive definite covariance matrix \(\Sigma \in {{\mathbb {R}}}^{m\times m}\).
7.2 Prior and Posterior
In uncertainty quantification and statistical inverse problems, the diffusion coefficient \(a\in L^\infty (\mathrm {D})\) is modeled as a random variable (independent of the observation noise \({\varvec{\eta }}\)) distributed according to some known prior distribution; see, e.g., [35]. Bayes’ theorem provides a formula for the distribution of the diffusion coefficient conditioned on the observations. This conditional is called the posterior and interpreted as the solution to the inverse problem.
To construct a prior, let \((\psi _j)_{j=1}^d\subset L^\infty (\mathrm {D})\) and set
where \(y_j\in [-1,1]\). We consider the uniform measure on \([-1,1]^d\) as the prior: Every realization \((y_j)_{j=1}^d\in [-1,1]^d\) corresponds to a diffusion coefficient \(a({{\varvec{y}}})\in L^\infty (\mathrm {D})\), and equivalently the pushforward of the uniform measure on \([-1,1]^d\) under \(a:[-1,1]^d\rightarrow L^\infty (\mathrm {D})\) can be interpreted as a prior on \(L^\infty (\mathrm {D})\). Throughout we assume \({{\,\mathrm{ess\,inf}\,}}_{x\in \mathrm {D}}a({{\varvec{y}}},x)>0\) for all \({{\varvec{y}}}\in [-1,1]^d\) and write \(u({{\varvec{y}}}){:}{=}{\mathfrak {u}}(a({{\varvec{y}}}))\) for the solution of (7.1).
Given m measurements \((\varsigma _i)_{i=1}^m\) as in (7.2), the posterior measure \({\pi }\) on \([-1,1]^d\) is the distribution of \({{\varvec{y}}}|{\varvec{\varsigma }}\). Since \({\varvec{\eta }}\sim {{\mathcal {N}}}(0,\Sigma )\), the likelihood (the density of \({\varvec{\varsigma }}|{{\varvec{y}}}\)) equals
By Bayes’ theorem, the posterior density \(f_{\pi }\), corresponding to the distribution of \({{\varvec{y}}}|{\varvec{\varsigma }}\), is proportional to the density of \({{\varvec{y}}}\) times the density of \({\varvec{\varsigma }}|{{\varvec{y}}}\). Since the (uniform) prior has constant density 1,
The normalizing constant Z is in practice unknown. For more details see, e.g., [17].
In order to compute expectations w.r.t. the posterior \({\pi }\), we want to determine a transport map \(T:[-1,1]^d \rightarrow [-1,1]^d\) such that \(T_\sharp \mu ={\pi }\): Then if \(X_i\in [-1,1]^d\), \(i=1,\dots ,N\), are iid uniformly distributed on \([-1,1]^d\), \(T(X_i)\), \(i=1,\dots ,N\), are iid with distribution \(\pi \). This allows us to approximate integrals \(\int _{[-1,1]^d}g({{\varvec{y}}})\;\mathrm {d}{\pi }({{\varvec{y}}})\) via Monte Carlo sampling as \(\frac{1}{N}\sum _{i=1}^N g(T(X_i))\).
7.3 Determining \(\Lambda _{k,\varepsilon }\)
Choose as the reference measure the uniform measure \({\rho }=\mu \) on \([-1,1]^d\) and let the target measure \({\pi }\) be the posterior with density \(f_{\pi }\) as in (7.4). To apply Theorem 4.5, we first need to determine \({\varvec{\delta }}\in (0,\infty )^d\), such that \(f_{\pi }\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}}([-1,1]);{{\mathbb {C}}})\). Since \(\exp :{{\mathbb {C}}}\rightarrow {{\mathbb {C}}}\) is an entire function, by (7.4), in case \(u\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}}([-1,1]);{{\mathbb {C}}})\) we have \(f_{\pi }\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}}([-1,1]);{{\mathbb {C}}})\). One can show that the forward operator \({\mathfrak {u}}\) is complex differentiable from \(\{L^\infty (\mathrm {D};{{\mathbb {C}}})\,:\,{{\,\mathrm{ess\,inf}\,}}_{x\in \mathrm {D}}\Re (a(x))>0\}\) to \(H_0^1(\mathrm {D};{{\mathbb {C}}})\); see [75, Example 1.2.39]. Hence, \(u({{\varvec{y}}})={\mathfrak {u}}(\sum _{j=1}^dy_j\psi _j)\) indeed is complex differentiable, e.g., for all \({{\varvec{y}}}\in {{\mathbb {C}}}^d\) such that
Complex differentiability implies analyticity, and therefore, \(u({{\varvec{y}}})\) is analytic on \({{\mathcal {B}}}_{\varvec{\delta }}([-1,1];{{\mathbb {C}}})\) with \(\delta _j\) proportional to \(\Vert \psi _j \Vert _{L^\infty (\mathrm {D})}^{-1}\):
Lemma 7.1
There exists \(\tau =\tau ({\mathfrak {u}},\Sigma ,d)>0\) and an increasing sequence \((\kappa _j)_{j=1}^d\subset (0,1)\) such that with \(\delta _j{:}{=}\kappa _j+\frac{\tau }{\Vert \psi _j \Vert _{L^\infty (\mathrm {D})}}\), \(f_{\pi }\) in (7.4) satisfies Assumption 3.5.
Let \(\varrho _j=1+C_7\delta _j\) be as in Theorem 4.5 (i.e., as in (4.7a)), where \(C_7\) is as in Theorem 3.6. With \(\kappa _j\in (0,1]\), \(\tau >0\) as in Lemma 7.1
In particular, \(\varrho _j\ge 1+C_7\tau \Vert \psi _j \Vert _{L^\infty (\mathrm {D})}^{-1}\). In practice, we do not know \(\tau \) and \(C_7\) (although pessimistic estimates could be obtained from the proofs), and we simply set \(\varrho _j{:}{=}1+\Vert \psi _j \Vert _{L^\infty (\mathrm {D})}^{-1}\). Theorem 4.3 (and Theorem 4.5) then suggest the choice (cp. (4.3), (4.4))
to construct a sparse polynomial ansatz space \({{\mathbb {P}}}_{\Lambda _{k,\varepsilon }}\) in which to approximate \(T_k\) (or \(\sqrt{\partial _{k}T_k}-1\)), \(k\in \{1,\dots ,d\}\). Here \(\varepsilon >0\) is a thresholding parameter, and as \(\varepsilon \rightarrow 0\) the ansatz spaces become arbitrarily large. We interpret (7.5) as follows: The smaller the \(\Vert \psi _j \Vert _{L^\infty (\mathrm {D})}\), the less important the variable j is in the approximation of \(T_k\) if \(j<k\). The kth variable plays a special role for \(T_k\), however, and is always among the most important in the approximation of \(T_k\).
7.4 Performing Inference
Given data \({\varvec{\varsigma }}\in {{\mathbb {R}}}^m\) as in (7.2), we can now describe a high-level algorithm to perform inference on the model problem in Sects. 7.1–7.2:
-
(i)
Determine ansatz space: Fix \(\varepsilon >0\) and determine \(\Lambda _{k,\varepsilon }\) in (7.5) for \(k=1,\dots ,d\).
-
(ii)
Find transport map: Use as a target \(\pi \) the posterior with density in (7.4), and solve the optimization problem
$$\begin{aligned} {{\,\mathrm{arg\,min}\,}}_{{\tilde{T}}\text { as in (4.10) with} p_k\in {{\mathbb {P}}}_{\Lambda _{k,\varepsilon }}}\mathrm{dist}(\tilde{T}_\sharp {\rho },{\pi }). \end{aligned}$$(7.6) -
(iii)
Estimate parameter: Estimate the unknown parameter \({{\varvec{y}}}\in [-1,1]^d\) via its conditional mean (CM), i.e., compute the expectation under the posterior \(\pi \simeq {\tilde{T}}_\sharp {\rho }\)
$$\begin{aligned} \int _{[-1,1]^d} {{\varvec{y}}}\;\mathrm {d}{\pi }({{\varvec{y}}}) \simeq \int _{[-1,1]^d} \tilde{T}({{\varvec{x}}})\;\mathrm {d}{\rho }({{\varvec{x}}}){=}{:}{\tilde{{{\varvec{y}}}}}. \end{aligned}$$(7.7)An estimate of the unknown diffusion coefficient \(a\in L^\infty (\mathrm {D})\) in (7.3) is obtained via \(1+\sum _{j=1}^d {\tilde{y}}_j\psi _j\).
We next provide more details for each of those steps.
7.4.1 Determining the Ansatz Space
An efficient algorithm (of linear complexity) to determine multi-index sets \(\Lambda _{k,\varepsilon }\) of the type (7.5) is given in [3] or [75, Sec. 3.1.3]. We emphasize that, in general, it is not an easy task to come up with suitable ansatz spaces. In the current setting, our explicit knowledge of the prior measure and its possibly anisotropic structure, and the analyticity of the forward operator, allow us—using the analysis of the previous sections—to determine a priori ansatz spaces yielding proven exponential convergence for the best approximating transport map within this space. By “anisotropic,” we mean that certain variables \(y_j\) may contribute less to the posterior than others due to \(\Vert \psi _j \Vert _{}\) being small in (7.3); in this case, our construction (7.5) results in fewer degrees of freedom spent on such variables, thus increasing the efficiency of the algorithm. This is to be contrasted with the use of generic ansatz spaces, which do not leverage such knowledge, as, for example, proposed in [45]. The explicit construction of these spaces is one of the main contributions of this work.
7.4.2 Finding the Transport Map
Again let \({\pi }\) with density \(f_{\pi }\) be the posterior in (7.4). The “\(\mathrm{dist}\)” function in (7.6) is often chosen to be the KL divergence. In this case, the optimization problem (7.6) can equivalently be written as
To minimize this term, the normalizing constant Z in (7.4), which is in general unavailable, need not be known; see, e.g., [45, Sec. 3.2] for more details. Moreover, since the reference \({\rho }\) is a tractable measure, the integral in (7.8) can be approximated. The simplest way is by Monte Carlo sampling (from \({\rho }\)), but also higher-order methods like quasi-Monte Carlo [21] or sparse-grid quadrature, e.g., [27, 77], could be used, though they have not yet been rigorously investigated in this context. Solving the optimization problem (7.8) is in general hard, as the objective is non-convex and defined on a high-dimensional space. Practical implementations have employed quasi-Newton or Newton-CG methods [4, 8, 25], with continuation heuristics to address non-convexity. Coming up with good optimization algorithms for this objective is out of the scope of the present paper but would be an interesting topic for future research.
Instead of minimizing over rational transports from (4.10) as in (7.6) and (7.8), alternatively we could minimize over the NN transports from Theorem 5.1. Since the proof of Theorem 5.1 is constructive, we could in principle give an explicit architecture achieving the rate in (5.2). We refrain from doing so, since an implementation of this (sparse) architecture may be cumbersome in practice, and not necessarily yield significantly better global minima than a fully connected network of similar size (which also allows for exponential convergence; see the discussion after Theorem 5.1).
Other objectives and approaches are possible as well. For instance, if the reference \({\rho }\) is chosen as the uniform measure, then \({\tilde{T}}_\sharp \rho \) has density \(\det d{\tilde{S}}\) with \(\tilde{S}={\tilde{T}}^{-1}\). In this case, we may minimize the Hellinger distance
To make this optimization problem independent of the normalizing constant—and, moreover, convex—one can proceed as follows: Substitute \({\tilde{g}} {:}{=}\sqrt{\det d {\tilde{S}}}\), and find
Observe that by using, for example, a polynomial ansatz \(\tilde{g}({{\varvec{y}}})=\sum _{{\varvec{\nu }}\in \Lambda }c_{\varvec{\nu }}{{\varvec{y}}}^{\varvec{\nu }}\), this yields a (convex) linear least-squares problem for the expansion coefficients \((c_{\varvec{\nu }})_{{\varvec{\nu }}\in \Lambda }\). Moreover, after normalizing
which guarantees \(g^2\) to be a probability density, this method does not depend on the unknown normalizing constant of the target \(f_{\pi }\). Using the explicit formulas for the (inverse) KR transport in Sect. 2 we get
If \({\tilde{g}}\) is a multivariate polynomial, these integrals can be evaluated analytically (without the use of numerical quadrature), which makes the method computationally feasible. A similar algorithm based on tensor-train approximations of the density has recently been proposed in [22].
7.4.3 Estimating the Parameter
In the Bayesian setting, a natural point estimate of the parameter \({{\varvec{y}}}\) is the CM estimator in (7.7). An alternative is the MAP estimator (a point maximizing the posterior density), but the CM has the advantage of satisfying several useful stability properties. For example, in contrast to the MAP, under suitable assumptions the CM depends continuously on the data \({\varvec{\varsigma }}\), e.g., [32, 38, 67]. Moreover, minimizing objectives like the KL divergence or the Hellinger distance will guarantee convergence of the CM as the number of degrees of freedom in the approximation of \({\tilde{T}}\) tends to infinity. It will not guarantee convergence of a MAP point, however, since \({\tilde{T}}_\sharp {\rho }\) need not converge pointwise to \({\pi }\) if \({\tilde{T}}\) minimizes the KL divergence as in (7.8). To see convergence of the CM, recall that the Hellinger distance is bounded according to \(\mathrm{H}(\tilde{T}_\sharp {\rho },{\pi })^2\le \frac{1}{2}\mathrm{KL}(\tilde{T}_\sharp {\rho },{\pi })\); see [28]. Moreover, by Remark 6.2
By Proposition 6.4, for \({\tilde{T}}\) solving the optimization problem (7.8) (i.e., minimizing \(\mathrm{KL}({\tilde{T}}_\sharp {\rho },{\pi })\)), we thus have
with \({\tilde{\beta }}<\beta \) as in (4.7), i.e., our approximation to the actual CM \(\int _{[-1,1]^d}{{\varvec{y}}}\;\mathrm {d}{\pi }({{\varvec{y}}})\in {{\mathbb {R}}}^d\) converges exponentially in terms of the number of degrees of freedom N used in (4.10) to approximate T. If instead of (4.10) we use a NN, by Proposition 6.5 we obtain a similar bound in terms of the trainable parameters N of the network, but for a different constant \({\tilde{\beta }}\), and with \(N^{1/d}\) on the right-hand side of (7.9) replaced by \(N^{1/(d+1)}\).
Given an approximation \({\tilde{T}}\) to the transport T, any other posterior expectation
can be also approximated by substituting \(\tilde{T}\) for T above; the CM simply corresponds to setting g to be the identity function. Choosing a polynomial g, for example, enables computation of the covariance or higher moments of the posterior, as a way of characterizing uncertainty in the parameter.
Remark 7.2
The forward operator being defined by the diffusion equation (7.1) is not essential to the discussion in Sect. 7. Other models such as the Navier–Stokes equations allow similar arguments [14]. More generally, the proof of Lemma 7.1 merely requires the existence of a complex differentiable function \({\mathfrak {u}}\) (between two complex Banach spaces) such that \(u({{\varvec{y}}})={\mathfrak {u}}(1+\sum _{j=1}^dy_j\psi _j)\), and \({\mathfrak {u}}\) does not even need to stem from a PDE.
8 Conclusions
In this paper, we proved several results for the approximation of the KR transport \(T:[-1,1]^d\rightarrow [-1,1]^d\). The central requirement was that the reference and target densities are both analytic. Based on this, we first conducted a careful analysis of the regularity of T by investigating its domain of analytic extension. This implied exponential convergence for sparse polynomial and ReLU neural network approximations. We gave an ansatz for the computation of the approximate transport, which necessitates that it is a bijective map on the hypercube \([-1,1]^d\). This led to a statement for rational approximations of T. Moreover, we discussed how our results can be used in the development of inference algorithms.
Most of these results are generalized and extended in [76], where we establish dimension-robust convergence in the high- or infinite-dimensional case \(d\gg 1\) or \(d=\infty \). For future research, we intend to use our proposed ansatz, including the a priori-determined sparse polynomial spaces, to construct and analyze in more detail concrete inference algorithms as outlined in Sect. 7. The present regularity and approximation results for T provide crucial tools to rigorously prove convergence and convergence rates for such methods. Additionally, we will investigate similar results on unbounded domains, e.g., \({{\mathbb {R}}}^d\).
Notes
It is desirable to have sharp upper bounds on the Legendre coefficients in order to construct the most efficient ansatz spaces. However, solely using the first bound \(t_1\) in the analysis would not alter the type of exponential convergence shown in Proposition 4.2. Using the improved bound \({\tilde{t}}_1\) becomes crucial for the analysis of the high-dimensional case \(d\gg 1\), which we discuss in [76].
References
Berg, R.V.d., Hasenclever, L., Tomczak, J.M., Welling, M.: Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649 (2018)
Beskos, A., Jasra, A., Law, K., Marzouk, Y., Zhou, Y.: Multilevel sequential Monte Carlo with dimension-independent likelihood-informed proposals. SIAM/ASA J. Uncertain. Quantif. 6(2), 762–786 (2018)
Bieri, M., Andreev, R., Schwab, C.: Sparse tensor discretization of elliptic SPDEs. SIAM J. Sci. Comput. 31(6), 4281–4304 (2009/2010)
Bigoni, D.: TransportMaps library, 2016–2020. http://transportmaps.mit.edu
Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)
Bogachev, V.I., Kolesnikov, A.V., Medvedev, K.V.: Triangular transformations of measures. Mat. Sb. 196(3), 3–30 (2005)
Bonito, A., DeVore, R., Guignard, D., Jantsch, P., Petrova, G.: Polynomial approximation of anisotropic analytic functions of several variables. arXiv:1904.12105 (2019)
Brennan, M., Bigoni, D., Zahm, O., Spantini, A., Marzouk, Y.: Greedy inference with structure-exploiting lazy maps. Adv. Neural Inform. Process. Syst. 33, 8330–8342 (2020)
Buchholz, A., Chopin, N.: Improving approximate Bayesian computation via quasi-Monte Carlo. J. Comput. Graph. Statist. 28(1), 205–219 (2019)
Chen, P., Schwab, C.: Adaptive sparse grid model order reduction for fast Bayesian estimation and inversion. In: Sparse Grids and Applications—Stuttgart 2014, vol. 109 Lecture Notes Computer Science Engineering, pp. 1–27. Springer, Cham (2016)
Cheney, E.: Introduction to Approximation Theory. International Series in Pure and Applied Mathematics. McGraw-Hill Book Co. (1966)
Chkifa, A.: Sparse polynomial methods in high dimension: application to parametric PDE. Ph.D. thesis, UPMC, Université Paris 06, Paris, France (2014)
Chkifa, A., Cohen, A., Schwab, C.: High-dimensional adaptive sparse polynomial interpolation and applications to parametric PDEs. J. Found. Comput. Math. 14(4), 601–633 (2013)
Cohen, A., Schwab, Ch., Zech, J.: Shape holomorphy of the stationary Navier–Stokes equations. SIAM J. Math. Anal. 50(2), 1720–1752 (2018)
Cotter, S.L., Roberts, G.O., Stuart, A.M.,White, D.: MCMC methods for functions: modifying old algorithms to make them faster. Stat. Sci. 28, 424–446 (2013)
Cui, T., Law, K.J.H., Marzouk, Y.M.: Dimension-independent likelihood-informed MCMC. J. Comput. Phys. 304, 109–137 (2016)
Dashti, M., Stuart, A.M.: The Bayesian approach to inverse problems. In: Handbook of Uncertainty Quantification, vol. 1, 2, 3, pp. 311–428. Springer, Cham (2017)
Davis, P.: Interpolation and Approximation. Dover Books on Mathematics. Dover Publications, New York (1975)
Detommaso, G., Cui, T., Spantini, A., Marzouk, Y., Scheichl, R.: A Stein variational Newton method. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS18, pp. 9187–9197, Red Hook, NY, USA. Curran Associates Inc (2018)
Dick, J., Gantner, R.N., Le Gia, Q.T., Schwab, C.: Higher order quasi-Monte Carlo integration for Bayesian PDE inversion. Comput. Math. Appl. 77(1), 144–172 (2019)
Dick, J., LeGia, Q.T., Schwab, C.: Higher order quasi Monte Carlo integration for holomorphic, parametric operator equations. SIAM J. Uncert. Quantif. 4(1), 48–79 (2016)
Dolgov, S., Anaya-Izquierdo, K., Fox, C., Scheichl, R.: Approximation and sampling of multivariate probability distributions in the tensor train decomposition. Stat. Comput. 30(3), 603–625 (2020)
Duncan, A., Nuesken, N., Szpruch, L.: On the geometry of Stein variational gradient descent. arXiv preprint arXiv:1912.00894 (2019)
E, W., Wang, Q.: Exponential convergence of the deep neural network approximation for analytic functions. Sci. China Math. 61(10), 1733–1740 (2018)
El Moselhy, T.A., Marzouk, Y.M.: Bayesian inference with optimal maps. J. Comput. Phys. 231(23), 7815–7850 (2012)
Finlay, C., Jacobsen, J.-H., Nurbekyan, L., Oberman, A.M.: How to train your neural ODE. arXiv preprint arXiv:2002.02798 (2020)
Gerstner, T., Griebel, M.: Numerical integration using sparse grids. Numer. Algorithms 18(3–4), 209–232 (1998)
Gibbs, A.L., Su, F.E.: On choosing and bounding probability metrics. Int. Stat. Rev. 70(3), 419–435 (2002)
Grathwohl, W., Chen, R.T.Q., Bettencourt, J., Sutskever, I., Duvenaud, D.: FFJORD: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367 (2018)
Griebel, M., Oettershagen, J.: On tensor product approximation of analytic functions. J. Approx. Theory 207, 348–379 (2016)
Hervé, M.: Analyticity in infinite-dimensional spaces. de Gruyter Studies in Mathematics, vol. 10. Walter de Gruyter & Co., Berlin (1989)
Hosseini, B., Nigam, N.: Well-posed Bayesian inverse problems: priors with exponential tails. SIAM/ASA J. Uncertain. Quantif. 5(1), 436–465 (2017)
Huang, C.-W., Krueger, D., Lacoste, A., Courville, A.: Neural autoregressive flows. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, pp. 2078–2087. PMLR, 10–15 Jul (2018)
Jaini, P., Selby, K.A., Yu, Y.: Sum-of-squares polynomial flow. In: ICML (2019)
Kaipio, J., Somersalo, E.: Statistical and computational inverse problems. In: Applied Mathematical Science, vol. 160. Springer, New York (2005)
Kobyzev, I., Prince, S.J., Brubaker, M.A.: Normalizing flows: an introduction and review of current methods. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 3964–79 (2020)
Kong, Z., Chaudhuri, K.: The expressive power of a class of normalizing flow models. In: Chiappa, S., Calandra, R. (eds). Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, vol. 108 of Proceedings of Machine Learning Research, pp. 3599–3609. PMLR, 26–28 Aug (2020)
Latz, J.: On the well-posedness of Bayesian inverse problems. SIAM/ASA J. Uncertain. Quantif. 8(1), 451–482 (2020)
Li, B., Tang, S., Yu, H.: Better approximations of high dimensional smooth functions by deep neural networks with rectified power units. Commun. Comput. Phys. 27(2), 379–411 (2019)
Liu, Q.: Stein variational gradient descent as gradient flow. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 3115–3123. Curran Associates, Inc. (2017)
Liu, Q., Wang, D.: Stein variational gradient descent: a general purpose Bayesian inference algorithm. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 2378–2386. Curran Associates, Inc. (2016)
Lu, J., Lu, Y., Nolen, J.: Scaling limit of the Stein variational gradient descent: the mean field regime. SIAM J. Math. Anal. 51(2), 648–671 (2019)
Lu, Y., Lu, J.: A universal approximation theorem of deep neural networks for expressing probability distributions. Adv. Neural Inform. Process. Syst. 33, 3094–105 (2020)
Markoff, W., Grossmann, J.: Über Polynome, die in einem gegebenen Intervalle möglichst wenig von Null abweichen. Math. Ann. 77(2), 213–258 (1916)
Marzouk, Y., Moselhy, T., Parno, M., Spantini, A.: Sampling via measure transport: an introduction. In: Handbook of Uncertainty Quantification, vol. 1, 2, 3, pp. 785–825. Springer, Cham (2017)
Mhaskar, H.N.: Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1(1), 61–80 (1993)
Morzfeld, M., Tong, X.T., Marzouk, Y.M.: Localization for MCMC: sampling high-dimensional posterior distributions with local structure. J. Comput. Phys. 380, 1–28 (2019)
Olver, F.W.J., Lozier, D.W., Boisvert, R.F., Clark, C.W. (eds).: NIST handbook of mathematical functions. U.S. Department of Commerce, National Institute of Standards and Technology, Washington, DC; Cambridge University Press, Cambridge (2010)
Opschoor, J.A.A., Schwab, C., Zech, J.: Exponential ReLU DNN expression of holomorphic maps in high dimension. Technical Report 2019-35, Seminar for Applied Mathematics, ETH Zürich, Switzerland (2019)
Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22, 1–64 (2021)
Papamakarios, G., Pavlakou, T., Murray, I.: Masked autoregressive flow for density estimation. arXiv preprint arXiv:1705.07057 (2017)
Parno, M.D., Marzouk, Y.M.: Transport map accelerated Markov chain Monte Carlo. SIAM/ASA J. Uncertain. Quantif. 6(2), 645–682 (2018)
Ramsay, J.O.: Estimating smooth monotone functions. J. R. Stat. Soc. Ser. B Stat. Methodol. 60(2), 365–375 (1998)
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, vol. 37 of Proceedings of Machine Learning Research, pp. 1530–1538, Lille, France, 07–09 Jul (2015)
Robert, C.P., Casella, G.: Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer, Berlin (2005)
Rosenblatt, M.: Remarks on a multivariate transformation. Ann. Math. Statist. 23, 470–472 (1952)
Rudolf, D., Sprungk, B.: On a generalization of the preconditioned Crank–Nicolson metropolis algorithm. Found. Comput. Math. 18(2), 309–343 (2018)
Sagiv, A.: The Wasserstein distances between pushed-forward measures with applications to uncertainty quantification. Commun. Math. Sci. 18(3), 707–724 (2020)
Santambrogio, F.: Optimal transport for applied mathematicians, vol. 87 of Progress in Nonlinear Differential Equations and their Applications. Birkhäuser/Springer, Cham (2015)
Scheichl, R., Stuart, A.M., Teckentrup, A.L.: Quasi-Monte Carlo and multilevel Monte Carlo methods for computing posterior expectations in elliptic inverse problems. SIAM/ASA J. Uncertain. Quantif. 5(1), 493–518 (2017)
Schillings, C., Schwab, C.: Sparse, adaptive Smolyak quadratures for Bayesian inverse problems. Inverse Probl. 29(6), 065011 (2013)
Schillings, C., Schwab, C.: Scaling limits in computational Bayesian inversion. ESAIM Math. Model. Numer. Anal. 50(6), 1825–1856 (2016)
Schillings, C., Sprungk, B., Wacker, P.: On the convergence of the Laplace approximation and noise-level-robustness of Laplace-based Monte Carlo methods for Bayesian inverse problems. Numer. Math. 145(4), 915–971 (2020)
Schwab, C., Stuart, A.M.: Sparse deterministic approximation of Bayesian inverse problems. Inverse Probl. 28(4), 045003 (2012)
Spantini, A., Baptista, R., Marzouk, Y.: Coupling techniques for nonlinear ensemble filtering. arXiv preprint arXiv:1907.00389 (2019)
Spantini, A., Bigoni, D., Marzouk, Y.: Inference via low-dimensional couplings. J. Mach. Learn. Res. 19(1), 2639–2709 (2018)
Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer. 19, 451–559 (2010)
Teshima, T., Ishikawa, I., Tojo, K., Oono, K., Ikeda, M., Sugiyama, M.: Coupling-based invertible neural networks are universal diffeomorphism approximators. Adv. Neural Inform. Process. Syst. 33, 3362–73 (2020)
Teshima, T., Tojo, K., Ikeda, M., Ishikawa, I., Oono, K.: Universal approximation property of neural ordinary differential equations. arXiv preprint arXiv:2012.02414 (2020)
Tong, X.T., Morzfeld, M., Marzouk, Y.M.: MALA-within-Gibbs samplers for high-dimensional distributions with sparse conditional structure. SIAM J. Sci. Comput. 42(3), A1765–A1788 (2020)
Villani, C.: Optimal transport, vol. 338 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer, Berlin (2009)
Wehenkel, A., Louppe, G.: Unconstrained monotonic neural networks. arXiv preprint arXiv:1908.05164 (2019)
Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017)
Yau, S.T., Zhang, L.: An upper estimate of integral points in real simplices with an application to singularity theory. Math. Res. Lett. 13(5–6), 911–921 (2006)
Zech, J.: Sparse-grid approximation of high-dimensional parametric PDEs. Dissertation 25683, ETH Zürich. https://doi.org/10.3929/ethz-b-000340651 (2018)
Zech, J., Marzouk, Y.: Sparse approximation of triangular transports. Part II: the infinite dimensional case. Constr. Approx. https://doi.org/10.1007/s00365-022-09570-9 (2022)
Zech, J., Schwab, C.: Convergence rates of high dimensional Smolyak quadrature. ESAIM Math. Model. Numer. Anal. 54(4), 1259–1307 (2020)
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Albert Cohen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This paper was written during the postdoctoral stay of JZ at MIT. JZ acknowledges support by the Swiss National Science Foundation under Early Postdoc Mobility Fellowship 184530. YM and JZ acknowledge support from the United States Department of Energy, Office of Advanced Scientific Computing Research, AEOLUS Mathematical Multifaceted Integrated Capability Center.
Appendices
A Inverse Function Theorem
In the following, if \(O\subseteq {{\mathbb {C}}}^n\) is a set, by \(f\in C^1(O;{{\mathbb {C}}})\) we mean that \(f\in C^0(O;{{\mathbb {C}}})\) and for every open \(S\subseteq O\) it holds \(f\in C^1(S;{{\mathbb {C}}})\).
Lemma A.1
Let \(n\in {{\mathbb {N}}}\), \(\delta >0\), \(\kappa \in (0,1)\) and let \(O\subseteq {{\mathbb {C}}}^n\). Assume that \(x_0\in O\), \(t_0\in {{\mathbb {C}}}\) and \(f\in C^1(O\times {{\mathcal {B}}}_\delta (t_0);{{\mathbb {C}}})\) satisfy
-
(a)
\(f(x_0,t_0)=0\) and \(f_t(x_0,t_0)\ne 0\),
-
(b)
\(|1-\frac{f_t(x,t)}{f_t(x_0,t_0)}|\le \kappa \) for all \((x,t)\in O\times {{\mathcal {B}}}_\delta (t_0)\),
-
(c)
\(|\frac{f(x,t_0)}{f_t(x_0,t_0)}|\le \delta (1-\kappa )\) for all \(x\in O\).
Then there exists a unique function \(t:O\rightarrow {{\mathcal {B}}}_\delta (t_0)\) such that \(f(x,t(x))=0\) for all \(x\in O\). Moreover, \(t\in C^1(O;{{\mathbb {C}}})\).
Proof
Define \(S(x,t){:}{=}t-\frac{f(x,t)}{f_t(x_0,t_0)}\), i.e., \(S:O\times {{\mathcal {B}}}_\delta (t_0)\rightarrow {{\mathbb {C}}}\). Then \(S_t(x,t) = 1 - \frac{f_t(x,t)}{f_t(x_0,t_0)}\), and by (b), we have for all \((x,t)\in O\times {{\mathcal {B}}}_\delta (t_0)\)
Moreover, (c) and (A.1) imply for all \((x,t)\in O\times {{\mathcal {B}}}_\delta (t_0)\)
Now define the Banach space \(X{:}{=}C^0(O;{{\mathbb {C}}})\) with norm \(\Vert f \Vert _{X}{:}{=}\sup _{t\in O}|f(t)|\), and consider the closed subset \(A{:}{=}\{f\in X\,:\,\Vert f-t_0 \Vert _{X}\le \delta \}\subset X\) (here, by abuse of notation, \(t_0:O\rightarrow {{\mathbb {C}}}\) is interpreted as the constant function in X). By (A.2), \(t(\cdot )\mapsto S(\cdot ,t(\cdot ))\) maps A to itself, and by (A.1) the map is a contraction there, so that it has a unique fixed point by the Banach fixed point theorem. We have shown the existence of \(t\in C^0(O;{{\mathcal {B}}}_\delta (t_0))\) satisfying \(S(x,t(x))\equiv t(x)\), which is equivalent to \(f(x,t(x))\equiv 0\). It remains to show that \(t\in C^1(O;{{\mathcal {B}}}_{\delta }(t_0))\). Letting \(t_0:O\rightarrow {{\mathbb {C}}}\) again be the constant function and \(t_k(x){:}{=}S(x,t_{k-1}(x))\) for every \(k\ge 2\), it holds \(t_k\rightarrow t\) in X, i.e., \((t_k)_{k\in {{\mathbb {N}}}}\) converges uniformly. Since \(t_0\in C^1(O;{{\mathbb {C}}})\), we inductively obtain \(t_k\in C^1(O;{{\mathbb {C}}})\) for all \(k\in {{\mathbb {N}}}\). Since X is a complex Banach space, as a uniform limit of differentiable (analytic) functions it holds \(\lim _{k\rightarrow \infty }t_k=t\in C^1(O;{{\mathbb {C}}})\), see, for instance, [31, Sec. 3.1].
Finally, to see that for each \(x\in O\) there exists only one \(s\in {{\mathcal {B}}}_\delta (t_0)\) such that \(f(x,s)=0\) (namely \(s=t(x)\)), one can argue similar as above and apply the Banach fixed point theorem to the map \(s\mapsto S(x,s)\) for \(x\in O\) fixed and \(s\in {{\mathcal {B}}}_\delta (t_0)\). \(\square \)
From the previous lemma, we deduce two types of inverse function theorems in Propositions A.2 and A.4.
Proposition A.2
Let \(t_0\in {{\mathbb {C}}}\) and \(\delta >0\) be such that
-
(a)
\(F\in C^1({{\mathcal {B}}}_\delta (t_0);{{\mathbb {C}}})\) and \(F'(t_0)\ne 0\),
-
(b)
\(F':{{\mathcal {B}}}_\delta (t_0)\rightarrow {{\mathbb {C}}}\) is Lipschitz continuous with Lipschitz constant \(L{:}{=}\frac{|F'(t_0)|}{2\delta }\).
Then with \(r{:}{=}\delta \frac{|F'(t_0)|}{2}\) there exists a unique \(G:{{\mathcal {B}}}_r(F(t_0))\rightarrow {{\mathcal {B}}}_\delta (t_0)\) such that \(F(G(x))=x\) for all \(x\in {{\mathcal {B}}}_r(F(t_0))\). Moreover, \(G\in C^1({{\mathcal {B}}}_r(F(t_0));{{\mathbb {C}}})\).
Proof
Let \(O{:}{=}{{\mathcal {B}}}_{r}(F(t_0))\). Define \(x_0{:}{=}F(t_0)\) as well as \(f(x,t){:}{=}F(t)-x\). Then \(f(x_0,t_0)=0\), \(f_t(x_0,t_0)=F'(t_0)\ne 0\), showing Lemma A.1 (a). Furthermore, for \(t\in {{\mathcal {B}}}_{\delta }(t_0)\) and \(x\in O={{\mathcal {B}}}_r(x_0)\), due to the Lipschitz continuity of \(F'\) and \(L=\frac{|F'(t_0)|}{2\delta }\)
which shows Lemma A.1 (b) with \(\kappa =\frac{1}{2}\). Finally, Lemma A.1 (c) (with \(\kappa =\frac{1}{2}\)) follows from the fact that for \(x\in O={{\mathcal {B}}}_r(F(t_0))\)
Hence, the statement follows by Lemma A.1. \(\square \)
The next lemma shows that G in Proposition A.2 depends continuously on F.
Lemma A.3
Let \(t_0\in {{\mathbb {C}}}\), \(\delta >0\) be such that both \(F\in C^1({{\mathcal {B}}}_\delta (t_0))\) and \({\tilde{F}}\in C^1({{\mathcal {B}}}_\delta (t_0))\) satisfy Proposition A.2 (a), (b). Denote the functions from Proposition A.2 by G, \({\tilde{G}}\), respectively. With \(r=\delta \frac{|F'(t_0)|}{2}\), \(\tilde{r}=\delta \frac{|{\tilde{F}}'(t_0)|}{2}\)
Proof
Let s, \(t\in {{\mathcal {B}}}_\delta (t_0)\). Then
For any \(\zeta \in [0,1]\) it holds \((1-\zeta )s+\zeta t\in {{\mathcal {B}}}_\delta (t_0)\), and therefore, \(|(1-\zeta )s+\zeta t-t_0|\le \delta \). Thus, using that \(\frac{|F'(t_0)|}{2\delta }\) is a Lipschitz constant of \(F'\),
We get for \(x\in {{\mathcal {B}}}_r(F(t_0))\cap {{\mathcal {B}}}_{{\tilde{r}}}({\tilde{F}}(t_0))\) (applying this inequality with \(s=G(x)\) and \(t={\tilde{G}}(x)\))
\(\square \)
Proposition A.4
Let \(k\in {{\mathbb {N}}}\), let \(S\subseteq {{\mathbb {C}}}^{k+1}\) be open and \(F\in C^1(S;{{\mathbb {C}}})\). Assume that \(x_0\in {{\mathbb {C}}}^k\) and \(t_0\in {{\mathbb {C}}}\) are such that \(F_t(x_0,t_0)\ne 0\).
Then there exists \(\delta >0\), a neighborhood \(O\subseteq {{\mathbb {C}}}^{k+1}\) of \((x_0,F(x_0,t_0))\) and a unique function \(G:O \rightarrow {{\mathcal {B}}}_\delta (t_0)\) such that \(F(x,G(x,y))=y\) for all \((x,y)\in O\). Moreover, \(G\in C^1(O;{{\mathbb {C}}})\).
Proof
Let \(\kappa {:}{=}\frac{1}{2}\). Choose an open set \(O\subseteq {{\mathbb {C}}}^{k+1}\) and \(\delta >0\) such that with \(y_0{:}{=}F(x_0,t_0)\in {{\mathbb {C}}}\) it holds \((x_0,y_0)\in O\) and \(|1-\frac{F_t(x,t)}{F_t(x_0,t_0)}|\le \kappa \) for all \(((x,y),t)\in O\times {{\mathcal {B}}}_\delta (t_0)\), and \(|\frac{F(x,t_0)-y}{F_t(x_0,t_0)}|\le \delta (1-\kappa )\) for all \((x,y)\in O\). This is possible because F, \(F_t\) are locally continuous around \((x_0,t_0)\). Set \(f((x,y),t){:}{=}F(x,t)-y\). Applying Lemma A.1 to \(f:O\times {{\mathcal {B}}}_\delta (t_0)\rightarrow {{\mathbb {C}}}\) gives the result. \(\square \)
B Proofs of Sect. 3
1.1 B.1 Lemma 3.3
Proof of Lemma 3.3
We start with (i). By definition \(F\in C^1([-1,1];[0,1])\) is strictly monotonically increasing with derivative \(F'=f\). This implies that F is bijective and by the inverse function theorem, its inverse belongs to \(C^1([0,1])\).
Holomorphy of \(f:{{\mathcal {B}}}_\delta ([-1,1])\rightarrow {{\mathbb {C}}}\) and the simply connectedness of \({{\mathcal {B}}}_\delta ([-1,1])\) imply the existence and well-definedness of a unique holomorphic antiderivative \(F:{{\mathcal {B}}}_\delta ([-1,1])\rightarrow {{\mathbb {C}}}\) of f satisfying \(F(-1)=0\). Since \(F'=f\), \(F:{{\mathcal {B}}}_\delta ([-1,1])\rightarrow {{\mathbb {C}}}\) has Lipschitz constant \(\sup _{x\in {{\mathcal {B}}}_{\delta }([-1,1])}|f(x)|\le L\).
We show (ii). For every \(x_0\in [-1,1]\), it holds \(F'(x_0)=f(x_0)\ne 0\). By Lemma 3.2, there exists a unique \(G_{x_0}:{{\mathcal {B}}}_{\alpha \delta }(F(x_0))\rightarrow {{\mathcal {B}}}_{\beta \delta }(x_0)\) satisfying \(F(G_{x_0}(y))=y\) for all \(y\in {{\mathcal {B}}}_{\alpha \delta }(F(x_0))\). It holds \(\beta \le 1\) by definition, and thus, \({{\mathcal {B}}}_{\beta \delta }(x_0)\subseteq {{\mathcal {B}}}_{\delta }(x_0)\). Since \(F:[-1,1]\rightarrow [0,1]\) is bijective, we can define \(G:{{\mathcal {B}}}_{\alpha \delta }([0,1])\rightarrow {{\mathcal {B}}}_{\beta \delta }([-1,1])\) locally via \(G(y){:}{=}G_{x_0}(y)\) whenever \(y\in {{\mathcal {B}}}_{\alpha \delta }(F(x_0))\). It remains to show well-definedness of G, i.e., that these local functions coincide wherever their domain overlaps. We then have \(F(G(y))=y\) for all \(y\in {{\mathcal {B}}}_{\alpha \delta }([0,1])\) by definition.
Let \(x_0\ne x_1\) be arbitrary in \([-1,1]\) and denote the corresponding local inverse functions of F by \(G_{x_i}:{{\mathcal {B}}}_{\alpha \delta }(F(x_i))\rightarrow {{\mathcal {B}}}_{\beta \delta }(x_i)\), \(i\in \{0,1\}\). The uniqueness of \(G_{x_0}\) and \(G_{x_1}\) (as stated in Lemma 3.2) and the continuity of \(F^{-1}:[0,1]\rightarrow [-1,1]\) imply that \(G_{x_j}\equiv F^{-1}\) on \([0,1]\cap {{\mathcal {B}}}_{\alpha \delta }(F(x_j))\ne \emptyset \), \(j\in \{0,1\}\). Now assume that there exists \(y\in {{\mathcal {B}}}_{\alpha \delta }(F(x_0))\cap {{\mathcal {B}}}_{\alpha \delta }(F(x_1))\cap [0,1]\). Then \(G_{x_0}(y)=F^{-1}(y)=G_{x_1}(y)\) and again by the local uniqueness of \(G_{x_0}\), \(G_{x_1}\) as the inverse of F those two functions coincide on a complex open subset of \({{\mathcal {B}}}_{\alpha \delta }(F(x_0))\cap {{\mathcal {B}}}_{\alpha \delta }(F(x_1))\). Since they are holomorphic, by the identity theorem they coincide on all of \({{\mathcal {B}}}_{\alpha \delta }(F(x_0))\cap {{\mathcal {B}}}_{\alpha \delta }(F(x_1))\). Thus, G is well defined on \({{\mathcal {B}}}_{\alpha \delta }([-1,1])\). \(\square \)
We will also use the following consequence of Lemma 3.3.
Lemma B.1
Let f, F, M, L and \(\alpha \) be as in Lemma 3.3. If \({\tilde{f}}\in C^1({{\mathcal {B}}}_\delta ([-1,1]);{{\mathbb {C}}})\) satisfies \(M\le |\tilde{f}(x)|\le L\) for all \(x\in {{\mathcal {B}}}_\delta ([-1,1])\) and
then for \({\tilde{F}}(x){:}{=}\int _{-1}^x{\tilde{f}}(t)\;\mathrm {d}\mu ( t)\) it holds \({{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1])\subseteq \{{\tilde{F}}(x)\,:\,x\in {{\mathcal {B}}}_\delta ([-1,1])\}\). Furthermore, there exists a unique \(\tilde{G}:{{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1])\rightarrow {{\mathcal {B}}}_{\beta \delta }([-1,1])\) such that \({\tilde{F}}({\tilde{G}}(y))=y\) for all \(y\in {{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1])\) and
Moreover, \({\tilde{G}}\in C^1({{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1]);{{\mathbb {C}}})\) with Lipschitz constant \(\frac{1}{M}\).
Proof
Fix \(x_0\in {{\mathcal {B}}}_\delta ([-1,1])\). It holds \({\tilde{F}}'(x_0)=\tilde{f}(x_0)\). By Lemma 3.2, there exists a unique \(\tilde{G}_{x_0}:{{\mathcal {B}}}_{\alpha \delta }({\tilde{F}}(x_0))\rightarrow {{\mathcal {B}}}_{\beta \delta }(x_0)\subseteq {{\mathcal {B}}}_\delta (x_0)\) satisfying \(\tilde{F}({\tilde{G}}_{x_0}(y))=y\) for all \(y\in {{\mathcal {B}}}_{\alpha \delta }(\tilde{F}(x_0))\). In particular, \({{\mathcal {B}}}_{\alpha \delta }({\tilde{F}}(x_0))\subseteq \{{\tilde{F}}(x)\,:\,x\in {{\mathcal {B}}}_\delta (x_0)\}\). By (B.1) and because \(\mu \) is a probability measure on \([-1,1]\), for all \(x\in [-1,1]\)
Therefore,
Restricting \({\tilde{G}}_{x_0}\) to \({{\mathcal {B}}}_{\frac{\alpha \delta }{2}}(F(x_0))\subset {{\mathcal {B}}}_{\alpha \delta }({\tilde{F}}(x_0))\) locally defines \(\tilde{G}:{{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1])\rightarrow {{\mathcal {B}}}_{\beta \delta }([-1,1])\), since \(F:[-1,1]\rightarrow [0,1]\) is bijective by Lemma 3.2. We next show well-definedness of \({\tilde{G}}\), i.e., these local functions coincide whenever their domain of definition overlaps.
Let \(y_1\), \(y_2\in [0,1]\) be arbitrary with
where the equality holds by definition of \(\beta =\frac{\alpha }{M}\) in (3.1). There exist unique \(x_i\in [-1,1]\) with \(y_i=F(x_i)\), \(i\in \{1,2\}\). Let \(\tilde{G}_{x_i}:{{\mathcal {B}}}_{\alpha \delta }({\tilde{F}}(x_i))\rightarrow {{\mathcal {B}}}_{\alpha \beta }(x_i)\) be the unique local inverse of \({\tilde{F}}\). We need to show that \({\tilde{G}}_{x_1}\equiv {\tilde{G}}_{x_2}\) on
First, by Lemma 3.2, \(F^{-1}\) has Lipschitz constant \(\frac{1}{M}\), and we recall that since \(|{\tilde{F}}'|=|{\tilde{f}}|\le L\) and \(|F'|=|f|\le L\), both F and \({\tilde{F}}\) have Lipschitz constant L. Thus, by (B.3)
This implies \({\tilde{F}}(x_1)\in {{\mathcal {B}}}_{\alpha \delta }({\tilde{F}}(x_2))\), i.e., \({\tilde{F}}(x_1)\) is in the domain of \({\tilde{G}}_{x_2}\). Again by Lemma 3.2, \({\tilde{G}}_{x_2}\) has Lipschitz constant \(\frac{1}{M}\) and \({\tilde{F}}\) has Lipschitz constant L. Thus,
and we obtain
Using again Lipschitz continuity of \(F^{-1}:[0,1]\rightarrow [-1,1]\) with Lipschitz constant \(\frac{1}{M}\), by (B.3)
Hence, \({\tilde{G}}_{x_2}({\tilde{F}}(x_1))\in {{\mathcal {B}}}_{\beta \delta }(x_1)\). Uniqueness of \({\tilde{G}}_{x_1}:{{\mathcal {B}}}_{\alpha \delta }({\tilde{F}}(x_1))\rightarrow {{\mathcal {B}}}_{\beta \delta }(x_1)\) (with the property \({\tilde{F}}(\tilde{G}_{x_1}(y))=y\) for all \(y\in {{\mathcal {B}}}_{\alpha \delta }({\tilde{F}}(x_1))\)) implies \({\tilde{G}}_{x_2}({\tilde{F}}(x_1))={\tilde{G}}_{x_1}(\tilde{F}(x_1))\). By continuity of \({\tilde{G}}_{x_1}\) and \({\tilde{G}}_{x_2}\) and the uniqueness property, \({\tilde{G}}_{x_1}\) and \({\tilde{G}}_{x_2}\) coincide on a complex neighborhood of \({\tilde{F}}(x_1)\). By the identity theorem of complex analysis, \({\tilde{G}}_{x_1}\) and \(\tilde{G}_{x_2}\) coincide on the whole intersection \({{\mathcal {B}}}_{\alpha \delta }({\tilde{F}}(x_1))\cap {{\mathcal {B}}}_{\alpha \delta }(\tilde{F}(x_2))\supseteq {{\mathcal {B}}}_{\frac{\alpha \delta }{2}}(y_1)\cap {{\mathcal {B}}}_{\frac{\alpha \delta }{2}}(y_2)\). \(\square \)
1.2 B.2 Lemma 3.9
Proof of Lemma 3.9
Fix \(*\in \{{\rho },{\pi }\}\). By analytic continuation, there is an open complex set \(O\subseteq {{\mathbb {C}}}^d\) containing \([-1,1]^d\) on which both functions are holomorphic. Moreover, \(0<\inf _{{{\varvec{x}}}\in [-1,1]^d} |f_*({{\varvec{x}}})|\) and \(\sup _{{{\varvec{x}}}\in [-1,1]^d}|f_*({{\varvec{x}}})|<\infty \) for \(*\in \{{\rho },{\pi }\}\) by compactness of \([-1,1]^d\). Hence, we can find \({\tilde{{\varvec{\delta }}}}\in (0,\infty )^d\) such that Assumption 3.5 (a) and (b) are satisfied for some \(0<M\le L<\infty \). Fix \(C_6=C_6(M,L)>0\) as in Theorem 3.6. Again by compactness (and the fact that \(f_*\) is continuous), decreasing the components of \({\tilde{{\varvec{\delta }}}}\) if necessary, also Assumption 3.5 (c) holds. Before verifying (d), we point out that Assumption 3.5 (a)–(c) is valid for any \({\varvec{\delta }}\in (0,\infty )^d\) with \(\delta _j\le {\tilde{\delta }}_j\), \(j\in \{1,\dots ,d\}\).
Let \(\delta _{\mathrm{min}}{:}{=}\min _{j=1}^d{\tilde{\delta }}_j\). Continuity of \(f_*:{{\mathcal {B}}}_{{\tilde{{\varvec{\delta }}}}}([-1,1]^d)\rightarrow {{\mathbb {C}}}\) and compactness of \([-1,1]^d\) imply with the notation \(\varepsilon {\varvec{1}}_k=(\varepsilon )_{j=1}^k\in {{\mathbb {R}}}^k\) for any \(k\in \{1,\dots ,d\}\)
Let \(\delta _{d}\in (0,{\tilde{\delta }}_d)\) be so small that \(\sup _{{{\varvec{x}}}\in [-1,1]^d}\sup _{{{\varvec{y}}}\in {{\mathcal {B}}}_{\delta _d{\varvec{1}}_d}} |f_*({{\varvec{x}}}+{{\varvec{y}}})-f_*({{\varvec{x}}})|\le C_6\). By (B.5), we can inductively (starting with \(k=d-1\) and ending with \(k=1\)) choose \(\delta _k\in (0,\min \{\delta _{\mathrm{min}},\delta _{k+1}\})\) so small that \(\sup _{{{\varvec{x}}}\in [-1,1]^d}\sup _{{{\varvec{y}}}\in {{\mathcal {B}}}_{\delta _k{\varvec{1}}_k}\times \{0\}^{d-k}}\inf _{{{\varvec{x}}}\in [-1,1]^d} |f_*({{\varvec{x}}}+{{\varvec{y}}})-f_*({{\varvec{x}}})|\le C_6\delta _{k+1}\). \(\square \)
1.3 B.3 Theorem 3.6
To prove Theorem 3.6, we start with some preliminary results investigating the functions \(f_k\) in (2.2). First, we will analyze the domain of analytic extension of \(T:[-1,1]^d\rightarrow [-1,1]^d\) for the general d-dimensional Knothe transport in (2.5). The following variation of Assumption 3.5 will be our working assumption on the densities. In particular, item (c) stipulates that the analytic extensions of the densities do not deviate too much. This will guarantee that the inverse CDFs can be suitably analytically extended.
Assumption B.2
For \({\varvec{\delta }}\in (0,\infty )^d\), some constants \(0<M\le L<\infty \) and
it holds
-
(a)
\(f:[-1,1]^d\rightarrow {{\mathbb {R}}}_+\) is a probability density and \(f\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}}([-1,1]);{{\mathbb {C}}})\),
-
(b)
\(M\le |f({{\varvec{x}}})|\le L\) for all \({{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}}([-1,1])\),
-
(c)
\(\sup _{{{\varvec{x}}}\in [-1,1]^d}\sup _{{{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}} \times \{0\}^{d-k}}|f({{\varvec{x}}}+{{\varvec{y}}})-f({{\varvec{x}}})| \le \varepsilon _k\) for all \(k\in \{1,\dots ,d\}\).
Remark B.3
We could have equivalently written \(\min _{l\in \{k,\dots ,d\}}\varepsilon _l\) on the right-hand side of the inequality in (c). In particular, \(|f({{\varvec{x}}}+{{\varvec{y}}})-f({{\varvec{x}}})|\le \varepsilon _d=\frac{M^3}{64L^2}\) for all \({{\varvec{x}}}\in [-1,1]^d\), \({{\varvec{y}}}\in {{\mathcal {B}}}_{\varvec{\delta }}\subseteq {{\mathbb {C}}}^d\).
Item (i) of the following lemma states that \(x_k\mapsto f_k({{\varvec{x}}}_{[k]})\) is a probability density on \([-1,1]\), and \({{\varvec{x}}}_{[k]}\mapsto f_k({{\varvec{x}}}_{[k]})\) has the same domain of analyticity as \({{\varvec{x}}}_{[k]}\mapsto f({{\varvec{x}}})\). Items (iii) and (iv) are statements about how much \(f_k\) varies in its variables: (iii) is mainly a technical requirement used in later proofs and (iv) will be relevant for large values of \(\delta _k>0\). It states that the maximum deviation of the probability density \(x_k\mapsto f_k({{\varvec{x}}}_{[k]})\) from the constant 1 function is indirect proportional to \(\delta _k\), i.e., is small for large \(\delta _k\).
Lemma B.4
Let \(f:[-1,1]^d\rightarrow {{\mathbb {R}}}_+\) satisfy Assumption B.2, and let \(f_k:[-1,1]^k\rightarrow {{\mathbb {R}}}_+\) be as in (2.2). Then for every \(k\in \{1,\dots ,d\}\)
-
(i)
\(f_k\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}}([-1,1]);{{\mathbb {C}}})\), \(f_k:[-1,1]^k\rightarrow {{\mathbb {R}}}_+\) and \(\int _{-1}^{1} f_k({{\varvec{x}}},t)\;\mathrm {d}\mu ( t)=1\) if \({{\varvec{x}}}\in [-1,1]^{k-1}\),
-
(ii)
with \({\tilde{M}}{:}{=}\frac{M}{2L}\) and \({\tilde{L}}{:}{=}\frac{2L}{M}\) it holds
$$\begin{aligned} {\tilde{M}}\le |f_k({{\varvec{x}}})|\le {\tilde{L}}\qquad \forall {{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}}([-1,1]), \end{aligned}$$(B.7)and
$$\begin{aligned} \Re (f_k({{\varvec{x}}}))\ge \frac{M}{4L},\quad |\Im (f_k({{\varvec{x}}}))|\le \frac{M}{8L}, \qquad \forall {{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}}([-1,1]), \end{aligned}$$(B.8) -
(iii)
if \(k\ge 2\)
$$\begin{aligned}&\sup _{{{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k-1]}}([-1,1])}\inf _{{{\varvec{x}}}\in [-1,1]^{k-1}}\Vert f_k({{\varvec{y}}},\cdot )-f_k({{\varvec{x}}},\cdot ) \Vert _{L^\infty ([-1,1])}\nonumber \\&\qquad \qquad \le \varepsilon _{k-1} \frac{8 L}{M} \le \min \{1,\delta _k\} C_1(M,L) \end{aligned}$$(B.9)where \(C_1(M,L){:}{=}\frac{{\tilde{M}}^2}{4{\tilde{M}}+8{\tilde{L}}}\),
-
(iv)
if \(k\ge 2\), for any \(\gamma \in (0,\frac{\delta _k}{2}]\)
$$\begin{aligned} \sup _{{{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k-1]}}([-1,1])} \sup _{t\in {{\mathcal {B}}}_\gamma ([-1,1])} |f_k({{\varvec{y}}},t)-1| \le \frac{2+\gamma }{\delta _k} C_2(M,L) \end{aligned}$$where \(C_2(M,L){:}{=}\frac{4L(L+M)}{M^2}\).
Proof
Step 1. We establish some preliminary inequalities and show (i). Analyticity of \(f:{{\mathcal {B}}}_{{\varvec{\delta }}}([-1,1]) \rightarrow {{\mathbb {C}}}\) implies that \({\hat{f}}_k:{{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}}([-1,1])\rightarrow {{\mathbb {C}}}\) and \({\hat{f}}_{k-1}:{{\mathcal {B}}}_{{\varvec{\delta }}_{[k-1]}}([-1,1])\rightarrow {{\mathbb {C}}}\) in (2.2) are holomorphic for all \(k\in \{1,\dots ,d\}\) (if \(k=1\), \({\hat{f}}_{k-1}\equiv 1\) by convention, so that \({\hat{f}}_{k-1}\) is an entire function). By Assumption B.2 (c) and Remark B.3, for every \(k\in \{1,\dots ,d\}\) and all \({{\varvec{x}}}\in [-1,1]^k\), \({{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}}\) (cp. (2.2))
and similarly, for every \(k\in \{2,\dots ,d\}\)
Since \(f:[-1,1]^d\rightarrow {{\mathbb {R}}}_+\) is a probability density and \(|f({{\varvec{x}}})|\ge M\) for all \({{\varvec{x}}}\in [-1,1]^d\), it holds \({\hat{f}}_k({{\varvec{x}}})\ge M\) for all \({{\varvec{x}}}\in [-1,1]^k\) and all \(k\in \{1,\dots ,d\}\). With (B.10) and because \(\frac{M^3}{64 L^2}\le \frac{M}{2}\) we conclude that
Moreover, we note that \(|f({{\varvec{x}}})|\le L\) for all \({{\varvec{x}}}\in {{\mathcal {B}}}_{\varvec{\delta }}([-1,1])\) implies (cp. (2.2))
By definition \(f_k({{\varvec{x}}})=\frac{{\hat{f}}_k({{\varvec{x}}})}{{\hat{f}}_{k-1}({{\varvec{x}}}_{[k-1]})}\). The modulus of the denominator is uniformly positive on \({{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}}([-1,1])\) according to (B.12), and hence, \(f_k\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}}([-1,1]);{{\mathbb {C}}})\). Now (i) is a consequence of Assumption B.2 (a) and the definition of \({\hat{f}}_k\), \({\hat{f}}_{k-1}\) in (2.2).
Step 2. We show (ii) and let at first \(k\ge 2\). By (B.12) and (B.13),
which shows (B.7) for \(k\ge 2\). For \(k=1\), we have \(f_1(x_1)={\hat{f}}_1(x_1)\) (since \({\hat{f}}_0\equiv 1\)). From Assumption B.2 (a) and (b), it follows \(M\le 1\le L\) because \(\mu \) is a probability measure. Hence, the definition of \({\hat{f}}_1\) and Assumption B.2 (b) imply \(\frac{M}{2L}\le M\le |f_1(x_1)|\le L \le \frac{2L}{M}\) for all \(x_1\in {{\mathcal {B}}}_{\delta _1}([-1,1])\).
To show (B.8) note that \(\frac{M}{2L}\le f_k({{\varvec{x}}})\in {{\mathbb {R}}}\) whenever \({{\varvec{x}}}\in [-1,1]^k\) by (B.14) and because \(f_k:[-1,1]^k\rightarrow {{\mathbb {R}}}_+\). If \(k\ge 2\), for \({{\varvec{x}}}\in [-1,1]^k\) and \({{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}}\), by (B.10), (B.12) and (B.13)
Hence, \(\Re (f_k({{\varvec{x}}}+{{\varvec{y}}}))\ge \Re (f_k({{\varvec{x}}}))-\frac{M}{8L}\ge \frac{M}{4L}\) and \(|\Im (f_k({{\varvec{x}}}+{{\varvec{y}}}))|=|\Im (f_k({{\varvec{x}}}+{{\varvec{y}}})-f_k({{\varvec{x}}}))|\le \frac{M}{8L}\).
For \(k=1\) we use again that \(f_1(x)={\hat{f}}_1(x)\) due to \({\hat{f}}_0\equiv 1\), and thus, by (B.10) \(|f_1(x_1+y_1)-f_1(x_1)|\le \frac{M^3}{64L^2}\le \frac{M}{8L}\) for all \(x_1\in [-1,1]\) and \(y_1\in {{\mathcal {B}}}_{\delta _1}\). We conclude similar as in the case \(k\ge 2\) that (B.8) holds.
Step 3. We show (iii). Let \(k\in \{2,\dots ,d\}\), \({{\varvec{x}}}\in [-1,1]^{k-1}\), \({{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k-1]}}\) and \(t\in [-1,1]\). By (B.10)–(B.13),
The condition on \(\varepsilon _{k-1}\) in (B.6) is chosen exactly such that the last term is bounded by \(\min \{1,\delta _k\} \frac{{\tilde{M}}^2}{4{\tilde{M}}+8{\tilde{L}}}\).
Step 4. We show (iv). Fix \(k\in \{2,\dots ,d\}\) and \(\gamma \in (0,\frac{\delta _k}{2}]\).
By (B.13) and Lemma 3.1 (with \(K={{\mathcal {B}}}_\gamma ([-1,1])\)), for any \({{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k-1]}}([-1,1])\)
By (B.7) and Lemma 3.1, for any \({{\varvec{x}}}\in [-1,1]^{k-1}\)
Now, \(\int _{-1}^1f_k({{\varvec{x}}},t)\;\mathrm {d}\mu ( t)=1\) (see (i)) and the mean value theorem imply that for every \({{\varvec{x}}}_{[k-1]} \in [-1,1]^{k-1}\) there exists \(x_k\in [-1,1]\) (depending on \({{\varvec{x}}}_{[k-1]}\)) such that \(f_k({{\varvec{x}}}_{[k-1]},x_k)=1\). Any two points in \([-1,1]\) having distance at most 2 and (B.17) yield
Next let \({{\varvec{x}}}\in [-1,1]^{k-1}\) and \({{\varvec{y}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k-1]}}\) be arbitrary and fix \(t\in {{\mathcal {B}}}_{\gamma }([-1,1])\). Then, using (B.12), (B.13), (B.16), the fact that \({\hat{f}}_{k-1}({{\varvec{x}}}_{[k-1]})=\int _{-1}^1{\hat{f}}_k({{\varvec{x}}}_{[k-1]},s)\;\mathrm {d}\mu ( s)\) (see (2.2)) and the fact that \(|s-t|\le 2+\gamma \) for any \(s\in [-1,1]\),
where we used that \(\mu \) is a probability measure. Now additionally fix \(s\in [-1,1]\) such that \(|t-s|<\gamma \). Then by (B.19), (B.17) and (B.18),
This gives the bound in (iv). \(\square \)
The next lemma facilitates to determine the analyticity domains of \({{\varvec{x}}}\mapsto F_{{\rho };k}({{\varvec{x}}}):[-1,1]^k\rightarrow {{\mathbb {R}}}\) and \(({{\varvec{x}}},y)\mapsto (F_{{\pi };k}({{\varvec{x}}},\cdot ))^{-1}(y):[-1,1]^{k-1}\times [0,1]\rightarrow [-1,1]\) in (2.3), (2.4).
Lemma B.5
Let \(2\le k\in {{\mathbb {N}}}\), \(O\subseteq {{\mathbb {C}}}^{k-1}\) be open and convex, \([-1,1]^{k-1}\subseteq O\), \(\delta >0\), \(0<{\tilde{M}}\le \tilde{L}<\infty \),
and assume that
-
(a)
\(f\in C^1(O\times {{\mathcal {B}}}_\delta ([-1,1]);{{\mathbb {C}}})\), \(f:[-1,1]^k\rightarrow {{\mathbb {R}}}_+\) and \(\int _{-1}^{1} f({{\varvec{x}}},t)\;\mathrm {d}\mu ( t) =1\) for \({{\varvec{x}}}\in [-1,1]^{k-1}\),
-
(b)
\({\tilde{M}} \le |f({{\varvec{x}}},t)| \le {\tilde{L}}\) for all \(({{\varvec{x}}},t)\in O\times {{\mathcal {B}}}_\delta ([-1,1])\),
-
(c)
\(\sup _{{{\varvec{y}}}\in O}\inf _{{{\varvec{x}}}\in [-1,1]^{k-1}}\Vert f({{\varvec{y}}},\cdot )-f({{\varvec{x}}},\cdot ) \Vert _{L^\infty ([-1,1])} <\epsilon \).
For \({{\varvec{x}}}=(x_i)_{i=1}^k \in O\times {{\mathcal {B}}}_\delta ([-1,1])\) let \(F({{\varvec{x}}}){:}{=}\int _{-1}^{x_k} f({{\varvec{x}}}_{[k-1]},t)\;\mathrm {d}\mu ( t)\).
Then with \(\alpha =\alpha ({\tilde{M}},{\tilde{L}})\), \(\beta =\beta (\tilde{M},{\tilde{L}})\) as in (3.1)
-
(i)
for every \(\xi \in (0,\delta ]\) holds \(F\in C^1(O\times {{\mathcal {B}}}_\xi ([-1,1]); {{\mathcal {B}}}_{\epsilon +\tilde{L}\xi }([0,1]))\),
-
(ii)
there exists \(G\in C^1(O\times {{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1]); {{\mathcal {B}}}_{\beta \delta }([-1,1]))\) such that \(F({{\varvec{x}}},G({{\varvec{x}}},y))=y\) for all \(({{\varvec{x}}},y)\in O\times {{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1])\),
-
(iii)
\(G:O\times {{\mathcal {B}}}_{\frac{\alpha \min \{1,\delta \}}{2}}([0,1])\rightarrow {{\mathcal {B}}}_{\beta \min \{1,\delta \}}([-1,1])\).
Proof
We start with (i). Analyticity of \(F:O\times {{\mathcal {B}}}_\delta ([-1,1])\rightarrow {{\mathbb {C}}}\) is an immediate consequence of the analyticity of f. By (a), we have \(F:[-1,1]^{k}\rightarrow [0,1]\).
Let \({{\varvec{y}}}\in O\) and \(\xi \in (0,\delta ]\). By (c) we can find \({{\varvec{x}}}\in [-1,1]^{k-1}\) such that \(\sup _{\zeta \in [-1,1]}|f({{\varvec{y}}},\zeta )-f({{\varvec{x}}},\zeta )|\le \epsilon \). Fix \(t\in {{\mathcal {B}}}_\xi ([-1,1])\). There exists \(s\in [-1,1]\) such that \(|s-t|<\xi \). Then
where we used that \(|f({{\varvec{y}}},\zeta )|\le {\tilde{L}}\) for all \(\zeta \in {{\mathcal {B}}}_\delta ([-1,1])\). Here \(\frac{1}{2}\int _{[s,t]}\cdot \;\mathrm {d}\zeta \) (for complex t) is interpreted as a path integral over the path \(\gamma (p)=s+p(t-s)\), \(p\in [0,1]\), and the factor \(\frac{1}{2}\) stems from the fact that \(F({{\varvec{x}}},\zeta )=\int _{-1}^{\zeta }f({{\varvec{x}}},z)\;\mathrm {d}\mu ( z)= \frac{1}{2}\int _{-1}^{\zeta }f({{\varvec{x}}},z)\;\mathrm {d}z= \frac{1}{2}\int _{[-1,\zeta ]} f({{\varvec{x}}},z)\;\mathrm {d}z\) for \(\zeta \in [-1,1]\). This shows (i).
To show (ii), first let \({{\varvec{x}}}\in [-1,1]^{k-1}\). According to Lemma 3.3 (ii) there exists \(G_{{{\varvec{x}}}}:{{\mathcal {B}}}_{\alpha \delta }([0,1])\rightarrow {{\mathcal {B}}}_{\beta \delta }([-1,1])\) satisfying \(F({{\varvec{x}}},G_{{{\varvec{x}}}}(z))=z\) for all \(z\in {{\mathcal {B}}}_{\alpha \delta }([0,1])\). Now let \({{\varvec{y}}}\in O\backslash [-1,1]^{k-1}\). By assumption (c), we can find \({{\varvec{x}}}\in [-1,1]^{k-1}\) such that
by (B.20) (cp. (3.1)). Therefore, Lemma B.1 (with “f”\(=f({{\varvec{x}}},\cdot )\) and “\(\tilde{f}\)”\(=f({{\varvec{y}}},\cdot )\)) yields the existence of \(G_{{{\varvec{y}}}}:{{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1])\rightarrow {{\mathcal {B}}}_{\beta \delta }([-1,1])\) satisfying \(F({{\varvec{y}}},G_{{{\varvec{y}}}}(z))=z\) for all \(z\in {{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1])\).
Set \(G({{\varvec{x}}},z){:}{=}G_{{{\varvec{x}}}}(z)\) for all \(({{\varvec{x}}},z)\in O\times {{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1])\). We have found
By Lemma A.3, the local inverse functions in Proposition A.2 depend continuously on “F” (in the there stated sense). Since the proofs of Lemma 3.3 and Lemma B.1 stitch together the local inverse functions from Proposition A.2, one can show that \(G({{\varvec{x}}},z)\) (obtained via Lemmas 3.3 and B.1) is in fact a continuous function of \({{\varvec{x}}}\), since \(F({{\varvec{x}}},z)\) and \(\partial _zF({{\varvec{x}}},z)\) depend continuously on \({{\varvec{x}}}\). Moreover, Lemma 3.3 and Lemma B.1 state that \(z\mapsto G_{{\varvec{x}}}(z)=G({{\varvec{x}}},z)\) has Lipschitz constant \(\frac{1}{{\tilde{M}}}\) independent of \({{\varvec{x}}}\in O\). In all this implies that \(G({{\varvec{x}}},z)\) is a (jointly) continuous function of \(({{\varvec{x}}},z)\in O\times {{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1])\).
By Proposition A.4, if G is continuous and satisfies \(F({{\varvec{x}}},G({{\varvec{x}}},y))=y\), then G is locally unique and analytic in both arguments (here we use that \(\partial _z F({{\varvec{x}}},z)=f({{\varvec{x}}},z)\ne 0\)). This shows that \(G:O\times {{\mathcal {B}}}_{\frac{\alpha \delta }{2}}([0,1])\) is analytic.
Finally we show (iii). If \(\delta \le 1\), then (iii) follows from (ii). If \(\delta >1\), f also satisfies assumptions (a)–(c) with \(\delta \) replaced by \({\tilde{\delta }}{:}{=}1\) (because \({{\mathcal {B}}}_{\delta }([-1,1])\supset {{\mathcal {B}}}_{{\tilde{\delta }}}\) and \(\min \{1,\delta \}=1=\min \{1,{\tilde{\delta }}\}\) in (B.20)). Hence, (iii) follows again from (ii). \(\square \)
Theorem B.6
Let
-
(a)
\(f_{\pi }\) satisfy Assumption B.2 with \({\varvec{\delta }}_{\pi }=(\delta _{{\pi };j})_{j=1}^d\subset (0,\infty )\), \(0<M_{\pi }\le L_{\pi }<\infty \) and \((\varepsilon _{{\pi };k})_{k=1}^d\subset [0,\infty )\), and define \(\tilde{M}_{\pi }{:}{=}\frac{M_{\pi }}{2L_{\pi }}\le 1\), \(\tilde{L}_{\pi }{:}{=}\frac{2L_{\pi }}{M_{\pi }}\ge 1\),
-
(b)
\(f_{\rho }\) satisfy Assumption B.2 with \({\varvec{\delta }}_{\rho }=(\delta _{{\rho };j})_{j=1}^d\subset (0,\infty )\), \(0\le M_{\rho }\le L_{\rho }<\infty \) and \((\varepsilon _{{\rho };k})_{k=1}^d\subset [0,\infty )\) such that (additional to (B.6)) with \(\alpha =\alpha ({\tilde{M}}_{\pi },\tilde{L}_{\pi })\) as in (3.1)
$$\begin{aligned} 0\le \varepsilon _{{\rho };k} \le \frac{\alpha M_{\rho }}{32L_{\rho }} \min \left\{ 1, \delta _{{\pi };k+1} \right\} \qquad \forall k\in \{1,\dots ,d-1\} \end{aligned}$$(B.21)and set \({\tilde{M}}_{\rho }{:}{=}\frac{M_{\rho }}{2 L_{\rho }}\) and \({\tilde{L}}_{\rho }{:}{=}\frac{2L_{\rho }}{M_{\rho }}\).
Define \({\varvec{\zeta }}=(\zeta _k)_{k=1}^d\) via
Then, for every \(k\in \{1,\dots ,d\}\)
-
(i)
\(T_k:[-1,1]^k\rightarrow [-1,1]\) in (2.4) allows an extension
$$\begin{aligned} T_k\in C^1({{\mathcal {B}}}_{{\varvec{\zeta }}_{[k]}}([-1,1]); {{\mathcal {B}}}_{\delta _{{\pi };k}}([-1,1])), \end{aligned}$$(B.23) -
(ii)
\(R_k({{\varvec{x}}}){:}{=}\partial _{k} T_k({{\varvec{x}}})\) satisfies with \(C_3{:}{=}\frac{4L_{\pi }L_{\rho }}{ M_{\pi }M_{\rho }}\) and \(C_4{:}{=}\frac{3M_{\pi }^3 M_{\rho }}{256 L_{\pi }^3 L_{\rho }}\)
$$\begin{aligned} R_k\in C^1({{\mathcal {B}}}_{{\varvec{\zeta }}_{[k]}}([-1,1]); {{\mathcal {B}}}_{C_3}(1))\qquad \text {s.t.}\qquad \inf _{{{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\zeta }}_{[k]}}([-1,1])}\Re (R_k({{\varvec{x}}})) \ge C_4 \end{aligned}$$(B.24a)and there exists \(C_5=C_5(M_{\rho },M_{\pi },L_{\rho },L_{\pi })\) such that if \(k\ge 2\)
$$\begin{aligned} R_k:{{\mathcal {B}}}_{{\varvec{\zeta }}_{[k-1]}}([-1,1])\times [-1,1]\rightarrow {{\mathcal {B}}}_{\frac{C_5}{\min \{\delta _{{\pi };k},\delta _{{\rho };k}\}}}(1). \end{aligned}$$(B.24b)
Proof
Step 1. We establish notation and preliminary results. Throughout this proof (as in (3.1)),
In the following, for \(k\in \{1,\dots ,d\}\) and \(*\in \{{\rho },{\pi }\}\)

For \(k\in \{1,\dots ,d\}\), let
be as in (2.2). Lemma B.4 (i)–(iii) states that for \(k\in \{2,\dots ,d\}\), the functions \(f_{{\pi };k}\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}_{{\pi };[k-1]}}\times {{\mathcal {B}}}_{\delta _{{\pi };k}};{{\mathbb {C}}})\) and \(f_{{\rho };k}\in C^1({{\mathcal {B}}}_{{\varvec{\delta }}_{{\rho };[k-1]}}\times {{\mathcal {B}}}_{\delta _{{\rho };k}};{{\mathbb {C}}})\) satisfy the assumptions of Lemma B.5, with the constants \(0<{\tilde{M}}_{\rho }\le {\tilde{L}}_{\rho }<\infty \), \(0<\tilde{M}_{\pi }\le {\tilde{L}}_{\pi }<\infty \), and
in (B.20) and assumption (c) of Lemma B.5 (cp. (B.6) and (B.9)) with \(O={{\mathcal {B}}}_{{\varvec{\delta }}_{{\rho };[k-1]}}([-1,1])\), \(O={{\mathcal {B}}}_{{\varvec{\delta }}_{{\pi };[k-1]}}([-1,1])\), respectively.
By Lemma B.5 (i), for \(k\in \{2,\dots ,d\}\) the functions \(F_{{\rho };k}\), \(F_{{\pi };k}\) as in (2.3) are well defined, and in particular, \(F_{{\rho };k}\) is holomorphic from
For \(k\in \{2,\dots ,d\}\), let \(G_{{\pi };k}\) be as in Lemma B.5 (ii) (w.r.t. \(F_{{\pi };k}\)). By Lemma B.5 (ii) for every \(k\in \{2,\dots ,d\}\), this map is holomorphic between
and by Lemma B.5 (iii)
Step 2. We show (i). By definition of \(T_k\) in (2.4), for \(k\in \{2,\dots ,d\}\) and \({{\varvec{x}}}\in [-1,1]^k\)
where we recall the notation \(T_{[k-1]}=(T_j)_{j=1}^{k-1}\). We show by induction (B.23), i.e.,
Let \(k=1\). By Lemma B.4 (i),
are holomorphic functions satisfying (B.7) with the corresponding constants \({\tilde{M}}_{\rho }\le {\tilde{L}}_{\rho }\), \(\tilde{M}_{\pi }\le {\tilde{L}}_{\pi }\). Thus, by Proposition 3.4 with \(r=\min \{\delta _{{\rho };1},\frac{\delta _{{\pi };1}\tilde{M}_{\pi }^2}{{\tilde{L}}_{\rho }(2{\tilde{M}}_{\pi }+4\tilde{L}_{\pi })}\}\)
is holomorphic. Since \(r\ge \zeta _1\) and \(\frac{r\tilde{L}_{\rho }}{{\tilde{M}}_{\pi }}\le \delta _{{\pi };1}\), this shows (B.23) for \(k=1\).
For the induction step, we first note that there hold the following inequalities for \(k\ge 2\):
-
(i1)
\(\zeta _k\le \min \{\delta _{{\pi };k},\delta _{{\rho };k}\}\): This is immediate from the definition of \(\zeta _k\) in (B.22).
-
(i2)
\(\epsilon _{{\rho };k} + {\tilde{L}}_{\rho }\zeta _k\le \frac{\alpha \delta _{{\pi };k}}{2}\): (B.26) and (B.21) give
$$\begin{aligned} \epsilon _{{\rho };k}\le \varepsilon _{{\rho };k-1} \frac{8L_{\rho }}{M_{\rho }} \le \frac{\alpha M_{\rho }}{32 L_{\rho }}\frac{8L_{\rho }}{M_{\rho }} \min \{1,\delta _{{\pi };k}\} = \frac{\alpha \min \{1,\delta _{{\pi };k}\}}{4}, \end{aligned}$$(B.31)and by (B.22), it holds \({\tilde{L}}_{\rho }\zeta _k\le \frac{\alpha \delta _{{\pi };k}}{4}\).
Let \(k\in \{2,\dots ,d\}\). Assume that
which is the induction hypothesis. We show that it also holds for k. By (B.27), (i1) and (i2),
Due to (B.28), (B.30) and the induction hypothesis (B.32), we get
Since \(\beta \le 1\) by definition, this shows (B.23) for \(T_k\).
Step 3. We show (B.24a) and first verify that for all \(k\in \{1,\dots ,d\}\)
First let \({{\varvec{x}}}\in [-1,1]^k\). By (B.30),
Now
and thus,
Using that \(F_{{\pi };k}({{\varvec{x}}}_{[k-1]},\cdot ):[-1,1]\rightarrow [0,1]\) is bijective, the substitution \(y_k=F_{{\pi };k}({{\varvec{x}}})\) and (B.36) give for all \(({{\varvec{x}}}_{[k-1]},y_k)\in [-1,1]^{k-1}\times [0,1]\)
Hence, since \(\partial _{k}F_{{\rho };k}=f_{{\rho };k}\) and \(\partial _{k}F_{{\pi };k}=f_{{\pi };k}\), we obtain by (B.35)
which by (B.30) shows (B.34) for \({{\varvec{x}}}\in [-1,1]^k\). The identity theorem for holomorphic functions implies that (B.34) holds for all \({{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\zeta }}_{[k]}}([-1,1])\).
Now we show (B.24a). By Lemma B.4 (ii) for \(*\in \{{\rho },{\pi }\}\)
Moreover, \(|f_{{\pi };k}({{\varvec{x}}})|\le {\tilde{L}}_{\pi }=\frac{2 L_{\pi }}{M_{\pi }}\). Thus, for arbitrary \({{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\zeta }}_{[k]}}([-1,1])\subseteq {{\mathcal {B}}}_{{\varvec{\delta }}_{{\pi };[k]}}([-1,1])\), writing \(f_{{\rho };k}({{\varvec{x}}})=a+\mathrm {i}b\) and \(f_{{\pi };k}(T_{[k]}({{\varvec{x}}}))=c+\mathrm {i}d\)
Moreover, by Lemma B.4 (ii) and (B.34) we have \(|R_k({{\varvec{x}}})|\le (\frac{2L_{\rho }}{ M_{\rho }})/(\frac{M_{\pi }}{2L_{\pi }})\) for all \({{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\zeta }}_{[k]}}([-1,1])\), which gives (B.24a).
Step 4. We show (B.24b). Fix \(k\in \{2,\dots ,d\}\) and let \(\xi \in (0,\zeta _k)\) be so small that \(\epsilon _{{\rho };k}+L_{\rho }\xi < \frac{\alpha \min \{1,\delta _{{\pi };k}\}}{2}\), which is possible by (B.31). Let
arbitrary. By (B.23),
By (B.27),
and since \(\beta \le \frac{1}{2}\) (cp. (B.25)), \(T_k({{\varvec{x}}})\in {{\mathcal {B}}}_{\frac{\min \{1,\delta _{{\pi };k}\}}{2}}([-1,1])\).
For the rest of the proof, fix \({{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\zeta }}_{[k-1]}}([-1,1])\times [-1,1]\). Lemma B.4 (iv) implies
with \(C_2=C_2(M_{\pi },L_{\pi })\) as in Lemma B.4 (iv). Lemma B.4 (iv) (and \(\zeta _j\le \delta _{{\rho };j}\), \(j=1,\dots ,k-1\)) also gives
with \(C_2=C_2(M_{\rho },L_{\rho })\) as in Lemma B.4 (iv).
Write \(f_{{\pi };k}(T_{[k]}({{\varvec{x}}}))=1+z_1\) and \(f_{{\rho };k}({{\varvec{x}}})=1+z_2\) for \(z_1\), \(z_2\in {{\mathbb {C}}}\) with
We distinguish between two cases and assume first that \(|z_1|\le \frac{1}{2}\). Then by (B.34) (cp. (B.22) and (B.37))
In the second case, we have \(|z_1|>\frac{1}{2}\). By (B.24a), \(|R_k({{\varvec{x}}})|\le 1+C_3\), and thus, with (B.37)
With
(B.38) and (B.39) show (B.24b). \(\square \)
We are now in a position to prove Theorem 3.6.
Proof of Theorem 3.6
Let \(0<M\le L<\infty \) and \({\varvec{\delta }}\in (0,\infty )^d\) be such that \(f_{\rho }\in C^1({{\mathcal {B}}}_{\varvec{\delta }}([-1,1]);{{\mathbb {C}}})\) and \( f_{\pi }\in C^1({{\mathcal {B}}}_{\varvec{\delta }}([-1,1]);{{\mathbb {C}}})\) both satisfy Assumption 3.5 with these constants. Upon choosing \(C_6=C_6(M,L)\) in Assumption 3.5 small enough, we show that \(f_{\rho }\), \(f_{\pi }\) satisfy Assumption B.2 with the additional constraint (B.21). This means that we can apply Theorem B.6, which immediately implies Theorem 3.6 with (cp. (3.1))
and where \({\tilde{M}}{:}{=}\frac{M}{2L}\), \({\tilde{L}}{:}{=}\frac{2L}{M}\) and \(C_3\), \(C_4\) and \(C_5\) are as in Theorem B.6.
Assumption B.2 (a), (b) holds by Assumption 3.5 (a), (b) with \(M_{\rho }=M_{\pi }=M\) and \(L_{\rho }=L_{\pi }=L\).
With \(\alpha =\frac{M^2}{2M+4L}\) holds \(\frac{\alpha M}{32L}=\frac{M^3}{64(ML+2L^2)}\). Therefore, spelling out Assumption B.2 (c) with the additional constraint (B.21), for \(k\in \{1,\dots ,d-1\}\) we require
and
Define
Then
and
implies (B.42). This concludes the proof, as Assumption 3.5 (c), (d) corresponds to (B.44) (note that if \(\min \{1,\delta _{k+1}\}=1\), then (B.42a) follows by (B.44b)). \(\square \)
C Proofs of Sect. 4
1.1 C.1 Lemma 4.1
Proof of Lemma 4.1
We start with (4.1) for \(k=1\), and denote the Legendre coefficients of a function g by \(l_{g,j}=\int _{[-1,1]}g(x)L_j(x)\;\mathrm {d}\mu (x)\), \(j\in {{\mathbb {N}}}_0\).
For \(\xi >1\), introduce the Bernstein ellipse
If \(g\in C^1({{\mathcal {E}}}_\xi )\), then \(|l_{g,j}|\le \xi ^{-j}(1+2j)^{3/2}\Vert g \Vert _{L^\infty ({{\mathcal {B}}}_\delta ([-1,1]))}\frac{2\xi }{\xi -1}\) for all \(j\in {{\mathbb {N}}}\). The proof of this classical estimate can be found for example, in [18] (see equations (12.4.24)–(12.4.26); also note that our normalization \(\Vert L_j \Vert _{L^2([-1,1],\mu )}=1\) gives another factor \((1+2j)^{1/2}\) compared to the discussion in [18], cp. Remark C.1).
The Bernstein ellipse \({{\mathcal {E}}}_\xi \) has semiaxes \(\frac{\xi +\xi ^{-1}}{2}\) and \(\frac{\xi -\xi ^{-1}}{2}\). Solving \(\frac{\xi -\xi ^{-1}}{2}=\delta \) and \(\frac{\xi +\xi ^{-1}}{2}-1=\delta \) for \(\xi \), we find that the largest ellipse \({{\mathcal {E}}}_\xi \) contained in \({{\mathcal {B}}}_\delta ([-1,1])\) is obtained for \(\xi =\delta +\sqrt{\delta ^2+1}\ge 1+\delta \). In particular, \({{\mathcal {E}}}_{1+\delta }\subseteq {{\mathcal {B}}}_\delta ([-1,1])\) for all \(\delta >0\). Hence, if \(f\in {{\mathcal {B}}}_\delta ([-1,1])\), then with \(\varrho {:}{=}1+\delta \), \(|l_{f,j}|\le \varrho ^{-k}(1+2k)^{3/2}\Vert f \Vert _{L^\infty ({{\mathcal {B}}}_\delta ([-1,1]))}\frac{\varrho }{\varrho -1}\). This shows (4.1) for \(k=1\). The bound (4.1) for general k follows by applying the same argument componentwise, see, for example, appendix in [12] or [75, Cor. B.2.7].
It remains to show (4.2) for \({\varvec{\nu }}\in {{\mathbb {N}}}_0^{k-1}\times \{0\}\). Since \(L_{\nu _k}(y_k)=L_0(y_k)=1\), we have
Thus, (4.2) follows by (4.1). \(\square \)
1.2 C.2 Proposition 4.2
The exponential convergence of Proposition 4.2 is verified by bounding the approximation error of the truncated Legendre expansion \(f({{\varvec{y}}})\simeq \sum _{{\varvec{\nu }}\in \Lambda _{k,\varepsilon }}l_{\varvec{\nu }}L_{\varvec{\nu }}({{\varvec{y}}})\) by \(\sum _{{\varvec{\nu }}\in {{\mathbb {N}}}_0^k\backslash \Lambda _{k,\varepsilon }} |l_{\varvec{\nu }}| \Vert L_{\varvec{\nu }} \Vert _{W^{m,\infty }([-1,1]^k)}\), which is an upper bound on the remainder. As we will see, this classical line of arguments requires the following ingredients: (a) an upper bound on the Legendre coefficients, (b) an upper bound on the norm of the Legendre polynomials \(L_{\varvec{\nu }}\), and (c) an upper bound on the cardinality of \(\Lambda _{k,\varepsilon }\). The first is obtained in Lemma 4.1 and the second in Remark C.1 ahead. Before coming to the proof of Proposition 4.2, it then remains to discuss the third.
Remark C.1
For all \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\) holds \(\Vert L_{\varvec{\nu }} \Vert _{L^{\infty }([-1,1]^k;{{\mathbb {R}}})}\le \prod _{j=1}^k (1+2\nu _j)^{1/2}\); see [48, §18.2(iii) and §18.3]. By the Markov brothers’ inequality (see [44] and, e.g., the references in [11, p. 228]), this generalizes to \(W^{m,\infty }\) via
A first simple bound on \(|\Lambda _{k,\varepsilon }|\) in (4.4) is obtained as follows: With \(\varrho _{\mathrm{min}}{:}{=}\min _j\varrho _j>1\) it holds \(\gamma ({\varvec{\varrho }},{\varvec{\nu }})\le \varrho _{\mathrm{min}}^{-|{\varvec{\nu }}|}\), and thus, \(\{{\varvec{\nu }}\in {{\mathbb {N}}}_0^k\,:\,\varrho _\mathrm{min}^{-|{\varvec{\nu }}|}\ge \varepsilon \}\supseteq \Lambda _{k,\varepsilon }\). Due to \(|\{{\varvec{\nu }}\in {{\mathbb {N}}}_0^k\,:\,|{\varvec{\nu }}|\le a\}|\le (1+a)^k\) for all \(a\ge 0\), for \(\varepsilon \in (0,1)\) we conclude with \(a=-\frac{\log (\varepsilon )}{\log (\varrho _{\mathrm{min}})}\)
The following slightly more involved bound holds by [49, Lemma 3.3] and because \(\gamma ({\varvec{\varrho }},{\varvec{\nu }})\le {\varvec{\varrho }}^{-{\varvec{\nu }}}\) for all \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\) (also see [74] and [7, Lemma 4.2]).
Lemma C.2
Let \(k\in {{\mathbb {N}}}\) and \({\varvec{\varrho }}\in (1,\infty )^k\). It holds for \(\varepsilon \in (0,1)\)
The next lemma will be required in the proof of Proposition 4.2. In the following for \({\varvec{\varrho }}\in (1,\infty )^k\) and \(\theta >0\), set
For any \(\theta >0\) holds \(S({\varvec{\varrho }},\theta )<\infty \); see, e.g., [77, Lemma 3.10].
Lemma C.3
Let k, \(m\in {{\mathbb {N}}}\), \(\theta >0\), \(C_0>0\) and \({\varvec{\varrho }}\in (1,\infty )^k\). Assume that \(f({{\varvec{y}}})=\sum _{{\varvec{\nu }}\in {{\mathbb {N}}}_0^k}l_{\varvec{\nu }}L_{\varvec{\nu }}({{\varvec{y}}})\in L^2([-1,1]^k,\mu )\) for certain coefficients \(l_{\varvec{\nu }}\in {{\mathbb {R}}}\) satisfying \(|l_{\varvec{\nu }}|\le C_0 \gamma ({\varvec{\varrho }},{\varvec{\nu }}) \prod _{j=1}^k(1+2\nu _j)^\theta \) for all \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\). Then \(f\in W^{m,\infty }([-1,1]^k)\) and with
\(\varrho _{\mathrm{min}}{:}{=}\min _j\varrho _j>1\), and \(w_{\varvec{\nu }}{:}{=}\prod _{j=1}^k(1+2\nu _j)^{1/2+2m+\theta }\) it holds for all \(\varepsilon \in (0,1)\)
where \(S({\varvec{\varrho }},\frac{1}{2}+\theta +2m)\) is defined in (C.4).
Proof
For \(\varepsilon \in (0,\varrho _k^{-1})\), set
so that
since \(\gamma ({\varvec{\varrho }},{\varvec{\nu }})\) is monotonically decreasing in each \(\nu _j\). Set
and
We want to show
If \(\gamma ({\varvec{\varrho }},{\varvec{\eta }})<\varepsilon \) and \(\eta _k=0\), then \({\varvec{\eta }}\in S_1\). Now let \(\gamma ({\varvec{\varrho }},{\varvec{\eta }})<\varepsilon \) and \(\eta _k>0\). If there exists \({\varvec{\nu }}\in A_{\varepsilon ,2}\) (i.e., \(\nu _k>0\)) such that \({\varvec{\nu }}\le {\varvec{\eta }}\) (i.e., \(\nu _j\le \eta _j\) for all j), then \({\varvec{\eta }}={\varvec{\nu }}+({\varvec{\eta }}-{\varvec{\nu }})\in S_2\). Otherwise there exists \({\varvec{\nu }}\in A_{\varepsilon ,1}\) (i.e., \(\nu _k=0\)) with \({\varvec{\nu }}\le {\varvec{\eta }}\). By definition, \(\gamma ({\varvec{\varrho }},{\varvec{\nu }})=\gamma ({\varvec{\varrho }},{\varvec{\nu }}+{\varvec{e}}_k)\) (since \(\nu _k=0\)), and therefore, \({\tilde{{\varvec{\nu }}}}{:}{=}{\varvec{\nu }}+{\varvec{e}}_k\in A_{\varepsilon ,2}\) satisfies \({\tilde{{\varvec{\nu }}}}\le {\varvec{\eta }}\). Hence, \({\varvec{\eta }}={\tilde{{\varvec{\nu }}}}+({\varvec{\eta }}-{\tilde{{\varvec{\nu }}}})\in S_2\), showing (C.8).
Note that
for all \({\varvec{\nu }}+{\varvec{\mu }}\in S_1\) or \({\varvec{\nu }}+{\varvec{\mu }}\in S_2\) as in (C.7). As mentioned before, \(\gamma ({\varvec{\varrho }},{\varvec{\nu }})\ge \varepsilon \) implies with \(\varrho _{\min }{:}{=}\min _j\varrho _j\) that \(\varrho _\mathrm{min}^{-|{\varvec{\nu }}|}\ge \gamma ({\varvec{\varrho }},{\varvec{\nu }})\ge \varepsilon \). Thus, if \({\varvec{\nu }}\in A_\varepsilon \) and \(\gamma ({\varvec{\varrho }},{\varvec{\nu }}-{\varvec{e}}_j)\ge \varepsilon \) then \(|{\varvec{\nu }}|\le -\frac{\log (\varepsilon )}{\log (\varrho _{\mathrm{min}})}+1\). Hence, \(w_{\varvec{\nu }}\le (3-\frac{2\log (\varepsilon )}{\log (\varrho _\mathrm{min})})^{k(\frac{1}{2}+\theta +2m)}\) for all \({\varvec{\nu }}\in A_\varepsilon \). Using
which is true because \(1+2(\nu _j+\mu _j)\le (1+2\nu _j)(1+2\mu _j)\), as well as (C.1), (C.8) and (C.9)
Now \(|A_\varepsilon |\le k|\Lambda _{k,\varepsilon }|\), due to the fact that for every \({\varvec{\nu }}\in A_\varepsilon \) there exists \(j\in \{1,\dots ,k\}\) such that \({\varvec{\nu }}-{\varvec{e}}_j\in \Lambda _{k,\varepsilon }\). Finally, by (C.2), \(|A_\varepsilon |\le k|\Lambda _{k,\varepsilon }|\le k( 1 - \frac{\log (\varepsilon )}{\log (\varrho _{\mathrm{min}})})^k\), and together with (C.4) we obtain (C.6). \(\square \)
Proof of Proposition 4.2
The function \(x\mapsto \frac{2x}{x-1}\) is monotonically decreasing for \(x>1\). Hence, with \(\varrho _{\min }{:}{=}\min _{j\in \{1,\dots ,k\}}\varrho _j\) we can replace the term \(\prod _{j\in {{\,\mathrm{supp}\,}}{\varvec{\nu }}}\frac{2\varrho _j}{\varrho _j-1}\) occurring in Lemma 4.1 by \(\prod _{j\in {{\,\mathrm{supp}\,}}{\varvec{\nu }}} 2^{\theta -1}\), where \(\theta {:}{=}1+\log _2(\frac{2\varrho _{\mathrm{min}}}{\varrho _\mathrm{min}-1})\), i.e., \(\frac{2\varrho _j}{\varrho _j-1} \le \frac{2\varrho _{\mathrm{min}}}{\varrho _{\mathrm{min}}-1}=2^{\theta -1}\) for each j. Using \(1+\nu _j\ge 2\) whenever \(j\in {{\,\mathrm{supp}\,}}{\varvec{\nu }}\), (4.1) thus implies
Using the bounds on the Legendre coefficients \(l_{\varvec{\nu }}\) of the function f from Lemma 4.1 we then get
Thus, by Lemma C.3
Let \(\tau \in (0,1)\) such that \({\tilde{\beta }}=\tau \beta <\beta \). Absorbing the logarithmic \(\varepsilon \) term, we get
for some C depending on \(\varrho _{\mathrm{min}}\), k, \(\tau \) and m.
Next, (C.3) gives \((k!|\Lambda _{k,\varepsilon }|\prod _{j=1}^k\log (\varrho _j))^{1/k}-\sum _{j=1}^k\log (\varrho _j)\le -\log (\varepsilon )\), and thus,
Plugging this into (C.11) yields (4.6). \(\square \)
1.3 C.3 Theorem 4.5
In the proof, we will need the following elementary lemma.
Lemma C.4
Let \(m\in {{\mathbb {N}}}_0\), \(k\in {{\mathbb {N}}}\). There exists \(C=C(k,m)\) such that
-
(i)
\(\Vert fg \Vert _{W^{m,\infty }([-1,1]^k)}\le C \Vert f \Vert _{W^{m,\infty }([-1,1]^k)} \Vert g \Vert _{W^{m,\infty }([-1,1]^k)}\), for all f, \(g\in {W^{m,\infty }([-1,1]^k)}\),
-
(ii)
for all f, \(g\in {W^{m,\infty }([-1,1]^k)}\)
$$\begin{aligned} \Vert f^2-g^2 \Vert _{W^{m,\infty }([-1,1]^k)} \le C (\Vert f \Vert _{W^{m,\infty }([-1,1]^k)}+\Vert g \Vert _{W^{m,\infty }([-1,1]^k)})\Vert f-g \Vert _{W^{m,\infty }([-1,1]^k)} \end{aligned}$$ -
(iii)
if \(\Vert 1-f \Vert _{L^{\infty }([-1,1]^k)}\le \frac{1}{2}\) then
$$\begin{aligned} \left\| 1-\frac{1}{f} \right\| _{W^{m,\infty }([-1,1]^k)}\le C \Vert 1-f \Vert _{W^{m,\infty }([-1,1]^k)} \left( 1+\Vert f \Vert _{W^{m,\infty }([-1,1]^k)}\right) ^{\max \{0,m-1\}}. \end{aligned}$$
In case \(m=0\), the constant \(C=C(k,0)\) is independent of k.
Proof
Item (i) is a consequence of the (multivariate) Leibniz rule for weak derivatives, i.e., \(\partial _{{{\varvec{x}}}}^{{\varvec{\nu }}} (fg) = \sum _{{\varvec{\mu }}\le {\varvec{\nu }}} \left( {\begin{array}{c}{\varvec{\nu }}\\ {\varvec{\mu }}\end{array}}\right) \partial _{{{\varvec{x}}}}^{{\varvec{\mu }}}f \partial _{{{\varvec{x}}}}^{{\varvec{\nu }}-{\varvec{\mu }}}g\) for all multi-indices \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\). Item (ii) follows by \(f^2-g^2=(f-g)(f+g)\) and (i).
Next let us show (iii). Due to \(\Vert 1-f \Vert _{L^\infty ([-1,1]^k)}\le \frac{1}{2}\) it holds \({{\,\mathrm{ess\,inf}\,}}_{{{\varvec{x}}}\in [-1,1]^k}f({{\varvec{x}}})\ge \frac{1}{2}\). Thus, \(\Vert 1-\frac{1}{f} \Vert _{L^\infty ([-1,1]^k)} =\Vert \frac{1-f}{f} \Vert _{L^\infty ([-1,1]^k)} \le 2 \Vert 1-f \Vert _{L^\infty ([-1,1]^k)}\), which proves the statement in case \(m=0\).
For general \(m\in {{\mathbb {N}}}\), we claim that for any \({\varvec{\nu }}\in {{\mathbb {N}}}_0^k\) such that \(1\le |{\varvec{\nu }}|\le m\) it holds \(\partial _{{{\varvec{x}}}}^{\varvec{\nu }}(\frac{1}{f})=\frac{p_{\varvec{\nu }}}{f^{|{\varvec{\nu }}|+1}}\) for some \(p_{\varvec{\nu }}\) satisfying
for a constant C depending on \({\varvec{\nu }}\) but independent of f. For \(|{\varvec{\nu }}|=1\), i.e., \({\varvec{\nu }}={\varvec{e}}_j=(\delta _{ij})_{i=1}^k\) for some j, this holds by \(\partial _{{\varvec{x}}}^{{\varvec{e}}_j}\frac{1}{f}= \partial _{j}\frac{1}{f}=\frac{-\partial _{j} f}{f^2}\) with \(p_{{\varvec{e}}_j}=-\partial _{j} f\) where \(\Vert p_{{\varvec{e}}_j} \Vert _{W^{m-1,\infty }([-1,1]^k)} =\Vert \partial _{j} f \Vert _{W^{m-1,\infty }([-1,1]^k)} \le \Vert 1-f \Vert _{W^{m,\infty }([-1,1]^k)}\). For the induction step,
Then
by the induction hypothesis (C.12).
Hence, for \(1\le |{\varvec{\nu }}|\le m\), due to \(\Vert 1-f \Vert _{L^\infty ({[-1,1]^k})}\le \frac{1}{2}\) which implies \({{\,\mathrm{ess\,inf}\,}}_{{{\varvec{x}}}\in [-1,1]^k} f({{\varvec{x}}})^{|{\varvec{\nu }}|+1}\ge 2^{-|{\varvec{\nu }}|-1}\),
which shows (iii). \(\square \)
Lemma C.5
Let \(m\in {{\mathbb {N}}}_0\), \(k\in {{\mathbb {N}}}\), \(k\ge 2\) and \(T:[-1,1]^k\rightarrow [-1,1]\) such that \(T\in W^{m,\infty }([-1,1]^k)\), \(\sqrt{\partial _{k}T}\in C^0\cap W^{m,\infty }([-1,1]^k)\), and \(T({{\varvec{x}}}_{[k-1]},\cdot ):[-1,1]\rightarrow [-1,1]\) is an increasing bijection for every \({{\varvec{x}}}_{[k-1]}\in [-1,1]^{k-1}\). Set \(Q({{\varvec{x}}}){:}{=}\sqrt{\partial _{k}T}-1\in C^0\cap W^{m,\infty }\cap ([-1,1]^k)\).
There exist constants \(K\in (0,1]\) (independent of m and k) and \(C=C(k,m)>0\), both independent of Q, with the following property: If \(p\in W^{m,\infty }([-1,1]^k;{{\mathbb {R}}})\) satisfies
then with
it holds
and
In case \(m=0\), C(k, 0) is independent of k.
Proof
Throughout this proof, the constant \(C>0\) (which may change its value even within the same equation) will depend on m, k but be independent of Q. Moreover, it will only depend on k through the constants from Lemma C.4, and thus be independent of k in case \(m=0\). In the following, we use several times the fact that
Set \(\varepsilon {:}{=}\Vert {Q}-p \Vert _{W^{m,\infty }({[-1,1]^k})}\le 1\). Using Lemma C.4 (ii), we get
where we used the triangle inequality and \(\varepsilon \le 1\), which holds by assumption.
In addition to \({\tilde{T}}\) in (C.14), let
Since \(T({{\varvec{x}}})=-1+\int _{-1}^{x_k}\partial _{k} T({{\varvec{x}}}_{[k-1]},t)\;\mathrm {d}t\), we get with (C.17) and (C.18)
Fix \({{\varvec{x}}}_{[k-1]}\in [-1,1]^{k-1}\). Then
Moreover, since \(T({{\varvec{x}}}_{[k-1]},\cdot ):[-1,1]\rightarrow [-1,1]\) is a monotonically increasing bijection,
Thus, using that T, \({\hat{T}}\) are Lipschitz continuous, and restrictions to \([-1,1]^{k-1}\times \{y\}\), \(y\in \{-1,1\}\), are well defined,
for some \({\tilde{C}}={\tilde{C}}(k,m)>0\), which is independent of k in case \(m=0\). Additionally by (C.20)
Define \(K{:}{=}\min \{1,\frac{1}{{\tilde{C}}(0,k)}\}\) with \(\tilde{C}(0,k)\) from (C.20), (i.e., \(m=0\) and K does not depend on k). By assumption \(\varepsilon \le \frac{K}{1+\Vert {Q} \Vert _{L^{\infty }({[-1,1]^{k}})}}\), which implies \(\Vert 1-\frac{c_k}{2} \Vert _{L^\infty ({[-1,1]^k})}\le \frac{1}{2}\) by (C.20) (for \(m=0\)). This allows to apply Lemma C.4 (iii), which together with (C.20) and (C.21) gives
Since \({\tilde{T}}=-1+\frac{2}{c_k}({\hat{T}}+1)\), (C.19) and (C.22) yield
for some constant C depending on m and k. Due to \(T({{\varvec{x}}})=-1+\int _{-1}^{x_k}({Q}({{\varvec{x}}}_{[k-1]},t)+1)^2\;\mathrm {d}x_k\), by (C.17)
by Lemma C.4 (i). In all, this shows (C.15).
To show (C.16), we proceed similarly and obtain via (C.18)
where we used Lemma C.4 (i) to bound \(\Vert (1+p)^2 \Vert _{W^{m,\infty }({[-1,1]^k})}\). Since \(\Vert p-{Q} \Vert _{W^{m,\infty }({[-1,1]^k})}\le 1\)
and thus, using (C.22)
\(\square \)
Proof of Theorem 4.5
Wlog in the following, we assume \(m\ge 1\), since the statement for \(m\ge 1\) trivially implies the statement for \(m=0\). Throughout we omit writing the \(\varepsilon >0\) index and write \({\tilde{T}}_{k}{:}{=}\tilde{T}_{k,\varepsilon }\) etc.
Step 1. Fix \(k\in \{1,\dots ,d\}\) and define \(Q_k({{\varvec{x}}}){:}{=}\sqrt{\partial _{k}T_k}-1\). We now construct \(p_k\) such that \(\Vert Q_k-p_k \Vert _{W^{m,\infty }([-1,1]^k)}\) is small.
According to Theorem 3.6 with \({\varvec{\zeta }}=(C_7\delta _j)_{j=1}^d\), it holds
and if \(k\ge 2\)
We have \(\Re (\partial _{k}T_k({{\varvec{x}}})) > 0\) for all \({{\varvec{x}}}\in {{\mathcal {B}}}_{{\varvec{\zeta }}_{[k]}}\) by Theorem 3.6 (i), so that (C.23) implies
If \(\delta _k<1\), then \(\frac{1+C_7}{1+C_7\delta _k}\ge 1\) and by (C.25)
Next, we consider the case \(\delta _k\ge 1\), so that \(\max \{1,\delta _k\}=\delta _k\). For any \(z\in {{\mathbb {C}}}\) with \(|z|\le \frac{1}{2}\) one checks that \(|\sqrt{z}+1|\ge \frac{1}{2}\). Thus, \(|\sqrt{z}-1|\le 2|\sqrt{z}-1||\sqrt{z}+1|=2|z-1|\). Hence, if \(\frac{C_8}{\max \{1,\delta _k\}}\le \frac{1}{2}\), then by (C.24)
and if \(\frac{C_8}{\max \{1,\delta _k\}}>\frac{1}{2}\), then \(2C_8>\delta _k\) and by (C.25)
In all with
holds
and if \(k\ge 2\)
Therefore, for \(m\in {{\mathbb {N}}}_0\) fixed, Proposition 4.2 gives for \(\varepsilon \in (0,1)\) and with
that with \(\tau :=\frac{{\tilde{\beta }}}{\beta }\)
for a constant \({\tilde{C}}\) depending on r, m, \(\tau \), \({\varvec{\delta }}\) and T but independent of \(\varepsilon \in (0,1)\).
Step 2. Fix \(k\in \{1,\dots ,d\}\). Let the constant \(K>0\) be as in Lemma C.5. We distinguish between two cases, first assuming \({\tilde{C}} \varepsilon ^\tau <\max \{1,\frac{K}{1+\Vert {Q_k} \Vert _{L^\infty ({[-1,1]^k})}}\}\). , Then (C.26) and Lemma C.5 imply
for a constant C independent of \(\varepsilon \).
In the second case where \({\tilde{C}} \varepsilon ^\tau >\max \{1,\frac{K}{1+\Vert {Q_k} \Vert _{L^\infty ({[-1,1]^k})}}\}\), we simply redefine \(p_k{:}{=}0\), so that \({\tilde{T}}_k({{\varvec{x}}})=x_k\) (cp. (4.10)). Then
with \(C>0\) depending on K, \({\tilde{C}}\), \(\Vert T_k \Vert _{W^{m,\infty }([-1,1]^k)}\) and \(\Vert {Q_k} \Vert _{W^{m,\infty }([-1,1]^k)}\) but not on \(\varepsilon \).
Step 3. We estimate the error in terms of \(N_\varepsilon \). By Lemma C.2
which implies
Together with (C.27) this shows
\(\square \)
D Proofs of Sect. 5
1.1 D.1 Theorem 5.1
Proof of Theorem 5.1
Fix \(k\in \{1,\dots ,d\}\). By [49, Thm. 3.6], there exists \(\beta >0\) such that for any \(N\in {{\mathbb {N}}}\) there exists a ReLU NN \({\tilde{\Phi }}_{N,k}:[-1,1]^k\rightarrow {{\mathbb {R}}}\) such that \(\Vert T_k-{\tilde{\Phi }}_{N,k} \Vert _{W^{1,\infty }([-1,1]^k)}\le C \exp (-\beta N^{1/(k+1)})\) and \(\mathrm{size}({\tilde{\Phi }}_{N,k})\le N\), \(\mathrm{depth}({\tilde{\Phi }}_{N,k})\le C\log (N) N^{1/(k+1)}\).
We use Lemma 5.3 to correct \({\tilde{\Phi }}_{N,k}\) and guarantee that it is a bijection of \([-1,1]^d\) onto itself. For \(k\ge 2\) define \(f_{-1}({{\varvec{x}}}){:}{=}-1-{\tilde{\Phi }}_{N,k}({{\varvec{x}}},-1)\) and \(f_{1}({{\varvec{x}}}){:}{=}1-{\tilde{\Phi }}_{N,k}({{\varvec{x}}},1)\) for \({{\varvec{x}}}\in [-1,1]^{k-1}\). In case \(k=1\) set \(f_{-1}{:}{=}-1-{\tilde{\Phi }}_{N,1}(-1)\in {{\mathbb {R}}}\) and \(f_{1}{:}{=}1-{\tilde{\Phi }}_{N,1}(1)\in {{\mathbb {R}}}\). With the notation from Lemma 5.3 for \({{\varvec{x}}}\in [-1,1]^{k-1}\) and \(x_k\in [-1,1]\) set
Then by Lemma 5.3 for any \({{\varvec{x}}}\in [-1,1]^{k-1}\)
Similarly \(\Phi _{N,k}({{\varvec{x}}},-1)=-1\). Clearly \(\Phi _{N,k}\) is a ReLU NN, and with Lemma 5.3
for some suitable constant C independent of \(N\in {{\mathbb {N}}}\). Similarly \(\mathrm{depth}(\Phi _{N,k})\le C N^{1/(1+k)}\log (N)\).
By Lemma 5.3, \(\Vert g_{f_1} \Vert _{W^{1,\infty }([-1,1]^k)}\le C \Vert f_1 \Vert _{W^{1,\infty }([-1,1]^k)}\) and the same inequality holds for \(f_{-1}\). Hence,
The last term is bounded by \(C_0\exp (-\beta N^{1/(k+1)})\) by definition of \({\tilde{\Phi }}_{N,k}\), and where \(C_0\) and \(\beta \) are independent of \(N\in {{\mathbb {N}}}\) (but depend on d). Choosing N large enough, it holds
Then for every \({{\varvec{x}}}\in [-1,1]^{k-1}\), \(\Phi _{N,k}({{\varvec{x}}},\cdot ):[-1,1]\rightarrow [-1,1]\) is continuous, monotonically increasing and satisfies \(\Phi _{N,k}({{\varvec{x}}},-1)=-1\) \(\Phi _{N,k}({{\varvec{x}}},1)=1\) as shown above.
In all, \(\Phi _N=(\Phi _{N,k})_{k=1}^d:[-1,1]^d\rightarrow [-1,1]^d\) is monotone, triangular and satisfies the claimed error bound \(\Vert \Phi _N-T \Vert _{W^{1,\infty }([-1,1]^d)}\le C\exp (-\beta N^{1/(d+1)})\) for some \(C>0\), \(\beta >0\) and all \(N\in {{\mathbb {N}}}\). Moreover, \(\mathrm{size}(\Phi _N)\le \sum _{k=1}^d \mathrm{size}(\Phi _{N,k}) \le C N\) for a constant C depending on d. Finally \(\mathrm{depth}(\Phi _N)\le \max _{k\in \{1,\dots ,d\}}\mathrm{depth}(\Phi _{N,k}) \le C N^{1/2}\log (N)\). To guarantee that all \(\Phi _{N,k}\), \(k=1,\dots ,d\) have the same depth \(CN^{1/2}\log (N)\), we can concatenate \(\Phi _{N,k}\) (at most \(CN^{1/2}\log (N)\) times) with the identity network \(x=\varphi (x)-\varphi (-x)\). This does not change the size and error bounds. \(\square \)
1.2 D.2 Lemma 5.3
Proof of Lemma 5.3
Set \(a{:}{=}\min _{{{\varvec{x}}}\in [-1,1]^d}f({{\varvec{x}}})\) and \(b{:}{=}\max _{{{\varvec{x}}}\in [-1,1]^d}f({{\varvec{x}}})-a\). Define
Using the identity network \(t=\varphi (t)-\varphi (-t)\), we may carry the value of t from layer 0 to layer \(L=\mathrm{depth}(f)\) with a network of size \(C\mathrm{depth}(f)\le C\mathrm{size}(f)\). This implies that \(g_f\) can be written as a ReLU NN satisfying the claimed size and depth bounds.
The definition of a and b implies \(0\le \frac{f({{\varvec{x}}})-a}{b}\le 1\) for all \({{\varvec{x}}}\in [-1,1]^d\). Hence, \(g_f({{\varvec{x}}},1)=f({{\varvec{x}}})\) and \(g_f({{\varvec{x}}},s)=0\) for all \(s\in [-1,0]\). Thus, \(\frac{d}{dt}g_f({{\varvec{x}}},t)=0\) for \(t\le 0\) and either \(\frac{d}{dt}g_f({{\varvec{x}}},t)=a\) or \(\frac{d}{dt}g_f({{\varvec{x}}},t)=b+a\) for \(t\in (0,1]\). Now \(|a|\le \max _{{{\varvec{y}}}\in [-1,1]^d}|f({{\varvec{x}}})|\) and \(|b+a|\le \max _{{{\varvec{y}}}\in [-1,1]^d}|f({{\varvec{x}}})|\) imply the bound on \(|\frac{d}{dt}g_f({{\varvec{x}}},t)|\). Next fix \(t\in [-1,1]\). At all points where \(\nabla _{{\varvec{x}}}g_f({{\varvec{x}}},t)\) is well defined, it either holds \(\nabla _{{\varvec{x}}}g_f({{\varvec{x}}},t)=\nabla f({{\varvec{x}}})\) or \(\nabla _{{\varvec{x}}}g_f({{\varvec{x}}},t)=0\), which concludes the proof. \(\square \)
E Proofs of Sect. 6
1.1 E.1 Lemma 6.1
Proof of Lemma 6.1
First,
which is bounded by the Lipschitz constant of S times \(\Vert T-{\tilde{T}} \Vert _{L^\infty ([-1,1]^d)}\).
For the second bound, let first A, \(B\in {{\mathbb {R}}}^{d\times d}\). Then
Since \(S\circ T\) is the identity we have \(dS({{\varvec{x}}})=dT(S({{\varvec{x}}}))^{-1}\) and similarly \(d{\tilde{S}}({\tilde{T}}({{\varvec{x}}}))=d{\tilde{T}}({{\varvec{x}}})^{-1}\). Thus,
The essential supremum is bounded by
Together with the first statement this concludes the proof. \(\square \)
1.2 E.2 Theorem 6.3
We’ll start by bounding \(|\det dS-\det d{\tilde{S}}|\). To this end we need the following lemma.
Lemma E.1
Let \((a_j)_{j=1}^d\), \((b_j)_{j=1}^d\subseteq (0,\infty )\). Then with \(a_{\mathrm{min}}=\min _{j}a_j\), \(b_{\mathrm{min}}=\min _{j}b_j\) and
holds
Proof
Since \(\log (1+x)\le x\) for all \(x\ge 0\),
For every \(x\in {{\mathbb {R}}}\), \(\exp :(-\infty ,x]\rightarrow {{\mathbb {R}}}\) has Lipschitz constant \(\exp (x)\). Thus, since \(\max \{a_j,b_j\}\le a_j+|a_j-b_j|\),
Set \(a_{\mathrm{min}}=\min _{j\le d}a_j>0\). Using again \(\log (1+x)\le x\) so that
we get
\(\square \)
Let
Then Lemma E.1 implies for any triangular \({\tilde{S}}\in C^1([-1,1]^d;[-1,1]^d)\) (i.e., \(\det d\tilde{S}=\prod _{j=1}^d\partial _{j}{\tilde{S}}_j\))
Thus, we can show the following:
Lemma E.2
Let \(f_{\rho }\), \(f_{\pi }\in C^1([-1,1]^d;{{\mathbb {R}}})\) be two positive probability densities on \([-1,1]^d\) (w.r.t. \(\mu \)), such that \(f_{\rho }\) has Lipschitz constant L. Let \(S:[-1,1]^d\rightarrow [-1,1]^d\) satisfy \(S^\sharp {\rho }={\pi }\), and let \({\tilde{S}}:[-1,1]^d\rightarrow [-1,1]^d\) be triangular and monotone with \(\Vert S-{\tilde{S}} \Vert _{W^{1,\infty }([-1,1]^d)}\le 1\).
Then there exists \(C>0\) solely depending on d, and the positive constants \({\tilde{S}}_{\mathrm{min}}\), \(S_{\mathrm{min}}\), \(S_{\mathrm{max}}\) in (E.4), such that
and
Proof
For \({{\varvec{x}}}\in [-1,1]^d\)
Next, (E.5) yields
Since \(S^\sharp {\rho }={\pi }\), it holds \(f_{\rho }\circ S \det dS=f_{\pi }\), and hence,
Together with (E.8) and (E.9), we obtain (E.6).
For any x, \(y\ge 0\) it holds \(|x-y|=|x^{1/2}-y^{1/2}||x^{1/2}+y^{1/2}|\). Thus, for \({{\varvec{x}}}\in [-1,1]^d\)
which shows (E.7). \(\square \)
Lemma E.3
For all x, \(y>0\) holds \(x|\log (x)-\log (y)|\le (1+\frac{|x-y|}{y})|x-y|\).
Proof
For any \(a>0\) the map \(\log :[a,\infty )\rightarrow {{\mathbb {R}}}\) has Lipschitz constant \(\frac{1}{a}\) so that
If \(x>y\)
If \(x\le y\) then \(\frac{x}{\min \{x,y\}}=1\le 1+\frac{|x-y|}{y}\) so that the claimed bound also holds. \(\square \)
Proof of Theorem 6.3
The assumptions on S and T imply \(\det dT({{\varvec{x}}}) = \frac{1}{\det dS(T({{\varvec{x}}}))}<\infty \) for all \({{\varvec{x}}}\in [-1,1]^d\) so that \(S_\mathrm{min}>0\). Denote \({\pi }{:}{=}T_\sharp {\rho }\) with pushforward density \(f_{\pi }({{\varvec{x}}}){:}{=}f_{\rho }(S({{\varvec{x}}}))\det dS({{\varvec{x}}})\). We have
By Lemma 6.1
Since \(f_{\rho }:[-1,1]^d\rightarrow {{\mathbb {R}}}\) is Lipschitz continuous, we may apply Lemma E.2, and obtain together with (E.11) for \(\alpha \in \{\frac{1}{2},1\}\)
for a suitable constant C depending on \(S_{\mathrm{min}}\), \(S_\mathrm{max}\), \({\tilde{S}}_{\mathrm{min}}\), \(\Vert d\tilde{S} \Vert _{L^\infty ([-1,1]^d)}\), \(\Vert f_{\rho } \Vert _{L^\infty ([-1,1]^d)}\), \({{\,\mathrm{ess\,inf}\,}}_{{{\varvec{x}}}\in [-1,1]^d}f_{\rho }({{\varvec{x}}})\), \({{\,\mathrm{ess\,inf}\,}}_{{{\varvec{x}}}\in [-1,1]^d}f_{\pi }({{\varvec{x}}})\), \(\mathrm{Lip}(f_{\rho })\), \(\mathrm{Lip}(S)\) and \(\mathrm{Lip}(dT)\). Hence, C depends on \({\tilde{T}}\) only through \({\tilde{S}}_{\mathrm{min}}\ge \tau _0\) and \(\Vert d{\tilde{S}} \Vert _{L^\infty ([-1,1]^d)}\le \frac{1}{\tau _0}\). This implies (6.4) for the Hellinger and total variation distance.
Next we show (6.5). It holds \(\inf _{{{\varvec{x}}}\in [-1,1]^d}f_{\pi }({{\varvec{x}}})\in (0,1]\) since \(f_{\pi }\) is a positive probability density. We obtain using Lemma E.3
Finally (E.12) with \(\alpha =1\) implies (6.4) for the KL divergence, which concludes the proof. \(\square \)
1.3 E.3 Proposition 6.4
Proof of Proposition 6.4
By Theorem 4.5
For the Wasserstein distance, [76, Prop. 6.2] (which is an extension of [58, Theorem 2]; also cp. (6.3)) implies together with (E.13)
As shown in Sect. 3, T and \(S=T^{-1}\) are analytic, and in particular, T, \(S\in W^{2,\infty }([-1,1]^d)\). Moreover, T and \({\tilde{T}}_\varepsilon :[-1,1]^d\rightarrow [-1,1]^d\) are bijective by Theorem 4.5.
We wish to apply Theorem 6.3, which together with (E.13) concludes the proof. To do so, it remains to show that we can find \(\tau _0>0\) such that for all \(\varepsilon >0\) holds (cp. (E.4))
Since \(f_{\rho }\) and \(f_{\pi }\) are uniformly positive and \(f_{\pi }\circ T\det dT=f_{\rho }\), \({{\,\mathrm{ess\,inf}\,}}_{{{\varvec{x}}}\in [-1,1]^d}\det dT({{\varvec{x}}})>0\). By (E.13) we have \(\lim _{\varepsilon \rightarrow 0}{{\,\mathrm{ess\,inf}\,}}_{{{\varvec{x}}}\in [-1,1]^d}\det d{\tilde{T}}_\varepsilon >0\). Together with \(\lim _{\varepsilon \rightarrow 0}\Vert d\tilde{T}_\varepsilon \Vert _{L^\infty ([-1,1]^d)}<\infty \), this implies with \({\tilde{S}}_\varepsilon =\tilde{T}_\varepsilon ^{-1}\) that \(\lim _{\varepsilon \rightarrow 0}\Vert d{\tilde{S}}_\varepsilon \Vert _{L^\infty ([-1,1]^d)}<\infty \). This can be seen by using \(A^{-1}=\frac{1}{\det A} A^\mathrm{adj}\) for a matrix \(A\in {{\mathbb {R}}}^{d\times d}\), where \(A^\mathrm{adj}\) denotes the adjugate matrix, which is equal to the transpose of the cofactor matrix, for which each entry can be bounded in terms of \(\Vert A \Vert _{}\). By a similar argument \(\lim _{\varepsilon \rightarrow 0}{\tilde{S}}_{\varepsilon ,\mathrm min}>0\). Hence, (E.14) holds for some \(\tau _0>0\) and all \(\varepsilon \le \varepsilon _0\) for some \(\varepsilon _0>0\). Finally note that \(\varepsilon \in [\varepsilon _0,\infty )\) corresponds to only finitely many choices of sets \((\Lambda _{k,\varepsilon })_{k=1}^d\) in Theorem 4.5. Hence, by decreasing \(\tau _0>0\) if necessary, (E.14) holds for all \(\varepsilon >0\). \(\square \)
F Proofs of Sect. 7
1.1 F.1 Lemma 7.1
Lemma F.1
Let \({{\varvec{b}}}=(b_j)_{j=1}^d\subset (0,\infty )\), \(C_0{:}{=}\max \{1,\Vert {{\varvec{b}}} \Vert _{\ell ^\infty }\}\) and \(\gamma >0\). There exists \((\kappa _j)_{j=1}^d\in (0,1]\) monotonically increasing and \(\tau >0\) (both depending on \(\gamma \), \(\tau \), d, \(C_0\) but otherwise independent of \({{\varvec{b}}}\)), such that with \(\delta _j{:}{=}\kappa _j+\frac{\tau }{b_j}\) holds \(\sum _{j=1}^kb_j\delta _j\le \gamma \delta _{k+1}\) for all \(k\in \{1,\dots ,d-1\}\), and \(\sum _{j=1}^d b_j\delta _j\le \gamma \).
Proof
Set \({\tilde{\gamma }}{:}{=}\frac{\min \{\gamma ,1\}}{C_0}\le 1\), \(\kappa _j{:}{=}(\frac{{\tilde{\gamma }}}{4})^{d+1-j}\) and \(\tau {:}{=}(\frac{{\tilde{\gamma }}}{4})^{d+1}\). For \(k\in \{1,\dots ,d\}\)
For \(k<d\) we show \(2\kappa _k\le \frac{{\tilde{\gamma }}}{2}(\kappa _{k+1}+\frac{\tau }{b_{k+1}})=\frac{{\tilde{\gamma }}}{2}\delta _{k+1}\) and \(k\tau \le \frac{{\tilde{\gamma }}}{2}(\kappa _{k+1}+\frac{\tau }{b_{k+1}})=\frac{{\tilde{\gamma }}}{2}\delta _{k+1}\), which then implies \(\sum _{j=1}^k b_j\delta _j\le C_0{\tilde{\gamma }}\delta _{k+1}\le \gamma \delta _{k+1}\). First
Furthermore,
Here we used \(k\alpha ^{d+1}\le \alpha ^{d+1-k}\), which is equivalent to \(k\alpha ^k\le 1\): due to \((x\alpha ^x)'=\alpha ^x(\log (\alpha )x+1)\), this holds in particular for any \(k=1,\dots ,d\) and \(\alpha \in (0,\exp (-1)]\) since then \(\log (\alpha )x+1\le 0\) for all \(x\ge 1\) and \(1\cdot \alpha ^1\le 1\). The claim follows with \(\alpha =\frac{{\tilde{\gamma }}}{4}\le \frac{1}{4}\le \exp (-1)\).
Finally, \(\sum _{j=1}^db_j\delta _j\le \gamma \) follows by (F.1) with \(k=d\): \(2\kappa _d=2\frac{{\tilde{\gamma }}}{4}=\frac{{\tilde{\gamma }}}{2}\) and \(d\tau = d(\frac{{\tilde{\gamma }}}{4})^{d+1}\le \frac{{\tilde{\gamma }}}{4}\) (by the same argument as above), and thus, \(C_02\kappa _d+d\tau \le C_0\frac{{\tilde{\gamma }}}{2}+\frac{{\tilde{\gamma }}}{4}\le C_0\frac{3{\tilde{\gamma }}}{4}\le \gamma \). \(\square \)
Proof of Lemma 7.1
The solution operator \({\mathfrak {u}}\) associated to (7.1) is complex Fréchet differentiable as a mapping from an open subset of \(L^\infty (\mathrm {D};{{\mathbb {C}}})\) containing \(\{a\in L^\infty (\mathrm {D};{{\mathbb {R}}})\,:\,{{\,\mathrm{ess\,inf}\,}}_{x\in \mathrm {D}}a(x)>0\}\) to the complex Banach space \(H_0^1(\mathrm {D};{{\mathbb {C}}})\); see, e.g., [75, Example 1.2.39]. By assumption \(\inf _{{{\varvec{y}}}\in [-1,1]^d}{{\,\mathrm{ess\,inf}\,}}_{x\in \mathrm {D}}a({{\varvec{y}}},x)>0\). Thus, there exists \(r>0\) such that with
the map \({\mathfrak {u}}:S\rightarrow H_0^1(\mathrm {D};{{\mathbb {C}}})\) is complex (Fréchet) differentiable. Then also
(see Sect. 7.2 for the notation) is complex differentiable from \(S\rightarrow {{\mathbb {C}}}\). Here we used that the bounded linear operator \(A:H_0^1(\mathrm {D};{{\mathbb {R}}})\rightarrow {{\mathbb {R}}}\) allows a natural (bounded) linear extension \(A:H_0^1(\mathrm {D};{{\mathbb {C}}})\rightarrow {{\mathbb {C}}}\). Using compactness of \([-1,1]^d\), by choosing \(r>0\) small enough it holds
and \({\mathfrak {f}}_{\pi }:S\rightarrow {{\mathbb {C}}}\) is Lipschitz continuous with some Lipschitz constant \(m>0\).
With \(C_6=C_6(M,L)\) as in Theorem 3.6, set
Let \(b_j{:}{=}\Vert \psi _j \Vert _{L^\infty (\mathrm {D})}\) and \(\delta _j=\kappa _j+\frac{\tau }{b_j}\) as in Lemma F.1.
Fix \({{\varvec{y}}}\in [-1,1]^d\) and \({{\varvec{z}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}}\). Then by Lemma F.1
and similarly for \(k=1,\dots ,d-1\)
We verify Assumption 3.5 for \({{\varvec{y}}}\mapsto f_{\pi }({{\varvec{y}}})={\mathfrak {f}}_{\pi }(\sum _{j=1}^d y_j\psi _j)\):
-
(a)
By definition \(f_{\pi }\) is a positive probability density on \([-1,1]^d\). As mentioned \({\mathfrak {f}}_{\pi }:S\rightarrow {{\mathbb {C}}}\) is differentiable, and (F.4) shows \(\Vert \sum _{j=1}^dz_j\psi _j \Vert _{L^\infty (\mathrm {D})}\le r\) for all \({{\varvec{z}}}\in {{\mathcal {B}}}_{\varvec{\delta }}\), so that \(\sum _{j=1}^d(y_j+z_j)\psi _j\in S\) for all \({{\varvec{y}}}\in [-1,1]^d\). Hence, \(f_{\pi }({{\varvec{x}}})={\mathfrak {f}}_{\pi }(\sum _{j=1}^dx_j\psi _j)\) is differentiable for \({{\varvec{x}}}\in {{\mathcal {B}}}_{\varvec{\delta }}([-1,1])\).
-
(b)
\(M\le |f_{\pi }({{\varvec{y}}}+{{\varvec{z}}})|\le L\) for all \({{\varvec{y}}}\in [-1,1]^d\) and \({{\varvec{z}}}\in {{\mathcal {B}}}_{\varvec{\delta }}\) according to (F.3), (F.2) and (F.4).
-
(c)
\({\mathfrak {f}}_{\pi }:S\rightarrow {{\mathbb {C}}}\) has Lipschitz constant m. By (F.4) for any \({{\varvec{y}}}\in [-1,1]^d\)
$$\begin{aligned} \sup _{{{\varvec{z}}}\in {{\mathcal {B}}}_{\varvec{\delta }}}|f_{\pi }({{\varvec{y}}}+{{\varvec{z}}})-f_{\pi }({{\varvec{y}}})|= \sup _{{{\varvec{z}}}\in {{\mathcal {B}}}_{\varvec{\delta }}}\left| {\mathfrak {f}}_{\pi }\left( \sum _{j=1}^d(y_j+z_j)\psi _j\right) -{\mathfrak {f}}_{\pi }\left( \sum _{j=1}^dy_j\psi _j\right) \right| \le m\frac{C_6}{m}=C_6. \end{aligned}$$ -
(d)
Similarly, by (F.5) for any \({{\varvec{y}}}\in [-1,1]^d\) and any \(k\in \{1,\dots ,d-1\}\)
$$\begin{aligned} \sup _{{{\varvec{z}}}\in {{\mathcal {B}}}_{{\varvec{\delta }}_{[k]}}\times \{0\}^{d-k}}|f_{\pi }({{\varvec{y}}}+{{\varvec{z}}})-f_{\pi }({{\varvec{y}}})| \le m\frac{C_6}{m}\delta _{k+1}=C_6\delta _{k+1}. \end{aligned}$$
\(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zech, J., Marzouk, Y. Sparse Approximation of Triangular Transports, Part I: The Finite-Dimensional Case. Constr Approx 55, 919–986 (2022). https://doi.org/10.1007/s00365-022-09569-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00365-022-09569-2
Keywords
- Transport maps
- Domains of holomorphy
- Uncertainty quantification
- Sparse approximation
- Neural networks
- Sampling
Mathematics Subject Classification
- 32D05
- 41A10
- 41A25
- 41A46
- 62D99
- 65D15