Sparse Approximation of Triangular Transports, Part I: The Finite-Dimensional Case

For two probability measures ρ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\rho }$$\end{document} and π\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\pi }$$\end{document} with analytic densities on the d-dimensional cube [-1,1]d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[-1,1]^d$$\end{document}, we investigate the approximation of the unique triangular monotone Knothe–Rosenblatt transport T:[-1,1]d→[-1,1]d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T:[-1,1]^d\rightarrow [-1,1]^d$$\end{document}, such that the pushforward T♯ρ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T_\sharp {\rho }$$\end{document} equals π\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\pi }$$\end{document}. It is shown that for d∈N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d\in {{\mathbb {N}}}$$\end{document} there exist approximations T~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{T}}$$\end{document} of T, based on either sparse polynomial expansions or deep ReLU neural networks, such that the distance between T~♯ρ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{T}}_\sharp {\rho }$$\end{document} and π\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\pi }$$\end{document} decreases exponentially. More precisely, we prove error bounds of the type exp(-βN1/d)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\exp (-\beta N^{1/d})$$\end{document} (or exp(-βN1/(d+1))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\exp (-\beta N^{1/(d+1)})$$\end{document} for neural networks), where N refers to the dimension of the ansatz space (or the size of the network) containing T~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{T}}$$\end{document}; the notion of distance comprises the Hellinger distance, the total variation distance, the Wasserstein distance and the Kullback–Leibler divergence. Our construction guarantees T~\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\tilde{T}}$$\end{document} to be a monotone triangular bijective transport on the hypercube [-1,1]d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[-1,1]^d$$\end{document}. Analogous results hold for the inverse transport S=T-1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S=T^{-1}$$\end{document}. The proofs are constructive, and we give an explicit a priori description of the ansatz space, which can be used for numerical implementations.

Normalizing flows, now widely used in the machine learning literature [36,50,54] for variational inference, generative modeling, and density estimation, are another instance of this transport framework. In particular, many so-called autoregressive flows, e.g., [34,51], employ specific neural network parametrizations of triangular maps. A complete mathematical convergence analysis of these methods is not yet available in the literature.
Sampling methods can be contrasted with deterministic approaches, where a quadrature rule is constructed that converges at a guaranteed (deterministic) rate for all integrands in some function class. Unlike sampling methods, deterministic quadratures can achieve higher-order convergence. They even overcome the curse of dimensionality, presuming certain smoothness properties of the integrand. We refer to sparse-grid quadratures [10,13,27,30,61,77] and quasi-Monte Carlo quadrature [9,20,60] as examples. It is difficult to construct deterministic quadrature rules for an arbitrary measure π , however, so typically they are only available in specific cases such as for the Lebesgue or Gaussian measure. Interpreting [−1,1] d g(t) dπ(t) as the integral [−1,1] d g(t) f π (t) dt w.r.t. the Lebesgue measure (here f π is again the Lebesgue density of π ), such methods are still applicable. In Bayesian inference, however, π can be strongly concentrated, corresponding to either small noise in the observations or a large data set. Then this viewpoint may become problematic. For example, the error of Monte Carlo quadrature depends on the variance of the integrand. The variance of g f π (w.r.t. the Lebesgue measure) can be much larger than the variance of g (w.r.t. π ) when π is strongly concentrated, i.e., when f π is very "peaky." This problem was addressed in [62] by combining an adaptive sparse-grid quadrature with a linear transport map. This approach combines the advantage of highorder convergence with quadrature points that are mostly within the area where π is concentrated. Yet if π is multimodal (i.e., concentrated in several separated areas) or unimodal but strongly non-Gaussian, then the linearity of the transport precludes such a strategy from being successful. A similar statement can be made about the method analyzed in [63], where the (Gaussian) Laplace approximation is used in place of a strongly concentrated posterior measure. For such π , the combination of nonlinear transport maps with deterministic quadrature rules may lead to significantly improved algorithms.
In a related spirit, we also mention interacting particle systems such as kernelbased Stein variational gradient descent (SVGD) and its variants, which have recently emerged as a promising research direction [19,41]. Put simply, these methods try to find n points (x i ) n i=1 minimizing an approximation of the KL divergence between the target π and the measure represented by the sample (x i ) n i=1 . A discrete convergence analysis is not yet available, but a connection between the mean field limit and gradient flow has been established [23,40,42].
In this paper, we analyze the approximation of the transport T satisfying T ρ = π under the assumption that the reference and target densities f ρ and f π are analytic. This assumption is quite strong, but satisfied in many applications, including the main application we have in mind, which is Bayesian inference in partial differential equations (PDEs). The reference ρ can be chosen at will, e.g., as a uniform measure so that f ρ is constant (thus analytic) on [−1, 1] d . And here the target density f π is a posterior density, stemming from a PDE-driven likelihood function. For certain linear and nonlinear PDEs (for example, the Navier-Stokes equations), it can be shown under suitable conditions that the corresponding posterior density is indeed an analytic function of the parameters; we refer, for instance, to [13,64].
As outlined above, T can be employed in the construction of either sampling-based or deterministic quadrature methods. Understanding the approximation properties of T is the first step toward a rigorous convergence analysis of such algorithms. In practice, once a suitable ansatz space has been identified, a (usually non-convex) optimization problem must be solved to find a suitableT within the ansatz space. While this optimization is beyond the scope of the current paper, we intend to empirically analyze a numerical algorithm based on the present results in a forthcoming publication.
Throughout we consider transports on [−1, 1] d with d ∈ N. It is straightforward to generalize all the presented results to arbitrary Cartesian products× d j=1 [a j , b j ] ⊂ R d with −∞ < a j < b j < ∞ for all j, via an affine transformation of all occurring functions. Most (theoretical and empirical) earlier works on this topic have, however, assumed measures supported on all of R d . A similar analysis in the unbounded case, as well as numerical experiments and the development and improvement of algorithms in this case, will be the topics of future research.

Contributions
For d ∈ N and under the assumption that the reference and target densities f ρ , f π : [−1, 1] d → (0, ∞) are analytic, we prove that there exist sparse polynomial spaces of dimension N ∈ N, in which the best approximation of the KR transport T converges to T at the exponential rate exp(−β N 1/d ) for some β > 0 as N → ∞; see Theorem 4.3. To guarantee that the approximationT : [−1, 1] d → [−1, 1] d is bijective, we propose to use an ansatz of rational functions, which ensures this property and retains the same convergence rate. In this case, N refers to the dimension of the polynomial space used in the denominator and numerator, i.e., again to the number of degrees of freedom; see Theorem 4.5. The argument is based on a result quantifying the regularity of T in terms of its complex domain of analyticity, which is given in Theorem 3.6.
In Sect. 6, we investigate the implications of approximating the transport map for the corresponding pushforward measures. We show that closeness of the transports in W 1,∞ implies closeness of the pushforward measures in the Hellinger distance, the total variation distance and the KL divergence. A similar statement is true for the Wasserstein distance if the transports are close in L ∞ . Specifically, Proposition 6.4 states the same exp(−β N 1/d ) error convergence as for the approximation of the transport is obtained for the distance between the pushforwardT ρ and the target π .
We provide lower bounds on β > 0, based on properties of f ρ and f π . Furthermore, given ε > 0, we provide a priori ansatz spaces guaranteeing the best approximation in this ansatz space to be O(ε)-close to T ; see Theorem 4.5 and Sect. 7. This allows to improve upon existing numerical algorithms: Previous approaches were either based on heuristics or based on adaptive (greedy) enrichment of the ansatz space [25], neither of which can guarantee asymptotic convergence or convergence rates in general. Moreover, greedy methods are inherently sequential (in contrast to a priori approaches), which can slow down computations.
Using known approximation properties of polynomials by rectified linear unit (ReLU) neural networks (NNs), we also show that ReLU NNs can approximate T at the exponential rate exp(−β N 1/(1+d) ). In this case, N refers to the number of trainable parameters ("weights and biases") in the network; see Theorem 5.1 for the convergence of the transport map and Proposition 6.5 for the convergence of the pushforward measure. We point out that normalizing flows in machine learning also try to approximate T with a neural network; see, for example, [26,29,33,54]. Recent theoretical works on the expressivity of neural network representations of transport maps include [43,68,69], which provide universal approximation results; moreover, Reference [37] provides estimates on the required network depth. In the present work we do not merely show universality, i.e., approximability of T by neural networks, but we even prove an exponential convergence rate. Similar results have not yet been established to the best of our knowledge.

Main Ideas
Consider the case d = 1. Let π and ρ be two probability measures on [−1, 1] with strictly positive Lebesgue densities f ρ , f π : [−1, 1] → {x ∈ R : x > 0}. Let Hence, T :=F −1 π • F ρ is the unique monotone transport satisfying ρ • T −1 = π , i.e., T ρ = π . The formula T = F −1 π • F ρ implies that T inherits the smoothness of F −1 π and F ρ . Thus, it is at least as smooth as f ρ and f π (more precisely, f ρ , f π ∈ C k imply T ∈ C k+1 ). We will see in Proposition 3.4 that if f ρ and f π are analytic, the domain of analyticity of T is (under further conditions and in a suitable sense) proportional to the minimum of the domain of analyticity of f ρ and f π . By the "domain of analyticity," we mean the domain of holomorphic extension to the complex numbers. Knowledge of the analyticity domain of T allows to prove exponential convergence of polynomial approximations: Assume for the moment that T : [−1, 1] → [−1, 1] admits an analytic extension to the complex disk with radius r > 1 and center 0 ∈ C. Then T (x) = k∈N d k dy k T (y)| y=0 x k k! for x ∈ [−1, 1], and the kth Taylor coefficient can be bounded with Cauchy's integral formula by Cr −k . This implies that the nth Taylor polynomial uniformly approximates T on [−1, 1] with error O(r −n ) = O(exp(− log(r )n)). Thus, r determines the rate of exponential convergence.
The above construction of the transport can be generalized to the KR transport T : [−1, 1] d → [−1, 1] d for d ∈ N. We will determine an analyticity domain for each component T k of T = (T k ) d k=1 : not in the shape of a polydisc, but rather as a pill-shaped set containing [−1, 1] k . The reason is that analyticity of f ρ and f π does not imply the existence of a polydisc, but does imply the existence of such pill-shaped domains. Instead of Taylor expansions, one can then prove exponential convergence of Legendre expansions. Rather than approximating T with Legendre polynomials, we introduce a correction guaranteeing our approximationT : [−1, 1] d → [−1, 1] d to be bijective. This results in a rational functionT . Using existing theory for ReLU networks, we also deduce a ReLU approximation result.

Outline
In Sect. 1.4, we introduce notation. Section 2 recalls the construction of the triangular KR transport T . In Sect. 3, we investigate the domain of analyticity of T . Section 4 applies the results of Sect. 3 to prove exponential convergence rates for the approximation of the transport through sparse polynomial expansions. Subsequently, Sect. 5 discusses a deep neural network approximation result for the transport. We then use these results in Sect. 6 to establish convergence rates for the associated measures (rather than the transport maps themselves). Finally, in Sect. 7 we present a standard example in uncertainty quantification and demonstrate how our results may be used in inference algorithms.
For the convenience of the reader, in Sect. 3.1 we discuss analyticity of the transport map in the one-dimensional case separately from the general case d ∈ N (which builds on similar ideas but is significantly more technical) and provide most parts of the proof in the main text. In the remaining sections, all proofs and technical arguments are deferred to appendix.

Real and Complex Numbers
Throughout, R d is equipped with the Euclidean norm and R d×d with the spectral norm. We write R + :={x ∈ R : x > 0} and denote the real and imaginary part of z ∈ C by (z), (z), respectively. For any δ ∈ R + and S ⊆ C If we omit the argument S, then S:={0}, i.e., B δ :=B δ (0).

Derivatives and Function Spaces
R d j denotes the weak jth derivative, and d j f (x) denotes the norm on R d×···×d induced by the Euclidean norm. For j = 1 we simply write d f :=d 1 f . By abuse of notation, e.g., for

Transport Maps
Also note that T ρ = π is equivalent to S ρ = π .

Knothe-Rosenblatt Transport
Let d ∈ N. Given a reference probability measure ρ and a target probability measure π on [−1, 1] d , under certain conditions (e.g., as detailed below) the KR transport is the (unique) triangular monotone transport T : We now recall the explicit construction of T , as, for instance, presented in [59]. Throughout it is assumed that π μ and ρ μ have continuous and positive densities, i.e., For a continuous probability density f : Thus,f k (·) is the marginal density of x [k] and f k (x [k−1] , ·) is the marginal density of x k conditioned on x [k−1] . The corresponding marginal conditional CDFs . [2] ) . . .
In general, T satisfying T ρ = π is not unique. To keep the presentation succinct, henceforth we will simply refer to "the transport T ," by which we always mean the unique triangular KR transport in (2.5).

Analyticity
The explicit formulas for T given in Sect. 2 imply that positive analytic densities yield an analytic transport. Analyzing the convergence of polynomial approximations to T requires knowledge of the domain of analyticity of T . This is investigated in the present section.

One-Dimensional Case
Let d = 1. By (2.4a), T : [−1, 1] → [−1, 1] can be expressed through composition of the CDF of ρ and the inverse CDF of π . As the inverse function theorem is usually stated without giving details on the precise domain of extension of the inverse function, we give a proof, based on classical arguments, in Appendix A. This leads to the result in Lemma 3.2. Before stating it, we provide another short lemma that will be used multiple times.
Proof For any x ∈ K and any ε ∈ (0, δ), by Cauchy's integral formula Letting ε → 0 implies the claim. there then exists a unique function G : Proof We verify the conditions of Proposition A.2 withδ: Finally, due to 1 = F(G(y)) = F (G(y))G (y), it holds G (y) = 1 f (G(y)) for all y ∈ B αδ (F(x 0 )), which shows that G : B αδ (F(x 0 )) → C has Lipschitz constant 1 M . Hence, In case f allows an extension f :   We arrive at a statement about the domain of analytic extension of the onedimensional monotone transport T :

General Case
We now come to the main result of Sect. 3, which is a multidimensional version of Proposition 3.4. More precisely, we give a statement about the analyticity domain of (∂ k T k ) d k=1 . The reason is that, from both a theoretical and a practical viewpoint, it is convenient first to approximate ∂ k T k : [−1, 1] k → [0, 1] and then to obtain an approximation to T k by integrating over x k . We explain this in more detail in Sect. 4, see (4.9).
The following technical assumption gathers our requirements on the reference ρ and the target π .
Assumptions (a) and (b) state that f * is a positive analytic probability density on [−1, 1] d that allows a complex differentiable extension to the set B . Equation (2.4) shows that T k+1 is obtained by a composition of F π ;k+1 (T 1 , . . . , T k , ·) −1 (the inverse in the last variable) and F ρ;k+1 . The smallness conditions (c) and (d) can be interpreted as follows: They will guarantee F ρ;k+1 ( y) (for certain complex y) to belong to the domain where the complex extension of F π ;k+1 (T 1 , . . . , T k , ·) −1 is well defined. Theorem 3.6 Let 0 < M ≤ L < ∞, d ∈ N and δ ∈ (0, ∞) d . There exist C 6 , C 7 , and C 8 > 0 depending on M and L (but not on d or δ) such that if Assumption 3.5 holds with C 6 , then: it holds for all k ∈ {1, . . . , d} that Put simply, the first item of the theorem can be interpreted as follows: The function ∂ k T k allows in the jth variable an analytic extension to the set B ζ j ([−1, 1]), where ζ j is proportional to δ j . The constant δ j describes the domain of analytic extension of the densities f ρ , f π in the jth variable. Thus, the analyticity domain of each ∂ k T k is proportional to the (intersection of the) domains of analyticity of the densities. Additionally, the real part of ∂ k T k remains strictly positive on this extension to the complex domain. Note that ∂ k T k (x) is necessarily positive for real x ∈ [−1, 1] k , since the transport is monotone.
The second item of the theorem states that the kth variable x k plays a special role for T k : if we merely extend ∂ k T k in the first k − 1 variables to the complex domain and let the kth argument x k belong to the real interval [−1, 1], then the values of this extension behave like 1 + O( 1 δ k ), and thus, the extension is very close to the constant 1 function for large δ k . In other words, if the densities f ρ , f π allow a (uniformly bounded from above and below) analytic extension to a very large subset of the complex domain in the kth variable, then the kth component of the transport T k (x [k] ) will be close to −1 + x k −1 1 dt = x k , i.e., to the identity in the kth variable. We also emphasize that we state the analyticity results here for ∂ k T k (in the form they will be needed below), but this immediately implies that T k allows an analytic extension to the same domain.

Remark 3.7
Crucially, for any k < d the left-hand side of the inequality in Assumption 3.5 (d) depends on (δ j ) k j=1 , while the right-hand side depends only on δ k+1 but not on (δ j ) k j=1 . This will allow us to suitably choose δ when verifying this assumption (see the proof of Lemma 3.9).

Remark 3.8
The proof of Theorem 3.6 shows that there exists C ∈ (0, 1) independent of M and L such that we can choose C 6 = C min{M,1} 5 max{L,1} 4 , C 7 = C min{M,1} 3 max{L,1} 3  To give an example for ρ, π fitting our setting, we show that Assumption 3.5 holds (for some sequence δ) whenever the densities f ρ , f π are analytic.

Polynomial-Based Approximation
Analytic functions on [−1, 1] d → R allow for exponential convergence when approximated by multivariate polynomial expansions. We recall this in Sect. 4.1 for truncated Legendre expansions. These results are then applied to the KR transport in Sect. 4.2.

Exponential Convergence of Legendre Expansions
For n ∈ N 0 , let L n ∈ P n be the nth Legendre polynomial normalized in is an orthonormal basis of As is well known, functions that are holomorphic on sets of the type B δ ([−1, 1]) have exponentially decaying Legendre coefficients. We recall this in the next lemma, which is adapted to the regularity we showed for the transport in Theorem 3.6.
The previous lemma in combination with Theorem 3.6 yields a bound on the Legendre coefficients of the partial derivatives ∂ k T k − 1 of the kth component of the transport. Specifically, for ν ∈ N k 0 Theorem 3.6 (i) together with Lemma 4.1 (i) implies with t 1 := −ν ∂ k T k − 1 L ∞ (B δ ([−1,1])) and t 2 :=w ν j∈supp ν j j −1 the bound t 1 t 2 for the corresponding Legendre coefficient. For a multi-index ν ∈ N k−1 0 × {0}, applying instead Theorem 3.6 (ii) together with Lemma 4.1 (ii) yields the boundt 1 t 2 wheret 1 : 1]) . By Theorem 3.6, the last norm is bounded by C 3 δ k . Hence, compared to the first estimate t 1 , we gain the factor 1 δ k by using the second estimatet 1 instead. 1 Taking the minimum of these estimates leads us to introduce for k ∈ N, ν ∈ N k 0 and ∈ (1, ∞) k the quantity and the set k,ε :={ν ∈ N k 0 : γ ( , ν) ≥ ε}, (4.4) corresponding to the largest values of γ ( , ν).
The structure of k,ε is as follows: The larger the j , the smaller the −1 j . Thus, the larger the j , the fewer the multi-indices ν with j ∈ supp ν belong to k,ε . In this sense, j measures the importance of the jth variable in the Legendre expansion of ∂ k T k . The kth variable plays a special role, as it always is among the most important variables, since for all ν ∈ N k 0 it holds that γ ( , ν) ≥ γ ( , e k ), where e k = (0, . . . , 0, 1) ∈ N k 0 . In other words, whenever ε > 0 is so small that k,ε = ∅, at least one ν with ν k = 0 belongs to k,ε .
Having determined a set of multi-indices corresponding to the largest upper bounds obtained for the Legendre coefficients, we arrive at the next proposition. The assumptions on the function f correspond to the regularity of ∂ k T k shown in Theorem 3.6. The proposition states that such f can be approximated with the error decreasing as O(−β N 1/k ) in terms of the dimension N of the polynomial space. .

Polynomial and Rational Approximation
Combining Proposition 4.2 with Theorem 3.6 we obtain the following approximation result for the transport. It states that T : For k ∈ {1, . . . , d}, let k,ε be as in (4.4) and define Remark 4. 4 We set j = 1+C 7 δ j > 1 in Theorem 4.3, where δ j as in Assumption 3.5 encodes the size of the analyticity domain of the densities f ρ and f π (in the jth variable). The constant β in (4.7b) is an increasing function of each j . Loosely speaking, Theorem 4.3 states that the larger the analyticity domain of the densities, the faster the convergence when approximating the corresponding transport T with polynomials.
We skip the proof of the above theorem and instead proceed with a variation of this result. It states a convergence rate for an approximationT k to T k , which enjoys the property thatT k ( This is achieved as follows: Let g : R → {x ∈ R : x ≥ 0} be analytic, such that g(0) = 1 and h:=g −1 : (0, ∞) → R is also analytic. We first approximate h(∂ k T k ) by some function p k and then obtain −1 + , t)) dt as an approximatioñ T k to T k . This approach, similar to what is proposed in [53], and in the present context in [45], guarantees to be bijective we introduce a normalization which leads tõ (4.9) The meaning of g(0) = 1 is that the trivial approximation To avoid further technicalities, henceforth we choose g(x) = (x + 1) 2 (and thus h(x) = √ x − 1), but emphasize that our analysis works just as well with any other positive analytic function such that g(0) = 1, e.g., g(x) = exp(x) and h(x) = log(x). The choice g(x) = (x + 1) 2 has the advantage that g( p k ) is polynomial if p k is polynomial. This allows exact evaluation of the integrals in (4.9) without resorting to numerical quadrature, and results in a rational approximationT k : Theorem 4.5 Let m ∈ N 0 . Let f ρ , f π satisfy Assumption 3.5 for some constants 0 < M ≤ L < ∞, δ ∈ (0, ∞) d and with C 6 = C 6 (M, L) as in Theorem 3.6. Let j , β and k,ε be as in Theorem 4.3.

is a monotone triangular bijection, and with
We emphasize that our reason for using rational functions rather than polynomials in Theorem 4.5 is merely to guarantee that the resulting approximatioñ T : [−1, 1] d → [−1, 1] d is a bijective and monotone map. We do not employ specific properties of rational functions (as done for Padé approximations) in order to improve the convergence order.
It is often simpler to first approximate S, and then compute T by inverting S, see [45]. Since the assumptions of Theorem 4.5 (and Theorem 3.6) on the measures ρ and π are identical, Theorem 4.5 also yields an approximation result for the inverse transport map S: For all ε > 0 and with k,ε as in Theorem 4.5, there exist multivariate polynomials p k ∈ P k,ε such that with

Deep Neural Network Approximation
Based on the seminal paper [73], it has recently been observed that ReLU neural networks (NNs) are capable of approximating analytic functions at an exponential convergence rate [24,49], and slight improvements can be shown for certain smoother activation functions, e.g., [39]. We also refer to [46] for much earlier results of this type for different activation functions. As a consequence, our analysis in Sect. 3 yields approximation results of the transport by deep neural networks (DNNs). Below we present the statement, which is based on [49,Thm. 3.7].
To formulate the result, we recall the definition of a feedforward ReLU NN. The (nonlinear) ReLU activation function is defined as ϕ(x):= max{0, x}. We call a func- for certain weight matrices W j ∈ R n j+1 ×n j and bias vectors b j ∈ R n j+1 where n 0 = n L+1 = d. For simplicity, we do not distinguish between the network (described by (W j , b j ) L j=0 ), and the function it expresses (different networks can have the same output). We then write depth( f ):=L, width( f ):= max j n j and size( f ): In other words, the depth corresponds to the number of applications of the activation function (the number of hidden layers) and the size equals the number of nonzero weights and biases, i.e., the number of trainable parameters in the network.
Here C is a constant depending on d, f ρ and f π but independent of N .

Remark 5.2
Compared to Theorems 4.3 and 4.5, for ReLU networks we obtain the slightly worse convergence rate exp This stems from the fact that, for ReLU networks, the best known approximation results for analytic functions in d dimensions show convergence with the rate exp The proof of Theorem 5.1 proceeds as follows: First, we apply results from [49] to obtain a neural network approximation˜ k to T k . The constructed (˜ k ) d k=1 : However, it is not necessarily a monotone bijective self-mapping of [−1, 1] d . To correct the construction, we use the following lemma: Since the introduced correction terms g f 1 and g f −1 have size and depth bounds of the same order as˜ k , they will not worsen the resulting convergence rates. The details are provided in Appendix D.1.
In the previous theorem, we consider a "sparsely connected" network , meaning that certain weights and biases are, by choice of the network architecture, set a priori to zero. This reduces the overall size of . We note that this also yields a convergence rate for fully connected networks: Consider all networks of width O(N ) and depth O(log(N )N 1/2 ). The size of a network within this architecture is bounded by , since the number of elements of the weight matrix W j between two consecutive layers is n j n j+1 ≤ N 2 . The network from Theorem 5.1 belongs to this class, and thus, the best approximation among networks with this architecture achieves at least the exponential convergence exp(−β N 1/(d+1) ). In terms of the number of trainable parameters M = O(N 5/2 log(N )), this convergence rate is, up to logarithmic terms, exp(−β M 2/(5d+5) ).

Remark 5.4
The constant C in (4.12) and the (possibly different) constant C in (5.2) typically depend exponentially on the dimension d: This dependence is true for polynomial approximation results of analytic functions in d dimensions, which is why it will hold for C in (4.12) in general. Since the proof in [49], upon which our analysis is based, uses polynomial approximations, the same can be expected for the constant in (5.2). In [76] we discuss the approximation of T by rational functions in the high-dimensional case. There we give sufficient conditions on the reference and target to guarantee algebraic convergence of the error, with all constants being controlled independent of the dimension.

Remark 5.5
Normalizing flows approximate a transport map T using a variety of neural network constructions, typically by composing a series of nonlinear bijective transformations; each individual transformation employs neural networks in its parametrization, embedded into a specific functional form (possibly augmented with constraints) that ensures bijectivity [36,50]. "Residual" normalizing flows [1,54] compose maps that are not in general triangular, but "autoregressive" flows [33,34,72] use monotone triangular maps as their essential building block. Many practical implementations of autoregressive flows, however, limit the class of triangular maps that can be expressed. Thus, they cannot seek to directly approximate the KR transport in a single step; rather, they compose multiple such triangular maps, interleaved with permutations [68]. In principle, however, a direct approximation of the KR map is sufficient, and our results could be a starting point for constructive and quantitative guidance on the parametrization and expressivity of autoregressive flows in this setting. Our result is also close in style to [43], which shows low-order convergence rates for neural network approximations of transport maps for certain classes of target densities, by writing the map as a gradient of a potential function given by a neural network. This construction, which employs semi-discrete optimal transport, is not in general triangular and does not necessarily coincide with common normalizing flow architectures. m ∈ N. In the present section, we show that these results imply corresponding error bounds for the approximate pushforward measure, i.e., bounds for dist(π,T ρ) = dist(T ρ,T ρ) (6.1) with "dist" referring to the Hellinger distance, the total variation distance, the Wasserstein distance or the KL divergence. Specifically, we will show that smallness of T −T W 1,∞ (or T −T L ∞ in case of the Wasserstein distance) implies smallness of (6.1).
As mentioned before, when casting the approximation of the transport as an optimization problem, it is often more convenient to first approximate the inverse transport S = T −1 by someS, and then to invertS to obtain an approximationT =S −1 of T ; see [45] and also, e.g., the method in [22]. In this case, we usually have an upper bound on S −S rather than T −T in a suitable norm; cp. Remark 4.7. However, a bound of the type 1} as the next lemma shows. Since closeness in L ∞ or W 1,∞ is all we require for the results of this section, the following analysis covers either situation.

Remark 6.2
As is well known, the Hellinger distance provides an upper bound of the difference of integrals w.r.t. two different measures. Assume that g ∈

Error Bounds
Thus, Theorems 4.5 and 5.1 readily yield bounds on W p (T ρ, π) for the approximate polynomial, rational, and NN transport maps from Sects. 4.2 and 5.
For the other three distances/divergences in (6.2), to obtain a bound on dist(T ρ, T ρ), we will upper bound the difference between the densities of those measures. Since the density ofT ρ is given by This will be done in the proof of the following theorem. To state the result, for a triangular map and Together with Theorem 4.5, we can now show exponential convergence of the pushforward measure in the case of analytic densities. For ε > 0, denote byT ε = (T ε,k ) d k=1 the approximation to T from Theorem 4.5. Moreover, recall that N ε in (4.11) denotes the number of degrees of freedom of this approximation (the number of coefficients of this rational function). As shown in Lemma 3.9, the exponential convergence shown in the next proposition holds in particular for positive and analytic reference and target densities f ρ , f π . Proposition 6.4 Consider the setting of Theorem 4.5; in particular, let f ρ , f π satisfy Assumption 3.5, and let β > 0 be as in (4.7b).
Then for everyβ < β there exists C > 0 such that for every ε ∈ (0, 1) and dist ∈ {H, TV, KL, W p } withT ε as in (4.10) and N ε as in (4.11) it holds Similarly, we get a bound for the pushforward under the NN transport from Theorem 5.1.

Proposition 6.5
Let f ρ and f π be two positive and analytic probability densities on [−1, 1] d . Then for every dist ∈ {H, TV, KL, W p }, there exist constants β > 0 and C > 0, and for every N ∈ N there exists a ReLU neural network N : The proof is completely analogous to Proposition 6.4 (but using Theorem 5.1 instead of Theorem 4.5 to approximate the transport T with the NN N ) which is why we do not give it in appendix.

Application to Inverse Problems in UQ
To give an application and explain in more detail the practical value of our results, we briefly discuss a standard inverse problem in uncertainty quantification.

Setting
Let n ∈ N and let D ⊆ R n be a bounded Lipschitz domain. For a diffusion coefficient a ∈ L ∞ (D; R) such that ess inf x∈D a(x) > 0, and a forcing term has a unique weak solution in H 1 0 (D). We denote it by u(a), and call u : a → u(a) the forward operator.
Let A : H 1 0 (D) → R m be a bounded linear observation operator for some m ∈ N. The inverse problem consists of recovering the diffusion coefficient a ∈ L ∞ (D), given the noisy observation with the additive observation noise η ∼ N (0, ), for a symmetric positive definite covariance matrix ∈ R m×m .

Prior and Posterior
In uncertainty quantification and statistical inverse problems, the diffusion coefficient a ∈ L ∞ (D) is modeled as a random variable (independent of the observation noise η) distributed according to some known prior distribution; see, e.g., [35]. Bayes' theorem provides a formula for the distribution of the diffusion coefficient conditioned on the observations. This conditional is called the posterior and interpreted as the solution to the inverse problem.
Given m measurements (ς i ) m i=1 as in (7.2), the posterior measure π on [−1, 1] d is the distribution of y|ς . Since η ∼ N (0, ), the likelihood (the density of ς | y) equals By Bayes' theorem, the posterior density f π , corresponding to the distribution of y|ς, is proportional to the density of y times the density of ς| y. Since the (uniform) prior has constant density 1, The normalizing constant Z is in practice unknown. For more details see, e.g., [17].
In order to compute expectations w.r.t. the posterior π , we want to determine a transport map T :
Let j = 1 + C 7 δ j be as in Theorem 4.5 (i.e., as in (4.7a)), where C 7 is as in Theorem 3.6. With κ j ∈ (0, 1], τ > 0 as in Lemma 7.1 In particular, j ≥ 1+C 7 τ ψ j −1 L ∞ (D) . In practice, we do not know τ and C 7 (although pessimistic estimates could be obtained from the proofs), and we simply set j :=1 + Here ε > 0 is a thresholding parameter, and as ε → 0 the ansatz spaces become arbitrarily large. We interpret (7.5) as follows: The smaller the ψ j L ∞ (D) , the less important the variable j is in the approximation of T k if j < k. The kth variable plays a special role for T k , however, and is always among the most important in the approximation of T k .

Performing Inference
Given data ς ∈ R m as in (7.2), we can now describe a high-level algorithm to perform inference on the model problem in Sects. 7.1-7.2: (i) Determine ansatz space: Fix ε > 0 and determine k,ε in (7.5) for k = 1, . . . , d.
(ii) Find transport map: Use as a target π the posterior with density in (7.4), and solve the optimization problem arg minT as in (4.10) with p k ∈P k,ε dist(T ρ, π). An estimate of the unknown diffusion coefficient a ∈ L ∞ (D) in (7.3) is obtained via 1 + d j=1ỹ j ψ j . We next provide more details for each of those steps.

Determining the Ansatz Space
An efficient algorithm (of linear complexity) to determine multi-index sets k,ε of the type (7.5) is given in [3] or [75,Sec. 3.1.3]. We emphasize that, in general, it is not an easy task to come up with suitable ansatz spaces. In the current setting, our explicit knowledge of the prior measure and its possibly anisotropic structure, and the analyticity of the forward operator, allow us-using the analysis of the previous sections-to determine a priori ansatz spaces yielding proven exponential convergence for the best approximating transport map within this space. By "anisotropic," we mean that certain variables y j may contribute less to the posterior than others due to ψ j being small in (7.3); in this case, our construction (7.5) results in fewer degrees of freedom spent on such variables, thus increasing the efficiency of the algorithm. This is to be contrasted with the use of generic ansatz spaces, which do not leverage such knowledge, as, for example, proposed in [45]. The explicit construction of these spaces is one of the main contributions of this work.

Finding the Transport Map
Again let π with density f π be the posterior in (7.4). The "dist" function in (7.6) is often chosen to be the KL divergence. In this case, the optimization problem (7.6) can equivalently be written as arg minT as in (4.10) with |T ( y))) dρ(x).
(7.8) To minimize this term, the normalizing constant Z in (7.4), which is in general unavailable, need not be known; see, e.g., [45,Sec. 3.2] for more details. Moreover, since the reference ρ is a tractable measure, the integral in (7.8) can be approximated. The simplest way is by Monte Carlo sampling (from ρ), but also higher-order methods like quasi-Monte Carlo [21] or sparse-grid quadrature, e.g., [27,77], could be used, though they have not yet been rigorously investigated in this context. Solving the optimization problem (7.8) is in general hard, as the objective is non-convex and defined on a high-dimensional space. Practical implementations have employed quasi-Newton or Newton-CG methods [4,8,25], with continuation heuristics to address non-convexity. Coming up with good optimization algorithms for this objective is out of the scope of the present paper but would be an interesting topic for future research.
Instead of minimizing over rational transports from (4.10) as in (7.6) and (7.8), alternatively we could minimize over the NN transports from Theorem 5.1. Since the proof of Theorem 5.1 is constructive, we could in principle give an explicit architecture achieving the rate in (5.2). We refrain from doing so, since an implementation of this (sparse) architecture may be cumbersome in practice, and not necessarily yield significantly better global minima than a fully connected network of similar size (which also allows for exponential convergence; see the discussion after Theorem 5.1).
Other objectives and approaches are possible as well. For instance, if the reference ρ is chosen as the uniform measure, thenT ρ has density det dS withS =T −1 . In this case, we may minimize the Hellinger distance To make this optimization problem independent of the normalizing constant-and, moreover, convex-one can proceed as follows: Substituteg:= det dS, and find arg ming Observe that by using, for example, a polynomial ansatzg( y) = ν∈ c ν y ν , this yields a (convex) linear least-squares problem for the expansion coefficients (c ν ) ν∈ .
Moreover, after normalizing which guarantees g 2 to be a probability density, this method does not depend on the unknown normalizing constant of the target f π . Using the explicit formulas for the (inverse) KR transport in Sect. 2 we get .
Ifg is a multivariate polynomial, these integrals can be evaluated analytically (without the use of numerical quadrature), which makes the method computationally feasible. A similar algorithm based on tensor-train approximations of the density has recently been proposed in [22].

Estimating the Parameter
In the Bayesian setting, a natural point estimate of the parameter y is the CM estimator in (7.7). An alternative is the MAP estimator (a point maximizing the posterior density), but the CM has the advantage of satisfying several useful stability properties. For example, in contrast to the MAP, under suitable assumptions the CM depends continuously on the data ς , e.g., [32,38,67]. Moreover, minimizing objectives like the KL divergence or the Hellinger distance will guarantee convergence of the CM as the number of degrees of freedom in the approximation ofT tends to infinity. It will not guarantee convergence of a MAP point, however, sinceT ρ need not converge pointwise to π ifT minimizes the KL divergence as in (7.8). To see convergence of the CM, recall that the Hellinger distance is bounded according to H(T ρ, π) 2 ≤ 1 2 KL(T ρ, π); see [28]. Moreover, by Remark 6.2 By Proposition 6.4, forT solving the optimization problem (7.8) (i.e., minimizing KL(T ρ, π)), we thus have withβ < β as in (4.7), i.e., our approximation to the actual CM [−1,1] d y dπ( y) ∈ R d converges exponentially in terms of the number of degrees of freedom N used in (4.10) to approximate T . If instead of (4.10) we use a NN, by Proposition 6.5 we obtain a similar bound in terms of the trainable parameters N of the network, but for a different constantβ, and with N 1/d on the right-hand side of (7.9) replaced by N 1/(d+1) .
Given an approximationT to the transport T , any other posterior expectation can be also approximated by substitutingT for T above; the CM simply corresponds to setting g to be the identity function. Choosing a polynomial g, for example, enables computation of the covariance or higher moments of the posterior, as a way of characterizing uncertainty in the parameter.

Remark 7.2
The forward operator being defined by the diffusion equation (7.1) is not essential to the discussion in Sect. 7. Other models such as the Navier-Stokes equations allow similar arguments [14]. More generally, the proof of Lemma 7.1 merely requires the existence of a complex differentiable function u (between two complex Banach spaces) such that u( y) = u(1 + d j=1 y j ψ j ), and u does not even need to stem from a PDE.

Conclusions
In this paper, we proved several results for the approximation of the KR transport T : The central requirement was that the reference and target densities are both analytic. Based on this, we first conducted a careful analysis of the regularity of T by investigating its domain of analytic extension. This implied exponential convergence for sparse polynomial and ReLU neural network approximations. We gave an ansatz for the computation of the approximate transport, which necessitates that it is a bijective map on the hypercube [−1, 1] d . This led to a statement for rational approximations of T . Moreover, we discussed how our results can be used in the development of inference algorithms.
Most of these results are generalized and extended in [76], where we establish dimension-robust convergence in the high-or infinite-dimensional case d 1 or d = ∞. For future research, we intend to use our proposed ansatz, including the a prioridetermined sparse polynomial spaces, to construct and analyze in more detail concrete inference algorithms as outlined in Sect. 7. The present regularity and approximation results for T provide crucial tools to rigorously prove convergence and convergence rates for such methods. Additionally, we will investigate similar results on unbounded domains, e.g., R d .

Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Inverse Function Theorem
In the following, if O ⊆ C n is a set, by f ∈ C 1 (O; C) we mean that f ∈ C 0 (O; C) and for every open S ⊆ O it holds f ∈ C 1 (S; C).

Moreover, (c) and (A.1) imply for all
(A.2) Now define the Banach space X :=C 0 (O; C) with norm f X := sup t∈O | f (t)|, and consider the closed subset A:={ f ∈ X : f − t 0 X ≤ δ} ⊂ X (here, by abuse of notation, t 0 : O → C is interpreted as the constant function in X ). By (A.2), t(·) → S(·, t(·)) maps A to itself, and by (A.1) the map is a contraction there, so that it has a unique fixed point by the Banach fixed point theorem. We have shown the existence of t ∈ C 0 (O; B δ (t 0 )) satisfying S(x, t(x)) ≡ t(x), which is equivalent to f (x, t(x)) ≡ 0. It remains to show that t ∈ C 1 (O; B δ (t 0 )). Letting t 0 : O → C again be the constant function and t k (x):=S(x, t k−1 (x)) for every k ≥ 2, it holds t k → t in X , i.e., (t k ) k∈N converges uniformly. Since t 0 ∈ C 1 (O; C), we inductively obtain t k ∈ C 1 (O; C) for all k ∈ N. Since X is a complex Banach space, as a uniform limit of differentiable (analytic) functions it holds lim k→∞ t k = t ∈ C 1 (O; C), see, for instance, [31,Sec. 3.1].
Finally, to see that for each x ∈ O there exists only one s ∈ B δ (t 0 ) such that f (x, s) = 0 (namely s = t(x)), one can argue similar as above and apply the Banach fixed point theorem to the map s → S(x, s) for x ∈ O fixed and s ∈ B δ (t 0 ).
Hence, the statement follows by Lemma A.1.
The next lemma shows that G in Proposition A.2 depends continuously on F.
We will also use the following consequence of Lemma 3.3.

Lemma B.1 Let f , F, M, L and α be as in Lemma
Therefore, 1] is bijective by Lemma 3.2. We next show well-definedness ofG, i.e., these local functions coincide whenever their domain of definition overlaps.

B.3 Theorem 3.6
To prove Theorem 3.6, we start with some preliminary results investigating the functions f k in (2.2). First, we will analyze the domain of analytic extension of T : The following variation of Assumption 3.5 will be our working assumption on the densities. In particular, item (c) stipulates that the analytic extensions of the densities do not deviate too much. This will guarantee that the inverse CDFs can be suitably analytically extended.
Remark B. 3 We could have equivalently written min l∈{k,...,d} ε l on the right-hand side of the inequality in (c). In particular, Item (i) of the following lemma states that x k → f k (x [k] ) is a probability density on [−1, 1], and x [k] → f k (x [k] ) has the same domain of analyticity as x [k] → f (x). Items (iii) and (iv) are statements about how much f k varies in its variables: (iii) is mainly a technical requirement used in later proofs and (iv) will be relevant for large values of δ k > 0. It states that the maximum deviation of the probability density x k → f k (x [k] ) from the constant 1 function is indirect proportional to δ k , i.e., is small for large δ k . and

C.3 Theorem 4.5
In the proof, we will need the following elementary lemma.

Lemma C.4
Let m ∈ N 0 , k ∈ N. There exists C = C(k, m) such that In case m = 0, the constant C = C(k, 0) is independent of k.
Proof Item (i) is a consequence of the (multivariate) Leibniz rule for weak derivatives, i.e., ∂ ν g for all multi-indices ν ∈ N k 0 . Item (ii) follows by f 2 − g 2 = ( f − g)( f + g) and (i).

Next let us show (iii). Due to 1− f L
, which proves the statement in case m = 0.
For general m ∈ N, we claim that for any ν ∈ N k 0 such that 1 ≤ |ν| ≤ m it holds for a constant C depending on ν but independent of f . For |ν| = 1, i.e., ν = e j = (δ i j ) k i=1 for some j, this holds by ∂ . For the induction step, Then by the induction hypothesis (C.12). Hence, for 1 ≤ |ν| ≤ m, which shows (iii).
There exist constants K ∈ (0, 1] (independent of m and k) and C = C(k, m) > 0, both independent of Q, with the following property: 15) and Proof Throughout this proof, the constant C > 0 (which may change its value even within the same equation) will depend on m, k but be independent of Q. Moreover, it will only depend on k through the constants from Lemma C.4, and thus be independent of k in case m = 0. In the following, we use several times the fact that (C.17) Set ε:= Q − p W m,∞ ([−1,1] k ) ≤ 1. Using Lemma C.4 (ii), we get where we used the triangle inequality and ε ≤ 1, which holds by assumption. In addition toT in (C.14), let , t) dt, we get with (C.17) and (C.18) Then for someC =C(k, m) > 0, which is independent of k in case m = 0. Additionally by (C.20)  for some constant C depending on m and k. Due to T (x) = −1 + by Lemma C.4 (i). In all, this shows (C.15).
To show (C.16), we proceed similarly and obtain via (C.18) and thus, using (C.22) Proof of Theorem 4.5 Wlog in the following, we assume m ≥ 1, since the statement for m ≥ 1 trivially implies the statement for m = 0. Throughout we omit writing the ε > 0 index and writeT k :=T k,ε etc.
In the second case whereCε τ > max{1, }, we simply redefine p k :=0, so thatT k (x) = x k (cp. (4.10)). Then Step 3. We estimate the error in terms of N ε . By Lemma C.2 Together with (C.27) this shows We use Lemma 5.3 to correct˜ N ,k and guarantee that it is a bijection of The last term is bounded by C 0 exp(−β N 1/(k+1) ) by definition of˜ N ,k , and where C 0 and β are independent of N ∈ N (but depend on d). Choosing N large enough, it holds inf  For the second bound, let first A, B ∈ R d×d . Then Since Together with the first statement this concludes the proof.

E.2 Theorem 6.3
We'll start by bounding | det d S − det dS|. To this end we need the following lemma. (E.1) For every x ∈ R, exp : (−∞, x] → R has Lipschitz constant exp(x). Thus, since Set a min = min j≤d a j > 0. Using again log(1 + x) ≤ x so that log(a j + |a j − b j |) = log a j 1 + Then there exists C > 0 solely depending on d, and the positive constantsS min , S min , S max in (E. 4), such that (E.6) and .
Proof For any a > 0 the map log : [a, ∞) → R has Lipschitz constant 1 a so that x| log(x) − log(y)| ≤ x min{x, y} |x − y|.
(E.10) If x ≤ y then x min{x,y} = 1 ≤ 1 + |x−y| y so that the claimed bound also holds.

Proof of Theorem 6.3
The assumptions on S and T imply det dT (x) = Next we show (6.5). It holds inf x∈[−1,1] d f π (x) ∈ (0, 1] since f π is a positive probability density. We obtain using Lemma E.3 Finally (E.12) with α = 1 implies (6.4) for the KL divergence, which concludes the proof.