We now discuss the accuracy of the actual approximation of the target density f given by
$$\begin{aligned} \tilde{f}^{TT}:= \tilde{f}_0^{\mathrm {Trun},\mathrm {TT}} \circ \tilde{T}^{-1} \otimes |\mathcal {J}_{\tilde{T}^{-1}}|. \end{aligned}$$
(40)
Since our approach is based on several components like transport, truncation, low-rank compression and the VMC method, these components are examined separately in the upcoming sections. Our main result is stated in Sect. 4.5.
Transport invariant measures of discrepancy
In this section we derive a relation property such that the error of the approximation \(\tilde{f}_0^{\mathrm {Trun},\mathrm {TT}}\) to the the perturbed prior transfers directly to the discrepancy between \(\tilde{f}^{\mathrm {TT}}\) and f. Note that this property is canonical since passing to the image space of some measurable function is fundamental in probability theory. Ideally such a relation is an equivalence of the form
$$\begin{aligned} {{\text {d}}}\left( Y; f, \tilde{f}^{TT}\right) = {{\text {d}}}\left( X; \tilde{f}_0, \tilde{f}_0^{\mathrm {Trun},\mathrm {TT}}\right) . \end{aligned}$$
(41)
Prominent measures of discrepancy for two absolutely continuous Lebesgue probability density functions \(h_1\) and \(h_2\) on some measurable space Z are the squared Hellinger distance
$$\begin{aligned} {{{\text {d}}}_{\mathrm {Hell}}^2(Z, h_1, h_2) := \frac{1}{2} \int \limits _{Z} \left( \sqrt{h_1}(z)-\sqrt{h_2}(z)\right) ^2\mathrm {d}\lambda (z)}, \end{aligned}$$
(42)
and the Kullback–Leibler divergence
$$\begin{aligned} {{\text {d}}}_\mathrm {KL}(Z,h_1,h_2) := \int \limits _{Z} \log \left( \frac{h_1(z)}{h_2(z)}\right) h_1(z)\,\mathrm {d}\lambda (z). \end{aligned}$$
(43)
For the Hellinger distance, the absolute continuity assumption can be dropped from an analytical point of view. We observe that both \({{\text {d}}}_{\mathrm {Hell}}\) and \({{\text {d}}}_\mathrm {KL}\) satisfy (41).
Lemma 1
Let \(\sharp \in \{\mathrm {Hell},\mathrm {KL}\}\). It then holds
$$\begin{aligned} {{\text {d}}}_{\sharp }(Y; f, \tilde{f}^{TT} ) = {{\text {d}}}_{\sharp }(X; \tilde{f}_0, \tilde{f}_0^{\mathrm {Trun},\mathrm {TT}} ). \end{aligned}$$
(44)
Proof
We only show (44) for \(\sharp =\mathrm {KL}\) since \(\sharp =\mathrm {Hell}\) follows by similar arguments. By definition
$$\begin{aligned} {{\text {d}}}_{\mathrm {KL}}(Y; f, \tilde{f}^{\mathrm {TT}} ) = \int \limits _{Y} \log \left( \frac{f(y)}{\tilde{f}^{\mathrm {TT}}(y)}\right) f(y)\,\mathrm {d}\lambda (y), \end{aligned}$$
(45)
and the introduction of the transport map \(\tilde{T}\) yields the claim
$$\begin{aligned}&\int \limits _{X} \log \left( \frac{f\circ \tilde{T}(x)}{\tilde{f}^{\mathrm {TT}}\circ \tilde{T}(x)} \cdot \frac{|\mathcal {J}_{\tilde{T}}(x)|}{|\mathcal {J}_{\tilde{T}}(x)|} \right) \tilde{f}_0(x)\,\mathrm {d}\lambda (x) \nonumber \\&\quad = {{\text {d}}}_{\mathrm {KL}}(X; \tilde{f}_0, \tilde{f}_0^{\mathrm {Trun},\mathrm {TT}} ). \end{aligned}$$
(46)
\(\square \)
Truncation error
Since our approximation scheme relies on the truncation of the density, we introduce a convenient type of decay on the outer layer of the perturbed prior.
Definition 2
(outer polynomial exponential decay) A function \(\tilde{f}_0:X\rightarrow \mathbb {R}^+\) has outer polynomial exponential decay if there exists a simply connected compact set \(K\subset X\) with a polynomial \(\pi ^+\) which is positive on \(X\setminus K\) and some \(C>0\) such that
$$\begin{aligned} \tilde{f}_0(x) \le C \exp {(-\pi ^+(x))},\quad x\in X\setminus {K}. \end{aligned}$$
(47)
The error introduced by the Gaussian extension is estimated in the next lemma.
Lemma 2
(truncation error) For \(\mu \in \mathbb {R}^d\) and \(\varSigma \in \mathbb {R}^{d, d}\), let \(\tilde{f}_0\) have outer polynomial exponential decay with positive polynomial \(\tilde{\pi }^+\) and \(\tilde{C}>0\) with \(K=\overline{B_R(\mu )}\) for some \(R>0\). Then, for \(C_\varSigma = 1/2\lambda _{\mathrm {min}}(\varSigma ^{-1})\) there exists a \(C=C(\tilde{C},\varSigma ,d, C_\varSigma )>0\) such that
$$\begin{aligned} \Vert \tilde{f}_0 - \tilde{f}_0^{{{\text {Trun}}}}\Vert _{L^1(X\setminus K)} \lesssim&\Vert \exp {(-\tilde{\pi }^+)}\Vert _{L^1(X\setminus K)} \\&+\varGamma \left( d/2, C_\varSigma R^2\right) \end{aligned}$$
and
$$\begin{aligned}&\left| \, \int \limits _{X\setminus {K}} \log \left( \frac{\tilde{f}_0}{f_{\varSigma ,\mu }} \right) \tilde{f}_0\,\mathrm {d}x\right| \\&\qquad \le \int \limits _{X\setminus K} \left( \frac{1}{2}\Vert x\Vert _{\varSigma ^{-1}}^2 + \tilde{\pi }^+(x)\right) e^{-\tilde{\pi }^+(x)}\,\mathrm {d}\lambda (x) \end{aligned}$$
with the incomplete Gamma function \(\varGamma \).
Proof
The proof follows immediately from the definition of \(\tilde{f}_0^{{{\text {Trun}}}}\). \(\square \)
In the case that the perturbed prior is close to a Gaussian standard normal distribution, it holds \(C\approx 1\).
Low-rank compression error
In this section we discuss the error introduced by compressing a full algebraic tensor into a tensor in low-rank tensor train format. The higher order singular value decomposition (HOSVD) (Oseledets and Tyrtyshnikov 2010) is based on successive unfoldings of the full tensor into matrices, which are orthogonalized and possibly truncated by a singular value decomposition. This algorithm leads to the following result.
Lemma 3
(Theorem 2.2 Oseledets and Tyrtyshnikov 2010) For any \(g\in {\mathcal {V}}_{\varLambda }\) and \(\varvec{r}\in \mathbb {R}^{d-1}\) there exists an extended low-rank tensor train \(g_{\varvec{r}}\in \mathcal {M}_{\varvec{r}}\) such that
$$\begin{aligned} \Vert g - g_{\varvec{r}}\Vert _{{\mathcal {V}}(\hat{X})}^2 \le \sum _{i=1}^{d-1}\sigma _i^2, \end{aligned}$$
(48)
where \(\sigma _i\) is the distance of the i-th unfolding matrix of the coefficient tensor of g in the HOSVD to its best rank \(r_i\) approximation in the Frobenius norm.
Remark 3
Estimate (48) is rather unspecific as the \(\sigma _i\) cannot be quantified a priori. In the special case of Gaussian densities we refer to Rohrbach et al. (2020) for an examination of the low-rank representation depending on the covariance structure. When the transport \(\tilde{T}\) maps the considered density only “closely” to a standard Gaussian, the results can be applied immediately to our setting and more precise estimates are possible.
VMC error analyis
To examine the VMC convergence in our setting, we recall the analysis of Eigel et al. (2019b) in a slightly more general manner. Analogously to Sect. 3.1, we use the notation \(\hat{\mathcal {X}}\) and w as a placeholder for any layer \(\hat{X}^\ell \) and weight \(w_\ell \) for \(\ell =1,\ldots ,L\). Here we assume that \(L^2(\hat{\mathcal {X}},w)\) is continuously embedded in \(L^1(\hat{\mathcal {X}},w)\).
Recall the cost functional \(\mathscr {J}\) from (32) defined by a loss function \(\iota \) depending on the transformed perturbed prior \(\hat{f}_0\) as in Sect. 3.3. As a first step we show compatibility conditions of two specific types of loss functions corresponding to the Kullback–Leibler divergence and the \(L^2\)-norm.
Lemma 4
(KL loss compatibility) Let \({\hat{f}_0}\in {\mathcal {V}}({\hat{\mathcal {X}}}, 0, c^*)\) for \(c^* <\infty \) and \(0<\underline{c}< \overline{c}<\infty \). Then
$$\begin{aligned} {\mathcal {V}}(\hat{X}, \underline{c}, \overline{c})\ni {\hat{g}} \mapsto \iota ({\hat{g}}, \hat{x})&= \iota (\hat{g}, \hat{x}, \hat{f}_0) \nonumber \\&:= -\log (\hat{g}(\hat{x}))\hat{f}_0(\hat{x}) \end{aligned}$$
(49)
is uniformly bounded and Lipschitz continuous on the model class \({\mathcal {M}}=\mathcal {M}_{\varvec{r}}({\hat{\mathcal {X}}}, \underline{c},\overline{c})\) if \(P_{\alpha } \in L^\infty ({\hat{\mathcal {X}}})\) for every \(\alpha \in \varLambda \). Furthermore, \(\mathscr {J}\) is globally Lipschitz continuous on the metric space \(({\mathcal {V}}({\hat{\mathcal {X}}}, \underline{c}, \overline{c}), d_{\mathcal {V}({\hat{\mathcal {X}}},\underline{c},\overline{c})})\).
Proof
The loss \(\iota \) is bounded on \(\mathcal {M}_{\varvec{r}}({\hat{\mathcal {X}}}, \underline{c},\overline{c})\) since \(0< \underline{c}< \overline{c}<\infty \). Let \({\hat{g}_1, \hat{g}_2\in \mathcal {V}(\hat{\mathcal {X}},\underline{c},\overline{c})}\). Then
$$\begin{aligned} |\iota ({\hat{g}}_1, \hat{x}) - \iota ({\hat{g}}_2, \hat{x})| \le \underbrace{\frac{1}{\underline{c}}\sup \limits _{\hat{x} \in {\hat{\mathcal {X}}}} \{{\hat{f}_0}(\hat{x})\}}_{:=C^*<\infty } |{\hat{g}}_1(\hat{x})- {\hat{g}}_2(\hat{x})|. \end{aligned}$$
(50)
The global Lipschitz continuity of \(\mathscr {J}\) follows by using (50) and
$$\begin{aligned} |\mathscr {J}({\hat{g}}_1)-\mathscr {J}({\hat{g}}_2)|&\le C^*\Vert {\hat{g}}_1-{\hat{g}}_2\Vert _{L^1({\hat{\mathcal {X}}},w)}\nonumber \\&\le C C^*d_{\mathcal {V}({\hat{\mathcal {X}}},\underline{c},\overline{c})}({\hat{g}}_1,{\hat{g}}_2), \end{aligned}$$
(51)
with a constant C related to the embedding of \(L^2({\hat{\mathcal {X}}},w)\) into \(L^1({\hat{\mathcal {X}}},w)\). If \({\hat{g}}_1,{\hat{g}}_2\) are in \(\mathcal {M}_{\varvec{r}}({\hat{\mathcal {X}}}, \underline{c},\overline{c})\) with coefficient tensors \(G_1\) and \(G_2\in \mathbb {TT}_{\varvec{r}}\) then by Parseval’s identity and the finite dimensionality of \(\mathcal {M}_{\varvec{r}}({\hat{\mathcal {X}}},\underline{c}, \overline{c})\) there exists \(c=c\left( \sup _{\alpha \in \varLambda } \Vert P_\alpha \Vert _{L^\infty (\hat{X})}\right) >0 \) such that
$$\begin{aligned} |{\hat{g}}_1(x)-{\hat{g}}_2(x)| \le c\Vert G_1-G_2\Vert _{\ell ^2(\mathbb {T})}&= c\Vert {\hat{g}}_1-{\hat{g}}_2\Vert _{\mathcal {V}} \nonumber \\&= c\, d_{\mathcal {V}({\hat{\mathcal {X}}},\underline{c},\overline{c})}({\hat{g}}_1,{\hat{g}}_2), \end{aligned}$$
(52)
which yields the Lipschitz continuity on \(\mathcal {M}_{\varvec{r}}({\hat{\mathcal {X}}},\underline{c}, \overline{c})\). \(\square \)
Lemma 5
(\(L^2\)-loss compatibility) Let \(\hat{f}_0\in {\mathcal {V}}(\hat{X}, 0, c^*)\) for \(c^*<\infty \) and let \(0\le \underline{c}<\overline{c}<\infty \). Then
$$\begin{aligned} {\mathcal {V}}({\hat{\mathcal {X}}}, \underline{c}, \overline{c})\ni {\hat{g}} \mapsto \iota ({\hat{g}}, \hat{x}) = \iota ({\hat{g}}, \hat{x},\hat{f}_0) := |{\hat{g}}(\hat{x})-\hat{f}_0(\hat{x})|^2 \end{aligned}$$
(53)
is uniformly bounded and Lipschitz continuous on
\(\mathcal {M}_{\varvec{r}}({\hat{\mathcal {X}}},\underline{c},\overline{c})\) provided \(P_{\alpha } \in L^\infty ({\hat{\mathcal {X}}})\) for every \(\alpha \in \varLambda \).
Proof
Let \({\hat{g}}_1, {\hat{g}}_2\in \mathcal {V}({\hat{\mathcal {X}}},\underline{c},\overline{c})\). Then
$$\begin{aligned} |\iota ({\hat{g}}_1, \hat{x}) - \iota ({\hat{g}}_2, \hat{x})|&\le | {\hat{g}}_1(\hat{x}) - {\hat{g}}_2(\hat{x})|\cdot |{\hat{g}}_2(\hat{x})+{\hat{g}}_2(\hat{x})|\nonumber \\&\quad + 2|{\hat{g}}_1(\hat{x})- {\hat{g}}_2(\hat{x})|{\hat{f}_0}(\hat{x}). \end{aligned}$$
(54)
Due to \(\overline{c}<\infty \), the Lipschitz property follows as in the proof of Lemma 4 if \({\hat{g}}_1,{\hat{g}}_2\in \mathcal {M}_{\varvec{r}}({\hat{\mathcal {X}}},\underline{c},\overline{c})\). \(\square \)
Let \(\hat{g}_{\mathcal {M}}\) and \(\hat{g}_{\mathcal {M},N}\) be as in (34) and (36). The analysis examines different error components with respect to \({\hat{f}_0}\in \mathcal {V}({\hat{\mathcal {X}}},0,c^*)\) for some \(0<c^*<\infty \) defined by
$$\begin{aligned}&\mathcal {E} := \left| \mathscr {J}({\hat{f}_0}) - \mathscr {J}\left( {\hat{g}}^*_{\mathcal {M},N}\right) \right| , \end{aligned}$$
(55)
$$\begin{aligned}&\mathcal {E}_{\mathrm {app}} := \left| \mathscr {J}({\hat{f}_0}) - \mathscr {J}\left( {\hat{g}}^*_{\mathcal {M}}\right) \right| , \end{aligned}$$
(56)
$$\begin{aligned}&\mathcal {E}_{\mathrm {gen}} := \left| \mathscr {J}\left( {\hat{g}}^*_{\mathcal {M}}\right) - \mathscr {J}\left( {\hat{g}}^*_{\mathcal {M},N}\right) \right| , \end{aligned}$$
(57)
denoting the VMC, approximation and generalization error, respectively. By a simple splitting, the VMC error can be bounded by the approximation and the generalization error,Footnote 3 namely
$$\begin{aligned} \mathcal {E} \le \mathcal {E}_{\mathrm {app}} + \mathcal {E}_{\mathrm {gen}}. \end{aligned}$$
(58)
Due to the global Lipschitz property on \(\mathcal {V}({\hat{\mathcal {X}}},\underline{c},\overline{c})\) with \(\underline{c} > 0 \) in the setting of (49) or \(\underline{c}\ge 0\) as in (53), the approximation error can be bounded by the best approximation in \(\mathcal {M}\). In particular there exists \(C>0\) such that
$$\begin{aligned} \mathcal {E}_{\mathrm {app}} \le C \inf \limits _{v\in \mathcal {M}}\Vert h^*-v\Vert _{\mathcal {V}({\hat{\mathcal {X}}})}^2. \end{aligned}$$
(59)
We note that such a bound by the best approximation in \(\mathcal {M}\) with respect to the \(\mathcal {V}({\hat{\mathcal {X}}})\)-norm may not be required when using the Kullback–Leibler divergence if one is interested directly in the best approximation in this divergence. Then the assumption \(\underline{c}>0\) can be relaxed in the construction of \(\mathcal {V}({\hat{\mathcal {X}}},\underline{c},\overline{c})\) since no global Lipschitz continuity of \(\mathscr {J}\) in Lemma 4 is necessary. Thus the more natural subspace of \(\mathcal {V}({\hat{\mathcal {X}}},0,\overline{c})\) of absolutely continuous functions with respect to \({\hat{f}_0}\) may be considered instead.
It remains to bound the statistical generalization error \(\mathcal {E}_{\mathrm {gen}}\). For this the notion of covering numbers is required. Let \((\varOmega ,\mathcal {F},\mathbb {P})\) be an abstract probability space.
Definition 3
(covering number) Let \(\epsilon > 0\). The covering number \(\nu (\mathcal {M},\epsilon )\) denotes the minimal number of open balls of radius \(\epsilon \) with respect to the metric \(d_{\mathcal {V}({\hat{\mathcal {X}}},\underline{c},\overline{c})}\) needed to cover \(\mathcal {M}\).
Lemma 6
Let \(\iota \) be defined as in (49) or (53). Then there exist \(C_1,C_2>0\) only depending on the uniform bound and the Lipschitz constant of \(\mathcal {M}\) given in Lemmas 4 and 5, respectively, such that for \(\epsilon >0\) and \(N\in \mathbb {N}\) denoting the number of samples in the empirical cost functional in (35) it holds
$$\begin{aligned} \mathbb {P}[\mathcal {E}_{\mathrm {gen}}>\epsilon ] \le 2\nu (\mathcal {M}, C_2^{-1}\epsilon ) \delta (1/4\epsilon , N), \end{aligned}$$
(60)
with \(\delta (\epsilon ,N)\le 2\exp (-2\epsilon ^2N/C_1^2)\).
Proof
The claim follows immediately from Lemmas 4 and 5, respectively, and (Thm. 4.12, Cor. 4.19 Eigel et al. 2019b). \(\square \)
Remark 4
(choice of \(\underline{c},\overline{c}\) and \({\hat{\mathcal {X}}}\)) Due to the layer based representation in (11) and (15) on each layer \(\hat{X}^\ell = \varPhi ^{-1}(X^\ell )\) we have the freedom to choose \(\underline{c}\) separately. In particular, assuming that the perturbed prior \(\tilde{f}_0\) decays per layer, we can choose \(\underline{c}\) according to the decay and with this control the constant in (50).
A priori estimate
In this section we state our main convergence result.
Assumption 1
For a target density \(f:Y\rightarrow \mathbb {R}_+\) and a transport map \(\tilde{T}:X \rightarrow Y\), there exists a simply connected compact domain K such that \(\tilde{f}_0=(f\circ T)\otimes |\mathcal {J}_T|\in L^2(K)\) has outer polynomial exponential decay with polynomial \(\pi ^+\) on \(X\setminus K\). Consider the symmetric positive definite matrix \(\varSigma \in \mathbb {R}^{d, d}\) and \(\mu \in \mathbb {R}^d\) as the covariance and mean for the outer approximation \(f_{\varSigma , \mu }\). Furthermore, let \(K=\bigcup _{\ell =1}^L \overline{X^\ell }\) where \(X^\ell \) is the image of a rank-1 stable diffeomorphism \(\varPhi ^\ell :\hat{X}^\ell \rightarrow X^\ell \) such that there exists \(0<c_\ell ^*<\infty \) with \(\hat{f}_0^\ell (\hat{x}) \le c_\ell ^*\) for \(\hat{x}\in \hat{X}_\ell \) for \(\ell = 1,\ldots ,L\).
We can now formulate the main theorem of this section regarding the convergence of the developed approximation with respect to the Hellinger distance and the KL divergence.
Theorem 1
(A priori convergence) Let Assumption 1 hold and let a sequence of sample sizes \((N^\ell )_{\ell =1}^L\subset \mathbb {N}\) be given. For every \(\ell =1,\ldots ,L\), consider bounds \(0<\underline{c}^\ell<\overline{c}^\ell <\infty \) and let \(\tilde{f}^{\mathrm {TT}}\) be defined as in (40). Then there exist constants \(C,C_\varSigma ,C^\ell ,C_\iota ^\ell >0\), \(\ell =1,\ldots ,L\), such that for \(\sharp \in \{\mathrm {KL},\mathrm {Hell}\}\) and \(p_\mathrm {Hell}=2\) and \(p_\mathrm {KL}=1\)
$$\begin{aligned} {{\text {d}}}_{\sharp }^{{p_\sharp }}(Y,f,\tilde{f}^{\mathrm {TT}})&\le C\left( \sum \limits _{\ell =1}^L \left( \mathcal {E}_{\mathrm {best}}^\ell + \mathcal {E}_{\mathrm {sing}}^\ell + \mathcal {E}_{\mathrm {gen}}^\ell \right) + \mathcal {E}_{\mathrm {trun}}^\sharp \right) . \end{aligned}$$
(61)
Here, \(\mathcal {E}_{\mathrm {best}}^\ell \) denotes the error of the best approximation \({\hat{g}_\varLambda ^{\ell ,*}}\) to \(\hat{f}_0^\ell \) in the full truncated space \(\mathcal {V}_{\varLambda }^\ell (\underline{c}^\ell ,\overline{c}^\ell ) =\mathcal {V}_{\varLambda }^\ell \cap \mathcal {V}(\hat{X}^\ell ,\underline{c}^\ell ,\overline{c}^\ell )\) given by
$$\begin{aligned} \mathcal {E}_{\mathrm {best}}^\ell := \Vert \hat{f}_0^\ell - {\hat{g}_\varLambda ^{\ell ,*}}\Vert _{\mathcal {V}(\hat{X}^\ell )} =\inf \limits _{{\hat{g}}^\ell \in \mathcal {V}_{\varLambda }^\ell (\underline{c}^\ell ,\overline{c}^\ell )} \Vert \hat{f}_0^\ell - {\hat{g}}^\ell \Vert _{\mathcal {V}(\hat{X}^\ell )}, \end{aligned}$$
\(\mathcal {E}_{\mathrm {sing}}^\ell \) is the low-rank approximation error of the algebraic tensor \(G:\varLambda \rightarrow \mathbb {R}\) associated to \(\hat{g}_\varLambda ^{\ell ,*}\) and the truncation error \(\mathcal {E}_{\mathrm {trun}}^{\#}\) is given by
$$\begin{aligned} \left( \mathcal {E}_{\mathrm {trun}}^\mathrm {Hell}\right) ^2&:= \Vert \exp {(-\pi ^+)}\Vert _{L^1(X\setminus K)} + \varGamma \left( d/2, C_\varSigma R^2\right) ,\\ \mathcal {E}_{\mathrm {trun}}^\mathrm {KL}&:= \int \limits _{X\setminus K} \left( \frac{1}{2}\Vert x\Vert _{\varSigma ^{-1}}^2 + \tilde{\pi }^+(x)\right) e^{-\tilde{\pi }^+(x)}\,\mathrm {d}\lambda (x). \end{aligned}$$
Furthermore, for any \((\epsilon ^\ell )_{\ell =1}^L \subset \mathbb {R}_+\) the generalization errors \(\mathcal {E}_{\mathrm {gen}}^\ell \) can be bounded in probability by
$$\begin{aligned} \mathbb {P}(\mathcal {E}_{\mathrm {gen}}^\ell > \epsilon ^\ell ) \le 2\nu (\mathcal {M}^\ell , C^\ell \epsilon ^\ell )\delta ^\ell (1/4\epsilon ^\ell , N^\ell ), \end{aligned}$$
with \(\nu \) denoting the covering number from Definition 3 and \(\delta ^\ell (\epsilon ,N)\le 2\exp (-2\epsilon ^2N/{C_\iota ^\ell })\).
Proof
We first prove (61) for \(\sharp =\mathrm {Hell}\). Note that \(|\sqrt{a}-\sqrt{b}| \le \sqrt{|a-b|}\) for \(a,b\ge 0\) and with Lemma 1 it holds
$$\begin{aligned} {{\text {d}}}_{\mathrm {Hell}}^{{2}}(Y; f,\tilde{f}^{\mathrm {TT}})&= {{\text {d}}}_{\mathrm {Hell}}^{{2}}(X; \tilde{f}_0, \tilde{f}_0^{\mathrm {Trun},\mathrm {TT}} ) \\&\le {\nicefrac {1}{2}} \Vert \tilde{f}_0 -\tilde{f}_0^{\mathrm {Trun},\mathrm {TT}}\Vert _{L^1(K)} \\&\quad +{\nicefrac {1}{2}}\Vert \tilde{f}_0 -\tilde{f}_0^{\mathrm {Trun},\mathrm {TT}}\Vert _{L^1(X\setminus K)}. \end{aligned}$$
Since \(K = \cup _{\ell =1}^L X^\ell \) and \(X^\ell \) are bounded, there exist constants \(C(X^\ell )>0\), \(\ell =1,\ldots ,L\), such that
$$\begin{aligned} \Vert \tilde{f}_0 -\tilde{f}_0^{\mathrm {Trun},\mathrm {TT}}\Vert _{L^1(K)}&= \sum \limits _{\ell =1}^L\Vert \tilde{f}_0 -\tilde{f}_0^{\mathrm {Trun},\mathrm {TT}}\Vert _{L^1(X^\ell )}\\&\le \sum \limits _{\ell =1}^L C(X_\ell ) \Vert \tilde{f}_0 -\tilde{f}_0^{\mathrm {Trun},\mathrm {TT}}\Vert _{L^2(X^\ell )}. \end{aligned}$$
Moreover, by construction
$$\begin{aligned} \Vert \tilde{f}_0 -\tilde{f}_0^{\mathrm {Trun},\mathrm {TT}}\Vert _{L^2(X^\ell )} = \Vert \hat{f}_0^\ell -\hat{f}_0^{\ell ,\mathrm {TT},N_\ell } \Vert _{\mathcal {V}(\hat{X}^\ell )}. \end{aligned}$$
(62)
The claim follows by application of Lemmas 2, 3 and 6 together with (58).
To show (61) for \(\sharp =\mathrm {KL}\), note that by Lemma 1 and the construction (38) it holds
$$\begin{aligned} {{\text {d}}}_{\mathrm {KL}}(Y; f,\tilde{f}^{\mathrm {TT}})&= \sum _{\ell =1}^L \int _{X^\ell }\log \frac{\tilde{f}_0}{\tilde{f}_0^{\ell , \mathrm {TT}}} \tilde{f}_0\mathrm {d}\lambda (x) \nonumber \\&\quad + \int _{X\setminus K} \log \frac{\tilde{f}_0}{f_{\varSigma , \mu }} \tilde{f}_0\mathrm {d}\lambda (x). \end{aligned}$$
(63)
Using Lemma 2 we can bound the integral over \(X\setminus K\) by the truncation error \(\mathcal {E}_{\mathrm {trun}}\). Employing the loss function and cost functional of Lemma 4 yields
$$\begin{aligned} \int _{X^\ell }\log \frac{\tilde{f}_0}{\tilde{f}_0^{\ell , \mathrm {TT}}} \tilde{f}_0\mathrm {d}\lambda (x) \le \mathcal {E}_{\mathrm {app}}^\ell + \mathcal {E}_{\mathrm {gen}}^\ell . \end{aligned}$$
(64)
The claim follows by application of Lemmas 3 and 6 together with (58). \(\square \)
Polynomial approximation in weighted \(L^2\) spaces
In order to make the error bound (61) in Theorem 1 more explicit with respect to \(\mathcal E_\text {best}\), we consider the case of a smooth density function with analytic extension. The analysis follows the presentation in Babuška et al. (2010) and leads to exponential convergence rates by an iterative interpolation argument based on univariate best approximation bounds by interpolation. An analogous analysis for more general regularity classes is possible but not in the scope of this article.
Let \({\hat{\mathcal {X}}} = \bigotimes _{i=1}^d \hat{\mathcal {X}}_i\subset \mathbb {R}^d\) be bounded and \(w=\otimes _{i=1}^d w_i\in L^\infty (\hat{\mathcal {X}})\) a non-negative weight such that \(\mathcal {C}(\hat{\mathcal {X}})\subset \mathcal {V}:=L^2(\hat{\mathcal {X}},w) = \bigotimes _{i=1}^d L^2(\hat{\mathcal {X}}_i,w_i)\).
For a Hilbert space H, a bounded set \(I\subset \mathbb {R}\) and a function \(f\in \mathcal {C}(I;H)\subset L^2(I,w;H)\) with weight \(w:I\rightarrow \mathbb {R}\), let \(\mathcal {I}_n :\mathcal {C}(I;H) \rightarrow L^2(I,w;H)\) denote the continuous Lagrange interpolation operator.
Assume that \(f\in \mathcal {C}(I;H)\) admits an analytic extension in the region of the complex plane \(\varSigma (I;\tau ) := \{z\in \mathbb {C} | {\text {dist}}(z,I)\le \tau \}\) for some \(\tau > 0\). Then, referring to Babuška et al. (2010),
$$\begin{aligned} \Vert f - \mathcal {I}_nf\Vert _{L^2(I,w;H)} \lesssim \sigma (n,\tau ) \max \limits _{z\in \varSigma (I;\tau )}\Vert f(z)\Vert _H, \end{aligned}$$
(65)
with \( \sigma (n,\tau ):=2(\rho -1)^{-1} \exp {(-n\log (\rho ))}\) and \(\rho := 2\tau /|I| + \sqrt{1 + 4\tau ^2/|I|^2}>1\). Using an iterative argument over d dimensions, a convergence rate for the interpolation of \(f\in \mathcal {C}(\hat{\mathcal {X}};\mathbb {R})\subset L^2(\hat{\mathcal {X}},w;\mathbb {R})\) can be derived from the 1-dimensional convergence. More specifically, let \(\mathcal {I}_\varLambda :\mathcal {C}(\hat{\mathcal {X}})\mapsto L^2(\hat{\mathcal {X}},w)\) denote the continuous interpolation operator \(\mathcal {I}_\varLambda := \mathcal {I}_{n_1}^1\circ \mathcal {I}_{n_2:n_d}^{2:d}\) written as composition of a 1-dimensional and a \(d-1\)-dimensional interpolation with continuous
and
with
. Then, for \(f\in \mathcal {C}(\hat{\mathcal {X}})\) and some \(C>0\) it follows
$$\begin{aligned} \Vert f - \mathcal {I}_\varLambda f\Vert&\le \Vert f - \mathcal {I}_{n_1}^1 f\Vert + \Vert \mathcal {I}_{n_1}^1(f-\mathcal {I}_{n_2,\ldots ,n_d}^{2,\ldots ,d}f)\Vert \\&\lesssim \Vert f - \mathcal {I}_{n_1}^1 f\Vert \\&\quad + \sup \limits _{\hat{x_1}\in \hat{\mathcal {X}}_1}\Vert f(x_1)-\mathcal {I}_{n_2,\ldots ,n_d}^{2,\ldots ,d}f(x_1)\Vert _{ H}. \end{aligned}$$
The second term of the last bound is a \(d-1\)-dimensional interpolation and can hence be bounded uniformly over \(\hat{x}_1\) by a similar iterative argument. We summarize the convergence result for \(\mathcal {E}_{\mathrm {best}}^\ell \) in the spirit of (Theorem 4.1 Babuška et al. 2010).
Lemma 7
Let \({\hat{f}_0}\in \mathcal {C}(\hat{X}^\ell )\subset L^2(\hat{X}^\ell ,w)\) admit an analytic extension in the region
for some \(\tau _i^\ell >0\), \(\ell =1,\ldots ,L\), \(i=1,\ldots ,d\). Then, with \(\sigma \) from (65),
$$\begin{aligned} \inf \limits _{v\in \mathcal {V}_{\varLambda }} \Vert {\hat{f}_0} - v\Vert _{L^2(\hat{X}^\ell ,w)} \lesssim \sum \limits _{i=1}^d \sigma (n_i,\tau _i). \end{aligned}$$
In case that \(\underline{c}\le {\hat{f}_0}(\hat{x}), {\hat{g}}^*(\hat{x}) \le \overline{c}\) is satisfied for \({\hat{g}}^*:= {{\,\mathrm{argmin}\,}}_{{\hat{g}}\in \mathcal {V}_\varLambda }\Vert f - {\hat{g}}\Vert _{L^2(\hat{X}^\ell ,w)}\), the decay rate carries over onto the space \(\mathcal {V}_\varLambda ^\ell (\underline{c}^\ell ,\overline{c}^\ell )\). If only \(\underline{c}\le {\hat{f}_0}(\hat{x})\le \overline{c}\) holds, the image of \({\hat{g}}^*\) can be restricted to \([\underline{c},\overline{c}]\), see e.g. Cohen and Migliorati (2017). This approximation in fact admits a smaller error than \({\hat{g}}^*\).
Remark 5
The interpolation argument on polynomial discrete spaces could be expanded to other orthonormal systems such as trigonometric polynomial, admitting well-known Lebesgue constants as in Da Fies and Vianello (2013).
Remark 6
Explicit best approximation bounds for appropriate smooth weights w as in the case of spherical coordinates can be obtained using partial integration techniques as in Mead and Delves (1973). There, the regularity class of \(\hat{f}_0\) uses high-order weighted Sobolev spaces based on derivatives of w as in the case of classical polynomials.