Variable Transformations in combination with Wavelets and ANOVA for high-dimensional approximation

We use hyperbolic wavelet regression for the fast reconstruction of high-dimensional functions having only low dimensional variable interactions. Compactly supported periodic Chui-Wang wavelets are used for the tensorized hyperbolic wavelet basis on the torus. With a variable transformation we are able to transform the approximation rates and fast algorithms from the torus to other domains. We perform and analyze scattered-data approximation for smooth but arbitrary density functions by using a least squares method. The corresponding system matrix is sparse due to the compact support of the wavelets, which leads to a significant acceleration of the matrix vector multiplication. For non-periodic functions we propose a new extension method. A proper choice of the extension parameter together with the piece-wise polynomial Chui-Wang wavelets extends the functions appropriately. In every case we are able to bound the approximation error with high probability. Additionally, if the function has low effective dimension (i.e. only interactions of few variables), we qualitatively determine the variable interactions and omit ANOVA terms with low variance in a second step in order to decrease the approximation error. This allows us to suggest an adapted model for the approximation. Numerical results show the efficiency of the proposed method.


Introduction
The distribution of data points is a key component in machine learning.In many applications a target variable has to be predicted from given high-dimensional samples.We want to reconstruct an underlying function f to give an interpretable approximation algorithm, which allows a prediction of the target variable for new samples.We consider the domains Ω P tT d , R d , r0, 1s d u and also tensor products of these cases.We consider the setting of reconstructing a d-dimensional function f : Ω Ñ C from discrete samples on the set of nodes ty 1 , . . ., y M u Ă Ω, which are distributed to the continuous density ̺ : Ω Ñ R `.One main aim is to also deal with an unknown density.Besides the natural question of finding a good approximation for f , we want to consider the question of interpretability, i.e. analyzing the importance of the input variables and variable interactions of the function.

Motivation
The starting point of our considerations is the question whether it is possible to transform the good approximation results and the related fast algorithms for periodic functions on the torus T d to the domain Ω.To investigate the scattered data problem on the torus we are engaged with the sample set X , the corresponding function values f " pf pxqq xPX and we constructed a recovery operator S X I in [26].This operator computes a best least squares fit in the finite dimensional subspace spanned by semi-orthogonal wavelets ψ k : T d Ñ R with indices in the hyperbolic cross type set I. Assuming i.i.d.uniformly samples X Ă T d we showed in [26,Corollary 3.22.]:Let m be the order of vanishing moments of the wavelets and the sample size have logarithmic oversampling, i.e. |X | Á r|I| log |I|.For 1{2 ă s ă m there is a constant Cpr, d, sq ą 0 such that for fixed Besov norm }f } B s 2,8 pT d q ď 1 see Appendix A for the definition of the Besov space.Also in [31] uniformly i.i.d.samples on the torus in combination with different basis functions perform well and there are possibilities for dealing with the curse of dimensionality.The results in [26] for the periodic case serve as basis for this paper.In many practical applications, we have to take the data set as it is and have no uniform samples available.For that reason, we study here the case where the given sample points Y Ă Ω are sampled from an arbitrary (but possibly unknown) density ̺pyq.In Figure 1.1 we illustrate some random two-dimensional samples with respect to ̺.We can not guarantee good approximation rates and stability if we would use these samples directly.To this end, we investigate transformations R of the given samples.The main result of this paper is the transformed approximation operator S Y I f .Besides the interesting results for Ω P tT d , R d u we use a new extension technique for the non-periodic case Ω " r0, 1s d .Furthermore we present a detailed error analysis.

The approach
Our main approach is to transform given samples Y Ă Ω to the d-dimensional torus T d by X " RpYq, using the idea of inverse transform sampling: Let F be the cumulative distribution function of a distribution ̺ and X " Upr0, 1sq.Then the random variable F ´1pX q " ̺ is distributed according to the distribution ̺.Based on this, we give possibilities for constructing a transformation R in (3.2), which transforms the samples Y Ă Ω d to X Ă T d on the torus.In Figure 1.1 we show an illustration of what our constructed transformation R does with the samples.In order to investigate the scattered data problem on Ω, we then use the recovery operator S X I from (1.1) on the torus.This operator minimizes the ℓ 2 -loss function ÿ yPY |f ˝R´1 pyq ´SX I `f ˝R´1 pyq ˘|2 by using an iterative LSQR algorithm.To transform the approximation back, we have to apply the transformation R. We give some explicit densities and the corresponding transformations in Example 4. 3 for Ω " R d and in Example 4.11 for Ω " r0, 1s d .Our procedure coincides with transforming the function f to the function f ˝R´1 , which is a function on the torus.In approximation theory it is known that the error decay gains from smoothness of the function.We will introduce weighted function spaces of mixed Sobolev smoothness H m mix pΩ, ̺q in Definition 4.6 and even for non-periodic functions in Definition 4.10.For the more general function class of mixed Besov spaces we also introduce in Definition 4.9 weighted spaces B s 2,8 pΩ, ̺q on Ω.All our function space definitions rely on the definition of the periodic spaces.The relevant facts about Sobolev and Besov spaces of mixed smoothness on T d have been collected in the appendix A.
Since we take the position that we can only learn the function where we have sample points, our aim is to find an approximation to the function f , which minimizes the L 2 -error with respect to the density ̺.Indeed, it shows that for functions in the defined weighted function spaces H m mix pΩ, ̺q or B s 2,8 pΩ, ̺q we receive in Theorem 5.1 the same approximation rates for the L 2 pΩ, ̺q-error as in the periodic setting, i.e. we provide that for 1{2 ă s ă m, r ą 1 and logarithmic oversampling |Y| Á r|I| log |I| there is a constant Cpr, d, sq ą 0 such that for fixed }f } B s 2,8 pΩ,̺q ď 1
Wavelets have many applications in signal processing.Most commonly they are used in compression, edge detection, noise reduction, and other signal enhancements.The broad practicality of wavelets is mainly due to the localization properties of wavelets in time and frequency, so that many signals can be sparsely represented.Hence, the hyperbolic wavelet regression is a reasonable choice for our purposes.
The approximation of non-periodic functions is more challenging than the periodic setting because of the boundary behaviour.For wavelet approximation one possibility is to construct boundary wavelets, see [11,21].We avoid these complicated constructions by extending the function, similar to Fourier extension, [3,5,20].Especially the Chui-Wang wavelets provide an opportunity for letting the approximation extend the function itself, so that we do not have to construct the extension explicitly.We suggest to choose the transformation with a fixed but small parameter η Rpyq " η `p1 ´ηq More details are described in Section 5.1.Our new extension method with a proper chosen extension parameter η relies on the compact support of the Chui-Wang wavelets.With this approach it is possible in Corollary 5.7 to end with nearly the same approximation rate as in the periodic setting.
In some applications we usually do not know the underlying density ̺ and we only get the samples Y. Therefore, we first estimate the underlying density by ̺ and construct the slightly different transformation R in Section 6.Using an approximation operator on T d is also in this case the core idea.Naturally, the approximation error depends on the quality of the density estimation.But in Theorem 6.1 we state that we expect similar approximation results as in the case where we know the density in advance.Numerical experiments confirm this result.
For dealing with the curse of dimensionality we introduce the analysis of variance (ANOVA) decomposition, see [6,27,18], [29,Section 3.1.6],which decomposes the d-variate function into 2 d ANOVA terms f u , i.e.
Each term corresponding to u only depends on variables y i , where i P u.The number of these variables is called order of the ANOVA term.However, in practical applications with high-dimensional functions, often only the ANOVA terms of low order play a role in order to describe the function well, see [6,25,12,43,32].For a rigorous mathematical treatment of this observation we work with functions of low effective dimension, which allow for a truncation of the hyperbolic wavelet regression.The starting point of our work is [26], where the usage of the ANOVA decomposition was also beneficial to approximate periodic high-dimensional functions.
Mathematical modelling of complex systems often requires sensitivity analysis to determine how an output variable of interest is influenced by individual or subsets of input variables.A global sensitivity analysis constitutes the study of how the output uncertainty from a mathematical model is divvied up to distinct sources of input variation in the model.We transform the classical sensitivity analysis from the torus to a weighted function space.The transformation helps to tune the hyperbolic wavelet regression.Our main suggestion is Algorithm 1, which gives is a tool for approximating high-dimensional functions from given arbitrary distributed samples from independent input variables.Furthermore, it is possible to interpret the results, since we get a knowledge about which input variables and variable interactions play a role and which do not.
One main advantage of our transformation approach is that we can deal with different domains in every variable direction.In applications it is often the case that we have a mixture of periodic, non-periodic and real-valued input variables on the larger domain R. Our proposed Algorithm 1 is also applicable in these cases.Furthermore we can use information of the densities, i.e. we handle every input variable separately, which enables a strategy to use density information where it is available.For a typical example see Section 7.2.

Related work and other approaches
We will heavily use the results of [26], which gives approximation bounds and fast algorithms for the periodic setting on T d .In this paper we want to generalize this to a more general (tensor product) domain Ω.Clearly, the main idea is the inverse transform sampling.But beyond that we study function spaces on Ω, which provide enough smoothness of the transformed function to fulfill the assumptions for the periodic approximation.Further new important aspects in this paper are the extension of non-periodic functions, similar to Fourier extension and the idea of combining density estimation and the transformation.
A nice introduction with a detailed description of the challenges in high-dimensional approximation is given in the book [1].The change of variables was successfully used in many applications.
In [22] the authors construct a least squares approximation method for the recovery of functions from a reproducing kernel Hilbert space on Ω Ă R d .The key is to construct the orthonormal basis pη k q N k"1 in L 2 pΩ, ̺q, which is in general not accessible for arbitrary or unknown densities ̺.Also the considerations [8,9] are based on the knowledge of the basis η k .With our approach we construct the concatenated functions pη k pyqq kPI " pψ per k pRpyqqq kPI , which form a semi-orthogonal basis in L 2 pΩ, ̺q.It is also possible to use other basis functions on T d instead of the wavelet functions, but in any case the benefit is that we have the basis in L 2 pΩ, ̺q available, even for a very general class of density functions.Furthermore, we are able to transform the fast algorithms from T d to the domain Ω.A recent improvement was done in [13], where the authors use a weighted leastsquares algorithm with weights related to the Christoffel function and reduced the sampling budget by canceling the logarithmic factor.But they also assume that an orthogonal basis is known.Furthermore, in contrast to this literature, we give in Theorem 5.1 and Corollary 5.7 a concentration inequality for the approximation error based on the probabilistic Bernstein inequality in comparison to estimating the expected value.
For the examples of the Chebyshev density, which is a special case of our examples, [22, Section 10.3], [13] propose the Chebyshev polynomials η k pyq " cospk arccospyqq as basis in L 2 pr´1, 1s d , ̺q where the inner function coincides with our transformation.The case that the samples are normally distributed was considered in [33].This approach coincides with our transformation.In Section 4.2 we give more details about the connection of our weighted function spaces to those in the literature.We study the case of fixed given samples Y.In contrast to that, the task of choosing sampling points was solved successfully in [28] transforming rank-1-lattices from the torus to R d or the cube r0, 1s d .

Outline
This paper is organized as follows.In Section 2 we recall an approximation operator for periodic functions, which is based on the hyperbolic wavelet regression and the well-known ANOVA decomposition of a function on the d-dimensional torus.Section 3 describes the main idea of our approach, namely how we construct a transformation R. Section 4 is dedicated to the introduction of weighted function spaces.We study the spaces of mixed dominating Sobolev regularity in 4.1, mixed dominating Besov regularity in 4.3 and end with defining similar spaces for non-periodic functions in 4.4.We study in this paper two settings: First, in Section 5 we assume that the underlying density is known.There we show in Theorems 5.1 and Corollary 5.7 that we transfer the approximation rates from the torus to our setting on Ω.Second, we investigate in Section 6 the setting where we are given only the samples Y and no density function ̺.Also in this case we are able to transfer the approximation results, see Theorem 6.1.Finally, Section 7 is dedicated to the presentation of Algorithm 1, which gives an interpretable high-dimensional approximation.Our theoretical results are supplemented by some numerical experiments in Section 7.2 that demonstrate the practical efficacy of our algorithm.

Preliminaries
Let us introduce the general setting and notation.Let f : Ω Ñ C be a function on a d-dimensional domain Ω.Given are the function values f " pf pyqq yPY at random points Y Ă Ω with |Y| " M .These samples are i.i.d.according to the density ̺ : Ω Ñ R, i.e.
ş Ω ̺pyq dy " 1.We will assume in this paper, that ̺pyq ą 0, since we otherwise omit parts of Ω where the density is equal to zero.Furthermore, we assume that the density is continuous, sufficiently smooth and integrable.We aim to approximate the function f .In the case where the density ̺ is the uniform distribution, we use the usual notations f L2pΩq and f L8pΩq .We focus on the case p " 2, since in this case we have the scalar product

Notation
The multi-dimensional Fourier coefficients on the torus are defined by This allows to write every function f P L 2 pT d q as a Fourier series In this paper we denote by rds the set t1, . . ., du.We work with a transformation idea, so we will always denote the d-dimensional input variable of the function f in Ω by y and the transformed values by x P T d .The subset-vector is denoted by y u " py i q iPu for a subset u Ď rds.The complement of those subsets is always with respect to rds, i.e. u c " rdszu.For an index set u Ď rds we define the order |u| as the number of elements in u.
We will study the cases where Ω " Ω i , Ω i P tT, R, r0, 1su for all i P rds.
Note that a general interval ra, bs with b ą a can be transferred to the unit interval via y Þ Ñ y´a b´a .Similar to the vector notation also a subset of the domain directions is denoted by Ω u " ˆiPu Ω i for u Ď rds.

Hyperbolic Wavelet Regression on the Torus
In this section we introduce an approximation operator for periodic functions.For a more detailed description see [26].We introduce the notation ψ j,k pxq " 2 j{2 ψp2 j x ´kq, for j P N 0 , k P Z and a wavelet function ψ.We use the periodization where ψ per ´1,0 pxq denotes the scalar function as well as the tensorization where j " pj i q d i"1 , j i P t´1, 0, 2, . ..u and k " pk i q d i"1 are multi-indices k P I j .Hence, we define the sets Furthermore, we introduce the parameter n, which always denotes the maximal level of the used wavelets, i.e.J n " tj P Z d | j ě ´1, |j| 1 ď nu.For notation shortening we introduce the index-set To construct an approximation operator which takes given samples X we solve the overdetermined system Aa " f , where A " pψ per j,k pxqq xPX ,pj,kqPIn is the hyperbolic wavelet matrix with M ą N .We will always denote the number of parameters, i.e. the number of columns of our wavelet matrix by N with N " |I n | and the number of samples by |X | " M .A detailed connection between the maximal wavelet level n and the number of wavelet functions N can be found in [26,Lemma 3.11].We compute the coefficients a by a " `A˚A ˘´1 A ˚f .We will do this iteratively by minimizing the norm Aa ´f 2 .This gives us the wavelet coefficients of an approximation S X n f to f , i.e. S X n f :" A further analysis of this operator can be found in [26,Corollary 3.22].
The estimates there are valid for general wavelets, which are compactly supported, i.e.
´8 ψpxqx β dx " 0, β " 0, . . ., m ´1, and the periodized wavelets form a Riesz-Basis for every index j with Because of this semi-orthogonality we have to use the dual basis ψ per,j ,k , such that every function f P L 2 pT d q can be decomposed as Furthermore, this decomposition gives us a connection between the wavelet coefficients, xf, ψ per j,k y " ,k yxψ per j,k , ψ per j,k 1 y.
In [26] we also introduced and analyzed the projection operator onto the wavelet space, The following estimates are a short summary of the results from [26] for periodic functions.For the definition of the function spaces see Appendix A. (2.7) where the last result holds for some r ą 1 if we have M Á rN log N and uniformly i.i.d.samples X .We will focus our numerical experiments on Chui-Wang-wavelets, where we will always denote by m the order of the wavelets, which denotes the number of vanishing moments of the wavelets.

The ANOVA decomposition
The curse of dimensionality comes into play whenever one deals with high-dimensional functions.The aim of sensitivity analysis is to describe the structure of multivariate periodic functions f and to analyze the influence of each variable.A frequently used concept is the following, [6,18,27].
Definition 2.1.Let f be in L 2 pT d q.For a subset u Ď rds we define the ANOVA (Analysis of variance) terms by (2.11) The ANOVA decomposition of a function f : T d Ñ C is then given by (2.12) The terms (2.11) are the unique decomposition (2.12), such that they have mean zero and are pairwise orthogonal.Additionally, the decomposition (2.5) of a function f P L 2 pT d q in terms of wavelets, can be written in ANOVA-terms, i.e. f u px u q " ÿ pj,kqPI u a j,k ψ per j,k pxq, I u :" tpj, kq | j u c " ´1, k P I j u.
The same connection holds true for a truncated wavelet decomposition with |j| 1 ď n That means, all hyperbolic indices pj, kq P I n can be decomposed in a disjoint union of index-sets belonging to one ANOVA term with index u Ă rds, i.e.This connection is illustrated in Figure 2.1 for d " 3 and n " 3. The crucial property is that for an index u the corresponding functions ψ per j,k have to be constant in all directions i R u, i.e. j u c " ´1.For further details see [26,Section 4].
To get a notion of the importance of single terms compared to the entire function, we define the variance of a function by The idea of the ANOVA decomposition is to analyze which combinations of the input variables x j play a role for the approximation of f .The variances of the ANOVA terms indicate their importance, hence we do the following.For subsets u Ď rds with u ‰ ∅ the global sensitivity indices (GSI) [39] are defined as Spu, f q :" σ 2 pf u q σ 2 pf q P r0, 1s, ( where the variance of the ANOVA term f u is since the mean of the ANOVA terms is zero.The L 2 pT d q-orthogonality of the ANOVA terms implies that the variance of f pxq for L 2 pT d q-functions f can be decomposed as This implies ÿ uĎrds u‰∅ Spu, f q " 1. The global sensitivity index Spu, f q represents the proportion of the variance of f pxq explained by the interaction between the variables indexed by u.These indices can also be computed using only the wavelet coefficients of a function with the connection (2.13).

Transformations of functions to the torus
In this chapter we introduce our main approach: the transformation procedure.We transform a function defined on some domain Ω to the torus, use the well-studied approximation operator for periodic functions and transform the result back to a function defined on Ω.
The univariate setting The basis for our transformation is the cumulative distribution function F : Ω Ñ r0, 1s, which fulfills d dy F pyq " ̺pyq.
Subsequently, we identify r0, 1s with the torus by the bijective mapping y Þ Ñ y ´1 2 .Similarly to the cumulative distribution function we define the transformation This transformation R gives us a possibility to transfer the function f to a function f ˝R´1 , which has its domain on the torus.Since we require that the density ̺ is positive, the cumulative distribution function is strictly monotone increasing and has a well-defined inverse function R ´1.
In the case of a non-periodic function, we have to use an extension with parameter η, since otherwise the transformed function is not even continuous.Our transformation R transforms the function f from r0, 1s to r´1{2 `η, 1 {2s, which means we extend the function on the boundary r´1{2, ´1{2 `ηs to receive a periodic function f : We give more details about this extension in Section 5.1.The connection between the functions is illustrated here: The This relation between the L 2 -norms motivates to transform the samples Y Ă Ω to the transformed samples X " RpYq and then use an approximation operator on T. In this paper we use the operator S X n , defined in (2.3).But in general the transformation can be applied to any approximation operator on T. At the end we receive the approximation which takes the given sample points Y and gives back a function defined on Ω.
In fact, we change the function and approximate the transformed function f ˝R´1 , which is a function on T. For the approximation operator on T it is known that, the smoother the function, the better the approximation.Indeed, it is not clear whether the transformed function f ˝R´1 inherits the smoothness of the function f itself, if the density is smooth enough.So far, we do not know which regularity the transformed function f ˝R´1 has.In the following we will show that, if we request more regularity from the density ̺, the regularity in the Sobolev and Besov  The multivariate setting In the multivariate setting we consider the domain Ω " ˆd i"1 Ω i with Ω i P tT, R, r0, 1su for i P rds.We require that the input variables y i are independent, which means that the density ̺pyq is a product measure, ̺pyq " (3.7) We build up a d-dimensional transformation R : Ω Ñ T d from one-dimensional transformations (3.2) by Rpyq :" pR 1 py 1 q, . . ., R d py d qq with d dy i R i py i q " # ̺ i py i q if Ω i P tT, Ru, p1 ´ηq̺ i py i q if Ω i " r0, 1s. (3.8) From time to time we use the notation R u py u q " pR i py i qq iPu , which is similar to the notation for vectors with index u.The inverse transformation is given by The relation that we will use through this paper can be seen in this illustration: By the observation that the Jacobi matrix is a diagonal matrix because of the product structure (3.8) of our transformation, it follows that, similar as in the univariate case (3.3), where d " |ti P rds | Ω i " r0, 1su| is the dimension of the non-periodic variables of f .The norm equality (3.10) ensures that the transformation R preserves the L 2 -norm of the function f , up to some factor for the non-periodic setting.
One main advantage of our transformation approach is that we have in addition an semi-orthogonal system on Ω with respect to ̺ i.e.
˝R´1 y ̺ " δ j,j 1 xψ per j,k ψ per j,k 1 y.The next chapter is dedicated to the introduction of weighted function spaces on Ω, such that also the smoothness of the function is inherited, i.e. we want to generalize the equations (3.5) and (3.6) to the function spaces of dominating mixed derivatives.

The transformation R meets the ANOVA decomposition
For an L 2 -function on the domain Ω " ˆd i"1 Ω i with Ω i P tT, R, r0, 1su for i " 1, . . ., d with respect to the density ̺ it is possible to define a generalized ANOVA decomposition.We assume in this paper that the input variables y i are independent, which means that ̺pyq has a product structure (3.7).Hence, for an ANOVA index ∅ ‰ u Ă t1, . . ., du we define the marginal distributions Then the ANOVA decomposition with respect to the measure ̺ is defined by where the ANOVA terms are expressed, analogously to (2.11) by a recursive formula see also [14,25,34] for the case in R d .Our main idea is to transform a function f from Ω to the torus T d .Using Definition 2.1, we have a decomposition of periodic functions on the torus, i.e. for the function f ˝R´1 .If we transform this decomposition back to Ω, we receive the decomposition (3.11).This can be seen by the following.
Lemma 3.1.Let Ω i P tT, Ru for all i P rds.The ANOVA terms defined in (3.12) are the same as the transformed terms of the periodic function f ˝R´1 with the transformation defined in (3.2), i.e.
pf ˝R´1 q u pR u py u qq " f u py u q.
Proof.We defined the ANOVA terms on Ω as well as the terms on T d recursively, see (2.11) and (3.11).Hence, we show by induction over the order |u| with the help of the substitution Rpyq " x: This gives the assertion.
The decomposition (3.11) of f P L 2 pΩ, ̺q preserves the orthogonality of the ANOVA-terms, since a simple substitution similar to (3.10) shows that Hence, the variance of the ANOVA-term of the transformed function pf ˝R´1 q u is equal to the variance of f u with respect to the density ̺, Analogously to the unweighted case (2.14), we define the global sensitivity indices for functions defined on Ω by Spu, f q :" σ 2 ̺ pf u q σ 2 ̺ pf q P r0, 1s.
(3.14) Remark 3.2 (Dependent input variables).Note that there is no natural way to decompose f into ANOVA terms for dependent input variables.Consider the extremal case where y i " y j for i ‰ j: It is not possible to say which proportion of the variance belong to f i or f j .Problems arise when the input variables are correlated.If we integrate over some distribution, when in reality features are dependent, we create a new data set that deviates from the joint distribution and extrapolates to unlikely combinations of features, which can indicate unwanted variances for feature decompositions.Thus, there has to be a precomputation step to avoid such dependencies.It would be possible to preprocess the given data by a PCA and a linear data transformation.Furthermore, there are approaches to generalize the ANOVA decomposition to dependent variables, see for example [19,35].The generalized ANOVA decomposition is very difficult to estimate, and the generalization of our approach to this setting is behind the scope of our paper and provides an opportunity for further research.

Weighted Function spaces
In this chapter we introduce weighted function spaces on Ω, which generalize smoothness from periodic functions to functions defined on Ω using the transformation R. The general idea is to study the smoothness of the concatenated function f ˝R´1 , since in the periodic setting we know from [26] the higher the smoothness, the better approximation results using hyperbolic wavelet regression.On the torus, there are results for the Sobolev and Besov regularity, which is based on the wavelet coefficient decay in these spaces.For that reason, we study for these two cases, which functions on Ω are transformed to a smooth function on T d .In Section 4.1 we study the Sobolev norm (A.1) or (A.2), which requests the norm inequality (3.5) to preserve smoothness.For that reason we introduce in Definition 4.6 a weighted Sobolev norm on Ω, which can be generalized to fractional smoothness by (4.10).In Section 4.2 we present the relation to function spaces already known from the literature.To preserve Besov regularity with the transformation R, we use in Section 4.3 the characterization (A.3) and we introduce the weighted Besov spaces in Definition 4.9, which fulfill (3.6).
First, we will study in the following the case where Ω i P tT, Ru for all i P rds.In Section 4.4 we then show that the non-periodic setting is similar, up to some slight modifications.

Weighted Sobolev Spaces
For measuring the smoothness of the transformed function f ˝R´1 , we have to calculate the derivatives of the concatenation f ˝R´1 , see Definition (A.1).We use the transformation (3.2) and consider in this subsection only the case Ω i P tT, Ru.The slight modification for non-periodic functions, i.e. η ą 0 is described in Section 4.4.
With this notation Faá di Bruno's formula reads as The Bell polynomial B α,i can be expressed in terms of derivatives of ̺pyq up to order α ´1 and powers of ̺ up to order 4α ´2i ´1.We have for small indices, The L 2 pTq-norm of the derivatives of f ˝R´1 can thus be expressed as For the Sobolev norm we have to sum over α and interchange the sums, which yields This motivates to generalize the Sobolev norm to functions defined on Ω by the following definition.
Definition 4.1.For m P N we define the function space where the norm is defined by and the density Υ m,k is defined by Note that we have for m ě 2 the useful recursion formula, We state the previous definition for the cases 1 ď m ď 3 explicitly: This can also be interpreted as: the function f cannot have a large L 2 -norm of its derivatives up to order m in areas where ̺ is small.One can not expect to capture such functions, since where the density is low, we can not approximate derivatives of the function f well.Note that only in special cases, where the derivatives of the density ̺ and the density itself are bounded from below and above, this defined norm is equivalent to a norm defined using the derivatives of a function, weighted with the density ̺.
Lemma 4.2.Let m P N be positive and the density Proof.Since every derivative D k R ´1 can be expressed in terms of derivatives of ̺ up to order k ´1 and we also assume that ̺ itself is bounded from above and below, the terms Υ m,k pyq can be bounded by with some constants 0 ă C, C 1 ă 8.This yields the assertion.ii) Cauchy distribution on R The density is a Cauchy distribution.The corresponding transformation is Rpyq " 1 π arctan y.
iii) Laplace distribution R The density where the norm is defined by f 2 H s pΩ,̺q :" with the Fourier coefficients c ̺ k pf q of the transformed function c ̺ k pf q :" c k pf ˝R´1 q and the Fourier coefficients for periodic functions are defined in (2.1).
Remark 4.5.The norm in the previous definition is for m " s P N 0 equivalent to the norm from Definition 4.1, since the terms Υ m,k pyq are chosen such that because of the norm equality of the norms (A.1) and (A.2), see [24].

The multivariate setting
The theory from the one-dimensional case can be transferred to the d-dimensional setting.Again, we have to use (A.1) and have to estimate norms of derivatives of the transformed function f ˝R´1 .Using the equations (3.9), we have that  where we define the multivariate analogon to (4.1) by B α,k pyq :" B αi,ki py ki q. (4.9) This motivates to generalize Definition 4.1 to multi-variate Sobolev spaces of mixed dominating smoothness by Definition 4.6.For m P N and Ω i P tT, Ru we define the function space where the norm is defined by and the density Υ m,k pyq is defined by Υ m,k pyq :" where the one-dimensional functions Υ m,ki are defined in (4.4).
This H m mix pΩ, ̺q-norm is equivalent to a norm definition using the decay of the Fourier coefficients of the transformed function c ̺ k pf ˝R´1 q like in (A.2), which can be generalized to thr case of fractional smoothness by H s mix pΩ, ̺q :" where the norm is defined by The function spaces H s mix pΩ, ̺q are defined such that the transformed function f ˝R´1 inherits the smoothness of the function f , i.e. (4.12)

Weighted Function spaces in the literature
There is a huge literature about weighted function spaces.We restrict ourselves to a few references closely related to our approach.Sobolev norms as defined in Definition 4.1, where the norms of the derivatives are measured with respect to a different density are also considered in [14] in the case m " 1.For the one-dimensional case on Ω " R the authors showed that the norms L2pR,ψq are equivalent under certain conditions on the density ψ.We are in the special case where ψpyq " 1 {̺pyq and meet the conditions in our examples.The authors also showed a norm equivalence for multivariate functions.Another example of weighted norms is [30], where the authors weight the derivatives of functions with some exponential term in order to integrate functions on R numerically.
In [40] several weighted function spaces were introduced.On R d the authors defined the weighted counterpart to the classical Sobolev spaces by introducing a weight function w for weighting all derivatives of the function f similarly, i.e. in our notation
(4.14)The derivatives of R ´1 can be bounded by since pR ´1q p1q pRpyqq " 1 ̺pyq and inductively every further derivative (of the sum of several terms which are fractions of polynomials of derivatives of ̺ and a power of ̺ in the denominator) either increases the power of ̺ in the denominator while adding a ̺ 1 in the nominator or just increases the derivatives of ̺ (but not the degree of the polynomial) in the nominator.The condition (4.13) shows than (4.15), which gives by induction and using (4.14) the result (4.15).This again gives us that which is also true in case where m " 0 because of the condition that ̺pyq ă 8.The choice w " ̺ ´m`1{2 P W d , gives then for 0 ď k ď m.The multivariate case follows from the fact that ̺ is assumed to be a product density, such that the weight w is also a product weight, since the densities Υ m,k pyq, defined in (4.11) are also products of the one-dimensional functions.
Note that for instance the constant function is in all H m pR d , ̺q, but not in W m pR d , ̺ ´m`1{2 q, since lim xÑ˘8 ̺pxq " 0. The Cauchy distribution (4.6) belongs to the set of admissible weight functions.It is also possible to extend this theory of weighted function spaces to other weight functions, for instance exponential weights.Then one has to change the definition of admissible weights, see [40,Remark 6.4], but the connection to our function spaces via Theorem 4.8 is nevertheless possible.

Weighted Besov spaces
So far we studied the smoothness of the transformed function f ˝R in Sobolev spaces with dominating mixed derivatives.The advantage of Besov-Nikolskij spaces compared to the Sobolev spaces is that they are a much more general tool in describing the smoothness properties of functions.Be aware that we will use in this chapter the indices j and k not as wavelet indices, but k as index of the Fourier coefficients and j as index of dyadic blocks in which we decompose the indices k.We use the Fourier analytic characterization of the spaces (A.3).Therefore we introduce the dyadic blocks For j P N d 0 we define if all components belong to N 0 .Using these dyadic blocks, we decompose the Fourier series of the function Furthermore we introduce the Fourier coefficients for functions defined on Ω, using the Fourier coefficients of the transformed function f ˝R´1 from (2.1) by Therefore, f pyq " ÿ This yields immediately by (3.10).
For the Definition in (A.3) we have to split the periodic function f ˝R´1 in dyadic blocks belonging to the indices j.For a fixed index j P N d 0 we define u " ti P rds | j i ą 0u.Similar to the decomposition with wavelet functions one can describe a connection of Fourier coefficients and ANOVA terms, see [31], for a function g P L 2 pT d q The splitting of the function f ˝R´1 in dyadic blocks j ‰ 0 gives for where we used the notation from (4.9) and β P N |u| .In the special case j " 0 we have k " 0 and Therefore, we define a Besov norm for functions defined on Ω by Definition 4.9.Let s ą 1 {2 and m " tsu.Then we define / -.This definition yields the estimate In contrast to the norm f ˝R´1 B s 2,8 pT d q , which deals with the transformed function on T d , the norm defined in Definition 4.9 considers products of the function f with terms consisting of powers and derivatives of the density ̺.

A note on non-periodic functions
So far we studied function spaces for the case Ω i P tT, Ru for all i P rds.A key difficulty in the approximation on Ω " r0, 1s is the non-periodicity of the function, i.e. the behaviour at the boundary.No matter how the density ̺ looks like, a transformation which equals the cumulative distribution function can not ensure that the function f ˝R´1 is periodic.Therefore we introduced in the transformation (3.2) the extension parameter η.We denote the extension of the function f by f with f ˇˇr´1{2`η, 1 {2s " f ˝R´1 .
The L 2 -norm of the function f itself behaves like the L 2 -norm of the transformation up to some factor, see (3.10).The same is true for the derivatives.For α P N with α ď m we have Hence, for every non-periodic direction we get the additional factor p1 ´ηq.This gives us the reasonable assumption that extension f at the boundary has to have a small Sobolev norm.For more details see Section 5.1.Since the factor p1 ´ηq is a constant for fixed η, we define for the sake of simplicity for function on r0, 1s d the weighted function spaces in the same manner as for periodic functions with the following definition.
Definition 4.10.For m P N let the extension of f ˝R´1 to the boundary be in the Sobolev space H m mix pT d zr´1{2 `η, 1 {2s d q.Then we define the function space where the norm is defined by and the density Υ m,k pyq is defined by Υ m,k pyq :" where the one-dimensional functions Υ m,ki are defined in (4.4).
Of course, this definition can be mixed with Definition 4.6 for a mixed function, which has different domains Ω i for different i P rds.Analogously, the function spaces for fractional smoothness s and for mixed Besov spaces are defined like in the periodic case.If the function f , has certain smoothness m on the interval r0, 1s, the transformed function inherits the smoothness of this function, as in the cases for periodic functions.This means, if the Sobolev norm of the extentended function f at boundary is finite, we have Let us finish this excursion to the non-periodic setting with an example for the distribution ̺. the beta distribution with the shape parameter α, where Γ be the Gamma function.For α " 1 this is the uniform distribution.For α ą 1 the cumulative distribution function is the regularized incomplete beta function, so the transformation in the case η " 0 reads which can be computed analytically for fixed α.These functions are plotted in Figure 4.4 for different parameters α.
Remark 4.12.The beta distribution ̺ B, 1 {2 coincides with the Chebychev distribution, which is defined on r´1, 1s.In this case [22, Section 10.3], [13] propose the Chebyshev polynomials as basis in L 2 pr´1, 1s d , ̺ B, 1 {2 q.This coincides with using our transformation R and the cosine basis on T d .

First Setting: Known density ̺
In this chapter we study the case where we assume that the underlying density ̺ of the samples is known and it is a tensor product density (3.7).With this information we use the transformation (3.2), transform the given samples Y Ă Ω to the transformed samples X " RpYq on the torus and apply the approximation operator S X n , given in (2.3).With the introduction of weighted function spaces in the previous chapter, we estimate the error of this approximation if the function f itself is in a weighted function space.In fact, we formulate the following Theorem.α Let Ω i P tT, Ru, the density ̺ i be in C m´1 pΩ i q for i P rds and let m P N be the order of vanishing moments of the wavelet ψ.Let the function fulfill for all i P rds where Let furthermore M be the number of samples satisfying M Á rN log N , where N " |I n | is the number of wavelet functions and r ą 1.Let Y " py j q M j"1 Ă Ω be drawn i.i.d,. at random according to ̺, f P CpΩq a continuous function and the samples Y transformed to X " RpYq using (3.7).In the case where 1{2 ă s ă m we have and in the case s " m we have Proof.The theory from in [26] studies the behaviour of periodic functions on T d .Because of the assertions, the function f ˝R´1 is a function on T d , and the samples X " RpYq are uniform i.i.d.Hence, [26,Corollary 3.22] is applicable to the function f ˝R´1 .Together with the definitions of the Sobolev and Besov norms for functions of mixed smoothness, for which we have (4.12) as well as (4.18), this yields the assertion.
This theorem is about the case where we do not have non-periodic variables y i involved.We introduced in (3.2) the extension parameter η to also deal with non-periodic functions.In the next chapter we will give a similar result for the non-periodic case.Of course, these results can be mixed if the domain has different parts Ω i .

Extensions of non-periodic functions
Here we study the case where Ω " r0, 1s d .Of course, one can interpret a function on r0, 1s d as (possibly non-continuous) function on the torus by gluing the endpoints together.This coincides with the transformation (3.2) with no extension η " 0. Since the function f ˝R´1 is then non-continuous, Theorem 5.1 does not give us a reasonable error decay.For that reason the aim of this section is to show that the transformation idea also works for non-periodic functions with a reasonable choice of the extension parameter η in the transformation (3.2).
For Fourier approximation there is the approach of Fourier extension [3,5,20], where the function is continued outside of the interval r0, 1s to a smooth function.See also [2] for a nice overview and the connection to the frame approach.We use a similar approach by introducing the extension parameter η, which allows us to extend f on the boundary r´1{2, ´1{2 `ηs in an appropriate way.On one hand, this gives better approximation rates, but on the hand the stability gets worse.The occurring problem is that we have to bound the condition of the approximation matrix A, see (2.2).To circumvent this, in the mentioned literature the authors use for instance the truncated singular value decomposition.We do not want to set up the whole matrix A, but rather use a least squares algorithm which gets only the result of a matrix vector multiplication with A and A J .We will see in this chapter that an appropriate choice of the extension parameter ensures stability.
Remember, rather than taking the cumulative distribution function of the density ̺, we use the modification Rpyq " η `p1 ´ηq ż y 0 ̺ptq dt, (5.3)where 0 ă η ! 1 is some extension parameter.In fact, we get the transformed samples X " RpYq, which are uniformly distributed on the cube Ω :" r´1{2 `η, 1 {2s d .This procedure transforms and compresses the original function f into the box Ω and allows to extend this function to a function f defined on T d .Figure 5.1 shows an illustration of the two-dimensional domains.On the boundary r´1{2, ´1{2 `ηs we extend the function f .As mentioned in the discussion before Definition 4.10, it is reasonable to choose an extension which has Sobolev smoothness.For the following result we introduce the notation of restricting Let the function g : Ω Ñ C fulfill the boundary conditions for all i P rds, |g pαeiq pxq| ă 8, for all x with x i P t´1 2 `η, 1 2 u for all α " 0, . . ., m ´2, where e i denotes the unit vector pe i q j " δ i,j .Let furthermore the extension parameter fulfill where ras denotes the smallest integer, which is bigger than a.Then there exists an extension Proof.Let us begin with the one-dimensional case.The function space V p1q n is the space of all spline functions, which are piece-wise polynomials of degree m with discontinuities of the pm ´1q-th derivative only at the grid points t k 2 n`1 | ´2n ď k ď 2 n u.These grid points divide the domain T naturally in pieces of length 2 ´pn`1q .The function g is defined on 2 n`1 ´pm ´1q of them.The function g has to be a piecewise polynomial of degree m ´1 with m ´1 pieces.To construct the coefficients of the function g, one has to solve a system of linear equations, which are independent.In fact, we have m pm ´1q coefficients and p2 `pm ´2qq pm ´1q " m pm ´1q constraints (at the boundary and the conditions for piece-wise polynomials).This system has always a solution.Figure 5.2 is an illustration of the one-dimensional case.For m " 2 a simple linear interpolation between gp´1 2 q and gp´1 2 `ηq does the job.For the multivariate case, we have to observe, that we use the index-set I n of hyperbolic structure.That means we need an index j with |j| 1 ď n, such that η ě m´1 2 pj i `1q for all i " 1, . . ., d.Since all indices j i are natural numbers, multiplication and taking the d-th root of these inequalities leads to the condition (5.4).
Remark 5.3.If we choose not the hyperbolic index-set I n , but the tensor index-set tpj, kq | ´1 ď j ď n 1, k P I j u, the corresponding tensor product spaces are V pdq,n : " For the case d " 1 this coincides with the previous lemma.But for d ě 2 we need in the previous proof only an index j with |j| 8 ď n, such that η ě m´1 2 pj i `1q for all i " 1, . . ., d.This leads to the condition η ě m´1 2 n`1 .Remark 5.4.Consider the ANOVA decomposition (3.11) of a function f P L 2 pr0, 1s d , ̺q.One ANOVA term f u py u q is a function, which depends on only |u| variables.Transforming f u to f u ˝R´1 u needs only a |u|dimensional extension.For that reason, it is enough to choose η ě m´1 (5.5) In the remaining part of this section we will focus on the one-dimensional case, because this can be also applied to high-dimensional functions with only low-dimensional interactions.The following ideas can be generalized to d ą 1, but in this case we have to omit boundary wavelets to ensure stability.
The one-dimensional case The following lemma shows that the projection operator (2.6) applied to the extended function f indeed inherits the approximation rate set by the order of the wavelets if the non-periodic function f is smooth enough on r0, 1s.Compare the following result with the periodic setting (2.7), the only difference is the term Theorem 5.5.Let d " 1 and the maximal wavelet level n P N. We choose the extension parameter η as in (5.5) and the transformation (5.3), which gives Ω " r´1{2`η, 1 {2s.Then we have for the approximation error of the projection operator P n defined in (2.6) for functions f P H m pr0, 1s, ̺q that f ´pP n f q ˝R L2pr0,1s,̺q À 2 ´nm p1 ´ηq 3 {2 f H m pr0,1s,̺q , where f " f ˝R ´1.
Proof.For j P N 0 let us split the indices k into the sets, depending on the support of the wavelet functions First, we have a look at the wavelet coefficients of the extended function f with j ą n.From [26, Lemma 3.4] we have that f pmq pxqΨ m p2 j x ´kq dx, ( where I j,k " supp ψ per j,k Ă T and the function Ψ m is defined in (B.2).Since the extension of f to f , namely f | Tz Ω is contained in the space of wavelet functions, the m-th derivative of f is zero at the boundary p´1{2, ´1{2 `ηq and we have that f pmq pxq " δpx `1 2 qF 1 pf q `δpx `1 2 ´ηqF 0 pf q, x P r´1 2 , ´1 2 `ηs, where δp¨q is the delta distribution and the numbers F 0 pf q and F 1 pf q depend on the boundary behavior of the function f .For the indices k P I r we split the integral (5.6), x f , ψ per j,k y " 2 j{2 2 ´jm ˜żI j,k X Ω f pmq pxqΨ m p2 j x ´kq dx `F1 pf qΨ m p2 j p´1 2 `ηq ´kq ¸, if ´1{2 `η P supp ψ per j,k .The other case where ´1{2 P supp ψ per j,k is analogue.The function Ψ m has the property that Ψ m pxq " 0 for x P N, see Lemma B.2.Because of the choice η " m´1 2 n`1 and the assumption j ą n, the numbers 2 j p´1 2 `ηq ´k are in t1, . . ., mu.This allows us to omit the second term in (5.7) and we get that Using Cauchy-Schwarz inequality we receive for where the last inequality follows from (4. 19) and (B.3).The Riesz basis property (2.4), which also applies to the dual wavelets yields (5.9) Also because of the Riesz basis property of the wavelet functions we have for j ą n that (5.10) To estimate the error of the projection operator P n we have to estimate the sum of wavelet coefficients, namely we first insert the definition (2.6) of P n , ,k yψ per j,k L2p Ωq

‚2
(5.10) The last sum of 2 ´jm is bounded by geometric series, Taking the square root in (5.11) gives the result, where the factor δ 3{2 m γm is a constant.Note that we receive at least in the one-dimensional case the same approximation rate as in the periodic setting.In the higher-dimensional setting this is not the case, since we loose the orthogonality between different wavelet levels in the L 2 p Ωq-norm.It is also possible to estimate the L 8 -error.Here also the only change is the additional term 1 1´η in comparison to the periodic result, (2.8).Theorem 5.6.Let d " 1 and the maximal wavelet level n P N. We choose the extension parameter η as (5.5), the transformation (5.3) and denote f " f ˝R´1 .Then we have for the approximation error of the projection operator Proof.Similar to the proof of the previous theorem we consider |f pyq ´pP n f q ˝Rpyq| " sup Using the same lines as in [26,Theorem 3.15] we have which gives the assertion.
In the following we discuss the numerical properties, that arise when using such an extension.We loose some stability in the sense that the wavelet matrix A has a bigger condition number.But nevertheless, we estimate the eigenvalues of the Moore Penrose inverse from below.The eigenvalues of the expectation matrix Λ " ˆżT ψ per j,k pxq ψ per j 1 ,k 1 pxq dx ˙pj,kqPIn,pj 1 ,k 1 qPIn are bounded by the Riesz constants γ m and δ m , see [26,Lemma 3.18].The transformation R changes the expectation matrix to Λ " ˆżΩ ψ per j,k pxq ψ per j 1 ,k 1 pxq dx ˙pj,kqPIn,pj 1 ,k 1 qPIn .
If we choose the extension parameter η like (5.5), it turns out, that the eigenvalues of Λ do not differ much from the eigenvalues of the initial expectation matrix Λ.We show this numerically.We lose the orthogonality of the wavelets of different levels.But the entries of the matrix Λ differ from the entries of Λ only at indices where supp ψ per j,k X r´1 2 , ´1 2 `ηs ‰ ∅.For different maximal level n, we construct the matrix Λ and calculate the extremal eigenvalues µ min p Λq and µ max p Λq.The results are summarized in Table 5 Riesz-constant γ m and 1.We leave the proof that µ min p Λq ą C ą 0 for a constant C as an open problem, but the numeric indicates that this is true.
In fact, the choice of η must be a balance between an ill-conditioned matrix for big η and a large approximation error for small η.The choice (5.5) does this job.
Error estimates Up to now we gave estimates for the L 2 and L 8 error of the projection operator P n , (2.6), instead of the approximation operator S X n .To end this subsection, we give also for the non-periodic case an error estimate with high probability, similar to Theorem 5.1.
Corollary 5.7.Let d " 1, m P N be the order of vanishing moments of the wavelet ψ, ̺ P C m´1 pr0, 1sq be a density and and n P N be the maximal level of the wavelets.We choose the transformation R as (5.3) with the extension parameter η as in (5.5).Let furthermore Y " py i q M i"1 Ă r0, 1s with M Á r 2 n`1 pn `1q be drawn i.i.d. at random according to ̺ and r ą 1.Then we denote by X " RpYq the transformed samples.Let furthermore µ min p Λq ě C ą 0. Then we have Proof.First, we denote e 2 :" f ´pP n f q L2p Ωq and e 8 :" f ´pP n f q L8p Ωq .By using the extension we loose the orthogonality of the wavelet functions even for different levels, since the L 2 -norm is then defined on Ω and not on the whole torus.Therefore, we have to modify the proof of [26,Theorem 3.20] slightly.We have f ´pS X n f q ˝R L2pr0,1s,̺q " For the last term we use the same lines as in [26,Theorem 3.20], which are based on Bernstein's inequality to get Taking the event into account that with high probability, we obtain by union bound the overall probability exceeding the sum of the probabilities, i.e.
Collecting the bounds from the occurring terms from Theorems 5.5 and 5.6 as well as logarithmic oversampling, which means log M M À 2 ´n, we end up with f ´pS X n f q ˝R L2pr0,1s,̺q " with high probability.

Numerical Experiments
In this section we study the approximation behavior numerically for some examples to underpin our findings from Theorem 5.1 and Corollary 5.7.We consider examples where Ω P tR d , T d , r0, 1s d u.
We do the following procedure.For a maximal level n, we draw M Á N log N -2 n n d i.i.d.samples according to the density ̺, which coincides with logarithmic oversampling.Also the corresponding function values are given.In our approximation, we transfer the samples in the set Y to the torus, by RpYq " X and apply the approximation operator S X n given in (2.3).A good estimator for the L 2 pΩ, ̺qerror, is the root mean square error (RMSE), which is defined by for sample points Y test Ă Ω, which are i.i.d.according to ̺.We use always |Y test | " 3 |Y| " 3 M .We defined in Examples 4.3 and 4.11 only one-dimensional densities.In the following we interpret the ddimensional densities as a tensor product in the sense of (3.7).

Distributions on R d
We begin with the normal distribution ̺ N from (4.5) for all one-dimensional densities ̺ i py i q.As test function we use the Gaussian f : R d Ñ R, f pyq " e ´ y 2 2 , (5.13) which is smooth.But in the sense of (4.10), we have to note that in the one-dimensional-case `p 8π 3 ?
Due to the embedding H 2 pTq Ă B 2 2,8 pTq we also have f ˝R´1 P B 2 2,8 pTq.To investigate the smoothness further, we have a look at the terms from Definition 4.9, i.e. for s " 5 {2 We have since R ´1pxq 2 is a smooth function except at x " 1 {2.Analogously, we get for the other term This yields since the index sets J j are defined such that k ą 2 |j|´1 and |J j | " 2 |j| .The function f as well as the density ̺ N have a tensor product structure, hence we get f ˝R´1 P B 5{2 2,8 pT d q and f P B 5{2 2,8 pΩ, ̺ N q.
We did the approximation for d P t1, 2, 3u and for the order m " t2, 3u vanishing moments of the wavelet.In Figure 5.3 we plotted the results.One can see, that we end up with the proposed error decay rates from Theorem 5.1.In case where m " 2, we are in the case (5.2) and get the proposed error decay rate of 2 ´2n n pd´1q{2 .If we increase the order of vanishing moments of the wavelets to m " 3, we are in the case (5.1) and receive also numerically the proposed error decay of 2 ´5{2n n pd´1q{2 .The numerical results are even slightly better in some cases.5 10 10   Note that a different density ̺pyq can lead to a different smoothness of f ˝R´1 even for the same function f .For a tensor product of the one-dimensional Cauchy ̺ C , (4.6), or the Laplace distribution ̺ L , (4.7), we even have f ˝R´1 P H s mix pΩ, ̺q for all s P N.That follows from D α f pyq " ppyqe ´y2 {2 for a polynomial ppyq and the fact that all differentials of the Cauchy as well as the Laplace distribution are polynomials or polynomials with a the factor of the behavior e k|y´2| .Furthermore, all integrals of the form ş R e ´y2 `k|y´2| ppyq dy are finite.In Figure 5.4 and 5.5 we plotted the approximation results.In this case the order of vanishing moments determines the error decay.In dimension d " 3 with Laplace distribution we are still in the preasymptotic case.

The beta distribution on the torus
Let us consider the beta distribution (4.20), but shifted by ´1{2 to the torus T " r´1{2, 1 {2q, for all onedimensional densities ̺ i px i q, which also includes the uniformly distribution on T for α " 1.We choose as test function     which is the tensor product of a polynomial of degree 6 and has triple zeros at 1 {2 and is in H 3 mix pT d q.Depending on the choice of the parameter α of the beta distribution, the transformed function has different regularity.We consider first the one-dimensional case.The crucial points to decide whether the norm f L2pT,̺B,αq is finite is at the points with lower regularity 1 {2 -´1{2.Because of the symmetry of the function f as well as the density ̺, it is sufficient to have a look at the point y " 1 {2.There we have the behavior f piq " py ´1{2q 3´i for i " 0, 1, 2, 3.With the same arguments we have that ̺ piq B,α " py ´1{2q α´1´i for α ‰ 1, 2 and i " 0, 1, 2. Furthermore the integral ş 1 0 x k dx is finite for k P R and k ą ´1.Hence, Definition 4.1 gives that we have to require the following: Since, the function f is a tensor product, we have the same estimates for the multivariate cases.Indeed, we used the order of vanishing moments m " 3 of the wavelets, which limits the maximal error decay rate and in Figure 5.6 is the resulting numerical approximation decay for different parameters α and d.Indeed, if we are below the critical values 6 {5, 2 and 6 for α, we get the desired approximation rates of 2 ´3n , 2 ´2n and 2 ´n given in Theorem 5.1.
Even dense samples at the boundary, which coincides with small α can not increase the error decay rate of 3.

Extensions of non-periodic functions on the cube
Here we want to demonstrate the benefits of the extension proposed in Section 5.1.Let us study the non-periodic function f : r0, 1s Ñ R, f pyq " y 3 . (5.15) and the uniformly distribution on the cube, ̺pyq " 1.Also for this non-periodic function we manage to use the periodic approximation operator and get good approximation results.We use a polynomial   of degree 3, since a lower degree together with the order of the wavelets m leads to a function f ˝R´1 , which is in the finite function space, which we use for the approximation and gives us approximation errors near machine precision.The results are plotted in Figure 5.7.We see, that the extension increases the approximation rate to 2 ´mn , as proposed in Crollary 5.7 Additionally, the wavelet matrix is in both cases well-conditioned.

Second Setting: Using kernel density estimation for unknown density ̺
While working with real world data, the underlying density ̺pyq is possibly a priori not known, we only have the given random sample points Y.For that reason we want adapt our strategy to this setting.A transformation of the given data Y is also in this case a useful tool to approximate a function f well.Instead of using the transformation R belonging to the underlying density ̺ as in Chapter 5, we approximate the underlying density function by a kernel density estimation, [15].The cumulative distribution function R of the estimated density function gives us a transformation for the underlying data set Y to the samples X " RpYq on the torus.Then we apply our approximation method for functions on the torus and at the end we transform the function back to a function defined on Ω.We will restrict our study in this chapter to the one-dimensional case, which we apply in Chapter 7 to high-dimensional functions.
The error between the estimated density and the true density influences the total approximation error.
Let us introduce the kernel density estimator ̺pyq " where k is a non-negative kernel function k : Ω Ñ R which is normed by ş Ω kpyq dy " 1 and σ is a smoothing parameter.Frequently used kernels are the standard normal distribution ̺ N , (4.5) or B-Splines (B.1).The normalization ensures that also ̺ is normalized.We get a transformation R : Ω Ñ T, which fulfills (3.1), by integration, Rpyq " where K is the antiderivative Hence, the integral K of the kernel k has to be calculated once in advance.Then the calculation of the integral of ̺ is only the evaluation of K at different points.Using this transformation, we get our transformed sample points by X " RpYq.If ̺ is a good approximation to ̺, these samples X are nearly uniformly distributed on T, which allows us to approximate the function f ˝R ´1 using an approximation operator on T d , for instance S X n from (2.3).Our procedure to get an approximation of f out of the samples X and the corresponding function values f is summarized in Figure 6.1.For shortening notation, we X " RpYq denote the error function ef : Ω Ñ R by ef " f ´´S X n pf ˝R ´1q ¯˝R , which we aim to estimate.Using the theory of the previous section, we receive a bound for ef L2pΩ,̺q in the norm with density ̺, if we assume that we choose the bandwidth σ so that ̺ « ̺, which yields that the samples X " RpYq are distributed uniformly on T. Since the original samples Y are distributed according to ̺ and we assume new test points are also samples according to ̺, we are interested in the L 2 pΩ, ̺q-error ef L2pΩ,̺q .Intuitively, if ̺ and ̺ are equal enough, these two errors have the same behavior.This can be made more precise by Theorem 6.1.In the case where Ω " T we have that In the case Ω " R there is a set E Ă R, such that ş E ̺pyqdy ď ε for some ε ą 0. In the case where Ω " R we can not use these calculation, since ef pyq can be non-zero on the whole real axis.But the splitting of R into the set E and the complement RzE and using the previous estimates, gives ef Therefore, it follows the assertion.
The introduction of the set E in the case where Ω " R tackles the behavior, that given data Y is contained in a finite interval ra, bs and ̺pyq is small outside this interval.In fact, we can not expect to approximate a function well where we have no information about the function.
The previous theorem shows that a good estimator for the bandwidth σ ensures that the mean integrated squared error (MISE) ¯.
is small.This choice of the smoothing parameter has to be a good trade-off between over-and underfitting.Consider the two extremal cases, which do not work.This behavior is illustrated in Figure 6.2.
• If we would choose a small smoothing parameter σ (σ " 0.01 in the Figure 6.2) or even as kernel function k the delta-distribution, we would get an equi-spaced sample set RpYq.But this posses on the other hand no smooth cumulative distribution function R and the distribution ̺ is not a good approximation on ̺, since MISEp̺q would not decay.
• If we choose the parameter σ in the Kernel function k too big (σ " 1 in the Figure 6.2), the density ̺ would not capture the behavior of the density ̺ and the transformed samples would not be uniformly distributed on T.

Smoothing parameter selection
There are some very simple and easy to compute mathematical formulas for estimating the smoothing parameter σ.They are often called the rules-of-thumb (ROT).One possibility is where std is the standard deviation and IQR is the inter-quartile range, see [15, Section 4.2.1].The assumption for that rule is that the unknown density belongs to the family of the normal distribution.
In practice, we do not know, whether ̺pyq is a normal distribution.If it is, then σ ROT gives the optimal smoothing parameter.If not, then σ ROT will give a parameter not too far from the optimum, if the distribution of the samples Y is not too different from the normal distribution.

Data on the real axis
Another approach, which is more general and performs better than the ROT is the Direct Plug-In-Selector (DPI), see [15,Section 4.2.2].To describe this approach, we have to introduce some notation.The second moment of the kernel k is defined by Since the MISE is set as the error criterion to be minimized, our aim is to find The dominating part of MISE is denoted by AMISE, which stands for Asymptotic MISE,

M σ ,
The minimizer σ AMISE is given by where r is an even number.The naming convention of Ψ r was introduced in [41, Section 3.5].The critical step is to estimate Ψ 4 in (6.5), as this is the only unknown value.Assuming that ̺ is some normal distribution, would lead to (6.4).This is an example of a zero-stage plug-in selector, a terminology inspired by the fact that Ψ 4 was estimated by directly plugging in a parametric assumption.Another possibility is to estimate Ψ 4 non-parametrically then to plug it into σ AMISE .First note, that integration by parts gives Therefore, a possible way to estimate Ψ r is where g is the smoothing parameter of a kernel density estimation.Typically, two stages are considered to have a good trade-off between bias and variance.This is the method proposed by [38], and does the following steps.
πpstdp̺qq 9 ¯, where stdp̺q is an estimate for the standard derivation of ̺, which can be stdpX q or min !stdpX q, IQRpX q 1.34 ) .

Non-periodic data
A general problem with kernel density estimation is that certain difficulties can arise at the boundaries and near them.In many practical situations the values of a random variable are bounded.For example, the age of a person obviously cannot be a negative number.On the other hand, the normal kernel has unlimited support.Even if a kernel with finite support is used, the estimated density can usually go beyond the permissible domain.
For that reason we use on Ω " r0, 1s compactly supported kernels k, but we allow the approximated density ̺ to be nonzero outside the interval r0, 1s.Especially, we receive a nonzero density in the interval Ω " rω 1 , ω 2 s with ´| supp k| σ 2 ď ω 1 ď 0 and 1 ď ω 2 ď 1 `| supp k| σ 2 .This allows us to create a function f , which smoothly extends the function f to a function on whole Ω, such that the transformed function f ˝R ´1 becomes a periodic function.This idea of an extension of the function is similar to the studies in Section 5.1.The choice of the extension parameter η is now hidden in the choice of the smoothing parameter selection σ, which determines the interval Ω.This is illustrated in Figure 6.3.The kernel density estimation of ̺ can be seen as a periodization of the function f , such that we can use approximation operators for functions on T. In contrast to the tent transformation, see [28], which passes trough the function forth and back, we have here no need to double the number of sample points.To select the smoothing parameter σ, one simple possibility is to again use the estimator σ ROT from (6.4).The extension f is similar to the proposed method in Section 5.1.But here naturally, the extension is on both sides of the interval because of the kernel density estimation.Instead of the factor 1 1´η we have now the the term max yPr0,1s ̺pyq ̺pyq .

Numerical experiments
In this section we endorse our theoretical findings by two numerical experiments on R and r0, 1s.We compare the approximation results of unknown density with the results if the density ̺ is known.In both cases we use Chui-Wang wavelets of order m " 3.
The Gauss kernel on the real axis For the case where Ω " R, we choose kpyq " ̺ N to be the standard normal distribution (4.5).The expressions used for estimating the smoothing parameter σ DPI are To study the performance of our algorithm, we use as test function again the function in (5.13).We investigate the densities ̺ N , (4.5), ̺ C , (4.6) and ̺ L , (4.7).Doing the same procedure as described in Section 5.2 and we study the resulting RMSE (5.12).We used the two different proposed parameter selection methods and plotted the results in Figure 6.4.It is reasonable to compare the results with the approximation error of the previous section, where we assume that the density ̺ is known, see (3.4) in Figure 6.4.One can see that our approximation approach without knowing the density works well for all three examples, since for both investigated smoothing parameter selectors we end up with nearly the same error, which we get with knowing the density.Furthermore, we had a look at the bound in Theorem 6.1.We choose since this is the interval where we do not expect data.Then we calculated the right-hand side of (6.3) numerically for the choice σ DPI .In Figure 6.4 we see that this is indeed a good upper bound for the approximation error from choosing σ " σ DPI .The polynomial kernel on the cube Let the density be the beta distribution ̺ B,α from (4.20) with shape parameter α P t 1 {2, 1, 2u.Let us study the test function f pyq " e y .We use a B-Spline kernel kpyq " B 3 pyq, see (B.1).In this case the    integral Kpyq can be calculated easily.The resulting RMSE are plotted in Figure 6.5.We compare with the case where the density in known, (3.4).In the case where α " 1 {2 the density ̺ B, 1 {2 is large on the boundary, see Figure 4.4a), which means that we have more points at the boundary.In this case we receive the error rate 2 ´3n , which is specified by the order of the wavelets.This behavior occurs even in the case α " 1, which means that the samples are uniformly distributed on r0, 1s.In the case where α " 2, the benefit of our approximation approach is not as big as in the other cases, since the density ̺ tends to zero at the boundary and we do not have much samples at the boundary.This means that the function f does not have a support bigger than r0, 1s and the smoothing effect of f ˝R ´1 at the boundary does not apply.The approximation is slightly better in the case where the density is known, but nevertheless this density does not inherit the decay rate from the other examples due to the lack of sample points at the boundary.

High-dimensional approximation
The main aim of this paper is the fast and effective approximation of high-dimensional functions.We study the setting where the variables y 1 , . . ., y d are independent, which means that density ̺pyq is a product of the one-dimensional densities (3.7).Therefore, we transform every variable of the given samples separately after estimating one-dimensional densities.Additionally, we utilize the ANOVA decomposition from Section 3.1 to deal with the curse of dimensionality.
For the function f P L 2 pΩ, ̺q we have the ANOVA decomposition (3.11).The number of ANOVA terms of a function is equal to 2 d and therefore grows exponentially in the dimension d.This reflects the curse of dimensionality in a certain way and poses a problem for the approximation of a function.In high-dimensional settings, the underlying function can very often effectively represented as a sum of lower-order functions.In other words, the function can be expressed as a combination of component functions, where only ν " d variables out of the total d variables are active in each component, [12,25].Recent methods as ANOVAapprox [26,31] (and the successful application to different datasets in [32]), SALSA [23], SRFE [17], SHRIMP [44] To this end we introduce the notion of effective dimension, see [6].Definition 7.1.For 0 ă ε ď 1 the effective dimension of f , in the superposition sense, is the smallest integer ν ď d, such that ÿ |u|ďν σ 2 ̺ pf u q ě εσ 2 ̺ pf q.
A function with low effective dimension allows a good approximation using only ANOVA-terms up to order ν.To approximate f u , we have to use the transformation R u , or if the density ̺ u is unknown, we have to estimate the one-dimensional densities ̺ i py i q for i P u, and use a transformation Ru py u q « R u py u q to transform the samples Y u to Xu " Ru pY u q.Since we deal with independent input variables, we transform every variable separately, i.e.
Ru pyq " ´R i py i q ¯iPu , where we get Ri from one-dimensional transformations (6.2).In the truncated hyperbolic wavelet matrix A we insert only the indices pj, kq belonging to the low-dimensional terms, i.e.
For chosen ν !d we do this for all u with |u| ď ν, i.e.U ν :" tu P rds | |u| ď νu.This is summarized in Algorithm 1.The notation Y tiu means analogously to the notation y u that we only consider the components y i of the samples in Y. Similar to [26,31] we calculate the variances of the approximations of the ANOVA-terms f u and omit in a second approximation step the ones with low variance in order to increase the accuracy with a higher wavelet level n for the important ones.Hence, in a second approximation step we use only the ANOVA-terms u P U Ă Pprdsq and get the approximant S Y,U n f , (7.1), also for an arbitrary ANOVA index set U .
To summarize our algorithm, we have for every variable y i two possibilities, which is summarized in Figure 7.1.Algorithm 1 works well if the underlying density is a tensor density.
The intuition is that a good approximation of the function means also a good approximation of the ANOVA-terms, and hence a good approximation of the variances.Calculating the variances of the ANOVA-terms for functions on T d is easy because of the connection (2.13).Therefore we approximate the variances σ 2 ̺ pf u q by the following estimated variances, see [26,Section 4] for computing details with hyperbolic wavelet regression.In the following we will study the error between the estimated variances σ2 ̺ pf u q and the variances σ 2 ̺ pf u q.First, let us consider the case where the density is known.Theorem 7.2.Let ∅ ‰ u P U and v " ti P u | Ω i " r0, 1su and f u P B s 2,8 pΩ u , ̺ u q.Denote furthermore e 2 :" f u ´pS X n pf ˝R´1 u qq ˝R and a " 1 ´p1 ´ηq |v| {2´1 .Then Proof.Let us denote in this proof g " pS

¯.
Only in the case where v ‰ ∅ we have the additional factor, which depends on η: In the case of periodic functions, the second term is zero.Putting all inequalities together gives the desired result.
The error between the function f u and its approximation g can be estimated as follows.We are now concerned with a |u|-dimensional function.Therefore estimates (2.7) to (2.10) hold for d " |u| for the transformed function.The connection (3.10) or rather Theorem 5.1 transforms the results to Ω u .Therefore, we have , with high probability.The logarithmic term in the approximation error appears only for ANOVA terms with |u| ě 2. In the case where the density ̺ is unknown we get an additional term, which depends on the error between the estimated density ̺ and the actual density ̺, similar as in Theorem 6.1.We introduce the subset E :" for which the estimate in the next theorem follows.
Theorem 7.3.Let ∅ ‰ u P U and f u P B s 2,8 pΩ u , ̺ u q.Let furthermore g :" S X ,U n f be an approximation received from our procedure in Algorithm 1 with unknown density.Then Let us briefly explain the appearing terms.Term A is the error from Theorem 7.2, terms C depends on the quality of the approximation ̺ « ̺ and term B describes the variance of g u in the part where we have no samples, i.e.where we extend the original function f u .Of course, if the original function is non-periodic, we use an extension and study the variances, so we have to accept the additional term.
Proof.We do the following splitting Then by splitting the domain of the second integral the assertion follows.
The quintessence of this subsection is that the approximation of the GSI of the function f by the GSI of the approximant is a reasonable approach to get insides about the variances of the ANOVA terms.In a second approximation step we reduce the index set to the ANOVA indices in the set for some threshold ε ą 0. This allows on the other hand to increase the maximal wavelet level for the important ANOVA terms and therefore decrease the approximation error, while ensuring logarithmic oversampling.

A synthetic numerical example
As a conclusion of this paper we want to apply Algorithm 1 to a high-dimensional test function.For that reason let f : R 5 ˆr0, 1s 3 Ñ R, f pyq " be an 8-dimensional function where y 5 ą 0. We assume the given data Y be sampled from the distribution ̺ : R 5 ˆr0, 1s 3 Ñ R `, ̺pyq " where we use different distributions, already studied in this paper, ̺ 1 py 1 q " ̺ N py 1 q, ̺ 2 py 2 q " ̺ L py 2 q, ̺ 3 py 3 q " ̺ C py 3 q, ̺ 5 py 5 q " 1 2 e ´y5 2 , ̺ 6 py 6 q " ̺ 8 py 8 q " 1, , ̺ 7 py 7 q " 1 π y 7 ´1{2 p1 ´y7 q ´1{2 .We draw M " 1000 samples and use the corresponding function values f " f pYq.These samples projected to the directions y 1 and y 2 are plotted in the introduction in Figure 1.1 together with the transformed samples RpYq, also projected to the directions y 1 and y 2 .We use as superposition dimension ν " 2, which is a suitable guess if we have a look at the function equation, which suggests only one ANOVA-term of order 2. With the choice ν " 3 we would conclude in a first step that we do not need the three-dimensional terms.Furthermore we use Chui Wang wavelets of order m " 2.
We consider the setting where we do not know the underlying density, so we use for the variables y 1 , . . ., y 5 the kernel density estimation for data on R from Section 6.1.1 using the Gaussian kernel introduced there and for the remaining variables y 6 , y 7 , y 8 the kernel density estimation for data on r0, 1s from Section 6.2 using the B-Spline kernel introduced there.For different wavelet levels n we plot in Figure 7.2a the approximated GSI's Spu, S Y n f q, i.e. the 8 GSI's of order 1 and then the 28 GSI's of order 2. Since we know the function explicitly, we compare this to the analytically calculated GSI's Spu, f q.One can see that we could indeed figure out even with a low wavelet level n " 2 the ANOVA terms with high variances.So we filter out the unnecessary variables y 2 and y 8 and all two-dimensional terms except the term with u " t1, 5u.Furthermore, we plotted in Figure 7 The approximated GSIs Spu, S Y n f q for |u| ď 2 and different level n compared to the GSIs Spu, f q.  indices U " t∅, t1u, t3u, t4u, t5u, t6u, t7u, t1, 5uu for the approximation.This procedure is similar to the proposed method suggested in [26,32].It is also reasonable to choose different maximal levels n for the one-and the two-dimensional terms, since for different dimension, these index-sets have different sizes.
For the choice n " 5 for the one-dimensional terms and the choice n " 3 for the two-dimensional term we are able to reduce the RMSE on the test-set additionally from 0.4997 to 0.2248.Our procedure reduces the RMSE significantly, hence we are able to approximate an 8-dimensional function using only 1000 samples very well.Using only a min-max transformation of the data, it is not possible to detect the non-zero ANOVA-terms.
In a second experiment we only use the ANOVA indices in U for the approximation and do our procedure for different sample sizes M , with all other parameters kept the same.The results are plotted in Figure 7.3.We used the maximal wavelet level n for which the RMSE is minimal for the given fixed data set.

Real-world data
The proposed Algorithm 1 was used in [42] to estimate the vertical ground reaction force from time series of plantar pressure from instrumented insoles.The study included data from 16 persons moving at different speeds on a two-belt treadmill equipped with force plates.In total about M « 1.2 ¨10 6 data points were used and the data was modelled as an 8-dimensional function with ANOVA terms up to order ν " 2. The approach successfully reached relative RMSEs of up to 10.6 %, which is comparable to other studies in the literature with the advantage of being interpretable and having much more data available in the study used in [42].
In the following we compare the performance of models obtained by Algorithm 1 with other state-of-theart algorithms when applied to seven real-world datasets from the the UCI repository (http://archive.ics.uci.edu/ml).For comparison we test against Random Forest Regression (RF) and Gaussian Processes (GP), both implemented in ScikitLearn.jl,which implements the popular scikit-learn interface and algorithms in Julia.Furthermore, we compare with sparse additive models [23,36,44] and follow the the experimental setup in [23], where the training data is normalized so that the input and output values have zero mean and unit variance along each dimensions.Each dataset is divided in half to form the training and testing sets, we use exactly the same splitting as in [23] for the datasets used there and do the same procedure for the datasets with more samples.Note that they test only one single random splitting.In our experiments we use a cross-validation of the original training data-set as training data and 20% as validation data to select the best parameters ν, n, λ, ε.An overview of the datasets is presented in Table 7.1.Furthermore, we give details of our trained model: The used basis functions on the torus (chui-m are Chui-Wang-wavelets of order m and per means trigonometrical polynomials) as well as the total number N of trained coefficients in the final model.For the transformation R we use the DPI, described in Section 6.1.1 applied to the Gaussian kernel.[23].We give also the used basis functions on the torus, the total number N of trained coefficients in the final model of Algorithm 1.

Dataset
The approximation results and comparisons are shown in Table 7.2.The results of SALSA are obtained from [23], the results from HARFE and SRFE are obtained from [36], and we included the results of the SHRIMP algorithm [44].Since the results for the random algorithms depend on the draw of the random features, in contrast to the given results for SHRIMP, HARFE and SALSA, we did the approximation validation 50 times and present the mean in Table 7.2.Note that our parameter ν coincides with the parameter q in the random feature literature.Furthermore, for the datasets with too many samples, i.e.Protein and Elevators, the random feature algorithms are not able to calculate an approximation, since the involved random matrices are getting too big.
We want to highlight two cases: the cases where the dataset has many samples M compared to the dimension d, and second, the dataset has not much data samples available.In the first case (datasets Protein and Elevators) Algorithm 1 performs better than the other machine learning algorithms.Since the random feature models set up a matrix as large as the number of unknowns, they can not handle such big datasets.Our Algorithm 1 even provides similar or smaller approximation errors compared to the random features models in the case of a low sample size, with the advantage of being interpretable, that means it is easy to calculate the GSIs of the involved ANOVA terms for the final approximation model.In the application the user can use this to derive conclusions.Furthermore, we confirm the thesis, that real-world data can be described by functions with low effective dimension.It should also be noted that in the dataset Airfoil additional 36 dimensions with random noise are added and Algorithm 1 easily finds  7.2: MSE on real datasets using various approximation techniques.Details to the corresponding algorithms can be found in HARFE [36], SHRIMP [44], SALSA [23], SRFE [17] or are available via ScikitLearn.jl.The best results for every dataset are highlighted.
the non-importance of these dimensions and reduces the data to the important 5 dimensions.

Conclusion and outlook
In this paper, we introduced a new method for function approximation from given fixed samples from an arbitrary density.This method combines previous work on least-squares approximation on the torus T d and the truncation of the ANOVA decomposition with a variable transformation and a kernel density estimation.We are able to transfer the error decay rates and the fast algorithms from the torus to the domain Ω.The new extension method, which benefits from the Chui-Wang wavelets, even allows the approximation of non-periodic functions.As shown in our numerical experiments, this procedure is beneficial in function approximation.The code is available in the Julia package ANOVAapprox on GitHub, see https://github.com/NFFT/ANOVAapprox.We assume for our algorithm that the input variables are independent, which is not necessarily the case in applications.In future work we want to adapt our algorithm also to the setting of dependent input variables.where Id f " f and ∆ m hi,i is the univariate operator applied to the i-th coordinate of f with the other variables kept fixed.
Definition A.1.Let s ą 0 and 1 ď p ď 8. Fixing an integer m ą s, we define the space B s p,8 pT d q as the set of all f P L p pT d q such that for any u Ă t1, ..., du We define functions in a Sobolev space with dominating mixed derivatives, where the norm is defined by where the norm is defined by This norm is equivalent to the norm in (A.1) for s P N, see [24].We will consider the case where s ą 1 2 , since in this case we have that H s mix pT d q ãÑ CpT d q, which is necessary to sample the function.There is a further useful equivalent norm which is based on a decomposition of f in dyadic blocks.The dyadic blocks (4.Interestingly, there is also a Fourier analytic characterization of the above defined Besov-Nikolskij spaces B s p,8 pT d q which even works for 1 ă p ă 8. Instead of taking the ℓ 2 -norm of the weighted sequence p2 |j|1s }f j } LppT d q q jPN d 0 we take the ℓ 8 -norm, }f } B s p,8sup jPN d 0 2 |j|1s }f j } LppT d q . (A.3)

Figure 1 . 1 :
Figure 1.1: Transformation of the samples in the two-dimensional case.

Figure 2 . 1 :
Figure 2.1: Illustration of the ANOVA indices of a three-dimensional function: Every cuboid belongs to one index j.The size represents the number of translation indices k P I j , which gives this dyadic structure.All indices in I 3 are decomposed into indices belonging to the ANOVA terms with index u Ď r3s.

Example 4 . 3 . 1 ?
We give three examples for distributions on Ω " R. We plotted the density function, the transformation R as well as the densities Υ m,k pyq from (4.4) in Figures 4.1, 4.2 and 4.3.i) Standard normal distribution on R The density ̺ N pyq " standard normal distribution.Because of this very smooth density we expect that this transformation passes the smoothness of f to f ˝R´1 .The corresponding transformation is Rpyq " Laplace distribution, the corresponding transformation is Rpyq " 1 2 sgnpy ´2q ´1 ´e´| y´2| 4 The functions Υ m,k pyq defined in (4.4).

Figure 4 . 1 :
Figure 4.1: The standard normal distribution ̺ N on R.

Theorem 4 . 8 .
Let m P N, ̺ P W d be an admissible weight function with ̺pyq ă 8 for all y P R d .ThenW m 2 pR d , ̺ ´m`1{2 q Ă H m pR d , ̺q.Proof.We begin with the one-dimensional case.First we show that for the Bell polynomials defined in (4.1) there holds |B α,k pyq| À |̺pyq| ´α by induction over α.The condition (4.13) of an admissible weight function and the examples (4.

Figure 4 . 4 :
Figure 4.4: The beta distribution ̺ B,α on the interval r0, 1s for α P t 1 {2, 2, 3u.Theorem 5.1.Let Ω i P tT, Ru, the density ̺ i be in C m´1 pΩ i q for i P rds and let m P N be the order of vanishing moments of the wavelet ψ.Let the function fulfill for all i P rds where Ω i " R that

Figure 5 . 1 :Lemma 5 . 2 .
Figure 5.1: Illustration of the two-dimensional extension function spaces V Ă L 2 pT d q to some smaller domain by V ˇˇra,bs d :" g ˇˇra,bs d : g P V (

Figure 5 . 2 :
Figure 5.2: The extension of the function g to g for the dimension d " 1 and the order of wavelets m " 2.

` 1 2 r n d s` 1 .
for the transformation R u .We will go more into details in Chapter 7.Motivated by the previous lemma, we make the reasonable choice η " m ´1 a) Decay of the RMSE for m " 2.

Figure 5 . 3 :
Figure 5.3: Approximation of the function (5.13) on R d for d P t1, 2, 3u and the normal distribution ̺ N .
Decay of the RMSE for m " 2.
Decay of the RMSE for m " 3.

Figure 5 . 4 :
Figure 5.4: Approximation of the function (5.13) on R d for d P t1, 2, 3u with the Cauchy distribution ̺ C .

Figure 5 . 5 :
Figure 5.5: Approximation of the function (5.13) on R d for d P t1, 2, 3u with the Laplace distribution ̺ L .
Decay of the RMSE for d " 1.

Figure 5 . 6 :
Figure 5.6: Approximation on T d for d P t1, 2u of the test function (5.14) samples distributed with respect to the beta distribution ̺ B,α .
Decay of the RMSE.

3 (
b) Condition number of the wavelet matrix A.
a) The density function.(b) The cumulative distribution function.

Figure 6 . 2 :
Figure 6.2: Illustration of the problem of over-and underfitting.

̺R̺
p4q pyq̺pyq dy, or more generally Ψ r " ż prq pyq̺pyq dy, a) Function f and extended function f on Ω.The density ̺ and the estimated density ̺.

Figure 6 . 4 :
Figure 6.4: RMSE of the approximation on R of the test function (5.13) using kernel density estimation.

Figure 6 . 5 :
Figure 6.5: RMSE of the approximation on the interval r0, 1s of the test function f pyq " e y using kernel density estimation.

E |g u pyq| 2
p̺pyq ´̺pyqq dy ˇˇľ oooooooooooooooooomoooooooooooooooooon Approximation errors, the ℓ2-norm on the given samples and the RMSE for different wavelet levels n.
the classical definition via mixed moduli of smoothness.Therefore first recall the basic concepts.For univariate functions f : T Ñ C the m-th difference operator ∆ m h is defined by ∆ m h pf, xq :" px `jhq, x P T, h P r0, 1s .Let u be any subset of t1, ..., du.For multivariate functions f : T d Ñ C and h P r0, 1s d the mixed pm, uq-th difference operator ∆ m,u h

1 ).
with the partial derivatives D k f " B k 1 `...`k d It clearly holds for d " 1 H m mix pTq " H m 2 pTq ": H m pTq .The case p " 2 and Ω " T d allows for a straight-forward extension to fractional smoothness parameters.Definition A.2. Let s ą 0. Then we define H s mix pT d q :"
Let us introduce the weighted L p -norm, norm is preserved under the transformation, i.e. our aim is to introduce suitable weighted norms on Ω, namely f H s pΩ,̺q and f B s 40, Definition 6.1].Definition 4.7.The class W d of admissible weight functions w is the collection of all positive functions w P C 8 pR d q with the properties i) For all γ P N d 0 there is a positive constant c γ with |D γ wpyq| ď c γ wpyq for all y P R d .(4.13)

Table 5 .
.1.For comparison we also give the extremal eigenvalues of the matrix Λ, which are the lower 1: Extremal eigenvalues of the expectation matrix Λ for different maximal level n and order m of the wavelets.
Proof.The first inequality follows from triangle inequality and Cauchy-Schwarz inequality, RzE |e f pyq| 4 dy ¸1{2 ̺pyq ´̺pyq L2pRq `ε sup yPE |e f | 2 .(6.3) Ωu|gpy u q| 2 p1 ´ηq ´|v| ̺ u py u qdy u .For simplicity we denote in this proof g " p1 ´ηq ´|v| {2 g.Then we estimate the difference of the variances of the ANOVA terms given in (3.12) by the reverse triangle inequality and Cauchy-Schwarz inequality, |σ 2 ̺ pf u q ´σ 2 ̺ pf u q| " ˇˇˇż Ωu`|f u py u q| 2 ´|gpy u q| 2 ˘̺u py u q dy u ˇˇ" ˇˇˇż Ωu p|f u py u q| ´|gpy u q|q p|f u py u q| `|gpy u q|q ̺ u py u q dy u ˇˇď f u ´g L2pΩu,̺uq f u `g L2pΩu,̺uq ď f u ´g L2pΩu,̺uq f u L2pΩu,̺uq ´2 ` f u ´g L2pΩu,̺uq .2b the error f ´SY n f ℓ2pYq " ´řyPY |f pyq ´SY n f pyq| 2¯1{2and the RMSE (5.12) for a test set Y test sampled according to ̺ with |Y test | " 3M .The low ℓ 2 pYqerror indicates that n " 3 is already overfitting, i.e. using to many parameters for the 1000 samples.In a second approximation step we omit the ANOVA-terms with low variance and use only the ANOVA

Table 7 .
1: Overview of seven datasets: Dimension d and number of datapoints in training and testing set.The experimental setup and datasets for each test follows from