1 Introduction

Two of the major developments during the last quarter of a century are the use of radial basis function (RBF) networks in learning theory, and wavelet analysis in computational harmonic analysis. The subject of radial basis function networks is highly studied, with applications in various fields of mathematics, sciences, engineering, biology, learning theory, and so forth. A MathSciNet and a Web of Science search on January 10 of 2013 for “radial basis function*” showed 1,262 and 12,697 citations, respectively. A similar search with “wavelet*” revealed 10,842 and 65,271 citations, respectively. Therefore, we will make no effort even to attempt to survey the entire subject, and, instead, we will focus on certain facts which have impacted our own research in these areas.

Please note that, for the sake of clarity, the notation used in the introduction may not be the same during the rest of the paper.

A major theme in learning theory is to discover a functional relationship in given data of the form \(\{(\mathbf{x}_k,y_k)\}_{k=1}^M\) where \(\mathbf{x}_k\)’s are vectors in a Euclidean space \(\mathbb{R }^q\) for some \(q\in \mathbb{N }\), and \(y_k\)’s are the corresponding function values, usually corrupted with noise. The problem is to find a target function (or model) \(f :\mathbb{R }^q\rightarrow \mathbb{R }\) so that \(f(\mathbf{x}_k)=y_k\), \(k=1, 2,\ldots ,M\) or at least, \(f(\mathbf{x}_k)\approx y_k\), \(k=1, 2,\dots ,M\). The first requirement is usually imposed either when the data is insufficient, or when it is important to reproduce the data exactly in the model. For example, in image registration applications, one needs to preserve some “landmarks”. On the other hand, if the data is plentiful, but noisy, one does not wish to reproduce it exactly, but wishes to “fit” a smooth model to the data.

In either case, the problem is clearly ill-posed—there are infinitely many models \(f\) which meet the requirements. Therefore, the usual method to obtain the model is first to choose a class \(\mathcal{M }\) of desired models, and find the desired model \(f\) so as to minimize a regularization functional such as

$$\begin{aligned} \sum _{k=1}^M (g(\mathbf{x}_k)-y_k)^2 + \delta \Vert \mathcal{L }g\Vert \end{aligned}$$
(1.1)

over all \(g\in \mathcal{M }\), where \(\mathcal{L }\) is a penalty functional (usually, a differential operator), \(\Vert \circ \Vert \) is a suitable norm, and \(\delta \) is the regularization parameter. The least square fit is obtained by setting \(\delta =0\); letting \(\delta \rightarrow \infty \) is equivalent to minimizing \(\Vert \mathcal{L }g\Vert \) subject to interpolatory conditions. A very classical example of this approach has \(\mathbf{x}_k\in [-1,1]\), \(\mathcal{M }\) is the class of all twice continuously differentiable functions, \(\Vert \circ \Vert \) is the \(L^2\) norm on \(\mathbb{R }\), \(\mathcal{L }g=g^{\prime \prime }\), and we let \(\delta \rightarrow \infty \). In this case, one recovers the cubic spline interpolant for the data. A technique well-known in image processing is the TV minimization, where \(\mathbf{x}_k\in [0,1]^2\), \(\Vert \circ \Vert \) is the \(L^1\) norm, and \(\mathcal{L }g\) is the so-called total variation of \(g\).

Motivated by the example of spline functions, regularization or smoothing interpolation is used with many other penalty functionals in learning theory. In view of the Golomb–Weinberger principle, the solution to these problems can often be obtained explicitly in the form \(\sum _{k=1}^M a_k\phi (|\circ -\mathbf{x}_k|)\) for a univariate function \(\phi \), where \(|\circ |\) denotes the usual Euclidean norm [15, 25]. A function of this form is called a radial basis function (RBF) network with \(M\)neurons. The function \(\phi \) is called the activation function and the \(\mathbf{x}_k\)’s are the centers of the network. More generally, a translation network is a function of the form \(\sum _{k=1}^M a_kG(\circ -\mathbf{x}_k)\) where the activation function \(G\) is defined on an appropriate Euclidean space.

The motivation for this terminology stems from an analogy with neurons in the brains, as indicated by Fig. 1.

Fig. 1
figure 1

The image on the left depicts a rat neuron. Neurons fire when their electric potential exceeds a certain threshold value. The term “neural network” arises in conjunction with the similarity found between the architecture of actual neuron cells, and the topology of an artificial neural RBF network, pictured on the right. The input layer receives a vector, and passes it as an argument to each of the small computers represented by circles in the hidden layer. Each of these computers evaluates a term of the form \(G(\mathbf{x}-\mathbf{x}_j)\), and represents the analogous dependence on electric potential. The output layer takes the linear combination. The image copyrights are held by Testuya Tatsukawa 2010 and Unikom Center 2010–2012, respectively

A question of central interest in machine learning can be formulated as a problem of function approximation. A very classical theorem of Wiener states that the set of all translation networks is dense in \(L^1(\mathbb{R }^q)\) if and only if \(\hat{G}(\mathbf{t })\not =0\) for any \(\mathbf{t }\in \mathbb{R }^q\). Under mild conditions on \(\phi \), it was proved by Park and Sandberg [70] that the class of all RBF networks

$$\begin{aligned} \left\{ \sum _{k=1}^M a_k\phi (|\circ -\mathbf{x}_k|) : M\ge 1, \ \mathbf{x}_k\in \mathbb{R }^q\right\} \end{aligned}$$

is dense in the space of continuous real valued functions on \(\mathbb{R }^q\) with the topology of convergence on compact subsets.

In [50], we showed that the set of all translation networks with activation function \(G\) which can be interpreted as a tempered distribution is dense in the same sense if and only if the support of the distributional Fourier transform of \(G\) is a so-called “set of uniqueness” for entire functions of finite exponential type in \(q\) variables.

Perhaps, the most popular way of using translation networks is for interpolating data. An attractive feature of RBF networks is a result by Micchelli [62] that (again under mild conditions on \(G\)) the matrix \([G(\mathbf{x}_j-\mathbf{x}_k)]_{j,k=1}^M\) is always invertible for an arbitrary choice of \(M\) and \(\mathbf{x}_k\)’s, so that interpolation by RBF networks is always possible. It is proved in [25] that an interpolatory network also solves certain regularization problems.

While interpolation by RBF networks as well as the use of extremal problems to obtain a functional relationship underlying a data are very popular methods, there are some disadvantages. First, there is no guarantee that as the data increases, the minimal value of the regularization functional will remain bounded. Second, there are usually no a priori bounds on the accuracy of the resulting approximation. Indeed, the theory of degree of approximation by interpolatory RBF networks is a fairly well established topic in its own right. Finally, there are numerous computational issues, including ill**-conditioning, lack of convergence, local minima, and so forth.

In the past 20 years or so, the first author, together with collaborators, has explored methods to construct translation network approximations with a priori performance guarantees which avoid each of the pitfalls mentioned above. Quite often, the resulting networks also satisfy the extremal properties up to a constant multiple. This research has demonstrated a very close connection between the classical theory of polynomial approximation, and approximation by translation networks. Applications of the ideas in this research include the theory of probability density estimation [78], pattern classification [3], control theory [33], signal processing [16], numerical simulation of turbulent channel flow [23], and construction of Schauder basis [24, 31, 37].

Some of the highlights of our work are the following:

  • The conditions on the activation function required to achieve our approximation bounds are weaker than those required for interpolation.

  • Our procedures are given as explicit formulas, requiring neither training in the classical machine learning setting, nor a solution of a system of equations involving a possibly ill-conditioned large matrix. The formulas can be implemented as a matrix–vector multiplication.

  • The spaces to which the target function is assumed to belong are the usual smoothness classes, rather than the so-called native spaces for the networks.

  • It is easy to adopt a two-scale approach to constructing a network which provides both the optimal approximation bounds, and interpolation at fewer points than there are centers of the network (cf. [7]).

The purpose of this paper is to illustrate the main ideas in our research in a few contexts. In Sect. 2, we introduce the basic ideas in the context of approximation by univariate trigonometric polynomials. The material in this section is extended in the context of multivariate trigonometric polynomials in Sect. 3. A new feature here is the ability to approximate functions based on “scattered data”, that is, evaluations where one does not prescribe the location of the points where the function is to be evaluated. The analogues of these results in the context of periodic basis function networks are given in Sect. 4. Certain extensions of the various parts of this theory are discussed in Sect. 5.

2 Approximation of univariate periodic functions

2.1 Preliminaries

In this section we describe some basic results regarding approximation of univariate \(2\pi \)-periodic functions by trigonometric polynomials. There are numerous standard references on the subject (for instance, [6, 12, 57, 66, 77]). Our summary in this section is based on [57], where many results are given with elementary proofs for the case of uniform approximation, and [12], where these results are given in the full generality. First, some terminology. We denote by \(\mathbb T \) the quotient space of the interval \([-\pi ,\pi ]\) where the end points are identified. Geometrically, we think of \(\mathbb T \) as the unit complex circle, except that rather than denoting a point on \(\mathbb T \) by \(e^{ix}\), we simplify the notation and denote it by \(x\). If \(f : \mathbb T \rightarrow \mathbb{R }\) is Lebesgue measurable, and \(A\subseteq \mathbb T \) is Lebesgue measurable, we write

(2.1)

The set of all Lebesgue measurable functions for which \(\Vert f\Vert _{p,A}<\infty \) is denoted by \(L^p(A)\), with the understanding that functions which are equal almost everywhere on \(A\) are considered equal as members of \(L^p(A)\). The set of all uniformly continuous, bounded, and \(2\pi \)-periodic functions on \(A\), equipped with the norm of \(L^\infty (A)\) (which is known as the uniform or supremum norm in this case) is denoted by \(C^*(A)\). If \(A=\mathbb T \), we will omit it from the notation. In the sequel, we will assume that \(1\le p\le \infty \). The dual exponent notation is as follows:

$$\begin{aligned} p^{\prime }\mathop {=}\limits ^{\mathrm{def}}\left\{ \begin{array}{ll} p/(p-1), &{}\quad {\hbox {if}}\ 1<p<\infty ,\\ \infty , &{}\quad {\hbox {if}}\ p=1,\\ 1, &{}\quad {\hbox {if}}\ p=\infty . \end{array} \right. \end{aligned}$$

If \(n\) is a non-negative integer, then a trigonometric polynomial of degree (or order)\(n\) is defined to be a function of the form

$$\begin{aligned} x\mapsto \sum _{k=-n}^n c_ke^{ikx}\!, \quad x\in \mathbb T , \end{aligned}$$

where \(c_k\)’s are complex numbers, known as the coefficients of the polynomial, and \(|c_n|+|c_{-n}|\not =0\). The set of all trigonometric polynomials of degree \(<n\) will be denoted by \(\mathbb{H }_n\). It is convenient to extend this notation to non-integer values of \(n\) by setting \(\mathbb{H }_n\mathop {=}\limits ^{\mathrm{def}}\mathbb{H }_{\lfloor n\rfloor }\) where \(\lfloor n\rfloor \) is the integer part of \(n\), and setting \(\mathbb{H }_n=\{0\}\) if \(n\le 0\). According to the trigonometric variant of the Weierstrass theorem and its \(L^p\)-versions, the \(L^p\)-closure of \(\cup _{n=1}^\infty \mathbb{H }_n\) is \(L^p\) for \(1\le p<\infty \) and \(C^*\) for \(p=\infty \). In order to avoid making an elaborate distinction between the two cases every time we state a theorem, we will write \(X^p=L^p\) if \(p<\infty \) and \(X^\infty \mathop {=}\limits ^{\mathrm{def}}C^*\). A central theme of approximation theory in this context is to investigate the properties of the degree of approximation of \(f\in X^p\) from \(\mathbb{H }_n\) for all \(n\). This is defined for \(f\in X^p\) and \(n\ge 0\) by

$$\begin{aligned} E_{n,p}(f) \mathop {=}\limits ^{\mathrm{def}}\inf \{\Vert f-P\Vert _p : P\in \mathbb{H }_n\}. \end{aligned}$$

The quantity \(E_{n,p}(f)\) measures the minimal error that one must expect if one wishes to use an element of \(\mathbb{H }_n\) as a model for \(f\). Clearly, the function \(n\mapsto E_{n,p}(f)\) is non-increasing, the function \(p\mapsto E_{n,p}(f)\) is non-decreasing, and the function \(f\mapsto E_{n,p}(f)\) is a semi-norm on \(X^p\) if \(p\ge 1\). In the sequel, we will assume that \(1\le p\le \infty \).

Two of the main problems in this theory are the following. One of them is how to construct the best approximation\(P_p^*(f)\in \mathbb{H }_n\) such that \(\Vert f-P_p^*(f)\Vert _p =E_{n,p}(f)\). The other problem is to give, for \(\gamma >0\), a complete characterization of \(f\in X^p\) for which \(E_{n,p}(f)=\mathcal{O}(n^{-\gamma })\). Except in the case when \(p=2\), the operator \(P_p^*\) is non-linear, and it takes a very elaborate optimization technique to compute \(P_p^*(f)\) in general. Thus we adopt the stance that it is computationally worthwhile to construct near-best approximations or sub-optimal solutions, that is, finding some \(P\in \mathbb{H }_n\) for which \(\Vert f-P\Vert _p \le c_1E_{cn,p}(f)\) for some positive constants \(c, c_1\) independent of \(n\) or \(f\). We discuss these constructions in this subsection, postponing the discussion of the characterization question to Sect. 2.2.

2.1.1 Constant convention

In the remainder of this paper, the symbols \(c, c_1,\ldots ,\) will denote generic positive constants whose value is independent of the target function \(f\), and other variable parameters such as \(n\), but may depend on fixed parameters under discussion, such as the norm \(p\) or the smoothness index \(\gamma \), and so forth. Their value may be different at different occurrences, even within a single formula. The notation \(A\sim B\) will mean \(c_1A \le B\le c_2A\). As usual, the notation \(A=\mathcal{O}(B)\) will mean that \(|A| \le c |B|\), where \(c\) is some positive constant whose value may depend on \(f\) which one way or another may appear in the expressions \(A\) and \(B\).

In the case when \(p=2\), an explicit formula for the best approximation polynomial \(P_2^*(f)\) is well known. We define the Fourier coefficients of \(f\) by

$$\begin{aligned} \hat{f}(k)=\frac{1}{2\pi }\int _\mathbb T f(t)\exp (-ikt)\,dt, \quad k\in \mathbb{Z }, \end{aligned}$$

and for \(n\in \mathbb{N }\) we set

$$\begin{aligned} s_n(f,x)=\sum _{|k|\le n-1} \hat{f}(k) \exp (ikx), \quad x\in \mathbb T . \end{aligned}$$

It is well known that \(P_2^*(f)=s_n(f)\), that is,

$$\begin{aligned} \Vert s_n(f)-f\Vert _2 =\inf \{\Vert f-P\Vert _2 : P\in \mathbb{H }_n\}. \end{aligned}$$

It is well known [79, Chapter VII, Theorem 6.4] that

$$\begin{aligned} \Vert s_n(f)-f\Vert _p \le cE_{n,p}(f), \quad 1<p<\infty , \end{aligned}$$

where \(c\) is a constant depending on \(p\). The value of \(c\) tends to \(\infty \) if \(p\rightarrow 1,\infty \). There exist integrable functions \(f\) for which the sequence \(\{s_n(f,x)\}\) diverges for almost all \(x\), and a dense set of functions \(f\in C^*\) for which \(\{s_n(f,0)\}\) diverges. A very deep theorem in the theory of Fourier series, the Carleson–Hunt theorem [4, 30] states that if \(p>1\) and \(f\in L^p\) then the sequence \(\{s_n(f,x)\}\) converges for almost all \(x\) to \(f(x)\).

A ground-breaking theorem in this direction states that if

$$\begin{aligned} \sigma _n(f,x) \mathop {=}\limits ^{\mathrm{def}}\frac{1}{n}\sum _{m=1}^n s_m(f,x), \quad n\in \mathbb{N }\ and \ x\in \mathbb T , \end{aligned}$$

then for all \(p\) with \(1\le p\le \infty \) and \(f\in X^p\),

$$\begin{aligned} \lim _{n\rightarrow \infty }\Vert f-\sigma _n(f)\Vert _p =0. \end{aligned}$$

In the case when \(p=\infty \), this theorem was proved by Fejér in [18] and it appears in almost every textbook on approximation theory, see e.g., [12, Chapter 1, Corollary 2.2] or [57, Chapter 1, Section 1.1]. In the case of \(1 \le p<\infty \) and, in fact, in greater generality, it appears in [79, Chapter IV, Theorem 5.14]. The rate of convergence

$$\begin{aligned} \Vert f-\sigma _n(f)\Vert _p\le \frac{c}{n}\sum _{k=1}^n E_{k,p}(f), \end{aligned}$$

was given in 1961 by Stechkin (aka Stečkin), see [74]. In order to get a near-best approximation, we define

$$\begin{aligned} v_n(f,x) \mathop {=}\limits ^{\mathrm{def}}\frac{1}{n}\sum _{m=n+1}^{2n} s_m(f,x), \quad n\in \mathbb{N }\ and \ x\in \mathbb T , \end{aligned}$$

where \(v\) is in honor of C. de La Vallée Poussin. It is easy to check that

$$\begin{aligned} v_n(f)=2\sigma _{2n}(f)-\sigma _n(f). \end{aligned}$$

It can be deduced from here ([12, Chapter 9, Theorem 3.1]) that for all \(p\) such that \(1\le p\le \infty \) and for all \(f\in L^p\) one has

$$\begin{aligned} E_{2n,p}(f) \le \Vert f-v_n(f)\Vert _p \le 4E_{n,p}(f). \end{aligned}$$

To understand the difference between the behavior of \(\sigma _n(f)\) and \(v_n(f)\), we point out an alternative expression for these. Namely,

$$\begin{aligned} \sigma _n(f, x)&= \sum _{|k|<n} \left( 1-\frac{|k|}{n}\right) \hat{f}(k)\exp (ikx), \nonumber \\ v_n(f,x)&= \sum _{|k|\le n} \hat{f}(k)\exp (ikx) +\sum _{n+1\le |k|<2n} \left( 2-\frac{|k|}{n}\right) \hat{f}(k)\exp (ikx).\qquad \end{aligned}$$
(2.2)

Thus, if \(P\in \mathbb{H }_n\) and \(P\) is not a constant, then \(\sigma _n(P)\not =P\) but \(v_n(P)\equiv P\). We observe further that \(v_n(f)\) can be written in the form

$$\begin{aligned} v_n(f,x)=\sum _{|k|<2n} h\left( \frac{k}{2n}\right) \hat{f}(k)\exp (ikx), \end{aligned}$$

where

$$\begin{aligned} h(t)=\left\{ \begin{array}{ll} 1, &{}\quad {\hbox { if}}\ |t| \le 1/2,\\ 2-2t, &{}\quad {\hbox { if}} \ 1/2<|t|<1,\\ 0, &{}\quad {\hbox { if}} \ |t|\ge 1. \end{array}\right. \end{aligned}$$

Motivated by this observation, we make the following definition.

Definition 2.1

Let \(h: \mathbb{R }\rightarrow [0,1]\) be compactly supported. The summability operator (corresponding to the filter\(h\)) is defined for \(f\in L^1\) by

$$ \begin{aligned} \sigma _n(h,f,x)\mathop {=}\limits ^{\mathrm{def}}\sum _{k\in \mathbb{Z }}h\left( \frac{k}{n}\right) \hat{f}(k)\exp (ikx), \quad n\in \mathbb{N }\ \& \ x\in \mathbb T . \end{aligned}$$
(2.3)

The summability kernel corresponding to \(h\) is defined by

$$\begin{aligned} \Phi _n(h,t) \mathop {=}\limits ^{\mathrm{def}}\sum _{k\in \mathbb{Z }}h\left( \frac{k}{n}\right) \exp (ikt), \quad t\in \mathbb T . \end{aligned}$$
(2.4)

The function \(h\) is called a low pass filter if \(h\) is an even function, non-increasing on \([0,\infty )\), \(h(t)=1\) if \(|t|\le 1/2\), and \(h(t)=0\) if \(|t|\ge 1\).

We note that for \(n\in \mathbb{N }\) and \(f\in L^1\),

$$\begin{aligned}&\sigma _n(h,f,x)= \frac{1}{2\pi }\int _\mathbb T f(t)\Phi _n(h,x-t)\,dt\nonumber \\&\quad = \frac{1}{2\pi }\int _\mathbb T f(x-t)\Phi _n(h,t)\,dt, \quad f\in L^1, \ x\in \mathbb T . \end{aligned}$$
(2.5)

Even though the sum in (2.3) is written as an infinite sum, it is in fact a finite sum since \(h(|k|/n)=0\) if \(|k|\) is sufficiently large; if \(h\) is a low pass filter, then \(k\ge n\) is sufficient. Therefore, for such \(h\), \(\sigma _n(h,f)\in \mathbb{H }_n\). By writing the sum as an infinite sum, we avoid the need to restrict ourselves in the definition to the case when \(n\) is an integer. If \(h\) is supported on \([-A,A]\), then \(\Phi _{1/A}(h,t) \equiv 1\), and we redefine \(\Phi _n(h,t)\equiv 0\) if \(n<1/A\).

We summarize some important properties of the summability operator. In what follows, the constants may depend on the function \(h\).

Theorem 2.1

  1. (a)

    If \(S\ge 2\) is an integer, and \(h\) is an \(S\)-times continuously differentiable function, then

    $$ \begin{aligned} |\Phi _n(h,t)| \le cn\min \left( 1, (n|t|)^{-S}\right) , \quad t\in \mathbb T \ \& \ n\ge 1. \end{aligned}$$
    (2.6)
  2. (b)

    Let \(h\) be a twice continuously differentiable, even function, \(1\le p\le \infty \). Then

    $$ \begin{aligned} \Vert \sigma _n(h,f)\Vert _p \le c\Vert f\Vert _p, \quad f\in L^p \ \& \ n\in \mathbb{N }. \end{aligned}$$
    (2.7)
  3. (c)

    Let \(h\) be a twice continuously differentiable low pass filter, \(n\in \mathbb{N }\). Then \(\sigma _n(h,P)\equiv P\) for all \(P\in \mathbb{H }_{n/2}\). In addition, for \(1\le p\le \infty \) and \(f\in L^p\),

    $$\begin{aligned} E_{n,p}(f) \le \Vert f-\sigma _n(h,f)\Vert _p \le cE_{n/2,p}(f). \end{aligned}$$
    (2.8)

Since the localization estimate (2.6) and its analogues in various contexts play an important role in our theory, we pause in our discussion to illustrate it by an example. We consider two low pass filters, \(h_3\) and \(h_\infty \) defined on \((1/2,1)\) by

$$\begin{aligned} h_3(t)=(1-t)^3\left( 8+48(t-1/2)+192(t-1/2)^2\right) \!, \end{aligned}$$

and

$$\begin{aligned} h_\infty (t) =\exp \left( -\frac{\exp (2/(1-2t))}{1-t}\right) . \end{aligned}$$

Of course, both functions are equal to \(1\) on \([0,1/2]\) and \(0\) on \([1,\infty )\). The function \(h_3\) is twice continuously differentiable on \([0,\infty )\) whereas \(h_\infty \) is infinitely many times differentiable. In Fig. 2, we show the graphs of \(|\Phi _n(h_3,x)|\) and \(|\Phi _n(h_\infty ,x)|\) for \(n=512\) and \(n=1024\) on the interval \([\pi /3,\pi ]\). It is clear from the figure that for both the values of \(n\), the graph corresponding to \(h_\infty \) is an order of magnitude smaller than that for \(h_3\), and that the graph corresponding to \( h_\infty \) decreases much faster as \(n\) (and/or \(|x|\)) increases.

Fig. 2
figure 2

Clockwise from the top left, the graphs of \(|\Phi _{512}(h_3,x)\), \(|\Phi _{1024}(h_3,x)|\), \(|\Phi _{1024}(h_\infty ,x)|\), and \(|\Phi _{512}(h_\infty ,x)|\). The numbers on the \(x\) axis are in multiples of \(\pi \). The maximum absolute values of these are \(2.6002e-07\), \(3.2776e-08\), \(4.4293e-11\), and \(6.6296e-08\) respectively

Proof of Theorem 2.1

Part (a) can be proved using the Poisson summation formula; see, e.g., [75, Chapter VII, Theorem 2.4 and Corollary 2.6]; a recent proof is given in [22]. If \(h\) is twice continuously differentiable, we use (2.6) with \(S=2\) to obtain

$$\begin{aligned} \int _\mathbb T |\Phi _n(h,t)|\,dt&= \int _{|t|\le 1/n} |\Phi _n(h,t)|\,dt+\int _{t\in \mathbb T , \ |t|>1/n}|\Phi _n(h,t)|\,dt\\&\le cn(2/n) + 2cn^{-1}\int _{1/n}^\pi t^{-2}\,dt \le 2c+2cn^{-1}\int _{1/n}^\infty t^{-2}\,dt = 4c. \end{aligned}$$

The estimate (2.7) is now deduced using the convolution identity (2.5) and Young’s inequality (cf. [79, Chapter II, Theorem 1.15]). This proves part (b). Since \(h\) is a low pass filter, \(h(|k|/n)=1\) if \(|k|<n/2\). If \(P\in \mathbb{H }_{n/2}\) then \(\hat{P}(k)=0\) if \(|k|\ge n/2\). Hence,

$$\begin{aligned} \sigma _n(h,P,x)=\sum _{|k|<n/2} h\left( \frac{|k|}{n}\right) \hat{P}(k)\exp (ikx)=P(x), \quad x\in \mathbb T . \end{aligned}$$

This proves the first assertion in part (c). Let \(f\in L^p\). The first inequality in (2.8) follows from the fact that \(\sigma _n(h,f)\in \mathbb{H }_n\). If \(P\in \mathbb{H }_{n/2}\) is arbitrary, then

$$\begin{aligned}&\Vert f-\sigma _n(h,f)\Vert _p = \Vert f-P-\sigma _n(h, f-P)\Vert _p \le \Vert f-P\Vert _p +\Vert \sigma _n(h, f-P)\Vert _p\\&\quad \le c_1\Vert f-P\Vert _p. \end{aligned}$$

The second inequality in (2.8) follows by taking the infimum over \(P\in \mathbb{H }_{n/2}\). \(\square \)

2.2 Degree of approximation

If \(\epsilon _n\downarrow 0\) as \(n\rightarrow \infty \), then it is easy to construct \(f\in C^*\) for which \(E_{n,\infty }(f) \le \epsilon _n\). We set \(\delta _k=\epsilon _k-\epsilon _{k+1}\) if \(k\ge 0\), and observe that \(\sum \delta _k\) is a convergent series with non-negative terms. Since \(|\cos (kx)| \le 1\) for all \(x\in \mathbb T \), the series

$$\begin{aligned} f(x)=\sum _{k=0}^\infty \delta _k\cos (kx) \end{aligned}$$

converges uniformly and absolutely to \(f\in C^*\). Clearly,

$$\begin{aligned} E_{n,\infty }(f)\le \sum _{k=n}^\infty \delta _k =\epsilon _n. \end{aligned}$$

As a matter of fact, as proved by Bernstein (cf. [66, Chapter V, Section 5, p. 109]), one could even have \(E_{n,\infty }(f) = \epsilon _n\). A central question in approximation theory is to determine the constructive properties (smoothness) of \(f\) which are equivalent to a given rate of decrease of the sequence \(\{E_{n,p}(f)\}\). In particular, we will examine the very classical case where \(E_{n,p}(f)=\mathcal{O}(n^{-\gamma })\) for some \(\gamma >0\).

The fundamental inequalities in this theory are given in the following theorem (for the uniform norm and elementary proofs, see [57, Chapter III, Section 3.1, Theorem 2, Chapter III, Section 2.2, Corollary 3] and in greater generality [12, Chapter 4, Theorem 2.4, Chapter 7, Section 4]). The best constant in the Bernstein inequality, see (2.9) below, is well-known. The idea of using summability operators to prove Bernstein-type inequalities goes back to 1914 when Riesz realized that Landau’s trick of applying Fejér’s ideas works for Bernstein-type inequalities. Riesz even managed to find the best constant with this seemingly inefficient approach; for the full story see [69]. While Riesz used an explicit expression for the kernel, we will use only the filter which gives rise to the summability kernel. We do not get the best constants, but the method can be generalized considerably to the settings when such an explicit expression for the kernel is not available, see, e.g., [39]. The best constants in (2.10) below are known [12, Chapter 7, Theorem 4.3], [57, Chapter III, Section 2.2, Theorem 4] as well.

Theorem 2.2

Let \(1\le p\le \infty \), \(n\in \mathbb{N }\), and let \(r\in \mathbb{N }\).

  1. (a)

    If \(P\in \mathbb{H }_n\) then the Bernstein-type inequality

    $$\begin{aligned} \Vert {P}^{(r)}\Vert _p \le cn^r\Vert P\Vert _p \end{aligned}$$
    (2.9)

    holds.

  2. (b)

    If \(f\in X^p\) and \({f}^{(r)}\in X^p\), then the Favard-type inequality

    $$\begin{aligned} E_{n,p}(f) \le \frac{c}{n^r}\Vert {f}^{(r)}\Vert _p. \end{aligned}$$
    (2.10)

    holds.

Proof

Let us fix an infinitely differentiable low pass filter \(h\). Clearly, the function \(h_1(t)\mathop {=}\limits ^{\mathrm{def}}t^rh(t)\), \(t\in \mathbb{R }\), is compactly supported and it is infinitely differentiable. To prove part (a), let \(P\in \mathbb{H }_n\). Then, in view of Theorem 2.1(c), we have

$$\begin{aligned} {P}^{(r)}(x)&= \sum _{k\in \mathbb{Z }} h\left( \frac{k}{2n}\right) \widehat{{P}^{(r)}}(k)\exp (ikx) = \sum _{k\in \mathbb{Z }} h\left( \frac{k}{2n}\right) (ik)^r \hat{P}(k)\exp (ikx) \\&= (2in)^r \sum _{k\in \mathbb{Z }} h_1\left( \frac{k}{2n}\right) \hat{P}(k)\exp (ikx)=(2in)^r\sigma _{2n}(h_1,P,x). \end{aligned}$$

The estimate (2.9) now follows immediately from (2.7). This proves part (a).

Next, we prove part (b). For \(f\in X^p\), we define

$$\begin{aligned} P_j \mathop {=}\limits ^{\mathrm{def}}\left\{ \begin{array}{ll} \sigma _1(h,f), &{}\quad {\hbox {if}}\ j=0,\\ \sigma _{2^j}(h,f)-\sigma _{2^{j-1}}(h,f), &{}\quad {\hbox {if}}\ j\in \mathbb{N }. \end{array}\right. \end{aligned}$$
(2.11)

Then for all integer \(m\ge 2\),

$$\begin{aligned} \sum _{j=0}^m P_j = \sigma _{2^m}(h,f), \end{aligned}$$
(2.12)

and so, since \(f\in X^p\), (2.8) implies that

$$\begin{aligned} f=\sum _{j=0}^\infty P_j, \end{aligned}$$
(2.13)

with the series converging in the sense of \(X^p\).

Next, let \(g(t)=h(t)-h(2t)\), and \(g_1(t)=g(t)t^{-r}\). Then \(g(t)=0\) if \(|t|\le 1/4\), and, hence, \(g_1\) is infinitely differentiable. For \(j\in \mathbb{N }\) and \(x\in \mathbb T \), we have

$$\begin{aligned}&P_j(x)= \sum _{k\in \mathbb{Z }}\left( h\left( \frac{|k|}{2^j}\right) -h\left( \frac{|k|}{2^{j-1}}\right) \right) \hat{f}(k)\exp (ikx)\\&\quad =\sum _{k\in \mathbb{Z }} g\left( \frac{|k|}{2^j}\right) \hat{f}(k)\exp (ikx)=\sigma _j(g,f,x). \end{aligned}$$

Let \(j\ge 2\). Taking into consideration that \(g(t)=0\) if \(|t|\le 1/4\), we can deduce that

$$\begin{aligned} P_j(x)&= \sigma _{2^j}(g,f,x)=\sum _{k\in \mathbb{Z }}g\left( \frac{k}{2^j}\right) \hat{f}(k)\exp (ikx)\\&= \sum _{k\in \mathbb{Z }}g\left( \frac{k}{2^j}\right) \frac{1}{(ik)^r}\widehat{{f}^{(r)}}(k)\exp (ikx)\\&= \frac{1}{(i2^j)^r}\sum _{k\in \mathbb{Z }}g_1\left( \frac{k}{2^j}\right) \widehat{{f}^{(r)}}(k)\exp (ikx) \\&= \frac{1}{(i2^j)^r}\sigma _{2^j}(g_1,{f}^{(r)},x). \end{aligned}$$

Therefore, (2.7) shows that

$$\begin{aligned} \Vert P_j\Vert _p \le c2^{-jr}\Vert {f}^{(r)}\Vert _p. \end{aligned}$$

Using (2.12) and (2.13), we conclude that

$$\begin{aligned} E_{2^m,p}(f) \!\le \! \Vert f-\sigma _{2^m}(h,f)\Vert _p \!\le \! \sum _{j=m+1}^\infty \Vert P_j\Vert _p \!\le \! c\Vert {f}^{(r)}\Vert _p\sum _{j=m+1}^\infty 2^{-jr} \!=\!c_12^{-mr}\Vert {f}^{(r)}\Vert _p. \end{aligned}$$

If \(n\ge 4\), we find \(m\ge 2\) such that \(2^m \le n \le 2^{m+1}\). Then the above estimate shows (2.10) in the case when \(n\ge 4\). The estimate is trivial if \(n=1,2,3\). \(\square \)

We pause in our discussion to introduce the notions of widths in approximation theory, in contradistinction with the characterization theorem (Theorem 2.3 below), which is our main interest in this section.

Let \(X\) be a normed linear space, \(n\ge 1\) be an integer, \(K\subset X\) be a compact set, and \(Y\subset X\) be a closed set. We define the distance

$$\begin{aligned} {\hbox {dist }}(K,Y)\mathop {=}\limits ^{\mathrm{def}}\sup _{f\in K}\inf _{P\in Y}\Vert f-P\Vert _X. \end{aligned}$$
(2.14)

In this paper, we will refer to bounds on \({\hbox {dist }}(K,Y)\) as distance bounds.

The notion of widths in approximation theory gives in some sense minimal distance of \(K\) from various different choices of \(Y\). Depending upon the nature of \(Y\), there are different variants of widths [36, 72]. We will describe two of these.

For integer \(n\ge 1\), the Kolmogorov\(n\)-width of \(K\) [32] is defined by

$$\begin{aligned} d_{n,kol}(K,X)\mathop {=}\limits ^{\mathrm{def}}\inf _{X_n} {\hbox { dist }}(K,X_n), \end{aligned}$$
(2.15)

where the infimum is taken over all linear subspaces \(X_n\) of \(X\), with the dimension of \(X_n\) being \(\le n\). More generally, any process of approximation of elements of \(K\) based on \(n\) parameters can be formalized as the composition of two functions: the first, \(\mathcal{M } :K\rightarrow \mathbb{R }^n\), represents the selection of the parameters, while the second, \(\mathcal{R }:\mathbb{R }^n\rightarrow X\), represents the reconstruction. Their composition \(f\mapsto \mathcal{R }(\mathcal{M }(f))\) is the desired approximation of \(f\in K\). The nonlinear \(n\)-width of \(K\) with respect to \(X\) is defined by DeVore et al. [11] and independently by Mathé [41] as

$$\begin{aligned} d_n(K,X)\mathop {=}\limits ^{\mathrm{def}}\inf \sup _{f\in K}\Vert f-\mathcal{R }(\mathcal{M }(f))\Vert _X, \end{aligned}$$
(2.16)

where the infimum is taken over all\(\mathcal R :\mathbb{R }^n\rightarrow X\), and all continuous\(\mathcal M :K\rightarrow \mathbb{R }^n\). Here, the reconstruction operation can be arbitrary. The nonlinear \(n\)-width gives the theoretically minimum “worst-case error” in approximating elements of \(K\), subject only to the prior knowledge that they are elements of \(K\), using \(n\) parameters selected in a stable manner.

It is not immediately clear that the choice of parameters involved in approximation from an arbitrary finite dimensional space is always continuous. Therefore, it is not obvious that the nonlinear \(n\)-width is a sharpening of the Kolmogorov \(n\)-width. However, in view of [11, Corollary 2.2], it follows that

$$\begin{aligned} d_n(K,X)\le d_{n,kol}(K,X). \end{aligned}$$

In many applications, \(K\) is defined in terms of a semi-norm \(|\!|\!|\circ |\!|\!|_K\) on a subset of \(X\):

$$\begin{aligned} K=\{f\in X : |\!|\!|f|\!|\!|_K\le 1\}. \end{aligned}$$

As a consequence of [11, Theorem 3.1], it follows that if there exists a linear subspace \(Y\subset X\) with dimension \(n+1\) such that a Bernstein inequality of the form

$$\begin{aligned} |\!|\!|g|\!|\!|_K \le b_n(K)\Vert g\Vert _X, \quad g\in Y, \end{aligned}$$
(2.17)

holds, then with an absolute constant \(c\),

$$\begin{aligned} d_n(K,x)\ge cb_n(K)^{-1}. \end{aligned}$$
(2.18)

For example, let \(X=X^p\), \(Y=\mathbb{H }_n\). Then \({\hbox {dist}}(K,\mathbb{H }_n)\) is the “worst-case error” in approximating elements of \(K\) by trigonometric polynomials of degree \(<n\). When

$$\begin{aligned} K\mathop {=}\limits ^{\mathrm{def}}\{f\in X^p : \Vert {f}^{(r)}\Vert _p \le 1\}, \end{aligned}$$
(2.19)

the Favard inequality (2.10) shows the upper distance bound

$$\begin{aligned} {\hbox {dist }}(K,\mathbb{H }_n)\le cn^{-r}. \end{aligned}$$

Therefore, with \(n=2m-1\), we see that \(d_n(K)\le d_{n,kol}(K)\le cn^{-r}\), while the Bernstein inequality (2.9) and (2.18) together show that \(d_n(K)\ge cn^{-r}\). Moreover, Theorem 2.1 then leads us to conclude that using the Fourier coefficients for the parameter selection, followed by the operator \(\sigma _n\) is, up to a constant factor, an optimal reconstruction algorithm for approximation of functions in \(K\).

The estimates on the \(n\)-widths or the lower distance bounds imply that the degree of approximation or method of approximation cannot be improved, in the sense that there is always some “bad function” in the class for which a lower bound is attained. They do not address the question of whether individual functions can be approximated better than what the degree of approximation theorem predicts, based on the a priori information known about the target function. For example, one could conceivably use some clever ideas appropriate for the target function, that can yield a better performance than the theoretically assumed a priori information. This is the question of characterization of smoothness classes. This is our main interest in this section, and we now resume this discussion.

It is perhaps clear that the number of derivatives is not a sufficiently sophisticated indication to characterize functions for which \(E_{n,p}(f)=\mathcal{O}(n^{-\gamma })\); e.g., when \(\gamma \) is not an integer. If \(\gamma \) is an integer and \({f}^{(\gamma )}\in X^p\), then \(E_{n,p}(f)=\mathcal{O}(n^{-\gamma })\) as proved in (2.10). The converse is not true. For example, if \(f(x)=|\cos x|\), then it is easy to compute that

$$\begin{aligned} f(x)=\frac{2}{\pi } +\frac{4}{\pi }\sum _{k=1}^\infty (-1)^{k+1}\frac{\cos (2kx)}{4k^2-1} \end{aligned}$$

where the series converges uniformly. Therefore,

$$\begin{aligned} E_{2n-1,\infty }(f) \le \Vert f-s_{2n}(f)\Vert _\infty \le \frac{4}{\pi } \sum _{k=n}^\infty \frac{1}{4k^2-1} \le c/n. \end{aligned}$$

Since \(f\) is not differentiable at \(\pm \pi /2\), the converse of the Favard estimate is not true.

The correct device is a regularization functional that is known in this context as a \(K\)-functional.

Definition 2.2

If \(r\in \mathbb{N }\), \(1\le p\le \infty \), and \(f\in X^p\), then the \(K\)-functional is defined by

$$\begin{aligned} K_{r,p}(f,\delta ) \mathop {=}\limits ^{\mathrm{def}}\inf \{\Vert f-g\Vert _p +\delta ^r\Vert {g}^{(r)}\Vert _p\}, \end{aligned}$$
(2.20)

where the infimum is taken over all \(g\) for which \({g}^{(r)}\in X^p\). If \(\gamma >0\) and \(r>\gamma \) is an integer, then we define

$$\begin{aligned} |\!|\!|f|\!|\!|_{\gamma , p} \mathop {=}\limits ^{\mathrm{def}}\sup _{0<\delta <1/2} \frac{K_{r,p}(f,\delta )}{\delta ^\gamma }. \end{aligned}$$
(2.21)

The set of all functions \(f\in X^p\) for which \(|\!|\!|f|\!|\!|_{\gamma ,p} <\infty \) is denoted by \(W_{\gamma ,p}\).

The following theorem [12, Chapter 7, Theorem 9.2], [57, Chapter III, Section 3.2, Corollary 7] shows the close connection between the apparently somewhat artificially defined class \(W_{\gamma ,p}\), the rate of decrease of the degrees of approximation, and the smoothness in the classical sense.

Theorem 2.3

Let \(1\le p\le \infty \), \(f\in X^p\), \(\gamma >0\), \(r>\gamma \) be an integer, \(\gamma =s+\beta \) where \(s\ge 0\) is an integer chosen so that \(0<\beta \le 1\), and let \(h\) be a twice continuously differentiable low pass filter.

  1. (a)

    \(f\in W_{\gamma ,p}\) if and only if \(E_{n,p}(f) =\mathcal{O}(n^{-\gamma })\). More precisely,

    $$\begin{aligned} |\!|\!|f|\!|\!|_{\gamma ,p} \sim \sup _{n\in \mathbb{N }} n^\gamma E_{n,p}(f)\sim \sup _{n\in \mathbb{N }}{n^\gamma }\Vert f-\sigma _n(h,f)\Vert _p, \end{aligned}$$
    (2.22)

    where the constants involved in “\(\sim \)” are independent of \(f\).

  2. (b)

    We have \(f\in W_{\gamma , p}\) if and only if \({f}^{(s)}\in W_{\beta ,p}\).

We remark that since the quantity \(E_{n,p}(f)\) does not depend on the choice of \(r\) in the definition of \(W_{\gamma ,p}\) except that \(r>\gamma \), the class \(W_{\gamma ,p}\) does not depend on the choice of \(r\) either, as long as \(r>\gamma \). Second, we note that in the above theorem \(s\not =\lfloor \gamma \rfloor \). In particular, for the function \(x\mapsto |\cos x|\), one has \(\gamma =1\), \(s=0\) and \(\beta =1\). There are characterizations of the \(K\)-functional that are given directly in terms of the function \(f\) rather than as a regularization functional. For instance (cf. [57, Chapter III, Section 1.2, Theorem 6], [12, Chapter 6, Theorem 2.4]),

$$\begin{aligned} K_{r,p}(f,\delta ) \sim \sup _{|t|\le \delta }\left\| \sum _{k=0}^r (-1)^k {\left( {\begin{array}{c}r\\ k\end{array}}\right) } f(\circ +kt)\right\| _p\!, \end{aligned}$$

where the right hand expression is called the \(r\)-th order modulus of smoothness of \(f\), and the constants involved in “\(\sim \)” are independent of \(f\) and \(\delta \). In particular, if \(0<\beta < 1\), then \(f\in W_{\beta ,\infty }\) is equivalent to the Lipschitz–Hölder condition on \(f\) ([57, Chapter III, Section 3.2, Corollary 7] [12, Chapter 7, Theorem 3.3]):

$$\begin{aligned} |f(x+t)-f(x)| \le c|\!|\!|f|\!|\!|_{\beta ,\infty } |t|^\beta , \quad x\in [-\pi ,\pi ]. \end{aligned}$$
(2.23)

In modern approximation theory, it has become more customary to take the \(K\)-functional itself as a measurement of smoothness in various situations, taking such direct relationships for granted.

We end this subsection by pointing out an interesting property of the summability operator, referred to in approximation theory literature as realization of the \(K\)-functional.

Theorem 2.4

Let \(r\in \mathbb{N }\), \(1\le p\le \infty \), and \(f\in X^p\). Let \(h\) be a twice continuously differentiable low pass filter. Then for \(n\in \mathbb{N }\),

$$\begin{aligned} \Vert f-\sigma _n(h,f)\Vert _p +n^{-r}\Vert {\sigma _n}^{(r)}(h,f)\Vert _p \sim K_{r,p}(f,1/n), \end{aligned}$$
(2.24)

where the constants involved in “\(\sim \)” are independent of both \(f\) and \(n\).

Thus, when one is not interested in finding the function \(g\) that achieves the infimum in the definition of the \(K\)-functional, \(\sigma _n(h,f)\) supplies a near-optimal solution. Moreover, while finding the minimizer \(g\) is a separate optimization problem for each \(r\) and \(p\), the summability operator works for every\(r\) and every\(p\), and its construction does not involve any optimization.

The essential ideas behind the proof of Theorem 2.4 are in the paper [8] of Czipszer and Freud. For a lack of easy reference, we include the simple proof.

Proof of Theorem 2.4

The definition (2.20) of the \(K\)-functional shows that

$$\begin{aligned} K_{r,p}(f,1/n)\le \Vert f-\sigma _n(h,f)\Vert _p +n^{-r}\Vert {\sigma _n}^{(r)}(h,f)\Vert _p. \end{aligned}$$

We prove the inequality in the other direction. Let \(g\in X^p\) be any function with \({g}^{(r)}\in X^p\). Then using Theorem 2.1 and the Favard estimate (2.10), we obtain

$$\begin{aligned}&\Vert f-\sigma _n(h,f)\Vert _p\le \Vert f-g-\sigma _n(h,f-g)\Vert _p+\Vert g-\sigma _n(h,g)\Vert _p\le \Vert f-g\Vert _p\nonumber \\&\quad +\Vert \sigma _n(h,f-g)\Vert _p +cE_{n/2}(g)\le c\left\{ \Vert f-g\Vert _p+\frac{1}{n^r}\Vert {g}^{(r)}\Vert _p\right\} \!. \end{aligned}$$
(2.25)

Further, using the fact that \({\sigma _n(h,g)}^{(r)}=\sigma _n(h,{g}^{(r)})\) and the Bernstein inequality (2.9), we deduce that

$$\begin{aligned}&n^{-r}\Vert {\sigma _n}^{(r)}(h,f)\Vert _p\le n^{-r}\Vert {\sigma _n(h,f-g)}^{(r)}\Vert _p + n^{-r}\Vert {\sigma _n(h,g)}^{(r)}\Vert _p\\&\quad \le c\{\Vert \sigma _n(h,f-g)\Vert _p +n^{-r}\Vert \sigma _n({g}^{(r)})\Vert _p\}\le c\left\{ \Vert f-g\Vert _p+\frac{1}{n^r}\Vert {g}^{(r)}\Vert _p\right\} \!. \end{aligned}$$

Together with (2.25), we have shown that

$$\begin{aligned} \Vert f-\sigma _n(h,f)\Vert _p +n^{-r}\Vert {\sigma _n}^{(r)}(h,f)\Vert _p\le c\left\{ \Vert f-g\Vert _p+\frac{1}{n^r}\Vert {g}^{(r)}\Vert _p\right\} \!. \end{aligned}$$

Since \(g\) is an arbitrary function with \({g}^{(r)}\in X^p\), the definition (2.20) shows that

$$\begin{aligned} \Vert f-\sigma _n(h,f)\Vert _p +n^{-r}\Vert {\sigma _n}^{(r)}(h,f)\Vert _p\le cK_{r,p}(f,1/n). \end{aligned}$$

\(\square \)

2.3 Wavelet-like representation

In this section, \(h\) will denote a fixed, infinitely differentiable low pass filter. We have seen that \(f\in L^1\) if and only if \(\sigma _n(h,f)\rightarrow f\) in \(L^1\). Since \(\sigma _n(h,f)\) is defined entirely in terms of the sequence \(\{\hat{f}(k)\}_{k\in \mathbb{Z }}\), it follows that the sequence of Fourier coefficients of an integrable function determines the function uniquely. For this reason, this sequence is called the frequency domain description of \(f\), while a formula giving \(f\) directly as a function of its argument is called a space (or time) domain description. There are some problems with the frequency domain description.

Except in the case of the \(L^2\) norm, where the Parseval identity is available, the frequency domain description of a function does not reveal its smoothness directly. For example, we consider

$$\begin{aligned} f_1(x)\mathop {=}\limits ^{\mathrm{def}}\frac{\cos ((\pi -x)/4)}{\sqrt{2\sin (x/2)}}, \end{aligned}$$

for which the formal Fourier series expansion is given by

$$\begin{aligned} 1 + \frac{2}{\sqrt{\pi }}\sum _{k=1}^\infty \frac{\Gamma (k+1/2)}{k!}\cos (kx), \end{aligned}$$

see, e.g., [79, Chapter V, Formula (2.3)], so that, using Stirling’s formula, \(\hat{f_1}(k)=\mathcal{O}(k^{-1/2})\). The function \(f_1\) is discontinuous at \(0\). On the other hand, for

$$\begin{aligned} f_2(x) \mathop {=}\limits ^{\mathrm{def}}\frac{1}{2}+\sum _{j=1}^\infty \frac{\cos (4^jx)}{2^j}, \end{aligned}$$

it is easy to verify using Theorem 2.3 that \(f_2\in W_{1/2,\infty }\) even though \(\hat{f_2}(k) =\mathcal{O}(k^{-1/2})\) as well. For the function

$$\begin{aligned} f_3(x) \mathop {=}\limits ^{\mathrm{def}}|\cos x|^{1/2} \!=\! \frac{\Gamma (3/2)}{\sqrt{2}\Gamma (5/4)^2} \!+\!\sum _{j=1}^\infty (-1)^j\frac{\sqrt{2}\Gamma (3/2)}{\Gamma (-1/4)\Gamma (5/4)}\frac{\Gamma (j\!-\!1/4)}{\Gamma (j\!+\!3/4)}\cos (2jx),\qquad \end{aligned}$$

see, e.g., [17, p. 12, Eqn. (30)], the formulation (2.23) of the class \(W_{1/2,\infty }\) shows that \(f_3\in W_{1/2,\infty }\), but \(f_3\not \in W_{\gamma ,\infty }\) for any \(\gamma >1/2\). However, \(\hat{f_3}(k)\sim k^{-3/2}\). Finally, \(f_2\) is nowhere differentiable, while \(f_3\) and \(f_1\) both admit analytic continuations at all but finitely many points on \(\mathbb T \).

In the last couple of decades, wavelet analysis has become popular as an alternative to Fourier series where the coefficients characterize local smoothness of the target function, see [10, Theorems 9.2.1, 9.2.2]. We find it interesting both from a theoretical as well as practical point of view to develop a similar expansion which can be computed using the classical Fourier coefficients, but achieves the same purpose. In this section, we review some of our results in this direction, sketching some proofs related to the summability kernel. This section is based on [5860].

We pause again to discuss an example from [34] for the utility of the summability kernel and its localization. In Fig. 3 below, we report the log-plot of the error between \(x\mapsto |\cos x|^{1/4}\) and its Fourier projection of degree \(31\), compared to the error obtained by using our summability operator with a smooth filter \(h\). In keeping with the converse theorem, the maximum error is of the same order of magnitude in both the cases, but it is clear that the error using the summability operator decreases rapidly as \(|x-\pi /2|\) or \(|x+\pi /2|\) increases. Thus, the summability operator is more robust with respect to local “singularities” in a function.

Fig. 3
figure 3

The plot of the logarithm (base 10) of the absolute error between the function \(x\mapsto |\cos x|^{1/4}\), and (left) its Fourier projection (right) trigonometric polynomial obtained by our summability operator, where the Fourier coefficients are estimated by 128 point FFT. The order of the trigonometric polynomials is \(31\) in each case. The numbers on the \(x\) axis are in multiples of \(\pi \), the actual absolute errors are \(10^{y}\)

To resume our main discussion, we note that unlike the notion of derivatives, membership in \(W_{\gamma ,p}\) is not a local property. The following is a standard way to define such spaces locally. We refer to an open connected subset of \(\mathbb T \) as an arc. Essentially, an arc is an open subinterval of \([-\pi ,\pi ]\), except that an arc might also be of the form \([-\pi ,a)\cup (b,\pi ]\). The class of all infinitely differentiable functions supported on an arc \(I\) will be denoted by \({C^\infty _I}\).

Definition 2.3

Let \(1\le p\le \infty \), \(\gamma >0\), \(x_0\in \mathbb T \), and \(f\in L^1\). We say that \(f\in W_{\gamma ,p}(x_0)\) if there exists an arc \(I\ni x_0\) such that for every \(\phi \in {C^\infty _I}\), \(\phi f\in W_{\gamma ,p}\).

Thus, among the three functions listed above, \(f_1\in W_{\gamma ,\infty }(x_0)\) for all \(\gamma >0\) and \(x_0\not =0\), \(f_2\not \in W_{\gamma ,\infty }(x_0)\) for any \(x_0\) if \(\gamma >1/2\), and \(f_3\in W_{\gamma ,\infty }(x_0)\) for all \(\gamma >0\), \(x_0\not =\pm \pi /2\).

We define the frame operators

$$\begin{aligned} \tau _j(h,f) \mathop {=}\limits ^{\mathrm{def}}\left\{ \begin{array}{ll} \sigma _1(h,f), &{}\quad {\hbox {if}}\ j=0,\\ \sigma _{2^j}(h,f)-\sigma _{2^{j-1}}(h,f), &{}\quad {\hbox {if}}\ j\in \mathbb{N }. \end{array}\right. \end{aligned}$$
(2.26)

Note that these are the same as the polynomials \(P_j\) in (2.11). We let

$$\begin{aligned} \Psi _j(h,t)\mathop {=}\limits ^{\mathrm{def}}\Phi _{2^{j+1}}(h,t)-\Phi _{2^{j-2}}(h,t), \quad j=0,1,\dots , \end{aligned}$$
(2.27)

where we recall our convention that \(\Phi _n(h,t)=0\) if \(n<1\).

We will not use the following constructions in the rest of the paper, but state them for the sake of completeness. Let

$$\begin{aligned} g^*(t)\mathop {=}\limits ^{\mathrm{def}}\sqrt{h(t)-h(2t)}. \end{aligned}$$
(2.28)

We write for \(t\in \mathbb T \), and \(f\in L^1\),

$$\begin{aligned} \Psi _j^*(t)\mathop {=}\limits ^{\mathrm{def}}\left\{ \begin{array}{ll} \Phi _{2^j}(g^*,t), &{}\quad j\in \mathbb{N },\\ 1, &{}\quad j=0, \end{array}\right. \end{aligned}$$
(2.29)

and

$$\begin{aligned} \tau _j^*(h,f)\mathop {=}\limits ^{\mathrm{def}}\sigma _{2^j}(g^*,f), \quad j=1,2,\ldots , \quad \tau _0^*(h,f,t)\mathop {=}\limits ^{\mathrm{def}}\hat{f}(0). \end{aligned}$$
(2.30)

Finally, we set

$$\begin{aligned} v_{k,j}= k\pi /2^j, \quad k=-2^j+1,\ldots ,2^j, \quad j=0,1,2,\ldots . \end{aligned}$$
(2.31)

It can be verified using the formula for the sum of geometric series that

$$\begin{aligned} \frac{1}{2^{j+1}}\sum _{k=-2^j+1}^{2^j} \exp (i\ell v_{k,j})=\left\{ \begin{array}{ll} 0, &{}\quad {\hbox { if}} \ \ell =\pm 1, \ldots , \pm (2^{j+1}-1),\\ 1, &{}\quad {\hbox { if}} \ \ell =0, \end{array}\right. \end{aligned}$$

and hence, the following quadrature formula holds.

$$\begin{aligned} \frac{1}{2\pi }\int _\mathbb T P(t)\,dt =\frac{1}{2^{j+1}}\sum _{k=-2^j+1}^{2^j} P(v_{k,j}), \quad P\in \mathbb{H }_{2^{j+1}}. \end{aligned}$$
(2.32)

The following theorem states the wavelet-like (frame) expansion of functions in \(X^p\). We note that, unlike classical Littlewood–Paley expansions or wavelet expansions, our results are valid also for \(f\in X^p\), where \(p=1\) and \(p=\infty \) are included. We note that the coefficients \(\tau _j(h,f,v_{k,j})\) and \(\tau _j^*(h,f,v_{k,j})\) can be computed as finite linear combinations of the Fourier coefficients of \(f\).

Theorem 2.5

Let \(1\le p\le \infty \), \(\gamma >0\), \(f\in X^p\), and let \(h\) be an infinitely differentiable low pass filter.

  1. (a)

    We have

    $$\begin{aligned} f&= \sum _{j=0}^\infty \tau _j(h,f)=\sum _{j=0}^\infty \frac{1}{2^{j+1}}\sum _{k=-2^j+1}^{2^j}\tau _j(h,f,v_{k,j})\Psi _j(h, \circ -v_{k,j})\nonumber \\&= \sum _{j=0}^\infty \frac{1}{2^{j+1}}\sum _{k=-2^j+1}^{2^j}\tau _j^*(h,f,v_{k,j})\Psi _j^*(h, \circ -v_{k,j}), \end{aligned}$$
    (2.33)

    with convergence in the sense of \(X^p\).

  2. (b)

    If \(f\in L^2\) then

    $$\begin{aligned} \Vert f\Vert _2^2 \le 2\sum _{j=0}^\infty \Vert \tau _j(h,f)\Vert _2^2 = \sum _{j=0}^\infty \frac{1}{2^j}\sum _{k=-2^j+1}^{2^j}|\tau _j(h,f,v_{k,j})|^2 \!\le \! 2\Vert f\Vert _2^2\qquad \qquad \end{aligned}$$
    (2.34)

    and

    $$\begin{aligned} \Vert f\Vert _2^2= \sum _{j=0}^\infty \Vert \tau _j^*(h,f)\Vert _2^2 = \sum _{j=0}^\infty \frac{1}{2^{j+1}}\sum _{k=-2^j+1}^{2^j}|\tau _j^*(h,f,v_{k,j})|^2. \end{aligned}$$
    (2.35)

Proof

The first equation in (2.33) follows from (2.8) by a telescoping series argument, cf. the proof of (2.13) (the polynomials denoted in the latter by \(P_j\) are in fact \(\tau _j(h,f)\)). A comparison of Fourier coefficients shows that for \(j=0,1\dots \),

$$\begin{aligned} \tau _j(h,f,x)&= \frac{1}{2\pi }\int _\mathbb T \tau _j(h,f,t)\Psi _j(h,x-t)\,dt\\&= \frac{1}{2\pi }\int _\mathbb T \tau _j^*(h,f,t)\Psi _j^*(h,x-t)\,dt. \end{aligned}$$

The quadrature formula (2.32) can be used to express \(\tau _j(h,f)\) in the first equation in (2.33) as the inner sums as indicated in the remaining two equations there.

We observe that \( h(t)-h(2t)\ge 0\) for all \(t\) and \(h(t)-h(2t)\not =0\) only if \(1/4<t<1\). So, for any \(k\in \mathbb{Z }{\setminus }\{0\}\), if \(m_k\) is the integer part of \(\log _2|k|\), then

$$\begin{aligned} h\left( \frac{k}{2^j}\right) -h\left( \frac{k}{2^{j-1}}\right) \not =0\; {\hbox { only if}}\; j=m_k+1\; {\hbox {or}}\; j=m_k+2 \end{aligned}$$

Consequently, recalling that \(0\le h(t)\le 1\) for all \(t\in \mathbb{R }\), we obtain for \(k\in \mathbb{Z }{\setminus }\{0\}\) that

$$\begin{aligned} 1&= \left( \sum _{j=1}^\infty h\left( \frac{k}{2^j}\right) -h\left( \frac{k}{2^{j-1}}\right) \right) ^2\le 2 \sum _{j=1}^\infty \left( h\left( \frac{k}{2^j}\right) -h\left( \frac{k}{2^{j-1}}\right) \right) ^2\nonumber \\&\le 2\sum _{j=1}^\infty h\left( \frac{k}{2^j}\right) -h\left( \frac{k}{2^{j-1}}\right) =2. \end{aligned}$$
(2.36)

Using the Parseval identity and Fubini’s theorem, we see that

$$\begin{aligned} \sum _{j=0}^\infty \Vert \tau _j(h,f)\Vert _2^2&= |\hat{f}(0)|^2 +\sum _{j=1}^\infty \sum _{k\in \mathbb{Z }\setminus \{0\}} \left( h\left( \frac{k}{2^j}\right) -h\left( \frac{k}{2^{j-1}}\right) \right) ^2|\hat{f}(k)|^2\\&= |\hat{f}(0)|^2 +\sum _{k\in \mathbb{Z }\setminus \{0\}}|\hat{f}(k)|^2 \sum _{j=1}^\infty \left( h\left( \frac{k}{2^j}\right) -h\left( \frac{k}{2^{j-1}}\right) \right) ^2. \end{aligned}$$

Using the Parseval identity again and (2.36), we conclude that

$$\begin{aligned} \Vert f\Vert _2^2= \sum _{k\in \mathbb{Z }}|\hat{f}(k)|^2\le 2\sum _{j=0}^\infty \Vert \tau _j(h,f)\Vert _2^2\le 2\Vert f\Vert _2^2. \end{aligned}$$
(2.37)

The quadrature formula (2.32) shows that

$$\begin{aligned} \Vert \tau _j(h,f)\Vert _2^2=\frac{1}{2^{j+1}}\sum _{k=-2^j+1}^{2^j}|\tau _j(h,f,v_{k,j})|^2, \quad j=0,1,\ldots . \end{aligned}$$

Together with (2.37), this completes the proof of (2.34).

Another simple application of the Parseval identity gives

$$\begin{aligned} \Vert f\Vert _2^2=\sum _{j=0}^\infty \Vert \tau _j^*(h,f)\Vert _2^2, \end{aligned}$$

which, together with the quadrature formula (2.32) leads to (2.35). \(\square \)

The following theorem shows that the coefficients of the expansions in (2.33) provide a complete characterization of local smoothness classes, analogous to the corresponding theorems in wavelet analysis.

Theorem 2.6

Let \(1\le p\le \infty \), \(\gamma >0\), \(x_0\in \mathbb T \), \(f\in X^p\), and let \(h\) be an infinitely differentiable low pass filter. Then the following statements are equivalent.

  1. (a)

    \(f\in W_{\gamma ,p}(x_0)\).

  2. (b)

    There exists an arc \(I\ni x_0\) such that

    $$\begin{aligned} \left\{ \frac{1}{2^{j+1}}\sum _{k:v_{k,j}\in I}|\tau _j(h,f,v_{k,j})|^p\right\} ^{1/p} =\mathcal{O}(2^{-j\gamma }), \quad 1\le p<\infty . \end{aligned}$$
    (2.38)

    In the case when \(p=\infty \), the above estimate is interpreted as

    $$\begin{aligned} \max _{k: v_{k,j}\in I} |\tau _j(h,f,v_{k,j})|=\mathcal{O}(2^{-j\gamma }). \end{aligned}$$
    (2.39)
  3. (c)

    There exists an arc \(I\) containing \(x_0\) such that

    $$\begin{aligned} \left\{ \frac{1}{2^{j+1}}\sum _{k:v_{k,j}\in I}|\tau _j^*(h,f,v_{k,j})|^p\right\} ^{1/p} =\mathcal{O}(2^{-j\gamma }), \end{aligned}$$
    (2.40)

    with a modification as done in (b) in the case when \(p=\infty \).

We will prove only the equivalence of parts (a) and (b) above; the equivalence between parts (b) and (c) do not add much to the concepts we wish to emphasize here. This proof depends on the following Marcinkiewicz–Zygmund–type inequalities, see [79, Chapter X, Theorem (7.5) and the remark thereafter, and Theorem (7.28)].

Lemma 2.1

For \(1\le p\le \infty \) and \(T\in \mathbb{H }_{2^{j+1}}\), we have

$$\begin{aligned} \Vert T\Vert _p\sim \left\{ \frac{1}{2^{j+1}}\sum _{k=-2^{j}+1}^{2^{j}}|T(v_{k,j})|^p\right\} ^{1/p}, \end{aligned}$$
(2.41)

where the usual interpretation for the middle expression is assumed when \(p=\infty \), and the constants involved in “\(\sim \)” are independent of \(j\) and \(T\).

Proof of (a)\(\Longleftrightarrow \)(b) of Theorem 2.6 In this proof, we will write for any arc \(I\), integer \(j\ge 1\) and \(g :\mathbb T \rightarrow \mathbb{R }\),

$$\begin{aligned}{}[g]_{j,p,I} = \left\{ \frac{1}{2^{j+1}}\sum _{k: v_{k,j}\in I}|g(v_{k,j})|^p\right\} ^{1/p}, \qquad 1\le p\le \infty . \end{aligned}$$

Let \(f\in W_{\gamma ,p}(x_0)\), and \(J\ni x_0\) be an arc such that \(\phi f\in W_{p,\gamma }\) for every infinitely differentiable function \(\phi \) supported on \(J\). We consider subarcs \(I\subset I^{\prime }\subset J\) with \(x_0\in I\), and an infinitely differentiable function \(\psi \) which is equal to \(1\) on \(I^{\prime }\) and \(0\) outside \(J\). We choose and fix an integer \(S>\max (1,\gamma )\). Since \(h\) is infinitely differentiable, it follows from (2.6) that

$$\begin{aligned} |\Psi _j(t)| \le c2^j\min \left( 1, (2^j|t|)^{-S-1}\right) , \quad t\in \mathbb T , \ j\ge 0. \end{aligned}$$
(2.42)

Therefore, for \(x\in I\),

$$\begin{aligned} |\tau _j((1-\psi )f,x)|&\le \frac{1}{2\pi }\int _\mathbb T |(1-\psi (t))f(t)\Psi _j(x-t)|\,dt\\&= \frac{1}{2\pi }\int _\mathbb{T \setminus I^{\prime }} |(1-\psi (t))f(t)\Psi _j(x-t)|\,dt\le c(I,I^{\prime })2^{-jS}\Vert f\Vert _1, \end{aligned}$$

and

$$\begin{aligned} |\tau _j(f,x)| \le |\tau _j(\psi f,x)| + |\tau _j((1-\psi )f,x)|\le |\tau _j(\psi f,x)|+c(I,I^{\prime })2^{-jS}\Vert f\Vert _1. \end{aligned}$$

Consequently, using Lemma 2.1, we conclude that

$$\begin{aligned}{}[\tau _j(f)]_{j,p,I}&\le [\tau _j(\psi f)]_{j,p,I} + c(I,I^{\prime })2^{-jS}\Vert f\Vert _1\nonumber \\&\le [\tau _j(\psi f)]_{j,p,\mathbb T } + c(I,I^{\prime })2^{-jS}\Vert f\Vert _1\nonumber \\&\le c(I,I^{\prime })\{\Vert \tau _j(\psi f)\Vert _p +2^{-jS}\Vert f\Vert _1\}. \end{aligned}$$
(2.43)

Since \(\tau _j(P)\equiv 0\) for all \(P\in \mathbb{H }_{2^{j-1}}\), the boundedness of the operators \(\tau _j\) implies that for any \(P\in \mathbb{H }_{2^{j-1}}\),

$$\begin{aligned} \Vert \tau _j(\psi f)\Vert _p=\Vert \tau _j(\psi f-P)\Vert _p\le c\Vert \psi f-P\Vert _p, \end{aligned}$$

and, hence,

$$\begin{aligned} \Vert \tau _j(\psi f)\Vert _p \le cE_{2^{j-1},p}(\psi f). \end{aligned}$$

Therefore, (2.43) shows that

$$\begin{aligned}{}[\tau _j(f)]_{j,p,I}\le c(I,I^{\prime })\{E_{2^{j-1},p}(\psi f) +2^{-jS}\Vert f\Vert _1\}. \end{aligned}$$

Since \(\psi f\in W_{\gamma ,p}\) and \(S>\gamma \), an application of Theorem 2.3 now leads to part (b).

Conversely, let \(I\ni x_0\) be an arc such that

$$\begin{aligned}{}[\tau _j(f)]_{j,p,I} =\mathcal{O}(2^{-j\gamma }), \end{aligned}$$
(2.44)

and let \(\phi \) be any infinitely differentiable function supported on \(I\). The Favard estimate then shows that for every \(j\ge 1\) there exists \(R_j\in \mathbb{H }_{2^j}\) such that

$$\begin{aligned} \Vert \phi -R_j\Vert _\infty \le c(\phi )2^{-jS}. \end{aligned}$$
(2.45)

Therefore, in view of (2.7), we obtain

$$\begin{aligned} E_{2^{j+1},p}(\phi f) \!&\le \! \Vert \phi f\!-\!R_j\sigma _{2^j}(h,f)\Vert _p \!\le \! \Vert \phi (f\!-\!\sigma _{2^j}(h,f))\Vert _p \!+\!\Vert (\phi \!-\!R_j)\sigma _{2^j}(h,f)\Vert _p \nonumber \\ \!&\le \! \Vert \phi (f-\sigma _{2^j}(h,f))\Vert _p +c(\phi )2^{-jS}\Vert f\Vert _p. \end{aligned}$$
(2.46)

Using (2.33), Lemma 2.1, (2.45), and (2.7) in that order, we see that

$$\begin{aligned}&\Vert \phi (f-\sigma _{2^j}(h,f))\Vert _p = \left\| \phi \sum _{k=j+1}^\infty \tau _k(h,f)\right\| _p\le \sum _{k=j+1}^\infty \Vert \phi \tau _k(h,f)\Vert _p\\&\quad \le \sum _{k=j+1}^\infty \Vert R_k\tau _k(h,f)\Vert _p +\sum _{k=j+1}^\infty \Vert (\phi -R_k)\tau _k(h,f)\Vert _p\\&\quad \le c\left\{ \sum _{k=j+1}^\infty [R_k\tau _k(h,f)]_{k,p,\mathbb T }+\sum _{k=j+1}^\infty \Vert (\phi -R_k)\tau _k(h,f)\Vert _p\right\} \\&\quad \le c\left\{ \sum _{k=j+1}^\infty [\phi \tau _k(h,f)]_{k,p,\mathbb T }\!+\!\sum _{k=j+1}^\infty [(\phi \!-\!R_k)\tau _k(h,f)]_{k,p,\mathbb T }\!+\!\sum _{k=j+1}^\infty \Vert (\phi \!-\!R_k)\tau _k(h,f)\Vert _p\right\} \\&\quad \le c(\phi )\left\{ \sum _{k=j+1}^\infty [\tau _k(h,f)]_{k,p,I}+\sum _{k=j+1}^\infty 2^{-kS}[\tau _k(h,f)]_{k,p,\mathbb T }+\sum _{k=j+1}^\infty 2^{-kS}\Vert \tau _k(h,f)\Vert _p\right\} \\&\quad \le c(\phi )\left\{ \sum _{k=j+1}^\infty [\tau _k(h,f)]_{k,p,I}+\sum _{k=j+1}^\infty 2^{-kS}\Vert \tau _k(h,f)\Vert _p\right\} \\&\quad \le c(\phi ) \left\{ \sum _{k=j+1}^\infty [\tau _k(h,f)]_{k,p,I}+\sum _{k=j+1}^\infty 2^{-kS}\Vert f\Vert _p\right\} . \end{aligned}$$

Thus, the assumption (2.44) leads to

$$\begin{aligned} \Vert \phi (f-\sigma _{2^j}(h,f))\Vert _p=\mathcal{O}(2^{-j\gamma }). \end{aligned}$$

Since \(S>\gamma \), this estimate and (2.46) show that \(E_{2^{j+1},p}(\phi f)=\mathcal{O}(2^{-j\gamma })\). Therefore, Theorem 2.3 implies that for every infinitely differentiable \(\phi \) supported on \(I\) we have \(\phi f\in W_{\gamma ,p}\), that is, (a) holds. \(\square \)

We note that an expansion of the form (2.33) is used often in approximation theory. We reserve the term wavelet-like representation to indicate that the behavior of the terms of the expansion characterize local smoothness of the target function. Thus, for example, although the expansion

$$\begin{aligned} f=\sum _{j=0}^\infty (v_{2^{j+1}}(f)-v_{2^j}(f)) +v_1(f) \end{aligned}$$

is very similar to (2.33), and holds for every \(f\in X^p\)\(p\in [1,\infty ]\), we do not refer to this expansion as a wavelet-like representation, because the localization properties of the operators \(v_n\) are not strong enough to admit an analogue of Theorem 2.6 for characterization of local smoothness classes for the smoothness parameter \(>\)1.

3 Multivariate analogues

3.1 Notation

In what follows, \(q\ge 2\) is a fixed integer and the various constants will depend on \(q\). As usual, the notation \(\mathbb T ^q\) denotes the \(q\) dimensional torus, that is, a \(q\)-fold cross product of \(\mathbb T \) with itself. As in the univariate case, functions on \(\mathbb T ^q\) can be considered as functions on \(\mathbb{R }^q\), \(2\pi \)-periodic in each variable. An arc in this context has the form \(\prod _{j=1}^q I_k\), where each \(I_k\subseteq \mathbb T \) is a univariate arc as defined in Sect. 2. An element of \(\mathbb{R }^q\) will be denoted in bold face, e.g., \(\mathbf{x}=(x_1,\ldots ,x_q)\). We find it convenient to write \(x\) also for the vector \((x,x,\ldots ,x)\); e.g., \(0\) denotes both the scalar \(0\) and the vector \((0,0,\ldots ,0)\). We hope that it will be clear from the context whether a scalar is intended or a vector with equal components is intended. When applied to vectors, univariate operations and relations will be interpreted in a coordinatewise sense; e.g., \(\mathbf{x}\ge 0\) means that \(x_j\ge 0\) for all \(j=1,2,\ldots ,q\), \(\mathbf{x}^\mathbf{y}=(x_1^{y_1},\ldots ,x_q^{y_q})\), whenever the expressions are defined, and so forth. The notation \(\mathbf{x}\cdot \mathbf{y}\) denotes the inner product between \(\mathbf{x}\) and \(\mathbf{y}\). For \(0<p\le \infty \), we define

$$\begin{aligned} |\mathbf{x}|_p =\left\{ \begin{array}{ll} \left( \sum \nolimits _{k=1}^q |x_k{\hbox { mod }} (2\pi )|^p\right) ^{1/p}, &{}\quad {\hbox {if}} \ 0<p<\infty ,\\ \max \nolimits _{1\le k\le q}|x_k{\hbox { mod }} (2\pi )|, &{}\quad {\hbox {if}} \ p=\infty . \end{array}\right. \end{aligned}$$

In general, many of the definitions of norms and related expressions, are very similar to those in the univariate case, and we will often omit the dimension \(q\) whenever we feel that writing it explicitly makes the notation unnecessarily cumbersome.

If \(f :\mathbb T ^q\rightarrow \mathbb{R }\) is differentiable, we write \(\partial _rf\) to denote the partial derivative of \(f\) with respect to the \(r\)-th variable. If \(\mathbf r \in \mathbb{Z }^q_+\) and \(f\) is sufficiently smooth, we write \(\partial ^\mathbf{r } f\) to denote the partial derivative indicated by \(\mathbf r \).

The measure \(\mu _q^*\) denotes the Lebesgue measure on \(\mathbb T ^q\) normalized to \(1\). If \(f : \mathbb T ^q\rightarrow \mathbb{C }\) is Lebesgue measurable, and \(A\subseteq \mathbb T ^q\) is Lebesgue measurable, we write

(3.1)

As before, if \(A=\mathbb T ^q\), then it will be omitted from the notation; e.g., \(\Vert f\Vert _p =\Vert f\Vert _{p,\mathbb T ^q}\). The spaces \(L^p\) are defined as usual.

3.2 Approximation theory

There are many ways to define trigonometric polynomials of several variables. In particular, for \(p\) such that \(0<p\le \infty \), a trigonometric polynomial of \(\ell ^p\)-degree \(<n\) is a function of the form

$$\begin{aligned} \mathbf{x}\mapsto \sum _{\mathbf{k}: |\mathbf{k}|_p <n} a_\mathbf{k}\exp (i\mathbf{k}\cdot \mathbf{x}). \end{aligned}$$

When \(p=\infty \), we refer to coordinatewise-degree\(<\)\(n\), whereas if \(p=1\), we refer to total degree\(<\)\(n\), and when \(p=2\), we refer to the spherical degree\(<\)\(n\). The dimensions of all these spaces are always \(\mathcal{O}(n^q)\). In our discussion below, we will restrict ourselves to the class \(\mathbb{H }_n^q\) of trigonometric polynomials of spherical degree \(<\)\(n\), but the results will be equally valid also for other values of \(p\). The \(L^p\) closure of \(\cup _{n\ge 0}\mathbb{H }_n^q\) will be denoted by \(X^p\) (or \(X^p(\mathbb T ^q)\) if some confusion is likely to result).

If \(f\in X^p(\mathbb T ^q)\) and \(n\ge 0\), then the degree of approximation of \(f\) from \(\mathbb{H }_n^q\) is defined by

$$\begin{aligned} E_{n,p}(f)\mathop {=}\limits ^{\mathrm{def}}E_{q;n,p}(f) \mathop {=}\limits ^{\mathrm{def}}\inf \{\Vert f-P\Vert _p : P\in \mathbb{H }_n^q\}. \end{aligned}$$

Again, the symbol \(q\) will be omitted when we don’t expect any confusion. If \(f\in L^1\), its Fourier coefficient is defined by

$$\begin{aligned} \hat{f}(\mathbf{k})=\int _\mathbb{T ^q}f(\mathbf{t })\exp (i\mathbf{k}\cdot \mathbf{t })\,d\mu _q^*(\mathbf{t }), \quad \mathbf{k}\in \mathbb{Z }^q. \end{aligned}$$

Let \(H :\mathbb{R }^q\rightarrow \mathbb{R }\) be compactly supported. The role of the kernel \(\Phi _n\) is played by

$$\begin{aligned} \Phi _n(H,\mathbf{t })\mathop {=}\limits ^{\mathrm{def}}\Phi _{n,q}(H,\mathbf{t }) \mathop {=}\limits ^{\mathrm{def}}\sum _{\mathbf{k}\in \mathbb{Z }^q}H\left( \frac{\mathbf{k}}{n}\right) \exp (i\mathbf{k}\cdot \mathbf{t }), \quad n\in \mathbb{N }, \ \mathbf{t }\in \mathbb T ^q, \end{aligned}$$

and the corresponding operator is defined by

$$\begin{aligned}&\sigma _n(H,f,\mathbf{x}) =\int _\mathbb{T ^q}f(\mathbf{t })\Phi _n(H,\mathbf{x}-\mathbf{t })\,d\mu _q^*(\mathbf{t })\\&\quad =\sum _{\mathbf{k}\in \mathbb{Z }^q}H\left( \frac{\mathbf{k}}{n}\right) \hat{f}(\mathbf{k})\exp (i\mathbf{k}\cdot \mathbf{x}), \quad n\in \mathbb{N }, \ \mathbf{x}\in \mathbb T ^q. \end{aligned}$$

Of particular interest is the case when \(H(\mathbf{t })=h(|\mathbf{t }|_2)\) for some compactly supported \(h : [0,\infty )\rightarrow \mathbb{R }\). We will overload the notation again, and denote the corresponding kernel and operator by \(\Phi _n(h,\circ )\) and \(f\mapsto \sigma _n(h,f)\), with the domain of \(\Phi _n\) and \(f\) making it clear which meaning is intended. We observe that when the mapping \(\mathbf{x}\mapsto h(|\mathbf{x}|_2)\) is integrable on \(\mathbb{R }^q\), then its Fourier transform is also radial, that is, it has the form \(\mathbf{x}\mapsto F(|\mathbf{x}|_2)\) for some \(F : [0,\infty )\rightarrow \mathbb{C }\) [75, Chapter IV, Theorem 3.3]. Therefore, if the Poisson summation formula holds, then \(\Phi _n(h, \circ )\) is a radial function, \(2\pi \)-periodic in each of its variables.

The analogue of Theorem 2.1 is the following, where the conditions on the filter \(H\) are a bit stronger to ensure the validity of the Poisson summation formula enlisted in the proof of the theorem.

Theorem 3.1

Let \(S> q\) be an integer and let \(H:\mathbb{R }^q\rightarrow \mathbb{R }\) be an \(S\)-times continuously differentiable, compactly supported function.

  1. (a)

    We have

    $$\begin{aligned} |\Phi _n(H,\mathbf{t })| \le c(H)n^q\min \left( 1, (n|\mathbf{t }|_2)^{-S}\right) , \quad \mathbf{t }\in \mathbb T , \ n\ge 1. \end{aligned}$$
    (3.2)
  2. (b)

    For \(1\le p\le \infty \),

    $$\begin{aligned} \Vert \sigma _n(H,f)\Vert _p \le c(H)\Vert f\Vert _p, \quad f\in L^p, \ n\in \mathbb{N }. \end{aligned}$$
    (3.3)
  3. (c)

    Let \(h : \mathbb{R }\rightarrow [0,1]\) be an \(S\) times continuously differentiable low pass filter and \(n\in \mathbb{N }\). Then \(\sigma _n(h,P)=P\) for all \(P\in \mathbb{H }_{n/2}^q\). Moreover, for \(1\le p\le \infty \) and \(f\in L^p\),

    $$\begin{aligned} E_{n,p}(f) \le \Vert f-\sigma _n(h,f)\Vert _p \le c(h)E_{n/2,p}(f). \end{aligned}$$
    (3.4)

Proof

The proof of parts (a) and (b) are given in [5, Section 6.1]. To prove part (c), we let \(H_1(\mathbf{x})=h(|\mathbf{x}|_2)\) and observe that since \(h\) is constant in a neighborhood of \(0\),

$$\begin{aligned} \nabla H_1(\mathbf{x}) = \frac{\mathbf{x}}{|\mathbf{x}|_2} h^{\prime }(|\mathbf{x}|_2) = 0 \quad {\hbox {in a neighborhood of}}\; 0. \end{aligned}$$

Since \(h\) is \(S\)-times continuously differentiable, it follows that \(H_1\) is as well. Since \(\sigma _n(h,f)=\sigma _n(H_1,f)\), the estimate (3.3) holds with \(H_1\) in place of \(H\), that is, \(\sigma _n(h,f)\) in place of \(\sigma _n(H,f)\). The remainder of the proof is verbatim the same as the corresponding part of Theorem 2.1. \(\square \)

We observe an important corollary of Theorem 3.1.

Corollary 3.1

If \(n\in \mathbb{N }\) and \(T\in \mathbb{H }_n^q\), then for \(1\le p\le \infty \) and \(1\le r\le q\),

$$\begin{aligned} \Vert \partial _r T\Vert _p \le cn\Vert T\Vert _p. \end{aligned}$$
(3.5)

Proof

Since \(\widehat{\partial _rT}(\mathbf{k})=ik_r\hat{T}(\mathbf{k})\) for \(\mathbf{k}\in \mathbb{Z }^q\), we may use Theorem 3.1 with an appropriate smooth low pass filter \(h\), and the function \(H(\mathbf{x})=x_rh(|\mathbf{x}|_2)\) as in the proof of the univariate Bernstein inequality, see Theorem 2.2(a), to prove (3.5).

We will denote the Laplacian operator by \(\Delta \mathop {=}\limits ^{\mathrm{def}}\sum _{j=1}^q \partial _j^2\), and observe that for a sufficiently smooth \(f\),

$$\begin{aligned} \widehat{(-\Delta f)}(\mathbf{k})=|\mathbf{k}|_2^2{\widehat{f}}(\mathbf{k}), \end{aligned}$$

for each \(\mathbf{k}\in \mathbb{Z }^q\). For \(r>0\), we define the differential operator \((-\Delta )^{r/2}\) formally by

$$\begin{aligned} \widehat{(-\Delta )^{r/2}f}(\mathbf{k})\mathop {=}\limits ^{\mathrm{def}}|\mathbf{k}|_2^r{\widehat{f}}(\mathbf{k}) \end{aligned}$$
(3.6)

for \(f \in L^1\) and \(\mathbf{k}\in \mathbb{Z }^q\). Clearly, if \(T\in \mathbb{H }_n^q\) for some \(n\in \mathbb{N }\), then \({(-\Delta )^{r/2}}T\) is well defined for all \(r>0\).

With respect to these operators, the analogous Favard estimate and Bernstein inequality are the following.

Theorem 3.2

  1. (a)

    Let \(1\le p \le \infty \) and let \(r\) be a positive even integer. Then for all \(f\in X^p\) for which \({(-\Delta )^{r/2}}f\in X^p\), we have

    $$\begin{aligned} E_{n,p}(f) \le cn^{-r} \Vert {(-\Delta )^{r/2}}f\Vert _p. \end{aligned}$$
  2. (b)

    Let \(T \in \mathbb{H }_n^q\) and let \(r\) be a positive even integer. Then for \(1\le p \le \infty \),

    $$\begin{aligned} \Vert {(-\Delta )^{r/2}}T \Vert _p \le cn^r \Vert T\Vert _p. \end{aligned}$$

The \(K\)-functional appropriate for the multivariate theory is defined (with another overload of notation) as follows.

Definition 3.1

If \(r\ge 1\) is an even integer, \(1\le p\le \infty \), and \(f\in X^p\), then we define

$$\begin{aligned} K_{r,p}(f,\delta ) \mathop {=}\limits ^{\mathrm{def}}K_{q;r,p}(f,\delta )\mathop {=}\limits ^{\mathrm{def}}\inf \{\Vert f-g\Vert _p +\delta ^r\Vert {(-\Delta )^{r/2}}g\Vert _p\}, \end{aligned}$$
(3.7)

where the infimum is taken over all \(g\) for which \({(-\Delta )^{r/2}}g\in X^p\). If \(\gamma >0\) and \(r>\gamma \) is an even integer, then we define

$$\begin{aligned} |\!|\!|f|\!|\!|_{\gamma , p} \mathop {=}\limits ^{\mathrm{def}}|\!|\!|f|\!|\!|_{q;\gamma ,p}\mathop {=}\limits ^{\mathrm{def}}\sup _{0<\delta <1/2} \frac{K_{r,p}(f,\delta )}{\delta ^\gamma }. \end{aligned}$$
(3.8)

The set of all functions \(f\in X^p\) for which \(|\!|\!|f|\!|\!|_{\gamma ,p} <\infty \) is denoted by \(W_{\gamma ,p}\) or, if some confusion is likely to happen, then by \(W_{q;\gamma ,p}\).

The following theorem is the direct analogue of Theorem 2.3 and it is proved in the same way.

Theorem 3.3

Let \(1\le p\le \infty \), \(f\in X^p\), \(\gamma >0\), \(r>\gamma \) be an even integer, \(\gamma =s+\beta \) where \(s\ge 0\) is an integer chosen so that \(0<\beta \le 1\). Let \(S>q\) be an integer and \(h\) be an \(S\)-times continuously differentiable low pass filter.

  1. (a)

    \(f\in W_{\gamma ,p}\) if and only if \(E_{n,p}(f) =\mathcal{O}(n^{-\gamma })\). More precisely,

    $$\begin{aligned} |\!|\!|f|\!|\!|_{\gamma ,p} \sim \sup _{n\in \mathbb{N }} n^\gamma E_{n,p}(f)\sim \sup _{n\in \mathbb{N }} n^\gamma \Vert f-\sigma _n(h,f)\Vert _p, \end{aligned}$$
    (3.9)

    where the constants involved in “\(\sim \)” are independent of \(f\).

  2. (b)

    We have \(f\in W_{\gamma , p}\) if and only if \((-\Delta )^{s/2} f\in W_{\beta ,p}\).

Finally, we state the following theorem.

Theorem 3.4

Let \(1\le p\le \infty \), \(f\in X^p\), and let \(r\ge 1\) be an even integer. Let \(S>q\) be an integer and let \(h\) be an \(S\)-times continuously differentiable low pass filter. Then for \(n\in \mathbb{N }\),

$$\begin{aligned} \Vert f-\sigma _n(h,f)\Vert _p +n^{-r}\Vert {(-\Delta )^{r/2}}\sigma _n(h,f)\Vert _p \sim K_{r,p}(f,1/n), \end{aligned}$$
(3.10)

where the constants involved in “\(\sim \)” are independent of \(f\) and \(n\).

3.3 Discretization

For many applications in learning theory, the information available about the target function \(f\) consists of its values \(f(\mathbf{y}_k)\) at finitely many points \(\{\mathbf{y}_k\}_{k=1}^M\) but one cannot prescribe in advance the precise location of these points. The goal of this section is to survey the ideas behind a construction of summability operators based on such information which have properties similar to those of the operators \(\sigma _n(h)\). An essential ingredient is to obtain real numbers \(w_k\) such that for an integer \(N\ge 0\) as high as possible, both the quadrature formula (3.12) and M–Z (Marcinkiewicz–Zygmund) inequalities (3.13) below hold. It will be shown in Theorem 3.5 that such weights can always be found with the desired degree \(N\) being dependent on the so-called density content of the set \(\{\mathbf{y}_k\}\) [43].

Definition 3.2

Let \(\mathcal{C }\mathop {=}\limits ^{\mathrm{def}}\{\mathbf{y}_k\}_{k=1}^M\subset \mathbb T ^q\). We define the density content\(\delta (\mathcal{C })\) and the minimal separation\(\eta (\mathcal{C })\) by

$$\begin{aligned} \delta (\mathcal{C })=\max _{\mathbf{x}\in \mathbb T ^q}\min _{1\le k\le M}|\mathbf{x}-\mathbf{y}_k|_\infty , \quad \eta (\mathcal{C })\mathop {=}\limits ^{\mathrm{def}}\min _{1\le k\not = j\le M}|\mathbf{y}_k-\mathbf{y}_j|_\infty . \end{aligned}$$
(3.11)

We note that the density content has been referred to in the literature also as fill distance or mesh norm.

Theorem 3.5

Let \(\mathcal{C }\mathop {=}\limits ^{\mathrm{def}}\{\mathbf{y}_k\}_{k=1}^M\subset \mathbb T ^q\) and \(\delta (\mathcal{C })\le 1\). There exists a positive constant \(\alpha \) dependent only on \(q\) with the property that there are non-negative numbers \(\{w_k\}_{k=1}^M\) such that for \(N\le \alpha \delta (\mathcal{C })^{-1}\) we have

$$\begin{aligned} \sum _{k=1}^M w_kP(\mathbf{y}_k)=\int _\mathbb{T ^q}P(\mathbf{t })d\mu _q^*(\mathbf{t }), \quad P\in \mathbb{H }^q _N, \end{aligned}$$
(3.12)

and

$$\begin{aligned} \sum _{k=1}^M |w_kP(\mathbf{y}_k)|\sim \int _\mathbb{T ^q}|P(\mathbf{t })|d\mu _q^*(\mathbf{t }), \quad P\in \mathbb{H }^q _N, \end{aligned}$$
(3.13)

where the constants involved in “\(\sim \)” depend only on \(q\) and not on \(\mathcal{C }\), \(M\), \(N\), the choice of the weights \(w_k\), or the polynomials \(P\).

Remark

Let \(\mathcal{C }\subset \mathbb T ^q\) be a finite set. It is easy to verify that \(\eta (\mathcal{C })\le 4\delta (\mathcal{C })\). When \(\delta (\mathcal{C })\le 1\), we can always select a subset of \(\mathcal{C }\) for which the minimal separation and the density content have the same order of magnitude. Let \(m\) be the integer part of \(2\pi /(3\delta (\mathcal{C }))\). For a multi-integer \(\mathbf{k}\) with \(0\le \mathbf{k}\le m-1\), we define

$$\begin{aligned} I_\mathbf{k}= I_{\mathbf{k},\mathcal{C }}\mathop {=}\limits ^{\mathrm{def}}\prod _{j=1}^q \left[ -\pi +\frac{2k_j\pi }{m}, -\pi +\frac{2(k_j+1)\pi }{m}\right] . \end{aligned}$$
(3.14)

Then \(\{I_\mathbf{k}\}\)’s are mutually disjoint arcs except for common boundaries of measure \(0\) and their union is \(\mathbb T ^q\). Let \(\mathbf{z}_k\) be the center of \(I_\mathbf{k}\). Since each side of the arc \(I_\mathbf{k}\) is \(\ge 3\delta (\mathcal{C })\), the set \(\mathcal{C }\) has at least one element in the arc \(\{\mathbf{x}: |\mathbf{x}-\mathbf{z}_k|_\infty \le \delta (\mathcal{C })\}\subset I_\mathbf{k}\). We form a subset \(\mathcal{C }^{\prime }\) of \(\mathcal{C }\) by choosing exactly one such element for each \(\mathbf{k}\). Then it is easy to see that \(\delta (\mathcal{C }^{\prime }) \le 4\delta (\mathcal{C })\), and \(\eta (\mathcal{C }^{\prime }) \ge \delta (\mathcal{C })\ge (1/4)\delta (\mathcal{C }^{\prime })\). Thus, \((1/4)\delta (\mathcal{C }^{\prime })\le \eta (\mathcal{C }^{\prime })\le 4\delta (\mathcal{C }^{\prime })\). In all of our discussion below, we will therefore assume that such a subset \(\mathcal{C }^{\prime }\) has been chosen. The rest of the elements of \(\mathcal{C }\) do not make any difference to the statements; e.g., we may just set the weights corresponding the points not so chosen to be \(0\). Therefore, rather than complicating our notations, we will just identify \(\mathcal{C }^{\prime }\) with \(\mathcal{C }\) in our notations. Thus, we assume that

$$\begin{aligned} (1/4)\delta (\mathcal{C })\le \eta (\mathcal{C })\le 4\delta (\mathcal{C })\le 4, \end{aligned}$$
(3.15)

and that each \(I_{\mathbf{k},\mathcal{C }}\) contains exactly one element of \(\mathcal{C }\). Then \(M\mathop {=}\limits ^{\mathrm{def}}|\mathcal{C }| =m^q\), and we will re-index \(\mathcal{C }\) by setting \(\mathbf{y}_\mathbf{k}\) to be the unique element of \(\mathcal{C }\cap I_\mathbf{k}\). \(\square \)

One of the important steps in the proof of Theorem 3.5 is the following lemma.

Lemma 3.1

Let \(m\ge 1\) be an integer, let \(\{I_\mathbf{k}: 0\le \mathbf{k}\le m-1\}\) be a partition of \(\mathbb T ^q\) as defined in (3.14), \(\mathcal{C }\mathop {=}\limits ^{\mathrm{def}}\{\mathbf{y}_\mathbf{k}\}_{0\le \mathbf{k}\le m-1}\subset \mathbb T ^q\), where each \(\mathbf{y}_\mathbf{k}\in I_\mathbf{k}\), and let (3.15) be satisfied. Then there exists \(C>0\) such that if \(\epsilon >0\) and \(N=C\epsilon m\), then we have for every \(P\in \mathbb{H }^q _N\),

$$\begin{aligned} \left| \frac{1}{m^q}\sum _{0\le \mathbf{k}\le m-1}|P(\mathbf{y}_\mathbf{k})| - \int _\mathbb{T ^q}|P(\mathbf{z})|\,d\mu _q^*(\mathbf{z})\right| \le \epsilon \Vert P\Vert _1, \end{aligned}$$
(3.16)

and

$$\begin{aligned} \left| \frac{1}{m^q}\sum _{0\le \mathbf{k}\le m-1}P(\mathbf{y}_\mathbf{k}) - \int _\mathbb{T ^q}P(\mathbf{z})\,d\mu _q^*(\mathbf{z})\right| \le \epsilon \Vert P\Vert _1. \end{aligned}$$
(3.17)

The lemma is proved in much greater generality in [20]. The ideas behind the proof are quite well known, e.g. [77, Section 4.9.1] or [67]. Here we give a simplified version of the proof in [20], using some ideas in [54, 64, 65]. Our objective is to highlight the use of localized kernels.

Proof of Lemma 3.1

In this proof, let \(\delta =\delta (\mathcal{C })\), and \(\eta =\eta (\mathcal{C })\). We note that \(\delta (\mathcal{C })\sim 1/m\). Let \(N\ge 1\) be an integer to be chosen later, and \(P\in \mathbb{H }^q _N\). Using the mean value theorem, it is easy to see that

$$\begin{aligned} \max _{\mathbf{z}\in I_\mathbf{k}}|P(\mathbf{x})-P(\mathbf{y}_\mathbf{k})|\le \frac{c}{m} \max _{1\le r\le q}\Vert \partial _r P\Vert _{\infty , I_\mathbf{k}}. \end{aligned}$$
(3.18)

Consequently,

$$\begin{aligned}&\left| \frac{1}{m^q}\sum _{0\le \mathbf{k}\le m-1}|P(\mathbf{y}_\mathbf{k})| - \int _\mathbb{T ^q}|P(\mathbf{z})|\,d\mu _q^*(\mathbf{z})\right| \nonumber \\&\quad = \left| \sum _{0\le \mathbf{k}\le m-1}\int _{I_\mathbf{k}}\left( |P(\mathbf{y}_\mathbf{k})|-|P(\mathbf{z})|\right) \,d\mu _q^*(\mathbf{z})\right| \nonumber \\&\quad \le \sum _{0\le \mathbf{k}\le m-1}\int _{I_\mathbf{k}}\left| P(\mathbf{y}_\mathbf{k})-P(\mathbf{z})\right| \,d\mu _q^*(\mathbf{z})\nonumber \\&\quad \le \max _{1\le r\le q}\frac{c}{m^q}\sum _{0\le \mathbf{k}\le m-1}\Vert \partial _r P\Vert _{\infty , I_\mathbf{k}}. \end{aligned}$$
(3.19)

Now, let \(S>q\) be an integer, \(h\) be an infinitely differentiable low pass filter, \(1\le r\le q\), and \(H_r(\mathbf u )=iu_rh(|\mathbf u |_2)\). Then \(H_r\) is also infinitely differentiable. Hence, the fact that \(|\mathbf{t }|_2\sim |\mathbf{t }|_\infty \) for all \(\mathbf{t }\in \mathbb{R }^q\) and the estimate (3.2) used with \(H_r\) in place of \(H\) shows that for \(\mathbf{t }\in \mathbb T ^q\),

$$\begin{aligned} |\partial _r \Phi _N(h,\mathbf{t })|=N|\Phi _N(H_r,\mathbf{t })|\le c\frac{N^{q+1}}{\max (1, (N|\mathbf{t }|_\infty )^S)}. \end{aligned}$$
(3.20)

We observe that for \(\mathbf{z}\in \mathbb T ^q\),

$$\begin{aligned} \partial _rP(\mathbf{z}) =\int _\mathbb{T ^q}P(\mathbf{t })\partial _r\Phi _N(h,\mathbf{z}-\mathbf{t })\,d\mu _q^*(\mathbf{t }), \end{aligned}$$

and from this deduce that

$$\begin{aligned} \sum _{0\le \mathbf{k}\le m-1}\Vert \partial _r P\Vert _{\infty , I_\mathbf{k}}\!\le \! N\int _\mathbb{T ^q}|P(\mathbf{t })|\left\{ \sum _{0\le \mathbf{k}\le m-1}\max _{\mathbf{z}\in I_\mathbf{k}}|\Phi _N(H_r,\mathbf{z}-\mathbf{t })|\right\} \,d\mu _q^*(\mathbf{t }).\nonumber \\ \end{aligned}$$
(3.21)

For the rest of the proof let \(R_r\) denote the maximum of the expression in the braces in the above formula for all \(\mathbf{t }\in \mathbb T ^q\). In view of translation invariance of the kernels \(\Phi _N\), we may assume without loss of generality that

$$\begin{aligned} R_r=\sum _{0\le \mathbf{k}\le m-1}\max _{\mathbf{z}\in I_\mathbf{k}}|\Phi _N(H_r,\mathbf{z}-(-\pi ,\ldots ,-\pi ))|. \end{aligned}$$
(3.22)

We conclude using (3.19) that

$$\begin{aligned} \left| \frac{1}{m^q}\sum _{0\le \mathbf{k}\le m-1}|P(\mathbf{y}_\mathbf{k})| - \int _\mathbb{T ^q}|P(\mathbf{z})|\,d\mu _q^*(\mathbf{z})\right| \le c\frac{N}{m^{q+1}}\left( \max _{1\le r\le q}R_r\right) \Vert P\Vert _1.\qquad \end{aligned}$$
(3.23)

For the rest of the proof let \(\beta =(2\pi N)/m\). For \(\ell =0,1,\dots , m-1\), let

$$\begin{aligned} \mathcal J _\ell =\left\{ \mathbf{k}\in \mathbb{Z }^q : 1\le \mathbf{k}\le m-1,\ \frac{2\pi \ell }{m}\le \min _{\mathbf{z}\in I_\mathbf{k}}|\mathbf{z}+(\pi ,\ldots ,\pi )|_\infty \le \frac{2\pi (\ell +1)}{m}\right\} , \end{aligned}$$

and let \(n_\ell \) denote the number of elements in \(\mathcal J _\ell \). Then \(n_\ell \sim \ell ^{q-1}\). Therefore, (3.20) shows that

$$\begin{aligned} R_r&\le cN^q\sum _{\ell =0}^{m-1} \frac{1}{\max (1,(\beta \ell )^S)}n_\ell =cN^q\left\{ \sum _{\ell \le \beta ^{-1}}\ell ^{q-1} +\beta ^{-S}\sum _{\ell >\beta ^{-1}}\ell ^{q-1-S}\right\} \\&\le cN^q\beta ^{-q} \le cm^q. \end{aligned}$$

Moreover, the very last constant \(c\) above can be chosen independently of \(r\). Hence, (3.23) shows that

$$\begin{aligned} \left| \frac{1}{m^q}\sum _{0\le \mathbf{k}\le m-1}|P(\mathbf{y}_\mathbf{k})| - \int _\mathbb{T ^q}|P(\mathbf{z})|\,d\mu _q^*(\mathbf{z})\right| \le c\frac{N}{m}. \end{aligned}$$

Choosing \(N=\epsilon m/c\), we arrive at (3.16).

Since

$$\begin{aligned}&\left| \frac{1}{m^q}\sum _{0\le \mathbf{k}\le m-1}P(\mathbf{y}_\mathbf{k}) - \int _\mathbb{T ^q}P(\mathbf{z})\,d\mu _q^*(\mathbf{z})\right| \nonumber \\&\quad = \left| \sum _{0\le \mathbf{k}\le m-1}\int _{I_\mathbf{k}}\left( P(\mathbf{y}_\mathbf{k})-P(\mathbf{z})\right) \,d\mu _q^*(\mathbf{z})\right| \nonumber \\&\quad \le \sum _{0\le \mathbf{k}\le m-1}\int _{I_\mathbf{k}}\left| P(\mathbf{y}_\mathbf{k})-P(\mathbf{z})\right| \,d\mu _q^*(\mathbf{z})\nonumber \\&\quad \le \max _{1\le r\le q}\frac{c}{m^{q+1}}\sum _{0\le \mathbf{k}\le m-1}\Vert \partial _r P\Vert _{\infty , I_\mathbf{k}}, \end{aligned}$$

the same proof as above shows also that (3.17) holds as well. \(\square \)

The other important ingredient in the proof of Theorem 3.5 is the following consequence of the Hahn–Banach theorem, known as the Krein extension theorem; see [21] for a recent proof. Let \(\mathbb{X }\) be a normed linear space, \(\mathcal{K }\) be a subset of its normed dual \(\mathbb{X }^*\), and \(\mathcal{V }\) be a linear subspace of \(\mathbb{X }\). We say that a linear functional \(x^*\in \mathcal{V }^*\) is positive on \(\mathcal{V }\) with respect to \(\mathcal{K }\) if \(x^*(f)\ge 0\) for every \(f\in \mathcal{V }\) with the property that \(y^*(f)\ge 0\) for every \(y^*\in \mathcal{K }\).

Theorem 3.6

Let \(\mathbb{X }\) be a normed linear space, let \(\mathcal{K }\) be a bounded subset of its normed dual \(\mathbb{X }^*\), let \(\mathcal{V }\) be a linear subspace of \(\mathbb{X }\), and let \(x^*\in \mathcal{V }^*\) be positive on \(\mathcal{V }\) with respect to \(\mathcal{K }\). We assume further that there exists \(v_0\in \mathcal{V }\) such that \(\Vert v_0\Vert _\mathbb{X }=1\) and

$$\begin{aligned} \inf _{y^*\in \mathcal{K }} y^*(v_0) =\beta ^{-1}>0. \end{aligned}$$
(3.24)

Then there exists an extension \(X^*\in \mathbb{X }^*\) of \(x^*\) which is positive on \(\mathbb{X }\) with respect to \(\mathcal{K }\) and satisfies

$$\begin{aligned} \Vert X^*\Vert _{\mathbb{X }^*}\le \beta \sup _{y^*\in \mathcal{K }} \Vert y^*\Vert _{\mathbb{X }^*}x^*(v_0). \end{aligned}$$
(3.25)

With this preparation, we are ready to prove Theorem 3.5.

Proof of Theorem 3.5

As explained earlier, we may assume that \(M=m^q\) and \(\mathcal{C }=\{\mathbf{y}_\mathbf{k}: 0\le \mathbf{k}\le m-1\}\).

We take \(\epsilon =1/4\) in Lemma 3.1 and conclude from (3.16) that

$$\begin{aligned} (3/4)\Vert P\Vert _1\le \frac{1}{m^q}\sum _{0\le \mathbf{k}\le m-1}|P(\mathbf{y}_\mathbf{k})| \le (5/4)\Vert P\Vert _1, \end{aligned}$$
(3.26)

and from (3.17) that

$$\begin{aligned} \left| \frac{1}{m^q}\sum _{0\le \mathbf{k}\le m-1}P(\mathbf{y}_\mathbf{k}) \!-\! \int _\mathbb{T ^q}P(\mathbf{z})\,d\mu _q^*(\mathbf{z})\right| \le (1/4)\Vert P\Vert _1\le \frac{1}{3m^q}\sum _{0\le \mathbf{k}\le m-1}|P(\mathbf{y}_\mathbf{k})|.\nonumber \\ \end{aligned}$$
(3.27)

Denote by \(\mathbb{X }\) the space \(\mathbb{R }^M\), equipped with the norm

$$\begin{aligned} \Vert \mathbf{x}\Vert =\sum _{0\le \mathbf{k}\le m-1}\mu _q^*(I_\mathbf{k})|x_\mathbf{k}| \quad \mathrm{where } \quad \mathbf{x}=(x_\mathbf{k})_{0\le \mathbf{k}\le m-1}. \end{aligned}$$

For the set \(\mathcal{K }\), we choose the set of coordinate functionals; \(z_\mathbf{k}^*(\mathbf{x})=x_\mathbf{k}\). Then \(\mathcal{K }\) is clearly a compact subset of \(\mathbb{X }^*\). We consider the operator \(\mathcal{S } :\mathbb{H }^q _N\rightarrow \mathbb{R }^M\) given by \(P\mapsto (P(\mathbf{y}_\mathbf{k}))_{\mathbf{k}=0}^{m-1}\), and take the subspace \(\mathcal{V }\) of \(\mathbb{X }\) to be the range of \(\mathcal{S }\). The lower estimate in (3.26) shows that \(\mathcal{S }\) is invertible on \(\mathcal{V }\). We define the functional \(x^*\) on \(\mathcal{V }\) by

$$\begin{aligned} x^*(\mathcal{S }(P))=\int _\mathbb{T ^q}P(\mathbf{z})\,d\mu _q^*(\mathbf{z})-\frac{1}{3m^q}\sum _{0\le \mathbf{k}\le m-1}P(\mathbf{y}_\mathbf{k}), \quad P\in \mathbb{H }^q _N. \end{aligned}$$

Moreover, if each \(P(\mathbf{y}_\mathbf{k})\ge 0\), then, using (3.27), we can conclude that

$$\begin{aligned} \int _\mathbb{T ^q}P(\mathbf{z})\,d\mu _q^*(\mathbf{z})-\frac{1}{3m^q}\sum _{0\le \mathbf{k}\le m-1}P(\mathbf{y}_\mathbf{k})\ge \frac{1}{3m^q}\sum _{0\le \mathbf{k}\le m-1}P(\mathbf{y}_\mathbf{k})\ge 0. \end{aligned}$$

Thus, \(x^*\) is positive on \(\mathcal{V }\) with respect to \(\mathcal{K }\). The element \((1,\ldots ,1)\in \mathcal{V }\) serves as \(v_0\) in Theorem 3.6. Theorem 3.6 then implies that there exists a nonnegative functional \(X^*\) on \(\mathbb{R }^M\) that extends \(x^*\). We may identify this functional with \((\tilde{W}_\mathbf{k})_{\mathbf{k}=0}^{m-1}\in \mathbb{R }^M\) such that each \(\tilde{W}_\mathbf{k}\ge 0\). The fact that \(X^*\) extends \(x^*\) means that for each \(P\in \mathbb{H }^q _N\),

$$\begin{aligned}&\int _\mathbb{T ^q} P(\mathbf{z})\,d\mu _q^*(\mathbf{z}) =\sum _{0\le \mathbf{k}\le m-1} (\tilde{W}_\mathbf{k}+(1/(3m^q))P(\mathbf{y}_\mathbf{k})\nonumber \\&\quad =\sum _{0\le \mathbf{k}\le m-1} w_\mathbf{k}P(\mathbf{y}_\mathbf{k}) \quad {\hbox {where}} \quad w_\mathbf{k}\mathop {=}\limits ^{\mathrm{def}}\tilde{W}_\mathbf{k}+(1/(3m^q). \end{aligned}$$
(3.28)

If we revert to the original set of points and set \(w_\mathbf{k}=0\) if \(\mathbf{y}_\mathbf{k}\) is not in the subset chosen as in the remark following the statement of Theorem 3.5, then this is (3.12).

It was proved in [21, Theorem 5.8] that (3.12) implies (3.13). \(\square \)

Next, we make a few comments about the numerical computation of the weights \(w_\mathbf{k}\). We observe that the reduction of the original data set \(\{\mathbf{y}_k\}\) may be done is several different ways and the weights \(w_\mathbf{k}\) are not uniquely defined either. Having made the reduction of the original set, a straightforward way to find the weights numerically is the following, see [34]. We minimize \(\sum _{0\le \mathbf{k}\le m-1}w_\mathbf{k}^2\) subject to the conditions

$$\begin{aligned} \sum _{0\le \mathbf{k}\le m-1} w_\mathbf{k}\exp (i\mathbf{j }\cdot \mathbf{y}_\mathbf{k})=\left\{ \begin{array}{ll} 1, &{}\quad {\hbox {if}}\ j=0,\\ 0, &{}\quad {\hbox {otherwise,}} \end{array}\right. \end{aligned}$$

for \(0\le \mathbf{j }<N\). This involves the solution of a linear system of equations whose matrix, the so-called Gram matrix, is given by

$$\begin{aligned} V_{\mathbf{j },{\varvec{\ell }}}=\sum _{0\le \mathbf{k}\le m-1} \exp (i(\mathbf{j }-{\varvec{\ell }})\cdot \mathbf{y}_\mathbf{k}), \quad 0\le \mathbf{j }, {\varvec{\ell }}<N. \end{aligned}$$
(3.29)

If \((a_\mathbf{j })_{0\le \mathbf{j }<N}\) is an arbitrary vector and we take \(P(\mathbf{x})=\sum _{0\le \mathbf{j }<N} a_\mathbf{j }\exp (i\mathbf{j }\cdot \mathbf{x})\), then the Raleigh quotient for this matrix can be calculated to be

$$\begin{aligned} \frac{\sum _{0\le \mathbf{k}\le m-1}|P(\mathbf{y}_\mathbf{k})|^2}{\sum _{0\,\le \, \mathbf{j }<N}|a_\mathbf{j }|^2} =\frac{\sum _{0\le \mathbf{k}\le m-1}|P(\mathbf{y}_\mathbf{k})|^2}{\Vert P\Vert _2^2}, \end{aligned}$$

where we used the Parseval identity in the last step. Thus, the largest and smallest eigenvalues of \(V\), \(\lambda _{\max }\) and \(\lambda _{\min }\) are given by

$$\begin{aligned} \lambda _{\min }=\min _{P\in \mathbb{H }^q _N} \frac{\sum _{0\le \mathbf{k}\le m-1}|P(\mathbf{y}_\mathbf{k})|^2}{\Vert P\Vert _2^2}, \quad \lambda _{\max }=\max _{P\in \mathbb{H }^q _N} \frac{\sum _{0\le \mathbf{k}\le m-1}|P(\mathbf{y}_\mathbf{k})|^2}{\Vert P\Vert _2^2},\qquad \quad \end{aligned}$$
(3.30)

see [29, Theorem 4.2.2, p. 176]. In practice, one has to choose \(N\) by trial and error so that the condition number of \(V\) is “reasonable”. We are tempted to solve a non-linear optimization problem with the additional requirement that the weights should be non-negative. It is our experience that the non-negativity of the weights is not important in practice, but an inequality of the form (3.13) is essential. For any given data, such an inequality will, of course, hold with some constants depending on \(N\) and the data set. However, it is of interest to estimate the constants. It turns out that the constants involved are proportional to \(\lambda _{\max }\) and \(\lambda _{\min }\).

We would like to point out another interesting fact. Suppose one wishes to find a least squares fit from \(\mathbb{H }^q _N\) to the data of the form \(\{(\mathbf{y}_\mathbf{k},z_\mathbf{k})\}_{0\le \mathbf{k}\le m-1}\), that is, find \(a_\mathbf{j }\)’s to minimize

$$\begin{aligned} \sum _{0\le \mathbf{k}\le m-1} |w_\mathbf{k}|\left| z_\mathbf{k}-\sum _{0\le {\varvec{\ell }}<N} a_{{\varvec{\ell }}}\exp (i{\varvec{\ell }}\cdot \mathbf{y}_\mathbf{k})\right| ^2 \end{aligned}$$

for a suitable choice of \(w_\mathbf{k}\). This involves the solution of a linear system of equations where the matrix involved is \(G\), defined by

$$\begin{aligned} G_{\mathbf{j },{\varvec{\ell }}}=\sum _{0\le \mathbf{k}\le m-1} |w_\mathbf{k}|\exp (i(\mathbf{j }-{\varvec{\ell }})\cdot \mathbf{y}_\mathbf{k}). \end{aligned}$$

As before, we have the bounds

$$\begin{aligned} \tilde{\lambda }_{\min }\sum _{0\le \mathbf{k}\le m-1}|w_\mathbf{k}||P(\mathbf{y}_\mathbf{k})|^2 \!\le \! \Vert P\Vert _2^2 \!\le \! \tilde{\lambda }_{\max }\sum _{0\le \mathbf{k}\le m-1}|w_\mathbf{k}||P(\mathbf{y}_\mathbf{k})|^2, \quad P\in \mathbb{H }^q _N,\nonumber \\ \end{aligned}$$
(3.31)

where \(\tilde{\lambda }_{\min }\) and \(\tilde{\lambda }_{\max }\) are the smallest and largest eigenvalues of \(G\).

To describe these results while keeping track of the constants on the various data sets and weights, etc., and also to simplify our notation in further theory, it is convenient to use a measure notation. The actual choice of the reduced data set and the weights plays no role in our theoretical consideration. Therefore, it is convenient to define a measure \(\nu \) that associates the mass \(w_\mathbf{k}\) with each \(\mathbf{y}_\mathbf{k}\), that is, for any subset \(B\subset \mathbb T ^q\),

$$\begin{aligned} \nu (B)=\sum _{\mathbf{k}: \mathbf{y}_\mathbf{k}\in B}w_\mathbf{k}. \end{aligned}$$

We pause in our discussion to review some basic notions related to signed measures and introduce some notation before proceeding further. We recall that the total variation measure of any signed measure \(\mu \) is defined by

$$\begin{aligned} |\mu |(\mathcal{U }) \mathop {=}\limits ^{\mathrm{def}}\sup \sum _{i=1}^\infty |\mu (U_i)|, \quad \mathcal{U }\subset \mathbb T ^q\!, \end{aligned}$$

where the supremum is taken over all countable partitions \(\{U_i\}\) into measurable sets of \(\mathcal{U }\). For the measure \(\nu \) as defined above, one can easily deduce that \(|\nu |(B)=\sum _{\mathbf{k}: \mathbf{y}_\mathbf{k}\in B}|w_{\mathbf{k}}|\) for any subset \(B\subset \mathbb T ^q\). It is well known that any signed measure \(\mu \) on \(\mathbb T ^q\) satisfies \(|\mu |(\mathbb T ^q)<\infty \). The support of a measure \(\mu \), denoted by \(\mathsf{supp }(\mu )\), is the set of all \(\mathbf{x}\in \mathbb T ^q\) such that for every open subset \(U\) of \(\mathbb T ^q\) containing \(\mathbf{x}\), \(|\mu |(U)>0\).

If \(1 \le p \le \infty \), \(\mu \) is a (possibly signed) measure on \(\mathbb T ^q\), \(B \subset \mathbb T ^q\) is \(\mu \)-measurable, and \(f:B \rightarrow \mathbb{C }\) is \(\mu \)-measurable, then the \(L^p\) norm of \(f\) with respect to \(\mu \) is given by

$$\begin{aligned} \left\| f \right\| _{{\mu ;p,B}} \mathop = \limits ^{{{\text {def}}}} \left\{ {\begin{array}{*{20}l} {\left\{ {\int _{B} {\left| {f(x)} \right| ^{p} d\left| \mu \right| } } \right\} ^{{1/p}}\!,} &{}\quad {{\text {if}}\ {\text {1}} \le {\text {p < }}\infty {\text {,}}} \\ {\left| \mu \right| - \mathop {{\text {ess}}\;\sup \left| {f(x)} \right| }_{{{\text {x}} \in {\text {B}}}},} &{}\quad {{\text {if}}\ {\text {p}} = \infty {\text {.}}} \\ \end{array} } \right. \end{aligned}$$
(3.32)

As before, we will omit the mention of \(B\) if \(B=\mathbb T ^q\), and to keep the notation consistent, will also omit the measure \(\mu \) from the notation if \(\mu =\mu _q^*\). The space \(X^p(\mu )\) denotes the \(L^p(\mu )\) closure of the set of all trigonometric polynomials. In this notation, the estimate (3.13) becomes

$$\begin{aligned} \Vert P\Vert _{\nu ;1}\sim \Vert P\Vert _1, \quad P\in \mathbb{H }^q _N, \end{aligned}$$

and (3.31) can be written as

$$\begin{aligned} \tilde{\lambda }_{\min }\Vert P\Vert _2^2\le \Vert P\Vert _{\nu ;2}^2\le \tilde{\lambda }_{\max }\Vert P\Vert _2^2. \end{aligned}$$

In the sequel, we will not fix the measure \(\nu \) any more as above, and instead we use the notations \(\nu \), \(\mu \), etc., to denote arbitrary measures. With a different measure \(\nu \), the formulas (3.30) become

$$\begin{aligned} \lambda _{\min }\Vert P\Vert _2^2\le \Vert P\Vert _{\nu ;2}^2 \le \lambda _{\max }\Vert P\Vert _2^2, \quad P\in \mathbb{H }^q _N. \end{aligned}$$

We are now ready to resume our discussion of the M–Z inequalities and related topics.

Definition 3.3

  1. (a)

    A (possibly signed) measure \(\mu \) is called a quadrature measure of order \(n\) when

    $$\begin{aligned} {\int _\mathbb{T ^q}}T\,d\mu = {\int _\mathbb{T ^q}}T\,d{\mu _q^*}, \quad T \in \mathbb{H }^q _n. \end{aligned}$$
    (3.33)
  2. (b)

    A (possibly signed) measure \(\mu \) is called a Marcinkiewicz–Zygmund measure, or M–Z measure, of order \(n\) when the following M–Z inequality

    $$\begin{aligned} {\int _\mathbb{T ^q}}|T|\,d|\mu |=\Vert T\Vert _{\mu ;1} \le c(n,\mu ) \Vert T \Vert _1, \quad T \in \mathbb{H }^q _n, \end{aligned}$$
    (3.34)

    is satisfied where \(c(n,\mu )\) is a constant independent of \(T\). The smallest \(c\) that works in (3.34) will be denote by \(|\!|\!|\mu |\!|\!|_n\).

It can be shown easily that for each \(n\in \mathbb{N }\), \(|\!|\!|\circ |\!|\!|_n\) is a norm on the space of Radon measures. Clearly, for every \(n\in \mathbb{N }\), \(\mu _q^*\) itself is an M–Z quadrature measure of order \(n\) with \(|\!|\!|\mu _q^*\Vert _n=1\). In general, if \(\mu \) is an M–Z quadrature measure of order \(n\) for some \(n>0\), then \(|\!|\!|\mu |\!|\!|_n \ge 1\).

In [21, Proposition 2.1, Theorem 5.4], we proved the following.

Theorem 3.7

Let \(\mu \) be a Radon measure and let \(n\ge 2\). Then we have the following inequalities where the constants \(c\) are independent of both \(\mu \) and \(n\).

  1. (a)

    If \(1\le p<\infty \) then

    $$\begin{aligned} \Vert P\Vert _{\mu ;p}\le c|\!|\!|\mu |\!|\!|_n^{1/p}\Vert P\Vert _p, \quad P\in \mathbb{H }^q _n. \end{aligned}$$
    (3.35)

    Conversely, if for some \(p\in [1,\infty )\) and \(A=A(\mu ,n)>0\),

    $$\begin{aligned} \Vert P\Vert _{\mu ;p}\le cA^{1/p}\Vert P\Vert _p, \quad P\in \mathbb{H }^q _n, \end{aligned}$$

    holds, then \(|\!|\!|\mu |\!|\!|_n\le cA\).

  2. (b)

    For any \(r>0\) and \(\mathbf{x}\in \mathbb T ^q\) we have \(|\mu |(\{\mathbf{y}: \Vert \mathbf{x}-\mathbf{y}\Vert _\infty \le r\}) \le c|\!|\!|\mu |\!|\!|_n (r+1/n)^q\). Conversely, if there exists a constant \(A=A(\mu ,n)\) such that for every \(r>0\) and \(\mathbf{x}\in \mathbb T ^q\), \(|\mu |(\{\mathbf{y}: \Vert \mathbf{x}-\mathbf{y}\Vert _\infty \le r\}) \le A(r+1/n)^q\), then \(A\le c|\!|\!|\mu |\!|\!|_n\).

  3. (c)

    For any constant \(\alpha >0\), we have \(|\!|\!|\mu |\!|\!|_n \sim |\!|\!|\mu |\!|\!|_{\alpha n}\), where the constants involved in the \(\sim \) relationship depend only on \(\alpha \) (and \(q\)) but not on \(\mu \) or \(n\).

Once we compute the maximum eigenvalue of the Gram matrix \(V\) defined in (3.29), Theorem 3.7(a) allows us to estimate the constant involved in (3.13). Theorem 3.7(b) gives a geometric criteria for (3.13) without referring to trigonometric polynomials. Theorem 3.7(c) shows that if (3.13) holds for some \(n\), then it holds also with equivalent constants for \(\alpha n\) for every \(\alpha >1\) as well. In particular, even if a quadrature measure of order \(n\) supported on the minimal number of points, \({\hbox {dim}}(\mathbb{H }^q _{n/2})\), does not exist when \(q\ge 2\), M–Z measures of order \(n\) with this property are plentiful.

3.4 Wavelet-like representation

Our starting point here is the following analogue of Theorem 3.1, where the novelty is that in the case of functions in \(X^\infty \), their samples are used for approximation.

First, some notation. If \(\mu \) is a (possibly signed) measure and \(f\in L^1(\mu )\), we define

$$\begin{aligned} \hat{f}(\mu ;\mathbf{k}) ={\int _\mathbb{T ^q}}f(\mathbf{t })\exp (-i\mathbf{k}\cdot \mathbf{t })\,d\mu (\mathbf{t }), \quad \mathbf{k}\in \mathbb{Z }^q. \end{aligned}$$

If \(\mu =\mu _q^*\) then \( \hat{f}(\mu ;\mathbf{k})=\hat{f}(\mathbf{k})\). If \(\mu \) is discretely supported measure then \( \hat{f}(\mu ;\mathbf{k})\) is a discretized approximation to \(\hat{f}(\mathbf{k})\) as indicated by \(\mu \). A common example is the discrete Fourier transform, obtained by letting \(\mu \) be supported on a set of dyadic points on \(\mathbb T ^q\). It is well known that the Fast Fourier Transform (FFT) algorithm allows a fast computation of the coefficients \(\hat{f}(\mu ;\mathbf{k})\) for the appropriate measures \(\mu \), and, for this reason, it is used widely in engineering applications. In the case when \(\mu \) is sparsely supported, there are new algorithms with sublinear complexity, see [28]. There are many other examples, including in particular, the so-called low discrepancy quadrature rules, see, e.g., [13] for a detailed discussion and further references. Our interest here is in the case when \(\mu \) is an M–Z quadrature measure.

The analogue of the summability operator is defined by

$$\begin{aligned} \sigma _n(\mu ;H,f,\mathbf{x})&\mathop {=}\limits ^{\mathrm{def}} \sum _{0\le \mathbf{k}<n} H\left( \frac{\mathbf{k}}{n}\right) \hat{f}(\mu ;\mathbf{k})\exp (i\mathbf{k}\cdot \mathbf{x}) \nonumber \\&= {\int _\mathbb{T ^q}}f(\mathbf{t })\Phi _n(H;\mathbf{x}\!-\!\mathbf{t })\,d\mu (\mathbf{t }), \quad n\!\in \!\mathbb{N },\ \mathbf{x}\in \mathbb T ^q,\ f\in L^1(\mu ).\nonumber \\ \end{aligned}$$
(3.36)

We will adopt the same conventions as in Sect. 3.2 with respect to overloaded notations. For example, when \(H(\mathbf{t })=h(|\mathbf{t }|_2)\), we will write \(\sigma _n(\mu ;h,f)\) in place of \(\sigma _n(\mu ;H,f)\).

The analogue of Theorem 3.1 is the following.

Theorem 3.8

Let \(n\in \mathbb{N }\), let \(S>q\) be an integer, \(1 \le p \le \infty \), and let \(\mu \) be an M–Z quadrature measure of order \(3n/2\). Let \(h\) be an \(S\)-times continuously differentiable low pass filter.

  1. (a)

    For for all \(P\in \mathbb{H }^q _{n/2}\), we have \(\sigma _{n}(\mu ;h,P)=P\).

  2. (b)

    We have for all \(f \in X^p(\mu )\),

    $$\begin{aligned} \Vert \sigma _n(\mu ;h, f)\Vert _p \le c|\!|\!|\mu |\!|\!|_n^{1-1/p}\Vert f\Vert _{\mu ;p}. \end{aligned}$$
    (3.37)

    Consequently, if \(f\in X^\infty \), then \(\Vert \sigma _{n}(\mu ;h,f)\Vert _\infty \le c|\!|\!|\mu |\!|\!|_n\Vert f\Vert _\infty \), and, furthermore,

    $$\begin{aligned} E_{n,\infty }(f) \le \Vert f-\sigma _{n}(\mu ;h,f)\Vert _\infty \le c|\!|\!|\mu |\!|\!|_n E_{n/2,\infty }(f). \end{aligned}$$
    (3.38)
  3. (c)

    If \(f \in L^1(\mu )\) is supported on a compact set \(K\) and \(V\) is an open set with \(K \subset V\), then

    $$\begin{aligned} \Vert \sigma _n(\mu ;h,f)\Vert _{\infty , \mathbb T ^q \setminus V}\le c\Vert f\Vert _{\mu ;1} n^{q-S}\!, \end{aligned}$$
    (3.39)

    where, in addition to \(S\) and \(h\), the constant \(c\) may depend on \(K\) and \(V\).

Proof

If \(P\in \mathbb{H }^q _{n/2}\), then for each \(\mathbf{x}\in \mathbb T ^q\) we have \(P\Phi _n(h,\mathbf{x}-\circ )\in \mathbb{H }^q _{3n/2}\). Since \(\mu \) is a quadrature measure of order \(3n/2\),

$$\begin{aligned} P(\mathbf{x})={\int _\mathbb{T ^q}}P(\mathbf{t })\Phi _n(h,\mathbf{x}-\mathbf{t })\,d\mu _q^*(\mathbf{t }) ={\int _\mathbb{T ^q}}P(\mathbf{t })\Phi _n(h,\mathbf{x}-\mathbf{t })\,d\mu (\mathbf{t })=\sigma _n(\mu ;h,P,\mathbf{x}). \end{aligned}$$

This proves part (a). The proof of part (b) is almost verbatim the same as that of Theorem 2.1(b), except that multivariate notation needs to be used, and, in addition, we need the fact that \(\mu \) is an M–Z measure of order \(3n/2\) to conclude that for each \(\mathbf{x}\in \mathbb T ^q\)

$$\begin{aligned} {\int _\mathbb{T ^q}}|\Phi _n(h,\mathbf{x}-\mathbf{t })|d|\mu |(\mathbf{t }) \le c|\!|\!|\mu |\!|\!|_n{\int _\mathbb{T ^q}}|\Phi _n(h,\mathbf{x}-\mathbf{t })|\,d\mu _q^*(\mathbf{t }) \le c|\!|\!|\mu |\!|\!|_n. \end{aligned}$$

Part (c) is easy to prove using the localization estimate (3.2). \(\square \)

As an immediate corollary, we note the following complement to the one-sided inequality (3.34), cf. Lemma 2.1 and (3.31).

Corollary 3.2

Let \(n\in \mathbb{N }\) and let \(\mu \) be an M–Z quadrature measure of order \(3n\). Then for \(P\in \mathbb{H }^q _n\) and \(1\le p<\infty \),

$$\begin{aligned} \Vert P\Vert _p\le c|\!|\!|\mu |\!|\!|_n^{1-1/p}\Vert P\Vert _{\mu ;p}\le c_1|\!|\!|\mu |\!|\!|_n\Vert P\Vert _p. \end{aligned}$$
(3.40)

Proof

We use (3.37) with \(P\) in place of \(f\) and \(\sigma _{2n}\) in place of \(\sigma _n\). Since \(\sigma _{2n}(\mu ;h,P)=P\), this leads to the first inequality in (3.40). The second inequality is a consequence of Theorem 3.7(a). \(\square \)

Next, we come to the definition and characterization of local smoothness classes. The local smoothness classes are defined exactly as in Definition 2.3, except that multivariate analogues of \(\mathbb T \), “arc”, and smoothness classes are used. Our objective is to state the analogues of Theorems 2.5 and 2.6. Since frames in the sense of \(L^2\) cannot be defined using values of the target function at countably many points, and the topic of tight frames does not add anything new to our discussion, we will not delve into the topic of tight frames.

We say a sequence of measures \( {\varvec{\mu }}= \left\{ \mu _n \right\} \) has nested support when

$$\begin{aligned} m < n \implies \mathsf{supp }(\mu _m) \subseteq \mathsf{supp }(\mu _n). \end{aligned}$$

In theoretical considerations for approximation based on values of the target function, we will usually assume that the data is available as a sequence \(\{\mathcal{C }_m\}\) of finite subsets of \(\mathbb T ^q\). By taking unions, we may assume without loss of generality that the sets are nested. The following proposition, proved in [49, Proposition 2.1], shows that the data reduction and construction of M–Z quadrature formulas c an be done in a consistent manner.

Proposition 3.1

Let \(\{\mathcal{C }_m\}\) be a sequence of finite subsets of \(\mathbb T ^q\) with \(\delta (\mathcal{C }_m)\sim 1/m\), and let \(\mathcal{C }_m\subseteq \mathcal{C }_{m+1}\) for \(m\in \mathbb{N }\). Then there exists a sequence of subsets \(\{\tilde{\mathcal{C }}_m \subseteq \mathcal{C }_m\}\), where, for \(m\in \mathbb{N }\), we have \(\delta (\tilde{\mathcal{C }}_m)\sim 1/m\), \(\tilde{\mathcal{C }}_m\subseteq \tilde{\mathcal{C }}_{m+1}\), and \(\delta (\tilde{\mathcal{C }}_m)\le 2\eta (\tilde{\mathcal{C }}_m)\).

Definition 3.4

Let \(h :\mathbb{R }\rightarrow [0,\infty )\) be compactly supported and let \({\varvec{\mu }}=\{\mu _n\}_{n=0}^\infty \) be a sequence of Borel measures on \(\mathbb T ^q\) with nested support. Let \(f \in \cap _{n=0}^\infty L^1(\mu _n)\). Then, for each non-negative integer \(n\), the band-pass operator \(\tau _n({\varvec{\mu }};h,f)\) with respect to \({\varvec{\mu }}\) is given by

$$\begin{aligned}&\tau _0({\varvec{\mu }};h,f) \mathop {=}\limits ^{\mathrm{def}}\sigma _1(\mu _0;h,f) {\hbox { and }} \tau _n({\varvec{\mu }};h,f)\nonumber \\&\quad \mathop {=}\limits ^{\mathrm{def}}\sigma _{2^n}(\mu _n;h,f) - \sigma _{2^{n-1}}(\mu _{n-1};h,f), {\hbox { for }} n\in \mathbb{N }. \end{aligned}$$
(3.41)

To keep the notation consistent, we will generally omit the symbol \({\varvec{\mu }}\) if each \(\mu _n=\mu _q^*\). If we must mention this measure for emphasis, then we will write \(\tau _j(\mu _q^*;h,f)\) in this case.

The kernels \(\Psi _j\) are defined in the same way as (2.27), except for obvious multivariate substitutions.

The analogue of Theorem 2.5 is the following statement.

Theorem 3.9

Let \(1\le p\le \infty \), \(\gamma >0\), \(f\in X^p\), and let \(h\) be an infinitely differentiable low pass filter. Let \({\varvec{\nu }}=\{\nu _j\}_{j=0}^\infty \) be a sequence of measures such that each \(\nu _j\) is a quadrature measure of order \(2^{j+1}\). Further, let \({\varvec{\mu }}=\{\mu _j\}_{j=0}^\infty \) be a sequence of measures such that each \(\mu _j\) is an M–Z quadrature measure of order \(3 \times 2^{j-1}\).

  1. (a)

    We have

    $$\begin{aligned} f=\sum _{j=0}^\infty \tau _j(h,f)=\sum _{j=0}^\infty \ {\int _\mathbb{T ^q}}\tau _j(h,f,\mathbf{t })\Psi _j(h, \circ -\mathbf{t })\,d\nu _j(\mathbf{t })\nonumber \\ \end{aligned}$$

    with convergence in the sense of \(X^p\).

  2. (b)

    If \(f\in L^2\) then

    $$\begin{aligned} \Vert f\Vert _2^2 \le \sum _{j=0}^\infty \Vert \tau _j(h,f)\Vert _2^2 = \sum _{j=0}^\infty \ {\int _\mathbb{T ^q}}|\tau _j(h,f,\mathbf{t })|^2\,d\nu _j(\mathbf{t }) \le 4\Vert f\Vert _2^2. \end{aligned}$$
    (3.42)
  3. (c)

    If \(f\in X^\infty \), then

    $$\begin{aligned} f=\sum _{j=0}^\infty \tau _j({\varvec{\mu }};h,f)=\sum _{j=0}^\infty \ {\int _\mathbb{T ^q}}\tau _j(\mu _j;h,f,\mathbf{t })\Psi _j(h, \circ -\mathbf{t })\,d\nu _j(\mathbf{t }), \end{aligned}$$
    (3.43)

    where both the series converge uniformly.

The interest in the above theorem is clearly when \({\varvec{\nu }}\) is a sequence of discretely supported measures. The proofs of this theorem and of the following analogue of Theorem 2.6 are verbatim the same as those of their univariate analogues, and therefore we omit them.

Theorem 3.10

Let \(1\le p\le \infty \), \(\gamma >0\), \(\mathbf{x}_0\in \mathbb T ^q\), \(f\in X^p\), and let \(h\) be an infinitely differentiable low pass filter. Let \({\varvec{\mu }}\) and \({\varvec{\nu }}\) be sequences of measures as in Theorem 3.9. We assume further that \(\sup _{j\ge 0}|\!|\!|\mu _j|\!|\!|_{2^j} <\infty \) and \(\sup _{j\ge 0}|\!|\!|\nu _j|\!|\!|_{2^j} <\infty \). Then the following are equivalent.

  1. (a)

    \(f\in W_{\gamma ,p}(\mathbf{x}_0)\).

  2. (b)

    There exists an arc \(I\ni \mathbf{x}_0\) such that

    $$\begin{aligned} \Vert \tau _j(h,f)\Vert _{\nu _j;p,I} =\mathcal{O}(2^{-j\gamma }). \end{aligned}$$
    (3.44)
  3. (c)

    In the case when \(p=\infty \),

    $$\begin{aligned} \max _{\mathbf{t }\in \mathsf{supp }(\nu _j)\cap I} |\tau _j({\varvec{\mu }};h,f,\mathbf{t })| =\mathcal{O}(2^{-j\gamma }). \end{aligned}$$
    (3.45)

Remark

As before, the main interest is when \({\varvec{\nu }}\) is a sequence of discretely supported measures with nested supports. We note that the polynomials \(\tau _j(h,f)\) in the estimates (3.44) are computed using the Fourier coefficients of \(f\). The estimates (3.45) are more general in the sense that the sequence \({\varvec{\mu }}\) can be a sequence of discrete measures as well, in which case, the values of \(f\) at the points in the supports of these measures are used. Of course, such a generalization is possible only if \(p=\infty \), so that point evaluations are well defined. \(\square \)

4 Periodic basis function (PBF) networks

4.1 Trigonometric polynomials and PBFs

We recall that a PBF network is a function of the form \(\mathbf{x}\mapsto \sum _{k=1}^N a_kG(\mathbf{x}-\mathbf{x}_k)\), \(\mathbf{x}\in \mathbb T ^q\) where \(G\in X^\infty \) is called the activation function, \(N\) is the number of neurons, and \(\mathbf{x}_k\in \mathbb T ^q\) are called centers. Thus, a PBF network is a translation network with activation function defined on \(\mathbb T ^q\) and the centers in \(\mathbb T ^q\) as well.

Starting with [51], we discovered a close connection between approximation by trigonometric polynomials and that by periodic basis function networks. While all the research known to us prior to [51] gave the degree of approximation by RBF networks in terms of a scaling parameter, the results in [51] appear to be the first of their kind where the theory was developed in terms of the number of neurons. Later on, it was observed by Schaback and Wendland [73], and independently in [42]—in the context of the so-called Gaussian networks—that it is the minimal separation among the centers, cf. Definition 3.2, of the network which leads to a complete theory of direct and converse theorems for approximation in this context. Accordingly, we will formulate our theorems here in terms of the minimal separation among the centers.

In the sequel, let \(G\in X^\infty (\mathbb T ^q)\) be a fixed activation function satisfying the condition that \(\hat{G}(\mathbf{k})\not =0\) for any \(\mathbf{k}\in \mathbb{Z }^q\). For \(f\in L^1\), we may then define formally a “derivative” by the formula

$$\begin{aligned} \widehat{D_G(f)}(\mathbf{k})\mathop {=}\limits ^{\mathrm{def}}\frac{\hat{f}(\mathbf{k})}{\hat{G}(\mathbf{k})}, \quad \mathbf{k}\in \mathbb{Z }^q. \end{aligned}$$
(4.1)

In the case when \(D_G(f)\in L^2\), one says that \(f\) is in the native space of \(G\). However, we will not need this concept. For our purpose, it is enough to note that if \(P\in \mathbb{H }^q _n\) for some \(n\in \mathbb{N }\), then \(D_GP\in \mathbb{H }_n^q\) as well. In fact, we verify the following important observation:

Proposition 4.1

Let \(n\in \mathbb{N }\), \(P\in \mathbb{H }_n^q\). Then

$$\begin{aligned} P(\mathbf{x})={\int _\mathbb{T ^q}}G(\mathbf{x}-\mathbf{y})D_G(P,\mathbf{y})\,d\mu _q^*(\mathbf{y}), \quad \mathbf{x}\in \mathbb T ^q. \end{aligned}$$
(4.2)

The proof of this proposition is a simple comparison of Fourier coefficients of both sides of (4.2). The basic idea in our proofs of the direct theorems as well as in wavelet-like representations is to discretize the integral in (4.2). The resulting estimate is shown in Theorem 4.1 below. In the remainder of this section, we use the notation

$$\begin{aligned} \mathfrak{m }_n=\mathfrak{m }_n(G)=\min _{|\mathbf{k}|_2\le n}|\hat{G}(\mathbf{k})|, \end{aligned}$$
(4.3)

and

$$\begin{aligned} x_+=\max (x,0), \quad x\in \mathbb{R }. \end{aligned}$$

Theorem 4.1

Let \(1\le p\le \infty \), \(n\in \mathbb{N }\), \(N\in \mathbb{N }\), \(P\in \mathbb{H }_n^q\), and let \(\nu \) be an M–Z quadrature measure of order \(n+N\). Then

$$\begin{aligned} \left\| P\!-\!{\int _\mathbb{T ^q}}G(\circ \!-\!\mathbf{y})D_G(P,\mathbf{y})\,d\nu (\mathbf{y})\right\| _\infty \!\!\!\le \! c|\!|\!|\nu |\!|\!|_{n+N}\left( n^{\left( \frac{q}{p} - \frac{q}{2} \!\right) _+}\right) \frac{E_{N,\infty }(G)}{\mathfrak{m }_n}\Vert P\Vert _p.\qquad \quad \end{aligned}$$
(4.4)

We observe that in view of Theorem 3.5, an arbitrary finite subset with sufficiently high density content admits an M–Z quadrature measure of order \(n+N\) supported on this subset. In particular, such a measure \(\nu \) exists with \(|\mathsf{supp }(\nu )| \sim (n+N)^q\), \(|\!|\!|\nu |\!|\!|_{n+N}\le c\), and \(\eta (\mathsf{supp }(\nu ))\sim (n+N)^{-1}\). Using such a choice of measure for \(\nu \), the integral expression in (4.4) is a PBF network with \(\sim (n+N)^q\) neurons, the set of centers being \(\mathsf{supp }(\nu )\), and the minimal separation among its centers is \(\sim (n+N)^{-1}\).

We note a corollary of Theorem 4.1.

Corollary 4.1

Let \(1\le p\le \infty \), and let \(G\) satisfy \(\hat{G}(\mathbf{k})\not =0\) for any \(\mathbf{k}\in \mathbb{Z }^q\). Then the class of all translation networks with \(G\) as the activation function is dense in \(X^p\).

Proof

If \(f\in X^p\) and \(\epsilon >0\), we find \(n\in \mathbb{N }\) and \(P\in \mathbb{H }_n^q\) such that \(\Vert f-P\Vert _p<\epsilon /2\). For this \(n\), we may find \(N\) large enough so that the network \(\mathcal G \) as constructed in (4.4) satisfies \(\Vert P-\mathcal G \Vert _p <\epsilon /2\). Then \(\Vert f-\mathcal G \Vert _p <\epsilon \). \(\square \)

The estimate (4.4) is much more meaningful in the case when \(G\) is very smooth, that is, for every \(L>0\),

$$\begin{aligned} \lim _{|\mathbf{k}|_2\rightarrow \infty } |\mathbf{k}|_2^L|\hat{G}(\mathbf{k})|=0. \end{aligned}$$

We introduce the following definition.

Definition 4.1

Let \(A>0\). We will say that \(G\in \mathcal E _A\) if each of the following conditions is satisfied:

  1. 1.

    \(G \in X^\infty \),

  2. 2.

    for all \(\mathbf{k}\in \mathbb{Z }^q\), \(\hat{G}(\mathbf{k})\not =0\),

  3. 3.
    $$\begin{aligned} \limsup \limits _{n \rightarrow \infty } \left( \frac{E_{An,\infty }(G)}{\mathfrak{m }_n} \right) ^{1/n} < 1. \end{aligned}$$
    (4.5)

For \(G\in \mathcal E _A\), Theorem 4.1 (used with \(N=An\)) leads to the following corollary.

Corollary 4.2

Let \(A>0\) and \(G\in \mathcal E _A\). Let \(1\le p\le \infty \), \(n\in \mathbb{N }\), \(P\in \mathbb{H }_n^q\), and let \(\nu \) be an M–Z quadrature measure of order \((1+A)n\). Then there exists \(\rho =\rho (A, p)\) such that \(0<\rho <1\), and

$$\begin{aligned} \left\| P-{\int _\mathbb{T ^q}}G(\circ -\mathbf{y})D_G(P,\mathbf{y})\,d\nu (\mathbf{y})\right\| _\infty \le c|\!|\!|\nu |\!|\!|_{n}\rho ^n\Vert P\Vert _p. \end{aligned}$$
(4.6)

Many standard examples used in network expansions involve activation functions in \(\mathcal{E }_A\) for some \(A>0\), cf. [75] for the computation of Fourier transforms given in the examples below.

Example 4.1

Periodization of the Gaussian.

$$\begin{aligned} G(\mathbf{x}) = \sum _{\mathbf{k}\in \mathbb{Z }^q} \exp (-|\mathbf{x}-2\pi \mathbf{k}|_2^2 /2), \hat{G} (\mathbf{k}) = (2 \pi )^{q/2} \exp (-|\mathbf{k}|_2^2 / 2). \end{aligned}$$

\(\square \)

Example 4.2

Periodization of the Hardy multiquadric.Footnote 1

$$\begin{aligned} G(\mathbf{x})= \sum _{\mathbf{k}\in \mathbb{Z }^q} (\alpha ^2+|\mathbf{x}-2\pi \mathbf{k}|_2^2)^{-1}\!, \quad \hat{G}(\mathbf{k}) = \frac{\pi ^{(q+1)/2}}{\Gamma \left( \frac{q+1}{2} \right) \alpha } \exp (-\alpha |\mathbf{k}|_2). \end{aligned}$$

\(\square \)

Example 4.3

Tensor product construction using the Poisson kernel. With \(0 < r < 1\),

$$\begin{aligned} G(\mathbf{x}) = \prod _{j=1}^q \frac{1-r^2}{1+r^2-2r\cos (x_j)}, \quad \hat{G}(\mathbf{k})=r^{|\mathbf{k}|_1}. \end{aligned}$$

\(\square \)

For the proof of Theorem 4.1, we recall the following Nikolskii inequalities, see, e.g., [77, Section 4.9.4, p. 231].

Lemma 4.1

Let \(1 \le p < r \le \infty \). Then for \(n\in \mathbb{N }\),

$$\begin{aligned} \Vert P\Vert _r \le c n^{\left( \frac{q}{p} - \frac{q}{r} \right) } \Vert P\Vert _p, \quad P\in \mathbb{H }_n^q. \end{aligned}$$
(4.7)

We are now ready to prove Theorem 4.1. In the remainder of this section, we fix an infinitely differentiable low pass filter \(h\).

Proof of Theorem 4.1

In this proof, let \(R=\sigma _N(h,G)\), so that \(R\in \mathbb{H }_N^q\), and

$$\begin{aligned} \Vert G-R\Vert _\infty \le cE_{N,\infty }(G). \end{aligned}$$
(4.8)

Since \(\nu \) is a quadrature measure of order \(n+N\) and \(D_G(P)\in \mathbb{H }_n^q\), we obtain

$$\begin{aligned} P(\mathbf{x})&= {\int _\mathbb{T ^q}}G(\mathbf{x}-\mathbf{y})D_G(P,\mathbf{y})\,d\mu _q^*(\mathbf{y})\\&= {\int _\mathbb{T ^q}}R(\mathbf{x}-\mathbf{y})D_G(P,\mathbf{y})\,d\mu _q^*(\mathbf{y}) \!+\!{\int _\mathbb{T ^q}}\left[ G(\mathbf{x}\!-\!\mathbf{y})-R(\mathbf{x}\!-\!\mathbf{y})\right] D_G(P,\mathbf{y})\,d\mu _q^*(\mathbf{y})\\&= {\int _\mathbb{T ^q}}R(\mathbf{x}-\mathbf{y})D_G(P,\mathbf{y})\,d\nu (\mathbf{y}) +{\int _\mathbb{T ^q}}\left[ G(\mathbf{x}-\mathbf{y})-R(\mathbf{x}-\mathbf{y})\right] D_G(P,\mathbf{y})\,d\mu _q^*(\mathbf{y})\\&= {\int _\mathbb{T ^q}}G(\mathbf{x}-\mathbf{y})D_G(P,\mathbf{y})\,d\nu (\mathbf{y}) +{\int _\mathbb{T ^q}}\left[ G(\mathbf{x}-\mathbf{y})-R(\mathbf{x}-\mathbf{y})\right] D_G(P,\mathbf{y})\,d\mu _q^*(\mathbf{y})\\&-{\int _\mathbb{T ^q}}\left[ G(\mathbf{x}-\mathbf{y})-R(\mathbf{x}-\mathbf{y})\right] D_G(P,\mathbf{y})\,d\nu (\mathbf{y}). \end{aligned}$$

Consequently,

$$\begin{aligned}&\left| P(\mathbf{x})-{\int _\mathbb{T ^q}}G(\mathbf{x}-\mathbf{y})D_G(P,\mathbf{y})\,d\nu (\mathbf{y})\right| \\&\quad \le {\int _\mathbb{T ^q}}\left| G(\mathbf{x}-\mathbf{y})-R(\mathbf{x}-\mathbf{y})\right| |D_G(P,\mathbf{y})|\,d\mu _q^*(\mathbf{y})\\&\qquad +{\int _\mathbb{T ^q}}\left| G(\mathbf{x}-\mathbf{y})-R(\mathbf{x}-\mathbf{y})\right| |D_G(P,\mathbf{y})|\,d|\nu |(\mathbf{y}). \end{aligned}$$

Since \(\nu \) is an M–Z quadrature measure of order \(n+N\), \(|\!|\!|\nu |\!|\!|_{n+N}\ge 1\), and we conclude that

$$\begin{aligned} \left\| P-{\int _\mathbb{T ^q}}G(\circ -\mathbf{y})D_G(P,\mathbf{y})\,d\nu (\mathbf{y})\right\| _\infty&\le cE_{N,\infty }(G)\left\{ \Vert D_G(P)\Vert _1+\Vert D_G(P)\Vert _{\nu ;1}\right\} \nonumber \\&\le c|\!|\!|\nu |\!|\!|_{n+N}E_{N,\infty }(G)\Vert D_G(P)\Vert _1. \end{aligned}$$
(4.9)

Using Nikolskii inequalities (4.7) and recalling the definition (4.1) of \(\widehat{D_G(P)}(\mathbf{k})\), we deduce that

$$\begin{aligned}&\Vert D_G(P)\Vert _1^2\le \Vert D_G(P)\Vert _2^2 =\sum _{\mathbf{k}: |\mathbf{k}|_2<n}\left| \frac{\hat{P}(\mathbf{k})}{\hat{G}(\mathbf{k})}\right| ^2 \le \mathfrak m _n^{-2}\sum _{\mathbf{k}: |\mathbf{k}|_2<n}|\hat{P}(\mathbf{k})|^2 =\mathfrak m _n^{-2}\Vert P\Vert _2^2\\&\quad \le c\left( n^{2\left( \frac{q}{p} - \frac{q}{2} \right) _+}\right) \mathfrak m _n^{-2} \Vert P\Vert _p^2. \end{aligned}$$

The estimate (4.4) follows by substituting the above estimate into (4.9). \(\square \)

While Theorem 4.1 shows that polynomials can be approximated well by PBF networks, the converse is also true. To describe this, we introduce some notation.

Let \(\{\mathbf{y}_j\}_{j=1}^M\subset \mathbb T ^q\) and let \(n\in \mathbb{N }\) be an integer with

$$\begin{aligned} \min _{j\not =k}|\mathbf{y}_j-\mathbf{y}_k|\ge 1/n. \end{aligned}$$
(4.10)

We note that this implies \(M\le cn^q\). In the sequel, we will assume tacitly that \(\{\mathbf{y}_j\}_{j=1}^M\) is one of the members of a sequence of finite subsets of \(\mathbb T ^q\). We assume that \(M\) and \(n\) are variables, and then the following constants are independent of these.

Theorem 4.2

Let \(\{\mathbf{y}_j\}_{j=1}^M\subset \mathbb T ^q\), \(n\in \mathbb{N }\) be an integer satisfying (4.10). Let \(r\ge 0\), \(\mathbf a \in \mathbb{R }^M\), and let

$$\begin{aligned} \mathcal G (\mathbf{x})=\sum _{j=1}^M a_jG(\mathbf{x}-\mathbf{y}_j), \qquad \mathbf{x}\in \mathbb T ^q. \end{aligned}$$

Then for \(N\in \mathbb{N }\), we have

$$\begin{aligned} \Vert {(-\Delta )^{r/2}}\mathcal{G }\!-\!{(-\Delta )^{r/2}}\sigma _N(h,\mathcal{G })\Vert _\infty \!\le \! c_1\left( n^{\left( \frac{q}{p}\!-\!\frac{q}{2}\right) _+}\right) \frac{E_{N,\infty }({(-\Delta )^{r/2}}G)}{\mathfrak{m }_{cn}({(-\Delta )^{r/2}}G)}\Vert {(-\Delta )^{r/2}}\mathcal{G }\Vert _p.\nonumber \\ \end{aligned}$$
(4.11)

In particular, if \(G\in \mathcal E _A\) for some \(A>0\), then there exists a \(\rho =\rho (A,p,r, G)\in (0,1)\) such that

$$\begin{aligned} \Vert {(-\Delta )^{r/2}}\mathcal G -{(-\Delta )^{r/2}}\sigma _{cAn}(h,\mathcal G )\Vert _\infty \le c_1\rho ^n\Vert {(-\Delta )^{r/2}}\mathcal G \Vert _p. \end{aligned}$$
(4.12)

Before proceeding to the proof, we note the following corollary.

Corollary 4.3

Let \(A>0\) and \(G\in \mathcal{E }_A\). With notation as in Theorem 4.2, we have for \(n\in \mathbb{N }\),

$$\begin{aligned} \Vert {(-\Delta )^{r/2}}\mathcal G \Vert _p\le cn^r\Vert \mathcal G \Vert _p, \end{aligned}$$
(4.13)

where \(c\) is a positive constant, independent of the choice of the coefficients or the nodes \(\{\mathbf{y}_j\}\), as long as (4.10) holds.

Proof

If \(n\) is large enough, (4.12) yields

$$\begin{aligned} \Vert {(-\Delta )^{r/2}}\mathcal G -{(-\Delta )^{r/2}}\sigma _{cAn}(h,\mathcal G )\Vert _p\le (1/2)\Vert {(-\Delta )^{r/2}}\mathcal G \Vert _p, \end{aligned}$$
(4.14)

and since we may also use the same estimate with \(r=0\),

$$\begin{aligned} \Vert \mathcal{G }-\sigma _{cAn}(h,\mathcal{G })\Vert _p\le (1/2)\Vert \mathcal G \Vert _p. \end{aligned}$$
(4.15)

This leads to

$$\begin{aligned} \Vert {(-\Delta )^{r/2}}\mathcal{G }\Vert _p\le 2\Vert {(-\Delta )^{r/2}}\sigma _{cAn}(h,\mathcal{G })\Vert _p, \quad \Vert \sigma _{cAn}(h,\mathcal{G })\Vert _p\le (3/2)\Vert \mathcal{G }\Vert _p.\qquad \quad \end{aligned}$$
(4.16)

Then using the Bernstein-type inequality in Theorem 3.2(b), we obtain

$$\begin{aligned} \Vert {(-\Delta )^{r/2}}\mathcal{G }\Vert _p\le 2\Vert {(-\Delta )^{r/2}}\sigma _{cAn}(h,\mathcal{G })\Vert _p\le c_1n^r \Vert \sigma _{cAn}(h,\mathcal{G })\Vert _p\le c_2n^r\Vert \mathcal{G }\Vert _p. \end{aligned}$$

If \(n\) is not large enough, then (4.13) is obtained by adjusting the constant factor.

The proof of Theorem 4.2 is yet another interesting application of the localization estimate (3.2). Using this estimate in a very critical manner, we proved the following in [5, Theorem 6.2].

Proposition 4.2

Let \(\{\mathbf{y}_j\}_{j=1}^M\subset \mathbb T ^q\) and let \(n\in \mathbb{N }\) be an integer satisfying (4.10). Let \(1\le p\le \infty \) and \(\mathbf a \in \mathbb{R }^M\). Then we have

$$\begin{aligned} c_2n^{q/p^{\prime }}|\mathbf{a}|_p\le \left\| \sum _{j=1}^M a_j\Phi _m(h,\circ -\mathbf{y}_j)\right\| _p\le c_3n^{q/p^{\prime }}|\mathbf{a}|_p, \end{aligned}$$
(4.17)

for \(m\ge c_1n\).

Proof of Theorem 4.2

We observe first that is enough to prove the theorem for \(r=0\). The general case follows by applying the result with the activation function \({(-\Delta )^{r/2}}G\) in place of \(G\). Let \(\mathbf{x}\in \mathbb T ^q\). We note that

$$\begin{aligned} \sigma _N(h,\mathcal G , \mathbf{x}-\mathbf{y}_j)=\sum _{k=1}^M a_j\sigma _N(h,G,\mathbf{x}-\mathbf{y}_j), \end{aligned}$$

and \(M\le c n^q\), so that

$$\begin{aligned} \Vert \mathcal{G }-\sigma _N(h,\mathcal G )\Vert _\infty \le cE_{N,\infty }(G)|\mathbf{a }|_1 \le cn^{q/2}E_{N,\infty }(G)|\mathbf{a }|_2. \end{aligned}$$
(4.18)

To estimate \(|\mathbf{a }|_2\), we fix \(m=C n\) for a sufficiently large \(m\) so as to satisfy the condition in Proposition 4.2. Let \(\alpha \) be defined by

$$\begin{aligned} \alpha \mathop {=}\limits ^{\mathrm{def}}\left( \frac{q}{p}-\frac{q}{2}\right) _+. \end{aligned}$$

Then we can deduce with some calculation and with (4.17) applied with \(p=2\) that

$$\begin{aligned} \Vert \mathcal{G }\Vert _p^2&\ge c\Vert \sigma _m(h,\mathcal{G })\Vert _p^2\ge cn^{-2\alpha }\Vert \sigma _m(h,\mathcal{G })\Vert _2^2\\&= cn^{-2\alpha }\sum _{j,\ell =1}^M a_j\overline{a_\ell }\sum _{\mathbf{k}\in \mathbb{Z }^q}h\left( \frac{|\mathbf{k}|_2}{m}\right) ^2|\hat{G}(\mathbf{k})|^2\exp (-i\mathbf{k}\cdot (\mathbf{y}_j-\mathbf{y}_\ell ))\\&= cn^{-2\alpha }\sum _{\mathbf{k}\in \mathbb{Z }^q}h\left( \frac{|\mathbf{k}|_2}{Cn}\right) ^2|\hat{G}(\mathbf{k})|^2\left| \sum _{j=1}^M a_j\exp (-i\mathbf{k}\cdot \mathbf{y}_j)\right| ^2\\&\ge cn^{-2\alpha }\mathfrak{m }_{Cn}(G)^{2}\sum _{\mathbf{k}\in \mathbb{Z }^q}h\left( \frac{|\mathbf{k}|_2}{m}\right) ^2\left| \sum _{j=1}^M a_j\exp (-i\mathbf{k}\cdot \mathbf{y}_j)\right| ^2\\&= cn^{-2\alpha }\mathfrak{m }_{Cn}(G)^{2}\left\| \sum _{j=1}^M a_j\Phi _m(h,\circ -\mathbf{y}_j)\right\| _2^2\\&\ge cn^{-2\alpha +q}\mathfrak{m }_{Cn}(G)^{2}|\mathbf a |_2^2. \end{aligned}$$

Thus,

$$\begin{aligned} n^{q/2}|\mathbf{a }|_2\le cn^\alpha \mathfrak{m }_{Cn}(G)^{-1}\Vert \mathcal{G }\Vert _p. \end{aligned}$$

Finally, the estimate (4.11) is obtained by substituting the upper bound on \(|\mathbf a |_2\) from this estimate into (4.18). \(\square \)

4.2 Direct and equivalence theorems

In this and the next subsections, we fix \(G\in \mathcal{E }_A\). We will deal with two different sequences of measures; the sequence \({{\varvec{\nu }}}\) will be a sequence of discrete measures, whose supports will give us the centers of the networks, and the sequence \({{\varvec{\mu }}}\) will be the sequence of measures so that the information on the target function is given in terms of integrals with respect to the members of this sequence. It is a common practice in the literature on RBF networks to choose the centers the same as the points at which the target function is sampled, but we wish to make it a point that this is not necessary. To avoid confusion, we make the following definition.

Definition 4.2

Let \(A>0\), and \({\varvec{\nu }}=\{\nu _j\}_{j=0}^\infty \) be a sequence of Borel measures. We say that \({\varvec{\nu }}\in \mathfrak{M }_A\) if each of the following conditions is satisfied. Here, the constants may depend upon \(A\) and the sequence \({{\varvec{\nu }}}\), but not on \(j\) or the individual measures \(\nu _j\).

  1. 1.

    Each \(\nu _j\) is a discrete measure, and the support \(\mathcal{C }_j=\mathsf{supp }(\nu _j)\) satisfies

    $$\begin{aligned} \delta (\mathcal{C }_j) \sim \eta (\mathcal{C }_j)\sim 2^{-j}. \end{aligned}$$
  2. 2.

    Each \(\nu _j\) is an M–Z quadrature measure of order \((1+A)2^j\), and \(|\!|\!|\nu |\!|\!|_{(1+A)2^j}\le c\).

  3. 3.

    If \(j>m\) then \(\mathcal{C }_m\subseteq \mathcal{C }_j\).

We point out an important example of measures in \(\mathfrak{M }_A\). Let

$$\begin{aligned} \textit{ZD}^q_j=\{\mathbf{k}\in \mathbb{Z }^q : -2^j+1\le k_\ell \le 2^j,\ \ell =1, 2,\ldots ,q\}, \quad j=0,1,2,\ldots , \end{aligned}$$

and

$$\begin{aligned} \mathbf{v }_{\mathbf{k},j}= \mathbf{k}\pi /2^j, \quad \mathbf{k}\in \textit{ZD}^q_j, \ j=0,1,2,\ldots . \end{aligned}$$
(4.19)

It can be verified using the corresponding univariate result repeatedly that the following quadrature formula holds.

$$\begin{aligned} \frac{1}{(2\pi )^q}{\int _\mathbb{T ^q}}P(\mathbf{t })\,d\mathbf{t }=\frac{1}{2^{q(j+1)}}\sum _{\mathbf{k}\in ZD^q_j} P(\mathbf{v }_{\mathbf{k},j}), \quad P\in \mathbb{H }_{2^{j+1}}^q. \end{aligned}$$
(4.20)

Let \(m\) be the integer part of \(\log _2(1+A)\), and for \(j=0,1,\ldots \), let \(\nu _j^*\) be the measure \(\nu _j^*\) that associates the mass \(2^{-q(j+m+1)}\) with each \(\mathbf v _{\mathbf{k},j+m}\), \(\mathbf{k}\in ZD^q_{j+m}\). Then the sequence \({\varvec{\nu }}^*=\{\nu _j^*\}_{j=0}^\infty \in \mathfrak M _A\). This fact can be checked using Lemma 2.1 for each coordinate.

Let \({\varvec{\nu }}\in \mathfrak M _A\). We define

$$\begin{aligned} \mathbb{G }_j=\mathbb{G }_j({\varvec{\nu }}) \mathop {=}\limits ^{\mathrm{def}}\{G(\circ -\mathbf{y}) : \mathbf{y}\in \mathsf{supp }(\nu _j)\}, \quad j\in \mathbb{N }, \quad \mathbb{G }_0=\{0\}, \end{aligned}$$
(4.21)

and observe that, since \(\{ \mathsf{supp }(\nu _j) \}\) is a nested sequence, \(\mathbb{G }_j\) is a nested sequence of linear subspaces of \(X^\infty \) and that the closure of their union in the sense of \(L^p\) is \(X^p\). For \(f\in L^p\), we write

$$\begin{aligned} \text{ dist } (L^p,f,\mathbb{G }_j)=\inf \{\Vert f-\mathcal G \Vert _p : \mathcal G \in \mathbb{G }_j\}. \end{aligned}$$

In addition to the centers, we will assume in the remainder of this and the next subsection that \({\varvec{\mu }}=\{\mu _j\}_{j=0}^\infty \) is a sequence of measures such that each \(\mu _j\) is an M–Z quadrature measure of order \(3\times 2^{j-1}\). The sequence \({\varvec{\mu }}^*\) in which each \(\mu _j=\mu _q^*\) is one such sequence. Our interest is both in this sequence and the case when the \(\mu _j\)’s are discretely supported.

The direct theorem can be stated as follows.

Theorem 4.3

Let \(A>0\), \(G\in \mathcal E _A\), \({\varvec{\nu }}\in \mathfrak M _A\). Then there exists \(\rho \in (0,1)\) such that for \(1\le p\le \infty \) and \(f\in L^p\) we have

$$\begin{aligned}&\text{ dist } (L^p,f,\mathbb{G }_j({\varvec{\nu }}))\le \left\| f-{\int _\mathbb{T ^q}}G(\circ -\mathbf{y})D_G\left( \sigma _{2^j}(\mu _q^*;h,f),\mathbf{y}\right) \,d\nu _j(\mathbf{y})\right\| _p\nonumber \\&\quad \le c\{E_{2^j,p}(f) +\rho ^{2^j}\Vert f\Vert _p\}. \end{aligned}$$
(4.22)

Consequently, if \(\gamma >0\) and \(f\in W_{\gamma ,p}\), then

$$\begin{aligned} \text{ dist } (L^p, f, \mathbb{G }_j({\varvec{\nu }})) =\mathcal{O}(2^{-j\gamma }). \end{aligned}$$
(4.23)

In the case when \(p=\infty \), (4.22) remains valid provided that \(\sigma _{2^j}(\mu _q^*;h,f)\) is replaced by \(\sigma _{2^j}(\mu _j;h,f)\), where \({\varvec{\mu }}\) is a sequence as described earlier.

We note the following variations of (4.22).

Corollary 4.4

With the set up as in Theorem 4.3,

$$\begin{aligned} \left\| f-{\int _\mathbb{T ^q}}G(\circ \!-\!\mathbf{y})D_G\left( \sigma _{2^j}(\mu _q^*;h,f),\mathbf{y}\right) \,d\nu _{j}^*(\mathbf{y})\right\| _p\!\le \! c\{E_{2^j,p}(f) \!+\!\rho ^{2^j}\Vert f\Vert _p\}.\nonumber \\ \end{aligned}$$
(4.24)

In the case when \(p=\infty \) and \({\varvec{\mu }}\in \mathfrak M _{\max (A,4)}\), we also have

$$\begin{aligned} \left\| f\!-\!{\int _\mathbb{T ^q}}G(\circ \!-\!\mathbf{y})D_G\left( \sigma _{2^j}(\mu _j;h,f),\mathbf{y}\right) \,d\mu _j(\mathbf{y})\right\| _\infty \!\le \! c\{E_{2^j,\infty }(f) \!+\!\rho ^{2^j}\Vert f\Vert _\infty \}.\nonumber \\ \end{aligned}$$
(4.25)

We observe that the integral expression in (4.24) is a PBF network with centers at the dyadic points (4.19). This is in keeping with traditional results on PBF approximation where scaled integer translates are considered. However, unlike in the classical setting, we do not require any a priori conditions such as the Strang–Fix conditions to ensure any polynomial reproduction. In particular, constants are not included in the network. In the case when \(\mu _j\) is used in place of \(\mu _q^*\), the information on \(f\) which is used in the construction of the network is the values of \(f\) at points in \(\mathsf{supp }(\mu _j)\), but we have different choices for the centers of the resulting network. A popular choice is to use points from the same sequence of data sets as centers as in (4.25). However, we may also use dyadic centers instead as in (4.24), potentially leading to faster computations.

All the constructions are linear operators on the information available on the target function. Finally, the bounds (4.22), (4.24), and (4.25), and also their variants where \(\sigma _{2^j}(\mu _q^*;h,f)\) is replaced by \(\sigma _{2^j}(\mu _j;h,f)\), show that the PBF networks constructed there are bounded operators on the spaces involved. Therefore, the constructions are stable.

Proof of Theorem 4.3

We need only to sketch the proof of (4.22); the other statements are immediate consequences. To prove (4.22), we use (4.6) with \(\sigma _{2^j}(\mu _q^*;h,f)\) (or with \(\sigma _{2^j}(\mu _j;h,f)\) respectively, when appropriate), recall that \(|\!|\!|\nu _j|\!|\!|_{2^j}\sim |\!|\!|\nu _j|\!|\!|_{(1+A)2^j} \le c\), and then use the resulting estimate together with the triangle inequality and (3.4) (or with (3.38) respectively) to arrive at (4.22). \(\square \)

Converse theorems for approximation by elements of \(\mathbb{G }_j\) can be obtained using Corollary 4.3, cf. [12, Theorem 9.1, also Chapter 6.7]. We note only the analogue of the equivalence theorem without proof.

Theorem 4.4

Let \(1\le p\le \infty \), \(\gamma >0\), \(A>0\), \(G\in \mathcal E _A\) and \({\varvec{\nu }}\in \mathfrak M _A\). Then the following are equivalent.

  1. (a)

    \(f\in W_{\gamma ,p}\).

  2. (b)

    \(\text{ dist } (L^p,f,\mathbb{G }_j({\varvec{\nu }})) =\mathcal{O}(2^{-j\gamma })\).

  3. (c)

    We have

    $$\begin{aligned} \left\| f-{\int _\mathbb{T ^q}}G(\circ -\mathbf{y})D_G\left( \sigma _{2^j}(\mu _q^*;h,f),\mathbf{y}\right) \,d\nu _j^*(\mathbf{y})\right\| _p=\mathcal{O}(2^{-j\gamma }). \end{aligned}$$
    (4.26)
  4. (d)

    If \(p=\infty \), then each of the above statements is also equivalent to

    $$\begin{aligned} \left\| f-{\int _\mathbb{T ^q}}G(\circ -\mathbf{y})D_G\left( \sigma _{2^j}(\mu _j;h,f),\mathbf{y}\right) \,d\nu _j(\mathbf{y})\right\| _\infty =\mathcal{O}(2^{-j\gamma }). \end{aligned}$$

We remark that the statement (a) in the above theorem is independent of the sequence \({\varvec{\nu }}\). Therefore, the part (a) above implies part (b) for every\({\varvec{\nu }}\in \mathfrak M _A\). On the other hand, if part (b) holds for some\({\varvec{\nu }}\in \mathfrak M _A\), then part (a) holds. In particular, when approximating a function from \(W_{\gamma ,p}\), the asymptotic degree of approximation by PBF networks does not depend upon the choice of centers and weights—encoded in the measures \({\varvec{\nu }}\)—as long as \({\varvec{\nu }}\in \mathfrak{M }_A\).

4.3 Wavelet-like representation using PBFs

The wavelet-like representations using PBF networks with activation function in \(\mathcal{E }_A\) for some \(A>0\) are very similar to those for multivariate trigonometric polynomials. This can be shown by using Theorem 4.1. We summarize them below. We will sketch the proofs only, except for the proof of the frame property which requires some new ideas.

In the sequel, let \(A>0\) and let \(G\in \mathcal E _A\) be fixed. Also, let \(h\) be a fixed, infinitely differentiable low pass filter. We will also make the same assumptions for the sequence of measures \({\varvec{\nu }}=\{\nu _j\}_{j=0}^\infty \) and \({\varvec{\mu }}=\{\mu _j\}_{j=0}^\infty \); namely, \({\varvec{\nu }}\in \mathfrak M _A\) and each \(\mu _j\) is an M–Z quadrature measure of order \(3\times 2^{j-1}\).

First, we consider the operators

$$\begin{aligned} \mathcal{S }_j^*(f,\mathbf{x}) \mathop {=}\limits ^{\mathrm{def}}{\int _\mathbb{T ^q}}G(\mathbf{x}-\mathbf{y})D_G\left( \sigma _{2^j}(\mu _q^*;h,f),\mathbf{y}\right) \,d\nu _j(\mathbf{y}), \quad j=0,1,\ldots ,\ f\in L^p, \ p\in [1,\infty ],\nonumber \\ \end{aligned}$$
(4.27)

and the corresponding frame operators

$$\begin{aligned} \mathcal{T }_j^*(f,\mathbf{x})=\left\{ \begin{array}{ll} \mathcal{S }_0^*(f,\mathbf{x}), &{}\quad {\hbox {if}}\ j=0,\\ \mathcal{S }_j^*(f,\mathbf{x})-\mathcal{S }_{j-1}^*(f,\mathbf{x}), &{}\quad {\hbox {if}}\ j\in \mathbb{N }. \end{array}\right. \end{aligned}$$
(4.28)

The variants when the information about \(f\) is in terms of the sequence \({\varvec{\mu }}\) are given by

$$\begin{aligned} \mathcal{S }_j(f,\mathbf{x}) \mathop {=}\limits ^{\mathrm{def}}{\int _\mathbb{T ^q}}G(\mathbf{x}\!-\!\mathbf{y})D_G\left( \sigma _{2^j}(\mu _j;h,f),\mathbf{y}\right) \,d\nu _j(\mathbf{y}), \quad j=0,1,\ldots ,\ f\in X^\infty ,\nonumber \\ \end{aligned}$$
(4.29)

and

$$\begin{aligned} \mathcal{T }_j(f,\mathbf{x})=\left\{ \begin{array}{ll} \mathcal{S }_0(f,\mathbf{x}), &{}\quad {\hbox {if}}\ j=0,\\ \mathcal{S }_j(f,\mathbf{x})-\mathcal{S }_{j-1}(f,\mathbf{x}), &{}\quad {\hbox {if}}\ j\in \mathbb{N }. \end{array}\right. \end{aligned}$$
(4.30)

Theorem 4.1 leads immediately to the following estimates for these operators.

Proposition 4.3

There exists \(\rho \in (0,1)\) such that, if \(1\le p\le \infty \) and \(f\in X^p\), then we have

$$\begin{aligned} \Vert \mathcal S _j^*(f)-\sigma _{2^j}(\mu _q^*;h,f)\Vert _p \le c\rho ^{2^j}\Vert \sigma _{2^j}(\mu _q^*;h,f)\Vert _p \le c\rho ^{2^j}\Vert f\Vert _p. \end{aligned}$$
(4.31)

For any measure \(\tilde{\nu }\) on \(\mathbb T ^q\),

$$\begin{aligned} \Vert \mathcal T _j^*(f)-\tau _j(\mu _q^*;h,f)\Vert _{\tilde{\nu };p} \le c\rho ^{2^j}\Vert f\Vert _p. \end{aligned}$$
(4.32)

Analogous statements hold if \(p=\infty \) and \(\mathcal S _j^*(f)\) (respectively, \(\mathcal T _j^*(f)\), \(\sigma _{2^j}(\mu _q^*;h,f)\), \(\tau _j(\mu _q^*;h,f))\) is replaced by \(\mathcal S _j(f)\) (respectively, \(\mathcal T _j(f)\), \(\sigma _{2^j}(\mu _j;h,f)\), \(\tau _j({\varvec{\mu }};h,f))\).

The following two theorems, analogous to Theorems 4.5 and 3.10, are immediate consequences of these theorems and the above proposition, except for part (b) of Theorem 4.5 below.

Theorem 4.5

Let \(1\le p\le \infty \), \(\gamma >0\), \(f\in X^p\), and let \(G\) and the measures \({{\varvec{\nu }}}\), \({{\varvec{\mu }}}\) be as described earlier.

  1. (a)

    We have

    $$\begin{aligned} f=\sum _{j=0}^\infty \mathcal T _j^*(f) \end{aligned}$$
    (4.33)

    with convergence in the sense of \(X^p\).

  2. (b)

    If \(f\in L^2\) then

    $$\begin{aligned} \Vert f\Vert _2^2 \sim \sum _{j=0}^\infty \Vert \mathcal{T }_j^*(f)\Vert _2^2, \end{aligned}$$
    (4.34)

    where the constants are independent of \(f\), but may depend upon \(G\).

  3. (c)

    If \(f\in X^\infty \), then we have the uniformly convergent expansion

    $$\begin{aligned} f=\sum _{j=0}^\infty \mathcal T _j(f) \end{aligned}$$
    (4.35)

The analogue of Theorem 3.10 is the following:

Theorem 4.6

Let \(1\le p\le \infty \), \(\gamma >0\), \(\mathbf{x}_0\in \mathbb T ^q\), \(f\in X^p\), and let \(G\) and the measures \({\varvec{\nu }}\), \({\varvec{\mu }}\) be as described earlier. Then the following are equivalent.

  1. (a)

    \(f\in W_{\gamma ,p}(\mathbf{x}_0)\).

  2. (b)

    There exists an arc \(I\ni \mathbf{x}_0\) such that

    $$\begin{aligned} \Vert \mathcal{T }_j^*(f)\Vert _{\nu _j;p, I}=\mathcal{O}(2^{-j\gamma }). \end{aligned}$$
    (4.36)
  3. (c)

    In the case when \(p=\infty \), one may replace (4.36) by

    $$\begin{aligned} \max _{\mathbf{t }\in \mathsf{supp }(\nu _j)\cap I}|\mathcal T _j(f,\mathbf{t })| =\mathcal{O}(2^{-j\gamma }). \end{aligned}$$

Part (b) of Theorem 4.5 is not immediately obvious. In order to prove this part, we first state a lemma.

Lemma 4.2

Let \(f\in L^2\), and let \(\rho \) be as in Proposition 4.3. Then for \(N\in \mathbb{N }\) and \(j\in \mathbb{N }\),

$$\begin{aligned} (1/4)\Vert f\Vert _2^2 \le \Vert \sigma _{2^N}(\mu _q^*;h,f)\Vert _2^2 +\sum _{j=N+1}^\infty \Vert \tau _j(\mu _q^*;h,f)\Vert _2^2 \le \Vert f\Vert _2^2, \end{aligned}$$
(4.37)

and

$$\begin{aligned} \left| \Vert \mathcal T _j^*(f)\Vert _2^2-\Vert \tau _j(\mu _q^*;h,f)\Vert _2^2\right| \le c\rho ^{2^j}\Vert f\Vert _2^2. \end{aligned}$$
(4.38)

Proof

Using the same argument as in the proof of (2.36) in the proof of Theorem 2.5, we deduce that for all \(\mathbf{k}\in \mathbb{Z }^q\)

$$\begin{aligned} 1\le 2\left( h\left( \frac{|\mathbf{k}|_2}{N}\right) \right) ^2 +4\sum _{j=N+1}^\infty \left( h\left( \frac{|\mathbf{k}|_2}{2^j}\right) -h\left( \frac{|\mathbf{k}|_2}{2^{j-1}}\right) \right) ^2 \le 4. \end{aligned}$$

Then, again as in the proof of Theorem 2.5, we use the Parseval identity to obtain (4.37).

Next, we use (4.32) with \(\nu =\mu _q^*\), \(A=\mathbb T ^q\), and \(p=2\) to observe that

$$\begin{aligned} \left| \Vert \mathcal T _j^*(f)\Vert _2^2\!-\!\Vert \tau _j(\mu _q^*;h,f)\Vert _2^2\right|&= (\Vert \mathcal T _j^*(f)\Vert _2\!+\!\Vert \tau _j(\mu _q^*;h,f)\Vert _2)(\Vert \mathcal T _j^*(f)\Vert _2\!-\!\Vert \tau _j(\mu _q^*;h,f)\Vert _2)\\&\le c\Vert f\Vert _2\Vert \Vert \mathcal T _j^*(f)-\tau _j(\mu _q^*;h,f)\Vert _2 \le c\rho ^{2^j}\Vert f\Vert _2^2. \end{aligned}$$

This is precisely (4.38). \(\square \)

Proof of Theorem 4.5

(b). Let \(N\in \mathbb{N }\) be fixed but to be chosen later. In light of (4.38),

$$\begin{aligned} \left| \sum _{j=N+1}^\infty \Vert \mathcal T _j^*(f)\Vert _2^2 -\sum _{j=N+1}^\infty \Vert \tau _j(\mu _q^*;h,f)\Vert _2^2\right| \le c\Vert f\Vert _2^2\sum _{j=N+1}^\infty \rho ^{2^j}. \end{aligned}$$
(4.39)

Also, (4.31) implies that

$$\begin{aligned} \left| \Vert \mathcal S _N^*(f)\Vert _2-\Vert \sigma _{2^N}(\mu _q^*;h,f)\Vert _2\right| \le c_1\rho ^{2^N}\Vert \sigma _{2^N}(\mu _q^*;h,f)\Vert _2. \end{aligned}$$
(4.40)

We now choose \(N\) such that

$$\begin{aligned} c\sum _{j=N+1}^\infty \rho ^{2^j}<1/8, \quad c_1\rho ^{2^N} <1/2, \end{aligned}$$
(4.41)

where \(c\) and \(c_1\) are as in the previous two displayed estimates. Then

$$\begin{aligned} \Vert \mathcal S _N^*(f)\Vert _2\le (3/2)\Vert \sigma _{2^N}(\mu _q^*;h,f)\Vert _2\le 3\Vert \mathcal S _N^*(f)\Vert _2. \end{aligned}$$
(4.42)

With this preparation, we first prove the upper bound on \(\Vert f\Vert _2^2\). The first inequality in (4.37) yields

$$\begin{aligned} (1/4)\Vert f\Vert _2^2 \le 4\Vert \mathcal S _N^*(f)\Vert _2^2 + \sum _{j=N+1}^\infty \Vert \mathcal T _j^*(f)\Vert _2^2 + (1/8)\Vert f\Vert _2^2, \end{aligned}$$

that is,

$$\begin{aligned} (1/32)\Vert f\Vert _2^2\le \Vert \mathcal S _N^*(f)\Vert _2^2 + \sum _{j=N+1}^\infty \Vert \mathcal T _j^*(f)\Vert _2^2. \end{aligned}$$
(4.43)

Since

$$\begin{aligned} \Vert \mathcal S _N^*(f)\Vert _2^2\le \left( \sum _{j=0}^N \Vert \mathcal T _j^*(f)\Vert _2\right) ^2\le (N+1)\sum _{j=0}^N \Vert \mathcal T _j^*(f)\Vert _2^2, \end{aligned}$$

(4.43) shows that

$$\begin{aligned} c\Vert f\Vert _2^2\le \sum _{j=0}^\infty \Vert \mathcal T _j^*(f)\Vert _2^2. \end{aligned}$$
(4.44)

The lower bound is easier. For each \(j\ge 0\), we have \(\Vert \mathcal T _j^*(f)\Vert _2\le c\Vert f\Vert _2\). Therefore, using (4.37) and (4.39) and keeping in mind our choice of \(N\) as in (4.41), we can conclude that

$$\begin{aligned} \sum _{j=0}^\infty \Vert \mathcal T _j^*(f)\Vert _2^2&= \sum _{j=0}^N \Vert \mathcal T _j^*(f)\Vert _2^2+\sum _{j=N+1}^\infty \Vert \mathcal T _j^*(f)\Vert _2^2\\&\le c(N+1)\Vert f\Vert _2^2 +\sum _{j=N+1}\Vert \tau _j(\mu _q^*;h,f)\Vert _2^2+(1/8)\Vert f\Vert _2^2\\&\le (c(N+1)+1+1/8)\Vert f\Vert _2^2. \end{aligned}$$

This completes the proof. \(\square \)

We note in closing that with a proper normalization, we may assume that \(\hat{G}(0)=1\). Then \(\mathfrak m _n(G)\ge 1\) for all \(n\). Hence, the statement that \(G\in \mathcal E _A\) implies, in particular, that

$$\begin{aligned} \limsup _{m\rightarrow \infty } E_{m,\infty }(G)^{1/m} <1. \end{aligned}$$

This, in turn, implies that \(G\) is analytic on \(\mathbb T ^q\). There are many examples of activation functions which are used in practice, notably the periodization of the Wendland functions, or Green’s functions of the operators \({(-\Delta )^{r/2}}\), for which this condition is not satisfied. The analogues of Theorems 4.1 and  4.2 are then much weaker; they are given in [49] in a very general context. Direct and converse theorems for approximation with networks with such activation functions are also obtained in that paper. However, the estimates there are not strong enough to obtain the characterization of local smoothness classes.

It is worthwhile to comment about the relationship of our results with those in the papers [14] by Dũng and Micchelli and [40] of Maiorov.

  • The paper [40] deals with approximation in \(L^2\), and the paper [14] deals with \(L^p\), \(1<p<\infty \). Our paper includes both \(p=1\) and the case of continuous functions when \(p=\infty \).

  • The paper [14] deals with Korobov spaces rather than Sobolev spaces in the sense of our definitions. While our methods can be extended to the case of Korobov spaces, the papers in their present form are not comparable.

  • The papers [14, 40] give the bounds on approximation in terms of the number of neurons. Our paper gives the bounds in terms of the minimal separation among centers. In the case of uniform grids, the two concepts coincide. In this case, the upper bound in [40] is similar to ours, and is the ideas behind its proof are essentially similar to those used in this paper. Both of these ideas are originally developed in [51].

  • The papers [14, 40] give lower bounds in the sense of worst case complexity; i.e., lower distance bounds. Our focus is on individual functions. For example, the lower bound in Theorem 1.1 of [40] implies that there exists some\(f\) in the function class under consideration for which the lower bound applies. The converse theorems in this paper are conceptually quite different. They state that for each individual function\(f\), without any prior knowledge of the class to which it belongs, the rate of decrease of the degree of approximation implies the smoothness class to which it belongs.

  • The paper [14] depends upon dyadic decompositions as is customary in the study of Korobov spaces. Preliminary numerical experiments suggest that the hyperbolic cross versions of the operators considered in [14] (or our operators for that matter) are not localized. We use the term wavelet-like representation to mean that the coefficients characterize local smoothness classes analogous to classical wavelet expansions. In this sense, it is an open problem to obtain a wavelet-like representation that characterizes local Korobov spaces.

  • There is an impressive lower bound in [14] which suggests that the best activation function in the case of approximation of Korobov spaces is the Korobov kernel itself. Intuitively, this deep result is somewhat expected, since the Korobov spaces are, in a limiting sense, translation networks with the Korobov kernel as the activation function. Analogous results for approximation of periodic functions are given in [52, 53]. The current paper does not deal with the question of the choice of an optimal activation function.

5 Further extensions

5.1 Jacobi expansions

Given \(\alpha >-1\) and \(\beta >-1\), the Jacobi weights are defined by

$$\begin{aligned} w_{\alpha ,\beta }(x)\mathop {=}\limits ^{\mathrm{def}}\left\{ \begin{array}{ll} (1-x)^\alpha (1+x)^\beta , &{}\quad {\hbox {if}}\ -1<x<1,\\ 0, &{}\quad {\hbox {otherwise.}} \end{array}\right. \end{aligned}$$

For \(1\le p < \infty \), the space \(L^p(\alpha ,\beta )\) is defined as the space of (equivalence classes of) functions \(f\) with

$$\begin{aligned} \Vert f\Vert _{\alpha ,\beta ;p} \mathop {=}\limits ^{\mathrm{def}}\left( \int _{-1}^1 |f(x)|^p w_{\alpha ,\beta }(x)\,dx\right) ^{1/p} < \infty . \end{aligned}$$

The symbol \(X^p(\alpha ,\beta )\) denotes \(L^p(\alpha ,\beta )\), if \(1 \le p < \infty \), and \(C([-1,1])\), the space of continuous functions on \([-1,1]\) with the maximum norm \(\Vert \circ \Vert _\infty \), if \(p=\infty \). The space of all algebraic polynomials of degree at most \(n\) will be denoted by \(\Pi _n\).

There exists a unique system of orthonormalized Jacobi polynomials \(\{{{p_j}^{(\alpha ,\beta )}}(x)=\gamma _j(\alpha ,\beta )x^j +\cdots \}\), \(\gamma _j(\alpha ,\beta )>0\) such that for integer \(j,\ell =0,1,\ldots \),

$$\begin{aligned} \int _{-1}^1 {{p_j}^{(\alpha ,\beta )}}(x){{p_\ell }^{(\alpha ,\beta )}}(x)w_{\alpha ,\beta }(x)\,dx=\left\{ \begin{array}{ll} 1, &{}\quad {\hbox {if}}\ j=\ell ,\\ 0, &{}\quad {\hbox {otherwise.}} \end{array}\right. \end{aligned}$$

The uniqueness of the system implies that \({{p_j}^{(\beta ,\alpha )}}(x)=(-1)^j{{p_j}^{(\alpha ,\beta )}}(-x)\), \(x\in \mathbb{R }\), \(j=0,1,\ldots \). Therefore, we may assume in the sequel that \(\alpha \ge \beta \). We will assume also that \(\alpha \ge \beta \ge -1/2\).

If \(f\in L^1(\alpha ,\beta )\), then, in this subsection, we define the Jacobi coefficients by

$$\begin{aligned} \hat{f}(j)\mathop {=}\limits ^{\mathrm{def}}\hat{f}(\alpha ,\beta ;j)\mathop {=}\limits ^{\mathrm{def}}\int _{-1}^1 f(y){{p_j}^{(\alpha ,\beta )}}(y)w_{\alpha ,\beta }(y)dy, \quad j=0,1,\ldots , \end{aligned}$$

so that the formal Jacobi expansion of \(f\) is given by \(\sum _{j\ge 0} \hat{f}(j){{p_j}^{(\alpha ,\beta )}}\). We define the summability operator analogous to \(\sigma _n\) as follows. Let \(h : [0,\infty )\rightarrow \mathbb{R }\) be a compactly supported function. We define

$$\begin{aligned} \Phi _n(\alpha ,\beta ;h,x,y)=\sum _{j=0}^\infty h\left( \frac{j}{n}\right) {{p_j}^{(\alpha ,\beta )}}(x){{p_j}^{(\alpha ,\beta )}}(y), \end{aligned}$$

and, for \(f\in L^1(\alpha ,\beta )\),

$$\begin{aligned}&\sigma _n(\alpha ,\beta ;h,f,x)=\sum _{j=0}^\infty h\left( \frac{j}{n}\right) \hat{f}(j){{p_j}^{(\alpha ,\beta )}}(x)\\&\quad =\int _{-1}^1 f(y)\Phi _n(\alpha ,\beta ;h,x,y)w_{\alpha ,\beta }(y)\,dy, \quad n\in \mathbb{N }. \end{aligned}$$

Using a result on the Cesáro means of the Jacobi expansion, see below, it is fairly easy to show that these operators are uniformly bounded. To describe this, we first recall the definition of Cesáro means. If \(\kappa >-1\), the Cesàro means of order \(\kappa \) of \(f\in L^1(\alpha ,\beta )\) are defined by

$$\begin{aligned} C_n^{[\kappa ]}(\alpha ,\beta ;f, x) \mathop {=}\limits ^{\mathrm{def}}\left( {\begin{array}{c}n+\kappa \\ \kappa \end{array}}\right) ^{-1} \sum _{j=0}^n \left( {\begin{array}{c}n-j+\kappa \\ \kappa \end{array}}\right) \hat{f}(j){{p_j}^{(\alpha ,\beta )}}(x) \end{aligned}$$
(5.1)

The following theorem is well known, see, e.g., [1, 76].

Theorem 5.1

Let \(\alpha ,\beta \ge -1/2\), \(\kappa >\max (\alpha ,\beta )+1/2\) be an integer, \(1\le p\le \infty \), and \(f\in X^p(\alpha ,\beta )\). Then

$$\begin{aligned} \Vert C_n^{[\kappa ]}(\alpha ,\beta ;f)\Vert _{\alpha ,\beta ;p} \le c\Vert f\Vert _{\alpha ,\beta ;p}, \end{aligned}$$
(5.2)

for \(n\in \mathbb{N }\).

Using a summation by parts argument as done in [26, Theorem 71, p. 128], Theorem 5.1 (used with \(S\) in place of \(\kappa \)) leads immediately to the following corollary.

Corollary 5.1

Let \(S>\max (\alpha ,\beta )+1/2\). If \(\{h_j\}\) is a sequence of real numbers, so that \(h_j\rightarrow 0\) as \(j\rightarrow \infty \), and

$$\begin{aligned} \sum _{j=0}^\infty (j+1)^S|\Delta ^{S+1}h_j| <\infty , \end{aligned}$$
(5.3)

where \(\Delta \) here is the forward difference operator, then for \(1\le p\le \infty \) and \(f\in X^p(\alpha ,\beta )\),

$$\begin{aligned}&\left\| \sum _{j=0}^\infty h_j\hat{f}(j){{p_j}^{(\alpha ,\beta )}}\right\| _p = \left\| \sum _{j=0}^\infty \left( {\begin{array}{c}j+S\\ S\end{array}}\right) C_j^{[S]}(\alpha ,\beta ;f)\Delta ^{S+1}h_j\right\| _p\nonumber \\&\quad \le c\left( \sum _{j=0}^\infty (j+1)^S|\Delta ^{S+1}h_j| \right) \Vert f\Vert _{\alpha ,\beta ;p}. \end{aligned}$$
(5.4)

In particular, if \(h\) is a compactly supported and \(S+1\) times continuously differentiable function, then

$$\begin{aligned} \Vert \sigma _n(\alpha ,\beta ;h,f)\Vert _p \le c\Vert f\Vert _p. \end{aligned}$$
(5.5)

Using the same methods as in the trigonometric case and the above results, one can easily obtain direct and converse theorems as well as the wavelet-like representation theorems for characterization of suitably defined global smoothness classes; these results are formulated in [59]. However, these bounds by themselves are not sufficient to obtain a characterization of local smoothness. We proved the following localization estimates on the kernels \(\Phi _n\), see [45, 48].

Theorem 5.2

Let \(\alpha ,\beta \ge -1/2\), \(S\in \mathbb{N }\), and let \(h_j=0\) for all sufficiently large \(j\). Then

$$\begin{aligned} a&\left| \sum _{k=0}^\infty h_j {{p_j}^{(\alpha ,\beta )}}(\cos \theta ){{p_j}^{(\alpha ,\beta )}}(\cos \varphi )\right| \nonumber \\&\quad \le c \, \sum _{j=0}^\infty \min \left( (j+1), \frac{1}{|\theta -\varphi |}\right) ^{\max (\alpha ,\beta )+S+1/2} \times \sum _{m=1}^{S}(j+1)^{\max (\alpha ,\beta )+1/2-S+m}|\Delta ^{m}h_j|,\nonumber \\ \end{aligned}$$
(5.6)

for \(\theta ,\varphi \in [0,\pi ]\). In particular, if \(h :[0,\infty )\rightarrow [0,\infty )\) is a compactly supported function that can be expressed as an \(S\)-times iterated integral of a function of bounded total variation \(V\), and if \(h^{\prime }(t)=0\) in a neighborhood of \(0\), then

$$\begin{aligned} |\Phi _n(\alpha ,\beta ;h,\cos \theta ,\cos \varphi )|\!\le \! c \, n^{2\max (\alpha ,\beta )+2} \, V \min \left( 1, \frac{1}{(n|\theta -\varphi |)^{\max (\alpha ,\beta )+S+1/2}}\right) ,\nonumber \\ \end{aligned}$$
(5.7)

for \(n\in \mathbb{N }\).

A wavelet-like representation with characterization of local smoothness classes is given in [48].

The subject of Marcinkiewicz–Zygmund inequalities in this context is very well studied. Perhaps, the most classical example is a simple consequence of the Gauss–Jacobi quadrature formula. For \(n\in \mathbb{N }\), let \(\{x_{k,n}\}_{k=1}^n\) be the zeros of \(P_n^{(\alpha ,\beta )}\), and let

$$\begin{aligned} \lambda _{k,n} \mathop {=}\limits ^{\mathrm{def}}\left\{ \sum _{j=0}^{n-1} p_j^{(\alpha ,\beta )}(x_{k,n})^2\right\} ^{-1}, \quad k=1, 2,\dots ,n. \end{aligned}$$

Then the Gauss–Jacobi quadrature formula implies that

$$\begin{aligned} \sum _{k=1}^n \lambda _{k,n}|P(x_{k,n})|^2 =\int _{-1}^1 |P(y)|^2w_{\alpha ,\beta }(y)\,dy,\quad P\in \Pi _{n-1}. \end{aligned}$$
(5.8)

The following analogue in the case of \(L^p\) norms was proved in [68, Theorem 25, p. 168].

Theorem 5.3

Let \(1\le p<\infty \) and let \(c_1>0\). Then there exists a constant \(c\) depending only on \(\alpha \), \(\beta \), and \(c_1\), such that for all \(m\in \mathbb{N }\) and \(n\in \mathbb{N }\) with \(1\le m\le c_1 n\),

$$\begin{aligned} \sum _{k=1}^n \lambda _{k,n}|P(x_{k,n})|^p \le c\Vert P\Vert _{\alpha ,\beta ;p}, \quad P\in \Pi _m. \end{aligned}$$
(5.9)

This theorem found deep applications in investigations related to weighted mean convergence of Lagrange interpolation [67, 68]. A survey of many of the classical results in this direction and their applications can be found in the paper [38] by Lubinsky. In [48], we proved the existence of M–Z quadrature measures based on arbitrary set of points on \([-1,1]\) subject to a density condition, and we gave applications to wavelet-like representations based on values of the function at these points.

Characteristically for polynomial approximation, one can also construct localized operators which yield approximation commensurate with analyticity of the target function on intervals, rather than the much weaker smoothness conditions studied in the previous section. If \(1\le p\le \infty \), \(n\ge 0\), and \(f\in L^p(\alpha ,\beta )\), then we define the degree of best weighted approximation of \(f\) by polynomials of degree at most \(n\) by

$$\begin{aligned} E_{n,p}(\alpha ,\beta ;f)\mathop {=}\limits ^{\mathrm{def}}\min _{P\in \Pi _n}\Vert f-P\Vert _{\alpha ,\beta ;p}. \end{aligned}$$

For integer \(n\ge 1\), let the numbers \(H^*_{j,n}\), \(j=0,\ldots ,5n-1\) be defined by

$$\begin{aligned} \left( \frac{1+x}{2}\right) ^n\Phi _{4n}(\alpha ,\beta ;x,1)=\sum _{j=0}^{5n-1}H^*_{j,n}{{p_j}^{(\alpha ,\beta )}}(x){{p_j}^{(\alpha ,\beta )}}(1), \end{aligned}$$
(5.10)

and let

$$\begin{aligned} \sigma _n(\alpha ,\beta ; H^*, f,x)\mathop {=}\limits ^{\mathrm{def}}\sum _{j=0}^{5n-1} H^*_{j,n}\hat{f}(j) {{p_j}^{(\alpha ,\beta )}}(x), \quad x\in [-1,1]. \end{aligned}$$

In [19, Theorem 3.3], we proved the following.

Theorem 5.4

  1. (a)

    Let \(1\le p\le \infty \), \(\alpha ,\beta \ge -1/2\), and let \(f\in L^p(\alpha ,\beta )\). Then, with \(H^*\) as defined in (5.10), we have \(\sigma _n(\alpha ,\beta ; H^*, P)=P\) for \(P\in \Pi _{n}\), and \(\sup _{n\in \mathbb{N }}\Vert \sigma _n(\alpha ,\beta ; H^*, f)\Vert _{\alpha ,\beta ;p}\le c\Vert f\Vert _{\alpha ,\beta ;p}\). In addition,

    $$\begin{aligned} E_{5n,p}(\alpha ,\beta ;f)\le \Vert f-\sigma _n(\alpha ,\beta ; H^*, f)\Vert _{\alpha ,\beta ;p}\le c_1E_{n,p}(\alpha ,\beta ;f). \end{aligned}$$
    (5.11)
  2. (b)

    Let \(f\in C([-1,1])\), \(x_0\in [-1,1]\), and let \(f\) have an analytic continuation to a complex neighborhood of \(x_0\), given by \(\{z\in \mathbb{C }: |z-x_0|\le d\}\) for some \(d\) with \(0<d\le 2\). Then

    $$\begin{aligned}&|f(x)-\sigma _n(\alpha ,\beta ; H^*, f,x)|\le c(f,x_0)\exp \left( -c_1(d)n\frac{d^2\log (e/2)}{e^2\log (e^2/d)}\right) ,\nonumber \\&\quad x\in [x_0-d/e,x_0+d/e]\cap [-1,1], \end{aligned}$$
    (5.12)

    where \(\log \) is the natural logarithm, and \(e\) is the basis of this logarithm.

5.2 Approximation on the sphere

Theorems about Jacobi expansions translate easily into analogous theorems for approximation on the sphere. The following paragraph is taken from [61]. Let \(q\in \mathbb{N }\), and let

$$\begin{aligned} \mathbb{S }^q\mathop {=}\limits ^{\mathrm{def}}\Bigg \{(x_1,\dots ,x_{q+1})\in \mathbb{R }^{q+1} :\sum _{j=1}^{q+1} x_j^2 =1\Bigg \}. \end{aligned}$$

A spherical cap, centered at \(\mathbf{x}_0\in \mathbb{S }^q\) and with radius \(\alpha \) is defined by

$$\begin{aligned} \mathbb{S }^q_\alpha (\mathbf{x}_0) \mathop {=}\limits ^{\mathrm{def}}\{\mathbf{x}\in \mathbb{S }^q\ :\ \mathbf{x}\cdot \mathbf{x}_0\ge \cos \alpha \}. \end{aligned}$$

We note that for all \(\mathbf{x}_0\in \mathbb{S }^q\) we have \(\mathbb{S }^q_\pi (\mathbf{x}_0)=\mathbb{S }^q\). In this subsection, the surface area (aka volume element) measure on \(\mathbb{S }^q\) will be denoted by \(\mu _q^*\) (there being no chance of confusion with the notation on the torus), and we write \(\omega _q \mathop {=}\limits ^{\mathrm{def}}\mu _q^*(\mathbb{S }^q)\). The spaces \(X^p(\mathbb{S }^q)\) and \(C(\mathbb{S }^q)\) on the sphere are defined analogously to the case of the interval.

A spherical polynomial of degree \(m\) is the restriction to \(\mathbb{S }^q\) of a polynomial in \(q+1\) real variables with total degree \(m\). For integer \(n\ge 0\), the class of all spherical polynomials of degree at most \(n\) will be denoted by \(\Pi ^q_n\). As before, we extend this notation for non-integer values of \(n\) by setting \(\Pi ^q_n\mathop {=}\limits ^{\mathrm{def}}\Pi ^q_{\lfloor n\rfloor }\). For integer \(\ell \ge 0\), the class of all homogeneous, harmonic, spherical polynomials of degree \(\ell \) will be denoted by \(\mathbf H ^q_\ell \), and its dimension by \(d^q_\ell \). For each integer \(\ell \ge 0\), let \(\{Y_{\ell ,k}\ :\ k=1, 2,\dots ,d_\ell ^q\}\) be a \(\mu _q^*\)-orthonormalized basis for \(\mathbf H ^q_\ell \). It is known that for any integer \(n\ge 0\), the system \(\{Y_{\ell ,k}\ :\ \ell =0, 1,\dots ,n,\ k=1, 2,\dots ,d_\ell ^q\}\) is an orthonormal basis for \(\Pi _n^q\), cf. [63, 75]. The connection with the theory of orthogonal polynomials on \([-1,1]\) is the following addition formula

$$\begin{aligned} \sum _{k=1}^{d\,^q_\ell } Y_{\ell ,k}(\mathbf{x}){Y_{\ell ,k}(\mathbf{y})} = \omega _{q-1}^{-1}p_\ell ^{(q/2-1,q/2-1)}(1)p_\ell ^{(q/2-1,q/2-1)}(\mathbf{x}\cdot \mathbf{y}), \quad \ell =0,1,\ldots , \end{aligned}$$

cf. [63] where the notation is different.

Analogues of the direct and converse theorems in the case of approximation on the sphere are given, for instance, by Pawelke [71] and Lizorkin and Rustamov [35]. The existence of M–Z quadrature measures was proved in [54]. Numerical constructions and various experiments to demonstrate the effectiveness of the localized operators are given in [34]. A wavelet-like representation including local smoothness classes is given in [47]. In the case of spherical caps, the existence of M–Z quadrature measures is given in [9, 46]; the corresponding results on spherical triangles are proved in [44] and numerical constructions are given in [2].

The analogue of the PBF network in this context is the so-called zonal function (ZF) network. A zonal function network is a function of the form \(\mathbf{x}\mapsto \sum _{k=1}^n c_k\phi (\mathbf{x}\cdot \mathbf{y}_k)\), where \(\mathbf{x}\) and the \(\mathbf{y}_k\)’s are on \(\mathbb{S }^q\) and \(\phi \in L^1(q/2-1,q/2-1)\). We observe that analogous to the “Mercer expansion”

$$\begin{aligned} G(\mathbf{x}-\mathbf{y})=\sum _\mathbf{k}\hat{G}(\mathbf{k})\exp (i\mathbf{k}\cdot \mathbf{x})\overline{\exp (i\mathbf{k}\cdot \mathbf{y})}, \end{aligned}$$

for the activation function \(G\) of a PBF network, one has the expansion

$$\begin{aligned} \phi (\mathbf{x}\cdot \mathbf{y})=\sum _{\ell =0}^\infty \hat{\phi }(\ell )\sum _{k=1}^{d\,^q_\ell } Y_{\ell ,k}(\mathbf{x}){Y_{\ell ,k}(\mathbf{y})} \end{aligned}$$

for the activation function \(\phi \) of a ZF network. The ideas in Sect. 4 can be carried over almost verbatim to the case of ZF networks. In particular, the direct and converse theorems in this connection are obtained in [55, 56]. The wavelet-like representation for ZF network frames was announced by Shvarts in a joint paper with HNM in a meeting in Barcelona, Spain, in December, 2011.