1 Introduction

Neural networks generate very popular function classes used in machine learning algorithms [2, 28]. The fundamental building blocks of neural networks are ridge functions (also called neurons) of the form \(x \in \mathbb {R}^d \mapsto \rho ((x^\intercal ,1)v)\), where \(\rho :\mathbb {R}\rightarrow \mathbb {R}\) is a continuous activation function and \(v\in \mathbb {R}^{d+1}\) is a trainable parameter. It is well-known that a shallow neural network with non-polynomial activation

$$\begin{aligned} f(x) = \sum _{i=1}^N a_i \rho ((x^\intercal ,1)v_i), \end{aligned}$$
(1.1)

is universal in the sense that it can approximate any continuous functions on any compact set with desired accuracy when the number of neurons N is sufficiently large [11, 23, 48]. The approximation and statistical properties of neural networks with different architectures have also been widely studied in the literature [6, 52, 64, 67], especially when \(\rho \) is a sigmoidal activation or \(\rho \) is the ReLU\(^k\) function \(\sigma _k(t) = \max \{0,t\}^k\), the k-power of the rectified linear unit (ReLU) with \(k\in \mathbb {N}_0:=\mathbb {N}\cup \{0\}\).

The main focus of this paper is rates of approximation by neural networks. For classical smooth function classes, such as Hölder functions, Mhaskar [39] (see also [48, Theorem 6.8]) presented approximation rates for shallow neural networks, when the activation function \(\rho \in C^\infty (\Omega )\) is not a polynomial on some open interval \(\Omega \) (ReLU\(^k\) does not satisfy this condition). It is known that the rates obtained by Mhaskar are optimal if the network weights are required to be continuous functions of the target function. Recently, optimal rates of approximation have also been established for deep ReLU neural networks [31, 54, 64, 65], even without the continuity requirement on the network weights. All these approximation rates are obtained by using the idea that one can construct neural networks to approximate polynomials efficiently. There is another line of works [4, 25, 34, 56, 57] studying the approximation rates for functions of certain integral forms (such as (1.2)) by using a random sampling argument due to Maurey [49]. In particular, Barron [4] derived dimension independent approximation rates for sigmoid type activations and functions h, whose Fourier transform \(\widehat{h}\) satisfies \(\int _{\mathbb {R}^d} |\omega | |\widehat{h}(\omega )| d\omega <\infty \). This result has been improved and generalized to ReLU activation in recent articles [25, 56, 57].

In this paper, we continue the study of these two lines of approximation theories for neural networks (i.e. the constructive approximation of smooth functions and the random approximation of integral representations). Our main result shows how well integral representations corresponding to ReLU\(^k\) neural networks can approximate smooth functions. By combining this result with the random approximation theory of integral forms, we are able to establish the optimal rates of approximation for shallow ReLU\(^k\) neural networks. Specifically, we consider the following function class defined on the unit ball \(\mathbb {B}^d\) of \(\mathbb {R}^d\) induced by vectors on unit sphere \(\mathbb {S}^d\) of \(\mathbb {R}^{d+1}\) as

$$\begin{aligned} \mathcal {F}_{\sigma _k}(M) := \left\{ f(x) = \int _{\mathbb {S}^d} \sigma _k((x^\intercal ,1)v) d\mu (v): \Vert \mu \Vert \le M, x\in \mathbb {B}^d \right\} , \end{aligned}$$
(1.2)

which can be regarded as a shallow ReLU\(^k\) neural network with infinite width [3]. The restriction on the total variation \(\Vert \mu \Vert :=|\mu |(\mathbb {S}^d) \le M\) gives a constraint on the size of the weights in the network. We study how well \(\mathcal {F}_{\sigma _k}(M)\) approximates the unit ball of Hölder class \(\mathcal {H}^\alpha \) with smoothness index \(\alpha >0\) as \(M\rightarrow \infty \). Roughly speaking, our main theorem shows that, if \(\alpha > (d+2k+1)/2\), then \(\mathcal {H}^\alpha \subseteq \mathcal {F}_{\sigma _k}(M)\) for some constant M depending on \(k,d,\alpha \), and if \(\alpha < (d+2k+1)/2\), we obtain the approximation bound

$$\begin{aligned} \sup _{h\in \mathcal {H}^\alpha } \inf _{f\in \mathcal {F}_{\sigma _k}(M)} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim M^{-\frac{2\alpha }{d+2k+1-2\alpha }}, \end{aligned}$$

where for two quantities X and Y, \(X \lesssim Y\) (or \(Y \gtrsim X\)) denotes the statement that \(X\le CY\) for some constant \(C>0\) (we will also denote \(X \asymp Y\) when \(X \lesssim Y \lesssim X\)). In other words, sufficiently smooth functions are always contained in the shallow neural network space \(\mathcal {F}_{\sigma _k}(M)\). And, for less smooth functions, we can characterize the approximation error by the variation norm. Furthermore, combining our result with the random approximation bounds from [3, 55, 57], we are able to prove that shallow ReLU\(^k\) neural network of the form (1.1) achieves the optimal approximation rate \(\mathcal {O}(N^{-\alpha /d})\) for \(\mathcal {H}^\alpha \) with \(\alpha < (d+2k+1)/2\), which generalizes the result of Mhaskar [39] to ReLU\(^k\) activation.

In addition to shallow neural networks, we can also apply our results to derive approximation bounds for multi-layer neural networks and convolutional neural networks (CNNs) when \(k=1\) (ReLU activation \(\sigma :=\sigma _1\)). These approximation bounds can then be used to study the performances of machine learning algorithms based on neural networks [2]. Here, we illustrate the idea by studying the nonparametric regression problem. The goal of this problem is to learn a function h from a hypothesis space \(\mathcal {H}\) from its noisy samples

$$\begin{aligned} Y_i = h(X_i) + \eta _i, \quad i=1,\dots , n, \end{aligned}$$

where \(X_i\) is sampled from an unknown probability distribution \(\mu \) and \(\eta _i\) is Gaussian noise. One popular algorithm for solving this problem is the empirical least square minimization

$$\begin{aligned} \mathop {\textrm{argmin}}\limits _{f\in \mathcal {F}_n} \frac{1}{n} \sum _{i=1}^n |f(X_i)- Y_i|^2, \end{aligned}$$

where \(\mathcal {F}_n\) is an appropriately chosen function class. For instance, in deep learning, \(\mathcal {F}_n\) is parameterized by deep neural networks and one solves the minimization by (stochastic) gradient descent methods. Assuming that we can compute a minimizer \(f_n^* \in \mathcal {F}_n\), the performance of the algorithm is often measured by the square loss \(\Vert f_n^*-h\Vert _{L^2(\mu )}^2\). A fundamental question in learning theory is to determine the convergence rate of the error \(\Vert f_n^*-h\Vert _{L^2(\mu )}^2 \rightarrow 0\) as the sample size \(n\rightarrow \infty \). The error can be decomposed into two components: approximation error and generalization error (also called estimation error). For neural network models \(\mathcal {F}_n\), our results provide bounds for the approximation errors, while the generalization errors can be bounded by the complexity of the models [40, 53]. We study the cases \(\mathcal {H}= \mathcal {H}^\alpha \) with \(\alpha < (d+3)/2\) or \(\mathcal {H}=\mathcal {F}_{\sigma }(1)\) for three ReLU neural network models: shallow neural network, over-parameterized neural network, and CNN. The models and our contributions are summarized as follows:

  1. 1.

    Shallow ReLU neural network \(\mathcal {F}_{\sigma }(N,M)\), where N is the number of neurons and M is a bound for the variation norm that measures the size of the weights. We prove optimal approximation rates (in terms of N) for this model. It is also shown that this model can achieve the optimal convergence rates for learning \(\mathcal {H}^\alpha \) and \(\mathcal {F}_{\sigma }(1)\), which complements the recent results for deep neural networks [27, 52].

  2. 2.

    Over-parameterized (deep or shallow) ReLU neural network \(\mathcal{N}\mathcal{N}(W,L,M)\) studied in [24], where WL are the width and depth respectively, and M is a constraint on the weight matrices. For fixed depth L, the generalization error for this model can be controlled by M [24]. When \(\mathcal {H}=\mathcal {H}^\alpha \), we characterize the approximation error by M, and allow the width W to be arbitrary large so that the model can be over-parameterized (the number of parameters is larger than the number of samples). When \(\mathcal {H}=\mathcal {F}_{\sigma }(1)\), we can simply increase the width to reduce the approximation error so that the model can also be over-parameterized. Our result shows that this model can achieve nearly optimal convergence rates for learning \(\mathcal {H}^\alpha \) and \(\mathcal {F}_{\sigma }(1)\). Both the approximation and convergence rates improve the results of [24].

  3. 3.

    Sparse convolutional ReLU neural network \(\mathcal {CNN}(s,L)\) introduced by [67], where L is the depth and \(s\ge 2\) is a fixed integer that controls the filter length. This model is shown to be universal for approximation [67] and universal consistent for regression [30]. We improve the approximation bound in [67] and give new convergence rates of this model for learning \(\mathcal {H}^\alpha \) and \(\mathcal {F}_{\sigma }(1)\).

The approximation rates and convergence rates of nonparametric regression for these models are summarized in Table 1, where we use the notation \(a\vee b:= \max \{a,b\}\).

Table 1 Approximation rates and convergence rates of nonparametric regression for three neural network models, ignoring logarithmic factors

The rest of the paper is organized as follows. In Sect. 2, we present our approximation results for shallow neural networks. Section 3 gives a proof of our main theorem. In Sect. 4, we apply our approximation results to study these neural network models and derive convergence rates for nonparametric regression using these models. Section 5 concludes this paper with a discussion on possible future directions of research.

2 Approximation Rates for Shallow Neural Networks

Let us begin with some notations for function classes. Let \(\mathbb {B}^d= \{x\in \mathbb {R}^d:\Vert x\Vert _2\le 1\}\) and \(\mathbb {S}^{d-1}=\{x\in \mathbb {R}^d:\Vert x\Vert _2=1\}\) be the unit ball and the unit sphere of \(\mathbb {R}^d\). We are interested in functions of the integral form

$$\begin{aligned} f(x) = \int _{\mathbb {S}^d} \sigma _k((x^\intercal ,1)v) d\mu (v), \quad x\in \mathbb {B}^d, \end{aligned}$$
(2.1)

where \(\mu \) is a signed Radon measure on \(\mathbb {S}^d\) with finite total variation \(\Vert \mu \Vert :=|\mu |(\mathbb {S}^d)<\infty \) and \(\sigma _k(t):= \max \{t,0\}^k\) with \(k\in \mathbb {N}_0:=\mathbb {N}\cup \{0\}\) is the ReLU\(^k\) function (when \(k=0\), \(\sigma _0(t)\) is the Heaviside function). For simplicity, we will also denote the ReLU function by \(\sigma :=\sigma _1\). The variation norm \(\gamma (f)\) of f is the infimum of \(\Vert \mu \Vert \) over all decompositions of f as (2.1) [3]. By the compactness of \(\mathbb {S}^d\), the infimum can be attained by some signed measure \(\mu \). We denote \(\mathcal {F}_{\sigma _k}(M)\) as the function class that contains all functions f in the form (2.1) whose variation norm \(\gamma (f) \le M\), see (1.2). The class \(\mathcal {F}_{\sigma _k}(M)\) can be thought of as an infinitely wide neural network with a constraint on its weights. The variation spaces corresponding to shallow neural networks have been studied by many researchers. We refer the reader to [8, 44,45,46, 51, 55,56,57,58] for several other definitions and characterizations of these spaces.

We will also need the notion of classical smoothness of functions on Euclidean space. Given a smoothness index \(\alpha >0\), we write \(\alpha =r+\beta \) where \(r\in \mathbb {N}_0\) and \(\beta \in (0,1]\). Let \(C^{r,\beta }(\mathbb {R}^d)\) be the Hölder space with the norm

$$\begin{aligned} \Vert f\Vert _{C^{r,\beta }(\mathbb {R}^d)}:= \max \left\{ \Vert f\Vert _{C^r(\mathbb {R}^d)}, \max _{\Vert s\Vert _1=r}|\partial ^s f|_{C^{0,\beta }(\mathbb {R}^d)} \right\} , \end{aligned}$$

where \(s=(s_1,\dots ,s_d) \in \mathbb {N}_0^d\) is a multi-index and

$$\begin{aligned} \Vert f\Vert _{C^r(\mathbb {R}^d)}:= \max _{\Vert s\Vert _1\le r} \Vert \partial ^s f\Vert _{L^\infty (\mathbb {R}^d)}, \quad |f|_{C^{0,\beta }(\mathbb {R}^d)}:= \sup _{x\ne y\in \mathbb {R}^d} \frac{|f(x)-f(y)|}{\Vert x-y\Vert _2^\beta }. \end{aligned}$$

Here we use \(\Vert \cdot \Vert _{L^\infty }\) to denote the supremum norm, since we mainly consider continuous functions. We write \(C^{r,\beta }(\mathbb {B}^d)\) for the Banach space of all restrictions to \(\mathbb {B}^d\) of functions in \(C^{r,\beta }(\mathbb {R}^d)\). The norm is given by \(\Vert f\Vert _{C^{r,\beta }(\mathbb {B}^d)} = \inf \{ \Vert g\Vert _{C^{r,\beta }(\mathbb {R}^d)}: g\in C^{r,\beta }(\mathbb {R}^d) \text{ and } g=f \text{ on } \mathbb {B}^d\}\). For convenience, we will denote the unit ball of \(C^{r,\beta }(\mathbb {B}^d)\) by

$$\begin{aligned} \mathcal {H}^\alpha := \left\{ f\in C^{r,\beta }(\mathbb {B}^d): \Vert f\Vert _{C^{r,\beta }(\mathbb {B}^d)}\le 1 \right\} . \end{aligned}$$

Note that, for \(\alpha =1\), \(\mathcal {H}^\alpha \) is a class of Lipschitz continuous functions.

Due to the universality of shallow neural networks [48], \(\mathcal {F}_{\sigma _k}(M)\) can approximate any continuous functions on \(\mathbb {B}^d\) if M is sufficiently large. Our main theorem estimates the rate of this approximation for Hölder class.

Theorem 2.1

Let \(k\in \mathbb {N}_0\), \(d\in \mathbb {N}\) and \(\alpha >0\). If \(\alpha > (d+2k+1)/2\) or \(\alpha = (d+2k+1)/2\) is an even integer, then \(\mathcal {H}^\alpha \subseteq \mathcal {F}_{\sigma _k}(M)\) for some constant M depending on \(k,d,\alpha \). Otherwise,

$$\begin{aligned} \sup _{h\in \mathcal {H}^\alpha } \inf _{f\in \mathcal {F}_{\sigma _k}(M)} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim {\left\{ \begin{array}{ll} \exp (-\alpha M^2), &{} \text{ if } \alpha = (d+2k+1)/2 \text{ and } \alpha /2 \notin \mathbb {N}, \\ M^{-\frac{2\alpha }{d+2k+1-2\alpha }}, &{} \text{ if } \alpha < (d+2k+1)/2, \end{array}\right. } \end{aligned}$$

where the implied constants only depend on \(k,d,\alpha \).

The proof of Theorem 2.1 is deferred to the next section. Our proof uses similar ideas as [3, Proposition 3], which obtained the same approximation rate for \(\alpha =1\) (with an additional logarithmic factor). The conclusion is more complicated for the critical value \(\alpha =(d+2k+1)/2\). We think this is due to the proof technique and conjecture that \(\mathcal {H}^\alpha \subseteq \mathcal {F}_{\sigma _k}(M)\) for all \(\alpha \ge (d+2k+1)/2\), see Remark 3.4. Nevertheless, in practical applications of machine learning, the dimension d is large and it is reasonable to expect that \(\alpha <(d+2k+1)/2\).

In order to apply Theorem 2.1 to shallow neural networks with finite neurons, we can approximate \(\mathcal {F}_{\sigma _k}(M)\) by the subclass

$$\begin{aligned} \mathcal {F}_{\sigma _k}(N,M):= \left\{ f(x) = \sum _{i=1}^N a_i\sigma _k((x^\intercal ,1)v_i): v_i \in \mathbb {S}^d, \sum _{i=1}^N |a_i|\le M \right\} , \end{aligned}$$

where we restrict the measure \(\mu \) to be a discrete one supported on at most N points. The next proposition shows that any function in \(\mathcal {F}_{\sigma _k}(M)\) is the limit of functions in \(\mathcal {F}_{\sigma _k}(N,M)\) as \(N\rightarrow \infty \).

Proposition 2.2

For \(k\in \mathbb {N}_0\), \(\mathcal {F}_{\sigma _k}(1)\) is the closure of \(\cup _{N\in \mathbb {N}} \mathcal {F}_{\sigma _k}(N,1)\) in \(L^\infty (\mathbb {B}^d)\).

Proof

Let us denote the closure of \(\cup _{N\in \mathbb {N}} \mathcal {F}_{\sigma _k}(N,1)\) in \(L^\infty (\mathbb {B}^d)\) by \(\widetilde{\mathcal {F}}_{\sigma _k}(1)\). We first show that \(\mathcal {F}_{\sigma _k}(1) \subseteq \widetilde{\mathcal {F}}_{\sigma _k}(1)\). For any \(f\in \mathcal {F}_{\sigma _k}(1)\) with the integral form \(f(x) = \int _{\mathbb {S}^d} \sigma _k((x^\intercal ,1)v) d\mu (v)\), we can decompose f as

$$\begin{aligned} f(x)&= \Vert \mu _+\Vert \int _{\mathbb {S}^d} \sigma _k((x^\intercal ,1)v) \frac{d\mu _+(v)}{\Vert \mu _+\Vert } - \Vert \mu _-\Vert \int _{\mathbb {S}^d} \sigma _k((x^\intercal ,1)v) \frac{d\mu _-(v)}{\Vert \mu _-\Vert } \\&=: \Vert \mu _+\Vert f_+(x) - \Vert \mu _-\Vert f_-(x), \end{aligned}$$

where \(\mu _+\) and \(\mu _-\) are the positive and negative parts of \(\mu \). If \(f_+, f_-\in \widetilde{\mathcal {F}}_{\sigma _k}(1)\), then \(f \in \widetilde{\mathcal {F}}_{\sigma _k}(1)\). Hence, without loss of generality, we can assume \(\mu \) is a probability measure. We are going to approximate f by uniform laws of large numbers. Let \(\{v_i\}_{i=1}^N\) be N i.i.d. samples from \(\mu \). By symmetrization argument (see [60, Theorem 4.10] for example), we can bound the expected approximation error by Rademacher complexity [7]:

$$\begin{aligned} \mathbb {E}\left[ \sup _{x\in \mathbb {B}^d} \left| f(x) - \frac{1}{N}\sum _{i=1}^N \sigma _k((x^\intercal ,1)v_i)\right| \right] \le 2 \mathbb {E}\left[ \sup _{x\in \mathbb {B}^d} \left| \frac{1}{N} \sum _{i=1}^N \epsilon _i \sigma _k((x^\intercal ,1)v_i) \right| \right] =:\mathcal {E}_k(N), \end{aligned}$$

where \((\epsilon _1,\dots ,\epsilon _N)\) is an i.i.d. sequence of Rademacher random variables. For \(k\in \mathbb {N}\), the Lipschitz constant of \(\sigma _k\) on \([-\sqrt{2},\sqrt{2}]\) is \(k2^{(k-1)/2}\). By the contraction property of Rademacher complexity [29, Corollary 3.17],

$$\begin{aligned} \mathcal {E}_k(N)&\le \frac{k2^{(k+1)/2}}{N} \mathbb {E}\left[ \sup _{x\in \mathbb {B}^d} \left| \sum _{i=1}^N \epsilon _i (x^\intercal ,1)v_i \right| \right] \le \frac{k2^{k/2+1}}{N} \mathbb {E}\left[ \left\| \sum _{i=1}^N \epsilon _i v_i \right\| _2 \right] \\&\le \frac{k2^{k/2+1}}{N} \sqrt{\mathbb {E}\left[ \left\| \sum _{i=1}^N \epsilon _i v_i \right\| _2^2 \right] } = \frac{k2^{k/2+1}}{N} \sqrt{\mathbb {E}\left[ \sum _{i=1}^N \left\| v_i \right\| _2^2 \right] } = \frac{k2^{k/2+1}}{\sqrt{N}}. \end{aligned}$$

For \(k=0\), the VC dimension of the function class \(\{f_x(v)=\sigma _0((x^\intercal ,1)v): x\in \mathbb {B}^d\}\) is at most d [60, Proposition 4.20]. Thus, we have the bound \(\mathcal {E}_0(N) \lesssim \sqrt{d/N}\) by [60, Example 5.24]. Hence, f is in the closure of \(\cup _{N\in \mathbb {N}} \mathcal {F}_{\sigma _k}(N,1)\).

Next, we show that \(\widetilde{\mathcal {F}}_{\sigma _k}(1) \subseteq \mathcal {F}_{\sigma _k}(1)\) for \(k\in \mathbb {N}_0\). Since \(\cup _{N\in \mathbb {N}} \mathcal {F}_{\sigma _k}(N,1) \subseteq \mathcal {F}_{\sigma _k}(1)\), we only need to show that \(\mathcal {F}_{\sigma _k}(1)\) is closed in \(L^\infty (\mathbb {B}^d)\). Let \(f_n(x)= \int _{\mathbb {S}^d} \sigma _k((x^\intercal ,1)v) d\mu _n(v)\), where \(\Vert \mu _n\Vert \le 1\), be a convergent sequence with limit \(f \in L^\infty (\mathbb {B}^d)\). It remains to show that \(f\in \mathcal {F}_{\sigma _k}(1)\).

For \(k\in \mathbb {N}\), by the compactness of \(\mathbb {S}^d\) and Prokhorov’s theorem, there exists a weakly convergent subsequence \(\mu _{n_i} \rightarrow \mu \). In particular, \(\Vert \mu \Vert \le 1\) and for any \(x\in \mathbb {B}^d\),

$$\begin{aligned} \lim _{n_i\rightarrow \infty } \int _{\mathbb {S}^d} \sigma _k((x^\intercal ,1)v) d\mu _{n_i}(v) = \int _{\mathbb {S}^d} \sigma _k((x^\intercal ,1)v) d\mu (v) =:\widetilde{f}. \end{aligned}$$

By the compactness of \(\mathbb {B}^d\), \(f_{n_i}\) converges uniformly to \(\widetilde{f}\). Hence \(f= \widetilde{f} \in \mathcal {F}_{\sigma _k}(1)\).

For \(k=0\), we use the idea from [58, Lemma 3]. We can view \(f_n\) as a Bochner integral \(\int _\mathbb {D}i_{\mathbb {D}\rightarrow L^2(\mathbb {B}^d)} d\mu _n\) of the inclusion map \(i_{\mathbb {D}\rightarrow L^2(\mathbb {B}^d)}\), where \(\mathbb {D}:=\{g_v(x)=\sigma _0((x^\intercal ,1)v): v\in \mathbb {S}^d \}\). Notice that the set \(\mathbb {D}\subseteq L^2(\mathbb {B}^d)\) is compact, because the mapping \(v\mapsto g_v\) is continuous. By Prokhorov’s theorem, there exists a weakly convergent subsequence \(\mu _{n_i} \rightarrow \mu \). Let us denote \(\widetilde{f}= \int _\mathbb {D}i_{\mathbb {D}\rightarrow L^2(\mathbb {B}^d)} d\mu \), then \(\widetilde{f} \in \mathcal {F}_{\sigma _k}(1)\) by viewing the Bochner integral as an integral over \(\mathbb {S}^d\). If we choose a countable dense sequence \(\{g_j\}_{j=1}^\infty \) of \(L^2(\mathbb {B}^d)\), then the weak convergence implies that

$$\begin{aligned} \lim _{n_i\rightarrow \infty } \langle g_j, f_{n_i} \rangle _{L^2(\mathbb {B}^d)} = \left\langle g_j, \widetilde{f} \right\rangle _{L^2(\mathbb {B}^d)}, \end{aligned}$$

for all j. The strong convergence \(f_{n_i} \rightarrow f\) in \(L^\infty (\mathbb {B}^d)\) implies that the same equality for f replacing \(\widetilde{f}\). Therefore, \(\langle g_j, f \rangle _{L^2(\mathbb {B}^d)} = \langle g_j, \widetilde{f} \rangle _{L^2(\mathbb {B}^d)}\) for all j, which shows \(f= \widetilde{f} \in \mathcal {F}_{\sigma _k}(1)\). \(\square \)

The proof of Proposition 2.2 actually shows the approximation rate \(\mathcal {O}(N^{-1/2})\) for the subclass \(\mathcal {F}_{\sigma _k}(N,1)\). This rate can be improved if we take into account the smoothness of the activation function. For ReLU activation, Bach [3, Proposition 1] showed that approximating \(f\in \mathcal {F}_{\sigma }(1)\) by neural networks with finitely many neurons is essentially equivalent to the approximation of a zonoid by zonotopes [10, 37]. Using this equivalence, he obtained the rate \(\mathcal {O}(N^{-\frac{1}{2}-\frac{3}{2d}})\) for ReLU neural networks. Similar idea was applied to the Heaviside activation in [32, Theorem 4], which proved the rate \(\mathcal {O}(N^{-\frac{1}{2}-\frac{1}{2d}})\) for such an activation function. For ReLU\(^k\) neural networks, the general approximation rate \(\mathcal {O}(N^{-\frac{1}{2} - \frac{2k+1}{2d}})\) was established in \(L^2\) norm by [57], which also showed that this rate is sharp. The recent work [55] further proved that this rate indeed holds in the uniform norm. We summarize their results in the following lemma.

Lemma 2.3

( [55]) For \(k\in \mathbb {N}_0\) and \(d\in \mathbb {N}\), it holds that

$$\begin{aligned} \sup _{f \in \mathcal {F}_{\sigma _k}(1)}\inf _{f_N \in \mathcal {F}_{\sigma _k}(N,1)} \Vert f-f_N\Vert _{L^\infty (\mathbb {B}^d)} \lesssim N^{-\frac{1}{2} - \frac{2k+1}{2d}}. \end{aligned}$$

Combining Theorem 2.1 and Lemma 2.3, we can derive the rate of approximation by shallow neural network \(\mathcal {F}_{\sigma _k}(N,M)\) for Hölder class \(\mathcal {H}^\alpha \). Recall that we use the notation \(a\vee b:= \max \{a,b\}\).

Corollary 2.4

Let \(k\in \mathbb {N}_0\), \(d\in \mathbb {N}\) and \(\alpha >0\).

  1. 1.

    If \(\alpha > (d+2k+1)/2\) or \(\alpha = (d+2k+1)/2\) is an even integer, then there exists a constant M depending on \(k,d,\alpha \) such that

    $$\begin{aligned} \sup _{h\in \mathcal {H}^\alpha } \inf _{f\in \mathcal {F}_{\sigma _k}(N,M)} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim N^{-\frac{1}{2} - \frac{2k+1}{2d}}. \end{aligned}$$
  2. 2.

    If \(\alpha = (d+2k+1)/2\) is not an even integer, then there exists \(M \asymp \sqrt{\log N}\) such that

    $$\begin{aligned} \sup _{h\in \mathcal {H}^\alpha } \inf _{f\in \mathcal {F}_{\sigma _k}(N,M)} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim N^{-\frac{1}{2} - \frac{2k+1}{2d}} \sqrt{\log N}. \end{aligned}$$
  3. 3.

    If \(\alpha < (d+2k+1)/2\), then

    $$\begin{aligned} \sup _{h\in \mathcal {H}^\alpha } \inf _{f\in \mathcal {F}_{\sigma _k}(N,M)} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim N^{-\frac{\alpha }{d}} \vee M^{-\frac{2\alpha }{d+2k+1-2\alpha }}. \end{aligned}$$

    Thus, the rate \(\mathcal {O}(N^{-\alpha /d})\) holds when \(M\gtrsim N^{(d+2k+1-2\alpha )/(2d)}\).

Proof

We only present the proof for part (3), since other parts can be derived similarly. If \(\alpha < (d+2k+1)/2\), then by Theorem 2.1, for any \(h\in \mathcal {H}^\alpha \), there exists \(g\in \mathcal {F}_{\sigma _k}(K)\) such that \(\Vert h-g\Vert _{L^\infty (\mathbb {B}^d)} \lesssim K^{-\frac{2\alpha }{d+2k+1-2\alpha }}\). By Lemma 2.3, then there exists \(f\in \mathcal {F}_{\sigma _k}(N,K)\) such that \(\Vert g-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim KN^{-\frac{1}{2} - \frac{2k+1}{2d}}\). If \(M \ge N^{\frac{d+2k+1-2\alpha }{2d}}\), we choose \(K=N^{\frac{d+2k+1-2\alpha }{2d}}\) then \(f\in \mathcal {F}_{\sigma _k}(N,K) \subseteq \mathcal {F}_{\sigma _k}(N,M)\) and

$$\begin{aligned} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)}&\le \Vert h-g\Vert _{L^\infty (\mathbb {B}^d)} + \Vert g-f\Vert _{L^\infty (\mathbb {B}^d)} \\&\lesssim K^{-\frac{2\alpha }{d+2k+1-2\alpha }} + KN^{-\frac{d+2k+1}{2d}} \lesssim N^{-\frac{\alpha }{d}}. \end{aligned}$$

If \(M \le N^{\frac{d+2k+1-2\alpha }{2d}}\), we choose \(K=M\), then

$$\begin{aligned} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim K^{-\frac{2\alpha }{d+2k+1-2\alpha }} + KN^{-\frac{d+2k+1}{2d}} \lesssim M^{-\frac{2\alpha }{d+2k+1-2\alpha }}. \end{aligned}$$

Combining the two bounds gives the desired result. \(\square \)

We make some comments on the approximation rate for \(\mathcal {H}^\alpha \) with \(\alpha < (d+2k+1)/2\). As shown by [48, Corollary 6.10], the rate \(\mathcal {O}(N^{-\alpha /d})\) in the \(L^2\) norm is already known for \(\alpha =1,2,\dots ,(d+2k+1)/2\). For ReLU activation, the recent paper [36] obtained the rate \(\mathcal {O}(N^{-\frac{\alpha }{d} \frac{d+2}{d+4}})\) in the supremum norm. Corollary 2.4 shows that the rate \(\mathcal {O}(N^{-\alpha /d})\) holds in the supremum norm for all ReLU\(^k\) activations. And more importantly, we also provide an explicit control on the network weights to ensure that this rate can be achieved, which is useful for estimating generalization errors (see Sect. 4.2). It is well-known that the optimal approximation rate for \(\mathcal {H}^\alpha \) is \(\mathcal {O}(N^{-\alpha /d})\), if we approximate \(h\in \mathcal {H}^\alpha \) by a function class with N parameters and the parameters are continuously dependent on the target function h [13]. However, this result is not directly applicable to neural networks, because we do not have guarantee that the parameters in the network depend continuously on the target function (in fact, this is not true for some constructions [31, 63, 65]). Nevertheless, one can still prove that the rate \(\mathcal {O}(N^{-\alpha /d})\) is optimal for shallow ReLU\(^k\) neural networks by arguments based on pseudo-dimension as done in [31, 63, 64].

We describe the idea of proving approximation lower bounds through pseudo-dimension by reviewing the result of Maiorov and Ratsaby [33] (see also [1]). Recall that the pseudo-dimension \(\,\textrm{Pdim}\,(\mathcal {F})\) of a real-valued function class \(\mathcal {F}\) defined on \(\mathbb {B}^d\) is the largest integer n for which there exist points \(x_1,\dots ,x_n \in \mathbb {B}^d\) and constants \(c_1,\dots ,c_n\in \mathbb {R}\) such that

$$\begin{aligned} \left| \{ \,\textrm{sgn}\,(f(x_1)-c_1),\dots ,\,\textrm{sgn}\,(f(x_n)-c_n): f\in \mathcal {F}\}\right| =2^n. \end{aligned}$$
(2.2)

Maiorov and Ratsaby [33] introduced a nonlinear n-width defined as

$$\begin{aligned} \rho _n(\mathcal {H}^\alpha ) = \inf _{\mathcal {F}_n} \sup _{h\in \mathcal {H}^\alpha } \inf _{f\in \mathcal {F}_n} \Vert h-f\Vert _{L^p(\mathbb {B}^d)}, \end{aligned}$$

where \(p\in [1,\infty ]\) and \(\mathcal {F}_n\) runs over all the classes in \(L^p(\mathbb {B}^d)\) with \(\,\textrm{Pdim}\,(\mathcal {F}_n)\le n\). They constructed a well-separated subclass of \(\mathcal {H}^\alpha \) such that if a function class \(\mathcal {F}\) can approximate this subclass with small error, then \(\,\textrm{Pdim}\,(\mathcal {F})\) should be large. In other words, the approximation error of any class \(\mathcal {F}_n\) with \(\,\textrm{Pdim}\,(\mathcal {F}_n)\le n\) can be lower bounded. Consequently, they proved that

$$\begin{aligned} \rho _n(\mathcal {H}^\alpha ) \gtrsim n^{-\alpha /d}. \end{aligned}$$

By [6], we can upper bound the pseudo-dimension of shallow ReLU\(^k\) neural networks as \(n:= \,\textrm{Pdim}\,(\mathcal {F}_{\sigma _k}(N,M)) \lesssim N \log N\). Hence,

$$\begin{aligned} \sup _{h\in \mathcal {H}^\alpha } \inf _{f_N\in \mathcal {F}_{\sigma _k}(N,M)} \Vert h-f_N\Vert _{L^p(\mathbb {B}^d)} \ge \rho _n(\mathcal {H}^\alpha ) \gtrsim (N \log N)^{-\alpha /d}, \end{aligned}$$

which shows that the rate \(\mathcal {O}(N^{-\alpha /d})\) in Corollary 2.4 is optimal in the \(L^p\) norm (ignoring logarithmic factors). This also implies the optimality of Theorem 2.1 (otherwise, the proof of Corollary 2.4 would give a rate better than \(\mathcal {O}(N^{-\alpha /d})\)).

3 Proof of Theorem 2.1

Following the idea of [3], we first transfer the problem to approximation on spheres. Let us begin with a brief review of harmonic analysis on spheres [12]. For \(n\in \mathbb {N}_0\), the spherical harmonic space \(\mathbb {Y}_n\) of degree n is the linear space that contains the restrictions of real harmonic homogeneous polynomials of degree n on \(\mathbb {R}^{d+1}\) to the sphere \(\mathbb {S}^d\). The dimension of \(\mathbb {Y}_n\) is \(N(d,n):= \frac{2n+d-1}{n} \left( {\begin{array}{c}n+d-2\\ d-1\end{array}}\right) \) if \(n\ne 0\) and \(N(d,n):=1\) if \(n=0\). Spherical harmonics are eigenfunctions of the Laplace-Beltrami operator:

$$\begin{aligned} \Delta Y_n = -n(n+d-1)Y_n, \quad Y_n\in \mathbb {Y}_n, \end{aligned}$$

where in the coordinates \(u=(u_1,\dots ,u_{d+1}) \in \mathbb {S}^d\),

$$\begin{aligned} \Delta = \sum _{i=1}^d \frac{\partial ^2}{\partial u_i^2} - \sum _{i=1}^d \sum _{j=1}^d u_iu_j \frac{\partial ^2}{\partial u_i \partial u_j} - d \sum _{i=1}^d u_i \frac{\partial }{\partial u_i}. \end{aligned}$$

Spherical harmonics of different degrees are orthogonal with respect to the inner product \(\langle f,g \rangle = \int _{\mathbb {S}^d} f(u)g(u) d\tau _d(u)\), where \(\tau _d\) is the surface area measure of \(\mathbb {S}^d\) (normalized by the surface area \(\omega _d:= 2\pi ^{(d+1)/2}/\Gamma ((d+1)/2)\) so that \(\tau _d(\mathbb {S}^d)=1\)).

Let \(\mathcal {P}_n:L^2(\mathbb {S}^d)\rightarrow \mathbb {Y}_n\) denote the orthogonal projection operator. For any orthonormal basis \(\{Y_{nj}:1\le j\le N(d,n)\}\) of \(\mathbb {Y}_n\), the addition formula [12, Theorem 1.2.6] shows

$$\begin{aligned} \sum _{j=1}^{N(d,n)} Y_{nj}(u) Y_{nj}(v) = N(d,n) P_n(u^\intercal v), \quad u,v \in \mathbb {S}^d, \end{aligned}$$
(3.1)

where \(P_n\) is the Gegenbauer polynomial

$$\begin{aligned} P_n(t):= \frac{(-1)^n}{2^n} \frac{\Gamma (d/2)}{\Gamma (n+d/2)} (1-t^2)^{(2-d)/2} \left( \frac{d}{dt}\right) ^n (1-t^2)^{n+(d-2)/2}, \quad t\in [-1,1], \end{aligned}$$

with normalization \(P_n(1)=1\). Applying the Cauchy-Schwarz inequality to (3.1), we get \(|P_n(t)|\le 1\). For \(n \ne 0\), \(P_n(t)\) is odd (even) if n is odd (even). Note that, for \(d=1\) and \(n\ne 0\), \(N(d,n)=2\) and \(P_n(t)\) is the Chebyshev polynomial such that \(P_n(\cos \theta )=\cos (n \theta )\). We can write the projection \(\mathcal {P}_n\) as

$$\begin{aligned} \mathcal {P}_n f(u) = N(d,n) \int _{\mathbb {S}^d} f(v) P_n(u^\intercal v) d\tau _d(v). \end{aligned}$$

This motivates the following definition of a convolution operator on the sphere.

Definition 3.1

(Convolution) Let \(\varrho \) be the probability distribution with density \(c_d (1-t^2)^{(d-2)/2}\) on \([-1,1]\), with the constant \(c_d =( \int _{-1}^1 (1-t^2)^{(d-2)/2}dt)^{-1} = \omega _{d-1}/\omega _d\). For \(f\in L^1(\mathbb {S}^d)\) and \(g\in L^1_\varrho ([-1,1])\), define

$$\begin{aligned} (f*g)(u):= \int _{\mathbb {S}^d} f(v) g(u^\intercal v) d\tau _d(v), \quad u\in \mathbb {S}^d. \end{aligned}$$

The convolution on the sphere satisfies Young’s inequality [12, Theorem 2.1.2]: for \(p,q,r \ge 1\) with \(p^{-1}= q^{-1}+r^{-1}-1\), it holds

$$\begin{aligned} \Vert f*g\Vert _{L^p(\mathbb {S}^d)} \le \Vert f\Vert _{L^q(\mathbb {S}^d)} \Vert g\Vert _{L^r_\varrho ([-1,1])}, \end{aligned}$$

where the norm is the uniform one when \(r=\infty \). Observe that the projection \(\mathcal {P}_n f = f*(N(d,n) P_n)\) is a convolution operator with \(\Vert N(d,n) P_n\Vert _{L^\infty ([-1,1])} \le N(d,n)\). Furthermore, for \(g\in L^1_\varrho ([-1,1])\), let \(\widehat{g}(n)\) denote the Fourier coefficient of g with respect to the Gegenbauer polynomials,

$$\begin{aligned} \widehat{g}(n):= \frac{\omega _{d-1}}{\omega _d} \int _{-1}^1 g(t) P_n(t) (1-t^2)^{(d-2)/2} dt. \end{aligned}$$

By the Funk-Hecke formula, one can show that [12, Theorem 2.1.3]

$$\begin{aligned} \mathcal {P}_n(f*g) = \widehat{g}(n) \mathcal {P}_n f, \quad f\in L^1(\mathbb {S}^d), n\in \mathbb {N}_0. \end{aligned}$$
(3.2)

This identity is analogous to the Fourier transform of ordinary convolution.

One of the key steps in our proof of Theorem 2.1 is the observation that functions of the form \(f(u)= \int _{\mathbb {S}^d} \phi (v) \sigma _k(u^\intercal v) d\tau _d(v)\) are convolutions \(\phi *\sigma _k\) with the activation function \(\sigma _k\in L^\infty ([-1,1])\). [3, Appendix D.2] has computed the Fourier coefficients \(\widehat{\sigma _k}(n)\) explicitly. We summarize the result in the following.

Proposition 3.2

For \(k\in \mathbb {N}_0\), \(\widehat{\sigma _k}(n)=0\) if and only if \(n\ge k+1\) and \(n \equiv k\bmod 2\). If \(n=0\),

$$\begin{aligned} \widehat{\sigma _k}(0) = \frac{\omega _{d-1}}{\omega _d} \frac{\Gamma (d/2)\Gamma ((k+1)/2)}{2\Gamma ((k+d+1)/2)}. \end{aligned}$$

If \(n\ge k+1\) and \(n+1 \equiv k\bmod 2\),

$$\begin{aligned} \widehat{\sigma _k}(n) = \frac{\omega _{d-1}}{\omega _d} \frac{k!(-1)^{(n-k-1)/2}}{2^n} \frac{\Gamma (d/2)\Gamma (n-k)}{\Gamma ((n-k+1)/2) \Gamma ((n+d+k+1)/2)}. \end{aligned}$$

By the Stirling formula \(\Gamma (x) = \sqrt{2\pi } x^{x-1/2}e^{-x}(1+\mathcal {O}(x^{-1}))\), we have \(\widehat{\sigma _k}(n) \asymp n^{-(d+2k+1)/2}\) for \(n\in \mathbb {N}\) satisfying \(\widehat{\sigma _k}(n)\ne 0\).

Next, we introduce the smoothness of functions on the sphere. For \(0\le \theta \le \pi \), the translation operator \(T_\theta \), also called spherical mean operator, is defined by

$$\begin{aligned} T_\theta f(u):= \int _{\mathbb {S}^\perp _u} f(u\cos \theta + v\sin \theta ) d\tau _{d-1}(v), \quad u\in \mathbb {S}^d, f\in L^1(\mathbb {S}^d), \end{aligned}$$

where \(\mathbb {S}^\perp _u:= \{v\in \mathbb {S}^d:u^\intercal v =0\}\) is the equator in \(\mathbb {S}^d\) with respect to u (hence \(\mathbb {S}^\perp _u\) is isomorphic to the sphere \(\mathbb {S}^{d-1}\)). We note that the translation operator satisfies \(\mathcal {P}_n(T_\theta f) = P_n(\cos \theta ) \mathcal {P}_n(f)\). For \(\alpha >0\) and \(0<\theta <\pi \), we define the \(\alpha \)-th order difference operator

$$\begin{aligned} \Delta _\theta ^\alpha := (I-T_\theta )^{\alpha /2} = \sum _{j=0} (-1)^j \left( {\begin{array}{c}\alpha /2\\ j\end{array}}\right) T_\theta ^j, \end{aligned}$$

where \(\left( {\begin{array}{c}\alpha \\ j\end{array}}\right) =\frac{\alpha (\alpha -1)\cdots (\alpha -j+1)}{j!}\), in a distributional sense by \(\mathcal {P}_n(\Delta _\theta ^\alpha f) = (1-P_n(\cos \theta ))^{\alpha /2} \mathcal {P}_n f\), \(n\in \mathbb {N}_0\). For \(f\in L^p(\mathbb {S}^d)\) and \(1\le p<\infty \) or \(f\in C(\mathbb {S}^d)\) and \(p=\infty \), the \(\alpha \)-th order modulus of smoothness is defined by

$$\begin{aligned} \omega _\alpha (f,t)_p:= \sup _{0<\theta \le t} \Vert \Delta _\theta ^\alpha f\Vert _{L^p(\mathbb {S}^d)}, \quad 0<t<\pi . \end{aligned}$$

For even integers \(\alpha =2s\), one can also use combinations of \(T_{j\theta }\) and obtain [15, 50]

$$\begin{aligned} \omega _{2s}(f,t)_p \asymp \sup _{0<\theta \le t} \left\| \sum _{j=0}^{2s} (-1)^j \left( {\begin{array}{c}2s\\ j\end{array}}\right) T_{j\theta } f \right\| _{L^p(\mathbb {S}^d)}, \quad s\in \mathbb {N}. \end{aligned}$$
(3.3)

Another way to characterize the smoothness is through the K-functionals. We first introduce the fractional Sobolev space induced by the Laplace-Beltrami operator. We say a function \(f\in L^p(\mathbb {S}^d)\) belong to the Sobolev space \(\mathcal {W}^{\alpha ,p}(\mathbb {S}^d)\) if there exists a function in \(L^p(\mathbb {S}^d)\), which will be denoted by \((-\Delta )^{\alpha /2}f\), such that

$$\begin{aligned} \mathcal {P}_n((-\Delta )^{\alpha /2}f) = (n(n+d-1))^{\alpha /2} \mathcal {P}_n f,\quad n\in \mathbb {N}_0, \end{aligned}$$

where we assume \(f,(-\Delta )^{\alpha /2}f\in C(\mathbb {S}^d)\) for \(p=\infty \). Then we can define the \(\alpha \)-th K-functional of \(f\in L^p(\mathbb {S}^d)\) as

$$\begin{aligned} K_\alpha (f,t)_p:= \inf _{g\in \mathcal {W}^{\alpha ,p}(\mathbb {S}^d)} \left\{ \Vert f-g\Vert _{L^p(\mathbb {S}^d)} + t^\alpha \Vert (-\Delta )^{\alpha /2} g\Vert _{L^p(\mathbb {S}^d)} \right\} , \quad t>0. \end{aligned}$$

It can be shown [12, Theorem 10.4.1] that the moduli of smoothness and the K-functional are equivalent:

$$\begin{aligned} \omega _\alpha (f,t)_p \asymp K_\alpha (f,t)_p. \end{aligned}$$
(3.4)

To prove Theorem 2.1, we denote the function class

$$\begin{aligned} \mathcal {G}_{\sigma _k}(M):= \left\{ g\in L^\infty (\mathbb {S}^d): g(u) = \int _{\mathbb {S}^d} \sigma _k(u^\intercal v) d\mu (v), \Vert \mu \Vert \le M \right\} , \end{aligned}$$

as the corresponding function class of \(\mathcal {F}_{\sigma _k}(M)\) on \(\mathbb {S}^d\). Abusing the notation, we will also denote \(\gamma (g) =\inf _{\mu } \Vert \mu \Vert \) as the variation norm of \(g\in \mathcal {G}_{\sigma _k}(M)\). The next proposition transfers our approximation problem on the unit ball \(\mathbb {B}^d\) to that on the sphere \(\mathbb {S}^d\).

Proposition 3.3

Let \(k\in \mathbb {N}_0\), \(d\in \mathbb {N}\) and \(\alpha =r+\beta \) where \(r\in \mathbb {N}_0\) and \(\beta \in (0,1]\). Denote \(\Omega := \{(u_1,\dots ,u_{d+1})^\intercal \in \mathbb {S}^d: u_{d+1}\ge 1/\sqrt{2} \}\) and define an operator \(S_k:L^\infty (\Omega ) \rightarrow L^\infty (\mathbb {B}^d)\) by

$$\begin{aligned} S_kg(x):= (\Vert x\Vert _2^2+1)^{k/2} g\left( \frac{1}{\sqrt{\Vert x\Vert ^2+1}} \begin{pmatrix} x \\ 1 \end{pmatrix} \right) ,\quad x\in \mathbb {B}^d. \end{aligned}$$

The operator \(S_k\) satisfies: (1) If \(g\in \mathcal {G}_{\sigma _k}(M)\), then \(S_kg\in \mathcal {F}_{\sigma _k}(M)\). (2) For any \(h\in \mathcal {H}^\alpha \), there exists \(\widetilde{h}\in C(\mathbb {S}^d)\) such that \(S_k \widetilde{h} =h\), \(\Vert \widetilde{h}\Vert _{L^\infty (\mathbb {S}^d)}\le C\) and \(\omega _{2s^*}(\widetilde{h},t)_\infty \le C t^{\alpha }\), where \(s^*\in \mathbb {N}\) is the smallest integer such that \(\alpha \le 2s^*\) and C is a constant independent of h. Furthermore, \(\widetilde{h}\) can be chosen to be odd or even.

Proof

  1. 1.

    If \(g(u) = \int _{\mathbb {S}^d} \sigma _k(u^\intercal v) d\mu (v)\) for some \(\Vert \mu \Vert \le M\), then for \(x\in \mathbb {B}^d\),

    $$\begin{aligned} S_kg(x)&= (\Vert x\Vert _2^2+1)^{k/2} \int _{\mathbb {S}^d} \sigma _k\left( (\Vert x\Vert _2^2+1)^{-1/2}(x^\intercal ,1) v\right) d\mu (v) \\&= \int _{\mathbb {S}^d} \sigma _k\left( (x^\intercal ,1) v\right) d\mu (v). \end{aligned}$$

    Hence, \(S_kg \in \mathcal {F}_{\sigma _k}(M)\) by definition.

  2. 2.

    Given \(h\in \mathcal {H}^\alpha \), for any \(u=(u_1,\dots ,u_{d+1})^\intercal \in \Omega \), we define \(\widetilde{h}(u):= u_{d+1}^k h(u_{d+1}^{-1}u')\), where \(u'=(u_1,\dots ,u_d)^\intercal \). It is easy to check that \(S_k \widetilde{h} =h\). Note that h is completely determined by the function values of \(\widetilde{h}\) on \(\Omega \). Observe that the smoothness of \(\widetilde{h}\) on \(\Omega \) can be controlled by the smoothness of h. We can extend \(\widetilde{h}\) to \(\mathbb {R}^{d+1}\) so that \(\Vert \widetilde{h}\Vert _{C^{r,\beta }(\mathbb {R}^{d+1})} \le C_0\) for some constant \(C_0\) independent of h, by using (refined version of) Whitney’s extension theorem [17,18,19]. It remains to show that \(\omega _{2\,s^*}(\widetilde{h},t)_\infty \lesssim t^{\alpha }\). By the equivalence (3.3),

    $$\begin{aligned} \omega _{2s^*}(\widetilde{h},t)_\infty&\lesssim \sup _{0<\theta \le t} \sup _{u\in \mathbb {S}^d} \left| \sum _{j=0}^{2s^*} (-1)^j \left( {\begin{array}{c}2s^*\\ j\end{array}}\right) \int _{\mathbb {S}^\perp _u} \widetilde{h}(u\cos j\theta + v\sin j\theta ) d\tau _{d-1}(v) \right| \\&\le \sup _{0<\theta \le t} \sup _{u\in \mathbb {S}^d} \sup _{v\in \mathbb {S}^\perp _u} \left| \sum _{j=0}^{2s^*} (-1)^j \left( {\begin{array}{c}2s^*\\ j\end{array}}\right) \widetilde{h}(u\cos j\theta + v\sin j\theta ) \right| \\&=: \sup _{0<\theta \le t} \sup _{u\in \mathbb {S}^d} \sup _{v\in \mathbb {S}^\perp _u} |H(u,v,\theta )|. \end{aligned}$$

    Next, we estimate \(\sup _{0<\theta \le t}|H(u,v,\theta )|\) for small \(t>0\) and fixed uv. One can check that the function \(f(\cdot ):= \widetilde{h}(u\cos (\cdot ) + v\sin (\cdot ))\) is in \(C^{r,\beta }([0,t_0])\) for small \(t_0>0\), and \(\Vert f\Vert _{C^{r,\beta }([0,t_0])} \lesssim \Vert \widetilde{h}\Vert _{C^{r,\beta }(\mathbb {R}^{d+1})}\). Let \(\widetilde{\Delta }_\theta f(\cdot ):= f(\cdot +\theta ) - f(\cdot )\) be the difference operator and \(\widetilde{\Delta }_\theta ^{n+1}:= \widetilde{\Delta }_\theta \widetilde{\Delta }_\theta ^n\) for \(n\in \mathbb {N}\). The binomial theorem shows \(H(u,v,\theta ) = \widetilde{\Delta }_\theta ^{2\,s^*}f(0)\). Then, the classical theory of moduli of smoothness [14, Chapter 2.6-\(-\)2.9] implies

    $$\begin{aligned} \sup _{0<\theta \le t}|H(u,v,\theta )| \le \sup _{0<\theta \le t}|\widetilde{\Delta }_\theta ^{2s^*}f(0)| \lesssim \Vert f\Vert _{C^{r,\beta }([0,t_0])} t^\alpha . \end{aligned}$$

    Consequently, we get the desired bound \(\omega _{2s^*}(\widetilde{h},t)_\infty \lesssim t^{\alpha }\). Finally, in order to ensure that \(\widetilde{h}\) is odd or even, we can multiply \(\widetilde{h}\) by an infinitely differentiable function, which is equal to one on \(\Omega \) and zero for \(u_{d+1}\le 1/(2\sqrt{2})\), and extend \(\widetilde{h}\) to be odd or even. These operations do not decrease the smoothness of \(\widetilde{h}\). \(\square \)

By Proposition 3.3, for any \(h\in \mathcal {H}^\alpha \) and \(g\in \mathcal {G}_{\sigma _k}(M)\), we have

$$\begin{aligned} \Vert h-S_k g\Vert _{L^\infty (\mathbb {B}^d)} = \Vert S_k \widetilde{h}-S_k g\Vert _{L^\infty (\mathbb {B}^d)} \le 2^{k/2} \Vert \widetilde{h}- g\Vert _{L^\infty (\mathbb {S}^d)}, \end{aligned}$$

for some \(\widetilde{h}\in C(\mathbb {S}^d)\). Since \(S_kg\in \mathcal {F}_{\sigma _k}(M)\), we can derive approximation bounds for \(\mathcal {F}_{\sigma _k}(M)\) by studying the approximation capacity of \(\mathcal {G}_{\sigma _k}(M)\). Now, we are ready to prove Theorem 2.1.

Proof of Theorem 2.1

By Proposition 3.3, for any \(h\in \mathcal {H}^\alpha \), there exists \(\widetilde{h}\in C(\mathbb {S}^d)\) such that \(S_k \widetilde{h} =h\), \(\Vert \widetilde{h}\Vert _{L^\infty (\mathbb {S}^d)}\le C\) and \(\omega _{2s^*}(\widetilde{h},t)_\infty \le Ct^{\alpha }\), where \(s^*\in \mathbb {N}\) is the smallest integer such that \(\alpha \le 2s^*\). We choose \(\widetilde{h}\) to be odd (even) if k is even (odd). Using \(\omega _s(\widetilde{h},t)_2\le 2^{s-2\,s^*+2}\omega _{2\,s^*}(\widetilde{h},t)_2\) for \(s>2\,s^*\) [12, Proposition 10.1.2] and the Marchaud inequality [15, Eq.(9.6)]

$$\begin{aligned} \omega _s(\widetilde{h},t)_2 \lesssim t^s \left( \int _t^1 \frac{\omega _{2s^*}(\widetilde{h},\theta )_2^2}{\theta ^{2s+1}}d\theta \right) ^{1/2},\quad s< 2s^*, \end{aligned}$$

we have

$$\begin{aligned} \omega _s(\widetilde{h},t)_2 \lesssim {\left\{ \begin{array}{ll} t^s, &{}\text{ if } s<\alpha , \\ t^\alpha , &{}\text{ if } s=\alpha = 2s^*,\\ t^\alpha \sqrt{\log (1/t)}, &{}\text{ if } s=\alpha \ne 2s^*,\\ t^\alpha , &{}\text{ if } s>\alpha . \end{array}\right. } \end{aligned}$$
(3.5)

We study how well \(g\in \mathcal {G}_{\sigma _k}(M)\) approximates \(\widetilde{h}\). It turns out that it is enough to consider a subset of \(\mathcal {G}_{\sigma _k}(M)\) that contains functions of the form

$$\begin{aligned} g(u) = \int _{\mathbb {S}^d} \phi (v) \sigma _k(u^\intercal v) d\tau _d(v), \quad u\in \mathbb {S}^d, \end{aligned}$$

for some \(\phi \in L^2(\mathbb {S}^d)\). Note that \(\gamma (g)\le \inf _{\phi } \Vert \phi \Vert _{L^1(\mathbb {S}^d)} \le \inf _{\phi } \Vert \phi \Vert _{L^2(\mathbb {S}^d)}\), where the infimum is taken over all \(\phi \in L^2(\mathbb {S}^d)\) satisfy the integral representation of g. Observing that \(g=\phi * \sigma _k\) is a convolution, by identity (3.2), \(\mathcal {P}_n g = \widehat{\sigma _k}(n) \mathcal {P}_n\phi \). Hence, we have the Fourier decomposition

$$\begin{aligned} g = \sum _{n=0}^\infty \mathcal {P}_n g = \sum _{n=0}^\infty \widehat{\sigma _k}(n) \mathcal {P}_n\phi , \end{aligned}$$

which converges in \(L^2(\mathbb {S}^d)\). This implies that \(g\in \mathcal {G}_{\sigma _k}(M)\) if g is continuous, \(\mathcal {P}_n g=0\) for any \(n\in \mathbb {N}_0\) satisfying \(\widehat{\sigma _k}(n)= 0\) and

$$\begin{aligned} \gamma (g)^2 \le \sum _{\widehat{\sigma _k}(n) \ne 0} \widehat{\sigma _k}(n)^{-2} \Vert \mathcal {P}_n g\Vert ^2_{L^2(\mathbb {S}^d)} \le M^2. \end{aligned}$$

By Proposition 3.2, we know that \(\widehat{\sigma _k}(n)=0\) if and only if \(n\ge k+1\) and \(n \equiv k\bmod 2\). For \(\widehat{\sigma _k}(n)\ne 0\), we have \(\widehat{\sigma _k}(n) \asymp n^{-(d+2k+1)/2}\).

We consider the convolutions \(g_m:=\widetilde{h}*L_m = \int _{\mathbb {S}^d} \widetilde{h}(u)L_m(u^\intercal v)d\tau _d(v)\) with

$$\begin{aligned} L_m(t):= \sum _{n=0}^\infty \eta \left( \frac{n}{m}\right) N(d,n) P_n(t), \quad m\in \mathbb {N}, \end{aligned}$$

where \(\eta \) is a \(C^\infty \)-function on \([0,\infty )\) such that \(\eta (t)=1\) for \(0\le t\le 1\) and \(\eta (t)=0\) for \(t\ge 2\). Since \(\eta \) is supported on [0, 2], the summation can be terminated at \(n=2m-1\), so that \(g_m\) is a polynomial of degree at most \(2m-1\). Since \(\widetilde{h}\) is odd (even) if k is even (odd), \(\mathcal {P}_n g_m= \eta (n/m) \mathcal {P}_n \widetilde{h}=0\) for any \(n \equiv k\bmod 2\). Furthermore, [12, Theorem 10.3.2] shows that

$$\begin{aligned} K_s(\widetilde{h},m^{-1})_p \asymp \Vert \widetilde{h} - g_m \Vert _{L^p(\mathbb {S}^d)} + m^{-s}\Vert (-\Delta )^{s/2} g_m\Vert _{L^p(\mathbb {S}^d)}. \end{aligned}$$
(3.6)

By the equivalence (3.4) and \(\omega _{2s^*}(\widetilde{h},m^{-1})_\infty \lesssim m^{-\alpha }\), the equivalence (3.6) for \(p=\infty \) implies that we can bound the approximation error as

$$\begin{aligned} \Vert \widetilde{h} - g_m \Vert _{L^\infty (\mathbb {S}^d)} \lesssim m^{-\alpha }. \end{aligned}$$

Applying the estimate (3.5) to the equivalence (3.6) with \(p=2\), we get

$$\begin{aligned} \Vert (-\Delta )^{s/2} g_m\Vert _{L^2(\mathbb {S}^d)} \lesssim {\left\{ \begin{array}{ll} 1, &{}\text{ if } s<\alpha , \\ 1, &{}\text{ if } s=\alpha = 2s^*,\\ \sqrt{\log m}, &{}\text{ if } s=\alpha \text{ and } \alpha \ne 2s^*,\\ m^{s-\alpha }, &{}\text{ if } \alpha <s \le 2s^*. \end{array}\right. } \end{aligned}$$

Using \(\mathcal {P}_n((-\Delta )^{s/2}g_m) = (n(n+d-1))^{s/2}\mathcal {P}_n g_m\), we can estimate the norm \(\gamma (g_m)\) as follows

$$\begin{aligned} \gamma (g_m)^2&\le \sum _{\widehat{\sigma _k}(n) \ne 0} \widehat{\sigma _k}(n)^{-2} \Vert \mathcal {P}_n g_m\Vert ^2_{L^2(\mathbb {S}^d)} \\&\lesssim \widehat{\sigma _k}(0)^{-2} \Vert \mathcal {P}_0 g_m\Vert ^2_{L^2(\mathbb {S}^d)} + \sum _{n=1}^{2m-1} n^{d+2k+1} n^{-2s}\Vert \mathcal {P}_n((-\Delta )^{s/2}g_m)\Vert ^2_{L^2(\mathbb {S}^d)} \\&\lesssim 1+ \sum _{n=1}^{2m-1} n^{d+2k+1-2s}\Vert \mathcal {P}_n((-\Delta )^{s/2}g_m)\Vert ^2_{L^2(\mathbb {S}^d)}\\&\lesssim 1 + \Vert (-\Delta )^{s/2}g_m\Vert ^2_{L^2(\mathbb {S}^d)}, \end{aligned}$$

where we choose \(s=(d+2k+1)/2\) in the last inequality. We continue the proof in three different cases.

Case I: \(\alpha > (d+2k+1)/2\) or \(\alpha = (d+2k+1)/2\) is an even integer. In this case, \(s<\alpha \) or \(s=\alpha =2\,s^*\). Thus,

$$\begin{aligned} \gamma (g_m)^2 \le \sum _{\widehat{\sigma _k}(n) \ne 0} \widehat{\sigma _k}(n)^{-2} \Vert \mathcal {P}_n g_m\Vert ^2_{L^2(\mathbb {S}^d)} \lesssim 1+ \Vert (-\Delta )^{s/2}g_m\Vert ^2_{L^2(\mathbb {S}^d)} \lesssim 1. \end{aligned}$$

Since \(\mathcal {P}_n g_m=\eta (n/m)\mathcal {P}_n \widetilde{h} = \mathcal {P}_n \widetilde{h}\) for \(n\le m\), we have

$$\begin{aligned} \gamma (\widetilde{h})^2&\le \lim _{m\rightarrow \infty } \sum _{n\le m,\widehat{\sigma _k}(n) \ne 0} \widehat{\sigma _k}(n)^{-2} \Vert \mathcal {P}_n \widetilde{h}\Vert ^2_{L^2(\mathbb {S}^d)} \\&\le \lim _{m\rightarrow \infty } \sum _{\widehat{\sigma _k}(n) \ne 0} \widehat{\sigma _k}(n)^{-2} \Vert \mathcal {P}_n g_m\Vert ^2_{L^2(\mathbb {S}^d)} \\&\lesssim 1. \end{aligned}$$

This shows that \(\widetilde{h} \subseteq \mathcal {G}_{\sigma _k}(M)\) for some constant M. Hence, \(h=S_k \widetilde{h} \in \mathcal {F}_{\sigma _k}(M)\) by Proposition 3.3.

Case II: \(\alpha = (d+2k+1)/2\) is not an even integer. We have \(s=\alpha \ne 2s^*\) and

$$\begin{aligned} \gamma (g_m)^2 \lesssim 1+ \Vert (-\Delta )^{s/2}g_m\Vert ^2_{L^2(\mathbb {S}^d)} \lesssim \log m. \end{aligned}$$

This shows that \(g_m \in \mathcal {G}_{\sigma _k}(M)\) with \(M\lesssim \sqrt{\log m}\). Therefore,

$$\begin{aligned} \Vert \widetilde{h} - g_m\Vert _{L^\infty (\mathbb {S}^d)} \lesssim m^{-\alpha } \lesssim \exp (-\alpha M^2). \end{aligned}$$

By Proposition 3.3, \(f:=S_k g_m \in \mathcal {F}_{\sigma _k}(M)\) and

$$\begin{aligned} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \le 2^{k/2} \Vert \widetilde{h}- g_m\Vert _{L^\infty (\mathbb {S}^d)} \lesssim \exp (-\alpha M^2). \end{aligned}$$

Case III: \(\alpha < (d+2k+1)/2\). In this case, \(s>\alpha \) and

$$\begin{aligned} \gamma (g_m)^2 \lesssim 1+ \Vert (-\Delta )^{s/2}g_m\Vert ^2_{L^2(\mathbb {S}^d)} \lesssim m^{d+2k+1-2\alpha }. \end{aligned}$$

This shows that \(g_m \in \mathcal {G}_{\sigma _k}(M)\) with \(M\lesssim m^{(d+2k+1-2\alpha )/2}\). Therefore,

$$\begin{aligned} \Vert \widetilde{h} - g_m\Vert _{L^\infty (\mathbb {S}^d)} \lesssim m^{-\alpha } \lesssim M^{-\frac{2\alpha }{d+2k+1-2\alpha }}. \end{aligned}$$

By Proposition 3.3, \(f:=S_k g_m \in \mathcal {F}_{\sigma _k}(M)\) and

$$\begin{aligned} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \le 2^{k/2} \Vert \widetilde{h}- g_m\Vert _{L^\infty (\mathbb {S}^d)} \lesssim M^{-\frac{2\alpha }{d+2k+1-2\alpha }}, \end{aligned}$$

which finishes the proof. \(\square \)

Remark 3.4

Since we are only able to estimate the smoothness \(\omega _{2s^*}(\widetilde{h},t)_\infty \) for even integer \(2s^*\), we have an extra logarithmic factor for the bound \(\omega _s(\widetilde{h},t)_2 \lesssim t^\alpha \sqrt{\log (1/t)}\) in (3.5) when \(s=\alpha \ne 2s^*\), due to the Marchaud inequality. Consequently, we can only obtain exponential convergence rate when \(\alpha = (d+2k+1)/2\) is not an even integer. We conjecture the bound \(\omega _s(\widetilde{h},t)_2 \lesssim t^\alpha \) holds for all \(s\ge \alpha \). If this is the case, then the proof of Theorem 2.1 implies \(\mathcal {H}^\alpha \subseteq \mathcal {F}_{\sigma _k}(M)\) for some constant M when \(\alpha \ge (d+2k+1)/2\).

4 Nonparametric Regression

In this section, we apply our approximation results to nonparametric regression using neural networks. For simplicity, we will only consider ReLU activation function (\(k=1\)), which is the most popular activation in deep learning.

We study the classical problem of learning a d-variate function \(h\in \mathcal {H}\) from its noisy samples, where we will assume \(\mathcal {H}= \mathcal {H}^\alpha \) with \(\alpha <(d+3)/2\) or \(\mathcal {H}= \mathcal {F}_{\sigma }(1)\). Note that, due to Theorem 2.1, the results for \(\mathcal {F}_{\sigma }(1)\) can be applied to \(\mathcal {H}^\alpha \) with \(\alpha >(d+3)/2\) by scaling the variation norm. Suppose we have a data set of \(n\ge 2\) samples \(\mathcal {D}_n = \{(X_i,Y_i)\}_{i=1}^n \subseteq \mathbb {B}^d \times \mathbb {R}\) which are independently and identically generated from the regression model

$$\begin{aligned} Y_i = h(X_i) + \eta _i, \quad X_i \sim \mu , \quad \eta _i \sim \mathcal {N}(0,V^2), \quad i=1,\dots ,n, \quad h\in \mathcal {H}, \end{aligned}$$
(4.1)

where \(\mu \) is the marginal distribution of the covariates \(X_i\) supported on \(\mathbb {B}^d\), and \(\eta _i\) is an i.i.d. Gaussian noise independent of \(X_i\) (we will treat the variance \(V^2\) as a fixed constant). We are interested in the empirical risk minimizer (ERM)

$$\begin{aligned} f_n^* \in \mathop {\textrm{argmin}}\limits _{f\in \mathcal {F}_n} \mathcal {L}_n(f) := \mathop {\textrm{argmin}}\limits _{f\in \mathcal {F}_n} \frac{1}{n} \sum _{i=1}^n |f(X_i)- Y_i|^2, \end{aligned}$$
(4.2)

where \(\mathcal {F}_n\) is a function class parameterized by neural networks. For simplicity, we assume here and in the sequel that the minimum above indeed exists. The performance of the estimation is measured by the expected risk

$$\begin{aligned} \mathcal {L}(f):= \mathbb {E}_{(X,Y)} [(f(X)-Y)^2] = \mathbb {E}_{X\sim \mu } [(f(X) - h(X))^2] + V^2. \end{aligned}$$

It is equivalent to evaluating the estimator by the excess risk

$$\begin{aligned} \Vert f - h\Vert _{L^2(\mu )}^2 = \mathcal {L}(f) - \mathcal {L}(h). \end{aligned}$$

In the statistical analysis of learning algorithms, we often require that the hypothesis class is uniformly bounded. We define the truncation operator \(\mathcal {T}_B\) with level \(B>0\) for real-valued functions f as

$$\begin{aligned} \mathcal {T}_Bf(x):= {\left\{ \begin{array}{ll} f(x) &{}\quad \text{ if } |f(x)|\le B, \\ \,\textrm{sgn}\,(f(x)) B &{}\quad \text{ if } |f(x)|> B. \end{array}\right. } \end{aligned}$$

Since we always assume the regression function h is bounded, truncating the output of the estimator \(f_n^*\) appropriately dose not increase the excess risk. We will estimate the convergence rate of \(\mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2\), where \(B_n \lesssim \log n\), as the number of samples \(n\rightarrow \infty \).

4.1 Shallow Neural Networks

The rate of convergence of neural network regression estimates has been analyzed by many papers [9, 26, 27, 38, 41, 52]. It is well-known that the optimal minimax rate of convergence for learning a regression function \(h\in \mathcal {H}^\alpha \) is \(n^{-2\alpha /(d+2\alpha )}\) [59]. This optimal rate has been established (up to logarithmic factors) for two-hidden-layers neural networks with certain squashing activation functions [26] and for deep ReLU neural networks [27, 52]. For shallow networks, [38] proved a rate of \(n^{-2\alpha /(2\alpha +d+5)+\epsilon }\) with \(\epsilon >0\) for a certain cosine squasher activation function. However, to the best of our knowledge, it is unknown whether shallow neural networks can achieve the optimal rate. In this section, we provide an affirmative answer to this question by proving that shallow ReLU neural networks can achieve the optimal rate for \(\mathcal {H}^\alpha \) with \(\alpha <(d+3)/2\).

We will use the following lemma to analyze the convergence rate. It decomposes the error of the ERM into generalization error and approximation error, and bounds the generalization error by the covering number of the hypothesis class \(\mathcal {F}_n\).

Lemma 4.1

( [27]) Let \(f_n^*\) be the estimator (4.2) and set \(B_n = c_1\log n\) for some constant \(c_1>0\). Then,

$$\begin{aligned}&\mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \\&\quad \le \frac{c_2 (\log n)^2 \sup _{X_{1:n}\in (\mathbb {B}^d)^n}\log (\mathcal {N}(n^{-1}B_n^{-1}, \mathcal {T}_{B_n}\mathcal {F}_n,\Vert \cdot \Vert _{L^1(X_{1:n})})+1)}{n}\\&\qquad + 2 \inf _{f\in \mathcal {F}_n} \Vert f-h\Vert _{L^2(\mu )}^2, \end{aligned}$$

for \(n>1\) and some constant \(c_2>0\) (independent of n and \(f_n^*\)), where \(X_{1:n} =(X_1,\dots ,X_n)\) denotes a sequence of sample points in \(\mathbb {B}^d\) and \(\mathcal {N}(\epsilon , \mathcal {T}_{B_n}\mathcal {F}_n,\Vert \cdot \Vert _{L^1(X_{1:n})})\) denotes the \(\epsilon \)-covering number of the function class \(\mathcal {T}_{B_n}\mathcal {F}_n:=\{\mathcal {T}_{B_n}f,f\in \mathcal {F}_n\}\) in the metric \(\Vert f-g\Vert _{L^1(X_{1:n})} = \frac{1}{n}\sum _{i=1}^n|f(X_i)-g(X_i)|\).

For shallow neural network model \(\mathcal {F}_n= \mathcal {F}_{\sigma }(N_n,M_n)\), Lemma 2.3 and Corollary 2.4 provide bounds for the approximation errors. The covering number of the function class \(\mathcal {T}_{B_n}\mathcal {F}_n\) can be estimated by using the pseudo-dimension of \(\mathcal {T}_{B_n}\mathcal {F}_n\) [22]. Choosing \(N_n,M_n\) appropriately to balance the approximation and generalization errors, we can derive convergence rates for the ERM.

Theorem 4.2

Let \(f_n^*\) be the estimator (4.2) with \(\mathcal {F}_n = \mathcal {F}_{\sigma }(N_n,M_n)\) and set \(B_n = c_1\log n\) for some constant \(c_1>0\).

  1. 1.

    If \(\mathcal {H}= \mathcal {H}^\alpha \) with \(\alpha <(d+3)/2\), we choose

    $$\begin{aligned} N_n \asymp n^{\frac{d}{d+2\alpha }}, \quad M_n \gtrsim n^{\frac{d+3-2\alpha }{2d+4\alpha }}, \end{aligned}$$

    then

    $$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-\frac{2\alpha }{d+2\alpha }} (\log n)^4. \end{aligned}$$
  2. 2.

    If \(\mathcal {H}= \mathcal {F}_{\sigma }(1)\), we choose

    $$\begin{aligned} N_n \asymp n^{\frac{d}{2d+3}}, \quad M_n \ge 1, \end{aligned}$$

    then

    $$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-\frac{d+3}{2d+3}} (\log n)^4. \end{aligned}$$

Proof

To apply the bound in Lemma 4.1, we need to estimate the covering number \(\mathcal {N}(\epsilon , \mathcal {T}_{B_n}\mathcal {F}_n,\Vert \cdot \Vert _{L^1(X_{1:n})})\). The classical result of [22, Theorem 6] showed that the covering number can be bounded by pseudo-dimension:

$$\begin{aligned} \log \mathcal {N}(\epsilon , \mathcal {T}_{B_n}\mathcal {F}_n,\Vert \cdot \Vert _{L^1(X_{1:n})}) \lesssim \,\textrm{Pdim}\,(\mathcal {T}_{B_n}\mathcal {F}_n) \log (B_n/\epsilon ), \end{aligned}$$
(4.3)

where \(\,\textrm{Pdim}\,(\mathcal {T}_{B_n}\mathcal {F}_n)\) is the pseudo-dimension of the function class \(\mathcal {T}_{B_n}\mathcal {F}_n\), see (2.2). For ReLU neural networks, [6] showed that

$$\begin{aligned} \,\textrm{Pdim}\,(\mathcal {T}_{B_n}\mathcal {F}_n) \lesssim N_n \log N_n. \end{aligned}$$

Consequently, we have

$$\begin{aligned} \log \mathcal {N}(\epsilon , \mathcal {T}_{B_n}\mathcal {F}_n,\Vert \cdot \Vert _{L^1(X_{1:n})}) \lesssim N_n \log (N_n) \log (B_n/\epsilon ). \end{aligned}$$

Applying Lemma 4.1 and Corollary 2.4, if \(\mathcal {H}= \mathcal {H}^\alpha \) with \(\alpha <(d+3)/2\), then

$$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim \frac{(\log n)^2 N_n \log (N_n) \log (n B_n^2)}{n} + N_n^{-\frac{2\alpha }{d}} \vee M_n^{-\frac{4\alpha }{d+3-2\alpha }}. \end{aligned}$$

By choosing \(N_n \asymp n^{d/(d+2\alpha )}\) and \(M_n \gtrsim N_n^{(d+3-2\alpha )/(2d)}\), we get \(\mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-2\alpha /(d+2\alpha )} (\log n)^4\). Similarly, by Lemmas 4.1 and 2.3, if \(\mathcal {H}= \mathcal {F}_{\sigma }(1)\) and \(M_n \ge 1\), then

$$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim \frac{(\log n)^2 N_n \log (N_n) \log (n B_n^2)}{n} + N_n^{-\frac{d+3}{d}}. \end{aligned}$$

We choose \(N_n \asymp n^{d/(2d+3)}\), then \(\mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-(d+3)/(2d+3)} (\log n)^4\). \(\square \)

Remark 4.3

The Gaussian noise assumption on the model (4.1) can be weaken for Lemma 4.1 and hence for Theorem 4.2. We refer the reader to [27, Appendix B, Lemma 18] for more details. Theorem 4.2 can be easily generalized to shallow ReLU\(^k\) neural networks for \(k\ge 1\) by using the same proof technique. For example, one can show that, if \(h\in \mathcal {H}^\alpha \) with \(\alpha <(d+2k+1)/2\), then we can choose \(\mathcal {F}_n = \mathcal {F}_{\sigma _k}(N_n,M_n)\) with \(N_n \asymp n^{\frac{d}{d+2\alpha }}\) and \(M_n \gtrsim n^{\frac{d+2k+1-2\alpha }{2d+4\alpha }}\), such that \(\mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-\frac{2\alpha }{d+2\alpha }} (\log n)^4\).

Theorem 4.2 shows that least square minimization using shallow ReLU neural networks can achieve the optimal rate \(n^{-\frac{2\alpha }{d+2\alpha }}\) for learning functions in \(\mathcal {H}^\alpha \) with \(\alpha <(d+3)/2\). For the function class \(\mathcal {F}_{\sigma }(1)\), the rate \(n^{-\frac{d+3}{2d+3}}\) is also minimax optimal as proven by [46, Lemma 25] (they studied a slightly different function class, but their result also holds for \(\mathcal {F}_{\sigma }(1)\)). Specifically, [57, Theorem 4 and Theorem 8] give a sharp estimate for the metric entropy

$$\begin{aligned} \log \mathcal {N}(\epsilon , \mathcal {F}_{\sigma }(1), \Vert \cdot \Vert _{L^2(\mathbb {B}^d)}) \asymp \epsilon ^{-\frac{2d}{d+3}}. \end{aligned}$$

Combining this estimate with the classical result of Yang and Barron (see [61, Proposition 1] and [60, Chapter 15]), we get

$$\begin{aligned} \inf _{\widehat{f}_n} \sup _{h\in \mathcal {F}_{\sigma }(1)} \mathbb {E}_{\mathcal {D}_n} \Vert \widehat{f}_n-h\Vert _{L^2(\mathbb {B}^d)}^2 \asymp n^{-\frac{d+3}{2d+3}}, \end{aligned}$$

where the infimum taken is over all estimators based on the samples \(\mathcal {D}_n\), which are generated from the model (4.1).

4.2 Deep Neural Networks and Over-Parameterization

There is a direct way to generalize the analysis in the last section to deep neural networks: we can implement shallow neural networks by sparse multi-layer neural networks with the same order of parameters, and estimate the approximation and generalization performance of the constructed networks. Since the optimal convergence rates of deep neural networks have already been established in [27, 52], we do not pursue in this direction. Instead, we study the convergence rates of over-parameterized neural networks by using the idea discussed in [24]. The reason for studying such networks is that, in modern applications of deep learning, the number of parameters in the networks is often much larger than the number of samples. However, in the convergence analysis of [27, 52], the network that achieves the optimal rate is under-parameterized (see also the choice of \(N_n\) in Theorem 4.2). Hence, the analysis can not explain the empirical performance of deep learning models used in practice.

Following [24], we consider deep neural networks with norm constraints on weight matrices. For \(W,L\in \mathbb {N}\), we denote by \(\mathcal{N}\mathcal{N}(W,L)\) the set of functions that can be parameterized by ReLU neural networks in the form

$$\begin{aligned} \begin{aligned} f^{(0)}(x)&= x \in \mathbb {R}^d, \\ f^{(\ell +1)}(x)&= \sigma (A^{(\ell )} f^{(\ell )}(x)+b^{(\ell )}), \quad \ell = 0,\dots ,L-1, \\ f(x)&= A^{(L)} f^{(L)}(x) + b^{(L)}, \end{aligned} \end{aligned}$$
(4.4)

where \(A^{(\ell )} \in \mathbb {R}^{N_{\ell +1}\times N_{\ell }}\), \(b^{(\ell )}\in \mathbb {R}^{N_{\ell +1}}\) with \(N_0 =d,N_{L+1} =1\) and \(\max \{N_1,\dots ,N_L\} =W\). The numbers W and L are called the width and depth of the neural network, respectively. Let us use the notation \(f_\theta \) to emphasize that the neural network function is parameterized by \(\theta =((A^{(0)},b^{(0)}),\dots ,(A^{(L)},b^{(L)}))\). We can define a norm constraint on the weight matrices as follows

$$\begin{aligned} \kappa (\theta ):= \Vert (A^{(L)},b^{(L)})\Vert \prod _{\ell =0}^{L-1} \max \left\{ \Vert (A^{(\ell )},b^{(\ell )})\Vert ,1\right\} , \end{aligned}$$

where we use \(\Vert A\Vert := \sup _{\Vert x\Vert _\infty \le 1} \Vert Ax\Vert _\infty \) to denote the operator norm (induced by the \(\ell ^\infty \) norm) of a matrix \(A = (a_{i,j}) \in \mathbb {R}^{m\times n}\). It is well-known that \(\Vert A\Vert \) is the maximum 1-norm of the rows of A:

$$\begin{aligned} \Vert A\Vert = \max _{1\le i\le m} \sum _{j=1}^{n} |a_{i,j}|. \end{aligned}$$

The motivation for such a definition of \(\kappa (\theta )\) is discussed in [24]. For \(M\ge 0\), we denote by \(\mathcal{N}\mathcal{N}(W,L,M)\) as the set of functions \(f_\theta \in \mathcal{N}\mathcal{N}(W,L)\) that satisfy \(\kappa (\theta ) \le M\). It is shown by [24, Proposition 2.5] that, if \(W_1\le W_2,L_1\le L_2, M_1\le M_2\), then \(\mathcal{N}\mathcal{N}(W_1,L_1,M_1) \subseteq \mathcal{N}\mathcal{N}(W_2,L_2,M_2)\). (Strictly speaking, [24] use the convention that the bias \(b^{(L)}=0\) in the last layer. But the results can be easily generalized to the case \(b^{(L)}\ne 0\), see [62, Section 2.1] for details.)

To derive approximation bounds for deep neural networks, we consider the relationship of \(\mathcal {F}_{\sigma }(N,M)\) and \(\mathcal{N}\mathcal{N}(N,1,M)\). The next proposition shows the function classes \(\mathcal{N}\mathcal{N}(N,1,M)\) and \(\mathcal {F}_{\sigma }(N,M)\) have essentially the same approximation power.

Proposition 4.4

For any \(N\in \mathbb {N}\) and \(M>0\), we have \(\mathcal {F}_{\sigma }(N,M)\subseteq \mathcal{N}\mathcal{N}(N,1,\sqrt{d+1}M)\) and \(\mathcal{N}\mathcal{N}(N,1,M) \subseteq \mathcal {F}_{\sigma }(N+1,M)\).

Proof

Each function \(f(x) =\sum _{i=1}^N a_i\sigma ((x^\intercal ,1)v_i)\) in \(\mathcal {F}_{\sigma }(N,M)\) can be parameterized in the form (4.4) with \(W=N,L=1\) and

$$\begin{aligned} (A^{(0)}, b^{(0)}) = (v_1,\dots ,v_N)^\intercal , \quad (A^{(1)}, b^{(1)}) = (a_1,\dots ,a_N,0). \end{aligned}$$

Since \(v_i\in \mathbb {S}^d\), it is easy to see that \(\kappa (\theta ) \le \sqrt{d+1} M\). Hence, \(\mathcal {F}_{\sigma }(N,M)\subseteq \mathcal{N}\mathcal{N}(N,1,\sqrt{d+1}M)\).

On the other side, let \(f_\theta \in \mathcal{N}\mathcal{N}(N,1,M)\) be a function parameterized in the form (4.4) with \((A^{(0)}, b^{(0)}) = (a^{(0)}_1,\dots ,a^{(0)}_N)^\intercal \) and \((A^{(1)}, b^{(1)}) = (a^{(1)}_1,\dots ,a^{(1)}_N,b^{(1)})\), where \(a^{(0)}_i\in \mathbb {R}^{d+1}\) and \(a^{(1)}_i, b^{(1)}\in \mathbb {R}\). Then, \(f_\theta \) can be represented as

$$\begin{aligned} f_\theta (x) = \sum _{i=1}^N a^{(1)}_i \Vert a^{(0)}_i\Vert _2 \sigma \left( (x^\intercal ,1) \frac{a^{(0)}_i}{\Vert a^{(0)}_i\Vert _2} \right) + b^{(1)}\sigma (1), \end{aligned}$$

where we assume \(\Vert a^{(0)}_i\Vert _2 \ne 0\) without loss of generality. Since

$$\begin{aligned} \gamma (f_\theta ) \le \sum _{i=1}^N |a^{(1)}_i| \Vert a^{(0)}_i\Vert _2 + |b^{(1)}| \le \Vert (A^{(0)}, b^{(0)})\Vert \sum _{i=1}^N |a^{(1)}_i| + |b^{(1)}| \le \kappa (\theta ), \end{aligned}$$

we conclude that \(\mathcal{N}\mathcal{N}(N,1,M) \subseteq \mathcal {F}_{\sigma }(N+1,M)\). \(\square \)

As a corollary of Theorem 2.1 and Lemma 2.3, we get the following approximation bounds for deep neural networks.

Corollary 4.5

For \(\mathcal {H}^\alpha \) with \(0<\alpha <(d+3)/2\), we have

$$\begin{aligned} \sup _{h\in \mathcal {H}^\alpha } \inf _{f\in \mathcal{N}\mathcal{N}(W,L,M)} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim W^{-\frac{\alpha }{d}} \vee M^{-\frac{2\alpha }{d+3-2\alpha }}. \end{aligned}$$

For \(\mathcal {F}_{\sigma }(1)\), there exists a constant \(M\ge 1\) such that

$$\begin{aligned} \sup _{h \in \mathcal {F}_{\sigma }(1)} \inf _{f\in \mathcal{N}\mathcal{N}(W,L,M)} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim W^{-\frac{1}{2} - \frac{3}{2d}}. \end{aligned}$$

Proof

The first part is a direct consequence of Corollary 2.4 and the inclusion \(\mathcal {F}_{\sigma }(W,M) \subseteq \mathcal{N}\mathcal{N}(W,L,\sqrt{d+1}M)\). The second part follows from Lemma 2.3 and we can choose \(M=\sqrt{d+1}\). \(\square \)

In the first part of Corollary 4.5, if we allow the width W to be arbitrary large, say \(W\gtrsim M^{2d/(d+3-2\alpha )}\), then we can bound the approximation error by the size of weights. Hence, this result can be applied to over-parameterized neural networks. (Note that, in Theorem 4.2, we use a different regime of the bound.) For the approximation of \(\mathcal {F}_{\sigma }(1)\), the size of weights is bounded by a constant. We will show that this constant can be used to control the generalization error. Since the approximation error is bounded by W and is independent of M, we do not have trade-off in the error decomposition of ERM and only need to choose W sufficiently large to reduce the approximation error. Hence, it can also be applied to over-parameterized neural networks.

The approximation rate \(M^{-\frac{2\alpha }{d+3-2\alpha }}\) for \(\mathcal {H}^\alpha \) in Corollary 4.5 improves the rate \(M^{-\frac{\alpha }{d+1}}\) proven by [24]. Using the upper bound for Rademacher complexity of \(\mathcal{N}\mathcal{N}(W,L,M)\) (see Lemma 4.6), [24] also gave an approximation lower bound \((M\sqrt{L})^{-\frac{2\alpha }{d-2\alpha }}\). For fixed depth L, our upper bound is very close to this lower bound. We conjecture that the rate in Corollary 4.5 is optimal with respect to M (for fixed depth L). The discussion of optimality at the end of Sect. 2 implies that the conjecture is true for shallow neural networks (i.e. \(L=1\)).

To control the generalization performance of over-parameterized neural networks, we need to have size-independent sample complexity bounds for such networks. Several methods have been applied to obtain such kind of bounds in recent works [5, 21, 42, 43]. Here, we will use the result of [21], which estimates the Rademacher complexity of deep neural networks [7]. For a set \(S\subseteq \mathbb {R}^n\), let us denote its Rademacher complexity by

$$\begin{aligned} \mathcal {R}_n(S):= \mathbb {E}_{\xi _{1:n}} \left[ \sup _{(s_1,\dots ,s_n)\in S} \frac{1}{n} \sum _{i=1}^n \xi _i s_i \right] , \end{aligned}$$

where \(\xi _{1:n} = (\xi _1,\dots , \xi _n)\) is a sequence of i.i.d. Rademacher random variables. The following lemma is from [21, Theorem 3.2] and [24, Lemma 2.3].

Lemma 4.6

For any \(x_1,\dots ,x_n \in [-1,1]^d\), let \(S:= \{(f(x_1),\dots ,f(x_n)):f \in \mathcal{N}\mathcal{N}(W,L,M) \} \subseteq \mathbb {R}^n\), then

$$\begin{aligned} \mathcal {R}_n(S) \le \frac{M\sqrt{2(L+2+\log (d+1))}}{\sqrt{n}}. \end{aligned}$$

Now, we can estimate the convergence rates of the ERM based on over-parameterized neural networks. As usual, we decompose the excess risk of the ERM into approximation error and generalization error, and bound them by Corollary 4.5 and Lemma 4.6, respectively. Note that the convergence rates in the following theorem are worse than the optimal rates in Theorem 4.2.

Theorem 4.7

Let \(f_n^*\) be the estimator (4.2) with \(\mathcal {F}_n = \{ \mathcal {T}_{B_n}f: f\in \mathcal{N}\mathcal{N}(W_n,L,M_n) \}\), where \(L\in \mathbb {N}\) is a fixed constant, \(1\le B_n \lesssim \log n\) in case (1) and \(\sqrt{2} \le B_n \lesssim \log n\) in case (2).

  1. 1.

    If \(\mathcal {H}= \mathcal {H}^\alpha \) with \(\alpha <(d+3)/2\), we choose

    $$\begin{aligned} W_n \gtrsim n^{\frac{d}{d+3+2\alpha }}, \quad M_n \asymp n^{\frac{1}{2} - \frac{2\alpha }{d+3+2\alpha }}, \end{aligned}$$

    then

    $$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-\frac{2\alpha }{d+3+2\alpha }} \log n. \end{aligned}$$
  2. 2.

    If \(\mathcal {H}= \mathcal {F}_{\sigma }(1)\), we choose a large enough constant M and let

    $$\begin{aligned} W_n \gtrsim n^{\frac{d}{2d+6}}, \quad M_n=M, \end{aligned}$$

    then

    $$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-\frac{1}{2}} \log n. \end{aligned}$$

Proof

The proof is essentially the same as [24, Theorem 4.1]. Observe that, for any \(f\in \mathcal {F}_n\),

$$\begin{aligned}&\Vert f_n^* - h\Vert ^2_{L^2(\mu )} = \mathcal {L}(f_n^*) - \mathcal {L}(h) \\&\quad = \left[ \mathcal {L}(f_n^*) - \mathcal {L}_n(f_n^*) \right] + \left[ \mathcal {L}_n(f_n^*) - \mathcal {L}_n(f)\right] + \left[ \mathcal {L}_n(f) - \mathcal {L}(f)\right] + \left[ \mathcal {L}(f) - \mathcal {L}(h)\right] \\&\quad \le \left[ \mathcal {L}(f_n^*) - \mathcal {L}_n(f_n^*) \right] + \left[ \mathcal {L}_n(f) - \mathcal {L}(f)\right] + \Vert f - h\Vert ^2_{L^2(\mu )}. \end{aligned}$$

Using \(\mathbb {E}_{\mathcal {D}_n} [\mathcal {L}_n(f)] = \mathcal {L}(f)\) and taking the infimum over \(f\in \mathcal {F}_n\), we get

$$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert f_n^*-h\Vert _{L^2(\mu )}^2 \le \inf _{f\in \mathcal {F}_n} \Vert f - h\Vert ^2_{L^2(\mu )} + \mathbb {E}_{\mathcal {D}_n} \left[ \mathcal {L}(f_n^*) - \mathcal {L}_n(f_n^*) \right] . \end{aligned}$$
(4.5)

Let us denote the collections of sample points and noises by \(X_{1:n} =(X_1,\dots ,X_n)\) and \(\eta _{1:n}=(\eta _1,\dots ,\eta _n)\). We can bound the generalization error as follows

$$\begin{aligned}&\mathbb {E}_{\mathcal {D}_n} \left[ \mathcal {L}(f_n^*) - \mathcal {L}_n(f_n^*) \right] \nonumber \\&\quad = \mathbb {E}_{\mathcal {D}_n} \left[ \Vert f_n^* - h\Vert ^2_{L^2(\mu )} + V^2 - \left( \frac{1}{n} \sum _{i=1}^n (f_n^*(X_i)- h(X_i))^2 -2 \eta _i(f_n^*(X_i)- h(X_i)) + \eta _i^2\right) \right] \nonumber \\&\quad = \mathbb {E}_{\mathcal {D}_n} \left[ \Vert f_n^* - h\Vert ^2_{L^2(\mu )} - \frac{1}{n} \sum _{i=1}^n (f_n^*(X_i)- h(X_i))^2 \right] +2 \mathbb {E}_{\mathcal {D}_n}\left[ \frac{1}{n} \sum _{i=1}^n \eta _i(f_n^*(X_i)- h(X_i)) \right] \nonumber \\&\quad \le \mathbb {E}_{X_{1:n}} \left[ \sup _{\phi \in \Phi _n} \left( \mathbb {E}_X[\phi ^2(X)] - \frac{1}{n} \sum _{i=1}^n \phi ^2(X_i) \right) \right] + 2 \mathbb {E}_{X_{1:n}} \mathbb {E}_{\eta _{1:n}} \left[ \sup _{\phi \in \Phi _n} \frac{1}{n} \sum _{i=1}^n \eta _i \phi (X_i) \right] , \end{aligned}$$
(4.6)

where we denote \(\Phi _n:= \{f-h:f\in \mathcal {F}_n\}\). By a standard symmetrization argument (see [60, Theorem 4.10]), we can bound the first term in (4.6) by the Rademacher complexity:

$$\begin{aligned} \mathbb {E}_{X_{1:n}} \left[ \sup _{\phi \in \Phi _n} \left( \mathbb {E}_X[\phi ^2(X)] - \frac{1}{n} \sum _{i=1}^n \phi ^2(X_i) \right) \right] \le 2 \mathbb {E}_{X_{1:n}} \left[ \mathcal {R}_n(\Phi _n^2(X_{1:n})) \right] , \end{aligned}$$

where \(\Phi _n^2(X_{1:n}):= \{ (\phi ^2(X_1),\dots ,\phi ^2(X_n)) \in \mathbb {R}^n: \phi \in \Phi _n \} \subseteq \mathbb {R}^n\) is the set of function values on the sample points. Recall that we assume \(B_n\ge 1\) in case (1) and \(B_n\ge \sqrt{2}\) in case (2). Hence, we always have \(\Vert h\Vert _{L^\infty (\mathbb {B}^d)} \le \sqrt{2}\) and \(\Vert \phi \Vert _{L^\infty (\mathbb {B}^d)} \le 2B_n\) for any \(\phi \in \Phi _n\). By the structural properties of Rademacher complexity [7, Theorem 12],

$$\begin{aligned} \mathbb {E}_{X_{1:n}} \left[ \mathcal {R}_n(\Phi _n^2(X_{1:n})) \right]&\le 8 B_n \mathbb {E}_{X_{1:n}} \left[ \mathcal {R}_n(\Phi _n(X_{1:n})) \right] \\&\le 8B_n \left( \mathbb {E}_{X_{1:n}} \left[ \mathcal {R}_n(\mathcal {F}_n(X_{1:n})) \right] + \frac{\Vert h\Vert _{L^\infty (\mathbb {B}^d)}}{\sqrt{n}} \right) \\&\lesssim \frac{M_n \log n}{\sqrt{n}}, \end{aligned}$$

where we apply Lemma 4.6 in the last inequality. Note that the second term in (4.6) is a Gaussian complexity. We can also bound it by the Rademacher complexity [7, Lemma 4]:

$$\begin{aligned} \mathbb {E}_{X_{1:n}} \mathbb {E}_{\eta _{1:n}} \left[ \sup _{\phi \in \Phi _n} \frac{1}{n} \sum _{i=1}^n \eta _i \phi (X_i) \right] \lesssim \mathbb {E}_{X_{1:n}} \left[ \mathcal {R}_n(\Phi _n(X_{1:n})) \right] \log n \lesssim \frac{M_n \log n}{\sqrt{n}}. \end{aligned}$$

In summary, we conclude that

$$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \left[ \mathcal {L}(f_n^*) - \mathcal {L}_n(f_n^*) \right] \lesssim \frac{M_n \log n}{\sqrt{n}}. \end{aligned}$$
(4.7)

If \(\mathcal {H}= \mathcal {H}^\alpha \) with \(\alpha <(d+3)/2\), by Corollary 4.5, we have

$$\begin{aligned} \sup _{h\in \mathcal {H}^\alpha } \inf _{f\in \mathcal {F}_n} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim W_n^{-\frac{\alpha }{d}} \vee M_n^{-\frac{2\alpha }{d+3-2\alpha }}. \end{aligned}$$

Combining with (4.5) and (4.7), we know that if we choose \(M_n \asymp n^{\frac{1}{2} - \frac{2\alpha }{d+3+2\alpha }}\) and \(W_n \gtrsim n^{\frac{d}{d+3+2\alpha }}\), then

$$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim W_n^{-\frac{2\alpha }{d}} \vee M_n^{-\frac{4\alpha }{d+3-2\alpha }} + \frac{M_n \log n}{\sqrt{n}} \lesssim n^{-\frac{2\alpha }{d+3+2\alpha }} \log n. \end{aligned}$$

Similarly, if \(\mathcal {H}= \mathcal {F}_{\sigma }(1)\), by Corollary 4.5, then there exist a constant \(M\ge 1\) such that, if \(M_n=M\),

$$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim W_n^{-\frac{d+3}{d}} + \frac{M \log n}{\sqrt{n}}. \end{aligned}$$

Thus, for any \(W_n \gtrsim n^{d/(2d+6)}\), we get \(\mathbb {E}_{\mathcal {D}_n} \Vert f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-1/2} \log n\), \(\square \)

4.3 Convolutional Neural Networks

In contrast to the vast amount of theoretical studies on fully connected neural networks, there are only a few papers analyzing the performance of convolutional neural networks [16, 20, 30, 35, 47, 66,67,68]. The recent work [30] showed the universal consistency of CNNs for nonparametric regression. In this section, we show how to use our approximation results to analyze the convergence rates of CNNs.

Following [67], we introduce a sparse convolutional structure on deep neural networks. Let \(s\ge 2\) be a fixed integer, which is used to control the filter length. Given a sequence \(w=(w_i)_{i\in \mathbb {Z}}\) on \(\mathbb {Z}\) supported on \(\{0,1,\dots ,s\}\), the convolution of the filter w with another sequence \(x=(x_i)_{i\in \mathbb {Z}}\) supported on \(\{1,\dots ,d\}\) is a sequence \(w*x\) given by

$$\begin{aligned} (w*x)_i:= \sum _{j\in \mathbb {Z}} w_{i-j} x_j = \sum _{j=1}^d w_{i-j} x_j, \quad i\in \mathbb {Z}. \end{aligned}$$

Regarding x as a vector of \(\mathbb {R}^d\), this convolution induces a \((d+s) \times d\) Toeplitz type convolutional matrix

$$\begin{aligned} A^{w}:= (w_{i-j})_{1\le i \le d+s, 1\le j\le d}. \end{aligned}$$

Note that the number of rows of \(A^{w}\) is s greater than the number of columns. This leads us to consider deep neural networks of the form (4.4) with linearly increasing widths \(\{N_\ell = d + \ell s\}_{\ell =0}^L\). We denote by \(\mathcal {CNN}(s,L)\) the set of functions that can be parameterized in the form (4.4) such that \(A^{(\ell )} = A^{w^{(\ell )}}\) for some filter \(w^{(\ell )}\) supported on \(\{0,1,\dots ,s\}\), \(0\le \ell \le L-1\), and the biases \(b^{(\ell )}\) take the special form

$$\begin{aligned} b^{(\ell )} = \left( b^{(\ell )}_1,\dots , b^{(\ell )}_s, b^{(\ell )}_{s+1}, \dots , b^{(\ell )}_{s+1}, b^{(\ell )}_{N_\ell -s+1},\dots , b^{(\ell )}_{N_\ell } \right) ^\intercal , \quad 0\le \ell \le L-2, \end{aligned}$$
(4.8)

with \(N_\ell -2s\) repeated components in the middle. By definition, it is easy to see that \(\mathcal {CNN}(s,L) \subseteq \mathcal{N}\mathcal{N}(d+Ls,L)\). The assumption on the special form (4.8) of biases is used to reduce the free parameters in the network. As in [67], one can compute that the number of free parameters in \(\mathcal {CNN}(s,L)\) is \((5s+2)L+2d-2s\), which grows linearly on L.

The next proposition shows that all functions in \(\mathcal{N}\mathcal{N}(N,1)\) can be implemented by CNNs.

Proposition 4.8

[67] If \(N,L\in \mathbb {N}\) satisfy \(L\ge \lfloor \frac{Nd}{s-1}+1 \rfloor \), then \(\mathcal{N}\mathcal{N}(N,1) \subseteq \mathcal {CNN}(s,L)\).

Proof

This result is proven in [67, Proof of Theorem 2]. We only give a sketch of the construction for completeness. Any function \(f\in \mathcal{N}\mathcal{N}(N,1)\) can be written as \(f(x) = \sum _{i=1}^N c_i \sigma (a_i^\intercal x+b_i) + c_0\), where \(a_i\in \mathbb {R}^d\) and \(b_i,c_i\in \mathbb {R}\). Define a sequence v supported on \(\{0,\dots ,Nd-1\}\) by stacking the vectors \(a_1,\dots , a_N\) (with components reversed) by

$$\begin{aligned} (v_{Nd-1},\dots ,v_0) = (a_N^\intercal ,\dots ,a_1^\intercal ). \end{aligned}$$

Applying [67, Theorem 3] to the sequence v, we can construct filters \(\{w^{(\ell )}\}_{\ell =0}^{L-1}\) supported on \(\{0,1,\dots ,s\}\) such that \(v=w^{(L-1)}*w^{(L-2)}*\dots *w^{(0)}\), which implies \(A^{w^{(L-1)}} \cdots A^{w^{(0)}} = A^v \in \mathbb {R}^{(d+Ls)\times d}\). Note that, by definition, for \(i=1,\dots ,N\), the id-th row of \(A^v\) is exactly \(a_i^\intercal \). Then, for \(\ell =0,\dots ,L-2\), we can choose \(b^{(\ell )}\) satisfying (4.8) such that \(f^{(\ell +1)}(x) = A^{w^{(\ell )}} \cdots A^{w^{(0)}}x+B^{(\ell )}\), where \(B^{(\ell )}>0\) is a sufficiently large constant that makes the components of \(f^{(\ell +1)}(x)\) positive for all \(x\in \mathbb {B}^d\). Finally, we can construct \(b^{(L-1)}\) such that \(f^{(L)}_k(x)=\sigma (a_i^\intercal x+b_i)\) for \(i=1,\dots ,N\) and \(k=id\), which implies \(f\in \mathcal {CNN}(s,L)\). \(\square \)

Note that Proposition 4.8 shows each shallow neural network can be represent by a CNN, with the same order of number of parameters. As a corollary, we obtain approximation rates for CNNs.

Corollary 4.9

Let \(s\ge 2\) be an integer.

  1. 1.

    For \(\mathcal {H}^\alpha \) with \(0<\alpha <(d+3)/2\), we have

    $$\begin{aligned} \sup _{h\in \mathcal {H}^\alpha } \inf _{f\in \mathcal {CNN}(s,L)} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim L^{-\frac{\alpha }{d}}. \end{aligned}$$
  2. 2.

    For \(\mathcal {F}_{\sigma }(1)\), we have

    $$\begin{aligned} \sup _{h \in \mathcal {F}_{\sigma }(1)} \inf _{f\in \mathcal {CNN}(s,L)} \Vert h-f\Vert _{L^\infty (\mathbb {B}^d)} \lesssim L^{-\frac{1}{2} - \frac{3}{2d}}. \end{aligned}$$

Proof

For any \(N\in \mathbb {N}\), we take \(L=\lfloor \frac{Nd}{s-1}+1 \rfloor \asymp N\), then \(\mathcal {F}_{\sigma }(N,M) \subseteq \mathcal{N}\mathcal{N}(N,1) \subseteq \mathcal {CNN}(s,L)\) for any \(M>0\), by Proposition 4.8. (1) follows from Corollary 2.4 and (2) is from Lemma 2.3. \(\square \)

Since the number of parameters in \(\mathcal {CNN}(s,L)\) is approximately L, the rate \(\mathcal {O}(L^{-\alpha /d})\) in part (1) of Corollary 4.9 is the same as the rate in [64] for fully connected neural networks. However, [31, 65] showed that this rate can be improved to \(\mathcal {O}(L^{-2\alpha /d})\) for fully connected neural networks by using the bit extraction technique [6]. It would be interesting to see whether this rate also holds for \(\mathcal {CNN}(s,L)\).

As in Theorem 4.2, we use Lemma 4.1 to decompose the error and bound the approximation error by Corollary 4.9. The covering number is bounded again by pseudo-dimension.

Theorem 4.10

Let \(f_n^*\) be the estimator (4.2) with \(\mathcal {F}_n = \mathcal {CNN}(s,L_n)\), where \(s\ge 2\) is a fixed integer, and set \(B_n = c_1\log n\) for some constant \(c_1>0\).

  1. 1.

    If \(\mathcal {H}= \mathcal {H}^\alpha \) with \(\alpha <(d+3)/2\), we choose

    $$\begin{aligned} L_n \asymp n^{\frac{d}{2d+2\alpha }}, \end{aligned}$$

    then

    $$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-\frac{\alpha }{d+\alpha }} (\log n)^4. \end{aligned}$$
  2. 2.

    If \(\mathcal {H}= \mathcal {F}_{\sigma }(1)\), we choose

    $$\begin{aligned} L_n \asymp n^{\frac{d}{3d+3}}, \end{aligned}$$

    then

    $$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-\frac{d+3}{3d+3}} (\log n)^4. \end{aligned}$$

Proof

The proof is the same as Theorem 4.2 and [68]. We can use (4.3) to bound the covering number by the pseudo-dimension. For convolutional neural networks, [6] gave the following estimate of the pseudo-dimension:

$$\begin{aligned} \,\textrm{Pdim}\,(\mathcal {T}_{B_n}\mathcal {F}_n) \lesssim L_n p(s,L_n) \log (q(s,L_n)) \lesssim L_n^2 \log L_n, \end{aligned}$$

where \(p(s,L_n)=(5\,s+2)L_n+2d-2\,s \lesssim L_n\) and \(q(s,L_n)\le L_n (d+sL_n) \lesssim L_n^2\) are the numbers of parameters and neurons of the network \(\mathcal {CNN}(s,L_n)\), respectively. Therefore,

$$\begin{aligned} \log \mathcal {N}(\epsilon , \mathcal {T}_{B_n}\mathcal {F}_n,\Vert \cdot \Vert _{L^1(X_{1:n})}) \lesssim L_n^2 \log (L_n) \log (B_n/\epsilon ). \end{aligned}$$

Applying Lemma 4.1 and Corollary 4.9, if \(\mathcal {H}= \mathcal {H}^\alpha \) with \(\alpha <(d+3)/2\), then

$$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim \frac{L_n^2 \log (L_n) (\log n)^3}{n} + L_n^{-\frac{2\alpha }{d}}. \end{aligned}$$

We choose \(L_n \asymp n^{d/(2d+2\alpha )}\), then \(\mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-\alpha /(d+\alpha )} (\log n)^4\). Similarly, if \(\mathcal {H}= \mathcal {F}_{\sigma }(1)\), then

$$\begin{aligned} \mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim \frac{L_n^2 \log (L_n) (\log n)^3}{n} + L_n^{-\frac{d+3}{d}}. \end{aligned}$$

We choose \(L_n \asymp n^{d/(3d+3)}\), then \(\mathbb {E}_{\mathcal {D}_n} \Vert \mathcal {T}_{B_n}f_n^*-h\Vert _{L^2(\mu )}^2 \lesssim n^{-(d+3)/(3d+3)} (\log n)^4\). \(\square \)

Finally, we note that the recent paper [68] also studied the convergence of CNNs and proved the rate \(\mathcal {O}(n^{-1/3}(\log n)^2)\) for \(\mathcal {H}^\alpha \) with \(\alpha >(d+4)/2\). The convergence rate we obtained in Theorem 4.10 for \(\mathcal {F}_{\sigma }(1)\), which includes \(\mathcal {H}^\alpha \) with \(\alpha >(d+3)/2\) by Theorem 2.1, is slightly better than their rate.

5 Conclusion

This paper has established approximation bounds for shallow ReLU\(^k\) neural networks. We showed how to use these bounds to derive approximation rates for (deep or shallow) neural networks with constraints on the weights and convolutional neural networks. We also applied the approximation results to study the convergence rates of nonparametric regression using neural networks. In particular, we established the optimal convergence rates for shallow neural networks and showed that over-parameterized neural networks can achieve nearly optimal rates.

There are a few interesting questions we would like to propose for future research. First, for approximation by shallow neural networks, we establish the optimal rate in the supremum norm by using the results of [55] (Lemma 2.3). The paper [55] actually showed that approximation bounds similar to Lemma 2.3 also hold in Sobolev norms. We think it is a promising direction to extend our approximation results in the supremum norm (Theorem 2.1 and Corollary 2.4) to the Sobolev norms. Second, it is unclear whether over-parameterized neural networks can achieve the optimal rate for learning functions in \(\mathcal {H}^\alpha \). It seems that refined generalization error analysis is needed. Finally, it would be interesting to extend the theory developed in this paper to general activation functions and study how the results are affected by the activation functions.