Sobolev-Type Embeddings for Neural Network Approximation Spaces

Grohs, Philipp; Voigtlaender, Felix

doi:10.1007/s00365-022-09598-x

Sobolev-Type Embeddings for Neural Network Approximation Spaces

Open access
Published: 05 March 2023

Volume 57, pages 579–599, (2023)
Cite this article

Download PDF

You have full access to this open access article

Constructive Approximation Aims and scope

Sobolev-Type Embeddings for Neural Network Approximation Spaces

Download PDF

Philipp Grohs^1,2,3^na1 &
Felix Voigtlaender^1,4^na1

1484 Accesses
Explore all metrics

Abstract

We consider neural network approximation spaces that classify functions according to the rate at which they can be approximated (with error measured in $L^p$) by ReLU neural networks with an increasing number of coefficients, subject to bounds on the magnitude of the coefficients and the number of hidden layers. We prove embedding theorems between these spaces for different values of p. Furthermore, we derive sharp embeddings of these approximation spaces into Hölder spaces. We find that, analogous to the case of classical function spaces (such as Sobolev spaces, or Besov spaces) it is possible to trade “smoothness” (i.e., approximation rate) for increased integrability. Combined with our earlier results in Grohs and Voigtlaender (Proof of the theory-to-practice gap in deep learning via sampling complexity bounds for neural network approximation spaces, 2021. arXiv preprint arXiv:2104.02746), our embedding theorems imply a somewhat surprising fact related to “learning” functions from a given neural network space based on point samples: if accuracy is measured with respect to the uniform norm, then an optimal “learning” algorithm for reconstructing functions that are well approximable by ReLU neural networks is simply given by piecewise constant interpolation on a tensor product grid.

Approximation Spaces of Deep Neural Networks

Article 05 May 2021

Optimal Rates of Approximation by Shallow ReLU $$^k$$ Neural Networks and Applications to Nonparametric Regression

Article Open access 26 February 2024

Best Approximation and Inverse Results for Neural Network Operators

Article Open access 22 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We consider approximation spaces associated to neural networks with ReLU activation function $\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}},$ $\varrho (x) = \max \{0,x\}$ as introduced in our recent work [11]; see also Sect. 2. Roughly speaking, these spaces are defined as follows: Given a depth-growth function $\varvec{\ell }\!:\! {\mathbb {N}} \rightarrow {\mathbb {N}}_{\ge 2} \cup \{\infty \}$ and a coefficient growth function ${\varvec{c}}: {\mathbb {N}} \rightarrow {\mathbb {N}} \cup \{\infty \}$, the approximation space consist of all functions that can be approximated up to error ${\mathcal {O}}(n^{-\alpha })$, with the error measured in the $L^p([0,1]^d)$ (quasi)-norm for some $p\in (0,\infty ]$, by neural networks with at most $\varvec{\ell }(n)$ layers and n coefficients (non-zero weights) of size at most ${\varvec{c}}(n)$, $n\in {\mathbb {N}}$. As shown in [11] these spaces can be equipped with a natural (quasi)-norm; see also Lemma 2.1 below. In most applications in machine learning, one is interested in “learning” such functions from a finite amount of samples. To render this problem well defined we would like the space to embed into $C([0,1]^d)$. The question that we consider in this work is therefore to find conditions under which such an embedding holds true. More generally, we are interested in characterizing the existence of embeddings of the approximation spaces into the Hölder spaces $C^{0,\beta }([0,1]^d)$ for $\beta \in (0, 1]$. The existence of such embeddings is far from clear; for example, recent work [5,6,7, 9] shows that fractal and highly non-regular functions can be efficiently approximated by (increasingly deep) ReLU neural networks.

The main contribution of this paper are sharp conditions for when the space embeds into $C([0,1]^d)$ or $C^{0,\beta }([0,1]^d)$. These results in particular show that, similar to classical Sobolev spaces, one can trade smoothness for (improved) integrability in the context of neural network approximation, where “smoothness” of a function is understood as it being well approximable by neural networks.

Building on these insights we consider the problem of finding optimal numerical algorithms for reconstructing arbitrary functions in from finitely many point samples, where the reconstruction error is measured in the $L^\infty ([0,1]^d)$ norm: As a corollary to our embedding theorems we establish the somewhat surprising fact that, irrespective of $\alpha $, an optimal algorithm is given by piecewise constant interpolation with respect to a uniform tensor-product grid in $[0,1]^d$.

1.1 Description of Results

In the following sections we sketch our main results. To simplify the presentation, let us assume that the coefficient growth function ${\varvec{c}}$ satisfies the asymptotic growth ${\varvec{c}}\asymp n^\theta \cdot (\ln (2n))^{\kappa }$ for certain $\theta \ge 0$ and $\kappa \in {\mathbb {R}}$ (our general results hold for any coefficient growth function). Let

$$\begin{aligned} \varvec{\ell }^*:= \sup _{n\in {\mathbb {N}}} \varvec{\ell }(n) \in {\mathbb {N}}_{\ge 2} \cup \{ \infty \} \qquad \text{ and } \qquad \gamma ^*:= \theta \cdot \varvec{\ell }^*+ \lfloor \varvec{\ell }^*/2\rfloor \in [1,\infty ]. \end{aligned}$$

1.1.1 Embedding Results

Our first main theorem provides sharp embedding results into $C([0,1]^d)$. A slight simplification of our main result reads as follows.

Theorem 1.1

Assume that $\alpha > \frac{d}{p}\cdot \gamma ^*$. Then the embedding holds. Assume that $\alpha < \frac{d}{p}\cdot \gamma ^*$. Then the embedding does not hold.

Theorem 1.1 is a special case of Theorems 3.3 and 3.6 below that will be proven in Sect. 3.

We would like to draw attention to the striking similarity between Theorem 1.1 and the classical Sobolev embedding theorem for Sobolev spaces $W^{\alpha ,p}$ which states that $W^{\alpha ,p}([0,1]^d)\hookrightarrow C([0,1]^d)$ holds if $\alpha > \frac{d}{p}$, but not if $\alpha < \frac{d}{p}$. Just as in the case of classical Sobolev spaces we show that, also in the context of neural network approximation, it is possible to trade “smoothness” (i.e., approximation order) for (improved) integrability.

We also prove sharp embedding results into the Hölder spaces $C^{0,\beta }([0,1]^d)$. A slight simplification of our main result reads as follows.

Theorem 1.2

Let $\beta \in (0,1)$.

If $\alpha > \frac{\beta + \frac{d}{p}}{1-\beta }\cdot \gamma ^*$, then the embedding holds.
If $\alpha < \frac{\beta + \frac{d}{p}}{1-\beta }\cdot \gamma ^*$, then the embedding does not hold.

Theorem 1.2 is a special case of Theorems 3.5 and 3.6 below that will be proven in Sect. 3.

Theorems 1.1 and 1.2 provide a complete characterization of the embedding properties of the neural network approximation spaces away from the “critical” index $\alpha = \frac{d}{p}\cdot \gamma ^*$ (embedding into $C([0,1]^d)$), respectively $\alpha = \frac{\beta + \frac{d}{p}}{1-\beta }\cdot \gamma ^*$ (embedding into $C^{0,\beta }([0,1]^d)$). We leave the (probably subtle) question of what happens at the critical index open for future work.

1.1.2 Optimal Learning Algorithms

Our embedding theorems allow us to deduce an interesting and surprising result related to the algorithmic feasibility of a learning problem posed on neural network approximation spaces.

More precisely, we consider the problem of approximating a function in the unit ball

of the neural network approximation space under the constraint that the approximating algorithm is only allowed to access point samples of the functions $f \in U$. Formally, a (deterministic) algorithm using m point samples is determined by a set of sample points ${\varvec{x}}= (x_1,\dots ,x_m) \in ([0,1]^d)^m$ and a map $Q: {\mathbb {R}}^m \rightarrow L^\infty ([0,1]^d)$ such that

Note that A need not be an “algorithm” in the usual sense, i.e., implementable on a Turing machine. The set of all such “algorithms” is denoted and we define the optimal order for (deterministically) uniformly approximating functions in using point samples as the best possible convergence rate with respect to the number of samples:

In a similar way one can define randomized (Monte Carlo) algorithms and consider the optimal order for approximating functions in using randomized algorithms based on point samples; see [11, Section 2.4.2]. In [11, Theorem 1.1] we showed (under the assumption $\varvec{\ell }^*\ge 3$) that

(1.1)

We now describe a particularly simple algorithm that achieves this optimal rate. For $m \in {\mathbb {N}}_{\ge 2^d}$, let $n:= \lfloor m^{1/d}\rfloor - 1$ and let

$$\begin{aligned} A_m^{\mathrm {pw-const}}(f)(x):= \sum _{i_1 = 0}^n \cdots \sum _{i_d=0}^n f \left( \frac{i_1}{n},\dots , \frac{i_d}{n}\right) \mathbb {1}_{\left( \frac{i_1}{n},\dots , \frac{i_d}{n}\right) + \left[ 0,\frac{1}{n}\right) ^d}(x) \end{aligned}$$

be the piecewise constant interpolation of f on a tensor product grid with sidelength $\frac{1}{n}$, where for $\tau \in [0,1]^d$ the function $\mathbb {1}_{\tau + \left[ 0,\frac{1}{n}\right) ^d}$ denotes the indicator function of the set $\tau + \left[ 0,\frac{1}{n}\right) ^d\subset {\mathbb {R}}^d$. Since $(n+1)^d \le m$ it clearly holds that

We have the following theorem.

Theorem 1.3

Under the above assumptions on $\varvec{\ell },{\varvec{c}}$, the piecewise constant interpolation operator $A_m^{\mathrm {pw-const}}$ constitutes an optimal algorithm for reconstructing functions in from point samples: we have

(1.2)

Proof

Equation (1.1) yields the best possible rate with respect to the number of samples. Without loss of generality we assume that $\varvec{\ell }^*< \infty $ (otherwise there is nothing to show as the optimal rate is zero). We now show that all rates below the optimal rate $\frac{1}{d} \cdot \frac{\alpha }{\gamma ^*+ \alpha }$ can be achieved by piecewise constant interpolation, which readily implies our theorem. To this end, let $\lambda \in \bigl (0,\frac{1}{d} \cdot \frac{\alpha }{\gamma ^*+ \alpha }\bigr )$ be arbitrary and let $\beta := d \cdot \lambda \in (0, \frac{\alpha }{\gamma ^*+ \alpha }) \subset (0,1)$. It is easy to check that $\alpha > \frac{\beta }{1-\beta } \gamma ^*$. Hence, Theorem 1.2 (applied with $p=\infty $) implies the embedding , yielding the existence of a constant $C_1 > 0$ with

where $U^\beta $ denotes the unit ball in $C^{0,\beta }([0,1]^d)$. Therefore,

(1.3)

where the last inequality follows from the elementary fact that piecewise constant interpolation on the uniform tensor-product grid $\{ \frac{0}{n}, \frac{1}{n},..., \frac{n}{n} \}^d$ achieves $L^\infty $ error of ${\mathcal {O}}(n^{-\beta }) \subset {\mathcal {O}}(m^{-\beta / d}) \!=\! {\mathcal {O}}(m^{-\lambda })$ for functions $f \in U^\beta $. Since $\lambda <\frac{1}{d}\cdot \frac{\alpha }{\gamma ^*+ \alpha }$ was arbitrary, Equation (1.3) proves Equation (1.2) and thus our theorem. $\square $

Theorem 1.3 proves that, provided that the reconstruction error is measured in the $L^\infty $ norm, the optimal “learning” algorithm for reconstructing functions that are well approximable by ReLU neural networks is simply given by piecewise constant interpolation on a tensor product grid, at least if no additional information about the target function is given. We consider it surprising that the optimal algorithm for this problem possesses such a simple and explicit representation.

We finally remark that we expect all our results to hold (in slightly modified form) if general bounded Lipschitz domains $\Omega \subset {\mathbb {R}}^d$ are considered in place of the unit cube $[0,1]^d$.

1.2 Related Work

The study of approximation properties of neural networks has a long history, see for example the surveys [3, 7, 9, 12] and the references therein. Function spaces related to neural networks have been studied in several works. For example, [1, 4, 14] study so-called Barron-type spaces that arise in a certain sense as approximation spaces of shallow neural networks. The paper [2] develops a calculus on spaces of functions that can be approximated by neural networks without curse of dimension. Closest to our work are the papers [10, 11] which study functional analytic properties of neural network approximation spaces similar to the ones studied in the present paper. The paper [10] also considers embedding results between such spaces with different constraints on their underlying network architecture. However, none of the mentioned works considered embedding theorems of Sobolev type comparable to those in the present paper. An important reason for this is that the spaces considered in [10] do not restrict the magnitude of the weights of the approximating networks, so that essentially ${\varvec{c}}\equiv \infty $ in our terminology. The resulting approximation spaces then contain functions of very low smoothness.

1.3 Structure of the Paper

The structure of this paper is as follows. In Sect. 2 we provide a definition of the neural network approximation spaces considered in this paper and establish some of their basic properties. In Sect. 3 we prove our embedding theorems.

2 Definition of Neural Network Approximation Spaces

In this section, we review the mathematical definition of neural networks, then formally introduce the neural network approximation spaces , and finally define the quantities $\varvec{\ell }^*$ and $\gamma ^*(\varvec{\ell },{\varvec{c}})$ that will turn out to be decisive for characterizing whether the embeddings or hold.

2.1 The Mathematical Formalization of Neural Networks

In our analysis, it will be helpful to distinguish between a neural network $\Phi $ as a set of weights and the associated function $R_\varrho \Phi $ computed by the network. Thus, we say that a neural network is a tuple ${\Phi = \big ( (A_1,b_1), \dots , (A_L,b_L) \big )}$, with $A_\ell \in {\mathbb {R}}^{N_\ell \times N_{\ell -1}}$ and $b_\ell \in {\mathbb {R}}^{N_\ell }$. We then say that ${{\varvec{a}}(\Phi ):= (N_0,\dots ,N_L) \in {\mathbb {N}}^{L+1}}$ is the architecture of $\Phi $, $L(\Phi ):= L$ is the number of layers^{Footnote 1} of $\Phi $, and ${W(\Phi ):= \sum _{j=1}^L (\Vert A_j \Vert _{\ell ^0} + \Vert b_j \Vert _{\ell ^0})}$ denotes the number of (non-zero) weights of $\Phi $. The notation $\Vert A \Vert _{\ell ^0}$ used here denotes the number of non-zero entries of a matrix (or vector) A. Finally, we write $d_{\textrm{in}}(\Phi ):= N_0$ and $d_{\textrm{out}}(\Phi ):= N_L$ for the input and output dimension of $\Phi $, and we set $\Vert \Phi \Vert _{{{\mathcal {N}}}{{\mathcal {N}}}}:= \max _{j = 1,\dots ,L} \max \{ \Vert A_j \Vert _{\infty }, \Vert b_j \Vert _{\infty } \}$, where ${\Vert A \Vert _{\infty }:= \max _{i,j} |A_{i,j}|}$.

To define the function $R_\varrho \Phi $ computed by $\Phi $, we need to specify an activation function. In this paper, we will only consider the so-called rectified linear unit (ReLU) ${\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto \max \{ 0, x \}}$, which we understand to act componentwise on ${\mathbb {R}}^n$, i.e., $\varrho \bigl ( (x_1,\dots ,x_n)\bigr ) = \bigl (\varrho (x_1),\dots ,\varrho (x_n)\bigr )$. The function $R_\varrho \Phi : {\mathbb {R}}^{N_0} \rightarrow {\mathbb {R}}^{N_L}$ computed by the network $\Phi $ (its realization) is then given by

$$\begin{aligned} R_\varrho \Phi := T_L \circ (\varrho \circ T_{L-1}) \circ \cdots \circ (\varrho \circ T_1) \quad \text {where} \quad T_\ell \, x = A_\ell \, x + b_\ell . \end{aligned}$$

2.2 Neural Network Approximation Spaces

Approximation spaces [8] classify functions according to how well they can be approximated by a family $\varvec{\Sigma } = (\Sigma _n)_{n \in {\mathbb {N}}}$ of certain “simple functions” of increasing complexity n, as $n \rightarrow \infty $. Common examples consider the case where $\Sigma _n$ is the set of polynomials of degree n, or the set of all linear combinations of n wavelets. The notion of neural network approximation spaces was originally introduced in [10], where $\Sigma _n$ was taken to be a family of neural networks of increasing complexity. However, [10] does not impose any restrictions on the size of the individual network weights, which plays an important role in practice and which is important in determining the regularity of the function implemented by the network. Since the function spaces introduced in [10] ignore this quantity, the spaces defined there never embed into the Hölder-space $C^{0,\beta }$ for any $\beta > 0$.

For this reason, and following our previous work [11], we introduce a modified notion of neural network approximation spaces that also takes the size of the individual network weights into account. Precisely, given an input dimension $d \in {\mathbb {N}}$ (which we will keep fixed throughout this paper) and non-decreasing functions ${\varvec{\ell }: {\mathbb {N}}\rightarrow {\mathbb {N}}_{\ge 2} \cup \{ \infty \}}$ and ${\varvec{c}}: {\mathbb {N}}\rightarrow {\mathbb {N}}\cup \{ \infty \}$ (called the depth-growth function and the coefficient growth function, respectively), we define

Then, given a measurable subset $\Omega \subset {\mathbb {R}}^d$ as well as $p \in (0,\infty ]$ and $\alpha \in (0,\infty )$, for each measurable $f: \Omega \rightarrow {\mathbb {R}}$, we define

where $d_{p}(f, \Sigma ):= \inf _{g \in \Sigma } \Vert f - g \Vert _{L^p(\Omega )}$.

The remaining issue is that since the set is in general neither closed under addition nor under multiplication with scalars, is not a (quasi)-norm. To resolve this issue, taking inspiration from the theory of Orlicz spaces (see e.g. [13, Theorem 3 in Section 3.2]), we define the neural network approximation space quasi-norm as

giving rise to the approximation space

The following lemma summarizes the main elementary properties of these spaces.

Lemma 2.1

Let $\varnothing \ne \Omega \subset {\mathbb {R}}^d$ be measurable, let $p \in (0,\infty ]$ and $\alpha \in (0,\infty )$. Then, the approximation space satisfies the following properties:

1.
is a quasi-normed space. Precisely, given arbitrary measurable functions $f,g: \Omega \rightarrow {\mathbb {R}}$, it holds that for $C = C(\alpha ,p)$.
2.
We have for $c \in [-1,1]$.
3.
if and only if .
4.
if and only if .
5.
.

Proof

For $p\in [1,\infty ]$ this is precisely [11, Lemma 2.1]. The case $p\in (0,1)$ can be proven in a completely analogous way and is left to the reader. $\square $

Remark 2.2

While we introduced the spaces for general domains $\Omega $, in what follows we will specialize to $\Omega = [0,1]^d$. We expect, however, that all our results will remain true (up to minor modifications) for general bounded Lipschitz domains.

2.3 Quantities Characterizing the Complexity of the Network Architecture

To conveniently summarize those aspects of the growth behavior of the functions $\varvec{\ell }$ and ${\varvec{c}}$ most relevant to us, we introduce three quantities that will turn out to be crucial for characterizing whether the embeddings or hold. First, we set

$$\begin{aligned} \varvec{\ell }^*:= \sup _{n\in {\mathbb {N}}} \varvec{\ell }(n) \in {\mathbb {N}}_{\ge 2} \cup \{ \infty \}. \end{aligned}$$

(2.1)

Furthermore, we define $\gamma ^*(\varvec{\ell }, {\varvec{c}}), \gamma ^{\diamondsuit } (\varvec{\ell }, {\varvec{c}}) \in (0,\infty ]$ by

$$\begin{aligned} \begin{aligned} \gamma ^*(\varvec{\ell }, {\varvec{c}})&:= \sup \bigg \{ \gamma \in [0,\infty ) \quad :\quad \exists \, L \in {\mathbb {N}}_{\le \varvec{\ell }^*}: \,\, \limsup _{n \rightarrow \infty } \frac{({\varvec{c}}(n))^L \cdot n^{\lfloor L/2 \rfloor }}{n^\gamma } \in (0,\infty ] \bigg \}, \\ \gamma ^{\diamondsuit } (\varvec{\ell }, {\varvec{c}})&:= \inf \Big \{ \gamma \in [0,\infty ) \quad :\quad \exists \, C > 0 \, \forall \, n \in {\mathbb {N}}, L \in {\mathbb {N}}_{\le \varvec{\ell }^*}: \,\, ({\varvec{c}}(n))^L \cdot n^{\lfloor L/2 \rfloor } \le C \cdot n^\gamma \Big \}, \end{aligned} \end{aligned}$$

(2.2)

where we use the convention that $\sup \varnothing = 0$ and $\inf \varnothing = \infty $. As we now show, the quantities $\gamma ^*$ and $\gamma ^{\diamondsuit }$ actually coincide.

Lemma 2.3

We have $\gamma ^*(\varvec{\ell },{\varvec{c}}) = \gamma ^{\diamondsuit }(\varvec{\ell },{\varvec{c}})$. Furthermore, if $\varvec{\ell }^*= \infty $, then $\gamma ^*(\varvec{\ell },{\varvec{c}}) = \gamma ^{\diamondsuit }(\varvec{\ell },{\varvec{c}}) = \infty $.

Proof

For brevity, write $\gamma ^*:= \gamma ^*(\varvec{\ell },{\varvec{c}})$ and $\gamma ^{\diamondsuit }:= \gamma ^{\diamondsuit } (\varvec{\ell },{\varvec{c}})$.

Case 1 ($\varvec{\ell }^*= \infty $): In this case, given any $\gamma > 0$, one can choose an even $L \in {\mathbb {N}}= {\mathbb {N}}_{\le \varvec{\ell }^*}$ satisfying $L > 2 \gamma $. It is then easy to see $ \limsup _{n \rightarrow \infty } \frac{({\varvec{c}}(n))^L \cdot n^{\lfloor L/2 \rfloor }}{n^\gamma } \ge \limsup _{n \rightarrow \infty } 1 = 1 \in (0,\infty ]. $ By definition of $\gamma ^*$, this implies $\gamma ^*\ge \gamma $ for each $\gamma > 0$, and hence $\gamma ^*= \infty $.

Next, we show that $\gamma ^{\diamondsuit } = \infty $ as well. To see this, assume towards a contradiction that $\gamma ^{\diamondsuit } < \infty $. By definition of $\gamma ^{\diamondsuit }$, this means that there exist $\gamma \ge 0$ and $C > 0$ satisfying

$$\begin{aligned} C \cdot n^\gamma \ge ({\varvec{c}}(n))^L \cdot n^{\lfloor L/2 \rfloor } \ge n^{\lfloor L/2 \rfloor } \qquad \forall \, n,L \in {\mathbb {N}}. \end{aligned}$$

But choosing an even $L \in {\mathbb {N}}$ satisfying $L > 2 \gamma $, we have $\lfloor L/2 \rfloor = L/2 > \gamma $, showing that the preceding displayed inequality cannot hold. This is the desired contradiction.

Case 2 ($\varvec{\ell }^*< \infty $): We first show $\gamma ^*\le \gamma ^{\diamondsuit }$. In case of $\gamma ^{\diamondsuit } = \infty $, this is trivial, so that we can assume $\gamma ^{\diamondsuit } < \infty $. Let $\gamma _0 \in [0,\infty )$ such that there exists $C > 0$ satisfying $({\varvec{c}}(n))^L \cdot n^{\lfloor L/2 \rfloor } \le C \cdot n^{\gamma _0}$ for all $n \in {\mathbb {N}}$ and $L \in {\mathbb {N}}_{\le \varvec{\ell }^*}$. Then, given any $L \in {\mathbb {N}}_{\le \varvec{\ell }^*}$ and $\gamma > \gamma _0$, we have

$$\begin{aligned} \limsup _{n \rightarrow \infty } \frac{({\varvec{c}}(n))^L \cdot n^{\lfloor L/2 \rfloor }}{n^{\gamma }} \le \limsup _{n \rightarrow \infty } \frac{C \cdot n^{\gamma _0}}{n^{\gamma }} = 0. \end{aligned}$$

This easily implies $\gamma ^*\le \gamma _0$. Since this holds for all $\gamma _0$ as above, we see by definition of $\gamma ^{\diamondsuit }$ that $\gamma ^*\le \gamma ^{\diamondsuit }$.

We finally show that $\gamma ^{\diamondsuit } \le \gamma ^*$. In case of $\gamma ^*= \infty $, this is clear; hence, we can assume that $\gamma ^*< \infty $. Let $\gamma > \gamma ^*$ be arbitrary. Then

$$\begin{aligned} 0 \le \liminf _{n \rightarrow \infty } \frac{({\varvec{c}}(n))^{\varvec{\ell }^*} \cdot n^{\lfloor \varvec{\ell }^*/2 \rfloor }}{n^\gamma } \le \limsup _{n \rightarrow \infty } \frac{({\varvec{c}}(n))^{\varvec{\ell }^*} \cdot n^{\lfloor \varvec{\ell }^*/2 \rfloor }}{n^\gamma } = 0. \end{aligned}$$

Thus, since convergent sequences are bounded, we see that $ ({\varvec{c}}(n))^{\varvec{\ell }^*} \cdot n^{\lfloor \varvec{\ell }^*/ 2 \rfloor } \le C \cdot n^\gamma $ for all $n \in {\mathbb {N}}$ and a suitable $C > 0$. Given any $L \in {\mathbb {N}}_{\le \varvec{\ell }^*}$, we thus have $ ({\varvec{c}}(n))^L \cdot n^{\lfloor L/2 \rfloor } \le ({\varvec{c}}(n))^{\varvec{\ell }^*} \cdot n^{\lfloor \varvec{\ell }^*/ 2 \rfloor } \le C \cdot n^\gamma . $ By definition of $\gamma ^{\diamondsuit }$, this implies $\gamma ^{\diamondsuit } \le \gamma $. Since this holds for all $\gamma > \gamma ^*$, we conclude $\gamma ^{\diamondsuit } \le \gamma ^*$. $\square $

3 Sobolev-Type Embeddings for Neural Network Approximation Spaces

In this section, we establish a characterization of the embeddings and solely in terms of $\gamma ^*(\varvec{\ell },{\varvec{c}})$ and the quantities $d \in {\mathbb {N}}$, $\beta \in (0,1)$, and $p \in (0,\infty ]$. This is roughly similar to Sobolev embeddings, which allow one to “trade smoothness for increased integrability.”

3.1 Sufficient Conditions for Embedding into $C([0,1]^d)$

In this subsection we establish sufficient conditions for an embedding to hold. The following lemma—showing that functions with large (but finite) Lipschitz constant but small $L^p$ norm have a controlled $L^\infty $ norm—is an essential ingredient for the proof.

We first make precise the notion of Lipschitz constant that we use in what follows. Given $d\in {\mathbb {N}}$ and a function $f:[0,1]^d \rightarrow {\mathbb {R}}$ we define

$$\begin{aligned} {\text {Lip}}(f):= \sup _{x,y \in [0,1]^d, x \ne y} \frac{|f(x) - f(y)|}{\Vert x-y \Vert _{\ell ^\infty }} \in [0,\infty ]. \end{aligned}$$

Lemma 3.1

Let $p \in (0,\infty )$ and $d \in {\mathbb {N}}$. Then there exists a constant $C = C(d,p) > 0$ such that for each $0 < T \le 1$ and each $f \in L^p([0,1]^d) \cap C^{0,1}([0,1]^d)$, we have

$$\begin{aligned} \Vert f \Vert _{L^\infty } \le C \cdot \big ( T^{-d/p} \, \Vert f \Vert _{L^p} + T \cdot {\text {Lip}}(f) \big ). \end{aligned}$$

Proof

For brevity, set $Q:= [0,1]^d$. Let $f \in L^p([0,1]^d) \cap C^{0,1}([0,1]^d)$. Since Q is compact and f is continuous, there exists $x_0 \in Q$ satisfying $|f(x_0)| = \Vert f \Vert _{L^\infty }$. Set $P:= Q \cap (x_0 + [-T,T]^d)$. For $x \in P$, we have

$$\begin{aligned} \Vert f \Vert _{L^\infty }= & {} |f(x_0)| \le |f(x_0) - f(x)| + |f(x)| \le {\text {Lip}}(f) \cdot \Vert x - x_0 \Vert _{\ell ^\infty }+ |f(x)| \\\le & {} {\text {Lip}}(f) \cdot T + |f(x)|. \end{aligned}$$

Now, denoting the Lebesgue measure by $\varvec{\lambda }$, taking the $L^p(P)$ norm of the preceding estimate, and recalling that $L^p$ is a (quasi)-Banach space, we obtain a constant $C_1 = C_1(p) > 0$ satisfying

$$\begin{aligned} \Vert f \Vert _{L^\infty } \cdot [\varvec{\lambda }(P)]^{1/p}&= \big \Vert \Vert f \Vert _{L^\infty } \big \Vert _{L^p(P)} \le \big \Vert {\text {Lip}}(f) \cdot T + |f| \big \Vert _{L^p(P)} \\&\le C_1 \cdot \big ( {\text {Lip}}(f) \cdot T \cdot [\varvec{\lambda }(P)]^{1/p} + \Vert f \Vert _{L^p(P)} \big ) \\&\le C_1 \cdot \big ( {\text {Lip}}(f) \cdot T \cdot [\varvec{\lambda }(P)]^{1/p} + \Vert f \Vert _{L^p} \big ) . \end{aligned}$$

Next, [11, Lemma A.2] shows that $\varvec{\lambda }(P) \ge 2^{-d} \, T^d$, so that we finally see

$$\begin{aligned} \Vert f \Vert _{L^\infty }\le & {} C_1 \cdot \bigl ({\text {Lip}}(f) \cdot T + [\varvec{\lambda }(P)]^{-1/p} \, \Vert f \Vert _{L^p}\bigr )\\\le & {} C_1 \cdot \bigl ({\text {Lip}}(f) \cdot T + 2^{d/p} \, T^{-d/p} \, \Vert f \Vert _{L^p}\bigr ), \end{aligned}$$

which easily implies the claim. $\square $

In order to derive a sufficient condition for the embedding , we will use the following lemma that provides a bound for the Lipschitz constant of a function implemented by a neural network, in terms of the size (number of non-zero weights) of the network and of the magnitude of the individual network weights.

Lemma 3.2

([11, Lemma 4.1]) Let $\varvec{\ell }: {\mathbb {N}}\rightarrow {\mathbb {N}}_{\ge 2} \cup \{ \infty \}$ and ${\varvec{c}}: {\mathbb {N}}\rightarrow [1,\infty ]$. Let $n \in {\mathbb {N}}$ and assume that $L:= \varvec{\ell }(n)$ and $C:= {\varvec{c}}(n)$ are finite. Then each satisfies

$$\begin{aligned} {\text {Lip}}(F) \le d \cdot C^L \cdot n^{\lfloor L/2 \rfloor }. \end{aligned}$$

Now, we can derive a sufficient condition for the embedding . Below, in Sect. 3.3, we will see that this sufficient condition is essentially optimal.

Theorem 3.3

Let $d \in {\mathbb {N}}$ and $p \in (0,\infty )$, and suppose that $\gamma ^{*}(\varvec{\ell },{\varvec{c}}) < \infty $. If $\alpha > \frac{d}{p} \cdot \gamma ^{*}(\varvec{\ell },{\varvec{c}})$, then . Furthermore, for each $\gamma \in \bigl ( \gamma ^{*}(\varvec{\ell },{\varvec{c}}), \frac{p}{d} \, \alpha \bigr )$, we have the embedding

Proof

As shown in Lemma 2.3, the assumption $\gamma ^{*}(\varvec{\ell },{\varvec{c}}) < \infty $ ensures that $L:= \varvec{\ell }^*< \infty $. Furthermore, since $ \gamma ^{\diamondsuit }(\varvec{\ell },{\varvec{c}}) = \gamma ^{*}(\varvec{\ell },{\varvec{c}}) < \frac{p}{d} \alpha $ we see by definition of $\gamma ^{\diamondsuit }$ (see Eq. (2.2)) that there exist $\gamma < \frac{p}{d} \alpha $ (arbitrarily close to $\gamma ^{*}(\varvec{\ell },{\varvec{c}})$) and $C = C(\gamma ,\varvec{\ell },{\varvec{c}}) > 0$ satisfying $({\varvec{c}}(n))^{L} \cdot n^{\lfloor L/2 \rfloor } \le C \cdot n^{\gamma }$ for all $n \in {\mathbb {N}}$.

Let . By Lemma 2.1, this implies , so that for each $m \in {\mathbb {N}}$, we can choose satisfying $\Vert f - F_m \Vert _{L^p} \le 2 \cdot 2^{-\alpha m}$. Since furthermore $\Vert f \Vert _{L^p} \le 1$, this remains true for $m = 0$ if we set $F_0:= 0$. Note that there exists a constant $C_1 = C_1(p) > 0$ satisfying

$$\begin{aligned} \Vert F_{m+1} - F_m \Vert _{L^p}{} & {} \le C_1 \cdot \big ( \Vert F_{m+1} - f \Vert _{L^p} + \Vert f - F_m \Vert _{L^p} \big ) \\{} & {} \le C_1 \cdot \big ( 2 \cdot 2^{-\alpha (m+1)} + 2 \cdot 2^{-\alpha m} \big ) \le 4 C_1 \cdot 2^{- \alpha m}. \end{aligned}$$

Furthermore, note as a consequence of Lemma 3.2 and because of $\varvec{\ell }(n) \le L$ for all $n \in {\mathbb {N}}$ that

$$\begin{aligned} {\text {Lip}}(F_{m+1} - F_m)&\le {\text {Lip}}(F_{m+1}) + {\text {Lip}}(F_m)\\ {}&\le d \cdot \bigl ( [{\varvec{c}}(2^{m+1})]^{L} \, (2^{m+1})^{\lfloor L/2 \rfloor }+ [{\varvec{c}}(2^{m})]^{L} \, (2^{m})^{\lfloor L/2 \rfloor } \bigr )\\ {}&\le d \, C \cdot \bigl ( (2^{m+1})^\gamma + (2^m)^\gamma \bigr ) \le C_2 \cdot 2^{\gamma m} \end{aligned}$$

for a suitable constant $C_2 = C_2(d,\gamma ,\varvec{\ell },{\varvec{c}}) > 0$.

Now, set $\theta := \frac{p}{p+d} (\gamma + \alpha ) > 0$. Then, for each $m \in {\mathbb {N}}$, we can apply Lemma 3.1 with $T = T_m:= 2^{- m \theta } \in (0,1]$ to deduce for a suitable constant $C_3 = C_3(d,p) > 0$ that

$$\begin{aligned} \Vert F_{m+1} - F_m \Vert _{L^\infty }&\le C_3 \cdot \bigl ( T_m^{-d/p} \, \Vert F_{m+1} - F_m \Vert _{L^p} + T_m \cdot {\text {Lip}}(F_{m+1} - F_m) \bigr ) \\&\le C_3 \cdot \bigl ( 4 C_1 \cdot 2^{\frac{d}{p} \theta m} \cdot 2^{-\alpha m} + C_2 \cdot 2^{-\theta m} \cdot 2^{\gamma m} \bigr ) \\&\le C_4 \cdot 2^{- \mu m} , \end{aligned}$$

for a suitable constant $C_4 = C_4(d,p,\gamma ,\varvec{\ell },{\varvec{c}}) > 0$ and $\mu := \frac{p}{d+p} (\alpha - \frac{d}{p}\gamma ) > 0$. Therefore, we see for $m_0 \in \mathbb {N}$ and $M \ge m \ge m_0$ that

$$\begin{aligned} \Vert F_M - F_m \Vert _{L^\infty } \le \sum _{\ell =m}^{M-1} \Vert F_{\ell +1} - F_\ell \Vert _{L^\infty } \le C_4 \sum _{\ell =m_0}^\infty 2^{- \mu \ell } \xrightarrow [m_0\rightarrow \infty ]{} 0, \end{aligned}$$

(3.1)

showing that $(F_m)_{m \in {\mathbb {N}}} \subset C([0,1]^d)$ is a Cauchy sequence. Since $F_m \rightarrow f$ in $L^p$, this implies (after possibly redefining f on a null-set) that $f \in C([0,1]^d)$ and $F_m \rightarrow f$ uniformly. Finally, we see

$$\begin{aligned} \Vert f \Vert _{L^\infty }= & {} \lim _{m \rightarrow \infty } \Vert F_m \Vert _{L^\infty } \le \lim _{m \rightarrow \infty } \sum _{\ell =0}^{m-1} \Vert F_{\ell +1} - F_\ell \Vert _{L^\infty } \le C_4 \cdot \sum _{\ell =0}^\infty 2^{- \mu \ell }\le C_5\nonumber \\= & {} C_5 (d,p,\gamma ,\alpha ,\varvec{\ell },{\varvec{c}}). \end{aligned}$$

(3.2)

Since this holds for all with , we have shown .

We complete the proof by showing . By definition of $\mu $ and since $\gamma < \frac{p}{d}\alpha $ can be chosen arbitrarily close to $\gamma ^{*}(\varvec{\ell },{\varvec{c}})$, this implies the second claim of the theorem. To see , note thanks to Equation (3.1) and because of $F_m \rightarrow f$ uniformly that

$$\begin{aligned} \Vert f - F_m \Vert _{L^\infty }= & {} \lim _{M\rightarrow \infty } \Vert F_M - F_m \Vert _{L^\infty } \le C_4 \sum _{\ell =m}^\infty 2^{-\mu \ell }\le \frac{C_4}{1 - 2^{-\mu }} 2^{- \mu m} \\=: & {} C_6 \cdot 2^{- \mu m}, \end{aligned}$$

for a suitable constant $C_6 = C_6(d,p,\gamma ,\alpha ,\varvec{\ell },{\varvec{c}})$.

Now, given any $t \in {\mathbb {N}}_{\ge 2}$, choose $m \in {\mathbb {N}}$ with $2^m \le t \le 2^{m+1}$. Since , we then see For $t = 1$, we have ; see Equation (3.2). Overall, setting $C_7:= \max \{ 1, 2^\mu C_6, C_5 \}$ and recalling Lemma 2.1, we see and hence , meaning . All in all, this shows that indeed . $\square $

3.2 Sufficient Conditions for Embedding into $C^{0,\beta }([0,1]^d)$

Our next goal is to prove a sufficient condition for the embedding . We first introduce some notation. For $d\in {\mathbb {N}}$, $\beta \in (0,1]$ and $f:[0,1]^d\rightarrow {\mathbb {R}}$ we define

$$\begin{aligned} {\text {Lip}}_\beta (f):= \sup _{x,y \in [0,1]^d, x \ne y} \frac{|f(x) - f(y)|}{\Vert x - y\Vert _{\ell ^\infty }^\beta } \in [0,\infty ]. \end{aligned}$$

Then, the Hölder space $C^{0,\beta }([0,1]^d)$ consists of all (necessarily continuous) functions $f: [0,1]^d \rightarrow {\mathbb {R}}$ satisfying

$$\begin{aligned} \Vert f \Vert _{C^{0,\beta }}:= \Vert f \Vert _{L^\infty } + {\text {Lip}}_\beta (f) < \infty . \end{aligned}$$

It is well-known that $C^{0,\beta }([0,1]^d)$ is a Banach space with this norm.

The proof of our sufficient embedding condition relies on the following analogue of Lemma 3.1.

Lemma 3.4

Let $p \in (0,\infty ]$, $\beta \in (0,1)$, and $d \in {\mathbb {N}}$. Then there exists a constant $C = C(d,p,\beta ) \!>\! 0$ such that for each $0 < T \le 1$ and each $f \in L^p([0,1]^d) \cap C^{0,1}([0,1]^d)$, we have

$$\begin{aligned} {\text {Lip}}_\beta (f) \le C \cdot \big ( T^{-\frac{d}{p} - \beta } \Vert f \Vert _{L^p} + T^{1 - \beta } {\text {Lip}}(f) \big ). \end{aligned}$$

Proof

For $x,y \in [0,1]^d$, we have $ |f(x) - f(y)|^{1-\beta } \le \big ( 2 \, \Vert f \Vert _{L^\infty } \big )^{1-\beta } $ and hence

$$\begin{aligned} |f(x) - f(y)|{} & {} = |f(x) - f(y)|^\beta \cdot |f(x) - f(y)|^{1-\beta }\\{} & {} \le 2^{1-\beta } \, \Vert f \Vert _{L^\infty }^{1-\beta } \cdot \big ( {\text {Lip}}(f) \cdot \Vert x - y\Vert _{\ell ^\infty } \big )^\beta , \end{aligned}$$

and this easily implies ${\text {Lip}}_\beta (f) \le 2^{1-\beta } \cdot \Vert f \Vert _{L^\infty }^{1-\beta } \cdot ({\text {Lip}}(f))^\beta $.

Let us first consider the case $p < \infty $. In this case, we combine the preceding bound with the estimate of $\Vert f \Vert _{L^\infty }$ provided by Lemma 3.1 to obtain a constant $C_1 = C_1(d,p) > 0$ satisfying

$$\begin{aligned} {\text {Lip}}_\beta (f)&\le 2^{1-\beta } C_1^{1-\beta } \cdot \bigl ( T^{-d/p} \, \Vert f \Vert _{L^p} + T \cdot {\text {Lip}}(f) \bigr )^{1-\beta } \cdot \big ( {\text {Lip}}(f) \big )^\beta \\&\le 2^{1-\beta } C_1^{1-\beta } \cdot \big ( T^{-(1-\beta ) d / p} \cdot \Vert f \Vert _{L^p}^{1-\beta } \cdot ({\text {Lip}}(f))^\beta + T^{1-\beta } \cdot {\text {Lip}}(f) \big ) . \end{aligned}$$

The second step used the elementary estimate $(a + b)^\theta \le a^\theta + b^\theta $ for $\theta \in (0,1)$ and $a,b \ge 0$.

The preceding estimate shows that it suffices to prove that

$$\begin{aligned} T^{-(1-\beta ) d / p} \cdot \Vert f \Vert _{L^p}^{1-\beta } \cdot ({\text {Lip}}(f))^\beta \le C_2 \cdot \big ( T^{-\frac{d}{p} - \beta } \Vert f \Vert _{L^p} + T^{1-\beta } {\text {Lip}}(f) \big ) \end{aligned}$$

for a suitable constant $C_2 = C_2(d,p,\beta ) > 0$. To prove the latter estimate, note for $a, b > 0$ and $\theta \in (0,1)$ that

$$\begin{aligned} a^\theta \cdot b^{1-\theta }= & {} \exp \bigl (\theta \ln a + (1-\theta ) \ln b \bigr ) \le \theta \exp (\ln a) + (1-\theta ) \exp (\ln b)\nonumber \\= & {} \theta a + (1-\theta ) b, \end{aligned}$$

(3.3)

thanks to the convexity of $\exp $. The preceding estimate clearly remains valid if $a = 0$ or $b = 0$. Applying this with $\theta = \beta $, $a = T^{1-\beta } {\text {Lip}}(f)$, and $b = T^{-\frac{d}{p}-\beta } \Vert f \Vert _{L^p}$, we thus obtain the needed estimate

$$\begin{aligned} T^{-(1-\beta ) d / p} \cdot \Vert f \Vert _{L^p}^{1-\beta } \cdot ({\text {Lip}}(f))^\beta&= \big ( T^{1-\beta } {\text {Lip}}(f) \big )^\beta \cdot \big ( T^{-\frac{d}{p} - \beta } \, \Vert f \Vert _{L^p} \big )^{1-\beta } \\&\le \beta \, T^{1-\beta } {\text {Lip}}(f) + (1-\beta ) \, T^{-\frac{d}{p} - \beta } \, \Vert f \Vert _{L^p} \\&\le T^{1-\beta } {\text {Lip}}(f) + T^{-\frac{d}{p} - \beta } \, \Vert f \Vert _{L^p} . \end{aligned}$$

This completes the proof in the case $p < \infty $.

In case of $p = \infty $, we combine the estimate from the beginning of the proof with Equation (3.3) to obtain

$$\begin{aligned} {\text {Lip}}_\beta (f)&\le 2^{1-\beta } \cdot \Vert f \Vert _{L^\infty }^{1-\beta } \cdot ({\text {Lip}}(f))^\beta \!=\!2^{1-\beta } \cdot \bigl ( T^{-\beta } \, \Vert f \Vert _{L^\infty } \bigr )^{1-\beta } \cdot \bigl (T^{1-\beta } \, {\text {Lip}}(f)\bigr )^\beta \\&\le 2^{1-\beta } \cdot \big ( (1-\beta ) T^{-\beta } \Vert f \Vert _{L^\infty } + \beta T^{1-\beta } \, {\text {Lip}}(f) \big ) \\&\le 2^{1-\beta } \cdot \big ( T^{-\frac{d}{p} - \beta } \Vert f \Vert _{L^p} + T^{1-\beta } {\text {Lip}}(f) \big ) , \end{aligned}$$

where the last step used that $p = \infty $. $\square $

We now prove our sufficient criterion for the embedding .

Theorem 3.5

Let $d \in {\mathbb {N}}$, $\beta \in (0,1)$, and $p \in (0,\infty ]$, and suppose that $\gamma ^*(\varvec{\ell },{\varvec{c}}) < \infty $. If

$$\begin{aligned} \alpha > \frac{\beta + \frac{d}{p}}{1 - \beta } \cdot \gamma ^*(\varvec{\ell },{\varvec{c}}), \end{aligned}$$

then .

Remark

It is folklore knowledge that the modulus of continuity of a function is closely connected to how well the function can be uniformly approximated by Lipschitz functions. Precisely, for $s > 0$, let

$$\omega _f (s):= \sup \{ |f(x) - f(y)| :x,y \in [0,1]^d, \Vert x-y \Vert _{\ell ^\infty } \le s \}$$

denote the modulus of continuity of a function $f: [0,1]^d \rightarrow {\mathbb {R}}$. It is easy to see that if $g: [0,1]^d \rightarrow {\mathbb {R}}$ satisfies ${\text {Lip}}(g) < \infty $, then $\omega _f (t) \le 2 \, \Vert f - g \Vert _{L^\infty } + t {\text {Lip}}(g)$ for all $t > 0$. In combination with Lemma 3.2 and the definition of the approximation space , this observation can be used to obtain a simplified proof of Theorem 3.5 in the case $p = \infty $.

However, for general $p \in (0,\infty )$, the approximation error in the definition of the space is measured in $L^p$ instead of $L^\infty $. Since we are not aware of a well-known connection between the modulus of continuity of a function and its $L^p$-approximability by Lipschitz functions, we therefore provide a proof of Theorem 3.5 from first principles.

Proof

The assumption on $\alpha $ implies in particular that $\alpha > \frac{d}{p} \gamma ^*(\varvec{\ell },{\varvec{c}})$. Hence, Theorem 3.3 shows in case of $p < \infty $ that , and this embedding trivially holds if $p = \infty $. Thus, it remains to show for with that ${\text {Lip}}_\beta (f) \lesssim 1$.

Let $\nu := (\beta + \frac{d}{p}) / (1 - \beta )$. Recall from Lemma 2.3 and the assumptions of the present theorem that $ \gamma ^{\diamondsuit }(\varvec{\ell },{\varvec{c}}) = \gamma ^*(\varvec{\ell },{\varvec{c}})< \alpha / \nu < \infty $ and hence that $L:= \varvec{\ell }^*\in {\mathbb {N}}$. By definition of $\gamma ^{\diamondsuit }(\varvec{\ell },{\varvec{c}})$, we thus see that there exist $\gamma > 0$ and $C = C(\gamma ,\varvec{\ell },{\varvec{c}}) > 0$ satisfying $\gamma < \alpha / \nu $ and $({\varvec{c}}(n))^L \cdot n^{\lfloor L/2 \rfloor } \le C \cdot n^{\gamma }$ for all $n \in {\mathbb {N}}$.

Let with . By Lemma 2.1, this implies , meaning that for each $m \in {\mathbb {N}}$ we can choose satisfying $\Vert f - F_m \Vert _{L^p} \le 2 \cdot 2^{-\alpha m}$. Since furthermore $\Vert f \Vert _{L^p} \le 1$, this remains true for $m = 0$ if we set $F_0:= 0$. As in the proof of Theorem 3.3, we then obtain constants $C_1 = C_1(p) > 0$ and $C_2 = C_2(d,\gamma ,\varvec{\ell },{\varvec{c}}) > 0$ satisfying

$$\begin{aligned} \Vert F_{m+1} - F_m \Vert _{L^p} \le 4 C_1 \cdot 2^{-\alpha m} \quad \text {and} \quad {\text {Lip}}(F_{m+1} - F_m) \le C_2 \cdot 2^{\gamma m} \quad \forall \, m \in {\mathbb {N}}. \end{aligned}$$

Now, set $\theta := (\alpha + \gamma )/(1 + \frac{d}{p})$. A direct computation shows that

$$\begin{aligned} \theta \cdot \left( \frac{d}{p} + \beta \right) - \alpha = \gamma - \theta \cdot (1 - \beta ) = \frac{1 - \beta }{1 + \frac{d}{p}} \cdot (\nu \gamma - \alpha ) =: \mu < 0. \end{aligned}$$

Thus, applying Lemma 3.4 with $T = 2^{-\theta m}$, we obtain a constant $C_3 = C_3(d,p,\beta ) > 0$ satisfying

$$\begin{aligned} {\text {Lip}}_\beta (F_{m+1} - F_m)&\le C_3 \cdot \big ( T^{-\frac{d}{p} - \beta } \Vert F_{m+1} - F_m \Vert _{L^p} + T^{1 - \beta } \, {\text {Lip}}(F_{m+1} - F_m) \big ) \\&\le C_3 \cdot \big ( 4 C_1 \cdot 2^{m \cdot (\theta \cdot (\frac{d}{p} + \beta ) - \alpha )} + C_2 \cdot 2^{m \cdot (\gamma - \theta \cdot (1 - \beta ))} \big ) \\&= C_3 \cdot (4C_1 + C_2) \cdot 2^{\mu m} =: C_4 \cdot 2^{\mu m} . \end{aligned}$$

Next, note for $g \in C^{0,\beta }([0,1]^d)$ and $x,y \in [0,1]^d$ that

$$\begin{aligned} |g(x)| \le |g(x) - g(y)| \!+\! |g(y)| \le {\text {Lip}}_\beta (g) \cdot \Vert x - y\Vert _{\ell ^\infty }^\beta \!+\! |g(y)| \le {\text {Lip}}_\beta (g) \!+\! |g(y)|. \end{aligned}$$

Taking the $L^p$-norm (with respect to y) of this estimate shows that

$$\begin{aligned} |g(x)|{} & {} =\big \Vert |g(x)| \big \Vert _{L^p([0,1]^d, d y)} \le \big \Vert {\text {Lip}}_\beta (g) + |g(y)| \big \Vert _{L^p([0,1]^d, d y)}\\{} & {} \le C_5 \cdot \bigl ({\text {Lip}}_\beta (g) + \Vert g \Vert _{L^p}\bigr ) \,\,\, \forall \, x \!\in \! [0,1]^d, \end{aligned}$$

for a suitable constant $C_5 = C_5(p) > 0$. Setting $\sigma := \max \{ \mu , -\alpha \} < 0$ and applying the preceding estimate to $g = F_{m+1} - F_m$, we see for a suitable constant $C_6 = C_6(d,p,\beta ) > 0$ that $\Vert F_{m+1} - F_m \Vert _{\sup } \le C_6 \cdot 2^{\sigma m}$ for all $m \in {\mathbb {N}}_0$. Hence, the series $F:= \sum _{m=0}^\infty (F_{m+1} - F_m)$ converges uniformly, so that we get $F \in C^{0,\beta }([0,1]^d)$ with

$$\begin{aligned} {\text {Lip}}_\beta (F) \le \sum _{m=0}^\infty {\text {Lip}}_\beta (F_{m+1} - F_m) \le C_4 \cdot \sum _{m=0}^\infty 2^{\mu m} = \frac{C_4}{1 - 2^\mu } =: C_7 < \infty . \end{aligned}$$

Since $\sum _{m=0}^N (F_{m+1} - F_m) = F_{N+1} \rightarrow f$ with convergence in $L^p([0,1]^d)$ as $N \rightarrow \infty $, we see that $F = f$ almost everywhere, and hence (after changing f on a null-set) $f \in C^{0,\beta }([0,1]^d)$ with ${\text {Lip}}_\beta (f) \le C_7$. As seen above, this completes the proof. $\square $

3.3 Necessary Conditions

In this section we complement our sufficient conditions by corresponding necessary conditions. Our main result reads as follows.

Theorem 3.6

Let $p \in (0,\infty ]$, $\beta \in (0,1]$, and $d \in {\mathbb {N}}$. Then:

If $p < \infty $ and $\gamma ^{*}(\varvec{\ell },{\varvec{c}}) > \alpha / (d / p)$, then the embedding does not hold.
If $\gamma ^*(\varvec{\ell },{\varvec{c}}) > \frac{1-\beta }{\beta + \frac{d}{p}} \alpha $, then the embedding does not hold.

Remark

A combination of the preceding result with Theorem 3.3 provides a complete characterization of the embedding , except in the case where $\gamma ^{*}(\varvec{\ell },{\varvec{c}}) = \alpha / (d/p)$. This remaining case is more subtle and seems to depend on the precise growth of the function ${\varvec{c}}$.

Likewise, combining the preceding result with Theorem 3.5 provides a complete characterization of the embedding , except in the “critical” case where $\gamma ^{*}(\varvec{\ell },{\varvec{c}}) = \frac{1-\beta }{\beta + \frac{d}{p}} \, \alpha $.

Proof

The proof is divided into several steps, where the first two steps are concerned with constructing and analyzing a certain auxiliary function $\zeta _{M'}$ which is then used to construct counterexamples to the considered embeddings.

Step 1: Given $M' \ge 0$, define

$$\begin{aligned} \zeta _{M'}: \quad {\mathbb {R}}^d \rightarrow {\mathbb {R}}, \quad x \mapsto \varrho \left( 1 - M' \, \sum _{i=1}^d x_i\right) . \end{aligned}$$

In this step, we show for any $n \in {\mathbb {N}}$, $0 \le C \le {\varvec{c}}(n)$, $L \in {\mathbb {N}}_{\ge 2}$ with $L \le \varvec{\ell }(n)$ and any $M' \ge 1$ and $0 \le M \le C^L \, n^{\lfloor L/2 \rfloor }$ that

(3.4)

To prove this, set $\mu := C^L \, n^{\lfloor L/2 \rfloor }$ and $D:= C \cdot (1)\in {\mathbb {R}}^{1 \times 1}$, as well as

$$\begin{aligned} b_1:= & {} \frac{C}{M'} \cdot (1,\dots ,1)^T \in {\mathbb {R}}^n,\quad A:= C \cdot (1,\dots ,1) \in {\mathbb {R}}^{1 \times n},\\ B:= & {} C \cdot (1,\dots ,1)^T \in {\mathbb {R}}^{n \times 1}, \end{aligned}$$

and finally

$$\begin{aligned} A_1:= - C \cdot \left( \begin{matrix} 1 &{} \cdots &{} 1 \\ \vdots &{} \ddots &{} \vdots \\ 1 &{} \cdots &{} 1 \end{matrix} \right) \in {\mathbb {R}}^{n \times d}. \end{aligned}$$

Note because of $M' \ge 1$ that $ \Vert b_1 \Vert _{\infty }, \Vert A \Vert _{\infty }, \Vert B \Vert _{\infty }, \Vert D \Vert _{\infty }, \Vert A_1 \Vert _{\infty } \le C \le {\varvec{c}}(n) \le {\varvec{c}}( (L+d) n). $ Now, we distinguish two cases.

Case 1 ($L \ge 2$ is even): In this case, define

$$\begin{aligned} \Phi := \big ( (A_1,b_1), (A,0), \underbrace{ (B,0), (A,0), \dots , (B,0), (A,0) }_{\frac{L-2}{2} \text { copies of } (B,0),(A,0)} \big ). \end{aligned}$$

Note that $W(\Phi ) \le d n + n + n + \frac{L-2}{2} (n + n) \le (d+L) n$ and $L(\Phi ) = L \le \varvec{\ell }(n) \le \varvec{\ell }( (d+L) n)$, showing that . Because of $A (\varrho (B x)) = C \sum _{j=1}^n \varrho (C x) = C^2 \, n \cdot \varrho (x)$ for $x \in {\mathbb {R}}$ and because of

$$\begin{aligned} A \cdot \varrho (A_1 x + b_1)= & {} C \cdot \sum _{j=1}^n \varrho \bigl (\tfrac{C}{M'} - {\textstyle \sum _{i=1}^d} C \, x_i\bigr ) = \frac{C^2 \, n}{M'} \cdot \varrho \bigl (1 - M' \, {\textstyle \sum _{i=1}^d} x_i\bigr ) \\= & {} \frac{C^2 \, n}{M'} \cdot \zeta _{M'} (x) \end{aligned}$$

for $x \in {\mathbb {R}}^d$, it follows that $ R_\varrho \Phi = C^{L-2} \cdot n^{\frac{L-2}{2}} \cdot \frac{C^2 \, n}{M'} \cdot \zeta _{M'} = \frac{\mu }{M'} \cdot \zeta _{M'}. $ Since and because of $0 \le M \le \mu $, this proves Equation (3.4) in this case.

Case 2 ($L \ge 3$ is odd): In this case, define

$$\begin{aligned} \Phi := \big ( (A_1,b_1), (A,0), \underbrace{ (B,0),(A,0),\dots ,(B,0),(A,0) }_{\frac{L-3}{2} \text { copies of } (B,0),(A,0)}, (D,0) \big ). \end{aligned}$$

Then $W(\Phi ) \le d n + n + n + \frac{L-3}{2}(n + n) + 1 \le (d + L) n$ and $L(\Phi ) = L \le \varvec{\ell }(n) \le \varvec{\ell }( (d+L) n)$, so that . Using essentially the same arguments as in the previous case, we see $R_\varrho \Phi = \frac{\mu }{M'} \cdot \zeta _{M'}$, which then implies that Equation (3.4) holds.

Step 2: We show $\big \Vert \zeta _{M'} \big \Vert _{L^p([0,1]^d)} \le (M')^{-d/p}$. To see this, note for $x \in [0,1]^d$ that $\sum _{i=1}^d x_i \ge 0$ and hence $1 - M' \, \sum _{i=1}^d x_i \le 1$, showing that $0 \le \zeta _{M'}(x) = \varrho \bigl (1 - M' \, \sum _{i=1}^d x_i\bigr ) \le \varrho (1) = 1$. Next, note that if $0 \ne \zeta _{M'}(x) = \varrho \big ( 1 - M' \, \sum _{i=1}^d x_i \big )$, then $1 - M' \, \sum _{i=1}^d x_i > 0$, which is only possible if $x_j \le \sum _{i=1}^d x_i \le \frac{1}{M'}$ for all $j \in \{ 1,\dots ,d \}$, that is, if $x \in \frac{1}{M'} [0,1]^d$. Therefore, we see as claimed that

$$\begin{aligned} \big \Vert \zeta _{M'} \big \Vert _{L^p([0,1]^d)} \le \big [ \varvec{\lambda }\bigl (\tfrac{1}{M'} [0,1]^d\bigr ) \big ]^{1/p} = (M')^{-d/p}. \end{aligned}$$

Step 3: In this step, we prove the first part of the theorem. By the assumptions of that part, we have $p < \infty $ and $\gamma ^*(\varvec{\ell },{\varvec{c}}) > \alpha / (d/p)$. By definition of $\gamma ^*$, this implies that there exists $\gamma > \alpha / (d/p)$ and some $L \in {\mathbb {N}}_{\le \varvec{\ell }^*}$ satisfying

$$\begin{aligned} c_0:= \limsup _{n \rightarrow \infty } \Big [ ({\varvec{c}}(n))^L \cdot n^{\lfloor L/2 \rfloor - \gamma } \Big ] \in (0,\infty ]. \end{aligned}$$

Since $\varvec{\ell }^*\ge 2$, we can clearly assume that $L \ge 2$. Choose a subsequence $(n_k)_{k \in {\mathbb {N}}}$ such that $c_0 = \lim _{k \rightarrow \infty } ({\varvec{c}}(n_k))^L \cdot n_k^{\lfloor L/2 \rfloor - \gamma }$. Then, setting $c:= \min \{ 1, c_0 / 2 \}$, there exists $k_0 \in {\mathbb {N}}$ satisfying $({\varvec{c}}(n_k))^L \cdot n_k^{\lfloor L/2 \rfloor } \ge c \cdot n_k^\gamma > 0$ and $L \le \varvec{\ell }(n_k)$ for all $k \in {\mathbb {N}}_{\ge k_0}$.

Choose $\kappa \in (0,1]$ satisfying $\kappa \le c$ and $\kappa \cdot (d + L)^\alpha \le 1$. Furthermore, define

$$\begin{aligned} \theta:= & {} \tfrac{1}{2} (\gamma + \tfrac{\alpha p}{d}) \in (\tfrac{\alpha p}{d}, \gamma ) \text { and } \beta := \theta + \min \{ \delta _1, \delta _2 \}\quad \text { where }\\ \delta _1:= & {} \tfrac{\theta d}{p} - \alpha> 0 \text { and } \delta _2:= \gamma - \theta > 0 \end{aligned}$$

and choose $k_1 \in {\mathbb {N}}$ satisfying $n_k^\theta \le \kappa \cdot n_k^\beta $ for all $k \ge k_1$.

We now claim for all $k \ge \max \{ k_0, k_1 \}$ that if we set $M:= \kappa \cdot n_k^\beta $ and $M':= n_k^\theta $, then . This will prove the claim, since $\big \Vert \frac{M}{M'} \, \zeta _{M'} \big \Vert _{C([0,1]^d)} \ge \frac{M}{M'} \rightarrow \infty $ as $k \rightarrow \infty $; here, we used that $\zeta _{M'}(0) = 1$ and that $\beta > \theta $.

To see that , note that $\beta \le \theta + \delta _2 = \gamma $ and hence

$$\begin{aligned} M = \kappa \cdot n_k^\beta \le c \cdot n_k^\gamma \le ({\varvec{c}}(n_k))^L \, n_k^{\lfloor L/2 \rfloor }, \end{aligned}$$

so that $M \le C^L \, n_k^{\lfloor \frac{L}{2} \rfloor }$ for a suitable $0 \le C \le {\varvec{c}}(n_k)$. Thus, Equation (3.4) shows . Now, for $t \in {\mathbb {N}}$ there are two cases: If $t \ge (d+L) \cdot n_k$, then . If otherwise $t < (d+L) \cdot n_k$, then the estimate from Step 2 combined with our choice of M, $M'$ and $\kappa $ shows that

Finally, we also have $ \Vert \frac{M}{M'} \, \zeta _{M'} \Vert _{L^p([0,1]^d)} \le \frac{M}{M'} \cdot (M')^{-d/p} \le \kappa \cdot n_k^{\beta - \theta - \theta \frac{d}{p}} \le n_k^{\theta + \delta _1 - \theta - \theta \frac{d}{p}} = n_k^{-\alpha } \le 1. $ Overall, this shows , so that Lemma 2.1 shows as well. As seen above, this completes the proof of the first part of the theorem.

Step 4: In this step, we prove the second part of the theorem. By the assumptions of that part, we have $ \gamma ^*(\varvec{\ell },{\varvec{c}}) > \alpha \frac{1 - \beta }{\beta + \frac{d}{p}} = \frac{\alpha }{\nu } $ for $\nu := \frac{\beta + \frac{d}{p}}{1 - \beta }$. By definition of $\gamma ^*$, we can thus choose $\gamma > \frac{\alpha }{\nu }$ and $L \in {\mathbb {N}}_{\le \varvec{\ell }^*}$ satisfying

$$\begin{aligned} c_0:= \limsup _{n \rightarrow \infty } \Big [ ({\varvec{c}}(n))^L \cdot n^{\lfloor L/2 \rfloor - \gamma } \Big ] \in (0,\infty ]. \end{aligned}$$

Since $\varvec{\ell }^*\ge 2$, we can clearly assume that $L \ge 2$. Choose a subsequence $(n_k)_{k \in {\mathbb {N}}}$ such that $c_0 = \lim _{k \rightarrow \infty } ({\varvec{c}}(n_k))^L \cdot n_k^{\lfloor L/2 \rfloor - \gamma }$. Then, setting $c:= \min \{ 1, c_0 / 2 \}$, there exists $k_0 \in {\mathbb {N}}$ satisfying $({\varvec{c}}(n_k))^L \cdot n_k^{\lfloor L/2 \rfloor } \ge c \cdot n_k^\gamma > 0$ and $L \le \varvec{\ell }(n_k)$ for all $k \in {\mathbb {N}}_{\ge k_0}$.

Since $\gamma > \frac{\alpha }{\nu } = \frac{1-\beta }{\beta + \frac{d}{p}} \alpha $, we have $\frac{\alpha }{\beta + \frac{d}{p}} < \frac{\gamma }{1-\beta }$. Hence, we can choose $ \tau \in \bigl (\frac{\alpha }{\beta + \frac{d}{p}}, \frac{\gamma }{1-\beta }\bigr ) \subset (0,\infty ). $ Then $\tau \cdot (1 - \beta ) < \gamma $ and furthermore

$$\begin{aligned} \tau \cdot (\beta + \tfrac{d}{p}) > \alpha \Longrightarrow \tau \cdot (1 - \beta - 1 - \tfrac{d}{p})< - \alpha \Longrightarrow \tau \cdot (1 - \beta ) < \tau \cdot (1 + \tfrac{d}{p}) - \alpha . \end{aligned}$$

Hence, we can choose $ \theta \in \big ( \tau \cdot (1 - \beta ), \,\, \min \{ \gamma , \tau \cdot (1 + \frac{d}{p}) - \alpha \} \big ) \subset (0,\infty ). $

Now, choose $\kappa \in (0,1]$ such that $\kappa \le c$ and $\kappa \cdot (L + d)^\alpha \le 1$. Fix $k \in {\mathbb {N}}_{\ge k_0}$ for the moment, and set $M:= \kappa \cdot n_k^\theta $ and $M':= n_k^\tau \ge 1$. Since $ ({\varvec{c}}(n_k))^L \cdot n_k^{\lfloor L/2 \rfloor } \ge c \cdot n_k^\gamma \ge \kappa \cdot n_k^\gamma , $ we can choose $0 \le C \le {\varvec{c}}(n_k)$ satisfying $ C^L \cdot n_k^{\lfloor L/2 \rfloor } \ge \kappa \cdot n_k^\gamma \ge \kappa \cdot n_k^\theta = M. $ Then, Step 1 shows that . Now, for $t \in {\mathbb {N}}$ there are two cases: If $t \ge (L + d) \cdot n_k$, then . If otherwise $t < (L+d) \cdot n_k$, then the estimate from Step 2 combined with the bound $\theta \le \tau \cdot (1 + \frac{d}{p}) - \alpha $ shows

Furthermore, Step 2 also shows that

$$\begin{aligned} \Vert f \Vert _{L^p}= & {} \frac{M}{M'} \Vert \zeta _{M'} \Vert _{L^p} \le \kappa \cdot n_k^{\theta - \tau } \cdot (M')^{-d/p} \le n_k^{\theta - \tau - \tau \frac{d}{p}} \le n_k^{\tau \cdot (1 + \frac{d}{p}) - \alpha - \tau - \tau \frac{d}{p}} \\= & {} n_k^{-\alpha } \le 1. \end{aligned}$$

Overall, we see that , so that Lemma 2.1 shows .

We now show that $\Vert f \Vert _{C^{0,\beta }} \rightarrow \infty $ as $k \rightarrow \infty $; since , this will imply that the embedding does not hold. To see that $\Vert f \Vert _{C^{0,\beta }} \rightarrow \infty $ as $k \rightarrow \infty $, note that $f(0) = \frac{M}{M'}$ and $f(\frac{1}{M'},0,\dots ,0) = 0$. Hence,

$$\begin{aligned} \Vert f \Vert _{C^{0,\beta }} \ge {\text {Lip}}_\beta (f){} & {} \ge \frac{|f(0) - f(\frac{1}{M'},0,\dots ,0)|}{\Vert 0 - (\frac{1}{M'},0,\dots ,0)\Vert _{\ell ^\infty }^\beta }= (M')^\beta \cdot \frac{M}{M'} = n_k^{\beta \tau - \tau } \cdot \kappa \cdot n_k^\theta \\{} & {} = \kappa \cdot n_k^{\theta - \tau \cdot (1 - \beta )} \!\xrightarrow [k\rightarrow \infty ]{}\! \infty , \end{aligned}$$

since $\theta > \tau \cdot (1 - \beta )$. $\square $

Notes

Note that the number of hidden layers is given by $H = L-1$.

References

Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930–945 (1993)
Beneventano, P., Cheridito, P., Jentzen, A., von Wurstemberger, P.: High-dimensional approximation spaces of artificial neural networks and applications to partial differential equations. arXiv preprint arXiv:2012.04326 (2020)
Berner, J., Grohs, P., Kutyniok, G., Petersen, P.: The modern mathematics of deep learning. In: Grohs, P., Kutyniok, G. (eds.) Mathematical Aspects of Deep Learning, pp. 1–111. Cambridge University Press, Cambridge (2022). https://doi.org/10.1017/9781009025096.002
Caragea, A., Petersen, P., Voigtlaender, F.: Neural network approximation and estimation of classifiers with classification boundary in a Barron class. arXiv preprint arXiv:2011.09363 (2020)
Daubechies, I., DeVore, R.A., Dym, N., Faigenbaum-Golovin, S., Kovalsky, S.Z., Lin, K.-C., Park, J., Petrova, G., Sober, B.: Neural network approximation of refinable functions. IEEE Trans. Inf. Theory 69(1), 482–495 (2023). https://doi.org/10.1109/TIT.2022.3199601
Daubechies, I., DeVore, R., Foucart, S., Hanin, B., Petrova, G.: Nonlinear approximation and (deep) ${\rm ReLU}$ networks. Construct. Approx. (2021). https://doi.org/10.1007/s00365-021-09548-z
DeVore, R., Hanin, B., Petrova, G.: Neural network approximation. Acta Numer. 30, 327–444 (2021)
DeVore, R.A., Lorentz, G.G.: Constructive Approximation. Grundlehren der Mathematischen Wissenschaften, vol. 303. Springer, Berlin (1993)
Elbrächter, D., Perekrestenko, D., Grohs, P., Bölcskei, H.: Deep neural network approximation theory. IEEE Trans. Inf. Theory 67(5), 2581–2623 (2021)
Gribonval, R., Kutyniok, G., Nielsen, M., Voigtlaender, F.: Approximation spaces of deep neural networks. Constr. Approx. (2021). https://doi.org/10.1007/s00365-021-09543-4
Grohs, P., Voigtlaender, F.: Proof of the theory-to-practice gap in deep learning via sampling complexity bounds for neural network approximation spaces. arXiv preprint arXiv:2104.02746 (2021)
Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999)
Rao, M.M., Ren, Z.D.: Theory of Orlicz Spaces. Monographs and Textbooks in Pure and Applied Mathematics, vol. 146. Marcel Dekker Inc, New York (1991)
Weinan, E., Ma, C., Wu, L.: Barron spaces and the compositional function spaces for neural network models. arXiv preprint arXiv:1906.08039 (2019)

Download references

Funding

Open access funding provided by University of Vienna.

Author information

Philipp Grohs and Felix Voigtlaender contributed equally to this work.

Authors and Affiliations

Faculty of Mathematics, University of Vienna, Oskar-Morgenstern-Platz 1, 1090, Vienna, Austria
Philipp Grohs & Felix Voigtlaender
Research Network Data Science @ Uni Vienna, Währinger Straße 29/S6, 1090, Vienna, Austria
Philipp Grohs
Johann Radon Institute, Altenberger Straße 69, 4040, Linz, Austria
Philipp Grohs
Department of Mathematics, Technical University of Munich, Boltzmannstr. 3, 85748, Garching bei München, Germany
Felix Voigtlaender

Authors

Philipp Grohs
View author publications
You can also search for this author in PubMed Google Scholar
Felix Voigtlaender
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philipp Grohs.

Additional information

Communicated by Albert Cohen.

This paper is dedicated to Ron DeVore on the occasion of his 80th birthday.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

F. Voigtlaender acknowledges support by the German Research Foundation (DFG) in the context of the Emmy Noether junior research group VO 2594/1–1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Grohs, P., Voigtlaender, F. Sobolev-Type Embeddings for Neural Network Approximation Spaces. Constr Approx 57, 579–599 (2023). https://doi.org/10.1007/s00365-022-09598-x

Download citation

Received: 22 October 2021
Revised: 03 June 2022
Accepted: 25 August 2022
Published: 05 March 2023
Issue Date: April 2023
DOI: https://doi.org/10.1007/s00365-022-09598-x

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Sobolev-Type Embeddings for Neural Network Approximation Spaces

Abstract

Similar content being viewed by others

Approximation Spaces of Deep Neural Networks

Optimal Rates of Approximation by Shallow ReLU $$^k$$ Neural Networks and Applications to Nonparametric Regression

Best Approximation and Inverse Results for Neural Network Operators

1 Introduction

1.1 Description of Results

1.1.1 Embedding Results

Theorem 1.1

Theorem 1.2

1.1.2 Optimal Learning Algorithms

Theorem 1.3

Proof

1.2 Related Work

1.3 Structure of the Paper

2 Definition of Neural Network Approximation Spaces

2.1 The Mathematical Formalization of Neural Networks

2.2 Neural Network Approximation Spaces

Lemma 2.1

Proof

Remark 2.2

2.3 Quantities Characterizing the Complexity of the Network Architecture

Lemma 2.3

Proof

3 Sobolev-Type Embeddings for Neural Network Approximation Spaces

3.1 Sufficient Conditions for Embedding into \(C([0,1]^d)\)

Lemma 3.1

Proof

Lemma 3.2

Theorem 3.3

Proof

3.2 Sufficient Conditions for Embedding into \(C^{0,\beta }([0,1]^d)\)

Lemma 3.4

Proof

Theorem 3.5

Remark

Proof

3.3 Necessary Conditions

Theorem 3.6

Remark

Proof

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation