1 Introduction

Sparsity reduces network complexities and, consequently, lowers the demands on memory and computation, reduces overfitting, and improves interpretability (Changpinyo et al. 2017; Han et al. 2016; Kim et al. 2016; Liu et al. 2015; Wen et al. 2016). Sparsity is at the heart of many current techniques in deep learning, such as dropouts (Srivastava et al. 2014), lottery tickets (Frankle and Carbin 2019), augmenting small networks (Ash 1989; Bello 1992), pruning large networks (Simonyan and Zisserman 2015; Han et al. 2016), sparsity constraints (Ledent et al. 2019; Neyshabur et al. 2015; Schmidt-Hieber 2020), and sparsity regularization (Taheri et al. 2021).

The many empirical observations of the benefits of sparsity have sparked interest in mathematical support in the form of statistical theories. Two current approaches are based on Rademacher complexities (Bartlett and Mendelson 2002; Neyshabur et al. 2015) and ideas from nonparametric statistics (Schmidt-Hieber 2020), respectively. While their results provide important support for sparse deep learning, they still have major limitations: The first approach is restricted to bounded loss functions (which excludes the \(\ell_{2}\)-loss, for example), is either restricted to a simple form of sparsity (which we will call “connection sparsity” later) or suffers from an exponential dependence on the number of layers (which contradicts the current interest in very deep networks), caters to constraints rather than regularization (which is the predominant implementation in practice), and is limited to a single output node and ReLU activation. The second approach is restricted to \(\ell_{0}\)-constraints (which are infeasible in practice), assumes bounded weights, and is also limited to a single output node and ReLU activation. In short, while some progress in the statistical understanding of sparse deep learning has been made already, many aspects have not yet been considered.

The goal of this paper is to establish a statistical theory that accounts for these missing aspects. For this, we follow a third, very recent approach introduced in Taheri et al. (2021). This approach is based on ideas from high-dimensional statistics and empirical-process theory (Lederer 2022). The main feature of their results is that they apply to \(\ell_{2}\)-loss, regularization instead of constraints, and a variety of activation functions. But they still miss some aspects, such as the inclusion of more complex notions of sparsity (we will speak of “node sparsity” later) and the restriction to a single output node. Moreover, their estimator involves an additional, arguably unnatural parameter.

In this paper, we remove these limitations from Taheri et al. (2021). We focus on regression-type settings with layered, feedforward neural networks. The estimators under consideration consist of a standard least-squares estimator with regularizers that induce different types of sparsity—without the need for an additional parameter. We then derive prediction and generalization guarantees by using techniques from high-dimensional statistics (Dalalyan et al. 2017) and empirical-process theory (van de Geer 2000). In the case of sub-Gaussian noise, we find the rates

$$\begin{aligned} \sqrt{\frac{{l}\bigl (\log [{m}{n}{\overline{p}}]\bigr )^3}{{n}}}~~~~~\text {and}~~~~~\sqrt{\frac{{m}{l}{\underline{p}}(\log [{m}{n}{\overline{p}}]\bigr )^3}{{n}}} \end{aligned}$$

for the connection-sparse and node-sparse estimators (see the following section for the notions of sparsity), respectively, where \(l\) is the number of hidden layers, \(m\) the number of output nodes, \(n\) the number of samples, \(\overline{p}\) the total number of parameters, and \(\underline{p}\) the maximal width of the network. The rates suggest that sparsity-inducing approaches can provide accurate prediction even in very wide (with connection sparsity) and very deep (with either type of sparsity) networks while, at the same time, ensuring low network complexities. These findings underpin the current trend toward sparse but wide and especially deep networks from a statistical perspective. More generally speaking, our paper complements the existing statistical theories for sparse deep learning with new results, and it refines the techniques that were introduced in (Taheri et al. 2021).

Outline of the paper   Section 2 recapitulates the notions of connection and node sparsity and introduces the corresponding deep learning framework and estimators. Section 3 confirms the empirically observed accuracies of connection- and node-sparse estimation in theory. Section 4 discusses connections of our theoretical results and weight initialization. Section 5 summarizes the key features and limitations of our work. The Appendix contains all proofs.

2 Connection- and node-sparse deep learning

We consider data \((\varvec{y}_1,{\varvec{x}}_1),\dots , (\varvec{y}_{{n}},{{\varvec{x}}}_{{n}})\in {\mathbb {R}}^{{m}}\times {\mathbb {R}}^{{d}}\) that are related via

$$\begin{aligned} \varvec{y}_i=\varvec{g}_{*}[{{\varvec{x}}_i}]+{\varvec{u}_i}~~~~~~~~~~~~\text {for}~i\in \{1,\dots ,{n}\} \end{aligned}$$
(1)

for an unknown data-generating function \(\varvec{g}_{*}\,:\,{\mathbb {R}}^{{d}}\rightarrow {\mathbb {R}}^{{m}}\) and unknown, random noise \(\varvec{u}_1,\dots ,\varvec{u}_{{n}}\in {\mathbb {R}}^{{m}}\). We allow all aspects, namely \(\varvec{y}_i\), \(\varvec{g}_{*}\), \({\varvec{x}}_i\), and \(\varvec{u}_i\), to be unbounded. Our goal is to model the data-generating function with a feedforward neural network of the form

$$\begin{aligned} {{\varvec{g}_{{\varvec{\Theta }}}}[{\varvec{x}}]}:={\Theta }^{{l}}{\varvec{f}}^{{l}}\bigl [{\Theta }^{{l}-1}\cdots {\varvec{f}}^1[{\Theta }^0{\varvec{x}}]\bigr ]~~~~~~~~~~~~\text {for}~{\varvec{x}}\in {\mathbb {R}}^{{d}} \end{aligned}$$
(2)

indexed by the parameter space \({\mathcal{M}}:=\{{\varvec{\Theta }}=({\Theta ^{{l}}},\dots ,{\Theta }^0)\,:\,{\Theta }^j\in {\mathbb {R}}^{{{p}^{j+1}}\times {{p}^j}}\}\). The functions \({\varvec{f}}^j\,:\,{\mathbb {R}}^{{{p}^j}}\rightarrow {\mathbb {R}}^{{{p}^j}}\) are called the activation functions (Lederer 2021), and \({p}^0:={d}\) and \({p}^{{l}+1}:={m}\) are called the input and output dimensions, respectively. The depth of the network is \({l}\), the maximal width is \({\underline{p}}:=\max _{j\in \{0,\dots ,{l}-1\}}{{p}^{j+1}}\), and the total number of parameters is \({\overline{p}}:=\sum _{j=0}^{{l}}{{p}^{j+1}}{{p}^j}\).

In practice, the total number of parameters often rivals or exceeds the number of samples: \({\overline{p}}\approx {n}\) or \({\overline{p}}\gg {n}\). We then speak of high dimensionality. A common technique for avoiding overfitting in high-dimensional settings is regularization that induces additional structures, such as sparsity. Sparsity has the interesting side-effect of reducing the networks’ complexities, which can facilitate interpretations and reduce demands on energy and memory. Three common notions of sparsity are connection sparsity, which means that there is only a small number of nonzero connections between nodes, node sparsity, which means that there is only a small number of active nodes (Alvarez and Salzmann 2016; Changpinyo et al. 2017; Feng and Simon 2017; Kim et al. 2016; Lee et al. 2008; Liu et al. 2015; Nie et al. 2015; Scardapane et al. 2017; Wen et al. 2016), and layer sparsity, which means that there is only a small number of active layers (Hebiri and Lederer 2020).

In the following, we focus on connection- and node sparsity. Our first sparse estimator is

$$\begin{aligned} {\widehat{{\varvec{\Theta }}}_{{\text {con}}}}\in {{\,\mathrm{arg\,min}\,}}_{{\varvec{\Theta }}\in {\mathcal {M}_{1}}}\Biggl \{\sum _{i=1}^{{n}}\big |\!\big |\varvec{y}_i-{{\varvec{g}_{{\varvec{\Theta }}}}[{{\user2{x}}_i}]}\big |\!\big |_2^2+{r_{{\text {con}}}}|\!|\!|{\Theta ^{{l}}}|\!|\!|_1\Biggr \} \end{aligned}$$
(3)

for a tuning parameter \({r_{{\text {con}}}}\in [0,\infty )\), a nonempty set of parameters

$$\begin{aligned} {\mathcal{M}}_{1} \subset \Bigl \{{\varvec{\Theta }}\in {\mathcal{M}}\, : \, \max_{j \in \{0,\dots,{l}-1\}}|\!|\!|{\Theta^{j}}|\!|\!|_1\le 1\Bigr \}, \end{aligned}$$

and the \(\ell_{1}\)-norm

$$\begin{aligned} |\!|\!|{\Theta ^j}|\!|\!|_1:=\sum _{i=1}^{{{p}^{j+1}}}\sum _{k=1}^{{{p}^j}}|({\Theta ^j})_{ik}|~~\text {for}~j\in \{0,\dots ,{l}\},\,{\Theta }^j\in {\mathbb {R}}^{{{p}^{j+1}}\times {{p}^j}}\,. \end{aligned}$$

This estimator is an analog of the lasso estimator in linear regression (Tibshirani 1996). It induces sparsity on the level of connections: the larger the tuning parameter \(r_{{\text {con}}}\), the fewer connections among the nodes.

Fig. 1
figure 1

exemplary networks produced by the connection-sparse estimator (3) and the node-sparse estimator (6)

Deep learning with \(\ell_{1}\)-regularization has become common in theory and practice (Kim et al. 2016; Taheri et al. 2021). Our estimator (3) specifies one way to formulate this type of regularization. The estimator is indeed a regularized estimator (rather than a constraint estimator), because the complexity is regulated entirely through the tuning parameter \({r_{{\text {con}}}}\) in the objective function (rather than through a tuning parameter in the set over which the objective function is optimized). But \(\ell_{1}\)-regularization could also be formulated slightly differently. For example, one could consider the estimators

$$\begin{aligned} {\overline{{\varvec{\Theta }}}_{{\text {con}}}}\in {{\,\mathrm{arg\,min}\,}}_{{\varvec{\Theta }}\in {\mathcal {M}}}\Biggl \{\sum _{i=1}^{{n}}\big |\!\big |\varvec{y}_i-{{\varvec{g}_{{\varvec{\Theta }}}}[{{\varvec{x}}_i}]}\big |\!\big |_2^2+{r_{{\text {con}}}}\prod _{j=0}^{{l}}|\!|\!|{\Theta ^j}|\!|\!|_1\Biggr \} \end{aligned}$$
(4)

or

$$\begin{aligned} {\widetilde{{\varvec{\Theta }}}_{{\text {con}}}}\in {{\,\mathrm{arg\,min}\,}}_{{\varvec{\Theta }}\in {\mathcal {M}}}\Biggl \{\sum _{i=1}^{{n}}\big |\!\big |\varvec{y}_i-{{\varvec{g}_{{\varvec{\Theta }}}}[{{\varvec{x}}_i}]}\big |\!\big |_2^2+{r_{{\text {con}}}}\sum _{j=0}^{{l}}|\!|\!|{\Theta ^j}|\!|\!|_1\Biggr \}\,. \end{aligned}$$
(5)

The differences among the estimators (3)–(5) are small: for example, our theory can be adjusted for (4) with almost no changes of the derivations. The differences among the estimators mainly concern the normalizations of the parameters; we illustrate this in the following proposition.

Proposition 1

(Scaling of Norms) Assume that the all-zeros parameter \(({\mathbf{0}}_{{p}^{{l}+1}\times {p}^{{l}}},\dots ,{\mathbf{0}}_{{p}^{1}\times {p}^{0}})\in {\mathcal {M}_{1}}\) is neither a solution of (3) nor of (5), that \({r_{{\text {con}}}}>0\), and that the activation functions are nonnegative homogenous: \({\varvec{f}}^j[a\varvec{b}]=a{\varvec{f}}^j[\varvec{b}]\) for all \(j\in \{1,\dots ,{l}\}\), \(a\in [0,\infty )\), and \(\varvec{b}\in {\mathbb {R}}^{{{p}^j}}\). Then, \(|\!|\!|({\widehat{{\Theta }}_{{\text {con}}}})^0|\!|\!|_1,\dots ,|\!|\!|({\widehat{{\Theta }}_{{\text {con}}}})^{{l}-1}|\!|\!|_1=1\) (concerns the inner layers) for all solutions of (3), while \(|\!|\!|({\widetilde{{\Theta }}_{{\text {con}}}})^0|\!|\!|_1=\cdots =|\!|\!|({\widetilde{{\Theta }}_{{\text {con}}}})^{{l}}|\!|\!|_1\) (concerns all layers) for at least one solution of (5).

In brief, the goal of our paper is not to promote a new way of implementing sparsity in practice but to reproduce practical implementations as accurately as possible in theory.

Another way to formulate \(\ell_{1}\)-regularization was proposed in Taheri et al. (2021): they reparametrize the networks through a scale parameter and a constraint version of \(\mathcal {M}\) and then to focus the regularization on the scale parameter only. Our above-stated estimator (3) is more elegant in that it avoids the reparametrization and the additional parameter.

The factor \(|\!|\!|{\Theta ^{{l}}}|\!|\!|_1\) in the regularization term of (3) measures the complexity of the network over the set \(\mathcal {M}_{1}\), and the factor \({r_{{\text {con}}}}\) regulates the complexity of the resulting estimator. This provides a convenient lever for data-adaptive complexity regularization through well-established calibration schemes for the tuning parameter, such as cross-validation. This practical aspect is an advantage of regularized formulations like ours as compared to constraint estimation over sets with a predefined complexity.

The constraints in the set \(\mathcal {M}_{1}\) of the estimator (3) can also retain the expressiveness of the full parameterization that corresponds to the set \(\mathcal {M}\): for example, assuming again nonnegative-homogeneous activation, one can check that for every \({\varvec{\Gamma }}\in {\mathcal {M}}\), there is a \({\varvec{\Gamma }}'\in \{{\varvec{\Theta }}\in {\mathcal{M}}\, :\, \max _{j\in \{0,\dots ,{l}-1\}}|\!|\!|{\Theta ^j}|\!|\!|_1\le 1\}\) such that \(\varvec{g}_{{\varvec{\Gamma }}}=\varvec{g}_{{\varvec{\Gamma }}'}\)—cf. (Taheri et al. 2021, Proposition 1). In contrast, existing theories on neural networks often require the parameter space to be bounded, which limits the expressiveness of the networks.

Our regularization approach is, therefore, closer to practical setups than constraint approaches. The price is that to develop prediction theories, we have to use different tools than those typically used in theoretical deep learning. For example, we cannot use established risk bounds such as (Bartlett and Mendelson 2002, Theorem 8) (because Rademacher complexities over classes of unbounded functions are unbounded) or (Lederer 2020a, Theorem 1) (because our loss function is not Lipschitz continuous) or established concentration bounds such as McDiarmid’s inequality in (McDiarmid 1989, Lemma (3.3)) (because that would require a bounded loss). We instead invoke ideas from high-dimensional statistics, prove Lipschitz properties for neural networks, and use empirical-process theory, specifically concentration inequalities that are based on chaining (see the Appendix).

Our second estimator is

$$\begin{aligned} {\widehat{{\varvec{\Theta }}}_{{\text {node}}}}\in {{\,\mathrm{arg\,min}\,}}_{{\varvec{\Theta }}\in {\mathcal {M}_{2,1}}}\Biggl \{\sum _{i=1}^{{n}}\big |\!\big |\varvec{y}_i-{{\varvec{g}_{{\varvec{\Theta }}}}[{{\varvec{x}}_i}]}\big |\!\big |_2^2+{r_{{\text {node}}}}|\!|\!|{\Theta ^{{l}}}|\!|\!|_{2,1}\Biggr \} \end{aligned}$$
(6)

for a tuning parameter \({r_{{\text {node}}}}\in [0,\infty )\), a nonempty set of parameters

$$\begin{aligned} {\mathcal {M}_{2,1}}\subset \Bigl \{{\varvec{\Theta }}\in {\mathcal{M}}\ :\ \max _{j\in \{0,\dots ,{l}-1\}}|\!|\!|{\Theta ^j}|\!|\!|_{2,1}\le 1\Bigr \}\,, \end{aligned}$$

and the \(\ell_{2}/\ell_{1}\)-norm

$$\begin{aligned}&|\!|\!|{\Theta ^j}|\!|\!|_{2,1}:=\sum _{k=1}^{{{p}^j}}\sqrt{\sum _{i=1}^{{{p}^{j+1}}}|({\Theta ^j})_{ik}|^2}\\&\quad \text {for}~j\in \{0,\dots ,{l}-1\},\,{\Theta }^j\in {\mathbb {R}}^{{{p}^{j+1}}\times {{p}^j}}\,. \end{aligned}$$

This estimator is an analog of the group-lasso estimator in linear regression (Bakin 1999). Again, to avoid ambiguities in the regularization, our formulation is slightly different from the standard formulations in the literature, but the fact that group-lasso regularizers leads to node-sparse networks has been discussed extensively before (Alvarez and Salzmann 2016; Liu et al. 2015; Scardapane et al. 2017): the larger the tuning parameter \(r_{{\text {node}}}\), the fewer active nodes in the network.

The above-stated comments about the specific form of the connection-sparse estimator also apply to the node-sparse estimator.

An illustration of connection and node sparsity is given in Fig. 1. Connection-sparse networks have only a small number of active connections between nodes (left panel of Fig. 1); node-sparse networks have inactive nodes, that is, completely unconnected nodes (right panel of Fig. 1). The two notions of sparsity are connected: for example, connection sparsity can render entire nodes inactive “by accident” (see the layer that follows the input layer in the left panel of the figure). In general, node sparsity is the weaker assumption, because it allows for highly connected nodes; this observation is reflected in the theoretical guarantees in the following section.

The optimal network architecture for given data (such as the optimal width) is hardly known beforehand in a data analysis. A main feature of sparsity-inducing regularization is, therefore, that it adjusts parts of the network architecture to the data. In other words, sparsity-inducing regularization is a data-driven approach to adapting the complexity of the network.

While versions of the estimators (3) and (6) are popular in deep learning, statistical analyses, especially of node-sparse deep learning, are scarce. Such a statistical analysis is, therefore, the goal of the following section.

3 Statistical prediction guarantees

We now develop statistical guarantees for the sparse estimators described above. The guarantees are formulated in terms of the squared average (in-sample) prediction error

$$\begin{aligned} {\text {err}}[{\varvec{\Theta }}]:=\frac{1}{{n}}\sum _{i=1}^{{n}}\big |\!\big |\varvec{g}_{*}[{{\varvec{x}}_i}]-{{\varvec{g}_{{\varvec{\Theta }}}}[{{\varvec{x}}_i}]}\big |\!\big |_2^2~~~~~~\text {for}~{\varvec{\Theta }}\in {\mathcal{M}}, \end{aligned}$$

which is a measure for how well the network \(\varvec{g}_{{\varvec{\Theta }}}\) fits the unknown function \(\varvec{g}_{*}\) (which does not need to be a neural network) on the data at hand, and in terms of the prediction risk (or generalization error) for a new sample \((\varvec{y},{\varvec{x}})\) that has the same distribution as the original data

$$\begin{aligned} {\text {risk}}[{\varvec{\Theta }}]:=E_{\varvec{y},{\varvec{x}}}|\!|\varvec{y}-{{\varvec{g}_{{\varvec{\Theta }}}}[{\varvec{x}}]}|\!|_2^2~~~~~~\text {for}~{\varvec{\Theta }}\in {\mathcal{M}}\,, \end{aligned}$$

which measures how well the network \(\varvec{g}_{{\varvec{\Theta }}}\) can predict a new sample. We first study the prediction error, because it is agnostic to the distribution of the input data; in the end, we then translate the bounds for the prediction error into bounds for the generalization error.

We first observe that the networks in (2) can be somewhat “linearized:” For every parameter \({\varvec{\Theta }}\in {\mathcal {M}_{1}}\), there is a parameter

$$\bar \Theta \in \overline{M} : = \{ \bar \Theta = ({\bar \Theta ^{l - 1}}, \ldots ,{\bar \Theta ^0})\;:\;{\bar \Theta ^j} \in {R^{{p^{j + 1}} \times {p^j}}},{\max _{j \in \{ 0, \ldots ,l - 1\} }}|||{\bar \Theta ^j}||{|_1} \le 1\}$$

such that for every \({\varvec{x}}\in {\mathbb {R}}^{{d}}\)

$$\begin{aligned}&{{\varvec{g}_{{\varvec{\Theta }}}}[{\varvec{x}}]}={\Theta ^{{l}}}{{\overline{\user2{g}}_{{\overline{{\varvec{\Theta }}}}}}[{\user2{x}}]}\nonumber \\&\quad \text {with}~~~~{{\overline{\user2{g}}_{{\overline{{\varvec{\Theta }}}}}}[{\user2{x}}]}:={\user2{\,f}}^{{l}}\bigl [{\overline{{\Theta }}}^{{l}-1}\cdots {\user2{\,f}}^1[{\overline{{\Theta }}}^0{\user2{x}}]\bigr ]\in {\mathbb {R}}^{{{p}^{{l}}}}\,. \end{aligned}$$
(7)

This additional notation allows us to disentangle the outermost layer (which is regularized directly) from the other layers (which are regularized indirectly). More generally speaking, the additional notation makes a connection to linear regression, where the above holds trivially with \({{\overline{\user2{g}}_{{\overline{{\varvec{\Theta }}}}}}[{\user2{x}}]}={\user2{x}}\).

We also define

$${\bar {\cal M}_{2,1}}{\rm{ }}: = \{ \overline \Theta = ({\overline \Theta ^{l - 1}}, \ldots ,{\overline \Theta ^0}){\mkern 1mu} :{\mkern 1mu} {\overline \Theta ^j} \in {R^{{p^{j + 1}} \times {p^j}}},{\rm{ }}\quad {\max _{j \in \{ 0, \ldots ,l - 1\} }}|||{\overline \Theta ^j}||{|_{2,1}} \le 1\} {\rm{ }}$$

accordingly.

In high-dimensional linear regression, the quantity central to prediction guarantees is the effective noise (Lederer and Vogt 2020). The effective noise is in our notation (with \({l}=0\) and \({m}=1\) to describe linear regression) \(2|\!|\sum _{i=1}^{{n}}u_i{\user2{x}}_i|\!|_\infty\). The above linearization allows us to generalize the effective noise to our general deep learning framework:

$$\begin{aligned}&r_{{\rm{con}}}^*: = 2\mathop {\sup }\limits_{\overline \Psi \in {{\bar {\cal M}}_1}} |||\sum\limits_{i = 1}^n {{u_i}} {({\overline g _{\overline \Psi }}[{x_i}])^ \top }||{|_\infty }\\&r_{{\rm{node}}}^*: = 2\sqrt m \mathop {\sup }\limits_{\overline \Psi \in {{\bar {\cal M}}_{2,1}}} |||\sum\limits_{i = 1}^n {{u_i}} {({\overline g _{\overline \Psi }}[{x_i}])^ \top }||{|_\infty }{\mkern 1mu} ,\end{aligned}$$
(8)

where \(|\!|\!|A|\!|\!|_\infty :=\max _{\begin{array}{c} (i,j)\in \{1,\dots ,{m}\}\times \{1,\dots ,{{p}^{{l}}}\} \end{array}}|A_{ij}|\) for \(A\in {\mathbb {R}}^{{m}\times {{p}^{{l}}}}\). The effective noises, as we will see below, are the optimal tuning parameters in our theories; at the same time, the effective noises depend on the noise random variables \(\user2{u}_1,\dots ,\user2{u}_{{n}}\), which are unknown in practice. Accordingly, we call the quantities \(r^*_{{\text {con}}}\) and \(r^*_{{\text {node}}}\) the oracle tuning parameters.

We take a moment to compare the effective noises in (8) to Rademacher complexities (Koltchinskii 2001; Koltchinskii and Panchenko 2002). Rademacher complexities are the basis of a line of other statistical theories for deep learning (Bartlett and Mendelson 2002; Golowich et al. 2018; Lederer 2020a; Neyshabur et al. 2015). In our framework, the Rademacher complexities in the case \({m}=1\) are (Lederer 2020a, Definition 1)

$$\begin{aligned}&{E}_{{\varvec{x}}_1,\dots ,{\varvec{x}}_{{n}},k_1,\dots ,k_{{n}}}\biggl [\sup _{{\varvec{\Theta }}\in {\mathcal {M}_{1}}}\Bigl |\frac{1}{{n}}\sum _{i=1}^{{n}}k_i{{\user2{g}_{{\varvec{\Theta }}}}[{{\user2{x}}_i}]}\Bigr |\biggr ]\\&\quad \text {and}~~~~{E}_{{\user2{x}}_1,\dots ,{\user2{x}}_{{n}},k_1,\dots ,k_{{n}}}\biggl [\sup _{{\varvec{\Theta }}\in {\mathcal {M}_{2,1}}}\Bigl |\frac{1}{{n}}\sum _{i=1}^{{n}}k_i{{\user2{g}_{{\varvec{\Theta }}}}[{{\user2{x}}_i}]}\Bigr |\biggr ] \end{aligned}$$

for i.i.d. Rademacher random variables \(k_1,\dots ,k_{{n}}\). The effective noises might look like (rescaled) empirical versions of these quantities at first sight, but this is not the case. Two immediate differences are that (8) apply to general \({m}\) and circumvent the outermost layers of the networks. But more importantly, Rademacher complexities involve external i.i.d. Rademacher random variables that are not connected with the statistical model at hand, while the effective noises involve the noise variables, which are completely specified by the model and, therefore, can have any distribution (see our sub-Gaussian example further below). Hence, there are no general techniques to relate Rademacher complexities and effective noises.

Not only are the two concepts distinct, but also they are used in very different ways. For example, existing theories use Rademacher complexities to measure the size of the function class at hand, while we use effective noises to measure the maximal impact of the stochastic noise on the estimators. (Our proofs also require a measure of the size of the function class, but this measure is entropy—cf. Lemma 1.) In general, our proof techniques are very different from those in the context of Rademacher complexities.

We can now state a general prediction guarantee.

Theorem 1

(General Prediction Guarantees) If \({r_{{\text {con}}}}\ge {r^*_{{\text {con}}}}\), it holds that

$$\begin{aligned} {\text {err}}[{\widehat{{\varvec{\Theta }}}_{{\text {con}}}}] \le \inf _{{\varvec{\Theta }}\in {\mathcal {M}_{1}}}\Bigl \{{\text {err}}[{\varvec{\Theta }}]+\frac{2{r_{{\text {con}}}}}{{n}}|\!|\!|{\Theta ^{{l}}}|\!|\!|_1\Bigr \}\,. \end{aligned}$$

Similarly, if \({r_{{\text {node}}}}\ge {r^*_{{\text {node}}}}\), it holds that

$$\begin{aligned} {\text {err}}[{\widehat{{\varvec{\Theta }}}_{{\text {node}}}}] \le \inf _{{\varvec{\Theta }}\in {\mathcal {M}_{2,1}}}\Bigl \{{\text {err}}[{\varvec{\Theta }}]+\frac{2{r_{{\text {node}}}}}{{n}}|\!|\!|{\Theta ^{{l}}}|\!|\!|_{2,1}\Bigr \}\,. \end{aligned}$$

Each bound contains an approximation error \({\text {err}}[{\varvec{\Theta }}]\) that captures how well the class of networks can approximate the true data-generating function \(\user2{g}_*\) and a statistical error proportional to \({r_{{\text {con}}}}/{n}\) and \({r_{{\text {node}}}}/{n}\), respectively, that captures how well the estimator can select within the class of networks at hand. In other words, Theorem 1 ensures that the estimators (3) and (6) predict—up to the statistical error described by \({r_{{\text {con}}}}/{n}\) and \({r_{{\text {node}}}}/{n}\), respectively—as well as the best connection- and node-sparse network. This observation can be illustrated further:

Corollary 1

(Parametric Setting) If additionally \(\user2{g}_*={\user2{g}_{{\varvec{\Theta }^*}}}\) for a \({\varvec{\Theta }^*}\in {\mathcal {M}_{1}}\), it holds that

$$\begin{aligned} {\text {err}}[{\widehat{{\varvec{\Theta }}}_{{\text {con}}}}] \le \frac{2{r_{{\text {con}}}}}{{n}}|\!|\!|{({\varvec{\Theta }^*})^{{l}}}|\!|\!|_1\,. \end{aligned}$$

If instead \(\user2{g}_*={\user2{g}_{{\varvec{\Theta }^*}}}\) for a \({\varvec{\Theta }^*}\in {\mathcal {M}_{2,1}}\), it holds that

$$\begin{aligned} {\text {err}}[{\widehat{{\varvec{\Theta }}}_{{\text {node}}}}] \le \frac{2{r_{{\text {node}}}}}{{n}}|\!|\!|{({\varvec{\Theta }^*})^{{l}}}|\!|\!|_{2,1}\,. \end{aligned}$$

Hence, if the underlying data-generating function is a sparse network itself, the prediction errors of the estimators are essentially bounded by the statistical errors \({r_{{\text {con}}}}/{n}\) and \({r_{{\text {node}}}}/{n}\). In high-dimensional statistics, bounds similar to those in Theorem 1 and Corollary 1 are called oracle inequalities (Lederer et al. 2019; Lederer 2022).

The above-stated results also identify the oracle tuning parameters \(r^*_{{\text {con}}}\) and \(r^*_{{\text {node}}}\) as optimal tuning parameters: they give the best prediction guarantees in Theorem 1. But since the oracle tuning parameters are unknown in practice, the guarantees implicitly presume a calibration scheme that satisfies \({r_{{\text {con}}}}\approx {r^*_{{\text {con}}}}\) in practice. A natural candidate is cross-validation, but there are no guarantees that cross-validation provides such tuning parameters. This is a limitation that our theories share with all other theories in the field.

Rather than dealing with the practical calibration of the tuning parameters, we exemplify the oracle tuning parameters in a specific setting. This analysis will illustrate the rates of convergences that we can expect from Theorem 1, and it will allow us to compare our theories with other theories in the literature. Assume that the activation functions satisfy \({\user2{\,f}}^j[{\mathbf{0}}_{{{p}^j}}]={\mathbf{0}}_{{{p}^j}}\) and are 1-Lipschitz continuous with respect to the Euclidean norms on the functions’ input and output spaces \({\mathbb {R}}^{{{p}^j}}\). A popular example is ReLU activation, but the conditions are met by many other functions as well. Also, assume that the noise vectors \(\user2{u}_1,\dots ,\user2{u}_{{n}}\) are independent and centered and have uniformly sub-Gaussian entries (van de Geer 2000, Display (8.2) on Page 126). Keep the input vectors fixed and capture their normalizations by

$$\begin{aligned} {\overline{v}_\infty }:=\sqrt{\frac{1}{{n}}\sum _{i=1}^{{n}}|\!|{{\user2{x}}_i}|\!|_\infty ^2}~~~~~~\text {and}~~~~~~{\overline{v}_2}:=\sqrt{\frac{1}{{n}}\sum _{i=1}^{{n}}|\!|{{\user2{x}}_i}|\!|_2^2}\,. \end{aligned}$$

Then, we obtain the following bounds for the effective noises.

Proposition 2

(Sub-Gaussian Noise) There is a constant \({c}\in (0,\infty )\) that depends only on the sub-Gaussian parameters of the noise such that

$$\begin{aligned} P\biggl \{{r^*_{{\text {con}}}}\le {c}{\overline{v}_\infty }\sqrt{{n}{l}\bigl (\log [2{m}{n}{\overline{p}}]\bigr )^3}\biggr \}\ge 1-\frac{1}{{n}} \end{aligned}$$

and

$$\begin{aligned} P\biggl \{{r^*_{{\text {node}}}}\le {c}{\overline{v}_2}\sqrt{{m}{n}{l}{\underline{p}}\bigl (\log [2{m}{n}{\overline{p}}]\bigr )^3}\biggr \}\ge 1-\frac{1}{{n}}\,. \end{aligned}$$

Broadly speaking, this result combined with Theorem 1 illustrates that accurate prediction with connection- and node-sparse estimators is possible even when using very wide and deep networks. Let us analyze the factors one by one and compare them to the factors in the bounds of Taheri et al. (2021) and Neyshabur et al. (2015), which are the two most related papers. The connection-sparse case compares to the results in Taheri et al. (2021), and it compares to the results in Neyshabur et al. (2015) when setting the parameters in that paper to \(p=q=1\) (which gives a setting that is slightly more restrictive than ours) or \(p=1;q=\infty\) (which gives a setting that is slightly less restrictive than ours), and it compares to (Golowich et al. 2018, Theorem 2). The node-sparse case compares to Neyshabur et al. (2015) with \(p=2;q=\infty\) (which gives a setting that is more restrictive than ours, though). Our setup is also more general than the one in Neyshabur et al. (2015) in the sense that it allows for activation other than ReLU.

The dependence on \(n\) is, as usual, \(1/\sqrt{{n}}\) up to logarithmic factors.

In the connection-sparse case, our bounds involve \({\overline{v}_\infty }=\sqrt{\sum _{i=1}^{{n}}|\!|{{\user2{x}}_i}|\!|_\infty ^2/{n}}\) rather than the factor \({v_\infty }:=\max _{i\in \{1,\dots ,{n}\}}|\!|{{\user2{x}}_i}|\!|_\infty\) of Golowich et al. (2018) and Neyshabur et al. (2015) or the factor \({\overline{v}_2}=\sqrt{\sum _{i=1}^{{n}}|\!|{{\user2{x}}_i}|\!|_2^2/{n}}\) of Taheri et al. (2021). In principle, the improvements of \(\overline{v}_\infty\) over \(v_\infty\) and \(\overline{v}_2\) can be up to a factor \(\sqrt{{n}}\) and up to a factor \(\sqrt{{d}}\), respectively; in practice, the improvements depend on the specifics on the data. For example, on the training data of MNIST (LeCun et al. 1998) and Fashion-MNIST (Xiao et al. 2017) (\(\sqrt{{n}}\approx 250;\sqrt{{d}}=28\) in both data sets), it holds that \({\overline{v}_\infty }\approx {v_\infty }\approx {\overline{v}_2}/9\) and \({\overline{v}_\infty }\approx {v_\infty }\approx {\overline{v}_2}/12\), respectively. In the node-sparse case, our bounds involve \(\overline{v}_2\), which is again somewhat smaller than the factor \({v_2}:=\max _{i\in \{1,\dots ,{n}\}}|\!|{{\user2{x}}_i}|\!|_2\) in Neyshabur et al. (2015).

The main difference between the bounds for the connection-sparse and node-sparse estimators is their dependencies on the networks’ maximal width \({\underline{p}}\). The bound for the connection-sparse estimator (3) depends on the width \(\underline{p}\) only logarithmically (through \(\overline{p}\)), while the bound for the node-sparse estimator (6) depends on \(\underline{p}\) sublinearly. The dependence in the connection-sparse case is the same as in Taheri et al. (2021), while Neyshabur et al. (2015) can avoid even that logarithmic dependence (and, therefore, allow for networks with infinite widths). The node-sparse case in Neyshabur et al. (2015) does not involve our linear dependence on the width, but this difference stems from the fact that they use a more restrictive version of the grouping—we take the maximum over each layer, while they take the maximum over each node— and our results can be readily adjusted to their notion of group sparsity. These observations indicate that node sparsity as formulated above is suitable for slim networks (\({\underline{p}}\ll {n}\)) but should be strengthened or complemented with other notions of sparsity otherwise. To give a numeric example, the training data in MNIST (LeCun et al. 1998) and Fashion-MNIST (Xiao et al. 2017) comprise \({n}=60\,000\) samples, which means that the width should be considerably smaller than \(60\,000\) when using node sparsity alone. (Note that the input layer does not take part in \(\underline{p}\), which means that \({d}\) could be larger.)

For unconstraint estimation, one can expect a linear dependence of the error on the total number of parameters (Anthony and Bartlett 1999). Our bounds for the sparse estimators, in contrast, only have a \(\log [{\overline{p}}]\) dependence on the total number of parameters. This difference illustrates the virtue of regularization in general, and the virtue of sparsity in particular.

Both of our bounds have a mild \(\sqrt{{l}}\) dependence on the depth. These dependencies align with the results in (Golowich et al. 2018, Theorem 2) but considerably improve on the exponentially increasing dependencies on the depth in Neyshabur et al. (2015) and, therefore, are particularly suited to describe deep network architectures. Replacing the conditions \(\max _j|\!|\!|{\Theta ^j}|\!|\!|_1\le 1\) and \(\max _j|\!|\!|{\Theta ^j}|\!|\!|_{2,1}\le 1\) in the definitions of the connection-sparse and node-sparse estimators by the stricter conditions \(\sum _j|\!|\!|{\Theta ^j}|\!|\!|_1\le 1\) and \(\sum _j|\!|\!|{\Theta ^j}|\!|\!|_{2,1}\le 1\), respectively (cf. Taheri et al. (2021) and our discussion in Section 2), the dependence on the depth can be improved further from \(\sqrt{{l}}\) to \((2/{l})^{{l}}\sqrt{{l}}\) (this only requires a simple adjustment of the last display in the proof of Proposition 4), which is exponentially decreasing in the depth.

Our connection-sparse bounds have a mild \(\log [{m}]\) dependence on the number of output nodes; the node-sparse bound involve an additional factor \(\sqrt{{m}}\). The case of multiple outputs has not been considered in statistical prediction bounds before.

Proposition 2 also highlights another advantage of our regularization approach over theories such as Golowich et al. (2018) and Neyshabur et al. (2015) that apply to constraint estimators. The theories for constraint estimators require bounding the sparsity levels directly, but in practice, suitable values for these bounds are rarely known. In our framework, in contrast, the sparsity is controlled via tuning parameters indirectly, and Proposition 2—although not providing a complete practical calibration scheme—gives insights into how these tuning parameters should scale with \({n}\), \({d}\), \({l}\), and so forth.

We also note that the bounds in Theorem 1 can be generalized readily to every estimator of the form

$$\begin{aligned} {\widehat{{\varvec{\Theta }}}_{{\text {gen}}}}\in {{\,\mathrm{arg\,min}\,}}_{{\varvec{\Theta }}\in {\mathcal {M}_{{\text {gen}}}}}\Biggl \{\sum _{i=1}^{{n}}\big |\!\big |\user2{y}_i-{{\user2{g}_{{\varvec{\Theta }}}}[{{\user2{x}}_i}]}\big |\!\big |_2^2+{r_{{\text {gen}}}}|\!|\!|{\Theta ^{{l}}}|\!|\!|\Biggr \}\,, \end{aligned}$$

where \({r_{{\text {gen}}}}\in [0,\infty )\) is a tuning parameter, \({\mathcal {M}_{{\text {gen}}}}\) any nonempty subset of \(\mathcal {M}\), and \(|\!|\!|\cdot |\!|\!|\) any norm. The bound for such an estimator is then

$$\begin{aligned} {\text {err}}[{\widehat{{\varvec{\Theta }}}_{{\text {gen}}}}] \le \inf _{{\varvec{\Theta }}\in {\mathcal {M}_{{\text {gen}}}}}\Bigl \{{\text {err}}[{\varvec{\Theta }}]+\frac{2{r_{{\text {gen}}}}}{{n}}|\!|\!|{\Theta ^{{l}}}|\!|\!|\Bigr \} \end{aligned}$$

for \({r_{{\text {gen}}}}\ge {r^*_{{\text {gen}}}}\), where \({r^*_{{\text {gen}}}}\) is as \(r^*_{{\text {con}}}\) but based on the dual norm of \(|\!|\!|\cdot |\!|\!|\) instead of the dual norm of \(|\!|\!|\cdot |\!|\!|_1\). For example, one could impose connection sparsity on some layers and node sparsity on others, or one could impose different regularizations altogether. We omit the details to avoid digression.

The above oracle inequalities bound the prediction error, a standard measure of accuracy in statistics. Broadly speaking, this measure captures “how well the estimator describes the data-generating process.” So our comparison with Neyshabur et al. (2015) and Golowich et al. (2018) might seem questionable, because they instead bound the generalization error, a measure that is more common in machine learning and captures “how well the estimator describes new samples.” But we can derive such bounds as well. For simplicity, we consider a parametric setting and sub-Gaussian noise again. We then find the following bounds:

Proposition 3

(Generalization Guarantees) Assume that the inputs \({\user2{x}},{\user2{x}}_1,\dots ,{\user2{x}}_{{n}}\) are i.i.d. random vectors, that the noise vectors \(\user2{u}_1,\dots ,\user2{u}_{{n}}\) are independent and centered and have uniformly sub-Gaussian entries, and that \({r_{{\text {con}}}}={r^*_{{\text {con}}}},{r_{{\text {node}}}}={r^*_{{\text {node}}}}\rightarrow 0\) as \({n}\rightarrow \infty\). Consider an arbitrary positive constant \(b\in (0,\infty )\). If \(\user2{g}_*={\user2{g}_{{\varvec{\Theta }^*}}}\) for a \({\varvec{\Theta }^*}\in {\mathcal {M}_{1}}\), it holds with probability at least \(1-1/{n}\) that

$$\begin{aligned} {\text {risk}}[{\widehat{{\varvec{\Theta }}}_{{\text {con}}}}] \le (1+b) {\text {risk}}[{\varvec{\Theta }^*}]+{c}{\overline{v}_\infty }\sqrt{\frac{{l}\bigl (\log [2{m}{n}{\overline{p}}]\bigr )^3}{{n}}}\,|\!|\!|{({\varvec{\Theta }^*})^{{l}}}|\!|\!|_1 \end{aligned}$$

for a constant \({c}\in (0,\infty )\) that depends only on b and the sub-Gaussian parameters of the noise. Similarly, if \(\user2{g}_*={\user2{g}_{{\varvec{\Theta }^*}}}\) for a \({\varvec{\Theta }^*}\in {\mathcal {M}_{2,1}}\), it holds with probability at least \(1-1/{n}\) that

$$\begin{aligned} {\text {risk}}[{\widehat{{\varvec{\Theta }}}_{{\text {con}}}}] \le (1+b) {\text {risk}}[{\varvec{\Theta }^*}]+{c}{\overline{v}_2}\sqrt{\frac{{m}{l}{\underline{p}}\bigl (\log [2{m}{n}{\overline{p}}]\bigr )^3}{{n}}}\,|\!|\!|{({\varvec{\Theta }^*})^{{l}}}|\!|\!|_{2,1} \end{aligned}$$

for a constant \({c}\in (0,\infty )\) that depends only on b and the sub-Gaussian parameters of the noise.

Hence, the generalization errors are bounded by the same terms as the prediction errors.

4 Outlook: Initialization

Our theoretical results also suggest further research on a practical problem in deep learning: weight initialization (Glorot and Bengio 2010; He et al. 2015; Mishkin and Matas 2015). To highlight the connection between our work and weight initialization, we consider once more our guarantees’ dependence on the depth \(l\). Proposition 3, for example, comprises a sublinear dependence through the factor \(\sqrt{{l}}\) and a logarithmic dependence through the total number of parameters \(\overline{p}\) inside the logarithm—we have discussed these dependencies in detail. But there is another potential source of dependence on \(l\): the factor \(|\!|\!|{({\varvec{\Theta }^*})^{{l}}}|\!|\!|_1\). Naively thinking, one could suspect that this factor scales exponentially in \(l\): the argument would be that the weight matrices of each of the \({l}-1\) inner layers needs to be rescaled to fit into \(\mathcal {M}_{1}\) or \(\mathcal {M}_{2,1}\), which means that the weight matrix of the outer layer needs to be rescaled by a product of these \({l}-1\) factors.

The argument is intuitive, but it is wrong: the problem with it is that the optimal weight matrices \({({\varvec{\Theta }^*})^{{l}}}\) change with the depth of the network, while the data-generating process remains unaffected by what function we use to approximate it. In other words, we cannot expect a simple relationship between \({({\varvec{\Theta }^*})^{{l}}}\) and \(({\varvec{\Theta }^*})^{{l}-1}\), but we can expect the overall “scales” of the corresponding networks to be similar, that is, \(|\!|\!|{({\varvec{\Theta }^*})^{{l}}}|\!|\!|_1\approx |\!|\!|({\varvec{\Theta }^*})^{{l}-1}|\!|\!|_1\). Hence, we can assume that the factor \(|\!|\!|{({\varvec{\Theta }^*})^{{l}}}|\!|\!|_1\) in our bounds to be approximately independent of \(l\).

One can also argue that the recent results on approximation properties of sparse neural networks, such as Beknazaryan (2021); Schmidt-Hieber (2020), suggest that sparse networks with parameters in \(\mathcal {M}_{1}\) or \(\mathcal {M}_{2,1}\) and fixed \(|\!|\!|{({\varvec{\Theta }^*})^{{l}}}|\!|\!|_1\) or \(|\!|\!|{({\varvec{\Theta }^*})^{{l}}}|\!|\!|_{2,1}\) norms, respectively, can indeed approximate large classes of functions.

In any case, we can draw two conclusions: First, our bounds indeed depend on the network depth as advertised. Second, our results hint at the fact that initialization schemes should take network depths into account, and it might be favorable to use sparse initialization schemes rather than distributing weights “uniformly” across the entire network. More generally, we conclude that the connection between sparse networks and weight initializations might be an interesting topic for further research.

5 Discussion

We have developed guarantees for sparse deep learning both in terms of the prediction error (Theorems 1 and Corollary 1 together with Proposition 2), a standard measure of accuracy in statistics, and in terms of the generalization error (Proposition 3), a standard measure of accuracy in machine learning. These results extend and complement existing guarantees in the literature—see Table 1 below.

Table 1 Presence (\(\checkmark\)) or absence ( ) of certain features in previous statistical theories for sparse deep learning

Even though many deep learning applications fall into the framework of classification, we have focussed on regression with least-squares loss. The reason is that the regression setting is much more challenging: since the loss is unbounded, many of the techniques regularly used in classification (like McDiarmid’s inequality (McDiarmid 1989, Lemma (3.3))) are not applicable. In this sense, our derivations are more general, and we expect that our approach will provide very similar classifications bounds in the future as well (see Appendix 1 for possible extensions more generally).

Evidence for the benefits of deep networks has been established in practice (LeCun et al. 2015; Schmidhuber 2015), approximation theory (Liang and Srikant 2016; Telgarsky 2016; Yarotsky 2017), and statistics (Taheri et al. 2021; Kohler et al. 2019). Since our guarantees scale at most sublinearly in the number of layers (or even improve with increasing depth—see our comment on Page 5), our paper complements these lines of research and shows that sparsity-inducing regularization is an effective approach to coping with the complexity of deep and very deep networks.

While previous theories mostly considered connection sparsity (small number of active connections between nodes), we also include node sparsity (small number of active nodes). Moreover, as discussed on Page 5, Theorem 1 can be readily extended to any norm-based regularization. Hence, it is straightforward to adjust our results to granularities between connection and node sparsity—cf. Mao et al. (2017). On the other hand, our techniques do not seem appropriate for “hard-coded” types of sparsity, such as 2:4 (“two-to-four”) sparsity (Mishra et al. 2021).

Connection sparsity limits the number of nonzero entries in each parameter matrix, while node sparsity only limits the total number of nonzero rows. Hence, the number of columns in a parameter matrix, that is, the width of the preceding layer, is regularized only in the case of connection sparsity. Our theoretical results reflect this insight in that the bounds for the connection- and node-sparse estimators depend on the networks’ width logarithmically and sublinearly, respectively. Practically speaking, our results indicate that connection sparsity is suitable to handle wide networks, but node sparsity is suitable for wide networks only when complemented by connection sparsity or other strategies.

The mild logarithmic dependence of our connection-sparse bounds on the number of output nodes illustrates that networks with many outputs can be learned in practice. Our prediction theory is the first one to consider multiple output nodes; a classification theory with a logarithmic dependence on the output nodes has been established very recently in Ledent et al. (2019).

The mathematical underpinnings of our theory are very different from those of most other papers in theoretical deep learning. The proof of the main theorem shares similarities with proofs in high-dimensional statistics, such as the concept of the effective noise (Lederer 2022). The treatments of the relevant empirical-processes use metric entropy, chaining, and Lipschitz properties of neural networks. These concepts and tools are not standard in deep learning and, therefore, might be of more general interest (see again Appendix 1 for further ideas).

Our theory has three limitations: First, the bounds apply only to global optima of the optimization landscapes rather than local optima or other points in which certain algorithms might be trapped. However, there is evidence that global optimization can be feasible at least in wide and deep networks (Lederer 2020b). Second, the theory does not entail a practical scheme for the calibration of the tuning parameters. However, the inclusion of regularization (rather than constraints) is already a step forward, because it reveals how the tuning parameters should scale with the problem dimensions (see our Proposition 2). Third, the network architecture is limited to fully connected feedforward layers, which excludes some aspects of modern pipelines (such as convolutions, dropout, and so forth). In any case, all three limitations are open problems in the literature; in particular, the mentioned limitations are shared by most theories on the topic.

We can summarize what this paper contributes—and what it does not—as follows: From a practical perspective, it is well established that sparsity can benefit deep learning, and there are several methods to generate sparsity in practice. Thus, this paper does not provide new practical insights or methods. Instead, our paper (i) backs up these practical observations with statistical theories that are more general and closer to practice than previous theories, and it (ii) establishes refined concepts and techniques for the statistical analysis of deep learning more generally.