Statistical guarantees for sparse deep learning

Neural networks are becoming increasingly popular in applications, but our mathematical understanding of their potential and limitations is still limited. In this paper, we further this understanding by developing statistical guarantees for sparse deep learning. In contrast to previous work, we consider different types of sparsity, such as few active connections, few active nodes, and other norm-based types of sparsity. Moreover, our theories cover important aspects that previous theories have neglected, such as multiple outputs, regularization, and l2-loss. The guarantees have a mild dependence on network widths and depths, which means that they support the application of sparse but wide and deep networks from a statistical perspective. Some of the concepts and tools that we use in our derivations are uncommon in deep learning and, hence, might be of additional interest.

The many empirical observations of the benefits of sparsity have sparked interest in mathematical support in the form of statistical theories.Two current approaches are based on Rademacher complexities [Bartlett andMendelson, 2002, Neyshabur et al., 2015] and ideas from nonparametric statistics [Schmidt-Hieber, 2020], respectively.While their results provide important support for sparse deep learning, they still have major limitations: The first approach is restricted to bounded loss functions (which excludes the ℓ 2 -loss, for example), is either restricted to a simple form of sparsity (which we will call "connection sparsity" later) or suffers from an exponential dependence on the number of layers (which contradicts the current interest in very deep networks), caters to constraints rather than regularization (which is the predominant implementation in practice), and is limited to a single output node and ReLU activation.The second approach is restricted to ℓ 0 -constraints (which are infeasible in practice), assumes bounded weights, and is also limited to a single output 3 n for the connection-sparse and node-sparse estimators (see the following section for the notions of sparsity), respectively, where l is the number of hidden layers, m the number of output nodes, n the number of samples, p the total number of parameters, and p the maximal width of the network.The rates suggest that sparsity-inducing approaches can provide accurate prediction even in very wide (with connection sparsity) and very deep (with either type of sparsity) networks while, at the same time, ensuring low network complexities.These findings underpin the current trend toward sparse but wide and especially deep networks from a statistical perspective.More generally speaking, our paper complements the existing statistical theories for sparse deep learning with new results, and it refines the techniques that were introduced in [Taheri et al., 2021].
Outline of the paper Section II recapitulates the notions of connection and node sparsity and introduces the corresponding deep learning framework and estimators.Section III confirms the empirically-observed accuracies of connection-and nodesparse estimation in theory.Section IV discusses connections of our theoretical results and weight initialization.Section V summarizes the key features and limitations of our work.The Appendix contains all proofs.

II. CONNECTION-AND NODE-SPARSE DEEP LEARNING
We consider data (y 1 , x 1 ), . . ., (y n , x n ) ∈ R m × R d that are related via for an unknown data-generating function g * : R d → R m and unknown, random noise u 1 , . . ., u n ∈ R m .We allow all aspects, namely y i , g * , x i , and u i , to be unbounded.Our goal is to model the data-generating function with a feedforward neural network of the form indexed by the parameter space M • • = {Θ = (Θ l , . . ., Θ 0 ) : Θ j ∈ R p j+1 ×p j }.The functions f j : R p j → R p j are called the activation functions [Lederer, 2021], and p 0 • • = d and p l+1 • • = m are called the input and output dimensions, respectively.The depth of the network is l, the maximal width is p • • = max j∈{0,...,l−1} p j+1 , and the total number of parameters is p • • = l j=0 p j+1 p j .In practice, the total number of parameters often rivals or exceeds the number of samples: p ≈ n or p ≫ n.We then speak of high dimensionality.A common technique for avoiding overfitting in high-dimensional settings is regularization that induces additional structures, such as sparsity.Sparsity has the interesting side-effect of reducing the networks' complexities, which can facilitate interpretations and reduce demands on energy and memory.Three common notions of sparsity are connection sparsity, which means that there is only a small number of nonzero connections between nodes, node sparsity, which means that there is only a small number of active nodes [Alvarez and Salzmann, 2016, Changpinyo et al., 2017, Feng and Simon, 2017, Kim et al., 2016, Lee et al., 2008, Liu et al., 2015, Nie et al., 2015, Scardapane et al., 2017, Wen et al., 2016], and layer sparsity, which means that there is only a small number of active layers [Hebiri and Lederer, 2020].
In the following, we focus on connection-and node sparsity.Our first sparse estimator is for a tuning parameter r con ∈ [0, ∞), a nonempty set of parameters and the ℓ 1 -norm This estimator is an analog of the lasso estimator in linear regression [Tibshirani, 1996].It induces sparsity on the level of connections: the larger the tuning parameter r con , the fewer connections among the nodes.
Our estimator (3) specifies one way to formulate this type of regularization.The estimator is indeed a regularized estimator (rather than a constraint estimator), because the complexity is regulated entirely through the tuning parameter r con in the objective function (rather than through a tuning parameter in the set over which the objective function is optimized).But ℓ 1 -regularization could also be formulated slightly differently.For example, one could consider the estimators (5) The differences among the estimators (3)-( 5) are small: for example, our theory can be adjusted for (4) with almost no changes of the derivations.The differences among the estimators mainly concern the normalizations of the parameters; we illustrate this in the following proposition.
In brief, the goal of our paper is not to promote a new way of implementing sparsity in practice but to reproduce practical implementations as accuractly as possible in theory.
Another way to formulate ℓ 1 -regularization was proposed in Taheri et al. [2021]: they reparametrize the networks through a scale parameter and a constraint version of M and then to focus the regularization on the scale parameter only.Our above-stated estimator (3) is more elegant in that it avoids the reparametrization and the additional parameter.
The factor |||Θ l ||| 1 in the regularization term of (3) measures the complexity of the network over the set M 1 , and the factor r con regulates the complexity of the resulting estimator.This provides a convenient lever for data-adaptive complexity regularization through well-established calibration schemes for the tuning parameter, such as cross-validation.This practical aspect is an advantage of regularized formulations like ours as compared to constraint estimation over sets with a predefined complexity.
The constraints in the set M 1 of the estimator (3) can also retain the expressiveness of the full parameterization that corresponds to the set M: for example, assuming again nonnegativehomogeneous activation, one can check that for every Γ ∈ M, there is a Γ ′ ∈ {Θ ∈ M : max j∈{0,...,l−1} |||Θ j ||| 1 ≤ 1} such that g Γ = g Γ ′ -cf.Taheri et al. [2021, Proposition 1].In contrast, existing theories on neural networks often require the parameter space to be bounded, which limits the expressiveness of the networks.Our regularization approach is, therefore, closer to practical setups than constraint approaches.The price is that to develop prediction theories, we have to use different tools than those typically used in theoretical deep learning.For example, we cannot use established risk bounds such as Bartlett and Mendelson [2002, Theorem 8] McDiarmid [1989, Lemma (3.3)] (because that would require a bounded loss).We instead invoke ideas from highdimensional statistics, prove Lipschitz properties for neural networks, and use empirical-process theory, specifically concentration inequalities that are based on chaining (see the Appendix).
Our second estimator is for a tuning parameter r node ∈ [0, ∞), a nonempty set of parameters and the ℓ 2 /ℓ 1 -norm This estimator is an analog of the group-lasso estimator in linear regression [Bakin, 1999].Again, to avoid ambiguities in the regularization, our formulation is slightly different from the standard formulations in the literature, but the fact that group-lasso regularizers leads to node-sparse networks has been discussed extensively before [Alvarez and Salzmann, 2016, Liu et al., 2015, Scardapane et al., 2017]: the larger the tuning parameter r node , the fewer active nodes in the network.The above-stated comments about the specific form of the connection-sparse estimator also apply to the node-sparse estimator.
An illustration of connection and node sparsity is given in Figure 1.Connection-sparse networks have only a small number of active connections between nodes (left panel of Figure 1); node-sparse networks have inactive nodes, that is, completely unconnected nodes (right panel of Figure 1).The two notions of sparsity are connected: for example, connection sparsity can render entire nodes inactive "by accident" (see the layer that follows the input layer in the left panel of the figure).In general, node sparsity is the weaker assumption, because it allows for highly connected nodes; this observation is reflected in the theoretical guarantees in the following section.
The optimal network architecture for given data (such as the optimal width) is hardly known beforehand in a data analysis.A main feature of sparsity-inducing regularization is, therefore, that it adjusts parts of the network architecture to the data.In other words, sparsity-inducing regularization is a data-driven approach to adapting the complexity of the network.
While versions of the estimators (3) and ( 6) are popular in deep learning, statistical analyses, especially of node-sparse deep learning, are scarce.Such a statistical analysis is, therefore, the goal of the following section.

III. STATISTICAL PREDICTION GUARANTEES
We now develop statistical guarantees for the sparse estimators described above.The guarantees are formulated in terms of the squared average (in-sample) prediction error which is a measure for how well the network g Θ fits the unknown function g * (which does not need to be a neural network) on the data at hand, and in terms of the prediction risk (or generalization error) for a new sample (y, x) that has the same distribution as the original data which measures how well the network g Θ can predict a new sample.We first study the prediction error, because it is agnostic to the distribution of the input data; in the end, we then translate the bounds for the prediction error into bounds for the generalization error.We first observe that the networks in (2) can be somewhat "linearized:" For every parameter Θ ∈ M 1 , there is a parameter This additional notation allows us to disentangle the outermost layer (which is regularized directly) from the other layers (which are regularized indirectly).More generally speaking, the additional notation makes a connection to linear regression, where the above holds trivially with g Θ [x] = x.
We also define In high-dimensional linear regression, the quantity central to prediction guarantees is the effective noise [Lederer and Vogt, 2020].The effective noise is in our notation (with l = 0 and m = 1 to describe linear regression) The above linearization allows us to generalize the effective noise to our general deep-learning framework: where The effective noises, as we will see below, are the optimal tuning parameters in our theories; at the same time, the effective noises depend on the noise random variables u 1 , . . ., u n , which are unknown in practice.Accordingly, we call the quantities r * con and r * node the oracle tuning parameters.
We take a moment to compare the effective noises in (8) to Rademacher complexities [Koltchinskii, 2001, Koltchinskii andPanchenko, 2002].Rademacher complexities are the basis of a line of other statistical theories for deep learning [Bartlett and Mendelson, 2002, Golowich et al., 2018, Lederer, 2020a, Neyshabur et al., 2015].In our framework, the Rademacher complexities in the case m = 1 are [Lederer, 2020a, for i.i.d.Rademacher random variables k 1 , . . ., k n .The effective noises might look like (rescaled) empirical versions of these quantities at first sight, but this is not the case.Two immediate differences are that (8) apply to general m and circumvent the outermost layers of the networks.But more importantly, Rademacher complexities involve external i.i.d.Rademacher random variables that are not connected with the statistical model at hand, while the effective noises involve the noise variables, which are completely specified by the model and, therefore, can have any distribution (see our sub-Gaussian example further below).Hence, there are no general techniques to relate Rademacher complexities and effective noises.
Not only are the two concepts distinct, but also they are used in very different ways.For example, existing theories use Rademacher complexities to measure the size of the function class at hand, while we use effective noises to measure the maximal impact of the stochastic noise on the estimators.(Our proofs also require a measure of the size of the function class, but this measure is entropy-cf.Lemma 1.)In general, our proof techniques are very different from those in the context of Rademacher complexities.
We can now state a general prediction guarantee.
Theorem 1 (General Prediction Guarantees).If r con ≥ r * con , it holds that Each bound contains an approximation error err[Θ] that captures how well the class of networks can approximate the true data-generating function g * and a statistical error proportional to r con /n and r node /n, respectively, that captures how well the estimator can select within the class of networks at hand.In other words, Theorem 1 ensures that the estimators ( 3) and ( 6) predict-up to the statistical error described by r con /n and r node /n, respectively-as well as the best connectionand node-sparse network.This observation can be illustrated further: Hence, if the underlying data-generating function is a sparse network itself, the prediction errors of the estimators are essentially bounded by the statistical errors r con /n and r node /n.In high-dimensional statistics, bounds similar to those in Theorem 1 and Corollary 1 are called oracle inequalities [Lederer et al., 2019, Lederer, 2022].
The above-stated results also identify the oracle tuning parameters r * con and r * node as optimal tuning parameters: they give the best prediction guarantees in Theorem 1.But since the oracle tuning parameters are unknown in practice, the guarantees implicitly presume a calibration scheme that satisfies r con ≈ r * con in practice.A natural candidate is cross-validation, but there are no guarantees that cross-validation provides such tuning parameters.This is a limitation that our theories share with all other theories in the field.
Rather than dealing with the practical calibration of the tuning parameters, we exemplify the oracle tuning parameters in a specific setting.This analysis will illustrate the rates of convergences that we can expect from Theorem 1, and it will allow us to compare our theories with other theories in the literature.Assume that the activation functions satisfy f j [0 p j ] = 0 p j and are 1-Lipschitz continuous with respect to the Euclidean norms on the functions' input and output spaces R p j .A popular example is ReLU activation, but the conditions are met by many other functions as well.Also, assume that the noise vectors u 1 , . . ., u n are independent and centered and have uniformly subgaussian entries [van de Geer, 2000, Display (8.2) on Page 126].Keep the input vectors fixed and capture their normalizations by Then, we obtain the following bounds for the effective noises.
Proposition 2 (Subgaussian Noise).There is a constant c ∈ (0, ∞) that depends only on the subgaussian parameters of the noise such that Broadly speaking, this result combined with Theorem 1 illustrates that accurate prediction with connection-and nodesparse estimators is possible even when using very wide and deep networks.Let us analyze the factors one by one and compare them to the factors in the bounds of Taheri et al.
[2021] and Neyshabur et al. [2015], which are the two most related papers.The connection-sparse case compares to the results in Taheri et al. [2021], and it compares to the results in Neyshabur et al. [2015] when setting the parameters in that paper to p = q = 1 (which gives a setting that is slightly more restrictive than ours) or p = 1; q = ∞ (which gives a setting that is slightly less restrictive than ours), and it compares to Golowich et al. [2018, Theorem 2].The node-sparse case compares to Neyshabur et al. [2015] with p = 2; q = ∞ (which gives a setting that is more restrictive than ours, though).Our setup is also more general than the one in Neyshabur et al. [2015] in the sense that it allows for activation other than ReLU.
The dependence on n is, as usual, 1/ √ n up to logarithmic factors.
In the connection-sparse case, our bounds involve Golowich et al. [2018] and Neyshabur et al. [2015] or the factor Taheri et al. [2021].In principle, the improvements of v ∞ over v ∞ and v 2 can be up to a factor √ n and up to a factor √ d, respectively; in practice, the improvements depend on the specifics on the data.For example, on the training data of MNIST [LeCun et al., 1998] and , respectively.In the node-sparse case, our bounds involve v 2 , which is again somewhat smaller than the factor v 2 • • = max i∈{1,...,n} ||x i || 2 in Neyshabur et al. [2015].
The main difference between the bounds for the connectionsparse and node-sparse estimators are their dependencies on the networks' maximal width p.The bound for the connectionsparse estimator (3) depends on the width p only logarithmically (through p), while the bound for the node-sparse estimator (6) depends on p sublinearly.The dependence in the connection-sparse case is the same as in Taheri et al. [2021], while Neyshabur et al. [2015] can avoid even that logarithmic dependence (and, therefore, allow for networks with infinite widths).The node-sparse case in Neyshabur et al. [2015] does not involve our linear dependence on the width, but this difference stems from the fact that they use a more restrictive version of the grouping-we take the maximum over each layer, while they take the maximum over each node-and our results can be readily adjusted to their notion of group sparsity.These observations indicate that node sparsity as formulated above is suitable for slim networks (p ≪ n) but should be strengthened or complemented with other notions of sparsity otherwise.To give a numeric example, the training data in MNIST [LeCun et al., 1998] and Fashion-MNIST [Xiao et al., 2017] comprise n = 60 000 samples, which means that the width should be considerably smaller than 60 000 when using node sparsity alone.(Note that the input layer does not take part in p, which means that d could be larger.) For unconstraint estimation, one can expect a linear dependence of the error on the total number of parameters [Anthony and Bartlett, 1999].Our bounds for the sparse estimators, in contrast, only have a log[p] dependence on the total number of parameters.This difference illustrates the virtue of regularization in general, and the virtue of sparsity in particular.
Both of our bounds have a mild √ l dependence on the depth.These dependencies align with the results in Golowich et al. [2018, Theorem 2] but considerably improve on the exponentially-increasing dependencies on the depth in Neyshabur et al. [2015] and, therefore, are particularly suited to describe deep network architectures.Replacing the conditions max j |||Θ j ||| 1 ≤ 1 and max j |||Θ j ||| 2,1 ≤ 1 in the definitions of the connection-sparse and node-sparse estimators by the stricter conditions j |||Θ j ||| 1 ≤ 1 and j |||Θ j ||| 2,1 ≤ 1, respectively (cf.Taheri et al. [2021] and our discussion in Section II), the dependence on the depth can be improved further from √ l to (2/l) l √ l (this only requires a simple adjustment of the last display in the proof of Proposition 4), which is exponentially decreasing in the depth.
Our connection-sparse bounds have a mild log[m] dependence on the number of output nodes; the node-sparse bound involve an additional factor √ m.The case of multiple outputs has not been considered in statistical prediction bounds before.
Proposition 2 also highlights another advantage of our regularization approach over theories such as Golowich et al. [2018] and Neyshabur et al. [2015] that apply to constraint estimators.The theories for constraint estimators require bounding the sparsity levels directly, but in practice, suitable values for these bounds are rarely known.In our framework, in contrast, the sparsity is controlled via tuning parameters indirectly, and Proposition 2-although not providing a complete practical calibration scheme-gives insights into how these tuning parameters should scale with n, d, l, and so forth.
We also note that the bounds in Theorem 1 can be generalized readily to every estimator of the form where r gen ∈ [0, ∞) is a tuning parameter, M gen any nonempty subset of M, and ||| • ||| any norm.The bound for such an estimator is then for r gen ≥ r * gen , where r * gen is as r * con but based on the dual norm of ||| • ||| instead of the dual norm of ||| • ||| 1 .For example, one could impose connection sparsity on some layers and node sparsity on others, or one could impose different regularizations altogether.We omit the details to avoid digression.
The above oracle inequalities bound the prediction error, a standard measure of accuracy in statistics.Broadly speaking, this measure captures "how well the estimator describes the data-generating process."So our comparison with Neyshabur et al. [2015] and Golowich et al. [2018] might seem questionable, because they instead bound the generalization error, a measure that is more common in machine learning and captures "how well the estimator describes new samples."But we can derive such bounds as well.For simplicity, we consider a parametric setting and subgaussian noise again.We then find the following bounds: Proposition 3 (Generalization Guarantees).Assume that the inputs x, x 1 , . . ., x n are i.i.d.random vectors, that the noise vectors u 1 , . . ., u n are independent and centered and have uniformly subgaussian entries, and that r con = r * con , r node = r * node → 0 as n → ∞.Consider an arbitrary positive constant b ∈ (0, ∞).
for a constant c ∈ (0, ∞) that depends only on b and the subgaussian parameters of the noise.Similarly, if for a constant c ∈ (0, ∞) that depends only on b and the subgaussian parameters of the noise.
Hence, the generalization errors are bounded by the same terms as the prediction errors.

IV. OUTLOOK: INITIALIZATION
Our theoretical results also suggest further research on a practical problem in deep learning: weight initialization [Glorot and Bengio, 2010, He et al., 2015, Mishkin and Matas, 2015].To highlight the connection between our work and weight initialization, we consider once more our guarantees' dependence on the depth l.Proposition 3, for example, comprises a sublinear dependence through the factor √ l and a logarithmic dependence through the total number of parameters p inside the logarithm-we have discussed these dependencies in detail.But there is another potential source of dependence on l: the factor |||(Θ * ) l ||| 1 .Naively thinking, one could suspect that this factor scales exponentially in l: the argument would be that the weight matrices of each of the l − 1 inner layers needs to be rescaled to fit into M 1 or M 2,1 , which means that the weight matrix of the outer layer needs to be rescaled by a product of these l − 1 factors.
The argument is intuitive, but it is wrong: the problem with it is that the optimal weight matrices (Θ * ) l change with the depth of the network, while the data-generating process remains unaffected by what function we use to approximate it.In other words, we cannot expect a simple relationship between (Θ * ) l and (Θ * ) l−1 , but we can expect the overall "scales" of the corresponding networks to be similar, that is, Hence, we can assume that the factor |||(Θ * ) l ||| 1 in our bounds to be approximately independent of l.
In any case, we can draw two conclusions: First, our bounds indeed depend on the network depth as advertised.Second, our results hint at the fact that initialization schemes should take network depths into account, and it might be favorable to use sparse initialization schemes rather than distributing weights "uniformly" across the entire network.More generally, we conclude that the connection between sparse networks and weight initializations might be an interesting topic for further research.

V. DISCUSSION
We have developed guarantees for sparse deep learning both in terms of the prediction error (Theorems 1 and Corollary 1 together with Proposition 2), a standard measure of accuracy in statistics, and in terms of the generalization error (Proposition 3), a standard measure of accuracy in machine learning.These results extend and complement existing guarantees in the literature-see Table I below.
Even though many deep-learning applications fall into the framework of classification, we have focussed on regression with least-squares loss.The reason is that the regression setting is much more challenging: since the loss is unbounded, many of the techniques regularly used in classification (like McDiarmid's inequality [McDiarmid, 1989, Lemma (3.3)]) are not applicable.In this sense, our derivations are more general, and we expect that our approach will provide very similar classifications bounds in the future as well (see Appendix G for possible extensions more generally).
Evidence for the benefits of deep networks has been established in practice [LeCun et al., 2015, Schmidhuber, 2015], approximation theory [Liang and Srikant, 2016, Telgarsky, 2016, Yarotsky, 2017], and statistics [Taheri et al., 2021, Kohler et al., 2019].Since our guarantees scale at most sublinearly in the number of layers (or even improve with increasing depth-see our comment on Page 5), our paper complements these lines of research and shows that sparsityinducing regularization is an effective approach to coping with the complexity of deep and very deep networks.
While previous theories mostly considered connection sparsity (small number of active connections between nodes), we also include node sparsity (small number of active nodes).Moreover, as discussed on Page 6, Theorem 1 can be readily extended to any norm-based regularization.Hence, it is straightforward to adjust our results to granularities between connection and node sparsity-cf.Mao et al. [2017].On the other hand, our techniques do not seem appropriate for "hardcoded" types of sparsity, such as 2:4 ("two-to-four") sparsity [Mishra et al., 2021].
Connection sparsity limits the number of nonzero entries in each parameter matrix, while node sparsity only limits the total number of nonzero rows.Hence, the number of columns in a parameter matrix, that is, the width of the preceding layer, is regularized only in the case of connection sparsity.Our theoretical results reflect this insight in that the bounds for the connection-and node-sparse estimators depend on the networks' width logarithmically and sublinearly, respectively.Practically speaking, our results indicate that connection sparsity is suitable to handle wide networks, but node sparsity is suitable for wide networks only when complemented by connection sparsity or other strategies.
The mild logarithmic dependence of our connection-sparse bounds on the number of output nodes illustrates that networks with many outputs can be learned in practice.Our prediction theory is the first one to consider multiple output nodes; a classification theory with a logarithmic dependence on the output nodes has been established very recently in Ledent et al. [2019].
The mathematical underpinnings of our theory are very different from those of most other papers in theoretical deep learning.The proof of the main theorem shares similarities with proofs in high-dimensional statistics, such as the concept of the effective noise [Lederer, 2022].The treatments of the relevant empirical processes use metric entropy, chaining, and Lipschitz properties of neural networks.These concepts and tools are not standard in deep learning and, therefore, might be of more general interest (see again Appendix G for further ideas).
Our theory has three limitations: First, the bounds apply only to global optima of the optimization landscapes rather than local optima or other points in which certain algorithms might be trapped.However, there is evidence that global optimization can be feasible at least in wide and deep networks [Lederer, 2020b].Second, the theory does not entail a practical scheme for the calibration of the tuning parameters.However, the inclusion of regularization (rather than constraints) is already a step forward, because it reveals how the tuning parameters should scale with the problem dimensions (see our Proposition 2).Third, the network architecture is limited to fully-connected feedforward layers, which excludes some aspects of modern pipelines (such as convolutions, dropout, and so forth).In any case, all three limitations are open problems in the literature; in particular, the mentioned limitations are shared by most theories on the topic.
We can summarize what this paper contributes-and what it does not-as follows: From a practical perspective, it is well established that sparsity can benefit deep learning, and there are several methods to generate sparsity in practice.Thus, this paper does not provide new practical insights or methods.Instead, our paper (i) backs up these practical observations with statistical theories that are more general and closer to practice than previous theories, and it (ii) establishes refined concepts and techniques for the statistical analysis of deep learning more generally.

APPENDIX
The Appendix consists of two auxiliary results and the proofs of Theorem 1 and Propositions 1 and 2. Our approach combines techniques from high-dimensional statistics and empirical-process theory that are very different from the techniques used in most other approaches in the literature.

A. Lipschitz Property
In this section, we prove a Lipschitz property that we use in the proof of Proposition 2.
Proposition 4 (Lipschitz Property).In the framework of Sections II and III, it holds for all Θ, Γ ∈ M 1 that The Frobenius norm is defined as Proposition 4 generalizes [Taheri et al., 2021, Proposition 2] to vector-valued network outputs and to node sparsity, and it replaces their ||x|| 2 with the smaller ||x|| ∞ in the connection-sparse case.
Proof of Proposition 4. This proof generalizes and sharpens the proof of Taheri et al. [2021], and it simplifies some arguments of that proof.We define the "inner subnetworks" of a network g Θ with Θ ∈ M 2,1 as the vector-valued functions for j ∈ {1, . . ., l − 1}.Similarly, we define the "outer subnetworks" of g Θ as the real-valued functions for j ∈ {1, . . ., l − 1} and The initial network can be split into an inner and an outer network along every layer j ∈ {1, . . ., l}: We call this our splitting argument.
To exploit the splitting argument, we derive a contraction result for the inner subnetworks and a Lipschitz result for the outer subnetworks.We denote the ℓ 2 -operator norm of a matrix A, that is, the largest singular value of A, by |||A||| op .Using then the assumptions that the activation functions are 1-Lipschitz and f j [0 p j ] = 0 p j , we get for every Θ = (Θ l−1 , . . ., Θ 0 ) ∈ M 2,1 and x ∈ R d that and Θ ∈ M 2,1 , we can deduce from the display that This inequality is our contraction property.

C. Proof of Theorem 1
In this section, we state a proof for Theorem 1.The proof is inspired by derivations in high-dimensional statistics-see, for example, [Zhuang andLederer, 2018, Lederer, 2022] and references therein.
Proof of Theorem 1.The main idea of the proof is to contrast the estimators' objective functions evaluated at their minima with the estimators' objective functions at other points.Our first step is to derive what we call a basic inequality.By the definition of the estimator in (6), it holds for every Θ ∈ M 2,1 that n i=1 where we use the shorthand Θ • • = Θ node .We then invoke the model in (1) to rewrite this inequality as Expanding the squared terms and rearranging the inequality then yields This is our basic inequality.
In the remainder of the proof, we need to bound the first two terms in the last line of the basic inequality.We call these terms the empirical-process terms.Using the reformulation of the networks in (7), we can write the empirical-process term of a general parameter Γ ∈ M 2,1 according to with Γ ∈ M 2,1 .Using the 1. the properties of transpositions, 2. the definition of the trace function, 3. the cyclic property of the trace function, and 4. the linearity of the trace function yields further Now, 1. denoting the column-vector that corresponds to the kth column of a matrix A by A •k , 2. using Hölder's inequality, 3. using Hölder's inequality again, and 4. again Hölder's inequality and our definitions of the elementwise ℓ ∞ -and ℓ 1 -norms, we find which implies in view of the definition of the effective noise in (8) This inequality is our bound on the empirical-process terms.
We can combine the bound on the empirical process term and the basic inequality to find Using then the assumption r node ≥ r * node yields Multiplying both sides by 1/n and taking the infimum over Θ ∈ M 2,1 on the right-hand side then gives Invoking the definition of the prediction error on Page 3 gives the desired result.
The proof for the connection-sparse estimator is virtually the same.

D. Proof of Proposition 1
In this section, we give a short proof of Proposition 1.
Given a solution Θ con of ( 5), define a . .has the same value in the objective function as Θ con .

E. Proof of Proposition 2
In this section, we establish a proof of Proposition 2. The key tools are the Lipschitz property of Proposition 4 and the entropy bounds of Lemma 1.
Proof of Proposition 2. The main idea is to rewrite the event under consideration in a form that is amenable to known tail bounds for suprema of empirical processes with subgaussian random variables.
The connection-sparse bound follows from where we use in turn 1. the definition of r * con in (8), 2. the union bound, 3. van de Geer [2000, Corollary 8.3] and our Proposition 4 and Lemma 1, and 4. the inequality p l ≤ p = l j=0 p j+1 p j and consolidating the factors.The key concept underlying van de Geer [2000, Corollary 8.3 on Page 128] is chaining [van der Vaart and Wellner, 1996, Page 90].
The same considerations also apply to the node-sparse case, but we get an additional factor √ m from the definition of the effective noise in (8) and a factor √ p from the entropy bound in Lemma 1.The differences between the bounds for the connection-and node-sparse cases in terms of v ∞ vs. v 2 stem from the different Lipschitz constants in Proposition 4.

F. Proof of Proposition 3
Proof of Proposition 3. The proof is based on standard empirical-process theory, including contraction and symmetrization arguments.
Using basic algebra and measure theory, one can easily show that for a constant c b ∈ (0, ∞) that depends only on b.The first term in this bound is the minimal risk as stated in the proposition, and the second term can be bounded by Corollary 1 and Proposition 2. Hence, it remains to bound the third term.
In view of the law of large numbers, it is reasonable to hope for the third term to be small.But to make this precise, we have to keep in mind that the estimator itself depends on the input vectors.We, therefore, need to prepare the third term for the application of a uniform version of the law of large numbers.Using standard contraction arguments-see [Boucheron et al., 2013, Chapter 11.3], for example-and Hölder's inequality, we can bound the third term by bounding which removes the dependence on the estimator Θ con up to the leading factor.To see that we can also neglect that factor, verify (see Proposition 2 and the proof of Theorem 1) that |||( Θ con ) l ||| 1 ≤ 2|||(Θ * ) l ||| 1 with high probability as long as r * con ≥ cv ∞ nl(log[2mnp]) 3 with c large enough.Consequently, we just need to consider the quantity The last step is to bring this term in a form that is amenable to our earlier proofs.Using standard symmetrization argumentssee van der Vaart and Wellner [1996, Chapter 2.3], for example)-we can bound this quantity by bounding where k 1 , . . ., k n are i.i.d.Rademacher random variables.But even though k 1 , . . ., k n are i.i.d.Rademacher random variables, we do not resort to Rademacher complexities; instead, we use that Rademacher random variables are subgaussian, so that we can then proceed similarly as in the proof of Proposition 2.
The node-sparse case can be treated along the same lines.

G. Extensions
Our proof approach disentangles the specifics of the objective function (proof of Theorem 1), of the network structure (proof of Proposition 4), and of the stochastic terms (proofs of Lemma 1 and Proposition 2).This feature allows one to generalize and extend the results of this paper in straightforward ways.For example, extensions to different noise distributions only need a corresponding version of Proposition 2-with everything else unchanged.One could envision, for example, using concentration inequalities for heavy-tailed distributions such as in Lederer and van de Geer [2014].Extensions to different loss functions, to give another example, can be established by adjusting Theorem 1 accordingly.This can be done, for example, by invoking ideas from specialized literature on high-dimensional logistic regression such as Li and Lederer [2019].We avoid going into further details to avoid digression; the key message is that the flexibility of the proofs is yet another advantage of our approach.

Fig. 1 .
Fig. 1. exemplary networks produced by the connection-sparse estimator (3) and the node-sparse estimator (6) OR ABSENCE ( ) OF CERTAIN FEATURES IN PREVIOUS STATISTICAL THEORIES FOR SPARSE DEEP LEARNING.WE EXTEND THE HIGHDIM RESULTS TO NODE SPARSITY AND MULTIPLE OUTPUTS.MOREOVER, WE IMPROVE THE DEPENDENCE OF THE HIGHDIM BOUNDS ON THE DATA, AND WE AVOID THEIR AUXILIARY PARAMETERS.