Singular Value Perturbation and Deep Network Optimization

We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep learning practitioners have long observed empirically: the parameters of some deep architectures (e.g., residual networks, ResNets, and Dense networks, DenseNets) are easier to optimize than others (e.g., convolutional networks, ConvNets). Building on our earlier work connecting deep networks with continuous piecewise-affine splines, we develop an exact local linear representation of a deep network layer for a family of modern deep networks that includes ConvNets at one end of a spectrum and ResNets, DenseNets, and other networks with skip connections at the other. For regression and classification tasks that optimize the squared-error loss, we show that the optimization loss surface of a modern deep network is piecewise quadratic in the parameters, with local shape governed by the singular values of a matrix that is a function of the local linear representation. We develop new perturbation results for how the singular values of matrices of this sort behave as we add a fraction of the identity and multiply by certain diagonal matrices. A direct application of our perturbation results explains analytically why a network with skip connections (such as a ResNet or DenseNet) is easier to optimize than a ConvNet: thanks to its more stable singular values and smaller condition number, the local loss surface of such a network is less erratic, less eccentric, and features local minima that are more accommodating to gradient-based optimization. Our results also shed new light on the impact of different nonlinear activation functions on a deep network's singular values, regardless of its architecture.


Introduction
Deep learning has significantly advanced our ability to address a wide range of difficult inference and approximation problems. Today's machine learning landscape is dominated by deep (neural) networks (DNs) [19], which are compositions of a large number of simple parameterized linear and nonlinear operators, known as layers. An all-too-familiar story of late is that of plugging a DN into an application as a black box, learning its parameter values using gradient descent optimization with copious training data, and then significantly improving performance over classical task-specific approaches.
Over the past decade, a menagerie of DN architectures has emerged, including Convolutional Neural Networks (ConvNets) that feature affine convolution operations [18], Residual Networks (ResNets) that extend ConvNets with skip connections that jump over some layers [12], Dense Networks (DenseNets) with several parallel skip connections [15], and beyond. A natural question for the practitioner is: Which architecture should be preferred for a given task? Approximation capability does not offer a point of differentiation, because, as their size (number of parameters) grows, these and most other DN models attain universal approximation capability [6].
Practitioners know that DNs with skip connections, such as ResNets and DenseNets, are much preferred over ConvNets, because empirically their gradient descent learning converges faster and more stably to a better solution. In other words, it is not what a DN can approximate that matters, but rather how it learns to approximate. Empirical studies [21] have indicated that this is because the so-called loss landscape of the objective function navigated by gradient descent as it optimizes the DN parameters is much smoother for ResNets and DenseNets as compared to ConvNets (see Figure 1). However, to date there has been no analytical work in this direction.
In this paper, we provide the first analytical characterization of the local properties of the DN loss landscape that enables us to quantitatively compare different DN architectures. The key is that, since the layers of all modern DNs (ConvNets, ResNets, and DenseNets included) are continuous piecewise-affine (CPA) spline mappings [2,1], the loss landscape is a continuous piecewise function of the DN parameters. In particular, for regression and classification problems where the DN approximates a function by minimizing the 2 -norm squared-error, we show that the loss landscape is a continuous piecewise quadratic function of the DN parameters. The local eccentricity of this landscape and width of each local minimum basin is governed by the singular values of a matrix that is a function of not only the DN parameters but also the DN architecture. This enables us to quantitatively compare different DN architectures in terms of their singular values. Let us introduce some notation to elaborate on the above programme. We study state-of-the-art DNs whose layers f k comprise a nonlinear, continuous piecewise-linear activation function φ that is applied element-wise to the output of an affine transformation. We focus on the ubiquitous class of activations φ(t) = max(ηt, t), which yields the so-called rectified linear unit (ReLU) for η = 0, leaky-ReLU for small η ≥ 0, and absolute value for η = −1 [11]. In an abuse of notation that is standard in the machine learning community, φ can also operate on vector inputs by applying the above activation function to each coordinate of the vector input separately. Focusing for this introduction on ConvNets and ResNets (we deal with DenseNets in Section 3), denote the input to layer k by z k and the output by z k+1 . Then we can write where W k is an n × n weight matrix and b k is a vector of offsets. 1 Typically W k is a (strided) convolution matrix. The choice ρ = 0 corresponds to a ConvNet, while ρ = 1 corresponds to a ResNet.
Previous work [1,2] has demonstrated that the operator f k is a collection of continuous piecewise-affine (CPA) splines that partition the layer's input space into polytopal regions and fit a different affine transformation on each region with the constraint that the overall mapping is continuous. This means that, locally around the input z k and the DN parameters W k , b k , (1) can be written as ConvNet ResNet Fig. 1 Optimization loss landscape of two deep networks (DNs) along a 2D slice of their parameter space (from [21]). (left) Convolutional neural network (ConvNet) with no skip connections. (right) Residual network (ResNet) with skip connections. This paper develops new matrix perturbation theory tools to understand these landscapes and in particular explain why skip connections produce landscapes that are less erratic, less eccentric, and feature local minima that are more accommodating to gradient-based optimization. Used with permission.
where the diagonal matrix D k contains 1s at the positions where the corresponding entries of W k z k + b k are positive and η where they are negative.
In supervised learning, we are given a set of G labeled training data pairs {x (g) , y (g) } G g=1 , and we tune the DN parameters W k , b k such that, when datum x (g) is input to the DN, the output y (g) is close to the true label y (g) as measured by some loss function L. In this paper, we focus on the 2 -norm squared-error loss averaged over a subset of the training data (called a mini-batch) Standard practice is to use some flavor of gradient descent to iteratively reduce L by differentiating with respect to W k , b k . For ease of exposition, we focus our analysis first on the case of fully stochastic gradient descent that involves only a single data point per gradient step (i.e., G = 1), meaning that we iteratively minimize L (g) := y (g) − y (g) 2 2 (4) for different choices of g. We then extend our theoretical results to arbitrary G > 1 in Section 3.6. Let z (g) k denote the input to the k-th layer when the DN input is x (g) . Using the CPA spline formulation (2), it is easy to show (see (46)) that, for fixed y (g) , the loss function L (g) is continuous and piecewise-quadratic in W k , b k .
The squared-error loss L (g) is ubiquitous in regression tasks, where y (g) is real-valued, but also relevant for classification tasks, where y (g) is discrete-valued. Indeed, recent empirical work [16] has demonstrated that state-of-the-art DNs trained for classification tasks using the squared-error loss perform as well or better than the same networks trained using the more popular cross-entropy loss. 2 We can characterize the local geometric properties of this piecewise-quadratic loss surface as follows. First, without loss of generality, we simplify our notation. For the rest of the paper, we label the layer of interest by k = 0 and suppress the subscript 0 (i.e., W 0 → W ); we assume that there are p subsequent layers (f 1 , . . . , f p ) between layer 0 and the output y. Optimizing the weights W of layer f with training datum x (g) that produces layer input z (g) requires the analysis of the DN output y, due to the chain rule calculation of the gradient of L with respect to W . (As we discuss below, there is no need to analyze the optimization of b.) For further simplicity, we suppress the superscript (g) that enumerates the training data whenever possible. We thus write the DN output as y = M (ρ)DW z + B where M (ρ) = D p W p · · · D 1 W 1 + ρ Id (6) collects the combined effect of applying W k in subsequent layers f k , k = 1, . . . , p, and B collects the combined offsets. Note that z reflects the influence of the training datum combined with the action of the layers preceding f and can be interpreted as the input to the shallower network consisting of layers f, f 1 , . . . , f p . This justifies using the index 0 for the layer under consideration.
Using this notation and fixing b as well as the input z, the piecewise-quadratic loss function L can be written locally as a linear term in W plus a quadratic term in W , which can be written as a quadratic form featuring matrix Q (see Lemma 8) z Here, w denotes the columnized vector version of the matrix W .
The semi-axes of the ellipsoidal level sets of the local quadratic loss (7) are determined by the singular values of the matrix Q, which we can write as s i · |z[j]| according to Corollary 9, with s i = s i (M D) the i-th singular value of the linear mapping M D and z[j] the j-th entry in the vector z. The eccentricity of the ellipsoidal level sets is governed by the condition number of Q which, therefore, factors as Here, min * denotes the minimum after discarding all vanishing elements (cf. (53)). In this paper we focus on the effect of the DN architecture on the condition number of the matrix M D representing the action of the subsequent layers, which is However, the factorization in (8) provides insights into the role played by the training datum z (see Corollary 9). Extensions of these results to batches (G > 1) are found in Section 3.6. (We note at this point that the optimization of the offset vector b has no effect on the shape of the loss function.) For a fixed set of training data (i.e., fixed x), the difference between the loss landscapes of a ConvNet and a ResNet is determined soley by ρ in (6). Therefore, we can make a fair, quantitative comparison between the loss landscapes of the ConvNet and ResNet architectures by studying how the singular values of the linear mapping M (ρ)D evolve as we move infinitesimally from a ConvNet (ρ = 0) towards a ResNet (ρ = 1).
Addressing how the singular values and condition number of the linear mapping M (ρ)D (and hence the DN optimization landscape) change when passing from ρ = 0 towards ρ = 1 requires a nontrivial extension of matrix perturbation theory. Our focus and main contributions in this paper lie in exploring this under-explored territory. 3 Our analysis technique is new, and so we dedicate most of the paper to its development, which is potentially of independent interest.
We briefly summarize our key theoretical findings from Sections 2-6 and our penultimate result in Theorem 22. We provide concrete bounds on the growth or decay of the singular values of any square matrix M o = M (0) when it is perturbed to M (ρ) = M o + ρ Id by adding a multiple of the identity ρ Id for 0 ≤ ρ ≤ 1. The bounds are in terms of the largest and smallest eigenvalues of the symmetrized matrix M T o + M o . The behavior of the largest singular value under this perturbation can be controlled quite accurately in the sense that random matrices tend to follow the upper bound we provide quite closely with high probability. For all other singular values, we require that they be non-degenerate, meaning that they have multiplicity 1 for all but finitely many ρ, a property that holds for most matrices in a generic sense (see Lemma 12). Further, we establish bounds for arbitrary singular values that do not hinge on the assumption of multiplicity 1 but that are somewhat less tight.
Our new perturbation results enable us to draw three rigorous conclusions regarding the optimization landscape of a DN. First, with regards to network architecture, we establish that the addition of skip connections improves the condition number of a DN's optimization landscape. To this end, we point out that no guarantees on the piecewise-quadratic optimization landscape's condition number are available for ConvNets; in particular, κ(M (0)) can take any value, even if the entries of M (0) are bounded. However, as we move from ConvNets (ρ = 0) towards networks with skip connections such as ResNets and DenseNets (ρ > 0), bounds on κ(M (ρ)) become available. For an appropriate random initialization of the weights and a wide network architecture (large n), we show that the condition number of a ResNet (ρ = 1) is bounded above asymptotically by just 3 (see (121), (122), and (128)). Second, for the particular case of DNs employing absolute value activation functions, we show that the relevant condition number is easier to control than for ReLU activations, that is, absolute value requires less restrictive assumptions on the entries of M in order to bound the condition number. For absolute value, we also prove that the condition number of M (1) (ResNet) is smaller than that of M (0) (ConvNet) with probability asymptotically equal to 1. Third, with regards to the influence of the training data, we prove that extreme values of the data negatively affect the performance of linear least squares and that this impact is tempered when applying mini-batch learning due to its inherent averaging.
In the context of DN optimization, these results demonstrate that the local loss landscape of a DN with skip connections (e.g., ResNet, DenseNet) is better conditioned than that of a ConvNet and thus less erratic, less eccentric, and with local minima that are more accommodating to gradient-based optimization, particularly when the weights in W are small. This is typically the case at the initialization of the optimization [10] and often remains true throughout training [9]. In particular, our results also provide new insights into the best magnitude to randomly initialize the weights at the beginning of learning in order to reap the maximum benefit from a skip connection architecture. This paper is organized as follows. Section 2 introduces our notation and provides a review of existing useful results on the perturbation of singular values. Section 3 overviews our prior work establishing that CPA DNs can be viewed as locally linear and leverages this essential property to identify the singular values relevant for understanding the shape of the quadratic loss surface used in gradient-based optimization. Section 4 develops the perturbation results needed to capture the influence on the singular values when adding a skip connection, i.e., when passing from M o to M o + Id. In particular, we provide bounds on the condition number of M o + ρId in terms of the largest singular value of M o and the largest and the smallest eigenvalue of M T o + M o . Section 5 combines the findings of Sections 3 and 4 to show that a layer with a skip connection and weights bounded by an appropriate constant will have a bounded condition number. Clearly, this applies to weights drawn from a uniform distribution. Section 6 is dedicated to a ResNet layer with random weights of bounded standard deviation. Here, we establish bounds on the condition number of the corresponding ResNet and on the probability with which they hold, explicitly in terms of the width n of the network. Further, we provide convincing numerical evidence that the asymptotic bounds apply in practice for widths as modest as n ≥ 5. We conclude in Section 7 with a synthesis of our results and perspectives on future research directions. All proofs are provided within the main paper. Various empirical experiments sprinkled throughout the paper support our theoretical results.

Notation
Singular values. Let A be a n × n-matrix, and let u denote the 2 -norm of the vector u. All products are to be understood as matrix multiplications, even if the factors are vectors. Vectors are always column vectors.
Let u i denote a unit-eigenvector of A T A to its eigenvalue λ i = λ i (A T A) such that the vectors u i (i = 1 . . . n) form an orthonormal basis.
The singular values of A are defined as the square roots of the eigenvalues of A T A Note that the singular values are always real and non-negative valued, since For reasons of definiteness, we take the eigenvalues of A T A, and thus the singular values of A to be ordered by size: Singular vectors. The vectors u i are called right singular vectors of A. The left singular vectors of A are denoted v i and are defined as an orthonormal basis with the property that Such a basis of left-singular vectors always exists.
SVD. Collecting the right singular vectors u i as columns into an orthogonal matrix U , the left singular vectors v i into an orthogonal matrix V , and the singular values s i into a diagonal matrix Σ in corresponding order, we obtain the Singular Value Decomposition (SVD) of A = V ΣU T .
Eigenvectors. Note that A may have eigenvectors that differ from the singular vectors. In particular, eigenvectors are not always orthogonal onto each other. Clearly, if A is symmetric then the eigenvectors and right-singular vectors coincide and every eigenvalue r i of A corresponds to a singular value s i = |r i |. However, to avoid confusion, we will refrain from referring to the eigenvalues and the singular values of the same matrix whenever possible.
Smooth matrix deformations. Consider a family of n×n matrices A(ρ), where the variable ρ in assumed to lie in some interval I. We say that the family A(ρ) is C m (I) if all its entries are C m (I), with m = 0 corresponding to being continuous and m = 1 to continuously differentiable. If A is C m (I), then so is A T A.
We denote the matrix of derivatives of the entries of A by A (ρ) = d dρ A(ρ). Multiplicity of singular values Denote by S i ⊂ I the set of ρ values for which s i is a simple singular value: For later use, let For completeness, we mention an obvious fact that follows from the SVD:

Singular values under additive perturbation
A classical result that will prove to be key in our study is due to Wielandt and Hoffman [14]. It uses the notion of the Frobenius norm A F of the n × n-matrix A. Denoting the entries of A by A[i, j] and its singular values by s i (A), it is well known that Proposition 1 (Wielandt and Hoffman) Let A andÃ be symmetric n×n-matrices with eigenvalues ordered by size. Then This result can easily be strengthened to read as follows.
Corollary 2 (Wielandt and Hoffman) Let A andÃ be any n × n-matrices with singular values ordered by size. Then Note that this result, as most others, holds actually also for rectangular matrices with the usual adjustments.
Proof We establish Corollary 2. The 2n eigenvalues of the symmetric matrix . Therefore, when applying Proposition 1 with A replaced by H A andÃ replaced by HÃ, we run through the left-hand side of (18) twice and in turn obtain twice the right-hand side. Corollary 2. immediately implies a well-known result that will be key: the singular values of a matrix depend continuously on its entries, as we state next.
then all its singular values s i (ρ) depend continuously on ρ. Consequently, all S i in (13) and S * i in (14) are then open.
As will become clear in the sequel, the main difficulty in establishing further results on the regularities of the singular values with elementary arguments and estimates lies in dealing with multiple zeros, i.e., with multiple singular values.
Also useful are some facts about the special role of the largest singular value.
Proposition 4 (Operator norm) Let A be a n × n-matrix. Then, its largest singular value is equal to its operator norm, and we have the following inequalities: In particular, if A is rank 1, then both norms coincide. Also, being a norm, the largest singular value satisfies for any matrices A andÃ.
Proof The first equality follows from the fact that s 1 is the largest semi-axis of the ellipsoid that is the image of the unit-sphere. The first inequality follows from (16). The last inequality follows again from considering the aforementioned ellipsoid.
Another proof of the continuity of singular values is found in the well-known interlacing properties of singular values, which we adapt to our notation and summarize for the convenience of the reader (see, e.g. [5, Theorem 6.1. 3-5.]).
Proposition 5 (Interlacing) Let A andÃ be n × n-matrices with singular values ordered by size.
-(Weyl additive perturbation) We have: -(Cauchy interlacing by deletion) LetÃ be obtained from A by deleting m rows or m columns. Then From (22) applied with j = 1, we note that the singular values are in fact uniformly continuous. Indeed, we have for any i that

Singular values under smooth perturbation
We now turn to the study of how singular values of A(ρ) change as a function of ρ, thereby leveraging the concept of differentiability.
This can be done in several ways; the following computation is particularly simple and well known. Let N (ρ) denote a family of symmetric C 1 (I) matrices. Dropping all indices and variables ρ for ease of reading, the eigenvalues and eigenvectors satisfy Denoting the derivatives with respect to ρ by N , λ and u and assuming they exist, we obtain We then left-multiply the first equality by u T and use the second equality to find Finally, as N is symmetric we see that u T N u = (N u) T u = λu T u = 0, leading to In quantum mechanics, the last equation is known as the Hellmann-Feynman Theorem and was discovered independently by several authors in the 1930s. It is known, though seldom mentioned, that the formula may fail when eigenvalues coincide. For further studies in this direction, we refer the interested reader to works in the field of degenerate perturbation theory.
For a proof of the Hellmann-Feynman Theorem, it is natural to consider the function Ψ : where u is an n-dimensional column vector. The Implicit Function Theorem applied to Ψ then yields the following result.
Theorem 6 (Hellmann-Feynman) Assume N (ρ) forms a family of symmetric C 1 (I) matrices. Then, as a function of ρ, each of its eigenvalues λ i = λ i (ρ) is continuously differentiable where it is simple. Its derivative reads as Its corresponding right singular vector u i (ρ) can be chosen such that it becomes C 1 (S i ) as well.
If in addition to the assumptions of Theorem 6 the family N is actually C m (I), then λ i (ρ) and u i (ρ) are C m (S i ).
From Hellmann-Feynmann we may conclude the following.
Simple examples show that the above results cannot be improved without stronger assumptions (see Example 1 below and Figure 2).
Proof Apply Theorem 6 to the family N = A T A. Note that N = A T A + A T A and apply the chain rule to λ i = (s i (ρ)) 2 to find the first equality of the following formula: this is a real-valued number. This implies the second equality in the formula above. Finally note that Au i = s i v i to complete the proof.
. They are not differentiable where they coincide. In particular, the matrix norm of N is not differentiable at ρ = 3. The singular values of M are s 1 = max(|ρ − 4|, |ρ − 2|) and s 2 = min(|ρ − 4|, |ρ − 2|); they are not differentiable at their respective zeros. Also, the left singular vectors v i change sign at the zero of s i and so are not even continuous there (see Figure 2 for a similar example).

Singular Values in Deep Learning
In this section, we establish that the squared-error loss landscape of a deep network (DN) is governed locally by the singular values of the matrix of weights of a layer and that the difference between certain network architectures can be interpreted as a perturbation of this matrix. Combining this key insight with the perturbation theory of singular values for DNs, which will be developed in the sections to follow, will enable us to characterize the dependence of the loss landscape on the underlying architecture. Doing so, we will ultimately provide analytical insight into numerous phenomena that have been observed only empirically so far, such as the fact that so called ResNets [13] are easier to optimize than ConvNets [18].

The continuous piecewise-affine structure of deep networks
The most popular DNs contain activation nonlinearities that are continuous piecewise-affine (CPA), such as the ReLU, leaky-ReLU, absolute value, and max-pooling nonlinearities [11]. To be more precise, let f k denote a layer of a DN. We will assume that it can be written as an affine map W k z k + b k followed by the activation The entries of the matrix W k are called weights; the constant additive term b k is called the bias. More general settings can easily be employed; however, the above will be sufficient for our purpose. The above-mentioned, commonly used nonlinearities can be written in this form using where the max is taken coordinate by coordinate. Clearly, any activation of the form (34) is continuous and piecewise linear, rendering the corresponding layer CPA (Fig. 3). The following choices for η correspond to ReLU (η = 0) [10], leaky-ReLU (0 < η < 1) [25] and absolute value [4] (η = −1) (see also [2,1]). In summary: We will consider only nonlinearities that can be written in the form (34). Such nonlinearities are in fact linear in each hyper-octant of space, and can be written as where D k is a diagonal matrix with entries that depend on the signs of the coordinates of the input t = W k z k +b k . The diagonal entries of D k equal either 1 or η depending on the hyper-octant.
Consequently, the input space of the layer f k is divided into regions that depend on W k and b k in which f k is affine and can be rewritten as By the linear nature of z k → W k z k , these regions of the input space (z k -space) are piecewise linear. Concatenating several layers into a multi-layer F (z 1 ) = f p • . . . • f 1 (z 1 ) then leads to a subdivision of its input space into piecewise-linear regions (see Figure 3) in which F is affine, a result due to [1,2]. In each linear region, this multi-layer can be written as where all constant terms are collected in A multi-layer can be considered to be an entire DN or a part of one.

From ConvNets to ResNets and DenseNets, and learning
A DN with an architecture F (z 1 ) = f p • . . . • f 1 (z 1 ) as described above is called a Convolutional Network (ConvNet) [18] whenever the W k are circulant-block-circulant; otherwise it is called a multi-layer-perceptron. Without loss of generality, we will focus on ConvNets and their extensions, particularly ResNets and DenseNets, in this paper. While such architectures are responsible for the recent cresting wave of deep learning successes, they are known to be intricate to effectively employ when the number of layers p large. This motivated research on alternative architectures. When adding the input z k to the output φ k (W k z k + b k ) of a convolutional layer (a skip connection), one obtains a residual layer which can be written in each region of linearity as A DN in which all (or most) layers are residual is called a Residual Network (ResNet) [13]. Densely Connected Networks (DenseNets) [15] allow the residual connection to bypass not just a single but multiple layers All of the above DN architectures are universal approximators. Learning the per-layer weights of any DN architecture is performed by solving an optimization problem that involves a training data set, a differentiable loss function such as the squared-error or cross-entropy, and a weight update policy such as gradient descent. Hence, the key benefit of one architecture over another only emerges when considering the learning problem, and in particular gradient-based optimization.

Learning
We now consider the learning process for one particular layer of a DN. As we make clear below, for the comparison of different network architectures, it is sufficient to study each layer independently of the others. For ease of exposition, our analysis in Sections 3.3-3.5 also focuses on a single training data point, which corresponds to learning via fully stochastic gradient descent (G = 1 in the Introduction). We extend our analysis to the general case of mini-batch gradient descent (G > 1) in Section 3.6.
The layer being learned is f 0 ; however, for clarity we drop the index 0 and use the notation The layers following f are the only ones relevant in the study of learning layer f . In fact, the previous layers transform the DN input into a feature map that, for the layer in question, can be seen as a fixed input. This part of the network will be denoted by F = f p • . . . • f 1 . In each partition region, it takes the form Here, B collects all of the constants, D represents the nonlinearity of the trained layer in the current region, and M collect the action of the subsequent layers F . We use the parameter ρ to distinguish between ConvNets and ResNets, with ρ = 0 corresponding to convolution layers and ρ = 1 to residual ones. The matrix M = M (ρ) representing the subsequent layers F becomes As a shorthand, we will often write M (ρ) = M o + ρId, where When saying "let M o be arbitrary" we mean that no restrictions on the factors in (45) are imposed except for the D i to be diagonal and the W i to be square. When choosing D p = · · · = D 1 = W p = · · · = W 2 = Id and W 1 arbitrary, we have M o = W 1 arbitrary. So, there is no ambiguity in our usage of the term "arbitrary".

Quadratic loss
We wish to quantify the advantage in training of a ResNet architecture over a ConvNet one. We consider a quadratic loss function such as the squared-error as mentioned in the Introduction (cf. (4); we drop the superscript (g)): In the above, we used that y T y = y T y T = y T y, since this is a real number.
Recalling that the DN is a continuous piecewise-affine operator and fixing all parameters except for W , we can locally write y = M DW z + B for all W in some region R of F (cf. (4); z is the input to f produced by the datum x (g) ). Writing y 2 2 = y T y, we find that the loss can be viewed as an affine term P (W ) with P a polynomial of degree 1 plus a quadratic term M DW z 2 We stress once more that all parameters are fixed except for W . Denote by It is worthwhile noting that the region R does not change when changing the value of ρ in M = M (ρ), since D∆W z does not depend on ρ. From this, writing c = h(W o ) for short, we obtain for any ρ This means that the loss surface is in each region R of F is a part of an exact ellipsoid whose shape is determined by the second-order expression M (ρ)D∆W z 2 2 as a function of ∆W . The value c should not be interpreted as the minimum of the loss function, since W o may lie outside the region R.
Before commenting further on the loss, however, let us untangle the influence of the data z (which remains unchanged during training) from the weights W in (47) (analogous for ∆W in (48)). When listing the variables W [i, j] row by row in a "flattened" vector w, the term (47) becomes a quadratic form defined by the following matrix Q.
Lemma 8 In the quadratic term (47) of the MSE, we have the following matrices: W is n × n, M D is n × n, and z is n × 1. Denote the flattened version of W by w (n 2 × 1) and the singular values of M D by Then, we have (cf. (47)) and that the singular values of Q factor as Proof Direct computation easily verifies (50). For (51), note that the characteristic function of A = Q T Q is as follows, where in slight abuse of notation Id denotes the identity in the appropriate dimensions Its zeros are easily identified as , establishing (51).

Condition number
We return now to the shape of the loss surface of the trained layer f . We point out that the loss surface is exactly equal to a portion of an ellipsoid whose geometry is completely characterized by the spectrum of the matrix Q due to (48) and Lemma 8. Its curvature and higher-dimensional eccentricity has a considerable influence on the numerical behavior of gradient-based optimization. In essence, the larger the eccentricity, the less accurately the gradient points towards the maximal change of the loss when making a non-infinitesimal change in W .
The condition number κ of the matrix Q provides a measure of its higher-dimensional eccentricity and, therefore, of the difficulty of optimization by linear least squares. It is defined as the ratio of the largest singular value s 1 to the smallest non-zero singular value. Denoting the latter by s * for convenience, the condition number of an arbitrary matrix A is given by Having identified κ(Q) as a most relevant quantity towards understanding the performance of linear least squares, we can apply Lemma 8 to identify and separate the influence of data and subsequent layers on the loss landscape as follows. For an efficient notation we recall that the singular values of diag(z) are |z[j]| and that vanishing z[j] do not contribute to κ(diag(z)) (cf. (53)).

Corollary 9 (The Data
Factor) The condition number of the matrix Q of Lemma 8 factors as Our main interest lies in the impact of the architecture of the p layers following f (adding a residual link or not), leaving the rest of the DN as is. Since the input data z does not depend on this architectural choice, the performance of the linear least-squares optimization we focus on here is governed by κ(M D) = κ(M (ρ)D).
Consequently, we are faced with two types of perturbation settings: First, the evolution of the loss landscape of DNs as their architecture is continuously interpolated from a ConvNet (ρ = 0) to a ResNet (ρ = 1), that is, when moving from M (0) to M (1). In this case, since the region R of F does not depend on ρ, such an analysis is meaningful for understanding the optimization of the loss via (48). Second, the effect of multiplying M by a diagonal matrix D. Before studying the perturbations in the subsequent sections, we make a remark regarding the use of mini-batches.

Learning using mini-batches
In deep learning, one works typically with mini-batches of data x (g) (g = 1, . . . , G) that produce the inputs z (g) to the trained layer f . The loss of a mini-batch is obtained by averaging the losses of the individual data points (cf. (3), (47) and (50)). Using the superscript (g) in a self-explanatory way and letting w be as in Lemma 8, we have Letting z (g) [k] denote the k-th coordinate of the input to the trained layer produced by the g-th datum x (g) we can rewrite the mini-batch loss in a form analogous to (47), which enables us to extend our analysis of the shape of the loss surface to mini-batches.
Lemma 10 There exists a linear polynomial P batch and a symmetric matrix Q batch such that The singular values of Q batch can be bounded as follows: Notably, for any j with Proof The linear polynomial is obviously P batch (W ) = (1/G) G g=1 P (g) (W ). To clarify the choice of Q batch , . Then, L batch = P batch (W ) + w T Aw, and A is symmetric and positive semidefinite.
Notably, the matrix A inherits the diagonal block structure of the Q (g) as given in (49). In a slight abuse of notation, we can write A = diag(A 1 . . . , A n ), where the j-th block is a symmetric n × n matrix Following the argumentation of the proof of Lemma 8, meaning that we exploit the fact that the blocks do not interact, it becomes apparent that the eigenvalues of A are those of the blocks A j . In short, we can list the eigenvalues of A as λ i,j (A) = λ i (A j ) ≥ 0, where we arrange eigenvalues by size in descending order within each block. Arranging these eigenvalues by block into a diagonal matrix Λ, there is an orthogonal matrix U such that A = U T ΛU . Now arranging their square roots λ i,j (A) in the same order into the diagonal matrix S, we can set Q batch = U T SU . Then, A = Q batch T Q batch and w T Aw = Q batch w 2 2 as desired. While we could choose Q batch to be any matrix of the form V T SU with V orthogonal and find identical singular values, we chose a symmetric matrix for reasons of simplicity and convenience.
For the upper bound of these singular values, we proceed as follows From this, (57) follows easily. For the lower bound (58) let u denote the normalized eigenvector to the smallest eigenvalue λ n,j of A j . Using (59) we obtain From this, using (M (g) D (g) )u 2 ≥ s n (M (g) D (g) ) u 2 = s n (M (g) D (g) ), the bound (58) follows.
For a convenient bound of the condition number of Q batch , we set z[j] = (1/G) G g=1 |z (g) [j]| 2 , which can be interpreted as the j-th mean square feature of the batch. From Lemma 10 we can conclude that z[j] = 0 implies s i,j (Q batch ) = 0 for all i. Since vanishing singular values are disregarded in the computation of the condition number as defined in this paper, we obtain the following result. Corollary 11 In the notation of Lemma 10, assume that there are constants C and c > 0 such that, for Assume that there is at least one non-vanishing z[j], and denote by min * j z[j] the minimum over all nonvanishing z[j]. Then . (63)

Perturbation of a DN's Loss Landscape When Moving from a ConvNet to a ResNet
In this section we exploit differentiability of the family M (ρ) in order to study the perturbation of its singular values when moving from ConvNets with ρ = 0 to ResNets with ρ = 1 (see Section 3.2).
We start with a general result that will prove most useful.

Lemma 12
Let N (ρ) be a family of symmetric n × n matrices with entries that are polynomials in ρ. Let E denote the set of ρ ∈ IR for which N (ρ) has multiple eigenvalues. Then the following dichotomy holds: -Either E is finite or E = IR. Clearly, this result extends easily to the multiplicity the singular values of any family of square matrices A(ρ) by considering N (ρ) = A T A.
Proof Fixing ρ for a moment, the eigenvalues of N (ρ) are the zeros of the characteristic polynomial h(λ) = det(N (ρ) − λ Id). This polynomial h has multiple zeros λ if and only if its discriminant is zero. It is well known that the discriminant χ h of a polynomial h is itself a polynomial in the coefficients of h.
If the entries of N are polynomials in ρ, then the discriminant of h is, as a function of ρ, in fact a polynomial χ h (ρ). In other words, the set E is the set of zeros of the polynomial χ h (ρ). This set is finite unless the polynomial χ h (ρ) vanishes identically, in which case E becomes all of IR.
Finally, if the entries of N are real analytical, then so is χ h (ρ), and the identity theorem implies the claim.
Combining with Lemma 3 and Theorem 6 we obtain the following.
Corollary 13 Let N be as in Lemma 12. In the real analytical case, assume that ρ 1 = 0 and ρ 2 = ρ for some fixed ρ. Assume that N (0) has no multiple eigenvalues. Then E is finite. Thus, for any i the set [0, ρ] \ S i is finite and   (45)). Writing u i = u i (ρ) for short, the derivative of the eigenvalue λ i (ρ) of M T M becomes, on S i (see (13)) On S * i (see (14)) we have where α i (ρ) denotes the angle between the i-th left and right singular vectors.
Using (66) The chain rule implies (66). Alternatively, (66) follows from Corollary 7. Note that, due to τ ≤ τ , we have We mention that a random matrix with iid uniform or Gaussian entries has almost surely no multiple singular values (see Theorem 34).
We continue with some observations on the growth of singular values that leverage the simple form of M (ρ) (see (45)) explicitly and that apply even in the case of M o having multiple zeros.
Obviously, (70) holds also if for any ρ the matrix M (ρ) is known to have no multiple singular values. Even without any such knowledge, we have the following useful consequence of Weyl's additive perturbation bound.

Lemma 17
Let M o be arbitrary (see (45)). Then we have Proof Apply (24) with A = M (ρ) = M o + ρ Id andÃ = ρ Id to obtain For a comparison, note that (70) implies that for any i whenever it holds. Since τ ≤ 2s 1 (M o ) and s i (0) ≤ s 1 (0), the upper bound is clearly sharper than that of (71).
Continuing our observations, the following fact is obvious due to the simple form of M (ρ). Note that the singular value of the fixed vector w may change order, as other singular values may increase or decrease at a different speed.
Further, we have that |s i (ρ)| ≤ 1 on S * i which is obvious from geometry and from (66). The eigenvectors of M o , if they exist, achieve extreme growth or decay.
Similarly, we can estimate the second to last expression from below by λ n (0) + ρτ + ρ 2 . Since i is arbitrary, the claim follows.
The advantage of (71) is its simple form, while (74) is sharper (cf. (73)). As we will see, certain random matrices of interest tend to achieve the upper bound (74) quite closely with high probability. Combining this observation with Lemmata 17 and 21, we obtain the following.
Alternatively, assume that τ > −1. Then We list all four bounds due to their different advantages. Due to (69), the first bounds in both (76) as well as (77) are in general sharper than the second. Indeed, for random matrices as used in Section 6.2 we have typically τ √ 2s 1 (M o ) < 2s 1 (M o ). The second bounds are simpler and sufficient in certain deterministic settings.
The first bound of (77) is the tightest of all four bounds for the random matrices of interest in this paper. On the other hand, since we typically have −τ √ 2s 1 (M o ) for such matrices, the stated sufficient condition for (76) is somewhat easier to meet than that of (77).

Influence of the Activation Nonlinearity on a Deep Network's Loss Landscape
As we put forward in Section 3, each layer of a DN can be represented locally as a product of matrices D i W i , where W i contains the weights of the layer while the diagonal D i with entries equal to η or 1 (compare (35) and (36)) captures the effect of the activation nonlinearity.
In the last section we explored how the singular values of a matrix M o change when adding a fraction of the identity, i.e., when moving from a ConvNet to a ResNet. In this section we explore the effect of the activation nonlinearities on the singular values. To this end, let D(m, n, η) denote the set of all n × n diagonal matrices with exactly m diagonal entries equal to η and n − m entries equal to 1: We will always assume that 0 ≤ m ≤ n. Note that D(0, n, η) = {Id} and D(n, n, η) = {ηId}.

Absolute value activation
Consider the case η = −1, corresponding to an activation of the type "absolute value" (see (35)).

ReLU activation
The next simple case is the ReLU activation, corresponding to η = 0 in (35). As we state next, the singular values of a linear map M decrease when the map is combined with a ReLU activation, from left or right.

Lemma 24 (ReLU Decreases Singular Values)
Let Proof This is a simple consequence of the Cauchy interlacing property by deletion (see (23)).

As a consequence, we have for ReLU that |s
Proof WriteM = M D for short. The following equality simplifies our computation and holds in this special setting, because D amounts to an orthogonal projection enabling us to appeal to Pythagoras: F . This can also be verified by direct computation. Lemma 24 implies yielding (81). Using Corollary 2 with A = M andÃ =M , we obtain (82).
Bounds on the effect of nonlinearities at the input can be obtained easily from bounds at the output using the following fact.
Proposition 26 (Input-Output via Transposition) For D diagonal, Proof This follows from the fact that matrix transposition does not alter singular values.

Bounding the Weights of a ResNet Bounds its Condition Number
In this section, we combine the results of the preceding two sections in order to quantify how bounds on the entries of the matrix W i translate into bounds on the condition number. As we have seen, this is of relevance in the context of optimizing DNs, especially when comparing ConvNets to ResNets (recall Section 3.2).
To this end, we will proceed in three steps: We treat the two cases of deterministic weights (hard bounds) and random weights (high probability bounds) separately.

Hard bounds
We start with a few facts on how hard bounds on the entries of a matrix are inherited by the entries and singular values of perturbed versions of the matrix. We use the term "hard bounds" for the following results on weights W i bounded by a constant in order to distinguish them from "high probability bounds" for random weights W i with bounded standard deviation. Clearly, for uniform random weights we can apply both types of results.
Lemma 27 Assume that all entries of the n × n matrix W i are bounded by the same constant: with equality, for example, for the matrix W i with all entries equal to c.
Proof We proceed along the lines of the arguments of Theorem 31. To this end, we need to generalize the results of Proposition 25.
Since all entries of M o are bounded by c due to Lemma 28 and using cn < 1, it follows that the entries of A are bounded as follows Using these bounds, direct computation shows that A −Ã = A − DAD contains 2m(n − m) terms off the diagonal bounded by 3c(1 − η), m terms in the diagonal bounded by (3c + 1)(1 − η 2 ), further m 2 − m terms off the diagonal bounded by 3c(1 − η 2 ), and the remaining (n − m) 2 terms equal to zero. Using cn < 1 whenever apparent as well as (1 − η 2 ) ≤ 2(1 − η), we can estimate as follows: We conclude that, for n ≥ 5, we have |s 2 Together with s j ≥ (1 − cn), we obtain (100).
We note that the earlier Theorem 22 will not lead to strong results in this context. However, it will prove effective in the random setting of the next section.

Bounds with high probability for random matrices
While the results of the last section provide absolute guarantees, they are quite restrictive on the matrix M o since they cover the worst cases. Working with the average behavior of a randomly initialized matrix M o enables us to relax the restrictions from bounding the actual values of the weights to only bounding their standard deviation.
The strongest results in this context concern the operator norm and exploit cancellations in the entries of M o (see [3,Proposition 3.1], [23,24]). Most useful for this study is the well-known result due to Tracy and Widom [23,24]. Translated into our setting, it reads as follows.
Theorem 33 (Tracy-Widom) Let X i,j (1 ≤ i, j) be an infinite family of iid copies of a Uniform or Gaussian random variable X with zero mean and variance 1. Let K n be a sequence of n × n matrices with entries Let t n denote the largest eigenvalue of K T n + K n and let t n denote its smallest. Then, as n → ∞, both t n and −t n lie with high probability in the interval More precisely, the distributions of The random variables Z n andZ n are equal in distribution due to the symmetry of the random variable X. However, they are not independent.
It is easy to verify that the theory of Wigner matrices and the law of Tracy-Widom apply here and to establish Theorem 33. In particular, K T n + K n is symmetric with iid entries above the diagonal and iid entries in the diagonal, all with finite non-zero second moment. The normalization of t n leading to Z n (see (107)) is quite sharp, meaning that the distributions of Z n are close to their limit starting at values as low as n = 5 (see Figure 4). The Tracy-Widom distributions are well studied and documented, with a right tail that decays exponentially fast [7]. Moreover, they are well concentrated at values close to 0. Combined with the fast convergence, this implies that t n is close to √ 8n with high probability even for modest n.
Lemma 36 (High Probability Bound) In the notation of Theorem 33, let r > 1 and n ≥ 2, set and let W 1 = σ n K n . Denote the largest and smallest eigenvalues of W T 1 + W 1 by τ n = τ n (r) and τ n = τ n (r) (compare with (69)).
(ii) In the Gaussian case, we have that s 1 (W 1 )→1/( √ 2r) almost surely. Also (iii) For any realization of W 1 with s 1 (W 1 ) < 1 (iv) For any realization of W 1 with τ n > −1 we have the slightly tighter bounds The condition number of W 1 + Id is then bounded as When comparing (i) and (ii), it becomes apparent that the stated sufficient condition in (iii) is more easily satisfied than that of (iv).
For a a single residual layer (43) with weights W 1 and a ResNet with absolute value nonlinearities, we obtain the following.
(iii) For any realization of W 1 with τ > −1 the following bound holds Proof Note that M 0 = D 1 W 1 is a random matrix with the same properties as W 1 itself, since D 1 is deterministic and changes only the sign of some of the entries of W 1 . We may, therefore, apply Lemma 36 to M o (replacing

Probabilities of exceptional events
-Clarification regarding τ vs. τ n in the setting of Corollary 37: While we have s 1 (M o ) = s 1 (W 1 ), one should note that τ (largest eigenvalue of M T o + M o ) may differ from τ n (largest eigenvalue of W T 1 + W 1 ) for any particular realization. However, since M o and W 1 are equal in distribution due to the special form of D 1 , we have that τ and τ n , as well as −τ and −τ n are all equal in distribution with distribution well concentrated around 1/r.
Moreover, P [E τ n ] ≈ 0 for values as small as r ≥ 2 and n ≥ 5. We are able to support this claim through the following computations and simulations: Clearly, on the one hand, P [τ n > −1] = P [Z n < (r−1) √ 8n 2/3 ] → 1 for r > 1. On the other hand, for r ≥ 2 and n ≥ 5, we have (r − 1) √ 8n 2/3 > 8 and may estimate Figure 4 it appears that P [Z n < 8] = 1 for n ≥ 5 for all practical purposes. This is confirmed by our simulations in the following sense: Every single one of the 500,000 random matrices simulated with r = 2 satisfied τ n > −1.
-Exception E s 1 (W 1 ) = {s 1 (W 1 ) ≥ 1}: We have that Moreover, P [E s 1 (W 1 ) ] ≈ 0 for values as small as r ≥ 2 and n ≥ 5. According to (113) and Theorem 35, we Figure 5, Y n converges rapidly to a Tracy-Widom law whence P [E s 1 (W 1 ) ] is small for r ≥ 2 and modest n. In the setting of Corollary 37, the probability of the event E s 1 (M o ) = {s 1 (M o ) ≥ 1} tends to zero rapidly in a similar fashion.
The bounds (115) and (116) are surprisingly efficient. The first upper bound of (116), for example, is only about 10% larger than κ(W 1 + Id) for large n (see Figure 6).
We see an explanation for this efficiency of the bounds to be rooted in the fact that the singular values of random matrices tend to be as widely spread as possible, meaning that the largest singular values grow as fast as possible while the smallest singular values grow as slowly as possible. While this maximal spreading is well known for symmetric random matrices called "Wigner matrices," we are not aware of any reports of this behavior for non-symmetric matrices such as W 1 .
For design purposes, an asymptotic bound for the condition number κ(M (1)D) in terms of the network parameters might be more useful than those in terms of the random realization as given in Corollary 37. Theorem 38 (High Probability Bound in the Limit) With the notation and assumptions of Theorem 33 and Lemma 36, the upper bounds given there converge in distribution to constants as follows and For a demonstration of the deterministic expression bounding the condition number as in (121), see Figure 7. For r = 2, for example, the right-hand side of (121) becomes 1.8028, while the right-hand side of (122) becomes the slightly larger 1.9719.
Proof Since Z n converges in distribution to a real-valued random variable Z, we can conclude that Z n /n 2/3 converges in distribution to 0. Indeed, for any ε > 0 and any n o ≥ 1, we have for n ≥ n o that From this we obtain for any n o ≥ 1 that lim sup and letting n o → ∞ we conclude that P [|Z n /n 2/3 | ≥ ε] → 0. This establishes that Z n /n 2/3 distr → 0. Similarly, we obtainZ n /n 2/3 distr → 0.
By the continuous mapping theorem, and since r > 1, we obtain Similarly, Simple algebra completes the proof.
Our simulations suggest that actually τ n /s 1 (W 1 ) converges in distribution to √ 2 as n → ∞, implying that we should see τ n √ 2s 1 (W 1 ) for most matrices in real world applications.
We turn now to the condition number of the relevant matrix (D 1 W 1 + Id)D for a ResNet with ReLU activations that are represented by D 1 ∈ D(m, n, 0) and D ∈ D(m, n, 0). We address the case of a single residual layer following the trained layer (cf. (43) with p = 1).
To this end, we let which can be considered as a measure of the extent to which these two nonlinearities affect the output in combination.
We mention further that the restriction 4 ≤ θ ≤ 2 √ n is solely for simplicity of the representation and can be dropped at the cost of more complicated formulae (see proof). In the given form, useful bounds on the probability are obtained even with modest n as low as n = 7.
Clearly, the bound (128) on the condition number κ can be forced arbitrarily close to 2 by choosing r sufficiently large. In fact, bounds can be given that come arbitrarily close to 1, as becomes apparent from (149) and (143) in the proof.
Proof Let A = W 1 + Id with singular values s i andÃ = (D 1 W 1 + Id)D with singular valuess i . Clearly, r > r > 1.
We proceed in several steps, addressing first (129).
(i) Sure bound on the sum of perturbations. We cannot apply Proposition 25 as we did in Theorem 31, since we need to take into account also the effect of the left-factor D 1 . Due to Corollary 2 we have We find k = n(m +m) − mm non-zero terms in the first matrix ( The second matrix (D − Id) is diagonal with exactly m non-zero terms in the diagonal that are all equal to −1. Thus, the Frobenius norm Ã − A 2 2 consists of exactly k terms of the form W 1 [i, j] 2 , which we collect into the variable V 1 ; several terms of the form 2W 1 [i, j], which we collect into the variable V 2 ; and finally m terms equal to 1. Note that V 2 consists of at most m terms, depending on how many diagonal positions are non-zero in both (D 1 W 1 D − W 1 ) and (D − Id). We obtain By elementary calculus we have n ≤ k ≤ n 2 . Together with m,m ≤ n and m +m ≥ 1, this implies that 1 ≤ ν ≤ n.
(ii) Concentration of the values of V 1 , V 2 and τ n . The variable V 2 is zero-mean Gaussian with a variance of at most 4mσ 2 n . To formulate where the values of V 2 concentrate, we will use a parameter θ. A standard result says that Since W 1 [i, j]/σ n are independent standard normal variables, the variable Y = V 1 /σ 2 n follows a Chi-square distribution with k degrees of freedom. One of the sharpest bounds known for the tail of a Chi-squared distribution is that of Massart and Laurent [17, Lemma 1, pg. 1325]. A corollary of their bound yields Choosing q = θ 2 /2 for symmetry with V 2 , we obtain Finally, since −τ n and τ n are equal in distribution, we obtain which is negligibly small for r ≥ 2.
In summary, using (132), (134) and (135) we find that, with probability at least 1−2 exp(−θ 2 /2)−P [E τ n (r ) ], we have simultaneously (115) as well as (iii) Sure bound on individual perturbations. As in earlier proofs, we note that the bulk of the sum of deviations |s i −s i | 2 stem from some m singular values of A being changed to zero by D, compensating the constant term m on the right. More precisely, s 2 i ≥ s 2 n ≥ 1 + τ n by (115) whiles i = 0 for i = n − m + 1, . . . , n. Thus, n i=n−m+1 implying for any 1 ≤ j ≤ n − m that (iv) Probabilistic bound on the perturbation. Combining the three bounds (136) that hold simultaneously with the indicated probability, we have Here, we factored σ n √ 8n, since it simplifies to 1/r.
(v) Simplifications for convenience. There are many ways to simplify this bound. For convenience of presentation, here is one. Assuming that θ ≤ 2 √ n, we have θ ≤ 2 √ k and Using σ n √ 8n ν ≤ 1, m/n ≤ 1 as well as ν = m +m(1 − m/n) ≥ m and 1 + θ/ √ 2 ≤ θ, we obtain Noting that )/2 whenever 0 < a, b < 1 and using (115), s j ≥ s n and (135), we havẽ Replacing σ n √ 8n by 1/r, this becomes This simple form is for convenience. Tighter bounds can be chosen along the lines above. This takes care of the denominator of κ. For the numerator, we estimate using well-known rules of operator norms (or alternatively Lemma 24 and (71)):s (vi) Let us now turn to (128). We may proceed as in steps (iii) through (v) with minor modifications. First, replace (135) by Thus, we have with probability at least 1 − 2 exp(−θ 2 /2) − P [s 1 ( where we use the last expression for convenience of presentation, as it relates to the exceptional event E s 1 (W 1 ) . Replacing , we obtain the sure bound similar to (iii) and from this, computing as in (iv) and (v), the bound which holds with probability at least 1 − 2 exp(−θ 2 /2) − P [s 1 (W 1 ) ≥ 1].
A similar argument can be given for Uniform random entries instead of Gaussian ones using the Central Limit Theorem and the Berry-Esseen Theorem instead of the tail behavior of Gaussian and Chi-Square variables. The statements and arguments will be more involved.
Also, similarly to the hard bounds it is possible to find bounds on the condition number for leaky-ReLU, implying again that choosing η close to 1 enables us to bound the condition number. Consider now a random weight setting and assume that the standard deviation of the entries of the weight matrix W 1 are bounded on the order of 1/ √ n. Then the condition number κ(M (1)D) can be bounded for the absolute value activation with known probability using (116). For the ReLU activation, similar bounds are available as for the absolute value. Again, as in the deterministic case, the restrictions on the standard deviation of the weights become more severe the more outputs are affected by the ReLU (see (129)).

Advantages of residual layers
To the best of our knowledge, similar bounds are not available for the random matrix M (0)D corresponding to a ConvNet. Rather, the relevant condition number increases with n, as we put forward in the proof of the next result (see Theorem 42).
To better appreciate the impact of this difference between ConvNets and ResNets, recall that the loss surface is piecewise quadratic, meaning that, close to the optimal values W , regions with quite different activities of the nonlinearities and thus with different D and D k may exist. Thus, lacking any bounds, the condition number κ(M (0)D) of a ConvNet may very well exhibit large fluctuations, resulting in a volatile shape for the loss landscape. For ResNets, the relevant condition number κ(M (1)D) is not only bounded via Theorem 39, but the bounds can also easily be rendered independent of the activities of the activation nonlinearities, i.e., independent of D and D 1 . Indeed, obviously we can replace ν everywhere by n and the statements still hold (cf. (149)).
Then, for any D 1 ∈ D(m, n, 0) and D ∈ D(m, n, 0) with 1 ≤ m +m, with probability at least The guarantees provided for the relevant matrix for a ResNet in terms of (151) assure that the resulting piecewise-quadratic loss surface cannot change shape too drastically.
Influence of the training data. As mentioned earlier, we focus in our results and discussions mainly on how the weights and nonlinearities of the layers, i.e., M (ρ)D influence the loss shape, with the exception of Subsection 3.6. This makes sense, since the influence of the data is literally factored out via Corollary 9 with a contribution to the condition number that is independent of the design choice of ConvNet versus ResNet.
Using mini-batches for learning affects the loss shape in two ways. First, it leads to smaller regions as put forward in Subsection 3.6 (cf. (55)) and thus to potentially increased volatility of the loss landscape. Second, Corollary 11 implies that using mini-batches alleviates the influence of data and features z (g) (input to the trained layer f ) in the sense that the "data factor" of the condition number κ(Q) has a less severe impact, since feature values are averaged over G data points.
As mentioned above, these two kinds of influences of the data on the condition number are the same for both ConvNets and ResNets. Nevertheless, some difference prevails. For ConvNets, the "layer factor" κ(M (0)D) of the condition number κ(Q) leads mini-batches inducing even more loss landscape volatility due to the smaller regions. For ResNets, the uniform bound (151) carries over to mini-batches, and hence the smaller regions do not cause increased condition numbers and loss landscape volatility. Again, no such guarantees are available for ConvNets.
Eccentricity of the loss landscape. Intuition from our singular values results says that the loss landscape of a ResNet should be less eccentric than that of a ConvNet. We now provide a theoretical result as well as a numerical validation to support this point of view.
Recalling that absolute value nonlinearites do not change singular values (recall Corollary 37), we obtain the following.
Theorem 42 (Residual Links Improve Condition Number) Let W 1 be as in Lemma 36, and let D, D 1 ∈ D(m, n, −1) for some m, corresponding to an absolute value activation nonlinearity. Consider a realization of W 1 such that (152) holds. Then, adding a residual link improves the condition number of the matrix relevant for learning κ(D 1 W 1 D) ≥ κ((D 1 W 1 + Id)D).
By Lemma 41, the condition on W 1 is satisfied with probability asymptotically equal to 1.
Proof To establish Lemma 41, it suffices to observe that s 1 (M o ) converges almost surely to 1/( √ 2r) < 1, as well as (117) and that κ = κ(D 1 W 1 ) is asymptotically of the order n.
To provide a rigorous argument, note that D 1 W 1 is a Gaussian random matrix just like W 1 itself, since D 1 only changes some of the signs of W 1 's entries. Write κ = κ(D 1 W 1 ) for short.
Applying a result due to Edelman [8, Theorem 6.1] to the matrix D 1 W 1 (multiplying a matrix by a constant does not change its condition number), we find that κ/n converges in distribution to a random variable K with density (for t > 0) f K (t) = 2t + 4 t 3 exp(−2/t − 2/t 2 ).
Since D 1 W 1 has the same properties as W 1 , it follows that s 1 (M o ) is equal in distribution to s 1 (W 1 ) from (113). Then, since Y n converges in distribution, we have that is asymptotically at least 1 − 2δ. With δ > 0 arbitrary, this establishes Lemma 41.
The theorem follows now from (117).
Numerical validation. To conclude the paper, we report on the results of a numerical experiment that validates our theoretical results on the singular values of deep networks and the conclusion that the loss landscape of a ResNet is more stable and less eccentric than that of a ConvNet. We constructed a toy fourlayer deep network with f k of the form (1) with the nonlinear activation function φ k the leaky-ReLU with η = 0.1 in (35). For f −1 , f 0 , f 2 , we set ρ = 0 in (1), while for f 1 we let ρ roam as a free parameter in the range ρ ∈ [0, 1] that includes as endpoints a ConvNet layer (ρ = 0) and a ResNet layer (ρ = 1). The input x was two-dimensional, and all other inputs, outputs, and bias vectors were 6-dimensional, making the weight matrix W 2 of size 6 × 2 and all other weight matrices of size 6 × 6. We generated the entries of the input x, random weights W k , and random biases b k as iid zero-mean unit Gaussian random variables and tasked the network with predicting the following six statistics of x by minimizing the 2 norm of the prediction error (squared error): [2], mean(x), max(x), min(x), x 2 T .
For each random x, W k , b k and each value of ρ ∈ [0, 1], we computed the singular values of the matrix Q from (50) in Lemma 8 that determine the local shape of the piecewise-quadratic loss landscape for the 0-th layer. We conducted 8 realizations of the above experiment and summarize our results in Figure 8. Figure 8 not only validates the theory we have developed above, but it also enables us to draw the following concrete conclusions for gradient-based optimization of the family of DNs we have considered. Recall that the loss surface of these networks consists of pieces of (hyper) ellipsoids according to (35) in Section 3.4. Independent of the location (W ) on the ellipsoid, the more eccentric the ellipsoid, the more the gradient points away from the local optimum. Figures 8(a) and (b) indicate for increasing ρ that, since the singular values increase, the image of the unit ball under Q becomes wider, and thus the level sets of the local loss surface become more narrow [8]. Moreover, Figure 8(c) and (d) show that the local loss surface becomes less eccentric as ρ increases (a sphere would have a ratio equal to 1). Like our theory, these numerical results strongly recommend ResNets over ConvNets from an optimization perspective. We observe that the singular values of the matrix Q increase steadily with ρ and the condition number decreases steadily with ρ, meaning that the eccentricity of the loss landscape also decreases steadily with ρ.

Conclusions
In this paper, we have developed new theoretical results on the perturbation of the singular values of a class of matrices relevant to the optimization of an important family of DNs. Applied to deep learning, our results show that the transition from a ConvNet to a network with skip connections (e.g., ResNet, DenseNet) leads to an increase in the size of the singular values of a local linear representation and a decrease in the condition number. For regression and classification tasks employing the classical squared-error loss function, the implications are that the piecewise-quadratic loss surface of a ResNet or DenseNet are less erratic, less eccentric and features local minima that are more accommodating to gradient-based optimization than a ConvNet. Our results are the first to provide an exact quantification of the singular values/conditioning evolution under smooth transition from a ConvNet to a network with skip connections. Our results confirm the empirical observations from [20] depicted in Figure 1.
Our results also imply that extreme values of the training data affect the optimization, an effect that is tempered when using mini-batches due to the inherent averaging that takes place. Finally, our results shed new light on the impact of different nonlinear activation functions on a DN's singular values, regardless of its architecture. In particular, absolute value activations (such as those employed in the Scattering Network [22]) do not impact the singular values governing the local loss landscape, as opposed to ReLU, which can decrease the singular values.
Our derivations have assumed both deterministic settings and random settings. In the latter, some of our results are asymptotic in the sense that the bounds hold with probability as close to 1 as desired provided that the parameters are set accordingly. We have also provided results in the pre-asymptotic regime that enable one to assess the probability that the bounds will hold.
Finally, while our application focus here has been on analyzing the loss landscape of an important family of DNs, our journey has enabled us to develop novel theoretical results that characterize the evolution of matrix singular values under a range of perturbation regimes that we hope will be of independent interest and broader use.