Abstract
We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor \(\tau =O(1/\sqrt{L})\) to guarantee stable forward/backward process, where L is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when \(\tau >L^{-\frac{1}{2}+c}\), for any positive constant c. The above two results together establish a sharp value of the scaling factor in determining the stability of deep ResNet. Based on the stability result, we further show that gradient descent finds the global minima if the ResNet is properly over-parameterized, which significantly improves over the previous work with a much larger range of \(\tau\) that admits global convergence. Moreover, we show that the convergence rate is independent of the depth, theoretically justifying the advantage of ResNet over vanilla feedforward network. Empirically, with such a factor \(\tau\), one can train deep ResNet without normalization layer. Moreover for ResNets with normalization layer, adding such a factor \(\tau\) also stabilizes the training and obtains significant performance gain for deep ResNet.
Similar content being viewed by others
1 Introduction
Residual Network (ResNet) has achieved great success in computer vision tasks since the seminal paper (He et al., 2016). The ResNet structure has also been extended to natural language processing and achieved the state-of-the-art performance (Vaswani et al., 2017; Devlin et al., 2018). In this paper, we study the forward/backward stability and convergence of training deep ResNet with gradient descent.
Specifically, we consider the following residual block (He et al., 2016),
where \(\phi (\cdot )\) is the point-wise activation function, \(h_l\) and \(h_{l-1}\) are the output and input of the residual block l, \(\mathcal {F}_l(\cdot )\) is the parametric branch, e.g., \(\mathcal {F}_l(h_{l-1}) = {\varvec{W}}_{l}h_{l-1}\) and \({\varvec{W}}_{l}\) is the trainable parameter, and \(\tau\) is a scaling factor on the parametric branch.
We note that standard initialization schemes, e.g., the Kaiming initialization or the Glorot initialization, are designed to keep the forward and backward variance constant when passing through one layer. However, things become different for ResNet because of the existence of the identity mapping. If \({\varvec{W}}_l\) adopts the standard initialization, a small \(\tau\) is necessary for a stable forward process of deep ResNet, because the output magnitude quickly explodes for \(\tau =1\) as L gets large. On the other side, a limit form of Euler’s constant indicates that \(\tau =O(1/L)\) is sufficient for the forward/backward stability, which is assumed in previous work (Allen-Zhu et al., 2018; Du et al., 2019b). We ask
“Are there other values of \(\tau\) that can guarantee the stability of ResNet with arbitrary depth?”
We answer the above question affirmatively by establishing a non-asymptotic analysis that the stability is guaranteed for deep ResNet with arbitrary depth as long as \(\tau =O(1/\sqrt{L})\). Moreover conversely, for any positive constant c, if \(\tau = L^{-\frac{1}{2}+c}\), the network output norm grows at least with rate \(L^c\) in expectation, which implies the forward/backward process is unbounded as L gets large.
One step further, based on the stability result, we show that if the network is properly over-parameterized, gradient descent finds global minima for training ResNet with \(\tau \le \tilde{O}(1/\sqrt{L})\)Footnote 1. This is essentially different from previous work that assumes \(\tau \le \tilde{O}(1/L)\) (Allen-Zhu et al., 2018; Du et al., 2019a; Frei et al., 2019).
Our contribution is summarized as follows.
-
We establish a non-asymptotic analysis showing that \(\tau =1/\sqrt{L}\) is sharp in the order sense to guarantee the stability of deep ResNet.
-
For \(\tau \le \tilde{O}(1/\sqrt{L})\), we establish the convergence of gradient descent to global minima for training over-parameterized ResNet with a depth-independent rate.
The key step to prove our first claim is a new bound of the spectral norm of the forward process for ResNet with \(\tau = O(1/\sqrt{L})\). We find that, although the natural bound \((1+1/\sqrt{L})^L\) explodes, the randomness of the trainable parameter in the parametric branch helps to control the output norm growth. Specifically, we bound the mean and the variance about the largest possible change after deep residual mappings when \(\tau =O(1/\sqrt{L})\).
We also argue the advantage of adding \(\tau\) over other stabilization methods, such as batch normalization (BN) (Ioffe & Szegedy, 2015) and Fixup (Zhang et al., 2018a). First, it has advantage over BN to guarantee stability. BN is architecture-agnostic and the output norm of ResNet with BN still grows unbounded as the depth increases. In practice, it has to to employ a learning rate warm-up stage to train very deep ResNet even with BN (He et al., 2016). In comparison, we prove that ResNet with \(\tau\) is stable over all depths and hence does not require any learning rate warm-up stage. Second, it is also more stable than the approach of scaling down initialization that is adopted in Fixup. Scaling down initial residual weight does not scale down the gradient properly and Fixup could explode after gradient descent updates for deep ResNet.
At last, we corroborate the theoretical findings with extensive experiments. First, we demonstrate that with \(\tau =1/\sqrt{L}\), ResNet can be effectively trained without the normalization layers. It is more stable and achieves better performance than Fixup. Second, we demonstrate that adding \(\tau =1/\sqrt{L}\) on top of the normalization layer can obtain even better performance.
1.1 Related works
There is a large volume of literature studying ResNet. We can only give a partial list.
To argue the benefit of skip connection, (Veit et al., 2016) interpret ResNet as an ensemble of shallower networks, (Zhang et al., 2018) study the local Hessian of residual blocks, (Hardt & Ma, 2016) show that deep linear residual networks have no spurious local optima, (Orhan & Pitkow, 2018) observe that skip connection eliminates the singularity, and (Balduzzi et al., 2017) find that ResNet is more resistant to the gradient shattering problem than the feedforward network. However, these results mainly rely on empirical observation or strong model assumption.
There are also several papers studying ResNet from the stability perspective (Arpit et al., 2019; Zhang et al., 2018a, b; Yang & Schoenholz, 2017; Haber & Ruthotto, 2017). In comparison, we study the model closest to the original ResNet and provide a rigorous non-asymptotic analysis for the stability when \(\tau =O(1/\sqrt{L})\) and a converse result showing the sharpness of \(\tau\). We also demonstrate the empirical advantage of learning ResNet with \(\tau\).
Our work is also related to recent literature on the theory of learning deep neural network with gradient descent in the over-parameterized regime. Many works (Jacot et al., 2018; Allen-Zhu et al., 2018; Du et al., 2019a; Chizat & Bach, 2018a; Zou et al., 2018; Zou & Gu, 2019; Arora et al., 2019a; Oymak & Soltanolkotabi, 2019; Chen et al., 2019; Ji & Telgarsky, 2019) use Neural Tangent Kernel (NTK) or similar technique to argue the global convergence of gradient descent for training over-parameterized deep neural network. Some (Brutzkus et al., 2017; Li & Liang, 2018; Allen-Zhu et al., 2019a; Arora et al., 2019b; Cao & Gu, 2019; Neyshabur et al., 2019) study the generalization properties of over-parameterized neural network. On the other side, there are papers (Ghorbani et al., 2019; Chizat et al., 2019; Yehudai & Shamir, 2019; Allen-Zhu & Li, 2019) discussing the limitation of the NTK approach in characterizing the behavior of neural network. Additionally, several papers (Chizat & Bach, 2018b; Mei et al., 2018, 2019; Nguyen, 2019; Fang et al., 2019a, a) study the convergence of the weight distribution in the probabilistic space via gradient flow for two or multiple layers network. To the best of our knowledge, we are the first to provide the global convergence of learning ResNet in the regime of \(\tau \le O(1/\sqrt{L})\)
2 Preliminaries
There are many residual network models since the seminal paper (He et al., 2016). Here we study a simplified ResNet that shares the same structure as He et al. (2016)Footnote 2, which is described as follows,
-
Input layer: \(h_{0}=\phi (\varvec{A}x)\), where \(x\in \mathbb {R}^{p}\) and \(\varvec{A}\in \mathbb {R}^{m\times p}\);
-
\(L-1\) residual blocks: \(h_{l}=\phi (h_{l-1}+\tau {\varvec{W}}_{l}h_{l-1})\), where \({\varvec{W}}_{l}\in \mathbb {R}^{m\times m}\);
-
A fully-connected layer: \(h_{L}=\phi ({\varvec{W}}_{L}h_{L-1})\), where \({\varvec{W}}_{L}\in \mathbb {R}^{m\times m}\);
-
Output layer: \(y={\varvec{B}}h_{L}\), where \({\varvec{B}}\in \mathbb {R}^{d\times m}\);
-
Initialization: The entries of \(\varvec{A}, {\varvec{W}}_l\) for \(l\in [L-1]\), \({\varvec{W}}_L\) and \({\varvec{B}}\) are independently sampled from \(\mathcal {N}(0,\frac{2}{m})\), \(\mathcal {N}(0,\frac{1}{m})\), \(\mathcal {N}(0,\frac{2}{m})\) and \(\mathcal {N}(0,\frac{1}{d})\), respectively;
where \(\phi (\cdot ):=\max \{0, \cdot \}\) is the ReLU activation function. We assume the input dimension is p, the intermediate layers have the same width m and the output has dimension d. For a positive integer L, we use [L] to denote the set \(\{1, 2, ..., L\}\). We denote the values before activation by \(g_{0}=\varvec{A}x,g_{l}=h_{l-1}+\tau {\varvec{W}}_{l}h_{l-1}\) for \(l=[L-1]\) and \(g_{L}={\varvec{W}}_{L}h_{L-1}\). We use \(h_{i,l}\) and \(g_{i,l}\) to denote the value of \(h_{l}\) and \(g_{l}\), respectively, when the input vector is \(x_{i}\), and \({\varvec{D}}_{i,l}\) the diagonal activation matrix where \([{\varvec{D}}_{i,l}]_{k,k}=\varvec{1}_{\{(g_{i,l})_{k}\ge 0\}}\). We use superscript \(^{(0)}\) to denote the value at initialization, e.g., \({\varvec{W}}_l^{(0)}\), \(h_{i,l}^{(0)}\), \(g_{i,l}^{(0)}\) and \({\varvec{D}}_{i,l}^{(0)}\). We may omit the subscript \(_i\) and the superscript \(^{(0)}\) when they are clear from the context for simplifying the notations.
We introduce a notation \(\overrightarrow{{\varvec{W}}}:=({\varvec{W}}_1, {\varvec{W}}_2, \dots , {\varvec{W}}_L)\) to represent all the trainable parameters. We note that \(\varvec{A}\) and \({\varvec{B}}\) are fixed after initialization. Throughout the paper, we use \(\Vert \cdot \Vert\) to denote the \(l_2\) norm of a vector. We further use \(\Vert \cdot \Vert\) and \(\Vert \cdot \Vert _F\) to denote the spectral norm and the Frobenius norm of a matrix, respectively. Denote \(\Vert \overrightarrow{{\varvec{W}}}\Vert := \max _{l\in [L]}\Vert {\varvec{W}}_l\Vert\) and \(\Vert {\varvec{W}}_{[L-1]}\Vert := \max _{l\in [L-1]}\Vert {\varvec{W}}_l\Vert\).
The training data set is \(\{(x_i, y_i^*)\}_{i=1}^n\), where \(x_i\) is the feature vector and \(y_i^*\) is the target signal for \(i=1, ..., n\). We consider the objective function is \(F(\overrightarrow{{\varvec{W}}}):=\sum _{i=1}^{n}F_{i}(\overrightarrow{{\varvec{W}}})\) where \(F_{i}(\overrightarrow{{\varvec{W}}}):=\ell ({\varvec{B}}h_{i,L}, y_{i}^{*})\) and \(\ell (\cdot )\) is the loss function. The model is trained by running the gradient descent algorithm. Though ReLU is nonsmooth, we abuse the word “gradient" to represent the value computed through back-propagation.
3 Forward and backward stability of ResNet
In this section, we establish the stability of training ResNet. We show that when \(\tau = O(1/\sqrt{L})\) the forward and backward pass is bounded at the initialization and after small perturbation. On the converse side, for an arbitrary positive constant c, if \(\tau >L^{-0.5+c}\), the output magnitude grows at least polynomial with depth at the initialization. We also argue the advantage of using a factor \(\tau\) over other stabilization methods, such as BN and Fixup. The stability result forms the basis to establish the global convergence in Sect. 4.
3.1 Forward process is bounded if \(\tau = O(1/\sqrt{L})\)
We first give a non-asymptotic bound on the forward process at initialization.
Theorem 1
Suppose that \(\overrightarrow{{\varvec{W}}}^{(0)}\), \(\varvec{A}\) are randomly generated as in the initialization step, and \({\varvec{D}}^{(0)}_{i, 0},\dots ,{\varvec{D}}^{(0)}_{i, L}\) are diagonal activation matrices for \(i\in [n]\). Suppose that c and \(\epsilon\) are arbitrary positive constants with \(0<\epsilon <1\). If \(\tau\) satisfies \(\tau ^2L \le \min \{ \frac{1}{2}\log (1+c), \frac{\log ^2(1+c)}{16(1+\log (1+2/\epsilon ))}\}\), then with probability at least \(1-3nL^2\cdot \exp \left( -m\right)\) over the initialization randomness, we have for any two integers \(a,b\in [L-1]\) with \(b>a\) and for all \(i\in [n]\),
The proof is based on Markov’s inequality with recursively conditioning. The full proof is deferred to Appendix B. Here we give an outline.
Proof Outline
We omit the subscript i and the superscript (0) for simplicity. Suppose that \(\Vert h_{a-1}\Vert =1\). Let \(g_l = h_{l-1}+\tau {\varvec{W}}_lh_{l-1}\) and \(h_l = {\varvec{D}}_l g_{l}\) for \(l=\{a,..., b\}\). We have
where the inequality is due to that \(\Vert {\varvec{D}}_l\Vert \le 1\). Taking logarithm at both side, we have
If let \(\tilde{h}_{l-1} := \frac{h_{l-1}}{\Vert h_{l-1}\Vert }\), then we obtain that
where the inequality is because \(\log (1+x) < x\) for \(x>-1\). Let \(\xi _{l} := 2\tau \left\langle \tilde{h}_{l-1},{\varvec{W}}_{l}\tilde{h}_{l-1} \right\rangle\) and \(\zeta _{l}:= \tau ^{2}\Vert {\varvec{W}}_{l}\tilde{h}_{l-1}\Vert ^{2}\). Then given \(\tilde{h}_{l-1}\), we have \(\xi _{l}\sim \mathcal {N}\left( 0, \frac{4\tau ^2}{m}\right)\), \(\zeta _{l}\sim \frac{\tau ^{2}}{m}\chi _{m}^2\).
We can argue that \(\sum _{l=a}^{b} \xi _l \sim \mathcal {N}\left( 0, \frac{4(b-a)\tau ^2}{m}\right)\) and \(\sum _{l=a}^{b} \zeta _l \sim \frac{(b-a)\tau ^{2}}{m}\chi _{m}^2\). Hence for arbitrary positive constant \(c_1\), if \(\tau ^2L \le c_1/4\) then \(\sum _{l=a}^{b}\log \Delta _{l}\le c_1\) with probability at least \(1- 3\exp (-\frac{mc_1^2}{64\tau ^2L})\). We then convert the condition on \(c_1\) to that on c in the theorem. Taking \(\varepsilon\)-net argument, we can establish the spectral norm bound for all vector \(h_{a-1}\). Let a and b vary from 1 to \(L-1\) and taking the union bound gives the claim. The full proof is presented in Appendix B. \(\square\)
We note that the constant c and \(\epsilon\) can be chosen arbitrarily small such that \((1+c)/(1-\epsilon )\) is arbitrarily close to 1 given stronger assumption on \(\tau ^2L\). Theorem 1 indicates that the norm of every residual block output is upper bounded by \((1+c)/(1-\epsilon )\) if the input vector has norm 1, which demonstrates that the the forward process is stable. This result is a bit surprising since for \(\tau = O(1/\sqrt{L})\) a natural bound on the spectral norm \(\Vert (\varvec{I}+\tau {\varvec{W}}^{(0)}_L)\cdots (\varvec{I}+\tau {\varvec{W}}^{(0)}_1)\Vert \le (1+\frac{1}{\sqrt{L}})^L\) explodes. Here the intuition is that the cross-product term concentrates on the mean 0 because of the independent randomness of matrices \({\varvec{W}}_l^{(0)}\) and the variance can be bounded at the same time.
Moreover, we can also establish a lower bound on the output norm of each residual block as follows.
Theorem 2
Suppose that c is an arbitrary constant with \(0<c<1\). If \(\tau ^2L \le \frac{1}{4}\log (1-c)^{-1}\), then with probability at least \(1-2nL^2\cdot \exp \left( -\frac{1}{32}m\log (1-c)^{-1}\right)\) over the randomness of \(\varvec{A}\in \mathbb {R}^{m\times p}\) and \(\overrightarrow{{\varvec{W}}}^{(0)}\in (\mathbb {R}^{m\times m})^{L}\) the following holds
Proof
The proof is similar to that of Theorem 1 but harder. The high level idea is to control the mean and the variance of the mapping of the intermediate residual blocks simultaneously by utilizing the Markov’s inequality with the recursive conditioning. The full proof is deferred to Appendix C.1. \(\square\)
Combining these two theorems, we can conclude that the norm of each residual block output concentrates around 1 with high probability \(1-O(nL^2) \exp (-\Omega (m))\). Moreover these two theorems also holds for \(\overrightarrow{{\varvec{W}}}\) that is within the neighborhood of \(\overrightarrow{{\varvec{W}}}^{(0)}\), which is presented in Appendix C.2.
3.2 Backward process is bounded for \(\tau \le O(1/\sqrt{L})\)
For ResNet, the gradient with respect to the parameter is computed through back-propagation. For any input sample i, we denote \(\partial h_{i,l} := \frac{\partial F_i(\overrightarrow{{\varvec{W}}})}{\partial h_{i,l}}\) and \(\nabla _{{\varvec{W}}_{l}}F_i(\overrightarrow{{\varvec{W}}}) := \frac{\partial F_i(\overrightarrow{{\varvec{W}}})}{\partial {\varvec{W}}_{l}} =\left( \tau {\varvec{D}}_{i,l} \partial h_{i,l}\right) \cdot h_{i,l-1}^T\). Therefore, the gradient upper bound is guaranteed if \(h_{i,l}\) and \(\partial h_{i,l}\) are bounded for all blocks. We next show that the backward process is bounded for each individual sample at the initialization stage.
Theorem 3
For every input sample \(i\in [n]\) and for any positive constants c and \(\epsilon\) with \(0<\epsilon <1\), if \(\tau\) satisfies \(\tau ^2L \le \min \{ \frac{1}{2}\log (1+c), \frac{\log ^2(1+c)}{16(1+\log (1+2/\epsilon ))}\}\), then with probability at least \(1- 3nL^2\cdot \exp \left( -\frac{1}{4}mc^2\right)\) over the randomness of \(\varvec{A},{\varvec{B}}\) and \(\overrightarrow{{\varvec{W}}}^{(0)}\), the following holds \(\forall l\in [L-1]\)
The full proof is is deferred to Appendix 6. Here we give an outline.
Proof Outline
The argument is based on the back-propagation formula and Theorem 1. We omit the superscript \(^{(0)}\) for notation simplicity. For each \(i\in [n]\) and \(l\in [L-1]\), i.e., the residual layers, we have
where the last inequality is due to Theorem 1 and the spectral norm bound of \({\varvec{W}}_L\) given in Appendix A. The full proof is deferred to Appendix 6. \(\square\)
This theorem indicates that the gradient of residual layers could be \(\tau\) times smaller than the usual feedforward layer. Moreover, for ResNet with \(\tau =1/\sqrt{L}\), the norm of all layer gradient is independent of the depth, which allows us to use a depth independent learning rate to train ResNets of all depths. This is essentially different from the feedforward case (Allen-Zhu et al., 2018; Zou & Gu, 2019). We note that the gradient upper bound also holds for \(\overrightarrow{{\varvec{W}}}\) within the neighborhood of \(\overrightarrow{{\varvec{W}}}^{(0)}\) (see details in Appendix C.2 and 6).
3.3 A converse result for \(\tau > L^{-\frac{1}{2}+c}\)
We have built the stability of the forward/backward process for \(\tau = O(1/\sqrt{L})\). We next establish a converse result showing that if \(\tau\) is slightly larger than \(L^{-\frac{1}{2}}\) in the order sense, the network output norm grows uncontrollably as the depth L increases. This justifies the sharpness of the value \(\tau = 1/\sqrt{L}\). Without loss of generality, we assume \(\Vert h_0\Vert =1\).
Theorem 4
Suppose that c is an arbitrary positive constant and the ResNet is defined and initialized as in Sect. 2. If \(\tau \ge L^{-\frac{1}{2}+c}\), then we have
Proof
The proof is based on a new inequality \((h_{l})_{k}\ge \phi \left( \sum _{a=1}^{l}\left( \tau {\varvec{W}}_{a}h_{a-1}\right) _{k}\right)\) for \(l\in [L-1]\) and for \(k\in [m]\). By the symmetry of Gaussian variables and the recursive conditioning, we can compute the expectation of \(\Vert h_L\Vert ^2\) exactly. The whole proof is relegated to Appendix G. \(\square\)
This indicates that \(\tau = O(1/\sqrt{L})\) is sharp to guarantee the forward stability of deep ResNet. We note that Theorems 1 and 3 hold with high probability when \(m> \Omega (\log L)\) and Theorem 2 holds with high probability when \(m>\Omega (\log (nL))\). These are very mild conditions on the width m, which are satisfied by practical networks.
3.4 Comparison with other approaches for stability
Up to now, we have provided a sharp value of \(\tau\) in terms of determining the stability of deep ResNet. In practice, two other approaches are used in residual networks to provide the stability: adding normalization layers, e.g., batch normalization (BN) (Ioffe & Szegedy, 2015), and scaling down the initial residual weights, e.g., Fixup (Zhang et al., 2018a). Next, we discuss BN and Fixup from the stability perspective, respectively, and make comparison with adding \(\tau =1/\sqrt{L}\).
Batch normalization is placed right after each convolutional layer in (He et al., 2016). Here for the ResNet model defined in Sect. 2, we put BN after each parametric branch and the residual block becomes \(h_l = \phi (h_{l-1} + \tilde{z}_l)\), where \((\tilde{z}_{i,l})_k := \text {BN}\left( (z_{i,l})_k\right) = \frac{(z_{i,l})_k- \mathbb {E}[(z_{\cdot , l})_k]}{\sqrt{\mathrm {Var}[(z_{\cdot , l})_k]}}\) and \((z_{i,l})_k := \left( {\varvec{W}}_lh_{i,l-1}\right) _k\) for \(k=[m]\) and \(l=[L-1]\), and the expectation and the variance are taken over samples in a mini-batch. Then we have \(\mathbb {E} (\tilde{z}_{\cdot , l})_k =0\) and \(\mathrm {Var}[(\tilde{z}_{\cdot ,l})_k] = 1\). We use the following proposition to estimate the norm of each residual block output for the ResNet with BN.
Proposition 1
Assume that \((\tilde{z}_{l})_k\) are independent random variable over l, k with \(\mathbb {E} (\tilde{z}_{l})_k =0\) and \(\mathrm {Var}[(\tilde{z}_{l})_k] = 1\). The output norm of the residual block l satisfies \(\mathbb {E}\Vert h_{l}\Vert ^2 \ge \frac{1}{2}ml\), for \(l\in [L-1]\).
Proof
The proof is adapted from the proof of Theorem 4, and is presented in Appendix G. \(\square\)
This indicates that the block output norm of ResNet with BN grows roughly at the rate \(\sqrt{l}\) at the initialization stage, where l is the block index and the larger l the closer to the output. To verify this, we plot how the output norm of each residual block grows for ResNet1202 (with/without BN)Footnote 3 in Fig. 1. We see that at epoch 0 (initialization stage), the output norm grows almost with the rate \(\sqrt{l}\) as predicted in Proposition 1. After training, the estimation in Proposition 1 is not as accurate as the initialization because the independence assumption does not hold after training. Besides the output norm growth, in practice, (He et al., 2016) have to use warm-up learning rates to train very deep ResNets, e.g., ResNet1202\(+\)BN. In contrast, it is proved that the approach of adding \(\tau =1/\sqrt{L}\) is stable over all depths and hence does not require any learning rate warm-up stage.
Recently, Zhang et al. (2018a) propose Fixup to train residual networks without the normalization layer. Essentially for each residual block, Fixup sets the weight matrix near the output to be 0 at the initialization stage, and then scales down the all other weight matrices by a factor that is determined by the network structure. However, in practice Fixup does not always converge for training very deep residual networks as shown in Sect. 5.2. Moreover, for the ResNet model defined in Sect. 2, Fixup could be unstable after gradient updates. The residual block is given by \(h_l = \phi (h_{l-1} + {\varvec{W}}_l h_{l-1})\), and following Fixup, \({\varvec{W}}_l^{(0)}\) is initialized to be 0 for \(l \in [L-1]\). At the initial stage for input sample i, \(h_{i,l} = h_{i,0}\) and hence \(\nabla _{{\varvec{W}}_{l}}F_i = \partial h_{i,L-1}\cdot h_{i,0}^T\), the same for all \(l \in [L-1]\). Then after one gradient update the residual blocks mapping \(\prod _{l=1}^{L-1}{\varvec{D}}_{i,l}(\varvec{I}+\eta \cdot \nabla _{{\varvec{W}}_{l}}F_i )\) could behave like \(({\varvec{D}}(\varvec{I}+\eta \cdot \partial h_{i,L-1}\cdot h_{i,0}^T))^{L-1}\) when \({\varvec{D}}_{i,l}={\varvec{D}}\) for all l, which grows exponentially. Empirically, such explosion is observed for deep ResNet with Fixup (see Fig. 2). In contrast, the ResNet with \(\tau\) is stable for varying depths (see Fig. 3), as guaranteed by our theory.
4 Global convergence for over-parameterized ResNet
In this section, we establish that gradient descent converges to global minima for learning an over-parameterized ResNet with \(\tau \le \tilde{O}(1/\sqrt{L})\). Compared to the recent work (Allen-Zhu et al., 2018), our result significantly enlarges the region of \(\tau\) that admits the global convergence of gradient descent. Moreover, our result also theoretically justifies the advantage of ResNet over vanilla feedforward network in terms of facilitating the convergence of gradient descent. Before stating the theorem, we introduce common assumptions on the training data and the loss function (Allen-Zhu et al., 2018; Zou & Gu, 2019; Oymak & Soltanolkotabi, 2019).
Assumption 1
(training data) For any \(x_i\), it holds that \(\Vert x_i\Vert =1\) and \((x_i)_p = 1/\sqrt{2}\). There exists \(\delta >0\), such that \(\forall i,j \in [n], i\ne j, \Vert x_{i}-x_{j}\Vert \ge \delta\).
The loss function \(\ell (\cdot , \cdot )\) is quadratic and the individual objective is \(F_{i}(\overrightarrow{{\varvec{W}}}):=\frac{1}{2}\Vert {\varvec{B}}h_{i,L}-y_{i}^{*}\Vert ^{2}\). We note that the assumption \((x_i)_p=1/\sqrt{2}\) means that the last coordinate of every \(x_i\) is \(1/\sqrt{2}\). This gives a random bias term after the first layer \({\varvec{A}}(\cdot )\), which makes the proof of Lemma 6 for the gradient lower bound easier. This assumption is because of the proof convenience rather than something that should be satisfied in practice.
Theorem 5
Suppose that the ResNet is defined and initialized as in Sect. 2 with \(\tau \le O(1/(\sqrt{L}\log m))\) and the training data satisfy Assumption 1. If the network width \(m\ge \Omega (n^{8} L^7\delta ^{-4}d\log ^2 m)\), then with probability at least \(1-\exp (-\Omega (\log ^{2}m))\), gradient descent with learning rate \(\eta =\Theta (\frac{d}{nm})\) finds a point \(F(\overrightarrow{{\varvec{W}}})\le \varepsilon\) in \(T=\Omega (n^2\delta ^{-1}\log \frac{n\log ^2 m}{\varepsilon })\) iterations.
Proof
The full proof is deferred to Appendix F. \(\square\)
This theorem establishes the linear convergence of gradient descent for learning ResNet for the range \(\tau \le O(1/(\sqrt{L}\log m))\). Combined with the unstable case of \(\tau > 1/\sqrt{L}\) in Sect. 3.3, we give a nearly full characterization of the convergence in terms of the range of \(\tau\). Moreover, our result indicates that the learning rate and the total number of iterations are depth-independent. We note that a recent paper Frei et al. (2019) also achieves a depth-independent rate but only for the case \(\tau \le O(1/(L\log m))\), whose proof critically relies on the choice of \(\tau =1/L\). The overparameterization dependence and the number of iterations are not directly comparable as we are studying the regression problem while Frei et al. (2019) is for the classification problem with different data assumption. Other previous results (Allen-Zhu et al., 2018; Du et al., 2019a) characterize the convergence guarantee only for the case \(\tau \le O(1/(L\log m))\), and their total number of iterations scales with the order \(L^2\). Our depth-independent results are achieved by a tighter smoothness and gradient upper bound.
In the analysis with the feedforward case (Allen-Zhu et al., 2018; Zou & Gu, 2019, the learning rate has to scale with \(1/L^2\) and the total number of iterations scales with \(L^2\) for the convergence of learning feedforward network. Therefore, our result theoretically justifies the advantage of ResNet over vanilla feedforward network in terms of facilitating the convergence of gradient descent.
Finally, we add a remark on the width requirement in Theorem 5. The width grows polynomially with the number of training examples. Such dependence is because we need to more neurons to distinguish each data point sufficiently with more examples, which is common for the regression task (Allen-Zhu et al., 2019b; Zou & Gu, 2019). This dependence could be avoid by assuming the training data follows specific distributions for the classification task (Cao & Gu, 2020). However this is orthogonal to our main claim that ResNet converges with a depth-dependent rate.
5 Empirical study
In this section, we present experiments to verify our theory and show the practical value of ResNet with \(\tau\). We first compare the performance of ResNet with different \(\tau\)’s and demonstrate that \(\tau =\frac{1}{\sqrt{L}}\) is a sharp value in determining the trainability of deep ResNet. We then compare the performance of adding the factor \(\tau\) and using Fixup initialization when training the popular residual networks without normalization layers. We finally show that with normalization layer, adding \(\tau\) also significantly improve the performance for both CIFAR and ImageNet tasks. Source code available online https://github.com/dayu11/tau-ResNet.
5.1 Theoretical verification
We train feedforward fully-connected neural networks (PlainNet), ResNets with different values of \(\tau\), and compare their convergence behaviors. Specifically, for ResNets, we adopts the exactly the same residual architecture as described in Eq. (1) and Sect. 2. The PlainNet adopts the same architecture as the ResNets without the skip connection. The models are generated with width \(m=128\) and depth \(L\in \{10, 100, 1000\}\). For ResNets with \(\tau\), we choose \(\tau =\frac{1}{L}, \frac{1}{\sqrt{L}},\frac{1}{L^{1/4}}\) to show the sharpness of the value \(\frac{1}{\sqrt{L}}\). We conduct classification on the MNIST dataset (LeCun et al., 1998). We train the model with SGDFootnote 4 and the size of minibatch is 256. The learning rate is set to \(\eta =0.01\) for all networks without tuning.
We plot the training curves in Fig. 3. For ResNets with \(\tau\), we see that both \(\tau =\frac{1}{L}\) and \(\tau =\frac{1}{\sqrt{L}}\) are able to train very deep ResNets successfully and \(\tau =\frac{1}{\sqrt{L}}\) achieves lower training loss than \(\tau =\frac{1}{L}\). For \(\tau =\frac{1}{L^{1/4}}\), the training loss explodes for models with depth 100 and 1000. This indicates that the bound \(\tau =\frac{1}{\sqrt{L}}\) is sharp for learning deep ResNets. Moreover, the convergence of ResNets with \(\tau =\frac{1}{\sqrt{L}}\) does not depend on the depth while training feedforward network becomes harder as the depth increases, corroborating our theory nicely.
To clearly see the benefit of \(\tau =\frac{1}{\sqrt{L}}\) over \(\tau =\frac{1}{L}\), we conduct the classification task on the CIFAR10 dataset (Krizhevsky & Hinton, 2009) with the residual networks from He et al. (2016). A bit different from the model described in Sect. 2, here one residual block is composed of two stacked convolution layers. We argue that our theoretical analysis still applies if treating the number of channels in convolution layer as width in Sect. 2. We plot the training/validation curves in Fig. 4. We can see that with \(\tau =\frac{1}{\sqrt{L}}\), both ResNet110 and ResNet1202 can be trained to good accuracy without BN. In contrast, with \(\tau =\frac{1}{L}\), the performance of ResNet110 and ResNet1202 drops a lot.
In the sequel, we use “adding \(\tau ^*\)" or “\(+\tau ^*\)" to denote residual network with \(\tau =\frac{1}{\sqrt{L}}\).
5.2 Comparison of adding \(\tau ^*\) and using Fixup
In this section we compare our approach of adding \(\tau ^*\) and the approach of using Fixup for training residual networks without BN. We conduct the classification task on the CIFAR10 dataset. We use the residual models in (He et al., 2016) with removing all the normalization layers. For the approach of Fixup, we use the code from their github website with the same hyperparameter setting. We note that Fixup has a learnable scalar with initial value 1 on the output of the parametric branch in each residual block, which is equivalent to set \(\tau =1\). For our approach, we use the same model as Fixup with setting \(\tau =\frac{1}{\sqrt{L}}\) and using the Kaiming initialization instead of Fixup initialization.
The results are presented in Table 1. We can see that our approach achieves much better performance than the Fixup approach over all depths. Moreover, the Fixup approach fails to converge 2 out of 5 runs for training ResNet1202 and hence the standard deviation is not presented in Table 1.
5.3 Add \(\tau ^*\) on top of normalization
In this section, we empirically show that adding \(\tau ^*\) in the residual block with batch normalization can also help to achieve better performance. We conduct experiments on standard classification datasets: CIFAR10/100 and ImageNet. The baseline models are the residual networks in He et al. (2016). We note that the residual block here is with batch normalization, which is discussed in Sect. 3.4 but not precisely covered by the theoretical model (Sect. 2). For our approach, the only modification is adding a fixed \(\tau =\frac{1}{\sqrt{L}}\) at the output of each residual block (right before the residual addition). We also tried to use learnable \(\tau\) but did not observe gain, which may be due to that the BN layers have learnable scaling factors. The validation errors on CIFAR10/100 are illustrated in Fig. 5, where all numbers are averaged over five runs. The performance of adding \(\tau ^*\) is much better than the baseline models and especially the benefit of adding \(\tau ^*\) becomes larger when the network is deeper. We note that one needs warm-up learning rate to successfully train ResNet1202+BN, while with \(\tau ^*\) we use the same learning rate schedule for all depths.
As the models for ImageNet classification has different numbers of residual blocks in each stage, we choose \(\tau ^*=\frac{1}{\sqrt{L}}\) where L is the average number of blocks over all stages. We take average instead of sum because there exists a BN layer on the output of each stage. All models are trained for 200 epochs with learning rate divided by 10 every 60 epochs. The other hyperparameters are the same as in He et al. (2016). Table 2 shows the top 1 validation error results on ImageNet. We can see that just by adding \(\tau ^*\) on top of BN we can achieve significant performance gain.
6 Conclusion
In this paper, we provide a non-asymptotic analysis on the forward/backward stability for ResNet, which unveils that \(\tau =1/\sqrt{L}\) is a sharp value in terms of characterizing the stability. We also bridge theoretical understanding and practical guide of ResNet structure. We empirically verify the efficacy of adding \(\tau\) for ResNet with/without batch normalization. As the residual block is also widely used in the Transformer model (Vaswani et al., 2017), it is interesting to study the effect of \(\tau\) and layer normalization there.
Notes
We use \(\tilde{O}(\cdot )\) to hide logarithmic factors.
Throughout the paper, the naming rule of ResNet is as follows.“ResNet" is referred to the model defined in Sect. 2, “ResNet#" is referred to the models in He et al. (2016) with removing all the BN layers, e.g., ResNet1202, “ResNet#\(+\)BN" corresponds to the original model in He et al. (2016), “\(+\)Fixup" corresponds to initializing the model with Fixup, and “\(+\tau\)" is referred to adding \(\tau\) on the output of the parametric branch in each residual block.
GD exhibits the same phenomenon. We use SGD due to the expensive per-iteration cost of GD.
References
Allen-Zhu, Z., & Li, Y. (2019). What can ResNet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems.
Allen-Zhu, Z., Li, Y., & Song, Z. (2018). A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962.
Allen-Zhu, Z., Li, Y., & Liang, Y. (2019a). Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in Neural Information Processing Systems, pp.6155–6166.
Allen-Zhu, Z., Li, Y., & Song, Z. (2019b). On the convergence rate of training recurrent neural networks. Advances in Neural Information Processing Systems.
Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., & Wang, R. (2019a). On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems.
Arora, S., Du, S. S., Hu, W., Li, Z., & Wang, R. (2019b). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. International Conference on Machine Learning (ICML).
Arpit, D., Campos, V., & Bengio, Y. (2019). How to initialize your network? robust initialization for weightnorm & resnets. Advances in Neural Information Processing Systems.
Balduzzi, D., Frean, M., Leary, L., Lewis, J. P., Wan-Duo Ma, K., & McWilliams, B. (2017). The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning (ICML), pp. 342–350.
Brutzkus, A., Globerson, A., Malach, E., & Shalev-Shwartz, S. (2018). SGD learns over-parameterized networks that provably generalize on linearly separable data. In Proceedings of the 6th international conference on learning representations (ICLR 2018).
Cao, Y., & Gu, Q. (2019). A generalization theory of gradient descent for learning over-parameterized deep ReLU networks. arXiv preprint arXiv:1902.01384.
Cao, Y., & Gu, Q. (2020). Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems (NeurIPS).
Chen, Z., Cao, Y., Zou, D., & Gu, Q. (2021). How much over-parameterization is sufficient to learn deep ReLU networks? In Proceedings of the international conference on learning representations (ICLR 2021).
Chizat, L., & Bach, F. (2018a). On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in Neural Information Processing Systems 31.
Chizat, L., & Bach, F. (2018b). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 8.
Chizat, L., Oyallon, E., & Bach, F. (2019). On lazy training in differentiable programming. Advances in Neural Information Processing Systems.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT.
Du, S. S., Lee, J. D., Li, H., Wang, L., & Zhai, X. (2019a). Gradient descent finds global minima of deep neural networks. In: International Conference on Machine Learning (ICML).
Du, S. S., Zhai, X., Poczos, B., & Singh, A. (2019b). Gradient descent provably optimizes over-parameterized neural networks. In: International Conference on Learning Representations (ICLR).
Fang, C., Dong, H., & Zhang, T. (2019a). Over parameterized two-level neural networks can learn near optimal feature representations. arXiv preprint arXiv:1910.11508.
Fang, C., Gu, Y., Zhang, W., & Zhang, T. (2019b). Convex formulation of overparameterized deep neural networks. arXiv preprint arXiv:1911.07626.
Frei, S., Cao, Y., & Gu, Q. (2019). Algorithm-dependent generalization bounds for overparameterized deep residual networks. Advances in Neural Information Processing Systems, pages 14769–14779.
Ghorbani, B., Mei, S., Misiakiewicz, T., Montanari, A. (2019). Limitations of lazy training of two-layers neural networks. Advances in Neural Information Processing Systems.
Haber, E., & Ruthotto, L. (2017). Stable architectures for deep neural networks. Inverse Problems, 34(1), 014004.
Hardt, M., & Ma, T. (2016). Identity matters in deep learning. In: International Conference on Learning Representations (ICLR).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (ICML), pp. 448–456.
Jacot, A., Gabriel, F, & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, pp. 8571–8580.
Ji, Z., & Telgarsky, M. (2020). Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks. In Proceedings of the international conference on learning representations (ICLR 2020).
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.
Laurent, B., & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pp. 1302–1338.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Li, Y., & Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in Neural Information Processing Systems, pp. 8168–8177.
Mei, S., Montanari, A., & Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33), E7665–E7671.
Mei, S., Misiakiewicz, T., & Montanari, A. (2019). Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Proceedings of the thirty-second conference on learning theory (pp. 2388–2464).
Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., & Srebro, N. (2019). The role of over-parametrization in generalization of neural networks. In: International Conference on Learning Representations (ICLR).
Nguyen. P.-M. (2019). Mean field limit of the learning dynamics of multilayer neural networks. arXiv preprint arXiv:1902.02880.
Orhan, A. E., & Pitkow, X. (2018). Skip connections eliminate singularities. In: International Conference on Learning Representations (ICLR).
Oymak, S., & Soltanolkotabi, M. (2019). Overparameterized nonlinear learning: Gradient descent takes the shortest path? In: International Conference on Machine Learning (ICML).
Spielman, D. A., & Teng, S-H. (2004). Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
Veit, A., Wilber, M. J., & Belongie, S. (2016). Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems, pp 550–558.
Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices (pp. 210–268). Theory and Applications: Compressed Sensing.
Yang, G., and Schoenholz, S. (2017). Mean field residual networks: On the edge of chaos. Advances in Neural Information Processing Systems, pp 7103–7114.
Yehudai, G., & Shamir, O. (2019). On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems.
Zhang, H., Dauphin, Y. N., & Ma, T. (2019a). Fixup initialization: Residual learning without normalization. In: International Conference on Learning Representations (ICLR).
Zhang, H., Chen, W., & Liu, T.-Y. (2018). On the local hessian in back-propagation. In Advances in Neural Information Processing Systems, pp. 6521–6531.
Zhang, J., Han, B., Wynter, L., Low, K. H., & Kankanhalli, M. (2019b). Towards robust resnet: A small step but a giant leap. In: International Joint Conferences on Artificial Intelligence (IJCAI).
Zou, D., & Gu, Q. (2019). An improved analysis of training over-parameterized deep neural networks. Advances in Neural Information Processing Systems.
Zou, D., Cao, Y., Zhou, D., & Gu, Q. (2020). Stochastic gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning, 109(3), 467–492.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Editor: Paolo Frasconi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Useful Lemmas
First we list several useful bounds on Gaussian distribution.
Lemma 1
Suppose \(X\sim \mathcal {N}(0,\sigma ^{2})\), then
Another bound is on the spectral norm of random matrix ((Vershynin, 2012), Corollary 5.35).
Lemma 2
Let \(\varvec{A}\in \mathbb {R}^{N\times n}\), and entries of \(\varvec{A}\) are independent standard Gaussian random variables. Then for every \(t\ge 0\), with probability at least \(1-\exp (-t^{2}/2)\) one has
where \(s_{\max }(\varvec{A})\) are the largest singular value of \(\varvec{A}\).
B Spectral norm bound at initialization
Next we present a spectral norm bound related to the forward process of ResNet with \(\tau\).
Proof
Without introducing ambiguity, we drop the superscript \(^{(0)}\) for notation simplicity. We first build the claim for one fixed sample \(i\in [n]\) and drop the subscript i, for convenience. Let \(g_l = h_{l-1}+\tau {\varvec{W}}_lh_{l-1}\) and \(h_l = {\varvec{D}}_l g_{l}\) for \(l=\{a,..., b\}\). We will show for a vector \(h_{a-1}\) with \(\Vert h_{a-1}\Vert =1\), we have \(\Vert h_b\Vert \le 1+c\) with high probability, where
Then we have \(\Vert g_l\Vert \ge \Vert h_l\Vert\) due to the assumption \(\Vert {\varvec{D}}_l\Vert \le 1\). Hence we have
Taking logarithm at both side, we have
If letting \(\tilde{h}_{l-1} := \frac{h_{l-1}}{\Vert h_{l-1}\Vert }\), then we obtain that
where the inequality is due to the fact \(\log (1+x) \le x\) for all \(x>-1\). Let \(\xi _{l} := 2\tau \left\langle \tilde{h}_{l-1},{\varvec{W}}_{l}\tilde{h}_{l-1} \right\rangle\) and \(\zeta _{l}:= \tau ^{2}\Vert {\varvec{W}}^{(0)}_{l}\tilde{h}_{l-1}\Vert ^{2}\), then given \(h_{l-1}\) we have \(\xi _{l}\sim \mathcal {N}\left( 0, \frac{4\tau ^2}{m}\right)\), \(\zeta _{l}\sim \frac{\tau ^{2}}{m}\chi _{m}^2\) because of the random initialization of \({\varvec{W}}_l\). We see that
Next we bound the two terms on the right hand side one by one. For the first term we have
where \(\lambda\) is any positive number and the last inequality uses the Markov’s inequality. Moreover,
Hence we obtain
by choosing \(\lambda = \frac{mc_1}{16\tau ^{2}L}\) and using \(b-a+1 \le L\). Due to the symmetry of \(\sum _{l=a}^{b}\xi _{l}\), the conclusion can be generalized to the quantity \(|\sum _{l=a}^{b}\xi _{l}|\) that \(\mathbb {P}\left( \left| \sum \limits _{l=a}^{b}\xi _{l}\right| \ge \frac{c_1}{2}\right) \le 2\exp \left( -\frac{mc_1^{2}}{64\tau ^{2}L}\right)\).
Then, for the second term, we follow the above procedure but for a \(\chi _m^2\) variable. We note that the generate moment function of \(\chi _{m}^{2}\) is \((1-2t)^{-m/2}\) for \(t<1/2\). We will use an inequality that \((1-\frac{x}{m})^{-m}\le e^{x}\) for \(x\ge 0\). By using the Markov’s inequality, we first have for any \(\lambda >0\),
Then we have
Hence we obtain
by choosing \(\lambda = \frac{mc_1}{\tau ^2L}\) and using \(b-a+1 \le L\). If further setting \(\tau\) such that \(\tau ^2 L\le \frac{c_1}{4}\), we have
Combining (13) and (17), we obtain \(\mathbb {P}\left( \sum \limits _{l=a}^{b}\log \Delta _{l}\ge c_1\right) \le 3\exp \left( -\frac{mc_1^2}{64\tau ^2L}\right)\) under the condition \(\tau ^2 L\le \frac{c_1}{4}\). Hence we have \(\mathbb {P}\left( \Vert h_b\Vert \ge 1+c\right) \le \mathbb {P}\left( \sum \limits _{l=a}^{b}\log \Delta _{l}\ge 2\log (1+c)\right) \le 3\exp \left( -\frac{m\log ^2(1+c)}{16\tau ^2L}\right)\) under the condition that \(\tau ^2 L\le \frac{1}{2} \log (1+c)\). We next use \(\epsilon\)-net argument to prove the claim for all m-dimensional vectors of \(h_{a-1}\). Let \(\mathcal {N}_{\epsilon }\) be an \(\varepsilon\)-net over the unit ball in \(\mathbb {R}^m\) with \(\epsilon < 1\), then we have the cardinality \(|\mathcal {N}_{\epsilon }|\le (1+2/\epsilon )^m\). Taking the union bound over all vectors \(h_{a-1}\) in the net \(\mathcal {N}_{\epsilon }\), we obtain
where the last equality is obtained by choosing \(\tau\) appropriately to make \(\frac{\log ^2 (1+c)}{16\tau ^2L} -\log (1+2/\epsilon )>1\). Then we have the spectral norm bound
This is because of the following argument. For a matrix \(\varvec{M}\), \(v_i\) is a vector in the net which is closest to a unit vector v, then \(\Vert \varvec{M}v\Vert \le \Vert \varvec{M}v_i \Vert + \Vert \varvec{M}(v-v_i)\Vert \le \Vert \varvec{M}v_i \Vert + \epsilon \Vert \varvec{M}\Vert\), and hence taking the supremum over v, one obtains \((1-\epsilon ) \Vert \varvec{M}\Vert \le \max _i \Vert \varvec{M}v_i\Vert\).
Finally taking a union bound over a and b with \(1\le a\le b <L\) and a union bound over all samples \(i\in [n]\), we have the claimed result. \(\square\)
C Bounded forward/backward process
1.1 C.1 Proof at initialization
Proof
We ignore the subscript \(^{(0)}\) for simplicity. First we have
Then we see
We introduce notation \(\Delta _{a}:=\frac{\Vert h_{i,a}\Vert ^{2} - \Vert h_{i,a-1}\Vert ^{2}}{\Vert h_{i,a-1}\Vert ^{2}}\). We next give a lower bound on \(\Delta _{a}\). Let S be the set \(\{k: k\in [m] \text { and } (h_{i,a-1})_{k} +\tau ({\varvec{W}}_{a}h_{i,a-1})_{k} >0\}\). We have that
where the inequality is due to the fact that for \(k\notin S\), \(|(h_{i,a-1})_{k}|<|(\tau {\varvec{W}}_{a}h_{i,a-1})_{k}|\) and \((h_{i,a-1})_{k}({\varvec{W}}_{a}h_{i,a-1})_{k}\le 0\). Let \(\xi _{a}:=\frac{2\tau \left\langle h_{i,a-1},{\varvec{W}}_{a}h_{i,a-1} \right\rangle }{\Vert h_{i,a-1}\Vert ^{2}}\) and \(\zeta _{a}:=\frac{\Vert \tau {\varvec{W}}_{a}h_{i,a-1}\Vert ^{2}}{\Vert h_{i,a-1}\Vert ^{2}}\), then \(\Delta _{a} \ge \xi _{a}- \zeta _{a}\). We note that given \(h_{i,a-1}\), \(\xi _{a}\sim \mathcal {N}\left( 0, \frac{4\tau ^2}{m}\right) \) and \(\zeta _{a}\sim \frac{\tau ^{2}}{m}\chi _{m}^2\). We use a tail bound for a \(\chi ^2_m\) variable X (see Lemma 1 in Laurent and Massart (2000))
By applying the tail bound on Gaussian and Chi-square variables, for a constant \(c_0\) such that \(4\tau ^2\le c_0 \) we have
Thus, by choosing \(c_0 = 0.5\), we have \(\mathbb {P}\left( \Delta _a \ge -0.5, \forall a\in [L-1]\right) \ge 1- L\exp \left( -\frac{m}{128\tau ^2}\right) \). On the event \(\{\Delta _a \ge -0.5, \forall a \in [L-1]\}\), we can use the relation \(\log (1 + x)\ge x - x^{2}\) for \(x\ge -0.5\) and have
Due to (13) and (17), we have for any \(c_1>0\), and \(\tau ^2L\le c_1/4\),
Thus we have for any \(c_1>0\), and \(\tau ^2L\le c_1/4\),
We can derive a similar result that \(\mathbb {P}\left( \sum _{a=1}^{l}\Delta _{a}\ge c_1\right) \le \mathbb {P}\left( \sum _{a=1}^{l}\xi _{a}\ge c_1\right) \le \exp \left( -\frac{mc_1^{2}}{16\tau ^{2}L}\right) \). Let \(a = b\) in (24), we have obtained that for a single \(\Delta _{a}\), for a constant \(c_1\) such that \(4\tau ^2\le c_1 \),
In addition, we see that for any \(16\tau ^4L\le c_1 \)
Thus, similar to the (25), we obtain for any \(c_1>0\) and \(8\tau ^2L< c_1\),
Thus on the event of \(\{\Delta _a \ge -0.5, \forall a\in [L-1]\}\), we have for any \(c_1>0\) and \(8\tau ^2L< c_1\),
Then we get the conclusion \(\mathbb {P}\left( \Vert h_{i,l}\Vert < 1-c\right) = \mathbb {P}\left( \log {\Vert h_{i,l}\Vert ^{2}} \le -2\log (1-c)^{-1}\right) \le 2L\exp \left( -\frac{1}{32}m\log (1-c)^{-1}\right) \). Taking union bound over \(i\in [n]\) and \(l\in [L-1]\), we get the claimed result with probability \(1-2nL^2\exp \left( -\frac{1}{32}m\log (1-c)^{-1}\right) \) under the condition \(\tau ^2L \le \frac{1}{4}\log (1-c)^{-1}\). \(\square \)
1.2 C.2 Lemmas and proofs after perturbation
We use \(\overrightarrow{{\varvec{W}}}^{(0)}\) to denote the weight matrices at initialization and use \(\overrightarrow{{\varvec{W}}}'\) to denote the perturbation matrices. Let \(\overrightarrow{{\varvec{W}}} = \overrightarrow{{\varvec{W}}}^{(0)} + \overrightarrow{{\varvec{W}}}'\). We define \(h_{i,l}^{(0)} = \phi ((\varvec{I}+\tau {\varvec{W}}_l^{(0)})h_{i,l-1}^{(0)})\) and \(h_{i,l} = \phi ((\varvec{I}+\tau {\varvec{W}}_l)h_{i,l-1})\) for \(l\in [L-1]\), and \(h_{i,L}^{(0)} = \phi ({\varvec{W}}_L^{(0)}h_{i,L-1}^{(0)})\) and \(h_{i,L} = \phi ({\varvec{W}}_L h_{i,L-1})\). Furthermore, let \(h'_{i,l} := h_{i,l}- h_{i,l}^{(0)}\) and \({\varvec{D}}'_{i,l} := {\varvec{D}}_{i,l} - {\varvec{D}}_{i,l}^{(0)}\). We note that \(\Vert \cdot \Vert _0\) is the number of nonzero entries in \(\cdot\). In the sequel, we will use notation O and \(\Omega\) to simplify the presentation. Then the spectral norm bound after perturbation is as follows.
Lemma 3
Suppose that \(\overrightarrow{{\varvec{W}}}^{(0)}\), \(\varvec{A}\) are randomly generated as in the initialization step, and \({\varvec{W}}'_{1},\dots ,{\varvec{W}}'_{L-1}\in \mathbb {R}^{m\times m}\) are perturbation matrices with \(\Vert {\varvec{W}}'_l\Vert <\tau \omega\) for all \(l\in [L-1]\) for some \(\omega <1\). Suppose \({\varvec{D}}_{i,0},\dots ,{\varvec{D}}_{i,L}\) are diagonal matrices representing the activation status of sample i. If \(\tau ^2L \le O(1)\), then with probability at least \(1-3nL^2\cdot \exp (-\Omega (m))\) over the initialization randomness we have
Proof
This proof is similar to the proof of Theorem 1. We first build the claim for one fixed sample \(i\in [n]\) and drop the subscript i, for convenience. We will show for a vector \(h_{a-1}\) with \(\Vert h_{a-1}\Vert =1\), we have \(\Vert h_b\Vert \le 1+c\) with high probability, where
Let \(g_l = h_{l-1}+\tau {\varvec{W}}^{(0)}_lh_{l-1} +\tau {\varvec{W}}'_lh_{l-1}\) and \(h_l = {\varvec{D}}_l g_{l}\) for \(l=\{a,..., b\}\). Then we have \(\Vert g_l\Vert \ge \Vert h_l\Vert\) due to the fact \(\Vert {\varvec{D}}_l\Vert \le 1\). Hence we have
Taking logarithm at both side, we have
If letting \(\tilde{h}_{l-1} := \frac{h_{l-1}}{\Vert h_{l-1}\Vert }\), then we obtain that
where the inequality is due to the fact \(\log (1+x) \le x\) for all \(x>-1\). We can bound the sum over layers of the first two terms as in the proof of Theorem 1. Next we control the last two terms related with \({\varvec{W}}'_l\), on a high probability event \(\{\Vert {\varvec{W}}^{(0)}_l \Vert \le 4, \text { for all } l \in [L-1]\}\)
Hence given \(\tau ^2 L \le c_1/4\) as in proof of Theorem 1 and \(\omega\) being a small constant, the above two sum are well controlled. We can obtain a spectral norm bound as claimed. Here the theorem is built for one \({\varvec{W}}'_l\). At the end of the whole proof, we will see the number of iterations is \(\Omega (n^2)\). If we take union bound over all the \({\varvec{W}}'_l\) s running into in the optimization trajectory, the overall probability is still as high as \(1 - \Omega (n^3 L^2)\exp (-\Omega (m))\). \(\square\)
We also have small changes on the output vector of each layer after perturbation.
Lemma 4
Suppose that \(\omega \le O(1)\) and \(\tau ^2L\le O(1)\). If \(\Vert {\varvec{W}}_{L}'\Vert \le \omega\) and \(\Vert {\varvec{W}}_{l}'\Vert \le \tau \omega\) for \(l\in [L-1]\), then with probability at least \(1-\exp (-\Omega (m\omega ^{\frac {2}{3}}))\), the following bounds on \(h'_{i,l}\) and \({\varvec{D}}'_{i,l}\) hold for all \(i\in [n]\) and all \(l\in [L-1]\),
Proof
Fixing i and ignoring the subscript in i, by Claim 8.2 in Allen-Zhu et al. (2018), for \(l\in [L-1]\), there exists \({\varvec{D}}''_{l}\) such that \(|({\varvec{D}}''_{l})_{k,k}|\le 1\) and
We claim that
due to the fact \(\Vert {\varvec{D}}''_{l}\Vert \le 1\) and the assumption \(\Vert {\varvec{W}}'_{l}\Vert \le \tau \omega\) for \(l\in [L-1]\). This implies that \(\Vert h'_{i,l}\Vert ,\Vert g'_{i,l}\Vert \le O(\tau ^2L\omega )\) for all \(l\in [L-1]\) and for all i with probability at least \(1-O(nL)\cdot \exp (-\Omega (m))\). One step further, we have \(\Vert h'_{L}\Vert ,\Vert g'_{L}\Vert \le O(\omega )\).
As for the sparsity \(\Vert {\varvec{D}}'_{l}\Vert _{0}\), we have \(\Vert {\varvec{D}}'_{l}\Vert _{0}\le O(m(\omega \tau L)^{\frac {2}{3}})\) for every \(l=[L-1]\) and \(\Vert {\varvec{D}}'_{L}\Vert _{0}\le O(m\omega ^{\frac {2}{3}})\).
The argument is as follows (adapt from the Claim 5.3 in Allen-Zhu et al. (2018)).
We first study the case \(l\in [L-1]\). We see that if \(({\varvec{D}}'_{l})_{j,j}\ne 0\) one must have \(|(g'_{l})_{j}|>|(g_{l}^{(0)})_{j}|\).
We note that \((g_{l}^{(0)})_{j}=(h_{l-1}^{(0)}+\tau {\varvec{W}}_{l}^{(0)}h_{l-1}^{(0)})_{j}\sim \mathcal {N}\left( (h_{l-1}^{(0)})_{j},\frac{\tau ^{2}\Vert h_{l-1}^{(0)}\Vert ^{2}}{m}\right)\). Let \(\xi \le \frac{1}{\sqrt{m}}\)be a parameter to be chosen later. Let \(S_{1}\subseteq [m]\) be a index set satisfying \(S_{1}:=\{j:|(g_{l}^{(0)})_{j}|\le \xi \tau \}\). We have \(\mathbb {P}\{|(g_{l}^{(0)})_{j}|\le \xi \tau \}\le O(\xi \sqrt{m})\) for each \(j\in [m]\). By Chernoff bound, with probability at least \(1-\exp (-\Omega (m^{3/2}\xi ))\) we have
Let \(S_{2}:=\{j:j\notin S_{1},\ \text {and }({\varvec{D}}'_{l})_{j,j}\ne 0\}\). Then for \(j\in S_{2}\), we have \(|(g'_{l})_{j}|>\xi \tau\). As we have proved that \(\Vert g'_{l}\Vert \le O(\tau ^2L\omega )\), we have
Choosing \(\xi\) to minimize \(|S_{1}|+|S_{2}|\), we have \(\xi =(\omega \tau L)^{\frac {2}{3}}/\sqrt{m}\) and consequently, \(\Vert {\varvec{D}}'_{l}\Vert _{0}\le O(m(\omega \tau L)^{\frac {2}{3}})\). Similarly, we have \(\Vert {\varvec{D}}'_{L}\Vert _{0}\le O(m\omega ^{\frac {2}{3}})\). \(\square\)
We next prove that the norm of a sparse vector after the ResNet mapping.
Lemma 5
Suppose that \(s\ge \Omega (d/\log m), \tau ^2L\le O(1)\). If \({\varvec{W}}_l\) for \(l\in [L]\) satisfy the condition as in Lemma 3, then for all \(i\in [n]\) and \(a\in [L]\) and for all s-sparse vectors \(u\in \mathbb {R}^{m}\) and for all \(v\in \mathbb {R}^{d}\), the following bound holds with probability at least \(1-(nL)\cdot \exp (-\Omega (s\log m))\)
where \({\varvec{D}}_{i, a}\) is diagonal activation matrix for sample i.
Proof
For any fixed vector \(u\in \mathbb {R}^{m}\), \(\Vert {\varvec{D}}_{i,L}{\varvec{W}}_{L}{\varvec{D}}_{i,L-1}(\varvec{I}+\tau {\varvec{W}}_{L-1})\cdots {\varvec{D}}_{i,a}(\varvec{I}+\tau {\varvec{W}}_{a})u\Vert \le 1.1 \Vert u\Vert\) holds with probability at least \(1-\exp (-\Omega (m))\) because of Lemma 3.
On the above event, for a fixed vector \(v\in \mathbb {R}^{d}\) and any fixed \({\varvec{W}}_{l}\) for \(l\in [L]\), the randomness only comes from \({\varvec{B}}\), then \(v^{T}{\varvec{B}}{\varvec{D}}_{i,L}{\varvec{W}}_{L}{\varvec{D}}_{i,L-1}(\varvec{I}+\tau {\varvec{W}}_{L-1})\cdots {\varvec{D}}_{i,a}(\varvec{I}+\tau {\varvec{W}}_{a})u\) is a Gaussian variable with mean 0 and variance no larger than \(1.1^2\Vert u\Vert ^2\cdot \Vert v\Vert ^2/d\). Hence
Take \(\epsilon\)-net over all s-sparse vectors of u and all d-dimensional vectors of v, if \(s\ge \Omega (d/\log m)\) then with probability \(1-\exp (-\Omega (s\log m))\) the claim holds for all s-sparse vectors of u and all d-dimensional vectors of v. Further taking the union bound over all \(i\in [n]\) and \(a\in [L]\), the lemma is proved. \(\square\)
D Gradient lower/upper bounds and their proofs
Because the gradient is pathological and data-dependent, in order to build bound on the gradient, we need to consider all possible point and all cases of data. Hence we first introduce an arbitrary loss vector and then the gradient bound can be obtained by taking a union bound.
We define the \(\mathsf {BP}_{\overrightarrow{{\varvec{W}}}, i}(v, \cdot )\) operator. It back-propagates a vector v to the \(\cdot\) which could be the intermediate output \(h_l\) or the parameter \({\varvec{W}}_l\) at the specific layer l using the forward propagation state of input i through the network with parameter \(\overrightarrow{{\varvec{W}}}\). Specifically,
Moreover, we introduce
where \(\overrightarrow{v}\) is composed of n vectors \(v_i\) for \(i\in [n]\). If \(v_i\) is the error signal of input i, then \(\nabla _{{\varvec{W}}_l} F_i(\overrightarrow{{\varvec{W}}}) = \mathsf {BP}_{\overrightarrow{{\varvec{W}}},i}({\varvec{B}}h_{i,L}-y_i^* , {\varvec{W}}_l)\).
1.1 D.1 Gradient upper bound
Proof
We ignore the superscript \(^{(0)}\) for simplicity. Then for an \(i\in [n]\) we have
because of Theorem 1. Similarly, we have for \(l\in [L-1]\),
because of Theorem 1 and Lemma 2. \(\square\)
The above upper bounds hold for the initialization \(\overrightarrow{{\varvec{W}}}^{(0)}\) because of Theorem 1 and Theorem 2. They also hold for all the \(\overrightarrow{{\varvec{W}}}\) such that \(\Vert \overrightarrow{{\varvec{W}}}-\overrightarrow{{\varvec{W}}}^{(0)}\Vert \le \omega\) due to Lemma 3.
For the quadratic loss function, we have \(\Vert \partial h_{i,L}\Vert ^2= \Vert {\varvec{B}}^T({\varvec{B}}h_{i,L}-y_{i}^{*})\Vert ^2= O(m/d)F_i(\overrightarrow{{\varvec{W}}})\). We have the gradient upper bound as follows.
Theorem 6
Suppose \(\omega = O(1)\). For every input sample \(i\in [n]\) and for every \(l\in [L-1]\) and for every \(\overrightarrow{{\varvec{W}}}\) such that \(\Vert {\varvec{W}}_L-{\varvec{W}}_L^{(0)}\Vert \le \omega\) and \(\Vert {\varvec{W}}_l-{\varvec{W}}_l^{(0)}\Vert \le \tau \omega\), the following holds with probability at least \(1- O(nL^2)\cdot \exp (-\Omega (m))\) over the randomness of \(\varvec{A},{\varvec{B}}\) and \(\overrightarrow{{\varvec{W}}}^{(0)}\)
1.2 D.2 Gradient lower bound
For the quadratic loss function, we have the following gradient lower bound.
Theorem 7
Let \(\omega =O\left( \frac{\delta ^{3/2}}{n^{3}\log ^{3}m}\right)\). With probability at least \(1-\exp (-\Omega (m\omega ^{\frac {2}{3}}))\) over the randomness of \(\overrightarrow{{\varvec{W}}}^{(0)},\varvec{A},{\varvec{B}}\), it satisfies for every \(\overrightarrow{{\varvec{W}}}\) with \(\Vert \overrightarrow{{\varvec{W}}}-\overrightarrow{{\varvec{W}}}^{(0)}\Vert \le \omega\),
This gradient lower bound on \(\Vert \nabla _{{\varvec{W}}_{L}}F(\overrightarrow{{\varvec{W}}})\Vert _{F}^{2}\) acts like the gradient dominance condition (Zou and Gu, 2019; Allen-Zhu et al., 2018) except that our range on \(\omega\) does not depend on the depth L.
Proof
The gradient lower-bound at the initialization is given by the Section 6.2 in (Allen-Zhu et al., 2018) and the Lemma 4.1 in (Zou and Gu, 2019) via the smoothed analysis (Spielman and Teng, 2004): with high probability the gradient is lower-bounded, although the worst case it might be 0. We adopt the same proof for the Lemma 4.1 in Zou and Gu (2019) based on two preconditioned results Theorem 2 and Lemma 6. We shall not repeat it here.
Now suppose that we have \(\Vert \nabla _{{\varvec{W}}_{L}}F(\overrightarrow{{\varvec{W}}}^{(0)})\Vert _{F}^{2}\ge \Omega \left( \frac{F(\overrightarrow{{\varvec{W}}}^{(0)})}{dn/\delta }\times m\right)\). We next bound the change of the gradient after perturbing the parameter. Recall that
By Lemma 4 and Lemma 5, we know,
Furthermore, we know
By Theorem 2 and Lemma 4, we have
Combing the above bounds together, we have
Hence the gradient lower bound still holds for \(\overrightarrow{{\varvec{W}}}\) given \(\omega <O\left( \frac{\delta ^{3/2}}{n^{3}}\right)\).
Finally, taking \(\epsilon -\)net over all possible vectors \(\overrightarrow{v}=(v_{1},\dots ,v_{n})\in (\mathbb {R}^{d})^{n}\), we prove that the above gradient lower bound holds for all \(\overrightarrow{v}\). In particular, we can now plug in the choice of \(v_{i}={\varvec{B}}h_{i,L}-y_{i}^{*}\) and it implies our desired bounds on the true gradients. \(\square\)
The gradient lower bound requires the following property.
Lemma 6
For any \(\delta\) and any pair \((x_{i},x_{j})\) satisfying \(\Vert x_{i}-x_{j}\Vert \ge \delta\), then \(\Vert h_{i,l}-h_{j,l}\Vert \ge \Omega (\delta )\) holds for all \(l\in [L]\) with probability at least \(1-O(n^{2}L)\cdot \exp (-\Omega (\log ^{2}{m}))\) given that \(\tau \le O(1/(\sqrt{L}\log {m}))\) and \(m\ge \Omega (\tau ^2 L^2\delta ^{-2})\).
The proof of Lemma 6 follows the Appendix C in Allen-Zhu et al. (2018).
E Semi-smoothness for \(\tau \le O(1/\sqrt{L})\)
With the help of Theorem 6 and several other improvements, we can obtain a tighter bound on the semi-smoothness condition of the objective function.
Theorem 8
Let \(\omega = O\left( \frac{\delta ^{3/2}}{n^3L^{7/2}}\right)\) and \(\tau ^2L\le 1\). With high probability, we have for every \(\breve{\overrightarrow{{\varvec{W}}}}\in (\mathbb {R}^{m\times m})^{L}\) with \(\left\| \breve{\overrightarrow{{\varvec{W}}}}-\overrightarrow{{\varvec{W}}}^{(0)}\right\| \le \omega\) and for every \(\overrightarrow{{\varvec{W}}}'\in (\mathbb {R}^{m\times m})^{L}\) with \(\Vert \overrightarrow{{\varvec{W}}}'\Vert \le \omega\), we have
We will show the semi-smoothness theorem for a more general \(\omega \in \left[ \Omega \left( \left( d/(m\log m)\right) ^{\frac {3}{2}}\right) , O(1)\right]\) and the above high probability is at least \(1-\exp (-\Omega (m\omega ^{\frac {2}{3}}))\) over the randomness of \(\overrightarrow{{\varvec{W}}}^{(0)},\varvec{A},{\varvec{B}}\).
Before going to the proof of the theorem, we introduce a lemma.
Lemma 7
There exist diagonal matrices \({\varvec{D}}''_{i,l}\in \mathbb {R}^{m\times m}\) with entries in [-1,1] such that \(\forall i\in [n],\forall l\in [L-1]\),
and
Furthermore, we then have \(\forall l\in [L-1],\Vert h_{i,l}-\breve{h}_{i,l}\Vert \le O(\tau ^2L\omega )\), \(\Vert {\varvec{D}}''_{i,l}\Vert _{0}\le O(m(\omega \tau L)^{\frac {2}{3}})\), and \(\Vert h_{i,L}-\breve{h}_{i,L}\Vert \le O((1+\tau \sqrt{L})\Vert {\varvec{W}}'\Vert _F)\), \(\Vert {\varvec{D}}''_{i,L}\Vert _{0}\le O(m\omega ^{\frac {2}{3}})\) and
hold with probability \(1-\exp (-\Omega (m\omega ^{\frac {2}{3}}))\) given \(\Vert {\varvec{W}}'_{L}\Vert \le \omega , \Vert {\varvec{W}}'_{l}\Vert \le \tau \omega\) for \(l\in [L-1]\) and \(\omega \le O(1), \tau \sqrt{L}\le 1\).
Proof of Theorem 8
First of all, we know that \(\breve{loss}_{i}:={\varvec{B}}\breve{h}_{i,L}-y_{i}^{*}\)
and
We use the relation that for two matrices A, B, \(\langle A, B\rangle = \text {tr}(A^TB)\). Then, we can write
Then further by Lemma 7, we have
where (a) is due to Lemma 7.
We next bound the RHS of (45). We first use Lemma 7 to get
Next we calculate that for \(l=L\),
For the first term, by Lemma 5 and Lemma 7, we have
where the last inequality is due to \(\Vert h_{i,L-1}\Vert \le O(1)\). For the second term, by Lemma 7 we have
where the last inequality is due to the assumption \(\Vert {\varvec{W}}'_{L}\Vert \le \omega\). Similarly for \(l\in [L-1]\), we ignore the index i for simplicity.
We next bound the terms in (50) one by one. For the first term, by Lemma 5 and Lemma 7, we have
where \(\Vert {\varvec{W}}'_{L-1:1}\Vert _F=\sqrt{\sum _{l=1}^{L-1}\Vert {\varvec{W}}'_l\Vert _F^2}\) and (a) is due to the similar argument (56) in the proof Lemma 7 and the fact \(\left\| \breve{{\varvec{W}}}_{L}({\varvec{D}}_{L-1}+{\varvec{D}}''_{L-1})(\varvec{I}+\tau \breve{{\varvec{W}}}_{L-1})\cdots ({\varvec{D}}_{l}+{\varvec{D}}''_{l})\right\| = O(1)\) and \(\Vert h_{l-1}\Vert =O(1)\) holds with high probability. We note that the inequality (a) helps us save a \(\sqrt{L}\) factor in our main theorem.
We have similar bound for the second term of (50)
For the last term in (50), we have
where is the last inequality is due to the bound on \(\Vert h_{l-1}-\breve{h}_{l-1}\Vert\) in Lemma 7. Hence
Having all the above together and using triangle inequality, we have the result. \(\square\)
Proof of Lemma 7
The proof relies on the following lemma.
Lemma 8
(Proposition 8.3 in in Allen-Zhu et al. (2018)) Given vectors \(a, b\in \mathbb {R}^m\) and \({\varvec{D}}\in \mathbb {R}^{m\times m}\) the diagonal matrix where \({\varvec{D}}_{k,k}=\varvec{1}_{a_k\ge 0}\). Then, there exists a diagonal matrix \({\varvec{D}}''\in \mathbb {R}^{m\times m}\) with
-
\(|{\varvec{D}}_{k,k}+{\varvec{D}}''_{k,k}|\le 1\) and \(|{\varvec{D}}''_{k,k}|\le 1\) for every \(k\in [m]\),
-
\({\varvec{D}}''_{k,k}\ne 0\) only when \(\varvec{1}_{a_k\ge 0}\ne \varvec{1}_{b_k\ge 0}\),
-
\(\phi (a)-\phi (b) = ({\varvec{D}}+{\varvec{D}}'')(a-b)\).
Fixing index i and ignoring the subscript in i for simplicity, by Lemma 8, for each \(l\in [L-1]\) there exists a \({\varvec{D}}''_{l}\) such that \(|({\varvec{D}}''_{l})_{k,k}|\le 1\) and
Then we have following properties. For \(l\in [L-1]\), \(\Vert h_l - \breve{h}_{l}\Vert \le O(\tau ^2 L \omega )\). This is because \(\Vert (\breve{{\varvec{D}}}_{l}+{\varvec{D}}''_{l})(\varvec{I}+\tau \breve{{\varvec{W}}}_{l})\cdots (\varvec{I}+\tau \breve{{\varvec{W}}}_{a+1})(\breve{{\varvec{D}}}_{a}+{\varvec{D}}''_{a})\Vert \le 1.1\) from Lemma 3; \(\Vert h_{a-1}\Vert \le O(1)\) from Theorem 2; and the assumption \(\Vert {\varvec{W}}'_l\Vert \le \tau \omega\) for \(l\in [L-1]\).
To have a tighter bound on \(\Vert h_L-\breve{h}_L\Vert\), let us introduce \({\varvec{W}}''_{b}:= \sum _{a=b}^{l} (\breve{{\varvec{D}}}_{l}+{\varvec{D}}''_{l})(\varvec{I}+\tau \breve{{\varvec{W}}}_{l})\cdots (\varvec{I}+\tau \breve{{\varvec{W}}}_{a+1})(\breve{{\varvec{D}}}_{a}+{\varvec{D}}''_{a}){\varvec{W}}'_{a}\), for \(b=1, ..., l\). Then we have
It is easy to get
where the inequality is because of \(\Vert h_{a-1}\Vert \le O(1)\) from Theorem 2. Next, we have
where the second inequality is from the definition of spectral norm, the third inequality is because of \(\Vert (\breve{{\varvec{D}}}_{l}+{\varvec{D}}''_{l})(\varvec{I}+\tau \breve{{\varvec{W}}}_{l})\cdots (\varvec{I}+\tau \breve{{\varvec{W}}}_{a+1})(\breve{{\varvec{D}}}_{a}+{\varvec{D}}''_{a})\Vert \le 1.1\) from Lemma 3.
Hence we have \(\Vert h_L- \breve{h}_{L}\Vert \le O\left( (1+\tau \sqrt{L})\Vert {\varvec{W}}'\Vert _F\right) = O\left( \Vert {\varvec{W}}'\Vert _F\right)\) because of the assumption \(\tau \sqrt{L}\le 1\).
For \(l\in [L]\), \(\Vert {\varvec{D}}''_l\Vert _0\le O(m\omega ^{\frac {2}{3}})\). This is because \(({\varvec{D}}''_l)_{k,k}\) is non-zero only at coordinates k where \((\breve{g}_l)_k\) and \((g_l)_k\) have opposite signs, where it holds either \(({\varvec{D}}_l^{(0)})_{k,k}\ne (\breve{{\varvec{D}}}_l)_{k,k}\) or \(({\varvec{D}}_l^{(0)})_{k,k}\ne ({\varvec{D}}_l)_{k,k}\). Therefore by Lemma 4, we have \(\Vert {\varvec{D}}''_l\Vert _0\le O(m(\omega \tau L)^{\frac {2}{3}})\) if \(\Vert {{\varvec{W}}}'_l\Vert \le \tau \omega\). \(\square\)
F Proof for Theorem 5
1.1 F.1 Convergence result for GD
Proof
Using Theorem 2 we have \(\Vert h^{(0)}_{i,L}\Vert \le 1.1\) and then using the randomness of \({\varvec{B}}\), it is easy to show that \(\Vert {\varvec{B}}h_{i,L}^{(0)}-y_{i}^{*}\Vert ^{2}\le O(\log ^{2}m)\) with probability at least \(1-\exp (-\Omega (\log ^{2}m))\), and therefore
Assume that for every \(t=0,1,\dots ,T-1\), the following holds,
We shall prove the convergence of GD under the assumption (58) holds, so that previous statements can be applied. At the end, we shall verify that (58) is indeed satisfied.
Letting \(\nabla _{t}=\nabla F(\overrightarrow{{\varvec{W}}}^{(t)})\), we calculate that
where the first inequality uses Theorem 4, the second inequality uses the gradient upper bound in Theorem 6 and the last inequality uses the gradient lower bound in Theorem 7 and the choice of \(\eta =O(d/(mn))\) and the assumption on \(\omega\) (58). That is, after \(T=\Omega (\frac{dn}{\eta \delta m})\log \frac{n\log ^{2}m}{\epsilon }\) iterations \(F(\overrightarrow{{\varvec{W}}}^{(T)})\le \epsilon\).
We need to verify for each t, (58) holds. Here we use a result from the Lemma 4.2 in Zou and Gu (2019) that states \(\Vert {\varvec{W}}_L^{(t)}-{\varvec{W}}_L^{(0)}\Vert _F\le O(\sqrt{\frac{n^2 d\log m }{m\delta }})\).
To guarantee the iterates fall into the region given by \(\omega\) (58), we obtain a bound \(m\ge n^8\delta ^{-4}dL^7\log ^2 m\). \(\square\)
1.2 F.2 Convergence result for SGD
Theorem 9
For the ResNet defined and initialized as in Sect. 2, the network width \(m\ge \Omega (n^{17}L^7b^{-4}\delta ^{-8}d\log ^2 m)\). Suppose we do stochastic gradient descent update starting from \(\overrightarrow{{\varvec{W}}}^{(0)}\) and
where \(S_t\) is a random subset of [n] with \(|S_t|=b\). Then with probability at least \(1-\exp (-\Omega (\log ^{2}m))\), stochastic gradient descent (61) with learning rate \(\eta =\Theta (\frac{db\delta }{n^{3}m\log m})\) finds a point \(F(\overrightarrow{{\varvec{W}}})\le \epsilon\) in \(T=\Omega (n^{5}b^{-1}\delta ^{-2}\log m \log ^2 \frac{1}{\epsilon })\) iterations.
Proof
The proof of the case of SGD can be adapted from the proof of Theorem 3.8 in Zou and Gu (2019). \(\square\)
G Proofs of Theorem 4 and Proposition 1
Proof
By induction we can show for any \(k\in [m]\) and \(l\in [L-1]\),
It is easy to verify \((h_1)_k = \phi \left( (h_0)_k+(\tau {\varvec{W}}_1h_0)_k\right) \ge \phi \left( (\tau {\varvec{W}}_1h_0)_k\right)\) because of \((h_0)_k\ge 0\).
Then assume \((h_l)_k \ge \phi \left( \sum _{a=1}^{l}\left( \tau {\varvec{W}}_{a}h_{a-1}\right) _{k}\right)\), we show it holds for \(l+1\).
where the last inequality can be shown by case study.
Next we can compute the mean and variance of \(\sum _{a=1}^{l}\left( \tau {\varvec{W}}_{a}h_{a-1}\right) _{k}\) by taking iterative conditioning. We have
Moreover, \((\tau {\varvec{W}}_a h_{a-1})_{k}\) are jointly Gaussian for all a with mean 0 because \({\varvec{W}}_a\)’s are drawn from independent Gaussian distributions. We use \(l=2\) as an example to illustrate the conclusion, it can be generalized to other l. Assume that \(h_{0}\) is fixed. First it is easy to verify that \((\tau {\varvec{W}}_1 h_{0})_{k}\) is Gaussian variable with mean 0 and \((\tau {\varvec{W}}_2 h_{1})_{k}\big |{\varvec{W}}_1\) is also Gaussian variable with mean 0. Hence \([(\tau {\varvec{W}}_1 h_{0})_{k}, (\tau {\varvec{W}}_2 h_{1})_{k}]\) follows jointly Gaussian with mean vector [0, 0]. Thus \((\tau {\varvec{W}}_1 h_{0})_{k}+(\tau {\varvec{W}}_2 h_{1})_{k}\) is Gaussian with mean 0. By induction, we have \(\sum _{a=1}^l(\tau {\varvec{W}}_a h_{a-1})_{k}\) is Gaussian with mean 0. Then we have
where the first step is due to (62), the second step is due to the symmetry of Gaussian distribution and the third step is due to (66). Since \((h_{l})_{k} = \phi \left( (h_{l-1})_{k} + \left( {\varvec{W}}_{l}h_{l-1}\right) _{k}\right)\), we can show \(\mathbb {E}(h_l)_k^2 \ge (h_{l-1})_{k}^2\) given \(h_{l-1}\) by numerical integral of Gaussian variable over an interval. Hence we have \(\mathbb {E}\Vert h_{l}\Vert ^{2}\ge \mathbb {E} \Vert h_{l-1}\Vert ^2\ge \cdots \ge \mathbb {E} \Vert h_{0}\Vert ^2=1\) by iteratively taking conditional expectation. Then combined with (64) and the choice of \(\tau =L^{-\frac{1}{2}+c}\), we have \(\mathbb {E}\Vert h_{L-1}\Vert ^2 \ge \frac{1}{2}L^{2c}\). Because \(({\varvec{W}}_L)_{i,j}\sim \mathcal {N}(0,2/m)\) and \(h_L=\phi ({\varvec{W}}_L h_{L-1})\), we have \(\mathbb {E} \Vert h_L\Vert ^2 = \Vert h_{L-1}\Vert ^2\). Thus, the claim is proved. \(\square\)
Proof
From the inequality (62) in the previous proof, we know for any \(k\in [m]\) and \(l\in [L-1]\),
Next we can compute the mean and variance of \(\sum _{a=1}^{l}\left( \tilde{z}_a\right) _{k}\) by taking iterative conditioning. We have
Then we have
where the first step is due to (62), the second step is due to the symmetry of random variable \((\tilde{z}_a)_k\) and the third step is due to (66). The proposition is proved. \(\square\)
H More empirical studies
We do more experiments to demonstrate the points in Sect. 5.
Besides the basic feedforward structure in Sect. 5.1, we do another experiment to demonstrate that \(\tau =1/\sqrt{L}\) is sharp with practical structures (see Fig. 6). We can see that for ResNet110 and ResNet1202, \(\tau =1/L^{1/4}\) cannot train the network effectively.
One may wonder if we can tune the learning rate for the case of \(\tau =1/L\) to achieve validation accuracy as well as the case of \(\tau =1/\sqrt{L}\). We do a new experiment to verify this (see Table 3). Specifically, for ResNet110 with fixed \(\tau =1/L\) and \(\tau =1/\sqrt{L}\) on CIFAR10 classification task, we tune the learning rate from 0.1 to 1.6 and record the validation accuracy in Table 3. We can see that ResNet110 with \(\tau =1/L\) performs inferior to that with \(\tau =1/\sqrt{L}\) even with grid search of learning rates. It is possible that we can achieve a bit better performance by adjusting the learning rate for \(\tau =1/L\). But this requires tuning for each depth. In contrast, we have shown that with \(\tau =1/\sqrt{L}\), one learning rate fits for all nets with different depths.
Rights and permissions
About this article
Cite this article
Zhang, H., Yu, D., Yi, M. et al. Stabilize deep ResNet with a sharp scaling factor \(\tau\). Mach Learn 111, 3359–3392 (2022). https://doi.org/10.1007/s10994-022-06192-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-022-06192-x