1 Introduction

Residual Network (ResNet) has achieved great success in computer vision tasks since the seminal paper (He et al., 2016). The ResNet structure has also been extended to natural language processing and achieved the state-of-the-art performance (Vaswani et al., 2017; Devlin et al., 2018). In this paper, we study the forward/backward stability and convergence of training deep ResNet with gradient descent.

Specifically, we consider the following residual block (He et al., 2016),

$$\begin{aligned}&\text { residual block: } \quad h_{l}=\phi (h_{l-1}+\tau \mathcal {F}_l(h_{l-1})), \end{aligned}$$
(1)

where \(\phi (\cdot )\) is the point-wise activation function, \(h_l\) and \(h_{l-1}\) are the output and input of the residual block l, \(\mathcal {F}_l(\cdot )\) is the parametric branch, e.g., \(\mathcal {F}_l(h_{l-1}) = {\varvec{W}}_{l}h_{l-1}\) and \({\varvec{W}}_{l}\) is the trainable parameter, and \(\tau\) is a scaling factor on the parametric branch.

We note that standard initialization schemes, e.g., the Kaiming initialization or the Glorot initialization, are designed to keep the forward and backward variance constant when passing through one layer. However, things become different for ResNet because of the existence of the identity mapping. If \({\varvec{W}}_l\) adopts the standard initialization, a small \(\tau\) is necessary for a stable forward process of deep ResNet, because the output magnitude quickly explodes for \(\tau =1\) as L gets large. On the other side, a limit form of Euler’s constant indicates that \(\tau =O(1/L)\) is sufficient for the forward/backward stability, which is assumed in previous work (Allen-Zhu et al., 2018; Du et al., 2019b). We ask

“Are there other values of \(\tau\) that can guarantee the stability of ResNet with arbitrary depth?”

We answer the above question affirmatively by establishing a non-asymptotic analysis that the stability is guaranteed for deep ResNet with arbitrary depth as long as \(\tau =O(1/\sqrt{L})\). Moreover conversely, for any positive constant c, if \(\tau = L^{-\frac{1}{2}+c}\), the network output norm grows at least with rate \(L^c\) in expectation, which implies the forward/backward process is unbounded as L gets large.

One step further, based on the stability result, we show that if the network is properly over-parameterized, gradient descent finds global minima for training ResNet with \(\tau \le \tilde{O}(1/\sqrt{L})\)Footnote 1. This is essentially different from previous work that assumes \(\tau \le \tilde{O}(1/L)\) (Allen-Zhu et al., 2018; Du et al., 2019a; Frei et al., 2019).

Our contribution is summarized as follows.

  • We establish a non-asymptotic analysis showing that \(\tau =1/\sqrt{L}\) is sharp in the order sense to guarantee the stability of deep ResNet.

  • For \(\tau \le \tilde{O}(1/\sqrt{L})\), we establish the convergence of gradient descent to global minima for training over-parameterized ResNet with a depth-independent rate.

The key step to prove our first claim is a new bound of the spectral norm of the forward process for ResNet with \(\tau = O(1/\sqrt{L})\). We find that, although the natural bound \((1+1/\sqrt{L})^L\) explodes, the randomness of the trainable parameter in the parametric branch helps to control the output norm growth. Specifically, we bound the mean and the variance about the largest possible change after deep residual mappings when \(\tau =O(1/\sqrt{L})\).

We also argue the advantage of adding \(\tau\) over other stabilization methods, such as batch normalization (BN) (Ioffe & Szegedy, 2015) and Fixup (Zhang et al., 2018a). First, it has advantage over BN to guarantee stability. BN is architecture-agnostic and the output norm of ResNet with BN still grows unbounded as the depth increases. In practice, it has to to employ a learning rate warm-up stage to train very deep ResNet even with BN (He et al., 2016). In comparison, we prove that ResNet with \(\tau\) is stable over all depths and hence does not require any learning rate warm-up stage. Second, it is also more stable than the approach of scaling down initialization that is adopted in Fixup. Scaling down initial residual weight does not scale down the gradient properly and Fixup could explode after gradient descent updates for deep ResNet.

At last, we corroborate the theoretical findings with extensive experiments. First, we demonstrate that with \(\tau =1/\sqrt{L}\), ResNet can be effectively trained without the normalization layers. It is more stable and achieves better performance than Fixup. Second, we demonstrate that adding \(\tau =1/\sqrt{L}\) on top of the normalization layer can obtain even better performance.

1.1 Related works

There is a large volume of literature studying ResNet. We can only give a partial list.

To argue the benefit of skip connection, (Veit et al., 2016) interpret ResNet as an ensemble of shallower networks, (Zhang et al., 2018) study the local Hessian of residual blocks, (Hardt & Ma, 2016) show that deep linear residual networks have no spurious local optima, (Orhan & Pitkow, 2018) observe that skip connection eliminates the singularity, and (Balduzzi et al., 2017) find that ResNet is more resistant to the gradient shattering problem than the feedforward network. However, these results mainly rely on empirical observation or strong model assumption.

There are also several papers studying ResNet from the stability perspective (Arpit et al., 2019; Zhang et al., 2018a, b; Yang & Schoenholz, 2017; Haber & Ruthotto, 2017). In comparison, we study the model closest to the original ResNet and provide a rigorous non-asymptotic analysis for the stability when \(\tau =O(1/\sqrt{L})\) and a converse result showing the sharpness of \(\tau\). We also demonstrate the empirical advantage of learning ResNet with \(\tau\).

Our work is also related to recent literature on the theory of learning deep neural network with gradient descent in the over-parameterized regime. Many works (Jacot et al., 2018; Allen-Zhu et al., 2018; Du et al., 2019a; Chizat & Bach, 2018a; Zou et al., 2018; Zou & Gu, 2019; Arora et al., 2019a; Oymak & Soltanolkotabi, 2019; Chen et al., 2019; Ji & Telgarsky, 2019) use Neural Tangent Kernel (NTK) or similar technique to argue the global convergence of gradient descent for training over-parameterized deep neural network. Some (Brutzkus et al., 2017; Li & Liang, 2018; Allen-Zhu et al., 2019a; Arora et al., 2019b; Cao & Gu, 2019; Neyshabur et al., 2019) study the generalization properties of over-parameterized neural network. On the other side, there are papers (Ghorbani et al., 2019; Chizat et al., 2019; Yehudai & Shamir, 2019; Allen-Zhu & Li, 2019) discussing the limitation of the NTK approach in characterizing the behavior of neural network. Additionally, several papers (Chizat & Bach, 2018b; Mei et al., 2018, 2019; Nguyen, 2019; Fang et al., 2019a, a) study the convergence of the weight distribution in the probabilistic space via gradient flow for two or multiple layers network. To the best of our knowledge, we are the first to provide the global convergence of learning ResNet in the regime of \(\tau \le O(1/\sqrt{L})\)

2 Preliminaries

There are many residual network models since the seminal paper (He et al., 2016). Here we study a simplified ResNet that shares the same structure as He et al. (2016)Footnote 2, which is described as follows,

  • Input layer: \(h_{0}=\phi (\varvec{A}x)\), where \(x\in \mathbb {R}^{p}\) and \(\varvec{A}\in \mathbb {R}^{m\times p}\);

  • \(L-1\) residual blocks: \(h_{l}=\phi (h_{l-1}+\tau {\varvec{W}}_{l}h_{l-1})\), where \({\varvec{W}}_{l}\in \mathbb {R}^{m\times m}\);

  • A fully-connected layer: \(h_{L}=\phi ({\varvec{W}}_{L}h_{L-1})\), where \({\varvec{W}}_{L}\in \mathbb {R}^{m\times m}\);

  • Output layer: \(y={\varvec{B}}h_{L}\), where \({\varvec{B}}\in \mathbb {R}^{d\times m}\);

  • Initialization: The entries of \(\varvec{A}, {\varvec{W}}_l\) for \(l\in [L-1]\), \({\varvec{W}}_L\) and \({\varvec{B}}\) are independently sampled from \(\mathcal {N}(0,\frac{2}{m})\), \(\mathcal {N}(0,\frac{1}{m})\), \(\mathcal {N}(0,\frac{2}{m})\) and \(\mathcal {N}(0,\frac{1}{d})\), respectively;

where \(\phi (\cdot ):=\max \{0, \cdot \}\) is the ReLU activation function. We assume the input dimension is p, the intermediate layers have the same width m and the output has dimension d. For a positive integer L, we use [L] to denote the set \(\{1, 2, ..., L\}\). We denote the values before activation by \(g_{0}=\varvec{A}x,g_{l}=h_{l-1}+\tau {\varvec{W}}_{l}h_{l-1}\) for \(l=[L-1]\) and \(g_{L}={\varvec{W}}_{L}h_{L-1}\). We use \(h_{i,l}\) and \(g_{i,l}\) to denote the value of \(h_{l}\) and \(g_{l}\), respectively, when the input vector is \(x_{i}\), and \({\varvec{D}}_{i,l}\) the diagonal activation matrix where \([{\varvec{D}}_{i,l}]_{k,k}=\varvec{1}_{\{(g_{i,l})_{k}\ge 0\}}\). We use superscript \(^{(0)}\) to denote the value at initialization, e.g., \({\varvec{W}}_l^{(0)}\), \(h_{i,l}^{(0)}\), \(g_{i,l}^{(0)}\) and \({\varvec{D}}_{i,l}^{(0)}\). We may omit the subscript \(_i\) and the superscript \(^{(0)}\) when they are clear from the context for simplifying the notations.

We introduce a notation \(\overrightarrow{{\varvec{W}}}:=({\varvec{W}}_1, {\varvec{W}}_2, \dots , {\varvec{W}}_L)\) to represent all the trainable parameters. We note that \(\varvec{A}\) and \({\varvec{B}}\) are fixed after initialization. Throughout the paper, we use \(\Vert \cdot \Vert\) to denote the \(l_2\) norm of a vector. We further use \(\Vert \cdot \Vert\) and \(\Vert \cdot \Vert _F\) to denote the spectral norm and the Frobenius norm of a matrix, respectively. Denote \(\Vert \overrightarrow{{\varvec{W}}}\Vert := \max _{l\in [L]}\Vert {\varvec{W}}_l\Vert\) and \(\Vert {\varvec{W}}_{[L-1]}\Vert := \max _{l\in [L-1]}\Vert {\varvec{W}}_l\Vert\).

The training data set is \(\{(x_i, y_i^*)\}_{i=1}^n\), where \(x_i\) is the feature vector and \(y_i^*\) is the target signal for \(i=1, ..., n\). We consider the objective function is \(F(\overrightarrow{{\varvec{W}}}):=\sum _{i=1}^{n}F_{i}(\overrightarrow{{\varvec{W}}})\) where \(F_{i}(\overrightarrow{{\varvec{W}}}):=\ell ({\varvec{B}}h_{i,L}, y_{i}^{*})\) and \(\ell (\cdot )\) is the loss function. The model is trained by running the gradient descent algorithm. Though ReLU is nonsmooth, we abuse the word “gradient" to represent the value computed through back-propagation.

3 Forward and backward stability of ResNet

In this section, we establish the stability of training ResNet. We show that when \(\tau = O(1/\sqrt{L})\) the forward and backward pass is bounded at the initialization and after small perturbation. On the converse side, for an arbitrary positive constant c, if \(\tau >L^{-0.5+c}\), the output magnitude grows at least polynomial with depth at the initialization. We also argue the advantage of using a factor \(\tau\) over other stabilization methods, such as BN and Fixup. The stability result forms the basis to establish the global convergence in Sect. 4.

3.1 Forward process is bounded if \(\tau = O(1/\sqrt{L})\)

We first give a non-asymptotic bound on the forward process at initialization.

Theorem 1

Suppose that \(\overrightarrow{{\varvec{W}}}^{(0)}\), \(\varvec{A}\) are randomly generated as in the initialization step, and \({\varvec{D}}^{(0)}_{i, 0},\dots ,{\varvec{D}}^{(0)}_{i, L}\) are diagonal activation matrices for \(i\in [n]\). Suppose that c and \(\epsilon\) are arbitrary positive constants with \(0<\epsilon <1\). If \(\tau\) satisfies \(\tau ^2L \le \min \{ \frac{1}{2}\log (1+c), \frac{\log ^2(1+c)}{16(1+\log (1+2/\epsilon ))}\}\), then with probability at least \(1-3nL^2\cdot \exp \left( -m\right)\) over the initialization randomness, we have for any two integers \(a,b\in [L-1]\) with \(b>a\) and for all \(i\in [n]\),

$$\begin{aligned} \left\| {\varvec{D}}^{(0)}_{i,b}\left( \varvec{I}+\tau {\varvec{W}}_{b}^{(0)}\right) \cdots {\varvec{D}}^{(0)}_{i,a}\left( \varvec{I}+\tau {\varvec{W}}_{a}^{(0)}\right) \right\| \le \frac{1+c}{1-\epsilon }. \end{aligned}$$
(2)

The proof is based on Markov’s inequality with recursively conditioning. The full proof is deferred to Appendix B. Here we give an outline.

Proof Outline

We omit the subscript i and the superscript (0) for simplicity. Suppose that \(\Vert h_{a-1}\Vert =1\). Let \(g_l = h_{l-1}+\tau {\varvec{W}}_lh_{l-1}\) and \(h_l = {\varvec{D}}_l g_{l}\) for \(l=\{a,..., b\}\). We have

$$\begin{aligned} \Vert h_{b}\Vert ^2&= \frac{\Vert h_{b}\Vert ^2}{\Vert h_{b-1}\Vert ^2} \cdots \frac{\Vert h_{a}\Vert ^2}{\Vert h_{a-1}\Vert ^2}\Vert h_{a-1}\Vert ^2 \le \frac{\Vert g_{b}\Vert ^2}{\Vert h_{b-1}\Vert ^2} \cdots \frac{\Vert g_{a}\Vert ^2}{\Vert h_{a-1}\Vert ^2}\Vert h_{a-1}\Vert ^2, \end{aligned}$$

where the inequality is due to that \(\Vert {\varvec{D}}_l\Vert \le 1\). Taking logarithm at both side, we have

$$\begin{aligned} \log {\Vert h_{b}\Vert ^2}\le \sum _{l=a}^{b}\log \Delta _{l},\quad \text {where } \Delta _{l} := \frac{\Vert g_{l}\Vert ^2}{\Vert h_{l-1}\Vert ^2}. \end{aligned}$$

If let \(\tilde{h}_{l-1} := \frac{h_{l-1}}{\Vert h_{l-1}\Vert }\), then we obtain that

$$\begin{aligned} \begin{aligned} \log {\Delta _{l}}&= \log \left( 1 + 2\tau \left\langle \tilde{h}_{l-1},{\varvec{W}}_{l}\tilde{h}_{l-1} \right\rangle + \tau ^{2}\Vert {\varvec{W}}_{l}\tilde{h}_{l-1}\Vert ^{2}\right) \\&\le 2\tau \left\langle \tilde{h}_{l-1},{\varvec{W}}_{l}\tilde{h}_{l-1} \right\rangle + \tau ^{2}\Vert {\varvec{W}}_{l}\tilde{h}_{l-1}\Vert ^{2}, \end{aligned} \end{aligned}$$

where the inequality is because \(\log (1+x) < x\) for \(x>-1\). Let \(\xi _{l} := 2\tau \left\langle \tilde{h}_{l-1},{\varvec{W}}_{l}\tilde{h}_{l-1} \right\rangle\) and \(\zeta _{l}:= \tau ^{2}\Vert {\varvec{W}}_{l}\tilde{h}_{l-1}\Vert ^{2}\). Then given \(\tilde{h}_{l-1}\), we have \(\xi _{l}\sim \mathcal {N}\left( 0, \frac{4\tau ^2}{m}\right)\), \(\zeta _{l}\sim \frac{\tau ^{2}}{m}\chi _{m}^2\).

We can argue that \(\sum _{l=a}^{b} \xi _l \sim \mathcal {N}\left( 0, \frac{4(b-a)\tau ^2}{m}\right)\) and \(\sum _{l=a}^{b} \zeta _l \sim \frac{(b-a)\tau ^{2}}{m}\chi _{m}^2\). Hence for arbitrary positive constant \(c_1\), if \(\tau ^2L \le c_1/4\) then \(\sum _{l=a}^{b}\log \Delta _{l}\le c_1\) with probability at least \(1- 3\exp (-\frac{mc_1^2}{64\tau ^2L})\). We then convert the condition on \(c_1\) to that on c in the theorem. Taking \(\varepsilon\)-net argument, we can establish the spectral norm bound for all vector \(h_{a-1}\). Let a and b vary from 1 to \(L-1\) and taking the union bound gives the claim. The full proof is presented in Appendix B. \(\square\)

We note that the constant c and \(\epsilon\) can be chosen arbitrarily small such that \((1+c)/(1-\epsilon )\) is arbitrarily close to 1 given stronger assumption on \(\tau ^2L\). Theorem 1 indicates that the norm of every residual block output is upper bounded by \((1+c)/(1-\epsilon )\) if the input vector has norm 1, which demonstrates that the the forward process is stable. This result is a bit surprising since for \(\tau = O(1/\sqrt{L})\) a natural bound on the spectral norm \(\Vert (\varvec{I}+\tau {\varvec{W}}^{(0)}_L)\cdots (\varvec{I}+\tau {\varvec{W}}^{(0)}_1)\Vert \le (1+\frac{1}{\sqrt{L}})^L\) explodes. Here the intuition is that the cross-product term concentrates on the mean 0 because of the independent randomness of matrices \({\varvec{W}}_l^{(0)}\) and the variance can be bounded at the same time.

Moreover, we can also establish a lower bound on the output norm of each residual block as follows.

Theorem 2

Suppose that c is an arbitrary constant with \(0<c<1\). If \(\tau ^2L \le \frac{1}{4}\log (1-c)^{-1}\), then with probability at least \(1-2nL^2\cdot \exp \left( -\frac{1}{32}m\log (1-c)^{-1}\right)\) over the randomness of \(\varvec{A}\in \mathbb {R}^{m\times p}\) and \(\overrightarrow{{\varvec{W}}}^{(0)}\in (\mathbb {R}^{m\times m})^{L}\) the following holds

$$\begin{aligned} \forall i\in [n],l\in [L-1]:\;\left\| h_{i,l}^{(0)}\right\| \ge 1-c. \end{aligned}$$
(3)

Proof

The proof is similar to that of Theorem 1 but harder. The high level idea is to control the mean and the variance of the mapping of the intermediate residual blocks simultaneously by utilizing the Markov’s inequality with the recursive conditioning. The full proof is deferred to Appendix C.1. \(\square\)

Combining these two theorems, we can conclude that the norm of each residual block output concentrates around 1 with high probability \(1-O(nL^2) \exp (-\Omega (m))\). Moreover these two theorems also holds for \(\overrightarrow{{\varvec{W}}}\) that is within the neighborhood of \(\overrightarrow{{\varvec{W}}}^{(0)}\), which is presented in Appendix C.2.

3.2 Backward process is bounded for \(\tau \le O(1/\sqrt{L})\)

For ResNet, the gradient with respect to the parameter is computed through back-propagation. For any input sample i, we denote \(\partial h_{i,l} := \frac{\partial F_i(\overrightarrow{{\varvec{W}}})}{\partial h_{i,l}}\) and \(\nabla _{{\varvec{W}}_{l}}F_i(\overrightarrow{{\varvec{W}}}) := \frac{\partial F_i(\overrightarrow{{\varvec{W}}})}{\partial {\varvec{W}}_{l}} =\left( \tau {\varvec{D}}_{i,l} \partial h_{i,l}\right) \cdot h_{i,l-1}^T\). Therefore, the gradient upper bound is guaranteed if \(h_{i,l}\) and \(\partial h_{i,l}\) are bounded for all blocks. We next show that the backward process is bounded for each individual sample at the initialization stage.

Theorem 3

For every input sample \(i\in [n]\) and for any positive constants c and \(\epsilon\) with \(0<\epsilon <1\), if \(\tau\) satisfies \(\tau ^2L \le \min \{ \frac{1}{2}\log (1+c), \frac{\log ^2(1+c)}{16(1+\log (1+2/\epsilon ))}\}\), then with probability at least \(1- 3nL^2\cdot \exp \left( -\frac{1}{4}mc^2\right)\) over the randomness of \(\varvec{A},{\varvec{B}}\) and \(\overrightarrow{{\varvec{W}}}^{(0)}\), the following holds \(\forall l\in [L-1]\)

$$\begin{aligned}&\Vert \nabla _{{\varvec{W}}_{l}}F_i(\overrightarrow{{\varvec{W}}}^{(0)})\Vert _{F}\le \frac{(1+c)^2}{(1-\epsilon )^2}(2\sqrt{2}+c)\tau \Vert \partial h_{i,L}\Vert , \;\; \Vert \nabla _{{\varvec{W}}_{L}}F_i(\overrightarrow{{\varvec{W}}}^{(0)})\Vert _{F}\le \frac{(1+c)}{(1-\epsilon )}\Vert \partial h_{i,L}\Vert . \end{aligned}$$
(4)

The full proof is is deferred to Appendix 6. Here we give an outline.

Proof Outline

The argument is based on the back-propagation formula and Theorem 1. We omit the superscript \(^{(0)}\) for notation simplicity. For each \(i\in [n]\) and \(l\in [L-1]\), i.e., the residual layers, we have

$$\begin{aligned} \Vert \nabla _{{\varvec{W}}_{l}}F_i(\overrightarrow{{\varvec{W}}})\Vert _{F}&=\left\| \tau \left( {\varvec{D}}_{i,l} (\varvec{I}+\tau {\varvec{W}}_{l+1})^T\cdots {\varvec{D}}_{i,L-1}{\varvec{W}}_L^T{\varvec{D}}_{i,L} \partial h_{i,L}\right) h_{i, l-1}^T\right\| _F \\&\le \tau \Vert {\varvec{D}}_{i,l} (\varvec{I}+\tau {\varvec{W}}_{l+1})^T\cdots {\varvec{D}}_{i,L-1}\Vert \cdot \Vert {\varvec{W}}_L^T{\varvec{D}}_{i,L}\Vert \cdot \Vert \partial h_{i,L}\Vert \cdot \Vert h_{i,l-1}\Vert ,\\&\le \frac{(1+c)^2}{(1-\epsilon )^2}(2\sqrt{2}+c)\tau \Vert \partial h_{i,L}\Vert , \end{aligned}$$

where the last inequality is due to Theorem 1 and the spectral norm bound of \({\varvec{W}}_L\) given in Appendix A. The full proof is deferred to Appendix 6. \(\square\)

This theorem indicates that the gradient of residual layers could be \(\tau\) times smaller than the usual feedforward layer. Moreover, for ResNet with \(\tau =1/\sqrt{L}\), the norm of all layer gradient is independent of the depth, which allows us to use a depth independent learning rate to train ResNets of all depths. This is essentially different from the feedforward case (Allen-Zhu et al., 2018; Zou & Gu, 2019). We note that the gradient upper bound also holds for \(\overrightarrow{{\varvec{W}}}\) within the neighborhood of \(\overrightarrow{{\varvec{W}}}^{(0)}\) (see details in Appendix C.2 and 6).

3.3 A converse result for \(\tau > L^{-\frac{1}{2}+c}\)

We have built the stability of the forward/backward process for \(\tau = O(1/\sqrt{L})\). We next establish a converse result showing that if \(\tau\) is slightly larger than \(L^{-\frac{1}{2}}\) in the order sense, the network output norm grows uncontrollably as the depth L increases. This justifies the sharpness of the value \(\tau = 1/\sqrt{L}\). Without loss of generality, we assume \(\Vert h_0\Vert =1\).

Theorem 4

Suppose that c is an arbitrary positive constant and the ResNet is defined and initialized as in Sect. 2. If \(\tau \ge L^{-\frac{1}{2}+c}\), then we have

$$\begin{aligned} \mathbb {E}\Vert h_{L}\Vert ^2 \ge \frac{1}{2}L^{2c}. \end{aligned}$$
(5)

Proof

The proof is based on a new inequality \((h_{l})_{k}\ge \phi \left( \sum _{a=1}^{l}\left( \tau {\varvec{W}}_{a}h_{a-1}\right) _{k}\right)\) for \(l\in [L-1]\) and for \(k\in [m]\). By the symmetry of Gaussian variables and the recursive conditioning, we can compute the expectation of \(\Vert h_L\Vert ^2\) exactly. The whole proof is relegated to Appendix G. \(\square\)

This indicates that \(\tau = O(1/\sqrt{L})\) is sharp to guarantee the forward stability of deep ResNet. We note that Theorems 1 and 3 hold with high probability when \(m> \Omega (\log L)\) and Theorem 2 holds with high probability when \(m>\Omega (\log (nL))\). These are very mild conditions on the width m, which are satisfied by practical networks.

3.4 Comparison with other approaches for stability

Up to now, we have provided a sharp value of \(\tau\) in terms of determining the stability of deep ResNet. In practice, two other approaches are used in residual networks to provide the stability: adding normalization layers, e.g., batch normalization (BN) (Ioffe & Szegedy, 2015), and scaling down the initial residual weights, e.g., Fixup (Zhang et al., 2018a). Next, we discuss BN and Fixup from the stability perspective, respectively, and make comparison with adding \(\tau =1/\sqrt{L}\).

Batch normalization is placed right after each convolutional layer in (He et al., 2016). Here for the ResNet model defined in Sect. 2, we put BN after each parametric branch and the residual block becomes \(h_l = \phi (h_{l-1} + \tilde{z}_l)\), where \((\tilde{z}_{i,l})_k := \text {BN}\left( (z_{i,l})_k\right) = \frac{(z_{i,l})_k- \mathbb {E}[(z_{\cdot , l})_k]}{\sqrt{\mathrm {Var}[(z_{\cdot , l})_k]}}\) and \((z_{i,l})_k := \left( {\varvec{W}}_lh_{i,l-1}\right) _k\) for \(k=[m]\) and \(l=[L-1]\), and the expectation and the variance are taken over samples in a mini-batch. Then we have \(\mathbb {E} (\tilde{z}_{\cdot , l})_k =0\) and \(\mathrm {Var}[(\tilde{z}_{\cdot ,l})_k] = 1\). We use the following proposition to estimate the norm of each residual block output for the ResNet with BN.

Proposition 1

Assume that \((\tilde{z}_{l})_k\) are independent random variable over lk with \(\mathbb {E} (\tilde{z}_{l})_k =0\) and \(\mathrm {Var}[(\tilde{z}_{l})_k] = 1\). The output norm of the residual block l satisfies \(\mathbb {E}\Vert h_{l}\Vert ^2 \ge \frac{1}{2}ml\), for \(l\in [L-1]\).

Proof

The proof is adapted from the proof of Theorem 4, and is presented in Appendix G. \(\square\)

This indicates that the block output norm of ResNet with BN grows roughly at the rate \(\sqrt{l}\) at the initialization stage, where l is the block index and the larger l the closer to the output. To verify this, we plot how the output norm of each residual block grows for ResNet1202 (with/without BN)Footnote 3 in Fig. 1. We see that at epoch 0 (initialization stage), the output norm grows almost with the rate \(\sqrt{l}\) as predicted in Proposition 1. After training, the estimation in Proposition 1 is not as accurate as the initialization because the independence assumption does not hold after training. Besides the output norm growth, in practice, (He et al., 2016) have to use warm-up learning rates to train very deep ResNets, e.g., ResNet1202\(+\)BN. In contrast, it is proved that the approach of adding \(\tau =1/\sqrt{L}\) is stable over all depths and hence does not require any learning rate warm-up stage.

Fig. 1
figure 1

The \(l_2\) norm of residual block output of the first stage of ResNet1202 at epoch 0 and epoch 50. The X axis is the block index and the Y axis is the output norm ratio compared to the first block

Fig. 2
figure 2

Training curves of ResNets with Fixup: MNIST classification, width \(m=128\) and learning rate \(\eta =0.01\)

Recently, Zhang et al. (2018a) propose Fixup to train residual networks without the normalization layer. Essentially for each residual block, Fixup sets the weight matrix near the output to be 0 at the initialization stage, and then scales down the all other weight matrices by a factor that is determined by the network structure. However, in practice Fixup does not always converge for training very deep residual networks as shown in Sect. 5.2. Moreover, for the ResNet model defined in Sect. 2, Fixup could be unstable after gradient updates. The residual block is given by \(h_l = \phi (h_{l-1} + {\varvec{W}}_l h_{l-1})\), and following Fixup, \({\varvec{W}}_l^{(0)}\) is initialized to be 0 for \(l \in [L-1]\). At the initial stage for input sample i, \(h_{i,l} = h_{i,0}\) and hence \(\nabla _{{\varvec{W}}_{l}}F_i = \partial h_{i,L-1}\cdot h_{i,0}^T\), the same for all \(l \in [L-1]\). Then after one gradient update the residual blocks mapping \(\prod _{l=1}^{L-1}{\varvec{D}}_{i,l}(\varvec{I}+\eta \cdot \nabla _{{\varvec{W}}_{l}}F_i )\) could behave like \(({\varvec{D}}(\varvec{I}+\eta \cdot \partial h_{i,L-1}\cdot h_{i,0}^T))^{L-1}\) when \({\varvec{D}}_{i,l}={\varvec{D}}\) for all l, which grows exponentially. Empirically, such explosion is observed for deep ResNet with Fixup (see Fig. 2). In contrast, the ResNet with \(\tau\) is stable for varying depths (see Fig. 3), as guaranteed by our theory.

4 Global convergence for over-parameterized ResNet

In this section, we establish that gradient descent converges to global minima for learning an over-parameterized ResNet with \(\tau \le \tilde{O}(1/\sqrt{L})\). Compared to the recent work (Allen-Zhu et al., 2018), our result significantly enlarges the region of \(\tau\) that admits the global convergence of gradient descent. Moreover, our result also theoretically justifies the advantage of ResNet over vanilla feedforward network in terms of facilitating the convergence of gradient descent. Before stating the theorem, we introduce common assumptions on the training data and the loss function (Allen-Zhu et al., 2018; Zou & Gu, 2019; Oymak & Soltanolkotabi, 2019).

Assumption 1

(training data) For any \(x_i\), it holds that \(\Vert x_i\Vert =1\) and \((x_i)_p = 1/\sqrt{2}\). There exists \(\delta >0\), such that \(\forall i,j \in [n], i\ne j, \Vert x_{i}-x_{j}\Vert \ge \delta\).

The loss function \(\ell (\cdot , \cdot )\) is quadratic and the individual objective is \(F_{i}(\overrightarrow{{\varvec{W}}}):=\frac{1}{2}\Vert {\varvec{B}}h_{i,L}-y_{i}^{*}\Vert ^{2}\). We note that the assumption \((x_i)_p=1/\sqrt{2}\) means that the last coordinate of every \(x_i\) is \(1/\sqrt{2}\). This gives a random bias term after the first layer \({\varvec{A}}(\cdot )\), which makes the proof of Lemma 6 for the gradient lower bound easier. This assumption is because of the proof convenience rather than something that should be satisfied in practice.

Theorem 5

Suppose that the ResNet is defined and initialized as in Sect. 2 with \(\tau \le O(1/(\sqrt{L}\log m))\) and the training data satisfy Assumption 1. If the network width \(m\ge \Omega (n^{8} L^7\delta ^{-4}d\log ^2 m)\), then with probability at least \(1-\exp (-\Omega (\log ^{2}m))\), gradient descent with learning rate \(\eta =\Theta (\frac{d}{nm})\) finds a point \(F(\overrightarrow{{\varvec{W}}})\le \varepsilon\) in \(T=\Omega (n^2\delta ^{-1}\log \frac{n\log ^2 m}{\varepsilon })\) iterations.

Proof

The full proof is deferred to Appendix F. \(\square\)

This theorem establishes the linear convergence of gradient descent for learning ResNet for the range \(\tau \le O(1/(\sqrt{L}\log m))\). Combined with the unstable case of \(\tau > 1/\sqrt{L}\) in Sect. 3.3, we give a nearly full characterization of the convergence in terms of the range of \(\tau\). Moreover, our result indicates that the learning rate and the total number of iterations are depth-independent. We note that a recent paper Frei et al. (2019) also achieves a depth-independent rate but only for the case \(\tau \le O(1/(L\log m))\), whose proof critically relies on the choice of \(\tau =1/L\). The overparameterization dependence and the number of iterations are not directly comparable as we are studying the regression problem while Frei et al. (2019) is for the classification problem with different data assumption. Other previous results (Allen-Zhu et al., 2018; Du et al., 2019a) characterize the convergence guarantee only for the case \(\tau \le O(1/(L\log m))\), and their total number of iterations scales with the order \(L^2\). Our depth-independent results are achieved by a tighter smoothness and gradient upper bound.

In the analysis with the feedforward case (Allen-Zhu et al., 2018; Zou & Gu, 2019, the learning rate has to scale with \(1/L^2\) and the total number of iterations scales with \(L^2\) for the convergence of learning feedforward network. Therefore, our result theoretically justifies the advantage of ResNet over vanilla feedforward network in terms of facilitating the convergence of gradient descent.

Finally, we add a remark on the width requirement in Theorem 5. The width grows polynomially with the number of training examples. Such dependence is because we need to more neurons to distinguish each data point sufficiently with more examples, which is common for the regression task (Allen-Zhu et al., 2019b; Zou & Gu, 2019). This dependence could be avoid by assuming the training data follows specific distributions for the classification task (Cao & Gu, 2020). However this is orthogonal to our main claim that ResNet converges with a depth-dependent rate.

Fig. 3
figure 3

Training curves for PlainNet, ResNet with \(\tau =\frac{1}{L}\), \(\tau =\frac{1}{\sqrt{L}}\) and \(\tau =\frac{1}{L^{1/4}}\) (from left to right). We use markers to denote the training encounters numerical overflow

5 Empirical study

In this section, we present experiments to verify our theory and show the practical value of ResNet with \(\tau\). We first compare the performance of ResNet with different \(\tau\)’s and demonstrate that \(\tau =\frac{1}{\sqrt{L}}\) is a sharp value in determining the trainability of deep ResNet. We then compare the performance of adding the factor \(\tau\) and using Fixup initialization when training the popular residual networks without normalization layers. We finally show that with normalization layer, adding \(\tau\) also significantly improve the performance for both CIFAR and ImageNet tasks. Source code available online https://github.com/dayu11/tau-ResNet.

5.1 Theoretical verification

We train feedforward fully-connected neural networks (PlainNet), ResNets with different values of \(\tau\), and compare their convergence behaviors. Specifically, for ResNets, we adopts the exactly the same residual architecture as described in Eq. (1) and Sect. 2. The PlainNet adopts the same architecture as the ResNets without the skip connection. The models are generated with width \(m=128\) and depth \(L\in \{10, 100, 1000\}\). For ResNets with \(\tau\), we choose \(\tau =\frac{1}{L}, \frac{1}{\sqrt{L}},\frac{1}{L^{1/4}}\) to show the sharpness of the value \(\frac{1}{\sqrt{L}}\). We conduct classification on the MNIST dataset (LeCun et al., 1998). We train the model with SGDFootnote 4 and the size of minibatch is 256. The learning rate is set to \(\eta =0.01\) for all networks without tuning.

We plot the training curves in Fig. 3. For ResNets with \(\tau\), we see that both \(\tau =\frac{1}{L}\) and \(\tau =\frac{1}{\sqrt{L}}\) are able to train very deep ResNets successfully and \(\tau =\frac{1}{\sqrt{L}}\) achieves lower training loss than \(\tau =\frac{1}{L}\). For \(\tau =\frac{1}{L^{1/4}}\), the training loss explodes for models with depth 100 and 1000. This indicates that the bound \(\tau =\frac{1}{\sqrt{L}}\) is sharp for learning deep ResNets. Moreover, the convergence of ResNets with \(\tau =\frac{1}{\sqrt{L}}\) does not depend on the depth while training feedforward network becomes harder as the depth increases, corroborating our theory nicely.

To clearly see the benefit of \(\tau =\frac{1}{\sqrt{L}}\) over \(\tau =\frac{1}{L}\), we conduct the classification task on the CIFAR10 dataset (Krizhevsky & Hinton, 2009) with the residual networks from He et al. (2016). A bit different from the model described in Sect. 2, here one residual block is composed of two stacked convolution layers. We argue that our theoretical analysis still applies if treating the number of channels in convolution layer as width in Sect. 2. We plot the training/validation curves in Fig. 4. We can see that with \(\tau =\frac{1}{\sqrt{L}}\), both ResNet110 and ResNet1202 can be trained to good accuracy without BN. In contrast, with \(\tau =\frac{1}{L}\), the performance of ResNet110 and ResNet1202 drops a lot.

In the sequel, we use “adding \(\tau ^*\)" or “\(+\tau ^*\)" to denote residual network with \(\tau =\frac{1}{\sqrt{L}}\).

Fig. 4
figure 4

Training/validation curves of ResNet110/1202 with \(\tau =1/\sqrt{L}\) and \(\tau =1/L\) for CIFAR10 classification task. We use the models in He et al. (2016) and remove all BN layers

Fig. 5
figure 5

Validation error bar charts for CIFAR classification tasks. Numbers are average of 5 runs with standard deviations. The deeper network, the larger benefit of \(\tau ^*\)

5.2 Comparison of adding \(\tau ^*\) and using Fixup

In this section we compare our approach of adding \(\tau ^*\) and the approach of using Fixup for training residual networks without BN. We conduct the classification task on the CIFAR10 dataset. We use the residual models in (He et al., 2016) with removing all the normalization layers. For the approach of Fixup, we use the code from their github website with the same hyperparameter setting. We note that Fixup has a learnable scalar with initial value 1 on the output of the parametric branch in each residual block, which is equivalent to set \(\tau =1\). For our approach, we use the same model as Fixup with setting \(\tau =\frac{1}{\sqrt{L}}\) and using the Kaiming initialization instead of Fixup initialization.

The results are presented in Table 1. We can see that our approach achieves much better performance than the Fixup approach over all depths. Moreover, the Fixup approach fails to converge 2 out of 5 runs for training ResNet1202 and hence the standard deviation is not presented in Table 1.

Table 1 Validation errors of ResNets+Fixup and ResNets+\(\tau ^*\) on CIFAR10. Numbers are average of 5 runs with standard deviations
Table 2 Top1 validation error on ImageNet. The models are adapted from He et al. (2016)

5.3 Add \(\tau ^*\) on top of normalization

In this section, we empirically show that adding \(\tau ^*\) in the residual block with batch normalization can also help to achieve better performance. We conduct experiments on standard classification datasets: CIFAR10/100 and ImageNet. The baseline models are the residual networks in He et al. (2016). We note that the residual block here is with batch normalization, which is discussed in Sect. 3.4 but not precisely covered by the theoretical model (Sect. 2). For our approach, the only modification is adding a fixed \(\tau =\frac{1}{\sqrt{L}}\) at the output of each residual block (right before the residual addition). We also tried to use learnable \(\tau\) but did not observe gain, which may be due to that the BN layers have learnable scaling factors. The validation errors on CIFAR10/100 are illustrated in Fig. 5, where all numbers are averaged over five runs. The performance of adding \(\tau ^*\) is much better than the baseline models and especially the benefit of adding \(\tau ^*\) becomes larger when the network is deeper. We note that one needs warm-up learning rate to successfully train ResNet1202+BN, while with \(\tau ^*\) we use the same learning rate schedule for all depths.

As the models for ImageNet classification has different numbers of residual blocks in each stage, we choose \(\tau ^*=\frac{1}{\sqrt{L}}\) where L is the average number of blocks over all stages. We take average instead of sum because there exists a BN layer on the output of each stage. All models are trained for 200 epochs with learning rate divided by 10 every 60 epochs. The other hyperparameters are the same as in He et al. (2016). Table 2 shows the top 1 validation error results on ImageNet. We can see that just by adding \(\tau ^*\) on top of BN we can achieve significant performance gain.

6 Conclusion

In this paper, we provide a non-asymptotic analysis on the forward/backward stability for ResNet, which unveils that \(\tau =1/\sqrt{L}\) is a sharp value in terms of characterizing the stability. We also bridge theoretical understanding and practical guide of ResNet structure. We empirically verify the efficacy of adding \(\tau\) for ResNet with/without batch normalization. As the residual block is also widely used in the Transformer model (Vaswani et al., 2017), it is interesting to study the effect of \(\tau\) and layer normalization there.