1 Introduction

The design of neural networks is often considered a black-art, driven by trial and error rather than foundational principles. This is exemplified by the success of recent architecture random-search techniques [22, 37], which take the extreme of applying no human guidance at all. Although as a field we are far from fully understanding the nature of learning and generalization in neural networks, this does not mean that we should proceed blindly.

In this work, we define a scaling quantity \(\gamma _{l}\) for each layer l that approximates two quantities of interest when considering the optimization of a neural network: The ratio of the gradient to the weights, and the average squared singular value of the corresponding diagonal block of the Hessian for layer l. This quantity is easy to compute from the (non-central) second moments of the forward-propagated values and the (non-central) second moments of the backward-propagated gradients. We argue that networks that have constant \(\gamma _{l}\) are better conditioned than those that do not, and we analyze how common layer types affect this quantity. We call networks that obey this rule preconditioned neural networks.

As an example of some of the possible applications of our theory, we:

  • Propose a principled weight initialization scheme that can often provide an improvement over existing schemes;

  • Show which common layer types automatically result in well-conditioned networks;

  • Show how to improve the conditioning of common structures such as bottlenecked residual blocks by the addition of fixed scaling constants to the network.

2 Notation

Consider a neural network mapping \(x_{0}\) to \(x_{L}\) made up of L layers. These layers may be individual operations or blocks of operations. During training, a loss function is computed for each minibatch of data, and the gradient of the loss is back-propagated to each layer l and weight of the network. We prefix each quantity with \(\varDelta \) to represent the back-propagated gradient of that quantity. We assume a batch-size of 1 in our calculations, although all conclusions hold using mini-batches as well.

Each layer’s input activations are represented by a tensor \(x_{l}:n_{l}\times \rho _{l}\times \rho _{l}\) made up of \(n_{l}\) channels, and spatial dimensions \(\rho _{l}\times \rho _{l}\), assumed to be square for simplicity (results can be adapted to the rectangular case by using \(h_{l}w_{l}\) in place of \(\rho _{l}\) everywhere).

3 A model of ReLU network dynamics

Our scaling calculus requires the use of simple approximations of the dynamics of neural networks, in the same way that simplifications are used in physics to make approximate calculations, such as the assumption of zero-friction or ideal gasses. These assumptions constitute a model of the behavior of neural networks that allows for easy calculation of quantities of interest, while still being representative enough of the real dynamics.

To this end, we will focus in this work on the behavior of networks at initialization. Furthermore, we will make strong assumptions on the statistics of forward and backward quantities in the network. These assumptions include:

  1. 1.

    The input to layer l, denoted \(x_{l}\), is a random tensor assumed to contain i.i.d entries. We represent the element-wise uncentered 2nd moment by \(E[x_{l}^{2}]\).

  2. 2.

    The back-propagated gradient of \(x_{l}\) is \(\varDelta x_{l}\) and is assumed to be uncorrelated with \(x_{l}\) and iid. We represent the uncentered 2nd-moment of \(\varDelta x_{l}\) by \(E[\varDelta x_{l}^{2}]\).

  3. 3.

    All weights in the network are initialized i.i.d from a centered, symmetric distribution.

  4. 4.

    All bias terms are initialized as zero.

Our calculations rely heavily on the uncentered second moments rather than the variance of weights and gradients. This is a consequence of the behavior of the ReLU activation, which zeros out entries. The effect of this zeroing operation is simple when considering uncentered second moments under a symmetric input distribution, as half of the entries will be zeroed, resulting in a halving of the uncentered second moment. In contrast, expressing the same operation in terms of variance is complicated by the fact that the mean after application of the ReLU is distribution-dependent. We will refer to the uncentered second moment just as the “second moment” henceforth.

4 Activation and layer scaling factors

The key quantity in our calculus is the activation scaling factor \(\varsigma _{l}\), of the input activations for a layer l, which we define as:

$$\begin{aligned} \varsigma _{l}=n_{l}\rho _{l}^{2}E[\varDelta x_{l}^{2}]E[x_{l}^{2}]. \end{aligned}$$
(1)

This quantity arises due to its utility in computing other quantities of interest in the network, such as the scaling factors for the weights of convolutional and linear layers. In ReLU networks, many, but not all operations maintain this quantity in the sense that \(\varsigma _{l}=\varsigma _{l+1}\) for a layer \(x_{l+1}=F(x_{l})\) with operation F, under the assumptions of Sect. 3. Table 1 contains a list of common operations and indicates if they maintain scaling. As an example, consider adding a simple scaling layer of the form \(x_{l+1}=\sqrt{2}x_{l}\) which doubles the second moment during the forward pass and doubles the backward second moment during back-propagation. We can see that:

$$\begin{aligned} \varsigma _{l+1}&=n_{l+1}\rho _{l+1}^{2}E[\varDelta x_{l+1}^{2}]E[x_{l+1}^{2}]\\&=n_{l}\rho _{l}^{2}\frac{1}{2}E[\varDelta x_{l}^{2}]\cdot 2E[x_{l}^{2}]=\varsigma _{l} \end{aligned}$$

Our analysis in our work is focused on ReLU networks primarily due to the fact that ReLU nonlinearities maintain this scaling factor.

Table 1 Scaling of common layers

Using the activation scaling factor, we define the layer or weight scaling factor of a convolutional layer with kernel \(k_{l}\times k_{l}\) as:

$$\begin{aligned} \gamma _{l}=\frac{\varsigma _{l}}{n_{l+1}n_{l}k_{l}^{2}E[W_{l}^{2}]^{2}}. \end{aligned}$$
(2)

Recall that \(n_{l}\) is the fan-in and \(n_{l+1}\) is the fan-out of the layer. This expression also applies to linear layers by taking \(k_{l}=1\). This quantity can also be defined extrinsically without reference to the weight initialization via the expression:

$$\begin{aligned} \gamma _{l}=n_{l}k_{l}^{2}\rho _{l}^{2}E\left[ x_{l}^{2}\right] ^{2}\frac{E[\varDelta x_{l+1}^{2}]}{E[x_{l+1}^{2}]}. \end{aligned}$$

we establish this equivalence under the assumptions of Sect. 3 in the Appendix.

5 Motivations for scaling factors

We can motivate the utility of our scaling factor definition by comparing it to another simple quantity of interest. For each layer, consider the ratio of the second moments between the weights, and their gradients:

$$\begin{aligned} \nu _{l}\doteq \frac{E[\varDelta W_{l}^{2}]}{E[W_{l}^{2}]}. \end{aligned}$$

This ratio approximately captures the relative change that a single SGD step with unit step-size on \(W_{l}\) will produce. We call this quantity the weight-to-gradient ratio. When \(E[\varDelta W_{l}^{2}]\) is very small compared to \(E[W_{l}^{2}]\), the weights will stay close to their initial values for longer than when \(E[\varDelta W_{l}^{2}]\) is large. In contrast, if \(E[\varDelta W_{l}^{2}]\) is very large compared to \(E[W_{l}^{2}]\), then learning can be expected to be unstable, as the sign of the elements of W may change rapidly between optimization steps. A network with constant \(\nu _{l}\) is also well-behaved under weight-decay, as the ratio of weight-decay second moments to gradient second moments will stay constant throughout the network, keeping the push-pull of gradients and decay constant across the network. This ratio also captures a relative notion of exploding or vanishing gradients. Rather than consider if the gradient is small or large in absolute value, we consider its relative magnitude instead.

Theorem 1

The weight to gradient ratio \(\nu _{l}\) is equal to the scaling factor \(\gamma _{l}\) under the assumptions of Sect. 3

5.1 Conditioning of the Hessian

The scaling factor of a layer l is also closely related to the singular values of the diagonal block of the Hessian corresponding to that layer. We derive a correspondence in this section, providing further justification for our definition of the scaling factor above. We focus on non-convolutional layers for simplicity in this section, although the result extends to the convolutional case without issue.

ReLU networks have a particularly simple structure for the Hessian for any set of activations, as the network’s output is a piecewise-linear function g fed into a final layer consisting of a loss. This structure results in greatly simplified expressions for diagonal blocks of the Hessian with respect to the weights, and allows us to derive expressions involving the singular values of these blocks.

We will consider the output of the network as a composition of two functions, the current layer g, and the remainder of the network h. We write this as a function of the weights, i.e. \(f(W_{l})=h(g(W_{l}))\). The dependence on the input to the network is implicit in this notation, and the network below layer l does not need to be considered.

Let \(R_{l}=\nabla _{x_{l+1}}^{2}h(x_{l+1})\) be the Hessian of h, the remainder of the network after application of layer l (For a linear layer \(x_{l+1}=W_{l}x_{l}\)). Let \(J_{l}\) be the Jacobian of \(y_{l}\) with respect to \(W_{l}\). The Jacobian has shape \(J_{l}:n_{l}^{\text {out}}\times \left( n_{l}^{\text {out}}n_{l}^{\text {in}}\right) \). Given these quantities, the diagonal block of the Hessian corresponding to \(W_{l}\) is equal to:

$$\begin{aligned} G_{l}=J_{l}^{T}R_{l}J_{l}. \end{aligned}$$

The lth diagonal block of the Generalized Gauss-Newton matrix G [23]. We discuss this decomposition further in the appendix.

Assume that the input-output Jacobian \(\Phi \) of the remainder of the network above each block is initialized so that \(\left\| \Phi \right\| _{2}^{2}=O(1)\) with respect to \(n_{l+1}\). This assumption just encodes the requirement that initialization used for the remainder of the network is sensible, so that the output of the network does not blow-up for large widths.

Theorem 2

Under the assumptions outlined in Sect. 3, for linear layer l, the average squared singular value of \(G_{l}\) is equal to:

$$\begin{aligned} n_{l}E\left[ x_{l}^{2}\right] ^{2}\frac{E[\varDelta x_{l+1}^{2}]}{E[x_{l+1}^{2}]}+O\left( \frac{n_{l}E\left[ x_{l}^{2}\right] ^{2}}{n_{l+1}E[x_{l+1}^{2}]}\right) . \end{aligned}$$

The Big-O term is with respect to \(n_{l}\) and \(n_{l+1}\); its precise value depends on properties of the remainder of the network above the current layer.

Despite the approximations required for its derivation, the scaling factor can still be close to the actual average squared singular value. We computed the ratio of the scaling factor (Eq. 2) to the actual expectation \(E[\left( G_{l}r\right) ^{2}]\) for a strided (rather than max-pooled, see Table 1) LeNet model, where we use random input data and a random loss (i.e. for outputs y we use \(y^{T}Ry\) for an i.i.d normal matrix R), with batch-size 1024, and \(32\times 32\) input images. The results are shown in Fig. 1 for 100 sampled setups; there is generally good agreement with the theoretical expectation.

Fig. 1
figure 1

Distributions of the ratio of theoretical scaling to actual for a strided LeNet network. The ratios are close to the ideal value of 1, indicating good theoretical and practical agreement

6 Initialization of ReLU networks

An immediate consequence of our definition of the scaling factor is a rule for the initialization of ReLU networks. Consider a network where the activation scaling factor is constant through-out. Then any two layers l and r will have the same weight scaling factor if \(\gamma _{l}=\gamma _{r}\), which holds immediately when each layer is initialized with:

$$\begin{aligned} E[W_{l}^{2}]=\frac{c}{k_{l}\sqrt{n_{l}n_{l+1}}}, \end{aligned}$$
(3)

for some fixed constant c independent of the layer. Initialization using the geometric-mean of the fan-in and fan-out ensures a constant layer scaling factor throughout the network, aiding optimization. Notice that the dependence on the kernel size is also unusual, rather than \(k_{l}^{2}\), we normalize by \(k_{l}\).

6.1 Other initialization schemes

The most common approaches are the Kaiming [11] (sometimes called He) and Xavier [6] (sometimes called Glorot) initializations. The Kaiming technique for ReLU networks is one of two approaches:

$$\begin{aligned}&(\text {fan-in)}\quad \text {Var}[W_{l}]=\frac{2}{n_{l}k_{l}^{2}}\;\text {or}\nonumber \\&(\text {fan-out)}\quad \text {Var}[W_{l}]=\frac{2}{n_{l+1}k_{l}^{2}} \end{aligned}$$
(4)

For the feed-forward network above, assuming random activations, the forward-activation variance will remain constant in expectation throughout the network if fan-in initialization of weights [21] is used, whereas the fan-out variant maintains a constant variance of the back-propagated signal. The constant factor 2 corrects for the variance-reducing effect of the ReLU activation. Although popularized by [11], similar scaling was in use in early neural network models that used tanh activation functions [1].

These two principles are clearly in conflict; unless \(n_{l}=n_{l+1}\), either the forward variance or backward variance will become non-constant. No prima facie reason for preferring one initialization over the other is provided. Unfortunately, there is some confusion in the literature as many works reference using Kaiming initialization without specifying if the fan-in or fan-out variant is used.

The Xavier initialization [6] is the closest to our proposed approach. They balance these conflicting objectives using the arithmetic mean:

$$\begin{aligned} \text {Var}[W_{l}]=\frac{2}{\frac{1}{2}\left( n_{l}+n_{l+1}\right) k_{l}^{2}}, \end{aligned}$$
(5)

to “... approximately satisfy our objectives of maintaining activation variances and back-propagated gradients variance as one moves up or down the network”. This approach to balancing is essentially heuristic, in contrast to the geometric mean approach that our theory directly guides us to.

Figure 2 shows heat maps of the average singular values for each block of the Hessian of a LeNet model under the initializations considered. The use of geometric initialization results in an equally weighted diagonal, in contrast to the other initializations considered.

Fig. 2
figure 2

Average singular value heat maps for the strided LeNet model, where each square represents a block of the Hessian, with blocking at the level of weight matrices (biases omitted). Using geometric initialization maintains an approximately constant block-diagonal weight. The scale goes from Yellow (larger) through green to blue (smaller)

6.2 Practical application

The dependence of the geometric initialization on kernel size rather than its square will result in a large increase in forward second moments if c is not carefully chosen. We recommend setting \(c=2/k\), where k is the typical kernel size in the network. Any other layer in the network with kernel size differing from this default should be preceded by a fixed scaling factor \(x_{l+1}=\alpha x_{l}\), that corrects for this. For instance, if the typical kernel size is 1, then a 3x3 convolution would be preceded with a \(\alpha =\sqrt{1/3}\) fixed scaling factor.

In general, we have the freedom to modify the initialization of a layer, then apply a fixed multiplier before or after the layer to “undo” the increase. This allows us to change the behavior of a layer during learning by modifying the network rather than modifying the optimizer. Potentially, we can avoid the need for sophisticated adaptive optimizers by designing networks to be easily optimizable in the first place. In a sense, the need for maintaining the forward or backward variance that motivates that fan-in/fan-out initialization can be decoupled from the choice of initialization, allowing us to choose the initialization to improve the optimizability of the network.

Initialization by the principle of dynamical isometry [30, 35], a form of orthogonal initialization [24] has been shown to allow for the training of very deep networks. Such orthogonal initializations can be combined with the scaling in our theory without issue, by ensuring the input-output second moment scaling is equal to the scaling required by our theory. Our analysis is concerned with the correct initialization when layer widths change within a network, with is a separate concern from the behavior of a network in a large-depth limit, where all layers are typically taken to be the same width. In ReLU networks orthogonal initialization is less interesting, as “... the ReLU nonlinearity destroys the qualitative scaling advantage that linear networks possess for orthogonal weights versus Gaussian” [27].

7 Output second moments

A neural network’s behavior is also very sensitive to the second moment of the outputs. We are not aware of any existing theory guiding the choice of output variance at initialization for the case of log-softmax losses, where it has a non-trivial effect on the back-propagated signals, although output variances of 0.01 to 0.1 are reasonable choices to avoid saturating the nonlinearity while not being too close to zero. The output variance should always be checked and potentially corrected when switching initialization schemes, to avoid inadvertently large or small values.

In general, the variance at the last layer may easily be modified by inserting a fixed scalar multiplier \(x_{l+1}=\alpha x_{l}\) anywhere in the network, and so we have complete control over this variance independently of the initialization used. For a simple ReLU convolutional network with all kernel sizes the same, and without pooling layers we can compute the output second moment when using geometric-mean initialization (\(c=2/k\)) with the expression:

$$\begin{aligned} E[x_{l+1}^{2}]&=\frac{1}{2}k_{l}^{2}n_{l}E[W_{l}^{2}]E[x_{l}^{2}]=\sqrt{\frac{n_{l}}{n_{l+1}}}E[x_{l}^{2}]. \end{aligned}$$
(6)

The application of a sequence of these layers gives a telescoping product:

$$\begin{aligned} E[x_{L}^{2}]&=\left( \prod _{l=0}^{L-1}\sqrt{\frac{n_{l}}{n_{l+1}}}\right) E[x_{0}^{2}]=\sqrt{\frac{n_{0}}{n_{L}}}E[x_{0}^{2}]. \end{aligned}$$

so the output variance is independent of the interior structure of the network and depends only on the input and output channel sizes.

8 Biases

The conditioning of the additive biases in a network is also crucial for learning. Since our model requires that biases be initialized to zero, we can not use the gradient to weight ratio for capturing the conditioning of the biases in the network. The average singular value notion of conditioning still applies, which leads to the following definition: The scaling of the bias of a layer l, \(x_{l+1}=C_{l}(x_{l})+b_{l}\) is defined as:

$$\begin{aligned} \gamma _{l}^{b}=\rho ^{2}\frac{E[\varDelta x_{l}^{2}]}{E[x_{l}^{2}]}. \end{aligned}$$
(7)

In terms of the activation scaling this is:

$$\begin{aligned} \gamma _{l}^{b}&=\rho ^{2}\frac{E[\varDelta x_{l}^{2}]}{E[x_{l}^{2}]}\nonumber \\&=\frac{\varsigma _{l}}{nE[\varDelta x_{l}^{2}]E[x_{l}^{2}]}\frac{E[\varDelta x_{l}^{2}]}{E[x_{l}^{2}]}\nonumber \\&=\frac{\varsigma _{l}}{n_{l}E[x_{l}^{2}]^{2}}. \end{aligned}$$
(8)

From Eq. 6 it’s clear that when geometric initialization is used with \(c=2/k\), then:

$$\begin{aligned} n_{l+1}E[x_{l+1}^{2}]^{2}=n_{l}E[x_{l}^{2}]^{2}, \end{aligned}$$

and so all bias terms will be equally scaled against each other. If kernel sizes vary in the ReLU network, then a setting of c following Sect. 6.2 should be used, combined with fixed scalar multipliers that ensure that at initialization \(E[x_{l+1}^{2}]=\sqrt{\frac{n_{l}}{n_{l+1}}}E[x_{l}^{2}]\).

8.1 Network input scaling balances weights against biases

It is traditional to normalize a dataset before applying a neural network so that the input vector has mean 0 and variance 1 in expectation. This scaling originated when neural networks commonly used sigmoid and tanh nonlinearities, which depended heavily on the input scaling. This principle is no longer questioned today, even though there is no longer a good justification for its use in modern ReLU based networks. In contrast, our theory provides direct guidance for the choice of input scaling.

Consider the scaling factors for the bias and weight parameters in the first layer of a ReLU-based network, as considered in previous sections. We assume the data is already centered. Then the scaling factors for the weight and bias layers are:

$$\begin{aligned} \gamma _{0}=n_{0}k_{0}^{2}\rho _{1}^{2}E\left[ x_{0}^{2}\right] ^{2}\frac{E[\varDelta y_{0}^{2}]}{E[y_{0}^{2}]},\qquad \gamma _{0b}=\rho _{1}^{2}\frac{E[\varDelta y_{0}^{2}]}{E[y_{0}^{2}]}. \end{aligned}$$

We can cancel terms to find the value of \(E\left[ x_{0}^{2}\right] \) that makes these two quantities equal:

$$\begin{aligned} E\left[ x_{0}^{2}\right] =\frac{1}{\sqrt{n_{0}k_{0}^{2}}}. \end{aligned}$$

In common computer vision architectures, the input planes are the 3 color channels and the kernel size is \(k=3\), giving \(E\left[ x_{0}^{2}\right] \approx 0.2\). Using the traditional variance-one normalization will result in the effective learning rate for the bias terms being lower than that of the weight terms. This will result in potentially slower learning of the bias terms than for the input scaling we propose. We recommend including an initial forward scaling factor in the network of \(1/(n_{0}k^{2})^{1/4}\) to correct for this (Table 2).

9 Experimental results on 26 LIBSVM datasets

Table 2 Comparison on 26 LIBSVM repository datasets

We considered a selection of dense and moderate-sparsity multi-class classification datasets from the LibSVM repository, 26 in total, collated from a variety of sources [3,4,5, 13, 14, 16, 18,19,20, 25, 28, 34]. The same model was used for all datasets, a non-convolutional ReLU network with 3 weight layers total. The inner-two layer widths were fixed at 384 and 64 nodes, respectively. These numbers were chosen to result in a larger gap between the optimization methods, less difference could be expected if a more typical \(2\times \) gap was used. Our results are otherwise generally robust to the choice of layer widths.

For every dataset, learning rate, and initialization combination we ran 10 seeds and picked the median loss after 5 epochs as the focus of our study (The largest differences can be expected early in training). Learning rates in the range \(2^{1}\) to \(2^{-12}\) (in powers of 2) were checked for each dataset and initialization combination, with the best learning rate chosen in each case based on the median of the 10 seeds. Training loss was used as the basis of our comparison as we care primarily about convergence rate, and are comparing identical network architectures. Some additional details concerning the experimental setup and which datasets were used are available in the appendix.

Table 2 shows that geometric initialization is the most consistent of the initialization approaches considered. The best value in each column is in bold. It has the lowest loss, after normalizing each dataset, and it is never the worst of the 4 methods on any dataset. Interestingly, the fan-out method is most often the best method, but consideration of the per-dataset plots (Fig. 3) shows that it often completely fails to learn for some problems, which pulls up its average loss and results in it being the worst for 9 of the datasets.

Fig. 3
figure 3

Training loss comparison across 26 datasets from the LibSVM repository

10 Convolutional case: AlexNet experiments

To provide a clear idea of the effect of our scaling approach on larger networks we used the AlexNet architecture [17] as a test bench. This architecture has a large variety of filter sizes (11, 5, 3, linear), which according to our theory will affect the conditioning adversely, and which should highlight the differences between the methods. The network was modified to replace max-pooling with striding as max-pooling is not well-scaled by our theory.

Fig. 4
figure 4

CIFAR-10 training loss for a strided AlexNet architecture. The median as well as a 25%-75% IQR of 40 seeds is shown for each initialization, where for each seed a sliding window of minibatch training loss over 400 steps is used

Following Sect. 7, we normalize the output of the network at initialization by running a single batch through the network and adding a fixed scaling factor to the network to produce output standard deviation 0.05. We tested on CIFAR-10 following the standard practice as closely as possible, as detailed in the Appendix. We performed a geometric learning rate sweep over a power-of-two grid. Results are shown in Fig. 4 for an average of 40 seeds for each initialization. Preconditioning is a statistically significant improvement (\(p=3.9\times 10^{-6})\) over arithmetic mean initialization and fan-in initialization, however, it only shows an advantage over fan-out at mid-iterations.

11 Case study: unnormalized residual networks

In the case of more complex network architectures, some care needs to be taken to produce well-scaled neural networks. We consider in this section the example of a residual network, a common architecture in modern machine learning. Consider a simplified residual architecture like the following, where we have omitted ReLU operations for our initial discussion:

$$\begin{aligned} x_{1}&=C_{0}(x_{0}),\\ x_{2}&=B_{0}(x_{1},\alpha _{0},\beta _{0}),\\ x_{3}&=B_{1}(x_{2},\alpha _{1},\beta _{1}),\\ x_{4}&=AvgPool(x_{3}),\\ x_{5}&=L(x_{4}). \end{aligned}$$

where for some sequence of operations F:

$$\begin{aligned} B(x,\alpha ,\beta )=\alpha x+\beta F(x), \end{aligned}$$

we further assume that \(\alpha ^{2}+\beta ^{2}=1\) and that \(E[F(x)^{2}]=E[x^{2}]\) following [31]. The use of weighted residual blocks is necessary for networks that do not use batch normalization [7, 10, 32, 36].

If geometric initialization is used, then \(C_{0}\) and L will have the same scaling, however, the operations within the residual blocks will not. To see this, we can calculate the activation scaling factor within the residual block. We define the shortcut branch for the residual block as the \(\alpha x\) operation and the main branch as the C(x) operation. Let \(x_{R}=\beta C(x)\) and \(x_{S}=\alpha x\), and define \(y=x_{S}+x_{R}\).

Let \(\varsigma \) be the scaling factor at x

$$\begin{aligned} \varsigma =n\rho ^{2}E[\varDelta x^{2}]E[x^{2}], \end{aligned}$$

We will use the fact that:

$$\begin{aligned} E[\varDelta x^{2}]&=\left( \alpha ^{2}+\beta ^{2}\right) E[\varDelta y^{2}]\\&=E[\varDelta y^{2}]. \end{aligned}$$

From rewriting the scale factor for \(x_{R}\), we see that:

$$\begin{aligned} \varsigma _{R}&=n\rho ^{2}E[\varDelta x_{R}^{2}]E[x_{R}^{2}]\\&=n\rho ^{2}E[\varDelta x^{2}]E[x_{R}^{2}]\\&=\beta ^{2}n_{l}\rho _{l}^{2}E[\varDelta x^{2}]E[x^{2}]\\&=\beta ^{2}\varsigma . \end{aligned}$$

A similar calculation shows that the residual branch’s scaling factor is multiplied by \(\alpha ^{2}\). To ensure that convolutions within the main branch of the residual block have the same scaling as those outside the block, we must multiply their initialization by a factor c. We can calculate the value of c required when geometric scaling is used for an operation in layer l in the main branch:

$$\begin{aligned} \gamma _{l}&=\frac{\varsigma _{R}}{n_{l+1}n_{l}k_{l}^{2}E[W_{l}^{2}]^{2}}\\&=\frac{\varsigma _{R}k_{l}^{2}n_{l+1}n_{l}}{n_{l+1}n_{l}k_{l}^{2}c_{l}^{2}}=\varsigma _{R}/c_{l}^{2} \end{aligned}$$

For \(\gamma _{l}\) to match \(\gamma \) outside the block we thus need \(\gamma _{l}=\varsigma _{R}/a_{l}^{2}=(\beta ^{2}/c^{2})\varsigma _{R},\) i.e. \(c=\beta \). If the residual branch uses convolutions (such as for channel widening operations or down-sampling as in a ResNet-50 architecture) then they should be scaled by \(\alpha \). Modifying the initialization of the operations within the block changes \(E[F(x)^{2}],\) so a fixed scalar multiplier must be introduced within the main branch to undo the change, ensuring \(E[F(x)^{2}]=E[x^{2}]\).

11.1 Design of a pre-activation ResNet block

Using the principle above we can modify the structure of a standard pre-activation ResNet block to ensure all convolutions are well-conditioned both across blocks and against the initial and final layers of the network. We consider the full case now, where the shortcut path may include a convolution that changes the channel count or the resolution. Consider a block of the form:

$$\begin{aligned} B(x,\alpha ,\beta )=\alpha S(x)+\beta F(x) \end{aligned}$$

We consider a block with fan-in n and fan-out m. There are two cases, depending on if the block is a downsampling block or not. In the case of a downsampling block, a well-scaled shortcut branch consists of the following sequence of operations:

$$\begin{aligned} y_{0}&=\text {AvgPool2D}(x,\text {kernel\_size=2},\text {stride=}2),\\ y_{1}&=y_{0}+b_{0},\\ y_{2}&=C(y_{1},\text {op=}m,\text {ks=}1,c={{\alpha }/4}),\\ y_{3}&=y_{2}/\sqrt{\alpha /4}. \end{aligned}$$

In our notation, C is initialized with the geometric initialization scheme of Eq. 3 using numerator \(c=\alpha \). Here, op is output planes and ks is the kernel size. The constant 4 corrects for the downsampling, and the constant \(\alpha \) is used to correct the scaling factor of the convolution as described above. In the non-downsampled case, this simplifies to

$$\begin{aligned} y_{0}&=x+b_{0},\\ y_{1}&=C(y_{0},\text {op=}m,\text {ks=}1,c={{\alpha }}),\\ y_{2}&=y_{1}/\sqrt{\alpha }. \end{aligned}$$

For the main branch of a bottlenecked residual block in a pre-activation network, the sequence begins with a single scaling operation \(x_{0}=\sqrt{\beta }x\), the following pattern is used, with w being inner bottleneck width.

$$\begin{aligned} x_{1}&=\text {ReLU}(x_{0})\\ x_{2}&=\sqrt{\frac{2}{\beta }}x_{1}\\ x_{3}&=x_{2}+b_{1}\\ x_{4}&=C(x_{3},\text {op=}w,\text {ks=}1,c={{\beta }}) \end{aligned}$$

Followed by a 3x3 conv:

$$\begin{aligned} x_{5}&=\text {ReLU}(x_{4})\\ x_{6}&=\sqrt{\frac{2}{3\beta }}x_{5}\\ x_{7}&=x_{6}+b_{2}\\ x_{8}&=C(x_{7},\text {op=}w,\text {ks=}3,c={{\beta }}) \end{aligned}$$

and the final sequence of operations mirrors the initial operation with a downscaling convolution instead of upscaling. At the end of the block, a learnable scalar \(x_{9}=\frac{v}{\sqrt{\beta }}x_{8}\) with \(v=\sqrt{\beta }\) is included following the approach of [31], and a fixed scalar corrects for any increases in the forward second moment from the entire sequence, in this case, \(x_{9}=\sqrt{\frac{m}{\beta n}}x_{8}\). This scaling is derived from Eq. 6 (\(\beta \) here undoes the initial beta from the first step in the block).

11.2 Experimental results

We ran a series of experiments on an unnormalized pre-activation ResNet-50 architecture using our geometric initialization and scaling scheme both within and outside of the blocks. We compared against the RescaleNet unnormalized ResNet-50 architecture. Following their guidelines, we added dropout which is necessary for good performance and used the same \(\alpha /\beta \) scheme that they used. Our implementation is available in the supplementary material. We performed our experiments on the ImageNet dataset [29], using standard data preprocessing pipelines and hyper-parameters. In particular, we use batch-size 256, decay 0.0001, momentum 0.9, and learning rate 0.1 with SGD, using a 30-60-90 decreasing scheme for 90 epochs. Following our recommendation in Sect. 7, we performed a sweep on the output scaling factor and found that a 0.05 final scalar gives the best results. Across 5 seeds, our approach achieved a test set accuracy of 76.18 (SE 0.04), which matches the performance of the RescaleNet within our test framework of 76.13 (SE 0.03). Our approach supersedes the “fixed residual scaling” that they propose as a way of balancing the contributions of each block.

12 Related work

Our approach of balancing the diagonal blocks of the Gauss-Newton matrix has close ties to a large literature studying the input-output Jacobian of neural networks. The Jacobian is the focus of study in a number of ways. The singular values of the Jacobian are the focus of theoretical study in [30, 35], where it’s shown that orthogonal initializations better control the spread of the spectrum compared to Gaussian initializations. [8, 9] also study the effect of layer width and depth on the spectrum. Regularization of the jacobian, where additional terms are added to the loss to minimize the Frobenius norm of the Jacobian, can be seen as another way to control the spectrum [12, 33], as the Frobenius norm is the sum of the squared singular values. The spectrum of the Jacobian captures the sensitivity of a network to input perturbations and is key to the understanding of adversarial machine learning, including generative modeling [26] and robustness [2, 15].

13 Conclusion

Although not a panacea, by using the scaling principle we have introduced, neural networks can be designed with a reasonable expectation that they will be optimizable by stochastic gradient methods, minimizing the amount of guess-and-check neural network design. Our approach is a step towards “engineering” neural networks, where aspects of the behavior of a network can be studied in an off-line fashion before use, rather than by a guess-implement-test-and-repeat experimental loop.