Limitations of neural network training due to numerical instability of backpropagation

We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate that it is highly unlikely to find ReLU neural networks that maintain, in the course of training with gradient descent, superlinearly many affine pieces with respect to their number of layers. In virtually all approximation theoretical arguments that yield high-order polynomial rates of approximation, sequences of ReLU neural networks with exponentially many affine pieces compared to their numbers of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences. The assumptions and the theoretical results are compared to a numerical study, which yields concurring results.


Introduction
Deep learning is a machine learning technique based on artificial neural networks which are trained by gradient-based methods and which have a large number of layers.This technique has been tremendously successful in a wide range of applications [26,24,44,41].Of particular interest for applied mathematicians are recent developments in which deep neural networks are applied to tasks of numerical analysis such as the numerical solution of inverse problems [1,34,27,20,38] or of (parametric) partial differential equations [7,12,39,9,40,25,29,3].
The appeal of deep neural networks for these applications is due to their exceptional efficiency in representing functions from several approximation classes that underlie well-established numerical methods.In terms of approximation accuracy with respect to the number of approximation parameters, deep neural networks have been theoretically proven to achieve approximation rates that are at least as good as those of finite elements [15,35,30], local Taylor polynomials or splines [47,11], wavelets [42] and, more generally, affine systems [5].
In the sequel, we consider neural networks with the rectified-linear-unit (ReLU) activation function, which is standard in most applications.In this case, the neural-network approximations are piecewiseaffine functions.We point out that all state-of-the-art results on the rates of approximation with deep ReLU neural networks that achieve higher order polynomial approximation rates are based on explicit constructions with the number of affine pieces growing exponentially with respect to the number of layers; see, e.g., [47,46].
In this work, we argue that this central building block, functions with exponentially many affine pieces, cannot be learned with the state-of-the-art techniques.We show theoretically that training in floating-point arithmetic is hindered by a bound on the number of affine pieces created during an iterative learning process.This bound is polynomial with respect to the number of approximation parameters and also with respect to the number of iterations.Notably, this non-computability of functions with exponentially many affine pieces is not derived from an abstract result of computability theory, as the related results established in [4,6]; instead, we identify a concrete reason why gradient descent based on backpropagation cannot produce these functions in floating-point arithmetic.The effect of numerical instability is also demonstrated in numerical examples.We stress that our results do not imply that theoretically derived approximation rates cannot be realized with learned functions.Instead, we can merely conclude that the approximating sequences of neural networks found in practice must be fundamentally different than the theoretically derived ones.We will give a detailed description of our findings in Subsection 1. 3. This work is strongly influenced by [14], which shows that functions with exponentially many affine pieces with respect to the number of layers typically do not appear in randomly initialised neural networks.The ideas presented in that work form the basis of our analysis.
Before presenting our main results, we recapitulate the notion of floating-point arithmetic and furnish an example illustrating how the finite precision of floating-point arithmetic can undermine learning due to the phenomenon known in numerical analysis as catastrophic cancellation.

Floating-point arithmetic
Computations are performed almost exclusively in binary floating-point arithmetic, which consists in restricting the arithmetic of real numbers to a discrete set of the form M = ± 2 e p k=0 2 −k c k : c 0 = 1, c 1 , . . ., c s ∈ {0, 1}, e ∈ {e min , . . ., e max } , ( extended with a few special elements, such as zero, infinity and NaN elements.Here, the radix of the floating-point arithmetic is two, the parameter p ∈ N is the precision and e min , e max ∈ Z are the minimum and maximum exponents.Naturally, M is not closed under basic arithmetic operations such as addition and multiplication, which necessitates rounding.The round-to-nearest addition and multiplication are defined so that

Instability of deep neural networks due to catastrophic cancellation
The bounds (1.3) allow for floating-point operations to be modelled as perturbations of the respective infinite-precision operations of the real arithmetic.
Consider a feed-forward ReLU neural network with L ∈ N layers with d = N 0 ∈ N real inputs and N j ∈ N neurons in each layer j ∈ N. The evaluation ϕ : R d → R N L of such a neural network at a point x (0) ∈ R d consists of iteratively applying the transformations sequentially for j = 1, . . ., L, which produces the corresponding output ϕ(x (0) ) = x (L) ∈ R N L .Here, ϱ : R → R given by ϱ(x) = max{x, 0} for all x ∈ R is the ReLU activation function, which is applied to the elements of R Nj with each j ∈ {1, . . ., L} componentwise, whereas A j ∈ R Nj−1×Nj and b j ∈ R Nj with j ∈ {1, . . ., L} are the weight matrices and bias vectors.For simplicity, let us consider the case of b j = 0 for all j ∈ {1, . . ., L} and assume that x = x (0) ∈ R d and A 1 , . . ., A L are such that all entries of x (j) are nonnegative for each j ∈ {1, . . ., L}.Then i.e., the evaluation of the network reduced to the multiplication of the input by a matrix product consisting of L factors.
The instability of such products for large L was demonstrated in the context of tensor decompositions, in [2, Example 3]: the first component of the tensor considered therein yields the matrix product (1.5) with

and with weight matrices
for each j = 2, . . ., L − 1, where a > 1 is a fixed parameter and ε 1 , . . ., ε L ≥ 0 are perturbation parameters.For all j ∈ {1, . . ., L}, the case of Considering ε as a perturbation parameter, which may be at the level of the machine epsilon ϵ, we observe that a relative perturbation of magnitude ε in the jth weight matrix leads to the error which implies the total loss of accuracy for any ε > 0 whenever a L ≳ ε −1 .This effect is an immediate consequence of the subtraction encoded by A L , in which the two terms may be perturbed individually (just as the first is perturbed by a multiplicative factor of 1 + ε and the second is not perturbed at all in the above example).This is an archetypal example of the so-called catastrophic cancellation; see, e.g., [17,Section 1.7].This example, based on inaccurate matrix multiplication, can be generalised into the following statement for deep neural networks.The precise statement is given as Proposition A.3 in the Appendix.In the finite-precision setting, we observe that the evaluation of very small neural networks can already lead to a relative error of 1 compared to the neural network with arbitrary precision.Admissible parameters such that the conclusion of Proposition 1.1 holds are, for example, N = 65, a = 10, and L = 8.Proposition 1.1 considers only the forward application of a neural network.However, the construction of this neural network is based on the example (1.6) and is such that the neural network is entirely linear.By the backpropagation rule, recalled in Definition 2.3, the reversal of the neural network constructed in Proposition 1.1 yields a neural network such that the accurate computation of the gradient of the weights with respect to the output of the neural network is infeasible in floating-point arithmetic.

Contribution
In this work, we analyse the effect of inaccurate computations stemming from floating-point arithmetic in a more moderate regime than that of Proposition 1.1.In contrast to the error amplification, that was the basis of Example 1.6, we assume the errors in the following to stem mostly from accumulation of many inaccurate operations.We assume that, in the computation of the gradient of a neural network, in each layer, a small and controlled relative error will appear.Under this assumption, we observe that neural networks trained with gradient descent will, with high probability, not exhibit exponentially many affine pieces with respect to the number of layers.
We describe a framework for learning in floating-point arithmetic in Section 2. There, we define a gradient descent procedure where the gradient is computed with noisy updates.
To facilitate our main results, we make two assumptions on the gradient descent procedure (Assumptions 2.6 and 2.7), which can be intuitively formulated as follows.
(A) We assume that the average of the reciprocals of the non-zero bias updates in an iteration is bounded by a polynomial in the maximum number N of neurons in a single layer.Moreover, every neuron with zero output (dead neuron) is assumed to remain so after the iteration.Both assumptions together essentially require the gradient of every bias to be either zero or not too small for most neurons.We verify this assumption numerically in Section 4.
(B) For all layers j ∈ N, the derivative of each coordinate of the preactivations, i.e., A j x (j−1) + b j , with respect to the input of the neural network is bounded by a uniform constant.
Under these assumptions, we prove Theorem 3.4, which can be intuitively phrased as follows: the expected number of affine pieces that the neural network has after a single perturbed gradient descent update has an upper bound that is polynomial (in practice quadratic) in the number of neurons, linear in the number of layers and inversely proportional to the level of perturbation.
The result is based on the following insight of [14].Consider as a prototypical neuron a function of the form ϱ(h − b), where h is a piecewise affine function of bounded variation.Then, for ϱ(h − b) to have more affine pieces than h it is necessary that h(x) = b for some x in the domain of h.Intuitively, for many such x to exist, h − b has to change sign many times.If b is a random variable, then, for the event h(x) = b to happen with high probability, h needs to repeatedly cross a substantial part of the region where b has most of its mass.This, however, can only happen to some degree since the variation of h was assumed to be bounded.The resulting estimate of expected newly generated affine pieces in one neuron under random biases is formalised in Lemma 3.1.Since the floating-point operations are assumed to introduce noise into the bias updates in a way quantified in Definition 2.3, and since Assumption (B) implies bounded variation of the outputs of each neuron, we deduce that with high probability the number of affine pieces in a neural network is bounded.
In a gradient descent iteration, the noise in each gradient update step also depends on the current empirical risk.Notably, if the neural network is initialised with zero-training error, no update happens.However, under standard assumptions on the dynamics of the loss, we can still prove that the number of affine pieces during the iteration does not increase quickly.This is the content of Theorem 3.6, which we state below in a simplified and informal version: Theorem 1.2 (Informal version of Theorem 3.6).During gradient descent, where the gradient is computed by backpropagation in floating-point arithmetic, it holds for each iteration with probability 1/2 that the number of affine pieces of a neural network along a given line grows at most polynomially (almost linearly) with respect to the number of iterations, polynomial in the number of neurons, and linear in the number of layers.
Assumptions (A) and (B) will be discussed in detail after their formal statement in Section 2. Since Assumption (A) relates to the distribution of updates, we numerically study if it is satisfied in practice in Section 4.
Finally, in Section 5, we numerically analyse the influence of the magnitude of round-off errors.We observe that, while lower numerical accuracy influences the generation of affine pieces negatively, as claimed by our main theorem, the effect is not as pronounced as the direct application of the theorem may suggest.The reason is that, for the approximations involved, the number of affine pieces is already significantly lower than our upper bounds.Hence we conclude that numerical stability is one reason that prevents the learning of functions with exponentially many affine pieces, but there may be many more such reasons that prevent the number of affine pieces from reaching the theoretically established threshold.If the number of affine pieces is large for the initial neural network, it reduces in the course of training; in accordance with our theoretical results, that reduction occurs faster for lower computational accuracies.

Related work
This work represents a counter-point to many approximation theoretical approaches used in the analysis of deep learning.Here, we study to what extent piecewise-affine functions with a number of affine pieces that is exponential with respect to the number of layers can be learned by deep neural networks using gradient descent algorithms.Functions with high numbers of affine pieces are necessary in the approximation of the square function [47], which forms the basis for many approximation results in the literature, e.g., [48,35,30,37,40,11].
The primary motivation for the approach that we follow is taken from [14].There it is shown that randomly initialised neural networks typically have few affine pieces.The main idea of that work is to show that to generate many affine pieces, the bias vectors in neural networks need to be very accurately chosen.This is unlikely to be achieved if the biases are initialised randomly.This very idea is also the basis behind our analysis.We cannot, however, simply apply the main results of [14] since analysing the whole iterative procedure of gradient descent requires keeping track of the interdependence of the random variables modelling the weights and biases of the network and estimating their variance throughout the process.
Floating-point arithmetic and its effect on iterated computations have been studied in the literature before.The example given by (1.5)-(1.6) is derived from the study of the stability of long tensor factorisations in [2].The explicit constructions of low-rank tensor approximations presented in [23,21,31,22] all, depending on the specific implementation, may suffer from the instability similar to that demonstrated in (1.5)- (1.6).
In [28], an empirical study is presented that finds that the accuracy affects the overall performance of a neural network classifier.In this context, also an effect of the number of layers is measured, showing that deeper neural networks are more sensitive to low accuracies.Another analysis is given in [45], where it is shown that numerical inaccuracies introduce stochasticity into the trajectories of trained neural networks, which is similar to noise in stochastic gradient descent.In the context of fixed-point arithmetic, [10] showed that in many classification problems, one can restrict the computation accuracy significantly without harming the classification accuracy.There is also a vast body of literature studying how neural networks can be efficiently quantised to speed up computations or minimise storage requirements [49,13,19,18].Typically it is found that moderate quantisation or specifically designed quantisation does not harm classification accuracies.To the best of our knowledge, the effect on the generation of many affine pieces has not been studied.
There have also been more abstract studies showing that on digital hardware or in finite precision computations, there are learning scenarios that cannot be solved to any reasonable accuracy; see [6,4].

Framework
In this section, we introduce the framework for this article.We start by formally introducing neural networks in Definition 2.1.Thereafter, we fix the notion of number of affine pieces of a neural network in Definition 2.2.To study the effect of numerical accuracy on the convergence and performance of gradient descent based learning, we define the perturbed gradient descent update in Definition 2.3 and the full perturbed gradient descent iteration in Definition 2.5.
To facilitate the results in the sequel, we make Assumptions 2.6 and 2.7 on the gradient descent dynamics.Assumption 2.6 requires the absolute value of the derivative of the objective function with respect to each of the biases to either vanish or be bounded from below.In the event that the derivative vanishes, we assume that the associated neuron is dead, i.e., its output is constant on the whole domain.Assumption 2.7 stipulates that the output of each neuron should have a bounded derivative with respect to the input variable.We will discuss the sensibility of these assumptions at the end of this section.
We start by defining a neural network.Here we focus only on the most widely used activation function, the ReLU, ϱ(x) := max{0, x}, for x ∈ R.

Definition 2.1 ([37, 35]). Let d, L ∈ N. A neural network (NN) with input dimension d and L layers is a sequence of matrix-vector tuples
where N 0 := d and N 1 , . . ., N L ∈ N, and where A j ∈ R Nj ×Nj−1 and b j ∈ R Nj for j = 1, ..., L. The number N L is referred to as the output dimension.
For j ′ ∈ {1, . . ., L − 1}, the subnetwork from layer 1 to layer j ′ of Φ is the sequence of matrix-vector tuples For a NN Φ and a domain Ω ⊂ R d , we define the associated realisation of the NN Φ as where the output x (L) ∈ R N L results from Here ϱ is understood to act component-wise on vector-valued inputs, i.e., for y = (y1 , . . ., y m ) ∈ R m , ϱ(y) := (ϱ(y 1 ), . . ., ϱ(y m )).We call N (Φ) := d + L j=1 N j the number of neurons of the NN Φ, L(Φ) := L the number of layers or depth.In addition, we refer to (d, N 1 , . . ., N L ) as the architecture of Φ.Furthermore, for a domain Ω and for j ∈ {1, . . ., L − 1}, we define the preactivation functions Since the ReLU is piecewise linear, it is not hard to see that the realisation of a NN is always a piecewise affine function.A widely-used tool to study deep NNs is by counting the number of the affine pieces.To formalise this, we present a definition of the number of affine pieces of a piecewise affine function.
Definition 2.2.Let d ∈ N.For a piecewise affine function f : Ω → R with Ω ⊂ R d and for any Ω ′ ⊂ Ω, we define the number of affine pieces of f on Ω ′ as the smallest integer P such that there exist open sets We denote the number of affine pieces of f by pieces(f, Ω ′ ).
Next, we introduce a model for the training process of deep NNs.It is customary that NNs are trained by minimising an empirical risk function using gradient descent 1 over the weights.Assume we are given a loss function ℓ : R q × R q → R + , for q ∈ N, which could, for example, be the square loss ℓ(y, y For NNs Φ with a fixed architecture, input dimension d, and output dimension N L = 1, we formalise the procedure to minimise R(Y, R(Φ)(X)) for given training data X = (x Definition 2.3 (Gradient descent update of a neural network).Let d ∈ N, Ω ⊂ R d , and let Φ be a NN of depth L ∈ N, input dimension d, and output dimension 1.Let A j , b j , N j be as in Definition 2.1.Further, let ε = (ε j ) L j=1 > 0 be the sequence of effective relative perturbations and λ > 0 be the step size.Moreover, let M ∈ N and let (x i ) M i=1 ⊂ Ω, (y i ) M i=1 ⊂ R be the training samples.Let j ∈ {1, . . ., L}.The exact gradient descent update of the biases in the j-th layer is given by u b j , which is defined as where I j (x) ∈ {0, 1} Nj with (I j (x)) k = 1 if and only if R(Φ j )(x) ≥ 0. The exact gradient descent update of the weights in the j-th layer is defined as Let Θ b j , Θ w j ∈ R Nj ×Nj−1 be two independent random variables of diagonal matrices containing i.i.d entries uniformly distributed on [−0.5, 0.5].The perturbed bias and weight updates are defined as We define the updated exact weight and bias matrices as and the updated perturbed weight and bias matrices as The update of Φ with sequence of effective relative perturbations ε is the random variable Moreover, for a domain Ω and for j ∈ {1, . . ., L − 1}, we define the perturbed preactivations as ηj := R( Φ ε j ) : Ω → R Nj .Remark 2.4.The perturbed updates of Definition 2.3 are a model for numerical errors arising in floating-point arithmetic.In this model, the effective relative perturbations ε j introduced in Definition 2.3 comprise all numerical errors resulting from repeated applications of matrix multiplications as well as from summing over potentially very large data sets.The effective relative perturbations can therefore be assumed to be bounded from below by the machine precision.However, there are many reasons to assume that it may be significantly larger.
• Amplification.As shown in Section 1.2, unstable NNs may significantly amplify perturbations.
The resulting accumulation of errors will affect lower layers more since the computation of the associated updates involves more matrix-vector multiplications.As a result, we expect ε j to guickly grow with j → 0.
• Large data sets.The computation of the update over large data sets requires the computation of the mean of noisy, already perturbed values.The computation of the mean leads to further error amplification, which may be expected to grow with respect to dataset size.This issue is also pointed out in the documention of the mean function of numpy, see https: // numpy.org/ doc/ stable/ reference/ generated/ numpy.mean.html .
• Forward pass.The computation of the updates in (2.3) is based on the current values ℓ(y i , R(Φ)(x i )), i ∈ {1, . . ., M }, of the loss function, as well as on the intermediate-layer outputs R( Φ j (x i )) with i ∈ {1, . . ., M } and j ∈ {1, . . ., L}.In practice, both of these computations are also affected by numerical errors, which can be quite substantial (as seen in Section 1.2).This would add yet another amplification to ε 1 , . . ., ε L .
• Stochasticity.The effective relative perturbation quantifies the level of stochasticity of the updates.In stochastic gradient descent, which is often used instead of the gradient descent described here, updates are designed as random variables centered at the true gradient.In this setting, perturbations as above arises naturally.
To illustrate the size of ε compared to the machine precision, we present a numerical study in Section 4.
A complete iteration of gradient descent with perturbed updates corresponds to repeatedly applying Definition 2.3.We present a formal definition below.We define a series of NNs by ( Φ ε,n ) n∈N , where for n ∈ N, Φ ε n+1 is the update of Φ ε,n with effective relative perturbation ε and step size λ n , and Φ ε 1 = Φ.We call ( Φ ε,n ) n∈N the training sequence with initialisation Φ.
Next, we present two crucial assumptions on the perturbed gradient descent updates of Definition 2.3.First, we assume that the average of the reciprocals of the non-zero bias updates is not too large.In addition, we assume that if the update of the bias weight of a neuron is equal to zero, then the associated neuron is dead.
The second assumption that we make is that the derivative of each of the preactivated neurons is bounded by one.Assumption 2.7.Let d ∈ N and Ω ⊂ R d be a domain.Maintaining the notation of Definition 2.1, we assume that for all j ∈ {1, . . ., L − 1} and k ∈ {1, . . ., N j }, we have that We end this section by discussing the sensibility of the Assumptions 2.6 and 2.7 above.Assumption 2.6 requires the average of the reciprocals of the updates of the bias vectors to bounded by c 0 N ν−1 .We numerically check this assumption in Section 4 and find that for ν = 2 the first part of the assumption is satisfied with probability at least 1/2 for c 0 < 0.1.We also show theoretically that under mild assumptions on the distribution of (u b j ) k , we can expect Assumption 2.6 to hold for moderate ν.Moreover, when sufficiently many training samples are used, any neuron such that the derivative with respect to its bias value is zero at all points in the training set is likely to be dead.Since this implies that the associated bias will not be changed in the gradient descent step and since the bias value is the most important (but not only) parameter to determine whether a neuron lives, it is likely that the neuron remains dead after one gradient descent step.
On the other hand, Assumption 2.7 requires the output of each neuron to have a gradient of length not exceeding one.It will be clear from the proofs of our main results that the bound of 1 can be replaced by any other positive number.Note that the boundedness of the gradient is reasonable to assume in general since a) this is a direct consequence of most initialisation schemes [14], b) it is a sufficient condition to prevent the amplification effects that were the basis of the examples found in Subsection 1.2 of the introduction, c) quite often an additional regularisation, such as weight decay [8, Section 7.1], is used, which promotes smaller weights and implies bounds on the derivatives of the outputs of each neuron.

The number of affine pieces generated in perturbed gradient descent
In this section, we demonstrate that, in the framework of Section 2 and under Assumptions 2.6 and 2.7, a neural network trained via gradient descent with an appropriately chosen step size will not admit a high number of affine pieces.We demonstrate in Theorem 3.4 that the realisation of a NN after one step of gradient descent, in expectation, admits a number of affine pieces that scales polynomially with the number of neurons.Moreover, we show in Theorem 3.6, that the number of affine pieces is polynomial in the number of gradient descent steps and in the number of neurons of the neural network.The polynomial dependence on the number of neurons in both results depends on the parameters in Assumption 2.6 and this polynomial is numerically identified to be quadratic in Section 4.
Before we can state the results, we need some auxiliary results which will be collected in Subsection 3.1.In Subsection 3.2, we combine these results to obtain an upper bound on the expected number of affine pieces in a full neural network.Finally, we present a high-probability upper bound on the number of affine pieces during the full iteration of gradient descent in Subsection 3.3.

Expected number of generated affine pieces in one neuron
Intuitively speaking, a neuron of the form ϱ(h + b), where h is a piecewise affine map mapping from R d to R and where b ∈ R, generates an affine piece if h(x) + b = 0 for a point x at which h is affine and not constant.In all other cases, h + b is either constant or the range of h(•) + b locally lies in one of the two linear regions of ϱ, which implies that the regularity of ϱ(h + b) is the same as that of h.
If b in the argument above is chosen randomly, we expect that we can quantify the probability of generating a given number of affine pieces.The following lemma is a first step in this direction.To not disturb the flow of reading, the proof of this auxiliary result was deferred to Appendix A.2.1.Lemma 3.1.Let c > 0 and h be a piecewise-affine function on [0, c] with P ∈ N affine pieces.Let t ∈ N and A be a Lebesgue measurable set and assume that for every y ∈ A it holds that where λ is the Lebesgue measure.

Remark 3.2. Lemma 3.1 can be reformulated as saying that
where U is a uniformly distributed random variable on an interval of length δ and |h ′ | ≤ 1.

An interpretation of this is that the probability of having many intersections of a uniform random variable with a function of bounded gradient is bounded by the inverse width of the uniform distribution and the number of intersections.
Remark.Note that the estimate in Lemma 3.1 is independent from the number of affine pieces P , but the proof requires that P be finite.
In the framework of Definition 2.3, we have that the updates at each gradient descent step are randomly perturbed.It follows from Lemma 3.1, that a random bias vector is unlikely to generate a high number of affine pieces.As a consequence, we obtain the following result, which is proved in Appendix A.2.2.

Expected number of affine pieces of neural networks after one gradient descent step
Next, we bound from above the expected number of affine pieces of a full NN after a single gradient descent update according to Definition 2.3.In Proposition 3.3, we bounded from above the expected number of added affine pieces by one neuron.The expected number of affine pieces added in the full NN is found by adding all the added affine pieces corresponding to individual neurons.The following theorem then follows by the linearity of the expected value.We present a proof in Appendix A.3.Theorem 3.4.Let L, d ∈ N, and N 1 , . . .N L−1 ∈ N, N L = 1, and let Φ be a NN with architecture (d, N 1 , . . ., N L ).Further, let Φ ε be a NN after one backpropagation step with sequence of effective relative perturbation ε ∈ (0, ∞) L satisfying Assumption 2.6 with c 0 , ν > 0 and Assumption 2.7 and with step size λ > 0.Then, for every line κ ⊂ [0, 1] d of length L(κ), we have for every j ′ = 1, . . ., L where εj ′ := min{ϵ 1 , . . ., ϵ j ′ }.
Remark 3.5.To interpret (3.3), we note that depending on the choice of j ′ the upper bound varies between exponential in L (for j ′ = 1) and linear in L (for j ′ = L).Moreover, if L − j ′ is fixed, then the whole estimate is a polynomial upper bound on the number of affine pieces in terms of the number of neurons N .On a theoretical level, this demonstrates that the asymptotic scaling of the number of affine pieces of a NN after one backpropagation step does, in expectation, not scale exponentially with the number of layers.
For practical purposes, it should be mentioned that this upper bound is void if ϵ j is too small for some j < j ′ .However, as described in Remark 2.4, we expect that, for NNs with many neurons per layer, already for reasonably small j ′ , performing j ′ matrix-vector multiplications as in (2.2) as well as summations over large data sets should result in the accumulation of errors such that the effective perturbations ϵ j with j < j ′ are sufficiently large.Theorem 3.4 demonstrates that the number of affine pieces that are expected after one gradient descent step is polynomial in the number of layers.This is problematic for constructions of NNs that create an exponential number of affine pieces with respect to the number of layers: when the effective perturbations are not too small, it is unlikely that a NN with a number of affine pieces exponentially large with respect to the number of layers is found after a single gradient descent step.

Generation of affine pieces during full iteration
We now take a closer look at the implications of Theorem 3.4 when optimising a NN using gradient descent.As a consequence of Theorem 3.4 and Markov's inequality for a given line κ it holds that and Assumption 2.7 in the update step of Φ ε,n , we have that Remark 3.7.As presented in [33,32], gradient descent as used in most machine learning applications typically achieves a convergence order of 1/ √ n.Consider a step-size rule where λ n ∼ = n −α , where α ∈ [0, 1] e.g., α = 1/2 as in [43,Section 14.4.2] or α = 1 as in [32].
Considering the square loss, we can assume for a converging training sequence The scaling law of (3.5) shows that it is reasonable to assume that Indeed, assuming the reverse inequality of (3.6) shows by a direct computation (plugging (3.6) into (2.3)) that the weights of the NNs ( Φ ε,n ) n∈N would converge with a rate faster than n −1/2 .This, by the local Lipschitz property of the realisation function [36,Proposition 4.1] yields a pointwise convergence of (R( Φ ε,n )) n∈N to a limit with a rate faster than n −1/2 .Due to the quadratic dependence of the square loss on the pointwise error, this would violate (3.5).
As a consequence of (3.6) as well as the linearity of (2.2) and (2.3), it is sensible to claim that c (n) 0 of Assumption 2.6 should scale as (n −3/2 /λ n ) −1 = n α+3/2 with respect to n ∈ N.For other loss functions than the square loss, similar arguments can be made, that may yield different exponents depending on the modulus of continuity of the loss.
Based on (2.3), we, therefore, conclude that an admissible scaling law for Assumption 2.6 for n ∈ N is

If we assume a different lower bound for the convergence of gradient descent than 1/
√ n, which could be sensible in special cases, then the scaling of c (n) 0 would change accordingly.
We observe that, if λ n c (n) 0 scales polynomially with respect to n ∈ N, then Theorem 3.6 states that NNs obtained by training with gradient descent, with probability 1/2, will not have a number of affine pieces exponentially large with respect to the number of layers, unless an exponential number of iterations is performed.Remark 3.7 shows that c (n) 0 can be assumed to grow not faster than n 3/2 with respect to n ∈ N.
Since, by construction, the NNs ( Φ ε,n ) n∈N are independent random variables, it holds that the event that the sequence (R( Φ ε,n )) n∈N maintains many affine pieces for a long series of gradient descent steps is highly unlikely.

Experimental studies 4.1 Size of the effective relative perturbation
In Definition 2.3, we assume that the gradient descent updates are perturbed by an effective relative perturbation ε.Theorem 3.4 then shows that the expected number of pieces is bounded by the reciprocal of ε.Therefore, we would like to assume that ε is not extremely small.In Remark 2.4, we argue that it is sensible to assume that ε is substantially larger than the machine precision, especially in those coordinates that correspond to lower layers.In this subsection, we present numerical results that support this claim in practice.
We carry out the following experiment.We initialize a random neural network according to the He initialization [16] and carry out one gradient descent step over a random training set of 100 data points.The training data are chosen randomly but such that they result in a ℓ 2 -norm perturbation of the data points of the size of pert.
We compute the forward and backward propagation as well as the computation of the involved means using computations with single precision and compare the relative error when computing the gradients associated to the bias vectors with an update using double precision.Then, for each layer, we compute the median value of the relative errors.This instead of taking the mean, is done so that the error will not be dominated by a few very large updates.
The results are averaged over 10000 runs and shown in Figure 1.In addition to the mean value over 10000 runs, a shaded region corresponding to one relative standard deviation is shown.
We observe that, the computation in single precision loses up to five orders of magnitude against its machine precision, which is of the order of 1e − 8. Architectures with more layers lead to higher errors.Also the relative error, seems to scale inversely with the norm of the loss.
Overall, the claim that ε can often be considerably larger than the underlying machine precision appears warranted.To verify the sensibility of Assumption 2.6, we numerically check if the gradient descent update of the biases of a randomly initialised NN is on average appropriately far away from zero.For L = 3, 4, 5, 6, 7, we construct random NNs with L layers and input and output dimensions equal to 1 and with various numbers of neurons between 100 and 400 per layer.We initialise the NNs according to the He initialiser [16].Next, we choose 500 random training points (x i ) 500 i=1 uniformly in [0, 2π] and associate as labels (sin(x i )) 500

Assumptions on gradient descent
i=1 .We perform one step of gradient descent and collect the size of the updates of the bias vectors.Finally, we compute for each NN the value where N ∈ N is the total number of neurons of the NN and (u b j ) k are the updates of the bias vectors as in Assumption 2.6.The term (4.1) corresponds to ν = 2 in Assumption 2.6.In Figure 2, we depict for each number of layers, the resulting values of (4.1).More precisely, for number of layers L and for each number N 1 ∈ {100, 110, . . ., 390, 400}, we initialise 500 neural networks with architecture (1, N 1 , . . ., N 1 , 1) and compute the associated values of (4.1).We average these values over the 500 runs.Then, we depict in Figure 2, the maximum value of these averages for all values N 1 .We observe that, in at least half of the cases the values (4.1), are bounded by c 0 < 0.1.
In this case, we have that for a q > 1/α with q ∈ N, by the binomial theorem.Moreover, for large N ∈ N, we will have by the law of large numbers that with high probability where c is such that supp σ ⊂ [0, c] and we used that −1/q + α > 0.
We can study the distribution of (u b j ) k numerically to see if it, indeed, has a mild pole at zero.We use the same setup as in the experiment of Figure 2 and observe in Figure 3 that the distribution σ of (u b j ) k , with some δ > 0, appears to satisfy σ(x) ≃ x −1/2+δ as x → 0. By the previous argument, this implies that Assumption 2.6 should hold with ν = 2.This estimate coincides with the findings of the experiment that was shown in Figure 2.
Note that the assumption that the gradient updates have a density with a mild pole is quite mild.Consider for example, a random neural network with all weight matrices assumed Gaussian.If we replace all ReLUs by identities, then we would conclude that the bias updates were normally distributed.Since the normal distribution has a bounded density, the assumptions would be satisfied in this case.The application of the ReLU activation function will set some updates to 0. This, however, is inconsequential for our analysis due to our convention 0 † = 0.

Experimental study of the main theorem
In this section, we study the number of pieces of NN realisations in practical scenarios as well as the effect of the accuracy in floating-point computations.Computing the number of pieces of a NN by identifying regions with gradients differing by a threshhold is extremely sensitive to the choice of the threshhold, and therefore does not yield reliable results.Instead, we will count so-called activation regions, which were considered as a proxy for affine pieces in [14].Indeed, the number of activation regions is an upper bound on the number of affine pieces.We refer to [14, Definition 1] for the definition of activation regions and [14, Lemma 3] for the connection between activation regions and affine pieces.
In our first experiment, depicted in Figures 4 and 5, we train NNs with various numbers of layers and neurons to approximate the functions x → 1 − x 2 /2 and x → cos(x), respectively.For that, the mean square loss on 500 randomly chosen training points is minimised via gradient descent.As a step size, we use λ n = 0.02/(1 + √ n/8) at iteration n.This step size rule was empirically found to yield the fastest convergence.The NNs are randomly initialised according to the He initialisation [16].In the computation of the gradient as well as when evaluating the NNs, we relatively perturb the output of each matrix-vector multiplication with relative noise the amplitude of which will be specified below.
In Figures 4 and 5, we observe that the number of activation regions always stays far below the upper bound of Theorem 3.6.However, there is no clear effect of the accuracy in the computations.Note that, in the regime far away from the upper bound of Theorem 3.6, it is conceivable that noise in the updates of the biases can even create new affine pieces, since, in this case, as seen in Remark 3.2, the generation of affine pieces is not necessarily a low probability event.Hence, it should not be surprising that no clear effect of the level of numerical accuracy is visible in this regime.Moreover, we expect that in addition to the accuracy of the computations, there are further reasons which prevent the creation of many pieces in practice.To make the effect of the accuracy in floating-point arithmetic visible, we need to enter a regime, where the underlying NNs already have a lot of affine pieces and then observe the effect of the accuracy on the number of pieces.We propose the following example: As famously shown in [47, Proposition 2], the function x → x 2 can be very efficiently approximated by ReLU NNs.The NN with L layers and four  T for ℓ = 1, . . ., L − 1, b L := 0, (5.1) ] and has 2 L−1 affine pieces.We initialise a NN in the way above, and train it with step size λ n = 0.1/(1 + √ n), for n = 1, . . ., 200, on training data which is comprised of 5'000 uniformly randomly sampled elements of [0, 1] with labels described by the function x → cx 2 , where c = 0.99999.Note that, simply changing A L to A L = c • (1, −1, 2, −1) would already achieve zero training error.Hence, the target function is very close to the initialisation.In this example, we clearly observe the effect of the numerical accuracy, with high accuracy computations maintaining activation regions much better and in some cases even increasing the number of activation regions further.The results are depicted in Figure 6.
We observe in Figure 6 that if higher accuracy computations are used, then fewer activation regions are lost during the iteration compared to iterations with low accuracies.In all iterations, the mean squared error does not significantly change during the iteration.
Having defined finite precision operations we continue by presenting a neural network Φ such that for every non-negative input x it holds that This implies that the relative error of passing to the finite precision network is 1.

Proposition 1 . 1 .
Let L ∈ N and N ∈ N. Let a = 2 r ≥ 1 for r ∈ Z and let M be as in(1.1), with e min ≤ 0, e max > 53 + r, p = 53 and hence ϵ= 2 −53 .If (L − 3) log 2 (N − 1) + (L − 1) log 2 (a − 2ϵ) ≥ 54,then there is a neural network Φ a,N,L with L layers, N neurons per layer, and all weights bounded in absolute value below by 1 and above by a such that|Φ a,N,L (x) − Φ a,N,L (x)| = |Φ a,N,L (x)| = |x|for all x ≥ 0, where Φ a,N,L (x) is the neural network Φ a,N,L evaluated in floating-point arithmetic with the machine epsilon ϵ.

Definition 2 . 5 .
Let d ∈ N, Ω ⊂ R d , let Φ be a NN of depth L ∈ N, input dimension d, and output dimension 1.Further, let M ∈ N, and let X = (x i ) M i=1 ⊂ Ω, Y = (y i ) M i=1 ⊂ R be the training samples.Let ε be the sequence of effective relative perturbations and (λ n ) n∈N ⊂ R + be the sequence of step sizes.

Assumption 2 . 6 .
Let d ∈ N and Ω ⊂ R d be a domain.Let ν, c 0 > 0. Maintaining the notation of Definition 2.3, we assume that

. 4 )Theorem 3 . 6 .
We continue by studying the impact of Equation 3.4 when training a NN with the gradient descent algorithm and obtain the following main result.Let d ∈ N and Ω ⊂ [0, 1] d .Let Φ be a NN of depth L ∈ N and input dimension d and output dimension 1, M ∈ N and X = (x i ) M i=1 ⊂ Ω, Y = (y i ) M i=1 ⊂ R be the training samples.Let ε > 0 be the effective relative perturbation and (λ n ) n∈N be the sequence of step sizes.Let ( Φ ε,n ) n∈N be the associated training sequence with initialisation Φ. Assume that κ ⊂ [0, 1] d is a line.Then, for each n ∈ N satisfying Assumption 2.6 with c 0 = c (n) 0

Figure 1 :
Figure 1: Mean size of ε = (ε j ) L j=1 for the experiment described above for a variety of architectures and perturbations.All results are averaged over 10000 runs, relative change of one standard deviation is shown as shaded area.

Figure 2 :
Figure 2: Cumulative distribution function for results of the experiment to estimate the value of (4.1) in practice.In about half of the cases the value (4.1) is bounded by 0.1.

Figure 3 :
Figure 3: Relative frequencies of the bias updates (u b j ) k computed over 10'000 neural network instantiations with N = 200 neurons per layer and, from top to bottom, L = 4, 5, 6 layers.We only include updates with absolute value less than one since we are interested in the asymptotic behaviour at zero.The constants C 1 , C 2 , C 3 in the plots are 1/160, 1/270, 1/160, respectively.The right panel shows the same distribution in a log-log plot and indicates that the density of the bias updates has a mild algebraic pole, of order 1/2, at the origin since the slope of the distribution appears to be not steeper than −1/2.

Figure 6 :
Figure 6: Training of neural networks initialised with many pieces according to (5.1).In the top row 12 layers are used, in the bottom row 14 layers are used.Each matrix-vector multiplication during training is perturbed by relative noise of size {5 • 10 −5 , 10 −4 , 5 • 10 −4 , 10 −3 , 5 • 10 −3 }.All results are averaged over fifteen runs; the relative change of one standard deviation is shown as shaded area.

Figure 7 :Proposition A. 3 .
Figure 7: The unstable neural network Φ a,N,L .As defined earlier, all weights between green neurons equal a.