On the computation of the gradient in implicit neural networks

Implicit neural networks and the related deep equilibrium models are investigated. To train these networks, the gradient of the corresponding loss function should be computed. Bypassing the implicit function theorem, we develop an explicit representation of this quantity, which leads to an easily accessible computational algorithm


Introduction
The conventional neural networks have a feedforward structure: several layers are stacked after each other and their output can be computed explicitly.To generalize this structure, the so called Implicit Neural Networks were introduced and analyzed in [1][2][3][4][5].Also, a related approach is described in the works [6][7][8], called there them Deep Equilibrium Models.Shortly, this model can be described as a feedforward deep neural network with identic layers.Practically, by increasing the number of layers, the existence and the computation of an equilibrium state is investigated.More precisely, Bai et al [6] formulated an L-layer feedforward network as z [k+1] = f θ (z [k] ; x), k = 0, . . ., L − 1 where z [k] are the hidden states in the k-th layer and the input representation x was injected into each layer.With certain stability conditions, the limit L → ∞ corresponds to a fixed-point iteration and hence leads to an equilibrium solution z of the equation z = f θ (z; x). (1) However, the main task is to minimize a given loss function E(z; t) among the solutions of equation ( 1) by optimizing the parameters θ of the transformation f θ .Here, for a gradient-based optimization technique, the gradient ∂E ∂θ of the loss function has to be calculated.The corresponding theory is based on the implicit function theorem, see [9] [6], and [1].Solving this problem numerically efficiently is also difficult, for further details see [7,10,11].
Our contributions in this paper are theoretical, and its outline, are as follows.We introduce implicit neural networks similarly as Deep Equilibrium Models.
In our contribution, we use a link between the Deep Equilibrium Models and Implicit Neural Networks.We propose a theory bypassing the application of the implicit function theorem, which leads to an easily accessible algorithm to compute the above gradient.We develop a closed formula for this, instead of applying a long automatic differentiation over the deep model.

Construction of the network
In general, a neural network is represented as a directed graph [12], this is the computational graph of the network.A network is called feedforward or acyclic if the corresponding graph is acyclic.Similarly, cyclic directed graphs correspond to the so-called implicit neural networks.Let us assume that the total number of vertices (which are called also the neurons) is K. Let a j denote the activation value of the j-th neuron and f j : R → R be the corresponding activation function.Common examples are, e.g.f j (z) = tanh(z) or f j (z) = max{0, z}.Evaluation of the network for an input vector x ∈ R L is as follows.If a neuron with index j receives a stimulus of magnitude a l from its ancestor of index l along the edge with the weight w j,l and a constant stimulus b j (also called the bias) applied to it, then the cumulated input z j of this neuron is where the operator • : R L → R K lifts the input vector x on the inputs of the neurons.Accordingly, the activation value of the neuron with index j is Assume for simplicity that the indexing of the neurons is such that the last N neurons are the output ones.

Problem statement
Indeed, the formulas in ( 2)-(3) define the following system of K equations: Here, summarized, z(x), a(x), b ∈ R K , W ∈ R K×K and x ∈ R L .Also, the functions f j : R → R are given for j = 1, . . ., K, which are used to define the vector T , when z = (z 1 , . . ., z K ) T .Sometimes, we simplify the notation and omit the x-dependence of the terms in (4).
We also have M pairs of training samples (x, y), where x ∈ R L are input and y ∈ R N are target vectors.At the m-th pair of samples, i.e., at the input x (m) , the error is defined as where ỹ(m) j denotes the corresponding component of the m-th training sample N ) ∈ R N of the target vector to be compared with the value of a j (x (m) ).
Therefore, let the operator • : R N → R K defined by the formula ỹ = (0, . . ., 0, y 1 , . . ., y N ) T .Likewise, consider the lift operator • : R L → R K , which is defined by the formula x = (x 1 , . . ., x L , 0 . . ., 0) T .That is, we assume that the input data is copied to the first L neurons of the network and the output of the network is yielded by the final N neurons.We also assume that 1 ≤ L, N ≤ K.The average error E over all pairs of training samples is their average over the entire data set, i.e., We investigate the task to determine W and b such that the error given by ( 6) is minimal.Solving equation ( 4) by a fixed-point iteration yields the vector z = (z 1 , . . ., z K ) T of neuron input values and the vector a = (a 1 , . . ., a K ) T of activation values.A single step in the fixed point iteration for solving (4) has the form An important observation is that the iteration in (7) delivers a feedforward neural network of infinite number of layers with K neurons in each layer, so we get a deep equilibrium model.In this framework, the fixed point iteration can be interpreted as the layer-wise computation with the original input.The weights of the edges passing between each two adjacent layers are given by the matrix W ∈ R K×K and b ∈ R K is the bias vector.Such an interpretation is shown in Figure 2, which is the unrolling of the small network shown in Figure 1 focusing only on the 3rd neuron.Remark 1 The algorithm for computing the network is given in Pseudocode 1 in Appendix A.

Further notations
Summarized, we use the following notations in the infinitely deep network: • The initial vector of the iteration is z (1) = x ∈ R K .Let z (l) i (x) denote the input value of the i-th neuron in the l-th layer and use z i (x) = lim l→∞ z (l) i (x) provided that this exists and is finite.In vector form, we have T .Sometimes, for simplicity, we omit the arguments x. • Let a (l) i (x) denote the activation value of the i-th neuron in the l-th layer with the input vector x.We use the notation a i (x) = lim l→∞ a (l) i (x), provided that this exists and is finite.Accordingly, we use T and a(x) = (a 1 (x), . . ., a K (x)) T .• Parallel with the formula (5), we also introduce Here, f j ′ (z j ) represents the usual derivative of the activation function f j : R → R, which is applied on the j-th neuron.
• We use the notation i ) for the utility value of the i-th neuron in the l-th layer and D i = lim l→∞ D (l) i provided that it exists and is finite.We also define the diagonal matrix D ∈ R K×K such that D = diag (D 1 , . . ., D K ) and the diagonal matrices D (l) ∈ R K×K in the same way.

Results
As discussed previously, we transform the original implicit network into an infinitely deep feedforward one.We apply the gradient backpropagation method in this network.To minimize the error function (6) using some gradient-based method, we need to determine the partial derivatives ∂E(m) ∂wj,i and ∂E(m) ∂bj .In the following Theorem, we express these in concrete terms.We make use the gradient backpropagation method [13] by applying it first to a finite network, and then performing a limit transition with respect to the number of the layers.For our main result, we use the following assumptions.Assumptions: (i) Equation ( 4) has a unique solution and the iteration in ( 7) is convergent such that we also have ∂b j for all indices k = K − N + 1, . . ., K and j = 1, . . ., K. (ii) f i ∈ C 1 (R), ∀i = 1, . . ., K and their derivatives are bounded.(iii) The linear mapping DW T ∈ R K×K is a contraction in some norm, i.e.
Theorem 1 With the assumptions (i)-(iii), the system of equations has a unique solution.Here I ∈ R K×K is the identity matrix.Furthermore, the partial derivatives of the error function can be given as Proof Consider the finite network that consists of the first R ≥ 2 layers from the previously constructed infinite forward-connected network.Let z ∈ R K given as the initialization of the fixed point iteration in (7).With these, we have In this context, the letter R in the superscripts of z (l),R j and a (l−1),R k refers to the actual truncated finite network, which includes R layers.
Using (x (m) , y (m) ) as an input-output pair, the error on the R-th layer error of this truncated network is given by where we denote the z-dependence of the error.We perform the gradient backpropagation on this truncated network.The partial derivative d for the output neurons and will be extended to the non-output ones.In the R-th layer, according to the classical algorithm for gradient backpropagation, we have for the output (K − N < j ≤ K) and non-output (1 ≤ j ≤ K − N ) neurons, respectively.For 1 ≤ l < R, correspondig to the gradient backpropagation algorithm, we have For calculating ∂E R (m,z) ∂bj , we have to sum up the above vectors d (l),R as shown in the following identity: Note that this principle is similar to the one in Backpropagation Through Time [14].
According to the identity in (11), we have Therefore, writing (13) into the equation ( 12) we arrive to the next equation.
Observe that this should be true also for the fixed point z ∈ R K of the iteration in (7).Since in this case, the diagonal matrices D (l) 1 ≤ l ≤ R coincide, denoting their common value with D, equation ( 14) is simplified to Taking the limit R → ∞ in equation (15) and using assumptions (i) and (ii), we get the equation Clearly, by the contraction condition of the theorem, there exists the inverse of I − DW T .
We turn now to the statement for E(m,z) ∂wj,i .Similarly as in (12) we can say that the following equation holds true for arbitrary z We can apply the formula (13) again in (17), so that we get We assume again, that z ∈ R K is the limit in (7).Therefore, D (l) ≡ D holds ∀ ≤ l ∈ N.With these, we can rewrite (18) as Then we perform the limit transition with respect to the number of the layers again, but now in equation ( 19), so we get the equation which completes the proof of the statement in the theorem.□ □

Remark 2
The algorithm of the calculation of the gradient can be found in the Pseudocode 2 in Appendix A.

Conclusion
In this work, we have presented an illustrative approach to constructing Deep Equilibrium Models or also-called Implicit Neural Networks highlighting that these networks are given by such a computational graph that may include even a directed cycle.We also have proved a theorem for calculating the gradient in such a network.Instead of using the implicit function theorem we have directly calculated it in the infinitely deep feedforward network which has been assigned to the computational graph.

Ethical Approval
Not Applicable.

Availability of supporting data
Not Applicable.

Competing interests
The authors declare no competing interests.We certify that the submission is original work and is not under review at any other publication.On the computation of the gradient in implicit neural networks Algorithm 2 Calculation of the gradient 1: procedure calc grad(y, a, f ′ , maxiter, tol, W, b, K, L, N ) while error > tol and iter < maxiter do ▷ The fixed-point iteration 9: iter ← iter + 1; dh old ← dh; dh i ← 0; i = 1, . . ., K 10: for j = 1 : 11: for k = 1 : K do 12: if ∃j → k edge then for i = 1 : K do ▷ Assembling of the gradients 20: for j = 1 : K do 21: if ∃i → j edge then return ∂E(m) ∂bi , ∀i = 1, . . ., K, and ∂E(m) ∂wj,i ∀i, j = 1, . . ., K, if ∃i → j edge in the network.

Fig. 1
Fig. 1 Example of computing the value of a neuron in an implicit neural network.The neuron of index 3 has three conventional inputs, plus a loop edge leading back to itself and the output of the neuron is also its input, therefore z 3 = w 3,1 a 1 + w 3,2 a 2 + b 3 .