On the computation of the gradient in implicit neural networks

Szekeres, Béla J.; Izsák, Ferenc

doi:10.1007/s11227-024-06117-6

On the computation of the gradient in implicit neural networks

Open access
Published: 25 April 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

On the computation of the gradient in implicit neural networks

Download PDF

Béla J. Szekeres¹^na1 &
Ferenc Izsák^2,3^na1

116 Accesses
Explore all metrics

Abstract

Implicit neural networks and the related deep equilibrium models are investigated. To train these networks, the gradient of the corresponding loss function should be computed. Bypassing the implicit function theorem, we develop an explicit representation of this quantity, which leads to an easily accessible computational algorithm. The theoretical findings are also supported by numerical simulations.

An implicit gradient-descent procedure for minimax problems

Article 17 November 2022

Translating Numerical Concepts for PDEs into Neural Architectures

Learning dynamics of gradient descent optimization in deep neural networks

Article 08 April 2021

1 Introduction

The conventional neural networks have a feedforward structure: several layers are stacked after each other and their output can be computed explicitly. To generalize this structure, the so called implicit neural networks were introduced and analyzed in [1,2,3,4,5]. Also, a related approach called the deep equilibrium models was developed in the works [6,7,8]. Shortly, this model can be described as a feedforward deep neural network with identic layers. Practically, by increasing the number of layers, the existence and the computation of an equilibrium state is investigated. More precisely, Bai et al. [6] formulated an L-layer feedforward network as

$$\begin{aligned} z^{[k+1]} = f_{\theta } (z^{[k]};\;x), \quad k=0,\dots ,L-1, \end{aligned}$$

, where $z^{[k]}$ are the hidden states in the k-th layer and the input x was utilized in each transition. In case of certain stability conditions, the limit $L\rightarrow \infty$ corresponds to a fixed-point iteration and hence leads to an equilibrium solution z of the equation

$$\begin{aligned} z = f_{\theta } (z;\;x). \end{aligned}$$

(1)

However, the main task is to minimize a given loss function ${\mathcal {E}}(z;t)$ among the solutions of Eq. (1) by optimizing the parameters $\theta$ of the transformation $f_\theta$. Here, for a gradient-based optimization technique, the gradient $\frac{\partial {\mathcal {E}}}{\partial \theta }$ of the loss function has to be calculated. The corresponding theory is based on the implicit function theorem, see [6, 9], and [1]. An efficient numerical approximation of z in Eq. (1) is also rather complex, for further details see [7, 10, 11].

Our contributions in this paper, being theoretical and empirical, are as follows. We introduce implicit neural networks similarly to deep equilibrium models, and propose a novel theory bypassing the traditional reliance on the implicit function theorem. This advancement leads to an easily accessible algorithm for computing the gradient. Our theoretical results are also confirmed with numerical simulations providing empirical evidence for our theoretical contributions.

2 Preliminaries

2.1 Construction of the network

In general, a neural network is represented by a directed graph [12], the computational graph of the network. A network is called feedforward or acyclic if the corresponding graph is acyclic. Similarly, cyclic directed graphs correspond to the so-called implicit neural networks.

We use K for the total number of vertices, which are also called the neurons. Let $a_j$ denote the activation value of the j-th neuron with $f_j: {\mathbb {R}} \rightarrow {\mathbb {R}}$ being the corresponding activation function. Common examples are, e.g., $f_{j}(z)=\tanh (z)$ or $f_{j}(z)=\max \{0,z\}$. Let the lift operator ${\hat{\cdot }}: {\mathbb {R}}^L \rightarrow {\mathbb {R}}^K$ be defined by the formula ${\hat{x}} = \left( x_1, \dots , x_L,0\dots ,0 \right) ^T$. That is, we assume that the input data is copied to the first L neurons of the network for simplicity.

For an input vector $x\in {\mathbb {R}}^L$, the network is evaluated as follows. If a neuron with index j receives a stimulus of magnitude $a_l$ from its ancestor of index l along the edge with the weight $w_{j,l}$ and a constant stimulus $b_{j}$ (also called the bias) applied to it, then the cumulated input $z_j$ of this neuron is

$$\begin{aligned} z_j=\sum _{l} w_{j,l} a_l + b_j + {\hat{x}}_j. \end{aligned}$$

(2)

Accordingly, the activation value of the neuron with index j is

$$\begin{aligned} a_j=f_{j}(z_j). \end{aligned}$$

(3)

Assume for simplicity that the indexing of the neurons is such that the last N neurons are the output ones.

2.2 Problem statement

Indeed, the formulas in (2), (3) define the following system of K equations:

$$\begin{aligned} {\left\{ \begin{array}{ll} z(x) = W a(x) + {\hat{x}} + b \\ a(x) = f(z(x)). \end{array}\right. } \end{aligned}$$

(4)

Here, summarized, $z(x), a(x), b \in {\mathbb {R}}^K,$ $W\in {\mathbb {R}}^{K \times K}$ and $x \in {\mathbb {R}}^L$. Also, the functions $f_j: {\mathbb {R}}\rightarrow {\mathbb {R}}$ are given for $j = 1, \dots , K$, which define $f(z) = \left( f_1 (z_1), \dots , f_K (z_K) \right) ^T\in {\mathbb {R}}^K$ with $z = \left( z_1, \dots , z_K \right) ^T$. Sometimes, we simplify the notation and omit the x-dependence of the terms in (4).

We also have M pairs of training samples (x, y) with $x\in {\mathbb {R}}^L$ the input and $y\in {\mathbb {R}}^N$ the target vectors assuming that $1 \le L,N \le K$. To compute with vectors of size K, we introduce the operator ${\tilde{\cdot }}: {\mathbb {R}}^N \rightarrow {\mathbb {R}}^K$ defined by the formula ${\tilde{y}} = \left( 0,\dots ,0,y_1,\dots ,y_N \right) ^T$.

At the m-th pair of samples, i.e., at the input $x^{(m)}$, the error is defined as

$$\begin{aligned} {\mathcal {E}}^{(m)}= \frac{1}{2} \sum _{j=K-N+1}^{K} \big ( {\tilde{y}}_{j}^{(m)} - a_j(x^{(m)}) \big )^2, \end{aligned}$$

(5)

where ${\tilde{y}}_{j}^{(m)}$ denotes the corresponding component of the m-th training sample $y^{(m)}=(y^{(m)}_1,\dots ,y^{(m)}_N) \in {\mathbb {R}}^N$ of the target vector to be compared with the value of $a_j(x^{(m)})$.

The average error ${\mathcal {E}}$ over all pairs of training samples is given by

$$\begin{aligned} {\mathcal {E}}=\frac{1}{M} \sum _{m=1}^{M} {\mathcal {E}}^{(m)}, \end{aligned}$$

(6)

which will be minimized with respect W and b.

Solving Eq. (4) by a fixed-point iteration yields the vector $z=\left( z_1,\dots ,z_K \right) ^T$ of neuron input values and the vector $a=\left( a_1,\dots ,a_K \right) ^T$ of activation values. A single step of this has the form

$$\begin{aligned} z^{(l)} = W a^{(l-1)} + b + {\hat{x}},\quad a^{(l)}=f(z^{(l)}), \quad l\ge 2 \quad \text {and } z^{(1)} = {\hat{x}} \in {\mathbb {R}}^K. \end{aligned}$$

(7)

An important observation is that the iteration in (7) delivers a feedforward neural network of infinite number of layers with K neurons in each layer, so we get a deep equilibrium model. In this framework, the fixed point iteration corresponds to the layer-wise computation with the original input. The weights of the edges passing between each two adjacent layers are given by the matrix $W \in {\mathbb {R}}^{K\times K}$ and the bias vector $b \in {\mathbb {R}}^K$. A small network is shown in Fig. 1 while Fig. 2 illustrates the unrolling of this network, focusing on the third neuron.

The pseudocode for computing the network is given in Algorithm 1.

2.3 Further notations

Summarized, we use the following notations in the infinitely deep network:

The initial vector of the iteration is $z^{(1)} = {\hat{x}} \in {\mathbb {R}}^K$. Let $z_i^{(l)}(x)$ denote the input value of the i-th neuron in the l-th layer and use $z_{i} (x) = \lim _{{l \to \infty }} z_{i}^{{(l)}} (x)$ provided that this exists and is finite. In vector form, we have
$$\begin{aligned} {z^{(l)}(x)=\left( z_1^{(l)}(x),\dots ,z_K^{(l)}(x) \right) ^T\quad \text {and}\quad z (x)=\left( z_1 (x),\dots ,z_K (x) \right) ^T.} \end{aligned}$$
Sometimes, for simplicity, we omit the arguments x.
Let $a_i^{(l)}(x)$ denote the activation value of the i-th neuron in the l-th layer with the input vector x. We use the notation $a_{i} (x) = \lim _{{l \to \infty }} a_{i}^{{(l)}} (x),$ provided that this exists and is finite. Accordingly, we use
$$\begin{aligned} f(z^{(l)})(x) = a^{(l)}(x)=\left( a_1^{(l)}(x),\dots ,a_K^{(l)}(x) \right) ^T \end{aligned}$$
and $a (x)=\left( a_1 (x),\dots ,a_K (x) \right) ^T.$
Parallel with the formula (5), we also introduce $d^{(\infty )} \in {\mathbb {R}}^K$ as
$$\begin{aligned} d^{(\infty )}_j = {\left\{ \begin{array}{ll} \left( a_j - {\tilde{y}}_j^{(m)} \right) {f_j}' (z_j), \quad K-N< j \le K\\ 0, \quad 1\le j \le K-N. \end{array}\right. } \end{aligned}$$
Here, the activation function $f_j: {\mathbb {R}} \rightarrow {\mathbb {R}}$, which is applied to the j-th neuron.
We use the notation
$$\begin{aligned} D_{i}^{(l)} = f'_{i} (z_{i}^{(l)}) \end{aligned}$$
for the utility value of the i-th neuron in the l-th layer and $D_{i} = \lim _{{l \to \infty }} D_{i}^{{(l)}}$ provided that it exists and is finite. We also define the diagonal matrix $D \in {\mathbb {R}}^{K \times K}$ such that $D = \textrm{diag} \left( D_{1}, \dots , D_{K} \right)$ and similarly, the diagonal matrices $D^{(l)} \in {\mathbb {R}}^{K \times K}$ on the l-th layer.

3 Theoretical results

As discussed previously, we transform the original implicit network into an infinitely deep feedforward one. We apply the gradient backpropagation method in this network. To minimize, the error function (6) using some gradient-based method, we need to determine the partial derivatives $\frac{\partial {{\mathcal {E}}^{(m)}}}{\partial w_{j,i}}$ and $\frac{\partial {{\mathcal {E}}^{(m)}}}{\partial b_{j}}$ after calculating the equilibrium in iteration (7). In the following statement, we express these in concrete terms. We make use the gradient backpropagation method [13] by applying it first to a finite network, and then performing a limit transition with respect to the number of the layers. For our main result, we use the following assumptions.

Assumptions:

(i)
Equation (4) has a unique solution and the iteration in (7) is convergent such that we also have
$$\begin{aligned} \frac{\partial a_k}{\partial b_j} = \lim _{R\rightarrow \infty } \frac{\partial a_k^{(R),R}}{\partial b_j} \end{aligned}$$
for all indices $k = K-N+1,\dots , K$ and $j= 1,\dots , K$.
(ii)
$f_i \in C^1 ({\mathbb {R}})$, $\forall i = 1, \dots , K$ and their derivatives are bounded.
(iii)
The linear mapping $D W^T \in {\mathbb {R}}^{K \times K}$ is a contraction in some induced norm, i.e. $\Vert D W^T \Vert <1$.

Theorem 1

Using assumptions in (i)–(iii), the system of equations

$$\begin{aligned} d = \left( I - D W^T \right) ^{-1} d^{(\infty )} \end{aligned}$$

(8)

has a unique solution. Here, $I \in {\mathbb {R}}^{K \times K}$ is the identity matrix. Furthermore, the partial derivatives of the error function ${\mathcal {E}}^{(m)}$ can be given as

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}}{\partial b_{j}}=d_j \quad \text {and} \quad \frac{\partial {\mathcal {E}}^{(m)}}{\partial w_{j,i}}=a_i d_j. \end{aligned}$$

(9)

Proof

Consider the finite network that consists of the first $R\ge 2$ layers from the previously constructed infinite forward-connected network. Let $z \in {\mathbb {R}}^K$ given as the initialization of the fixed point iteration in (7). With these, we have

$$\begin{aligned} z^{(l),R} = W a^{(l-1),R}+b+{\hat{x}}^{(m)} \quad \text {and} \quad z^{(1),R}=z \end{aligned}$$

or component wise, $\displaystyle {z_j^{(l),R}=\sum _{k} w_{j,k} a_k^{(l-1),R}+b_j + {\hat{x}}^{(m)}_j}.$ Here, the letter R in the superscripts refers to the actual truncated finite network including R layers, where the gradient backpropagation is performed.

Using $(x^{(m)},y^{(m)})$ as an input–output pair, the error on the R-th layer of this truncated network is given by

$$\begin{aligned} {\mathcal {E}}^{(m)}_R(z)=\frac{1}{2} \sum _{j=K-N+1}^K \big ( a_j^{(R),R}(x^{(m)}) -{\tilde{y}}_{j}^{(m)} \big )^2, \end{aligned}$$

where, we denote the z-dependence of the error. The partial derivative $d_i^{(l),R}=\frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial z_i^{(l),R}}$ is defined for the output neurons and will be extended to the non-output ones. In the R-th layer, according to the classical algorithm for gradient backpropagation, we have

$$\begin{aligned} d_j^{(R),R}= {\left\{ \begin{array}{ll} \left( a_j^{(R),R} - {\tilde{y}}_j^{(m)} \right) {f_j}' (z_j^{(R),R}), \quad K-N< j \le K\\ 0, \quad 1\le j \le K-N \end{array}\right. } \end{aligned}$$

(10)

for the output ($K-N< j \le K$) and nonoutput ($1\le j \le K-N$) neurons, respectively.

For $1\le l<R$, correspondig to the gradient backpropagation algorithm [13], we have

$$\begin{aligned} d^{(l),R} = \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial z^{(l+1),R}} \cdot \frac{\partial z^{(l+1),R}}{\partial z^{(l),R}} = D^{(l)} W^T d^{(l+1),R}. \end{aligned}$$

(11)

For calculating $\frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j}$, we have to sum up the above vectors $d^{(l),R}$. This principle is similar to the one in backpropagation through time [14]. Thus, we get the following identity:

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j} = \sum _{k=0}^{R-1} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial {z}^{(R-k),R}_j} \frac{\partial {z}^{(R-k),R}_j}{\partial b_j} = \sum _{k=0}^{R-1} d^{(R-k),R}_j. \end{aligned}$$

(12)

According to the identity in (11), we have

$$\begin{aligned} d^{(R-k),R} = \left( \prod _{l=0}^{k-1} D^{(R-l)} W^T \right) d^{(R),R}, \end{aligned}$$

(13)

which can be inserted into (12) to get

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j} = \sum _{k=0}^{R-1} \left[ \left( \prod _{l=0}^{k-1} D^{(R-l)} W^T \right) d^{(R),R} \right] _j. \end{aligned}$$

(14)

Observe that this should be true also for the fixed point $z\in {\mathbb {R}}^K$ of the iteration in (7). Since in this case, the diagonal matrices $D^{(l)}$ $1 \le l \le R$ coincide, denoting their common value with D, Eq. (14) is simplified to

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j} = \sum _{k=0}^{R-1} \left[ \left( D W^T \right) ^k d^{(R),R} \right] _j. \end{aligned}$$

(15)

Taking the limit $R\rightarrow \infty$ in Eq. (15) and using assumptions (i) and (ii), we get the equation

$$\begin{aligned} \begin{aligned}\frac{\partial {\mathcal {E}}^{(m)} (z)}{\partial b_j} &= \lim _{R \rightarrow \infty }\frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j} \\&= \lim _{R \rightarrow \infty } \sum _{k=0}^{R-1} \left[ \left( D W^T \right) ^k d^{(R),R} \right] _j = \left[ \left( I - DW^T \right) ^{-1} d^{(\infty )}\right] _j, \end{aligned} \end{aligned}$$

(16)

, where the existence of the inverse follows from the assumption (iii).

We turn now to the statement for $\frac{{\mathcal {E}}(m,z)}{\partial w_{j,i}}$. Similarly to (12), we have the following equality for arbitrary $z \in {\mathbb {R}}^K$:

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial w_{j,i}} = \sum _{k=0}^{R-2} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial z^{(R-k),R}_j} \frac{\partial z^{(R-k),R}_j}{\partial w_{j,i}} = \sum _{k=0}^{R-2} d^{(R-k),R}_j a^{(R-k-1),R}_i. \end{aligned}$$

(17)

We can apply formula (13) again in (17) to get

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial w_{j,i}} = \sum _{k=0}^{R-2} \left[ \left( \prod _{l=0}^{k-1} D^{(R-l)} W^T \right) d^{(R),R} \right] _j a^{(R-k-1),R}_i. \end{aligned}$$

(18)

We assume again, that $z\in {\mathbb {R}}^K$ is the limit in (7). Therefore, $D^{(l)}\equiv D$ holds $\forall \le l \in {\mathbb {N}}$. With these, we can rewrite (18) as

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial w_{j,i}} = \sum _{k=0}^{R-2} \left[ \left( D W^T \right) ^k d^{(R),R} \right] _j a_i. \end{aligned}$$

(19)

Performing here limit transition with respect to the number of the layers again, we finally get

$$\begin{aligned} \begin{aligned} \frac{\partial {\mathcal {E}} (m,z)}{\partial w_{j,i}}&= \lim _{R \rightarrow \infty }\frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial w_{j,i}} \\&= \lim _{R \rightarrow \infty } \sum _{k=0}^{R-2} \left[ \left( D W^T \right) ^k d^{(R),R} \right] _j a_i = \left[ \left( I - DW^T \right) ^{-1} d^{(\infty )}\right] _j a_i, \end{aligned} \end{aligned}$$

(20)

which completes the proof of the theorem.$\square$

The pseudocode of the gradient calculation can be found in Algorithm 2. Note, that here the product $\left( I - D W^T \right) ^{-1} d^{(\infty )}$, is approximated using Neumann series.

3.1 Computational complexity

If $\textrm{T}$ represents the maximum number of iterations as in Algorithm 2, and $K$ denotes the number of neurons, it is easy to see that the computational complexity of both the network evaluation and gradient calculation is $O(\textrm{T} \cdot K)$. In the case of a feedforward network, this is considerably more favorable since a calculation is performed exactly once on every neuron for both forward propagation and backpropagation, making the computational complexity for the feedforward case simply $O(K)$. This implies that the training duration for an implicit network could substantially exceed that of a feedforward network.

4 Numerical experiments

We will classify pulsars in the HTRU2 dataset [15] and network intrusions in the NSL-KDD dataset [16]. The methodology involves using autoencoders to compress and reconstruct the multidimensional data, facilitating the identification of normal and anomalous signals. Anomalies are distinguished from typical noise by leveraging the reconstruction error as a metric. This approach demonstrates the effectiveness of autoencoders in detecting patterns within complex datasets and their utility in astronomical data analysis and in the detection of network intrusions.

The structure of this section is as follows. Firstly, the datasets under investigation are described, including the data preprocessing. Then, the applied evaluation metrics and then the training approach based on [17] will be described. Finally, the results of the numerical simulations are shown and discussed.

4.1 The investigated datasets

4.1.1 The HTRU2 dataset

The HTRU2 dataset [15] is a collection of pulsar candidates collected during the high time resolution universe survey. It consists of 17,898 total samples, with 1639 real pulsar examples. These samples are described by eight continuous variables and a class label that distinguishes between the pulsar and nonpulsar candidates. Pulsars are a rare type of neutron stars that emit radiation that can be observed on Earth, making them of significant interest in astrophysics.

We partition the dataset into learning and testing sets. The learning set contains 90% of the original data. Then, the SMOTE algorithm [18] is applied to the learning set. That is, the learning set consists of 29,282 elements in a balanced ratio. This resampled set is further divided into training and validation sets for cross-validation using the above ratio.

Then, a simple standardization is used to normalize the data based only on non-pulsar samples. The exact sizes of the sets and the element numbers of the classes are shown in Table 1.

Table 1 Sizes of the classes in the training, validation, and testing sets in the HTRU2 dataset

Full size table

4.1.2 The NSL-KDD dataset

The NSL-KDD dataset [16] is widely used to detect network intrusions. This dataset was created to address certain deficiencies of the KDD Cup 99 dataset [19], such as redundant records in the training and testing sets, which could distort the evaluation of machine learning models. It includes data from normal traffic and various types of attacks.

In the preprocessing phase, numerical representations are derived from non-numerical data to construct inputs suitable for the autoencoder. The features Protocol, Service, and Flag are identified as categorical. The Protocol feature, which specifies a network protocol, can assume the values tcp, udp, or icmp. Through the application of one-hot encoding, this feature is transformed into a 3-dimensional vector. The Service feature, encompassing 70 distinct categories, is represented by a vector with 70 entries. Similarly, the Flag feature, with coded with 11 entries. Consequently, the input data are formulated as vectors of 117 dimensions by combining 33 numerical features with 84 one-hot-encoded categorical features.

Subsequently, the dataset is partitioned into sets for learning and testing. The learning set, comprising 84.83% of the total data, is further partitioned into training and validation subsets to facilitate cross-validation. This arrangement adopts a splitting ratio of 63.62/21.21/15.17% for the training, validation, and test sets. Then, a simple standardization is used to normalize all the features by calculating the two required scalars, the mean and the standard deviation of the training set only for benign samples. Table 2 shows the precise distribution of class elements.

Table 2 Sizes of the classes in the training, validation, and testing sets in the NSL-KDD dataset

Full size table

4.2 Evaluation metrics

The predictions made can vary for different values of the threshold c to belonging to the positive class. With this, it can be determined whether a prediction qualifies as true positive ($\textrm{TP}_c$), false positive ($\textrm{FP}_c$), true negative ($\textrm{TN}_c$) or false negative ($\textrm{FN}_c$). The following evaluation metrics are used based on these values.

Accuracy ($A_c$) It is the ratio of all correctly identified instances, positive and negative classes, to the total number of instances in the dataset with a given classification threshold c. This quantity is given by the formula
$$\begin{aligned} A_c = \frac{\textrm{TP}_c+\textrm{TN}_c}{\textrm{TP}_c + \textrm{TN}_c+ \textrm{FP}_c+ \textrm{FN}_c}. \end{aligned}$$
Precision ($P_c$) It is defined as the ratio of correctly identified positive instances and the total number of instances classified as positive. This metric evaluates the accuracy of positive predictions with a given classification threshold c as follows:
$$\begin{aligned} P_c = \frac{\textrm{TP}_c}{\textrm{TP}_c + \textrm{FP}_c}. \end{aligned}$$
Recall ($R_c$) or True Positive Rate ($\textrm{TPR}_c$) It quantifies the percentage of positive cases correctly identified with a given c classification threshold. It can be calculated using the following equation:
$$\begin{aligned} R_c = \textrm{TPR}_c = \frac{\textrm{TP}_c}{\textrm{TP}_c + \textrm{FN}_c}. \end{aligned}$$
Matthew’s Correlation Coefficient ($\textrm{MCC}_c$) We use this as the primary metric to represent the best performance the model can achieve at a fixed threshold c. It is in the range between $-1$ and 1, where one indicates a perfect prediction and $-1$ means all predictions are false. In concrete terms, this score is given by
$$\begin{aligned} \textrm{MCC}_c = \frac{\textrm{TP}_c \cdot \textrm{TN}_c - \textrm{FP}_c \cdot \textrm{FN}_c}{\sqrt{\left( \textrm{TP}_c+\textrm{FP}_c \right) \cdot \left( \textrm{TP}_c+\textrm{FN}_c \right) \cdot \left( \textrm{TN}_c+\textrm{FP}_c \right) \cdot \left( \textrm{TN}_c+\textrm{FN}_c \right) }}. \end{aligned}$$
F1-score ($\textrm{F1}_c$) It is the harmonic mean of the precision and recall values, it can be calculated as follows.
$$\begin{aligned} \textrm{F1}_c = 2 \cdot \frac{P_c \cdot R_c}{P_c+R_c} \end{aligned}$$

4.3 Training approach

Here, the steps of the suggested training approach are described, slightly modified from those in article [17].

1.
Preprocessing Initially, samples of the positive class are excluded from the training dataset. Feature values are scaled via standardization. Specific preprocessing techniques applied to the HTRU2 and NSL-KDD datasets are shown in detail in Sect. 4.1.
2.
Training The autoencoder is designed to reconstruct inputs so that they closely resemble normal patterns, utilizing the $L_2$-norm for calculating reconstruction loss, as introduced in the formula (6). Inputs corresponding to anomalies, when processed, are expected to deviate significantly from their original form, facilitating classification based on the disparity between the input and its output. A threshold differentiates anomalies from normal instances. During training, the Stochastic Gradient optimizer and Cosine learning rate scheduler [20] are utilized, incorporating Nesterov momentum and a weight decay of 0.0001. The initial learning rate is set at 0.01, while the final learning rate is adjusted to 0.001. For the HTRU2 dataset, training is conducted over 20 epochs with a minibatch size of 16, whereas for the NSL-KDD dataset, training is conducted for 5 epochs with a minibatch size of 32.
3.
Model Selection The $\textrm{F1}$-score is calculated at each epoch multiple times at a particular frequency. The highest $\textrm{F1}$-score’s model weights are utilized upon training completion. It is calculated through the selection of the threshold, as shown in the following step.
4.
Threshold Selection Selecting the threshold for reconstruction distance between input and output significantly impacts performance. The optimal threshold ${\hat{c}}$ is determined by evaluating the model on the validation set by maximizing the $\textrm{F1}_c$-score in c. Also, this maximal $\textrm{F1}_c$ value is the $\textrm{F1}$-score mentioned in the previous step. This process involves standardizing reconstruction distances using mean and variance from negative samples in the validation set to determine the most effective threshold for class separation. This means the Z-score of the validation loss is calculated. The optimal threshold is sought within the $[-\,4,4]$ interval with a division of 0.001. According to the properties of the standard normal distribution, $99.994\%$ of the scaled errors falls within the $[-\,4,4]$ interval. The Z-score that most effectively separates the anomalies from normal samples in the validation data is the threshold ${\hat{c}}$.
5.
Evaluation Performance is assessed using test data, classifying samples based on their Z-score relative to the threshold. Samples identified as anomalies exceed the threshold. Predicted labels are compared to actual labels to compute the model’s metrics.

4.4 The proposed implicit autoencoder models

Autoencoder networks for anomaly detection are operating by learning to replicate normal data input. They consist of three main components:

Encoder It compresses the input into a latent-space representation. It learns the most important features of the data, effectively reducing its dimensionality.
Decoder It reconstructs the input data from the latent space representation, aiming to compute output as close to the original input as possible.
Latent Layer This is the core of the autoencoder: a compressed knowledge representation of the input data between the encoder and decoder.

The proposed autoencoder models are shown in Fig. 3. In case of v1 models, every neuron in the latent layer is interconnected in a directed manner, with no edges leading back to itself. The activation function chosen for the encoder, decoder, and latent layers is the $\arctan$ function. For v2 models, the latent layer consists of two blocks of neurons. The pointwise product of the two blocks constitutes the layer’s output, which is directed towards the decoder. The input to the first block of the latent layer is sourced from the last encoder layer. The activation function for the first block and the encoders and decoders is the $\arctan$ function. In contrast, the activation function for the second block of the latent layer is the sigmoid function. Also, in v2 models, no edges lead from a neuron back to itself within the latent layer. With these choices, we examine whether adding extra weights to feedforward networks and, thus, getting implicit networks brings advantages by fixing the number of neurons. Here we made only the smallest, i.e., the latent layer implicit to keep computational costs at a relatively low level.

For a fair comparison, we consider six different feedforward autoencoder configurations, each with 8 and 16 neurons in the latent layer. We refer to these networks as the v0 model family. All the configurations are shown in Table 3. In the feedforward networks, the ReLu activation function is applied in the encoder, decoder, and latent layers.

Also, for the purpose of ensuring fair comparisons, two classic models, the Random Forest and XGBoost models, are selected.

Table 3 Configuration of the autoencoder models

Full size table

4.5 Experiment environment

With our implementation, all models were executed on a single NVIDIA Geforce GTX 1080 8GB GPU and a Xeon X5670 CPU, utilizing Python 3.8 and the CUDA C programming language with CUDA version 11.4. We set all possible random seeds during our numerical experiments for reproducibility purposes. To account for the variability of random processes, we repeat each experiment ten times and report the best and average scores in Sect. 5.

5 Numerical results

5.1 The HTRU2 dataset

We summarize the results in Table 4. The two largest values for each model family are displayed in bold in each column. Also, the two largest values in each column are underlined. The implicit model, identified as $(5;32;16)-v1$, dominates four out of five test metrics. From the perspective of the dataset under consideration, the most crucial test metric is recall, for which the $(5;64;16)-v2$ model performs the best, achieving the highest score of 0.9244. Recall is crucial for the HTRU2 dataset, because minimizing false negative predictions is essential for pulsar detection. Nevertheless, the XGBoost and Random Forest models exhibit marginally higher performance in the other metrics. The confusion matrices created on the test set by the implicit $(5;64;8)-v1$ and $(5;64;16)-v2$ models can be seen in Fig. 4.

Table 4 Evaluation metrics among ten simulations for the HTRU2 dataset with the following abbreviations

Full size table

We have also studied the stability of the methods by computing the average test metrics over ten simulations. This can confirm the computations’ reliability in Table 4. The results are shown in Table 5. The results are observed to be very close to one another. The implicit v1 family dominates the CTR metric within the autoencoders, while the v2 family excels in the other test metrics. Overall, however, the XGBoost and Random Forest models yield the most consistent values.

Table 5 Average evaluation metrics among ten simulations for the HTRU2 dataset

Full size table

5.2 The NSL-KDD dataset

The results are shown in Table 6. The two largest values for each model family are printed in bold in each column. Also, the two largest values in each column are underlined. The $(5;64;16)-v1$ model is observed to dominate in four out of five test metrics, including the CTR metric, which is also this dataset’s most crucial test metric. The performance of the best v2 model, namely the $(5;32;8)-v2$, is observed to fall behind only a little from this. The XGBoost and Random Forest models are observed to perform substantially weaker than the autoencoders here. The confusion matrices created on the test set by the implicit $(5;64;8)-v1$ and $(5;64;16)-v2$ models can be shown in Fig. 5.

In Table 7, the stability of the models is investigated. Here, the $(5;64;16)-v1$ and $(5;32;8)-v2$ implicit autoencoder are observed to dominate, too.

Table 6 Evaluation metrics among 10 simulations for the NSL-KDD dataset with the following abbreviations

Full size table

Table 7 Average evaluation metrics among ten simulations for the NSL-KDD dataset

Full size table

Unfortunately, a significant cost is associated with the implicit models. This can be shown in Table 8. The computational time is about four times as much as their feedforward variants.

The comparison with the XGBoost and Random Forest models in terms of computational time, this is only partially fair, as these were executed without the use of GPUs.

Table 8 Training time TT inference time IT in various models

Full size table

6 Future work

Expanding and deepening our numerical simulations is a priority in the future work, especially in scenarios where current frameworks like PyTorch [21] or TensorFlow [22] encounter limitations. Furthermore, we aim to understand better the impact of activation functions and the sparsity of computational graphs representing network architectures on the speed and efficiency of convergence. Additionally, accelerating computations is a crucial issue, for which we would like to develop further the algorithm proposed in the article [6].

7 Conclusion

In this work, we have shown an illustrative approach to constructing deep equilibrium models, also called implicit neural networks, highlighting that these networks are given by such a computational graph that may even include a directed cycle. We have proved a theorem for calculating the gradient in such a network, enabling the computation of gradients without resorting to the implicit function theorem, but by directly calculating them in an infinitely deep feedforward network associated with the computational graph. Furthermore, numerical experiments confirmed our findings, providing empirical evidence to support the theoretical results. This work lays a foundation for further exploration into the capabilities and applications of Implicit Neural Networks, marrying theoretical insights with practical validation.

Data availability

The implemented codes and the dataset used in this article are available on the GitHub repository https://github.com/szbela87/imp_autoencoder. On this page, one can also find the documentation of the developed CUDA C code.

References

El Ghaoui L, Gu F, Travacca B, Askari A, Tsai A (2021) Implicit deep learning. SIAM J Math Data Sci 3(3):930–958. https://doi.org/10.1137/20M1358517
Article MathSciNet Google Scholar
Bianchini M, Gori M, Sarti L, Scarselli F (2006) Recursive neural networks and graphs: dealing with cycles. In: Apolloni B, Marinaro M, Nicosia G, Tagliaferri R (eds) Neural nets. Springer, Berlin, Heidelberg, pp 38–43
Chapter Google Scholar
Bianchini M, Gori M, Scarselli F (2002) Recursive processing of cyclic graphs. In: Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290), vol 1, pp 154–1591. https://doi.org/10.1109/IJCNN.2002.1005461
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80. https://doi.org/10.1109/TNN.2008.2005605
Article Google Scholar
Kawaguchi K (2021) On the theory of implicit deep learning: global convergence with implicit layers. In: International Conference on Learning Representations. https://openreview.net/forum?id=p-NZIuwqhI4
Bai S, Kolter JZ, Koltun V (2019) Deep equilibrium models. In: Wallach H, Larochelle H, Beygelzimer A, d’ alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems, vol 32. Curran Associates, Inc., Vancouver, Canada
Google Scholar
Bai S, Koltun V, Kolter JZ (2020) Multiscale deep equilibrium models, vol 33. Vancouver, Canada, pp 5239–5250
Google Scholar
Gu F, Chang H, Zhu W, Sojoudi S, El Ghaoui L (2020) Implicit graph neural networks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, vol 33. Curran Associates Inc., Vancouver, Canada, pp 11984–11995
Google Scholar
Steven G, Krantz HRP (2013) The implicit function theorem, 1st edn. Modern Birkhäuser classics. Birkhäuser, Boston. https://doi.org/10.1007/978-1-4614-5981-1
Book Google Scholar
Winston E, Kolter JZ (2020) Monotone operator equilibrium networks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, vol 33. Curran Associates Inc., Vancouver, Canada, pp 10718–10728
Google Scholar
Geng Z, Zhang X-Y, Bai S, Wang Y, Lin Z (2021) On training implicit models. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW (eds) Advances in neural information processing systems, vol 34. Curran Associates Inc., Vancouver, Canada, pp 24247–24260
Google Scholar
Bondy JA, Murty USR (2008) Graph theory, 1st edn. Springer, New York
Book Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536. https://doi.org/10.1038/323533a0
Article Google Scholar
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560. https://doi.org/10.1109/5.58337
Article Google Scholar
Lyon R (2017) HTRU2. UCI machine learning repository. https://doi.org/10.24432/C5DK6R
Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp 1–6. https://doi.org/10.1109/CISDA.2009.5356528
Song Y, Hyun S, Cheong Y-G (2021) Analysis of autoencoders for network intrusion detection. Sensors 21(13):4294. https://doi.org/10.3390/s21134294
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357
Google Scholar
Stolfo S, Fan W, Lee W, Prodromidis A, Chan P (1999) KDD Cup 1999 data. UCI machine learning repository. https://doi.org/10.24432/C51C7N
Loshchilov, I, Hutter F (2017) SGDR: stochastic gradient descent with warm restarts
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Curran Associates Inc, Red Hook, NY, USA, pp 8026–8037
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. https://www.tensorflow.org/

Download references

Funding

Open access funding provided by Eötvös Loránd University. Béla J. Szekeres was supported by the Project No. 2019-1.3.1-KK-2019-00011 financed by the National Research, Development and Innovation Fund of Hungary under the Establishment of Competence Centers, Development of Research Infrastructure Programme funding scheme. Ferenc Izsák was supported by the National Research, Development and Innovation Office within the framework of the Thematic Excellence Program 2021 - National Research Sub programme: “Artificial intelligence, large networks, data security: mathematical foundation and applications”

Author information

Béla J. Szekeres and Ferenc Izsák have contributed equally to this work.

Authors and Affiliations

Department of Numerical Analysis, Faculty of Informatics, Eötvös Loránd University, Pázmány P. stny. 1C, Budapest, 1117, Hungary
Béla J. Szekeres
Alfréd Rényi Institute of Mathematics, Reáltanoda u. 13-15., Budapest, 1053, Hungary
Ferenc Izsák
MTA ELTE NumNet Research Group, Eötvös Loránd University, Pázmány P. stny. 1C, Budapest, 1117, Hungary
Ferenc Izsák

Authors

Béla J. Szekeres
View author publications
You can also search for this author in PubMed Google Scholar
Ferenc Izsák
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors contributed equally to this work.

Corresponding author

Correspondence to Béla J. Szekeres.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest. We certify that the submission is original work and is not under review at any other publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Szekeres, B.J., Izsák, F. On the computation of the gradient in implicit neural networks. J Supercomput (2024). https://doi.org/10.1007/s11227-024-06117-6

Download citation

Accepted: 29 March 2024
Published: 25 April 2024
DOI: https://doi.org/10.1007/s11227-024-06117-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On the computation of the gradient in implicit neural networks

Abstract

Similar content being viewed by others

An implicit gradient-descent procedure for minimax problems

Translating Numerical Concepts for PDEs into Neural Architectures

Learning dynamics of gradient descent optimization in deep neural networks

1 Introduction

2 Preliminaries

2.1 Construction of the network

2.2 Problem statement

2.3 Further notations

3 Theoretical results

Theorem 1

Proof

3.1 Computational complexity

4 Numerical experiments

4.1 The investigated datasets

4.1.1 The HTRU2 dataset

4.1.2 The NSL-KDD dataset

4.2 Evaluation metrics

4.3 Training approach

4.4 The proposed implicit autoencoder models

4.5 Experiment environment

5 Numerical results

5.1 The HTRU2 dataset

5.2 The NSL-KDD dataset

6 Future work

7 Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation