1 Introduction

The conventional neural networks have a feedforward structure: several layers are stacked after each other and their output can be computed explicitly. To generalize this structure, the so called implicit neural networks were introduced and analyzed in [1,2,3,4,5]. Also, a related approach called the deep equilibrium models was developed in the works [6,7,8]. Shortly, this model can be described as a feedforward deep neural network with identic layers. Practically, by increasing the number of layers, the existence and the computation of an equilibrium state is investigated. More precisely, Bai et al. [6] formulated an L-layer feedforward network as

$$\begin{aligned} z^{[k+1]} = f_{\theta } (z^{[k]};\;x), \quad k=0,\dots ,L-1, \end{aligned}$$

, where \(z^{[k]}\) are the hidden states in the k-th layer and the input x was utilized in each transition. In case of certain stability conditions, the limit \(L\rightarrow \infty\) corresponds to a fixed-point iteration and hence leads to an equilibrium solution z of the equation

$$\begin{aligned} z = f_{\theta } (z;\;x). \end{aligned}$$
(1)

However, the main task is to minimize a given loss function \({\mathcal {E}}(z;t)\) among the solutions of Eq. (1) by optimizing the parameters \(\theta\) of the transformation \(f_\theta\). Here, for a gradient-based optimization technique, the gradient \(\frac{\partial {\mathcal {E}}}{\partial \theta }\) of the loss function has to be calculated. The corresponding theory is based on the implicit function theorem, see [6, 9], and [1]. An efficient numerical approximation of z in Eq. (1) is also rather complex, for further details see [7, 10, 11].

Our contributions in this paper, being theoretical and empirical, are as follows. We introduce implicit neural networks similarly to deep equilibrium models, and propose a novel theory bypassing the traditional reliance on the implicit function theorem. This advancement leads to an easily accessible algorithm for computing the gradient. Our theoretical results are also confirmed with numerical simulations providing empirical evidence for our theoretical contributions.

2 Preliminaries

2.1 Construction of the network

In general, a neural network is represented by a directed graph [12], the computational graph of the network. A network is called feedforward or acyclic if the corresponding graph is acyclic. Similarly, cyclic directed graphs correspond to the so-called implicit neural networks.

We use K for the total number of vertices, which are also called the neurons. Let \(a_j\) denote the activation value of the j-th neuron with \(f_j: {\mathbb {R}} \rightarrow {\mathbb {R}}\) being the corresponding activation function. Common examples are, e.g., \(f_{j}(z)=\tanh (z)\) or \(f_{j}(z)=\max \{0,z\}\). Let the lift operator \({\hat{\cdot }}: {\mathbb {R}}^L \rightarrow {\mathbb {R}}^K\) be defined by the formula \({\hat{x}} = \left( x_1, \dots , x_L,0\dots ,0 \right) ^T\). That is, we assume that the input data is copied to the first L neurons of the network for simplicity.

For an input vector \(x\in {\mathbb {R}}^L\), the network is evaluated as follows. If a neuron with index j receives a stimulus of magnitude \(a_l\) from its ancestor of index l along the edge with the weight \(w_{j,l}\) and a constant stimulus \(b_{j}\) (also called the bias) applied to it, then the cumulated input \(z_j\) of this neuron is

$$\begin{aligned} z_j=\sum _{l} w_{j,l} a_l + b_j + {\hat{x}}_j. \end{aligned}$$
(2)

Accordingly, the activation value of the neuron with index j is

$$\begin{aligned} a_j=f_{j}(z_j). \end{aligned}$$
(3)

Assume for simplicity that the indexing of the neurons is such that the last N neurons are the output ones.

2.2 Problem statement

Indeed, the formulas in (2), (3) define the following system of K equations:

$$\begin{aligned} {\left\{ \begin{array}{ll} z(x) = W a(x) + {\hat{x}} + b \\ a(x) = f(z(x)). \end{array}\right. } \end{aligned}$$
(4)

Here, summarized, \(z(x), a(x), b \in {\mathbb {R}}^K,\) \(W\in {\mathbb {R}}^{K \times K}\) and \(x \in {\mathbb {R}}^L\). Also, the functions \(f_j: {\mathbb {R}}\rightarrow {\mathbb {R}}\) are given for \(j = 1, \dots , K\), which define \(f(z) = \left( f_1 (z_1), \dots , f_K (z_K) \right) ^T\in {\mathbb {R}}^K\) with \(z = \left( z_1, \dots , z_K \right) ^T\). Sometimes, we simplify the notation and omit the x-dependence of the terms in (4).

We also have M pairs of training samples (xy) with \(x\in {\mathbb {R}}^L\) the input and \(y\in {\mathbb {R}}^N\) the target vectors assuming that \(1 \le L,N \le K\). To compute with vectors of size K, we introduce the operator \({\tilde{\cdot }}: {\mathbb {R}}^N \rightarrow {\mathbb {R}}^K\) defined by the formula \({\tilde{y}} = \left( 0,\dots ,0,y_1,\dots ,y_N \right) ^T\).

At the m-th pair of samples, i.e., at the input \(x^{(m)}\), the error is defined as

$$\begin{aligned} {\mathcal {E}}^{(m)}= \frac{1}{2} \sum _{j=K-N+1}^{K} \big ( {\tilde{y}}_{j}^{(m)} - a_j(x^{(m)}) \big )^2, \end{aligned}$$
(5)

where \({\tilde{y}}_{j}^{(m)}\) denotes the corresponding component of the m-th training sample \(y^{(m)}=(y^{(m)}_1,\dots ,y^{(m)}_N) \in {\mathbb {R}}^N\) of the target vector to be compared with the value of \(a_j(x^{(m)})\).

The average error \({\mathcal {E}}\) over all pairs of training samples is given by

$$\begin{aligned} {\mathcal {E}}=\frac{1}{M} \sum _{m=1}^{M} {\mathcal {E}}^{(m)}, \end{aligned}$$
(6)

which will be minimized with respect W and b.

Solving Eq. (4) by a fixed-point iteration yields the vector \(z=\left( z_1,\dots ,z_K \right) ^T\) of neuron input values and the vector \(a=\left( a_1,\dots ,a_K \right) ^T\) of activation values. A single step of this has the form

$$\begin{aligned} z^{(l)} = W a^{(l-1)} + b + {\hat{x}},\quad a^{(l)}=f(z^{(l)}), \quad l\ge 2 \quad \text {and } z^{(1)} = {\hat{x}} \in {\mathbb {R}}^K. \end{aligned}$$
(7)

An important observation is that the iteration in (7) delivers a feedforward neural network of infinite number of layers with K neurons in each layer, so we get a deep equilibrium model. In this framework, the fixed point iteration corresponds to the layer-wise computation with the original input. The weights of the edges passing between each two adjacent layers are given by the matrix \(W \in {\mathbb {R}}^{K\times K}\) and the bias vector \(b \in {\mathbb {R}}^K\). A small network is shown in Fig. 1 while Fig. 2 illustrates the unrolling of this network, focusing on the third neuron.

Fig. 1
figure 1

Example of computing the value of a neuron in an implicit neural network. The neuron of index 3 has three conventional inputs, plus a loop edge leading back to itself and the output of the neuron is also its input, therefore \(z_3 = w_{3,1} a_1 + w_{3,2} a_2 + b_3\)

Fig. 2
figure 2

Unrolling of the implicit neural network shown in Fig. 1 focusing on the third neuron in the \(l-1\)-th, l-th and \(l+1\)-th layers, \(z_3^{(l)}=w_{3,1}a_1^{(l-1)}+w_{3,2}a_2^{(l-1)}+w_{3,3}a_3^{(l-1)}+b_3\), and \(z_3^{(l+1)}=w_{3,1}a_1^{(l)}+w_{3,2}a_2^{(l)}+w_{3,3}a_3^{(l)}+b_3\)

The pseudocode for computing the network is given in Algorithm 1.

Algorithm 1
figure a

Calculation of the network

2.3 Further notations

Summarized, we use the following notations in the infinitely deep network:

  • The initial vector of the iteration is \(z^{(1)} = {\hat{x}} \in {\mathbb {R}}^K\). Let \(z_i^{(l)}(x)\) denote the input value of the i-th neuron in the l-th layer and use \(z_{i} (x) = \lim _{{l \to \infty }} z_{i}^{{(l)}} (x)\) provided that this exists and is finite. In vector form, we have

    $$\begin{aligned} {z^{(l)}(x)=\left( z_1^{(l)}(x),\dots ,z_K^{(l)}(x) \right) ^T\quad \text {and}\quad z (x)=\left( z_1 (x),\dots ,z_K (x) \right) ^T.} \end{aligned}$$

    Sometimes, for simplicity, we omit the arguments x.

  • Let \(a_i^{(l)}(x)\) denote the activation value of the i-th neuron in the l-th layer with the input vector x. We use the notation \(a_{i} (x) = \lim _{{l \to \infty }} a_{i}^{{(l)}} (x),\) provided that this exists and is finite. Accordingly, we use

    $$\begin{aligned} f(z^{(l)})(x) = a^{(l)}(x)=\left( a_1^{(l)}(x),\dots ,a_K^{(l)}(x) \right) ^T \end{aligned}$$

    and \(a (x)=\left( a_1 (x),\dots ,a_K (x) \right) ^T.\)

  • Parallel with the formula (5), we also introduce \(d^{(\infty )} \in {\mathbb {R}}^K\) as

    $$\begin{aligned} d^{(\infty )}_j = {\left\{ \begin{array}{ll} \left( a_j - {\tilde{y}}_j^{(m)} \right) {f_j}' (z_j), \quad K-N< j \le K\\ 0, \quad 1\le j \le K-N. \end{array}\right. } \end{aligned}$$

    Here, the activation function \(f_j: {\mathbb {R}} \rightarrow {\mathbb {R}}\), which is applied to the j-th neuron.

  • We use the notation

    $$\begin{aligned} D_{i}^{(l)} = f'_{i} (z_{i}^{(l)}) \end{aligned}$$

    for the utility value of the i-th neuron in the l-th layer and \(D_{i} = \lim _{{l \to \infty }} D_{i}^{{(l)}}\) provided that it exists and is finite. We also define the diagonal matrix \(D \in {\mathbb {R}}^{K \times K}\) such that \(D = \textrm{diag} \left( D_{1}, \dots , D_{K} \right)\) and similarly, the diagonal matrices \(D^{(l)} \in {\mathbb {R}}^{K \times K}\) on the l-th layer.

3 Theoretical results

As discussed previously, we transform the original implicit network into an infinitely deep feedforward one. We apply the gradient backpropagation method in this network. To minimize, the error function (6) using some gradient-based method, we need to determine the partial derivatives \(\frac{\partial {{\mathcal {E}}^{(m)}}}{\partial w_{j,i}}\) and \(\frac{\partial {{\mathcal {E}}^{(m)}}}{\partial b_{j}}\) after calculating the equilibrium in iteration (7). In the following statement, we express these in concrete terms. We make use the gradient backpropagation method [13] by applying it first to a finite network, and then performing a limit transition with respect to the number of the layers. For our main result, we use the following assumptions.

Assumptions:

  1. (i)

    Equation (4) has a unique solution and the iteration in (7) is convergent such that we also have

    $$\begin{aligned} \frac{\partial a_k}{\partial b_j} = \lim _{R\rightarrow \infty } \frac{\partial a_k^{(R),R}}{\partial b_j} \end{aligned}$$

    for all indices \(k = K-N+1,\dots , K\) and \(j= 1,\dots , K\).

  2. (ii)

    \(f_i \in C^1 ({\mathbb {R}})\), \(\forall i = 1, \dots , K\) and their derivatives are bounded.

  3. (iii)

    The linear mapping \(D W^T \in {\mathbb {R}}^{K \times K}\) is a contraction in some induced norm, i.e. \(\Vert D W^T \Vert <1\).

Theorem 1

Using assumptions in (i)–(iii), the system of equations

$$\begin{aligned} d = \left( I - D W^T \right) ^{-1} d^{(\infty )} \end{aligned}$$
(8)

has a unique solution. Here, \(I \in {\mathbb {R}}^{K \times K}\) is the identity matrix. Furthermore, the partial derivatives of the error function \({\mathcal {E}}^{(m)}\) can be given as

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}}{\partial b_{j}}=d_j \quad \text {and} \quad \frac{\partial {\mathcal {E}}^{(m)}}{\partial w_{j,i}}=a_i d_j. \end{aligned}$$
(9)

Proof

Consider the finite network that consists of the first \(R\ge 2\) layers from the previously constructed infinite forward-connected network. Let \(z \in {\mathbb {R}}^K\) given as the initialization of the fixed point iteration in (7). With these, we have

$$\begin{aligned} z^{(l),R} = W a^{(l-1),R}+b+{\hat{x}}^{(m)} \quad \text {and} \quad z^{(1),R}=z \end{aligned}$$

or component wise, \(\displaystyle {z_j^{(l),R}=\sum _{k} w_{j,k} a_k^{(l-1),R}+b_j + {\hat{x}}^{(m)}_j}.\) Here, the letter R in the superscripts refers to the actual truncated finite network including R layers, where the gradient backpropagation is performed.

Using \((x^{(m)},y^{(m)})\) as an input–output pair, the error on the R-th layer of this truncated network is given by

$$\begin{aligned} {\mathcal {E}}^{(m)}_R(z)=\frac{1}{2} \sum _{j=K-N+1}^K \big ( a_j^{(R),R}(x^{(m)}) -{\tilde{y}}_{j}^{(m)} \big )^2, \end{aligned}$$

where, we denote the z-dependence of the error. The partial derivative \(d_i^{(l),R}=\frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial z_i^{(l),R}}\) is defined for the output neurons and will be extended to the non-output ones. In the R-th layer, according to the classical algorithm for gradient backpropagation, we have

$$\begin{aligned} d_j^{(R),R}= {\left\{ \begin{array}{ll} \left( a_j^{(R),R} - {\tilde{y}}_j^{(m)} \right) {f_j}' (z_j^{(R),R}), \quad K-N< j \le K\\ 0, \quad 1\le j \le K-N \end{array}\right. } \end{aligned}$$
(10)

for the output (\(K-N< j \le K\)) and nonoutput (\(1\le j \le K-N\)) neurons, respectively.

For \(1\le l<R\), correspondig to the gradient backpropagation algorithm [13], we have

$$\begin{aligned} d^{(l),R} = \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial z^{(l+1),R}} \cdot \frac{\partial z^{(l+1),R}}{\partial z^{(l),R}} = D^{(l)} W^T d^{(l+1),R}. \end{aligned}$$
(11)

For calculating \(\frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j}\), we have to sum up the above vectors \(d^{(l),R}\). This principle is similar to the one in backpropagation through time [14]. Thus, we get the following identity:

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j} = \sum _{k=0}^{R-1} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial {z}^{(R-k),R}_j} \frac{\partial {z}^{(R-k),R}_j}{\partial b_j} = \sum _{k=0}^{R-1} d^{(R-k),R}_j. \end{aligned}$$
(12)

According to the identity in (11), we have

$$\begin{aligned} d^{(R-k),R} = \left( \prod _{l=0}^{k-1} D^{(R-l)} W^T \right) d^{(R),R}, \end{aligned}$$
(13)

which can be inserted into (12) to get

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j} = \sum _{k=0}^{R-1} \left[ \left( \prod _{l=0}^{k-1} D^{(R-l)} W^T \right) d^{(R),R} \right] _j. \end{aligned}$$
(14)

Observe that this should be true also for the fixed point \(z\in {\mathbb {R}}^K\) of the iteration in (7). Since in this case, the diagonal matrices \(D^{(l)}\) \(1 \le l \le R\) coincide, denoting their common value with D, Eq. (14) is simplified to

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j} = \sum _{k=0}^{R-1} \left[ \left( D W^T \right) ^k d^{(R),R} \right] _j. \end{aligned}$$
(15)

Taking the limit \(R\rightarrow \infty\) in Eq. (15) and using assumptions (i) and (ii), we get the equation

$$\begin{aligned} \begin{aligned}\frac{\partial {\mathcal {E}}^{(m)} (z)}{\partial b_j} &= \lim _{R \rightarrow \infty }\frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial b_j} \\&= \lim _{R \rightarrow \infty } \sum _{k=0}^{R-1} \left[ \left( D W^T \right) ^k d^{(R),R} \right] _j = \left[ \left( I - DW^T \right) ^{-1} d^{(\infty )}\right] _j, \end{aligned} \end{aligned}$$
(16)

, where the existence of the inverse follows from the assumption (iii).

We turn now to the statement for \(\frac{{\mathcal {E}}(m,z)}{\partial w_{j,i}}\). Similarly to (12), we have the following equality for arbitrary \(z \in {\mathbb {R}}^K\):

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial w_{j,i}} = \sum _{k=0}^{R-2} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial z^{(R-k),R}_j} \frac{\partial z^{(R-k),R}_j}{\partial w_{j,i}} = \sum _{k=0}^{R-2} d^{(R-k),R}_j a^{(R-k-1),R}_i. \end{aligned}$$
(17)

We can apply formula (13) again in (17) to get

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial w_{j,i}} = \sum _{k=0}^{R-2} \left[ \left( \prod _{l=0}^{k-1} D^{(R-l)} W^T \right) d^{(R),R} \right] _j a^{(R-k-1),R}_i. \end{aligned}$$
(18)

We assume again, that \(z\in {\mathbb {R}}^K\) is the limit in (7). Therefore, \(D^{(l)}\equiv D\) holds \(\forall \le l \in {\mathbb {N}}\). With these, we can rewrite (18) as

$$\begin{aligned} \frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial w_{j,i}} = \sum _{k=0}^{R-2} \left[ \left( D W^T \right) ^k d^{(R),R} \right] _j a_i. \end{aligned}$$
(19)

Performing here limit transition with respect to the number of the layers again, we finally get

$$\begin{aligned} \begin{aligned} \frac{\partial {\mathcal {E}} (m,z)}{\partial w_{j,i}}&= \lim _{R \rightarrow \infty }\frac{\partial {\mathcal {E}}^{(m)}_R (z)}{\partial w_{j,i}} \\&= \lim _{R \rightarrow \infty } \sum _{k=0}^{R-2} \left[ \left( D W^T \right) ^k d^{(R),R} \right] _j a_i = \left[ \left( I - DW^T \right) ^{-1} d^{(\infty )}\right] _j a_i, \end{aligned} \end{aligned}$$
(20)

which completes the proof of the theorem.\(\square\)

The pseudocode of the gradient calculation can be found in Algorithm 2. Note, that here the product \(\left( I - D W^T \right) ^{-1} d^{(\infty )}\), is approximated using Neumann series.

Algorithm 2
figure b

Calculation of the gradient

3.1 Computational complexity

If \(\textrm{T}\) represents the maximum number of iterations as in Algorithm 2, and \(K\) denotes the number of neurons, it is easy to see that the computational complexity of both the network evaluation and gradient calculation is \(O(\textrm{T} \cdot K)\). In the case of a feedforward network, this is considerably more favorable since a calculation is performed exactly once on every neuron for both forward propagation and backpropagation, making the computational complexity for the feedforward case simply \(O(K)\). This implies that the training duration for an implicit network could substantially exceed that of a feedforward network.

4 Numerical experiments

We will classify pulsars in the HTRU2 dataset [15] and network intrusions in the NSL-KDD dataset [16]. The methodology involves using autoencoders to compress and reconstruct the multidimensional data, facilitating the identification of normal and anomalous signals. Anomalies are distinguished from typical noise by leveraging the reconstruction error as a metric. This approach demonstrates the effectiveness of autoencoders in detecting patterns within complex datasets and their utility in astronomical data analysis and in the detection of network intrusions.

The structure of this section is as follows. Firstly, the datasets under investigation are described, including the data preprocessing. Then, the applied evaluation metrics and then the training approach based on [17] will be described. Finally, the results of the numerical simulations are shown and discussed.

4.1 The investigated datasets

4.1.1 The HTRU2 dataset

The HTRU2 dataset [15] is a collection of pulsar candidates collected during the high time resolution universe survey. It consists of 17,898 total samples, with 1639 real pulsar examples. These samples are described by eight continuous variables and a class label that distinguishes between the pulsar and nonpulsar candidates. Pulsars are a rare type of neutron stars that emit radiation that can be observed on Earth, making them of significant interest in astrophysics.

We partition the dataset into learning and testing sets. The learning set contains 90% of the original data. Then, the SMOTE algorithm [18] is applied to the learning set. That is, the learning set consists of 29,282 elements in a balanced ratio. This resampled set is further divided into training and validation sets for cross-validation using the above ratio.

Then, a simple standardization is used to normalize the data based only on non-pulsar samples. The exact sizes of the sets and the element numbers of the classes are shown in Table 1.

Table 1 Sizes of the classes in the training, validation, and testing sets in the HTRU2 dataset

4.1.2 The NSL-KDD dataset

The NSL-KDD dataset [16] is widely used to detect network intrusions. This dataset was created to address certain deficiencies of the KDD Cup 99 dataset [19], such as redundant records in the training and testing sets, which could distort the evaluation of machine learning models. It includes data from normal traffic and various types of attacks.

In the preprocessing phase, numerical representations are derived from non-numerical data to construct inputs suitable for the autoencoder. The features Protocol, Service, and Flag are identified as categorical. The Protocol feature, which specifies a network protocol, can assume the values tcp, udp, or icmp. Through the application of one-hot encoding, this feature is transformed into a 3-dimensional vector. The Service feature, encompassing 70 distinct categories, is represented by a vector with 70 entries. Similarly, the Flag feature, with coded with 11 entries. Consequently, the input data are formulated as vectors of 117 dimensions by combining 33 numerical features with 84 one-hot-encoded categorical features.

Subsequently, the dataset is partitioned into sets for learning and testing. The learning set, comprising 84.83% of the total data, is further partitioned into training and validation subsets to facilitate cross-validation. This arrangement adopts a splitting ratio of 63.62/21.21/15.17% for the training, validation, and test sets. Then, a simple standardization is used to normalize all the features by calculating the two required scalars, the mean and the standard deviation of the training set only for benign samples. Table 2 shows the precise distribution of class elements.

Table 2 Sizes of the classes in the training, validation, and testing sets in the NSL-KDD dataset

4.2 Evaluation metrics

The predictions made can vary for different values of the threshold c to belonging to the positive class. With this, it can be determined whether a prediction qualifies as true positive (\(\textrm{TP}_c\)), false positive (\(\textrm{FP}_c\)), true negative (\(\textrm{TN}_c\)) or false negative (\(\textrm{FN}_c\)). The following evaluation metrics are used based on these values.

  • Accuracy (\(A_c\)) It is the ratio of all correctly identified instances, positive and negative classes, to the total number of instances in the dataset with a given classification threshold c. This quantity is given by the formula

    $$\begin{aligned} A_c = \frac{\textrm{TP}_c+\textrm{TN}_c}{\textrm{TP}_c + \textrm{TN}_c+ \textrm{FP}_c+ \textrm{FN}_c}. \end{aligned}$$
  • Precision (\(P_c\)) It is defined as the ratio of correctly identified positive instances and the total number of instances classified as positive. This metric evaluates the accuracy of positive predictions with a given classification threshold c as follows:

    $$\begin{aligned} P_c = \frac{\textrm{TP}_c}{\textrm{TP}_c + \textrm{FP}_c}. \end{aligned}$$
  • Recall (\(R_c\)) or True Positive Rate (\(\textrm{TPR}_c\)) It quantifies the percentage of positive cases correctly identified with a given c classification threshold. It can be calculated using the following equation:

    $$\begin{aligned} R_c = \textrm{TPR}_c = \frac{\textrm{TP}_c}{\textrm{TP}_c + \textrm{FN}_c}. \end{aligned}$$
  • Matthew’s Correlation Coefficient (\(\textrm{MCC}_c\)) We use this as the primary metric to represent the best performance the model can achieve at a fixed threshold c. It is in the range between \(-1\) and 1, where one indicates a perfect prediction and \(-1\) means all predictions are false. In concrete terms, this score is given by

    $$\begin{aligned} \textrm{MCC}_c = \frac{\textrm{TP}_c \cdot \textrm{TN}_c - \textrm{FP}_c \cdot \textrm{FN}_c}{\sqrt{\left( \textrm{TP}_c+\textrm{FP}_c \right) \cdot \left( \textrm{TP}_c+\textrm{FN}_c \right) \cdot \left( \textrm{TN}_c+\textrm{FP}_c \right) \cdot \left( \textrm{TN}_c+\textrm{FN}_c \right) }}. \end{aligned}$$
  • F1-score (\(\textrm{F1}_c\)) It is the harmonic mean of the precision and recall values, it can be calculated as follows.

    $$\begin{aligned} \textrm{F1}_c = 2 \cdot \frac{P_c \cdot R_c}{P_c+R_c} \end{aligned}$$

4.3 Training approach

Here, the steps of the suggested training approach are described, slightly modified from those in article [17].

  1. 1.

    Preprocessing Initially, samples of the positive class are excluded from the training dataset. Feature values are scaled via standardization. Specific preprocessing techniques applied to the HTRU2 and NSL-KDD datasets are shown in detail in Sect. 4.1.

  2. 2.

    Training The autoencoder is designed to reconstruct inputs so that they closely resemble normal patterns, utilizing the \(L_2\)-norm for calculating reconstruction loss, as introduced in the formula (6). Inputs corresponding to anomalies, when processed, are expected to deviate significantly from their original form, facilitating classification based on the disparity between the input and its output. A threshold differentiates anomalies from normal instances. During training, the Stochastic Gradient optimizer and Cosine learning rate scheduler [20] are utilized, incorporating Nesterov momentum and a weight decay of 0.0001. The initial learning rate is set at 0.01, while the final learning rate is adjusted to 0.001. For the HTRU2 dataset, training is conducted over 20 epochs with a minibatch size of 16, whereas for the NSL-KDD dataset, training is conducted for 5 epochs with a minibatch size of 32.

  3. 3.

    Model Selection The \(\textrm{F1}\)-score is calculated at each epoch multiple times at a particular frequency. The highest \(\textrm{F1}\)-score’s model weights are utilized upon training completion. It is calculated through the selection of the threshold, as shown in the following step.

  4. 4.

    Threshold Selection Selecting the threshold for reconstruction distance between input and output significantly impacts performance. The optimal threshold \({\hat{c}}\) is determined by evaluating the model on the validation set by maximizing the \(\textrm{F1}_c\)-score in c. Also, this maximal \(\textrm{F1}_c\) value is the \(\textrm{F1}\)-score mentioned in the previous step. This process involves standardizing reconstruction distances using mean and variance from negative samples in the validation set to determine the most effective threshold for class separation. This means the Z-score of the validation loss is calculated. The optimal threshold is sought within the \([-\,4,4]\) interval with a division of 0.001. According to the properties of the standard normal distribution, \(99.994\%\) of the scaled errors falls within the \([-\,4,4]\) interval. The Z-score that most effectively separates the anomalies from normal samples in the validation data is the threshold \({\hat{c}}\).

  5. 5.

    Evaluation Performance is assessed using test data, classifying samples based on their Z-score relative to the threshold. Samples identified as anomalies exceed the threshold. Predicted labels are compared to actual labels to compute the model’s metrics.

4.4 The proposed implicit autoencoder models

Autoencoder networks for anomaly detection are operating by learning to replicate normal data input. They consist of three main components:

  • Encoder It compresses the input into a latent-space representation. It learns the most important features of the data, effectively reducing its dimensionality.

  • Decoder It reconstructs the input data from the latent space representation, aiming to compute output as close to the original input as possible.

  • Latent Layer This is the core of the autoencoder: a compressed knowledge representation of the input data between the encoder and decoder.

The proposed autoencoder models are shown in Fig. 3. In case of v1 models, every neuron in the latent layer is interconnected in a directed manner, with no edges leading back to itself. The activation function chosen for the encoder, decoder, and latent layers is the \(\arctan\) function. For v2 models, the latent layer consists of two blocks of neurons. The pointwise product of the two blocks constitutes the layer’s output, which is directed towards the decoder. The input to the first block of the latent layer is sourced from the last encoder layer. The activation function for the first block and the encoders and decoders is the \(\arctan\) function. In contrast, the activation function for the second block of the latent layer is the sigmoid function. Also, in v2 models, no edges lead from a neuron back to itself within the latent layer. With these choices, we examine whether adding extra weights to feedforward networks and, thus, getting implicit networks brings advantages by fixing the number of neurons. Here we made only the smallest, i.e., the latent layer implicit to keep computational costs at a relatively low level.

Fig. 3
figure 3

The proposed implicit autoencoder models. Directed arrows are used to denote fully connected layers. In the left figure, a member of the v1 family is shown. The connection of the latent layer to itself indicates that each neuron in this layer is interconnected in a directed way. In the right figure, a member of the v2 family is shown. The latent layer consists of two blocks of neurons. The output of this layer is given by the pointwise product of the two blocks. The output of the layer is directed towards the decoder. The input to the first block of the latent layer comes from the last encoder layer. In both cases, no edges are drawn from neurons to themselves in the implicit latent layers. For each layer, it has been indicated to which part of the autoencoder it belongs and the number of neurons contained therein

For a fair comparison, we consider six different feedforward autoencoder configurations, each with 8 and 16 neurons in the latent layer. We refer to these networks as the v0 model family. All the configurations are shown in Table 3. In the feedforward networks, the ReLu activation function is applied in the encoder, decoder, and latent layers.

Also, for the purpose of ensuring fair comparisons, two classic models, the Random Forest and XGBoost models, are selected.

Table 3 Configuration of the autoencoder models

4.5 Experiment environment

With our implementation, all models were executed on a single NVIDIA Geforce GTX 1080 8GB GPU and a Xeon X5670 CPU, utilizing Python 3.8 and the CUDA C programming language with CUDA version 11.4. We set all possible random seeds during our numerical experiments for reproducibility purposes. To account for the variability of random processes, we repeat each experiment ten times and report the best and average scores in Sect. 5.

5 Numerical results

5.1 The HTRU2 dataset

We summarize the results in Table 4. The two largest values for each model family are displayed in bold in each column. Also, the two largest values in each column are underlined. The implicit model, identified as \((5;32;16)-v1\), dominates four out of five test metrics. From the perspective of the dataset under consideration, the most crucial test metric is recall, for which the \((5;64;16)-v2\) model performs the best, achieving the highest score of 0.9244. Recall is crucial for the HTRU2 dataset, because minimizing false negative predictions is essential for pulsar detection. Nevertheless, the XGBoost and Random Forest models exhibit marginally higher performance in the other metrics. The confusion matrices created on the test set by the implicit \((5;64;8)-v1\) and \((5;64;16)-v2\) models can be seen in Fig. 4.

Table 4 Evaluation metrics among ten simulations for the HTRU2 dataset with the following abbreviations

We have also studied the stability of the methods by computing the average test metrics over ten simulations. This can confirm the computations’ reliability in Table 4. The results are shown in Table 5. The results are observed to be very close to one another. The implicit v1 family dominates the CTR metric within the autoencoders, while the v2 family excels in the other test metrics. Overall, however, the XGBoost and Random Forest models yield the most consistent values.

Table 5 Average evaluation metrics among ten simulations for the HTRU2 dataset
Fig. 4
figure 4

The confusion matrices generated with the best \((5;64;8)-v1\) and the \((5;64;16)-v2\) implicit models on the testing set for the HTRU2 dataset

5.2 The NSL-KDD dataset

The results are shown in Table 6. The two largest values for each model family are printed in bold in each column. Also, the two largest values in each column are underlined. The \((5;64;16)-v1\) model is observed to dominate in four out of five test metrics, including the CTR metric, which is also this dataset’s most crucial test metric. The performance of the best v2 model, namely the \((5;32;8)-v2\), is observed to fall behind only a little from this. The XGBoost and Random Forest models are observed to perform substantially weaker than the autoencoders here. The confusion matrices created on the test set by the implicit \((5;64;8)-v1\) and \((5;64;16)-v2\) models can be shown in Fig. 5.

In Table 7, the stability of the models is investigated. Here, the \((5;64;16)-v1\) and \((5;32;8)-v2\) implicit autoencoder are observed to dominate, too.

Table 6 Evaluation metrics among 10 simulations for the NSL-KDD dataset with the following abbreviations
Fig. 5
figure 5

The confusion matrices generated with the best \((5;64;16)-v1\) and the \((5;32;8)-v2\) implicit models on the testing set for the NSL-KDD dataset

Table 7 Average evaluation metrics among ten simulations for the NSL-KDD dataset

Unfortunately, a significant cost is associated with the implicit models. This can be shown in Table 8. The computational time is about four times as much as their feedforward variants.

The comparison with the XGBoost and Random Forest models in terms of computational time, this is only partially fair, as these were executed without the use of GPUs.

Table 8 Training time TT inference time IT in various models

6 Future work

Expanding and deepening our numerical simulations is a priority in the future work, especially in scenarios where current frameworks like PyTorch [21] or TensorFlow [22] encounter limitations. Furthermore, we aim to understand better the impact of activation functions and the sparsity of computational graphs representing network architectures on the speed and efficiency of convergence. Additionally, accelerating computations is a crucial issue, for which we would like to develop further the algorithm proposed in the article [6].

7 Conclusion

In this work, we have shown an illustrative approach to constructing deep equilibrium models, also called implicit neural networks, highlighting that these networks are given by such a computational graph that may even include a directed cycle. We have proved a theorem for calculating the gradient in such a network, enabling the computation of gradients without resorting to the implicit function theorem, but by directly calculating them in an infinitely deep feedforward network associated with the computational graph. Furthermore, numerical experiments confirmed our findings, providing empirical evidence to support the theoretical results. This work lays a foundation for further exploration into the capabilities and applications of Implicit Neural Networks, marrying theoretical insights with practical validation.