1 Introduction

Deep Neural Network (DNN)-based models have become the standard for solving many real-world problems [1,2,3,4,5], including computer vision tasks [6, 7], due to their high accuracy. However, their high computational and energy requirements [8] make them resource-intensive to train, highlighting the importance of exploring alternative models that can provide more sustainable solutions. One such model is the echo state network (ESN) [9], a type of recurrent neural network (RNN) that has gained popularity due to its effectiveness in solving time series prediction problems [10]. The lightness of the model, as well as its fast training, makes it ideal for use on resource-constrained devices, even without GPUs, which is difficult to achieve with DNN-based models. Like traditional RNNs, ESNs have an internal feedback matrix that allows them to store information about previous states, making them well suited to provide predictions based on both current inputs and historical information.

Despite their success in predicting time series [11], their potential to effectively capture spatial relationships in data has been relatively understudied. This study aims to investigate the ability of ESNs to handle pure spatial information in image data, which has important implications for real-world applications. In particular, we seek to assess the performance of ESNs in handling pure spatial tasks using the MNIST and FashionMNIST data sets. This thorough evaluation will offer insight into ESN’s capabilities and limitations in image processing, guiding future research in this field.

The main contributions of this article are the following:

  1. 1.

    Exploration of the vanilla ESN architecture in a domain beyond time series, namely in an image classification task. We study its behaviour for different sets of hyperparameters and test the influence of each of them on the performance of the architecture, shedding light on its adaptability to spatial tasks.

  2. 2.

    Implementation and refinement of three established ESN-based architectures. The modifications include the integration of multiple output layers, tailoring these architectures for best performance in an image classification task.

  3. 3.

    Evaluation of the aforementioned architectures in the context of image classification, providing an analysis of their performance and highlighting the advantages of the parallel reservoir architecture over other ESN-based architectures.

  4. 4.

    Advancement of the state of the art through the application of ESNs for image classification, showcasing a significant contribution to the field.

The remainder of this article is structured as follows: Sect. 2 presents an overview of related works on ESN. In Sect. 3, we detail our proposed ESN architectures for image processing. Section 4 presents and analyses the results of our experiments. Finally, Sect. 5 summarises our findings and provides insights for future work.

2 Previous works

Echo state networks were initially introduced in the early 2000 s [9] as a type of recurrent neural network. The inception of ESNs was motivated by the need to address the challenges associated with training RNNs, whose recurrent connections present complexities when attempting to apply backpropagation to large sequences.

The key feature of ESN is the use of a randomly generated sparse connectivity matrix in the recurrent layer and a constant reaction of the network state to the input and its previous states, allowing the network to capture complex temporal dependencies while reducing the number of trainable parameters in the network. The training of an ESN can be summarised in four phases: initialisation, reservoir computation, weight adjustment, and testing.

2.1 ESN architecture

2.1.1 Initialisation

The randomness with which both the input layer and the reservoir are generated, generally using a normal or uniform distribution, is one of the factors influencing rapid training of ESNs. However, initial randomness can sometimes lead to suboptimal results, so careful selection of some parameters is necessary to achieve the desired results. Factors such as the number of neurons, the density of the connection, and the spectral radius of the connection matrix all play a crucial role in determining the reservoir behaviour, which subsequently affects the overall performance of the network. Next, we will briefly discuss them to understand their impact (see [12] for an extended explanation of each hyperparameter):

  • Number of Nodes: Having too many or too few nodes can lead to poor results. The number of nodes determines the size and complexity of the reservoir.

  • Density: The ratio of connections between reservoir nodes, also known as the connection density, can greatly influence the performance of the network.

  • Spectral Radius: Usually denoted by \(\varvec{\rho }\). The spectral radius of the connection matrix helps to avoid the predictions of the network tend to infinity or zero when the network works in recurrent mode.

  • Input Scaling: Ensuring that the input data falls within a suitable range for the network to process is crucial for accurate predictions. It is usually referred to as \(\varvec{\gamma }\).

  • Leaking Rate: The extent to which information from the previous time-step is retained in the current time-step plays an important role in the network’s performance. It is usually called \(\varvec{\alpha }\).

  • Regularisation coefficient: This hyperparameter helps to prevent overfitting during the computation of the weights of the output layer. Regularisation is especially important when working with small data sets. Usually denoted by \(\varvec{\beta }\).

Optimising the hyperparameters of echo state networks remains a challenging task. Various methods, such as genetic algorithms (GA) [13], particle swarm optimisation (PSO) [14], as well as gradient-based [15] or grid search approaches, have been used to fine-tune the hyperparameters and improve network performance.

In this work, a grid search was used to observe the response of the architecture to the different hyperparameters.

2.1.2 Reservoir computation

During training, input sequences are fed into the network and the activations of the reservoir neurons are computed. These activations and inputs are then collected and stored in a matrix, which is a historical record of the dynamic behaviour of the network (as shown in Fig. 1). This matrix is referred to as H for convenience. Upon completion of the training phase, these data determine the weights of the output layer. As a result, the performance of the ESN model depends on the information contained in H, making it a crucial element of the training phase.

Fig. 1
figure 1

ESN. Basic architecture. Reservoir computing phase. The symbol \(+\!\!\!+\) is used to express a vector concatenation, while \(\lambda\) is used to express the computation of the new state. At each time t, the new input causes a reaction in the state of the neurons, and this reaction, together with the input that caused it, is stored in the H matrix

The most commonly used approach to update states in ESNs is the leaking-integration variation, introduced by [16]. This approach updates the state of the network at each time step using:

$$\begin{aligned} x_{t} = (1-\alpha )x_{t-1} \;\; + \;\; \alpha \lambda (W_{\rm{in}} \; \cdot \; u_{t} \;\; + \;\; W \; \cdot \; x_{t-1} ) \end{aligned}$$
(1)

In the given equation, the function \(\lambda\) is typically chosen as a sigmoid function, with the hyperbolic tangent function being a commonly used example. If we let K represent the size of the input and N denote the number of nodes in the reservoir, the vector \(u_{t} \in \mathbb {R}^{K}\) denotes the input at time t, and \(W_{\rm{in}} \in \mathbb {R}^{N \times K}\) represents the weights of the input layer. The matrix \(W \in \mathbb {R}^{N \times N}\) is the adjacency matrix that describes the connections between neurons (this structure is commonly referred to as the reservoir) and \(x_{t-1} \in \mathbb {R}^{N}\) represents the state of the neurons in the previous time step. The parameter \(\alpha \in [0,1]\) controls the network sensitivity to new inputs. A low value of \(\alpha\) results in a less responsive network to new inputs, tending to retain more information about its previous state. On the other hand, a high value of \(\alpha\) leads to a network that responds more strongly to new inputs. Since the calculation of new states is partly based on the previous state of the network, each new state contains an “echo" of the previous one. This is why these networks are called echo state networks.

2.1.3 Finding the weights of the output Layer

Fig. 2
figure 2

ESN. Basic architecture. Regression with Tikhonov regularisation (expressed as \(\tau\)) is used to calculate the weights of the output layer

In an ESN, the input layer and the reservoir work together to generate a singular state that is highly dependent on both the current input and the previous state of the reservoir. During the training phase, these states are stored in the matrix H, which is then used to calculate the weights of the output layer. The task of the output layer is to decipher the network response to inputs and to provide an accurate output prediction. This is achieved by solving an ordinary differential equation (ODE) system that relates the network response to input data (stored in H) to the desired output (Fig. 2).

The accuracy of the output layer is critical for the performance of the ESN, as it directly impacts the network’s ability to generate accurate predictions. A linear regression algorithm is often used to train the output layer, which is less computationally expensive than backpropagation. The Tikhonov regularisation method (expressed in Eq. 2) is frequently used for stability:

$$\begin{aligned} W_{\rm{out}} = ( H^\intercal H + \beta I )^{-1}H^\intercal Y \end{aligned}$$
(2)

where \(W_{\rm{out}}\) represents the output layer and \(H^\intercal\) is the transpose of H, the regularisation coefficient, \(\beta\), is typically a small value between \(10^{-4}\) and \(10^{-10}\), and the target vector is denoted by Y.

2.1.4 Testing

During the test phase, the ESN generates predictions for novel input sequences. Each new input produces a reaction of neurons within the network, and the output layer processes the resulting state beside the input to produce a prediction. Typically, this is the next step when working with time series. Figure 3 illustrates the test phase in the vanilla ESN architecture.

Fig. 3
figure 3

ESN. Basic architecture. Testing. In the test phase, each new input causes a response in the state of the neurons, and this response, together with the input that caused it, is processed by the output layer to produce the final output

In some cases, the network takes each prediction as the next input, allowing the ESN to work in a generative mode (predicting sequences rather than just the next value in the series).

2.2 ESN applications

In the context of neural networks, DNNs and ESNs represent two distinct approaches to learning through training. In DNNs, a complex structure of layers of neurons connected by edges is constructed. These networks are trained by backpropagation to adjust the weights of the connections and improve the prediction accuracy. However, ESNs comprise three fundamental components: the input layer, a group of neurons organised in a graph-like structure (the reservoir), and the output layer. The input layer and the reservoir are generated randomly and their weights remain unchanged during training. The learning process in ESN focuses on only determining the set of weights for the output layer though linear regression algorithm (which has a lower complexity compared to backpropagation methods) that enables accurate predictions of the input data.

In summary, while DNNs aim to fine-tune all network weights during training, ESNs rely on a fixed and random structure to make predictions, adjusting only the output layer. In this way, ESN-based architectures offer a significant advantage over traditional DNNs, reducing the training time and computation required and, consequently, the size of data sets for training. The simplicity and robustness of ESNs, with their short training times, low resource consumption, and ability to approximate complex system dynamics, have made them attractive for researchers to apply in a wide range of scenarios. We can find applications of ESN in diverse fields such as speech recognition [17], natural language processing [18, 19], control systems [20], anomaly detection [21], image classification [22], music generation [23], phone call prediction [24] and bioinformatics [25].

Its popularity in recent years has led to several proposals based on the original architecture [26]. In [27], three ESN-based architectures were proposed and their ability to generate states was investigated. These architectures are:

  • deepESN, where a series of layers are generated and fed with the output of the previous layer (Fig. 4a).

  • deepESN-IA, similar to the previous one, but each layer receives the original input along with the output of the previous layer (Fig. 4b).

  • groupedESN, where a series of reservoirs are generated and work in parallel to produce a single state (Fig. 4c).

Fig. 4
figure 4

Multi-Layer ESN-based architectures. a deepESN: The first layer processes the input as in the vanilla ESN, and the other layers receive the state of the previous reservoir as input. b deepESN-IA: The first layer processes the input as in the vanilla ESN, the other layers receive the state of the previous reservoir along with the original input, c groupedESN: a group of reservoirs operate independently as in vanilla ESNs, together they form a single state vector

In some cases, ESNs have been used in combination with other types of neural networks to improve their performance. For example, in [28] an ESN was integrated with a multi-layer perceptron (MLP) to address the task of colour image segmentation. Similarly, [29] used a combination of long short-term memory (LSTM) and ESN to control the temperature of an industrial hot-blast furnace. The integration of ESN with convolutional neural networks (CNNs) has also been explored, as demonstrated in the research conducted by [30], where the network was applied to a solar energy prediction task.

ESNs have also been shown to be effective in image processing, as evidenced by [31, 32], after transforming images into temporal series. [33] also proposed a combination of ESNs with CNNs for image classification. A summary of these works and their scopes can be seen in Table 1.

Table 1 Previous work overview

Although, as a type of recurrent neural network, ESNs excel at processing sequential data, research on their capability to comprehend spatial relationships remains limited. The purpose of this paper is to push ESNs beyond their traditional use and to evaluate their performance in scenarios devoid of temporal information. Specifically, we investigate the network’s capacity in image classification, where the emphasis shifts from temporal relationships to spatial understanding. As our experimental phase will demonstrate, this change makes certain network parameters, typically crucial for generating echo states (e.g., leaking rate or spectral radius), less significant in this context than in time series applications.

In the next section, we present a multi-reservoir echo state network (MRESN) architecture based on groupedESNs [27], which uses multiple reservoirs working in parallel. Each reservoir is tasked with processing the entire input image, allowing the network to capture a variety of spatial features simultaneously. Unlike parallelESN [35], which partitions the input and distributes it to the reservoirs, each reservoir in our proposed architecture processes the entire image in a single step. This approach is reminiscent of using an ESN ensemble to determine the output, but with the key difference that the ensemble information is taken into account when finding the output layer weights, allowing us to obtain high-dimensional state vectors while significantly reducing training times without negatively affecting network performance.

3 Multi-reservoir ESN (MRESN) architecture for image classification

As we have seen, in ESNs hyperparameters have a direct impact on the performance of the model. Although most of these parameters strongly influence its behaviour when analysing time series, they become less relevant when trying to capture spatial relationships in images. In particular, as will be confirmed in the experimentation section, the spectral radius or the leaking rate do not have a significant impact on the results. This is because in this task we do not have a sequence in which the points depend on the previous ones, such as the benchmark signal Mackey glass, etc. In our case, there is independence between the different input examples, and they are analysed in a single step, and therefore the echoes containing the states become less relevant when analysing static data such as the image. However, the number of nodes in the reservoir becomes critical, and large reservoirs will be necessary to obtain good results. The biggest drawback is that as we increase the number of nodes in the reservoir, the computation required to train the model increases, directly affecting both the training time and the required memory space. The simplicity in calculating the weights of an ESN is both an advantage and a disadvantage because the larger the number of nodes in the network, the larger the reservoir (which will grow quadratically with respect to the number of nodes) and the size of H, so the configuration of these networks will be limited by the hardware resources used.

In this study, we implemented the three architectures previously presented, making necessary modifications to tailor them for an image classification task that involves multiple output layers. Although deepESNs have been shown to be very robust in the time series domain [34] and even in real-world problems [36], they have not performed well in the image classification task proposed in this work. However, the parallel architecture allows us to achieve high performance compared to ESNs.

Our MRESN architecture is based on groupedESN. It uses multiple reservoirs that work in parallel, each processing the entire input image. The high-dimensional state vectors produced play a crucial role in preserving spatial information and significantly reducing training times compared to DNNs and even ESNs with the same number of nodes. The process for generating the state of the MRESN architecture is described by:

$$\begin{aligned} \textbf{x}_{t}^{\; i}= & {} (1-\alpha ^i)\textbf{x}_{t-1}^{\; i} \;\; + \;\; \alpha ^i \lambda ^i(W_{\rm{in}}^i \; \cdot \; \textbf{u}_{t} \;\; + \;\; W^i \; \cdot \; \textbf{x}_{0} ), \; \; i \in [1,r] \end{aligned}$$
(3)
$$\begin{aligned} \textbf{x}_{t}=\, & {} ( \textbf{x}_{t}^{\; 1}, \textbf{x}_{t}^{\; 2}, \;... , \textbf{x}_{t}^{\; r}) \in \mathbb {R}^{\sum _{i=1}^{r} |r_i |} \end{aligned}$$
(4)

where r is the total number of reservoirs. At each time step t, the vector \(\textbf{x}_t \,\) is obtained by concatenating the vectors, \(\textbf{x}_{t}^{\; i} \,\), produced by each individual reservoir, \(W^i\). This process results in a final vector of size \(\sum _{i=1}^{r} |r_i |\). In the original ESN architecture, the size of the reservoir scales quadratically with the number of nodes, resulting in a significant increase in the memory footprint of the architecture when a large number of nodes are used. In contrast, groupedESN, parallelESN and MRESN require allocating the sum of the squares of the nodes in each reservoir. As the number of nodes in each reservoir is typically much smaller than that required in the original architecture, the MRESN architecture exhibits a significantly reduced memory requirement.

$$\begin{aligned} \mathcal {O}_{\rm{ESN}}= \, & {} n \times n \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {O}_{\rm{MRESN}}= \, & {} \sum _{i=1}^{r} n_i \times n_i \end{aligned}$$
(6)

where \(n_i \ll n , \; \text { for } i=1,2,...,r\).

Despite the differences in state computing, the training phase of this architecture is similar to the original (see Fig. 5).

Fig. 5
figure 5

groupedESN and MRESN architecture. A concatenation of each reservoir state yields the final state, which is stored in the H matrix during the training phase

Another aspect to consider in image classification tasks is the independence among images from the data set. Consequently, it would not be advisable to rely on previous states of the network when processing a new image, which could lead to unexpected network behaviours. To avoid this, we use a zero vector \(\textbf{x}_0\) every time the network has to process a new image, and before computing the final state (which will be stored in the H matrix), a two-step initial transient is used to warm up the network.

Similarly to the vanilla ESN architecture, we store each input of the training set in the matrix H along with the resulting state. Once the H matrix is ready, we calculate the weights of the output layer in a single step using Tikhonov regularisation. However, a key difference from the basic architecture must be emphasised. Instead of using a target vector consisting of the next step in the input series, a target vector must contain the class of the image. This modification enables the ESN to perform classification tasks rather than prediction tasks (see Fig. 6a).

As suggested by [12], we can train specific output layers for each class instead of using a single output layer to decide the class of the image:

$$\begin{aligned} W_{\rm{outs}}^{\rm{c}} = ( H^TH + \beta I )^{-1}H^T Y^c, \; \; \; c \in [1,2, \dots , k] \end{aligned}$$
(7)

In our experiments (MNIST and FashionMNIST), we will have \(k=10\). Training each output layer to specialise in a particular class improves accuracy. It does not require multiple passes through the data set and does not significantly increase the training process, despite the increase in k. We use the states already stored in the H matrix to train these layers and compare them with different target vectors derived from the original. These new target vectors are binary, containing 1 when the example belongs to the class we are training and 0 otherwise. In this way, we obtain specialised output layers for each class that can accurately decide whether the input belongs to a particular class or not (see Fig. 6b).

Fig. 6
figure 6

a One layer for all classes. b One layer per class

When we have multiple output layers, we need a mechanism to make a decision based on the output of the specialised output layers, for example, a softmax layer that takes the output of all the specialised output layers and returns a probability distribution over the classes. The class with the highest probability can then be chosen as the predicted class for the input (see Fig. 7).

Fig. 7
figure 7

MRESN architecture. The output layer is a matrix where each column is specialised in a particular class. Argmax or Softmax can be used as the final layer

4 Methodology

In this section, we evaluate the performance of the proposed architecture by conducting experiments on the MNIST and FashionMNIST data sets. To measure the effectiveness of our architecture, we will compare its results with those obtained with the original ESN architecture. Our experimental setup involves testing various parameters and configurations, and we conduct several trials to obtain reliable results.

4.1 MNIST and FashionMNIST data sets

Fig. 8
figure 8

a MNIST 3 mm b FashionMNIST

The MNIST and FashionMNIST data sets are both commonly used benchmarks in image classification. MNIST consists of greyscale images of handwritten digits and is simple to use, making it a valuable benchmark for assessing the performance and robustness of image classification models. FashionMNIST includes greyscale images of clothing items instead of digits (see Fig.  8). Both data sets contain training and testing sets, and each example is associated with a label from a set number of classes. Both data sets include ten classes. There is no temporal relationship between the examples; they are independent of each other.

4.2 Experimentation

We have selected Julia for all the implementations: [12] for the original ESNs, and our own implementation for deepESN, deepESN-IA and MRESN. These architectures are trained and tested on the MNIST and FashionMNIST data sets using the same experimental setup: 60,000 examples for training and 10,000 for testing. Due to the poor results obtained for this task with deepESN and deepESN-IA, we focused on MRESN, which was compared to the vanilla ESN architecture. The same range of hyperparameters was tested for both architectures.

To ensure the robustness and reliability of our results, we conduct each experiment several times using different seeds to generate the random structures of the input layers and reservoirs and report the average performance of the architectures. Performance is evaluated using various metrics, including classification accuracy, training time, and memory usage.

All experiments are performed on a server equipped with an AMD Ryzen 7 5800X 16-core processor, 64 Gb of RAM, and an Nvidia GeForce RTX 4090 GPU card.

5 Results and discussion

One of the first questions that arose was whether it was necessary to have an initial transient. When using an ESN with leaky integrator [16] to analyse time series, it is interesting to note that at the beginning of training, the network has been tempered by a series of steps in which information is allowed to circulate through the network, but without registering these initial states in H. This allows more coherent states to be stored in H as they lack the noise that the cold start of the network can generate. In our case, lacking the temporal component, what we have done is to introduce the image into the network at each step and use the generated state along with the image to generate the next state. But is an initial transient necessary for this task? To answer this question, we conducted an experiment to test how different values for the initial transient affect the final result.

Fig. 9
figure 9

Errors committed by networks with different values of \(\alpha\) as the initial transient increases

As \(\alpha\) significantly influences the retention of information from the previous state within the network, we performed experiments with various networks, each assigned a value of \(\alpha \in [0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 0.99]\). Illustrated in Fig. 9, the accuracy of these networks is observed as we vary the initial transient. The maximum accuracy is achieved by networks with \(\alpha > 0.3\). These networks are depicted in the bottom section of Fig. 9. Networks with \(0.1 \le \alpha \le 0.3\) exhibit a requirement for more initial transient steps to produce satisfactory results, as seen in the middle of Fig. 9. Notably, networks with \(\alpha = 0.01\) do not reach the accuracy levels of other networks, even when subjected to high initial transient values. In any case, for networks with a sufficiently high value \(\alpha\), we can see that two initial transient steps are sufficient to obtain good results and that increasing this value would only lead to networks with longer training and inference times. For this reason, in this work we have chosen to set the value of the initial transient to 2 and to test the behaviour of the architecture by changing the rest of the parameters.

Once the initial transient value had been set, our next experiment was to analyse the performance of a vanilla ESN architecture with multiple output layers, one for each class. We measured its effectiveness over a range of different parameter sets. Based on [12], we identified a spectrum of possible values for each hyperparameter. Although this study provides a set of values that generally perform well, for specific problems and hyperparameters, outliers have been employed with success. In our study, we expanded the values suggested by the authors to explore the limits of our grid search. The ranges we considered are summarised in Table 2 and Fig. 10.

Fig. 10
figure 10

Grid search Hyperparameters values

Table 2 Range of hyperparameter values for grid search in vanilla ESN

The influence of each hyperparameter on the final performance of the network can be seen in Figs. 1115.

Fig. 11
figure 11

Influence of the number of nodes (N) on network performance

Fig. 12
figure 12

Influence of the \(\alpha\) parameter on network performance

Fig. 13
figure 13

Influence of the \(\beta\) parameter on network performance

Fig. 14
figure 14

Influence of the density of the network on performance

Fig. 15
figure 15

Influence of the \(\rho\) parameter on network performance

Although these figures show the results obtained using MNIST data set, the same trends are observed in FashionMNIST data set.

For this range of hyperparameters, the best results were obtained with \(N=5000\) nodes, with an error ranging from 0.0337 to 0.0345. This network performed best with different sets of hyperparameters. Initially, when \(\alpha =0.7\) and \({\rm{density}}=0.5\), the results were consistently favourable, regardless of the values assigned to \(\rho\) and \(\beta\) (within the dimension chosen in the grid search). Similarly, the minimum error was also achieved when \(\alpha = 0.5\), \({\rm{density}}=0.2\), and \(\beta \in [10^{-9}, 10^{-6}]\), regardless of the specific value of \(\rho\). It should be noted that the hyperparameter space is extremely rough and that different sets may work well for other reservoirs. However, the results indicate that the parameters leaking-rate (\(\alpha\)), regularisation coefficient (\(\beta\)), density, and spectral-radius (\(\rho\)) do not have a significant impact on network performance. This is expected since \(\rho\) and \(\alpha\) are typically more relevant for time series data, where the network needs to maintain a memory of previous states. In our approach, the network has only two steps of initial-transient before computing the final state, which reduces the risk of the state converging to atypical values, so some hyperparameters have little impact on the final results. On the other hand, N has a significant impact on the performance of the network. Increasing the number of nodes (even beyond the maximum used in this grid search) reduces the prediction error because the state vector becomes larger and can encode more information about the input.

The downside is that increasing N also increases the time and memory required for training since it affects the size of the structures used, especially H.

It should be noted that the number of nodes in the network appears to be the critical factor in capturing spatial relationships. This observation allows greater flexibility in adjusting other hyperparameters, such as spectral radius and leaking rate, in architectures designed for scenarios involving spatial and temporal relationships, such as video sequences.

Fig. 16
figure 16

a deepESN, b deepESN-IA, c deepESN with Identity matrix as Input Layer, d deepESN-IA with Identity matrix as Input Layer. e Same as c but in a shorter range, f) Same as d but in a shorter range

After this exploration of the behaviour in vanilla ESNs, we implemented the architectures proposed by [27] (see Fig. 4) and enhanced their design by incorporating multiple specialised output layers, customising them for a classification task. We used an argmax layer as the final step. The reservoirs and the input layer are randomly generated. We fix \(\alpha =0.7\), but we give random values for \(\rho\) and \(\lambda\) (input scaling) to improve diversity. The range of hyperparameters used for the networks can be found in Table 3. In our first approach, we want to test if initial transient is relevant in these architectures and fix this parameter to 0.

Table 3 Hyperparameters used in deepESN and deepESN-IA

These networks were tested for depths ranging from 1 to 10 layers. However, neither deepESN nor deepESN-IA yielded good results, and we observed that these results worsened as we increased the number of layers in the network (Fig. 16a and b). One possible explanation is that, in these architectures, there is a linear modification of the input data performed by the input layer of each ESN in the hidden layers, causing the error to increase as we increase the number of hidden layers. For example, if we replace the input layers in the reservoirs located in the hidden layers with an identity matrix that does not alter the input received (that is, the state of the previous layer), the results become more consistent (Fig. 16e and f). Anyway, lack of initial transient does not help the network make better predictions. As we will see later in the comparison, some initial transient steps help the network perform better, but this was not enough to match the accuracy of other ESN-based models.

Although sequential architectures based on stacked ESN layers may not provide optimal performance for this task, the use of multiple ESNs in parallel proves to be a successful approach. The MRESN approach allows us to generate large state vectors and significantly reduce the required training time compared to the conventional ESN architecture. This time difference increases significantly compared to the vanilla-ESN architecture, where training times are significantly longer. Furthermore, the overall performance of the MRESN network improves with the same number of nodes compared to ESN or deepESN. The multi-reservoir architecture provides a promising solution to generate large state vectors while maintaining high accuracy and reducing training times. Tables 4 and 5 compare the three implemented architectures and the vanilla ESN in terms of execution time and accuracy, respectively.

Table 4 Global execution time
Table 5 Networks accuracy for MNIST and FashionMNIST

Our experiments demonstrated that the proposed architecture outperforms the original ESN architecture in terms of classification accuracy on both data sets. Specifically, the proposed architecture achieved 98.43% accuracy in the MNIST data set, compared to 97.84% for the original ESN architecture. In the FashionMNIST data set, the proposed architecture achieved 89.12% accuracy, while the original ESN architecture achieved 88.1%. Notably, the new architecture not only performs better but also requires less training time and memory for the used structures.

Finally, Table 6 presents a comparative analysis of 3 MRESN architectures, one with 20 reservoirs of 1000 nodes, one with 6 reservoirs of 1000 nodes and finally a small network with 4 reservoirs of 500 nodes. These architectures are compared with other architectures based on ESN and DNN, focusing on two key metrics: accuracy and number of trainable parameters. This comparison provides insight into the effectiveness of the MRESN architecture in achieving competitive levels of accuracy while maintaining a manageable number of trainable parameters across different r x N configurations. A graphical representation of this table is shown in Fig. 17.

Table 6 Architecture comparison with other S.O.T.A. methods
Fig. 17
figure 17

Relationship between accuracy and trainable parameters for different models. Stars denote the MRESN architecture, while other symbols represent models identified by their reference numbers in Table 5

6 Conclusions

In this paper, we have investigated the behaviour of different ESN-based architectures and evaluated their performance in the context of image classification. For the basic architecture, we tested the network response under various sets of hyperparameters and analysed different modifications for deepESN-type networks.

Building upon groupedESN, we introduced the MRESN architecture for image classification. This architecture uses multiple reservoirs that operate in parallel to capture pure spatial relationships in images without transforming them into time series. Our experiments on the MNIST data set demonstrated that the MRESN architecture achieved a classification accuracy of 98.43%, surpassing both the classical ESN architecture and other state-of-the-art ESN-based architectures, all while requiring less training time.

Unlike previous approaches that divide the image into parts and allocate one slice to each reservoir, the MRESN architecture processes the entire image in each reservoir. This method uses the information of the ensemble to determine the weights of the output layer, overcoming the limitations of vanilla ESNs and providing more accurate predictions. Furthermore, utilising a set of smaller reservoirs instead of a single large structure with many nodes results in reduced training and testing times for the network and significantly lowers the required disk space for model execution.

Our results suggest that the proposed MRESN architecture can serve as a viable alternative for learning spatial relationships in scenarios where energy efficiency and training time are critical concerns. The MRESN architecture holds promise for applications in image classification tasks. Future research could explore its potential for generalisation to more complex data sets. Furthermore, investigating the results when preprocessing data sets by applying techniques such as data augmentation [40] or feature extraction [41] could further enhance its utility.

In this study, we examine the response of the network to different sets of hyperparameters and demonstrate that the number of nodes within the network emerges as a critical factor in capturing spatial relationships. This discovery provides greater flexibility for adjusting other hyperparameters, such as spectral radius and leaking rate, in architectures designed for scenarios that involve not only spatial but also temporal relationships, as seen in video sequences. This adaptability should allow these networks to be optimised in various contexts without compromising their ability to capture spatial relationships, a hypothesis that we intend to test in our future work.