1 Introduction

Artificial Neural Networks (ANNs) are computing systems consisting of connected units, commonly called neurons or nodes. ANNs are inspired by biological neural systems (McCulloch and Pitts 1943; Hebb 1949) and constitute a class of powerful machine learning tools for a wide range of applications (Lecun et al. 2015; Schmidhuber 2015; Abiodun et al. 2018; Tealab 2018; Goodfellow et al. 2016). Recurrent Neural Networks (RNNs) are ANNs, which are useful to process sequential data. They contain at least three layers: an input layer, representing the input data, a hidden layer with recurrent connections, and an output layer. A given sequence of input data is fed into the RNN step by step. The recurrent connections of the hidden layer ensure that the network keeps information about past input elements.

A standard method to train RNNs is gradient descent, where the gradient is computed by backpropagation through time (Robinson and Fallside 1987; Mozer 1989; Goodfellow et al. 2016). This training method can suffer from vanishing or exploding gradients (Mozer 1989; Kolen and Kremer 2001; Goodfellow et al. 2016), which can be avoided by using advanced RNN models, such as long short-term memories (LSTMs) (Hochreiter and Schmidhuber 1997) or gated recurrent units (GRUs) (Cho et al. 2014). LSTMs and GRUs perform well, even on challenging tasks (Sak et al. 2014; Uhlich et al. 2017; Takahashi et al. 2018; Ravanelli et al. 2018), but are sophisticated and rather difficult to implement.

Another, earlier and elegantly simple approach to overcome the problem of vanishing or exploding gradients, are Echo State Networks (ESNs) (Jaeger 2001, 2002; Lu et al. 2017). ESNs are RNNs with fixed, randomly chosen input and internal weights and adaptable output weights. Since only the output weights are trained, there is no need to employ gradient descent. Instead, one can use a simple regression method. Fixed input and internal weights come at the price that ESNs perform worse than advanced methods, such as LSTMs or GRUs, on challenging tasks. But for solving rather simple or moderately challenging tasks, ESNs should still be considered as a possible solution due to their simplicity and their fast training process. ESNs have been successfully applied to chaotic signal prediction (Lu et al. 2017; Pathak et al. 2017), signal separation (Krishnagopal et al. 2020), stock price prediction (Lin et al. 2009), and for the simulation of cardiac electrical waves (Zimmermann and Parlitz 2018).

Furthermore, ESNs are a subclass of reservoir computing, which is the principle of employing an input driven dynamical system, called reservoir, with an adaptable readout transformation for machine learning problems. The hidden layer of an ESN can be considered as a discrete-time dynamical system and serves as the reservoir; the ESN’s output layer realizes the readout transformation. The ESN is the reservoir computing concept that was introduced first. Meanwhile, there are further concepts, e.g., time-delay reservoir computing (Appeltant et al. 2011; Larger et al. 2012; Schumacher et al. 2013; Brunner et al. 2013).

Classical ESNs possess only one hidden layer, which obtains its connections from an Erdős-Rényi graph (Erdös and Renyi 1961). In addition, deep ESNs have been introduced (Gallicchio and Micheli 2016; Gallicchio et al. 2018; Gallicchio and Micheli 2020; Dettori et al. 2020). Deep ESNs possess multiple hidden layers, which are internally recurrent, but coupled to each other in a feed-forward way.

In spite of decent results obtained from ESNs with simple Erdős-Rényi topology, there is an interest in employing complex network topologies for further improvements. Complex networks are graphs with nontrivial connection patterns and provide a powerful tool for modeling complex systems by unifying spatial, topological, functional, and evolutionary properties. One of their salient features is the presence of communities forming a clustered structure. While a community is a group of densely connected nodes, the connections between nodes of different communities are sparse. Interestingly, such clustered structures have been found in, e.g., the brains of humans and animals (Martens et al. 2017; Berry and Tkacik 2020). Data on both anatomical and functional connectomes have shown a small-world structure with highly clustered modules at different scales (Akiki and Abdallah 2019; Gleiser and Spoormaker 2010; Hagmann et al. 2008).

Various concepts of clustered ESNs (CESNs), i.e., ESNs with a reservoir described by a complex network with community structure, have been proposed, including geometrically motivated (Deng and Zhang 2007), data driven (Li et al. 2015), and knowledge-based (Yu et al. 2011) methods. The recent work (Oliveira et al. 2020) introduced two further types of CESNs, where the clusters are defined by equally sized blocks in the adjacency matrix. These blocks are generated using the Erdős-Rényi model or a Barabási-Albert-like model, respectively. However, only simple time-series prediction and frequency filtering problems on artificial datasets have been considered to evaluate the effectiveness of these methods.

In this article, we show that the CESNs from Oliveira et al. (2020) can also be applied for signal denoising. Specifically, we reconstruct signals from their noisy versions, which are corrupted by Gaussian, impulse, or ECG noise. To demonstrate the effectiveness of CESNs, we compare them to classical ESNs and to the Wiener filter (Wiener 1949). Furthermore, we introduce deep CESNs, i.e., ESNs with multiple clustered layers.

Our results show that CESNs and deep CESNs perform significantly better than the Wiener filter and classical ESNs. ESNs, in particular CESNs and deep CESNs, are a robust method providing a decent performance in signal denoising.

2 Review of echo state networks

In this section, we provide a general definition of ESNs and explain how to train them using ridge regression. We describe classical ESNs with a random Erdős-Rényi network topology as introduced in Jaeger (2001), which we call random ESN throughout this article. Further, we describe deep ESNs (Gallicchio and Micheli 2016, 2020).

We use the term adjacency matrix for the matrix indicating the number of connections between two nodes of a directed graph (including self-connections). The entries of an adjacency matrix are in general non-negative integers. In ESNs there is at most one connection per direction between two nodes, i.e., the reservoir’s adjacency matrix contains only ones and zeros. Matrices describing weighted connections are referred to as weight matrix. We call the matrix of the weighted connections inside the reservoir hidden weight matrix.

2.1 General definition of echo state networks

ESNs are RNNs consisting of an input layer u(t), an output layer s(t), and one or multiple hidden layers defining the reservoir x(t), where \(t = 0, 1, \ldots , t_\mathrm {max}\) is a time index.

The input layer u(t) represents a sequence of input vectors with dimension \(M\in {\mathbb {N}}\), which are inserted step by step into the reservoir x(t). The reservoir itself is a high-dimensional nonlinear discrete-time dynamical system with states in \({\mathbb {R}}^N\), where \(N\gg M\). The high dimensionality ensures that a sufficiently large number of input elements can be stored (Jaeger 2002).

The dynamics of the reservoir is described by the equation

$$\begin{aligned} x(t+1)= (1 - \alpha ) x(t) + \alpha f(A x(t)+ W^{\mathrm {in}} u(t) + \gamma {\mathbf {1}}), \end{aligned}$$
(1)

where \(0 < \alpha \le 1\) is a leakage rate, \(A\in {\mathbb {R}}^{N\times N}\) is the hidden weight matrix, \(W^{\mathrm {in}} \in {\mathbb {R}}^{N \times M}\) is the input weight matrix, \(u(t)\in {\mathbb {R}}^{M}\) is the input vector at time t, and \(\gamma {\mathbf {1}}\) is a bias vector, where \(\gamma \in {\mathbb {R}}\) is the bias strength and \({\mathbf {1}}\in {\mathbb {R}}^N\) denotes a vector filled with ones. The leakage rate \(\alpha\) is a fixed hyperparameter that controls how fast the system state changes. If the ESN contains multiple hidden layers, they can all be described by just a single hidden weight matrix A with a certain block structure. See Sect. 2.5 about deep ESNs for details.

The output of an ESN is a time series with elements of dimension \(P\in {\mathbb {N}}\) given by the equation

$$\begin{aligned} {\hat{s}}(t)=W^{\mathrm {out}} x(t) + c, \end{aligned}$$
(2)

where \(W^{\mathrm {out}}\in {\mathbb {R}}^{P \times N}\) is the output weight matrix and \(c\in {\mathbb {R}}^P\) is a vector containing output bias weights.

All types of ESNs considered in this article differ only in the structure of the hidden weight matrix A. Consequently, they can all be trained with the same training procedure.

2.2 Training process

ESNs are applied for supervised sequential machine learning problems. That is, in addition to the sequence of inputs u(t), we are given a sequence of target vectors \(s(t)\in {\mathbb {R}}^P\). The training process aims to find a good estimator \({\hat{s}}(t)\) to approximate the target sequence s(t). This is done by fitting the variables \(W^\mathrm {out}\) and c using a regression method.

The work Lu et al. (2017) explains how to optimize \(W^\mathrm {out}\) and c by ridge regression. We use the same method for our numerical studies presented below. For the purpose of self-containedness, we repeat the explanations from Lu et al. (2017).

For the training process, we assign the negative time index \(t_0 < 0\) to the initial state \(x(t_0)\) of the system. We solve Eq. (1) iteratively, to generate the system states x(t). Only the states \(x(1), \ldots ,x(t_\mathrm {max})\) need to be stored. The period up to \(t=0\) is called initial wash out phase and ensures that the training results do not depend on the initial state (Jaeger 2001; Lu et al. 2017).

To perform the actual regression, we calculate the element-wise mean \({\bar{x}}\in {\mathbb {R}}^N\) of the reservoir states \(x(1), \ldots , x(t_\mathrm {max})\) and the element-wise mean \({\bar{s}}\in {\mathbb {R}}^P\) of the target states \(s(1), \ldots , s(t_\mathrm {max})\), i.e.,

$$\begin{aligned} {\bar{x}} = \frac{1}{t_\mathrm {max}} \sum _{t=1}^{t_\mathrm {max}} x(t),\quad {\bar{s}} = \frac{1}{t_\mathrm {max}} \sum _{t=1}^{t_\mathrm {max}} s(t). \end{aligned}$$
(3)

Let \(\delta X\) be a matrix with N rows and \(t_\mathrm {max}\) columns, where the tth column is the vector \(x(t) - {\bar{x}} \in {\mathbb {R}}^N\). Analogously, let \(\delta S\) be a matrix with P rows and \(t_\mathrm {max}\) columns, where the tth column is the vector \(s(t) - {\bar{s}} \in {\mathbb {R}}^P\). Then, the output weight matrix \(W^\mathrm {out} \in {\mathbb {R}}^{P\times N}\) can be calculated using the equation

$$\begin{aligned} W^\mathrm {out}=\delta S \delta X^\mathrm {T} (\delta X \delta X^\mathrm {T} + \beta I)^{-1}, \end{aligned}$$
(4)

where I is the identity matrix and \(\beta\) is the ridge regression parameter. The output bias vector \(c \in {\mathbb {R}}^P\) is calculated by

$$\begin{aligned} c=-(W^\mathrm {out} {\bar{x}} - {\bar{s}}). \end{aligned}$$
(5)

The required computational time for the training process is of order \({\mathcal {O}}(N^2 t_\mathrm {max})\), which is the computational complexity of the matrix multiplication of \(\delta X\) with \(\delta X^\mathrm {T}\).

The required memory space corresponds to the size of the largest matrix used by the algorithm and is of order \({\mathcal {O}}(N t_\mathrm {max})\). In particular due to the memory requirement, it can be advisable to truncate the training signal if a large amount of training data is available. This way the length of the training sequence \(t_\mathrm {max}\) can be limited to a moderate value.

2.3 The Erdős-Rényi model

Below, we explain several types of ESNs that differ in the graph used for the construction of their hidden weight matrix A. Several of these ESN types, including the classical model introduced in Jaeger (2001), are based on Erdős-Rényi graphs (Erdös and Renyi 1961), which we briefly describe here.

Erdős-Rényi graphs are directed graphs with sparse connectivity, which are randomly generated. First, we choose two parameters: the number of nodes N and an average node degree D. The edges of the graph are randomly chosen in such a way that all possible connections can occur with the same probability. In practice, this means that we generate the random adjacency matrix of an Erdős-Rényi graph, which is contains the entries 1 and 0 to indicate whether there is a directed connection between two nodes or not. We call this adjacency matrix an Erdős-Rényi matrix.

Precisely speaking, there are two equally common ways to define an Erdős-Rényi matrix. One can either set all entries of the matrix independently of each other to one or zero with probability D/N or \(1 - D/N\), respectively; or one can set the number of connections to exactly ND and choose them randomly. In this work, we use the second version.

2.4 Random echo state networks

We refer to the classical ESN model, introduced in Jaeger (2001), as random ESN. As illustrated in Fig. 1, random ESNs contain one hidden layer. An Erdős-Rényi matrix is employed as the adjacency matrix of the reservoir. Thus, the reservoir’s connectivity is in general sparse and all possible connections occur with the same probability. We need to specify two parameters to generate the adjacency matrix: the number of hidden nodes N and the average node degree \(D \le N\). The adjacency matrix is then obtained by the algorithm described in Sect. 2.3.

Fig. 1
figure 1

Illustration of a random ESN. The reservoir is illustrated by a box containing the hidden nodes (black circles). The connections inside the reservoir are drawn randomly from an Erdős-Rényi graph. The ESN also contains multiple input nodes (red circles) and output nodes (blue circles) (Color figure online)

For the hidden weight matrix A, we need to specify one further parameter: the spectral radius \(\rho _A\) of the matrix. In order to generate the weight matrix, we draw the weights for all connections (given by the adjacency matrix) from a uniform distribution on the interval \([-1,1]\). Then we compute the spectral radius \(\rho _\mathrm {init}\) of the obtained matrix and multiply the matrix with the factor \(\rho _A / \rho _\mathrm {init}\). This way we obtain a hidden weight matrix A with Erdős-Rényi topology, uniformly distributed random weights, and the desired spectral radius \(\rho _A\)

2.5 Deep echo state networks

Deep Echo State Networks (deep ESNs) are ESNs with multiple hidden layers (Gallicchio and Micheli 2016; Gallicchio et al. 2018; Gallicchio and Micheli 2020; Dettori et al. 2020), as illustrated in Fig. 2. We consider deep ESNs with equally sized layers, i.e., the total number of nodes N of a deep ESN is a multiple of the number of hidden layers L and the number of nodes per hidden layer is \(N_L = N/L\), which is usually the case for deep ESNs. Although deep ESNs can have differently sized hidden layers, it requires the introduction of new parameters to the model and, therefore, this topic is left as a future work. The adjacency matrix of a deep ESN is a block matrix consisting of blocks with dimension \(N_L \times N_L\). The hidden layers are given by the main-diagonal blocks, and the inter-layer connections are given by the blocks directly below the main diagonal. All other blocks are zeros matrices. Thus, the layers are connected in a feed-forward manner, i.e., the ith hidden layer is forward-connected to the \(i+1\)-st hidden layer, for \(i=1,\ldots ,L-1\).

The main-diagonal blocks are generated by the Erdős-Rényi model with a given mean degree \(D\le N_L\). Also, the block determining the inter-layer connections are Erdős-Rényi matrices with a given mean degree \(D_\mathrm {inter} \le N_L\).

Fig. 2
figure 2

Illustration of a deep ESN with three hidden layers. The reservoir is indicated by the large outer box and contains three hidden layers (smaller rectangles inside the box). Each hidden layer contains multiple nodes (black circles) which are recurrently connected inside the layers. The hidden layers are connected to each other in a feed-forward manner. The input nodes (red circles) are connected to the first hidden layer, and the output nodes (blue circles) are connected all hidden layers, i.e., to the whole reservoir (Color figure online)

Once we obtained the whole \(N\times N\) adjacency matrix of the deep ESN’s reservoir, the weights of the connections are drawn from a uniform distribution on the interval \([-1,1]\). Again, we rescale the resulting matrix to obtain a hidden weight matrix A with the desired spectral radius \(\rho _A\).

The input layer is only connected to the first hidden layer, i.e., only the first \(N_L\) rows of the input weight matrix \(W^\mathrm {in}\) have non-zero entries. All nodes of the reservoir (i.e., of all hidden layers) are connected to the output layer. Hence, there are no restrictions for the output weight matrix \(W^\mathrm {out}\).

For the numerical tests presented in Sect. 5, we sometimes did not use exactly equally sized layers, but layers which can differ in size by one node. This way, the number of layers L could be chosen more flexibly.

3 Clustered echo state networks

Multiple works Deng and Zhang (2007), Li et al. (2015), Yu et al. (2011) and Oliveira et al. (2020) have proposed new types of ESNs based on complex networks: Clustered Echo State Networks (CESNs). Instead of choosing a simple random Erdős-Rényi graph to define the reservoir, connection patterns with clusters were chosen. See Fig. 3 for an illustration.

Fig. 3
figure 3

Illustration of a clustered ESN. The reservoir (large box) contains three clusters, where the nodes belonging to the same cluster are drawn in the same color (green, purple or orange). The inter-cluster connectivity is sparser than the connectivity inside the clusters. The input and the output layer of the clustered ESN (red and blue) are connected to all clusters in the reservoir (Color figure online)

In this section, we focus on the models from Oliveira et al. (2020). The clusters are represented by equally sized blocks in the adjacency matrix, and the internal connections of each cluster are either realized using the Erdős-Rényi model or a custom model which is similar to the Barabási-Albert model. These new ESN types are called Erdős-Rényi CESN or Barabási-Albert-like CESN, respectively. Moreover, we propose a combination of CESNs and deep ESNs: deep ESNs with clustered layers, which we call deep CESNs.

For all CESNs and deep CESNs defined below, we denote the total number of hidden nodes by N. We let D be the mean degree of the hidden layer(s). We consider deep CESNs with L equally sized hidden layers, each consisting of \(N_L = N/L\) nodes. Further, let \(C\in {\mathbb {N}}\) be the number of clusters per hidden layer. Also the clusters are equally sized, i.e., N (or \(N_L\), respectively) must be a multiple of C. Let \(P_\mathrm {in} \in [0,1]\) be the share of cluster-internal connections and \(P_\mathrm {out} = 1 - P_\mathrm {in}\) be the share of inter-cluster-connections within a hidden layer.

Since D is the mean degree of the hidden layer(s) and N (or \(N_L\), for the deep case) is the number of nodes per hidden layer, one hidden layer contains ND (or \(N_L D\)) connections: specifically, \(NDP_\mathrm {in}\) (or \(N_LDP_\mathrm {in}\)) cluster-internal connections and \(NDP_\mathrm {out}\) (or \(N_LDP_\mathrm {out}\)) inter-cluster connections.

3.1 Erdős-Rényi clustered echo state networks

The clustered network topology of an Erdős-Rényi CESN is described by an adjacency matrix which is partitioned into blocks of size \(N_C \times N_C\). Each cluster is represented by a block on the main-diagonal. The off-diagonal blocks contain the connections between nodes of different clusters. The connections inside the main-diagonal blocks are determined by an Erdős-Rényi graph with \(N_C\) nodes and \(NDP_\mathrm {in}/C\) edges. To realize the inter-cluster connections, we randomly set \(NDP_\mathrm {out}\) entries of the off-diagonal blocks to 1.

The adjacency matrix of an Erdős-Rényi CESN is similar to a stochastic block model (Holland et al. 1983). It is, however, not exactly the same because the main-diagonal and off-diagonal blocks are constructed following different rules.

Given the adjacency matrix, we obtain the weight matrix of the reservoir by drawing random \({\mathcal {U}}([0,1])\)-distributed weights and rescaling them to achieve the desired spectral radius.

3.2 Barabási-Albert-like clustered echo state network

The reservoir of a Barabási-Albert-like CESN is described by a block matrix, which has the same block structure as the adjacency matrix of the Erdős-Rényi CESN defined above. However, the main-diagonal blocks, which represent the clusters, are obtained from a different graph model, and have themselves a clustered structure. This graph model is inspired by the model introduced in Bollobas et al. (2003) and the well-known Barabási-Albert model Barabási and Albert (1999). We apply a modified model instead of the original Barabási-Albert model or the model from Bollobas et al. (2003) because we require directed graphs and we are given an exact number of connections.

To generate the main-diagonal blocks, we initialize all \(N_C\) nodes of a cluster at once (in contrast to the models from Barabási and Albert (1999) and Bollobas et al. (2003)). We seed the graph with one connection from one node to itself (a loop). Then we repeat the following procedure until the graph contains \(NDP_\mathrm {in}/C\) directed edges:

  1. 1.

    With probability \(P_1\) do: (a) Choose a node u with uniform probability. (b) Choose a node v with a probability proportional to the node’s in-degree. (c) Add an edge from u to v.

  2. 2.

    With probability \(P_2\) do: (a) Choose a node u with a probability proportional to the node’s out-degree. (b) Choose a node v with a probability proportional to the node’s in-degree. (c) Add an edge from u to v.

  3. 3.

    With probability \(P_3\) do: (a) Choose a node u with a probability proportional to the node’s out-degree. (b) Choose a node v with uniform probability. (c) Add an edge from u to v.

Per step only one of the procedures 1), 2) or 3) is conducted; note that the probabilities add up to 1. If there is already an edge between the randomly selected nodes, we do not add another edge with the same direction. Here, we choose \(P_1 = 0.41\), \(P_2 = 0.54\), and \(P_3 = 0.05\).

Subsequently, we add inter-cluster connections in two steps:

  1. 1.

    We construct a graph with C nodes and \(NDP_\mathrm {out}\) edges using the modified Barabási-Albert-like model, described above, with one difference: this time we allow multiple edges per direction between two nodes. That is, the entries of the resulting adjacency matrix can be any non-negative integer. Moreover, we do not allow loops in this case, i.e., the entries on the diagonal are zeros.

  2. 2.

    According to the adjacency matrix generated in step 1), we decide how many connections we draw from one cluster to another. Recall that the reservoir’s adjacency matrix is a block matrix with \(C\times C\) blocks of size \(N_C\times N_C\). If the adjacency matrix obtained in step 1) has the entry n at position (ij), we generate the block with the index (ij) of the reservoir’s adjacency matrix using the Erdős-Rényi model with \(N_C\) nodes and n edges.

Again, the hidden weight matrix A is obtained by drawing the weights from a uniform distribution on [0, 1] and rescaling.

3.3 Deep clustered echo state networks

We propose combinations of deep ESNs and CESNs: deep Erdős-Rényi CESNs and deep Barabási-Albert-like CESNs. The basic structure of these networks is the same as for the deep ESN defined in Sect. 2.5: the hidden weight matrix is a block matrix consisting of \(L\times L\) blocks of size \(N_L\times N_L\), where \(N_L = N/L\). The main-diagonal blocks, which represent the hidden layers, are constructed according to the models described in the Sects. 3.1 and 3.2, respectively. That is, each hidden layer has a clustered Erdős-Rényi or Barabási-Albert-like structure, and is itself a block matrix. The connections between the hidden layers are constructed in the same way as for the classical deep ESN, Sect. 2.5.

Note that equally sized clusters and layers require the total number of nodes N to be a multiple of L, and \(N_L\) to be a multiple of C. For a more flexible choice of these parameters, we allowed the clusters and layers to differ in size by one node for our numerical test presented in Sect. 5.

4 Materials and tasks

The performance of the ESN methods described above was evaluated using three different tasks—two based on artificial data and the third based on real-world signals. These tasks are: Gaussian noise reduction We add Gaussian noise with mean \(\mu = 0\) and standard deviation \(\sigma = 1\) to a randomly generated wave signal. The wave signal consists of four sine signals with random phases, slow frequencies 0.005, 0.01, 0.02, 0.03, and fluctuating (between 0.5 and 1.5) amplitudes determined by randomly generated envelope functions. The objective of the task is to remove the noise and to reconstruct the original artificial wave signal. For training and testing we use the same parameters for the noise, frequencies, and amplitude, but independent realization of the randomly generated signal. Impulse noise reduction We use the same kind of randomly generated wave signal as for task 1), but we add impulse noise instead of Gaussian noise. For a given number \(n_\mathrm {dp}\in {\mathbb {N}}\) and a given noise amplitude \(\delta > 0\), we randomly choose \(n_\mathrm {dp}\) data points and deviate each of them by a random value drawn from a uniform distribution on the interval \([-\delta , \delta ]\). For our numerical tests, we chose \(\delta = 1\) and \(n_\mathrm {dp}=K/20\), where K is the number of data points of the original discrete-time signal. Again, the objective of the task is to reconstruct the original wave signal. For training and testing we use the same parameters for the noise, frequencies, and amplitude, but independent realization of the randomly generated signal.

ECG signal noise reduction For this task, we were using the dataset from Lugovaya (2005); Goldberger et al. (2000), which contains ECG records from 90 persons (multiple records per person). Each record consists of two different signals: the noisy, raw ECG signal, and a filtered signal without noise. The signals have a length of 20 seconds and were recorded with a sampling rate of 500 Hz. The amplitude was measured in units of mV. The raw signals serve as input and the filtered signals serve as target signals.

In order to take account of non-causal relationships between input and target data, i.e., the fact that s(t) may depend on \(u(t^{\prime })\) with \(t^{\prime }>t\), we introduce the delay parameter d. If \(d>0\), the input signals are shifted by d time steps relative to the target signal. Consequently, the ESN methods can take into account d input data points ahead of the target signal.

For the ECG signal noise reduction task, we did not use the complete dataset. Instead we randomly selected 50 records as training dataset, and 20 records as test dataset. All records were taken from different people. For the training process, we concatenated the 50 training signals to one long sequence of data points. The test signals were processed in the same way. Each of the 50 training records contains 10,000 data points, i.e., the complete training signal contains 500,000 data points, of which 5000 are used for the initial wash out phase and 495,000 contribute directly to the training. This amount of data is sufficient to train an ESN. It might be possible to increase the prediction performance for the ECG signal noise reduction task by increasing the amount of training data. However, the right choice of this amount is a trade-off between prediction performance and memory requirement (see Sect. 2.2).

Note that for all tasks we have one-dimensional input and target time series, i.e., \(M = 1\) and \(P = 1\). Thus, the input and output matrices \(W^\mathrm {in} \in {\mathbb {R}}^{N\times M}\) and \(W^\mathrm {out} \in {\mathbb {R}}^{P\times N}\) become vectors and the output bias vector \(c\in {\mathbb {R}}^P\) is actually a scalar. Other variables than \(W^\mathrm {in}\), \(W^\mathrm {out}\) and c are not affected by the input dimension M or the target dimension P. In particular, the hidden weight matrix A does not depend on M or P. Therefore, adapting our approach to multivariate time series is straightforward. Ref. Lu et al. (2017) contains examples of random ESNs applied to multivariate time series. Since the other ESN types described in Sect. 2 and 3 differ only in the choice of A, they can be easily applied to multivariate time series too.

5 Experimental results

We applied the CESNs and deep CESNs, described in Sect. 3, to the tasks from Sect. 4, and compared them to the classical ESN methods, described in Sect. 2. We used the hyperbolic tangent as the activation function f. The default parameters for our experiments are listed in Table 1. The results were evaluated by the Normalized Root Mean Squared Error (NRMSE) of the output signal \({\hat{s}}(t)\) in comparison to the target signal s(t):

$$\begin{aligned} \mathrm {NRMSE}({\hat{s}},s) = \sqrt{\frac{\sum _{t=1}^{t_\mathrm {test}} ({\hat{s}}(t) - s(t))^2}{\mathrm {Var}(s) t_\mathrm {test}}} \end{aligned}$$
(6)

where \(\mathrm {Var}(s)\) is the variance of the target signal and \(t_\mathrm {test}\) is the number of time steps of the test time series. All presented results are an average over 8 executions.

Table 1 Default parameters for the numerical experiments

5.1 Gaussian noise reduction

We conducted numerical tests to compare the ESN methods to each other and to the causal Wiener filter. We applied the Wiener filter implementation from scipy.signal.wiener with optimal window size and noise power parameter. We found that the random ESN performs as good as the Wiener filter on Gaussian noise reduction and that all other ESN methods perform even better if their parameters are properly chosen.

Figure 4 shows the performance of various ESN methods on the Gaussian noise reduction task for different leakage rate values \(\alpha\). For all methods, we need rather large values of \(\alpha\) to obtain good results. This can be explained by the fact that for removing Gaussian noise from our artificially created time series the local structure of the time series is more important than its states in the more distant past. The leakage rate controls how fast the system state changes: a larger leakage rate leads to a faster changing system state which makes the ESN more sensitive to short-term fluctuations of the input time series. The Wiener filter achieved an NRMSE of approximately 0.23. For \(\alpha \ge 0.95\), the deep ESN and the CESNs yield better results than the Wiener filter. Moreover, we found that the deep ESN and the CESNs yield better results than the classical random ESN for all \(\alpha\).

Fig. 4
figure 4

Gaussian noise reduction NRMSE depending on \(\alpha\) for the random ESN (red), the Erdős-Rényi CESN (green), the Barabási-Albert-like CESN (purple), and the deep ESN (blue). For comparison, the Wiener filter’s NRMSE is indicated by a dashed black line. The deep ESN and the CESNs perform better than the Wiener filter if the leakage rate is large enough. The random ESN’s performance is similar to that of the Wiener filter if \(\alpha = 1\) (Color figure online)

Figure 5a shows the NRMSE obtained from the deep ESN with different values of the delay parameter d and \(\alpha\). One can see that it is important to set d not too small. The results for \(\alpha\) correspond to the results shown in Fig. 4.

Fig. 5
figure 5

NRMSE of the deep ESN for different values of \(\alpha\) and d for a the Gaussian noise reduction task, and b the impulse noise reduction task. For both tasks, optimal results are achieved for a relatively wide range of the delay parameter d. Only if d is very small, the performance drops significantly. The leakage rate \(\alpha\) should be larger than 0.8 to obtain optimal results for the Gaussian noise reduction tasks and larger than 0.6 for the impulse noise reduction task. The NRMSE minima are indicated by red squares (Color figure online)

Figure 6 compares the performance of the Erdős-Rényi and the Barabási-Albert-like CESN for different numbers of clusters C. Both CESN methods work similarly well if C is at least 15. The performance of the Barabási-Albert-like CESN drops strongly when we decrease C. For the Erdős-Rényi CESN we obtain a similar NRMSE over the whole range of C.

Fig. 6
figure 6

Gaussian noise reduction performance of the CESNs depending on the number of clusters C. For the Erdős-Rényi CESN (green), the NRMSE remains almost unchanged if we vary C. The Barabási-Albert-like CESN (purple) has a high NRMSE for small C, but can compete with the Erdős-Rényi CESN when C is sufficiently large. The Wiener filter’s NRMSE is indicated by a dashed black line (Color figure online)

Moreover, we evaluated the performance of the deep ESN and deep CESNs for different numbers of hidden layers L. The results are presented in Fig. 7. All three methods achieve similar NRMSEs, mostly between 0.2 and 0.21, which is significantly better than the Wiener filter’s NRMSE. The choice of the parameter L does not have a strong influence on the performance, but one can notice a slight trend that the deep CESNs do better with a larger number of hidden layers.

Fig. 7
figure 7

Gaussian noise reduction with the deep ESN (blue), the deep Erdős-Rényi CESN (orange) and the deep Barabási-Albert-like CESN (brown). The NRMSE tends to be slightly better for a larger number of layers L (Color figure online)

5.2 Impulse noise reduction

As for the reduction of Gaussian noise, we require sufficiently large values of \(\alpha\) for the impulse noise reduction task; see Fig. 8. With \(\alpha \ge 0.7\), we obtain near optimal results for the deep ESN and the CESNs. For the random ESN \(\alpha\) should be near 1 to obtain good results. Again, we can explain the need for a rather large leakage rate by the local nature of the task. For any choice of \(\alpha\), the deep ESN and the CESNs perform better than the random ESN.

A 2D plot showing the NRMSE of the deep ESN for the impulse noise reduction task for different values of \(\alpha\) and d is presented in Fig. 5b. In comparison to the Gaussian noise reduction task, a smaller delay parameter d is sufficient. Further, the NRMSE remains small for a wider range of \(\alpha\), which corresponds to the findings shown in Fig. 8.

Fig. 8
figure 8

Impulse noise reduction NRMSE depending on the leakage rate \(\alpha\) for the random ESN (red), the deep ESN (blue), the Erdős-Rényi CESN (green), and the Barabási-Albert-like CESN (purple). The deep ESN and the CESN methods perform substantially better than the random ESN over the whole range of \(\alpha\). For all methods, \(\alpha\) should be chosen to be rather large (Color figure online)

Figure 9 shows how the CESNs perform depending on the number of clusters C. For the Erdős-Rényi CESN, C should be large, whereas for the Barabási-Albert-like CESN, fewer clusters are beneficial. Overall, we obtained significantly better results from the Barabási-Albert-like CESN.

Fig. 9
figure 9

Impulse noise reduction performance of the CESNs depending on the number of clusters C. For the Erdős-Rényi CESN (green), the NRMSE decreases as the number of clusters increases. In contrast, for the Barabási-Albert-like CESN (purple), we obtained the best NRMSE for the rather small value \(C=3\). Over the whole range of C, the Barabási-Albert-like CESN performs as good or better than the Erdős-Rényi CESN (Color figure online)

We tested the deep ESN and deep CESNs with different numbers of layers L; see Fig. 10. In all cases, the NRMSE is widely independent of L. Moreover, the results are not better than the results obtained from the CESNs (Fig. 9). These findings suggest that the classical deep ESN, CESNs and deep CESNs are equally suitable for removing impulse noise.

Fig. 10
figure 10

Impulse noise reduction with the deep ESN (blue), the deep Erdős-Rényi CESN (orange) and the deep Barabási-Albert-like CESN (brown). All methods perform similarly well and the NRMSE seems not to depend on L (Color figure online)

5.3 ECG signal noise reduction

The numerical studies presented above have been conducted using artificially generated data. To demonstrate that the investigated ESN methods are also useful for real world problems, we apply them for the reduction of ECG noise (Lugovaya 2005; Goldberger et al. 2000).

Figure 11 shows the NRMSE obtained from the random ESN, the deep ESN, and the CESNs depending on the leakage rate \(\alpha\). The deep ESN and the CESNs show a similar behavior. For these methods, \(\alpha = 0.2\) is optimal and yields an NRMSE of roughly 0.4. For the random ESN larger values of \(\alpha\) are needed and the optimal NRMSE is about 0.45. In contrast to the Gaussian and impulse noise removal tasks which we considered above, the ECG signal noise reduction task requires a smaller leakage rate for optimal results. This indicates that considering a wider time window of the noisy input improves the quality of the prediction of the signal without noise. Although these results are not competitive to more advanced methods, they show that ESNs are in principle suitable for real world tasks.

Fig. 11
figure 11

ECG noise reduction NRMSE depending on the leakage rate \(\alpha\) for the random ESN (red), the deep ESN (blue), the Erdős-Rényi CESN (green), and the Barabási-Albert-like CESN (purple). The deep ESN and the CESNs achieve an optimal NRMSE for \(\alpha = 0.2\). The random ESN requires a larger value for \(\alpha\) for an optimal NRMSE and performs slightly worse than the other methods (Color figure online)

5.4 Computing time

ESNs are trained by processing a training input series once and using linear regression to fit the ESN’s output to a given target series. See Sect. 2.2 for a detailed explanation of the training process. For the tasks with artificial time series (Gaussian and impulse noise reduction), the length of the training time series is 50,000 and the CPU time needed for the training is between 4.3 and 4.8 seconds for all tested ESN models. For the ECG noise reduction tasks, the length of the training time series is 495,000 and the measured CPU time for the training is between 38 and 42 seconds for all models.

Regarding the inference time ESNs do not offer a speed benefit. As inference time for the artificial datasets (test series length 30,000) we measured 2.2 to 2.5 seconds for each ESN model, for the ECG noise reduction task (test series length 195,000) we measured 14.5 to 15.6 seconds.

In summary, ESNs are in many cases a suitable method when fast training is required.

The CPU time measurements were performed on a HP Z1 computer with Intel Core i7-10700 processor.

5.5 Comparison of echo state networks to alternative methods

We performed numerical tests with a selection of alternative denoising methods: a low pass filter implemented using fast Fourier transform (FFT filter), the wavelet filter (implemented in skimage.restoration.denoise_wavelet) (Kohler 2005), and the Wiener filter (scipy.signal.wiener). In Table 2 we compare the results of these classical methods with the results of the ESN methods.

For the tasks based on artificial time series data (Gaussian and impulse noise reduction), the FFT fiter has the best performance among all considered methods. This result is expected because the target time series for these tasks are obtained by discretization of signal with a discrete frequency spectrum containing only 4 frequencies. For the Gaussian and the impulse noise reduction tasks, ESNs perform similarly well as the Wiener filter and the wavelet filter.

It is noteworthy that the ESN methods perform significantly better than the classical filters if we consider real world data (the ECG noise reduction task). These results show that the ESN methods are quite robust to deal with real-world problems where the signals usually contain wide noise frequency bands and the frequency bands vary from one signal to another.

Table 2 Average NRMSE of the ESNs and classical methods for each task

6 Conclusion

We demonstrated that Echo State Networks (ESNs), deep ESNs, and the Erdős-Rényi and Barabási-Albert-like clustered Echo State Networks (CESNs), introduced in Oliveira et al. (2020), are adequate tools to denoise time series. Further, we introduced deep CESNs, which are CESNs with multiple hidden layers.

All considered ESN methods can compete with the Wiener filter and the wavelet filter in removing Gaussian noise and impulse noise from an artificial wave signal. Moreover, ESN reveal significantly better results in comparison to classical denoising methods for the ECG noise reduction task. In comparison to common machine learning methods for sequential data, such as LSTMs, ESNs are significantly simpler in terms of their architecture and training process, which makes an implementation from scratch easier. Taking the ESN method’s simplicity into account, we also obtained satisfactory results on the ECG noise reduction task. Overall, the deep ESN, the CESNs, and the deep CESNs have shown a significantly better performance than the classical ESN with simple Erdős-Rényi connectivity. This implies that ESNs can be benefited from both clustered and layered structures of the internal network. For the deep ESN, this result was expected because prior studies have shown its advantage over classical ESNs (Gallicchio and Micheli 2016, 2020). Our contribution is the finding that CESNs perform as good as deep ESNs for certain tasks. Hence, CESNs may be considered as an equivalent alternative to deep ESNs. Moreover, in some situations deep CESNs can provide an additional benefit over deep ESNs or CESNs. Despite the optimal NRMSE of deep CESNs is similar to the results obtained by deep ESNs and CESNs for the tasks that we investigated, deep CESNs are able to achieve this NRMSE with a smaller number of total connections within and in between the hidden layers (see Table 1). Using algorithms for sparse matrices, this leads to faster computation during inference time.