Prediction for Nonlinear Time Series by Improved Deep Echo State Network Based on Reservoir States Reconstruction

. Abstract Echo state networks (ESN) have received extensive attention because of their superior performance in nonlinear time series forecasting tasks. Most of the research on ESN focuses on extracting features from time series and improving the activation function to enhance prediction accuracy. In this study, we creatively propose recombining the obtained features to improve the prediction accuracy further. To this end, two state reconstruction methods are developed based on the existing features in the ESN. Additionally, ESN is very sensitive to the setting of hyperparameters, resulting in its unstable performance of time series prediction. Thus, we adopt the Self-Normalizing Activation (SNA) function to reduce the model’s sensitivity to hyper-parameters and improve its stability. Then our proposed model is compared with the popular improved ESN models like DeepESN, GroupESN, and Grouped DeepESN on four time series datasets. The results show that our model has the best prediction performance.


Introduction
Reliable prediction of nonlinear time series is a complex task [1].Time series refers to the numerical sequence obtained by observing measurement variables in the past.Through certain methods, appropriate knowledge is extracted from the time series data, and the future time series value can be predicted.Predicting the future time can be applied in many scenarios, such as telecommunications, electricity, engineering, etc. [2].For example, in the telecommunications field, user experience can be improved by predicting data traffic and rationally allocating network resources.From the view of electricity, forecasting electricity demand can help ensure a stable energy supply.The prediction of time-based data can help relevant industries predict the changes in some important parameters in advance to ensure their stable operation.
Many methods have been proposed to predict nonlinear time series.Particularly machine learning methods such as support vector machines [3], fuzzy systems [4], neural networks, and so on.Among them, the neural network has the strongest nonlinear ability and the best time series prediction ability [5].In machine learning, recurrent neural network (RNN) has been widely used in time series prediction [6].RNN is a model closest to the biological brain and can effectively extract features from time series.An essential part of training the RNN network is the error back propagation [7].However, there is a limitation that bifurcation makes the training non-converging and when the time span becomes larger or the network is deepened, the calculation amount of the model will be significantly increased [8].In addition, exploding and vanishing gradients may also occur [9] in RNN.Based on RNN, a reservoir computing (RC) model with higher efficiency and better modeling ability is developed, effectively solving the bifurcation and gradient limitations in the RNN prediction model.RC [10] [11], a special computational framework of RNN, has a complete theoretical basis and can effectively avoid the problems of high computational complexity and gradient explosion or vanishment.The two main types of RC, which are Echo State Networks (ESN) and Liquid State Machine (LSM), are easy to train and have excellent prediction performance.Moreover, they are widely applied in the problems of system identification, signal processing, and time series prediction at present.
The subject of this study is ESN, and it converts the input information into a high-dimensional signal and stores them in its reservoirs.ESN has a strong nonlinear mapping ability to capture inputs' dynamics.Compared with other RNN methods, ESN only needs to adjust the output weight matrix with a linear regression algorithm, which is easy to implement and has low computational complexity.Thus, ESN is widely applied in solving complex data-based problems, such as classification [29], time series forecasting [14], etc.Although ESN has shown great ability in low computational complexity and stable prediction performance, it has weaknesses of being sensitive to hyperparameters [33] and lacking further research on extracting more features from existing reservoir states.The existing ways to improve ESN focus on changing the model's topology or adjusting the sampling method of time series to extract the features of time series in different ways.However, research on making full use of the features ESN obtained is rare.To find the hidden time information in the historical sequence, we propose further exploring the current time features rather than mining new ones.This paper aims to improve ESN by designing a two-state reconstruction method, which can further extract features from the existing states and improve the model's prediction accuracy.In addition, We also replace the tanh activation function in ESN with SNA activation function to ensure that the model runs in a stable state.
The remainder of this paper is organized as follows.Section 2 summarizes the current works on the time series prediction of ESN and the methods to improve ESN.In Section 3, the operation mechanism of the original ESN is introduced.Section 4 elaborates on the improved ESN proposed in this study.The SNA activation function is introduced to improve the stability of ESN's predictions.Moreover, a state reconstruction module is designed to extract temporal features from the existing state in ESN further.In Section 5, the improved ESN model is compared with three other models with SNA activation functions on multiple datasets, and the results show that our proposed model has the best performance.Finally, conclusions and perspectives are given in Section 6.

Literature review
ESN has the ability of fast training speed and robust nonlinear mapping; it is a popular tool to address the problem of time series prediction [14] [15].
ESN was first applied to time series problems in 2004.Compared with previous prediction methods, the accuracy of ESN is increased by 2400 times [16].Later, a leakage rate is added to the standard ESN as a parameter to optimize global learning and learn the sequences with different time features.It is superior to traditional ESN in slow dynamic systems, noise time series, and time-warped dynamic patterns datasets [17].However, the traditional singlelayer random connected reservoir cannot preserve the features of long-term time series.Over time, the time features in the reservoir gradually disappear, which limits ESN's prediction ability.To increase the long-term memory capacity of the reservoir, many researchers made contributions to improving ESN models.According to work [5], research on improved ESN can be divided into improved original ESN, Deep ESN, and a combination of ESN and other algorithms.
The research improving the original ESN includes the dynamic adjustment of the internal weights and the change of the internal topology.The internal weights of the reservoir are generated by random initialization.However, this weight matrix may not adapt well to the given tasks.The work of [18] proposes the Principal Neuron Reinforcement (PNR) algorithm to determine the main neuron in the reservoir and also strengthen the connections between other neurons and the main neuron.The authors of [19] introduce Anti-Oja (AO) learning to update neuron weights in the reservoir, reducing the correlation between neurons, increasing the diversity of internal state dynamics, and improving prediction performance.The original ESN has complex internal connections and high computational complexity.By simplifying the internal structure of the reservoir, better performance can be achieved with relatively low complexity.Simple Cycle Reservoir (SCR), Delay Line Reservoir (DLR), and Delay Line Reservoir with feedback connections (DLRB) [20] simplify the construction of the reservoirs in ESN and achieve better performance.Adjacent-feedback Loop Reservoir (ALR) [21] introduces the feedback mechanism of adjacent neurons on SCR and achieves satisfactory results.
The Deep ESN, which consists of multiple stacked reservoirs, mainly focuses on how to design the model structure [22].With the development of deep learning, researchers found that higher accuracy can be achieved by stacking multiple layers of reservoirs.Compared with traditional shallow ESN, it effectively improves the short-term memory capacity and the reservoir dynamics richness.Based on Deep ESN, [23] proposes a new hidden layer structure.It designs three gate states to solve the information retention problem, which improves deep memory ability.Compared with Deep ESN, Deepr-ESN [24] continuously projects the high-level state space in the reservoirs to the lowlevel feature space through the encoders, thereby removing the high-frequency components of the representations, and the obtained results are significantly improved.Modular Deep ESN (Mod-Deep ESN) [25] proposes several models with heterogeneous topologies to capture multi-scale dynamic features of time series.Mod-Deep ESN proposes Criss-Cross topology and Wide Layered topology, and it has been found that these two topologies have better performance in some time series prediction tasks.Multi-reservoir ESN with sequence resampling (MRESN) [26] proposes three resampling methods based on Deep ESN and Group ESN to improve the prediction performance of nonlinear time series.In [27], three ESN models with SNA are constructed: Deep ESN with SNA, Group ESN with SNA, and Grouped Deep ESN with SNA, ensuring the stable prediction of the model when obtaining better performance.
ESN can not only be combined with deep learning but also with machine learning algorithms.With the development of the neural network, ESN obtains superior performance in specific tasks by combining machine learning algorithms.In [28], the feedforward neural network is used to replace the output layer of ESN, and the backpropagation algorithm is used to optimize the feedforward neural network.The work of [29] proposes an enhanced echo-state restricted Boltzmann machine (eERBM) extracts temporal features through RBM and then inputs them into ESN.Through experiments, eERBM achieves better nonlinear approximation and robustness in traffic prediction tasks.ESN combined with a least absolute shrinkage and selection operator (LASSO-ESN) [30] eliminates the parameters with the minimum weight in the prediction process and improves the accuracy of the prediction Most current ESN improvement studies focus on obtaining more reservoir states but lack research on extracting time features from existing states.Inspired by the research on ESN sampling [26], we propose an improved ESN extracting the states between different layers and synthesizing several new states through a state reconstruction module.Further, the effectiveness of our state reconstruction method is verified by comparing it with the other three deep models proposed in [27].

Preliminaries of ESN
ESN is a particular recurrent neural network proposed by Jaeger [31].A traditional ESN shown in Figure 1 consists of an input layer, a reservoir with sparse connections, and a readout layer.W in represents the weight matrix of the input unit, W is the weight inside the reservoir, and W out is the weight of the output unit.W in and W are generated from random values, and they are fixed during training.The W out is obtained by linear regression, which can dramatically reduce the amount of calculation compared to RNN.The input units, reservoir units, and output units are denoted by: (1) The updated formula of neurons in the reservoirs is as follows: where f is the activation function, traditional ESN uses tanh to activate neurons.The prediction output calculation formula is as follows: where f out is the readout function and W out is the trained output weight matrix.Symbol [; ] stands for a vertical concatenation.
To ensure the dynamic stability of the network during initialization, ESN needs to satisfy the asymptotic stability property, which is called Echo State Property (ESP) [32].To guarantee ESP, the spectral radius of the reservoir's weight matrix (Equation 6) needs to be less than 1.
where W 1 is the randomly generated reservoir weight matrix, λ max is the spectral radius of W 1 , and α is the scaling parameter for tuning.The time series forecasting process of the proposed ESN is as follows (see Figure 3):

Self-Normalizing Activation Function on
Hyper-Sphere Generally, the traditional ESN model uses tanh as the reservoir activation function.This activation function is simple and practical, whereas it is very sensitive to hyperparameters.Only proper hyperparameter configurations can ensure that the ESN is at the edge of criticality and maximize the ESN's performance.Otherwise, the network will be useless and result from chaotic behavior.Therefore, we use the SNA activation function (Equation 7) rather than tanh in the ESN.Theoretical analysis of the SNA activation function shows that the maximum Lyapunov exponent of the model is always zero.It means that no matter how the hyperparameters of the network are configured, the model always runs in a stable state, eliminating the excessive dependence on hyperparameters [33].Furthermore, SNA guarantees that ESN exhibits nonlinear behavior and handles tasks that require rich dynamics.SNA also provides memory behavior similar to linear networks, effectively balancing nonlinearity and memory ability [33].
In Equation ( 7), α k is the pre-activation vector obtained from the state x k−1 of the input u k .In Equation ( 8), the pre-activation vector α k is projected onto an N − 1 dimensional hypersphere with radius r, and the post-activation state x k is obtained.The SNA activation function is not element-wise, it depends on all the states, and it is a global activation function.

States Reconstruction Module
We design the state reconstruction module based on Deep ESN, as shown in Figure 4.The states reconstruction module is used to cross-stitch the original states of the input in the original order.Unlike general Deep ESN, we further process the state of the reservoir that has been obtained, expecting to obtain more temporal features.As shown in Figure 2(b), after inputting the data into the multi-layer ESN model from the top, the states are inputted to the state reconstruction module in pairs, extracting more temporal features.We employ s(κ) ∈ R Nr for r = 1, 2..., N r to represent the inputs to the state reconstruc- tion module, where N r is the size of the reservoirs.And then, a collection matrix of the inputs in pair can be defined as Based on the sampling and splicing of states, we put forward two-state reconstruction methods: single adjacent elements concatenation and double adjacent elements concatenation, respectively.They are designed to realize the mutual mixing of the states of different layers and improve the model's ability to extract the features of time series.

Concatenation of single adjacent elements
Regarding the method of a single adjacent element, it inputs the states between the two layers into the state reconstruction module.It divides the elements into two states according to whether the subscript is odd or even.It keeps the relative position of reservoir states unchanged.The odd subscript elements of the first state are combined with the even subscript elements of the second state.The even subscript elements of the first state are combined with the odd subscript elements of the second state.
with τ 1 = t mod 2 == 0, t ∈ (0, N r ), where τ represents the index of the original state element, and s rec represents the new state after reconstruction.We construct s rec1 by interval sampling, and exchange index to construct s rec2 .

Concatenation of double adjacent elements
This method treats two adjacent states as a group.They are concatenating odd-numbered combinations in the first state with even-numbered combinations in the second state and vice versa.
G represents the index formed by every two states as a group.Through these two reconstruction methods, we can add several new states to the original states and ensure that the relative positions of the elements in the original states remain unchanged in the new states.

States Concatenation Module
In the states concatenation module, all states corresponding to time t are vertically connected.The vertically arranged states include the SNA-activated states and the new state obtained by the state reconstruction module.
The definition of states concatenation is as follows.

Dataset description
We use two benchmark prediction tasks named Multiple Superimposed Oscillators (MSO) and the Rossler system to evaluate the proposed ESN.Further, Laser and Electromyograms (EMG), which are two real data sets, are also applied to simulate time series prediction tasks (see Figure 5).
The Rossler system [36] consists of three nonlinear ordinary differential equations that define a continuous-time dynamical system.
The system shows chaotic behavior for a = 0.15, b = 0.2, c = 10.The time series of the Rossler system is generated with initial values (−1, 0, 3) and step of 0.01.In our experiments, we make a one-step ahead prediction for Rossler − x(t).

Laser
Santa Fe Laser Dataset is a benchmark dataset in time series forecasting tasks [37].It is a real-world dataset measured in the laboratory with periodic laser output power datasets, and it contains a length of 10092.It will be used in one step ahead prediction of our simulation.

EMG
Electromyography is mainly used in clinical experiments to assess functions such as muscles and nerves.We use the EMG time series of a 44-year-old healthy male with no history of muscle disease collected in work [38] and use it for one-step-ahead prediction.
We employ the same length of the training set, validation set, test set, and initial transient set on the above four experimental datasets.The specific parameters are shown in Table 1.Each dataset is 4000 in length, of which the first 3000 data are used for training, 500 data are used for validation, and the last 500 data points are used for testing.The transient data set is 30.For a fair comparison, the same dataset partitioning is used in our improved ESN model and the three contrasting models.

Evaluation metrics
Two metrics are used to evaluate our model, root-mean-square error (RMSE) in Equation ( 18) and normalized root mean square error (NRMSE) in Equation (19).They have a strong theoretical correlation in statistical modeling and are the most commonly used measurement methods in time series prediction tasks [39].
y(t) refers to the actual data observed at time t under the length of N T , ŷ (t) refers to the predicted value at time t, and ȳ (t) represents the average value of real data.

Experiments settings
The parameter settings of the simulated ESNs are shown in Table 2.The input scaling θ, the spectral radius ρ, the density of internal weight η, the leaking rate α, and the regularizing factor β. The reservoir size N R is varied in the range of [100, 1000] with the interval of 100.The activation radius r is in range [10, 50, 100, ..., 800].
After using the SNA activation function, the sensitivity of the input scaling θ and spectral radius ρ decreases.We can fix these two values to improve the search efficiency [27].However, SNA function also introduces another hyperparameter, the activation radius r.Thus, for each dataset, we test the reservoir size N R ∈ [100,200,...,1000] and the SNA activation radius r ∈ [10, 50, 100, 200, ..., 800].
The improved ESN is compared with three models with SNA activation functions proposed in work [34], Deep ESN with SNA, Grouped Deep ESN with SNA, and Grouped ESN with SNA.We use a grid search strategy to test each model under the same conditions and compare their optimal values.

Results analysis
In this section, we present and analyze the performance of our imposed ESN and compare ESN models on the aforementioned four datasets.The optimal   parameters of our proposed model in each dataset are shown in Table 3, Table 4, Table 5, and Table 6.
• MSO Table 3 lists the optimal performance of our improved model and the three comparative models under the optimal configuration within the given range.In this dataset, the average NRMSE (18) and RMSE (19) obtained by grouped ESN are the best, and the improved ESN is better than the other two models.Figure 6(b) shows the result of our improved model by grid search.Among them, N r =600, r=80.The predicted time series is shown in Fig 6(a).

• Laser
Fig. 6 The prediction performance and grid search result on MSO12 Fig. 7 The prediction performance and grid search result on Rossler The results of the Laser dataset are shown in Table 5.The improved ESN is still better than the three comparison models in RMSE and NRMSE, where the model parameters N R =600, r=100.The prediction and grid search results are shown in Figure 8.

• Electromyograms
The experimental results of the EMG dataset are shown in Table 6.The value of std of RMSE (18) of deep ESN is slightly smaller than our proposed ESN.The improved ESN is better than the comparison model on the rest of the metrics.The optimal parameter configuration is N R =100, r=800.The predicted effect is shown in Figure 9.
Compared with the other three ESNs with SNA activation function, our improved ESN model has advantages in multiple indicators of the four data Fig. 8 The prediction performance and grid search result on Laser Fig. 9 The prediction performance and grid search result on EMG sets.On all datasets, the minimum RMSE (18) and NRMSE (19) obtained by our model are the best.The reason may be that the state reconstruction module we proposed has the ability to obtain further the features related to time series by inputting the state of the multi-layer reservoirs and re-splicing it in sequence.Compared with the original model, the state reconstruction module increases the richness of the original states.Although on the MSO data set, the average values of our NRMSE and RMSE are slightly lower than some of the compared models, the performance of our model has been improved from a global view.

Conclusions
This paper proposes an improved ESN model by designing a state reconstruction model with an SNA activation function.Firstly, the state reconstruction module is designed to recombine the existing state of the reservoirs and extract more features.Secondly, through the SNA activation function, the model runs on the critical edge to ensure stability.At the same time, the SNA activation function can reduce the number of hyperparameters that need to be adjusted and reduce the complexity of grid search.Through comparative experiments, it can be seen that the state reconstruction strategy adopted in our model can effectively extract more temporal features and improve the model's prediction accuracy.The possible reason is that the state reconstruction module fuses the states of different layers of reservoirs and adds more features related to the current time series on the premise that the original state matrix remains unchanged.After adding these features, the storage capacity is further enriched.In addition, our method is also more flexible.It can be used in combination with other Deep ESN without affecting the training method of the original model.
The limitation of this study is that each layer's state is reconstructed in the same way.It is possible that each layer of the reservoir contains different hidden features in the time series, and our method neglects the particular design for the state of each layer.Future work will focus on exploring the differences between the features of the reservoirs at different time layers.

Fig. 1
Fig. 1 Traditional ESN This paper focuses on the ESN's problems of being sensitive to hyperparameters and unable to obtain temporal features from the state obtained by training fully.Furthermore, we propose an improved ESN-based framework to deal with these two issues.The proposed ESN is developed based on Deep ESN. Figure 4 shows the schematic diagram of Deep ESN and the proposed improved ESN.First, we replace the traditional Hyperbolic tangent activation function with the SNA activation function to reduce the sensitivity of hyperparameters.Then, we design a state reconstruction module based on Deep ESN.The proposed ESN model aims to extract further time series features from the obtained states between different layers in ESN.

Fig. 2
Fig. 2 Comparison of Deep ESN and our proposed ESN

Fig. 3
Fig. 3 Flowchart of the proposed ESN's training process

Fig. 4
Fig. 4 An example of state reconstruction of two states in the proposed ESN

Table 1
Data partition for time series datasets

Table 2
The parameters settings of experiments

Table 3
Hyperparameter setting in MSO task

Table 4
Hyperparameter setting in Rossler task

Table 5
Hyperparameter setting in Laser task

Table 6
Hyperparameter setting in EMG task