2.1 Predictors’ Identification
The forecasting of a time series is a typical example of supervised learning task: the training is performed by optimizing a suitable metric (the loss function), that assesses the distance between N target samples, \(\mathbf{y} = \big [ y(1), y(2), \dots , y(N)\big ]\), and the corresponding predictions, forecasted i steps ahead, \(\hat{\mathbf{y }}^{(i)} = \big [ \hat{y}(1)^{(i)}, \hat{y}(2)^{(i)}, \dots ,\) \(\hat{y}(N)^{(i)} \big ]\). A widely used loss function is the mean squared error (MSE), that can be computed for the \(i^{\text {th}}\) step ahead:
$$\begin{aligned} \text {MSE}\big (\mathbf{y} , \hat{\mathbf{y }}^{(i)}\big ) = \frac{1}{N} \sum _{t=1}^{N} \Big ( y(t) - \hat{y}(t)^{(i)} \Big )^2. \end{aligned}$$
(1)
One-step predictors are optimized by minimizing \(\text {MSE}\big (\mathbf{y} , \hat{\mathbf{y }}^{(1)}\big )\). Conversely, a multi-step predictors can be directly trained on the loss function computed on the entire h-step horizon:
$$\begin{aligned} \frac{1}{h} \sum _{i=1}^{h} \text {MSE}\big (\mathbf{y} , \hat{\mathbf{y }}^{(i)}\big ). \end{aligned}$$
(2)
Following the classical neural nets learning framework, the training is performed for each combination of the hyper-parameters (mini-batch size, learning rate and decay factor), selecting the best performing on the validation dataset. The hyper-parameters specifying the neural structure (i.e., the number of hidden layers and of neurons per layer) are fixed: 3 hidden layers of 10 neurons each. Once the predictor has been selected, it is evaluated on the test dataset (never used in the previous phases) in order to have a fair performance assessment and to avoid overfitting on both the training and the validation datasets.
A well-known issue with the MSE is that the value it assumes does not give any general insight about the goodness of the forecasting. To overcome this inconvenient, the \(R^2\)-score is usually adopted in the test phase:
$$\begin{aligned} R^2\big (\mathbf{y} , \hat{\mathbf{y }}^{(i)}\big ) = 1 - \frac{\text {MSE}\big (\mathbf{y} , \hat{\mathbf{y }}^{(i)}\big )}{\text {MSE}\big (\mathbf{y} , \bar{y}\big )}. \end{aligned}$$
(3)
Note that \(\bar{y}\) is the average of the data, and thus the denominator corresponds to the variance, var(y). For this reason, the \(R^2\)-score can be seen as a normalized version of the MSE. It is equal to 1 in the case of a perfect forecasting, while a value equal to 0 indicates that the performance is equivalent to that of a trivial model always predicting the mean value of the data. An \(R^2\)-score of \(-1\) reveals that the target and predicted sequences are two trajectories with the same statistical properties (they move within the same chaotic attractor) but not correlated [3, 4]. In other words, the predictor would be able to reproduce the actual attractor, but the timing of the forecasting is completely wrong (in this case, we would say that we can reproduce the long-term behavior or the climate of the attractor [5, 6]).
2.2 Feed-Forward Neural Networks
The easiest approach to adapt a static FF architecture to time series forecasting consists in identifying a one-step predictor, and to use it recursively (FF-recursive predictor). Its learning problem requires to minimize the MSE in Eq. (1) with \(i=1\) (Fig. 1, left), meaning that only observed values are used in input.
Once the one-step predictor has been trained, it can be used in inference mode to forecast a multi-step horizon (of h steps) by feeding the predicted output as input for the following step (Fig. 1, right).
Alternative approaches based on FF networks are the so-called FF-multi-output and FF-multi-model. In the multi-output approach, the network has h neurons in the output layer, each one performing the forecasting at a certain time step of the horizon. The multi-model approach requires to identify h predictors, each one specifically trained for a given time step. In this chapter, we limit the analysis to the FF-recursive predictor. A broad exploration including the other FF-based approaches can be found in the literature [7, 8].
2.3 Recurrent Neural Networks
Recurrent architectures are naturally suited for sequential tasks as time series forecasting and allow to explicitly take into account temporal dynamics. In this work, we make use of RNNs formed by long short-term memory (LSTM) cells, which have been successfully employed in a wide range of sequential task, from natural language processing to policy identification.
When dealing with multi-step forecasting, the RNNs are usually trained with a technique known as teacher forcing (TF). It requires to always feed the actual values, instead of the predictions computed at the previous steps, as input (Fig. 2, left).
TF proved to be necessary in almost all the natural language processing tasks, since it guides the optimization ensuring the convergence to a suitable parameterization. For this reason, it became the standard technique implemented in all the deep learning libraries. TF simplifies the optimization at the cost of making training and inference modes different. In practice, at time t, we cannot use the future values \(y(t+1), y(t+2), \dots \) because such values are unknown and must be replaced with their predictions \(\hat{y}(t+1), \hat{y}(t+2), \dots \) (Fig. 2, right).
This discrepancy between training and inference phases is known under the name of exposure bias in the machine learning literature [9]. The main issue with TF is that, even if we optimize the average MSE over the h-step horizon in Eq. (2), we are not really doing a multi-step forecasting (we can say that it is a sequence of h single-step forecasting, similarly to what happens in FF-recursive). In other words, the predictor is not specifically trained on a multi-step task.
We thus propose to train the LSTM without TF (LSTM-no-TF) so that the predictor’s behavior in training and inference coincides (Fig. 2, right).