1 Introduction

Time series analysis and forecasting techniques process data points that are ordered in a discrete-time sequence. While time series analysis focuses on extracting meaningful descriptive statistics of the data, time series forecasting uses a model for predicting the next value(s) of the series based on the previous ones. Traditionally, time series forecasting has been tackled with statistical techniques based on auto-regression or the moving average, such as exponential smoothing (ETS) Hyndman et al. [28] and the auto-regressive integrated moving average (ARIMA) Box et al. [8]. These methods are relatively simple and perform well in univariate scenarios and with relatively small data. However, they are more limited in predicting a long time horizon or dealing with multivariate scenarios.

The ubiquitousness of data generation in today’s society brings the opportunity to exploit recurrent neural network (RNN) architectures for time series forecasting. RNN-based models have reported promising results in multivariate forecasting of long series Hewamalage et al. [25]. In contrast to feed-forward neural networks, RNN-based models capture long-term dependencies in the time sequence through their feedback loops. The majority of works published in this field are based on vanilla RNNs, Long-short Term Memory (LSTM) Hochreiter and Schmidhuber [26] or Gated Recurrent Unit (GRU) Cho et al. [13] architectures. In the last M4 forecasting competition Makridakis et al. [35], the winners were models combining RNNs with traditional forecasting techniques, such as exponential smoothing Smyl [50]. However, the use of RNN architectures is not entirely embraced by the forecasting community due to their lack of transparency, need for very specific configurations, and high computational cost [25, 35, 36].

In this regard, the development of RNN architectures for time series forecasting can bring serious financial and environmental costs. As anecdotal evidence, one of the participants in the last M4 forecasting competition reported getting a huge electricity bill from 5 computers running for 4.5 months Makridakis et al. [35]. More formally, the authors in Strubell et al [51] presented an eye-opening study characterizing the energy required to train recent deep learning models, including their estimated carbon footprint. An example of a training-intensive task is the tuning of the BERT model Devlin et al. [16] for natural language processing tasks, which compares to the \(CO_2\) emissions of a trans-American flight. One of the conclusions of the study in Strubell et al. [51] is that researchers should focus on developing more efficient techniques and report measures (such as the training time) next to the model’s accuracy.

A second concern related to the use of deep machine learning models is their lack of interpretability. For most high-stakes decision problems having an accurate model is insufficient; some degree of interpretability is also needed. There exist several model-agnostic post-hoc methods for computing explanations based on the predictions of a black-box model. For example, feature attribution methods such as SHAP Lundberg and Lee [34] approximate the Shapley values that explain the role of the features in the prediction of a particular instance. Other techniques such as LIME Ribeiro et al. [47] leverage the intrinsic transparency of other machine learning models (e.g., linear regression) to approximate the decisions locally. In contrast, intrinsically interpretable methods provide explanations from their structure and can be mappable to the domain Grau et al. [21]. In Rudin [48], the author argues that these explanations are more reliable and faithful to what the model computes. However, developing environmental-friendly RNN-based forecasting models able to provide a certain degree of transparency is a significant challenge.

In this paper, we propose the long short-term cognitive networks (LSTCNs) to cope with the efficient and transparent forecasting of long univariate and multivariate time series. LSTCNs involve a sequential collection of short-term cognitive network (STCN) blocks Nápoles et al. [39], each processing a specific time patch in the sequence. The STCN model allows for transparent reasoning since both weights and neurons map to specific features in the problem domain. Besides, STCNs allow for hybrid reasoning since the experts can inject knowledge into the network using prior knowledge matrices. As a second contribution, we propose a deterministic learning algorithm to compute the tunable parameters of each STCN block in a deterministic fashion. The highly efficient algorithm replaces the non-synaptic learning method presented in Nápoles et al. [39]. As a final contribution, we present a feature influence score as a proxy to explain the reasoning process of our neural system. The numerical simulations using three case studies show that our model produces high-quality predictions with little computational effort. In short, we have found that our model can be remarkably faster than state-of-the-art recurrent neural networks.

The rest of this paper is organized as follows. Section 2 revises the literature on time series forecasting with recurrent neural networks, while Sect. 3 presents the theory behind the STCN block. Section 4 is devoted to LSTCN’s architecture, learning, and interpretability. Section 5 evaluates the performance of our model using three case studies involving long univariate and multivariate time series. Section 6 concludes the paper and provides future research directions.

2 Related work on time series forecasting

In the last decade, we observed a constantly growing share of artificial neural network-based approaches for time series forecasting. Prominent studies, including Bhaskar and Singh [7] and Ticknor [53], use traditional feed-forward neural architectures trained with the backpropagation algorithm for time series prediction. However, in more recent papers, we see a shift toward other neural models. In particular, RNNs have gained momentum Kong et al. [29].

Feed-forward neural networks consist of layers of neurons that are one-way connected, from the input to the output layer, without cycles. In contrast, RNNs allow connections to previous layers and self-connections, resulting in cycles. In the special case of a fully connected recurrent neural network Menguc and Acir [38], the outputs of all neurons are also the inputs of all neurons. The literature is rich with various RNN architectures applied to time series forecasting. Yet, we can generalize the elaboration on various RNNs by stating that they allow having self-connected hidden layers Chen et al. [9]. Compared with feedforward neural networks, RNNs utilize the action of hidden layer unfolding, which makes them able to process sequential data. This explains their vast popularity in the analysis of temporal data, such as time series or natural language Cortez et al [14].

A popular RNN architecture is called long short-term memory (LSTM). It was designed by Hochreiter and Schmidhuber [26] to overcome the problems arising when training vanilla RNN models. Traditional RNN training takes a very long time, mostly because of insufficient, decaying error when doing the error backpropagation Guo et al. [23]. The LSTM architecture uses a special type of neurons called memory cells that mimic three kinds of gate operations Hewamalage et al. [25]. These are referred to as the multiplicative input, output, and forget gates. These gates filter out unrelated and perturbed inputs Guo et al. [20]. Standard LSTM models are constructed in a way that past observations influence only the future ones, but there exists a variant called bidirectional LSTM that lifts this restriction Cui et al [15]. Numerous studies show that both unidirectional and bidirectional LSTM networks outperform traditional RNNs due to their ability to capture long-term dependencies more precisely Tang et al. [52].

The Gated Recurrent Unit (GRU) is another RNN model Cho et al. [13]. In comparison with LSTM, GRU executes simplified gate operations by using only two types of memory cells: input merged with output and forget cell Wang et al. [57], called here update and reset gate, respectively Becerra-Rico et al. [6]. As in the case of LSTM, GRU training is less sensitive to the vanishing/exploding gradient problem that is encountered in traditional RNNs Ding et al. [17].

The inclusion of recurrent/delayed connections boosted the capability of neural models to predict time series accurately, while further improvements of their architecture (like LSTM or GRU model) made training dependable. However, it shall be mentioned that the use of the traditional error backpropagation is not the only option to learn network weights from historical data. Alternatively, we can use meta-heuristic approaches to train a model. There exists a range of interesting studies, where the authors used Genetic Algorithm Sadeghi-Niaraki et al. [49] or Ant Colony Optimization ElSaid et al. [19]. The study in Abdulkarim and Engelbrecht [1] concluded that, for the tested time series, dynamic Particle Swarm Optimization obtained a similar forecasting error compared with a feed-forward neural architecture and a recurrent one.

It shall be noted that the application of a modern neural architecture does not relieve a model designer from introducing required data staging techniques. This is why we find a range of domain-dependent studies that link various RNN architectures with supplemental processing options. For example, Liu and Shen [33] used a wavelet transform alongside a GRU architecture, while Nikolaev et al. [43] included a regime-switching step, and Cheng et al. [11] employed wavelet-based de-noising and adaptive neuro-fuzzy inference system.

We should mention recent studies on fusing RNN architectures with Convolutional Neural Networks (CNNs). The latter model has attracted much attention due to its superior efficiency in pattern classification. We find a range of studies [37, 59], where a CNN is merged with an RNN in a deep neural model that aims at time series forecasting. The role of a CNN is to extract features that are used to train an RNN forecasting model Li et al. [32]. Attention mechanisms have also been successfully merged with RNNs, as presented by Zhang et al. [61].

From a high-level perspective on time series forecasting with RNNs, we can also distinguish architectures that read in an entire time series and produce an internal representation of the series, i.e., a network plays the role of an encoder Laubscher [31]. A decoder network then needs to be used to employ this internal representation to produce forecasts Bappy et al. [5]. The described scheme is called an encoder–decoder network and was applied, for example, by Habler and Shabtai [24] together with LSTM, by Chen et al. [10] with convolutional LSTM, and by Yang et al. [60] with GRU.

We shall also mention the forecasting models based on Fuzzy Cognitive Maps (FCMs) Kosko [30]. Such networks are knowledge-oriented architectures with processing capabilities that mimic the ones of RNNs. The most attractive feature of these models is network interpretability. There are numerous papers, including the works of [44, 45, 54] or [58], where FCMs are applied to process temporal data. However, recent studies show that even better forecasting capabilities can be achieved with STCNs Nápoles et al. [39] or long-term cognitive networks Nápoles et al [40]. As far as we know, these FCM generalizations have not yet been used for time series forecasting. This paper extends the research on the STCN model, which will be used as the main building block of our proposal.

3 Short-term cognitive networks

The STCN model was introduced in Nápoles et al. [39] to cope with short-term WHAT-IF simulation problems where problem variables are mapped to neural concepts. In these problems, the goal is to compute the immediate effect of input variables on output ones given long-term prior knowledge. Remark that the model in Nápoles et al. [39] was trained using a gradient-based non-synaptic learning approach devoted to adjusting a set of parametric transfer functions. In this section, we redefine the STCN model such that it can be trained in a synaptic fashion.

The STCN block involves four matrices to perform reasoning: \(W_1^{(t)}\), \(B_1^{(t)}\), \(W_2^{(t)}\), and \(B_2^{(t)}\). The first two matrices denote the prior knowledge coming from a previous learning process and can be modified by the experts to include new pieces of knowledge that have not yet been recorded in the historical data (e.g., an expected increase in the Bitcoin value as Tesla decides to accept such cryptocurrency as a valid payment method). These prior knowledge matrices allow for hybrid reasoning, which is an appealing feature of the STCN model. The third and fourth matrices contain learnable weights that adapt the input \(X^{(t)}\) and the prior knowledge to the expected output \(Y^{(t)}\) in the current step. The matrices \(B_1^{(t)}\) and \(B_2^{(t)}\) represent the bias weights.

Figure 1 shows how the different components interact with each other in an STCN block. It is important to highlight that this model lacks hidden neurons, so each inner block (abstract layer) has exactly M neurons, with M being the number of neural concepts in the model. This means that we have a neural system in which each component has a well-defined meaning. For example, the intermediate state \(H^{(t)}\) represents the outcome that the network would have produced given \(X^{(t)}\) if the network would not have been adjusted to the expected output \(Y^{(t)}\). Similarly, the bias weights denote the external information that cannot be inferred from the given inputs.

Fig. 1
figure 1

The STCN block involves two components: the prior knowledge matrices \(W^{(t)}_1\) and \(B^{(t)}_1\), and the learnable matrices \(W^{(t)}_2\) and \(B^{(t)}_2\). The prior knowledge matrices are a result of a previous learning process and can be modified by domain experts if deemed opportune

Equations 1 and 2 formalize the short-term reasoning process of this model in the t-th iteration,

$$\begin{aligned} \hat{Y}^{(t)}=f\left( H^{(t)} W_2^{(t)} \oplus B_2^{(t)}\right) \end{aligned}$$
(1)

and

$$\begin{aligned} H^{(t)}=f\left( X^{(t)} W_1^{(t)} \oplus B_1^{(t)} \right) \end{aligned}$$
(2)

where \(X^{(t)}\) and \(\hat{Y}^{(t)}\) are \(K \times M\) matrices encoding the input and the forecasting in the current iteration, respectively, with K being the number of observations and M the number of neurons. \(B_{1}\) and \(B_{2}\) are \({1\times M}\) matrices representing the bias weights. \(H^{(t)}\) is a \(K \times M\) matrix, while \(W_1^{(t)}\) and \(W_2^{(t)}\) are a \(M \times M\) matrices. In these equations, the \(\oplus\) operator performs a matrix-vector addition by operating each row of a given matrix with a vector, provided that both the matrix and the vector have the same number of columns. Finally, \(f(\cdot )\) stands for the nonlinear transfer function, typically the sigmoid function:

$$\begin{aligned} f(x) = \frac{1}{1+e^{-x}}. \end{aligned}$$
(3)

The inner working of an STCN block can be summarized as follows. The block receives a weight matrix \(W_1^{(t)}\), the bias weight matrix \(B_1^{(t)}\) and a chunk of data \(X^{(t)}\) as the input data. Firstly, we compute an intermediate state \(H^{(t)}\) that mixes \(X^{(t)}\) with the prior knowledge (e.g., knowledge resulting from the previous iteration). Secondly, we operate \(H^{(t)}\) with \(W_2^{(t)}\) and \(B_2^{(t)}\) to approximate the expected output \(Y^{(t)}\).

This short-term reasoning of this model makes it less sensitive to the convergence issues of long-term cognitive networks such as the unique-fixed point attractors Nápoles et al. [39]. Furthermore, the short-term reasoning allows extracting more clear patterns to be used to generate explanations.

4 Long short-term cognitive network

In this section, we introduce the long short-term cognitive networks for time series forecasting, which can be defined as a collection of chained STCN blocks.

4.1 Architecture

As mentioned, the model presented in this section is devoted to the multiple-ahead forecast of very long (multivariate) time series. Therefore, the first step is splitting the time series into T time patches, each comprising a collection of tuples with the form \((X^{(t)},X^{(t+1)})\). In these tuples, the first matrix denotes the input used to feed the network in the current iteration, while the second one is the expected output \(Y^{(t)}=X^{(t+1)}\). Notice that each time patch often contains several time steps (e.g., all tuples produced within a 24-hour time frame).

Figure 2 shows, as an example, how to decompose a given time series into T time patches of equal length where each time patch will be processed by an STCN block. This procedure holds for multivariate time series such that both \(X^{(t)}\) and \(Y^{(t)}\) have a dimension of \(K \times M\). In this case, K denotes the number of time steps allocated to the time patch, whereas M defines the width of each STCN block. Therefore, if we have a multivariate time series described by N features and want to forecast L steps, then \(M=N \times L\).

Fig. 2
figure 2

Recurrent approach to process a (multivariate) time series with an LSTCN model. The sequence is split into T time patches with even length. Each time patch is used to train an STCN block that employs information from the previous block as prior knowledge

In short, the LSTCN model can be defined as a collection of STCN blocks, each processing a specific time patch and passing knowledge to the next block. In each time patch, the matrices of the previous model are aggregated and used as prior knowledge for the current STCN block, that is to say:

$$\begin{aligned} W_1^{(t)}=\Psi \left( W_1^{(t-1)},W_2^{(t-1)}\right) \end{aligned}$$
(4)

and

$$\begin{aligned} B_1^{(t)}=\Psi \left( B_1^{(t-1)},B_2^{(t-1)}\right) \end{aligned}$$
(5)

such that \(\Psi (x,y)=\tan h(\max \{x,y\})\). The aggregation procedure creates a chained neural structure that allows for long-term predictions since the learned knowledge is used when performing reasoning in the current iteration.

Figure 3 shows the LSTCN architecture to process the time series in Fig. 2, which was split into three time patches of equal length. In the figure, blue boxes represent STCN blocks, while orange boxes denote learning processes.

Fig. 3
figure 3

Example of an LSTCN composed of three STCN blocks. In each iteration, the model receives a time patch \(X^{(t)}\) to be processed and produces an approximation of the expected output \(Y^{(t)}\). The weights learned in the current block are aggregated (using Eqs. 4 and 5) and transferred to the following STCN block as prior knowledge matrices

It should be highlighted that, although the LSTCN model works in a sequential fashion, each STCN block performs an independent learning process (to be explained in the next subsection) before moving to the next block. Therefore, the long-term component refers to how we process the whole sequence, which is done by transferring the knowledge (in the form of weights) from one STCN block to another. Notice that we do not pass the neurons’ activation values to the subsequent blocks. Once we have processed the whole sequence, the model narrows down to the last STCN in the pipeline.

We would like to draw attention to a certain design analogy between the LSTCN and the LSTM model. We ought to outline how short-term and long-term dependencies in temporal data are captured in both models to address this topic. Let us recall that LSTM networks are derived from RNN networks. An RNN network in an unfolded state can be illustrated as a sequence of neural layers. The hidden layers in an RNN are responsible for window-based time series processing. In an RNN, the values computed by the network for the previous time step are used as input when processing the current time step. Due to the cyclic nature of the entire process, training an RNN is challenging. The input signals tend to either decay or grow exponentially. Graves et al. [22] explain that this is referred to in the literature as the vanishing gradient problem. The most significant difference between the RNN and the LSTM model is that the latter adds a forgetting mechanism at each hidden layer. The LSTM model processes the data using a windowing technique in which the number of hidden layers is equal to the length of the window. This window is responsible for processing and recognizing short-term dependencies in time series. The forgetting mechanism in each layer acts as a symbolic switch that either retains the incoming signal or forgets it. (Please note that this switch is not binary.) Thus, the forgetting mechanism in LSTM adds flexibility that allows the network to accumulate long-term temporal contextual information in its internal states, but at the same time, short-term dependencies are also modeled because the processing scheme is still sequential and windows-based.

Similar to the LSTM model, the LSTCN model analyzes data in a sequential, window-based manner (see Fig. 3). The difference is that each STCN block that makes up the LSTCN model can be viewed as a sub-window. The aggregation function \(\Psi (x,y)\) can roughly be seen as an analogy to the forgetting mechanism in an LSTM. Thus, as a signal is passed through the network, the internal states \(H^{(t)}\) of the LSTCN accumulate knowledge of long-term temporal contextual information. At the same time, the short-term dependencies in the time series are processed in a conventional way, in each STCN block (see Fig. 1).

4.2 Learning

Training an LSTCN model means training each STCN block with its corresponding time patch. In this neural system, the learned knowledge up to the current iteration is stored in \(B_1\) and \(W_1\), while \(B_2\) and \(W_2\) contain the knowledge needed to make the prediction in the current iteration. Therefore, the learning problem consist of computing \(W_2^{(t)}\) and \(B_2^{(t)}\) given the tuple \((X^{(t)}, Y^{(t)})\) corresponding to the current time patch. Let us recall that \(H^{(t)}\) is a \(K \times M\) matrix, \(W_2^{(t)}\) is a \(M \times M\) matrix, while \(B_2^{(t)}\) is a \(1 \times M\) matrix. The underlying optimization problem is given below:

$$\begin{aligned} min \rightarrow \left\| f\left( H^{(t)} W_2^{(t)} \oplus B_2^{(t)}\right) - Y^{(t)}\right\| _{\ell _2} + \lambda \left\| \Gamma _2^{(t)} \right\| _{\ell _2} \end{aligned}$$
(6)

such that

$$\begin{aligned} \Gamma _2^{(t)} = \begin{bmatrix} W_2^{(t)} \\ B_2^{(t)} \end{bmatrix} \end{aligned}$$
(7)

represents the matrix with dimension \((K+1) \times M\) that results after performing a row-wise concatenation of the bias weight matrix \(B_2^{(t)}\) to \(W_2^{(t)}\), while \(\lambda \ge 0\) is the ridge regularization penalty. The added value of using a ridge regression approach is regularizing the model and preventing overfitting. In our network, overfitting is likely to happen when splitting the original time series into too many time patches covering few observations.

Equation (8) displays the deterministic learning rule solving this ridge regression problem,

$$\begin{aligned} \Gamma _2^{(t)} = \left( \left( \Phi ^{(t)} \right) ^{\top } \Phi ^{(t)} + \lambda \Omega ^{(t)} \right) ^{-1} \left( \Phi ^{(t)} \right) ^{\top } f^{-} \left( Y^{(t)}\right) \end{aligned}$$
(8)

where \(\Phi ^{(t)}=(H^{(t)} \vert A)\) such that \(A_{K \times 1}\) is a column vector filled with ones, \(\Omega ^{(t)}\) denotes the diagonal matrix of \((\Phi ^{(t)})^{\top } \Phi ^{(t)}\), while \((\cdot )^{-1}\) represents the Moore–Penrose pseudo-inverse Penrose [46]. This generalized inverse is computed using singular value decomposition and is defined and unique for all real matrices. Remark that this learning rule assumes that the activation values in the inner layer are standardized. As far as standardization is concerned, these calculations are based on standardized activation values. When the final weights are returned, they are adjusted back into their original scale.

It can be noticed that an STCN block trained using the learning rule in Eq. (8) is similar to an Extreme Learning Machine (ELM) Huang et al. [27], which is a special case of a two-layer multilayer perceptron. However, there are three main differences between these models. Firstly, the \(W_{1}^{(t)}\) and \(B_{1}^{(t)}\) matrices are not random but initialized with the prior knowledge arriving at the STCN block from previous learning processes. Secondly, while the hidden layer of ELMs is of arbitrary width, the number of neurons in an STCN is given by the number of steps ahead to be predicted and the number of features in the multivariate time series. Finally, each neuron (also referred to as neural concept) represents the state of a problem feature in a given time step. While this constraint equips our model with interpretability features, it might also limit its approximation capabilities.

Another issue that deserves attention is how to estimate the first weight matrix \(W_1^{(0)}\) to be used as prior knowledge in the first iteration. This matrix is expected to be (partially) provided by domain experts or computed from a previous learning process (e.g., using a transfer learning approach). In this paper, we simulate such knowledge by fitting a stateless STCN (that is to say, \(H^{(t)} = X^{(t)}\)) on a smoothed representation of the whole time series we are processing. The smoothed time series is obtained using the moving average method for a given window size. Finally, we generate some white noise over the computed weights to compensate for the moving average operation. Equation (9) shows how to compute this matrix,

$$\begin{aligned} W_1^{(0)} \sim \mathcal {N}\left( \left( \bar{X}^{\top } \bar{X} + \lambda \Omega \right) ^{-1} \bar{X}^{\top } f^{-} \left( \bar{Y} \right) , \sigma \right) \end{aligned}$$
(9)

where \(\bar{X}\) and \(\bar{Y}\) are the smoothed inputs and outputs obtained for the whole time series, respectively, while \(\sigma\) is the standard deviation. In this case, we will use \(\Omega\) again to denote the diagonal matrix of \(\bar{X}^{\top } \bar{X}\) if no confusion arises.

The prior bias matrix \(B_1^{(0)}\) is assumed to be zero since we use that component to model the external stimulus of neurons after performing an STCN’s learning process.

The intuition dictates that the training error will go down as more time patches are processed. Of course, such time patches should not be too small to avoid overfitting. In some cases, we might obtain an optimal performance using a single time patch containing the whole sequence such that we will have a single STCN block. In other cases, it might occur that we do not have access to the whole sequence (e.g., as happens when solving online learning problems), such that using a single STCN block would not be an option.

4.3 Interpretability

As mentioned, the architecture of our neural system allows explaining the forecasting since both neurons and weights have a precise meaning for the problem domain being modeled. However, the interpretability cannot be confined to the absence of hidden components in the network since the structure might involve hundreds or thousands of edges.

In this subsection, we introduce a measure to quantify the influence of each feature in the forecasting of multivariate time series. Our proposal is based on the knowledge structures of the LSTCN model, i.e., the learned weights connecting the neurons. This implies that our measure is a model-intrinsic feature importance measure, reflecting what the model considers important when learning the relations between the time points. This approach contrasts with model-agnostic methods that inspect how the variations in the input data affect the model’s output.

The proposed measure can be computed from \(W_1^{(t)}\), \(W_2^{(t)}\) or their combination. The scores obtained from \(W_1^{(t)}\) can be understood as the feature influence up to the t-th time patch, while scores obtained from \(W_2^{(t)}\) can be understood as the feature influence to the current time patch. Let us recall that \(W_1^{(t)}\) and \(W_2^{(t)}\) are \(M \times M\) matrices such that \(M=N \times L\), assuming that we have a multivariate time series with N features and that we want to forecast L steps ahead. Moreover, the neurons are organized temporally, which means that we have L blocks of neurons, each containing N units. Equations (10) and (11) show how to quantify the effect of feature \(f_i\) on feature \(f_j\) given a matrix \(W^{(t)}\) that characterizes the interaction among the problem features,

$$\begin{aligned} \gamma ^{(t)}(f_i,f_j) = \sum _{p_i \in P(i)} \sum _{p_j \in P(j)} \left| w^{(t)}_{p_i p_j} \right| , w^{(t)}_{p_i p_j} \in W^{(t)} \end{aligned}$$
(10)

such that

$$\begin{aligned} P(i)= \{ p \in \mathbb {N}, p \le M~\vert ~(p~mod~i)=0\}. \end{aligned}$$
(11)

The feature influence score in Equation (10) can be normalized such that the sum of all scores related to the j-th feature is one. This can be done as follows:

$$\begin{aligned} \hat{\gamma }^{(t)}(f_i,f_j) = \frac{\gamma ^{(t)}(f_i,f_j)}{\sum _{k=1}^{N} \gamma ^{(t)}(f_k,f_j)}. \end{aligned}$$
(12)

The rationale behind the proposed feature influence score is that the most important problem features will have attached weights with large absolute values. Moreover, it is expected for the learning algorithm to produce sparse weights with a zero-mean normal distribution, which is an appreciated characteristic when it comes to interpretability.

The idea of computing the relevance of features from the weights in neural systems has been explored in the literature. For example, the Layer-Wise Relevance Propagation (LRP) algorithm Bach et al. [4] explains the predictions made by a neural classifier for a given instance by assigning relevance scores to features, which are computed using the learned weights and neurons’ activation values. It should be stated that we do not use neurons’ activation values in our feature influence score as we intend to produce global explanations based on the learned weights only. Similar approaches have been proposed in Nápoles et al. [41] and Nápoles et al. [42] but applied to LTCN-based classifiers. In the first study, the feature scores indicate which features play a significant role in obtaining a given class instead of an alternative class. This type of interpretability responds to the question why not?. Conversely, the second study measures the feature importance in obtaining the decision class. The results were contrasted with the feature scores obtained from logistic regression and both models agreed on the top features that play a role in the outcome. Both feature score measures operate on neural systems where the neurons have an explicit meaning for the problem domain. Therefore, the learned weights can be used as a proxy for interpretability.

5 Numerical simulations

In this section, we will explore the performance (forecasting error and training time) of our neural system on three case studies involving univariate and multivariate time series. In the case of multivariate time series, we will also depict the feature contribution score to explain the predictions.

When it comes to the pre-processing steps, we interpolate the missing values (whenever applicable) and normalize the series using the min-max method. In addition, we split the series into 80% for training and validation and 20% for testing purposes. As for the performance metric, we use the mean absolute error in all simulations reported in the section. For the sake of convenience, we trimmed the training sequence (by deleting the first observations) such that the number of times is a multiple of L (the number of steps ahead we want to forecast).

The models used for comparison are a fully connected Recurrent Neural Network (RNN) where the output is to be fed back to the input, GRU, LSTM and Extreme Learning Machine (ELM). In the first three models, the number of epochs was set to 20, while the batch size was obtained through hyperparameter tuning (using grid search). The candidate batch sizes were the powers of two, starting from 32 until 4,096. The values for the remaining parameters were retained as provided in the Keras library. In the case of the LSTCN model, we fine-tuned the number of time patches \(T \in \{1,2\ldots ,10\}\) and the regularization penalty \(\lambda \in \{\text {1.0E--3},\) \(\text {1.0E--2},\) \(\text {1.0E--1}, \text {1.0E+1}, \text {1.0E+2}, \text {1.0E+3}\}\). In Eq. (9), we arbitrarily set the standard deviation \(\sigma\) to 0.05 and the moving window size w to 100. These two hyperparameters were not optimized during the hyperparameter tuning step as they were used to simulate the prior knowledge component. In the case of ELM, we use the implementation provided in Scikit–ELM [3]. The number of neurons in the hidden layer was set to \(M=N \times L\) where N is the number of features while L is the number of steps ahead to be forecast. The values for the remaining hyperparameters were retained as provided in the library.

Finally, all experiments presented in this section were performed on a high-performance computing environment that uses two Intel Xeon Gold 6152 CPUs at 2.10 GHz, each with 22 cores and 768 GB of memory.

5.1 Apple Health’s step counter

The first case study concerns physical activity prediction based on daily step counts. In this case study, the health data of one individual were extracted from the Apple Health application in the period from 2015 to 2021. In total, the time series dataset is composed of 79,142 instances or time steps. The Apple Health application records the step counts in small sessions during which the walking occurs. The dataset (available at https://bit.ly/2S9vzMD) contains two timestamps (start date and end date), the number of recorded steps, and separate columns for year, month, date, day, and hour. Besides, the day of the week that each value was recorded is known. Table 1 presents descriptive statistics attached to this univariate time series before normalization.

Table 1 Descriptive statistics for the steps case study

The target variable (number of steps) follows an exponential distribution with very infrequent, extremely high step counts and very common low step counts. Overall, the data neither follows seasonal patterns nor a trend.

Table 2 shows the normalized errors attached to the models under consideration when forecasting 50 steps ahead in the Steps dataset. In addition, we portray the training and test times (in seconds) for the optimized models. The hyperparameter tuning reported that our neural system needed two iterations to produce the lowest forecasting errors, while the optimal batch size for RNN, GRU and LSTM was found to be 32. Although LSTCN outperforms the remaining methods in terms of forecasting error, what is truly remarkable is its efficiency. The results show that LSTCN is 2.7E+2 times faster than RNN, 2.3E+3 times faster than GRU, 2.2E+3 times faster than LSTM, and just 2.0E-2 times slower than ELM. In this experiment, we ran all models five times with optimized hyperparameters and selected the shortest training time in each case. Hence, the time measures reported in Table 2 concern the fastest executions observed in our simulations.

Table 2 Simulation results for the steps case study

Figure 4 displays the distributions of weights in the \(W_1\) and \(W_2\) matrices attached to the last STCN block (the one to be used to perform the forecasting). In other words, we visualize the differences in the distributions of prior knowledge weights and the weights learned in the last STCN block. It is worth recalling that the prior knowledge block in that last block is what the network has learned after processing all the time patches but the last one. In contrast, the learned weights in that block adapt the prior knowledge to forecast the last time patch. In this case study, most prior knowledge weights are distributed in the \([-0.2,1.0]\) interval, while weights in \(W_2\) follows a zero-mean Gaussian distribution. This figure illustrates that the network significantly adapts the prior knowledge to the last piece of data available.

Fig. 4
figure 4

Distribution of weights for the Steps case study

Figure 5 depicts the overall behavior of weights connecting the inner neurons with the outer ones in the last STCN block. In this simulation, we averaged the \(W_1\) and \(W_2\) matrices for the sake of simplicity, thus resulting in an average layer. In that layer, inner and outer neurons refer to the leftmost and rightmost neurons, respectively. Observe that the learning algorithm assigns larger weights to connections between neurons processing the last steps in the input sequence and neurons processing the first steps in the output sequence. This is an expected behavior in time series forecasting that supports the rationale of the proposed feature relevance measure.

Fig. 5
figure 5

Behavior of weights connecting the inner neurons with the outer ones in the last STCN block after averaging \(W_1\) and \(W_2\)

The fact that each neuron has a well-defined meaning for the problem domain makes it possible to elucidate how the network uses the current L time steps to predict the following ones. Using that knowledge, experts could estimate how many previous time steps would be needed to predict a sequence of length L without performing further simulations.

5.2 Household electric power consumption

The second case study concerns the energy consumption in one house in France measured each minute from December 2006 to November 2010 (47 months). This dataset (available at https://bit.ly/3ugv8Pt) involves nine features and 2,075,259 observations from which 1.25% are missing. Hence, records with missing values were interpolated using the nearest neighbor method. In our experiments, we retained the following variables: global minute-averaged active power (in kilowatt), global minute-averaged reactive power (in kilowatt), minute-averaged voltage (in volt), and global minute-averaged current intensity (in ampere) (Table 3).

Table 3 Descriptive statistics concerning four variables in the Power case study: mean value, standard deviation, minimal, and maximal values (“pwr” stands for power)

The series exhibits cyclic patterns. On the most fine-grained scale, we observe a repeating low nighttime power consumption. We also noted a less distinct but still present pattern related to the day of the week: higher power consumption during weekend days. Finally, we observed high power consumption during the winter months (peaks in January) each year and low in summer (lowest values recorded for July).

Table 4 portrays the normalized errors obtained by each optimized model when forecasting 200 steps ahead, and the training and test times (in seconds). The hyperparameter tuning reported that our network produced the optimal forecasts with two iterations, while the optimal batch size for RNN, GRU and LSTM was 4,096, 64 and 256, respectively. According to these simulations, LSTCN obtains the best results followed by GRU with the latter being notably slower than the former. Overall, LSTCN proved to be 2.3E+1 times faster than RNN, 2.2E+3 times faster than GRU, and 1.6E+3 times faster than LSTM. ELM was the second-fastest algorithm and showed competitive results in terms of error compared to LSTCN.

Table 4 Simulation results for the power case study

Figure 6 displays the distributions of weights in the \(W_1\) and \(W_2\) matrices for the last STCN block. These histograms reveal that weights follow a zero-mean Gaussian distribution and that the second matrix has more weights near zero (the shape of the second curve contracts toward zero). In this case study, the network does not shift the distribution of weights as happened in the first case study. Actually, the accumulated prior knowledge does not seem to suffer much distortion (distribution-wise) when adapted to the last time patch.

Fig. 6
figure 6

Distribution of weights for the Power case study

Figure 7 displays the feature influence scores obtained with Equation (12). These scores were computed after averaging the \(W_1\) and \(W_2\) matrices that result from adjusting the network to the last time patch. In this figure, the bubble size denotes the extent to which one feature in the y-axis is deemed relevant to forecast the value of another feature in the x-axis. For example, it was observed that the first feature (global active power) is the most important one to forecast the second feature (global reactive power). Observe that the sum of all scores by column is one due to the normalization step.

Fig. 7
figure 7

Feature influence in the Power case study

Overall, the results indicate that the proposed network obtains small forecasting errors while being markedly faster than the state-of-the-art recurrent neural networks. Moreover, its knowledge structures facilitate explaining how the forecasting was made, using feature relevance explanations.

5.3 Bitcoin transactions analysis

In this section, we inspect a case study concerning changes in the Bitcoin transaction graph observed with a daily frequency from January 2009 to December 2018. The data set is publicly available in the UCI Repository (https://bit.ly/3ES71M1). Using a time interval of 24 hours, the contributors of this dataset Akcora et al. [2] extracted daily transactions and characterized them. In total, we have 2,916,697 observations of six numerical features (the remaining ones were discarded).

Due to the nature of this dataset, we do not observe typical statistical properties (there are no seasonal patterns, the data are not stationary, the fluctuations do not show evident patterns and features are not normally distributed). Table 5 depicts descriptive statistics for the retained features.

Table 5 Descriptive statistics (mean, standard deviation, minimum and maximum value) of variables in the Bitcoin dataset

Table 6 shows the errors obtained by each optimized network when forecasting 200 steps ahead, and the training and test times (in seconds). After performing hyperparameter tuning, we found that the optimal batch size for RNN, GRU and LSTM was 4,096, 128 and 64, respectively, while the number of LSTCN iterations was set to eight. In this problem, the LSTCN model clearly outperformed the remaining algorithms selected for comparison. When it comes to the training time, LSTCN is 2.1E+1 times faster than RNN, 1.7E+3 times faster than GRU, 2.2E+3 times faster than LSTM, and 1.4 times faster than ELM. Similarly to the other experiments, ELM was the second-fastest algorithm; however, it obtained the second-worst score in terms of the test error.

Table 6 Simulation results for the Bitcoin case study

Figure 8 shows the distribution of weights in the first prior knowledge matrix and the matrix computed in the last learning process. This figure illustrates how the weights become more sparse as the network performs more iterations. This happens due to a heavy \(\ell _2\) regularization with \(\lambda =1.0E+3\) being the best penalty value obtained with grid search. However, no shift in the distributions is observed.

Fig. 8
figure 8

Distribution of weights in the Bitcoin case study

Figure 9 shows the feature influence scores obtained for the Bitcoin case study. Similarly to the previous scenario, these scores were computed after averaging the \(W_1\) and \(W_2\) matrices that result from adjusting the network to the last time patch. The relevance scores suggest that the sixth (income), the second (weight) and the third (count) features have the biggest influence in the forecasting.

Fig. 9
figure 9

Feature influence in the Bitcoin case study

Overall, it should be highlighted that the intention of the proposed feature score is to provide the LSTCN with intrinsic interpretability, as opposed to model-agnostic measures of feature importance. In general, model-intrinsic explanations are preferred when the fidelity of the explanations to the model is an important factor for the user Rudin [48]. In practice, our approach can help the practitioners elucidate the degree to which each feature influences the forecasting. Comparing the quality of our model-intrinsic explanations with model-agnostic explanations would require specific application and the availability of experts to measure satisfaction, informativeness, usefulness, trust, etc. Doshi-Velez and Kim [18]. This is an interesting step to take into account in the future work of this research line.

6 Concluding remarks

In this paper, we have presented a recurrent neural system termed Long Short-term Cognitive Networks to forecast long time series. The proposed model consists of a collection of STCN blocks, each processing a specific data chunk (time patch). In this neural ensemble, each STCN block passes information to the next one in the form of prior knowledge matrices that preserve what the model has learned up to the current iteration. This means that, in each iteration, the learning problem narrows down to solving a regression problem. Furthermore, neurons and weights can be mapped to the problem domain, making our neural system interpretable.

The underlying model aims at solving an optimization problem, which is practically realized with ridge regression. The natural limitation of such a solution is that we may encounter computational issues in ultra-high dimensional spaces. The literature of the domain suggests solving these issues with the help of dimensionality reduction algorithms (see [12, 55, 56]).

The numerical simulations using three case studies allow us to draw the following conclusions. Firstly, our model performs better than (or comparably to) state-of-the-art recurrent neural networks. It has not escaped our notice that these algorithms could have produced smaller forecasting errors if we had optimized other hyperparameters (such as the learning rate, the optimizer, the regularizer, etc.). However, such an increase in performance would come at the expense of a significant increase in the computations needed to produce fully optimized models. Secondly, the simulation results have shown that our proposal is noticeably faster than GRU and LSTM, which are popular recurrent models for time series forecasting, and comparable to ELM. Such a conclusion is particularly relevant since our primary goal was to design a fairly accurate forecasting model with fast training time rather than outperforming the forecasting capabilities of these recurrent models. Finally, we have illustrated how to derive insights into the relevance of features using the network’s knowledge structures with little effort.

Future research efforts will be devoted to exploring the forecasting capabilities of our model further. On the one hand, we plan to conduct a larger experiment involving more univariate and multivariate time series. On the other hand, we will analytically study the generalization properties of LSTCNs under the PAC-Learning formalism. This seems especially interesting since the network’s size depends on the number of features and the number of steps ahead to be forecast.