1 Introduction

We study the online prediction problem where we sequentially observe a data sequence, possibly along with feature vectors, related to a target signal and the aim is to predict the future values. This is a fundamental problem that is extensively studied in the machine learning and signal processing literatures with various practical applications including finance, meteorology and health sciences (Sun et al., 2014; Saadallah et al., 2022; Shih et al., 2019). In these applications, a wide range of methods have been employed, both linear and nonlinear.

Linear models are widely used for sequential prediction tasks due to their simplicity which allows for easy interpretation and implementation. Additionally, they have lower variance and are less prone to overfitting than more complex models (Landwehr et al., 2005). Among these linear models, the auto-regressive moving-average (ARMA) model (Box et al., 2015) is a popular choice for sequential prediction tasks. The auto-regressive (AR) models are a type of linear models that use a linear combination of past observations to predict the future values of a time series. They can be enhanced by incorporating the past error terms, or residuals, i.e., moving-average (MA) terms, resulting the highly powerful ARMA models (Fan & Yao, 2013). Combining the AR and MA components, the ARMA model can effectively capture the linear patterns in time series data. The inclusion of MA terms in the ARMA model is essential as it allows the model to use the past errors, i.e., performance in the previous samples, in other words “the state information” in the residual terms from the previous time steps (Box et al., 2015).

However, in most real-life applications, linear models have limitations since real-life data frequently exhibits nonlinear characteristics (Landwehr et al., 2005). Consequently, nonlinear models have gained popularity in recent years as they can capture such patterns with a great degree of accuracy and flexibility, surpassing the linear methods as shown in several highly publicized real-life competitions (Kaggle, 2020). One of the primary nonlinear approaches for time series forecasting is the hard decision trees. These trees utilize decision rules to partition the input space and predict the output. Decision trees are a versatile tool for sequential prediction tasks because they can handle both continuous and discrete data, and are easily interpretable and visualized (Bertsimas & Dunn, 2017). Decision trees have also been incorporated into many gradient boosting machines as the base learners, where they are used to generate an ensemble of trees to improve prediction accuracy. As a result, gradient boosting decision tree methods such as XGBoost (Chen & Guestrin, 2016) and LightGBM (Ke et al., 2017) have achieved remarkable success in several machine learning applications. Another popular nonlinear method for sequential prediction tasks is the artificial neural networks (ANNs), which are made up of interconnected processing nodes that learn complex patterns. ANNs have shown to outperform linear models in various real-life applications and are highly flexible to handle different types of data (Alpaydin, 2014). In addition, ANNs have several variations, such as the recurrent neural networks (RNNs), and the long short-term memory (LSTMs) (Hochreiter & Schmidhuber, 1997), which are specifically designed to handle sequential data. With the advancements in deep learning, ANNs have have achieved state-of-the-art results in various sequential prediction tasks (Shih et al., 2019).

Although, nonlinear methods have shown superior performance in a variety of real-life applications, unlike ARMA models, neither the decision trees or the ANNs utilize the past error terms or the residuals, in a certain sense optimal manner, to model the sequential data. Instead, they mostly rely on the past time instances, i.e., AR terms, to make predictions since including the past error terms would require a highly complex formulation. However, leveraging the residual information which can be considered as the state, i.e., which carries the prediction performance information in the past time steps, can significantly enhance the accuracy by adjusting the algorithm with this highly relevant information (Box et al., 2015).

To this end, we introduce a new approach to nonlinear online prediction which exploits the past error terms as additional features. We need to emphasize that such inclusion is not straightforward since the online algorithms are sequential, i.e., they update or learn their parameters in time based on the newly arriving data. Hence, for each new sample, all the parameters of the algorithm will change and the error terms which are included as the new features will also change. In this sense, our algorithm optimizes both the model parameters and also the feature vectors. Note that by doing so, we obtain the optimized residual values of the final model, rather than using a set of constant errors that are fixed after a certain training stage and would be associated with a past version of the model that has lost its relevance to the final version. We achieve this joint optimization by introducing a new gradient update rule that accounts for the effects resulting from the changes in this new feature vector and we provide the corresponding equations. Naturally, our method outperforms linear models as well as extensively used nonlinear models that only use the past observation samples as we demonstrate through our experiments on both the well-known real-life competition datasets and synthetically generated data.

Our main contributions are as follows.

  1. 1.

    To the best of our knowledge, our work is the first in the literature to introduce a nonlinear sequential prediction algorithm that uses and trains the past error terms as additional features in order to exploit the algorithm’s performance on the previous samples.

  2. 2.

    We introduce an approach that jointly optimizes the model parameters and the feature vectors that are extended to include the past error terms. This algorithm is highly versatile as it can work with any linear or nonlinear prediction model as long as its first order derivatives exist.

  3. 3.

    We provide the corresponding equations that account for the effects resulting from the changing error terms, both for decision trees and ANNs since they are the most widely used methods in the machine learning literature.

  4. 4.

    Through an extensive set of simulations on both artificial and real-life datasets, we demonstrate that our algorithm shows significant performance improvements over the state-of-the-art.

The organization of this paper is as follows. In Sect. 2, we provide a review of the works in the literature that are related to our paper. In Sect. 3, we describe the sequential prediction problem and then briefly introduce some of the prominent methods used in this problem, namely ARMA models, ANNs, and hard decision trees. Then, we present soft decision trees as a differentiable alternative to hard decision trees in Sect. 3.5. In Sect. 4.1, we introduce our gradient-based learning algorithm which exploits the past error terms and provide the necessary equations for soft decision trees. In Sect. 4.2 we provide the equations for the application of our algorithm to the ANNs as well. Then, in Sect. 5, we illustrate the performance of our algorithm on both synthetically generated and real-life datasets, showing significant improvements over the state-of-the-art methods. We then finalize our paper with concluding remarks in Sect. 6.

2 Related work

For sequential prediction or forecasting tasks, two of the most common classical methods are the linear Exponential Smoothing (ES) and ARMA models. ES method use a weighted sum of the past observations with exponentially decaying weights to predict the next value in the sequence. In ARMA models, a linear combination of the past error terms (MA terms) are incorporated to the past observations (AR terms) to better model the autocorrelations for the time series. The ARMA models can be extended to include seasonal parts and exogenous features, resulting with the SARMAX models which can effectively capture the linear patterns in the time series data.

Since linear models are often not enough to deal with the nonlinear patterns that often appear in real-life data, nonlinear methods have overtook the classical approaches both in performance and popularity in recent years. One of the most widely applied nonlinear methods for sequential prediction tasks is the artificial neural networks (ANNs) and their variations, such as the recurrent neural networks (RNNs), and the long short-term memory (LSTMs) (Hochreiter & Schmidhuber, 1997), which are specifically designed to handle sequential data. Another state-of-the-art nonlinear approach is to use gradient-boosted hard decision trees where the model is built by iteratively adding base learners that are trained to predict the residuals of the previous base learners (Friedman, 2001). These gradient boosting models have several variations such as XGBoost (Chen & Guestrin, 2016), LightGBM (Ke et al., 2017) and CatBoost (Prokhorenkova et al., 2019), and have shown remarkable performance in various machine learning applications. However, none of these nonlinear methods conventionally integrate the error terms as additional features, which are shown to effectively improve the performance in the case of linear ARMA models.

In the machine learning literature, the integration of past error terms as additional features in nonlinear sequential prediction tasks has been mainly explored with hybrid models. These models use a linear combination of an ARMA model and a suitable nonlinear model such as ANNs (Zhang, 2003; Li et al., 2018; Fard & Akbari-Zadeh, 2013), RNNs (Aladag et al., 2009), support vector machines (SVM) (Pai & Lin, 2005), LSTM (Fan et al., 2021) or decision trees (Nie et al., 2021). This hybrid methodology deals with the original time series and the error series separately. Namely, a linear ARMA model is first fitted on the data, and then a nonlinear model is fitted on its residuals to capture the nonlinear patterns that are initially unused (Zhang, 2003). However, this approach assumes an additive relationship between the linear and nonlinear components, which may not be an accurate description of the underlying relation (Taskaya Temizel & Casey, 2005) and also the error terms that are used are not the error terms coming from the mixture, i.e., only coming from the first stage algorithm. In this sense, these methods can be considered in the boosting framework, i.e., first stage in the boosting methods.

To mitigate this issue, recent works have proposed methods that combine the ARMA and nonlinear patterns using another nonlinear machine learning model, or a “meta-learner”, i.e., the ensemble approaches (Santos et al., 2019; Khashei & Bijari, 2011). There are also studies that propose methods that decompose the data into more suitable forms to separately fit the ARMA and ANN models (Babu & Reddy, 2014). Nonetheless, all of these previous techniques use the same residuals throughout their training phases, i.e., the residuals of an initially fitted ARMA model rather than using the residuals of the entire model. In addition, in the gradient boosting machines such as XGBoost and LightGBM, the weak learners fit the residuals of the previous week learners to minimize the loss functon sequentially. Although all weak learners predict different targets, which are the remaining errors/residuals from the ensemble of the previous weak learners, they do not directly benefit from the information of past error terms over time. In this paper, unlike other methods, our algorithm does not use a single error series acquired at some training step. Instead, we consider the fact that as we update the model parameters, the error terms and by extension, the feature vectors change accordingly. Hence, a joint optimization is needed to handle the effects of the past error terms effectively as time progresses. We accomplish this so that the effects of the past error terms, i.e., the state information, is propagated to the future steps to effectively improve the prediction performance.

3 Background

In this section, we describe the sequential prediction problem and then briefly introduce some of the prominent methods used in this problem, namely ARMAX models, ANNs, and hard decision trees. Lastly, in Sect. 3.5, we present soft decision trees as a differentiable alternative to hard decision trees.

3.1 Problem description

All vectors are column vectors and denoted by boldface lowercase letters. Matrices are represented by boldface capital letters. For a vector \(\varvec{u}\) (or a matrix \(\varvec{U}\)), \(\varvec{u}^T\) (\(\varvec{U}^T\)) is the ordinary transpose. The time index is given as subscript, e.g., \(\varvec{u}_t\) is the vector at time t. \(\varvec{I}_{d} \in {\mathbb {R}}^{d \times d}\) is the identity matrix and the \(\varvec{0}\) (or \(\varvec{1}\)) is a vector or a matrix of all zeros (ones), where the size is understood from the context.

We sequentially observe \(\{y_t\}_{t \ge 1}\), \(y_t \in {\mathbb {R}}\) along with the feature vectors \(\{\varvec{x}_t\}_{t \ge 1}\), \(\varvec{x}_t \in {\mathbb {R}}^d\). Our goal is to predict \(y_{t+1}\) in an online manner by constructing a sequential function, which only depends on the past information, as:

$$\begin{aligned} {\hat{y}}_{t+1} = f_t(\{\dots , y_{t-1}, y_t\}, \{\dots , \varvec{x}_{t-1}, \varvec{x}_t\}) \end{aligned}$$

where the function \(f_t(\cdot )\) can vary in time to accommodate the non-stationarity in the data. At each time step t, we generate the output \({\hat{y}}_{t+1}\) and suffer the loss \(l(y_{t+1}, {\hat{y}}_{t+1})\). Here, the loss function \(l\) can be any differentiable function such as the squared error loss \((y_{t+1}-{\hat{y}}_{t+1})^2\). Our goal is to minimize the accumulated online error

$$\begin{aligned} L_t = \sum _{k=1}^t l(y_k, {\hat{y}}_k). \end{aligned}$$
(1)

For this problem, linear models are commonly used as \(f(\cdot )\) i.e., \(f(\varvec{z}_t) = \varvec{w}^T \varvec{z}_t\) where \(\varvec{z}_t\) is the combined feature vector at time \(t\), e.g., in addition to the feature vectors \(\varvec{x}_t\), one can generate further features from the past values of \(y_t\), and \(\varvec{w}\) is the corresponding coefficient vector. As an example, the well-known linear regression model (Brockwell & Davis, 2002) predicts the next time instance as

$$\begin{aligned} {\hat{y}}_{t+1} = \beta _0 + \beta _1 z_{t,1} + \beta _2 z_{t,2} + \dots + \beta _n z_{t,n}, \end{aligned}$$
(2)

where \({\hat{y}}_{t+1}\) is the predicted value for the next time instance, \(z_{t,i}\) represents the value of the \(i\)’th feature which can be the past values of \(y_t\) and exogenous variables at time \(t\), and \(\beta _i\) are the regression coefficients.

However, in most applications, linear models are found to be insufficient when the data contains nonlinear patterns (Landwehr et al., 2005). Hence, nonlinear models are employed where \(f_t(\cdot )\) is a time-varying nonlinear function of the current and past data and \({\hat{y}}_{t+1} = f_t(\varvec{y}_t, \varvec{x}_t)\). Recently, decision tree based models such as LightGBM (Ke et al., 2017), XGBoost (Chen & Guestrin, 2016) and neural network based models are among the most popular nonlinear models due to their success in real-life applications and competitions such as Kaggle competitions (Kaggle, 2020). Nonetheless, in such nonlinear models, the past errors used in the ARMA models, i.e., \(e_t=y_t-{\hat{y}}_t\) for past instances, are not exploited. These error terms can improve the prediction performance when used as features since they carry state information, i.e., past performance, through time (Box et al., 2015).

We introduce a nonlinear prediction algorithm which exploits the past data instances \(y_t\), exogenous feature vectors \(\varvec{x}_t\), and also the past error terms \(e_t\). We provide the corresponding equations for two choices of the nonlinear function \(f(\cdot )\): a decision tree model and an ANN. However, as we show in Sect. 4, the algorithm can be used with any differentiable nonlinear model of choice. To this end, we first explain the ARMA with exogenous terms (ARMAX) models (Box et al., 2015) as the source of inspiration for exploiting the error terms. Then we briefly introduce the ANNs, hard decision trees and soft decision trees and discuss their utility for this particular problem. Finally, in the next section, we introduce a model which jointly optimizes this structure.

3.2 ARMAX models

Autoregressive moving average (ARMA) models are a class of linear time series models that are popular due to their efficiency and flexibility in approximating many stationary processes (Fan & Yao, 2013). ARMA models present the future value of the time series as a linear function of the past time steps and past errors, which are called the autoregressive (AR) and moving-average (MA) terms, respectively. Specifically, an ARMA(p, q) model has the form

$$\begin{aligned} \begin{aligned} y_t&= \phi _1 y_{t-1} + \phi _2 y_{t-2} + \dots + \phi _p y_{t-p}\\&+ \theta _1 e_{t-1} + \theta _2 e_{t-2} + \dots + \theta _p e_{t-q} + e_t. \end{aligned} \end{aligned}$$
(3)

where \(\{e_k\}_{k=t-q}^{t-1}\) are the past q residual errors. ARMAX models incorporate exogenous variables, \(\varvec{x}_t\), into this representation, and their prediction of the next instance of a time series can be written in vector form as

$$\begin{aligned} {\hat{y}}_{t+1} = [\varvec{\phi }^T \varvec{\theta }^T \varvec{v}^T] \begin{bmatrix}\varvec{y}_t\\ \varvec{e}_t\\ \varvec{x}_t\end{bmatrix} \end{aligned}$$
(4)

where \(\varvec{y}_t = [y_{t}, \dots , y_{t-p+1}]^T\), \(\varvec{e}_t = [e_{t}, \dots , e_{t-q+1}]^T\), \(\varvec{x}_t\) is the feature vector at time \(t\), and \(\varvec{\phi } \in {\mathbb {R}}^p\), \(\varvec{\theta } \in {\mathbb {R}}^q\), \(\varvec{v} \in {\mathbb {R}}^d\) are the corresponding coefficient vectors. Here, past residual errors are defined as \(e_k=y_k - {\hat{y}}_k\).

3.3 Artificial neural networks (ANNs)

ANNs are a versatile family of models and they are universal approximators that can accurately approximate a wide range of nonlinear functions with high accuracy (Alpaydin, 2014). For presentation purposes we only work with ANNs with one hidden layer since generalization to multi layer ANNs is straightforward. An ANN with one hidden layer is given by:

$$\begin{aligned} f(\varvec{z}) = \varvec{W}_2 g(\varvec{W}_1 \varvec{z} + \varvec{b}_1) + b_2, \end{aligned}$$
(5)

where \(\varvec{W}_1 \in {\mathbb {R}}^{n \times m}\), \(\varvec{W}_2 \in {\mathbb {R}}^{1 \times n}\), \(\varvec{b}_1 \in {\mathbb {R}}^{n}\), \(\varvec{b}_2 \in {\mathbb {R}}\) are the connection weights and biases, and \(g(\cdot )\) is the nonlinear activation function which is commonly chosen as sigmoid function \(\sigma (x)=(1+e^{-x})^{-1}\). Note that our derivations equally apply to other activation functions such as ReLU, tanh and similar. Here, the number of input nodes is m, the number of hidden nodes is n, and the number of output nodes is one, representing the regressed value of the next instance in the sequence.

3.4 Hard decision trees

In order to overcome the restrictions of linear methods, one can employ a nonlinear method such as decision trees. A hard binaryFootnote 1 decision tree creates an axis-aligned partition of the feature space and produces an output according to the partition under which the input falls. As shown in Fig. 1, a conventional binary decision tree consists of two types of nodes called the internal nodes and the leaf nodes, where each internal node represents a decision rule based on its input. Starting from the root node of the tree, the input is routed either to the left or right child nodes based on the decision rules applied at the internal nodes (Bertsimas & Dunn, 2017). As an example, these decisions rules are produced by constructing histograms along each feature dimension such as in LightGBM (Ke et al., 2017). This process continues until the input reaches one of the leaf nodes, at which an output is produced. Consequently, for a tree with N leaf nodes, the input space is partitioned into N regions \(P_n\), where \(= 1,2,\dots ,N\). Thereby, for an input vector \(\varvec{z}\), the output of the decision tree is given as

$$\begin{aligned} f(\varvec{z}) = \sum _{n = 1}^N \gamma _n \mathbbm {1}(\varvec{z} \in P_n), \end{aligned}$$
(6)

where \(\gamma _n \in {\mathbb {R}}\) is the produced output at the leaf node \(n\) and \(\mathbbm {1}\) is the indicator function. Therefore, only one leaf node, i.e., the leaf node into which the input vector \(\varvec{z}\) is routed, contributes to the output because of the hard decisions made at each internal node.

Fig. 1
figure 1

Depiction of the forward pass of a simple hard decision tree. The larger squares are the internal nodes where \(N_i\)’s are the corresponding decision rules while the smaller squares are the leaf nodes with predictions \(\gamma _i\). As an example, the input \(\varvec{z}\) is subjected to the decision rule \(N_1\) at the root node and routed to the right child as a result. Similarly, when subjected to \(N_3\), the input is routed to the third leaf node at which the output \(\gamma _3\) is produced

Despite their high performance in time series modeling, we cannot use a decision tree with hard boundaries since these boundaries should be trained to minimize the final error.

Remark 1

We note that the error terms directly depend on the tree itself. Hence, while optimizing the tree, the input for the future samples, i.e, \(e_t\) terms, should change accordingly in online learning. Hence, the hard decision tree should also accommodate this, which is not possible. Therefore, we introduce a differentiable decision tree model as a suitable alternative which not only optimizes the tree, but also takes care of changing feature vectors, i.e., \(\varvec{z}_t = [\varvec{y}_t^T, \varvec{e}_t^T, \varvec{x}_t^T]^T\), in Sect. 3.5.

3.5 Soft decision trees

A soft decision tree model is a variant of decision trees that directs the input to all leaf nodes at the same time with different weights, unlike the hard decision trees that direct the input to a single leaf node due to the hard decision rules applied at the nodes (Olaru & Wehenkel, 2003; Irsoy et al., 2012; Frosst & Hinton, 2017). This is achieved by the soft decision rules applied at each internal node that route the input to both child nodes at the same time, as shown in Fig. 2. Provided that these decision rules are differentiable, the resulting model is also differentiable and can be updated at each time step by using the gradients of the cost function.

In a soft decision tree with a set of internal nodes \({\mathcal {N}}\) and leaf nodes \({\mathcal {L}}\), the input vector is subjected to a weighted routing at every internal node. As an example, in Fig. 2, we show a simple soft decision tree with two internal nodes and four leaf nodes. Note that, according to this structure, the paths with higher weights contribute more to the final output of the tree. At each of the internal nodes \(N_m \in {\mathcal {N}}\), the routing is controlled by

$$\begin{aligned} p_m = \sigma (\varvec{w}_m^T\varvec{z} + b_m), \end{aligned}$$
(7)

where \(\varvec{z}\) is the input vector, \(\sigma (\cdot )\) is the sigmoid activation function, and \(\varvec{w}_m\), \(b_m\) are the coefficients for the mth node. Here, the weights for the left and right directions are calculated as \(p_m\) and \(1 - p_m\), respectively. For each leaf node \(\ell \in {\mathcal {L}}\), we define the corresponding full path weight \(p^*_\ell\) that represents the contribution of each leaf nodes decision to the final outcome as

$$\begin{aligned} p^*_\ell = \prod _{i \in \textrm{A}(\ell )} p_i^{v_{i,\ell }} (1 - p_i)^{1-v_{i,\ell }}, \end{aligned}$$
(8)

where \(\textrm{A}(\ell )\) denotes the ascendant nodes of the leaf node \(\ell\) including the root node, and \(v_{i,\ell }\) represents whether \(\ell\) is a member of the left descendants of node i, i.e., if the next node in the path to \(\ell\) is the left (or right) child of node i, \(v_{i,\ell }\) is equal to 1 (0). Hence, the output of the soft decision tree given the input vector \(\varvec{z}\) is

$$\begin{aligned} f(\varvec{z}) = \sum _{\ell \in {\mathcal {L}}} p^*_\ell \gamma _\ell , \end{aligned}$$
(9)

where \(\gamma _\ell\) is the decision of the leaf node \(\ell \in {\mathcal {L}}\).

Fig. 2
figure 2

Depiction of the forward pass of a simple soft decision tree. At every internal rode starting with the root node, the input \(\varvec{z}\) is routed to both the right and left children according to the weights calculated for that node. The output is calculated as the weighted combination of leaf scores \(\gamma _\ell\) where the weights are given by the path weights \(p^*_\ell\)

4 A nonlinear prediction algorithm based on ARMA models

In this section, we introduce a nonlinear decision tree based prediction algorithm which uses the past input samples (AR terms) \(\{y_{t-i}\}_{i=0}^{p-1}\), the past error terms \(\{e_{t-j}\}_{j=0}^{q-1}\), and also the feature vector \(\varvec{x}_t\). Firstly, we provide a method for training soft decision tree models in an online manner with the stochastic gradient descent (SGD) algorithm (Duda et al., 2000) in the presence of MA terms. Later, we show that the algorithm can easily be applied to ANNs as well by only updating some of the equations. We emphasize that any optimization algorithm using gradients or hessians of the final error can be used instead of the SGD algorithm after updating the proper equations as shown in the following.

4.1 Training soft decision trees with MA terms

We introduce a soft decision tree model which, as shown in Fig. 3, uses its residuals as additional features and then, we describe our algorithm (see Algorithm 1 for the pseudo-code) that jointly optimizes the tree parameters and the residuals that we use as features.

Fig. 3
figure 3

The conventional training method for sequential prediction in a nonlinear setting involves the use of AR terms, \(\varvec{y}_t\), and exogenous features \(\varvec{x}_t\), as shown in the top. We present a new method, which, as shown in the bottom, utilizes the MA terms, i.e., the residuals that carry state information, to exploit the past performance of the nonlinear model \(f(\cdot )\)

Remark 2

Note that in most applications, gradient descent is applied only in the presence of the AR terms, \(\varvec{y}_t\), or the extra features, \(\varvec{x}_t\). However, MA terms are usually not included since the error terms, \(e_t\) and therefore feature vector \(\varvec{z}_t = [\varvec{y}_t^T, \varvec{e}_t^T, \varvec{x}_t^T]^T\) dynamically change with every update on the model’s parameters. Hence, they have not been used in a gradient descent algorithm, and their optimization is mainly done with the Kalman Filter (Harvey, 1990) for the linear models. The algorithm we introduce handles the effects of the changes in the error terms on the gradients, thus enabling the error terms to be used as features that are jointly optimized with the model’s parameters. In this sense, our method performs feature extraction by updating errors that carry the state information from the past as a result of this process.

At each time step \(t\), we observe a new data instance \(y_t\) and our predictions are given by

$$\begin{aligned} {\hat{y}}_{t+1} = f(\varvec{y}_t, \varvec{e}_{t}, \varvec{x}_t), \end{aligned}$$
(10)

where \(\varvec{y}_t = [y_{t}, \dots , y_{t-p+1}]^T\) are the AR terms, \(\varvec{e}_{t} = [e_{t}, \dots , e_{t-q+1}]^T\) are the MA terms,Footnote 2\(\varvec{x}_t\) is the observed feature vector and \(e_{t} = y_t - {\hat{y}}_{t}\) is the induced error of the predictor \(f(\cdot )\). As shown in Algorithm 1, we update the tree parameters of \(f(\cdot )\) whenever a new data instance is observed.

We have \(\varvec{z}_{t} = [\varvec{y}_t^T, \varvec{e}_{t}^T, \varvec{x}_t^T]^T\) for the combined feature vectors at time \(t\), i.e., \({\hat{y}}_{t+1} = f_t(\varvec{z}_{t})\), and the combined parameter vector \(\varvec{\alpha }_t\) for the tree parameters \(\{\varvec{w}_m\}_{m\in {\mathcal {N}}}\), \(\{b_m\}_{m\in {\mathcal {N}}}\) and \(\{\gamma _\ell \}_{\ell \in {\mathcal {L}}}\). We also have the node weights for the soft decision tree \(\varvec{w}_m = [\varvec{w}_{m_1}^T, \varvec{w}_{m_2}^T, \varvec{w}_{m_3}^T]^T\) so that for the combined input \(\varvec{z}_{t}\), (7) is updated as

$$\begin{aligned} p_m = \sigma (\varvec{w}_{m_1}^T\varvec{y}_t + \varvec{w}_{m_2}^T\varvec{e}_{t} + \varvec{w}_{m_3}^T\varvec{x}_t + b_m). \end{aligned}$$
(11)

Our goal is to minimize the accumulated online error

$$\begin{aligned} L_t = \sum _{k=1}^t l(y_k, {\hat{y}}_{k}), \end{aligned}$$
(12)

where \(l(y_k, {\hat{y}}_{k})\) represents the loss suffered from the prediction \({\hat{y}}_{t}\) and, as noted earlier, \(l(\cdot )\) can be chosen as any differentiable loss function such as the squared error loss \((y_k-{\hat{y}}_{k})^2\).

Now, we present the calculations for the gradients in order to update the tree parameters according to Algorithm 1. Before the time step \(t\), we first predict \({\hat{y}}_t =f(\varvec{y}_{t-1}, \varvec{e}_{t-1}, \varvec{x}_{t-1})\) and suffer the loss \(l(y_t, {\hat{y}}_{t})\) based on our observations \(y_t\) and \(\varvec{x}_t\) at time t. Hence, the corresponding gradient is found with the derivative

$$\begin{aligned} \begin{aligned} \frac{\partial l(y_t, {\hat{y}}_{t})}{\partial \varvec{\alpha }}&= \frac{\partial l(y_t, {\hat{y}}_{t})}{\partial {\hat{y}}_{t}} \frac{\partial {\hat{y}}_{t}}{\partial \varvec{\alpha }} \\&= -\frac{\partial l(y_t, {\hat{y}}_{t})}{\partial {\hat{y}}_{t}} \frac{\partial e_{t}}{\partial \varvec{\alpha }}, \end{aligned} \end{aligned}$$
(13)

where the second line follows from the definition \(e_t = y_t - {\hat{y}}_t\).

Algorithm 1
figure a

SGD with MA terms

The function \(f(\cdot )\) has the past error terms among its inputs, and each of these error terms also indirectly depend on the differentiation variable \(\alpha\). Therefore, the derivative term \(\frac{\partial e_{t}}{\partial \varvec{\alpha }}\) should be expanded as

$$\begin{aligned} \begin{aligned} \frac{\partial e_{t}}{\partial \varvec{\alpha }}&= \frac{\partial }{\partial \varvec{\alpha }} (y_{t} -f(\varvec{z}_{t-1})) \\&= \frac{\partial }{\partial \varvec{\alpha }} (y_{t} -f(\varvec{y}_{t-1}, e_{t-1}, \dots , e_{t-q}, \varvec{x}_{t-1})) \\&= -\frac{\partial {f(\varvec{z}_{t-1})}}{\partial \varvec{\alpha }} - \sum _{i=1}^q\frac{\partial {f(\varvec{z}_{t-1})}}{\partial e_{t-i}}\frac{\partial e_{t-i}}{\partial \varvec{\alpha }}, \end{aligned} \end{aligned}$$
(14)

which defines a recursive relation of order \(q\) for the series \(\left\{ \frac{\partial e_{k}}{\partial \varvec{\alpha }}\right\} _{k=1}^t\). For notational simplicity, let us define the quantities

$$\begin{aligned} E_k&= \frac{\partial e_{k}}{\partial \varvec{\alpha }}&F_k&= \frac{\partial {f(\varvec{z}_{k})}}{\partial \varvec{\alpha }}&G_{k, k-i}&= \frac{\partial {f(\varvec{z}_{k})}}{\partial e_{k-i}}. \end{aligned}$$

where \(E_k = 0\) for \(k \le 0\), and \(G_{k, k-i} = 0\) for \(k-i \le 0\). Then, the recursive relation for \(\left\{ \frac{\partial e_{k}}{\partial \varvec{\alpha }}\right\} _{k=1}^t\) shown in (14) becomes

$$\begin{aligned} E_k = -F_{k-1} - \sum _{i=1}^q G_{k, k-i} E_{k-i}. \end{aligned}$$
(15)

Equivalently, in order to write the equation (15) in the vector form, we can define the vectors and matrices

$$\begin{aligned} \varvec{E}_k&= \begin{bmatrix} E_k\\ E_{k-1}\\ \vdots \\ E_{k-q+1} \end{bmatrix}&\varvec{F}_k&= \begin{bmatrix} F_k\\ 0 \\ \vdots \\ 0 \end{bmatrix} \end{aligned}$$
$$\begin{aligned} \varvec{G}_k = \left[ \begin{array}{@{}c|c@{}} \begin{matrix} G_{k, k-1} \dots G_{k, k-q+1} \end{matrix} &{} G_{k, k-q} \\ -\varvec{I}_{q-1} &{} \varvec{0} \end{array}\right] \end{aligned}$$

so that equation (15) becomes

$$\begin{aligned} \varvec{E}_k = -\varvec{F}_{k-1} - \varvec{G}_{k} \varvec{E}_{k-1}, \end{aligned}$$
(16)

for \(k=1,\dots ,t\). As a result, the closed-form solution to this set of equations can be written in the form

$$\begin{aligned} \varvec{E}_k = -\varvec{F}_{k-1} + \sum _{i=1}^{k-1} (-1)^{i+1} \Big (\prod _{j=0}^{i-1} \varvec{G}_{k-j}\Big ) \varvec{F}_{k-i-1}, \end{aligned}$$
(17)

through which we can sequentially compute the desired values of the derivative terms \(E_k\) from the values of \(F_k\) and \(G_{k, k-i}\).

The partial derivatives \(F_k\) and \(G_{k, k-i}\) can be calculated directly from the soft decision tree equations (7)-(9). \(G_{k, k-i} = \frac{\partial {f(\varvec{z}_{k})}}{\partial e_{k-i}}\) refers to the derivative of the output of the tree with respect to one of its input entries and can be calculated as

$$\begin{aligned} \begin{aligned} G_{k, k-i}&= \frac{\partial {f(\varvec{y}_k, \varvec{e}_{k}, \varvec{x}_k)}}{\partial e_{k-i}} \\&= \sum _{\ell \in {\mathcal {L}}} \gamma _\ell \frac{\partial {p^*_\ell }}{\partial e_{k-i}}, \\&= \sum _{\ell \in {\mathcal {L}}} \gamma _\ell \sum _{m \in A(\ell )} \frac{p^*_\ell }{p_m^{v_{m,\ell }} (1 - p_m)^{1-v_{m,\ell }}} \frac{\partial p_m^{v_{m,\ell }} (1 - p_m)^{1-v_{m,\ell }}}{\partial e_{k - i}} \end{aligned} \end{aligned}$$

since the leaf scores \(p^*_\ell = \prod _{m \in \textrm{A}(\ell )} p_m^{v_{m,\ell }} (1 - p_m)^{1-v_{m,\ell }}\) are a product of the node scores for the ascendant nodes. Considering the fact that the derivative of sigmoid function is \(\frac{\partial \sigma (x)}{\partial x} = \sigma (x)(1 - \sigma (x))\), it follows that,

$$\begin{aligned} \begin{aligned} G_{k, k-i}&= \sum _{\ell \in {\mathcal {L}}} \gamma _\ell \sum _{m \in A(\ell )} p^*_\ell ((1-v_{m,\ell }) - p_m) \frac{\partial p_m}{\partial e_{k-i}} \\&= \sum _{\ell \in {\mathcal {L}}} \gamma _\ell p^*_\ell \sum _{m \in A(\ell )} ((1-v_{m,\ell }) - p_m) w_{m_2, i+1}, \end{aligned} \end{aligned}$$
(18)

where, for each node m, \(w_{m_2, i+1}\) is the element of node weight \(\varvec{w}_{m_2}\) that multiplies the specific input entry \(e_{k-i}\).

In the case of \(F_k\), the partial derivative with respect to the leaf scores, \(\gamma _\ell\) is simply

$$\begin{aligned} F_k = \frac{\partial {f(\varvec{z}_{k})}}{\partial \gamma _\ell } = p^*_\ell . \end{aligned}$$
(19)

For the internal node weights \(\varvec{w}_m\)

$$\begin{aligned} \begin{aligned} F_k&= \frac{\partial {f(\varvec{z}_{k})}}{\partial \varvec{w}_m } \\&= \sum _{\ell \in {\mathcal {L}} \cap D(m)} \gamma _\ell p^*_\ell ((1-v_{m,\ell }) - p_m) \varvec{z}_{k}, \end{aligned} \end{aligned}$$
(20)

where D(m) denotes the descendant nodes of node m. Lastly, for the biases \(b_m\), the derivatives are calculated in a similar manner,

$$\begin{aligned} \begin{aligned} F_k&= \frac{\partial {f(\varvec{z}_{k})}}{\partial b_m } \\&= \sum _{\ell \in {\mathcal {L}} \cap D(m)} \gamma _\ell p^*_\ell ((1-v_{m,\ell }) - p_m). \end{aligned} \end{aligned}$$
(21)

After \(E_t = \frac{\partial e_{t}}{\partial \varvec{\alpha }}\) is calculated with (17), the gradient can be found using (13). For example, in the case where we use the squared error loss, \(l(y_t, {\hat{y}}_{t}) = (y_k-{\hat{y}}_{t})^2\), the gradient yields

$$\begin{aligned} \frac{\partial }{\partial \varvec{\alpha }} (y_t-{\hat{y}}_{t})^2 = 2(y_t-{\hat{y}}_{t})\frac{\partial e_{t}}{\partial \varvec{\alpha }}. \end{aligned}$$
(22)

Remark 3

Note that the algorithm we introduced can be easily modified to accommodate batch gradient descent or mini-batch gradient descent instead of a single online update by adjusting the proper terms. Namely, although we present the parameter update only for the most recent data instance in (13), the same structure can be used to calculate the gradients for every data instance. Moreover, the calculation of the necessary terms \(\{E_k\}_{k=1}^t\) is already performed during the recursive calculations described in (15)-(17).

Remark 4

We also note that our calculations can easily be adapted to replace the soft decision model with any other differentiable model as the predictor function. We only need to update the model-specific derivative calculations in (18)–(21) to employ another preferred differentiable predictor depending on the application. As an example, we present the necessary modifications to implement the algorithm on ANNs in Sect. 4.2.

4.2 Application of the algorithm on ANNs

As noted earlier, our algorithm is highly versatile in that it can be easily adjusted to work with any nonlinear differentiable function of choice. To this end, we provide the necessary equations for applying the algorithm on ANNs. This only requires us to substitute the gradient equations given in (18)-(21) with their counterparts for the ANN model. The exact equations can be derived from the original ANN equation (5) with sigmoid activation.

First, for notational clarity, let us write the ANN’s connection weights between the input and hidden layers, \(\varvec{W}_1\), as \(\varvec{W}_1 = [\varvec{W}_{11} \varvec{W}_{12} \varvec{W}_{13}]\) so that, for the combined input \(\varvec{z}_{t}\), the hidden layer activations can be written as \(\sigma (\varvec{W}_{11} \varvec{y}_t + \varvec{W}_{12} \varvec{e}_t + \varvec{W}_{13} \varvec{x}_t + b_1)\).

Based on (5), the partial derivative \(G_{k, k-i}\) can be found as

$$\begin{aligned} G_{k, k-i}&= \frac{\partial {f(\varvec{y}_k, \varvec{e}_{k}, \varvec{x}_k)}}{\partial e_{k-i}} \nonumber \\&= (\varvec{W}_2^T \odot (\sigma (\varvec{W_1} \varvec{z}_k + \varvec{b}_1) \odot (\varvec{1} - \sigma (\varvec{W_1} \varvec{z}_k + \varvec{b}_1))))^T \varvec{w}_{12, i+1}, \end{aligned}$$
(23)

where \(\varvec{w}_{12, i+1} \in {\mathbb {R}}^{n \times 1}\) is the column of \(\varvec{W}_{12}\) that multiplies \(e_{k-i}\), and \(\odot\) denotes the element-wise product operator. In the case of \(F_k\), the partial derivative with respect to the connection weights, \(\varvec{W}_1\) and \(\varvec{W}_2\), and biases, \(\varvec{b}_1\) and \(b_2\) can be calculated as follows:

$$\begin{aligned}{} & {} \frac{\partial {f(\varvec{z}_{k})}}{\partial \varvec{W}_2} = \sigma (\varvec{W_1} \varvec{z}_k + \varvec{b}_1)^T, \end{aligned}$$
(24)
$$\begin{aligned}{} & {} \frac{\partial {f(\varvec{z}_{k})}}{\partial b_2} = 1, \end{aligned}$$
(25)
$$\begin{aligned}{} & {} \frac{\partial {f(\varvec{z}_{k})}}{\partial \varvec{W}_1} = (\varvec{W}_2^T \odot (\sigma (\varvec{W_1} \varvec{z}_k + \varvec{b}_1) \odot (\varvec{1} - \sigma (\varvec{W_1} \varvec{z}_k + \varvec{b}_1)))) \varvec{z}_k^T, \end{aligned}$$
(26)
$$\begin{aligned}{} & {} \frac{\partial {f(\varvec{z}_{k})}}{\partial b_1} = \varvec{W}_2^T \odot (\sigma (\varvec{W_1} \varvec{z}_k + \varvec{b}_1) \odot (\varvec{1} - \sigma (\varvec{W_1} \varvec{z}_k + \varvec{b}_1))). \end{aligned}$$
(27)

5 Experiments

In this section, we analyze the performance of our algorithm under different scenarios. Experiments consist of three parts. In the first part, we demonstrate the learning characteristics of the introduced soft decision tree model with the ARMA features (SDT ARMA) on the synthetic data. We compare its performance against a variety of models: soft decision trees with AR features (SDT AR), hard decision trees with AR features (HDT AR), conventional ARMA models (Hyndman & Athanasopoulos, 2021), contemporary LSTM (Long Short-Term Memory) networks and state-of-the-art models such as LightGBM, XGBoost, and CatBoost. We also includes the artificial neural network with AR (ANN AR) and ARMA (ANN ARMA) to show the effect of our algorithm on another differentiable nonlinear model. Next, we present our experiments over the well-known competition datasets such as hourly, daily, and weekly data of the M4 Competition Dataset (Makridakis et al., 2020). This stage of the experiment is crucial for understanding how our model performs under different sequential prediction scenarios. Here, we conduct a meticulous comparison of our algorithm with state-of-the-art models as well as a diverse set of additional algorithms. Furthermore, we extend our evaluation to real-world datasets, which include Delhi daily climate dataset, residential natural gas consumption in Turkey and the Hong Kong exchange rate against USD (Frees, 2009). This enable us to assess the practical applicability of our model in diverse and dynamic environments.

For all datasets, we divide the whole dataset into three parts, namely the training data, validation data and test data. There is no shuffling operation while dividing the dataset i.e, they are separated in order. Firstly, we train the models on the training data, and then we fine-tune the hyper-parameters by using the validation data as an independent set for evaluation during training. We reserved 30% of the training data as validation data for both synthetic datasets and real-life datasets. The size of the test data is determined to be the same as the validation data. In the M4 competition datasets, on the other hand, the test data was prepared separately by Makridakis et al. (2020), and their sizes vary depending on the type of M4 competition datasets. We always set the test data and validation data to be equal in size for the different type of the M4 competition datasets. In all experiments, the only pre-processing step is standardization. The mean and standard deviation of the features are calculated over the training data, and using these statistics, both training and test data are scaled. During training, we use stochastic gradient descent algorithm while optimizing the soft decision trees and artificial neural network. As illustrated in Fig. 3, after we obtain a prediction in each iteration, we sequentially extract the moving-average (MA) features by calculating error between the predicted value and target value. We update the feature set by adding these error term as MA features, and we can also add MA terms obtained from previous iterations to the feature set for a certain window length. Then, the updated feature vector is prepared for the next iteration. This is crucial both during training and testing phases of the implementation.

5.1 Synthetic data

This section demonstrates the advantage of the soft decision trees with the ARMA features over different models. We compare the soft decision trees with the ARMA features against the soft decision trees with the AR features, the hard decision trees with the AR features, the artificial neural network with the ARMA and AR features, the ARMA models and LSTM network. Note that our aim in this section is to demonstrate the individual effects of each component, e.g., MA features, the soft decision trees, on the performance. Moreover, we critically assess our algorithm by comparing it against leading gradient boosting algorithms such as LightGBM, XGBoost, and CatBoost. This comparison is essential to understand how our model stands in relation to current state-of-the-art methodologies.

For the controlled experiments, we generate the artificial data as follows. Firstly, we divide the feature space into four regions using a random vector, say \(\varvec{a}\) and another vector that is orthogonal to this random vector, say \(\varvec{b}\). Then, we use the ARMA equation with different weights for each region as

$$\begin{aligned} \begin{aligned} \text {if }&\left( \varvec{x}^T \varvec{a} \ge 0 \text { and } \varvec{x}^T \varvec{b} \ge 0\right) , \\&y_t = [0.20, -0.30, 0.10, 0.12, 0.80, -0.60, 0.50, -0.40]\varvec{x} + e_t, \\ \text {if }&\left( \varvec{x}^T \varvec{a} \ge 0 \text { and } \varvec{x}^T \varvec{b}< 0\right) , \\&y_t = [0.15, -0.35, 0.05, 0.10, 0.75, -0.80, 0.50, -0.40]\varvec{x} + e_t, \\ \text {if }&\left( \varvec{x}^T \varvec{a}< 0 \text { and } \varvec{x}^T \varvec{b} \ge 0\right) , \\&y_t = [-0.40, 0.25, -0.20, -0.22, 0.65, 0.70, 0.50, -0.40]\varvec{x} + e_t, \\ \text {if }&\left( \varvec{x}^T \varvec{a}< 0 \text { and } \varvec{x}^T \varvec{b} < 0\right) , \\&y_t = [0.28, -0.38, 0.18, 0.15, -0.70, -0.85, 0.60, 0.50] \varvec{x} + e_t, \end{aligned} \end{aligned}$$
(28)

where \(\varvec{a} = [a_1, a_2, a_3, a_4, a_5, a_6, a_7, a_8]^T\) and \(\varvec{b} = [b_1, b_2, b_3, b_4, b_5, b_6, b_7, b_8]^T\) are orthogonal random vectors, \(\varvec{x} = [y_{t-1}, y_{t-2}, y_{t-3}, y_{t-4}, e_{t-1}, e_{t-2}, e_{t-3}, e_{t-4}]^T\) is the vector of the ARMA inputs, and \(e_t\) is white noise which are independently and identically distributed (i.i.d.) normal random variables with mean \(\mu = 0\) and standard deviation \(\sigma = 0.1\). Starting from \(t = 0\), we generate the target \(y_t\) iteratively by using 28. We assign greater weight to moving average (MA) terms to emphasize the significance of these terms while generating the synthetic data. Note that the target and error values for t less than 0 are set to 0. We generate 1500 synthetic samples and partition the data into training (the first 900 instances), validation (the next 300 instances) and test sets (the last 300 instances). Moreover, we generate another set of synthetic data by using \(y_{t-1},y_{t-2},e_{t-1},e_{t-2}\) as the features to conduct experiments under a different size of feature set with different target-feature relationships.

5.1.1 Ablation study

We conduct an ablation study to demonstrate the effect of each component in the soft decision tree with the ARMA features on the synthetic data. We consider the first four lags of both \(y_t\) and \(e_t\), which correspond to \(y_{t-1}\), \(y_{t-2},\) \(y_{t-3}\), \(y_{t-4}\), and \(e_{t-1}\), \(e_{t-2}\), \(e_{t-3}\), \(e_{t-4}\) as the features. Lags of \(y_t\) are used as the AR features, and our algorithm extracts the MA features to predict \(y_t\). The first comparison is between the soft decision trees with the ARMA features and the AR features to show the contributions of the MA features. Subsequently, we compare the results with the AR features for the soft and hard decision trees. The next comparison is between the ARMA models and the soft decision trees with the ARMA features to highlight the importance of the soft decision trees because they both use auto-regressive terms and past error values. In addition, we conduct a comprehensive comparison between our algorithm with state-of-the-art gradient boosting machine models exemplified by LightGBM, XGBoost and CatBoost. Finally, we apply our algorithm that exploits the MA terms to the artificial neural network to compare it with the artificial neural network with AR features and to show the general applicability and the merit of our algorithm. Note that all the other hyper-parameters are optimized using the validation set. We implement the soft decision trees with AR and ARMA models using the same set of hyper-parameters for a fair comparison.

To evaluate the performance of our models, we generate 200 synthetic time series data and apply various models on each of these data. Loss sequences are obtained by computing the difference between the predicted values and the target values for each data. In order to remove any bias from individual sequences, we calculate the average of these loss sequences over 200 hundred such randomly generated sequences and average their results as \(l_t^{ave} = \frac{1}{200}\sum _{i}^{200}(y_t^{(i)}-{\hat{y}}_t^{(i)})^2\), where \(y_t^{(i)}\) and \({\hat{y}}_t^{(i)}\) are the target and forecasting values of time t, respectively, for the i’th sequence. Then, we calculate the time average accumulated error to further smooth the average error curves \(l_t^{acc} = \frac{1}{t}\sum _{j}^{t} l_j^{ave}\). Figure 4 shows the time averaged accumulated error of the models on the synthetic data generated by using 8 features. All models have a similar error pattern, but soft decision tree and artificial neural network with ARMA features achieve the least time accumulated error across all test data. While the patterns of soft decision tree with the ARMA and ANN ARMA are very close to each other, they have a significant error margin compared to the other models. Note that we don’t show time averaged accumulated error because it has a huge error values compared to other models, and it distorts the plot Both the soft decision trees and hard decision trees are appropriate for the non-linear synthetic data. The soft decision trees outperform the hard decision trees in fitting the data and reducing the error because of their ability to learn region boundaries in addition to partitioning. Also, the hidden layer and the nonlinear activation functions in the artificial neural network increases the representation capability for the intricate nonlinear patterns within the data. Furthermore, the utilization of MA features in the models leads to a significant reduction of the time accumulated error. The ARMA model, on the other hand, provides worse performance than the soft trees with the ARMA features since the generated data has nonlinear characteristics and the linear ARMA model fails to fit well to this type of data. Conversely, the soft decision tree with the ARMA features can capture non-linear relationships in the data. The state-of-the-art gradient boosting algorithms and LSTM perform similarly i.e, their error patterns are almost same in the the time averaged accumulated error plot. Extracting MA features in a sequential manner increases the performance of ANN and the soft decision trees in a way that surpasses the state-of-the-art results since our algorithm enables important state information to be obtained and included in the gradient flow while optimizing the model.

We also use the root mean square error (RMSE) to demonstrate prediction errors of all models, given as \(RMSE = \sqrt{\frac{1}{n} \sum _{i=1}^{n} (y_i - {\hat{y}}_i)^2}\), where \(y_i\) and \(\hat{y_i}\) are the target and forecasting values, respectively, and n is the total number of instances in the test data. We conduct our experiment on 2 different synthetic data configurations. While we generate the first synthetic data by using 2 AR and 2 MA features, the second synthetic data is generated by 4 AR and 4 MA features. Then, we generate 200 sequences of data for each of these 2 different synthetic data configurations. We also average over 200 sequences, and the RMSE of the test predictions are shown in Table 1. Base on the results of these experiments, the soft decision tree with the ARMA features has the best performance. For soft decision trees, using ARMA features produces better results than using AR features, proving the effect of the MA features on the model performance. Moreover, the superior performance of the ANN with the ARMA features over the ANN with only the AR features further reinforces our initial observation. In the most important comparison where we compare the overall performance of our model with state-of-the-art gradient boosting algorithms and LSTM, we significantly surpass the state-of-the-art’s performance.

We also performed additional experiments on synthetic data to show that employing a set of static errors that are fixed after a certain stage of the training do not have any beneficial contribution. As shown in Table 2, if we use the static error terms derived from a certain training step as additional features, it affects the learning process adversely and increases the error on test data since the static errors has become obsolete in relation to the final version of the model parameters, ultimately misleading the model. On the other hand, our method extracts the residuals and updates the feature set for each step. In this way, the model and the error terms are always linked to each other, allowing the error terms to provide beneficial information to the model.

Table 1 RMSE performance of all models on synthetic data
Fig. 4
figure 4

Comparison of the time averaged accumulated errors for synthetic data generated by 8 features

Table 2 Comparison with static ARMA extraction on synthetic dataset

5.2 M4 competition dataset

The well-known and publicized M4 Competition Dataset (Makridakis et al., 2020) contains 100,000 time series with different lengths and properties. There exist diverse types of sequences in this dataset such as yearly, quarterly, monthly, weekly, daily, and hourly data. Both the training and test sequences are separately provided in this dataset. To conduct our experiments, we select sequences with various temporal resolutions, including weekly, daily, and hourly data. The weekly dataset consists of 359 time series data with varying lengths and scales. The daily dataset involves 4227 time series data. These are long time series data with an average length of 2357.38. We select daily data so that we can evaluate how well our models could catch up with the long-term trends. We also pick the hourly data as the last dataset because the data in this set shows seasonal characteristics. The hourly dataset contains 414 time series data, which is much less than that of the daily dataset.

Regarding the M4 Competition Dataset, we use the mean absolute percentage error (MAPE) as the forecasting error since it is scale independent. Hence, it is suitable to compare varying scales of time series data in the M4 Competition Dataset, given as: \(MAPE = \frac{1}{n}\sum _{i=1}^{n}\left| \frac{y_i - \hat{y_i}}{y_i}\right|\), where \(y_i\) and \(\hat{y_i}\) are the target and forecasting values, respectively, and n is the total number of instances in the test sequence. Also, we provide the mean absolute error (MAE) calculating the average of absolute difference between the target and forecasting values as \(MAE = \frac{1}{n} \sum _{i=1}^{n} |y_i - {\hat{y}}_i|\) where \(y_i\), \(\hat{y_i}\) and n are same as MAPE.

Our experimental design includes 11 models to compare against:

  1. 1.

    The naive model, which predicts the target as the previous instant i.e., \({\hat{y}}_{t} = y_{t-1}\).

  2. 2.

    The hard decision trees with the AR features (HDT AR).

  3. 3.

    Autoregressive moving average model with exogenous regressors (ARMAX).

  4. 4.

    LightGBM as the state-of-the-art

  5. 5.

    CatBoost as the state-of-the-art

  6. 6.

    XGBoost as the state-of-the-art

  7. 7.

    LSTM as recurrent neural network

  8. 8.

    The soft decision trees with the AR features (SDT AR).

  9. 9.

    The soft decision tree with the ARMA (SDT ARMA).

  10. 10.

    The artificial neural network with the AR features (ANN AR).

  11. 11.

    The artificial neural network with the ARMA (ANN ARMA).

For all models, we use the same features including lag values, rolling mean and standard deviation calculated over varying window sizes determined to the specifics of the task and datasets. For instance, in daily M4 competition experiment, we select the window size of 4, 7, 28 days whereas in the hourly, the window size is 4, 12, 24 to capture short-term trends and periodicity effectively. After preparing the features of the datasets, we run the models and algorithms on them. Since the hourly dataset has seasonality, we use the seasonal ARMAX (SARMAX) (Korstanje, 2021) model instead of the ARMAX only for the hourly dataset. It is an extension of the traditional ARMA, and considers the exogenous inputs and seasonality to analyze more complex time series data. We tune all the hyper-parameters for all models using the validation set. Validation set has the same size with the size of the test set and it is selected from the end of training set. For example, in the weekly dataset, a time series data is chosen, and then the last 13 data points are reserved for validation. Before testing, we refit the model to all of the training data by using the best parameters. To reduce the computational burden and improve the efficiency of our analysis, we randomly select 200 time series and normalize them for each data type in the M4 Competition Dataset and compute the average MAE and MAPE scores. The results are displayed in Table 3.

Table 3 shows that our algorithm with the soft decision tree performs the best for the all datasets. We obtain more accurate results than the soft decision tree with the AR features. Similarly, the exploiting MA features in the artificial neural network model yields significantly improved outcomes compared to utilizing the AR feature. If data is non-stationary, we observe the remarkable effect of the MA features. Particularly, the hourly data of the M4 Competition Dataset is considerably non-stationary. Hence, the MA features help to reduce prediction error significantly. As a result, the MA features have a significant influence on correcting predictions by using the residual information obtained from prior prediction errors. Particularly, we outperform the sate-of-the-art LightGBM, XGBoost, CatBoost and LSTM as recurrence model. Some of our results surpass the state-of-the-art results, albeit by a small margin. The important point is that while soft AR and ANN AR are not capable to compete with state of the art models, when our algorithm is applied and the feature set is changed by sequentially extracting MA features, these models can surpass the state of the art models. These improvements of both ANN and soft tree demonstrate the potential merit and validity of our algorithm on datasets with various temporal characteristics.

The box plots as shown in Fig. 5 represent the MAPE distribution of model prediction over 200 randomly selected time series for weekly, daily and hourly datasets. The results indicate that the median MAPE score of the SDT ARMA model is lower than that of the SDT AR model. The interquartile range (IQR) of the SDT ARMA is also narrower compared to the SDT AR model, indicating that the middle 50% of our MAPE scores are more tightly clustered. Similarly, the ANN ARMA model reduces median MAPE score and IQR compared to the ANN AR model. Hence, our algorithm enables the models to perform better on 200 randomly selected time series. It decreases the MAPE scores especially for the time series where MAPE is high and makes the models more robust.

Table 3 Performance metrics of various forecasting models on M4 Competition Datasets
Fig. 5
figure 5

Box plots of MAPE scores of M4 competition datasets

5.3 Real life datasets

We next perform experiments on three well-known real-life time series datasets:

  • Delhi daily climate dataset Delhi daily climate (Rao, 2020) is a weather forecasting dataset. It consists of 4 years weather data from 2013 to 2017 and provides a large training dataset. The target is the mean temperature and dataset provides 3 exogenous features, namely humidity, wind speed and mean pressure. Then, we extract the date related features such as month of year, quarter of year and day of year in terms of sin and cos. Also, we extract the AR features \(y_{t-1}\), \(y_{t-2}, y_{t-3}\), and we calculate the additional lags, rolling means, rolling standard deviations with different window length.

  • Hong Kong Exchange Rate against USD Hong Kong exchange rate against USD (Frees, 2009) is collected from 2005 to 2007. It is a daily exchange rate, and the objective is to predict the future rate. The AR features are the past values such as \(y_{t-1}\), \(y_{t-2}, y_{t-3}\), which corresponds to the AR features. Similar to Delhi daily climate dataset, we extract exogenous features in addition to the AR features.

  • The residential natural gas consumption in Turkey The dataset consists of total residential natural gas demand of Turkey for 1000 days, spanning from 2018 to 2020. The data has a nonlinear characteristic, while also displaying a linear pattern. The dataset incorporates exogenous features such as religious holidays, national holidays, weekends, and temperature readings. Also, we extract the additional exogenous feature similar to Delhi daily climate dataset.

To compare the models, we use the same experiment setup in M4 Competition Dataset. Differently, we use the SARMAX, which uses the seasonal terms, instead of the ARMAX for all datasets. For the hyper-parameter tuning, 30% of the data is reserved for testing, with the remaining data being split into a 30% validation set and an 70% training set. The test results is presented in terms of the MAPE score in Table 4.

Table 4 Performance Metrics of Various Forecasting Models on Real-Life Dataset
Fig. 6
figure 6

Comparison of the time averaged accumulated errors for Delhi daily climate dataset

Based on the results presented in Table 4, we observe that our algorithm with the soft decision tree outperforms all other models for all three datasets. For Delhi daily climate dataset and gas demand dataset, soft decision tree with the ARMA features give the best and proximate scores. Our performance exceeds that of leading models such as LightGBM, XGBoost, CatBoost. Although the margins are not huge, our outcome surpasses these current top-tier results. The extraction of MA features improves the performance of both the soft decision tree and the artificial neural network models similarly. The datasets exhibit non-linear properties. Therefore, the soft decision with the ARMA features outperforms the SARMAX model in predicting the given time series data. The soft decision tree with the ARMA features is capable of learning both the parameters and the partitioning orientation, allowing for a more flexible approach to fit the data. In the Fig. 6, the soft decision tree with the ARMA features achieves the least cumulative error across all test data. As the test data progresses, the accumulated errors of the models with the AR features increase significantly after a point. On the other hand, models in which our algorithm is applied are not affected i.e., the error does not change significantly, which shows the importance of using MA features for stability. Note that models such as SARMAX, Deep ANN AR and Deep ANN ARMA are not added since they distort the figure because of their large error jumps. Another important observation is the real-life datasets depends on \(y_{t-1}\) strongly as can be seen from the naive results; therefore, models prefer to give great importance to \(y_{t-1}\) but extracting MA terms changes this conditions and breaks the dominance of \(y_{t-1}\). The huge difference in the errors of HK exchange data set results from this situation.

To understand the learning characteristic of our algorithm on a deeper network, we enhance our approach by incorporating a more complex network structure. We increase the number of layers in the search space and we fine-tune the number of layers with larger numbers. Also, in order to increase the model complexity, we add residual connections to the deeper network. Then, we conduct experiments with and without our algorithm. Except the Gas Demand dataset, huge networks without our algorithm (Deep ANN AR) have diverged for the Delhi Daily Climate and HK Exchange Rate datasets. On the other hand, when we apply our algorithm to the same networks (Deep ANN ARMA) which had diverged due to over-fitting, these networks perform well and the error significantly decreases. Moreover, in the Gas Demand dataset, deeper network does not diverge but the error decreases in terms of both MAE and MAPE scores considerably. As a result, our algorithm can prevent the ANN model from over-fitting and diverging since it creates a dynamic relationship between the model and its predictions by sequentially extracting the error terms.

In all of the experiments, we observe the improvement in the error while we apply our algorithm on soft decision tree and ANN. we also outperform the state-of-the-art models. Although soft decision tree AR and ANN AR models alone do not compete with leading models, their capabilities significantly improve when our algorithm is implemented, and the feature set is dynamically modified by sequentially extracting MA features. Then, they can exceed the performance of current state-of-the-art models. These enhancements also bring about compilation costs. There exists a trade-off between accuracy and time. Higher accuracy, achieved through complex algorithm comes at the cost of increased computational time. In Table 5, we demonstrate the computational cost of our algorithm for both training for 1 epoch and inference for 1 instance during testing. This experiment is conducted on the Delhi Daily Climate dataset, and we also observed very similar results in other datasets. Here, we can observe the main limitation of our method, which is the increased training time due to the optimization process. The joint optimization of model parameters and the features requires a sequential gradient calculation process which introduces additional algorithmic steps. Moreover, since we employ a sequential feature extraction process, we could not use the mini batch optimization. However, although our algorithm increases the training time, the inference times are quite similar for models with AR and ARMA features. It indicates that once models are trained, they operate at nearly the same speed for making predictions. As a result, although it is costly for training, the significant increase in the performance and the lack of additional cost in the inference time make our algorithm reasonable.

Table 5 Time cost for training 1 epoch and 1 inference of soft decision tree and ANN with only AR and ARMA configurations on Delhi daily climate dataset

6 Conclusion

We introduced a highly versatile and effective gradient-based nonlinear sequential prediction algorithm. In this new framework, we not only use the AR terms but also the MA terms, to exploit the algorithm’s performance on the previous samples. We jointly optimize the model parameters and the feature vectors which are extended to include the residuals. To this end, after we include the residual terms, we rederive the gradient equations, while optimizing both the parameters of the nonlinear learner and the feature vectors it is currently using. We provide the full sets of equations for both the soft decision trees and the ANNs. As shown, our method is generic and can be applied to other machine learning models supporting gradient updates. By using the residual information, which represents the state or current performance of the learner, our algorithm provides significant performance improvements. We demonstrate these performance improvements through experiments performed on synthetically generated data and real-life datasets including the well-known competition datasets.