1 Introduction

Global time series forecasting models (GFM, Januschowski et al., 2020) have shown their capability of providing accurate forecasts compared to traditional univariate forecasting models such as Exponential Smoothing (ETS, Hyndman et al., 2008) and Auto-Regressive Integrated Moving Average models (ARIMA, Box et al., 2015), in particular with obtaining superior results in the recently held M4 (Makridakis et al., 2018) and M5 (Makridakis et al., 2022) forecasting competitions. In contrast to traditional univariate forecasting models that build isolated models to forecast each series, GFMs are trained across collections of time series allowing the models to learn cross-series information, which also allows them to generalise better and control model complexity on a global level (Montero-Manso & Hyndman, 2021).

While a lot of research in this space focuses on neural networks and deep-learning models (e.g., Salinas et al., 2020; Hewamalage et al., 2021; Oreshkin et al., 2020), tree-based GFM algorithms have risen greatly in popularity after dominating the winning methods of the M5 forecasting competition (Makridakis et al., 2022). In particular, these approaches use LightGBM (Ke et al., 2017) which is a highly efficient gradient boosted tree-based algorithm, to train global models across time series. Also other tree-based algorithms that can be trained as GFMs such as eXtreme Gradient Boosting (XGBoost, Chen & Guestrin, 2016) and CatBoost (Prokhorenkova et al., 2018) have been used in the forecasting space. The advantages of these methods (especially compared with deep-learning methods) are their speed of training, their ease to include external regressors (numerical and categorical), and their relative ease and robustness of hyperparameter tuning. However, these algorithms are general-purpose algorithms, and none of them is specifically designed for forecasting. When performing regression, they use the average of the training outputs at a leaf node as the prediction for an instance that reaches that node during forecasting. Furthermore, while potentially easier to handle than in other algorithms, their accuracy is still subject to tuning of a number of hyperparameters such as the maximum tree depth, maximum number of leaf nodes, minimum number of instances in a leaf node and learning rate. These findings motivate us to develop a global tree-based algorithm specifically for forecasting, achieving higher forecasting accuracy with minimal hyperparameter tuning.

Piecewise linear models such as Threshold Autoregressive (TAR) models (Tong, 1978; 1993; Terasvirta, 1994) have greatly contributed to the econometric and forecasting fields during the past few decades. These models partition the instance space of a particular single time series to be forecast into several subspaces where each subspace is modeled by a separate Autoregressive (AR) model. There are many variations of TAR models such as Self Exciting Threshold Autoregressive (SETAR, Tong, 1993), Smooth Transition Autoregressive (STAR, Terasvirta, 1994), Exponential STAR (ESTAR), Logistic STAR (LSTAR) and Neuro-Coefficient STAR (Medeiros & Veiga, 2000) models.

Similar to the works of Aznarte and Benítez (2010), Aznarte et al. (2010) and Aznarte et al. (2007) that establish analogies between certain types of TAR models and Fuzzy Rule Based Systems, we exploit the analogy of TAR models and decision trees that fit a linear model per leaf, for global time series forecasting, to introduce a new tree-based algorithm that uses the underlying concept of SETAR models in defining the splits. Thus, we name this model as SETAR-Tree. In particular, our method internally finds the optimal past lag and the threshold that should be used to split the instances of a particular tree node into child nodes. The instances whose corresponding values of the optimal lag are greater than or equal to the selected optimal threshold and less than the selected optimal threshold are grouped separately whereas each instance group is considered as a child node. In contrast to the traditional general purpose tree-based regression algorithms that compute the average of the training outputs at a leaf node as the final prediction, our proposed method is a forecasting-specific tree which builds a Pooled Regression model (PR, Gelman & Hill, 2006; Montero-Manso & Hyndman, 2021) which is a global linear AR model, at each leaf node allowing the model to learn cross-series information. As our model is a forecasting-specific tree with global regression models in the leaf nodes, it can be essentially identified as a global (cross-series) hierarchical SETAR model. Furthermore, the SETAR-Tree uses some time-series-specific stopping procedures. Exploiting the analogy of TAR models and regression trees, our proposed SETAR-Tree internally tunes the maximum tree depth using a statistical linearity test well established in the literature for TAR models (Terasvirta, 1994), as well as measuring the error reduction percentage at each node split. Thus, our proposed model requires minimal external hyperparameter tuning whereas it provides competitive results even with its default hyperparameters. The proposed method is also applicable to time series forecasting problems that use external covariates during modelling.

Proposing a novel tree algorithm for forecasting naturally lends itself to extend the procedure to a forest, which is another main contribution of this study. In line with the traditional forest-based algorithms such as Random Forest (RF, Breiman, 2001), our forest model uses a collection of diverse SETAR-Trees that are trained using a set of randomly selected time series. The forecasts provided by all trees are averaged to obtain the final forecasts. The trees are made diverse in terms of the significance of the statistical linearity test and error reduction percentage that are used to split each node. Thus, the forecasting accuracy of the forest model is even less sensitive to the hyperparameters than the tree.

Our proposed SETAR-Tree and SETAR-Forest algorithms in particular also outperform a set of state-of-the-art tree-based algorithms and forecasting benchmarks with statistical significance across eight experimental datasets on four error metrics. All implementations of this study are publicly available at: https://github.com/rakshitha123/SETAR_Trees.

The remainder of this paper is organized as follows: Sect. 2 reviews the relevant prior work. Section 3 explains the theoretical concepts of SETAR models, and the proposed tree and forest algorithms in detail. Section 4 explains the experimental framework, including the datasets, error metrics, benchmarks and statistical testing. An analysis of the results is presented in Sect. 5. Section 6 concludes the paper and discusses possible future research.

2 Related work

In the following, we discuss the related prior work in the areas of TAR models, GFMs and the state-of-the-art tree-based forecasting models.

2.1 Threshold autoregressive models

TAR models (Tong, 1978; 1993) are piece-wise linear models that model the state space of a given prediction problem using multiple AR models. TAR models are also known as AR regime switching models which use separate (same or different order) AR functions to model different regimes.

TAR models have variants such as SETAR and STAR (Terasvirta, 1994) which are different in terms of the transition function and the method of defining regimes. The simplest version of the TAR model is known as the SETAR model (Tong, 1993) which defines two or more regimes based on a particular lagged value of the time series itself. Thus, SETAR models are required to estimate the optimal lag, and the optimal threshold corresponding with that chosen optimal lag to define regimes. The most common way of defining these optimal lag and threshold values is using a grid search. The optimal thresholds can be also identified by analysing the places of time series showing significant increasing or decreasing rates (Ghosh et al., 2006) and by using memetic algorithms (Bergmeir et al., 2012). The transition function used by the SETAR model is generally a step function. Thus, during prediction, SETAR models use only one AR model. Versions of the SETAR model exist that use statistical tests such as a Lagrange Multiplier (LM) test (Breusch and Pagan, 1980) to determine the linearity captured by the AR models (Terasvirta, 1994). In STAR models (Terasvirta, 1994), the transition can happen either using a past value of a series or an external variable. Unlike the SETAR models, STAR models use a transition function such as exponential or logistic, resulting in ESTAR and LSTAR models, respectively, and thus, the transition happens smoothly. During prediction, STAR models use multiple AR models to determine the final output where the corresponding weight of each contributing AR model is determined based on the transition function.

Aznarte and Benítez (2010) and Aznarte et al. (2010) establish the equivalence between TAR models and certain types of fuzzy rule-based systems. Inspired by this work, we exploit equivalences between TAR models and decision trees, and extend TAR models to global hierarchical models which train across series.

TAR models have been used to address many real-world forecasting problems such as stock market forecasting (Narayan, 2006), exchange rate forecasting (Pippenger & Goering, 1998) and electricity price forecasting (Rambharat et al., 2005). To the best of our knowledge, none of these TAR models have been used as global models across time series yet.

2.2 Global time series forecasting

Global time series forecasting (Januschowski et al., 2020) is a relatively recent trend in forecasting that became popular after contributing to the winning methods of the M4 (Makridakis et al., 2018) and M5 (Makridakis et al., 2022) forecasting competitions. In particular, many top solutions of the M5 competition used global (tree-based) models. GFMs build a single forecasting model across many series sharing the same parameters across all series. Thus, the model has the capability to explore cross-series information with a fewer amount of parameters compared to traditional univariate forecasting models.

Many works use Recurrent Neural Networks (RNN, Hewamalage et al., 2021) as the baseline model when developing global forecasting frameworks. The winning method of the M4 competition (Smyl, 2020) uses an ensembling approach, ensemble of specialists (Smyl, 2020), that combines the predictions of a set of globally trained RNNs. Smyl and Kuber (2016) use globally trained RNNs for forecasting. One of the state-of-the-art probabilistic forecasting algorithms, DeepAR (Salinas et al., 2020), uses a global AR RNN. Godahewa et al. (2022) propose a weekly forecasting framework that uses global RNNs.

In particular, the winning method of the M5 forecasting competition uses a globally trained LightGBM model (Ke et al., 2017). GFMs have also contributed to the winning methods of many other recently held Kaggle competitions (Bojer & Meldgaard, 2020). Nowadays, GFMs are used with many real-world applications such as energy optimisation (Godahewa et al., 2022), sales demand forecasting (Bandara et al., 2019) and emergency medical services demand forecasting (Bandara et al., 2020).

Godahewa et al. (2021) and Bandara et al. (2020) show that a GFM should be trained with a set of related time series. Those authors further claim that if a collection of time series has heterogeneous series, then the series should be first grouped based on their similarity and then, separate GFMs should be trained per each series group. In line with this concept, our proposed SETAR-Tree model trains multiple GFMs, one per leaf node, because the instances belonging to a particular leaf node are similar in terms of their values corresponding with the optimal lag identified using the SETAR methodology. As the baseline GFM, we use a linear AR model trained across a set of time series that is known as a PR model (Gelman & Hill, 2006).

2.3 Tree-based forecasting models

There are many popular tree-based algorithms used for forecasting, including regression trees (Loh, 2011), gradient boosted trees (Friedman, 2001), XGBoost (Chen & Guestrin, 2016), CatBoost (Prokhorenkova et al., 2018), and LightGBM (Ke et al., 2017).

As forecasting is usually a regression task with numerical values, regression trees (Loh, 2011) are used for forecasting, as opposed to decision trees for classification. Spiliotis (2022) provides an overview of how decision trees can be used for time series forecasting. In particular, the author highlights that in the forecasting domain, a regression tree provides a piece-wise approximation to a single continuous function in standard regression. Regression trees use every possible binary split on every input attribute to determine the node splits. Regression trees have been successfully applied to address real-world time series forecasting applications. Torgo and Oliveira (2014) use bagged regression trees to handle the diverse dynamic regimes and non-stationarities observed in real-world time series. Cerqueira et al. (2019) use a regression tree as a base model of a pool of forecasting experts where the forecasts of experts are dynamically combined using arbitrating to obtain the final forecasts. Gradient boosted trees (Friedman, 2001) use a collection of regression trees as a sequential ensemble model. The first regression tree is trained on the actual observations of the training instances where the remaining trees are trained to predict the residuals of the previous tree in the sequence. XGBoost (Chen & Guestrin, 2016) is an accurate, efficient and scalable version of gradient boosted trees. To define the best node splits, XGBoost uses similarity scores in a way that the gain is maximised. Furthermore, to reduce overfitting, XGBoost uses regularisation parameters when calculating similarity scores and tree pruning. The second winning method of the M4 forecasting competition, Feature-based Forecast Model Averaging (FFORMA, Montero-Manso et al., 2020) uses XGBoost as a meta-learner to determine the weights that should be used to combine the predictions of a set of base forecasting models. CatBoost (Prokhorenkova et al., 2018) considers the order of data points during modelling and thus, it is more suitable to address time series forecasting problems. CatBoost more effectively deals with categorical variables compared to other tree-based algorithms.

As stated earlier, tree-based forecasting models recently obtained popularity in the forecasting field after contributing to the winning approach and many other top submissions in the M5 forecasting competition (Makridakis et al., 2022). In particular, many top submissions of the competition incorporate a highly efficient gradient boosted tree-based model, LightGBM (Ke et al., 2017). The LightGBM trees use the leaf-wise tree growth approach instead of the level-wise tree growth used in traditional tree-based algorithms. It identifies the best node splits by using Gradient-based One-Side Sampling (GOSS). LightGBM can handle extremely large datasets as well and nowadays, it is used with many real-world forecasting applications such as cryptocurrency price trend forecasting (Sun et al., 2020), sales demand forecasting (Weng et al., 2020) and wind power forecasting (Ju et al., 2019). Januschowski et al. (2021) highlight that tree-based algorithms are blackbox learners which are highly efficient and more robust than the existing deep learning techniques. In particular, the LightGBM algorithm offers a large number of loss functions and missing value handling methods where the users can choose the corresponding options based on their application.

Even though these tree algorithms are efficient, they are not tailored to the forecasting problem. For example, they consider the average of the training outputs in a particular leaf node as the forecast of the test instances that reach that node during forecasting. In contrast, our proposed SETAR-Tree algorithm trains a global linear AR model in each leaf node, allowing each leaf node to learn the cross-series information.

In the literature, there already exist tree models which train linear models in leaf nodes, known as linear regression model trees (Quinlan, 1992). They analyse every possible split to determine the node splits where the split which maximises the difference between the standard deviation of the actual target values of the parent node and the mean of the standard deviations of the actual target values of child nodes, is considered as the optimal split. The prediction for a given test instance is obtained by its corresponding leaf node where the prediction is then adjusted based on the nodes from the root node to its corresponding leaf node. Cubist (Kuhn & Johnson, 2013) is such a linear model tree that performs rule-based modelling. It trains linear models in all tree nodes where the predictions are obtained from the leaf nodes and are smoothed afterwards. Lefakis et al. (2019) propose a piecewise linear regression tree which trains regularised linear models in leaf nodes. The nodes are split in a way that the least square errors of linear predictors after splitting are minimised. Loh (2002) proposes a regression tree which uses the significance of the chi-square test to identify the variables that should be used during node splitting. Dutang and Guibert (2022) propose a splitting procedure for regression trees where the optimal splits are chosen based on the explicit likelihood. Zeileis et al. (2008) propose a regression model which uses recursive partitioning where the nodes are only further split if there exists an overall parameter instability. Hothorn et al. (2006) propose CTree, a regression model that uses recursive partitioning with conditional inference procedures. There are hardly any works in the literature that propose tree-based models particularly for forecasting that we are aware of, and the few works available address local forecasting, not global forecasting models. In particular, the STR-Tree proposed by da Rosa et al. (2008), though proposed for regression, is deemed by its authors to be trivially extendable to forecasting, and Epprecht and Veiga (2012) propose a tree method to predict stock market returns where in the leaf nodes AR models with extra inputs are trained. These works use the concept of STAR models during node splitting, which is similar to our approach but limits them effectively to smaller sample sizes, as the smooth transitions make their model training computationally considerably more costly. Thus, these methods are inherently more suitable for local forecasting and using them in current typical global forecasting scenarios brings scalability issues.

Our proposed SETAR-Tree algorithm is in particular tailored to the time series forecasting task by using the concept of SETAR models in defining splits where the optimal lag and the threshold that should be used for node splitting is determined using a grid search approach, without considering every possible split as used by the above explained linear regression model trees. Our grid search approach selects the lag and the threshold that is corresponding with the split which minimises the Sum of Squared Errors (SSE) at the child nodes. Furthermore, to determine the validity of a split, we use a statistical linearity test and the error reduction percentage that can be gained from the split, where a node is only further split if making that split is worth enough. In leaf nodes, we train global AR models allowing the models to learn the cross-series information and they do not use any regularisation during model fitting. Also, the predictions for the test instances are obtained only from their corresponding leaf nodes where no further adjustments are applied to the predictions as in linear regression model trees (for details, see Sect. 3.3).

A collection of parallel executed tree models is known as a forest. The RF (Breiman, 2001) algorithm is the most common type of forest algorithms which averages the predictions provided by a set of diverse regression trees to obtain the final predictions. The trees can be diverse in terms of the data and features used during training. This ensembling technique is also known as bagging in the machine learning literature and it is expected to improve the accuracy of the final predictions compared to the predictions of the individual trees by properly addressing data, model and parameter uncertainties (Petropoulos et al., 2018; Bergmeir et al., 2016). Also, Coulombe (2021) argues that the RF does an optimal pruning as the splits that overfit are averaged out through bagging. In the context of linear model trees, the Macroeconomic RF (MRF) proposed by Coulombe (2020) consists of a collection of linear model trees that use the concept of TAR models during node splitting. However, MRF is not designed for global time series forecasting. Unlike our SETAR-Tree, the individual trees in MRF use regularisation during parameter estimation, and use moving average factors calculated using the lagged values of time series as regressors during training. The trees also do not use any stopping criterion based on statistical tests or error reduction percentages. In particular, the individual trees in MRF do not use early stopping where they are allowed to grow into higher depths ensuring the trees are sufficiently diversified. Furthermore, Athey et al. (2019) propose generalized RFs, a method based on RFs that can be used for non-parametric statistical estimation. Consequently, we propose a forest model that uses a collection of SETAR-Trees where the individual trees are diverse in terms of the significance of the linearity test and the error threshold used to make the node splits (for details, see Sects. 3.3.2, 3.3.3 and 3.4).

3 Methodology

This section first gives a brief overview of the theory of SETAR models. Then, we explain the terminology of GFMs. Later, the proposed SETAR-Tree and forest algorithms are explained in detail.

3.1 Self exciting threshold autoregressive models

For our study, we consider a 2-regime SETAR model (Tong, 1993) which is defined in Eq. 1. Here, \(y_t\) is the value of the series at time t, n is the number of past time series lags, T is the threshold value, and \(\beta ^{i}\) and \(\epsilon ^{i}\) are respectively the AR parameters and error terms of the ith regime.

$$\begin{aligned} y_t = {\left\{ \begin{array}{ll} \sum _{l=1}^{n} \beta ^{(1)}_{l}{y_{t-l}} + \epsilon ^{(1)}_t,&{} \text {if } y_{t-l} < T \\ \\ \sum _{l=1}^{n} \beta ^{(2)}_{l}{y_{t-l}} + \epsilon ^{(2)}_t,&{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

As shown in Eq. 1, SETAR models define regimes based on a particular lagged value of the time series itself. One regime is defined for the training instances where the corresponding lagged value is less than the threshold value T, and the other regime is defined for the remaining instances where the corresponding lagged value is greater than or equal to T. Thus, SETAR models have as hyperparameters the optimal lag, l, and the optimal threshold value, T corresponding with l to define the regimes.

3.2 Terminology of global forecasting models

For our study, we consider univariate time series forecasting which predicts the future values of a particular time series using its own past values and possibly, some external covariates. We use the usual out-of-sample evaluation in forecasting that reserves a block of data from the very end of the time series for evaluation (Bergmeir & Benítez, 2012). For that, the time series are first split into training and test parts where the training parts are used for model training and the test parts are the actual values corresponding with the expected forecast horizon.

Let \(TS_{1}\) and \(TS_{2}\) be two time series. To train purely autoregressive GFMs, as shown in Fig. 1, an embedded matrix is constructed using \(TS_{1}\) and \(TS_{2}\) where each row contains a window of a time series which is composed of a set of lagged values and the corresponding next value. The rows of the embedded matrix that are corresponding with the training parts of the series are used to train GFMs. The forecasts corresponding with the expected forecast horizon are obtained for each series using the trained GFMs. During the evaluation, the forecasts are compared with their corresponding actual values that are in the test set and the forecasting errors are measured using the error metrics explained in Sect. 4.2.

Fig. 1
figure 1

Visualisation of an embedded matrix created using two time series: \(TS_{1}\) and \(TS_{2}\). Each row contains a window of a time series which is composed with a set of lagged values (L1 to L5) and the corresponding next value/true output (y). The windows corresponding with the training and test sets are respectively highlighted in yellow and blue

3.3 SETAR-Tree model

Figure 2 shows an overview of an example SETAR-Tree model. Level 0 refers to the root node of the tree. The root node is an embedded matrix constructed using the training parts of \(TS_{1}\) and \(TS_{2}\) as explained in Sect. 3.2. Each row of the embedded matrix is used as a training instance. We use the concept of SETAR models to split the training instances into child nodes. As we only consider 2-regime SETAR models, a given tree node is split into two child nodes. The optimal lag, l and the optimal threshold, T that are used to split each node are determined by using a grid search approach as explained in Sect. 3.3.1. The instances where the corresponding value of the Lag l is less than T, and greater than or equal to T, are separately grouped into child nodes. However, a node is only split if making that split is worth enough. In particular, a node is only split if there exists remaining non-linearity in the training instances of that node (Sect. 3.3.2) and/or it can gain a certain training error reduction by splitting (Sect. 3.3.3). Thus, our model uses the leaf-wise tree growth approach where the branches of the tree may have different lengths. For an example, at Level 1, only one node is further split into child nodes. The other node is not split and it has been treated as a leaf node because further splitting that node does not fulfill the specified criteria.

Fig. 2
figure 2

Overview of an example SETAR-Tree constructed using \(TS_{1}\) and \(TS_{2}\). The matrices containing the past time series lags and their corresponding true outputs are considered as nodes. Each row of a matrix is considered as a training instance. The training instances of each node are optimally split into child nodes based on the concept of SETAR models. Here l and T mentioned with the split nodes refer to the optimal lag and the optimal threshold that are used to split at a particular node. A node is only split if there exists remaining non-linearity in the training instances of that node and/or it can gain a certain training error reduction by splitting. This example tree model has three levels. This model uses a leaf-wise tree growth approach and thus, the branches of the tree may have different lengths

The two methods we use to determine whether a node should be split or not, a statistical linearity test and the error reduction gained by node splitting, are explained in detail in Sects. 3.3.2 and 3.3.3. They control the growth of the tree and thus, the model can itself determine the maximum tree depth that it should allow. In particular, these two methods act as the stopping criteria of the tree growth where constructing the tree is completed when none of the nodes should be further split. The leaf nodes can be there at different tree levels as shown in Fig. 2 where one leaf node is in Level 1 and two leaf nodes are in Level 2. In each leaf node, a global PR model is trained using the instances corresponding with that node.

During testing, the model identifies the leaf node corresponding with a given test instance by following the same optimal lags and thresholds it previously used during node splitting starting from the root node. The prediction of the test instance is then obtained using the trained PR model corresponding with its leaf node.

The SETAR-Tree requires minimal hyperparameter tuning. A few parameters are used for the statistical linearity test and for measuring error reduction in node spitting. However, the model provides competitive results with their default parameters that we provide and thus, setting externally tuned hyperparameters is not mandatory for our model.

3.3.1 Finding optimal lags and thresholds

To find the lag and threshold that will be used to split a given node, we utilise a standard grid-search approach. Given a node, we consider each lag and a set (grid) of thresholds with which to split the node into two children. For each lag and threshold pair in the grid, separate linear models are fitted to the two subsets of data formed by partitioning the chosen lag at the specified threshold. The lag and the threshold that are results in the split with the minimum SSE at the child nodes is selected as the “optimal” lag and threshold for splitting.

To increase the efficiency of the grid-search, we exploit the properties of the least-squares estimates during fitting. Recall that the leaf models are linear AR models of the form

$$\begin{aligned} y = X \beta + \varepsilon \end{aligned}$$

where X is the lag-matrix, y are the outputs, \(\beta\) the AR coefficients and \(\varepsilon\) a vector of errors. The least-squares estimates of \(\beta\) are well known to be

$$\begin{aligned} \hat{\beta } = (X^{T} X)^{-1} X^{T} y \end{aligned}$$
(2)

where T and -1 respectively refer to the transpose and inverse of the matrix. The SSE of the least-squares fit is given by

$$\begin{aligned} \textrm{SSE}(\hat{\beta }) = y^T y - \hat{\beta }^T X^T X \hat{\beta }. \end{aligned}$$
(3)

An important property of the least-squares estimates are that they depend on X and y only through their inner products. We can exploit this to fit the models for all potential thresholds in \(O(n \log n)\) time. Let \(X_l\) be the lag variable we will be splitting on, and let \(s = (s_1,\ldots ,s_q)\) be the q candidate thresholds for splitting. Define the sets

$$\begin{aligned} S_i = \left\{ j: x_{j,l} \le s_i \right\} , (i = 1,\ldots ,q) \; \textrm{and} \; S_{q+1} = \left\{ j: x_{j,l} > s_q \right\} . \end{aligned}$$

These sets indicate which rows fall into each of the \(q+1\) different partitions of the data defined by the set of thresholds. Let \(\bar{x}_i\) denote the i-th row of X; we can then define the quantities

$$\begin{aligned} B_{(k)} = \sum _{i=1}^k \sum _{j \in S_i} \bar{x}_{j} \bar{x}_{j}^{T}, \; \; c_{(k)} = \sum _{i=1}^k \sum _{j \in S_i} \bar{x}_{j} y_j \; \; \textrm{and}\; \; d_{(k)} = \sum _{i=1}^k \sum _{j \in S_i} y_j^2. \end{aligned}$$
(4)

These are the relevant inner products for the “left” child node model formed by splitting the data at each of the thresholds \(s_k\). From Eq. 2, if we choose to split at threshold \(s_k\), the least squares estimates for the left and right child node models can be written as

$$\begin{aligned} \hat{\beta }{(L_k)} = B_{(k)}^{-1} c_{(k)} \; \; \textrm{and} \; \; \hat{\beta }{(R_k)} = \left( X^{T} X - B_{(k)} \right) ^{-1} \left( X^{T} y - c_{(k)} \right) \end{aligned}$$
(5)

respectively. Using Eq. 3, the SSE in the left and right child node models can be then written as

$$\begin{aligned} \textrm{SSE}{(L_k)} = d_{(k)} - \left( \hat{\beta }{(L_k)} \right) ^{T} B_{(k)} \left( \hat{\beta }{(L_k)} \right) \end{aligned}$$
(6)

and

$$\begin{aligned} \textrm{SSE}{(R_k)} = \left( y^{T} y - d_{(k)} \right) - \left( \hat{\beta }{(L_k)} \right) ^{T} \left( X^{T} X - B_{(k)} \right) \left( \hat{\beta }{(L_k)} \right) \end{aligned}$$
(7)

respectively. When considering a lag \(x_l\) to split on, we can first sort the rows of X by \(x_l\). In practice, these sorted values can then be used to find a set of suitable thresholds; typically we choose fifteen equispaced quantiles of the vector \(x_l\). We can then use Eq. 4 to compute the relevant inner products for the first threshold, i.e., \(B_{(1)}\), \(c_{(1)}\) and \(d_{(1)}\). We can then iteratively try each potential threshold \(s_1,\ldots ,s_q\) in turn using Eqs. 5, 6 and 7, exploiting the fact that we can update the inner products using

$$\begin{aligned} B_{(k+1)} = B_{(k)} + \sum _{j \in S_k} \bar{x}_j \bar{x}_j^T, \end{aligned}$$

as we move from threshold \(s_k\) to \(s_{k+1}\), with similar expressions for \(c_{(k)}\) and \(d_{(k)}\) to efficiently compute the running totals of these quantities.

3.3.2 Statistical linearity test

The usage of linearity tests in determining the amount of regimes and the goodness of TAR models is well-established in the econometrics and statistics literature (Terasvirta, 1994). However, to the best of our knowledge, the linearity tests have not been used to determine the goodness of a SETAR model trained across series, in particular to determine the depth of a hierarchical SETAR model as we propose.

We use the general linear F-test (Box, 1953) to determine whether there exists a remaining non-linearity of a set of training instances in a particular tree node. The node is only further split if there exists a significant remaining non-linearity.

The null hypothesis of the linear F-test, \(H_0\) states that there exists no significant remaining non-linearity in the training instances and thus, the parent node is enough to model the instances. The alternative hypothesis, \(H_1\) states that there exists a significant remaining non-linearity in the training instances and therefore, a node should be further split into child nodes.

In particular, the linear F-test calculates a statistic, namely the F-statistic, which compares a complex model and a reduced model, and determines whether the reduced model should be rejected in favour of the complex model. In our node splitting scenario, the reduced and complex models respectively refer to the global PR models that are trained on parent and child nodes corresponding with a particular split. The suitability of the split is determined based on the value of the F-statistic calculated using the SSE corresponding with PR models built on parent and child nodes.

The SSE of a parent node and the corresponding child nodes are respectively defined in Eqs. 8 and 9. Here, N is the number of training instances, \(y_k\) are the observed values, \(yp_k\) are the fitted values of the parent node model and \(yc_k\) are the fitted values of the child node models.

$$\begin{aligned} SSE(P)= & {} \sum _{k=1}^{N} ({y_{k} - yp_{k}})^2 \end{aligned}$$
(8)
$$\begin{aligned} SSE(C)= & {} \sum _{k=1}^{N} ({y_{k} - yc_{k}})^2 \end{aligned}$$
(9)

The F-statistic corresponding with a node split is then calculated using Eq. 10 where \(df_P\) and \(df_C\) are the degrees of freedom of parent and child node models, respectively.

$$\begin{aligned} F^* = \left( \frac{SSE(P) - SSE(C)}{df_P - df_C}\right) {\left( \frac{SSE(C)}{df_C}\right) }^{-1} \end{aligned}$$
(10)

For simplicity, we consider the same number of past lag values when training the PR models in all nodes. Hence, \(df_P\) and \(df_C\) are defined as in Eqs. 11 and 12 where L is the number of past lags used for modelling.

$$\begin{aligned} df_P= & {} N - L - 1 \end{aligned}$$
(11)
$$\begin{aligned} df_C= & {} N - 2L - 2 \end{aligned}$$
(12)

Substituting \(df_P\) and \(df_C\) reduces Eq. 10 as follows.

$$\begin{aligned} F^* = \left( \frac{SSE(P) - SSE(C)}{L + 1}\right) {\left( \frac{SSE(C)}{N - 2L - 2}\right) }^{-1} \end{aligned}$$
(13)

The p-value of the F-test is determined by comparing the value of \(F^*\) to the F distribution. If the p-value is less than the considered significance level, \(\alpha\), then \(H_1\) is accepted and the node is further split into child nodes. Otherwise, the node is kept as a leaf node of the tree.

As a form of multiple testing correction (Noble, 2009), we gradually decrease the value of \(\alpha\) when the tree grows level by level as defined in Eq. 14, where \(\alpha _{d}\) and \(\alpha _{d+1}\) are the significance levels corresponding with the tree levels d and \(d+1\), respectively.

$$\begin{aligned} \alpha _{d+1} = \frac{\alpha _{d}}{ significance\_divider } \end{aligned}$$
(14)

This reduces unnecessary tree growth and acts as a way of preventing model overfitting. The values of \(\alpha _0\) and \(significance\_divider\) are user defined, with default values of 0.05 and 2, repectively. As our trees perform binary splits, using a \(significance\_divider\) of 2 corresponds to a Bonferroni correction, where the family-wise error is kept constant across the levels of the tree. Larger values lead accordingly to more conservative significance levels.

3.3.3 Error reduction in node splitting

We also consider the error reduction percentage to determine whether a node should be further split or not. The child nodes models together use more parameters during model fitting and thus, they always give lower training error compared to the parent node model. Here, the purpose is to check whether the error reduction percentage that can be gained by splitting a node is considerably high or not. A node is only split if the error reduction percentage is greater than or equal to a particular error threshold (\(e_t\)) and otherwise, the node is considered as a leaf node of the tree. The value of \(e_t\) is user defined and based on our preliminary experiments, it is set to 3%, by default.

Algorithm 1 shows the training and testing phases of the SETAR-Tree model. The algorithm takes the following parameters as inputs.

figure a
  • training_set contains the training parts of each series of the dataset

  • forecast_horizon the length of the prediction period

  • lag the number of past lags used to forecast the next series value

  • stopping_criteria the criteria that are used to determine whether a node should be split

  • max_depth maximum possible tree depth (optional)

  • alpha0 the initial significance level to be used by the F-test (optional)

  • significance_divider the significance level used by the F-test is sequentially divided at each tree level by this value (optional)

  • error_threshold minimum error reduction that should be gained to split a node (optional)

The algorithm first creates an embedded matrix using the training parts of each series which is considered as the root node of the tree. The function create_input_matrix creates this embedded matrix taking training_set and lag as inputs (line 6). It then finds the optimal lag and the threshold that should be used to split the node into child nodes using the function get_opt_params (line 11). This function internally conducts a grid search over the lags of the node and selects the lag and the threshold that are corresponding with the split which provides the minimum SSE at the child nodes. Then, the function split_node (line 14) splits the parent node into child nodes using the optimal lag (opt_lag) and the threshold (opt_threshold) returned by get_opt_params. Based on the stopping_criteria, the split is further assessed to determine whether the child nodes should be added to the tree or not. The functions check_linearity (line 16) and check_error_reduction (line 18) respectively check whether there exists remaining non-linearity in the training instances of the parent node and whether the training error reduction gain by node splitting is greater than the error_threshold. If the split is worth to make, then the child nodes are added to the tree (line 24). The tree building procedure is stopped if none of the nodes at the current tree level is split (line 34). If at least one node at the current tree level is split, then the same node splitting procedure is conducted for the next tree level. At each tree level, the significance level used by the F-test, alpha, is decreased by dividing it by the significance_divider (line 32). After completing the tree construction process, our method trains a global PR model at each leaf node using the function train_pr_model (line 40). During the testing phase, it first creates a set of test instances, test_set, using the function create_test_set which returns the last set of lags in each training series (line 44). The forecasts for the required forecast horizon are then obtained iteratively. In each iteration, the function find_leaf_node_index (line 47) identifies the corresponding leaf node for each test instance by following the same optimal lags (opt_lags) and thresholds (opt_thresholds) it previously used during node splitting. The predictions for the test instances are then obtained using the corresponding trained models at their leaf nodes (line 48). In each iteration, the test_set is also updated using the function update_test_set, by appending the forecasts of the previous iteration (line 50).

3.3.4 Training with covariates

The proposed SETAR-Tree model can also be trained with external numerical and categorical covariates in addition to the past lags of time series. The numerical covariates are treated in the same way as the time series lagged values. The categorical covariates are converted into numerical format beforehand by applying one-hot encoding which then allows to treat them in the same way as numerical attributes. When determining the optimal attribute and the threshold to split a node, the past lags of series, numerical covariates and categorical covariates are all considered together.

3.4 SETAR-Forest model

A forest is an ensemble model that aggregates the predictions provided by a set of diverse tree models to obtain a prediction, often through bootstrap aggregation (“bagging”) (Breiman, 2001).

We introduce a new forest algorithm, SETAR-Forest, which consists of a collection of diverse SETAR-Trees. The trees are made diverse by varying the initial significance level (\(\alpha _0\)), the significance divider used to calculate the sequence significance levels and the error reduction percentage threshold (\(e_t\)) used during node splitting. In a SETAR-Tree, these three parameters are set by default based on our preliminary experiments and in contrast to that, in the forest algorithm, each tree selects the values of these parameters randomly. For our experiments, we consider ten SETAR-Trees for a SETAR-Forest. An individual SETAR-Tree does not use bagging whereas the SETAR-Forest does. A set of randomly chosen 80% of instances and all attributes of them are used to train each SETAR-Tree in a SETAR-Forest. Thus, the values of bagging frequency, bagging fraction and feature fraction of our SETAR-Forest are 10, 0.8, and 1, respectively, by default. However, our implementation takes these values as external parameters and thus, users can modify them based on the application.

During testing, the forecasts of the test instances are obtained by averaging the corresponding forecasts provided by all SETAR-Trees in the forest.

4 Experimental framework

In this section, we discuss the experimental datasets, error metrics, and benchmarks used in our experiments.

4.1 Datasets

Table 1 Datasets information

We use eight experimental datasets,Footnote 1 six publicly available datasets and two simulated datasets, to evaluate the performance of our proposed SETAR-Tree and forest algorithms. The datasets are briefly explained in the following and Table 1 provides their summary statistics.

  • Rossmann Sales Dataset (Kaggle, 2015) The dataset from the Rossmann Sales forecasting competition that shows the daily sales of a set of Rossmann stores. The missing observations of this dataset are replaced by carrying forward the corresponding last observations (LOCF method). Several covariates are also available with this dataset such as the daily number of customers to each store, status of the stores (open or closed), promotion details and state/school holidays information.

  • Kaggle Wikipedia Web Traffic Dataset (Google, 2017) The first 1000 time series from the Kaggle Wikipedia Web Traffic forecasting competition that shows the number of daily hits for a given set of Wikipedia web pages. Following the procedure suggested by Hewamalage et al. (2021), the missing observations of this dataset are replaced by zeros. Covariates are also available with this dataset such as day of the week. The dataset was extracted from Godahewa et al. (2021).

  • Favorita Dataset (Kaggle, 2018) The first 1000 time series from the Corporación Favorita Grocery Sales forecasting competition that shows daily unit sales of a set of items sold at different Favorita stores. The missing observations of this dataset are replaced by zeros. Covariates are also available with this dataset such as day of the week.

  • M5 Dataset (Makridakis et al., 2022) A subset of time series from the M5 forecasting competition that shows daily unit sales of a set of items sold at different Walmart stores. To make the amount of time series comparable with the other datasets, we select the items that belong to the \(HOBBIES\_2\) department which is one of the ten departments used in the competition.

  • Tourism Monthly Dataset (Athanasopoulos et al., 2011) Monthly dataset from the Tourism forecasting competition. The dataset was extracted from Godahewa et al. (2021).

  • Tourism Quarterly Dataset (Athanasopoulos et al., 2011) Quarterly dataset from the Tourism forecasting competition. The dataset was extracted from Godahewa et al. (2021).

  • Chaotic Logistic Dataset A simulated dataset used in Hewamalage et al. (2021a) constructed by using the Chaotic Logistic Map (May, 1976) Data Generation Process (DGP)

  • Mackey-Glass Dataset A simulated dataset used in Hewamalage et al. (2021a) constructed by using the Mackey-Glass Equation (Mackey and Glass, 1977) DGP.

For our study, we have mostly used datasets where globally trained tree models are known to provide good forecasts. Note that four datasets: Rossmann, Kaggle Web Traffic, Favorita and M5 are from Kaggle competitions where tree-based models have won the first or second positions (Bojer & Meldgaard, 2020). The two simulated datasets also contain non-linear time series where tree-based models have provided accurate forecasts compared to PR and neural network models based on the results published by Hewamalage et al. (2021a). To add some diversity into the pool of datasets, we consider the Tourism Monthly and Quarterly datasets from the Tourism forecasting competition where the traditional univariate forecasting models have provided better forecasts. There are other popular benchmarking datasets in the global forecasting field, most notably the M3 (Makridakis & Hibon, 2000) and M4 (Makridakis et al., 2018) datasets. However, these datasets are very heterogeneous as they contain time series from many different use cases and require special preprocessing and/or ensembling techniques such as clustering and ensemble of specialists (Godahewa et al., 2021; Bandara et al., 2020; Smyl, 2020) to make global models competitive. As we deem such additional techniques to be out of the scope of our current paper, we do not consider these datasets for this study.

4.2 Error metrics

We measure the performance of our models using the modified version of symmetric Mean Absolute Percentage Error (msMAPE, Suilin, 2017) and Mean Absolute Scaled Error (MASE, Hyndman & Koehler, 2006) which are commonly used error metrics in the time series forecasting field. Equations 15 and 16 respectively define the msMAPE and MASE error metrics. Here, \(F_k\) are the forecasts, \(Y_k\) are the actual values for the required forecast horizon, M is the number of instances in the training set, N is the number of data points in the test set and S is the length of the seasonal cycle. In Eq. 15, the \(\epsilon\) is set to its proposed default value of 0.1.

$$\begin{aligned} msMAPE= & {} \frac{100\%}{N}\sum _{k=1}^{N} \frac{|F_{k} - Y_{k}|}{(max(|Y_{k}|+ |F_{k}|+ \epsilon , 0.5 + \epsilon ))/2} \end{aligned}$$
(15)
$$\begin{aligned} MASE= & {} \frac{\sum _{k=1}^{N} |F_{k} - Y_{k}|}{\frac{N}{M - S}\sum _{k=S+1}^{M} |Y_{k} - Y_{k - S} |} \end{aligned}$$
(16)

For the two simulated datasets, the length of the seasonal cycle is considered as one. Since all these error measures are defined for each time series, we calculate the mean and median values of them across a dataset to measure the model performance. Therefore, four error metrics are used to evaluate each model: mean msMAPE, median msMAPE, mean MASE and median MASE.

4.3 Benchmarks and variants

We use seven global models: a PR model (Gelman & Hill, 2006), Feed-Forward Neural Network (FFNN, Goodfellow et al., 2016) and five tree-based models: a regression tree (Loh, 2011), Cubist (Kuhn & Johnson, 2013), LightGBM (Ke et al., 2017), XGBoost (Chen & Guestrin, 2016), and CatBoost (Prokhorenkova et al., 2018), as the main benchmarks of our study. These tree-based models are popular well-performing models in the forecasting domain. Regression trees have provided successful insights and foundations for machine learning approaches in the forecasting domain (Spiliotis, 2022). Cubist is a well-known regression model tree that performs rule-based modelling (Quinlan, 1992). Many top solutions of the M5 forecasting competition (Makridakis et al., 2022) including the winning method use globally trained LightGBM models. Furthermore, the winning methods of many recently held Kaggle forecasting competitions use tree-based models (Bojer & Meldgaard, 2020). The FFORMA method (Montero-Manso et al., 2020) which placed second in the M4 forecasting competition (Makridakis et al., 2018) uses XGBoost as a meta-learner. CatBoost considers the order of data points during modelling which makes it more suitable for time series forecasting (Prokhorenkova et al., 2018).

The R packages glmnet (Friedman et al., 2010), nnet (Venables & Ripley, 2002), rpart (Therneau & Atkinson, 2019), Cubist (Kuhn & Quinlan, 2022), lightgbm (Ke et al., 2020), xgboost (Chen et al., 2020) and catboost (Prokhorenkova et al., 2018) are respectively used to implement the PR, FFNN, regression tree, Cubist, LightGBM, XGBoost and CatBoost models. The hyperparameters of FFNN, LightGBM and XGBoost are tuned using a grid search approach. For the FFNN, the number of nodes in the hidden layers is varied from 10 to 60 in steps of 10, and the learning rate decay is varied from 0.01 to 0.1 in steps of 0.01. For LightGBM, the minimum number of instances in a leaf node is varied from 50 to 200 in steps of 50, and the learning rate is varied from 0.01 to 0.1 in steps of 0.01. For XGBoost, the maximum tree depth is varied from 3 to 10 in steps of 1, and \(\eta\) is varied from 0.1 to 0.5 in steps of 0.1. The regression tree, Cubist and CatBoost models are used with their default parameters that are assumed to provide good results. The PR model does not require any hyperparameter tuning.

We also consider two traditional univariate forecasting models: ETS (Hyndman et al., 2008) and ARIMA (Box et al., 2015) that are commonly used in the forecasting space, as the benchmarks of our study. They are implemented using the R package, forecast (Hyndman & Khandakar, 2008) under the default configurations.

The SETAR (Tong, 1993) and STAR (Terasvirta, 1994) models that are trained per series are also considered as benchmarks. They are implemented using the R package, tsDyn (Narzo et al., 2022) under the default configurations.

We further consider RF (Breiman, 2001) as a benchmark to evaluate the performance of our SETAR-Forest algorithm. For that, the implementation of RF in the R package lightgbm (Ke et al., 2020) is used where the hyperparameters are tuned using a grid search approach. The minimum number of instances in a leaf node is varied from 50 to 200 in steps of 50, and the learning rate is varied from 0.01 to 0.1 in steps of 0.01. The bagging frequency, bagging fraction and feature fraction are respectively set to the same default values used in our SETAR-Forest algorithm: 10, 0.8 and 1 to facilitate the comparison between RF and the proposed SETAR-Forest model.

Three variants of the proposed SETAR-Tree model (Sect. 3.3) are used as follows.

  • Tree.Lin.Test Uses the significance of the linearity test as the stopping criterion of node splitting.

  • Tree.Error.Red Uses the error reduction percentage that can be gained by node splitting as the stopping criterion.

  • Tree.Lin.Test.Error.Red Uses both the significance of the linearity test and the error reduction percentage gained by node splitting as the stopping criteria.

Three variants of the proposed SETAR-Forest model (Sect. 3.4) are used as follows.

  • Forest.Significance The trees are randomised in terms of the significance level and the sequence significance divider considered in the linearity test.

  • Forest.Error.Red The trees are randomised in terms of the error threshold used to measure the error reduction percentage of node splitting.

  • Forest.Significance.Error.Red The trees are randomised in terms of the significance level and the sequence significance divider considered in the linearity test as well as the error threshold used to measure the error reduction percentage of node splitting.

The number of lagged values used in all GFMs including our SETAR-Tree and forest variants are determined using the heuristics suggested by Hewamalage et al. (2021) where the number of lags used with a dataset depends on its seasonal cycle length or forecast horizon. In particular, for daily and monthly datasets, we consider the heuristic, \(seasonality \times 1.25\) to determine the number of lags. For daily datasets: Rossmann, Kaggle Web Traffic, Favorita and M5, this heuristic suggests 8.75 lags, where this value is rounded (to steps of 5) and 10 lags are finally considered with those datasets. For the Tourism Monthly dataset, 15 lags are considered as suggested by the above heuristic. For Chaotic Logistic, Mackey-Glass and Tourism Quarterly datasets, the seasonality-based heuristic suggests small values, so that for these datasets we use the horizon-based heuristic, and consider 10 lags which is equal to \(horizon \times 1.25\) of those datasets. For the three datasets with external covariates: Rossmann, Kaggle Web Traffic and Favorita, the GFMs are separately executed with and without covariates where only the past lags are used when training without covariates, and the past lags and covariates are together used when training with covariates.

4.4 Statistical testing of the results

The non-parametric Friedman rank-sum test is used to assess the statistical significance of the results provided by different forecasting models across time series considering a significance level of \(\alpha\) = 0.05 (García et al., 2010). Based on the corresponding msMAPE errors, the methods are ranked on every series of the eight primary experimental datasets. In line with traditional forecasting modelling and evaluation (Koning et al., 2005), we treat each series as a separate entity for statistical testing, so that the ranks per series are used as inputs for the statistical tests. This seems reasonable as the methods are univariate and – once trained – forecast each series in isolation. In this way, the statistical test has much more information instead of averaging the results over the datasets before applying the test. All primary experimental datasets have a comparable amount of series and thus, none of the datasets will dominate the evaluation and we obtain a representative result across all primary experimental datasets. The best method according to the average rank is chosen as the control method. To further characterise the statistical differences, Hochberg’s post-hoc procedure is used (García et al., 2010).

5 Results

This section discusses the results in terms of main accuracy, statistical significance and computational performance.

5.1 Main accuracy results

This section first presents the results and a comparison of our SETAR-Tree and SETAR-Forest variants. The best SETAR-Tree and SETAR-Forest variants are then compared with the benchmarks.

5.1.1 Comparison of SETAR-Tree and SETAR-Forest variants

Table 2 Results of SETAR-Tree and SETAR-Forest variants across all experimental datasets.

Table 2 shows the results of the considered SETAR-Tree and SETAR-Forest variants across all experimental datasets for mean msMAPE, median msMAPE, mean MASE and median MASE. The tree variants and forest variants are separately grouped in Table 2. The results of the best performing variants in each group are italicized, and the overall best performing variants across the datasets are highlighted in boldface. The preliminary experiments are also conducted with other stopping criteria such as using Akaike Information Criterion (AIC), however, the performance of the SETAR-Tree was not improved with them confirming the claims of prior similar studies (da Rosa et al., 2008). Thus, the corresponding results are not included here.

Across the tree model variants, on mean msMAPE, the Tree.Lin.Test model variant shows the best performance across five primary datasets where the Tree.Lin.Test.Error.Red model variant shows the best performance across four primary datasets. However, on all other error metrics, Tree.Lin.Test.Error.Red shows the overall best performance across the majority of the experimental datasets from the tree model variants. Compared with the Tree.Lin.Test and Tree.Lin.Test.Error.Red model variants, the Tree.Error.Red model variant shows worse performance on all error metrics except across the Kaggle Web Traffic dataset. This shows that using the significance of the statistical linearity test individually or together with the error reduction percentage gained by node splitting are better options rather than using the error reduction percentage on its own as the stopping criterion of the SETAR-Tree to determine its maximum depth.

In the SETAR-Forest model variants, as the stopping criteria of each individual SETAR-Tree, we consider both the significance of the statistical linearity test and the error reduction percentage gained by node splitting which is the stopping criterion of our best SETAR-Tree variant, Tree.Lin.Test.Error.Red. Out of the three SETAR-Forest variants, Forest.Significance.Error.Red shows an overall better performance compared to the other two forest variants across all experimental datasets on all error metrics. This shows that when the individual SETAR-Trees of the forest are more diversified, the resultant forecasts are more accurate. We see that overall, the SETAR-Forest variants show the best performance across the majority of datasets compared to the SETAR-Tree variants on all error metrics. The forests properly address the data, model, and parameter uncertainties through bagging compared to the individual trees and that can be considered as the major reason for this phenomenon.

Our SETAR-Tree and forest model variants also provide interesting results when trained with external covariates. As discussed in Sect. 3.3.4, the categorical covariates of the Rossmann (status of the store, promotions, state holiday and school holiday flags), Kaggle Web Traffic (day of the week) and Favorita (day of the week) datasets are converted into the numerical format by applying a one-hot encoding mechanism before model training. Furthermore, with the Rossmann dataset, the daily number of customers to each store is considered as an external numerical covariate. In particular, the overall result of the Rossmann dataset has considerably improved across our proposed SETAR-Tree and forest variants after using external covariates on all error metrics. However, the results of the Kaggle Web Traffic and Favorita datasets do not show such improvements after using covariates across our proposed model variants.

5.1.2 Comparison of proposed models with benchmarks

Table 3 shows the results of the benchmarks and our best SETAR-Tree and SETAR-Forest variants, Tree.Lin.Test.Error.Red and Forest.Significance.Error.Red, across all experimental datasets for mean msMAPE, median msMAPE, mean MASE and median MASE.

Table 3 Results across all experimental datasets.

The models in Table 3 are grouped based on the sub-experiments. The results of the best performing models in each group are italicized, and the overall best performing models across the datasets are highlighted in boldface. The first group contains the traditional univariate forecasting models, ETS, ARIMA, SETAR and STAR, where these models are only executed with our eight primary experimental datasets without using external covariates. The second group contains benchmark GFMs including PR, FFNN, state-of-the-art tree-based models and RF. The last group contains our best SETAR-Tree and SETAR-Forest variants that are discussed in Sect. 5.1.1. The GFMs are executed with and without covariates for three datasets: Rossmann, Kaggle Web Traffic and Favorita, separately.

In the first group, ETS shows an overall better performance compared to ARIMA, SETAR and STAR across all error metrics. ETS shows the best performance across the Tourism Monthly dataset on all error metrics, Tourism Quarterly dataset on mean msMAPE and median MASE, and the Favorita dataset on median MASE. However, GFMs including our proposed models show a better performance than these traditional univariate forecasting models across all experimental datasets on all error metrics except for these seven cases.

In the second group, overall, PR and Cubist show a better performance compared to the other benchmark GFMs across all error metrics. The Cubist and CatBoost models show a better performance when training with external covariates.

The third group shows the results of our best SETAR-Tree variant, Tree.Lin.Test.Error.Red, and the best SETAR-Forest variant, Forest.Significance.Error.Red. Our proposed models show an overall better performance than other benchmark GFMs listed in the second group. In particular, our proposed models outperform the state-of-the-art tree-based and forest-based forecasting algorithms such as the regression tree, Cubist, LightGBM, XGBoost, CatBoost and RF across all experimental datasets on all error metrics except for twelve cases: Rossmann on mean msMAPE (Cubist), M5 on all error metrics (LightGBM), M5 on median msMAPE (CatBoost and XGBoost), Tourism Monthly on mean msMAPE, median msMAPE and median MASE (Cubist), Favorita with covariates on median msMAPE (LightGBM) and on all error metrics (Cubist, Catboost and XGBoost). This further shows training a global AR model in the leaf nodes leads to more accurate forecasts rather than simply considering the average of the training instances in the leaf nodes. We also see the results of PR and Tree.Lin.Test.Error.Red are the same across the Favorita dataset without covariates and the M5 dataset on all error metrics. This means the corresponding SETAR-Tree have only one node which is the parent node where the parent node has not been further split based on the considered stopping criterion. The SETAR-Tree and the PR model have overall provided better results across the above two datasets compared to most of the other benchmark GFMs. It shows limiting the tree to a single node is a worthwhile decision in these cases. We see that overall, Forest.Significance.Error.Red shows the best performance across the majority of datasets compared to all other considered benchmarks listed in the first and second groups as well as Tree.Lin.Test.Error.Red.

Overall, our proposed models show better performance when training with covariates compared to the benchmarks including PR, FFNN and the state-of-the-art tree-based algorithms. As explained in Sect. 5.1.1, the overall result of the Rossmann dataset has considerably improved across our proposed models after using external covariates on all error metrics while considerably outperforming all benchmarks. The results of the Kaggle Web Traffic and Favorita datasets do not show such improvements after using covariates across our proposed models as well as all other considered benchmarks. However, across the Kaggle Web Traffic dataset, Tree.Lin.Test.Error.Red shows the overall best performance compared to the other benchmarks when training with covariates.

5.1.3 Analysis of SETAR-Forest size

Table 4 Results of Forest.Significance.Error.Red for Chaotic Logistic dataset across four different sizes.

We analyse the effect of the size of the SETAR-Forest, i.e., the number of SETAR-Trees in the forest, for the final forecasting accuracy. Table 4 shows the results of our best SETAR-Forest variant, Forest.Significance.Error.Red, for the Chaotic Logistic dataset across four different sizes: 5, 10, 20 and 50 for mean msMAPE, median msMAPE, mean MASE and median MASE.

As expected, based on the results, overall, the performance of the SETAR-Forest gets better when the number of SETAR-Trees is increased. The SETAR-Forest containing 10 SETAR-Trees shows the best performance on mean msMAPE and mean MASE and the second best performance on median MASE. Adding more trees does not lead to considerable accuracy gains. Thus, we deem ten SETAR-Trees sufficient for a SETAR-Forest to obtain accurate forecasts in our experiments.

5.2 Statistical testing results

Table 5 Results of statistical testing

Table 5 shows the results of the statistical testing evaluation, namely the adjusted p-values calculated from the Friedman test with Hochberg’s post-hoc procedure considering a significance level of \(\alpha = 0.05\) (García et al., 2010). We consider the msMAPE of each series provided by the benchmarks and our best SETAR-Tree and SETAR-Forest variants, Tree.Lin.Test.Error.Red and Forest.Significance.Error.Red, for the eight primary experimental datasets, namely the Rossmann, Kaggle Web Traffic, Favorita, M5, Tourism Monthly, Tourism Quarterly, Chaotic Logistic, and Mackey-Glass datasets. The datasets are not considered with the external covariates during statistical testing as ETS, ARIMA, SETAR and STAR models are only executed without covariates.

The overall p-value of the Friedman rank sum test is less than \(10^{-30}\) which is highly significant. Forest.Significance.Error.Red performs the best on ranking over msMAPE per each series of the eight primary experimental datasets and thus, it is used as the control method as mentioned in the first row. All benchmarks and Tree.Lin.Test.Error.Red are significantly worse than the control method as they report \(p_{Hoch}\) values less than \(\alpha\).

5.3 Computational performance

Table 6 Computational times (in minutes) of all models across all experimental datasets
Table 7 Times (in seconds) required by the proposed models to predict a data point (given the fully trained models). The prediction time of Forest.Significance.Error.Red is ten times higher than the prediction time of Tree.Lin.Test.Error.Red as we consider ten SETAR-Trees for a SETAR-Forest

To compare the computational performance of the benchmarks and our proposed models, all experiments are executed in a controlled environment, namely an Intel(R) Core(TM) i7 processor (2.6GHz) and 32GB of main memory.

Table 6 shows the computational times (in minutes) of all benchmarks and our proposed tree and forest models across all experimental datasets. The reported computational times include the model training, hyperparameter tuning and forecast computation times.

From Table 6, we can see that the simple models such as PR and regression tree show the lowest computational times. Cubist, CatBoost and LightGBM models generally show lower computational times compared to XGBoost. Our proposed SETAR-Tree model, Tree.Lin.Test.Error.Red, overall shows lower computational times than Cubist, LightGBM and XGBoost across all primary datasets except Rossmann and Mackey-Glass. The most accurate model which is our proposed SETAR-Forest model, Forest.Significance.Error.Red, generally shows higher computational times than the benchmarks as it executes multiple SETAR-Trees and combines their forecasts to produce the final forecasts where a single tree internally does a lot of automatic parameter tuning. However, our SETAR-Forest shows lower computational times across the M5, Tourism Quarterly and Chaotic Logistic datasets than the LightGBM and XGBoost models. RF shows lower computational times than the SETAR-Forest model, however, as our SETAR-Forest is more accurate than the RF, the higher computational times seem justified.

Table 7 shows the times (in seconds) required by the proposed models, once trained, to predict a data point. Across the majority of datasets, the proposed SETAR-Tree model takes less than a second for prediction. The proposed SETAR-Forest model takes a few seconds for the same prediction, given the trained models. Thus, we deem our models computationally feasible.

Based on our results, we recommend to use our best SETAR-Tree variant, Tree.Lin.Test.Error.Red, and the best SETAR-Forest variant, Forest.Significance.Error.Red, over the state-of-the-art tree-based algorithms.

6 Conclusions and future research

Globally trained tree-based models have recently become popular in the forecasting field due to their contribution to most top-performing methods in the M5 forecasting competition. However, the models used are general-purpose models not specific to the forecasting task, and they only consider the average of the training instances at a leaf node as their prediction.

In this paper, we have proposed a new forecasting-specific globally trained tree model and a forest model, the SETAR-Tree and SETAR-Forest, which use time-series-specific splitting and stopping procedures. The SETAR-Tree uses the underlying concept of SETAR models during node splitting. In contrast to the state-of-the-art tree-based algorithms, our proposed tree model trains a global PR model in each leaf node allowing the models to learn the cross-series information. The SETAR-Tree internally controls the tree depth by conducting a statistical linearity test and measuring the error reduction percentage at each node split while making the requirement of external hyperparameter tuning to a minimal state. The SETAR-Forest model uses a collection of SETAR-Trees during forecasting where the trees are made diverse by randomising the significance of the statistical linearity test and the threshold of the error reduction percentage used during node splitting. The proposed tree and forest models can also be trained with external covariates. Across eight experimental datasets, we have shown that our proposed tree and forest models can significantly outperform a set of state-of-the-art tree-based GFMs and traditional univariate forecasting models. As our proposed models show a better forecasting accuracy, with less parameter tuning with a competitive computational performance compared to the benchmarks, we recommend SETAR-Tree and SETAR-Forest as strong tree-based models for global time series forecasting.

From our experiments, we conclude that training a global model in the leaf nodes often leads to more accurate results in tree-based algorithms rather than considering the average of the training instances in the leaf nodes in the forecasting context. We further conclude that considering the significance of the statistical linearity test and the error reduction percentage gained by node splitting together is a good option in automatically determining the maximum tree depth. The SETAR-Forest shows the best performance across the majority of the datasets as it adequately addresses the data, model, and parameter uncertainties compared to the individual tree models. The results get further improved when the individual trees in the forest are more diversified.

The success of this approach encourages as future work to extend the proposed tree models to work with multiple regimes, to work with boosting techniques and to provide probabilistic forecasts. It will be also interesting to study whether incorporating tree pruning can further increase the forecasting accuracy of SETAR-Trees. Analysing the effects of training a GFM in the leaf nodes of the state-of-the-art tree-based algorithms such as LightGBM and CatBoost is also worth to study.