1 Introduction

Boosting methods have become increasingly successful in machine learning over the past decade. While early weighed boosting algorithms such as AdaBoost (Freund & Schapire, 1997) showed promise, they were later surpassed by gradient boosting methods (Friedman, 2001, 2002; Friedman et al., 2000). Gradient Boosting leverages the previous base learner’s gradient information (i.e., the slope of the loss function) to boost the performance of the next learner in an ensemble. The eXtreme Gradient Boosting (XGBoost) (Chen & Guestrin, 2016) takes this approach to another level, achieving high efficiency and superior performance on various time-critical real-world problems. However, in many real-world scenarios, traditional batch learning with the Independent and Identically Distributed (iid) assumption cannot keep pace with the evolving nature of the underlying data stream (Gomes et al., 2017; Bifet et al., 2018).

On the other hand, Stream Learning (SL) accounts for the possibility of change in the underlying data distribution (concept drift) (Bifet et al., 2018). A model should respond efficiently in real-time when learning from an evolving data stream (Bifet et al., 2018). While methods such as Adaptive eXtreme Gradient Boosting (Axgb) (Montiel et al., 2020) and Adaptive Iterations (AdIter) (Wang et al., 2022) were proposed by the research community to enable gradient boosting for evolving data streams, they failed to outperform state-of-the-art ensemble learners like Adaptive Random Forest (Arf) (Gomes et al., 2017) and Streaming Random Patches (Srp) (Gomes et al., 2019).

The proposed work utilizes streaming regression trees with inbuilt drift detectors in a gradient-boosted setting. The paper makes the following contributions:

  1. 1.

    To our knowledge, Sgbt is the first instance where the weighted squared loss elicited in Friedman et al. (2000); Chen and Guestrin (2016) with hessian as the weight and gradient over hessian as the target considering the previous boosting step’s loss, is used to develop a streaming gradient-boosted method for evolving data streams. This allows Sgbt to leverage any streaming regression tree as its base learner.

  2. 2.

    Sgbt utilizes trees with an internal Tree Replacement (TR) mechanism instead of externally monitoring each item in the boosting ensemble for drifts and adjusting each item like Axgb (Montiel et al., 2020) or resetting some parts as in AdIter (Wang et al., 2022). This Tree Replacement mechanism in Sgbt allows the trees in the booster to adapt dynamically to concept drifts. Unlike binary-class gradient-boosted streaming implementations: Axgb and AdIter, Sgbt can solve multi class problems using a committee of trees at each boosting step or a committee of Sgbts.

  3. 3.

    We present an extensive empirical evaluation of Sgbt against current state-of-the-art streaming bagging with random subspaces (Srp), random forest (Arf), boosting (OSB), and gradient boosting (AdIter) methods on 14 datasets with different drift types.

Overall, Sgbt outperforms existing techniques for evolving data streams. The paper is structured as follows. The next section reviews the current state-of-the-art stream learning methods and recent gradient boosting work for evolving data streams. The subsequent section explains our proposed Sgbt method. The experiments section describes the experimental setup where Sgbt was evaluated against state-of-the-art stream learning methods. The final section provides conclusions and directions for future research.

2 Related work

Boosting and bagging are two popular ensemble learning techniques used in machine learning. Bagging randomly samples instances with replacement to train each member of the ensemble. Boosting, on the other hand, attempts to boost the performance of the next base learner in the ensemble, considering the loss of the previous one. It combines the prediction of weak learners addictively to produce a strong learner (Friedman et al., 2000; Friedman, 2001). AdaBoost (Freund & Schapire, 1997) highly weights the misclassified instances by the current base learner to improve the next base learner. Gradient boosting uses the current base learner’s gradient information of the loss to improve the next base learner (Friedman et al., 2000). XGBoost (Chen & Guestrin, 2016) uses this gradient information to derive a particular regression tree that predicts a raw score at the leaf for a given instance. It contains an efficient split-finding mechanism, cache-aware data processing, and parallel processing to produce a highly scalable and efficient algorithm for batch learning (Chen & Guestrin, 2016).

Compared to batch learning, Stream Learning model learns from an evolving data stream (non-iid data), processing one instance at a time. Here, the model should predict at any given moment using limited processing and memory (Bifet et al., 2018; Gomes et al., 2017). Also, it should adjust to distribution changes (concept drifts) in the underlying data stream (Bifet et al., 2018; Bifet & Gavalda, 2007; Gomes et al., 2017).

Data stream boosting is challenging due to the evolving nature of the data stream. Here, the model needs to adjust to the new input distribution of the stream after a concept drift (Bifet et al., 2018; Montiel et al., 2020). Online Bagging (OBg) and Online Boosting (OB) (Oza & Russell, 2001) were inspired by the observation that a binomial distribution Binomial(pn) can be approximated by a Poisson distribution \(Poisson(\lambda )\) with \(\lambda =np\) as \(n \rightarrow \infty\). Here, n is the number of instances, and p is the probability of success in the binomial distribution. Since the probability of selecting a given example is 1/n in batch bagging, the uniform sampling with replacement of the bagging algorithm is approximated by Poisson(1) in OBg. On the other hand, in OB, \(\lambda\) is computed by tracking the total weights of correctly classified and misclassified examples for each base learner. An online version of SmoothBoost (Servedio, 2003) was proposed in Chen et al. (2012). This Online Smooth Boost (OSB) uses smooth distributions that do not assign too much weight to a single example. When the number of weak learners and examples are sufficiently large, OSB is guaranteed to achieve an arbitrarily small error rate (Chen et al., 2012; Gomes et al., 2019). Recently, two notable approaches were proposed by the stream learning community to leverage gradient boosting for data streams: Axgb (Montiel et al., 2020) and AdIter (Wang et al., 2022). Axgb employs mini-batch trained XGBoost as its base learners and adjusts the ensemble in response to concept drifts, which it detects using ADWIN (Bifet & Gavalda, 2007). AdIter attempts to identify the weak learners in the ensemble and prune them when confronted with concept drift. It then employs multiple training iterations via majority vote among the ensemble to support different drift types. Both Axgb and AdIter only support binary classification. In contrast, our proposed streaming gradient boosting method (Sgbt) supports both binary and multi class problems.

Arf (Gomes et al., 2017) and Srp (Gomes et al., 2019) are popular ensemble learning methods for streaming data. They allow one to use efficient stream learning base learners like Hoeffding Tree (HT) in a random forest or random subspaces setup in conjunction with efficient drift detectors like ADWIN. Arf is a streaming random forest adaptation that combines re-sampling strategies, drift detection, and drift recovery strategies (Gomes et al., 2017). Srp combines random subspaces and re-sampling (i.e., random patches) to leverage diversity among base incremental learners (Gomes et al., 2019). It uses the same drift detection and recovery strategy as Arf, but tends to outperform Arf (Gomes et al., 2019) in some benchmarks while not being limited to decision trees.

OSB performed better compared to OB (Chen et al., 2012). Empirical evaluation (Gomes et al., 2019) shows that even with 100 base learners, Arf and Srp outperform OSB by a large margin. In the same evaluation, Srp outperformed Arf. Axgb failed to outperform Arf in the Montiel et al. (2020) empirical evaluation. In Wang et al. (2022) experiments, AdIter also failed to surpass Arf on synthetic evolving datasets with 10,000 instances. However, in the same evaluation, AdIter surpassed Arf on real-world data. In that evaluation, all the other datasets had less than 100,000 instances apart from airlines. The above empirical evaluations suggest that the latest gradient boosting methods for evolving data streams are yet to surpass current state-of-the-art ensemble methods like Srp and Arf. However, our proposed Sgbt was able to outperform Srp and Arf in a variety of evolving datasets.

3 Streaming gradient boosted trees (SGBT)

For a dataset with n instances, let \(x_i\) be the features for the i-th instance and \(y_i\) be its relevant target value. In gradient boosting, a model \(\phi\) can be represented as S additive functions:

$$\begin{aligned} {\hat{y}}_{i}=\phi (x_i)=\sum _{s=1}^{S}f_{s}(x_{i}), f_{s}\in {\mathcal {F}} \end{aligned}$$

to predict \({\hat{y}}_{i}\)Friedman (2002); Chen and Guestrin (2016). Here, \({\mathcal {F}}\) is the space of regression trees. In XGBoost (Chen & Guestrin, 2016), each \(f_s\) corresponds to an independent tree structure with leaf scores \(\omega\). Each regression tree contains a continuous score \(\omega _{i}\) at the leaf for the i-th instance. The authors proposed to sum up the corresponding scores at the leaves of each tree for prediction. The learning objective is to minimize the regularized objective:

$$\begin{aligned} {\mathcal {L}}(\phi ) = \sum _{i=1}^{n}l(y_i, {\hat{y}}_i) + \sum _{s=1}^{S}\Omega (f_s) \end{aligned}$$
(1)

where \(\Omega\) penalizes the complexity of the tree f:

$$\begin{aligned} \Omega (f) = \gamma T + \frac{1}{2} \beta \left\| \omega \right\| \end{aligned}$$

Here, \(\gamma\) penalizes adding a new leaf and \(\beta\) forces leaf predictions to be small. T is the number of leaves in the tree. l is a differentiable convex loss function that measures the difference between the prediction \({\hat{y}}_i\) and the target \(y_i\). Furthermore, the loss at the s-th step is the loss incurred by the previous (\(s-1\)) step and the loss incurred by tree \(f_s\) plus the regularization term:

$$\begin{aligned} {\mathcal {L}}^{(s)}= \sum _{i=1}^{n} l(y_i, {\hat{y}}^{(s-1)} + f_s(x_i))+ \Omega (f_s) \end{aligned}$$

This loss could be approximated using second-order Taylor approximation (Chen & Guestrin, 2016) to:

$$\begin{aligned} {\mathcal {L}}^{(s)} \simeq \sum _{i=1}^{n}\left[ l(y_i, {\hat{y}}^{(s-1)}) + g_i f_s(x_i) + \frac{1}{2}h_i f^2_s(x_i) \right] + \Omega (f_s) \end{aligned}$$

Here, \(g_i = \partial _{{\hat{y}}^{(s-1)}} l(y_i, {\hat{y}}^{(s-1)})\) and \(h_i = \partial ^2_{{\hat{y}}^{(s-1)}} l(y_i, {\hat{y}}^{(s-1)})\) are the first and second order (hessian) gradient statistics of the loss considering \(s-1\)-th prediction. Though the authors (Chen & Guestrin, 2016) use a simplified version of the above loss function by removing constants to derive raw score values at the leaves, the below version was elicited to explain it as a weighted squared loss with weight \(h_i\) and target \(g_i/h_i\):

$$\begin{aligned} \sum _{i=1}^{n} \frac{1}{2} h_i (f_t(x_i) - g_i/h_i )^2 + \Omega (f_s) + constant. \end{aligned}$$
(2)

This weighted squared loss with hessian as the weight and gradient over hessian as the target, considering the previous boosting step’s loss, was first introduced in Friedman et al. (2000).

Algorithm 1
figure a

Training Sgbt

Equation 2 provides the flexibility to utilize various streaming regression trees instead of the one employed in XGBoost. Moreover, depending on the implementation, the streaming regression tree’s regularization term can diverge from that employed in XGBoost. In this work, the Tree Replacement strategy explained later in this paper, acts as a regularization mechanism.

Table 1 Notations

3.1 Streaming regression trees with internal tree replacement strategy for gradient boosting

In data stream learning, n could be infinite, and learning happens online, where a model \(\phi _{i-1}\) learned at the \(i-1\)th instance is used to predict the ith instance. Also, from any ith instance, the underlying distribution of x could change (concept drift). The model \(\phi _{i}\) should adjust it’s regression trees to adapt to this new distribution at i. Instead of externally monitoring and resetting each \(f_s\) tree like in AdIter (Wang et al., 2022), in Sgbt, the trees internally monitor their standardized absolute error and train an alternate tree if it goes above a warning level. The tree \(f_s\) switches to its alternate tree once the error reaches a danger zone. \(f_s\) tree employs a drift detector to monitor its standardized absolute error to trigger these warning and danger signals. The rest of the paper identifies this strategy of replacing the active tree with an alternate tree on the drift detection signal as Tree Replacement (TR). In the experiments, we used two regression trees for data streams: FIMT-DD (Ikonomovska et al., 2011) and SGT (Gouk et al., 2019) with in-built drift detectors: Page-Hinckley Test (Pht) (Mouss et al., 2004) and Ddm (Gama et al., 2004). The implementation of SGT with Ddm is generic, and one could replace SGT with any other regression tree for data streams. Here, TR also serves as a dynamic regularization mechanism by replacing trees as data evolves during learning. Arf and Srp use a similar Tree Replacement strategy under random-forest and bagging settings for SL classification. But to our knowledge, this is the first instance, TR is used in gradient-boosted trees for SL classification. Here, the booster is allowed to dynamically adjust to underlying input distribution changes as some active trees are replaced by their alternate trees on drift detection.

The loss function in Eq.  2 requires the regression trees to support fractional weights, as \(h_i\) could be a fractional value for some loss functions. Streaming regression trees (SGT and FIMT-DD) considered in this work only support integer weights. Supporting fractional weights for them is not trivial. For example, SGT and FIMT-DD require the incremental calculation of variance and co-variance for fractional weights. Though recent work by Pébay et al. (2016) and Schubert and Gertz (2018) suggests this is possible, this itself is a separate research topic. Also, later in the text, it is clarified that the hessians for the popular categorical cross-entropy loss with softmax used in the experiments are consistently below 1. Hence, even though Sgbt calculates these weights (hessians), it does not pass them to the underlying trees for these practical reasons. Alternatively, it passes a weight of 1 to the trees.

Instead of using all the features to train at each boosting step, Sgbt uses a subset of features based on a predefined feature percentage. This approach of using a subset of features to train each ensemble member is also used in Srp (Gomes et al., 2019) to increase the diversity among the base learners. Algorithm 1 explains the training procedure of Sgbt.

3.2 Multi class support

Two approaches are used to support multi class problems: Sgbt and Sgbt\(^{MC}\).

  • SGBT uses a committee of regression trees in a given boosting step s. Here, a single tree is trained for each class. The committee is composed of a softmax function, so the probability that an instance, \(x_i\), belongs to class c is given by: \({\hat{y}}_{i,c}= \frac{exp(f_{s,c}(x_i))}{\sum _{c=1}^{C}exp(f_{s,c}(x_i))}\). Here \(f_{s,c}\) is the regression tree trained to predict a real-valued score for class c at the s-th boosting step, and C is the number of classes. In practice, hard-wiring \(f_{s,C}(x_i)=0\) allows Sgbt to reduce the number of trees being trained.Footnote 1 The categorical cross-entropy loss (\(l^{CE}\)) is used to train the model: \(l^{CE}(y,{\hat{y}}) = - \sum _{c=1}^{C} y_c log ({\hat{y}}_c)\). Here, y is the ground truth encoded as a one-hot vector. For \(l^{CE}\), gradient (g) is \(y_c -{\hat{y}}_c\), and hessian (h) is \({\hat{y}}_c (1 - {\hat{y}}_c)\). The regression tree committee (composing \(C-1\) items) at the s-th boosting step represents the base learner for the s-th boosting step. This approach is also used in SGT to support multi class classification.

  • SGBT\(\mathbf {^{MC}}\) uses the same loss function (\(l^{CE}\)) as in Sgbt. But, it uses a wrapper classifier to invoke a binary Sgbt classifier for each class. The task of the binary Sgbt classifier is to distinguish a given class from all the other classes. All C classifier votes for the positive outcome are collected and normalized at prediction. The class associated with the classifier that predicted the positive outcome most confidently is considered the final class for the instance. This approach is very popular in batch learning and is commonly known as one-vs-rest or one-vs-all in literature (Witten et al., 2016). Sgbt\(^{MC}\) reverts to Sgbt for binary class problems to avoid any computing overhead.

Unlike Axgb and AdIter, the above two approaches allow Sgbt to support gradient boosting for evolving data streams on multi class problems.

Algorithm 2
figure b

Training Sgbt\(^{SK}_{MI}\)

3.3 Predicting and computing improvements

Two variants of Sgbt are proposed below to improve the computing performance and utilise already calculated hessian weights.

  • SGBT\(\mathbf {^{SK}}\): In most streaming regression trees, the computation and memory complexities are affected by the number of instances they process. Some computation and memory savings could be achieved via skip training on random instances. Sgbt\(^{SK}\) randomly skips 1/k-th of instances (\(k \ge 1, \in {\mathbb {N}}\)). k is set to 1 by default, causing it to process all instances as in Sgbt. Work by Gunasekara et al. (2022), Pavlovski et al. (2017) also exploited skip training for Stream Learning. Line 5 in algorithm 2 highlights this skip training.

  • SGBT\(\mathbf {_{MI}}\): Even though current base learners only support integer weights, utilizing already calculated fractional hessian weights is helpful. For \(l^{CE}\), hessian for class c at i-th instance is always less than 1 (\(h_{i,c}< 1\)). Even if one passes \(h_{i,c}\) to a ceilingFootnote 2 function, it will always return 1. For all instances, multiplying \(h_{i,c}\) by 10 and passing that to a ceiling function results in a positive integer weight that is greater than 1 for some instances. For all the other instances, the weight is set to 1. If \(ceiling(h_{i,c} * 10 ) = T\), Sgbt can train \(f_{s,c}\) base learner T times using instance \(x_{i,s}\) with label \(g_{i,c}/h_{i,c}\). Here, multiplier 10 ensures that \(T\le 10\) for all instances, providing a reasonable upper limit to the computational cost of this approach. This technique of training a base learner multiple times based on a calculated integer weight for an instance is quite common in stream learning (Oza & Russell, 2001; Gomes et al., 2019). Line 12 in algorithm 2 highlights this multiple training iteration approach. Furthermore, this multiple-training iteration approach allows Sgbt to use streaming regression trees that do not support weights.

Algorithm 2 explains the above two variants of Sgbt in detail. In the experiments, we evaluate the effectiveness of these Sgbt variants.

As Sgbt allows different streaming regression trees for its base learners, its final time and memory complexities are influenced by the base learner’s time and memory complexities. Sgbt’s time complexity can be derived as \({\mathcal {O}}(CSf)\), and its memory complexity as \({\mathcal {O}}(CSf)\), assuming \({\mathcal {O}}(f)\) for the base learner’s time and memory complexities. Here, S is the number of boosting steps, and C is the number of classes. Sgbt\(^{MC}\) has the same time and memory complexities as Sgbt. The time complexity of Sgbt\(^{MC}\) could be further improved by parallel training each Sgbt. Our implementation of Sgbt\(^{MC}\) leverages this parallel processing. This allows Sgbt\(^{MC}\)’s time complexity to be \({\mathcal {O}}(Sf)\). This is similar to current state-of-the-art streaming bagging and random-forest based methods: Srp and Arf. For Sgbt\(^{MC}_{SK}\), this time complexity is further reduced to \({\mathcal {O}}((1-{1/k})Sf)\) by skipping 1/k-th of instances at training. Table 1 contains all the notations introduced in this section.

4 Experiments

We begin our experiments by comparing Sgbt against current state-of-the-art streaming bagging with random subspaces (Srp), random forest (Arf), boosting (OSB), and gradient boosting (AdIter) methods on 14 datasets. We also conducted a parameter exploration to illustrate the effects of different Sgbt components.

Table 2 Dataset properties: has (D)rifts, (R)eal, (S)ynthetic

Finally, we show an in-depth analysis concerning the computational requirements of Sgbt.

Datasets: AGRa, AGRg, LEDa, LEDg, RBFf, RBFm, electricity, airlines and covtype are from Gomes et al. (2019). RandomTree, LED, RBF5, RBF_Bm, RBF_Bf were generated using MOA synthetic generators. The synthetic datasets with drifts simulate different types of concept drifts, i.e., abrupt (AGRa, LEDa), gradual (AGRg, LEDg), fast incremental changes (RBF_Bf, RBFf), and moderate incremental changes (RBF_Bm, RBFm).

AGR\(_{{\textbf {a}}}\) is a binary class synthetic dataset with 1 M instances, where abrupt concept drifts occur after every 250000 instances, with 50 instances drift width.

AGR\(_{{\textbf {g}}}\) also contains binary class synthetic data. Here, gradual concept drifts occur after every 250000 instances, with 50000 instances drift width. The dataset has 1 M instances.

LED\(_{{\textbf {a}}}\) is a multi class synthetic dataset with 1 M instances, where abrupt concept drifts occur after every 250000 instances, with 50 instances drift width.

LED\(_{{\textbf {g}}}\) is also a multi class synthetic dataset. The dataset has gradual concept drifts occurring after every 250000 instances, with 50000 instances drift width. The dataset has 1 M instances.

RBF\(_{{\textbf {f}}}\) contains multi class synthetic data. Here, fast incremental concept drifts occur with 0.001 centroid’s speed of change. There are 1 M instances in this dataset.

RBF\(_{{\textbf {m}}}\) also has multi class synthetic data. The dataset contains 1 M instances. Here, moderate incremental concept drifts occur with 0.0001 centroid’s speed of change.

RBF_B\(_{{\textbf {f}}}\) is a binary class synthetic dataset with 1 M instances that includes fast incremental concept drifts with the centroid’s speed of change set to 0.001.

RBF_B\(_{{\textbf {m}}}\) has 1 M instances. It is a binary class synthetic dataset with moderate incremental concept drifts occurring with 0.0001 centroid’s speed of change.

RandomTree is a binary class synthetic dataset without any drifts. It was generated using MOA RandomTreeGenerator. It has 100K instances.

LED contains multi class synthetic data without any drifts. The dataset was generated using MOA LEDGenerator. It also has 100K instances.

RBF5 is a dataset with 100K instances. It contains multi class synthetic data without drifts. Data was generated using MOA RandomRBFGenerator.

Electricity contains the Australian New South Wales Electricity Market data when the prices are not fixed. These prices are affected by the supply and demand of the market itself and are set every five minutes. It is a binary class real-world dataset. The class label identifies the price changes (up or down) relative to a moving average of the last 24 h. The dataset exhibits temporal dependencies. It contains 45310 instances.

Airlines is a binary class real-world dataset. The task is to predict whether a given flight will be delayed, given information on the scheduled departure. The dataset has 539382 instances.

Covertype dataset represents forest cover type for 30 x 30-meter cells obtained from the US Forest Service Region 2 Resource Information System (RIS) data. Each class corresponds to a different cover type. The dataset contains a multi class problem with seven imbalanced class labels. It includes 581010 instances.

Table 3 Accuracy: Sgbt\(^{MC}\) against other baselines (values are rounded to 2 decimals). Relevant Shaffer Post-hoc test results are shown in figure 1
Fig. 1
figure 1

Shaffer Post-hoc test with p-value 0.05 for all, binary class, multi class, and evolving (AGRa, AGRg, LEDa, LEDg, RBF_Bm, RBF_Bf, RBFm, RBFf) datasets (accuracy): Sgbt\(^{MC}\) against other baselines (10 iterations with different random seeds). A lower rank is better. Table 3 contains the individual accuracy values for each algorithm on each dataset

Fig. 2
figure 2

Accuracy over time: Sgbt\(^{MC}\) against Srp, Arf, and OSB on AGR\(_{{\textbf {g}}}\). X axis is the number of instances seen so far. Vertical dotted lines mark a concept drift’s start, center, and end. Sgbt\(^{MC}\)[\(S=100\), \(m=75\), \(lr=\)1.25e\(-\)2]

Fig. 3
figure 3

Accuracy over time: Sgbt\(^{MC}\) against Srp, Arf, and OSB on LED\(_{{\textbf {g}}}\). X axis is the number of instances seen so far. Vertical dotted lines mark a concept drift’s start, center, and end. Sgbt\(^{MC}\)[\(S=100\), \(m=75\), \(lr=\)1.25e\(-\)2]

Table 4 Time (seconds): Sgbt\(^{MC}\) against Srp (values are rounded to 0 decimals, except ranks)
Table 5 Accuracy: different variants of Sgbt (values are rounded to 2 decimals)

Datasets RBF_Bm, RBFm, RBF_Bf and RBFf were generated using MOA RandomRBFGeneratorDrift. While AGRa and AGRg were generated using MOA ConceptDriftStream with AgrawalGenerator. LEDa and LEDg datasets were generated using MOA ConceptDriftStream with LEDGeneratorDrift. Table 2 summarizes the characteristics of the datasets.

Sgbt was compared against the current state-of-the-art stream learning baseline Srp, streaming random forest method Arf, the latest gradient-boosted method for data streams AdIter, and the stream-boosting method OSB. Axgb was not considered in the evaluation as it failed to outperform Arf (Montiel et al., 2020). Srp used the best parameter configurations explained in Gomes et al. (2019). As 100 base learners produced the best results for Srp in Gomes et al. (2019), all the baselines used 100 base learners. Arf and OSB used the same parameters in Gomes et al. (2019) evaluation. Arf and OSB used the same base learner (HT) in Srp with the same hyperparameters as in Srp.

We collected votes for each class on each instance from AdIter’s Python implementation and ran it through a MOA dummy classifier to yield the same evaluation as the other methods. Sgbt was implemented as an MOA classifier, and it used 100 boosting steps (S) to match other baselines 100 base learners. The Sgbt\(^{MC}\) variant was compared against the above baselines. Here, the one-vs-rest wrapper classifier was also implemented in MOA. Sgbt used a learning rate of 0.0125 and 75% of the features at each boosting step. As Sgbt requires streaming regression trees as its base learners, the streaming classifier tree HT can not be used as a base learner. Therefore, the streaming regression tree FIMT-DD (Ikonomovska et al., 2011) was chosen as its base learner. FIMT-DD used a variance reduction split criterion, a grace period of 25, a split confidence interval of 0.05, a constant learning rate at the leaves, and the regression tree option.

Each algorithm was executed multiple times with different random seeds, and the average accuracy was considered in the evaluation process.Footnote 3. Appendix A contains detailed information about the experimental setup.

Table 3 compares Sgbt\(^{MC}\)’s accuracy against the baselines mentioned above. As one can see, Sgbt\(^{MC}\) outperforms all the baselines on binary class problems considering average accuracy and rank. It also performs equally well on multi class problems. It is also evident that Sgbt\(^{MC}\) outperforms other methods on datasets with drifts: AGRa, AGRg, LEDa, LEDg, and RBFm. This suggests that Sgbt\(^{MC}\) is a good candidate not only for evolving data, but also for binary class problems.

Table 6 Test then train accuracy of Sgbt\(^{MC}\) for different learning rates (lr), boosting steps (S), feature percentages (m), and Tree Replacement (TR) mechanisms: TR via drift detector and no TR. Notes: (i) Results are ranked and highlighted separately for lr, S, m and TR. (ii) TR: FIMT-DD does TR via Pht, and SGT uses a wrapper classifier to do TR via Ddm. (iii) Values are rounded to 2 decimals, 4 decimals were considered to select the winner

It also performed well on the airlines dataset. On the other hand, Srp yielded good results on LEDa, LED, and covtype datasets, while Arf performed well on RBF_Bm, RBF_Bf, electricity, RBFf, and RBF5. OSB performed well on the RandomTree dataset. The streaming gradient boosting method AdIter was the least performant among all methods. As it is a binary classifier, AdIter was only evaluated on binary class problemsFootnote 4. Furthermore, KappaM results in Appendix C (Table 9), which evaluate learner’s performance on imbalanced data (Bifet et al., 2018), also align with accuracy rankings in Table 3.

Figure 1 shows the Shaffer Post-hoc test results with a p-value of 0.05 for: all, binary class, multi class, and evolving (AGRa, AGRg, LEDa, LEDg, RBF_Bm, RBF_Bf, RBFm, RBFf) datasets considering accuracy. It further highlights the fact that Sgbt\(^{MC}\) outperforms other methods on binary and evolving datasets with statistical significance. For multi class problems it is on par with current state-of-the-art Srp. To our knowledge, this is the first time a streaming gradient boosted method is able to surpass current state-of-the-art bagging and random-forest based methods in wide range of evolving data and perform well on all types of dataFootnote 5. These Fig. 1 post-hoc test results for accuracy also align with KappaM post-hoc test results in Fig. 8 (Appendix C) with the same p-value.

We investigate each algorithm’s performance on evolving data further in Figs. 2 and  3 by comparing accuracy over time for Sgbt\(^{MC}\), Srp, Arf, and OSB on AGRg and LEDg. Based on Figs. 2 and  3, it is evident that Sgbt\(^{MC}\) had the lowest decrease in performance around drift points.

Table 4 compares the evaluation time in seconds reported by MOA among single-threaded Sgbt\(^{MC}\) (Sgbt\(^{MC}_{ST}\)), multi-threaded Sgbt\(^{MC}\), and Srp. Srp was chosen due to the fact that it had the best predictive performance among competitors considering Table 3 and Fig. 1. For binary class problems, both Sgbt\(^{MC}\) variants perform faster than Srp. Maybe FIMT-DD in Sgbt\(^{MC}\) is a faster base learner than HT in Srp. Compared to Sgbt\(^{MC}_{ST}\), Srp performs well on multi class problems. However, Sgbt\(^{MC}\) performed the fastest on multi class problems leveraging parallel processing at training and prediction. Considering the Gomes et al. (2019) empirical evaluation on run time for 100 base learners, we would like to acknowledge that Arf and OSB can perform faster than Srp in practice. However, they have an inferior predictive performance compared to Srp in Table 3 evaluation and in the empirical evaluation of Gomes et al. (2019).

4.1 Multiple steps and multi class support

Another study was conducted to understand the performance of different Sgbt variants: Sgbt, Sgbt\(^{MC}\), Sgbt\(_{MI}\), and Sgbt\(^{MC}_{MI}\). Sgbt\(^{MC}\) supports multi class problems using binary Sgbts, and Sgbt\(_{MI}\) employs multiple iterations by hessian weights. Both Sgbt\(^{MC}\) and Sgbt\(_{MI}\) are orthogonal, so they can be fused to yield Sgbt\(^{MC}_{MI}\). All Sgbt variants used the same hyperparameter configurations as in the previous experiments. Table 5 shows the results of the study. Since Sgbt\(^{MC}\) reverts to Sgbt and Sgbt\(^{MC}_{MI}\) reverts to Sgbt\(_{MI}\) on binary class problems, if one ignores Sgbt\(^{MC}_{MI}\) and Sgbt\(^{MC}\) for binary class problems, Sgbt performs well on most of the binary class datasets compared to Sgbt\(_{MI}\). However, Sgbt\(_{MI}\) has a higher average accuracy for that category. This suggests it performs exceptionally well on certain datasets such as RBF_Bm, RBF_Bf and electricity. This results on RBF_Bf, which has fast-evolving drifts, is interesting, as it suggests that multiple training iterations by hessian in Sgbt\(_{MI}\) improve Sgbt’s performance on fast-evolving data. For multi class problems, Sgbt\(^{MC}_{MI}\) is the clear winner. When one compares Sgbt\(^{MC}\) with Sgbt, it is clear that multi class support using binary Sgbts performs better than Sgbt with multi class support. On the other hand, multi class results on Sgbt and Sgbt\(_{MI}\) suggest that multiple iterations by hessian improve Sgbt’s accuracy on multi class problems. This explains why Sgbt\(^{MC}_{MI}\) performs best on multi class problems, as it includes multi class support using binary Sgbts and multiple iterations by hessian approaches. Overall performance by Sgbt\(^{MC}_{MI}\) exceeds the performance of Sgbt\(^{MC}\), which is compared against other baselines in Table 3. But Sgbt\(^{MC}\) was used in Table 3 evaluation considering its computation efficiency compared to Sgbt\(^{MC}_{MI}\). On the other hand, Sgbt\(^{MC}_{MI}\) is a good candidate for evolving data stream applications that prioritize predictive performance over computation efficiency.

4.2 Parameter exploration

A parameter exploration was conducted to understand the impact of learning rate (lr), boosting steps (S), weight (\(h_i\)) transfer methods, percentage of features (m), and the independent TR mechanism at each tree via drift detection on Sgbt\(^{MC}\)’s predictive performance. The results for all these analyses are shown in Table 6 (ranked separately).

Three learning rates: 6.25e\(-\)3, 1.25e\(-\)2, and 2.50e\(-\)2, were used in the study to understand the effect of learning rate (lr) on Sgbt\(^{MC}\)’s performance. All the other configurations: FIMT-DD base learner, 75% of features (m), and 100 boosting steps (S) were kept unchanged. As per Table 6, considering Sgbt\(^{MC}\) [\(S=100\), \(m=75\), \(lr=\){6.25e\(-\)3, 1.25e\(-\)2, 2.50e\(-\)2}, FIMT-DD] configurations, in general, larger learning rates (lr) seem to favour both binary and multi class problems.

In a separate study to understand the effect of boosting steps on Sgbt\(^{MC}\)’s performance, five boosting steps (20, 40, 60, 80, 100) were considered. In this study, base learner (FIMT-DD), feature percentage (m=75%), and learning rate (lr=1.25e\(-\)2) were kept unchanged. According to Table 6, when considering Sgbt\(^{MC}\) [\(S={20, 40, 60, 80, 100}\), \(m=75\), \(lr=\)1.25e\(-\)2, FIMT-DD] configurations, 100 boosting steps seem to yield good results than the smaller boosting steps for both binary and multi class problems. This aligns with OSB results in Gomes et al. (2019), where more boosting iterations performed better than fewer boosting iterations.

In another study investigating the influence of different feature percentages (m) on Sgbt\(^{MC}\)’s performance, all Sgbt\(^{MC}\) configurations remained constant, including the base learner (FIMT-DD), learning rate (lr=1.25e\(-\)2), and boosting steps (S=100), except for the feature percentage (m).

Fig. 4
figure 4

Accuracy over time: Different Sgbt\(^{MC}\) versions on AGR\(_{{\textbf {g}}}\). X axis is the number of instances seen so far. Vertical dotted lines mark a concept drift’s start, center, and end. Sgbt\(^{MC}_{SK\_{1/k}}\)[\(S=100\), \(m=75\), \(lr=\)1.25e\(-\)2]

Fig. 5
figure 5

Accuracy over time: Different Sgbt\(^{MC}\) versions on LED\(_{{\textbf {g}}}\). X axis is the number of instances seen so far. Vertical dotted lines mark a concept drift’s start, center, and end. Sgbt\(^{MC}_{SK\_{1/k}}\)[\(S=100\), \(m=75\), \(lr=\)1.25e\(-\)2]

Fig. 6
figure 6

Model size over time: Different Sgbt\(^{MC}\) versions on AGR\(_{{\textbf {g}}}\). X axis is the number of instances seen so far. Vertical dotted lines mark a concept drift’s start, center, and end. Sgbt\(^{MC}_{SK\_{1/k}}\)[\(S=100\), \(m=75\), \(lr=\)1.25e\(-\)2]

Fig. 7
figure 7

Model size over time: Different Sgbt\(^{MC}\) versions on LED\(_{{\textbf {g}}}\). X axis is the number of instances seen so far. Vertical dotted lines mark a concept drift’s start, center, and end. Sgbt\(^{MC}_{SK\_{1/k}}\)[\(S=100\), \(m=75\), \(lr=\)1.25e\(-\)2]

Table 7 Accuracy and evaluation time(s) of Sgbt\(^{MC}_{SK\_{1/k}}\)[\(S=100\),\(m=75\),\(lr=\)1.25e\(-\)2]

According to Table 6, among Sgbt\(^{MC}\) [\(S=100\), \(m={45, 60, 75, 100}\), \(lr=\)1.25e\(-\)2] configurations, 75% of features yield good accuracy on most datasets. Not having 100% of the features helps to increase the diversity of the ensemble, which avoids overfitting to data. These results match (Gomes et al., 2017, 2019) findings where Arf and Srp perform best with 60% of the features.

A separate study examines the effect of independent TR mechanisms by each base learner on Sgbt\(^{MC}\)’s performance. For this study, SGT was selected as the base learner since FIMT-DD has a built-in TR mechanism. Hence, a generic regressor with an inbuilt TR mechanism based on Ddm’s warning and out-of-control signals was introduced into MOA. This allows us to enable or disable the underlying TR strategy using a generic regressor with SGT and Ddm or just using SGT. The Ddm settings were: minimum number of instances before permitting a change detection = 250, warning level = 2.0, and out-of-control level = 2.5. SGT used the same default configurations used in Gouk et al. (2019). From Table 6 results, one can see that having an internal TR mechanism often improves performance. Also, all the Sgbt\(^{MC}\) configurations with SGT perform poorly on RBFf. Maybe SGT’s default warmStart (number of instances used to estimate bin boundaries for numeric values) 1000 is too large for RBFf with fast-moving drifts.

4.3 Skip training on instances

Another study was conducted using Sgbt\(^{MC}_{SK\_{1/k}}\)[\(S=100\), \(m=75\), \(lr=\)1.25e\(-\)2] with different k values to understand the effect of random skip training. Here, k was set to 1, 2, and 3 so that Sgbt\(^{MC}_{SK\_{1/k}}\) would not skip, skipping 1/2 and 1/3 of instances. As per Table 7, apart from RBF_Bf and RBFf Sgbt\(^{MC}_{SK\_{1/3}}\), produced good results even with 1/3-rd of instances skipped. Here, slight poor accuracy in those two datasets may be because both RBF_Bf and RBFf have fast-moving drifts.

To further illustrate the influence of random skipping a bit, another study was conducted using Sgbt\(^{MC}_{SK\_{1/k}}\)[\(S=100\), \(m=75\), \(lr=\)1.25e\(-\)2] with different k values: 1, 2, 3 on AGRg and LEDg datasets. The idea here is to understand the effect of skip training instances on Sgbt\(^{MC}_{SK\_{1/k}}\)’s performance for binary and multi class problems. Both AGRg and LEDg had drifts happening at the same time intervals. However, AGRg is a binary problem, and LEDg is a multi class problem with 10 classes. Accuracy and model size statistics were collected every 10000 instances. When one considers the classification accuracy in Figs. 4 and  5, skipping instances for training does not significantly hinder the accuracy on both AGRg and LEDg. On the other hand, skipping instances results in significant memory savings on both datasets in Figs. 6 and  7. These savings are much more prevalent in LEDg as Sgbt\(^{MC}_{SK\_{1/k}}\) needs 10 Sgbts compared to 1 for AGRg.

5 Conclusion

This work uses the generic weighted squared loss elicited in Friedman et al. (2000); Chen and Guestrin (2016) with hessian as the weight and gradient over hessian as the target, considering the loss of the previous boosting step with streaming regression trees with internal TR strategy to propose Sgbt. In the experiments, Sgbt variant Sgbt\(^{MC}\) with FIMT-DD as the base learner produced superior results compared to the state-of-the-art streaming methods on large evolving data with multiple drifts and drift types.

Sgbt calculated hessian weights result in fractions for most of the loss functions. To our knowledge, none of the streaming regression trees support non-integer weights. To circumvent this limitation, Sgbt employs a weight of 1 or transformed weight (which yields a positive integer) to train the base learner. As future work, one could explore the work by Pébay et al. (2016) and Schubert and Gertz (2018) for incremental calculation of variance and co-variance to support fractional weights for SGT and FIMT-DD base learners. Another future work is to skip training selectively on certain instances considering the loss above a certain threshold, like in Gunasekara et al. (2022), instead of random skipping.