Abstract
Gradient Boosting is a widely-used machine learning technique that has proven highly effective in batch learning. However, its effectiveness in stream learning contexts lags behind bagging-based ensemble methods, which currently dominate the field. One reason for this discrepancy is the challenge of adapting the booster to new concept following a concept drift. Resetting the entire booster can lead to significant performance degradation as it struggles to learn the new concept. Resetting only some parts of the booster can be more effective, but identifying which parts to reset is difficult, given that each boosting step builds on the previous prediction. To overcome these difficulties, we propose Streaming Gradient Boosted Trees (Sgbt), which is trained using weighted squared loss elicited in XGBoost. Sgbt exploits trees with a replacement strategy to detect and recover from drifts, thus enabling the ensemble to adapt without sacrificing the predictive performance. Our empirical evaluation of Sgbt on a range of streaming datasets with challenging drift scenarios demonstrates that it outperforms current state-of-the-art methods for evolving data streams.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Boosting methods have become increasingly successful in machine learning over the past decade. While early weighed boosting algorithms such as AdaBoost (Freund & Schapire, 1997) showed promise, they were later surpassed by gradient boosting methods (Friedman, 2001, 2002; Friedman et al., 2000). Gradient Boosting leverages the previous base learner’s gradient information (i.e., the slope of the loss function) to boost the performance of the next learner in an ensemble. The eXtreme Gradient Boosting (XGBoost) (Chen & Guestrin, 2016) takes this approach to another level, achieving high efficiency and superior performance on various time-critical real-world problems. However, in many real-world scenarios, traditional batch learning with the Independent and Identically Distributed (iid) assumption cannot keep pace with the evolving nature of the underlying data stream (Gomes et al., 2017; Bifet et al., 2018).
On the other hand, Stream Learning (SL) accounts for the possibility of change in the underlying data distribution (concept drift) (Bifet et al., 2018). A model should respond efficiently in real-time when learning from an evolving data stream (Bifet et al., 2018). While methods such as Adaptive eXtreme Gradient Boosting (Axgb) (Montiel et al., 2020) and Adaptive Iterations (AdIter) (Wang et al., 2022) were proposed by the research community to enable gradient boosting for evolving data streams, they failed to outperform state-of-the-art ensemble learners like Adaptive Random Forest (Arf) (Gomes et al., 2017) and Streaming Random Patches (Srp) (Gomes et al., 2019).
The proposed work utilizes streaming regression trees with inbuilt drift detectors in a gradient-boosted setting. The paper makes the following contributions:
-
1.
To our knowledge, Sgbt is the first instance where the weighted squared loss elicited in Friedman et al. (2000); Chen and Guestrin (2016) with hessian as the weight and gradient over hessian as the target considering the previous boosting step’s loss, is used to develop a streaming gradient-boosted method for evolving data streams. This allows Sgbt to leverage any streaming regression tree as its base learner.
-
2.
Sgbt utilizes trees with an internal Tree Replacement (TR) mechanism instead of externally monitoring each item in the boosting ensemble for drifts and adjusting each item like Axgb (Montiel et al., 2020) or resetting some parts as in AdIter (Wang et al., 2022). This Tree Replacement mechanism in Sgbt allows the trees in the booster to adapt dynamically to concept drifts. Unlike binary-class gradient-boosted streaming implementations: Axgb and AdIter, Sgbt can solve multi class problems using a committee of trees at each boosting step or a committee of Sgbts.
-
3.
We present an extensive empirical evaluation of Sgbt against current state-of-the-art streaming bagging with random subspaces (Srp), random forest (Arf), boosting (OSB), and gradient boosting (AdIter) methods on 14 datasets with different drift types.
Overall, Sgbt outperforms existing techniques for evolving data streams. The paper is structured as follows. The next section reviews the current state-of-the-art stream learning methods and recent gradient boosting work for evolving data streams. The subsequent section explains our proposed Sgbt method. The experiments section describes the experimental setup where Sgbt was evaluated against state-of-the-art stream learning methods. The final section provides conclusions and directions for future research.
2 Related work
Boosting and bagging are two popular ensemble learning techniques used in machine learning. Bagging randomly samples instances with replacement to train each member of the ensemble. Boosting, on the other hand, attempts to boost the performance of the next base learner in the ensemble, considering the loss of the previous one. It combines the prediction of weak learners addictively to produce a strong learner (Friedman et al., 2000; Friedman, 2001). AdaBoost (Freund & Schapire, 1997) highly weights the misclassified instances by the current base learner to improve the next base learner. Gradient boosting uses the current base learner’s gradient information of the loss to improve the next base learner (Friedman et al., 2000). XGBoost (Chen & Guestrin, 2016) uses this gradient information to derive a particular regression tree that predicts a raw score at the leaf for a given instance. It contains an efficient split-finding mechanism, cache-aware data processing, and parallel processing to produce a highly scalable and efficient algorithm for batch learning (Chen & Guestrin, 2016).
Compared to batch learning, Stream Learning model learns from an evolving data stream (non-iid data), processing one instance at a time. Here, the model should predict at any given moment using limited processing and memory (Bifet et al., 2018; Gomes et al., 2017). Also, it should adjust to distribution changes (concept drifts) in the underlying data stream (Bifet et al., 2018; Bifet & Gavalda, 2007; Gomes et al., 2017).
Data stream boosting is challenging due to the evolving nature of the data stream. Here, the model needs to adjust to the new input distribution of the stream after a concept drift (Bifet et al., 2018; Montiel et al., 2020). Online Bagging (OBg) and Online Boosting (OB) (Oza & Russell, 2001) were inspired by the observation that a binomial distribution Binomial(p, n) can be approximated by a Poisson distribution \(Poisson(\lambda )\) with \(\lambda =np\) as \(n \rightarrow \infty\). Here, n is the number of instances, and p is the probability of success in the binomial distribution. Since the probability of selecting a given example is 1/n in batch bagging, the uniform sampling with replacement of the bagging algorithm is approximated by Poisson(1) in OBg. On the other hand, in OB, \(\lambda\) is computed by tracking the total weights of correctly classified and misclassified examples for each base learner. An online version of SmoothBoost (Servedio, 2003) was proposed in Chen et al. (2012). This Online Smooth Boost (OSB) uses smooth distributions that do not assign too much weight to a single example. When the number of weak learners and examples are sufficiently large, OSB is guaranteed to achieve an arbitrarily small error rate (Chen et al., 2012; Gomes et al., 2019). Recently, two notable approaches were proposed by the stream learning community to leverage gradient boosting for data streams: Axgb (Montiel et al., 2020) and AdIter (Wang et al., 2022). Axgb employs mini-batch trained XGBoost as its base learners and adjusts the ensemble in response to concept drifts, which it detects using ADWIN (Bifet & Gavalda, 2007). AdIter attempts to identify the weak learners in the ensemble and prune them when confronted with concept drift. It then employs multiple training iterations via majority vote among the ensemble to support different drift types. Both Axgb and AdIter only support binary classification. In contrast, our proposed streaming gradient boosting method (Sgbt) supports both binary and multi class problems.
Arf (Gomes et al., 2017) and Srp (Gomes et al., 2019) are popular ensemble learning methods for streaming data. They allow one to use efficient stream learning base learners like Hoeffding Tree (HT) in a random forest or random subspaces setup in conjunction with efficient drift detectors like ADWIN. Arf is a streaming random forest adaptation that combines re-sampling strategies, drift detection, and drift recovery strategies (Gomes et al., 2017). Srp combines random subspaces and re-sampling (i.e., random patches) to leverage diversity among base incremental learners (Gomes et al., 2019). It uses the same drift detection and recovery strategy as Arf, but tends to outperform Arf (Gomes et al., 2019) in some benchmarks while not being limited to decision trees.
OSB performed better compared to OB (Chen et al., 2012). Empirical evaluation (Gomes et al., 2019) shows that even with 100 base learners, Arf and Srp outperform OSB by a large margin. In the same evaluation, Srp outperformed Arf. Axgb failed to outperform Arf in the Montiel et al. (2020) empirical evaluation. In Wang et al. (2022) experiments, AdIter also failed to surpass Arf on synthetic evolving datasets with 10,000 instances. However, in the same evaluation, AdIter surpassed Arf on real-world data. In that evaluation, all the other datasets had less than 100,000 instances apart from airlines. The above empirical evaluations suggest that the latest gradient boosting methods for evolving data streams are yet to surpass current state-of-the-art ensemble methods like Srp and Arf. However, our proposed Sgbt was able to outperform Srp and Arf in a variety of evolving datasets.
3 Streaming gradient boosted trees (SGBT)
For a dataset with n instances, let \(x_i\) be the features for the i-th instance and \(y_i\) be its relevant target value. In gradient boosting, a model \(\phi\) can be represented as S additive functions:
to predict \({\hat{y}}_{i}\)Friedman (2002); Chen and Guestrin (2016). Here, \({\mathcal {F}}\) is the space of regression trees. In XGBoost (Chen & Guestrin, 2016), each \(f_s\) corresponds to an independent tree structure with leaf scores \(\omega\). Each regression tree contains a continuous score \(\omega _{i}\) at the leaf for the i-th instance. The authors proposed to sum up the corresponding scores at the leaves of each tree for prediction. The learning objective is to minimize the regularized objective:
where \(\Omega\) penalizes the complexity of the tree f:
Here, \(\gamma\) penalizes adding a new leaf and \(\beta\) forces leaf predictions to be small. T is the number of leaves in the tree. l is a differentiable convex loss function that measures the difference between the prediction \({\hat{y}}_i\) and the target \(y_i\). Furthermore, the loss at the s-th step is the loss incurred by the previous (\(s-1\)) step and the loss incurred by tree \(f_s\) plus the regularization term:
This loss could be approximated using second-order Taylor approximation (Chen & Guestrin, 2016) to:
Here, \(g_i = \partial _{{\hat{y}}^{(s-1)}} l(y_i, {\hat{y}}^{(s-1)})\) and \(h_i = \partial ^2_{{\hat{y}}^{(s-1)}} l(y_i, {\hat{y}}^{(s-1)})\) are the first and second order (hessian) gradient statistics of the loss considering \(s-1\)-th prediction. Though the authors (Chen & Guestrin, 2016) use a simplified version of the above loss function by removing constants to derive raw score values at the leaves, the below version was elicited to explain it as a weighted squared loss with weight \(h_i\) and target \(g_i/h_i\):
This weighted squared loss with hessian as the weight and gradient over hessian as the target, considering the previous boosting step’s loss, was first introduced in Friedman et al. (2000).
Equation 2 provides the flexibility to utilize various streaming regression trees instead of the one employed in XGBoost. Moreover, depending on the implementation, the streaming regression tree’s regularization term can diverge from that employed in XGBoost. In this work, the Tree Replacement strategy explained later in this paper, acts as a regularization mechanism.
3.1 Streaming regression trees with internal tree replacement strategy for gradient boosting
In data stream learning, n could be infinite, and learning happens online, where a model \(\phi _{i-1}\) learned at the \(i-1\)th instance is used to predict the ith instance. Also, from any ith instance, the underlying distribution of x could change (concept drift). The model \(\phi _{i}\) should adjust it’s regression trees to adapt to this new distribution at i. Instead of externally monitoring and resetting each \(f_s\) tree like in AdIter (Wang et al., 2022), in Sgbt, the trees internally monitor their standardized absolute error and train an alternate tree if it goes above a warning level. The tree \(f_s\) switches to its alternate tree once the error reaches a danger zone. \(f_s\) tree employs a drift detector to monitor its standardized absolute error to trigger these warning and danger signals. The rest of the paper identifies this strategy of replacing the active tree with an alternate tree on the drift detection signal as Tree Replacement (TR). In the experiments, we used two regression trees for data streams: FIMT-DD (Ikonomovska et al., 2011) and SGT (Gouk et al., 2019) with in-built drift detectors: Page-Hinckley Test (Pht) (Mouss et al., 2004) and Ddm (Gama et al., 2004). The implementation of SGT with Ddm is generic, and one could replace SGT with any other regression tree for data streams. Here, TR also serves as a dynamic regularization mechanism by replacing trees as data evolves during learning. Arf and Srp use a similar Tree Replacement strategy under random-forest and bagging settings for SL classification. But to our knowledge, this is the first instance, TR is used in gradient-boosted trees for SL classification. Here, the booster is allowed to dynamically adjust to underlying input distribution changes as some active trees are replaced by their alternate trees on drift detection.
The loss function in Eq. 2 requires the regression trees to support fractional weights, as \(h_i\) could be a fractional value for some loss functions. Streaming regression trees (SGT and FIMT-DD) considered in this work only support integer weights. Supporting fractional weights for them is not trivial. For example, SGT and FIMT-DD require the incremental calculation of variance and co-variance for fractional weights. Though recent work by Pébay et al. (2016) and Schubert and Gertz (2018) suggests this is possible, this itself is a separate research topic. Also, later in the text, it is clarified that the hessians for the popular categorical cross-entropy loss with softmax used in the experiments are consistently below 1. Hence, even though Sgbt calculates these weights (hessians), it does not pass them to the underlying trees for these practical reasons. Alternatively, it passes a weight of 1 to the trees.
Instead of using all the features to train at each boosting step, Sgbt uses a subset of features based on a predefined feature percentage. This approach of using a subset of features to train each ensemble member is also used in Srp (Gomes et al., 2019) to increase the diversity among the base learners. Algorithm 1 explains the training procedure of Sgbt.
3.2 Multi class support
Two approaches are used to support multi class problems: Sgbt and Sgbt\(^{MC}\).
-
SGBT uses a committee of regression trees in a given boosting step s. Here, a single tree is trained for each class. The committee is composed of a softmax function, so the probability that an instance, \(x_i\), belongs to class c is given by: \({\hat{y}}_{i,c}= \frac{exp(f_{s,c}(x_i))}{\sum _{c=1}^{C}exp(f_{s,c}(x_i))}\). Here \(f_{s,c}\) is the regression tree trained to predict a real-valued score for class c at the s-th boosting step, and C is the number of classes. In practice, hard-wiring \(f_{s,C}(x_i)=0\) allows Sgbt to reduce the number of trees being trained.Footnote 1 The categorical cross-entropy loss (\(l^{CE}\)) is used to train the model: \(l^{CE}(y,{\hat{y}}) = - \sum _{c=1}^{C} y_c log ({\hat{y}}_c)\). Here, y is the ground truth encoded as a one-hot vector. For \(l^{CE}\), gradient (g) is \(y_c -{\hat{y}}_c\), and hessian (h) is \({\hat{y}}_c (1 - {\hat{y}}_c)\). The regression tree committee (composing \(C-1\) items) at the s-th boosting step represents the base learner for the s-th boosting step. This approach is also used in SGT to support multi class classification.
-
SGBT\(\mathbf {^{MC}}\) uses the same loss function (\(l^{CE}\)) as in Sgbt. But, it uses a wrapper classifier to invoke a binary Sgbt classifier for each class. The task of the binary Sgbt classifier is to distinguish a given class from all the other classes. All C classifier votes for the positive outcome are collected and normalized at prediction. The class associated with the classifier that predicted the positive outcome most confidently is considered the final class for the instance. This approach is very popular in batch learning and is commonly known as one-vs-rest or one-vs-all in literature (Witten et al., 2016). Sgbt\(^{MC}\) reverts to Sgbt for binary class problems to avoid any computing overhead.
Unlike Axgb and AdIter, the above two approaches allow Sgbt to support gradient boosting for evolving data streams on multi class problems.
3.3 Predicting and computing improvements
Two variants of Sgbt are proposed below to improve the computing performance and utilise already calculated hessian weights.
-
SGBT\(\mathbf {^{SK}}\): In most streaming regression trees, the computation and memory complexities are affected by the number of instances they process. Some computation and memory savings could be achieved via skip training on random instances. Sgbt\(^{SK}\) randomly skips 1/k-th of instances (\(k \ge 1, \in {\mathbb {N}}\)). k is set to 1 by default, causing it to process all instances as in Sgbt. Work by Gunasekara et al. (2022), Pavlovski et al. (2017) also exploited skip training for Stream Learning. Line 5 in algorithm 2 highlights this skip training.
-
SGBT\(\mathbf {_{MI}}\): Even though current base learners only support integer weights, utilizing already calculated fractional hessian weights is helpful. For \(l^{CE}\), hessian for class c at i-th instance is always less than 1 (\(h_{i,c}< 1\)). Even if one passes \(h_{i,c}\) to a ceilingFootnote 2 function, it will always return 1. For all instances, multiplying \(h_{i,c}\) by 10 and passing that to a ceiling function results in a positive integer weight that is greater than 1 for some instances. For all the other instances, the weight is set to 1. If \(ceiling(h_{i,c} * 10 ) = T\), Sgbt can train \(f_{s,c}\) base learner T times using instance \(x_{i,s}\) with label \(g_{i,c}/h_{i,c}\). Here, multiplier 10 ensures that \(T\le 10\) for all instances, providing a reasonable upper limit to the computational cost of this approach. This technique of training a base learner multiple times based on a calculated integer weight for an instance is quite common in stream learning (Oza & Russell, 2001; Gomes et al., 2019). Line 12 in algorithm 2 highlights this multiple training iteration approach. Furthermore, this multiple-training iteration approach allows Sgbt to use streaming regression trees that do not support weights.
Algorithm 2 explains the above two variants of Sgbt in detail. In the experiments, we evaluate the effectiveness of these Sgbt variants.
As Sgbt allows different streaming regression trees for its base learners, its final time and memory complexities are influenced by the base learner’s time and memory complexities. Sgbt’s time complexity can be derived as \({\mathcal {O}}(CSf)\), and its memory complexity as \({\mathcal {O}}(CSf)\), assuming \({\mathcal {O}}(f)\) for the base learner’s time and memory complexities. Here, S is the number of boosting steps, and C is the number of classes. Sgbt\(^{MC}\) has the same time and memory complexities as Sgbt. The time complexity of Sgbt\(^{MC}\) could be further improved by parallel training each Sgbt. Our implementation of Sgbt\(^{MC}\) leverages this parallel processing. This allows Sgbt\(^{MC}\)’s time complexity to be \({\mathcal {O}}(Sf)\). This is similar to current state-of-the-art streaming bagging and random-forest based methods: Srp and Arf. For Sgbt\(^{MC}_{SK}\), this time complexity is further reduced to \({\mathcal {O}}((1-{1/k})Sf)\) by skipping 1/k-th of instances at training. Table 1 contains all the notations introduced in this section.
4 Experiments
We begin our experiments by comparing Sgbt against current state-of-the-art streaming bagging with random subspaces (Srp), random forest (Arf), boosting (OSB), and gradient boosting (AdIter) methods on 14 datasets. We also conducted a parameter exploration to illustrate the effects of different Sgbt components.
Finally, we show an in-depth analysis concerning the computational requirements of Sgbt.
Datasets: AGRa, AGRg, LEDa, LEDg, RBFf, RBFm, electricity, airlines and covtype are from Gomes et al. (2019). RandomTree, LED, RBF5, RBF_Bm, RBF_Bf were generated using MOA synthetic generators. The synthetic datasets with drifts simulate different types of concept drifts, i.e., abrupt (AGRa, LEDa), gradual (AGRg, LEDg), fast incremental changes (RBF_Bf, RBFf), and moderate incremental changes (RBF_Bm, RBFm).
AGR\(_{{\textbf {a}}}\) is a binary class synthetic dataset with 1 M instances, where abrupt concept drifts occur after every 250000 instances, with 50 instances drift width.
AGR\(_{{\textbf {g}}}\) also contains binary class synthetic data. Here, gradual concept drifts occur after every 250000 instances, with 50000 instances drift width. The dataset has 1 M instances.
LED\(_{{\textbf {a}}}\) is a multi class synthetic dataset with 1 M instances, where abrupt concept drifts occur after every 250000 instances, with 50 instances drift width.
LED\(_{{\textbf {g}}}\) is also a multi class synthetic dataset. The dataset has gradual concept drifts occurring after every 250000 instances, with 50000 instances drift width. The dataset has 1 M instances.
RBF\(_{{\textbf {f}}}\) contains multi class synthetic data. Here, fast incremental concept drifts occur with 0.001 centroid’s speed of change. There are 1 M instances in this dataset.
RBF\(_{{\textbf {m}}}\) also has multi class synthetic data. The dataset contains 1 M instances. Here, moderate incremental concept drifts occur with 0.0001 centroid’s speed of change.
RBF_B\(_{{\textbf {f}}}\) is a binary class synthetic dataset with 1 M instances that includes fast incremental concept drifts with the centroid’s speed of change set to 0.001.
RBF_B\(_{{\textbf {m}}}\) has 1 M instances. It is a binary class synthetic dataset with moderate incremental concept drifts occurring with 0.0001 centroid’s speed of change.
RandomTree is a binary class synthetic dataset without any drifts. It was generated using MOA RandomTreeGenerator. It has 100K instances.
LED contains multi class synthetic data without any drifts. The dataset was generated using MOA LEDGenerator. It also has 100K instances.
RBF5 is a dataset with 100K instances. It contains multi class synthetic data without drifts. Data was generated using MOA RandomRBFGenerator.
Electricity contains the Australian New South Wales Electricity Market data when the prices are not fixed. These prices are affected by the supply and demand of the market itself and are set every five minutes. It is a binary class real-world dataset. The class label identifies the price changes (up or down) relative to a moving average of the last 24 h. The dataset exhibits temporal dependencies. It contains 45310 instances.
Airlines is a binary class real-world dataset. The task is to predict whether a given flight will be delayed, given information on the scheduled departure. The dataset has 539382 instances.
Covertype dataset represents forest cover type for 30 x 30-meter cells obtained from the US Forest Service Region 2 Resource Information System (RIS) data. Each class corresponds to a different cover type. The dataset contains a multi class problem with seven imbalanced class labels. It includes 581010 instances.
Datasets RBF_Bm, RBFm, RBF_Bf and RBFf were generated using MOA RandomRBFGeneratorDrift. While AGRa and AGRg were generated using MOA ConceptDriftStream with AgrawalGenerator. LEDa and LEDg datasets were generated using MOA ConceptDriftStream with LEDGeneratorDrift. Table 2 summarizes the characteristics of the datasets.
Sgbt was compared against the current state-of-the-art stream learning baseline Srp, streaming random forest method Arf, the latest gradient-boosted method for data streams AdIter, and the stream-boosting method OSB. Axgb was not considered in the evaluation as it failed to outperform Arf (Montiel et al., 2020). Srp used the best parameter configurations explained in Gomes et al. (2019). As 100 base learners produced the best results for Srp in Gomes et al. (2019), all the baselines used 100 base learners. Arf and OSB used the same parameters in Gomes et al. (2019) evaluation. Arf and OSB used the same base learner (HT) in Srp with the same hyperparameters as in Srp.
We collected votes for each class on each instance from AdIter’s Python implementation and ran it through a MOA dummy classifier to yield the same evaluation as the other methods. Sgbt was implemented as an MOA classifier, and it used 100 boosting steps (S) to match other baselines 100 base learners. The Sgbt\(^{MC}\) variant was compared against the above baselines. Here, the one-vs-rest wrapper classifier was also implemented in MOA. Sgbt used a learning rate of 0.0125 and 75% of the features at each boosting step. As Sgbt requires streaming regression trees as its base learners, the streaming classifier tree HT can not be used as a base learner. Therefore, the streaming regression tree FIMT-DD (Ikonomovska et al., 2011) was chosen as its base learner. FIMT-DD used a variance reduction split criterion, a grace period of 25, a split confidence interval of 0.05, a constant learning rate at the leaves, and the regression tree option.
Each algorithm was executed multiple times with different random seeds, and the average accuracy was considered in the evaluation process.Footnote 3. Appendix A contains detailed information about the experimental setup.
Table 3 compares Sgbt\(^{MC}\)’s accuracy against the baselines mentioned above. As one can see, Sgbt\(^{MC}\) outperforms all the baselines on binary class problems considering average accuracy and rank. It also performs equally well on multi class problems. It is also evident that Sgbt\(^{MC}\) outperforms other methods on datasets with drifts: AGRa, AGRg, LEDa, LEDg, and RBFm. This suggests that Sgbt\(^{MC}\) is a good candidate not only for evolving data, but also for binary class problems.
It also performed well on the airlines dataset. On the other hand, Srp yielded good results on LEDa, LED, and covtype datasets, while Arf performed well on RBF_Bm, RBF_Bf, electricity, RBFf, and RBF5. OSB performed well on the RandomTree dataset. The streaming gradient boosting method AdIter was the least performant among all methods. As it is a binary classifier, AdIter was only evaluated on binary class problemsFootnote 4. Furthermore, KappaM results in Appendix C (Table 9), which evaluate learner’s performance on imbalanced data (Bifet et al., 2018), also align with accuracy rankings in Table 3.
Figure 1 shows the Shaffer Post-hoc test results with a p-value of 0.05 for: all, binary class, multi class, and evolving (AGRa, AGRg, LEDa, LEDg, RBF_Bm, RBF_Bf, RBFm, RBFf) datasets considering accuracy. It further highlights the fact that Sgbt\(^{MC}\) outperforms other methods on binary and evolving datasets with statistical significance. For multi class problems it is on par with current state-of-the-art Srp. To our knowledge, this is the first time a streaming gradient boosted method is able to surpass current state-of-the-art bagging and random-forest based methods in wide range of evolving data and perform well on all types of dataFootnote 5. These Fig. 1 post-hoc test results for accuracy also align with KappaM post-hoc test results in Fig. 8 (Appendix C) with the same p-value.
We investigate each algorithm’s performance on evolving data further in Figs. 2 and 3 by comparing accuracy over time for Sgbt\(^{MC}\), Srp, Arf, and OSB on AGRg and LEDg. Based on Figs. 2 and 3, it is evident that Sgbt\(^{MC}\) had the lowest decrease in performance around drift points.
Table 4 compares the evaluation time in seconds reported by MOA among single-threaded Sgbt\(^{MC}\) (Sgbt\(^{MC}_{ST}\)), multi-threaded Sgbt\(^{MC}\), and Srp. Srp was chosen due to the fact that it had the best predictive performance among competitors considering Table 3 and Fig. 1. For binary class problems, both Sgbt\(^{MC}\) variants perform faster than Srp. Maybe FIMT-DD in Sgbt\(^{MC}\) is a faster base learner than HT in Srp. Compared to Sgbt\(^{MC}_{ST}\), Srp performs well on multi class problems. However, Sgbt\(^{MC}\) performed the fastest on multi class problems leveraging parallel processing at training and prediction. Considering the Gomes et al. (2019) empirical evaluation on run time for 100 base learners, we would like to acknowledge that Arf and OSB can perform faster than Srp in practice. However, they have an inferior predictive performance compared to Srp in Table 3 evaluation and in the empirical evaluation of Gomes et al. (2019).
4.1 Multiple steps and multi class support
Another study was conducted to understand the performance of different Sgbt variants: Sgbt, Sgbt\(^{MC}\), Sgbt\(_{MI}\), and Sgbt\(^{MC}_{MI}\). Sgbt\(^{MC}\) supports multi class problems using binary Sgbts, and Sgbt\(_{MI}\) employs multiple iterations by hessian weights. Both Sgbt\(^{MC}\) and Sgbt\(_{MI}\) are orthogonal, so they can be fused to yield Sgbt\(^{MC}_{MI}\). All Sgbt variants used the same hyperparameter configurations as in the previous experiments. Table 5 shows the results of the study. Since Sgbt\(^{MC}\) reverts to Sgbt and Sgbt\(^{MC}_{MI}\) reverts to Sgbt\(_{MI}\) on binary class problems, if one ignores Sgbt\(^{MC}_{MI}\) and Sgbt\(^{MC}\) for binary class problems, Sgbt performs well on most of the binary class datasets compared to Sgbt\(_{MI}\). However, Sgbt\(_{MI}\) has a higher average accuracy for that category. This suggests it performs exceptionally well on certain datasets such as RBF_Bm, RBF_Bf and electricity. This results on RBF_Bf, which has fast-evolving drifts, is interesting, as it suggests that multiple training iterations by hessian in Sgbt\(_{MI}\) improve Sgbt’s performance on fast-evolving data. For multi class problems, Sgbt\(^{MC}_{MI}\) is the clear winner. When one compares Sgbt\(^{MC}\) with Sgbt, it is clear that multi class support using binary Sgbts performs better than Sgbt with multi class support. On the other hand, multi class results on Sgbt and Sgbt\(_{MI}\) suggest that multiple iterations by hessian improve Sgbt’s accuracy on multi class problems. This explains why Sgbt\(^{MC}_{MI}\) performs best on multi class problems, as it includes multi class support using binary Sgbts and multiple iterations by hessian approaches. Overall performance by Sgbt\(^{MC}_{MI}\) exceeds the performance of Sgbt\(^{MC}\), which is compared against other baselines in Table 3. But Sgbt\(^{MC}\) was used in Table 3 evaluation considering its computation efficiency compared to Sgbt\(^{MC}_{MI}\). On the other hand, Sgbt\(^{MC}_{MI}\) is a good candidate for evolving data stream applications that prioritize predictive performance over computation efficiency.
4.2 Parameter exploration
A parameter exploration was conducted to understand the impact of learning rate (lr), boosting steps (S), weight (\(h_i\)) transfer methods, percentage of features (m), and the independent TR mechanism at each tree via drift detection on Sgbt\(^{MC}\)’s predictive performance. The results for all these analyses are shown in Table 6 (ranked separately).
Three learning rates: 6.25e\(-\)3, 1.25e\(-\)2, and 2.50e\(-\)2, were used in the study to understand the effect of learning rate (lr) on Sgbt\(^{MC}\)’s performance. All the other configurations: FIMT-DD base learner, 75% of features (m), and 100 boosting steps (S) were kept unchanged. As per Table 6, considering Sgbt\(^{MC}\) [\(S=100\), \(m=75\), \(lr=\){6.25e\(-\)3, 1.25e\(-\)2, 2.50e\(-\)2}, FIMT-DD] configurations, in general, larger learning rates (lr) seem to favour both binary and multi class problems.
In a separate study to understand the effect of boosting steps on Sgbt\(^{MC}\)’s performance, five boosting steps (20, 40, 60, 80, 100) were considered. In this study, base learner (FIMT-DD), feature percentage (m=75%), and learning rate (lr=1.25e\(-\)2) were kept unchanged. According to Table 6, when considering Sgbt\(^{MC}\) [\(S={20, 40, 60, 80, 100}\), \(m=75\), \(lr=\)1.25e\(-\)2, FIMT-DD] configurations, 100 boosting steps seem to yield good results than the smaller boosting steps for both binary and multi class problems. This aligns with OSB results in Gomes et al. (2019), where more boosting iterations performed better than fewer boosting iterations.
In another study investigating the influence of different feature percentages (m) on Sgbt\(^{MC}\)’s performance, all Sgbt\(^{MC}\) configurations remained constant, including the base learner (FIMT-DD), learning rate (lr=1.25e\(-\)2), and boosting steps (S=100), except for the feature percentage (m).
According to Table 6, among Sgbt\(^{MC}\) [\(S=100\), \(m={45, 60, 75, 100}\), \(lr=\)1.25e\(-\)2] configurations, 75% of features yield good accuracy on most datasets. Not having 100% of the features helps to increase the diversity of the ensemble, which avoids overfitting to data. These results match (Gomes et al., 2017, 2019) findings where Arf and Srp perform best with 60% of the features.
A separate study examines the effect of independent TR mechanisms by each base learner on Sgbt\(^{MC}\)’s performance. For this study, SGT was selected as the base learner since FIMT-DD has a built-in TR mechanism. Hence, a generic regressor with an inbuilt TR mechanism based on Ddm’s warning and out-of-control signals was introduced into MOA. This allows us to enable or disable the underlying TR strategy using a generic regressor with SGT and Ddm or just using SGT. The Ddm settings were: minimum number of instances before permitting a change detection = 250, warning level = 2.0, and out-of-control level = 2.5. SGT used the same default configurations used in Gouk et al. (2019). From Table 6 results, one can see that having an internal TR mechanism often improves performance. Also, all the Sgbt\(^{MC}\) configurations with SGT perform poorly on RBFf. Maybe SGT’s default warmStart (number of instances used to estimate bin boundaries for numeric values) 1000 is too large for RBFf with fast-moving drifts.
4.3 Skip training on instances
Another study was conducted using Sgbt\(^{MC}_{SK\_{1/k}}\)[\(S=100\), \(m=75\), \(lr=\)1.25e\(-\)2] with different k values to understand the effect of random skip training. Here, k was set to 1, 2, and 3 so that Sgbt\(^{MC}_{SK\_{1/k}}\) would not skip, skipping 1/2 and 1/3 of instances. As per Table 7, apart from RBF_Bf and RBFf Sgbt\(^{MC}_{SK\_{1/3}}\), produced good results even with 1/3-rd of instances skipped. Here, slight poor accuracy in those two datasets may be because both RBF_Bf and RBFf have fast-moving drifts.
To further illustrate the influence of random skipping a bit, another study was conducted using Sgbt\(^{MC}_{SK\_{1/k}}\)[\(S=100\), \(m=75\), \(lr=\)1.25e\(-\)2] with different k values: 1, 2, 3 on AGRg and LEDg datasets. The idea here is to understand the effect of skip training instances on Sgbt\(^{MC}_{SK\_{1/k}}\)’s performance for binary and multi class problems. Both AGRg and LEDg had drifts happening at the same time intervals. However, AGRg is a binary problem, and LEDg is a multi class problem with 10 classes. Accuracy and model size statistics were collected every 10000 instances. When one considers the classification accuracy in Figs. 4 and 5, skipping instances for training does not significantly hinder the accuracy on both AGRg and LEDg. On the other hand, skipping instances results in significant memory savings on both datasets in Figs. 6 and 7. These savings are much more prevalent in LEDg as Sgbt\(^{MC}_{SK\_{1/k}}\) needs 10 Sgbts compared to 1 for AGRg.
5 Conclusion
This work uses the generic weighted squared loss elicited in Friedman et al. (2000); Chen and Guestrin (2016) with hessian as the weight and gradient over hessian as the target, considering the loss of the previous boosting step with streaming regression trees with internal TR strategy to propose Sgbt. In the experiments, Sgbt variant Sgbt\(^{MC}\) with FIMT-DD as the base learner produced superior results compared to the state-of-the-art streaming methods on large evolving data with multiple drifts and drift types.
Sgbt calculated hessian weights result in fractions for most of the loss functions. To our knowledge, none of the streaming regression trees support non-integer weights. To circumvent this limitation, Sgbt employs a weight of 1 or transformed weight (which yields a positive integer) to train the base learner. As future work, one could explore the work by Pébay et al. (2016) and Schubert and Gertz (2018) for incremental calculation of variance and co-variance to support fractional weights for SGT and FIMT-DD base learners. Another future work is to skip training selectively on certain instances considering the loss above a certain threshold, like in Gunasekara et al. (2022), instead of random skipping.
Data availability
Open source
Notes
This practice is used in Gouk et al. (2019) as well.
Similar to Java lang.Math.ceil(v) that returns an integer value greater than or equal to the passed-in value v.
Table 3, Figs. 1 and 8 used ten iterations. All the other experiments used three iterations. Code and data are available at https://github.com/nuwangunasekara/SGBT
Considering AdIter’s weak performance on binary class problems and it’s Python implementation, it was not evaluated on multi class problems using MOA one-vs-rest wrapper classifier.
References
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119–139.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of statistics, 29, 1189–1232.
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378.
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics, 28(2), 337–407.
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd international conference on knowledge discovery and data mining, pp. 785–794.
Gomes, H. M., Barddal, J. P., Enembreck, F., & Bifet, A. (2017). A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR), 50(2), 1–36.
Bifet, A., Gavaldá, R., Holmes, G., & Pfahringer, B. (2018). Machine learning for data streams: With practical examples in MOA (pp. 52–96). Massachusetts: The MIT Press. https://doi.org/10.7551/mitpress/10654.001.0001
Montiel, J., Mitchell, R., Frank, E., Pfahringer, B., Abdessalem, T., & Bifet, A. (2020). Adaptive xgboost for evolving data streams. In 2020 international joint conference on neural networks (IJCNN), pp. 1–8. IEEE.
Wang, K., Lu, J., Liu, A., Song, Y., Xiong, L., & Zhang, G. (2022). Elastic gradient boosting decision tree with adaptive iterations for concept drift adaptation. Neurocomputing, 491, 288–304.
Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfahringer, B., Holmes, G., & Abdessalem, T. (2017). Adaptive random forests for evolving data stream classification. Machine Learning, 106(9), 1469–1495.
Gomes, H.M., Read, J., & Bifet, A. (2019). Streaming random patches for evolving data stream classification. In 2019 IEEE International conference on data mining (ICDM), pp. 240–249. IEEE.
Bifet, A., & Gavalda, R. (2007). Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining, pp. 443–448. SIAM.
Oza, N.C., & Russell, S.J. (2001). Online bagging and boosting. In International workshop on artificial intelligence and statistics, pp. 229–236. PMLR.
Servedio, R. A. (2003). Smooth boosting and learning with malicious noise. The Journal of Machine Learning Research, 4, 633–648.
Chen, S.-T., Lin, H.-T., & Lu, C.-J. (2012). An online boosting algorithm with theoretical justifications. arXiv preprint arXiv:1206.6422.
Ikonomovska, E., Gama, J., & Džeroski, S. (2011). Learning model trees from evolving data streams. Data Mining and Knowledge Discovery, 23(1), 128–168.
Gouk, H., Pfahringer, B., & Frank, E. (2019). Stochastic gradient trees. In Asian conference on machine learning, pp. 1094–1109. PMLR.
Mouss, H., Mouss, D., Mouss, N., & Sefouhi, L. (2004). Test of page-hinckley, an approach for fault detection in an agro-alimentary production system. In 2004 5th Asian control conference (IEEE Cat. No. 04EX904), vol. 2, pp. 815–818. IEEE.
Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with drift detection. In Brazilian symposium on artificial intelligence, pp. 286–295. Springer.
Pébay, P., Terriberry, T. B., Kolla, H., & Bennett, J. (2016). Numerically stable, scalable formulas for parallel and online computation of higher-order multivariate central moments with arbitrary weights. Computational Statistics, 31(4), 1305–1325.
Schubert, E., & Gertz, M. (2018). Numerically stable parallel computation of (co-) variance. In Proceedings of the 30th international conference on scientific and statistical database management, pp. 1–12
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data mining: Practical machine learning tools and techniques. The Morgan Kaufmann Series in Data Management Systems (pp. 322–328). San Francisco: Elsevier.
Gunasekara, N., Gomes, H.M., Pfahringer, B., & Bifet, A. (2022). Online hyperparameter optimization for streaming neural networks. In 2022 international joint conference on neural networks (IJCNN), pp. 1–9. IEEE.
Pavlovski, M., Zhou, F., Stojkovic, I., Kocarev, L., & Obradovic, Z. (2017). Adaptive skip-train structured regression for temporal networks. In machine learning and knowledge discovery in databases: European conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part II 10, pp. 305–321. Springer.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions. NZ Tertiary Education Commission funded Real-time Analytics of Big Data Programme.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Not applicable
Ethical approval
Not applicable
Consent to participate
Not applicable
Consent for publication
Not applicable
Additional information
Editor: João Gama.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Experimental setup
Experiments relating to Table 3, Figs. 1 and 8 used ten iterations with random seeds: 5, 9, 17, 13, 19, 23, 29, 31, 37 and 121. All the other experiments used three iterations with random seeds: 9, 17, and 121.
Experiments were run on an i) Ubuntu 18.04 LST system with AMD EPYC 7702 64-Core Processor at 4.00GHz, and with 1000GB RAM and on ii)Ubuntu 20.04.3 system with an Intel(R) Core(TM) i7-6700K CPU at 4.00GHz, and with 64GB RAM. All CPU Time experiments were done on the system i. The OpenJDK version was 11.0.11, and the JVM configurations were: -Xmx96g, -Xms50m, and -Xss1g.
Appendix B: Parameter exploration results
Table 8 contains average accuracy and standard deviation for parameter exploration experiments in Sect. 4.2.
Appendix C: KappaM results
Table 9 and Fig. 8 contain KappM results for learners: Sgbt\(^{MC}\), Srp, Arf, OSB and AdIter on all datasets discussed in Sect. 4. KappaM measures learner’s performance against a majority class classifier (Bifet et al., 2018). It is used to evaluate learner’s performance on an imbalanced dataset (Bifet et al., 2018). Here learner rankings in Table 9 and Fig. 8 align with rankings in Table 3 and Fig. 1.
Appendix D: Skip training on instances
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gunasekara, N., Pfahringer, B., Gomes, H. et al. Gradient boosted trees for evolving data streams. Mach Learn 113, 3325–3352 (2024). https://doi.org/10.1007/s10994-024-06517-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-024-06517-y