Traditional Machine Learning problems consist of a number of examples that are observed in arbitrary order. In this work we consider classification problems. Each example \(e = (\mathbf {x}, l(\mathbf {x}))\) is a tuple of p predictive attributes \(\mathbf {x} = (x_1, \ldots , x_p)\) and a target attribute \(l(\mathbf {x})\). A data set is an (unordered) set of such examples. The goal is to approximate a labelling function \(l: \mathbf {x} \rightarrow l(\mathbf {x})\). In the data stream setting the examples are observed in a given order, therefore each data stream S is a sequence of examples \(S = (e_1, e_2, e_3, \dots , e_n, \ldots )\), possibly infinite. Consequently, \(e_i\) refers to the \(i^{th}\) example in data stream S. The set of predictive attributes of that example is denoted by \( PS _i\), likewise \(l( PS _i)\) maps to the corresponding label. Furthermore, the labelling function that needs to be learned can change over time due to concept drift.
When applying an ensemble of classifiers, the most relevant variables are which base-classifiers (members) to use and how to weight their individual votes. This work mainly focuses on the latter question. Section 3.1 describes the Performance Estimation framework to weight member votes in an ensemble. In Sect. 3.2 we show how to use the Classifier Output Difference to select ensemble members. Section 3.3 describes an ensemble that employs these techniques.
Online performance estimation
In most common ensemble approaches all base-classifiers are given the same weight (as done in Bagging and Boosting) or their predictions are otherwise combined to optimise the overall performance of the ensemble (as done in Stacking). An important property of the data stream setting is often neglected: due to the possible occurrence of concept drift it is likely that in most cases recent examples are more relevant than older ones. Moreover, due to the fact that there is a temporal component in the data, we can actually measure how ensemble members have performed on recent examples, and adjust their weight in the voting accordingly. In order to estimate the performance of a classifier on recent data, van Rijn et al. (2015) proposed:
$$\begin{aligned} P_{win}(l',c,w,L) = 1 - \sum _{i=max(1,c-w)}^{c-1}\frac{L(l'( PS _i),l( PS _i))}{min(w,c-1)} \end{aligned}$$
(2)
where \(l'\) is the learned labelling function of an ensemble member, c is the index of the last seen training example and w is the number of training examples over which we want to estimate the performance of ensemble members. Note that there is a certain start-up time (i.e., when w is larger than or equal to c) during which we can only calculate the performance estimation over a number of instances smaller than w. Also note that it can only be performed after several labels have been observed (i.e., \(c > 1\)). Finally, L is a loss function that compares the labels predicted by the ensemble member to the true labels. The most simple version is a zero/one loss function, which returns 0 when the predicted label is correct and 1 otherwise. More complicated loss functions can also be incorporated. The outcome of \(P_ win \) is in the range [0, 1], with better performing classifiers obtaining a higher score. The performance estimates for the ensemble members can be converted into a weight for their votes, at various points over the stream. For instance, the best performing members at that point could receive the highest weights. Figure 2 illustrates this.
There are a few drawbacks to this approach. First, it requires the ensemble to store the \(w \times n\) additional values, which is inconvenient in a data stream setting, where both time and memory are important factors. Second, it requires the user to tune a parameter which highly influences performance. Last, there is a hard cut-off point, i.e., an observations is either in or out of the window. What we would rather model is that the most recent observations are given most weight, and gradually lower this for less recent observations.
In order to address these issues, we propose an altered version of performance estimation, based on fading factors, as described by Gama et al. (2013). Fading factors give a high importance to recent predictions, whereas the importance fades away when they become older. This is illustrated by Fig. 3.
The red (solid) line shows a relatively fast fading factor, where the effect of a given prediction is already faded away almost completely after 500 predictions, whereas the blue (dashed) line shows a relatively slow fading factor, where the effect of an observation is still considerably high, even when \(10,000\) observations have passed in the meantime. Note that even though all these functions start at 1, in practise we need to scale this down to \(1 - \alpha \), in order to constrain the complete function within the range [0, 1]. Putting this all together, we propose:
$$\begin{aligned} P(l',c,\alpha ,L) = \left\{ \begin{array}{l l} 1 &{} {\mathrm {iff\ }} c = 0 \\ P(l',c-1,\alpha ,L) \cdot \alpha + (1 - L(l'( PS _c),l( PS _c))) \cdot (1 - \alpha ) &{} \mathrm {otherwise} \end{array} \right. \end{aligned}$$
(3)
where, similar to Eq. 2, \(l'\) is the learned labelling function of an ensemble member, c is the index of the last seen training example and L is a loss function that compares the labels predicted by the ensemble member to the true labels. Fading factor \(\alpha \) (range [0, 1]) determines at what rate historic performance becomes irrelevant, and is to be tuned by the user. A value close to 0 will allow for rapid changes in estimated performance, whereas a value close to 1 will keep them rather stable. The outcome of P is in the range [0, 1], with better performing classifiers obtaining a higher score. In Sect. 6 we will see that the fading factor parameter is more robust and easier to tune than the window size parameter. When building an ensemble based upon Online Performance Estimation, we can now choose between a Windowed approach (Eq. 2) and Fading Factors (Eq. 3).
Figure 4 shows how the estimated performance for each base-classifier evolves at the start of the electricity data stream. Both figures expose similar trends: apparently, on this data stream the Hoeffding Tree classifier performs best and the Stochastic Gradient Descent algorithm performs worst. However, both approaches differ subtly in the way the performance of individual classifiers are measured. The Windowed approach contains many spikes, whereas the Fading Factor approach seems more stable.
Ensemble composition
In order for an ensemble to be successful, the individual classifiers should be both accurate and diverse. When employing a homogeneous ensemble, choosing an appropriate base-learner is an important decision. For heterogeneous ensembles this is even more true, as we have to choose a set of base-learners. We consider a set of classifiers from MOA 2016.04 (Bifet et al. 2010a). Furthermore, we consider some fast batch-incremental classifiers from Weka 3.7.12 (Hall et al. 2009) wrapped in the Accuracy Weighted Ensemble (Wang et al. 2003). Table 1 lists all classifiers and their parameter settings.
Table 1 Classifiers considered in this research
Figure 5 shows some basic results of the classifiers on 60 data streams. Figure 5a shows a violin plot of the predictive accuracy of all classifiers, with a box plot in the middle. Violin plots show the probability density of the data at different values (Hintze and Nelson 1998). The classifiers are sorted by median accuracy. Two common Data Stream baseline methods, the No Change classifier and the Majority Class classifier, end at the bottom of the ranking based on accuracy. This indicates that most of the selected data streams are both balanced (in terms of class labels) and do not have high auto-correlation. In general, tree-based methods seem to perform best.
Figure 5b shows violin plots of the run time (in seconds) that the classifiers needed to complete the tasks. From the top-half performing classifiers in terms of accuracy, the Hoeffding Trees is the best ranked algorithm in terms of run time. Lazy algorithms (k-NN and its variations) turn out to be rather slow, despite the reasonable value of window size parameter (controlling the number of instances that are remembered). It also confirms some observation made by Read et al. (2012), that the batch-incremental classifiers generally take more resources than instance-incremental classifiers; all classifiers wrapped in the Accuracy Weighted Ensemble are on the right half of the figure.
Figure 6 shows the result of a statistical test on the base-classifiers. Classifiers are sorted by their average rank (lower is better). Classifiers that are connected by a horizontal line are statistically equivalent. The results confirm some of the observations made based on the violin plots, e.g., the baseline models (Majority Class and No Change) perform worst; also other simple models such as the Decision Stumps and OneRule (which is essentially a Decision Stump) are inferior to the tree-based models. Oddly enough, the instance incremental Rule Classifier does not compete at all with the Batch-incremental counterpart (AWE(JRIP)).
When creating a heterogeneous ensemble, a diverse set of classifiers should be selected (Hansen and Salamon 1990). Classifier Output Difference is a metric that measures the difference in predictions between a pair of classifiers. We can use this to create a hierarchical agglomerative clustering of data stream classifiers in an identical way to Lee and Giraud-Carrier (2011). For each pair of classifiers involved in this study, we measure the number of observations for which the classifiers have different outputs, aggregated over all data streams involved. Hierarchical agglomerative clustering (HAC) converts this information into a hierarchical clustering. It starts by assigning each observation to its own cluster, and greedily joins the two clusters with the smallest distance (Rokach and Maimon 2005). The complete linkage strategy is used to measure the distance between two clusters. Formally, the distance between two clusters A and B is defined as \(max \left\{ COD (a,b) : a \in A, b \in B\right\} \). Figure 7 shows the resulting dendrogram. There were 9 data streams on which several classifiers did not terminate. We left these out of the dendrogram.
We can use a dendrogram like the one in Fig. 7 to get a collection of diverse and well performing ensemble members. A COD-threshold is to be determined, selecting representative classifiers from all clusters with a distance lower than this threshold. A higher COD-threshold would result in a smaller set of classifiers, and vice versa. For example, if we set the COD-threshold to 0.2, we end up with an ensemble consisting of classifiers from 11 clusters. The ensemble will consist of one representative classifier from each cluster, which can be chosen based on accuracy, run time, a combination of the two (e.g., Brazdil et al. 2003) or any arbitrary other criteria. Which exact criteria to use is outside the scope of this research, however in this study we used a combination of accuracy and run time. Clearly, when using this technique in experiments, the dendrogram should be constructed in a leave-one-out setting: it can be created based on all data streams except for the one that is being tested.
Figure 7 can also be used to make some interesting observations. First, it confirms some well-established assumptions. The clustering seems to respect the taxonomy of classifiers provided by MOA. Many of the tree-based and rule-based classifiers are grouped together. There is a cluster of instance-incremental tree classifiers (Hoeffding Tree, AS Hoeffding Tree, Hoeffding Option Tree and Hoeffding Adaptive Tree), a cluster of batch-incremental tree-based and rule-based classifiers (REP Tree, J48 and JRip) and a cluster of simple tree-based and rule-based classifiers (Decision Stumps and One Rule). Also the Logistic and SVM models seem to produce similar predictions, having a sub-cluster of batch-incremental classifiers (SMO and Logistic) and a sub-cluster of instance incremental classifiers (Stochastic Gradient Descent and SPegasos with both loss functions).
The dendrogram also provides some surprising results. For example, the instance-incremental Rule Classifier seems to be fairly distant from the tree-based classifiers. As decision rules and decision trees work with similar decision boundaries and can easily be translated to each other, a higher similarity would be expected (Apté and Weiss 1997). Also the internal distances in the simple tree-based and rule-based classifiers seem rather high.
A possible explanation for this could be the mediocre performance of the Rule Classifier (see Fig. 5). Even though COD clusters are based on instance-level predictions rather than accuracy, well performing classifiers have a higher prior probability of being clustered together. As there are only few observations they predict incorrectly, naturally there are also few observations their predictions can disagree on.
BLAST
BLAST (short for best last) is an ensemble embodying the performance estimation framework. Ideally, it consists of a group of diverse base-classifiers. These are all trained using the full set of available training observations. For every test example, it selects one of its members to make the prediction. This member is referred to as the active classifier. The active classifier is selected based on Online Performance Estimation: the classifier that performed best over the set of w previous training examples is selected as the active classifier (i.e., it gets 100% of the weight), hence the name. Formally,
$$\begin{aligned} AC _c = \mathop {{{\mathrm{arg\,max}}}}\limits _{m_j \in M}{P(m_j,c-1,\alpha ,L)} \end{aligned}$$
(4)
where M is the set of models generated by the ensemble members, c is the index of the current example, \(\alpha \) is a parameter to be set by the user (fading factor) and L is a zero/one loss function, giving a penalty of 1 to all misclassified examples. Note that the performance estimation function P can be replaced by any measure. For example, if we would replace it with Eq. 2, we would get the exact same predictions as reported by van Rijn et al. (2015). When multiple classifiers obtain the same estimated performance, any arbitrary classifier can be selected as active classifier. The details of this method are summarised in Algorithm 1.
Line 2 initialises a variable that keeps track of the estimated performance for each base-classifier. Everything that happens from lines 5–7 can be seen as an internal prequential evaluation method. At line 5 each training example is first used to test all individual ensemble members on. The predicted label is compared against the true label \(l(\mathbf {x})\) on line 7. If it predicts the correct label then the estimated performance for this base-classifier will increase; if it predicts the wrong label, the estimated performance for this base-classifier will decrease (line 6). After this, base-classifier \(m_j\) can be trained with the example (line 7). When, at any time, a test example needs to be classified, the ensemble looks up the highest value of \(p_j\) and lets the corresponding ensemble member make the prediction.
The concept of an active classifier can also be extended to multiple active classifiers. Rather than selecting the best classifier on recent predictions, we can select the best k classifiers, whose votes for the specified class-label are all weighted according to some weighting schedule. First, we can weight them all equally. Indeed, when using this voting schedule and setting \(k = |M|\), we would get the same behaviour as the Majority Vote Ensemble, as described by van Rijn et al. (2015), which performed only averagely. Alternatively, we can use Online Performance Estimation to weight the votes. This way, the best performing classifier obtains the highest weight, the second best performing classifier a bit less, and so on. Formally, for each \(y \in Y\) (with Y being the set of all class labels):
$$\begin{aligned} votes _y = \sum _{m_j \in M} P(m_j,i,\alpha ,L) \times B(l_j'( PS _i),y) \end{aligned}$$
(5)
where M is the set of all models, \(l_j'\) is the labelling function produced by model \(m_j\) and B is a binary function, returning 1 iff \(l_j'\) voted for class label y and 0 otherwise. Other functions regulating the voting process can also be incorporated, but are beyond the scope of this research. The label y that obtained the highest value \( votes _y\) is then predicted. BLAST is available in the MOA framework as of version 2017.06.