Contextbased unsupervised ensemble learning and feature ranking
 1.3k Downloads
 2 Citations
Abstract
In ensemble systems, several experts, which may have access to possibly different data, make decisions which are then fused by a combiner (metalearner) to obtain a final result. Such ensemblebased systems are wellsuited for processing bigdata from sources such as social media, instream monitoring systems, networks, and markets, and provide more accurate results than single expert systems. However, most existing ensemblelearning techniques have two limitations: (i) they are supervised, and hence they require access to the true label, which is often unknown in practice, and (ii) they are not able to evaluate the impact of the various data features/contexts on the final decision, and hence they do not learn which data is required. In this paper we propose a joint estimation–detection method for evaluating the accuracy of each expert as a function of the data features/context and for fusing the experts decisions. The proposed method is unsupervised: the true labels are not available and no prior information is assumed regarding the performance of each expert. Extensive simulation results show the improvement of the proposed method as compared to the stateoftheart approaches. We also provide a systematic, unsupervised method for ranking the informativeness of each feature on the decision making process.
Keywords
Ensemble learning Unsupervised learning Decision making Contextual estimation Feature selection Big data1 Introduction
In numerous big data applications [e.g., datadriven marketing (Brown et al. 2011), surveillance (Craig and Ludloff 2011), sensing and networking (Segaran and Hammerbacher 2009), and health monitoring (Tseng et al. 2008)] involving data mining, decision making, predictions etc., ensemblebased approaches have been shown to produce more accurate results than singleexpert systems (Kuncheva and Whitaker 2003; Tekin and van der Schaar 2013; Zhang et al. 2013). Another key advantage is that, when the various experts have access to and base their decisions on heterogeneous sources of data, ensemblebased approaches do not need to centralize the data acquisition and processing, thereby enabling lowdelay, distributed processing by individual experts.
An ensemble system is constructed from a set of (possibly heterogeneous)^{1} experts and a proper combining rule for fusing the outputs of the experts. Individual experts may have access to heterogeneous data and may have been trained using different data sets. Hence, by properly combining their outputs, ensemblebased methods can achieve more accurate decisions.
As mentioned above, in many applications, the data may be distributed among the experts, with each of the experts using a part of the data. The data may be partitioned horizontally so that each expert works with different disjoint subsets of the entire data set, or vertically so that each expert works with a subset of dimensions (or features) of the same data (Zhang et al. 2013; Zheng et al. 2011).
It is wellknown that the success of ensemble methods depends on the diversity of the experts. Bagging (Bootstrap aggregating, Breiman 1996), Boosting, (Schapire 1990), and AdaBoost (Freund and Schapire 1997) represent examples of ensemble learning methods in which diversity is achieved by using different training subsets. Neural networks and decision trees represent examples in which diversity is achieved based on the structure of the expert and the parameters selected during their training stage.
In any ensemble system, the combiner, which combines the local decisions of the experts, plays an essential role in determining the overall performance. Different methods have been proposed to aggregate the individual decisions of experts. When the performance of experts is unknown, majority rule is often employed (Kuncheva 2004), while when the performances of the experts is known, weighted majority rules are often employed, in which different optimallydesigned weights are assigned to the experts based on their accuracies.
The method of tracking the best expert is one of the seminal works in online ensemble learning based on weighted majority rule (Herbster and Warmuth 1998). In this approach, the importance of each expert is modeled by a weight which is updated over time using an adaptation method. Different variations of this method have been proposed in which the fusion rule or adaptation algorithm were improved. To improve the adaptation equation, new cost functions are suggested in Choromanska and Monteleoni (2012), Herbster and Warmuth (2001), Monteleoni and Jaakkola (2004).
A priori information regarding the performance of each expert can be obtained using training and validation data sets. For instance, the behavior knowledge space (BKS) method estimates the densities of the classifier outputs and requires large training and validation data sets (Huang and Suen 1995). In some ensemble systems, the experts and the combiner are trained together, using a joint procedure, such as stacked generalization or mixture of experts (Jacobs et al. 1991; Wolpert 1992).
Optimal fusion of local decisions requires the a priori knowledge of the accuracy of the experts which, in many applications, may not be available. For example, the data may have an extremely large dimension or the data stream may be timevarying which makes it difficult to accurately evaluate the experts’ performance based on a priori, limited validation data sets. Moreover, data streams are often received along with their context. The context could be a small side information such as a description of the way the data is acquired (Tekin and van der Schaar 2013), or it could be a small dimensional portion of the actual high dimensional data representing one of its features or attributes. However, the accuracies of the experts often vary with the context, and the combiner needs to know the accuracies of the experts for every arriving context in order to optimally fuse their decisions, resulting in prohibitively high cost in processing, communication and storage requirements.
In this paper we present an unsupervised ensemble learning method in which the combiner has no prior information regarding the experts’ performance. In addition, the methods adopted by the experts or the data in which they operate is also unknown by the combiner. Each expert may use a different part of the big data, the preprocessed data, or even different correlated data streams obtained from multiple sources. The combiner uses an unsupervised approach to evaluate the accuracies of the experts as functions of the data context as well as to fuse the decisions of individual experts. We introduce a model for estimating the experts’ accuracies in terms of probabilities of false alarm and correct detection.
To contrast our approach with those in Choromanska and Monteleoni (2012), Herbster and Warmuth (1998), Herbster and Warmuth (2001), Monteleoni and Jaakkola (2004), we would like to point out that the main focus of these papers is to design an online fusion rule using the unsupervised weighted majority rule. On the other hand, our approach uses batch processing. We assume that the data is received along with some context and that the performance of the individual experts is unknown. Our proposed method estimates the performance of the experts in terms of their probabilities of detection and false alarm as a function of the data context, and fuses the decisions of the individual experts. A novel feature of our approach is the manner in which we develop the expectation maximization (EM) algorithm to enable ensemble learning. Ordinarily, for a set of I instances or time indexes, the wellknown EM algorithm (Dempster et al. 1977) must run for \(2^I\) runs in order to obtain an estimation of the parameters and the fused decisions for the I instances. Instead, we introduce separate prior probabilities for the fused decision of each instance. This allows us to obtain an estimate of the parameters and the fused decision for the I instances from a single run of the EM algorithm. We show that, even though unsupervised, our proposed ensemble learning method outperforms numerous stateoftheart ensemble approaches that are supervised.
In many applications we wish to determine the importance or influence of different features on the final decision. Previously, different traditional feature selection methods have been proposed in Holte (1993), Karegowda et al. (2010), Roobaert et al. (2006), Kannan and Ramaraj (2010). These are supervised methods in which the true labels are known, and the features are selected based on different criteria. Mutual information quotient (MIQ) and mutual information difference (MID) are two effective feature selection methods which are based on the mutual information between the true label and different features (Ding and Peng 2003; Peng et al. 2005). The main drawback of these methods is that they are supervised, i.e., they need to know the true label. We extend our proposed method to select/rank the features (data contexts), in terms of their impact on the ensemble’s decision making process. We show that, even though unsupervised, our proposed feature selection method is similar in performance to supervised feature selection methods such as MIQ and MID.

We have explained our algorithm in more detail, with derivations and with additional discussions.

We have provided a discussion on the computational complexity of the proposed algorithm.

Assuming that the expectation maximization algorithm converges to the maximum likelihood solution, we have justified the combiner’s fusion rule.

Using an information theoretic approach, we have proposed a new feature ranking algorithm. The results show that this new feature ranking approach provides a performance similar to the supervised ranking methods.

We have included several additional results on the performance of the algorithm, on a comparison of the proposed method with the majority rule, and on the effect of the values of the parameters of the algorithm on its performance.
2 Problem formulation and notations
We consider an ensemble learning system with K experts; each expert classifies an input data stream characterized by its context.
Since a multiplechoice decision making problem can be divided into a set of binary decision problems (Lienhart et al. 2003), without loss of generality we consider the binary decision problem here.
For each instance^{2} i, let the portion of data available for the \(k\hbox {th}\) expert be denoted by \(s_k(i) \in {\mathcal {S}}_k\), and let \(Z(i) \in {\mathcal {Z}}\) be the context of the received data. As mentioned before, the context may be a vector in general, and may represent a side information about the data or it may be a subset of the features (attributes) of the data. The set \({\mathcal {Z}}\) is assumed to be a (subset of a) metric space with the metric \(d_{{\mathcal {Z}}}(z_1,z_2)\) that represents the distance between \(z_1\) and \(z_2\). Let \(y(i) \in {\mathcal {Y}}\triangleq \left\{ 0, 1 \right\} \) denote the true label at instance i. In the proposed approach, the true label y(i) is not available to the combiner/ensemble learner and the combiner does not know the methods used by the experts to classify the data. Our unsupervised method will use the context Z(i) to estimate the accuracy of each expert.
The combiner receives the decisions of all the experts, Y, (as well as the context \(\mathbf Z \)) and needs to fuse them to get an estimate of the (unknown) true labels. However, to enable the efficient fusion of the received decisions, the combiner must estimate the accuracy of each expert. We describe these accuracies in terms of the probabilities of correct decision for each expert. More specifically, we associate a probability of (correct) detection^{3} and a probability of false alarm^{4} with each expert. In order to estimate these probabilities, we require the true labels (which are unknown). On the other hand, (for the combiner) to detect the true labels, we require the probabilities of detection and false alarm. From this, it can be easily seen that these two problems are connected. A naive solution is to estimate the probabilities of false alarm and detection for every possible label (decision) vector from 1 to I (\(2^I\) possibilities), and then use these estimated probabilities to evaluate the likelihood of observing the corresponding label vector, and among all the label vectors, select the one with the highest likelihood. Clearly, the computational complexity of this approach is prohibitive. In the next section we present a novel method based on the EM algorithm which can be used to effectively detect the true labels and estimate the probabilities of false alarm and detection for each expert with significantly lower complexity than the brute force method.
3 Estimation of the experts’ accuracies and decision making
In this section, given the local decisions, Y, and the observed vector of contexts, \(\mathbf Z \), we first develop an estimation method for \(\varTheta \) (which includes the estimation of P(z), \(\forall z \in {\mathcal {Z}}\), and \(\varPhi \)). Then, we use the estimated parameters to detect the true labels \(\mathbf y \).
3.1 Estimation procedure
Expectation step
By iterating between the expectation step and the maximization step, until a stopping criterion is satisfied,^{7} we find an estimation of the parameter set.^{8}
In each iteration of the EM, Eqs. (11), (14), and (20) are calculated. (14) is directly derived from (11), so they together need \({\mathcal {O}}(KI)\) multiplications. However, in each iteration to solve (14), a use of the interior point algorithm is required which would be the most dominant term in the computational complexity and requires \({\mathcal {O}}(\sqrt{KI})\) in each of its iterations. Assuming \(N_{\mathrm{IP}}\) iterations are required for interior point, the computational complexity would be \({\mathcal {O}}(N_{\mathrm{IP}}\sqrt{KI})\) (Anstreicher 1999). Finally, assuming that we run the EM \(N_{\mathrm{EM}}\) times, the computational complexity of the entire algorithm would be \({\mathcal {O}}(N_{\mathrm{EM}}N_{\mathrm{IP}}\sqrt{K I})\).
We denote the final estimates of the parameter set by \(\tilde{\varTheta }\). Similarly we denote the final estimates of P and \(\varPhi \) and their entries \(p_{\eta k}(z)\) and \(\phi _\eta (i)\) by \(\tilde{P}\), \(\tilde{\varPhi }\), \(\tilde{p}_{\eta k}(z)\) and \(\tilde{\phi }_\eta (i)\), respectively.
Remark 1
The Lipschitz constants \(c_{\eta k}\) affect the performance of the algorithm in estimating the parameters \(p_{\eta k}(z)\) as functions of z. As evident from (4), (23) and (27), smaller values of \(c_{\eta k}\) result in a smoother estimate for \(p_{\eta k}(z)\), while larger values of \(c_{\eta k}\) allow for larger variations in the estimates. Therefore the Lipschitz constants \(c_{\eta k}\) must be selected in accordance with the performance of the classifiers as a function of the context variables. In particular, if for example the detection performance \(p_{1k}(z)\) of the kth classifier is believed to be very sensitive to the context variable z, i.e., small changes in z result in large changes in \(p_{1k}(z)\), then the value of \(c_{1k}\) must be chosen to be large. On the other hand, if the detection performance of the kth classifier is not very sensitive to the context variable z, then a smaller value should be assigned to \(c_{1k}\). That said, we would like to also point out that the Lipschitz condition in (4) is introduced to enable the estimation of the functions \(p_{\eta k}(z)\) with a smaller number of samples. It is important to note that even if \(c_{\eta k}\) do not satisfy (4) for the true functions \(p_{\eta k}(z)\), our algorithm still works. However, in this case our estimates of \(p_{\eta k}(z)\) may not be as accurate. In Fig. 5 of Sect. 5 we present results of the estimations for different values of the Lipschitz constants to highlight this point. When the Lipschitz constants are not known, they can be set to larger values initially. It is worth noting that for a given number of data samples, with smaller values of the Lipschitz constants the results would be smoother. Knowing the Lipschitz continuity provides extra information to the combiner about the performance of the experts as a function of the context. This information limits the possibilities for the probabilities of false alarms and detection of the experts to the set of functions satisfying the Lipschitz continuity constraints. However, knowing this information is not critical in the detection of the labels or the estimation of the parameters. When this information is not available at the combiner, it can be set to a large number relative to the domain of context, \({\mathcal {Z}}\). In this way, no constraint will be applied to the EM algorithm. However to achieve a performance similar to the case that the Lipschitz constants are known, more data samples will be required.
3.2 Combiner’s decisions
In the previous section, we evaluated the estimates of probabilities of false alarm and detection for all the experts as well as the prior probabilities of the true labels \(\tilde{\varPhi }\).
4 Feature selection
In this section, we extend the proposed approach in Sect. 3 in order to extract the importance of each individual feature of a data set in the decisions of the individual experts as well as the combiner’s decisions. Suppose that the received data is described by \(N_F\) features or attributes. We denote the \(\ell \hbox {th}\) feature by \(x^\ell \in {\mathcal {X}}^\ell \) where \({\mathcal {X}}^\ell \) denotes the set of values that feature \(x^\ell \) may assume. Hereafter, \(x^\ell = {x}\) implies that the value of the \(\ell \hbox {th}\) feature, \(x^{\ell }\), is x. The system model is the same as that in Fig. 1. Each expert sends its decisions to the combiner and the combiner implements the proposed approach described in Sect. 3 once for each feature, where a feature in this section is the same as a context in Sect. 3.
We assume that the ensemble system is constructed from a variety of experts and, with the proposed approach, a very accurate detection performance can be achieved. In other words, we assume that the combiner’s decisions are the same as the true labels. This assumption allows us to analyze each feature independently of the others in terms of the information between the feature and the actual label.
The unsupervised feature selection method described above can be extended to more than one feature per expert by letting the context be formed from the set of features of interest.
5 Numerical results
In this section, we first use a system with up to 8 experts to evaluate the performance of the proposed unsupervised ensemble approach. The probabilities of false alarm and detection of these 8 experts as a function of the context z are shown in Table 1. These probabilities are selected in a way that they can represent a variety of behaviors. Many of the experts have varying accuracies with different context values, and for many values of the context the false alarm and detection probabilities of the experts are close to 0.5, i.e., these experts are not very effective in detecting the true labels. Finally the \({\mathcal {L}}_1\) norm is used as the distance measure, i.e., \(d_{{\mathcal {Z}}}(z_1, z_2)=\left\ z_1z_2\right\ _1\).
The probabilities of false alarm and detection of the classifiers
\(p_{0k}(z) \)  \(c_{0k}\)  \(p_{1k}(z)\)  \(c_{1k}\)  

Expert 1  \( 2z^2+2z \)  2.0  \(.5+.5\left \sin (2\pi z)\right \)  3.1 
Expert 2  \( 2(z.5)^2 \)  2.0  .9  0.1 
Expert 3  \( .5\left \sin (2\pi z)\right \)  3.1  \( 12(z.5)^2 \)  2.0 
Expert 4  .1  0.1  \( .5+2(z.5)^2 \)  2.0 
Expert 5  .5z  0.5  \( .75+2(z.5)^3 \)  1.5 
Expert 6  \(.25+2(z.5)^3 \)  1.5  \( .752(z.5)^3 \)  1.5 
Expert 7  \( .5(1z) \)  0.5  \(.75+.5(z.5)\)  0.5 
Expert 8  \( .25+2(z.5)^3 \)  1.5  \(.5(2z)\)  0.5 
To evaluate the ability of the fusion rule in making the right decision about benign or malignant samples, we compare the performance of our approach against the supervised and unsupervised versions of the method of tracking the best expert (MTBE) (Herbster and Warmuth 1998), adaptive Perceptron weighted majority rule (APMR) (Canzian et al. 2013), and the supervised optimal fusion rule (SOFR) (Chair and Varshney 1986), in term of probability of error, \(p_e\). In MTBE, at each instance, the decision of each expert is compared against the actual label in the supervised version (or the pool of the decisions in the unsupervised version). A coefficient is associated with each expert and determines the weight of the expert in the pooling. This weight is updated at each instance using a nonlinear function based on how close the latest decision from the expert was to the actual label in the supervised version (or to the pool of decisions in unsupervised version). We use MTBE for the comparison as one the seminal and widely used effective online learning approaches. APMR is similar to MTBE, where the experts weights are updated using a linear function of the previous decision and the pool or actual decision. However in supervised APMR, the update for an expert weight happens only when the combiner decision is different from the actual label. We compare our approach against APMR as one of the newer suggested approaches for online learning based on Perceptron. SOFR is a supervised method in the sense that the combiner knows the performance of all the experts in the system. It uses ML rule to fuse the decision at each instant. The error rate of SOFR can be considered as a lower bound to any (supervised or unsupervised) method to compare against. In Fig. 9, the results of the comparison of these approaches and our proposed approach are shown. It can be seen that the proposed approach works better than MTBE and APMR methods including the supervised MTBE. APMR and MTBE do not fuse the data optimally. Moreover, in its modeling, APMR does not “reward” or “punish” the experts who make decision similar to or different from the combiner even when the combiner correctly detects the true label. Another fundamental problem with the unsupervised MTBE and APMR is in their modeling methods. Suppose an expert can correctly detect the event (detection probability of one) but has poor performance when the true label is 0 (large false alarm probability). Since the model used in MTBE and APMR only considers the correct detection, it can not properly characterize this expert.
The importance of each feature for every expert, \({{\mathrm{imp}}}(\ell ;k)\)
DecisionStump  KNN  kStar  \(\hbox {LogitBoost}+\hbox {ZeroR}\)  Multilayer Perceptron  NaiveBayes  Importance, \({{\mathrm{imp}}}(k)\)  

Clump thickness  .32  .72  .67  .32  .25  .12  0.7158 
Uniformity of cell size  .92  .48  .995  .92  .95  .4  0.9950 
Uniformity of cell shape  .8  .24  .994  .8  .96  .4  0.9940 
Marginal adhersion  .23  .65  .994  .24  .97  .86  0.9942 
Single epithelial  .88  .38  .99  .88  .93  .69  0.9897 
Bare nucleoli  .37  .82  .99  .37  .98  .86  0.9904 
Bland chromatin  .73  .88  .98  .73  .96  .94  0.9834 
Normal nucleoli  .78  .34  .97  .78  .95  .83  0.9682 
Mitoses  .92  .86  .99  .92  .99  .82  0.9880 
Our approach  MIQ  MID  

\(c=0.005\)  0.01  0.05  0.5  
\(p_e=0.05\)  0.05  0.331  0.137  0.05  0.05  
Clump thickness  9  9  9  9  9  9  9  9 
Uni. of cell size  1  1  4  1  1  1  1  1 
Uni. of cell shape  2  3  1  3  3  3  3  3 
Marginal adhersion  6  2  3  2  2  2  4  2 
Single epithelial  3  7  5  5  5  5  6  5 
Bare nucleoli  5  4  7  4  4  4  2  4 
Bland chromatin  8  8  6  8  7  7  5  6 
Normal nucleoli  7  6  8  6  8  8  7  8 
Mitoses  4  5  2  7  6  6  8  7 
6 Conclusion
In this paper, we provided an approach to estimate the accuracies of experts in ensemblebased decision systems and to make final decision based on the local decisions of experts. Moreover, since in many applications (especially medicine) the true label may be unknown, the proposed approach is unsupervised. Our approach does not assume any prior information about how the experts process the data to issue their decisions or their accuracies. The results show the efficiency and accuracy of the proposed approach in decision making and learning systems as well as for extracting the importance of each data feature. The proposed method has many applications, including clinical decision support systems, surveillance systems, transportation systems etc. The methods introduced in this paper can be extended in numerous directions. Subsequently, we only describe a few. First, in the current system, the experts are fixed and not adapting their expertise (rules) over time. Future work will investigate the case in which experts change and adapt their expertise over time and the impact this has on the ensemble operation and its performance. Second, in certain applications such as predictions from social media, from financial or from transportation data, the experts may significantly differ in terms of the quantity and quality of the data available to them. In such settings, it may be important to adapt the operation of the proposed ensemble scheme to take such variations into consideration. Finally, while the current experts are computer systems (machine learning algorithms), future systems may consider local experts to be a mixture of humans and computer systems. Understanding in which settings and for which applications it is beneficial to adopt ensembles of both human and computerized experts represents yet another interesting direction of future research.
Footnotes
 1.
Here heterogeneity of classifiers implies that they may adopt different processing schemes, which may lead to different error rates in classifying the data (Webb and Copsey 2011).
 2.
For other applications such as processing a database, instance can be replaced by the index of the data sample.
 3.
The probability that the expert makes a correct determination of the true label when the label is 1.
 4.
The probability that the expert makes a an incorrect determination of the true label when the label is 0.
 5.
We would like to point out that although we refer to \(\varPhi \) as the prior probability matrix, it is only introduced here to convert the problem of detection of \(\mathbf y \) into a problem of estimation of the matrix \(\varPhi \). This point is made clear in Sect. 3.2.
 6.
As mentioned previously, we assume that given the true labels, i.e., \(\varDelta \), and the contexts Z (as well as the parameter set \(\varTheta \)), the experts’ decisions are independent.
 7.
A stopping criterion could be a preselected number of iterations or a threshold on the relative difference between the last two estimations. In Sect. 5, we have used the number of iterations as the stopping criterion where we show the results of the parameter estimation for 1, 2 and 5 iterations of the algorithm. It is shown that the estimated parameters after 2 and 5 iterations are very close and also close to the actual parameters.
 8.
 9.
The minimum in (27) provides a maxmin approximation for the values of detection (false alarm) probabilities that have been calculated. This is an interpolation problem and our approach is admittedly heuristic. Another approach is to select the mean.
 10.
All our numerical results verify this to be the case.
 11.
Although we have used the symbol I for the number of indexes as well as the mutual information, we believe from the context it should be clear which notation is in use.
 12.
The initial values for the parameter set can be arbitrary. However as it is normally expected the probabilities of detection are greater than 0.5 and the probabilities of false alarm are less than 0.5. Therefore, in Sect. 5 we set all the initial probabilities of detection to 0.8 and all the probabilities of false alarm to 0.2. The labels prior probabilities, \(\phi _{\eta }(i)\), can be any number in the interval [0, 1]. We have initialized \(\phi _1(i)\) to 0.6 for \(i = 1, 2, \ldots , I\). We also used the number of iterations as the stopping criterion and observed that good results can be obtained with 5 or fewer iterations.
 13.
We used machine learning classifiers from Weka. Detailed description of each classifier can be found in Witten et al. (2011).
 14.
While in the first six columns we show only two digits after the decimal point, in the last column four digits are shown to more clearly distinguish the results.
Notes
Acknowledgments
M. van der Schaar acknowledges the support of NSF grant ECCS 1462245.
References
 Anstreicher, K. M. (1999). Linear programming in o ([n3/ln n] l) operations. SIAM Journal on Optimization, 9(4), 803–812.MathSciNetCrossRefzbMATHGoogle Scholar
 Bishop, C. M. (2006). Pattern recognition and machine learning (information science and statistics). Secaucus, NJ: Springer.zbMATHGoogle Scholar
 Blum, A. (1995). Empirical support for winnow and weightedmajority algorithms: Results on a calendar scheduling domain. In Proceedings of 12th International Conference on Machine learning (pp. 64–72). San Francisco, CA: Morgan Kaufmann.Google Scholar
 Boyd, S., & Vandenberghe, L. (2004). Convex optimization. New York, NY: Cambridge University Press.CrossRefzbMATHGoogle Scholar
 Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.MathSciNetzbMATHGoogle Scholar
 Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of big data? McKinsey Quarterly, 4, 24–35.Google Scholar
 Canzian, L., Zhang, Y., & van der Schaar, M. (2013). Ensemble of distributed learners for online classification of dynamic data streams. Preprint. arXiv:1308.5281.
 Chair, Z., & Varshney, P. K. (1986). Optimal data fusion in multiple sensor detection systems. IEEE Transactions on Aerospace and Electronic Systems, AES–22(1), 98–101.CrossRefGoogle Scholar
 Choromanska, A., & Monteleoni, C. (2012). Online clustering with experts. In International conference on artificial intelligence and statistics (pp. 227–235).Google Scholar
 Craig, T., & Ludloff, M. E. (2011). Privacy and big data. Sebastopol: O’Reilly Media.Google Scholar
 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.Google Scholar
 Ding, C., & Peng, H. (Aug. 2003). Minimum redundancy feature selection from microarray gene expression data. In Proceedings of the 2003 IEEE Bioinformatics conference, 2003. CSB 2003 (pp. 523–528).Google Scholar
 Fan, W., Stolfo, S. J., & Zhang, J. (1999). The application of AdaBoost for distributed, scalable and online learning. In Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’99 (pp. 362–366). New York, NY:ACM.Google Scholar
 Freund, Y., & Schapire, R. E. (1997). A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.MathSciNetCrossRefzbMATHGoogle Scholar
 Hadavandi, E., Shahrabi, J., & Shamshirband, S. (2015). A novel boostedneural network ensemble for modeling multitarget regression problems. Engineering Applications of Artificial Intelligence, 45, 204–219.CrossRefGoogle Scholar
 Herbster, M., & Warmuth, W. K. (1998). Tracking the best expert. Machine Learning, 32(2), 151–178.CrossRefzbMATHGoogle Scholar
 Herbster, M., & Warmuth, W. K. (2001). Tracking the best linear predictor. The Journal of Machine Learning Research, 1, 281–309.MathSciNetzbMATHGoogle Scholar
 Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1), 63–90.MathSciNetCrossRefzbMATHGoogle Scholar
 Huang, Y. S., & Suen, C. Y. (1995). A method of combining multiple experts for the recognition of unconstrained handwritten numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1), 90–94.CrossRefGoogle Scholar
 Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87.CrossRefGoogle Scholar
 Kannan, S. S., & Ramaraj, N. (2010). A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm. KnowledgeBased Systems, 23(6), 580–585.CrossRefGoogle Scholar
 Karegowda, A. G., Manjunath, A. S., & Jayaram, M. A. (2010). Comparative study of attribute selection using gain ratio and correlation based feature selection. International Journal of Information Technology and Knowledge Management, 2(2), 271–277.Google Scholar
 Kleinberg, R., Slivkins, A., & Upfal, E. (2008). Multiarmed bandits in metric spaces. In Proceedings of the 40th annual ACM symposium on Theory of computing (pp. 681–690). ACM.Google Scholar
 Kuncheva, L. I. (2004). Combining pattern classifiers: Methods and algorithms. Hoboken: Wiley.CrossRefzbMATHGoogle Scholar
 Kuncheva, L. I., & Whitaker, C. J. (2003). Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2), 181–207.CrossRefzbMATHGoogle Scholar
 Lienhart, R., Liang, L., & Kuranov, E. R. (2003). A detector tree of boosted classifiers for realtime object detection and tracking. In IEEE international conference on multimedia and systems (ICME2003).Google Scholar
 Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108(2), 212–261.MathSciNetCrossRefzbMATHGoogle Scholar
 Monteleoni, C., & Jaakkola, T. S. (2004). Online learning of nonstationary sequences. In S. Thrun, L. K. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems 16 (pp. 1093–1100). Cambridge: MIT Press.Google Scholar
 Murphy, P. M., & Aha, D. W. (1994). UCI repository of machine learning databases: Machine readable data repository. Irvine: Univ. of California at Irvine.Google Scholar
 Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238.CrossRefGoogle Scholar
 Roobaert, D., Karakoulas, G., & Chawla, N. V. (2006). Information gain, correlation and support vector machines. In I. Guyon, S. Gunn, M. Nikravesh, & L. A. Zadeh (Eds.), Feature extraction (pp. 463–470). Berlin, Heidelberg: SpringerVerlag.Google Scholar
 Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227.Google Scholar
 Segaran, T., & Hammerbacher, J. (2009). Beautiful data: The stories behind elegant data solutions. Sebastopol: O’Reilly Media, Inc.Google Scholar
 Stahl, F., May, D., Mills, H., Bramer, M., & Gaber, M. M. (2015). A scalable expressive ensemble learning using random prism: A mapreduce approach. In A. Hameurlain, J. Küng , R. Wagner, S. Sakr, L. Wang, & A. Zomaya (Eds.), Transactions on largescale dataand knowledgecentered systems (pp. 90–107). Springer.Google Scholar
 Tekin, C., & van der Schaar, M. (2013). Distributed online big data classification using context information. Preprint. arXiv:1307.0781.
 Tseng, V. S., Lee, C.H., & Chen, J. C.Y. (2008). An integrated data mining system for patient monitoring with applications on asthma care. In 21st IEEE international symposium on computerbased medical systems, 2008. CBMS ’08 (pp. 290–292).Google Scholar
 Wang, H., Fan, W., Yu, P. S., & Han, J. (2003). Mining conceptdrifting data streams using ensemble classifiers. In Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03 (pp. 226–235). New York, NY: ACM.Google Scholar
 Wang, Y., Li, H., Wang, H., Zhou, B., & Zhang, Y. (2015). Multiwindow based ensemble learning for classification of imbalanced streaming data. In J. Wang, W. Cellary, D. Wang, H. Wang, SC Chen, T. Li, & Y. Zhang (Eds.), Web information systems engineering—WISE 2015 (pp. 78–92). Switzerland: Springer International Publishing.Google Scholar
 Webb, A. R., & Copsey, K. D. (2011). Statistical pattern recognition. Hoboken: Wiley.CrossRefzbMATHGoogle Scholar
 Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques: Practical machine learning tools and techniques. The Morgan Kaufmann series in data management systems. Elsevier Science.Google Scholar
 Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259.MathSciNetCrossRefGoogle Scholar
 Zhang, D. T. Y., Sow, D., & van der Schaar, M. (2013) A fast online learning algorithm for distributed mining of bigdata. In The big data analytics workshop at SIGMETRICS, vol. 2013.Google Scholar
 Zheng, H., Kulkarni, S. R., & Poor, H. V. (2011). Attributedistributed learning: Models, limits, and algorithms. IEEE Transactions on Signal Processing, 59(1), 386–398.MathSciNetCrossRefGoogle Scholar