Keywords

1 Introduction

Nowadays, data is available from heterogeneous and dynamic web data sources and their integration becomes a crucial problem [5]. Due to the increase of such data sources and/or data Web Services, finding and ranking the suitable data sources and/or data web services leads to deep investigations and significant research efforts [12]. But from another viewpoint, processing the collected heterogeneous data remains a challenging problem especially for classification tasks. The objective of multiple classifiers is to build a powerful solution to handle a difficult pattern recognition problems [2]. There are two existing categories to combine classifiers, i.e. weighting and meta-learning [11]. Both categories tend to focus on the label prediction to build a decision than considering the class probability. Predicted class probability can be used to represent the classifier’s confidence score as it is applied in weighted voting [9]. However, the quality of the weighted voting depends on the performance of its base classifiers. Moreover, each supervised learning has its own method to provide a confidence score and it does not always in probability form. For these reasons, we extend the previous study on the transformation of confidence score [15]. In the context of ensemble learning, each base learner has its own performance. The reliability score of each classifier within the binary problem is considered to handle the contrasted performances and to propose a better prediction. It is important to mention that prediction score also represents algorithm’s level of certainty for each instance of a dataset, so that the composition of base classifiers can be changed dynamically based on their confidence level. We propose a new weighted voting approach that adapts to varied data characteristics. The reliability aspect is tested by adding spammers to base classifiers. Detailed explanation of this paper is organized as follows: Sect. 2 reviews the previous works on supervised learning and ensemble methods; the proposed algorithm is described in Sect. 3; Sect. 4 is dedicated to the experiment setup, result, and discussion; and concluding remarks are given in Sect. 5.

2 Related Work

In this section, a brief description about several supervised learning approaches is provided. These approaches are used as base classifiers. Then, different combinations of ensemble method are discussed as previous works.

2.1 Supervised Learning as Base Classifiers

There are five approaches to represent a knowledge in supervised learning, Bayesian classifier, decision tree, rule, function, and lazy classifier [14]. Bayesian is classification method developed from Bayes’ rule of conditional probability. Decision tree is a “divide-and-conquer” approach in learning problem from a set of independent instances with attribute as its node. Rule has similar representation as decision tree, although several preconditions are applied to determine its conclusion. Function learner consists of various algorithms that can be written down as mathematical equations in a reasonably natural way. Lazy works by delaying the classification until all the training data is collected and a request is made. In addition to the above algorithms, there is a predictor called spammers who label a class randomly, which are often found in the case of crowds [10]. One algorithm of each category and a group of spammers are selected as ensemble input to ensure the diversity of base classifier, which is further discussed in Sect. 4.

2.2 Confidence Score in Ensemble Learning

Voting, Stacked Generalization (Stacking), and Multi-Scheme are some examples of combination method that are able to handle different base classifiers as the inputs, while Random Forest, Bootstrap aggregating (Bagging), and Boosting combine several models from the same algorithm [8]. Majority voting (MV) is a popular combination method compared to the others [16]. The quality of MV depends on the performance of base classifiers. In order to make a robust voting, weighted voting (WMV) can be considered as a potential method [13]. Confidence score can be used as weight parameter of voting. This method is conducted by collecting confidence scores from a set of classifiers and take the average of each class. The highest value determines which label belongs to the tested data. Our work enhances the utilization of confidence score to build a reliability diagram. Our proposed method is compared to MV, WMV, Stacking, and Multi-Scheme, since several approaches of classifiers are used as the basis.

3 Dynamic Reliable Voting Algorithm

In this section, we specify our problem and basic notation used in binary classification. We also provide an overview of reliability diagram according to the previous literatures. Then, the workflow of Dynamic Reliable Voting (DRV) algorithm is proposed. As shown in Fig. 1, our approach can be summarized into three steps: (a) transforming confidence score into reliability diagram, (b) removing spammer or weak classifier by static threshold, (c) selecting the best combination decision for each bin. X-axis represents the confidence score, while Y-axis is the value of empirical class membership probability. Four different points in Fig. 1a describe the diversity of learning algorithms. Static thresholds are illustrated as blue and purple dash lines (see Fig. 1b), and dynamic threshold is drawn in a solid blue-orange line (see Fig. 1c).

Fig. 1.
figure 1

The framework of the proposed algorithm

3.1 Problem Formulation and Modeling

The training process of ensemble learning is started by a prediction of a set of instances \(X=\{x_1, x_2, ..., x_i, ..., x_N \}\) by T base classifiers. Then, the decisions from multiple models are combined to improve the overall performance. Each decision of classifier t consists of an independent label prediction y and its confidence score s. This ensemble decisions can be noted as follow: \(D = \left\{ (y^1_i,s^1_i), (y^2_i,s^2_i),..., (y^t_i,s^t_i), ..., (y^T_i,s^T_i) \right\} _{i=1}^N\) such that \(y^t_i \in \{0,1\}\), \(s^t_i \in [0,1]\), and . A confidence score needs to be normalized if \(s \in [a,b]\), where \(a<b\), \(a \ne 0\) or \(b \ne 1\), , and . The normalization can be expressed as follow: \(s' = \frac{s-a}{b-a}\). In a case where a and b are unknown, Platt scaling [6] can be applied to adjust the confidence score range.

Reliability diagram is firstly introduced to display forecast probability at Chicago [4]. This probability and observed relative frequency are drawn into X-axis and Y-axis diagram, respectively. Later, this representation is applied in classification by converting a confidence score into empirical class membership probability. This probability can be denoted as P(c|s), where \(c\in \{0,1\}\) and c is a class label. This diagram is built in training step because it requires the true label of each instance, which is denoted as \(z_i\) and \(z_i=\{0,1\}\). In order to reflect the distribution of s, confidence score is converted according to the class c, which is \(s(c=1)^t_i = 1 - s(c=0)^t_i \) for binary case. Then, a set of \(s(c=1)\) is split into several bins in an interval v. Let V be a set of interval, then . \(s^t_i\) will be categorized in interval j when \(v_j \le s^t_i < v_{j+1}\). P(c|s) is defined as the number of true label corresponded to c divided by the number of all prediction in interval j. The result of the distribution is represented in Fig. 1a.

3.2 Classifier Selection

Once we get the information of reliability representation for each classifier, a threshold is defined to filter reliable classifiers, which is denoted as RC. The selection process contains two step, static and dynamic selection. In static phase, the average probability estimate \(\overline{P(c|s_{i}^t)}\) and the training accuracy \(A^t\) of each classifier are calculated, which is formulated by Eq. 1. A set of reliable candidate RC is determined by choosing algorithms that satisfy the threshold \(\epsilon _1\) and \(\epsilon _2\), or have an accuracy \(A^t\) better than the average accuracy \(\overline{A}\). The purpose of this step is to eliminate a probable spammer in the training step, since the \(\overline{P(c|s_{i}^t)}\) of spammer lies in the area of uncertainty (from 0.4 to 0.6).

$$\begin{aligned} RC = \left\{ t \in T | \left( \overline{P(c|s_{1i}^t)} \le \epsilon _1 \wedge \overline{P(c|s_{2i}^t)} \ge \epsilon _2 \right) \vee \left( A^t > \overline{A} \right) \right\} \end{aligned}$$
(1)

where \(s^t_{1i} = s^t_i \le 0.5\), \( s^t_{2i} = s^t_i > 0.5\), , \(\epsilon _1 \in [0,0.5]\), , and \(\epsilon _2 \in [0.5,1]\).

Fig. 2.
figure 2

An example of the reliability diagram of six classifiers with the possible thresholds of \(\lambda _1\) and \(\lambda _2\) by the precision 0.1.

Dynamic selection process is started by searching the optimal values of thresholds \(\lambda = \{\lambda _1,\lambda _2\}\), where and \(\lambda \in [-0.5,0.5]\). This selection is applied in the testing step. A classifier will be excluded from RC if the confidence score is in the range of [0, 0.5] and its probability estimate is higher than the bin threshold, while a classifier which has a confidence score between 0.51 to 1 will be eliminated from RC if its probability estimate is less than the bin threshold (see Eq. 2). The final reliable classifier \(RC_f\) is defined as a subset of RC, which contains a set classifiers that pass the threshold. This selection is called a dynamic process because of the value of confidence score for each instance \(s^t_i\) is different and independent, so the number of \(RC_f\) is also different for each test datum.

$$\begin{aligned} RC_f=\left\{ \begin{array}{cc} P(c|s^t_i) \le ((v_j+v_j+1)/2) - \lambda _1 &{} \qquad \text { if } s^t_i \le 0.5 \\ P(c|s^t_i) \ge ((v_j+v_j+1)/2) - \lambda _2 &{} \qquad \text {otherwise}\\ \end{array} \right. \end{aligned}$$
(2)

There is no efficient way to define the values of \(\epsilon _1, \epsilon _2, \lambda _1, \) and \(\lambda _2\) except with an iterative process. The time complexity of the iterative loop C depends on its precision p, which can be written as follows:

$$\begin{aligned} C = \Big ( \frac{0.6-0.4}{p} \Big )^2 \Big ( \frac{0.5-(-0.5)}{p} \Big )^2 \end{aligned}$$
(3)

where , and \( p \in [0,1]\). 0.6 and 0.4 are the highest and lowest limit of uncertainty respectively, while 0.5 and −0.5 are the highest and lowest limit of \(\lambda \) respectively. The representation of the possible values of \(\lambda \) is illustrated in Fig. 2. C can be optimized by limiting the values of \(\lambda _1\) and \(\lambda _2\) so that \(0<((v_j+v_j+1)/2) - \lambda _1 \le 0.5\) and \(0.5<((v_j+v_j+1)/2) - \lambda _2 \le 1\).

Our proposed method can be explained through Algorithm 1. Line 1 describes the instances X that consists of \(X_{train}\) and \(X_{test}\), the ground truth of the training data \(Z_{train}\), base classifiers T, and a set if the interval limit V. The training step of base classifiers is processed in lines 2–4. Reliability transformation and static selection are conducted in lines 5–16. Then, lines 17–19 determine the optimal thresholds of \(\lambda _1\) and \(\lambda _2\). After defining the reliable candidate RC and the thresholds, a set of reliable candidate \(RC_f\) is extracted from the process in line 20–23. As it is shown in the line 20, the combination of \(RC_f\) depends on the characteristic of each instance x, hence it is called dynamic selection. Finally, a majority voting is applied in line 23 to produce a set of recommended decision.

figure a

4 Experimentation Results and Discussion

To evaluate our algorithm, a series of experiments were performed on eight different datasets. Next section discusses the dataset used, the protocol, and then results are exposed.

Table 1. Dataset information.

4.1 Data Description and Protocol

Table 1 provides the information of the dataset that were used in this experiments. Eight datasets from UCI repository [3] were used: Breast Cancer Wisconsin Diagnostic (BCWD), Vertebral Column (Vertebral), Ionosphere, Musk (version 1), Indians Diabetes (Diabetes), Spambase, Phishing Websites, and EEG Eye State (EES). These data are selected in order to study the behavior of our algorithm to handle from BCWD (286 instances and 10 attributes) to EES data (14980 instances and 15 attributes). The attributes vary between numeric (integer and real) and categorical. Since our focus is to study the reliability aspect of various expertise of base predictors, we avoid to use imbalanced dataset so that the performances of the algorithms are not distracted by these conditions. Class imbalance problems can be measured by the imbalance ratio (IR), defined as the ratio of the number of instances in the majority class to the number of examples in the minority class [1]. Balanced data are indicated in Table 1 by the IR score that is close to the value 1. The experiments were conducted by train-test evaluation and the data were split into 67% of training set and 33% of testing set.

Five base classifiers were applied based on different knowledge representations to obtain diversity among the models combination. We used WekaFootnote 1 library to build the models of C4.5 (Decision Tree), Naive Bayes (Bayesian), JRip (Rule), Sequential minimal optimization (Function), and k-nearest neighbors (Lazy). Then, we evaluated our proposed algorithm with MV, WMV [9], Stacking, and Multi-Scheme (MS) by accuracy score. The parameters of all algorithms were not changed and we considered the default setting of Weka. Ensemble algorithms were tested in a condition where the base classifiers do not contain a spammer as the first experiment. Then, 25 spammers were added to the base input as second attempt. This second scenario where random predictors are higher than the original classifiers is important to learn the reliability aspect of combination methods [7]. Both experiments were conducted in Java. We set the precision of the threshold p to 0.1 with the interval of the bin equals to 0.1.

Table 2. Accuracy comparison between base classifiers and ensemble methods.
Table 3. Accuracy comparison of ensemble methods after 25 spammers were added.

4.2 Results and Discussion

Table 2 shows the accuracy comparison between base classifiers and ensemble methods. The ID column represents the sequence number of dataset according to the Table 1. The best algorithm is defined by an algorithm that has the highest score of accuracy and the smallest value of standard deviation. kNN shows the best result than the other learners on the side of base classifier. In another way, C4.5 provides the smallest standard deviation among the others. Four out of five ensemble methods exceed the average scores of all single classifiers. This scenario confirms the benefit of ensemble methods to give a better accuracy score than relying on a single classifier. Three voting based algorithms (MV, WMV, DRV) show a superior average results compared to the results of Stacking, and MS. It is normal to see that voting based methods have good results since the base classifiers scores are quite good. WMV and DRV have the same deviation score even though the accuracy score of each dataset is different. If we consider the accuracy score of ensemble methods individually, DRV provides the highest accuracy for six dataset.

Fig. 3.
figure 3

Accuracy distance before and after 25 spammers were added for eight dataset (smaller value is better).

In contrast to the results of the first experiment, Table 3 provides a significant decreasing values on MV and WMV after 25 spammers were added. DRV gives the best result, followed by MS, Stacking, MV, and WMV respectively. MV tends to give similar accuracy score for eight dataset indicated by the smallest standard deviation score, while the accuracy scores of WMV are the lowest among the others on six data. It means that the decision of MV and WMV are distracted by the presence of the spammers, while DRV is able to select the best combination and to eliminate the weak predictions. The ability of ensemble methods to maintain the accuracy score is illustrated in Fig. 3. X-axis shows the sequence of the dataset, while Y-axis shows the absolute accuracy distance between the first and second trials (lower value is better). This score is formulated as \(\varDelta A = |A_1 - A_2|\), where \(A_1\) is the accuracy value of the first experiment and \(A_2\) is the accuracy score in the presence of spammers. The distance scores of MV are higher than DRV, Stacking, and MS on all dataset, while WMV shows the highest accuracy instability. This measure allows us to see the affect of random predictors to the popular voting techniques. On the other hand, DRV improves this drawback by considering predictor reliability aspect, indicated by the lower score similar to MS and Stacking.

Fig. 4.
figure 4

Computation time comparison during training phase (smaller value is better).

Figure 4 illustrates the computation time of five ensemble methods during the training phase. It consists of two conditions where five base classifiers were used as the input (see Fig. 4a) and after the spammers were added (see Fig. 4b). X-axis represents eight data used and Y-axis indicates the number of second needed in a log scale. The values written in the diagram describe the lowest and highest time in each dataset. The performances of MV and WMV were computed from the sum of training time of base classifiers, while the score of DRV was obtained from the MV and the reliability diagram building time (see Eq. 3). Due to the same complexity, MV line is not visible in the figure and is overwritten by the WMV line. According to Fig. 4a, all ensemble methods require similar time to train when the number of instances is less than 500. It also shows that the number of instances generally influences the computation time. Although, the performances in BCWD and Diabetes show the opposite results due to their specific characteristics. The superiority of voting based methods compared to Stacking and MS can be seen in Diabetes, Spambase, Phishing and EES. Similar results are also presented in Fig. 4b. Stacking and MS computed Vertebral, Ionosphere, and Musk faster than the others. In contrary, the deviation between their running time and voting algorithms for the second setup are greater than the first experiments. MV, WMV, and DRV do not have varied results because the spammers do not need significant time to calculate. Based on the comparison of the first and the second figures, the number of base classifiers clearly affects the computation time during the training phase.

5 Conclusion

A diverse group of classifiers are likely to make better decisions comparing to a single learner. However, considering ensemble learning context, each classifier has its own performance. Hence, reliability is a crucial problem when such classifiers have contrasted performances. We propose dynamic reliable voting to solve the problem on how to select the best combination of reliable classifiers and how to handle uncertain labelers, i.e. spammer. The confidence score of prediction is used as main information to produce a reliability diagram of each algorithm and several filters are set to select the best candidates. Five classifiers are chosen as the base models and the voting combination of their predictions for each datum is changed dynamically according to the past experience of their probability estimates. The result shows that our proposed algorithm provides a reliable performance against the previous approaches on eight datasets before and after the presence of spammers. In future work, we will improve our approach to adapt uncertainty and imbalanced class. We will also enhance our algorithm to handle multi-class and multi-label classification.