1 Introduction

Uncertainty caused by incomplete data has become a great challenge to the problem of pattern classification [1,2,3,4,5]. Multi-class classification using Error Correcting Output Codes (ECOC), first proposed by Dietterich and Bakiri [6] in 1995, attracts attention due to its excellent performance. As a decomposition framework, ECOC method effectively reduces a complex multi-class problem into a set of binary problems. ECOC Classification simplifies the complexity of pattern recognition and uses the state-of-the-art binary classifiers for multi-class classification. So far, ECOC has been widely applied to spectrum sensing [7], image recognition [8, 9] and disease and fault diagnosis [10, 11]with fairly good classification performance.

There are two steps when using ECOC methods to solve the multi-class issues: the encoding process and the decoding process. The goal of encoding is to construct a matrix M = (mij)c × l, mij ∈ {1, 0, −1}where rows hold the code words of the class and columns represent bipartitions for the dichotomizers. The classes denoted by zero are ignored in training. The decoding strategy is chosen to merge the outputs of base classifiers. The framework of ECOC is described in Fig. 1:

Fig. 1
figure 1

four-class classification using ternary ECOC

Encoding as the first step is especially crucial. Three main encoding methods are mainly predefined code, data-dependent code and dichotomies-based code [12]. Independent of the specific application and the classes used to train dichotomizers, the predefined code ignores the potential information of original classes and confines the improvement of classification accuracy, including one-versus-one matrix, one-versus-all matrix, dense and sparse random matrices. The dichotomies-based code involves finding an optimal code matrix given a set of binary classifiers, proven to be NP-complete by Crammer and Singer [12].

However, the data-dependent code can make the most of class separability among samples to enhance classification performance as a whole, which has drawn special attention. The classic data-driven ECOC variants are addressed such as Discriminant ECOC [13], Subclass ECOC [14], Linear Discriminant ECOC [15], and Hierarchical ECOC[16]. As Wang et al. [17] notes: “ECOC brings a simple and common implementation of multi-class classification, but simultaneously, results in the under-exploitation of already-provided structure knowledge in individual original classes”. In Wang’s work, they show that the utilization of such prior structure knowledge improves the performance of ECOC. Escalera [18] proposed re-coding ECOC without re-training. Zhou et al. [19] used the Fisher formulation to construct an ECOC coding matrix by using confusion matrix. Experiments with the UCI dataset proved that the approach could offset the base classifiers’ error. To overcome the small samples size and overlaps among classes, Zhong [20] has proposed a new Self-Adjusting ECOC technique to generate diverse code matrices based on different feature subset in terms of the data complexity measure. Rocha et al. [21] took the correlation and joint probability of base binary learner into consideration when using ECOC-based approaches. Sun et al. [22] put a new ECOC algorithm to Enhance Class Separability named as ECOC-ECS by obtaining the optimal class split based on Data Complexity during encoding process. Zhao et al. [9] used deep neural networks as base classifiers to exhibits superior performance in spite of long computation time. Marie et al. [23] proposed an efficient modeling from the coding and decoding process to best gain the data information to enhance performance.

In order to form data-dependent codes to promote classification performance, two problems of ECOC classification are addressed in this paper. The first one is the non-competence [24] problem that the dichotomizers in ternary ECOCs exclude the prior information of the classes denoted by zero. On this account, a non-competent classifier emerges when it classify an instance whose real class does not belong to the meta-subclasses. Our aim is how to use the knowledge of the problem domain to design better codes and reduce classification error. The other is the cost-sensitive classification problem. It is well known that the common criterion for evaluating the performance of a base classifier is classification error. The smaller the error, the better the performance. However, when considering the classification risk and dealing with class imbalance problem [25], the classification error become infeasible. At this point, it is better and costs less to reject an unlabeled sample than to misclassify. The classic ECOC dichotomizers can only produce binary outputs and have no capability of rejection for classification. So it is a need to modify dichotomizers to give a third output and be adapted to cost-sensitive classification in terms of cost loss.

Taking the analysis above into consideration, a new variant of ECOC algorithm called as Reject-Option-based Re-encoding ECOC (ROECOC) is presented in this paper. ROECOC embeds a reject option to ECOC and find the best threshold values for constructing data-driven codes by using the arbitrary ECOC as initialized matrix based on ROC curve. Then, the optimal base classifier with reject option which can produce three-symbol output and classify selectively is obtained and used to classify the corresponding classes denoted by zeros. Finally, the initial coding matrix is recoded according to the optimal classifier output and thresholds without retraining. The decoding strategy suitable for three-symbol-output classifiers is discussed at the same time. The ROECOC not only uses the prior class information to construct data-driven matrix and also is applicable to the cost-sensitive classification.

The paper is organized as follows: Section 2 discusses the idea of presented ROECOC. Section 3 focuses on the proposed re-encoding ECOC and addresses the potential problems. Section 4 presents the experiments and results, and section 5 concludes the paper.

2 Re-encoding idea

As a more universal algorithm, ternary ECOC [26] method nearly unify all decomposing frameworks for multi-class classification. The introduction of a third output value of zero makes the construction of the encoding matrix more flexible and diverse, which greatly promotes the development of ECOC classification. However, there still exist some problems.

On the one hand, the dichotomizers may provide some useful decision information for the classes ignored. Suppose a five-class dataset with a mixed Gaussian distribution as shown in Fig. 2. The decision boundary of classifier h15 trained with C1 and C5 can classifyC2 ,C3, andC4 correctly for the most of samples. On the other hand, the dichotomizers are non-competent for class ignored. We consider a matrix M4 × 7 shown in Fig. 3. The testing instance x might be classified as C1 according to Hamming and Euclidean distance decoding. However, from another perspective, x belongs to classC2 because the predicted class label of x is in accordance with C2 as long as C2 was not ignored during base classifier training. If C2 is ignored, the classifier’s decision boundary cannot make the correct classification. This is explained by the disturbance introduced by zero, which introduces the classification error into the decoding process. When there are zeros in the coding matrix, the dichotomizers do not have the distribution information of the corresponding classes and fail to make the right decision.

Fig. 2
figure 2

The distribution of five-class dataset

Fig. 3
figure 3

M4 × 7 coding matrix

The non-competence problem has been discussed by Escalera [18] who proposed a re-encoding approach without retraining and used the classification accuracy as the threshold to recode the zeros. However, how to determine the thresholds is still an open issue.

On account of the above situation, it is easy to handle when dichotomizers classify the classes ignored correctly. If not, it is better that the outputs of dichotomizers remain zeros. Therefore, the output of dichotomizers must expand from {1, −1} to {1, 0, −1} and zero means rejection. It is well known that the binary output have no capability of rejection for classification. The most commonly used method is to construct a reject option (t1, t2), t2 > t1 to produce a third output to classify selectively. The question is how to form the reject option.

As an efficient means of classifier performance assessment, the concept of ROC is clear, intuitive, and independent of prior distribution information, base classifiers, and cost matrices, which provides a powerful tool for the construction of reject option. Tortorella [27] presented a 2D ROC-based rejection mechanism, but the experiments show that the decision thresholds are the same and irrelevant when the datasets are small. Bernard et al. [28] find that it is more efficient to exploit the ROC space for learning a pool of classifiers than a single classifier. Zhao [29] models the loss functions of cost-sensitive issue with rejection option and obtains the thresholds via the tangent of ROC curve. In binary classification, the ROC curve has shown to be very powerful for designing cost-sensitive classifiers, but has been poorly exploited for multi-class classification. It is well known that the ECOC base classifier cannot produce the precise posteriori probability, so t1 + t2 ≠ 1 and the reject option for each dichotomizer is different. How to formulate the decision of the three-symbol output and model the minimal cost-loss objective function and constraints is the premise of obtaining reject option based on ROC for multi-class classification. To solve the questions, we construct the reject cost matrix at first and find a formula which can formulate the three-symbol output and use the cost loss as objective function at the same time. Finally, the thresholds values are calculated with the help of ROC convex hull.

In conclusion, ROECOC applies ROC to design the reject option, extending two-symbol output into three-symbol output. Then, the ignored classes are classified by the corresponding optimal classifiers as 1,-1 or 0 respectively. Figure 4 shows the framework of the re-encoding ROECOC based on ROC, where Strain and SV represent the training and validation subsets respectively. By doing so, a new data-driven matrix including more class information is obtained. It is worth noting that the re-encoding process remains in the training phase, avoiding a second training step and reducing complexity. The three-symbol output classifiers can apply to the cost-sensitive classification.

Fig. 4
figure 4

The framework of the re-encoding ROECOC

3 Reject option-based ECOC by using ROC

This section focuses on ROECOC and explains the determination of the reject option (t1, t2) based on the cost-sensitive model and ROC curve [30].

A confusion matrix in Fig. 5 can be acquired after the bipartition training. The true positive rate and false positive rate can be calculated by \( tp=\frac{TP}{TP+ FN}=\frac{TP}{P}, fp=\frac{FP}{FP+ TN}=\frac{FP}{N} \).

Fig. 5
figure 5

Confusion matrix

In order to realize rejection classification:

$$ {w}_i\left(x,{t_1}^i,{t_2}^i\right)=\Big\{{\displaystyle \begin{array}{l}1\kern0.36em if\;{f}_i(x)>{t_2}^i\\ {}-1\; if\;{f}_i(x)<{t_1}^i\;\\ {}\delta \kern0.24em otherwise\end{array}} $$
(1)

we define a new cost matrix \( {\mathbf{C}}_r={\displaystyle \begin{array}{cccc}& 1& -1& \delta \\ {}1& 0& {c}_{12}& {c}_{13}\\ {}-1& {c}_{21}& 0& {c}_{23}\end{array}} \), in which c11 = c22 = 0, c12 > c13 > 0 and c21 > c23 > 0.

It is hard to acquire the reject option thresholds directly in practice. The classic binary output of classifiers can be described as

$$ {w}_i\left(x,r\right)=\Big\{{\displaystyle \begin{array}{l}1\kern0.36em if\;{f}_i(x)>r\\ {}-1\; if\;{f}_i(x)<r\kern0.24em \end{array}} $$
(2)

, and r is the cutoff value. The eq. (1) can be seen as the voting rules under some constraints of eq. (2):

$$ {\displaystyle \begin{array}{l}{w}_i\left(x,{r}_1,{r}_2\right)=\Big\{\begin{array}{l}1\kern0.36em if\;{f}_{t_1}(x)>{r}_1\wedge {f}_{t_2}(x)>{r}_2\\ {}-1\; if\;{f}_{t_1}(x)<{r}_1\wedge {f}_{t_2}(x)<{r}_2\\ {}\delta \kern0.24em if\;{f}_{t_1}(x)<{r}_1\wedge {f}_{t_2}(x)>{r}_2\Big\Vert {f}_{t_1}(x)>{r}_1\wedge {f}_{t_2}(x)<{r}_2\end{array}\\ {}s.t.\kern0.5em \forall x\;{w}_i\left(x,{r}_i\right)=1\Rightarrow {w}_j\left(x,{r}_j\right)=1\wedge {w}_j\left(x,{r}_j\right)=-1\Rightarrow {w}_i\left(x,{r}_i\right)=-1\end{array}} $$
(3)

Therefore the reject option can be constructed by finding a formula which can meet the constraints of eq. (3) and use the cost loss as objective function at the same time. On this account, we use the eq. (2) as the decision rules to obtain ROC curve. The ROC convex hull can be acquired by fitting the ROC curve described in Fig. 6 in heavy line. According to the characteristics of ROCCH, any two points are satisfied with the constraints of eq. (3), so the question is transferred to how to find the two points p1, p2 to meet our need. Suppose the corresponding confusion matrices of the two points (base classifiers) are(TP1, FN1, FP1, TN1) and (TP2, FN2, FP2, TN2) respectively.

Fig. 6
figure 6

ROC, ROCCH and the optimal classifierfROC(fp)

The classification cost loss function can be defined as:

$$ {\displaystyle \begin{array}{l} EC=\frac{\left(F{P}_2-F{P}_1\right){c}_{23}+\left(F{N}_1-F{N}_2\right){c}_{13}+F{P}_1\cdot {c}_{21}+F{N}_2\cdot {c}_{12}}{TP+ FN+ FP+ TN}\\ {}\kern0.72em =\frac{\left(F{N}_1\cdot {c}_{13}+F{P}_1\cdot \left({c}_{21}-{c}_{23}\right)+F{N}_2\cdot \left({c}_{12}-{c}_{13}\right)+F{P}_2\cdot {c}_{23}\right)}{TP+ FN+ FP+ TN}\end{array}} $$
(4)

Note that

$$ {\displaystyle \begin{array}{l}P= FN+ TP\Rightarrow \\ {} FN=P- TP=P\left(1-\frac{TP}{P}\right)=P\left(1- tp\right)=P\left(1-{f}_{ROC}(fp)\right)\end{array}} $$

Then the eq. (4) can be rewritten as:

$$ {\displaystyle \begin{array}{l} EC=P\left(1-{f}_{ROC}\left(\frac{F{P}_1}{N}\right)\right){c}_{13}+F{P}_1\left({c}_{21}-{c}_{23}\right)\\ {}\kern0.84em +P\left(1-{f}_{ROC}\left(\frac{F{P}_2}{N}\right)\right)\left({c}_{12}-{c}_{13}\right)+F{P}_2\cdot {c}_{23}\end{array}} $$
(5)

Take the partial derivatives of eq. (5) forFP1andFP2, then we can get the final results:

$$ {\displaystyle \begin{array}{l}{f}_{ROC}^{\prime}\left(f{p_1}^{\ast}\right)=\frac{c_{21}-{c}_{23}}{c_{13}}\frac{N}{P}\\ {}{f}_{ROC}^{\prime}\left(f{p_2}^{\ast}\right)=\frac{c_{23}}{c_{12}-{c}_{13}}\frac{N}{P}\end{array}} $$
(6)

fp1 and fp2 are the two points we find and used as the thresholds values of reject option (t1, t2).

After embedding reject options for each classifier, the outputs become three-symbol. In order to decode the output of reject option, the state-of-the-art Hamming decoding strategy is extended as follows according to the decoding process of the ternary ECOC [31]:

  1. Step 1:

    Replace all the rejected code words denoted as zero with “-1”, and use the classic Hamming distance to find the nearest class code c−1;

  2. Step 2:

    Replace all the rejected code words denoted as zero with “1”, and use the classic Hamming distance to find the nearest class codec1;

  3. Step 3:

    Compare the distance of the non-rejected code words betweenc−1andc1, and the nearer class code is the final decision.

After solving the construction of reject option and decoding strategy, the ROECOC process is presented particularly in Table 1.

Table 1 The algorithm of the re-encoding ROECOC using the reject option

The reject option is trained by the original binary splits, which can ensure the independence and diversity of dichotomizers. Meanwhile, the outputs of ROECOC base classifiers provide the prior knowledge of classes ignored. It is worth noting that zeros denoted as rejection may still appear in the new matrix, which make the output flexible. However, when a classifier is non-competent for an instance to be classified, the output should be zero, which exactly matches the corresponding code in the original matrix.

4 Experiments

4.1 Experimental data and design

To validate the performance of our proposal, we use two kinds of dataset: 16 different multi-class datasets from the University of California at Irvine (UCI) repository [32, 33] and high resolution range profile (HRRP) dataset of three airplanes:B-52, Farmer and Fishbed. Table 2 provides the characteristics of the UCI datasets [19]. Meanwhile, the principal component analysis (PCA) is used to reduce the dimensionality to promote classification speed. The HRRP dataset was acquired with zoom models in a microwave anechoic chamber and it was composed of data in the range of 0°-155°. There are 322 location data for B-52, 311for Farmer and 451 for Fishbed. Each data sample is described by 64 attributes, namely range cell.

Table 2 Characteristics of the UCI datasets (Features: C-continuous, B-binary, and N-nominal)

4.2 Experimental design

The experiments chose six different coding matrices with different freedom degree (the distribution density of zero): one-versus-all coding, dense random, Discriminant ECOC, sparse random, SA-ECOC [20] and one-versus-one coding. As for ternary ECOC, we add ReECOC [18] to compare the results with ROECOC. The random matrices were selected from a set of 2000 randomly generated matrices with P(−1) = P(+1) = 0.5 for the dense random matrix and P(−1) = P(+1) = P(0) = 1/3 for the sparse random matrix [19]. In all algorithms, the parameters are predefined or default values given by the authors. The LOGLC and SVM with polynomial kernel K(x, xi) = [x, xi + 1]q are considered as base classifiers. The regularization parameter C and the kernel parameter q are selected by K-fold cross-validation (K = 5). The range of values allowed for q is1–10, the initial value of C is 2. The cost matrix is set as \( {\mathbf{C}}_r=\left[\begin{array}{ccc}0& 6& 1\\ {}3& 0& 1\end{array}\right] \) by manual, which is the same for each base classifier and will not affect the feasibility of the experiment.

Furthermore, the ROECOC is used for target recognition with HRRP dataset of three different planes. We pick four different angle ranges to evaluate the performance in practice (0°-20°, 20°-40°, 130°-150°, 0°-150°). To simplify the experiments, the SVM was used for the based classifiers with the same parameters as the mentioned before. The decoding strategy is the modified Hamming distance strategy. At last, we discuss the influence of freedom degree on cost-loss classification.

To evaluate the performance of the different results, the experiments perform stratified ten-fold cross validation and test for a confidence interval of 95% with a two-tailed t-test if the number of samples was larger than 500 [24, 34]. The calculating formula is given as follows:

$$ \frac{\mid \overline{x}-\mu \mid }{\sigma /\sqrt{n}}\ge {t}_{0.025}\left(n-1\right) $$
(7)

where μ and σ indicate mean and variance respectively and t0.025(9) = 2.2622.

4.3 Experimental results and analysis

4.3.1 UCI dataset

Tables 3 and 4 show the classification accuracy for different encoding matrices based on different classifiers and the modified Hamming decoding. The best performance per dataset is highlighted in boldface. The results of the rest datasets are listed in Table 5 and Table 6 as shown in the Appendix.. From the results in tables, we can see that the classification accuracy got by ROECOC outperform the corresponding state-of-the-art matrices and Re-coding methods most of the time. This illustrates that the classification performance of ROECOC based on reject option is much better than that of classic ECOCs without reject option. It is worth noting that the classification results have no distinct difference when using one-versus-one code as the initial matrix. The most likely reason is that there is only one class in each binary split in one-versus-one matrix and the class imbalance problem has little impact on classification. One of the advantages of ROECOC is to overcome the class imbalance problem according to the introduction section, which can be detected easily with the one-versus-all as the initial matrix. In general, the ROECOC based on reject option has better performance and can avoid making the direct decision for samples to be easily misclassified or with high misclassification risk, which can be seen from other ECOC matrices.

Table 3 Accuracy rates and confidence interval at 95% for ECOC matrices using SVM(%)
Table 4 Accuracy rates and confidence interval at 95% for ECOC matrices using LOGLC(%)

4.3.2 HRRP dataset

Figure 7 show the classification cost of four different angle ranges of HRRP dataset based on classic ECOCs and ROECOC. The freedom degree of matrices along the x-axes increases form left to right. According to the results, the classification cost of ROECOC is much less than that of classic ECOCs in general. We can also find that with the increasing of the freedom degree, the classification performance of classic ECOCs and ROECOC are approaching gradually. When the freedom degree increases to a certain extent, the advantage diminishes. Especially when using one-versus-one code as the initial matrix, the classification performance has no big difference, which is in accordance with the results of UCI datasets. The most likely reason is that with the increasing of the freedom degree, the degree of class imbalance decreases. The more balanced the data distribution, the less samples rejected. However, it is worthy of considering the freedom degree when using ROECOC to classify (Tables 5 and 6).

Fig. 7
figure 7

Normalized classification cost of four different angle ranges of HRRP datasets

It is also worth noting that the classification cost is less when the angle range is smaller such as 0°-20°, 20°-40°and 130°-150°. Because the recognition based on HRRP has orientation sensitivity. Taking the results and analysis above into consideration, the proposed ROECOC can promote the classification performance aiming at reducing the classification risk in practice.

5 Conclusions

The multi-class classification aiming at reducing classification cost has been widely used in practice. However, the classic ECOC classification still takes the error as the evaluation criteria, which is not suitable for the cost-loss classification. Meanwhile, the dichotomizers can only produce binary outputs and have no capability for rejection. To reduce classification cost and construct data-driven matrix, a new reject option-based ECOC is proposed. ROECOC does not change the framework of ECOC classification and introduces reject option for each base classifier. The reject option based on ROC curve is constructed by minimizing the cost-loss function model with the help of cost matrix and ROCCH. The dichotomizers with reject option can produce three-symbol output and classify selectively, which can provide more information of class distribution. Given any initial matrix, ROECOC can produce a data-driven and competent matrix for the cost-sensitive classification. The experimental results illustrate that ROECOC can reduce the classification cost and enhance the performance especially when the freedom degree of the initial matrix is low.