Abstract
A common way of solving a multiclass classification problem is to decompose it into a collection of simpler twoclass problems. One major disadvantage is that with such a binary decomposition scheme it may be difficult to represent subtle betweenclass differences in manyclass classification problems due to limited choices of binaryvalue partitions. To overcome this challenge, we propose a new decomposition method called Nary decomposition that decomposes the original multiclass problem into a set of simpler multiclass subproblems. We theoretically show that the proposed Nary decomposition could be unified into the framework of error correcting output codes and give the generalization error bound of an Nary decomposition for multiclass classification. Extensive experimental results demonstrate the stateoftheart performance of our approach.
Introduction
Many realworld problems are multiclass in nature. To handle multiclass problems, many approaches have been proposed. One research direction focuses on solving multiclass problems directly. These approaches include decision tree based methods (Quinlan 1986; Beygelzimer et al. 2009; Su and Zhang 2006; Bengio et al. 2010; Yang and Tsang 2011; Gao and Koller 2011; Deng et al. 2011). In particular, decisiontree based algorithms label each leaf of the decision tree with one of the \(N_C\) classes, and internal nodes can be selected to discriminate between these classes. The performance of decisiontree based algorithms heavily depends on the internal tree structure. Thus, these methods are usually vulnerable to outliers. To achieve better generalization, Yang and Tsang (2011) and Gao and Koller (2011) propose to learn the decision tree structure based on the large margin criterion. However, these algorithms usually involve solving sophisticated optimization problems and their training time increases dramatically with the increase of the number of classes. Contrary to these complicated methods, KNearest Neighbour (KNN) (Cover and Hart 1967) is a simple but effective and stable approach to handle multiclass problems. However, KNN is sensitive to noise features and can therefore suffer from the curseofdimensionality. Meanwhile, Crammer and Singer (2002) and Jenssen et al. (2012) propose a direct approach for learning multiclass support vector machines (MSVM) by deriving the generalized notion of margins as well as separating hyperplanes.
Another research direction focuses on the framework of the ensemble of binary classifiers. It first decomposes a multiclass problem into multiple binary problems , and then one can reuse the wellstudied binary classification algorithms for their simplicity and efficiency. To obtain base classifiers, different decomposition strategies can be found in the literature (Galar et al. 2011; Rocha and Goldenstein 2014). The most common strategies are called binary decomposition such as “onevsall” (OVA) (Knerr et al. 1990) and ternary decomposition such as “onevsone” (OVO) (Galar et al. 2011). To this end, more general decomposition strategies have been developed, for instance, error correcting output codes (ECOC) approaches (Dietterich and Bakiri 1991; Liu et al. 2013; Rocha and Goldenstein 2014; Zhao and Xing 2013; Übeyli 2007; GarcíaPedrajas and OrtizBoyer 2011) have been proposed in recent years to design codes to enable a good partition scheme. One of most popular ways is to use random dense/sparse binary matrix to represent class assignment in each subproblem and it is able to correct errors committed in some base classifiers through the final results aggregation. For aggregation strategy, there are a number of methods are developed to combine the outputs of the base classifiers, such as probability estimates (Wu et al. 2004), binarytree based strategies (Fei and Liu 2006) and dynamic classification schemes (Hong et al. 2008).
Though all the abovementioned approaches endeavor to enhance the partition strategy for classification tasks, their designs of base learners are confined to binary classification. In the more challenging realworld applications, there exists multiclass problems where some of the classes are very similar and difficult to differentiate with each other. The existing binary partition schemes cannot handle this challenge due to limited choices . It is highly possible that some classes are assigned with same or similar codes.
To address this issue, we investigate whether one can extend the existing binary decomposition scheme to an Nary decomposition scheme to (i) allow users the flexibility to choose N to construct the subclass in order to (ii) improve the classification performance for a given dataset. The main idea of our scheme is to decompose the original multiclass problem into a series of smaller multiclass subproblems instead of binary classification problems. To make it clearer, we first define a metaclass as follows,
Definition 1
A metaclass is defined as a subset of classes such that the original classes are partitioned into different metaclasses.
Figure 1 is drawn to illustrate the metaclass. Suppose that all original classes (i.e., black, red, blue and green ) can be merged into a series of large metaclasses (i.e., black, red, blue). So, in each level, there is a classifier to divide the data into two N smaller metaclasses (\(N=3\) in Fig. 1d). Here we only consider two levels. Based on the definition of metaclasses, this Nary decomposition scheme is deemed as a divideandconquer method.
More interestingly, we revisit Nary decomposition for multiclass classification from the perspective of ECOC. Each partition scheme corresponds to a specific coding matrix M. We find that the performance of ensemble based methods relies on the minimum distance, \(\varDelta _{\min }(M)\), between any distinct pair of rows in the coding matrix M. A larger \(\varDelta _{\min }(M)\) is more likely to rectify the errors committed by individual base classifiers (Allwein et al. 2001). We further theoretically investigate the impact of the error correcting capability of different decomposition strategies for multiclass classification and show that Nary decomposition has more advantages in correcting errors than binary decomposition strategy for multiclass classification.
The main contributions of this paper are as follows.

We propose a novel Nary decomposition scheme that achieves a large expected distance between any pair of rows in M at a reasonable \(N (> 3)\) for a multiclass problem (see Sect. 3). The two main advantages of such a decomposition scheme are as follows: (i) the ability to construct more discriminative codes and (ii) the flexibility for the user to select the best N for ensemblebased multiclass classification. In light of this approach, class binarization techniques are considered special cases of the Nary decomposition.

We provide theoretical insights on the dependency of the generalization error bound of Nary decomposition on the average base classifier generalization error and the minimum distance between any two constructed codes (see Sect. 5). Furthermore, we conduct a series of empirical analysis to verify the validity of the theorem on the error bound (see Sect. 6).

We show empirically that the optimal N (based on classification performance) lies in [3, 10] with a slight tradeoff in computational cost (see Sect. 6).

We show empirically the superiority of the proposed decomposition scheme over the stateoftheart coding methods for multiclass prediction tasks on a set of benchmark datasets (see Sect. 6).
The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the generalization from binary to Nary decomposition scheme. In Sect. 4, we give the complexity analysis of Nary decomposition and compare it with other schemes with the SVM classifier as a showcase. Section 5 gives the error bound analysis of Nary decomposition. Finally, Sect. 6 discusses our empirical studies and Sect. 7 concludes this work.
Related work
Many decomposition strategy for multiclass classification (Allwein et al. 2001) have been proposed to design a good coding matrix M for partition assignment in recent years with \(M_{ij}\in \{1,1,0\}\), where \(1/1\) denotes the assigned positive/negative class, 0 denotes unselected class. Most of them fall into the following two categories, i.e., problem dependent decomposition, problemindependent decomposition (Rocha and Goldenstein 2014; Zhong and Cheriet 2013). Our proposed random decomposition belongs to the problemindependent decomposition. Here we mainly survey the problemindependent decomposition.
Problemindependent decomposition designs a good coding matrix for partition assignment independent of data, such as OVO, OVA, random binary/ternary decomposition (Dietterich 2000). However, the coding matrix design is not optimized for the training dataset or the instance labels. Therefore, these approaches usually require a large number of base classifiers generated by the predesigned coding matrix. For example, the random ternary decomposition approach aims to construct the coding matrix \(M \in \{1,1\}^{N_C\times N_L}\) where \(N_C\) is the number of classes, \(N_L\) is the code length, and its elements are randomly chosen as either 1 (positive class) or 1 (negative class) (Dietterich and Bakiri 1995). Allwein et al. (2001) extends this binary decomposition scheme to ternary decomposition by using a coding matrix \(M \in \{1,0,1\}^{N_C\times N_L}\) where the classes corresponding to 0 are not considered in the learning process. However, a random binary decomposition approach cannot guarantee that the created base binary classification task are always well designed and easily trained. Therefore, Allwein et al. (2001) suggest that binary/ternary decomposition approaches require at least \(10\log _2(N_C)\) and \(15\log _2(N_C)\) base classifiers, respectively, to achieve optimal results.
Problemindependent decomposition is the ensemble of binary classifiers, which has following advantages: (1) It is easy to use in practice without any adhoc design of coding matrix; (2) It can be parallelized due to independences of base tasks; (3) It enjoys a good theoretical guarantee on classification performance.
Due to the favorable properties and promising performance of problemindependent approaches for the classification task, they have been applied to realworld classification applications such as face verification (Kittler et al. 2003), ECG beats classification (Übeyli 2007), and even beyond multiclass problems, such as feature extraction (Zhong and Liu 2013) and fast similarity search (Yu et al. 2010).
Though all the abovementioned variations of the binary decomposition endeavor to enhance the error correcting ability for the classification task, their designs are still based on aggregation of base binary classification results which lack some desirable properties available in their generalized form.
From a binary to Nary decomposition
In this section, we discuss necessities and advantages of Nary decomposition scheme from aspects of coding matrix and investigate the column correlation of coding matrix and separation between codewords of different classes.
Existing ensemble methods design the coding matrix and constrain the coding values either in \(\{1,1\}\) (binary) or \(\{1,0,1\}\) (ternary) and train a number of different binary classifiers accordingly. A lot of studies show that when there are sufficient classifiers, ensemblebased multiclass classification algorithm can reach stable and reasonable performance (Rocha and Goldenstein 2014; Dietterich and Bakiri 1995). Nevertheless, binary and ternary codes can generate at most \(2^{N_C}\) and \(3^{N_C}\) binary classifiers, where \(N_C\) denotes the number of classes. On the other hand, due to limited choices of coding values, existing codes tend to create correlated and redundant classifiers and make them less effective “voters”. Moreover, some studies show that binary and ternary codes usually require only \(10\log _2(N_C)\) and \(15\log _2(N_C)\) base classifiers, respectively, to achieve optimal results (Allwein et al. 2001). Furthermore, when the original multiclass problem is difficult, the existing coding schemes cannot handle well. For example, as shown in Fig. 1b, the binary decomposition that creates binary codes like OVA may result in difficult base binary classification tasks. The ternary decomposition (see Fig. 1c) may cause cases where the test data from the same class is assigned to different classes.
To address these issues, we extend the binary or ternary codes to Nary codes. One example of the Nary coding matrix to represent seven classes is shown in Table 1. Unlike the existing binary decomposition methods, a row of coding matrix M represents the code of each class and the code consists of \(N_L\) numbers in \(\{1\cdots N\}\), where \(N>3\); while a column \(M_s\) of M represents the N partitions of classes to be considered. To be specific, the Nary ECOC approach consists of four main steps:

1.
Generate an Nary matrix M by uniformly random sampling from a range \(\{1.. N\}\) (e.g., Table 1).

2.
For each of the \(N_L\) matrix columns, partition original training data into N groups based on the new class assignments and build an Nclass classifier.

3.
Given a test example \(\mathbf {x}_t\), use the \(N_L\) classifiers to output \(N_L\) predicted labels for the testing output code (e.g., \(f(\mathbf {x}_t) = [4, 3, 1, 2, 4, 2]\)).

4.
Final label prediction \(y_t\) for \(\mathbf {x}_t\) is the nearest class based on minimum distance between the training and the testing output codes (e.g., \(y_t = \arg \min _i d(f(\mathbf {x}_t),C_i) = 4\)).
One notes that Nary decomposition randomly breaks a large multiclass problem into a number of smaller multiclass subproblems. These subproblems are more complicated than binary problems and they incur additional computational cost. Hence, there is a tradeoff between error correcting capability and computational cost.^{Footnote 1} Fortunately, our empirical studies indicate that N does not need to be too large to achieve good classification performance.
Column correlations of coding matrix
In the traditional binary decomposition strategy, it suggests longer codes, i.e., \(N_L\) is larger, however more binary base classifiers are likely to be more correlated. Thus, more base classifiers created by binary or ternary codes are not effective for final multiclass classification. To illustrate the advantage of Nary decomposition in creating uncorrelated class assignment for base classifications, we conduct an experiment to investigate the column correlations of coding matrix M. The results are shown in Fig. 2. In the experiment, we set \(N_C = 20, N = 5,\) and \(N_L\) varies in [10, 80], and use Pearson’s correlation (PCC) which is a normalized correlation measure that eliminates the scaling effect of the codes. From Fig. 2, we observe that Nary decomposition achieves lower correlations for columns of coding matrix compared to conventional ternary codes. Especially, when the number of tasks is small, the correlations over the created tasks for binary decomposition is significantly higher than that of the Nary decomposition. Therefore, an Nary decomposition not only provides more flexibility in creating a coding matrix, but also generates codes that are less correlated and less redundant, compared to traditional binary decomposition methods.
Separation between codewords of different classes
Apart from the column correlation, the row separation is another important measure to evaluate the error correcting ability of the coding matrix M (Dietterich and Bakiri 1995; Allwein et al. 2001). The codes for different classes are expected to be as dissimilar as possible. If codes (rows) for different classes are similar, it is easier to commit errors. Thus, the capability of error correction relies on the minimum distance, \(\varDelta _{\min }(M)\) or expected \(\varDelta (M)\) for any distinct pair of rows in the coding matrix \(M \in \{1,2,\ldots ,N\}^{N_C\times N_L}\) where \(N_C\) is the number of classes, and \(N_L\) is the code length. Both the absolute distance and the Hamming distance can serve as the measure of row separation. The key difference between these two distances is that Hamming distance measures a scaleinvariant difference. Specifically, the Hamming distance only cares about the number of different elements. It ignores the scale of the difference.
Hamming Distance One can use the generalized Hamming distance to calculate the \(\varDelta ^{Ham}(M)\) for the existing coding schemes, which is defined as follows,
Definition 1
(Generalized Hamming Distance) Let \(M(r_1,:), M(r_2,:)\) denote row \(r_1, r_2\) coding vectors in coding matrix M with length \(N_L\), respectively. Then the generalized hamming distance can be expressed as
For the OVA coding, every two rows have exactly two entries with opposite signs, \( \varDelta _{min}^{Ham(OVA)}(M)=2\). For the OVO coding, every two rows have exactly one entry with opposite signs, \(\varDelta _{min}^{Ham(OVO)}(M)=\left( \left( \begin{array}{c} N_C \\ 2 \end{array} \right) 1\right) /2+1\), where \(N_C\) is the number of classes. Moreover, for a random coding matrix with its entries uniformly chosen, the expected value of any two different class codes is \(\varDelta ^{Ham(RAND)}(M)\) is \(N_L/2\), where \(N_L\) is the code length. A larger \(\varDelta ^{Ham(RAND)}(M)\) is more likely to rectify the errors committed by individual base classifiers. Therefore, when \(N_L\gg N_C\), a random coding matrix is expected to be more robust and rectify more errors than the OVO and OVA approaches (Allwein et al. 2001). However, the choice of only either binary or ternary codes hinders the construction of longer and more discriminative codes. For example, binary codes can only construct codes of length \(N_L \le 2^{N_C}\). Moreover, they lead to many redundant base learners. In contrast, for Nary random matrix, the expected value of \(\varDelta ^{Ham(N)}(M)\) is \(N_L (1\frac{1}{N})\) (see Lemma 1 for proof). \(\varDelta ^{Ham(N)}(M)\) is expected to be larger than \(\varDelta ^{Ham(RAND)}(M)\) when \(N\ge 3\) (Table 2).
Lemma 1
The expected Hamming distance for any two distinct rows in a random Nary coding matrix \(M \in \{1,2, \ldots , N\}^{N_C \times N_L}\) is
Proof
Given a random matrix M with components chosen uniformly over \(\{1,2, \ldots , N\}\), for any distinct pair of entries in column s, i.e., \(M(r_i, s)\) and \(M(r_j, s)\), the probability of \(M(r_i,s) = M(r_j,s)\) is \(\frac{1}{N}\). Then the probability of \(M(r_i,s) \ne M(r_j,s)\) is \(1  \frac{1}{N}\).
Therefore, according to Definition 1, the expected Hamming distance for M can be computed as follows,
\(\square \)
Absolute distance Different from the Hamming distance, the absolute distance measures the difference scales. Thus, for a fair comparison, we assume that coding values are in the same scale for the absolute distance analysis. The definition of absolute distance is given as follows,
Definition 2
(Absolute distance) Let \(M(r_1,:)\) and \( M(r_2,:)\) denote row \(r_1\) and \( r_2\) coding vectors in coding matrix M with length \(N_L\), respectively. Then the absolute distance can be expressed as
For the convenience of analysis, we first give the expected absolute distance for Nary coding matrix in Lemma 2.
Lemma 2
The expected absolute distance for any two distinct rows in a random Nary coding matrix \(M\in \{1,2, \ldots , N\}^{N_C \times N_L}\) is
Proof
Given a random matrix M with components chosen uniformly over \(\{1,2, \ldots , N\}\), for any distinct pair of entries in column s, i.e., \(M(r_i, s), M(r_j, s)\), we denote the corresponding expected absolute distance as \(\varDelta ^{abs(N)} (M(:,s)) = {\mathbb E}~{d_{ij}} = {\mathbb E}~ M(r_i,s)  M(r_j,s)\).
It can be calculated by averaging all the possible pairwise distances \(d_{ij}\) for \(i,j \in \{1,2,\ldots , N\}\). Since the two numbers \(r_i, r_j\) are chosen randomly from \(\{1,...,N\}\), \(\varDelta ^{N}(M)\) can be expressed as follows:
First, we define the sequence \(a_n\) as follows:
Table 3 gives all the possible choices of \(d_{ij}\). Thus the calculation of \(\varDelta ^{N}(M)\) is equal to taking the average of all the entries in Table 3, which can be expressed as follows:
where (5) comes from the symmetry of \(d_{ij}\). Then
\(\square \)
For the OVA coding scheme, every two rows have exactly two entries with opposite signs, the minimum absolute distance \(\varDelta _{min}^{abs(OVA)}(M)=4\); while for the OVO coding scheme, every two rows have exactly one entry with opposite signs and only \(2N_C4\) entries with a difference of exactly one, \(\varDelta _{min}^{abs(OVO)}(M)=2N_C 2\). For binary random codes, the expected absolute distance between any two different rows is \(\varDelta ^{abs(RAND)}(M)=N_L\). Thus, when N is large, \(\varDelta ^{abs(N)}(M)\) is much larger than \(\varDelta ^{abs(RAND)}(M)\), and Nary coding is expected to be better.
The Hamming and absolute distance comparisons for different codes are summarized in Table 2. We can see that Nary coding scheme has an advantage in creating more discriminative codes with larger distances for different classes in both two distance measures. This advantage is very important to analyze the generalization error analysis of Nary ECOC.
Complexity comparison
As discussed in Sect. 3, Nary codes have a better error correcting capability than the traditional random codes when N is larger than 3. However, one notes that the base classifier of each column is no longer solving a binary problem. Instead, we randomly break a large multiclass problem into a number of smaller multiclass subproblems. These subproblems are more complicated than binary problems and they incur additional computational cost. Hence, there is a tradeoff between the error correcting capability and computational cost.
If the complexity of the algorithm employed to learn the smallsize multiclass base problem is \(\mathcal {O}(g(N,N_{tr},d))\) with N classes, \(N_{tr}\) training examples, d predictive features and \(g(N,N_{tr},d)\) is the complexity function w.r.t N, \(N_{tr}\), d, then the computational complexity of Nary codes is \(\mathcal {O}(N_L g(N,N_{tr},D))\) for codes of length \(N_L\).
Taking SVM as the base learner for example, one can learn each binary classification task created by binary coding matrix with training complexity of \(\mathcal {O}(N_{tr}^3)\) for traditional SVM solvers that build on the quadratic programming (QP) problems. However, a major stumbling block for these traditional methods is in scaling up these QPs to large data sets, such as those commonly encountered in data mining applications. Thus, some stateoftheart SVM implementations, e.g., LIBSVM (Chang and Lin 2011), Core Vector Machines (Tsang et al. 2005), have been proposed to reduce training time complexity from \(\mathcal {O}(N_{tr}^3)\) to \(\mathcal {O}(N_{tr}^2)\) and \(\mathcal {O}(N_{tr})\), respectively. Nevertheless, how to efficiently train SVM is not the focus of our paper. For the convenience of complexity analysis, we use the time complexity of the traditional SVM solvers as the complexity of the base learners. Then, the complexity of binary codes is \(\mathcal {O}(N_L N_{tr}^3)\). Different from existing decomposition method, one can directly address the multiclass problem in one single optimization process, e.g., multiclass SVM (Crammer and Singer 2002). This kind of model combines multiple binaryclass optimization problems into one single objective function and simultaneously achieves the classification of multiple classes. In this way, the correlations across multiple binary classification tasks are captured in the learning model. The resulting QP optimization requires a complexity of \(\mathcal {O}((N_C N_{tr})^3)\). However, it causes high computational complexity for a relatively large number of classes. In contrast, Nary codes are in the complexity of \(\mathcal {O}(N_L (NN_{tr})^3)\), where \(N < N_C\). In this case, it achieves better tradeoff between the error correcting capability and computational cost, especially for large class size \(N_C\).
We summarize the time complexity of different codes in Table 4. In Sect. 6.1.4, our empirical studies indicate that N does not need to be too large to achieve optimal classification performance.
Generalization analysis of Nary decomposition for multiclass classification
In Sect. 5.1, we study the error correcting ability of an Nary decomposition. In Sect. 5.2, we derive the generalization error bound for Nary decomposition independent of the base classifier.
Analysis of error correcting on Nary decomposition
To study the error correcting ability of Nary decomposition, we first define the distance between the codes in any distinct pair of rows, \(M(r_i)\) and \(M(r_j)\), in an Nary coding matrix M as \(\varDelta ^N(M(r_i),M(r_j))\). It is the sum of the \(N_L\) distances between two entries, \(M(r_i, s)\) and \(M(r_j, s)\) in the same column s at two different rows, \(r_i\) and \(r_j\), i.e., \( \varDelta ^N(M(r_i),M(r_j)) = \sum _{s=1}^{N_L} \varDelta ^N(M(r_i,s),M(r_j,s)). \)
We further define \(\rho = \min _{r_i\ne r_j}{\varDelta ^N(M(r_i),M(r_j))}\) as the minimum distance between any two rows in M.
Proposition 1
Given an Nary coding matrix M and a vector of predicted labels \(f(\mathbf {x})=[f_1(\mathbf {x})),\ldots ,f_{N_L}(\mathbf {x}))]\) by \(N_L\) base classifiers for a test instance \(\mathbf {x}\). If \(\mathbf {x}\) is misclassified by the Nary ECOC decoding, then the distance between the correct label in M(y) and \(f(\mathbf {x})\) is greater than one half of \(\rho \), i.e.,
Proof
Suppose that the distancebased decoding incorrectly classifies a test instance \(\mathbf {x}\) with known label y. In other words, there exists a label \(r\ne y\) for which
Here, \(\varDelta ^N(M(y),f(\mathbf {x}))\) and \(\varDelta ^N(M(r),f(\mathbf {x}))\) can be expanded as the elementwise summation. Then, we have
Based on the above inequality, we obtain:
where Inequality (8) uses Inequality (7) and Inequality (9) follows from the triangle inequality. \(\square \)
Remark 1
From Proposition 1, one notes that a mistake on a test instance \((\mathbf {x},y)\) implies that \(\varDelta ^N(M(y),f(\mathbf {x})) \ge \frac{1}{2}\rho \). In other words, the prediction codes are not required to be exactly the same as the groundtruth codes for all the base classifications. As long as the distance is no larger than \(\frac{1}{2}\rho \), Nary coding can rectify the error committed by some base classifiers, and is still able to make an accurate prediction. This error correcting ability is very important especially when the labeled data is insufficient. Moreover, a larger minimum distance, i.e., \(\rho \), leads to a stronger capability of error correcting. Note that this proposition holds for all the distance measures and traditional coding schemes due to the fact that only the triangle inequality is required in the proof.
Generalization error of Nary decomposition
The next result provides a generalization error bound for any type of base classifier, such as the SVM classifier and decision tree, used in the Nnary decomposition for multiclass classification.
Theorem 1
(Nary decomposition error bound) Given \(N_L\) base classifiers, \(f_1, \ldots , f_{N_L}\), trained on \(N_L\) subsets \(\{(\mathbf {x}_i, M(y_i,s))_{i=1,\ldots ,N_{tr}}\}_{s=1,\ldots ,N_L}\) of the dataset with \(N_{tr}\) instances for coding matrix \(M\in \{1,2,\ldots ,N\}^{N_C\times N_L}\). The generalized error rate for the Nary ECOC approach using distancebased decoding is upper bounded by
where \(\bar{B} = \frac{1}{N_L}\sum _{s=1}^{N_L} B_s\) and \(B_s\) is the upper bound of the distancebased loss for the \(s^{th}\) base classifier.
Proof
According to Proposition 1, for any misclassified data instance, the distance between its incorrect label vector \(f(\mathbf {x})\) and the true label vector M(y) should satisfy the minimal distance \(\frac{\rho }{2}\), i.e., \( \varDelta ^N(M(y),f(\mathbf {x})) = \sum _{s=1}^{N_L} \varDelta ^N(M(y,s),f_s(\mathbf {x})) \ge \frac{\rho }{2}. \)
Let a be the number of incorrect label predictions for a set of test instances of size \(N_{te}\). One obtains
Then,
where \(\bar{B} = \frac{1}{N_L}\sum _{s=1}^{N_L} B_s\).
Hence, the testing error rate is bounded by \(\frac{2 {N_L} \bar{B}}{\rho }\). \(\square \)
Remark 2
From Theorem 1, one notes that for the fixed \(N_L\), the generalization error bound of the Nary decomposition depends on the two following factors:

1.
The averaged loss \(\bar{B}\) for all the base classifiers. In practice, some base tasks may be badly designed due to the randomness. As long as the averaged loss \(\bar{B}\) over all the tasks is small, the resulting ensemble classifier is still able to make a precise prediction.

2.
The minimum distance \(\rho \) for coding matrix M. As we discussed in Proposition 1, the larger \(\rho \), the stronger capability of error correcting Nary code has.
Both two factors are affected by the choice of N. In particular, \(\bar{B}\) increases as N increases since the base classification tasks become more difficult. On the other hand, from experimental results in Fig. 3b, it is observed that \(\rho \) becomes larger when N increases. Therefore, there is a tradeoff between these two factors.
Experimental results
We present experimental results on 11 wellknown UCI multiclass datasets from a wide range of application domains. The statistics of these datasets are summarized in Table 5. The parameter N is chosen by crossvalidation procedure. With the tuned parameters, all methods are run ten realizations. Each has different random splittings with fixed training and testing size as given in Table 5. Our experimental results focus on the comparison of different encoding schemes rather than decoding schemes. Therefore, we fix generalized hamming distance as the decoding strategy for all the coding designs for a fair comparison.
To investigate the effectiveness of the proposed Nary coding scheme, we compare it with dataindependent coding schemes including OVO, OVA, and random binary encoding as well as the direct multiclass methods multiclass SVM (MSVM) and decision tree. For the random binary encoding scheme, or ECOC in short, and the Nary strategy, we select the matrix with the largest minimum absolute distance from 1000 randomly generated matrices.
To ensure a fair comparison and easy replication of results, the base learners decision tree CART (Breiman et al. 1984) and linear SVM are implemented with the CART decision tree MATLAB toolbox and the LIBSVM (Chang and Lin 2011) with the linear kernel in default settings, respectively.
Error bound analysis on Nary decomposition
In the bound analysis, we choose hamming distance 1 to measure the row separation as a showcase. According to Theorem 1, the generalization error bound depends on the minimum distance \(\rho \) between any two distinct rows in the Nary coding matrix M as well as the average loss of base classifiers \(\bar{B}\). In particular, the expected value of \(\varDelta ^N(M)\) scales with O(N).
In this subsection, we investigate the effect of the number of classes N using the Pendigits dataset with CART as the base classifier to illustrate the following aspects: (i) \(\varDelta ^{N}(M)\) between any two distinct rows of codes (see Fig. 3a), (ii) \(\rho \) (see Fig. 3b), (iii) \(\frac{\bar{B}}{\rho }\) (see Fig. 3c), and (iv) the classification performance (see Fig. 4). The empirical results corroborate the proposed error bounds in Theorem 1.
Average distance \(\varDelta ^N(M)\) versus N
Recall that the hamming distance for different coding matrices discussed in Sect. 3 are: \(\varDelta ^{N}(M)=N_L (1\frac{1}{N})\), \(\varDelta ^{rand}(M)=N_L/2\), \(\varDelta _{\min }^{ova}(M)=2\) and \(\varDelta _{\min }^{ovo}(M)=\left( \left( \begin{array}{c} N_C\\ 2 \end{array} \right) 1\right) /2+1\).
From Fig. 3a, we observe that the empirical average hamming distances of the constructed Nary coding matrices for random Nary schemes are close to \(N_L (1\frac{1}{N})\). Furthermore, when there are 45 base classifiers, the average distance for Nary coding matrices is larger than 30, which is larger than that of the binary random codes with an average absolute distance of 22.5. Moreover, a higher N leads to a larger average distance. Comparing Fig. 3a, b, the large average distance \(\varDelta ^N(M)\) also correlates with the large minimum distance \(\rho \).
Minimum distance \(\rho \) versus N
For the Pendigits dataset with 10 classes, \(\rho \) for OVA and OVO are 4 and 18, respectively. From Fig. 3b, we observe that with a fixed number of base classifiers, \(\rho \) increases with the number of multiclass subproblems of classsize N, meanwhile \(\rho \) also increases with respect to the code length \(N_L\). Furthermore, in comparison to the other coding schemes, our proposed method usually creates a coding matrix with a large \(\rho \). For example, in Fig. 3b, one observes that when there are 25 and 45 base classifiers, the corresponding \(\rho \) for binary random codes are 0. On the other hand, Nary decomposition, given a sufficiently large N, creates a coding matrix with \(\rho \) to be larger than 10 and 20, respectively. Although Nary decomposition creates an Nary coding matrix with a large \(\rho \) when N is larger, in realworld applications, it is preferred that N is not too large to ensure reasonable computational cost and difficulty of subproblems. In short, Nary decomposition provides a better alternative to creating a coding matrix with a large class separation compared to traditional coding schemes.
Ratio \(\bar{B}/\rho \) versus N
Both \(\bar{B}\) and \(\rho \) are dependent on N. Moreover, from the generalization error bound, we observe that \(\bar{B}/\rho \) directly affects classification performance.
Hence, this ratio, which bounds the classification error, requires further investigation. Figure 3c shows that when \(N = 4\), the ratio \(\bar{B}/\rho \) is lowest. This observation suggests that the more the row and column separation of the coding matrix, the stronger the capability of error correction (Dietterich and Bakiri 1995). Therefore, Nary decomposition is a better way to creating the coding matrix with large separation among the classes as well as more diversity, compared to the binary and ternary coding schemes. One notes that \(\bar{B}/\rho \) starts to increase when \(N \ge 5\). This means that the increase of the average base classifier loss \(\bar{B}\) overwhelms the increase in \(\rho \). The reason for this phenomena is the increase in difficulty of the subproblem classification with more classes.
Classification accuracy versus N
Next, we study the impact of N on the multiclass classification accuracy. We use datasets Pendigits, Letters, Sectors, Aloi with 10 classes, 26 classes, 105 classes, 1000 classes respectively as showcase. In order to a obtain meaningful analysis, we choose a suitable classifier for different datasets. In particular, we apply the CART to datasets Pendigits, Letters and Aloi and linear SVM to Sectors. One observes from Fig. 4 that the Nary decomposition achieves competitive prediction performance when \(3\le N \le 10\). However, given sufficient base learners, the classification error starts increasing when N is large (e.g. \(N > 4\) for Pendigits, \(N > 5\) for Letters and \(N > 8\) for Sector). This is because the base tasks are more challenging to solve when N is large and it indicates the influence of \(\bar{B}\) outweighs that of \(\rho \). Furthermore, one observes that the performance curves in Figs.3c and 4a roughly correlate to each other. Hence, one can estimate the trend in the empirical error using the ratio \(\bar{B}/\rho \). This verifies the validity of the generalized error bound in Theorem 1. To investigate the choice of N on multiclass classification more comprehensively, we further conduct experiments on the other datasets. The results of datasets Pendigits, Letters, Sectors and Aloi are summarized in Fig. 4a–d, respectively. For the rest of the datasets, we have the similar observations. In general, smaller values of N (\(N \in [3,10]\)) usually lead to reasonably competitive performance. In other words, the complexity of base learners for Nary codes does not need to significantly increase above 3 for the performance to be better than existing binary or ternary coding approaches.
Classification accuracy versus \(N_L\)
From Fig. 5, we observe that high accuracy can be achieved with a small number of base learners. Another important observation is that given fewer base learners, it is better to choose a large value of N rather than a small N. This may be due to the fact that a larger N leads to stronger discrimination among codes as well as base learners. However, neither a large nor small N can reach optimal results given a sufficiently large \(N_L\).
Comparison to stateoftheart decomposition strategies
We compare our proposed Nary decomposition to other popular decomposition strategies for with different base classifiers including decision tree (DT) (Breiman et al. 1984) and support vector machine (SVM) (Chang and Lin 2011).^{Footnote 2} The two binary classifiers can be easily extended to a multiclass setting. In particular, we use the multiclass SVM (MSVM) (Crammer and Singer 2002) implemented with the MSVMpack (Lauer and Guermeur 2011). In addition to the multiclass extension of the two classifiers, we also compare Nary decomposition strategies to OVO, OVA, random ternary decomposition strategies with the two binary classifiers. For random ternary and Nary decomposition strategies, we report the best results with \(N_L \le N_C(N_C1)/2\), which is sufficient for conventional random decomposition methods to reach optimal performance (Allwein et al. 2001). But for Aloi dataset with 1000 classes, we only report the results for all the decomposition strategies within \(N_L=1000\) due to its large class size.
Comparison to stateoftheart baselines with SVM classifiers
The classification accuracy of different coding schemes as well as proposed Nary coding with SVM classifiers are presented in Table 6. We observe that OVO has the best and most stable performance on most datasets of all the encoding schemes except for Nary coding. This is because all the information between any two classes is used during classification and the OVO coding strategy has no redundancy among different base classifiers. However, it sacrifices efficiency for better performance. It is very expensive for both training and testing when there are many classes in the datasets such as the Auslan, Sector and Aloi. Especially, for Aloi with 1000 classes, it is often not viable to calculate the entire OVO classifications in the realworld application as it would require 499 500 base learners in the pool of possible combinations for training and testing. The performance of OVA is unstable. For the datasets News20 and Sector, OVA even significantly outperforms OVO. However, the performances of OVA on the datasets Vowel, Letters, and Glass are much worse than other encoding schemes. Note that ECOCONE is initialized with OVA. We observe that MSVM achieves better results than random binary decomposition because it considers relationship among classes. However, the training complexity of MSVM is very high. In contrast to MSVM, random binary decomposition is ensemble of binary classifiers, which can be parallelized due to independences of base tasks. Nary decomposition combines the advantages of both MSVM and ensemble to achieve better performance.
Comparison to stateoftheart baselines with decision tree classifiers
Next, we compare Nary decomposition with other stateoftheart coding schemes using binary decision tree classifiers CART (Breiman et al. 1984) as well as its multiclass extension MCART. We implement it with the CART toolbox with a default setting and the results are reported in Table 7. We observe that binary decision tree classifiers with traditional decomposition strategies are worse than the direct multiclass extension of the decision tree. The decision tree classifiers show better performances than SVM on the Pendigits, Vowel, and Letters datasets. However, it shows very poor performances on high dimensional datasets such as News20 and Sector. This is due to the fact that highdimensional features often lead to complex tree structure construction. Nevertheless, Nary decomposition still can significantly improve the performance on either traditional coding schemes with binary decision tree learner as well as the multiclass decision tree.
In summary, our proposed Nary decomposition is superior to traditional decomposition schemes and direct multiclass algorithms on most tasks, and provides a flexible strategy to decompose many classes into many smaller multiclass problems, each of which can be independently solved by either MSVM or MCART in parallelization.
Discussion on many class situation
From the experiments results, we observe that the Nary decomposition shows significant improvement on the Aloi dataset with 1000 classes over other existing coding schemes as well as direct multiclass classification algorithms, especially decision tree classifiers. For the binary or ternary codes, it is highly possible to assign the same codes to different classes. From the experimental results, we observe the minimum distance \(\rho \) for binary and ternary coding are small or even tends to be 0. In other words, the existing coding cannot help the classification algorithms to differentiate some classes. In contrast, Nary with \(N_L = 1000\) and \(N=5\), the minimum distance \(\rho \) is 741. Thus, it creates codes with larger margins for different classes, which explains the superior On the other hand, the direct multiclass algorithms cannot work well when the class size is large. Furthermore, the computation cost for direct multiclass algorithms is in \(O(N_C^3)\). When the class size \(N_C\) is large, the algorithms are expensive to train. On the contrary, random binary codes can be easily parallelized due to the independence among the subproblems.
Discussion on performance on each individual class
To understand the performances of different codes for each individual class, we show the confusion matrix on the Pendigits dataset in Fig. 6. First, we observe that binary code (i.e., OVA) has very poor performances on some classes in terms of recall or precision. For example, recall on class 2, 6 and precision on class 10 are below 50%. It can be explained by that as illustrated in Fig. 1b, binary codes may lead to nonseparable cases. Nevertheless, it achieves best classification results on the class 2, 4, 6 and class 9. Compared to the binary code, ternary code (i.e., OVO) largely reduces the bias and improve precision and recall scores on most classes. What is more interesting, when the ternary code and Nary decomposition achieves comparable overall performances, Nary decomposition achieves smaller maximal errors. It may be benefited from simpler subtasks created by Nary decomposition, as shown in Fig. 1d.
Conclusions
In this paper, we investigate whether one can relax binary decomposition to Nary decomposition to achieve better multiclass classification performance. In particular, we present an Nary decomposition strategy that decomposes the original multiclass problem into simpler multiclass subproblems. The advantages of such decomposition are as follows: (i) the ability to construct more discriminative codes and (ii) the flexibility for the user to select the best N for random decompositionbased classification. We derive a base classifier independent generalization error bound for the Nary decomposition classification problem. We show empirically that the optimal N (based on classification performance) lies in [3, 10] with some tradeoff in computational cost. Experimental results on benchmark multiclass datasets show that the proposed decomposition achieves superior prediction performance over the stateoftheart multiclass baselines. In the future, we will investigate a more efficient realization of Nary decomposition to improve the prediction speed.
Notes
 1.
More complexity analyses can be found from Sect. 4.
 2.
Note that coding design is independent from base learners. It is fair to fix the base learners for decomposition strategy comparison.
References
Allwein, E. L., Schapire, R. E., & Singer, Y. (2001). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.
Bengio, S., Weston, J., & Grangier, D. (2010). Label embedding trees for large multiclass tasks. In NIPS (pp. 163–171).
Beygelzimer, A., Langford, J., Lifshits, Y., Sorkin, G., & Strehl, A. (2009). Conditional probability tree estimation analysis and algorithms. In UAI (pp. 51–58).
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees., Statistics/probability series Belmont, California: Wadsworth Publishing Company.
Chang, C.C., & Lin, C.J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27:1–27:27.
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
Crammer, K., & Singer, Y. (2002). On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2, 265–292.
Deng, J., Satheesh, S., Berg, A., & FeiFei, L. (2011). Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS.
Dietterich, T. G. (2000). Ensemble methods in machine learning. In MCS (pp. 1–15). Springer.
Dietterich, T. G., & Bakiri, G. (1991). Errorcorrecting output codes: A general method for improving multiclass inductive learning programs. In AAAI (pp. 572–577). AAAI Press.
Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research, 2, 263–286.
Fei, B., & Liu, J. (2006). Binary tree of SVM: A new fast multiclass training and classification algorithm. IEEE Transactions on Neural Networks, 17(3), 696–704.
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). An overview of ensemble methods for binary classifiers in multiclass problems: Experimental study on onevsone and onevsall schemes. Pattern Recognition, 44(8), 1761–1776.
Gao, T., & Koller, D. (2011). Discriminative learning of relaxed hierarchy for largescale visual recognition. In ICCV (pp. 2072–2079).
GarcíaPedrajas, N., & OrtizBoyer, D. (2011). An empirical study of binary classifier fusion methods for multiclass classification. Information Fusion, 12(2), 111–130.
Hong, J.H., Min, J.K., Cho, U.K., & Cho, S.B. (2008). Fingerprint classification using onevsall support vector machines dynamically ordered with Naïve Bayes classifiers. Pattern Recognition, 41(2), 662–671.
Jenssen, R., Kloft, M., Zien, A., Sonnenburg, S., & Müller, K.R. (2012). A scatterbased prototype framework and multiclass extension of support vector machines. PLoS ONE, 7(10), e42947.
Kittler, J., Ghaderi, R., Windeatt, T., & Matas, J. (2003). Face verification via error correcting output codes. Image and Vision Computing, 21(13–14), 1163–1169.
Knerr, S., Personnaz, L., & Dreyfus, G. (1990). Singlelayer learning revisited: A stepwise procedure for building and training a neural network. In F. F. Soulié & J. Hérault (Eds.), Neurocomputing. NATO ASI Series (Series F: Computer and Systems Sciences), Vol. 68. Berlin, Heidelberg: Springer.
Lauer, F., & Guermeur, Y. (2011). MSVMpack: A multiclass support vector machine package. Journal of Machine Learning Research, 12, 2269–2272.
Liu, X.Y., Li, Q.Q., & Zhou, Z.H. (2013). Learning imbalanced multiclass data with optimal dichotomy weights. In IEEE 13th international conference on data mining (ICDM) (pp. 478–487). IEEE.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Rocha, A., & Goldenstein, S. (2014). Multiclass from binary: Expanding onevsall, onevsone and ECOCbased approaches. IEEE Transactions on Neural Networks and Learning Systems, 25(2), 289–302.
Su, J., & Zhang, H. (2006). A fast decision tree learning algorithm. In Proceedings of the 21st national conference on artificial intelligence, AAAI’06 (Vol. 1, pp. 500–505). AAAI Press.
Tsang, I. W., Kwok, J. T., & Cheung, P. (2005). Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6, 363–392.
Übeyli, E. D. (2007). Ecg beats classification using multiclass support vector machines with error correcting output codes. Digital Signal Processing, 17(3), 675–684.
Wu, T.F., Lin, C.J., & Weng, R. C. (2004). Probability estimates for multiclass classification by pairwise coupling. Journal of Machine Learning Research, 5(Aug), 975–1005.
Yang, J.B., & Tsang, I. W. (2011). Hierarchical maximum margin learning for multiclass classification. In UAI.
Yu, Z., Cai, D., & He, X. (2010). Errorcorrecting output hashing in fast similarity search. In Proceedings of the second international conference on internet multimedia computing and service, ICIMCS ’10 (pp. 7–10). New York, NY, USA: ACM.
Zhao, B., & Xing, E. P. (2013). Sparse output coding for largescale visual recognition. In CVPR (pp. 3350–3357). IEEE.
Zhong, G., & Cheriet, M. (2013). Adaptive errorcorrecting output codes. In IJCAI (pp. 1932–1938).
Zhong, G., & Liu, C.L. (2013). Errorcorrecting output codes based ensemble feature extraction. Pattern Recognition, 46(4), 1091–1100.
Acknowledgements
Joey Tianyi Zhou is supported by Programmatic Grant No. A1687b0033 from the Singapore government’s Research, Innovation and Enterprise 2020 plan (Advanced Manufacturing and Engineering domain). Ivor Tsang is supported by Australian Research Council grants DP180100106, LP150100671, and FT130100746.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Masashi Sugiyama, YungKyun Noh.
Rights and permissions
About this article
Cite this article
Zhou, J.T., Tsang, I.W., Ho, SS. et al. Nary decomposition for multiclass classification. Mach Learn 108, 809–830 (2019). https://doi.org/10.1007/s10994019057862
Received:
Accepted:
Published:
Issue Date:
Keywords
 Ensemble learning
 Multiclass classification
 Nary ECOC