# Efficient implementation of class-based decomposition schemes for Naïve Bayes

- 797 Downloads
- 5 Citations

## Abstract

Previous studies have shown that the classification accuracy of a Naïve Bayes classifier in the domain of text-classification can often be improved using binary decompositions such as error-correcting output codes (ECOC). The key contribution of this short note is the realization that ECOC and, in fact, all class-based decomposition schemes, can be efficiently implemented in a Naïve Bayes classifier, so that—because of the additive nature of the classifier—all binary classifiers can be trained in a single pass through the data. In contrast to the straight-forward implementation, which has a complexity of *O*(*n*⋅*t*⋅*g*), the proposed approach improves the complexity to *O*((*n*+*t*)⋅*g*). Large-scale learning of ensemble approaches with Naïve Bayes can benefit from this approach, as the experimental results shown in this paper demonstrate.

## Keywords

Naïve Bayes Error-correcting output codes Scalability## 1 Introduction

A common approach for tackling multi-class classification is the decomposition of a given *k*-class problem into a set of *n* binary classification problems, which in turn can be learned by a binary base classifier. The best-known decomposition schemes are *one-against-all* and *one-against-one* (Friedman 1996; Fürnkranz 2002), where *n*=*k* and \(n=\frac{k(k-1)}{2}\) respectively. These particular schemes and, in general, every class-based decomposition scheme can be modeled within the *error-correcting output coding* (ECOC) framework (Dietterich and Bakiri 1995), and its later generalization known as *ternary ECOC* (Allwein et al. 2000). It consists of a code matrix (*m* _{ ij })∈{−1,0,1}^{ k×n }, whose *k* rows represent the classes, and the columns encode the training protocol for the *n* classifiers. Each entry *m* _{ ij } encodes whether the examples of class *i* are used as positive examples (+1), as negative examples (−1), or are ignored (0) in the training of classifier *j*.

The original motivation for the development of the ECOC framework was to transfer the error-correcting properties that are known from signal theory (Macwilliams and Sloane 1983) to the problem of multi-class classification (Dietterich and Bakiri 1995), which could be empirically confirmed in several studies (Kong and Dietterich 1995; Kittler et al. 2003; Melvin et al. 2007). In particular, several authors have shown that the Naïve Bayes algorithm can benefit from the ECOC framework for text-classification (Berger 1999; Ghani 2000). Note, however, that this advantage is not guaranteed. One can show that there exist code types, for which the binary decomposition in conjunction with voting aggregation (which is in principle identical to Hamming decoding (Park and Fürnkranz 2012)) is equivalent to standard Naïve Bayes. For example, in Sulzmann et al. (2007) it was shown that a one-against-one decomposition of Naïve Bayes is equivalent to standard Naïve Bayes. Moreover, most of the above studies used a fairly basic Naïve Bayes algorithm, whose performance can certainly be improved. It is well-known that the probabilities obtained by Naïve Bayes tend to over-emphasize the winning class (Domingos and Pazzani 1997), and should be used with care. Calibration of the probabilities, with isotonic regression (Zadrozny and Elkan 2001, 2002) or, equivalently, via the ROC convex hull (Fawcett and Niculescu-Mizil 2007; Flach and Matsubara 2007), may lead to improved predictions, both in multi-class and binary classification settings. Similarly, algorithms that weaken the conditional independence assumption may lead to better predictions (Webb et al. 2005). Nevertheless, it is also well-known that for multi-class classification problems, Naïve Bayes can be optimal despite violations of its independence assumption and the resulting erroneous probability estimates (Domingos and Pazzani 1997; Zhang 2005).

However, the subject of this paper is not whether or in which cases a combination of ECOC and Naïve Bayes classifier can result in increased classification accuracy, or how it compares to the above-mentioned alternative approaches for improving predictive accuracy in Naïve Bayes. Instead, we show how this combination can be implemented in a single pass through the data, so that it is no more costly than standard Naïve Bayes. In this way, the question whether a certain output code can yield a gain in classification accuracy can be efficiently answered for specific practical problems. The key idea behind the approach is the realization that the binary decompositions of a Naïve Bayes classifier can be computed very efficiently from the estimated conditional probabilities of the original Naïve Bayes procedure.

First, we briefly recapitulate ECOC and Naïve Bayes in Sects. 2 and 3 and derive the efficient computation of ECOC ensembles with Naïve Bayes base classifiers in Sect. 4. Then, in Sect. 4.3, we demonstrate a suitable *precalculation* method for discrete, normal and kernel density estimation methods. Finally, we provide empirical support for the efficiency of the method in Sect. 5, and end with the conclusion in Sect. 6. For completeness, we also show a comparison of the accuracies of Naïve Bayes to its ECOC decomposition in the Appendix.

## 2 Error-correcting output codes

Error-correcting output codes (Dietterich and Bakiri 1995) are a well-known technique for converting multi-class problems into a set of binary problems. Each of the *k* original classes *c* _{ i } receives a codeword **c** _{ i } in {−1,1}^{ n }, thus resulting in a *k*×*n* coding matrix **M**. Each of its columns **b** _{ i },*i*={1,…,*n*} of the matrix corresponds to a binary classifier, which considers all examples of a class corresponding to a (+1) entry as positive, and all examples of a class corresponding to a (−1) entry as negative. Ternary ECOC (Allwein et al. 2000) are an elegant generalization of this technique which allows (0)-values in the codes, which correspond to ignoring examples of this class.

A crucial step in the design of an ECOC-classifier is the selection of the coding matrix. Many well-known reduction algorithms, such as *one-against-all* or *one-vs-one* can be realized by choosing appropriate coding matrices. Many other domain-independent coding schemes, such as random codes, have been investigated.

In our experiments, we follow prior work in Naïve Bayes text classification (Ghani 2000), and use *Bose-Chaudhuri-Hocquenghem (BCH) codes*, because they have properties which are favorable in practical applications. For example, BCH codes allow to specify the desired minimum Hamming distance of the codewords, which is directly related to the error-correcting ability of the ECOC framework. The greater this distance, the greater the number of errors which can be detected and corrected. Besides, related results in the literature support its usability for multi-class classification.

The set of all BCH codewords of a specific length and desired Hamming distance can be computed using an appropriate *generator polynomial*. Similar to Dietterich and Bakiri (1995), each of the *k* classes is initialized with a randomly selected vector which is then multiplied with the generator polynomial to yield the code word for this class. In our evaluation, we used the bchpoly routine of Gnu Octave to generate binary BCH codes of lengths 15,31,63,127,255,511 and 1023 with maximal designed minimum Hamming distance respectively. In the usual notation, we used BCH codes (15,5,7), (31,6,15), (63,7,31), (127,8,63), (255,9,127), (511,10,255) and (1023,11,511), where the parameters describe (in this order) the codeword bit-length, the bit-length for coded information, and the minimal Hamming distance between any pair of codewords. A detailed description and further information on BCH codes can, e.g., be found in Bose and Ray-Chaudhuri (1960), Macwilliams and Sloane (1983).

## 3 Naïve Bayes

Though Naïve Bayes (NB) is capable of directly learning multi-class predictors, results in the literature indicate that its classification performance can be increased in combination with ECOC methods. Especially for text classification the combination seems to be promising (Berger 1999; Ghani 2000).

**x**be characterized with

*g*values (

*a*

_{1},…,

*a*

_{ g }) for attributes

*A*

_{1},…,

*A*

_{ g }, and

*C*={

*c*

_{1},…,

*c*

_{ k }} be the set of classes. Using Bayes Theorem, we can compute the probability that

**x**belongs to class

*c*

_{ i }as

**x**, we can ignore this term, and focus on the numerator. More precisely, for the case of classification, the following holds:

*class-conditional independence assumption*, we can estimate the class-conditional probability with

*a*

_{ j }∣

*c*

_{ i }) and Pr(

*c*

_{ i }) are estimated from the training data.

## 4 Computation of ECOC for Naïve Bayes in a single pass

In this section, we describe the key idea of this note. We first show that all probability estimates that are conditioned on a mutually exclusive group of classes are additive (Sect. 4.1), and that this can be used for faster probability estimation in ECOC codes (Sect. 4.2). Finally, we discuss how the idea can be implemented for nominal and numeric data (Sect. 4.3).

### 4.1 Reduction to base probabilities

The key idea behind the efficient computation of arbitrary class-based decomposition schemes such as ECOC is that all constituent classifiers of a class-based decomposition can be reduced to the estimation of the parameters of the Naïve Bayes classifier, Pr(*a* _{ j }∣*c* _{ i }) and Pr(*c* _{ i }).

**b**={−1,+1}

^{ k }of the coding matrix corresponds to the training set of a binary classifier, in which all classes that correspond to a −1 (+1) entry in

**b**are labeled as negative (positive). Let \(C_{\mathbf{b}}^{+} = \lbrace c_{1},\dots, c_{l}\rbrace\) be the set of classes defined as positive given by column

**b**. Then it holds that

*c*) are mutually exclusive. The probabilities \(\Pr(C_{\mathbf{b}}^{-})\) and \(\Pr(a_{j}\mid C_{\mathbf{b}}^{-})\) for the negative class can be reduced analogously.

Equations (1) and (2) simply show that all necessary values \(\Pr(C_{\mathbf{b}}^{+})\) and \(\Pr(a_{j}\mid C_{\mathbf{b}}^{+})\) can be computed using Pr(*c* _{ i }) and Pr(*a* _{ j }∣*c* _{ i }), as for the standard Naïve Bayes. Therefore, different decompositions within the ECOC-Framework can be applied with Naïve Bayes without employing further probability estimation steps from training data, since they would involve redundant computations.

This tight combination of Naïve Bayes and ECOC, i.e. applying standard Naïve Bayes learning and computing the appropriate probabilities using (1) and (2), will be called ECOC-NB in the following text, for convenience.

### 4.2 Complexity

Instead of *n* iterations over the dataset for estimating the corresponding estimations of \(\Pr(a_{j}\mid C_{\mathbf{b}}^{+})\) and \(\Pr(C_{\mathbf{b}}^{+})\) for an *n*-bit ECOC scheme, only one pass is necessary. The usual training complexity of *O*(*n*⋅*t*⋅*g*) can thus be reduced to *O*(*t*⋅*g*), where *n* is the number of classifiers, *t* the number of training instances and *g* the number of features. This is possible by estimating the parameters Pr(*a* _{ j }∣*c* _{ i }) and Pr(*c* _{ i }) as for the regular Naïve Bayes classifier, and applying Eqs. (1) and (2) at classification time for each decomposed classifier and test instance.

Note, however, that although the training complexity is significantly decreased in comparison to the straight-forward application of the ECOC framework, the cost is moved to the prediction or testing phase, because we now have to perform more calculations for estimating the class-probabilities of an example. In particular, for a problem with a large number of attributes and classifiers, this can lead to a significant increase of testing complexity.

As we will show in the next section, this drawback can be solved by *precalculating* the probability distributions needed by the classifiers, i.e. precalculating the combined probability distribution \(\Pr(a_{j}\mid C_{\mathbf{b}}^{+})\), instead of always aggregating over a series of part-probabilities according to (2) for each test instance. This approach, which will be described in the following sections, results in a *training* complexity of *O*((*n*+*t*)⋅*g*) and the same *testing* complexity as the standard approach.

### 4.3 Precalculation

We take a closer look into probability estimation methods for common attribute types and show how to precalculate the needed probability distributions.

#### 4.3.1 Discrete/nominal attribute

*a*

_{ j }∈

*A*

_{ j }, where

*A*

_{ j }is a finite set of distinct values, the following frequency based model is usually used:

*x*| denotes the number of observed instances which satisfy

*x*. Also: \(\Pr(a_{j} \wedge c_{i}) = \frac{|a_{j} \wedge c_{i}|}{n}\) and \(\Pr(c_{i}) = \frac{|c_{i}|}{n}\), where

*n*is the number of observed instances, so far. So for Eq. (2),

*A*

_{ j }|+1)

*k*additions and |

*A*

_{ j }| divisions for generating the pseudo probability estimator. Note, this complexity is not dependent on the number of training instances.

#### 4.3.2 Numeric attribute

For numeric attributes, the following two estimation procedures are commonly applied:

### Normal density estimation

*a*

_{ j }∣

*c*

_{ i }) is in this case usually modeled as a normal distribution:

*t*, the sum of observed attribute values \(\sum^{t}_{m=1} x_{m}\) and the sum of squared values \(\sum^{t}_{m=1} x_{m}^{2}\) for each attribute. These values can be analogously summed up for representing the pseudo probability distribution, whose computational cost is also independent of the number of instances but dependent on the number of attributes.

### Kernel density estimation

*x*

_{ m }, and the probability estimate of a requested value

*x*depends on its distance to these values. This results in a somewhat smooth and not necessary unimodal probability density function. The definition is

*K*(.) is some kernel (often a standard Gaussian function with mean zero and variance one) and

*h*is a smoothing parameter, called the

*bandwidth*.

In our context, the straight-forward method to combine these probability distributions is to merge the observations, i.e. to form the union of all observed values of an attribute in the relevant classes \(C_{\mathbf{b}}^{+}\). This can be done in *O*(*t*). In contrast to the previous estimation techniques, the overall worst-case complexity is therefore in this case only equivalent to the straight-forward ECOC method. Note, however, that the merging of the values of two partitions can be sped up in domains where numerical values repeat across training instances. The idea, as, e.g., realized in the WEKA software (Hall et al. 2009), is to maintain a list of all unique values of an attribute together with its number of occurrences. This allows additional savings if the total number of unique values *w* is much smaller than the total number of values *z*=*t*⋅*g* because the merging of the value lists in two partitions takes only *O*(*w*) instead of *O*(*z*). In the experimental, we also report the ratio *w*/*z* as *diversity*-value of a dataset.

*f*(

*x*)=(

*x*−0.5)⋅2.

## 5 Evaluation

### 5.1 Experimental setup

^{1}), from which we selected 17 datasets. The remaining two datasets were excluded because one yielded a very low accuracy for all algorithms (<1 %) and the other already exhibited a very high time complexity for standard Naïve Bayes, so that we did not attempt a complete evaluation of all variants of the ECOC-NB classifiers. This collection is composed of well-known benchmark datasets such as TREC and OHSUMED. For a detailed description, we refer to Forman (2003). Table 1 shows the dataset characteristics and the last column shows the diversity of each dataset.

Dataset characteristics. This table shows the number of instances *t*, the number of features *g*, the number of classes *k* and the diversity of a dataset

| #instances | #features | #classes | Diversity |
---|---|---|---|---|

| 2463 | 2000 | 17 | 0.0050 |

| 3204 | 31472 | 6 | 0.0013 |

| 3075 | 31472 | 6 | 0.0013 |

| 1003 | 3182 | 10 | 0.0038 |

| 918 | 3012 | 10 | 0.0038 |

| 1050 | 3238 | 10 | 0.0043 |

| 913 | 3100 | 10 | 0.0042 |

| 1504 | 2886 | 13 | 0.0023 |

| 1657 | 3758 | 25 | 0.0022 |

| 414 | 6429 | 9 | 0.0127 |

| 313 | 5804 | 8 | 0.0151 |

| 336 | 7902 | 6 | 0.0184 |

| 204 | 5832 | 6 | 0.0294 |

| 927 | 10128 | 7 | 0.0058 |

| 878 | 7454 | 10 | 0.0049 |

| 690 | 8261 | 10 | 0.0076 |

| 1560 | 8460 | 20 | 0.0022 |

We could not find datasets with discrete attributes with similar characteristics (a large number of attributes and a large number of classes), so we decided to perform experiments on discretized versions of these datasets. In particular, we converted each numerical attribute into a 10-valued discrete attribute using equal-frequency discretization in order to get a good idea of potential run-time savings in such a setting.

For all experiments, 10-fold Cross-Validation was applied and they were conducted on a 2.4 GHz AMD Opteron 250 system with 8GB RAM. For kernel density estimation, a Gaussian kernel with mean zero and variance one was used. No feature selection was applied, since we are mainly interested on the training complexity, which is more interesting with a high number of features. But this comes with the disadvantage that the accuracy performances may not represent the optimal values. So, in this regard, the following accuracy results should be viewed with reservation.

In the following performance tables, some cells are empty, because for this particular combination of dataset and BCH bit-length, the BCH code generation process could not generate a valid ECOC matrix, which satisfies some machine learning relevant properties: The code generation process randomly picks *k* BCH codewords of the specified length as the ECOC matrix and checks for every column, if there is at least one (+1) and one (−1) symbol, respectively. Furthermore, no two columns must be identical or inverse to each other. The code generation process is aborted after 100,000 iterations. In this context, it holds that the lower the number of classes is, the lower is the probability to generate a suitable ECOC-matrix with high bit-length, since the number of valid combinations reduces with decreasing number of classes or rows in the coding matrix.

Note, that we used weighted decoding (Dietterich and Bakiri 1995) instead of Hamming decoding, because it performed slightly better with respect to accuracy in our setting in some preliminary tests. In the following, we will show the run-time results, which are our main focus. For completeness, accuracy results can be found in the Appendix.

### 5.2 Run-time evaluation

Training time comparison of Naïve Bayes and ECOC-NB with different BCH code lengths (with precalculation and normal density estimators, in seconds). For bit-lengths 15, 31, 63 and 127 the second column shows the corresponding training time for the straight-forward ECOC implementation

| NB | 15-Bit BCH | 31-Bit BCH | 63-Bit BCH | 127-Bit BCH | 255-BCH | 511-BCH | 1023-BCH | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

| 11.16 | 11.34 | 165.24 | 12.03 | 340.76 | 11.84 | 630.94 | 12.58 | 1355.48 | 12.78 | 14.04 | 20.39 |

| 101.74 | 101.20 | 1596.21 | 110.98 | 3219.63 | – | – | – | – | – | – | – |

| 91.02 | 92.41 | 1400.38 | 98.21 | 2865.61 | – | – | – | – | – | – | – |

| 4.23 | 4.39 | 57.30 | 4.54 | 122.39 | 4.57 | 235.56 | 4.81 | 489.79 | 5.27 | 7.07 | – |

| 3.80 | 3.79 | 48.54 | 3.86 | 104.57 | 3.92 | 201.35 | 4.12 | 414.65 | 4.66 | 6.75 | – |

| 4.57 | 4.73 | 61.87 | 4.87 | 132.27 | 4.92 | 253.59 | 5.15 | 524.22 | 5.69 | 7.84 | – |

| 3.98 | 3.94 | 52.04 | 4.06 | 109.39 | 4.18 | 210.48 | 4.39 | 430.19 | 4.84 | 6.96 | – |

| 6.07 | 6.14 | 76.51 | 6.16 | 156.57 | 6.24 | 309.98 | 6.48 | 657.10 | 7.03 | 8.54 | 15.11 |

| 9.08 | 9.00 | 118.84 | 8.97 | 244.10 | 9.25 | 484.02 | 9.85 | 1013.32 | 10.98 | 12.93 | 21.55 |

| 4.37 | 4.45 | 61.24 | 4.51 | 124.40 | 4.59 | 249.29 | 5.03 | 506.52 | 6.05 | – | – |

| 2.83 | 2.85 | 39.66 | 2.92 | 81.02 | 3.03 | 161.02 | 3.39 | 338.22 | – | – | – |

| 4.35 | 4.46 | 62.16 | 4.51 | 128.97 | – | – | – | – | – | – | – |

| 1.78 | 1.82 | 24.92 | 1.85 | 50.94 | – | – | – | – | – | – | – |

| 16.10 | 16.49 | 229.83 | 16.35 | 467.13 | 16.32 | 940.35 | – | – | – | – | – |

| 11.23 | 11.31 | 153.79 | 11.35 | 313.71 | 11.46 | 629.27 | 12.15 | 1314.06 | 13.06 | 15.62 | – |

| 9.82 | 9.94 | 134.89 | 9.80 | 279.80 | 10.16 | 558.91 | 10.87 | 1174.95 | 11.63 | 14.58 | – |

| 23.78 | 23.19 | 309.03 | 22.93 | 651.38 | 23.68 | 1294.39 | 24.63 | 2714.96 | 26.04 | 29.62 | 40.95 |

Training time comparison of Naïve Bayes and ECOC-NB with different BCH code lengths (with precalculation and kernel density estimators, in seconds). For bit-lengths 15, 31, 63 and 127 the second column shows the corresponding training time for the straight-forward ECOC implementation

| NB | 15-Bit BCH | 31-Bit BCH | 63-Bit BCH | 127-Bit BCH | 255-BCH | 511-BCH | 1023-BCH | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

| 11.85 | 13.53 | 181.11 | 13.51 | 376.83 | 13.62 | 750.80 | 15.88 | 1559.50 | 17.58 | 23.74 | 38.46 |

| 107.12 | 120.18 | 1700.63 | 119.79 | 3474.00 | – | – | – | – | – | – | – |

| 105.69 | 106.56 | 1517.53 | 107.69 | 3080.02 | – | – | – | – | – | – | – |

| 4.95 | 5.26 | 67.72 | 5.60 | 140.77 | 5.95 | 286.17 | 7.30 | 584.12 | 9.10 | 15.83 | – |

| 4.31 | 4.55 | 58.76 | 4.86 | 121.46 | 5.17 | 248.96 | 6.49 | 505.73 | 8.88 | 15.01 | – |

| 5.36 | 5.72 | 73.57 | 6.02 | 152.98 | 6.30 | 311.64 | 7.78 | 635.28 | 10.32 | 16.82 | – |

| 4.48 | 4.73 | 60.54 | 5.01 | 124.98 | 5.36 | 252.04 | 6.75 | 513.94 | 9.14 | 15.48 | – |

| 6.94 | 7.25 | 91.68 | 7.59 | 190.76 | 7.83 | 388.70 | 9.58 | 792.19 | 12.27 | 18.31 | 32.40 |

| 10.24 | 10.94 | 136.05 | 11.76 | 283.06 | 12.72 | 575.87 | 16.61 | 1175.56 | 22.14 | 36.46 | 63.40 |

| 4.76 | 5.25 | 66.61 | 5.79 | 138.48 | 6.62 | 280.17 | 9.19 | 568.41 | 13.53 | – | – |

| 3.12 | 3.49 | 43.84 | 3.92 | 90.84 | 4.69 | 184.31 | 6.64 | 374.89 | – | – | – |

| 4.76 | 5.26 | 69.61 | 5.75 | 143.73 | – | – | – | – | – | – | – |

| 1.92 | 2.29 | 27.81 | 2.72 | 57.59 | – | – | – | – | – | – | – |

| 17.75 | 18.41 | 256.99 | 19.18 | 526.89 | 19.90 | 1070.32 | – | – | – | – | – |

| 12.37 | 12.96 | 172.23 | 13.68 | 357.73 | 14.48 | 719.62 | 18.15 | 1489.90 | 23.13 | 36.71 | – |

| 10.75 | 11.52 | 152.41 | 12.30 | 313.86 | 13.30 | 640.29 | 17.14 | 1318.36 | 22.33 | 37.56 | – |

| 25.87 | 27.24 | 354.87 | 28.60 | 731.24 | 30.24 | 1482.71 | 37.65 | 3110.42 | 46.45 | 70.85 | 119.63 |

Training time comparison of Naïve Bayes and ECOC-NB with different BCH code lengths on the discretized datasets. For bit-lengths 15, 31, 63 and 127 the second column shows the corresponding training time for the straight-forward ECOC implementation

| NB | 15-Bit BCH | 31-Bit BCH | 63-Bit BCH | 127-Bit BCH | 255-BCH | 511-BCH | 1023-BCH | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

| 1.32 | 1.38 | 9.93 | 1.47 | 20.77 | 1.50 | 41.73 | 1.86 | 84.66 | 2.95 | 5.70 | 13.12 |

| 13.10 | 14.02 | 153.54 | 13.76 | 343.46 | – | – | – | – | – | – | – |

| 11.50 | 12.48 | 140.24 | 12.85 | 319.59 | – | – | – | – | – | – | – |

| 0.94 | 1.02 | 7.72 | 1.05 | 17.30 | 1.12 | 32.71 | 1.45 | 69.50 | 2.19 | 4.46 | – |

| 0.81 | 0.88 | 6.21 | 0.93 | 13.84 | 0.91 | 25.02 | 1.31 | 52.21 | 1.86 | 4.48 | – |

| 0.96 | 1.07 | 8.31 | 1.07 | 18.73 | 1.14 | 35.51 | 1.58 | 76.56 | 2.26 | 5.10 | – |

| 0.77 | 0.86 | 7.01 | 0.96 | 14.81 | 0.96 | 27.85 | 1.30 | 59.65 | 1.96 | 4.49 | – |

| 1.22 | 1.34 | 8.69 | 1.32 | 18.64 | 1.40 | 35.91 | 1.83 | 74.00 | 2.30 | 4.60 | 11.06 |

| 2.07 | 2.26 | 18.81 | 2.37 | 38.34 | 2.30 | 76.67 | 3.56 | 165.50 | 4.34 | 8.65 | 18.03 |

| 0.85 | 1.01 | 10.21 | 1.18 | 22.71 | 1.33 | 47.60 | 2.24 | 100.39 | 3.63 | – | – |

| 0.56 | 0.72 | 6.92 | 0.78 | 15.52 | 0.93 | 28.79 | 1.47 | 59.65 | – | – | – |

| 0.83 | 0.94 | 11.21 | 1.08 | 23.86 | – | – | – | – | – | – | – |

| 0.38 | 0.49 | 4.94 | 0.57 | 10.30 | – | – | – | – | – | – | – |

| 2.93 | 3.18 | 37.17 | 3.45 | 79.85 | 3.44 | 155.42 | – | – | – | – | – |

| 2.15 | 2.31 | 24.89 | 2.40 | 54.05 | 2.58 | 109.35 | 3.49 | 223.02 | 4.80 | 10.32 | – |

| 1.88 | 2.12 | 22.83 | 2.17 | 48.89 | 2.49 | 95.80 | 3.37 | 198.83 | 4.85 | 11.22 | – |

| 4.46 | 4.64 | 45.62 | 5.05 | 105.40 | 5.11 | 219.85 | 6.46 | 434.70 | 8.73 | 16.24 | 30.09 |

Also, using kernel density estimators, we can observe only a relative slight increase for increasing number of classifiers (Table 3). As previously mentioned, the worst-case training complexity is still the same as the baseline in this case, but in many cases, when the dataset has a very small ratio of distinct values compared to the number of instances, run-times will be significantly smaller. This is also the case here, the last column of Table 1 shows the ratio of the sum of distinct values over all attributes to the number of instances times the number of features, which are all far away from the worst-case scenario. In addition, the tight combination of ECOC and Naïve Bayes may benefit also from the reduced overhead on the programming language level, e.g. fewer function calls and I/O operations.

Note for discrete and normal density estimation, the difference of training time between ECOC-NB to NB is independent of the number of instances *t*. For instance, if dataset *fbis* had far more instances, the training time of ECOC-NB with 31-bit BCH codes will still only last about 1 sec longer than standard Naïve Bayes, for instance.

## 6 Conclusion

We report a simple combined computation of ECOC ensembles with Naïve Bayes as base learner. Compared to the straight-forward method with a training complexity of *O*(*n*⋅*t*⋅*g*) its complexity using normal and discrete density estimation methods is reduced to *O*((*n*+*t*)⋅*g*).

In conjunction with kernel density estimators the worst-case complexity remains the same, but, in contrast, it can benefit from a low number of distinct feature values. We show some empirical evaluations supporting this statement and expect similar training complexity reduction also on the majority of real-world datasets, which, in our experience, typically exhibit such a low diversity.

A possible disadvantage of the decomposition approach is the need for tuning parameters such as the bit-length. However, with the efficient computation scheme proposed in this work, the cost of such a parameter tuning has become feasible. Furthermore, ECOC-NB can benefit naturally from sophisticated or more specialized code types in the future, which is an active research topic (e.g., Pujol et al. 2006; Escalera et al. 2010). Finally, we note that the results of this paper also facilitate an implementation of ECOC-NB for learning from sufficient statistics, in very much the same way as it possible for the regular Naïve Bayes classifier (Koul et al. 2008).

In summary, we have shown that the combination of Naïve Bayes with Error-Correcting Output Codes is almost as fast as a conventional Naïve Bayes classifier. ECOC are thus a viable technique for trying to improve the classification performance of Naïve Bayes on large-scale datasets.

## Footnotes

## Notes

### Acknowledgements

We would like to thank Eyke Hüllermeier and Jan-Nikolas Sulzmann for helpful suggestions and discussions. This work was supported by the German Science Foundation (DFG).

## References

- Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: a unifying approach for margin classifiers.
*Journal of Machine Learning Research*,*1*, 113–141. MathSciNetzbMATHGoogle Scholar - Berger, A. (1999). Error-correcting output coding for text classification. In
*Proceedings of the IJCAI-99 workshop on machine learning for information filtering (IJCAI99-MLIF)*, Stockholm, Sweden. Google Scholar - Bose, R. C., & Ray-Chaudhuri, D. K. (1960). On a class of error correcting binary group codes.
*Information and Control*,*3*(1), 68–79. MathSciNetCrossRefzbMATHGoogle Scholar - Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes.
*Journal of Artificial Intelligence Research*,*2*, 263–286. zbMATHGoogle Scholar - Domingos, P., & Pazzani, M. J. (1997). On the optimality of the simple Bayesian classifier under zero-one loss.
*Machine Learning*,*29*(2–3), 103–130. CrossRefzbMATHGoogle Scholar - Escalera, S., Pujol, O., & Radeva, P. (2010). Error-correcting output codes library.
*Journal of Machine Learning Research*,*11*, 661–664. zbMATHGoogle Scholar - Fawcett, T., & Niculescu-Mizil, A. (2007). PAV and the ROC convex hull.
*Machine Learning*,*68*(1), 97–106. CrossRefGoogle Scholar - Flach, P. A., & Matsubara, E. T. (2007). A simple lexicographic ranker and probability estimator. In J. N. Kok, J. Koronacki, R. Lopez de Mantaras, S. Matwin, D. Mladenič, & A. Skowron (Eds.),
*Proceedings of the 18th European conference on machine learning (ECML-07)*, Warsaw, Poland (pp. 575–582). Berlin: Springer. Google Scholar - Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification.
*Journal of Machine Learning Research*,*3*, 1289–1305. zbMATHGoogle Scholar - Friedman, J. H. (1996). Another approach to polychotomous classification. Technical Report, Department of Statistics, Stanford University, Stanford, CA. Google Scholar
- Fürnkranz, J. (2002). Round robin classification.
*Journal of Machine Learning Research*,*2*, 721–747. MathSciNetzbMATHGoogle Scholar - Ghani, R. (2000). Using error-correcting codes for text classification. In P. Langley (Ed.),
*Proceedings of the 17th international conference on machine learning (ICML-00)*(pp. 303–310). Stanford: Morgan Kaufmann. Google Scholar - Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update.
*SIGKDD Explorations*,*11*(1), 10–18. CrossRefGoogle Scholar - Kittler, J., Ghaderi, R., Windeatt, T., & Matas, J. (2003). Face verification via error correcting output codes.
*Image and Vision Computing*,*21*(13–14), 1163–1169. CrossRefGoogle Scholar - Kong, E. B., & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In
*Proceedings of the 12th international conference on machine learning (ICML-95)*(pp. 313–321). Stanford: Morgan Kaufmann. Google Scholar - Koul, N., Caragea, C., Honavar, V., Bahirwani, V., & Caragea, D. (2008). Learning classifiers from large databases using statistical queries. In
*Proceedings of the 2008 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology*(pp. 923–926). CrossRefGoogle Scholar - Macwilliams, F. J., & Sloane, N. J. A. (1983).
*The theory of error-correcting codes*. Amsterdam: North-Holland. zbMATHGoogle Scholar - Melvin, I., Ie, E., Weston, J., Noble, W. S., & Leslie, C. (2007). Multi-class protein classification using adaptive codes.
*Journal of Machine Learning Research*,*8*, 1557–1581. MathSciNetzbMATHGoogle Scholar - Park, S. H., & Fürnkranz, J. (2012). Efficient prediction algorithms for binary decomposition techniques.
*Data Mining and Knowledge Discovery*,*24*(1), 40–77. MathSciNetCrossRefzbMATHGoogle Scholar - Pujol, O., Radeva, P., & Vitrià, J. (2006). Discriminant ECOC: A heuristic method for application dependent design of error correcting output codes.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*28*(6), 1007–1012. CrossRefGoogle Scholar - Sulzmann, J. N., Fürnkranz, J., & Hüllermeier, E. (2007). On pairwise Naïve Bayes classifiers. In J. N. Kok, J. Koronacki, R. Lopez de Mantaras, S. Matwin, D. Mladenič, & A. Skowron (Eds.),
*Proceedings of the 18th European conference on machine learning (ECML-07)*, Warsaw, Poland (pp. 371–381). Berlin: Springer. Google Scholar - Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: aggregating one-dependence estimators.
*Machine Learning*,*58*(1), 5–24. CrossRefzbMATHGoogle Scholar - Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and Naïve Bayesian classifiers. In
*Proceedings of the 8th international conference on machine learning (ICML-01)*, Williamstown, MA (pp. 609–616). Google Scholar - Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In
*Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-02)*, Edmonton, Canada (pp. 694–699). CrossRefGoogle Scholar - Zhang, H. (2005). Exploring conditions for the optimality of Naïve Bayes.
*International Journal of Pattern Recognition and Artificial Intelligence*,*19*(2), 183–198. CrossRefGoogle Scholar