Skip to main content

Experimental validation for N-ary error correcting output codes for ensemble learning of deep neural networks

Abstract

N-ary error correcting output codes (ECOC) decompose a multi-class problem into simpler multi-class problems by splitting the classes into N subsets (meta-classes) to form an ensemble of N-class classifiers and combine them to make predictions. It is one of the most accurate ensemble learning methods for traditional classification tasks. Deep learning has gained increasing attention in recent years due to its successes on various tasks such as image classification and speech recognition. However, little is known about N-ary ECOC with deep neural networks (DNNs) as base learners, probably due to the long computation time. In this paper, we show by experiments that N-ary ECOC with DNNs as base learners generally exhibits superior performance compared with several state-of-the-art ensemble learning methods. Moreover, our work contributes to a more efficient setting of the two crucial hyperparameters of N-ary ECOC: the value of N and the number of base learners to train. We also explore valuable strategies for further improving the accuracy of N-ary ECOC.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    https://github.com/alexvking/neural-net-house-number-recognition

  2. 2.

    https://gist.github.com/mavenlin/e56253735ef32c3c296d

  3. 3.

    https://github.com/jimgoo/caffe-oxford102

References

  1. Ahmed, E., Jones, M., Marks, T.K. (2015). An improved deep learning architecture for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3908–3916.

  2. Ahonen, T., Hadid, A., Pietikainen, M. (2006). Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12), 2037–2041.

    MATH  Article  Google Scholar 

  3. Allwein, E.L., Schapire, R.E., Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.

    MathSciNet  MATH  Google Scholar 

  4. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123–140.

    MATH  MathSciNet  Google Scholar 

  5. Buckland, M., & Gey, F. (1994). The relationship between recall and precision. Journal of the American Society for Information Science, 45(1), 12.

    Article  Google Scholar 

  6. Buda, M., Maki, A., Mazurowski, M.A. (2017). A systematic study of the class imbalance problem in convolutional neural networks. arXiv:1710.05381.

  7. Chawla, N.V. (2009). Data mining for imbalanced datasets: An overview. In: Data mining and knowledge discovery handbook, pp. 875–886. Springer.

  8. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

    MATH  Google Scholar 

  9. Dietterich, T.G. (2002). Ensemble Learning. The handbook of brain theory and neural networks, 2nd edn. Cambridge: MIT Press.

    Google Scholar 

  10. Dietterich, T.G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2(1), 263–286.

    MATH  Article  Google Scholar 

  11. Freund, Y., & Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55 (1), 119–139.

    MathSciNet  MATH  Article  Google Scholar 

  12. Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep learning. London: MIT Press.

    MATH  Google Scholar 

  13. Han, J., Pei, J., Kamber, M. (2011). Data mining: Concepts and techniques, 3rd edn., Waltham, Morgan Kaufmann.

  14. Hastie, T., Rosset, S., Zhu, J., Zou, H. (2009). Multi-class AdaBoost. Statistics and Its Interface, 2(3), 349–360.

    MathSciNet  MATH  Article  Google Scholar 

  15. Ho, T.K. (1995). Random Decision Forests. In: Proceedings of the 3rd international conference on document analysis and recognition, pp. 278–282.

  16. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T. (2014). Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia, pp. 675–678.

  17. Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Master’s thesis, department of computer science. Canada: University of Toronto.

    Google Scholar 

  18. Krizhevsky, A., Sutskever, I., Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In: Proceedings of the advances in neural information processing systems, pp. 1097–1105.

  19. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. (1998). Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp. 2278–2324.

  20. Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., Batra, D. (2015). Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv:1511.06314.

  21. Lin, M., Chen, Q., Yan, S. (2014). Network in network. In: Proceedings of the international conference on learning representations.

  22. Masko, D., & Hensman, P. (2015). The impact of imbalanced training data for convolutional neural networks. Bachelor’s thesis, school of computer science and communication. Sweden: KTH Royal Institute of Technology.

    Google Scholar 

  23. Müller, H., Michoux, N., Bandon, D., Geissbuhler, A. (2004). A review of content-based image retrieval systems in medical applications - clinical benefits and future directions. International Journal of Medical Informatics, 73(1), 1–23.

    Article  Google Scholar 

  24. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y. (2011). Reading digits in natural images with unsupervised feature learning. In: Proceedings of the NIPS workshop on deep learning and unsupervised feature learning.

  25. Nilsback, M.E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In: Proceedings of the 6th Indian conference on computer vision, graphics & image processing, pp. 722–729.

  26. Opitz, M., Possegger, H., Bischof, H. (2016). Efficient model averaging for deep neural networks. In: Proceedings of the 13th Asian conference on computer vision (ACCV’16).

  27. Pearson, K. (1895). Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242.

    Article  Google Scholar 

  28. Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

    Google Scholar 

  29. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115 (3), 211–252.

    MathSciNet  Article  Google Scholar 

  30. Schapire, R.E. (1990). The strength of weak learnability. Machine learning, 5(2), 197–227.

    Google Scholar 

  31. Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A. (2008). Resampling or reweighting: A comparison of boosting implementations. In: Proceedings of the IEEE international conference on tools with artificial intelligence, pp. 445–451.

  32. Skalak, D.B. (1996). The sources of increased accuracy for two proposed boosting algorithms. In: Proceedings of the american association for artificial intelligence workshop on integrating multiple learned models.

  33. Van Rijsbergen, C. (1979). Information Retrieval. London: Butterworth.

    MATH  Google Scholar 

  34. Vriesmann, L.M, de Souza Britto Jr, A., De Oliveira, L.E.S., Sabourin, R., Ko, A.H.R. (2012). Improving a dynamic ensemble selection method based on oracle information. International Journal of Innovative Computing and Applications, 4(3-4), 184–200.

    Article  Google Scholar 

  35. Wei, X.S., Gao, B.B., Wu, J. (2015). Deep spatial pyramid ensemble for cultural event recognition. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 38–44.

  36. Xie, J., Xu, B., Chuang, Z. (2013). Horizontal and vertical ensemble with deep representation for classification. In: Proceedings of the ICML workshop on representation learning.

  37. Yosinski, J., Clune, J., Bengio, Y., Lipson, H. (2014). How transferable are features in deep neural networks?. In: Proceedings of the advances in neural information processing systems, pp. 3320–3328.

  38. Yule, G.U. (1900). On the association of attributes in statistics: With illustrations from the material of the childhood society. Philosophical Transactions of the Royal Society of London Series A, 194, 257–319.

    MATH  Article  Google Scholar 

  39. Zhou, J.T., Tsang, I.W., Ho, S.S., Muller, K.R. (2016). N-ary error correcting coding scheme. arXiv:1603.05850.

  40. Zhou, Z.H. (2012). Ensemble methods: Foundations and algorithms, Chapman & Hall/CRC, Boca Raton.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kaikai Zhao.

Appendix

Appendix

A. Supporting Evidences for the Effect of the Third Factor in Hypothesis 1

To verify the effect of merging classes on the feature learning of the DNN base learners of N-ary ECOC, we train the AlexNet (Krizhevsky et al. 2012) using the ILSVRC2012 dataset with two different encoding strategies. The first one is ECOC encoding, in which the degree of merging classes is large. We follow Yosinski et al. (2014) to split the 1,000 classes into two meta-classes. Each meta-class contains approximately half (551 classes versus 449 classes) of the 1,000 original classes. The second one is RI encoding (keeping the original 1,000 classes), in which the degree of merging classes is small.

Figure 5 shows the visualizations of the learned features of the first convolutional layer at different iterations (5,000, 50,000 and 450,000) for the two encoding strategies. There are six plots, in which Figs. 5a, b, c and d, e, f correspond to ECOC and RI encodings, respectively. For each plot, there are 96 squared blocks, each of which represents a learned feature map (also called filter) of size 11 (width) ×11 (height) ×3 (channel).

Fig. 5
figure5

Visualizations of the first convolutional layer (conv1) of the trained AlexNet models with two meta-classes (the first row) and 1,000 meta-classes (the second row) at different iterations. Note that the maximum number of iterations is 450,000

We see from Fig. 5 that a DNN with two meta-classes (Fig. 5a, b, c) tends to learn less representative features compared with a DNN with 1,000 classes (Fig. 5d, e and f). For instance, if we compare Fig. 5a with Fig. 5d, we see that most of the learned features of the DNN with ECOC encoding at the 5,000th iteration look like noisy features (e.g., the lower half of the rows of the learned features in Fig. 5a). However, the DNN with RI encoding at the 5,000th iteration has learned many representative features (e.g., the edges and the color patterns in Fig. 5d).

If we compare Fig. 5b with Fig. 5e, i.e., at the 50,000th iteration, we see that the DNN with ECOC encoding (Fig. 5b) has learned more representative features (compared with itself at the 5,000th iteration in Fig. 5a), while the DNN with RI encoding (Fig. 5e) keeps refining its learned features.

If we compare Fig. 5c with Fig. 5f, i.e., at the 450,000th iteration (the end of the training), we see that the learned features of the DNN with ECOC encoding (Fig. 5c) are not as representative as those of the DNN with RI encoding (Fig. 5f) in terms of the complexity of the patterns of the learned features. For instance, most of the color patches of the DNN with ECOC encoding represent some kind of gradient maps to the edge direction, while many of the color patches of the DNN with RI encoding represent more complex gradient maps such as a color patch surrounded by other different color patches (e.g., the features in the 6th row and the 5th column, the 6th row and the 7th column, and the last row and the 5th column of Fig. 5f), which are useful in detecting a sharp change in color in almost all directions.

To verify the effect of merging classes on the generalization ability of the DNN base learners of N-ary ECOC, we show the average base model accuracies (BAs) of the tested DNN models (as listed in Table 2 in Section 3.1) with respect to N in Fig. 6, as BA shows how well the learned DNNs generalize to the test data. We see from Fig. 6 that, for the DNNs that follow Pattern 1 (MNIST-LeNet, SVHN, CIFAR10-QUICK and CIFAR10-FULL)), the BA generally decreases as N increases. Moreover, the base learners with N < NC have BAs which are higher than the BA of RI (N = NC). Two possible reasons behind the above observations are, for a dataset with a small number of classes (i.e., NC < 25), the effect of merging classes is small, and the BA is practically determined by the value of N (the smaller N, the easier the classification task).

Fig. 6
figure6

Average base model accuracies with respect to N

On the other hand, for the DNNs that follow Pattern 2 (CIFAR100-NIN, CIFAR100-QUICK and FLOWER102-AlexNet), the BA first decreases a large amount (more than 10%), then slowly increases as N continues to increase. Moreover, most of the base learners with N < NC have BAs which are much lower than the BA of RI. Intuitively, an N-class (N < 100) classification problem is easier than a 100-class classification problem. However, the results are the opposite. One possible reason behind the above observations is that, for a dataset with a large number of classes (e.g., NC ≥ 100), when N is small (NNC), NCN is large (i.e., the degree of merging classes is large), thus the BA is generally low compared with the BA of RI. Note that the DNN with N = 2 (a binary classification problem) corresponds to the easiest classification task (i.e., a random guess will be 50% correct if the class distribution is balanced) among all possible values for N (2 ≤ NNC), and thus it can still have a high BA. When N begins to increase, the accuracy of a random guess drops sharply, for instance, 33% for N = 3 and 25% for N = 4, even if the class distribution is balanced. Thus there is a drop for the BA. When N further increases, the effect of merging classes is weakened, thus the BA begins to increase and finally saturates.

The above observations suggest that both the feature learning and the generalization ability of the DNN base learners of N-ary ECOC are largely influenced by the effect of merging classes.

B. Our extended disagreement measure

As defined in Section 4.3, m is the number of examples in the test dataset, Ci and Cj are two base learners of N-ary ECOC, yk,i and yk,j are the predictions of Ci and Cj, respectively, on the k th test example. We propose to map the predicted meta-class label yk,i (each label takes a value from 0, ..., N − 1) to the original class label (each label takes a value from 0, ..., NC − 1), and then construct a one-by-NC bit vector \(y_{k,i}^{\prime }\) to indicate if the original class labels are included in the predicted meta-class yk,i. Thus the mapping from yk,i to \(y_{k,i}^{\prime }\) is \(\lbrace 0,\ ...,\ N - 1 \rbrace \rightarrow \lbrace 0,\ ...,\ N_{C} - 1\rbrace ^{N_{C}}\).

Take Table 1 (b) in Section 2 as an example, suppose the base learners trained with f0 and f2 both predict the meta-class 0, for the k th test example, i.e., yk,0 = 0 and yk,2 = 0. Since the meta-class 0 of f0 consists of the classes c0 and c8 while the meta-class 0 of f2 consists of the classes c2 and c6, the corresponding \(y_{k,0}^{\prime }\) and \(y_{k,2}^{\prime }\) are \(y_{k,0}^{\prime }\) = [1 0 0 0 0 0 0 0 1 0] and \(y_{k,2}^{\prime }\) = [0 0 1 0 0 0 1 0 0 0]. That is, the c0 and c8 positions of \(y_{k,0}^{\prime }\) as well as the c2 and c6 positions of \(y_{k,2}^{\prime }\) are 1, while all other positions are 0.

Then we estimate the Hamming distance between \(y_{k,i}^{\prime }\) and \(y_{k,j}^{\prime }\), i.e., \((y_{k,i}^{\prime } - y_{k,j}^{\prime } ) (y_{k,i}^{\prime } - y_{k,j}^{\prime } )^{T}\), as the distance between yk,i and yk,j. For instance, in the above example, the distance between the predictions of the meta-class 0 by DNNs trained with f0 and f2 is 4 as four bit positions of \(y_{k,i}^{\prime }\) and \(y_{k,j}^{\prime }\) are different.

The advantage of our proposal is that it fits our intuition. That is, the more overlap exists in the classes included in the predicted meta-classes, the smaller the distance. In a general case that the predicted meta-classes of two base learners share common classes, the distance measured by our proposal will be smaller than the above example, in which no common class exists. For instance, if the predicted meta-classes are 2 and 3 for f0 and f2, respectively (note that the meta-class 2 of f0 and the meta-class 3 of f2 share a common class c1), then the distance is 2 as only two bit positions of \(y_{k,i}^{\prime }\) and \(y_{k,j}^{\prime }\) are different. In the extreme case that the classes included in the predicted meta-classes are identical, the distance is 0. For instance, both the meta-class 0 of f0 and the meta-class 4 of f2 consist of the classes c0 and c8.

Finally, we propose to estimate the disagreement measure of the base learners of N-ary ECOC by

$$ \mathit{Dis}_{i,j}^{\prime} = \frac{1}{m} \sum\limits_{k = 1}^{m} (y_{k,i}^{\prime} - y_{k,j}^{\prime} ) (y_{k,i}^{\prime} - y_{k,j}^{\prime} )^{T}, $$
(6)

where \(y_{k,i}^{\prime }\) and \(y_{k,j}^{\prime }\) are the mapped back predictions of the base learners Ci and Cj, respectively, on the k th test example.

By averaging the disagreement measures over all pairs of base learners based on Skalak (1996), we obtain

$$ \mathit{Dis}_{avg} = \frac{2}{N_{L}(N_{L}-1)} \sum\limits_{i = 1}^{N_{L}-1} \sum\limits_{j=i + 1}^{N_{L}} \mathit{Dis}_{i,j}^{\prime}. $$
(7)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhao, K., Matsukawa, T. & Suzuki, E. Experimental validation for N-ary error correcting output codes for ensemble learning of deep neural networks. J Intell Inf Syst 52, 367–392 (2019). https://doi.org/10.1007/s10844-018-0516-5

Download citation

Keywords

  • N-ary error correcting output codes
  • Ensemble learning
  • Deep neural networks
  • Image classification