Experimental validation for N-ary error correcting output codes for ensemble learning of deep neural networks

Zhao, Kaikai; Matsukawa, Tetsu; Suzuki, Einoshin

doi:10.1007/s10844-018-0516-5

Experimental validation for N-ary error correcting output codes for ensemble learning of deep neural networks

Published: 07 July 2018

Volume 52, pages 367–392, (2019)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

446 Accesses
4 Citations
Explore all metrics

Abstract

N-ary error correcting output codes (ECOC) decompose a multi-class problem into simpler multi-class problems by splitting the classes into N subsets (meta-classes) to form an ensemble of N-class classifiers and combine them to make predictions. It is one of the most accurate ensemble learning methods for traditional classification tasks. Deep learning has gained increasing attention in recent years due to its successes on various tasks such as image classification and speech recognition. However, little is known about N-ary ECOC with deep neural networks (DNNs) as base learners, probably due to the long computation time. In this paper, we show by experiments that N-ary ECOC with DNNs as base learners generally exhibits superior performance compared with several state-of-the-art ensemble learning methods. Moreover, our work contributes to a more efficient setting of the two crucial hyperparameters of N-ary ECOC: the value of N and the number of base learners to train. We also explore valuable strategies for further improving the accuracy of N-ary ECOC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

Laith Alzubaidi, Jinglan Zhang, … Laith Farhan

A survey on ensemble learning

Article 30 August 2019

Xibin Dong, Zhiwen Yu, … Qianli Ma

Feature dimensionality reduction: a review

Article Open access 21 January 2022

Weikuan Jia, Meili Sun, … Sujuan Hou

Notes

References

Ahmed, E., Jones, M., Marks, T.K. (2015). An improved deep learning architecture for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3908–3916.
Ahonen, T., Hadid, A., Pietikainen, M. (2006). Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12), 2037–2041.
Article MATH Google Scholar
Allwein, E.L., Schapire, R.E., Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.
MathSciNet MATH Google Scholar
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123–140.
MATH MathSciNet Google Scholar
Buckland, M., & Gey, F. (1994). The relationship between recall and precision. Journal of the American Society for Information Science, 45(1), 12.
Article Google Scholar
Buda, M., Maki, A., Mazurowski, M.A. (2017). A systematic study of the class imbalance problem in convolutional neural networks. arXiv:1710.05381.
Chawla, N.V. (2009). Data mining for imbalanced datasets: An overview. In: Data mining and knowledge discovery handbook, pp. 875–886. Springer.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
MATH Google Scholar
Dietterich, T.G. (2002). Ensemble Learning. The handbook of brain theory and neural networks, 2nd edn. Cambridge: MIT Press.
Google Scholar
Dietterich, T.G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2(1), 263–286.
Article MATH Google Scholar
Freund, Y., & Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55 (1), 119–139.
Article MathSciNet MATH Google Scholar
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep learning. London: MIT Press.
MATH Google Scholar
Han, J., Pei, J., Kamber, M. (2011). Data mining: Concepts and techniques, 3rd edn., Waltham, Morgan Kaufmann.
Hastie, T., Rosset, S., Zhu, J., Zou, H. (2009). Multi-class AdaBoost. Statistics and Its Interface, 2(3), 349–360.
Article MathSciNet MATH Google Scholar
Ho, T.K. (1995). Random Decision Forests. In: Proceedings of the 3rd international conference on document analysis and recognition, pp. 278–282.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T. (2014). Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia, pp. 675–678.
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Master’s thesis, department of computer science. Canada: University of Toronto.
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In: Proceedings of the advances in neural information processing systems, pp. 1097–1105.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. (1998). Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp. 2278–2324.
Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., Batra, D. (2015). Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv:1511.06314.
Lin, M., Chen, Q., Yan, S. (2014). Network in network. In: Proceedings of the international conference on learning representations.
Masko, D., & Hensman, P. (2015). The impact of imbalanced training data for convolutional neural networks. Bachelor’s thesis, school of computer science and communication. Sweden: KTH Royal Institute of Technology.
Google Scholar
Müller, H., Michoux, N., Bandon, D., Geissbuhler, A. (2004). A review of content-based image retrieval systems in medical applications - clinical benefits and future directions. International Journal of Medical Informatics, 73(1), 1–23.
Article Google Scholar
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y. (2011). Reading digits in natural images with unsupervised feature learning. In: Proceedings of the NIPS workshop on deep learning and unsupervised feature learning.
Nilsback, M.E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In: Proceedings of the 6th Indian conference on computer vision, graphics & image processing, pp. 722–729.
Opitz, M., Possegger, H., Bischof, H. (2016). Efficient model averaging for deep neural networks. In: Proceedings of the 13th Asian conference on computer vision (ACCV’16).
Pearson, K. (1895). Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242.
Article Google Scholar
Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115 (3), 211–252.
Article MathSciNet Google Scholar
Schapire, R.E. (1990). The strength of weak learnability. Machine learning, 5(2), 197–227.
Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A. (2008). Resampling or reweighting: A comparison of boosting implementations. In: Proceedings of the IEEE international conference on tools with artificial intelligence, pp. 445–451.
Skalak, D.B. (1996). The sources of increased accuracy for two proposed boosting algorithms. In: Proceedings of the american association for artificial intelligence workshop on integrating multiple learned models.
Van Rijsbergen, C. (1979). Information Retrieval. London: Butterworth.
MATH Google Scholar
Vriesmann, L.M, de Souza Britto Jr, A., De Oliveira, L.E.S., Sabourin, R., Ko, A.H.R. (2012). Improving a dynamic ensemble selection method based on oracle information. International Journal of Innovative Computing and Applications, 4(3-4), 184–200.
Article Google Scholar
Wei, X.S., Gao, B.B., Wu, J. (2015). Deep spatial pyramid ensemble for cultural event recognition. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 38–44.
Xie, J., Xu, B., Chuang, Z. (2013). Horizontal and vertical ensemble with deep representation for classification. In: Proceedings of the ICML workshop on representation learning.
Yosinski, J., Clune, J., Bengio, Y., Lipson, H. (2014). How transferable are features in deep neural networks?. In: Proceedings of the advances in neural information processing systems, pp. 3320–3328.
Yule, G.U. (1900). On the association of attributes in statistics: With illustrations from the material of the childhood society. Philosophical Transactions of the Royal Society of London Series A, 194, 257–319.
Article MATH Google Scholar
Zhou, J.T., Tsang, I.W., Ho, S.S., Muller, K.R. (2016). N-ary error correcting coding scheme. arXiv:1603.05850.
Zhou, Z.H. (2012). Ensemble methods: Foundations and algorithms, Chapman & Hall/CRC, Boca Raton.

Download references

Author information

Authors and Affiliations

Graduate School of Systems Life Sciences, Kyushu University, Fukuoka, Japan
Kaikai Zhao
Department of Informatics, ISEE, Kyushu University, Fukuoka, Japan
Tetsu Matsukawa & Einoshin Suzuki

Authors

Kaikai Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Tetsu Matsukawa
View author publications
You can also search for this author in PubMed Google Scholar
Einoshin Suzuki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kaikai Zhao.

Appendix

1.1 A. Supporting Evidences for the Effect of the Third Factor in Hypothesis 1

To verify the effect of merging classes on the feature learning of the DNN base learners of N-ary ECOC, we train the AlexNet (Krizhevsky et al. 2012) using the ILSVRC2012 dataset with two different encoding strategies. The first one is ECOC encoding, in which the degree of merging classes is large. We follow Yosinski et al. (2014) to split the 1,000 classes into two meta-classes. Each meta-class contains approximately half (551 classes versus 449 classes) of the 1,000 original classes. The second one is RI encoding (keeping the original 1,000 classes), in which the degree of merging classes is small.

Figure 5 shows the visualizations of the learned features of the first convolutional layer at different iterations (5,000, 50,000 and 450,000) for the two encoding strategies. There are six plots, in which Figs. 5a, b, c and d, e, f correspond to ECOC and RI encodings, respectively. For each plot, there are 96 squared blocks, each of which represents a learned feature map (also called filter) of size 11 (width) ×11 (height) ×3 (channel).

We see from Fig. 5 that a DNN with two meta-classes (Fig. 5a, b, c) tends to learn less representative features compared with a DNN with 1,000 classes (Fig. 5d, e and f). For instance, if we compare Fig. 5a with Fig. 5d, we see that most of the learned features of the DNN with ECOC encoding at the 5,000th iteration look like noisy features (e.g., the lower half of the rows of the learned features in Fig. 5a). However, the DNN with RI encoding at the 5,000th iteration has learned many representative features (e.g., the edges and the color patterns in Fig. 5d).

If we compare Fig. 5b with Fig. 5e, i.e., at the 50,000th iteration, we see that the DNN with ECOC encoding (Fig. 5b) has learned more representative features (compared with itself at the 5,000th iteration in Fig. 5a), while the DNN with RI encoding (Fig. 5e) keeps refining its learned features.

If we compare Fig. 5c with Fig. 5f, i.e., at the 450,000th iteration (the end of the training), we see that the learned features of the DNN with ECOC encoding (Fig. 5c) are not as representative as those of the DNN with RI encoding (Fig. 5f) in terms of the complexity of the patterns of the learned features. For instance, most of the color patches of the DNN with ECOC encoding represent some kind of gradient maps to the edge direction, while many of the color patches of the DNN with RI encoding represent more complex gradient maps such as a color patch surrounded by other different color patches (e.g., the features in the 6th row and the 5th column, the 6th row and the 7th column, and the last row and the 5th column of Fig. 5f), which are useful in detecting a sharp change in color in almost all directions.

To verify the effect of merging classes on the generalization ability of the DNN base learners of N-ary ECOC, we show the average base model accuracies (BAs) of the tested DNN models (as listed in Table 2 in Section 3.1) with respect to N in Fig. 6, as BA shows how well the learned DNNs generalize to the test data. We see from Fig. 6 that, for the DNNs that follow Pattern 1 (MNIST-LeNet, SVHN, CIFAR10-QUICK and CIFAR10-FULL)), the BA generally decreases as N increases. Moreover, the base learners with N < N_C have BAs which are higher than the BA of RI (N = N_C). Two possible reasons behind the above observations are, for a dataset with a small number of classes (i.e., N_C < 25), the effect of merging classes is small, and the BA is practically determined by the value of N (the smaller N, the easier the classification task).

On the other hand, for the DNNs that follow Pattern 2 (CIFAR100-NIN, CIFAR100-QUICK and FLOWER102-AlexNet), the BA first decreases a large amount (more than 10%), then slowly increases as N continues to increase. Moreover, most of the base learners with N < N_C have BAs which are much lower than the BA of RI. Intuitively, an N-class (N < 100) classification problem is easier than a 100-class classification problem. However, the results are the opposite. One possible reason behind the above observations is that, for a dataset with a large number of classes (e.g., N_C ≥ 100), when N is small (N ≪ N_C), N_C − N is large (i.e., the degree of merging classes is large), thus the BA is generally low compared with the BA of RI. Note that the DNN with N = 2 (a binary classification problem) corresponds to the easiest classification task (i.e., a random guess will be 50% correct if the class distribution is balanced) among all possible values for N (2 ≤ N ≤ N_C), and thus it can still have a high BA. When N begins to increase, the accuracy of a random guess drops sharply, for instance, 33% for N = 3 and 25% for N = 4, even if the class distribution is balanced. Thus there is a drop for the BA. When N further increases, the effect of merging classes is weakened, thus the BA begins to increase and finally saturates.

The above observations suggest that both the feature learning and the generalization ability of the DNN base learners of N-ary ECOC are largely influenced by the effect of merging classes.

1.2 B. Our extended disagreement measure

As defined in Section 4.3, m is the number of examples in the test dataset, C_i and C_j are two base learners of N-ary ECOC, y_k,i and y_k,j are the predictions of C_i and C_j, respectively, on the k th test example. We propose to map the predicted meta-class label y_k,i (each label takes a value from 0, ..., N − 1) to the original class label (each label takes a value from 0, ..., N_C − 1), and then construct a one-by-N_C bit vector $y_{k,i}^{\prime }$ to indicate if the original class labels are included in the predicted meta-class y_k,i. Thus the mapping from y_k,i to $y_{k,i}^{\prime }$ is $\lbrace 0,\ ...,\ N - 1 \rbrace \rightarrow \lbrace 0,\ ...,\ N_{C} - 1\rbrace ^{N_{C}}$.

Take Table 1 (b) in Section 2 as an example, suppose the base learners trained with f₀ and f₂ both predict the meta-class 0, for the k th test example, i.e., y_k,0 = 0 and y_k,2 = 0. Since the meta-class 0 of f₀ consists of the classes c₀ and c₈ while the meta-class 0 of f₂ consists of the classes c₂ and c₆, the corresponding $y_{k,0}^{\prime }$ and $y_{k,2}^{\prime }$ are $y_{k,0}^{\prime }$ = [1 0 0 0 0 0 0 0 1 0] and $y_{k,2}^{\prime }$ = [0 0 1 0 0 0 1 0 0 0]. That is, the c₀ and c₈ positions of $y_{k,0}^{\prime }$ as well as the c₂ and c₆ positions of $y_{k,2}^{\prime }$ are 1, while all other positions are 0.

Then we estimate the Hamming distance between $y_{k,i}^{\prime }$ and $y_{k,j}^{\prime }$, i.e., $(y_{k,i}^{\prime } - y_{k,j}^{\prime } ) (y_{k,i}^{\prime } - y_{k,j}^{\prime } )^{T}$, as the distance between y_k,i and y_k,j. For instance, in the above example, the distance between the predictions of the meta-class 0 by DNNs trained with f₀ and f₂ is 4 as four bit positions of $y_{k,i}^{\prime }$ and $y_{k,j}^{\prime }$ are different.

The advantage of our proposal is that it fits our intuition. That is, the more overlap exists in the classes included in the predicted meta-classes, the smaller the distance. In a general case that the predicted meta-classes of two base learners share common classes, the distance measured by our proposal will be smaller than the above example, in which no common class exists. For instance, if the predicted meta-classes are 2 and 3 for f₀ and f₂, respectively (note that the meta-class 2 of f₀ and the meta-class 3 of f₂ share a common class c₁), then the distance is 2 as only two bit positions of $y_{k,i}^{\prime }$ and $y_{k,j}^{\prime }$ are different. In the extreme case that the classes included in the predicted meta-classes are identical, the distance is 0. For instance, both the meta-class 0 of f₀ and the meta-class 4 of f₂ consist of the classes c₀ and c₈.

Finally, we propose to estimate the disagreement measure of the base learners of N-ary ECOC by

$$ \mathit{Dis}_{i,j}^{\prime} = \frac{1}{m} \sum\limits_{k = 1}^{m} (y_{k,i}^{\prime} - y_{k,j}^{\prime} ) (y_{k,i}^{\prime} - y_{k,j}^{\prime} )^{T}, $$

(6)

where $y_{k,i}^{\prime }$ and $y_{k,j}^{\prime }$ are the mapped back predictions of the base learners C_i and C_j, respectively, on the k th test example.

By averaging the disagreement measures over all pairs of base learners based on Skalak (1996), we obtain

$$ \mathit{Dis}_{avg} = \frac{2}{N_{L}(N_{L}-1)} \sum\limits_{i = 1}^{N_{L}-1} \sum\limits_{j=i + 1}^{N_{L}} \mathit{Dis}_{i,j}^{\prime}. $$

(7)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, K., Matsukawa, T. & Suzuki, E. Experimental validation for N-ary error correcting output codes for ensemble learning of deep neural networks. J Intell Inf Syst 52, 367–392 (2019). https://doi.org/10.1007/s10844-018-0516-5

Download citation

Received: 13 September 2017
Revised: 26 June 2018
Accepted: 29 June 2018
Published: 07 July 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s10844-018-0516-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Experimental validation for N-ary error correcting output codes for ensemble learning of deep neural networks

Abstract

Access this article

Similar content being viewed by others