1 Introduction

Deep Neural Networks (DNN) have become increasingly popular in computer science, especially in the computer vision community for image and other types of multimedia content classification. Neural Network was originally based on a binary classifier, the perceptron, but later it was extended with more output nodes for multiclass classification. Each node in the output layer of the DNN still can be considered a binary classifier, and the outputs of these classifiers could be combined by information fusion as an alternative to the state-of-the-art multiclass DNN classifiers.

In computer vision, object detection in images is a usual task, and there are some application areas where not only the misclassification ratio (which measures the average) but the largest error between any two classes are crucial. For example, the recognition of traffic signs is important and the mismatch of them should be avoided. So, in our special task, the aim was not only to achieve high accuracy in the classification but also to avoid a large error among the classes. The accuracy indicator can measure the average goodness (number of the good decisions divided by all decisions of the classifier), but cannot measure the bottleneck of this classification problem. Therefore, new indicators were also required to evaluate the classifiers.

The focus of this paper is on the error-correcting output codes (ECOC) combination of binary DNN classifiers. The decoding of the results of binary classifiers is based on the nearest code (closest row) in the ECOC matrix, which contains ones and zeros indicating whether a class belongs to a binary classifier or not. The coding methods can be categorized into data-dependent and data-independent code, our research focuses on only the latter one. Both of them have a large literature, and the current papers of ECOC research deal with weighted decoding for the competence reliability problem [24] and novel strategies to focus on the accuracy-complexity trade-off [1]. In the literature of ECOC methods, the early researchers used classical binary classifiers (e.g., SVM), as base learners; but recently deep neural networks (DNN) are also considered. The ECOC generation method can be based on the Hadamard matrix, this Hadamard-ECOC algorithm with DNN as the base classifier has been proposed in a recent paper [9]. However, the drawback of the Hadamard-ECOC is that the number of the binary classifiers cannot be determined arbitrarily, this is always equal to 2k − 1, for example, if the number of classes is 5 or 6, then the number of the binary classifiers is 7. An open question in ECOC design is how to construct a good ECOC matrix. Existing methods answer these questions, but they focused on only the general goodness of the classifier. The main purpose of our research was to strengthen the bottleneck of the ensemble classifier, where we examined (i) the mistakes and (ii) the mistake pairs in the confusion matrix of the classifier. The latter consists of two values, one is the number of errors where the real class is determined and the predicted class differs from it, and the other is the number of errors where the previously mentioned real and the predicted classes switch roles. The number of errors in the confusion matrix can be proportionated to (i) the number of elements in the corresponding real classes or to (ii) all elements. At the former, we took into account the sum of the mistake pairs (as can be seen later, in Sect. 4.1, where LCPE indicator is defined), and at the latter, we used the number of the mistakes (for the LME indicator as can be seen in Sect. 4.1). Our aim was to minimize the largest values of these ratios. Our motivation in this research was to theoretically find an optimal design for this subproblem and to use it in the experiments with larger flexibility, where the number of binary classifiers can be selected from a range.

2 Related works

In machine learning, one can take advantage of using multiple models together by different strategies, like bagging, boosting [15], and stacking [16], and the frequent case in this topic when the model is a binary classifier. In the literature, there are many possible approaches to reduce multiclass to binary classification problems [12, 34] (in special cases the binary problem can be a binary descriptor), which can be used in different areas, like medical prognosis [19], stock market prediction [37] or visual computing, e.g., pedestrian detection [42]. Various authors modified the multiclass classification approach such as one versus one (OVO) [17, 44], one versus all (OVA) [13, 39], which is used for transforming classifier scores into the estimation of multiclass probabilities as well [41], and Directed Acyclic Graph (DAG), which creates many binary (base) classifiers and combines their results to determine the correct class label [3, 10, 27]. Error Correcting Output Coding (ECOC) [11] is also a frequently used approach for this multiclass classification problem, the ECOC is a generic ensemble classification framework designed with an ECOC matrix, which decomposes the task into several two-class problems, and the results of binary classifiers are aggregated by information fusion. For a special task (cancer classification), where only a few samples are available in each class, a special ECOC with a hierarchical ensemble strategy, named Hierarchical Ensemble of Error Correcting Output Codes (HE-ECOC) [25] was a good solution. Besides the multiclass problem, biometric cryptosystem can also use the ECOC framework because it often requires a binarization phase to transform the original real-valued templates into their binary versions; and there is a solution that combines the genetic algorithm with ECOC matrices with specific crossover, mutation, and extension operations considering the properties of optimally constructed ECOC matrices [28].

Lots of types of machine learning methods, e.g., Support Vector Machine [23], SVM with a Gaussian kernel [32], logistic regression, naive Bayes classifier, multi-layer perceptron, C4.5 decision tree, Multiple Birth Least Squares Support Vector Machine [8] were already investigated in information fusion [14]. Researchers constructed combined methods, for example, OVO-SVM, OVA-SVM, and ECOC-SVM were developed [36] as a combination of SVM with OVO, OVA [26], and ECOC; but it is worth examining the Deep Neural Networks (DNNs) as well. Deep Neural Networks can be used in end-to-end machine learning at image classification without manual feature extraction, thus our work focused on image datasets, instead of investigating the UCI datasets [40, 45] with a simple data structure. DNN consists of smaller parts, which can be considered as binary classifiers (perceptron), thus in this paper, we analyzed only the classifiers where the number of the classes in the base learners is two, despite there are some other possibilities, e.g., one-class classifier [21], N-ary classifier [43], Generic Subclass Ensemble [5].

Researchers developed ECOC for Recurrent Neural Networks (especially LSTM) [29], and for other types of DNN, for the Convolutional Neural Network (CNN) based on the Hadamard matrix, the so-called Hadamard-ECOC [9]. Another study analyses ECOC concerning the deep learning (CNN) research focusing on the accuracy-complexity trade-off [1]. We also planned to develop ECOC for CNN, not only constructing an ECOC matrix but theoretically finding the best matrix from the point of view mentioned at the end of Sect. 1. We considered the base binary learners as independent classifiers, contrary to the paper [30], where the authors examined the concept of correlation among them.

The row separation is an important measure to evaluate the error-correcting ability of the coding ECOC matrix [2]. The codes for different classes are expected to be as dissimilar as possible; otherwise, it is easier to commit errors. Thus, the capability of error correction relies on the distances among the rows [43]. The absolute distance and the Hamming distance can be examined, but we used only the latter one because in the binary case, they are the equivalent. The Hamming distance cannot be used in DNN without ECOC coding, because in DNN the one-hot encoding is required for multiclass labels. The Hamming distance in one-hot encoding is equal to two from the coding theory perspective, which does not allow error-detection or error-correcting capabilities. ECOC coding provides more possibilities for encoding categorical data into the output codes, which mitigates the limitations of the one-hot encoding mentioned above [20]. So, we used ECOC coding with Hamming distance, but contrary to the paper [20], we did not apply Zadeh fuzzy logic.

Deep Neural Networks (DNN) are multiclass classifiers, but we used such constructed networks, where the DNN operated as a binary classifier. Two types of construction were developed, one type was a self-made Convolutional Neural Network (briefly CNN21), and another type was a prepared one in the literature. CNN21 consists of 21 layers, convolutional layers with ReLU activation function, max-pooling layers, dropout, and softmax layer; and we used the binary cross-entropy loss function (the details are seen in Table 18 in Appendix).

Furthermore, we used VGG19 [33] and ResNet50 architectures [18] from the literature, however, the last fully connected layer (of VGG19 and ResNet50 as well) was replaced with a new layer for binary classification. For the multiclass problem, an appropriate ECOC construction was needed, which is described in the next section.

We demonstrated our information fusion results on complex datasets, thus we tested it on four datasets, e.g., on the Fashion-MNIST dataset, instead of an easier MNIST dataset, as in papers [4, 6] (and the paper [31] also justifies that the Fashion-MNIST is more difficult to learn by the deep learning models).

3 Min–Max ECOC method

3.1 ECOC construction

The number of rows in the ECOC matrix is given because this is equal to the number of the classes (number of the different labels) in the classification problem, let us denote this number by NL. The number of the columns is larger than NL, but a too large number will cause a long runtime during the learning, this number will be denoted by N in this paper. Before the calculation of the theoretical largest N (depending on NL), we present the constraints of the design. None of the column vectors can consist of only “1” values (or only “0” values), because columns code the classes for the binary classification learning, and at least one class should be on the other side in the two groups of the classes. Another constraint is about uniqueness, each column vector should be different from the others, otherwise, the columns would be redundant during the learning. This unique condition is not enough, another condition is that two column vectors cannot be complements to each other, because in the binary classification the changes of “1” and “0” label values cause the same situation with the same results. Thus, the maximal number of the columns is equal to \(2^{{N_{L} - 1}} - 1\) as can be seen in the next equation.

$$ N \le 2^{{N_{L} - 1}} - 1 $$
(1)

When N reaches the maximum value according to this equation, we call this kind of matrix by full ECOC matrix. In the literature, the number of columns is greater than or equal to the number of rows in the ECOC matrix, and this restriction was also applied in our work, i.e., \(N_{L} \le N\) (this restriction is necessary but not sufficient for error-correcting capabilities).

3.2 Construction of full ECOC matrix

At the ECOC matrix, the row vectors determine the goodness of the decisions, different row vectors have an advantage. The difference of row vectors is measured by Hamming distance function. The symmetric matrix by the construction of each pair of row vectors is called Hamming matrix containing Hamming distance values of the pairs.

Theorem 1

At full ECOC the entries in the Hamming matrix are the same, except for the diagonal, where the values are equal to zero.

Theorem 2

At full ECOC the positive entries in the Hamming matrix are \(2^{{N_{L} - 2}}\).

The construction of a full ECOC matrix can be solved from a smaller full ECOC matrix where the number of rows is less by one. We describe the construction algorithm, and based on them the proofs of both theorems will be presented.

figure b

Proofs of the theorems

We will prove both theorems by mathematical induction. When NL = 3, the following matrix (as can be seen in Table 1) is full ECOC, and the statements in Theorems 1 and 2 are true because Hamming distance between any different pair of rows is 2, so diagonal entries in the Hamming matrix are 0, others are equal to 2.

Table 1 Full ECOC matrix with 3 classes (identity matrix)

Let us suppose that both theorems are true when NL = n, we will prove that they are also true in the next step (NL = n + 1). Let us use Algorithm 1; for example, when NL = 3, the algorithm will give the following full ECOC matrix (NL = 4) presented in Table 2.

Table 2 Full ECOC matrix with 4 classes

The duplication in the algorithm causes twice larger Hamming distances between every pair of row vectors in the original matrix. The last column addition will not change the Hamming distances among them, because the values in the entries are equal (they are all ones). Since in the original matrix, the Hamming distances of every pair are equal, thus in the new matrix, these pairs (so all rows except the last one in the new matrix) will be equal also. The addition row (last row) in the new matrix contains double snippets (“1” and “0”, briefly 10). The Hamming distance between this snippet and an appropriate snippet (appropriate means entries in the same columns) of any row vector is always 1 because appropriate snippets can be 00 or 11. The number of the snippets is \(N = 2^{n - 1} - 1\). The last entry in the last row is 1, and this is different from any other entry in the last column, thus we should add 1 in the calculation of Hamming distance between the last row and any other row (we will get \(2^{{N_{L} - 1}}\)). We got the distances that belong to the last row, but let us see the distances among any two pairs except the last row. Hamming distances will be \(2 \cdot 2^{n - 2} = 2^{n - 1}\) as discussed above. Since if NL = n + 1, then \(n = N_{L} - 1\), thus Hamming distances will be \(2^{n - 1} = 2^{{N_{L} - 2}}\) as we stated in Theorem 2. Furthermore, Theorem 1 is also true, because entries in the new Hamming matrix are the same, except for the diagonal, where the values are equal to zero. □

3.3 Min–Max ECOC matrix for optimized fusion

In the fusion of outputs, the number of binary classifiers is sometimes too large, thus the full ECOC matrix is ignored in most cases to avoid a long learning procedure. Hamming matrix containing Hamming distance values of each pair of row vectors can help in the ECOC design. The low value in this matrix indicates that the corresponding row vectors are close to each other, which can cause a larger mistake between two classes belonging to these rows. If the aim is to minimize the largest values (mistakes and mistake pairs) in the confusion matrix, then we should maximize the lowest value in the Hamming matrix (H) coming from ECOC matrix M, as can be seen in the next equation.

$$ H_{ij} = \mathop \sum \limits_{k = 1}^{N} \left| {M_{ik} - M_{jk} } \right| \quad i,j = 1,2, \ldots ,N_{L} $$
(2)
$$ maximize\; H_{min} ,\quad {\text{where}}\; H_{min} = \mathop {\min }\limits_{i \ne j} \left( {H_{ij} } \right) $$
(3)

There is a set of all possible ECOC matrices (with NL rows and N columns) that are met with all requirements (constraints) described before, and the notation of this set is S(NL,N). The largest \(H_{min}\) in this set is denoted by \(H_{opt}\), and the ECOC matrix belonging to it is called Min–Max ECOC matrix.

$$ H_{opt} \left( {N_{L} ,N} \right) = \mathop {\max }\limits_{{M \in S\left( {N_{L} ,N} \right)}} \left( {H_{min} } \right) $$
(4)

The full matrix is always a Min–Max ECOC matrix because the number of the items in the set S(NL, N) is one, so there is no need to optimize. Based on Theorems 1 and 2 we can write the following.

$$ H_{opt} \left( {N_{L} ,N_{max} } \right) = \mathop {\max }\limits_{{M\left( {N_{L} ,N_{\max } } \right)}} \left( {H_{min} } \right) = 2^{{N_{L} - 2}} $$
(5)
$$ {\text{where}}\quad N_{max} = 2^{{N_{L} - 1}} - 1 $$
(6)

3.4 Theorems for Min–Max ECOC matrix

Before the learning procedure of classification, the number of columns (N) should be decided. After this decision, a good ECOC matrix should be found to get optimized, previously called Min–Max ECOC matrix, or close to this optimized matrix when the optimization would be long. The run length of the optimization is shorter if N is smaller (provided that N is less than half of the maximal \(2^{{N_{L} - 1}} - 1\)), thus we can reduce the optimization problem into a smaller ECOC finding. If we got a Min–Max ECOC matrix in smaller dimensions (number of rows is one less than the number of the classes, number of columns is half of the chosen N), then based on the Min–Max ECOC matrix we can construct a good ECOC matrix for the actual classification problem: (i) if N is odd, then we should use Algorithm 1, (ii) otherwise, (i.e., N is even) the Algorithm 2 will construct the output (as can be seen below). Let us notice, despite the input matrix in Algorithm 1 is not a full matrix, the algorithm can be used.

figure c

During the finding of a good ECOC matrix (by random generation or by a constructive algorithm) the \(H_{min}\) values of the trials are in a range between \(H_{worst}\) (definition can be seen below) and the \(H_{opt}\).

$$ H_{worst} \left( {N_{L} ,N} \right) = \mathop {\min }\limits_{{M \in S\left( {N_{L} ,N} \right)}} \left( {H_{min} } \right) $$
(7)

At a small fusion problem, where the number of the classes is 4, we calculated the \(H_{worst}\) and the \(H_{opt}\) values by investigating all alternatives (by exhaustive search), and the results can be seen in the next Table 3.

Table 3 \(H_{worst}\) and \(H_{opt}\) values with different N, where NL = 4

Without exhaustive search, the \(H_{worst}\) values can be determined based on the next theorem, Theorem 3.

Theorem 3

The \(H_{worst}\) value of full ECOC is equal to the \(H_{opt}\) value. From starting \(N_{max}\), decreasing the N by 1 (at the same NL), the \(H_{worst}\) value will also be reduced by 1 until zero (after that zero will remain). This value can be determined by a closed-form expression, as can be seen in the next equation.

$$ H_{worst} \left( {N_{L} ,N} \right) = \max \left( {0, 2^{{N_{L} - 2}} - \left( {2^{{N_{L} - 1}} - 1 - N} \right)} \right) = \max \left( {0, N - 2^{{N_{L} - 2}} + 1} \right) $$
(8)

Proof of the theorem

In the beginning, we have full ECOC, and in each step try to delete a column that reduces the \(H_{min}\) value by the largest amount. We search the two rows of the actual ECOC matrix where the Hamming distance is the smallest, and we should delete the column in which a difference can be found between the two rows. Thus, the Hamming distance between them will be decreased by 1. If two rows are the same, then the actual \(H_{min}\) value is equal to zero, this cannot be reduced more. Otherwise, they are different at least in one column, so by deleting this column the \(H_{worst}\) value will also be reduced by 1.□

The \(H_{opt}\) values cannot be determined so easily, but based on the next theorems, lower and upper limits can be expressed.

Theorem 4

The upper limit for the \(H_{opt}\) values can be seen in the next equation.

$$ H_{opt} \left( {N_{L} ,N} \right) \le 2^{{N_{L} - 2}} $$
(9)

Proof of the theorem

The equality is true when \(N = N_{max}\), otherwise, the inequality will be met, because deleting columns from full ECOC cannot cause larger Hamming distances among the rows.

$$ H_{opt} \left( {N_{L} ,N} \right) \le H_{opt} \left( {N_{L} ,N_{max} } \right) = 2^{{N_{L} - 2}} $$
(10)

Theorem 5

The lower limit for the \(H_{opt}\) values can be seen in the following equation.

$$ {\text{min}}\left( {2 \cdot H_{opt} \left( {N_{L} - 1,\lfloor\frac{N}{2}\rfloor} \right),\lfloor\frac{N}{2}\rfloor} \right) \le H_{opt} \left( {N_{L} ,N} \right) $$
(11)

where \(\lfloor{x}\rfloor\)is the floor rounded number of x.

Proof of the theorem

Let us suppose that we have a Min–Max ECOC matrix with \(N_{L} - 1\) rows and half of N columns. If N is even, then Algorithm 2 will construct the output, where at the first step (duplication of columns) all Hamming distances will be twice larger, thus \(H_{min}\) value as well. Hamming distance between the additional row and any other row is N/2 because the number of columns before the algorithm was N/2, and the Hamming distances between each snippet of (“0”, “1”) and (“0”, “0”) are always 1, or comparing (“1”, “1”) snippets the Hamming distances are 1 as well. We do not know which is smaller, thus the new \(H_{min}\) value will be the minimum among them, as can be seen in the equation of the theorem.

If N is odd, then using Algorithm 1 we will have a new ECOC matrix (NL, N) from the ECOC matrix (NL − 1, (N − 1)/2). The additional column in the algorithm will not change the Hamming distances, so the statements in the case of even N are also true. By the construction (Algorithm 1 or Algorithm 2) the \(H_{min}\) value described above can be always reached, so the optimal \(H_{opt}\) will equal to or larger than this \(H_{min}\) value.

4 Evaluation

4.1 Metrics for evaluation

In this section, we present our experimental evaluation. We have used the well-known accuracy indicator, which is equal to \(1 - misclassification\;rate\) in the classification. The accuracy measures average goodness (number of the good decisions divided by all decisions), and in our task, we were interested in the goodness of the worst class as well. Therefore, the minimum of the precision, the minimum of the recall, and the minimum of the F1 value (this is the smallest value among the F1 values of the classes as can be seen in the next equation) were also measured. We also calculated the average of them, this is the macro average of F1 (sum of the F1 values of the classes divided by the number of the classes).

$$ minimal\; F_{1} = \mathop {\min }\limits_{c \in C} \frac{2 \cdot precision\left( c \right) \cdot recall\left( c \right)}{{precision\left( c \right) + recall\left( c \right)}} \quad C\, is\, the\, set\, of\,classes $$
(12)

Furthermore, we defined two new indicators from the confusion matrix \((\overline{\overline{A}} )\) coming from the results and the true values.

$$ \overline{\overline{A}} = \left( {a_{ij} } \right) \in R^{n \times n} $$
(13)

The confusion matrix contains the number of the different cases, in which each entry (aij) presents the sum of the cases where the predicted class is i, meanwhile the real class is j. We introduced the Largest Class Pair Error (LCPE), as the ratio of the mistake between two classes

$$ LCPE = \mathop {\max }\limits_{i \ne j} \frac{{\left( {a_{ij} + a_{ji} } \right)}}{{\left| {C_{i} } \right| + \left| {C_{j} } \right|}} \quad i,j = 1,2, \ldots ,n $$
(14)

where the denominator is the sum of the number of elements in real class Ci and Cj, which can be determined from the confusion matrix as well (as can be seen in the next equation).

$$ \left| {C_{j} } \right| = \mathop \sum \limits_{k = 1}^{n} a_{kj} \quad k = 1,2, \ldots ,n $$
(15)

Another new indicator is the Largest Mistake Error (LME), which measures the largest ratio of the mistakes among all classes (relatively to all cases).

$$ LME = \mathop {\max }\limits_{i \ne j} \frac{{a_{ij} }}{{Sum\left( {\overline{\overline{A}} } \right)}} \quad i,j = 1,2, \ldots ,n $$
(16)

where

$$ Sum\left( {\overline{\overline{A}} } \right) = \mathop \sum \limits_{j = 1}^{n} \mathop \sum \limits_{i = 1}^{n} a_{ij} \quad i,j = 1,2, \ldots ,n $$
(17)

4.2 Results on 4 datasets

Our experiments were based on four datasets: LinnaeusFootnote 1 [7], FMNISTFootnote 2 [38], GTSRBFootnote 3 [35] and CIFAR-10 [22] datasets. These image datasets were selected by criteria on the number of classes to get a medium (i.e., not too small and not too large) number of categories. Linnaeus consists of 5 classes of images, berry, bird, dog, flower, and other. Images are color images with 256 × 256 pixels, there are 1200 training images and 400 test images in each class. FMNIST (Fashion-MNIST) is a dataset comprising 28 × 28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. We randomly selected 5 and 6 classes among these categories (this was executed a few times, and we calculated the average of the results), and we tested the whole 10 categories as well. The third dataset is the GTSRB (German Traffic Sign Recognition Benchmark), this benchmark is a multi-class, single-image classification task; each image contains one traffic sign. Images are stored in PPM format, and the sizes vary between 15 × 15 to 250 × 250 pixels. Dataset contains more than 50,000 images in total, 75% of them are the training set, and the remaining are the test set. The dataset consists of 43 classes, but we randomly selected 5, 6, and 10 classes among them (more times). The last dataset, the CIFAR-10 contains 60,000 32 × 32 color images in 10 different classes (airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks); our experiments were based on these 10 classes. There are 6000 images of each class, where 5000 belong to the train set, and others consist of the test set.

All training methods in our experiments were based on the training set, and the next tables present the results of the test set. The first method in every table is a deep learning method without information fusion, we used three different Convolutional Neural Network (CNN) multiclass classifiers, (i) CNN21 for FMNIST and CIFAR-10, (ii) VGG19 architecture [33] for Linnaeus and GTSRB, (iii) and ResNet50 [18] for CIFAR-10 (CIFAR-10 was classified by CNN21 and ResNet to investigate the differences). The next method is the OVA information fusion method from binary classifiers; we implemented it based on the well-known one-vs-all mechanism. In this case, the CNN21, VGG19, and ResNet50 architectures were the same as described in Sect. 2, and these binary classifiers were used as base classifiers. Hadamard-ECOC [9] (briefly H-ECOC later in our evaluation) was also one of the methods in our comparison. Besides that, we investigated the ECOC method with random ECOC matrices, similarly to a recent paper [1], where the authors call this method randECOC. The results of the next tables contain the average results of the trials of this random ECOC matrix, the two subindexes present the number of rows and columns in the matrix. The Full ECOC methods contain the maximal number of binary classifiers (\(2^{{N_{L} - 1}} - 1\)), i.e., 15 (and 31) classifiers when the number of the classes is equal to 5 (and 6).

With 10 classes, the maximal number of binary classifiers would be too large (511), so Full ECOC was omitted in this case. The Min–Max ECOC with a smaller number of binary classifiers (e.g., 20) could not be calculated because of the computational cost. Even though the search for the theoretical best 20 subsets among the 511 ones would be very time-consuming, an approximately good solution is possible. For this purpose, we generated an ECOC10,20 from Min–Max ECOC6,10, by Algorithm 2 via several steps (because at the output of the algorithm in each step, the best half of the columns should be selected based on the Min–Max criterion); and we called it briefly as gen-MM-ECOC10,20 (generated Min–Max ECOC10,20), the matrix is seen in Table 16 in Appendix.

The next tables contain the results, where “prec. min” is the minimum of the precision, “recall min” denotes the minimum of the precision, “F1 macro” is the macro average of the F1 values, and “F1 min” denotes the minimum of the F1 values. These indicators and the accuracy are the larger, the better. LCPE and LME indicators defined above measure the error, so they are the smaller, the better. In all tables, the bold numbers are the best ones (i.e., LCPE, LME are the smallest, in the other indicators they are the largest ones), and the underlined numbers are the second-best ones.

Table 4 presents the results of the Linnaeus dataset, where Min–Max ECOC5,10 and the Full ECOC were the best models (but Full ECOC is also a Min–Max matrix as we wrote earlier). The Min–Max ECOC5,10, i.e., the optimized ECOC matrix for 5 classes with the fusion of 10 binary classifiers is seen in Appendix in Table 13. This optimized ECOC matrix was found by examining all subsets containing ten elements from the whole set with 15 elements.

Table 4 Accuracy and pair error results on 5 classes of the Linnaeus dataset

Table 5 contains the results of the FMNIST dataset (with only 5 classes), where the Min–Max ECOC5,10 outperforms other methods; and Table 6 (with 6 classes) shows that the Min–Max ECOC6,15 is the second-best after the Full ECOC method. The theoretically best ECOC for 10 classes could not be calculated, but from smaller Min–Max ECOC a generated matrix, the gen-MM-ECOC10,20 was used. This generated ECOC exceeded every competitor from every point of view according to Table 7.

Table 5 Accuracy and pair error results on 5 random classes from the FMNIST dataset
Table 6 Accuracy and pair error results on 6 random classes from the FMNIST dataset
Table 7 Accuracy and pair error results on 10 classes from the FMNIST dataset

On the GTSRB dataset, the Min–Max ECOC methods (Min–Max ECOC5,10, Min–Max ECOC6,10, and Min–Max ECOC6,15) also surpass the competitor methods as can be seen in Tables 8 and 9. The Min–Max ECOC6,10 (and Min–Max ECOC6,15), i.e., the optimized ECOC matrix for 6 classes with the fusion of 10 (and 15) binary classifiers is seen in Appendix in Table 14 (and in Table 15). These optimized ECOC matrices were also found by examining all subsets from the whole set. Table 10 shows that the gen-MM-ECOC10,20 is the best among the investigated methods.

Table 8 Accuracy and pair error results on 5 random classes from the GTSRB dataset
Table 9 Accuracy and pair error results on 6 random classes from the GTSRB dataset
Table 10 Accuracy and pair error results on 10 classes from the GTSRB dataset

On the CIFAR-10 dataset, theoretically best ECOC could not be compared, thus gen-MM-ECOC10,20 method was included; and for most of the indicators, this method outperformed the others (with both CNN21 and ResNet). When the CNN21 was the base classifier, then our method reached 0.053 improvements in terms of the accuracy; and in the case of stronger base classifiers (ResNet), the improvement was only 0.028 (from 0.923 to 0.951) (Tables 11, 12).

Table 11 Accuracy and pair error results on 10 classes from the CIFAR-10 dataset (CNN21)
Table 12 Accuracy and pair error results on 10 classes from the CIFAR-10 dataset (ResNet)

The Hopt values of the Min–Max ECOC6,10 and Min–Max ECOC6,15 matrices were 6 and 8, respectively. All Hopt values for the Min–Max ECOC matrix for 6 classes with a different number of binary classifiers are seen in Appendix in Table 17.

5 Conclusion

This paper aimed to minimize the largest error ratios of two types of the bottleneck (LCPE and LME) in the deep neural network-based classifier using an ensemble of binary classifiers. To minimize these error ratios, we suggested maximizing the lowest value in the Hamming matrix coming from the ECOC matrix (the lowest value was denoted by \(H_{min}\)), where the elements of Hamming matrix measure the Hamming distance between the pair of rows in the ECOC matrix. We suggested a special matrix, the Min–Max ECOC matrix among a large set of all possible ECOC matrices with predefined rows and columns, which possesses the largest \(H_{min}\) in this set. The largest \(H_{min}\) (\(H_{opt}\)) and the corresponding Min–Max ECOC matrix can be an optimal solution in the misclassification problem of a multiclass classification task. Besides the Min–Max ECOC matrix our contribution is an interrelation between the properties of the Min–Max ECOC matrix and the full ECOC matrix, and the estimation for the exact \(H_{opt}\) value. The full matrix is always Min–Max ECOC matrix, and we presented a recursive construction algorithm for this. The significance of the Min–Max ECOC method is the flexibility because it gives an optimal solution for each number of classifiers in the corresponding range (e.g., from 6 to 31 when the number of the classes is 6). The usefulness of the construction algorithm lies in the fact that it can generate a matrix close to the optimum if the optimal solution cannot be calculated due to computational costs.

During the finding of a good ECOC matrix (by random generation or by a constructive algorithm) the \(H_{min}\) values of the trials are in a range between the worst, so-called \(H_{worst}\), and the best (the \(H_{opt}\)) values. The \(H_{worst}\) values can be determined by a closed-form expression based on our theorem, which was proved in the paper. There is not easy to calculate exact \(H_{opt}\) values for large ECOC matrices; therefore, we constructed and proved two theorems for an upper and lower limit of the \(H_{opt}\) values. The importance of our theorems is that the interval between the upper and lower limit gives an estimation for the exact \(H_{opt}\) value. Three types of Convolutional Neural Networks with the Min–Max ECOC matrix were tested on four real datasets and compared with OVA and variants of ECOC methods (random and Hadamard ECOC methods) in terms of known and two new indicators; the experimental results show that the proposed method surpasses the others.

The drawback of the Min–Max ECOC method is that the running time of the optimization process may long when the number of rows is large. Thus, this method can be used in only a small number of classes. The proposed recursive construction algorithm for the ECOC matrix can help with this limitation. However, the disadvantage of this construction algorithm is that the optimal Hamming distance of the generated ECOC matrix is not guaranteed. There is a trade-off at these limitations, and a possible future work can be the quantification of the advantages and disadvantages of this trade-off. Our research focused on only the data-independent code in the ECOC, but in future, we plan to investigate and work out data-dependent ECOC algorithms as well.