Abstract
Automatic classification tasks on structured data have been revolutionized by Convolutional Neural Networks (CNNs), but the focus has been on binary and nominal classification tasks. Only recently, ordinal classification (where class labels present a natural ordering) has been tackled through the framework of CNNs. Also, ordinal classification datasets commonly present a high imbalance in the number of samples of each class, making it an even harder problem. Focus should be shifted from classic classification metrics towards per-class metrics (like AUC or Sensitivity) and rank agreement metrics (like Cohen’s Kappa or Spearman’s rank correlation coefficient). We present a new CNN architecture based on the Ordinal Binary Decomposition (OBD) technique using Error-Correcting Output Codes (ECOC). We aim to show experimentally, using four different CNN architectures and two ordinal classification datasets, that the OBD+ECOC methodology significantly improves the mean results on the relevant ordinal and class-balancing metrics. The proposed method is able to outperform a nominal approach as well as already existing ordinal approaches, achieving a mean performance of \({{\,\mathrm{\textit{RMSE}}\,}}= 1.0797\) for the Retinopathy dataset and \({{\,\mathrm{\textit{RMSE}}\,}}= 1.1237\) for the Adience dataset averaged over 4 different architectures.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
There exists a large variety of classification tasks tackled in Machine Learning (ML) literature. It is natural to group them, for example, depending on the number of different class labels assigned to the classification samples. According to this, we differentiate between binary classification tasks (those where only two different labels are present, usually a “positive” class and a “negative” class) and multi-class classification tasks (those where more than two different labels exist).
Focusing on multi-class tasks, one could also pay attention to the relation between the class labels. Classic approaches assume all classes equally without relations between them and try to minimize simply the number of samples correctly assigned a label.
However, when an order relation between the class labels is present due to the nature of the problem itself, these tasks can be posed as “ordinal classification” (sometimes referred as “ordinal regression”) tasks, which have gained popularity in the last decade. This family of problems, halfway between nominal classification and regression, presents extra information which can be exploited in order to improve performance, sometimes regarding different metrics than usual [1, 4, 16]. The benefits of this exploitation have been proven to outperform purely nominal methods in the context of unstructured data [10, 29, 30], and some methods have been proposed to search for ordinality in the class labels of apparently purely categorical datasets [24].
In this work, we propose and explore a novel general methodology for ordinal classification tasks of 2D images. This includes a generic structure for the final layers of a Convolutional Neural Network (CNN), adaptable to a wide range of already existing architectures, as well as a prediction scheme adapted to this structure and an ordinal target label encoding, both based on the Error-Correcting Output Code (ECOC) framework. Our hypothesis is that this exploitation of ordinal information in the context of image classification may improve performance, not only on ordinal metrics but also in nominal ones.
This work is structured as follows: in Sect. 2 a brief literature review on ordinal classification and CNNs is presented. In Sect. 3 a baseline nominal methodology for training CNNs to solve classification problems is posed. Then in Sect. 4 the ordinal classification framework is described and three different ordinal classification methodologies for CNNs (two already existing methods based on previous works and one novel method) are described. In Sect. 5, the experiments for the comparison of these four approaches are presented, including the datasets used for evaluation. Finally, in Sect. 6, the experiment results are shown, and Sect. 7 concludes with a discussion of these results.
2 Related Work
Early ordinal classification approaches were limited to unstructured input data, where no spatial or temporal relations exist between the inputs. Some basic approaches include using regular regression methods with rounding applied at the outputs [23] or using the label distance as a cost penalty [22]. The performance of such methods is limited because of the potentially unequal underlying distance between labels. Cumulative Link Models (CLMs) such as the Proportional Odds Model (POM) [27] or the gologit model [35], which not only learn a latent continuous variable but also a set of thresholds for each rank, are able to overcome this limitation. There are also adaptations of Support Vector Machines (SVMs) like SVORIM or SVOREX [7] which add ordinal constraints to the optimization of the model. Lastly, an approach known as Ordinal Binary Decomposition (OBD), where the original ordinal problem is split into a set of binary problems, has also proven to improve performance. Examples of this are the cascade linear utility model [36], where a different model solves each binary problem, or neural networks coupled with multiple outputs, one for each binary subproblem [9, 20]. The main problem with OBD is the matter of combining the different outputs to produce a final decision.
These approaches are not suitable for structured information such as 2D images, where domain-specific feature extraction is still necessary. In this regard, CNNs provide an automatic method for extracting learned features from structured data in classification tasks. Unfortunately, due to their high number of parameters CNNs suffer easily from overfitting problems resulting in low generalization performance. In addition to classic techniques such as \(L_2\) regularization and dropout, recent techniques include multi-stage implicit regularization [37] and network path pruning [38] to avoid this problem.
Adapting CNNs to work with ordinal information is a recent line of research, still needing extensive work. In [12, 33], a CLM has been adapted as the activation function of a single output of a CNN. In [28] a CNN architecture for solving the OBD version of an age estimation problem is proposed, with a very simple combination of the binary outputs for obtaining a rank. [25] proposes a different methodology for small datasets based on triplets of samples and majority voting. Finally, [6] proposes an improvement over [28] by bounding the maximum binary error of each output.
3 Base Nominal CNN Methodology
Nominal classification is the general framework for tasks where there is a need to assign a class label to a randomly sampled object from a specific distribution. More formally, we want to obtain a rule \(r: \mathcal {X} \rightarrow \mathcal {Y}\) that associates an input vector \({\mathbf {x}}\in \mathcal {X} \subseteq {\mathbb {R}}^K\) to a class label \(y \in \mathcal {Y} = \{{\mathcal {C}}_1, {\mathcal {C}}_2, ..., {\mathcal {C}}_Q\}\) in a finite set. In order to learn this relation, a dataset D is provided consisting on tuples of correctly classified samples \(D = \left\{ ({\mathbf {x}}_i, y_i) \mid {\mathbf {x}}_i \in \mathcal {X},\, y_i \in \mathcal {Y},\, i \in \{1, \dots , N\} \right\} \).
Focusing on image classification tasks, CNNs are able to capture the spatial nature of image features, where nearby pixels have a stronger association between them than far away ones. We have considered four different well-known and competitive CNN architectures for image classification in order to have a good performance baseline: VGG11 [31], ResNet18 [17], MobileNetV3 [19] and ShuffleNetV2 [26]. We use these architectures as a baseline for traditional nominal classification.
While the specifics of each architecture varies wildly, their general design follows the following overall premise:
-
First, several blocks of convolution and pooling operations are applied to the input image.
-
Then, the mapped features are processed by one or more hidden fully-connected layers.
-
Finally, an output layer with as many units as classes and softmax activation is used, whose value represent the probability of input sample \({\mathbf {x}}\) being assigned each class label \(P(y = {\mathcal {C}}_q \mid {\mathbf {x}})\). These are compared to the ground truth labels of dataset D to compute a loss function \(\ell \) and minimize it through some sort of gradient descent procedure.
3.1 Decision Rule
During evaluation of the model, the maximum probability class of \({\mathbf {x}}_i\) is selected as the predicted class label \({\hat{y}}_i\):
where \(P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i)\) is the probability of sample \({\mathbf {x}}_i\) being assigned label \({\mathcal {C}}_q\) predicted by the network.
3.2 Loss Function
For the baseline nominal methodology, categorical cross-entropy is used as the loss function \(\ell \) during training:
where \(1\{y_i = {\mathcal {C}}_q\}\) is the indicator function that is equal to 1 when \(y_i = {\mathcal {C}}_q\) and 0 otherwise.
4 The Ordinal Classification Framework
As in a nominal classification framework, an ordinal classification task is characterised as the prediction process of assigning a label y to an input vector \({\mathbf {x}}\), where \({\mathbf {x}}\in \mathcal {X} \subseteq {\mathbb {R}}^K\) and \(y \in \mathcal {Y} = \{{\mathcal {C}}_1, {\mathcal {C}}_2, \dots , {\mathcal {C}}_Q\}\), i.e., \({\mathbf {x}}\) is a K-dimensional vector and y is a class label in a finite set. The goal is also to obtain some classification rule \(r: \mathcal {X} \rightarrow \mathcal {Y}\) that predicts the categories of new patterns given a dataset D.
Where the ordinal framework differs from the nominal framework is in the presence of a natural ordering of the class labels: \({\mathcal {C}}_1 \prec {\mathcal {C}}_2 \prec \dots \prec {\mathcal {C}}_Q\), where \(\prec \) is an order relation. This is similar to regression, where \(y \in {\mathbb {R}}\), and real values can be ordered by the < operator but, in this case, the labels are discrete and include qualitative information instead of quantitative [16]. Throughout this work, the convention that \(i < j \Rightarrow {\mathcal {C}}_i \prec {\mathcal {C}}_j\) always holds.
4.1 Adapting CNNs for ordinal classification
Without altering the architectures in a major way, several different options are available for introducing the ordinal information of the original dataset in the model and its training process:
-
(a)
Using a loss function that incorporates ordinal information in the optimization procedure.
-
(b)
Altering only the fully-connected layer phase of the architecture, maintaining all previous layers as-is.
-
(c)
Furthermore, altering the decision rule that assigns a label to each sample when making a prediction.
In the following three sections, three different ordinal methodologies are described: two already present in the literature as well as our proposed method.
4.2 Using an ordinal loss function: Quadratic Weighted Kappa
A naive approach to integrate ordinal information in the learning process of the model consists on optimizing an order-sensitive loss function instead of the classic categorical cross-entropy.
A promising such function is the weighted Kappa metric [3] (described in Sect. 5.4), a relevant score for ordinal classifiers as it measures the rank agreement between two raters (in our case, the ground truth labels and the model outputs) based on a disagreement penalty. This penalty is usually defined as the absolute (linear) or square (quadratic, used in the rest of this paper) difference between the rank labels. It is often used in medical diagnosis systems, where the severity of a disease presents naturally ordered stages. It is defined as:
where \(w_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) is the disagreement cost when \(y = {\mathcal {C}}_i\) and \({\hat{y}} = {\mathcal {C}}_j\) (\(w_{{\mathcal {C}}_i,{\mathcal {C}}_j} = (i - j)^2\) for the quadratic case), and \(p_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) and \(e_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) are the observed agreement and expected agreement due to chance for classes \({\mathcal {C}}_i\) and \({\mathcal {C}}_j\), respectively. A larger \(\kappa \) value corresponds with a better agreement and vice versa, and so it is a metric to be maximised.
Unfortunately, like is the case with accuracy, this metric is not continuous and is expressed in terms of discrete labels, preventing the application of gradient descent methods. In [8] a proposal is made to adapt this metric as a loss function to be used in CNN model training maintaining the architecture of the network as well as the decision rule.
4.2.1 Loss Function
First, \(\kappa \) is expressed in terms of probabilities instead of class labels, maintaining the penalty matrix \(w_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) but substituting \(p_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) and \(e_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) for the probability outputs of the model:
where \(N_j\) is the number of samples with class label \({\mathcal {C}}_j\) in the dataset D.
Then, in order to pose it as a minimization problem, loss \(\ell \) is defined as:
Further derivation and a more in-depth discussion can be found in [8].
4.2.2 Decision Rule
In the same manner as the nominal approach of Sect. 3, the maximum probability class of \({\mathbf {x}}_i\) is selected as the predicted class label \({\hat{y}}_i\).
4.3 The Cumulative Link Model Approach
For the CLM framework (family of models which includes the POM [27]), only a small modification to the baseline nominal model is needed: the output is reduced to only a single unit in the last layer, and the logit cumulative link function is used as the activation function instead of softmax:
where \(P(y \preceq {\mathcal {C}}_q \mid {\mathbf {x}})\) is the probability of sample \({\mathbf {x}}_i\) being assigned label \({\mathcal {C}}_q\) or lower predicted by the network, \(f({\mathbf {x}})\) is the single output of the model, \(\sigma \) is the sigmoid function and \(b_q\) is one of the \(Q-1\) thresholds learned as additional parameters. Note that cumulative probabilities \(P(y \preceq {\mathcal {C}}_q \mid {\mathbf {x}})\) are predicted by this function instead of individual ones like \(P(y = {\mathcal {C}}_q \mid {\mathbf {x}})\).
4.3.1 Decision Rule
During evaluation, elementary probability rules are used to combine the cumulative probabilities from Eq. (6) into individual probabilities [15]:
and the maximum probability class is then selected as the predicted label \({\hat{y}}_i\):
4.3.2 Loss Function
Cross-entropy loss is used as the loss function in the same manner as in the nominal model.
4.4 Our approach: Ordinal Binary Decomposition
For our ordinal approach, we decompose the original Q-class ordinal problem into \(Q-1\) binary decision problems, what is known as Ordinal Binary Decomposition (OBD). Each q problem consists on deciding whether \(y \succ {\mathcal {C}}_q\) conditioned to sample \({\mathbf {x}}\) (\(1 \le q < Q\)) (this is referred to as the “Ordered partitions” scheme in [16]).
To adapt the outputs of the model to this, the final fully-connected block is substituted by \(Q-1\) fully-connected blocks, each one with a single output unit with sigmoid activationFootnote 1. Each of the \(Q-1\) outputs of the model \(o_q\) is trying to predict the probability \(P(y \succ {\mathcal {C}}_q \mid {\mathbf {x}})\). The result of this modification is obtaining \(Q-1\) different models, which share their convolutional feature extraction parameters and are trained simultaneously.
4.4.1 Decision Rule
In the case of the OBD models, because the outputs are not individual probabilities but cumulative ones (\(o_k = P(y \succ {\mathcal {C}}_k \mid {\mathbf {x}})\)), the decision rule requires combining several outputs. Moreover, these probabilities may be inconsistent: nothing forces them to fulfil basic probability properties like \(P(y \succ {\mathcal {C}}_i) \ge P(y \succ {\mathcal {C}}_{i+1})\) and \(\sum _{i=1}^{Q} P(y = {\mathcal {C}}_i) = 1\). For this reason, Eq. (7) cannot be applied as for the CLM.
In order to circumvent this problem, a stable approach based on the ECOC framework is used: the ideal output vector \(\mathbf {v}({\mathcal {C}}_i)\) for each class \({\mathcal {C}}_i\) is considered, \(\mathbf {v}({\mathcal {C}}_i) = (c_1, \dots , c_{Q-1})\) where \(c_j = 1\{{\mathcal {C}}_j \prec {\mathcal {C}}_i\}\), i.e. a vector with ones in all positions corresponding with classes which are lower than \(\mathcal {C}_i\) in the ordinal scale. This makes the ideal output vector for a sample \({\mathbf {x}}_i\) with label \(y_i = {\mathcal {C}}_k\) be:
i.e., for a 4 class ordinal problem with labels \({\mathcal {C}}_1\), \({\mathcal {C}}_2\), \({\mathcal {C}}_3\), and \({\mathcal {C}}_4\) the ideal outputs would be \(\mathbf {v}({\mathcal {C}}_1) = (0,0,0)\), \(\mathbf {v}({\mathcal {C}}_2) = (1,0,0)\), \(\mathbf {v}({\mathcal {C}}_3) = (1,1,0)\), and \(\mathbf {v}({\mathcal {C}}_4) = (1,1,1)\).
The decision rule is based on determining the ideal vector which minimizes the distance to the obtained output vector \(\mathbf {o}\):
where \(\Vert \cdot \Vert _2\) is the \(L_2\) norm. This distance metric is selected in order to align it with the loss function of the optimization process.
As an example to illustrate this prediction criterion, assume a 4 class ordinal problem like the one previously mentioned. For sample \({\mathbf {x}}\), let the output of the model be the 3 dimensional vector \(\mathbf {o} = ( 0.8, 0.3, 0.2 )\). The distance to each ideal class vector would be computed as:
This process is illustrated in Fig. 1. The vector closest to \(\mathbf {o}\) is \(\mathbf {v}({\mathcal {C}}_2)\) and thus, sample \({\mathbf {x}}\) would be assigned the class label \({\hat{y}} = {\mathcal {C}}_2\).
4.4.2 Loss Function
For the OBD methodology, categorical cross-entropy has been substituted by the Mean Squared Error loss because it copes better with the distance function used for the ECOC decision [2]:
where \(1\{y_i \succ {\mathcal {C}}_k\}\) is the indicator function that is equal to 1 when \(y_i \succ {\mathcal {C}}_k\) and 0 otherwise, and \(P(y_i \succ {\mathcal {C}}_k \mid {\mathbf {x}}_i)\) is the probability that \(y_i \succ {\mathcal {C}}_k\) predicted by the network given a sample \({\mathbf {x}}_i\).
The four methodologies described in this section are illustrated in Fig. 2.
5 Experiment Design
5.1 Datasets
The effects of the four described methodologies will be tested against the following two different datasets, chosen specifically for the ordinal nature of their class labels and an acute class imbalance.
5.1.1 Diabetic Retinopathy Dataset
The diabetic retinopathy dataset from KaggleFootnote 2 (referred to as “Retinopathy” from now on) consists on a total of 88 702 retina images labelled by a clinician on a 0 to 4 scale evaluating the presence of Diabetic Retinopathy (DR), an eye disease present in a large proportion of diabetes patients. It contains 65 343 images labelled as No DR, 6205 images labelled as Mild DR, 13 153 images labelled as Moderate DR, 2087 images labelled as Severe DR, and 1914 images labelled as Proliferative DR. The task consists on predicting the clinician label using the colour image of the retina. Three sample images can be seen in Fig. 3. All images have been normalized down to a size of \(128 \times 128\) pixels.
5.1.2 Adience Faces Dataset
The Adience faces dataset for age classification [11] (referred to simply as “Adience” from now on) is composed of 26 580 photos of 2284 different subjects extracted from real online albums and automatically cropped and aligned. 17 702 of these photos have an age label attached, referring to one of 8 different age groups of increasing value: 0–2 years, 4–6 years, 8–13 years, 15–20 years, 25–32 years, 38–43 years, 48–53 years, and 60 years and up. The task consists on assigning one of these 8 age labels to each photo. A sample of this images can be seen in Fig. 4. As a preprocessing step, all images have been resized down to \(256 \times 256\) pixels.
5.2 Methodologies and Validation Scheme
Four different methodologies are tested against each other:
-
The baseline nominal architecture, with categorical cross-entropy loss function.
-
The same architecture, but substituting the loss function by the Quadratic Weighted Kappa (QWK) function described in Sect. 4.2.
-
The CLM approach, as described in Sect. 4.3.
-
The OBD approach with ECOC decision rule, as described in Sect. 4.4.
All of these are applied to all four of the previously mentioned architectures (VGG11, ResNet18, MobileNetV3 and ShuffleNetV2), yielding a total of sixteen different experiments for each of the two datasets.
In order to obtain a statistically significant result to test the hypotheses, each experiment is repeated 30 times on 30 different holdout splits of the original dataset, where 80% of samples are used for training and 20% are used for model evaluation. This split is performed in a stratified fashion, preserving the original proportion of the classes of the original dataset in the subsets. For the Retinopathy dataset this leaves 70 962 training samples (of which 7096 are reserved for validation) and 17 740 test samples in each split. In the case of the Adience dataset, 14 161 are used for training (of which 1416 are reserved for validation) and 3541 are used for evaluation.
5.3 Training Scheme
In all experiments, weights are initialized randomly using the He initialization scheme described in [18]. They are then adjusted using the Adam method [21] with a learning rate \(\eta = 1\times 10^{-4}\).
In the case of VGG11, both dropout (\(p=0.5\)) and \(L_2\) regularization (with a weight of \(5\times 10^{-4}\)) are applied only in the fully-connected layers as in the original paper [31]. For ResNet18, batch normalization is applied after every convolution operation and \(L_2\) penalty (with a weight of \(1\times 10^{-4}\)) is added to all mappings [17]. The number of trainable parameters for each model is available on Table 1.
In order to help overcome the class imbalance, class weighting is applied to the loss function based on \(N_q\) (number of training samples for class \({\mathcal {C}}_q\)):
where C is a constant defined as \(C=3\times 10^{-5}\). This weight \(w_q\) is multiplied by the loss contribution of each sample with a ground truth label of \({\mathcal {C}}_q\).
Before training, 10% of training samples are reserved for validation, again selected in a stratified fashion according to the class labels. Model weights are updated in batches of 72 training samples and loss performance is monitored on both training and validation. If validation performance does not increase for 5 full epochs, training is halted, and the best performing parameters over the validation set are restored.
The code used to perform the experiments can be accessed through GitHubFootnote 3.
5.4 Performance Metrics
The classical performance metric in classification tasks is the Correct Classification Rate (CCR). However, given that both datasets present a very high class imbalance, the traditional CCR is not a representative measure of model performance: for example, in the case of the Retinopathy dataset, a dummy classifier that always assign the majority class label (class 0) would obtain a CCR of 73%.
In order to monitor global per-class performance, metrics such as the Average Area Under the Receiver Operating Characteristic (ROC) curve (\({{\,\mathrm{\textit{AvAUC}}\,}}\)), minimum sensitivity (\({{\,\mathrm{\textit{MS}}\,}}\)) and geometric mean of the sensitivities (\({{\,\mathrm{\textit{GMS}}\,}}\)) [10] will also be included.
Also, for ordinal classification problems, rank agreement metrics including the Root of Mean Squared Error (RMSE) (comparing actual and predicted labels, represented as consecutive integers in the ordinal scale), Spearman’s rank correlation coefficient (\({{\,\mathrm{r_s}\,}}\)) [5] or the Quadratic Weighted Cohen’s Kappa (\(\kappa \)) [3] (described in Eq. (3)) have been selected as well for evaluation:
where \(O({\mathcal {C}}_q) = q\) is the ordinal number of label \({\mathcal {C}}_q\), \({{\,\mathrm{Cov}\,}}(O(y), O({\hat{y}}))\) is the covariance between the ground truth labels ordinal numbers and the predicted labels ordinal numbers, and \(\sigma _{O(y)}\) and \(\sigma _{O({\hat{y}})}\) is their standard deviation.
An illustrated example of the global experimentation procedure can be found in Fig. 5.
In Sect. 6, mean results and standard deviation are reported for each methodology. Then, statistical hypothesis testing will be performed in order to discern the effects of the different factors and conclude whether the OBD methodology shows a significant improvement over the other two.
6 Results
The average of the training curves over all 30 repetitions is shown in Figs. 6, 7.Footnote 4 Note how the QWK methodology fails to converge when used in conjunction with the VGG11 architecture: the high depth of this architecture makes the gradients disappear in the backpropagation phase of training, something known as the “vanishing gradients” problem. All the other architectures tested implement residual paths into the network, allowing them to avoid this problem [34]. Note how the OBD methodology does not alter the depth of the CNN model, so it will never cause this problem by itself.
Additionally, in Figs. 8, 9 the training time for each methodology and architecture is shown. In accordance to the number of parameters Table 1, the VGG11 architecture takes the longest time to train compared to the other three, which all take a similar time. Regarding the methodologies, while the nominal approach usually takes less time than the ordinal ones, the OBD methodology is a close second in speed.
The average experimental results for each experiment are shown in Appendix A (Tables 4–11) and the mean over all four architectures is summarized in Table 2 for convenience. It can be noted that for the Retinopathy dataset the CLM models are able to improve ordinal metrics by a little, at the cost of worsening metrics related to the imbalance problem (\({{\,\mathrm{\textit{AvAUC}}\,}}\), \({{\,\mathrm{\textit{MS}}\,}}\), and \({{\,\mathrm{\textit{GMS}}\,}}\)). Meanwhile, the OBD models improve the ordinal metrics further while also improving class balancing metrics. This is done at the cost of worsening CCR, but only because of the high class imbalance. In the case of the Adience dataset the OBD models still achieve a higher score in class-balancing metrics, although \({{\,\mathrm{r_s}\,}}\) and \(\kappa \) are worsened slightly.
From the confusion matrices it can be noted that, although Table 2 shows that the CLM improves on the ordinal metrics on the Adience dataset, it fails on every class balancing metric compared to the nominal model, as it ignores both classes 1 and 3 in the Retinopathy dataset, as well as classes 3, 5 and 6 in the Adience dataset. The OBD model, on the other hand, is able to improve both class balancing and ordinal metrics. This is achieved at the cost of losing some performance on the extreme classes, but note how sensitivity and precision never fall to zero when using the OBD model on any class, that is, no class is ignored systematically. This is easily seen on the confusion matrices for each methodology and architecture shown in Appendix A (Figs. 10–17).
6.1 Statistical Analysis
To determine the statistical significance of the mean differences observed for each classifier, each architecture and each dataset, we have carried out a parametric Analysis of Variance (ANOVA) test [13, 14] for each of the evaluated metrics. The three factors considered for the experimental design are: (i) the database (Adience and Retinopathy), (ii) the CNN network architecture (VGG11, ResNet18, MobileNetV3 and ShuffleNetV2) and (iii) the methodology (nominal, QWK, CLM and OBD).
For each combination of these three factors we have repeated the experiment 30 times with different data splits and weight initialization seeds. We have tested, using the Kolmogorov-Smirnov test for all metrics mentioned in Sect. 5.4, whether the null hypothesis stating that the results are drawn from a normal distribution cannot be rejected (for a significance level of \(\alpha = 0.05\)). This is true for all metrics except \({{\,\mathrm{\textit{MS}}\,}}\) and \({{\,\mathrm{\textit{GMS}}\,}}\), namely the Quadratic Weighted Cohen’s Kappa (\(\kappa \)), \({{\,\mathrm{\textit{AvAUC}}\,}}\), \({{\,\mathrm{\textit{RMSE}}\,}}\), Spearman’s rank correlation coefficient (\({{\,\mathrm{r_s}\,}}\)) and \({{\,\mathrm{\textit{CCR}}\,}}\). Only these metrics will be considered for the subsequent analysis, given that ANOVA is a parametric test and can only be applied to normally distributed variables.
After this, ANOVA is performed for these five metrics. The ANOVA tables are available in Appendix B. According to this analysis, for all normally distributed metrics there exist significant differences in the mean value (for a significance level of \(\alpha = 0.05\)) concerning the three individual factors (Dataset, Architecture and Methodology, all p-values \(< 0.001\)). Then, we also found significant interactions between all the pairs of factors (p-values \(< 0.001\)) and between all three factors (p-value \(< 0.001\)). This shows that:
-
1.
the impact of the architecture and the methodology varies across datasets,
-
2.
the architecture significantly affects performance,
-
3.
the effect of the methodology is affected by the CNN architecture (that is, some architectures are better suited for each methodology), and
-
4.
the methodology alone affects the performance, OBD being in the lead according to the mean results of Table 2.
That is why we now analyse the magnitude of those differences according to the Methodology factor. A post-hoc Tukey’s HSD multiple comparison test [32] has been performed on each of the metrics shown to be affected by this factor. The purpose of this test is to group each of the methodologies into groups of significantly similar performance, where each group is significantly different than the rest. The results of this test are summarized in Table 3 by grouping the methodologies in subsets according to their performance on each metric. The first subset contains the worst methodology, while the last one includes the best methodologies.
Note that for \(\kappa \), \({{\,\mathrm{\textit{AvAUC}}\,}}\), and \({{\,\mathrm{r_s}\,}}\) the OBD methodology has a significantly better mean performance than the other three methodologies. For the \({{\,\mathrm{\textit{RMSE}}\,}}\) metric both CLM and OBD exhibit similar performance, but significantly better than the other two methodologies. Finally, for \({{\,\mathrm{\textit{CCR}}\,}}\) both CLM and the nominal methodology perform similarly and better than OBD and QWK.
7 Conclusions and Future Work
A new ordinal CNN architecture based on Ordinal Binary Decomposition has been proposed, as well as a decision scheme based on ECOC, showing that it is able to significantly outperform a purely nominal approach as well as already existing ordinal approaches, especially when considering highly imbalanced scenarios like medical datasets and web-scraped datasets. Specifically, the proposed OBD methodology is able to improve both class balancing and ordinal metrics such as \({{\,\mathrm{\textit{RMSE}}\,}}\), Spearman’s rank correlation coefficient and Quadratic Weighted Cohen’s Kappa. This methodology is easy to adapt to any other ordinal tasks.
While the tested architectures are widely established and overall well performing models, different and more novel architectures could also be adapted in the same manner. This adds a new fairly generic tool for classification tasks where ordinal information can be exploited. Moreover, this modification does not increase the number of parameters or memory consumption of the network and does not significantly increase the running time for training, making it applicable in memory limited scenarios.
In a future work, more complex data structures like 3D images can be studied. This is possible because the needed modifications only alter the latter stages of the network, allowing for arbitrary input shapes. Also, even though we have been able to improve on class imbalance sensitive metrics, further work is necessary, as can be noted from the confusion matrices. Better class balancing approaches than loss weighting, such as a data augmentation scheme sensitive to ordinal information, can be applied in order to improve on this.
Notes
For architectures where an extra hidden layer of size H is present (like VGG11 and MobileNetV3), these are reduced to \(\lfloor H / (Q-1) \rfloor \) units in order to maintain a similar number of parameters.
The original experiment results can be checked in the GitHub repository: https://github.com/ayrna/ordinal-cnn-ecoc/blob/main/results.xlsx..
References
Agresti A (2010) Analysis of ordinal categorical data. Wiley Series in Probability and Statistics, Wiley, Hoboken, NJ
Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
Ben-David A (2008) Comparison of classification accuracy using Cohen’s Weighted Kappa. Expert Syst Appl 34(2):825–832. https://doi.org/10.1016/j.eswa.2006.10.022
Cardoso JS, Pinto da Costa JF (2007) Learning to classify ordinal data: the data replication method. J Mach Learn Res 8:1393–1429
Cardoso JS, Sousa R (2011) Measuring the performance of ordinal classification. Int J Pattern Recognit Artif Intell 25(08):1173–1195. https://doi.org/10.1142/S0218001411009093
Chen S, Zhang C, Dong M, et al (2017) Using ranking-CNN for age estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 5183–5192
Chu W, Keerthi SS (2007) Support vector ordinal regression. Neural Comput 19(3):792–815. https://doi.org/10.1162/neco.2007.19.3.792
de la Torre J, Puig D, Valls A (2018) Weighted kappa loss function for multi-class classification of ordinal data in deep learning. Pattern Recognit Lett 105:144–154. https://doi.org/10.1016/j.patrec.2017.05.018
Deng WY, Zheng QH, Lian S et al (2010) Ordinal extreme learning machine. Neurocomputing 74(1):447–456. https://doi.org/10.1016/j.neucom.2010.08.022
Dorado-Moreno M, Pérez-Ortiz M, Gutiérrez PA et al (2017) Dynamically weighted evolutionary ordinal neural network for solving an imbalanced liver transplantation problem. Artif Intell Med 77:1–11. https://doi.org/10.1016/j.artmed.2017.02.004
Eidinger E, Enbar R, Hassner T (2014) Age and gender estimation of unfiltered faces. IEEE Trans Inform Forensics Secur 9(12):2170–2179. https://doi.org/10.1109/TIFS.2014.2359646
Fernández-Navarro F (2017) A generalized logistic link function for cumulative link models in ordinal regression. Neural Process Lett 46(1):251–269. https://doi.org/10.1007/s11063-017-9589-3
Fisher RA (1925) Theory of statistical estimation. Math Proc Camb Philos Soc 22(5):700–725. https://doi.org/10.1017/S0305004100009580
Fisher RA (1954) Statistical methods for research workers, twentieth. Oliver and Boyd, Edinburgh
Frank E, Hall M (2001) A simple approach to ordinal classification. In: European Conference on Machine Learning. Springer, Berlin, Heidelberg, Freiburg, Germany, 145–156, https://doi.org/10.1007/3-540-44795-4_13
Gutiérrez PA, Pérez-Ortiz M, Sánchez-Monedero J et al (2016) Ordinal regression methods: survey and experimental study. IEEE Trans Knowl Data Eng 28(1):127–146. https://doi.org/10.1109/TKDE.2015.2457911
He K, Zhang X, Ren S, et al (2015a) Deep residual learning for image recognition. arXiv:1512.03385
He K, Zhang X, Ren S, et al (2015b) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. arXiv:1502.01852
Howard A, Sandler M, Chu G, et al (2019) Searching for MobileNetV3. arXiv:1905.02244
Jianlin Cheng, Zheng Wang, Pollastri G (2008) A neural network approach to ordinal regression. In: IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), 1279–1284, https://doi.org/10.1109/IJCNN.2008.4633963
Kingma DP, Ba J (2017) Adam: a method for stochastic optimization. arXiv:1412.6980
Kotsiantis SB, Pintelas PE (2004) A cost sensitive technique for ordinal classification problems. In: Vouros GA, Panayiotopoulos T (eds.) Methods and applications of artificial intelligence, lecture notes in computer science, https://doi.org/10.1007/978-3-540-24674-9_24
Kramer S, Widmer G, Pfahringer B, et al (2010) Prediction of ordinal classes using regression trees. In: Raś ZW, Ohsuga S (eds) Foundations of intelligent systems. Springer, Berlin, Heidelberg, Lecture notes in computer science, 426–434, https://doi.org/10.1007/3-540-39963-1_45
Lausser L, Schäfer LM, Kühlwein SD et al (2020) Detecting ordinal subcascades. Neural Process Lett 52(3):2583–2605. https://doi.org/10.1007/s11063-020-10362-0
Liu Y, Kong A, Goh C (2017) Deep ordinal regression based on data relationship for small datasets. In: IJCAI international joint conference on artificial intelligence, https://doi.org/10.24963/ijcai.2017/330
Ma N, Zhang X, Zheng HT, et al (2018) ShuffleNet V2: Practical guidelines for efficient cnn architecture design. arXiv:1807.11164
McCullagh P (1980) Regression models for ordinal data. J Royal Stat Soc: Series B (Methodol) 42(2):109–127. https://doi.org/10.1111/j.2517-6161.1980.tb01109.x
Niu Z, Zhou M, Wang L, et al (2016) Ordinal regression with multiple output CNN for age estimation. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), 4920–4928, https://doi.org/10.1109/CVPR.2016.532
Pérez-Ortiz M, Fernández-Delgado M, Cernadas E et al (2016) On the use of nominal and ordinal classifiers for the discrimination of states of development in fish oocytes. Neural Process Lett 44(2):555–570. https://doi.org/10.1007/s11063-015-9476-8
Sánchez-Monedero J, Pérez-Ortiz M, Sáez A et al (2018) Partial order label decomposition approaches for melanoma diagnosis. Appl Soft Comput 64:341–355. https://doi.org/10.1016/j.asoc.2017.11.042
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Tukey JW (1949) Comparing individual means in the analysis of variance. Biometrics 5(2):99–114. https://doi.org/10.2307/3001913
Vargas VM, Gutiérrez PA, Hervás-Martínez C (2020) Cumulative link models for deep ordinal classification. Neurocomputing 401:48–58. https://doi.org/10.1016/j.neucom.2020.03.034
Veit A, Wilber M, Belongie S (2016) Residual networks behave like ensembles of relatively shallow networks. arXiv:1605.06431
Williams R (2006) Generalized ordered logit/partial proportional odds models for ordinal dependent variables. Stata J 6:58–82. https://doi.org/10.1177/1536867X0600600104
Wu H, Lu H, Ma S (2003) A practical SVM-based algorithm for ordinal regression in image retrieval. In: 11th ACM international conference on multimedia. Association for computing machinery, Berkeley, 612–621, https://doi.org/10.1145/957013.957144
Zheng Q, Yang M, Yang J et al (2018) Improvement of generalization ability of deep CNN via implicit regularization in two-stage training process. IEEE Access 6:15844–15869. https://doi.org/10.1109/ACCESS.2018.2810849
Zheng Q, Tian X, Yang M et al (2020) PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning. Multidimens Syst Signal Process 31(3):793–827. https://doi.org/10.1007/s11045-019-00686-z
Acknowledgements
This work has been partially subsidised by the “Agencia Estatal de Investigación” (Spain) [grant reference: PID2020-115454GB-C22/AEI/10.13039/501100011033], the “Consejería de Salud y Familias” (Junta de Andalucía) [grant reference: PS-2020-780] and the “Consejería de Transformación Económica, Industria, Conocimiento y Universidades” (Junta de Andalucía) y Programa Operativo “FEDER 2014-2020” [grant references: UCO-1261651 and PY20_00074]. Javier Barbero-Gómez research has been subsidised by the FPI Predoctoral Program of the “Ministerio de Ciencia, Innovación y Universidades” (Spain) [grant reference PRE2018-085659].
Funding
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Author information
Authors and Affiliations
Contributions
JB-G: Writing – original draft, Conceptualization, Methodology, Software, Validation, Investigation. P-AG: Writing – review & editing, Supervision. CH-M: Formal analysis, Writing – review & editing, Supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Barbero-Gómez, J., Gutiérrez, P.A. & Hervás-Martínez, C. Error-Correcting Output Codes in the Framework of Deep Ordinal Classification. Neural Process Lett 55, 5299–5330 (2023). https://doi.org/10.1007/s11063-022-10824-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-10824-7