Error-Correcting Output Codes in the Framework of Deep Ordinal Classification

Automatic classification tasks on structured data have been revolutionized by Convolutional Neural Networks (CNNs), but the focus has been on binary and nominal classification tasks. Only recently, ordinal classification (where class labels present a natural ordering) has been tackled through the framework of CNNs. Also, ordinal classification datasets commonly present a high imbalance in the number of samples of each class, making it an even harder problem. Focus should be shifted from classic classification metrics towards per-class metrics (like AUC or Sensitivity) and rank agreement metrics (like Cohen’s Kappa or Spearman’s rank correlation coefficient). We present a new CNN architecture based on the Ordinal Binary Decomposition (OBD) technique using Error-Correcting Output Codes (ECOC). We aim to show experimentally, using four different CNN architectures and two ordinal classification datasets, that the OBD+ECOC methodology significantly improves the mean results on the relevant ordinal and class-balancing metrics. The proposed method is able to outperform a nominal approach as well as already existing ordinal approaches, achieving a mean performance of RMSE=1.0797\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\,\mathrm{\textit{RMSE}}\,}}= 1.0797$$\end{document} for the Retinopathy dataset and RMSE=1.1237\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\,\mathrm{\textit{RMSE}}\,}}= 1.1237$$\end{document} for the Adience dataset averaged over 4 different architectures.


Introduction
There exists a large variety of classification tasks tackled in Machine Learning (ML) literature.It is natural to group them, for example, depending on the number of different class labels assigned to the classification samples.According to this, we differentiate between binary classification tasks (those where only two different labels are present, usually a "positive" class and a "negative" class) and multi-class classification tasks (those where more than two different labels exist).
Focusing on multi-class tasks, one could also pay attention to the relation between the class labels.Classic approaches assume all classes equally without relations between them and try to minimize simply the number of samples correctly assigned a label.
However, when an order relation between the class labels is present due to the nature of the problem itself, these tasks can be posed as "ordinal classification" (sometimes referred as "ordinal regression") tasks, which have gained popularity in the last decade.This family of problems, halfway between nominal classification and regression, presents extra information which can be exploited in order to improve performance, sometimes regarding different metrics than usual [1,4,16].The benefits of this exploitation have been proven to outperform purely nominal methods in the context of unstructured data [10,29,30], and some methods have been proposed to search for ordinality in the class labels of apparently purely categorical datasets [24].
In this work, we propose and explore a novel general methodology for ordinal classification tasks of 2D images.This includes a generic structure for the final layers of a Convolutional Neural Network (CNN), adaptable to a wide range of already existing architectures, as well as a prediction scheme adapted to this structure and an ordinal target label encoding, both based on the Error-Correcting Output Code (ECOC) framework.Our hypothesis is that this exploitation of ordinal information in the context of image classification may improve performance, not only on ordinal metrics but also in nominal ones.
This work is structured as follows: in Sect. 2 a brief literature review on ordinal classification and CNNs is presented.In Sect. 3 a baseline nominal methodology for training CNNs to solve classification problems is posed.Then in Sect. 4 the ordinal classification framework is described and three different ordinal classification methodologies for CNNs (two already existing methods based on previous works and one novel method) are described.In Sect.5, the experiments for the comparison of these four approaches are presented, including the datasets used for evaluation.Finally, in Sect.6, the experiment results are shown, and Sect.7 concludes with a discussion of these results.

Related Work
Early ordinal classification approaches were limited to unstructured input data, where no spatial or temporal relations exist between the inputs.Some basic approaches include using regular regression methods with rounding applied at the outputs [23] or using the label distance as a cost penalty [22].The performance of such methods is limited because of the potentially unequal underlying distance between labels.Cumulative Link Models (CLMs) such as the Proportional Odds Model (POM) [27] or the gologit model [35], which not only learn a latent continuous variable but also a set of thresholds for each rank, are able to overcome this limitation.There are also adaptations of Support Vector Machines (SVMs) like SVORIM or SVOREX [7] which add ordinal constraints to the optimization of the model.Lastly, an approach known as Ordinal Binary Decomposition (OBD), where the original ordinal problem is split into a set of binary problems, has also proven to improve performance.Examples of this are the cascade linear utility model [36], where a different model solves each binary problem, or neural networks coupled with multiple outputs, one for each binary subproblem [9,20].The main problem with OBD is the matter of combining the different outputs to produce a final decision.
These approaches are not suitable for structured information such as 2D images, where domain-specific feature extraction is still necessary.In this regard, CNNs provide an automatic method for extracting learned features from structured data in classification tasks.Unfortunately, due to their high number of parameters CNNs suffer easily from overfitting problems resulting in low generalization performance.In addition to classic techniques such as L 2 regularization and dropout, recent techniques include multi-stage implicit regularization [37] and network path pruning [38] to avoid this problem.
Adapting CNNs to work with ordinal information is a recent line of research, still needing extensive work.In [12,33], a CLM has been adapted as the activation function of a single output of a CNN.In [28] a CNN architecture for solving the OBD version of an age estimation problem is proposed, with a very simple combination of the binary outputs for obtaining a rank.[25] proposes a different methodology for small datasets based on triplets of samples and majority voting.Finally, [6] proposes an improvement over [28] by bounding the maximum binary error of each output.

Base Nominal CNN Methodology
Nominal classification is the general framework for tasks where there is a need to assign a class label to a randomly sampled object from a specific distribution.More formally, we want to obtain a rule r : X → Y that associates an input vector x ∈ In order to learn this relation, a dataset D is provided consisting on tuples of correctly classified samples Focusing on image classification tasks, CNNs are able to capture the spatial nature of image features, where nearby pixels have a stronger association between them than far away ones.We have considered four different well-known and competitive CNN architectures for image classification in order to have a good performance baseline: VGG11 [31], ResNet18 [17], MobileNetV3 [19] and ShuffleNetV2 [26].We use these architectures as a baseline for traditional nominal classification.
While the specifics of each architecture varies wildly, their general design follows the following overall premise: • First, several blocks of convolution and pooling operations are applied to the input image.
• Then, the mapped features are processed by one or more hidden fully-connected layers.
• Finally, an output layer with as many units as classes and softmax activation is used, whose value represent the probability of input sample x being assigned each class label P(y = C q | x).These are compared to the ground truth labels of dataset D to compute a loss function and minimize it through some sort of gradient descent procedure.

Decision Rule
During evaluation of the model, the maximum probability class of x i is selected as the predicted class label ŷi : ŷi = arg max where P(y i = C q | x i ) is the probability of sample x i being assigned label C q predicted by the network.

Loss Function
For the baseline nominal methodology, categorical cross-entropy is used as the loss function during training: where 1{y i = C q } is the indicator function that is equal to 1 when y i = C q and 0 otherwise.

The Ordinal Classification Framework
As in a nominal classification framework, an ordinal classification task is characterised as the prediction process of assigning a label y to an input vector x, where x ∈ X ⊆ R K and y x is a K -dimensional vector and y is a class label in a finite set.The goal is also to obtain some classification rule r : X → Y that predicts the categories of new patterns given a dataset D.
Where the ordinal framework differs from the nominal framework is in the presence of a natural ordering of the class labels: where ≺ is an order relation.This is similar to regression, where y ∈ R, and real values can be ordered by the < operator but, in this case, the labels are discrete and include qualitative information instead of quantitative [16].Throughout this work, the convention that i < j ⇒ C i ≺ C j always holds.

Adapting CNNs for ordinal classification
Without altering the architectures in a major way, several different options are available for introducing the ordinal information of the original dataset in the model and its training process: (a) Using a loss function that incorporates ordinal information in the optimization procedure.(b) Altering only the fully-connected layer phase of the architecture, maintaining all previous layers as-is.(c) Furthermore, altering the decision rule that assigns a label to each sample when making a prediction.
In the following three sections, three different ordinal methodologies are described: two already present in the literature as well as our proposed method.

Using an ordinal loss function: Quadratic Weighted Kappa
A naive approach to integrate ordinal information in the learning process of the model consists on optimizing an order-sensitive loss function instead of the classic categorical cross-entropy.
A promising such function is the weighted Kappa metric [3] (described in Sect.5.4), a relevant score for ordinal classifiers as it measures the rank agreement between two raters (in our case, the ground truth labels and the model outputs) based on a disagreement penalty.This penalty is usually defined as the absolute (linear) or square (quadratic, used in the rest of this paper) difference between the rank labels.It is often used in medical diagnosis systems, where the severity of a disease presents naturally ordered stages.It is defined as: where w C i ,C j is the disagreement cost when y = C i and ŷ = C j (w C i ,C j = (i − j) 2 for the quadratic case), and p C i ,C j and e C i ,C j are the observed agreement and expected agreement due to chance for classes C i and C j , respectively.A larger κ value corresponds with a better agreement and vice versa, and so it is a metric to be maximised.Unfortunately, like is the case with accuracy, this metric is not continuous and is expressed in terms of discrete labels, preventing the application of gradient descent methods.In [8] a proposal is made to adapt this metric as a loss function to be used in CNN model training maintaining the architecture of the network as well as the decision rule.

Loss Function
First, κ is expressed in terms of probabilities instead of class labels, maintaining the penalty matrix w C i ,C j but substituting p C i ,C j and e C i ,C j for the probability outputs of the model: where N j is the number of samples with class label C j in the dataset D.
Then, in order to pose it as a minimization problem, loss is defined as: Further derivation and a more in-depth discussion can be found in [8].

Decision Rule
In the same manner as the nominal approach of Sect.3, the maximum probability class of x i is selected as the predicted class label ŷi .

The Cumulative Link Model Approach
For the CLM framework (family of models which includes the POM [27]), only a small modification to the baseline nominal model is needed: the output is reduced to only a single unit in the last layer, and the logit cumulative link function is used as the activation function instead of softmax: where P(y C q | x) is the probability of sample x i being assigned label C q or lower predicted by the network, f (x) is the single output of the model, σ is the sigmoid function and b q is one of the Q − 1 thresholds learned as additional parameters.Note that cumulative probabilities P(y C q | x) are predicted by this function instead of individual ones like P(y = C q | x).

Decision Rule
During evaluation, elementary probability rules are used to combine the cumulative probabilities from Eq. ( 6) into individual probabilities [15]: and the maximum probability class is then selected as the predicted label ŷi :

Loss Function
Cross-entropy loss is used as the loss function in the same manner as in the nominal model.

Our approach: Ordinal Binary Decomposition
For our ordinal approach, we decompose the original Q-class ordinal problem into Q − 1 binary decision problems, what is known as Ordinal Binary Decomposition (OBD).Each q problem consists on deciding whether y C q conditioned to sample x (1 ≤ q < Q) (this is referred to as the "Ordered partitions" scheme in [16]).
To adapt the outputs of the model to this, the final fully-connected block is substituted by Q − 1 fully-connected blocks, each one with a single output unit with sigmoid activation 1 .Each of the Q − 1 outputs of the model o q is trying to predict the probability P(y C q | x).The result of this modification is obtaining Q − 1 different models, which share their convolutional feature extraction parameters and are trained simultaneously.

Decision Rule
In the case of the OBD models, because the outputs are not individual probabilities but cumulative ones (o k = P(y ), the decision rule requires combining several outputs.Moreover, these probabilities may be inconsistent: nothing forces them to fulfil basic probability properties like P(y For this reason, Eq. ( 7) cannot be applied as for the CLM.
In order to circumvent this problem, a stable approach based on the ECOC framework is used: the ideal output vector v(C i ) for each class where c j = 1{C j ≺ C i }, i.e. a vector with ones in all positions corresponding with classes which are lower than C i in the ordinal scale.This makes the ideal output vector for a sample x i with label y i = C k be: i.e., for a 4 class ordinal problem with labels C 1 , C 2 , C 3 , and C 4 the ideal outputs would be The decision rule is based on determining the ideal vector which minimizes the distance to the obtained output vector o: where • 2 is the L 2 norm.This distance metric is selected in order to align it with the loss function of the optimization process.
As an example to illustrate this prediction criterion, assume a 4 class ordinal problem like the one previously mentioned.For sample x, let the output of the model be the 3 dimensional vector o = (0.8, 0.3, 0.2).The distance to each ideal class vector would be computed as: This process is illustrated in Fig. 1.The vector closest to o is v(C 2 ) and thus, sample x would be assigned the class label ŷ = C 2 .

Loss Function
For the OBD methodology, categorical cross-entropy has been substituted by the Mean Squared Error loss because it copes better with the distance function used for the ECOC decision [2]: where 1{y i C k } is the indicator function that is equal to 1 when y i C k and 0 otherwise, and P(y i C k | x i ) is the probability that y i C k predicted by the network given a sample x i .The four methodologies described in this section are illustrated in Fig. 2.
Fig. 2 The four compared methodologies, from left to right: the baseline nominal architecture (both using categorical cross-entropy as well as QWK as the loss function), CLM, and our proposal, OBD 5 Experiment Design

Datasets
The effects of the four described methodologies will be tested against the following two different datasets, chosen specifically for the ordinal nature of their class labels and an acute class imbalance.

Diabetic Retinopathy Dataset
The diabetic retinopathy dataset from Kaggle2 (referred to as "Retinopathy" from now on) consists on a total of 88 702 retina images labelled by a clinician on a 0 to 4 scale evaluating the presence of Diabetic Retinopathy (DR), an eye disease present in a large proportion of diabetes patients.It contains 65 343 images labelled as No DR, 6205 images labelled as Mild DR, 13 153 images labelled as Moderate DR, 2087 images labelled as Severe DR, and 1914 images labelled as Proliferative DR.The task consists on predicting the clinician label using the colour image of the retina.Three sample images can be seen in Fig. 3.All images have been normalized down to a size of 128 × 128 pixels.

Adience Faces Dataset
The Adience faces dataset for age classification [11] (referred to simply as "Adience" from now on) is composed of 26 580 photos of 2284 different subjects extracted from real online albums and automatically cropped and aligned.17

Methodologies and Validation Scheme
Four different methodologies are tested against each other: • The baseline nominal architecture, with categorical cross-entropy loss function.
• The same architecture, but substituting the loss function by the Quadratic Weighted Kappa (QWK) function described in Sect.4.2.• The CLM approach, as described in Sect.4.3.
• The OBD approach with ECOC decision rule, as described in Sect.4.4.
All of these are applied to all four of the previously mentioned architectures (VGG11, ResNet18, MobileNetV3 and ShuffleNetV2), yielding a total of sixteen different experiments for each of the two datasets.
In order to obtain a statistically significant result to test the hypotheses, each experiment is repeated 30 times on 30 different holdout splits of the original dataset, where 80% of samples are used for training and 20% are used for model evaluation.This split is performed in a stratified fashion, preserving the original proportion of the classes of the original dataset in the subsets.For the Retinopathy dataset this leaves 70 962 training samples (of which 7096 are reserved for validation) and 17 740 test samples in each split.In the case of the Adience dataset, 14 161 are used for training (of which 1416 are reserved for validation) and 3541 are used for evaluation.

Training Scheme
In all experiments, weights are initialized randomly using the He initialization scheme described in [18].They are then adjusted using the Adam method [21] with a learning rate η = 1 × 10 −4 .
In the case of VGG11, both dropout ( p = 0.5) and L 2 regularization (with a weight of 5 × 10 −4 ) are applied only in the fully-connected layers as in the original paper [31].For ResNet18, batch normalization is applied after every convolution operation and L 2 penalty (with a weight of 1×10 −4 ) is added to all mappings [17].The number of trainable parameters for each model is available on Table 1.
In order to help overcome the class imbalance, class weighting is applied to the loss function based on N q (number of training samples for class C q ): where C is a constant defined as C = 3 × 10 −5 .This weight w q is multiplied by the loss contribution of each sample with a ground truth label of C q .
Before training, 10% of training samples are reserved for validation, again selected in a stratified fashion according to the class labels.Model weights are updated in batches of 72 training samples and loss performance is monitored on both training and validation.If validation performance does not increase for 5 full epochs, training is halted, and the best performing parameters over the validation set are restored.
The code used to perform the experiments can be accessed through GitHub3 .

Performance Metrics
The classical performance metric in classification tasks is the Correct Classification Rate (CCR).However, given that both datasets present a very high class imbalance, the traditional CCR is not a representative measure of model performance: for example, in the case of the Retinopathy dataset, a dummy classifier that always assign the majority class label (class 0) would obtain a CCR of 73%.
In order to monitor global per-class performance, metrics such as the Average Area Under the Receiver Operating Characteristic (ROC) curve (AvAUC), minimum sensitivity (MS) and geometric mean of the sensitivities (GMS) [10] will also be included.
Also, for ordinal classification problems, rank agreement metrics including the Root of Mean Squared Error (RMSE) (comparing actual and predicted labels, represented as consecutive integers in the ordinal scale), Spearman's rank correlation coefficient (r s ) [5] or the Quadratic Weighted Cohen's Kappa (κ) [3] (described in Eq. ( 3)) have been selected as well for evaluation: where O(C q ) = q is the ordinal number of label C q , Cov(O(y), O( ŷ)) is the covariance between the ground truth labels ordinal numbers and the predicted labels ordinal numbers, and σ O(y) and σ O( ŷ) is their standard deviation.
An illustrated example of the global experimentation procedure can be found in Fig. 5.In Sect.6, mean results and standard deviation are reported for each methodology.Then, statistical hypothesis testing will be performed in order to discern the effects of the different factors and conclude whether the OBD methodology shows a significant improvement over the other two.
Fig. 5 The general experimentation procedure as described in Sect. 5 123 Fig. 6 Average training curves for each model and methodology when applied to the diabetic retinopathy dataset.Train loss is shown as solid lines and validation loss as dashed lines.The average is calculated over all executions which reach the corresponding iteration

Results
The average of the training curves over all 30 repetitions is shown in Figs. 6, 7. 4 Note how the QWK methodology fails to converge when used in conjunction with the VGG11 architecture: the high depth of this architecture makes the gradients disappear in the backpropagation phase of training, something known as the "vanishing gradients" problem.All the other architectures tested implement residual paths into the network, allowing them to avoid this problem [34].Note how the OBD methodology does not alter the depth of the CNN model, so it will never cause this problem by itself.
Additionally, in Figs. 8, 9 the training time for each methodology and architecture is shown.In accordance to the number of parameters Table 1, the VGG11 architecture takes the longest time to train compared to the other three, which all take a similar time.Regarding  the methodologies, while the nominal approach usually takes less time than the ordinal ones, the OBD methodology is a close second in speed.The average experimental results for each experiment are shown in Appendix A (Tables 4-11) and the mean over all four architectures is summarized in Table 2 for convenience.It can be noted that for the Retinopathy dataset the CLM models are able to improve ordinal metrics by a little, at the cost of worsening metrics related to the imbalance problem (AvAUC, MS, and GMS).Meanwhile, the OBD models improve the ordinal metrics further while also improving class balancing metrics.This is done at the cost of worsening CCR, but only because of the high class imbalance.In the case of the Adience dataset the OBD models still achieve a higher score in class-balancing metrics, although r s and κ are worsened slightly.
From the confusion matrices it can be noted that, although Table 2 shows that the CLM improves on the ordinal metrics on the Adience dataset, it fails on every class balancing metric compared to the nominal model, as it ignores both classes 1 and 3 in the Retinopathy dataset, as well as classes 3, 5 and 6 in the Adience dataset.The OBD model, on the other hand, is able to improve both class balancing and ordinal metrics.This is achieved at the cost of losing some performance on the extreme classes, but note how sensitivity and precision never fall to zero when using the OBD model on any class, that is, no class is ignored systematically.This is easily seen on the confusion matrices for each methodology and architecture shown in Appendix A (Figs. 10-17).

Statistical Analysis
To determine the statistical significance of the mean differences observed for each classifier, each architecture and each dataset, we have carried out a parametric Analysis of Variance (ANOVA) test [13,14] for each of the evaluated metrics.The three factors considered for the experimental design are: (i) the database (Adience and Retinopathy), (ii) the CNN network architecture (VGG11, ResNet18, MobileNetV3 and ShuffleNetV2) and (iii) the methodology (nominal, QWK, CLM and OBD).
For each combination of these three factors we have repeated the experiment 30 times with different data splits and weight initialization seeds.We have tested, using the Kolmogorov-Smirnov test for all metrics mentioned in Sect.5.4, whether the null hypothesis stating that the results are drawn from a normal distribution cannot be rejected (for a significance level of α = 0.05).This is true for all metrics except MS and GMS, namely the Quadratic Weighted Cohen's Kappa (κ), AvAUC, RMSE, Spearman's rank correlation coefficient (r s ) and CCR.Only these metrics will be considered for the subsequent analysis, given that ANOVA is a parametric test and can only be applied to normally distributed variables.
After this, ANOVA is performed for these five metrics.The ANOVA tables are available in Appendix B. According to this analysis, for all normally distributed metrics there exist significant differences in the mean value (for a significance level of α = 0.05) concerning the three individual factors (Dataset, Architecture and Methodology, all p-values < 0.001).Then, we also found significant interactions between all the pairs of factors ( p-values < 0.001) and between all three factors ( p-value < 0.001).This shows that: 1. the impact of the architecture and the methodology varies across datasets, 2. the architecture significantly affects performance, 3. the effect of the methodology is affected by the CNN architecture (that is, some architectures are better suited for each methodology), and 4. the methodology alone affects the performance, OBD being in the lead according to the mean results of Table 2.
That is why we now analyse the magnitude of those differences according to the Methodology factor.A post-hoc Tukey's HSD multiple comparison test [32] has been performed on each of the metrics shown to be affected by this factor.The purpose of this test is to group each of the methodologies into groups of significantly similar performance, where each group is significantly different than the rest.The results of this test are summarized in Table 3 by grouping the methodologies in subsets according to their performance on each metric.The first subset contains the worst methodology, while the last one includes the best methodologies.
Note that for κ, AvAUC, and r s the OBD methodology has a significantly better mean performance than the other three methodologies.For the RMSE metric both CLM and OBD exhibit similar performance, but significantly better than the other two methodologies.Finally, for CCR both CLM and the nominal methodology perform similarly and better than OBD and QWK.

Conclusions and Future Work
A new ordinal CNN architecture based on Ordinal Binary Decomposition has been proposed, as well as a decision scheme based on ECOC, showing that it is able to significantly outperform a purely nominal approach as well as already existing ordinal approaches, especially when considering highly imbalanced scenarios like medical datasets and web-scraped datasets.Specifically, the proposed OBD methodology is able to improve both class balancing and ordinal metrics such as RMSE, Spearman's rank correlation coefficient and Quadratic Weighted Cohen's Kappa.This methodology is easy to adapt to any other ordinal tasks.
While the tested architectures are widely established and overall well performing models, different and more novel architectures could also be adapted in the same manner.This adds a new fairly generic tool for classification tasks where ordinal information can be exploited.Moreover, this modification does not increase the number of parameters or memory consumption of the network and does not significantly increase the running time for training, making it applicable in memory limited scenarios.
In a future work, more complex data structures like 3D images can be studied.This is possible because the needed modifications only alter the latter stages of the network, allowing for arbitrary input shapes.Also, even though we have been able to improve on class imbalance sensitive metrics, further work is necessary, as can be noted from the confusion matrices.Better class balancing approaches than loss weighting, such as a data augmentation scheme sensitive to ordinal information, can be applied in order to improve on this.

Fig. 1
Fig.1The model output vector o (dot in red) for sample x and its distance to each of the ideal class vectors (dashed lines), illustrated as a 3D graphic where each dimension represents each of the three model outputs.The closest point is v(C 2 ) (marked in red), and thus x is assigned label C 2

Fig. 7 Fig. 8 Fig. 9
Fig.7 Average training curves for each model and methodology when applied to the Adience dataset.Train loss is shown as solid lines and validation loss as dashed lines.The average is calculated over all executions which reach the corresponding iteration

Fig. 10 Fig. 11 Fig. 12 Fig. 13 Fig. 14 Fig. 15 Fig. 16 Fig. 17
Fig. 10 Average confusion matrices for each methodology using the VGG11 architecture on the Retinopathy dataset.Rows are normalised according to the total number of samples in the test set for each class

Table 1
Number of trainable parameters and total memory size of the trained models for each methodology and architecture

Table 2
Average experimental results for each of the four methodologies on the test sets of both datasets.Metrics to maximize are marked with (↑) and metrics to minimize with (↓).Best results are highlighted in bold and second best in italics

Table 3
Results of the Tukey's HSD test for all tested metrics.Methodologies are grouped such that the elements within a subset are not significantly different, while the differences between each group are significant.The first subset contains the worst methodologies, while the last subset groups the best methodologies.The best performing subset is highlighted in bold

Table 4
Mean results using the VGG11 architecture for the Retinopathy dataset.Metrics to maximize are marked with (↑) and metrics to minimize with (↓).Best results are highlighted in bold and second best in italics

Table 5
Mean results using the ResNet18 architecture for the Retinopathy dataset.Metrics to maximize are marked with (↑) and metrics to minimize with (↓).Best results are highlighted in bold and second best in italics

Table 6
Mean results using the MobileNetV3 architecture for the Retinopathy dataset.Metrics to maximize are marked with (↑) and metrics to minimize with (↓).Best results are highlighted in bold and second best in italics

Table 7
Mean results using the ShuffleNetV2 architecture for the Retinopathy dataset.Metrics to maximize are marked with (↑) and metrics to minimize with (↓).Best results are highlighted in bold and second best in italics

Table 8
Mean results using the VGG11 architecture for the Adience dataset.Metrics to maximize are marked with (↑) and metrics to minimize with (↓).Best results are highlighted in bold and second best in italics

Table 9
Mean results using the ResNet18 architecture for the Adience dataset.Metrics to maximize are marked with (↑) and metrics to minimize with (↓).Best results are highlighted in bold and second best in italics

Table 10
Mean results using the MobileNetV3 architecture for the Adience dataset.Metrics to maximize are marked with (↑) and metrics to minimize with (↓).Best results are highlighted in bold and second best in italics

Table 11
Mean results using the ShuffleNetV2 architecture for the Adience dataset.Metrics to maximize are marked with (↑) and metrics to minimize with (↓).Best results are highlighted in bold and second best in italics

Table 12
ANOVA III table for the CCR test results

Table 13 ANOVA
III table for the AvAUC test results

Table 14 ANOVA
III table for the RMSE test results

Table 15
ANOVA III table for the r s test results

Table 16 ANOVA
III table for the κ test results