1 Introduction

There exists a large variety of classification tasks tackled in Machine Learning (ML) literature. It is natural to group them, for example, depending on the number of different class labels assigned to the classification samples. According to this, we differentiate between binary classification tasks (those where only two different labels are present, usually a “positive” class and a “negative” class) and multi-class classification tasks (those where more than two different labels exist).

Focusing on multi-class tasks, one could also pay attention to the relation between the class labels. Classic approaches assume all classes equally without relations between them and try to minimize simply the number of samples correctly assigned a label.

However, when an order relation between the class labels is present due to the nature of the problem itself, these tasks can be posed as “ordinal classification” (sometimes referred as “ordinal regression”) tasks, which have gained popularity in the last decade. This family of problems, halfway between nominal classification and regression, presents extra information which can be exploited in order to improve performance, sometimes regarding different metrics than usual [1, 4, 16]. The benefits of this exploitation have been proven to outperform purely nominal methods in the context of unstructured data [10, 29, 30], and some methods have been proposed to search for ordinality in the class labels of apparently purely categorical datasets [24].

In this work, we propose and explore a novel general methodology for ordinal classification tasks of 2D images. This includes a generic structure for the final layers of a Convolutional Neural Network (CNN), adaptable to a wide range of already existing architectures, as well as a prediction scheme adapted to this structure and an ordinal target label encoding, both based on the Error-Correcting Output Code (ECOC) framework. Our hypothesis is that this exploitation of ordinal information in the context of image classification may improve performance, not only on ordinal metrics but also in nominal ones.

This work is structured as follows: in Sect. 2 a brief literature review on ordinal classification and CNNs is presented. In Sect. 3 a baseline nominal methodology for training CNNs to solve classification problems is posed. Then in Sect. 4 the ordinal classification framework is described and three different ordinal classification methodologies for CNNs (two already existing methods based on previous works and one novel method) are described. In Sect. 5, the experiments for the comparison of these four approaches are presented, including the datasets used for evaluation. Finally, in Sect. 6, the experiment results are shown, and Sect. 7 concludes with a discussion of these results.

2 Related Work

Early ordinal classification approaches were limited to unstructured input data, where no spatial or temporal relations exist between the inputs. Some basic approaches include using regular regression methods with rounding applied at the outputs [23] or using the label distance as a cost penalty [22]. The performance of such methods is limited because of the potentially unequal underlying distance between labels. Cumulative Link Models (CLMs) such as the Proportional Odds Model (POM) [27] or the gologit model [35], which not only learn a latent continuous variable but also a set of thresholds for each rank, are able to overcome this limitation. There are also adaptations of Support Vector Machines (SVMs) like SVORIM or SVOREX [7] which add ordinal constraints to the optimization of the model. Lastly, an approach known as Ordinal Binary Decomposition (OBD), where the original ordinal problem is split into a set of binary problems, has also proven to improve performance. Examples of this are the cascade linear utility model [36], where a different model solves each binary problem, or neural networks coupled with multiple outputs, one for each binary subproblem [9, 20]. The main problem with OBD is the matter of combining the different outputs to produce a final decision.

These approaches are not suitable for structured information such as 2D images, where domain-specific feature extraction is still necessary. In this regard, CNNs provide an automatic method for extracting learned features from structured data in classification tasks. Unfortunately, due to their high number of parameters CNNs suffer easily from overfitting problems resulting in low generalization performance. In addition to classic techniques such as \(L_2\) regularization and dropout, recent techniques include multi-stage implicit regularization [37] and network path pruning [38] to avoid this problem.

Adapting CNNs to work with ordinal information is a recent line of research, still needing extensive work. In [12, 33], a CLM has been adapted as the activation function of a single output of a CNN. In [28] a CNN architecture for solving the OBD version of an age estimation problem is proposed, with a very simple combination of the binary outputs for obtaining a rank. [25] proposes a different methodology for small datasets based on triplets of samples and majority voting. Finally, [6] proposes an improvement over [28] by bounding the maximum binary error of each output.

3 Base Nominal CNN Methodology

Nominal classification is the general framework for tasks where there is a need to assign a class label to a randomly sampled object from a specific distribution. More formally, we want to obtain a rule \(r: \mathcal {X} \rightarrow \mathcal {Y}\) that associates an input vector \({\mathbf {x}}\in \mathcal {X} \subseteq {\mathbb {R}}^K\) to a class label \(y \in \mathcal {Y} = \{{\mathcal {C}}_1, {\mathcal {C}}_2, ..., {\mathcal {C}}_Q\}\) in a finite set. In order to learn this relation, a dataset D is provided consisting on tuples of correctly classified samples \(D = \left\{ ({\mathbf {x}}_i, y_i) \mid {\mathbf {x}}_i \in \mathcal {X},\, y_i \in \mathcal {Y},\, i \in \{1, \dots , N\} \right\} \).

Focusing on image classification tasks, CNNs are able to capture the spatial nature of image features, where nearby pixels have a stronger association between them than far away ones. We have considered four different well-known and competitive CNN architectures for image classification in order to have a good performance baseline: VGG11 [31], ResNet18 [17], MobileNetV3 [19] and ShuffleNetV2 [26]. We use these architectures as a baseline for traditional nominal classification.

While the specifics of each architecture varies wildly, their general design follows the following overall premise:

  • First, several blocks of convolution and pooling operations are applied to the input image.

  • Then, the mapped features are processed by one or more hidden fully-connected layers.

  • Finally, an output layer with as many units as classes and softmax activation is used, whose value represent the probability of input sample \({\mathbf {x}}\) being assigned each class label \(P(y = {\mathcal {C}}_q \mid {\mathbf {x}})\). These are compared to the ground truth labels of dataset D to compute a loss function \(\ell \) and minimize it through some sort of gradient descent procedure.

3.1 Decision Rule

During evaluation of the model, the maximum probability class of \({\mathbf {x}}_i\) is selected as the predicted class label \({\hat{y}}_i\):

$$\begin{aligned} {\hat{y}}_i = {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\mathcal {C}_q \in \mathcal {Y}}} P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i), \end{aligned}$$
(1)

where \(P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i)\) is the probability of sample \({\mathbf {x}}_i\) being assigned label \({\mathcal {C}}_q\) predicted by the network.

3.2 Loss Function

For the baseline nominal methodology, categorical cross-entropy is used as the loss function \(\ell \) during training:

$$\begin{aligned} \ell = -\frac{1}{N}\sum _{i=1}^N \sum _{q=1}^Q 1\{y_i = {\mathcal {C}}_q\} \log (P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i)), \end{aligned}$$
(2)

where \(1\{y_i = {\mathcal {C}}_q\}\) is the indicator function that is equal to 1 when \(y_i = {\mathcal {C}}_q\) and 0 otherwise.

4 The Ordinal Classification Framework

As in a nominal classification framework, an ordinal classification task is characterised as the prediction process of assigning a label y to an input vector \({\mathbf {x}}\), where \({\mathbf {x}}\in \mathcal {X} \subseteq {\mathbb {R}}^K\) and \(y \in \mathcal {Y} = \{{\mathcal {C}}_1, {\mathcal {C}}_2, \dots , {\mathcal {C}}_Q\}\), i.e., \({\mathbf {x}}\) is a K-dimensional vector and y is a class label in a finite set. The goal is also to obtain some classification rule \(r: \mathcal {X} \rightarrow \mathcal {Y}\) that predicts the categories of new patterns given a dataset D.

Where the ordinal framework differs from the nominal framework is in the presence of a natural ordering of the class labels: \({\mathcal {C}}_1 \prec {\mathcal {C}}_2 \prec \dots \prec {\mathcal {C}}_Q\), where \(\prec \) is an order relation. This is similar to regression, where \(y \in {\mathbb {R}}\), and real values can be ordered by the < operator but, in this case, the labels are discrete and include qualitative information instead of quantitative [16]. Throughout this work, the convention that \(i < j \Rightarrow {\mathcal {C}}_i \prec {\mathcal {C}}_j\) always holds.

4.1 Adapting CNNs for ordinal classification

Without altering the architectures in a major way, several different options are available for introducing the ordinal information of the original dataset in the model and its training process:

  1. (a)

    Using a loss function that incorporates ordinal information in the optimization procedure.

  2. (b)

    Altering only the fully-connected layer phase of the architecture, maintaining all previous layers as-is.

  3. (c)

    Furthermore, altering the decision rule that assigns a label to each sample when making a prediction.

In the following three sections, three different ordinal methodologies are described: two already present in the literature as well as our proposed method.

4.2 Using an ordinal loss function: Quadratic Weighted Kappa

A naive approach to integrate ordinal information in the learning process of the model consists on optimizing an order-sensitive loss function instead of the classic categorical cross-entropy.

A promising such function is the weighted Kappa metric [3] (described in Sect. 5.4), a relevant score for ordinal classifiers as it measures the rank agreement between two raters (in our case, the ground truth labels and the model outputs) based on a disagreement penalty. This penalty is usually defined as the absolute (linear) or square (quadratic, used in the rest of this paper) difference between the rank labels. It is often used in medical diagnosis systems, where the severity of a disease presents naturally ordered stages. It is defined as:

$$\begin{aligned} (\kappa ) = 1 - \dfrac{\sum _{i=1}^Q \sum _{j=1}^Q w_{{\mathcal {C}}_i,{\mathcal {C}}_j} p_{{\mathcal {C}}_i,{\mathcal {C}}_j}}{\sum _{i=1}^Q \sum _{j=1}^Q w_{{\mathcal {C}}_i,{\mathcal {C}}_j} e_{{\mathcal {C}}_i,{\mathcal {C}}_j}}, \end{aligned}$$
(3)

where \(w_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) is the disagreement cost when \(y = {\mathcal {C}}_i\) and \({\hat{y}} = {\mathcal {C}}_j\) (\(w_{{\mathcal {C}}_i,{\mathcal {C}}_j} = (i - j)^2\) for the quadratic case), and \(p_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) and \(e_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) are the observed agreement and expected agreement due to chance for classes \({\mathcal {C}}_i\) and \({\mathcal {C}}_j\), respectively. A larger \(\kappa \) value corresponds with a better agreement and vice versa, and so it is a metric to be maximised.

Unfortunately, like is the case with accuracy, this metric is not continuous and is expressed in terms of discrete labels, preventing the application of gradient descent methods. In [8] a proposal is made to adapt this metric as a loss function to be used in CNN model training maintaining the architecture of the network as well as the decision rule.

4.2.1 Loss Function

First, \(\kappa \) is expressed in terms of probabilities instead of class labels, maintaining the penalty matrix \(w_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) but substituting \(p_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) and \(e_{{\mathcal {C}}_i,{\mathcal {C}}_j}\) for the probability outputs of the model:

$$\begin{aligned} {\hat{\kappa }} = 1 - \frac{\sum _{i=1}^N \sum _{q=1}^Q w_{y_i,{\mathcal {C}}_q} P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i)}{\sum _{j=1}^Q \frac{N_j}{N} \sum _{k=1}^Q (w_{{\mathcal {C}}_j,{\mathcal {C}}_k} \sum _{i=1}^N P(y_i = {\mathcal {C}}_k \mid {\mathbf {x}}_i))}, \end{aligned}$$
(4)

where \(N_j\) is the number of samples with class label \({\mathcal {C}}_j\) in the dataset D.

Then, in order to pose it as a minimization problem, loss \(\ell \) is defined as:

$$\begin{aligned} \ell = \log (1-{\hat{\kappa }}) \text {, where } \ell \in (-\infty , \log 2]. \end{aligned}$$
(5)

Further derivation and a more in-depth discussion can be found in [8].

4.2.2 Decision Rule

In the same manner as the nominal approach of Sect. 3, the maximum probability class of \({\mathbf {x}}_i\) is selected as the predicted class label \({\hat{y}}_i\).

4.3 The Cumulative Link Model Approach

For the CLM framework (family of models which includes the POM [27]), only a small modification to the baseline nominal model is needed: the output is reduced to only a single unit in the last layer, and the logit cumulative link function is used as the activation function instead of softmax:

$$\begin{aligned} P(y \preceq {\mathcal {C}}_q \mid {\mathbf {x}}) = \sigma (b_q - f({\mathbf {x}})) \;, \; 1 \le q < Q, \end{aligned}$$
(6)

where \(P(y \preceq {\mathcal {C}}_q \mid {\mathbf {x}})\) is the probability of sample \({\mathbf {x}}_i\) being assigned label \({\mathcal {C}}_q\) or lower predicted by the network, \(f({\mathbf {x}})\) is the single output of the model, \(\sigma \) is the sigmoid function and \(b_q\) is one of the \(Q-1\) thresholds learned as additional parameters. Note that cumulative probabilities \(P(y \preceq {\mathcal {C}}_q \mid {\mathbf {x}})\) are predicted by this function instead of individual ones like \(P(y = {\mathcal {C}}_q \mid {\mathbf {x}})\).

4.3.1 Decision Rule

During evaluation, elementary probability rules are used to combine the cumulative probabilities from Eq. (6) into individual probabilities [15]:

$$\begin{aligned} P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i) = {\left\{ \begin{array}{ll} P(y_i \preceq {\mathcal {C}}_1 \mid {\mathbf {x}}_i), \; &{}\text {if} \; q=1, \\ P(y_i \preceq {\mathcal {C}}_q \mid {\mathbf {x}}_i) - P(y_i \preceq {\mathcal {C}}_{q-1} \mid {\mathbf {x}}_i), \; &{}\text {if} \; 1< q < Q, \\ 1 - P(y_i \preceq {\mathcal {C}}_{Q-1} \mid {\mathbf {x}}_i), \; &{}\text {if} \; q=Q, \end{array}\right. } \end{aligned}$$
(7)

and the maximum probability class is then selected as the predicted label \({\hat{y}}_i\):

$$\begin{aligned} {\hat{y}}_i = {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{{\mathcal {C}}_1 \preceq {\mathcal {C}}_q \preceq {\mathcal {C}}_Q }} P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i). \end{aligned}$$
(8)

4.3.2 Loss Function

Cross-entropy loss is used as the loss function in the same manner as in the nominal model.

4.4 Our approach: Ordinal Binary Decomposition

For our ordinal approach, we decompose the original Q-class ordinal problem into \(Q-1\) binary decision problems, what is known as Ordinal Binary Decomposition (OBD). Each q problem consists on deciding whether \(y \succ {\mathcal {C}}_q\) conditioned to sample \({\mathbf {x}}\) (\(1 \le q < Q\)) (this is referred to as the “Ordered partitions” scheme in [16]).

To adapt the outputs of the model to this, the final fully-connected block is substituted by \(Q-1\) fully-connected blocks, each one with a single output unit with sigmoid activationFootnote 1. Each of the \(Q-1\) outputs of the model \(o_q\) is trying to predict the probability \(P(y \succ {\mathcal {C}}_q \mid {\mathbf {x}})\). The result of this modification is obtaining \(Q-1\) different models, which share their convolutional feature extraction parameters and are trained simultaneously.

4.4.1 Decision Rule

In the case of the OBD models, because the outputs are not individual probabilities but cumulative ones (\(o_k = P(y \succ {\mathcal {C}}_k \mid {\mathbf {x}})\)), the decision rule requires combining several outputs. Moreover, these probabilities may be inconsistent: nothing forces them to fulfil basic probability properties like \(P(y \succ {\mathcal {C}}_i) \ge P(y \succ {\mathcal {C}}_{i+1})\) and \(\sum _{i=1}^{Q} P(y = {\mathcal {C}}_i) = 1\). For this reason, Eq. (7) cannot be applied as for the CLM.

In order to circumvent this problem, a stable approach based on the ECOC framework is used: the ideal output vector \(\mathbf {v}({\mathcal {C}}_i)\) for each class \({\mathcal {C}}_i\) is considered, \(\mathbf {v}({\mathcal {C}}_i) = (c_1, \dots , c_{Q-1})\) where \(c_j = 1\{{\mathcal {C}}_j \prec {\mathcal {C}}_i\}\), i.e. a vector with ones in all positions corresponding with classes which are lower than \(\mathcal {C}_i\) in the ordinal scale. This makes the ideal output vector for a sample \({\mathbf {x}}_i\) with label \(y_i = {\mathcal {C}}_k\) be:

$$\begin{aligned} \mathbf {v}({\mathcal {C}}_k) = (c_1, ..., c_{k-1}, c_k, ..., c_{Q-1}) = (1, ..., 1, 0, ..., 0), \end{aligned}$$
(9)

i.e., for a 4 class ordinal problem with labels \({\mathcal {C}}_1\), \({\mathcal {C}}_2\), \({\mathcal {C}}_3\), and \({\mathcal {C}}_4\) the ideal outputs would be \(\mathbf {v}({\mathcal {C}}_1) = (0,0,0)\), \(\mathbf {v}({\mathcal {C}}_2) = (1,0,0)\), \(\mathbf {v}({\mathcal {C}}_3) = (1,1,0)\), and \(\mathbf {v}({\mathcal {C}}_4) = (1,1,1)\).

The decision rule is based on determining the ideal vector which minimizes the distance to the obtained output vector \(\mathbf {o}\):

$$\begin{aligned} {\hat{y}}_i = {\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{{\mathcal {C}}_1 \preceq {\mathcal {C}}_q \preceq {\mathcal {C}}_Q }} \left\Vert \mathbf {o} - \mathbf {v}({\mathcal {C}}_q) \right\Vert _2, \end{aligned}$$
(10)

where \(\Vert \cdot \Vert _2\) is the \(L_2\) norm. This distance metric is selected in order to align it with the loss function of the optimization process.

As an example to illustrate this prediction criterion, assume a 4 class ordinal problem like the one previously mentioned. For sample \({\mathbf {x}}\), let the output of the model be the 3 dimensional vector \(\mathbf {o} = ( 0.8, 0.3, 0.2 )\). The distance to each ideal class vector would be computed as:

$$\begin{aligned} \begin{aligned} \Vert \mathbf {o} - \mathbf {v}({\mathcal {C}}_1) \Vert _2 = \Vert (0.8 - 0, 0.3 - 0, 0.2 - 0) \Vert _2 = 0.77,\\ \Vert \mathbf {o} - \mathbf {v}({\mathcal {C}}_2) \Vert _2 = \Vert (0.8 - 1, 0.3 - 0, 0.2 - 0) \Vert _2 = 0.17,\\ \Vert \mathbf {o} - \mathbf {v}({\mathcal {C}}_3) \Vert _2 = \Vert (0.8 - 1, 0.3 - 1, 0.2 - 0) \Vert _2 = 0.57,\\ \Vert \mathbf {o} - \mathbf {v}({\mathcal {C}}_4) \Vert _2 = \Vert (0.8 - 1, 0.3 - 1, 0.2 - 1) \Vert _2 = 1.17. \end{aligned} \end{aligned}$$
(11)

This process is illustrated in Fig. 1. The vector closest to \(\mathbf {o}\) is \(\mathbf {v}({\mathcal {C}}_2)\) and thus, sample \({\mathbf {x}}\) would be assigned the class label \({\hat{y}} = {\mathcal {C}}_2\).

Fig. 1
figure 1

The model output vector \(\mathbf {o}\) (dot in red) for sample \({\mathbf {x}}\) and its distance to each of the ideal class vectors (dashed lines), illustrated as a 3D graphic where each dimension represents each of the three model outputs. The closest point is \(v({\mathcal {C}}_2)\) (marked in red), and thus \({\mathbf {x}}\) is assigned label \({\mathcal {C}}_2\)

4.4.2 Loss Function

For the OBD methodology, categorical cross-entropy has been substituted by the Mean Squared Error loss because it copes better with the distance function used for the ECOC decision [2]:

$$\begin{aligned} \ell = \frac{1}{N}\sum _{i=1}^N \sum _{k=1}^{Q-1} (1\{y_i \succ {\mathcal {C}}_k\} - P(y_i \succ {\mathcal {C}}_k \mid {\mathbf {x}}_i))^2. \end{aligned}$$
(12)

where \(1\{y_i \succ {\mathcal {C}}_k\}\) is the indicator function that is equal to 1 when \(y_i \succ {\mathcal {C}}_k\) and 0 otherwise, and \(P(y_i \succ {\mathcal {C}}_k \mid {\mathbf {x}}_i)\) is the probability that \(y_i \succ {\mathcal {C}}_k\) predicted by the network given a sample \({\mathbf {x}}_i\).

Fig. 2
figure 2

The four compared methodologies, from left to right: the baseline nominal architecture (both using categorical cross-entropy as well as QWK as the loss function), CLM, and our proposal, OBD

The four methodologies described in this section are illustrated in Fig. 2.

5 Experiment Design

5.1 Datasets

The effects of the four described methodologies will be tested against the following two different datasets, chosen specifically for the ordinal nature of their class labels and an acute class imbalance.

5.1.1 Diabetic Retinopathy Dataset

The diabetic retinopathy dataset from KaggleFootnote 2 (referred to as “Retinopathy” from now on) consists on a total of 88 702 retina images labelled by a clinician on a 0 to 4 scale evaluating the presence of Diabetic Retinopathy (DR), an eye disease present in a large proportion of diabetes patients. It contains 65 343 images labelled as No DR, 6205 images labelled as Mild DR, 13 153 images labelled as Moderate DR, 2087 images labelled as Severe DR, and 1914 images labelled as Proliferative DR. The task consists on predicting the clinician label using the colour image of the retina. Three sample images can be seen in Fig. 3. All images have been normalized down to a size of \(128 \times 128\) pixels.

Fig. 3
figure 3

Sample images from the Retinopathy dataset

5.1.2 Adience Faces Dataset

The Adience faces dataset for age classification [11] (referred to simply as “Adience” from now on) is composed of 26 580 photos of 2284 different subjects extracted from real online albums and automatically cropped and aligned. 17 702 of these photos have an age label attached, referring to one of 8 different age groups of increasing value: 0–2 years, 4–6 years, 8–13 years, 15–20 years, 25–32 years, 38–43 years, 48–53 years, and 60 years and up. The task consists on assigning one of these 8 age labels to each photo. A sample of this images can be seen in Fig. 4. As a preprocessing step, all images have been resized down to \(256 \times 256\) pixels.

Fig. 4
figure 4

Sample images from the Adience faces dataset

5.2 Methodologies and Validation Scheme

Four different methodologies are tested against each other:

  • The baseline nominal architecture, with categorical cross-entropy loss function.

  • The same architecture, but substituting the loss function by the Quadratic Weighted Kappa (QWK) function described in Sect. 4.2.

  • The CLM approach, as described in Sect. 4.3.

  • The OBD approach with ECOC decision rule, as described in Sect. 4.4.

All of these are applied to all four of the previously mentioned architectures (VGG11, ResNet18, MobileNetV3 and ShuffleNetV2), yielding a total of sixteen different experiments for each of the two datasets.

In order to obtain a statistically significant result to test the hypotheses, each experiment is repeated 30 times on 30 different holdout splits of the original dataset, where 80% of samples are used for training and 20% are used for model evaluation. This split is performed in a stratified fashion, preserving the original proportion of the classes of the original dataset in the subsets. For the Retinopathy dataset this leaves 70 962 training samples (of which 7096 are reserved for validation) and 17 740 test samples in each split. In the case of the Adience dataset, 14 161 are used for training (of which 1416 are reserved for validation) and 3541 are used for evaluation.

5.3 Training Scheme

In all experiments, weights are initialized randomly using the He initialization scheme described in [18]. They are then adjusted using the Adam method [21] with a learning rate \(\eta = 1\times 10^{-4}\).

In the case of VGG11, both dropout (\(p=0.5\)) and \(L_2\) regularization (with a weight of \(5\times 10^{-4}\)) are applied only in the fully-connected layers as in the original paper [31]. For ResNet18, batch normalization is applied after every convolution operation and \(L_2\) penalty (with a weight of \(1\times 10^{-4}\)) is added to all mappings [17]. The number of trainable parameters for each model is available on Table 1.

Table 1 Number of trainable parameters and total memory size of the trained models for each methodology and architecture

In order to help overcome the class imbalance, class weighting is applied to the loss function based on \(N_q\) (number of training samples for class \({\mathcal {C}}_q\)):

$$\begin{aligned} w_q = \frac{e^{-C N_q}}{\sum _{i=1}^{Q} e^{-C N_i}}, \end{aligned}$$
(13)

where C is a constant defined as \(C=3\times 10^{-5}\). This weight \(w_q\) is multiplied by the loss contribution of each sample with a ground truth label of \({\mathcal {C}}_q\).

Before training, 10% of training samples are reserved for validation, again selected in a stratified fashion according to the class labels. Model weights are updated in batches of 72 training samples and loss performance is monitored on both training and validation. If validation performance does not increase for 5 full epochs, training is halted, and the best performing parameters over the validation set are restored.

The code used to perform the experiments can be accessed through GitHubFootnote 3.

5.4 Performance Metrics

The classical performance metric in classification tasks is the Correct Classification Rate (CCR). However, given that both datasets present a very high class imbalance, the traditional CCR is not a representative measure of model performance: for example, in the case of the Retinopathy dataset, a dummy classifier that always assign the majority class label (class 0) would obtain a CCR of 73%.

In order to monitor global per-class performance, metrics such as the Average Area Under the Receiver Operating Characteristic (ROC) curve (\({{\,\mathrm{\textit{AvAUC}}\,}}\)), minimum sensitivity (\({{\,\mathrm{\textit{MS}}\,}}\)) and geometric mean of the sensitivities (\({{\,\mathrm{\textit{GMS}}\,}}\)) [10] will also be included.

Also, for ordinal classification problems, rank agreement metrics including the Root of Mean Squared Error (RMSE) (comparing actual and predicted labels, represented as consecutive integers in the ordinal scale), Spearman’s rank correlation coefficient (\({{\,\mathrm{r_s}\,}}\)) [5] or the Quadratic Weighted Cohen’s Kappa (\(\kappa \)) [3] (described in Eq. (3)) have been selected as well for evaluation:

$$\begin{aligned} {{\,\mathrm{\textit{RMSE}}\,}}&= \sqrt{\frac{1}{N} \sum _{i=1}^{N} (O({\hat{y}}_i) - O(y_i))^2}, \end{aligned}$$
(14)
$$\begin{aligned} {{\,\mathrm{r_s}\,}}&= \dfrac{{{\,\mathrm{Cov}\,}}(O(y), O({\hat{y}}))}{\sigma _{O(y)} \sigma _{O({\hat{y}})}} \end{aligned}$$
(15)

where \(O({\mathcal {C}}_q) = q\) is the ordinal number of label \({\mathcal {C}}_q\), \({{\,\mathrm{Cov}\,}}(O(y), O({\hat{y}}))\) is the covariance between the ground truth labels ordinal numbers and the predicted labels ordinal numbers, and \(\sigma _{O(y)}\) and \(\sigma _{O({\hat{y}})}\) is their standard deviation.

An illustrated example of the global experimentation procedure can be found in Fig. 5.

Fig. 5
figure 5

The general experimentation procedure as described in Sect. 5

In Sect. 6, mean results and standard deviation are reported for each methodology. Then, statistical hypothesis testing will be performed in order to discern the effects of the different factors and conclude whether the OBD methodology shows a significant improvement over the other two.

6 Results

The average of the training curves over all 30 repetitions is shown in Figs. 6, 7.Footnote 4 Note how the QWK methodology fails to converge when used in conjunction with the VGG11 architecture: the high depth of this architecture makes the gradients disappear in the backpropagation phase of training, something known as the “vanishing gradients” problem. All the other architectures tested implement residual paths into the network, allowing them to avoid this problem [34]. Note how the OBD methodology does not alter the depth of the CNN model, so it will never cause this problem by itself.

Fig. 6
figure 6

Average training curves for each model and methodology when applied to the diabetic retinopathy dataset. Train loss is shown as solid lines and validation loss as dashed lines. The average is calculated over all executions which reach the corresponding iteration

Fig. 7
figure 7

Average training curves for each model and methodology when applied to the Adience dataset. Train loss is shown as solid lines and validation loss as dashed lines. The average is calculated over all executions which reach the corresponding iteration

Fig. 8
figure 8

Average training time of each methodology and architecture for the Retinopathy dataset. Error bars indicate ± the standard deviation

Fig. 9
figure 9

Average training time of each methodology and architecture for the Adience dataset. Error bars indicate ± the standard deviation

Additionally, in Figs. 8, 9 the training time for each methodology and architecture is shown. In accordance to the number of parameters Table 1, the VGG11 architecture takes the longest time to train compared to the other three, which all take a similar time. Regarding the methodologies, while the nominal approach usually takes less time than the ordinal ones, the OBD methodology is a close second in speed.

Table 2 Average experimental results for each of the four methodologies on the test sets of both datasets. Metrics to maximize are marked with (\(\uparrow \)) and metrics to minimize with (\(\downarrow \)). Best results are highlighted in bold and second best in italics

The average experimental results for each experiment are shown in Appendix A (Tables 411) and the mean over all four architectures is summarized in Table 2 for convenience. It can be noted that for the Retinopathy dataset the CLM models are able to improve ordinal metrics by a little, at the cost of worsening metrics related to the imbalance problem (\({{\,\mathrm{\textit{AvAUC}}\,}}\), \({{\,\mathrm{\textit{MS}}\,}}\), and \({{\,\mathrm{\textit{GMS}}\,}}\)). Meanwhile, the OBD models improve the ordinal metrics further while also improving class balancing metrics. This is done at the cost of worsening CCR, but only because of the high class imbalance. In the case of the Adience dataset the OBD models still achieve a higher score in class-balancing metrics, although \({{\,\mathrm{r_s}\,}}\) and \(\kappa \) are worsened slightly.

From the confusion matrices it can be noted that, although Table 2 shows that the CLM improves on the ordinal metrics on the Adience dataset, it fails on every class balancing metric compared to the nominal model, as it ignores both classes 1 and 3 in the Retinopathy dataset, as well as classes 3, 5 and 6 in the Adience dataset. The OBD model, on the other hand, is able to improve both class balancing and ordinal metrics. This is achieved at the cost of losing some performance on the extreme classes, but note how sensitivity and precision never fall to zero when using the OBD model on any class, that is, no class is ignored systematically. This is easily seen on the confusion matrices for each methodology and architecture shown in Appendix A (Figs. 1017).

6.1 Statistical Analysis

To determine the statistical significance of the mean differences observed for each classifier, each architecture and each dataset, we have carried out a parametric Analysis of Variance (ANOVA) test [13, 14] for each of the evaluated metrics. The three factors considered for the experimental design are: (i) the database (Adience and Retinopathy), (ii) the CNN network architecture (VGG11, ResNet18, MobileNetV3 and ShuffleNetV2) and (iii) the methodology (nominal, QWK, CLM and OBD).

For each combination of these three factors we have repeated the experiment 30 times with different data splits and weight initialization seeds. We have tested, using the Kolmogorov-Smirnov test for all metrics mentioned in Sect. 5.4, whether the null hypothesis stating that the results are drawn from a normal distribution cannot be rejected (for a significance level of \(\alpha = 0.05\)). This is true for all metrics except \({{\,\mathrm{\textit{MS}}\,}}\) and \({{\,\mathrm{\textit{GMS}}\,}}\), namely the Quadratic Weighted Cohen’s Kappa (\(\kappa \)), \({{\,\mathrm{\textit{AvAUC}}\,}}\), \({{\,\mathrm{\textit{RMSE}}\,}}\), Spearman’s rank correlation coefficient (\({{\,\mathrm{r_s}\,}}\)) and \({{\,\mathrm{\textit{CCR}}\,}}\). Only these metrics will be considered for the subsequent analysis, given that ANOVA is a parametric test and can only be applied to normally distributed variables.

Table 3 Results of the Tukey’s HSD test for all tested metrics. Methodologies are grouped such that the elements within a subset are not significantly different, while the differences between each group are significant. The first subset contains the worst methodologies, while the last subset groups the best methodologies. The best performing subset is highlighted in bold

After this, ANOVA is performed for these five metrics. The ANOVA tables are available in Appendix B. According to this analysis, for all normally distributed metrics there exist significant differences in the mean value (for a significance level of \(\alpha = 0.05\)) concerning the three individual factors (Dataset, Architecture and Methodology, all p-values \(< 0.001\)). Then, we also found significant interactions between all the pairs of factors (p-values \(< 0.001\)) and between all three factors (p-value \(< 0.001\)). This shows that:

  1. 1.

    the impact of the architecture and the methodology varies across datasets,

  2. 2.

    the architecture significantly affects performance,

  3. 3.

    the effect of the methodology is affected by the CNN architecture (that is, some architectures are better suited for each methodology), and

  4. 4.

    the methodology alone affects the performance, OBD being in the lead according to the mean results of Table 2.

That is why we now analyse the magnitude of those differences according to the Methodology factor. A post-hoc Tukey’s HSD multiple comparison test [32] has been performed on each of the metrics shown to be affected by this factor. The purpose of this test is to group each of the methodologies into groups of significantly similar performance, where each group is significantly different than the rest. The results of this test are summarized in Table 3 by grouping the methodologies in subsets according to their performance on each metric. The first subset contains the worst methodology, while the last one includes the best methodologies.

Note that for \(\kappa \), \({{\,\mathrm{\textit{AvAUC}}\,}}\), and \({{\,\mathrm{r_s}\,}}\) the OBD methodology has a significantly better mean performance than the other three methodologies. For the \({{\,\mathrm{\textit{RMSE}}\,}}\) metric both CLM and OBD exhibit similar performance, but significantly better than the other two methodologies. Finally, for \({{\,\mathrm{\textit{CCR}}\,}}\) both CLM and the nominal methodology perform similarly and better than OBD and QWK.

7 Conclusions and Future Work

A new ordinal CNN architecture based on Ordinal Binary Decomposition has been proposed, as well as a decision scheme based on ECOC, showing that it is able to significantly outperform a purely nominal approach as well as already existing ordinal approaches, especially when considering highly imbalanced scenarios like medical datasets and web-scraped datasets. Specifically, the proposed OBD methodology is able to improve both class balancing and ordinal metrics such as \({{\,\mathrm{\textit{RMSE}}\,}}\), Spearman’s rank correlation coefficient and Quadratic Weighted Cohen’s Kappa. This methodology is easy to adapt to any other ordinal tasks.

While the tested architectures are widely established and overall well performing models, different and more novel architectures could also be adapted in the same manner. This adds a new fairly generic tool for classification tasks where ordinal information can be exploited. Moreover, this modification does not increase the number of parameters or memory consumption of the network and does not significantly increase the running time for training, making it applicable in memory limited scenarios.

In a future work, more complex data structures like 3D images can be studied. This is possible because the needed modifications only alter the latter stages of the network, allowing for arbitrary input shapes. Also, even though we have been able to improve on class imbalance sensitive metrics, further work is necessary, as can be noted from the confusion matrices. Better class balancing approaches than loss weighting, such as a data augmentation scheme sensitive to ordinal information, can be applied in order to improve on this.