Error-Correcting Output Codes in the Framework of Deep Ordinal Classification

Barbero-Gómez, Javier; Gutiérrez, Pedro Antonio; Hervás-Martínez, César

doi:10.1007/s11063-022-10824-7

Error-Correcting Output Codes in the Framework of Deep Ordinal Classification

Open access
Published: 12 May 2022

Volume 55, pages 5299–5330, (2023)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Error-Correcting Output Codes in the Framework of Deep Ordinal Classification

Download PDF

1016 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Automatic classification tasks on structured data have been revolutionized by Convolutional Neural Networks (CNNs), but the focus has been on binary and nominal classification tasks. Only recently, ordinal classification (where class labels present a natural ordering) has been tackled through the framework of CNNs. Also, ordinal classification datasets commonly present a high imbalance in the number of samples of each class, making it an even harder problem. Focus should be shifted from classic classification metrics towards per-class metrics (like AUC or Sensitivity) and rank agreement metrics (like Cohen’s Kappa or Spearman’s rank correlation coefficient). We present a new CNN architecture based on the Ordinal Binary Decomposition (OBD) technique using Error-Correcting Output Codes (ECOC). We aim to show experimentally, using four different CNN architectures and two ordinal classification datasets, that the OBD+ECOC methodology significantly improves the mean results on the relevant ordinal and class-balancing metrics. The proposed method is able to outperform a nominal approach as well as already existing ordinal approaches, achieving a mean performance of ${{\,\mathrm{\textit{RMSE}}\,}}= 1.0797$ for the Retinopathy dataset and ${{\,\mathrm{\textit{RMSE}}\,}}= 1.1237$ for the Adience dataset averaged over 4 different architectures.

Error-Correcting Output Codes in the Framework of Deep Ordinal Classification

Deep Ordinal Classification Based on the Proportional Odds Model

Ordinal Regression with Neuron Stick-Breaking for Medical Diagnosis

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

There exists a large variety of classification tasks tackled in Machine Learning (ML) literature. It is natural to group them, for example, depending on the number of different class labels assigned to the classification samples. According to this, we differentiate between binary classification tasks (those where only two different labels are present, usually a “positive” class and a “negative” class) and multi-class classification tasks (those where more than two different labels exist).

Focusing on multi-class tasks, one could also pay attention to the relation between the class labels. Classic approaches assume all classes equally without relations between them and try to minimize simply the number of samples correctly assigned a label.

However, when an order relation between the class labels is present due to the nature of the problem itself, these tasks can be posed as “ordinal classification” (sometimes referred as “ordinal regression”) tasks, which have gained popularity in the last decade. This family of problems, halfway between nominal classification and regression, presents extra information which can be exploited in order to improve performance, sometimes regarding different metrics than usual [1, 4, 16]. The benefits of this exploitation have been proven to outperform purely nominal methods in the context of unstructured data [10, 29, 30], and some methods have been proposed to search for ordinality in the class labels of apparently purely categorical datasets [24].

In this work, we propose and explore a novel general methodology for ordinal classification tasks of 2D images. This includes a generic structure for the final layers of a Convolutional Neural Network (CNN), adaptable to a wide range of already existing architectures, as well as a prediction scheme adapted to this structure and an ordinal target label encoding, both based on the Error-Correcting Output Code (ECOC) framework. Our hypothesis is that this exploitation of ordinal information in the context of image classification may improve performance, not only on ordinal metrics but also in nominal ones.

This work is structured as follows: in Sect. 2 a brief literature review on ordinal classification and CNNs is presented. In Sect. 3 a baseline nominal methodology for training CNNs to solve classification problems is posed. Then in Sect. 4 the ordinal classification framework is described and three different ordinal classification methodologies for CNNs (two already existing methods based on previous works and one novel method) are described. In Sect. 5, the experiments for the comparison of these four approaches are presented, including the datasets used for evaluation. Finally, in Sect. 6, the experiment results are shown, and Sect. 7 concludes with a discussion of these results.

2 Related Work

Early ordinal classification approaches were limited to unstructured input data, where no spatial or temporal relations exist between the inputs. Some basic approaches include using regular regression methods with rounding applied at the outputs [23] or using the label distance as a cost penalty [22]. The performance of such methods is limited because of the potentially unequal underlying distance between labels. Cumulative Link Models (CLMs) such as the Proportional Odds Model (POM) [27] or the gologit model [35], which not only learn a latent continuous variable but also a set of thresholds for each rank, are able to overcome this limitation. There are also adaptations of Support Vector Machines (SVMs) like SVORIM or SVOREX [7] which add ordinal constraints to the optimization of the model. Lastly, an approach known as Ordinal Binary Decomposition (OBD), where the original ordinal problem is split into a set of binary problems, has also proven to improve performance. Examples of this are the cascade linear utility model [36], where a different model solves each binary problem, or neural networks coupled with multiple outputs, one for each binary subproblem [9, 20]. The main problem with OBD is the matter of combining the different outputs to produce a final decision.

These approaches are not suitable for structured information such as 2D images, where domain-specific feature extraction is still necessary. In this regard, CNNs provide an automatic method for extracting learned features from structured data in classification tasks. Unfortunately, due to their high number of parameters CNNs suffer easily from overfitting problems resulting in low generalization performance. In addition to classic techniques such as $L_2$ regularization and dropout, recent techniques include multi-stage implicit regularization [37] and network path pruning [38] to avoid this problem.

Adapting CNNs to work with ordinal information is a recent line of research, still needing extensive work. In [12, 33], a CLM has been adapted as the activation function of a single output of a CNN. In [28] a CNN architecture for solving the OBD version of an age estimation problem is proposed, with a very simple combination of the binary outputs for obtaining a rank. [25] proposes a different methodology for small datasets based on triplets of samples and majority voting. Finally, [6] proposes an improvement over [28] by bounding the maximum binary error of each output.

3 Base Nominal CNN Methodology

Nominal classification is the general framework for tasks where there is a need to assign a class label to a randomly sampled object from a specific distribution. More formally, we want to obtain a rule $r: \mathcal {X} \rightarrow \mathcal {Y}$ that associates an input vector ${\mathbf {x}}\in \mathcal {X} \subseteq {\mathbb {R}}^K$ to a class label $y \in \mathcal {Y} = \{{\mathcal {C}}_1, {\mathcal {C}}_2, ..., {\mathcal {C}}_Q\}$ in a finite set. In order to learn this relation, a dataset D is provided consisting on tuples of correctly classified samples $D = \left\{ ({\mathbf {x}}_i, y_i) \mid {\mathbf {x}}_i \in \mathcal {X},\, y_i \in \mathcal {Y},\, i \in \{1, \dots , N\} \right\} $.

Focusing on image classification tasks, CNNs are able to capture the spatial nature of image features, where nearby pixels have a stronger association between them than far away ones. We have considered four different well-known and competitive CNN architectures for image classification in order to have a good performance baseline: VGG11 [31], ResNet18 [17], MobileNetV3 [19] and ShuffleNetV2 [26]. We use these architectures as a baseline for traditional nominal classification.

While the specifics of each architecture varies wildly, their general design follows the following overall premise:

First, several blocks of convolution and pooling operations are applied to the input image.
Then, the mapped features are processed by one or more hidden fully-connected layers.
Finally, an output layer with as many units as classes and softmax activation is used, whose value represent the probability of input sample ${\mathbf {x}}$ being assigned each class label $P(y = {\mathcal {C}}_q \mid {\mathbf {x}})$. These are compared to the ground truth labels of dataset D to compute a loss function $\ell $ and minimize it through some sort of gradient descent procedure.

3.1 Decision Rule

During evaluation of the model, the maximum probability class of ${\mathbf {x}}_i$ is selected as the predicted class label ${\hat{y}}_i$:

$$\begin{aligned} {\hat{y}}_i = {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\mathcal {C}_q \in \mathcal {Y}}} P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i), \end{aligned}$$

(1)

where $P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i)$ is the probability of sample ${\mathbf {x}}_i$ being assigned label ${\mathcal {C}}_q$ predicted by the network.

3.2 Loss Function

For the baseline nominal methodology, categorical cross-entropy is used as the loss function $\ell $ during training:

$$\begin{aligned} \ell = -\frac{1}{N}\sum _{i=1}^N \sum _{q=1}^Q 1\{y_i = {\mathcal {C}}_q\} \log (P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i)), \end{aligned}$$

(2)

where $1\{y_i = {\mathcal {C}}_q\}$ is the indicator function that is equal to 1 when $y_i = {\mathcal {C}}_q$ and 0 otherwise.

4 The Ordinal Classification Framework

As in a nominal classification framework, an ordinal classification task is characterised as the prediction process of assigning a label y to an input vector ${\mathbf {x}}$, where ${\mathbf {x}}\in \mathcal {X} \subseteq {\mathbb {R}}^K$ and $y \in \mathcal {Y} = \{{\mathcal {C}}_1, {\mathcal {C}}_2, \dots , {\mathcal {C}}_Q\}$, i.e., ${\mathbf {x}}$ is a K-dimensional vector and y is a class label in a finite set. The goal is also to obtain some classification rule $r: \mathcal {X} \rightarrow \mathcal {Y}$ that predicts the categories of new patterns given a dataset D.

Where the ordinal framework differs from the nominal framework is in the presence of a natural ordering of the class labels: ${\mathcal {C}}_1 \prec {\mathcal {C}}_2 \prec \dots \prec {\mathcal {C}}_Q$, where $\prec $ is an order relation. This is similar to regression, where $y \in {\mathbb {R}}$, and real values can be ordered by the < operator but, in this case, the labels are discrete and include qualitative information instead of quantitative [16]. Throughout this work, the convention that $i < j \Rightarrow {\mathcal {C}}_i \prec {\mathcal {C}}_j$ always holds.

4.1 Adapting CNNs for ordinal classification

Without altering the architectures in a major way, several different options are available for introducing the ordinal information of the original dataset in the model and its training process:

(a)
Using a loss function that incorporates ordinal information in the optimization procedure.
(b)
Altering only the fully-connected layer phase of the architecture, maintaining all previous layers as-is.
(c)
Furthermore, altering the decision rule that assigns a label to each sample when making a prediction.

In the following three sections, three different ordinal methodologies are described: two already present in the literature as well as our proposed method.

4.2 Using an ordinal loss function: Quadratic Weighted Kappa

A naive approach to integrate ordinal information in the learning process of the model consists on optimizing an order-sensitive loss function instead of the classic categorical cross-entropy.

A promising such function is the weighted Kappa metric [3] (described in Sect. 5.4), a relevant score for ordinal classifiers as it measures the rank agreement between two raters (in our case, the ground truth labels and the model outputs) based on a disagreement penalty. This penalty is usually defined as the absolute (linear) or square (quadratic, used in the rest of this paper) difference between the rank labels. It is often used in medical diagnosis systems, where the severity of a disease presents naturally ordered stages. It is defined as:

$$\begin{aligned} (\kappa ) = 1 - \dfrac{\sum _{i=1}^Q \sum _{j=1}^Q w_{{\mathcal {C}}_i,{\mathcal {C}}_j} p_{{\mathcal {C}}_i,{\mathcal {C}}_j}}{\sum _{i=1}^Q \sum _{j=1}^Q w_{{\mathcal {C}}_i,{\mathcal {C}}_j} e_{{\mathcal {C}}_i,{\mathcal {C}}_j}}, \end{aligned}$$

(3)

where $w_{{\mathcal {C}}_i,{\mathcal {C}}_j}$ is the disagreement cost when $y = {\mathcal {C}}_i$ and ${\hat{y}} = {\mathcal {C}}_j$ ($w_{{\mathcal {C}}_i,{\mathcal {C}}_j} = (i - j)^2$ for the quadratic case), and $p_{{\mathcal {C}}_i,{\mathcal {C}}_j}$ and $e_{{\mathcal {C}}_i,{\mathcal {C}}_j}$ are the observed agreement and expected agreement due to chance for classes ${\mathcal {C}}_i$ and ${\mathcal {C}}_j$, respectively. A larger $\kappa $ value corresponds with a better agreement and vice versa, and so it is a metric to be maximised.

Unfortunately, like is the case with accuracy, this metric is not continuous and is expressed in terms of discrete labels, preventing the application of gradient descent methods. In [8] a proposal is made to adapt this metric as a loss function to be used in CNN model training maintaining the architecture of the network as well as the decision rule.

4.2.1 Loss Function

First, $\kappa $ is expressed in terms of probabilities instead of class labels, maintaining the penalty matrix $w_{{\mathcal {C}}_i,{\mathcal {C}}_j}$ but substituting $p_{{\mathcal {C}}_i,{\mathcal {C}}_j}$ and $e_{{\mathcal {C}}_i,{\mathcal {C}}_j}$ for the probability outputs of the model:

$$\begin{aligned} {\hat{\kappa }} = 1 - \frac{\sum _{i=1}^N \sum _{q=1}^Q w_{y_i,{\mathcal {C}}_q} P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i)}{\sum _{j=1}^Q \frac{N_j}{N} \sum _{k=1}^Q (w_{{\mathcal {C}}_j,{\mathcal {C}}_k} \sum _{i=1}^N P(y_i = {\mathcal {C}}_k \mid {\mathbf {x}}_i))}, \end{aligned}$$

(4)

where $N_j$ is the number of samples with class label ${\mathcal {C}}_j$ in the dataset D.

Then, in order to pose it as a minimization problem, loss $\ell $ is defined as:

$$\begin{aligned} \ell = \log (1-{\hat{\kappa }}) \text {, where } \ell \in (-\infty , \log 2]. \end{aligned}$$

(5)

Further derivation and a more in-depth discussion can be found in [8].

4.2.2 Decision Rule

In the same manner as the nominal approach of Sect. 3, the maximum probability class of ${\mathbf {x}}_i$ is selected as the predicted class label ${\hat{y}}_i$.

4.3 The Cumulative Link Model Approach

For the CLM framework (family of models which includes the POM [27]), only a small modification to the baseline nominal model is needed: the output is reduced to only a single unit in the last layer, and the logit cumulative link function is used as the activation function instead of softmax:

$$\begin{aligned} P(y \preceq {\mathcal {C}}_q \mid {\mathbf {x}}) = \sigma (b_q - f({\mathbf {x}})) \;, \; 1 \le q < Q, \end{aligned}$$

(6)

where $P(y \preceq {\mathcal {C}}_q \mid {\mathbf {x}})$ is the probability of sample ${\mathbf {x}}_i$ being assigned label ${\mathcal {C}}_q$ or lower predicted by the network, $f({\mathbf {x}})$ is the single output of the model, $\sigma $ is the sigmoid function and $b_q$ is one of the $Q-1$ thresholds learned as additional parameters. Note that cumulative probabilities $P(y \preceq {\mathcal {C}}_q \mid {\mathbf {x}})$ are predicted by this function instead of individual ones like $P(y = {\mathcal {C}}_q \mid {\mathbf {x}})$.

4.3.1 Decision Rule

During evaluation, elementary probability rules are used to combine the cumulative probabilities from Eq. (6) into individual probabilities [15]:

$$\begin{aligned} P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i) = {\left\{ \begin{array}{ll} P(y_i \preceq {\mathcal {C}}_1 \mid {\mathbf {x}}_i), \; &{}\text {if} \; q=1, \\ P(y_i \preceq {\mathcal {C}}_q \mid {\mathbf {x}}_i) - P(y_i \preceq {\mathcal {C}}_{q-1} \mid {\mathbf {x}}_i), \; &{}\text {if} \; 1< q < Q, \\ 1 - P(y_i \preceq {\mathcal {C}}_{Q-1} \mid {\mathbf {x}}_i), \; &{}\text {if} \; q=Q, \end{array}\right. } \end{aligned}$$

(7)

and the maximum probability class is then selected as the predicted label ${\hat{y}}_i$:

$$\begin{aligned} {\hat{y}}_i = {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{{\mathcal {C}}_1 \preceq {\mathcal {C}}_q \preceq {\mathcal {C}}_Q }} P(y_i = {\mathcal {C}}_q \mid {\mathbf {x}}_i). \end{aligned}$$

(8)

4.3.2 Loss Function

Cross-entropy loss is used as the loss function in the same manner as in the nominal model.

4.4 Our approach: Ordinal Binary Decomposition

For our ordinal approach, we decompose the original Q-class ordinal problem into $Q-1$ binary decision problems, what is known as Ordinal Binary Decomposition (OBD). Each q problem consists on deciding whether $y \succ {\mathcal {C}}_q$ conditioned to sample ${\mathbf {x}}$ ($1 \le q < Q$) (this is referred to as the “Ordered partitions” scheme in [16]).

To adapt the outputs of the model to this, the final fully-connected block is substituted by $Q-1$ fully-connected blocks, each one with a single output unit with sigmoid activation^{Footnote 1}. Each of the $Q-1$ outputs of the model $o_q$ is trying to predict the probability $P(y \succ {\mathcal {C}}_q \mid {\mathbf {x}})$. The result of this modification is obtaining $Q-1$ different models, which share their convolutional feature extraction parameters and are trained simultaneously.

4.4.1 Decision Rule

In the case of the OBD models, because the outputs are not individual probabilities but cumulative ones ($o_k = P(y \succ {\mathcal {C}}_k \mid {\mathbf {x}})$), the decision rule requires combining several outputs. Moreover, these probabilities may be inconsistent: nothing forces them to fulfil basic probability properties like $P(y \succ {\mathcal {C}}_i) \ge P(y \succ {\mathcal {C}}_{i+1})$ and $\sum _{i=1}^{Q} P(y = {\mathcal {C}}_i) = 1$. For this reason, Eq. (7) cannot be applied as for the CLM.

In order to circumvent this problem, a stable approach based on the ECOC framework is used: the ideal output vector $\mathbf {v}({\mathcal {C}}_i)$ for each class ${\mathcal {C}}_i$ is considered, $\mathbf {v}({\mathcal {C}}_i) = (c_1, \dots , c_{Q-1})$ where $c_j = 1\{{\mathcal {C}}_j \prec {\mathcal {C}}_i\}$, i.e. a vector with ones in all positions corresponding with classes which are lower than $\mathcal {C}_i$ in the ordinal scale. This makes the ideal output vector for a sample ${\mathbf {x}}_i$ with label $y_i = {\mathcal {C}}_k$ be:

$$\begin{aligned} \mathbf {v}({\mathcal {C}}_k) = (c_1, ..., c_{k-1}, c_k, ..., c_{Q-1}) = (1, ..., 1, 0, ..., 0), \end{aligned}$$

(9)

i.e., for a 4 class ordinal problem with labels ${\mathcal {C}}_1$, ${\mathcal {C}}_2$, ${\mathcal {C}}_3$, and ${\mathcal {C}}_4$ the ideal outputs would be $\mathbf {v}({\mathcal {C}}_1) = (0,0,0)$, $\mathbf {v}({\mathcal {C}}_2) = (1,0,0)$, $\mathbf {v}({\mathcal {C}}_3) = (1,1,0)$, and $\mathbf {v}({\mathcal {C}}_4) = (1,1,1)$.

The decision rule is based on determining the ideal vector which minimizes the distance to the obtained output vector $\mathbf {o}$:

$$\begin{aligned} {\hat{y}}_i = {\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{{\mathcal {C}}_1 \preceq {\mathcal {C}}_q \preceq {\mathcal {C}}_Q }} \left\Vert \mathbf {o} - \mathbf {v}({\mathcal {C}}_q) \right\Vert _2, \end{aligned}$$

(10)

where $\Vert \cdot \Vert _2$ is the $L_2$ norm. This distance metric is selected in order to align it with the loss function of the optimization process.

As an example to illustrate this prediction criterion, assume a 4 class ordinal problem like the one previously mentioned. For sample ${\mathbf {x}}$, let the output of the model be the 3 dimensional vector $\mathbf {o} = ( 0.8, 0.3, 0.2 )$. The distance to each ideal class vector would be computed as:

$$\begin{aligned} \begin{aligned} \Vert \mathbf {o} - \mathbf {v}({\mathcal {C}}_1) \Vert _2 = \Vert (0.8 - 0, 0.3 - 0, 0.2 - 0) \Vert _2 = 0.77,\\ \Vert \mathbf {o} - \mathbf {v}({\mathcal {C}}_2) \Vert _2 = \Vert (0.8 - 1, 0.3 - 0, 0.2 - 0) \Vert _2 = 0.17,\\ \Vert \mathbf {o} - \mathbf {v}({\mathcal {C}}_3) \Vert _2 = \Vert (0.8 - 1, 0.3 - 1, 0.2 - 0) \Vert _2 = 0.57,\\ \Vert \mathbf {o} - \mathbf {v}({\mathcal {C}}_4) \Vert _2 = \Vert (0.8 - 1, 0.3 - 1, 0.2 - 1) \Vert _2 = 1.17. \end{aligned} \end{aligned}$$

(11)

This process is illustrated in Fig. 1. The vector closest to $\mathbf {o}$ is $\mathbf {v}({\mathcal {C}}_2)$ and thus, sample ${\mathbf {x}}$ would be assigned the class label ${\hat{y}} = {\mathcal {C}}_2$.

4.4.2 Loss Function

For the OBD methodology, categorical cross-entropy has been substituted by the Mean Squared Error loss because it copes better with the distance function used for the ECOC decision [2]:

$$\begin{aligned} \ell = \frac{1}{N}\sum _{i=1}^N \sum _{k=1}^{Q-1} (1\{y_i \succ {\mathcal {C}}_k\} - P(y_i \succ {\mathcal {C}}_k \mid {\mathbf {x}}_i))^2. \end{aligned}$$

(12)

where $1\{y_i \succ {\mathcal {C}}_k\}$ is the indicator function that is equal to 1 when $y_i \succ {\mathcal {C}}_k$ and 0 otherwise, and $P(y_i \succ {\mathcal {C}}_k \mid {\mathbf {x}}_i)$ is the probability that $y_i \succ {\mathcal {C}}_k$ predicted by the network given a sample ${\mathbf {x}}_i$.

The four methodologies described in this section are illustrated in Fig. 2.

5 Experiment Design

5.1 Datasets

The effects of the four described methodologies will be tested against the following two different datasets, chosen specifically for the ordinal nature of their class labels and an acute class imbalance.

5.1.1 Diabetic Retinopathy Dataset

The diabetic retinopathy dataset from Kaggle^{Footnote 2} (referred to as “Retinopathy” from now on) consists on a total of 88 702 retina images labelled by a clinician on a 0 to 4 scale evaluating the presence of Diabetic Retinopathy (DR), an eye disease present in a large proportion of diabetes patients. It contains 65 343 images labelled as No DR, 6205 images labelled as Mild DR, 13 153 images labelled as Moderate DR, 2087 images labelled as Severe DR, and 1914 images labelled as Proliferative DR. The task consists on predicting the clinician label using the colour image of the retina. Three sample images can be seen in Fig. 3. All images have been normalized down to a size of $128 \times 128$ pixels.

5.1.2 Adience Faces Dataset

The Adience faces dataset for age classification [11] (referred to simply as “Adience” from now on) is composed of 26 580 photos of 2284 different subjects extracted from real online albums and automatically cropped and aligned. 17 702 of these photos have an age label attached, referring to one of 8 different age groups of increasing value: 0–2 years, 4–6 years, 8–13 years, 15–20 years, 25–32 years, 38–43 years, 48–53 years, and 60 years and up. The task consists on assigning one of these 8 age labels to each photo. A sample of this images can be seen in Fig. 4. As a preprocessing step, all images have been resized down to $256 \times 256$ pixels.

5.2 Methodologies and Validation Scheme

Four different methodologies are tested against each other:

The baseline nominal architecture, with categorical cross-entropy loss function.
The same architecture, but substituting the loss function by the Quadratic Weighted Kappa (QWK) function described in Sect. 4.2.
The CLM approach, as described in Sect. 4.3.
The OBD approach with ECOC decision rule, as described in Sect. 4.4.

All of these are applied to all four of the previously mentioned architectures (VGG11, ResNet18, MobileNetV3 and ShuffleNetV2), yielding a total of sixteen different experiments for each of the two datasets.

In order to obtain a statistically significant result to test the hypotheses, each experiment is repeated 30 times on 30 different holdout splits of the original dataset, where 80% of samples are used for training and 20% are used for model evaluation. This split is performed in a stratified fashion, preserving the original proportion of the classes of the original dataset in the subsets. For the Retinopathy dataset this leaves 70 962 training samples (of which 7096 are reserved for validation) and 17 740 test samples in each split. In the case of the Adience dataset, 14 161 are used for training (of which 1416 are reserved for validation) and 3541 are used for evaluation.

5.3 Training Scheme

In all experiments, weights are initialized randomly using the He initialization scheme described in [18]. They are then adjusted using the Adam method [21] with a learning rate $\eta = 1\times 10^{-4}$.

In the case of VGG11, both dropout ($p=0.5$) and $L_2$ regularization (with a weight of $5\times 10^{-4}$) are applied only in the fully-connected layers as in the original paper [31]. For ResNet18, batch normalization is applied after every convolution operation and $L_2$ penalty (with a weight of $1\times 10^{-4}$) is added to all mappings [17]. The number of trainable parameters for each model is available on Table 1.

Table 1 Number of trainable parameters and total memory size of the trained models for each methodology and architecture

Error-Correcting Output Codes in the Framework of Deep Ordinal Classification

Abstract

Similar content being viewed by others

Error-Correcting Output Codes in the Framework of Deep Ordinal Classification

Deep Ordinal Classification Based on the Proportional Odds Model

Ordinal Regression with Neuron Stick-Breaking for Medical Diagnosis

1 Introduction

2 Related Work

3 Base Nominal CNN Methodology

3.1 Decision Rule

3.2 Loss Function

4 The Ordinal Classification Framework

4.1 Adapting CNNs for ordinal classification

4.2 Using an ordinal loss function: Quadratic Weighted Kappa

4.2.1 Loss Function

4.2.2 Decision Rule

4.3 The Cumulative Link Model Approach

4.3.1 Decision Rule

4.3.2 Loss Function

4.4 Our approach: Ordinal Binary Decomposition

4.4.1 Decision Rule

4.4.2 Loss Function

5 Experiment Design

5.1 Datasets

5.1.1 Diabetic Retinopathy Dataset

5.1.2 Adience Faces Dataset

5.2 Methodologies and Validation Scheme

5.3 Training Scheme

5.4 Performance Metrics

6 Results

6.1 Statistical Analysis

7 Conclusions and Future Work

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Experimentation Result Tables and Confusion Matrices

Appendix B Statistical Analysis Tables

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation