1 Introduction

Psoriasis is a chronic, immune-mediated, relapsing, inflammatory skin disease and usually associated with itch. The prevalence of psoriasis varies 1%–12% among different populations worldwide [1]. This disease develops when the immune system mistakes a normal skin cell for a pathogen and sends out faulty signals that cause overproduction of new skin cells. This disease can be diagnosed by visual and haptic inspection. The visual changes of outer skin surface due to this disease include development of elevated red scaly dry patches with well-demarcated borders on the skin surface. However, the shape, size, color and distribution of these patches vary. In dermatology, these patches are termed as psoriatic plaque [13].

No drug is available yet to cure psoriasis completely but the severity can be controlled by suitable drug doses. As the drug response varies among different patients thus development of reliable severity assessment procedure is required to decide the type and dose of the drugs as well as measure disease progress and drug’s efficacy. Dermatologists use Psoriasis Area Severity Index (PASI) [5] for estimating severity. PASI considers two major aspects of the disease: ratio of body surface area affected by this disease and the severity of the plaques formed on the skin surface. The severity of the plaque is determined by the visual disorder formed on the affected skin regions. Three different aspects are considered for severity of the plaques: degree of redness or erythema, thickness or induration and scaling or desquamation. All aspects are scored with a value between \(0\!-\!4\). Table 1 contains a sample image for every severity class.

Table 1. Visualization of psoriasis plaques with different severity scores.

The severity factors are determined by the dermatologists in an eye estimation technique. The severity assessment procedure suffers from both inter- and intra-observer variability. Hence, development of an automated and robust system for severity assessment of psoriatic plaque is necessary for clinical studies. Some approaches have already been proposed for automatic scoring of scaling [2] and erythema [4, 6, 7, 12]. In [14], an image based system is also proposed to compute the aggregated severity score according to plaque characteristics. In [11], an attempt has been made to assess the erythema, scaling and induration scores from psoriatic plaque images. However, all of these approaches consider the present severity grading task as an image classification problem but fail to capture the underlying ordinal relationship among the severity labels. This motivates us to develop CNN based ordinal classifiers for severity assessment of psoriatic plaques.

To summarize, the key contributions of this paper are: (i) a pioneering attempt towards developing a deep convolutional neural network based ordinal classifier for predicting severity score of psoriatic plaque, (ii) a new loss function is used for training a CNN which can capture the ordinal relationship among the class labels, (iii) two pre-trained CNN models (namely, ResNet-50 and Mobile Net) trained on imagenet dataset are fine-tuned to develop the severity assessment classifiers, and finally, (iv) the performance of the proposed CNN is compared with several baselines.

2 Methodology

2.1 Convolutional Neural Network

Nowadays, Convolutional Neural Network (CNN) is widely used for image classification tasks as it relieves the researchers from designing hand-engineered feature descriptors and automatically develops powerful mathematical models directly from the training images. These models are made up of multiple processing units and each processing unit consists of trainable weights and biases. In the training phase, the network parameters are updated by comparing the distribution of predicted class labels with the actual class labels of the training images. A brief description of the traditional categorical cross entropy (CCE) loss and the mean square error (MSE) loss functions are given below.

Suppose, for a C-class (\(C>2\)) single-label image classification problem, the ground truth of a particular image is given by a binary vector G of length C such that \(G_{i}=1\) whenever \(i = k\) and 0 otherwise. The output of the CNN is a probability distribution P of length C such that its \(i^{th}\) entry (\(P_i\)) represents the predicted probability of the \(i^{th}\) class. Now the definition of CCE loss and the MSE loss are given in Eqs. 1 and 2.

$$\begin{aligned} \mathcal {L}_{CCE}=-\sum _{i=1}^{C}G_i\, ln(P_i) \end{aligned}$$
$$\begin{aligned} \mathcal {L}_{MSE}=\sum _{i=1}^{C}(P_i -G_i)^2 \end{aligned}$$

2.2 Ordinal Classification and Limitation of CCE and MSE Loss

In the present severity assessment task, there exists an ordinal relationship among the severity grades. Suppose, the actual and predicted severity score of a misclassified image is K and \(K_{1}\) respectively. Then, we would prefer the classifier to have the least possible absolute difference \(|K-K_{1}|\). But it can be seen from Eqs. 1 and 2, CCE and MSE loss ignores this relationship since CCE only considers the probability of the correct class and MSE is invariant to permutation of probabilities of incorrect classes.

2.3 Proposed Loss Function

Motivated from [9], for the present classification task, we used the Earth Mover’s Distance (EMD) based loss function. Let \(X^{CDF}_{i}\) denote the \(i^{th}\) element of the cumulative distribution of X then the loss function is as follows:

$$\begin{aligned} \nonumber \mathcal {L}_{EMD}= & {} \sum _{i=1}^{C}(P^{CDF}_i - G^{CDF}_i)^2\\ \nonumber= & {} \sum _{i=1}^{C}(\sum _{j=1}^{i}P_j-\sum _{j=1}^{i}G_j)^2\\= & {} \underbrace{\sum _{i=1}^{k-1}\bigg (\sum _{j=1}^{i}P_j\bigg )^{2}}_{\mathcal {A}} + \underbrace{\sum _{i=k}^{C}\bigg (\sum _{j=1}^{i}P_j-1\bigg )^{2}}_{\mathcal {B}} \end{aligned}$$

where k is the correct class. According to Eq. 3, when \(i<k\), increasing the value of \(P_{i}\) increases the value of \(\mathcal {A}\) whereas when \(i\ge k\), increasing the value of \(P_{i}\) decreases the value of \(\mathcal {B}\). Since, in \(\mathcal {A}\), \(P_{i}\) occurs \((k-i)\) times hence, the value of \(\mathcal {L}_{EMD}\) increases as \(|i - k|\) increases. Similarly, in \(\mathcal {B}\), for \(i\ge k\), \(P_{i}\) occurs \((C-i)\) times hence, the value of \(\mathcal {L}_{EMD}\) increases as \(|i - k|\) increases. Thus the proposed loss function trains the network in such a way that the class label farthest from actual class gets less probability.

3 Experimental Setup

Dataset: In this research, an image dataset of seven hundred seven (707) psoriatic plaque images having expert annotated severity scores for erythema, scaling and induration is used. This dataset is built by cropping sub-images from a dataset of psoriasis images collected from 80 patients. The original images are collected in an uncontrolled environment by layman photographers with different view angle, distance, lighting condition and varying background. Apart from photographic limitation and skin color tone variation, the presence of several artefacts like hair, wrinkle etc. make the severity assessment task challenging.

Network: As the data volume is small, the training of a Convolutional Neural Network (CNN) from scratch does not produce satisfactory performance. Fine-tuning of pre-trained network is opted for the present classification task. Two pre-trained networks ResNet-50 [8] and Mobile Net [10] trained on imagenet dataset are considered for fine-tuning. ResNet-50 is chosen due to its impressive performance on imagenet classification. The mobile net is chosen as it contains comparatively fewer parameters but produces good performance on imagenet classification.

Training: In this paper, the performance of the developed system is reported on the basis of 7-fold cross validation. The model is trained with stochastic gradient descent optimizer using a batch size of 4 images, momentum of 0.9, weight decay of \(10^{-6}\) and with the learning rate of 0.001. For every fold, the network is trained 10 times and the trained model which ends with minimal loss is chosen for prediction of test images. Horizontal and vertical flipping augmentation is used for improving the generalization ability of the classifiers.

Baselines: In this paper, the performance of the CNN trained with proposed ordinal loss minimization is compared with four baselines. First two CNNs are trained with traditional categorical cross entropy (\(\mathbf {CNN_{CCE}}\)) and mean-square error (\(\mathbf {CNN_{MSE}}\)) loss minimization. In the third approach (\(\mathbf {CNN_{Regr}}\)), the severity scores are projected into C equal partitions in [0, 1] and the CNN is trained in such a way that the \(i^{th}\) class \((i=1,2,...C)\) image outputs a value in \([\frac{i-1}{C},\frac{i}{C}]\). The last approach is the decomposition (\(\mathbf {CNN_{Decomp}}\)) of the C class classification problem into \(C-1\) binary classification problems where the \(i^{th}\) classifier predicts whether an image has classification label more than i or not. Then these trained classifiers are used to predict class labels of the test images. It is worth mentioning that the binary CNNs are trained with binary cross-entropy loss minimization. Among all considered baselines, only the last two classifiers can capture the ordinal relationship among the labels.

Performance Evaluation Metrics: The performance of the trained CNN is measured with three different evaluation metrics- (i) Mean Accuracy (MA), (ii) Mean Absolute Error (MAE) and (iii) Kendall’s \(\tau _b\). The value of MA lies in [0, 1] and a higher value represents better performance. A lower value of MAE represents better performance. On the other hand, Kendall’s \(\tau _b\) measures the association or rank correlation between two measured quantities. The \(\tau _b\) value lies in \([- 1,+1]\), where, +1 is the maximum agreement between the prediction and the ground truth class labelling, 0 represents no correlation between them and \(-1\) represents maximum disagreement. MAE and Kendall’s \(\tau _b\) are used since MA ignores the ordinal relationship between predicted and actual class for a misclassified image.

Suppose, there are N test images having a discrete class label in [1, C]. Let \(Y_i^p\), \(Y_i^g\) represent the predicted and the ground-truth class label of the \(i^{th}\) test image respectively. Then the mathematical expressions of these metrics are shown in Eqs. 456.

$$\begin{aligned} MA = \frac{1}{N}\sum _{i=1}^{N}\delta (Y^g_i,Y^p_i); \qquad \qquad {\delta (x,y)}={\left\{ \begin{array}{ll} 1, &{} \text {if}\,\, {x=y}\\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
$$\begin{aligned}&\quad MAE = \frac{1}{N}\sum _{i=1}^{N}|Y^g_i-Y^p_i|\\&\tau _b = \frac{\sum _{i,j=1}^{N} \hat{C}_{i,j}{C}_{i,j}}{\sqrt{\sum _{i,j=1}^{N} \hat{C}_{i,j}^2 \sum _{i,j=1}^N C_{i,j}^2}}, \text {where} \nonumber \end{aligned}$$
$$\begin{aligned} \hat{C_{ij}}={\left\{ \begin{array}{ll} 1, &{} \text {if}\,\, Y_i^p>Y_j^p \\ -1, &{} \text {if}\,\, Y_i^p<Y_j^p\\ 0, &{} \text {otherwise}. \end{array}\right. } \qquad \qquad {C_{ij}}={\left\{ \begin{array}{ll} 1, &{} \text {if}\,\, Y_i^g>Y_j^g \\ -1, &{} \text {if}\,\, Y_i^g<Y_j^g \\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

4 Results and Discussion

The average performance (metrics described in Sect. 3) of Mobile Net and ResNet-50 for erythema, scaling and induration scoring using considered approaches are listed in Table 2. According to Table 2, the performance of the chosen networks trained with proposed loss function outperforms the same network trained with CCE or MSE loss minimization. However, the networks trained with CCE and MSE loss minimization produce comparable performance. We receive poor performance when the CNN is trained for regression (Regr) output. This justifies the fact that the sensitivity of this method towards presence of noise in test images affects the performance badly. So, this approach is unsuitable for the present task. On the other hand, binary decomposition approach outperforms the CNN models trained with CCE and MSE loss minimization. However, in most cases, this approach is beaten by the proposed method. Obviously, the success of the binary decomposition approach depends on the robustness of all decomposed classifiers and a weak classifier may affect the whole classification scheme adversely. According to Table 2, among all considered approaches, the best performance is achieved when ResNet-50 is fine-tuned with EMD loss minimization. Some images in our dataset along with their actual and predicted severity scores with respect to erythema, scaling and induration predicted by the best models are given in Fig. 1.

Table 2. Experimental Result
Fig. 1.
figure 1

Psoriasis images and their ground-truthed (GT) and Predicted (Pred) severity scores achieved from the best classifiers. The scores are given in (Erythema, Scaling, Induration). The errors are highlighted in yellow. (Color figure online)

The psoriasis image dataset developed for [11] is reused in our research. In [11], the best models for erythema and induration were obtained from the AlexNet based MTL network and for scaling it was from the AlexNet based STL network. The performance was evaluated with average correct classification accuracy, without and with \(\pm 1\) toleranceFootnote 1, combined average classification accuracyFootnote 2 without and with \(\pm 1\) tolerance (see footnote 1). In Table 3, the first row contains the previous best result and the second row contains the result produced by our best model. According to Table 3, our model redefines the current state of the art.

Table 3. Comparison with the state of the art. WoT refers without tolerance and WT refers with tolerance.

5 Conclusion

A novel loss function is designed to make CNN suitable for ordinal classification and used for automatic severity assessment of psoriatic plaques. The use of such loss function is a pioneering attempt. The proposed learning scheme successfully improves the classification performance. Specifically, improvement of MAE and \(\tau _{b}\) in comparison to the considered baselines justifies the advantage of training a CNN with ordinal loss minimization. The proposed loss minimization in CNN training can be employed for other image based severity prediction from medical images, age group estimation from face images [3] etc.