Keywords

1 Introduction

Medical imaging is an emerging and successful tool increasingly employed in precision medicine. It aids in making a medical decision for providing appropriate and optimal therapies to an individual patient. Skin Cancer is one such disease which can be identified through medical imaging using dermoscopic techniques. There are many types of skins cancers, but we can broadly put them in to two general categories viz., Non melanoma and Melanoma. Non-melanoma cancers are unlikely to spread to other parts of the body but Melanoma is likely to spread to other parts of the body and is known to be aggressive cancer. Malignant Melanoma is a cutaneous disease. It affects the melanin producing cells known as melanocytes. Melanoma is likely to be fatal, it has caused more deaths than any other type of skin disease [18]. The dermoscopic acquisition of a skin image targets segmentation into two regions: lesion and normal skin. The affected part of an organ or a tissue due to a disease or an injury is generally termed as lesion. Efficient and accurate segmentation of the lesion region in dermoscopic images aids in classification of various skin diseases. Furthermore, the severity of the diseases can be predicted through various grading techniques which result in early identification of a skin disease which plays a vital role in the treatment and cure of the disease.

2 Literature

In the existing literature, several attempts have been made to develop a more robust and efficient segmentation of the lesion region in dermoscopic and non-dermoscopic clinical images. The methods for skin lesion segmentation can be segregated into the following categories [8] viz., thresholding, active contours, region merging methods [13] and deep learning architectures. Some methods [7, 16] have been proposed on non-dermoscopic images which address skin lesion segmentation based on colour features and textural properties respectively. The method in [15] addresses the illumination effects and artifacts. These methods apply post processing steps for refining the segmentation results. In [19], a deep convolutional neural network (CNN) has been proposed which combines both local texture and global structure information to predict a label for each pixel for segmentation of the lesion region. In [11], an automated system for skin lesion region segmentation has been proposed to classify each pixel based on pertinent geometrical, textural and colour features which are selected using Ant Colony Optimization (ACO). The complementary strengths of a saliency and Bayesian framework are applied to distinguish the shape and boundaries of the lesion region and background in [2]. In [23], an unsupervised methodology based on the wavelet lattice, shift and scale parameters of wavelets has been proposed for the segmentation of skin lesion regions in dermoscopic images. In [6], image-wise supervised learning is proposed to derive a probabilistic map for automated seed selection and multi-scale super-pixel based cellular automata to acquire structural information for skin lesion region segmentation. A Guassian membership function is applied for image fuzzification and to quantify each pixel for skin segmentation [12].

Despite several methods being available for segmentation of lesion region in images of skin diseases, there is still scope for exploring new models, which are efficient and provide better segmentation. Thus, in this work, we propose a deep residual architecture inspired by UNet [22] for skin lesion segmentation. The rest of this paper is organized as follows: Sect. 3 elaborates the proposed model for the segmentation of skin lesion region. Section 4 gives the experimental analysis and comparative analysis. Section 5 gives a conclusion.

3 Proposed Method

The proposed methodology for automatic skin lesion region segmentation using deep learning architecture is shown in Fig. 1. The architecture is inspired from UNet [22] and residual network [17]. The input to the network is RGBH (Red, Green, Blue and Hue planes respectively) of a dermoscopic image and the output is binary segmented image with white and Black pixels representing the affected skin and non-affected regions respectively. There are four important components in the proposed network. The first is construction of a multi-scale [14] image pyramid input which makes the network scale invariant. The second is a U-shape convolutional network, to learn a vivid hierarchical representation. The third is to incorporates residual learning to preserve spatial and contextual information from the preceding layers. The residual connections are used at two levels. Firstly, at each step of the contracting (encoder) and expansive (decoder) path of the U-Net and another short connection between the multi-scale input and expansive (encoder) path of respective step. The information lost in the encoder stages due to the max-pooling layer at each level is preserved through the multi-scale residual connection. Finally, a layer with binary cross-entropy loss function based on Jaccard index [3] is included for classification of pixels.

3.1 Multi-scale Input Layer

The proposed method has similar architecture to the methodology in [14] for constructing the multi-scale input by using an average pooling layer to downsample the images naturally and construct a multi-scale input in the encoder path. These scaled input layers are used to increase the network width of decoder path and also as a shortcut connection to the encoder path to increase the network width of the decoder path.

3.2 Network Structure

U-Net [22] is an efficient fully convolutional network which has been proposed for biomedical image segmentation. The proposed architecture adopts similar architecture consisting of two blocks placed in U-shape as shown in Fig. 1. The block with green color (Fig. 2(a)) represents the residual downsampling block and the red color (Fig. 2(b)) represents residual upsampling block. A \(2\times 2\) max-pooling operation with stride 2 for downsampling is used and at each stage number of feature channels chosen in the proposed architecture are shown in Fig. 1. The left side path consist of repeated residual downsampling block (henceforth referred as resDownBlock) which are connected to the corresponding residual upsampling block (henceforth referred as resUpBlock). This connection is shown with dotted lines in Fig. 1 similar to U-Net, where the feature maps of resDownBlock is concatenated to the corresponding resUpBlock. Along with the u-connection, there are also short connections between the Multi-scale input at each step of the U-Net with the corresponding resUpBlock by convolving the scaled input with \(3\times 3\) convolution which avoids convergence on a local optimal solution and thus helps the network to achieve good performance in complex image segmentation.

Fig. 1.
figure 1

Proposed architecture of Multi-Scale Residual UNet

Fig. 2.
figure 2

(a) Details of resDownBlock, (b) Details of resUpBlock (Color figure online)

3.3 Residual-Down-sampling Block(resDownBlock)

The structure of resDownBlock consists of two \(3\times 3\) convolutions, each followed by a rectified linear unit (ReLU). A shortcut connection of the input layer is added with the output feature-maps of the second convolution layer before passing to the ReLU as shown in Fig. 2(a). Batch Normalization is adopted between convolutional layer and rectified linear units layer as well as during the shortcut connection. The max-pooling layer in the resDownBlock has a kernel size of \(2\times 2\) and a stride of 2. Excluding the initial resDownBlock in the encoder path all other resDownBLocks receives the concatenated output feature-maps from the preceding block with the scaled input.

3.4 Residual-Up-sampling Block(resUpBlock)

The structure of resUpBlock consists of two \(3\times 3\) convolutional layers, each followed by a rectified linear unit (ReLU) and a shortcut connection of the input layer is added with the output feature-maps of second convolution layer along with the shortcut connection of the scaled input image before passing to the ReLU as shown in Fig. 2(b). There is a Concatenation layer, which concatenates the upsampled feature-maps from previous block with the feature-maps of resDownblock. According to the architecture, the resolution of resDownBlocks output should match with the resUpBlock’s input for adding the upsampling layer in the beginning of each block. Batch Normalization is adopted between convolutional layer and rectified linear units layer similarly as mentioned in the above section.

4 Experiments

In order to evalute the efficacy of the proposed model, experiments have been conducted on the ISIC 2017 Challenge [10] official dataset. The dataset consists of dermoscopic images with 2000 training, 150 validation and 600 test samples respectively. The proposed network is implemented using the Keras neural network API [9] with Tensorflow backend [1] and trained on a single GPU (GeForce GTX TITAN X, 12 GB RAM). The network is optimized by Adam optimizer [20] with an initial learning rate of 0.001. For increasing the number of samples during the training phase, we have used standard geometrical (linear) data augmentation techniques, namely rotation(−45\(^{\circ }\) to +45\(^{\circ }\)), horizontal and vertical flipping, translation and scaling (−10% to +10%)) of the input image. We choose \(256\times 256\) square images with batch size of 4 samples. The number of learning steps at each epoch is set to 1000. We have exploited RGB and HSI color space model for deriving RGBH (Red, Blue, Green and Hue Channels of dermoscopic images) as input data to the network to capture the color variations in the data. Figure 3 presents the lesion region segmentation for few test samples with overlay of segmentation results. The overlay consists of differentiations viz., blue, green and red overlays representing false negatives, true positives and false positives respectively. It is evident that the proposed model effectively captures the lesion region without any post-processing steps.

To evaluate the performance of the segmentation, we have use Accuracy (AC), Jaccard Index (JA), Dice coefficient (DI), Sensitivity (SE) and Specificity (SP). Consider \(\beta _{tp}\), \(\beta _{tn}\) , \(\beta _{fp}\) and \(\beta _{fn}\) which represent the number of true positive, true negative, false positive and false negative respectively. All the above mentioned metrics are computed using Eqs. (1)–(5):

$$\begin{aligned} Accuracy (AC) = \frac{\beta _{tp} + \beta _{tn}}{\beta _{tp} + \beta _{tn} + \beta _{fp} + \beta _{fn}} \end{aligned}$$
(1)
$$\begin{aligned} Sensitivity (SE) = \frac{\beta _{tp}}{\beta _{tp} + \beta _{fn}} \end{aligned}$$
(2)
$$\begin{aligned} Dice ~coefficient (DI) = \frac{2*\beta _{tp}}{2*\beta _{tp} + \beta _{fp} + \beta _{fn}} \end{aligned}$$
(3)
$$\begin{aligned} Specificity (SP) = \frac{\beta _{tn}}{\beta _{tp} + \beta _{fn}} \end{aligned}$$
(4)
$$\begin{aligned} Jaccard Index (JA) = \frac{\beta _{tp}}{\beta _{tp} + \beta _{fp} + \beta _{fn}} \end{aligned}$$
(5)

Figure 4 presents the segmentation results of the proposed model compared to other methods in the literature by depicting number of test samples in each bin, where each bin in x-axis represents the Jaccard Index range, y-axis represents number of test samples. The results of our method is presented in Table 1. From Table 1, it is evident that the proposed method outperforms the other methods in terms of Accuracy, Dice Coefficient and Sensitivity. The results of our method is quite competitive for the ISIC 2017 dataset in comparison with the methods which have shown top performance in the literature.

Fig. 3.
figure 3

The visual examples of lesion region segmentations, \(1^{st}\) row are test images, \(2^{nd}\) row corresponding ground truth images and \(3^{rd}\) are output of the segmented lession region (Color figure online)

Table 1. Comparison of Skin Lesion Segmentation on ISIC 2017.
Fig. 4.
figure 4

Graphical representation of Jaccard Index on overall test set.

5 Conclusion

In this work, we have proposed a deep architecture for skin lesion segmentation termed as Multi-scale residual UNet. From the results in Fig. 3, it can be observed that the boundaries of lesion regions and the background are well separated and differentiable. Furthermore, the proposed model uses only \(\approx \)16M parameters when compared to other well known conventional deep architectures for various complex applications. To further improve the performance, in our future work visual saliency shall be explored in conjunction with deep features and post processing methods based on Conditional Random fields.