Keywords

1 Introduction

Skin disease is one of the most common illnesses in human daily life. It pervades all cultures, occurs at all ages, and affects between 30 % and 70 % of individuals [1]. There are tens of millions of people affected by it every day. Skin disease is twofold, i.e. skin infection and skin neoplasm, in which thousands of skin conditions have been described [2]. Skin disease has a major adverse impact on quality of life and many are associated with significant psychosocial mobility. However, only a small proportion of people can recognize these diseases without access to a field guide. Moreover, there are many over-the-counter (OTC) drugs to treat the frequently-occurring skin diseases in daily life. In this case, correctly recognizing the skin diseases becomes very important for people who need to make a choice about these medicines. If people want to make a preliminary self diagnosis, it is undisputed that a visual recognition system will be useful for assisting them even if it is not perfect. For example, if an accurate skin disease classifier is developed, a user can submit a photo of recently skin condition to query a diagnosis. Surprisingly, there exists few research using computer vision techniques to recognize many common skin diseases based on ordinary photographical images.

Despite there are some related applications, the problem of recognizing skin diseases has not been fully solved by the computer vision community. In contrast to object or scene classification, skin disease image has no distinctive spatial layout, as we can label a bird with its body and head or an outdoor scene with sky region and house. For example, it’s difficult for us to find an accurate description of scattered red eczema. Besides, there are many challenges, including low contrast between lesion and surrounding skin, irregular and fuzzy borders, fragmentation or variegated coloring inside the lesion, etc., which make it hard to recognize skin diseases.

Most previous works on recognition of skin disease are restricted to dermoscopic images [3, 4], which are acquired through a digital dermatoscope. A dermatoscope is a special device for dermatologists to use to look at skin lesions that acts as a filter and magnifier [5]. As a result, dermoscopic images have low level of noise and are always with unique lighting. We show some examples of dermoscopic images in Fig. 1(a). On the other hand, clinical skin disease images are collected via a variety of sources, most of which are acquired using digital cameras and cell phones. Examples are shown in Fig. 1(b). We have found some work based on clinical disease images [5, 6]. However, all these work are built on small datasets which only contain very few species and are not publicly available. The absence of benchmark datasets is a barrier to a more dynamic development of this research area. As a consequence, in this paper, we introduce a new, publicly available dataset for real-world skin disease images recognition. This dataset contains 6,584 images of 198 fine-grained skin disease categories.

Fig. 1.
figure 1

Examples of dermoscopic and clinical images. (a) Dermoscopic images are acquired through a digital dermatoscope, which have relatively low levels of noise and consistent background illumination. (b) Clinical images are collected via various sources, most of which are captured with digital cameras and cell phones

As is well known, image classification is one of the most fundamental problems of computer vision, and has been studied for many years. Large-scale annotated image datasets have been instrumental for driving progress in object recognition over the last decade. These datasets contain a wide variety of basic-level classes, such as different kinds of animals and inanimate objects. Significant progress has been made in the past few years in object classification as researchers compete on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

Compared to generic object classification, fine-grained visual categorization [711] aims to classify categories which belong to the same basic-level class. In recent years, fine-grained recognition has been demonstrated in many domains with corresponding datasets, including birds [12, 13], flowers [14, 15], leaves [16], dogs [17, 18], and cars [19]. A variety of methods have been developed for classifying fine-grained categories [7, 11, 2023].

Skin disease image classification is naturally considered belonging to the problem of fine-grained visual object classification. However, in contrast to scene classification or object classification, it has own characteristics different from the existing fine-grained classification work, because it’s a difficult problem that push the limits of the visual abilities for both human and computers. Clinically, the diagnosis of any particular skin condition is made by gathering pertinent information regarding the presenting skin lesion(s), including the location, symptoms, duration, arrangement (solitary, generalized, annular, linear), morphology (macules, papules, vesicles), and color (red, blue, brown, black, white, yellow) [24]. In addition, the diagnosis of many conditions often requires more complicated information.

In order to validate the usefulness of our proposed dataset and inspire the computer vision community to carry out more meaningful research in this field, we perform a lot of basic experiments employing both hand-crafted features and deep features to establish a baseline performance on the dataset. On the other hand, recently deep learning has enabled robust and accurate feature learning, which in turn produces the state-of-the-art performance on many computer vision related tasks. In this work, we want to find out whether or not applying CNNs to skin disease classification provides advantages over hand-crafted features.

Our contributions are summarized as follows. First, we collect a novel and large scale benchmark dataset for skin disease image recognition. Second, we evaluate the performance of skin disease classification using CNNs as well as hand-engineered features. Extensive experimental results show that using the existing CNN model does not outperform manually crafted visual features. On the other hand, we hope this can promote future research on skin disease classification with deep learning.

2 Related Work

Our work is closely related to image classification on both dermoscopic and clinical images, and convolutional neural networks.

2.1 Dermoscopic Images Recognition

Present works on skin disease image classification are twofold, that is, dermoscopic and clinical image recognition. First, we introduce the representative works on dermoscopic images.

Dermoscopic images have been mostly used in computer aided diagnosis, which is a technique of visualizing lesions by directing light onto the skin. Because dermoscopic images have bright illumination conditions, it is clear enough for recognition. Besides, the viewpoint is basically invariable and background clutter is very limited. All these characteristics make the processing of dermoscopic images easier, which further result that the computer vision studies based on dermoscopic images are much more than work based on clinical images.

Some work have focused on developing different components of dermoscopic image recognition, including segmentation [25], detection [26] and classification [3, 27], etc. Gonzalez-Castro et al. [3] introduce a color texture descriptor and apply it to classify images of nevi into benign lesions and melanoma. Celebi et al. [27] present a methodological approach for the classification of dermoscopy images. The approach involves border detection, feature extraction, and SVM classification with model selection. Kasmi and Mokrani [28] introduce an algorithm that extracts the characteristics of ABCD (asymmetry, border irregularity, colour and dermoscopic structure) attributes to build a binary classifier, again distinguishing melanoma from benign nevus.

The popular datasets of dermoscopic images used in recent works are shown in Table 1. There is no doubt that these studies have developed the diagnosis of skin diseases. However, their applications are limited due to the specialized medical equipments and requirement for expert knowledge. Different from the mentioned datasets, in this paper, we build a large scale clinical image dataset to encourage further research which could be applied in real life scenes.

Table 1. Statistics of recent datasets of dermoscopic images. Also, the representative work employing these datasets are listed here.

2.2 Clinical Image Recognition

Some efforts have been made to classify clinical skin disease images [3537]. Concretely, Glaister et al. [5] propose a segmentation algorithm based on texture distinctiveness (TD) to locate skin lesions in photographs. They introduce a joint statistical TD metric and a texture-based region classification algorithm, which captures the dissimilarity between learned representative texture distributions. Alcón et al. [38] describe an automatic system for inspection of pigmented skin lesions and discriminating between malignant an benign lesions. The system includes a dedicated image processing system for feature extraction and classification, and patient-related data decision support machinery for calculating a personal risk factor. It has been shown that their algorithm is capable of recreating controlled lighting conditions and correcting for uneven illumination.

Moreover, Razighi et al. [6, 39, 40] heavily rely on human-in-the-loop and high level knowledge in their work. They use human provided information with a random forest or bayesian framework. The aforementioned interaction information comes from questions designed in advance. For example, the answer to a binary question like: Is the object red? can be regarded as the presence of tag Red, that can be used as a visual feature to improve the final classification result. They include 10 questions and 37 possible binary answers/tags in their system.

Typically, the previous works focusing on clinical skin disease images are commonly built on a small size datasets. To the best of our knowledge, the largest dataset contains 2309 images from only 44 different diseases, and it is not publicly available to the community.

2.3 Convolutional Neural Networks

In recent years, Convolutional Neural Networks (CNNs) have achieved great empirical successes in many computer vision tasks, such as image classification [41], object detection [42], scene recognition [43], and fine-grained classification [23, 44, 45]. It is now possible to train a very deep network [46] on large collections of images with the help of the increasing computational power of GPU.

Skin diseases have the similar characteristics with objects in fine-grained classification, that is, lesion areas in skin disease images show large intra-class variation and small inter-class variation. Therefore, CNNs are also supposed to make sense in skin disease recognition. On the other hand, skin disease images are different from conventional fine-grained object images in some degrees. For example, some current works in fine-grained classification employ bounding box of objects of interest to help recognition, while it’s more difficult label bounding box in skin disease images, or to distinguish lesions from background. Furthermore, objects always have specific shapes and parts, resulting a massive of part based methods to train fine-grained part models in CNNs. However, choosing parts of lesion is almost impossible when skin disease images are applied.

Table 2. Statistics of the existing clinical skin disease datasets. In [6, 37], the authors add a question and answer bank into their datasets, which is used to provide human-computer interaction in the systems. Note that none of current datasets are publicly available. For comparison, we also show information of our work in the last two columns, which expand both the dataset size and the number of categories.

3 Our Dataset

Several datasets have been used for skin disease studies [6, 37]. Concretely, Razeghi et al. [37] collect two subsets in their work, which contain 90 and 706 images from 3 and 7 different skin diseases, respectively. In another work of the same team [6], they acquire a new dataset containing 2,309 visual similar images of skin conditions from 44 different diseases. The authors argue that the lesions are manually segmented using a bounding box in their dataset, and the dataset has a question and answer bank for help classification. Unfortunately, both of the mentioned datasets are not publicly available.

In this work, we present a new clinical skin disease dataset, namely SD-198. To the best of our knowledge, it is the largest available skin disease database, whether clinical or dermoscopic images are mentioned. The statistics of the existing skin disease datasets are shown in Table 2.

Our SD-198 dataset contains 198 different diseases from different types of eczema, acne and various cancerous conditions. There are 6,584 images in total. We also choose the classes with more than 20 image samples as a subset, namely SD-128. In general, overall classification can be improved when less categories and more samples are applied, which is verified in our experiments. Examples of images in our dataset can be found in Fig. 2.

Fig. 2.
figure 2

Here we show some examples of our SD-198 dataset, each of which is selected from different classes. In another word, none of the images listed here have the same class label. However, it’s difficult to distinguish these skin disease images, because some of them have the extremely same color and shape. For example, the five images in the first column belong to different categories, while finding the differences among these images are challenging. (Color figure online)

3.1 Image Collection and Annotation

Collection. The images are downloaded from the DermQuestFootnote 1, which is an online medical resource for dermatologists and healthcare professionals with an interest in dermatology. It contains an extensive clinical images shared by the wide dermatology community. The images are submitted by patients or dermatologists.

The website contains 729 species of skin lesions in total, which include all kinds of conditions that affect the integumentary system, i.e., the organ system that encloses the body and includes skin, hair, nails, and related muscle and glands [47]. We execute a statistical analysis of these skin lesion categories, and remove the species that rarely appear in real life or that have less than 10 samples.

We initially have collected more than 10,000 clinical skin disease images. In order to keep balance of categories, we remove some samples from the subsets whose images are sufficient so as to each category has 60 images at most. Then, we further drop the duplicate images and low-quality images. Finally, we get a dataset containing 6,584 images from 198 different categories.

Annotation. The ground truth annotations of the images in our collected dataset are obtained from DermQuest, since each image has been recognized by experts and labeled with the name of its class. Because the clinical case notes and diagnosis quizzes on the website are reviewed by an international editorial board comprised of renowned dermatologists, the labels obtained for our dataset are considered reliable. Despite that, in order to ensure the label quality of our dataset, we have invited two professionals to review our dataset.

Fig. 3.
figure 3

Statistics of the numbers of images for each class in our SD-128 and SD-198 datasets. Note that each category of SD-128 contains more than 20 samples, while SD-198 has some categories whose samples are between 10 and 20.

3.2 Properties of Dataset

Not only our dataset is larger than previous datasets, but also has superior performance. We will introduce the properties of the proposed dataset in this section.

Scale. This paper aims to provide a large-scale clinical skin disease benchmark dataset. To the best of our knowledge, its size is about 3 to 10 times as the reported scale of the previous datasets. It contains 198 categories which have covered all of the common skin diseases. We hope the dataset with 6,584 well-labeled clinical skin disease images can promote the vision research in this area.

Diversity. All images are from the real scene with variance in color, exposure, illumination and level of details. That is to say the images may be taken by any configuration of equipments or in a variety of environments. Therefore, hopefully the future works based on our benchmark dataset will be easier to be applied into practices. The mentioned diversities mainly include:

Fig. 4.
figure 4

Species diversity and appearance diversity of our proposed dataset. If I tell you that the images in the first row of (a) belong to the same class, do you think the images in the next row are from the same class? The answer is no. Moreover, the third row of (a) show that different shooting distance and illumination have a big influence on the appearance of skin diseases. In (b) we show some examples with the ABCD criteria. Note that these mentioned diversities, as well as attribute diversity, contribute to making automatic recognition of clinical skin disease image a challenging work. (Color figure online)

(1) Species Diversity: Skin lesion images in our dataset contain: eczema, psoriasis, acne vulgaris, pruritus, alopecia areata, decubitus ulcer, urticaria, scabies, impetigo, abscess, bacterial skin diseases, viral warts, molluscum, melanoma and non-melanoma skin cancer, which have covered most of the common skin diseases. Figure 3 shows the statistics of the number of images in each class.

We also show some images in Fig. 4. For example, in Fig. 4(a), the first row is angioma, and the second row contains four kinds of diseases. In the third row, images of acne vulgaris and guttate psoriasis are in green and yellow boxes, respectively. Figure 4(b) also contains three kinds of diseases. Due to space limitation, we do not show all classes. However, one can already find the species diversity of clinical skin disease images in these figures.

(2) Appearance Diversity: In real life, clinicians and dermatologists determine whether a lesion is a melanoma by a certain criteria, that is ABCD criteria (asymmetry, border irregularity, colour and diameter or differential structures). The criteria is proposed by Friedman et al. [48], which has been widely adopted through the previous works, especially in dermoscopic image recognition.

Compared to dermoscopic images, there are different meanings of ABCD in clinical skin disease images. We summarize ABCD’s conventional meanings and refine them to apply to clinical images in our dataset. In Fig. 4(b), we show some examples of clinical images based on ABCD criteria. The images in the same row represent the A-asymmetry, B-border irregularity, and C-multiple colors, respectively. The D-diameter is difficult to be judged by images, but we can see from Fig. 4(b) that it varies greatly among different diseases.

Skin diseases in our dataset show that they have different appearances from an ABCD perspective, which includes arrangement (solitary, generalized, annular, and linear), color (red, blue, brown, black, white, and yellow), border (well defined, poorly defined), shape(circular, strip, and irregular). Most of these styles can be found in Fig. 4. Other arrangement styles are also included in our dataset. In particular, the appearance diversity also exists in the same class, e.g. images in the first row of Fig. 4(a) contain skin disease images with different colors and shapes, coming from the same category named angioma.

(3) Attribute Diversity: Images in our dataset cover a lot of situations for patients such as age (child, adult, old), sex, disease site (hand, feet, head, nails), color of skin(white, yellow, brown, black) and different periods of lesions(early, middle, late). On the other hand, our dataset have also covered a lot of situations for environment, such as illumination, shooting distance, etc. All these diversities make our benchmark more comprehensive and challenging.

Challenge and Lack. Our dataset is a special images dataset different from object or scene datasets. The change of each condition, e.g. illumination, focal distance, and point of view, could increase a lot of difficulty for its classification. For example, the images with yellow boxes in the third row of Fig. 4(a) are from the same category named guttate psoriasis. The images from left to right are with different illumination and shooting distance leading to big differences among them. Furthermore, pathological changes in different periods and different colors of skin of the patients all make a large intra-class variation. There are some diseases with low color contrast of foreground(lesion) and background(health skin), which are hard to recognize.

Of course, our dataset also has disadvantages. Details of dermatosis marks need stronger professional knowledge than other object and scene datasets. Considering the differences between clinical skin disease and fine-grained object images, e.g. birds and dogs, it’s difficult for us to label part annotations in skin disease images. Besides, as Fig. 3 shows, our dataset shows imbalance among different categories. We try to collect the same number of samples, while some diseases rarely appear in real life.

4 Clinical Skin Disease Classification

In order to establish a baseline performance on our proposed dataset and evaluate the performance of different features, we design experiments for two aspects: (a) comparing the influence of different baseline features; (b) evaluating some existing methods whose aim is fine-grained classification. In all of the experiments, we randomly select half images from each class as the training set and the rest as testing set. We introduce our implementation details in the next paragraph. In addition, we present the color and texture features for classification and analyze their influences.

4.1 Hand-Crafted Features Based Classification

Implementation Details. We first investigate how conventional computer vision methods are used to recognize clinical skin disease images. We employ seven kinds of texture and color features and utilize LIBSVM, a popular library for support vector machine, to build some baseline algorithms. We use these algorithms to measure the classification accuracy on our dataset. Then, we evaluate our dataset using some existing work with their hand-tuned features and off-the-shelf frameworks. SIFT and Color Names features are extracted following the routine of [49]. HOG and LBP features are obtained by employing VLFeat [50].

Baseline Approaches. Two representative works [49, 51] are included in this paper. Then, their performance on our proposed benchmark dataset are evaluated. While these methods are designed to classify fine-grained object or natural scene images, skin disease images are also sensitive to texture and color cues, which are employed in these tools.

In detail, Goering et al. [49] compute a global representation using the whole image. Feature types are the same as commonly used for fine-grained classification, i.e. bag-of-visual words with SIFT and Color Names, but with additional spatial pyramid pooling. Furthermore, they apply GrabCut segmentation to estimate the foreground. This algorithm performs iterative segmentation with a conditional Markov random field, where unary potentials are modeled with a Gaussian mixture model re-estimated in each iteration, and pairwise potentials are added to favor strong image edges. Lazebnik et al. [51] have presented a holistic approach for image categorization based on a modification of pyramid match kernels. They repeatedly subdivide an image and compute histograms of image features over the resulting subregions, showing promising results on scene databases.

Table 3. Classification performance with different hand-engineered features on both of our datasets, i.e. SD-198 and SD-128. Each of the first seven methods is built with a popular off-the-shelf feature, using SVM as its classifier. On the other hand, the last two methods are designed for similar vision tasks, i.e. fine-grained object classification and natural scene classification, respectively.

Results and Analysis. To establish a baseline performance on our dataset, we evaluate the features mentioned in Table 3. The experimental results show that texture features play a more important role than color features in this dataset. We find that the colors of foreground and background are extremely the same in some skin disease images. On the other hand, the lesions often present different textures and shapes, such as annular, linear, concave and convex shapes.

Furthermore, there are different skin disease categories sharing very similar shapes, and their color cures are slightly different, e.g. neurofibroma and apocrine hydrocystoma. Considering these cases, the off-the-shelf tool [49], performs best in this configuration, although it’s designed for fine-grained object recognition. Note that, the influence of background clutter is significant in this method.

4.2 Deep Features Based Classification

Implementation Details. In our experiments, we extract deep convolutional features from a CNN model pre-trained on ImageNet. Due to the skin classes in our dataset, we change the original 1000-way fc8 classification layer to a new 198-way fc8 layer, whose weights are randomly initialized by a Gaussian function. We set fine-tuning learning rates as proposed by CaffeNet CNN, and initialize the global rate to a tenth of the initial ImageNet learning rate. In addition, during the training process, we drop the learning rate by a factor of 10. Furthermore, we independently fine-tune the ImageNet pre-trained CNN for classification on ground truth crops of each region warped to the 227 \(\times \) 227 network input size. At test time, we extract features from the test images using the network fine-tuned on the training set of our skin disease images. Meanwhile, we also fine-tune a very deep CNN architecture, i.e. VGGNet [52] with 16 layers, to extract deep features.

Results and Analysis. We fine-tune the pre-trained CNN model, and compare it with the original CaffeNet by showing the results of using the SVM as a classifier. We extract deep features from the last layer of CaffeNet and obtain a 4096 dimensional feature representation. For both of our skin disease datasets, i.e. SD198 and SD-128, half images of each class are used for fine-tuning the model. From Table 4, we can draw a conclusion that the fine-tuned VGGNet gets significant promotion, which is mainly benefited from our lager-scale well-labeled dataset.

Table 4. The average classification accuracy with different models of convolutional neural networks.ft indicates that the corresponding model is fine-tuned with our training samples
Fig. 5.
figure 5

Accuracy for each class with different models. (a) The performance of CaffeNet on SDC-198. (b) The performance of VGGNet on SDC-198. (c) The performance of CaffeNet on SDC-128. (d) The performance of VGGNet on SDC-128. For each figure, the secondary Y-axis(right) represents the number of testing image.

To further analyze their performance, we also calculate the accuracy for each class of the CNN models. Figure 5(a, b) show the classification results on SD-198, and Fig. 5(c, d) show the classification results on SD-128. It’s shown that the accuracies have bigger fluctuation when the number of images of each class decreases. For these classes, the skin diseases have a relatively low morbidity in our daily life. Furthermore, we observe these classes, including stomatitis, histiocytosis-X, lymphangioma-circumscriptum, pomade-acne, etc., and find they share a common point that the corresponding images usual carry strong landmarks of lesions. For example, the region of skin disease is a saliency area. On the other hand, we find the classes has accuracy close to 0, which almost are hard to distinguish even for the professional doctor. For these classes, we may need more labeled data to provide in-sight to their characteristics.

5 Discussion

We have shown the performance of traditional features that have been commonly used in computer vision tasks. We also execute experiments with deep visual features on our skin disease benchmark dataset. The accuracy for all these features have been showed in Tables 3 and 4, respectively. In this section, we will compare the best performance of hand-crafted features with the deep visual features.

Fig. 6.
figure 6

Examples of classification results on our proposed benchmark dataset. (a) Images are correctly classified by [49] and wrongly classified by VGGNet. (b) Images are correctly classified by deep network and wrongly classified by [49] (Color figure online)

For SDC-198, the best classification result is 52.19 %, which is acquired by combining the SIFT and Color Names features. The accuracy using a pre-trained and fine-tuned VGGNet is 50.27 %. It is interesting to find that the performance of hand-crafted features is better than deep visual features for the skin disease classification.

In order to investigate the reason, in Fig. 6(a), we show some representative images which have been correctly classified by [49] and wrongly classified by VGGNet. We also show the images in the opposite situation in Fig. 6(b). Useful observation can be draw from the presented images. First, images in (a) always have a cleaner background than the disease images in (b), and second, the appearance of lesions in (a) is much simpler than (b). Since [49] has applied a segmentation procedure with GrabCut to estimate the foreground, it’s reasonable that this algorithm outperforms CNNs when both of them are applied to images in (a). For example, consider the images in the first row of both (a) and (b), these images are corresponding to skin diseases such as dermatofibroma, basal cell carcinoma, angioma, seborrheic keratosis and blue nevus etc. Compared to images in (a), the lesions in (b) are surrounded with more hair, which will weaken the segmentation employed in [49]. Moreover, CNNs have shown advantages in finding structure and semantic information. Images in (b) include more cues about the location of lesion, e.g. mouth, foot, eye, hand, etc., perform better with powerful VGGNet.

6 Conclusion

In this paper, we raise a challenging problem of automatic visual classification of clinical skin disease images. The absence of benchmark datasets is a barrier to a more dynamic development of this research area. We build a new and challenging clinical skin disease images dataset, including 6,584 real-world images from 198 categories. Each sample in our benchmark is well labeled. We intend to release the dataset to the community to promote the related research. Furthermore, we also evaluate the performance of different features to establish a baseline performance on our dataset.