This study was approved by the Ethics Committee of the Hospital of Skin Diseases, Chinese Academy of Medical Sciences (IRB number: 2019-KY-005). Since all patient records were anonymous and privacy was removed through image preprocessing, no personal data were collected from existing patient records and no written informed consent was required. Patients were informed that the images may be used for scientific research and publication after privacy removal and approval by the ethics committee.
The design and workflow of the study is shown schematically in Fig. 1.
Clinical image data collection and pretreatment
Clinical image data collection
For this study we collected 5871 clinical photographs of 1957 patients who visited the Hospital of Skin Diseases, Chinese Academy of Medical Sciences from 2004 to 2016. Basic parameters of the data were as follows: (1) the pictures were taken by two single-lens reflex cameras (FinePix S9500, Fujifilm, Tokyo, Japan and Canon model EOS 800D, Canon Corp., Tokyo Japan, respectively), and the images were 2 million by 20 million pixels; (2) the clinical photographs contained all the lesion areas. All images used in this study were processed for privacy masking.
Preprocessing of the clinical image data
Facial photographs of each acne patient were preprocessed by following procedures. First, 68 human face markers were used to mark the photographs. According to the face markers, the two eyes in each person’s facial image were placed on a horizontal line by rotation, and then the image size was adjusted to ensure that the distance between the two pupils was 800 pixels. The purpose of this procedure was to ensure that each face was kept horizontal and zoomed standardly. Then, in order to avoid interference from the eyes, nose and mouth area, each face was divided into different regions, as shown in Fig. 1a, and the four regions of the same patient (i.e. the forehead, lower jaw, left side of the face and right side of the face) were subsequently combined to form a complete facial region. Thus, the entire face area of the patient was fully visible in the two-dimensional photograph. Finally, due to the lack of sufficient training data and data imbalance, we utilised the ImageDataGenerator in the neural network library Keras as an image augmentation technique to increase the size of the training set. It should be noted that all the above-mentioned steps ran automatically without human involvement.
Rating of acne severity on clinical images
The processed clinical images were independently graded by two experienced dermatologists, and a third dermatologists who had more experience was consulted in the case of disagreement. The three dermatologists were blinded—and had no access to—the deep-learning predictions, as shown in Fig. 1b. According to the Chinese guidelines for the management of acne vulgaris , the images were classified into four grades based on the type of the clinical features and ensued treatment strategy, as shown in Table 1. The three dermatologists entered their consensus directly into the dataset which we named Acne Dataset. The Acne Dataset was divided into two separate directories following the division of 80% and 20% for training and validation, respectively.
The model was trained using the Inception-v3 network (a deep learning-based classification model), as shown in Fig. 1c. When training the network, we first used pretrained network parameters in the ImageNet Dataset (including 1.28 million images and 1000 objects) because these pretrained networks have preserved the shallow features of the image, which helps to improve classification accuracy. Then, based on the transfer learning method, we used the Acne Dataset-training set to train the network for learning the high-level semantic features of the image. In the experiment, we used a learning rate of 0.001 for training and used the cross-entropy loss function and the Adam optimizer. The Acne Dataset-validation set was later used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters for better performance. The model was trained and validated on a server (Intel® Xeon® Processor E5: 2.10 GHz, 32 GB RAM, 1080 GTX GPU; Intel Corp., Santa Clara, CA, USA).
Testing the model
Using the same preprocessing method as the Acne Dataset, a test set of 40 preprocessed images were obtained, as shown in Fig. 1c. Three attending dermatologists and three dermatology residents were invited to classify the photos of the test set. Each attending dermatologist and dermatology resident who did not participate in the labelling independently completed the evaluation. The evaluation results of the three attending dermatologists were voted on, and those on which more than two attending dermatologists agreed were considered to be the results of the evaluation from the attending dermatologist. The same method was also used for the selection of evaluation results at the dermatology resident level.
Statistical indicators for evaluating the model classification performance
This study used the F1 value as a classification evaluation indicator. The F1 value is the harmonic mean of the accuracy and the recall rate. When the accuracy and recall rate are high, the F1 value will also be high accordingly. The F1 value reaches the optimum value (i.e. with a perfect accuracy and recall rate) at 1, and the worst value at 0.
This study also used Kendall's coefficient of concordance (Kendall’s W) and its test to evaluate the consistency of the evaluation results among the three attending dermatologists. The same method was applied to the evaluation results among the dermatology residents. The linearly weighted kappa coefficient and its test were used to evaluate the consistency of the model-based evaluation results and the dermatologist-based evaluation results. A value of > 0.75 indicates high consistency, a value between 0.75 and 0.65 indicates moderate consistency and a value < 0.65 indicates low consistency. The above test was carried out in SPSS version 24.0 (SPSS IBM Corp., Armonk, NY, USA).