FormalPara Key Summary Points

To reduce misdiagnosis and missed diagnosis, an effective and accurate method for melasma diagnosis is necessary.

On the basis of deep learning, we developed an intelligent diagnostic model for melasma.

Our model was trained with a large sample of melasma and non-melasma facial images and acquired a high accuracy and area under the curve.

In further experiments, we found that multichannel image input obtained by fusing multiple modes of images in VISIA increased our network performance.

More data from multiple centers and improved applicability are needed to make the model a likely valuable tool in clinical practice.


Melasma is a commonly acquired pigmentation disorder characterized by symmetrical brown macules and patches on the face with irregular borders, which has a negative effect on appearance and self-esteem of patients [1,2,3]. Its pathophysiology is complex and unknown, but it is believed to relate to genetic and environmental factors [4]. Melasma mainly affects women and people with high pigmentation phenotypes. The prevalence of melasma is higher in East Asians, Indians, Latin Americans, and Hispanics [5,6,7]. However, the specific epidemiological data are still unclear.

The diagnosis of melasma usually depends on the naked-eye judgment of physicians according to the clinical characteristics of lesions. However, for pigmented skin lesions, the diagnostic ability of non-dermatologists is not at a comparable level to that of dermatologists [8, 9]. Melasma and other atypical hyperpigmentation like nevus of Ota are often missed and misdiagnosed [10]. Thus, the correct diagnosis of melasma with the naked eye alone may require physicians to have certain clinical experience, especially in complicated facial conditions. In addition, the use of diagnostic assistant tools, such as wood lamps and dermoscopy, is time consuming and inappropriate to accurately distinguish melasma from other pigmentation disorders [10]. These tools also need to be assessed by physicians, which could be a subjective process.

The misdiagnosis and missed diagnosis of melasma might have undesirable effects on patients in which treatment such as CO2 laser is required for other skin diseases that is not acceptable for melasma [11,12,13]. The treatment for melasma should be selected cautiously because of its high rate of recurrence [14]. Moreover, improper treatment under misjudgment of melasma might result in serious sequelae, such as pigmentation and scarring after CO2 laser [11,12,13]. In addition, the remoteness of certain regions and lack of knowledge make some patients with melasma seek help from beauty salons and estheticians instead of dermatologists. Nevertheless, owing to the lack of professional expertise and accurate diagnostic tools, such non-professionals usually cannot make a correct diagnosis and choose the wrong treatment. Thus, it is necessary to develop an accurate and rapid diagnostic method for melasma.

The purpose of this study was to develop and validate an intelligent diagnostic system for melasma images on the basis of deep learning and provide a reference for accurate and rapid diagnosis of melasma. In this study, we collected a large number of clinical melasma images and evaluated the performances of four deep learning models in melasma and non-melasma binary classifiers. We further conducted image fusion via multichannel image input and found an improvement in network performance.


This study was approved by the Ethics Committee of the First Affiliated Hospital of Chongqing Medical University (no. 2022-K349). This study was performed in accordance with the Declaration of Helsinki of 1964. In the absence of any exclusion criteria, we retrospectively collected images of all patients with melasma that visited the Dermatology Clinic of Chongqing Medical University between January 2017 and September 2021. All images were stored in a VISIA imaging system (Canfield Scientific, NJ, USA). A similar number of images from patients without melasma were randomly selected from the VISIA system. Considering that the problem we intended to solve in this study was to judge the presence or absence of melasma on the basis of facial image, which required that the binary classifier be able to work in a variety of situations, the “non-melasma” images in this study were obtained from non-pigmentary diseases (such as rosacea and acne), other pigmentary diseases except melasma (such as freckles, lentigines, nevus of Ota), and healthy people.

With a resolution of 3128 × 4171 pixels, the VISIA system included an imaging chamber with a 15 megapixel resolution camera. Using three types of light sources, i.e., standard incandescent light, ultraviolet (UV) light, and polarized light, five images of different modes were obtained for each shot. The “NORMAL” mode was taken under standard incandescent light and used to identify spots, wrinkles, texture, and pores. The “UV SPOTS” and “PORPHYRINS” modes were taken under UV light to detect UV spots and porphyrins, respectively. The “BROWN SPOTS” and “RED AREAS” modes were taken under polarized light to observe brown spots and prominent blood vessels, respectively [15]. Thus, five different modes of images were obtained in one shot, i.e., “NORMAL”, “UV SPOTS”, “PORPHYRINS”, “BROWN SPOTS”, and “RED AREAS”, which could show different skin characteristics (Supplementary Material Fig. S1).

A total of 4005 melasma and 4005 non-melasma images were collected. The detailed clinical data of patients were not collected due to confidentiality requirements and inapplicability. The diagnosis of all patients was based on the discussion of three experienced dermatologists in accordance with the images, which was regarded as the ground truth in this study. For different patients, we randomly divided all images into the training and test sets with a ratio of approximately 2:1 to ensure that the images of the same patient did not appear simultaneously in the training and test sets (Fig. 1). To achieve a balanced distribution, the number of melasma and non-melasma images was approximately equal. Thus, there were 2650 melasma and 2670 non-melasma images in the training set, while 1355 melasma and 1335 non-melasma images were in the test set. The images of the training set were augmented by rotation, random erasing, and gray level adjustment. The resolution of all images was then adjusted to 480 × 640 pixels.

Fig. 1
figure 1

Flowchart of data collation and preprocessing

Our task was to build a binary classifier with “melasma” as the positive class and “non-melasma” as the negative class. Considering that there are five different modes of images in one shot, both single-mode and multimode binary classifiers were studied. The images of “NORMAL” mode were the same as those seen by clinicians with the naked eye; therefore, we used this mode to explore a network for direct and rapid diagnosis. Four deep learning models, i.e., MobileNetv2, Swin Transformer, ResNet50, and DenseNet121, were used to build the binary classifiers of melasma and non-melasma. The records with the best performance were saved. To visualize the features selected by the network, we used Gradient-weighted Class Activation Mapping (Grad-CAM) to demonstrate the interpretability of the optimal network via gradient-based localization. Subsequently, we further investigated the differences in network performance for four different modes of images: “UV SPOTS,” “PORPHYRINS,” “BROWN SPOTS,” and “RED AREAS.” Next, we studied multimode images to examine whether they had the potential to further improve the performance of our diagnostic system. We fused different modes of the same shot through a multichannel input and input the integrated multimode features to the network. A flowchart of the network for multimode images is shown in Fig. 2. For data analysis, the performance of all models on the test set was evaluated using the performance indices of accuracy, area under the curve (AUC), sensitivity, and specificity. All analyses were conducted using Python 3.7.3. All patient images shown in the figures were anonymized by covering the eyes manually for privacy purposes.

Fig. 2
figure 2

Visual representation of network on multimode input. The eyes are covered manually for privacy purposes


Performance of Four Models

We examined the receiver operating characteristic (ROC) curves for the four models trained in this study, and the results of the test set are shown in Fig. 3. In terms of AUC (i.e., an indicator that describes the confidence of the prediction results and is considered to be an important performance index for binary classifiers), the DenseNet121 model outperformed the others with a value of 97.87%. In addition, confusion matrices were used to visualize the performances of the four models (Supplementary Material Fig. S2). The ResNet50 model achieved the highest sensitivity with a value of 97.14%, but obtained a relatively poor specificity of 88.76%, resulting in an accuracy of 91.45% for the test set. In contrast, the DenseNet121 model performed well in identifying both negative (95.88% specificity) and positive (94.29% sensitivity) samples. After comparing the performances of the four deep learning models, we found that the network based on DenseNet121 achieved the highest accuracy, with 93.68% (Table 1). Therefore, among these four deep learning models, DenseNet121 was regarded as the optimal model for melasma diagnosis on the basis of clinical images.

Fig. 3
figure 3

Receiver operating characteristic (ROC) curves of four deep learning models on test set AUC area under the curve

Table 1 Performance of four deep learning models on test set

Interpretability of Optimal Model

Subsequently, Grad-CAM was used to implement model interpretability. According to the results of Grad-CAM presented in Fig. 4, the red regions represent areas activated by the network, whereas the blue regions represent areas that were not activated. Activation was focused on the lesions of melasma, which mainly appeared in the cheek and malar area. Notably, in images of patients with melasma coexisting with facial skin conditions (such as seborrheic keratosis and post-acne hyperpigmentation), the network was able to focus more on melasma lesions on the face than other skin disorders.

Fig. 4
figure 4

Visual explanations of melasma cases via gradient-based localization: A a patient with facial dark brown melasma lesion; B a patient with facial hazel melasma lesion; C a patient with facial melasma and seborrheic keratosis; D a patient with facial melasma and post-acne hyperpigmentation. The color of the pixel, from dark blue to red, indicates the importance ranging from lowest to highest. Eyes are covered manually for privacy purposes

Network Performance Using Multimode Input

Since different modes in the VISIA system showed different skin characteristics, we investigated the performance of the network on the basis of each image mode. From the five modes, “BROWN SPOTS” had the best performance, with an accuracy of 94.42% and AUC of 98.57% (Table 2 and Fig. 5). Additionally, the accuracy and AUC of “UV SPOTS” were 93.49% and 97.55%, respectively, which were similar to those of the “NORMAL” mode. The “PORPHYRINS” and “RED AREAS” modes performed slightly worse, with accuracy values of 88.29% and 82.34%, respectively. The results of the confusion matrices are shown in Supplementary Material Fig. S3.

Table 2 Network performance on single mode
Fig. 5
figure 5

Receiver operating characteristic (ROC) curves for network performance on single-mode input AUC area under the curve

To date, we have acquired the ranks of the five modes in our network. The performance of the network on multimode input was further explored. On the basis of the results of each mode, we determined the combinations of multimode input as follows: “NORMAL + BROWN SPOTS,” “NORMAL + BROWN SPOTS + UV SPOTS,” “NORMAL + BROWN SPOTS + UV SPOTS + PORPHYRINS,” and “NORMAL + BROWN SPOTS + UV SPOTS + PORPHYRINS + RED AREAS.” Finally, for these combinations, the accuracy of “NORMAL + BROWN SPOTS + UV SPOTS” mode achieved the highest accuracy of 97.4% (Table 3). The AUC was 99.28%, which was slightly higher than those of the others (Fig. 6). Supplementary Material Figure S4 shows the confusion matrix results for these multimode combinations.

Table 3 Network performance on multimode input
Fig. 6
figure 6

Receiver operating characteristic (ROC) curves for network performance on multimode input AUC area under the curve


In recent years, deep learning has gained widespread attention for medical diagnosis, grading, and efficacy evaluation. In particular, deep learning shows superior performance in image classification and recognition tasks and has been applied in the field of dermatology. A previous study reported an artificial intelligence-assisted decision making system for skin tumors with a recognition rate of 91.2% for benign and malignant skin tumors [16]. Lim et al. used a convolutional neural network to grade the severity of facial images of patients with acne, and obtained the best classification accuracy of 67% [17]. Additionally, in rosacea, psoriasis, eczema, and atopic dermatitis, deep learning has been proven to have excellent diagnostic or classification capabilities [18,19,20,21].

So far, there have been a few reports on the application of deep learning to melasma. One study used a voting-based probabilistic linear discriminant analysis to classify non-tumorous skin pigmentation diseases, including melasma, with an accuracy of 67.7% for melasma [22]. Another study presented a spatial compounding-based denoising convolutional neural network for quantifying and evaluating melanin in melasma optical coherence tomography images [23]. However, there is still a lack of research on large training datasets and high accuracy diagnostic systems for melasma facial images. In this study, a large number of images were used as the training set in deep learning models, and we developed an accurate diagnostic system on the basis of DenseNet121 for clinical melasma facial images. In a further experiment with multimode image input, we fused different modes of images in the VISIA system and fed them to the network to simulate how a dermatologist used multiple modes of images to diagnose melasma. Finally, in the experiment with the “NORMAL + BROWN SPOTS + UV SPOTS” combination, we acquired a high accuracy of 97.4% and AUC of 99.28%.

In this study, we chose four deep learning models for comparison purposes, including three traditional convolutional neural networks (i.e., MobileNetv2, ResNet50, and DenseNet121) and a novel network (i.e., Swin Transformer). In previous studies, MobileNet v2 was able to work on lightweight computing devices and had high accuracy in classification of skin disease images [24]; ResNet50 showed superior performance for segmentation and classification in multiple skin lesions diagnostics [25]; DenseNet121 was also used to segment skin and lesion [26]; and Swin Transformer was a novel fine-grained recognition framework and showed more powerful and robust features in medical image segmentation [27]. Therefore, we selected these mainstream and high-performance deep learning networks to explore their performance in melasma diagnosis task. Our results indicated that in MobileNetv2, ResNet50, and DenseNet121, the performance of DenseNet121 was slightly better than that of ResNet50, and MobileNetv2 had the worst performance in our study. In recent studies, the comparison of DenseNet and ResNet has been discussed for the identification of ductal carcinoma in situ and microinvasion of the breast using ultrasound images [28], recognition of digital dental X-ray images [29], and classification of glaucomatous fundus images [30]. For each of these applications, DenseNet has been reported to perform better than ResNet. The ResNet bypass signals from one layer to the next through identity connections and combines features by summing them before passing them into a layer. In contrast, to ensure maximum information flow, all layers of DenseNet could take additional input from the previous layer, pass on their own feature maps to all subsequent layers, and combine features by concatenating them. These features of DenseNet appear to be useful in our clinical images of melasma and non-melasma. Similarly, in another set of images with pigmented facial skin lesions, DenseNet also had better performance evaluation than ResNet; this is consistent with our results [31]. On the other hand, Swin Transformer was proposed in 2021 as a novel model for computer vision, which constructed hierarchical feature maps and had linear computational complexity to image size. At present, only a few studies have reported the application of Swin Transformer in medical images [32, 33]. To the best of our knowledge, this is the first study to apply Swin Transformer in clinical dermatology images. Although Swin Transformer was not the best model for our task, its accuracy reached 88.48%. The medical application of Swin Transformer is worthy of further study.

In clinical practice, dermatologists usually combine clinical information of patients, such as age, medical history, dermoscopy results, and multimode images of the VISIA system, to make the final diagnosis. This could increase the likelihood of dermatologists making the correct decision; additionally, it is worth investigating whether this could be applicable to artificial intelligence. Previous studies have shown that multiple types of information can improve the diagnostic ability of deep learning models. In a study by Tschandl et al., both dermoscopic images and clinical close-ups were used to train the network; the combination was found to acquire a better result than the individual modalities [34]. Jin et al. found that multiple extracted histological features, including nuclei, mitosis, epithelial, and tubular cells, could further improve the detection of lymph node metastasis in patients with breast cancer [35]. This study used a multichannel image input method to fuse multiple mode images in the VISIA system and discovered that it could improve the accuracy of our network to some extent. As the amount of image information increased, the network performance improved. However, we found that the optimal combination was the “NORMAL + BROWN SPOTS + UV SPOTS” mode, which had slightly higher accuracy and AUC than the other five modes. It also seems that the “PORPHYRINS” and “RED AREAS” modes did not improve the network performance under this multimode input method. This is noteworthy because it was previously believed that more information in training data could improve network performance [36]. The five modes had the following characteristics: (1) the “NORMAL” mode image could identify spots by their color and contrast from the surrounding skin; (2) the “UV SPOTS” mode image was generated by the selective absorption of UV light from epidermal melanin; (3) the “BROWN SPOTS” mode could reflect the detection of deeper deposition of melanin under cross-polarized light; (4) the “PORPHYRINS” mode image was photographed in UV light on the basis that porphyrin could fluoresce in UV light; and (5) the “RED AREAS” mode image was used as a measurement of hemoglobin content through cross-polarized light [15]. Since the pigmentation of melasma was considered to be a combination of epidermis and dermis [37], the “NORMAL,” “UV SPOTS,” and “BROWN SPOTS” modes might be more useful for dermatologists in melasma diagnosis. Although the classification basis of the deep learning model was unrevealed and regarded as a black box, the information from “PORPHYRINS” and “RED AREAS” mode images might cause confusion to our multichannel input network in multiple information fusion, thereby degrading the network performance. Hence, the specific reason for this outcome requires further investigation.

On the basis of common clinical images, we developed a straightforward and rapid melasma diagnosis network, thus eliminating time-consuming invasive or noninvasive methods such as wood lamps, dermoscopy, and reflectance confocal microscopy. In the subsequent clinical transformation, we expected to develop a remote diagnostic tool using smartphones or networks on the basis of single-mode imaging for convenient self-diagnosis of patients, and a downstream medical software of the VISIA system based on multimode image input, aiming to become an accurate assistant diagnostic tool.

However, this study has some limitations. First, the data were obtained from a single center, and it was better to validate our network in multiple centers to include more patients with different skin conditions. Second, owing to the fixed background and lighting, almost all image backgrounds were black. Thus, in future practical applications, it might be necessary to add content to our diagnostic system to mask the background and better simulate various training conditions. Third, this study was conducted in Southwest China and the Fitzpatrick types of all included patients were III and IV. Since higher phototypes exhibited more melanocytes [38], there was a possibility that the network performance might variate in people of different Fitzpatrick phototypes. This required further exploration in the datasets of different skin types. In addition, our model was not able to quantify melasma severity on the basis of the Melasma Area and Severity Index. It required higher performance networks and more sophisticated algorithms. Finally, our results lacked further validation of the model use, i.e., whether it could play a role in the diagnosis of melasma by non-dermatologists. This would need further investigation.


This study is the first one to use a large sample of melasma images to compare the diagnostic performance of multiple deep learning models and develop a diagnostic system for melasma. Multimode image combination evaluation was performed using images under different lighting conditions in the VISIA system, which further increased the diagnostic accuracy of melasma. Our study could provide a basis for the development of clinical diagnostic applications for melasma and other skin disorders. However, more available clinical images of patients from multiple centers are needed to further improve the proposed diagnostic system.