1 Introduction

Skin cancer is a common cancer worldwide. Skin cancer includes squamous cell carcinoma, basal cell carcinoma, and melanoma. The proportion of melanoma is smaller than other types of skin cancer, but melanoma is more likely to invade nearby tissues and spread throughout the body. According to statistics, if melanoma is treated early, the patient's survival rate within 5 years exceeds 95%. On the contrary, if the melanoma patient does not receive effective treatment in the early stage, the survival rate within 5 years does not exceed 15% [1]. Doctors usually use dermoscopic imaging technology to diagnose melanoma. This is a non-invasive imaging technology that can obtain a magnified image of a certain area of the skin and maintain a relatively high resolution. Such an image can clearly get the skin-specific regional image information [2], so as to provide more lesion information and help doctors identify skin lesions. Although dermoscopy imaging technology has been proven to have achieved phased achievements in the task of skin lesion detection, it can only reach the average sensitivity when diagnosing melanoma [3]. Therefore, to help dermatologists diagnose skin melanoma more accurately and efficiently, it is very necessary to apply other diagnosis methods. In recent years, diagnosis methods based on deep learning have dominated the field of melanoma diagnosis [4, 5], especially solutions based on deep convolutional neural networks, have achieved significant performance improvements in melanoma diagnosis [5,6,7]. However, improving the accuracy of melanoma diagnosis is still a big challenge due to the following five factors.

First, the number of images in this field is relatively small. Usually, there are only a few thousand or even hundreds of skin lesion images in the skin lesion dataset, which involves the difficulty of obtaining image data and the test of annotation accuracy [8]. The problem of insufficient training samples limits the success of neural network methods in this field [5].

Second, there is a huge gap between the number of target categories and the number of non-target categories in the same data set. As shown in Table 1 (A), most of the images in the dataset are non-melanoma images, which will produce overfitting or produce also bias.

Table 1 The number of various lesion types in the ISIC2016 data set

Third, there are larger intraclass differences and smaller interclass differences between melanoma and non-melanoma. Skin lesion images belonging to the same category have significant differences in features such as color and shape, and the images of skin lesions belonging to different categories are highly similar. Even for doctors with professional knowledge, it is still a challenge to correctly classify under the premise of such differences.

Fourth, the previous methods of using machine learning to classify skin lesion images generally lacked effective semantic focus on the lesion area. The lesion area in the skin lesion image only occupies a small part of it, and most of it is normal skin tissue. As shown in Fig. 1B, these tissues are not only irrelevant to the classification of melanoma, but may also interfere with the classification of the lesion. Although deep learning methods are widely used in skin lesion analysis [9, 10], however, there are few studies to explain which part of the image the meaningful semantic information concerned by the model focuses on.

Fig. 1
figure 1

A Melanoma, B Non-melanoma, C Patch image

The last problem is the weak generalization ability of single-depth convolutional neural networks. Researchers usually use Ensemble Learning (EL) or increase external data to solve this difficulty [7]. The use of external data is the easiest way to solve this problem. Even many Deep convolution neural network (DCNN) methods also use external data to improve classification performance [5, 11], but in most medicines in the field, a large number of labeled images are not available [12], and using different external data as training data will affect the fairness of comparison between different methods. Therefore, the use of EL methods becomes the key to solve this problem. EL methods are combining different outputs with multiple rules to obtain better generalization performance than a single classifier. Researchers believe that different DCNNs are able to extract different semantic information from skin lesion images, and EL-based methods usually aggregate different powerful DCNN structures; for example, Balazs [12] integrated four models and obtained an average AUC value of 0.891 on the ISIC 2017 dataset [13], and Mahbod and Schaefer [14] integrated 72 models and obtained an average AUC value of 0.914 on the ISIC2017 dataset.

Based on the above problems, we propose a new method for the diagnosis of melanoma, as shown in Fig. 2. The contributions of this article are as follows:

  1. (1)

    The effective semantic information of skin lesion images accounts for a small proportion of samples. Obtaining local fine grain information and global information of skin lesion images is the key to accurate classification. A new method for melanoma diagnosis is designed. Among them, Basic Convolutional Neural Network (BCNN) is responsible for extracting patch image features; Auxiliary Convolutional Neural Network (ACNN) includes two parts: ACNN_WI and ACNN_O, ACNN_WI input corresponding global image, ACNN_O input patch image, and fusion of BCNN middle layer feature matrix and ACNN_WI middle layer feature matrix through a weighted integration strategy; Fusion Convolutional Neural Network (FCNN) obtains the local fine-grained information and global information of the skin lesion image and outputs the classification results.

  2. (2)

    More effectively extract effective semantic features from skin lesion images. Innovatively propose patch strategies for generating patch images. In this paper, the patch algorithm is used to extract the patch image from the original image, and the extracted patch image and the original training image are input into the Basic-Aided-Fusion Convolutional Neural Network (BAFCNN) network to further improve the ability of melanoma diagnosis. The patch image contains all the lesions in the skin lesion image and some normal skin tissues around the lesion. Through this purposeful manual operation, the neural network can correctly and quickly focus on the meaningful information in the skin lesion image and improve the classification performance.

  3. (3)

    Better classification of original and patch image features. Introduce a new integration strategy. Because different networks in BAFCNN have different degrees of attention to skin lesion images. Therefore, by adapting different weight values in FCNN, complementary identification information can be extracted from dermoscopic images, and the diagnostic ability of melanoma can be further improved.

Fig. 2
figure 2

The structure diagram of the melanoma detection method proposed in this article. a Conditional image synthesis and extraction of patch images. b Send patch feature information and global feature information to BCNN and ACNN networks respectively. c Send the output of BCNN and ACNN_WI, ACNN_O to FCNN to get melanoma diagnosis result

2 Related Work

2.1 Solve the Problem of the Imbalance

In the field of skin lesion classification, the number of images in the data set is relatively small, and the proportion of melanoma images is even lower. To solve this problem, researchers have done a lot of work. IG Díaz [15] rotates the lesion image, and crops the largest internal matrix of the original data inside the image after the rotation. The author generates normalized coordinates for each view, so that the subsequent processing provides invariance to the displacement, rotation, size change, and even irregular shape changes to lesion. In the final experiment, eight rotations were made, and three regions were cropped each time. Finally, each original image enhanced image collection has 24 training images. The experiment was carried out on the ISIC2017 dataset [13] and achieved a 3.4% increase in AUC value. References [6, 16,17,18] solve this problem by adding additional data. Zhang and Xie’s [5] experiments on the ISIC2017 dataset show that with an additional 1320 training images, the AUC value is increased by 1.6%, and the classification performance of melanoma is improved. However, in a recent article, [19, 20] adopt a brand-new method to solve this problem, which is to use the inter-class variation of data distribution to complete conditional image synthesis to solve the imbalance problem. By learning the mapping between classes and using unpaired image-to-image conversion to synthesize under-represented class samples, the final experiment is completed on the ISIC2016 dataset [21]. Experiments show that the data enhancement method proposed in the paper can increase the AUC value by 2.1%.

Because for narrowly defined tasks, it is relatively easy to learn transformations with given prior knowledge or conditions [22, 23], but the use of unconditional image synthesis data often leads to artifacts and makes training unstable [24]. In recent years, the cycle-consistent adversarial networks’ (CycleGAN) method has been used to solve the problem of conditional image synthesis. This is a technology that does not involve automatic training of image-to-image translation models of paired samples [25,26,27]. The model uses a collection of source and target images to train in an unsupervised manner [20]. In this paper, the CycleGAN conditional image synthesis method is used to solve the problem of the imbalance between the number of melanoma images and the number of non-melanoma images in the dataset.

2.2 DCNN Model

For the research in the field of image classification, the most classic method is to train a relatively deep CNN model (such as the popular EfficientNet model at this stage) on the ImageNet data set, and then replace the last layer of the model according to a specific data set to make the model final output dimension is the target output dimension [28].

In recent years, DCNNs have achieved the most advanced performance in many computer vision applications. It shines even more in the field of melanoma diagnosis [6, 16,17,18, 29,30,31]. However, in this field, the DCNN method generally lacks effective semantic focusing ability for lesions. Among them, Kawahara [32] demonstrates that these features are concentrated on the color of the skin lesion, the edge of the skin lesion, and the hair and artifacts by visualizing CNN features. It can be seen that the area with effective semantic information only occupies a small part of the dermoscopy image, and most of the dermoscopy image is normal skin tissue, ruler, etc. These parts have nothing to do with the classification of melanoma. How to effectively remove these interfering factors is a hot problem that needs to be solved urgently in the diagnosis of melanoma. Liu and Zerubia [31] believe that these irrelevant factors will interfere with the correct melanoma classification. Kim and Hong [33] believe that bad features, such as hair and artifacts in dermoscopic images, degrade the visibility of images, Therefore, Generative Adversarial learning is used on the dataset to inducing a reconstructed distribution of images with hair similar to the distribution of no hair, thereby eliminating irrelevant parts such as hair in the dermoscopic image. The results of quantitative evaluation show that on the ISIC2020 data set, the AUC value can be increased by 12.8% after hair removal. Almarazdamian and Ponomaryov [30] also proved that hair removal is of great significance for the diagnosis of melanoma.

In addition, many works have proven that DCNN trained for classification tasks has extraordinary localization capabilities and can highlight meaningful regions in images. Song and Li [34] designed a three-stage method to classify melanoma. The first step is to use U-Net segmentation network to extract the original training image segmentation mask and crop the image according to the segmentation mask. In the second step, five latest deep neural networks are used to extract the cropped image features, and finally, a new neural network with local connections is constructed to integrate the classification results. The method is tested on the ISIC2017 data set, and the results show that the segmentation operation is used Can improve classification performance by 10%. Tang and Liang [29] also proved the effectiveness of the segmentation classification model. Yang [35] trains a model that handles classification and segmentation tasks at the same time, but when pixel-level annotations are not available, these models cannot be effectively trained. To be able to more accurately determine the meaningful area in the image, the researchers got inspiration from the research of human vision and introduced the attention mechanism into the neural network. In recent years, the attention mechanism has been widely used in deep learning applications. Researchers selectively select certain information of interest [26, 36, 37], and Yan and Kawahara [9] proposed a melanoma classification method combining VGGNET with attention module, in which the attention map highlights the image regions of interest related to the classification of skin lesions.

3 Method

3.1 Conditional Image Synthesis

For narrowly defined tasks, learning transformations with given prior knowledge or conditions is relatively easy in the main text, but the use of unconditional image synthesis data often leads to artifacts and makes training unstable [38]. In recent years, the cycle-consistent adaptive networks (CycleGAN) method has been used to solve the problem of conditional image synthesis, which is a technology that does not involve automatic training of image-to-image translation models from paired samples [39]. This model is trained in an unsupervised manner using a set of source and target images [40]. This article adopts the CycleGAN conditional image synthesis method to solve the problem of imbalanced ratio between the number of target images and the number of non-target images in the dataset.

We divide the classes into two domains for conditional image synthesis, with the aim of generating target samples from non-target samples. The data generation process for the target minority class is performed to alleviate class imbalance issues, as for narrowly defined tasks, learning transformations with given prior knowledge or conditions is relatively easy [38]. In addition, the use of unconditional image synthesis to generate target distribution data from noise often leads to artifacts and may lead to training instability. However, most methods use paired training data for image-to-image translation, which requires generating a new image that is a controlled modification of a given image [41]. Since the data set composed of paired samples cannot be used for melanoma detection, we use the CyclicGAN, which is an automatic training technology involving image-to-image translation models without paired samples. Train these models in an unsupervised manner using image sets from the source and target domains. CycleGAN is a framework for training image-to-image conversion models, which uses the GAN model architecture combined with cyclic consistency to learn mapping functions between two domains. The idea behind cyclic consistency is to prevent conflicting learning mappings between these two domains.

The non-melanoma images and melanoma images in the data set are defined as B domain and M domain, respectively. The network contains two types of networks, namely generator G and discriminator D. The framework needs to learn two generators \(G{\text{BM}}\) and \(G{\text{MB}}\), and two discriminators \(D{\text{BM}}\) and \(D{\text{MB}}\). The generator \(G{\text{BM}}\) is responsible for converting B-domain images into M-domain feature images, while \(G{\text{MB}}\) is responsible for converting M-domain images into B-domain feature images. The discriminator \(D{\text{BM}}\) scores the M-domain image and the M’-domain feature image.\(D{\text{MB}}\) scores the real B-domain image and the B’-domain feature image. That is to say, the discriminator D will perform a corresponding credibility evaluation on the image generated by the generator G, and appropriately update the generator model G according to the evaluation result, and finally achieve a balanced state through this game, as shown in Fig. 3.

Fig. 3
figure 3

Diagram of the process of CycleGAN network generating target domain images from source domain images through unconfigured methods. The B', M' represent the conditional composite images of the non-melanoma and melanoma images, respectively. The network generates melanoma feature images from non-melanoma images and then restores them to non-melanoma feature images. The consistency of the cycle should be maintained throughout the process

Theoretically, the game of adversarial training can make generators \(G{\text{BM}}\) and \(G{\text{MB}}\) powerful, and they will, respectively, produce outputs of the same distribution in the melanoma image domain G and the non-melanoma image domain G [42]. However, in the actual training process, the sample has a large enough capacity, and the generator can map the same set of input images to any randomly arranged images in the target domain [25]. Therefore, to further reduce the space of the mapping function, we use the cyclic consistency loss function.

The idea of cyclic consistency is to make the CycleGAN network learn \(G{\text{BM}}\) and \(G{\text{MB}}\) two mappings at the same time, \(G{\text{MB}}(G{\text{BM}}(b)) \approx b\) and \({G}_{\mathrm{BM}}\left({G}_{\mathrm{MB}}\left(y\right)\right)\approx y\) are mandatory. Generally speaking, after converting a non-melanoma image to a melanoma image, it can also be converted into a non-melanoma image, reducing the model mapping function space. In this model, \(G{\text{BM}}\) and \(G{\text{MB}}\) use U-Net architecture [43], and \(D{\text{MB}}\) and \(D{\text{BM}}\) use PatchGAN [44]. The U-Net architecture consists of an encoder subnet and a decoder subnet connected by a bridge, while PatchGAN is basically a convolutional neural network classifier used to determine whether an image patch is real or fake.

In our experiment, we used the weight file generated by the \(G{\text{BM}}\) generator to generate melanoma conditional composite images from the non-melanoma images in the original data set. However, in the experiment, we found that if only one prediction is made, the melanoma features in the synthesized image are not obvious. Therefore, we use the weight file to perform multiple prediction experiments when conditions permit, and finally fit the result image to get the final result. The resulting images are shown in Fig. 1C. In the experiment, a total of 702 synthetic images were generated, and then, the original training set and the obtained synthetic images were combined to expand the data set to balance the data set. The balance result is shown in Table 1(C).

3.2 Patch Strategy

Since, the effective semantic information of skin lesion images is clustered at the center of the image [7, 45, 46] and only a small portion of the image. How to enable the model to effectively focus on meaningful regions is the focus of research in this area. References [47, 48] proposed a new patch-based attention architecture that provides global context between small, high-resolution patches. The results show that the proposed attention mechanism method is better than previous methods, and the average sensitivity is increased by 7%. However, this method has many drawbacks; first, the segmentation network model is large and resource-intensive to train slowly, which is not conducive to the application of melanoma classification. Second, this method only retains the melanoma part after segmenting the image without the normal skin tissue information, and we believe that such a skin lesion image ignores the transition information between the normal skin and the target classification region, which is not conducive to melanoma classification. Tang and Liang [7] introduced a method of extracting effective semantic information from dermoscopy images. In the paper, the image with effective semantic information extracted from the original dermoscopic image is called a patch image, but the method of extracting the patch image in the paper cannot guarantee to include all the lesions, and it may contain too much normal skin tissue.

Therefore, to remove the interference factors in the lesion image, we propose a pixel block difference patch strategy (PBD strategy). Before the algorithm starts, we set two variable parameters T and threshold to control the details of the extracted patch image. First, the pixel axis of the original sample is extracted and defined as width and height. Then, the two parameters are, respectively, passed to the Section() function to calculate the partition coordinates. In the Section() function, each axis pixel value set is divided into T blocks, where the pixel block and value of each block are the variable sum, and the difference between two adjacent sum is calculated and saved in the division. If the variable divergence is greater than the previously defined variable parameter threshold, it is considered that the pixel details of the original sample have changed significantly here, that is, the lesion boundary of the skin lesion image occurs between these two pixel blocks. Therefore, after determining the change interval between the target area and the non-target area, we will use Formula 14 to return the absolute pixel coordinates relative to the original sample cutting. In the experiment, the changes of the diseased and non diseased tissues of the original sample provide features for the classifier to recognize. Therefore, we do not use the actual boundary of the two as the cutting line, but choose the position far away from the diseased tissues in the pixel block with large difference as the cutting line of the original image; as shown in the Fig. 4. The algorithm is as follows:

$$ L = \frac{W}{T} \times B([S(T_{i} ),S(T_{i + 1} )]_{i = 1,2 \ldots \ldots T - 1} ), $$
(1)
$$ R = W - \frac{W}{T} \times B([S(T_{i} ),S(T_{i + 1} )]_{i = 1,2 \ldots \ldots T - 1} ), $$
(2)
$$ U = \frac{H}{T} \times B([S(T_{i} ),S(T_{i + 1} )]_{i = 1,2 \ldots \ldots T - 1} ), $$
(3)
$$ D = H - \frac{H}{T} \times B([S(T_{i} ),S(T_{i + 1} )]_{i = 1,2 \ldots \ldots T - 1} ), $$
(4)

where L, R, U, and D represent the distance of the target patch image from the left edge, right edge, top edge, and bottom edge of the original image, respectively, and W and H represent the width and height of the original image. T represents the number of grids, Ti represents the ith pixel block, and S(Ti) is the sum of the ith pixel block. B () returns the i value of the first or last pixel block whose difference between two-pixel blocks is greater than the threshold. According to Formula 14, T is positively related to the fineness of sample division, that is, when T becomes larger, fine patch images can be obtained. However, due to the abundant non-target feature information, such as artifacts, artifacts, hair, and so on, the skin lesion images affect the classification of samples. Therefore, when the T value is too large, the resolution ratio of the patch image obtained by the algorithm to the original sample pixel becomes larger, that is, a smaller T may obtain a larger grid pixel standard deviation. The algorithm cannot obtain fine slicing results, but will produce more meaningless patch images instead.

Fig. 4
figure 4

Schematic diagram of PBD strategy

figure a

In the experiment, to prove the effectiveness of the proposed PBD strategy, we compare it with the Geometric-Slice patch strategy (GS strategy). The extraction algorithm is to bisect the central axis of the two original images, connect the four bisectors, and take the middle part of the image as the patch image, as shown in Fig. 5.

Fig. 5
figure 5

Schematic diagram of GS strategy. The number indicates the original picture slice index, 1–4 indicates the original image is divided into four, and 5–8 indicates the original image is divided into eight. Finally, 5–8 combinations are extracted to form a patch image

Before the experiment, this paper conducts statistical analysis on the patch images generated by these two patch strategies. Among them, we use the mask image of the lesion in the ISIC2016 data set as a reference, and the calculation process is all pixel-level operations.

First, we observe the results of the sections, as shown in Fig. 6. By comparing the original image with the plaque image generated by PBD strategy and GS strategy, we found that, for example, column a, b, e and f, the patch generated by PBD strategy extracts the lesion area as completely as possible, and extracts some normal skin tissues around the lesion. However, the plaque image generated by applying GS strategy in columns a, e and f can only extract part of the lesion image. Such images will lose a lot of meaningful semantic information. Compared with the c and d images, the patch image generated by applying the PBD strategy can contain most of the lesion location, but it also contains a large amount of non-lesion location information. The patch image generated by applying the GS strategy contains less information about non-lesion parts. We believe that both types of patch images can be successfully focused by the network. In addition, when the training image is affected by artifacts, as shown in g, the application of the PBD strategy can ensure that all the lesions are included, but the GS strategy will delete some of the lesions. It is obvious that the PBD strategy is more suitable for use in this case for the extraction of patch images.

Fig. 6
figure 6

Examples of patch images generated by applying PBD strategy and GS strategy, respectively

In addition to the comparison of the sliced result images, statistical analysis of these two images is also performed.

First, the lesion in the PBD patch image has 95.41% of the lesion in the original image, and it occupies 55.15% of the area in the patch image, while the proportion of lesions in the original training image is only 27.0%. The lesions in the GS patch image accounted for 27.87% of the original image, and 27.83% of the patch image. The comparison of the above data shows that the patch image extracted by the PBD strategy can contain most of the lesion positions of the original image, and can remove part of the normal skin tissue in the original image. According to the operating mechanism of PBD, we believe that the skin tissue contained in the slice image is the tissue surrounding the lesion. Such an image not only enables BAFCNN to find meaningful semantic information in the image faster and more accurately, but also removes the interference of normal skin tissue on BAFCNN to a large extent.

Second, the PBD patch image accounts for 47.24% of the total original image. When the slice image accounts for less than 45% of the original image, the lesion in the slice image accounts for 94.72% of the original image; when it is between 45 and 80%, the proportion is 94.24%; when it is greater than 80%, the proportion is 99.32%. The patch image generated by the GS strategy always accounts for 25% of the original image. The above data show that even if the PBD strategy extracts a small part of the original image, it can accurately locate the location of the lesion. The reason for the larger slice image is the larger area of the lesion in the original image, rather than the error of the algorithm itself. Therefore, the patch we proposed can intelligently extract the patch image according to the size of the lesion, accurately locate the lesion part of the original dermoscopic image, and can effectively reduce the size of the training image and avoid the omission of effective semantic information, as shown in Fig. 7.

Fig. 7
figure 7

The pixel size distribution of the original image and the slice image. The color indicates the number of images with corresponding width and height, and the reference scale is given on the right

Third, we also counted the proportion of lesion sites in slice images. We found that when the slice image accounted for less than 45% the original image, the lesion area accounts for about 51% of slice image; When the proportion is between 45 and 80%, the lesion area accounts for 63.88% of the slice image; When it is greater than 80%, the lesion area accounts for 55.23% of the slice image. It is confirmed that the PBD strategy can keep the ratio of lesion to non-lesion area stable in the slice image, avoid too large non-lesion area in the dermoscopy image, and make the training network pay too much attention to and irrelevant positions, so as to obtain better network performance. In conclusion, we believe that PBD strategy has better performance.

3.3 BAFCNN

The combination of different classifiers can provide the accuracy and robustness of classification tasks, and solve the problem of weak generalization ability of a single neural network. However, in melanoma task detection, integrated learning has not been fully explored. In our paper, we use a new integrated learning framework, including deep learning of original image sample information and patch image fine granularity information [7]. Figure 8 illustrates the proposed integrated learning framework used in our task. BCNN uses the traditional neural network model. We determined the traditional neural network model through a large number of experiments, as shown in Fig. 10. The experiments show that EfficientNet-B6 network has the best performance. Finally, we determined the application of this network in BCNN. ACNN_ WI and ACNN_ The O structure is shown in Fig. 8, and the FCNN structure is shown in Fig. 9.

Fig. 8
figure 8

BCNN and ACNN (ACNN_WI, ACNN_O) structure diagram. Among them, the BCNN inputs the training image and performs pixel-level operations on the features of the middle layer and the features of the middle layer of the ACNN_WI network. It is worth noting that the up-sampling operation is performed before the training image is sent to the ACNN network to highlight the characteristics of the patch image. The BCNN network and the ACNN network output result vectors of the same dimension

Fig. 9
figure 9

Structure diagram of the three convergence strategies

Efficientnet is different from the design of commonly used CNN. Efficientnet seeks the best architecture in the baseline model by changing the depth and width of the network or the input resolution. Efficientnet-B6 is developed based on the baseline network and generated by composite scaling. This method uses composite coefficient Ø. In principle, the network depth d, network width w, and resolution γ uniform scaling.

Among them, α, β, and γ, it is a constant of grid search settings, and Ø is a user-specified coefficient, depending on how much computing resources are available for the model. Among these different models, EfficientNet-B6 has better performance and requires less computation than other models. The reason why EfficientNet-B6 is superior to other networks can be attributed to three factors. The first is a deeper network, which can capture richer and more complex features and has good generalization ability for new tasks. Second, more extensive networks can often extract more fine-grained features and are easier to train. Finally, the higher resolution of the input image means that more pixels are input. With higher resolution, my model may capture more fine-grained patterns. Therefore, EfficientNet-B6 expands these dimensions of EfficientNet-B0 and achieves excellent performance

$$\begin{gathered} {\text{Depth}}{:}\,d = \alpha^{\emptyset } {\text{,}} \, \hfill {\text{Width: }}w = \beta^{\emptyset } {\text{,}}\, \hfill {\text{Re}} {\text{solution: }}r = \gamma^{\emptyset } {\text{,}} \hfill s.t.\alpha \beta^{2} \gamma^{2} \approx 2 {\text{,}}\, \hfill \alpha \ge 1,\beta \ge 1,\gamma \ge 1. \hfill \\ \end{gathered}$$
(5)

The ACNN network consists of two parts. The first part upsamples the patch image by four times, and then sends the image to the ACNN_WI network. The output of the middle layer of the BCNN network is input to ACNN_WI as a kind of global guidance, and combined with the weighted integration strategy to obtain the final output result. The weighted integration strategy is shown in the formula

$$ \hat{G}{\text{ = sqr[}}\left( {F \oplus G} \right)\Theta (G,F)], $$
(6)

where \(F{ = (}f{1,}f{2} \ldots \ldots fn{)}\) denotes the BCNN intermediate features, \(G = (g1,g2 \ldots \ldots gn)\) denotes the ACNN network local features, fi is the feature vector at the ith spatial location, gi is the feature vector at the ith spatial location, and \(\oplus\) denotes the stitching operation of the two feature vectors.\(\Theta\) denotes a complex computation where F and G are multiplied pixel by pixel with G and F, respectively, followed by squaring the result. ACNN_O first upsamples the training image, and then sends it to the network containing seven convolutional layers to obtain the output vector.

FCNN is responsible for fusing the output results of BCNN, ACNN_WI, and ACNN_O. We experimented with three fusion strategies, namely, Splicing-Fusion-Strategy (SFS), Weight-Fusion-Strategy (WFS), and Result-Weight-Fusion-Strategy (RWFS). The three fusion strategies are shown in. Among them, the SFS is to merge the outputs of BCNN, ACNN_WI, and ACNN_O at the pixel level and then send them to the fully connected layer for classification and finally obtain the classification results. The WFS is to reorganize the three outputs according to three weight values, and send the results to the fully connected layer for classification. The RWFS is to send the three output results to the three classification layers to obtain three different output results, and then assign the classification results according to different weight values, and finally get the output diagnosis. The formulas of the three fusion strategies are as follows:

$$ {\text{SFS}}: \Pi_{{x{3}}} ({\text{RB}} \oplus {\text{RAWI}} \oplus {\text{RAO}}), $$
(7)
$${\text{WFS}}: \Pi_{x2} ({\text{RB}}*W1{ \circledast }{\text{RAWI}}*W2{ \circledast }{\text{RAO}}*W3),$$
(8)
$${\text{RWFS}}: \Pi_{x2} ({\text{RB}})*{\text{RW}}1{ \circledast }\Pi_{x2} ({\text{RWI}})*{\text{RW}}2{ \circledast }\Pi_{x2} ({\text{RAO}})*{\text{RW}}3,$$
(9)

where RB = (rb1, rb2……rbn), RAWI = (rawi1, rawi2……rawin), and RAO = (rao1, rao 2… …raon) denote the eigenvalues of the output results of BCNN, ACNN_WI, and ACNN_O by the dimensionality reduction operation. W1, W2, W3, RW1, RW2, and RW3 all represent the weight value. \(\oplus\) denotes the vector stitching operation, and \({ \circledast }\) denotes the pixel-level vector value summing operation. \(\Pi_{{{\text{xi}}}}\) denotes i fully connected layers.

4 Experiments and Results

4.1 Dataset

We evaluated the proposed method on the publicly available dataset of ISIC 2016 Skin Cancer Classification Challenge. The ISIC 2016 data set includes 900 images for training and 379 images for testing, with an interclass ratio of 1:4. Specifically, the training images included 727 non-melanoma cases and 173 melanoma cases, with an interclass ratio of 1:4. The test images include 304 non-melanoma cases and 75 melanoma cases, as shown in Table 1. The image size of this dataset varies from 556 × 679 to 2848 × 4828 pixels. The main task of this data set is to classify melanoma (melanoma and non-melanoma). Melanoma cases and non-melanoma cases in ISIC 2016 data set are shown in Fig. 1, which shows that the two categories have high visual similarity, which makes the detection of melanoma very difficult. Note that there are high intraclass differences between melanoma samples, including color, texture, and shape. On the other hand, there are artifacts in the image, such as ruler marks and hairs, which will block the focus area and affect the extraction of classification features. Note that since the patient may have been diagnosed with melanoma before image acquisition, and the medical staff has acquired dermatoscopic images to a deeper extent to better distinguish benign and malignant grades, most malignant images show more diffuse boundaries. Since the internationally published ISIC 2020 skin lesion image dataset for melanoma classification does not give the true classification label of the test set, we only apply the publicly available dataset of ISIC 2016 skin cancer classification challenge to evaluate the proposed method.

4.2 Data Preprocessing

As shown in Table 1, there is a serious class imbalance in the dataset. Therefore, we use conditional image synthesis to solve this problem. Through this method, we generated a sample of categories that are not representative enough. This method repeats the samples of the target class in the training set, and generates an image of the target class from a few class samples to add new sample data, so that all classes have the same proportion of the number of samples, as shown in Table 1. The patch image is extracted from the balanced dataset using the patch algorithm, and the extracted image still has the same resolution as the original image.

To achieve faster convergence, feature standardization is usually performed, that is, we rescale the image to have a value between 0 and 1. It is worth noting that our method does not adopt domain specific or application specific post-processing. Then, we adjusted the original image to 224 × 224, the patch image to 128 × 128, and the patch image was sampled eight times and four times, respectively. On the other hand, data enhancement is usually performed on medical data sets to improve the performance of classification tasks. This is usually a modified version of the input image in the dataset created by random conversion, including horizontal and vertical flipping, Gaussian noise, brightness and zoom enhancement, horizontal and vertical shifting, sampling noise, per pixel, color space conversion, and rotation.

During training, we perform pass through data enhancement (random) through online operation, and then use the enhanced data to train the classifier. In addition, we do not use data enhancement in conditional image synthesis, because it will not contribute more to the composite image.

4.3 Evaluation Indicators

We use AUC [21, 49, 50], ACC, PRE, and TPR evaluated the classification performance of melanoma in this data set. The formula is as follows:

$$ {\text{ACC = }}\frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}, $$
(10)
$$ {\text{PRE}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}, $$
(11)
$$ {\text{TPR}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}. $$
(12)

TN, TP, FP, and FN are the number of samples with true negative, true positive, false positive, and false negative, respectively. The reduction of FN and FP classifiers shows better performance [19]. Sensitivity, also known as recall or true-positive rate (TPR), is one of the most commonly used measures to evaluate classifiers in medical image classification tasks. Another common metric is AUC. AUC summarizes the information contained in the ROC curve, which plots the TPR and FPR = FP/(FP + TN) under various thresholds, that is, the false-positive rate. A higher AUC value indicates better performance in distinguishing melanoma and non-melanoma images.

4.4 Implementation Details

In this experiment, conditional image synthesis as well as extraction of patch images are operated offline, and then, the images are fed into the training network. CycleGAN network is trained with 500 training cycles, using Adam optimizer with batch size of 1, learning rate of 0.0002, and trained 30 epochs. BAFCNN uses tenfold cross-validation, Adam optimizer with learning rate of 0.0005, and BCEWithLogitsLoss loss function, for ACNN images make, the input size is 224 × 224 × 3, and the input image size of BCNN is 128 × 128 × 3, albumentations are performed on the images before feeding them into BAFCNN. During the network training process, the weight file with the higher AUC value of the validation set is saved. The patch strategy proposed in this article uses a threshold of 30. We use Pytorch to build our network architecture and provide measurement functions for evaluation indicators. The experiment is configured to experiment on a platform configured with Intel(R) Xeon(R) Gold 5118 CPU @ 2.30 GHz and two RTX 2080Ti GPUs.

4.5 Experiment and Comparison of Results

In response to the melanoma classification method proposed in this article, we designed a series of experiments to prove the effectiveness of the method.

We add different patch images to the data set. Among them, we call the experiment using the PDB patch strategy and the WFS fusion strategy as BAFCNN, and the experiment using the GS patch strategy and the WFS fusion strategy as GS-BAFCNN. In addition, the experiment using the SFS fusion strategy is SFS-BAFCNN, and the experiment using the RWFS strategy is called RWFS-BAFCNN. In these four experiments, the BCNN input is the patch image, the first part of ACNN_WI is the global image, and ACNN_O is the patch image.

The comparison between BAFCNN and GS-BAFFCNN methods is shown in Table 2. We observe that the experimental test results using the PBD strategy are better than the experimental test results using the GS strategy. The AUC value is 85.4%, the accuracy rate is 85.2%, and the performance is improved by 2.9%. It shows that the patch image generated by the PBD strategy can make the training network more accurately focus on the meaningful area. In our experiments, we also extend the data volume by increasing the training set to ten times, as shown in Table 2 (BAFCNN(× 10)). Interestingly, in this experiment, increasing the number of images too much causes the training network to focus excessively on meaningless regions in the images, resulting in performance degradation. We verify three fusion strategies in experiments. WFS fusion strategy is used to obtain the best classification performance. But relatively speaking, the TPR value of RWFS fusion strategy and SFS fusion strategy is higher than that of WFS fusion strategy, that is, these two fusion strategies can obtain a higher true positive proportion in the output results of BAFCNN network, but these two fusion strategies do not show outstanding achievements in melanoma diagnosis. The experimental loss comparison is shown in

Table 2 Comparison of classification performance between different fusion strategies

In addition, we also compare BAFCNN with the methods of applying the ISIC 2016 data set in other articles [9, 20, 36, 51,52,53]. In [9], use an attention mechanism to classify melanoma, but in the training process, the region of the induction region (ROI) is referred to as lesion segmentation or dermoscopic features. The author proves that using the regularized attention module, the classification performance can be refined. In [52], before classification, the data segmentation adds the network model’s meaningful focus ability to the skin lesion image. In [20], using the two-stage framework, first use the difference in data distribution between classes, learn the mapping between classes, and use unpaired image-to-image conversion to synthesize under-represented class samples from over-represented class samples, to complete the task of conditional image synthesis. In the second stage, we use the original training set combined with newly synthesized under-represented samples to train a deep convolutional neural network for skin lesion classification. The classification results are shown in Table 3, indicating that the proposed method outperforms other methods in some performances.

Table 3 Comparison of the proposed method with other studies

We hypothesize that the improvement of melanoma classification performance is not only attributed to the simultaneous extraction of global and local features in the BAFCNN network, but also to the PBD strategy we proposed. In addition to the statistical analysis of it and the GS strategy in the previous article, this paper designs the following comparative experiment to analyze the effect of the PBD strategy on the experimental results.

First, only BCNN is sent to the patch image training. The results are shown in Table 4 (PBCNN). By comparing PBCNN with Efficientnet-B6 in Fig. 10, it is found that PBCNN has achieved better performance, can obtain higher AUC values and identify more non-Melanoma. However, the prediction accuracy of the test set will decrease. We believe that the reason for the decrease is that the patch image lacks the global information of the original training image and part of the semantic information of the lesion. Therefore, we added ACNN and FCNN to the network model, namely BAFCNN network model, to effectively extract the effective semantic information of the global image and the patch image. Since the BAFCNN network has three inputs, we designed five experiments to verify the impact of different input methods on the results. The experimental inputs and results are shown in Table 4, and the loss is shown in Fig. 11a. The BAFCNN experiment shows the best performance on the test set. It can be proved that the BAFCNN network can not only effectively extract the effective semantic information of the patch image, but also apply the weighted integration strategy to fuse the characteristics of the patch image and the global image, so that the final classification result is greatly improved.

Fig. 10
figure 10

Comparison of classification performance under different models

Table 4 Comparison of classification performance between different input methods
Fig. 11
figure 11

a The loss graph of the five experiments shown in Table 4. Among them, it can be clearly found that the loss of the BAFCNN experiment can reach a stable value faster, and the loss is lower. PBCNN has the highest loss and the largest change. b The experimental loss graph shown in Table 2

5 Conclusion

Melanoma is a fatal skin disease. We verify the effectiveness of a melanoma diagnosis method based on CycleGAN and basic auxiliary fusion neural network (BAFCNN) proposed in this paper through a training set containing only 1279 images. The results show that this method can achieve proud results. We improve the diagnostic performance of melanoma through three-stage operation. First, we use CycleGAN to synthesize samples with insufficient representation in the dataset and balance the dataset. Second, we creatively designed a patch strategy to reasonably extract the lesion site of the original training image and the normal skin tissue around the lesion, so that the network model can more accurately focus on the meaningful skin lesions. Finally, we use ensemble learning to extract the original image and patch image feature information and classify them. The good performance of the melanoma diagnosis method proposed by us is largely attributed to the fact that the mapping between the distribution of source and target category features for conditional image synthesis can be easily learned using the interclass differences of medical images. In addition, it should be noted that even though the image-to-image translation scheme is considered to make the image illusion by adding or removing image features, we also show that in our scheme, the division between categories does not lead to deviation or unnecessary illusion. Figure 1 shows the benign lesions collected from ISIC-2016 training and transformed into malignant samples using CycleGAN. It can be seen that benign and corresponding composite malignant images have a high degree of visual similarity. This is mainly due to the nature of the dataset. How to get rid of artifacts, hair and other artifacts, and how to extract patch images more scientifically and effectively extract and classify sample features are the key points of our next work. We are grateful to the International Organization for Skin Imaging Cooperation (ISIC) for its efforts to collect and share a database of classification of skin lesions, which is used to evaluate computer-aided methods for classification of skin lesions in dermatoscopic images.