1 Introduction

Deep learning has dominated the field of computer vision since 2012 [1], taking advantage of the huge improvement in data storage and computing power of modern processing devices. Currently, most advanced methods for computer vision are based on deep learning. In this context, medical image analysis is an important research direction. The advantage of deep learning network is the ability to automatically extract features [2], researchers can describe medical images without constructing complex manual features. In deep learning-based medical image analysis, the end-to-end network training method shows significant advantages. Moreover, medical image analysis has huge practical demand and market space. It can be reasonably predicted that deep learning-based medical image analysis has great potential for research in the near future.

Most current artificial intelligence (AI) methods and applications belong to the category of supervised learning [3], which in this case means medical image data must be labeled. This is very difficult and costly to achieve in practice. On one hand, each medical image implies in principle a patient behind it, so the amount of medical image data available is very limited. On the other hand, medical image labeling requires highly specialized medical staff and plenty of time. For example, to train a deep convolutional neural network (CNN) for tumor segmentation, it is necessary for a specialized physician to mark all tumor pixels in the training image. These problems greatly restrict the development of automated, intelligent medical image analysis tools. It is of high potential that Generative adversarial network (GAN) has the potential to provide efficient solutions to these problems.

GAN was proposed in 2014 [4], with the original intention of imitating real data. GAN consists of two subnetworks: generator (\(G\)) and discriminator (\(D\)). During training, \(G\) is used to generate data with a given (expected) distribution, whereas \(D\) is used to determine whether generated data are real or fake. The two are trained alternately, and improve together [5]. Eventually, a G is obtained that can generate data close to the real data distribution, which is the ultimate goal of the method. Obviously, if it is applied to medical imaging, it can expand datasets with insufficient amounts of medical image data so that deep learning methods can be used together with the expanded datasets. Another very useful feature of GAN for medical image analysis is its adversarial training strategy, which can be applied to image segmentation, detection, or classification.

Compared with other medical image analysis techniques [6], GAN is still in its infancy and the number of related works available in the literature is relatively small, but it has huge potential. The application of GAN to medical images began in 2016, when only an article on the topic was published [7]. Since 2017, there have been more relevant studies, so the articles about GAN in medical images in the past five years have been analyzed and summarized in terms of application direction, methods, and other aspects. The rest of this article is organized as follows (Fig. 1). In the second section, GAN methods commonly applied in the medical image field are described in detail, focusing on their technical characteristics. The third section addresses the main application of GAN in this context, namely medical image synthesis. A classification is proposed according to different conditions of generation. The fourth section analyzes the application of GAN in medical image data enhancement. In the fifth section, GAN is discussed as a semi-supervised learning method, which mainly operates through feature and annotations sharing. The sixth section describes the functions of GAN that can be extended to other medical tasks. The seventh section discusses technical and non-technical challenges and directions. Finally, the conclusions are summarized in the eighth section.

Fig. 1
figure 1

Main content of this paper

2 GAN technology

This section starts with the original GAN and then covers the evolution process of GAN when used for image generation. The methods considered are also frequently used in the specific field of medical images. The section emphasizes the overall architecture, data flow, and objective function of GAN, and does not address the network details of specific generators or discriminators.

2.1 Original GAN

The operation of the original GAN is shown in Fig. 2a, where the symbols \(G\) and \(D\) denote neural networks. The input of \(G\) is a random noise vector \(z\), which is sampled from the distributed \(p(z)\). Generally, in order to keep consistency and convenience of training, the symbol \(p(z)\) adopts either a Gaussian or a uniform distribution. It should be noted that \(z\) is a low-dimensional vector, whereas images in actual applications are high-dimensional data, so \(G\) learns the mapping from a low-dimensional noise space to a high-dimensional real data space. The inputs to \(D\) include \(G(z)\), generated fake data, and \(X\), real sample data used to balance training data. The symbol \(D\) is a classifier whose purpose is to judge the truth or falsehood of data. The purpose of \(G\) is to produce data as close to the real ones as possible, confusing \(D\) so that it cannot distinguish which ones are real and which ones are fake. In this way, \(G\) and \(D\) take part in a dynamic game process, improving each other during training. Data generated by \(G\) will be more and more realistic, and the recognition rate of \(D\) will gradually decrease from an initial value equal (or close) to 1 (perfect discrimination of real and false data) to (optimally) 0 (fake data cannot be distinguished from real ones). The optimization functions for \(D\) and \(G\) are as follows:

$$ \begin{gathered} L_{D} { = }\mathop {\max }\limits_{D} E_{x \sim p(x)} \left[ {\log D(x)} \right] + E_{z \sim p(z)} \left[ {\log (1 - D(G(z)))} \right], \hfill \\ L_{G} = \mathop {\min }\limits_{G} E_{z \sim p(z)} \left[ {\log (1 - D(G(z)))} \right] \hfill \\ \end{gathered} $$
Fig. 2
figure 2

Four common GAN structures

where \(L_{D}\) and \(L_{G}\) represent \(D\) and \(G\) loss functions, respectively, \(D(x)\) is close to 1, because X are real data,\(D(G(z))\) gradually decreases, and the optimization process consist in maximizing \(L_{D}\) and minimizing \({\text{L}}_{{\text{G}}}\).


The original GAN is not actually used to generate images, because \(G\) and \(D\) are ordinary fully connected networks not suitable for images. Image data distribution is very complex and has high dimensions, which is not easy to achieve. CNNs are more image-friendly than fully connected networks, and deep convolutional generative adversarial networks (DCGAN) have successfully combined CNNs with GAN, resulting in a more suitable solution for image generation [8]. DCGAN also adopts the structure shown in Fig. 2a, except that \(G\) and \(D\) are both replaced by CNNs. GAN has the problem of mode collapse, in other words, the training process is not stable, and generated images may only belong to a few fixed categories, or some strange images may appear. DCGAN proposes a series of techniques to balance the training process. \(G\) and \(D\) are fully convolutional networks (FCNs, i.e., CNNs without fully connected layers), using strided convolution instead of pooling layer for down-sampling. The output layer of \(G\) and the input layer of \(D\) use batch normalization [9], a data normalization layer that can be embedded in the network to accelerate learning and convergence. In DCGAN, activation functions are also changed in \(D\) with regard to GAN. GAN uses the ReLU activation function [10] (Fig. 3a) for both \(G\) and \(D\), whereas DCGAN uses ReLU for \(G\) and LeakyReLU [11] (Fig. 3b) for \(D\), to prevent gradient sparsity. In addition, the activation function of the output layer of \(D\) is tanh.

Fig. 3
figure 3

ReLU and Leaky ReLU activation functions

2.3 CGAN

GAN uses a random noise vector with a very low dimension to generate high-dimensional image data. This modeling method has too many degrees of freedom. If the noise signal has only hundreds of dimensions but the generated image has thousands of pixels, then controllability will be very poor. Conditional generative adversarial networks (CGAN, Fig. 2b) increase controllability by adding a constraint c to data [12], which is part of the input layer of both \(G\) and \(D\), guiding data generation. The objective function of CGAN is

$$ \begin{gathered} L_{D} { = }\mathop {\max }\limits_{D} E_{x \sim p(x)} \left[ {\log D(x|c)} \right] + E_{z \sim p(z)} \left[ {\log (1 - D(G(z|c)))} \right], \hfill \\ L_{G} = \mathop {\min }\limits_{G} E_{z \sim p(z)} \left[ {\log (1 - D(G(z|c)))} \right] \hfill \\ \end{gathered} $$

where \({\text{c}}\) can be a label, tags, data from different modes, or even an image. For example, the prior condition (see Sect. 3.3) used by Pix2pix [13] is segmentation image or contour image, Pix2pix can complete the transformation from image to image. When the prior condition is an image, a loss between conditional and generated images is usually added, so that the generated image can have higher authenticity [13]. InfoGAN [14] can also be viewed as a special kind of CGAN. Different from CGAN, it tries to add constraints in random noise \(z\) and uses regularization terms based on mutual information. As the input of the network, the symbol \(z\) controls the image generation. For instance, in the MNIST dataset [15], \(z\) controls the thickness, slope, and other characteristics of the generated numbers.

2.4 CycleGAN

Pix2pix requires paired images, one of them annotated, which requires a lot of time and implies a high cost. In contrast, CycleGAN [16] proposes a ring closed network consisting of two generators and two discriminators (Fig. 2c), which performs the conversion between two image domains without the need of paired images. Because of the two generators and discriminators, the overall structure and data flow are more complex than in the previous methods. The symbols \(G_{B}\) and \(G_{A}\) perform the transformation from domain A to domain B and from domain B to domain A, respectively, so they are equivalent to two reciprocal mappings. The symbol \(G_{B}\) generates images \(X_{{{\text{fB}}}}\) with domain B characteristics from images \(X_{A}\) of domain A, whereas \(G_{A}\) generates images \(X_{{{\text{fA}}}}\) with domain A characteristics from images \(X_{B}\) of domain B. Discriminators \(D_{A}\) and \(D_{B}\) identify images of domains A and B, respectively. The objective function of CycleGAN can be written as:

$$ \begin{aligned} L(G_{A} ,G_{B} ,D_{A} ,D_{B} ) = L_{GAN} (G_{A} ,G_{B} ,X_{A} ,X_{B} ) \\ + L_{GAN} (G_{B} ,G_{A} ,X_{B} ,X_{A} ) + \lambda L_{cyc} (G_{A} ,G_{B} ) \\ \end{aligned} $$

where \(L_{GAN}\) is a regular generator loss, as described by Eq. (1). Real data return to its original domain after a loop, so \(L_{cyc}\) represents the loss of real data and its cyclic data. \(\lambda\) is a coefficient used to balance generator loss and cycle loss.

$$ \begin{aligned} L_{cyc} (G_{A} ,G_{B} ) = {\text{E}}_{{X_{A} \sim A}} \left[ {\left\| {G_{A} (G_{B} (X_{A} )) - X_{A} } \right\|_{1} } \right] \\ + {\text{E}}_{{X_{B} \sim B}} \left[ {\left\| {G_{B} (G_{A} (X_{B} )) - X_{B} } \right\|_{1} } \right] \\ \end{aligned} $$

Since it is easy for GAN to be unbalanced in training, the two generators and discriminators in CycleGAN need to be carefully balanced during training. The use of paired images is equivalent to a feature filtering, and GAN can easily learn which parts of images need to be converted. However, the training process requires huge amounts of data when working with unpaired images, like in the case of CycleGAN.


Humans usually paint a picture with multiple strokes, so machines can create images by multiple steps. That is where the idea of LAPGAN [17] comes from. There is no need to complete all GAN tasks at once, but one at a time generating a full image in several steps. Figure 2d shows a three-stage LAPGAN, the red arrows representing down-sampling and the blue arrows representing up-sampling. The three down-sampling processes can be regarded as a three-layer Laplace pyramid, and an independent conditional GAN model is trained at each level. Using the multi-scale structure of natural images, a series of generative models are constructed, each one capturing a specific scale image structure of the pyramid. The training process is carried out from left to right. The original image \(X_{r1}\) is transformed into \(X_{r1}^{^{\prime}}\) through down-sampling, and \(X_{r1}^{^{\prime}}\) becomes \(X_{r1}^{^{\prime\prime}}\) through up-sampling. Then a residual image is obtained by comparing \(X_{r1}\) with \(X_{r1}^{^{\prime\prime}}\). \(G_{1}\) takes a noise signal \(z_{1}\) as input and \(X_{r1}^{^{\prime\prime}}\) as the condition to generate the residual image. Training in the remaining levels is similar. The LAPGAN test process is shown in Fig. 2e. In this case it is performed from right to left. It is important to note that the target of \(G\) is the residual image, so there is a summation process. Serialization and the use of residual images are the two LAPGAN characteristics that effectively reduce the content and difficulty that GAN needs to learn.

3 Medical image synthesis

The most successful application of GAN in medical image analysis to date is medical image synthesis, which can alleviate the problems of insufficient medical images available or imbalanced data categories [18, 19]. Traditional data enhancement techniques include image cutting, flipping, and symmetry, among others. Obviously, these techniques can only change data in direction or size, but no new data are generated, whereas GAN can generate completely new data. In this section, unconditional synthesis, domain transformation and other conditional synthesis methods are described according to different conditions of medical images. Figure 4 shows some examples of these applications.

Fig. 4
figure 4

Medical image synthesis examples. Unconditional synthesis of brain magnetic resonant images (MRI) [20] a and of skin lesions [21] b synthesized (left) and real (right) images. c Stain normalization by GAN [22]: original (left) and stained (right) images. d MRI image (left) converted into computed tomography (CT) image (right) [23]. e MRI image (left) converted into positron emission tomography (PET) image (right) [24]. f Retinal vessels used as condition to generate color fundus image [25]: original color fundus image (left), retinal vessels map (center), synthesized color fundus image (right). h Blood vessel geometry synthesis [26]

3.1 Unconditional medical image synthesis

Unconditional synthesis was the first method used in medical image analysis. It is usually a tentative method in preliminary studies or when constraints are not available. A random noise is transformed into an image. Since the information contained in random noise is very small compared with that of a large image, the resolution of generated images is usually not very high. DCGAN and LAPGAN are the most commonly used methods for unconditional medical image synthesis.

Medical images of the brain, chest, organs, and fundus are usually taken at very high resolution, so synthesized images must have reasonable outline, shape, and structure to look realistic enough. Therefore, it is in principle difficult to use unconditional synthesis to generate such complex images. Camilo et al. [20]. proved that unconditional synthesis could generate high-resolution images. They used one-dimensional uniform noise as input to \(G\), and generated brain MRI images with a resolution of 220 × 172. In [27], Wasserstein GAN (WGAN) was used to generate 128 × 128 brain MRI slices. Furthermore, for synthesis of brain MRI slices, in [28] a Laplace pyramid adversarial network was designed and implemented by building a series of generative models, each one capturing a specific image structure. A low-resolution image is first generated, and then more details are gradually added to it to generate the high-resolution image. Progressively grown generative adversarial networks (PGGAN) have also been applied in [21], where colored fundus images from 8 to 512 pixels were synthesized. In order to balance the amount of data in different categories, Hojjat et al. [29] proposed to use DCGAN. They generated six different categories of chest X-ray (CXR) images.

In contrast with the generation of medical images that contain a whole structure, many researchers targeted lesion sites, such as tumors or plaques, which are usually smaller in size and easier to be synthesized. Andy and Jarrel [30] used GAN to synthesize images of prostate lesions with 16×16 resolution. In [31], images of benign and malignant pulmonary nodules were generated with a resolution of 56×56 and image fidelity even fooled expert doctors. Christoph et al. [32, 33] used DCGAN and LAPGAN to generate images of skin lesions, results showing that LAPGAN was more effective. Xin et al. [34] proposed catWGAN, where categorical generative adversarial networks (catGAN) was the primary structure and WGAN the auxiliary network, to generate images of cutaneous melanoma tissue. Maayan et al. [35] synthesized high-quality liver damage ROIs using GAN structures, then a hepatopathy classification network was trained using traditional enhanced data and synthetic data, results showing that synthetic images are completely competent for classification. Table 1 summarizes published methods for unconditional medical image synthesis. Since unconditional medical image synthesis mostly takes random noise vectors as input, only model output is given in Table 1. Measures refer to the method used to evaluate the how realistic the synthetic images are.

Table 1 Summary of publications for unconditional medical image synthesis

3.2 Domain transformation

Domain transformation is one of several existing conditional synthesis methods but, because of the many published works that use it and also because it is a special condition, it is described separately in this subsection.

Domain transformation refers to the transformation from one type of image into another. MRI and CT are the most widely used medical images, and domain transformations between them account for nearly half of all domain transformation published works. MRI is better than CT in terms of imaging, but requires the patient to lie still for a long time, and he/she cannot have metal in the body, so some patients are only suitable for CT. On the other hand, CT can expose patients to risk of cancer, so some patients are only suitable for MRI. In this context, domain transformation helps doctors make a comprehensive judgment of patients who can only take one kind of image.

Dong et al. [36] proposed a FCN as generator to transform brain MRI images into CT ones by adopting the strategy of adversarial training. The input of the FCN is the MRI image, and the label is the corresponding CT image. Since images are in the form of patches, auto-context model is also used to construct context information. Differently, [23] used unpaired data, meaning that the MRI and CT images were not from the same patient at the same location, and CycleGAN was used to carry out the conversion between unpaired images. Zhao et al. [37] designed a kind of depth supervision cascade GAN, which they applied for automatic segmentation of bone structures. The first module in the network is used to generate high quality CT images from MRI ones, and the second module is used to segment bones from the generated CT images. In [38], a 3D generation network based on CycleGAN was proposed, which transformed 3D cardiovascular volume MRI images into CT images, adding a loss function for shape. The generated image is segmented to measure the gap between generated and real data. Thomas et al. [39] also used unpaired cardiac MRI and CT images. Yuta et al. [40] proposed gradient consistency loss for CycleGAN to place more emphasis on the boundary between joints and muscles, so the resulting images have clearer boundaries.

PET has also been the subject of many works on image domain transformation. Avi et al. [24] used a VGG16 CNN as generator to transform CT images into PET ones, with very small datasets consisting of only 17 pairs of training samples and 8 pairs of test samples. Lei et al. [41] aimed at low resolution and low signal-to-noise ratio synthetic PET images. They proposed a multi-channel GAN, which can capture features with advanced semantic information based on the concept of dual learning. This method produces PET lung cancer images are very close to real ones. In [42] a sketcher-refiner GAN composed of two CGAN is proposed to predict the content of pet-derived myelin in multimodal MRI. Karim et al. [43] proposed a new GAN framework (MedGAN) and a new generator (CasNet) for converting PET images into CT ones. MedGAN captures high- and low-frequency components of the target mode through a new combination of adversarial framework and non-adversarial losses. CasNet is inspired by residual neural networks (ResNets), where several fully convolutional encoder-decoder networks are connected together as generator. In [44], an FCN is used to transform CT images into initial PET images, which are then improved and refined with CGAN.

Other interesting application of medical image domain transformation is microscopic examination of human tissue, which requires chemical staining to produce contrast. Similar tissues may differ greatly in appearance depending on the stain, scanner, slice thickness, and experimental environment. This randomness of stain poses a great challenge to the automatic analysis of medical images. It is possible for GAN to achieve automatic stain generation or to transform different staining images into the same style. Aicha and Ghassan [22] proposed a stain normalization method and designed a discriminating image analysis model with a stain normalization component. It can learn the dyeing properties of a particular dataset, different types of images can be normalized, and then other image tasks can be performed (e.g., classification or segmentation). Farhad et al. [45] tested three stain normalization methods, namely GAN, variational auto-encoder (VAE), and deep convolutional Gaussian mixture model (DCGMM), among which DCGMM provided the best results. In [46], the first two components of principal component analysis (PCA) and the average image built a three-channel color image as the condition, then CGAN was used to complete the stain of lung histology images. Published methods for medical image domain transformation are summarized in Table 2. Table 2 also shows the evaluation metrics of synthetic images, the commonly used metrics include synthetics Mean Squared Error (MSE), Structural Similarity Index (SSIM), and Peak Signal to Noise Ratio (PSNR). Visual Information Fidelity (VIF), Universal Quality Index (UQI), and Learned Perceptual Image Patch Similarity (LPIPS) are also adopted in few works.

Table 2 Summary of publications for medical image domain transformation

3.3 Conditional medical image synthesis

Several different prior conditions can be used in conditional medical image synthesis. In addition to domain transformation, segmentation map is the most commonly used prior condition among the many existing ones. In it, the segmentation task consists in obtaining the segmentation map from the original image, and the generation task consists in regenerating from the segmentation map the original image, whose style may change in the process. Therefore, these can be regarded as inverse tasks to each other.

He et al. [25] proposed a GAN that can synthesize color fundus images based on retinal vessel maps and designed a style extractor based on VGG16. The features of the style extractor middle layer were added to GAN in the form of style loss. Retinal vessel maps are the input of GAN, and any style color fundus image can be used as the input of the feature extractor, which converts vessel maps into color fundus images. This method produces a variety of color fundus images, but the retinal vessels are fixed. In [47], VAE is used to reconstruct retinal vessels and then to generate color images. In [48], two GAN are used, one to generate retinal vessels from random noise, and then another to generate color fundus images. Hu et al. [49] used the calibrated pixel coordinates of global physical space as a prior condition to generate fetal ultrasound images. Francis and Debdoot [50] used the speckle mapping of a digitally defined phantom as condition, and then pathological ultrasound images were generated after two-stage GAN. In [51], the generating condition is set as real image. GAN's role is to add different disease features into healthy images to generate realistic CXR images with different diseases. Lejmer et al. [26] used an attribute vector containing real sample and synthetic sample information as condition, then synthesized 3D coronary geometry figures with CGAN. Magnetic resonance angiography (MRA) is an important imaging technique, usually captured for vascular intervention, but the MRA sequence may be missing in patient scans. Sahin et al. [52] proposed a MRA generation method based on T1-weighted and T2-weighted MRI images. Published methods for conditional medical image synthesis are summarized in Table 3.

Table 3 Summary of publications for conditional medical image synthesis

4 Data augmentation

Although there has been a lot of work for medical image synthesis, the ultimate purpose of GAN application in the field of medical imaging is to improve the performance of the models, such as classification or segmentation models. Small dataset size and poor image quality reduce the accuracy of deep learning medical image models. To overcome these problems, some researchers use GAN technology for data augmentation, including super-resolution, image denoising, reconstruction, registration, and dataset expansion. In fact, image synthesis can also be regarded as a kind of data augmentation, but there is a lot of work on medical image synthesis, we use a separate chapter to introduce image synthesis.

4.1 Super-resolution

Super-resolution technology refers to the generation of high-resolution images from low-resolution ones, to obtain more detailed information. The super-resolution of medical images using GAN technology mainly relies on the capabilities of the generator. GAN-based medical image super-resolution is usually based on pairs of low- and high-resolution images. Almalioglu et al. [53] proposed a framework combining an attention mechanism with CGAN and designed a high-fidelity loss function. This is a weighted hybrid loss function specifically optimized for endoscope images, which synergistically combines the benefits of perception, content, texture, and pixel-based loss description. Ma et al. [54] proposed a GAN-based progressive multi-supervised super-resolution model, whose first stage corresponds to a densely-connected U-Net CNN [55], whereas the generator of the second stage corresponds to a residual-in-residual DenseBlock CNN.

In the field of medical imaging, high-resolution images require better equipment or longer image processing, and the acquisition of paired low- and high-resolution images is difficult. For example, to obtain paired high and low-resolution CT images, patients are required to undergo multiple CT scans with additional radiation dose, which is obviously not feasible. Therefore, some researchers use unsupervised or semi-supervised GAN methods for super-resolution. Daniele et al. [56] used the cyclic consistency method for microscopic endoscopic images. They designed a deep learning framework for unsupervised training with two special loss functions, in which paired low- and high-resolution images are no longer required. GAN's function is to convert a low-resolution image in the input domain into an image in any target domain. You et al. [57] incorporated residual learning into network technology for feature extraction and recovery. They also enforced cyclic consistency according to the Wasserstein distance, and included joint constraints in the loss function to achieve structural protection. Das et al. [58] use adversarial learning with cycle consistency and identity mapping prior conditions to preserve spatial correlation, color, and texture details in generated clean HR images. As it can be seen, most existing unsupervised super-resolution methods are based on cyclic consistency loss.

Deep CNNs require a large number of parameters to be optimized and a large amount of memory to be used. The problems discussed above are all focused on 2D images. When reconstruction takes place in a 3D environment, processing times for both training and inference need to be considered [59]. Chen et al. [60] proposed a multistage dense connection network with GAN for 3D brain MRI super-resolution reconstruction. The generator convolutional layers are all connected in a dense manner, whose main advantage is high speed. Irina et al. [61] used a combination of least squares dual loss and image gradient as loss function for the generator in GAN 3D super-resolution reconstruction, improving the quality of generated images.

4.2 Denoising

Noise in medical images seriously affects the diagnostic accuracy of doctors. This problem can be alleviated by GAN image denoising capabilities. In CT images, since high doses can harm the patient's health, the past decade has seen a trend towards dose reduction in CT examinations, at the expense of noise appearing in the low-dose images. Yang et al. [62] proposed a GAN with Wasserstein distance and perceptual similarity, which suppresses noise by comparing the perceptual features of a denoised output against those of the ground truth in a given feature space. Wolterink et al. [63] compared three training strategies, namely voxel loss, combined voxelwise and adversarial loss, and adversarial loss. Choi et al. [64] considered the statistical characteristics of CT images and introduced a loss function to incorporate the noise property in the image domain derived from noise statistics in the sinogram domain.

In addition to the use of lower doses, operation equipment (e.g., portable) may also introduce noise. Zhou et al. [65] constructed a two-stage GAN to improve the quality of ultrasonic images and reduce noise. In the training process, a transmission learning method based on plane wave image (PWI) data was introduced to facilitate convergence and eliminate the influence of deformation caused by respiratory activity. Chen et al. [66] proposed an unsupervised learning framework for high-quality pixel-level smoke detection and removal. The detection network is regarded as a prior knowledge and a loss function is used to support the training of smoke removal network.

4.3 Reconstruction

MRI is a widely used clinical medical imaging method, but one of its main disadvantages is the long acquisition time. During MRI imaging, data samples are not collected directly in the image space, but in the k space. The k space contains spatial frequency information obtained row by row and at any position. Slow acquisition causes interferences that may reduce image quality, due for instance to patient movements, such as heart beats or breathing. Compressive sensor-based imaging provides a solution to accelerate the acquisition of MRI images by reconstructing them from a small part of k space. In theory, assuming that the original data can be compressed, the reconstruction can be performed through nonlinear optimization of random under-sampled original data. GAN-based MRI image reconstruction is based on this theory and can be summarized as follows. The generator consists of multiple end-to-end networks. The first one converts a zero-fill reconstructed image into a complete reconstructed image. The following refinement network improves the accuracy of the reconstructed image. Then a discriminator network assesses whether or not the reconstruction is accurate. The works reported in [67, 68, 69 ] are all based on this framework, whose structure is shown in Fig. 4, the difference being the loss functions used. In order to improve the perceived quality of reconstruction [67], content loss is designed for generator training. This loss includes three parts: pixel mean square error loss, frequency domain mean squared error loss, and VGG loss. Feature matching loss and penalty are added in [68]. The work in [69] adds cycle loss, which is a cycle combination of low sampling frequency and completely reconstructed images.

4.4 Registration

To get accurate pathological information in the process of medical diagnosis, a set of images is taken of the same body part, so it is usually necessary to conduct quantitative analyses of several different images at the same time. These images need to be strictly aligned, which is called image registration. It requires a spatial transformation of images, so that there is spatial consistency between corresponding points in several images. In [70], a constrained CNN replaced heuristic smoothness measures of displacement fields, the generator is the registration network and the discriminator distinguishes the dense displacement field predicted by the generator from motion data simulated with the finite element method. During training, the registration network maximizes the similarity between anatomical labels and minimizes the difference between measured and simulated deformation. The generator in [71] generates conversion parameters between fixed and moving images. Different from [70], the discriminator is not used to assess conversion parameters, but to determine whether or not the processed moving image has completed registration. The work in [72] used CGAN for multimodal registration. By adding appropriate terms into the loss function of image generation, the generated output image has the same features as the moving image. Christine et al. [73] converted MRI images into CT ones with CycleGAN, and then used single-mode image similarity measurements for registration.

It can be seen that when GAN is applied to medical image registration, the generator no longer generates images, but it is more like a parameter fitting machine, responsible for fitting registration parameters. The discriminator may serve two purposes: to assess whether a set of parameters is suitable for the target system, or whether or not images processed with this set of parameters are registered.

4.5 Dataset expansion

As a data expansion technology, GAN can generate medical images that are relatively real under visual observation [74]. In this section, the influence of synthetic images on the accuracy of deep learning models is analyzed.

The number of medical images with annotations is usually small. GAN can synthesize images with specific labels to expand datasets. Diaz-Pinto et al. [75] proposed an accurate method for glaucoma assessment based on GAN and semi-supervised learning. The system is not only able to generate images synthetically but also to automatically provide them with labels. Medical datasets tend to be highly imbalanced due to privacy concerns and non-sharing of data between medical institutions. Salehinejad et al. [76] implemented a DCGAN to create synthetic CXR images based on a medium-sized labeled dataset. A combination of real and synthetic images was used to train deep CNNs to detect the etiology of five types of CXR images. The results show that these networks outperform similar ones trained only with real images. It is doubtful that any synthetic image is suitable for the training process, so efficient selection algorithms for synthetic images are also necessary. Xue et al. [77] proposed a CGAN to generate histopathological images based on classification labels. They also designed a synthetic sample selection algorithm that compares the features of real and synthetic images, where the ground truth label of real images matches the conditional label used to generate synthetic images. Only synthetic images that are close enough to the real-image centroid in the feature space are selected.

5 Semi-supervised learning

5.1 Feature sharing

The classification process starts with extracting appropriate features from the samples, which must show differences among different data categories. Features can form a feature space, where sample points of different categories are separated. Image classification based on deep neural networks mainly uses CNNs to extract features and the last layer to make classification decisions. The whole process is automatically learnt. Although the generator task is generating images and the discriminator task is distinguishing real from fake, their essence is still the CNN structure, with abundant hidden layer features. These can be directly used for classification or to provide auxiliary features for classification. More importantly, both generator and discriminator are trained under unsupervised or semi-supervised conditions, which build the relationship between different image domains. Generators and discriminator features make it possible for semi-supervised or unsupervised learning in other medical image tasks.

The generator usually transforms images between domains or reconstructs images. The generator features need to retain sufficient image structure features. Xie et al. [78] proposed a semi-supervised antagonistic classification model for the classification of benign and malignant pulmonary nodules. The model takes an autoencoder-based unsupervised reconstruction network as generator, and the remaining components include a supervised classification network, a discriminator and a learnable transition layer, which applies generator features to the classification network. Yuan et al. [79] proposed a 3D unified GAN, which unifies the any-to-any modality translation and multimodal segmentation in a single network. Since the anatomical structure is preserved during modality translation, the auxiliary translation task is used to extract modality-invariant features and implicitly generate the additional training data.

The discriminator is used to identify the domain of generated images, so its features retain those of domains. Hu et al. [80] proposed a unified GAN architecture to perform cell-level unsupervised learning. During training, the generator and discriminator learn a distribution that matches that of real data. Then an auxiliary network is used to maximize the mutual information between selected random variables and the generated samples. The weights of the auxiliary network and discriminator heads are shared. Wang et al. [81] proposed a multi-path GAN. The multichannel generator generates a series of sub-fundus images corresponding to scattering diabetic retinopathy features. For input features, each discriminator produces an independent classification result. Results from multiple discriminators are weighted to determine the diabetic retinopathy level, and feature matching between multiple discriminators is implemented.

5.2 Annotations sharing

Deep learning methods are supervised, and image labeling is a difficult task [82]. Because the number of real medical images in training sets is very limited, it is impossible for deep learning models to learn all types of images from them. Aiming at solving the problem of supervised algorithms not performing well with data they have not been exposed to before, GAN can implement semi-supervised learning through annotation sharing. Specifically, annotated datasets can train supervised deep learning models. Annotated images are unified through domain transformation into an annotated image domain, and then supervised deep learning models can be applied to all images in that domain.

Gadermayr et al. [83] used CycleGAN to convert unstained and stained kidney histological images. Based on this, they developed a completely unsupervised segmentation method that relies on image-to-image transformation to obtain color-independent intermediate representations. Chen et al. [84] constructed a cross-modal GAN learning CT and MRI mapping and an MRI segmentation network, which converted the unlabeled CT images into MRI ones for segmentation. They also proposed a neighborhood-based anchoring method to reduce the ambiguity in cross modal synthesis. In 3D CT scans, organs remain more clearly structured. Zhang et al. [85] proposed a GAN to synthesize 3D CT scan images from X-ray ones, and then used a multi-organ segmentation network for segmentation. Konstantinos et al. [86] proposed a domain adaptive multi-connected adversarial network, where different data types are treated as different domains, making features learnt by segmentation independent of domain-specific factors. Good adaptability was shown with two different brain MRI databases. In [87], also from the point of view of using domains to solve data inconsistent segmentation problems, a network that migrates specific image styles was used. An unannotated color fundus image dataset was changed to annotated dataset style. In this way, the segmentation network trained by annotated datasets can be used to segment unannotated images.

6 Function expansion of GAN

6.1 Extended generator and discriminator

The adversarial learning process of the generator and discriminator produces a large number of advanced semantic features that can be extended to other tasks. The applicability of extended generators and discriminators is not limited, respectively, to image synthesis and to the classification of true and fake images.

Das et al. [88] proposed a generalizable classifier using adversarial learning between generator and discriminator to predict progressive retinal diseases such as age-related macular degeneration and diabetic macular edema. Gu et al. [89] proposed a transfer recurrent feature learning framework for probe-based confocal laser endomicroscopy (pCLE) video classification tasks. In a first phase, the discriminator features of pCLE images are learnt by GAN. In a second phase, discriminator features are applied to a recurrent neural network (RNN) to distinguish between true and false data and lesion grade. It can be seen that the discriminator is mainly expanded into a multiclass classifier.

Some researchers suggested using generators to segment images [87, 90]. In conditional GANs, the generator is usually an ordinary U-net structure, which is completely suitable for segmentation. For example, in tumor segmentation [91] the generator is used to generate segmentation maps, the discriminator has two inputs, namely predicted and real tumor masks, and the loss between masks is used. The work in [92] emphasized adversarial training and proposed an energy based self-encoder as discriminator. Prior knowledge of spine shape is also applied to the network through adversarial learning. Wu et al. [93] proposed an automatic left ventricle segmentation method, which consists of a multi-scale segmentation network and two shared discriminator networks. Two different types of images are alternately applied to the segmentation network to achieve semi-supervised training. Lei et al. [94] suggested using generators to segment skin lesions and designed two discriminators. One of them analyzes the difference between the generated segmentation mask boundary and the ground truth, and the other detects the contextual environment of target objects in the original image.

6.2 Prediction

The prediction task is important in the medical field, because it can provide information about the evolution and estimation of a patient's condition. The time series prediction models commonly used in deep learning include RNN and long short-term memory (LSTM), among others, but these models are more suitable for time series signal vectors, whereas medical images have higher dimensions. GAN provides the possibility of image-level prediction by synthesizing images of nodes in the current time step into images of nodes in the next time step.

Rachmadi et al. [95] proposed a disease evolution predictor (DEP) model to predict the evolution of white matter hyperintensities from baseline to follow-up (i.e., one year later). The DEP model takes a baseline image as input to generate a disease evolution map representing the evolution of the disease. In order to simulate the non-deterministic and unknown parameters involved in the evolution process, a Gaussian noise vector is added to the DEP model as an auxiliary input, which forces the DEP model to simulate a wider range of prediction results. Elazab et al. [96] proposed a stacked 3D GAN for predicting glioma growth. The generator is designed based on a modified 3D U-Net architecture with skip connections to combine hierarchical features. Segmented feature maps are used to guide the generator to generate better images. Wei et al. [97] proposed Sketcher-Refiner GANs to predict demyelination. A first generator generates global anatomical and physiological information. A second one extracts and produces tissue myelin content. Zhao et al. [98] proposed a framework for predicting the progression of Alzheimer's disease, which includes a 3D multi-information GAN (mi-GAN) and a 3D DenseNet-based multi-class classification network. The generator predicts how a whole brain will look like within a time interval and a classification network determines the clinical stages of the brain.

6.3 Pseudo-healthy synthesis

The task of pseudo-healthy synthesis is to create a healthy image from a sick one. Synthetic healthy images can be used to detect anomalies and understand changes caused by pathology and disease. Tang et al. [99] proposed a deep disentangled generative model (DGM) to generate both abnormal disease residual maps and healthy chest X-ray images from real abnormal chest X-rays images. The DGM consists of three encoder-decoder blocks: the first is used for healthy chest X-rays image synthesis, the second for generating residual maps describing potential lesion areas, and the third for facilitating the training process and enhancing the model's robustness against noise data. Sun et al. [100] proposed an abnormal-to-normal translation GAN (ANT-GAN), which generates healthy medical images based on their abnormal-looking counterparts without the need for paired training data. Xia et al. [101] proposed a model based on cycle-GAN, in which training data used images of healthy domains from different unrelated datasets. A pathological image is firstly disentangled into a corresponding pseudo-healthy image and a pathology segmentation. In the reconstructed network, the pseudo-healthy image is further combined with the segmented image to reconstruct the pathological image. Results for optical coherence tomography in the retina show that this method can correctly identify images containing retinal fluid or highly reflective lesions [102]. The work in [103] adds a simple and efficient constraint to better map abnormal to normal, putting forward stronger requirements for the generator. Domain transformation is usually a change in the overall style of an image, but in this method the generator can only convert the abnormal part. Therefore, it needs to be highly sensitive to the characteristics of abnormal parts.

7 Key challenges and future research directions

7.1 Technology challenges and directions

1) Evaluation metrics. The first challenge is the lack of metrics to evaluate the quality of synthetic images. At present, medical image generation has proven to be visually effective, but the quality of generated images still uses traditional evaluation indexes, such as MSE, PSNR, and SSIM, whose mathematical expressions are shown in Table 4. MSE and PSNR just describe the difference between pixels without considering the visual characteristics of human eyes. SSIM only measures the image similarity from brightness, contrast and structure. These evaluation indexes are not objective enough to evaluate the quality of medical image generation [104]. Moreover, it is not necessarily reasonable to evaluate synthetic images by similarity. For example, if the color fundus image of one style is used as the condition to generate the image of another style, it is obvious that the images of these two styles are not similar, which requires more advanced and abstract metrics to measure [105]. Existing works often use physician evaluation to make up the lack of indicators, or the quality of GAN is evaluated by the quality of GAN's downstream tasks.

Table 4 Image similarity evaluation indexes

Loss functions drive networks towards the training process, whereas evaluation indexes check the quality of generated data in the test process, so the two are based on similar ideas. GAN loss functions and evaluation indexes more suitable for medical imaging are also a trend for the future. Unlike other real images, medical ones are often complex in structure, and the details are likely to contain important pathological information. Moreover, there are significant differences between medical images of different types. In GAN's design process, more medical a priori knowledge can be added, which will be specifically reflected in network structures and loss functions. For example, in [92] the spine is labeled using BtrflyNet with sagittal and coronal information.

2) Semi-supervised methods. Traditional supervised methods are useful when training with a large number of labeled data. However, although the learning effect is good, the process of collecting and tagging large datasets is time-consuming, expensive, and error-prone [106]. GAN provides the possibility for weak supervision and non-supervision of medical images. In this paper, the application of GAN semi-supervised learning is considered from two aspects, namely feature sharing and annotation contribution, whose advantages and defects are analyzed.

Feature sharing transfers generator features to deep learning models of unannotated images. As proposed in [79], generators are usually assumed to preserve certain structural features. However, the reserved specific structural features are unprovable and cannot be matched to specific categories. One potential solution is to design feature decoupling modules that define feature structures when generator features are transferred to other tasks.

3) Domain selection. In addition to the domain transformation described in Sect. 3.2, many problems can also be abstracted into domain transformation problems, whose pure mathematical description is to map data from one kind of distribution to another, the generator being the mapping between two distributions. Existing works mainly focus on how to do the mapping. Actual medical images are definitely not perfect enough to strictly meet a certain distribution, but it is considered that the same type of data approximately belong to the same distribution. Therefore, there is a lack of attention on the distribution of original data. It is unreasonable to assume that all original data come from the same device or same type means same distribution.

In the authors’ opinion introducing domain selection into GAN task is a good solution for this. Domain selection can be implemented in the data preprocessing stage. By calculating distances within the data distribution, the data domain can be determined. It is also feasible to reclassify the data domain by clustering methods. In addition, it is also possible to integrate domain selection into the GAN model. Existing generators are usually self-encoder structures, and the purpose of integrating domain selection into GAN can be achieved by extracting the features of the self-encoder and then performing cluster analysis on them.

4) Simulation. GAN's application in medical imaging mainly focuses on laboratory works and has not yet entered the clinical stage. In order to enter it, it is not only necessary to improve a certain technology or a certain index, but also more practical background is required. For example, GAN for surgical simulation is a good combination. Generated images, particularly 3D ones, can not only guide actual surgery but also simulate it. More communication and cooperative research with doctors are needed in this context.

5) Mode collapse. GAN is prone to mode collapse in the training process, which leads to strange appearance of generated images, or to the generation of just a certain type of images. The causes of GAN modal collapse are various, including unstable and diminishing gradients, convergence failure, generator dominance etc. Since the basic structure of GAN is the neural network, diminishing gradients may appear in the shallow network during the training, which means that the parameters of the shallow network will be updated slowly and do not match the parameter updating speed of the deep network. For some difficult image synthesis tasks, GAN may not be able to converge. Generator dominance means that the GAN is alternately trained with the generator and discriminator, but there is a possibility that the generator will be dominant. GAN still lacks precise mathematical proof of antagonistic learning and pattern transformation. There is no fundamental solution to mode collapse, and designers can only use more advanced techniques and carefully master the training process. Particularly for medical images, which have high dimensions and complex patterns, pattern collapse is likely to occur.

7.2 Non-technology challenges and directions

1) Privacy. The collection of medical images for scientific research requires patient consent. It is not clear if generated images or dataset generated based on them are to be considered as original data or new data, and therefore whether they should be subject to patient consent or not. The legality of new data is also uncertain. Some applications of GAN, such as domain transformation, may even expose more patients' personal privacy than original images. Therefore, for the application of new technology, not only its feasibility but also ethics and law must be considered.

2) Image confidence. In the field of medical imaging, the interpretation of an image may affect the life of the patient, so many technologies that are good in other areas for similar purposes may not be applicable in this medical field. Sometimes even a normal medical image will not be given enough trust by doctors, and multi-level detection is still needed. In this context, currently there is no reason for doctors to give trust to images generated by GAN. Cohen et al. [107] questioned the medical images generated by GAN, which may misjudge the medical condition of patients. They trained a CycleGAN to convert normal brain MRI images to brain MRI images with tumors. In fact, the images generated by their network are visually realistic, but without tumors. There are many reasons behind this. For instance, the generalization performance of a well-trained model is not good, or the transformations between some data domains cannot be accurately carried out. Attention should be paid to this issue, but this does not mean that all GANs will lead to misdiagnosis.

3) Datasets. Although there are many publicly available datasets, most of them were created not for use with GAN, but for other medical tasks. The quality of existing medical datasets is spotty, and some are old and scattered. For some tasks, such as the transformation between MRI and CT images, it is difficult to find relevant images of a certain scale. Most researchers collect them by themselves through hospitals.

8 Conclusion

Oriented to GAN for medical imaging, this paper summarizes commonly used GAN methods, medical image synthesis and the function of adversarial learning in other medical image tasks. The relevant papers in the area published in the last five years are reviewed. The challenges of datasets, training methods, reliability, and legality are pointed out. Future directions of unsupervised learning, breakthroughs in clinical needs, and the need for GANs more suitable for medical imaging are also discussed. In general, the existing medical image synthesis technology has a high reliability, and the combination of GAN and other medical image models also produces a good effect. It can be clearly concluded that GAN has great potential and development perspectives in medical imaging. In fact, the whole development trend of artificial intelligence is towards unsupervised (deep) learning.