1 Introduction

Medical imaging is an essential part of the disease diagnosis process. Medical images are classified and segmented by experts to diagnose disorders, and in some situations, anticipate how diseases will progress. Medical image processing was previously done manually by specialists. However, with the rapid advancement of equipment used in imaging, a substantial volume of medical images is generated daily, which takes time to process and requires specific domain knowledge [1]. Therefore, there is an increasing need for the automation of this process. The best method for automating medical image processing has been to treat it as a deep neural network (DNN) task and solve it by training a deep learning (DL) model. Improving the performance of the DL model can be achieved by training it with a sufficiently large and correctly labeled dataset. Furthermore, if the aim is to classify diseases, the model must be fed by a balanced size of each class with correct labeling. In other words, diversity of the data is necessary to obtain better generalizability of the developed model; otherwise, the model will perform poorly [2,3,4].

There are many challenging obstacles to developing an efficient DL model for medical image processing. The first obstacle is obtaining sufficient data. This depends on many factors, such as sharing data between healthcare facilities and researchers while maintaining the confidentiality of patients’ data. Second, the lack of labeled data is a significant impediment to developing reliable image segmentation and classification models because annotating medical images by hand is difficult, time-consuming, and inconsistent among imaging modalities. Third, most datasets are imbalanced because disorders usually occur with much lower frequency compared to non-diseased conditions, which makes developing an automated diagnosis system much more difficult. As a result, the researchers began to consider improving the medical image datasets to create a good classification or segmentation model. One technique that adds more information to the original dataset is data augmentation, which includes: (1) basic image augmentation (e.g., rotation, flipping, cropping, color spacing); and (2) deep learning approaches (e.g., Generative Adversarial Network(GAN), Neural Style Transfer) [5]. The basic augmentation methods are limited in terms of the amount of data they can generate and depend entirely on the original dataset. In contrast, deep learning approaches, specifically GANs, are capable of generating a wide range of diversity in the data independently of the original dataset [5].

In this work, we systematically review the use of GANs in medical image augmentation and to what extent they improve the performance of further DL models, which use the augmented dataset either to solve classification or segmentation tasks. In our review strategy, we take into consideration many factors, including: (1) the type of GANs used in the augmentation; (2) the medical image modality (e.g., MRI, CT scan) that was used as an input for the GAN model; (3) the purpose of using the augmented dataset; and (4) the evaluation models used to evaluate the augmented dataset.

In this review, we aim to explore how GAN-based architectures are used in medical image augmentation. To achieve this aim, we defined the following research questions:

  • RQ1: What are the GAN-based architectures used in medical image augmentation?

  • RQ2: What were the different medical image modalities that used GAN in image augmentation?

  • RQ3: What were the body organs associated with the medical image modalities?

  • RQ4: What was the targeted task following GAN-based image augmentation?

  • RQ5: How was the performance of the GAN model for augmentation evaluated in the retrieved articles?

Following this introduction section, Sect. 2 provides the background and related work. Section 3 describes the research methodology. Then, Sect. 4 presents the results obtained in the order of the research questions. Section 5 then presents the discussion of our results. Finally, in Sect. 6, we cover the limitations and reliability of our work, followed by the conclusion.

2 Background and related work

We explain GANs and the different architectures in Sect. 2.1. In Sect. 2.2, we discuss the evaluation methods used by the primary articles to evaluate the GAN-based augmentation models. Then, in Sect. 2.3, related reviews are presented.

2.1 GAN architecture

GANs, proposed by Goodfellow et al. [6] in 2014, are deep neural networks that produce realistic samples. A GAN model consists of a generator network G and a discriminator network D. The generator is the generative model that is responsible for producing fake samples that follow the distribution of the real samples, while the discriminator learns to distinguish between real samples and fake samples produced by the generator. The dynamics between the generator and the discriminator are what is known as adversarial learning, where each network is trained to make the other make a mistake, e.g., the generator wants to fool the discriminator into thinking a fake sample is real. This is done using backpropagation, where the generator maps a random distribution into one that matches the distribution of real data, and the discriminator evaluates the differences between the fake and the real data distributions, with the error being backpropagated to the generator if the discriminator finds that the sample is fake. After some time, the two models will reach a balance, with the generator producing images that look extremely realistic and are so close to the real data distribution that the discriminator can only achieve a 50 Several variations on the architecture of GANs have been proposed for medical image synthesis, with the most popular and common being conditional GAN (cGAN), deep convolutional GAN (DCGAN), cycle-consistent GAN (CycleGAN), auxiliary classifier GAN (ACGAN), Wasserstein GAN (WGAN), information-maximizing GAN (infoGAN), Pix2Pix, StyleGAN, and progressive growing GAN (Progressive GAN or PGGAN). cGAN [7] adds conditional information, such as class labels, to the generator and discriminator, which improves the generation of detailed features and imbalanced classes. In DCGAN [8], the generator and discriminator both use a deep convolutional network architecture. CycleGAN [9] uses two GANs to transform images from one domain into another and back to the original domain, solving the issue of image-to-image translation. ACGAN [10] is an extension of cGAN in that, rather than adding class labels to the discriminator, the discriminator is made to predict the class label of an image in addition to the probability of that image being real or fake. WGAN [11] uses the Wasserstein loss function in order to improve the learning stability of GAN. InfoGAN [12] allows the learning of meaningful and interpretable representations of images using latent codes (i.e., producing images with different rotations or widths). Pix2Pix [13] is another variation of cGAN that solves image-to-image translation by generating an image that is conditional on the input image, where the discriminator determines whether the produced image is a plausible transformation of the input image. StyleGAN [14] allows control of smaller-scale features of the image without altering levels. Progressive GAN [15] increases the depth and number of layers in the generator and discriminator throughout the training process, allowing the model to learn the finer details and increasing the quality and stability of the GAN.

Table 1 compares between the main commonly-used GAN architectures. In order to build a given GAN architecture, a number of tools are available. These are summarized in Table 2.

Table 1 Commonly used GAN architectures
Table 2 A summary of GAN building tools

2.2 Evaluation methods

Since the loss function used to train the GAN models does not have a metric for assessing the model’s quality, there is no method to objectively measure progress in training and model quality using only losses [16]. Various qualitative and quantitative methodologies have been developed to look at the quality and diversity of the synthetic images created by GANs. Evaluating GANs model is mainly categorized into two techniques: qualitative and quantitative. In the qualitative method, the practitioners visually assess images created by GANs concerning the target domain. Although manual assessment is the most basic form of model evaluation, it has several drawbacks, mainly that it is subjective, with the reviewer’s biases about the model, its configuration, and its goal. It necessitates an understanding of what is and is not feasible in the target area. The number of images that may be reviewed in an acceptable amount of time is restricted [17]. As a result, many quantitative methods have been developed to evaluate GANs output; some of these methods target the GANs output directly or indirectly. In direct evaluation, the augmentation model itself is evaluated directly. One method is to quantify experts’ opinions on synthetic images using a numerical score. A second method is to compare the generated images to the original images on the pixel level. Another method is to compare the data distribution of the original datasets with the data distribution of the new datasets after adding the synthetic images. Another type of quantitative method evaluates GANs indirectly by comparing the performance of the downstream models (e.g., classification or segmentation models that use the augmented dataset) with and without the synthetic images, which would reflect the quality and divergence of the GANs output. Table 3 illustrates all qualitative and quantitative methods.

Table 3 Classification of different performance evaluation metrics

2.3 Related work

Several reviews address the use of GANs in medical images for different objectives, including synthesis [18], augmentation [19], segmentation [20], and classification [21]. Singh et al. [18] discussed the different GAN architectures used in medical image generation, in addition to the applications of these GANs in the reconstruction of medical images to reduce noise as well as the synthesis of medical images. For image reconstruction, 22 articles are summarized in terms of modalities, methods, losses, and some remarks, in addition to another 22 articles being summarized for image synthesis in terms of modalities, methods, additional remarks for unconditional synthesis, and conditional information for conditional synthesis. The paper concludes with some future research directions for GANs in medical image generation. However, this review was not done systematically, and its focus was not on augmentation. Chen et al. [19] presented a review of GANs in medical image augmentation using 105 papers published from 2018 to 2021, mainly collected from ELSEVIER, IEEE Xplore, and Springer that were published and pre-published. The papers were analyzed based on the different organs, the dataset utilized for training and testing the models, the loss function employed in the training, and the performance evaluation metrics used to evaluate the performance of the proposed models. As per the analysis, the advantages of each method, loss function, and evaluation metric are discussed. Nevertheless, this review was not done systematically, which may affect its repeatability, and it was not restricted to peer-reviewed published articles. In addition, the searched databases did not include Science Direct, Scopus, or PubMed, and the whole review did not include articles beyond 2021. On the other hand, Xun et al. [20] based their review on GANs used in medical image segmentation, where 120 papers published before September of 2021 were reviewed and analyzed based on the segmentation region (i.e., brain, chest, abdomen, etc.), image modality, and classification methods. The articles are collected from Google Scholar, PubMed, Semantic Scholar, Springer, arXiv, and some top conferences in computer science. Similarly, the advantages of the proposed methods are discussed, in addition to the limitations and future research directions for the use of GANs in segmentation. However, this review did not focus on augmentation, and it was not done systematically. Following PRISMA guidelines, Joeng et al. [21] conducted a systematic literature review on the use of GANs in medical image classification and segmentation. A meta-analysis was performed on 54 papers published from 2015 to 2020 based on image modality, task, and clinical domain, focusing on how GANs were utilized for classification and segmentation purposes. The articles were retrieved from PubMed, Science Direct, and Google Scholar. The 54 papers are split into 33 papers that address segmentation (only 12 address augmentation for segmentation), 13 papers that address generation, and 9 papers that address classification. Their primary focus was on the classification and segmentation objectives for GANs, and they did not focus particularly on GANs for medical image augmentation, which was reflected in the search keys they used and the reduced number of augmentation papers retrieved. Additional manual searches from retrieved articles, such as backward and forward snowballing, were not reported in their review. Finally, the performance evaluation metrics of the included articles were not retrieved, quantified, and grouped in detail. The main contribution of this work is that we perform a systematic review following PRISMA guidelines to retrieve and synthesize data from peer-reviewed published articles in IEEE, PubMed, Science Direct, and Scopus electronic databases with manual search to identify additional relevant articles through backward and forward snowballing. We included articles up to February 2022 as well. From the included articles, our work identified, quantified, and grouped all performance evaluation metrics that were used to evaluate the developed GAN-based models.

3 Research methodology

In this article, we systematically reviewed the use of GANs to augment medical images following PRISMA guidelines. In this section, we describe our research methodology, which comprises a review protocol, Research Questions (RQ), sources of data, search criteria, exclusion and inclusion criteria, and quality assessment. In this review, we explore the literature related to GAN-based augmentation of medical images. We formulated five research questions and developed our search strategy to define the key search terms, electronic databases, and inclusion and exclusion criteria. Data from finally included articles that met the selection criteria and quality assessment were collected and synthesized to answer our RQs. In this review, we formulated five research questions as presented in the introduction section. By searching the scientific databases of IEEE, PubMed, Scopus, and Science Direct, we identified peer-reviewed, published articles that meet our inclusion criteria and quality assessment. Our search string combined several keywords as follows: (“GAN” OR “Generative Adversarial Network”) AND “medical image” AND “augmentation”. In addition, reference lists of included articles and literature that cited those articles were also tracked manually to find additional relevant articles for inclusion. We defined inclusion and exclusion criteria to govern the selection process and retrieve relevant articles. The inclusion and exclusion criteria for this study are presented in Tables 4 and  5, respectively.

Table 4 Inclusion criteria
Table 5 Exclusion criteria

After applying the above-mentioned selection criteria, 93 articles from different data sources were considered for further screening. After eliminating duplicates, three independent researchers screened the title and abstract of the remaining 72 articles. The resulting 49 articles were subjected to full-text reading, and none were discarded. An additional manual search in the reference lists of included articles (backward snowballing) and published articles that cited these articles (forward snowballing) resulted in the addition of three extra articles. Finally, 52 articles were assessed on the basis of their quality, and all of them were retained.

Finally, the included articles were evaluated based on their quality. The criteria utilized to assess the quality of papers are included in Table 6.

Table 6 Quality assessment criteria

A cumulative score is calculated based on the above criteria to determine the inclusion of each article. It is important to highlight that these quality criteria are not used to exclude articles but to evaluate the degree of relevance of the literature to be included in our research. For each article that was deemed to have met or exceeded the quality assessment, the following facets were explored in depth: motivations, contributions, and results achieved. The number of articles selected at the end of this process is 52. Figure 1 shows the PRISMA flow diagram of the process of study screening and selection.

Fig. 1
figure 1

The PRISMA flow diagram

4 Results

In this section, we present the results we synthesized from our collected data. The results will be organized in the same flow as per the research questions mentioned in the introduction section. Out of the initially identified 93 articles (from searching PubMed, Scopus, Science Direct, and IEEE electronic databases), we finally included 52 articles [22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73] as shown in Fig. 1. The publication years of these articles are shown in Fig. 2. The numbers show a growing trend in the number of publications that use GAN-based augmentation models for medical images. Since we could not cover all the papers published in 2022 due to our search period, the number of papers in 2022 looks limited, however, we expect more papers will be published by the end of the year 2022. Considering the previous years, it is clear that interest in this topic is increasing dramatically.

Fig. 2
figure 2

The number of articles per publication year

4.1 RQ1: What are the GAN-based architectures that were used in medical image augmentation?

The included articles reported the use of a variety of basic GAN architectures. Although 8 articles did not specify exactly which GAN architecture was used, there were 9 reported types used independently or in combination, as shown in Fig. 3. cGAN comes at the top of the most used architecture (\(n = 13\)), followed by DCGAN (\(n = 10\)), cycleGAN (\(n = 5\)), Pix2Pix architecture (\(n = 4\)), and others. Only two articles combined two or three of these basic architectures.

Fig. 3
figure 3

Distribution of the GAN architectures included in the selected articles

4.2 RQ2: What were the different medical image modalities that used GAN in image augmentation?

The most frequently reported image modality in the included articles was Magnetic Resonance Imaging (MRI) (\(n = 17\)), followed by Computed Tomography (CT) (\(n = 13\)), X-ray (\(n = 9\)), ultrasound (\(n = 4\)), mammography (\(n = 4\)), and others. Figure 4 presents the count of articles per image modality.

Fig. 4
figure 4

The number of articles per image modality

4.3 RQ3: What were the body organs associated with the above medical imaging modalities?

The top reported body organ in the included articles was the brain (\(n = 15\)), followed by the chest (\(n = 8\)), the breast (\(n = 8\)), the lung (\(n = 7\)), and others. Figure 5 shows the number of articles per body organ.

Fig. 5
figure 5

The number of articles per body organ

4.4 RQ4: What was the targeted task following GAN-based image augmentation?

The majority of the articles (\(n = 37\)) implemented medical image classification tasks following the development of the GAN-based augmentation model. Segmentation tasks were implemented in 11 articles following augmentation, whereas augmentation alone was performed in 4 articles. The number of articles based on the tasks that follow augmentation is depicted in Fig. 6.

Fig. 6
figure 6

The number of articles per task that follows augmentation

4.5 RQ5: How was the performance of the GAN model for augmentation evaluated in the retrieved articles?

The most frequent single performance evaluation method used was indirect quantitative performance evaluation (\(n = 23\)), whereas direct quantitative methods were used alone in 7 articles and qualitative methods (expert opinion) were never used alone. Eleven articles combined both direct and indirect quantitative methods in their model evaluations. Qualitative methods were combined with direct methods (\(n = 4\)), indirect methods (\(n = 4\)), and both direct and indirect methods (i.e., the full set of performance evaluation methods) (\(n = 4\)). Figure 7 presents the number of articles based on the performance evaluation method of the developed GAN-based augmentation models.

Fig. 7
figure 7

The number of articles per performance evaluation method

5 Discussion

GANs have been increasingly used in a variety of applications related to medical images. Since the introduction of the GAN concept in 2014, the number of publications investigating GAN applications has grown notably. Some uses were related to classification tasks that may guide the decision to diagnose a disease, the detection of its presence, or even the progress of its level. In 2016, publications started to include segmentation tasks as another application of GAN. Our review revealed the inclusion of GANs in medical image augmentation starting in 2018. Since we stopped our review in February 2022, we estimate that the trend is going to keep growing at the end of 2022 for publications that developed GAN-based models for the augmentation of medical images. Medical imaging is a very suitable area for GAN-based augmentation since it is usually associated with imbalanced datasets in favor of normal (i.e., healthy) images rather than diseased images. In addition, the volume of available datasets may be insufficient for performing training on other tasks such as classification or segmentation. In this section, we present the discussion related to each of our research questions. Regarding the types of basic GAN architectures used in augmentation, we found that cGAN came at the top of the list of basic architectures used. The popularity of cGAN is mainly attributed to its ability to learn specific features from the dataset via the use of conditional information [7, 74]. Following cGAN, DCGAN is the second-most popular architecture. This is due to the fact that DCGANs are known for their capacity to produce high-resolution medical images with improvements. That’s why it is also commonly used in architectures used in segmentation tasks. Most of the articles reported the use of a single basic GAN architecture for augmentation. However, only two articles reported that they used a combination of two or three basic types together to develop a more complex model. We found that eight articles did not specify the GAN architecture used in medical image augmentation. Although an in-depth explanation of the GAN architecture is needed to facilitate the replication of the study results, most of the included studies did not mention other details like the hyperparameter values they used in their models. Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) appeared as the top two image modalities that leveraged GAN-based augmentation models. This may be attributed to the fact that these two tests are considered less available and more expensive diagnostic tests, whereas x-ray and ultrasound come next because they are more available and less expensive general screening tests. Both types of tests share the characteristic of having imbalanced classes (healthy vs. diseased). Image modalities that were used least are laparoscopic images, Human Epithelial type2 (HEP-2 cell images), Functional Near InfraRed Spectroscopy (FNIRS), and digital histopathological slides. These studies may be in the early phases of experimenting with GAN-based augmentation and may witness an increase in the upcoming years. The brain, chest, breast, and lung were the top reported body organs in the included articles. This may be related to the increasing demand for the availability of medical images of these organs in comparison to other organs. On the other hand, the cervix, heart, and pancreas were examples of the least reported body organs that used GAN-based augmentation. Medical images of these body organs might be more readily available, or the specialized physicians might be in the early stages of experimenting with GAN-based augmentation in these organs’ images. Detection or diagnosis of a disease through a classification task seems to be the most commonly reported use following the GAN-augmented datasets, being the most demanded application. GAN-augmented images were used to perform segmentation tasks in less than one-third of reported classification tasks. In Table 1, we demonstrated different types of performance evaluation metrics used for GAN-based medical image augmentation. The ability of the subsequent model (whether classification or segmentation) to perform better in its tasks according to a set of performance evaluation metrics is considered a quantitative, indirect method of evaluating the performance of the GAN-augmentation model. That’s why we found this indirect evaluation to be the most reported single metric in the included articles. Indirect metrics were also used in combination with direct metrics only or in a combination of both qualitative and direct methods. All reported indirect metrics suggest that GAN-based augmentation was effective in improving the performance of the subsequent task, as per the reported indirect metrics. Quantitative, direct performance evaluation metrics were used alone in a smaller number of articles. However, they were combined with indirect or both indirect and qualitative methods in some articles. Finally, qualitative performance evaluation metrics were not used as a single evaluation method in any of the articles. It was always combined with either indirect metrics, direct metrics, or both. Qualitative metrics were included in a limited number of articles (12 articles). This may be because qualitative methods require the availability of experts to examine and evaluate the generated medical images. However, validating the performance of GAN-based augmented medical images by combining both qualitative and quantitative (direct and indirect) methods should be sought in a wider range of articles because augmented images can generate features that do not exist or hide existing features, which necessitates expert opinion. As we mentioned, the included augmentation models did not explain the details of their architectures in many cases or the hyperparameters they used. Therefore, the replication of such studies may be limited without further details from the authors. In addition, none of the included articles evaluated the effectiveness of the proposed model in a practical clinical setting.

6 Conclusion and future work

GANs have a wide range of applications in a variety of domains. However, the scope of this review is limited to applications of GANs in medical image augmentation tasks only. We also restricted our search to publications published between January 2012 and February 2022. However, GANs were first introduced in 2014. We stopped reviewing papers released after February 2022 as an endpoint. We also focused on the augmentation of medical images using GANs; despite the fact that GANs have been used for other tasks with medical images, such as segmentation, classification, super-resolution, image modality translation, and image denoising, some of these tasks, such as segmentation and classification tasks, were performed as a second step following GAN-based augmentation. Furthermore, the included articles did not share full details about the augmentation model they used, such as the values of the hyperparameters, which limited our data synthesis scope. We did not restrict our search to a single sort of medical image or a single usage of augmented medical images. The selection process we followed was compliant with PRISMA guidelines, and pre-defined search key terms, inclusion criteria, and exclusion criteria were used. The results of each selection stage were agreed upon by three independent reviewers, and a further quality assessment check was performed. In conclusion, this systematic review covered GAN-based models that were used for medical image augmentation. We presented the basic GAN architectures used for building these augmentation models. cGAN and DCGAN are the most popular basic GAN architectures. Furthermore, we presented the most common medical image modalities (MRI, CT, X-ray, and ultrasound) and body organs (brain, chest, breast, and lung) reported in the included articles. In addition, we reported the second task that followed medical image augmentation (classification or segmentation). We first explained the different types of metrics used to evaluate the performance of GAN-based models. These included qualitative, quantitative direct, and quantitative indirect methods. Following that, we grouped the number of articles that used each of these performance-evaluating metrics, whether individually or in combination. The most reported performance evaluation metrics were the quantitative indirect methods, whereas qualitative metrics were the least used ones. Publications focused on the use of GANs for medical image augmentation are expected to continue to grow steadily as they are still in their early stages. Explaining more details about the developed models and the hyperparameter values used and combining the three types of performance evaluation metrics are needed in future publications to facilitate the replication and validation of the published results. Further work to evaluate the GAN-based augmentation models in real clinical settings would be considered another future direction to pursue.