1 Introduction

Medical image interpretations are mostly performed by medical professionals like clinicians and radiologists. However, the variations among different experts and complexities of medical images make it very difficult for the experts to diagnose diseases accurately all the time. Thanks to computerized techniques, the tedious image analysis task can be performed by semi-/fully-automatically, and they help the experts to make objective, rapid and accurate diagnoses. Therefore, designing deep learning based methods with medical images has always been an attractive area of research (Tsuneki 2022; van der Velden et al. 2022; Chen et al. 2022a). Particularly, deep Convolutional Neural Network (CNN) architectures have been used for several processes (e.g., image segmentation, classification, registration, content-based image retrieval, identification of disease severity, and lesion/tumor/tissue detection, etc.) in some medical diagnosis fields like eye, breast, brain, and lung (Yu et al. 2021). Although CNNs can produce results with higher performance compared to the results of traditional machine learning based methods, improving their generalization abilities and developing robust models are still significant challenging issues. Because training data sets should be constructed with a large number of images including all variations (i.e., heterogeneous images) to provide the high generalization ability and robustness of those architectures. However, the number of medical images is usually limited. The data scarcity problem can be due to not enough patients for some diseases, patients do not want to allow their images to be used, lack of medical equipment or equipment, inability to obtain images that meet the desired criteria, inability to access images of patients living in different geographical regions or of different races.

Another important issue is that, even if enough images are acquired, labeling them (unlike natural image labeling which is relatively easy) is not easy, requires certain domain knowledge of medical professionals, and is time-consuming. Also, legal issues and privacy are other concerns in the labeling of medical images. Therefore, imbalanced data causing overfitting, biased and inaccurate results is a significant challenge that should be handled while developing deep network based methods.

To overcome these challenges by using balanced number of images and expanding training datasets automatically, image augmentation methods are used. Thanks to the augmentation methods, deep network architectures’ learning and generalization ability can be improved, and the desired robustness property of the networks can be provided (Sect. 2). However, various augmentation techniques have been applied with different types of images in the literature. It is not clear which data augmentation technique provides more efficient results for which image type since different diseases are handled, different network architectures are used, and these architectures are trained and tested with different numbers of data sets in the literature (Sect. 2). According to the segmentation or classification results presented in the literature, it is not possible to understand or compare the impacts or contributions of the used augmentation approaches on the results, and to decide the most appropriate augmentation method.

In the literature, researchers have presented some surveys about augmentation methods. However, they have focused on the augmentation methods applied either;

  1. (i)

    with a specific type of images like natural images (Khosla and Saini 2020), mammography images (Oza et al. 2022), Computed Tomography (CT), and Magnetic Resonance (MR) images (Chlap et al. 2021) acquired with different imaging techniques and have different properties.

  2. (ii)

    to improve the performance of specific operations like polyp segmentation (Dorizza 2021) and text classification (Bayer et al. 2021).

  3. (iii)

    with a specific technique, such as Generative Adversarial Network (GAN) (Chen et al. 2022b), erasing and mixing (Naveed 2021).

In this work, a more comprehensive range of image types has been handled. Also, a more in-depth analysis has been performed through the implementation of common augmentation methods with the same data sets for objective comparisons, and evaluations based on quantitative results. According to our knowledge, there is no (i) comprehensive review that covers a greater range of medical imaging modalities and applications than articles in the literature, and (ii) any work on quantitative evaluations of the augmentation methods applied with different medical images to determine the most appropriate augmentation method.

To fill this gap in the literature, in this paper, the augmentation techniques that have been applied to improve performances of deep learning-based diagnosis of diseases in different organs (brain, lung, breast, and eye) using different imaging modalities (MR, CT, mammography, and fundoscopy) have been examined. The chosen imaging modalities are widely used in medicine for many applications such as classification of brain tumors, lung nodules, and breast lesions (Meijering 2020). Also, in this study, the most commonly used augmentation methods have been implemented using the same datasets. Additionally, to evaluate the effectiveness of those augmentation methods in classifications from the four types of images, a classifier model has been designed and classifications with the augmented images for each augmentation method have been performed with the same classifier model. The effectiveness of the augmentation methods in the classifications has been discussed based on quantitative results. This paper presents a novel comprehensive insight into the augmentation methods proposed for diverse medical images by considering popular medical research problems altogether.

The key features of this paper are as follows:

  1. 1.

    The techniques used for augmentation of brain MR images, lung CT images, breast mammography images, and eye fundus images have been reviewed.

  2. 2.

    The papers published between 2020 and 2022 and mainly collected by Springer, IEEE Xplore, and ELSEVIER, have been reviewed, and a systematic and comprehensive review and analyses of the augmentation methods have been carried out.

  3. 3.

    Commonly used augmentations in the literature have been implemented with four different medical image modalities.

  4. 4.

    For comparisons of the effectiveness of the augmentation methods in deep learning-based classifications, the same network architecture and expanded data sets with the augmented images from each augmentation methods have been used.

  5. 5.

    Five evaluation metrics have been used, and the results provided from each classification have been evaluated with the same evaluation metrics.

  6. 6.

    The best augmentation method for different imaging modalities showing different organs has been determined according to quantitative values.

We believe that the results and conclusions presented in this work will be helpful for researchers to choose the most appropriate augmentation technique in developing deep learning-based approaches to diagnose diseases with a limited number of image data. Also, this work can be helpful for researchers interested in developing new augmentation methods for different types of images. This manuscript has been structured as follows: A detailed review of the augmentation approaches applied with the medical images handled in this work between 2020 and 2022 is presented in the second section. The data sets and methods implemented in this work are explained in the third section. Evaluation metrics and quantitative results are given in the fourth section. Discussions are given in the fifth section, and conclusions are presented in the sixth section.

2 Related works

In this section, the methods used in the literature for augmentation of brain MR images, lung CT images, breast mammography images, and eye fundus images are presented.

2.1 Augmentation of brain MR images

In the literature, augmentation of brain MR images have been performed with various methods such as rotation, noise addition, shearing, and translation to increase number of images and improve performances of several tasks (Table 1). For example, augmentation has been provided by noise addition and sharpening to increase the accuracy of tumor segmentation and classification in a study (Khan et al. 2021). In another study, augmented images have been obtained by translation, rotation, random cropping, blurring, and noise addition to obtain high performance from the method applied for age prediction, schizophrenia diagnosis, and sex classification (Dufumier et al. 2021). In different studies, random scaling, rotation, and elastic deformation have been applied to increase tumor segmentation accuracy (Isensee et al. 2020; Fidon et al. 2020). Although mostly those techniques have been preferred in the literature, augmentation has also been provided by producing synthetic images. To generate synthetic images, Mixup method, which combines two randomly selected images and their labels, or modified Mixup, or GAN structures have been used (Table 1).

Table 1 Augmentation methods applied with brain MR images and details given in publications

For instance, a modified Mixup is a voxel-based approach called TensorMixup, which combines two image patches using a tensor, and has been applied (Wang et al. 2022) to boost the accuracy of tumor segmentation. In a recent study, a GAN based augmentation has been used for cerebrovascular segmentation (Kossen et al. 2021).

All methods applied for augmentation of brain MR images in the publications reviewed in this work and the details in those publications are presented in Table 1. Those augmentation methods have been used in the applications developed for different purposes. In those applications, different network architectures, datasets, and numbers of images have been used, and also their performances have been evaluated with different metrics. High performances have been obtained from those applications according to the reported results. However, the most effective and appropriate augmentation method used in them with the brain MR images is not clear due to those differences. Also, it is not possible to determine augmentation methods’ contributions to those performances according to the presented results in those publications.

2.2 Augmentation of lung CT images

In computerized methods being developed to assist medical professionals in various tasks [e.g., nodule or lung parenchyma segmentation, nodule classification, prediction of malignancy level, disease diagnosis like coronavirus (covid-19), etc.], augmentations of CT lung images are commonly provided by traditional techniques like rotation, flipping, and scaling (Table 2). For example, image augmentation has been provided using flipping, translation, rotation, and brightness changing to improve automated diagnosis of covid-19 by Hu et al. (2020). Similarly, a set of augmentations (Gaussian noise addition, cropping, flipping, blurring, brightness changing, shearing, and rotation) has been applied to increase the performance of covid-19 diagnosis by Alshazly et al. (2021). Also, several GAN models have been used to augment lung CT images by generating new synthetic images (Table 2). While doing a GAN based augmentation of lung CT scans, researchers usually tend to generate the area showing lung nodules in the whole image. Because the lung nodules are small and therefore it is more cost-effective than generating a larger whole image or background area. For instance, guided GAN structures have been implemented to produce synthetic images showing the area with lung nodules by Wang et al. (2021). Nodules’ size information and guided GANs have been used to produce images with diverse sizes of lung nodules by Nishio et al. (2020).

All methods applied for augmentation of breast mammography images in the publications reviewed in this study and the details in those publications are presented in Table 2. Those augmentation methods have been applied in the applications developed for different purposes. In those applications, different network models, data sets and number of images have been used, and also their performances have been evaluated with different metrics. High performances have been obtained from those applications according to the reported results. However, the most effective and appropriate augmentation method used in them with the lung CT images is not clear due to those differences. Also, it is not possible to determine augmentation methods’ contributions to those performances according to the presented results in those publications.

Table 2 Augmentation methods applied with lung CT images and details given in publications

2.3 Augmentation of breast mammography images

Generally, augmentation methods with mammography images have been applied to improve automated recognition, segmentation, and detection of breast lesions. Because failures in the detection or identification of the lesions lead to unnecessary biopsies or inaccurate diagnoses. In some works, in the literature, augmentation methods have been applied on extracted positive and negative patches instead of the whole image (Karthiga et al. 2022; Zeiser et al. 2020; Miller et al. 2022; Kim and Kim 2022; Wu et al. 2020b). Positive patches are extracted based on radiologist marks while negative patches are extracted as random regions from the same image. Also, refinement and segmentation of the lesion regions before the application of an augmentation method is the most preferred way to provide good performance from the detection and classification of the lesions. For instance, image refinement has been provided by noise reduction and resizing steps, and then segmentation and augmentation steps have been applied (Zeiser et al. 2020). Similarly, noise and artifacts have been eliminated before the segmentation of the images into non-overlapping parts and augmentation (Karthiga et al. 2022). Augmentation of patches or whole mammography images have usually been performed by scaling, noise addition, mirroring, shearing, and rotation techniques or using several combinations of them (Table 3). For instance, a set of augmentation methods (namely mirroring, zooming, and resizing) has been applied to improve automated classification of images into three categories (as normal, malignant, and benign) (Zeiser et al. 2020). Also, those techniques have been used together with GANs to provide augmentations with new synthetic images. For instance, flipping and deep convolutional GAN has been combined for binary classifications of images (as normal and with a mass) by Alyafi et al. (2020). Only GAN based augmentations have also been used. For instance, contextual GAN, which generates new images by synthesizing lesions according to the context of surrounding tissue, has been applied to improve binary classification (as normal and malignant) by Wu et al. (2020b). In another work (Shen et al. 2021), both contextual and deep convolutional GAN have been used to generate synthetic lesion images with margin and texture information and achieve improvement in the detection and segmentation of lesions.

All methods applied for augmentation of breast mammography images in the publications reviewed in this work and the details in those publications are presented in Table 3. Those augmentation methods have been applied in the applications developed for different purposes. In those applications, different network architectures, datasets and number of images have been used, and also their performances have been evaluated with different metrics. High performances have been obtained from those applications according to the reported results. However, the most effective and appropriate augmentation method used in them with breast mammography images is not clear due to those differences. Also, it is not possible to determine augmentation methods’ contributions to those performances according to the presented results in those publications.

Table 3 Augmentation methods applied with mammography images and details given in publications

2.4 Augmentation of eye fundus images

Eye fundus images show various important features, which are symptoms of eye diseases, such as abnormal blood veins (neovascularization), red/reddish spots (hemorrhages, microaneurysms), and bright lesions (soft or hard exudates). Those features are used in deep learning based methods being developed for different purposes like glaucoma identification, classification of diabetic retinopathy (DR), and identification of DR severity. Commonly used augmentation techniques with eye fundus images to boost the performance of deep learning based approaches are rotation, shearing, flipping, and translation (Table 4). For example, in a study (Shyamalee and Meedeniya 2022), a combined method including rotation, shearing, zooming, flipping, and shifting has been applied for glaucoma identification. Another combined augmentation including shifting and blurring has been used by Tufail et al. (2021) before binary classification (as healthy and nonhealthy) and multi-class (5 classes) classification of images to differentiate different stages of DR. In a different study (Kurup et al. 2020), combination of rotation, translation, mirroring, and zooming has been used to achieve automated detection of malaria retinopathy. Also, pixel intensity values have been used in some augmentation techniques since fundus images are colored images. For instance, to achieve robust segmentation of retinal vessels, augmentation has been provided by a channel-wise random gamma correction (Sun et al. 2021). The authors have also applied a channel-wise vessel augmentation using morphological transformations. In another work (Agustin et al. 2020), to improve the robustness of DR detection performance, image augmentation has been performed by zooming and a contrast enhancement process known as Contrast Limited Adaptive Histogram Equalization (CLAHE). Augmentations have also been provided by generating synthetic images using GAN models. For instance, conditional GAN (Zhou et al. 2020) and deep convolutional GAN (Balasubramanian et al. 2020) based augmentations have been used to increase the performance of the methods developed for automatic classifications of DR according to its severity level. In some works, retinal features (such as lesion and vascular features) have been developed and added to new images. For instance, NeoVessel (NV)-like structures have been synthesized in a heuristic image augmentation (Ara´ujo et al. 2020) to improve detection of proliferative DR which is an advanced DR stage characterized by neovascularization. In this augmentation, different NV kinds (trees, wheels, brooms) have been generated depending on the expected shape and location of NVs to synthesize new images.

Table 4 Augmentation methods applied with eye fundus images and details given in publications

All methods applied for augmentation of eye fundus images in the publications reviewed in this study and the details in those publications are presented in Table 4. Those augmentation methods have been applied in the applications developed for different purposes. In those applications, different network structures, data sets and number of images have been used, and also their performances have been evaluated with different metrics. High performances have been obtained from those applications according to the reported results. However, the most effective and appropriate augmentation method used in them with eye fundus images is not clear due to those differences. Also, it is not possible to determine augmentation methods’ contributions to those performances according to the presented results in those publications.

2.5 Commonly applied augmentation methods

2.5.1 Rotation

Rotation-based image augmentations are provided by rotating an image by concerning its original position. The rotation uses a new coordinate system and retains the same relative positions of the pixels of an image. It can be rightward or leftward across an axis within the range of [1°, 359°]. Image labels may not always be preserved if the degree of rotation increases. Therefore, the safety of this augmentation technique depends on the rotation degree. Although rotation transformation is not safe on images showing 6 and 9 in digit recognition applications, it is generally safe on medical images.

2.5.2 Flipping

Flipping is a technique generating a mirror image from an image. The pixel’s positions are inverted by concerning one of the two axes (for a two-dimensional image). Although it can be applied with vertical or/and horizontal axes, vertical flipping is rarely preferred since the bottom and top regions of an image may not always be interchangeable (Nalepa et al. 2019).

2.5.3 Translation

Images are translated along an axis using a translation vector. This technique preserves the relative positions between pixels. Therefore, translated images provide prevention of positional bias (Shorten and Khoshgoftaar 2019) and the models do not focus on properties in a single spatial location (Nalepa et al. 2019).

2.5.4 Scaling

Images are scaled along different axes with a scaling factor, which can be different or the same for each axis. Especially, scaling changings can be interpreted as zoom out (when the scaling factor is less than 1) or zoom in (when the scaling factor is greater than 1).

2.5.5 Shearing

This transformation slides one edge of an image along the vertical or horizontal axis, creating a parallelogram. A vertical direction shear slides an edge along the vertical axis, while a horizontal direction shear slides an edge along the horizontal axis. The amount of the shear is controlled by a shear angle.

2.5.6 Augmentation using intensities

This augmentation is provided by modification of contrast or brightness, blurring, intensity normalization, histogram equalization, sharpening, and addition of noise. If the preferred type of noise is Gaussian, intensity values are modified by sampling a Gaussian distribution randomly. If it is salt-and-pepper type of noise, pixel values are set randomly to white and black. Uniform noise addition is performed by modification of pixel values using randomly sampling a uniform distribution.

2.5.7 Random cropping and random erasing

Augmentation by random cropping is applied by taking a small region from an original image and resizing it to match the dimensions of the original image. Therefore, this augmentation can also be called scaling or zooming. Augmentation by random erasing is performed by randomly eliminating image regions.

2.5.8 Color modification

Image augmentation by color modification is performed in several ways. For example, an RGB-colored image is stored as arrays, which correspond to 3 color channels representing levels of red, green, and blue intensity values. Because of this, the color of the image can be changed without losing its spatial properties. This operation is called color space shifting, and it is possible to perform the operation in any number of spaces because the original color channels are combined in different ratios in every space. Therefore, a way to augment an RGB-colored image by color modification is to put the pixel values in the color channels of the image into a histogram and then manipulate them using filters to generate new images by changing the color space characteristics. Another way is to take 3 shifts (integer numbers) from 3 RGB filters and add each shift into a channel in the original image. Also, image augmentation can be provided by updating values of hue, saturation, and value components, isolating a channel (such as a blue, red, or green color channel), and converting color spaces into one another to augment images. However, converting a color image to its grayscale version should be performed carefully because this conversion reduces performance by up 3% (Chatfield et al. 2014).

2.5.9 GAN based augmentation

GANs are generative architectures constructed with a discriminator to separate synthetic and true images and a generator to generate synthetic realistic images. The main challenging issues of GAN based image augmentations are generating images with high quality (i.e., high-resolution, clear) and maintaining training stability. To overcome those issues, several GAN variants were developed such as conditional, deep convolutional, and cycle GAN. Conditional GAN adds parameters (e.g., class labels) to the input of the generator to control generated images and allows conditional image generation from the generator. Deep convolutional GAN uses deep convolution networks in the GAN structure. Cycle GAN is a conditional GAN and translates images from one domain to another (i.e., image-to-image translation), which modifies input images into novel synthetic images that meet specific conditions. Wasserstein GAN is the structure that is constructed with the loss function designed with the earth-mover distance to increase the stability of the model. Style GAN and progressively trained GAN add coarse to fine details and learn features from training images to synthesize new images.

3 Materials and methods

3.1 Materials

In this work, publicly available brain MR, lung CT, breast mammography, and eye fundus images have been used.

Brain MR images have been provided from Multimodal Brain Tumor Image Segmentation Challenge (BRaTS) (Bakas et al. 2018). The data for each patient includes four diverse models of MR images which are Fluid-Attenuated Inversion Recovery (FLAIR) images, T2-weighted MR images, T1-weighted MR images with Contrast Enhancement (T1-CE), and T1-weighted MR images. Different modal images provide different information about tumors. In this work, FLAIR images have been used. Example FLAIR images showing High-Grade Glioma (HGG) (Fig. 1a, b) and Low-Grade Glioma (Fig. 1c, d) (LGG) are presented in Fig. 1. The BRaTS database contains 285 images including HGG (210 images) and LGG (75 images) cases.

Fig. 1
figure 1

Brain MR scans showing HGG (a, b) and LGG (c, d) (Bakas et al. 2018)

Lung images have been provided from Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) database (Armato et al. 2011; McNitt-Gray et al. 2007; Wang et al. 2015) by the LUNA16 challenge (Setio et al. 2017). In the database, there are 1018 CT scans which belong to 1010 people. Those scans were annotated by four experienced radiologists. The nodules that were marked by those radiologists were divided into three groups which are “non-nodule ≥ 3 millimeters”, “nodule < 3 millimeters”, and “nodule ≥ 3 millimeters” (Armato et al. 2011). The images having a slice thickness greater than 3 mm were removed due to their usefulness according to the suggestions (Manos et al. 2014; Naidich et al. 2013). Also, the images having missing slices or not consistent slice spacing were discarded. Therefore, there are a total of 888 images extracted from 1018 cases. Example images with nodules (Fig. 2a) and without nodules (Fig. 2b) have been presented in Fig. 2.

Fig. 2
figure 2

Example images showing nodules (a) and non-nodule regions (b) (Armato et al. 2011)

Breast mammography images have been provided from the INbreast database (Moreira et al. 2012). It contains 410 images, of which only 107 images show breast masses, and the total number of mass lesions is 115 since an image may contain more than one lesion. The number of benign and malignant masses is 52 and 63, respectively. Example mammography images without any mass (Fig. 3a), with a benign mass (Fig. 3b), and malignant mass (Fig. 3c) lesions have been presented in Fig. 3.

Fig. 3
figure 3

Mammography images without mass (a), a benign mass (b), and a malignant mass (c) (Moreira et al. 2012)

Eye fundus images have been provided from the Messidor database (Decencière et al. 2014) including 1200 images categorized into three classes (0-risk, 1-risk, and 2-risk) according to the diabetic macular edema risk level. The number of images in the categories is 974, 75, and 151, respectively. Example images showing macular edema with 0-risk (Fig. 4a), 1-risk (Fig. 4b), and 2-risk (Fig. 4c) have been presented in Fig. 4.

Fig. 4
figure 4

Eye fundus images showing macular edema with 0-risk (a), 1-risk (b), and 2-risk (c) (Decencière et al. 2014)

3.2 Implemented augmentation methods

In this work, commonly used augmentation methods based on transformations and intensity modifications have been implemented to see and compare their effects on the performance of deep learning based classifications from MR, CT, and mammography images. Also, a method based on a color modification to augment colored eye fundus images has been applied. Those methods are explained below.

  • 1st Method In this method, a shearing angle, which is selected randomly within the range [− 15°, 15°], is used and a shearing operation is applied within the x and y-axis. This operation is repeated 10 times with 10 different shearing angles. By this method, 10 images are generated from an original image.

  • 2nd Method In this method, translation operation is applied on the x and y-axis. This operation is repeated 10 times by using different values sampled randomly within the range [− 15, 15]. By this method, 10 images are generated from an original image.

  • 3rd Method Rotation angles are selected randomly within the range [− 25°, 25°] and clockwise rotations are applied using 10 different angle values. 10 images are generated from an original image by this method.

  • 4th Method In this method, noise addition is applied by I = I + (rand(size(I)) − FV) × N, where the term FV refers to a fixed value, N refers to noise and I refers to an input image. The noise to be added is generated using a Gaussian distribution. The variance of the distribution is computed from the input images while the mean value is set to zero. In this study, three different fixed values (0.3, 0.4, and 0.5) are used and 3 images are generated from an input image.

  • 5th Method In this method, noise is added to each image with 3 density values (0.01, 0.02, and 0.03) and a salt-and-pepper type noise is used. Therefore, 3 images are generated from an input image by this method.

  • 6th Method In this method, augmentation is provided by salt-and-pepper type noise addition (5th method) followed by shearing (1st method). 30 images are generated from an input image by this method.

  • 7th Method In this method, augmentation is provided by Gaussian noise addition (4th method) followed by rotation (3rd method). 30 images are generated from an input image by this method.

  • 8th Method In this augmentation method, clockwise rotation and then translation operations are applied. Rotation angles are selected randomly within the range [− 25°, 25°] and translation is applied on the x and y-axis by using different values sampled randomly within the range [− 15, 15]. This successive operation is applied 10 times, and 10 images are generated from an input image by this method.

  • 9th Method In this method, translation is applied followed by the shearing. The translation step is applied on the x and y-axis with different values sampled randomly within the range [− 15, 15]. In the shearing step, an angle is selected randomly within the range [− 15°, 15°], and shearing is applied within the x and y-axis. The subsequent operations are applied 10 times and 10 images are generated from an input image by this method.

  • 10th Method In this method, three operations (translation, shearing, and rotation) are applied subsequently. Implementation of the translation and shearing operations are performed as explained in the 9th method. In the rotation step, a random angle is selected within the range [− 25°, 25°] clockwise rotation is applied. The successive operations are applied 10 times, and 10 images are generated from an input image by this method.

  • 11th Method Color shifting, sharpening, and contrast changing have been combined in this method. Color shifting has been applied by taking 3 shifts (integer numbers) from 3 RGB filters and adding each shift into one of the 3 color channels in the original images. Sharpening has been changed by blurring the original images and subtracting the blurred images from the original images. A Gaussian filter, whose variance value is 1, has been used for the blurring process. Contrast changing has been applied by scaling the original images linearly between i1 and i2 (i1 < i2), which will be the maximum and minimum intensities in the augmented images, and then mapping the original images’ pixels with intensity values higher than i2 (or lower than i1) to 255 (or 0). Therefore, 3 images are generated from an input image by this method.

3.3 Classification

In this work, to compare the effectiveness of the augmentation approaches in classifications, a CNN-based classifier has been used. In this step, the original images taken from the public databases (Sect. 3.1) have been used in construction of training and testing datasets. However, the databases have imbalanced class distributions. Therefore, to obtain unbiased classification results, an equal number of images have been taken from the databases for the classes. Those balanced numbers of images have been used with the augmentation methods and the generated images from them have been added into the data sets. 80% of the total data has been used in the training stage, the rest 20% of the data has been used in the testing stage. It should be reminded here that the goal in this study is to evaluate the augmentation techniques used with different types of images in improving the classification rather than evaluations of classifier models. Therefore, ResNet101 architecture has been chosen and implemented in this work. The reason to choose this architecture is to utilize its advantage in addressing the gradient vanishing issue and high learning ability (He et al. 2016). Pre-training of networks by using large datasets [such as ImageNet (Russakovsky et al. 2015)] and then fine-tuning the networks for target tasks having less training data has been commonly applied in recent years. Pre-training has provided superior results on many tasks, such as action recognition (Simonyan and Zisserman 2014; Carreira and Zisserman 2017), image segmentation (He et al. 2017; Long et al. 2015), and object detection (Ren et al. 2015; Girshick 2015; Girshick et al. 2014). Therefore, transfer learning from a model trained using ImageNet has been used in this work for fine-tuning of the network. Optimization has been provided by Adam, and activation has been provided by ReLU. The initial learning rate has been chosen as 0.0003, and cross-entropy loss has been computed to update the weights in the training phase. The number of epochs and mini-batch size have been set as 6 and 10, respectively.

4 Results

Classification results have been evaluated by using these five measurements: specificity, accuracy, sensitivity, Matthew Correlation Coefficient (MCC), and F1-score computed by:

$$Accuracy=\frac{{TP+TN}}{{TP+FP+TN+FN}}$$
(1)
$$Specificity=\frac{{TN}}{{TN+FP}}$$
(2)
$$Sensitivity=\frac{{TP}}{{TP+FN}}$$
(3)
$$F{1_{score}}=\frac{{2 \times TP}}{{2 \times TP+FP+FN}}$$
(4)
$$MCC=\frac{{TP \times TN - FP \times FN}}{{\sqrt {\left( {TP+FP} \right)\left( {TP+FN} \right)\left( {TN+FP} \right)\left( {TN+FN} \right)} }}$$
(5)

where FP, FN, TN, and TP refer to false positives, false negatives, true negatives, and true positives, respectively. The number of images obtained from each augmentation method for each augmentation approach is given in Table 5.

Table 5 The number of images from each augmentation method

The augmentation methods and quantitative results obtained by classifications of HGG and LGG cases from brain MR images have been presented in Table 6.

Table 6 Results of classifications from brain MR images

Lung CT scans have been classified as a pulmonary nodule or a non-nodule with the same classifier. 1004 lung nodules of which 450 positive candidates provided by the LIDC-IDRI have been used. After generation more positive candidate nodules with the augmentation methods, equal number of negative and positive candidates have been used in the classifications. The augmentation methods and quantitative results of classifications have been presented in Table 7.

Table 7 Results of classifications from lung CT images

The breast mammography images have been classified into three groups as malignant, benign, and normal. The augmentation methods and quantitative results of classifications have been presented in Table 8.

Table 8 Results from classifications from mammography images

Unlike from MR, CT, and mammography images, the eye fundus images are colored. Therefore, in addition to those 10 augmentation methods applied for grayscale images, the 11th method has also been used for augmentation of this images. The augmentation methods and quantitative results of the classifications into three classes according to macular edema risk level have been presented in Table 9.

Table 9 Results of classifications from eye fundus images

5 Discussions

To overcome the imbalance distribution and data scarcity problems in deep learning based approaches with medical images, data augmentations are commonly applied with various techniques in the literature (Sect. 2). In those approaches, different network architectures, different types of images, parameters, and functions have been used. Although they achieved good performances with augmented images (Tables 1, 2, 3 and 4), the results presented in the publications do not indicate on which aspects of the approaches contributed the most. Although augmentation methods are applied as pre-processing steps and therefore have significant impacts on the remaining steps and overall performance, their contributions are not clear. Because of this, it is not clear which data augmentation technique provides more efficient results for which image type. Therefore, in this study, the effects of augmentation methods have been investigated to determine the most appropriate method for automated diagnosis from different types of medical images. This has been performed by these steps: (i) The augmentation techniques used to improve the performance of deep learning based diagnosis of diseases in different organs (brain, lung, breast, and eye) have been reviewed. (ii) The most commonly used augmentation methods have been implemented with four types of images (MR, CT, mammography, and fundoscopy). (iii) To evaluate the effectiveness of those augmentation methods in classifications, a classifier model has been designed. (iv) Classifications using the augmented images for each augmentation method have been performed with the same classifier and datasets for fair comparisons of those augmentation methods. (v) The effectiveness of the augmentation methods in classifications have been discussed based on the quantitative results.

It has been observed from the experiments in this work that transformation-based augmentation methods are easy to implement with medical images. Also, their effectiveness depends on the characteristics (e.g., color, intensity, etc.) of the images, and whether all significant visual features according to medical conditions exist in the training sets. If most of the images in the training sets have similar features, then there is a risk of constructing a classifier model with overfitting and less generalization capability. In general, the pros and cons of the widely used traditional augmentation approaches are presented in Table 10.

Table 10 The pros and cons of traditional augmentation approaches

Generating synthetic images by GAN based methods can increase diversity. However, GANs have their own challenging problems. A significant problem is vanishing gradient and mode collapsing (Yi et al. 2019). Since the collapse of the mode restricts the capacity of GAN architecture to be varied, this interconnection is detrimental for augmentations of medical images (Zhang 2021). Another significant problem is that it is hard to acquire satisfactory training results if the training procedure does not assure the symmetry and alignment of both generator and discriminator networks. Artifacts can be added while synthesizing images, and therefore the quality of the produced images can be very low. Also, the trained model can be unpredictable since GANs are complex structures and managing the coordination of the generator and discriminator is difficult. Furthermore, they share similar flaws with neural networks (such as, poor interpretability). Additionally, they require strong computer resources, a long processing time, and hamper the quality of the generated images in case of running on computers having low computational power with constrained resources.

Quality and diversity are two critical criteria in the evaluation of synthetically generated images from GANs. Quality criterion indicates the level of similarity between the synthetic and real images. In other words, it shows how representative synthetic images are of that class. A synthetic image’s quality is characterized by a low level of distortions, fuzziness, noise (Thung and Raveendran 2009), and its feature distribution matching with the class label (Zhou et al. 2020; Costa et al. 2017; Yu et al. 2019).

Diversity criteria indicates the level of dissimilarity of the synthetic and real images. In other words, it shows how uniform or wide the feature distribution of the synthetic image is (Shmelkov et al. 2018). When the training datasets are expanded with the images lacking diversity, the datasets can provide only limited coverage of the target domain and cause low classification performance due to incorrect classifications of the images containing features that belong to the less represented regions. When the training datasets are expanded by adding synthetic images with low quality, the classifier cannot learn the features representing different classes which leads to low classification performance. Therefore, if synthetic images are used in the training sets, they should be sufficiently diverse to represent features of each class and have high quality.

The diversity and quality of the new images produced by GANs are evaluated manually by a physician or quantitatively by using similarity evaluation measurements, which are generally FID, structural similarity index, and peak signal-to-noise ratio (Borji 2019; Xu et al. 2018). The validity of these measurements for medical image data sets is still under investigation and there is no accepted consensus. Manual evaluations are subjective as well as time-consuming (Xu et al. 2018) since they are based on the domain knowledge of the physician. Also, there is no commonly accepted metric for evaluating the diversity and quality of synthesized images.

About Brain MR Image Augmentation techniques The methods used for brain MR image augmentation have some limitations or drawbacks. For instance, in a study (Isensee et al. 2020), the augmentation approach uses elastic deformations, which add shape variations. However, the deformations can bring lots of damage and noise when the deformation field is varied seriously. Also, the generated images seem not to be realistic and natural. It has been shown in the literature that widely used elastic deformations produce unrealistic brain MR images (Mok and Chung 2018). If the simulated tumors are not in realistic locations, then the classifier model can focus on the lesion’s appearance features and be invariant to contextual information. In another study (Kossen et al. 2021), although the generated images yielded a high dice score in the transfer learning approach, performance only slightly improved when training the network with real images and additional augmented images according to the results presented by the authors. This might be because of the less blurry and noisy appearance of the images produced by the used augmentation. The discriminator observed results of the generator network at intermediate levels in the augmentation with multi-scale gradient GAN (Deepak and Ameer 2020). Because the proposed GAN structure included a single discriminator and generator with multiple connections between them. Although augmentation with TensorMixup improved the diversity of the dataset (Wang et al. 2022), its performance should be further verified with more datasets including medical images with complicated patterns as samples generated with this augmentation may not satisfy the clinical characterizations of the images. Li et al. (2020) has observed that an unseen tumor label cannot be provided with the augmentation method and therefore the virtual semantic labels’ diversity is limited.

It has been observed that usage of the augmented images obtained by the combination of rotation with shearing, and translation provides higher performance than the other augmentation techniques in the classifications of HGG and LGG cases from FLAIR types of brain images. On the other hand, augmentation with the combination of shearing and salt-and-pepper noise addition is the least efficient approach for augmentation in improving classification performance (Table 6).

About Lung CT Image Augmentation techniques Further studies on augmentation techniques for lung CT images are still needed since current methods still suffer from some issues. For instance, in a study (Onishi et al. 2020), although performance in the classification of isolated nodules and nodules having pleural tails increased, it did not increase in the classification of nodules connected to blood vessels or pleural. The reason might be due to the heavy usage of isolated nodule images (rather than the images with the nodules adjacent to other wide tissues like the pleural) for the training of the GAN. Also, the Wasserstein GAN structure leads to gradient vanishing problems due to small weight clipping and a long time to connect because of huge clipping. In another study (Nishio et al. 2020), the quality of the generated 3-dimensional CT images that show nodules is not low. On the other hand, in some of those images, the lung parenchyma surrounding the nodules does not seem natural. For instance, the CT values around the lung parenchyma in the generated images are relatively higher than the CT values around the lung parenchyma in the real images. Also, lung vessels, chest walls, and bronchi in some of those generated images are not regular. The radiologists easily distinguish those generated images according to the irregular and unnatural structures. The augmentation method in a different work (Nishio et al. 2020) generate only lung nodules’ 3-dimensional CT images. However, there exist various radiological findings (e.g., ground glass, consolidation, and cavity) and it is not clear whether those findings are generated or not. Also, the application can classify nodules only according to their sizes rather than other properties (e.g., absence or presence of spicules, margin characteristics, etc.). Therefore, further evaluation should be performed to see whether nodule classification performance can be increased with the new lung nodules for other classifications, such as malignant or benign cases. A constant coefficient was used in the loss function to synthesize lung nodules by Wang et al. (2021). It affects the training performance and should be chosen carefully. Therefore, the performance of the augmentation method should be evaluated with increased a number of images. Although the method proposed by Toda et al. (2021) has the potential to generate images, spicula-like features are obscure in the generated images and it does not include the distribution of true images. Also, the generation of some features contained in the images (e.g., cavities around or inside the tumor, prominent pleural tail signs, non-circular shapes) are difficult with the proposed augmentation. Therefore, images with those features tend to be misclassified. In a study, augmentation with elastic deformations, which add shape variations, brings noise and damage when the deformation field is varied seriously (Müller et al. 2021). In a different approach, as it is identified by the authors, the predicted tumor regions are prone to imperfections (Farheen et al. 2022).

It has been observed that usage of the augmented images obtained by the combination of translation and shearing techniques in the classification of images with nodules and without nodules provides the highest performance than the other augmentation methods. On the other hand, augmentation only by shearing is the least efficient approach (Table 7).

About breast Mammography Image Augmentation techniques Augmentation of breast mammography images is one of the significant and fundamental directions that need to be focused on for further investigations and future research efforts. Although scaling, translation, rotation, and flipping are widely used, they are not suitable enough for augmentation since the additional information provided by them is not sufficient to make variations in the images, and so the diversity of the resulting dataset is limited. For instance, in a study performed by Zeiser et al. (2020), although the classifier classifies pixels having high intensity values as masses as it is desired, the network ends up producing FPs in dense breasts. Therefore, the augmentation technique applied as a pre-processing step should be modified to expand the datasets more efficiently and to increase the generalization skill of the proposed classifier. Besides, the performance of the applications should be tested with not only virtual images but also real-world images. Although the results of the applications that use generated images from a GAN based augmentation are promising (Table 3), their performances should be evaluated with increased numbers and variations of images. Since augmentations by synthesizing existing mammography images are difficult because of the variations of masses in terms of shape and texture as well as the presence of diverse and intricate breast tissues around the masses. The common problems in those GAN based augmentations are mode collapsing and saddle point optimization (Yadav et al. 2017). In the optimization problem, there exists almost no guarantee of equilibrium between the training of the discriminator and generator functions, causing one network to inevitably become stronger than the other network, which is generally the discriminator function. In the collapsing mode problem, the generator part focuses on limited data distribution modes and causes the generation of images with limited diversity.

Experiments in this study indicated that the combined technique consisting of translation, shearing, and rotation is the most appropriate approach for the augmentation of breast mammography images in order to improve the classification of the images as normal, benign, and malignant. On the other hand, the combined technique consisting of salt and pepper noise addition and shearing is the least appropriate approach (Table 8).

About Eye Fundus Image Augmentation techniques Further research on the augmentation techniques for fundus images is still needed to improve the reliability and robustness of computer-assisted applications. Because the tone qualities of fundus images are affected by the properties of fundus cameras (Tyler et al. 2009) and the images used in the literature have been acquired from different types of fundus cameras. Therefore, the presented applications fitting well to images obtained from a fundus camera may not generalize images from other kinds of fundus camera systems. Also, images can be affected by pathological alterations. Because of this reason, the applications should be robust on those alterations, particularly on the pathological changes that do not exist in the images used in the training steps. Therefore, although the results of the applications in the current literature indicated high performances (Table 4), their robustness should be evaluated using other data sets with increased numbers and variations of images and taken from different types of fundus cameras. Also, the contributions of the applied augmentation techniques to the presented performances are not clear. In a study performed by Zhou et al. (2020), although the GAN structure is able to synthesize high-quality images in most cases, the lesion and structural masks used as inputs are not real ground truth images. Therefore, the generator’s performance depends on the quality of those masks. Also, the applied GAN architecture fails to synthesize some lesions, such as microaneurysms. In another study performed by Ju et al. (2021), the GAN based augmentation might lead to biased results because of matching the generated images to the distribution of the target domain.

Experiments in this study indicated that usage of the augmented images obtained by the combination of color shifting with sharpening, and contrast changing provides higher performance than the other augmentation techniques in the classifications of eye fundus images. On the other hand, augmentation with translation is the least efficient approach for augmentation to improve classification performance (Table 9).

6 Conclusion

Transformation-based augmentation methods are easy to implement with medical images. Also, their effectiveness depends on the characteristics (e.g., color, intensity, etc.) of the images, and whether all significant visual features according to medical conditions exist in the training sets.

GAN based augmentation methods can increase diversity. On the other hand, GANs have vanishing gradient and mode collapsing problems. Also, obtaining satisfactory training results is not easy if the training procedure does not assure the symmetry and alignment of both generator and discriminator networks. Besides, GANs are complex structures and managing the coordination of the generator and discriminator is difficult.

Combination of rotation with shearing, and translation provides higher performance than the other augmentation techniques in the classifications of HGG and LGG cases from FLAIR types of brain images. However, augmentation with the combination of shearing and salt-and-pepper noise addition is the least efficient approach for augmentation in improving classification performance.

Combination of translation and shearing techniques in the classification of lung CT images with nodules and without nodules provides the highest performance than the other augmentation methods. However, augmentation only by shearing is the least efficient approach.

Combination of translation, shearing, and rotation is the most appropriate approach for the augmentation of breast mammography images in order to improve the classification of the images as normal, benign, and malignant. On the other hand, the combined technique consisting of salt and pepper noise addition and shearing is the least appropriate approach.

Combination of color shifting with sharpening, and contrast changing provides higher performance than the other augmentation techniques in the classifications of eye fundus images. On the other hand, augmentation with translation is the least efficient approach for augmentation to improve classification performance.

As an extension of this work, the effectiveness of the augmentation methods will be evaluated in the diagnosis of diseases from positron emission tomography, ultrasonography images, other types of MR sequences (e.g., T2, T1, and proton density-weighted), and also in the classification of other types of images such as satellite or natural images. Also, GAN based augmentations will be applied, quantitative and qualitative analyses of the generated images will be performed to ensure their diversity and realness. In this study, ResNet101 has been used due to its advantage based on residual connections and efficiency in classification. Therefore, the implementation of other convolutional network models will be performed in our future works.