1 Introduction

Deep learners, especially convolutional neural networks (CNNs), are now the dominant classification paradigm in image classification, as witnessed by the plethora of articles in the literature that currently spotlight these networks. Learners like CNN are attractive, in part, because the architectures of these networks learn to extract salient features directly from samples, thus bypassing the need for human intervention in selecting the appropriate feature extraction method for the task at hand. These learned features have been shown to eclipse the power of handcrafted features chiefly because CNNs progressively downsample the spatial resolution of images while at the same time enlarging the depth of the feature maps.

Despite the strengths of CNNs, there are some significant drawbacks. Because the parameter size of CNNs is huge, these networks tend to overfit when trained on small datasets. Overfitting reduces the classifier's ability to generalize its learning so that it can correctly predict unseen samples. Researchers are now pressured to collect colossal datasets to accommodate the needs of deep learners, as exemplified by the ever-growing dataset ImageNet [1], which now contains over 14 million images classified into 1000 plus classes. In many domains, such as medical image analysis and bioinformatics (where samples might only number in the hundreds), collecting sufficient data for proper CNN training is prohibitively expensive and labor-intensive. This need for enormous datasets also requires that researchers have access to costly machines with considerable computational power.

Several solutions to the problem of overfitting that bypass the need for collecting more data have been proposed. Two of the most powerful techniques are transfer learning, where a given CNN architecture is pretrained on a massive dataset and provided to researchers and practitioners so that the network can be finetuned on smaller datasets, and 2) data augmentation, which adds new data points based on the original samples in a training set. Other methods unrelated to the techniques employed here include dropout [2], zero-shot/one-shot learning [3, 4], and batch normalization [2].

This study focuses on data augmentation since it has become a vital technology in fields where large datasets are difficult to procure [5,6,7]. Data augmentation methods aim at increasing the amount of training data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. They act as a regularizer and help reduce overfitting when training a machine learning model. Not only does data augmentation promote learning that leads to better CNN generalization, but it also fixes the problem of overfitting by adding and extracting information that is inherent within the training space. Data augmentation (DA) is a key element in the success of CNN models, as its use can lead to a faster convergence to solution and better prediction accuracy.

In the literature (see surveys [5,6,7]), most of the work focuses essentially on geometric transforms, statistical methods for color modification, and, recently, also on learned methods such as those based on GAN. To the best of our knowledge, this is the first paper that reports a very large study that deals with data augmentation approaches based on feature transform, testing them in several datasets spanning different applications.

Specifically, we focus on data augmentation techniques for image classification. These techniques can be divided into two broad types depending on whether the methods are based on basic image manipulations (such as translating and cropping) or on deep learning approaches [5]. The main object of this study is to evaluate the feasibility of building ensembles at the data level by adding augmented images generated using different sets of image manipulation methods, an approach that was taken in [8]. Unlike [8], however, this work performs a more exhaustive study building ensembles of augmentation methods by assessing over twice the number of techniques across eleven (versus only four) benchmark datasets.

The remainder of this paper is organized as follows: in Sect. 2, we review some of the best-performing data image manipulation approaches. In Sect. 3, novel data augmentation algorithms based on the radon transform (RT) [9], the discrete cosine transform (DCT), and the Fourier transform (FT) are proposed. As described more fully in Sect. 3, ensembles are built with pretrained ResNet50s finetuned on training sets composed of images taken from the original data and generated by an augmentation technique. These networks and their fusions are evaluated on the benchmarks described at the end of Sect. 3. In Sect. 4, we compare the performance of individual augmentation approaches and the ensembles built on them. The best ensemble reported in this work either exceeds the performance of the state-of-the-art in the literature or achieves similar performance on all the tested datasets. In Sect. 5, we conclude with a few suggestions for further research in this area.

The main contributions of this study can be summarized as follows:

  • Presented is an extensive evaluation of common image manipulation methods used for data augmentation across eleven freely available and diverse benchmarks.

  • Proposed are three new augmentation approaches utilizing RT, FT, and DCT transforms.

  • Demonstrated is the value of building deep ensembles of classifiers on the data level by adding to the training sets images generated using different data augmentation approaches: the experimentally derived ensemble developed in this work is shown to achieve state-of-the-art performance on several benchmarks.

  • Provided to the public at no charge is the MATLAB source code used in the experiments reported in this work (available at https://github.com/LorisNanni/Feature-transforms-for-image-data-augmentation).

2 Related work

As mentioned in the Introduction, this study focuses on building ensembles with augmentation methods produced by the application of image manipulation methods. In [5], these methods are divided into the following groups based on: (1) simple geometric transformations, (2) randomly erasing and cutting, (3) mixing images, (4) kernel filters, and (5) color space transforms [5]. Most of these augmentation algorithms are easy to implement. Practitioners must be careful, however, when applying these image manipulations to a sample because it is possible to produce new images that no longer belong to the same class as the original. Flipping an image of the number six, for instance, would result in an image recognized as the number nine.

Flipping, especially along the horizontal axis, is one of the simplest and most popular geometric transforms for data augmentation [5], as are rotation (typically on the right or left axis in the range [1°, 359°]) and translation (where positional bias is avoided by shifting a sample up, down, left, and right) [5]. A problem with translation is that it can introduce undesirable noise [10]. Random cropping is another simple technique that reduces the size of new images, which is often needed to fit the input of a network. Augmented data can also be generated by merely substituting random values in an image, as extensively evaluated in [11]. In [12], the authors compared the performance of these simple augmentation techniques with each trained on AlexNet and assessed on two datasets, ImageNet and CIFAR10 [13]: rotation was found to perform better than translation, random cropping, and random values.

Random erasing [14] and cutting [15] occlude images; these methods model what occurs regularly in the real world, where objects are often only partially presented in the visual field. A review of the literature on augmentation methods based on this category can be found in [6]. Of particular interest is the method proposed in [14] that randomly erases an image with patches that vary in size. This method of partially erasing images was evaluated on ResNet architectures across three datasets: Fashion-MNIST, CIFAR10, and CIFAR100. Results showed consistent performance improvements.

Another simple method for constructing images is to mix them. A simple way to accomplish this task is to average the pixels between two or more images belonging to the same class [16]. Alternatively, images can be submitted to a transform, and the resulting components can be mixed, for example, by chaining, as in [17]. Masks can also be applied. In [16], the authors combined images using several image manipulation techniques: random images were flipped and cropped, and then, the RGB channel values for each pixel were averaged. Some nonlinear methods for mixing images were proposed in [18], and GANs were used in [18] to blend images.

Kernel filters can also be applied to create new images within a sample space. Filters are often used to sharpen or blur images. Filters, such as Gaussian blur, are applied by sliding an n × n window across the image. PatchShuffle, proposed in [19], randomly swaps matrix values in the filter window to make new images.

Novel color images can be created by means of color space transformations. A positive side-effect of this technique is the removal of illumination bias [5]. Transformations of color space can involve making a histogram of pixels in a color channel and applying different filters, much like those positioned over the lenses of cameras to alter the characteristics of the color space in a scene. Alternatively, color spaces can be converted into other color spaces. Care must be taken when transforming the color space, as it has been observed, for example, that changing an RGB image to a grayscale image can reduce the performance of a classifier [20]. New images can be produced by adding noise to color distributions or by jittering and adjusting the brightness, contrast, and saturation of samples [12, 21]. These color adjustments run the risk of removing valuable information. A review of color space transforms for image augmentation and a comparison of this type of image manipulation with geometric transforms is available in [22].

Not all data augmentation techniques take into account the entire training set. One popular technique in this vein is PCA jittering [12, 21,22,23,24], which produces new images by multiplying the PCA components by a small number. In [22], only the first component, which contains the most information, is jittered by being multiplied by a random number selected from a uniform distribution. Finally, in [23], an original image is transformed by PCA and DCT and jittered by adding noise to all components before reconstructing the image.

3 Materials and methods

3.1 Proposed approach

The method proposed here for building ensembles of deep learners is illustrated in Fig. 1. A given training set is augmented using n = 14 approaches, each detailed in Sect. 3.2. These new training sets are then used to finetune fourteen ResNet50s pretrained on ImageNet. In this work, each pre-trained ResNet50 is finetuned with a batch size of 30 and a learning rate of 0.001. In the testing phase, each unknown sample is classified by the 14 CNNs, and the resulting scores are fused by the sum rule.

Fig. 1
figure 1

Ensembles are built on pretrained ResNet50s finetuned on training sets composed of original images combined with images generated by several data augmentation techniques

3.2 Data augmentation methods

In this section, we describe the data augmentation sets (APPs1-14) explored in this study. APPs1-11 have been detailed in [8], so they will receive less attention. APPs12-14 are proposed here for the first time; these augmentations are explained more fully and illustrated in Fig. 2 (resulting images) and Fig. 3 (methods).

Fig. 2
figure 2

Examples of an original image (lower left-hand side) augmented on methods APP12 (top), APP13 (middle), and APP14 (bottom)

Fig. 3
figure 3

Schema for augmentation sets APP12-14 (designated here to save space as #12–#14)

The method proposed here for building ensembles of deep learners is illustrated in Fig. 1. A given training set is augmented using n = 14 approaches, each detailed in Sect. 3.2. These new training sets are then used to finetune fourteen ResNet50s pretrained on the following training sets:

APP1 (number of new images generated: 3) takes a given image and randomly reflects it top–bottom and left–right for two new images. The third transform linearly scales the original image along both axes with two factors randomly extracted from the uniform distribution [1, 2].

APP2 (number of new images generated: 6) replicates App1 with three additional manipulations: image rotation (randomly extracted from [− 10, 10] degrees), translation (along both axes with the value randomly sampled from the interval [0, 5] pixels), and shear (with vertical and horizontal angles randomly sampled in the interval [0, 30] degrees).

APP3 (number of new images generated: 4) replicates App2 without shear.

APP4 (number of new images generated: 3), proposed in [23], applies a transform based on PCA, where the PCA coefficients extracted from a given image are transformed (1) by randomly setting to zero (with a probability of 0.5) each element of the feature vector; (2) by adding noise to each component based on the standard deviation of the projected image; and (3) by selecting five images from the same class as the original image, computing the PCA vector, and randomly selecting components with a probability 0.05 from the original PCA vector and swapping them out with some of the corresponding components of the five other PCA vectors. The PCA inverse transform is applied to the three perturbed PCA vectors of the original image to produce the augmented images.

APP5 (number of new images generated: 3) applies the same perturbation method as those described in App4 using the DCT transform rather than PCA. With APP5, the DC coefficient is never changed.

APP6 (number of new images generated: 3) is applied to color images. The three images are constructed by altering contrast, sharpness, and color shifting. The contrast is altered by linearly scaling the original contrast of the image between the values a, the lowest value, and b, the highest value allowed for the augmented image. Every pixel in the original image outside this range is mapped to 0 if less than a or 255 if greater than b. Sharpness is altered by blurring the original with a Gaussian filter (variance = 1) and by subtracting the blurred image from the original. The color is shifted with three integer shifts from three RGB filters. Each shift is then added to one of the three channels in the original image.

APP7 (number of new images generated: 7) is applied to color images. The first four augmented images are produced by altering the pixel colors in the original image using the MATLAB function jitterColorHSV with randomly selected values in the range of [0.05, 0.15] for hue, in the range [− 0.4, − 0.1] for hue saturation, in the range [− 0.3, − 0.1] for brightness, and in the range [1.2, 1.4] for contrast. Image five is generated with the MATLAB function imgaussfilt, where the standard deviation of the Gaussian filter randomly ranges in [1, 6]. Image six is generated by the MATLAB function imsharpen with sharpening equal to two and the radius of the Gaussian low-pass filter equal to one. Image seven applies the same color-shifting detailed in App6.

APP8 (number of new images generated: 2) is applied to color images and produces the two augmented images by randomly selecting a target image belonging to the same class as a given image followed by the application of two nonlinear mappings: RGB histogram specification and stain normalization using the Reinhard Method [25].

APP9 (number of new images generated: 6), proposed in [8], applies two elastic deformations: one MATLAB method that introduces distortions into the original image and an RGB adaptation of ElasticTransform from the computer vision tool Albumentations (available at https://albumentations.ai/ accessed 01/15/22). Both methods transform a given image by applying a randomly generated displacement field to its pixels by a value extracted from the standard uniform distribution in the range [− 1, 1] for the first method or in the range of [− 1. + 1] for the second. The resulting horizontal and vertical displacement fields are passed through three low-pass filters: (1) circular averaging filter, (2) rotationally symmetric Gaussian low-pass filter, and (3) rotationally symmetric Laplacian of Gaussian filter. For more details, see [8].

APP10 (number of new images generated: 3), proposed in [8], is based on DWT [26] (specifically, Daubechies wavelet db1 with one vanishing moment). DWT produces four matrices: the approximation coefficients (cA) and the horizontal, vertical, and diagonal coefficients (cH, cV, and cD, respectively). APP10 performs three perturbations on these matrices to generate three new images. In the first method, each element in the coefficient matrices is randomly selected, with a probability of 0.5, to be set to zero. In the second method, a constant is added to each element that is calculated by summing the standard deviation of the original image with a number randomly selected in the range [− 0.5, 0.5]. In the third method, five images from the same class as the original image are randomly selected, and the DWT coefficient matrices are calculated for each one. Elements of the original cA, cH, cV, and cD matrices are then replaced, with a probability of 0.05, with values in the matrices of the other five images. The three augmented images are produced by applying the inverse DWT transform on the three sets of perturbed matrices.

APP11 (number of new images generated: 3): was first proposed in [8] and is based on the Constant-Q Transform (CQT) [27]. After calculating the CQT arrays of a given image, it undergoes the same perturbations to generate the three images as in APP10. The three augmented images are then produced by applying the inverse CQT transform on the perturbed CQT arrays.

APP12 (number of new images generated: 5) is a new method proposed here based on DCT and the random selection of other images. Three novel images are extracted from the same class as the original image and two from a different class. The original image and the five selected images are projected on the DCT space, with each element of the five images having a probability of 0.2 of being averaged with the original DCT element. Notice that this approach is cumulative, as can be observed in the following pseudo-code:

figure a

IDCT is applied at the end. Since there are three channels for each color image, these perturbations are applied to each channel independently.

APP13 (number of new images generated: 3) takes an original image and builds 3 augmented images, using the Radon transform, as can be observed in the following pseudo-code:

figure b

where randSel(num, range) randomly select num values in a range.

The first image is produced with the Radon transform (RT), which projects the original image's intensity along a radial line oriented at a specific angle (angles with values between [0, 179]), but, for the first image, twenty angles are randomly selected and discarded. Then, the image is back-projected with the inverse RT (IRT). The second image is generated in the same way as the first, but all angles are used to project the image with RT, after which 15% of the angle values (that is, the columns of the projected image) are set to zero before IRT is applied. The third image is like the first in that 20 angles are randomly selected and discarded in the projection step, and, like in the construction of the second image, only 15% of the angles (columns of the projected image) are set to zero before IRT is applied.

APP14 (number of new images generated: 2) The first image is generated using the Fast Fourier Transform (FFT). After FFT is applied, 50% of the coefficients are randomly set to zero before performing the inverse FFT. The second image is built by applying DCT; next, a square low-frequency filter (size 40 × 40) is applied on the DCT image before performing the inverse DCT:

figure c

where randomMask(image, prob) returns a random pixel mask which is of the same size as the image and prob is the probability of each pixel to be 1.

3.3 Datasets

In this work, ensembles of augmentation methods are tested and compared with the literature on eleven benchmark datasets for image classification (see Table 1).

Table 1 Description of the eleven datasets used in this study

In Table 1, the following information is reported for each dataset: a short name, the original dataset name (if provided in the reference), the number of classes and samples, the size(s) of the images, the testing protocol, and the original reference. The abbreviations for the testing protocols in Table 1 are detailed as follows:

  • 5CV,10CV represents fivefold and tenfold cross-validation.

  • Tr-Te represents a dataset that is pre-divided into training and testing sets. For LAR, a threefold division is provided by the authors. For PBC, the official protocol specifies that 88% of the images be included in the training set and 12% in the test set, with both sets maintaining the same sample per class ratio as in the original dataset. END includes a training set of 3302 images and an external validation set of 200 images.

The performance indicator typically reported on these datasets is accuracy, which measures the rate of correct classifications. For the GRAV dataset, four different views are extracted at different durations from each glitch/image; therefore, the final score is obtained by combining the four classification scores via the average rule. Validation of the superiority of one method over the others is provided by the Wilcoxon signed rank test [39].

4 Experimental results

We start our experiments by comparing the performance of the augmentation sets with ResNet50 (see Table 2), along with the results of these approaches on the following ensembles:

  • EnsDA_A: this is the fusion by sum rule among all the ResNet50 trained using App1-11; each ResNet50 is trained with a different data augmentation approach. The data augmentation methods based on color spaces (App6-8) are not reported on VIR, HE, and MA since they are gray-level images.

  • EnsDA_B: this fusion is the same as EnsDA_A except for the addition of ResNet50s trained with the new augmentation methods App12-14.

  • EnsDA_C: this is the fusion by sum rule among those methods not based on feature transforms. Each approach is iterated twice (three times for datasets VIR, HE, and MA since they are gray-level images; they are trained three times so that the size of the ensemble EnsDA_C is similar to EnsDA_B).

  • EnsBase(X): this is a baseline ensemble intended to compare/validate the performance of EnsDA_* (i.e., 1–3 above); EnsBase(X) combines (via sum rule) X ResNet50 networks trained separately on App3, which produces the best average performance compared with all the other data augmentation sets.

Table 2 Performance accuracy (in %) of data augmentation sets (APP1-14) and defaults

The label NoDA in the first row of Table 2 is a stand-alone ResNet50 trained without data augmentation.

Several conclusions can be drawn by examining Table 2:

  • The best augmentation set varies with each dataset: in some, the best approach is a feature transform (see GRAV, Triz, and END); in others, a color-based method (see POR), and for some, the best performance is obtained using affine transformations.

  • Considering a stand-alone CNN, in some datasets (see PBC), the performance of NoDA is similar to that of the best App augmentation sets.

  • In general, though, the ensembles strongly boost the performance of NoDA: both Ens_Base(11) and Ens_Base(14) outperform NoDA with a p value of 0.1, and all the EnsDA_* outperform NoDA with a p value of 0.001. Across all the datasets, the EnsDA_* ensembles obtain an accuracy higher than or equal to that obtained by NoDA.

  • EnsDA_C outperforms Ens_Base(14) with a p value of 0.1. Both EnsDA_A and EnsDA_B (which include the augmentation methods based on the feature transform approaches) outperform Ens_Base(14) with a p value of 0.05. Among the different tested ensembles, our suggested approach is EnsDA_B since it obtains the highest average performance among the EnsDA_* (EnsDA_A: 93.34%, EnsDA_B: 93.54%, EnsDA_C: 92.98%, EnsBase(14): 90.76%).

The approach proposed in [40] selects only a subset of images from a larger set of built images. Here, as a base, we use a large dataset made up of the images created by all the methods belonging to EnsDA_B, but it produced no improvement in performance compared to EnsDA_B; for this reason, and for the sake of reducing computation time, it was tested only on a subset of all the datasets.

A second experiment was performed to confirm the previous results on a different architecture. In Table 3, the same ensembles reported in Table 2 are evaluated using mobileNetV2 [41] instead of ResNet50. MobileNetv2 is a lightweight architecture that produces results comparable to heavy architectures using far less computational resources.

Table 3 Performance accuracy (in %) using MobileNetv2 as a model

The results in Table 3 substantially confirm conclusions reported in Table 2, viz., the proposed approaches for data augmentation are valid methods for increasing diversity among classifiers and designing high-performing ensembles: EnsDA_* outperform EnsBase(14) with a p value of 0.1.

Even if accuracy is probably the most used performance indicator for classification problems, it is not the most suitable for comparing classifiers. The area under the curve (AUC) is preferred as a standard measure in tests of predictive modeling performance. The AUC is an estimate of the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. In this work, we use the one's complement of AUC, i.e., the error under the ROC curve (EUC): EUC = 1 − AUC. Thus, in Tables 4 and 5, the performance of the proposed approaches in terms of EUC is reported. Because EUC is an indicator for binary classifiers, in multiclass problems, the average value of one-versus-all EUC is used (the rocmetrics MATLAB function has been employed). The results in Tables 4 and 5 largely reflect the trend of accuracy, namely, that the proposed ensembles based on data augmentation outperform both base ensembles and stand-alone approaches.

Table 4 EUC (1-AUC) using ResNet50 as a model
Table 5 EUC using MobileNetv2 as a model

The results reported in Tables 4 and 5 substantially confirm previous conclusions:

EnsDA_A and EnsDA_B outperforms (considering EUC) EnsBase(14) with a p value of 0.1 in both the tested network topologies (MobileNetV2 and ResNet50). EnsDA_B obtains an average performance better than that obtained by EnsDA_A.

In Table 6, our best ensemble is compared with the best methods reported in the literature on the same datasets. As can be observed, our proposed method obtains state-of-the-art or similar performance. Note that the performance indicator is the F1-measure with the LAR dataset because that is the measure that is reported most commonly in the literature for this dataset.

Table 6 Performance as a measure of accuracy (in %) compared with the best in the literature

In Fig. 4, we evaluate the disagreement of predictions on different CNNs to evaluate their degree of diversity [58]. The cosine similarity among scores is calculated for eight networks: the first five are networks trained separately on App3 (we used POR dataset for this experiment). In contrast, the last three are trained on an augmented dataset by APP12, APP13, and APP14, respectively. As can be observed from Fig. 4, the dissimilarity among scores is maximized in the last three rows/columns, proving the higher diversity among classifiers.

Fig. 4
figure 4

Cosine Similarity between scores of eight ResNet50: 1–5 trained on the original training set and 6–8 trained on an augmented training set

5 Conclusion

In this study, we compare combinations of pretrained ResNet50s finetuned on training sets with the addition of some of the best-performing image manipulation methods for generating new images. The performance of these networks and their fusions were compared across eleven benchmarks representing diverse image classification tasks.

This study shows that constructing ensembles of deep learners on the data level by adding images generated by different data augmentation techniques increases the robustness of CNNs. Given the breadth in the diversity of the selected benchmarks, the approach taken here for building CNN ensembles should work on most image problems.