1 Introduction

In recent years Deep Learning (DL) research has made an incredible progress in solving complex image-related problems. This progress and the accompanying raising interest in this area were greatly motivated by the first widely-known successful application of Convolutional Neural Networks (CNNs) to the problem of image classification in 2012, when AlexNet (Krizhevsky et al. 2012a) clearly outperformed shallow methods. There were three main reasons for this success: advancement in deep network architectures, the availability of huge computation power, and access to large amounts of training data. An acclaimed example of a large data set that fostered further development in this area is ImageNet (Deng et al. 2009) around which a competition called ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was organized. This competition enabled tracking the performance of various CNN models over time and spurred the development of several renowned architectures like VGG (Simonyan and Zisserman 2015), GoogleNet (Szegedy et al. 2015) or ResNet (He et al. 2016a). Throughout the years one could observe how the classification top-5 error plummets from more than \(25\%\) in 2010–2011, when shallow methods were used, to less than \(5\%\) in 2015 with the use of DL.

CNNs are successfully used in various computer vision tasks like image classification, object localization, object detection, image segmentation, action recognition in videos or image captioning (in the last task usually in combination with Recurrent Neural Networks). The main focus of this survey is on image classification problem, since the vast majority of mixing augmentation methods were designed specifically with this task in mind.

Generally speaking, when creating any DL algorithm for image categorization the goal is to train the model to correctly predict classes for new (previously unseen images) based on available training data. This ability is called generalization as the model is able to generalize the knowledge extracted from training images to correctly classify images not encounter during training. A failure to generalize is referred to as model over-fitting and is an unwanted property. The goal of DL practitioner is to find a model that is complex enough to learn based on all training samples (training examples) but simple enough so as not to memorize these samples. In many practical cases a viable approach is to create a model that is overcomplex for a given training set but, at the same time, introducing mechanisms that prevent the algorithm from over-fitting. There are various ways to increase generalization ability of DL models (and avoid over-fitting), for example by means of regularization mechanisms (Kukacka et al. 2017) such as dropout, weight decay or batch normalization, as well as an application of specific training strategies like transfer learning or zero-shot learning.

Although many tasks can be solved using the data sets, model architectures and computational power available right now, in quite many domains the lack of sufficient amounts of training data remains one of the key obstacles in DL application. This problem is especially severe in the areas where gathering of labeled training samples is costly or requires special skills (e.g. medical images), or the scale of the business case is not large enough to justify an investment into labeled training data gathering.

One of possible remedies, leading to partial alleviation of this problem, is data augmentation (DA), which is one of the regularization methods (Kukacka et al. 2017) designed to create additional observations based on available ones and thus increase the size of the training set. The focus of this survey is on DA methods that apply certain transformations to the source (original) images in order to create new training examples. Both, data warping (i.e. applying any label preserving transformation to the image) and synthetic over-sampling (i.e. creation of artificial samples) approaches are considered. Additionally, we discuss strategies for selection of the augmentation technique(s) best suited for a given task.

This review paper complements recent DA survey (Shorten and Khoshgoftaar 2019) that covered a full spectrum of various data augmentation methods, ranging from traditional augmentation methods to Generative Adversarial Network (GAN) based augmentations. The paper significantly extends the above-cited work in the area of mixing augmentation methods and augmentation strategies, which are only briefly covered in the work of Shorten and Khoshgoftaar (2019).

The main source of papers for this review are recent top-tier conferences related to Artificial Intelligence, Computer Vision, and Machine Learning, from the period 2017–2021. Certain exceptions include the cases of relevant and highly influential papers published in other venues.

1.1 Taxonomy of presented data augmentation methods

Presented methods are divided into two high level categories: data augmentation methods and strategies for data augmentation method(s) selection. The first group is further divided into approaches that consist in erasing part of the image and those relying on image mixing. Technically, the former methods can also be thought of as mixing ones, in which a given image is mixed with an empty image. However, due to their somewhat different specificity, and following the taxonomy generally agreed within the DA community, these two groups of methods are considered separately.

Strategies for data augmentation selection are mostly applied in the literature to traditional DA techniques, i.e. label preserving geometric transformations or operations in the color space, and only occasionally to more advanced augmentation methods.

While these two areas (mixing DA methods and DA selection strategies) are compared in the literature they are rather not combined within one DA system. In this review both groups of methods are intentionally considered together, since we hypothesize that future development of DA research will largely include application of DA selection strategies to advanced mixing or GAN-based augmentation methods.

1.2 Notation and general remarks

Let us start with an introduction of the base notation. First of all, mixing DA techniques are divided into two main classes: those that mix images using pixel-wise weighted average (referred to as pixel-wise mixing) and those that mix images spatially by means of extracting patches from different images and joining them together (referred to as patch-wise mixing methods). Examples from both classes are presented in Fig. 1.

Fig. 1
figure 1

From left to right: two sample images and examples of pixel-wise and patch-wise mixing, respectively. The pixel-wise and patch-wise images present the zoomed region indicated by a red rectangle to show the detailed characteristics of the mixed images. (Color figure online)

Furthermore, whenever a result of applying any given DA technique is presented, the baseline result or simply baseline would refer to the respective result obtained using the same architecture, trained on the same data and using the same training hyper-parameters, but with no application of the analyzed DA method. For the sake of consistency all DA methods will be written in italic font, e.g. Mixup, CutMix or Attentive CutMix.

The remainder of the survey is organized as follows: Sect. 2 presents DA methods that rely on erasing part of the image and Sect. 3 introduces the image mixing methods. The mixing methods are further qualitatively compared and analyzed in the context of their specific aspects (e.g. in which part of the learning model the DA is applied, how many images are involved in a single DA sample preparation, whether DA method is pixel-wise or patch-wise, whether or not the DA technique mixes labels, and other) in Sect. 4. A thorough quantitative experimental comparison of the methods is presented in Sect. 5. The vast majority of results refer to CIFAR-10 (Krizhevsky et al. 2009a) and CIFAR-100 (Krizhevsky et al. 2009b) data sets which are most frequently used in DA methods assessment. Section 6 presents and compares strategies for optimal DA method selection. The last section concludes the paper by summarizing its main findings and pointing some possible future prospects in DA domain—notably application of the DA selection strategies to the class of mixing DA methods.

2 Data augmentation by erasing part of the image

Methods in this group rely on removing part of the image either by masking it out (Devries and Taylor 2017; Zhong et al. 2020) or replacing with certain noise (Lopes et al. 2019).

A foundational method in this area is Cutout (Devries and Taylor 2017) that erases/masks a square region of the input image by zeroing the respective pixels. Cutout was an inspiration for several subsequent methods, both erasing [e.g. Patch Gaussian (Lopes et al. 2019)] as well as mixing [e.g. CutMix (Yun et al. 2019), SmoothMix (Lee et al. 2020), Attentive CutMix (Walawalkar et al. 2020), Puzzle Mix (Kim et al. 2020), Saliency Mix (Uddin et al. 2020) or SnapMix (Huang et al. 2020)] which are presented in further sections.

The main motivation behind Cutout is better utilization of the entire context of the image, which is especially important when dealing with partial occlusion. Due to masking a particular region of the image in the input layer, the network needs to recreate the missing information based on the context. Conceptually, the technique is similar to dropout (Krizhevsky et al. 2012b) and can be thought of as an extension of dropout to the input layer. However, there are two key differences: firstly, Cutout is applied exclusively to the input layer (not to all layers as dropout is) and secondly, it drops regions of the input image, rather than individual pixels. While there are variants of dropout that work with continuous regions, e.g. DropBlock (Ghiasi et al. 2018), they are applied to randomly selected regions, different in every layer, so, unlike in Cutout, the key regularization mechanism in this case relies on randomness.

The method uses just one hyperparameter, the size of the filter. Devries and Taylor (2017) claim to have considered one more hyperparameter—the shape of the filter, but as the size turned out to be much more important, all experiments are focused on size optimization of the cutout, with a square shape assumed by default. Based on the experiments, the relation between model accuracy and filter size has a parabolic shape—the accuracy increases with an increase of the filter size, up to an optimal point after which masking out too much information deteriorates the resulting accuracy.

An important point to make is that Cutout’s occlusion ratio (a proportion of pixels masked to all pixels) can vary even for filters of the same size due to various possible positions of the center point of the cutout region, which is selected randomly. This diversity of the effective filter size is crucial for achieving high performance. An alternative approach is to always select a center point of the filter far enough from the image edges, so as the entire mask would fall within the image, but apply the method with \(50\%\) probability (Devries and Taylor 2017).

Another erasing augmentation approach, Random Erasing (Zhong et al. 2020) is a deep dive into Cutout and its high level description is very similar to that of Cutout. However, Random Erasing is a more flexible approach and transforms the image in a slightly different manner. The method admits rectangular shape of the masking patch and its effective application depends on 3 (instead of 1 as in Cutout) hyperparameters: the erasing probability p, the range of area ratio \(S_e/S\in [s_l, s_h]\) (1) between the masked region and the whole image, which controls the patch size, and the range of aspect ratio \(r_e\in [r_1, r_2]\) (2) between height and width of the masked region (3), which controls the patch shape.

Additionally, the method fills in the erased region differently to Cutout. Each pixel’s value within the masked region is replaced with a randomly selected value between 0 and 255 (other less successful approaches are discussed in the work of Zhong et al. (2020)).

Furthermore, the entire patch area must not extend beyond the original image range, so the following patch selection process (1)–(4) is repeated until an appropriate patch is found.

$$\begin{aligned}&S_e \leftarrow \text {Rand} (s_l, s_h) \times S \end{aligned}$$
(1)
$$\begin{aligned}&r_e \leftarrow \text {Rand} (r_1, r_2) \end{aligned}$$
(2)
$$\begin{aligned}&H_e \leftarrow \sqrt{S_e \times r_e}, W_e \leftarrow \sqrt{\dfrac{S_e}{r_e}} \end{aligned}$$
(3)
$$\begin{aligned}&x_e \leftarrow \text {Rand} (0, W), y_e \leftarrow \text {Rand} (0, H) \end{aligned}$$
(4)

where S is the image size, \(S_e, r_e, H_e, W_e\) are the area, the aspect ratio, the height and the width of the filtering rectangle, respectively, and \(x_e, y_e\) indicate the upper left corner of this rectangle. The effects of Cutout and Random Erasing are compared in Fig. 2.

Fig. 2
figure 2

The effects of Cutout and Random Erasing. (Color figure online)

Generally, Random Erasing yields better results than the baseline for all reasonable parameter settings, however, the final parameter values need to be optimized for a given data set. For instance, for CIFAR-10, it is recommended by Zhong et al. (2020) that the values of p, \(s_h\) and \(r_1\) be equal to 0.5, 0.4 and 0.3, respectively. The method is generally not the best choice for a standalone DA technique. It is inferior, for instance, to random flipping (Shorten and Khoshgoftaar 2019) or random cropping (Shorten and Khoshgoftaar 2019), although when added on top of any of them, as a complementary augmentation method it improves classification accuracy.

The next erasing method, Patch Gaussian (Lopes et al. 2019), is a hybrid of Cutout and a Gaussian noise augmentation procedure. Adding Gaussian noise to the image region selected by Cutout improves performance of the trained model on corrupted images, in addition to its efficient dealing with partial occlusion. The motivation behind Patch Gaussian is twofold: (a) Cutout is generally able to increase the model’s accuracy but without ensuring its robustness, while (b) addition of Gaussian noise increases robustness but hurts accuracy of the model. The aim of Patch Gaussian is to get the best of both worlds and combine high accuracy on a held-out set of images with the robustness to corrupted images. The method is parameterized by a patch size (same as in Cutout) and the maximum standard deviation of Gaussian noise. Its effects are visualized in Fig. 3 and compared with application of Gaussian blur.

Fig. 3
figure 3

Original image, its version after application of Gaussian blur, and the effects of Patch Gaussian augmentation. (Color figure online)

In order to confirm that high accuracy and robustness to noise are indeed caused by application of Patch Gaussian augmentation and are not just a result of stacking data augmentation techniques, in ablation study presented by Lopes et al. (2019) Patch Gaussian was compared with an application of Cutout followed by the Gaussian noise addition, as well as with either of these two techniques applied alone to half of the batches. Among these approaches Patch Gaussian achieved the smallest error on corrupted data while maintaining high accuracy on non-corrupted test samples.

An interesting approach is a combination of Patch Gaussian with regularization strategy, or another DA technique. Using Patch Gaussian together with any of the following strategies: weight decay (Goodfellow et al. 2016), label smoothing (Szegedy et al. 2016), DropBlock (Ghiasi et al. 2018), or AutoAugment (Cubuk et al. 2019) leads to decreasing of the error on corrupted data whilst making the error on non-corrupted data just marginally higher than the baseline.

The genre of erasing methods is relatively small since researchers quickly shifted their attention towards methods described in Sect. 3 that mix images in a patch-wise manner. Patch-wise mixing has similar effects to erasing part of the original image, but without reducing the training signal provided by each image. In case of mixing methods a masked part of the image is replaced by a part of another image, thus providing a stronger training stimulus.

3 Data augmentation by image mixing

Fig. 4
figure 4

Image mixing DA methods presented on a time scale, with key characteristics and dependencies indicated. Dotted regions separate methods in which mixing takes pixel-wise form (Pixel-wise) from those with spatial mixing (Patch-wise), and those with mixing applied not to a pair of images, but either just one image and its transformed version or more than 2 images (Other than 2 images). Directed lines indicate inspirations (dotted lines) or direct extensions (solid lines) of the methods. Each method appears once except RICAP which in a patch-wise manner mixes 4 images, hence appears in both Patch-wise and Other than 2 images groups. (Color figure online)

Image mixing DA methods rely on blending two input images and their corresponding labels according to the following equations:

$$\begin{aligned}&{\tilde{x}} = B \odot x_1 + (I-B) \odot x_2 \end{aligned}$$
(5)
$$\begin{aligned}&{\tilde{y}} = \lambda y_1 + (1-\lambda ) y_2 \end{aligned}$$
(6)

where \(x_1, x_2\) are original input images, \(y_1, y_1\) are one-hot label encodings, \(\lambda\) is a mixing ratio, B is a mixing mask matrix suitable for both pixel-wise and patch-wise mixing and I is an identity matrix of the same dimensionality as B. \(\odot\) denotes element-wise matrix multiplication operation. The vast majority of approaches described in this section are built around Eqs. (5), (6) and mainly differ by the method of \(\lambda\) selection and construction of matrix B.

Figure 4 presents a map of the mixing methods, indicating for each of them the publication date, certain key characteristics and relations to other methods. In the remainder of this section the methods presented in Fig. 4 are described and discussed in more detail.

A founding mixing method is Mixup (Zhang et al. 2018) introduced in 2018, which laid ground for many subsequent papers in this area (Guo et al. 2019; Verma et al. 2019; Summers and Dinneen 2019; Yun et al. 2019; Lee et al. 2020; Walawalkar et al. 2020; Kim et al. 2020; Uddin et al. 2020; Kim et al. 2021; Huang et al. 2020; Hendrycks et al. 2020; Jackson et al. 2019; Zhou et al. 2021).

Mixup constructs new training samples according to Eqs. (5), (6) by means of the same weighted mean of the images and their labels, i.e. the entire matrix B is populated with \(\lambda\). The underlying assumption of Mixup is that linear interpolation of feature vectors should lead to an adequate linear combination of the associated labels. This linear combination of images/classes is controlled by \(\lambda\), e.g., \(\lambda =0.5\) leads to averaging the images and their corresponding labels, while \(\lambda \in \{0,1\}\) preserves one of the original images and its label.

The authors of Mixup additionally checked whether it is more beneficial to sample image pairs from the entire data set vs. sampling them solely from observations belonging to the same class, and whether is it preferred to apply weighted average (5), (6) in the input layer vs. its application in subsequent layers. The results presented in Table 1 prove that it is more beneficial to mix images selected at random rather than within the same class and that mixing images in the input is more advantageous than mixing their latent representations.

Table 1 Part of the ablation study presented by Zhang et al. (2018)

Although the work of Zhang et al. (2018) is considered to be a founding paper in the image mixing augmentation research line, historically, there were three papers published prior to Mixup that tackled the problem of mixing images (Lemley et al. 2017; DeVries and Taylor 2017; Inoue 2018) but approached it from different angles.

Smart Augmentation (SA) (Lemley et al. 2017), the first of these methods, attempts to learn the way of mixing that minimizes the loss. This learning-based approach is quite specific since the vast majority of the current augmentation papers focus on applying effective, yet arbitrary chosen, mixing techniques. SA employs two separate networks: the first one (called Augmentor) is responsible for mixing two or more images, while the other one (Target) is responsible for image classification. Selection of images for mixing is performed randomly and is limited to observations from the same class. Parameters of both networks are updated based on the loss of the Target network. The loss of the Augmentor network is furthered extended with an MSE-based comparison of the mixed image with randomly selected image (from the same class) other than the ones used to create the mixed sample. Conceptually, the approach is flexible and architectures of both networks can vary as long as the Augmentor network returns the output of the same size as the input. Once the training is completed the Augmentor is discarded.

Mixing images in a learned Feature Space is proposed in the second of pre-Mixup papers by DeVries and Taylor (2017). The authors did not propose any particular name for their method and referred to it as “Data set augmentation in Feature Space”. For the sake of brevity, we coined the name Feature Space to this method. In Feature Space augmentation, latent features are generated using a sequence-to-sequence model which enables cross domain application of the method, e.g. to both text and image classification. The set of operations performed on those latent features encompasses adding a random noise, as well as interpolation and extrapolation of features from different observations.

Technically, the method uses a sequence autoencoder (a stacked LSTM (Li and Wu 2015) with encoder and decoder layers) to generate the training features. Whilst method execution an image is propagated through the encoder network and subsequently its hidden representation is either augmented with Gaussian noise or mixed with a similar observation. The mixing process consists in finding K nearest neighbors in the feature space with the same class label as the selected image. For each pair of hidden representations (of a given image and one of its K nearest neighbours) the mixing is performed according to one of the following equations:

$$\begin{aligned}&\ c'=(c_k-c_j)\lambda + c_j \end{aligned}$$
(7)
$$\begin{aligned}&c'=(c_j-c_k)\lambda + c_j \end{aligned}$$
(8)

where \(c_j\) is the vector representing latent features of a given image, \(c_k\) is its neighboring vector in the latent feature space, and \(\lambda \in [0,1]\) controls the degree of interpolation (7) and extrapolation (8), respectively. It is suggested in the work of DeVries and Taylor (2017) that extrapolation should be preferred over interpolation as it is able to create samples that display higher variability.

The third DA approach that appeared prior to Mixup, SamplePairing (Inoue 2018), is a relatively straightforward technique that constructs a new data sample by a pixel-wise averaging of two randomly selected images. Unlike Mixup it does not consider the label of the second image and simply assigns the label of the first image to the newly created mixed training sample.

In the image classification task SamplePairing is used as a pre-training mechanism which is turned on and off at regular intervals before being completely disabled at the end of the pre-training phase. The effects of Mixup and SamplePairing are compared in Fig. 5. Both methods mix images in a pixel-wise manner: Mixup with a user defined mixing ratio, in this case 0.7 and SamplePairing with a fixed ratio of 0.5.

Fig. 5
figure 5

Comparison of Mixup and SamplePairing augmentations. (Color figure online)

Another approach related to Mixup is Between-Class learning (BCL) (Tokozume et al. 2018a), published in the same year. BCL was initially proposed in the context of mixing sound waveforms (Tokozume et al. 2018b) and then adjusted to image domain (Tokozume et al. 2018a). A major distinction between Mixup and BCL is that the latter mixes images from different classes only, whereas Mixup selects two images at random, which makes it possible to mix two images from the same class.

There are two BCL implementations proposed by Tokozume et al. (2018a). The first one follows Eq. (5) and the other one (named BC+), instead of (5) applies Eq. (9), which takes into account characteristics of the image, i.e. pixel-wise image mean and standard deviation:

$$\begin{aligned} {\tilde{x}}=\frac{p(x_1-\mu _1)+(1-p)(x_2-\mu _2)}{\sqrt{p^2+(1-p)^2}} \text {,}\ \ \ \ \ \ \ p = \frac{1}{1+ \frac{\sigma _1}{\sigma _2}*\frac{1-\lambda }{\lambda }} \end{aligned}$$
(9)

where \(\lambda\) is the mixing ratio, \(\sigma _1, \sigma _2\) are pixel-wise standard deviations, and \(\mu _1, \mu _2\) are pixel-wise means, respectively. Both methods apply Eq. (6) to create mixed label encoding.

During CNN training the goal of the network is to predict the newly created mixed label encoding which is equivalent to predicting the mixing ratio used to create the mixed sample. For this task standard categorical cross entropy loss function is replaced with Kullback-Leibler divergence, which proved to be more effective in the ablation study presented in the work of Tokozume et al. (2018a). Based on the ablation experiments it can also be concluded that arbitrary choosing one of the two possible labels for the mixed sample, instead of their linear combination, worsens the performance. Furthermore, while mixing examples from different classes renders better results, the performance also improves (compared to the baseline) if mixed images belong to the same class. Similarly to Mixup study (Zhang et al. 2018), Tokozume et al. (2018a) found that application of the mixing process in the input layer is more beneficial than its usage in subsequent layers. In the latter case, for deeper CNN layers mixing may even deteriorate performance.

Going back to the Mixup method, subsequent related research was mainly concentrated along two axes: direct improvements of the method (Guo et al. 2019; Verma et al. 2019) (discussed in Sect. 3.1), and questioning the efficacy of pixel-wise mixing (Summers and Dinneen 2019; Yun et al. 2019; Lee et al. 2020; Walawalkar et al. 2020; Uddin et al. 2020) (discussed in Sect. 3.2).

3.1 Direct extensions of Mixup

AdaMixup (Guo et al. 2019) was the first direct extension of Mixup. The method attempts to learn a better than Mixup mixing distribution to avoid the so-called manifold intrusion problem, i.e. a situation in which a synthetically created image coincides with a class which is different from any of the two classes assigned to images being mixed.

AdaMixup extends Mixup by adding 2 neural networks called Intrusion Discriminator (ID) and Policy Region Generator (PRG). The first one is responsible for predicting whether or not the resultant mixed sample would lead to manifold intrusion and the role of the latter one is to propose the mixing policy. All three networks are trained jointly using the following loss function:

$$\begin{aligned} \ L_{total} = L_{D}(H) + L_{D^{'}}(H,\{\pi _k\})+L_{intr}(\{\pi _k\}, \varphi ) \end{aligned}$$
(10)

where \(L_{D}\), \(L_{D^{'}}\) and \(L_{intr}\) are standard loss, data space regularization loss and intrusion regularization, respectively. H, \(\varphi\) and \(\{\pi _k\}\) denote the main classifier, ID and PRG, respectively.

Another direct extension of Mixup is Manifold Mixup (Verma et al. 2019), which is motivated by an observation that Mixup produces sharp decision boundaries in latent image representations. Manifold Mixup addresses this issue by performing image mixing operations in hidden layers with latent feature representations. More precisely, the method proposes to select randomly one layer in a network, either hidden or input, and then process two random mini-batches of data until reaching the selected layer, at which point the two mini-batches are mixed according to (5). Afterwards, the processing is continued until the output layer is reached where the loss is calculated and all network parameters are updated according to the gradients. If randomly selected layer happens to be the input one, the method is equivalent to Mixup.

Manifold Mixup achieves comparative or higher performance than Mixup and AdaMixup. Note that the results are somewhat contradictory to the conclusion presented in the work of Zhang et al. (2018) that mixing should be applied in the input layer.

3.2 Patch-wise extensions of Mixup

The other stream of Mixup follow-up papers, which question the efficacy of mixing pixels linearly, share many properties with the augmentation methods focused on occluding parts of the image, described in Sect. 2. A primary approach here is Mixed-Example method (Summers and Dinneen 2019) which tests various patch-wise mixing approaches to conclude that they work comparably well to pixel-wise methods and even excel them in certain settings.

Fig. 6
figure 6

Effects of various patch-wise Mixed-Example augmentation methods. Random Square variant, which is visually equivalent to CutMix, is omitted in the figure and presented in Fig. 7. (Color figure online)

An underlying assumption of Mixed-Example (Summers and Dinneen 2019) is that linear interpolations represent just a small subset of mixing operations that can potentially be used for data augmentation. Consequently, the following patch-wise mixing schemes are proposed in the work of Summers and Dinneen (2019):

  • Vertical Concat - a method where bottom and upper parts of the new image come from two different images. This effect can be implemented with Eq. (5) where B has 1s in the bottom part and 0s in the upper part or vice versa;

  • Horizontal Concat - same as Vertical Concat, but in reference to the left and right parts of the new image;

  • Mixed Concat - a combination of both the above methods in which a new image is constructed from 4 patches and patches on each diagonal come from the same source image;

  • Random 2×2 - a randomized version of Mixed Concat where the source of each of the 4 patches is independently selected at random (either the first or the second image). Random 2×2 implements an idea similar to RICAP method described below in this section, which combines 4 images and each patch comes from a different image;

  • Random Square - a method in which a randomly chosen square from one image is placed on the other image. Random Square is similar to CutMix method described below in this section;

  • Random Column Interval - a generalization of Horizontal Concat where a randomly selected vertical stripe is picked in one image and pasted onto the other image;

  • Random Row Interval - an analogous generalization of Vertical Concat;

  • Random Columns/Random Rows - a further generalization of Random Interval approaches in which a subset of rows/columns, respectively is chosen in one image and pasted onto another image;

  • Random Pixels - a new image is created based on pixels sampled uniformly from both images;

  • Random Elements - a new image is created based on color channels values of the respective pixels sampled uniformly from both images.

The effects of patch-wise Mixed-Example augmentations are presented in Fig. 6.

Mixed-Example paper (Summers and Dinneen 2019) additionally proposes several strategies which combine pixel-wise and patch-wise mixing, e.g. a blend of Vertical Concat and Mixup, where Mixup is applied to one of the input images and followed by Vertical Concat procedure.

Several subsequent papers in this line of research either build on (Lee et al. 2020; Walawalkar et al. 2020; Uddin et al. 2020) or explore in-depth (Yun et al. 2019) one of particular augmentation methods presented by Summers and Dinneen (2019). CutMix (Yun et al. 2019) mixes images by occluding part of one image with a patch extracted from the other image. SmoothMix (Lee et al. 2020) and Attentive CutMix (Walawalkar et al. 2020) extend CutMix by either smoothing sharp edges in the resultant image (SmoothMix) or by using a specific way of patch selection (Attentive CutMix). Saliency Mix (Uddin et al. 2020) attempts to achieve the same properties as Attentive CutMix but using saliency information. All four above-mentioned methods are discussed in more detail below and the effects of their application are illustrated in Fig. 7 (CutMix, SmoothMix, Attentive CutMix) and Fig. 8 (Saliency Mix).

Fig. 7
figure 7

Effects of CutMix (which is visually equivalent to Random Square variant of Mixed-Example method described above), SmoothMix and Attentive CutMix augmentations. (Color figure online)

CutMix (Yun et al. 2019) is motivated by the Cutout method (described in Sect. 2), however, unlike Cutout which simply clears out part of the image with possible loss of relevant information, CutMix introduces new information to the resultant image extracted from the other image it is mixed with. The method is regarded as a strong alternative to Mixup.

A patch size in CutMix is proportional to the image size and calculated according to the following equations:

$$\begin{aligned} \ r_w = W \sqrt{1-\lambda },\ \ \ \ \ \ \ r_h = H \sqrt{1-\lambda } \end{aligned}$$
(11)

where \(r_w\) and \(r_h\) are the width and the height of the patch, respectively, W and H are the width and the height of original images and \(\lambda\) is the mixing ratio. According to Yun et al. (2019) the method should be applied in the input layer (to raw images) and its application to hidden layer representations degrades performance.

SmoothMix (Lee et al. 2020) is a direct extension of CutMix that attempts to avoid sharp edges, i.e. unnatural changes in pixel values around the implemented patch. SmoothMix creates a new observation according to Eqs. (5), (6) with a specific form of matrix B which has 0s at the edges of a patch and 1 in its center. B implements a smooth transition from the edge pixels towards the patch center, i.e. its elements change gradually from 0 to 1 along with moving further away from the edges and approaching the central region of the patch. Both, circular and rectangular patches were tested in the work of Lee et al. (2020) with the final preference for a circular one.

Attentive CutMix (Walawalkar et al. 2020) is another extension of CutMix. The method divides both images into patches (typically a \(7\times 7\) grid) and in one of the images selects some number of patches that activate the most the respective class-related output. Next, the method pastes those most-representative patches onto the other image in the respective places. The label of the new image is calculated based on the proportion of the pixels corresponding to each of the two source images. Attentive CutMix addresses one of the problems with CutMix, i.e. susceptibility to copying a non-representative part of the image (e.g. background) and consequently producing an example with a noisy label. Attentive CutMix, on the contrary, puts special attention to selection of the meaningful and class-representative patches to be transferred to the resultant image. One disadvantage of Attentive CutMix, compared to other discussed methods, is the requirement of having a pretrained feature extractor that is used for selecting the most relevant patches.

The idea of selecting the most representative patches is also implemented in SaliencyMix method (Uddin et al. 2020) which uses saliency information to find the most representative pixels and then, similarly to CutMix, selects a patch of a size determined by \(\lambda\). In SaliencyMix the patch is either centered on the most salient pixel (if the entire patch fits within the image) or keeps this pixel within the patched area (otherwise). The method follows Eqs. (5), (6) and is implemented by setting the values of matrix B to 1 within the patch area (selected based on the saliency peak region) and to 0 elsewhere. Figure 8 presents an example of saliency map and the effect of SaliencyMix application.

Fig. 8
figure 8

Effects of SaliencyMix augmentation method with the corresponding mixed label. From left: two original images, the saliency map with a bounding box around saliency peak region, and the resulting image after SaliencyMix application. (Color figure online)

Uddin et al. (2020) considered four saliency calculation methods relaying either on statistical approaches (Montabone and Soto 2010; Hou and Zhang 2007; Achanta et al. 2009) or a learning based approach (Qin et al. 2019), and eventually decided to use the statistical saliency method by Montabone and Soto (2010) as it offered slightly better classification accuracy (with SaliencyMix) and, additionally, is invariant to image size, as opposed to learning based methods which are limited by the input size of the architecture used. Various placements of the salient patch of the image were also tested, including placing the patch on salient/non-salient/corresponding part of the other image, with a conclusion that the choice of a corresponding place is the best option in terms of regularization.

The following two sections consider methods that deviate from certain baseline assumptions that were commonly considered in the methods discussed so far. Section 3.3 presents methods that does not follow equation (5) (Kim et al. 2020; Huang et al. 2020) and Sect. 3.4 discusses the methods that use the number of images other than two to produce a mixed sample (Takahashi et al. 2018; Hendrycks et al. 2020; Jackson et al. 2019; Zhou et al. 2021; Kim et al. 2021).

3.3 Beyond equation (5)

The next method cannot be definitively classified as either pixel-wise or patch-wise. Puzzle Mix (Kim et al. 2020) is motivated by the same principles, as the previously-described Attentive CutMix and Saliency Mix methods, namely ensuring that the resultant image contains patches that are relevant for both classes. Similarly to Saliency Mix this goal is achieved through utilization of the saliency information (Simonyan et al. 2014) to determine the most important patches in both images, though Puzzle Mix additionally performs an optimal transportation of these most salient parts that would maximize their exposure in the resultant augmented image. This results in placing a relevant patch in the target image not in the place that corresponds to its location in the source image but in a destination that contains low saliency features compared to the rest of the target image. The overall goal is achieved through jointly seeking an optimal mixing mask and an optimal placements for the patches coming from both source images in the augmented image. Consequently, instead of (5) the following equation is used for creating a mixed sample in Puzzle Mix:

$$\begin{aligned} {\tilde{x}} = (1-B) \odot \prod _{1} x_1 + B \odot \prod _{2} x_2 \end{aligned}$$
(12)

where B is a mask with \(b_{ij}\in [0,1]\) and \(\prod _{1}, \prod _{2}\) are \(n \times n\) transportation matrices. \(\prod _{i}, i=1,2\) encodes the placement of patches extracted from the source images \(x_i, i=1,2\) in the augmented sample. The mixing ratio \(\lambda\) is defined as \(\lambda = \frac{1}{n} \sum _{ij} b_{ij}\).

Since there are two goals in Puzzle Mix the problem is solved via iterative alternating of the following two phases: (1) minimization of B and (2) simultaneous optimization of \(\prod _{1}\) and \(\prod _{2}\). Although Kim et al. (2020) propose certain improvements of both CPU and GPU implementations, as well as down-sampling strategies for mask optimization, the method still requires substantially more computational resources than previous DA techniques, e.g. Mixup or CutMix. On the other hand, whenever Puzzle Mix is compared with other methods in the same experiment setup, it achieves higher accuracy. Numerical results are presented in Sect. 5.

The other method that departs from Eq. (5) is SnapMix (Huang et al. 2020). The method is dedicated to fine-grained image classification problem, in which classes differentiate by details only like, for instance, in CUB (Wah et al. 2011), Stanford Cars (Yang et al. 2015) or FGVC Aircrafts (Maji et al. 2013) datasets. Consequently, SnapMix assumes that visual features indicative for a given class may take a small part of the image, in which case traditional methods like Mixup (Zhang et al. 2018) or CutMix (Yun et al. 2019) would not be effective. An example of SnapMix application is depicted in Fig. 9.

Fig. 9
figure 9

Effects of SnapMix augmentation with the corresponding mixed label. Left: original images with randomly selected patches of different sizes. Middle: heatmaps presenting the output of CAM for the respective class. Right: a resulting image after SnapMix application. Observe that elements of the label vector (right figure) do not have to sum up to 1. (Color figure online)

SnapMix modifies CutMix in the two following aspects: the way of label vector calculation for the mixed sample, and independent selection of the size and placement of the patch in each of the two input images (the patch is not automatically pasted in the corresponding location of the target image).

SnapMix method can be expressed as follows:

$$\begin{aligned} {\tilde{x}} = (I-B_{\lambda ^{1}}) \odot x_1 + T_{\theta }(B_{\lambda ^{2}} \odot x_2) \end{aligned}$$
(13)

where \(B_{\lambda ^{1}}\) and \(B_{\lambda ^{2}}\) are two binary masks containing random box regions with the area ratio \(\lambda ^{1}\) and \(\lambda ^{2}\), respectively, and \(T_{\theta }\) is a function that maps the patch from \(x_2\) onto the patch in \(x_1\).

In order to calculate the label for the augmented image (13), the class activation map (CAM) (Zhou et al. 2016) is used. CAM is a transformation applied on top of the last convolutional layer of the network so as to point locations of the class-discriminative features. The output of CAM is normalized to get Semantic Percent Map (SPM) using Eq. (14):

$$\begin{aligned} \ S(x_{i}) = \frac{CAM(x_{i})}{sum(CAM(x_{i}))} \end{aligned}$$
(14)

The label of the image created according to (13) is finally calculated as follows:

$$\begin{aligned} \ p_{1} = 1 - sum(B_{\lambda ^{1}} \odot S(x_{1})), \ \ \ \ \ p_{2} = sum(B_{\lambda ^{2}} \odot S(x_{2})) \end{aligned}$$
(15)

where \(p_{1}, p_{2} \in [0,1]\) are partial labels assigned to classes corresponding to the class of images 1 and 2, respectively.

An interesting aspect of the method is the lack of a constraint that partial labels should sum up to 1. This way it is possible to indicate in the label that as a result of the operation one image has become relatively more/less relevant than previously (for instance, when the patch masked a discriminative feature in that image).

Another relevant SnapMix feature is the ability to offer more effective augmentations as the training process progresses. This is due to the fact that CAM works on top of the classifier, and therefore, becomes more accurate along with the classification improvement, making the resulting augmented samples more effective.

3.4 Mixing other than 2 images

Until now all described methods used 2 images to create a mixed sample. The next group of approaches (Takahashi et al. 2018; Hendrycks et al. 2020; Jackson et al. 2019; Zhou et al. 2021; Kim et al. 2021) veered off in the other direction. RICAP (Takahashi et al. 2018) combines 4 images in a pixel-wise fashion while AugMix (Hendrycks et al. 2020) and Style Augmentation (Jackson et al. 2019) use one image in the augmentation process. MixStyle (Zhou et al. 2021) on the other hand, relies only on the per channel statistics, like mean and standard deviation calculated from several hidden layers of the second image, disregarding the input pixels. A potential breakthrough is proposed in Co-mixup (Kim et al. 2021), one of the youngest methods, that performs mixing of the entire mini-batch rather than each pair separately. All five methods are briefly discussed below.

RICAP augmentation process consists of the three following steps. Firstly, four images are randomly selected—one for each of the four parts of the new image: upper left, upper right, lower left and lower right. Secondly, the images are randomly cropped with respect to a boundary position point, i.e. a point that determines the area a given image will occupy in the newly created augmented image. The boundary point \(BP=(w,h)\) is calculated using the following equations:

$$\begin{aligned} \begin{aligned}&w = round(w^{\prime }I_x),\ \ \& \quad h = round(h^{\prime }I_y) \\&w^{\prime } \sim Beta(\beta , \beta ),\ \ \& \quad h^{\prime } \sim Beta(\beta , \beta ) \end{aligned} \end{aligned}$$
(16)

where \(I_x\) and \(I_y\) are width and height of the original image, respectively.

For selecting the BP coordinates the ratios \(w^{\prime }\) and \(h^{\prime }\) ale sampled from Beta distribution parameterized by \(\beta \in (0,\infty )\). Based on BP the cropping sizes \([w_i, h_i], i=1,\ldots ,4\) are calculated, i.e. \(w_1=w_3=w\), \(w_2=w_4=I_x-w\), \(h_1=h_3=h\) and \(h_2=h_4=I_y-h\). The new class label is calculated as a combination of four original one-hot encoded labels weighted by the relative size of the area assigned to each of the four images in the newly created image. An example of RICAP application is presented in Fig. 10.

Fig. 10
figure 10

Effects of RICAP augmentation with the corresponding mixed label. Left: original images: a boat, an airplane, a cow and a dog. Right: a resulting augmented image. (Color figure online)

The next two methods apply mixing procedure to the original image and its transformed version(s). AugMix (Hendrycks et al. 2020) starts off with application of traditional augmentation operations (e.g. translation, shear, rotation, etc.) to the input image. More precisely a set of k (\(k=3\) by default) augmentation chains, each composed of 3 operations, are selected and applied independently to k copies of the original image. The resulting k images are mixed together linearly with random weights. Next this augmented image is mixed linearly with the original non-augmented image. AugMix creates 2 images in the above-described manner.

In order to enforce a consistent embedding across diverse augmentations of the same input image a consistency loss in the form of Jensen-Shannon (JS) divergence (Shannon 1948) is included in the model loss function (17):

$$\begin{aligned}&L(p_{orig},y) + \lambda JS(p_{orig};p_{aug1};p_{aug2}) \end{aligned}$$
(17)
$$\begin{aligned}&JS(p_{orig};p_{aug1};p_{aug2}) = \frac{1}{3}(KL[p_{orig}||M] + KL[p_{aug1}||M] + KL[p_{aug2}||M]) \end{aligned}$$
(18)

where \(M = (p_{orig} + p_{aug1} + p_{aug2})/3\) and \(p_{orig}, p_{aug1}, p_{aug2}\) are posterior distributions of the original sample and its augmented variants.

Additionally, two alternative approaches were tested in the work of Hendrycks et al. (2020) that use respectively 1 or 3 augmented samples to calculate the consistency loss. The former setup performed the worse whereas the gains from the latter one were marginal and did not justify the related computational complexity increasing. The results of AugMix application are visualized in Fig. 11.

Fig. 11
figure 11

Effects of AugMix method with \(k=3\). Left and middle: original image and its augmentations (translate, rotate and posterize). Right: the image after AugMix application (mixing augmentations 1–3). (Color figure online)

The other single-image method, Style Augmentation (Jackson et al. 2019), randomly changes the texture, contrast and colors of the image while preserving the objects’ shapes and the semantic meaning of the image. This is achieved by using two neural networks: P and T, described in detail in the work of Ghiasi et al. (2017). Network P is trained on image styles and provides style embeddings which are subsequently used by network T that performs style transfer operation.

In practice, for the sake of computational efficiency, Style Augmentation does not operationally use network P to provide a style embedding from a randomly sampled image, but instead randomly samples a style embedding vector directly from a distribution with mean and covariance matching those used for training network P. Such a random sampling (from appropriate distribution) simulates the process of choosing an image from the training data set and calculating its style embedding, without computational load related to its actual calculation.

Additionally, in order to control the strength of the augmentation process, a randomly sampled embedding vector is linearly mixed with a style embedding of the input image. The effective embedding vector z used for style transfer is therefore defined in the following way:

$$\begin{aligned} \ z = \alpha {\mathcal {N}}(\mu , \Sigma ) + (1 - \alpha ) P(c) \end{aligned}$$
(19)

where P is the style predictor network, c is the input image, and \(\mu , \Sigma\) are the mean vector and covariance matrix of embeddings from the training set. An example of Style Augmentation application is presented in Fig. 12.

Fig. 12
figure 12

Effects of Style Augmentation method with 3 arbitrary style images. Upper row: original image and 3 style images. Lower row: original sample after application of the style extracted from the same (original) sample, and the 3 augmented samples (Style Augment 1,2,3) obtained by applying style vectors in the form of a linear combination of the original image style (Image styled) and the respective image style vector (Style 1,2,3). The results are presented for style mixing parameter \(\alpha =0.75\). (Color figure online)

Another method, MixStyle (Zhou et al. 2021), attempts to achieve a similar goal as Style Augmentation but using a different approach to style transfer. The method is dedicated to the problem of Domain Generalization (DG), i.e. construction of classifiers robust to domain shift, able to generalize to unseen domains. To this end MixStyle does not mix pixels but instance-level feature statistics of the two images. These statistics extracted from early layers of a CNN are calculated in the following manner:

$$\begin{aligned} \ \gamma _{mix} = \lambda \sigma (x_1) + (1-\lambda )\sigma (x_2) \end{aligned}$$
(20)
$$\begin{aligned} \ \beta _{mix} = \lambda \mu (x_1) + (1-\lambda )\mu (x_2) \end{aligned}$$
(21)

where \(\gamma _{mix}, \beta _{mix}\) are the mix feature statistics, \(\lambda\) is the instance-wise weight sampled from \(Beta(\beta , \beta )\) distribution parameterized with \(\beta \in (0,\infty )\) and \(\mu (x_i)\), \(\sigma (x_i)\), \(i=1,2\) are mean and standard deviations computed for the set of all elements in each color channel, in each instance. Once the mixing statistics are calculated, the mixed sample is defined as:

$$\begin{aligned} \ {\tilde{x}} = \gamma _{mix} \frac{x_1-\mu (x_1)}{\sigma (x_1)} + \beta _{mix} \end{aligned}$$
(22)

Equation (22) is inspired by the arbitrary style transfer (Huang and Belongie 2017) and is applied with probability 0.5. Otherwise, the instance is not augmented. It is recommended by Zhou et al. (2021) to apply MixStyle to multiple lower level layers which yields the best performance. The optimal combination of layers depends on the final task and including the last layer always deteriorates the performance. An interesting observation is that if multiple layers are selected the same sample can be processed with different mixed statistics as the shuffling process that selects pairs of samples is independent in each layer. This property constitutes additional regularization mechanism in MixStyle.

The final method in this area, Co-Mixup (Kim et al. 2021), lifts the idea of mixing from the level of pairs of images to the level of entire training mini-batches. The method, similarly to Saliency Mix, uses saliency information to generate mixed samples but at the same time looks at the whole mini-batch to encourage diversity among the constructed mixed samples. Since Co-Mixup works on the entire mini-batch Eqs. (5), (6) are no longer valid and the mixing process is described as follows:

$$\begin{aligned} \ h(x_B) = (g(z_1 \odot x_B),...,g(z_{m'} \odot x_B)) \end{aligned}$$
(23)

where \(z_j \in {\mathcal {L}}^{m \times n}\) for \(j = 1,...,m^{\prime }\) and \({\mathcal {L}}^{m\times n} = \{ \frac{l}{L} | l=0,1,...,L \}\) is a discretized mask of dimensions equal to the number of images in the mini-batch (m) and the number of regions into which the image is divided into (n), for optimization purposes (see below). \(x_B\) is the mini-batch and \(g: {\mathbb {R}}^{m \times n} \rightarrow {\mathbb {R}}^{n}\) returns a column-wise sum of matrix \(z_j \odot x_B\) for \(j = 1,...,m^{\prime }\). As \(z_j\) is a 2D matrix, one can interpret the \(k{\text{th}}\) column of \(z_j\) (\(z_{j,k} \in {\mathcal {L}}^{m}\)) as the mixing ratio for m inputs from a mini-batch at the \(k^{\text{th}}\) area.

Additionally an optimization procedure is applied as the method aims to maximize the exposed saliency, maintain local smoothness within the images (adjacent locations in the mixed sample are similar to one another) and also encourage diversity among the constructed mixed samples (Kim et al. 2021).

4 Comparison of DA methods based on particular properties

In this section the aforementioned mixing methods are compared on the basis of certain aspects which refer to their operational properties, efficacy and computational complexity. The key characteristic features of the methods are summarized in Table 2.

4.1 Where is the augmentation applied?

All methods apply the DA procedure in the input layer except three: Feature Space, Manifold Mixup and MixStyle (cf. the first column of Table 2). Feature Space uses additional encoder-decoder network to create image embeddings which are then mixed. Manifold Mixup is a direct extension of Mixup with the mixing mechanism applied to a pool of eligible layers that includes both input and hidden layers. The layer in which mixing is performed is selected randomly. MixStyle does not actually mix images but the instance-level feature statistics extracted from early layers of a CNN.

Table 2 Comparison of data augmentation techniques with respect to particular baseline properties: where, how and in which form the augmentation is applied, whether or not it mixes labels or utilizes a specific loss function, how many images take part in a single augmentation and what is the computational complexity of the method

4.2 How is the augmentation applied?

Generally, there are three ways in which the DA techniques that we discuss are applied. The first one, which includes Smart Augmentation, AdaMixup, Attentive CutMix, Saliency Mix and Style Augmentation, is to rely on an auxiliary mechanism to aid the augmentation process. In the case of Smart Augmentation a CNN based augmentation model precedes the classification network and is trained together to optimize the effects of augmentation for a given problem. In AdaMixup an additional network is used to predict whether or not mixing a given pair of images will result in manifold intrusion. In Attentive CutMix a CNN is utilized to detect the regions that are most representative for a given class, whereas Saliency Mix achieves the same objective by using the saliency information. Style Augmentation applies a style transfer network to change visual attributes of an image without changing its meaning.

In the second group, encompassing all but two of the remaining methods, an augmentation technique follows a certain rule or procedure which is applied in some randomized way. These rules differ in terms of complexity and the number of hyperparameters, however the main dividing line is between methods whose application results in pixel-wise mixing and those which lead to patch-wise mixing. This aspect is further discussed in the following section.

The third set of methods is represented by Puzzle Mix and Co-Mixup. The former attempts to find an optimal way of mixing a given pair of images by solving an optimization task with two objectives. The first objective is to find an optimal mask, i.e. decide how much of the image should be concealed with the other image in a given image region. The second goal is to find optimal locations for the patches extracted from original images in the mixed image, in order to maximize the exposed saliency.

The other method, Co-Mixup, works on the whole mini-batch and attempts to optimize the exposed saliency, as well as encourage diversity among created (augmented) samples.

Both methods bear some resemblance to Saliency Mix yet the key difference is that Saliency Mix uses saliency directly to make a decision while Puzzle Mix and Co-Mixup utilize this information to steer the optimization process. A summary of how is augmentation applied in each method is presented in the second column of Table 2.

4.3 Is the resulting mix pixel-wise or patch-wise?

Probably the most fundamental division of mixing augmentation methods is whether they output pixel-wise or patch-wise combination of the images (cf. the third column of Table 2). A distinction between pixels-wise and patch-wise mixing was clarified in Sect. 1.2. Please recall that the term pixel-wise is used for mixing images using pixel-wise weighted average and patch-wise for mixing images spatially, by means of taking patches from original images and joining them together.

Pixel-wise methods are specifically well suited for dealing with corrupted images and adversarial attacks whereas patch-wise augmentations are useful in the context of partial occlusion and weakly supervised object localization. A detailed performance comparison of both types of methods is presented in Sect. 5.

Smart Augmentation, SmoothMix, Puzzle Mix and Co-Mixup are the exceptions from this division as they combine aspects of both types of mixing. Smart Augmentation uses an auxiliary network to learn the optimal way of mixing and the resulting approach can be either pixel-wise or patch-wise. SmoothMix constructs a graded mixing mask in a way that at the center of the mask and at the edges of the image the resulting mixing is patch-wise as it uses just one image. As we transition from the center of the mask to the edge, mixing becomes pixel-wise as the mask gradually changes its values from 1 to 0. Puzzle Mix also exhibits properties of both pixel-wise and patch-wise approaches as it is steered by the saliency information which can result in the patch containing just one image for salient patches and a mixture of two images for less salient patches. Co-Mixup is also guided by saliency information and can exhibit both pixel-wise and patch-wise properties in certain patches of the augmented image.

4.4 Does the augmentation technique mix labels?

Label smoothing has proven to be a very effective regularization technique in the presence of label noise (Szegedy et al. 2016; Goodfellow et al. 2016). Maximizing \(\log {p(y|x)}\) when y actually represents an incorrect label can be harmful and one way to prevent this is explicit modeling of label noise. Label smoothing regularizes the model by replacing the 0 and 1 targets with the targets \(\epsilon / (k-1)\) and \(1-\epsilon\), where k is the number of output classes.

Mixing techniques that combine images from various classes together with mixing their corresponding labels also benefit from this regularization property. The majority of DA techniques mix labels. The ones that do not could be divided in 3 distinctive groups: early methods, methods designed to mix samples from the same class only, and methods that need just one image to work.

SamplePairing, the only representative of early methods that we consider, was designed in a way to assign just one label (from one of the images being mixed) to the augmented image. This resulted in a complicated training procedure that requires the DA to be alternately turned on and off some number of times, and this idea was not continued in subsequent research.

The methods designed and tested only in the context of mixing images from the same class are Feature Space and Smart Augmentation.

The last genre encompasses AugMix, Style Augmentation and MixStyle which all use the content of just one image during the mixing process, hence there is no room for label mixing. AugMix mixes an image with its versions processed with traditional (label preserving) techniques and Smart Augmentation mixes an image with its own version transformed using the style transfer algorithm. In MixStyle the second image is technically required however the method does not refer to its pixel-based (raw) representation but considers the instance-level feature statistics that correspond to visual features.

A taxonomy of the methods based on the label mixing property is summarized in the fourth column of Table 2.

4.5 Does the augmentation technique use standard loss function?

Most of DA techniques described in this survey use standard loss function during training, as presented in the fifth column of Table 2. There are four exceptions: AdaMixup introduces a separate element in the loss function (10) responsible for penalizing manifold intrusion (a situation when an augmented sample corresponds to a class different from any of the two image classes). AugMix extends the standard loss function to ensure that augmented samples are as close to original ones as possible (17). Between-Class learning employs KL-divergence without specifying reasons other than better experimental outcomes (Tokozume et al. 2018a). Finally, Smart Augmentation uses an auxiliary Augmentor network whose error is added to the standard loss during training.

4.6 How many images are used to create an augmented image?

Amongst discussed methods all but four use 2 images to obtain an augmented sample (see the second to last column of Table 2). Out of those four methods, two use more than 2 images: RICAP, which mixes four images in a patch-wise fashion by cropping them and afterwards patching into one image and Co-Mixup which works with the entire training mini-batch at once. AugMix and Style Augmentation mix an image with a transformed version of itself.

4.7 What is the computational complexity of the augmentation technique?

The two key factors contributing to the computational complexity of the considered methods are: (1) whether an augmentation uses a simple rule or relies on a more complex approach, as well as, (2) whether there are any additional components or specific training process that is required for the method to work. Based on the above, the methods can be roughly divided into the two following groups (cf. the last column of Table 2).

  • Group A: methods that do not incur any significant computational overhead, as they mix images using simple, often randomized, rules, i.e. RICAP (Takahashi et al. 2018), Mixup (Zhang et al. 2018), Between-Class learning (Tokozume et al. 2018a), Mixed-Example (Summers and Dinneen 2019), CutMix (Yun et al. 2019), SmoothMix (Lee et al. 2020), Cutout (Devries and Taylor 2017), Random Erasing (Zhong et al. 2020) and Patch Gaussian (Lopes et al. 2019).

  • Group B: methods that require either special training process, or multiple evaluations, or an auxiliary component that incur additional computational cost.

The functional overheads of particular methods related to using the above-mentioned additional components are summarized below.

  • Smart Augmentation (Lemley et al. 2017) - utilizes additional neural network responsible for mixing two images.

  • Feature Space augmentation  (DeVries and Taylor 2017) - utilizes additional network for transforming input data into context vectors.

  • SamplePairing (Inoue 2018) - although the augmentation itself does not incur additional computational cost, its training process requires that augmentation be turned on and off alternately which greatly increases the number of training epochs required.

  • Manifold Mixup (Verma et al. 2019)- requires the samples to be propagated through the network before mixing them in hidden layers, hence increasing the computational burden in forward step.

  • AdaMixup (Guo et al. 2019) - solves two additional tasks (discrimination of intrusions and generation of policy regions) on top of the standard classification; for the latter task an additional neural network is employed

  • Style Augmentation (Jackson et al. 2019) - requires two additional forward steps through the style transfer network prior to mixing the images.

  • Attentive CutMix (Walawalkar et al. 2020) - utilizes an auxiliary network to propose important regions that should be included in the mixing process (an additional forward step of that network is required).

  • Puzzle Mix (Kim et al. 2020) and Co-mixup (Kim et al. 2021) - both require calculation of saliency information and utilize an optimization procedure for optimal mixing that incurs additional computational burden.

  • AugMix (Hendrycks et al. 2020) - even though applies standard augmentations and simple mixing rule, because of the modified loss function requires more than one forward step to calculate the JS divergence.

  • Saliency Mix (Uddin et al. 2020) - requires additional step of saliency calculation

  • SnapMix (Huang et al. 2020) - requires two forward steps to calculate CAM which is required in the mixing process.

  • MixStyle (Zhou et al. 2021) - applies different statistics at each layer which incurs additional computational cost of their calculation.

Generally speaking, it can be roughly estimated that—depending on the augmentation method—the use of the above-listed functional components introduces between a few percents to a few hundred percents of computational time overhead compared to the baseline (straightforward augmentation methods).

5 Experimental evaluation on standard benchmarks

This section summarizes the accuracy results of both erasing and mixing methods on popular benchmark sets. Additionally, the methods are compared based on their relative improvement over baseline results.

We start in Sect. 5.1 with performance analysis on classification task on clean data using 3 data sets. Next, the methods are compared on classification of corrupted images in Sect. 5.2, and in the context of adversarial examples in Sect. 5.3. The final comparison is made in the problem of weakly supervised object localization and for the partial occlusion task—Sect. 5.4. The next two sections address the possibility of application of analysed augmentation methods to image-related tasks other than classification (Sect. 5.5) and to modalities other than images (Sect. 5.6). Section 5.7 discusses the robustness of augmentation methods to parameter selection.

Please note that quantitative comparisons of the methods presented in this section rely on experiments that compared at least 3 different methods using a common architecture. The only exception from this rule are the tables presenting top results (Tables 34 and 5), which are compiled based on all available outcomes.

5.1 Evaluation on clean data

Quantitative assessment of the methods on clean data is presented on three widely-used benchmarks: CIFAR-10, CIFAR-100 and ImageNet.

Unless stated otherwise, a reference to baseline result, labeled as “NO DA”, would mean the outcomes of the same network architecture, the same training conditions and pre-processing of the input data, but with no use of analyzed data augmentation.

5.1.1 CIFAR-10

CIFAR-10 data set (Krizhevsky et al. 2009a) consists of \(60,000\) color images of size 32x32 grouped into 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck), with \(6000\) images per class. The set is composed of \(50,000\) training samples and \(10,000\) test ones. There are \(1000\) images per class in the test subset. Examples of CIFAR-10 images are depicted in Fig. 13.

Fig. 13
figure 13

Example images from the CIFAR-10 data set. (Color figure online)

Fig. 14
figure 14

Accuracy results for Mixup, CutMix and AttentiveCutMix on CIFAR-10 reported in the work of Walawalkar et al. (2020). Each panel presents a particular type of architecture (ResNet, DenseNet, EfficientNet) with complexity of the network increasing from top to bottom. Left panels present an absolute error value and the right ones a relative improvement over the baseline (“NO DA” results). (Color figure online)

Figure 14 presents performance results of Mixup, CutMix and AttentiveCutMix on CIFAR-10 grouped by networks of the same type but with varying complexity. The following architectures are compared: ResNet (He et al. 2016a), DenseNet (Huang et al. 2017) and EfficientNet (Tan and Le 2019), each of them in several realizations. The best overall result is achieved by EfficientNet-B7 and Attentive CutMix, with an error of \(4.14\%\) (compared to \(5.05\%\) baseline).

Looking at the left panels of the figure, one can conclude that DA improves the accuracy regardless of particular architecture type and complexity. In all three left panels the error decreases when gradually more complex architectures are used (top to bottom) and with an application of more advanced DA methods (right to left). It is generally assumed that Mixup, as the initial method in the area, is the least advanced, followed by CutMix, developed based on a certain criticism of Mixup, and AttentiveCutMix which is an extension of CutMix.

Another way to look at the results is from the perspective of a relative improvement over the baseline (the right panels). A general observation is that using DA with more complex types of architectures yields lower relative boost, on average equal to around \(22\%\), \(21\%\) and \(11\%\), respectively for ResNet, DenseNet and EfficientNet. However, the relative effects of DA vary substantially within each architecture type. The highest relative advantage is achieved by the most complex model (ResNet-152) in ResNet group, but for the other two architectures the highest boost is observed for DenseNet-201 and EfficientNet-B1, respectively, which are not the most complex ones.

Fig. 15
figure 15

Accuracy results of various augmentation methods on CIFAR-10 reported in: c—(Summers and Dinneen 2019), d—(Takahashi et al. 2018), e—(Uddin et al. 2020), f—(Yun et al. 2019) and grouped around common baselines (particular architectures). Left panels present absolute error values and the right ones the relative improvements over the baseline. (Color figure online)

Figure 15 shows the results of experiments grouped around the same baselines (i.e. particular architectures) for a wider selection of augmentation methods. The following architectures are considered: two ResNets of the same complexity trained differently than in Fig. 14 (coming from two different papers (Summers and Dinneen 2019; Uddin et al. 2020) with slightly varying experiment designs), two Wide ResNets (Takahashi et al. 2018; Uddin et al. 2020), a specific case of DenseNet (Takahashi et al. 2018) where the compression factor for both bottleneck and transition layers is smaller than one, two Pyramidal ResNets (Yun et al. 2019; Takahashi et al. 2018) with different numbers of convolutional layers and different widening factors, as well as Shake-Shake network (Takahashi et al. 2018).

Generally, similar trends can be observed as in the case of groups of architectures. Simpler models benefit relatively more from data augmentation, however, in terms of absolute figures they still yield higher errors. All in all, every DA technique is able to improve over the baseline, with patch-wise methods achieving slightly better results. For both ResNet-18 architectures best result were accomplished by patch-wise (Saliency Mix) method, for PyramidNet-200-240 by CutMix (a patch-wise mixing approach), and for the remaining architectures RICAP, which is also a patch-wise mixing method, performed better or equally good than its competitors. The results additionally confirm that erasing methods (Cutout and RandomErasing) are slightly inferior to the mixing ones.

Table 3 Top-5 overall best outcomes on CIFAR-10

The last comparison, presented in Table 3, lists top-5 combinations of an architecture and an augmentation method for CIFAR-10 found in the literature. The list is led by two versions of the Shake-Shake model (Gastaldi 2017).

5.1.2 CIFAR-100

CIFAR-100 data set (Krizhevsky et al. 2009b) is defined in a similar way to CIFAR-10, except that it is more fine grained and has 100 classes. Each class contains 600 images, divided in 500 training and 100 test samples. Fifty randomly selected images with corresponding classes are depicted in Fig. 16.

Fig. 16
figure 16

Examples of images from the CIFAR-100 set. (Color figure online)

Fig. 17
figure 17

Accuracy results for Mixup, CutMix and AttentiveCutMix on CIFAR-100 reported in the work of Walawalkar et al. (2020). Each panel presents a particular type of architecture (ResNet, DenseNet, EfficientNet) with complexity of the network increasing from top to bottom. Left panels present an absolute error value and the right ones a relative improvement over the baseline (“NO DA” results). (Color figure online)

Figure 17 is a CIFAR-100 analogue of Fig. 14 with the same range of tested architectures and DA methods. Its aim is to investigate how the network’s complexity impacts the relative benefit of using particular DA techniques. The absolute errors (left panels) are much higher than in the case of CIFAR-10, because CIFAR-100 is a much more challenging data set (there are only 500 training samples per each of 100 fine-grained categories).

Interestingly, for CIFAR-100 a relative boost (right panels) is generally much smaller than in the case of its less complex counterpart (cf. Fig. 14). This difference is most probably a direct consequence of smaller amount of training data available for each class compared to CIFAR-10.

Certain differences can also be observed in the average relative improvement across considered architecture types. While for CIFAR-10 the highest relative gain was observed for the smallest model (ResNet), in the case of CIFAR-100 the biggest relative improvement is noted for the DenseNet network. On average, the use of augmentation methods resulted in a relative gain of around \(9.4\%\), \(10.1\%\) and \(7.1\%\), respectively for ResNet, DenseNet and EfficientNet. Similarly to CIFAR-10 architecture-based trends (within each panel) vary significantly.

Fig. 18
figure 18

Accuracy results of various augmentation methods on CIFAR-100 reported in: a—(Kim et al. 2021), b—(Kim et al. 2020), c—(Summers and Dinneen 2019), d—(Takahashi et al. 2018), e—(Uddin et al. 2020), f—(Yun et al. 2019) and grouped for particular ResNet architectures. (Color figure online)

Figure 18 compares various augmentation methods across common ResNet baselines: PreAct ResNet (He et al. 2016b), several ResNet and WideResNet architectures (Zagoruyko and Komodakis 2016) and Pyramidal ResNet (Han et al. 2017). This time models that relatively benefit most from DA application are not the simplest ones. Furthermore, it can be observed that application of certain augmentation techniques (AugMix and Cutout) actually deteriorates accuracy for some architectures.

Table 4 Top-5 overall best outcomes on CIFAR-100

Table 4 presents top-5 combinations of model architecture and augmentation method found in the literature. The best accuracy is achieved either by highly complex models (Pyramidal ResNet, Shake-Shake) joint with one of the founding methods (Mixup, BCL or CutMix) or their extensions (SmoothMix), or by applying one of the most recent data augmentation techniques, i.e. Puzzle Mix, with less complex architecture.

5.1.3 ImageNet

The most comprehensive benchmark that we consider is ImageNet (Deng et al. 2009), which contains 1.2 million training samples and 50,000 validation ones, divided into \(1000\) categories. The size of the images varies with the average of 469 × 387 pixels. In the experiments reported in the literature the images are usually cropped to 256 × 256 pixels or 224 × 224 pixels, depending on the architecture used. Example images are presented in Fig. 19.

Fig. 19
figure 19

Examples of images from the ImageNet data set. (Color figure online)

Fig. 20
figure 20

Accuracy results of various augmentation methods on ImageNet reported in: a—(Kim et al. 2021), b—(Kim et al. 2020), d—(Takahashi et al. 2018), e—(Uddin et al. 2020) and grouped around particular ResNet architectures. Left panels present absolute error values and the right ones the relative improvements over the baseline. (Color figure online)

Figure 20 compares augmentation methods applied to various ResNet architectures (Kim et al. 2021, 2020; Takahashi et al. 2018; Uddin et al. 2020). A predominant observation from the figure is that for ImageNet, a far more complex data set than the CIFAR sets, the relative advantage of applying data augmentation is much lower. The average relative increment is between 5% and 6% and the maximum gain equals \(10.5\%\) for Puzzle Mix method and ResNet-50 model.

Table 5 Top-5 overall best outcomes on ImageNet

Table 5 presents top-5 combinations of model architecture and augmentation method. Similarly to the CIFAR sets the leaders are the most advanced architectures (ResNeXt-101 64*4d and ResNeXt-101) combined with base mixing techniques, followed by certain less complex architectures paired with more recent mixing methods.

5.2 Evaluation on corrupted images

In this section we verify the efficacy of models trained with DA techniques on corrupted test data. Experimental evaluation is performed on CIFAR-100-C (Hendrycks and Dietterich 2019), which is obtained from CIFAR-100 by applying 15 different corruptions which are divided into four major categories: noise, blur, weather changes, and digital corruptions. The effects of applying these corruptions are presented in Fig. 21. For each image from CIFAR-100 each corruption is applied at 5 severities indicating the corruption strength, which leads to 75 transformed versions of this image. Please note that in all experiments discussed in this section corrupted images were used only at test time and were not presented to the model at training time. The error measure, Mean Corruption Error (MCE), is calculated for a given image as the average across all its corrupted versions i.e. 75 instances.

Fig. 21
figure 21

The effects of 15 corruptions applied to an example image. Corruptions were applied with the highest severity. (Color figure online)

Fig. 22
figure 22

A summary of Mean Corruption Error results for various augmentation techniques and network architectures. Puzzle Mix (Aug) denotes a combined usage of Puzzle Mix and AugMix techniques. The first three rows come from the work of Hendrycks et al. (2020) and the next ones from Kim et al. (2020) and Lee et al. (2020), respectively. (Color figure online)

Figure 22 aggregates the results for various DA techniques and architectures of varying complexity (DenseNet, ResNeXt, two Wide ResNet-s and Pyramidal ResNet). It can be observed that on corrupted data, regardless of particular architecture, certain data augmentations (Cutout, Mixup, CutMix, SmoothMix) do not offer substantial improvement compared to the baseline results.

When leaving aside Puzzle Mix (Aug), a method that consistently outperforms its competitors is AugMix which was explicitly designed for dealing with corrupted data. A runner-up technique, which is also substantially better than the remaining methods, is Puzzle Mix. Not surprisingly, the strongest overall approach is an application of Puzzle Mix on top of AugMix in which a pair of images, after being processed with AugMix, is mixed using Puzzle Mix. This combination, denoted as Puzzle Mix (Aug) in the figure managed to further decrease the classification error down to the level of \(29.9\%\), which is the lowest result on CIFAR-100-C reported in the literature.

5.3 Evaluation on adversarial examples

Fig. 23
figure 23

FGSM error rates on CIFAR-100 for two architectures (PreAct ResNet-18 and WideResNet-28-10) trained using various data augmentation techniques. Puzzle Mix (adv) is a combination of Puzzle Mix and adversarial training (Wong et al. 2020). The left column presents absolute error values and the right one a relative improvement over the baseline. All results come from the work of Kim et al. (2020) which offers a broad comparison of data augmentation methods, in both clean data and adversarial attacks scenarios. (Color figure online)

Fig. 24
figure 24

FGSM error rates on ImageNet for ResNet-50 architecture trained using three data augmentation techniques (Cutout, Mixup and CutMix). The left column presents absolute error values and the right one a relative improvement over the baseline. All results come from the work of Uddin et al. (2020). (Color figure online)

While this area is not the mainstream of image augmentation research, it is worth to mention that some augmentation methods were also evaluated against adversarial examples. An adversarial attack used most often in this context is Fast Gradient Sign Method (FGSM) employed as a white-box attack. In this case the gradient of the loss function is analyzed to find the most effective way of perturbing the image so as to mislead the classifier. Similarly to the case of image corruption adversarial examples are used exclusively at test time.

Figures 23 and 24 present the FGSM error rates from the experiments reported in the works of Kim et al. (2020) and Uddin et al. (2020) regarding CIFAR-100 and ImageNet, respectively. Generally, the results are inconclusive as for CIFAR-100 pixel-wise mixing methods (Mixup, Manifold Mixup, AugMix) are superior, whereas for ImageNet the best performing method is Saliency Mix, a patch-wise mixing approach. It should be noted that in the work of Verma et al. (2019) visibly lower FGSM error rates for Mixup (\(59.3\%\)) and Manifold Mixup (\(55.0\%\)) were reported for CIFAR-100 and PreAct ResNet-18 network, however with no details regarding the experiment setup.

The method that clearly stands out in terms of accomplished results is Puzzle Mix (adv) (Kim et al. 2020), which is a variant of Puzzle Mix employing adversarial training. At the same time, this is the only method that requires substantially more computational power compared to the “NO DA” reference case.

5.4 Evaluation on weakly supervised object localization and partial occlusion tasks

In this section we discuss the efficacy of DA techniques in the problems of weakly supervised object localization (WSOL) and partial occlusion of the target object, which are frequently encountered in the image classification or object detection settings. In both contexts a desired property of the model is to make prediction based on a wide array of class-relevant visual features and not just the most distinctive features. In WSOL this property helps in finding a fragment of an image that entirely covers a given object, and in the context of partial occlusion supports the network in inferring the correct class based on the remaining, less distinctive features.

Utilization of augmented data in WSOL is discussed in (Yun et al. 2019; Takahashi et al. 2018; Uddin et al. 2020; Huang et al. 2020; Kim et al. 2021) in reference to CutMix, RICAP, Saliency Mix, SnapMix and Co-Mixup, respectively. In all cases the results are assessed using Class Activation Mapping (Zhou et al. 2016) already mentioned in Sect. 3.3. CAM creates a heat map of the same size as input images, indicating regions in the original image that contributed most towards a certain class prediction. CAM requires a CNN model to possess a global average pooling layer to obtain the spatial average of the feature maps prior to the output layer, so as to visualize the most class-contributive pixels.

The results indicate that augmented data indeed broadens the regions used by the model to infer the correct class (Yun et al. 2019; Takahashi et al. 2018; Kim et al. 2021) with Saliency Mix being more discriminative against the image background. The only exception is SnapMix, which was designed for fine-grained classification and limits the regions activated by CAM to parts of the image that are relevant for a particular fine-grained class (e.g wings instead of a whole bird).

RICAP was tested in the work of Takahashi et al. (2018) also in object detection task. The experiment was conducted on MS COCO (Microsoft Common Objects in Context) set (Lin et al. 2014) using YOLOv3 architecture (Redmon and Farhadi 2018). MS COCO is a large set of images presenting common objects with associated bounding boxes and segmentation maps, accompanied by the respective captions. In the reported experiments only information about the bounding boxes was used.

RICAP was adjusted so as not to interfere with the area in which an object was located. The target bounding boxes where restricted to the cropped regions. This way the model was no longer able to benefit from partial labels (each object had one class assigned to it) but could learn partial features (only part of an object might fall into the cropped region, so the network had to learn to make predictions based on the remaining part). This adjusted version of RICAP slightly improved the mean Average Precision (mAP) over the baseline (YOLOv3 without augmentation) from \(51.3\%\) to \(52.7\%\) (Takahashi et al. 2018).

5.5 Application to image-related tasks other than classification

The canonical tasks in Computer Vision are image categorization, object localization, object detection and semantic segmentation. These tasks can be further divided into specific cases:

  • Fine grained categorization in which differences between classes refer to small visual features.

  • Categorization under domain shift in which a model is trained on data from a distribution other than the one applied at inference stage.

  • Categorization with federated learning in which a model is trained on data that is scattered among many devices as opposed to being collected in one place.

  • Categorization of corrupted images which refers to images distorted by certain noise, blur, weather conditions, digital processing, etc.

  • Categorization of adversarial images which refers to images that were purposefully modified to confuse the model.

  • Weakly supervised object localization in which the goal is to localize an object based solely on the image label.

  • Object detection in the presence of partial occlusion where object are often partly occluded by other objects.

When it comes to application of certain data augmentation techniques to various tasks, there are two key properties: (1) whether the mixing is performed pixel-wise or patch-wise and (2) how many images are mixed. Consequently the methods could be divided into 3 following groups:

  • Group A: Pixel-wise or mixed Pixel-wise and Patch-wise augmentations that work on 2 or more images: Smart Augmentation (Lemley et al. 2017), Feature Space (DeVries and Taylor 2017), Mixup (Zhang et al. 2018), SamplePairing (Inoue 2018), Between-Class learning (Tokozume et al. 2018a), Manifold Mixup (Verma et al. 2019), AdaMixup (Guo et al. 2019), SmoothMix (Lee et al. 2020), Puzzle Mix (Kim et al. 2020) and Co-mixup (Kim et al. 2021).

  • Group B: Patch-wise methods: RICAP (Takahashi et al. 2018), patch-wise versions of Mixed-Example (Summers and Dinneen 2019), CutMix (Yun et al. 2019), Attentive CutMix (Walawalkar et al. 2020), Saliency Mix (Uddin et al. 2020) and SnapMix (Huang et al. 2020).

  • Group C: methods that work on just 1 image: Style Augmentation (Jackson et al. 2019), AugMix (Hendrycks et al. 2020), MixStyle (Zhou et al. 2021), Cutout (Devries and Taylor 2017), Random Erasing (Zhong et al. 2020) and Patch Gaussian (Lopes et al. 2019).

Methods from group A are limited to categorization task due to their underlying property of mixing images-pixel wise. This leads to certain regions of the image representing more than one class and renders application of these methods to other tasks difficult (e.g. what should be done with a bounding box for a part of the image that is mixed?).

Augmentations in group B directly address categorization and localization tasks and can be further adjusted to object detection and segmentation by proper handling of an additional information associated to the task (e.g. RICAP (Takahashi et al. 2018) method limits the bounding box to the area corresponding to the selected patch).

Augmentation approaches from group C, which work on a single image, can in principle be applied to all tasks.

On a more detailed level, there are certain methods that were either developed with a particular problem in mind or were studied in the context of a specific problem and perform well on it.

  • Fine grained categorization: SnapMix (Huang et al. 2020).

  • Categorization under domain shift: MixStyle (Zhou et al. 2021).

  • Categorization with federated learning: Mixup applied to federated learning (Yoon et al. 2021).

  • Categorization of corrupted images: Manifold Mixup (Verma et al. 2019), SmoothMix (Lee et al. 2020), Puzzle Mix (Kim et al. 2020), AugMix (Hendrycks et al. 2020).

  • Categorization of adversarial images: Mixup (Zhang et al. 2018), Manifold Mixup (Verma et al. 2019), CutMix (Yun et al. 2019), Puzzle Mix (Kim et al. 2020).

  • Weakly supervised object localization: CutMix (Yun et al. 2019), RICAP (Takahashi et al. 2018).

  • Object detection in the presence of partial occlusion: CutMix (Yun et al. 2019).

5.6 Application to modalities other than images

The main focus of this survey is data augmentation for images, however there are also other modalities that could be considered for application of the analyzed augmentation methods: text and audio. In this section we point the methods that could potentially be extended to other modalities, and present examples of such applications already reported in the literature.

When it comes to applying image augmentation methods to other modalities the following 3 groups could be distinguished:

  • Group A: Mixup-like methods (Pixel-wise mixing) that work on 2 or more images and do not utilize any complex mixing mechanism, i.e. Feature Space (DeVries and Taylor 2017), Mixup (Zhang et al. 2018), SamplePairing (Inoue 2018), Between-Class learning (Tokozume et al. 2018a) and Manifold Mixup (Verma et al. 2019).

  • Group B: Patch-wise methods or mixed Pixel-wise and Patch-wise, i.e. RICAP (Takahashi et al. 2018), patch-wise versions of Mixed-Example (Summers and Dinneen 2019), CutMix (Yun et al. 2019), SmoothMix (Lee et al. 2020), Cutout (Devries and Taylor 2017) and Random Erasing (Zhong et al. 2020).

  • Group C: methods that cannot be directly applied to other modalities due to their inherent connection to image-specific data transformations or architectures, i.e. Style Augmentation (Jackson et al. 2019), AugMix (Hendrycks et al. 2020), MixStyle (Zhou et al. 2021), Smart Augmentation (Lemley et al. 2017), AdaMixup (Guo et al. 2019), Puzzle Mix (Kim et al. 2020), Attentive CutMix (Walawalkar et al. 2020), Saliency Mix (Uddin et al. 2020), SnapMix (Huang et al. 2020), Co-mixup (Kim et al. 2021) and Patch Gaussian (Lopes et al. 2019).

Methods from group A can be applied to other modalities without any adaptation as long as the same size of the input objects is ensured. In the context of audio it means having the same length and the same spectrum of frequencies, and for the text data, the same size of vector embeddings. Examples of successful applications of the methods from group A to other modalities are presented below in this section.

For methods from group B their application to modalities other than images is technically possible, however, not yet empirically tested. Such a mixing would potentially signify specific modality-depending aspects, e.g. spatial mixing of embeddings of different sentences or pasting a part of a voice spectrogram into another one.

A potential application of the methods from group C to other modalities would require introducing major changes to their design and operation, as they are inherently related to image data. Some methods from this group utilize image saliency information (Kim et al. 2021; Huang et al. 2020; Uddin et al. 2020; Kim et al. 2020), other use image specific data transformations, like style transfer or rotation (Zhou et al. 2021; Hendrycks et al. 2020; Jackson et al. 2019). Yet another ones, utilize architectures dedicated to processing the image data (Lemley et al. 2017; Guo et al. 2019; Walawalkar et al. 2020).

So far several examples of applying image augmentation methods to other modalities have been proposed in the recent literature. Most notably the Mixup method was applied to speaker classification task (Zhu et al. 2019) and to sentence classification (Jindal et al. 2020). Also Mixup like method were applied to rare event detection based on sound data (Chen and Jin 2019) and to text classification (Guo 2020).

Even though application of augmentation methods to non-image data seems to be scarce, we are convinced that this area offers many flourishing research directions related to both adaptation of the methods from groups B and C, as well as invention of the entirely new augmentation approaches devoted to particular data modalities and addressing their specificity.

5.7 Robustness to parameter selection

The majority of the methods presented in the survey rely on no more than 3 hyper-parameters and are relatively robust to their selection or change, as well as the change of the data set.

The method’s robustness to the problem change (same task different data set) is mostly correlated with the way the augmentation method is applied (cf. Sect. 4.2). For the rule based methods, the change of a problem does not require any hyper-parameter changes. For other methods, however, one can face some difficulties with their application to other data sets. Attentive CutMix (Walawalkar et al. 2020) utilizes an additional feature extractor that might not be available for all problems. Puzzle Mix (Kim et al. 2020) and Co-mixup (Kim et al. 2021) employ an optimization procedure that requires some assumptions on how to simplify the optimization problem which might differ between problems (e.g. wideness of the image may impact how the grid of location for the optimization process is created). The next method, Style Augmentation (Jackson et al. 2019), was shown to be sensitive to hyper-parameter change in the ablation study. In case of this method both strength of regularization via style transfer as well as ratio of augmented samples to non-augmented ones greatly impact the method’s accuracy. Yet another method that we believe is hard to apply directly to other data sets is Smart Augmentation (Lemley et al. 2017)—one of the older approaches that employs a dedicated problem-specific network responsible for mixing images. Training of this network, in addition to computational overhead, is generally difficult and may potentially lead to poor mixed samples that would impact the accuracy of the target network.

6 Selecting the best data augmentation strategy

Up to now we have discussed newest mixing augmentation methods whose utilization improved the state-of-the-art results in many visual tasks, for instance image classification (either clean (Tokozume et al. 2018a; Yun et al. 2019) or corrupted (Hendrycks et al. 2020; Kim et al. 2020)), object detection (Takahashi et al. 2018) or WSOL (Takahashi et al. 2018; Yun et al. 2019). While application of gradually more advanced and more complicated methods proven successful, an alternative approach is to consider an ensemble of traditional augmentation methods and pick-up those of them that are particularly well suited to a given task and/or experiment setup.

Traditional data augmentation techniques, e.g. image scaling, translation or rotation are generally effective in improving accuracy of DL image classifiers, however an impact of a particular augmentation depends on characteristics of the data set and the task at hand, which poses certain limitations to their effective application. Moreover, the number and diversity of traditional augmentation methods prohibit using them all at once, as such a situation would heavily slow down the training process and actually deteriorate the accuracy. Hence, the ability to limit the number of considered augmentation options for a given problem (task, data set, imposed constraints, etc.) is an important issue. In this section several automated approaches to effective selection of the optimal subset of traditional augmentation techniques (for a given task and dataset) are summarized and evaluated. These methods will be referred to as data augmentation policy selection (DAPS) approaches.

6.1 Black Box methods

A seminal DAPS method is AutoAugment (Cubuk et al. 2019) which is composed of two major components: the search algorithm and the search space. The search engine (an RNN controller) samples from the search space the DA policy S defined as the following triple: (image augmentation operation, probability of its application, magnitude of the operation). The controller is trained with Reinforcement Learning Proximal Policy Optimization algorithm (Schulman et al. 2017) based on the reward signal from the auxiliary model (e.g. a CNN model in the case of image classification) which measures the efficacy of the selected policy in improving model generalization. After completion of the policy exploration phase the auxiliary models are discarded and the best policies found are used for target model training. The choice of RL as the training algorithm is arbitrary and is inspired by automated architecture search techniques (Zoph and Le 2017). Actually, any other suitable technique, e.g. augmented random search or evolutionary strategy could be used in its place. In order to cast the problem of selecting S into a discrete search space, the augmentation methods’ probabilities and magnitudes are discretized to uniformly spaced values.

In the search for the optimal DA policy only a fraction of observations from the original data set are used, for instance \(4000\) and \(1000\) in the case of CIFAR-10 and SVHN (Netzer et al. 2011), respectively. The underlying principle of AutoAugment is that an optimal DA policy for a given data set would not change if the entire data set was considered.

The best policies found for the two above-mentioned data sets are different and clearly linked to the image content. For CIFAR-10 the vast majority of best augmentation policies are focused on color-based transformations since within the respective classes (e.g. airplane or truck) the diversity introduced by color-related augmentations coincide with real-life cases where one object can be of various colors. In SVHN the best augmentations are geometric transformations since for street numbers categorization they seem to be more important than color, which is actually irrelevant if only the number is readable.

An interesting question is whether augmentation policies are transferable between data sets. To this end the best AutoAugment policies found for SVHN were subsequently applied in the training process of the networks solving classification problems on five other data sets [Oxford 102 Flowers (Nilsback and Zisserman 2008), Caltech-101 (Li et al. 2007), Oxford-IIIT Pets (Em et al. 2017), FGVC Aircraft (Maji et al. 2013) and Stanford Cars (Yang et al. 2015)]. In all cases using policies learned on SVHN resulted in performance increase over the baseline (Cubuk et al. 2019).

In the ablation experiments application of an optimal policy found by AutoAugment was compared with the use of the same subset of augmentation techniques, but with random probabilities and magnitudes, as well as with random policy sampling. In both cases AutoAugment proven superior.

The main limitation of the method is its extensive computational cost. In order to generate sufficient number of training signals for the RNN controller, thousands of auxiliary models need to be trained with sampled augmentation policies before the final out-of-sample accuracy is achieved. All the remaining methods presented in this section efficiently address this problem while keeping the results at comparable level.

The first method that bases upon the idea of AutoAugment and matches its accuracy is Population Based Augmentation (PBA) (Ho et al. 2019). PBA produces dynamic augmentation policy schedule and combines random search with evolutionary strategy to significantly decrease computation time.

The difference between the policy (constructed in AutoAugment) and the policy schedule (in PBA) lies in the time range for which the optimal augmentation policy is selected. In AutoAugment the optimal policy is selected once for the entire training time. In PBA it is chosen for each epoch, leading to a sequence of optimal policies, i.e. a policy schedule. The underpinning idea of PBA is that diversification of policies applied at different training stages is advantageous for the final accuracy.

The main motivation of Ho et al. (2019) is to demonstrate that state-of-the-art results (comparable to those of the AutoAugment application) can be achieved with significantly lower computational cost. As a backbone PBA uses Population Based Training algorithm (PBT) (Jaderberg et al. 2017) that selects a subset of augmentation techniques independently for each epoch. The schedule learned so far is used as a starting point. In other words, at any given epoch all candidate policies share the common pool of policies selected historically.

Initially PBT trains in parallel an ensemble of randomly initialized models with the same architecture. At certain intervals the performance of these models is evaluated on a validation set. At this point two mechanisms are applied: exploitation and exploration. Exploitation consists in replacing the weights of \(25\%\) of the worst-performing models with the weights from the top-\(25\%\) models. Exploration relies on randomly perturbing the hyperparameters of the models with replaced weights, in order to extend the hyperparameter space search. PBA uses exactly the same hyperparameters as AutoAugment, i.e. a list of triples: (transformation type, probability of its application, its magnitude).

Similarly to AutoAugment the quest for an optimal DA policy is performed on a reduced data set. In the experiments presented by Ho et al. (2019) the policy schedule is learned only once using WideResNet architecture, and is subsequently applied to training other models (based on other architectures). Since DA method applied in a given epoch is not linked to any particular architecture, the only relevant difference from the perspective of the DA process is the number of required training epochs. The training schedule is adjusted proportionally by stretching it out (shrinking) in case a particular architecture requires longer (shorter) training than WideResNet. For instance, an optimal schedule for 2 epochs consisting of a set of augmentations A and B, could be stretch out to 4 epochs by repeating each of these sets, leading to a schedule A, A, B, B.

Fast AutoAugment (FAA) (Lim et al. 2019) is another approach indicating AutoAugment as its inspiration. FAA addresses the problem of AutoAugment computational complexity by using a more effective search strategy based on density matching. Instead of repeatedly training auxiliary models with different augmentations in order to select the best-performing one, FAA searches for augmentation policies that minimize categorical cross-entropy loss on validation set. The process relays on treating the augmented data as missing data points and selecting the best-performing augmentations using the Tree-structured Parzen Estimator (TPE) algorithm (Bergstra et al. 2011).

FAA search space is similar to that of AutoAugment and PBA as the method uses a list of transformation operations, each of them described by its probability and magnitude. The goal is to find a set of optimal policies that can be used for the final model training. The key difference is that TPE enables searching over continuous space so neither probabilities nor magnitudes need to be discretized.

The search process consists of the following steps. First, the data set is split into K pairs (\({D_{{\mathcal {M}}}^{(k)}}, {D_{{\mathcal {A}}}^{(k)}}\)) using K-fold stratified shuffling where \({D_{{\mathcal {M}}}^{(k)}}\) is the non-augmented data, \({D_{{\mathcal {A}}}^{(k)}}\) is the data that will be used to evaluate different augmentations, for \(k = {1,\ldots ,K}\). Next, the model is trained from scratch using \({D_{{\mathcal {M}}}^{(k)}}\) only. After training a set of augmentation policies is selected for evaluation. The policies parameters are optimized by minimizing the categorical cross entropy on the \({D_{{\mathcal {A}}}^{(k)}}\) set. Since in order to evaluate a new policy only augmentation and prediction is required, there is no need to train the model again from scratch.

Once the above exploration phase is completed, top-N policies are selected as the optimal augmentation policies for a given data set. In the final step, the model is trained using the entire training data set and the optimal policies selected across K folds.

All three above-described methods (AutoAugment, PBA, FAA) are based on the assumption that a set of augmentation policies that will result in high model accuracy can be found based on a smaller proxy task. Therefore, all of them limit the size of the training set used to search for the optimal augmentation policies.

6.2 Reducing the search space

The sole method in this area is RandAugment (Cubuk et al. 2020), which questions an assumption that a search for optimal policies over the entire solution space can be effectively performed with a smaller proxy task. This disbelief is based on the experiments showing that the optimal augmentation magnitude depends on the size of a training set and the size of a model. Consequently, RandAugment postulates significant reduction of the search space so as to make the selection of an optimal augmentation policy a feasible task.

To this end it is proposed by Cubuk et al. (2020) to jointly optimize magnitudes of all augmentation operations by means of a single parameter called distortion magnitude. Furthermore, it is proposed to uniformly set the probability of all augmentation operations, thus reducing the search space even further. The resulting algorithm depends on just two parameters: the number of transformations applied to an image and the magnitude distortion parameter indicating regularization strength.

In the experiments performed on CIFAR-10 a family of WideResNet architectures with various values of the widening parameter, responsible for the number of convolutional filters, hence complexity of the network, were trained on data sets of various sizes. The first one tested 7 WideResNet models with various widening factors, and the other one considered 8 different sizes of the training set, between \(1000\) and 45,000 images sampled randomly from CIFAR-10 with fixed WideResNet architecture with widening factor \(= 10\). Both evaluations showed that more complex networks and/or bigger data sets require stronger regularization to achieve full classification potential. This finding is somewhat counterintuitive in reference to the training set size where a common belief is that smaller training sets require more regularization. A hypothetical explanation presented by Cubuk et al. (2020) is that strong augmentation on small data set might result in high noise to signal ratio.

6.3 Making the augmentation pipeline differentiable

Methods in this section transform augmentation operations, which up to now were intrinsically non-differentiable, to differentiable ones, thus allowing joint optimization of the data augmentation pipeline and classification network.

Chronologically, the first paper in this area is Faster AutoAugment (Hataya et al. 2020) that builds on and enhances Fast AutoAugment approach (Lim et al. 2019) described in Sect. 6.1. The method adjusts the policy search pipeline so as to make it fully differentiable and consequently enable gradient based optimization. This is achieved by implementing the following three concepts: (1) approximate gradient calculation of discrete image operation (Bengio et al. 2013), (2) making the operation selection process differentiable thanks to the adaptation of neural architecture search methods (Liu et al. 2019), and (3) expanding the idea presented by Lim et al. (2019), where the validation loss on the augmented sample was minimized, to GAN approach with augmentation policy treated as a generator and a critic network (Arjovsky et al. 2017).

Faster AutoAugment also adapts the foundations of DAPS, i.e. organization of the augmentations into policies and sub-policies, as well as the way the experiment is structured (policy search on smaller sample, model training on full data). Since the method transforms the search for optimal parameters into an optimization task solving which is more time-effective, it can be applied to larger search spaces. It is concluded by Hataya et al. (2020) that increasing the number of sub-policies or the number of operations in each sub-policy is increasing the performance.

Differentiable Automatic Data Augmentation (DADA) (Li et al. 2020) relaxes the DA policy selection problem to differentiable one using (continuous) Gumbel-Softmax distribution that approximates samples from the categorical distribution.

Technically, DADA approaches the tasks of sub-policy selection and augmentation parameter selection by sampling from a categorical distribution and Bernoulli distribution, respectively and then relaxes this optimization problem to differentiable one using Gumbel-Softmax. Eventually, joint optimization of augmentation parameters and classification network weights is carried out using the RELAX estimator (Grathwohl et al. 2018).

Finally, the bi-level optimization of DA parameters and classification network weights is carried out according to the following equations:

$$\begin{aligned}&\min {\mathcal {L}}_{val}( \omega ^{*}(d)) \end{aligned}$$
(24)
$$\begin{aligned}&\text {subject to} \ \omega ^{*}(d) = \underset{\omega }{\arg \min } {\mathbf {E}}[{\mathcal {L}}_{train}(\omega , d)] \end{aligned}$$
(25)

where, \(d = \{\alpha , \beta , m, \phi \}\) represent probability of selecting a sub-policy (\(\alpha\)), probability of applying a transformation (\(\beta\)), magnitude of the transformation (m), and RELAX network parameters (\(\phi\)), respectively. \(\omega\) represents parameters of the classification network. In order to solve the bi-level optimization problem (24)–(25) \(\omega\) and d are optimized alternately through gradient descent. As in AutoAugment the policies are searched for on a reduced data set, although using only one network training limited to 20 epochs.

The next method, Adversarial AutoAugment (Zhang et al. 2020) generates adversarial images as augmented samples. The method performs training on full data set, getting rid of the smaller proxy task, and returns a dynamic policy schedule which is updated during training.

Technically, Adversarial AutoAugment uses two neural networks: Task Network (TN) and Policy Search Network (PSN). PSN is responsible for sampling policies that should be applied to training data. It is implemented as a one-layer LSTM (Li and Wu 2015) with 100 neurons in hidden layer and an embedding size of 32. TN is trained on data augmented with the sampled policies. Each image from the training batch is processed multiple times using different policies and all these transformed versions are added to the training batch. Based on the parts of the TN loss function associated with each sampled policy better policies are generated. PNS parameters are updated using the REINFORCE algorithm (Williams 1992). Adversarial AutoAugment modifies a default policy search space by skipping policy probability and considering continuous policy magnitudes:

$$\begin{aligned}&\ w^{*} = \underset{w}{\arg \min } \underset{x\sim \Omega }{{\mathbf {E}}} \underset{\tau \sim {\mathcal {A}}(.,\theta )}{{\mathbf {E}}} {\mathcal {L}}[{\mathcal {F}}(\tau (x), w), y] \end{aligned}$$
(26)
$$\begin{aligned}&\theta ^{*} = \underset{\theta }{\arg \max } \underset{x\sim \Omega }{{\mathbf {E}}} \underset{\tau \sim {\mathcal {A}}(.,\theta )}{{\mathbf {E}}} {\mathcal {L}}[{\mathcal {F}}(\tau (x), w), y] \end{aligned}$$
(27)

where TN and PSN are indicated with \({\mathcal {F}}\) and \({\mathcal {A}}\), respectively and are parameterized by w and \(\theta\). The optimization problem is formulated as a min-max game: TN tries to train a network and PSN attempts to increase the loss by providing harder augmentations, also making TN more robust. Such a setup as it conforms to the current state of the training process and provides augmentation with lower severity at the initial training phase (to allow for quick learning of the main patterns) and increases the severity in subsequent training once the network learned all simple patterns.

The high level setup of Adversarial AutoAugment is similar to that of GANs in that one network tries to outplay the other one. Yet the method is inherently different as it works on existing images by applying a transformation from a pre-defined list and does not synthesize new images from noise, as GANs typically do.

Policies learned using Adversarial AutoAugment transferred to other problem give comparable or better results than AutoAugment, however the best results are achieved when the policies are searched for independently, focusing on a given data set (Zhang et al. 2020).

Another method in this group, MetaAugment (Zhou et al. 2020), challenges one of the well established DA assumptions that an augmentation transformation is selected for the entire data set. MetaAugment evaluates a fit of the selected augmentation operation to the image and assigns a weight based on the significance of this fit. The weight is calculated by Augmentation Policy Network (APN) and passed on to Task Network (TN) to calculate the weighted loss of the augmented training image.

APN takes as its input an embedding of a transformation function together with deep features extracted from the image with TN and outputs a weight to adjust the augmented image loss computed by TN. The goal of APN is to improve performance of TN on a validation set via adjusting the weights of the losses. APN is implemented as a 1-hidden layer MLP that takes the aforementioned embedding and deep features as its input, which is followed by a fully-connected layer of size 100 with RELU activations and an output layer with one sigmoid neuron. TN can be implemented as any of the standard CNN architectures that minimizes the weighted training loss using the weights provided by APN:

$$\begin{aligned}&\theta ^{*} = \underset{\theta }{\arg \min }\ {\mathcal {L}}({\mathcal {X}}_{val}, \omega ^{*}(\theta )) \end{aligned}$$
(28)
$$\begin{aligned}&\text {subject to} \ \omega ^{*}(\theta ) = \underset{\omega }{\arg \min }\ {\mathcal {L}}({\mathcal {X}}_{tr}, \omega , \theta ) \end{aligned}$$
(29)

where \(\theta\) and \(\omega\) are parameters of APN and TN and \({\mathcal {X}}_{tr}\) and \({\mathcal {X}}_{val}\) denote training and validation sets, respectively. MetaAugment proposes an additional mechanism, the transformation sampler (TS), working in tandem with APN and TN. TS samples transformations according to a probability distribution estimated in the outputs of APN that reflects the overall effectiveness of the transformation for the entire dataset. As the distribution is estimated based on APN, it evolves over the training process and provides gradually better recommendations.

One of the MetaAugment disadvantages is computational overhead. The method requires 3 forward and backward passes in TN which takes 3 times longer than in a typical training scenario. Its clear advantage, on the other hand, is the ability to fit the transformation to a given sample. While this fitting is not performed explicitly because the image and transformation are sampled independently, it is forced by assigning the weight to the image, which denotes the impact of this image on the loss function that is used to update TN weights.

The most recent method in this area (Mounsaveng et al. 2020) tries to jointly learn the optimal data augmentation parameters while training the end task model. The method will be henceforth referred to as ABO (Data Augmentation with Online Bi-level Optimization). ABO employs two neural networks, a CNN Classification Network (CN) and an MLP Augmentation Network (AN). The training process is divided into two streams. In the inner loop AN predicts augmentation parameters, based on which the image is augmented, and fed afterwards into CN to calculate the loss. In the outer loop a feedback signal for AN is created:

$$\begin{aligned}&\theta ^{*} = \underset{\theta }{\arg \min } {\mathcal {L}}({\mathcal {X}}_{val}, \omega ^{*}) \end{aligned}$$
(30)
$$\begin{aligned}&\text {subject to} \ \omega ^{*} = \underset{\omega }{\arg \min } {\mathcal {L}}({\mathcal {A}}_{\theta }({\mathcal {X}}_{tr}), \omega ) \end{aligned}$$
(31)

where \({\mathcal {A}}_{\theta }\) is AN parameterized by \(\theta\), and CN is parameterized by \(\omega\).

Joint training of AN using back propagation is possible thanks to the following features. Firstly, AN is trained on the validation set [cf. Eq. (30)] to improve generalization properties of CN. Secondly, in order to enable gradient calculation the AN error on the validation set is calculated using CN [Eq. (30)] and subsequently used to update parameters of AN. Thirdly, an online approximation of the bi-level optimization is proposed to enable updating of AN parameters at each training step. Fourthly, differentiable augmentation operations from Kornia library (Riba et al. 2020) are used.

The online approximation of the bi-level optimization is used to overcome the problematic setup used in many previous methods (Cubuk et al. 2019; Ho et al. 2019; Cubuk et al. 2020) where optimization process treats the two objectives separately as black-box problems. In ABO the weights of AN are updated after each step in reference to CN outputs:

$$\begin{aligned}&\nabla _{\theta } {\mathcal {L}}({\mathcal {X}}_{val}, \omega ^{*}) = \ \frac{\partial {\mathcal {L}}({\mathcal {X}}_{val}, \omega ^{*})}{\partial \theta } = \ \frac{\partial {\mathcal {L}}({\mathcal {X}}_{val}, \omega ^{*})}{\partial \omega ^{*}} \frac{\partial \omega ^{*}}{\partial \theta } \end{aligned}$$
(32)
$$\begin{aligned}&\frac{\partial \omega ^{*}}{\partial \theta } \approx \frac{\partial \omega ^{(t)}}{\partial \theta ^{(t)}} = \sum _{i=1}^{t} \frac{\partial \omega ^{(t)}}{\partial \omega ^{(i)}} \frac{\partial \omega ^{(i)}}{\partial {\mathcal {G}}^{(i-1)}} \frac{\partial {\mathcal {G}}^{(i-1)}}{\partial \theta ^{(i)}} \end{aligned}$$
(33)

where \({\mathcal {G}}^{(t)}\) is the gradient of the training loss at iteration t.

AN is implemented as an MLP with input and output of size n, being the number of hyperparameters to optimize, and two hidden layers with n and 10n neurons, respectively. Bigger architectures were also tested but it was found out empirically that the size of AN does not have significant impact on the accuracy of the classifier (Mounsaveng et al. 2020).

Contrary to Adversarial AutoAugment (Zhang et al. 2020) and MetaAugment (Zhou et al. 2020) where subsequent training resulted in more adequate change of the method’s parameters, in the case of ABO the transformations applied at the beginning of the training are stronger and approach identity towards the end of the training.

6.4 Quantitative evaluation

We quantitatively compare the results of application of nine DAPS methods presented in Sects. 6.16.3 on the same three benchmark sets that were used in evaluation of previously introduced DA methods. Methods are compared between each other and against selected erasing and mixing approaches described in Sects. 2 and 3, respectively.

Fig. 25
figure 25

Accuracy results on CIFAR-10 of nine DAPS methods (AutoAugment, FFA, PBA, Rand Augment, Faster AutoAugment, DADA, Adversarial AutoAugment, MetaAugment, ABO) and selected mixing and erasing DA approaches. Top row presents absolute error values and the bottom one relative improvements over the baseline. All papers introducing DAPS methods (Cubuk et al. 2019; Lim et al. 2019; Ho et al. 2019; Cubuk et al. 2020; Hataya et al. 2020; Li et al. 2020; Zhang et al. 2020; Zhou et al. 2020; Mounsaveng et al. 2020) adopted the same experiment design. (Color figure online)

A summary of results on CIFAR-10 presented in Figure 25 shows that application of DAPS methods yields on average better results than utilization of advanced pixel-wise or patch-wise DA approaches. The only exception is Manifold Mixup, which slightly outperforms the majority of DAPS solutions, albeit the supporting evidence comes from one experiment only. Adversarial AutoAugment and MetaAugment, two methods that introduce changes not aimed solely at reducing AutoAugment computation time, reign among DAPS methods.

Fig. 26
figure 26

Accuracy on CIFAR-100 for nine DAPS methods (AutoAugment, FFA, PBA, Rand Augment, Faster AutoAugment, DADA, Adversarial AutoAugment, MetaAugment, ABO) and selected mixing and erasing DA approaches. Top row presents absolute error values and the bottom one relative improvements over the baseline. All the papers describing DAPS methods (Cubuk et al. 2019; Lim et al. 2019; Ho et al. 2019; Cubuk et al. 2020; Hataya et al. 2020; Li et al. 2020; Zhang et al. 2020; Zhou et al. 2020; Mounsaveng et al. 2020) adopted the same experiment design. (Color figure online)

The above observations are also confirmed in experiments with more complex data sets. In the case of CIFAR-100 (see Fig. 26), again all DAPS methods fare very well—with the average error on WideResNet-28-10 architecture slightly below 17. At the same time Puzzle Mix (a DA approach) achieves the best overall score. Likewise, on ImageNet (Fig. 27) the results of majority of DAPS methods are close to each other except the two leading approaches (Adversarial AutoAugment and MetaAugment) which outperform the rest. The results are generally better than those of DA methods, except Puzzle Mix which demonstrates a slight upper-hand over all but the two above-mentioned methods.

Fig. 27
figure 27

Accuracy results on ImageNet for nine DAPS methods (AutoAugment, FFA, PBA, Rand Augment, Faster AutoAugment, DADA, Adversarial AutoAugment, MetaAugment, ABO) and selected mixing and erasing augmentation approaches. Top row presents absolute error values and the bottom one relative improvements over the baseline. All the papers describing DAPS methods (Cubuk et al. 2019; Lim et al. 2019; Ho et al. 2019; Cubuk et al. 2020; Hataya et al. 2020; Li et al. 2020; Zhang et al. 2020; Zhou et al. 2020; Mounsaveng et al. 2020) adopted the same experiment design. (Color figure online)

A general observation is that the more complex and bigger the data set is the smaller relative improvement over the baseline, stemming from DAPS usage, can be expected. Depending on the architecture the relative improvement varies between \(24\%\) and \(51\%\) for CIFAR-10, between \(5\%\) and \(26\%\) for CIFAR-100, and between \(1\%\) and \(14\%\) for ImageNet. Based on the results we conclude that the two top methods are MetaAugment and Adversarial AutoAugment—which both are recently developed DAPS approaches.

An open question, which we believe is worth investigation is application of DAPS methods to more complex DA techniques, going beyond traditional augmentations. The results presented in Figures 25, 26, 27 suggest that inclusion of more complex erasing or mixing methods into DAPS pool of augmentation techniques might lead to their efficacy improvement beyond the state-of-the-art MetaAugment and Adversarial AutoAugment solutions.

6.5 Comparison of DAPS methods based on particular properties

This section collates all DAPS approaches introduced above in reference to their key aspects summarized in Table 6.

Table 6 Comparison of DAPS methods along key differentiating axes

6.5.1 The DAPS baseline algorithm

One of the key factors differentiating DAPS methods is an algorithm used to address the task of finding best augmentation parameters (cf. row “Algorithm” in Table 6). AutoAugment uses a Reinforcement Learning based approach in which many networks have to be trained on different versions of augmented data in order to generate a strong enough learning signal for the RNN controller. PBA employs Evolutionary Strategy metaheuristic in which a population of models trained on different versions of augmented data is maintained and gradually ameliorated through application of evolutionary operators. FFA relies on Density Matching algorithm which consists in initially training the classifier on non-augmented data and then verifying which transformations minimize the cross-entropy loss on validation set. RandAugment uses a simple grid search approach on a predefined list of hyper-parameters.

The other five methods share a common approach that consists in making the data augmentation pipeline differentiable. Faster AutoAugment actually follows an approach similar to FAA, as it uses Density Matching to optimize augmentation parameters. However, instead of searching for the best augmentation policy parameters, it simply learns them from the data using differentiable data augmentation pipeline. DADA formulates the policy search problem as a Monte Carlo sampling problem. Adversarial AutoAugment replicates the data set n times, applies different augmentations to each replica and calculates part of the loss associated with that replica. MetaAugment tries to find an optimal augmentation for each sample by means of reformulating the problem as sample reweighting problem. The most recent work, ABO, introduces an online approach to solving bi-level optimization problem.

6.5.2 Fixed or varying in time policy schedule

Following the intuitive reasoning presented by Ho et al. (2019) an optimal augmentation policy (i.e. the one with best generalization properties) depends on the advancement of the training process. However, among older methods (up to 2019), PBA (Ho et al. 2019) is the only one that uses a variable policy (policy schedule), and all other methods employ a fixed policy approach. As for the recent methods (2020–2021), 3 out of 5 (Adversarial Autoaugment, MetaAugment and ABO) employ dynamic policies. Furthermore, in the RandAugment paper (Cubuk et al. 2020) the authors tested different schedules of distortion parameter M (a constant magnitude, a random magnitude, a linearly increasing magnitude, a randomly selected magnitude with increasing upper bound) but concluded that a constant global distortion value yields comparable results, so there is no point of increasing the method’s complexity. Schedule characteristics adopted by particular methods are summarized in row “Dynamic” of Table 6.

Figure 28 presents example outputs of a single policy for DAPS method for the same image that was used throughout the survey. The key factor determining the output of a given policy is whether the method utilizes a fixed or varying in time augmentation policy schedule.

The policies that use fixed schedule will apply the same augmentations throughout the entire training process, whereas for varying in time schedule the policy can change as the training progresses. Another aspect of the policy is the augmentation strength—there are three such levels illustrated in Fig. 28. Again, for a fixed policy the strength remains the same throughout the training process, while for a varying one it can change along with a training progress.

Fig. 28
figure 28

Example behavior of a single policy with shear, translate or rotate, and color data augmentations applied. Rows correspond to various augmentation strengths. The first column depicts example output of a fixed policy that does not change throughout the training. Columns 2–4 refer to a policy varying in time during training. (Color figure online)

6.5.3 Computational complexity of the method

An assessment of the methods’ complexity based on the reported GPU hours would not be adequate as the presented solutions rely on various graphic cards, training set sizes, etc. For this reason, we decided to present two computational complexity parameters in Table 6: “Epochs”—which is the number of times the algorithm goes through the entire training data set as reported in the respective papers (these values are not directly comparable due to different sizes of the training sets) and “Epoch norm.” (normalized epochs)—which is a crude estimation of epoch-data-set-size complexity, that assumes that the same number of training epochs (200) and the same amount of data are considered by each method. Based on this rough estimation it can be concluded that the most computationally effective methods are MetaAugment, Faster AutoAugment and DADA, followed by FAA and PBA. At the other extreme there is AutoAugment which is orders of magnitude more time demanding than the remaining approaches.

6.5.4 Search space

The majority of DAPS methods consider discrete parameter spaces. The three exceptions search continuous spaces as they use either Density Matching (FAA, Faster AutoAugment) or a separate Augmentation Network that works on augmentation embeddings (MetaAugment).

Among all methods only RandAugment significantly reduces the search space by restricting it to \(10^{2}\) possible options. Furthermore, Adversarial AutoAugment removes the probability parameter of each operation and hence reduces the search space to \(10^{22}\) elements. The remaining methods consider much bigger spaces: with \(10^{32}\) elements in the case of AA, FFA, Faster AutoAugment, DADA and MetaAugment, and \(10^{61}\) for PBA, which uses the same number of parameters but additionally searches for an optimal schedule of the policies.

A summary of search space characteristics are presented in rows ”Space type” and ”Space size” of Table 6.

6.5.5 Proxy task

This aspect is connected to the baseline algorithm employed by each method, described in Sect. 6.5.1. Three non-differentiable methods use a proxy task for finding the optimal data augmentation policy. The complexity of proxy differs between AA, PBA and FAA but in each case additional models need to be trained. More details are presented below in Sect. 6.5.6. The only method in this group that trains the final classifier models right from the start is RandAugment.

Among differentiable methods there are two (Faster AutoAugment and DADA) that use a separate augmentation parameter search phase, although both of them use just one auxiliary model. The remaining differentiable methods do not use proxy task as they jointly train the augmentation pipeline and the target model.

Rows “Proxy task” and “Final task” in Table 6 show the number of observations used respectively to solve the proxy task and the final task (with the augmentation strategy selected based on proxy).

6.5.6 Number of trained models

RandAugment is the sole method that trains several final classifiers—the number of them depends on the data set. The remaining methods train just one final classifier, since at this stage the best policy is already selected or is optimized jointly with the target model training.

The target classifier is always trained on the full available training data, while auxiliary models are trained only on a fraction of this data. In case of AA, PBA, Faster Autoaugment and DADA between \(1000\) and \(4000\) observations are used in auxiliary training, depending on the data set, and in the case of FFA the training set is divided into a number of equal-size parts, each of them used to train one auxiliary model. The respective numbers of auxiliary models when dealing with CIFAR-10 are presented in the last two rows of Table 6. One more thing worth mentioning is n-time multiplication of the training set size of the target model in Adversarial AutoAugment—one instance per each DA.

7 Conclusions

Data augmentation in image classification task is mainly applied to increase the size of the training data set and make the model more robust by creating variations of the images that, although procedurally generated, resemble real world test case settings. DA is a well know regularization mechanism helpful in preventing over-fitting of the learning process. Additionally, scarce availability of annotated training data is considered one of the biggest impediments in DL applications to narrow domains (e.g. particular business problems) or those requiring high level expertise. In such scenarios DA can play a critical role in increasing the size of the training data without increasing the cost of manual creation of the annotated data.

This survey is focused on the two DA areas: data augmentation via mixing images and data augmentation policy selection (Cubuk et al. 2019; Ho et al. 2019; Lim et al. 2019; Cubuk et al. 2020; Hataya et al. 2020; Li et al. 2020; Zhang et al. 2020; Zhou et al. 2020; Mounsaveng et al. 2020). The former genre is further divided into methods that erase part of the image (Devries and Taylor 2017; Zhong et al. 2020; Lopes et al. 2019) and image mixing methods (Zhang et al. 2018; Yun et al. 2019; Guo et al. 2019; Verma et al. 2019; Lee et al. 2020; Walawalkar et al. 2020; Kim et al. 2020; Inoue 2018; Tokozume et al. 2018a; Summers and Dinneen 2019; Takahashi et al. 2018; Hendrycks et al. 2020; Zhou et al. 2021; Huang et al. 2020; Kim et al. 2021; Uddin et al. 2020) which can be further divided based on particular properties, e.g. pixel-wise vs. patch-wise methods or approaches working on pairs of images (a typical situation) vs. those that utilize other than two images to produce an augmented sample.

Mixing DA methods and DAPS approaches for image categorization is a relatively new research area, therefore the vast majority of the papers discussed in this survey were published between 2018 and 2021. Methods created with the image classification problem in mind gradually become explored in the context of other data modalities, for instance an application of Mixup to text data (Guo 2020; Jindal et al. 2020), or become adapted to particular use cases like semantic segmentation (Olsson et al. 2021), federated learning (Yoon et al. 2021) or fairness (Chuang and Mroueh 2021).

The properties and design of DA method determine its impact on the augmented image and are crucial aspects in deciding which DA method fits best the task at hand. Among mixing methods, the pixel-wise approaches (e.g. Mixup) work better with noise (corrupted images or incorrect labels) while the patch-wise ones (e.g. CutMix) are better suited to the task of partial occlusion or weakly supervised object localization problem. Patch-wise methods (nonetheless with some adjustments required, like the adaptation of RICAP described in Sect. 5.4) are also effective in the object detection task. Additionally, there are DA methods designed specifically for a particular problem or setting, like AugMix which is devoted to the case of corrupted images.

In reference to data augmentation policy search (DAPS) methods, a general observation is that the more complex and bigger the data set is the smaller relative improvement over the baseline, stemming from DAPS usage, can be expected. In other words, for complex and diverse data application of existing DAPS approaches is still limited, possibly due to considering only simple, traditional DA techniques in the developed policies.

A similar observation is also valid in the case of individual DA approaches where the experiments have proven that the relative gain from DA application is smaller for more complex data sets.

An open question is whether mixing raw input images is more effective than mixing their latent feature-based representations. While some related discussion had already taken place in the literature, the outcomes are yet inconclusive. It seems reasonable to assume that the advantage of either of these two approaches may depend on a particular DA method applied.

7.1 Possible future directions

Generally, within mixing DA methods one can observe two major tendencies: (1) joining approaches that demonstrate different capabilities into coherent, though more complex, synergetic methods and (2) extending well established methods by combining them with certain statistical image-based information, so as to achieve better properties and higher accuracy of such hybridized approaches. A motivating example of the former direction could be a combination of Puzzle Mix (Kim et al. 2020) and AugMix (Hendrycks et al. 2020) that achieves better results on corrupted images than any of the constituting methods alone. Premiere examples of the latter trend are, for instance, Patch Gaussian (Lopes et al. 2019) that joins the idea of Cutout (Devries and Taylor 2017) with addition of Gaussian noise to make the method work well on both clean and corrupted data, or Puzzle Mix that combines the idea of mixing with an information on saliency of visual features to expose the most discriminative parts of two mixed images in the augmented sample.

We suspect that unless some major breakthrough occurs, the predictable future of data augmentation lies in further hybridization, with more and more methods joint together to optimize the final accuracy or to address some niche problem.

Another promising direction, which we believe will be explored in the near future, is image-based selection of DA techniques that aims at finding an optimal augmentation for each sample. One of possible realizations of this idea has been proposed in MetaAugment approach (Zhou et al. 2020). There is also room for optimization based solutions that combine learning the optimal DA selection and parametrization with the target model training. Such a bi-level approach has been recently proposed in ABO method (Mounsaveng et al. 2020).

In the field of DAPS methods a promising, though yet under-explored research avenue is considering more recent, well-established DA techniques thus extending the search space beyond traditional transformations (e.g. affine or color related transformations) which may potentially lead to improved performance.