1 Introduction

Neural networks are acknowledged to be learn-by-example techniques. Assuming the training set is representative, the problem becomes finding proper loss functions to avoid local optima and a good backbone. We know that some convolutional neural networks (CNNs) are more accurate than others. In image classification tasks, ResNet [9] usually performs better than VGG [21], for it has a more complex architecture and residual connections that help generalization [10].

Changing the training protocol is another approach to increase generalization. Notably, neural networks can obtain better results when more (and proper) training instances are available. One can also train a model in some larger dataset before fine-tuning it to the desired problem, i.e., transfer learning [16]. Results show it can boost the outcomes significantly in a variety of problems. Google improved results in the ImageNet challenge [4] by creating a huge dataset with more than 300 million images to first train a model and further fine-tune it in the ImageNet set [23].

However, some scenarios do not allow us to collect additional training data, for the labeling cost is prohibitive. Regularization can come to this aid by either making training harder or the loss function landscape smoother [7]. We expect to achieve improvements in the model’s generalization after the application of such approaches. Classical regularization approaches in deep learning include data augmentation, which shall consider semantics after transformation. A model that trains on the MNIST dataset [13], for instance, can rotate images to some extent only. Rotating a “6” in 180\(^{circ}\) ends up in a “9”, generating a wrong instance-label pair.

This work introduces Random Label Smoothing (RLS), a new regularization approach that works on the output layer. RLS operates by randomly changing values in the label vector (ground truth). We demonstrate state-of-the-art results in image classification and super-resolution, evidencing it can be used in a broad range of application domains.

The manuscript is organized as follows. Sections 2 and 3 present some related works and the proposed approach, respectively. Section 4 introduces the methodology, and Sect. 5 demonstrates the robustness of the proposed approach in experiments under different scenarios. Section 6 provides a brief discussion about the outcomes, and Sect. 7 states conclusions.

2 Related Works

Several regularization techniques are available for helping neural networks to accomplish better results in different domains. Regularization based on data augmentation is classic and with many approaches in the literature. AutoAugment [3] performs data augmentation by first learning the best policy for creating synthetic samples. However, it may take too long to determine the best data augmentation strategy for a given data set. Aiming to make this process faster, Fast AutoAugment [15] calculates the gradient from just one batch, decreasing the computational effort considerably. Other methods, such as Cutout [5] and RandomErasing [30], work by removing random areas of the image. The former removes a patch and leaves its content empty, while the other fills it with some random noise.

Other methods operate by changing the feature maps generated during training. Dropout [22] randomly drops neurons, while MaxDropout [18] eliminates the most active ones, i.e., the neurons with the highest activation values. An improved version, called MaxDropoutV2 [19], includes a more efficient approach to finding the most active neurons. Instead of directly comparing values on the output feature map from a given layer, it first sums the value of each neuron in the depth axis for further performing the comparison. MaxDropoutV2 carries more semantic information than its original counterpart. Additional methods consider other internal aspects when training CNNs. Shake–Shake [6] changes the weights of the inference and the backpropagation values on training time in a multi-branch model, such as ResNeXT [26]. Results show it can significantly improve the results.

A recent analysis of regularization methods for CNNs [17] raised some interesting drawbacks in the area. The first one is the shortage of algorithms that perform regularization on a label level. The other point concerns the application domain, i.e., most regularization techniques designed for deep nets focus on image classification. Label Smoothing [24] changes the values of the output layer (label vector), i.e., it decreases the value of the position that represents the true label and increases the values of the inactive labels. The Two-Stage Label Smoothing (TSLA) [27] changes the label values to some extent during training. The work shows that stopping label smoothing in a late training stage helps the model to generalize better.

3 Proposed Approach

According to [17], there are several issues related to some new regularization methods. We address quite a few of them in this work. Regularization approaches are often evaluated in a single context only, primarily on image classification. Here, we also consider image super-resolution. Still, according to to [17], a good regularization technique should improve results even if the model is already using another regularization technique, which RLS is capable of.

Traditional data augmentation usually changes features in the input data. A simple way to perform data augmentation is to rotate the image to the left or to the right in random degrees. Another way is to crop some areas of the input image, e.g., Cutout [5]. Following a similar logic, RLS augments the data by performing random but controlled changes in the label of every single instance during training. We explain how to do that for image classification and super-resolution tasks.

3.1 Image Classification

Concerning image classification, we vary the output values that define the label in a controlled but random range of values. In the “active” position, i.e., the index represented by the value “1” that encodes the label (one-hot representation), we randomly decrease its values between 0.05 and 0.49, guaranteeing that the active label will always have the greatest value. For the inactive positions (defined as “0”), we divided the amount of value used earlier among these positions. For instance, if the problem has 10 classes and the removed value is 0.3, the active position is set to 0.7, and all the other 9 positions receive a portion of the remaining value.

By doing these transformations in the labels during training, we create various acceptable labels for a given instance, working as an augmented label algorithm. Our results show it is functional for image classification, helping the model to generalize better and overcoming other methods, such as TargetDrop [31] and MaxDropout [18]. Figure 1 demonstrates how RLS works for a classification problem in a toy example.

Fig. 1
figure 1

Simulation of Label Smoothing and Random Label Smoothing over a batch of labels during training for a classification model (active label in bold). Traditionally, the active label is set to“1” while all other classes’ indices are set to ‘0”. In Label Smoothing, the active class is set to a constant (and higher) value, while the inactive classes are set to a smaller invariant value. In Random Label Smoothing, the active label receives a random (and higher) value (e.g., greater than 0.5) while the inactive labels receive a random value that, summed with the active label value, reaches 1

3.2 Image Super-Resolution

For image reconstruction, we first tried a similar approach to the classification task by randomly changing the label’s values following a Gaussian distribution.Footnote 1 However, it did not work as expected. The new reference image (modified ground truth) figures much Gaussian noise by changing the pixel values using a Gaussian distribution. The entire system then learns to reconstruct images with noise.

Changing the values pixel-by-pixel with different random values did not seem to be a good strategy, for we lose semantic details. We, therefore, decided to perturb all pixels by the same amount, i.e., a random value that can be either added or subtracted by the pixel value in a given training interaction. The problem is now defining what values range leads to the best results.

We achieved promising outcomes by reverse-engineering the results of the neural networks employed in this study. We used the results (i.e., PSNR—peak signal-to-noise ratio) of each architecture to set a range of values with better results. For instance, PyNET [11] achieved a 21.19 dB of PSNR; then, converting this value to a gray-scale amount results in about 12 units. Therefore, all pixel values (for all dataset images) were either subtracted or added concerning that amount (in this example). For EDSR (Enhanced Deep Residual Networks for Single Image Super-esolution) [14], the results are a PSNR of 29.21 dB for Div2k [1] and 28.89 dB for the RealSR dataset [2]. Therefore, we set 4.5 units for both datasets.

We verified if the above methodology could be further improved in the last experiment. We found out that we could achieve even better results by using half of the range than using the full range. In this case, we used 6 units for the PyNet and 2.25 for EDSR. Even though the image regions may be different, smaller differences in continuous regions can help the entire model understand that smaller errors are more acceptable than the exact value.

4 Methodology

This section provides a complete description of the experimentation protocol we used to evaluate RLS. We divided the experiments into three main parts: first, all four architectures we used for evaluating purposes are described. Right after, we presented the training protocol and later a description of the datasets.

4.1 Scenarios

We considered three different scenarios to evaluate RLS. They all have, in some ways, differences in the input data or the label level. The first one is standard image classification, and the second task concerns image super-resolution. In this case, the neural network’s task is to magnify the image input, creating another but amplified. For example, for a magnification of four times, if the input has the size of \(200\times 200\), the model’s objective is to create an output of size \(800\times 800\).

Last but not least, another challenge is to simulate an image signal processor (ISP), which basically creates an RGB image from a CFA (color filter array) acquired by the camera’s sensor. Therefore, given a CFA input, the task is to learn a CNN that can generate its corresponding RBG output.

4.2 Neural Network Architecture

As mentioned earlier [17], a good regularization technique should improve results in different problems to show it can enhance a given CNN outcome. We tested four neural backbones in different neural architectures to provide a fair evaluation: two for image classification, one for single image super-resolution, and one for software ISP.

The first CNN we use to evaluate RLS is ResNet [9], more precisely, ResNet-18. We have chosen this architecture because it is widely used for evaluating regularization techniques, allowing a natural comparison. Such neural backbone comprises a sequence of convolutional and pooling layers, with pooling after a sequence of two or three convolutional layers. The significant innovation in its architecture concerns the residual connections, which may improve effectiveness to a certain extent.

EDSR [14] is one of the scarce neural networks used to evaluate regularization methods [29], ending up in another natural choice. It stands for a residual convolutional network with a sequence of convolutional-ReLU activation-convolutional operations in its residual blocks and pixel shuffle operations [20] to perform image super-resolution in the end. PyNET [11], a multi-branch CNN that has several layers in parallel and uses different measures for error calculation, is interesting in evaluating problems related to image and signal processing, specifically image reconstruction.

4.3 Training Protocol

For the image classification problem, we considered the protocol suggested by [17]. The images were redimensioned to \(32\times 32\) pixels and then randomly cropped in \(28\times 28\) patches. Stochastic gradient descent with Nesterov momentum is used for gradient calculation. The learning rate starts at \(10e-2\) and is multiplied by \(10e-1\) on epochs 80, 120, and 160.

Concerning image super-resolution, we did not find any defined or suggested protocol. We, therefore, followed the same parameters used in CutBlur [29] and PyNET [11]. We understand a natural comparison to these previous works by observing the same parameters.

In all scenarios, five training runs were performed to avoid comparing results only by chance. Our results report the mean and standard deviation values for each performance measure.

4.4 Datasets

We used a different dataset for each task evaluated in this work to allow a fair comparison against other methods. In each case, we selected datasets that, according to our research, are the most used ones on each application domain considered here, i.e., image classification and super-resolution.

For the image classification task, we appointed CIFAR-100 [12], one of the most used datasets to evaluate regularization techniques [17]. It comprises 50, 000 images from 100 different classes for training purposes and 10, 000 images as the validation set.

For the single image super-resolution task, we considered two different datasets inspired in [29]. The first one stands for the Div2K [1] dataset, with 800 pairs of low and high-resolution RGB images for training and 100 for evaluation. The other dataset is RealSR [2], which comprises 459 pairs of images for training and 100 for model validation. We used a magnification of four times for comparison purposes in both cases.

The last one is the Zurich RAW to RGB Dataset [11], which evaluates techniques for image reconstruction. This dataset is divided into 46, 839 pairs of RGB Bayer filter data/RGB image for training and 1, 204 similar pairs for testing purposes.

5 Experimental Results

This section provides outcomes of RLS against some state-of-the-art regularization approaches. RLS is first evaluated over image classification tasks (Sect. 5.1) and later on image super-resolution problems (Sect. 5.2). Best results are presented in bold.

5.1 Image Classification

Table 1 presents the error rate concerning ResNet-18 in CIFAR-100 dataset. RLS has achieved the best outcome solely, outperforming seven other techniques. The first row in the table is our baseline, i.e., ResNet-18, without any regularization.

Table 1 Image classification experiment over CIFAR-100 dataset

5.1.1 Working Along with Other Regularizers

As mentioned by [17], it is vital to check how a particular regularization algorithm works along with other regularization methods. Here, we provide some interesting outcomes. Table 2 presents results of ResNet-18 using Cutout and other methods working together. The proposed approach can outperform MaxDropout and TargetDrop working jointly with Cutout by more than 0.5% on average, which is to be considered a good improvement.

Table 2 Results on CIFAR-100 using ResNet-18 with one or more regulization methods

Combining PyramidNet with ShakeDrop and RLS results in some improvement too. Table 3 shows the outcomes of PyramidNet without any regularization, using ShakeDrop, and using ShakeDrop+RLS. On average, that combination allowed an improvement of \(0.1\%\).

Table 3 Results on CIFAR-100 using PyramidNet with one or more regulization methods

5.2 Image Super-Resolution

Table 4 shows the outcomes of EDSR backbone using different regularization techniques from [29]. Some conclusions can be drawn in this scenario. The first one concerns Div2k dataset, whose results show that RLS using half of the perturbation value (RLS-Half) outperforms all methods, including the situation when all techniques (i.e., EDSR, Cutout, Cutmix, Mixup, RGB permutation, Blend, and Cutblur) are used together (All).

The second analysis concerns the outcomes of the RealSR dataset. Although RLS did not overcome the neural network trained using all methods, still, it has the best individual result.

Table 4 PSNR results on Div2K and RealSR datasets for EDSR using regulization methods

We considered an additional experiment related to image reconstruction. Table 5 shows the outcomes of PyNET using RLS algorithm considering PSNR and Multiscale Structural Similarity Index Measure (MS-SSIM) [25] quality measures. An improvement in the original results (i.e., standard PyNet) can be observed when RLS is applied. It is worth mentioning that, even though there are several regularization methods available for CNNs, none of them tackles, or at least is evaluated, in the context of image reconstruction. As far as we are aware [17], this is the first regularization approach that improves the results of deep learning models in the aforementioned task.

Table 5 Results on Zurich RAW to RGB Dataset for PyNet

6 Discussion

Providing new regularization algorithms is not straightforward for it often needs the knowledge of a specialist in the problem. Deep learning by itself is already an area of research that demands plenty of work when some improvements are required. This section provides some discussion about the outcomes obtained in the previous section.

6.1 Lack of Label Regularization Methods

Achieving the best results using a neural network is always desired, and regularization methods should be encouraged in most cases, as far as it does not break the semantics of the dataset. The label can be considered safe to test for multi-class classification, regardless of the application domain. The output is often hot-encoded, so there are few possibilities to harm or lose semantics.

This work is about a regularization method for Convolutional Neural Networks. It is fair and easy to compare with other algorithms because it follows the same evaluation protocol. However, more techniques rather than TSLA are desired to perform a better and direct comparison.

6.2 Lack of Comparison for General-Purpose Applications

There might be a bias toward creating deep learning regularization algorithms only for the image classification task. We could find several regularization methods [5, 18, 30, 31] for comparison purposes; however, we found only one for directly comparing in the context of image super-resolution [29]. Besides, as far as we are aware, no other in-depth study compared regularization techniques for image reconstruction.

The scarcity of works that aimed to compare regularization techniques in problems other than image classification is worrying. Indeed, we found some works [17, 29] that also complain about this scarcity of research on regularization algorithms in more problems. We encourage the researchers to develop new methods for other image processing problems, for there might be a promising area of research.

7 Conclusions and Future Works

We presented the RLS technique for label-level regularization concerning convolutional neural networks. Our results demonstrate that it can outperform other techniques when applied to different image processing problems. As such, we tackle not only the enhancement of neural networks but the problem of generalizing regularization algorithms, complained by [17]. RLS can be combined with other techniques and be used within any backbone.

We intend to apply RLS to other problems in future works, such as natural processing language processing. Another intent is to check if there are random distributions that can improve our current results.