Keywords

1 Introduction

Deep learning has enabled impressive performance increases across a range of computer vision tasks. However, this performance improvement is largely dependent upon the size and variation of labeled training datasets that are available for a chosen task. For some tasks, benchmark datasets contain millions of hand-labeled images for the supervised training of deep neural networks (DNNs) [1, 2]. Ideally, we could compile a large, comprehensive training set that is representative of all domains and is labelled for all visual tasks. However, it is expensive and time-consuming to both collect and label large amounts of training data, especially for more complex tasks like detection or pixelwise segmentation [40]. Furthermore, it is practically impossible to gather a single real dataset that captures all of the variability that exists in the real world.

Two promising methods have been proposed to overcome the limitations of real data collection: graphics rendering engines and image augmentation pipelines. These approaches enable increased variability of scene features across an image set without requiring any additional manual data annotation. Recent work in rendering datasets has shown success in training DNNs with large amounts of highly photorealistc, synthetic data and testing on real data [17], [?]. Pixel-wise labels for synthetic images can be generated automatically by rendering engines, greatly reducing the cost and effort it takes to create ground truth for different tasks. Recent work on image augmentation has focused on modeling environmental effects such as scene lighting, time of day, scene background, weather, and occlusions in training images as a way to increase the representation of these visual factors in training sets, thereby increasing robustness to these cases during test time [5, 6]. Another proposed augmentation approach is to increase the occurrence of objects of interest (such as cars or pedestrians) in images in order to provide more training examples of those objects in different scenes and spatial configurations [4, 7].

Fig. 1.
figure 1

Examples of object detection tested on KITTI for baseline unaugmented data (left) and for our proposed method (right). Blue boxes show correct detections; red boxes show detections missed by the baseline method but detected by our proposed approach for sensor-based image augmentation. (Color figure online)

However, even with varying spatial geometry and environmental factors in an image scene, there remain challenges to achieving robustness of task performance when transferring trained networks between synthetic and real image domains. To further understand the gaps between synthetic and real datasets, it is worthwhile to consider the failure modes of DNNs in visual learning tasks. One factor that has been shown to contribute to degradation of performance and cross-dataset generalization for various benchmark datasets is sensor bias [8,9,10,11]. The interaction between the camera model and lighting in the environment can greatly influence the pixel-level artifacts, distortions, and dynamic range induced in each image [12,13,14]. Sensor effects, such as blur and overexposure, have been shown to decrease performance of object detection networks in urban driving scenes [15]. Examples of failure modes caused by over exposure, manifesting as missed detections, are shown in Fig. 1. However, there still is an absence in the literature examining how to improve failure modes due to sensor effects for learned visual tasks in the wild.

In this work, we propose a novel framework for augmenting synthetic data with realistic sensor effects – effectively randomizing the sensor domain for synthetic images. Our augmentation pipeline is based on sensor effects that occur in image formation and processing that can lead to loss of information and produce failure modes in learning frameworks – chromatic aberration, blur, exposure, noise and color cast. We show that our proposed method improves performance for object detection in urban driving scenes when trained on synthetic data and tested on real data, an example of which is shown in Fig. 1. Our results demonstrate that sensor effects present in real images are important to consider for bridging the domain gap between real and simulated environments.

This paper is organized as follows: Sect. 2 presents related background work; Sect. 3 details the proposed image augmentation pipeline; Sect. 4 describes experiments and discusses results of these experiments and Sect. 5 concludes the paper. Code for this paper can be found at https://github.com/alexacarlson/SensorEffectAugmentation.

2 Related Work

Domain Randomization with Synthetic Data: Rendering and gaming engines have been used to synthesize large, labelled datasets that contain a wide variety of environmental factors that could not be feasibly captured during real data collection [3, 16]. Such factors include time of day, weather, and community architecture. Improvements to rendering engines have focused on matching the photorealism of the generated data to real images, which comes at a huge computational cost. Recent work on domain randomization seeks to bridge the reality gap by generating synthetic data with sufficient random variation over scene factors and rendering parameters such that the real data falls into this range of variation, even if rendered data does not appear photorealistic. Tobin et al. [27] focus on the task of object localization trained with synthetic data. They perform domain randomization over textures, occlusion levels, scene lighting, camera field of view, and uniform noise within the rendering engine, but their experiments are limited to highly simplistic toy scenes. Building on [27], Tremblay et al. [?] generate a synthetic dataset via domain randomization for object detection of real urban driving scenes. They randomize camera viewpoint, light source, object properties, and introduce flying distractors. Our work focuses on image augmentation outside of the rendering pipeline and could be applied in addition to domain randomization in the renderer.

Augmentation with Synthetic Data: Shrivastava et al. recently developed SimGAN, a generative adversarial network (GAN) to augment synthetic data to appear more realistic. They evaluated their method on the tasks of gaze estimation and hand pose estimation [19]. Similarly, Sixt et al. proposed RenderGAN, a generative network that uses structured augmentation functions to augment synthetic images of markers attached to honeybees [20]. The augmented images are used to train a detection network to track the honeybees. Both of these approaches focus on image sets that are homogeneously structured and low resolution. We instead focus on the application of autonomous driving, which features highly varied, complex scenes and environmental conditions.

Traditional Augmentation Techniques: Standard geometric augmentations, such as rotation, translation, and mirroring, have become commonplace in deep learning for achieving invariance to spatial factors that are not relevant to the given task [24]. Photometric augmentations aim to increase robustness to differing illumination color and intensity in a scene. These augmentations induce small changes in pixel intensities that do not produce loss of information in the image. A well known example is the PCA-based color shift introduced by Krizhevsky et al. [1] to perform more realistic RGB color jittering. In contrast, our augmentations are modeled directly from real sensor effects and can induce large changes in the input data that mimics the loss of information that occurs in real data.

Sensor Effects in Learning: More generally, recent work has demonstrated that elements of the image formation and processing pipeline can have a large impact upon learned representation [10, 28, 29]. Andreopoulos and Tsotsos demonstrate the sensitivities of popular vision algorithms under variable illumination, shutter speed, and gain [8]. Doersch et al. show there is dataset bias introduced by chromatic aberration in visual context prediction and object recognition tasks [11]. They correct for chromatic aberration to eliminate this bias. Diamond et al. demonstrate that blur and noise degrade neural network performance on classification tasks [29]. They propose an end-to-end denoising and deblurring neural network framework that operates directly on raw image data. Rather than correcting for the effects of the camera during image formation of real images, we propose to augment synthetic images to simulate these effects. As many of these effects can lead to loss of information, correcting for them is non-trivial and may result in the hallucination of visual information in the restored image.

3 Sensor-Based Image Augmentation

Figure 2 shows a side-by-side comparison of two real benchmark vehicle datasets, KITTI [38, 39] and Cityscapes [40], and two synthetic datasets, Virtual KITTI [16] and Grand Theft Auto [17, 41]. Both of the real datasets share many spatial and environmental visual features: both are captured during similar times of day, in similar weather conditions, and in cities regionally close together, with the camera located on a car pointing at the road. In spite of these similarities, images from these datasets are visibly different. This suggests that these two real datasets differ in their global pixel statistics. Qualitatively, KITTI images feature more pronounced effects due to blur and over-exposure. Cityscapes has a distinct color cast compared to KITTI. Synthetic datasets such as Virtual KITTI and GTA have many spatial similarities with real benchmark datasets, but are still visually distinct from real data. Our work aims to close the gap between real and synthetic data by modelling these sensor effects that can cause distinct visual differences between real world datasets. Figure 3 shows the architecture of the proposed sensor-based image augmentation pipeline. We consider a general camera framework, which transforms radiant light captured from the environment into an image [30]. There are several stages that comprise the process of image formation and post-processing steps, as shown in the first row of Fig. 3. The incoming light is first focused by the camera lens to be incident upon the camera sensor. Then the camera sensor transforms the incident light into RGB pixel intensity. On-board camera software manipulates the image (e.g., color space conversion and dynamic range compression) to produce the final output image. At each stage of the image formation pipeline, loss of information can occur to degrade the image. Lens effects can introduce visual distortions in an image, such as chromatic aberration and blur. Sensor effects can introduce over- or under-saturation depending on exposure, and high frequency pixel artifacts, based on characteristic sensor noise. Lastly, post-processing effects are implemented to shift the color cast to create a desirable output. Our image augmentation pipeline focuses on five total sensor effects augmentations to model loss of information that can occur at each stage during image formation and post-processing: chromatic aberration, blur, exposure, noise, and color shift. To model how these effects manifest in images in a camera, we implement the image processing pipeline as a composition of physically-based augmentation functions across these five effects, where lens effects are applied first, then sensor effects, and finally post-processing effects:

$$\begin{aligned} I_{aug.} = \phi _{color}(\phi _{noise}(\phi _{exposure}(\phi _{blur}(\phi _{chrom.ab.}(I))))) \end{aligned}$$
(1)

Note that these chosen augmentation functions are not exhaustive, and are meant to approximate the camera image formation pipeline. Each augmentation function is described in detail in the following subsections.

Fig. 2.
figure 2

A comparison of images from the KITTI Benchmark dataset (upper left), Cityscapes dataset (upper right), Virtual KITTI (lower left) and Grand Theft Auto (lower right). Note that each dataset has differing color cast, brightness, and detail.

Fig. 3.
figure 3

A schematic of the image formation and processing pipeline used in this work. A given image undergoes augmentations that approximate the same pixel-level effects that a camera would cause in an image.

3.1 Chromatic Aberration

Chromatic aberration is a lens effect that causes color distortions, or fringes, along edges that separate dark and light regions within an image. There are two types of chromatic aberration, longitudinal and lateral, both of which can be modeled by geometrically warping the color channels with respect to one another [31]. Longitudinal chromatic aberration occurs when different wavelengths of light converge on different points along the optical axis, effectively magnifying the RGB channels relative to one another. We model this aberration type by scaling the green color channel of an image by a value S. Lateral chromatic aberration occurs when different wavelengths of light converge to the different points within the image plane. We model this by applying translations \((t_{x},t_{y})\) to each of the color channels of an image. We combine these two effects into the following affine transformation, which is applied to each (xy) pixel location in a given color channel C of the image:

$$\begin{aligned} \begin{bmatrix} x^{\tiny {chrom.ab.}}_{C} \\ y^{\tiny {chrom.ab.}}_{C} \\ 1 \end{bmatrix} = \begin{bmatrix} S&0&t_{x} \\ 0&S&t_{y} \\ 0&0&1 \end{bmatrix} \begin{bmatrix} x_{C} \\ y_{C} \\ 1 \end{bmatrix} \end{aligned}$$
(2)

3.2 Blur

While there are several types of blur that occur in image-based datasets, we focus on out-of-focus blur, which can be modeled using a Gaussian filter [33]:

$$\begin{aligned} G = \frac{1}{2\pi \sigma ^2}e^{-\frac{x^2+y^2}{2\sigma ^2}} \end{aligned}$$
(3)

where x and y are spatial coordinates of the filter and \(\sigma \) is the standard deviation. The output image is given by:

$$\begin{aligned} I_{blur} = I*G \end{aligned}$$
(4)

3.3 Exposure

To model exposure, we use the exposure density function developed in [34, 35]:

$$\begin{aligned} I = f(S) = \frac{255}{1 + e^{-A \times S}} \end{aligned}$$
(5)

where I is image intensity, S indicates incoming light intensity, or exposure, and A is a constant value for contrast. We use this model to re-expose an image as follows:

$$\begin{aligned} S' = f^{-1}(I) + \varDelta S \end{aligned}$$
(6)
$$\begin{aligned} I_{exp} = f(S') \end{aligned}$$
(7)

We vary \(\varDelta S\) to model changing exposure, where a positive \(\varDelta S\) relates to increasing the exposure, which can lead to over-saturation, and a negative value indicates decreasing exposure.

3.4 Noise

The sources of image noise caused by elements of the sensor array can be modeled as either signal-dependent or signal-independent noise. Therefore, we use the Poisson-Gaussian noise model proposed in [14]:

$$\begin{aligned} I_{noise}(x,y)=I(x,y)+\eta _{poiss}(I(x,y))+\eta _{gauss} \end{aligned}$$
(8)

where I(xy) is the ground truth image at pixel location (xy), \(\eta _{poiss}\) is the signal-dependent Poisson noise, and \(\eta _{gauss}\) is the signal-independent Gaussian noise. We sample the noise for each pixel based upon its location in a GBRG Bayer grid array assuming bilinear interpolation as the demosaicing function.

3.5 Post-processing

In standard camera pipelines, post-processing techniques, such as white balancing or gamma transformation, are nonlinear color corrections performed on the image to compensate for the presence of different environmental illuminants. These post-processing methods are generally proprietary and cannot be easily characterized [12]. We model these effects by performing translations in the CIELAB color space, also known as L*a*b* space, to remap the image tonality to a different range [36, 37]. Given that our chosen datasets are all taken outdoors during the day, we assume a D65 illuminant in our L*a*b* color space conversion.

3.6 Generating Augmented Training Data

The bounds on the sensor effect parameter regimes were chosen experimentally. The parameter selection process is discussed in more detail in Sect. 4. To augment an image, we first randomly sample from these visually realistic parameter ranges. Both the chosen parameters and the unaugmented image are then input to the augmentation pipeline, which outputs the image augmented with the camera effects determined by the chosen parameters. We augmented each image multiple times with different sets of randomly sampled parameters. Note that this augmentation method serves as a pre-processing step. Figure 4 shows sample images augmented with individual sensor effects as well as our full proposed sensor-based image augmentation pipeline. We use the original image labels as the labels for the augmented data. Pixel artifacts from cameras, like chromatic aberration and blur, make the object boundaries noisy. Thus, the original target labels are used to ensure that the network makes robust and accurate predictions in the presence of camera effects.

Fig. 4.
figure 4

Example augmentations of GTA (left column) and VKITTI (right column) using the proposed sensor effect augmentation pipeline. Each image has a randomly sampled level of blur, chromatic aberration, exposure, sensor noise, and color temperature shift applied to it in an effort to model the visual structure/information loss caused by cameras when capturing real images.

4 Experiments

We evaluate the proposed sensor-based image augmentation pipeline on the task of object detection on benchmark vehicle datasets to assess its effectiveness at bridging the synthetic to real domain gap. We apply our image augmentation pipeline to two benchmark synthetic vehicle datasets, each of which was rendered with different levels of photorealism. The first, Virtual KITTI (VKITTI) [16], features over 21000 images and is designed to models the spatial layout of KITTI with varying environmental factors such as weather and time of day. The second is Grand Theft Auto (GTA) [17, 41], which features 21000 images and is noted for its high quality and increased photorealism compared to VKITTI. To evaluate the proposed augmentation method for 2D object detection, we used Faster R-CNN as our base network [42]. Faster R-CNN achieves relatively high performance on the KITTI benchmark test dataset, and many state-of-the-art object detection networks that improve upon these results use Faster R-CNN as their base architecture. For all experiments, we apply sensor effect augmentation pipeline to all images in the given dataset, then train an object detection network on the combination of original unaugmented data and sensor effect augmented data. We ran experiments to determine the number of sensor effect augmentations per image, and determined that optimal performance was achieved by augmenting each image in each dataset one time. To determine the bounds of the sensor effect parameter ranges from which to sample, we augmented small datasets of 2975 images by randomly sampling from increasingly larger parameter bounds and chose the ranges for each sensor effect that yielded the highest performance as well as visually realistic images. We found that the same parameter regime yielded optimal performance for both synthetic datasets. All of the trained networks are tested on a held out validation set of 1480 images from the KITTI training data and we report the Pascal VOC \(AP50_{bbox}\) value for the car class. We also report the gain in \(AP50_{bbox}\), which is the difference in performance relative to the baseline (unaugmented) dataset. We compare the performance of object detection networks trained on sensor-effect augmented data to object detection networks trained on unaugmented data as our baseline. For each dataset, we trained each Faster R-CNN network for 10 epochs using four Titan X Pascal GPUs in order to control for potential confounds between performance and training time.

4.1 Performance on Baseline Object Detection Benchmarks

Table 1 shows results for FasterRCNN networks trained on unaugmented synthetic data and sensor-effect augmented data for both VKITTI and GTA. Note that we provide experiments trained on the full training datasets, as well as experiments trained on subsets of 2975 images to allow comparison of performance across differently sized datasets. Synthetic data augmented with the proposed method yields significant performance gains over the baseline (unaugmented) synthetic datasets. This is expected as, in general, rendering engines do not realistically model sensor effects such as noise, blur, and chromatic aberration as accurately as our proposed approach. Another important result for the synthetic datasets (both VKITTI and GTA), is that, by leveraging our approach, we are able to outperform the networks trained on over 20000 unaugmented images with a tiny subset of 2975 images augmented with using our approach. This means that not only can networks be trained faster but also when training with synthetic data, varying camera effects can outweigh the value of simply generating more data with varied spatial features. The VKITTI baseline dataset tested on KITTI performs relatively well compared to GTA, even though GTA is a more photorealistic dataset. This can most likely be attributed to the similarity in spatial layout and image features between VKITTI and KITTI. With our proposed approach, VKITTI gives comparable performance to the network trained on the Cityscapes baseline, showing that synthetic data augmented with our proposed sensor-based image pipeline can perform comparably to real data for cross-dataset generalization.

Table 1. Object detection trained on synthetic data, tested on KITTI

4.2 Comparison to Other Augmentation Techniques

We ran experiments to compare our proposed method to photometric augmentation, specifically PCA-based color shift [1], complex spatial/geometric augmentations, specifically elastic deformation [47], standard additive gaussian noise augmentation, and a suite of standard spatial augmentations, specifically random rotations, scaling, translations, and cropping. We provide the results of training Faster-RCNN networks on the full VKITTI and GTA datasets augmented with the above methods in Table 2. All networks were tested on the same held-out set of KITTI images as used in the previous object detection experiments. Our results show that our proposed method drastically outperforms the other standard augmentation techniques, and that for certain synthetic data, spatial augmentations actually decrease performance on real data. This suggests that the proposed sensor effect augmentations capture more salient visual structure than traditional, non-photorealistic augmentation methods. We hypothesize this is because the physically-based sensor augmentations better model the information loss and the resulting global pixel-statistics that occur in real images. For example, our proposed method uses LAB space color transformation to alter the color cast of an image, where as traditional approaches use RGB space. LAB space is device independent, so it results in a more accurate, physically-based augmentation than [1].

Table 2. We provide the results of training Faster-RCNN networks on GTA and Virtual KITTI augmented with various augmentation methods. All networks were tested on KITTI.

4.3 Ablation Study

To evaluate the contribution of each sensor effect augmentation on performance, we used the proposed pipeline to generate datasets with only one type of sensor effect augmentation. We trained Faster-RCNN on each of these datasets augmented with single augmentation functions, the results of which are given in Table 3. Performance increases across all ablation experiments for training on synthetic data. This further validates our hypothesis that each of the sensor effects are important for closing the gap between synthetic and real data.

Table 3. Ablation study for object detection trained on synthetic data, tested on KITTI
Fig. 5.
figure 5

Virtual KITTI examples are in the left column, GTA examples are in the right column. Blue boxes show correct detections; red boxes show detections missed by the FasterRCNN network trained on baseline, unaugmented image datasets but detected by FasterRCNNs trained on data augmented using our proposed approach for sensor-based image augmentation. (Color figure online)

4.4 Failure Mode Analysis

Figure 5 shows the qualitative results of failure modes of FasterRCNN trained on each synthetic training dataset and tested on KITTI, where the blue bounding box indicates correct detections and the red bounding box indicate a missed detection for the baseline that was correctly detected by our proposed method. Qualitatively, it appears that our method more reliably detects instances of cars that are small in the image, in particular in the far background, at a scale in which the pixel statistics of the image are more pronounced. Note that our method also improves performance on car detections for cases where the image is over-saturated due to increased exposure, which we are directly modeling through our proposed augmentation pipeline. Additionally, our method produces improved detections for other effects that obscure the presence of a car, such as occlusion and shadows, even though we do not directly model these effects. This may be attributed to increased robustness to effects that lead to loss of visual information about an object in general.

5 Conclusions

We have proposed a novel sensor-based image augmentation pipeline for augmenting synthetic training data input to DNNs for the task of object detection in real urban driving scenes. Our augmentation pipeline models a range of physically-realistic sensor effects that occur throughout the image formation and post-processing pipeline. These effects were chosen as they lead to loss of information or distortion of a scene, which degrades network performance on learned visual tasks. By training on our augmented datasets, we can effectively increase dataset size and variation in the sensor domain, without the need for further labeling, in order to improve robustness and generalizability of resulting object detection networks. We achieve significantly improved performance across a range of benchmark synthetic vehicle datasets, independent of the level of photorealism. Overall, our results reveal insight into the importance of modeling sensor effects for the specific problem of training on synthetic data and testing on real data.