Modeling Camera Effects to Improve Visual Learning from Synthetic Data

Carlson, Alexandra; Skinner, Katherine A.; Vasudevan, Ram; Johnson-Roberson, Matthew

doi:10.1007/978-3-030-11009-3_31

Alexandra Carlson¹⁴,
Katherine A. Skinner¹⁴,
Ram Vasudevan¹⁴ &
…
Matthew Johnson-Roberson¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11129))

Included in the following conference series:

European Conference on Computer Vision

2316 Accesses
16 Citations

Abstract

Recent work has focused on generating synthetic imagery to increase the size and variability of training data for learning visual tasks in urban scenes. This includes increasing the occurrence of occlusions or varying environmental and weather effects. However, few have addressed modeling variation in the sensor domain. Sensor effects can degrade real images, limiting generalizability of network performance on visual tasks trained on synthetic data and tested in real environments. This paper proposes an efficient, automatic, physically-based augmentation pipeline to vary sensor effects – chromatic aberration, blur, exposure, noise, and color temperature – for synthetic imagery. In particular, this paper illustrates that augmenting synthetic training datasets with the proposed pipeline reduces the domain gap between synthetic and real domains for the task of object detection in urban driving scenes.

A. Carlson and K. A. Skinner—Authors contributed equally to this work.

You have full access to this open access chapter, Download conference paper PDF

Neural-Sim: Learning to Generate Training Data with NeRF

Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact Simulation

Effective Use of Synthetic Data for Urban Scene Semantic Segmentation

Keywords

1 Introduction

Deep learning has enabled impressive performance increases across a range of computer vision tasks. However, this performance improvement is largely dependent upon the size and variation of labeled training datasets that are available for a chosen task. For some tasks, benchmark datasets contain millions of hand-labeled images for the supervised training of deep neural networks (DNNs) [1, 2]. Ideally, we could compile a large, comprehensive training set that is representative of all domains and is labelled for all visual tasks. However, it is expensive and time-consuming to both collect and label large amounts of training data, especially for more complex tasks like detection or pixelwise segmentation [40]. Furthermore, it is practically impossible to gather a single real dataset that captures all of the variability that exists in the real world.

Two promising methods have been proposed to overcome the limitations of real data collection: graphics rendering engines and image augmentation pipelines. These approaches enable increased variability of scene features across an image set without requiring any additional manual data annotation. Recent work in rendering datasets has shown success in training DNNs with large amounts of highly photorealistc, synthetic data and testing on real data [17], [?]. Pixel-wise labels for synthetic images can be generated automatically by rendering engines, greatly reducing the cost and effort it takes to create ground truth for different tasks. Recent work on image augmentation has focused on modeling environmental effects such as scene lighting, time of day, scene background, weather, and occlusions in training images as a way to increase the representation of these visual factors in training sets, thereby increasing robustness to these cases during test time [5, 6]. Another proposed augmentation approach is to increase the occurrence of objects of interest (such as cars or pedestrians) in images in order to provide more training examples of those objects in different scenes and spatial configurations [4, 7].

However, even with varying spatial geometry and environmental factors in an image scene, there remain challenges to achieving robustness of task performance when transferring trained networks between synthetic and real image domains. To further understand the gaps between synthetic and real datasets, it is worthwhile to consider the failure modes of DNNs in visual learning tasks. One factor that has been shown to contribute to degradation of performance and cross-dataset generalization for various benchmark datasets is sensor bias [8,9,10,11]. The interaction between the camera model and lighting in the environment can greatly influence the pixel-level artifacts, distortions, and dynamic range induced in each image [12,13,14]. Sensor effects, such as blur and overexposure, have been shown to decrease performance of object detection networks in urban driving scenes [15]. Examples of failure modes caused by over exposure, manifesting as missed detections, are shown in Fig. 1. However, there still is an absence in the literature examining how to improve failure modes due to sensor effects for learned visual tasks in the wild.

In this work, we propose a novel framework for augmenting synthetic data with realistic sensor effects – effectively randomizing the sensor domain for synthetic images. Our augmentation pipeline is based on sensor effects that occur in image formation and processing that can lead to loss of information and produce failure modes in learning frameworks – chromatic aberration, blur, exposure, noise and color cast. We show that our proposed method improves performance for object detection in urban driving scenes when trained on synthetic data and tested on real data, an example of which is shown in Fig. 1. Our results demonstrate that sensor effects present in real images are important to consider for bridging the domain gap between real and simulated environments.

This paper is organized as follows: Sect. 2 presents related background work; Sect. 3 details the proposed image augmentation pipeline; Sect. 4 describes experiments and discusses results of these experiments and Sect. 5 concludes the paper. Code for this paper can be found at https://github.com/alexacarlson/SensorEffectAugmentation.

2 Related Work

Domain Randomization with Synthetic Data: Rendering and gaming engines have been used to synthesize large, labelled datasets that contain a wide variety of environmental factors that could not be feasibly captured during real data collection [3, 16]. Such factors include time of day, weather, and community architecture. Improvements to rendering engines have focused on matching the photorealism of the generated data to real images, which comes at a huge computational cost. Recent work on domain randomization seeks to bridge the reality gap by generating synthetic data with sufficient random variation over scene factors and rendering parameters such that the real data falls into this range of variation, even if rendered data does not appear photorealistic. Tobin et al. [27] focus on the task of object localization trained with synthetic data. They perform domain randomization over textures, occlusion levels, scene lighting, camera field of view, and uniform noise within the rendering engine, but their experiments are limited to highly simplistic toy scenes. Building on [27], Tremblay et al. [?] generate a synthetic dataset via domain randomization for object detection of real urban driving scenes. They randomize camera viewpoint, light source, object properties, and introduce flying distractors. Our work focuses on image augmentation outside of the rendering pipeline and could be applied in addition to domain randomization in the renderer.

Augmentation with Synthetic Data: Shrivastava et al. recently developed SimGAN, a generative adversarial network (GAN) to augment synthetic data to appear more realistic. They evaluated their method on the tasks of gaze estimation and hand pose estimation [19]. Similarly, Sixt et al. proposed RenderGAN, a generative network that uses structured augmentation functions to augment synthetic images of markers attached to honeybees [20]. The augmented images are used to train a detection network to track the honeybees. Both of these approaches focus on image sets that are homogeneously structured and low resolution. We instead focus on the application of autonomous driving, which features highly varied, complex scenes and environmental conditions.

Traditional Augmentation Techniques: Standard geometric augmentations, such as rotation, translation, and mirroring, have become commonplace in deep learning for achieving invariance to spatial factors that are not relevant to the given task [24]. Photometric augmentations aim to increase robustness to differing illumination color and intensity in a scene. These augmentations induce small changes in pixel intensities that do not produce loss of information in the image. A well known example is the PCA-based color shift introduced by Krizhevsky et al. [1] to perform more realistic RGB color jittering. In contrast, our augmentations are modeled directly from real sensor effects and can induce large changes in the input data that mimics the loss of information that occurs in real data.

Sensor Effects in Learning: More generally, recent work has demonstrated that elements of the image formation and processing pipeline can have a large impact upon learned representation [10, 28, 29]. Andreopoulos and Tsotsos demonstrate the sensitivities of popular vision algorithms under variable illumination, shutter speed, and gain [8]. Doersch et al. show there is dataset bias introduced by chromatic aberration in visual context prediction and object recognition tasks [11]. They correct for chromatic aberration to eliminate this bias. Diamond et al. demonstrate that blur and noise degrade neural network performance on classification tasks [29]. They propose an end-to-end denoising and deblurring neural network framework that operates directly on raw image data. Rather than correcting for the effects of the camera during image formation of real images, we propose to augment synthetic images to simulate these effects. As many of these effects can lead to loss of information, correcting for them is non-trivial and may result in the hallucination of visual information in the restored image.

3 Sensor-Based Image Augmentation

Figure 2 shows a side-by-side comparison of two real benchmark vehicle datasets, KITTI [38, 39] and Cityscapes [40], and two synthetic datasets, Virtual KITTI [16] and Grand Theft Auto [17, 41]. Both of the real datasets share many spatial and environmental visual features: both are captured during similar times of day, in similar weather conditions, and in cities regionally close together, with the camera located on a car pointing at the road. In spite of these similarities, images from these datasets are visibly different. This suggests that these two real datasets differ in their global pixel statistics. Qualitatively, KITTI images feature more pronounced effects due to blur and over-exposure. Cityscapes has a distinct color cast compared to KITTI. Synthetic datasets such as Virtual KITTI and GTA have many spatial similarities with real benchmark datasets, but are still visually distinct from real data. Our work aims to close the gap between real and synthetic data by modelling these sensor effects that can cause distinct visual differences between real world datasets. Figure 3 shows the architecture of the proposed sensor-based image augmentation pipeline. We consider a general camera framework, which transforms radiant light captured from the environment into an image [30]. There are several stages that comprise the process of image formation and post-processing steps, as shown in the first row of Fig. 3. The incoming light is first focused by the camera lens to be incident upon the camera sensor. Then the camera sensor transforms the incident light into RGB pixel intensity. On-board camera software manipulates the image (e.g., color space conversion and dynamic range compression) to produce the final output image. At each stage of the image formation pipeline, loss of information can occur to degrade the image. Lens effects can introduce visual distortions in an image, such as chromatic aberration and blur. Sensor effects can introduce over- or under-saturation depending on exposure, and high frequency pixel artifacts, based on characteristic sensor noise. Lastly, post-processing effects are implemented to shift the color cast to create a desirable output. Our image augmentation pipeline focuses on five total sensor effects augmentations to model loss of information that can occur at each stage during image formation and post-processing: chromatic aberration, blur, exposure, noise, and color shift. To model how these effects manifest in images in a camera, we implement the image processing pipeline as a composition of physically-based augmentation functions across these five effects, where lens effects are applied first, then sensor effects, and finally post-processing effects:

$$\begin{aligned} I_{aug.} = \phi _{color}(\phi _{noise}(\phi _{exposure}(\phi _{blur}(\phi _{chrom.ab.}(I))))) \end{aligned}$$

(1)

Note that these chosen augmentation functions are not exhaustive, and are meant to approximate the camera image formation pipeline. Each augmentation function is described in detail in the following subsections.

3.1 Chromatic Aberration

Chromatic aberration is a lens effect that causes color distortions, or fringes, along edges that separate dark and light regions within an image. There are two types of chromatic aberration, longitudinal and lateral, both of which can be modeled by geometrically warping the color channels with respect to one another [31]. Longitudinal chromatic aberration occurs when different wavelengths of light converge on different points along the optical axis, effectively magnifying the RGB channels relative to one another. We model this aberration type by scaling the green color channel of an image by a value S. Lateral chromatic aberration occurs when different wavelengths of light converge to the different points within the image plane. We model this by applying translations $(t_{x},t_{y})$ to each of the color channels of an image. We combine these two effects into the following affine transformation, which is applied to each (x, y) pixel location in a given color channel C of the image:

$$\begin{aligned} \begin{bmatrix} x^{\tiny {chrom.ab.}}_{C} \\ y^{\tiny {chrom.ab.}}_{C} \\ 1 \end{bmatrix} = \begin{bmatrix} S&0&t_{x} \\ 0&S&t_{y} \\ 0&0&1 \end{bmatrix} \begin{bmatrix} x_{C} \\ y_{C} \\ 1 \end{bmatrix} \end{aligned}$$

(2)

3.2 Blur

While there are several types of blur that occur in image-based datasets, we focus on out-of-focus blur, which can be modeled using a Gaussian filter [33]:

$$\begin{aligned} G = \frac{1}{2\pi \sigma ^2}e^{-\frac{x^2+y^2}{2\sigma ^2}} \end{aligned}$$

(3)

where x and y are spatial coordinates of the filter and $\sigma $ is the standard deviation. The output image is given by:

$$\begin{aligned} I_{blur} = I*G \end{aligned}$$

(4)

3.3 Exposure

To model exposure, we use the exposure density function developed in [34, 35]:

$$\begin{aligned} I = f(S) = \frac{255}{1 + e^{-A \times S}} \end{aligned}$$

(5)

where I is image intensity, S indicates incoming light intensity, or exposure, and A is a constant value for contrast. We use this model to re-expose an image as follows:

$$\begin{aligned} S' = f^{-1}(I) + \varDelta S \end{aligned}$$

(6)

$$\begin{aligned} I_{exp} = f(S') \end{aligned}$$

(7)

We vary $\varDelta S$ to model changing exposure, where a positive $\varDelta S$ relates to increasing the exposure, which can lead to over-saturation, and a negative value indicates decreasing exposure.

3.4 Noise

The sources of image noise caused by elements of the sensor array can be modeled as either signal-dependent or signal-independent noise. Therefore, we use the Poisson-Gaussian noise model proposed in [14]:

$$\begin{aligned} I_{noise}(x,y)=I(x,y)+\eta _{poiss}(I(x,y))+\eta _{gauss} \end{aligned}$$

(8)

where I(x, y) is the ground truth image at pixel location (x, y), $\eta _{poiss}$ is the signal-dependent Poisson noise, and $\eta _{gauss}$ is the signal-independent Gaussian noise. We sample the noise for each pixel based upon its location in a GBRG Bayer grid array assuming bilinear interpolation as the demosaicing function.

3.5 Post-processing

In standard camera pipelines, post-processing techniques, such as white balancing or gamma transformation, are nonlinear color corrections performed on the image to compensate for the presence of different environmental illuminants. These post-processing methods are generally proprietary and cannot be easily characterized [12]. We model these effects by performing translations in the CIELAB color space, also known as L*a*b* space, to remap the image tonality to a different range [36, 37]. Given that our chosen datasets are all taken outdoors during the day, we assume a D65 illuminant in our L*a*b* color space conversion.

3.6 Generating Augmented Training Data

The bounds on the sensor effect parameter regimes were chosen experimentally. The parameter selection process is discussed in more detail in Sect. 4. To augment an image, we first randomly sample from these visually realistic parameter ranges. Both the chosen parameters and the unaugmented image are then input to the augmentation pipeline, which outputs the image augmented with the camera effects determined by the chosen parameters. We augmented each image multiple times with different sets of randomly sampled parameters. Note that this augmentation method serves as a pre-processing step. Figure 4 shows sample images augmented with individual sensor effects as well as our full proposed sensor-based image augmentation pipeline. We use the original image labels as the labels for the augmented data. Pixel artifacts from cameras, like chromatic aberration and blur, make the object boundaries noisy. Thus, the original target labels are used to ensure that the network makes robust and accurate predictions in the presence of camera effects.

4 Experiments

We evaluate the proposed sensor-based image augmentation pipeline on the task of object detection on benchmark vehicle datasets to assess its effectiveness at bridging the synthetic to real domain gap. We apply our image augmentation pipeline to two benchmark synthetic vehicle datasets, each of which was rendered with different levels of photorealism. The first, Virtual KITTI (VKITTI) [16], features over 21000 images and is designed to models the spatial layout of KITTI with varying environmental factors such as weather and time of day. The second is Grand Theft Auto (GTA) [17, 41], which features 21000 images and is noted for its high quality and increased photorealism compared to VKITTI. To evaluate the proposed augmentation method for 2D object detection, we used Faster R-CNN as our base network [42]. Faster R-CNN achieves relatively high performance on the KITTI benchmark test dataset, and many state-of-the-art object detection networks that improve upon these results use Faster R-CNN as their base architecture. For all experiments, we apply sensor effect augmentation pipeline to all images in the given dataset, then train an object detection network on the combination of original unaugmented data and sensor effect augmented data. We ran experiments to determine the number of sensor effect augmentations per image, and determined that optimal performance was achieved by augmenting each image in each dataset one time. To determine the bounds of the sensor effect parameter ranges from which to sample, we augmented small datasets of 2975 images by randomly sampling from increasingly larger parameter bounds and chose the ranges for each sensor effect that yielded the highest performance as well as visually realistic images. We found that the same parameter regime yielded optimal performance for both synthetic datasets. All of the trained networks are tested on a held out validation set of 1480 images from the KITTI training data and we report the Pascal VOC $AP50_{bbox}$ value for the car class. We also report the gain in $AP50_{bbox}$, which is the difference in performance relative to the baseline (unaugmented) dataset. We compare the performance of object detection networks trained on sensor-effect augmented data to object detection networks trained on unaugmented data as our baseline. For each dataset, we trained each Faster R-CNN network for 10 epochs using four Titan X Pascal GPUs in order to control for potential confounds between performance and training time.

4.1 Performance on Baseline Object Detection Benchmarks

Table 1 shows results for FasterRCNN networks trained on unaugmented synthetic data and sensor-effect augmented data for both VKITTI and GTA. Note that we provide experiments trained on the full training datasets, as well as experiments trained on subsets of 2975 images to allow comparison of performance across differently sized datasets. Synthetic data augmented with the proposed method yields significant performance gains over the baseline (unaugmented) synthetic datasets. This is expected as, in general, rendering engines do not realistically model sensor effects such as noise, blur, and chromatic aberration as accurately as our proposed approach. Another important result for the synthetic datasets (both VKITTI and GTA), is that, by leveraging our approach, we are able to outperform the networks trained on over 20000 unaugmented images with a tiny subset of 2975 images augmented with using our approach. This means that not only can networks be trained faster but also when training with synthetic data, varying camera effects can outweigh the value of simply generating more data with varied spatial features. The VKITTI baseline dataset tested on KITTI performs relatively well compared to GTA, even though GTA is a more photorealistic dataset. This can most likely be attributed to the similarity in spatial layout and image features between VKITTI and KITTI. With our proposed approach, VKITTI gives comparable performance to the network trained on the Cityscapes baseline, showing that synthetic data augmented with our proposed sensor-based image pipeline can perform comparably to real data for cross-dataset generalization.

Table 1. Object detection trained on synthetic data, tested on KITTI

Full size table

4.2 Comparison to Other Augmentation Techniques

We ran experiments to compare our proposed method to photometric augmentation, specifically PCA-based color shift [1], complex spatial/geometric augmentations, specifically elastic deformation [47], standard additive gaussian noise augmentation, and a suite of standard spatial augmentations, specifically random rotations, scaling, translations, and cropping. We provide the results of training Faster-RCNN networks on the full VKITTI and GTA datasets augmented with the above methods in Table 2. All networks were tested on the same held-out set of KITTI images as used in the previous object detection experiments. Our results show that our proposed method drastically outperforms the other standard augmentation techniques, and that for certain synthetic data, spatial augmentations actually decrease performance on real data. This suggests that the proposed sensor effect augmentations capture more salient visual structure than traditional, non-photorealistic augmentation methods. We hypothesize this is because the physically-based sensor augmentations better model the information loss and the resulting global pixel-statistics that occur in real images. For example, our proposed method uses LAB space color transformation to alter the color cast of an image, where as traditional approaches use RGB space. LAB space is device independent, so it results in a more accurate, physically-based augmentation than [1].

Table 2. We provide the results of training Faster-RCNN networks on GTA and Virtual KITTI augmented with various augmentation methods. All networks were tested on KITTI.

Full size table

4.3 Ablation Study

To evaluate the contribution of each sensor effect augmentation on performance, we used the proposed pipeline to generate datasets with only one type of sensor effect augmentation. We trained Faster-RCNN on each of these datasets augmented with single augmentation functions, the results of which are given in Table 3. Performance increases across all ablation experiments for training on synthetic data. This further validates our hypothesis that each of the sensor effects are important for closing the gap between synthetic and real data.

Table 3. Ablation study for object detection trained on synthetic data, tested on KITTI

Full size table

4.4 Failure Mode Analysis

Figure 5 shows the qualitative results of failure modes of FasterRCNN trained on each synthetic training dataset and tested on KITTI, where the blue bounding box indicates correct detections and the red bounding box indicate a missed detection for the baseline that was correctly detected by our proposed method. Qualitatively, it appears that our method more reliably detects instances of cars that are small in the image, in particular in the far background, at a scale in which the pixel statistics of the image are more pronounced. Note that our method also improves performance on car detections for cases where the image is over-saturated due to increased exposure, which we are directly modeling through our proposed augmentation pipeline. Additionally, our method produces improved detections for other effects that obscure the presence of a car, such as occlusion and shadows, even though we do not directly model these effects. This may be attributed to increased robustness to effects that lead to loss of visual information about an object in general.

5 Conclusions

We have proposed a novel sensor-based image augmentation pipeline for augmenting synthetic training data input to DNNs for the task of object detection in real urban driving scenes. Our augmentation pipeline models a range of physically-realistic sensor effects that occur throughout the image formation and post-processing pipeline. These effects were chosen as they lead to loss of information or distortion of a scene, which degrades network performance on learned visual tasks. By training on our augmented datasets, we can effectively increase dataset size and variation in the sensor domain, without the need for further labeling, in order to improve robustness and generalizability of resulting object detection networks. We achieve significantly improved performance across a range of benchmark synthetic vehicle datasets, independent of the level of photorealism. Overall, our results reveal insight into the importance of modeling sensor effects for the specific problem of training on synthetic data and testing on real data.

References

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2017)
Google Scholar
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3234–3243 (2016)
Google Scholar
Alhaija, H.A., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets deep learning for car instance segmentation in urban scenes. In: Proceedings of the British Machine Vision Conference, vol. 3 (2017)
Google Scholar
Zhang, H., Sindagi, V., Patel, V.M.: Image de-raining using a conditional generative adversarial network. arXiv preprint arXiv:1701.05957 (2017)
Veeravasarapu, V., Rothkopf, C., Visvanathan, R.: Adversarially tuned scene generation. arXiv preprint arXiv:1701.00405 (2017)
Huang, S., Ramanan, D.: Undefined, undefined, undefined, undefined: expecting the unexpected: training detectors for unusual pedestrians with adversarial imposters. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4664–4673 (2017)
Google Scholar
Andreopoulos, A., Tsotsos, J.K.: On sensor bias in experimental methods for comparing interest-point, saliency, and recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 34, 110–126 (2012)
Article Google Scholar
Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
Google Scholar
Dodge, S., Karam, L.: Understanding how image quality affects deep neural networks. In: 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. IEEE (2016)
Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015)
Google Scholar
Grossberg, M.D., Nayar, S.K.: Modeling the space of camera response functions. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1272–1282 (2004)
Article Google Scholar
Couzinie-Devy, F., Sun, J., Alahari, K., Ponce, J.: Learning to estimate and remove non-uniform image blur. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1075–1082 (2013)
Google Scholar
Foi, A., Trimeche, M., Katkovnik, V., Egiazarian, K.: Practical Poissonian-Gaussian noise modeling and fitting for single-image raw-data. IEEE Trans. Image Process. 17, 1737–1754 (2008)
Article MathSciNet Google Scholar
Ramanagopal, M.S., Anderson, C., Vasudevan, R., Johnson-Roberson, M.: Failing to learn: autonomously identifying perception failures for self-driving cars. CoRR abs/1707.00051 (2017)
Google Scholar
Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349 (2016)
Google Scholar
Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S.N., Rosaen, K., Vasudevan, R.: Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks? In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 746–753. IEEE (2017)
Google Scholar
Sankaranarayanan, S., Balaji, Y., Jain, A., Lim, S., Chellappa, R.: Unsupervised domain adaptation for semantic segmentation with GANs. CoRR abs/1711.06969 (2017)
Google Scholar
Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. arXiv preprint arXiv:1612.07828 (2016)
Sixt, L., Wild, B., Landgraf, T.: RenderGAN: generating realistic labeled data. arXiv preprint arXiv:1611.01331 (2016)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017)
Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687. IEEE (2015)
Google Scholar
Wu, R., Yan, S., Shan, Y., Dang, Q., Sun, G.: Deep image: scaling up image recognition. arXiv preprint arXiv:1501.02876, vol. 7, no. 8 (2015)
Hauberg, S., Freifeld, O., Larsen, A.B.L., Fisher, J., Hansen, L.: Dreaming more data: class-dependent distributions over diffeomorphisms for learned data augmentation. In: Artificial Intelligence and Statistics, pp. 342–350 (2016)
Google Scholar
Paulin, M., Revaud, J., Harchaoui, Z., Perronnin, F., Schmid, C.: Transformation pursuit for image classification. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3646–3653. IEEE (2014)
Google Scholar
Kim, H.E., Lee, Y., Kim, H., Cui, X.: Domain-specific data augmentation for on-road object detection based on a deep neural network. In: 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 103–108. IEEE (2017)
Google Scholar
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. IEEE (2017)
Google Scholar
Kanan, C., Cottrell, G.W.: Color-to-grayscale: does the method matter in image recognition? PloS One 7(1), e29740 (2012)
Article Google Scholar
Diamond, S., Sitzmann, V., Boyd, S., Wetzstein, G., Heide, F.: Dirty pixels: optimizing image classification architectures for raw sensor data (2017)
Google Scholar
Karaimer, H.C., Brown, M.S.: A software platform for manipulating the camera imaging pipeline. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 429–444. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_26
Chapter Google Scholar
Kang, S.B.: Automatic removal of chromatic aberration from a single image. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8. IEEE (2007)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Google Scholar
Cheong, H., Chae, E., Lee, E., Jo, G., Paik, J.: Fast image restoration for spatially varying defocus blur of imaging sensor. Sensors 15(1), 880–898 (2015)
Article Google Scholar
Bhukhanwala, S.A., Ramabadran, T.V.: Automated global enhancement of digitized photographs. IEEE Trans. Consum. Electron. 40(1), 1–10 (1994)
Article Google Scholar
Messina, G., Castorina, A., Battiato, S., Bosco, A.: Image quality improvement by adaptive exposure correction techniques. In: Proceedings of the 2003 International Conference on Multimedia and Expo, ICME 2003, vol. 1, I-549–I-552, July 2003
Google Scholar
Hunter, R.S.: Accuracy, precision, and stability of new photoelectric color-difference meter. J. Opt. Soc. Am. 38, 1094 (1948)
Google Scholar
Annadurai, S.: Fundamentals of Digital Image Processing. Pearson Education India, Delhi (2007)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. IEEE (2012)
Google Scholar
Fritsch, J., Kuehnl, T., Geiger, A.: A new performance measure and evaluation benchmark for road detection algorithms. In: International Conference on Intelligent Transportation Systems (ITSC) (2013)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Google Scholar
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Chapter Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Google Scholar
Zhang, Y., David, P., Gong, B.: Curriculum domain adaptation for semantic segmentation of urban scenes. In: The IEEE International Conference on Computer Vision (ICCV), vol. 2, p. 6 (2017)
Google Scholar
Hoffman, J., Wang, D., Yu, F., Darrell, T.: FCNs in the wild: pixel-level adversarial and constraint-based adaptation. CoRR abs/1612.02649 (2016)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar

Download references

Acknowledgements

This work was supported by a grant from Ford Motor Company via the Ford-UM Alliance under award N022884, and by the National Science Foundation under Grant No. 1452793.

Author information

Authors and Affiliations

University of Michigan, Ann Arbor, USA
Alexandra Carlson, Katherine A. Skinner, Ram Vasudevan & Matthew Johnson-Roberson

Authors

Alexandra Carlson
View author publications
You can also search for this author in PubMed Google Scholar
Katherine A. Skinner
View author publications
You can also search for this author in PubMed Google Scholar
Ram Vasudevan
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Johnson-Roberson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandra Carlson .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Carlson, A., Skinner, K.A., Vasudevan, R., Johnson-Roberson, M. (2019). Modeling Camera Effects to Improve Visual Learning from Synthetic Data. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11129. Springer, Cham. https://doi.org/10.1007/978-3-030-11009-3_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-11009-3_31
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11008-6
Online ISBN: 978-3-030-11009-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Modeling Camera Effects to Improve Visual Learning from Synthetic Data

Abstract

Similar content being viewed by others

Neural-Sim: Learning to Generate Training Data with NeRF

Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact Simulation

Effective Use of Synthetic Data for Urban Scene Semantic Segmentation

Keywords

1 Introduction

2 Related Work