Advertisement

International Journal of Computer Vision

, Volume 126, Issue 9, pp 961–972 | Cite as

Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes

  • Hassan Abu Alhaija
  • Siva Karthik Mustikovela
  • Lars Mescheder
  • Andreas Geiger
  • Carsten Rother
Article

Abstract

The success of deep learning in computer vision is based on the availability of large annotated datasets. To lower the need for hand labeled images, virtually rendered 3D worlds have recently gained popularity. Unfortunately, creating realistic 3D content is challenging on its own and requires significant human effort. In this work, we propose an alternative paradigm which combines real and synthetic data for learning semantic instance segmentation and object detection models. Exploiting the fact that not all aspects of the scene are equally important for this task, we propose to augment real-world imagery with virtual objects of the target category. Capturing real-world images at large scale is easy and cheap, and directly provides real background appearances without the need for creating complex 3D models of the environment. We present an efficient procedure to augment these images with virtual objects. In contrast to modeling complete 3D environments, our data augmentation approach requires only a few user interactions in combination with 3D models of the target object category. Leveraging our approach, we introduce a novel dataset of augmented urban driving scenes with 360 degree images that are used as environment maps to create realistic lighting and reflections on rendered objects. We analyze the significance of realistic object placement by comparing manual placement by humans to automatic methods based on semantic scene analysis. This allows us to create composite images which exhibit both realistic background appearance as well as a large number of complex object arrangements. Through an extensive set of experiments, we conclude the right set of parameters to produce augmented data which can maximally enhance the performance of instance segmentation models. Further, we demonstrate the utility of the proposed approach on training standard deep models for semantic instance segmentation and object detection of cars in outdoor driving scenarios. We test the models trained on our augmented data on the KITTI 2015 dataset, which we have annotated with pixel-accurate ground truth, and on the Cityscapes dataset. Our experiments demonstrate that the models trained on augmented imagery generalize better than those trained on fully synthetic data or models trained on limited amounts of annotated real data.

Keywords

Synthetic training data Data augmentation Autonomous driving Instance segmentation Object detection 

References

  1. Blender Online Community. (2006). Blender: a 3D modelling and rendering package. Amsterdam: Blender Foundation, Blender Institute. http://www.blender.org. Accessed 01 May 2017.
  2. Brostow, G. J., Fauqueur, J., & Cipolla, R. (2009). Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2), 88–97.CrossRefGoogle Scholar
  3. Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., et al. (2016). Synthesizing training images for boosting human 3D pose estimation. In Proceedings of the international conference on 3D vision (3DV).Google Scholar
  4. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  5. Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  6. de Souza, C. R., Gaidon, A., Cabon, Y., & Peña, A. M. L. (2016). Procedural generation of videos to train deep action recognition networks. arXiv:1612.00881.
  7. Dosovitskiy, A., Fischer, P., Ilg, E., Haeusser, P., Hazirbas, C., Golkov, V., et al. (2015). Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision (ICCV).Google Scholar
  8. Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  9. Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR), 32(11), 1231–1237.CrossRefGoogle Scholar
  10. Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  11. Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., & Cipolla, R. (2016). Understanding real world indoor scenes with synthetic data. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  12. Hattori, H., Boddeti, V. N., Kitani, K. M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In Proceedigs of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  13. Jakob, W. (2010). Mitsuba renderer. http://www.mitsuba-renderer.org. Accessed 01 May 2017.
  14. Kronander, J., Banterle, F., Gardner, A., Miandji, E., & Unger, J. (2015). Photorealistic rendering of mixed reality scenes. Computer Graphics Forum, 34(2), 643–665.  https://doi.org/10.1111/cgf.12591.CrossRefGoogle Scholar
  15. Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  16. Movshovitz-Attias, Y., Kanade, T., & Sheikh, Y. (2016). How useful is photo-realistic rendering for visual learning? In Proceedings of the European conference on computer vision (ECCV) workshops.Google Scholar
  17. Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In Proceedings of the IEEE international conference on computer vision (ICCV).Google Scholar
  18. Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  19. Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormählen, T., & Schiele, B. (2011). Learning people detection models from few training samples. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  20. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 28, pp. 91–99). Red Hook, NY: Curran Associates Inc.Google Scholar
  21. Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In Proceedings of the European conference on computer vision (ECCV).Google Scholar
  22. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  23. Rozantsev, A., Lepetit, V., & Fua, P. (2015). On rendering synthetic images for training an object detector. Computer Vision and Image Understanding (CVIU), 137, 24–37.CrossRefGoogle Scholar
  24. Shafaei, A., Little, J. J., & Schmidt, M. (2016). Play and learn: Using video games to train computer vision models. arXiv:1608.01745
  25. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representations (ICLR).Google Scholar
  26. Stark, M., Goesele, M., & Schiele, B. (2010). Back to the future: Learning shape models from 3D CAD data. In Proceedings of the British machine vision conference (BMVC).Google Scholar
  27. Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for CNN: viewpoint estimation in images using CNNS trained with rendered 3D model views. In Proceedings of the IEEE international conference on computer vision (ICCV).Google Scholar
  28. Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  29. Teichmann, M., Weber, M., Zöllner, J. M., Cipolla, R., & Urtasun, R. (2016). Multinet: Real-time joint semantic reasoning for autonomous driving. arXiv:1612.07695.
  30. Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I. et al. (2017). Learning from synthetic humans. arXiv:1701.01370.
  31. Xie, J., Kiefel, M., Sun, M. T., & Geiger, A. (2016). Semantic instance annotation of street scenes by 3D to 2D label transfer. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  32. Zhang, Y., Qiu, W., Chen, Q., Hu, X., & Yuille, A. L. (2016a). Unrealstereo: A synthetic dataset for analyzing stereo vision. arXiv:1612.04647.
  33. Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J., Jin, H., et al. (2016b). Physically-based rendering for indoor scene understanding using convolutional neural networks. arXiv:1612.07429.
  34. Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., et al. (2016). Target-driven visual navigation in indoor scenes using deep reinforcement learning. arXiv:1609.05143.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Visual Learning LabHeidelberg UniversityHeidelbergGermany
  2. 2.Autonomous Vision GroupMPI-IS TübingenTübingenGermany
  3. 3.Computer Vision and Geometry GroupETH ZürichZurichSwitzerland

Personalised recommendations