Abstract
Data-driven algorithms have surpassed traditional techniques in almost every aspect in robotic vision problems. Such algorithms need vast amounts of quality data to be able to work properly after their training process. Gathering and annotating that sheer amount of data in the real world is a time-consuming and error-prone task. These problems limit scale and quality. Synthetic data generation has become increasingly popular since it is faster to generate and automatic to annotate. However, most of the current datasets and environments lack realism, interactions, and details from the real world. UnrealROX is an environment built over Unreal Engine 4 which aims to reduce that reality gap by leveraging hyperrealistic indoor scenes that are explored by robot agents which also interact with objects in a visually realistic manner in that simulated world. Photorealistic scenes and robots are rendered by Unreal Engine into a virtual reality headset which captures gaze so that a human operator can move the robot and use controllers for the robotic hands; scene information is dumped on a per-frame basis so that it can be reproduced offline to generate raw data and ground truth annotations. This virtual reality environment enables robotic vision researchers to generate realistic and visually plausible data with full ground truth for a wide variety of problems such as class and instance semantic segmentation, object detection, depth estimation, visual grasping, and navigation.
Similar content being viewed by others
Notes
References
Bhoi A (2019) Monocular depth estimation: a survey. arXiv preprint arXiv:1901-09402
Bousmalis K, Irpan A, Wohlhart P, Bai Y, Kelcey M, Kalakrishnan M, Downs L, Ibarz J, Pastor P, Konolige K et al (2017) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. arXiv preprint arXiv:1709.07857
Brodeur S, Perez E, Anand A, Golemo F, Celotti L, Strub F, Rouat J, Larochelle H, Courville A (2017) Home: a household multimodal environment. arXiv preprint arXiv:1711.11017
Butler DJ, Wulff J, Stanley GB, Black MJ (2012) A naturalistic open source movie for optical flow evaluation. In: Proceedings of the European conference on computer vision (ECCV), pp 611–625
Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2650–2658
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems (NIPS), pp 2366–2374
Gaidon A, Wang Q, Cabon Y, Vig E (2016) Virtual worlds as proxy for multi-object tracking analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4340–4349
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision (CVPR), pp 2961–2969
Kolve E, Mottaghi R, Gordon D, Zhu Y, Gupta A, Farhadi A (2017) Ai2-thor: an interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474
Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: Proceedings of the IEEE conference on 3D vision (3DV), pp 239–248
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
Lenz I, Lee H, Saxena A (2015) Deep learning for detecting robotic grasps. Int J Robot Res 34(4–5):705–724
Levine S, Pastor P, Krizhevsky A, Ibarz J, Quillen D (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int J Robot Res 37(4–5):421–436
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440
Looman T (2017) Vr template. https://wiki.unrealengine.com/VR_Template. Accessed 1 Sept 2018
Mahler J, Liang J, Niyaz S, Laskey M, Doan R, Liu X, Ojea JA, Goldberg K (2017) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312
McCormac J, Handa A, Leutenegger S, Davison AJ (2016) Scenenet rgb-d: 5m photorealistic images of synthetic indoor trajectories with ground truth. arXiv preprint arXiv:1612.05079
Oculus (2017a) Distance grab sample now available in oculus unity sample framework. https://developer.oculus.com/blog/distance-grab-sample-now-available-in-oculus-unity-sample-framework/. Accessed 1 Sept 2018
Oculus (2017b) Oculus first contact. https://www.oculus.com/experiences/rift/1217155751659625/. Accessed 1 Sept 2018
Pashevich A, Strudel R, Kalevatykh I, Laptev I, Schmid C (2019) Learning to augment synthetic images for sim2real policy transfer. arXiv preprint arXiv:1903.07740
Qiu W, Yuille A (2016) Unrealcv: connecting computer vision to unreal engine. In: Proceedings of the European conference on computer vision (ECCV), pp 909–916
Qiu W, Zhong F, Zhang Y, Qiao S, Xiao Z, Kim TS, Wang Y (2017) Unrealcv: virtual worlds for computer vision. In: Proceedings of the 2017 ACM on multimedia conference (ACMMM), pp 1221–1224
Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-Janua, pp 6517–6525. https://doi.org/10.1109/CVPR.2017.690
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2016.91
Ros G, Sellart L, Materzynska J, Vazquez D, Lopez AM (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3234–3243
Savva M, Chang AX, Dosovitskiy A, Funkhouser T, Koltun V (2017) Minos: multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931
Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd images. In: Proceedings of the European conference on computer vision (ECCV), pp 746–760
Tekin B, Sinha SN, Fua P (2018) Real-time seamless single shot 6d object pose prediction. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 292–301. https://doi.org/10.1109/CVPR.2018.00038
To T, Tremblay J, McKay D, Yamaguchi Y, Leung K, Balanon A, Cheng J, Birchfield S (2018) NDDS: NVIDIA deep learning dataset synthesizer. https://github.com/NVIDIA/Dataset_Synthesizer
Tobin J, Fong R, Ray A, Schneider J, Zaremba W, Abbeel P (2017a) Domain randomization for transferring deep neural networks from simulation to the real world. In: Proceedings of the IEEE international conference on intelligent robots and systems (IROS), pp 23–30
Tobin J, Zaremba W, Abbeel P (2017b) Domain randomization and generative models for robotic grasping. arXiv preprint arXiv:1710.06425
Tremblay J, Prakash A, Acuna D, Brophy M, Jampani V, Anil C, To T, Cameracci E, Boochoon S, Birchfield S (2018) Training deep networks with synthetic data: bridging the reality gap by domain randomization. arXiv preprint arXiv:1804.06516
Ummenhofer B, Zhou H, Uhrig J, Mayer N, Ilg E, Dosovitskiy A, Brox T (2017) Demon: depth and motion network for learning monocular stereo. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 5038–5047
Xia F, Zamir RA, He ZY, Sax A, Malik J, Savarese S (2018) Gibson env: real-world perception for embodied agents. In: Proceedings of the IEEE computer vision and pattern recognition (CVPR)
Xu D, Wang W, Tang H, Liu H, Sebe N, Ricci E (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3917–3925
Yan C, Misra D, Bennnett A, Walsman A, Bisk Y, Artzi Y (2018) Chalet: cornell house agent learning environment. arXiv preprint arXiv:1801.07357
Acknowledgements
This work has been funded by the Spanish Government TIN2016-76515-R Grant for the COMBAHO project, supported with Feder funds. This work has also been supported by three Spanish national grants for Ph.D. studies (FPU15/04516, FPU17/00166, and ACIF/2018/197), by the University of Alicante Project GRE16-19, and by the Valencian Government Project GV/2018/022. Experiments were made possible by a generous hardware donation from NVIDIA. We would also like to thank Zuria Bauer for her collaboration in the depth estimation experiments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Martinez-Gonzalez, P., Oprea, S., Garcia-Garcia, A. et al. UnrealROX: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation. Virtual Reality 24, 271–288 (2020). https://doi.org/10.1007/s10055-019-00399-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10055-019-00399-5