Abstract
Three-dimensional (3D) modeling of urban road scenes has warranted well-deserved interest in entertainment, urban planning, and autonomous vehicle simulation. Modeling such realistic scenes is still predominantly a manual process, relying mainly on 3D artists. Cameras mounted on vehicles can now provide images of road scenes, which can be used as references for automating scene layout. Our goal is to use the information from such images from a single camera sensor on a moving vehicle to build an approximate 3D virtual world. We propose a workflow that takes the human out of the loop through the use of deep learning to generate a dense depth map, an inverse projection to correct for perspective distortion in the image, collision detection, and a rendering engine. The engine loads and displays 3D models belonging to a particular type, at accurate relative positions, thus building and rendering a virtual world corresponding to the image. This virtual world can then be edited and animated. Our proposed workflow can potentially speed up the process of modeling virtual environments significantly when integrated with a modeling tool. We have tested the efficacy of our 3D virtual world-building and rendering using user studies with image-to-image similarity and video-to-image correspondences. Even with limited photorealistic rendering, our user study results demonstrate that 3D world-building can be effectively done, with minimal human intervention, using our workflow with monocular images from moving vehicles as inputs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We refer to the axes in image space as X-Y, and those in 3D world space as X-Y-Z.
References
Ahlberg, S., Söderman, U., Elmqvist, M., Persson, A.: On modelling and visualisation of high resolution virtual environments using LIDAR data. In: Proceedings of the 12th International Conference on Geoinformatics, pp. 299–306 (2004)
Beardsley, P., Torr, P., Zisserman, A.: 3D model acquisition from extended image sequences. In: Buxton, B., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1065, pp. 683–695. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61123-1_181
Bullinger, S.: Image-based 3D Reconstruction of Dynamic Objects Using Instance-aware Multibody Structure from Motion, vol. 44. KIT Scientific Publishing (2020)
Cabon, Y., Murray, N., Humenberger, M.: Virtual KITTI 2. arXiv preprint arXiv:2001.10773 (2020)
Clarke, M.P.: Virtual World Construction, US Patent 9,378,296 (28 June 2016)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223. IEEE (2016)
Frahm, J.-M., et al.: Building Rome on a cloudless day. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 368–381. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_27
Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4340–4349. IEEE (2016)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Gliem, J.A., Gliem, R.R.: Calculating, interpreting, and reporting Cronbach’s alpha reliability coefficient for Likert-type scales. In: Midwest Research-to-Practice Conference in Adult, Continuing, and Community (2003)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 270–279. IEEE (2017)
Groenendijk, R., Karaoglu, S., Gevers, T., Mensink, T.: On the benefit of adversarial training for monocular depth estimation. Comput. Vis. Image Underst. 190, 102848 (2020)
Hane, C., Zach, C., Cohen, A., Angst, R., Pollefeys, M.: Joint 3D scene reconstruction and class segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 97–104. IEEE (2013)
He, L., Lu, J., Wang, G., Song, S., Zhou, J.: SOSD-Net: joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 440, 251–263 (2021)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1125–1134. IEEE (2017)
Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.M.: Joint semantic segmentation and 3D reconstruction from monocular video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 703–718. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_45
Lempitsky, V., Boykov, Y.: Global optimization for shape fitting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8. IEEE (2007)
Mohr, R., Quan, L., Veillon, F.: Relative 3D reconstruction using multiple uncalibrated images. Int. J. Robot. Res. 14(6), 619–632 (1995)
Nidhi, K., Paddock, S.M.: Driving to safety: how many miles of driving would it take to demonstrate autonomous vehicle reliability? Transp. Res. Part A Policy Pract. 94, 182–193 (2016)
Pollefeys, M., et al.: Detailed real-time urban 3D reconstruction from video. Int. J. Comput. Vis. 78(2–3), 143–167 (2008)
Rander, P., Narayanan, P., Kanade, T.: Virtualized reality: constructing time-varying virtual worlds from real world events. In: Proceedings of 8th IEEE Visualization Conference, pp. 277–284. IEEE (1997)
Shreiner, D.: OpenGL Reference Manual: The Official Reference Document to OpenGL, version 1.2. Addison-Wesley Longman Publishing Co., Inc. (1999)
Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. ACM Trans. Graph. (TOG) 25, 835–846 (2006)
Venkateshkumar, S.K., Sridhar, M., Ott, P.: Latent hierarchical part based models for road scene understanding. In: Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV) Workshop, pp. 115–123. IEEE (2015)
Wang, T.C., et al.: Video-to-video synthesis. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp. 1152–1164 (2018). http://dl.acm.org/citation.cfm?id=3326943.3327049
Acknowledgments
The authors are grateful to all members of the Graphics-Visualization-Computing Lab and peers at the IIITB for their support. This work has been financially supported by the Machine Intelligence and Robotics (MINRO) grant by the Government of Karnataka.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Victor, A.C., Sreevalsan-Nair, J. (2021). Building 3D Virtual Worlds from Monocular Images of Urban Road Traffic Scenes. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2021. Lecture Notes in Computer Science(), vol 13018. Springer, Cham. https://doi.org/10.1007/978-3-030-90436-4_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-90436-4_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90435-7
Online ISBN: 978-3-030-90436-4
eBook Packages: Computer ScienceComputer Science (R0)