Abstract
We introduce a novel framework for 3D scene reconstruction with simultaneous object annotation, using a pre-trained 2D convolutional neural network (CNN), incremental data streaming, and remote exploration, with a virtual reality setup. It enables versatile integration of any 2D box detection or segmentation network. We integrate new approaches to (i) asynchronously perform dense 3D-reconstruction and object annotation at interactive frame rates, (ii) efficiently optimize CNN results in terms of object prediction and spatial accuracy, and (iii) generate computationally-efficient colliders in large triangulated 3D-reconstructions at run-time for 3D scene interaction. Our method is novel in combining CNNs with long and varying inference time with live 3D-reconstruction from RGB-D camera input. We further propose a lightweight data structure to store the 3D-reconstruction data and object annotations to enable fast incremental data transmission for real-time exploration with a remote client, which has not been presented before. Our framework achieves update rates of 22 fps (SSD Mobile Net) and 19 fps (Mask RCNN) for indoor environments up to 800 m3. We evaluated the accuracy of 3D-object detection. Our work provides a versatile foundation for semantic scene understanding of large streamed 3D-reconstructions, while being independent from the CNN’s processing time. Source code is available for non-commercial use.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Mossel, A.; Kroeter, M. Streaming and exploration of dynamically changing dense 3D reconstructions in immersive virtual reality. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, 43–48, 2016.
Ruddle, R. A.; Lessels, S. The benefits of using a walking interface to navigate virtual environments. ACM Transactions on Computer-Human Interaction Vol. 16, No. 1, Article No. 5, 2009.
Sünderhauf, N.; Pham, T. T.; Latif, Y.; Milford M.; Reid, I. Meaningful maps with object-oriented semantic mapping. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 5079–5085, 2017.
Kammerl, J.; Blodow, N.; Rusu, R. B.; Gedikli, S.; Beetz, M.; Steinbach, E. Real-time compression of point cloud streams. In: Proceedings of the IEEE International Conference on Robotics and Automation, 778–785, 2012.
Golla, T.; Klein, R. Real-time point cloud compression. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 5087–5092, 2015.
Morell, V.; Orts, S.; Cazorla, M.; Garcia-Rodriguez, J. Geometric 3D point cloud compression. Pattern Recognition Letters Vol. 50, 55–62, 2014.
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 580–587, 2014.
Girshick, R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 1440–1448, 2015.
Ren, S.; He, K.; Girshick, R.; Sun, J.; Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137–1149, 2017.
Redmon, J.; Farhadi, A. Yolo9000: Better, faster, stronger. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, 7263–7271, 2017.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C. Y.; Berg, A. C. SSD: Single shot MultiBox detector. In: Computer Vision-ECCV 2016. Lecture Notes in Computer Science, Vol. 9905. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 21–37, 2016.
He, K.; Gkioxari, G.; Dollfiar, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2961–2969, 2017.
McCormac, J.; Handa, A.; Davison, A.; Leutenegger, S. Semanticfusion: Dense 3D semantic mapping with convolutional neural networks. In: Proceedings of the IEEE International Conference on Robotics and automation, 4628–4635, 2017.
Whelan, T.; Leutenegger, S.; Salas Moreno, R.; Glocker, B.; Davison, A. ElasticFusion: Dense SLAM without a pose graph. In: Proceedings of the Robotics: Science and Systems, 2015.
Runz, M.; Bufier, M.; Agapito, L. MaskFusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, 10–20, 2018.
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2980–2988, 2017.
Nakajima, Y.; Saito, H. Efficient object-oriented semantic mapping with object detector. IEEE Access Vol. 7, 3206–3213, 2019.
Prisacariu, V. A.; Kähler, O.; Golodetz, S.; Sapienza, M.; Cavallari, T.; Torr, P. H.; Murray, D. W. InfiniTAM v3: A framework for large-scale 3D reconstruction with loop closure. arXiv preprint arXiv:1708.00783, 2017.
Tateno, K.; Tombari, F.; Navab, N. Realtime and scalable incremental segmentation on dense slam. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 4465–4472, 2015.
Wang, L.; Li, R.; Sun, J.; Liu, X.; Zhao, L.; Seah, H. S.; Quah, C. K.; Tandianus, B. Multi-view fusion-based 3D object detection for robot indoor scene perception. Sensors Vol. 19, No. 19, 4092, 2019.
Hou, J.; Dai, A.; Niefiner, M. 3D-sis: 3D semantic instance segmentation of RGB-D scans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4421–4430, 2019.
Prisacariu, V. A.; Kahler, O.; Cheng, M. M.; Ren, C. Y.; Valentin, J.; Torr, P. H.; Reid, I. D.; Murray, D. W. A framework for the volumetric integration of depth images. arXiv preprint arXiv:1410.0925, 2014.
Nießner, M.; Zollhöfer, M.; Izadi, S.; Stamminger, M. Real-time 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics Vol. 32, No. 6, Article No. 169, 2013.
Nickolls, J.; Buck, I.; Garland, M.; Skadron, K. Scalable parallel programming with CUDA. Queue Vol. 6, No. 2, 40–53, 2008.
Newcombe, R. A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A. J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. KinectFusion: Real-time dense surface mapping and tracking. In: Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, 127–136, 2011.
Deutsch, P. DEFLATE Compressed Data Format Specification version 1.3. RFC 1951. DOI: 10.17487/RFC1951. 1996.
Lorensen, W. E.; Cline, H. E. Marching cubes: A high resolution 3D surface construction algorithm. ACM SIGGRAPH Computer Graphics Vol. 21, No. 4, 163–169, 1987.
Kahler, O.; Adrian Prisacariu, V.; Yuheng Ren, C.; Sun, X.; Torr, P., Murray, D. Very high frame rate volumetric integration of depth images on mobile devices. IEEE Transactions on Visualization and Computer Graphics Vol. 21, No. 11, 1241–1250, 2015.
Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C. Y.; Berg, A. C. SSD mobilenet v2 COCO 2018 03 29. Available at https://github.com/opencv/opencvfiextra/blob/master/testdata/dnn/ssdfimobilenetfiv2ficocofi2018fi03fi29.pbtxt.
Abdulla, W. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. 2017. Avaiblable at https://github.com/matterport/Mask_RCNN.
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollfiar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision-ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
Zhou, Q. Y.; Koltun, V. Dense scene reconstruction with points of interest. ACM Transactions on Graphics Vol. 32, No. 4, Article No. 112, 2013.
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from RGBD images. In: Computer Vision-ECCV 2012. Lecture Notes in Computer Science, Vol. 7576. Fitzgibbon, A.; Lazebnik, S.; Perona, P.; Sato, Y.; Schmid, C. Eds. Springer Berlin Heidelberg, 746–760, 2012.
Acknowledgements
The work was solely supported by Vienna University of Technology.
Author information
Authors and Affiliations
Corresponding author
Additional information
Benjamin Höller is a postgraduate student with the Interactive Media System Group at the Institute of Visual Computing and Human-Centered Technology at Vienna University of Technology, where he received his M.Sc. degree with distinction in 2019. His research interests lie at the intersection of virtual and augmented reality, and 3D computer vision, with a strong focus on machine learning.
Annette Mossel is a post-doctoral researcher at the Institute of Visual Computing and Human-Centered Technology at Vienna University of Technology, and a scientific entrepreneur. She received her Ph.D. degree in 2014 from Vienna University of Technology. During her studies, she worked as a visiting researcher at the Fraunhofer Institute for Computer Graphics and the MIT Media Lab. She has 12 years of experience in mixed reality with strong expertise in vision-based self-localization, dense 3D mapping, and 3D human computer interaction (HCI). She has authored or co-authored more than 20 scientific publications and participated and lead multiple nationally funded scientific projects on wide-area indoor localization, multi-user VR, and dense 3D surface reconstruction.
Hannes Kaufmann is full professor of virtual and augmented reality at the Institute of Visual Computing & Human-Centered Technology at TU Wien. He has conducted research into virtual reality, tracking, mobile augmented reality, training spatial abilities in AR/VR, tangible interaction, medical VR/AR applications, real time ray-tracing, redirected walking, geometry, and educational mathematics software. His habilitation (2010) was on “applications of mixed reality” with a major focus on educational mixed reality applications. He has acted on behalf of the European Commission as a project reviewer, participated in EU projects in FP5, FP7, and Horizon2020, managed over 30 research projects and published more than 100 scientific papers.
Electronic Supplementary Material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
About this article
Cite this article
Höller, B., Mossel, A. & Kaufmann, H. Automatic object annotation in streamed and remotely explored large 3D reconstructions. Comp. Visual Media 7, 71–86 (2021). https://doi.org/10.1007/s41095-020-0194-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-020-0194-4