Abstract
We introduce a fully automated 360° video processing pipeline using a hierarchical combination of Artificial Intelligence (AI) modules to create immersive volumetric XR experiences. Two critical productions tasks (person segmentation and depth estimation) are addressed with a parallel Deep Neural Network (DNN) pipeline that combines instance segmentation, person detection, pose estimation, camera stabilization, neural tracking, 3D face detection, hair masking, and monocular 360° depth computation in a single and robust tool set. To facilitate the rapid uptake of these techniques we provide a detailed review of AI-based methods to address these problems (complete with links to recommended open source implementations) as well as references to existing authoring tools in the market. Our key contributions include a method to create semi-synthetic data sets for data auto-augmentation and using this technique to generate over 3.8 m images as part of a concise evaluation and subsequent retraining of DNNs for person detection tasks. Furthermore, we apply the same techniques to develop a spherical DNN for monocular depth estimation with a Free Viewpoint Video (FVV) capture system and a novel method to generate 3D human shapes and pose mannequins for training. To evaluate the performance of our AI authoring tool set we address four challenging production tasks and demonstrate the practical use of our solution with videos showing processed output.
Similar content being viewed by others
Notes
Combination of real photographs and 3D models in a virtually rendered scene.
Github source code links (https://github.com/) are also included to aid the rapid uptake of these techniques by the immersive media production and research community at large.
In our work we focus on the central task of person segmentation in the scene, but the techniques described can be readily generalized to any object class as required.
The actual settings are relative to the coordinate system of the animation space but simulate physical distances and angular changes with respect to the virtual spherical camera.
We note that while the Face3D algorithm generally workes well for a broad range of 360° footage, in this particular video the rather low pixel resolution and difficult poses, such as people looking down with many features rendered invisible, made it to produce somewhat noisy results and needing an additional pass to clean the data up.
Note that neither the Samsung Gear360 nor the GoPro6 camera rig has stereoscopic capabilities or enough overlap of the lenses to compute disparity maps.
References
3DVista Pro (2020) https://www.3dvista.com. Accessed 1 Jan 2021
Adobe Creative Suite Tools (2020) https://www.adobe.com/creativecloud/video/virtual-reality.html. Accessed 1 Jan 2021
Andersson Technologies (2020), SynthEyes 3D Camera Tracking and Stabilization Software, https://www.ssontech.com/synovu.html. Accessed 1 Jan 2021
Bochkovskiy A, Wang CY, Liao HYM (2020) YOLOv4: Optimal Speed and Accuracy of Object Detection. https://arxiv.org/abs/2004.10934. Accessed 1 Jan 2021
Bodini M (2019) A Review of Facial Landmark Extraction in 2D Images and Videos Using Deep Learning. Big Data Cogn. Comput. 3(1):14. https://doi.org/10.3390/bdcc3010014
Bolya D, Zhou C, Xiao F, Lee YJ (2019) YOLACT++: better real-time instance segmentation, Source Code https://github.com/dbolya/yolact. Accessed 1 Jan 2021
Bulat A, Tzimiropoulos G (2017) super-FAN: integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs, https://arxiv.org/abs/1712.02765, Source Code https://github.com/1adrianb/face-alignment. Accessed 1 Jan 2021
Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y (2018) OpenPose: Realtime multi-person 2D pose estimation using part affinity fields, Computer Vision and Pattern Recognition, Source Code https://github.com/CMU-Perceptual-Computing-Lab/openpose. Accessed 1 Jan 2021
Cohen T, Geiger M, Koehler J, Welling M, Spherical CNNs. ICLR 2018. https://openreview.net/pdf?id=Hkbd5xZRb, Soure Code: https://github.com/jonas-koehler/s2cnn. Accessed 1 Jan 2021
Cubuk ED, Zoph B, Mane D, Vasude V, Le QV (2019) AutoAugment: Learning Augmentation Strategies From Data; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 113–123. https://openaccess.thecvf.com/content_CVPR_2019/html/Cubuk_AutoAugment_Learning_Augmentation_Strategies_From_Data_CVPR_2019_paper.html
CVAT - Computer Vision Annotation Tool (2020), Source Code https://github.com/openvinotoolkit/cvat. Accessed 1 Jan 2021
de La Garanderie GP, Abarghouei AA, Breckon TP (2018) Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360° Panoramic Imagery, in Proc. European Conference on Computer Vision, Springer. https://arxiv.org/abs/1808.06253 Source Code https://github.com/gdlg/panoramic-depth-estimation. Accessed 1 Jan 2021
Dhimana C, Vishwakarmab DK (2019) A Review of State-of-the-art Techniques for Abnormal Human Activity Recognition. Eng Appl Artificial Intell 77:21–45
Duan Z, Tezcan MO, Nakamura H, Ishwar P, Konrad J (2020) RAPiD: rotation-aware people detection in overhead fisheye images, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Omnidirectional Computer Vision in Research and Industry (OmniCV) Workshop. https://arxiv.org/abs/2005.11623
Everingham M, Van Gool L, Williams C, Winn KI, Zisserman JA (2010) The PASCAL visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 http://host.robots.ox.ac.uk/pascal/VOC/. Accessed 1 Jan 2021
Fader (2020) https://getfader.com. Accessed 1 Jan 2021
Fang HS, Xie S, Tai YW, Lu C (2018) RMPE: Regional Multi-Person Pose Estimation, https://arxiv.org/abs/1612.00137. Accessed 1 Jan 2021
K. Gao, S. Yang, K. Fu, P. Cheng (2019), Deep 3D Facial Landmark Detection on Position Maps. In: Cui Z., Pan J., Zhang S., Xiao L., Yang J. (eds) Intelligence Science and Big Data Engineering. Visual Data Engineering. IScIDE 2019. Lecture notes in computer science, vol 11935. Springer, Cham.
Ghiasi G, Lee H Kudlur M, Dumoulin V, Shlens J (2017) Exploring the structure of a real-time, Arbitrary Neural Artistic Stylization Network. https://arxiv.org/abs/1705.06830. Accessed 1 Jan 2021
Godard C, Aodha OM, Firman M, Brostow GJ (2019) Digging into self-supervised monocular depth estimation, in Proc the international conference on computer vision (ICCV19), Source Code https://github.com/nianticlabs/monodepth2. Accessed 1 Jan 2021
Google Research (2019), BodyPix2.0, Source Code https://github.com/tensorflow/tfjs-models/tree/master/body-pix. Accessed 1 Jan 2021
Guo K, et. al (2019) The Relightables: Volumetric Performance Capture of Humans with Realistic Relighting. ACM Trans Graphics 38(6). https://doi.org/10.1145/3355089.3356571
Han Z, Ban X, Wang X, Wu J (2020) MIPOSE: A Micro-intelligent Platform for Dynamic Human Pose Recognition, in Proc. AsianHCI '19: Proceedings of Asian CHI Symposium 2019: Emerging HCI Research Collection, pp 60–65, https://doi.org/10.1145/3309700.3338440
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN, IEEE international conference on computer vision (ICCV), Source Code: https://github.com/matterport/Mask_RCNN. Accessed 1 Jan 2021
Hohman F, Wongsuphasawat K, Kery MB, Patel K (2020), Understanding and Visualizing Data Iteration in Machine Learning, in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3313831.3376177
Huang J, Cheny Z, Ceylanz D, Jinx H (2017) 6-DOF VR videos with a single 360-camera. Proc. IEEE Virtual Reality (VR), Los Angeles
Hyper360 Project (2020) http://www.hyper360.eu/. Accessed 1 Jan 2021
Insta360 Stitching Software (2020) https://www.insta360.com/download/insta360-pro. Accessed 1 Jan 2021
Karakottas A, Zioulis N, Zarpalas D, Daras P (2018) 360D: a dataset and baseline for dense depth estimation from 360 images. In: 1st workshop on 360o perception and interaction. European Conf. on Computer Vision (ECCV), Munich
Keyframe Interpolation (2017), Source Code https://github.com/Kay1794/Mocap-Keyframe-Interpolation. Accessed 1 Jan 2021
Kolotouros N, Pavlakos G, Black MJ, Daniilidis K (2019) Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop, in Proc ICCV2019, Source Code https://github.com/nkolot/SPIN. Accessed 1 Jan 2021
Kopf J (2016) 360° Video Stabilization. ACM Trans Graph 35(6):19 https://dl.acm.org/citation.cfm?id=2982405. Accessed 1 Jan 2021
Li C, Xu M,, Zhang S, Le Callet P (2018) Distortion-aware CNNs for spherical images, in Proc. of the 27th Int. Joint Conference on Artificial Intelligence, pp 1198–1204. https://www.ijcai.org/Proceedings/2018/167. Accessed 1 Jan 2021
Li Z, Dekel T, Cole F, Tucker R, Snavely N, Liu C, Freeman WT (2019) learning the depths of moving people by watching frozen people, in IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Source Code https://github.com/google/mannequinchallenge. Accessed 1 Jan 2021
Li C, Xu M, Zhang S, Le Callet P (2020) State-of-the-art in 360° Video/Image Processing: Perception, Assessment Compress IEEE J Select Topics Signal Process 14(1)
Lin TY, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2015) Microsoft COCO: Common Objects in Context https://arxiv.org/abs/1405.0312http://cocodataset.org/#home. Accessed 1 Jan 2021
Lindlbaue D, Feit A, Hilliges O (2019) Context-Aware Online Adaptation of Mixed Reality Interfaces, in UIST '19: Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology. https://doi.org/10.1145/3332165.3347945
Liquid Cinema (2020) https://liquidcinemavr.com. Accessed 1 Jan 2021
Liu SJ, Agrawala M, DiVerdi S, Hertzmann A (2019) View-dependent video textures for 360° video, in proceedings of the 32nd annual ACM symposium on user Interface Software and technology, Source Code: https://lseancs.github.io/viewdepvrtextures/. Accessed 1 Jan 2021
Liu L, Ouyang W, Wang X et al (2020) Deep learning for generic object detection: a survey. Int J Computer Vision 128:261–318. https://doi.org/10.1007/s11263-019-01247-4
Lyu W, Zhou Z, Hou LCY (2019) A survey on image and video stitching. Virtual Reality Intell Hardware 1(1):55–83. https://doi.org/10.3724/SP.J.2096-5796.2018.0008
Maninis KK, Caelles S, Pont-Tuset J, Van Gool L (2018), Deep extreme cut: from extreme points to object segmentation, computer vision and pattern recognition (CVPR), Source Code: https://github.com/scaelles/DEXTR-PyTorch. Accessed 1 Jan 2021
Matos T, Nóbrega R, Rodrigues R, Pinheiro M (2018) Dynamic Annotations on an Interactive Web-based 360 Deg; Video Player, Proc.. of the 23rd International ACM Conference on 3D Web Technology (Web3D ‘18). ACM, New York, Article 22. https://doi.org/10.1145/3208806.3208818
Label Me (2020), Source Code: https://github.com/wkentaro/labelme. Accessed 1 Jan 2021
Nakatani A, Shinohara T, Miyaki K (2019) Live 6DoF Video Production with Stereo Camera in Proc SA '19: Siggraph Asia XR, pp 23–24, https://doi.org/10.1145/3355355.3361880
Omnivirt (2020) https://www.omnivirt.com/. Accessed 1 Jan 2021
Papandreou G, Zhu T, Chen LC, Gidaris S, Tompson J, Murphy K (2018) PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision – ECCV 2018. Lecture notes in computer science, vol 11218. Springer, Cham Source Code https://github.com/scnuhealthy/Tensorflow_PersonLab. Accessed 1 Jan 2021
Paulsen RR, Juhl KA, Haspang TM, Hansen T, Ganz M, Einarsson G (2019) Multi-view Consensus CNN for 3D Facial Landmark Placement. In: Jawahar C, Li H, Mori G, Schindler K (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture notes in computer science, vol 11361. Springer, Cham https://arxiv.org/abs/1910.06007. Accessed 1 Jan 2021
Pixel Annotation Tool (2020), Source Code : https://github.com/abreheret/PixelAnnotationTool. Accessed 1 Jan 2021
Pseudoscience (2020) Volumetric 360 6DoF Video / Stereo2Depth Conversion algorithm http://pseudoscience.pictures/index.html. Accessed 1 Jan 2021
Schonberger JL, Frah JM (2016) Structure-from-Motion Revisited, in Proc Conference on Computer Vision and Pattern Recognition (CVPR)
SGO Mistika VR Optic Flow Stitcher (2020) https://www.sgo.es/mistika-vr/. Accessed 1 Jan 2021
PanoCAST (2021) http://www.panocast.com. Accessed 1 Jan 2021
Sreenu G, Durai MAS (2019) Intelligent video surveillance: a review through deep learning techniques for crowd analysis, in J. Big Data 6:48. https://doi.org/10.1186/s40537-019-0212-5
Su YC, Grauman K (2017) Flat2Sphere: learning spherical convolution for fast features from 360° imagery, Neural Information Processing Systems (NIPS). https://proceedings.neurips.cc/paper/2017/hash/0c74b7f78409a4022a2c4c5a5ca3ee19-Abstract.html, https://www.researchgate.net/publication/318899201_Flat2Sphere_Learning_Spherical_Convolution_for_Fast_Features_from_360deg_Imagery. Accessed 1 Jan 2021
Supervisely (2020), Community Edition http://www.supervise.ly/. Accessed 1 Jan 2021
Svanera M. Muhammad UR, Leonardi R, Benini S (2016) Figaro, Hair Detection and Segmentation in the wild, in IEEE International Conference on Image Processing, Source Code https://github.com/YBIGTA/pytorch-hair-segmentation. Accessed 1 Jan 2021
Szczuko P (2019) Deep neural networks for human pose estimation from a very low resolution depth image. Multimed Tools Appl 78:29357–29377. https://doi.org/10.1007/s11042-019-7433-7
Takacs B (2011) Immersive interactive reality: internet-based on-demand VR for cultural presentation. Virtual Reality 15(4):267–278
Takacs B, Vincze Z, Fassold H, Karakottas A, Zioulis N, Zarpalas D, Daras P (2019) Hyper 360 – towards a unified Tool set supporting next generation VR film and TV productions in J. Software Eng Appl 12:127–148. https://doi.org/10.4236/jsea.2019.125009
Takacs B, Vincze Zs, Richter G (2020) MultiViewMannequins for Deep Depth Estimation in 360° Videos, 918 in Proc. Siggraph2020. https://doi.org/10.1145/3388770.3407410
ThingLink (2020) https://www.thinglink.com. Accessed 1 Jan 2021
Tripathi S, Ranade S, Tyagi A, Agrawal A (2020) PoseNet3D: Unsupervised 3D Human Shape and Pose Estimation. https://arxiv.org/abs/2003.03473. Accessed 1 Jan 2021
Viar360 (2020) https://www.viar360.com. Accessed 1 Jan 2021
VRDirect (2021) https://www.vrdirect.com. Accessed 1 Jan 2021
Wang FE, Hu HN, Cheng HT, Lin JT, Yang ST, Shih ML, Chu HK, Sun M (2018) Self-Supervised Learning of Depth and Camera Motion from 360° Videos, in Proc ACCV 2018 https://arxiv.org/abs/1811.05304. Accessed 1 Jan 2021
Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS. (2019) Fast Online Object Tracking and Segmentation: A Unifying Approach, in IEEE conference on computer vision and pattern recognition (CVPR), Source Code: https://github.com/STVIR/pysot. Accessed 1 Jan 2021
Wikipedia (2020), List of Map Projections, https://en.wikipedia.org/wiki/List_of_map_projections. Accessed 1 Jan 2021
Wonda VR (2020) https://www.wondavr.com. Accessed 1 Jan 2021
Wu D et al (2019) Deep learning-based methods for person re-identification: a comprehensive review. Neurocomputing 337(14):354–371
Xiu Y, Jiefeng L, Haoyu W, Yinghong F, Cewu L (2018) Pose flow: efficient online pose tracking, British Machine Vision Conference, Source Code https://github.com/MVIG-SJTU/AlphaPose. Accessed 1 Jan 2021
Yan Y, Berthelier A, Duffner S, Naturel X , Garcia C, Chateau T (2019) Human hair segmentation in the wild using deep shape prior, in CVPR19 workshop on computer vision for augmented and virtual reality (CV4ARVR), Long Beach. https://yozey.github.io/Hair-Segmentation-in-the-wild/. Accessed 1 Jan 2021
Yu K, Li J, Zhang Y, Zhao Y, Xu L (2019) Image Quality Assessment for Omnidirectional Cross-reference Stitching, https://arxiv.org/abs/1904.04960. Accessed 1 Jan 2021
Zhang Z, Xu Y, Yu J, Gao S (2018) Saliency detection in 360° videos, in Proceedings of the European Conference on Computer Vision, Source Code: https://github.com/svip-lab/Saliency-Detection-in-360-Videos. Accessed 1 Jan 2021
Zioulis N, Karakottas A, Zarpalas D, Alvarez F, Daras P (2019) Spherical view synthesis for self-supervised 360° depth estimation in Proc international conference on 3D vision (3DV) , Source Code: https://arxiv.org/pdf/1909.08112.pdf. Accessed 1 Jan 2021
Acknowledgments
This work has received funding from the European Union’s Horizon 2020 research and innovation programme, grant n° 761934, Hyper360 (“Enriching 360 media with 3D storytelling and personalisation elements”).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Takacs, B., Vincze, Z. Deep authoring - an AI Tool set for creating immersive MultiMedia experiences. Multimed Tools Appl 80, 31105–31134 (2021). https://doi.org/10.1007/s11042-020-10275-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10275-z