Skip to main content

Advertisement

Log in

Deep authoring - an AI Tool set for creating immersive MultiMedia experiences

  • 1173: Interaction in Immersive Experiences
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We introduce a fully automated 360° video processing pipeline using a hierarchical combination of Artificial Intelligence (AI) modules to create immersive volumetric XR experiences. Two critical productions tasks (person segmentation and depth estimation) are addressed with a parallel Deep Neural Network (DNN) pipeline that combines instance segmentation, person detection, pose estimation, camera stabilization, neural tracking, 3D face detection, hair masking, and monocular 360° depth computation in a single and robust tool set. To facilitate the rapid uptake of these techniques we provide a detailed review of AI-based methods to address these problems (complete with links to recommended open source implementations) as well as references to existing authoring tools in the market. Our key contributions include a method to create semi-synthetic data sets for data auto-augmentation and using this technique to generate over 3.8 m images as part of a concise evaluation and subsequent retraining of DNNs for person detection tasks. Furthermore, we apply the same techniques to develop a spherical DNN for monocular depth estimation with a Free Viewpoint Video (FVV) capture system and a novel method to generate 3D human shapes and pose mannequins for training. To evaluate the performance of our AI authoring tool set we address four challenging production tasks and demonstrate the practical use of our solution with videos showing processed output.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. Combination of real photographs and 3D models in a virtually rendered scene.

  2. Github source code links (https://github.com/) are also included to aid the rapid uptake of these techniques by the immersive media production and research community at large.

  3. In our work we focus on the central task of person segmentation in the scene, but the techniques described can be readily generalized to any object class as required.

  4. The actual settings are relative to the coordinate system of the animation space but simulate physical distances and angular changes with respect to the virtual spherical camera.

  5. We note that while the Face3D algorithm generally workes well for a broad range of 360° footage, in this particular video the rather low pixel resolution and difficult poses, such as people looking down with many features rendered invisible, made it to produce somewhat noisy results and needing an additional pass to clean the data up.

  6. Note that neither the Samsung Gear360 nor the GoPro6 camera rig has stereoscopic capabilities or enough overlap of the lenses to compute disparity maps.

References

  1. 3DVista Pro (2020) https://www.3dvista.com. Accessed 1 Jan 2021

  2. Adobe Creative Suite Tools (2020) https://www.adobe.com/creativecloud/video/virtual-reality.html. Accessed 1 Jan 2021

  3. Andersson Technologies (2020), SynthEyes 3D Camera Tracking and Stabilization Software, https://www.ssontech.com/synovu.html. Accessed 1 Jan 2021

  4. Bochkovskiy A, Wang CY, Liao HYM (2020) YOLOv4: Optimal Speed and Accuracy of Object Detection. https://arxiv.org/abs/2004.10934. Accessed 1 Jan 2021

  5. Bodini M (2019) A Review of Facial Landmark Extraction in 2D Images and Videos Using Deep Learning. Big Data Cogn. Comput. 3(1):14. https://doi.org/10.3390/bdcc3010014

    Article  Google Scholar 

  6. Bolya D, Zhou C, Xiao F, Lee YJ (2019) YOLACT++: better real-time instance segmentation, Source Code https://github.com/dbolya/yolact. Accessed 1 Jan 2021

  7. Bulat A, Tzimiropoulos G (2017) super-FAN: integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs, https://arxiv.org/abs/1712.02765, Source Code https://github.com/1adrianb/face-alignment. Accessed 1 Jan 2021

  8. Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y (2018) OpenPose: Realtime multi-person 2D pose estimation using part affinity fields, Computer Vision and Pattern Recognition, Source Code https://github.com/CMU-Perceptual-Computing-Lab/openpose. Accessed 1 Jan 2021

  9. Cohen T, Geiger M, Koehler J, Welling M, Spherical CNNs. ICLR 2018. https://openreview.net/pdf?id=Hkbd5xZRb, Soure Code: https://github.com/jonas-koehler/s2cnn. Accessed 1 Jan 2021

  10. Cubuk ED, Zoph B, Mane D, Vasude V, Le QV (2019) AutoAugment: Learning Augmentation Strategies From Data; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 113–123. https://openaccess.thecvf.com/content_CVPR_2019/html/Cubuk_AutoAugment_Learning_Augmentation_Strategies_From_Data_CVPR_2019_paper.html

  11. CVAT - Computer Vision Annotation Tool (2020), Source Code https://github.com/openvinotoolkit/cvat. Accessed 1 Jan 2021

  12. de La Garanderie GP, Abarghouei AA, Breckon TP (2018) Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360° Panoramic Imagery, in Proc. European Conference on Computer Vision, Springer. https://arxiv.org/abs/1808.06253 Source Code https://github.com/gdlg/panoramic-depth-estimation. Accessed 1 Jan 2021

  13. Dhimana C, Vishwakarmab DK (2019) A Review of State-of-the-art Techniques for Abnormal Human Activity Recognition. Eng Appl Artificial Intell 77:21–45

    Article  Google Scholar 

  14. Duan Z, Tezcan MO, Nakamura H, Ishwar P, Konrad J (2020) RAPiD: rotation-aware people detection in overhead fisheye images, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Omnidirectional Computer Vision in Research and Industry (OmniCV) Workshop. https://arxiv.org/abs/2005.11623

  15. Everingham M, Van Gool L, Williams C, Winn KI, Zisserman JA (2010) The PASCAL visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 http://host.robots.ox.ac.uk/pascal/VOC/. Accessed 1 Jan 2021

  16. Fader (2020) https://getfader.com. Accessed 1 Jan 2021

  17. Fang HS, Xie S, Tai YW, Lu C (2018) RMPE: Regional Multi-Person Pose Estimation, https://arxiv.org/abs/1612.00137. Accessed 1 Jan 2021

  18. K. Gao, S. Yang, K. Fu, P. Cheng (2019), Deep 3D Facial Landmark Detection on Position Maps. In: Cui Z., Pan J., Zhang S., Xiao L., Yang J. (eds) Intelligence Science and Big Data Engineering. Visual Data Engineering. IScIDE 2019. Lecture notes in computer science, vol 11935. Springer, Cham.

  19. Ghiasi G, Lee H Kudlur M, Dumoulin V, Shlens J (2017) Exploring the structure of a real-time, Arbitrary Neural Artistic Stylization Network. https://arxiv.org/abs/1705.06830. Accessed 1 Jan 2021

  20. Godard C, Aodha OM, Firman M, Brostow GJ (2019) Digging into self-supervised monocular depth estimation, in Proc the international conference on computer vision (ICCV19), Source Code https://github.com/nianticlabs/monodepth2. Accessed 1 Jan 2021

  21. Google Research (2019), BodyPix2.0, Source Code https://github.com/tensorflow/tfjs-models/tree/master/body-pix. Accessed 1 Jan 2021

  22. Guo K, et. al (2019) The Relightables: Volumetric Performance Capture of Humans with Realistic Relighting. ACM Trans Graphics 38(6). https://doi.org/10.1145/3355089.3356571

  23. Han Z, Ban X, Wang X, Wu J (2020) MIPOSE: A Micro-intelligent Platform for Dynamic Human Pose Recognition, in Proc. AsianHCI '19: Proceedings of Asian CHI Symposium 2019: Emerging HCI Research Collection, pp 60–65, https://doi.org/10.1145/3309700.3338440

  24. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN, IEEE international conference on computer vision (ICCV), Source Code: https://github.com/matterport/Mask_RCNN. Accessed 1 Jan 2021

  25. Hohman F, Wongsuphasawat K, Kery MB, Patel K (2020), Understanding and Visualizing Data Iteration in Machine Learning, in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3313831.3376177

  26. Huang J, Cheny Z, Ceylanz D, Jinx H (2017) 6-DOF VR videos with a single 360-camera. Proc. IEEE Virtual Reality (VR), Los Angeles

    Book  Google Scholar 

  27. Hyper360 Project (2020) http://www.hyper360.eu/. Accessed 1 Jan 2021

  28. Insta360 Stitching Software (2020) https://www.insta360.com/download/insta360-pro. Accessed 1 Jan 2021

  29. Karakottas A, Zioulis N, Zarpalas D, Daras P (2018) 360D: a dataset and baseline for dense depth estimation from 360 images. In: 1st workshop on 360o perception and interaction. European Conf. on Computer Vision (ECCV), Munich

    Google Scholar 

  30. Keyframe Interpolation (2017), Source Code https://github.com/Kay1794/Mocap-Keyframe-Interpolation. Accessed 1 Jan 2021

  31. Kolotouros N, Pavlakos G, Black MJ, Daniilidis K (2019) Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop, in Proc ICCV2019, Source Code https://github.com/nkolot/SPIN. Accessed 1 Jan 2021

  32. Kopf J (2016) 360° Video Stabilization. ACM Trans Graph 35(6):19 https://dl.acm.org/citation.cfm?id=2982405. Accessed 1 Jan 2021

  33. Li C, Xu M,, Zhang S, Le Callet P (2018) Distortion-aware CNNs for spherical images, in Proc. of the 27th Int. Joint Conference on Artificial Intelligence, pp 1198–1204. https://www.ijcai.org/Proceedings/2018/167. Accessed 1 Jan 2021

  34. Li Z, Dekel T, Cole F, Tucker R, Snavely N, Liu C, Freeman WT (2019) learning the depths of moving people by watching frozen people, in IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Source Code https://github.com/google/mannequinchallenge. Accessed 1 Jan 2021

  35. Li C, Xu M, Zhang S, Le Callet P (2020) State-of-the-art in 360° Video/Image Processing: Perception, Assessment Compress IEEE J Select Topics Signal Process 14(1)

  36. Lin TY, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2015) Microsoft COCO: Common Objects in Context https://arxiv.org/abs/1405.0312http://cocodataset.org/#home. Accessed 1 Jan 2021

  37. Lindlbaue D, Feit A, Hilliges O (2019) Context-Aware Online Adaptation of Mixed Reality Interfaces, in UIST '19: Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology. https://doi.org/10.1145/3332165.3347945

  38. Liquid Cinema (2020) https://liquidcinemavr.com. Accessed 1 Jan 2021

  39. Liu SJ, Agrawala M, DiVerdi S, Hertzmann A (2019) View-dependent video textures for 360° video, in proceedings of the 32nd annual ACM symposium on user Interface Software and technology, Source Code: https://lseancs.github.io/viewdepvrtextures/. Accessed 1 Jan 2021

  40. Liu L, Ouyang W, Wang X et al (2020) Deep learning for generic object detection: a survey. Int J Computer Vision 128:261–318. https://doi.org/10.1007/s11263-019-01247-4

    Article  Google Scholar 

  41. Lyu W, Zhou Z, Hou LCY (2019) A survey on image and video stitching. Virtual Reality Intell Hardware 1(1):55–83. https://doi.org/10.3724/SP.J.2096-5796.2018.0008

    Article  Google Scholar 

  42. Maninis KK, Caelles S, Pont-Tuset J, Van Gool L (2018), Deep extreme cut: from extreme points to object segmentation, computer vision and pattern recognition (CVPR), Source Code: https://github.com/scaelles/DEXTR-PyTorch. Accessed 1 Jan 2021

  43. Matos T, Nóbrega R, Rodrigues R, Pinheiro M (2018) Dynamic Annotations on an Interactive Web-based 360 Deg; Video Player, Proc.. of the 23rd International ACM Conference on 3D Web Technology (Web3D ‘18). ACM, New York, Article 22. https://doi.org/10.1145/3208806.3208818

    Book  Google Scholar 

  44. Label Me (2020), Source Code: https://github.com/wkentaro/labelme. Accessed 1 Jan 2021

  45. Nakatani A, Shinohara T, Miyaki K (2019) Live 6DoF Video Production with Stereo Camera in Proc SA '19: Siggraph Asia XR, pp 23–24, https://doi.org/10.1145/3355355.3361880

  46. Omnivirt (2020) https://www.omnivirt.com/. Accessed 1 Jan 2021

  47. Papandreou G, Zhu T, Chen LC, Gidaris S, Tompson J, Murphy K (2018) PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision – ECCV 2018. Lecture notes in computer science, vol 11218. Springer, Cham Source Code https://github.com/scnuhealthy/Tensorflow_PersonLab. Accessed 1 Jan 2021

  48. Paulsen RR, Juhl KA, Haspang TM, Hansen T, Ganz M, Einarsson G (2019) Multi-view Consensus CNN for 3D Facial Landmark Placement. In: Jawahar C, Li H, Mori G, Schindler K (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture notes in computer science, vol 11361. Springer, Cham https://arxiv.org/abs/1910.06007. Accessed 1 Jan 2021

  49. Pixel Annotation Tool (2020), Source Code : https://github.com/abreheret/PixelAnnotationTool. Accessed 1 Jan 2021

  50. Pseudoscience (2020) Volumetric 360 6DoF Video / Stereo2Depth Conversion algorithm http://pseudoscience.pictures/index.html. Accessed 1 Jan 2021

  51. Schonberger JL, Frah JM (2016) Structure-from-Motion Revisited, in Proc Conference on Computer Vision and Pattern Recognition (CVPR)

  52. SGO Mistika VR Optic Flow Stitcher (2020) https://www.sgo.es/mistika-vr/. Accessed 1 Jan 2021

  53. PanoCAST (2021) http://www.panocast.com. Accessed 1 Jan 2021

  54. Sreenu G, Durai MAS (2019) Intelligent video surveillance: a review through deep learning techniques for crowd analysis, in J. Big Data 6:48. https://doi.org/10.1186/s40537-019-0212-5

    Article  Google Scholar 

  55. Su YC, Grauman K (2017) Flat2Sphere: learning spherical convolution for fast features from 360° imagery, Neural Information Processing Systems (NIPS). https://proceedings.neurips.cc/paper/2017/hash/0c74b7f78409a4022a2c4c5a5ca3ee19-Abstract.html, https://www.researchgate.net/publication/318899201_Flat2Sphere_Learning_Spherical_Convolution_for_Fast_Features_from_360deg_Imagery. Accessed 1 Jan 2021

  56. Supervisely (2020), Community Edition http://www.supervise.ly/. Accessed 1 Jan 2021

  57. Svanera M. Muhammad UR, Leonardi R, Benini S (2016) Figaro, Hair Detection and Segmentation in the wild, in IEEE International Conference on Image Processing, Source Code https://github.com/YBIGTA/pytorch-hair-segmentation. Accessed 1 Jan 2021

  58. Szczuko P (2019) Deep neural networks for human pose estimation from a very low resolution depth image. Multimed Tools Appl 78:29357–29377. https://doi.org/10.1007/s11042-019-7433-7

    Article  Google Scholar 

  59. Takacs B (2011) Immersive interactive reality: internet-based on-demand VR for cultural presentation. Virtual Reality 15(4):267–278

    Article  Google Scholar 

  60. Takacs B, Vincze Z, Fassold H, Karakottas A, Zioulis N, Zarpalas D, Daras P (2019) Hyper 360 – towards a unified Tool set supporting next generation VR film and TV productions in J. Software Eng Appl 12:127–148. https://doi.org/10.4236/jsea.2019.125009

  61. Takacs B, Vincze Zs, Richter G (2020) MultiViewMannequins for Deep Depth Estimation in 360° Videos, 918 in Proc. Siggraph2020. https://doi.org/10.1145/3388770.3407410

  62. ThingLink (2020) https://www.thinglink.com. Accessed 1 Jan 2021

  63. Tripathi S, Ranade S, Tyagi A, Agrawal A (2020) PoseNet3D: Unsupervised 3D Human Shape and Pose Estimation. https://arxiv.org/abs/2003.03473. Accessed 1 Jan 2021

  64. Viar360 (2020) https://www.viar360.com. Accessed 1 Jan 2021

  65. VRDirect (2021) https://www.vrdirect.com. Accessed 1 Jan 2021

  66. Wang FE, Hu HN, Cheng HT, Lin JT, Yang ST, Shih ML, Chu HK, Sun M (2018) Self-Supervised Learning of Depth and Camera Motion from 360° Videos, in Proc ACCV 2018 https://arxiv.org/abs/1811.05304. Accessed 1 Jan 2021

  67. Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS. (2019) Fast Online Object Tracking and Segmentation: A Unifying Approach, in IEEE conference on computer vision and pattern recognition (CVPR), Source Code: https://github.com/STVIR/pysot. Accessed 1 Jan 2021

  68. Wikipedia (2020), List of Map Projections, https://en.wikipedia.org/wiki/List_of_map_projections. Accessed 1 Jan 2021

  69. Wonda VR (2020) https://www.wondavr.com. Accessed 1 Jan 2021

  70. Wu D et al (2019) Deep learning-based methods for person re-identification: a comprehensive review. Neurocomputing 337(14):354–371

    Article  Google Scholar 

  71. Xiu Y, Jiefeng L, Haoyu W, Yinghong F, Cewu L (2018) Pose flow: efficient online pose tracking, British Machine Vision Conference, Source Code https://github.com/MVIG-SJTU/AlphaPose. Accessed 1 Jan 2021

  72. Yan Y, Berthelier A, Duffner S, Naturel X , Garcia C, Chateau T (2019) Human hair segmentation in the wild using deep shape prior, in CVPR19 workshop on computer vision for augmented and virtual reality (CV4ARVR), Long Beach. https://yozey.github.io/Hair-Segmentation-in-the-wild/. Accessed 1 Jan 2021

  73. Yu K, Li J, Zhang Y, Zhao Y, Xu L (2019) Image Quality Assessment for Omnidirectional Cross-reference Stitching, https://arxiv.org/abs/1904.04960. Accessed 1 Jan 2021

  74. Zhang Z, Xu Y, Yu J, Gao S (2018) Saliency detection in 360° videos, in Proceedings of the European Conference on Computer Vision, Source Code: https://github.com/svip-lab/Saliency-Detection-in-360-Videos. Accessed 1 Jan 2021

  75. Zioulis N, Karakottas A, Zarpalas D, Alvarez F, Daras P (2019) Spherical view synthesis for self-supervised 360° depth estimation in Proc international conference on 3D vision (3DV) , Source Code: https://arxiv.org/pdf/1909.08112.pdf. Accessed 1 Jan 2021

Download references

Acknowledgments

This work has received funding from the European Union’s Horizon 2020 research and innovation programme, grant n° 761934, Hyper360 (“Enriching 360 media with 3D storytelling and personalisation elements”).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Barnabas Takacs.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

ESM 1

(MP4 7588 kb)

ESM 2

(MP4 63,834 kb)

ESM 3

(MP4 42,384 kb)

ESM 4

(MP4 47,089 kb)

ESM 5

(MP4 128,642 kb)

ESM 6

(MP4 148,995 kb)

ESM 7

(MP4 74,639 kb)

ESM 8

(MP4 33,757 kb)

ESM 9

(MP4 32,220 kb)

ESM 10

(MP4 35,117 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Takacs, B., Vincze, Z. Deep authoring - an AI Tool set for creating immersive MultiMedia experiences. Multimed Tools Appl 80, 31105–31134 (2021). https://doi.org/10.1007/s11042-020-10275-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10275-z

Keywords

Navigation