Skip to main content

Scene Reconstruction for Storytelling in 360\(^\circ \) Videos

Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST,volume 273)


In immersive and interactive contents like 360-degrees videos the user has the control of the camera, which poses a challenge to the content producer since the user may look to where he wants. This paper presents the concept and first steps towards the development of a framework that provides a workflow for storytelling in 360-degrees videos. With the proposed framework it will be possible to connect a sound to a source and taking advantage of binaural audio it will help to redirect the user attention to where the content producer wants. To present this kind of audio, the scenario must be mapped/reconstructed so as to understand how the objects contained in it interfere with the sound waves propagation. The proposed system is capable of reconstructing the scenario from a stereoscopic, still or motion 360-degrees video when provided in an equirectangular projection. The system also incorporates a module that detects and tracks people, mapping their motion from the real world to the 3D world. In this document we describe all the technical decisions and implementations of the system. To the best of our knowledge, this system is the only that has shown the capability to reconstruct scenarios in a large variety of 360 footage and allows for the creation of binaural audio from that reconstruction.


  • 360 videos
  • Storytelling
  • Scene Reconstruction
  • Binaural sound
  • Computer vision
  • Computer graphics
  • 3D reconstruction
  • People detection
  • People tracking

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-16447-8_12
  • Chapter length: 10 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   44.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-16447-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   60.00
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.


  1. 360-degree projection. Accessed 11 June 2018

  2. S3A spatial audio. Accessed 24 July 2018

  3. Akbarzadeh, A., et al.: Towards urban 3D reconstruction from video. In: Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT 2006). IEEE Computer Society (2006)

    Google Scholar 

  4. Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-detection-by-tracking. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008.

  5. Breuers, S., Beyer, L., Rafi, U., Leibe, B.: Detection-tracking for efficient person analysis: the DetTA pipeline. CoRR abs/1804.10134 (2018).

  6. Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. CoRR abs/1604.03901 (2016).

  7. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. CoRR abs/1406.2283 (2014).

  8. Ewerth, R., et al.: Estimating relative depth in single images via rankboost. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 919–924, July 2017.

  9. Geiger, A., Ziegler, J., Stiller, C.: StereoScan: Dense 3D reconstruction in real-time. In: 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 963–968, June 2011.

  10. Grani, F., et al.: Audio-visual attractors for capturing attention to the screens when walking in cave systems. In: 2014 IEEE VR Workshop: Sonic Interaction in Virtual Environments (SIVE), pp. 3–6, March 2014.

  11. Güler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. CoRR abs/1802.00434 (2018).

  12. Jafari, O.H., Mitzel, D., Leibe, B.: Real-time RGB-D based people detection and tracking for mobile robots and head-worn cameras. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 5636–5643, May 2014.

  13. Kim, A., Eustice, R.M.: Active visual slam for robotic area coverage: theory and experiment. Int. J. Robot. Res. 34(4–5), 457–475 (2015).

    CrossRef  Google Scholar 

  14. Kim, H., Hilton, A.: Block world reconstruction from spherical stereo image pairs. Comput. Vis. Image Underst. 139, 104–121 (2015).

    CrossRef  Google Scholar 

  15. Kim, H., et al.: Acoustic room modelling using a spherical camera for reverberant spatial audio objects. In: Audio Engineering Society Convention 142, May 2017.

  16. Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. CoRR abs/1708.02002 (2017).

  17. Liu, B., Gould, S., Koller, D.: Single image depth estimation from predicted semantic labels. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1253–1260, June 2010.

  18. Liu, C., Yang, J., Ceylan, D., Yumer, E., Furukawa, Y.: PlaneNet: piece-wise planar reconstruction from a single RGB image. CoRR abs/1804.06278 (2018).

  19. Polic, M., Förstner, W., Pajdla, T.: Fast and accurate camera covariance computation for large 3D reconstruction (2018)

    Google Scholar 

  20. Riazuelo, L., Montano, L., Montiel, J.M.M.: Semantic visual SLAM in populated environments. In: 2017 European Conference on Mobile Robots (ECMR), pp. 1–7, Sept 2017.

  21. Saurer, O., Pollefeys, M., Hee Lee, G.: Sparse to dense 3D reconstruction from rolling shutter images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3337–3345 (2016)

    Google Scholar 

  22. Spinello, L., Arras, K.O., Triebel, R., Siegwart, R.: A layered approach to people detection in 3D range data. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, pp. 1625–1630. AAAI Press (2010).

  23. Stewart, R., Andriluka, M., Ng, A.Y.: End-to-end people detection in crowded scenes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

    Google Scholar 

  24. Sturm, P., Triggs, B.: A factorization based algorithm for multi-image projective structure and motion. In: Buxton, B., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1065, pp. 709–720. Springer, Heidelberg (1996).

    CrossRef  Google Scholar 

  25. Toldo, R., Gherardi, R., Farenzena, M., Fusiello, A.: Hierarchical structure-and-motion recovery from uncalibrated images. Comput. Vis. Image Underst. 140, 127–143 (2015).

    CrossRef  Google Scholar 

  26. Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

    Google Scholar 

  27. Wong, K.H., Chang, M.M.Y.: 3D model reconstruction by constrained bundle adjustment. In: 2004 Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 902–905, Aug 2004.

  28. Yu, R., Russell, C., Campbell, N.D.F., Agapito, L.: Direct, dense, and deformable: Template-based non-rigid 3D reconstruction from RGB video. In: The IEEE International Conference on Computer Vision (ICCV), December 2015

    Google Scholar 

  29. Yu, S., Lhuillier, M.: Incremental reconstruction of manifold surface from sparse visual mapping. In: 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization Transmission, pp. 293–300, October 2012.

  30. Zakharov, A.A., Barinov, A.E.: An algorithm for 3D-object reconstruction from video using stereo correspondences. Pattern Recogn. Image Anal. 25(1), 117–121 (2015).

    CrossRef  Google Scholar 

  31. Zhang, G., Liu, J., Li, H., Chen, Y.Q., Davis, L.S.: Joint human detection and head pose estimation via multistream networks for RGB-D videos. IEEE Signal Process. Lett. 24(11), 1666–1670 (2017).

    CrossRef  Google Scholar 

  32. Zhou, H., Zou, D., Pei, L., Ying, R., Liu, P., Yu, W.: StructSLAM: visual SLAM with building structure lines. IEEE Trans. Veh. Technol. 64(4), 1364–1375 (2015).

    CrossRef  Google Scholar 

  33. Zhuo, W., Salzmann, M., He, X., Liu, M.: Indoor scene structure analysis for single image depth estimation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 614–622, June 2015.

Download references


This article is a result of the project CHIC - Cooperative Holistic view on Internet and Content (project n\(^\circ \) 24498), supported by the European Regional Development Fund (ERDF), through the Competitiveness and Internationalization Operational Program (COMPETE 2020) under the PORTUGAL 2020 Partnership Agreement.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Gonçalo Pinheiro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2019 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Pinheiro, G., Alves, N., Magalhães, L., Agrellos, L., Guevara, M. (2019). Scene Reconstruction for Storytelling in 360\(^\circ \) Videos. In: Cortez, P., Magalhães, L., Branco, P., Portela, C., Adão, T. (eds) Intelligent Technologies for Interactive Entertainment. INTETAIN 2018. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 273. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-16446-1

  • Online ISBN: 978-3-030-16447-8

  • eBook Packages: Computer ScienceComputer Science (R0)