Advertisement

OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas

  • Nikolaos ZioulisEmail author
  • Antonis Karakottas
  • Dimitrios Zarpalas
  • Petros Daras
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11210)

Abstract

Recent work on depth estimation up to now has only focused on projective images ignoring \({360}^{\circ }\) content which is now increasingly and more easily produced. We show that monocular depth estimation models trained on traditional images produce sub-optimal results on omnidirectional images, showcasing the need for training directly on \({360}^{\circ }\) datasets, which however, are hard to acquire. In this work, we circumvent the challenges associated with acquiring high quality \({360}^{\circ }\) datasets with ground truth depth annotations, by re-using recently released large scale 3D datasets and re-purposing them to \({360}^{\circ }\) via rendering. This dataset, which is considerably larger than similar projective datasets, is publicly offered to the community to enable future research in this direction. We use this dataset to learn in an end-to-end fashion the task of depth estimation from \({360}^{\circ }\) images. We show promising results in our synthesized data as well as in unseen realistic images.

Keywords

Omnidirectional media \({360}^{\circ }\) Spherical panorama Scene understanding Depth estimation Synthetic dataset Learning with virtual data 

Notes

Acknowledgements

This work was supported and received funding from the European Union Horizon H2020 Framework Programme funded project Hyper360, under Grant Agreement no. 761934 (http://www.hyper360.eu/). We are also grateful and acknowledge the support of NVIDIA for a hardware donation.

Supplementary material

474211_1_En_28_MOESM1_ESM.pdf (7.4 mb)
Supplementary material 1 (pdf 7623 KB)

References

  1. 1.
    Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular slam with learned depth prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6565–6574, July 2017Google Scholar
  2. 2.
    Mo, K., Li, H., Lin, Z., Lee, J.Y.: The AdobeIndoorNav dataset: towards deep reinforcement learning based real-world indoor robot visual navigation (2018)Google Scholar
  3. 3.
    Hedman, P., Alsisan, S., Szeliski, R., Kopf, J.: Casual 3D photography. ACM Trans. Graph. (TOG) 36(6), 234 (2017)CrossRefGoogle Scholar
  4. 4.
    Huang, J., Chen, Z., Ceylan, D., Jin, H.: 6-DOF VR videos with a single 360-camera. In: 2017 IEEE Virtual Reality (VR), pp. 37–44. IEEE (2017)Google Scholar
  5. 5.
    Karsch, K.: Automatic scene inference for 3D object compositing. ACM Trans. Graph. (TOG) 33(3), 32 (2014)CrossRefGoogle Scholar
  6. 6.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)Google Scholar
  7. 7.
    Ren, X., Bo, L., Fox, D.: RGB-(D) scene labeling: features and algorithms. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2759–2766. IEEE (2012)Google Scholar
  8. 8.
    Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2000)zbMATHGoogle Scholar
  9. 9.
    Furukawa, Y., Hernández, C., et al.: Multi-view stereo: a tutorial. Found. Trends® in Comput. Graph. Vis. 9(1–2), 1–148 (2015)Google Scholar
  10. 10.
    Özyeşil, O., Voroninski, V., Basri, R., Singer, A.: A survey of structure from motion*. Acta Numerica 26, 305–364 (2017)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Li, S.: Binocular spherical stereo. IEEE Trans. Intell. Transp. Syst. 9(4), 589–600 (2008)CrossRefGoogle Scholar
  12. 12.
    Ma, C., Shi, L., Huang, H., Yan, M.: 3D reconstruction from full-view fisheye camera. arXiv preprint arXiv:1506.06273 (2015)
  13. 13.
    Pathak, S., Moro, A., Yamashita, A., Asama, H.: Dense 3D reconstruction from two spherical images via optical flow-based equirectangular epipolar rectification. In: 2016 IEEE International Conference on Imaging Systems and Techniques (IST), pp. 140–145. IEEE (2016)Google Scholar
  14. 14.
    Li, S., Fukumori, K.: Spherical stereo for the construction of immersive VR environment. In: Proceedings of Virtual Reality, VR 2005, pp. 217–222. IEEE (2005)Google Scholar
  15. 15.
    Kim, H., Hilton, A.: 3D scene reconstruction from multiple spherical stereo pairs. Int. J. Comput. Vis. 104(1), 94–116 (2013)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Zhang, Y., Song, S., Tan, P., Xiao, J.: PanoContext: a whole-room 3D context model for panoramic scene understanding. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 668–686. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10599-4_43CrossRefGoogle Scholar
  17. 17.
    Yang, H., Zhang, H.: Efficient 3D room shape recovery from a single panorama. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5422–5430 (2016)Google Scholar
  18. 18.
    Xu, J., Stenger, B., Kerola, T., Tung, T.: Pano2CAD: room layout from a single panorama image. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 354–362. IEEE (2017)Google Scholar
  19. 19.
    Kim, H., de Campos, T., Hilton, A.: Room layout estimation with object and material attributes information using a spherical camera. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 519–527. IEEE (2016)Google Scholar
  20. 20.
    Plagemann, C., Stachniss, C., Hess, J., Endres, F., Franklin, N.: A nonparametric learning approach to range sensing from omnidirectional vision. Robot. Auton. Syst. 58(6), 762–772 (2010)CrossRefGoogle Scholar
  21. 21.
    Ruder, M., Dosovitskiy, A., Brox, T.: Artistic style transfer for videos and spherical images. Int. J. Comput. Vis. 126(11), 1199–1219 (2018)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Monroy, R., Lutz, S., Chalasani, T., Smolic, A.: SalNet360: saliency maps for omni-directional images with CNN. arXiv preprint arXiv:1709.06505 (2017)
  23. 23.
    Zhang, J., Lalonde, J.F.: Learning high dynamic range from outdoor panoramas. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4519–4528 (2017)Google Scholar
  24. 24.
    Frossard, P., Khasanova, R.: Graph-based classification of omnidirectional images. In: 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 860–869. IEEE (2017)Google Scholar
  25. 25.
    Su, Y.C., Grauman, K.: Learning spherical convolution for fast features from 360 imagery. In: Advances in Neural Information Processing Systems, pp. 529–539 (2017)Google Scholar
  26. 26.
    Jeon, Y., Kim, J.: Active convolution: learning the shape of convolution for image classification. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1846–1854. IEEE (2017)Google Scholar
  27. 27.
    Dai, J., et al.: Deformable convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 764–773 (2017)Google Scholar
  28. 28.
    Deng, L., Yang, M., Li, H., Li, T., Hu, B., Wang, C.: Restricted deformable convolution based road scene semantic segmentation using surround view cameras. arXiv preprint arXiv:1801.00708 (2018)
  29. 29.
    Cohen, T., Geiger, M., Welling, M.: Convolutional networks for spherical signals. In: Principled Approaches to Deep Learning Workshop ICML 2017 (2017)Google Scholar
  30. 30.
    Cohen, T.S., Geiger, M., Köhler, J., Welling, M.: Spherical CNNs. In: International Conference on Learning Representations (ICLR) (2018)Google Scholar
  31. 31.
    Ranftl, R., Vineet, V., Chen, Q., Koltun, V.: Dense monocular depth estimation in complex dynamic scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4058–4066 (2016)Google Scholar
  32. 32.
    Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 716–723. IEEE (2014)Google Scholar
  33. 33.
    Karsch, K., Liu, C., Kang, S.B.: Depth transfer: depth extraction from videos using nonparametric sampling. In: Hassner, T., Liu, C. (eds.) Dense Image Correspondences for Computer Vision, pp. 173–205. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-23048-1_9CrossRefGoogle Scholar
  34. 34.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)Google Scholar
  35. 35.
    Ren, Z., Lee, Y.J.: Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  36. 36.
    Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 239–248. IEEE (2016)Google Scholar
  37. 37.
    Li, B., Shen, C., Dai, Y., van den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1119–1127 (2015)Google Scholar
  38. 38.
    Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2016)CrossRefGoogle Scholar
  39. 39.
    Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous CRFs as sequential deep networks for monocular depth estimation. In: Proceedings of CVPR (2017)Google Scholar
  40. 40.
    Chakrabarti, A., Shao, J., Shakhnarovich, G.: Depth from a single image by harmonizing overcomplete local network predictions. In: Advances in Neural Information Processing Systems, pp. 2658–2666 (2016)Google Scholar
  41. 41.
    Li, B., Dai, Y., He, M.: Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference. Pattern Recogn. 83, 328–339 (2018)CrossRefGoogle Scholar
  42. 42.
    Cao, Y., Wu, Z., Shen, C.: Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circ. Syst. Video Technol. (2017)Google Scholar
  43. 43.
    Fu, H., Gong, M., Wang, C., Tao, D.: A compromise principle in deep monocular depth estimation. arXiv preprint arXiv:1708.08267 (2017)
  44. 44.
    Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_45CrossRefGoogle Scholar
  45. 45.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)Google Scholar
  46. 46.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, vol. 2, p. 7 (2017)Google Scholar
  47. 47.
    Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  48. 48.
    Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  49. 49.
    Yang, Z., Wang, P., Xu, W., Zhao, L., Nevatia, R.: Unsupervised learning of geometry with edge-aware depth-normal consistency. arXiv preprint arXiv:1711.03665 (2017)
  50. 50.
    Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  51. 51.
    Srinivasan, P.P., Garg, R., Wadhwa, N., Ng, R., Barron, J.T.: Aperture supervision for monocular depth estimation (2017)Google Scholar
  52. 52.
    Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. In: Advances in Neural Information Processing Systems, pp. 730–738 (2016)Google Scholar
  53. 53.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33715-4_54CrossRefGoogle Scholar
  54. 54.
    Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009)CrossRefGoogle Scholar
  55. 55.
    Matzen, K., Cohen, M.F., Evans, B., Kopf, J., Szeliski, R.: Low-cost 360 stereo photography and video capture. ACM Trans. Graph. (TOG) 36(4), 148 (2017)CrossRefGoogle Scholar
  56. 56.
    Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  57. 57.
    Handa, A., Pătrăucean, V., Stent, S., Cipolla, R.: SceneNet: an annotated model generator for indoor scene understanding. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5737–5743. IEEE (2016)Google Scholar
  58. 58.
    Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2D-3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)
  59. 59.
    Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1543 (2016)Google Scholar
  60. 60.
    Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: International Conference on 3D Vision (3DV) (2017)Google Scholar
  61. 61.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
  62. 62.
    Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289 (2015)
  63. 63.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
  64. 64.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  65. 65.
    Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Computer Vision and Pattern Recognition, vol. 1 (2017)Google Scholar
  66. 66.
    van Noord, N., Postma, E.O.: Light-weight pixel context encoders for image inpainting. CoRR abs/1801.05585 (2018)Google Scholar
  67. 67.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  68. 68.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, MM 2014, pp. 675–678. ACM, New York (2014)Google Scholar
  69. 69.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 9. PMLR, pp. 249–256, 13–15 May 2010Google Scholar
  70. 70.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  71. 71.
    Xiao, J., Ehinger, K.A., Oliva, A., Torralba, A.: Recognizing scene viewpoint using panoramic place representation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2695–2702. IEEE (2012)Google Scholar
  72. 72.
    Rhee, T., Petikam, L., Allen, B., Chalmers, A.: MR360: mixed reality rendering for 360 panoramic videos. IEEE Trans. Visual. Comput. Graph. 23(4), 1379–1388 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Centre for Research and Technology Hellas (CERTH) - Information Technologies Institute (ITI) - Visual Computing Lab (VCL)ThessalonikiGreece

Personalised recommendations