Advertisement

Self-supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12365)

Abstract

Self-supervised monocular depth estimation presents a powerful method to obtain 3D scene information from single camera images, which is trainable on arbitrary image sequences without requiring depth labels, e.g., from a LiDAR sensor. In this work we present a new self-supervised semantically-guided depth estimation (SGDepth) method to deal with moving dynamic-class (DC) objects, such as moving cars and pedestrians, which violate the static-world assumptions typically made during training of such models. Specifically, we propose (i) mutually beneficial cross-domain training of (supervised) semantic segmentation and self-supervised depth estimation with task-specific network heads, (ii) a semantic masking scheme providing guidance to prevent moving DC objects from contaminating the photometric loss, and (iii) a detection method for frames with non-moving DC objects, from which the depth of DC objects can be learned. We demonstrate the performance of our method on several benchmarks, in particular on the Eigen split, where we exceed all baselines without test-time refinement.

Supplementary material

504476_1_En_35_MOESM1_ESM.pdf (13.2 mb)
Supplementary material 1 (pdf 13567 KB)

References

  1. 1.
    Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Nonrigid structure from motion in trajectory space. In: Proceedings of NIPS, Vancouver, BC, Canada, pp. 41–48, December 2009Google Scholar
  2. 2.
    Aleotti, F., Tosi, F., Poggi, M., Mattoccia, S.: Generative adversarial networks for unsupervised monocular depth prediction. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 337–354. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-11009-3_20CrossRefGoogle Scholar
  3. 3.
    Bolte, J.A., et al.: Unsupervised domain adaptation to improve image segmentation quality both in the source and target domain. In: Proceedings of CVPR - Workshops, Long Beach, CA, USA, pp. 1–10, June 2019Google Scholar
  4. 4.
    Bozorgtabar, B., Rad, M.S., Mahapatra, D., Thiran, J.P.: SynDeMo: synergistic deep feature alignment for joint learning of depth and ego-motion. In: Proceedings of ICCV, Seoul, Korea, pp. 4210–4219, October 2019Google Scholar
  5. 5.
    Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. In: Proceedings of AAAI, Honolulu, HI, USA, January 2019Google Scholar
  6. 6.
    Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Unsupervised monocular depth and ego-motion learning with structure and semantics. In: Proceedings of CVPR - Workshops, Long Beach, CA, USA, June 2019Google Scholar
  7. 7.
    Chen, P.Y., Liu, A.H., Liu, Y.C., Wang, Y.C.F.: Towards Scene Understanding: unsupervised monocular depth estimation with semantic-aware representation. In: Proceedings of CVPR, Long Beach, CA, USA, June 2019Google Scholar
  8. 8.
    Chen, Y., Schmid, C., Sminchisescu, C.: Self-supervised learning with geometric constraints in monocular video connecting flow, depth, and camera. In: Proceedings of ICCV, Seoul, Korea, pp. 7063–7072, October 2019Google Scholar
  9. 9.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of CVPR, Las Vegas, NV, USA, pp. 3213–3223, June 2016Google Scholar
  10. 10.
    CS Kumar, A., Bhandarkar, S.M., Prasad, M.: Monocular depth prediction using generative adversarial networks. In: Proceedings of CVPR - Workshops, Salt Lake City, UT, USA, pp. 1–9, June 2018Google Scholar
  11. 11.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of ICCV, Santiago, Chile, pp. 2650–2658, December 2015Google Scholar
  12. 12.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Proceedings of NIPS, Montréal, QC, Canada, pp. 2366–2374, December 2014Google Scholar
  13. 13.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. (IJCV) 111(1), 98–136 (2015)CrossRefGoogle Scholar
  14. 14.
    Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of CVPR, Salt Lake City, UT, USA, pp. 2002–2011, June 2018Google Scholar
  15. 15.
    Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: Proceedings of ICML, Lille, France, pp. 1180–1189, July 2015Google Scholar
  16. 16.
    Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_45CrossRefGoogle Scholar
  17. 17.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
  18. 18.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of CVPR, Honolulu, HI, USA, pp. 270–279, July 2017Google Scholar
  19. 19.
    Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. arXiv arXiv:1806.01260v4, June 2018
  20. 20.
    Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of ICCV, Seoul, Korea, pp. 3828–3838, October 2019Google Scholar
  21. 21.
    Goldman, M., Hassner, T., Avidan, S.: Learn stereo, infer mono: siamese networks for self-supervised, monocular, depth estimation. In: Proceedings of CVPR - Workshops, Long Beach, CA, USA, pp. 1–10, June 2019Google Scholar
  22. 22.
    Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In: Proceedings of ICCV, Seoul, Korea, pp. 8977–8986, October 2019Google Scholar
  23. 23.
    Guizilini, V., Ambrus, R., Pillai, S., Gaidon, A.: 3D packing for self-supervised monocular depth estimation. In: Proceedings of CVPR, Seattle, WA, USA, pp. 2485–2494, June 2020Google Scholar
  24. 24.
    Guizilini, V., Hou, R., Li, J., Ambrus, R., Gaidon, A.: Semantically-guided representation learning for self-supervised monocular depth. In: Proceedings of ICLR, Addis Ababa, Ethiopia, pp. 1–14, April 2020Google Scholar
  25. 25.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR, Las Vegas, NV, USA, pp. 770–778, June 2016Google Scholar
  26. 26.
    Hirschmüller, H.: Stereo processing by semi-global matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 30(2), 328–341 (2008)CrossRefGoogle Scholar
  27. 27.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kayukcuoglu, K.: Spatial transformer networks. In: Proceedings of NIPS, Montréal, QC, Canada, pp. 2017–2025, December 2015Google Scholar
  28. 28.
    Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of CVPR, Salt Lake City, UT, USA, pp. 7482–7491, June 2018Google Scholar
  29. 29.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of ICLR, San Diego, CA, USA, pp. 1–15, May 2015Google Scholar
  30. 30.
    Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 6399–6408, June 2019Google Scholar
  31. 31.
    Kuznietsov, Y., Stuckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of CVPR, Honolulu, HI, USA, pp. 6647–6655, July 2017Google Scholar
  32. 32.
    Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: Proceedings of 3DV, Stanford, CA, USA, pp. 239–248, October 2017Google Scholar
  33. 33.
    Li, Z., et al.: Learning the depths of moving people by watching frozen people. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 4521–4530, June 2019Google Scholar
  34. 34.
    Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intelli. (TPAMI) 38(10), 2024–2039 (2016)CrossRefGoogle Scholar
  35. 35.
    Liu, L., Zhai, G., Ye, W., Liu, Y.: Unsupervised learning of scene flow estimation fusing with local rigidity. In: Proceedings of IJCAI, Macao, China, pp. 876–882, August 2019Google Scholar
  36. 36.
    Liu, P., Lyu, M., King, I., Xu, J.: SelFlow: self-supervised learning of optical flow. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 4571–4580, June 2019Google Scholar
  37. 37.
    Luo, C., et al.: Every pixel counts++: joint learning of geometry and motion with 3D holistic understanding. arXiv arXiv:1810.06125, July 2019
  38. 38.
    Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Proceedings of CVPR, Salt Lake City, UT, USA, pp. 5667–5675, June 2018Google Scholar
  39. 39.
    Meng, Y., et al.: SIGNet: semantic instance aided unsupervised 3D geometry perception. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 9810–9820, June 2019Google Scholar
  40. 40.
    Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Proceedings of CVPR, Boston, MA, USA, pp. 3061–3070, June 2015Google Scholar
  41. 41.
    Ochs, M., Kretz, A., Mester, R.: SDNet: semantically guided depth estimation network. In: Proceedings of GCPR, Dortmund, Germany, pp. 288–302, September 2019Google Scholar
  42. 42.
    Ors̆ić, M., Kres̆o, I., Bevandić, P., S̆egvić, S.. In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 12607–12616, June 2019Google Scholar
  43. 43.
    Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: ENet: a deep neural network architecture for real-time semantic segmentation. arXiv arXiv:1606.02147, June 2016
  44. 44.
    Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of NeurIPS, Vancouver, BC, Canada, pp. 8024–8035, December 2019Google Scholar
  45. 45.
    Pilzer, A., Lathuiliere, S., Sebe, N., Ricci, E.: Refine and distill: exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 9768–9777, June 2019Google Scholar
  46. 46.
    Pilzer, A., Xu, D., Puscas, M., Ricci, E., Sebe, N.: Unsupervised adversarial depth estimation using cycled generative networks. In: Proceedings of 3DV, Verona, Italy, pp. 587–595, September 2018Google Scholar
  47. 47.
    Zama Ramirez, P., Poggi, M., Tosi, F., Mattoccia, S., Di Stefano, L.: Geometry meets semantics for semi-supervised monocular depth estimation. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. Geometry Meets Semantics for Semi-Supervised Monocular Depth Estimation, vol. 11363, pp. 298–313. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-20893-6_19CrossRefGoogle Scholar
  48. 48.
    Ranftl, R., Vineet, V., Chen, Q., Koltun, V.: Dense monocular depth estimation in complex dynamic scenes. In: Proceedings of CVPR, Las Vegas, NV, USA, pp. 4058–4066, June 2016Google Scholar
  49. 49.
    Ranjan, A., et al.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 12240–12249, June 2019Google Scholar
  50. 50.
    Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning for optical flow estimation. In: Proceedings of AAAI, San Francisco, CA, USA, pp. 1495–1501, February 2017Google Scholar
  51. 51.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  52. 52.
    Sun, J., Li, Y., Kang, S.B., Shum, H.Y.: Symmetric stereo matching for occlusion handling. In: Proceedings of CVPR, San Diego, CA, USA, pp. 399–406, June 2005Google Scholar
  53. 53.
    Szeliski, R.: Computer Vision: Algorithms and Applications. Springer, London (2010).  https://doi.org/10.1007/978-1-84882-935-0CrossRefzbMATHGoogle Scholar
  54. 54.
    Tosi, F., Aleotti, F., Poggi, M., Mattoccia, S.: Learning monocular depth estimation infusing traditional stereo knowledge. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 9799–9809, June 2019Google Scholar
  55. 55.
    Tosi, F., et al.: Distilled semantics for comprehensive scene understanding from videos. In: Proceedings of CVPR, Seattle, WA, USA, pp. 4654–4665, June 2020Google Scholar
  56. 56.
    Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: Proceedings of 3DV, Verona, Italy, pp. 11–20, October 2017Google Scholar
  57. 57.
    Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SfM-Net: learning of structure and motion from video. arXiv arXiv:1704.0780, April 2017
  58. 58.
    Wang, C., Miguel Buenaposada, J., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: Proceedings of CVPR, Salt Lake City, UT, USA, pp. 2022–2030, June 2018Google Scholar
  59. 59.
    Wang, R., Pizer, S.M., Frahm, J.M.: Recurrent neural network for (un-)supervised learning of monocular video visual odometry and depth. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 5555–5564, June 2019Google Scholar
  60. 60.
    Wang, Y., Wang, P., Yang, Z., Luo, C., Yang, Y., Xu, W.: UnOS: unified unsupervised optical-flow and stereo-depth estimation by watching videos. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 8071–8081, June 2019Google Scholar
  61. 61.
    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)CrossRefGoogle Scholar
  62. 62.
    Yang, G., Zhao, H., Shi, J., Deng, Z., Jia, J.: SegStereo: exploiting semantic information for disparity estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 660–676. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01234-2_39CrossRefGoogle Scholar
  63. 63.
    Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R.: Every pixel counts: unsupervised geometry learning with holistic 3D motion understanding. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11133, pp. 691–709. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-11021-5_43CrossRefGoogle Scholar
  64. 64.
    Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of CVPR, Salt Lake City, UT, USA, pp. 1983–1992, June 2018Google Scholar
  65. 65.
    Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of CVPR, Salt Lake City, UT, USA, pp. 340–349, June 2018Google Scholar
  66. 66.
    Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.S.: GA-Net: guided aggregation net for end-to-end stereo matching. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 185–194, June 2019Google Scholar
  67. 67.
    Zhang, H., Shen, C., Li, Y., Cao, Y., Liu, Y., Yan, Y.: Exploiting temporal consistency for real-time video depth estimation. In: Proceedings of ICCV, Seoul, Korea. pp. 1725–1734, October 2019Google Scholar
  68. 68.
    Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., Yang, J.: Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 4106–4115, June 2019Google Scholar
  69. 69.
    Zhao, S., Fu, H., Gong, M., Tao, D.: Geometry-aware symmetric domain adaptation for monocular depth estimation. In: Proceedings of CVPR, Long Beach, CA, USA, June 2019Google Scholar
  70. 70.
    Zhou, J., Wang, Y., Qin, K., Zeng, W.: Unsupervised high-resolution depth learning from videos with dual networks. In: Proceedings of ICCV, Seoul, Korea, pp. 6872–6881, October 2019Google Scholar
  71. 71.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of CVPR, Honolulu, HI, USA, pp. 1851–1860, July 2017Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Technische Universität BraunschweigBraunschweigGermany

Personalised recommendations