Advertisement

APAC-Net: Unsupervised Learning of Depth and Ego-Motion from Monocular Video

  • Rui Lin
  • Yao Lu
  • Guangming LuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11935)

Abstract

We propose an unsupervised novel method, Attention-Pixel and Attention-Channel Network (APAC-Net), for unsupervised monocular learning of estimating scene depth and ego-motion. Our model only utilizes monocular image sequences and does not need additional sensor information, such as IMU and GPS, for supervising. The attention mechanism is employed in APAC-Net to improve the networks’ efficiency. Specifically, three attention modules are proposed to adjust feature weights when training. Moreover, to minimum the effect of noise, which is produced in the reconstruction processing, the Image-reconstruction loss based on PSNR \(L_{PSNR}\) is used to evaluation the reconstruction quality. In addition, due to the fail depth estimation of the objects closed to camera, the Temporal-consistency loss \(L_{Temp}\) between adjacent frames and the Scale-based loss \(L_{Scale}\) among different scales are proposed. Experimental results showed APAC-Net can perform well in both the depth and ego-motion tasks, and it even behaved better in several items on KITTI and Cityscapes.

Keywords

Depth estimation Ego-motion estimation Attention mechanism 

References

  1. 1.
    Choi, S., Min, D., Ham, B., Kim, Y., Oh, C., Sohn, K.: Depth analogy: data-driven approach for single image depth estimation using gradient samples. IEEE Trans. Image Process. 24(12), 5953–5966 (2015)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Clark, R., Wang, S., Wen, H., Markham, A., Trigoni, N.: VINet: visual-inertial odometry as a sequence-to-sequence learning problem. National Conference on Artificial Intelligence, pp. 3995–4001 (2017)Google Scholar
  3. 3.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)Google Scholar
  4. 4.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. Neural Information Processing Systems, pp. 2366–2374 (2014)Google Scholar
  5. 5.
    Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2018)CrossRefGoogle Scholar
  6. 6.
    Garg, R., Kumar, B.G.V., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_45CrossRefGoogle Scholar
  7. 7.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
  8. 8.
    Godard, C., Aodha, O.M., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Computer Vision and Pattern Recognition, pp. 6602–6611 (2017)Google Scholar
  9. 9.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)Google Scholar
  10. 10.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2015)Google Scholar
  11. 11.
    Klein, G., Murray, D.W.: Parallel tracking and mapping for small AR workspaces, pp. 1–10 (2007)Google Scholar
  12. 12.
    Li, R., Wang, S., Long, Z., Gu, D.: UnDeepVO: monocular visual odometry through unsupervised deep learning. In: International Conference on Robotics and Automation, pp. 7286–7291 (2018)Google Scholar
  13. 13.
    Liu, B., Gould, S., Koller, D.: Single image depth estimation from predicted semantic labels, pp. 1253–1260 (2010)Google Scholar
  14. 14.
    Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Computer Vision and Pattern Recognition, pp. 5667–5675 (2018)Google Scholar
  15. 15.
    Murartal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)CrossRefGoogle Scholar
  16. 16.
    Newcombe, R.A., Lovegrove, S., Davison, A.J.: DTAM: dense tracking and mapping in real-time, pp. 2320–2327 (2011)Google Scholar
  17. 17.
    Pillai, S., Leonard, J.J.: Towards visual ego-motion learning in robots, pp. 5533–5540 (2017)Google Scholar
  18. 18.
    Pinard, C., Chevalley, L., Manzanera, A., Filliat, D.: Learning structure-from-motion from motion. Computer Vision and Pattern Recognition, pp. 363–376 (2018). arXivGoogle Scholar
  19. 19.
    Repala, V.K., Dubey, S.R.: Dual CNN models for unsupervised monocular depth estimation. Computer Vision and Pattern Recognition (2018). arXivGoogle Scholar
  20. 20.
    Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009)CrossRefGoogle Scholar
  21. 21.
    Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SfM-Net: learning of structure and motion from video. Computer Vision and Pattern Recognition (2017). arXivGoogle Scholar
  22. 22.
    Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: International Conference on Robotics and Automation, pp. 2043–2050 (2017)Google Scholar
  23. 23.
    Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: Computer Vision and Pattern Recognition, pp. 1983–1992 (2018)Google Scholar
  24. 24.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Computer Vision and Pattern Recognition, pp. 6612–6619 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Harbin Institute of Technology (ShenZhen)ShenZhenChina

Personalised recommendations