Advertisement

Learning Deeply Supervised Good Features to Match for Dense Monocular Reconstruction

  • Chamara Saroj WeerasekeraEmail author
  • Ravi Garg
  • Yasir Latif
  • Ian Reid
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11365)

Abstract

Visual SLAM (Simultaneous Localization and Mapping) methods typically rely on handcrafted visual features or raw RGB values for establishing correspondences between images. These features, while suitable for sparse mapping, often lead to ambiguous matches in texture-less regions when performing dense reconstruction due to the aperture problem. In this work, we explore the use of learned features for the matching task in dense monocular reconstruction. We propose a novel convolutional neural network (CNN) architecture along with a deeply supervised feature learning scheme for pixel-wise regression of visual descriptors from an image which are best suited for dense monocular SLAM. In particular, our learning scheme minimizes a multi-view matching cost-volume loss with respect to the regressed features at multiple stages within the network, for explicitly learning contextual features that are suitable for dense matching between images captured by a moving monocular camera along the epipolar line. We integrate the learned features from our model for depth estimation inside a real-time dense monocular SLAM framework, where photometric error is replaced by our learned descriptor error. Our extensive evaluation on several challenging indoor datasets demonstrate greatly improved accuracy in dense reconstructions of the well celebrated dense SLAM systems like DTAM, without compromising their real-time performance.

Keywords

Mapping Visual learning 3D reconstruction SLAM 

Supplementary material

484520_1_En_39_MOESM1_ESM.pdf (4.6 mb)
Supplementary material 1 (pdf 4757 KB)

Supplementary material 2 (mp4 7229 KB)

Supplementary material 3 (mp4 4235 KB)

Supplementary material 4 (mp4 6001 KB)

Supplementary material 5 (mp4 6947 KB)

References

  1. 1.
    Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., Davison, A.J.: CodeSLAM-learning a compact, optimisable representation for dense visual SLAM. arXiv preprint arXiv:1804.00874 (2018)
  2. 2.
    Choy, C.B., Gwak, J., Savarese, S., Chandraker, M.: Universal correspondence network. In: Advances in Neural Information Processing Systems 30 (2016)Google Scholar
  3. 3.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)Google Scholar
  4. 4.
    Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40, 611–625 (2017)CrossRefGoogle Scholar
  5. 5.
    Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10605-2_54CrossRefGoogle Scholar
  6. 6.
    Fácil, J.M., Concha, A., Montesano, L., Civera, J.: Deep single and direct multi-view depth fusion. CoRR abs/1611.07245 (2016). http://arxiv.org/abs/1611.07245
  7. 7.
    Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_45CrossRefGoogle Scholar
  8. 8.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) 32, 1231–1237 (2013)CrossRefGoogle Scholar
  9. 9.
    Handa, A., Whelan, T., McDonald, J., Davison, A.: A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In: IEEE International Conference on Robotics and Automation, ICRA, Hong Kong, May 2014Google Scholar
  10. 10.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  11. 11.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  12. 12.
    Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. CoRR abs/1703.04309 (2017). http://arxiv.org/abs/1703.04309
  13. 13.
    Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: 6th IEEE and ACM International Symposium on Mixed and Augmented Reality 2007, ISMAR 2007, pp. 225–234. IEEE (2007)Google Scholar
  14. 14.
    Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 239–248. IEEE (2016)Google Scholar
  15. 15.
    Liu, C., Yuen, J., Torralba, A.: SIFT Flow: dense correspondence across scenes and its applications. In: Hassner, T., Liu, C. (eds.) Dense Image Correspondences for Computer Vision, pp. 15–49. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-23048-1_2CrossRefGoogle Scholar
  16. 16.
    Mur-Artal, R., Montiel, J.M.M., Tardós, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)CrossRefGoogle Scholar
  17. 17.
    Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. CoRR abs/1610.06475 (2016). http://arxiv.org/abs/1610.06475
  18. 18.
    Newcombe, R.A., et al.: KinectFusion: real-time dense surface mapping and tracking. In: IEEE ISMAR. IEEE (2011)Google Scholar
  19. 19.
    Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: DTAM: dense tracking and mapping in real-time. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2320–2327. IEEE (2011)Google Scholar
  20. 20.
    Prisacariu, V., et al.: A framework for the volumetric integration of depth images. arXiv e-prints (2014)Google Scholar
  21. 21.
    Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. CoRR abs/1611.00850 (2016). http://arxiv.org/abs/1611.00850
  22. 22.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24574-4_28CrossRefGoogle Scholar
  23. 23.
    Schmidt, T., Newcombe, R., Fox, D.: Self-supervised visual descriptor learning for dense correspondence. IEEE Robot. Autom. Lett. 2(2), 420–427 (2017)CrossRefGoogle Scholar
  24. 24.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33715-4_54CrossRefGoogle Scholar
  25. 25.
    Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D SLAM systems. In: Proceedings of the International Conference on Intelligent Robot Systems (IROS) (2012)Google Scholar
  26. 26.
    Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular slam with learned depth prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6565–6574. IEEE (2017)Google Scholar
  27. 27.
    Ummenhofer, B., et al.: DeMoN: depth and motion network for learning monocular stereo. CoRR abs/1612.02401 (2016). http://arxiv.org/abs/1612.02401
  28. 28.
    Weerasekera, C.S., Latif, Y., Garg, R., Reid, I.: Dense monocular reconstruction using surface normals. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2524–2531, May 2017.  https://doi.org/10.1109/ICRA.2017.7989293
  29. 29.
    Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1395–1403 (2015)Google Scholar
  30. 30.
    Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: learned invariant feature transform. CoRR abs/1603.09114 (2016). http://arxiv.org/abs/1603.09114
  31. 31.
    Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17(1–32), 2 (2016)zbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Chamara Saroj Weerasekera
    • 1
    • 2
    Email author
  • Ravi Garg
    • 1
    • 2
  • Yasir Latif
    • 1
    • 2
  • Ian Reid
    • 1
    • 2
  1. 1.University of AdelaideAdelaideAustralia
  2. 2.ARC Centre of Excellence for Robotic VisionBrisbaneAustralia

Personalised recommendations