Advertisement

DeepTAM: Deep Tracking and Mapping with Convolutional Neural Networks

  • Huizhong ZhouEmail author
  • Benjamin Ummenhofer
  • Thomas Brox
Article

Abstract

We present a system for dense keyframe-based camera tracking and depth map estimation that is entirely learned. For tracking, we estimate small pose increments between the current camera image and a synthetic viewpoint. This formulation significantly simplifies the learning problem and alleviates the dataset bias for camera motions. Further, we show that generating a large number of pose hypotheses leads to more accurate predictions. For mapping, we accumulate information in a cost volume centered at the current depth estimate. The mapping network then combines the cost volume and the keyframe image to update the depth prediction, thereby effectively making use of depth measurements and image-based priors. Our approach yields state-of-the-art results with few images and is robust with respect to noisy camera poses. We demonstrate that the performance of our 6 DOF tracking competes with RGB-D tracking algorithms.We compare favorably against strong classic and deep learning powered dense depth algorithms.

Keywords

Camera tracking Multi view stereo ConvNets 

Notes

Supplementary material

Supplementary material 1 (mp4 16091 KB)

11263_2019_1221_MOESM2_ESM.pdf (6.2 mb)
Supplementary material 2 (pdf 6299 KB)

References

  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.Google Scholar
  2. Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In 2015 IEEE international conference on computer vision (ICCV) (pp. 37–45),  https://doi.org/10.1109/ICCV.2015.13.
  3. Collins, R.T. (1996). A space-sweep approach to true multi-image matching. In Proceedings CVPR IEEE computer society conference on computer vision and pattern recognition, IEEE (pp. 358–363),  https://doi.org/10.1109/CVPR.1996.517097.
  4. Dhiman, V., Tran, Q.H., Corso, J.J., & Chandraker, M. (2016). A continuous occlusion model for road scene understanding. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4331–4339),  https://doi.org/10.1109/CVPR.2016.469.
  5. Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.) Advances in neural information processing systems (Vol. 27, pp. 2366–2374), Curran Associates, Inc.Google Scholar
  6. Engel, J., Schöps, T., & Cremers, D. (2014). LSD-SLAM: Large-scale direct monocular SLAM. In European conference on computer vision (pp. 834–849). Springer.Google Scholar
  7. Engel, J., Koltun, V., & Cremers, D. (2018). Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3), 611–625.  https://doi.org/10.1109/TPAMI.2017.2658577.CrossRefGoogle Scholar
  8. Fattal, R. (2008). Single image dehazing. In ACM SIGGRAPH 2008 Papers, ACM, New York, NY, USA, SIGGRAPH ’08 (pp. 72:1–72:9),  https://doi.org/10.1145/1399504.1360671.
  9. Felzenszwalb, P. F., & Huttenlocher, D. P. (2006). Efficient belief propagation for early vision. International Journal of Computer Vision, 70(1), 41–54.  https://doi.org/10.1007/s11263-006-7899-4.CrossRefGoogle Scholar
  10. Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (CVPR), IEEE (pp. 3354–3361).Google Scholar
  11. Gupta, S., Arbelàez, P., & Malik, J. (2013). Perceptual organization and recognition of indoor scenes from rgb-d images. In 2013 IEEE conference on computer vision and pattern recognition (pp. 564–571),  https://doi.org/10.1109/CVPR.2013.79.
  12. Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Indoor scene understanding with rgb-d images: Bottom-up segmentation, object detection and semantic segmentation. International Journal of Computer Vision, 112(2), 133–149.  https://doi.org/10.1007/s11263-014-0777-6.MathSciNetCrossRefGoogle Scholar
  13. Hirschmüller, H. (2005). Accurate and efficient stereo processing by semi-global matching and mutual information. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05) (Vol. 2, pp. 807–814),  https://doi.org/10.1109/CVPR.2005.56.
  14. Hosni, A., Rhemann, C., Bleyer, M., Rother, C., & Gelautz, M. (2013). Fast cost-volume filtering for visual correspondence and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2), 504–511.  https://doi.org/10.1109/TPAMI.2012.156.CrossRefGoogle Scholar
  15. Kendall, A., & Cipolla, R. (2017). Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  16. Kendall, A., Martirosyan, H., Dasgupta, S., & Henry, P. (2017). End-to-end learning of geometry and context for deep stereo regression. In 2017 IEEE international conference on computer vision (ICCV) (pp. 66–75),  https://doi.org/10.1109/ICCV.2017.17.
  17. Kerl, C., Sturm, J., & Cremers, D. (2013a). Dense visual SLAM for RGB-D cameras. In 2013 IEEE/RSJ international conference on intelligent robots and systems (pp. 2100–2106),  https://doi.org/10.1109/IROS.2013.6696650.
  18. Kerl, C., Sturm, J., & Cremers, D. (2013b). Robust odometry estimation for RGB-D cameras. In 2013 IEEE international conference on robotics and automation (pp. 3748–3754),  https://doi.org/10.1109/ICRA.2013.6631104.
  19. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Y. Bengio , Y. LeCun (Eds.) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings.Google Scholar
  20. Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small \(\{\text{AR}\}\) workspaces. In Proceedings of sixth IEEE and ACM international symposium on mixed and augmented reality (ISMAR’07).Google Scholar
  21. Li, R., Wang, S., Long, Z., & Gu, D. (2018). UnDeepVO: monocular visual odometry through unsupervised deep learning. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 7286–7291),  https://doi.org/10.1109/ICRA.2018.8461251.
  22. Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic gradient descent with warm restarts. In 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings, OpenReview.net.Google Scholar
  23. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.  https://doi.org/10.1023/B:VISI.0000029664.99615.94.CrossRefGoogle Scholar
  24. Newcombe, R. A., Lovegrove, S., & Davison, A. (2011). DTAM: Dense tracking and mapping in real-time. In: 2011 IEEE international conference on computer vision (ICCV) (pp. 2320–2327),  https://doi.org/10.1109/ICCV.2011.6126513.
  25. Schönberger, J. L., & Frahm, J. M. (2016). Structure-from-motion revisited. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4104–4113),  https://doi.org/10.1109/CVPR.2016.445.
  26. Schönberger, J. L., Zheng, E., Frahm, J. M., & Pollefeys, M. (2016). Pixelwise view selection for unstructured multi-view stereo. In Computer Vision – ECCV 2016 (pp. 501–518). Springer,  https://doi.org/10.1007/978-3-319-46487-9_31.
  27. Song, S., & Chandraker, M. (2015). Joint SFM and detection cues for monocular 3D localization in road scenes. In 2015 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3734–3742),  https://doi.org/10.1109/CVPR.2015.7298997.
  28. Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 190–198),  https://doi.org/10.1109/CVPR.2017.28.
  29. Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of RGB-D SLAM systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems (pp. 573–580),  https://doi.org/10.1109/IROS.2012.6385773.
  30. Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6565–6574),  https://doi.org/10.1109/CVPR.2017.695.
  31. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., & Brox, T. (2017). DeMoN: Depth and motion network for learning monocular stereo. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  32. Valada, A., Radwan, N., & Burgard, W. (2018). Deep auxiliary learning for visual localization and odometry. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 6939–6946),  https://doi.org/10.1109/ICRA.2018.8462979.
  33. Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). SfM-Net: Learning of structure and motion from video. arXiv:170407804 [cs].
  34. Wang, S., Clark, R., Wen, H., & Trigoni, N. (2017). DeepVO: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 2043–2050),  https://doi.org/10.1109/ICRA.2017.7989236.
  35. Weerasekera, C. S., Latif, Y., Garg, R., & Reid, I. (2017). Dense monocular reconstruction using surface normals. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 2524–2531),  https://doi.org/10.1109/ICRA.2017.7989293.
  36. Weerasekera, C. S., Garg, R., Latif, Y., Reid, I. (2018). Learning deeply supervised good features to match for dense monocular reconstruction. In Computer vision—ACCV 2018 (pp. 609–624). Cham: Springer,  https://doi.org/10.1007/978-3-030-20873-8_39.
  37. Xiao, J., Owens, A., & Torralba, A. (2013). SUN3D: A database of big spaces reconstructed using SfM and object labels. In 2013 IEEE international conference on computer vision (ICCV) (pp. 1625–1632),  https://doi.org/10.1109/ICCV.2013.458.
  38. Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., & Reid, I. (2018). Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In The IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  39. Zhang, H., & Patel, V. M. (2018a) Densely connected pyramid dehazing network. In CVPR.Google Scholar
  40. Zhang, H., & Patel, V. M. (2018b) Density-aware single image de-raining using a multi-stream dense network. In CVPR.Google Scholar
  41. Zhou, H., Ummenhofer, B., & Brox, T. (2018). Deeptam: Deep tracking and mapping. In European conference on computer vision (ECCV).Google Scholar
  42. Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6612–6619),  https://doi.org/10.1109/CVPR.2017.700.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.University of FreiburgFreiburgGermany
  2. 2.Intel LabsMunichGermany

Personalised recommendations