Advertisement

Stereo Vision-Based Semantic 3D Object and Ego-Motion Tracking for Autonomous Driving

  • Peiliang Li
  • Tong Qin
  • Shaojie Shen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11206)

Abstract

We propose a stereo vision-based approach for tracking the camera ego-motion and 3D semantic objects in dynamic autonomous driving scenarios. Instead of directly regressing the 3D bounding box using end-to-end approaches, we propose to use the easy-to-labeled 2D detection and discrete viewpoint classification together with a light-weight semantic inference method to obtain rough 3D object measurements. Based on the object-aware-aided camera pose tracking which is robust in dynamic environments, in combination with our novel dynamic object bundle adjustment (BA) approach to fuse temporal sparse feature correspondences and the semantic 3D measurement model, we obtain 3D object pose, velocity and anchored dynamic point cloud estimation with instance accuracy and temporal consistency. The performance of our proposed method is demonstrated in diverse scenarios. Both the ego-motion estimation and object localization are compared with the state-of-of-the-art solutions.

Keywords

Semantic SLAM 3D object localization Visual Odometry 

Notes

Acknowledgment

This work was supported by the Hong Kong Research Grants Council Early Career Scheme under project no. 26201616. The authors also thank Xiaozhi Chen for providing the 3DOP [1] results on the KITTI tracking dataset.

References

  1. 1.
    Chen, X., et al.: 3D object proposals for accurate object class detection. In: Advances in Neural Information Processing Systems, pp. 424–432 (2015)Google Scholar
  2. 2.
    Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2147–2156 (2016)Google Scholar
  3. 3.
    Qin, T., Li, P., Shen, S.: VINS-Mono: a robust and versatile monocular visual-inertial state estimator. arXiv preprint arXiv:1708.03852 (2017)
  4. 4.
    Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Trans. Robot. 33(5), 1255–1262 (2017)CrossRefGoogle Scholar
  5. 5.
    Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10605-2_54CrossRefGoogle Scholar
  6. 6.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  7. 7.
    Frost, D.P., Kähler, O., Murray, D.W.: Object-aware bundle adjustment for correcting monocular scale drift. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 4770–4776. IEEE (2016)Google Scholar
  8. 8.
    Sucar, E., Hayet, J.B.: Probabilistic global scale estimation for monoslam based on generic object detection. In: Computer Vision and Pattern Recognition Workshops (CVPRW) (2017)Google Scholar
  9. 9.
    Bowman, S.L., Atanasov, N., Daniilidis, K., Pappas, G.J.: Probabilistic data association for semantic slam. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1722–1729. IEEE (2017)Google Scholar
  10. 10.
    Atanasov, N., Zhu, M., Daniilidis, K., Pappas, G.J.: Semantic localization via the matrix permanent. In: Proceedings of Robotics: Science and Systems, vol. 2 (2014)Google Scholar
  11. 11.
    Pillai, S., Leonard, J.J.: Monocular slam supported object recognition. In: Proceedings of Robotics: Science and Systems, vol. 2 (2015)Google Scholar
  12. 12.
    Dong, J., Fei, X., Soatto, S.: Visual-inertial-semantic scene representation for 3-D object detection. arXiv preprint arXiv:1606.03968 (2016)
  13. 13.
    Civera, J., Gálvez-López, D., Riazuelo, L., Tardós, J.D., Montiel, J.: Towards semantic slam using a monocular camera. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1277–1284. IEEE (2011)Google Scholar
  14. 14.
    Salas-Moreno, R.F., Newcombe, R.A., Strasdat, H., Kelly, P.H., Davison, A.J.: Slam++: simultaneous localisation and mapping at the level of objects. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1352–1359. IEEE (2013)Google Scholar
  15. 15.
    Gálvez-López, D., Salas, M., Tardós, J.D., Montiel, J.: Real-time monocular object slam. Robot. Auton. Syst. 75, 435–449 (2016)CrossRefGoogle Scholar
  16. 16.
    Bao, S.Y., Savarese, S.: Semantic structure from motion. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2025–2032. IEEE (2011)Google Scholar
  17. 17.
    Bao, S.Y., Bagra, M., Chao, Y.W., Savarese, S.: Semantic structure from motion with points, regions, and objects. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2703–2710. IEEE (2012)Google Scholar
  18. 18.
    Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.M.: Joint semantic segmentation and 3D reconstruction from monocular video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 703–718. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10599-4_45CrossRefGoogle Scholar
  19. 19.
    Vineet, V., et al.: Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 75–82. IEEE (2015)Google Scholar
  20. 20.
    Li, X., Belaroussi, R.: Semi-dense 3D semantic mapping from monocular slam. arXiv preprint arXiv:1611.04144 (2016)
  21. 21.
    McCormac, J., Handa, A., Davison, A., Leutenegger, S.: Semanticfusion: Dense 3D semantic mapping with convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4628–4635. IEEE (2017)Google Scholar
  22. 22.
    Bao, S.Y., Chandraker, M., Lin, Y., Savarese, S.: Dense object reconstruction with semantic priors. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1264–1271. IEEE (2013)Google Scholar
  23. 23.
    Zia, M.Z., Stark, M., Schiele, B., Schindler, K.: Detailed 3D representations for object recognition and modeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2608–2623 (2013)CrossRefGoogle Scholar
  24. 24.
    Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Data-driven 3D voxel patterns for object category recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1903–1911 (2015)Google Scholar
  25. 25.
    Mousavian, A., Anguelov, D., Flynn, J., Košecká, J.: 3D bounding box estimation using deep learning and geometry. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5640. IEEE (2017)Google Scholar
  26. 26.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  27. 27.
    Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: an efficient alternative to sift or surf. In: 2011 IEEE international conference on Computer Vision (ICCV), pp. 2564–2571. IEEE (2011)Google Scholar
  28. 28.
    Gu, T.: Improved trajectory planning for on-road self-driving vehicles via combined graph search, optimization and topology analysis. Ph.D. thesis, Carnegie Mellon University (2017)Google Scholar
  29. 29.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  30. 30.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) (2013)Google Scholar
  31. 31.
    Cordts, M., et al.: The cityscapes dataset. In: CVPR Workshop on the Future of Datasets in Vision, vol, 1, March 2015Google Scholar
  32. 32.
    Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D slam systems. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 573–580. IEEE (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Hong Kong University of Science and TechnologyClear Water BayHong Kong

Personalised recommendations