Advertisement

Visual-Inertial Object Detection and Mapping

  • Xiaohan FeiEmail author
  • Stefano Soatto
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11215)

Abstract

We present a method to populate an unknown environment with models of previously seen objects, placed in a Euclidean reference frame that is inferred causally and on-line using monocular video along with inertial sensors. The system we implement returns a sparse point cloud for the regions of the scene that are visible but not recognized as a previously seen object, and a detailed object model and its pose in the Euclidean frame otherwise. The system includes bottom-up and top-down components, whereby deep networks trained for detection provide likelihood scores for object hypotheses provided by a nonlinear filter, whose state serves as memory. Additional networks provide likelihood scores for edges, which complements detection networks trained to be invariant to small deformations. We test our algorithm on existing datasets, and also introduce the VISMA dataset, that provides ground truth pose, point-cloud map, and object models, along with time-stamped inertial measurements.

Notes

Acknowledgment

Research supported by ONR N00014-17-1-2072 and ARO W911NF-17-1-0304.

Supplementary material

474198_1_En_19_MOESM1_ESM.pdf (490 kb)
Supplementary material 1 (pdf 489 KB)

Supplementary material 2 (mp4 88341 KB)

References

  1. 1.
    Salas-Moreno, R.F., Newcombe, R.A., Strasdat, H., Kelly, P.H., Davison, A.J.: SLAM++: simultaneous localisation and mapping at the level of objects. In: Computer Vision and Pattern Recognition (CVPR) (2013)Google Scholar
  2. 2.
    Dong, J., Fei, X., Soatto, S.: Visual-inertial-semantic scene representation for 3D object detection. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  3. 3.
    Jazwinski, A.: Stochastic Processes and Filtering Theory. Academic Press, Cambridge (1970)Google Scholar
  4. 4.
    Mourikis, A., Roumeliotis, S.: A multi-state constraint kalman filter for vision-aided inertial navigation. In: International Conference on Robotics and Automation (ICRA) (2007)Google Scholar
  5. 5.
    Tsotsos, K., Chiuso, A., Soatto, S.: Robust inference for visual-inertial sensor fusion. In: International Conference on Robotics and Automation (ICRA) (2015)Google Scholar
  6. 6.
    Civera, J., Davison, A.J., Montiel, J.M.: Inverse depth parametrization for monocular slam. IEEE Trans. Robot. 24(5), 932–945 (2008)CrossRefGoogle Scholar
  7. 7.
    Blake, A., Isard, M.: The condensation algorithm-conditional density propagation and applications to visual tracking. In: Advances in Neural Information Processing Systems (NIPS) (1997)Google Scholar
  8. 8.
    Drummond, T., Cipolla, R.: Real-time visual tracking of complex structures. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 24(7), 932–946 (2002)CrossRefGoogle Scholar
  9. 9.
    Klein, G., Murray, D.W.: Full-3D edge tracking with a particle filter. In: British Machine Vision Conference (BMVC) (2006)Google Scholar
  10. 10.
    Canny, J.: A computational approach to edge detection. In: Readings in Computer Vision, pp. 184–203. Elsevier (1987)Google Scholar
  11. 11.
    Gordon, N.J., Salmond, D.J., Smith, A.F.: Novel approach to nonlinear/non-gaussian bayesian state estimation. In: IEE Proceedings F (Radar and Signal Processing), vol. 140, pp. 107–113. IET (1993)Google Scholar
  12. 12.
    Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., He, K.: Detectron. https://github.com/facebookresearch/detectron (2018)
  13. 13.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 12, 2481–2495 (2017)CrossRefGoogle Scholar
  14. 14.
    Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: International Conference on Computer Vision (ICCV) (2001)Google Scholar
  15. 15.
    Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D slam systems. In: International Conference on Intelligent Robots and Systems (IROS) (2012)Google Scholar
  16. 16.
    Handa, A., Whelan, T., McDonald, J., Davison, A.J.: A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In: International Conference on Robotics and Automation (ICRA) (2014)Google Scholar
  17. 17.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
  18. 18.
    Burri, M.: The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. (IJCV) 35(10), 1157–1163 (2016)CrossRefGoogle Scholar
  19. 19.
    Pfrommer, B., Sanket, N., Daniilidis, K., Cleveland, J.: Penncosyvio: a challenging visual inertial odometry benchmark. In: International Conference on Robotics and Automation (ICRA) (2017)Google Scholar
  20. 20.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. (IJCV) 88(2), 303–338 (2010)CrossRefGoogle Scholar
  21. 21.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48Google Scholar
  23. 23.
    Xiang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: a benchmark for 3D object detection in the wild. In: Winter Conference on Applications of Computer Vision (WACV) (2014)Google Scholar
  24. 24.
    Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.): ObjectNet3D: a large scale database for 3D object recognition. ECCV 2016. LNCS, vol. 9912, pp. 160–176. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_10CrossRefGoogle Scholar
  25. 25.
    Hua, B.S., Pham, Q.H., Nguyen, D.T., Tran, M.K., Yu, L.F., Yeung, S.K.: Scenenn: a scene meshes dataset with annotations. In: 3D Vision (3DV) (2016)Google Scholar
  26. 26.
    Savva, M., et al.: Shrec16 track large-scale 3D shape retrieval from shapenet core55. In: Proceedings of the Eurographics Workshop on 3D Object Retrieval (2016)Google Scholar
  27. 27.
    Whelan, T., Leutenegger, S., Salas-Moreno, R.F., Glocker, B., Davison, A.J.: Elasticfusion: dense slam without a pose graph. In: Robotics: Science and Systems (RSS) (2015)Google Scholar
  28. 28.
    Zhou, Q.Y., Park, J., Koltun, V.: Open3D: a modern library for 3D data processing. arXiv:1801.09847 (2018)
  29. 29.
    Castle, R.O., Klein, G., Murray, D.W.: Combining monoslam with object recognition for scene augmentation using a wearable camera. Image Vis. Comput. 28(11), 1548–1556 (2010)CrossRefGoogle Scholar
  30. 30.
    Civera, J., Gálvez-López, D., Riazuelo, L., Tardós, J.D., Montiel, J.: Towards semantic slam using a monocular camera. In: International Conference on Intelligent Robots and Systems (IROS) (2011)Google Scholar
  31. 31.
    Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.: Joint semantic segmentation and 3D reconstruction from monocular video. In: European Conference Computer Vision (ECCV) (2014)Google Scholar
  32. 32.
    Hermans, A., Floros, G., Leibe, B.: Dense 3D semantic mapping of indoor scenes from RGB-D images. In: International Conference on Robotics and Automation (ICRA) (2014)Google Scholar
  33. 33.
    Vineet, V.E.A.: Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In: International Conference on Robotics and Automation (ICRA) (2015)Google Scholar
  34. 34.
    McCormac, J., Handa, A., Davison, A., Leutenegger, S.: Semanticfusion: dense 3D semantic mapping with convolutional neural networks. In: International Conference on Robotics and Automation (ICRA) (2017)Google Scholar
  35. 35.
    Bowman, S.L., Atanasov, N., Daniilidis, K., Pappas, G.J.: Probabilistic data association for semantic slam. In: International Conference on Robotics and Automation (ICRA) (2017)Google Scholar
  36. 36.
    Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: International Symposium on Mixed and Augmented Reality (ISMAR) (2007)Google Scholar
  37. 37.
    Girshick, R.: Fast R-CNN. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  38. 38.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  39. 39.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: International Conference on Computer Vision (ICCV) (2017)Google Scholar
  40. 40.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  41. 41.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  42. 42.
    Choi, C., Christensen, H.I.: 3D textureless object detection and tracking: An edge-based approach. In: International Conference on Intelligent Robots and Systems (IROS) (2012)Google Scholar
  43. 43.
    Lepetit, V., Fua, P., et al.: Monocular model-based 3D tracking of rigid objects: a survey. In: Foundations and Trends\({\textregistered }\) in Computer Graphics and Vision (2005)CrossRefGoogle Scholar
  44. 44.
    Prisacariu, V.A., Reid, I.D.: Pwp3D: real-time segmentation and tracking of 3D objects. Int. J. Comput. Vis. (IJCV) 98(3), 335–354 (2012)MathSciNetCrossRefGoogle Scholar
  45. 45.
    Tjaden, H., Schwanecke, U., Schömer, E.: Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.UCLA Vision LabUniversity of CaliforniaLos AngelesUSA

Personalised recommendations