Advertisement

Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12363)

Abstract

We introduce a learning-based approach for room navigation using semantic maps. Our proposed architecture learns to predict top-down belief maps of regions that lie beyond the agent’s field of view while modeling architectural and stylistic regularities in houses. First, we train a model to generate amodal semantic top-down maps indicating beliefs of location, size, and shape of rooms by learning the underlying architectural patterns in houses. Next, we use these maps to predict a point that lies in the target room and train a policy to navigate to the point. We empirically demonstrate that by predicting semantic maps, the model learns common correlations found in houses and generalizes to novel environments. We also demonstrate that reducing the task of room navigation to point navigation improves the performance further.

Keywords

Embodied AI Room navigation 

Notes

Acknowledgements

We thank Abhishek Kadian, Oleksandr Maksymets, and Manolis Savva for their help with Habitat, and Arun Mallya and Alexander Sax for feedback on the manuscript. The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. Prof. Darrell’s group was supported in part by DoD, NSF, BAIR, and BDD. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

Supplementary material

504473_1_En_30_MOESM1_ESM.pdf (244 kb)
Supplementary material 1 (pdf 243 KB)

Supplementary material 2 (mp4 781 KB)

Supplementary material 3 (mp4 6036 KB)

Supplementary material 4 (mp4 7178 KB)

References

  1. 1.
    Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
  2. 2.
    Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)Google Scholar
  3. 3.
    Aydemir, A., Göbelbecker, M., Pronobis, A., Sjöö, K., Jensfelt, P.: Plan-based object search and exploration using semantic spatial knowledge in the real world. In: ECMR (2011)Google Scholar
  4. 4.
    Bailey, T., Durrant-Whyte, H.: Simultaneous localization and mapping (SLAM) Part ii. IEEE Robot. Autom. Mag. 13, 99–110 (2006)CrossRefGoogle Scholar
  5. 5.
    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems (NeurIPS)Google Scholar
  6. 6.
    Bowman, S.L., Atanasov, N., Daniilidis, K., Pappas, G.J.: Probabilistic data association for semantic slam. In: International Conference on Robotics and Automation (ICRA) (2017)Google Scholar
  7. 7.
    Cadena, C., et al.: Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans. Robot. 32, 1309–1332 (2016)CrossRefGoogle Scholar
  8. 8.
    Carlone, L., Du, J., Kaouk Ng, M., Bona, B., Indri, M.: Active SLAM and exploration with particle filters using Kullback-Leibler divergence. J. Intell. Robot. Syst. 75(2), 291–311 (2013).  https://doi.org/10.1007/s10846-013-9981-9CrossRefGoogle Scholar
  9. 9.
    Chang, A., et al.: Matterport3D: Learning from RGB-D data in indoor environments. arXiv preprint arXiv:1709.06158 (2017). matterport3D dataset available at https://niessner.github.io/Matterport/
  10. 10.
    Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation. arXiv preprint arXiv:1903.01959 (2019)
  11. 11.
    Crespo, J., Castillo, J.C., Mozos, O.M., Barber, R.: Semantic information for robot navigation: a survey. Appl. Sci. 10, 497 (2020)CrossRefGoogle Scholar
  12. 12.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  13. 13.
    Durrant-Whyte, H., Bailey, T.: Simultaneous localization and mapping: part i. IEEE Robot. Autom. Mag. 13, 99–110 (2006)CrossRefGoogle Scholar
  14. 14.
    Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  15. 15.
    Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)Google Scholar
  16. 16.
    Fuentes-Pacheco, J., Ruiz-Ascencio, J., Rendón-Mancha, J.M.: Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev. 43(1), 55–81 (2012).  https://doi.org/10.1007/s10462-012-9365-8CrossRefGoogle Scholar
  17. 17.
    Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  18. 18.
    Hartley, R., Zisserman, A.: Multiple view geometry in computer vision (2003)Google Scholar
  19. 19.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  20. 20.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  21. 21.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
  22. 22.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  23. 23.
    Kollar, T., Roy, N.: Trajectory optimization using reinforcement learning for map exploration. Int. J. Robot. Res. 27, 175–196 (2008)CrossRefGoogle Scholar
  24. 24.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017).  https://doi.org/10.1007/s11263-016-0981-7MathSciNetCrossRefGoogle Scholar
  25. 25.
    LaValle, S.M.: Planning Algorithms. Cambridge University Press, Cambridge (2006)CrossRefGoogle Scholar
  26. 26.
    Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
  27. 27.
    Martinez-Cantin, R., de Freitas, N., Brochu, E., Castellanos, J., Doucet, A.: A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Auton. Robots 27, 93–103 (2009).  https://doi.org/10.1007/s10514-009-9130-2CrossRefGoogle Scholar
  28. 28.
    Mirowski, P., et al.: Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016)
  29. 29.
    Pronobis, A., Jensfelt, P.: Large-scale semantic mapping and reasoning with heterogeneous modalities. In: International Conference on Robotics and Automation (ICRA) (2012)Google Scholar
  30. 30.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24574-4_28CrossRefGoogle Scholar
  31. 31.
    Savinov, N., Dosovitskiy, A., Koltun, V.: Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653 (2018)
  32. 32.
    Savva, M., et al.: Habitat: A platform for embodied AI research. arXiv preprint arXiv:1904.01201 (2019)
  33. 33.
    Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015)
  34. 34.
    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
  35. 35.
    Stachniss, C., Grisetti, G., Burgard, W.: Information gain-based exploration using Rao-Blackwellized particle filters. In: Robotics: Science and Systems (2005)Google Scholar
  36. 36.
    Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press, Cambridge (2005)zbMATHGoogle Scholar
  37. 37.
    Walter, M.R., Hemachandra, S., Homberg, B., Tellex, S., Teller, S.: Learning semantic maps from natural language descriptions. In: Robotics: Science and Systems (2013)Google Scholar
  38. 38.
    Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  39. 39.
    Wang, X., Xiong, W., Wang, H., Wang, W.Y.: Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 38–55. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01270-0_3CrossRefGoogle Scholar
  40. 40.
    Wang, Z., Zhang, Q., Li, J., Zhang, S., Liu, J.: A computationally efficient semantic SLAM solution for dynamic scenes. Remote Sens. 11, 1363 (2019)CrossRefGoogle Scholar
  41. 41.
    Wijmans, E., et al.: DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames. In: International Conference on Learning Representations (ICLR) (2020)Google Scholar
  42. 42.
    Wu, Y., Wu, Y., Gkioxari, G., Tian, Y.: Building generalizable agents with a realistic and rich 3D environment. arXiv preprint arXiv:1801.02209 (2018)
  43. 43.
    Wu, Y., Wu, Y., Tamar, A., Russell, S., Gkioxari, G., Tian, Y.: Bayesian relational memory for semantic visual navigation. arXiv preprint arXiv:1909.04306 (2019)
  44. 44.
    Wu, Y., He, K.: Group normalization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 3–19. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01261-8_1CrossRefGoogle Scholar
  45. 45.
    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  46. 46.
    Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543 (2018)
  47. 47.
    Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: International Conference on Robotics and Automation (ICRA) (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Facebook AI ResearchMenlo ParkUSA
  2. 2.University of California, BerkeleyBerkeleyUSA
  3. 3.Georgia Institute of TechnologyAtlantaUSA

Personalised recommendations