Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

Krantz, Jacob; Wijmans, Erik; Majumdar, Arjun; Batra, Dhruv; Lee, Stefan

doi:10.1007/978-3-030-58604-1_7

Jacob Krantz¹²,
Erik Wijmans^13,14,
Arjun Majumdar¹³,
Dhruv Batra^13,14 &
…
Stefan Lee¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12373))

Included in the following conference series:

European Conference on Computer Vision

3696 Accesses
54 Citations

Abstract

We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions. By being situated in continuous environments, this setting lifts a number of assumptions implicit in prior work that represents environments as a sparse graph of panoramas with edges corresponding to navigability. Specifically, our setting drops the presumptions of known environment topologies, short-range oracle navigation, and perfect agent localization. To contextualize this new task, we develop models that mirror many of the advances made in prior settings as well as single-modality baselines. While some transfer, we find significantly lower absolute performance in the continuous setting – suggesting that performance in prior ‘navigation-graph’ settings may be inflated by the strong implicit assumptions. Code at jacobkrantz.github.io/vlnce .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Details included from correspondence with the author of [4].
2.
68% forward, 15% turn-left, 15% turn-right, and 2% stop.
3.
Note that the VLN test set is not publicly available except through this leaderboard.

References

Locobot: an open source low cost robot (2019). https://locobot-website.netlify.com/
Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
Anderson, P., Shrivastava, A., Parikh, D., Batra, D., Lee, S.: Chasing ghosts: instruction following as Bayesian state tracking. In: NeurIPS (2019)
Google Scholar
Anderson, P., et al.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)
Google Scholar
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017). MatterPort3D dataset license available at: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf
Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: CVPR (2019)
Google Scholar
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: CVPR (2018)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: NeurIPS (2018)
Google Scholar
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: CVPR (2018)
Google Scholar
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: CVPR (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hermann, K.M., Malinowski, M., Mirowski, P., Banki-Horvath, A., Anderson, K., Hadsell, R.: Learning to follow directions in street view. In: AAAI (2020)
Google Scholar
Kadian, A., et al.: Are we making real progress in simulated environments? Measuring the sim2real gap in embodied visual navigation. In: IROS (2020)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kohlbrecher, S., Meyer, J., von Stryk, O., Klingauf, U.: A flexible and scalable slam system with full 3D motion estimation. In: SSRR. IEEE, November 2011
Google Scholar
Ma, C.Y., et al.: Self-monitoring navigation agent via auxiliary progress estimation. In: ICLR (2019)
Google Scholar
Magalhaes, G., Jain, V., Ku, A., Ie, E., Baldridge, J.: Effective and general evaluation for instruction conditioned navigation using dynamic time warping. arXiv preprint arXiv:1907.05446 (2019)
Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)
Google Scholar
Misra, D., Bennett, A., Blukis, V., Niklasson, E., Shatkhin, M., Artzi, Y.: Mapping instructions to actions in 3D environments with visual goal prediction. In: EMNLP (2018)
Google Scholar
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular slam system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)
Article Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP (2014)
Google Scholar
Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: AISTATS (2011) 3, 10, 12
Google Scholar
Stentz, A.: Optimal and efficient path planning for partially known environments. In: Hebert, M.H., Thorpe, C., Stentz, A. (eds.) Intelligent Unmanned Ground Vehicles. SECS, vol. 388, pp. 203–220. Springer, Boston (1997). https://doi.org/10.1007/978-1-4615-6325-9_11
Chapter Google Scholar
Tan, H., Yu, L., Bansal, M.: Learning to navigate unseen environments: back translation with environmental dropout. In: NAACL HLT (2019)
Google Scholar
Thomason, J., Gordon, D., Bisk, Y.: Shifting the baseline: single modality performance on visual navigation & QA. In: NAACL HLT (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: CVPR (2019)
Google Scholar
Wijmans, E., et al.: Embodied question answering in photorealistic environments with point cloud perception. In: CVPR (2019)
Google Scholar
Wijmans, E., et al.: DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames. In: ICLR (2020)
Google Scholar

Download references

Acknowledgements

We thank Anand Koshy for his implementation of nDTW. The GT effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. The OSU effort was supported in part by DARPA. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

Author information

Authors and Affiliations

Oregon State University, Corvallis, USA
Jacob Krantz & Stefan Lee
Georgia Institute of Technology, Atlanta, USA
Erik Wijmans, Arjun Majumdar & Dhruv Batra
Facebook AI Research, Menlo Park, USA
Erik Wijmans & Dhruv Batra

Authors

Jacob Krantz
View author publications
You can also search for this author in PubMed Google Scholar
Erik Wijmans
View author publications
You can also search for this author in PubMed Google Scholar
Arjun Majumdar
View author publications
You can also search for this author in PubMed Google Scholar
Dhruv Batra
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jacob Krantz .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5691 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Krantz, J., Wijmans, E., Majumdar, A., Batra, D., Lee, S. (2020). Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12373. Springer, Cham. https://doi.org/10.1007/978-3-030-58604-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-58604-1_7
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58603-4
Online ISBN: 978-3-030-58604-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics