Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments

Krantz, Jacob; Lee, Stefan

doi:10.1007/978-3-031-19842-7_34

Jacob Krantz¹² &
Stefan Lee¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

European Conference on Computer Vision

2562 Accesses
7 Citations

Abstract

Recent work in Vision-and-Language Navigation (VLN) has presented two environmental paradigms with differing realism – the standard VLN setting built on topological environments where navigation is abstracted away [3], and the VLN-CE setting where agents must navigate continuous 3D environments using low-level actions [21]. Despite sharing the high-level task and even the underlying instruction-path data, performance on VLN-CE lags behind VLN significantly. In this work, we explore this gap by transferring an agent from the abstract environment of VLN to the continuous environment of VLN-CE. We find that this sim-2-sim transfer is highly effective, improving over the prior state of the art in VLN-CE by +12% success rate. While this demonstrates the potential for this direction, the transfer does not fully retain the original performance of the agent in the abstract setting. We present a sequence of experiments to identify what differences result in performance degradation, providing clear directions for further improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
github.com/jacobkrantz/Sim2Sim-VLNCE.
2.
As defined by the Matterport3D Simulator used in VLN.
3.
eval.ai/web/challenges/challenge-page/97.
4.
eval.ai/web/challenges/challenge-page/719.

References

Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
Anderson, P., et al.: Sim-to-real transfer for vision-and-language navigation. In: CoRL (2020)
Google Scholar
Anderson, et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018)
Google Scholar
Blukis, V., Terme, Y., Niklasson, E., Knepper, R.A., Artzi, Y.: Learning to map natural language instructions to physical quadcopter control using simulated flight. In: CoRL (2020)
Google Scholar
Chang, A., et al.: Matterport3d: learning from RGB-D data in indoor environments. In: 3DV (2017), MatterPort3D dataset license. http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf
Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural slam. In: ICLR (2020)
Google Scholar
Chen, K., Chen, J.K., Chuang, J., Vázquez, M., Savarese, S.: Topological planning with transformers for vision-and-language navigation. In: CVPR (2021)
Google Scholar
Chen, S., Guhur, P.L., Schmid, C., Laptev, I.: History aware multimodal transformer for vision-and-language navigation. In: NeurIPS (2021)
Google Scholar
Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. In: NeurIPS (2013)
Google Scholar
Deitke, M., et al.: Robothor: an open simulation-to-real embodied AI platform. In: CVPR (2020)
Google Scholar
Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: NeurIPS (2018)
Google Scholar
Gordon, D., Kadian, A., Parikh, D., Hoffman, J., Batra, D.: Splitnet: sim2sim and task2task transfer for embodied visual navigation. In: CVPR (2019)
Google Scholar
Hahn, M., Chaplot, D.S., Tulsiani, S., Mukadam, M., Rehg, J.M., Gupta, A.: No RL, no simulation: learning to navigate without navigating. In: NeurIPS (2021)
Google Scholar
Hao, W., Li, C., Li, X., Carin, L., Gao, J.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: CVPR (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: VLN BERT: a recurrent vision-and-language bert for navigation. In: CVPR (2021)
Google Scholar
Irshad, M.Z., Ma, C.Y., Kira, Z.: Hierarchical cross-modal agent for robotics vision-and-language navigation. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2021)
Google Scholar
Irshad, M.Z., Mithun, N.C., Seymour, Z., Chiu, H.P., Samarasekera, S., Kumar, R.: Sasra: semantically-aware spatio-temporal reasoning agent for vision-and-language navigation in continuous environments. In: International Conference on Pattern Recognition (ICPR) (2022)
Google Scholar
Kadian, A., et al.: Are we making real progress in simulated environments? measuring the sim2real gap in embodied visual navigation. In: IROS (2020)
Google Scholar
Krantz, J., Gokaslan, A., Batra, D., Lee, S., Maksymets, O.: Waypoint models for instruction-guided navigation in continuous environments. In: ICCV (2021)
Google Scholar
Krantz, J., Wijmans, E., Majumdar, A., Batra, D., Lee, S.: Beyond the nav-graph: vision-and-language navigation in continuous environments. In: ECCV (2020)
Google Scholar
Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: ECCV (2020)
Google Scholar
Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)
Google Scholar
Quigley, M., et al.: Ros: an open-source robot operating system. In: ICRA Workshop on Open Source Software (2009)
Google Scholar
Raychaudhuri, S., Wani, S., Patel, S., Jain, U., Chang, A.X.: Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments. In: EMNLP (2021)
Google Scholar
Tan, H., Yu, L., Bansal, M.: Learning to navigate unseen environments: back translation with environmental dropout. In: NAACL HLT (2019)
Google Scholar
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: CVPR (2019)
Google Scholar
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. In: TPAMI (2017)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the DARPA Machine Common Sense program. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

Author information

Authors and Affiliations

Oregon State University, Corvallis, USA
Jacob Krantz & Stefan Lee

Authors

Jacob Krantz
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jacob Krantz .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Supplementary material 1 (mp4 2858 KB)

Supplementary material 2 (pdf 411 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Krantz, J., Lee, S. (2022). Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_34

Download citation

DOI: https://doi.org/10.1007/978-3-031-19842-7_34
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19841-0
Online ISBN: 978-3-031-19842-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments