Skip to main content

An Exploration of Embodied Visual Exploration

Abstract

Embodied computer vision considers perception for robots in novel, unstructured environments. Of particular importance is the embodied visual exploration problem: how might a robot equipped with a camera scope out a new environment? Despite the progress thus far, many basic questions pertinent to this problem remain unanswered: (i) What does it mean for an agent to explore its environment well? (ii) Which methods work well, and under which assumptions and environmental settings? (iii) Where do current approaches fall short, and where might future work seek to improve? Seeking answers to these questions, we first present a taxonomy for existing visual exploration algorithms and create a standard framework for benchmarking them. We then perform a thorough empirical study of the four state-of-the-art paradigms using the proposed framework with two photorealistic simulated 3D environments, a state-of-the-art exploration architecture, and diverse evaluation metrics. Our experimental results offer insights and suggest new performance metrics and baselines for future work in visual exploration. Code, models and data are publicly available.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    See Appendix G for more details on how landmarks are mined.

  2. 2.

    We select area visited because it is a simple metric that does not require additional semantic annotations.

  3. 3.

    The NRC values may be larger than 1.0 for learned methods. This is due to domain differences between the inferred occupancy used in training and the GT occupancy used in testing.

  4. 4.

    RANSAC implementation: https://github.com/facebookresearch/exploring_exploration/blob/master/exploring_exploration/utils/pose_estimation.py

  5. 5.

    See https://github.com/s-gupta/map-plan-baseline for a well-tuned implementation.

  6. 6.

    Examples of AVD object instances: https://www.cs.unc.edu/~ammirato/active_vision_dataset_website/get_data.html

References

  1. Aloimonos, J., Weiss, I., & Bandyopadhyay, A. (1988). Active vision. International Journal of Computer Vision, 1, 333–356.

    Article  Google Scholar 

  2. Ammirato, P., Poirson, P., Park, E., Kosecka, J. & Berg, A. (2016). A dataset for developing and benchmarking active vision. In: ICRA.

  3. Anderson, P., Chang, A., Chaplot, D. S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., Savva, M. & Zamir, A. R. (2018a). On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757.

  4. Anderson, P., Chang, A., Chaplot, D. S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R. & Savva, M., et al. (2018b). On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757.

  5. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S. & van den Hengel, A. (2018c). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  6. Armeni, I., Sax, A., Zamir, A. R. & Savarese, S. (2017). Joint 2D-3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints.

  7. Bajcsy, R. (1988). Active perception. Proceedings of the IEEE.

  8. Ballard, D. H. (1991). Animate vision. Artificial intelligence.

  9. Batra, D., Gokaslan, A., Kembhavi, A., Maksymets, O., Mottaghi, R., Savva, M., Toshev, A. & Wijmans, E. (2020). Objectnav revisited: On evaluation of embodied agents navigating to objects.

  10. Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D. & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In: Advances in Neural Information Processing Systems.

  11. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. (2016). End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316.

  12. Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T. & Efros, A. A. (2018a). Large-scale study of curiosity-driven learning. In: arXiv:1808.04355.

  13. Burda, Y., Edwards, H., Storkey, A. & Klimov, O. (2018b) Exploration by random network distillation. arXiv preprint arXiv:1810.12894.

  14. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. (2020). End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872.

  15. Cassandra, A. R., Kaelbling, L. P. & Kurien, J. A. (1996). Acting under uncertainty: Discrete bayesian models for mobile-robot navigation. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS’96, vol. 2, pp. 963–972. IEEE.

  16. Chang, A., Dai, A., Funkhouser, T., , Nießner, M., Savva, M., Song, S., Zeng, A. & Zhang, Y. (2017). Matterport3d: Learning from rgb-d data in indoor environments. In: Proceedings of the International Conference on 3D Vision (3DV). MatterPort3D dataset license available at: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf.

  17. Chaplot, D. S., Gandhi, D., Gupta, S., Gupta, A., & Salakhutdinov, R. (2019). Learning To Explore Using Active Neural SLAM. In International Conference on Learning Representations.

  18. Chen, T., Gupta, S. & Gupta, A. (2019). Learning exploration policies for navigation. In: International Conference on Learning Representations. https://openreview.net/pdf?id=SyMWn05F7.

  19. Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

  20. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D. & Batra, D. (2018a). Embodied Question Answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  21. Das, A., Gkioxari, G., Lee, S., Parikh, D. & Batra, D. (2018b). Neural modular control for embodied question answering. In: Conference on Robot Learning, pp. 53–62.

  22. Devlin, J., Chang, M.W., Lee, K. & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  23. Duan, Y., Chen, X., Houthooft, R., Schulman, J. & Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338.

  24. Fang, K., Toshev, A., Fei-Fei, L. & Savarese, S. (2019). Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 538–547.

  25. Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.

    MathSciNet  Article  Google Scholar 

  26. Giusti, A., Guzzi, J., Cireşan, D. C., He, F. L., Rodríguez, J. P., Fontana, F., Faessler, M., Forster, C., Schmidhuber, J., Di Caro, G., et al. (2016). A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters.

  27. Goyal, P., Mahajan, D., Gupta, A. & Misra, I. (2019). Scaling and benchmarking self-supervised visual representation learning. arXiv preprint arXiv:1905.01235.

  28. Gupta, S., Davidson, J., Levine, S., Sukthankar, R. & Malik, J. (2017a). Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625.

  29. Gupta, S., Fouhey, D., Levine, S. & Malik, J. (2017b). Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125.

  30. Haber, N., Mrowca, D., Fei-Fei, L. & Yamins, D. L. (2018). Learning to play with intrinsically-motivated self-aware agents. arXiv preprint arXiv:1802.07442.

  31. Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2), 100–107.

    Article  Google Scholar 

  32. He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.

  33. Henriques, J. F. & Vedaldi, A. (2018). Mapnet: An allocentric spatial memory for mapping environments. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8476–8484.

  34. Isola, P., Zhu, J. Y., Zhou, T. & Efros, A. A. (2016). Image-to-image translation with conditional adversarial networks. arxiv.

  35. Jayaraman, D., & Grauman, K. (2018a). End-to-end policy learning for active visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7), 1601–1614.

    Article  Google Scholar 

  36. Jayaraman, D. & Grauman, K. (2018b). Learning to look around: Intelligently exploring unseen environments for unknown tasks. In: Computer Vision and Pattern Recognition, 2018 IEEE Conference on.

  37. Kadian, A., Truong, J., Gokaslan, A., Clegg, A., Wijmans, E., Lee, S., Savva, M., Chernova, S. & Batra, D. (2019). Are we making real progress in simulated environments? measuring the sim2real gap in embodied visual navigation. arXiv preprint arXiv:1912.06321.

  38. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

  39. Kingma, D. P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  40. Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A. & Farhadi, A. (2017). AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv.

  41. Kostrikov, I. (2018). Pytorch implementations of reinforcement learning algorithms. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail.

  42. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: European Conference on Computer Vision.

  43. Lopes, M., Lang, T., Toussaint, M. & Oudeyer, P. Y. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. In: Advances in neural information processing systems, pp. 206–214.

  44. Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed markov decision processes. Annals of Operations Research, 28(1), 47–65.

    MathSciNet  Article  MATH  Google Scholar 

  45. Mahmood, A. R., Korenkevych, D., Vasan, G., Ma, W. & Bergstra, J. (2018). Benchmarking reinforcement learning algorithms on real-world robots. In: Conference on Robot Learning, pp. 561–591.

  46. Malmir, M., Sikka, K., Forster, D., Movellan, J. & Cottrell, G. W. (2015). Deep Q-learning for active recognition of GERMS. In: BMVC.

  47. Savva, Manolis, Kadian, Abhishek, Maksymets, Oleksandr, Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D. & Batra, D. (2019). Habitat: A Platform for Embodied AI Research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

  48. Mishkin, D., Dosovitskiy, A. & Koltun, V. (2019). Benchmarking classic and learned navigation in complex 3d environments. arXiv preprint arXiv:1901.10915.

  49. Narasimhan, M., Wijmans, E., Chen, X., Darrell, T., Batra, D., Parikh, D. & Singh, A. (2020). Seeing the un-scene: Learning amodal semantic maps for room navigation. arXiv preprint arXiv:2007.09841.

  50. Ostrovski, G., Bellemare, M. G., van den Oord, A. & Munos, R. (2017). Count-based exploration with neural density models. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2721–2730. JMLR. org.

  51. Oudeyer, P. Y., Kaplan, F., & Hafner, V. V. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2), 265–286.

    Article  Google Scholar 

  52. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L. & Lerer, A. (2017). Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop.

  53. Pathak, D., Agrawal, P., Efros, A. A. & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In: International Conference on Machine Learning.

  54. Pathak, D., Gandhi, D. & Gupta, A. (2018). Beyond games: Bringing exploration to robots in real-world.

  55. Pathak, D., Gandhi, D. & Gupta, A. (2019) Self-supervised exploration via disagreement. arXiv preprint arXiv:1906.04161.

  56. Qi, W., Mullapudi, R. T., Gupta, S. & Ramanan, D. (2020) Learning to move with affordance maps. arXiv preprint arXiv:2001.02364.

  57. Ramakrishnan, S. K. & Grauman, K. (2018). Sidekick policy learning for active visual exploration. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 413–430.

  58. Ramakrishnan, S. K., Jayaraman, D. & Grauman, K. (2019). Emergence of exploratory look-around behaviors through active observation completion. Science Robotics 4(30). https://doi.org/10.1126/scirobotics.aaw6326. https://robotics.sciencemag.org/content/4/30/eaaw6326.

  59. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    MathSciNet  Article  Google Scholar 

  60. Savinov, N., Dosovitskiy, A. & Koltun, V. (2018a). Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653.

  61. Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T. & Gelly, S. (2018b). Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274.

  62. Savva, M., Chang, A. X., Dosovitskiy, A., Funkhouser, T. & Koltun, V. (2017). Minos: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931.

  63. Schmidhuber, J. (1991). Curious model-building control systems. In: Proc. international joint conference on neural networks, pp. 1458–1463.

  64. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

  65. Seifi, S. & Tuytelaars, T. (2019). Where to look next: Unsupervised active visual exploration on 360 \(\{\backslash \)deg\(\}\) input. arXiv preprint arXiv:1909.10304.

  66. Soomro, K., Zamir, A. R. & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

  67. Stachniss, C., Grisetti, G., & Burgard, W. (2005). Information gain-based exploration using rao black-wellized particle filters. Robotics Science and Systems, 2, 101.

    Google Scholar 

  68. Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J. J., Mur-Artal, R., Ren, C., Verma, S., Clarkson, A., Yan, M., Budge, B., Yan, Y., Pan, X., Yon, J., Zou, Y., Leon, K., Carter, N., Briales, J., Gillingham, T., Mueggler, E., Pesqueira, L., Savva, M., Batra, D., Strasdat, H. M., Nardi, R. D., Goesele, M., Lovegrove, S. & Newcombe, R. (2019). The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797.

  69. Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8), 1309–1331.

    MathSciNet  Article  MATH  Google Scholar 

  70. Sun, Y., Gomez, F. & Schmidhuber, J. (2011). Planning to be surprised: Optimal bayesian exploration in dynamic environments. In: International Conference on Artificial General Intelligence, pp. 41–51. Springer.

  71. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge: MIT press.

    MATH  Google Scholar 

  72. Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, O. X., Duan, Y., Schulman, J., DeTurck, F. & Abbeel, P. (2017). # exploration: A study of count-based exploration for deep reinforcement learning. In: Advances in neural information processing systems, pp. 2753–2762.

  73. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008.

  74. Wijmans, E., Datta, S., Maksymets, O., Das, A., Gkioxari, G., Lee, S., Essa, I., Parikh, D. & Batra, D. (2019). Embodied question answering in photorealistic environments with point cloud perception. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6659–6668.

  75. Wilkes, D. & Tsotsos, J. K. (1992). Active object recognition. In: Computer Vision and Pattern Recognition, 1992. IEEE Computer Society Conference on.

  76. Xia, F., Zamir, A. R., He, Z., Sax, A., Malik, J. & Savarese, S. (2018). Gibson env: Real-world perception for embodied agents. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9068–9079.

  77. Yamauchi, B. (1997). A frontier-based approach for autonomous exploration.

  78. Yang, J., Ren, Z., Xu, M., Chen, X., Crandall, D., Parikh, D. & Batra, D. (2019a). Embodied visual recognition. arXiv preprint arXiv:1904.04404.

  79. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R. & Le, Q. V. (2019b). Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems, pp. 5753–5763.

  80. Zamir, A. R., Wekel, T., Agrawal, P., Wei, C., Malik, J. & Savarese, S. (2016). Generic 3D representation via pose estimation and matching. In: European Conference on Computer Vision, pp. 535–553. Springer.

  81. Zhu, Y., Gordon, D., Kolve, E., Fox, D., Fei-Fei, L., Gupta, A., Mottaghi, R. & Farhadi, A. (2017). Visual Semantic Planning using Deep Successor Representations. In: Computer Vision, 2017 IEEE International Conference on.

Download references

Acknowledgements

UT Austin is supported in part by the NSF AI Institute on Foundations of Machine Learning, DARPA Lifelong Learning Machines, and the GCP Research Credits Program. The authors thank Ziad-Al Halah for the helpful discussions. The authors thank Tao Chen for clarifying details about the exploration architecture implementation.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Santhosh K. Ramakrishnan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Code: https://github.com/facebookresearch/exploring_exploration.

Communicated by Torsten Sattler.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 19332 KB)

Appendices

Appendices

We provide additional information to support the text from the main paper. In particular, the appendix includes additional details regarding the following key topics:

Section A: Downstream task transfer

Section B: Criteria for visiting objects and landmarks

Section C: Hyperparameters for learning exploration policies

Section D: Comparative study of different coverage variants

Section E: Frontier-exploration algorithm

Section F: Generating difficult testing episodes for PointNav

Section G: Automatically mining landmarks

Section H: The factors influencing exploration performance

Section I: Measuring the correlation between different metrics

The supplementary video contains example trajectories that support some of the key conclusions arrived at in the main paper. For each method, we recap their strengths and weaknesses, and then provide sample exploration trajectories that highlight these properties.

A Downstream Task Transfer

We now elaborate on the three downstream tasks defined in Sect. 4.2: view localization, reconstruction, and PointNav navigation.

A.1 View Localization Pipeline

A.1.1 Problem Setup

An exploration agent is required to gather information from the environment that will allow it to localize key landmark views in the environment after exploration. Since the exploration agent does not know what views will be presented to it a priori, a general exploration policy that gathers useful information about the environment will perform best on this task.

More formally, the problem of view localization is as follows. The exploration agent is spawned at a random pose \(p_{0}\) in some unknown environment, and is allowed to explore the environment for a time budget \(T_{exp}\). Let \(V_{exp} = \{x_{t}\}_{t=1}^{T_{exp}}\) be the set of observations the agent received and \(\mathcal {P}_{exp} = \{p_{t}\}_{t=1}^{T_{exp}}\) be the corresponding agent poses (relative to pose \(p_{0}\)). After exploration, a set of N query views \(V = \{x_{i}^{q}\}_{i=1}^{N}\) are sampled from query poses \(\mathcal {P} = \{p_{i}^{q}\}_{i=1}^{N}\) within the same environment and presented to the agent. The agent is then required to use the information \(V_{exp},~\mathcal {P}_{exp}\) gathered during exploration to predict \(\{p_{i}^{q}\}_{i=1}^{N}\). In practice, the agent is only required to predict the translation components of the pose, i.e., \(p^{ref} = (\varDelta x, \varDelta y)\) where \(\varDelta x, \varDelta y\) represent the translation along the X and Y axes from a top-down view of the environment. An agent that can successfully predict this has a good understanding of the layout of the environment as it can point to where a large set of views in the environment are sampled from. We next review the architecture for view localization. For the sake of simplicity, we consider the case of \(N=1\) with \(x^{q}, p^{q}\) denoting the query view and pose respectively.

A.1.2 View Localization Architecture

We design a simple view localizer that is motivated from past work in the literature (see Fig. 14). Given the history of views observed by the agent, we sample the most visually similar observations to the query view using a retrieval network (Zamir et al. 2016; Savinov et al. 2018a, b). Next, we estimate the relative pose between the retrieved pairs and the query view to obtain pose estimates from each retrieval (Zamir et al. 2016). We then use the well-known RANSAC approach to diregard outliers in the pose estimates and obtain the average pose estimate (Fischler and Bolles 1981). Next, we review the individual components of the model and provide implementation details.

Episodic Memory (E) In order to store information over the course of a trajectory, we use an episodic memory E that stores the history of past observations and corresponding poses \(\{(x_{t}, p_{t})\}_{t=1}^{T}\). For efficient storage, we only store image features vectors obtained from the pairwise pose predictor (P) and retrieval network (R) (as described in subsequent sections) in the memory.

Pairwise pose predictor (P) We train a pairwise pose predictor that takes in pairs of images \(x_{i}, x_{j}\) that are visually similar (see next section), and predicts \(\varDelta p^{j}_{i} = \text {P}(x_{i}, x_{j})\), where \(\varDelta p^{j}_{i}\) is the relative pose of \(x_{j}\) in the egocentric coordinates of \(x_{i}\). The architecture is shown in Fig. 13. We follow a different parameterization of the pose prediction when compared to Zamir et al. 2016. Instead of directly regressing \(\varDelta p^{j}_{i}\), we first predict the distances \(d_{i}, d_{j}\) to the points of focus (central pixel) for each image, and the baseline angle \(\beta \) between the two viewpoints (see Fig. 12). The relative pose is then computed as follows:

$$\begin{aligned} \varDelta p_{i}^{j} = (d_{i} - d_{j}\text {cos}(\beta ),~- d_{j}\text {sin}(\beta ),~\beta ) \end{aligned}$$

This pose parameterization was more effective than directly regressing \(\varDelta p_{i}^{j}\), especially when the data diversity was limited (eg. in AVD). To sample data for training the pose estimator, we opt for the sampling strategy from Zamir et al. 2016 (see Fig. 12). The prediction of \(d_{i}, d_{j}\) is cast as independent regression problems with the MSE loss \(L_{d}\). The prediction of \(\beta \) is split into two problems: predicting the baseline magnitude, and predicting the baseline sign. Baseline magnitude prediction is treated as a regression problem for AVD and as a 15-class classification problem for MP3D with corresponding MSE or cross entropy losses (\(L_{\text {mag}})\). Baseline sign prediction is treated as a binary classification problem with a binary cross entropy loss \(L_{\text {sign}}\). The overall loss function is:

$$\begin{aligned} L = L_{d} + L_{\mathrm{mag}} + L_{\mathrm{sign}} \end{aligned}$$
(11)
Fig. 12
figure12

Pairwise pose data sampling: First, a random viewpoint \(x_{i}\) (red) is selected from the environment. A ray is cast along its viewing direction to reach the obstacle (gray) at the point of focus. A new ray (green dotted) is cast out from the point of focus and another viewpoint \(x_{j}\) (green) is selected along this ray. Since \(x_{i}\) and \(x_{j}\) share similar visual content, it should be possible to estimate the pose between these viewpoints. \(d_{i}, d_{j}\) are the distances from the viewpoints to the point of focus. \(\beta \) is the baseline angle between the two viewpoints

Fig. 13
figure13

Pairwise pose predictor: A ResNet-18 feature extractor (He et al. 2016) extracts features from both images. The concatenated features are then used by three separate networks to predict (1) the distances \(d_{1}, d_{2}\) of the points of focus of each image, (2) the magnitude \(|\beta |\), and (3) \(\text {sign}(\beta )\) of the baseline \(\beta \) (notations in Fig. 12). Parameters are shared between the ResNets (orange), and distance prediction MLPs (blue)

Retrieval network (R) We train a retrieval network R that, given a query image \(x^{q}\), can retrieve matching observations from E. Similar to Savinov et al. 2018a, we use a siamese architecture consisting of a ResNet-18 feature extractor followed by a 2-layer MLP to predict the similarity score \(\text {R}(x_{i}, x^{q})\) between images \(x_{i}\) and \(x^{q}\). Since our goal is to retrieve observations that can be used by the pairwise pose predictor (P), the positive pairs are the same pairs used for training the pose predictor. Negative pairs are obtained by choosing random images that are far away from the current location. We use the binary cross entropy loss function to train the retrieval network (Fig. 13).

Fig. 14
figure14

View localization architecture: consists of four main components. (1) Episodic memory (E) (left top) stores the sequences of the agent’s past egocentric observations along with their poses (relative to the agent’s starting viewpoint). (2) Retrieval network (R) (left bottom) compares a reference image \(x^{ref}\) with the episodic memory and retrieves the top K similar images \(\{x_{j}\}_{j=1}^{K}\). (3) Pairwise pose predictor (P) (center) estimates the real-world pose \(\hat{p}^{ref}_{j}\) of \(x^{ref}\) using each retrieval \(x_{j}, p_{j}\) and \(x^{ref}\). (4) View localizer (L) (right) combines the individual pose predictions \(\{\hat{p}^{ref}_{j}\}_{j=1}^{K}\) by filtering the noisy estimates using RANSAC to localize \(x^{ref}\)

View localizer (L) So far, we have a retrieval network R that retrieves observations that are similar to a query view \(x^{q}\), and a pairwise pose predictor P that predicts the relative pose between \(x^{q}\) and each retrieved image. The goal of the view localizer (L) is to combine predictions made by P on individual retrievals given by R to obtain a robust estimate the final pose \(p^{q}\). The overall pipeline works as follows (see Fig. 14).

Similarity scores \(\{\text {R}(x_{t}, x^{q})\}_{t=1}^{T_{exp}}\) are computed between each \(x_{t}\) in the episodic memory E and \(x^{q}\). The sequence of scores are temporally smoothed using a median filter to remove noisy predictions. After filtering out dissimilar images in the episodic memory, \(\mathcal {V}_{sim} = \{x_{t} | \text {R}(x_{t}, x^{q}) < \eta _{noise}\}\), we sample the top K observations \(\{x_{j}\}_{j=1}^{K}\) from \(\mathcal {V}_{sim}\) with highest similarity scores. For each retrieved observation \(x_{j}\), we compute the relative pose \(\varDelta p^{q}_{j} = \text {P}(x_{j}, x^{q})\), i.e., the predicted pose of \(x^q\) in the egocentric coordinates of \(x_{j}\). We rotate and translate \(\varDelta p_{j}^{q}\) using \(p_{j}\), the real-world pose of \(x_{j}\), to get \(\hat{p}_{j}^{q}\), the real-world pose of \(x^{q}\) estimated from \(x_{j}\):

$$\begin{aligned} \hat{p}_{j}^{q} = \varvec{R}_{j} \varDelta p_{j}^{q} + \varvec{t}_{j} \end{aligned}$$
(12)

where \(p_{j} = \{\varvec{R}_{j}, \varvec{t}_{j}\}\) are the rotation and translation components of \(p_{j}\). Given the set of individual predictions \(\varvec{\hat{p}} = \{\hat{p}^{q}_{j}\}_{j=1}^{K}\), we use RANSAC (Fischler and Bolles 1981) to aggregate these predictions to arrive at a consistent estimate of \(\hat{p}^{ref}\).

A.1.3 Implementation Details

For AVD, we restrict the baseline angle to lie in the range \([0, 90]^{\circ }\) and depth values to lie in the range \([1.5, 3]\hbox {m}\). We sample \(\sim 1M\) training, 240K validation and 400K testing pairs. While the number of samples are high, the diversity is quite limited since there are only 20 environments in total. For MP3D, we restrict the baseline angle to lie in the range \([0, 90]^{\circ }\) and depth values to lie in the range \([1, 4]\hbox {m}\). We sample \(\sim 0.5M\) training, 88K validation and 144K testing pairs. Both the pairwise pose predictor and retrieval network are trained (independently) using Adam optimizer with a learning rate of 0.0001, weight decay of 0.00001, batch size of 128. The ResNet-18 feature extractor is pretrained on ImageNet. The models are trained for 200 epochs and early stopping is performed using the loss on validation data. In case of AVD, the baseline magnitude predictor is a regression model that is trained using MSE loss. In MP3D, the baseline magnitude predictor is a 15-class classification model where each class represents a uniformly sampled bin in the range \([0, 90]^{\circ }\). The choice of classification vs. regression and the number of classes for predicting \(|\beta |\) is made based on the validation performance. In both datasets, we use a median filter of size 3 with \(\eta _{noise} = 0.95\) for the view localizer L. We sample the reference views from the set of landmark-views that we used in Sec. 4.2. Since these views are distinct and do not repeat in the environment, they are less ambiguous to localize. We use the simplest version of RANSAC without added bells and whistles (Fischler and Bolles 1981). Our implementation is publicly available.Footnote 4.

A.2 Reconstruction Pipeline

Fig. 15
figure15

Reconstruction architecture: consists of three key components: (1) Observation encoder: It encodes the input observation \((x_{t}, p_{t})\) obtained during exploration into a high-dimensional feature representation \(o_{t}\). The image \(x_{t}\) and pose \(p_{t}\) are independently encoded using an ImageNet pretrained ResNet-50 and a 2-layer MLP, respectively. (2) Exploration memory: It keeps track of all the encoded features \(\{o_{t}\}_{t=1}^{T}\) obtained during exploration, (3) Reconstruction transformer: It contains a transformer encoder and decoder. The transformer encoder uses self-attention between the encoded features to refine the representation and obtain improved features \(\mathcal {F} = \{f_{t}\}_{t=1}^{T}\). The transformer decoder uses pose encoding of \(p^{q}\) to attend to the right parts of the encoded features \(\mathcal {F}\) and predicts a probability distribution over the set of concepts present at a pose \(p^{q}\)

A.2.1 Problem Setup: Reconstruction

An exploration agent is required to gather information from the environment that will allow it to reconstruct views from arbitrarily sampled poses in the environment after exploration. Since the exploration agent does not know what poses will be presented to it a priori, a general exploration policy that gathers useful information about the environment will perform best on this task. This can be viewed as the inverse of the view localization problem where views are presented after exploration and their poses must be predicted.

Following the task setup from Sect. 3.4 in the main paper, the exploration agent is spawned at a random pose \(p_{0}\) in some unknown environment and obtains the observations \(V_{exp} = \{x_{t}\}_{t=1}^{T_{exp}}\) views and \(\mathcal {P}_{exp} = \{p_{t}\}_{t=1}^{T_{exp}}\) poses during exploration. After exploration, N query poses \(\mathcal {P} = \{p_{i}^{q}\}_{i=1}^{N}\) are sampled from the same environment and the agent is required to reconstruct the corresponding views \(V = \{x_{i}^{q}\}_{i=1}^{N}\).

This reconstruction is performed in a concept space \(\mathcal {C}\) which is automatically discovered from the training environments. We sample views uniformly from the training environments and cluster their ResNet-50 features using K-means. The concepts \(c \in \mathcal {C}\) are, therefore, the cluster centroids obtained after clustering image features. Each query location \(p_{i}^{q}\) has a set of “reconstructing” concepts \(C_{i} = \{c_{i}\}_{i=1}^{J} \in \mathcal {C}\). These are determined by extracting the ResNet-50 features from \(x_{i}^{q}\) and obtaining the J nearest cluster centroids. We use a transformer (Vaswani et al. 2017) based architecture for predicting the concepts as described in Fig. 15. This architecture has been very successful for natural language tasks (Vaswani et al. 2017; Devlin et al. 2018; Yang et al. 2019b) and has achieved promising results in computer vision (Fang et al. 2019; Carion et al. 2020). While our design choices are motivated by Fang et al. 2019, we use the model to predict concepts present at a given location in the environment instead of learning a motion policy.

A.2.2 Loss Function

Reconstruction in the concept space is treated as a multilabel classification problem. For a particular query view \(x^{q}\) at a query pose \(p^{q}\), the reconstructing concepts are obtained by retrieving the top J nearest neighbouring clusters in the image feature space. These J clusters are treated as positive labels for \(x^{q}\) and the rest are treated as negative labels. The ground-truth probability distribution C assigned to \(x^{q}\) consists of 0s for the negative labels and 1/J for the positive labels. Let \(\hat{C} = \text {P}(.|p^{q}, V_{exp}, \mathcal {P}_{exp})\) be the posterior probabilites for each concept inferred by the model (see Fig. 15). Then, the loss \(L_{rec}\) is

$$\begin{aligned} L_{rec}(p^q) = D (C||{\hat{C}}) \end{aligned}$$

where D is the KL-divergence between the two distributions.

A.2.3 Reward Function

The reconstruction method relies on rewards from a trained reconstruction model to learn an exploration policy. Note that the reconstruction model is not updated during policy learning. For each episode, a set of N query poses \(\mathcal {P}^{q} = \{p_{i}^{q}\}_{i=1}^{N}\) and their views \(V^{q} = \{x_{i}^{q}\}_{i=1}^{N}\) are sampled initially. This information is hidden from the exploration policy and does not affect the exploration directly. At time t during an exploration episode, the agent will have obtained observations \(V_{exp}^t = \{x_{\tau }\}_{\tau =1}^{t}\) and \(\mathcal {P}_{exp}^t = \{p_{\tau }\}_{\tau =1}^{t}\). The reconstruction model uses \(V_{exp}^t, \mathcal {P}_{exp}^t\) to predict posteriors over the concepts for the different queries \(p^{q} \in P^{q}\): \(\hat{C}_{t}~=~\text {P}(.|p^{q}, V_{exp}^t, \mathcal {P}_{exp}^t)\). The reconstruction loss for each prediction is \(L_{rec, t}(p^{q}) = D (C||{\hat{C}}_{t})\) where C is the reconstructing concept set for query \(p^q\). The reward is then computed as follows:

$$\begin{aligned} r_{t} = \frac{1}{N} \sum _{p^q \in \mathcal {P}^{q} }\bigg (L_{rec, t-\varDelta _{rec}}(p^q) - L_{rec, t}(p^q)\bigg ) \end{aligned}$$

where the reward is provided to the agent after every \(\varDelta _{rec}\) steps. This reward is the reduction in the reconstruction loss over the past \(\varDelta _{rec}\) time-steps. The goal of the agent is to constantly reduce the reconstruction loss, and it is rewarded more for larger reductions.

Fig. 16
figure16

Examples of images in each cluster with the corresponding cluster IDs on Matterport3D (first row) and Active Vision Dataset (second row). The clusters typically corresponding to meaningful concepts such as pillars / arches, doors, windows / lights, geometric layouts in MP3D and windows, doors, computer screens, sofas and kitchen in AVD

A.2.4 Implementation Details

We sample 30 clusters for AVD and 50 clusters for MP3D based on the Elbow method which selects the N, after which, the reduction in within-cluster separation saturates. Additionally, we manually inspect the clusters for different values of N to ensure that they contain meaningful concepts (see Fig. 16).

In practice, we do not directly use the ResNet-50 image features as the output of the image encoder. We compute the similarity scores between the ResNet-50 features for a given \(x_{t}\) and all the cluster centroids in \(\mathcal {C}\). This gives an 30 and 50 dimensional vectors of similarities for AVD and MP3D, respectively which serves as the output of the image encoder. This design choice achieves two things, (1) it reduces computational complexity significantly as the number of clusters is much fewer than the ResNet-50 features (2048-D), and (2) it directly incorporates the information from cluster centroids into the reasoning process of the reconstruction, as the reasoning happens in the cluster similarity space rather than image feature space.

For training the reconstruction model, we sample exploration trajectories using the oracle exploration method. First, N query views are sampled for each environment by defining a discrete grid of locations and sampling images from multiple heading angles at each valid location on the grid. For AVD, we use a grid cell distance of \(1\hbox {m}\) while sampling views from 4 uniformly separated heading angles. For MP3D, we use a grid cell distance of \(2\hbox {m}\) while sampling views from 3 uniformly separated heading angles. These values were selected to ensure a good spread of views, low redundancy in the views and adequate supervision (larger the grid cell distance, lesser the number of valid points). The model is trained on trajectories of length \(T_{exp} = 200\) in AVD and \(T_{exp} = 500\) in MP3D. We use \(J = 3\) nearest neighbors clusters as positives for both AVD and MP3D. For making the model more robust to the actual trajectory length, we also train on intermediate time-steps of the episode (after every 20 steps in AVD and 100 steps in MP3D). The optimization was performed using Adam optimizer with a learning rate of 0.0001 for AVD and 0.00003 for MP3D. We use 2 layers in both the transformer encoder and decoder with 2 attention heads each. For training the reconstruction exploration agent, we use \(\varDelta _{rec} = 1\) for AVD and \(\varDelta _{rec} = 5\) for MP3D.

A.3 Navigation Pipeline

A.3.1 Problem Setup

An exploration agent is required to gather information from the environment that will allow it to navigate to a given \(p^{tgt}\) location after exploration.

More formally, the exploration agent is spawned at a random pose \(p_{0}\) in some unknown environment, and is allowed to explore the environment for a time budget \(T_{exp}\). Let \(V^{d}_{exp} = \{x_{t}^{d}\}_{t=1}^{T_{exp}}\) be the set of depth observations the agent received and \(\mathcal {P}_{exp} = \{p_{t}\}_{t=1}^{T_{exp}}\) be the corresponding agent poses (relative to pose \(p_{0}\)). The depth observations along with the corresponding poses are used to build a 2D top-down occupancy map of the environment \(\mathcal {M} \in \mathbb {R}^{h\times w}\) that indicates whether an (xy) location in the map is free, occupied, or unknown. After exploration, the agent is respawned at \(p_{0}\) and is provided a target coordinate \(p^{tgt}\) that it must navigate to within a budget of time \(T_{nav}\), using the occupancy information \(\mathcal {M}\) gathered during exploration. After reaching the target, it is required to execute a STOP action indicating that it has successfully reached the target. Following past work on navigation (Anderson et al. 2018b; Savva et al. 2019), the episode is considered to be a success only of the agent executed the stop action within a threshold geodesic distance \(\eta _{\text {success}}\) from the target.

A.3.2 Navigation Policy

Our navigation policy is based on a mapping + planning baseline that is known to be very competitive and has achieved 0.92 SPL on the Gibson validation set (Gupta et al. 2017a). Footnote 5 See Algo. 1. The input to the policy consists of the egocentric occupancy map \(\mathcal {M}\) generated at the end of exploration and a target location \(p_{tgt}\) on that map. The map \(\mathcal {M}\) consists of free, occupied and unexplored regions. \(\text {ProcessMap}(\mathcal {M})\) converts this into a binary map by treating all free and unexplored regions as free space, and the occupied regions as obstacles. It also applies the morphology close operator to fill any holes in the binary map.

Next, the AStarPlanner uses the processed map \(\bar{\mathcal {M}}\) to generate the shortest path from the current position to the target. If the policy has reached the target, then it returns STOP. Otherwise, if the path is successfully generated, the policy samples the next location on the path to navigate to (\(p^{\text {next}}\)) and selects an action to navigate to that target. \(\text {get\_action}()\) is a simple rule-based action selector that moves forward if the agent is already facing the target, otherwise rotates left / right to face \(p^{\text {next}}\). However, if the path does not exist, the policy samples a random action. This condition is typically reached if ProcessMap blocks narrow paths to the target or assigns the agent’s position as an obstacle while closing holes.

figureb

A.3.3 Implementation Details

We use a publicly available A* implementation: https://github.com/hjweide/a-star. We vary \(T_{exp}\) for benchmarking and set \(T_{nav} = 200, 500\) for AVD, MP3D. \(\eta _{success} = 0.5\hbox {m}, 0.25\hbox {m}\) for AVD, MP3D. The value is larger for AVD since the environment is discrete, and a threshold of \(0.5\hbox {m}\) is satisfied only when the agent is one-step away from the target.

The map \(\mathcal {M}\) is an egocentric crop of the allocentric map generated during exploration. For AVD, we freeze the map after exploration, i.e., do not update the map based on observations received during the navigation phase. Therefore, the agent is required to have successfully discovered a path to the target during exploration (eventhough it does not know the target during exploration). This type of evaluation generally fails for MP3D since the floor-plans are very large and it is generally not possible for an exploration agent to discover the full floor plan within the restricted time-budget. Therefore, we permit online updates to \(\mathcal {M}\) during exploration. This means that the role of exploration in MP3D is to not necessarily discover a path to the target, instead, it is used to rule out certain regions of the environment that may cause planning failure, which would reduce the navigation efficiency.

B Criteria for Visiting Objects and Landmarks

We highlight the exact success criteria for what counts as visiting an object or landmark.

B.1 Visiting Objects

AVD The object instances in AVD (Ammirato et al. 2016) are annotated as follows: If an object is visible in an image, the bounding box and the instance ID are listed. A particular object instance is considered to be visited if it is annotated in the current image, the distance to the object is lesser than \(1.5\hbox {m}\), and the bounding box area is larger than 70 squared pixels (approximately \(1\%\) of an \(84 \times 84\) image). We keep the bounding box size threshold low since many of the object instances in AVD are very small objectsFootnote 6, and we primarily rely on visibility and the agent’s proximity to the object to determine visitation.

MP3D The objects in MP3D (Chang et al. 2017) are annotated with (xyz) center-coordinates in 3D space along with their extents specified as (width, height, depth). As stated in Sect. 4.2 in the main paper, to determine object visitation, we check if the agent is close to the object, has the object within its field of view, and if the object is not occluded. While it is also possible to arrive at similar visitation criteria by rendering semantic segmentations of the scene at each time step, we refrain from doing that as it typically requires larger memory to load semantic meshes and slows down rendering significantly. See Fig. 17 for the exact criteria. The values for this evaluation metric were determined upon manual inspection on training environments.

Fig. 17
figure17

Object visitation criteria on MP3D: The left image shows a top-down view of the environment containing the agent (red) and the object (blue). d is the euclidean distance between the agent’s centroid and the object’s centroid, the dotted line represents the center of the agent’s field of view, and \(\theta \) represents the angle between the agent’s viewing angle and the ray connecting the agent’s centroid to the object’s centroid. The right image shows the egocentric view of the agent containing the blue object. The indicated (xy) represents the pixel coordinates of the object centroid. An object is considered visited if (1) \(d < 3.0\hbox {m}\), (2) \(\theta \le 60^{\circ }\), (3) (xy) is within the image extents, and (4) \(|\text {depth}[x, y] - d\text {cos}(\theta )| < 0.3\hbox {m}\) where \(\text {depth}\) is the depth image. The final criteria checks for occlusions since the expected distance to the object must be consistent with the depth sensor readings at the object centroid

Fig. 18
figure18

Landmarks visitation criteria on AVD, MP3D: The image on the left shows the top-down view of the environment with the agent in red, the landmark-view in green, and rays representing their field of view centers in corresponding colors. The gray lines represent obstacles. \(\theta \) represents the discrepancy between the direction the agent is looking at and the point the landmark-view is focused on. d is the geodesic distance between the two viewpoints. To successfully visit the landmark-view, the field of view of the agent must closely overlap with that of the landmark-view. On AVD, this is ensured by satisfying two criteria: (1) \(\theta < 30^{\circ }\), and (2) \(d < 1\hbox {m}\). On MP3D, we specify three criteria: (1) \(\theta < 20^{\circ }\), (2) \(d < 2\hbox {m}\), and (3) \(|d_{1} - d_{2}| < 0.5\hbox {m}\) where \(d_{1}, d_{2}\) are the lengths of the red and blue line-segments respectively. We additionally imposed the third condition to check for occlusions that block agent’s view of the landmarks. If the agent is close to the landmark-view, lower \(\theta \) leads to success

B.2 Visiting Landmarks

The criteria for visiting landmarks differs from that of visiting objects as the goal here is to match a particular \((x, y, z, \phi )\) pose in the environment rather than be close to some (xyz) location and have it within the agent’s field of view. Specifically, the goal is to look at the same things that the landmarks are looking at. See Fig. 18.

C Hyperparameters for Learning Exploration Policies

Table 5 Values for hyperparameters for optimizing exploration policies and the spatial memory common across methods. The learning rate is selected from the specified range based on grid-search

We expand on the implementation details provided in Sect. 5 from the main paper. We use PyTorch (Paszke et al. 2017) and a publicly available codebase for PPO (Kostrikov 2018) for all our experiments. The hyperparameters for training different exploration algorithms are shown in Table 5. The optimization and spatial memory hyperparameters are kept fixed across different exploration algorithms. We use ImageNet pre-trained image encoders and keep them frozen to facilitate extensive experimentation. The primary factor that varies across methods is the reward scale. For MP3D, the models are trained on 4 Titan V GPUs and typically take 1–2 days for training. For AVD, the models are trained on 1 Titan V GPU and typically take 1 day to train.

Next, we compare our curiosity implementation with the one given in Burda et al. 2018a. For training our curiosity policy, we use the forward-dynamics architecture proposed in Burda et al. 2018a which consists of four MLP residual blocks. We use the GRU hidden state from the policy as our feature representation to account for partial observability. As recommended in Burda et al. 2018a, we do not backpropagate the gradients from the forward dynamics model to the feature representation to have relatively stable features. However, since the policy is updated, the features from the recurrent states are not fixed during training (as suggested in Burda et al. 2018a). Nevertheless, we found that it was more important to use memory-based features that account for partial observability, than to use stable image features that are frozen (see Fig. 11 in the main paper). We additionally use PPO, advantage normalization, reward normalization, feature normalization, and observation normalization following the practical considerations suggested in Burda et al. 2018a. We are limited to using only 8 parallel actors due to computational and memory constraints.

D Comparative Study of Different Coverage Variants

While we use area as our primary quantity of interest for coverage in the main paper, we extend this idea can more generally for learning to visit other interesting things such as objects (similar to the search task from Fang et al. 2019) and landmarks (see Sect. 4.2 from the main paper). The coverage reward consists of the increment in some observed quantity of interest:

$$\begin{aligned} r_{t} \propto I_{t} - I_{t-1}, \end{aligned}$$
(13)

where \(I_{t}\) is the quantity of interesting things (e.g., object) visited by time t. Apart from area coverage (regular and smooth), we also consider objects and landmarks for I, where the agent is reward based on the corresponding visitation metric from Sect. 4.2 in the main paper.

Fig. 19
figure19

Plots comparing different coverage variants on the three visitation metrics

For each of the visitation metrics, we have one method that is optimized for doing well on that metric. For example, area-coverage optimizes for area visited, objects- coverage optimizes for objects visited, etc. As expected, on AVD we generally observe that the method optimized for a particular metric typically does better than most methods on that metric. However, on MP3D, we find that smooth-coverage and area-coverage dominate on most metrics. This shift in the trend is likely due to optimization difficulties caused by reward sparsity: landmarks and objects occur more sparsely in the large MP3D environments. Objects tend to occur more frequently in the environment than landmarks, and this is reflected in the performance as objects-coverage generally performs better. For this reason, we use smooth-coverage as the standard coverage method in the main paper.

E Frontier-Based Exploration Algorithm

We briefly describe the frontier-exploration baseline in Sect. 4.1 in the main paper. Here, we provide a detailed description of the algorithm along with the pseudo-code.

We implement a variant of the frontier-based exploration algorithm from Yamauchi 1997 as shown in Algorithm 2. The core structure of the algorithm is similar to the navigation policy used in Algorithm 1. The key difference here is that the target \(p^{tgt}\) is assigned by the algorithm itself.

DetectFrontiers() Given the egocentric occupancy map \(\mathcal {M}\) of the environment, frontiers are detected. Frontiers are defined as the edges between free and unexplored regions in the map. In our case, we detect these edges and group them into contours using the contour detection algorithm from OpenCV. “frontiers” is the list of these contours representing different frontiers in the environment.

SampleTarget() We sort the frontiers based on their lengths since longer contours represent potentially larger areas to uncover. We then randomly sample one of the three longest frontiers, and sample a random point within this contour to get \(p^{tgt}\).

UpdateMap() We update the map based on observations received while navigating to \(p^{tgt}\). Once we sample a frontier target \(p^{tgt}\), we use a navigation policy (see Algo. 1 in the main paper) to navigate to the target. Since the occupancy maps can be noisy, we add two simple heuristics to make frontier-based exploration more robust. First, we keep track of the number of times planning to \(p^{tgt}\) fails. This can happen if the map that is updated during exploration reveals that it is not possible to reach \(p^{tgt}\). Second, we keep track of the total time spent on navigating to \(p^{tgt}\). Depending on the map updates during exploration, certain targets may be very far away from the agent’s current position since the geodesic distance changes based on the revealed obstacles. If planning fails more than \(N_{fail}\) times or the time spent reaching \(p^{tgt}\) crosses \(T_{max}\), then we sampled a new frontier target.

We use \(N_{fail} = 2\) for both AVD, MP3D and \(T_{max} = 20, 200\) for AVD and MP3D respectively. Preliminary analysis showed that the algorithm was relatively robust to different values of \(T_{max}\).

figurec

F Generating Difficult Testing Episodes for PointNav

In the implementation details provided in Sect. 5 of the main paper, we mentioned that we generate difficult test episodes for navigation. Here, we describe the rationale behind selecting difficult episodes and show some examples. In order to generate difficult navigation episodes, we ask the following question:

“How Difficult Would it be for an Agent that has not Explored the Environment to Reach the Target?”

An agent that has not explored the environment would assume the entire environment is free and plan accordingly. When this assumption fails, i.e., a region that was expected to be free space is blocked, the navigation agent has to most likely reverse course and find a different path (re-plan). As newer obstacles are discovered along the planned shortest paths, navigation efficiency reduces as more re-planning is required. Therefore, an exploration agent that uses the exploration time budget to discover these obstacles a priori is expected to have higher navigation efficiency. We select start and goal points for navigation by manually inspecting floor-plans and identifying candidate (start, goal) locations that will likely require good exploration for efficient navigation. See Fig. 20 for some examples on MP3D. We use the same idea to generate episodes for AVD.

Fig. 20
figure20

Difficult navigation episodes: Six examples of difficult navigation episodes in MP3D that would benefit from exploration are shown. The ground truth shortest path is shown in purple. The navigation that starts to navigate from “Start” to “Goal”. If it is not aware of the presence of the obstacle (the red X indicator), it is likely to follow the incorrect yellow path, discover the obstacle and walk back all the way and re-plan. Larger the deviation from the shortest path in purple, lower the navigation efficiency (SPL). Good exploration agents discover these obstacles during exploration and, therefore, have better navigation efficiency

Fig. 21
figure21

Examples of good and poor landmarks on AVD (Ammirato et al. 2016), MP3D (Chang et al. 2017) are shown. Good landmarks typically include things that occur uniquely within one environment such as kitchens, living rooms, bedrooms, study tables, etc. Poor landmarks typically include repetitive things in the environment such as doorways, doors, plain walls, and plants. Note that one concept could be a good landmark in one environment, but poor in another. For example, if there is just one television in a house, it is a good landmark. However, as we can see in the last column, last row, televisions that occur more than once in an environment are poor landmarks

G Automatically Mining Landmarks

In Sect. 4.2 from the main paper, we briefly motivated what landmarks are and how they are used for learning an exploration policy. Here, we explain how these landmarks are mined automatically from training data.

We define landmarks to be visually distinct parts of the environment, i.e., similar looking viewpoints do not occur in any other spatially distinct part of the environment. To extract such viewpoints from the environment, we sample large number of randomly selected viewpoints from each environment. For each viewpoint, we extract features from a visual similarity prediction network (see Sec. A.1) and cluster the features using K-Means. Visually similar view-points are clustered together due to the embedding learned by the similarity network. We then sort the clusters based on the intra-cluster variance in the (x, y, z) positions and select clusters with low variance. These clusters include viewpoints which do not have similar views in any other part of the environment, i.e., they are visually and spatially distinct. In indoor environments, these typically include distinct objects such as bicycles, mirrors and jackets, and also more abstract concepts such as kitchens, bedrooms and study areas (see Fig. 21 for examples).

Fig. 22
figure22

Success cases of frontier-exploration : for smaller environments that typically do not have mesh defects, frontier-exploration is successful at systematically identifying regions that were not explored and covering them. While novelty does fairly well, it generally does worse than frontier-exploration on these cases

Fig. 23
figure23

Failure cases of frontier-exploration : for larger environments that tend to have either outdoor regions or mesh defects, the occupancy estimation often tends to be incorrect. Since frontier-exploration relies on heuristics for exploration, it is less robust to these noisy cases and gets stuck in regions where noise is high. A learned approach like novelty is more robust to these cases

H Factors Influencing Performance

In Sect.  5.3 from the main paper, we briefly discussed two different factors that affect quality of exploration, specifically, the number of training environments and the size of testing environments. Here, we provide qualitative examples of success and failure cases of frontier-exploration in the noise-free case on Matterport3D. We additionally analyze the impact of using imitation learning as a pre-training strategy for learning exploration staretgy.

H.1 Influence of Testing Environment Size

As discussed in Sect. 5.3 in the main paper, novelty perform well on most environments. While frontier-exploration performs very well in small environments, it struggles in large MP3D environments. This is due to mesh defects present in the scans of large environments where the frontier agent gets stuck. Here, we show qualitative examples where frontier-exploration succeeds in small environments (see Fig. 22), and fails in large environments (see Fig. 23). For each example, we also show the exploration trajectories of novelty to serve as a reference because it succeeds on a wide variety of cases (see Fig. 10 from the main paper).

H.2 Influence of Imitation-Based Pre-Training

In Sect. 2.4 from the main paper, we mentioned that we pre-train policies with imitation learning before the reinforcement learning stage. Here, we evaluate the impact of doing so. See Fig. 24.

We train a few sample methods on both AVD and MP3D with, and without the imitation learning stage on three different random seeds. We then evaluate their exploration performance on 100 AVD episodes and 90 MP3D episodes as a function of the number of training episodes. Except in the case of novelty in MP3D, pre-training policies using imitation does not seem to improve performance or speed up convergence. While it is possible that expert trajectories gathered from humans (as opposed to synthetically generated) could lead to better performance, we reserve such analyses for future work.

Fig. 24
figure24

Impact of imitation pre-training: The yellow curves show the results with imitation pre-training, the green curves show results when the policy is trained from random initialization. The curves represent the area covered by the agent on validation episodes over the period of training. The blue dot indicates the performance of the pure imitation policy. The yellow curves are shifted to account for number of training episodes used for imitation learning

I Measuring the Correlation Between Different Metrics

In the main paper, we quantified the performance of different exploration strategies on various visitation metrics and downstream tasks. Here, we measure the correlation between performance on the visitation metrics and the performance on downstream tasks. For example, we would like to know if an agent covers more area, does it imply that it will perform well on navigation?

Based on our results on Matterport3D, we estimate the Spearman’s rank correlation between a subset of visitation metrics and downstream tasks. To estimate the correlation, we first compute the metric values for each method on all episodes of the test set. We then estimate the correlation between two metrics by selecting their corresponding values from each episode and method in Table 6. As expected, larger values of the visitation metrics are correlated with better performance on the reconstruction task. However, the area covered is negatively correlated with the pointnav performance. This is an interesting phenomenon, and echoes that point we made earlier in Sect.5.1 about how current exploration methods do not incorporate biases such as uncovering potential obstacles which aid can future navigation.

Table 6 Correlation between visitation metrics and downstream task performance

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ramakrishnan, S.K., Jayaraman, D. & Grauman, K. An Exploration of Embodied Visual Exploration. Int J Comput Vis 129, 1616–1649 (2021). https://doi.org/10.1007/s11263-021-01437-z

Download citation

Keywords

  • Visual navigation
  • Visual exploration
  • Learning for navigation