1 Introduction

With breakthroughs in computer vision and natural language understanding, the embodiment hypothesis that an intelligent agent is born from its interaction with environments (Smith & Gasser, 2005) is now attracting more and more attention to embodied AI tasks such as vision-and-language navigation (VLN). VLN is firstly defined in Anderson et al. (2018b) towards the goal of a robot carrying out general verbal instructions, where an agent is required to follow natural-language instructions based on what it sees and adapt to previously unseen environments. VLN has developed various settings, such as fine-grained and short-horizon navigation (e.g., R2R Anderson et al., 2018b and RxR Ku et al.,Ku et al., 2020a), long-horizon navigation (e.g., R4R Jain et al., 2019), vision-and-dialogue navigation (e.g., CVDN Thomason et al. 2020), and navigation with high-level instructions (e.g., REVERIE Qi et al., 2020b). Compared with non-embodied VL tasks such as visual question answering (Antol et al., 2015) and visual captioning (Chen et al., 2015; Xu et al., 2016), VLN agents suffer from domain shifts and changing observations during multi-step decision-making in the scenarios.

Fig. 1
figure 1

The longer blue trajectory shows an agent carrying out instruction 1. The next time, the agent enters this scene to conduct the second instruction along the shorter red path. ESceme allows it to recall the visited nodes (i.e., the blue ones \(\textrm{B}_1\) and \(\textrm{B}_3\)) at where it is standing (A) and choose the neighboring node B\(_1\) that will see “the white bookshelf” in one more step at C. Finally, it navigates towards the red dash route and reaches the target (Color figure online)

A vanilla Seq2Seq pipeline (Anderson et al., 2018b) that implicitly encodes path history with LSTMs (Hochreiter & Schmidhuber, 1997) shows moderate navigating ability. Since then, VLN performance has been considerably improved by pre-training (Hao et al., 2020; Hong et al., 2021; Chen et al., 2021b; Qiao et al., 2022), data augmentations (Fried et al., 2018; Tan et al., 2019; Li et al., 2022a), and algorithms that explicitly track past decisions along the trajectory (Chen et al., 2021b; Wang et al., 2021; Chen et al., 2022b). These methods learn enhanced representations by training VLN agents in each episode but ignore the dynamics of navigating over the whole data. Different strategies, including modified beam search (Fried et al., 2018) and pre-exploration (Wang et al., 2019; Tan et al., 2019; Majumdar et al., 2020; Zhu et al., 2020a), are devised to specifically increase adaptation to unseen environments at the cost of efficiency. Specifically, beam search significantly extends route length and involves much more interactions with the environment; pre-exploration takes extra steps to gather information and train the agent with auxiliary objectives before it can conduct given instructions. Such strategies incur burdensome time and computational expenses in practical usage.

In this work, we propose a navigation mechanism with Episodic Scene memory (ESceme) to balance generalization and efficiency by exploiting the dynamics of navigating all the episodes. ESceme requires no extra annotations or heavy computation and is agent-agnostic. We encode observation, instruction, and path history separately and update the scene memory during navigation via candidate enhancing. By preserving the memory among episodes, ESceme envisions the agent seeing a bigger picture in each decision. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. Then during inference, it predicts actions with the progressively completed memory. A demonstration is shown in Fig. 1. When carrying out an instruction at Location A, the agent is to select one from the adjacent nodes B\(_1\)-B\(_5\) to navigate. It recalls the episodic scene memory, i.e., the blue route of a completed trajectory, and chooses Node B\(_1\) that will see “the white bookshelf” in one more step at C.

We verify the superiority of ESceme in short-horizon navigation with fine-grained instruction (R2R), long-horizon navigation (R4R), and vision-and-dialog navigation (CVDN). We find that ESceme notably benefits navigation with longer routes (R4R and CVDN), promoting both successful reaching and path fidelity. Our method achieves the highest Goal Progress in the CVDN challenge. Besides a fair comparison with existing approaches under a single run, we test the performance with an approximately complete memory, where the agent fully updates its scene memory in the first round of navigation over all the episodes. We denote it as ESceme*, which serves as the upper bound of ESceme. We observe a further improvement in ESceme*, which indicates better-completed memory magnifies the advantage of ESceme. We hope this work can inspire further explorations in modeling episodic scene memory for VLN.

Since ESceme does not introduce any extra time or steps before following the instruction in inference, it is fair to compare it with its counterparts in the single-run setting. Very different from pre-exploration optimizing the parameters of an agent before solving the task, ESceme only renews its episodic memory while conducting instructions and requires no back-propagation operations. Moreover, ESceme neither involves beam search nor changes the local action space in sequential decision-making. These properties make ESceme both efficient and effective in reality use. Our contributions are summarized as follows:

  • We devise the first navigation mechanism with episodic scene memory (ESceme) for VLN to balance generalization and efficiency.

  • We provide a simple yet effective implementation of ESceme via candidate enhancing, tested with two navigation architectures and two inferring strategies.

  • We verify the superiority of ESceme in short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) navigation, and achieve a new state-of-the-art.

2 Related Work

2.1 Vision-and-Language Navigation

Since Anderson et al. (2018b) defined the VLN task and provided an LSTM-based sequence-to-sequence baseline (Seq2Seq), numerous approaches have been developed. A branch of methods improves navigation via data augmentation, such as SF (Fried et al., 2018), EnvDrop (Tan et al., 2019), and EnvEdit (Li et al., 2022a). As for agent training, Wang et al. (2018) model the environment to provide planned-ahead information during navigation. RCM (Wang et al., 2019) provides an intrinsic reward for reinforcement learning via an instruction-trajectory matching critic. Wang et al. (2020b) jointly train an agent on VLN and vision-dialog navigation (MT-RCM+EnvAg). To fully use available semantic information in the environment, AuxRN (Zhu et al., 2020a) devises four self-supervised auxiliary reasoning tasks. TDSTP (Zhao et al., 2022) introduces an extra target location estimation during finetuning to achieve reliable path planning. Many methods explore more effective feature representations and architectures, such as PTA (Cornia & Cucchiara, 2019), OAAM (Qi et al., 2020a), NvEM (An et al., 2021), RelGraph (Hong et al., 2020), MTVM (Lin et al., 2022b), and SEvol (Chen et al., 2022a).

Some methods construct and reason about a graph of navigation while conducting an episode, such as NTS (Chaplot et al., 2020) and RECON (Shah et al., 2022) in the ImageGoal space and ETPNav (An et al., 2023) and CMTP (Chen et al., 2021a) in the VLN space. VLN-SIG (Li & Bansal, 2023) adds the tasks of generating semantics for future navigation views in pre-training and fine-tuning, and contributes to a more powerful agent backbone. KERM (Li et al., 2023) introduces knowledge described by text to aid action prediction, which is useful mainly in seen environments. GridMM (Wang et al., 2023) builds a grid memory with fine-grained features and adopts a global action space, which improves the success rate but suffers from a much longer trajectory length.

Inspired by the breakthrough of large-scale pre-trained BERT (Kenton & Toutanova, 2019) in natural language processing tasks, PRESS (Li et al., 2019) replaces RNNs with pre-trained BERT to encode instructions and achieves a non-trivial improvement in unseen environments. PREVELENT (Hao et al., 2020) pre-trains BERT from scratch using image-text-action triplets and further boosts the performance. RecBERT (Hong et al., 2021) integrates a recurrent unit into a BERT model to be time-aware. Chen et al. (2021b) propose the first VLN network that allows a sequence of historical memory and can be optimized end-to-end (HAMT). HOP (Qiao et al., 2022) designs trajectory order modeling and group order modeling tasks to model temporal order information in pre-training. CSAP (Wu et al., 2022) proposes trajectory-conditioned masked fragment modeling and contrastive semantic-alignment modeling tasks for pre-training. ADAPT (Lin et al., 2022a) explicitly learns action-level modality alignment with action prompts. There are also some works specially designed for vision-and-dialog navigation, such as VISITRON (Shrivastava et al., 2022), SCoA (Zhu et al., 2021), and CMN (Zhu et al., 2020b).

The differences between the proposed ESceme and previous graph- or map-construction approaches are twofold. First, they construct path-level memory along a route in a single conduction. ESceme maintains scene-level memory from multiple episodes in the same scenario. Second, they use the path-level memory in planning by extending the agent’s action space from local to global. ESceme does not change the agent’s action space. Instead, it improves navigation by increasing the information of each node, which is the core idea that makes ESceme perform better than path-level memory methods (e.g., EGP and SSM). Scene memory is also studied in other works (Datta et al., 2022; Georgakis et al., 2022; Li et al., 2022b; Krantz et al., 2023; Vasudevan et al., 2021; Gupta et al., 2017).

The method that is the most closely related to ours is IVLN (Krantz et al., 2023). Our setting is actually identical to IVLN, which reorganizes episodes into tours. We store scene IDs during inference instead of explicitly organizing episodes according to their IDs, yet the two ways result in the same effect. Although both explore the impact of episodic memory and compare with the same baseline model, i.e., HAMT, our work provides a more effective design of the memory mechanism and obtains better performance due to candidate enhancing. Instead, the episodic memory in IVLN encodes the memory map as a whole and observes even worse results than the path-level memory baseline (cf. Table 2 in IVLN).

Fig. 2
figure 2

An overview of the Episodic Scene memory mechanism for VLN. On the left is partial episodic memory for the current scene, which gets updated in navigation 1) following the previous instruction, i.e., the blue route \(\textrm{B}_3\rightarrow \textrm{A}\rightarrow \textrm{B}_1\rightarrow \textrm{E}\), and 2) following the current instruction from Step 1 to \(t-1\), i.e., the solid trajectory \(\textrm{B}_4\rightarrow \textrm{A}\). The cyan nodes \(\textrm{B}_2\), \(\textrm{B}_5\), C, and D are those viewed but not visited. The shadow box shows the memory of node B\(_1\), which has six adjacent neighbors, i.e., A, B\(_2\), B\(_5\), C, D, and E. The integration of these nodes consists of the memory of B\(_1\). At Step t, the agent stands at Node A and is expected to choose one node from B\(_1\) to B\(_5\). Given observation from K views, each view retrieves its memory in ESceme and produces \(\{{\textbf{m}}_1,...,{\textbf{m}}_K\}\). The memory representation then fuses with original encoded observations, which yields \(\{{\textbf{o}}_1,...,{\textbf{o}}_K,{\textbf{o}}_s\}\). \(o_s\) is the representation for STOP. The enhanced observations, instruction text, and history from Step 1 to \(t-1\) compose the input to a navigation network to predict the action \(a_t=i\in \{1,...,K,s\}\). Generally, a navigation network uses the encoded features of the original K views as the input to the cross-modal encoder, i.e., the output \(\textcircled {1}\). Our ESceme exploits the enhanced observations from \(\textcircled {2}\) (Color figure online)

2.2 Exploration Strategies in VLN

As the navigation graph is pre-defined in discrete VLN, diverse strategies are adopted other than the regularly used single-run. For example, Fried et al. (2018) modifies the standard beam search to select the final navigation route, which notably increases navigation success at the cost of unbearable trajectory lengths. More efficient pre-exploration methods are studied. For instance, a progress monitor is trained to discard unfinished trajectories during inference (Ma et al., 2019a). Ma et al. (2019b) learn a regret module to decide when to backtrack. Ke et al. (2019) compare partial paths with global information considered and backtrack only when necessary. AcPercep (Wang et al., 2020a) learns an exploration policy to gather visual information for navigation. Although these methods improve searching efficiency, they heavily depend on manually designed or heuristic rules. Deng et al. (2020) define a global action space for the first time and build a graphical representation of the environment for elegant exploration/backtracking. Wang et al. (2021) extend EnvDrop (Tan et al., 2019) with an external structured scene memory (SSM) to promote exploration in the global action space.

Pre-exploration, which allows an agent to pre-explore unseen environments before navigating, is first introduced in Wang et al. (2019) as a setting different from single-run and beam search. The obtained information functions in diverse ways. RCM (Wang et al., 2019) uses the exploration experience in self-supervised imitation learning. EnvDrop (Tan et al., 2019) exploits the environment information for data augmentation via back-translation. VLN-BERT (Majumdar et al., 2020) provides the agent with a global view for optimal route selection. AuxRN (Zhu et al., 2020a) finetunes the agent in unseen environments with auxiliary tasks.

3 Method

3.1 Problem Formulation

Given an instruction \(X_i\), e.g., “Turn around and walk to the right of the room...”, an agent starts from the initial location of route \(R_i\). It observes a panoramic view of the environment \(Y_i\). The panoramic view consists of \(K{=}36\) single viewpoints, each of which is accompanied by an orientation \((\theta ,\phi )\) indicating heading and elevation and a binary navigable signal. The agent selects a viewpoint from the navigable ones and moves to the next location with new observations. This process repeats until the agent takes the STOP action.

In a regular VLN task, there is a set of training samples \({\mathcal {D}}=\{(Y_1,X_1,R_1),...,(Y_{N_1},X_{N_1},R_{N_1})\}\), where \((X_i,R_i)\) is the instruction-route pair in an environment \(Y_i\). The set \(\{Y_1,...,Y_{N_1}\}\) denotes the seen environments during training. An agent is expected to learn navigation with \({\mathcal {D}}\) and carry out instructions in unseen scenarios given by \({\mathcal {D}}^u=\{(Y^u_1,X_1),...,(Y^u_{N_2},X_{N_2})\}\). The set \(\{Y^u_1,...,Y^u_{N_2}\}\) denotes the unseen environments for test.

For a sequence prediction problem, history is an important source of information apart from observations and instructions. The shadow part in Fig. 2 shows a decision step by a general navigation approach that follows the pretraining-finetuning branch and encodes path history, represented by HAMT(Chen et al., 2021b). We denote the vanilla features of K single views extracted by the observation encoder as \(\{{\textbf{f}}_1,...,{\textbf{f}}_K,{\textbf{f}}_s\}\), which can be obtained by concatenating the separate features of encoded RGB images and orientations. \({\textbf{f}}_s\) is appended to allow a STOP action. Together with history representations \(\{{\textbf{h}}_1,...,{\textbf{h}}_{t-1}\}\) from the history encoder and text representations \(\{{\textbf{x}}_{cls},{\textbf{x}}_1,...,\) \({\textbf{x}}_L\}\) from the instruction encoder, the features of the observations \(\textcircled {1}\) are input into a cross-modal encoder for multi-modal fusion. A predictor block takes in the cross-modal representations \(\{{\textbf{o}}'_1,...,{\textbf{o}}'_K,{\textbf{o}}'_s\}\), \(\{{\textbf{h}}'_1,...,{\textbf{h}}'_{t-1}\}\), and \(\{{\textbf{x}}'_{cls},{\textbf{x}}'_1,...,{\textbf{x}}'_L\}\) to predict action \(a_t\).

Fig. 3
figure 3

Episodic memory construction of a scene during navigation. ESceme at the beginning of each time step is presented in the figures, which comprises green nodes and edges and is empty at the beginning of \(t=1\). The blue nodes indicate the current location of following the first instruction at each time step, and the red ones correspond to the second instruction. The small cyan nodes mark the remaining navigable viewpoints of the current location. Nodes with green boundary are the chosen viewpoints in each time step. ESceme at the end of that time step is updated by the node with green boundary and the dashed lines connected to its existing nodes. Please refer to Fig. 1 for a complete global graph of the scene, which is unavailable to the agent either in navigation or ESceme construction (Color figure online)

Due to potential differences between seen and unseen environments, such as the appearance and layout of the scenario and the display of objects, an agent trained in the above way suffers from decreased decision ability. The mistake accumulates along the path, which incurs a heavy drop in successful navigation in new environments. Since strategies such as pre-exploration and beam search that exploit extra clues in a new scene are too expensive for a deployed robot, we propose a mechanism of episodic scene memory to balance accuracy and efficiency. Figure 2 provides an overview of the proposed ESceme mechanism. By retrieving episodic memory for the K views at Step t, ESceme replaces the vanilla encoded observations with enhanced representations for cross-modal encoding and action prediction, i.e., \(\textcircled {1}\rightarrow \textcircled {2}\). In the following sections, we detail how to build the episodic scene memory and promote observations with the memory in navigation.

3.2 Episodic Scene Memory Construction

We initialize the episodic memory of Scene Y with an empty graph \({\mathcal {G}}^{(0)}_Y=({\mathcal {V}}^{(0)}_Y{=}\emptyset ,~ {\mathcal {E}}^{(0)}_Y{=}\emptyset )\) if an agent has never seen the scene. Namely, for the first instruction in Scene Y, an agent starts navigation with an empty episodic memory. As shown in Fig. 3a, the start location has four neighbors and is added to \({\mathcal {G}}_Y\) at the end of \(t{=}1\) by \({\mathcal {V}}^{(1)}_Y=\{V_1\}\). Node feature \({\textbf{m}}_{V_1}\) is an integration of its neighbors,

$$\begin{aligned} {\textbf{m}}_{V_1}=\text {pooling}({\textbf{f}}_{V_{1,i}}), \end{aligned}$$
(1)

where \(i{\in } \{1,2,3,4\}\) in Fig. 3a, \({\textbf{f}}_{V_{1,i}}\in {\mathbb {R}}^d\) is d-dim plain representations of the i-th neighbor view from the observation encoder, and \({\textbf{m}}_{V_1}\in {\mathbb {R}}^d\). The pooling function can be either max or mean pooling along the number of features. It is worth noting that obtaining \({\textbf{f}}_{V_{1,i}}\) does not involve extra computation since these features have been calculated in offline feature extraction. The agent selects its right neighbor to navigate, and at the end of \(t{=}2\), the visited node is added to \({\mathcal {G}}_Y\) by \({\mathcal {V}}^{(2)}_Y{=}\{V_1,V_2\},~{\mathcal {E}}^{(2)}_Y{=}\{e_{12}\}\), with node feature \(m_{V_2}\) calculated similarly as Eq. (1). We set all edges \(e_{jk}{=}1\).

While following the first instruction, the agent updates its episodic scene memory \({\mathcal {G}}_Y\) accordingly, i.e., the green nodes and edges in Fig. 3b, c. At the end of \(t=5\), \({\mathcal {V}}_Y^{(5)}=\{V_1,V_2,...,V_5\},~{\mathcal {E}}_Y^{(5)}=\{e_{12},e_{23},e_{34},e_{45}\}\). When the agent is directed to the second instruction in Scene Y, its memory in previous visits is preserved in \({\mathcal {G}}_Y\) and is updated at the end of each time step accordingly as Fig. 3d, e demonstrate. In Fig. 3f, since the agent’s location A has been added to ESceme in conducting the first instruction, there is no update to \({\mathcal {G}}_Y\). The agent stores episodic memory for each scene separately in similar ways. Therefore, we omit the subscript Y for simplicity.

3.3 ESceme Navigation by Candidate Enhancing

In addition to information from instruction, current observation, and route history, an agent can refer to its episodic scene memory in decision-making at each step.

Since the node representation in ESceme integrates information within the neighborhood, it is expected to envision the agent with a bigger picture of the current location. Therefore, we devise a candidate-enhancing (CE) mechanism to improve navigation. A flowchart of CE is shown in Fig. 2. Faced with K candidate views at Step t, the agent retrieves their representations \({\textbf{m}}_k,~k\in \{1,...,K\}\) from episodic memory \({\mathcal {G}}^{(t-1)}\),

$$\begin{aligned} {\textbf{m}}_k=\left\{ \begin{array}{ll} {\textbf{m}}_{V_j} &{} \text {if the } k \text {-th view is } V_j \in {\mathcal {V}}^{(t-1)} \\ {\textbf{0}} &{} \text {otherwise.} \end{array} \right. \end{aligned}$$
(2)

Then the Fusion block integrates the ESceme representations with the plain features \(\{{\textbf{f}}_1,...,{\textbf{f}}_K\}\) to produce enhanced candidate viewpoints,

$$\begin{aligned} {\textbf{o}}_k=\text {MLP}([{\textbf{m}}_k;{\textbf{f}}_k]), \end{aligned}$$
(3)

where \([\cdot ;\cdot ]\) denotes concatenation along feature dimension. The MLP function is a two-layer non-linear projection from \({\mathbb {R}}^{2d}\) to \({\mathbb {R}}^d\). Following Chen et al. (2021b); Zhao et al. (2022), type embedding that distinguishes visual and linguistic signals, navigable embedding that indicates the navigability of each candidate view, and orientation encoding are added to \({\textbf{o}}_k\). A zero vector \({\textbf{o}}_s\in {\mathbb {R}}^d\) is appended as the feature for STOP action.

Finally, together with encoded history features, the enhanced candidate representations \(\{{\textbf{o}}_1,...,{\textbf{o}}_K,{\textbf{o}}_s\}\) are input to the cross-modal encoder to merge linguistic information from encoded text features. The agent predicts the distribution of action \(a_t\) via a two-layer non-linear Predictor block,

$$\begin{aligned} P(a_t{=}k{\in }\{1,...,K,s\}) = \frac{e^{\textrm{MLP}({\textbf{o}}'_k\odot {\textbf{x}}'_{cls})}}{\sum _{j\in \{1,...,K,s\}}e^{\textrm{MLP}({\textbf{o}}'_j\odot {\textbf{x}}'_{cls})}}, \end{aligned}$$
(4)

where \(\odot \) is element-wise multiplication of two vectors \({\textbf{o}}'_k\) and \({\textbf{x}}'_{cls}\) \(\in {\mathbb {R}}^d\), and the two-layer non-linear MLP block maps the result to a scalar \(\in {\mathbb {R}}\). Following Tan et al. (2019); Chen et al. (2021b), we train the framework end-to-end by a mixture of Imitation Learning and Reinforcement Learning (A2C Mnih et al. 2016) loss,

$$\begin{aligned} {\mathcal {L}}={-}\alpha \sum _{t=1}^{T^*}\log P(a_t{=}a_t^*)-\sum _{t=1}^T\log P({\tilde{a}}_t)(r_t{-}v_t), \end{aligned}$$
(5)

where \(T^*\) and T are the length of the annotated route and predicted path, respectively. \({\tilde{a}}_t\) is sampled action. \(r_t\) is the discount reward, and \(v_t\) is the state value given by a two-layer (MLP) critic network.

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets and Metrics

We conduct experiments on the following three VLN tasks for evaluation.

  1. (1)

    Short-horizon with fine-grained instructions. R2RFootnote 1 (Anderson et al., 2018b) constructs on Matterport3D (Chang et al., 2017) and has 7,189 direct-to-goal trajectories with an average of 10 m. Each path is associated with three instructions of 29 words on average. The train, val seen, val unseen, and test unseen splits include 61, 56, 11, and 18 houses, respectively.

  2. (2)

    Long-horizon with fine-grained instructions. R4RFootnote 2 (Jain et al., 2019) is generated by joining existing trajectories in R2R with others that start close by where they end. Compared to R2R, it has longer paths and instructions and reduced shortest-path bias. The train, val seen, and val unseen have 233,613, 1,035, and 45,162 samples, respectively.

  3. (3)

    Vision-dialog navigation. CVDNFootnote 3 (Jain et al., 2019) requires an agent to navigate given a target object and a dialog history. It has 7k trajectories and 2,050 navigation dialogs, where the paths and language contexts are also longer than those in R2R. The train, val seen, val unseen, and test splits contain 4,742, 382, 907, and 1,384 instances, respectively.

Following standard criteria (Chen et al., 2021b; Anderson et al., 2018b, a), we evaluate the R2R dataset with Trajectory Length (TL), Navigation Error (NE), Success Rate (SR), and Success weighted by Path Length (SPL). TL is the average length of an agent’s navigation route in meters, NE is the mean shortest path distance between the agent’s stop location and the target, and SR measures the ratio of navigation that stops less than three meters from the goal. SPL normalizes SR by the ratio between the path length of ground truth and the navigated, which balances accuracy and efficiency and becomes the key metric for the R2R dataset. We adopt three additional metrics, Coverage weighted by Length Score (CLS), normalized Dynamic Time Warping (nDTW), and Success weighted by nDTW (SDTW), to assess path fidelity on the R4R dataset. As for vision-dialog navigation on CVDN, the primary evaluation metric is Goal Progress (GP) in meters.

4.1.2 Implementation Details

We adopt the encoders from Chen et al. (2021b) in comparison by default, where the text, history, and cross-modal encoders have nine, two, and four transformer layers, respectively. Features of single views are extracted offline using finetuned ViT-B/16 released by Chen et al. (2021b). For a fair comparison, we set the feature dimension \(d{=}768\), the ratio of imitation learning loss \(\alpha {=}0.2\), and train the ESceme framework for 100K iterations on each dataset with a batch size of 8 and a learning rate of 1e-5. All the experiments run on a single NVIDIA V100 GPU. We adopt max pooling and single-run by default in comparison with other methods, and provide the results of mean pooling and inferring twice in ablation studies and supplementary material, with qualitative examples and failure cases included.

For Reinforcement Learning, the action space is restricted to navigable locations (loosely equal to viewpoints) from each node, which is implemented by first predicting the log probability distribution over all the K viewpoints plus a STOP token and then setting the non-navigable ones as -inf. The policy is given by \(\pi (a_t|\{o'_i\}_1^s,\{h'_i\}_1^{t-1},\{x'_i\}_{cls}^L)\), and sampling is conducted according to the restricted log probability. For Imitation Learning, the shortest path planner provides expert demonstrations, which are directly available from the simulator.

Table 1 Comparison with state-of-the-art methods on R2R dataset

4.1.3 Fair Comparison

We consider deploying an agent in new environments to execute a series of language instructions. Admittedly, this definition is slightly different from existing methods, whereas it 1) preserves the original setup of unseen environment, i.e., the agent never sees the environment before deployment, and 2) is more practical in real scenarios, e.g., housework robots. Meanwhile, the proposed episodic memory leads to initialization change: the agent conducts the first episode with empty memory and the following episodes with its own estimates. The comparisons we made in the paper aim to verify the superiority of the proposed episodic memory instead of just showing an instantiation of ESceme surpassing its counterparts. This way, we inevitably compare it with existing path memory since this is a novel memory mechanism. Our inter-episode memory requires no extra time or computation while maintaining partial episodic memory via initialization, which is worth further exploration.

The directly available location ID, which we use to retrieve enhanced features for the current node, is universally adopted by 1) implicit path-level memory methods (e.g., HAMT) to retrieve accessible candidates, and 2) explicit path-level memory methods (e.g., EGP, SSM) to extend action space. Our comparisons introduce the essential signal, i.e., scan id as environment index, to be fair in showing the superiority of episodic memory over path-level memory, where unseen scenes refer to those never appearing in training/validation. In discrete environments, the usage of location ID inevitably leads to “the agent knowing the current location is exactly something it sees before”. The proposed ESceme can easily extend to continuous settings by combining with waypoint prediction methods (e.g., CWP Hong et al., 2022) that surpass most semantic-map-based approaches. Moreover, the proposed episodic memory mechanism can transfer to continuous scenes by maintaining a global map via the widely studied visual SLAM.

Table 2 Comparison on the val unseen split of R4R dataset
Table 3 Resuls of goal process (GP) in meters on CVDN dataset

4.2 Comparison to State-of-the-Art

4.2.1 Results on R2R Dataset

Table 1 compares the proposed ESceme with existing methods on the R2R dataset. We can see that the pretraining-finetuning paradigm (e.g., RecBERT (Hong et al., 2021), HAMT (Chen et al., 2021b), ADAPT (Lin et al., 2022a), CSAP (Wu et al., 2022), TDSTP (Zhao et al., 2022)) largely improves the performance of VLN in unseen environments. ESceme achieves the highest SPL on the unseen splits. It surpasses the baseline model HAMT (Chen et al., 2021b) by about 5% SPL on the validation and test unseen environments and even outperforms TDSTP (Zhao et al., 2022) that involves auxiliary training tasks. Besides, ESceme brings a relative decrease of 6.4% and 4.1% in NE on validation and test unseen split, respectively. The results demonstrate the efficacy of episodic scene memory in generalization to unseen scenarios with short instructions.

We also compare with the most recent works, including VLN-SIG (Li & Bansal, 2023), KERM (Li et al., 2023), and GridMM (Wang et al., 2023). The proxy pre-training task involved in VLN-SIG shows no advantage in unseen environments. KERM surpasses all the methods on the validation-seen split but drops much more heavily on unseen splits. GridMM achieves the highest SR and slightly lower SPL than ours in unseen scenarios yet takes a much longer trajectory length.

Table 4 Ablation studies of ESceme construction on R2R dataset to compare the effect of different pooling functions

4.2.2 Results on R4R Dataset

We evaluate the proposed ESceme on the R4R dataset to examine if the generalization promotion is maintained in long-horizon navigation tasks. The results are listed in Table 2. Our ESceme outperforms existing state-of-the-art by a large margin, i.e., a relative improvement of 6.4% in SPL, 7.0% in CLS, 7.3% in nDTW, and 9.1% in SDTW. It indicates that ESceme improves not only navigation success but also path fidelity. Although good at carrying out short instructions, TDSTP (Zhao et al., 2022) suffers a heavy drop in long-horizon navigation regarding path fidelity compared with its baseline model HAMT (Chen et al., 2021b). It reveals that goal-related auxiliary tasks such as target prediction benefit reaching the target location but undermine the ability to follow instructions. Equipped with ESceme, an agent has a promoted ability to travel the expected route in long-horizon navigation. Besides, a consistent advantage of pretraining-based methods can be observed on this dataset.

4.2.3 Results on CVND Dataset

Table 3 compares ESceme with state-of-the-art methods on the vision-and-dialog navigation task. CVDN provides longer instructions and trajectories than R2R and more complicated instructions than R4R. The proposed ESceme achieves the best goal process in both seen and unseen scenarios and wins first place on the leaderboard. HAMT (Chen et al., 2021b) shows an obvious advantage over other pretraining-based methods such as PREVALENT (Hao et al., 2020), and even surpasses those counterparts specially designed for vision-and-dialog navigation, e.g., CMN (Zhu et al., 2020b), VISITRON (Shrivastava et al., 2022), and SCoA (Zhu et al., 2021). Our ESceme brings a relative improvement of 20.7%, 5.7%, and 7.3% over the baseline HAMT (Chen et al., 2021b) in val seen, val unseen, and test unseen environments, respectively.

Table 5 Ablation studies of navigation architectures and inferring strategies on R2R dataset

4.3 Ablation Studies & Analysis

4.3.1 Different ESceme Constructions

We evaluate the effect of different pooling functions in Table 4. Candidate Enhancing with mean pooling brings a relative improvement of 2.3% in SPL for unseen navigation and behaves similarly in seen environments. Integrated with max pooling, Candidate Enhancing further boosts the performance in unseen environments, which produces a 3.8% relative increase compared to the HAMT (Chen et al., 2021b) baseline. The results demonstrate the efficacy of the proposed Candidate Enhancing, which improves observation representations via direct injection and fusion, and max pooling, which preserves more distinguishable features of each view. Appendix A discusses a different implementation of the proposed episodic scene memory by Graph Encoding.

4.3.2 Different Navigation Architectures & Inferring Strategies

The proposed ESceme is devised to be model-agnostic and should be compatible with any navigation network that has an observation input. To validate this property, we build ESceme upon TDSTP (Zhao et al., 2022) that achieves the highest SR on the R2R dataset and list the results in Table 5. ESceme improves navigation in both seen and unseen environments by 4.9% and 1.4% in SPL, respectively.

As introduced in Sect. 3.3, the agent starts with an empty episodic scene memory during inference, and the memory keeps updating. If we let the agent renew its memory thoroughly by going through all the episodes and then evaluate its navigation performance, it will have a much more complete episodic memory. We present the results of ESceme* in Table 5. We can see that the nearly completed memory further boosts the performance in unseen environments by 1.3% and 2.1% regarding SPL for ESceme upon HAMT (Chen et al., 2021b) and TDSTP (Zhao et al., 2022), respectively. More results of ESceme* are in supplementary material, with slighter improvements observed for longer-horizon navigation. The results demonstrate that an agent learns to assist navigation with partial and persistently updated episodic memory.

The observation that the performance of ESceme* is only slightly better than that of ESceme has two sides. On the one hand, it indicates that the agent has learned to use the dynamically accumulative episodic memory instead of working until collecting the complete memory. On the other hand, the slight gain of ESceme* indicates possible bottlenecks in the encoder/cross-encoder architecture, the frozen vision encoder, and the scale of datasets.

More effects of the proposed episodic scene memory are present in Appendix B and F. Comparison with pre-exploration methods shows that ESceme* is more robust to unseen scenarios. Ablation on graph re-initialization verifies that episodic scene memory contributes to decision-making in both seen and unseen environments. The observation in the IVLN benchmark is consistent with our discussion in Sect. 2 and our experimental results in Sect. 4.2, and validates the superiority of the proposed ESceme.

Fig. 4
figure 4

Navigation quality w.r.t. inferring progress. The x-axis indicates the ratio of samples tested, and the y-axis is the smoothed average of SPL or CLS. We use the default order for all the methods. Navigation with ESceme improves over time

Fig. 5
figure 5

Panoramic views and top-down overviews of navigation. Mistakes during navigation are marked with red boxes for panorama and red arrows for top-down trajectories. The star indicates the target location. Our ESceme strictly follows the instruction “walk down to the end of hall” and waits at the door of the bedroom (Color figure online)

Fig. 6
figure 6

Failure case in R2R val unseen split. The instruction is “Leave sitting room and head towards the kitchen, turn right at living room and enter.Walk through living room to dining room and enter. Turn left and head to front door. Exit the house and stop on porch.” After correctly predicting the first three actions, ESceme failed to enter the dining room and got lost

4.3.3 Computational Efficiency

We present model size, GPU usage, and time cost during inference on the R2R dataset in Table 5. Either upon HAMT (Chen et al., 2021b) or TDSTP (Zhao et al., 2022), the proposed ESceme brings about 1.0% extra parameters and memory occupation in GPU. In a single-run setting, ESceme slightly increases the computational time by 4.8% when built on top of HAMT. Compared with HAMT, the TDSTP baseline costs more time by 59.5% and GPU by 23.5%. Accordingly, our ESceme only raises the time cost by 3.8% and almost no extra GPU consumption. With better-completed memory, ESceme* further boosts navigation performance in new environments at the expense of double the time. We can see that ESceme achieves a good trade-off between efficiency and efficacy in a single run. The proposed episodic memory mechanism consumes marginal (\(\le 0.1\%\)) computation and parameters. For D-dim features, K nodes/scene, and N scenes, increased cost of space and parameters are about \(3.81e^{-6}{\times }DKN\) and \(1.14e^{-5}{\times }D^2\), respectively.

4.3.4 Order of Executing Instructions

Since ESceme learns with dynamically updated episodic memory while conducting instructions, the order of execution has little impact on overall performance. Table 6 lists navigating performance with shuffled episodes on the val unseen split in all the datasets, which indicates the stability of ESceme.

Table 6 \({\bar{x}}\pm \sigma \) scores of shuffled episodes with five random seeds on the val unseen split of the datasets

4.3.5 Success Variation During Inference

Figure 4 compare SPL and CLS curves of different methods to visualize the variation of navigation quality in inferring progress. On the short-horizon navigation dataset R2R, HAMT (Chen et al., 2021b) oscillates around 62 and drops in the last 1/5 progress. The decrease could result from more tough samples at the end. TDSTP (Zhao et al., 2022) presents a more stable oscillation around 62, owing to a global action space and an auxiliary goal-related task. Starting from a moderate navigation ability, an agent with ESceme benefits greatly from memory updates and maintains a high success rate with completed memory.

On the long-horizon VLN dataset R4R, TDSTP (Zhao et al., 2022) shares a similar oscillation around 41 with HAMT (Chen et al., 2021b) in SPL. TDSTP preserves a relatively more stable success rate at the cost of much lower CLS, which reveals that goal-related auxiliary task undermines the ability of instruction following. Our ESceme shows a sharp increase within the first 4/5 navigation and has remained stable since then. We attribute the excellent promotion on R4R to two reasons, 1) long-horizon navigation involves more action steps, so a slight increase in navigation ability results in a big difference in final performance; 2) the sample density of a scene from R4R is much higher than that from the R2R dataset.

4.4 Qualitative Analysis

To intuitively demonstrate the benefit of the proposed episodic scene memory, we provide a visualization example in Fig. 5. It shows the panoramic views and top-down overviews of navigation. The last step of HAMT and TDSTP navigates to a visible corner of the bedroom. Instead, ESceme understands the instruction better. It takes a step to walk down to the end of the hall and then turns left to the bedroom.

A failure case of ESceme is shown in Fig. 6, where the instruction is “Leave sitting room and head towards the kitchen, turn right at living room and enter. Walk through living room to dining room and enter. Turn left and...” After correctly predicting the first three actions, ESceme failed to enter the dining room and got lost. It indicates that the representations for the viewpoints are not distinguishable enough to capture some fine-grained difference between the dining room and the living room.

5 Conclusion

In this paper, we devise the first VLN mechanism with episodic scene memory (ESceme) and propose a simple yet effective implementation via candidate enhancing. We show that an agent with ESceme improves navigation ability in short-horizon, long-horizon, and vision-and-dialog navigation. Our method outperforms the existing state-of-the-art and wins first place in the CVDN leaderboard, bringing a marginal increase in memory, parameters, and inference time. We hope this work can inspire further explorations on episodic memory in VLN and related fields, e.g., building the memory in continuous environments and with more advanced techniques such as neural SLAM.

5.1 Limitations

Although we show the effectiveness of the proposed episodic scene memory, there are still several limitations. First, the agent requires knowledge of environmental identity to build episodic memory for each scene. It is inevitable but supported by practical demands where an agent conducts multiple instructions in one scenario. Second, the “location ID” information is directly available from the simulator and the dataset, which is accurate and free of noise. For the case where location ID is unknown in advance, the episodic scene memory can be built by adding a discrete mapping process analogous to SLAM. No

Table 7 Ablation studies of ESceme construction on R2R dataset

specific location ID is required, and the rough global position of each node can be dynamically estimated using the angle of each navigable viewpoint. Third, the architecture of a navigation agent and the training data limit the efficacy of a complete scene memory. We hope the proposed episodic scene memory can be explored in more advanced and diverse architectures.