1 Introduction

Visual navigation has attracted much attention and used in many robot application scenes [13, 26, 29, 35, 37]. It is difficult for an embodied agent executing navigation, where the agent can only apply a camera sensor to navigate to the given target by carrying on a sequence of actions.

In order to navigate to the target effectively, the agent needs to extract environmental features to locate itself and use the spatial relationships of the environment on different time steps to move to the target location. The scene graph mimics the human ability to reason about the relevance of objects in an unknown environment and aids navigation. How to establish the scene prior which adapts to the change of scene and encodes the scene graphs are two key points.

Fig. 1
figure 1

Scene graph in indoor navigation. Different from most existing one-stage methods that only consider the co-occurrence relationships. Our two-stage method considers the entity proposals, positional embedding, and directional relationships

First, the navigation model enables the agent to perceive the high-dimensional spatial distribution of semantic concepts at scale and represented as scene graphs. It mainly includes scene graphs [13, 26, 30, 37], object instance-wise graphs [15] and zone-wise graphs [34]. The above approaches build an undirected graph with only co-occurrence relationships. It provides some indication to the navigation when the target is invisible. However, since object relationships and spatial layout are usually inconsistent in different environments, the generalization ability of the above methods is still limited. In addition, These methods have the disadvantage of increasing the state space containing useless information, such as scene prior information that cannot be adapted to all scenarios as fixed features. In contrast to prior work using offline construction with co-occurrence relationship graph as a common choice for components within scene graph [30], we seek to study if heterogeneous relation zone modeling—i.e., modeling the joint distribution of object position, directions between objects, and the relationships between object and relations—can serve as a part of replacement in state for conventional RL algorithms.

Second, after constructing scene graph, we also revisit powerful representation of Transformers with visual navigation. In prior works of scene graph, some irrelevant latent information, such as “countertop" zone, “table" zone information is missing for locating the possible position of the target. Besides, the edge information between objects is also important for reasoning directions for the agent. For example, when the target remoter control is invisible, the TV is easy to find, looking down may find the remoter control quickly. Because of common sense, remoter control is usually under the TV. Therefore, the usage of hierarchical information can implicitly reduce object and zone bias. Besides, prior works rarely explored the relational reasoning in this graph and only consider the edges equally. They lack the interaction between nodes (objects) and edges (directions of objects), and they use the fixed graph structure, which cannot adapt to changes in different environments and may bring errors to the new scenes.

To solve above two issues, we design the RTNet to extract scene graph information, where N2N serves as encoder and E2N servers as the decoder. It encodes the target-related zone and suppresses other useless information. It gives guidance on relations and navigation actions so that the agent can locate the target area more accurately. RTNet considers the E2N interaction relations and adapts a different structure, thus making relational reasoning and guiding the agent in navigation policy. We classify the nodes by explicitly inferring their connections between visible objects nodes. Moreover, we iteratively refine the bridge between the scene and commonsense graphs to update these connections.

Experiments show our navigation framework can get SOTA results in the AI2THOR environment. Our main contributions are:

  1. 1.

    We define the VN task as feature learning on graphs that use a transformer framework to exploit object-to-object interactions. We get object-to-object interaction attention by node-to-node (N2N) self-attention of the transformer encoder.

  2. 2.

    We apply rich node representation to capture edge-to-node (E2N) cross-attention of the object-to-relationship interaction decoder through the decoder. A novel edge position embedding method, transformer decoder, is proposed to accumulate global scenes and protect the scene information while preserving local context.

  3. 3.

    We use an efficient predictive module for directional relations to accumulate the nodes learned and the directed relations required for representation and classification from the transformer sides. It can effectively avoid collisions and deadlocks, as well as improve generalization across scenarios.

The other parts of the paper are organized as follows: Sect. 2 involves the summarization of the recent advances in visual navigation topics. Section 3 proposes our navigation framework. Experiment details and analysis are conducted in Sect. 4. We make the overall conclusion and summarize the possible future improvements in Sect. 5.

2 Related work

2.1 Visual navigation

More recent works [31, 33,34,35,36] focus on spatial relationships in an indoor scene to describe semantic graph as priors for navigation. One way is to build a topological graph using conditional random fields, constraining a sequence of robotics actions into a specific graph [20]. Although they use pre-established maps to improve the localization ability of the robot, they need to build corresponding maps for different scenes, which makes the cost expensive. Other studies use representations like RelVec [9], contextualized graph [28], or caption graph to structure robotic actions. However, their graph embedding method rarely considers the direction of relations between the objects. They use a predefined global semantic map, which has a lot of redundant node information and cannot accurately locate the target. In our model, the target images and contextual semantics are fused to dynamically update the process on-scene knowledge graph using RTNet, which owns the cross-scene generalization ability in HZG.

2.2 Transformer in visual navigation

Transformer [25] became one of the most popular approaches for various vision [34] or vision language tasks from NLP [1]. In vision language pretraining tasks, due to their ability to process sequential and none-sequential data, in almost all cases, it improved upon the superior results. [23] design a map representation to encode traversable paths, unexplored regions, RGB, depth, and semantic segmentation masks. It introduces successful multi-layer Transformer networks to better utilize scene semantics. However, they rarely concentrate on the correlation of relations between scene objects and spatial information. [11] introduces context information and present an Object Memory Transformer (OMT) network, which can utilize a long-term memory of object semantics without prior knowledge. But it may result in higher complexity. Moghaddam et al. [21] propose a graph transformer network-based value estimation, which reduces value estimation error and gets an optimal policy. However, they also rarely consider the spatial direction information, which lacks direct guidance for finding the object. Zachary et al. [24] propose a map transformer method to extract multimodal environmental information, which combines attention schema and auxiliary rewards to better utilize scene semantics. Tommaso [3] design scene memory transformer model, which can exploit, starting from RGB images that bind objects to specific rooms. These methods still use transformer networks for storing RL training experience or regarding as visual perception method in navigation architecture. However, they lack the mutual interaction between objects or zones and relations, which cannot give direct information about the possible location of objects. Our framework designs RTNet to capture object-relation interactions and help avoid deadlocks.

Table 1 Table of mathematical symbols and meanings
Fig. 2
figure 2

Overview of our proposed navigation framework. The scene graph knowledge HZG is constructed from DETR [4] object detector to get object positional relationships. AI2THOR depth images are used to get a 3D bounding box to predicate remote relationships. Visual features are obtained from Resnet18 and target Glove embedding (TGE) as target word embedding features extracted from Glove and HZG represent spatial features. Graph attention network is used to reason the primary references (objects) to make knowledge reasoning in guiding RL action sample. Meanwhile, meta-learning supervises the RL’s training process and provides adaptation in HZG

3 RTNet for visual navigation

We detail the proposed visual navigation framework. The mathematical symbols in the manuscript are shown in Table 1. The whole system includes three parts: heterogeneous zone graph generation, RTNet for graph embedding, and the multimodal fusion for DRL, as is shown in Fig. 2. It works as follows: (1) The visual perception module encodes the multimodal environment information: global image features that consist of RGB information, local object image features, and target word embedding. It considers useful navigation trajectories, history, and rich spatial information. (2) The detection module performs object detection and locates the object instances of interest by a detector DETR [4]. DETR transforms N encoded \(d-\)dimension features \({\mathbb {R}}^{N\times {d}}\) from the same layer to N detection results, using RTNet to encode node information and decode edge information. Thus, RTNet allows the agent to make inferences in the structural environment. (3) DRL module estimate the state-action function \(\pi (a_{t}|s_{t};\theta ^{'})\) with shared hidden vector \(h_{t-k}\). \(\theta\) is the training parameters and k is the number of LSTM layers.

3.1 Task definition setup

We apply A3C algorithm as our navigation planner, which can be defined as a task set \(T={t_{1},\ldots ,t_N}\), in a given room. Our agent can perceive surrounding locations and scenes using its observations. It can be defined as a partially observable Markov decision process (POMDP) [10] to model a sequence of actions. POMDP can be represented as a tuple \((S,A,R,\upsilon )\). S is the state, which consists of observation frames. Action space is A, reward is R, and discount factor is \(\upsilon\). a \(\in\) \(A=(MoveAhead,RotateLeft,RotateRight,LookUp,LookDown,\)-Done). MoveAheadRotateLeftRotateRight, LookUp and LookDown as their denotation means the agent move forward 0.25 m, turn left, turn right 45 degrees, and looks up and down for 30 degrees. Once the agent can see the target images within 1.5 m, the “success" signal occurs. Done action is sampled when the episode fails. The evaluation indicators are success rate (SR) and success weighted by path (SPL); SR is the average success rate of all episodes that find the target in a given number of steps. SPL takes into accounts the navigation quality related to the optimal trajectory, formed as follows: \(\frac{1}{N}\sum ^{N}_{i=0}P_{i}\frac{OL_{i}}{\max (OL_{i},L_{i})}\), where \(P_i\) is a binary value for success, \(OL_i\) is the optimal length of the i-th trajectory and \(L_i\) is the agent’s actual trajectories.

3.2 Heterogeneous zone graph generation

Different from their KG, see in Fig. 3, we build a HZG with 7 types of edges, where each edge denotes the direction of one object to another object. With this setting, the agent can walk toward the target using precise guidance, i.e., the mouse is the right of the laptop in HZG, when the laptop is invisible and the mouse is in the current observation, rotating right could be more likely to find the target laptop. We define our HZG as an adjacency matrix, \(A\in P^{n\times n\times C}\), which is consisted of 4-dimensional vectors where each channel C will encode a different type of scene. The spacing of KG allows agents to use different types of scenes to avoid confusion. Doing this, scene-specific HZG can be encoded. The agent can inference kitchen prior different from bathroom.

Fig. 3
figure 3

Heterogeneous zone graph generation. We construct a cleaner knowledge graph in our framework. It contains spatial locations of objects extracted as 7 types of relations. Heterogeneous Zone Graph is stored as a \(92\times 92\) diagram. HZG contains precise spatial location relationships rather than ambiguous co-occurrence relationships in commonly used VG knowledge graphs

Different from their homogeneous graph, HZG can provide the agent with direct guidance and fuzzy search capability. In a specific room m, we first let the agent randomly explore this room and observe a set of visual tuple features (fl), where \(f\in {\mathbb {R}}^{N\times {1}}\) is the object bag-of-words vector obtained by DETR, representing the object in which the current view appears, and 0 and 1 are used to represent the bag-of-words category. If the current view contains many objects belonging to the same category, it is recorded only once in the object triple. N represents the number of object categories and \(l=\{x,z,\theta _{yaw}, \theta _{pitch}, ROI\}\) represents the observation position, where x, z represent the horizontal coordinates and \(\theta _{yaw}\), \(\theta _{pitch}\) represent the yaw and pitch angles of the agent. ROI is the fusion region of interest features for all objects’ (n) ROI in the zone, where \(ROI=Concat(roi_{n})\). Then k-means clustering feature f is used to obtain K regions and form the zone-level HZG \(Z_m(V_m,E_m)\).

In addition, in order to obtain directional features, that is, provide directional guidance for the agent to supervise the action selection of reinforcement learning. The motivation is that utilize the commonsense of scene layout. For example, the TV zone is generally in front of the sofa zone, and when the sofa is in front of the agent, looking back tends to find the TV faster. For each zone \(z_{m}\), whose coordinates include the center point of ROI (xy), and width, height (wh). The direction relationships are counted as word frequency using center point location, that is, a certain relationship appears at least 3 times between zones. Their order by term frequency is represented using max-min normalization. Consider a scenario where the set of zone-level zones is \(Z={z_1 (V_1, E_1),...,z_n (V_n,E_n)}\). Since the zone number K is fixed, the HZG of each room has the same structure for matching and merging.

3.3 RTNet for graph embedding

In this section, we details the RTNet designed for graph embedding generated by last section.

Fig. 4
figure 4

The structure of relation-wise transformer network

In transformer, for each layer, the input \(X\in {\mathbb {R}}^N\times D\) that has N entries of D dimensions, is transformed into queries \((Q=XW_{Q}, W_{Q}\in \mathbb {D\times {D_{q}}})\), keys \((K=XW_{k},W_{k}\in {\mathbb {R}}^{D\times D_{k}})\) and values \((V=XW_{V}, W_{V}\in {\mathbb {R}}^{D\times D_{v}})\). \(D_{q}\), \(D_{k}\) and \(D_{v}\) are the same in the implementation normally. The attention with QKV is defined as:

$$\begin{aligned} Att(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{D_{k}}})V \end{aligned}$$
(1)

where MHA is formulated as:

$$\begin{aligned} \begin{aligned} MHA(Q,K,V)=Concat(h_{1},...,h_{n})W_{o},\\ h_{i}=Attention(XW_{Q_{i}},XW_{K_{i}},XW_{V_{i}}) \end{aligned} \end{aligned}$$
(2)

\(Att(\cdot )\) is the self-attention layer, which is followed by normalization, residual and feed-forward layer. In this work, we propose a Relation-wise Transformer based on \(Att(\cdot )\) to explore the spatial context, which encodes the spatial structure between relations and detected objects.

3.3.1 Heterogeneous relation-wise embedding

To represents the direction relationship in scenes graph. The relationship is a directed property, i.e., \(Subject\rightarrow {Object}\) is fixed and cannot be exchanged. After obtaining the context-rich node and edge embeddings, the global average pooling operation of the image features obtained by the object detector, an initial directed relational embedding \((rel_{i\rightarrow {j}}^{in})\in {\mathbb {R}}^{512}\) is gained. Next, \((rel_{i\rightarrow {j}}^{in})\) is passed through a sequential block of neural netwotks, which we call heterogeneous relation-wise embedding module (HRE). The HRE architecture (Fig. 4) add edge relations in the original transformer architecture. The displacement equal variance of the transformer is an ideal way to deal with node embeddings in the graph because the graph is with edges preserved, the arrangement of nodes is unchanged. We call the residual channels of the original existence transformer architecture node channel. These channels translate a set of input node embeddings \(\{n_{1}^{0},n_{2}^{0},...n_{N}^{0}\}\)into a set of output node embeddings \((n_{i}^{N})\), where \(n_{i}^{l}\in {\mathbb {R}}^{d_{e}}\), \(d_{e}\) is the edge embedding dimensionality. The input edge embedding consists of graph structure matrix and edge features. The edge embeddings of each layer are updated by HRE, and finally a set is generated the output edge of \((e^{L}_{ij})\) from which structural predictions such as edge labeling can be performed.

3.3.2 Relationship representation

We apply DETR [4] as our backbone. For each frame \(I_{t}\) at timestep t in the observation, the detector provides visual features \(\{v_{t}^{1},...,v_{t}^{N{(t)}}\}\in {\mathbb {R}}^{2048}\), bounding boxes \(\{b_{t}^{1},...,b_{t}^{N(t)}\}\) and object category distribution \(\{c^{1}_{t},...,c^{N(t)}\}\) of object proposals where N(t) is the number of object proposals. Among the N(t) object proposals, there is a set of relationships \(R_{t}=\{r_{t}^{1},r_{t}^{2},...,r_{t}^{K(t)}\}\). The representation vector \(x_{t}^{k}\) of the relation \(r_{t}^{k}\) between the \(i-th\) and \(j-th\) object proposals consists of current observation image, target word embedding, and spatial information which is defined as:

$$\begin{aligned} \begin{aligned} x_{t}^{k}={\mathcal {C}}(W_{s}v_{t}^{i},W_{o}v_{t}^{j},W_{u}\theta (u_{t}^{ij}\oplus f_{bbox}(b_{t}^{i},b_{t}^{j})),s_{t}^{i},s_{t}^{j}) \end{aligned} \end{aligned}$$
(3)

where \({\mathcal {C}}\) is concatenation operation,\(\oplus\) is element-wise addition and \(\theta\) is flattening operation. \(W_{s},W_{o}\in {\mathbb {R}}^{2048\times 512}\) and \(W_{u}\in {\mathbb {R}}^{512\times 512}\) denote the linear matrices for dimension compression. \(u_{t}^{ij}\in {\mathbb {R}}^{256\times 7\times 7}\) is the union box computed by ROIAlign [29] while \(f_{bbox}\) is the function to map the object-subject to an entire feature with the same shape as \(u_{t}^{ij}\). The semantic embedding vectors \(s_{t}^{i},s_{t}^{j}\in {\mathbb {R}}^{200}\) are determined by the object categories. The relationship representations exchange spatial and sequential information in Relation-wise Transformer Network.

3.3.3 Relation propagation via transformer

The core concept of Relation-wise Transformer Network is efficient attention-based relation propagation across all nodes and edges, which design an encoder-decoder architecture implemented as transformers [25]. The transformer architecture uses self-attention mechanisms to restrain the global image clues. The attention is defined as the matrix:

$$\begin{aligned} att(x_{t}^{k},r_{t}^{k})=softmax(\frac{x_{t}^{k}{({r_{t}^{k}})}^{T}r_{t}^{k}}{\sqrt{d}}) \end{aligned}$$
(4)

In the transformer architecture, two different attention mechanisms are designed depending on the observation. A self-attention module is added to the encoder of the transformer as N2N attention. Furthermore, to model the optimal context from all nodes to edges, E2N attention is used as cross-attention in the transformer’s decoder. To facilitate global and local context propagation in E2N attention, proper changes are introduced to the positional encoding of the decoder.

3.3.4 Encoder N2N attention

Context is established by exploring the object’s surroundings using the object detector, which encodes more discriminative features for relation classification. To this end, we arrange the nodes in a permutation-invariant order and pass this sequence of nodes to the transformer encoder. Through the \((W_{node})\)linear projection to get \(i^{th}\)node of the initial node eigenvector \((f^{in}_{i})\):

$$\begin{aligned} \begin{aligned} f_{i}^{in}=W_{node}([v_{i};s_{i};b_{i}]) \end{aligned} \end{aligned}$$
(5)

In addition, for \(I ^ {th}\)node, we added a positional feature vector \((pos(n_{i}))\) and the characteristics of its initial \(f_{i}^{in}\). It takes the categorical position of the \(i^{th}\) node in linear ordering of all nodes, and converts it to a continuous sinusoidal vector as described in [25].

$$\begin{aligned}{} & {} \begin{aligned} f_{i}^{final}=encoder(f_{i}^{in}+pos(n_{i})) \end{aligned} \end{aligned}$$
(6)
$$\begin{aligned}{} & {} \begin{aligned} o_{i}^{final}=argmax(W_{cate}(f_{i}^{final})) \end{aligned} \end{aligned}$$
(7)

Where encoderis a stack of multi-head attention layers as shown in Fig. 2. After the nodes are contextualized by the encoder, the final node features \(f_{i}^{final}\) are obtained. Designing this semantically rich node function serves two purposes. First, it gets the exact final object class \(o_{i}^{final}\in C\) probability by a linear object classifier \(w_{cate}\), as described in Eq. 7, and second, \(f_{i}^{final}\)is passed to decoder cross-attention for edge context propagation.

3.3.5 Decoder edge positional encoding

We feed the transformer decoder the edges of the directed object interaction graph and their position embeddings, which we call EdgeQueries. Since there is no clear ordering relationship between edges, a new edge embedding method based on node position is proposed. The new position encoding vector \((pos_{e_{ij}}\in {\mathbb {R}}^{2048})\)for edges \((e_{ij})\)encodes the position of the two source nodes in an interleaved fashion. One of the nodes will play the role of an object or target. Since our edges are directed, we assume that the proposed edge position embedding is to distinguish the source node from all the different nodes. The goal is to accumulate the necessary global context (all the different object instances) without losing focus on the local context. We define the PE as:

$$\begin{aligned} PE_{e_{ij}}(k,k+1)=[\sin {(\frac{p_{i}}{m\frac{2k}{d}})},\cos {(\frac{p_{i}}{m\frac{2k}{d}})}] \end{aligned}$$
(8)

Equation 8 describes an edge positional coding, where \(p_{i}\)and \(p_{j}\)is the location of the node \(n_{i}\) and \(n_{j}\), m is a maximum number of nodes in the sequence, \(d=2048\), k denotes positional coding feature vector of \(k^{th}\).

3.3.6 Decoder E2N attention

For the edge query \(e_{ij}\) (between nodes \(n_{i}\) and \(n_{j}\)), Its bounding box position (\(b_{ij}\in {\mathbb {R}}^{4}\)) and the initial visual feature \(v_{ij}\in {\mathbb {R}}^{4096}\)) come from the union of the bounding boxes of the two nodes, as shown in Fig. 2. GloVe vector embeddings from its two node labels \(s_{ij}\) are concatenated with previously obtained boxes and visual features to enrich the semantics of edges. Then, using linear projection layer \((W_{edge})\)to get the initial edge character vector \((f_{ij}^{in})\) edge queries as:

$$\begin{aligned} f_{ij}^{in}=W_{edge}([v_{ij};s_{ij};b_{ij}). \end{aligned}$$
(9)

Complex global scene representations require good contextual edges, which can only be achieved by exploiting larger scene contexts. In conventional transformer decoders, masked attention is used to restrict edge attention to a part of the sequence. The accumulation of edge global contexts requires a unique mechanism in order to preserve its local dependencies while exploring the global context.

Firstly, E2N cross-attention is applied from an edge to all nodes. Finally, we get the context edge character \((f_{ij}^{final}\in {\mathbb {R}}^{2048})\) as,

$$\begin{aligned} \begin{aligned} f_{ij}^{final}=decoder(f_{ij}^{in}+PE_{e_{iji}},f_{i=1..N}^{final}). \end{aligned} \end{aligned}$$
(10)

where decoder is a multi-head attention stack that contains our proposed E2N attention, which is positional encoding.

3.4 Multimodal fusion for reinforcement learning

We fuse the multimodal information of image embedding, graph embedding, previous action embedding and memory embedding as RL state representation. A3C reinforcement learning maintains estimation of both the state-value function \(V(s_t)\) and the state-action policy \(\pi (a_{t},s_{t})\). The agent uses the value function (critic) to update policy (actor) by training multiple parallel threads which can stabilize each other. We can learn the policy function and value function of a single neural network with multiple heads at the same time [16]. For getting update, it calculates \(\bigtriangledown _{\theta _{\pi }}\log \pi (a_{t}|s_{t})A(s_{t},a_{t})\) where the advantage term is A. Advantage term quantifies how much better an action turns out to be than expected, which is able to estimate the difference between accumulated discounted rewards and the value at current state. Therefore, the whole update process can be written as:

$$\begin{aligned} \begin{aligned} \bigtriangledown _{\theta _{\pi }}\log \pi (a_{t}|s_{t};\theta _{\pi })(R_{t}-V(s_{t};\theta _{\upsilon }))\\ +\eta \bigtriangledown _{\theta _{\pi }}H(\pi (s_{t};\theta _{\pi })) \end{aligned} \end{aligned}$$
(11)

where H is an entropy terms which makes the agent exploration restricted by hyper-parameter \(\eta\). Generally, except for the last layer, policy and value functions share network parameters. However, we use the RTNet knowledge graph features extracted from our proposed HZG to expand the critic sub-network. These characteristics lead to a more accurate value estimation, thereby reducing the changes of the policy update process. Because we define our advantage function as \(A(o_{t})=r(a_{t}|o_{t})+V(o_{t+1})-V(o_{t})\) at each state and updates gradients using policy loss from

$$\begin{aligned} L_{\pi }=-(\log (\pi (a_{t}|o_{t}))\times A(o_{t})+\beta \times H_{t}(\pi )) \end{aligned}$$
(12)

\(\beta\) is a hyper-parameter encouraging exploration and \(H_{t}\) is the entropy [27]. Doing this, a more accurate value estimation hold the stability in the policy gradient [21], and we assume that this is able to promote the optimal learned policy.

4 Experiments

4.1 Experimental setup

The House Of inteRactions (AI2THOR) is a realistic interoperable framework for AI agents. AI2THOR consists of an indoor scene where AI agents can navigate and interact with objects in the scene to perform tasks. AI2THOR simulator is used as our experimental framework, which is divided into four different room types: bathroom, living room, bedroom and kitchen. Each type of room contains 30 rooms, totally 120 rooms. All baselines and our experimental setting is followed as SAVN [26], including the target settings and the number of training frames. The first training scenes are (\(1-20\)). (\(21-25\)) are test set, and the last (\(26-30\)) scenes are taken as valuation set for each room type. The purpose of each episode is to maximize the expected cumulative reward \(\Sigma _{t=0}^{T}\gamma ^{t}r_{t}\). The reward function of each trajectory is defined in Eq. 13, which is a combination of the one proposed in [8] and [17].

$$\begin{aligned} \begin{aligned} r=\left\{ \begin{aligned}&5.0 \quad if \quad success \\&S_{bbox} \quad if \quad S_{bbox}\quad \text {is the highest in the episode} \\&-0.01 \quad otherwise, \end{aligned} \right. \end{aligned} \end{aligned}$$
(13)

where \(S_{bbox}\) is the bounding box area. We train all methods until maximum convergence with of 200 million frames. All available objects are 92 that include in our HZG. For training our model, we use Pytorch framework, RMSprop for adaptation optimizer and SharedRMSprop otherwise. The training and test targets are listed in Table 2.

Table 2 Target setting

4.2 Implementation details

To tackle visual perceptive images, we use pre-trained ResNet18 to extract observation features at each time step. For RTNet, the node features in the encoder are converted from \(7\times 7\times 512\) dimensions by the linear layer to \(7\times 7\times 256\), then \(7\times 7\times 128\) to \(7\times 7\times 64\). The edge features of the decoder were converted from \(7\times 7\times 92\) to \(7\times 7\times 256\) through the linear layer, and the encoder features were concatenated into a feature vector of \(7\times 7\times 320\). The algorithm uses Glove [22] to generate a 300-dimensional semantic embedding of target and graph objects, in total of 92 objects. The input of our actor-critic network represents 512 hidden states LSTM network and two fc layers representing actor and critic. It is concatenated with the target object as a 300-dimensional vector, observation features, as a 1024-dimensional feature vectors and HZG with 92 nodes is fed into RTNet producing 92-dimensional vectors. Meanwhile, RTNet is also used to make knowledge inference for producing a single value which is appended in our critic. The actor outputs a 6-dimensional actions distribution \(\pi (a_{t}|x_{t})\). The critic estimates a single value using softmax. Especially, we novelty adapt our Zero-shot agent in dynamically updating knowledge graph in unseen scenes and correct wrong prior in policy network. The input to the graph is a 1024-dimensional vector as node features. It concatenates 512 observation features with Glove embeddings of the objects which are mapped from 300 to 512 using linear layers. Each layer contains 92 nodes in adjacency matrix and 5 layers in total, of which 4 layers are edges between objects in four types of scenes and the other is for regularization using self-connection layer.

Table 3 Evaluation results (\(\%\)) on a per-room basis without “Done" signal in seen objects and unseen scenes setting

4.3 Baseline comparison

In order to demonstrate our main contribution and component necessity, we make a comparison with different baselines:

  • SP [30], the authors use a fixed full graph encoding 83 objects in Visual Genome datasets. The input to the policy network is a concatenation of target object vector embedding and observation features.

  • A3C [37] agent, it removes knowledge graph and adaptation framework. We also use HZG in SAVN agent for comparison. For fair comparison, the other experimental settings of all baselines are the same.

  • SAVN [26], the agent uses meta-learning to learn the training gradient and adapt it to novel scenes.

  • Bayes [27] use a variable bayes probability to predict optimal actions.

  • SpAtt [19] propose using attention to encode three types of information: previous actions, hidden states and target word embedding.

  • Context [8] use YoLo detector to transform visual information to context grid, which represents the similarity of objects and targets.

  • GTN [18] consider there are huge bias between real-world VG knowledge and simulated environment, who use graph transformer network (GTN) [32] to update and adapt knowledge.

  • TPN [10] use object detector to build object-relation graph (ORG), which is composed by confidence score, labels and bounding box.

  • GVE [21] propose graph-based value estimation (GVE) to improve navigation policy, which also use graph neural network and meta-learning as main methods.

  • HOZ [34] propose a hierarchical object-to-zone graph to guide the agent in a coarse-to-fine manner. HOZ consists of scene nodes, zone nodes and objects nodes. However, it is much like scene-specific layer proposed in A3C [37], who also divides different types of rooms to maintain accurate scene information. But, it is not universal to all kinds of rooms, especially in unseen environment and huge changes on layout.

  • OMT [12] design a Object Memory Transformer navigation architecture, which store long-term scenes and object semantics and attends to salient objects in this memory. It allows for efficiently navigation without prior knowledge.

4.4 Results

Quantitative Results. We compare four baseline methods, ours and replace Visual Genome datasets with HZG in SAVN as SAVN-HZG. The results are shown in Table 4. Our approach improves all the baselines on success rate and SPL compared with previous SOTA about \(23.4\%\) and \(13.1\%\). Meanwhile, each component can improve navigation performance in our method.

Table 4 Total comparison results, we use the same experiments setting and without ‘Done’ signal (agent learn to stop by itself). L>5 means the optimal path is larger than 5 steps

Since the baseline (SP) uses a category relation graph extracted from real-world Visual Genome database, it could result non-discriminative representations and semantic ambiguity. the baseline method may tend to fail in recognizing target or just find similar context objects rather targets. Experiment shows that our HZG facilitates useful spatial visual features for agents, significantly promoting efficiency and effectiveness of navigation. Besides, graph attention network can produce advisable successive value to RL policy with softmax, and ML is able to adapt knowledge graph in new scene to guide the RL training process. Thus our model achieves better performance than baselines.

Although SAVN applies meta RL to promote navigation performance, it uses Visual Genome objects as word embedding to represent targets in AI2THOR, thus suffer the ambiguity when objects appear together, such as a tomato and an apple. Our HZG contains precise objects relations with ground truth knowledge extracted from AI2THOR. For evaluation, we replace Visual Genome knowledge graph of SAVN with HZG which we name SAVN-HZG. Our model uses HZG and RTNet significantly improves the navigation performance.

Fig. 5
figure 5

Our method is compared with baseline. Our model combines HZG prior knowledge. When looking for target, our agent will think that it may be to the some of directions of other objects, so tack actions using this knowledge. However, baseline is locked in deadlock. The trajectory of the agent is indicated by green arrows. The red point is the start position, key frame is the reasonable position when the target is invisible and some objects can make inference

Table 3 shows the performance of each of the comparison models in terms of success rate (SR) and SPL. It shows that our model outperforms the current SOTA significantly. This supports our hypothesis that incorporating HZG embedded with RTNet into RL policy works better than just concatenates Resnet and Glove features. It indicates our spatial information in HZG extracted from GT scene is more reliable, thereby making visual navigation problem easier. It also needs to pay attention here, that the SP [30] without our HZG performs worse than even the SAVN [26] and SAVN-HZG.

Fig. 6
figure 6

Training curves for our model (Ours(full)) and baselines. With more HZG proportion (25\(\%\), 75\(\%\)), the faster convergence will produce compared with other baselines after 10 M training frames

We demonstrate the comparison results on total episode length after training 10 M frames and 21 targets in 4 types, totally 20 rooms. As shown in the training curves in Fig. 6, our model using HZG converges faster than those using GCN and Visual Genome graph. Besides, the more HZG used, the more faster training process would be. Our model navigation performance is also better than other baselines both in SR and SPL. The experimental results demonstrate that our model has made full use of the failure experience stored in LSTM using learnt loss.

Table 5 Impacts of different components on navigation performances

4.5 Ablation study

As seen in Table 5, our HZG can adapt the changes of scene layout and provide knowledge inference. Note that the significant improvements in SPL indicate that the HZG improves the efficiency of the navigation system. Furthermore, to study the influence of HZG on generalization, we train our model with 25\(\%\),50\(\%\) and 75\(\%\) of the HZG, of which the corresponding objects and Glove embedding are randomly selected. It is obvious, the more HZG, the better results on success rate and SPL. Besides, we visualize the training process in 10 M frames, as is shown in Fig. 6. After training, the weight matrix is embedded by RTNet during navigation. Besides, our HZG is much robust and obtains the main features faster. The main reason for our good performance is that we build a precise HZG rather than other works [8, 10, 34] using only objects detectors, which is easily affected by appearance between the gap in real-to-sim transformation. This is also shown in Fig. 5. HZG contains specific direction relationships for every two objects as a scene priors, thus improving the generalization ability.

Fig. 7
figure 7

t-SNE embedding of the indoor objects appeared in four types rooms. We show that node features extracted by the RTNet layer and project them into 2D space. The results show that similar objects (appearance) or intimate relationships have similar distribution in HZG. And our model has learned relative spatial layout after exploring numerous rooms

Our method has four major contributions, i.e., DETR, HZG, RTNet and ZSL adaptation. We dissect their impacts as follows.

Impact of DETR Object detection acquires zone information that integrate into visual inputs. We study the impact of these visual cues (graph encode by RTNet, observation, action and memory). It is can be seen in Table 6. First, we remove the attention mechanism to test the effect of attention. Then, we train their functions separately, i.e., without graph, without memory. we compare navigation performance using different detection method DETR and Faster-RCNN. Unlike Faster-RCNN, DETR infers the relations between object instances and the global image context via its transformer to output the final predictions. Both they use same class labels glove word embedding and the output feature of visual perception as node features in HZG. Although DETR and Faster-RCNN achieve similar detection performance, features extracted by DETR are more informative and robust than those of Faster-RCNN used in HZG.

Impact of HZG As seen in Table 5, our HZG can adapt the changes of scene layout and provide knowledge inference. Note that the significant improvements in SPL indicate that the graph improves the efficiency of the navigation system. Furthermore, to study the influence of HZG on generalization, we train our model with 25\(\%\), 50\(\%\) and 75\(\%\) of the HZG, of which the corresponding objects and Glove embedding are randomly selected. It is obvious, the more HZG, the better results on success rate and SPL. Our HZG is much robust and obtain the more scene graph structure information, for the more component of the graph, the less episode length, the agent gets.

Impact of RTNet RTNet improves navigation policy compared to the model without using RTNet, as indicated in Table 5. RTNet is an efficient graph neural network in representing heterogeneous graph and it pay more attention on its key nodes, which also finds novel meta path through multiple hops connections. Thus, it improves generalization in new scenes. We remove the HZG from our framework, we observe that both effectiveness and efficiency of the navigation decrease. This indicates that the HZG facilitate RTNet to exploit the spatial information of observation regions. HZG can provide the scene priors for the agent to endow the foresight ability finding invisible objects.

Table 6 The performance of RTNet (SPL/SR in \(\%\)), 6-6-8 and 3-3-4 are number of encoder layer, encoder layer and MHA, respectively

Impact of Zero-shot Learning As indicated in Table 5, ZSL improves the result of navigation in terms of success rate and SPL. Navigation performance is better for using ZSL than not using ZSL both in success rate and SPL. It proves that ZSL has learned different loss distributions between similar scenes using same training parameters stored. When navigating in new scenes, it can adapt the changes of scene layout.

Visualization analysis. To figure out why HZG provides good results, we study graph structure changes to analyze the reason.

We examine the node feature vectors learned by our RTNet layers. Figure 7 illustrates visualization of the node feature vectors obtained from three RTNet layers using t-SNE. We observe that the feature vectors of the HZG nodes are usually consistent with their corresponding t-SNE two-dimensional projections. For example, potatoes and lettuce are usually placed on the countertop, so it can be seen from the picture that their feature spaces are very close. It means that our model has learned to project objects into the feature space while preserving the spatial configuration. The reason for the similar object semantic distance clustering in the zone is that we construct the HZG instead of using object detection like other works [5, 15, 21], which is susceptible to the appearance error of real2sim transformation. This can also be seen in Fig. 5. The HZG contains a specific relationship of every two object directions and acts as a scene prior, and this object aggregation and new knowledge become more perfect with the deepening of training.

4.6 Case study

As shown in Fig. 8, 15 kinds of targets were selected in the verification set to verify the agent generalization ability in an unknown environment. It is found that the navigation process of the RTNet agent is affected by weak texture area, strong light and corner layout of the room. The fewer turns it takes to find the target, the more efficient the navigation will be. The experimental results show that "large" targets and common targets are easier to find, and they use fewer turn steps, which indicates that the our framework can extract useful information of the environment and improve navigation efficiency.

Fig. 8
figure 8

Case study of RTNet

In addition, RTNet for embodied agents encountered some failure cases during the experiment. As shown in Fig. 8, four cases are enumerated: (1) the target is too small to be found within the specified step size. (2) the agent is trapped in a corner, and the agent cannot leave the area in a short time. (3) The agent lacks guidance in an environment with less texture, and the agent performs ineffective actions. (4) It is the influence of the strong light of the environment, and the agent cannot predict the accurate action. The problems of multimodal model of visual perception analyzed in this paper are as follows: (1) It is caused by insufficient recognition accuracy of target detection algorithm. (2) This problem may be caused by the "timid" navigation strategy inherent in reinforcement learning, in which the model does not learn to strategically explore the environment effectively. (3) Poor performance in less textured environments and lack of experience in less textured environments. (4) The agent relies heavily on visual observation, and large changes in environmental conditions will significantly affect its performance.

Therefore, in future work, we will solve these problems by improving the resolution of the observed image and integrating more efficient state representation techniques. In addition, more robust exploration policy or combined domain randomization techniques are needed to render different targets to improve the agent’s exploration capability.

4.7 Discussion

One benefit of the virtual environment simulation platform is that shortest paths can be generated as supervision signals to train task-specific visual navigation models. Often real-world uncertainties are not taken into account. One of the important factors is that the physical control error of the robot is large. Generating shortest paths in real-world settings is very expensive, and even generating a small number of samples for fine-tuning a navigation model may not be affordable. In the simulated environment constructed from real images, the environment is also static and far from reality, which is subject to changes in lighting, object layout, etc., in the real environment. Visual navigation solutions in static simulation environments still have significant shortcomings, making the exploration of visual navigation tasks in dynamic environments challenging. It is important to transfer the navigation model trained in the virtual environment to the real environment because the training cost is small in the virtual environment. Narrowing the gap between real and simulated environments is one of the challenges of model-based reinforcement learning engine navigation, and improving the generalization ability of the model in real-world environments can help solve real-world problems.

The development of our method is significant because the real-to-sim gap is one of the challenges in navigation using model-based RL engines. The navigation generalization performance is decided much on the information from the knowledge graph. We construct HZG and probability inferred by the GNN, which achieve deep reinforcement learning interpretability for indoor navigation.

Different from traditional knowledge graphs in navigation, in HZG, the RTNet serves as a semantic map with specific direction information of the objects and local attention area in scenes, thus providing a global view for the partial observation task. Although TPN [10] uses an object detector to maintain the local image features and get promising results, the comparison and analysis experiment confirms a better adaptive capacity of the HZG model. Furthermore, a GNN is applied to solve the probability inference and graph embedding, which gets an accurate prediction of distances in the experiments and the visualized results in Fig. 5.

Test with real-world AVD dataset. For furthermore demonstrate the effect of our model, we conduct another experiment on active vision dataset (AVD) [2], a real-world navigation environment. There are 11 relatively complex real-world houses in AVD, where 8 houses are for training and 3 for testing. Each house is build with a grid of robot locations. We choose 15 different views as the target, each of which contains a targeted object such as a laptop, a table. As in most navigation systems, a collision detection module is designed. Once the collision is detected, the action with softamx is output with largest probability.

Fig. 9
figure 9

3D t-SNE visualization of home in AVD. The color of data point denotes visited objects. Some of the data points are marked with an index of the current observation. The green point is to find the objects in the bathroom and the yellow is to find the objects in the kitchen. From the distribution of visited points, our model can lead to more accurate action prediction

Table 7 Navigation performance (SR and SPL, in %) in unseen environment and seen targets from AVD with “Done" signal. Rand denotes random walk

Except for the random walk, all alternatives are trained with supervision. We also follow the Wu et al. [27] work; the policy layer is updated using shaped reward based on the geodesic distance to the goal. All methods take RGB images as the input observation. The results on the two metrics (SR and SPL) are reported in Table 7. Our model with default RGB image input outperforms the state-of-art alternatives by \(\>{9.6\%}\) for average success rate and by \(\>{9.5\%}\) for average SPL. Results demonstrate that our model is compatible both in simulated and real-world environments. Because the adaptive HZG model can cope with changes in different scenarios, It distills knowledge and encodes the spatial position relationship of general targets. Meta-learning takes effect for storing the gradient update schema across scenes.

To investigate how well our graph model optimizes the navigation policy based on observation and the target view, we show in Table 7, a 3D t-SNE visualization of the trajectory and Bayes [27]. In our model, the agent associates prior knowledge and distills this knowledge into an unseen environment, which outperforms Bayes in finding the target. This is mainly because Bayes only uses a variational model for action estimation. However, our RTNet models these variational distributions in a more robust way. Meta-learning can also learn this action prediction with different sample efficient distributions. We see great potential of RTNet that encodes graph information in the challenging AI2THOR indoor environment. However, this comes at the cost of longer training episodes (up to 100 M frames). Besides, the sim2real gap is a long-term issue for visual navigation based on reinforcement learning.

Test with real-world Gibson dataset. We tested sim2real in a real gibson environment and compared the models tested in a gibson environment. The experimental settings of our navigation framework are followed by the VGM [14], which use Faster-RCNN to visual memory graph and GCN to extract spatial information. We replace PPO with A3C in their framework. Besides, we add RTNet and ZSL in their backbones. We train all models on Habitat simulator with 72 scenes and evaluate on 14 unseen scenes. We evaluated each model on different difficulty settings. The difficulty of level depends on the geodesic distance to the target location (easy: \(1.5m \sim 3m\), medium: \(3m \sim 5m\), hard: \(5\,m \sim 10\,m\)). We tested 1007 sample scenes for each difficulty level. The maximum time steps for an episode is set to 500

Table 8 SR comparison of baselines and ours without "Done" signal

Teh results can be seen in Table 8, VGM [14], NTS [7] and our framework achieve better results on SR. Our framework leads the all models. We analyze that HZG structure helps a lot, because of the agent tends to find the high semantic similarity object and can adapt to similar scene layouts. Overall, our framework improves SR by \(16\%\) compared to A3C. SLAM-based models (SemExp and NTS ) show a relatively lower SR. Besides, Graph-based models (NTS, SemExp, VGM, and Ours) can robustly navigate more freely in noisy environments, but cannot be self-driven to new scenes without a given graph. These models rely heavily on exploration policies and graph estimators. However, the exploration policy is only effective for unexplored zones, not zones with similar semantics. In contrast, our framework can adapt different scenes of HZG and make relation reasoning via RTNet, guiding the agent to find the target more easily.

5 Conclusion

In this paper, we present a relation-wise transformer network to exploit the relationship in the observations. Extensive experiments and ablation studies demonstrate the efficiency of our approach in benefiting from our HZG when necessary, which allows for adapting to the new environment. We show for the first time that knowledge distillation from RTNet and HZG can improve the performance of navigation agents. Since we have proven the effectiveness of our method in practice, as part of our future work, we plan to extend to other RL algorithms and combine other training methods. Also, incorporating various visual semantic perceptions and scene semantic prior representations will take effect. Furthermore, we find that weight learned from auxiliary tasks, such as semantic segmentation, object detection, and depth estimation is effective in navigation and can be used in RL training.