Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning

Visual mapless navigation (VMN), modeling a direct mapping between sensory inputs and agent actions, aims to navigate from a stochastic origin location to a prescribed goal in an unseen scene. A fundamental yet challenging issue in visual mapless navigation is generalizing to a new scene. Furthermore, it is of pivotal concern to design a method to make effective policy learning. To address these issues, we introduce a novel visual mapless navigation model, which integrates hierarchical semantic information represented by context vector with meta-learning to improve the generalization performance gap between known and unknown environments. Extensive experimental results on AI2-THOR benchmark dataset demonstrate that our model significantly outperforms the state-of-the-art model by 15.79%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$15.79\%$$\end{document} for the SPL and by 23.83%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$23.83\%$$\end{document} for the success rate. In addition, the exploration rate experiment shows that our model can effectively improve the invalid exploration behavior of the agent and accelerate the convergence speed of the model. Our implementation code and data can be viewed on https://github.com/zhiyu-tech/WHU-CVVMN.


Introduction
The purpose of visual mapless navigation is to move from a randomly generated starting point to a destination objective (known or unknown) supplied by an image or instruction in an unseen scene, has lately been a prominent emphasis of embodied vision navigation. Many visual mapless navigation models [1][2][3] have been presented, and their capacity to perceive the surroundings and make decisions has shown encouraging results. Visual mapless navigation has shown the potential of end-to-end deep neural models for process- ing, comprehending, and categorizing image and language information to some level.
Despite their success, existing visual mapless navigation models disregard some prior intuitions of human beings during navigation. A case in point is to navigate to a target object such as an alarm clock. How do we do that? An intuitional thing to do is to start by looking around the bed in the bedroom due to the functional and semantic structure of the world. In recent years, researchers attempt to solve this challenge with graph neural network and knowledge graph. Yang et al. [4] first proposed to use graph convolutional networks (GCN) [5] to encode prior knowledge in the task of semantic navigation, and experimental results show that their method improves generalization to novel scenes. However, the prior knowledge of their model cannot be updated at test time. To address this issue, Wu et al. [6] proposed a Bayesian relational memory (MBR) model that can encode prior knowledge in the training environment and update it during testing. Moghaddam et al. [7] presented graph-based value estimation (GVE) system which uses externally learned prior knowledge and then integrate it into navigation model by a neural graph. However, their parameters are frozen during inference, bringing about a hard limit between the training and the testing stages.
To address this issue, Wortsman et al. [8] innovatively proposed a novel self-adaptive visual navigation (SAVN) approach that enables an agent to adapt to unseen scenes. Their model adopts a meta-reinforcement learning solution to learn a self-supervised interaction objective to encourage effective navigation. This method eliminates the problem of hard boundary in training and testing phases. Further, Li et al. [9] proposed to learn transferable meta-skills by unsupervised reinforcement learning approach from un-annotated scenes, and, therefore, the agent can fast adapt to visual navigation tasks. Li et al. [10] proposed multi-goals and multi-scenes navigation model, which utilize the scene priors of environment and showed promising result. However, these methods locate the target according to the current observations, which will ignore a large amount of rich information contained in current environment. The research work of Torralba et al. [11] and Mottaghi et al. [12] proves the effectiveness of context vector for visual object detection and position detection. More recently, Elayaperumal et al. [13] proposed a spatial variation and multi-feature fusion method based on context information, which can accurately locate the target position. Therefore, we intend to develop a navigation model that incorporates the superiority of both hierarchical relation information and meta-learning.
In this paper, we provide a unique visual mapless navigation model, which integrates hierarchical semantic information with meta-learning to improve the generalization performance. As illustrated in Fig. 1, we endow our model with the capability of encoding semantic information via context vector, which associates current observations and directional navigation signals. After that, we concatenate three key properties: the context vector, the word embedding of target and the knowledge graph purified manually. Then, we map this representation to a standard LSTM. Finally, we employ a meta-learning scheme to associate the visual representations with navigation actions, and thus facilitate navigation policy learning.
Extensive empirical results in the AI2-THOR environment demonstrate that our model accomplish significant and consistent improvement as compared to all baseline models. In particular, we prove that our approach outperforms state-ofthe-art model in terms of both SPL (24.2 vs 20.9) and success rate (61.92 vs 50.00). In addition, we perform extensive analyses on the ratio of exploration regarding our model, and the final experimental results show that our method can effectively improve the invalid exploration behavior of the agent.

Related work
Visual navigation has a long history and is becoming prominent recently. For decades of research, a typically pipeline for traditional navigation [14,15] consists of building a map, performing planning and designing an optimal path over the map. Meanwhile, the SLAM community has made tremendous progress, allowing for large-scale real-world applications and a continuous movement of this technology into industry [16][17][18][19]. However, conventional methods rely heavily on the existence of a predefined map.

Reinforcement learning for visual navigation
Recently, with the development of deep learning and representation learning comprehension capacity [20][21][22][23], researchers attempt to solve visual mapless navigation problem via reinforcement learning (RL). In general, RL takes pixels as inputs and predicts actions directly. Zhu et al. [1] developed a Siamese actor-critic target-driven navigation network, which takes visual task objective as an input to avoid re-training for every novel target. Several works [2,[24][25][26] formulated the navigation problem as reinforcement learning problem. However, these model ignore the implicit hierarchical relationships between objects. More recently, [27,28] introduced a transformer architecture to learn a spatial-enhanced local descriptor and a positional global descriptor correlated with directional signals. Wu et al. [29] proposed a NeoNav navigation model, which predicts the next action based on the imagination of the next expected observation. Tang et al. [30] presented an Auto-Navigation method to address the problem of training with weak supervision in RL. We differ in our method as we incorporate hierarchical relationships into meta-learning framework.

Prior knowledge for navigation
Recently, knowledge graph technique has received great attention as the prior knowledge allowing an agent to take human experience. Researchers attempt to utilize the approach to improve the performance of navigate cross domain. Hence, [4,31] presented a method which incorporate the semantic priors in the task of visual navigation to improve navigation performance. Zeng et al. [31] employed background knowledge about common spatial relations between landmark and target objects for robot navigation. Chaplot eta al. [32] constructed an episodic semantic map for goal oriented semantic exploration. Wu et al. [6] investigated a Bayesian relational memory (BRM) method to capture the design prior knowledge from unseen scenes and showed the promising results. Our approach differs as we consider the inherent hierarchical relationship between target categories. Druon et al. [33] proposed to transform visual information into an intermediate representation, called context grid, which encode the target object and other objects together. Qiu et al. [34] proposed a memory-utilized joint hierarchical object learning for navigation in indoor rooms (MJOLNIR). However, these methods do not update relationship during Fig. 1 Motivation of our method. We use the real label of objects to build up the context vector for all objects in our scene. An agent first detects objects of interest from its observation. Context vector will guide agent to find the more salient object (box), which correspond to the target (vase). Then, the agent will choose Move Ahead from six action spaces to range targets the training phase, then leading the model generalize poorly across domains.

Meta-learning
Recently, meta-learning proved its strong ability of learning from a small amount of training data. Nagabandi et al. [35] used meta-learning to train dynamics model prior, which can be rapidly adapted to the local context. Eshratifar et al. [36] combined both meta-learning and transfer-learning to improve generalization performance on unknown tasks. Li et al. [9] introduced a novel unsupervised reinforcement learning method to learn transferable meta-skills. On the contrary, our goal is different since we focus on using metalearning to acquire a good initial parameter. Yan et al. [37] proposed vision-voice indoor navigation (MVV-IN) model, which aggregated multimodal information to enhance agent's environment understanding. Wang et al. [38] provided a meta-learning approach to tackle classification problem. Our work instead emphasizes the visual navigation domain and encourages effective navigation. Zou et al. [39] proposed a novel meta-learning framework to automatically learn reward shaping to employ on new tasks. In our approach, however, our goal is to integrate hierarchical semantic information with meta-learning to improve the generalization performance.

Preliminaries
Problem formulation. For visual mapless navigation, we aim to find an instance of target object category, e.g., television, using only egocentric RGB perception of the agent.
In addition, we regard a path of a navigation task as an episode. Towards the beginning of every episode, the agent is spawned from a random starting point in a scene, and predicts its actions based on the current view and previous states (agent's past motions). Formally, we formulate the visual mapless navigation as reinforcement problem. Given a set of scenes S = {S 1 , . . . , S n } and goal object G = {g 1 , . . . , g m }.
The navigation task τ ∈ T consists of a scene S, target object class g ∈ G, and initial location P. Hence, we denote each task g ∈ G as a tuple T = (S, g, P). The agent receives a state s t (i.e., a self-centered RGB image from the current position and orientation) and samples an action a from the action space In a reinforcement learning navigation process, given the state s t at time t and the word embedding of the target category. The network outputs the agent's policy π θ (s t ) and the value of the current state v θ (s t ). Finally, the agent selects an action a from the action space A according to probability π a θ (s t ). We approximate the policy by a deep policy network π(·; θ): where v, u and θ are the parameters for the navigation network. Since the current state s t and target word embeddings are two different feature modalities, we design two subnetworks φ(s t ; v) and ψ(g; u)to map the two inputs into one feature space.

Proposed context vector-based navigation model
In this section, we will describe our approach exhaustively. The goal of our method is to learn an intelligent agent that can generalize to unseen environments and unseen objects. We first formalize our basic model structure and then explain the details of contextual vectors and meta-learning in visual mapless navigation.

Base model architecture
As is shown in Fig. 2, our network architecture includes two parts: (i) learning a joint representation embedding; (2) learning a navigation policy from the joint representation embedding and previous states.
In the first part, we introduce a novel context vector to encode all objects spatial and semantic relationship. In our model, we further design two descriptors both the semantic cues (e.g., the word embedding) and knowledge graph (e.g., GCN), to allow us to extract visual information effectively. Then, our model fuses these three types of descriptors with pointwise convolutional to produce final visual representations. Moreover, our model enforces visual representations to be highly correlated to navigation signals via our developed context vector, thus facilitating navigation policy learning. In the second part, the final visual representations are referred to as input and fed into an LSTM [40]. We employ an extra linear layer to obtain the policy and value for the remainder of this network. Note that the ReLU activation we utilize throughout is not shown.

Context vector
The context vector aims to encode the object-object relationships to enhance generalization ability across domains. We utilize the real label of the object category in AI2-THOR environment to construct the context vector. For each object, g i ∈ G. Therefore, we have |G| = 101 categories based on their visibility at random initialization of the environment. We let C j denote the context vector, and The state of o i in the current frame can be described as the five elements of the C j . The first element B is a binary indicator that indicates whether o i appears in the current frame. The following two elements, (x c , y c ) and A bbox , correspond to the center coordinates of the object o i and the coverage area is normalized according to the image size. We let C S denote the cosine similarity between the target object and word embedding of g i , which is expressed as where the g oj and g t denote the word embedding of object and target in the form of GloVe embedding [41].

Hierarchical semantic information
As in [34], we construct a new set of object categories. We let P denote the "parent objects", and P = {P 1 , . . . , P m }. We take the larger objects in the AI2-THOR room as the "parent object", and require them to be semantically/spatially related to the target. The new set of P are manually selected from each room based on their strong semantically/spatially correspondences with the target item G. For instance, Drawer is a parent object in the Livingr oom and Bathr oom room types. The aim is that the navigation agent will quickly find the target g i ∈ G based on semantic information, thus reducing the time spent exploring the environment.

Meta-learning objective for navigation
The meta-learning object is used in visual mapless navigation to allow an agent to quickly acquire a policy that allows for effective navigation with only a few gradient updates. The foundation of our approach lies in recent work proposed by Finn [42], which is a gradient based meta-learning algorithm. The highlight of this algorithm is that the network trained by MAML algorithm can be quickly adapted to the new task, when the distribution of training set and test set is similar. Therefore, we rely on the MAML algorithm to optimizes for fast adaptation to new scene types for our visual mapless navigation. For clarity, we introduce our approach in a general manner, including brief examples. We consider a navigation task, denoted T , that maps the current state S t to output a ∈ A. Our approach assumes that during training we have access to a set of tasks T train where each task T train has a small meta-training dataset D tr ø and meta-validation set D val ø . For example, in our mapless navigation task, we decompose the agent's trajectory into an interaction and navigation phase. We let D int ø denote the first k steps of the agent's trajectory, which contains the internal state descriptor, observations and actions. Additionally, let D nav ø denote the same information for the rest of the trajectory. The goal is then to correctly reach one of the object categories to each room types in unseen scenes. Our meta-learning objective takes the form: where the loss L int is written as function of the network parameters θ and D int ø . In addition, α is the learning rate hyper-parameter, and denotes the differential operator gradient. Our core idea is to learn parameters θ such that they provide a good initialization for fast adaptation to novel tasks. Formally, our training objective uses the adapted parameters θ −α L (θ, D) to optimize for performance on D nav ø instead parameters θ .
The intuition for our objective is as follows: as first our agent interact with the scene using the parameters θ , and then, after k steps, we obtain the adapted parameters by utilizing an SGD update with respect to the self-supervised loss. Therefore, we execute a first-order Taylor expansion as [30]. Our meta-training objective. Equation (3) where , denotes an inner product. The goal of our first-order Taylor expansion is to learn to minimize the navigation loss while increasing the similarity between the self-supervised interaction loss and the supervised navigation loss gradients. To understand , , recall the dot operation computes the similarity of two vectors: a · b = ||a|| 2 ||b|| 2 cos(ζ ), where ζ is the angle between vectors a and b. For the remainder of this work, we call the L int as interaction-gradient and the L nav as navigation-gradient. If these two gradients are similar, then we can keep "training" during inference though we do not have access to L nav . However, in practice, it is difficult to draw an L int approxi-mation for L nav , which directly drives us to utilize learning approaches to acquire self-supervised interaction loss.
To solve this problem, we propose a novel solution to learn a self-supervised interaction objective that is notably tailored to our visual mapless navigation. Our core goal is for agent to enhance navigation performance by minimizing the self-supervised loss. During training, we not only learn this objective but also learn how to employ this objective. Formally, we describe the L int as a neural network parameterized by φ, which we denote L φ int . Our training objective then takes the form: In inference phase, the parameter φ is frozen. If the gradients from both losses are similar, we believe that minimizing this loss allows the agent to navigate effectively. In this scene, the self-supervised loss is being trained to mimic the supervised L nav loss. As is illustrated in Fig. 2, we use the agent's previous k internal state representations concatenated with the agent's policy. We utilize two layers of one-dimensional temporal convolutions for the architecture of our learning loss. In the first layer, we employ 10*1 filters and the next with 1*1.

Experiments
We conduct our experiment on AI2-THOR dataset [43], and continue to answer the accompanying three questions: (1) determine whether our model is generalizing better than other baseline models. (2) determine whether our model can achieve promising results that training in one scene and testing on different scene types, and (3) gain insight into how and why our model can find the goal with fewer steps.
In the remainder of this paper, "Experimental setup" first introduces how to separate the dataset and evaluation metrics. Implementation details are shown in "Implementation details" and "Baselines" shows the baselines and SOTA comparison. Then, we answer the first two questions quantitatively in "Quantitative analysis" and "Generalization across scene types", respectively. Qualitative results of the ratio of exploration (i.e., fewer steps, and faster rate of convergence) are illustrated in "Ratio of exploration" . We perform an ablation study on the hierarchical relationships in our model in "Ablation study on hierarchical relationships".

Datasets
We perform our experiments on AI2-THOR, which afford near-photorealistic 3D indoor scenes. It consist of four different types of scenes, i.e., kitchen, living room, bedroom and bathroom. Each scene contains real-word objects that the agent can observe, allowing algorithms learned here to be easily transferred to real-world settings. For each room type, there are 30 different rooms with various layout and items. Following SAVN [8], our experiment used a training set of 20 rooms for each scenario type and a test set of 10 rooms. As a result, the complete training set has 80 rooms, whereas the test set contains 40 rooms, and D train ∩ D test = ∅.

Success criteria
For each episode, the agent is considered to finish a navigation task until the target is visible (i.e., the distance to the target is in the field of view and within 1 m). To avoid the problem that the agent does not have any idea if it has reached the target or not, we consider that the navigation task is successful if Done action is issued when an object instance is within the field of view. Note that if the agent ever issues the Done action when it has not reached a goal then we consider the task a failure. This makes the learning more challenging.

Metrics
For fair comparison with other state-of-the-art algorithms, we adopt the same evaluation metrics recommended by [44], and approved by other visual mapless navigation model [4,8]. We evaluated our navigation performances (i.e., generalization) on unseen scenes based on two metrics: Success Rate (SR) and the Success weighted by Path Length (SPL). The SR is defined as 1 N N n S n , which means the ratio of the number of times the agent successfully navigates to the goal object and the total number of episodes. The SPL is given by 1 N N n S n Len n max(Len n ,Len shortest ) . Here, we conduct N episodes, and let S n as a success indicator of the n-th episode. Len n denotes path length and Len shortest is the shortest path distance (provide by the environment for evaluation) in episode i.

Implementation details
As indicated in "Experimental setup" section, we train our model and all baselines until the success rate saturates on the 80 room types. We run our experiment in PyTorch and train 14 asynchronous workers on two Nvidia GeForce RTX 2080Ti GPUs. In each episode, we randomly sample a target from floorplan with a random initial agent position. We use Adam optimizer [45] to update the policy network for navigation gradients, with a learning rate of 10 −4 and an inner learning rate of 0.001.
For evaluation, we used 1000 different episodes for 4 room types (250 for each scene type) with four asynchronous workers. All models are tested with the same set.

Baselines
For comparison, we select several public models as baselines including: Random policy. The agent randomly draws one out of six actions using a uniform distribution.
SP [4]. A graph convolutional network navigation model, which integrate semantic priors with reinforcement learning architecture.
SAVN. [8] exploits a meta reinforcement learning approach for an agent to learn a self-adaptive visual navigation model to encourage the agent to keep learning during inference time.
GVE. [7] proposed a graph-based value estimation (GVE) to provide a more optimal policy for navigation. The GVE module reduces the value estimation error in A3C RL algorithm.
ApAtt. [46] encoding semantic information and spatial information about object's place using an attention probability model, which allows the agent to navigate towards the target effectively.
MJOLNIR_O/R. [34] utilize implicit memorization of the relationships between different objects and show promising results for navigation efficiently.

Quantitative analysis
In this part, we show the navigation performance and provide some further analysis. We offer quantitative results for all "ALL" targets as well as a subset of targets (L ≥ 5) with optimal trajectory lengths greater than 5. As shown in Table 1, we observe that our model significantly outperforms SP [4] and SAVN [8]. Since SP and SAVN use prior knowledge or meta-learning separately while our model uses the hierarchical relationships and combines with the meta-learning. Compared to the GVE model [7], our model outperforms by 57.14% and 83.74% on the SPL metric and SR metric, respectively. Compared with the SpAtt model [46], our model outperforms by 52.2% and 89.94% on the SPL metric and SR metric, respectively. Our approach achieves expressive generalizability and fast policy learning for navigation. Most notably, we observe about 115.75% absolute improvement in success and 73.98% in SPL. In contrast, benefiting from our context vector representations, our model outperforms the state-of-the-art method MJOL-NIR_O [34] by >23.83% for success rate and by >15.79% for SPL. We achieve better navigation performance utilizing our context vector-based navigation model. It also confirms our assertion that incorporating the hierarchical relationships into meta-learning can encourage agents to better generalize to unknown environments. We note that our model's SPL metric is 7.5% lower than MJOLNIR_O in the case of L ≥ 1. We speculate that this is because semantic information does not cover all categories.

Generalization across scene types
We test the generalization ability of our model in crossscenarios as well. The idea is that we train an independent model for four scene type, and then test it on a different scene type. We refer to them as K-single, L-single, Be-single and Ba-single, respectively. The K-single, for example, is trained on kitchen room types, and evaluated on the all scenario types. Our results are shown in Table 2.
From the results, we can see that our cross-scene model outperforms the comparison methods in both SR and SPL assessment measures. In particular, our approach using hier-  Fig. 3 Ratio of exploration. Our method has a lower ratio of exploration, which means our agent adopts more effective actions archical semantic relation input surpass the state-of-the-art method [34] for SPL by 36.01%.

Ratio of exploration
In order to prove that our method has faster strategy learning. We newly proposed a ratio of exploration to illustrate the effect. We defined the ratio of exploration as number of explored states divided by the current number of steps. For example, if an agent only explores two states in the first five  steps, then the exploration rate in the first five steps is only 40%. As shown in Fig. 3, our method has a lower exploration rate, thanks to our intelligent agent can find the target faster and avoid repeating an invalid action all the time. This support that our model provide good initialization parameters.

Ablation study on hierarchical relationships
We conduct evaluations on how the performance is affected by changing the hierarchical relationships. First, we only sequentially remove the hierarchical relationships of a room type. Second, to compare the results intuitively and reasonably, we adopt the same experimental settings as "Quantitative analysis" section to re-train the model for four room scene types and then test in the room type that removes the hierarchical relationships. For example, "Kitchen_no_relationships" refers to removing the hierarchical relationships in kitchen room type from our system.
As shown in Table 3, we observe that the navigation performance degrades significantly compared to the results of Table 1. Moreover, the performance of the other three rooms was much lower than the results in Table 1 in both SPL and SR metrics as well. This validates the influence of hierarchical relationships for navigation performance. We consider that our hierarchical relationships provide contextual guidance to an agent.

Case study
We provide qualitative comparison of the performance of our model compared to previous SOTA, e.g., SAVN [8] and MJOLNIR_O [34]. As illustrated in Fig. 4, agent is navigating towards a target of "Laptop" in scenario "living room". SAVN and MJOLNIR_O both issue the Done action after navigating a few steps (15 and 8 steps, respectively). Since our navigation model provides clear directional signals, our model uses the least steps (4 steps) to find the target.
For failure cases, we conducted further experiments. We chose the "RemoteControl" as target in "FloorPlan205" of AI2-THOR environment. As shown in Fig. 5, we observe that the agent misses several times when looking for the target. Our agent passed by the "RemoteControl" several times, but did not issue an "Done" command, and finally ended the episode because the maximum step was exceeded.
By repeatedly visualizing the agent's behavior, we did further analysis of the failure cases. We observed the agent going back and forth beside the desk. This may be because the agent thinks that the remote control is sometimes closer to the table, and the agent is misled by prior knowledge. As shown in the top view in Fig. 5, the agent walked directly towards the table and then stopped for a while.

Conclusion
In this paper, we propose a context vector-based architecture to achieve powerful visual mapless navigation, which fully explored the hierarchical relationships between objects and targets. Further, we absorb the strength of meta-learning to encourage more effective policy learning. Experiments reveal that our model outperforms a state-of-the-art visual mapless model considerably and consistently. Furthermore, our approach has a high degree of generalizability across different scenario types. In addition, our approach has a lower ratio of exploration, which means our agent adopts more effective actions. In the future, we plan to explore background knowledge about the composition of the real world, including common sense knowledge of human navigation and factual knowledge, which can provide strong generalizability in unknown environment.