Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning

Li, Fei; Guo, Chi; Zhang, Huyin; Luo, Binhan

doi:10.1007/s40747-022-00902-7

Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning

Original Article
Open access
Published: 04 November 2022

Volume 9, pages 2031–2041, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning

Download PDF

Fei Li¹,
Chi Guo^2,3,4,
Huyin Zhang¹ &
…
Binhan Luo²

1665 Accesses
1 Citation
Explore all metrics

Abstract

Visual mapless navigation (VMN), modeling a direct mapping between sensory inputs and agent actions, aims to navigate from a stochastic origin location to a prescribed goal in an unseen scene. A fundamental yet challenging issue in visual mapless navigation is generalizing to a new scene. Furthermore, it is of pivotal concern to design a method to make effective policy learning. To address these issues, we introduce a novel visual mapless navigation model, which integrates hierarchical semantic information represented by context vector with meta-learning to improve the generalization performance gap between known and unknown environments. Extensive experimental results on AI2-THOR benchmark dataset demonstrate that our model significantly outperforms the state-of-the-art model by $15.79\%$ for the SPL and by $23.83\%$ for the success rate. In addition, the exploration rate experiment shows that our model can effectively improve the invalid exploration behavior of the agent and accelerate the convergence speed of the model. Our implementation code and data can be viewed on https://github.com/zhiyu-tech/WHU-CVVMN.

Improving indoor visual navigation generalization with scene priors and Markov relational reasoning

Article 04 April 2022

Occupancy Anticipation for Efficient Exploration and Navigation

Learning Object Relation Graph and Tentative Policy for Visual Navigation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The purpose of visual mapless navigation is to move from a randomly generated starting point to a destination objective (known or unknown) supplied by an image or instruction in an unseen scene, has lately been a prominent emphasis of embodied vision navigation. Many visual mapless navigation models [1,2,3] have been presented, and their capacity to perceive the surroundings and make decisions has shown encouraging results. Visual mapless navigation has shown the potential of end-to-end deep neural models for processing, comprehending, and categorizing image and language information to some level.

Despite their success, existing visual mapless navigation models disregard some prior intuitions of human beings during navigation. A case in point is to navigate to a target object such as an alarm clock. How do we do that? An intuitional thing to do is to start by looking around the bed in the bedroom due to the functional and semantic structure of the world. In recent years, researchers attempt to solve this challenge with graph neural network and knowledge graph. Yang et al. [4] first proposed to use graph convolutional networks (GCN) [5] to encode prior knowledge in the task of semantic navigation, and experimental results show that their method improves generalization to novel scenes. However, the prior knowledge of their model cannot be updated at test time. To address this issue, Wu et al. [6] proposed a Bayesian relational memory (MBR) model that can encode prior knowledge in the training environment and update it during testing. Moghaddam et al. [7] presented graph-based value estimation (GVE) system which uses externally learned prior knowledge and then integrate it into navigation model by a neural graph. However, their parameters are frozen during inference, bringing about a hard limit between the training and the testing stages.

To address this issue, Wortsman et al. [8] innovatively proposed a novel self-adaptive visual navigation (SAVN) approach that enables an agent to adapt to unseen scenes. Their model adopts a meta-reinforcement learning solution to learn a self-supervised interaction objective to encourage effective navigation. This method eliminates the problem of hard boundary in training and testing phases. Further, Li et al. [9] proposed to learn transferable meta-skills by unsupervised reinforcement learning approach from un-annotated scenes, and, therefore, the agent can fast adapt to visual navigation tasks. Li et al. [10] proposed multi-goals and multi-scenes navigation model, which utilize the scene priors of environment and showed promising result. However, these methods locate the target according to the current observations, which will ignore a large amount of rich information contained in current environment. The research work of Torralba et al. [11] and Mottaghi et al. [12] proves the effectiveness of context vector for visual object detection and position detection. More recently, Elayaperumal et al. [13] proposed a spatial variation and multi-feature fusion method based on context information, which can accurately locate the target position. Therefore, we intend to develop a navigation model that incorporates the superiority of both hierarchical relation information and meta-learning.

In this paper, we provide a unique visual mapless navigation model, which integrates hierarchical semantic information with meta-learning to improve the generalization performance. As illustrated in Fig. 1, we endow our model with the capability of encoding semantic information via context vector, which associates current observations and directional navigation signals. After that, we concatenate three key properties: the context vector, the word embedding of target and the knowledge graph purified manually. Then, we map this representation to a standard LSTM. Finally, we employ a meta-learning scheme to associate the visual representations with navigation actions, and thus facilitate navigation policy learning.

Extensive empirical results in the AI2-THOR environment demonstrate that our model accomplish significant and consistent improvement as compared to all baseline models. In particular, we prove that our approach outperforms state-of-the-art model in terms of both SPL (24.2 vs 20.9) and success rate (61.92 vs 50.00). In addition, we perform extensive analyses on the ratio of exploration regarding our model, and the final experimental results show that our method can effectively improve the invalid exploration behavior of the agent.

Related work

Visual navigation has a long history and is becoming prominent recently. For decades of research, a typically pipeline for traditional navigation [14, 15] consists of building a map, performing planning and designing an optimal path over the map. Meanwhile, the SLAM community has made tremendous progress, allowing for large-scale real-world applications and a continuous movement of this technology into industry [16,17,18,19]. However, conventional methods rely heavily on the existence of a predefined map.

Reinforcement learning for visual navigation

Recently, with the development of deep learning and representation learning comprehension capacity [20,21,22,23], researchers attempt to solve visual mapless navigation problem via reinforcement learning (RL). In general, RL takes pixels as inputs and predicts actions directly. Zhu et al. [1] developed a Siamese actor-critic target-driven navigation network, which takes visual task objective as an input to avoid re-training for every novel target. Several works [2, 24,25,26] formulated the navigation problem as reinforcement learning problem. However, these model ignore the implicit hierarchical relationships between objects. More recently, [27, 28] introduced a transformer architecture to learn a spatial-enhanced local descriptor and a positional global descriptor correlated with directional signals. Wu et al. [29] proposed a NeoNav navigation model, which predicts the next action based on the imagination of the next expected observation. Tang et al. [30] presented an Auto-Navigation method to address the problem of training with weak supervision in RL. We differ in our method as we incorporate hierarchical relationships into meta-learning framework.

Prior knowledge for navigation

Recently, knowledge graph technique has received great attention as the prior knowledge allowing an agent to take human experience. Researchers attempt to utilize the approach to improve the performance of navigate cross domain. Hence, [4, 31] presented a method which incorporate the semantic priors in the task of visual navigation to improve navigation performance. Zeng et al. [31] employed background knowledge about common spatial relations between landmark and target objects for robot navigation. Chaplot eta al. [32] constructed an episodic semantic map for goal oriented semantic exploration. Wu et al. [6] investigated a Bayesian relational memory (BRM) method to capture the design prior knowledge from unseen scenes and showed the promising results. Our approach differs as we consider the inherent hierarchical relationship between target categories. Druon et al. [33] proposed to transform visual information into an intermediate representation, called context grid, which encode the target object and other objects together. Qiu et al. [34] proposed a memory-utilized joint hierarchical object learning for navigation in indoor rooms (MJOLNIR). However, these methods do not update relationship during the training phase, then leading the model generalize poorly across domains.

Meta-learning

Recently, meta-learning proved its strong ability of learning from a small amount of training data. Nagabandi et al. [35] used meta-learning to train dynamics model prior, which can be rapidly adapted to the local context. Eshratifar et al. [36] combined both meta-learning and transfer-learning to improve generalization performance on unknown tasks. Li et al. [9] introduced a novel unsupervised reinforcement learning method to learn transferable meta-skills. On the contrary, our goal is different since we focus on using meta-learning to acquire a good initial parameter. Yan et al. [37] proposed vision-voice indoor navigation (MVV-IN) model, which aggregated multimodal information to enhance agent’s environment understanding. Wang et al. [38] provided a meta-learning approach to tackle classification problem. Our work instead emphasizes the visual navigation domain and encourages effective navigation. Zou et al. [39] proposed a novel meta-learning framework to automatically learn reward shaping to employ on new tasks. In our approach, however, our goal is to integrate hierarchical semantic information with meta-learning to improve the generalization performance.

Preliminaries

Problem formulation. For visual mapless navigation, we aim to find an instance of target object category, e.g., television, using only egocentric RGB perception of the agent. In addition, we regard a path of a navigation task as an episode. Towards the beginning of every episode, the agent is spawned from a random starting point in a scene, and predicts its actions based on the current view and previous states (agent’s past motions). Formally, we formulate the visual mapless navigation as reinforcement problem. Given a set of scenes $S=\{S_{1},\ldots ,S_{n}\} $ and goal object $G=\{g_{1},\ldots ,g_{m}\}$. The navigation task $\tau \in T$ consists of a scene S, target object class $g \in G$, and initial location P. Hence, we denote each task $g \in G$ as a tuple $T=(S,g,P)$. The agent receives a state $s_{t}$ (i.e., a self-centered RGB image from the current position and orientation) and samples an action a from the action space A at each time step t. $A = \{MoveAhead, RotateLeft, RotateRight, LookDown, LookUp, Done\} $. The horizontal left and right rotation angles are $45^\circ $, while the vertical up and down rotation angles are plus or minus $30^\circ $.

In a reinforcement learning navigation process, given the state $s_{t}$ at time t and the word embedding of the target category. The network outputs the agent’s policy $\pi _{\theta }(s_{t})$ and the value of the current state $v_{\theta }(s_{t})$. Finally, the agent selects an action a from the action space A according to probability $\pi _{\theta }^{a}(s_{t})$. We approximate the policy by a deep policy network $\pi (\cdot ;\theta )$:

$$\begin{aligned} a_{t} \thicksim \pi (\phi (s_{t};v), \psi (g;u); \theta ) \end{aligned}$$

(1)

where v, u and $\theta $ are the parameters for the navigation network. Since the current state $s_{t}$ and target word embeddings are two different feature modalities, we design two sub-networks $\phi (s_{t};v)$ and $\psi (g;u)$to map the two inputs into one feature space.

Proposed context vector-based navigation model

In this section, we will describe our approach exhaustively. The goal of our method is to learn an intelligent agent that can generalize to unseen environments and unseen objects. We first formalize our basic model structure and then explain the details of contextual vectors and meta-learning in visual mapless navigation.

Base model architecture

As is shown in Fig. 2, our network architecture includes two parts: (i) learning a joint representation embedding; (2) learning a navigation policy from the joint representation embedding and previous states.

In the first part, we introduce a novel context vector to encode all objects spatial and semantic relationship. In our model, we further design two descriptors both the semantic cues (e.g., the word embedding) and knowledge graph (e.g., GCN), to allow us to extract visual information effectively. Then, our model fuses these three types of descriptors with pointwise convolutional to produce final visual representations. Moreover, our model enforces visual representations to be highly correlated to navigation signals via our developed context vector, thus facilitating navigation policy learning. In the second part, the final visual representations are referred to as input and fed into an LSTM [40]. We employ an extra linear layer to obtain the policy and value for the remainder of this network. Note that the ReLU activation we utilize throughout is not shown.

Context vector in visual mapless navigation

Context vector

The context vector aims to encode the object-object relationships to enhance generalization ability across domains. We utilize the real label of the object category in AI2-THOR environment to construct the context vector. For each object, $g_{i} \in G$. Therefore, we have $|G|=101$ categories based on their visibility at random initialization of the environment. We let $C_{j}$ denote the context vector, and $C_{j}=[B,x_{c},y_{c},A_\textrm{bbox},CS]^{T}$. The state of $o_{i}$ in the current frame can be described as the five elements of the $C_{j}$. The first element B is a binary indicator that indicates whether $o_i$ appears in the current frame. The following two elements, $(x_{c},y_{c})$ and $A_\textrm{bbox}$, correspond to the center coordinates of the object $o_i$ and the coverage area is normalized according to the image size. We let CS denote the cosine similarity between the target object and word embedding of $g_\textrm{i}$, which is expressed as

$$\begin{aligned} CS(g_{o j},g_{t})=\frac{g_{o j},g_{t}}{||g_{o j}||\cdot ||g_{t}||} \end{aligned}$$

(2)

where the $g_{oj}$ and $g_{t}$ denote the word embedding of object and target in the form of GloVe embedding [41].

Hierarchical semantic information

As in [34], we construct a new set of object categories. We let P denote the “parent objects”, and $P=\{P_{1},\ldots ,P_{m}\}$. We take the larger objects in the AI2-THOR room as the “parent object”, and require them to be semantically/spatially related to the target. The new set of P are manually selected from each room based on their strong semantically/spatially correspondences with the target item G. For instance, Drawer is a parent object in the Livingroom and Bathroom room types. The aim is that the navigation agent will quickly find the target $g_{i} \in G$ based on semantic information, thus reducing the time spent exploring the environment.

Meta-learning objective for navigation

The meta-learning object is used in visual mapless navigation to allow an agent to quickly acquire a policy that allows for effective navigation with only a few gradient updates. The foundation of our approach lies in recent work proposed by Finn [42], which is a gradient based meta-learning algorithm. The highlight of this algorithm is that the network trained by MAML algorithm can be quickly adapted to the new task, when the distribution of training set and test set is similar. Therefore, we rely on the MAML algorithm to optimizes for fast adaptation to new scene types for our visual mapless navigation.

For clarity, we introduce our approach in a general manner, including brief examples. We consider a navigation task, denoted T, that maps the current state $S_{t}$ to output $a \in A $. Our approach assumes that during training we have access to a set of tasks $T_\textrm{train}$ where each task $T_\textrm{train}$ has a small meta-training dataset $D_\mathrm{\tau }^\textrm{tr}$ and meta-validation set $D_\mathrm{\tau }^\textrm{val}$. For example, in our mapless navigation task, we decompose the agent’s trajectory into an interaction and navigation phase. We let $D_\mathrm{\tau }^\textrm{int}$ denote the first k steps of the agent’s trajectory, which contains the internal state descriptor, observations and actions. Additionally, let $D_\mathrm{\tau }^\textrm{nav}$ denote the same information for the rest of the trajectory. The goal is then to correctly reach one of the object categories to each room types in unseen scenes. Our meta-learning objective takes the form:

$$\begin{aligned} \min \limits _{\theta }\sum _{\tau \in \tau _\textrm{train}}L_\textrm{nav}\left( \theta -\alpha \nabla _{\theta }L_\textrm{int}\left( \theta , D_{\tau }^\textrm{int}\right) , D_{\tau }^\textrm{nav}\right) \end{aligned}$$

(3)

where the loss $L_\textrm{int}$ is written as function of the network parameters $\theta $ and $D_\mathrm{\tau }^\textrm{int}$. In addition, $\alpha $ is the learning rate hyper-parameter, and $ \triangledown $ denotes the differential operator gradient. Our core idea is to learn parameters $\theta $ such that they provide a good initialization for fast adaptation to novel tasks. Formally, our training objective uses the adapted parameters $\theta -\alpha \triangledown L \left( \theta ,D \right) $ to optimize for performance on $ D_\mathrm{\tau }^\textrm{nav}$ instead parameters $ \theta $.

The intuition for our objective is as follows: as first our agent interact with the scene using the parameters $\theta $, and then, after k steps, we obtain the adapted parameters by utilizing an SGD update with respect to the self-supervised loss. Therefore, we execute a first-order Taylor expansion as [30]. Our meta-training objective. Equation (3) takes the form

$$\begin{aligned}{} & {} \min \limits _{\theta }\sum _{\tau \in \tau _\textrm{train}}L_\textrm{nav}\left( \theta , D_{\tau }^\textrm{nav}\right) \nonumber \\{} & {} \quad -\alpha \left\langle \nabla _{\theta }L_\textrm{int}\left( \theta , D_{\tau }^\textrm{int} \right) ,\nabla _{\theta }L_\textrm{nav}\left( \theta , D_{\tau }^\textrm{nav} \right) \right\rangle \end{aligned}$$

(4)

where $\lessdot ,\gtrdot $ denotes an inner product. The goal of our first-order Taylor expansion is to learn to minimize the navigation loss while increasing the similarity between the self-supervised interaction loss and the supervised navigation loss gradients. To understand $\lessdot ,\gtrdot $, recall the dot operation computes the similarity of two vectors: $a \cdot b = ||a||_{2}||b||_{2}\cos (\zeta )$, where $\zeta $ is the angle between vectors a and b. For the remainder of this work, we call the $L_\textrm{int}$ as interaction-gradient and the $L_\textrm{nav}$ as navigation-gradient. If these two gradients are similar, then we can keep “training” during inference though we do not have access to $L_\textrm{nav}$. However, in practice, it is difficult to draw an $L_\textrm{int}$ approximation for $L_\textrm{nav}$, which directly drives us to utilize learning approaches to acquire self-supervised interaction loss.

To solve this problem, we propose a novel solution to learn a self-supervised interaction objective that is notably tailored to our visual mapless navigation. Our core goal is for agent to enhance navigation performance by minimizing the self-supervised loss. During training, we not only learn this objective but also learn how to employ this objective. Formally, we describe the $L_\textrm{int}$ as a neural network parameterized by $\phi $, which we denote $L_\textrm{int}^{\phi }$. Our training objective then takes the form:

$$\begin{aligned} \min \limits _{\theta , \phi }\sum _{\tau \in \tau _\textrm{train}}L_\textrm{nav}\left( \theta -\alpha \nabla _{\theta }L_\textrm{int}^\phi \left( \theta , D_{\tau }^\textrm{int}\right) , D_{\tau }^\textrm{nav}\right) \end{aligned}$$

(5)

In inference phase, the parameter $\phi $ is frozen. If the gradients from both losses are similar, we believe that minimizing this loss allows the agent to navigate effectively. In this scene, the self-supervised loss is being trained to mimic the supervised $L_\textrm{nav}$ loss. As is illustrated in Fig. 2, we use the agent’s previous k internal state representations concatenated with the agent’s policy. We utilize two layers of one-dimensional temporal convolutions for the architecture of our learning loss. In the first layer, we employ 10*1 filters and the next with 1*1.

Experiments

We conduct our experiment on AI2-THOR dataset [43], and continue to answer the accompanying three questions: (1) determine whether our model is generalizing better than other baseline models. (2) determine whether our model can achieve promising results that training in one scene and testing on different scene types, and (3) gain insight into how and why our model can find the goal with fewer steps.

In the remainder of this paper, “Experimental setup” first introduces how to separate the dataset and evaluation metrics. Implementation details are shown in “Implementation details” and “Baselines” shows the baselines and SOTA comparison. Then, we answer the first two questions quantitatively in “Quantitative analysis” and “Generalization across scene types”, respectively. Qualitative results of the ratio of exploration (i.e., fewer steps, and faster rate of convergence) are illustrated in “Ratio of exploration” . We perform an ablation study on the hierarchical relationships in our model in “Ablation study on hierarchical relationships”.

Experimental setup

Datasets

We perform our experiments on AI2-THOR, which afford near-photorealistic 3D indoor scenes. It consist of four different types of scenes, i.e., kitchen, living room, bedroom and bathroom. Each scene contains real-word objects that the agent can observe, allowing algorithms learned here to be easily transferred to real-world settings. For each room type, there are 30 different rooms with various layout and items. Following SAVN [8], our experiment used a training set of 20 rooms for each scenario type and a test set of 10 rooms. As a result, the complete training set has 80 rooms, whereas the test set contains 40 rooms, and $D_\textrm{train}\cap D_\textrm{test}=\emptyset $.

Success criteria

For each episode, the agent is considered to finish a navigation task until the target is visible (i.e., the distance to the target is in the field of view and within 1 m). To avoid the problem that the agent does not have any idea if it has reached the target or not, we consider that the navigation task is successful if Done action is issued when an object instance is within the field of view. Note that if the agent ever issues the Done action when it has not reached a goal then we consider the task a failure. This makes the learning more challenging.

Metrics

For fair comparison with other state-of-the-art algorithms, we adopt the same evaluation metrics recommended by [44], and approved by other visual mapless navigation model [4, 8]. We evaluated our navigation performances (i.e., generalization) on unseen scenes based on two metrics: Success Rate (SR) and the Success weighted by Path Length (SPL). The SR is defined as $\frac{1}{N} \sum _{n}^{N}S_\textrm{n} $, which means the ratio of the number of times the agent successfully navigates to the goal object and the total number of episodes. The SPL is given by $\frac{1}{N} \sum _{n}^{N}S_\textrm{n} \frac{Len_\textrm{n}}{\max (Len_\textrm{n}, Len_\textrm{shortest})} $. Here, we conduct N episodes, and let $S_\textrm{n}$ as a success indicator of the n-th episode. $Len_\textrm{n}$ denotes path length and $Len_\textrm{shortest}$ is the shortest path distance (provide by the environment for evaluation) in episode i.

Implementation details

As indicated in “Experimental setup” section, we train our model and all baselines until the success rate saturates on the 80 room types. We run our experiment in PyTorch and train 14 asynchronous workers on two Nvidia GeForce RTX 2080Ti GPUs. In each episode, we randomly sample a target from floorplan with a random initial agent position. We use Adam optimizer [45] to update the policy network for navigation gradients, with a learning rate of $10^{-4}$ and an inner learning rate of 0.001.

For evaluation, we used 1000 different episodes for 4 room types (250 for each scene type) with four asynchronous workers. All models are tested with the same set.

Baselines

For comparison, we select several public models as baselines including:

Random policy. The agent randomly draws one out of six actions using a uniform distribution.

SP [4]. A graph convolutional network navigation model, which integrate semantic priors with reinforcement learning architecture.

SAVN. [8] exploits a meta reinforcement learning approach for an agent to learn a self-adaptive visual navigation model to encourage the agent to keep learning during inference time.

GVE. [7] proposed a graph-based value estimation (GVE) to provide a more optimal policy for navigation. The GVE module reduces the value estimation error in A3C RL algorithm.

ApAtt. [46] encoding semantic information and spatial information about object’s place using an attention probability model, which allows the agent to navigate towards the target effectively.

MJOLNIR_O/R. [34] utilize implicit memorization of the relationships between different objects and show promising results for navigation efficiently.

Quantitative analysis

In this part, we show the navigation performance and provide some further analysis. We offer quantitative results for all “ALL” targets as well as a subset of targets ($L\geqslant 5$) with optimal trajectory lengths greater than 5.

As shown in Table 1, we observe that our model significantly outperforms SP [4] and SAVN [8]. Since SP and SAVN use prior knowledge or meta-learning separately while our model uses the hierarchical relationships and combines with the meta-learning. Compared to the GVE model [7], our model outperforms by $57.14\%$ and $83.74\%$ on the SPL metric and SR metric, respectively. Compared with the SpAtt model [46], our model outperforms by $52.2\%$ and $89.94\%$ on the SPL metric and SR metric, respectively. Our approach achieves expressive generalizability and fast policy learning for navigation. Most notably, we observe about $115.75\% $ absolute improvement in success and $73.98\% $ in SPL. In contrast, benefiting from our context vector representations, our model outperforms the state-of-the-art method MJOLNIR_O [34] by $23.83\%$ for success rate and by $15.79\%$ for SPL. We achieve better navigation performance utilizing our context vector-based navigation model. It also confirms our assertion that incorporating the hierarchical relationships into meta-learning can encourage agents to better generalize to unknown environments. We note that our model’s SPL metric is 7.5% lower than MJOLNIR_O in the case of $L\geqslant 1$. We speculate that this is because semantic information does not cover all categories.

Table 1 Quantitative results

Full size table

Table 2 Quantitative results across scene types

Full size table

Generalization across scene types

We test the generalization ability of our model in cross-scenarios as well. The idea is that we train an independent model for four scene type, and then test it on a different scene type. We refer to them as K-single, L-single, Be-single and Ba-single, respectively. The K-single, for example, is trained on kitchen room types, and evaluated on the all scenario types. Our results are shown in Table 2.

Table 3 Ablation results

Full size table

From the results, we can see that our cross-scene model outperforms the comparison methods in both SR and SPL assessment measures. In particular, our approach using hierarchical semantic relation input surpass the state-of-the-art method [34] for SPL by $36.01\%$.

Ratio of exploration

In order to prove that our method has faster strategy learning. We newly proposed a ratio of exploration to illustrate the effect. We defined the ratio of exploration as number of explored states divided by the current number of steps. For example, if an agent only explores two states in the first five steps, then the exploration rate in the first five steps is only 40%.

As shown in Fig. 3, our method has a lower exploration rate, thanks to our intelligent agent can find the target faster and avoid repeating an invalid action all the time. This support that our model provide good initialization parameters.

Ablation study on hierarchical relationships

We conduct evaluations on how the performance is affected by changing the hierarchical relationships. First, we only sequentially remove the hierarchical relationships of a room type. Second, to compare the results intuitively and reasonably, we adopt the same experimental settings as “Quantitative analysis” section to re-train the model for four room scene types and then test in the room type that removes the hierarchical relationships. For example, “Kitchen_no_relationships” refers to removing the hierarchical relationships in kitchen room type from our system.

As shown in Table 3, we observe that the navigation performance degrades significantly compared to the results of Table 1. Moreover, the performance of the other three rooms was much lower than the results in Table 1 in both SPL and SR metrics as well. This validates the influence of hierarchical relationships for navigation performance. We consider that our hierarchical relationships provide contextual guidance to an agent.

Case study

We provide qualitative comparison of the performance of our model compared to previous SOTA, e.g., SAVN [8] and MJOLNIR_O [34]. As illustrated in Fig. 4, agent is navigating towards a target of “Laptop” in scenario “living room”. SAVN and MJOLNIR_O both issue the Done action after navigating a few steps (15 and 8 steps, respectively). Since our navigation model provides clear directional signals, our model uses the least steps (4 steps) to find the target.

For failure cases, we conducted further experiments. We chose the “RemoteControl” as target in “FloorPlan205” of AI2-THOR environment. As shown in Fig. 5, we observe that the agent misses several times when looking for the target. Our agent passed by the “RemoteControl” several times, but did not issue an “Done” command, and finally ended the episode because the maximum step was exceeded.

By repeatedly visualizing the agent’s behavior, we did further analysis of the failure cases. We observed the agent going back and forth beside the desk. This may be because the agent thinks that the remote control is sometimes closer to the table, and the agent is misled by prior knowledge. As shown in the top view in Fig. 5, the agent walked directly towards the table and then stopped for a while.

Conclusion

In this paper, we propose a context vector-based architecture to achieve powerful visual mapless navigation, which fully explored the hierarchical relationships between objects and targets. Further, we absorb the strength of meta-learning to encourage more effective policy learning. Experiments reveal that our model outperforms a state-of-the-art visual mapless model considerably and consistently. Furthermore, our approach has a high degree of generalizability across different scenario types. In addition, our approach has a lower ratio of exploration, which means our agent adopts more effective actions. In the future, we plan to explore background knowledge about the composition of the real world, including common sense knowledge of human navigation and factual knowledge, which can provide strong generalizability in unknown environment.

References

Zhu Y, Mottaghi R, Kolve E, Lim JJ, Gupta A, Fei-Fei L, Farhadi A (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 3357–3364
Mirowski P, Grimes M, Malinowski M, Hermann KM, Anderson K, Teplyashin D, Simonyan K, Zisserman A, Hadsell R, et al (2018) Learning to navigate in cities without a map. In: Advances in neural information processing systems, pp 2419–2430
Chaplot DS, Sathyendra KM, Pasumarthi RK, Rajagopal D, Salakhutdinov R (2018) Gated-attention architectures for task-oriented language grounding. In: Thirty-second AAAI conference on artificial intelligence
Yang W, Wang X, Farhadi A, Gupta A, Mottaghi R (2018) Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Wu Y, Wu Y, Tamar A, Russell S, Gkioxari G, Tian Y (2019) Bayesian relational memory for semantic visual navigation. In: Proceedings of the IEEE international conference on computer vision, pp 2769–2779
Moghaddam MK, Wu Q, Abbasnejad E, Shi J (2021) Optimistic agent: accurate graph-based value estimation for more successful visual navigation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3733–3742
Wortsman M, Ehsani K, Rastegari M, Farhadi A, Mottaghi R (2019) Learning to learn how to learn: Self-adaptive visual navigation using meta-learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6750–6759
Li J, Wang X, Tang S, Shi H, Wu F, Zhuang Y, Wang WY (2020) Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12123–12132
Li F, Guo C, Luo B, Zhang H (2021) Multi goals and multi scenes visual mapless navigation in indoor using meta-learning and scene priors. Neurocomputing 449:368–377
Article Google Scholar
Torralba A, Murphy KP, Freeman WT, Rubin MA (2003) Context-based vision system for place and object recognition. In: IEEE international conference on IEEE computer society computer vision, vol 2, pp 273–273
Mottaghi R, Chen X, Liu X, Cho NG, Lee SW, Fidler S, Urtasun R, Yuille A (2014) The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 891–898
Elayaperumal D, Joo YH (2021) Robust visual object tracking using context-based spatial variation via multi-feature fusion. Inf Sci 577:467–482
Article MathSciNet Google Scholar
Thrun S (1998) Learning metric-topological maps for indoor mobile robot navigation. Artif Intell 99(1):21–71
Article MATH Google Scholar
Kidono K, Miura J, Shirai Y (2002) Autonomous visual navigation of a mobile robot using a human-guided experience. Robot Auton Syst 40(2–3):121–130
Article Google Scholar
Durrant-Whyte H, Bailey T (2006) Simultaneous localization and mapping: part I. IEEE Robot Autom Mag 13(2):99–110
Article Google Scholar
Garcia-Fidalgo E, Ortiz A (2015) Vision-based topological mapping and localization methods: a survey. Robot Auton Syst 64:1–20
Article Google Scholar
Cadena C, Carlone L, Carrillo H, Latif Y, Scaramuzza D, Neira J, Reid I, Leonard JJ (2016) Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans Rob 32(6):1309–1332
Article Google Scholar
Zhang P, Ouyang W, Zhang P, Xue J, Zheng N (2019) Sr-lstm: state refinement for lstm towards pedestrian trajectory prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12085–12094
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp 1928–1937
Mirowski P, Pascanu R, Viola F, Soyer H, Ballard AJ, Banino A, Denil M, Goroshin R, Sifre L, Kavukcuoglu K et al (2016) Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673
Mousavian A, Toshev A, Fišer M, Košecká J, Wahid A, Davidson J (2019) Visual representations for semantic target driven navigation. In: 2019 international conference on robotics and automation (ICRA). IEEE, pp 8846–8852
Shi H, Shi L, Xu M, Hwang KS (2019) End-to-end navigation strategy with deep reinforcement learning for mobile robots. IEEE Trans Ind Inf 16(4):2393–2402
Article Google Scholar
Fang K, Toshev A, Fei-Fei L, Savarese S (2019) Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 538–547
Du H, Yu X, Zheng L (2021) Vtnet: Visual transformer network for object goal navigation. arXiv preprint arXiv:2105.09447
Wu Q, Manocha D, Wang J, Xu K (2020) Neonav: improving the generalization of visual navigation via generating next expected observations. Proc AAAI Conf Artif Intell 34:10001–10008
Google Scholar
Tang T, Yu X, Dong X, Yang Y (2021) Auto-navigator: decoupled neural architecture search for visual navigation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3743–3752
Zeng Z, Röfer A, Jenkins OC (2020) Semantic linking maps for active visual object search. In: 2020 IEEE international conference on robotics and automation (ICRA). IEEE, pp 1984–1990
Chaplot DS, Gandhi DP, Gupta A, Salakhutdinov RR (2020) Object goal navigation using goal-oriented semantic exploration. Adv Neural Inf Process Syst 33
Druon R, Yoshiyasu Y, Kanezaki A, Watt A (2020) Visual object search by learning spatial context. IEEE Robot Autom Lett 5(2):1279–1286
Article Google Scholar
Qiu Y, Pal A, Christensen HI (2020) Learning hierarchical relationships for object-goal navigation. arXiv preprint arXiv:2003.06749
Nagabandi A, Clavera I, Liu S, Fearing RS, Abbeel P, Levine S, Finn C (2018) Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. arXiv preprint arXiv:1803.11347
Eshratifar AE, Abrishami MS, Eigen D, Pedram M (2019) A meta-learning approach for custom model training. Proc AAAI Conf Artif Intell 33:9937–9938
Google Scholar
Yan L, Liu D, Song Y, Yu C (2020) Multimodal aggregation approach for memory vision-voice indoor navigation with meta-learning. In: 2020 IEEE/rsj international conference on intelligent robots and systems (IROS). IEEE, pp 5847–5854
Wang C, Qiu M, Huang J, He X (2020) Keml: A knowledge-enriched meta-learning framework for lexical relation classification. arXiv preprint arXiv:2002.10903
Zou H, Ren T, Yan D, Su H, Zhu J (2021) Learning task-distribution reward shaping with meta-learning. Proc AAAI Conf Artif Intell 35:11210–11218
Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400
Kolve E, Mottaghi R, Gordon D, Zhu Y, Gupta A, Farhadi A (2017) Ai2-thor: an interactive 3d environment for visual AI. arXiv preprint arXiv:1712.05474
Anderson P, Chang A, Chaplot DS, Dosovitskiy A, Gupta S, Koltun V, Kosecka J, Malik J, Mottaghi R, Savva M et al (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Mayo B, Hazan T, Tal A (2021) Visual navigation with spatial attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16898–16907

Download references

Acknowledgements

We express our gratitude to Allen Institute for AI for offering a reliable experiment tool, which is important to our work. This work was supported by a grant from the Wuhan Science and Technology Planning Application Foundation Frontier Project, No.2019010701011413, the National Key Research and Development Program of China (2018YFB1305001), and the Project Supported by the Open Fund of Hubei Luojia Laboratory.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, 430072, Wuhan, Hubei, China
Fei Li & Huyin Zhang
GNSS Research Center, Wuhan University, 430072, Wuhan, Hubei, China
Chi Guo & Binhan Luo
Artificial Intelligence Institute, Wuhan University, 430072, Wuhan, Hubei, China
Chi Guo
Luojia Laboratory, Wuhan University, 430072, Wuhan, China
Chi Guo

Authors

Fei Li
View author publications
You can also search for this author in PubMed Google Scholar
Chi Guo
View author publications
You can also search for this author in PubMed Google Scholar
Huyin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Binhan Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chi Guo or Huyin Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, F., Guo, C., Zhang, H. et al. Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning. Complex Intell. Syst. 9, 2031–2041 (2023). https://doi.org/10.1007/s40747-022-00902-7

Download citation

Received: 26 January 2022
Accepted: 17 October 2022
Published: 04 November 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s40747-022-00902-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning

Abstract

Similar content being viewed by others

Improving indoor visual navigation generalization with scene priors and Markov relational reasoning

Occupancy Anticipation for Efficient Exploration and Navigation

Learning Object Relation Graph and Tentative Policy for Visual Navigation

Introduction

Related work

Reinforcement learning for visual navigation

Prior knowledge for navigation

Meta-learning

Preliminaries

Proposed context vector-based navigation model

Base model architecture

Context vector in visual mapless navigation

Context vector

Hierarchical semantic information

Meta-learning objective for navigation

Experiments

Experimental setup

Datasets

Success criteria

Metrics

Implementation details

Baselines

Quantitative analysis

Generalization across scene types

Ratio of exploration

Ablation study on hierarchical relationships

Case study

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation