1 Introduction

The word robot was introduced and popularized in the Czech play, “Rossum’s Universal Robots”, also known as R.U.R. In this seminal piece of theatre, robots understand and carry out a variety of verbal human instructions. Roboticists and AI researchers have long been striving to create machines with such an ability to turn natural language instructions into physical actions in the real world (Jang et al., 2022; Stepputtis et al., 2020; Lynch and Sermanet, 2021; Ahn et al., 2022; Shridhar et al., 2021). However, this task requires robots to interpret instructions in the current situational and behavioral context in order to accurately reflect the intentions of the human partner. Achieving such inference and decision-making capabilities demands a deep integration of multiple data modalities—specifically, the intersection of vision, language, and motion. Language-conditioned imitation learning (Lynch and Sermanet, 2021; Stepputtis et al., 2020) is a technique that can help address these challenges by jointly learning perception, language understanding, and control in an end-to-end fashion.

Fig. 1
figure 1

Our proposed method demonstrates high performance on a variety of tasks. It is able to transfer to new robots in a data-efficient manner, while still keeping a high execution performance. It also accepts adding new behaviors to an existing trained policy. Besides them, we also demonstrate the ability to learn relational tasks, where there are two objects involved in the same sentence

However, a significant drawback of this approach is that, once trained, these language-conditioned policies are only applicable to the specific robot they were trained on. This is because end-to-end policies are monolithic in nature, which means that robot-specific aspects of the task, such as kinematic structure or visual appearance, cannot be individually targeted and adjusted. While it is possible to retrain the policy on a new robot, this comes with the risk of catastrophic forgetting and substantial computational overhead. Similarly, adding a new aspect, behaviors, or elements to the task may also require a complete retraining.

This paper tackles the problem of creating modular language-conditioned robot policies that can be re-structured, extended and selectively retrained. Figure 1, depicts a set of scenarios that we want to address in this paper. For example, we envision an approach which allows for the efficient repurposing and transfer of a policy to a new robot. We also envision situations in which a new behavior may be added to an existing policy, e.g., incorporating obstacle avoidance into an existing motion primitive. Similarly, we envision situations in which the type of behavior is changed by incorporating additional modules into a policy, e.g., following human instructions that define a relationship between multiple objects, such as, “Put the apple left of the orange!”.

Fig. 2
figure 2

Using generative models to automatically synthesize an unlimited set of 3D models

However, the considered modularity is at odds with the monolithic nature of end-to-end deep learning. To overcome this challenge, the paper proposes an attention-based methodology for learning reusable building blocks, or modules, that realize specialized sub-tasks. In particular, we discuss supervised attention which allows the user to guide the training process by focusing the attention of a sub-network (or module) on certain input–output variables. By imposing a specific locus of attention, individual sub-modules can be guided to realize an intended target functionality. Another contribution, called hierarchical modularity, is a training regime inspired by curriculum learning that aims to decompose the overall learning process into individual subtasks. This approach enables neural networks to be trained in a structured fashion, maintaining a degree of modularity and compositionality.

Our contributions can be summarize and extend our prior work in Zhou et al. (2022) as follows: (1) we propose a sample-efficient approach for training language-conditioned manipulation policies that allows for rapid transfer across different types of robots; (2) we introduce a novel method, which is based on two components called hierarchical modularity and supervised attention, that bridges the divide between modular and end-to-end learning and enables the reuse of functional building blocks; (3) we demonstrate that our method outperforms the current state-of-the-art methods [BC-Z (Jang et al., 2022) and LP (Stepputtis et al., 2020)]; (4) we extend the methodology by creating more complex tasks that incorporate obstacle avoidance and relational instruction following. Finally, we also perform an extensive number of experiments that shed light on generalization properties of the our methodology from different angles, e.g., dealing with occlusions, synonyms, variable objects, etc (Fig. 1).

2 Preamble: how generative AI helped write the paper

This paper largely centers around the training of generative models at the intersection of vision, language and robot control. Besides being the topic of the paper, generative models have also been instrumental in writing this paper. In particular, we incorporated such techniques into both (a) the text editing process when writing the manuscript, as well as (b) the process of generating 3D models and textures of manipulated objects.

For text editing, we utilized GPT-4 (OpenAI, 2023) to iteratively revise and refine our initial drafts, ensuring improved readability and clarity of the concepts discussed. We achieved this by conducting prompt engineering and formulating a specific prompt as follows:

“Now you are a professor at a top university, studying computer science, robotics and artificial intelligence. Could you please help me rewrite the following text so that it is of high quality, clear, easy to read, well written and can be published in a top level journal? Some of the paragraphs might lack critical information. If you notice that, could you please let me know? Let’s do back and forth discussions on the writing and refine the writing.”

We initiate each conversation with this impersonation prompt, followed by our draft text. GPT-4 then returns a revised version of the text, ensuring the semantics remained unaltered while updating the literary style to incorporate professional terminology and wording, as well as a clear logical flow. This prompt also encourages GPT-4 to solicit feedback on the revised text, thus facilitating back-and-forth conversations. We manually determine when a piece of writing has been fine-tuned to a satisfactory degree and bring the conversation to a close.

With regard to the generation of 3D models and assets, we created a new pipeline for automated synthesis of complete polygonal meshes. Figure 2(top row) depicts the individual steps of this process. First, we synthesize an image of the intended asset using latent diffusion models (Rombach et al., 2022) to produce an image of the required asset. We provide as input to the model a textual description of the asset, e.g., “A front image of an apple and a white background.”. In turn, the resulting image is fed into a monocular depth-estimation algorithm (Ranftl et al., 2022) to generate the corresponding depth map. At this stage, each pixel in the image has both (1) a corresponding depth value and (2) an associated RGB texture value. To generate a 3D object, we take a flat mesh grid of the same resolution as the synthesized RGB image. We then perform displacement mapping (Zirr and Ritschel, 2019) based on the values present in the depth image. Within this process, each point of the originally flat grid gets elevated or depressed according to its depth value. The result is a 3D model representing the front half of the target object. For the sake of this paper, we assume a plane symmetry—a feature that is common among a large number of household objects. Accordingly, we can mirror the displacement map in order to yield the occluded part of the object. Finally, we also apply a Laplacian smoothing operation (Sorkine et al., 2004) on the final object. Texturing information is retrieved from the source image. This automated 3D synthesis process allows us to rapidly generate a potentially infinite number of variants of an object. This is particularly useful when studying the generalization capabilities of a model. It also completely removes any 3D modeling or texturing burden. At the moment, the pipeline is limited to symmetric objects.

3 Related work

Imitation learning offers a straightforward and efficient method for learning agent actions based on expert demonstrations (Dillmann and Friedrich, 1996; Schaal, 1999; Argall et al., 2009). This approach has proven effective in diverse tasks including helicopter flight (Coates et al., 2009), robot control (Maeda et al., 2014), and collaborative assembly. Recent advancements in deep learning have enabled the acquisition of high-dimensional inputs, such as vision and language data (Duan et al., 2017a; Zhang et al., 2018a; Xie et al., 2020)—partially stemming from improvements in image and video understanding domains (Lu et al., 2019; Kamath et al., 2021; Chen et al., 2020; Tan and Bansal, 2019; Radford et al., 2021; Dosovitskiy et al., 2021), but also in language comprehension (Wang et al., 2022; Ouyang et al., 2022). Specifically, the work presented in Radford et al. (2021) paved the way for multimodal language and vision alignment. The generalizability of such large multimodal models (Singh et al., 2022; Alayrac et al., 2022; Ouyang et al., 2022; Zhu et al., 2023) enables a variety of downstream tasks, including image captioning (Laina et al., 2019; Vinyals et al., 2015; Xu et al., 2015), visual question answering systems (VQA) (Antol et al., 2015; Johnson et al., 2017), and multimodal dialog systems (Kottur et al., 2018; Das et al., 2017). However, most importantly, these models have shown their utility when learning language-conditioned robot policies (Shridhar et al., 2021; Nair et al., 2022) that conduct a variety of manipulation tasks (Lynch and Sermanet, 2021; Stepputtis et al., 2020; Jang et al., 2022). Utilizing multimodal inputs for task specification and robot control (Anderson et al., 2019; Kuo et al., 2020; Rahmatizadeh et al., 2018; Duan et al., 2017b; Zhang et al., 2018b; Abolghasemi et al., 2019; Mees et al., 2022) plays a crucial role, as the environment and verbal instruction needs to be grounded across modalities. Most notably, BC-Z (Jang et al., 2022) proposes a large multimodal dataset which is trained via imitation learning in order to complete a variety of diverse household tasks. Similar in spirit, LanguagePolicies (LP) (Stepputtis et al., 2020) learns a language-conditioned policy to comprehend commands that describes what, where and how to do a task, but describes the outputs of the policy in terms of a dynamic motor primitive (DMP) (Schaal, 2006). Going beyond single instruction following, SayCan (Ahn et al., 2022) focuses on planning of longer horizon tasks and incorporates prompt engineering. Most recently, even large language models have achieved impressive performance on embodied agents (Vemprala et al., 2023), with a push to generally capable agents that can play Atari, caption images, chat, and stack blocks with a real robot arm (Reed et al., 2022).

While these model achieve impressive performance, they usually require large quantities of data and are mostly “black box” approaches that do not lend themselves well to human interpretation in case the policy behavior is not performing as desired. A potential solution to this problem that retains the end-to-end training benefits of deep learning is the utilization of a modularized approach, allowing the creation of entire policies from a set of modules that can afford additional insights into the inference process of the neural network. Such modularization can be achieved by introducing auxiliary tasks that have shown to improve policy performance (Huang et al., 2022). Recent works on modularity investigate the question of whether “modules implementing specific functionality emerge” in neural networks automatically (Csordás et al., 2021; Filan et al., 2020). However, in contrast to these emergent modularity approaches, our prior work (Zhou et al., 2022) introduced supervised attention, together with a hierarchical learning regime akin to curriculum learning. Originating in machine translation (Liu et al., 2016), supervised attention and hierarchical modularity allow for such functional modules to be implemented in a top-down manner. In this work, we delve deeper into the benefits of this approach by investigating how it can be extended to more complex tasks including obstacle avoidance and instructions utilizing referential expressions across tasks that utilize a large qunatity of automatically generated scene objects.

4 Methodology

In this section, we present our approach for modularity in language-conditioned robot policies. The main objective of the approach is to build neural networks out of composable building blocks which can be reused, retrained and repurposed whenever changes to the underlying task occur. A distinguishing feature of our approach is its modular training, while maintaining end-to-end learning benefits. In particular, the shift from training individual components to training the complete network occurs progressively, yet modules can be trained quickly without requiring gradient propagation throughout the entire network. Owing to its modularity, \(\pi _{\varvec{\theta }}\) can be transferred to a new robot in a sample-efficient manner. The modular nature of the resulting neural networks also enables easy introspection into the intermediate computation steps at runtime.

The introduced methodology builds upon two essential components, namely supervised attention and hierarchical modularity—two ingredients that are used in conjunction to crystallize individual modules within an end-to-end deep learning system. Subsequently, we first introduce the problem statement underlying language-conditioned imitation learning. Thereafter, a detailed description of the training process is provided. Initially, we focus on efficient training of language-conditioned policies that can be transferred across a variety of robots. Thereafter, we shift our focus to the question of how new modules can be incorporated or how multiple modules can be interrelated.

Fig. 3
figure 3

Overview: different input modalities, i.e., vision, joint angles and language are fed into a language-conditioned neural network to produce robot control values. The network is setup and trained in a modular fashion—individual modules address sub-aspects of the task. The neural network can efficiently be trained and transferred onto other robots and environments (e.g. Sim2Real)

4.1 Problem statement

In Language-Conditioned Imitation Learning (Lynch and Sermanet, 2021; Stepputtis et al., 2020), the goal is to learn a policy \(\pi _{\pmb {\theta }}( {\varvec{a}} \mid {\varvec{s}}, {\varvec{I}})\) that can execute a human instruction while taking into account situational and environmental conditions. The result of the learning process is a deep neural network parameterized by weight vector \(\varvec{\theta }\). Input \({\varvec{s}}\) is a verbal task instruction provided by a human whereas \({\varvec{I}}\) is an image captured by an RGB camera mounted on the robot. Throughout this paper, policy \(\pi _{\pmb {\theta }}\) is trained to generate an action \({\varvec{a}} \in {\mathbb {R}}^7\) containing the Cartesian position (xyz) and orientation (rpy) for the robot end-effector, as well as a binary label \(g = \{open, closed\}\) indicating the gripper state. Policies are trained following the imitation learning paradigm from a dataset \({\mathcal {D}} = \{{\varvec{d}}_0,\ldots , {\varvec{d}}_N\}\) of N expert demonstrations and corresponding verbal commands. In this dataset, each demonstration \({\varvec{d}}_n\) represents a sequence with T steps \(\left( ({\varvec{a}}_0, {\varvec{s}}_0, {\varvec{I}}_0), \ldots , ({\varvec{a}}_T, {\varvec{s}}_T, {\varvec{I}}_T) \right) \). Each step in demonstration \({\varvec{d}}_n\) is defined as a tuple \(({\varvec{a}}_t, {\varvec{s}}_t, {\varvec{I}}_t)\) containing the action, language command, and image at time step t. Upon completion of training, the policy is expected to execute novel configurations of the task.

4.2 Training modular language-conditioned policies

Our overall method is illustrated in Fig. 3. First, camera image \({\varvec{I}}\) is processed together with a natural language instruction \({\varvec{s}}\) and the robot’s proprioceptive data (i.e., joint angles) through modality-specific encoders to generate their respective embeddings. The resulting embeddings are subsequently supplied as input tokens to a transformer-style (Vaswani et al., 2017) neural network consisting of multiple attention layers. This neural network is responsible for implementing the overall policy \(\pi _{\varvec{\theta }}\) and produces the final robot control signals.

Fig. 4
figure 4

Different sub-aspects of the tasks are implemented as modules (via supervised attention). LANG identifies the target object. EE2D locates the robot end-effector

The encoding process ensures that distinct input modalities, e.g., language, vision and motion, can effectively be integrated within a single model. To that end, Vision Encodings \({\varvec{e}}_{{\varvec{I}}} = f_{{{\mathcal {V}}}}({\varvec{I}})\) are generated using an input image \({\varvec{I}} \in {\mathbb {R}}^{H \times W \times 3}\). Taking inspiration from (Carion et al., 2020; Locatello et al., 2020), we maintain the original spatial structure while encoding the image into a sequence of lower-resolution image tokens. The resolution is reduced via a convolutional neural network while increasing the number of channels, yielding \({\varvec{e}}{\pmb {I}} \in {\mathbb {R}}^{({H}/{s}) \times ({W}/{s}) \times d}\), with s representing a scaling factor and d denoting the embedding size. Consequently, the low-resolution pixel tokens are transformed into a sequence of tokens \({\varvec{e}}{{\varvec{I}}} \in {\mathbb {R}}^{Z \times d}\), where \(Z = (H\times W)/{s^2}\) through a flattening operation.

By contrast, Language Encodings \({\varvec{e}}_{{\varvec{s}}} = f_{{{\mathcal {L}}}}({\varvec{s}}) \in {\mathbb {R}}^{1 \times d}\) are produced via a pre-trained and fine-tuned CLIP (Radford et al., 2021) model. Particularly, each instruction \({\varvec{s}}\) is represented as a sequence of words \(\left[ w_0, w_1,\ldots , w_n \right] \) in which each word \(w_i \in {\mathcal {W}}\) is a member of vocabulary \({\mathcal {W}}\). During training, we employ automatically generated, well-formed sentences; however, after training, we allow any free-form verbal instruction that is presented to the model, including sentences affected by typos or bad grammar. Finally, Joint Encodings \({\varvec{e}}_{{\varvec{j}}} = f_{{{\mathcal {J}}}}({\varvec{a}}) \in {\mathbb {R}}^{1 \times d}\) are created by transforming the current robot state \({\varvec{a}}\) into a latent representation using a simple multi-layer perceptron. The main purpose of this step is to transform the joint representation into a compatible shape that aligns with the other input embeddings.

4.2.1 Supervised attention

After encoding, the inputs are processed within a single neural network in order to produce robot control actions. However, a unique element of our approach is the formation of semantically meaningful sub-modules during the learning process. These modules may solve a specific sub-task, e.g., detecting the robot end-effector or calculating the distance between the robot and the target object. To achieve this effect, we build upon modern attention mechanisms (Vaswani et al., 2017) in order to manage the flow of information within attention layers, thereby explicitly guiding the network to concentrate on essential inputs.

More specifically, we adopt a supervised attention mechanism in order to enable user-defined information routing and the formation of modules within an end-to-end neural network. The main idea underlying this mechanism is that information about optimal token pairings may be available to the user. In other words, if we know which key tokens are important for the queries to look at, we can treat their similarity score as a maximization goal. In Fig. 4, we see the information routing for three modules. The first module LANG is supposed to identify the target object within the sentence. Hence, the corresponding attention layer is trained to only focus on the language input. The attention for the robot joint values and vision input is trained to be zero. In order to provide the output of this module to the attention layer in the next level, we use so-called register slots. Register slots are used to store the output of a module so that it can be accessed in subsequent modules in the hierarchy. Accordingly, each module within our method has corresponding register slot tokens. The role of the register slots is to provide access to the output of previously executed modules within the hierarchy. Coming back to Fig. 4, the second module EE2D locates the robot end-effector in the image. Accordingly, the attention for this module is trained such that the focus is on the vision and language inputs only. In turn, the result is written into the corresponding register slot. The final module in Fig. 4, DISP, calculates the distance between the end effector (EE) and the target object (O). Since this module is higher up in the hierarchy, it accesses the register slots of lower-level modules as inputs in order to calculate the distance.

Registers serve multiple purposes and can either be used as inputs to a module, in which case they serve as a learnable latent embedding, or be used to store the output of a particular module. An output register of a module is calculated by utilizing the standard transformer architecture. In particular, we define a transformer based attention module over queries (\({\varvec{Q}}\)), keys (\({\varvec{K}}\)), and values (\({\varvec{V}}\)), which are subsequently processed as follows:

$$\begin{aligned} {\varvec{r}}_{\text {out}} = \text {Attn.}({\varvec{Q}}, {\varvec{K}}, {\varvec{V}}) = \textrm{softmax}\left( \frac{{{\varvec{Q}}{\varvec{K}}}^T}{\sqrt{d_k}}\right) {\varvec{V}} \end{aligned}$$
(1)

where \(d_k\) is the dimensionality of the keys. In our use case, the queries are initialized with either learnable and previously unused register slots, or with registers that have been set by modules operating in prior layers, thus encoding their respective results. Our keys are equivalent to the values and are initialized with all formal inputs (language, vision, and joint embeddings) as well as all previously set registers from prior layers. In contrast to common practice, we control the information flow when learning each module via our proposed supervised attention, which is a specific optimization target for attention layers.

Fig. 5
figure 5

Supervised attention example for the second layer of processing information throughout our overall policy

As an illustrative example, consider a query identifying the location of the end-effector, as demonstrated in Fig. 5 (first key and query combination in the top left) or finding the target object (key and query combination near the center). For simplicity, we omit the other formal inputs and only focus on the visual input. However, the Tar Reg. would also depend on the language register from the prior LANG module. Following common practice, the keys and values derive from the input image, with each image embedding vector corresponding to an image patch (Fig. 5 left). In this particular example, the EE uses a trainable, previously unused register as query, while the Tar register utilizes the output register of the language module to find the correct object (Fig. 5 top). The EE register is supervised to focus on the robot’s gripper image patch, thereby creating a sub-module for detecting the robot end-effector. Similarly, the target register attends to the target object’s image patch, forming a sub-module responsible for identifying the target object. When these queries accurately attend to their respective patches, these patches will primarily contribute to the output register’s embedding vector, which can then be used as subsequent module inputs.

Table 1 Explanation of the various modules utilized in our hierarchical attention module

More formally, we maximize the similarity between query \({\varvec{q}}_i\) and key \({\varvec{k}}_j\) if a connection should exist, thus optimizing \(\mathop {\mathrm {arg\,max}}\limits _{\varvec{\theta }}~ {{\varvec{q}}_i{\varvec{k}}^T_j}\). This process is equivalent to maximizing the corresponding attention map element \({\varvec{M}}_{ij}\), where \({\varvec{M}}_i = \textrm{softmax}(\frac{{\varvec{Q}}_i{\varvec{K}}^T}{\sqrt{d_k}})\). Since each element \({\varvec{M}}_{ij} < 1\), we minimize the distance between \({\varvec{M}}_{ij}\) and 1 according to Eq. 2. We assume that N supervision pairs are provided in a set \({{{\mathcal {S}}}}\), indicating the query and key tokens that should pay attention to each other. Each pair \((i, j) \in {{{\mathcal {S}}}}\) contains the indices defining which queries \({\varvec{q}}_{i}\) should attend to which corresponding keys \({\varvec{k}}_{j}\). Individual supervision pairs in this set can be addressed by \(\mathcal{S}(p) = (i_p, j_p)\). We then define the cost function for supervised attention as follows:

$$\begin{aligned} {{{\mathcal {L}}}}({{{\mathcal {S}}}}) = \sum _{n=0}^{N} \left( \textrm{softmax}\left( \frac{{\varvec{q}}_{r}{\varvec{k}}_{s}^T}{\sqrt{d_k}}\right) - 1\right) ^2 \end{aligned}$$
(2)

where (rs) correspond to the indices held by the n-th supervision pair \((r, s) = {{{\mathcal {S}}}}(n)\). While Eq. 2 defines the loss as a minimization problem with a mean squared error loss, other cost functions such as the cross-entropy (de Boer et al., 2004) can also be applied, but have empirically resulted in lower performance.

4.2.2 Hierarchical modularity

In this section, we describe hierarchical modularity—an algorithm for training hierarchies of modules which is inspired by curriculum learning (Bengio et al., 2009). The previously introduced supervised attention mechanism enables the training of modules or building blocks relevant to the task. However, such modules also have to be stacked and cascaded together in other to realize the overall goal of the policy. In that sense, one module’s output becomes the subsequent module’s input. This can be represented as a directed graph, as shown in Fig. 6 (top), in which a cascade of specialized modules implements the overall control policy. Here, each module is represented by a node, while edges represent the information flow between nodes.

Fig. 6
figure 6

Hierarchy of the modules used in our method

Table 1 formally defines each of the modules (nodes) by introducing their functionality, queries, keys, and supervised attention mask. Broadly speaking, each module follows equation  Eq. 1 with keys being the set of original sensor modalities, as well as registers. In the first layer of Fig. 6, the LANG module identifies the target object, as referred to in the verbal command, and stores the result in the \({\varvec{r}}_{\text {LANG}}\) register. Subsequently, in the second layer, the \(f_{\text {TAR2D}}\) module utilizes the \({\varvec{r}}_{\text {LANG}}\) register as a query while the \(f_{\text {EE2D}}\) module utilizes a new, previously unused register as a query. This chain continues until the final control output of the robot is generated in the CTRL module.

Recall that sub-modules address intermediate tasks in the overarching control problem, making the output register \({\varvec{r}}\) suitable for human interpretation and allowing for supervised training of the resulting embedding. To achieve this, we employ small multi-layer perceptron (MLP) decoders to convert the module outputs into their respective numeric outputs. For example, we train a small MLP on top of the \({\varvec{r}}_{\text {EE2D}}\) register that predicts the end-effector location \((ee_x, ee_y)\) via a single linear transformation. This approach enables our policy to predict intermediate module outputs, enhancing training accuracy and allowing monitoring and debugging during inference, which is particularly valuable when transferring the policy to different robots or scenarios.

Training Cascaded Modules

Intuitively, the cascaded modules can be trained in a manner inspired by curriculum learning, wherein each component is trained before further layers of the hierarchy are added to the training objective. This ensures that each module is trained until convergence before being employed for more sophisticated decision-making processes, ultimately leading to the prediction of robot control parameters.

Algorithm 1
figure f

Hierachical Modularity: training algorithm returns network weights \(\theta \).

Algorithm 1 outlines the training procedure for our hierarchical approach in further detail. The algorithm trains each module of the hierarchy one after another, until the currently trained module is converged according to its respective loss function. After that, we progressively incorporate additional modules in a manner reminiscent of curriculum learning. Each module k is trained with an attention loss \({{{\mathcal {L}}}}_k\) given the supervision signal S of our proposed supervised attention approach, as well as a task-specific loss functions \({\Psi }_k\) which trains the MLP decoder for every module. Thus, each module is optimized with regard to two targets. Note that the policy loss for the robot controller CTRL is also implemented as an MLP decoder, which also represents the overall prediction target of our training process. Notably, in our scenario, this decoder predicts the next ten goal positions at each timestep instead of predicting only the next action. This choice is inspired by Jang et al. (2022), which also allows for a fair comparison in the subsequent evaluation sections.

While the modular approach requires manually defining loss terms for each module, it is essential to note that all modules form a single overarching neural network implementing the robot policy, inherently learning necessary features in an end-to-end manner. Modularization arises solely from training the network with various supervised attention targets and a cost function that successively integrates more sub-tasks.

Fig. 7
figure 7

Sequence of real-time outputs of the network modules: the object name (white) and visual attention (yellow region), the length of the displacement (white text), the object pos (blue), and end-effector pos (red). All values are generated from a single network that also produces robot controls (Color figure online)

4.3 Use-cases and extensions of hierarchy

We present our model as a cascade of sub-modules, trained hierarchically, enabling seamless integration of additional modules. In this section, we discuss the incorporation of obstacle avoidance, tracking a predefined obstacle, and describe the generalization of this approach to arbitrary “referential objects” that let users specify commands that reference any other object. These enhancements are implemented by introducing new modules, as depicted in Fig. 8.

4.3.1 Runtime introspection

All sub-modules retain their functionality, even after training. Consequently, they can be used at runtime to query individual outputs (e.g., LANG, TAR2D, EE3D). This feature allows users to monitor the intermediate computations of the end-to-end network to identify potential deviations and misclassifications. Figure 7 visually depicts the outputs of each model during the execution on a real robot. A textual description (left upper corner) shows the currently identified object name, as well as the displacement (in cm) between the end-effector and the target object. The current attention map is visualized in yellow, whereas the end-effector position and the target position are highlighted by red and blue points. Computing these intermediate outputs of the network generates negligible to no computation overhead. In our specific system, we implemented a real-time visualization tool that can be used at all times to monitor the above features. Such tools for introspection can help in debugging and troubleshooting of the language-conditioned policy. For example, they can be used to detect when individual modules need to be retrained, or where in the hierarchy a problem is manifesting. In addition, such outputs can be used with formal runtime monitoring and verification systems, e.g.,  Yamaguchi and Fainekos (2021) and Pettersson (2005), to improve the safety of the neural network policy.

4.3.2 Adding new behaviors

An important benefit of the modular architecture of our approach is the ability to add new modules into a neural network, even after successful training. To demonstrate this functionality, we add an obstacle avoidance behavior into the system, i.e., the robot is expected to detect an obstacle and generate controls to avoid any collisions.

In our specific scenario, we introduce an obstacle in the form of an orange basketball that must be avoided when approaching the target object. To incorporate this ability into the existing system, we add new modules into the previous hierarchy. This can be seen in Fig. 8 (top).

Fig. 8
figure 8

Extensions of hierarchy. a The hierarchy used for obstacle avoidance. 3 new modules, OBST2D, OBST3D and DISP2 are plugged in post-training for detecting the obstacle and avoiding collision with it. b The hierarchy for the relational tasks. These tasks involve 2 objects in a sentence, e.g., “Put the apple right to the orange”, where “orange” is the referral object. We add LANG2D, REF2D, REF3D and DISP2 for detecting the referral object and generating the according trajectory (Color figure online)

In particular, we add OBST2D and OBST3D, which identify the obstacle’s position, and DISP2, which computes the displacement between the end-effector and obstacle. Similar to target object detection, the obstacle is identified from object embeddings and ultimately results in a displacement value. The controller module incorporates the additional displacement as an additional input. In general, new modules can be added or existing modules removed according to the needs of the task.

4.3.3 Creating new types of behaviors by interconnecting modules

The modular approach also enables new types of behaviors to be incorporated; in particular, behaviors that interconnect multiple existing modules. For example, we may want to learn a robot policy that allows for relational queries, e.g., “Put the coke can in front of the pepsi can.”. Such a feature would require the dynamic identification of a secondary object and its desired relationship to the target object. In the previous example, the model must infer the target object (“coke can”), the reference object (“pepsi can”), and their relation (“in front of”).

Figure 8 (bottom) shows the new hierarchy for this use case. Similar to the object avoidance case, we can incorporate additional modules REF2D, REF3D, and DISP2 for this purpose. However, in contrast to the obstacle avoidance case, an additional module LANG2 is added to extract the object reference from the user’s instruction and subsequently informs the REF2D for further processing. This process of adding and removing modules allows for extensible language-conditioned policies whose complexity can be increased or reduced according to the necessities of the task. In the evaluation section, we will see that such an incremental approach has advantages over a complete retraining of the entire policy.

5 Evaluation

In this section, we present a set of experiments designed to evaluate various aspects of our approach. We firstly elaborate on the data collection process in Sect. 5.1. In Sect. 5.2, we investigating basic performance metrics of our approach and compare them to other state-of-the-art methods. To this end, we carry out ablation studies in order to probe the impact of our hierarchical modularity and supervised attention modules, as well as structure of the hierarchy itself. Thereafter, we study the robustness of our approach when exposed to occlusions (Sect. 5.2.2) and linguistic variability (Sect. 5.2.3). In Sect. 5.3, we focus on the ability of our approach to transfer existing policies between different robots in simulation, but also demonstrate the transfer to real-world robots in a sample efficient manner. Section 5.4 examines the policy’s ability to generalize to novel objects. Lastly, we explore the possibility of incorporating new modules into an existing hierarchy for the purposes of obstacle avoidance and relational instructions (Sect. 5.5).

We evaluate our method on a tabletop manipulation task of six Robosuite (Zhu et al., 2020) and up to 100 automatically generated objects across five different tasks. Our tasks include three basic objectives namely picking objects, pushing them across the table, and rotating them. Further, we have two obstacle avoidance task, focusing on a single object, or all non-target objects simultaniously. In addition, we also investigate a more complex placing task in which objects need to be placed in relation to other objects in the environment, thus requiring the understanding and correct interpretation of relational instructions. Tasks are performed on three different robots in simulation and one robot in the real world. Our simulated robots include a Franka Emika, Kinova Jaco, and Universal Robot UR5 compliant robot arm. In the real world scenario, we utilize a UR5 robot. The following sections will first provide the details of our experimental setup and data collection strategy and then discuss evaluation results.

Training Resources To train our method from scratch, a single Quadro P5000 GPU takes approximately 48 h until convergence. In conjunction with this paper, we will release our final code base (and dataset) which is capable of leveraging multi-GPU setups, thereby resulting in further speed-ups with regards to the absolute training time.

5.1 Data generation

We perform a series of simulated experiments in MuJoCo  (Todorov et al., 2012), employing three distinct robotic platforms (Kinova, UR5, and Franka) that closely resemble our real-world experimental setup with a UR5 robot.

Figure 9 illustrates all four configurations, along with the six Robosuite objects utilized in our investigations, including a red cube, a Coke can, a Pepsi can, a milk carton, a green bottle, and a loaf of bread. Further, our comprehensive set of 100 procedurally generated objects is depicted in Fig. 3. Demonstrations are collected using a heuristic motion planner that orchestrates fundamental motion planning techniques to control each target robot. By contrast, real-world demonstrations are collected via kinesthetic teaching utilizing a gravity-compensated robot arm. Beyond the robots’ motion, we store each action’s respective command (e.g., “Pick up the green bottle!”) and the corresponding RGB video stream captured by an overhead camera from the same angle as shown in Fig. 9 with a resolution of \(224\times 224\) pixels.

Fig. 9
figure 9

A human instruction is turned into robot actions via a learned language-conditioned policy. The neural network is then successfully transferred to different robots in simulation and real-world

As a simple data augmentation technique, we utilize a templating system that generates syntactically correct sentences during the collection of training, validation, and testing data. These templates are derived from two human annotators who, after watching pre-recorded robot behavior videos, were assigned the task of providing instructions on what the robot was executing in the video. This small dataset served as the foundation for extracting command templates, as well as a collection of the used nouns, verbs, and adjectives. This collection is then extended with commonly available synonyms to allow the creation of an automated system for command generation during data collection. The template initially selects a random verb phrase in accordance with Table 10. Subsequently, a noun phrase is determined through random selections from Adj and Noun, as outlined in Table 9.

Table 2 Utilized dataset during our experiments
Table 3 Comparison with the state-of-the-art baseline as well as ablations in Mujoco

Table 2 presents the datasets utilized in our experimental setup. Each sample is collected with 125 Hz, resulting in trajectories containing 100–500 steps, depending on the distance between the robot’s initial position and the target object, as well as the task being executed. The smaller datasets in rows three to five are for transferring a previously trained policy from one robot or task to another. The transfer learning datasets are purposefully over-provisioned, as we assess the minimal size required to achieve performance comparable to a policy trained from scratch in Sect. 5.3. Finally, the datasets in rows 4, 6, 7, 8, and 9 undergo evaluation in an interactive, live setting in which a user engages with a deployed policy, either within a simulation or the real world; thus, these datasets do not have a formal test split.

5.2 Model performance and baseline comparison

In this section, we evaluate our model on the three basic actions across the six Robosuite objects, utilizing the \({\mathcal {D}}^{\text {UR5}}\) dataset. We also compare our method to two state-of-the-art baselines, specifically BC-Z (Jang et al., 2022) and LP (Stepputtis et al., 2020). As our third baseline, we investigate vanilla, unsupervised attention. In this scenario, the same network as before is trained, but without supervision of the attention process as introduced in this paper.

Table 3 summarizes these results in which each training and testing procedure was executed three times to provide a better understanding of the stability of the compared methods. We evaluate not only the overall success rates but also the performance of each individual module within our language-conditioned policy. Specifically, we employ the following metrics: (1) Success Rate describes the percentage of successfully executed trials among the test set, (2) Target Object Position Error (TAR3D) measures the Euclidean 3D distance between the predicted target object position and the ground truth, (3) End Effector Position Error (EE3D) quantifies the Euclidean 3D distance between the predicted end effector position and the ground truth, (4) Displacement Error (DISP) calculates the 3D distance between the predicted 3D displacement vector and corresponding ground truth vector.

Our method (line 4) outperformed BC-Z (line 3) on all basic tasks with an average success rate of \(82.4\%\), as compared to \(73.1\%\) for BC-Z. Furthermore, we separately assessed the prediction error of the proposed network’s components, namely EE3D, TAR3D, and DISP. We note that the end-effector pose prediction accuracy (approximately 0.5 cm) surpasses the target object’s accuracy, which could be attributed to the presence of the robot’s joint state information. The target object’s position estimation deviates by around 2–3, possibly due to the absence of depth information in our input dataset (solely consisting of RGB).

By contrast, the LP model (line 1) is not able to successfully complete any of the tasks. We hypothesize that this low performance is due to the training dataset’s significantly smaller size compared to the LP’s usual training data size, as indicated by Stepputtis et al. (2020). Finally, the vanilla, unsupervised attention approach (line 2) achieves a success rate of \(13.3\%\). Qualitatively, we observe in this scenario that the vanilla attention model is not able to recognize the correct object. Similarly to the LP approach, we hypothesize that the issue could potentially be resolved with a larger dataset. However, for the sake of a fair comparison within this paper, we utilize the same dataset \({\mathcal {D}}^{\text {UR5}}\) across all methods.

5.2.1 Ablations

In order to evaluate the impact of our two main contributions—supervised attention and hierarchical modularity—we conduct an ablation study to investigate the impact of each contribution on training performance. In addition, we also ablate the structure of the hierarchy itself in order to investigate its resiliency to structural changes.

Results of the ablation experiments can be found in Table 3. Our model (line 3) has an overall success rate of \(82.4\%\) across three seeds. When ablating the usage of hierarchical modularity, performance drops to \(36.4\%\) (line 5). Utilizing our runtime introspection approach to investigate potential issues in the modules (Sect. 4.3.1), we find that the target and displacement errors increased to over 20 cm, which is likely the cause for the reduced performance. When removing the supervision signal (line 4) for the attention inside our modules (and instead relying on end-to-end training), we see a drop of \(\approx 2.5\%\) in performance to about \(80\%\).

When ablating the hierarchy itself, we merged the TAR2D and TAR3D module (line 7) into a single module instead of maintaining two. The underlying rationale is that the separation of the target detection between 2D and 3D detection is not strictly necessary and thus a single target module may be sufficient. The resulting success rate in this case is \(80.9\%\) which is only slightly below the original rate of \(82.4\%\). Next we removed the displacement module DISP (line 9) altogether, which results in a performance of about \(61.3\%\) (a loss of around \(20\%\)). Finally, we added spurious modules that are not necessary for the policy’s success in these tasks (line 10). In particular, we added a specific module that only detects the “Coke” can. In this case, we achieved a success rate of \(86.3\%\) which is slightly higher than the original result.

As a general observation, the approach seems to be favorable to superfluous modules, combined modules, or variations of a hierarchy. However, the absence of certain critical modules, e.g., the DISP or TAR modules (lines 9 and 8 respectively), may have a more drastic effect on performance. In the above case of removing the DISP module (line 9), the performance reduces to about \(61.3\%\) which is below the corresponding value for BC-Z (\(73.1\%\)).

5.2.2 Occlusion

Next, we evaluate the robustness of our approach to partial occlusions of the target objects during task execution. To this end, occlusions are introduced by removing image patches in the camera feed of the simulated experiments. This step is performed by covering approximately \(20\%\), \(42\%\), \(68\%\) and \(80\%\) of the target object’s total area; calculated via a pixel-based segmentation approach of the input image provided by the simulator. All experiments are conducted on all six Robosuite objects across all three basic tasks. The results are shown in Fig. 10. We observe that our method is robust to occlusions of up to \(20\%\) of the target object, while our baseline model, BC-Z, already experiences a significant drop in accuracy. While our model only loses about \(1.1\%\) in performance, BC-Z drops by \(9.35\%\). However, for occlusions greater than \(40\%\), our method performs on-par with BC-Z. We argue that our robustness to \(20\%\) occlusions is significant since small, partial, occlusions are more likely to occur during tabletop manipulation tasks.

Fig. 10
figure 10

Success rate when part of the target object is occluded

5.2.3 Synonyms

Our final robustness experiment is concerned with the variability of free-form spoken language. While our system is trained with sentences from a template-based generator, we evaluate its performance when exposed to a set of additional synonyms, as well as free-form spoken language from a small set of human subjects. When replacing synonyms, as shown in Table 8, in the single-word and short-phrase case, we observe that our model achieves a \(82.5\%\) success rate on the pushing task. When using BC-Z, on the same task with the same synonyms, performance drops to \(28.57\%\), indicating the robustness of our methods to variations in the language inputs. Finally, we also evaluate the performance on 30 examples of free-form natural language instructions that were collected from human annotators and report a success rate of \(73.3\%\). The sentences used by the annotators can be found in Table 11 and show that our model can work with unconstrained language commands.

5.3 Transfer to different robots and real-world

In this section, we evaluate the ability of our approach to efficiently transfer policies between different robots that may have different morphologies. Rather than retraining our model from scratch to accommodate the altered dynamics between different robots, we posit that our modular approach enables the transfer of substantial portions of the prior policy. This necessitates only minimal fine-tuning, consequently resulting in a reduced demand for data collection on the different robots. In particular, we evaluate fine-tuning of the entire policy, and fine-tuning of only the modules affected by a change in visual appearance of robot morphology.

Fig. 11
figure 11

Results of transferring policies from Kinova (K) robot to UR5 (U) and Franka (F) robots. “Ours-f” refers to freezing parts of our model during transferring. Experiments are performed in Mujoco simulator

5.3.1 Transfer in simulation

Our initial policy is trained from scratch on the \({\mathcal {D}}^{\text {Kinova}}\) dataset while the transfer of the trained policy to the Franka and UR5 robot is realized with the \({\mathcal {D}}^{\text {Franka}}_{\text {TF}}\) and \({\mathcal {D}}^{\text {UR5}}_{\text {TF}}\) datasets respectively.

As noted earlier, the \({\mathcal {D}}_{\text {TF}}\) datasets are intentionally over-provisioned to allow an evaluation regarding how much data is required in order to match the performance of the transferred policy to a policy that is trained from scratch on the same robot. In order to shed some light on this, we sub-sampled the transfer datasets to a total size of 80, 160, 240 and 320 demonstrations and conducted the training. Figure 11 shows the results of this analysis (reported as “Ours”) given the varying dataset sizes when fine-tuning the entire policy initialized with the Kinova weights. With 160 demonstrations, our model achieves a success rate of \(80\%\), which is only slightly below the policy’s performance when trained on the full 1600 demonstrations from scratch. Further, given the full 320 demonstrations of the transfer dataset, the policy reaches a performance that is on-par with one trained from scratch. When fine-tuning BC-Z with the same dataset splits, we observe that our model consistently outperforms BC-Z. Interestingly, we also observe that our model performs similarly when transferring to the Franka and UR5 robots across the dataset splits, while BC-Z seems to initially perform worse when transferring the Franka robot. Note here that Franka is a 7 degree of freedom (DoF) robot while the source policy, which operates over the Kinova robot, only has six. This discrepancy likely affects robot dynamics thereby affecting the transfer process.

Further, we conducted experiments in which we froze parts of our model during transfer of a pre-trained policy from the Kinova to the UR5 and Franka robot. In particular, the TAR3D, EE3D, and DISP prediction modules are unaffected by the change in visual appearance and morphology of the new robot and, thus, do not need to be retrained. Note, however, that we retrain TAR2D since partial occlusions by the new robot could lead to false positives for target objects. We have conducted further experiments with the same fine-tuning datasets and report their results in Fig. 11 (reported as “Ours-f”). In this setting, with a dataset of only 80 demonstrations, the partially frozen module produces a result of \(60\%\) and \(72.5\%\) when transferring to the Franka and UR5 respectively. This poses a substantial performance improvement of up to \(18\%\) in the case of transfer to the UR5 robot while utilizing less data than fine-tuning the entire model. This result further underlines the gains in data-efficiency that can be achieved through the hierarchical modularity.

5.3.2 Real-world transfer

Having demonstrated the ability of our approach to efficiently transfer policies between robots in simulation, we demonstrate that a policy can also be transferred to the real world (Sim2Real Transfer) in a sample-efficient way. To this end, we first trained a policy for the UR5 robot in simulation utilizing the \({\mathcal {D}}^{\text {UR5}}\) dataset and subsequently transferred it with a substantially smaller real-world dataset \({\mathcal {D}}^{\text {UR5}}_{\text {RW}}\). More specifically, 260 demonstrations on the real-robot are collected for transfer—this corresponds to about \(\frac{1}{6}\)-ththe size of the original training set. The overall robot setup can be seen in Fig. 12. The scene is observed via an external RGB camera and robot actions are calculated in a closed-loop fashion by providing the current camera image and language instruction to the policy.

To investigate the contributions of our proposed methods, we conduct experiments under 3 different baseline settings. These include directly applying the simulated policy on the real robot, fine-tuning the simulated policy using the real-world dataset \({\mathcal {D}}^{\text {UR5}}_{\text {RW}}\), and transferring the simulated policy to the real world using our proposed method. Image sequences of real-robot executions can be seen in Figs. 7 and 9. As expected, the policy trained in simulation is unable to complete any task when being directly applied to the real robot despite coordinate systems and basic dynamics being matched between the simulation and the real world. This failure is due to the substantial variation in visual appearance of the robot and objects. When using a naive fine-tuning approach that does not use our core contributions, the resulting success rate is \(56.7\%\) over 30 trials, thus demonstrating partial success. However, we observe that the noise in the attention maps is unusually high, which we attribute to the intricacies of real-world vision and dynamics. Finally, when training the system with our approach, including supervised attention and hierarchical modularity, the approach achieve a success rate of \(80\%\) in the real world when prompted with 30 commands issued by a human operator.

Fig. 12
figure 12

Experimental setup of real-robot experiments. Objects are seen through an external camera and actions are generated in a closed-loop

5.4 Generalization to novel objects

To investigate the importance of modularity for generalization, we extend the experiment setup to include a more challenging scenario. In particular, we incorporate a total of 100 objects, which are automatically generated following the approach outlined in Sect. 2. They are comprised of 10 unique classes, each with 10 objects. We utilize 3 objects from each class for training, while the remaining 7 objects, which were previously unobserved by the model, are reserved for testing. For this experiment, we utilize the \({\mathcal {D}}^{\text {UR5}}_{\text {NO}}\) dataset and perform an evaluation with 100 trials (Fig. 13).

Fig. 13
figure 13

Robot performing a task with objects which are generated automatically generated. The top row is the robot picking up an object, while the second row is the robot pushing an object

Fig. 14
figure 14

Test results of simulated experiments with 70 unseen objects from 10 classes which are generated automatically using the data pipeline proposed in Sect. 2

As before, we compare our model’s performance to BC-Z, as well as an ablated version of our model without supervised attention or hierarchical modularity. The results are shown in Fig. 14 on the 100 object generalization task. During this study, we removed either one or both of our components from the model during training to examine their individual and combined contributions. The most basic version of the model, identified as “Base” and not using supervised attention, nor hierarchical modularity, demonstrates poor performance with a score of \(47\%\). Models without Supervised Attention (“w/ Sup. Attn”) and those without Hierarchical Modularity (“w/ Hier. Mod.”) each exhibit significantly better performance compared to the base model. Notably, the optimal model is our proposed full model (“Ours”), which combines both Supervised Attention and Hierarchical Modularity, resulting in an impressive success rate of \(87\%\). For comparison, the baseline model BC-Z attains a \(76\%\) success rate, which is surpassed by our proposed model by \(9\%\).

5.5 Hierarchy extension

In this section, we explore two extension to our hierarchy by introducing new models that allow the policy to conduct new tasks. As we have shown in prior sections, our modular approach allows for easy transfer between different robots; however, this approach can also be utilized to introduce novel tasks to the policy. The following sections introduce an obstacle avoidance task in which an object is placed in the path between the robot and the described target object which has to be detected and avoided. In a subsequent experiment, we further extend the hierarchy by not only focusing on a fixed “obstacle object”, but allow the user to specify a secondary reference object, ultimately affording a novel placement task that allows objects to be placed in relation to others objects in the environment.

Fig. 15
figure 15

Robot trained to avoid all obstacles in the scene. On the way to the Coke can, the robot first avoids a basketball and then the green bottle. We move the bottle in front of the robot to generate an instantaneous response

5.5.1 Obstacle avoidance

In this experiment, we demonstrate a seamless way to integrate new modules into an existing trained hierarchy by introducing an obstacle avoidance task. First, we discuss a setup wherein a single, specific object needs to be avoided, before extending the approach to avoid any obstacle in the scene (Figs. 15, 16).

In our first setting, a basketball is placed between the end-effector and the objects that are to be manipulated, serving as an obstacle. The robot must first identify the obstacle and subsequently formulate a trajectory to navigate around it effectively. In this task, new modules OBST2D and OBST3D are added to the hierarchy and trained to generate the location of the obstacle in image space and world space. More specifically, OBST2D identifies image patches that belong to the object. In turn, these patches are fed into OBST3D to generate a 3D world coordinate. We relate the obstacle’s position to the robot by calculating a second displacement DISP2 which utilizes EE3D and OBST3D. Figure 8 shows the updated hierarchy. The output of DISP2 feeds into the calculation of the control value where it is combined with the output of DISP (the displacement of the end-effector to the target object).

Fig. 16
figure 16

Robot performing a task while avoiding a basketball. The top row shows a pick action and the bottom row shows a push action. In both cases, the robot changes its course to avoid collision with the obstacle

The expert trajectories which avoid the obstacle are generated by using a potential field approach (Khatib, 1986). More specifically, the basketball is a repulsor that pushes the end-effector away from it. Using this approach, 200 training demonstrations are collected, forming the dataste \({\mathcal {D}}^{\text {UR5}}_{\text {OBST}}\). The policy for this task has been trained from a UR5 policy by utilizing the above datatet that introduces the novel task. For evaluation, we define a successful trial as the absence of any collision between the robot and the obstacle. After training, our method achieves a success rate of \(88\%\) where failure cases mostly revolve around premature contact with the target object.

We further extend the capabilities of the proposed hierarchical approach by avoiding any object in the environment. To this end, we utilize a single module that is trained to focus on all objects with exception of the target object. This module can viewed as the inverse of the target detection module, i.e., all but the target object are highlighted. In this multiple-obstacle case, the trained network achieves a success rate of \(83\%\). An image sequence of the resulting behavior can be seen in Fig. 15. Notice the robot response after a second obstacle (green bottle) is moved in front of it. The image sequence also highlights the closed-loop control underlying our approach—robot actions are constantly recalculated based on the current environmental conditions.

Fig. 17
figure 17

Robot performing relational tasks with 2 objects involved in 1 command. The top row is the robot putting an avocado left to a hamburger, while the second row is the robot putting a donut right to a hamburger

5.5.2 Relational reference

While the obstacle avoidance tasks showed the basic pipeline of adding a secondary object and defining a desired behavior for it, the approach can be extended to also allow the user to verbally specify this secondary object. For this purpose, we introduce a relational placing task in which a user specifies a reference object, requiring the system to identify the two task related objects and generating a control signal in accordance to it. Relational tasks involve instructions that not only specify a target object for manipulation (e.g., an “apple”) but also mention an additional referential object (e.g., an “orange”), such as “Place the apple to the right of the orange.” In these scenarios, the robot must identify the two objects and understand the intention behind the given language. We aim to demonstrate that our model can effectively handle such tasks even under generalization constraints. For this purpose we again utilize our 100 automatically generated objects and train a policy over 30 of them, which are composed of 3 objects per class, while evaluating it’s generalization capabilities on the remaining 70 objects. For this experiment, we utilize the \({\mathcal {D}}^{\text {UR5}}_{\text {ET}}\) dataset (Fig. 17).

For this task, we have made modifications to the original hierarchies. Firstly, we introduce a LANG2 module to determine the referential object based on the language input. Besides that, we add TAR2D2 and TAR3D2 modules to identify the image patch corresponding to the second object and generate its 3D world coordinate, respectively. We also include a DISP2 module to calculate the displacement between the end-effector and the second object.

In this scenario, the robot is directed by a verbal sentence to identify the first object, pick it up, recognize the second object, and then place the first object either to the left or right of the second object according to the command given.

The entire process is carried out in the MuJoCo environment, evaluated on 100 test trials. For comparison, we also train and evaluate the BC-Z model. Our model achieves a success rate of 76%, which is a 7% improvement over the BC-Z performance. Considering the increased complexity of this task compared to previous ones—due to the need to identify two objects from both the sentence and image, and the more extended manipulation steps required—a 76% success rate and a 7% increment compared to the baseline are commendable results.

6 Discussion and limitations

The above experiments show a variety of benefits of the introduced modular approach. On one hand, it allows for new components and behaviors to be incorporated into an existing policy. This property is particularly appealing in robotics, since many popular robot control architectures are based on the concept of modular building-blocks, e.g., behavior-based robotics (Arkin, 1998) and subsumption architecture (Brooks, 1986). Modularity also enables the user to employ modern verification and runtime monitoring tools to better understand and debug the decision-making of the system. At the same time, the overall system is still end-to-end differentiable and was shown in the above experiments to yield practical improvements in sample-efficiency, robustness and extensibility.

However, a major assumption made in our approach is that a human expert correctly identifies the logical flow of components and subtasks into which a task can be divided. This process requires organizing these subtasks into a hierarchical cascade. Early results indicate that an inadequate decomposition can hamper, rather than improve, learning. Furthermore, the approach does not incorporate memory and therefore cannot perform sequential actions. In a few cases we observed a failure to stop after finishing a manipulation - the robot continues with random actions. Another open question is the scalability of the approach. In our investigations, we looked at behaviors with a small number of sub-tasks. Is it possible to scale the approach to hierarchies with hundreds or thousands of nodes? The prospect is appealing since this would bridge the divide between the expressiveness and plasticity of neural networks and the ability to create larger robot control systems which require the interplay of many subsystems.

For future work, we are particularly interested in using unsupervised and supervised attention side-by-side, i.e., several modules may be supervised by the human expert whereas other modules are adjusted in an unsupervised fashion. This would combine the best of both worlds, namely the ability to provide human structure and knowledge while at the same time maximally profiting from the network’s plasticity. This is a particularly promising direction, since the ablation experiments indicate that having superfluous modules does not drastically alter the network performance. Further, we would like to investigate the potential of inferring a suitable hierarchy in a data-driven manner.

7 Conclusions

In this paper, we present a data-efficient approach for language-conditioned policies in robot manipulation tasks. We introduce a novel method called Hierarchical Modularity, and adopt supervised attention, to train a set of reusable sub-modules. This approach maintains the end-to-end learning advantages while promoting the reusability of the learned sub-modules. As a result, we are able to customize the hierarchy according to the specific task demand, or integrating new modules to an existing hierarchy for new tasks. Our method demonstrates high performance in a comprehensive set of experiments including training manipulation policies with limited data, transferring between multiple robots, and extension of module hierarchies. We also develop an automated data generation pipeline for creating simulated objects to manipulate with, and show our model’s generalization capability on unseen objects generated by such pipeline. Furthermore, we demonstrate that the learned hierarchy of sub-modules can be employed for introspection and visualization of the robot’s decision-making processes.