1 Introduction

Autonomous robots will soon be deployed in large numbers performing a wide variety of tasks. They will operate for long periods of time, often far from their users, making real time supervision of their activities impractical. They will be faced with challenging situations and have to make decisions and perform actions on their own, all without the aid or immediate knowledge of their operators. Robot autonomy, then, presents an important challenge: the need to inform robots’ operators what they have done.

Upon returning home after tasking a robot with cleaning the house, making dinner, and taking care of the dog, a robot user would like to know what happened during the day: how the dog fared, what parts of a house could not be cleaned and why, and what the robot made for dinner, for example. A farmer using robots to harvest crops would want to be able to get ask how much has been harvested, what the conditions in the field were, and if any evidence of disease and drought was seen.

It might be thought that a robot could simply keep a log of all it has done and thus give a full report of every turn, move, and decision it made during its operation. However, there are two problems with such a scenario. First, such reporting may not be possible, particularly if an agent’s actions are not the result of interpretable internal planning using discrete primitives but are instead the result of following a reinforcement learning policy implemented as a neural network which simply outputs, for example, rotations and joint movements to a robot’s wheels, arms, etc. Second, even if such a complete record of action existed and was human readable, it would not be useful; it would be far too long and detailed to read and make sense of in a reasonable time in any realistic situation. Instead, it will be necessary for agents to summarize their activities. And in order to make such a summary available in a format for humans to comprehend quickly and accurately it would be ideal if the summary were given in natural language.

Summaries, rather than complete records, will be particularly useful as action sequences become longer. They will also be challenging to produce because it will be necessary to identify the most important actions and, very often, to describe those actions using higher level abstract terms. Summaries may not fully address everything that a user wants to know about a robot’s actions so a user may want to ask questions about what a robot did or saw during a particular action sequence.

Roboticists have long recognized the usefulness of being able to give natural language instructions to robots. Summarizing and answering questions about past robotic actions can be seen as a complement to instruction following. A user who gave such instructions might naturally be expected to want a short natural language summary of what was done in response to those instructions. Yet despite the volume of work that has been done on instruction following, its complement has gone largely unaddressed. Fortunately, existing and future datasets designed for instruction following tasks can be repurposed and augmented to serve as a training ground for robot action summarization and question answering. We make use of and augment the popular alfred dataset (Shridhar et al., 2020) which provides ego-centric video frames of episodes of robot action sequences in a virtual environment along with multiple levels of description in natural and structured language. Using a model that incorporates a large language model (llm), we present the first work directly addressing, performing, and evaluating robot action summarization and question answering.

Fig. 1
figure 1

Visual presentation of model and method for producing zero-shot summaries involving novel objects. Step 1 illustrates the full model: input (at the left) includes video frames as well as episode metadata describing the environment as the agent saw it. The components of the model in black (clip Resnet and the word embeddings) are pretrained and remain frozen during our training process, while the light blue module (the vision-to-T5 bridge network) is trained from scratch. The dark blue module, a pretrained T5, which outputs the final question answer or summary, is fine-tuned during training. Step 2 demonstrates zero-shot summarization using a previously trained model which was not trained to summarize episodes with some of the objects in the newly presented episode

Our main contributions are:

Summarization of actions. We demonstrate summarization of robotic actions in both short and long summaries from video frames in a multimodal model that incorporates vision and fine-tunes a pretrained T5 llm (Raffel et al., 2020).

Answering questions about actions. The same model is jointly trained to answer questions about robotic actions, including questions about actions performed, objects seen, and the order in which actions were performed.

Zero-shot transfer from question answering to summarization. We show that an llm-based system trained to answer questions about held-out objects can faithfully produce summaries about those objects in a zero-shot manner, even though the objects are not in the summarization task training set. This demonstrates the transfer of representational knowledge from the question answering tasks to the summarization tasks. We further demonstrate that this transfer occurs for some question types but not others.

Automatic generation of questions and answers. We develop a method to automatically generate questions and answers using an existing dataset and its associated virtual environment and release a dataset of such questions and answers.

2 Method

Our objective is to generate a summary or question response in natural language \(r \in \mathcal {L}\) of a long horizon robotic task, given the history of observations \(o \in \mathcal {O}\) that the robot experienced during the task and a question or summarization prompt q. We define the robot experience/trajectory as \(\tau = \{ (o_{0},...) \}\). We seek to learn a function \(\mathcal {F}_{\theta }\) such that: \(r = \mathcal {F}_{\theta } (\tau , q)\).

2.1 Data requirements

The general problem of robot action summarization and question answering could be addressed in a variety of ways depending on the data available, the environment the robot operates in, and details of how the robot operates. A few types of data would be most helpful in training and operating an autonomous, mobile, general purpose robot to summarize its past:

(1) Ego-centric video of the robot performing tasks serves as the primary input to summarization and question answering. It can be captured by many robots and would facilitate the transfer of knowledge to new circumstances and environments.

(2) Natural and/or structured language summaries of the actions performed in the video. These summaries could be of varying lengths, depending on the needs of the end user. The presence of both short and long summaries would provide the most flexibility and choice for a user.

(3) Ground truth information about the objects and places in the ego-centric video for training question answering tasks. The objects and places present in the training dataset will determine what kinds of questions a user could subsequently ask.

2.2 Repurposed dataset

Our approach requires egocentric video or video frames, a description of an agent’s actions during an episode, and information about the environment the agent operates in, particularly the locations of objects it encounters. For the purposes of the current investigation we use episodes from the alfred dataset. An episode of robot state-action trajectory in the original dataset has four different kinds of representation which we make use of, either as-is or transforming them in some ways. The following list of dataset elements lays out the way they are used in this work as well as noting their original purpose and description in the alfred dataset:

(1) Short summaries: Human-generated natural language one sentence summaries of the whole action sequence (called “goal descriptions” in the original dataset).

(2) Long summaries: High level narratives of the robotic agent’s actions, provided in the original dataset in the form of action plans in the structured Planning Domain Description Language (pddl) (McDermott et al., 1998). We convert the terms used in pddl to natural language: for example, “GotoLocation” becomes simply “go to” and some object names become two English words instead of one word (e.g. “coffeemachine” becomes “coffee machine”). We also break these long summaries up to form questions, as described in the next subsection.

(3) Natural language action description sentences: Natural language step by step descriptions of the actions taken in each episode, written by humans, which were used as instructions in the original dataset. These are used here to form some of the questions, as described in the next section. We do not use these to generate summaries because they are too detailed and contain somewhat idiosyncratic descriptions provided by human annotators. These characteristics, which make these inappropriate to serve as ground truth summaries, nevertheless make them good training examples for naturalistic human-generated questions, which is why we use them to form the basis of questions.

(4) Video, images, and visual features: Raw video of a task episode as well as a selection of still frames from the video chosen by the creators of the alfred datasset in such a way as to guarantee at least one still frame per low level action as defined in the original dataset. We use the pre-selected subset of still frames in the dataset, leaving the question of frame selection to future work.

Robot actions in the alfred dataset consist of discrete navigation and manipulation actions labeled ’low level’ actions see Shridhar et al. (2020) for details; episodes have an average of 50 such actions. Because summarization and question answering involve higher level semantic descriptions, action descriptions in this work derive from two sources: pddl converted to natural language and human annotator descriptions which include actions. The former are a restricted set (go to, pick up, put, cool, heat, clean, toggle) while the latter are unrestricted and express actions in diverse ways.

Fig. 2
figure 2

Sample partial selection of input frames from an episode in a seen environment originally from the alfred dataset (at the top), generated questions (on the left, in blue) and expected answers (on the right, in green), broken up into question type, along with the prompts for long and short summaries, at the bottom

2.3 Automatic generation of questions and answers

We develop a Q &A generation algorithm that produces questions and answers about episodes of robots interacting with an environment. After initial pre-processing, the algorithm can be used in a partly online fashion during training or as a one-time off-line dataset generation step which produces a set of static questions and answers. We train models in an online fashion and provide performance metrics from the static validation sets of questions and answers we release with this work.

In addition to the elements already present in the original dataset enumerated in the previous subsection, we use the ai2thor environment (Kolve et al., 2017) to rerun the agent trajectories for each episode in the dataset and capture information present while the agent is in the environment. At each time step after executing an action, the environment returns a ‘metadata’ python dictionary with information about the last action taken, the agent’s current position and pose, and the objects present in the environment. Information about objects includes whether they are visible and within a specified distance of the agent (we use the default 1.5 ms). We use these two pieces of information to construct questions about whether objects were present in the environment. Though here we use one particular existing dataset and environment, our approach is general and can be applied to other datasets and environments with action descriptions in natural or structured language and available information about the environment.

The algorithm produces nine types of questions in three broad categories (see Fig. 2 for examples of each type from the valid seen set and Appendix F for additional examples from the valid unseen set):

(1) Object questions about the presence of objects in the environment, both those the agent interacted with and those it only saw. There are two kinds of object question: “object yes/no” questions of the form, “was there an <object>?", which require only “yes” or “no” answers and “object either/or” questions of the form, “was there an <object A> or <object B>?” which require the model to output the name of the object present. Our algorithm uses the metadata of all objects visible in the environment to ensure that only one of the objects in an either/or question will have been seen during an episode. The algorithm samples objects with negative answers in proportion to their appearance in the dataset so that the model cannot, for example, learn to always answer, “no”, for seldom-seen objects. Questions with “yes” and “no” answers are presented with equal frequency.

(2) Action questions, which ask about actions the agent performed. The two types of question—“action yes/no” and “action either/or”—follow the structure of the respective object questions explained above. There are two subtypes of the “action yes/no” questions: “simple action yes/no” uses the relatively simple language converted from pddl for both the questions and answers. “Complex action yes/no” uses the raw human-generated description of each action step to pose the “yes/no” question. “Action either/or” questions present an either/or choice between two actions described in the simpler language of the converted pddl plans.

(3) Temporal questions about the order in which actions were performed, of two primary kinds. The first kind—“just before” questions—asks what action was performed immediately before a named action (“what did you do just before <action description>?") while the second—“just after” questions—asks what action was performed immediately following the named action (“what did you do just after <action description>?"). If an action occurs more than once in an episode it will not appear in a temporal question to avoid ambiguity.

Each of these types of temporal questions has two subtypes. The first is asked using the simpler description of actions from converted pddl while the second uses a human-generated action description sentence to formulate the question. Human-generated descriptions are longer, contain more diverse word choice, and sometimes mention irrelevant details. The answers to both question subtypes are in the simpler action description format. We suggest that this distinction between enabling the model to answer both simple and more complexly-worded questions while only answering in simpler language is desirable because while a robot agent should be able to understand questions phrased in a variety of ways, such an agent should not produce similarly varied answers, but instead generate only simple, consistent language.

In addition to these questions and answers, we also prompt the model to produce two kinds of summaries:

(1) Short summaries are the short one sentence descriptions of the action sequences written by human annotators as provided in the original dataset. We train the model to output a summary of a given episode with the text prompt, “summarize what you did."

(2) Long summaries, which are the longer narratives of actions converted from pddl to natural English. Although these are meaningfully longer than the one sentence summaries, they are significantly shorter than a step by step account of every low level action the virtual robot performed (e.g. move ahead, turn, look up, etc.). The model is trained to output a long summary of an episode with the prompt, “narrate what you did.”

2.3.1 Dataset of questions and answers

We will release both the code to generate the questions and answers as well as a static set of premade questions and answers aligned to episodes in the alfred dataset. The static dataset was generated to produce up to ten question instances per question type for each episode; in some cases there are fewer than ten such question instances per episode because not all question types can produce ten question instances for a given episode.

The entire static question and answer set contains 486,704 questions paired to episodes in the alfred dataset’s training set, 18,891 questions paired to its seen environments validation set, and 19,097 in its unseen environments validation set.

Table 1 Accuracy and precision scores for question and summary outputs by output type, including standard deviation.

2.4 Joint summarization and question answering model

We present a learned algorithm that takes as input ego-centric video frames of a virtual mobile robot along with a natural language question or summarization prompt and produces an answer or summary in response.

Our full neural network model (see the breakdown on the left in Fig. 1) combines several components. Video frames from each episode are fed as individual images collected into a batch into a frozen Resnet network (He et al., 2016) pretrained as part of the clip model (Radford et al., 2021). We extract the output of the last convolutional layer and feed it into a three layer convolutional network trained from scratch, which acts a bridge network between the Resnet and the next step in the pipeline, a pretrained T5 transformer llm (Raffel et al., 2020) (“t5-base”in the Hugging Face library (Wolf et al., 2020)). The bridge network outputs one vector for each input image; these vectors are concatenated together along with the tokenized question or summary prompt which is embedded using the T5 model’s pretrained embeddings as the input to the T5 model. The bridge network serves to translate the input from the clip latent space into one which can be processed by the T5. We find that fine tuning the entire T5 – rather than leaving either or both of the encoder or decoder frozen – leads to better results. While the T5 model was pretrained only on language data, we use it for simultaneous language and visual input, following other work which has shown the ability of language model transformers to process multimodal data (Lu et al., 2022; Tsimpoukelli et al., 2021). The latent space of inputs which the T5 expects is likely also modified during this fine tuning, so the adaptation of the T5 to process multimodal input can be seen as a result of both the bridge network and the fine tuning process.

As the T5 is an encoder-decoder model it is able to generate encoded representations of the images conditioned on the given question or prompt. We train a single model to answer all questions and produce long and short summaries so that it must learn to generate representations useful for all of these tasks. During an epoch of training we iterate through each episode in random order. For each episode, the model must produce long and short summaries and answer one question of each of the nine question types (when such a question exists for that episode).

2.5 Zero-shot summarization after question answering

We are interested in the possible interaction between question answering and summarization abilities within the model, in particular if representations of objects transfer between these tasks. We therefore alter the training regime to leave some objects out of the summarization training set and measure whether the model is still able to produce accurate summaries about interactions with the objects. In these experiments, we first randomly select a set of five objects from among the most common thirty objects in the dataset (excluding the top ten). We then identify all episodes whose long summaries contain those objects (i.e. any episode in which the virtual robot interacts with those objects) and set them aside as a ‘held-out’ set. The model is then trained on questions and answers involving all episodes, including the held-out episodes, but is not trained to produce either long or short summaries of the held-out episodes. We then test its ability to summarize these held out episodes.

3 Results

3.1 Summarization and question answering

We find that our model performs very well on both short and long summarization tasks and on the questions from our Q &A generation algorithm. Table 1 presents results for all question and summarization types. An answer is considered accurate if it completely matches the target answer. bleu (Papineni et al., 2002) and rouge (Lin, 2004) scores are also given for the two summary types. The bleu score is a measure of how well the generated text matches the ground truth text, penalizing words and phrases which are not present in the ground truth while rouge measures how much of the ground truth text is present in the generated text, penalizing words and phrases which are missing from the generated text. Unigram precision scores measure the percentage of generated words which are in the ground truth text and are given for question answering tasks which require more than one word as an answer. As the short summaries are more lexically diverse, binary accuracy measures are less appropriate so only bleu and rouge scores are given for the short summaries.

Table 2 Overlap of missing objects between questions and long summaries by question type, averaged over three models tested on the static held out valid unseen set.

A few patterns in the results can be seen. First, the performance generally varies depending on how much generated text must be produced in an answer. Longer answers provide more opportunities for errors so performance when measured by the strict metric of complete accuracy tends to be worse. This is particularly true for the question which asks for a long summary of the agent’s action, which has the worst results according to the all-or-nothing accuracy metric.

Second, “either/or” questions have better accuracy than their corresponding “yes/no” questions. This could be because asking if, for example, an action was performed is made easier when it is a choice between two actions so that any uncertainty the model has about one of the actions may be offset by its certainty about the other option. It is also possible that the model has a harder time connecting the meaning of the “yes/no” answers back to the input, particularly since most of the questions require outputting an object or action name, not just a “yes/no”.

Third, it might be expected that questions about the order that actions took place would be significantly more difficult for the model to interpret than those about the mere occurrence of those actions. Surprisingly, then, we find that in most cases the model’s performance on temporal questions is very similar to that on the other questions.

The model tends to make two kinds of errors when generating anything other than “yes/no” answers. It sometimes misidentifies objects, especially small ones, and particularly in the unseen environments. It also sometimes uses a different description for a location than the ground truth annotation, in some cases doing so in a way that is nevertheless consistent with the action as seen in the episode. For example, the ground truth annotation may read, “go to the apple” while the model outputs, “go to the counter” when the apple is on the counter. See Fig. 3 for examples of errors in short and long summaries generated by the model.

The errors made by the model display some consistency between the different questions asked and between the questions and summaries. For example, in one episode of the validation seen set which involves moving a book, it consistently mistakes the book for a pen, answering a “just before” question with, “put the pen on the desk," producing a short summary, “put two pens on the right side of the desk,” and beginning the long summary with, “go to the side table, pick up the pen...” There is a marked difference in the consistency of these errors depending on question type, however, as we show in Table 2. We measure this consistency by counting what fraction of particular objects omitted from the model’s answers to a given question type is also missing from the corresponding long summaries about that episode. This fraction is compared for different question types. We find that questions which require generating both an action and an object together have the highest degree of overlap in which objects they fail to identify and which are also missing in the long summaries; the temporal “just before / just after” answers in particular show high consistency with the long summaries. We hypothesize that the representations which the model uses for summarization align better with those it uses for the question types where there is higher overlap of missing words.

3.2 Zero-shot summarization via question answering

Can question answering improve the ability to summarize? We find that when the model is trained to answer questions about episodes involving all objects, it is then able to go on to summarize episodes with objects which it has not been trained to include in summaries. Table 3 displays a breakdown of zero-shot performance on long summaries. For comparison, results when nothing is held out—the standard case detailed in table 1—and for a model not trained to answer questions on the held out set are included. These comparisons show that while zero-shot summarization is not as accurate as fully supervised summarization, training on the auxilliary question-answering task is significantly better than not doing so. A model not trained to answer questions on episodes with held out objects is unable to correctly summarize episodes involving those held out objects. It is simply not able to output any of the held out objects’ names without having at least seen them during question answering. Training the model to learn to answer questions about the objects through an auxilliary question-answering task leads to clear improvement on the summarization task.

This result suggests that the model is learning representations of objects, or actions involving objects, while learning to answer questions which it can then use when producing summaries. There must be at least some transfer of representational knowledge between the question answering and the summarization tasks within the model.

Table 3 Accuracy of zero-shot long summarization when transferring representations learned from question answering to producing long summaries, broken down by question type used to learn the objects held out from summarization training.
Fig. 3
figure 3

Example errors in generated long and short summaries. Errors in the long summaries are indicated with strikethrough text (with the correct text following in italics and parentheses). Generated short summaries appear to the left of the correct summaries, which are in italics

Clear improvement with transfer compared to without transfer is also demonstrated in bleu and rouge scores of both short and long summaries in seen and unseen environments (in only one case is there not improvement); see Table 5 in Appendix C for details.

3.2.1 Impact of question type on zero-shot transfer to summarization

We have seen that transfer from question answering to summarization occurs. But which questions are most important or useful for transfer? In order to further investigate the sharing of representations between question answering and summarization, we rerun the experiments using the same held out protocol, but using focused sets of particular question types. Testing each question type separately allows us to measure whether all questions are equally useful for promoting transfer to summarization.

Interestingly, we find that not all questions are equally useful: only the temporal “just before” and “just after” questions—which ask what action was performed just before or after a given action—exhibit transfer between tasks (see Table 3 for accuracy metrics on temporal and non-temporal questions). This is true of both subtypes of these questions, i.e. both the simple and complex language versions. On their own, the “yes/no” and “either/or” questions about objects or actions do not lead to the same zero-shot summarization ability. It is worth recalling here that the answers to the temporal questions were also found to be especially consistent with the long summaries in the missing object errors they contained, which would also suggest a particularly aligned representational space between these tasks (see Table 2).

We also tested the transfer ability of a model trained in a similar manner but which excluded episodes based on the action verbs they contained rather than the objects. For these experiments, only one action verb at a time and the episodes which contained it were identified as held out items. In none of these cases was the model able to transfer the use of the verb to summaries of the held out episodes. This could be due to the smaller number of actions in the dataset than objects.

4 Related Work

RoboNLP Tangiuchi et al. (2019) and Tellex et al. (2020) offer thorough reviews of language use in the context of robotics. Detailed descriptions of actions such as robots playing soccer (Mooney, 2008) or automated driving (Barrett et al., 2015, 2017) have been generated. These have not involved learning how to report and condense a series of actions into anything like a summary, however. DeChant and Bauer (2021) propose robot action summarization as a research direction, suggesting a set of tasks to pursue.

Instruction Following Our proposal is closely related to learning to follow natural language instructions, which has long generated a great deal of interest at the intersection of robotics and natural language processing (Winograd, 1972; Dzifcak et al., 2009). Shridhar et al. (2021a) train a robotic arm in a virtual environment to perform a range of tasks following natural language instructions and transfer the learned model to a real world robot. Mees et al. (2021) introduce a benchmark for long horizon robotic manipulation tasks following natural language instructions.

Rich simulated environments for language-guided navigation tasks have been introduced in recent years. Anderson et al. (2018) introduced the Room to Room vision and language navigation dataset, which became the basis for much work in this area. Some of that work has involved learning to generate natural language descriptions of navigation trajectories as a training signal or tool: Nguyen et al. (2021) provide feedback to an agent in the Room to Room environment by describing in natural language the paths the agent actually takes so it can learn to compare that to the path it should have taken; Fried et al. (2018) learn to generate instructions to augment training data and then, at test time, to evaluate the similarity of routes it might take with the description of the desired route.

The alfred dataset (Shridhar et al., 2020) we repurpose has inspired a great deal of work on its natural language instruction following challenge. Shridhar et al. (2021b) improve an agent’s ability to perform tasks in the virtual environment by first training them to learn to act in the interactive text only TextWorld environment (Côté et al., 2018) in similar situations which are described there only in text. Pashevich et al. (2021) learn to leverage the presence of the high level pddl plans to produce better representations of the natural language instructions by also training those representations to be used to generate pddl plans from the natural language instructions.

Q &A in robotics Learning to ask questions has also been worked on as a way for a robotic agent to ask for help or clarification while performing a task (Tellex et al., 2014; Thomason et al., 2019). Yoshino et al. (2021) use natural language questions to clarify aspects of how a simple action was performed in response to a question. Datta et al. (2022) introduce a form of question answering where the questions are in natural language but the answers take the form of visual highlights of a map to indicate locations. Carta et al. (2022) propose filling in the blanks within structured language instructions as an auxiliary task for reinforcement learning agents in a 2-D grid world. Gao et al. (2021) introduce a similar Q &A task in a virtual environment, though without summarization; a slightly different embodied Q &A task, requiring an agent to seek out answers to questions, is proposed by Gordon et al. (2018).

Summarization There is an extensive body of work on natural language summarization, providing examples and resources for the new but related task of robot action summarization (see Nenkova and McKeown (2012) and Gambhir and Gupta (2017) for reviews). There are two main kind of summarization. In extractive summarization, the summaries are selected from the original text already present in a source document. In abstractive summarization, by contrast, new text is generated as the summary, allowing for a higher level of description. Recurrent sequence to sequence models (Rush et al., 2015; Gupta & Gupta, 2019) as well as Transformer (Vaswani et al., 2017) models have been used to perform abstractive summarization (Lewis et al., 2019; Raffel et al., 2020).

Video understanding Work on understanding video is relevant to our work since we are interested in using video or selected images from video as an input to summarizing a robot’s action in natural language. The task of ‘video summarization’ in the computer vision community refers to selecting important frames of a video that can, together, serve as a visual summary of the whole video; see Apostolidis et al. (2021) for a review of such techniques. Some work has been done on multimodal summarization from video and text transcripts to natural language summaries; Palaskar et al. (2019) is one example, going from video and text in the How2 video dataset (Sanabria et al., 2018) to summaries. Bärmann and Waibel (2022) assemble a large question answering dataset for real world video of humans performing actions, requiring significant effort to annotate.

Natural language question answering is also used for video understanding. Originally stemming from similar work in visual question answering (vqa) of natural language questions on still images (Antol et al., 2015), many video Q &A works address factual questions about the presence of objects or particular actions in video clips (Fan, 2019; Castro et al., 2022). These questions are similar to the object and action questions in our work. More recently, video question answering work has focused on more complex questions, including questions about the order of actions which are similar to our temporal questions (Xiao et al., 2021; Grunde-McLaughlin et al., 2021). Work has also been done to answer causal and related questions (e.g. "why did X happen?") which we do not address here and leave for future work (Wu et al., 2021; Li et al., 2022). Video question and answering has also been done with multimodal input which incorporates both video and at least one other modality such as text captions or an audio track (Choi et al., 2021; Yang et al., 2022). While our work does not incorporate such multimodal sources, future robot action summarization could do so, particularly for robots that have natural language interaction with humans in the course of their operation. Some video question answering datasets contain questions which are automatically formed from natural language descriptions of video sequences (Zeng et al., 2017; Zhao et al., 2017). Our automatic question generation method is similar but also incorporates ground truth information about the environment which are accessible because the episodes take place in simulation. Pretrained language models have been incorporated in models used to address video question answering (Zellers et al., 2021). See a recent survey by Zhong et al. (2022) for additional background on question answering for video understanding.

Grounding language It has been recognized for some time that grounding language to the real world is essential for creating ai systems that actually understand the language they processed (Harnad, 1990). Recent proposals on the need to situate natural language processing in a grounded or embodied context have brought renewed attention to this issue (Bisk et al., 2020; Chandu et al., 2021; McClelland et al., 2020; Lake & Murphy, 2021). Though these did not discuss robots summarizing their actions, our work is a contribution to this direction of research.

5 Conclusion

We develop a model that can be jointly trained to summarize and answer questions about a virtual robot’s past actions. We find that the model learns a representation space which is shared across at least some of the question types and summaries, leading to zero-shot summarization abilities.

This work helps begin a line of research on robot action summarization and question answering. It is important that robots operating in the real world be well supervised by humans and that their actions be understandable. We suggest that establishing a basic narrative of what an agent does is in some ways a prerequisite to further understanding why it does something. Once answering questions about and summarizing robot actions can be performed reliably, we expect these capabilities to be useful in a variety of ways, including in training robots. Learning representations for these tasks can serve as a form of pretraining for downstream robotic applications. New techniques for lifelong learning might enable robots to receive and learn from feedback to the summaries they generate. Our approach of making use of an existing instruction following dataset naturally allows for this application and is something we will pursue in future work on this and other datasets.

Though this work took place in simulation, the summarization and question answering tasks are not specific to aspects of this or any simulated environment. Future work will explore the application of these tasks to real world robots.