Learning to summarize and answer questions about a virtual robot’s past actions

When robots perform long action sequences, users will want to easily and reliably find out what they have done. We therefore demonstrate the task of learning to summarize and answer questions about a robot agent’s past actions using natural language alone. A single system with a large language model at its core is trained to both summarize and answer questions about action sequences given ego-centric video frames of a virtual robot and a question prompt. To enable training of question answering, we develop a method to automatically generate English-language questions and answers about objects, actions, and the temporal order in which actions occurred during episodes of robot action in the virtual environment. Training one model to both summarize and answer questions enables zero-shot transfer of representations of objects learned through question answering to improved action summarization.


Introduction
As robots become more capable and are entrusted with more tasks, it will be increasingly important to reliably keep track of what they do.However, robots will routinely perform roles that make direct supervision of them difficult or impossible.A robot may, for example, be used to move many loads of construction material from place to place or perform household chores.In both cases, real time human oversight would be impractical.It will therefore be necessary to develop methods to monitor and record the actions of such agents and provide that information at a later time to a human.One way to do that is to develop the capability for robots to summarize and answer questions about their actions in natural language.
Summaries, rather than complete records, will be particularly useful as action sequences become longer.They will also be challenging to produce because it will be necessary to identify the most important actions and, very often, to describe those actions using higher level abstract terms.Summaries may not fully address everything that a user wants to know about a robot's actions so a user may want to ask questions about what a robot did or saw during a particular action sequence.Providing summaries and answering questions are therefore complementary skills which we would want robotic agents to possess.
Fig. 1 Visual presentation of model and method for producing zero-shot summaries involving novel objects.
Step 1 illustrates the full model: input (at the left) includes video frames as well as episode metadata describing the environment as the agent saw it.The components of the model in black (clip Resnet and the word embeddings) are pretrained and remain frozen during our training process, while the light blue module (the vision-to-T5 bridge network) is trained from scratch.The dark blue module, a pretrained T5, which outputs the final question answer or summary, is fine-tuned during training.
Step 2 demonstrates zero-shot summarization using a previously trained model which was not trained to summarize episodes with some of the objects in the newly presented episode.
We demonstrate a model that learns such summarization and question answering skills by making use of the capabilities of a large language model (llm).Although much work has gone into training robot agents to follow natural language instructions, little work has addressed reporting back what a robot has done, which might be seen as the flip side of instruction following.Fortunately, existing datasets designed for instruction following tasks can be repurposed and augmented to serve as a training ground for robot action summarization and question answering.We make use of and augment the popular alfred dataset (Shridhar et al, 2020) which provides ego-centric video frames of episodes of robot action sequences in a virtual environment.
Our main contributions are: Summarization of actions.We demonstrate summarization of robotic actions in both short and long summaries from video frames in a multimodal model that incorporates vision and fine-tunes a pretrained T5 llm (Raffel et al, 2020).Answering questions about actions.The same model is jointly trained to answer questions about robotic actions, including questions about actions performed, objects seen, and the order in which actions were performed.Zero-shot transfer from question answering to summarization.We show that an llm-based system trained to answer questions about held-out objects can faithfully produce summaries about those objects in a zero-shot manner, even though the objects are not in the summarization task training set.This demonstrates the transfer of representational knowledge from the question answering tasks to the summarization tasks.Automatic generation of questions and answers.We develop a method to automatically generate questions and answers using an existing dataset and its associated virtual environment and release a dataset of such questions and answers.

Method
Our objective is to generate a summary or question answer in natural language a ∈ L of a long horizon robotic task, given the history of observations o ∈ O that the robot experienced during the task and a question or summarization prompt q.We define the robot experience/trajectory as τ = {(o 0 , ...)}.We seek to learn a function F θ such that: a = F θ (τ, q).

Repurposed dataset
Our approach requires egocentric video or video frames, a description of an agent's actions during an episode, and information about the environment the agent operates in, particularly the locations of objects it encounters.For the purposes of the current investigation we use episodes from the alfred dataset.An episode of robot state-action trajectory in the original dataset has four different kinds of representation which we make use of, either as-is or transforming them in some ways.The following list of dataset elements lays out the way they are used in this work as well as noting their original purpose and description in the alfred dataset: 1) Short summaries: Human-generated natural language one sentence summaries of the whole action sequence (called "goal descriptions" in the original dataset).
2) Long summaries: High level narratives of the robotic agent's actions, provided in the original dataset in the form of action plans in the structured Planning Domain Description Language (pddl) McDermott et al (1998).We convert the terms used in pddl to natural language: for example, "GotoLocation" becomes simply "go to" and some object names become two English words instead of one word (e.g."coffeemachine" becomes "coffee machine").We also break these long summaries up to form questions, as described in the next subsection.
3) Natural language action description sentences: Natural language step by step descriptions of the actions taken in each episode, written by humans, which were used as instructions in the original dataset.These are used here to form some of the questions, as described in the next section.4) Video, images, and visual features: Raw video of a task episode as well as still frames from the video.We use a pre-selected subset of frames in the original dataset, leaving the question of frame selection to future work.

Automatic generation of questions and answers
We develop a Q&A generation algorithm that produces questions and answers about episodes of robots interacting with an environment.After initial pre-processing, the algorithm can be used in a partly online fashion during training or as a one-time off-line dataset generation step which produces a set of static questions and answers.We train models in an online fashion and provide performance metrics from the static validation sets of questions and answers we release with this work.
In addition to the elements already present in the original dataset enumerated in the previous subsection, we use the ai2thor environment (Kolve et al, 2017) to rerun the agent trajectories for each episode in the dataset and capture metadata present while the agent is in the environment.This metadata is used to generate questions and includes information about objects encountered in the virtual environment and the order in which the robot sees and interacts with them.Though here we use one particular existing dataset and environment, our approach is general and can be used in other cases where similar data can be captured.
The algorithm produces nine types of questions in three broad categories (see Figure 2 for examples of each type): (1) Object questions about the presence of objects in the environment, both those the agent interacted with and those it only saw.There are two kinds of object question: "object yes/no" questions of the form, "was there an <object>?",which require only "yes" or "no" answers and "object either/or" questions of the form, "was there an <object A> or <object B>?" which require the model to output the name of the object present.Our algorithm uses the metadata of all objects visible in the environment to ensure that only one of the objects in an either/or question will have been seen during an episode.The algorithm samples objects with negative answers in proportion to their appearance in the dataset so that the model cannot, for example, learn to always answer, "no", for seldom-seen objects.Questions with "yes" and "no" answers are presented with equal frequency.
(2) Action questions, which ask about actions the agent performed.The two types of question -"action yes/no" and "action either/or" -follow the structure of the respective object questions explained above.There are two subtypes of the "action yes/no" questions: "simple action yes/no" uses the relatively simple language converted from pddl for both the questions and answers."Complex action yes/no" uses the raw human-generated description of each action step to pose the "yes/no" question."Action either/or" questions present an either/or choice between two Fig. 2 Sample questions (on the left, in blue) and expected answers (on the right, in green), broken up into question type, along with the prompts for long and short summaries, at the bottom.
actions described in the simpler language of the converted pddl plans.
(3) Temporal questions about the order in which actions were performed, of two primary kinds.The first kind -"just before" questions -asks what action was performed immediately before a named action ("what did you do just before <action description>?")while the second -"just after" questions -asks what action was performed immediately following the named action ("what did you do just after <action description>?").If an action occurs more than once in an episode it will not appear in a temporal question to avoid ambiguity.
Each of these types of temporal questions has two subtypes.The first is asked using the simpler description of actions from converted pddl (e.g. the first temporal question in Figure 2) while the second uses a human-generated action description sentence to formulate the question (e.g. the second temporal question in Figure 2) .These descriptions are longer, contain more diverse word choice, and sometimes mention irrelevant details.The answers to both question subtypes are in the simpler action description format.We suggest that this distinction between enabling the model to answer both simple and more complexlyworded questions while only answering in simpler language is desirable because while a robot agent should be able understand questions phrased in a variety of ways, for the sake of clarity such an agent should not produce similarly varied answers, but instead generate only simple, consistent language.
In addition to these questions and answers, we also prompt the model to produce two kinds of summaries: (1) Short summaries are the short one sentence descriptions of the action sequences written by human annotators as provided in the original dataset.We train the model to output a summary of a given episode with the text prompt, "summarize what you did." (2) Long summaries, which are the longer narratives of actions converted from pddl to natural English.Although these are meaningfully longer than the one sentence summaries, they are significantly shorter than a step by step account of every low level action the virtual robot performed (e.g.move ahead, turn, look up, etc.).The model is trained to output a long summary of an episode with the prompt, "narrate what you did."

Dataset of questions and answers
We will release both the code to generate the questions and answers as well as a static set of premade questions and answers aligned to episodes in the alfred dataset.The static dataset was generated to produce up to ten question tokens per question type for each episode; in some cases there are fewer than ten such question tokens per episode because not all question types can produce ten question tokens for a given episode.
The entire static question and answer set contains 486,704 questions paired to episodes in the alfred dataset's training set, 18,891 questions paired to its seen environments validation set, and 19,097 in its unseen environments validation set.

Joint summarization and question answering model
We present a learned algorithm that takes as input ego-centric video frames of a virtual mobile robot along with a natural language question or summarization prompt and produces an answer or summary in response.
Our full neural network model (see the breakdown on the left in Figure 1) combines several components.Video frames are fed into a frozen Resnet network (He et al, 2016) pretrained as part of the clip model (Radford et al, 2021).We extract the output of the last convolutional layer and feed it into a three layer convolutional network trained from scratch, which acts a bridge network between the Resnet and the next step in the pipeline: a pretrained T5 transformer llm (Raffel et al, 2020) ("t5-base"in the Hugging Face library (Wolf et al, 2020)) which we fine tune.While the T5 model was pretrained only on language data, we use it for simultaneous language and visual input, following other work which has shown the ability of language model transformers to process multimodal data (Lu et al, 2022;Tsimpoukelli et al, 2021).
The tokens of the natural language questions and summarization prompts are embedded using the T5 model's pretrained embeddings.We concatenate these text embeddings with the image vector representations yielded by the bridge network.As T5 is an encoder-decoder model it is able to generate encoded representations of the images conditioned on the given question or prompt.We train a single model to answer all questions and produce long and short summaries so that it must learn to generate representations useful for all of these tasks.During an epoch of training we iterate through each episode in random order.For each episode, the model must produce long and short summaries and answer one question of each of the nine question types (when such a question exists for that episode).

Zero-shot summarization after question answering
We are interested in the possible interaction between question answering and summarization abilities within the model, in particular if representations of objects transfer between these tasks.We therefore alter the training regime to leave some objects out of the summarization training set and measure whether the model is still able to produce accurate summaries about interactions with the objects.In these experiments, we first randomly select a set of five objects from among the most common thirty objects in the dataset (excluding the top ten).We then identify all episodes whose long summaries contain those objects (i.e.any episode in which the virtual robot interacts with those objects) and set them aside as a 'held-out' set.The model is then trained on questions and answers involving all episodes, including the held-out episodes, but is not trained to produce either long or short summaries of the held-out episodes.

Summarization and question answering
We find that our model performs very well on both short and long summarization tasks and on the questions from our Q&A generation algorithm.2 Overlap of missing objects between questions and long summaries by question type, averaged over three models tested on the static held out valid unseen set.Overlap here is the number of missing word errors per question type for which the long summaries are also missing the same word in the same episode, as a percentage of all missing word errors per question type.
As the short summaries are more lexically diverse, binary accuracy measures are less appropriate so precision scores are given for the short summaries.
A few patterns in the results can be seen.First, the performance generally varies depending on how much generated text must be produced in an answer.Longer answers provide more opportunities for errors so performance when measured by the strict metric of complete accuracy tends to be worse.This is particularly true for the question which asks for a long summary of the agent's action, which has the worst results according to the all-or-nothing accuracy metric.
Second, "either/or" questions have better accuracy than their corresponding "yes/no" questions.This could be because asking if, for example, an action was performed is made easier when it is a choice between two actions so that any uncertainty the model has about one of the actions may be offset by its certainty about the other option.It is also possible that the model has a harder time connecting the meaning of the "yes/no" answers back to the input, particularly since most of the questions require outputting an object or action name, not just a "yes/no".
Third, it might be expected that questions about the order that actions took place would be significantly more difficult for the model to interpret than those about the mere occurrence of those actions.Surprisingly, we find that in most cases the model's performance on temporal questions is very similar to that on the other questions.
The model tends to make two kinds of errors when generating anything other than "yes/no" answers.It sometimes misidentifies objects, especially small ones, and particularly in the unseen environments.It also sometimes uses a different description for a location than the ground truth annotation, in some cases doing so in a way that is nevertheless consistent with the action as seen in the episode.For example, the ground truth annotation may read, "go to the apple" while the model  outputs, "go to the counter" when the apple is on the counter.See Figure 3 for examples of errors in short and long summaries generated by the model.The errors made by the model display some consistency between different questions asked and between the questions and summaries.For example, in one episode of the validation seen set which involves moving a book, it consistently mistakes the book for a pen, answering a "just before" question with, "put the pen on the desk," producing a short summary, "put two pens on the right side of the desk," and beginning the long summary with, "go to the side table, pick up the pen...".There is a marked difference in the consistency of these errors depending on question type, however, as we show in Table 2.We measure this consistency by counting what fraction of particular objects omitted from the model's answers to a given question type is also missing from the corresponding long summaries about that episode.This fraction is compared for different question types.We find that questions which require generating both an action and an object together have the highest degree of overlap in which objects they fail to identify and which are also missing in the long summaries; the temporal "just before / just after" answers in particular show high consistency with the long summaries.We hypothesize that the representations which the model uses for summarization align better with those it uses for the question types where there is higher overlap of missing words.

Zero-shot summarization via question answering
Can question answering improve the ability to summarize?We find that when the model is trained to answer questions about episodes involving all objects, it is then able to go on to summarize episodes with objects which it has not been trained to include in summaries.Table 3 displays a breakdown of zero-shot performance on long summaries.For comparison, results when nothing is held out -the standard case detailed in table 1 and for a model not trained to answer questions on the held out set are included.These comparisons show that while zero-shot summarization is not as accurate as fully supervised summarization, training on the auxilliary question-answering task is significantly better than not.A model not trained to answer questions on episodes with held out objects is unable to correctly summarize episodes involving those held out objects.It is simply not able to output any of the held out objects' names without having at least seen them during question answering.Training the model to learn to answer questions about the objects through an auxilliary question-answering task leads to clear improvement on the summarization task.This result suggests that the model is learning representations of objects, or actions involving objects, while learning to answer questions which it can then use when producing summaries.There must be at least some transfer of representational knowledge between the question answering and the summarization tasks within the model.
Clear improvement with transfer compared to without transfer is also demonstrated in bleu and rouge scores of both short and long summaries in seen and unseen environments (in only one case is there not improvement); see Table 5 in Appendix C for details.

Impact of question type on zero-shot transfer to summarization
We have seen that transfer from question answering to summarization occurs.But which questions are most important?In order to further investigate the sharing of representations between question answering and summarization, we rerun the experiments using the same held out protocol, but using focused sets of particular question types.
Testing each question type separately allows us to measure whether all questions are equally useful for promoting transfer to summarization.Interestingly, we find that not all questions are equally useful: only the temporal "just before" and "just after" questions -which ask what action was performed just before or after a given actionexhibit the transfer between tasks (see Table 3 for accuracy metrics on temporal and non-temporal questions).This is true of both subtypes of these questions, i.e. both the simple and complex language versions.On their own, the "yes/no" and "either/or" questions about objects or actions do not lead to the same zero-shot summarization ability.It is worth recalling here that the answers to the temporal questions were also found to be especially consistent with the long summaries in the missing object errors they contained, which would also suggest a particularly aligned representational space between those tasks (see Table 2).We also tested the transfer ability of a model trained in a similar manner but which excluded episodes based on the action verbs they contained rather than the objects.For these experiments, only one action verb at a time and the episodes which contained it were identified as held out items.In none of these cases was the model able to transfer the use of the verb to summaries of the held out episodes.This could be due to the smaller number of actions in the dataset than objects.

Related Work
RoboNLP Tangiuchi et al ( 2019) and Tellex et al (2020) offer thorough reviews of language use in the context of robotics.Detailed descriptions of actions such as robots playing soccer Mooney (2008) or automated driving Barrett et al (2015Barrett et al ( , 2017) ) have been generated.These have not involved learning how to report and condense a series of actions into anything like a summary, however.DeChant and Bauer (2021) robot action summarization as a research direction, suggesting a set of tasks to pursue.
Instruction Following Our proposal is closely related to learning to follow natural language instructions, which has long generated a great deal of interest at the intersection of robotics and natural language processing (Winograd, 1972;Dzifcak et al, 2009).Shridhar et al (2021) train a robotic arm in a virtual environment to perform a range of tasks following natural language instructions and transfer the learned model to a real world robot.Mees et al (2021) introduce a benchmark for so-called long horizon (many step) robotic manipulation tasks following natural language instructions.
Rich simulated environments for languageguided navigation tasks have been introduced in recent years.Anderson et al (2018) introduced the Room to Room vision and language navigation dataset, which became the basis for much work in this area.Some of that work has involved learning to generate natural language descriptions of navigation trajectories as a training signal or tool: Nguyen et al (2021) provide feedback to an agent in the Room to Room environment by describing in natural language the paths the agent actually takes so it can learn to compare that to the path it should have taken; Fried et al ( 2018) learn to generate instructions to augment training data and then, at test time, to evaluate the similarity of routes it might take with the description of the desired route.
Q&A in robotics Learning to ask questions has also been worked on as a way for a robotic agent to ask for help or clarification while performing a task (Tellex et al, 2014;Thomason et al, 2019).Yoshino et al (2021) use natural language questions to clarify aspects of how a simple action was performed in response to a question.Datta et al ( 2022) introduce a form of question answering where the questions are in natural language but the answers take the form of visual highlights of a map to indicate locations.Carta et al (2022) propose filling in the blanks within structured language instructions as an auxiliary task for reinforcement learning agents in a 2-D grid world.Gao et al ( 2021) introduce a similar Q&A task in a virtual environment, though without summarization; a slightly different embodied Q&A task, requiring an agent to seek out answers to questions, is proposed by Gordon et al (2018).
Summarization There is an extensive body of work on natural language summarization, providing examples and resources for the new task of robot action summarization (see Nenkova and McKeown (2012) and Gambhir and Gupta (2017) for reviews).
Video understanding Work on understanding video is relevant to our work since we are interested in using video or selected images from video as one of the inputs to summarizing a robot's action in natural language.The task of 'video summarization' in the computer vision community refers to selecting important frames of a video that can, together, serve as a visual summary of the whole video; see Apostolidis et al (2021) for a review of such techniques.Some work has been done on multimodal summarization from video and text transcripts to natural language summaries; Palaskar et al (2019) is one example, going from video and text in the How2 video dataset (Sanabria et al, 2018) to summaries.Bärmann and Waibel (2022) assemble a large question answering dataset for real world video of humans performing actions, requiring significant effort to annotate.

Conclusion
We develop a model that can be jointly trained to summarize and answer questions about a virtual robot's past actions.We find that the model learns a representation space which is shared across at least some of the question types and summaries, leading to zero-shot summarization abilities.
This work helps begin a line of research on robot action summarization and question answering.It is important that robots operating in the real world be well supervised by humans and that their actions be understandable.We suggest that establishing a basic narrative of what an agent does is in some ways a prerequisite to further understanding why it does something.Once answering questions about and summarizing robot actions can be performed reliably, we also expect these capabilities to be useful in a variety of other ways, including in training robots and in lifelong learning settings in which robots might receive feedback to the summaries they generate.
It has been recognized for some time that grounding the use of language in machines to the real world is essential for creating ai systems that can actually understand the language they process (Harnad, 1990;Bisk et al, 2020;Chandu et al, 2021;McClelland et al, 2020).Summarization and question answering about past robot actions provide new avenues for tackling the grounding problem.
Though this work took place in simulation, the summarization and question answering tasks are not specific to aspects of this or any simulated environment.Future work will explore the application of these tasks to real world robots.
Acknowledgements C.D. was supported by a grant from the Long Term Future Fund.
Author  A1 presents the results of the ablation study on the seen and unseen validation set environments.Some of the binary questions have accuracies very close to 50% -these include the "simple action yes/no", "complex action yes/no", and "simple action yes/no".The "object yes/no" and "object either/or" questions, on the other hand, have slightly higher accuracies, approximately 63%, suggesting that model has learned some patterns in the distribution of objects in the dataset.Similarly, the temporal "just before" and "just after" questions have higher accuracies than a uniformly random choice among possible actions would demonstrate.Our actual model achieves much higher accuracy on all of these tasks, however, demonstrating that it has learned much more than merely the regularities in the dataset.
We initially tested an additional form of question, a question that asked if ¡Action A¿ happened before ¡Action B¿ in a given episode (note that this type of question differs from the temporal questions included in this work, which ask what action happened immediately before or after a given action, not whether an action happened at any point before a given action).We had to exclude this form of question, however, because the model was able to achieve over 80% accuracy on validation set episodes under the ablated visual input regime.This question was apparently simply too easy given the regularities in the dataset.

Appendix C Additional comparison metrics for zero-shot transfer
In Table B2 we provide additional metrics to understand the performance of zero-shot transfer from question answering to summarization.Short summaries tend to have lower bleu and rouge scores than the long summaries because the long summaries use a standardized set of words to describe actions and objects while the short summaries use a more diverse set of words and provide varying levels of detail.
These results, though on a small test set, suggest that the model has learned a bias toward answering, "yes", only when there is evidence in the input that an answer should be answered affirmatively.This is, of course, what a user would want.Further investigation of the circumstances under which it correctly answers out of distribution questions is warranted, as well as ways to improve the performance on out of distribution questions, especially unusual ones.

Fig. 3
Fig.3Example errors in generated long and short summaries.Errors in the long summaries are indicated with strikethrough text (with the correct text following in italics and parentheses).Generated short summaries appear to the left of the correct summaries, which are in italics.
Dharur S, Cartillier V, et al (2022)   Episodic memory question answering.In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, pp  19119-19128   DeChant C, Bauer D (2021)  Toward robots that learn to summarize their actions in natural language: a set of tasks.In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution.In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163-4168 Fried D, Hu R, Cirik V, et al (2018) Speakerfollower models for vision-and-language navigation.In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318-3329 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey.Artificial Intelligence Review 47(1):1-66 Gao D, Wang R, Bai Z, et al (2021) Envqa: a video question answering benchmark for comprehensive understanding of dynamic environments.In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675-1685 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments.In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089-4098 Harnad S (1990) The symbol grounding problem.Physica D: Nonlinear Phenomena 42(1-3):335-346 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770-778

Table 1
Accuracy and precision scores for question and summary outputs by output type, including standard deviation.rouge and bleu scores also given for summaries.Results shown are from two validation sets: those based on episodes in virtual environments seen during training are on the left, unseen environments on the right.None of the actual episodes themselves, of either type, are found in the training set.Precision scores are not shown for "yes/no"answers where such scores must equal the accuracy scores.Results are averaged from three models with different random seeds, all tested on the set of static held-out questions.
(Lin, 2004)t al, 2002) for all question and summarization types.An answer is considered accurate if it completely matches the target answer.bleu(Papinenietal,2002)androuge(Lin, 2004)scores are also given for the two summary types.The bleu score is a measure of

Table 3
Accuracy of zero-shot long summarization when transferring representations learned from question answering to producing long summaries, broken down by question type used to learn the objects held out from summarization training.Results shown for episodes containing held-out objects in the validation sets in seen and unseen environments.The bottom two rows show a baseline with no question answering training on the held-out objects -and therefore no transfer -and a comparison to the fully trained model with nothing held out.
contributions C.D. led the project, performed experiments, and wrote the initial draft of the manuscript.I.A. and D.B. provided supervision, research guidance, and edited and wrote portions of the manuscript.TableA1Ablation of video frames baseline: results for a model trained to answer questions and produce summaries when trained with questions and answers as usual but with each question and answer pair and summarization task paired to identical visual input (i.e. each episode's observations are replaced by a single, static set of observations that do not vary from episode to episode), thereby completely depriving the model of any useful visual information with which to answer the question.thatthe model trained with this ablation of meaningful visual inputs could learn anything about visual inputs.The output on validation set summarization and question answering prompts would therefore only be a reflection of what the model has learned about the regularities in the text portion of the dataset, e.g.what actions are more likely to follow from other actions, regardless of the episode visual data.Table