Rescaling Egocentric Vision

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version, EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection enables new challenges such as action detection and evaluating the"test of time"- i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics

Alternatively, one dataset can be enriched with multiple annotations and tasks, aimed towards learning intermediate representations through downstream and multi-task learning on the same input. This has been recently achieved for autonomous driving [18,19,20,21,22,23,24,25] and scene understanding [26,27]. For example, Taskonomy [26] contains 26 tasks ranging from edge detection to vanishing point estimation and scene classification.
In comparison, the number of tasks proposed for action and activity understanding datasets [1,5,28,29,30,31] remains modest. Often, this is limited by the source of videos in these datasets. YouTube [28,30] and movies [5,29] typically contain curated videos, with edited shots. However, attempts to define multiple challenges for these datasets have been exemplary. Activi-tyNet [28] is the most popular video challenge, evaluated   [1] and newly collected videos, with selected frames showcasing the same action. Note object location differences in 'returning' kitchens (e.g. microwave relocated). We show the same action performed in 'changing' kitchens (e.g. same participant preparing pizza or filtered coffee in a new kitchen).
Several leading egocentric datasets [35,36,37,38,39] showcased the unique perspective and potential of first-person views for action recognition, particularly hand-object interactions. In 2018, the introduction of the largest-scale dataset EPIC-KITCHENS [1] has transformed egocentric vision, not only due to its size, but also the unscripted nature of its collection and the scalable nature of the collection pipeline. In this paper, we present EPIC-KITCHENS-100, a substantial extension which brings the total footage to 100 hours, capturing diverse unscripted and unedited object interactions in people's kitchens 2 . As shown in Fig 1, the actions capture hand object interactions with everyday objects in participants' kitchens. The unscripted nature of the dataset results in naturally unbalanced data, with novel compositions of actions in new environments. While challenging, the dataset is domain-specific (i.e. kitchenbased activities), offering opportunities for engaging domain knowledge. We offer two-level annotations for nouns and verbs in interactions (e.g. "carrot/courgette → vegetable", "put/throw/drop → leave") to utilise such priors.
Importantly, we propose a refined annotation pipeline that results in denser and more complete 2 We will refer to the previous edition as EPIC-KITCHENS-55 in reference to the number of hours collected and annotated.
annotations of actions in untrimmed videos. This pipeline enables various tasks on the same dataset; we demonstrate six in Section 4, with baselines and evaluation metrics that focus on understanding fine-grained actions and offer benchmarks which can support research into better modelling of video data.

Data Collection and Scalable Pipeline
In this section, we detail our collection and annotation effort. Data Collection. We collect additional footage as follows: we contacted participants from EPIC-KITCHENS-55 to record further footage. Of the 32 participants in [1], 16 subjects expressed interest in participating. Interestingly, half of these (8 subjects) had moved homes over the past two years. We also recruited 5 additional subjects, increasing the total number of subjects and kitchen environments to 37 and 45 respectively. All participants were asked to collect 2-4 days of their typical kitchen activities, as in [1]. We collect footage using a head mounted GoPro Hero7 black. This is two generations newer than the camera used in EPIC-KITCHENS-55, with a built-in feature for HyperSmooth video stabilisation. Sample frames are shown in Fig. 1, with selected frames of the same action in returning and changing kitchens. Annotation Pipeline. An overview of the pipeline can be seen in Fig. 2. (a) Narrator. Previously, for EPIC-KITCHENS-55, we used a non-stop narration approach, where each participant narrated their previous action while watching the future actions in the running video. We found this resulted in increased mental load and some actions being missed or misspoken. To improve upon this approach, we take inspiration from [40], where objects in images are annotated by pointing and speaking and propose temporal 'pointing' which we refer to as 'pause-and-talk'. By allowing participants to pause the video to speak as well as take breaks, we hope to increase accuracy and density of actions, whilst still allowing for a scalable narration approach. We built an interface to facilitate collecting such narrations from the participants (Fig. 2a), which includes a video player, synced with audio recordings 3 . Participants watch the video and press a key to pause while they narrate the action in their native language. As previously observed in [1], using the native language ensures the narrations use the correct vocabulary in describing the actions. The video restarts on key release. 3 Our tool is available at https://github.com/epic-kitchens/epic-kitchens- 100-narrator Note that the narrator still watches the video once, maintaining the targeted scalability of the annotation pipeline, but removes the mental overload of narrating past actions while watching future actions. This allows for short and overlapping actions to be captured in addition to enabling error correction, as participants can listen to, delete or re-record a narration. Fig. 2 shows an ongoing narration, demonstrating density (ticks on the slider).
(b) Transcriber. We perform transcription of audio narrations, followed by translation (if applicable): first, we transcribe all narrations and then translate the unique transcriptions into English using a hired translator for correctness and consistency. The approach we used to transcribe narrations in [1] had issues where workers failed to understand some audio narrations due to the lack of any visual information. To mitigate this, we build a new transcriber interface containing three images sampled around the timestamp (Fig. 2b). We find that images increase worker agreement and alleviate issues with homonyms (e.g. 'flower' and 'flour'). Each narration is transcribed into a caption by 3 Amazon Mechanical Turk (AMT) workers using a consensus of 2 or more workers. Transcriptions were automatically rejected if the cosine similarity between the Word2Vec [41] embeddings was lower than an empirical threshold of 0.9. When AMT workers fail to agree, the correct transcription was selected manually. Captions were then spell checked and substitutions were applied from a curated list of problematic words (e.g. 'hob' and 'hop'), further reducing errors.
(c) Parser. We use spaCy [42] to parse the transcribed captions into verbs and nouns (Fig. 2c) and manually group these into minimally overlapping classes as we did in our previous work. We reworked this to improve parsing of compound nouns and missing verbs/nouns. Additionally, all annotations (including those we collected previously from EPIC-KITCHENS-55) were reparsed using the updated pipeline. To cluster the verbs and nouns, we adjust previous clusters to reduce ambiguities between classes. For example, we group 'brush' and 'sweep' into one verb class, and introduce noun classes that did not exist before such as 'lentils'. (d) Temporal Annotator. We built an AMT interface for labelling start/end times of action segments (Fig. 2d). Annotators completed a quick tutorial on annotating temporal bounds before they labelled 10 consecutive actions. To create the bounds of the action segment, we use the same approach as we did previously but increased the number of workers from 4 to 5 to improve quality. Note that in the untrimmed videos there might be consecutive instances of the same action. These will be indicated by repeated narrations. We thus request that annotators mark the temporal bounds of each instance, prompted by the timestamp. This avoids the merging of instances of the same action. Quality Improvements.
Our EPIC-KITCHENS-100 scalable pipeline focuses on denser and more accurate annotations. We compare different parts of the pipeline to our previous one in Appendix B. Here, we show improved quality of annotations both numerically and through an example. Fig. 3 (left) compares the narration method we used in [1] to the new pipeline over several metrics. Our 'pause-and-talk' narrator produces more densely annotated videos; fewer gaps and more labelled frames; actions are shorter; and exhibit higher overlap. The narration timestamps are also closer to the relevant action, with a higher percentage being contained within the action and a smaller distance to remaining timestamps outside the action. Fig. 3 (right) shows two video sections, of equal length, annotated by the same participant, one using non-stop narrations and the other with 'pause-and-talk'. The number of annotated actions increased from 20 to 56, with short actions (such as 'turn on tap') often missed in the previous pipeline. We demonstrate these through two examples. The first shows a missed action of picking up a bag off the floor that had been dropped, and the second shows a missed closing cupboard action. In the sequence from 'pause-and-talk', all actions including closing the cupboard were successfully narrated thanks to our 'pause-and-talk' pipeline. By narrating more actions, the start/end times also become more accurate as it is more obvious to the AMT annotators what each narration refers to.

Statistics, Scalability and the Test of Time
EPIC-KITCHENS-100 contains 89,977 segments of finegrained actions annotated from 700 long videos. Footage length amounts to 100 hours. Table 1 lists the general statistics, separating those from the videos collected previously to the newly collected videos. Note that all previous narrations have been re-parsed using the new pipeline ( Fig. 2b-d). EPIC-KITCHENS-100 rescales our previous dataset with almost double the length with    In Fig. 4 we show the frequency of verb (97) and noun (300) classes in the dataset. These are grouped into categories (13 verb and 21 noun categories), sorted by size. For example, verbs 'wash', 'dry', 'scrape', 'scrub', 'rub', 'soak' and 'brush' are grouped into a clean verb category. The plots show a clear long-tail distribution.
The contribution of each class from source videos [1] and extension are also shown. New verb classes (e.g. 'lock', 'bend') and noun classes (e.g. 'orange' and 'hoover') are only present in the newly-collected videos.
We enrich our dataset with automatic spatial annotations using two models. The first is Mask R-CNN [44] trained on MSCOCO [4]. The second is hand-object interactions from [43], trained on 100K images from YouTube along with 42K images from three egocentric datasets [1,45,39] of which 18K are from our videos [1]. It detects interacted static and portable objects as an offset to hand detections. Example annotations are shown in Fig. 5, and the number of annotations is given in Table 1. While we do not use these annotations to report results, we believe these 66M masks, 31M hand and 38M object bounding boxes could facilitate future models for spatial (or spatio-temporal) attention 4 . Splits. We split the videos into Train/Val/Test with a ratio of roughly 75/10/15. Each video, with all its action segments, is present in one of the splits, and the Test split contains only newly-collected videos. We use re-parsed videos from the original EPIC-KITCHENS test sets 5 as the new validation set. Our Val/Test splits contain two interesting subsets, which we report on separately: -Unseen Participants: Our Val and Test splits contain participants not present in Train: 2 participants in Val, and another 3 participants in Test. These contain 1,065 and 4,110 action segments respectively. This subset helps evaluate the generalisability of the models across the various benchmarks. -Tail Classes: We define these (for verbs and nouns) to be the set of smallest classes whose instances account for 20% of the total number of instances in training. A tail action class contains either a tail verb class or a tail noun class. These are 86/228/3,729 verb/noun/action classes.
Scalability and the Test of Time. As we rescale EPIC-KITCHENS with additional videos, we carry out two investigations: (a) how models trained on videos from [1] perform on videos collected two years later, and (b) how models' performance scales with additional annotated data. We call these the test of time and the scalability tests respectively. 4 Correctness of bounding boxes for hands and objects has been evaluated by Shan et al. [43] -see acknowledgements.
Performance of R-CNN masks has not been quantitatively evaluated and these are error-prone. 5 We no longer split the test set into seen and unseen kitchens, but instead report on relevant evaluation metrics for each challenge.  6 includes results for both investigations, evaluated on the task of action recognition (definition and models from Section 4.1). We separate overall results (left) from unseen participants (right). For all models, comparing the first two bars demonstrates that models trained solely on videos from [1] do not withstand the test of time. For the same model, performance drops significantly when new data is evaluated. This highlights a potential domain gap, which we discuss next. We assess scalability by gradually adding new data in training. Results demonstrate a significant improvement, albeit saturating when 50% of new data is added, particularly for unseen participants. This highlights the need for better models and more diverse data rather than merely more data. This can be particularly observed as the unseen participants data benefits even less when scaling. We tackle the gap to new environments and participants next.
Unravelling the Domain Gap. As defined in the early work on speech recognition [46], "A domain D is a (often infinite) set of samples such that each sample satisfies a property P D ". A domain gap is present when at least one property differs between the samples of two domains. Domain gaps have been a frequent source of frustration for a wide range of learning tasks, where models are trained on samples from one domain, and thus under-perform when deployed in a different domain. This is also known as sample-selection bias [47]. Sampling bias is a common cause for a domain gap between datasets, which can not easily be removed during dataset collection, as noted in [48]. The most obvious domain gaps stem from changes in locations [49], viewpoints [50], labels [51] and participants [52]. However, there are often more subtle causes, such as differences in capture methodology [53] or due to changes in objects, environments and actions over time.
The concept of a compound domain gap has recently been introduced in [54], where the target domain is a compound of multiple domains without domain labels. As stated by Liu et al. [54], this is a more realistic scenario resulting from unconstrained data collection. In EPIC-KITCHENS-100, each video in the extension offers a compound domain gap due to changes in one or more of the following properties: -Hardware and capturing as in [53,55]. Extended footage uses a newer camera model with onboard video stabilisation. -Locations as in [49]. As indicated in Section 2, eight subjects have moved home resulting in changing surroundings but keeping the appearance of many objects and tools. Additionally, unseen participants capture footage in new environments where the appearance of objects and surroundings differ. -Participants as in [52]. Hand appearance and individual behaviours exist in the extension which are not in the original footage. -Short-term temporal offsets as in [56], where time-ofday can affect scene lighting, and some background objects change position (e.g. on the counter for one video, put away in a cupboard for a later video). -Long-term temporal offsets as in [57,58]. EPIC-KITCHENS-100 is filmed 2 years after EPIC-KITCHENS-55. In the same environment, changes such as wear and tear, new objects and different object positions are observed (see Fig 1 right). Participant behaviour can also change over time.
While we have domain labels for some of these properties (e.g. recording camera, location, time-of-day and participant ID), other property changes can vary between samples, without associated labels. It is particularly difficult to associate labels with changes in behaviour or object appearances, for example. We publish these properties with the dataset when present. Importantly, we explore this compound domain gap, without using property labels, using a new challenge on unsupervised adaptation for action recognition (Section 4.5).

Challenges and Baselines
In this section, we define 6 challenges on our dataset, two modified from [1], namely action recognition (Section 4.1) and anticipation (Section 4.4). We introduce four new challenges: weakly-supervised action recognition (Section 4.2), action detection (Section 4.3), unsupervised domain adaptation for action recognition (Section 4.5) and action retrieval (Section 4.6). While many works have addressed one or more of these challenges, they are typically explored using different datasets. Our annotation pipeline (from captions and single timestamps to segments and classes- Fig. 2) can be used to define multiple challenges, potentially jointly. In this section, we only scratch the surface by reporting on each challenge independently. For readability, we include all implementation details in Appendix C, and we published all our baseline models and evaluation scripts.

Action Recognition
Definition. As in [1], we consider a video segment (t s , t e ) as the start and end frames in a video. We aim to predict (v,n,â) as the verb/noun/action classes of the action in this segment. We consider overlapping segments independently. Related Datasets. Several datasets have been collected to focus on action recognition, from [71,72] to recent large-scale ones [5,59,61,62,64,65], all offering a challenge with a held-out test set. In Table 2, we compare EPIC-KITCHENS-100 to these non-egocentric datasets across a range of facets. Ours is the only dataset of unscripted activities, of comparable size to those collected from scripted or curated (YouTube) videos. Evaluation Metrics. We report Top-1/5 Accuracy on Val and Test sets. Baselines and Results. In Table 3, we report results of five state-of-the-art recognition models [66,67,68,69,70] in addition to a random chance baseline. We use the Train set to report on Val, optimising hyper-parameters. We then fix these, and train on both the Train and Val sets in order to report on the Test set.

Weakly-supervised Action Recognition
Definition. As in Sec 4.1, the goal is to recognise the action, i.e. predict (v,n,â), in a trimmed action segment during testing. Distinctly, we use single timestamps instead of temporal boundaries during training.
be the action instances contained in an untrimmed training video, each A i = (t, v, n, a) is labelled with only one timestamp t roughly located around the action, along with verb/noun classes. We utilise the narration timestamps from our collection pipeline as t.

Related Datasets and Types of Supervision.
Previous weakly-supervised approaches utilised videolevel or transcript supervision, where the set [83,84,85,86,87,88] or sequence [89,90,91,92,93,94] of actions in the video are used in training, without temporal bounds. Table 4 compares EPIC-KITCHENS-100 to datasets trained with weak-supervision. When considering the number of classes (and instances) per video, EPIC-KITCHENS-100 offers a significant challenge. For example, ActivityNet [28] videos contain 1 class and 1.5 action instances on average, whereas in EPIC-KITCHENS-100, videos contain 53.2 classes   [28,75], while transcript supervision [79,80] expects no overlap between actions. Both types of weak supervision are insufficient in our case.
Alternatively, single-timestamp supervision is gaining popularity due to the scalability and performance balance [82,95,96,97]. We follow this trend as it fits naturally with our narration timestamps collected using 'pause-and-talk'.
Evaluation Metrics. We follow the same metrics as in Section 4.1.
Baselines and Results. We consider two baselines. The first, "Fixed segment", uses a segment of fixed length centred on the timestamp. The second is our previous work [82], where sampling distributions, to select training frames from the untrimmed videos, are initialised from single timestamps, and refined based on the classifier's response 6 . Both are trained end-toend using a TSN backbone [66] and results can be seen in Table 5. [82] improves the fixed segment baseline by 1-3% top-1 accuracy across Val and Test. The fully supervised upper bound is TSN, reported in Table 3. Comparatively, weak supervision performs 11% worse than strong supervision on top-1 action accuracy in Val and Test. Using roughly aligned single timestamps is challenging when actions are short and overlapping. EPIC-KITCHENS-100, with its dense actions, provides  an interesting benchmark to develop new models for weak-supervision.

Action Detection
Definition. All other challenges in Section 4 consider a trimmed segment (t s , t e ) from the test video as input.
This assumption is limiting, as labelled start/end times of actions are unlikely to be present for new test videos.
In this challenge, we aim to detect and recognise all action instances within an untrimmed video, as in [28]. Given a video, we predict the set of all action instanceŝ is an action detection tuple including the predicted start and end times (t s ,t e ) and the predicted classes (v,n,â). During training, we use the set of ground-truth action annotations A. Note that the ground-truth A and predictedÂ sets can be of different sizes. This definition is closely related to temporal segmentation [98], but segmentation assumes non-overlapping segments and is thus unsuitable for EPIC-KITCHENS-100. Related Datasets. Table 4 compares EPIC-KITCHENS-100 to popular datasets for temporal action detection and segmentation. EPIC-KITCHENS-100 presents the largest challenge, when considering the combined metrics of: average video length, average instances per video and overlapping instances. Compared to datasets with overlapping segments, it has a larger  [99], which are then classified using SlowFast [70] (model trained as in Section 4.1). Results in Table 6 highlight that action detection is particularly challenging on this dataset, especially with respect to higher IoU thresholds. The qualitative example in Fig. 8 shows that our videos in EPIC-KITCHENS-100 contain actions of varying lengths, which adds further challenges.  Fig. 8 Qualitative results of action detection. Predictions with confidence > 0.5 are shown with colour-coded class labels (see legend). Since the baseline predicts overlapping segments, the predictions are displayed over four rows for ease of viewing.

Action Anticipation
Definition.
We aim to predict (v,n,â) as the verb/noun/action classes of the action, by observing a video segment of arbitrary duration τ o seconds (observation time) ending τ a seconds (anticipation time) before the action's start, t s . We set τ a = 1. We expect models addressing this task to reason on observed sequences of actions, the current state of the world (e.g., what objects are visible) and the possible goal of the camera wearer. Table 4 also compares EPIC-KITCHENS-100 with other datasets used for action anticipation [31,73,74,75,80,81,39]. Our dataset is the largest in hours and classes, and is unscripted, which is critical for meaningful anticipation models, and for in the wild testing.

Related Datasets.
Evaluation Metrics. We report results using classmean top-5 recall [101]. The top-k criterion accounts for uncertainty in future predictions, as with previous anticipation efforts [102,103,104]. Class-mean allows for balancing the long-tail distribution.
Baselines and Results. We use our prior work RU-LSTM [100] as a baseline. In Table 7, RU-LSTM performs better for nouns compared to verbs, but shows that tail classes are particularly challenging for anticipation. Fig. 9 demonstrates the baseline struggles where the next active noun/verb are ambiguous.

Unsupervised Domain Adaptation for Action Recognition
Definition. Unsupervised Domain Adaptation (UDA) utilises a labelled source domain and learns to adapt to an unlabelled target domain. We use videos recorded in 2018 as the labelled source, and use newly collected videos as unlabelled target (i.e. without any of the accompanying annotations). The action recognition task itself follows the definition in Section 4.1. The difficulty of this challenge stems from the fact that the source and target domains come from distinct training distributions due to the collection of videos two years later . Changes in location, hardware and long-term temporal offsets are the main sources of the domain shift (see Section 3). A method which is able to perform this task well provides a number of practical benefits, most notably the elimination of labelling time and expense when collecting new videos, in the future. Related Datasets. UDA datasets have traditionally used images [105,107,108,109], with recent attempts to use video [111,112,113] adapting across public datasets (e.g. UCF to Olympics). EPIC-KITCHENS-100 is the first to propose a within-dataset domain adaptation challenge in video. Video-based UDA raises additional challenges, such as aligning temporal information across domains [111], attending to relevant transferable frames [112], and avoiding non-informative background frames [114]. Table 8 shows EPIC-KITCHENS-100 provides several advantages over other video-based datasets: largest number of instances, classes, subdomains, and is multimodal [115]. Additionally, it has a compound domain  gaps resulting from the test of time (i.e. recording data two years later).
Splits. This challenge assesses models' ability to adapt to additional footage without labels. We thus define the following splits; Source: labelled training data from 16 participants (collected in 2018) and Target: unlabelled footage from the same 16 participants collected in 2020. This ensures the gap in the domains is related to the capturing of the data 'two years later'. We further split target videos into: Target Train and Target Test. The first are unlabelled videos used during domain adaptation, while the second are videos used for evaluation, as in [108]. Number of action instances per split are reported in Table 8. Evaluation. We use the same evaluation metrics as Section 4.1 on Target Test.
Baselines and Results. We present lower and upper bounds: "Source-Only", where labelled source data is used for training and no adaptation to target data is attempted, and two upper bounds: "Target-Only", where labelled target data is used and "Source+Target" where all training data is used with associated labels. Neither of these are UDA methods, but offer an insight into the domain gap. Table 9 reports the results for the baselines. These use extracted features from TBN [68] trained on source. We use the code of Temporal Attentive Alignment (TA3N) [112], modified to consider multi-modal features (RGB, Flow and Audio), to report results. These show significant performance improvement when using multi-modal data compared to single modality models of RGB, Flow and Audio. The domain gap is evident when comparing the lower and upper bounds. TA3N is able to partially decrease this gap, providing a 2.5% improvement in verb accuracy and 2.4% in nouns when using multiple modalities. Recent work [117] showed that RGB and Audio exhibit different levels of robustness to the domain gap in EPIC-KITCHENS-100. The best performing submissions for this challenge in 2021 exploited multi-modalities for domain adaptation [118,119]. Fig. 10 visualises the multi-modal

Multi-Instance Action Retrieval
Definition. Given a query action segment, the aim of video-to-text retrieval is to rank captions in a gallery set, C, such that those with a higher rank are more semantically relevant to the action in the video. Conversely, text-to-video retrieval uses a query caption c i ∈ C to rank videos. Different from other challenges in Section 4, we here use the English-translated free-form captions from the narrations (Fig. 2b). Splits. We use the Train split from Table 1. As access to the captions are required for both video-to-text and text-to-video retrieval, the Val set is used for evaluating this challenge to allow the held-out Test set for all other challenges to remain intact. We consider all the videos in Val, and all unique captions, removing repeats. Related Datasets. In datasets that are commonly used for retrieval [7,29,30,120], captions are considered relevant if they were collected for the same video, and irrelevant otherwise. This common approach ignores the semantic overlap between captions of different videos that contain identical or similar actions. These datasets thus assume videos to be distinct from one another. In instructional video datasets [30,122], the corresponding YouTube subtitle is only considered relevant, again ignoring semantic overlap or similarities to other actions. Note that the large-scale HowTo100M [117] dataset has only been used for pre-training, due to being webly supervised and thus noisy. The dataset does not include a val/test set.
In this challenge, we use the class knowledge from Section 3 to define caption relevancy. This allows us to consider captions "put glass" and "place cup" as semantically relevant-an opportunity not available in other retrieval datasets.
Evaluation Metrics. To evaluate this challenge, relevancy of a retrieved caption (or video) to the query item needs to be assessed. We consider the case where a query video contains the action of someone cutting a pizza using a cutter. We want captions: a) "cutting a pizza using a cutter", b) "cutting a pizza slice", c) "slicing a pizza" to all be more relevant than d) "cutting a pizza using a knife" which in turn is more relevant than both e) "cutting a vegetable" or f) "picking up a pizza slice". Critically, g) "opening a fridge" should be considered irrelevant.
Mean Average Precision (mAP) has been used in other retrieval works [121,123,124,125], yet it only considers relevance between items to be binary. Because of this, (a-c) would be considered (equally) relevant captions. However, we would also like to consider nonbinary relevance where (d) is more relevant than (e) which in turn is more relevant than (g). We thus also report results using normalised Discounted Cumulative Gain (nDCG) [126]. This metric allows for non-binary relevance between captions. We define the relevance, R, as the mean IoU of the verb and noun classes, giving a value between 0 and 1, where 0 is irrelevant (no overlap in verb/noun classes) and 1 is extremely relevant. From the example above, 1 = R(a,a) ≥ R(a,b) = R(a,c) ≥ R (a,d) ≥ R (a,e) = R (a,f) ≥ R (a,g) = 0. We then use R to calculate nDCG as in [126] (see appendix C.6 for full definition).
Baselines and Results. As in Section 4.5, we use TBN [68] features trained on the Train split. Table 11 provides results for two baselines and the chance lower bound. Multi-Layer Perceptron (MLP) uses a 2-layer perceptron to project both modalities into a shared action space with a triplet loss. Our previous work JPoSE [121] disentangles captions into verb, noun and action spaces learned with a triplet loss. JPoSE sees a significant boost in performance over MLP. Fig. 11 shows qualitative retrieval results on four examples using both MLP and JPoSE for text-to-video retrieval. JPoSE is able to retrieve more correct videos than MLP, but both methods still struggle on longer captions. Importantly, this dataset offers the first opportunity for action retrieval that considers semantic similarity.

Conclusion and Future Work
We presented our large-scale egocentric dataset EPIC-KITCHENS-100, through an annotation pipeline that is scalable and is of higher quality than previous approaches. We defined six challenges, providing leaderboard baselines. Dataset and leaderboards are available at http://epic-kitchens.github.io.
These 6 challenges have been chosen to facilitate progress in open topics within video understanding. They also highlight interesting parts of our collection and annotation pipeline. For example, retrieval uses our free-form captions, while unsupervised domain adaptation for action recognition builds on collecting footage two years later. Our dense annotations of overlapping actions make detection in long untrimmed videos particularly challenging. While this paper addresses each challenge independently, successful methods that address one challenge (e.g. detection) are likely to prove advantageous for better performance in another (e.g. anticipation). Combining all challenges with unsupervised domain adaptation would enable future deployment in new environments without additional labels.
In publishing this manuscript we hope that people can not only utilise this large-scale dataset in their ongoing research, but also build on our novel pipeline in collecting our dataset. The proposed 'pause-and-talk' narrator, publicly available, as well as our visually-supported transcription interfaces can prove advantageous for other large-scale collection efforts. Data Release Statement: Dataset sequences, extracted frames and optical flow are available under Non-Commercial Government Licence for public sector information at the University of Bristol data repository: http://dx.doi.org/10.5523/bris.2g1n6qdydwa9u22shpxqzp0t8m Annotations, models, evaluation scripts, challenge leaderboards and updates are available at: http://epic-kitchens.github.io

Appendices A Video Demonstration
We provide a video demonstration of our annotation pipeline and six challenges. Our video utilises a single sequence, showcasing the annotation pipeline first, as the sequence progresses. We demonstrate the 'pause-and-talk' narrator, transcription and translation steps, then parsing and class mapping. We then showcase the two automatic annotations provided with our dataset.
The video demonstrates predictions from our six challenges. This showcases baseline results, but on a training sequence demonstrating 'near perfect' performance as opposed to current baseline performance. This aims to highlight the potential of EPIC-KITCHENS-100 and the link between these challenges. Our Video demonstration is available at: https://youtu.be/MUlyXDDzbZU

B Further Collection Details
In this section we provide further details of how EPIC-KITCHENS-100 was collected including comparing to the annotation pipeline from our previous work [1]. Camera Settings for Collection. Head mounted GoPro Hero 7 was used for data collection filming at 50fps with video stabilisation. Our choice of 50 fps avoids overhead light flickering visible in [1] that occurs due to the difference between frame rates and the national grid frequency. Narration 'pause-and-talk' interface. Fig. 12 contains a more detailed look at our proposed 'pause-and-talk' narrator. Annotators had a number of options to help with the recording, including whether or not to hear the audio from the captured video while narrating, and the ability to change the speed of the video. They could also play, redo or delete recordings they had already made.
As mentioned in Section 2, this led to denser and more correct annotations, as annotators were able to pause the video while providing annotations, avoiding any missed annotations of critical actions. Transcription. Thanks to our 'pause-and-talk' narrator, each audio clip contained a single action narration, whereas formerly speech chunks were combined into 30 second clips. In [1], Amazon Mechanical Turk (AMT) workers had to translate and transcribe this audio narration in a single step. To ensure correctness and consistency, we split the transcription from the translation steps. The set of non-English transcriptions was first agreed by multiple annotators and then translated in one go by a hired translator.
Additionally, we provided images during the transcription step centred around the timestamp collected by the 'pauseand-talk' Narrator at {−0.25s, 0s, +0.25s} to improve context (see Fig. 1b). Temporal Annotator. Previously, initial start/end times were obtained by automatic alignment of captions using YouTube automatic subtitling API. This is problematic as it assumes action length is the same as the narration length. We adopt a different approach here starting from our accurate single timestamps produced by our proposed 'pause-and-talk' narrator. We developed a temporal segment annotation interface (see Fig 1d), where annotators start from this rough-time stamp and annotate the start/end time. We also increased the number of annotators per segment to 5, compared to 4 used in [1]. This resulted in higher agreements between annotators.

C Challenges' Implementation Details
In this section we include the implementation and training details for all of the baselines, to enable replication of our results. Additionally, for some challenges, further details are provided such as definition of evaluation metrics.

C.1 Action Recognition
Implementation and Training Details.
We use our publicly available PyTorch [127] model definitions of TSN [66], TRN [67] and TSM [69]. We use ResNet-50 backbones for all models with publicly available initialisations -these are ImageNet weights for TSN and TRN and Kinetics weights for TSM. We train two instances of each model: one with 8 RGB frames as input, and the other with 8 stacks of 5 (u, v) flow fields computed using TV-L 1 [128]. We use twoway output in the last layer, one to predict verbs and the other to predict nouns with an average verb/noun loss. Actions are predicted as the most likely verb-noun combinations computed by combining softmaxed verb/noun scores. We train each model for 80 epochs using SGD with momentum 0.9 and a learning rate of 0.01 decayed at epochs 20 and 40 by a factor of 10. TSN and TRN models are trained on 8 GPUs with a batch-size of 128, whereas TSM used a batch-size of 64 on 4 GPUs. We apply a weight decay of 0.0005 to all weights in the models, drop out with p = 0.7, and clipping gradients above 20. We use center-crop evaluation. The RGB and optical flow models are trained individually, and predictions are averaged pre-softmax during inference.
For TBN, we use the publicly available PyTorch [127] model from [68]. We train using a batch size of 64, 6 segments, and drop the learning rate at epoch 40 and 60. All unspecified hyperparameters remain unchanged.
For SlowFast [70], we use the publicly available Py-Torch [127] model. We modify the model to have a two-way output for verbs and nouns, and train it with the average verb-noun loss. We use the SlowFast 8x8, ResNet-50 backbone, initialised from Kinetics pretrained weights also provided by [70]. A 1 second clip randomly sampled from the video is used as input to the model during training. We train for 30 epochs using SGD with momentum 0.9 and a learning rate of 0.01 decayed at epochs 20 and 25 by a factor of 10. The model is trained on 8 GPUs with a batch-size of 32, using a weight decay of 0.0001 to all weights in the model and drop out with p = 0.5. We freeze all batch-normalisation layers' parameters and statistics during training. During testing, we uniformly sample 10 clips (1s each) from each video, and a single center crop per clip, and average their predictions.

C.2 Weakly-Supervised Action Recognition
Implementation and Training Details.
We use our publicly available PyTorch [127] code from [82] for both baselines. This uses TSN [66] with Inception backbone and batch normalisation [129], pre-trained on Kinetics-400 [3]. Predictions employ standard late-fused two-stream approach at test time (RGB and Flow models are trained independently). This uses 25 RGB frames (or optical flow stacks) for testing. Fig. 12 Components of our 'pause-and-talk' annotation tool.
We set a length of 5 seconds for the fixed-length segment baseline. For this baseline, frames are sampled randomly from equally sized segments (as proposed in [66]). For the baseline from [82] training frames are selected using the sampling distributions which are iteratively updated. For both baselines we sample 5 frames for training. The ADAM [130] optimiser is used with initial learning rate equal to 0.0001 halved twice during training, and report results after 80 epochs. We changed the parameters from [82] as follows: w = 2.5 seconds and s = 0.75, updating the distributions every 5 epochs with (λ c , λ w , λ s ) = (0.5, 0.25, 0.25). We set CL h = 1 and CL z = 0.25. Update proposals are generated with τ ∈ {0.5, 0.85}, discarding proposals with length less than 10 frames.

C.3 Action Detection
Implementation and Training Details.
We train Boundary Matching Network (BMN) [99] using the publicly available implementation 7 to produce temporal action proposals. BMN is trained using TSN-based features, as in action recognition. As proposed in [99], we rescale the feature sequence of each video to the length of the observation window l ω . Since the proposed dataset contains videos of different lengths, we choose a large observation window 7 https://github.com/JJBOY/BMN-Boundary-Matching-Network l ω = 400 and set the maximum action length to D = 400. To limit the amount of memory required at training time, we set the number of sample points to N = 4. We train one model on the Train set for 9 epochs, which maximizes performance on Val. We use this model to report on both Val and Test. We apply Soft Non-Maximum Suppression with the parameters suggested in [99] to reduce the number of overlapping proposals and retain the top scoring 1, 000 instances per video.
Each proposal is then classified using the SlowFast Network with implementation details as in Section C.1. Note that we classify proposals on the validation set using the SlowFast model trained only on the training set, whereas we classify proposals on the test set using the model trained on the union of the training and validation sets.

C.4 Action Anticipation
Implementation and Training Details. We follow our prior work [100] training a TSN model to extract RGB and Flow features, using the same hyperparameters recommended in [100]. The RGB model has been trained for 95 epochs, while the optical flow branch has been trained for 132 epochs, which maximise performance on Val. Object-based features are extracted running the object detector from [100], trained on manually-annotated object bounding boxes from our previous edition [1]. The RU-LSTM model is trained using the provided implementation with SGD and a fixed learning rate of 0.01. The single-modality RGB, optical flow and object branches are pre-trained with sequence completion respectively for 88, 95, and 98 epochs, then fine-tuned for the anticipation task for 86, 81 and 7 epochs respectively. The full architecture with modality attention is trained for 29 epochs. These maximise performance on Val. All other parameters are kept as their default values in the public code from [100], The same model is used to report both on Val and Test. Impact of current action on anticipation. Predicting a future action given the currently observed one provides a strong prior. To assess this, we created three co-occurrence matrices for verbs, nouns and actions. Each matrix M is constructed such that M [i, j] reports the number of times class j is observed after class i in the training set considering τ a = 1 as the anticipation time. At test time, we rely on the last observed action i to predict the most frequent 5 actions following i (corresponding to the 5 largest values of the i th row of M ). Note that this calculation requires knowledge of the observed action from the ground truth, thus cannot be considered a baseline, as it cannot be replicated in inference. We found that this oracle knowledge of the current action obtains 20.84%, 25.00% and 8.92% for Top-5 verb, noun and action labels respectively on the validation set. These numbers are significantly larger than the chance baseline (6.39%, 2.00%, 0.20%) from Table 7 but still lower than the ones of the RU-LSTM baseline (27.76%, 30.76%, 14.04%). These results suggest that, while the prior is indeed a strong one, as you would hope for meaningful sequences of actions, the considered baseline is going beyond recognising the current action and applying an action sequence prior.

C.5 Unsupervised Domain Adaptation (UDA) for Action Recognition
Validation Splits for Hyper-parameter Tuning.
As the target domain is unlabelled, no labelled data is available for hyper-parameter tuning. Therefore, we split the training data to define a Source Val and Target Val splits with data collected by 4 of the 16 participants. Of these, 2 participants are of returning kitchens and 2 of changing kitchens. The Source Train and Target Train are thus composed of the 12 remaining participants.
For hyper-parameter tuning, models are trained on labelled data from Source Val and unlabelled from Target Val. The performance on Target Val can be used to asses the impact of different hyper-parameters.
To obtain the results for the leaderboard and accompanying challenge, a new model is trained on Source Train and unlabelled Target Train, using the hyper-parameters optimised from the validation split. This model is evaluated on Target Test to obtain results. Note on zero-shot actions. Due to the unscripted nature of the data collection, a negligible number of verb and noun classes in the target domain are not present in the source domain, 0.2% and 2.3% respectively. We have not removed these to maintain the same splits used in other challenges. Additionally, 9.46% actions (exact verb-noun combinations) did not exist in the targets domain, these are referred to as the zero-shot actions. Note it is still possible to predict these actions as both verbs and nouns were present in the source domain. Implementation and Training Details.
We train the TBN feature extractor on the union of Source Train and Source Val. We make these features publicly available. We use the available code from [112], to train and evaluate 'Source-Only' as well as 'TA3N' baselines. We modify the code to consider multi-modal input, by concatenating the features from all modalities as input. This automatically increased the number of parameters in the first fully connected layer.
We improve the performance of TA3N by initialising the domain discriminators before the gradients are reversed and back-propagated. In our implementation, the domain discriminators' hyper-parameters are annealed similar to that in [131]: where p is the training progress that linearly increases from 0 to 1. The domain discriminator hyperparameters are annealed up to the value specified in TA3N, i.e. λ s = 0.75η, λ r = 0.5η and λ t = 0.75η. The weighting of the categorical entropy on the target domain is set to γ = 0.003. Models are trained for 30 epochs at a learning rate of 3e −3 reduced by a factor of 10 at epochs 10 and 20.

C.6 Multi-Instance Action Retrieval
Evaluation Metrics. We define the Relevance R between a video, x i , and a caption, c j , as given by the averaged Intersection-over-Union of the verb and noun classes: where x v i is the set of verb classes in the video and c N j is the set of noun classes in the caption.
The nDCG can be calculated for a query video, x i , and the ranked list of gallery captions, C r , as the Discounted Cumulative Gain (DCG) over the Ideal Discounted Cumulative Gain (IDCG): with the DCG being given by: To calculate the IDCG(x i , C r ), we need the ground truth ranking between video x i and captions C r . To do this, we first find the relevance between video x i and every caption in C r as follows: {R(x i , c j ); ∀c j ∈ C r )}. We then constructĈ r , the ground truth ranking of captions, by sorting these in descending order of relevance. Note that if R(x i , c j ) = R(x i , c k ) then c j and c k are ordered based on their unique ID due to the stable sort used, and similarly for the method to be evaluated. Finally, the IDCG is calculated using IDCG(x i , C r ) = DCG(x i ,Ĉ r )). nDCG can be similarly defined for a query caption, c i and a gallery set of videos X r . Implementation and Training Details. For video features we use 25 RGB, Flow and Audio features extracted uniformly from TBN [68]. We make these features publicly available. Features from each modality are temporally averaged and then concatenated to provide the final feature vector for each video, with size 3072. Text features come from word2vec [41] trained on the wikipedia corpus with an embedding space of size 100.
The MLP baseline uses a 2 layer perceptron which projects both the visual and textual features into the same embedding space. We set the final embedding size to 512 and the size of the hidden units is 1280 and 78 for visual/textual respectively (halfway between initial feature size and output space size). MLP is trained for 100 epochs with a batch size of 64 and a learning rate of 0.01. Triplets are sampled randomly using the semantic relevance used when calculating mAP/nDCG (i.e. verb and noun class are identical), with triplets being sampled every 10 iterations. The triplet loss terms for all four pairs of modalities are set to 1.0, apart from the the text-to-visual weight which is assigned a weight of 2.0.
We use our public code of JPoSE [121] . Each Part-of-Speech embedding is modelled off of the MLP baseline, but using the part-of-speech relevancies defined in [121] (e.g. for the verb embedding the verb class between two captions must be the same). The final embeddings are concatenated and fed into a final fully connected layer with shared weights for the action embedding. The verb and noun embedding spaces have an output embedding size of 256, with the resulting action embedding space having an output size of 512. Triplets are independently resampled (randomly) every 10 epochs. A batch size of 64 is used with a learning rate of 0.01 and the model is trained for 100 epochs.