1 Introduction and Related Datasets

Since the dawn of machine learning for computer vision, datasets have been curated to train models, for single tasks from classification (Deng et al. 2009; Carreira and Zisserman 2017) to detection (Lin et al. 2014; Gu et al. 2018), captioning (Karpathy and Fei-Fei 2015; Xu et al. 2016) and segmentation (Zhou et al. 2017; Perazzi et al. 2016). Increasingly, datasets have been used for novel tasks, through pre-training (He et al. 2019; Mettes et al. 2016), self-supervision (Noroozi and Favaro 2016; Vondrick et al. 2018) or additional annotations (Gupta and Malik 2016; Heilbron et al. 2018). However, task adaptation demonstrates that models overfit to the data and annotations (Zhai et al. 2019; Moltisanti et al. 2017).

Alternatively, one dataset can be enriched with multiple annotations and tasks, aimed towards learning intermediate representations through downstream and multi-task learning on the same input. This has been recently achieved for autonomous driving (Zhou et al. 2019; Geiger et al. 2012; Cordts et al. 2016; Neuhold et al. 2017; Yu et al. 2018; Huang et al. 2018; Caesar et al. 2019; Yogamani et al. 2019) and scene understanding (Zamir et al. 2018; Silberman et al. 2012). For example, Zamir et al. (2018) contains 26 tasks ranging from edge detection to vanishing point estimation and scene classification.

In comparison, the number of tasks proposed for action and activity understanding datasets (Damen et al. 2018; Gu et al. 2018; Heilbron et al. 2015; Rohrbach et al. 2015; Zhou et al. 2017; Rohrbach et al. 2012) remains modest. Often, this is limited by the source of videos in these datasets. YouTube (Heilbron et al. 2015; Zhou et al. 2017) and movies (Gu et al. 2018; Rohrbach et al. 2015) typically contain curated videos, with edited shots. However, attempts to define multiple challenges for these datasets have been exemplary. ActivityNet (Heilbron et al. 2015) is the most popular video challenge, evaluated for localisation, dense captioning (Krishna et al. 2017) and object detection (Zhou et al. 2019). Similarly, AVA (Gu et al. 2018) has challenges on action localisation and active speaker detection (Roth et al. 2019).

Fig. 1
figure 1

Left: Frames from EPIC-KITCHENS-100 showcasing returning participants with returning or changing kitchens (top) as well as new participants (bottom). Right: Comparisons between recordings from [1] and newly collected videos, with selected frames showcasing the same action. Note object location differences in ‘returning’ kitchens (e.g. microwave relocated). We show the same action performed in ‘changing’ kitchens (e.g. same participant preparing pizza or filtered coffee in a new kitchen)

Several leading egocentric datasets  (Pirsiavash and Ramanan 2012; Damen et al. 2014; Fathi et al. 2012; De La Torre et al. 2008; Li et al. 2015) showcased the unique perspective and potential of first-person views for action recognition, particularly hand-object interactions. In 2018, the introduction of the largest-scale dataset EPIC-KITCHENS (Damen et al. 2018) has transformed egocentric vision, not only due to its size, but also the unscripted nature of its collection and the scalable nature of the collection pipeline. In this paper, we present EPIC-KITCHENS-100, a substantial extension which brings the total footage to 100 hours, capturing diverse unscripted and unedited object interactions in people’s kitchens.Footnote 1 As shown in Fig. 1, the actions capture hand object interactions with everyday objects in participants’ kitchens. The unscripted nature of the dataset results in naturally unbalanced data, with novel compositions of actions in new environments. While challenging, the dataset is domain-specific (i.e. kitchen-based activities), offering opportunities for engaging domain knowledge. We offer two-level annotations for nouns and verbs in interactions (e.g. “carrot/courgette \(\rightarrow \) vegetable”, “put/throw/drop \(\rightarrow \) leave”) to utilise such priors.

Importantly, we propose a refined annotation pipeline that results in denser and more complete annotations of actions in untrimmed videos. This pipeline enables various tasks on the same dataset; we demonstrate six in Sect. 4, with baselines and evaluation metrics that focus on understanding fine-grained actions and offer benchmarks which can support research into better modelling of video data.

Fig. 2
figure 2

Annotation pipeline: a narrator, b transcriber, c temporal segment annotator and d dependency parser. Red arrows show AMT crowdsourcing of annotations (Color figure online)

2 Data Collection and Scalable Pipeline

In this section, we detail our collection and annotation effort.

Data Collection We collect additional footage as follows: we contacted participants from EPIC-KITCHENS-55 to record further footage. Of the 32 participants in (Damen et al. 2018), 16 subjects expressed interest in participating. Interestingly, half of these (8 subjects) had moved homes over the past two years. We also recruited 5 additional subjects, increasing the total number of subjects and kitchen environments to 37 and 45 respectively. All participants were asked to collect 2–4 days of their typical kitchen activities, as in (Damen et al. 2018). We collect footage using a head mounted GoPro Hero7 black. This is two generations newer than the camera used in EPIC-KITCHENS-55, with a built-in feature for HyperSmooth video stabilisation. Sample frames are shown in Fig. 1, with selected frames of the same action in returning and changing kitchens.

Annotation Pipeline An overview of the pipeline can be seen in Fig. 2.

(a) Narrator Previously, for EPIC-KITCHENS-55, we used a non-stop narration approach, where each participant narrated their previous action while watching the future actions in the running video. We found this resulted in increased mental load and some actions being missed or misspoken. To improve upon this approach, we take inspiration from Gygli and Ferrari (2019), where objects in images are annotated by pointing and speaking and propose temporal ‘pointing’ which we refer to as ‘pause-and-talk’. By allowing participants to pause the video to speak as well as take breaks, we hope to increase accuracy and density of actions, whilst still allowing for a scalable narration approach. We built an interface to facilitate collecting such narrations from the participants (Fig. 2a), which includes a video player, synced with audio recordings.Footnote 2 Participants watch the video and press a key to pause while they narrate the action in their native language. As previously observed in Damen et al. (2018), using the native language ensures the narrations use the correct vocabulary in describing the actions. The video restarts on key release. Note that the narrator still watches the video once, maintaining the targeted scalability of the annotation pipeline, but removes the mental overload of narrating past actions while watching future actions. This allows for short and overlapping actions to be captured in addition to enabling error correction, as participants can listen to, delete or re-record a narration. Fig. 2 shows an ongoing narration, demonstrating density (ticks on the slider).

(b) Transcriber We perform transcription of audio narrations, followed by translation (if applicable): first, we transcribe all narrations and then translate the unique transcriptions into English using a hired translator for correctness and consistency. The approach we used to transcribe narrations in Damen et al. (2018) had issues where workers failed to understand some audio narrations due to the lack of any visual information. To mitigate this, we build a new transcriber interface containing three images sampled around the timestamp (Fig. 2b). We find that images increase worker agreement and alleviate issues with homonyms (e.g.  ‘flower’ and ‘flour’). Each narration is transcribed into a caption by 3 Amazon Mechanical Turk (AMT) workers using a consensus of 2 or more workers. Transcriptions were automatically rejected if the cosine similarity between the Word2Vec (Mikolov et al. 2013) embeddings was lower than an empirical threshold of 0.9. When AMT workers fail to agree, the correct transcription was selected manually. Captions were then spell checked and substitutions were applied from a curated list of problematic words (e.g. ‘hob’ and ‘hop’), further reducing errors.

(c) Parser We use spaCy (Honnibal and Montani 2017) to parse the transcribed captions into verbs and nouns (Fig. 2c) and manually group these into minimally overlapping classes as we did in our previous work. We reworked this to improve parsing of compound nouns and missing verbs/nouns. Additionally, all annotations (including those we collected previously from EPIC-KITCHENS-55) were re-parsed using the updated pipeline. To cluster the verbs and nouns, we adjust previous clusters to reduce ambiguities between classes. For example, we group ‘brush’ and ‘sweep’ into one verb class, and introduce noun classes that did not exist before such as ‘lentils’.

(d) Temporal Annotator We built an AMT interface for labelling start/end times of action segments (Fig. 2d). Annotators completed a quick tutorial on annotating temporal bounds before they labelled 10 consecutive actions. To create the bounds of the action segment, we use the same approach as we did previously but increased the number of workers from 4 to 5 to improve quality. Note that in the untrimmed videos there might be consecutive instances of the same action. These will be indicated by repeated narrations. We thus request that annotators mark the temporal bounds of each instance, prompted by the timestamp. This avoids the merging of instances of the same action.

Fig. 3
figure 3

Comparing non-stop narrations (blue) to ‘pause-and-talk’ narrations (red). Right: timestamps (dots) and segments (bars) for two sample sequences. ‘pause-and-talk’ captures all actions including short ones. Black frames depict missed actions (Color figure online)

Quality Improvements Our EPIC-KITCHENS-100 scalable pipeline focuses on denser and more accurate annotations. We compare different parts of the pipeline to our previous one in Appendix B. Here, we show improved quality of annotations both numerically and through an example.

Figure 3 (left) compares the narration method we used in Damen et al. (2018) to the new pipeline over several metrics. Our ‘pause-and-talk’ narrator produces more densely annotated videos; fewer gaps and more labelled frames; actions are shorter; and exhibit higher overlap. The narration timestamps are also closer to the relevant action, with a higher percentage being contained within the action and a smaller distance to remaining timestamps outside the action.

Fig. 4
figure 4

Frequency of verbs (top) and nouns (bottom), grouped by category. Each bar is linearly split: solid represents instances from newly-collected videos and washed-out from original videos

Figure 3 (right) shows two video sections, of equal length, annotated by the same participant, one using non-stop narrations and the other with ‘pause-and-talk’. The number of annotated actions increased from 20 to 56, with short actions (such as ‘turn on tap’) often missed in the previous pipeline. We demonstrate these through two examples. The first shows a missed action of picking up a bag off the floor that had been dropped, and the second shows a missed closing cupboard action. In the sequence from ‘pause-and-talk’, all actions including closing the cupboard were successfully narrated thanks to our ‘pause-and-talk’ pipeline. By narrating more actions, the start/end times also become more accurate as it is more obvious to the AMT annotators what each narration refers to.

3 Statistics, Scalability and the Test of Time

EPIC-KITCHENS-100 contains 89,977 segments of fine-grained actions annotated from 700 long videos. Footage length amounts to 100 hours. Table 1 lists the general statistics, separating those from the videos collected previously to the newly collected videos. Note that all previous narrations have been re-parsed using the new pipeline (Fig. 2b–d). EPIC-KITCHENS-100 rescales our previous dataset with almost double the length with 1.8x hours and 2.3x action segments. Comparisons to other datasets are presented under relevant benchmarks in Section 4.

In Fig. 4 we show the frequency of verb (97) and noun (300) classes in the dataset. These are grouped into categories (13 verb and 21 noun categories), sorted by size. For example, verbs ‘wash’, ‘dry’, ‘scrape’, ‘scrub’, ‘rub’, ‘soak’ and ‘brush’ are grouped into a clean verb category. The plots show a clear long-tail distribution. The contribution of each class from source videos (Damen et al. 2018) and extension are also shown. New verb classes (e.g. ‘lock’, ‘bend’) and noun classes (e.g. ‘orange’ and ‘hoover’) are only present in the newly-collected videos.

Fig. 5
figure 5

Top: Sample Mask R-CNN of large objects (col1: oven), hands (labelled person), smaller objects (col2: knife, carrot, banana, col3: clock, toaster, col4: bottle, bowl), incorrect labels of visually ambiguous objects (col3: apple vs onion) and incorrect labels (col3: mouse, col4: chair). Bottom: Sample hand-object detections from Shan et al. (2020). L/R = Left/Right, P = interaction with portable object, O = object. Multiple object interactions are detected (col2: pan and lid, col4: tap and kettle)

Fig. 6
figure 6

Test of time and scalability test results

We enrich our dataset with automatic spatial annotations using two models. The first is Mask R-CNN (He et al. 2017) trained on MSCOCO (Lin et al. 2014). The second is hand-object interactions from Shan et al. (2020), trained on 100K images from YouTube along with 42K images from three egocentric datasets (Damen et al. 2018; Sigurdsson et al. 2018; Li et al. 2015) of which 18K are from our videos (Damen et al. 2018). It detects interacted static and portable objects as an offset to hand detections. Example annotations are shown in Fig. 5, and the number of annotations is given in Table 1. While we do not use these annotations to report results, we believe these 66M masks, 31M hand and 38M object bounding boxes could facilitate future models for spatial (or spatio-temporal) attention.Footnote 3

Splits We split the videos into Train/Val/Test with a ratio of roughly 75/10/15. Each video, with all its action segments, is present in one of the splits, and the Test split contains only newly-collected videos. We use re-parsed videos from the original EPIC-KITCHENS test sets as the new validation set.Footnote 4 Our Val/Test splits contain two interesting subsets, which we report on separately:

  • Unseen Participants Our Val and Test splits contain participants not present in Train: 2 participants in Val, and another 3 participants in Test. These contain 1,065 and 4,110 action segments respectively. This subset helps evaluate the generalisability of the models across the various benchmarks.

  • Tail Classes We define these (for verbs and nouns) to be the set of smallest classes whose instances account for 20% of the total number of instances in training. A tail action class contains either a tail verb class or a tail noun class. These are 86/228/3,729 verb/noun/action classes.

Scalability and the Test of Time As we rescale EPIC-KITCHENS with additional videos, we carry out two investigations: (a) how models trained on videos from Damen et al. (2018) perform on videos collected two years later, and (b) how models’ performance scales with additional annotated data. We call these the test of time and the scalability tests respectively.

Figure 6 includes results for both investigations, evaluated on the task of action recognition (definition and models from Sect. 4.1). We separate overall results (left) from unseen participants (right). For all models, comparing the first two bars demonstrates that models trained solely on videos from Damen et al. (2018) do not withstand the test of time. For the same model, performance drops significantly when new data is evaluated. This highlights a potential domain gap, which we discuss next. We assess scalability by gradually adding new data in training. Results demonstrate a significant improvement, albeit saturating when 50% of new data is added, particularly for unseen participants. This highlights the need for better models and more diverse data rather than merely more data. This can be particularly observed as the unseen participants data benefits even less when scaling. We tackle the gap to new environments and participants next.

Table 1 Statistics of EPIC-KITCHENS-100 and its Train/Val/Test splits

Unravelling the Domain Gap As defined in the early work on speech recognition (Ueberla 1997), “A domain D is a (often infinite) set of samples such that each sample satisfies a property \(P_D\)”. A domain gap is present when at least one property differs between the samples of two domains. Domain gaps have been a frequent source of frustration for a wide range of learning tasks, where models are trained on samples from one domain, and thus under-perform when deployed in a different domain. This is also known as sample-selection bias (Heckman 1979). Sampling bias is a common cause for a domain gap between datasets, which cannot easily be removed during dataset collection, as noted in (Torralba and Efros 2011). The most obvious domain gaps stem from changes in locations (Oberdiek et al. 2020), viewpoints (Zhai et al. 2017), labels (Hsu et al. 2020) and participants (Stein and McKenna 2013). However, there are often more subtle causes, such as differences in capture methodology (Saenko et al. 2010) or due to changes in objects, environments and actions over time.

The concept of a compound domain gap has recently been introduced in Liu et al. (2020), where the target domain is a compound of multiple domains without domain labels. As stated by Liu et al. (2020), this is a more realistic scenario resulting from unconstrained data collection. In EPIC-KITCHENS-100, each video in the extension offers a compound domain gap due to changes in one or more of the following properties:

  • Hardware and capturing as in Saenko et al. (2010); Gong et al. (2012). Extended footage uses a newer camera model with onboard video stabilisation.

  • Locations as in Oberdiek et al. (2020). As indicated in Section 2, eight subjects have moved home resulting in changing surroundings but keeping the appearance of many objects and tools. Additionally, unseen participants capture footage in new environments where the appearance of objects and surroundings differ.

  • Participants as in Stein and McKenna (2013). Hand appearance and individual behaviours exist in the extension which are not in the original footage.

  • Short-term temporal offsets as in Wulfmeier et al. (2018), where time-of-day can affect scene lighting, and some background objects change position (e.g. on the counter for one video, put away in a cupboard for a later video).

  • Long-term temporal offsets as in Carlevaris-Bianco et al. (2016); Maddern et al. (2017). EPIC-KITCHENS-100 is filmed 2 years after EPIC-KITCHENS-55. In the same environment, changes such as wear and tear, new objects and different object positions are observed (see Fig. 1 right). Participant behaviour can also change over time.

While we have domain labels for some of these properties (e.g. recording camera, location, time-of-day and participant ID), other property changes can vary between samples, without associated labels. It is particularly difficult to associate labels with changes in behaviour or object appearances, for example. We publish these properties with the dataset when present. Importantly, we explore this compound domain gap, without using property labels, using a new challenge on unsupervised adaptation for action recognition (Sect. 4.5).

4 Challenges and Baselines

In this section, we define 6 challenges on our dataset, two modified from Damen et al. (2018), namely action recognition (Sect. 4.1) and anticipation (Sect. 4.4). We introduce four new challenges: weakly-supervised action recognition (Sect. 4.2), action detection (Sect. 4.3), unsupervised domain adaptation for action recognition (Sect. 4.5) and action retrieval (Sect. 4.6). While many works have addressed one or more of these challenges, they are typically explored using different datasets. Our annotation pipeline (from captions and single timestamps to segments and classes—Fig. 2) can be used to define multiple challenges, potentially jointly. In this section, we only scratch the surface by reporting on each challenge independently. For readability, we include all implementation details in Appendix C, and we published all our baseline models and evaluation scripts.

4.1 Action Recognition

Definition

As in Damen et al. (2018), we consider a video segment (\(t_s, t_e\)) as the start and end frames in a video. We aim to predict (\({\hat{v}}, {\hat{n}}, {\hat{a}}\)) as the verb/noun/action classes of the action in this segment. We consider overlapping segments independently.

Related Datasets Several datasets have been collected to focus on action recognition, from Soomro et al. (2012), Kuehne et al. (2011) to recent large-scale ones (Gu et al. 2018; Kay et al. 2017; Monfort et al. 2020; Goyal et al. 2017; Zhao et al. 2019; Sigurdsson et al. 2016), all offering a challenge with a held-out test set. In Table 2, we compare EPIC-KITCHENS-100 to these non-egocentric datasets across a range of facets. Ours is the only dataset of unscripted activities, of comparable size to those collected from scripted or curated (YouTube) videos.

Evaluation Metrics We report Top-1/5 Accuracy on Val and Test sets.

Baselines and Results In Table 3, we report results of five state-of-the-art recognition models (Wang et al. 2016; Zhou et al. 2018; Kazakos et al. 2019; Lin et al. 2019; Feichtenhofer et al. 2019) in addition to a random chance baseline. We use the Train set to report on Val, optimising hyper-parameters. We then fix these, and train on both the Train and Val sets in order to report on the Test set. Figure 7 shows success and failure examples, using examples from the Val set.

Table 2 A comparison of EPIC-KITCHENS-100 against popular action recognition datasets. a = Action, v = Verb, n = Noun, c = caption, ML-a = Multi-Label Action
Table 3 Action recognition results on Val (using Train) and Test (using Train+Val)
Table 4 Characteristics of popular datasets related to our challenges: weakly-supervised action recognition (WS), anticipation (Ant.) and detection (Det.)

4.2 Weakly-Supervised Action Recognition

Definition

As in Sect. 4.1, the goal is to recognise the action, i.e. predict \(({\hat{v}}, {\hat{n}}, {\hat{a}})\), in a trimmed action segment during testing. Distinctly, we use single timestamps instead of temporal boundaries during training. Let \({\mathcal {A}}=(A_i)_{i=1}^N\) be the action instances contained in an untrimmed training video, each \({A_i=(t, v, n, a)}\) is labelled with only one timestamp t roughly located around the action, along with verb/noun classes. We utilise the narration timestamps from our collection pipeline as t.

Related Datasets and Types of Supervision Previous weakly-supervised approaches utilised video-level or transcript supervision, where the set (Wang et al. 2017; Singh and Lee 2017; Nguyen et al. 2018; Liu et al. 2019; Nguyen et al. 2019; Narayan et al. 2019) or sequence (Bojanowski et al. 2014; Huang et al. 2016; Ding and Xu 2018; Richard et al. 2018; Chang et al. 2019; Li et al. 2019) of actions in the video are used in training, without temporal bounds. Table 4 compares EPIC-KITCHENS-100 to datasets trained with weak-supervision. When considering the number of classes (and instances) per video, EPIC-KITCHENS-100 offers a significant challenge. For example, ActivityNet (Heilbron et al. 2015) videos contain 1 class and 1.5 action instances on average, whereas in EPIC-KITCHENS-100, videos contain 53.2 classes and 128.5 instances. Video-level supervision is only sufficient for datasets with a few classes per video (Heilbron et al. 2015; Jiang et al. 2014), while transcript supervision (Marszalek et al. 2009; Kuehne et al. 2014) expects no overlap between actions. Both types of weak supervision are insufficient in our case.

Alternatively, single-timestamp supervision is gaining popularity due to the scalability and performance balance (Moltisanti et al. 2019; Bearman et al. 2016; Mettes et al. 2016; Chéron et al. 2018). We follow this trend as it fits naturally with our narration timestamps collected using ‘pause-and-talk’.

Evaluation Metrics We follow the same metrics as in Sect. 4.1.

Baselines and Results We consider two baselines. The first, “Fixed segment”, uses a segment of fixed length centred on the timestamp. The second is our previous work (Moltisanti et al. 2019), where sampling distributions, to select training frames from the untrimmed videos, are initialised from single timestamps, and refined based on the classifier’s response.Footnote 5 Both are trained end-to-end using a TSN backbone (Wang et al. 2016) and results can be seen in Table 5. Moltisanti et al. (2019) improves the fixed segment baseline by 1–3% top-1 accuracy across Val and Test. The fully supervised upper bound is TSN, reported in Table 3. Comparatively, weak supervision performs 11% worse than strong supervision on top-1 action accuracy in Val and Test. Using roughly aligned single timestamps is challenging when actions are short and overlapping. EPIC-KITCHENS-100, with its dense actions, provides an interesting benchmark to develop new models for weak-supervision.

Table 5 Weakly-supervised action recognition results
Table 6 Temporal action detection results in mAP (%)

4.3 Action Detection

Definition All other challenges in Section 4 consider a trimmed segment \((t_s, t_e)\) from the test video as input. This assumption is limiting, as labelled start/end times of actions are unlikely to be present for new test videos. In this challenge, we aim to detect and recognise all action instances within an untrimmed video, as in Heilbron et al. (2015). Given a video, we predict the set of all action instances \(\hat{{\mathcal {A}}}=\{{{\hat{A}}}_i\}_{i=1}^M\), where \({{\hat{A}}}_i = ({{\hat{t}}}_s, {{\hat{t}}}_e, {{\hat{v}}}, {{\hat{n}}}, {\hat{a}}\)) is an action detection tuple including the predicted start and end times \(({{\hat{t}}}_s, {{\hat{t}}}_e)\) and the predicted classes \(({{\hat{v}}}, {{\hat{n}}}, {\hat{a}})\). During training, we use the set of ground-truth action annotations \({\mathcal {A}}\). Note that the ground-truth \({\mathcal {A}}\) and predicted \(\hat{{\mathcal {A}}}\) sets can be of different sizes. This definition is closely related to temporal segmentation (Lea et al. 2017), but segmentation assumes non-overlapping segments and is thus unsuitable for EPIC-KITCHENS-100.

Related Datasets Table 4 compares EPIC-KITCHENS-100 to popular datasets for temporal action detection and segmentation. EPIC-KITCHENS-100 presents the largest challenge, when considering the combined metrics of: average video length, average instances per video and overlapping instances. Compared to datasets with overlapping segments, it has a larger number of instances per video and is also longer (in hours) than all datasets with higher average instances per video.

Evaluation Metrics In line with (Heilbron et al. 2015), we use mean Average Precision (mAP) by computing the average of the AP values for each class. A predicted segment matches a ground truth segment if their Intersection over Union (IoU) is greater than or equal to thresholds ranging from 0.1 to 0.5.

Baselines and Results We consider a two-stage baseline. Action proposals are first obtained using Boundary Matching Networks (BMN) (Lin et al. 2019), which are then classified using SlowFast (Feichtenhofer et al. 2019) (model trained as in Sect. 4.1). Results in Table 6 highlight that action detection is particularly challenging on this dataset, especially with respect to higher IoU thresholds. The qualitative example in Fig. 8 shows that our videos in EPIC-KITCHENS-100 contain actions of varying lengths, which adds further challenges.

Fig. 7
figure 7

Qualitative action recognition results for various baselines

Fig. 8
figure 8

Qualitative results of action detection. Predictions with confidence \(> 0.5\) are shown with colour-coded class labels (see legend). Since the baseline predicts overlapping segments, the predictions are displayed over four rows for ease of viewing

Table 7 Action anticipation results reported in class-mean top-5 recall (%)
Fig. 9
figure 9

Qualitative action anticipation results

4.4 Action Anticipation

Definition We aim to predict (\({\hat{v}},{\hat{n}},{\hat{a}}\)) as the verb/noun/action classes of the action, by observing a video segment of arbitrary duration \(\tau _o\) seconds (observation time) ending \(\tau _a\) seconds (anticipation time) before the action’s start, \(t_s\). We set \(\tau _a=1\). We expect models addressing this task to reason on observed sequences of actions, the current state of the world (e.g., what objects are visible) and the possible goal of the camera wearer.

Related Datasets Table 4 also compares EPIC-KITCHENS-100 with other datasets used for action anticipation (Rohrbach et al. 2012; Patron-Perez et al. 2010; De Geest et al. 2016; Jiang et al. 2014; Kuehne et al. 2014; Stein and McKenna 2013; Li et al. 2015). Our dataset is the largest in hours and classes, and is unscripted, which is critical for meaningful anticipation models, and for in the wild testing.

Evaluation Metrics We report results using class-mean top-5 recall (Furnari et al. 2018). The top-k criterion accounts for uncertainty in future predictions, as with previous anticipation efforts (Koppula and Saxena 2016; Lee et al. 2017; Bhattacharyya et al. 2019). Class-mean allows for balancing the long-tail distribution.

Baselines and Results We use our prior work RU-LSTM (Furnari and Farinella 2020) as a baseline. In Table 7, RU-LSTM performs better for nouns compared to verbs, but shows that tail classes are particularly challenging for anticipation. Figure 9 demonstrates the baseline struggles where the next active noun/verb are ambiguous.

4.5 Unsupervised Domain Adaptation for Action Recognition

Definition Unsupervised Domain Adaptation (UDA) utilises a labelled source domain and learns to adapt to an unlabelled target domain. We use videos recorded in 2018 as the labelled source, and use newly collected videos as unlabelled target (i.e. without any of the accompanying annotations). The action recognition task itself follows the definition in Sect. 4.1. The difficulty of this challenge stems from the fact that the source and target domains come from distinct training distributions due to the collection of videos two years later . Changes in location, hardware and long-term temporal offsets are the main sources of the domain shift (see Sect. 3). A method which is able to perform this task well provides a number of practical benefits, most notably the elimination of labelling time and expense when collecting new videos, in the future.

Related Datasets UDA datasets have traditionally used images (Saenko et al. 2010; Venkateswara et al. 2017; Peng et al. 2017, 2019), with recent attempts to use video (Jamal et al. 2018; Chen et al. 2019; Qi et al. 2018) adapting across public datasets (e.g. UCF to Olympics). EPIC-KITCHENS-100 is the first to propose a within-dataset domain adaptation challenge in video. Video-based UDA raises additional challenges, such as aligning temporal information across domains (Jamal et al. 2018), attending to relevant transferable frames (Chen et al. 2019), and avoiding non-informative background frames (Pan et al. 2020).

Table 8 Comparison of domain adaptation classification datasets.

Table 8 shows EPIC-KITCHENS-100 provides several advantages over other video-based datasets: largest number of instances, classes, subdomains, and is multi-modal (Munro and Damen 2020). Additionally, it has a compound domain gaps resulting from the test of time (i.e. recording data two years later).

Splits This challenge assesses models’ ability to adapt to additional footage without labels. We thus define the following splits; Source: labelled training data from 16 participants (collected in 2018) and Target: unlabelled footage from the same 16 participants collected in 2020. This ensures the gap in the domains is related to the capturing of the data ‘two years later’. We further split target videos into: Target Train and Target Test. The first are unlabelled videos used during domain adaptation, while the second are videos used for evaluation, as in Peng et al. (2017). Number of action instances per split are reported in Table 8.

Evaluation We use the same evaluation metrics as Sect. 4.1 on Target Test.

Baselines and Results We present lower and upper bounds: “Source-Only”, where labelled source data is used for training and no adaptation to target data is attempted, and two upper bounds: “Target-Only”, where labelled target data is used and “Source+Target” where all training data is used with associated labels. Neither of these are UDA methods, but offer an insight into the domain gap.

Table 9 reports the results for the baselines. These use extracted features from TBN (Kazakos et al. 2019) trained on source. We use the code of Temporal Attentive Alignment (TA3N) (Chen et al. 2019), modified to consider multi-modal features (RGB, Flow and Audio), to report results. These show significant performance improvement when using multi-modal data compared to single modality models of RGB, Flow and Audio. The domain gap is evident when comparing the lower and upper bounds. TA3N is able to partially decrease this gap, providing a \(2.5\%\) improvement in verb accuracy and 2.4% in nouns when using multiple modalities. Recent work (Planamente et al. 2021) showed that RGB and Audio exhibit different levels of robustness to the domain gap in EPIC-KITCHENS-100. The best performing submissions for this challenge in 2021 exploited multi-modalities for domain adaptation (Yang et al. 2021; Plizzari et al. 2021). Fig. 10 visualises the multi-modal feature space showing limited overlap between source and target. TA3N aligns the features demonstrating the capability of UDA.

4.6 Multi-Instance Action Retrieval

Definition Given a query action segment, the aim of video-to-text retrieval is to rank captions in a gallery set, C, such that those with a higher rank are more semantically relevant to the action in the video. Conversely, text-to-video retrieval uses a query caption \(c_i \in C\) to rank videos. Different from other challenges in Sect. 4, we here use the English-translated free-form captions from the narrations (Fig. 2b).

Fig. 10
figure 10

UMAP (McInnes et al. 2018) of feature spaces shows better alignment through UDA baseline.

Table 9 Unsupervised domain adaptation results with lower (source-only) and the upper bounds of target-only and source+target
Table 10 Multi-Instance retrieval datasets

Splits We use the Train split from Table 1. As access to the captions are required for both video-to-text and text-to-video retrieval, the Val set is used for evaluating this challenge to allow the held-out Test set for all other challenges to remain intact. We consider all the videos in Val, and all unique captions, removing repeats.

Related Datasets In datasets that are commonly used for retrieval (Xu et al. 2016; Rohrbach et al. 2015; Zhou et al. 2017; Chen and Dolan 2011), captions are considered relevant if they were collected for the same video, and irrelevant otherwise. This common approach ignores the semantic overlap between captions of different videos that contain identical or similar actions. These datasets thus assume videos to be distinct from one another. In instructional video datasets (Zhou et al. 2017; Miech et al. 2019), the corresponding YouTube subtitle is only considered relevant, again ignoring semantic overlap or similarities to other actions. Note that the large-scale HowTo100M [117] dataset has only been used for pre-training, due to being webly supervised and thus noisy. The dataset does not include a val/test set.

In this challenge, we use the class knowledge from Sect. 3 to define caption relevancy. This allows us to consider captions “put glass” and “place cup” as semantically relevant—an opportunity not available in other retrieval datasets.

Evaluation Metrics To evaluate this challenge, relevancy of a retrieved caption (or video) to the query item needs to be assessed. We consider the case where a query video contains the action of someone cutting a pizza using a cutter. We want captions: a “cutting a pizza using a cutter”, b “cutting a pizza slice”, c “slicing a pizza” to all be more relevant than d “cutting a pizza using a knife” which in turn is more relevant than both e “cutting a vegetable” or f “picking up a pizza slice”. Critically, g “opening a fridge” should be considered irrelevant.

Mean Average Precision (mAP) has been used in other retrieval works (Wray et al. 2019; Rasiwasia et al. 2014; Kang et al. 2015; Cao et al. 2017), yet it only considers relevance between items to be binary. Because of this, (a–c) would be considered (equally) relevant captions. However, we would also like to consider non-binary relevance where (d) is more relevant than (e) which in turn is more relevant than (g). We thus also report results using normalised Discounted Cumulative Gain (nDCG) (Järvelin and Kekäläinen 2002). This metric allows for non-binary relevance between captions. We define the relevance, \({\mathcal {R}}\), as the mean IoU of the verb and noun classes, giving a value between 0 and 1, where 0 is irrelevant (no overlap in verb/noun classes) and 1 is extremely relevant. From the example above, 1 = \({\mathcal {R}}\)(a,a) \(\ge \) \({\mathcal {R}}\)(a,b) = \({\mathcal {R}}\)(a,c) \(\ge \) \({\mathcal {R}}\) (a,d) \(\ge \) \({\mathcal {R}}\) (a,e) = \({\mathcal {R}}\) (a,f) \(\ge \) \({\mathcal {R}}\) (a,g) = 0. We then use \({\mathcal {R}}\) to calculate nDCG as in (Järvelin and Kekäläinen 2002) (see appendix C.6 for full definition).

Baselines and Results As in Sect. 4.5, we use TBN (Kazakos et al. 2019) features trained on the Train split. Table 11 provides results for two baselines and the chance lower bound. Multi-Layer Perceptron (MLP) uses a 2-layer perceptron to project both modalities into a shared action space with a triplet loss. Our previous work JPoSE (Wray et al. 2019) disentangles captions into verb, noun and action spaces learned with a triplet loss. JPoSE sees a significant boost in performance over MLP. Figure 11 shows qualitative retrieval results on four examples using both MLP and JPoSE for text-to-video retrieval. JPoSE is able to retrieve more correct videos than MLP, but both methods still struggle on longer captions. Importantly, this dataset offers the first opportunity for action retrieval that considers semantic similarity.

Table 11 Multi-Instance retrieval results
Fig. 11
figure 11

Qualitative results for text-to-video action retrieval. Top 3 retrieved videos and the semantic relevancy \({\mathcal {R}}\) of the top 50 retrievals (red: irrelevant, green: relevant) (Color figure online)

5 Conclusion and Future Work

We presented our large-scale egocentric dataset EPIC-KITCHENS-100, through an annotation pipeline that is scalable and is of higher quality than previous approaches. We defined six challenges, providing leaderboard baselines. Dataset and leaderboards are available at http://epic-kitchens.github.io.

These 6 challenges have been chosen to facilitate progress in open topics within video understanding. They also highlight interesting parts of our collection and annotation pipeline. For example, retrieval uses our free-form captions, while unsupervised domain adaptation for action recognition builds on collecting footage two years later. Our dense annotations of overlapping actions make detection in long untrimmed videos particularly challenging. While this paper addresses each challenge independently, successful methods that address one challenge (e.g.  detection) are likely to prove advantageous for better performance in another (e.g.  anticipation). Combining all challenges with unsupervised domain adaptation would enable future deployment in new environments without additional labels.

In publishing this manuscript we hope that people can not only utilise this large-scale dataset in their ongoing research, but also build on our novel pipeline in collecting our dataset. The proposed ‘pause-and-talk’ narrator, publicly available, as well as our visually-supported transcription interfaces can prove advantageous for other large-scale collection efforts.

Data Release Statement Dataset sequences, extracted frames and optical flow are available under Non-Commercial Government Licence for public sector information at the University of Bristol data repository: http://dx.doi.org/10.5523/bris.2g1n6qdydwa9u22shpxqzp0t8m

Annotations, models, evaluation scripts, challenge leaderboards and updates are available at: http://epic-kitchens.github.io