Keywords

1 Introduction

In recent years, we have seen significant progress in many domains such as image classification [19], object detection [37], captioning [26] and visual question-answering [3]. This success has in large part been due to advances in deep learning [27] as well as the availability of large-scale image benchmarks [9, 11, 30, 55]. While gaining attention, work in video understanding has been more scarce, mainly due to the lack of annotated datasets. This has been changing recently, with the release of the action classification benchmarks such as [1, 14, 18, 38, 46, 54]. With the exception of [46], most of these datasets contain videos that are very short in duration, i.e., only a few seconds long, focusing on a single action. Charades [42] makes a step towards activity recognition by collecting 10 K videos of humans performing various tasks in their home. While this dataset is a nice attempt to collect daily actions, the videos have been recorded in a scripted way, by asking AMT workers to act out a script in front of the camera. This makes the videos look oftentimes less natural, and they also lack the progression and multi-tasking of actions that occur in real life.

Fig. 1.
figure 1

From top: Frames from the 32 environments; Narrations by participants used to annotate action segments; Active object bounding box annotations

Here we focus on first-person vision, which offers a unique viewpoint on people’s daily activities. This data is rich as it reflects our goals and motivation, ability to multi-task, and the many different ways to perform a variety of important, but mundane, everyday tasks (such as cleaning the dishes). Egocentric data has also recently been proven valuable for human-to-robot imitation learning [34, 53], and has a direct impact on HCI applications. However, datasets to evaluate first-person vision algorithms [6, 8, 13, 16, 36, 41] have been significantly smaller in size than their third-person counterparts, often captured in a single environment [6, 8, 13, 16]. Daily interactions from wearable cameras are also scarcely available online, making this a largely unavailable source of information.

In this paper, we introduce , a large-scale egocentric dataset. Our data was collected by 32 participants, belonging to 10 nationalities, in their native kitchens (Fig. 1). The participants were asked to capture all their daily kitchen activities, and record sequences regardless of their duration. The recordings, which include both video and sound, not only feature the typical interactions with one’s own kitchenware and appliances, but importantly show the natural multi-tasking that one performs, like washing a few dishes amidst cooking. Such parallel-goal interactions have not been captured in existing datasets, making this both a more realistic as well as a more challenging set of recordings. A video introduction to the recordings is available at: http://youtu.be/Dj6Y3H0ubDw.

Altogether, has 55 h of recording, densely annotated with start/end times for each action/interaction, as well as bounding boxes around objects subject to interaction. We describe our object, action and anticipation challenges, and report baselines in two scenarios, i.e., seen and unseen kitchens. The dataset and leaderboards to track the community’s progress on all challenges, with held out test ground-truth are at: http://epic-kitchens.github.io.

Table 1. Comparative overview of relevant datasets \(^*\)action classes with >50 samples

2 Related Datasets

We compare to four commonly-used [6, 8, 13, 36] and two recent [16, 41] egocentric datasets in Table 1, as well as six third-person activity-recognition datasets [14, 28, 39, 42, 44, 56] that focus on object-interaction activities. We exclude egocentric datasets that focus on inter-person interactions [2, 12, 40], as these target a different research question.

A few datasets aim at capturing activities in native environments, most of which are recorded in third-person [14, 18, 28, 41, 42]. [28] focuses on cooking dishes based on a list of breakfast recipes. In [14], short segments linked to interactions with 30 daily objects are collected by querying YouTube, while [18, 41, 42] are scripted – subjects are requested to enact a crowd-sourced storyline [41, 42] or a given action [18], which oftentimes results in less natural looking actions. All egocentric datasets similarly use scripted activities, i.e. people are told what actions to perform. When following instructions, participants perform steps in a sequential order, as opposed to the more natural real-life scenarios addressed in our work, which involve multi-tasking, searching for an item, thinking what to do next, changing one’s mind or even unexpected surprises. is most closely related to the ADL dataset [36] which also provides egocentric recordings in native environments. However, our dataset is substantially larger: it has 11.5M frames vs 1M in ADL, 90x more annotated action segments, and 4x more object bounding boxes, making it the largest first-person dataset to date.

3 The Dataset

In this section, we describe our data collection and annotation pipeline. We also present various statistics, showcasing different aspects of our collected data.

Fig. 2.
figure 2

Instructions used to collect video narrations from our participants

3.1 Data Collection

The dataset was recorded by 32 individuals in 4 cities in different countries (in North America and Europe): 15 in Bristol/UK, 8 in Toronto/Canada, 8 in Catania/Italy and 1 in Seattle/USA between May and Nov 2017. Participants were asked to capture all kitchen visits for three consecutive days, with the recording starting immediately before entering the kitchen, and only stopped before leaving the kitchen. They recorded the dataset voluntarily and were not financially rewarded. The participants were asked to be in the kitchen alone for all the recordings, thus capturing only one-person activities. We also asked them to remove all items that would disclose their identity such as portraits or mirrors. Data was captured using a head-mounted GoPro with an adjustable mounting to control the viewpoint for different environments and participants’ heights. Before each recording, the participants checked the battery life and viewpoint, using the GoPro Capture app, so that their stretched hands were approximately located at the middle of the camera frame. The camera was set to linear field of view, 59.94 fps and Full HD resolution of 1920\(\,\times \,\)1080, however some subjects made minor changes like wide or ultra-wide FOV or resolution, as they recorded multiple sequences in their homes, and thus were switching the device off and on over several days. Specifically, 1% of the videos were recorded at 1280\(\,\times \,\)720 and 0.5% at 1920\(\,\times \,\)1440. Also, 1% at 30 fps, 1% at 48 fps and 0.2% at 90 fps.

The recording lengths varied depending on the participant’s kitchen engagement. On average, people recorded for 1.7 h, with the maximum being 4.6 h. Cooking a single meal can span multiple sequences, depending on whether one stays in the kitchen, or leaves and returns later. On average, each participant recorded 13.6 sequences. Figure 3 presents statistics on time of day using the local time of the recording, high-level goals and sequence durations.

Since crowd-sourcing annotations for such long videos is very challenging, we had our original participants do a coarse first annotation. Each participant was asked to watch their videos, after completing all recordings, and narrate the actions carried out, using a hand-held recording device. We opted for a sound recording rather than written captions as this is arguably much faster for the participants, who were thus more willing to provide these annotations. These are analogous to a live commentary of the video. The general instructions for narrations are listed in Fig. 2. The participant narrated in English if sufficiently fluent or in their native language. In total, 5 languages were used: 17 narrated in English, 7 in Italian, 6 in Spanish, 1 in Greek and 1 in Chinese. Figure 3 shows wordles of the most frequent words in each language.

Fig. 3.
figure 3

Top (left to right): time of day of the recording, pie chart of high-level goals, histogram of sequence durations and dataset logo; Bottom: Wordles of narrations in native languages (English, Italian, Spanish, Greek and Chinese)

Table 2. Extracts from 6 transcription files in .sbv format

Our decision to collect narrations from the participants themselves is because they are the most qualified to label the activity compared to an independent observer, as they were the ones performing the actions. We opted for a post-recording narration such that the participant performs her/his daily activities undisturbed, without being concerned about labelling.

We tested several automatic audio-to-text APIs [5, 17, 23], which failed to produce accurate transcriptions as these expect a relevant corpus and complete sentences for context. We thus collected manual transcriptions via Amazon Mechanical Turk (AMT), and used the YouTube’s automatic closed caption alignment tool to produce accurate timings. For non-English narrations, we also asked AMT workers to translate the sentences. To make the job more suitable for AMT, narration audio files are split by removing silence below a pre-specified decibel threshold (after compression and normalisation). Speech chunks are then combined into HITs with a duration of around 30 s each. To ensure consistency, we submit the same HIT three times and select the ones with an edit distance of 0 to at least one other HIT. We manually corrected cases when there was no agreement. Examples of transcribed and timed narrations are provided in Table 2. The participants were also asked to provide one sentence per sequence describing the overall goal or activity that took place.

In total, we collected 39, 596 action narrations, corresponding to a narration every 4.9 s in the video. The average number of words per phrase is 2.8 words. These narrations give us an initial labelling of all actions with rough temporal alignment, obtained from the timestamp of the audio narration with respect to the video. However, narrations are also not a perfect source of ground-truth:

  • The narrations can be incomplete, i.e., the participants were selective in which actions they chose to narrate. We noticed that they labelled the ‘open’ actions more than their counter-action ‘close’, as the narrator’s attention has already moved to the next goal. We consider this phenomena in our evaluation, by only evaluating actions that have been narrated.

  • Temporally, the narrations are belated, after the action takes place. This is adjusted using ground-truth action segments (see Sect. 3.2).

  • Participants use their own vocabulary and free language. While this is a challenging issue, we believe it is important to push the community to go beyond the pre-selected list of labels (also argued in [55]). We here resolve this issue by grouping verbs and nouns into minimally overlapping classes (see Sect. 3.4).

3.2 Action Segment Annotations

For each narrated sentence, we adjust the start and end times of the action using AMT. To ensure the annotators are trained to perform temporal localisation, we use a clip from our previous work’s understanding [33] that explains temporal bounds of actions. Each HIT is composed of a maximum of 10 consecutive narrated phrases \(p_i\), where annotators label \(A_{i} = [t_{s_i}, t_{e_i}]\) as the start and end times of the \(i^{th}\) action. Two constraints were added to decrease the amount of noisy annotations: (1) action has to be at least 0.5 s in length; (2) action cannot start before the preceding action’s start time. Note that consecutive actions are allowed to overlap. Moreover, the annotators could indicate that the action does not appear in the video. This handles occluded, impossible to distinguish or out-of-bounds cases.

To ensure consistency, we ask \(\mathcal {K}_a=4\) annotators to annotate each HIT. Given one annotation \(A_i(j)\) (i is the action and j indexes the annotator), we calculate the agreement as follows: \({\alpha _i(j) = \frac{1}{K_a} \sum _{k=1}^{\mathcal {K}_a} \text {IoU} (A_i(j), A_i(k))}\). We first find the annotator with the maximum agreement \(\hat{j} = \arg \max _j \alpha _i(j)\), and find \(\hat{k} = \arg \max _k \text {IoU}(A_i(\hat{j}), A_i(k))\). The ground-truth action segment \(A_i\) is then defined as:

$$\begin{aligned} A_i ={\left\{ \begin{array}{ll} \text {Union}(A_i(\hat{j}), A_i(\hat{k})), &{} \text {if IoU} (A_i(\hat{j}), A_i(\hat{k}))>0.5\\ A_i(\hat{j}), &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

We thus combine two annotations when they have a strong agreement, since in some cases the single (best) annotation results in a too tight of a segment. Figure 4 shows examples of combining annotations.

Fig. 4.
figure 4

An example of annotated action segments for 2 consecutive actions

Fig. 5.
figure 5

Object annotation from three AMT workers (orange, blue and green). The green participant’s annotations are selected as the final annotations (Color figure online)

In total, we collected such labels for 39, 564 action segments (lengths: \({\mu =3.7s}\), \({\sigma =5.6}\) s). These represent 99.9% of narrated segments. The missed annotations were those labelled as “not visible” by the annotators, though mentioned in narrations.

3.3 Active Object Bounding Box Annotations

The narrated nouns correspond to objects relevant to the action [6, 29]. Assume \(\mathcal {O}_i\) is the set of one or more nouns in the phrase \(p_i\) associated with the action segment \(A_i = [t_{s_i}, t_{e_i}]\). We consider each frame f within \([t_{s_i}-2s,t_{e_i}+2s]\) as a potential frame to annotate the bounding box(es), for each object in \(\mathcal {O}_i\). We build on the interface from [49] for annotating bounding boxes on AMT. Each HIT aims to get an annotation for one object, for the maximum duration of 25 s, which corresponds to 50 consecutive frames at 2 fps. The annotator can also note that the object does not exist in f. We particularly ask the same annotator to annotate consecutive frames to avoid subjective decisions on the extents of objects. We also assess annotators’ quality by ensuring that the annotators obtain an \({\text {IoU} \ge 0.7}\) on two golden annotations at the start of every HIT. We request \(\mathcal {K}_o = 3\) workers per HIT, and select the one with maximum agreement  \(\beta \):

$$\begin{aligned} \beta (q) = \sum _f \max \limits _{j \ne q}^{\mathcal {K}_o}\, \max _{k,l}\, \text {IoU}(\text {BB}(j, f, k), \text {BB}(q, f, l)) \end{aligned}$$
(2)

where \(\text {BB}(q,f, k)\) is the \(k^{th}\) bounding box annotation by annotator q in frame f. Ties are broken by selecting the worker who provides the tighter bounding boxes. Figure 5 shows multiple annotations for four keyframes in a sequence.

Overall, 77% of requested annotations resulted in at least one bounding box. In total, we collected 454,255 bounding boxes (\(\mu = 1.64\) boxes/frame, \({\sigma = 0.92}\)). Sample action segments and object bounding boxes are shown in Fig. 6.

3.4 Verb and Noun Classes

Since our participants annotated using free text in multiple languages, a variety of verbs and nouns have been collected. We group these into classes with minimal semantic overlap, to accommodate the more typical approaches to multi-class detection and recognition where each example is believed to belong to one class only. We estimate Part-of-Speech (POS), using SpaCy’s English core web model. We select the first verb in the sentence, and find all nouns in the sentence excluding any that match the chosen verb. When a noun is absent or replaced by a pronoun (e.g. ‘it’), we use the noun from the directly preceding narration (e.g. \(p_i\): ‘rinse cup’, \(p_{i+1}\): ‘place it to dry’).

We refer to the set of minimally-overlapping verb classes as \(C_V\), and similarly \(C_N\) for nouns. We attempted to automate the clustering of verbs and nouns using combinations of WordNet [32], Word2Vec [31], and Lesk algorithm [4], however, due to limited context there were too many meaningless clusters. We thus elected to manually cluster the verbs and semi-automatically cluster the nouns. We preprocessed the compound nouns e.g. ‘pizza cutter’ as a subset of the second noun e.g. ‘cutter’. We then manually adjusted the clustering, merging the variety of names used for the same object, e.g. ‘cup’ and ‘mug’, as well as splitting some base nouns, e.g. ‘washing machine’ vs ‘coffee machine’.

Fig. 6.
figure 6

Sample consecutive action segments with keyframe object annotations

Fig. 7.
figure 7

Top: Frequency of verb classes in action segments; Bottom: Frequency of noun clusters in bounding box annotations, by category

In total, we have 125 \(C_V\) classes and 331 \(C_N\) classes. Table 3 shows a sample of grouped verbs and nouns into classes. These classes are used in all three defined challenges. In Fig. 7, we show \(C_V\) ordered by frequency of occurrence in action segments, as well as \(C_N\) ordered by number of annotated bounding boxes. These are grouped into 19 super categories, of which 9 are food and drinks, with the rest containing kitchen essentials from appliances to cutlery. Co-occurring classes are presented in Fig. 8.

Fig. 8.
figure 8

Left: Frequently co-occurring verb/nouns in action segments [e.g. (open/close, cupboard/drawer/fridge), (peel, carrot/onion/potato/peach), (adjust, heat)]; Middle: Next-action excluding repetitive instances of the same action [e.g. peel \(\rightarrow \) cut, turn-on \(\rightarrow \) wash, pour \(\rightarrow \) mix].; Right: Co-occurring bounding boxes in one frame [e.g. (pot, coffee), (knife, chopping board), (tap, sponge)]

3.5 Annotation Quality Assurance

To analyse the quality of annotations, we choose 300 random samples, and manually assess correctness. We report:

  • Action Segment Boundaries (\(A_i\)): We check that the start/end times fully enclose the action boundaries, with any additional frames not part of other actions - error: 5.7%.

  • Object Bounding Boxes (\(\mathcal {O}_i\)): We check that the bounding box encapsulates the object or its parts, with minimal overlap with other objects, and that all instances of the class in the frame have been labelled – error: 6.3%.

  • Verb classes (\(C_V\)): We check that the verb class is correct – error: 3.3%.

  • Noun classes (\(C_N\)): We check that the noun class is correct – error: 6.0%.

These error rates are comparable to recently published datasets [54].

4 Benchmarks and Baseline Results

offers a variety of potential challenges from routine understanding, to activity recognition and object detection. As a start, we define three challenges for which we provide baseline results, and avail online leaderboards. For the evaluation protocols, we hold out ground truth annotations for 27% of the data (Table 4). We particularly aim to assess the generalizability to novel environments, and we thus structured our test set to have a collection of seen and previously unseen kitchens:

Seen Kitchens (S1): In this split, each kitchen is seen in both training and testing, where roughly 80% of sequences are in training and 20% in testing. We do not split sequences, thus each sequence is in either training or testing.

Unseen Kitchens (S2): This divides the participants/kitchens so all sequences of the same kitchen are either in training or testing. We hold out the complete sequences for 4 participants for this testing protocol. The test set of S2 is only 7% of the dataset in terms of frame count, but the challenges remain considerable.

Table 3. Sample verb and noun classes
Table 4. Statistics of test splits: seen (S1) and unseen (S2) kitchens

We now evaluate several existing methods on our benchmarks, to gain an understanding of how challenging our dataset is.

4.1 Object Detection Benchmark

Challenge: This challenge focuses on object detection for all of our \(C_N\) classes. Note that our annotations only capture the ‘active’ objects pre-, during- and post- interaction. We thus restrict the images evaluated per class to those where the object has been annotated. We particularly aim to break the performance down into multi-shot and few-shot class groups, so as to analyse the capabilities of the approaches to quickly learn novel objects (with only a few examples). Our challenge leaderboard reflects the methods’ abilities on both sets of classes.

Method: We evaluate object detection using Faster R-CNN [37] due to its state-of-the-art performance. Faster R-CNN uses a region proposal network (RPN) to first generate class agnostic object proposals, and then classifies these and outputs refined bounding box predictions. We use the implementation from [21, 22] with a base architecture of ResNet-101 [19] pre-trained on MS-COCO [30].

Implementation Details: Learning rate is initialised to 0.0003 decaying by a factor of 10 after 90 K and stopped after 120 K iterations. We use a mini-batch size of 4 on 8 Nvidia P100 GPUs on a single compute node (Nvidia DGX-1) with distributed training and parameter synchronisation – i.e. overall mini-batch size of 32. As in [37], images are rescaled such that their shortest side is 600 pixels and the aspect ratio is maintained. We use a stride of 16 on the last convolution layer for feature extraction and for anchors we use 4 scales of 0.25, 0.5, 1.0 and 2.0; and aspect ratios of 1:1, 1:2 and 2:1. To reduce redundancy, NMS is used with an IoU threshold of 0.7. In training and testing we use 300 RPN proposals.

Evaluation Metrics: For each class, we only report results on \(I^{c_n \in C_N}\), these are all images where class \(c_n\) has been annotated. We use the mean average precision (mAP) metric from PASCAL VOC [11], using IoU thresholds of 0.05, 0.5 and 0.75 similar to [30].

Table 5. Baseline results for the Object Detection challenge
Fig. 9.
figure 9

Qualitative results for the object detection challenge

Results: We report results in Table 5 for many-shot classes (those with \({\ge 100}\) bounding boxes in training) and few shot classes (with \({\ge 10}\) and \({< 100}\) bounding boxes in training), alongside AP for the 15 most frequent classes. There are a total of 202 many-shot classes and 88 few-shot classes. One can see that our objects are generally harder to detect than in most existing datasets, with performance at the standard IoU \(>0.5\) below \(40\%\). Even at a very small IoU threshold, the performance is relatively low. The more challenging classes are “meat”, “knife”, and “spoon”, despite being some of the most frequent ones. Notice that the performance for the low-shot regime is substantially lower than in the many-shot regime. This points to interesting challenges for the future. However, performances for the Seen and Unseen splits in object detection are comparable, thus showing generalization capability across environments.

Figure 9 shows qualitative results with detections shown in colour and ground truth shown in black. The examples in the right-hand column are failure cases.

4.2 Action Recognition Benchmark

Challenge: Given an action segment \(A_i = [t_{s_i}, t_{e_i}]\), we aim to classify the segment into its action class, where classes are defined as \({C_a = \{(c_v \in C_V, c_n \in C_N)\}}\), and \(c_n\) is the first noun in the narration when multiple nouns are present. Note that our dataset supports more complex action-level challenges, such as action localisation in the videos of full duration. We decided to focus on the classification challenge first (the segment is provided) since most existing works tackle this challenge.

Table 6. Baseline results for the action recognition challenge
Table 7. Sample baseline action recognition per-class metrics (using TSN fusion)

Network Architecture: We train the Temporal Segment Network (TSN) [48] as a state-of-the-art architecture in action recognition, but adjust the output layer to predict both verb and noun classes jointly, with independent losses, as in [25]. We use the PyTorch implementation [51] with the Inception architecture [45], batch normalization [24] and pre-trained on ImageNet [9].

Implementation Details: We train both spatial and temporal streams, the latter on dense optical flow at 30 fps extracted using the \({{\mathrm{TV-L_1\!}}}\) algorithm [52] between RGB frames using the formulation \({{\mathrm{TV-L_1\!}}}\left( I_{2t}, I_{2t+3}\right) \) to eliminate optical flicker, and released the computed flow as part of the dataset. We do not perform stratification or weighted sampling, allowing the dataset class imbalance to propagate into the mini-batch. We train each model on 8 Nvidia P100 GPUs on a single compute node (Nvidia DGX-1) for 80 epochs with a mini-batch size of 512. We set learning rate to 0.01 for spatial and 0.001 for temporal streams decreasing it by a factor of 10 after epochs 20 and 40. After averaging the 25 samples within the action segment each with 10 spatial croppings as in [48], we fuse both streams by averaging class predictions with equal weights. All unspecified parameters use the same values as [48].

Evaluation Metrics: We report two sets of metrics: aggregate and per-class, which are equivalent to the class-agnostic and class-aware metrics in [54]. For aggregate metrics, we compute top-1 and top-5 accuracy for correct predictions of \(c_v\), \(c_n\) and their combination \((c_v, c_n)\) – we refer to these as ‘verb’, ‘noun’ and ‘action’. Accuracy is reported on the full test set. For per-class metrics, we compute precision and recall, for classes with more than 100 samples in training, then average the metrics across classes - these are 26 verb classes, 71 noun classes, and 819 action classes. Per-class metrics for smaller classes are \(\approx 0\) as TSN is better suited for classes with sufficient training data.

Results: We report results in Table 6 for aggregate metrics and per-class metrics. We compare TSN (3 segments) to 2SCNN [43] (1 segment), chance and largest class baselines. Fused results perform best or are comparable to the best stream (spatial/temporal). The challenge of getting both verb and noun labels correct remains significant for both seen (top-1 accuracy 20.5%) and unseen (top-1 accuracy 10.9%) environments. This implies that for many examples, we only get one of the two labels (verb/noun) right. Results also show that generalising to unseen environments is a harder challenge for actions than it is for objects. We give a breakdown per-class metrics for the 15 largest verb classes in Table 7.

Fig. 10.
figure 10

Qualitative results for the action recognition and anticipation challenges

Figure 10 reports qualitative results, with success highlighted in green, and failures in red. In the first column both the verb and the noun are correctly predicted, in the second column one of them is correctly predicted, while in the third column both are incorrect. Challenging cases like distinguishing ‘adjust heat’ from turning it on, or pouring soy sauce vs oil are shown.

4.3 Action Anticipation Benchmark

Challenge: Anticipating the next action is a well-mastered skill by humans, and automating it has direct implications in assertive living. Given any of the upcoming wearable system (e.g. Microsoft Hololens or Google Glass), anticipating the wearer’s next action, from a first-person view, could trigger smart home appliances, providing a seamless achievement of the wearer’s goals. Previous works have investigated different anticipation tasks from an egocentric perspective, e.g. predicting future localisation [35] or next-active object [15]. We here consider the task of forecasting an action before it happens. Let \(\tau _a\) be the ‘anticipation time’, how far in advance to recognise the action, and \(\tau _o\) be the ‘observation time’, the length of the observed video segment preceding the action. Given an action segment \(A_i=[t_{s_i},t_{e_i}]\), we predict the action class \(C_a\) by observing the video segment preceding the action start time \(t_{s_i}\) by \(\tau _a\), that is \([t_{s_i}-(\tau _a+\tau _o),t_{s_i}-\tau _a]\).

Network Architecture: As in Sect. 4.2, we train TSN [48] to provide baseline action anticipation results and compare with 2SCNN [43]. We feed the model with the video segments preceding annotated actions and train it to predict verb and noun classes jointly as in [25]. Similarly to [47], we set \(\tau _a=1\) s. We report results with \(\tau _o=1\) s, and note that performance drops with longer segments.

Implementation Details: Models for both spatial and temporal modalities are trained using a single Nvidia Titan X with a batch size of 64, for 80 epochs, setting the initial learning rate to 0.001 and dropping it by a factor of 10 after 30 and 60 epochs. Fusion weights spatial and temporal streams with 0.6 and 0.4 respectively. All other parameters use the values specified in [48].

Evaluation Metrics: We use the same evaluation metrics as in Sect. 4.2.

Results: Table 8 reports baseline results for the action anticipation challenge. As expected, this is a harder challenge than action recognition, and thus we note a drop in performance throughout. Unlike the case of action recognition, the flow stream and fusion do not generally improve performances. TSN often offers small, but consistent improvements over 2SCNN.

Figure 10 reports qualitative results. Success examples are highlighted in green, and failure cases in red. As the qualitative figure shows, the method over-predicts ‘put’ as the next action. Once an object is picked up, the learned model has a tendency to believe it will be put down next. Methods that focus on long-term understanding of the goal, as well as multi-scale history would be needed to circumvent such a tendency.

Table 8. Baseline results for the action anticipation challenge

Discussion: The three defined challenges form the base for higher-level understanding of the wearer’s goals. We have shown that existing methods are still far from tackling these tasks with high precision, pointing to exciting future directions. Our dataset lends itself naturally to a variety of less explored tasks. We are planning to provide a wider set of challenges, including action localisation [50], video parsing [42], visual dialogue [7], goal completion [20] and skill determination [10] (e.g. how good are you at making your eggs for breakfast?). Since real-time performance is crucial in this domain, our leaderboard will reflect this, pressing the community to come up with efficient and effective solutions.

5 Conclusion and Future Work

We present the largest and most varied dataset in egocentric vision to date, , captured in participants’ native environments. We collect 55 hours of video data recorded on a head-mounted GoPro, and annotate it with narrations, action segments and object annotations using a pipeline that starts with live commentary of recorded videos by the participants themselves. Baseline results on object detection, action recognition and anticipation challenges show the great potential of the dataset for pushing approaches that target fine-grained video understanding to new frontiers. Dataset and online leaderboard for the three challenges are available from http://epic-kitchens.github.io.