1 Introduction

There has been considerable research interest in action recognition in video over the past two decades [1, 2, 4, 5, 7, 8, 11, 1318, 20, 22, 2734, 37, 4042, 44, 45, 47, 49, 50, 5254, 5759, 61, 62, 64, 67, 68, 7179, 81, 8385, 87]. To support such research, numerous video datasets have been gathered. Liu et al.[39] summarize the available datasets as of 2011. These include KTH (6 classes, [58]), Weizmann (10 classes, [4]), CMU Soccer (7 classes, [9]), CMU Crowded (5 classes, [27]), UCF Sports (9 classes, [53]), UR ADL (10 classes, [44]), UM Gesture (14 classes, [38]), UCF Youtube (11 classes, [41]), Hollywood-1 (8 classes, [35]), Hollywood-2 (12 classes, [43]), MultiKTH (6 classes, [70]), MSR (3 classes, [86]), and TRECVID (10 classes, [63]). These datasets contain short clips, each depicting one of a small number of classes (3–14). Several more recent datasets also contain short clips, each depicting a single action, but with a larger number of action classes: UCF50 (50 classes, [52]), HMDB51 (51 classes, [33]), and UCF101 (101 classes, [65]). The VIRAT dataset [48] has 12 classes and longer streaming video.

Here, we introduce a new dataset called the large continuous action dataset (LCA). This dataset contains depictions of 24 action classes. The video for this dataset was filmed and annotated as part of the DARPA Mind’s Eye program. A novel characteristic of this dataset is that rather than consisting of short clips each of which depicts a single action class, this dataset contains much longer streaming video segments that each contain numerous instances of a variety of action classes that often overlap in time and may occur in different portions of the field of view. The annotation that accompanies this dataset delineates not only which actions occur but also their temporal extent.

Many of the prior datasets were culled from video downloaded from the Internet. In contrast, the LCA dataset contains video that was filmed specifically to construct the dataset. While the video was filmed with people hired to act out the specified actions according to a general script, the fact that the video contains long streaming segments tends to mitigate any artificial aspects of the video and render the action depictions to be quite natural. Moreover, the fact that all of the video was filmed in a relatively small number of distinct backgrounds makes the dataset challenging; the background gives little clue as to the action class.

A further distinguishing characteristic of the LCA dataset is the degree of ambiguity. Most prior action recognition datasets, in fact most prior datasets for all computer vision tasks, make a tacit assumption that the labeling is unambiguous, and thus, there is a ‘ground truth.’ We had a team of five human annotators each annotate the entire LCA dataset. This allowed us to measure the degree of intercoder agreement. Surprisingly, there is a significant level of disagreement between humans as to the temporal extent of most action instances. We believe that such inherent ambiguity is a more accurate reflection of the underlying action recognition task and hope that the multiplicity of divergent annotations will help spur novel research with this more realistic dataset.

Another distinguishing characteristic of the LCA dataset is that some action occurrences were filmed simultaneously with multiple cameras with partially overlapping fields of view. While the cameras were neither spatially calibrated nor temporally synchronized, the fact that we have multiple annotations of the temporal extent of action occurrences may support future efforts to perform temporal synchronization after the fact. Furthermore, while most of the video was filmed from ground-level cameras with horizontal view, some of the video was filmed with aerial cameras with bird’s eye view. Some of this video was filmed simultaneously with ground cameras. This may support future efforts to conduct scene reconstruction.

Some datasets are provided with official tasks and evaluation metrics. We refrain from doing so for this dataset. Instead, we leave it up to the community to make use of this dataset in a creative fashion for as many different tasks as it will be suited. Nonetheless, the LCA dataset contains sufficient information for users to compare their methods precisely with the results of the baseline experiments reported here.

2 Collection

The video for this dataset was filmed by DARPA in conjunction with Mitre and several performers from the Mind’s Eye program.Footnote 1 See appendix included in the electronic supplementary material for a precise explanation of the origin of the video used in the LCA dataset and the distinction between it and that used as part of the Mind’s Eye program.

The LCA dataset was filmed at several different locations, all of which were either military training facilities or facilities used to film Hollywood movies. The videos were filmed in a variety of backgrounds: several simulated country roads, several simulated safe houses, and several simulated middle-eastern urban environments. This manuscript reports a systematic annotation effort for this video which comprises 190 files as delineated in Table 1 in the appendix included in the electronic supplementary material.

Table 1 Verbs used as labels in the LCA dataset

The LCA dataset constitutes 2,302,144 frames and a total of 12-h, 51-min, and 16-s of video. For comparison, UCF50 has 1,330,936 frames and 13.81 h, HMDB51 has 632,635 frames and 5.85 h, UCF101 has 27 h, Hollywood-2 has 20.1 h, and VIRAT has 8.5 h. Several frames from this dataset illustrating several of the backgrounds are shown in Fig. 1.

Fig. 1
figure 1

Several frames from the LCA dataset illustrating several of the backgrounds in which they were filmed

Fig. 2
figure 2

Sample frame pairs from the LCA dataset illustrating the 24 action classes

3 Annotation

The LCA dataset contains annotations for 24 verbs, as delineated in Table 1. Figure 2 contains sample frame pairs for each of the 24 verbs. Of these, 17 verbs were used as part of the stage directions given to the actors to guide the actions that they performed. The remainder were not used as part of the stage directions but occurred incidentally. Nothing, however, precluded the actors from performing actions that could be described by other verbs. Thus, the video depicts many other actions than those annotated, including but not limited to riding bicycles, pushing carts, singing, pointing guns, arguing, and kicking balls. The only restriction, in principle, to these 24 verbs is that these were the only actions that were annotated. Identifying the presence of specific verbs in the context of many such confounding actions should present additional challenges.

We annotated all occurrences of the 24 verbs from Table 1 in the videos in Table 1 in the appendix included in the electronic supplementary material. Each such occurrence consists of a temporal interval labeled with a verb. The judgment of whether an action described by a particular verb occurred is subjective; different annotators will arrive at different judgments as to occurrence as well as the temporal extent thereof. To help guide annotators, we gave them the specification of the intended meaning of each of the 24 verbs as provided by DARPA. Annotators performed the annotation at workstations with dual monitors. One monitor displayed the annotation tool while the other monitor displayed the documentation of intended verb meaning. The documentation of intended verb meaning is included in the LCA distribution in the electronic supplementary material.

We also asked annotators to annotate intervals where certain object classes were present in the field of view. These include bandannas, bicycles, people, vehicles, and weapons (the bandannas were worn by people around their head or arms). For these, a count of the number of instances of each class that were visible in the field of view was maintained. It was incremented each time a new instance became visible and decremented each time an instance became invisible. We instructed annotators that there was no need to be precise when an instance was partially visible. We further instructed annotators that vehicles denoted motor vehicles, not push carts, and weapons denoted guns, not other things like clubs or rocks that could be used as weapons.

We provided annotators with a tool that allowed them to view the videos at ordinary frame rate, stop and start the videos at will, navigate to arbitrary points in the videos, view individual frames of the videos, add, delete, and move starting and ending points of intervals, and label intervals with verbs. The tool also contained buttons to increment and decrement the counts for each of the object classes and appraised the annotator with the running counts for the object classes in each frame as the video was played or navigated.

Because of the large quantity of video to be annotated, and the fact that nothing happens during large portions of the video, we preprocessed the video to reduce the amount requiring manual annotation. We first downsampled the video to 5 fps just for the purpose of annotation; the annotation was converted back at the end to the original frame rate. Then, segments of this downsampled video where no motion occurred were removed. To do this, we computed dense optical flow on each pixel of each frame of the downsampled video. We then computed the average of the magnitude of the flow vectors in each frame and determined which frames were above a threshold. Stretches of contiguous frames that were above threshold that were separated by short stretches of contiguous frame that were below threshold were merged into single temporal segments. Then, such single temporal segments that were shorter than a specified temporal length were discarded.Footnote 2 Annotators were only given the remaining temporal segments to annotate. We performed a postprocessing step whereby the authors manually viewed all discarded frames to make sure that no actions started, ended, or spanned the omitted temporal segments. As part of this postprocessing step, the authors manually checked that none of the specified object classes entered or left the field of view during the omitted temporal segments.

We had five annotators each independently annotate the entire LCA dataset. Annotators were given initial instructions. During the annotation, annotators were encouraged to discuss their annotation judgments with the authors. The authors would then arbitrate the judgment, often specifying principles to guide the annotation. These principles were then circulated among the other annotators. The annotator instructions and principles developed through arbitration are included in the LCA distribution.

We performed a consistency check during the annotation process. Whenever an annotator completed annotation of a temporal segment, if that annotator did not annotate any intervals during that segment but other annotators did, we asked that annotator to review their annotation.

The LCA dataset contains five verb-annotation files for each of the video files in Table 1 in the appendix included in the electronic supplementary material. These have the same name as their corresponding video, but with the extension txt, and are located in directories named with each of the annotator codes bmedikon, cbushman, kim861, nielder, and nzabikh. Each line in each of these files contains a single temporal interval as a text string specifying a verb and two zero-origin nonnegative integers specifying the starting and ending frames of the interval inclusive. The LCA dataset also contains five object class annotation files for each of the video files in Table 1 in appendix included in the electronic supplementary material. These also share the filename with the corresponding video, but with the addition of the suffix -enter-exits.txt, and are located in the same directories named with each of the above annotator codes. Each line in each of these files contains a text string specifying an object class, an integer specifying the number of objects of that class entering or exiting the field of view (positive for entering and negative for exiting), and a single zero-origin nonnegative integer specifying the video frame.

Fig. 3
figure 3

Intercoder agreement on the annotations of the LCA dataset. F1 score for each pair of annotators as the overlap criterion is varied. Overlap of two intervals is measured as the length of their intersection divided by the length of their union

4 Analysis

We analyzed the degree of agreement between the different annotators. To do this, we compared pairs of annotators, taking the judgments of one as ‘ground truth’ and computing the F1 score of the other. An interval in the annotation being scored was taken as a true positive if it overlapped some interval with the same label in the ‘ground truth.’ An interval in the annotation being scored was taken as a false positive if it did not overlap any interval with the same label in the ‘ground truth.’ An interval in the ‘ground truth’ was taken as a false negative if it did not overlap any interval with the same label in the annotation being scored. From these counts, an F1 score could be computed.

We employed the following overlap criterion. For a pair of intervals, we computed a one-dimensional variant of the ‘intersection over union’ criterion employed within the Pascal VOC challenge to determine overlap of two axis-aligned rectangles [10], namely the length of the intersection divided by the length of the union. We considered two intervals to overlap when the above exceeded some specified threshold. We then computed the F1 score as this threshold was varied and plotted the results for all pairs of annotators (Fig. 3).

Note that there is a surprisingly low level of agreement between annotators. Annotators rarely if ever agree on the precise temporal extent of an action as indicated by the fact that all agreement curves go to zero as the overlap threshold goes to one. At an overlap threshold of 0.5, the F1 score varies between about 0.3 and about 0.6. At an overlap threshold of 0.1, the threshold employed by VIRAT to score machines against humans, the F1 score varies between about 0.38 and about 0.67. This would put an upper bound on machine performance with this dataset using the VIRAT threshold. Even if the overlap threshold is reduced to zero, the F1 score varies between about 0.43 and about 0.7. This indicates that this dataset should be challenging for computer action recognition.

5 Baseline experiments

We performed two experiments to present and compare the performance of several state-of-the-art action recognition systems on the LCA dataset. The first experiment evaluated performance of several baseline methods on trimmed videos extracted from the LCA dataset. This task involved training and testing a classifier on a 1-out-of-24 forced-choice classification task, where each trimmed video clip nominally depicted a single action occurrence. The second experiment evaluated performance of several baseline methods on untrimmed streaming videos that comprise the entire LCA dataset. For this task, the entire LCA dataset was partitioned into five sets of videos to perform leave-one-set-out cross-validation. Models were trained on the videos in four sets and then applied to the videos in the fifth set. The task was to produce temporal intervals that delineated occurrences of the 24 action classes, each such interval labeled with the class of the action that occurred. We describe the two baseline experiments below.

The baseline experiments were performed on a collection of methods that attempt to approximate recent state-of-the-art methods for action recognition. These include running the actual released code for C2 [24], Action Bank [57], Stacked ISA [36], and VHTK [44]. We also obtained the code for Cao’s method [6], Cao’s reimplementation [6] of Ryoo’s method [56], and Retinotopic [3] from the authors. We also employ a number of other recent methods, including Dense Trajectories [72, 73], Improved Trajectories [74], C3D [69], and the methods of Simonyan and Zisserman [60], Ng et al. [46], and Xu et al. [80].

The authors of Dense Trajectories [72, 73] make their feature extractor available, but not their full pipeline [73]. Similarly, the authors of Improved Trajectories [74] also make their feature extractor available, but not their full pipeline. We reimplemented one version of a pipeline based on the original Dense Trajectories and two versions of a pipeline based on the Improved Trajectories, one that employs an SVM classifier and one that employs a neural network classifier.

This latter method, Improved Trajectories+NN, was implemented as follows. We compute Improved Trajectories for a video and then use PCA to reduce the number of dimensions of the Traj, HoG, HoF, MBHx, and MBHy features to 20, 48, 54, 48, and 48, respectively. We then train Fisher vectors [51] with a Gaussian mixture model with 32 components for each of the five features. Finally, the five Fisher vectors are concatenated to form a fixed-length feature vector of 6976 dimensions for each video. To classify these feature vectors, we construct a four-layer feed-forward neural network (4-mlp), whose structure will be described below.

We constructed a classifier based on C3D [69] by using the published pretrained models with a neural network classifier. We compute the C3D fc6 features using the model trained on the Sports-1M dataset [26]. A fixed-length 4096-dimensional feature vector is computed for every temporal interval of 16 frames in a video sample. Thus, a video sample of length T will have a sequence of \(\lfloor \frac{T}{16}\rfloor \) feature vectors. All such feature vectors produced on a video sample are averaged to obtain a single feature vector for that sample. These are also classified with a 4-mlp network.

We have three additional baseline methods. The first, VGG(RGB) \(+\) PCA \(+\) VLAD, attempts to simulate Xu et al. [80]. The second, VGG(RGB) \(+\) LSTM, attempts to simulate Ng et al. [46]. Finally, both Simonyan and Zisserman [60] and Ng et al. [46] employ two data streams, one modeling image appearance features and one modeling motion through optical flow features. We simulate the latter data stream with a third baseline method, VGG(Flow) \(+\) LSTM.

The VGG(RGB) \(+\) PCA \(+\) VLAD method pools video descriptors produced by a convolutional neural network (CNN). We compute the VGG-16 fc7-relu features using the model pretrained on the ImageNet dataset [55]. A fixed-length 4096-dimensional feature vector is computed for every RGB video frame, after which the dimension is reduced to 128 with principle component analysis (PCA). We employ the vector of linearly aggregated descriptors (VLAD) method [23] with 32 K-means centers to pool the sequence of 128-dimensional feature vectors into a single 4096-dimensional feature vector per video. Again, these are also classified with a 4-mlp network.

The VGG(RGB) \(+\) LSTM method computes a sequence of VGG-16 fc7-relu feature vectors for a video, one per RGB frame. This sequence is then classified with a five-layer neural network (5-lstm) built around a Long Short-term Memory (LSTM) layer [19].

The VGG(Flow) \(+\) LSTM method is similar to the VGG(RGB) \(+\) LSTM method except that the VGG features are computed on dense optical flow fields [12], sampled at frame rate, instead of RGB frames. The same VGG model, pretrained on the ImageNet dataset, is applied to the flow fields. The same sequence classifier, 5-lstm, is used for classifying the resulting feature vector sequence. But this classifier is retrained on the different features produced.

The 4-mlp classifiers used by Improved Trajectories \(+\) NN, C3D, and VGG(RGB) \(+\) PCA \(+\) VLAD employ the same structure. An \(\alpha \)-dimensional input feature vector is processed by an input layer with \(\alpha \) nodes, a first hidden layer with \(\frac{\alpha }{2}\) nodes, and a second hidden layer with \(\frac{\alpha }{4}\) nodes to produce an output layer with \(\beta \) nodes, where \(\beta \) is the number of classes. Similarly, the 5-lstm classifiers used by VGG(RGB) \(+\) LSTM and VGG(Flow) \(+\) LSTM also employ the same structure. An \(\alpha \)-dimensional input feature vector is processed by an input layer with \(\alpha \) nodes, a first hidden layer with \(\frac{\alpha }{16}\) nodes, an LSTM layer with \(\frac{\alpha }{16}\) nodes, and a second hidden layer with 256 nodes, to produce an output layer with \(\beta \) nodes, where \(\beta \) is the number of classes. The last instance of the output sequence of the LSTM layer is fed into the second hidden layer. All other layers are fully connected linear units. The 4-mlp and 5-lstm networks both employ a dropout layer [66] with a drop rate 0.3 before the input layer and a softmax layer after the output layer to compute the class probability. All networks employ hyperbolic tangent (\(\tanh \)) as the activation function. Distinct instances of the associated network topology are trained for the different methods using stochastic gradient descent (SGD) with batch size between 10 and 20, and a learning rate between \(\text {10}^{-3}\) and \(\text {10}^{-2}\).

5.1 Baseline experiment on trimmed videos

The dataset of trimmed videos was constructed from the full LCA dataset as follows. First, we took the human-annotated action intervals produced by one of the annotators, cbushman. This annotator was chosen to maximize the number of available action intervals. Next, a maximum of 100 intervals were selected for each action class. For those action classes for which more than 100 intervals were annotated, a random subset of 100 intervals was selected. For those action classes with 100 or fewer annotated intervals, all annotated intervals were used. A 2-s clip was extracted from the original videos centered in time on the middle of each selected annotation interval. These clips were temporally downsampled to 20 fps and spatially downsampled to a width of 320 pixels, maintaining the aspect ratio. This process resulted in a total of 1858 clips used for the baseline experiment on trimmed videos.

The class label of each clip was considered to be the action class corresponding to the human-annotated interval from which the clip was derived. The clips for each class were randomly split into a training set with 70 % of the clips and a test set with 30 % of the clips, under the constraint that sets of clips extracted from the same video should fall completely into either the training or test set. This was done to avoid having clips from the same action (e.g., two clips from the same person digging in the same location) from appearing in both the training and test sets. This resulted in a training set of 1318 training clips and 540 test clips. Each method was trained on the training set and used to produce labels on the test set. All methods were run with default or recommended parameters. These labels were compared to the intended class labels to measure the accuracy of each method. The results of this experiment are summarized in Table 2.

Table 2 Comparison of accuracy for state-of-the-art action recognition systems on a subset of the LCA dataset with trimmed videos

There are several things of note in these results. First, all the accuracies are quite low, indicating the difficulty of the LCA dataset. The highest performing method, Improved Trajectories \(+\) NN, is correct only 18.148 % of the time. The four lowest performing methods have accuracies approaching the performance of the blind baseline (5.555 %). Additionally, many newer methods do not necessarily outperform the older methods. We suspect that this difference in relative performance of newer and older methods compared to other datasets is the result of the lack of correlation between background and action class which is often present in other datasets, as well as the presence of multiple people in the field of view. That the performance is so low and that the highest scoring methods on other datasets are not necessarily the same here shows that this dataset presents new and difficult challenges not present in other datasets.

5.2 Baseline experiment on untrimmed streaming videos

For this experiment, we employed fivefold leave-one-set-of-videos-out cross-validation. For each fold, binary classifiers using each method were trained for each action class to perform a presence/absence distinction for activity of that class. These were trained with a collection of short 2-s samples of each action class. Each presence/absence classifier was trained in a discriminative fashion with all samples from the target class in the training set as positive samples and all samples from all the other classes in the training set as negative samples. The collection of short 2-s samples of each action class was constrained to have a maximum of 100 samples for each action class. These were cropped from the streaming training videos using the temporal annotations obtained from one of the annotators, cbushman. For those action classes for which the training set contained fewer than 100 instances, all instances were used. For those action classes for which the training set contained more than 100 instances, 100 instances were chosen randomly.

These binary presence/absence classifiers were used to produce labeled intervals for event occurrences in the test sets as follows. Each streaming test video was divided into a sequence of 2-s clips, each overlapping the previous by 1 s. The trained binary presence/absence classifiers were used to generate confidence scores for each action class on each clip by removing the final quantization. This yields a sequence of confidence scores for each action class over time on the streaming test video. This sequence of scores was smoothed with an FIR filter by convolving with a finite signal: \([\text {0.1},\text {0.2}.\text {0.3},\text {0.4},\text {0.5},\text {0.4},\text {0.3},\text {0.2},\text {0.1}]\). Nonmaximum suppression was then performed on each smoothed score sequence to generate intervals for each peak in the smoothed score sequence. Each interval was given the score corresponding to the peak that produced that interval. The temporal extent of each interval was found by progressively extending the temporal interval forward and backward around the peak in 1-s increments until a local minimum in smoothed sequence score. Finally, for each action class, the top 50 % of such intervals were selected based on their confidence score.

Fig. 4
figure 4

Machine–human intercoder agreement on the LCA dataset. F1 score for each pair of machine methods and human annotators as the overlap criterion is varied. Overlap of two intervals is measured as the length of their intersection divided by the length of their union

Each method produced a set of labeled intervals for the test set in each cross-validation fold. Since the test sets for the cross-validation folds constitute a disjoint cover of the entire dataset, these were all pooled to yield a single set of intervals produced by each method for the entire dataset. Such a set of intervals produced by a method is conceptually similar to a set of intervals produced by a human annotator, since each human annotator also annotated the entire dataset. Thus, we computed machine–human intercoder agreement on sets of intervals covering the entire LCA dataset comparing each method to each human annotator using the same method for computing human–human intercoder agreement described in Sect. 4. Just as Fig. 3 characterized human–human intercoder agreement by plotting F1 score for a pair of human annotators as a function of the overlap threshold, Fig. 4 characterizes machine–human intercoder agreement by plotting F1 score between a pair of a machine method and a human annotator as a function of overlap threshold. This is done for all pairs of machine methods and human annotators. To allow comparison between machine–human and human–human intercoder agreement, Fig. 5 overlays the results of machine–human and human–human intercoder agreement. To increase legibility, machine–human and human–human intercoder agreement is shown only for pairs that include cbushman. Note that machine–human intercoder agreement is considerably lower than human–human intercoder agreement. The gap between machine–human and human–human intercoder agreement suggests that the LCA dataset is challenging and can serve as fodder for new action recognition research.

Fig. 5
figure 5

Comparison between machine–human and human–human intercoder agreement on the LCA dataset, comparing against a single human annotator: cbushman. F1 score for each pair of annotators as the overlap criterion is varied. Overlap of two intervals is measured as the length of their intersection divided by the length of their union

We similarly compared both machine–human and human–human intercoder agreement using the evaluation metric from THUMOS [25]. For each action class and every pair of interval annotations for the entire LCA dataset that included the human annotation produced by cbushman, we computed the average precision (AP) on the ranked set of intervals produced using five distinct overlap thresholds: 0.1, 0.2, 0.3, 0.4, and 0.5. For human annotators, we randomized the rank order as humans did not annotate intervals with confidence scores. For each overlap threshold and each pair of machine and/or human annotators, a mean average precision (mAP) was computed as an unweighted average over all action classes. Figure 6 shows the machine–human and human–human intercoder agreement using the evaluation metric from THUMOS for each overlap threshold. Again note that machine–human intercoder agreement is considerably lower than human–human intercoder agreement. The gap between machine–human and human–human intercoder agreement computed using the THUMOS evaluation criterion further supports the hypothesis that the LCA dataset is challenging and can serve as fodder for new action recognition research.

Fig. 6
figure 6

Comparison between machine–human and human–human intercoder agreement on the LCA dataset, using the evaluation metric from the THUMOS Challenge [25], comparing against a single human annotator: cbushman

6 Related work

The THUMOS Challenge [25] also evaluates action recognition on untrimmed videos. The dataset used for the THUMOS Challenge differs from the LCA dataset in several ways. First, the THUMOS Challenge uses trimmed videos from UCF101 [65] as the training set; only the validation and test sets involve untrimmed videos. The LCA dataset is partitioned into five sets of videos for training and test; each set of videos consists of untrimmed videos. Second, the validation and test sets for the THUMOS Challenge are ‘untrimmed’ but not ‘streaming.’ The videos in the LCA dataset are not just ‘untrimmed’ but also ‘streaming.’ That is, the videos in the LCA dataset are all very long, much longer than those in the THUMOS Challenge validation and test sets. Each video in the LCA dataset typically includes numerous occurrences of many different kinds of actions at different points in time and different positions in the field of view. Multiple action occurrences, of different action types, often overlap in space and/or time. Part of the intent of the LCA dataset is to evaluate the ability of methods to handle such. Third, the set of action classes used in the THUMOS Challenge are different from those used in the LCA dataset. The THUMOS Challenge uses a subset of 20 out of the 101 action classes from UCF101, all of which are sporting activities. Many of these are described by nouns rather than verbs. In contrast, the LCA dataset contains annotations for 24 action classes and five object classes. The action classes are all verbs that describe everyday activities. The action classes of the THUMOS Challenge and the LCA dataset are disjoint and cover fundamentally different kinds of activity. Fourth, the action classes in the THUMOS Challenge are often correlated with the background, Diving typically happens in swimming pools and BasketballDunk typically happens in basketball courts. The background can help distinguish activity class. In contrast, there is little to no correlation between action class and background in the LCA dataset. This forces activity recognition to rely solely on the motion characteristics of the action being performed. Thus, the THUMOS Challenge and the LCA dataset support two different kinds of research into activity recognition: methods that utilize background as part of activity recognition and methods that do not. Fifth, the THUMOS Challenge comes with a single annotation for the validation and test sets. As such, there is no way of knowing whether or not there is human agreement as to the annotation. In contrast, the LCA dataset was annotated by five different independent annotators. As such, this models the inherent ambiguity present in many natural activities. This can serve to facilitate future computer vision research that is aware of and can model such ambiguity. Finally, the action classes in the THUMOS Challenge are themselves largely unambiguous. For example, any given action occurrence is unlikely to be both PoleVault and Shotput. In contrast, the action classes in the LCA dataset overlap semantically. For example, a given action occurrence might legitimately be both an approach and a walk. That is the nature of verbs in natural language; they describe overlapping semantic classes. Moreover, there may be natural inferential structure between such semantic overlap. This inferential structure may be bidirectional. For example, it may be the case that whenever there is chase there is also flee and vice versa. The fact that chase and flee cooccur does not imply that they have the same meaning; their meaning differs in the thematic relationship between the participants. The LCA dataset can facilitate future research that studies such [82]. This inferential structure may also be unidirectional. For example, it may be the case that whenever there is carry there is also walk or run but not vice versa. The LCA dataset can facilitate future research that attempts to learn the inferential structure of language from visual data [21].

7 Conclusion

We make available to the community a new dataset to support action recognition research.Footnote 3 This dataset has more hours of video than HMDB51, roughly the same amount of video as UCF50, about half as much video as UCF101 and Hollywood-2, but unlike these has streaming video and has about twice as much video and twice as many classes as VIRAT, the largest dataset of streaming video. A distinguishing characteristic of this dataset is that the video is streaming; long video segments contain many actions that start and stop at arbitrary times, often overlapping in space and/or time. A further distinguishing characteristic is that while all actions were filmed in a variety of backgrounds, every action occurs in every background so that background gives little information as to action class. The above characteristics suggest that this will be a challenging dataset. This is confirmed by the low performance of recent methods on baseline experiments which also show that those methods which perform best on other datasets do not necessarily outperform other methods on this dataset. The new difficulties posed by this dataset should spur significant advances in action recognition research.