Collecting and annotating the large continuous action dataset

We make available to the community a new dataset to support action recognition research. This dataset is different from prior datasets in several key ways. It is significantly larger. It contains streaming video with long segments containing multiple action occurrences that often overlap in space and/or time. All actions were filmed in the same collection of backgrounds so that background gives little clue as to action class. We had five humans to replicate the annotation of temporal extent of action occurrences labeled with their classes and measured a surprisingly low level of intercoder agreement. Baseline experiments show that recent state-of-the-art methods perform poorly on this dataset. This suggests that this will be a challenging dataset to foster advances in action recognition research. This manuscript serves to describe the novel content and characteristics of the LCA dataset, present the design decisions made when filming the dataset, document the novel methods employed to annotate the dataset, and present the results of our baseline experiments.

Here, we introduce a new dataset called the Large Continuous Action Dataset (LCA).This dataset contains depictions of 24 action classes.The video for this dataset was filmed and annotated as part of the DARPA Mind's Eye program.A novel characteristic of this dataset is that rather than consisting of short clips each of which depicts a single action class, this dataset contains much longer streaming video segments that each contain numerous instances of a variety of action classes that often overlap in time and may occur in different portions of the field of view.The annotation that accompanies this dataset delineates not only which actions occur but also their temporal extent.
Many of the prior datasets were culled from video downloaded from the internet.In contrast, the LCA dataset contains video that was filmed specifically to construct the dataset.
While the video was filmed with people hired to act out the specified actions according to a general script, the fact that the video contains long streaming segments tends to mitigate any artificial aspects of the video and render the action depictions to be quite natural.Moreover, the fact that all of the video was filmed in a relatively small number of distinct backgrounds makes the dataset challenging; the background gives little clue as to the action class.
A further distinguishing characteristic of the LCA dataset is the degree of ambiguity.Most prior action-recognition datasets, in fact most prior datasets for all computer-vision tasks, make a tacit assumption that the labeling is unambiguous and thus there is a 'ground truth.'We had a team of five human annotators each annotate the entire LCA dataset.This allowed us to measure the degree of intercoder agreement.Surprising, there is a significant level of disagreement between humans as to the temporal extent of most action instances.We believe that such inherent ambiguity is a more accurate reflection of the underlying action-recognition task and hope that the multiplicity of divergent annotations will help spur novel research with this more realistic dataset.
Another distinguishing characteristic of the LCA dataset is that some action occurrences were filmed simultaneously with multiple cameras with partially overlapping fields of view.While the cameras were neither spatially calibrated nor temporally synchronized, the fact that we have multiple annotations of the temporal extent of action occurrences may support future efforts to perform temporal synchronization after the fact.Furthermore, while most of the video was filmed from ground-level cameras with horizontal view, some of the video was filmed with aerial cameras with bird's eye view.Some of this video was filmed simultaneously with ground cameras.This may support future efforts to conduct scene reconstruction.Some datasets are provided with specific tasks and evaluation metrics.We refrain from doing so for this dataset.Inter alia, we do not provide official sanctioned splits into validation sets.Instead, we leave it up to the community to make use of this dataset in a creative fashion for as many different tasks as it will be suited.
In particular, the evaluations conducted by Mind's Eye included a specific set of tasks, namely Recognition (REC), Description (DES), Gap Filling (GAP), and Anomaly Detection (ANM) with specific evaluation metrics.Such tasks and metrics are expressly not part of LCA.The evaluations conducted under Mind's Eye make use of material that is not included in LCA and metrics that are not public.Likewise, LCA contains material that was not available for use during the Mind's Eye evaluations.Thus said evaluations could not be replicated outside of the context of Mind's Eye.Any potential future evaluations conducted with LCA would thus be incomparable to the results obtained under Mind's Eye.
The entire LCA dataset, including the video and the annotations, has been cleared for release by DARPA.The remaining material gathered by DARPA for the Mind's Eye Year 2 evaluation that is not included in LCA may not have been cleared for release.As part of the release process, some video was edited to remove certain portions.Furthermore, the annotation process was performed with the particular versions of the videos included in LCA as provided by DARPA.These may have been transcoded from the original as filmed by the camera.Thus, the time alignment of the annotations can only be guaranteed with the versions of the videos included in LCA.The time alignment may not be correct for any other versions of these videos that may be residual from the DARPA Mind's Eye program.

Collection
The video for this dataset was filmed by DARPA in conjunction with Mitre and several performers from the Mind's Eye program. 1 The bulk of the video was filmed as part of the Mind's Eye Year 2 evaluation.Within the Mind's Eye program, that video was referred to as C-D2a, C-D2b, C-D2c, and the Y2 Evaluation dataset.This video is disjoint from that gathered by Janus Research Group as part of the Mind's Eye Year 1 evaluation.Within the Mind's Eye program, that video was refereed to as C-D1a, C-D1b, C-D1, and C-E1.As the LCA dataset contains only a subset of that material, we refrain from using all such terminology in reference to LCA.Also note that the LCA dataset contains some video that was not included in the data used as part of the Mind's Eye evaluations.
The LCA dataset was filmed at three different locations over four periods:  1.Of these, 17 verbs were used as part of the stage directions given to the actors to guide the actions that they performed.The remainder were not used as part of the stage directions but occurred incidentally.Nothing, however, precluded the actors from performing actions that could be described by other verbs.Thus the video depicts many other actions than those annotated, including but not limited to riding bicycles, pushing carts, singing, pointing guns, arguing, and kicking balls.The only restriction, in principle, to these 24 verbs is that these were the only actions that were annotated.Identifying the presence of specific verbs in the context of many such confounding actions should present additional challenges.

Annotation
We annotated all occurrences of the 24 verbs from Table 1 in the videos in Table 2.Each such occurrence consisted of a temporal interval labeled with a verb.The judgment of whether an action described by a particular verb occurred is subjective; different annotators will arrive at different judgments as to occurrence as well as the temporal extent thereof.To help guide annotators, we gave them the specification of the intended meaning of each of the 24 verbs as provided by DARPA.Annotators performed the annotation at workstations with dual monitors.One monitor displayed the annotation tool while the other monitor displayed the documentation of intended verb meaning.The documentation of intended verb meaning is included in the LCA distribution.
We also asked annotators to annotate intervals where certain object classes were present in the field of view.These include bandannas, bicycles, people, vehicles, and weapons.(The bandannas were worn by people around their head or arms.)For these, a count of the number of instances of each class that were visible in the field of view was maintained.It was incremented each time a new instance became visible and decremented each time an instance became invisible.We instructed annotators that there was no need to be precise when an instance was partially visible.We further instructed annotators that vehicles denoted motor vehicles, not push carts, and weapons denoted guns, not other things like clubs or rocks that could be used as weapons.
We provided annotators with a tool that allowed them to view the videos at ordinary frame rate, stop and start the videos at will, navigate to arbitrary points in the videos, view individual frames of the videos, add, delete, and move starting and ending points of intervals, and label intervals with verbs.The tool also contained buttons to increment and decrement the counts for each of the object classes and appraised the annotator with the running counts for the object classes in each frame as the video was played or navigated.
Because of the large quantity of video to be annotated, and the fact that nothing happens during large portions of the video, we preprocessed the video to reduce the amount requiring manual annotation.We first downsampled the video to 5 fps just for the purpose of annotation; the annotation was converted back at the end to the original frame rate.Then segments of this downsampled video where no motion occurred were removed.To do this, we computed dense optical flow on each pixel of each frame of the downsampled video.We then computed the average of the magnitude of the flow vectors in each frame and determined which frames were above a threshold.Stretches of contiguous frames that were above threshold that were separated by short stretches of contiguous frame that were below threshold were merged into single temporal segments.Then such single temporal segments that were shorter than a specified temporal length were discarded. 2Annotators were only given the remaining temporal segments to annotate.We performed a postprocessing step whereby the authors manually viewed all discarded frames to make sure that no actions started, ended, or spanned the omitted temporal segments.As part of this post  processing step, the authors manually checked that none of the specified object classes entered or left the field of view during the omitted temporal segments.
We had five annotators each independently annotate the entire LCA dataset.Annotators were given initial instructions.During the annotation, annotators were encouraged to discuss their annotation judgments with the authors.The au-thors would then arbitrate the judgment, often specifying principles to guide the annotation.These principles were then circulated among the other annotators.The annotator instructions and principles developed through arbitration are included in the LCA distribution.
We performed a consistency check during the annotation process.Whenever an annotator completed annotation of a temporal segment, if that annotator did not annotate any intervals during that segment but other annotators did, we asked that annotator to review their annotation.
The LCA dataset contains five verb-annotation files for each of the video files in Table 2.These have the same name as their corresponding video, but with the extension txt, and are located in directories named with each of the annotator codes bmedikon, cbushman, kim861, nielder, and nzabikh.Each line in each of these files contains a single temporal interval as a text string specifying a verb and two zero-origin nonnegative integers specifying the starting and ending frames of the interval inclusive.The LCA dataset also contains five object-class annotation files for each of the video files in Table 2.These also share the filename with the corresponding video, but with the addition of the suffix -enter-exits.txt,and are located in the same directories named with each of the above annotator codes.Each line in each of these files contains a text string specifying an object class, an integer specifying the number of objects of that class entering or exiting the field of view (positive for entering and negative for exiting), and a single zero-origin nonnegative integer specifying the video frame.

Analysis
We analyzed the degree of agreement between the different annotators.To do this, we compared pairs of annotators, taking the judgments of one as 'ground truth' and computing the F1 score of the other.An interval in the annotation being scored was taken as a true positive if it overlapped some interval with the same label in the 'ground truth.'An interval in the annotation being scored was taken as a false positive if it didn't overlap any interval with the same label in the 'ground truth.'An interval in the 'ground truth' was taken as a false negative if it didn't overlap any interval with the same label in the annotation being scored.From these counts, an F1 score could be computed.
We employed the following overlap criterion.For a pair of intervals, we computed a one-dimensional variant of the 'intersection over union' criterion employed within the Pascal VOC Challenge to determine overlap of two axis-aligned rectangles [10], namely the length of the intersection divided by the length of the union.We considered two intervals to overlap when the above exceeded some specified threshold.We then computed the F1 score as this threshold was varied and plotted the results for all pairs of annotators (Fig. 2).
Note that there is a surprisingly low level of agreement between annotators.Annotators rarely if ever agree on the precise temporal extent of an action as indicated by the fact that all agreement curves go to zero as the overlap threshold goes to one.At an overlap threshold of 0.5, the F1 score varies between about 0.3 and about 0.6.At an overlap threshold of 0.1, the threshold employed by VIRAT to score ma-chines against humans, the F1 score varies between about 0.38 and about 0.67.This would put an upper bound on machine performance with this dataset using the VIRAT threshold.Even if the overlap threshold is reduced to zero, the F1 score varies between about 0.43 and about 0.7.This indicates that this dataset should be challenging for computer action recognition.
This difficulty is corroborated by a recent paper [1].That paper employs a different subset of video from the DARPA Mind's Eye Year 2 evaluation that is extremely similar to that in the LCA dataset.That dataset was annotated with the same procedures that were used to annotate the LCA dataset.The six verbs with highest intercoder agreement were selected: carry, dig, hold, pick up, put down, and walk.For each of these, between 23 and 30 clips of 2.5s duration with the highest level of intercoder agreement were selected, yielding 169 distinct clips.Seven different state-of-the-art computervision action-recognition methods (C2 [20], Action Bank [48], Stacked ISA [30], VHTK [38], Cao's implementation [6] of Ryoo's method [47], Cao's method [6], and our own implementation of the classifier described by Wang et al. [61] on top of the Dense Trajectories [60][61][62] feature extractor) were employed on this dataset, performing one-of-outsix classification in an eight-fold cross-validation.Note that for this task, each 2.5s clip was labeled with precisely one of the six verbs as ground truth.All seven methods performed extremely poorly on this dataset (C2 47.4%, Action Bank 44.2%,Stacked ISA 46.8%, VHTK 32.5%, Cao's implementation of Ryoo's method 31.2%,Cao's method 33.3%, and Dense Trajectories 52.3%), a task with only six classes and chance performance of 16.6%.

Baseline Experiment
We performed an experiment similar to that of Barbu et al. [1] to present and compare the performance of several stateof-the-art action-recognition systems on the LCA dataset.We evaluated all known action-recognition systems for which the code for the end-to-end system is available, as well as implementations of several for which the code is unavailable.We used the same set of seven state-of-the-art actionrecognition systems compared in Barbu et al. [1] (C2 [20], Action Bank [48], Stacked ISA [30], VHTK [38], Cao's implementation [6] of Ryoo's method [47], Cao's method [6], and our own implementation of the classifier described by Wang et al. [61] on top of the Dense Trajectories [60,61] feature extractor) We also compared against our own implementation of the classifier described by Wang et al. [61] on top of the more recent Improved Trajectories method [62].
As these methods are designed for classification of video clips, rather than for streaming video, this experiment was performed on a subset of the LCA dataset.This subset was designed to be similar in character to other action-recognition datasets and comprised short video clips.It was created as follows.First, we took the human-annotated action intervals produced by one of the annotators, cbushman.This annotator was chosen to maximize the number of available action intervals.Next, a maximum of 100 intervals were selected for each action class.For those action classes for which more than 100 intervals were annotated, a random subset of 100 intervals was selected.For those action classes with 100 or fewer annotated intervals, all annotated intervals were used.A 2s clip was extracted from the original videos centered in time on the middle of each selected annotation interval.These clips were temporally downsampled to 20 fps and spatially downsampled to a width of 320 pixels, maintaining the aspect ratio.This process resulted in a total of 1858 clips used for the baseline experiment.
The class label of each clip was considered to be the action class corresponding to the human-annotated interval from which the clip was derived.The clips for each class were randomly split into a training set with 70% of the clips and a test set with 30% of the clips, under the constraint that sets of clips extracted from the same video should fall completely into either the training or test set.This was done to avoid having clips from the same action (e.g., two clips Method accuracy (%) Action Bank [48] 16.667 Improved Trajectories [62] 15.556 Dense Trajectories [60,61] 14.074 C2 [20] 9.259 Cao [6] 7.592 Cao's [6] implementation of Ryoo [47] 6.667 Stacked ISA [30] 6.667 VHTK [38] 6.296 chance (30/540) 5.555 from the same person digging in the same location) from appearing in both the training and test sets.This resulted in a training set of 1318 training clips and 540 test clips.Each method was trained on the training set and used to produce labels on the test set.All methods were run with default or recommended parameters.These labels were compared to the intended class labels to measure the accuracy of each method.The results of this experiment are summarized in Table 3.
There are several things of note in these results.First, all the accuracies are quite low, indicating the difficulty of the LCA dataset.The highest performing method, Action Bank, is correct only 16.667% of the time.The four lowest performing methods have accuracies approaching chance performance (5.555%).Additionally, the newer methods do not necessarily outperform the older methods.C2 significantly outperforms four more recently published methods, while Action Bank is the best, outperforming even Improved Trajectories, which has the highest performance on several well known datasets including HMDB (57.2% vs Action Bank's 26.9%) and UCF50 (91.2%).We suspect that this difference in relative performance compared to other datasets is the result of the lack of correlation between background and action class which is often present in other datasets, as well as the presence of multiple people in the field of view and the small relative size of the people in the field of view.That the performance is so low and that the highest scoring methods on other datasets are not necessarily the same here shows that this dataset presents new and difficult challenges not present in other datasets.

Conclusion
Upon acceptance of this manuscript we will make available to the community a new dataset to support action-recognition research.This dataset has more hours of video than HMDB51, roughly the same amount of video as UCF50, about half as much video as UCF101 and Hollywood-2, but unlike these has streaming video and has about twice as much video and twice as many classes as VIRAT, the largest dataset of streaming video.A distinguishing characteristic of this dataset is that the video is streaming; long video segments contain many actions that start and stop at arbitrary times, often overlapping in space and/or time.A further distinguishing characteristic is that while all actions were filmed in a variety of backgrounds, every action occurs in every background so that background gives little information as to action class.We employed novel techniques to annotate the temporal extent of action occurrences.A multiplicity of human annotations allows measuring intercoder agreement.The above characteristics together with the surprisingly low level of intercoder agreement suggest that this will be a challenging dataset.This is confirmed by the low performance of recent methods on a baseline experiment which also shows that those methods which perform best on other datasets do not necessarily outperform other methods on this dataset.The new difficulties posed by this dataset should spur significant advances in action-recognition research.

Fig. 1
Fig. 1 Several frame sequences from the LCA dataset illustrating several of the backgrounds in which they were filmed.

Fig. 2
Fig.2Intercoder agreement on the annotations of the LCA dataset.F1 score for each pair of annotators as the overlap criterion is varied.Overlap of two intervals is measured as the length of their intersection divided by the length of their union.

Table 1
Verbs used as labels in the LCA dataset.The starred verbs were used as part of the stage directions to the actors.The remaining verbs were not used as part of the stage directions but may have occurred incidentally.Year 2 evaluation, we undertook a systematic annotation effort for a portion of the above video.That annotation forms the basis of LCA.LCA contains all and only the portion of the above video that was annotated as part of this process.This video comprises 190 files as delineated in Table2.
Eight files are MOV format, 46 are MP4 format, and 136 are AVI format.The MOV files all use the MP4V codec and are 640×384 at 60 fps.The MP4 files all use the H264 codec and are 640×360 at 30 fps.The AVI files all use the XVID codec and are either 640×384 at 60 fps or 1440×1080 at 30 fps.This constitutes 2302144 frames and a total of 12 hours, 51 minutes, and 16 seconds of video.For comparison, UCF50 has 1330936 frames and 13.81 hours, HMDB51 has 632635 frames and 5.85 hours, UCF101 has 27 hours, Hollywood-2 has 20.1 hours, and VIRAT has 8.5 hours.Several frame sequences from this dataset illustrating several of the backgrounds are shown in Fig. 1.The Mind's Eye program specified a set of 48 verbs of interest.Of these, the LCA dataset uses only 24 verbs as annotation labels, as delineated in Table

Table 2
The original names of the files provided by DARPA.Filenames containing GPTC were filmed at GPJTC.Filenames containing STOPS were filmed at STOPS.Filenames consisting solely of a number were filmed at FITG.Numbers of the form YYYYMMDD indicate filming date.CR indicates country road.SH indicates safe house.Indices on CR, SH, and VT indicate variant backgrounds of the given class.CP1, CP2, C1, and C3 indicate camera.Text indicates the staging directions to guide filming.The remaining numbers serve to uniquely identify the video.These videos were renamed to video-XXX for consistency in the release.The release also contains a file, video-mapping.txt, which includes the mapping between the original filenames and those in the LCA release.

Table 3
Comparison of accuracy for state-of-the-art actionrecognition systems on a subset of the LCA dataset.