1 Introduction

Video data arguably dominates the largest portion of internet content. With more than \(74\%\) of total internet traffic being video [15], a need that arises is to automatically understand and index such massive amounts of data. The computer vision community has embraced this problem, and during the last decade, several approaches for video analysis have been proposed [8, 26, 31, 39, 41, 48, 52, 58, 76]. One of the most challenging tasks in this field, which has recently gained much attention, is to understand and temporally localize human actions in untrimmed videos. Such a task, which is widely known as temporal action localization, aims to produce temporal bounds in a video, during which human actions occur.

Fig. 1.
figure 1

Active Learning for Action Localization. We compare three different active learners for temporal action localization. We plot the localization performance (mAP) of each learner at different active learning steps. Each learner’s aim is to use the least number of training samples as possible, which are obtained sequentially by annotating samples from an unlabeled set. The proposed method resembles Learner C, which minimizes the number of active learning steps to reach a target performance. Using our active learner, we construct Kinetics-Localization, a novel and large-scale dataset for temporal action localization.

Datasets such as Thumos14 [35], ActivityNet [8], and Charades [58] have enabled the development of innovative approaches addressing the temporal action localization problem [50, 56, 71, 75, 77]. These approaches have been successful in increasing localization performance while maintaining a low computational footprint [5, 71]. For instance, current state-of-the-art approaches [44, 77] have improved more than three times the first reported performance on datasets like Thumos14 and ActivityNet. However, despite those great achievements, a crucial limitation persists, namely the dependence of these models on large-scale annotated data for training. This limitation often prevents the deployment of action localization methods at scale, due to the large costs associated with video labeling (e.g. Charades authors [58] spent $1 per video).

Additionally, given that datasets for temporal action localization are relatively small, it is unclear whether existing methods will reach performances like the ones obtained in other vision tasks such as object detection [54]. To overcome some of these issues, Wang et al. [68] propose a new model that uses video-level annotations combined with an attention mechanism to pinpoint actions temporal bounds. Although it does not require temporal ground-truth, their performance is significantly lower than that achieved by fully-supervised approaches, thus, restricting its applications that do not require accurate detection.

In this paper, we propose an active learning method that aims to ease the large-scale data dependence of current temporal localization methods. As in every active learning setting [55], our goal is to develop a learner that selects samples (videos in this case) from unlabeled sets to be annotated by an oracle. As compared to traditional active learners [27, 42] where heuristics such as uncertainty sampling are used to perform the sample selection, we explore novel selection functions [25, 40] that reuse knowledge from a previously existing dataset. For instance, we study a learnable selection function that learns a mapping from a model-sample state pair to an expected improvement in performance. In doing so, such function learns to score the unlabelled samples based on the expected performance gain they are likely to produce if they are annotated and used to update the current version of the localization model being trained.

Figure 1 depicts the learning process of three different action localization strategies. To evaluate each learner, we measure the performance improvements, which are assessed on a labeled set, at different training dataset sizes (or learning stages). We associate traditional action localization approaches [5, 71, 77] to Learner A (passive learning), which randomly picks samples to be annotated for future training iterations. Learner A exhibits passive behavior in making smart selections of samples to augment its training set. Learner B is an active learner that uses uncertainty sampling [42] to select the samples (the learner chooses instances whose labels are most uncertain). Learner C is a learning-based active learner. Because it incorporates historical knowledge from previous dataset selections, Learner C enables a better learning process. In this paper, we introduce an active learning framework that minimizes the number of active learning steps required to reach the desired performance.

Contributions. The core idea of the paper is to develop an active learning framework for temporal action localization. Specifically, the contributions of this paper are threefold. (1) We introduce a new active learner for action localization (see Sect. 3). To develop our approach, we thoughtfully study different sampling functions, including those that can exploit previously labeled data to learn or bootstrap a selection function that chooses unlabelled samples with the aim of improving the localization model the most. (2) We conduct extensive experiments in Sect. 4 demonstrating the capabilities of the proposed framework. When compared to traditional learning (random sampling), our approach learns to detect actions significantly quicker. Additionally, we show that our active learner can be employed in batch-mode, and is robust to noisy ground-truth annotations. (3) We employ our active learner to construct a novel dataset for temporal action localization (see Sect. 5). Using videos from the Kinetics [39] dataset, we apply our learner to request temporal annotations from Amazon Mechanical Turk workers. We name this collected dataset Kinetics-Localization and it comprises more than 15K YouTube videos.

2 Related Work

This section briefly discusses the most relevant work to ours, namely those related to active learning and temporal action localization.

Active Learning tackles the problem of selecting samples from unlabeled sets to be annotated by an oracle. In the last decade, several active learning strategies have been proposed [27, 42, 63] and applied to several research fields, including speech recognition [32], natural language processing [62], chemistry [18], just to name a few. We refer the reader to the survey of Settles [55] for an extensive review of active learning methods. Active learning has also been used in traditional computer vision tasks, such as image classification [4, 22, 25, 36, 37, 53] and object detection [64], or to construct large-scale image and video datasets [16, 66, 72]. Very recently, active learning approaches have emerged in more contemporary vision tasks, including human pose estimation [46] and visual question answering [45]. Most of the active learning approaches in computer vision have used the simple but effective uncertainty sampling query strategy [42, 43], where unlabelled samples are selected based on the entropy of their scores generated by the current discriminative model (least confidence and margin based score selections are other popular query strategies). However, the main limitation of this strategy is its inability to handle complex scenarios where factors such as label noise, outliers, or shift in data distribution arise in the active learning setting [40]. Inspired by very recent ideas in active learning [1, 25, 40, 70, 74], our proposed active learning framework learns (or bootstrap) a function that selects samples for annotation based on knowledge extracted from a previous dataset. One variant of our approach estimates the effect of labeling a particular instance on the performance of the current discriminative model. As such, this learnable function is able to overcome the shortcomings of heuristic active learners, such as uncertainty sampling (see Sect. 4).

Temporal Action Localization. Many techniques have been developed over the years to recognize [11, 12, 49, 59, 67], and localize human activities, either in images [28, 47, 73] or videos [29, 34, 69]. Our work focuses on the temporal action localization problem in video, whose goal is to provide starting and ending times of an action occurring within an untrimmed video. Researchers have explored innovative ideas to efficiently and accurately address this problem. Earlier methods rely on applying action classifiers in a sliding window fashion [19, 23, 50]. To unburden the computational requirements of sliding windows, a new line of work studies the use of action proposals to quickly scan a video in an attempt to reduce the search space [6, 7, 10, 20, 24, 56]. More recently, end-to-end approaches have boosted the performance of stage-wise methods, demonstrating the importance of jointly optimizing classifiers and feature extractors [13, 71, 75, 77].

Despite the large body of work on action localization, most methods focus on either improving performance [77] or boosting speed [5], while very few investigate the use of active learning to mitigate the data dependency problem. To the best of our knowledge, only the work of Bandla and Grauman [2] has incorporated active learning to train an action detection model. However, their method relies on hand-crafted active selection functions such as uncertainty sampling [42], which works well in controlled scenarios where statistical properties of the dataset can be inferred. However, it fails when more complex shifts in data distribution are present. In contrast and inspired by recent works [25, 40], our approach avoids predefined heuristics and instead learns or bootstraps the active selection function from existing data. We will show that learning such a function not only improves the learning process of an action localization model on a given dataset, but it is also adaptable for use when annotating new data.

Fig. 2.
figure 2

Active Learner for Temporal Action Localization. Firstly, we train an action localization model with a labeled set of videos. Then, using the trained model, we generate video predictions in an unlabeled set and select one of the videos that is expected to improve the learner the most. Finally, an oracle temporally annotates the selected video and then added into the labeled set.

3 Active Learner for Temporal Action Localization

We propose an active learning framework for temporal action localization. Our goal is to train accurate detection models using a reduced amount of labeled data. At every learning step t, a set of labeled samples \(\mathcal {L}_t\) is first used to train a model \(f_t\). Then, from an unlabeled pool \(\mathcal {U}_t\), a video instance \(v^{*}\) is chosen by a selection function g. Afterwards, an oracle provides temporal ground-truth for the selected instance, and the labeled set \(\mathcal {L}_t\) is augmented with this new annotation. This process repeats until the desired performance is reached or the set \(\mathcal {U}_t\) is empty. As emphasized in previous work [37, 46], the key challenge in active learning is to design the proper selection function, which seeks to minimize the number of times an oracle is queried to reach a target performance. Accordingly, we empower our proposed framework with state-of-the-art selection functions that exploit previously labeled datasets as bootstrapping.

This section provides a complete walk-through of our approach (see Fig. 2). We describe our model for temporal action localization, elaborate on our proposed active selection function, and explain in detail the oracle’s task.

3.1 Localization Model Training Step

Much progress has been made in designing accurate action detection models [5, 24, 71, 77]. So ideally, any of these detectors can be used here. These detectors can be grouped into two categories, namely, stage-wise and end-to-end models. Models trained end-to-end have shown superior detection rates. However, such methods cannot decompose the localization problem into simpler tasks. We argue that decomposing the action localization task is key, specially for active learning methods that use previous knowledge to bootstrap the selection function learning process. As such, we opt for designing a stage-wise action localization model.

Our model takes as input a video v described by a set of n temporal segments, denoted by \(\mathbf {S}=\{\mathbf {s}_1,\cdots ,\mathbf {s}_{n}\}\) where \(\mathbf {s_i} = [t^{start}, t^{end}]\) is a 2D vector containing starting and ending times of a segment. In this paper, these temporal segments are action proposals generated by DAPs [20]. Our localization model’s goal is to select k temporal segments \(\mathbf {S}^{k}\) from the initial set \(\mathbf {S}\) and produce a vector of confidence scores \(\mathbf {z}_c \in \mathbb {R}^{k}\) for each action class c in the dataset. In short, our model maps an input video described by a large set of candidate segments into a small set of temporal predictions: \(f_t(v, \mathbf {S}) \rightarrow \left\{ \mathbf {S}^{k}, \{\mathbf {z}_c\}_{c\in \mathcal {C}} \right\} \) where \(\mathcal {C}\) is the set of action classes.

To that end, we organize our model into three modules: a video encoder whose goal is to describe temporal segments \(\mathbf S \) in terms of a feature vector \(\mathbf {o}\), an attention module which picks k segments based on a binary action classifier \(h_t\), and an action classifier \(\phi (\mathbf {S}^{k})\) which generates the confidence scores \(\mathbf {z}_c\) for each class in \(\mathcal {C}\). Below, we provide design details for each component.

Video Encoder. Given a set of temporal segments \(\mathbf {S}\), our aim is to encode each individual segment \(\mathbf {s}_i\) with a compact representation. We first extract frame-level features using a CNN and then aggregate these representations into a single feature vector \(\mathbf {o_i}\). In our experiments, we train an Inception V3 network [61] using the Kinetics dataset [39] and extract features from the pool3 layer (a feature vector with 2048 dimensions). To reduce the temporal receptive field, we opt for average pooling, which beyond its simplicity has demonstrated competitive performance in various tasks [38, 60]. Thus, our video encoder generates a matrix of visual observations, \(\mathbf {O}=[\mathbf {o}_1\cdots \mathbf {o}_{n}] \in \mathbb {R}^{2048\times n}\).

Attention Module. This module receives a visual observation matrix \(\mathbf {O}\) to pick k temporal segments \(\mathbf {S}^{k}\) which are most likely to contain an action. We adopt a linear Support Vector Machine (SVM) [17, 21] to learn a binary classifier that discriminates between actions and background. We employ Platt scaling [51] to obtain probabilistic scores from the SVM outputs. Finally, to select the output segments, we perform hard attention pooling and pick the top-k segments with high confidence scores. We set \(k=10\) in our experiments. Accordingly, our attention module \(h_t\) outputs a small number of segments \(\mathbf {S}^{k}\), which are encoded with their corresponding visual representations in \(\mathbf {O}\).

Action Classifier. Taking as input a reduced set of temporal segments \(\mathbf {S}^{k}\), the action classifier aims to generate a set of confidence scores \(\mathbf {z}_c\) for each action category in \(\mathcal {C}\). Consciously, we build a model composed of a fully-connected layer and a soft-max classifier. Thus, our action classifier \(\phi \) generates the final detection results \(\left\{ \mathbf {S}^{k}, \{\mathbf {z}_c\}_{c\in \mathcal {C}} \right\} \).

Training. We define the labeled set at learning step t of size \(p_t\) as \(\mathcal {L}_t = \left\{ \ (v^{train}_1, \mathbf {y}_1), (v^{train}_2, \mathbf {y}_2), \cdots (v^{train}_{p_t}, \mathbf {y}_{p_t}) \right\} \), where \(\mathbf {Y}=[\mathbf {y}_1|\cdots |\mathbf {y}_{p_t}] \in \mathbb {R}^{2 \times p_t}\) contains the temporal annotations of all action instances. We also define the set of temporal segments of size m as \( \mathbf {S}^{(t)}_i = \{\mathbf {s}^{train}_1,\cdots ,\mathbf {s}^{train}_{m}\} \), where \(i \in \left\{ 1,2,...,p_t \right\} \ \) describes each video. We train our attention and action classifier modules separately. To train the attention module, we define instances in \(\mathbf {S}^{(t)}_i\) as positives if the temporal Intersection over Union (tIoU) with any ground-truth instance is greater than 0.7. Similarly, for training the action classifier, we use temporal instances with tIoU greater than 0.7, but considering only the top-k segments chosen by our attention module.

3.2 Active Selection Step

Our aim is to design a selection function g that picks an instance \(v^{*}\) from the unlabeled set \(\mathcal {U}_t\). Our primary challenge is to develop this function such that it selects the samples that are expected to improve the localization model the most. Additionally, we want the selection function to generalize to unseen action categories. Purposefully, instead of sampling directly from the \(f_t\) predictions, we cast the selection problem into a meta-learning task; pick samples that improve attention module \(h_t\) the most. Here, we focus the learner on the attention module as opposed to the action classifier, since the former deals with a more complex task (temporal boundary generation) and its output directly impacts the latter. Formally, our learnable selector g takes as input confidence scores produced by the action classifier \(h_t\) when applied to the unlabeled set \(\mathcal {U}_t\): \(\mathbf {X} = [\mathbf {x}_1, \mathbf {x}_2, \cdots , \mathbf {x}_{q_t}] \) where \(\mathbf {X} \in \mathbb {R}^{l \times {q_t}}\) with l being the number of temporal segments and \(q_t\) the number of unlabeled videos. In this section, we introduce three different sampling functions, which are studied and diagnosed in Sect. 4.

Learning Active Learning (LAL). Here, we follow [40] and formulate the learning of the selection function as a regression problem, which predicts the improvement in performance of our attention module for all samples belonging to \(\mathcal {U}_t\). We construct a feature matrix \(\mathbf {F}\) from pairs of model state and sample description. We choose the model state to be the SVM weights defining \(h_t\) and the sample description to be the histogram of confidence scores in \(\mathbf {X}\). The target vector used for regression is \(\mathbf {\eta }\), which corresponds to the improvement \(\delta \) in localization performance (in practice mean Average Precision) after the model \(h_t\) is trained with each of the samples in a Set of previously labeled examples \(\mathcal {K}_t\) individually. In our experiments, we refer to \(\mathcal {K}_t\) as the Knowledge-Source Set. To generate a matrix \(\mathbf {F}\) that explores enough pairs of model and sample states, we follow the Monte-Carlo procedure used in [40]. Once matrix \(\mathbf {F}\) and targets \(\mathbf {\eta }\) are constructed, we learn g using Support Vector Regression (SVR). Once trained, we can apply g to the unlabelled set to select the sample with the highest predicted performance improvement: \(g(\mathcal {U}_t) \rightarrow v^{*}\).

Maximum Conflict Label Equality (MCLE). This method leverages knowledge from past existing datasets. We closely follow [25] and devise a method that uses zero-shot learning as warm initialization for active learning. We opt for simplicity and implement a Video Search zero-shot learning approach, which uses top results from YouTube search as positive samples [14]. This approach’s implementation is based on the code provided by [25].

Uncertainty Sampling (US). This baseline samples videos with the most uncertain predictions. Following standard uncertainty sampling approaches [42], we compute the entropy of the video predictions (i.e. the histogram of confidence scores in the columns of \(\mathbf {X}\)) and select the one with highest entropy value. This baseline is popularly used in computer vision applications such as image classification [53] or human pose estimation [46].

3.3 Annotation Step

The oracle’s task is to annotate videos chosen by the selection function g. Specifically, the oracle is asked to provide temporal bounds of all instances of an intended action. Towards this goal, several researchers have proposed efficient strategies to collect such annotations [9, 57]. Most of them have focused their approaches to exploit crowd-sourcing throughput and have used Amazon Mechanical Turk to annotate their large-scale video datasets. In this work, we experiment with two type of oracles: (i) simulated ones, which we emulate by using the ground-truth from existing and completely annotated datasets, and (ii) real human annotators, who are Amazon Mechanical Turk workers. We observe that the proposed framework is indiscriminately good in both cases.

4 Diagnostic Experiments

To evaluate our framework, we analyze its performance, including all its variants of selection functions, when oracles are simulated, i.e. we emulate an oracle’s outcome by using the ground-truth from existing datasets that have already been completely annotated.

4.1 Experimental Settings

Dataset. We choose ActivityNet [8], the largest available dataset for temporal action localization, to conduct the diagnostic experiments in this section. Specifically, we use the training and validation sets of ActivityNet 1.3, which include 14950 videos from 200 activity classes.

Metrics. We use the mean Average Precision (mAP) metric to assess the performance of an action localization model. Following the standard evaluation of ActivityNet, we report mAP averaged in a range of tIoU thresholds, i.e. from 0.5 to 0.95 with an increment of 0.05. To quantify the merits of a sampling function, we are particularly interested in observing the rate of increase of mAP with increasing training set size (i.e. increasing percentage of the dataset used to train the localization model).

Setup. LAL and MCLE approaches (introduced in Sect. 3.2) leverage knowledge extracted from previous datasets to bootstrap the selection function learning process. To exploit each of these methods to their full extent, we extract two category-disjoint subsets from ActivityNet. The first subset, dubbed Knowledge-Source, contains 2790 videos from 50 action categories. This subset is used to bootstrap the LAL and MCLE sampling functions. The second subset, dubbed ActivityNet-Selection, consists of 11160 videos with 150 action categories, which do not overlap with the ones in Knowledge-Source. We mainly conduct the active learning experiments on ActivityNet-Selection. Additionally, to measure the performance of the localization model, we define a Testing Set, which contains 3724 unseen videos from the same 150 categories as ActivityNet-Selection. The Testing Set videos do not overlap with ActivityNet-Selection nor Knowledge-Source videos.

We use the following protocol in our diagnostic experiments. We bootstrap LAL and MCLE using the labeled data in Knowledge-Source by following the method described in Sect. 3.2. Note that US does not need previous knowledge to operate. Once the selection function is available, we randomly select \(10\%\) from ActivityNet-Selection as a training set to build an initial action localization model (refer to Sect. 3.1). Then, we evaluate the model’s mAP performance on the Testing Set, and we apply our active learner onto the remaining videos of ActivityNet-Selection to select one or more of them, which will be annotated in the next step. Subsequently, we probe the oracle, which is simulated in this case by using the ground-truth directly provided by ActivityNet-Selection, to obtain temporal annotations for the selected videos. Finally, we augment the training set with the newly annotated samples, which in turn are used to re-train the localization model. This sequential process repeats until we have used \(100\%\) of the videos in ActivityNet-Selection for training.

4.2 Selection Function Ablation Study

Comparison under Controlled Settings. Figure 3 (Left) compares mAP performance between the three selection functions introduced in Sect. 3.2 on the Testing Set. We also report the performance of a Random Sampling baseline for reference. We report how the mAP of the localization model increases with the increase in training data, which is iteratively sampled according to the three active learning methods. These results help us investigate the effectiveness of each method in terms of how much improvement is obtained by adding a certain amount of training data. It is clear that LAL and MCLE significantly outperform US and the random sampling baseline. For example, to achieve \(80\%\) of the final mAP (i.e. when all of ActivityNet-Selection is used for training), LAL and MCLE require only \(35\%\) and \(38\%\) of the training data to be labelled respectively, while Uncertainty and Random Selection need \(42\%\) and \(65\%\) respectively to achieve the same performance. We attribute the superiority of LAL and MCLE to the fact that both approaches reuse information from labeled classes in the Knowledge-Source Set. Additionally, LAL directly exploits the current state of the localization model to make its selection at every training step. As such, it has inherently broader knowledge about the dataset it is annotating as compared to the simple heuristics used by Uncertainty Selection.

Effect of Sampling Batch Size. Re-training a model whenever a single new sample is made available is prohibitively expensive. To alleviate this problem, researchers often consider active learning in batch-mode [3]. In batch-mode, our active learner selects groups of samples instead of just one. For LAL, we simply rank all the unlabelled samples and pick the top scoring ones based on LAL’s predictions (i.e. the performance gain they are expected to cause when they are individually added to the training). For MCLE and Uncertainty Sampling, we select one unlabeled instance at a time until we completely fill the batch that will be annotated by the oracle. Figure 3 (Center) shows the Area Under the Learning Curve for different sampling batch sizes. The AULC value summarizes the performance of an active learner by computing the area under the “percentage of full mAP vs ratio of labeled videos” curve. For reference, we include the performance when using a single selection (i.e. batch size of 1). Uncertainty Sampling performance is poor after increasing the sampling batch size to 32. Interestingly, MCLE performance is strongly degraded at larger sampling batch sizes. The AULC score jumps from \(0.75\%\) down to \(0.65\%\) when the batch size is set to 64. On the other hand, we observe that LAL is relatively robust to larger sampling batch sizes. For instance, for a batch of size 64, the AULC drops only 0.05. We attribute the robustness of LAL to the fact that it estimates the selection score of each sample independently. Motivated by a trade-off between computational footprint and performance, we fix the selection batch size to 64 for the remaining experiments.

Fig. 3.
figure 3

Selection Function Ablation Study. Left. We show the % of full mAP (full training) achieved at different ratios of labeled videos. We report the Area Under the Learning Curve (AULC) for each sampling function. LAL and MCLE present steeper increases on mAP. Center. We report the AULC at different batch sizes. LAL is robust to large batch sizes. Right. We compute AULC against different level of noise from oracle annotations. All methods tolerate small levels of noise.

Effect of Noisy Annotations. Here, we analyze the performance of the selection functions when exposed to noisy oracles. To evaluate robustness against noisy annotations, we measure the performance of our active learner when different levels of noise are injected into the oracle responses. We quantify the noise in terms of how much an oracle response differs, in tIoU, from the original ground-truth. For example, at \(5\%\) noise level, the oracle returns temporal instances sampled from a Gaussian distribution with mean equal to \(95\%\) tIoU.

Similar to previous analysis, Fig. 3 (Right) reports the AULC at different noise levels. We observe that all sampling functions tolerate high levels of noise and in some cases (LAL) their performance can even improve when small (\(5\%\)) noise levels are added. We conjecture that this improvement is due to the fact that such small levels of noise can be seen as adversarial examples, which previous works have demonstrated to be beneficial for training [30].

5 Online Experiments: Collecting Kinetics-Localization

In this section, we perform live experiments, where we employ our active learner to build a new dataset. Instead of collecting the dataset from scratch, we exploit Kinetics [39] videos (and its video-level labels) and enrich them with temporally localized annotations for actions. We call our novel dataset Kinetics-Localization. First, we run our active learner to collect temporal annotations from Kinetics videos. Then, we present statistics of the collected data. Finally, we evaluate the performance of models trained with the collected data.

5.1 Active Annotation Pipeline

The Kinetics dataset [39] is one of the largest available datasets for action recognition. To construct the dataset, the authors used Amazon Mechanical Turk (AMT) to decide whether a 10 seconds clip contains a target action. To gather the pool of clips to be annotated, first a large set of videos are obtained by matching YouTube titles with action names. Then, a classifier, which is trained with images returned by Google Image Search, decides where the 10 seconds clip to be annotated is extracted from. As a result, Kinetics provides more than 300K videos among 400 different action labels. There is only one annotated action clip in each video. The scale of the dataset has enabled the development of novel neural network architectures for video [12]. Unfortunately, despite the tremendous effort needed to build Kinetics, the dataset is not designed for the task of temporal action localization. Thus, we commit our active learner to collect temporal annotations for a portion of Kinetics.

We employ our active learner to gather temporal annotations for Kinetics videos from 75 action classes. It needs to select samples that will be annotated online by real human oracles. Following standard procedure for temporal video annotation, we design a user interface that allows people to determine the temporal bounds of actions in videos [9, 57, 65]. We rely on Amazon Mechanical Turk workers (turkers) to annotate the videos. Snapshots of the user interface and details about the annotation process are available in the supplementary material.

5.2 Kinetics-Localization at a Glance

As a result of our annotation campaign, we effectively compile a temporal action localization dataset comprising 15000 videos from 75 different action categories, resulting in more than 30000 temporal annotations. Figure 4 summarizes Kinetics-Localization properties. Figure 4 (Top) shows the number of videos and instances per class in the current version of the dataset. The distribution of number of videos/instances is close to uniform. Also notice that the ratio of instances per video is 2.2.

Fig. 4.
figure 4

Kinetics-Localization at a Glance. We introduce Kinetics-Localization, a novel dataset for temporal action localization. Top: Distribution of number of videos and instances per class. Middle. Kinetics-Localization attributes. We show the distribution of ground-truth instances for different attributes including Coverage, Length, and Number of Instances per video. Bottom. We analyze the distribution of ground-truth instances for pairwise interactions of attributes. Each bin reports the percentage of ground-truth that belongs to such bin.

Figure 4 (Middle) shows the ground-truth distribution for three different inherent attributes of the dataset. (i) Coverage, which we measure as the fraction between an instance’s length and the duration of the video it belongs to. We group instance coverage into five groups: Extra Small (XS: (0, 0.2]); Small (S: (0.2, 0.4]); Medium (M: (0.4, 0.6]); Large (L: (0.6, 0.8]); Extra Large (XL: (0.8, 1.0]). (ii) Length, measured as the duration, in seconds, of an instance. We define five bins to plot the distribution of this attribute: Extra Small (XS: (0, 30]), Small (S: (30, 60]), Medium (M: (60, 120]), Large (L: (120, 180]), and Extra Large (XL: \({>}180\)). (iii) Number of instances in a video (# instances), which we cluster into five bins as well: Extra Small (XS: [0, 1]); Small (S: (1, 4]); Medium (M: (4, 8]); Large (L: (8, 16]); Extra Large (XL: \({>}16\)). In terms of coverage, extra small and extra large instances have a large portion of ground-truth instances assigned. Also note that more than half of the instances have at most small coverage (\({<}0.4\)). The dataset comprises \(55.1\%\) of instances that are relatively small. We hypothesize that such small instances will enable new challenges, as is the case in other fields such as face detection [33].

We also study the distribution between pairs of instance attributes (see Fig. 4 (Bottom)). We observe three major trends from the ground-truth distribution: (i) as expected, instances with high coverage tend to have no neighbours (single instance per video); (ii) \(34.9\%\) of instances have extra small coverage and extra small length, which we argue may be the hardest type of sample for current detectors; (iii) In summary, we find that the dataset exhibits challenging types of ground-truth instances, which may span ranges of difficulty.

5.3 Kinetics-Localization Benchmark

We evaluate two different temporal action localization models: (i) our temporal localization model (Stage-Wise), which we introduced in Sect. 3.1; (ii) the Structured Segmented Network (SSN) introduced by Zhao et al. [77] (we refer to this approach as End-to-End). Although we could have employed other action detectors such as [5, 71], we choose SSN because it registers state-of-the-art performance. We train each of the models either using Kinetics-Localization or the original Kinetics dataset. Table 1 summarizes the results. We use the provided 10 second clips to train the action localization models, and assume that all remaining content in the video is background information. Even though background might also contain some valid action instances, we argue there is no systematic way to add those for training without fully annotating them.

To properly quantify performance, we fully annotate a portion of the Kinetics validation subset with temporal annotations, which we refer from now on as Kinetics-Localization Validation Set. Table 1 shows the temporal localization performance of both approaches at different tIoU thresholds on the Kinetics-Localization Validation Set. We observe that the performance at lower tIoU thresholds (e.g. 0.1) for both models is close to the achieved performance of previous work on the trimmed classification task [12]. However, when the tIoU threshold is increased to 0.2, the performance drastically drops. For example, the mAP of the End-to-End SSN model (trained on the original Kinetics) decreases from \(59.4\%\) to \(40.1\%\). Also, once typical tIoU thresholds for localization are used (0.5 to 0.9), both approaches perform poorly. We attribute this behavior to the fact that Kinetics does not include accurate temporal action bounds, thus, preventing the localization models to reason about temporal configurations of actions. When comparing the performance of the Stage-Wise approach to that of the same model trained with the newly collected Kinetics-Localization data, an improvement of \(13.1\%\) mAP is unlocked on the validation set. This validates the need for accurate temporal annotations to train localization models as well as the need for cost effective frameworks to collect these annotations (like the active learner method we propose in this paper).

Table 1. Kinetics-Localization benchmark. We report the mAP at different tIoU thresholds of the Stage-Wise and End-to-End models. We averaged mAP in a range of tIoU thresholds, i.e. from 0.5 to 0.95 with an increment of 0.05 (Avg. mAP). Notably, training with Kinetics-Localization dataset offers significant gains in performance as compared to using the original Kinetics dataset.

6 Conclusion

We introduced a novel active learning framework for temporal action localization. Towards this goal, we explored several state-of-the-art active selection functions and systematically analyzed their performance. We showed that our framework outperforms baseline approaches when the evaluation is conducted with simulated oracles. We also observed interesting properties of our framework when equipped with its LAL variant; (1) it exhibited good performance in batch-mode, and (2) is robust to noisy oracles. After validating the contributions of our active learner, we employed it to gather a novel dataset for temporal localization, which we called Kinetics-Localization. We presented statistics of the datasets as well as a novel established benchmark for temporal action localization. We hope that the collected Kinetics-Localization dataset helps to encourage the design of novel methods for action localization.