Pointly-Supervised Action Localization

This paper strives for spatio-temporal localization of human actions in videos. In the literature, the consensus is to achieve localization by training on bounding box annotations provided for each frame of each training video. As annotating boxes in video is expensive, cumbersome and error-prone, we propose to bypass box-supervision. Instead, we introduce action localization based on point-supervision. We start from unsupervised spatio-temporal proposals, which provide a set of candidate regions in videos. While normally used exclusively for inference, we show spatio-temporal proposals can also be leveraged during training when guided by a sparse set of point annotations. We introduce an overlap measure between points and spatio-temporal proposals and incorporate them all into a new objective of a multiple instance learning optimization. During inference, we introduce pseudo-points, visual cues from videos, that automatically guide the selection of spatio-temporal proposals. We outline five spatial and one temporal pseudo-point, as well as a measure to best leverage pseudo-points at test time. Experimental evaluation on three action localization datasets shows our pointly-supervised approach (1) is as effective as traditional box-supervision at a fraction of the annotation cost, (2) is robust to sparse and noisy point annotations, (3) benefits from pseudo-points during inference, and (4) outperforms recent weakly-supervised alternatives. This leads us to conclude that points provide a viable alternative to boxes for action localization.


Introduction
This paper aims to recognize and localize actions such as skiing, running, and getting out of a vehicle in videos. Action recognition has been a vibrant topic in vision for several decades, resulting in approaches based on local spatio-temporal features (Dollár et al, 2005;Laptev, 2005;Wang et al, 2009), dense trajectories  two-stream neural networks (Simonyan and Zisserman, 2014;Feichtenhofer et al, 2016), 3D convolutions (Ji et al, 2013;Tran et al, 2015), and recurrent networks (Donahue et al, 2015;Li et al, 2018;Srivastava et al, 2015). We aim to not only recognize which actions occur in videos, but also discover when and where the actions are present.
Action localization in videos corresponds to finding tubes of consecutive bounding boxes in frames for each action. Initial work aimed at localizing actions by finding local discriminative parts and generating tubes through linking or sliding windows (Lan et al, 2011;Tian et al, 2013a;Wang et al, 2014). State-of-the-art localizers classify boxes per frame (or few frames) before linking them into tubes (Gkioxari and Malik, 2015;Weinzaepfel et al, 2015;Hou et al, 2017;Kalogeiton et al, 2017a). Regardless the approach, a requirement for all these works is the need for box-supervision per frame of each training video. As annotating boxes in videos is an expensive, cumbersome and error-prone endeavor, we prefer to perform action localization without the need for box supervision.
The first contribution of this paper is to localize actions in videos with the aid of point-supervision. For pointly-supervised action localization, we start from (unsupervised) spatio-temporal proposals. Spatio-temporal proposals reduce the search space of actions in videos to a few hundred to thousand tubes, where at least one  Fig. 1: Pointly-supervised action localization using spatio-temporal proposals and pseudo-points. During training, we start from point-supervision for each video. Our overlap measure computes the match between each proposal and the point annotations. We iteratively refine the proposal selection by extending the max-margin Multiple Instance Learning formulation. During inference, we compute pseudo-points for all video frames and use them in conjunction with the learned action model to determine the top proposals per action over all test videos.
tube matches well with the ground truth action location (Jain et al, 2014;Jain et al, 2017;Oneata et al, 2014). This is typically achieved by clustering local representations such as supervoxels or dense trajectories. In the literature, the use of spatio-temporal proposals is restricted to the inference stage; training of the action localizer that select the best proposal still depends on box-supervision. While the spatio-temporal proposals may be unsupervised, they do not relax the need for box-supervision during the training stage of action localizers. We propose to bypass bounding box annotations by training action localizers on spatio-temporal proposals from training videos. We show that training on spatio-temporal proposals guided by point annotations, yields similar action localization performance to their box-supervised alternative at a fraction of the annotation time.
As our second contribution, we propose an overlap measure that matches the centers of spatio-temporal proposals with point annotations. To identify the best proposal to train on, we adopt a Multiple Instance Learning perspective (Andrews et al, 2002), with the spatiotemporal proposals defining the instances and videos the bags. We employ the max-margin Multiple Instance Learning formulation and extend it to incorporate information from the proposed overlap measure. This results in action localization using video labels and point annotations as the sole action supervision. Our first two contributions were previously presented in the conference version of this paper .
For our third contribution we are inspired by , who propose to train action localizers with spatio-temporal proposals selected by automatic visual cues. Rather than employing the cues at training time, we prefer to exploit the cues during inference and call them pseudo-points. The pseudo-points are used as an unsupervised generalization of point-supervision during the testing stage. The pseudo-points cover cues from training point statistics, person detection (Yu and Yuan, 2015), independent motion (Jain et al, 2014), spatiotemporal proposals , center bias (Tseng et al, 2009), and temporal information. To link the point-supervision in training videos to pseudopoints in test videos, we propose a function that both weights and selects pseudo-points based on how well they match with points annotated during training. We use the weighting function to determine which pseudopoints are most effective and how much they should contribute to the selection of spatio-temporal proposals in test videos. A complete overview of our proposed approach is shown in Figure 1.
The rest of the paper is organized as follows. In Section 2, we describe related work. Section 3 details our algorithm for point-supervision during training. Section 4 presents pseudo-points and explains how to leverage them during inference. We detail our experimental setup on UCF Sports (Rodriguez et al, 2008), UCF-101 (Soomro et al, 2012) and Hollywood2Tubes  in Section 5. Ablation studies, error diagnosis and comparisons are discussed in Section 6. We conclude our work in Section 7.

Action localization with box-supervision
The problem of action localization is commonly tackled by supervision with video-level action class labels and frame-level box annotations during training. Initial approaches do so through figure-centric structures (Lan et al, 2011) andpart-based models (Tian et al, 2013a;Wang et al, 2014). Inspired by the success of object proposals in images , several works have investigated spatio-temporal proposals for action localization in videos. Such spatio-temporal proposals are typically generated by grouping supervoxels Oneata et al, 2014;Soomro et al, 2015) or dense trajectories Marian Puscas et al, 2015). Spatio-temporal proposals reduce the search space to a few hundred or thousand locations per video. In the literature, the use of spatio-temporal proposals is limited to the testing stage. Training is still performed on features derived from bounding box annotations per frame. In this paper, we extend the use of action proposals to the training stage. We show that proposals provide high quality training examples when leveraging our Multiple Instance Learning variant, guided by point annotations, completely alleviating the need for box annotations.
Recently, a number of works have achieved success in action localization by separating spatial detection from temporal linking (Gkioxari and Malik, 2015;Weinzaepfel et al, 2015). Such approaches have been further improved with better representations (Peng and Schmid, 2016;Saha et al, 2016;Yang et al, 2017), joint linking algorithms , and by classifying a few consecutive frames before linking (Hou et al, 2017;Kalogeiton et al, 2017a;Saha et al, 2017). While effective, these approaches have an inherent requirement for box annotations to detect and regress the boxes in video frames. We focus on the use of unsupervised spatiotemporal proposals (Jain et al, 2014;Jain et al, 2017;Oneata et al, 2014), and we show how to utilize them during training to bypass the need for box-supervision.

Action localization without box-supervision
Given the annotation burden for box-supervision in action localization, several works have investigated action localization from weaker supervision signals. Most works focus on localization from video labels only. Siva and Xiang (2011) employ spatio-temporal proposals and optimize for an action localization model through Multiple Instance Learning (Andrews et al, 2002), where the videos are the bags and the proposals are the instances. We show that Multiple Instance Learning yields suboptimal results for action localization; extending Multiple Instance Learning with point-supervision alleviates this problem. Chen and Corso (2015) also employ spatio-temporal proposals and video labels, but skip the Multiple Instance Learning step. Instead, they train on the most dominant proposal per training video, without knowing whether the proposal fits the action location well. Recent work by Li et al (2018) achieves action localization from video labels through attention. The action location is determined by a box around the center of attention in each frame, followed by a linking procedure. These approaches provide action localization without box annotations. However, using only the video label restricts the localization performance. We show that point annotations have a direct impact on the performance at the expense of a small additional annotation cost, outperforming approaches using video labels only.
Several recent works have investigated action localization in a zero-shot setting, where no video training examples are provided for test actions. This is typically achieved through semantic word embeddings (Mikolov et al, 2013) between actions and objects as found in text corpora. Initial work by Jain et al (2015) employed spatio-temporal proposals and assigned object classifier scores to each proposal. The object scores are combined with the word embedding scores given an action and the highest scoring proposal is selected for each test video.  perform zero-shot action localization by linking boxes that are scored based on a spatial-aware embedding between actors and objects. Kalogeiton et al (2017b) perform zero-shot localization through joint localization of actions and objects. Soomro and Shah (2017) aim for unsupervised action localization through discriminative clustering on videos and spatio-temporal action proposal generation with 0-1 Knapsack. Such works are promising but do not perform on the level of (weakly) supervised alternatives, as detailed in our final experiment.

Speeding-up box annotations
Easing the annotation burden of bounding box annotations in videos has been investigated by Vondrick et al (2013). They investigate different strategies to annotate boxes in videos, e.g., with expert annotators and tracking. Furthermore, several works have attempted faster ways to annotate boxes, e.g., through human verification (Russakovsky et al, 2015;Papadopoulos et al, 2016) or by clicking the extremes of objects (Papadopoulos et al, 2017). While such investigations and approaches provide faster alternatives to the costly ImageNet standard for box annotation (Su et al, 2012), annotating boxes remains a slow and manually expensive endeavor In this work, we avoid box annotations and show that action localization can be done efficiently through simple point annotations.
Several recent works have investigated the merit of point annotations in other visual recognition challenges. Bearman et al (2016) investigate point-supervision for semantic segmentation in images, which constitutes a fraction of the annotation cost compared to pixel-wise segmentation. In the video domain, Jain and Grauman (2016) investigate object segmentation based on point clicks. Similar in spirit to our work, Manen et al (2017) show the spatio-temporal tracks from consecutive point annotations provide a rich supervision for multiple object tracking in videos. In this work, we investigate the potential of point-supervision for action localization in videos, showing we can reach comparable performance to full box-supervision approaches based on action proposals.

Point-supervision for training
For pointly-supervised action localization with spatiotemporal proposals, we start from the hypothesis that the proposals themselves, normally used for testing, can substitute the ground truth box annotations during training without a significant loss in performance. Our main goal is to mine out of a set of action proposals the best one during training while minimizing the annotation effort. The first level of supervision constitutes the action class label for the whole video. Given such global labels, we follow the traditional approach of mining the best proposals through Multiple Instance Learning, as introduced for object detection by Cinbis et al (2017). In the context of action localization, each video is interpreted as a bag and the proposals in each video are interpreted as its instances. The goal of Multiple Instance Learning is to train a classifier that selects the top proposals and separates proposals from different actions.
Next to the global action class label we leverage easy to obtain annotations within each video: we simply point at the action. Point-supervision allows us to easily exclude those proposals that have no overlap with any annotated point. Nevertheless, there are still many proposals that intersect with at least one point, as points do not uniquely identify each proposal. Therefore, we introduce an overlap measure to associate proposals with points. We also extend the objective of Multiple Instance Learning to include the proposed overlap measure for proposal mining.

Overlap between proposals and points
Let us first introduce the following notation. For a video V of F V frames, an action tube proposal A = {BB i } m i=f consists of connected bounding boxes through video frames (f, ..., m) where 1 ≤ f < m ≤ F V . Let BB i denote the center of a bounding box i. The point su- We propose an overlap measure that provides a continuous bounded score based on the match between a proposal and the point annotations.
Our overlap measure, inspired by a mild centerbias in annotators (Tseng et al, 2009), consists of two terms. The first term M (·) states how close the center of a bounding box from a proposal is to an annotated point, relative to the bounding box size, within the same frame. This center-bias term normalizes the distance of a point annotation to the center of a bounding box by the distance between the center and closest edge of the bounding box. For point annotation (c and for bounding box BB Ki in the same frame, the score is 1 if the box center BB Ki is the same as the point annotation. The score decreases linearly in value as the distance between the point annotation and the box center grows and the score becomes 0 if the point annotation is not contained in BB Ki : (1) In Equation 1, (u, v) denotes the center point of each of the four edges of box BB Ki , given by the function e(BB Ki ). The second term S(·) serves as a form of regularization on the overall size of a proposal. The regularization aims to alleviate the bias of the first term towards large proposals, since large proposals are more likely to contain points and the box centers of large proposals are by default closer to the center of the video frames. Since actions are more likely to be in the center of videos (Tseng et al, 2009), the first term M (·) tends to be biased to large proposals. The size regularization term S(·) addresses this bias by penalizing proposals with large bounding boxes |BB i | ∈ A, compared to the size of a video frame |F i | ∈ V :  M (·) regularized by S(·), our overlap measure O(·) is defined as Recall that A are the proposals, C captures the pointsupervision and V the video. Overlap measure O(·) provides an estimation of the quality of the proposals during training and we use the measure in an iterative proposal mining algorithm over all training videos in search for the best proposals. In Figure 2, we provide three visual examples of spatio-temporal proposals ranked based on our overlap measure.

Mining for proposals with points
To mine spatio-temporal proposals, we are given a train- is the D dimensional feature representation of each proposal in video i. Annotations consist of the action class label y i and the points C i .
The proposal mining combines the use of the overlap measure O(·) of Equation 3 with a Multiple Instance Learning optimization. The optimization aims to train a classification model that can separate good and bad proposals for a given action. We start from a standard MIL-SVM (Andrews et al, 2002;Cinbis et al, 2017) and adapt it's objective to include a mining score P (·) of each proposal, which incorporates our function O(·) as: where (w, b) denote the classifier parameters, ξ i denotes the slack variable and λ denotes the regularization parameter. Variable z ∈ x i denotes the representation of a single proposal in the set of all proposals x i for training video i. Variable A (z) i denotes the tube corresponding to proposal representation z. The proposal with the highest mining score per video is used to train the classifier.
Different from standard MIL-SVM, the proposals are not only conditioned on the classifier parameters, but also on the overlap scores from the point annotations. In other words, the standard maximum likelihood optimization of MIL is adapted to include point overlap scores that serve as a prior on the individual proposals. The objective of Equation 4 is non-convex due to the joint minimization over the classifier parameters (w, b) and the maximization over the mined proposals P (·). Therefore, we perform iterative block coordinate descent by alternating between clamping one and optimizing the other. Given a fixed selection of proposals, the optimization becomes a standard SVM optimization over the features of selected proposals (Cortes and Vapnik, 1995). For fixed model parameters, the maximization over the proposals is determined by scoring proposals as: In Equation 5, the score of a proposal is the sum of two components, namely the score of the current model and the overlap with the point annotations in the corresponding training video. The mining and classifier optimizations are alternated for a fixed amount of iterations. After the iterative optimization, a final SVM is trained on the best mined proposals. Identical to approaches using box-supervision, our model selects the best proposals from test videos, without requiring any box annotations during training.

Pseudo-pointing for inference
Inference is typically achieved through a maximum likelihood over all proposals in a test video. However, relying on a maximum likelihood estimate of the model is rather limited, as it only relies on the features of the proposals. We show that visual cues within the test videos help to guide the selection proposals during inference, similar to how point annotations provide guidance during training. We dub these automatic cues pseudopoints and investigate five of them. The pseudo-points rely on training point annotations, self-supervision, person detection, independent motion, and center bias. We show how to exploit and combine these pseudo-points to improve the action localization during inference. Lastly, we also provide two forms of regularization to further boost the localization results. (c) Independent motion.
(e) Training point statistics.
(f) Center bias. The pseudo-points derived from (b) person detection, (c) independent motion, and (d) self-supervision focus on the primary action in the video. The pseudo-points derived from (d) training points and (e) center bias provide data-independent prior statistics to steer better proposal selection during inference.

The pseudo-points
In Figure 3, we provide a visual overview of the visual cues for multiple video frames. Next, we outline each pseudo-point individually.

Training point statistics
The first pseudo-point focuses on the point annotations provided during training. Intuitively, actions do not occur at random locations in video frames. Recall that we are given N training videos, where y i , C i denote respectively the video label and point annotation of training video i. We exploit this observation by making a pseudo-point for an action class Y as follows: The above Equation states that for an action Y, the pseudo-point in a test video is determined as the average point annotation location given the training videos of the same action. The reasoning behind this pseudopoint is that specific actions tend to re-occur in similar locations across videos. Note that the pseudo-point is independent of the frame F itself and only depends on the training point statistics.

Self-supervision from proposals
The second pseudo-point we investigate does not require external information; the pseudo-point relies on the spatio-temporal proposals themselves. The main idea behind this pseudo-point is that the distribution over all proposals in a single frame provides information about its action uncertainty. It relies on the following assumption: the more the proposals are on the same spatial location, the higher the likelihood that the action occurs in that location. The pseudo-point can be seen as a form of self-supervision (Doersch et al, 2015;Fernando et al, 2017), since it provides an automatic annotation from proposals to guide the selection of the very same proposals. More formally, for test video t, let A t denote the spatio-temporal proposals. Furthermore, let F denote a video frame and let C At (u, v, F ) denote the number of proposals that contain pixel (u, v) in F . We place an automatic pseudo-point at the center of mass over these pixel counts: The function of Equation 7 outputs a 2D coordinate in frame F , representing the center of mass over all pixels in F , with the mass of each pixel (u, v) given by C At (u, v, F ). The 2D output coordinate will serve as the pseudo-point in frame F .

Person detection
The third pseudo-point follows earlier work on action localization by incorporating knowledge about person detections (Siva and Xiang, 2011;Yu and Yuan, 2015). Actions are typically person-oriented, so the presence or absence of a person in a proposal provides valuable information. Here, we employ a Faster R-CNN network (Ren et al, 2015), pre-trained on MS-COCO (Lin et al, 2014), and use the person class for the detections. This results in roughly 50 box detections per frame after non-maximum suppression. We select the box in each frame with the maximum confidence score as the automatic pseudo-point.

Independent motion
The independent motion of a pixel (u, v) in frame F provides information as to where foreground actions are occurring. More precisely, independent motion states the deviation from the global motion in a frame . Let C I (u, v, F ) ∈ [0, 1] denote the inverse of the residual in the global motion estimation at pixel (u, v) in frame F . The higher C I (u, v, F ), the less likely it is the pixel contributes to the global motion. Akin to the second pseudo-point, we place an automatic pseudopoint at the center of mass, now over the independent motion estimates: Equation 8 outputs a 2D coordinate, but now using the independent motion as mass for each pixel in F .

Direct center bias
Lastly, we again focus on an observation made during training; actions and annotators have a bias towards the center of the video (Tseng et al, 2009). We exploit this bias directly in our fifth pseudo-point by simply placing a point on the center of each frame: where F W and F H denote the width and height of frame F respectively. Figure 4 provides the spatio-temporal evolution and focus area of the pseudo-points for four example videos. Fig. 4: Pseudo-point visualization on four example videos for training points, center bias, self-supervision, independent motion, and person detection (depicted as points for visualization). In general, the pseudo-points are present around the action or even follow the action. When actions are not in the frame however, as shown in the right example, pseudo-points may place automatic annotations in phantom positions.

Rescoring test proposals
Given a test video t, the standard approach in action localization with spatio-temporal proposal for finding the best proposal is done from a set of proposals {A t i} |At| i=1 , given a model (w, b) is given as: We exploit the pseudo-points to adjust this likelihood estimate: where P denotes the pseudo-point of interest and t V denotes the test video itself. The above equation is similar in spirit to Equation 5, but now automatic cues are employed, rather than manual point annotations. Adjusting the proposal selection using pseudo-points during testing can be seen as a form of regularization. Rather than a single-point maximum likelihood given a trained model, we add continuous restrictions on the proposals based on their match with automatic pseudo-points, which aid the selection towards proposals with a high overlap to the ground truth action location.

Weighting and selecting pseudo-points
Intuitively, not all pseudo-points are equally effective, stating the need for the weights in Equation 11. However, setting proper values for λ P can not be done directly through standard (cross-)validation, as this requires box-supervision. To overcome both problems, we provide a score function to estimate the quality of each pseudo-point. This score will be used to both determine which pseudo-point is most favourable to select and directly serve as weighting value in Equation 11. The score function for the person detection pseudopoint (the only pseudo-point that outputs boxes), is identical to the overlap function in Equation 1. This entails that if the center of the top person detection in each frame of a training video matches with the point annotations of the same video, a high score is achieved. We compute the average match over all training videos as the weight (λ P ) for person detection. For the other pseudo-points, we are only given points. In these cases, we use the distance to the nearest image border to normalize the distance between the manual point annotation and the automatic pseudo-point annotation. The overall score function is computed in identical fashion as for the person detection.
By matching automatic pseudo-points in training videos with the manual point annotations, we arrive at an automatic quality measure for pseudo-points, which can be used to weight and select pseudo-points.

Temporal pseudo-points
Besides knowing where specific actions occur spatially over a complete dataset, the temporal extent of actions is also helpful for proposal selection. Here, we provide a temporal pseudo-point, again relying on training point statistics. For action Y, we retrieve its temporal extent by comparing the temporal span of point annotations to the temporal extent of the videos in which the actions occur. The fraction of the annotation span relative to the video length is computed for each action instance and averaged over a complete dataset. Let F Y denote the average temporal length of action Y and let F t jk denote the temporal length of proposal k in test video j. Then we compute this temporal pseudo-point as:  In Equation 12, the match between the temporal pseudopoint of an action and the temporal extent of a proposal also acts as a regularization. The better the match, the lower the penalty in the likelihood, resulting in a better selection of proposals.
With pseudo-points, we are able to guide the selection of the top proposal per action per video for action localization, akin to how point-supervision is used during training. Having defined the complete pointlysupervised regime for training and inference we are now ready for the experiments.  2008). We employ the train/test split suggested in (Lan et al, 2011). Example frames from the dataset are shown in Figure 5a.
UCF-101. The UCF-101 dataset consists of 13,320 videos from 101 action categories, such as Skiing, Basketball dunk, and Surfing (Soomro et al, 2012). For a subset of 3,204 videos and 24 categories, spatio-temporal annotations are provided (Soomro et al, 2012). We will use this subset in the experiments and use the first train/test split suggested in (Soomro et al, 2012). In Figure 5b, we show dataset example frames.
Hollywood2Tubes. The Hollywood2Tubes dataset consists of 1,707 videos from 12 action categories , such as getting out of a car, sitting down, and eating. The dataset is derived from the Hollywood2 dataset (Marsza lek et al, 2009), with point annotations for training and box annotations for evaluation. Different from current action localization datasets, Holly-wood2Tubes is multi-label and actions can be multishot, i.e., can span over multiple non-continuous shot, adding new challenges for action localization. We show an example frame with box annotations for each of the 12 actions in the dataset in Figure 5c. Annotations are available at http://tinyurl.com/hollywood2tubes.  Across both datasets and all overlap thresholds, point-supervision is as effective as box-supervision, while they both outperform video-label supervision. We conclude that spatial annotations are vital and that points provide sufficient support for effective localization.

Implementation details
Proposals. Our proposal mining algorithm is agnostic to the underlying spatio-temporal proposal algorithm. Through this work, we employ the unsupervised APT proposals , since the algorithm provides high action recall, is fast to execute, and the code is publicly available. For each proposal, we extract Improved Dense Trajectories and compute HOG, HOF, Traj, and MBH features . The combined features are concatenated and aggregated into a fixed-size representation using Fisher Vectors (Sánchez et al, 2013). We construct a codebook of 128 clusters, resulting in a 54,656-dimensional representation per proposal. The same proposals and representations are also used in  allowing for a fair comparison.
Training. The proposal mining is performed for 5 iterations; more iterations have little effect on performance. Following the suggestions of Cinbis et al (2017), the training videos are randomly split into 3 splits to train and select the proposals. For training a classification for one action, 100 proposals of each video are randomly sampled from the other actions as negatives. The regularization parameter λ in the max-margin optimization is fixed to 10 throughout the experiments.
Evaluation. For an action, we select the top scoring proposal for each test video given the trained model. To evaluate the action localization performance, we compute the Intersection-over-Union (IoU) between proposal p 1 and the box annotations of the corresponding test example p 2 as: iou(p 1 , p 2 ) = 1 |Γ | f ∈Γ IoU p1,p2 (f ), where Γ is the set of frames where at least one of (p 1 , p 2 ) is present. The function IoU states the box overlap within a specified frame. For IoU threshold τ , a top selected proposal is deemed a positive detection if iou(p 1 , p 2 ) ≥ τ . After aggregating the top tubes from all videos, we compute either the Average Precision score or AUC using the proposal scores and positive/negative detection labels.

Action localization with point-supervision
Setup. In the first experiment, we evaluate our main notion of localizing actions using point-supervision. We perform this evaluation on UCF Sports and UCF-101. We compare our approach to the following three baselines: box-supervision: This baseline follows the training protocol of van Gemert et al (2015), where for each action, a classifier is trained using the features from ground truth boxes. Additionally, spatiotemporal proposals with an overlap higher than 0.6 and lower than 0.1 are added as positives and negatives, respectively.   . We first observe that traditional boxsupervision yield identical results to using the best possible spatio-temporal proposal. This result validates our starting hypothesis that spatio-temporal proposals provide viable training examples. Second, we observe that across both datasets and all overlap thresholds, pointsupervision performs similar to both the box-supervision and best proposal approaches. This result highlights the effectiveness of point-supervision for action localization. With pointly-supervised action localization we no longer require expensive box annotations. As results using video-labels only are limited compared to points, we conclude that points provide vital information about the spatial location of actions.
Error analysis. To gain insight into why point supervision is effective for action localization, we perform an error diagnosis and corresponding qualitative analysis. We perform the diagnosis on the approaches using box-supervision and point-supervision. Akin to error diagnosis for object detection (Hoiem et al, 2012), we quantify the types of errors made by each localization approach. We take the top R detections for each action, where R is equal to the number of ground truth instances in the test set. We categorize each detection into five classes relevant for action localization: (1) correct detection, (2) localization error, (3) confusion with other action, (4) background from video containing the action, and (5) background from video not containing the action. The categorization definition is provided in Appendix A.
The error diagnosis, averaged over all actions, is shown in Figure 7 for UCF Sports and UCF-101. We observe that overall, the types of errors made by both approaches are similar. The predominant error type is localization error, which means that proposals from positive videos with a low match to the ground truth are the main errors. Proposals from background proposals of both positive and negative videos are hardly ranked high. Overall, using boxes and points result in similar errors, which matches with their similarity in localization performance. A common limitation is the quality of the spatio-temporal proposals themselves; only few proposals have a high overlap with the ground truth, making the localization a needle in the haystack problem regardless of the model. On UCF-101, a large part of the errors also comes from confusion with the background from other videos. This is because the UCF-101 dataset can have more than one instance of the same action in each video. If such additional instances are missed, non-distinct regions of negative videos are automatically ranked higher.
Qualitative analysis. In Figure 8, we provide qualitative results on UCF Sports and UCF-101. The results show where point-and box-supervision yield similar and dissimilar action tubes. For simple actions like walking and soccer juggling, both approaches yield (near-)identical results. For actions such as skateboarding and walking with a dog, we observe that point-supervision tends to focus on the invariant object (here: skateboard, horse, and dog), since these are distinctive elements for the action. This is because the spatial extent of actions is no longer known with points, which means that the extent is learned from examples, rather than from manual annotations. We also note that limitations in the model can result in different results, as shown in the leftmost example of Figure 8b.  agnosis, and qualitative analysis, we make the following conclusions: (i) point-supervision yield results comparable to full box-supervision for action localization, (ii) averaged over all actions, the approaches using box and point annotations have approximately similar error type distributions,and (iii) models learned with pointsupervision learn the spatial extent of actions discriminatively from examples.

Influence of spatio-temporal proposal quality
In the second experiment, we evaluate the influence of the spatio-temporal proposals upon which our approach is built. Spatio-temporal proposals optimize recall, i.e., for a video, at least one proposal should have a high overlap to the ground truth action localization. An inconvenient side-effect from this requirement is that each video outputs many proposals that have a low overlap, making the selection of the best proposal during testing a needle in the haystack problem. This problem was observed in the error diagnosis of the first experiment.
Here, we investigate the influence of the high ratio of proposals with a low overlap during training and testing. During both training and testing, we add the oracle ground truth tube to the proposals. We furthermore add a parameter , which controls the fraction of proposals with an overlap below 0.5. We train several models with varying values for . We evaluate this oracle experiment on UCF Sports.
Results. The localization performance for several values of is shown in Figure 9. The baseline is the result achieved in the first experiment. From the Figure, it is evident that removing low quality proposals positively affects the localization performance. However, a large portion of low quality proposals need to be removed to achieve better results. This is because of the large    Fig. 9: Influence of spatio-temporal proposal quality on UCF-Sports. The baseline corresponds to the result from the first experiment. For the others, the ground truth location is added as one of the proposals during testing. Where, states the fraction of low quality (overlap ≤ 0.5) proposals that are removed. Action localization performance increases when large amounts of low quality proposals are removed. We conclude that better quality action proposals will have a positive impact on pointly-supervised action localization.
amount of low quality proposals. On UCF Sports, only 7% of the proposals have an overlap of at least 0.5 (!). This means that when 50% of the low quality proposals are removed, the ratio of low to high quality proposals is still 6 to 1. When removing 50% of the low quality proposals, the result increases from 0.23 to 0.32. This further increases to 0.49 when removing 95% of the low quality proposals. When only using the ground truth tube and high overlapping proposals (i.e., = 1.0), we achieve a performance of 0.90, indicating the large gap between current performance and the upper bound given the current set of features. We conclude from this experiment that for pointly-supervised action localization with spatio-temporal proposals, a limiting factor is the quality of the proposals themselves. With bet-ter action proposals, point-supervision can achieve even better results.

Sparse point annotations
In the third experiment, we evaluate: (i) how much faster is point-supervision compared to box-supervision and (ii) how many point annotations are sufficient for effective localization. Intuitively, point-supervision is not required for every frame, since the amount of change between consecutive frames is small. We evaluate the influence of the annotation stride and we also estimate how much faster the annotation process becomes compared to dense box-supervision. We perform this experiment on UCF-101, since the videos in this dataset are the longest for action localization, allowing for a wide range of annotation strides.
Annotation times. To obtain an estimate of the annotation times for box-and point-supervision, we have re-annotated several hundreds of videos while keeping track of the annotation times. We found that annotating a video with an action label takes roughly 5 seconds. Furthermore, annotating a box in a frame takes roughly 15 seconds. This estimate is in between the estimate for image annotation of Su et al (2012) (roughly 30 seconds) and the estimate of Russakovsky et al (2015) (10 to 12 seconds). Annotating a point takes roughly 1.5 seconds, making points ten times faster than boxes to annotate. This estimate is in line with point annotations in images (0.9-2.4 seconds (Bearman et al, 2016)).
Results. In Table 1, we provide the localization performance for two overlap thresholds and the annotation speed-up for point supervision at seven annotation strides. The Table shows that when annotating fewer frames, performance is retained. Only when annotating fewer than 5% of the frames (i.e., for a stride larger than 20), the performance drops marginally. This result shows that our approach is robust to sparse annotation, a point at every frame is not required. The bottom row of the Table shows the corresponding speed-up in annotation time compared to box-supervision. An almost 50fold speed-up can be achieved while maintaining comparable localization performance. A 300-500 fold speedup can be attained with a marginal drop in localization performance. We conclude that point-supervision is robust to sparse annotations, opening up the possibility for further reductions in annotation cost for action localization.

Noisy point annotations
Human annotators, while center biased (Tseng et al, 2009), do not always precisely pinpoint center locations while annotating (Bearman et al, 2016). In the fourth experiment, we evaluate how robust the action localization performance is with respect to noise in the point-supervision. We start from the original point annotations and add zero-mean Gaussian noise with varying levels of isotropic variance. This experiment is performed on the UCF-101 dataset.
Results. The localization performance for six levels of annotation noise is shown in Figure 10. The performance for σ = 0 corresponds to the performance of point-supervision in the first experiment. We observe that across all overlap thresholds, the localization performance is unaffected for noise variations up to a σ of 5. For σ = 10, the results are only affected for thresholds of 0.3 and 0.4, highlighting the robustness of our approach to annotation noise. For large noise variations (σ = 50 or 100), we observe a modest drop in performance for the overlaps thresholds 0.1 to 0.4. We conclude that points do not need to be annotated precisely at the center of actions. Annotating points in the vicinity of the action is sufficient for action localization.

Exploiting pseudo-points
In the fifth experiment, we investigate the effect of each of the pseudo-points on the action localization performance during inference. We perform this experiment on both UCF Sports and UCF-101.
The weights computed based on the match with pointsupervision in training videos provide the degree to which each pseudo-point should contribute to the selection of spatio-temporal proposals in test videos and they also provide a measure to select the best pseudopoint.
Results. In Figure 11, we show the localization performance for overlap thresholds of 0.2 and 0.5. On UCF Sports, we observe for an overlap of 0.2, the performance improves for training points, self-supervision, and person detection. For center bias and independent motion, there is a minimal drop in performance. For an overlap of 0.5, the results diverge more clearly. Independent motion (+6.7%), self-supervision (+6.8%), and person detection (+20.0%) benefit directly from inclusion. This does not hold for center bias and training points. While pseudo-points can have a positive impact on the performance, it is not effective for all types of pseudo-points. Discovering which pseudo-points are  All results are provided relative to the performance without pseudo-points. Data-dependent pseudo-points such as person detection, self-supervision, and independent motion have a positive effect on the localization performance. Data-independent pseudo-points such as center bias and training points are not effective for action localization. Incorporating the temporal extent of actions as a pseudopoint can further boost performance. We conclude that pseudo-points, when chosen correctly, aid action localization performance.
most effective is a necessity. On UCF-101, we observe similar trends. Person detection and self-supervision yield increased localization performance, while the data independent center bias and train points have a negative effect. Video-specific visual cues, such as persons and motion, are effective; generic statistics less so. We observe that the order of the weights correlates with the localization performance. This indicates the effectiveness of the proposed weighting function, as it provides insight into the quality of the pseudo-points without having to evaluate their performance at test time. Person detection is the most effective pseudopoint. The center bias and training points score lower, which is also visible in their localization performance. Only the self-supervision scores low, while it has a positive effect on the localization. We conclude that the proposed pseudo-point weighting is a reliable way to determine the effectiveness of pseudo-points with pointsupervision.
On both datasets, we also investigate the effect of temporal pseudo-points on the localization performance. On UCF Sports, we observe an increase in performance for both overlap thresholds (+7.3% at 0.2, +2.4% at 0.5). On UCF-101, we also observe a positive effect, albeit with smaller improvements (+0.8% at 0.2, +0.6% at 0.5). We conclude that regularizing spatio-temporal proposals using the temporal extent of actions, which is provided by point-supervision, aids action localization in videos.
Based on the weights of the pseudo-annotations, we recommend to use person detection during inference. We will use this setup for the state-of-the-art comparison.
Qualitative results. To gain insight into which types of videos benefit from pseudo-points, we provide a qualitative analysis on UCF Sports. In Figure 12a, we show three test videos where the localization improved due to the effect of pseudo-points. We show the effect for the independent motion, self-supervision, and person detection respectively. In all three cases, the inclusion of pseudo-points resulted in a better fit on the action by enlarging its scope. For self-supervision and independent motion, the wider motion evidence resulted in a better fitting localization. For person detection, the ev-  The second row shows failure cases. For data independent pseudo-points such as center bias and training point statistics, the action can deviate from its true location. Person detection can furthermore be problematic when many people are present in the scene. The qualitative analysis shows that pseudo-points can alleviate problems in pointly-supervised action localization. To aid the performance, data-dependent pseudo-points are informative, while data-independent pseudo-points appear less effective.
idence of the whole person had a positive effect. These examples show the potential of pseudo-points to guide the selection of spatio-temporal proposals during inference.
In Figure 12b, we show three test videos where the inclusion of pseudo-points resulted in a worse localization. We show this effect for the less successful center bias and training points, as well as for the most successful pseudo-point person detection. For center bias, this resulted in a shift from precise fit on the action (red) to a large generic location (blue). This is because the center bias is data independent and might undo correct localizations. This also holds for the training point statistics, which are identical for each test video. Lastly, the person detection can yield diverging localizations when many people are present in the scene. We conclude that motion-based and person-based pseudopoints can aid action localization, while data independent pseudo-points are less suited.

Comparison to others
In our final experiment, we compare pointly-supervised action localization with alternatives using either boxsupervision, or weaker forms of supervision. We perform this experiment on UCF Sports, UCF-101, and Hollywood2Tubes. To compare with as many methods as possible, we evaluate with AUC on UCF Sports and with mAP on UCF-101 and Hollywood2Tubes. We evaluate action localization for the standard overlap threshold of 0.2.

Results.
We present results on all three datasets In Table 2. On UCF Sports, we observe that our approach outperforms the state-of-the-art in weakly-supervised action localization, as well as the point-supervision in our previous work  which lacks the pseudo-points during inference.
Naturally, our pointly-supervised approach also outperforms the state-of-the-art in zero-shot and unsupervised action localization, emphasizing the effectiveness of points as supervision. Lastly, we perform competi-  Table 2: Comparative evaluation of pointly-supervised action localization to the state-of-the-art using boxsupervision as well as weakly-supervised alternatives. All results are shown for an overlap threshold of 0.2. On all datasets our approach compares favorably to all weakly-supervised localization approaches, indicating the effectiveness of point-supervision. On UCF Sports, we perform comparable or better to approaches that require box-supervision. On UCF-101, we outperform the approach based on box-supervision with the same proposals and features , but we are outperformed by approaches that score and link individual boxes (Kalogeiton et al, 2017a;Saha et al, 2017;Yang et al, 2017). We expect that higher quality spatio-temporal proposals, can narrow this gap (see Figure 9). On Hollywood2Tubes, which only provides point annotations for training, we set a new state-of-the-art.
tive or even better than the state-of-the-art using boxsupervision 1 . On UCF-101, we also outperform all existing weaklysupervised alternatives. Our approach reaches an mAP of 0.418, compared to 0.369 of Li et al (2018) and 0.351 of , the state-of-the-art in weaklysupervised action localization. In comparison to boxsupervision, we outperform the approach of van Gemert et al (2015), which employs identical spatio-temporal proposals and representations. On UCF-101, the stateof-the-art approaches in action localization from boxsupervision perform better (Kalogeiton et al, 2017a;Yang et al, 2017). These approaches score and link 1 to 10 consecutive boxes into tubes, rather than opting for spatio-temporal proposals. Based on our second experiment, we posit that better spatio-temporal proposals can narrow this gap in performance.
Lastly, we provide results with our approach on Hol-lywood2Tubes. We first observe that overall, the perfor-mance on this dataset is lower than on UCF Sports and UCF-101 in terms of mAP scores. This indicates the challenging nature of the dataset. The combination of temporally untrimmed videos, multi-shot actions, and actions of complex semantic nature make for a difficult action localization. Our approach provides a new stateof-the-art result in this dataset with an mAP of 0.178, compared to 0.143 of  and 0.172 of .

Conclusions
This paper introduces point-supervision for action localization in videos. We start from spatio-temporal proposals, normally used during inference to determine the action location. We propose to bypass the need for box-supervision by learning directly from spatiotemporal proposals in training videos, guided by pointsupervision. Experimental evaluation on three action localization datasets shows that our approach yields similar results to box-supervision. Moreover, our ap-proach can handle sparse and noisy point annotations, resulting in a 20 to 150 times speed-up for action supervision. To help guide the selection of spatio-temporal proposals during inference, we propose pseudo-points, automatic visual cues in videos that hallucinate points in test videos. When weighted and selected properly with our quality measure, pseudo-points can have a positive impact on the action localization performance. We conclude that points provide a fast and viable alternative to boxes for spatio-temporal action localization.

A Types of localization errors
For the error diagnosis, we consider five types of detections, parameterized by an overlap threshold τ . The first detection type is a correct detection (detection from positive video with an overlap of at least τ ). The second type is a localization error (detection from positive video with an overlap less than τ , but greater than 0.1). The third type is confusion with another action (detection from negative video with an overlap of at least 0.1). The fourth type is background detection from own action (detection from positive video with an overlap less than 0.1). The fifth and final type is background detection from another action (detection from negative video with an overlap less than 0.1). These five types cover all possible types of detections. These types are similar to (Hoiem et al, 2012). Different from (Hoiem et al, 2012), we do not split actions into similar and dissimilar (since no such subdivision exists). Instead, we split background detections into detection from own and other actions.