Tubelets: Unsupervised action proposals from spatiotemporal super-voxels

This paper considers the problem of localizing actions in videos as a sequences of bounding boxes. The objective is to generate action proposals that are likely to include the action of interest, ideally achieving high recall with few proposals. Our contributions are threefold. First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spatiotemporal super-voxels in an unsupervised manner, we call them Tubelets. Second, along with the static features from individual frames our approach advantageously exploits motion. We introduce independent motion evidence as a feature to characterize how the action deviates from the background and explicitly incorporate such motion information in various stages of the proposal generation. Finally, we introduce spatiotemporal refinement of Tubelets, for more precise localization of actions, and pruning to keep the number of Tubelets limited. We demonstrate the suitability of our approach by extensive experiments for action proposal quality and action localization on three public datasets: UCF Sports, MSR-II and UCF101. For action proposal quality, our unsupervised proposals beat all other existing approaches on the three datasets. For action localization, we show top performance on both the trimmed videos of UCF Sports and UCF101 as well as the untrimmed videos of MSR-II.


Introduction
The goal of this paper is to localize and recognize actions such as 'kicking', 'hand waving' and 'salsa spin' in video content.The recognition of actions has witnessed tremendous progress in recent years thanks to advanced video representations based on motion and appearance e.g.(Laptev, 2005;Dollar et al., 2005;Wang et al., 2013Wang et al., , 2015a;;Simonyan and Zisserman, 2014).However, determining the spatiotemporal extent of an action has appeared considerably more challenging.Early success came from an exhaustive evaluation of possible action locations e.g.(Ke et al., 2005;Lan et al., 2011;Tian et al., 2013).Such a sliding cuboid is tempting, but owing to large number of possible locations demands a relatively simple video representation, e.g.(Dalal and Triggs, 2005;Kläser et al., 2008).Moreover, the rigid cuboid shape does not necessarily capture the versatile nature of actions well.We propose an approach for action localization enabling flexible spatiotemporal subvolumes, while still allowing for modern video representations.
Tran and Yuan pioneered the prediction of flexible spatiotemporal boxes around actions (Tran andYuan, 2011, 2012).They first obtain for each individual frame the most likely spatial locations containing the action, before determining the best temporal path or action proposal through the box search space (Tran andYuan, 2011, 2012).Surprisingly, the initial spatial classification is frame-based and ignores motion characteristics for action recognition.More recently both Gkioxari and Malik (2015) and Weinzaepfel et al. (2015) overcome this limitation by relying on a two-stream convolutional neural network based on appearance and twoframe motion flow.While proven effective, these works need to determine the locations in each frame with supervision, and for each action class separately, making them less suited for action localization challenges requiring hundreds of actions.Rather than separating the spatial from the temporal analysis and relying on region-level class-specific supervision, we prefer to analyze both spatial and temporal dimensions jointly to obtain action proposals in an unsupervised manner and avoid supervision until classification.Such an approach is easier to scale to hundreds of classes.Moreover, the same set of proposals can be used for applications requiring different encodings or classification schemes.
We are inspired by a method for object detection in static images called selective search (Uijlings et al., 2013).The algorithm generates box proposals for possible object locations by hierarchically merging adjacent super-pixels from (Felzenszwalb and Huttenlocher, 2004) , based on similarity criteria for color, texture, size and fill.The approach iteration 0 iteration 9 iteration 25 Fig. 1 Overview of unsupervised action proposal from super-voxels: An initial super-voxel segmentation of a video example is shown as a frame sequence in the bottom layer.The proposed grouping (only shown for one frame) iteratively merges the super-voxels that are on the action.One of the better super-voxels after grouping is shown in blue, enclosed by a green box.We refer to the sequence of such bounding boxes over the frames as a Tubelet.
does not require any supervision, making it suited to evaluate many object classes with the same set of proposals.The small set of object proposals is known to result in both high recall and overlap with the ground-truth (Hosang et al., 2016).Moreover, by separating the localization from the recognition, selective search facilitates modern encodings, such as Fisher vectors of (Sánchez et al., 2013) in (van de Sande et al., 2014) and convolutional neural network features in (Girshick et al., 2016).Following the example set by selective search for object detection, we introduce unsupervised spatiotemporal proposals for action localization by relying on video-specific appearance and motion properties derived from super-voxels.Brox and Malik (2010) realized earlier that temporally consistent segmentations of moving objects in a video can be obtained without supervision.They propose to cluster long term point trajectories and show that these lead to better segmentations than two-frame motion fields.Both Chen and Corso (2015) and van Gemert et al. (2015) build on the work of Brox and Malik (2010) and propose action proposals by clever clustering the improved dense trajectories of Wang and Schmid (2013).Their approaches are known to be very effective for untrimmed videos where temporal localization is essential.We adopt the use of long term trajectories for temporal refinement and pruning of our action proposals, but we do not restrict ourselves exclusively to improved dense trajectories as representation for action classification.
Our first out of three contributions is to generalize the selective search strategy for unsupervised action proposals in videos.We adopt the general principle designed for static images and repurpose it for video.We consider super-voxels instead of super-pixels to produce spatiotemporal shapes.This directly gives us 2D+t sequences of bounding boxes, without the need to address the problem of linking boxes from one frame to another, as required in other approaches (Tran andYuan, 2011, 2012;Gkioxari and Malik, 2015;Weinzaepfel et al., 2015).We refer to our action proposal as Tubelets in this paper, and summarize their generation in Figure 1.
Our second contribution is explicitly incorporating motion information in various stages of the analysis.We introduce independent motion evidence as a feature to characterize how the action motion deviates from the background motion.By analogy to image descriptors such as the Fisher vector (Sánchez et al., 2013), we encode the singularity of the motion in a feature vector associated with each super-voxel.We use the motion as an independent cue to produce supervoxels segmenting the video.In addition, motion is used as a merging criterion in the agglomerative grouping of supervoxels leading to better Tubelets.
A preliminary version of this article appeared as Jain et al. (2014).The current version adds as third contribution, the spatiotemporal refinement and pruning of Tubelets.The spatiotemporal refinement includes temporal sampling and smoothing the irregular shaped Tubelets.This postprocessing considerably improves the performance while keeping the number of proposals manageable.Where Chen and Corso (2015) and van Gemert et al. (2015) derive their proposals directly and exclusively from the improved dense trajectories, we use the trajectories to refine our unsupervised action proposals from super-voxels.In addition to this technical novelty, the current paper adds: i) detailed experimental evaluation of motion-based segmentation for better proposals, leading to large gains in both proposal quality and action localization, ii) apart from UCF Sports and MSR-II we also consider the much larger UCF101 dataset, iii) revised experiments for all three datasets considering both the quality of the proposal as well as their suitability for action localization using modern video representations (Sánchez et al., 2013;Szegedy et al., 2015), and iv) a new related work section, which will be discussed next.

Related work
We discuss action recognition and action localization.In Table 1 we link action recognition representations with action localization methods and use it to structure our discussion of related work.

Action recognition
Part-based Action recognition by parts typically exploits the human actor.Correctly recognizing the human pose improves performance Jhuang et al. (2013).A detailed pose model can make fine-grained distinctions between nearly similar actions Cheron et al. (2015).Pose can be modeled with poselets Maji et al. (2011) or as a flexible constellation of parts in a CRF Wang and Mori (2011).For action recognition in still images where motion is not available the human pose can play a role Delaitre et al. ( 2010) as modeled in a partbased latent SVM (Felzenszwalb et al., 2010).In our work we make no explicit assumptions on the pose, and use generic local video features.
Cube Local video features are typically represented by a 3D cube.The seminal work of (Laptev, 2005) on Spatio-Temporal Interest Points (STIPs) detects points that are salient in appearance and motion and then uses a cube of Gaussian derivative filter responses to represent the interest points.An alternative representation is HOG3D Kläser et al. (2008) which extends the 2D Histogram of Oriented Gradients (HOG) of Dalal and Triggs (2005) to 3D.Instead of using sparse salient points, the work of Dollar et al. (2005) shows that using denser sampling improves results.Replacing dense points with dense trajectories (Wang et al., 2015a) and flexible track-aligned feature cubes with motion boundary features yields excellent performance.The improved trajectories take into account the camera motion compensation, which is shown to be critical in action recognition (Jain et al., 2016;Piriou et al., 2006;Wang and Schmid, 2013).In our work we build on these dense trajectories as well.
Bag of Words To arrive at a global representation over all local descriptors, BoW represents a cube descriptor by a prototype.The frequency of the prototypes aggregated in a histogram is a global video representation.The BoW representation is simple and offers good results (Everts et al., 2014;Wang et al., 2011).We consider BoW as one of our representations for action localization as well.
Fisher Vector Where BoW records prototype frequency counts, the Fisher vector (Sánchez et al., 2013) and the VLAD (Jégou et al., 2012) model the relation between local descriptors and prototypes in the feature space of the descriptor.This more sophisticated variant of BoW outperforms BoW (Jain et al., 2013;Oneata et al., 2013Oneata et al., , 2014b)).Because of the good performance we also consider the Fisher vector as a representation.
CNNs Deep learning on visual data with CNNs (Convolutional Neural Networks) has revolutionized static image recognition Krizhevsky et al. (2012).For action recognition in videos, the work of Simonyan and Zisserman (2014) separate video in two channels: a network on static RGB and a network on hand-crafted optical flow.In Wang et al. (2015b) CNN features are used as a local feature in dense trajectories using a Fisher vector.Long term motion can be modeled by recurrent networks Ng et al. (2015).The distinction between motion and static objects is analyzed in Jain et al. (2015b) and extended by Jain et al. (2015a) for action recognition without using any video training data.Instead of separating static and motion, 3D convolutional networks combine both Tran et al. (2015).Due to excellent performance we also adopt CNN features as a representation for action localization.

Action localization
2D Human detector Spatiotemporal action localization can be realized by running a human detector on each frame and tracking the detections.In Kläser et al. (2012) a sliding window upper-body HOG detector per frame is tracked by optical flow feature points for spatial localization.Temporal localization is achieved with a sliding window on track-aligned HOG3D features.HOG3D features are also used in Lan et al. (2011) albeit in BoW, where the 2D person detector is treated as a latent variable and an undirected relational graph inspired by a latent SVM is used for classification.Similarly, the human pose is used by Wang et al. (2014) in a relational dynamic poselet model using cuboids to model a mixture of parts.In Ma et al. (2013) dynamic action parts are extended by incorporating static parts using 2D segments.Segments are grouped to tracks and represented in a hierarchical variant of BoW.In our work we do not make the assumption that an action has to be performed by a human.Our method is equally applicable to actions by groups, animals, or vehicles.
2D generic detector By replacing the human detector with a generic detector the types of actions can be extended beyond a human actor.This can be done by finding the best path trough fixed positions in a frame using HOG/HOF directly (Tran and Yuan, 2012) or through BoW (Tran and Yuan, 2011).Instead of fixed positions, Gkioxari and Malik (2015) classify object proposals with a two-stream CNN and track overlapping proposals with a high classification score.The work of Weinzaepfel et al. (2015) uses a similar twostream CNN approach, adding a HOG/HOF/MBH-like cube descriptor at the track level and add temporal localization with a sliding window.The need for strong supervision is removed by Puscas et al. (2015) where generic CNN feature are linked through dense trajectory tracks to yield action proposals that could be used for action localization.Similarly, our work requires no supervision for obtaining action proposals, and we experimentally show that these proposals give good results.In addition, we do not first treat a video as a collection of static frames where temporal relations are added as an separate second step.Instead, we respect the 3D spatiotemporal nature of video from the very beginning.
3D Trajectory The strength of 3D dense trajectories Wang et al. (2015a) for action recognition spilled over to action localization.In Raptis et al. (2012) mid-level clusters of trajectories are grouped and matched with a graphical model.The work of Mosabbeb et al. (2014) groups trajectories to parts which are used in a BoW in an unsupervised manner using low-rank matrix completion and subspace clustering.Similarly, BoW on space-time graph clusters is used by Chen and Corso (2015) and a Fisher vector on trajectories is used on hierarchical clusters in van Gemert et al. (2015) for action localization.These methods specifically target the strength of dense trajectories.Instead, our approach does not commit itself to a single representation.
3D Cuboid The 3D nature of video is respected by building on space-time cuboids for action localization.Such cuboids are a natural extension of 2D patches to 3D.Ke et al. (2005) offer a 3D extension of the seminal face detector of Viola and Jones (2004) using 3D cuboids with optical flow features.The work of Yuan et al. (2009) and Cao et al. (2010) exploit the efficient branch and bound method (Lampert et al., 2008) in 3D.In Tian et al. (2013) the deformable part-based model (Felzenszwalb et al., 2010) is generalized to 3D, an efficient sliding window approach in 3D is proposed by Derpanis et al. (2013) and ordinal regression (Kim and Pavlovic, 2010) is extended by Chen et al. (2014).Instead of using cuboids, which are rigid in time and space, we choose a more delicate approach using 3D voxels.
3D Voxels As a 3D generalization of 2D image segmentation the voxels from video segmentation methods (Xu and Corso, 2012) offer flexible and fine-grained tools for action proposals.In extension of Manen et al. (2013), the work of Oneata et al. (2014a) groups voxels together for action proposals using minimal training.Such action proposals could be used for action localization.This is done by Soomro et al. (2015) who use a supervised CRF to model foregroundbackground relationships for proposals and action localization.Instead, our proposal method is unsupervised and thus class agnostic.This is beneficial as this makes our algorithm independent on the number of action classes.This paper is an extension of Jain et al. (2014), where 3D voxels are grouped to proposals based on features such as color, texture and motion.The proposals have successfully been used for action localization using objects Jain et al. (2015b) and in a zero-shot setting Jain et al. (2015a).We will discuss the mechanics of our unsupervised action proposals next.

Unsupervised action proposals: Tubelets
In this section we present our approach to obtain action proposals from video in an unsupervised manner, we call the spatiotemporal proposals Tubelets.The three stages of the Tubelet generation process are shown in Figure 2. We first introduce in Subsection 3.1 our motion model based on evidence of independent motion.This motion cue is used in the first two stages of the process.In Subsection 3.2, we discuss the first stage, super-voxel segmentation, to generate an initial set of super-voxels from video.For this we rely on an off-the-shelf video segmentation as well as our proposed independent motion evidence.In Subsection 3.3 we detail the second stage of super-voxel grouping, where we iteratively group the two most similar super-voxels into a new one.The similarity score is computed using multiple grouping functions, each leading to a set of super-voxels.A super-voxel is tightly bounded by a rectangle in each frame it appears.The temporal sequence of bounding boxes forms our action proposal, a Tubelet.In Subsection 3.4, we introduce spatiotemporal refinement and pruning of Tubelets.This enhances the proposal quality, especially for temporal localization, while at the same time keeping the number of proposals feasible

Super-Voxel Grouping Tubelets
Pruning and Spatiotemporal refinement Fig. 2 Tubelet generation: In the first stage a video is segmented into super-voxels.In addition to segmenting video frames, we also segment their iMotion maps to also include motion information in the super-voxel segmentation stage.In the second stage of super-voxel grouping, super-voxels are iteratively merged using several grouping functions each of them leading to a set of action proposals.These sets are again grouped by union into a set of Tubelets.The final stage is post-processing that includes pruning and spatiotemporal-refinement of action proposals.
to use computationally expensive features and memory demanding encodings for action localization.

Evidence of independent motion
Since we are concerned with action localization, we need to aggregate super-voxels corresponding to the action of interest.Most of the points in such super-voxels would deviate from the background motion caused by moving camera and usually assumed to be dominant motion.In other words, the regions corresponding to independently moving objects do not, usually, conform with the dominant motion in the frame.The dominant frame motion can be represented by a 2D parametric motion model.Typically, an affine motion model of parameters θ = (a i ), i = 1...6, or a quadratic (perspective) model with 8 parameters can be used, depending on the type of camera motion and the scene layout likely to occur: where w θ (p) is the velocity vector supplied by the motion model at point p = (x, y) in the image domain Ω .In this paper, we use the affine motion model for all the experiments.
We formulate the evidence that a point p ∈ Ω undergoes an independent motion (i.e., an action related motion) at time step t.Let us introduce the displaced frame difference at point p and at time step t for the motion model of parameter θ t : r θ t (p,t) = I(p + w θ t (p),t + 1) − I(p,t).Here, r θ t (p,t) will be close to 0 if point p only undergoes the background motion due to camera motion.At every time step t, the global parametric motion model can be estimated with a robust penalty function as θt = arg min where ρ is the robust function.To solve (1), we use the publicly available Motion2D software by (Odobez and Bouthemy, 1995), where ρ(.) is defined as the Tukey function.ρ(r θ t ) produces a maximum likelihood type estimate: the so-called M-estimate (Huber, 1981).Indeed, if we write ρ(r θ t ) = − log f (r θ t ) for a given function f , ρ(r θ t ) supplies the usual maximum likelihood estimate.Since we are looking for action related moving points in the image, we want to measure the deviation to the global (background) motion.This is in spirit of the Fisher vectors by (Perronnin and Dance, 2007), where the deviation of local descriptors from a background Gaussian mixture model is encoded to produce an image representation.
Let us consider the derivative of the robust function ρ(.).It is usually denoted as ψ(.) and corresponds to the influence function (Huber, 1981).More precisely, the ratio ψ(r θ t )/r θ t accounts for the influence of the residual r θ t in the robust estimation of the model parameters.The higher the influence, the more likely the point conforms to the global motion.Conversely, the lower the influence, the less likely the point approves to the global motion.This leads us to define the independent motion evidence as: where ϖ(p,t) is the ratio

Super-voxel segmentation
To generate an initial set of super-voxels, we rely on a thirdparty graph-based video segmentation by (Xu and Corso, 2012).We choose their graph-based segmentation over other methods in (Xu and Corso, 2012) because it is more efficient w.r.t.time and memory.The graph-based segmentation is about 13 times faster than the slightly more accurate hierarchical version (Xu and Corso, 2012).Independent motion.As an alternative to the off-the-shelf video segmentations, each video frame is represented with the corresponding map, ξ (t), of independent motion of pixels.This encodes motion information in the segmentation.We show video frames and their ξ (t) maps in Figure 3(a) and 3(b).We post-process the independent motion or ξ (t) maps by applying a morphological closing operation (dilation followed by erosion) to obtain denoised maps, which we refer to as iMotion maps, displayed in Figure 3(c).Applying the graph-based video segmentation of (Xu and Corso, 2012) on sequences of these denoised maps partitions the video into super-voxels with independent motion.Three examples of re-

Frame sequence
Itera&ons of super-voxel grouping Frame sequence sults obtained this way are shown in Figure 3(d).The first column shows a frame from action 'Swing-Bench', where the action of interest is highlighted by iMotion map itself and then clearly delineated by segmenting the maps.Second column shows an example from action 'Running'.Here the segmentation does not give an ideal set of initial super-voxels but the iMotion map has useful information to be exploited by our motion feature based merging criterion (described in Subsection 3.3).An example of 'Hand Waving' is shown in the last column.The resulting super-voxels are more adapted and aligned to the action sequences.This alternative for initial segmentation is also more efficient, about 4 times faster than graph-based segmentation on the original video and produces 8 times fewer super-voxels.Unlike graph-based video segmentation on original frames this alternate set of initial super-voxels exploits motion information.The two are complementary and together lead to much better proposal quality as shown later in our experiments.

Super-voxel grouping
Having defined our ways to segment a video sequence into super-voxels, we are now ready to present our method for grouping super-voxels into Tubelets.The grouping is done in two steps.In the first step, initial super-voxels are grouped iteratively to create new super-voxels.A grouping function computes the similarity between any two super-voxels and the successive groupings of the most similar pairs lead to a new set of super-voxels.Each grouping function leads to a set of super-voxel.In the second step, the super-voxel sets produced by multiple grouping functions are again grouped by union.This united set of super-voxels is then enclosed by boxes in each frame to yield the Tubelets.
Iterative grouping.We iteratively group super-voxels in an agglomerative manner.Starting from the initial set of supervoxels, we hierarchically group them until the video becomes a single super-voxel.At each iteration, a new super-voxel is produced from two super-voxels, which are then not consid- ered any more in subsequent iterations.This iterative merging algorithm is inspired by the selective search method proposed for localization in images by (Uijlings et al., 2013).
Formally, we produce a hierarchy of super-voxels that are represented as a tree: The leaves correspond to the n initial super-voxels while the internal nodes are produced by the merge operations.The root node is the whole video and the corresponding super-voxel is produced in the last iteration.Since this hierarchy of super-voxels is organized as a binary tree, it is straightforward to show that n − 1 additional supervoxels are produced by the algorithm.Out of these n − 1 super-voxels, those which are very small or contain no motion at all are discarded at this point.This usually leaves much fewer number of super-voxels depending upon the grouping function used.
Grouping function.For selection of the two super-voxels to be grouped, we rely on similarities computed between all the neighboring super-voxels that are still active.We employ five complementary similarity measures in our grouping functions to compare super-voxels, in order to decide which should be merged.They are fast to compute.Four of these measures are adapted from selective search in image: The measures based on Color, Texture, Size and Fill were computed for super-pixels (Uijlings et al., 2013).We revise them for super-voxels.As our objective is not to segment the objects but to delineate the actions or actors, we additionally employ a motion-based similarity measure based on our independent motion evidence to characterize a super-voxel.The grouping function is defined as any one of the similarity measures or sum of multiple of them.Next we present the five similarity measures for super-voxels: motion, color, texture, size and fill.
Similarity by motion (s M ): We define a motion representation of super-voxels from iMotion maps capturing the relevant motion information.This motion representation is also efficient to compute.We consider the binarized version of iMotion maps obtained by setting all non-zero values to 1.At every pixel p, we count the number of pixels q (including p) in its 3D neighborhood that are set to 1 (i.e.pixels likely to be related to actions).In a subvolume of 5 × 5 × 3 pixels, this count value ranges from 0 to 75.A motion histogram of these values, denoted by h Mi , is computed over the super-voxel r i .Intuitively, this histogram captures both the density and the compactness of a given region with respect to the number of points belonging to independently moving objects.Now, two super-voxels, r i and r j , represented by motion histograms are compared as follows.The motion histograms are first 1 -normalized and then compared with histogram intersection, s = δ 1 (h Mi , h M j ).The histograms are efficiently propagated through the hierarchy of super-voxels.Denoting with r t = r i ∪ r j the super-voxel obtained by merging the super-voxels r i and r j , we have: where Γ(r) denotes the number of pixels in super-voxel r.
The size of the new super-voxel Similarity by color (s C ) and texture (s T ).In addition to motion, we also consider similarity based on color and texture.Both h C and h T are identical to the histograms considered for selective search in images (Uijlings et al., 2013), be it that we compute them on super-voxels rather than super-pixels.The histograms are computed from color and intensity gradient for each given super-voxel: -The color histogram h C captures the HSV components of the pixels included in a super-voxel; h T encodes the texture or gradient information of a given super-voxel.
The method of similarity computation and the process of merging for color and texture is the same as for motion: Describe each super-voxel with a histogram and compare the two by histogram intersection.
Similarity by size (s Γ ) and fill (s F ).The similarity s Γ (r i , r j ) aims at merging smaller super-voxels first: where Γ(video) is the size of the video (in pixels).This tends to produce super-voxels, and therefore Tubelets, of varying sizes in all parts of the video (recall that we only merge contiguous super-voxels).
The last similarity measure s F measures how well supervoxels r i and r j fit into each other.We define B i, j to be the tight bounding cuboid enveloping r i and r j .The similarity is given by: After each merge, we compute the new similarities between the resulting super-voxel and its neighbors.As illustrated in the following two figures.Figure 4 illustrates the method on a sample video.Each color represents a supervoxel and after every iteration a new super-voxel is added and two are removed.After 1, 000 iterations, observe that two Tubelets (blue and dark green) emerge around the action of interest in the beginning and the end of the video, respectively.At iteration 1,720, the two corresponding supervoxels are merged.The novel Tubelet (dark green) resembles the yellow ground-truth sequence of bounding-boxes.This exhibits the ability of our method to group super-voxels both spatially and temporally.Also importantly, it shows the capability to sample an action proposal with boxes having very different aspect ratios.This is unlikely to be coped by sliding-subvolumes or even approaches based on efficient sub-window search.Figure 5 depicts another example, with a single frame considered at different stages of the algorithm.
Here the initial super-voxels (second image in first row) are spatially more decomposed because the background is cluttered both in appearance and in motion (spectators cheering).Even in such a challenging case our method is able to group the super-voxels related to the action of interest.

Pruning and spatiotemporal refinement of Tubelets
Pruning proposals.We apply two types of pruning to reduce the number of proposals leading to a more compact set of Tubelet action proposals with minimal impact on the recall.
Motion pruning: The first type of pruning is based on the amount of motion.Long videos that have much background clutter due to unrelated actors/objects, usually result in many irrelevant Tubelet proposals.We filter them based on their motion content, which we quantify by the number of motion trajectories (Wang and Schmid, 2013).For each video, we rank the Tubelet proposals based on the number of trajectories, keep the top P proposals and the top ten percent of the rest.This is to ensure that at least a minimal number of proposals are retained from each video.
Overlap pruning: The second type of pruning is based on mutual overlaps of the action proposals.Many proposals have very high alignment or overlaps between them, all practically representing the same part of the video.To eliminate such redundant proposals we keep only one in a set of many highly overlapping ones.It is particularly useful when there is a large number of action proposals per video.
Spatiotemporal refinement.A super-voxel and therefore a Tubelet capturing an actor/object can continue to extend further even after the action is completed as shown in the top row of Figure 6.Tubelets are generated from super-voxels that generally follow an object or an actor and hence can be irregular in shape spatially, sometimes leading to sudden changes in the size of consecutive bounding boxes.We propose to handle the above two problems of weak temporal localization and non-smooth spatial localization by temporal and spatial refinement.
Temporal refinement: In order to deal with the overly long Tubelets we propose to temporally sample or segment them.For this we devise a method that can segment each proposal into smaller sub-sequences with tighter temporal boundaries, without increasing the total number of proposals too much.This temporal refinement is applied to one proposal at a time.Consider an action proposal of B boxes (i.e., extending over B frames) and i th box has nrTra j(i) trajectories passing through it (where i = 1 . . .B). Now, we represent each box by two values, (a) relative location = i B and (b) relative motion content = nrTra j(i) nrTra j max .Here, nrTra j max is the maximum number of trajectories passing through any of the B boxes.The boxes that have similar relative location and relative motion content are grouped together by clustering, such that the initial proposal is segmented into about fifteen sub-sequences.Then, very short proposals with temporal length less than thirty are filtered out.In practice, this increases the number of proposals by a factor ten. Therefore, we precede and follow temporal sampling by Overlap pruning, to restrict the total number of proposals.The impact of temporal refinement is shown in the second row of Figure 6 Spatial refinement: We apply spatial refinement of proposals, to steer the super-voxels closer to the shape of the action rather than the objects/actor and also to avoid sudden changes in sizes of bounding boxes and thus have smoother sequence of boxes.First, to align the boxes closer to action we modify them such that they are not void of motion trajectories at the boundaries.In each box, the minimum and maximum of x and y coordinates of intersecting trajectories are computed and the box is restricted to [x min − N, y min − N, x max + N, y max + N].Second, we apply weighted linear regression on width, height, x and y coordinates of the top left corner of the boxes.This is done over a local span of a few frames, typically a fifth of the proposal length.The impact of spatial refinement after temporal refinement is shown in the last row of Figure 6.

Datasets
UCF Sports.This dataset consists of 150 videos of actions extracted from sports broadcasts with realistic actions captured in dynamic and cluttered environments (Rodriguez et al., 2008).This dataset is challenging due to many actions with large displacement and intra-class variability.Ten action categories are represented, for instance 'diving', 'swinging bench', 'horse riding', etc.We use the disjoint train-test split of videos (103 for training and 47 for testing) suggested by (Lan et al., 2011).The ground truth is provided as sequences of bounding boxes enclosing the actors.The area under the ROC curve (AUC) is the standard evaluation measure used, and we follow this convention.
MSR-II and KTH.This dataset consists of 54 videos recorded in a crowded environment with many people moving in the background.Each video contains multiple actions of three types: 'boxing', 'hand clapping' and 'hand waving'.An actor appears, performs one of these actions, and walks away.A single video has multiple actions (5-10) of different types, making the temporal localization challenging.Bounding subvolumes or cuboids are provided as the ground truth.Since the actors do not change their location, it is equivalent to a sequence of bounding boxes.The localization criterion is subvolume-based, so we follow (Cao et al., 2010) and use the tight subvolume or cuboid enveloping Tubelet.Precisionrecall curves and average precision (AP) are used for evaluation (Cao et al., 2010).As standard practice, this dataset is used for cross-dataset experiments with KTH (Schüldt et al., 2004) as training set.
UCF101.The UCF101 dataset by (Soomro et al., 2012) is a large action recognition dataset containing 101 action categories of which 24 are provided with localization annotations, corresponding to 3,204 videos.Each video contains one or more instances of same action class.It has large variations (camera motion, appearance, scale, etc.) and exhibits much diversity in terms of actions.Three train/test splits are provided with the dataset, we perform all evaluations on the first split with 2,290 videos for training and 914 videos for testing.Mean average precision is used for evaluation.
Example frames of some of the action classes are shown in Figure 7 for each dataset.

Evaluation criteria for action proposals
To evaluate the quality of action proposals, we compute the upper bound on the localization accuracy, as previously done to evaluate the quality of object proposals (Uijlings et al., 2013), by the Mean Average Best Overlap (MABO) and maximum possible recall.In this subsection, we extend these measures from objects in images to actions in videos.This requires measuring the overlap between two sequences of boxes instead of two boxes.
Overlap or localization score.In a given video V of F frames comprising m instances of different actions, the i th ground truth sequence of bounding boxes is given by gt If there is no action of i th instance in frame f , then B i f = / 0. From the action proposals, the j th proposal formed by a sequence of bounding boxes is denoted as, dt j = (D j 1 , D j 2 , ...D j F ). Let OV i, j ( f ) be the overlap between the two bounding boxes in frame, f , which is computed as intersection-over-union.The localization score between ground truth Tubelet gt i and a Tubelet dt j is given by: where Γ is the set of frames where at least one of B i f , D j f is not empty.This criterion generalizes the one proposed by (Lan et al., 2011) by taking into account the temporal axis.
Mean Average Best Overlap (MABO).The Average Best Overlap (ABO) for a given class c is obtained by computing for each ground-truth annotation gt i ∈ G c , the best localization from the set of action proposals T = {dt j | j = 1 . . .m}: The mean ABO (MABO) summarizes the performance over all the classes.
Maximum possible recall (Recall).Another measure for quality of proposals is maximum possible recall.It is computed as the fraction of ground-truth actions with best overlap of greater than the overlap threshold (σ ) averaged over action classes.We compute it with a very stringent localization threshold σ = 0.5.Note that adding more proposals can only increase the MABO and Recall (scores are maintained if added proposals are not better).So, both MABO and Recall must be considered jointly with the number of proposals.
Action localization.An instance of action, gt i , is considered to be correctly localized by an action proposal, dt j , if the action is correctly predicted by the classifier and also the overlap/localization score is greater than the overlap threshold, i.e., S(gt i , dt j ) > σ .Xu and Corso (2012) on RGB video frames and on a sequence of iMotion maps for the UCF Sports train set.We report MABO, Recall (at σ = 0.5), number of initial super-voxels, and execution time in seconds.Note the competitive performance of super-voxel segmentation on iMotion maps.
5 Experiments: Quality of Tubelets In this section, we first analyze and evaluate the three stages of Tubelet extraction on the training set of the UCF Sports dataset.The initial step, super-voxel segmentation, is discussed in Subsection 5.1.Then, we evaluate different grouping functions over the initial set of super-voxels in Subsection 5.2 and also show that segmenting iMotion maps is complementary to segmenting input video frames.In Subsection 5.3, we evaluate the impact of spatiotemporal refinement and pruning on all three datasets.Finally, in Subsection 5.4 we compare Tubelets with the state-of-the-art.We evaluate Tubelets with modern representations for action localization in Section 6.

Super-voxel segmentation
Here, we evaluate the graph-based segmentation of video and the graph-based segmentation of iMotion maps.We set parameters as follows: σ = 0.5, merging threshold of two nodes, c = 200, minimum segment size smin = 500, bigger c and smin would mean larger (and hence fewer) segments.
In Table 2, we compare the segmentation methods based on MABO, Recall, number of super-voxels and computation time.Segmentation of iMotion maps leads to better results on all respects with higher MABO and Recall, fewer initial super-voxels and lower computation time.However, initial super-voxels from video segmentation are also important, as we will see in the next experiment.

Super-voxel grouping
We evaluate super-voxel groupings in Table 3 and Table 4 for video and iMotion segmentations respectively.Nine grouping functions are considered that use one or more of the five similarity measures defined in Section 3.3: Motion, Color, Texture, Size and Fill.Five of these use only one similarity measure, while the other four use multiple similarities.Here, All-but-motion is Color+Texture+Size+Fill and All is Mo-tion+Color+Texture+Size+Fill, the rest are self-explanatory.We first evaluate these 9 grouping functions in both the tables.In  video to achieve a MABO of 56.2% and Recall of 64.3%.Note that it is much lower than the number of initial supervoxels ( 862) by the graph-based video segmentation.This is because iMotion brings most of the motion content in fewer super-voxels and the majority of the resulting super-voxels are too small or have zero-motion, and hence are discarded.
After trying several combinations on the training set of UCF Sports, we select 5 best grouping functions: Motion, Fill, Motion+Size+Fill, All-but-motion and All.Grouping the super-voxels from the five selected functions into a Union set, Φ significantly increases the MABO and Recall to 62.0% and 74.7% respectively.Considering that a common localization score threshold (σ ) used in the literature is 0.2 (Lan et al., 2011;Tian et al., 2013), these MABO values and Recall at σ = 0.5 are very promising.Thus obtained set of Tubelets with input video segmentation and Union set, Φ, is from now on referred to as T vid .Super-voxel groupings with segmentation of iMotion maps are evaluated in Table 4. Here, the grouping functions containing the iMotion similarity measure again prove to be the most successful, though not as much as in the case of video segmentation.It is because by segmenting iMotion maps motion information is already utilized to some extent.Fill also leads to good MABO and Recall with just 155 proposals.The union set, Φ, achieves a good MABO of 56.8% and Recall of 77.0%, which even outperforms the Recall obtained with video segmentation by 2.3%.Although the best MABO with segmentation of iMotion maps is lower than that for video segmentation, the number of proposals required is only 624 on average, which is lower than the 3,254 proposals from video segmentation.This is a considerable reduction, which is in particular useful for long videos where the number of proposals can be high.Moreover segmenting iMotion maps is faster, which is again of interest when operating on longer videos.This set of Tubelets obtained by segmenting iMotion maps and Union set, Φ, is from here on referred to as T iMotion .
After analyzing segmentations from input video and iMotion maps separately, we now combine the Tubelets from both, resulting proposal set denoted by T iMotion ∪ T vid .As reported in Table 5, the MABO increases up to 69.5% and Recall reaches 93.6%.This is an improvement of ∼7% in MABO and ∼16% in Recall over the individual best of video and iMotion segmentations.The experiments till this point are conducted on training set of UCF Sports.This validates the set of grouping functions, Φ, and that the two Tubelet sets T iMotion and T vid complement each other for localizing actions.We fix this setting for the experiments to follow.

Pruning and spatiotemporal refinement
In this section, we evaluate the impact of pruning and spatiotemporal refinement on the quality of action proposals of UCF Sports, MSR-II and UCF101.The validation for grouping functions and segmentation is already done on the training set of UCF Sports.Now, we report results when consid- ering all the videos of these three datasets, to be comparable with the numbers reported by other methods.Before moving to results, we provide the implementation details of pruning and spatiotemporal refinement.
Implementation details.For motion pruning we set P = 50, so that at least fifty proposals are retained from each video.Also, motion pruning is only applied to T vid , since proposals from T iMotion are expected to have enough motion content.
Overlap pruning is similar to non-maximum suppression, but applied without classification scores and therefore can affect the recall.To minimize its impact on Recall, we set a high overlap threshold of 0.8 for overlap based pruning.For spatial refinement, we set N equal to 5% of the frame width.
UCF Sports.In Table 6, we evaluate the impact of pruning and spatial refinement on MABO, Recall and the average number of proposals per video for UCF Sports dataset.The results for T vid ∪ T iMotion for all 150 videos of UCF Sports is similar to that on its train set.Now, with motion pruning there is no loss of MABO and Recall while only ∼ 26% of original proposals are used.Further, with overlap pruning number of proposals further goes down to ∼ 8% of original number with a small loss in MABO and Recall.Finally, with spatial refinement of Tubelets there is small improvement of Recall.Altogether, with pruning and spatial refinement we are able to decrease the number action proposals by a factor 12 with only a modest loss in MABO and Recall.reports MABO and Recall for Tubelet set T vid after motion pruning for spatiotemporal localization and also spatial-only localization.Overlap score for spatiotemporal case is computed according to Equation 6as done in all other results.For spatial localization, we compute only for the frames where ground-truth proposal is present, i.e., we do not penalize overlap score for temporal misalignment.MABO doubles and the Recall shoots from 2.2% to 81.3% for spatial-only localization, which means that our Tubelets very well locate the actions spatially but extends to the frames where there in no action of interest.This is due the tendency of super-voxels to continue to cover the actor even when the action is completed.We overcome this limitation by temporal refinement.
In Table 8, in addition to pruning and spatial refinement, we also report for temporal refinement to improve temporal localization.First, motion pruning maintains the MABO and Recall while reducing the number of proposals to only a quarter of initial number.This pruning needs to precede temporal refinement to limit the number of proposals.Second, temporal refinement leads to a massive improvement of 30.1% in Recall and 9.3% in MABO.Note that temporal refinement also includes overlap pruning to filter-out newly added very similar proposals.Also, to limit the number of proposals temporal refinement is exclusively applied to 'T vid + Motion pruning', which means only overlap pruning is applied to 'T iMotion + Motion pruning'.Finally, with spatial refinement another huge improvement of ∼ 12% is achieved in Recall along with ∼ 3% improvement in MABO.
Overall, we achieve an improvement of 12% of MABO and 42.3% of Recall while decreasing the number of proposals by about 72% compared to the initial set, T vid ∪ T iMotion .The gain due to temporal refinement is easy to understand for this dataset of untrimmed videos.However, we also get impressive boost by spatial refinement that is much more than we get for the other two datasets.We attribute this to the exploitation of information from motion trajectories, which is paramount for MSR-II as noted before in van Gemert et al. (2015); Chen and Corso (2015).
UCF101.In Table 9, we report the impact of pruning and spatial refinement on MABO, Recall and the average number of proposals per video for UCF101 dataset.Motion pruning also works well on the 3,204 videos of UCF101, compressing the number of proposals by a factor of four, while maintaining MABO and Recall.Further, with overlap pruning number of proposals further goes down to ∼ 9% of original number with a small loss in MABO and Recall.With favourable spatial refinement, eventually, final set of Tubelets achieve same performance as by T vid ∪ T iMotion , but with about 10 times fewer proposals.

Comparison with state-of-the-art methods
In Table 10, we compare our Tubelets with alternative unsupervised action proposals from the literature.With a relatively small set 289 proposals we outperform all the other approaches on UCF Sports.On MSR-II, we outperform the previous best approach of van Gemert et al. (2015).It is interesting to note the improvement in MABO and Recall over the initial version of our approach (Jain et al., 2014), indicating the value of spatiotemporal refinement and pruning.On UCF101, we achieve MABO and Recall comparable to the method of van Gemert et al. (2015), be it that we need five times less proposals.Overall, Tubelets provides state-ofthe-art quality while balancing the number of proposals.Next we evaluate the action localization abilities of Tubelets when combined with modern representations.

Experiments: Action localization
In this section we evaluate our approach for action localization UCF Sports, MSR-II and UCF101.For positive training examples, we use the ground-truth and our Tubelets that have localization score greater than 0.7 with the groundtruth.Negative samples are randomly selected by considering Tubelets whose overlap with ground-truth is less than 0.15.This scheme is followed for UCF Sports and UCF101.In case of MSR-II cross-dataset evaluation is employed, the training samples consist of the clips from KTH dataset while testing is performed on the Tubelets from the videos of MSR-II.We apply power normalization followed by 2 normalization before training with a linear SVM.One round of retraining on "hardnegatives" was enough as additional rounds did not improve performance further.Again there is no retraining in case of We first give details of the representations used to encode each Tubelet and show their impact on the UCF Sports dataset.Then, we compare our action localization results with the state-of-the-art methods on each of the three datasets.

Tubelet representations
We capture motion information by the four local descriptors computed along the improved trajectories (Wang and Schmid, 2013).To represent the local descriptors, we use bag-of-words or Fisher vectors.A Tubelet is assigned the trajectories that have more than half of there points inside the Tubelet.For the third representation we use features from a Convolutional Neural Network layer and average pool them over the frames.Below we explain these three representations.
Bag of words (BoW).The local descriptors are vector quantized and pooled into a bag-of-words histogram.We set the vocabulary size to K = 500.This is the least expensive (and expressive) of the three representations.
Fisher vectors (FV).We first apply PCA on the local descriptors and reduce the dimensionality by a factor of two.Then 256,000 descriptors are selected at random from a training set to estimate a Gaussian Mixture Model with K (= 128) Gaussians.Each video is then represented by 2DK dimensional Fisher vector, where D is the dimension of the descriptor after PCA.Finally, we apply power and 2 normalization to the Fisher vector as suggested in (Perronnin et al., 2010).The feature computation is reasonably efficient but the memory requirement would be a bottleneck if the number of proposals are high (e.g.> 5000).Fisher vectors have been used for temporal action localization by (Oneata et al., 2014b) and for spatiotemporal action localization by van Gemert et al. (2015).Comparing representations: Bag-of-words, Fisher vector and CNN features on UCF Sports, performance is measured by AUC for σ from 0.1 to 0.6, following (Tian et al., 2013).The best AUC is obtained when both Fisher vector and CNN features are combined for the Tubelet representation.
Convolutional neural network (CNN).We use an in-house implementation of GoogLeNet (Szegedy et al., 2015), trained on ImageNet over 15k object categories (Jain et al., 2015b) without fine-tuning.The features are extracted from the fullyconnected layer (before softmax2) of the network, which is a 1024 dimensional vector to represent a bounding box in a frame.Since a Tubelet is a sequence of bounding boxes, the final representation for it is obtained by averaging the feature vectors for the sampled frames (2 frames per second).Here, the memory requirement is limited, and feature computation is the costly operation, motivating the need for a compact set of action proposals.
Comparing representations.We now analyze the impact of the above three Tubelet representations on the UCF Sports dataset, following the process described in Section 4.2.Following popular practice, we use area under ROC curve (AUC) as the evaluation measure, as common for this dataset.Figure 8 compares the performance of the various Tubelet representations for a varying overlap threshold.We observe a clear improvement when moving from BoW to FV, to CNN and eventually the combination of FV and CNN, especially for higher thresholds (σ ≥ 0.4).

Comparison with state-of-the-art methods
We now compare our approach with state-of-the-art methods on the three datasets.
UCF Sports.In Figure 9, we compare the performance of our method with the best reported results in the literature.
In (Jain et al., 2015b) Fig. 9 Comparison with state-of-the-art methods on UCF Sports, performance is measured by AUC for σ from 0.1 to 0.6.maps as well.Tubelets represented with FV+CNN is competitive to the methods of Gkioxari and Malik (2015) and Weinzaepfel et al. (2015) and outperforms all other approaches.Since van Gemert et al. ( 2015) uses only the FV representation, for fair comparison we also include Tubelets with a FV representation, which does better for most of the thresholds.
Figure 11 shows some examples of action localizations from UCF Sports.
MSR-II.This dataset is designed for cross-dataset evaluation.Following standard practice, we train on KTH dataset and test on MSR-II.While training for one class, the videos from other classes are used as the negative set.We use the FV representation to be more comparable with the competitive work of (van Gemert et al., 2015), which also generates action proposals in an unsupervised manner like Tubelets.
In Table 11, we compare with several state-of-the-art methods; mean average precision (mAP) along with the APs for the three classes are reported.Following the usual practice on this dataset we report results for an overlap threshold of 0.125.Apart from Chen and Corso (2015), our approach outperforms all other methods by 5% of mAP or more.Chen and Corso (2015) very well utilizes information from motion trajectories and samples action proposals by clustering over a space-time trajectory graph.Motion trajectory based approaches are particularly well-suited for MSR-II dataset, as observed with our spatiotemporal refinement of Tubelets and also in (van Gemert et al., 2015).Similarly, the approach of Chen and Corso (2015) that is mainly focused on trajectories lead to excellent performance on MSR-II but its performance on UCF Sports is modest (Figure 9).Finally, compared to the Tubelets in Jain et al. (2014), we improve mAP by 24.5%.Again, we claim the importance of using both input video frames and iMotion maps for segmentation and spatiotempo- ral refinement of Tubelets.Figure 12 shows some examples of localizations for MSR-II.
UCF101.UCF101 is much larger than the other two datasets, with 24 action classes, and is currently the most challenging dataset for classification of proposals.Again, we represent Tubelets with FV following (van Gemert et al., 2015).In Figure 10, we report mAPs for different overlap thresholds and compare Tubelets with three other approaches that report results on UCF101 dataset.Despite the use of human detection, the approach by Yu and Yuan (2015) is about 10% behind our method for an overlap threshold of 0.125.Weinzaepfel et al. (2015) uses bounding-box level action class supervision while generating proposals.Despite their additional supervision and use of two-stream CNN features, we achieve better mAP for 3 out of 4 overlap thresholds.The only other approach that uses proposals generated in an unsupervised manner, as we do, is APT by (van Gemert et al., 2015).Tubelets outperform their approach while requiring only about a fifth of proposals (see Table 10).
Figure 13 displays some examples of action localizations from UCF101.With 24 classes this dataset offers larger variety in types of actions.Poor localization (shown in red) mainly happens in case of multiple actors, when during the action one of the actors gets occluded (see 'Salsa Spin').Typically, in that case, Tubelets often encapsulates both actors together.However, the varying aspect ratios, diverse locations in the video frames, speed of action and multiple actors are well captured by our action proposal method.

Conclusions
We presented an unsupervised approach to generate proposals from super-voxels for action localization in videos.This is done by iterative grouping of super-voxels driven by both static features and motion features, motion being the key ingredient.We introduced independent motion evidence to characterize how the action related motion deviates from the background.The generated iMotion maps provide a more efficient alternative for segmentation.Moreover, iMotionbased features allow for effective and efficient grouping of super-voxels.Our action proposals, Tubelets, are action class independent and implicitly cover variable aspect ratios and temporal lengths.We showed, for the first time, the effectiveness of Tubelets for action localization in Jain et al. (2014).In this paper, iMotion maps are presented with further insights and the segmenting iMotion maps is shown complementary to segmenting input video frames.Additionally, we introduced spatiotemporal refinement and pruning of Tubelets.Spatiotemporal refinement overcomes the tendency of supervoxels to sometimes follow the actor even after the action is completed.This led to improved MABO and Recall scores, especially on the untrimmed videos of MSR-II (Table 8), while pruning kept the number of Tubelets limited.The impact of these and the other components of Tubelet generation are extensively evaluated in our experiments.We evaluate our method for both action proposal quality and action localization.For action proposal quality, Tubelets beat all other existing approaches on the three datasets with much fewer number of proposals (Table 10).For action localization, our method leads to the best performance on UCF101 and second best on UCF Sports and MSR-II.The method of Chen and Corso (2015) gets best mAP for MSR-II but its performance on UCF Sports is rather modest.Similarly Weinzaepfel et al. (2015) does well on UCF Sports and UCF101 but being supervised in generating proposals is not easy to apply on MSR-II.Ours is the only method that delivers excellent performance on both the trimmed videos of UCF Sports and UCF101 as well as the untrimmed videos of MSR-II.
Fig. 3 iMotion maps for segmentation: Top two rows show the original frames and their independent motion.The iMotion maps obtained after applying morphological operations are shown in the third row.The bottom row shows the result of applying graph-based video segmentation on iMotion maps.The process is illustrated for three example video clips for actions 'Swing-Bench', 'Running' and 'Hand Waving' respectively.In spite of clutter and illumination variations the iMotion map successfully highlights the action.

Fig. 4
Fig.4Illustration of hierarchical grouping of super-voxels into Tubelets.Left column: A sampled sequence of frames (1 st , 15 th , 25 th , 35 th , 50 th ) associated with the action 'Diving'.The yellow bounding boxes represent the ground-truth sequence.Column 2: the initial video segmentation used as input to our method.The last two columns show the two junctures of the iterative grouping algorithm.A Tubelet close to the action is also represented by bounding boxes in these two columns.Observe how close it is to the ground-truth in the last column despite the varying aspect ratios in different frames.

Fig. 5
Fig.5Example for the action 'Running': The first two images depict a video frame and the initial super-voxel segmentation used as input of our approach.The next three images represent the segmentation after a varying number of merge operations.

Fig. 6
Fig. 6 Impact of spatiotemporal-refinement of Tubelets: The first row shows an untrimmed video of about 900 frames.The ground-truth action is an instance of 'Boxing' from frame 108 to frame 151, as bounded by the yellow boxes.The green boxes in the top row show one of the best Tubelet action proposals obtained for this video.While it aligns well with the ground-truth spatially, it fails temporally as it continues beyond 200 frames.With temporal refinement in the second row, we are able to sample a sub-sequence that localizes the action temporally well also.Third row shows further improvement by spatial refinement.

Fig. 7
Fig. 7 Example video frames showing action classes from the UCF Sports, MSR-II and UCF101 datasets.
Fig.8Comparing representations: Bag-of-words, Fisher vector and CNN features on UCF Sports, performance is measured by AUC for σ from 0.1 to 0.6, following(Tian et al., 2013).The best AUC is obtained when both Fisher vector and CNN features are combined for the Tubelet representation.

Fig. 10
Fig.10Comparison with state-of-the-art methods on UCF101: Tubelets are obtained using the selected five grouping functions and represented with FV.Performance is measured by mAP for σ from 0.1 to 0.6.

Fig. 11
Fig. 11 Localization results shown as a sequence of bounding boxes (UCF-Sports): Ground-truth is shown in yellow, correctly localized detections in green and poorly localized ones in red.Caption below each sequence reports the class detected.

Fig. 12 Fig. 13
Fig. 12 Localization results shown as a sequence of bounding boxes (MSR-II): Ground-truth is shown in yellow, correctly localized detections in green and poorly localized ones in red.Two instances of 'Boxing' being correctly localized are shown in the first column.The middle two columns show successful results for 'Clapping' and 'Waving' actions.Last column shows a failure case of poor localization of an instance of 'Boxing', while the second instance in the video is localized well.

Table 1
Related work linking the action representation with approaches in action localization.Our work does not treat a video as a collection of 2D frames.Instead, we take a holistic spatiotemporal approach by aggregating 3D voxels.From these voxels we build Tubelets on which we evaluate several state-of-the-art action representations.

Table 2
Quality of initial super-voxels by applying the graph-based segmentation by

Table 3
, the best performing groupings are the ones that involve the iMotion similarity measure: Motion, Mo-tion+Size+Fill and All.Motion needs only 299 proposals per

Table 3
Evaluation of super-voxel groupings with video segmentation on training set of UCF Sports.Among the similarity measures, the ones based on iMotion: Motion, Motion+Size+Fill and All perform the best while generating a reasonable number of proposals.The union of the five selected grouping functions, Φ, further increases the MABO and Recall.

Table 4
Evaluation of super-voxel groupings with segmentation of iMotion maps on the training set of UCF Sports.The grouping functions containing the iMotion similarity measure again prove to be the most successful, though not as much as in Table3.The union set, Φ, achieves a high MABO and Recall with only 624 proposals per video.

Table 5
Combining of Tubelets from video segmentation and iMotion segmentation, T vid ∪ T iMotion .Numbers are reported for the five selected grouping functions as well as their union set, Φ.The combination leads to significant improvement of MABO and Recall, showing the two sets of Tubelets from two video segmentations complement each other.

Table 6
Impact of pruning and spatial refinement of Tubelets on UCF Sports: Even after motion pruning the MABO and Recall are maintained with only ∼ 26% of proposals.With overlap pruning the number of proposals goes down further to ∼ 8% of the original number, with a small loss in MABO and Recall scores.The loss is compensated by spatial refinement of Tubelets.

Table 8
Impact of pruning and spatial refinement of Tubelets on MSR-II: Pruning by motion maintains the MABO and Recall while reducing the proposals to only a quarter of the initial set.Temporal refinement has a positive impact on proposal quality with Recall increased by 30%.Finally, with spatial refinement another improvement of ∼ 12% is achieved.Spatiotemporal refinement is important for this dataset.
MSR-II.The MSR-II dataset has untrimmed videos with multiple instances of different types of actions in the same video.This poses additional challenges for temporal localization, which is experimentally illustrated in Table7.The table

Table 9
Impact of pruning and spatial refinement of Tubelets on UCF101: Motion pruning leads to ∼ 1% loss in MABO and Recall while filtering out 75% of the proposals.With overlap pruning the number of proposals goes down further to ∼ 9% of the original number with a small loss in MABO and Recall.This loss is compensated by spatial refinement leading to the same performance with ten times fewer proposals.

Table 10
van Gemert et al. (2015)ion proposals against state-ofthe-art.Our Tubelets outperform all other approaches on these three datasets with a modest number of proposals.Our Recall on UCF101 is slightly behind the approach ofvan Gemert et al. (2015), be it they use five times more proposals.
Table 11 Comparison with state-of-the-art methods on MSR-II: Average precision (AP) and mean AP are reported.