Keywords

1 Introduction

Temporal action localization (TAL) in untrimmed videos has attracted more and more attention in recent years, many methods [9, 11, 14, 23, 28, 38, 41, 43] that greatly enhanced performance have been developed. Because labeling action boundary in untrimmed video is expensive, some researchers [25, 26, 30, 33, 38] proposed to use video-level action annotation to produce snippet level action localization results, which greatly reduced demand for human laboring and yield comparable performance. These studies combine Multiple Instance Learning (MIL) [7] and attention mechanism [25, 26, 38] with Deep Convolutional Neural Networks (DCNN) to produce clip presentations. Then, action detection criterions maps clip presentations to Class Activation Sequence (CAS) which determines which snippet includes action.

However, these weakly-supervised methods share two convenient assumptions that might be too optimistic in the real world. The first assumption is empirical thresholds to determine temporal action boundaries could be obtained in a trivial manner. This implicit assumption could be far from reality, given the diversity of datasets and applications. The second assumption is that straightforward fusion strategies are adequate in weakly-supervised TAL because of the prevailing two-stream networks [3, 32, 39], where CAS is either generated separately and fused by weighted average [25, 30, 38], or generated by concatenated features [26] and [30] regression methods. With the two-stream network, each stream is independently trained via backpropagation and no interactions happen between streams. These two strategies are straightforward to implemented but we argue that there could possibly be a better alternative.

To address these challenges, we design a general-purpose action detection criterion and an alternative stream fusion strategy. Specifically, we design the action detection criterion based on attention mechanism with momentum-inspired threshold generated in training stage. An analysis in stream combination options results in the proposed Action Sensitive Extractor (ASE) as shown in Fig. 1. Inspired by recent literature in spatial and temporal interaction [10, 35], the proposed ASE prudently selects action sensitive features between two streams and produces activations. In the ASE, we handle spatial and temporal stream asymmetrically with respect to different sensitivities in actions. With our action detection criterion and the ASE, we build Action Sensitive Network (ASN) for Weakly-supervised TAL.

Fig. 1.
figure 1

Illustration of different strategies to combine two-stream features. (a) Lateral fusion of two-stream features. (b) Concatenating two-stream features for processing. (c) Our action sensitive extractor.

Main contributions of this paper include (1) a comparative analysis on stream fusion strategies with the proposed Action Sensitive Extractor (ASE), and (2) a new flexible action localization criterion which generates high quality CAS. The performance gains of the proposed ASN algorithm are verified on two challenging public datasets.

2 Related Works

Video Action analyze has been wildly discussed in several years. Most studies focus on action recognition in trimmed videos. Many novel structures have been proposed for videos [3, 8, 15, 19] based on DCNN neural networks [16, 17, 37]. Two-stream network [32] was one design which employs RGB images and optical Flow with lateral fusion. Based on two-stream network, temporal segment network (TSN) [39] was proposed to analyze long-term temporal data. TSN has been used as backbone in different tasks [38, 43] with good performance. To further leverage optical Flow, [35] proposed a novel structure for optical Flow. Recent proposed SlowFast Network [10] uses two path way to process videos similar to two-stream network. In SlowFast, a fast pathway handles wide temporal motions and a slow pathway handles rich local details.

Action localization has been greatly improved based on video action analyze. Many neural architectures and methods [9, 13, 21, 24] have been developed for supervised learning. However, those studies heavy rely on data annotations of action sequences, which are expensive to acquire. To incorporate more data in training, Sun et al. [34] proposed to use web images and video-level annotation to handle TAL. Moreover, hide and seek [33] discovered how to force network focus on most discriminating part. UntrimmedNet [38] designed a novel structures that trains high-quality network on untrimmed videos and proposed a method which efficiently selects action segments. UntrimmedNet not only provide a good solution for Localization but is also a good baseline model that generates local representation. Based on extracted feature representation, AutoLoc [30] discovered an anchor generation and selection standard on feature sequences. W-TALC [26] and Nguyen et al. [25] discovered feature based networks with different auxiliary loss functions and attention mechanism.

3 Action Sensitive Network

In this section, our proposed Action Sensitive Network will be introduced. Section 3.1 describes the ASE we proposed. Section 3.2 describes our momentum-inspired action detection criterion. In the last section, we introduce details of ASN.

3.1 Action Sensitive Extractor

In this section, we propose models to extract action sensitive features. Our proposal is to train a network that maximally leverage action sensitive features in two streams. Because actions are described in moving image in videos, spatial stream with only one frame perception unlikely to recognize action directly. While temporal stream have wider temporal perception and inherently sensitive at motion boundary [27]. Detial analysis can be found in our experiment section. Inspired by SlowFast network [10], where spatial (slow) and temporal (fast) are fed into different network architectures with different channels and different temporal perception, we propose our models that treat temporal and spatial asymmetrically.

In general, we use learned action sensitive knowledge (inherit from temporal stream) as main stream. We discover different structures to extract beneficial features to reinforce main stream. We adopt strategy in DenseNet [17] that we concatenate our main stream and reinforce stream together for classification and attention calculation. We call our extraction model Action Sensitive Extractor (ASE), different settings of ASE are shown in Fig. 2. For simplicity, we still use single fully-connected layer for classification or attention branch. ASE with classification and attention branch is referred as ASE model.

Fig. 2.
figure 2

Data flow of different ASE settings. We set temporal stream as main stream and spatial stream as reinforce stream. Reinforced streams are fed into classification and activation branch then. (a) Fusion model. (b) Bottleneck model. (c) Bilinear bottleneck model.

Fusion with Temporal Knowledge. To leverage temporal features that are related to actions, we propose to build a network that initialized with temporal features and extended with spatial features. To achieve this goal, we adopt methods from [4], our network on fusion (concatenated) features is initiated with pretrained temporal weights and zero spatial weights. For example, Eq. 1 shows the classification branch for fusioned features. To inherit knowledge in temporal classifier, we set \(\mathbf {W}^{t}\) and \(\mathbf {b}^{t}\) to pretrained temporal weights, while \(\mathbf {W}^{s}\) and \(\mathbf {b}^{s}\) are set to 0. We also apply same method to attention branch.

$$\begin{aligned} \begin{aligned} \mathbf {c}&= \mathbf {W}^{f} \cdot \mathbf {x}^{f} + \mathbf {b}^{f} \\ {}&= \begin{bmatrix}\mathbf {W}^{t}, \mathbf {W}^{s}\end{bmatrix} \cdot \begin{bmatrix} \mathbf {x}^{t} \\ \mathbf {x}^{s} \end{bmatrix} + \begin{bmatrix} \mathbf {b}^{t} \\ \mathbf {b}^{s} \end{bmatrix} \end{aligned} \end{aligned}$$
(1)

Bottleneck Model. To limit overfitting with spatial features, we further study on limiting and distilling spatial features. Different from former studies that enforce loss [26], we simply use special designed network architecture. As a naive attempt, we use a bottleneck layer to extract knowledge from spatial features. The bottleneck layer includes a dropout, a fully-connected layer and ReLU activation. The features extracted from bottleneck layer are concatenated with temporal features and feed to classification and attention branch. Our bottleneck layer extracts most expressive spatial features that help to identify actions.

Bilinear Bottleneck Model. Bottleneck model will remove unnecessary spatial feature but can’t introduce interactions. In recent works [5, 40], bilinear layers are proposed to aggregate spatial-temporal features. To make use of connection between streams, we propose to use bilinear block to aggregate features. In our work, we propose to use two fully-connected bottleneck layer to aggregate features in each stream and use bilinear layer to combine temporal and spatial features. We use 0.5 dropout before bottleneck layer and bilinear layer. We use ReLU activation after fc layers and bilinear layer. The aggregated features are concatenated with temporal features as in bottleneck model. Hidden layer size of bottleneck layer and bilinear layer are set to same for simplicity.

3.2 Action Detection Criterion

Here we represent our action detection criterion. In our study, we propose to trim fixed proportion of clips as background, since proportion of background frames is relative stable in each dataset. We set our threshold as quantile of attention values during training, similar to batch normalization [18], where mean and standard deviation in each batch is recorded and reused, to deal with fluctuation.

Quantile level describes desired proportion of clips. Level of quantile defines how much proportion the quantile divide, e.g. quantile at 30% means around 30% of clips in each batch have lower attention than the quantile. For each batch in training, we sort attentions of each clip in this batch and sample a attention value at desired level. Current quantile is updated by a momentum factor according to Eq. 2. Quantile is fixed during testing.

$$\begin{aligned} q^{t+1} = \alpha q^{t} + (1 - \alpha ) q \end{aligned}$$
(2)

Our method is simple and cross modality, it is easy to apply our action detection criterion to any attention based localization problem across different settings. Note that the quantile maybe different across datasets, since proportion of background frames maybe different.

Fig. 3.
figure 3

Our full network for action recognition and detection. We use our ASE model to produce frame activations and attentions. Video-level classification activations are optimized with video-level annotations. CAS is generated by action detection criterion. Action segments are selected based on CAS.

3.3 Network Details

Having explained key components, now we introduce details of our ASN as shown in Fig. 3. To efficiently look over long video, we break videos into different levels. In bottom level, each frame is represented separately as frame. We use features from our two-stream pretrained DCNN model as representation. Then the middle level, which is clip level. We average our features sampled in short temporal period as clip representation since close frames in videos should be correlated. To distill key knowledge and trim noise, we use Action Sensitive Extractor to extract features and feed to classification and attention branch. The highest level is video-level, which is aggregated by attention mechanism. This level is symmetrical to annotations.

In our study, we discover a setting based on extracted features from UntrimmedNet [38]. Following UntrimmedNet [38], we randomly sample 7 clips for untrimmed videos, 1 clips for video clips. For each clip, 3 frames are sparsely sampled as in TSN [39] and averaged as clip representation. Two fully-connected layers are used to produce classification and attention respectively. Dropout of 0.5 is used only before classification layer. To fuse clip level activations, we apply softmax operation on attentions x from clip 1 to t. The normalized attentions \(\bar{x_i^a} = \frac{exp(x_i^a)}{\sum _{j=1}^t{exp(x_j^a)}}\) are used to fuse clip level classifications into video level prediction \(\mathbf {x}^c\), where \(\mathbf {x}^c = \sum _{i=1}^t {\bar{x_i^a} \mathbf {x}_i^c}\). Next, we apply softmax operations among each dimension of prediction and optimize with multi-label cross-entropy loss.

$$\begin{aligned} l(\mathbf {x}^c, \mathbf {y}) = \sum _{i=1}^t{y_i log(\frac{exp(x_i^c)}{\sum _{j=1}^t{exp(x_j^c)}} )} \end{aligned}$$
(3)

During testing, we use strategy similarly to [38] and [30]. Each clip is aggregated every 15 frames. ASE model produces classification and attention activations for each clip. For video recognition, we soften our attentions by a factor (sets to 3) at first. Then, clips are fused to video representation according to their attentions as in training. For video detection, we generate CAS of size \(clip\_number \times class\_number\) and feed it into selection method. Firstly, we apply softmax operation on clip classification activations. Then, we apply threshold on video-level prediction, clip activations of video unrelated class are set to 0 in CAS. Thirdly, we apply attention level threshold, clips with attentions lower than threshold are set to 0 in CAS. Finally we feed our CAS into selection method to generate action segments.

4 Experiments

4.1 Dataset

THUMOS14 [20] has 101 classes for recognition and 20 classes out of 101 for action detection. THUMOS14 includes training set, validation set and testing set. Training set includes action video clips, validation and testing set includes untrimmed videos. In THUMOS14, 15 instances of actions covers 29% of video on average [28]. We train our model on training set and validation set, we test our model on testing set.

ActivityNet1.2 [2] has 100 classes for both detection and recognition. It is divided into training set, validation set and test set. In ActivityNet, 1.5 instances of actions covers 64% of video on average [28]. We train our model on training set and test on validation set.

4.2 Implementation Details

We train our ASN using features extracted by UntrimmedNet pretrained model, which trained on same dataset and subsets as UntrimmedNet. We train our network with Nestrov momentum [36] of 0.9, weights decay of 0.0005. Batch size is set to 512 for THUMOS14 validation set and 8192 for THUMOS14 training set. Batch size is set to 512 for ActivityNet1.2. On THUMOS14 [20], we train 80 epochs jointly on training set and validation set. Our learning rate is set to 0.1 and decay 10 times on 40th and 60th epoch. On ActivityNet1.2 [2], we train 160 epochs jointly on training set. Learning rate is set to 0.1 and decay on 80th and 120th epoch.

4.3 Ablation Study

In this section, we explore our action detection criterion at different levels of quantiles and different model settings. For simplicity and efficiency, we use naive approach in UntrimmedNet [38] as selection method in ablation study on THUMOS14. This method only selects consecutive activated frames in CAS. For a selected snippet from clip timestamp n to \(k+n\) with label v, confidence scores s are evaluated by video-level activation \(c_{v}\) and average activation as shown in Eq. 4. Where we use \(\lambda = 0.2\) in our experiment.

$$\begin{aligned} s = \frac{1}{n+1} \sum _{i=k}^{k+n} c^i_v + \lambda c_{v} \end{aligned}$$
(4)

Evaluation of Action Detection Criterion. To demonstrate efficiency of our action detection criterion, we train our network 10 times and record performance on testing set under different quantiles. As baseline, spatial (RGB) and temporal (Flow) models are treated separately. Different levels of quantiles are recorded and tested on localization task. We train our network 10 times and quantiles are recorded at level 10%, 20%, 30% to 90%. To compare with former studies, we also apply our methods on pretrained weights provided by [38], the quantiles of pretrained models are recorded by running on THUMOS14 validation set. CAS of two-stream model in localization are generated by two steps. First, clip level classifications after softmax are averaged. Second, attention scores of each stream are normalized by each threshold and averaged. Video-level recognition results are set to average of two streams.

Fig. 4.
figure 4

Localization mAP of flow and rgb model under different quantiles on THUMOS14. mAP is recorded under 0.5 IoU threshold.

Table 1. Comparison with different settings on THUMOS14. We compare localization mAP under common IoU threshold and recognition accuracy. UntrimmedNet use a slightly different recognition strategy.

Results of spatial and temporal model under IoU threshold of 0.5 are shown in Fig. 4. The performance of pretrained model on different quantiles are shown in Fig. 5. For different models, performance peaks locate near 50% quantiles. During training, we find attention quantiles are fluctuating but performances are generally stable. Notably, spatial performances are worse and more unstable than temporal. We also compare our methods with original UntrimmedNet [38]. Performances of our best models under different settings are shown in Table 1. Our action detection criterion can achieve high performance with only temporal stream.

Evaluation of Streams. We evaluate different combination of streams as shown in Table 1. We evaluate on spatial (RGB), temporal (Flow), two-stream and fusion stream (concatenated features of RGB and Flow). We also discover attention quality of each stream.

Surprisingly, temporal stream yield the best localization performance. Streams with spatial features perform poorly. The bad behavior of spatial related streams may because of trivial details in spatial features cause overfitting. In addition, we analyze attention in each stream. For two-stream model, we fix CAS and apply only temporal or spatial attention to our criterion. We find two-stream with temporal attentions yield high performance similar to temporal stream and two-stream with spatial attentions yield low performance similar to fusion stream.

Our experiment shows differences in action sensitivity between two streams. Combining with temporal and spatial information usually yield higher performance in action recognition but lower in localization. We also find commonly used two-stream or fusion strategies are inefficient in weakly-supervised localization task, which are worse than single temporal stream.

Evaluation of ASE. We evaluate different ASE model settings. For inherit strategy, we our best flow model as initial weight. For fusion model, bottleneck model and bilinear bottleneck model, we compare training from scratch and inherit strategy with feature size of 64. We compare inherit strategy of feature size of 64 and 128 in bottleneck model and bilinear bottleneck model. Our results are shown in Table 2.

Fig. 5.
figure 5

Localization mAP of pretrained model under different quantiles on THUMOS14. mAP is recorded under 0.5 IoU threshold.

Table 2. Compare with different ASE model settings on THUMOS14.

Compare with training from scratch, inherit strategy greatly improves recognition and localization except for bottleneck model. For bottleneck model, only localization is slightly improved. This phenomenon may denotes that our bottleneck model has already restrained overfitting. For bottleneck models and bilinear bottleneck models, feature size from 64 to 128 slightly improves performance.

In recognition tasks, fusion model has the highest performance because it can access full information, it also proves that our bottleneck structure does restrain information. For localization, bottleneck model and bilinear bottleneck model performs much higher than fusion model. Bilinear bottleneck models perform slightly higher than bottleneck model, which denotes that bilinear layer does improve interaction. High performance of our proposed ASE models shows its ability to extract action sensitive features.

4.4 Experiments on AutoLoc

To evaluate our final Action Sensitive Network, we use AutoLoc [30] as selection methods and compare with state-of-art results. AutoLoc incorporate Outer-Inter-Contrastive (OIC) loss that evaluate action snippet accurately. To further adjust performance, we increase weights for outer boundary in OIC as follow:

$$\begin{aligned} L_{OIC} = \lambda A_o(\phi ) + A_i(\phi ) \end{aligned}$$
(5)
Table 3. Comparison with state-of-art methods on ActivityNet1.2 in terms of action localization mAP under different IoU. We only list weakly-supervised methods. All results in this table are based on UntrimmedNet features. We describe selection methods we used in brackets.
Table 4. Comparison with state-of-art methods on THUMOS14 in terms of action localization mAP under different IoU. All weakly-supervised results are based on UntrimmedNet features. We describe selection methods we used in brackets.

On THUMOS14, we set \(\lambda \) to 2. We also increase boundary inflation rate to 0.35. These settings help AutoLoc select most distinguishable action snippets. We add more offset anchors to AutoLoc and only use AutoLoc as a selection method over CAS. We show performance of our bilinear bottleneck model with feature size 128 and inherit strategy in Table 4. For ActivityNet1.2 [2], we set \(\lambda \) to 5 and boundary inflation to 0.7. We use quantile at 10% for ActivityNet1.2. Our results are shown in Table 3. Compare with other weakly-supervised TAL methods, our method have advantage especially under higher IoU and reach state-of-art level in both datasets.

5 Conclusion

We propose a general action detection criterion which can generate high quality CAS and can apply to different modalities. Based on this thresholding method, we analyze performance of different combinations of streams. According to our experiments, spatial and temporal stream contains different information and have different sensitivity in actions. To combine two streams properly, we propose our novel Action Sensitive Network. Two-stream features are treated asymmetry to produce accurate representation without losing sensitivity in actions. We use ASE model to produce clip features and CAS that can be applied to different selection methods. Our network yields state-of-art performance with AutoLoc as selection method. In the future, we can investigate higher level relationship between different streams and apply our method to more modalities.