Abstract
Temporal action proposal generation in an untrimmed video is very challenging, and comprehensive context exploration is critically important to generate accurate candidates of action instances. This paper proposes a Temporal-aware Attention Network (TAN) that localizes context-rich proposals by enhancing the temporal representations of boundaries and proposals. Firstly, we pinpoint that obtaining precise location information of action instances needs to consider long-distance temporal contexts. To this end, we propose a Global-Aware Attention (GAA) module for boundary-level interaction. Specifically, we introduce two novel gating mechanisms into the top-down interaction structure to incorporate multi-level semantics into video features effectively. Secondly, we design an efficient task-specific Adaptive Temporal Interaction (ATI) module to learn proposal associations. TAN enhances proposal-level contextual representations in a wide range by utilizing multi-scale interaction modules. Extensive experiments on the ActivityNet-1.3 and THUMOS-14 demonstrate the effectiveness of our proposed method, e.g., TAN achieves 73.43% in AR@1000 on THUMOS-14 and 69.01% in AUC on ActivityNet-1.3. Moreover, TAN significantly improves temporal action detection performance when equipped with existing action classification frameworks.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Understanding human behavior and intent in videos is crucial across several domains, including but not limited to human–computer interaction, robotics, video retrieval analysis, and intelligent security. As a result, video content analysis methods [15, 21, 33, 37, 46] have attracted increased interest from academics and industry. Temporal Action Proposal Generation (TAPG) is one of the most trending topics in video understanding. It is intended to detect temporal intervals likely to contain an action instance in untrimmed videos, telling each action's start and end frames.
Anchor-based works [4, 34, 35] generate proposals based on dense predefined anchor boxes that are either regularly distributed [36] or manually defined [12]. However, one main drawback of anchor-based methods is that fixed-size anchors can hardly cover all these ground-truth instances with different lengths, from a second to minutes. On the contrary, boundary-based methods [22, 23] locate action boundaries based on the snippet-level probabilities of boundaries. These methods first evaluate the start and end probabilities of each snippet consisting of sequential frames, and then combine high-scoring snippets to form candidate proposals. Previous boundary-based methods have been demonstrated to be more effective and provide insightful ways of TAPG. However, due to various action durations and the cluttered environment surrounding boundaries, boundary-based methods have two difficulties: (1) how to use more precise boundaries to represent action proposals and (2) how to exploit the semantic relations among those proposals effectively.
Regarding the first one, several works [11, 23, 25, 26] applied 1-dimensional (1D) temporal convolutions on snippets before pooling to encode the snippets relations, which are beneficial for increasing the recall of boundary detection. However, the above methods neglect the crucial fact that the duration of action instances can vary dramatically across categories and videos, and even in the same video, there may be multiple action instances with significantly different durations. This will result in a lack of global information that may not be conducive to long action instances. Secondly, different scales of snippet-level contexts are not equally informative, and some of them may affect distant boundaries or may not be helpful for certain action instances. For example, long-range information or global contexts with fewer local details may not be conducive to detecting short actions. Anchor-based methods [24, 27] used feature pyramids to encode multi-scale contexts to address the issue. However, boundary-based methods without anchors have yet to explore this issue entirely. Thus, selecting effective contexts based on the video content is necessary to avoid invalid boundary-level matches.
The second challenge is related to proposal relations, which provide more internal hints to enhance representations of action proposals. Existing methods [47, 55] usually only considered overlapped proposals that represent different stages of an action instance. Early works [36, 51, 58] mainly dealt with proposals individually. BSN + + [37] proposes a self-attention module to explore proposal-level contextual information. However, this may ignore the impact of negative proposals and could result in computationally expensive. Indeed, distant proposals containing similar semantics are significant, as they may give indicative hints. G-TAD [52] exploits the proposal-proposal relations using graph convolutional networks with fixed edge weights between nodes. Thus, these methods have failed to effectively utilize the dynamic temporal relationships between proposals in the temporal dimension.
To remedy these problems, we propose a novel network TAN to enhance the boundary prediction performance and effectively leverage proposal-level features. Firstly, to obtain action boundary probabilities with high precision and recall, TAN introduces a global-aware attention (GAA) module. In addition to designing Cross Attention in multiple layers to select effective contexts based on the video content in the temporal dimension, it also enhances the global snippet-level contexts for boundary classification with Fusion Attention. Secondly, TAN introduces an adaptive temporal interaction (ATI) module to address the limitation of proposal-level features being too singular, which aims to construct proposal-level contexts in the temporal dimension. It integrates our well-designed temporal context interaction (TCI) block to assign dynamic convolution weights to proposal sets with the same start point. Specifically, considering the flexible duration of action instants, it uses temporal-scale modeling convolution (TMConv) with varied dilated rates to enhance modeling capability for distant proposals.
In a nutshell, our contributions are as follows:
-
-
To model long-range relationships between video units, we present a novel global-aware attention module with well-designed cross-scale gating mechanism and multi-input fusion attention to aggregate multi-level snippet-level context representations.
-
-
-
We introduce an adaptive temporal interaction module with multi-scale dynamic temporal convolution, which can accurately capture the relationship between multi-scale proposals by assigning different weights to temporal contexts.
-
-
-
Based on the above two modules, we propose a temporal-aware attention network. It aims to enhance boundary-level predictions and proposal-level representations for generating context-rich proposals.
-
-
-
We validate TAN on two challenging benchmarks: THUMOS-14 and ActivityNet-1.3. Experimental results show that TAN achieves comparable improvements and delivers more accurate proposals.
-
The remainder of this paper is organized as follows. We review the related works in Sect. "Related works" and then introduce the details of TAN in Sect. "Methodology". In Sect. "Experiments and results", we conduct experiments to evaluate our TAN. Finally, we conclude the work in Sect. "Conclusion".
Related works
Temporal action proposal generation (TAPG)
There are two main categories of temporal action proposal generation pipelines. One line of research, referred to as anchor-based methods [6, 11, 27, 29, 58], generates action proposals based on dense sliding windows or predefined anchors. For example, PBRNet [27] uses feature pyramids to refine predefined anchors progressively. TALNet [35] uses dilation convolution to exploit global contexts among frames to get a larger receptive field. Although multi-scale anchors [35] and pyramid architectures [11, 27] are used to increase the diversity of anchors, proposals generated by these methods are still not flexible enough to cover actions of various durations. Another line of research is known as boundary-based methods [25, 32, 35]. They first predict the start and end probabilities for each snippet, then match frames with high start and end probabilities. For example, BMN [25] and BSN + + [37] apply the boundary-matching mechanism to generate candidate proposals. MGG [29] and TCANet [32] combine the advantages of anchor-based and boundary-based approaches, generating proposals with more flexible durations and precise boundaries. Other methods include AFSD [24] and TRA [59], which propose the anchor-free method to detect actions efficiently. In our work, we propose a GAA that fully uses snippet-level semantics for boundaries with high precision and recall. Also, we propose an ATI, which enhances the proposal-level representation by mining correlations in the temporal dimension among proposal sets to achieve a more accurate proposal evaluation.
Action recognition
Action recognition models can be used to extract frame-level or snippet-level visual features in untrimmed videos, which have been utilized by most TAPG methods. Before the rise of deep learning, early algorithms in the field of action recognition, such as iDT, basically used hand-extracted features, including Histograms of Oriented Optical Flow (HOF), Histogram of Oriented Gradient (HOG), Motion Boundary Histograms (MBH). In recent years, convolution neural networks have been introduced to learn deep features of videos, such as 3D CNN [41], which is proposed to directly capture the spatial–temporal features between frames from the original video sequence. However, 3D CNNs have tremendous parameter amounts and computational costs. To handle the huge computation of 3D CNNs and provide intuitive motion information of actions, two-stream networks [9, 48] decode RGB images and optical flows and combine them to describe the temporal relationships, thus boosting the accuracy and flexibility of action recognition. In this work, we use a pre-trained TSN model to encode video clips for better comparison with state-of-the-art methods.
Attention mechanism for long-range contextual dependencies
The attention mechanism was first proposed in natural language processing. It is broadly leveraged in different research areas, such as video understanding [1, 3] and object detection [34, 54]. When it comes to video contextual information modeling, the self-attention mechanism focuses on important parts of video scenes and can capture long-term dependencies more effectively than RNNs. For example, Non-Local [49] embedded attention structure into the action recognition to analyze videos. Action Transformer [14] utilizes transformer to aggregate features from the spatiotemporal context for recognizing human actions. Following [43], many transformer-based models [8, 11] are proposed and show great potential to tackle TAPG tasks. RTD-Net [40] uses a transformer decoder to model relations between snippets. RapNet [11] proposes a frame-relation aware module to exploit long-range dependencies, distilling and adaptively recalibrating frame-level features. However, these methods cannot take advantage of global context that contains higher-level semantics. In contrast, we design novel attention modules to exploit multi-scale information and fuse effective snippet-level contexts based on video content.
Temporal modeling
Temporal modeling is an important cue to understand video. The significant distinction between video understanding and image processing is whether there is modeling in the temporal dimension, such as the emergence of 3D convolution [41] and the proposal of (2 + 1)D convolution [33, 42], both of which are based on the 2D image spatial convolution, and has temporal convolution at the same time. However, the researchers found that the critical information in the video can be more comprehensively explored when the convolutional weights in the temporal dimension are no longer strictly shared. For example, many recent works on dynamic convolution [20, 22, 30, 53] proposed convolution kernel weights that are adaptive to the content to achieve diverse modeling of video content. The weight of this type of convolution method is mainly described by spatial context or global information. Moreover, the temporal-adaptive convolution proposed in TadaConv [17] gives spatial convolution the ability of temporal modeling directly based on 2D convolution, obtaining adaptive convolutional weights for each frame along the temporal dimension. However, the application of modeling along temporal dimension for TAPG has not been well explored, especially in complicated noisy scenarios.
Methodology
We denote an untrimmed video sequence \(V={\left\{{v}_{l}\right\}}_{l=1}^{L}\) with \(L\) frames, where \({v}_{l}\) represents the \(l\)-th frame in the video. Besides, the annotation of action instances is \(\Psi ={\left\{{\psi }_{n}|\left({t}_{s,n},{t}_{e,n}\right)\right\}}_{n=1}^{N}\), in the videos which have \(N\) instances, \({\psi }_{n}\) in the formula represents the \(n\)-th action instance, \({t}_{s,n}\) and \({t}_{e,n}\) are the start and end frame corresponding to the action, respectively. The purpose of the TAPG task is to generate a set of proposals \(\Phi ={\left\{{\phi }_{m}=\left({t}_{s,m},{t}_{e,m},{p}_{m}\right)\right\}}_{m=1}^{M}\) that may contain action instances in video \(V\), here \({p}_{m}\) indicates the confidence of the \(m\)-th proposal, and \(M\) is the total number of proposals.
As illustrated in Fig. 1, taking snippet-level features (in Sect. "Video feature encoding") as input, TAN generates reliable action proposals. Specifically, GAA (in Sect. "Global-aware attention module with snippet-level context") exploits the global information around boundaries to predict the start and end probabilities of each temporal location adaptively. ATI (in Sect. "Adaptive temporal interaction module with proposal-level context") utilize temporal adaptive convolution is utilized to adjust the receptive field and explore temporal-aware relationships between proposals. Finally, with the predicted boundaries probabilities and proposals’ completeness confidence, we apply a post-processing algorithm to select high-quality proposals.
Video feature encoding
We encode the raw video sequence into a set of feature sequences by a two-stream network [10, 30]. It consists of two parts: the spatial network extracts appearance information from a single RGB frame, and the temporal network extracts motion features from stacked optical flow field. According to the previous method [13, 38, 39], given an untrimmed video \(V\) that contains \(L\) frames, we process video with regular frame intervals \(\delta \) to \(T=\lceil L/\delta \rceil\) video snippets to reduce the computational cost. The feature vector of the whole video is represented as \(F=\left\{{F}_{rgb},{F}_{flow}\right\}\in {R}^{C\times T}\) containing \(C\)-dimension, which is used as the input of the following modules.
Global-aware attention module with snippet-level context
The GAA module takes video features \({F}_{g}\) as input. \({F}_{g}\) is from the initial features \(F\) processed by the base module that includes two temporal convolutions with kernel size of 3, stride of 1. The GAA captures global temporal contextual information, which aims to rule out erroneous boundary predictions to obtain more accurate probability sequences. Considering action instances with different scales that require corresponding receptive fields, GAA designs a top-down structure composed of Cross Attention and Fusion Attention to model multi-scale feature interaction, as shown in Fig. 2a.
The encoder pathway in GAA uses temporal convolutions with stride of 2 for down-sample, while the decoder pathway uses temporal deconvolution layers with a factor of 2 for up-sample. To leverage the complementarity of the encoder and decoder, GAA fuses the encoder feature with more location information and the decoder feature with more semantic information through the well-designed attention module layer-by-layer.
Cross attention
For each action instance in videos, the features of different scale contexts captured by the decoder are not equivalent. Through empirical studies, we find that directly adding all contexts with different scales together may lead to semantic inconsistency and even blur the important local details for boundary prediction. To be compatible with local details and highlight the informative context, we propose Cross Attention in each skip connection, as shown in Fig. 2a.
To be specific, different from the traditional gating module [33]. Cross Attention first applies the temporal global average pooling to the combined features of different levels. Then, the global vector passes to a shared multi-layer perceptron (MLP) and sigmoid layer to compute a cross attention vector that serves as a feature gate for focusing on the low-level features. Consequently, the low-level features are calibrated with both important context information and local details. Finally, the weighted low-level information is added to the high-level features.
Fusion attention
Let \({\left\{{F}_{i}\right\}}_{i=1}^{S}\) be the generated feature maps with \(S\) temporal scale. We introduce Fusion Attention to strengthen the semantic relation between different level features(high-level) by capturing long-range dependencies, as shown in Fig. 2b. Different scales of contexts are not equally informative. Fusion Attention aims to obtain multi-head attention between the \(i\)-th layer and the \(i+1\)-th layer. First, the high-level feature \({F}_{i+1,t}\) is projected by \({\lambda }_{q}(\cdot )\). The low-level feature \({{F}_{i}}{\prime}\), which is transformed from \({F}_{i}\in {\mathbb{R}}^{C\times T}\) by bilinearly up-sample, is projected by \({\theta }_{k}(\cdot )\), \({\gamma }_{v}(\cdot )\). \({\lambda }_{q}(\cdot )\) is used to extract temporal information to form representative vectors, as are \({\theta }_{k}(\cdot )\), \({\gamma }_{v}(\cdot )\). As shown in Fig. 2b, \({{F}_{i+1,t}}{\prime}\) comes from \({F}_{i+1,t}\) through sequence and extraction block to obtain channel attention and improve the feature quality. The elements (i.e., action snippets) surrounding the central element \({F}_{i,t}\) at time \(t\in [1,T]\) in \({{F}_{i}}{\prime}\) are selected to form a representation \({{F}_{i,t}}{\prime}{\in {\mathbb{R}}}^{C\times K}\). The formulas are as follows:
where \({W}_{\phi }\), \({W}_{\theta }\), \({W}_{\gamma }{\in {\mathbb{R}}}^{{C}^{*}\times C}\) are learnable weighting parameters, we omit the bias term for simplicity. The output attention explores the relationship between discriminated information \({\lambda }_{q}\) and \({\theta }_{k}\), and the aggregation with another linear embedding \({\gamma }_{v}\): \({G}_{s}=softmax({\lambda }_{q}\cdot {\theta }_{k}^{T}/\sqrt{d})\cdot {\gamma }_{v}\),\(d = C/M\) indicates dimension for \({\lambda }_{q}\) and \({\theta }_{k}\). This step is used to calculate the correlation between the central and the surrounding elements across the time domain. We can get relatively accurate and robust feature information by fusing basic information and complex information as the output. The output of the attention operation for the \(t\)-th timestep is shown below:
where \({W}_{o}\) \({\in {\mathbb{R}}}^{{C}^{*}\times C}\), The output of the combination of \(i\)-th and \(i+1\)-th layer is formed by concatenating all timestamp representations in the video sequence:
And then after a \(1\times 1\) convolution, two probability sequences \({P}_{start}={\left\{{p}_{tn}^{s}\right\}}_{n=1}^{T}\) and \({P}_{end}={\left\{{p}_{tn}^{e}\right\}}_{n=1}^{T}\) are generated:
where \(\sigma \) denotes ReLU activation and \(\varepsilon \) denotes batch normalization.
The Fusion Attention fully considers long-distance dependencies, which means it fuses the context-rich features and location-rich features together to eliminate redundant information and capture the dependencies between them.
Adaptive temporal interaction module with proposal-level context
The goal of ATI is to generate confidence scores of all candidate proposals. Following the previous method BMN [5], we introduce the proposal sampling module to generate the proposal features \({F}_{P}\in {\mathbb{R}}^{{C}{\prime}\times D\times T}\) from the temporal feature \({F}_{g}\) and then use \({F}_{P}\) to obtain classification and regression confidence maps \({M}_{cls}\), \({M}_{reg}\in {\mathbb{R}}^{D\times T}\), where \(D\) represents pre-defined maximum proposal duration. The proposal sampling module aims to select \(N\) sample points for each proposal with a shared sampling matrix to represent the corresponding features \({F}_{P}\) of \(D\times T\) proposals. A point \((i,j)\) in maps represents the confidence score of the proposal \({\phi }_{i,j}\) with \(j\) duration and starting at \(i\)-th temporal location.
For action instances with different durations in a video, if the receptive field is too small, the longer action information may be destroyed, while if the receptive field is too large, the short action will contain redundant noise. Here, the ATI module aims to capture multi-scale temporal context interaction. It provides the temporal context interaction (TCI) block, including Temporal-scale modeling convolution (TMConv) for adaptive weights between proposals and their adjacent units. The aggregation is used to predict classification and regression confidence maps \({M}_{C}\), \({M}_{R}\), as shown in Fig. 1.
Temporal-scale modeling convolution
In this part, for \(D\times T\) proposals, in addition to the primary information fusion between adjacent proposals, TMConv focuses on the connectivity between adjacent proposals to help promote the reasoning of their relationship. It can generate calibrated weights in the temporal dimension. The illustration of TMConv is shown in Fig. 3(a). In contrast to standard convolution, for the input feature \({F}_{P}\), adaptive dynamic convolution leverages separate kernel generation function \(f(\cdot )\) to generate the temporal filter to obtain the new high-dimensional features at each dense-proposal point. The result \({F}_{H}\in {\mathbb{R}}^{{C}{\prime}\times D\times T}\) can be written as:
where * indicates the convolution operation. To fully incorporate global temporal information for interaction between the different duration proposal sets, TMConv adopts the TAKG module, which has a larger temporal field, as shown in Fig. 3(b). TAKG first applies global average pooling \({GAP}_{d}\) at \(t\)-th point on the description vectors along duration-dimension: \({V}_{t}\) = \({GAP}_{d}\left({F}_{P,t}\right)\in {\mathbb{R}}^{{C}{\prime}\times T}\). At the same time, TAKG operates the vector \({V}_{t}\) via the combination of \({GAP}_{dt}\) and \({GMP}_{dt}\) on temporal dimensions to effectively remove redundant information and highlight important proposal temporal information. Then TAKG generates the dynamic convolution kernels by stacked 1D convolutions and reducing the dimension by ratio \(\tau \):
where \(FC(\cdot )\) is a linear mapping function and \(FN(\cdot )\) is Filter Normalization [60] for stable training. \(\sigma \) and \(\varepsilon \) denote the ReLU and Batch Normalization.
Temporal context interaction block
Based on TMConv, we design the TCI block that employs dilated convolutions with different dilation rates, as shown in Fig. 4. TCI block increases the receptive field of the kernel and alleviates the 'grid problem', which effectively balances the contradiction between the long and short actions. TCI block includes a \(1\times 1\) convolution, a global adaptive pooling operation, and stacked multiple TMConvs with different expansion rates:
When \(n=5\), the \(k\)-th element of the proposal-level feature map is calculated as follows:
where \({\Theta }_{bi}(\cdot )\) denotes bilinearly up-sample. \({F}_{P}=\left[{F}_{P,1},{F}_{P,2}, \dots ,{F}_{P,{C}{\prime}}\right]\) and \({f}_{5}=[{g}_{1}, {g}_{2},... , {g}_{{C}{\prime}}].\) Finally, all the branches are concatenated to obtain the feature maps:\({F}_{p}{\prime}={\sum }_{n=1}^{5}{f}_{n}/5\). Next, \({F}_{p}{\prime}\) is fed into a series of 2D convolution layers and the sigmoid activation function to predict the score maps \({M}_{cls}\) for completeness classification and \({M}_{reg}\) for completeness regression.
Training
Ground truth description
In order to predict the boundary probability sequence, TAN needs to generate the corresponding label sequence \({G}_{s}={\{{g}_{t}^{s}\}}_{t=1}^{T}\), \({G}_{e}={\{{g}_{t}^{e}\}}_{t=1}^{T}\) in GAA module. For each action instance \(\varphi =\left[{t}_{s},{t}_{e}\right]\) in the action instance set Ψ \(={\left\{{\psi }_{n}|\left({t}_{s,n},{t}_{e,n}\right)\right\}}_{n=1}^{N}\) with the label, we represent its start and end regions as \({r}_{g}^{s}=\left[{t}_{s}-{d}_{\varphi }/\rho ,{t}_{s}+{d}_{\varphi }/\rho \right]\) and \({r}_{g}^{e}=\left[{t}_{e}-{d}_{\varphi }/\rho ,{t}_{e}+{d}_{\varphi }/\rho \right]\), where \(\rho \) is the preset constant, and \({d}_{\varphi }={t}_{e}-{t}_{s}\). For each timestamp, the corresponding label \({g}_{t}^{s}\) or \({g}_{t}^{e}\) will be set to 1 if it is in the start or end region of any ground truth.
As for ATI, for any one of the proposals \({\varphi }_{u,v}\), the start point is \(u\), and the end point is \(u+v\). We calculate the temporal Intersection over Union (tIoU) with all \(\varphi \) in Ψ and determine the maximum value \({g}_{u,v}^{c}\), then get the tIoU label map \({G}_{M}={\left\{{\left\{{g}_{u,v}^{c}\right\}}_{v=1}^{T}\right\}}_{u=1}^{D}\).
Loss function
We use the binary logistic regression loss function \({\mathcal{L}}_{b}\) for procedural supervision of the prediction boundaries in the GAA module.
where \({\alpha }^{+}=T/\sum \left({g}_{t}\right)\), \({\alpha }^{-}=T/\sum ({1-g}_{t})\) are balance factors, and \({p}_{t}\) and \({g}_{t}\) represent the predicted result in the boundary probability sequences \(P\in {\left\{{p}_{t}\right\}}_{t=1}^{T}\) and ground truth of \(G\in {\left\{{g}_{t}\right\}}_{t=1}^{T}\) \(t\)-th snippet. In addition, for the generated probability confidence maps \({M}_{cls}\) and \({M}_{reg}\) with the ground truth label \({G}_{M}\) in the ATI module, we use the SI-loss and the L2 loss denotes as \({\mathcal{L}}_{CLS}\) and \({\mathcal{L}}_{REG}\) to calculate the classification loss and regression loss:
We train TAN in the form of multi-task loss learning, the overall loss function contains GAA and ATI loss, and a regularization where \({\lambda }_{1}\) and \({\lambda }_{2}\) are set to 1 and 1e-4 to balance the contributions of different losses:
The training process of TAN model is summarized in Algorithm 1.
Inference and post process
In the inference phase, the proposal set \(\Phi ={\left\{{\phi }_{m}=\left({t}_{s,m},{t}_{e,m},{p}_{m}\right)\right\}}_{m=1}^{M}\) is generated, where \({p}_{m}\) is the final score of \({\phi }_{m}\) containing the boundary probabilities score (\({P}_{s} and {P}_{e}\)) and confidence map (\({M}_{cls} and {M}_{reg}\)).The combination of final score \({p}_{m}\) can be shown as:
Finally, Soft-NMS is used to remove redundant proposals so as to retrieve high-quality proposals more efficiently.
Experiments and results
Datasets and experimental settings
Datasets
ActivityNet-1.3 [5]. It is a large-scale dataset for action detection and TAPG tasks. This dataset contains 19,993 untrimmed videos, a total of 200 kinds of actions are labeled, and each video contains an average of 1.5 action instances. These videos are divided into the training set, test set, and validation set according to the ratio of 2:1:1. We evaluate TAN on the validation set at the end.
THUMOS-14 [18]. This dataset consists of 1010 validation videos and 1574 test videos, and the data are all from the YouTube website—only 413 temporally annotated untrimmed videos with 20 sports action categories for TAPG. The validation set contains 200 videos, and the test set contains 213 videos, each with an average of 15 action instances. We use the validation set to train the network and evaluate our model on the test set.
Evaluation metrics
Generating high-performance proposals means covering ground truth with high recall and temporal overlap. In the TAPG task, the prediction is judged to be correct when the overlap between the proposal and the ground truth is above threshold. According to the confidence ranking, we select the Average Recall (AR) of Average Number (AN) of proposals, denoted as AR@AN. The ActivityNet-1.3 and THUMOS-14 use threshold ranges of [0.5:0.05:0.95] and [0.5:0.05:1.0] to compute AR@AN, respectively. We use the mean Average Precision (mAP) under different tIoU thresholds as the primary evaluation metric to further examine the action localization performance for generating proposals. On ActivityNet-1.3, the value of tIoU is set to {0.5, 0.75, 0.95}, while the threshold set {0.3, 0.4, 0.5, 0.6, 0.7} is used on THUMOS-14. In addition, the area under the AR and AN curve (AUC) is also used as an evaluation metric on ActivityNet-1.3, where AN varies from 0 to 100.
Implementation details
Following previous works, we adopt a pre-trained model TSN [46] to extract video features, which include ResNet [16] and BN-Inception [19] as spatial and temporal networks, respectively. For ActivityNet-1.3, the sampling time interval \(\delta =16\), the constant value of each feature sequence is calculated by linear interpolation \(T=100\), and the maximum duration \(D\) is also set to 100. For THUMOS-14, the value of \(\delta \) is set to 5. The length of the sliding window is \(T=128\), \(D=64\). In the training process of the two datasets, Adam is used for optimization, the batch size is set to 16, there are 8 epochs in total, the initial learning rate of the first 7 epochs is set to 1e-3, and the learning rate of the subsequent epochs is decayed to 1e-4.
Our experiment was trained and validated on the NVIDIA GeForce RTX 3080 and Intel Xeon w-2295 3.00 GHz platform. All codes are based on PyTorch 1.10.0 and Python 3.8.
Finally, in the subsequent model comparison, the experimental results of TSI were reproduced under local environment, using the same training parameters as TAN on each dataset. The results of other models are cited, e.g., BMN [25], RTD-Net [40], etc.
Temporal action proposal generation
This part compares our method with other state-of-the-art (SOTA) methods on ActivityNet-1.3 and THUMOS-14. As shown in Table 1, TAN outperforms other methods, especially when AN equals to 100, TAN achieves 77.08% regarding the AR metric. The results show that TAN generates proposals with high recall while pursuing diversification. Our approach is different in spirit from others: (1) We pinpoint that it is more effective to exploit features at different scales for global interactions, and thus the GAA module incorporates gating mechanisms and top-down methods to help remove mispredictions. (2) The ATI module focuses on discriminating information in proposal-level features along the temporal dimension. Thus, the ATI module gains contextual associations with more efficient attention for wide-range proposals. TAN achieves better performance, indicating that we build a better boundary-based pipeline. It means TAN will bring more significant improvements in subsequent action detection tasks.
Additionally, the results in Table 2 demonstrate AR@AN results of our proposed TAN and other SOTA techniques on the test set of THUMOS-14, where C3D and two-stream features are adopted for fairness, with AN ranging from 50 to 1000 per video. Experimental results show that TAN outperforms other methods with C3D features and two-stream features. This shows that our proposed network can find more action instances in the video with the same number of proposals, which is mainly because the GAA module can essentially remove useless points, thereby suppressing low-quality proposals. At the same time, we also observed that the result of AR is lower than DCAN [7] when AN is 50. We checked TAN proposals and found that in some videos where action instances were too sparse, TAN generated less than 50 proposals. The possible reason is that the multi-scale TCI module encodes multiple action instances of diverse categories well, and it does not exhibit its merits when there are only a few action instances. The AR-AN curves of different methods on THUMOS-14 are shown in Fig. 5a. To further demonstrate the high overlap between the proposals produced by our approach and the ground truth actions, we calculate recall for multiple tIoU thresholds with 100 proposals per video. As shown in Fig. 5(b), TAN achieves significantly higher recall with fewer proposals when the threshold is between 0.5 and 0.8 compared to other SOTA methods.
In addition, compared with ActivityNet-1.3, the improvements on THUMOS-14 are more significant. This is because each video in the THUMOS-14 contains 15 actions on average, which further validates that our model is better than other models for regularizing action boundaries for videos containing multiple actions.
Table 3 reports the efficiency of TAN and several closely related alternatives with #parameters (M), floating point operations (G), and inference time on a 3-min video with a single NVIDIA GeForce RTX 3080. From the table, it can be observed that our method enjoys satisfying performance with much less FLOPs and an acceptable increase in the amount of parameters compared to others.
In order to further demonstrate that TAN exhibits better convergence, we presented the loss curves of TAN and Baseline Model [28] during training. As depicted in Fig. 6, TAN has good convergence performance. At the same time, due to the combination of a variety of corresponding attention modules and the optimization of proposal-level temporal modeling, the convergence of loss during training is more stable. In addition, the average training time of each epoch is maintained at about 380 s, and the training speed is considerable.
Temporal action detection
To further verify the quality of the proposals generated by TAN, we apply our TAN proposals to the temporal action detection task and compare them with previous methods. We combine the generated proposals with the action classifier UntrimmedNet [45] and the implementation proposed by Zhao et al. [50]. These flagged proposals are evaluated with the mAP metric as described above.
The evaluation results on the test set of ActivityNet-1.3 are shown in Table 4. TAN achieves the best detection performance when tIoU ranges from 0.5 to 0.95, which verifies the high quality of the proposals generated by TAN. Especially when tIoU is 0.95, the mAP we obtain is 10.26%, indicating TAN proposals are more precise. The experimental results on THUMOS-14 shown in Table 5 re-emphasize the superior performance of our TAN when compared to other SOTA methods. Especially when tIoU is 0.6, mAP improves from 38.7 to 42.1%. Finally, we conclude that our TAN can provide more efficient and reliable temporal action proposals for the action detection task.
Ablation study
In this section, we further investigate each component's performance and suitable settings to understand TAN better. All experiments were operated on THUMOS-14 and ActivityNet-1.3. In addition, the output video feature sequences are all from the method using TSN as the backbone network.
Effectiveness of different components in TAN
TAN contains two main modules: the GAA module and the ATI module. To confirm the effectiveness and superiority of TAN, the impact of removing each component is evaluated in Table 6. Each component contributes to the final performance. When only using GAA, The AUC results of ActivityNet-1.3 and THUMOS-14 reach 68.26% and 63.25% due to locating more accurate boundaries. And we also observe that AUC makes improvements once ATI is equipped. ATI brings considerable further performance promotion due to fully modeling the relationship between proposals while considering segments' different scales and temporal relationships.
When GAA and ATI work together, the results of TAN reach 69.01% and 64.15% in AUC on two datasets, which illustrate the importance and effectiveness of global boundary prediction and proposal-level context modeling.
Analysis of feature layer numbers for GAA module
We construct our top-down structure in GAA based on the original feature maps. We evaluate the importance of video feature scales in predicting boundaries by using different numbers of feature layers, as shown in Fig. 7. We find that when the GAA module only uses single feature layer, the performance is much lower than multi-scale features because a single feature layer cannot predict the contextual information interaction between actions of different lengths. In addition, the prediction is getting more accurate along with the increase of multi-scale feature layers. However, excessive down-sample and repeated restoration of context information will lead to inaccurate information for our boundary global interaction structure. Thus, our experimental setup employs four layers of encoders and decoders.
Effectiveness of encoder component in GAA module
There are two kinds of attention blocks in our GAA module: Cross Attention and Fusion Attention. Cross Attention applies in each layer simultaneously highlights low-level features containing local details and informative high-level features. Fusion Attention enhances temporal global feature extraction by fusing multiple levels of features. For the validity of both, we test different combinations of Cross Attention and Fusion Attention.
As shown in Table 7, When we use both, TAN has a large gain of improvement with 4.15% in AR@50. When Cross Attention and Fusion Attention are separately added, we get a gain of 2.08% and 2.96% in AR@500, respectively. We conclude that global contextual information aggregation provides essential guidance for the effective fusion of boundary information. We visualize queries in Fusion Attention in decoder layers. As shown in Fig. 8, the query has an obvious attention area compared to the initial input features.
Analysis of temporal interaction between proposals
To evaluate the effectiveness of proposal-level contexts from the TCI block, we carry out several ablation experiments. First, we replace the entire TMConv with traditional 2D convolutions, which is indicated by ‘w/o TMConv’. Second, we introduce TMConv without channel attention (CA) to aggregate proposal-level representations, which is denoted by “w/o CA”. Third, introduce TMConv without temporal attention (TA) (denoted by "w/o TA") to verify the necessity of temporal contexts. As the results are shown in Fig. 9(a), both temporal attention and channel attention can improve performance. Considering these results, it is convincing that 1) different weights for each proposal in temporal dimension are effective for boosting performance. 2) Temporal interaction is most helpful for weight calibration in the time dimension.
In the ATI module, we stack multiple successive TCI blocks to exploit proposal-level contexts. Here, we explore how the number of TCI blocks (i.e., K) influences the performance. The results are shown in Fig. 9b. \(K=0\) indicates that no TCI block is used. As seen from the figure, the TCI block substantially improves performance, which confirms the necessity of proposal-level contextual information. In particular, the largest improvement is achieved when \(K = 3\), which is the TCI block's default setting.
Effectiveness of the pooling method in TMConv
In addition, we explore the choice of pooling methods in our TMConv. (1) GAP: The mean operation is applied to all sampled features. (2) GMP: The max operation is applied to all sampled features. (3) GAP & GMP: Features output from mean and max operations are concatenated, and a fully connected layer is applied to shrink the channel. As shown in Table 8, in the process of convolution weight generation, the global weight value obtained by the combination of GAP and GMP not only pays attention to the general information but also emphasizes the vital information more conducive to the role of dynamic convolution and information modeling. AR is significantly improved when the AN range from 50 to 500 on THUMOS-14.
Effectiveness of different scale sizes in TCI block
The experimental results of setting dilated convolution parameters for the dynamic convolution layer in the TCI block are shown in Table 9. To ensure sufficient feature information is captured while expanding the receptive field, we set up three sets of data (3, 5, 7), (3, 6, 8), and (3, 6, 12). In addition, we also set (1, 1, 1) to verify the necessity of multi-scale block for proposal-level context. According to the results, multi-scale contexts effectively boost performance, and the recall of the proposals generated by the model is higher when the expansion rate is set larger. For example, when set to (3, 6, 8), the effect is better than that of (3, 5, 7), which indicates that larger dilated rate convolution has a better effect on relieving the contradiction between longer actions and shorter actions. However, when the dilated rate is too large, the convolution cannot obtain the effective adjacent features of the current proposal but extracts features with too long distance, which causes redundancy and is meaningless for feature extraction. Thus, we choose (3, 6, 8) as our default setting.
Qualitative results
To intuitively understand the behavior of the GAA and ATI modules, we visualize the start and end probability prediction sequences of TAN and BMN in Fig. 10. From the prediction results in Fig. 10, for some boundary positions, BMN recognizes some background positions (yellow boxes) as boundaries, which demonstrates only using the local context is difficult to evaluate temporal boundaries. The probability maps of TAN are more distinguishable than BMN, and the probabilities of background-position are significantly lower than those of boundaries. This indicates that global context aggregation on the boundary level can improve the model’s ability to suppress false boundary positions.
We also provide some visualization examples of TAN in Fig. 11. It presents a visualization example from five randomly selected videos in THUMOS-14 and ActivityNet-1.3. The proposals with the highest \(k\) scores are visualized in each video, where \(k\) is the number of ground truth. For example, in the first and the last video, background and action instances have similar scenes. However, our proposal still successfully aligns the position of ground truth action. The third video has multiple ground truth action instances, while our top-3 proposals perfectly cover them in an accurate way, suggesting the high quality of our generated proposals. The correct predictions show that our TAN generates reliable proposals by modeling context interactions between proposal-level and boundary-wise feature maps.
Conclusion
In this paper, we present a novel TAPG method named temporal-aware attention network (TAN), aiming to generate high-quality temporal action proposals. The key idea of TAN is to exploit temporal-aware action contexts in videos by temporally modeling boundary-level information and proposal-level features separately. First, we declare that long-distance temporal context helps obtain precise location information of action instances. Thus, the GAA module employs novel Cross Attention and Fusion Attention to learn the rich temporal contextual information for boundary prediction. Second, we consider that it is necessary to exploit the temporal correlation among a wide range of proposals. So, the ATI module utilizes novel TMConv with varying dilation rates, and it enhances the proposal-level feature dependencies to enrich the context. Extensive experiments on ActivityNet-1.3 and THUMOS-14 demonstrate the effectiveness of our framework. We believe that our research will facilitate practical applications of the TAPG task. At the same time, it is worth noting that the requirement for extensive manual frame-level annotations during training, which can be laborious, is not conducive to the widespread application of TAN. Another limitation is that TAN requires a fixed feature encoder (e.g., TSN, I3D) for feature extraction. An interesting future work is studying TAPG with weak supervision. It would also be intriguing to join learning of non-visual modalities to improve action detection performance, e.g., synchronizing between image and audio.
Data availability
Data available on request from the authors.
References
Arnab, A., et al., ViViT: A Video Vision Transformer, in IEEE/CVF International Conference on Computer Vision. 2021. p. 6836–6846.
Bai Y et al (2020) Boundary content graph neural network for temporal action proposal generation. European Conference on Computer Vision. Springer, pp 121–137
Bertasius, G., H. Wang, and L. Torresani, Is Space-Time Attention All You Need for Video Understanding?, in International Conference on Machine Learning. 2021, PMLR. p. 813–824.
Bochkovskiy, A., C.-Y. Wang, and H.-Y.M. Liao, Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
Caba Heilbron, F., et al., Activitynet: A large-scale video benchmark for human activity understanding, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2015. p. 961–970.
Chao, Y.-W., et al. Rethinking the faster r-cnn architecture for temporal action localization. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
Chen, G., et al., DCAN: Improving temporal action detection via dual context aggregation, in AAAI Conference on Artificial Intelligence. 2022. p. 248–257.
Chen P et al (2019) Relation attention for temporal action localization. IEEE Trans Multimedia 22(10):2723–2733
Feichtenhofer, C., A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016. p. 1933–1941.
Gao, J., K. Chen, and R. Nevatia, Ctap: Complementary temporal action proposal generation, in European conference on computer vision. 2018. p. 68–83.
Gao, J., et al., Accurate temporal action proposal generation with relation-aware pyramid network, in AAAI Conference on Artificial Intelligence. 2020. p. 10810–10817.
Gao, J., et al., Turn tap: Temporal unit regression network for temporal action proposals, in IEEE/CVF International Conference on Computer Vision. 2017. p. 3628–3636.
Gao, J., Z. Yang, and R. Nevatia, Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180, 2017.
Girdhar, R., et al. Video action transformer network. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
Han T et al (2020) TVENet: Temporal variance embedding network for fine-grained action representation. Pattern Recogn 103:107267
He, K., et al., Deep residual learning for image recognition, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016. p. 770–778.
Huang, Z., et al., TAda! Temporally-Adaptive Convolutions for Video Understanding. arXiv preprint arXiv:2110.06178, 2021.
Idrees H et al (2017) The THUMOS challenge on action recognition for videos “in the wild.” Comput Vis Image Underst 155:1–23
Ioffe, S. and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning. 2015, PMLR. p. 448–456.
Jia, X., et al., Dynamic filter networks. Advances in neural information processing systems, 2016. 29.
Li P, Cao J, Ye X (2023) Prototype contrastive learning for point-supervised temporal action detection. Expert Syst Appl 213:118965
Li, Y., et al., Revisiting dynamic convolution via matrix decomposition. arXiv preprint arXiv:2103.08756, 2021.
Lin, C., et al., Fast learning of temporal action proposal via dense boundary generator, in AAAI Conference on Artificial Intelligence. 2020. p. 11499–11506.
Lin C et al (2021) Learning Salient Boundary Feature for Anchor-free Temporal Action Localization. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021:3319–3328
Lin, T., et al., BMN: Boundary-Matching Network for Temporal Action Proposal Generation, in IEEE/CVF International Conference on Computer Vision. 2019. p. 3888–3897.
Lin T et al (2018) BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. European Conference on Computer Vision. Springer, pp 3–21
Liu, Q. and Z. Wang. Progressive boundary refinement network for temporal action detection. in Proceedings of the AAAI Conference on Artificial Intelligence. 2020.
Liu S et al (2020) TSI: Temporal Scale Invariant Network for Action Proposal Generation. Asian Conference on Computer Vision. Springer, pp 530–546
Liu, Y., et al., Multi-Granularity Generator for Temporal Action Proposal, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. p. 3604–3613.
Liu, Z., et al., Tam: Temporal adaptive module for video recognition, in IEEE/CVF International Conference on Computer Vision. 2021. p. 13708–13718.
Long, F., et al., Gaussian temporal awareness networks for action localization, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. p. 344–353.
Qing, Z., et al., Temporal context aggregation network for temporal action proposal refinement, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 485–494.
Qiu, Z., T. Yao, and T. Mei, Learning spatio-temporal representation with pseudo-3d residual networks, in IEEE/CVF International Conference on Computer Vision. 2017. p. 5533–5541.
Redmon, J. and A. Farhadi, Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
Ren S et al (2016) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Shou, Z., D. Wang, and S.-F. Chang, Temporal action localization in untrimmed videos via multi-stage cnns, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016. p. 1049–1058.
Su, H., et al., Bsn++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation, in AAAI Conference on Artificial Intelligence. 2021. p. 2602–2610.
Su H, Zhao X, Lin T (2018) Cascaded pyramid mining network for weakly supervised temporal action localization. Asian Conference on Computer Vision. Springer, pp 558–574
Su H et al (2020) Transferable knowledge-based multi-granularity fusion network for weakly supervised temporal action detection. IEEE Trans Multimedia 23:1503–1515
Tan, J., et al., Relaxed Transformer Decoders for Direct Action Proposal Generation, in IEEE/CVF International Conference on Computer Vision. 2021. p. 13506–13515.
Tran, D., et al., Learning spatiotemporal features with 3d convolutional networks, in IEEE/CVF International Conference on Computer Vision. 2015. p. 4489–4497.
Tran, D., et al., A Closer Look at Spatiotemporal Convolutions for Action Recognition, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. p. 6450–6459.
Vaswani, A., et al., Attention is all you need. Advances in neural information processing systems, 2017. 30.
Vo K et al (2021) ABN: agent-aware boundary networks for temporal action proposal generation. IEEE Access 9:126431–126445
Wang, L., et al., Untrimmednets for weakly supervised action recognition and detection, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2017. p. 4325–4334.
Wang, L., et al., Temporal segment networks: Towards good practices for deep action recognition, in European Conference on Computer Vision. 2016. p. 20–36.
Wang L et al (2023) MIFNet: Multiple instances focused temporal action proposal generation. Neurocomputing 538:126025
Wang, X., et al. Skeleton-based action recognition via adaptive cross-form learning. in Proceedings of the 30th ACM International Conference on Multimedia. 2022.
Wang, X., et al., Non-local Neural Networks, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. p. 7794–7803.
Xiong, Y., et al., Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797, 2016.
Xu H, Das A, Saenko K (2017) R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. IEEE International Conference on Computer Vision (ICCV) 2017:5794–5803
Xu, M., et al., G-tad: Sub-graph localization for temporal action detection, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 10156–10165.
Yang, B., et al., Condconv: Conditionally parameterized convolutions for efficient inference, in Advances in Neural Information Processing Systems. 2019.
Yang, Y., et al., Exploiting semantic-level affinities with a mask-guided network for temporal action proposal in videos. Applied Intelligence, 2022: p. 1–21.
Zeng, R., et al., Graph Convolutional Networks for Temporal Action Localization, in IEEE/CVF International Conference on Computer Vision. 2019.
Zhang, H., et al., MTSCANet: Multi temporal resolution temporal semantic context aggregation network. IET Computer Vision, 2023.
Zhao P et al (2020) Bottom-up temporal action localization with mutual regularization. European Conference on Computer Vision. Springer, pp 539–555
Zhao Y et al (2020) Temporal Action Detection with Structured Segment Networks. Int J Comput Vision 128(1):74–96
Zhao Y et al (2022) A temporal-aware relation and attention network for temporal action localization. IEEE Trans Image Process 31:4746–4760
Zhou, J., et al., Decoupled dynamic filter networks, in IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 6647–6656.
Acknowledgements
The authors thank the editors and reviewers for their work on this manuscript. This work is supported by Important Research Project of Hebei Province (Grant No. 22370301D), Scientific Research Foundation of Hebei University for Distinguished Young Scholars (Grant No. 521100221081), Scientific Research Foundation of Colleges and Universities in Hebei Province (Grant No. QN2022107). This work is supported by the High-Performance Computing Center of Hebei University.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Jiao, Y., Yang, W., Xing, W. et al. TAN: a temporal-aware attention network with context-rich representation for boosting proposal generation. Complex Intell. Syst. 10, 3691–3708 (2024). https://doi.org/10.1007/s40747-024-01343-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-024-01343-0