Introduction

Understanding human behavior and intent in videos is crucial across several domains, including but not limited to human–computer interaction, robotics, video retrieval analysis, and intelligent security. As a result, video content analysis methods [15, 21, 33, 37, 46] have attracted increased interest from academics and industry. Temporal Action Proposal Generation (TAPG) is one of the most trending topics in video understanding. It is intended to detect temporal intervals likely to contain an action instance in untrimmed videos, telling each action's start and end frames.

Anchor-based works [4, 34, 35] generate proposals based on dense predefined anchor boxes that are either regularly distributed [36] or manually defined [12]. However, one main drawback of anchor-based methods is that fixed-size anchors can hardly cover all these ground-truth instances with different lengths, from a second to minutes. On the contrary, boundary-based methods [22, 23] locate action boundaries based on the snippet-level probabilities of boundaries. These methods first evaluate the start and end probabilities of each snippet consisting of sequential frames, and then combine high-scoring snippets to form candidate proposals. Previous boundary-based methods have been demonstrated to be more effective and provide insightful ways of TAPG. However, due to various action durations and the cluttered environment surrounding boundaries, boundary-based methods have two difficulties: (1) how to use more precise boundaries to represent action proposals and (2) how to exploit the semantic relations among those proposals effectively.

Regarding the first one, several works [11, 23, 25, 26] applied 1-dimensional (1D) temporal convolutions on snippets before pooling to encode the snippets relations, which are beneficial for increasing the recall of boundary detection. However, the above methods neglect the crucial fact that the duration of action instances can vary dramatically across categories and videos, and even in the same video, there may be multiple action instances with significantly different durations. This will result in a lack of global information that may not be conducive to long action instances. Secondly, different scales of snippet-level contexts are not equally informative, and some of them may affect distant boundaries or may not be helpful for certain action instances. For example, long-range information or global contexts with fewer local details may not be conducive to detecting short actions. Anchor-based methods [24, 27] used feature pyramids to encode multi-scale contexts to address the issue. However, boundary-based methods without anchors have yet to explore this issue entirely. Thus, selecting effective contexts based on the video content is necessary to avoid invalid boundary-level matches.

The second challenge is related to proposal relations, which provide more internal hints to enhance representations of action proposals. Existing methods [47, 55] usually only considered overlapped proposals that represent different stages of an action instance. Early works [36, 51, 58] mainly dealt with proposals individually. BSN +  + [37] proposes a self-attention module to explore proposal-level contextual information. However, this may ignore the impact of negative proposals and could result in computationally expensive. Indeed, distant proposals containing similar semantics are significant, as they may give indicative hints. G-TAD [52] exploits the proposal-proposal relations using graph convolutional networks with fixed edge weights between nodes. Thus, these methods have failed to effectively utilize the dynamic temporal relationships between proposals in the temporal dimension.

To remedy these problems, we propose a novel network TAN to enhance the boundary prediction performance and effectively leverage proposal-level features. Firstly, to obtain action boundary probabilities with high precision and recall, TAN introduces a global-aware attention (GAA) module. In addition to designing Cross Attention in multiple layers to select effective contexts based on the video content in the temporal dimension, it also enhances the global snippet-level contexts for boundary classification with Fusion Attention. Secondly, TAN introduces an adaptive temporal interaction (ATI) module to address the limitation of proposal-level features being too singular, which aims to construct proposal-level contexts in the temporal dimension. It integrates our well-designed temporal context interaction (TCI) block to assign dynamic convolution weights to proposal sets with the same start point. Specifically, considering the flexible duration of action instants, it uses temporal-scale modeling convolution (TMConv) with varied dilated rates to enhance modeling capability for distant proposals.

In a nutshell, our contributions are as follows:

    • To model long-range relationships between video units, we present a novel global-aware attention module with well-designed cross-scale gating mechanism and multi-input fusion attention to aggregate multi-level snippet-level context representations.

    • We introduce an adaptive temporal interaction module with multi-scale dynamic temporal convolution, which can accurately capture the relationship between multi-scale proposals by assigning different weights to temporal contexts.

    • Based on the above two modules, we propose a temporal-aware attention network. It aims to enhance boundary-level predictions and proposal-level representations for generating context-rich proposals.

    • We validate TAN on two challenging benchmarks: THUMOS-14 and ActivityNet-1.3. Experimental results show that TAN achieves comparable improvements and delivers more accurate proposals.

The remainder of this paper is organized as follows. We review the related works in Sect. "Related works" and then introduce the details of TAN in Sect. "Methodology". In Sect. "Experiments and results", we conduct experiments to evaluate our TAN. Finally, we conclude the work in Sect. "Conclusion".

Related works

Temporal action proposal generation (TAPG)

There are two main categories of temporal action proposal generation pipelines. One line of research, referred to as anchor-based methods [6, 11, 27, 29, 58], generates action proposals based on dense sliding windows or predefined anchors. For example, PBRNet [27] uses feature pyramids to refine predefined anchors progressively. TALNet [35] uses dilation convolution to exploit global contexts among frames to get a larger receptive field. Although multi-scale anchors [35] and pyramid architectures [11, 27] are used to increase the diversity of anchors, proposals generated by these methods are still not flexible enough to cover actions of various durations. Another line of research is known as boundary-based methods [25, 32, 35]. They first predict the start and end probabilities for each snippet, then match frames with high start and end probabilities. For example, BMN [25] and BSN +  + [37] apply the boundary-matching mechanism to generate candidate proposals. MGG [29] and TCANet [32] combine the advantages of anchor-based and boundary-based approaches, generating proposals with more flexible durations and precise boundaries. Other methods include AFSD [24] and TRA [59], which propose the anchor-free method to detect actions efficiently. In our work, we propose a GAA that fully uses snippet-level semantics for boundaries with high precision and recall. Also, we propose an ATI, which enhances the proposal-level representation by mining correlations in the temporal dimension among proposal sets to achieve a more accurate proposal evaluation.

Action recognition

Action recognition models can be used to extract frame-level or snippet-level visual features in untrimmed videos, which have been utilized by most TAPG methods. Before the rise of deep learning, early algorithms in the field of action recognition, such as iDT, basically used hand-extracted features, including Histograms of Oriented Optical Flow (HOF), Histogram of Oriented Gradient (HOG), Motion Boundary Histograms (MBH). In recent years, convolution neural networks have been introduced to learn deep features of videos, such as 3D CNN [41], which is proposed to directly capture the spatial–temporal features between frames from the original video sequence. However, 3D CNNs have tremendous parameter amounts and computational costs. To handle the huge computation of 3D CNNs and provide intuitive motion information of actions, two-stream networks [9, 48] decode RGB images and optical flows and combine them to describe the temporal relationships, thus boosting the accuracy and flexibility of action recognition. In this work, we use a pre-trained TSN model to encode video clips for better comparison with state-of-the-art methods.

Attention mechanism for long-range contextual dependencies

The attention mechanism was first proposed in natural language processing. It is broadly leveraged in different research areas, such as video understanding [1, 3] and object detection [34, 54]. When it comes to video contextual information modeling, the self-attention mechanism focuses on important parts of video scenes and can capture long-term dependencies more effectively than RNNs. For example, Non-Local [49] embedded attention structure into the action recognition to analyze videos. Action Transformer [14] utilizes transformer to aggregate features from the spatiotemporal context for recognizing human actions. Following [43], many transformer-based models [8, 11] are proposed and show great potential to tackle TAPG tasks. RTD-Net [40] uses a transformer decoder to model relations between snippets. RapNet [11] proposes a frame-relation aware module to exploit long-range dependencies, distilling and adaptively recalibrating frame-level features. However, these methods cannot take advantage of global context that contains higher-level semantics. In contrast, we design novel attention modules to exploit multi-scale information and fuse effective snippet-level contexts based on video content.

Temporal modeling

Temporal modeling is an important cue to understand video. The significant distinction between video understanding and image processing is whether there is modeling in the temporal dimension, such as the emergence of 3D convolution [41] and the proposal of (2 + 1)D convolution [33, 42], both of which are based on the 2D image spatial convolution, and has temporal convolution at the same time. However, the researchers found that the critical information in the video can be more comprehensively explored when the convolutional weights in the temporal dimension are no longer strictly shared. For example, many recent works on dynamic convolution [20, 22, 30, 53] proposed convolution kernel weights that are adaptive to the content to achieve diverse modeling of video content. The weight of this type of convolution method is mainly described by spatial context or global information. Moreover, the temporal-adaptive convolution proposed in TadaConv [17] gives spatial convolution the ability of temporal modeling directly based on 2D convolution, obtaining adaptive convolutional weights for each frame along the temporal dimension. However, the application of modeling along temporal dimension for TAPG has not been well explored, especially in complicated noisy scenarios.

Methodology

We denote an untrimmed video sequence \(V={\left\{{v}_{l}\right\}}_{l=1}^{L}\) with \(L\) frames, where \({v}_{l}\) represents the \(l\)-th frame in the video. Besides, the annotation of action instances is \(\Psi ={\left\{{\psi }_{n}|\left({t}_{s,n},{t}_{e,n}\right)\right\}}_{n=1}^{N}\), in the videos which have \(N\) instances, \({\psi }_{n}\) in the formula represents the \(n\)-th action instance, \({t}_{s,n}\) and \({t}_{e,n}\) are the start and end frame corresponding to the action, respectively. The purpose of the TAPG task is to generate a set of proposals \(\Phi ={\left\{{\phi }_{m}=\left({t}_{s,m},{t}_{e,m},{p}_{m}\right)\right\}}_{m=1}^{M}\) that may contain action instances in video \(V\), here \({p}_{m}\) indicates the confidence of the \(m\)-th proposal, and \(M\) is the total number of proposals.

As illustrated in Fig. 1, taking snippet-level features (in Sect. "Video feature encoding") as input, TAN generates reliable action proposals. Specifically, GAA (in Sect. "Global-aware attention module with snippet-level context") exploits the global information around boundaries to predict the start and end probabilities of each temporal location adaptively. ATI (in Sect. "Adaptive temporal interaction module with proposal-level context") utilize temporal adaptive convolution is utilized to adjust the receptive field and explore temporal-aware relationships between proposals. Finally, with the predicted boundaries probabilities and proposals’ completeness confidence, we apply a post-processing algorithm to select high-quality proposals.

Fig. 1
figure 1

The framework of our proposed TAN. First, TAN applies two-stream feature extractor to encode video frames. Second, GAA takes the video feature sequence as input and outputs the boundary probability sequence. Then, ATI generates two confidence maps for all candidate proposals. Finally, TAN constructs proposals based on boundary probabilities sequence and obtains the corresponding confidence score from the confidence map

Video feature encoding

We encode the raw video sequence into a set of feature sequences by a two-stream network [10, 30]. It consists of two parts: the spatial network extracts appearance information from a single RGB frame, and the temporal network extracts motion features from stacked optical flow field. According to the previous method [13, 38, 39], given an untrimmed video \(V\) that contains \(L\) frames, we process video with regular frame intervals \(\delta \) to \(T=\lceil L/\delta \rceil\) video snippets to reduce the computational cost. The feature vector of the whole video is represented as \(F=\left\{{F}_{rgb},{F}_{flow}\right\}\in {R}^{C\times T}\) containing \(C\)-dimension, which is used as the input of the following modules.

Global-aware attention module with snippet-level context

The GAA module takes video features \({F}_{g}\) as input. \({F}_{g}\) is from the initial features \(F\) processed by the base module that includes two temporal convolutions with kernel size of 3, stride of 1. The GAA captures global temporal contextual information, which aims to rule out erroneous boundary predictions to obtain more accurate probability sequences. Considering action instances with different scales that require corresponding receptive fields, GAA designs a top-down structure composed of Cross Attention and Fusion Attention to model multi-scale feature interaction, as shown in Fig. 2a.

Fig. 2
figure 2

The detailed structure of the GAA module and Fusion Attention. The features from different stages are fused to obtain the boundary probability sequence with high recall via Cross Attention and Fusion Attention

The encoder pathway in GAA uses temporal convolutions with stride of 2 for down-sample, while the decoder pathway uses temporal deconvolution layers with a factor of 2 for up-sample. To leverage the complementarity of the encoder and decoder, GAA fuses the encoder feature with more location information and the decoder feature with more semantic information through the well-designed attention module layer-by-layer.

Cross attention

For each action instance in videos, the features of different scale contexts captured by the decoder are not equivalent. Through empirical studies, we find that directly adding all contexts with different scales together may lead to semantic inconsistency and even blur the important local details for boundary prediction. To be compatible with local details and highlight the informative context, we propose Cross Attention in each skip connection, as shown in Fig. 2a.

To be specific, different from the traditional gating module [33]. Cross Attention first applies the temporal global average pooling to the combined features of different levels. Then, the global vector passes to a shared multi-layer perceptron (MLP) and sigmoid layer to compute a cross attention vector that serves as a feature gate for focusing on the low-level features. Consequently, the low-level features are calibrated with both important context information and local details. Finally, the weighted low-level information is added to the high-level features.

Fusion attention

Let \({\left\{{F}_{i}\right\}}_{i=1}^{S}\) be the generated feature maps with \(S\) temporal scale. We introduce Fusion Attention to strengthen the semantic relation between different level features(high-level) by capturing long-range dependencies, as shown in Fig. 2b. Different scales of contexts are not equally informative. Fusion Attention aims to obtain multi-head attention between the \(i\)-th layer and the \(i+1\)-th layer. First, the high-level feature \({F}_{i+1,t}\) is projected by \({\lambda }_{q}(\cdot )\). The low-level feature \({{F}_{i}}{\prime}\), which is transformed from \({F}_{i}\in {\mathbb{R}}^{C\times T}\) by bilinearly up-sample, is projected by \({\theta }_{k}(\cdot )\), \({\gamma }_{v}(\cdot )\). \({\lambda }_{q}(\cdot )\) is used to extract temporal information to form representative vectors, as are \({\theta }_{k}(\cdot )\), \({\gamma }_{v}(\cdot )\). As shown in Fig. 2b, \({{F}_{i+1,t}}{\prime}\) comes from \({F}_{i+1,t}\) through sequence and extraction block to obtain channel attention and improve the feature quality. The elements (i.e., action snippets) surrounding the central element \({F}_{i,t}\) at time \(t\in [1,T]\) in \({{F}_{i}}{\prime}\) are selected to form a representation \({{F}_{i,t}}{\prime}{\in {\mathbb{R}}}^{C\times K}\). The formulas are as follows:

$${\lambda }_{q}\left({F}_{i+1,t}\right)={W}_{\lambda }{{F}_{i+1,t}}{\prime}; {\theta }_{k}\left({{F}_{i,t}}{\prime}\right)={W}_{\theta }{{F}_{i,t}}{\prime}; {\gamma }_{v}\left({{F}_{i,t}}{\prime}\right)={W}_{\gamma }{{F}_{i,t}}{\prime},$$
(1)

where \({W}_{\phi }\), \({W}_{\theta }\), \({W}_{\gamma }{\in {\mathbb{R}}}^{{C}^{*}\times C}\) are learnable weighting parameters, we omit the bias term for simplicity. The output attention explores the relationship between discriminated information \({\lambda }_{q}\) and \({\theta }_{k}\), and the aggregation with another linear embedding \({\gamma }_{v}\): \({G}_{s}=softmax({\lambda }_{q}\cdot {\theta }_{k}^{T}/\sqrt{d})\cdot {\gamma }_{v}\),\(d = C/M\) indicates dimension for \({\lambda }_{q}\) and \({\theta }_{k}\). This step is used to calculate the correlation between the central and the surrounding elements across the time domain. We can get relatively accurate and robust feature information by fusing basic information and complex information as the output. The output of the attention operation for the \(t\)-th timestep is shown below:

$$A({F}_{i+1,t})={W}_{o}\left({{G}_{s}}^{T}\right)+{F}_{i+1,t},$$
(2)

where \({W}_{o}\) \({\in {\mathbb{R}}}^{{C}^{*}\times C}\), The output of the combination of \(i\)-th and \(i+1\)-th layer is formed by concatenating all timestamp representations in the video sequence:

$$FA\left({F}_{i,i+1}\right)=\left[A\left({F}_{i+\mathrm{1,1}}\right),A\left({F}_{i+\mathrm{1,2}}\right),\dots ,A\left({F}_{i+1,T}\right)\right].$$
(3)

And then after a \(1\times 1\) convolution, two probability sequences \({P}_{start}={\left\{{p}_{tn}^{s}\right\}}_{n=1}^{T}\) and \({P}_{end}={\left\{{p}_{tn}^{e}\right\}}_{n=1}^{T}\) are generated:

$${P}_{start}=\sigma \left(\varepsilon \left({Conv}_{S}\left(FA\left({F}_{i,i+1}\right)\right)\right)\right),{P}_{end}=\sigma \left(\varepsilon \left({Conv}_{E}\left(FA\left({F}_{i,i+1}\right)\right)\right)\right),$$
(4)

where \(\sigma \) denotes ReLU activation and \(\varepsilon \) denotes batch normalization.

The Fusion Attention fully considers long-distance dependencies, which means it fuses the context-rich features and location-rich features together to eliminate redundant information and capture the dependencies between them.

Adaptive temporal interaction module with proposal-level context

The goal of ATI is to generate confidence scores of all candidate proposals. Following the previous method BMN [5], we introduce the proposal sampling module to generate the proposal features \({F}_{P}\in {\mathbb{R}}^{{C}{\prime}\times D\times T}\) from the temporal feature \({F}_{g}\) and then use \({F}_{P}\) to obtain classification and regression confidence maps \({M}_{cls}\), \({M}_{reg}\in {\mathbb{R}}^{D\times T}\), where \(D\) represents pre-defined maximum proposal duration. The proposal sampling module aims to select \(N\) sample points for each proposal with a shared sampling matrix to represent the corresponding features \({F}_{P}\) of \(D\times T\) proposals. A point \((i,j)\) in maps represents the confidence score of the proposal \({\phi }_{i,j}\) with \(j\) duration and starting at \(i\)-th temporal location.

For action instances with different durations in a video, if the receptive field is too small, the longer action information may be destroyed, while if the receptive field is too large, the short action will contain redundant noise. Here, the ATI module aims to capture multi-scale temporal context interaction. It provides the temporal context interaction (TCI) block, including Temporal-scale modeling convolution (TMConv) for adaptive weights between proposals and their adjacent units. The aggregation is used to predict classification and regression confidence maps \({M}_{C}\), \({M}_{R}\), as shown in Fig. 1.

Temporal-scale modeling convolution

In this part, for \(D\times T\) proposals, in addition to the primary information fusion between adjacent proposals, TMConv focuses on the connectivity between adjacent proposals to help promote the reasoning of their relationship. It can generate calibrated weights in the temporal dimension. The illustration of TMConv is shown in Fig. 3(a). In contrast to standard convolution, for the input feature \({F}_{P}\), adaptive dynamic convolution leverages separate kernel generation function \(f(\cdot )\) to generate the temporal filter to obtain the new high-dimensional features at each dense-proposal point. The result \({F}_{H}\in {\mathbb{R}}^{{C}{\prime}\times D\times T}\) can be written as:

$${F}_{H}=f\left({F}_{P}\right)*{F}_{P},$$
(5)

where * indicates the convolution operation. To fully incorporate global temporal information for interaction between the different duration proposal sets, TMConv adopts the TAKG module, which has a larger temporal field, as shown in Fig. 3(b). TAKG first applies global average pooling \({GAP}_{d}\) at \(t\)-th point on the description vectors along duration-dimension: \({V}_{t}\) = \({GAP}_{d}\left({F}_{P,t}\right)\in {\mathbb{R}}^{{C}{\prime}\times T}\). At the same time, TAKG operates the vector \({V}_{t}\) via the combination of \({GAP}_{dt}\) and \({GMP}_{dt}\) on temporal dimensions to effectively remove redundant information and highlight important proposal temporal information. Then TAKG generates the dynamic convolution kernels by stacked 1D convolutions and reducing the dimension by ratio \(\tau \):

Fig. 3
figure 3

a Is an illustration of TMConv. b Explains the Temporal Attention Kernel Generation (TAKG) module equipped by the TMConv

$${{V}_{t}}{\prime}={V}_{t}+{FC}^{{C}{\prime}\to {C}{\prime}}\left({GAP}_{dt}\left({V}_{t}\right)+{GMP}_{dt}\left({V}_{t}\right)\right),$$
(6)
$$f\left({F}_{P}\right)=FN({Conv1D}^{{C}^{\mathrm{^{\prime}}}/\tau \to {C}^{\mathrm{^{\prime}}}}\left(\sigma \left(\varepsilon \left({FC}^{{C}^{\mathrm{^{\prime}}}\to {C}^{\mathrm{^{\prime}}}/\tau }\left({{V}_{t}}^{\mathrm{^{\prime}}}\right)\right)\right)\right)),$$
(7)

where \(FC(\cdot )\) is a linear mapping function and \(FN(\cdot )\) is Filter Normalization [60] for stable training. \(\sigma \) and \(\varepsilon \) denote the ReLU and Batch Normalization.

Temporal context interaction block

Based on TMConv, we design the TCI block that employs dilated convolutions with different dilation rates, as shown in Fig. 4. TCI block increases the receptive field of the kernel and alleviates the 'grid problem', which effectively balances the contradiction between the long and short actions. TCI block includes a \(1\times 1\) convolution, a global adaptive pooling operation, and stacked multiple TMConvs with different expansion rates:

Fig. 4
figure 4

Illustration of TCI block. TCI block consists of a pooling branch, static convolution branch, and stacked dilated dynamic convolution branches

$${f}_{n}=\left\{\begin{array}{c}\sigma \left(\varepsilon \left({Conv2D}^{{C}{\prime}\to {C}{\prime}}\left({F}_{P}\right)\right)\right), n=1\\ {TMConv2D}^{{C}{\prime}\to {C}{\prime}}\left({F}_{P}\right), n=\mathrm{2,3},4\end{array}\right..$$
(8)

When \(n=5\), the \(k\)-th element of the proposal-level feature map is calculated as follows:

$${F}_{P,k}=\frac{1}{D\times T}\sum_{i=1}^{D}\sum_{j=1}^{T}{F}_{P,k}\left(i,j\right), {g}_{k}={\Theta }_{bi}\left({F}_{P,k}\right),$$
(9)

where \({\Theta }_{bi}(\cdot )\) denotes bilinearly up-sample. \({F}_{P}=\left[{F}_{P,1},{F}_{P,2}, \dots ,{F}_{P,{C}{\prime}}\right]\) and \({f}_{5}=[{g}_{1}, {g}_{2},... , {g}_{{C}{\prime}}].\) Finally, all the branches are concatenated to obtain the feature maps:\({F}_{p}{\prime}={\sum }_{n=1}^{5}{f}_{n}/5\). Next, \({F}_{p}{\prime}\) is fed into a series of 2D convolution layers and the sigmoid activation function to predict the score maps \({M}_{cls}\) for completeness classification and \({M}_{reg}\) for completeness regression.

Training

Ground truth description

In order to predict the boundary probability sequence, TAN needs to generate the corresponding label sequence \({G}_{s}={\{{g}_{t}^{s}\}}_{t=1}^{T}\), \({G}_{e}={\{{g}_{t}^{e}\}}_{t=1}^{T}\) in GAA module. For each action instance \(\varphi =\left[{t}_{s},{t}_{e}\right]\) in the action instance set Ψ \(={\left\{{\psi }_{n}|\left({t}_{s,n},{t}_{e,n}\right)\right\}}_{n=1}^{N}\) with the label, we represent its start and end regions as \({r}_{g}^{s}=\left[{t}_{s}-{d}_{\varphi }/\rho ,{t}_{s}+{d}_{\varphi }/\rho \right]\) and \({r}_{g}^{e}=\left[{t}_{e}-{d}_{\varphi }/\rho ,{t}_{e}+{d}_{\varphi }/\rho \right]\), where \(\rho \) is the preset constant, and \({d}_{\varphi }={t}_{e}-{t}_{s}\). For each timestamp, the corresponding label \({g}_{t}^{s}\) or \({g}_{t}^{e}\) will be set to 1 if it is in the start or end region of any ground truth.

As for ATI, for any one of the proposals \({\varphi }_{u,v}\), the start point is \(u\), and the end point is \(u+v\). We calculate the temporal Intersection over Union (tIoU) with all \(\varphi \) in Ψ and determine the maximum value \({g}_{u,v}^{c}\), then get the tIoU label map \({G}_{M}={\left\{{\left\{{g}_{u,v}^{c}\right\}}_{v=1}^{T}\right\}}_{u=1}^{D}\).

Loss function

We use the binary logistic regression loss function \({\mathcal{L}}_{b}\) for procedural supervision of the prediction boundaries in the GAA module.

$$\begin{aligned}{\mathcal{L}}_{b}(P, G)= & \sum_{t=1}^{T}\left(\alpha^{+} \cdot g_{t} \cdot \log \left(p_{t}\right)+\alpha^{-} \cdot \left(1-g_{t}\right)\right. \\ & \left. \cdot \log \left(1-p_{t}\right)\right),\end{aligned}$$
(10)
$${\mathcal{L}}_{GAA}={\mathcal{L}}_{b}\left({P}_{s},{G}_{s}\right)+{\mathcal{L}}_{b}\left({P}_{e},{G}_{e}\right),$$
(11)

where \({\alpha }^{+}=T/\sum \left({g}_{t}\right)\), \({\alpha }^{-}=T/\sum ({1-g}_{t})\) are balance factors, and \({p}_{t}\) and \({g}_{t}\) represent the predicted result in the boundary probability sequences \(P\in {\left\{{p}_{t}\right\}}_{t=1}^{T}\) and ground truth of \(G\in {\left\{{g}_{t}\right\}}_{t=1}^{T}\) \(t\)-th snippet. In addition, for the generated probability confidence maps \({M}_{cls}\) and \({M}_{reg}\) with the ground truth label \({G}_{M}\) in the ATI module, we use the SI-loss and the L2 loss denotes as \({\mathcal{L}}_{CLS}\) and \({\mathcal{L}}_{REG}\) to calculate the classification loss and regression loss:

$${\mathcal{L}}_{ATI}={\mathcal{L}}_{CLS}\left({M}_{cls},{G}_{M}\right)+{\mathcal{L}}_{REG}\left({M}_{reg},{G}_{M}\right).$$
(12)

We train TAN in the form of multi-task loss learning, the overall loss function contains GAA and ATI loss, and a regularization where \({\lambda }_{1}\) and \({\lambda }_{2}\) are set to 1 and 1e-4 to balance the contributions of different losses:

$${\mathcal{L}}_{total}={\mathcal{L}}_{GAA}+{{\lambda }_{1}\cdot \mathcal{L}}_{ATI}+{\lambda }_{2}\cdot {L}_{2}\left(\theta \right).$$
(13)

The training process of TAN model is summarized in Algorithm 1.

Algorithm 1:
figure a

TAN model Training

Inference and post process

In the inference phase, the proposal set \(\Phi ={\left\{{\phi }_{m}=\left({t}_{s,m},{t}_{e,m},{p}_{m}\right)\right\}}_{m=1}^{M}\) is generated, where \({p}_{m}\) is the final score of \({\phi }_{m}\) containing the boundary probabilities score (\({P}_{s} and {P}_{e}\)) and confidence map (\({M}_{cls} and {M}_{reg}\)).The combination of final score \({p}_{m}\) can be shown as:

$${p}_{m}={P}_{s}^{{t}_{s,m}}\cdot {P}_{e}^{{t}_{e,m}}\cdot {M}_{cls}^{{t}_{s,m},{t}_{e,m}-{t}_{s,m}}\cdot {M}_{reg}^{{t}_{s,m},{t}_{e,m}-{t}_{s,m}}$$
(14)

Finally, Soft-NMS is used to remove redundant proposals so as to retrieve high-quality proposals more efficiently.

Experiments and results

Datasets and experimental settings

Datasets

ActivityNet-1.3 [5]. It is a large-scale dataset for action detection and TAPG tasks. This dataset contains 19,993 untrimmed videos, a total of 200 kinds of actions are labeled, and each video contains an average of 1.5 action instances. These videos are divided into the training set, test set, and validation set according to the ratio of 2:1:1. We evaluate TAN on the validation set at the end.

THUMOS-14 [18]. This dataset consists of 1010 validation videos and 1574 test videos, and the data are all from the YouTube website—only 413 temporally annotated untrimmed videos with 20 sports action categories for TAPG. The validation set contains 200 videos, and the test set contains 213 videos, each with an average of 15 action instances. We use the validation set to train the network and evaluate our model on the test set.

Evaluation metrics

Generating high-performance proposals means covering ground truth with high recall and temporal overlap. In the TAPG task, the prediction is judged to be correct when the overlap between the proposal and the ground truth is above threshold. According to the confidence ranking, we select the Average Recall (AR) of Average Number (AN) of proposals, denoted as AR@AN. The ActivityNet-1.3 and THUMOS-14 use threshold ranges of [0.5:0.05:0.95] and [0.5:0.05:1.0] to compute AR@AN, respectively. We use the mean Average Precision (mAP) under different tIoU thresholds as the primary evaluation metric to further examine the action localization performance for generating proposals. On ActivityNet-1.3, the value of tIoU is set to {0.5, 0.75, 0.95}, while the threshold set {0.3, 0.4, 0.5, 0.6, 0.7} is used on THUMOS-14. In addition, the area under the AR and AN curve (AUC) is also used as an evaluation metric on ActivityNet-1.3, where AN varies from 0 to 100.

Implementation details

Following previous works, we adopt a pre-trained model TSN [46] to extract video features, which include ResNet [16] and BN-Inception [19] as spatial and temporal networks, respectively. For ActivityNet-1.3, the sampling time interval \(\delta =16\), the constant value of each feature sequence is calculated by linear interpolation \(T=100\), and the maximum duration \(D\) is also set to 100. For THUMOS-14, the value of \(\delta \) is set to 5. The length of the sliding window is \(T=128\), \(D=64\). In the training process of the two datasets, Adam is used for optimization, the batch size is set to 16, there are 8 epochs in total, the initial learning rate of the first 7 epochs is set to 1e-3, and the learning rate of the subsequent epochs is decayed to 1e-4.

Our experiment was trained and validated on the NVIDIA GeForce RTX 3080 and Intel Xeon w-2295 3.00 GHz platform. All codes are based on PyTorch 1.10.0 and Python 3.8.

Finally, in the subsequent model comparison, the experimental results of TSI were reproduced under local environment, using the same training parameters as TAN on each dataset. The results of other models are cited, e.g., BMN [25], RTD-Net [40], etc.

Temporal action proposal generation

This part compares our method with other state-of-the-art (SOTA) methods on ActivityNet-1.3 and THUMOS-14. As shown in Table 1, TAN outperforms other methods, especially when AN equals to 100, TAN achieves 77.08% regarding the AR metric. The results show that TAN generates proposals with high recall while pursuing diversification. Our approach is different in spirit from others: (1) We pinpoint that it is more effective to exploit features at different scales for global interactions, and thus the GAA module incorporates gating mechanisms and top-down methods to help remove mispredictions. (2) The ATI module focuses on discriminating information in proposal-level features along the temporal dimension. Thus, the ATI module gains contextual associations with more efficient attention for wide-range proposals. TAN achieves better performance, indicating that we build a better boundary-based pipeline. It means TAN will bring more significant improvements in subsequent action detection tasks.

Table 1 Comparison with other SOTA methods in terms of AR@AN (%) and AUC (%) on validation set of ActivityNet-1.3

Additionally, the results in Table 2 demonstrate AR@AN results of our proposed TAN and other SOTA techniques on the test set of THUMOS-14, where C3D and two-stream features are adopted for fairness, with AN ranging from 50 to 1000 per video. Experimental results show that TAN outperforms other methods with C3D features and two-stream features. This shows that our proposed network can find more action instances in the video with the same number of proposals, which is mainly because the GAA module can essentially remove useless points, thereby suppressing low-quality proposals. At the same time, we also observed that the result of AR is lower than DCAN [7] when AN is 50. We checked TAN proposals and found that in some videos where action instances were too sparse, TAN generated less than 50 proposals. The possible reason is that the multi-scale TCI module encodes multiple action instances of diverse categories well, and it does not exhibit its merits when there are only a few action instances. The AR-AN curves of different methods on THUMOS-14 are shown in Fig. 5a. To further demonstrate the high overlap between the proposals produced by our approach and the ground truth actions, we calculate recall for multiple tIoU thresholds with 100 proposals per video. As shown in Fig. 5(b), TAN achieves significantly higher recall with fewer proposals when the threshold is between 0.5 and 0.8 compared to other SOTA methods.

Table 2 Comparison of TAN with other SOTA methods on the test set of THUMOS-14 in terms of AR@AN (%)
Fig. 5
figure 5

Comparison of our proposal generation method with other SOTA methods on the THUMOS-14 dataset according to AN–AN a and Recall@100-tIoU (b). The horizontal coordinate in (a) is expressed in logarithmic form to ensure the data is fully displayed

In addition, compared with ActivityNet-1.3, the improvements on THUMOS-14 are more significant. This is because each video in the THUMOS-14 contains 15 actions on average, which further validates that our model is better than other models for regularizing action boundaries for videos containing multiple actions.

Table 3 reports the efficiency of TAN and several closely related alternatives with #parameters (M), floating point operations (G), and inference time on a 3-min video with a single NVIDIA GeForce RTX 3080. From the table, it can be observed that our method enjoys satisfying performance with much less FLOPs and an acceptable increase in the amount of parameters compared to others.

Table 3 Network efficiencies of TAN and several of previous works

In order to further demonstrate that TAN exhibits better convergence, we presented the loss curves of TAN and Baseline Model [28] during training. As depicted in Fig. 6, TAN has good convergence performance. At the same time, due to the combination of a variety of corresponding attention modules and the optimization of proposal-level temporal modeling, the convergence of loss during training is more stable. In addition, the average training time of each epoch is maintained at about 380 s, and the training speed is considerable.

Fig. 6
figure 6

The loss curve for the TAN and baseline model on activityNet-1.3

Temporal action detection

To further verify the quality of the proposals generated by TAN, we apply our TAN proposals to the temporal action detection task and compare them with previous methods. We combine the generated proposals with the action classifier UntrimmedNet [45] and the implementation proposed by Zhao et al. [50]. These flagged proposals are evaluated with the mAP metric as described above.

The evaluation results on the test set of ActivityNet-1.3 are shown in Table 4. TAN achieves the best detection performance when tIoU ranges from 0.5 to 0.95, which verifies the high quality of the proposals generated by TAN. Especially when tIoU is 0.95, the mAP we obtain is 10.26%, indicating TAN proposals are more precise. The experimental results on THUMOS-14 shown in Table 5 re-emphasize the superior performance of our TAN when compared to other SOTA methods. Especially when tIoU is 0.6, mAP improves from 38.7 to 42.1%. Finally, we conclude that our TAN can provide more efficient and reliable temporal action proposals for the action detection task.

Table 4 Comparison between TAN and other methods on ActivityNet-1.3 regarding temporal action detection. The results are measured by mAP (%) at different tIoU thresholds and average mAP (%). We combined our proposals with video-level classification results from Zhao et al. [50]
Table 5 Comparison between TAN with other methods on THUMOS-14 regarding temporal action detection. The results are measured by mAP (%) at different tIoU thresholds. Proposals from all methods are combined with the video-level classifier UntrimmedNet [45]

Ablation study

In this section, we further investigate each component's performance and suitable settings to understand TAN better. All experiments were operated on THUMOS-14 and ActivityNet-1.3. In addition, the output video feature sequences are all from the method using TSN as the backbone network.

Effectiveness of different components in TAN

TAN contains two main modules: the GAA module and the ATI module. To confirm the effectiveness and superiority of TAN, the impact of removing each component is evaluated in Table 6. Each component contributes to the final performance. When only using GAA, The AUC results of ActivityNet-1.3 and THUMOS-14 reach 68.26% and 63.25% due to locating more accurate boundaries. And we also observe that AUC makes improvements once ATI is equipped. ATI brings considerable further performance promotion due to fully modeling the relationship between proposals while considering segments' different scales and temporal relationships.

Table 6 Effectiveness of different components in TAN on ActivityNet-1.3 and THUMOS-14

When GAA and ATI work together, the results of TAN reach 69.01% and 64.15% in AUC on two datasets, which illustrate the importance and effectiveness of global boundary prediction and proposal-level context modeling.

Analysis of feature layer numbers for GAA module

We construct our top-down structure in GAA based on the original feature maps. We evaluate the importance of video feature scales in predicting boundaries by using different numbers of feature layers, as shown in Fig. 7. We find that when the GAA module only uses single feature layer, the performance is much lower than multi-scale features because a single feature layer cannot predict the contextual information interaction between actions of different lengths. In addition, the prediction is getting more accurate along with the increase of multi-scale feature layers. However, excessive down-sample and repeated restoration of context information will lead to inaccurate information for our boundary global interaction structure. Thus, our experimental setup employs four layers of encoders and decoders.

Fig. 7
figure 7

Analysis of feature layer numbers for the GAA module on THUMOS-14 in terms of AR@AN (%)

Effectiveness of encoder component in GAA module

There are two kinds of attention blocks in our GAA module: Cross Attention and Fusion Attention. Cross Attention applies in each layer simultaneously highlights low-level features containing local details and informative high-level features. Fusion Attention enhances temporal global feature extraction by fusing multiple levels of features. For the validity of both, we test different combinations of Cross Attention and Fusion Attention.

As shown in Table 7, When we use both, TAN has a large gain of improvement with 4.15% in AR@50. When Cross Attention and Fusion Attention are separately added, we get a gain of 2.08% and 2.96% in AR@500, respectively. We conclude that global contextual information aggregation provides essential guidance for the effective fusion of boundary information. We visualize queries in Fusion Attention in decoder layers. As shown in Fig. 8, the query has an obvious attention area compared to the initial input features.

Table 7 Effectiveness of encoder component of GAA on THUMOS-14. We show experimental results in terms of AR@AN (%). CrossAttn and FusionAttn denote the Cross Attention and the Fusion Attention within each layer. The symbol √ denotes existence, and × denotes nonexistence
Fig. 8
figure 8

Visualization of boundary query features. a Is the original query. b is the query feature after Fusion Attention

Analysis of temporal interaction between proposals

To evaluate the effectiveness of proposal-level contexts from the TCI block, we carry out several ablation experiments. First, we replace the entire TMConv with traditional 2D convolutions, which is indicated by ‘w/o TMConv’. Second, we introduce TMConv without channel attention (CA) to aggregate proposal-level representations, which is denoted by “w/o CA”. Third, introduce TMConv without temporal attention (TA) (denoted by "w/o TA") to verify the necessity of temporal contexts. As the results are shown in Fig. 9(a), both temporal attention and channel attention can improve performance. Considering these results, it is convincing that 1) different weights for each proposal in temporal dimension are effective for boosting performance. 2) Temporal interaction is most helpful for weight calibration in the time dimension.

Fig. 9
figure 9

Analysis of temporal interaction between proposals. a Is for comparing the different variants of the TMConv. b Is for the choices about the number of TCI blocks

In the ATI module, we stack multiple successive TCI blocks to exploit proposal-level contexts. Here, we explore how the number of TCI blocks (i.e., K) influences the performance. The results are shown in Fig. 9b. \(K=0\) indicates that no TCI block is used. As seen from the figure, the TCI block substantially improves performance, which confirms the necessity of proposal-level contextual information. In particular, the largest improvement is achieved when \(K = 3\), which is the TCI block's default setting.

Effectiveness of the pooling method in TMConv

In addition, we explore the choice of pooling methods in our TMConv. (1) GAP: The mean operation is applied to all sampled features. (2) GMP: The max operation is applied to all sampled features. (3) GAP & GMP: Features output from mean and max operations are concatenated, and a fully connected layer is applied to shrink the channel. As shown in Table 8, in the process of convolution weight generation, the global weight value obtained by the combination of GAP and GMP not only pays attention to the general information but also emphasizes the vital information more conducive to the role of dynamic convolution and information modeling. AR is significantly improved when the AN range from 50 to 500 on THUMOS-14.

Table 8 Ablation results of TMConv with different pooling methods on THUMOS-14

Effectiveness of different scale sizes in TCI block

The experimental results of setting dilated convolution parameters for the dynamic convolution layer in the TCI block are shown in Table 9. To ensure sufficient feature information is captured while expanding the receptive field, we set up three sets of data (3, 5, 7), (3, 6, 8), and (3, 6, 12). In addition, we also set (1, 1, 1) to verify the necessity of multi-scale block for proposal-level context. According to the results, multi-scale contexts effectively boost performance, and the recall of the proposals generated by the model is higher when the expansion rate is set larger. For example, when set to (3, 6, 8), the effect is better than that of (3, 5, 7), which indicates that larger dilated rate convolution has a better effect on relieving the contradiction between longer actions and shorter actions. However, when the dilated rate is too large, the convolution cannot obtain the effective adjacent features of the current proposal but extracts features with too long distance, which causes redundancy and is meaningless for feature extraction. Thus, we choose (3, 6, 8) as our default setting.

Table 9 The effect of different group sizes G of TMConv on ActivityNet-1.3 and THUMOS-14 in terms of AR@AN (%) and AUC (%)

Qualitative results

To intuitively understand the behavior of the GAA and ATI modules, we visualize the start and end probability prediction sequences of TAN and BMN in Fig. 10. From the prediction results in Fig. 10, for some boundary positions, BMN recognizes some background positions (yellow boxes) as boundaries, which demonstrates only using the local context is difficult to evaluate temporal boundaries. The probability maps of TAN are more distinguishable than BMN, and the probabilities of background-position are significantly lower than those of boundaries. This indicates that global context aggregation on the boundary level can improve the model’s ability to suppress false boundary positions.

Fig. 10
figure 10

Visualization of the start and end probability prediction sequences of TAN and BMN

We also provide some visualization examples of TAN in Fig. 11. It presents a visualization example from five randomly selected videos in THUMOS-14 and ActivityNet-1.3. The proposals with the highest \(k\) scores are visualized in each video, where \(k\) is the number of ground truth. For example, in the first and the last video, background and action instances have similar scenes. However, our proposal still successfully aligns the position of ground truth action. The third video has multiple ground truth action instances, while our top-3 proposals perfectly cover them in an accurate way, suggesting the high quality of our generated proposals. The correct predictions show that our TAN generates reliable proposals by modeling context interactions between proposal-level and boundary-wise feature maps.

Fig. 11
figure 11

Visualization examples of proposals generated by our method on THUMOS-14 (first, second, and third rows) and ActivityNet-1.3 (fourth and fifth rows)

Conclusion

In this paper, we present a novel TAPG method named temporal-aware attention network (TAN), aiming to generate high-quality temporal action proposals. The key idea of TAN is to exploit temporal-aware action contexts in videos by temporally modeling boundary-level information and proposal-level features separately. First, we declare that long-distance temporal context helps obtain precise location information of action instances. Thus, the GAA module employs novel Cross Attention and Fusion Attention to learn the rich temporal contextual information for boundary prediction. Second, we consider that it is necessary to exploit the temporal correlation among a wide range of proposals. So, the ATI module utilizes novel TMConv with varying dilation rates, and it enhances the proposal-level feature dependencies to enrich the context. Extensive experiments on ActivityNet-1.3 and THUMOS-14 demonstrate the effectiveness of our framework. We believe that our research will facilitate practical applications of the TAPG task. At the same time, it is worth noting that the requirement for extensive manual frame-level annotations during training, which can be laborious, is not conducive to the widespread application of TAN. Another limitation is that TAN requires a fixed feature encoder (e.g., TSN, I3D) for feature extraction. An interesting future work is studying TAPG with weak supervision. It would also be intriguing to join learning of non-visual modalities to improve action detection performance, e.g., synchronizing between image and audio.