1 Introduction

High-quality temporal action proposals are crucial for a successful two-stage action localization pipeline on long-term video sequences. Deep learning models achieve remarkable performances in the temporal action proposal generation task using either boundary-based [1, 2] or proposal-based [3] approaches within a fully-supervised setting. Complementary characteristics of these two techniques motivate the introduction of joint models with improved performance [3,4,5]. Despite advances in deep learning architectures for temporal action proposals, the performance usually relies on human-annotated data as it scales up with growing labeled data. However, only relatively limited data is available in the video domain compared to that in image datasets.

Fig. 1
figure 1

Overview of Temporal Teacher with Masked Transformers (TTMT). The teacherstudent framework consists of two steps: Build-in Stage and Mutual Learning Stage. The backbone model is based on a multiscale transformer architecture.

Semi-supervised learning (SSL) algorithms aim to learn prediction functions jointly from labeled and unlabeled observations. Inspired by the advances in rapidly developed semi-supervised image classification models [6, 7], recent studies [8,9,10,11] show promising results on semi-supervised object-detection with limited labeled data compared to that of fully-supervised versions. To our knowledge, there are few recent approaches adapted and applied to semi-supervised action detection and proposal generation tasks on untrimmed videos. Available action models [12, 13] are designed on top of anchor-based models, where the former [12] investigates the SSL approach using the Boundary Sensitive Network (BSN) model [2] and the latter [13] applies the SSL on the Boundary Matching Network (BMN) model [4]. However, anchor-free models have been receiving more attention in the fully-supervised setting with a few recent studies proposed for action detection  [14, 15] with the promise of achieving competitive accuracy and computational efficiency. Therefore, targeting anchor-free semi-supervised action models looks like a future trend in the field as well.

Particularly, semi-supervised techniques are not well explored on anchor-free models for temporal action proposal generation. Focusing on a two-stage detection pipeline as in [12, 13], we aim to propose an anchor-free model within a semi-supervised training methodology. Following the teacher-student framework [16] as a semi-supervised technique, we introduce an anchor-free temporal proposal generation model to achieve comparable performance with that of fully-supervised anchor-based [3,4,5] and semi-supervised anchor-based models [13]. Recently, Unbiased Teacher v2 [11] evaluates a new pseudo-labeling semi-supervised approach on anchor-free object detectors with extensive analysis. Following the observations on object detectors from the study, we investigate the performance of our semi-supervised and anchor-free approach for action proposal generation. An SSL-based anchor-free model [17] has been recently introduced for temporal action detection, but direct comparison is not reasonable as this work is proposed in a one-stage detection pipeline. One-stage pipelines are often powered by a refinement stage, since they directly target the action detection task, e.g. [17]. Instead, our work neither requires an extra refinement stage nor uses an existing one, and it focuses mainly on a two-stage pipeline that places more emphasis on the task of creating a proposal. As one advantage, two-stage pipelines with good proposal candidates can be flexibly integrated for strengthening various downstream tasks in different granularities, e.g. action recognition on a coarse scale or human-object interaction detection on a fine scale. As another advantage, they are less dependent on action categories than one-stage pipelines and can be easily fine-tuned into new action categories, also with greater potential for class-incremental scenarios. In particular, we perform action recognition in the second stage of the pipeline as a downstream task by adding a simple pre-trained action classifier, i.e., UntrimmedNet [18], to rescore extracted proposals for classification as in [3,4,5]. With a two-stage pipeline, our proposal generation model can be integrated with proposal refinement techniques such as P-GCN [19] for action detection as was previously effective in [20].

Our semi-supervised approach, shown in Fig. 1, is a teacher-student framework that follows the training methodology from Liu et al. [21]. Our model observes a set of labeled videos and a set of unlabeled ones in a two-step training pipeline. The first step of the training mechanism, i.e., the burn-in phase, draws on labeled data during the first iterations of training using our anchor-free backbone model as the Student, while the second step, i.e., the mutual learning phase, makes use of both labeled and unlabeled videos using the teacher-student framework with the competing Teacher and Student built on the same anchor-free backbone for the rest of the training. Our teacher-student framework contains both snippet-based classification and regression objectives on supervised and unsupervised data with pseudo-labeling to support strong integration with the snippet-based anchor-free backbone predictions.

The backbone model, i.e., the Masked Transformer model, is based on the detection of multiple per-snippet-based local clues on video sequences via an encoder-only Transformer architecture designed in our recent study [22]. The traditional Transformer model [23] supports the detection of entities along with their pairwise relationships. Because it explores local snippet-based features at multiple levels of detail, a multiscale Transformer can be a good video processing technique for anchor-free models. There exist recent models for multiscale image classification and our strategy is to extend one of these models, namely Improved Multiscale Vision Transformers (MViTv2) [24], for the video action proposal generation task. Our model improves the pooling attention [24] by using bi-directional masks to better model temporal ordering [25]. Using the proposed Masked Transformer model, we primarily aim to demonstrate that our anchor-free model can be integrated into a semi-supervised teacher-student framework with performance comparable to that of both fully- and semi-supervised anchor-based models [3,4,5, 12, 13]. Next, we aim to demonstrate how our teacher-student framework can be applied to temporal sequences by both snippet-based classification and regression through consistency regularization and pseudo-labeling, taking into account the localization uncertainty in boundary estimations. Mean teacher models are mostly examined for classification scenarios. Instead, our semi-supervised model is based on a teacher-student framework with multiple snippet-based classification and regression functions formulated specifically for our snippet-based anchor-free design.

We demonstrate that our end-to-end trainable anchor-free Transformer-based generator network, called Temporal Teacher with Masked Transformers (TTMT), achieves promising performance when compared to the state-of-the-art models on action proposal generation. We validate our model with experiments on the THUMOS14 [26] and ActivityNet-1.3 [27] datasets. Experiments reveal that our anchor-free Transformer-based model is a good candidate for video processing as it performs as well as the proposal-based models. The generated proposals are highly overlapping with the ground truth and have accurate boundary localization. The main contributions of our study are (i) a new teacher-student model with an encoder-only Transformer model for anchor-free temporal action proposal generation, (ii) a Masked Transformer model with a temporal extension of pooling attention unit [24] via bi-directional masks for temporal encoding, and (iii) an improved anchor-free model with uncertainty-aware boundary estimations.

2 Background

The target of our work is on semi-supervised two-stage action localization pipeline on untrimmed video sequences. Although the literature is dense on studies of action detection with robust one-stage and two-stage detection models, we here discuss the recent studies on the two-stage detection models. For a robust two-stage pipeline, high-quality proposal generation means better capturing the ground-truth segments with highly confident foreground action regions and accurate boundaries [2, 4, 28]. Most existing studies focus on fully-supervised models for action localization, while few recent ones aim for semi-supervised models.

2.1 Fully-supervised models

Existing proposal generation models in fully-supervised settings can be categorized as anchor-based and anchor-free approaches. Anchor-based approaches can be categorized as top-down and bottom-up approaches. While the former group relies on the sliding window or the Faster R-CNN [29] strategies to extract proposal-level regions as candidate segments [30], the latter is based on detecting boundary-level features for extracting candidates [2, 28]. Temporal Unit Regression Network (TURN) [31] generates proposals via decomposition into short units and employs regression to adjust boundaries from the sliding windows. Temporal Action Grouping (TAG) [28] connects high-scoring regions by a watershed algorithm. Boundary Sensitive Network (BSN) [2] detects local boundaries and evaluates proposal confidence scores within a region. On the other hand, Complementary Temporal Action Proposal (CTAP) [1] jointly uses sliding windows and grouping-based methods for high-quality proposals. Another approach, Snippet Relatedness-based Generator (SRG) [32], represents long-range dependencies among snippets by a score map.

Both proposal-level and boundary-level features are critical for obtaining high-quality proposals with precise boundaries [1, 5]. Complementary characteristics of these features are the key motivations for many joint models integrating proposal- and boundary-level features, e.g. BMN [4] and MGG [5]. One recent model, Boundary Content Graph Neural Network (BC-GNN) [33], uses a graph neural network for the interactions of boundaries and content of proposals. Another model, Relaxed Transformer Decoder (RTD-Net) [34], proposes a transformer-based architecture for temporal proposal generation inspired by a recent transformer object detection framework DETR [35].

In addition to anchor-based models, more recent studies focus on anchor-free approaches. Anchor-Free Saliency-based Detector (AFSD) [14] proposes a saliency-based refinement module that gathers boundary features, and ActionFormer [15] uses multiscale Transformers. Contrary to these studies aiming for single-stage action detection, we target anchor-free models for proposal generation within two-stage action detectors and we devise an SSL-based model.

2.2 Semi-supervised models

A powerful technique for training models on both labeled and unlabeled data is semi-supervised learning (SSL). A popular class of SSL methods produces artificial labels for unlabeled data and trains a model to predict the artificial label when unlabeled data is inputted. The majority of the recent SSL methods typically consist of pseudo-labeling and consistency regularization approaches. Pseudo-labeling [36] uses the model itself to obtain predictions for unlabeled data. Besides, consistency regularization [37] leverages the idea of obtaining similar predictions when the models are fed with the perturbed data. Early approaches apply exponential moving average (EMA) of model parameters [16] or self-ensembling [38] when producing artificial labels.

SSL for image classification has been rapidly developed with promising results in recent years. Existing SSL image classification works [7, 39] apply input augmentations and consistency regularization on unlabeled images. Inspired by these works, several semi-supervised object detection works have been proposed to exploit similar ideas to train object detectors in a semi-supervised manner [8, 40]. Despite the significant improvement, there are still two remaining issues: (i) there are few studies on SSL-based proposal generation and action detection models, (ii) prior works are mainly focused on anchor-based models [12, 13]. Both models, [12, 13], adopt the Mean Teacher framework in the semi-supervised temporal action proposal task. We devise an alternative teacher-student framework based on our anchor-free masked Transformer network with a lightweight uncertainty-aware proposal refinement component.

One recent SSL-based study [17] introduces an anchor-free one-stage approach to action detection. The study integrates a two-stream model based on a standard Transformer backbone into a semi-supervised model via pseudo-labeling applied to both classification and mask predictions. Similarly, we offer a teacher-student framework, but unlike [17], our proposed SSL-based framework relies on a new anchor-free masked Transformer network and our framework integrates pseudo-labeling not only for the classification of various snippet-based features but also for boundary regression. Our framework leverages the relative uncertainties between the Teacher and Student to select the boundary-level pseudo-labels [11]. Moreover, a direct comparison is not reasonable since we are proposing a two-stage pipeline contrary to Nag et al. [17], which is a one-stage model.

3 Masked transformer pyramid model

Core models replicated under the proposed teacher-student framework are Transformer-based. The Transformers were first introduced for language modeling on text sequences [23] with its support on learning long-range dependencies via the self-attention mechanism. Following the success in NLP [41], attention mechanisms later became an integral part of many vision tasks, including image recognition, object detection, video understanding, text-image synthesis and visual question answering [42, 43]. In particular, we use a multiscale encoder-only Transformer network introduced in our previous study [22] designed for directional temporal dependency modeling on long-range video snippet sequences.

Based on the Masked Transformer network that reveals the local clues in multiple scales besides interactions among snippets, we aim to extract proposal candidates within a pyramid structure. In this section, we first describe the encoder-only Transformer architecture with a bi-directional multi-head attention unit and then give the details of pyramid architecture.

Fig. 2
figure 2

Bi-directional pooling attention unit proposed as one extension of the attention unit from [24] with integrated directional masks

3.1 Multiscale encoder-only transformers

Our core model is based on a multiscale transformer architecture. For the multiscale purpose, we exploit the pooling attention units devised as the self-attention blocks by Multiscale Vision Transformers (MViTv2) [24]. In MViTv2, the pooling attention has been originally proposed as part of a Vision Transformer model for image classification, object detection and video recognition tasks. In this work, we integrate it to process 1D sequences of snippet embeddings extracted using a pre-trained CNN model and we leverage it for temporal proposal generation task.

Multiscale Transformer architecture comprises the concept of stages. Each stage consists of multiple transformer blocks with specific time resolution and channel dimension. Reducing the sequence length from input to output of the network stages, the architecture gradually expands the channel width via pooling attention units. For an input sequence, \(F\in \mathbb {R}^{T \times D}\), a Transformer block packs it into query, key and value matrices, QKV, with a pooling attention unit as

$$\begin{aligned} Q = P_Q(FW_Q), K = P_K(FW_K), V = P_V(FW_V) , \end{aligned}$$
(1)

where \(W_Q\), \(W_K\) and \(W_V \in \mathbb {R}^{D \times D}\). The pooling attention unit first projects input F using \(W_Q\), \(W_K\) and \(W_V\) and then applies pooling operators (P) that are \(1\times 3\) convolution layers. The pooling operator can reduce the time resolution, i.e., the sequence length, using a convolutional stride.

Following the pooling operators, the standard (i.e., unmasked) version of the multi-head attention block is applied as

$$\begin{aligned} Z'&={{QK^\top } \over {\sqrt{D}}} + E^{r} \;, \nonumber \\ Attn(Q,K,V)&= softmax(Z')V \;, \end{aligned}$$
(2)

where \(E^{r}\) is the relative position embedding along temporal axes. Later, we apply the residual pooling connection and add the pooled query tensor to the output sequence, \(Z=Attn(Q,K,V)+Q\).

3.2 Bi-directional multi-head attention

In this work, we integrate a directional mask into the pooling attention unit and introduce a bi-directional version of the attention unit to model temporal ordering in attention output [25]. Given a mask \(M\in \mathbb {R}^{T \times T}\), we first apply dot product attention among Q and K with a scaling factor as in Eq. (2) and then add with the mask component as

$$\begin{aligned} Z'_{ij} = \sum _{d=1}^{D} (Q_{id}K_{dj})/ \sqrt{D}+E_{ij}^{r} +M_{ij} , \end{aligned}$$
(3)

where i and j are snippet indices. If \(M_{ij}=-\infty \), then \(Z'_{ij}=-\infty \). This implies the \(Attn_{ij}\) turns into zero in Eq. (2), since softmax output results in 0.

For the bi-directional version, we use two masks—one for modeling forward ordering and the other for modeling backward ordering, \(M^{f}\) and \(M^{b}\), respectively, as

$$\begin{aligned}{} & {} \displaystyle M^{f}_{ij}={\left\{ \begin{array}{ll} 0&{}i < j,\\ infty&{}\displaystyle \text {otherwise}\end{array}\right. } ,\end{aligned}$$
(4)
$$\begin{aligned}{} & {} \displaystyle M^{b}_{ij}={\left\{ \begin{array}{ll} 0&{}i > j,\\ infty&{}\displaystyle \text {otherwise}\end{array}\right. } . \end{aligned}$$
(5)

We apply forward and backward masks as in Eq. (3) to compute \(Z'^{f}\) and \(Z'^{b}\) outputs respectively, and multiply by V. The final Attn matrix is then merged with a simple addition operation as

$$\begin{aligned} Attn(Q,K,V)&=softmax(Z'^{f})V \nonumber \\&\quad +\,softmax(Z'^{b})V\;. \end{aligned}$$
(6)

Figure 2 illustrates the details of the pooling attention unit with bi-directional mask extension. Note that the proposed attention model can be generalized to various other mask structures.

3.3 Transformer-based pyramid architecture

In the proposed multiscale transformer architecture, while the bottom stages perform fine-scale evaluation, the higher stages perform coarse-scale evaluation on video sequences. The architecture is converted into a simple pyramid structure with attachments of lateral connections. In this structure, the bottom-up pathway consists of multiple stages each having a various number of blocks. The last block of each stage doubles the channel width \(D_i\) while reducing the sequence length \(T_i\) by a factor of two using the bi-directional pooling attention unit (see Sect. 3.2). The last block output of a stage corresponds to a level sequence map. The top-down pathway integrates sequence maps iteratively via lateral connections to form a pyramid network [44]. In each iteration, a coarse-scale sequence map is upsampled by a factor of two using the nearest neighbor and added to the previous bottom-up map that is filtered using a 1\(\times \)1 convolutional layer. The merged map is smoothed using a 1\(\times \)3 convolutional filter into \(P_i\in \mathbb {R}^{D_i \times T_i}\) (we fix the numbers of channels \(D_i\) to 1024 in this paper). This process continues until the finest resolution map is constructed.

All levels of the pyramid use shared network heads including classifiers and regressors as in a traditional image pyramid. Our network heads consist of (i) snippet-based prediction branches including {actionness, centerness, start-boundary, end-boundary}, (ii) boundary regression branch, and (iii) localization uncertainty branch. Given a feature map \(P_i\), the head of actionness predicts the actionness score, \(p^n_a\), the head of centerness measures the centerness score, \(p_c^n\), while the heads of start- and end-boundary classifiers estimate the scores of being a start and an end position, \(p^n_s\) and \(p^n_e\) for the snippet n, respectively. The prediction heads are designed using two linear layers. Besides, there exists a boundary regression branch with two linear layers that returns a pair of relative distance estimations, \(v^n=(l^n,r^n)\), from a snippet n to start and end boundaries. Finally, our network has a localization uncertainty branch to estimate uncertainties [45] for predicted relative distances, \(\sigma ^n =(\sigma _l^n\),\(\sigma _r^n)\), with a linear layer attached to the first linear layer of the boundary regression branch.

Given a ground-truth segment at an interval \([s^*, e^*]\), the snippets are defined as positive within this interval for the actionness, i.e., snippet n within a ground-truth segment has an actionness value of \(p^{n*}_a=1\). Adapting the centerness formulation for temporal segments from FCOS [46], snippet n at location t within a ground-truth segment has a centerness value on the same interval as \(p^{n*}_c=\sqrt{\min (l^{n*}, r^{n*}) \over \max (l^{n*}, r^{n*})}\) where \(l^{n*}\) and \(r^{n*}\) are distances of snippet n to start and end boundaries, \(l^{n*}=t - s^{*}\) and \(r^{n*}=e^{*}-t\) (otherwise the centerness value is zero). Corresponding start and end boundary labels are defined as positive within intervals \([s^{*} -\tau ^{*}, s^{*}+\tau ^{*}]\) and \([e^{*}-\tau ^{*}, e^*+\tau ^{*}]\), respectively, with an extra offset \(\tau ^*=(e^*-s^*)/10\). Following FCOS [46], positive snippets, that lie within a ground-truth segment, are participated in boundary regression and uncertainty prediction using \(v^{n*}=(l^{n*},r^{n*})\).

4 Teacher-student framework

The teacher-student framework is borrowed by many deep neural network models for semi-supervised learning [16] to reduce over-fitting with a large number of learning parameters and to train robust models with more abstract invariances. The framework jointly trains a student and a teacher model in a mutually beneficial way in which the student model learns and updates the teacher model using exponential moving average (EMA) [16]; while the teacher model generates targets to train student model. In this section, we describe the stages in the training process of the proposed teach-student framework; burn-in and mutual learning stages, respectively.

4.1 Burn-in stage

In a teacher-student framework, good initialization is important, since the teacher generates targets to be used by the student for learning. We utilize Burn-in training strategy [21] to optimize the student model weights \(\theta \) using supervised data and supervised loss.

Let \(P_i\in \mathbb {R}^{D_i \times T_i}\) be the feature map at layer i of pyramid network with feature dimension \(D_i \) and length \(T_i\). Once we have ground truth labels at each location t on the feature map, we train our student model on supervised data with a fixed number of epochs using the following supervised loss

$$\begin{aligned} {{\mathcal {L}}^{snip}_{sup}}&={1 \over N_s} \biggl ( \sum _n \ell _{a} (p^{n}_{a}, p^{n*}_{a}) + \sum _n \ell _{c} (p^{n}_{c}, p^{n*}_{c}) \nonumber \\&\quad + \,\sum _n \ell _{s} (p^{n}_{s},p^{n*}_{s}) + \sum _n \ell _{e} (p^{n}_{e}, p^{n*}_{e})\biggr ) \nonumber \\&\quad +\, {1 \over N_{ps}} \biggl (\sum _n \pmb 1^{n} \ell _{diou} (v^{n}, v^{n*}) \nonumber \\&\quad +\,\sum _n \pmb 1^{n} \ell _{unc} (v^{n}, \sigma ^n , v^{n*})\biggr ), \end{aligned}$$
(7)

where we predict the actionness score, the centerness score, and the start-end boundary scores and regress the target segment intervals assuming each snippet location as an anchor point. \(\pmb 1^{n}\) indicates that the n-th snippet is a positive instance within a ground-truth segment interval, and \(p^{n}_a\), \(p^{n}_c\), \(p^{n}_s\), \(p^{n}_e\), \(v^{n}\) and \(\sigma ^n\) show the prediction outputs of corresponding network heads. \(N_s\) and \(N_{ps}\) are the numbers of all locations and positive locations in a batch, respectively. \(\ell _{a}\) is defined as a cross-entropy loss, while \(\ell _{c}\), \(\ell _{s}\) and \(\ell _{e}\) are binary cross-entropy losses with logits. \(\ell _{diou}\) is a temporal Intersection-over-Union (tIoU) based loss that is computed using predicted boundary distances \(v^{n}=(l^{n},r^{n})\) and ground-truth boundaries \(v^{n*}=(l^{n*},r^{n*})\), where we adapt the Distance-IoU loss [47] for temporal segments as

$$\begin{aligned} \ell _{diou} =&1-tIoU+ {d(v, v^{*}) \over {a}^2} \; , \nonumber \\ tIoU =&{\min (l,l^{*})+\min (r,r^{*}) \over {\max (l,l^{*})+\max (r,r^{*})}} \; , \nonumber \\ d(v, v^{*}) =&|r-l-r^{*}+l^{*}| /2 \; , \nonumber \\ a=&{\max (l,l^{*})+\max (r,r^{*})}\; , \end{aligned}$$
(8)

where \(d(\cdot ,\cdot )\) is the Euclidean distance between the centers of the predicted and the ground truth segments and a is the length of the shortest enclosing segment covering the two segments.

The localization uncertainty branch is jointly trained with the boundary regression branch using \(\ell _{unc}\) that is the negative power log-likelihood loss (NPLL) [45] as

$$\begin{aligned} {\ell _{unc}}=\eta \cdot \bigg [\Big ({\sum \limits _{k{\in \{l,r\}}}}({{(k^*-k)^2} \over {2\sigma _k{^2}}}+{{\log \sigma _k^2}\over {2}})\Big )+ 2\log 2\pi \bigg ], \end{aligned}$$
(9)

where \(\eta \) is either 1 or tIoU score between the predicted and the ground-truth boundaries \(v=(l,r)\) and \(v^*=(l^*,r^*)\), respectively. \(k \in \{l,r\}\) and \(\sigma _k\) is the predicted uncertainty for left or right direction.

After 15 epochs in Burn-in stage, we copy the trained weights \(\theta \) for both the teacher and the student models, \((\theta _t \leftarrow \theta , \theta _s \leftarrow \theta )\).

4.2 Mutual learning stage

In the mutual learning stage, the student and teacher models are jointly training using the EMA [16] strategy. Consistency learning as one self-training technique constrains model outputs to be comparable using transformed unlabeled data with some randomness. Therefore, the technique has been well-adopted for the SSL as reducing dependency on limited labeled data. We apply consistency regularization on both supervised and unsupervised data splits.

On supervised data, while the student model continues to learn using \({{\mathcal {L}}^{snip}_{sup}}\), the teacher model generates targets for student models on augmented copies of the data. Alongside \(\mathcal L^{snip}_{sup}\), a regularization loss is used with two components, \({\mathcal {L}}^{simcls}_{sup}\) and \({\mathcal {L}}^{simreg}_{sup}\), respectively, given as

$$\begin{aligned} {{\mathcal {L}}^{simcls}_{sup}}=&{1 \over N_{s}} \Big (\sum _n \ell _{con} (p^{nt}_{a}, p^{ns}_{a}) + \sum _n \ell _{con} (p^{nt}_{c}, p^{ns}_{c}) \nonumber \\&+\sum _n \ell _{con} (p^{nt}_{s},p^{ns}_{s}) + \sum _n \ell _{con} (p^{nt}_{e}, p^{ns}_{e}) \Big ) \; , \end{aligned}$$
(10)

where \(\ell _{con}\) is the mean square error loss to compare the softmax activations over the actionness predictions and the sigmoid activations over the centerness, the start- and end-boundary predictions by the student and teacher models, \(p^{ns}_{\centerdot }\) and \(p^{nt}_{\centerdot }\), respectively. \(N_s\) is the number of augmented snippet copies in the batch. Besides, there exists a regression part with \(\ell ^{simreg}_{sup}\) given as

$$\begin{aligned} \displaystyle {\ell ^{simreg}_{sup}}=&{\left\{ \begin{array}{ll} \ell _{diou} (v^{nt}, v^{ns}) &{} \text {if}\; \sigma ^{nt}+\delta \le \sigma ^{ns} \\ 0 &{}\displaystyle \text {otherwise}\end{array}\right. } \;, \nonumber \\ {\mathcal {L}}^{simreg}_{sup} =&{1 \over N_{s}} \sum _n \ell ^{simreg}_{sup} (v^{ns} , v^{nt}, \sigma ^{ns}, \sigma ^{nt}) \;, \end{aligned}$$
(11)

where \(\delta \ge 0\) is a small margin between the localization uncertainties of teacher and student models and we set it to 0.01. Following Liu et al. [11], we first remove the boundaries where the student model has small localization certainty, e.g. \(\sigma ^{ns} \le 0.5\). Then, the loss between the boundary predictions of the student and the teacher models are compared using \(\ell _{diou}\) given in Eq. (8) if the teacher certainty is higher than the student certainty value.

When we have unsupervised data as well, we follow a similar methodology with consistency regularization, but each batch contains both supervised and unsupervised data. Consistency regularization is also applied to unsupervised data predictions of teacher and student models using Eqs. (10) and (11). Finally, the objective function is extended as follows

$$\begin{aligned} \displaystyle {{\mathcal {L}}}&= {{\mathcal {L}}^{snip}_{sup}} + \text {w}^{cls} {{\mathcal {L}}^{simcls}_{sup}} + \text {w}^{reg}{{\mathcal {L}}^{simreg}_{sup}} \nonumber \\&\quad +\, \text {w}^{usup} (\text {w}^{cls}{{\mathcal {L}}^{simcls}_{usup}}+ \text {w}^{reg}{{\mathcal {L}}^{simreg}_{usup}}) \;, \end{aligned}$$
(12)

where we have used three weights, \(\text {w}^{cls}\), \(\text {w}^{reg}\) and \(\text {w}^{usup}\), respectively.

Our model leverages two kinds of augmentations, weak and strong augmentations, on supervised data as well as unsupervised data. The student model is trained using strongly augmented data while the teacher model is trained using weakly augmented one. In all of our experiments, weak augmentation is a snippet-dropping strategy with a probability of 5% on input videos of the teacher model, i.e., 5% of the feature channels are dropped. For strong augmentation, we apply both (i) the snippet-dropping strategy with a probability of 20% and (ii) the temporal shifting operations on randomly chosen \(\mu \) of feature channels [13, 48] on input videos of the student model.

5 Proposal inference in multiple scales

During inference, we follow similar inference steps and use the scorig function from our previous study [22]. We generate lists of proposals from feature maps and merge the lists. For a feature map \(P_i\), we first extract the candidate proposal locations and then score these candidates with a scoring function. Later, we prune proposals via non-maximum suppression (NMS) and select top M candidates.

To extract candidate locations, i.e., start and end boundaries, we compute two vectors for each boundary type: \(g_s\) and \(g_e\) that are the boundary estimates via the boundary regression and uncertainty branches, and \(g'_s\) and \(g'_e\) that are the boundary estimates via the snippet-based start- and end boundary prediction branches. Given a snippet n at location t on a video test instance, the boundary regression and uncertainty branches return estimate \(v^n=\) \((l^n,r^n)\) with uncertainty scores \(\sigma ^n=(\sigma ^n_l, \sigma ^n_r)\), respectively. These values are translated into start boundary scores using a probability density function of \(\mathcal {N}(s^{n}, {\sigma ^n_l})\) within a neighborhood \([s^{n}-{\tau '},s^{n} +{\tau '}]\) where \(s^n=t-l^n\) and \(\tau '\) is a small margin. We similarly generate end boundary scores within a neighborhood \([e^{n}-{\tau '},e^{n} +{\tau '}]\) where \(e^n=t+r^n\). Translating start and end scores for all snippets, the final start and end score vectors, \(g_s\) and \(g_e\), are built as the maximum of all start and end scores at each location, respectively.

Concurrently, the start-end boundary heads return vectors of predictions with scores \(p^n_s\) and \(p^n_e\) of a snippet n. We prune these vectors by setting scores to zero at locations that are not peak and having scores lower than a threshold value. Then, we obtain two vectors of boundary estimates, \(g'_s\) and \(g'_e\) for start and end, respectively. A value \(p^n_s\) is a peak if \(p^n_s > p^{n-1}_s\), \(p^n_s > p^{n+1}_s\) and \(p^n_s>thr\) (similarly for \(p^n_e\)). Lastly, a snippet is in start set, S, if \(g_s(n)+g'_s(n)>0\) and in end set, E, if \(g_e(n)+g'_e(n)>0\), respectively. Having the boundary start locations S and end locations E, we generate \(|S| \times |E|\) candidate proposal locations.

Given a proposal candidate m with snippets \(\mathcal {X}^m=\{x_1,\ldots ,x_c,\ldots ,x_n\}\) where \(x_1\), \(x_c\), \(x_n\) are the start, center and end points of the proposal candidate, \(x_1\in S\) and \(x_n\in E\), we devise a scoring function with three components. \(sc_{action}^m\) is the average of actionness scores of all proposal snippets. Next, \(sc_{center}^m\) is computed over the centerness scores of the start, center and end snippets, where the score is high for a proposal with low centerness scores on the boundaries and a high centerness score in the middle. Finally, \(sc_{bound}^m\) is computed over the start and end scores of the start, center and end snippets, where the score is high for a proposal with low start-end boundary scores in the center and high boundary scores in the edges. Then, we combine the scores with equal weights as \({sc^m}=sc_{action}^m + sc_{center}^m+sc_{bound}^m\) and the components are given as

(13)

After generating candidate proposals, we prune redundant ones using non-maximum suppression (NMS) or soft-NMS to achieve higher recall rates [2, 49].

6 Experimental evaluation

Our goal is to demonstrate the robustness and performance of our method, i.e., TTMT, on the task of generating accurate action proposals on two benchmark datasets, the THUMOS14 [26] and ActivityNet-1.3 [27]. Detailed ablation comparisons on the THUMOS14 dataset are also presented to analyze the model.

6.1 Datasets

THUMOS14 [26]. The dataset includes 1010 and 1574 videos of 20 action categories in the validation and test splits, respectively. Among these videos, 200 validation videos and 212 test videos have temporal annotations of actions. Following the previous studies [1, 2], we conduct our training on the validation set and performance evaluation on the test set.

ActivityNet-1.3 [27]. The dataset consists of 19, 994 long-term untrimmed video sequences in 200 action categories. The dataset splits into training, validation and testing subsets with 10, 024, 4, 926 and 5, 044 video samples, respectively. Each video sequence contains one or more actions with annotated segment intervals. We train our model on the training set and evaluate on the validation set.

Table 1 Evaluation of the model TTMT@100% in various Transformer settings on the THUMOS14 dataset

6.2 Visual encodings and training settings

Given an untrimmed video, it is represented as a sequence of \(T'\) snippets encoded using pre-trained CNN models, \(F'\in \mathbb {R}^{T' \times D'}\). For the THUMOS14, we use feature encoding precomputed by [20] based on TSN pre-trained model on Kinetics [50]. We split each video sequence during inference with overlapped windows of size 128 and stride 64. For the ActivityNet, we used the Slowfast features precomputed by [13]. We scale the feature length to \(T'=128\) for all videos.

Following Ji et al. [12] and Wang et al. [13], we split the training data with available labels into labeled and unlabeled subsets. We have three data settings represented as TTMT@M% where M \(\in \{100,90,60\}\) and M% of the training data is reserved as labeled data for supervised learning within the temporal teacher pipeline, e.g. TTMT@60% means that our model is trained following the proposed teacher-student framework using 60% of available data as labeled in supervised training and 40% of data as unlabeled in unsupervised training. We obtain predictions of the student and teacher models concurrently. Since we have observed that the teacher model outperforms the student model, we report the teacher results throughout the experiments. The predictions are from the best student and teacher models that are the ones with the lowest validation loss. For both datasets, the learning rate of \(10^{-4}\) is used with a weight decay of \(10^{-9}\). We use the Adam optimizer during training.

For the THUMOS experiments, we use the weight combination of \(\text {w}^{cls}=6\), \(\text {w}^{reg}=0.005\) in TTMT@100% training setting (where \(\text {w}^{usup}=0\)), and \(\text {w}^{usup}=1\) in TTMT@60% and @90% training settings. For the ActivityNet experiments, we use the weight combination of \(\text {w}^{cls}=6\), \(\text {w}^{reg}=0.05\) in TTMT@100% training setting (where \(\text {w}^{usup}=0\)), and \(\text {w}^{usup}=1\) in TTMT@60% and @90% training settings. For temporal augmentation, we have experimented with various snippet-dropping percentages and temporal shift parameters. Based on our empirical observation, we report our results for the randomly chosen \(\mu =64\) of feature channels where half of the channels move forward, and the other half of them move backward by a shift amount of 1.

6.3 Proposal generation

Following previous works [1, 2, 4, 5], proposal generation task is evaluated by means of Average Recall (AR) and Area Under Curve (AUC) metrics. AR is evaluated under various tIoU thresholds in the range [0.5, 0.95] for the ActivityNet and in the range [0.5, 1.0] for the THUMOS14 with a step of 0.05. The AUC is calculated using AR under various Average Number of Proposals (AN) as AR@AN, where AN varies from 0 to 100 for the ActivityNet and from 0 to 1000 for the THUMOS14.

6.3.1 Proposal generation on THUMOS dataset

We first examine the pyramid setting of the core transformer architecture to see its effect on the performance of model TTMT@100% in the proposal generation task. Following an incremental strategy, we experiment with up to three stages of a pyramid with various block numbers and report two-stage results since we have not observed further improvement with more stages.

Table 1 shows the AR@AN performances with AN varying from 50 to 1000 on the test set with NMS pruning (threshold is set to 0.83). Using a single stage Transformer network with a number of blocks B in range \([1,\ldots ,11]\), the initial channel depth of 1024 and the head number of 8, we observe that while the model TTMT@100% with B[8] shows the highest performance of 45.97 AR@50, the model TTMT@100% with B[7] shows better performance at higher AR values. We build the pyramid iteratively and extend the models B[7] and B[8] with a second stage. Adding a second stage to the pyramid, we observe that the model TTMT@100% with B[7+2] over B[7] and the model TTMT@100% with B[8+2] over B[8] gain improvement in AR@50, and TTMT@100% with B[8+2] outperforms all other single-stage models we have tested so far. Evaluations on the THUMOS datasets show that the second stage of resolution can help to improve AR performances, as we expect from a pyramid structure.

Fig. 3
figure 3

Evaluation of the model TTMT@100% with B[8+2] in various weight combinations \(\{\text {w}^{cls},\text {w}^{reg}\}\)

Table 2 Comparison of semi-supervised baseline models with our proposal generation models TTMT@60% and TTMT@90% that use less number of labeled data on the THUMOS14
Table 3 Comparison of fully-supervised baseline models with our proposal generation model TTMT on the THUMOS14

We also conduct a set of experiments to search for weight combinations, \(\text {w}^{cls}\) and \(\text {w}^{reg}\) given in Eq. (12), over the model TTMT@100% with B[8+2]. The plot in Fig. 3 shows that increasing \(\text {w}^{cls}\) in consistency regularization has a significant effect in performance, while the variation in \(\text {w}^{reg}\) has a minor effect.

Comparisons with semi-supervised models Selecting the best transformer settings on TTMT@100%, we examine the performances of the models TTMT@60% and TTMT@90% with B[8] and B[8+2] where models use less supervision due to fewer labeled data than supervised models. Reported in Table 2, we have been outperforming [12, 13], except [13] in AR@1000. The performance improvement at low recall values are more important and we particularly outperform others in AR@50 and AR@100. For instance, we achieve better than [13] by +1.87 @60% and +4.91 @90% in AR@50, respectively. Moreover, model B[8+2] outperforms model B[8] in AR@50 due to its multiscale nature at 90% and 60% settings as well. We have observed slightly lower performance than [13] only in AR@1000 with \(-\)1.94 @60% and \(-\)1.35 @90%. Both Ji et al. [12] and SSTAP [13] are teacher-student frameworks, but they rest on anchor-based models as the core architecture. While the former is built on the BSN [2] proposal generation model, the latter is built on the BMN [4] model. Using an anchor-free model, we achieve better performance in AR metrics except AR@1000, but improving performance at low AR values is important.

Comparisons with fully-supervised models. Similarly, we examine the performance of our model TTMT@100% with some related fully-supervised approaches. TTMT@100% is trained using all the available labeled data and only the supervised loss components from Eq. (12), i.e., \(\text {w}^{usup} = 0.0\). Table 3 reports our results, in comparison to other studies. We have observed better results than the other fully-supervised methods except in AR@1000. In AR@1000, the performance of model TTMT@100% with B[8+2] is lower than [13] by \(-0.64\) and [52] by \(-2.03\), respectively.

Table 4 Comparison with some state-of-the-art fully-supervised and semi-supervised anchor-based models on the ActivityNet

6.3.2 Proposal generation on ActivityNet dataset

On the ActivityNet dataset, we perform the same iterative strategy to check the pyramid settings as we perform for the THUMOS dataset. Examining model TTMT@100% with B in range \([1,\ldots ,11]\), the initial channel depth of 1024 and the head number of 8, we observe that B[1] and B[4] show comparable and the best performances. Increasing pyramid stages, we have observed no improvement thus we do not report the performance here. Performance in AR and AUC metrics are reported in Table 4 with comparison to some state-of-the-art studies. Most state-of-the-art results except models of Ji et al. [12] and SSTAP [13] are taken in fully-supervised setting where there is also no teacher-student framework. Besides, all the models reported in Table 4 including Ji et al. [12] and SSTAP [13] are anchor-based. Our model TTMT@100% results in better performance than CTAP [1], BSN [2], MGG [5] and BMN [4], while both TTMT@100% and TTMT@60% show competitive performance with anchor-based semi-supervised models Ji et al. [12], and SSTAP [13]. Both Ji et al. [12] and SSTAP [13] are teacher-student frameworks, and they rely on anchor-based BSN [2] and BMN [4] architectures, respectively.

We report the ActivityNet results for two settings in which we modify the \(\eta \) in \(\ell _{unc}\) [see Eq. (9)] and set \(\eta \)=tIoU or \(\eta \)=1. The results are similar with minor variations.

Table 5 Evaluation of the core model on the THUMOS14 dataset in comparison with the model integrated into the teacher-student framework
Table 6 Evaluation of different mask integrations within the model TTMT@100% with B\([8+2]\) on the THUMOS14 dataset

Computational analysis With our current implementation, we have analyzed the average network inference time using an Nvidia Tesla P100 graphics card on a sample of 600 videos from the ActivityNet Dataset. Following Lin et al. [2, 4] and Tan et al. [34], we exclude the computation of the backbone feature extractor, since it is pre-computed. As mentioned in Sect. 3.3, the depth of channels, i.e., \(D_i\), has been set to 1024 and the network inference took an average of 0.0018 s for the model TTMT@100 B[1]. In this study, we inherited the MViTv2 implementation. The standard attention unit has quadratic complexity in computing and memory [23]. Some recent works aim to reduce quadratic time complexity to make transformers more efficient with linear time [53, 54]. Although our proposed bi-directional mask strategy for forward and backward computations is slightly lower, doubling the attention units still maintains the same computational complexity.

Number of parameters is another measure of model complexity. Since our architecture includes pooling-layers within the pooling attention unit, the total number of learnable parameters is related to \(D_i\). If \(D_i\) is set to 256, the total number of learnable parameters for the model TTMT@100 B[1] is 4.8 M. If \(D_i\) is set to 1024, then the total number for the same model is 25.7M. As we stated in Sect. 3.3, we use \(D_i=1024\) in all our experiments on both benchmarks. The BMN [4] model that is integrated into SSTAP [13] includes a 5.7M total number of learnable parameters when the channel depth is set to 256. Additionally, a Transformer-based model RTD-Net contains a total of 32.1M learnable parameters. However, we refrain from making a direct comparison because various architectures rely either on various submodel structures at different network depths or on various hyperparameters, and the tuning of these parameters acts differently on performance in each model.

6.4 Ablation studies on THUMOS dataset

A set of ablation studies are conducted to further investigate the proposed teacher-student transformer network. We examine (i) the impact of teacher-student training over traditional fully-supervised training, (ii) the impact of uni-directional and bi-directional masks on the proposal generation task, (iii) the impact of each component in the scoring function, and (iv) the impact of two pathways for extracting candidate boundaries.

6.4.1 Impact of teacher-student framework

Under the same evaluation settings, we examine the performance of the core encoder-only Masked Transformer model introduced in Sect. 2 without the integration into the teacher-student framework. We have conducted the experiments for B[8] and B[8+2], keeping the setting we have used in model TTMT@100% (i.e., the student model in the build-in stage has the equivalent setting to the core model as well). As reported in Table 5, we obtain significant improvement within the TTMT framework over the core Transformer model in all AR metrics, e.g., the model TTMT@100% with B\([8+2]\) improves by 5.38 in AR@50 over B\([8+2]\). In TTMT, we have two competing models where the teacher model (EMA model) is trained smoothly over the student model weights, and we apply pseudo-labeling and consistency regularization. It shows that the integration into the teacher-student framework helps in improving the performance of the core model.

Table 7 Evaluation of the proposed scoring function with actionness \(sc_{action}\), centerness \(sc_{center}\) and boundary \(sc_{bound}\) components with the model TTMT@100% with B\([8+2]\) on the THUMOS14 dataset (see Sect. 5)
Table 8 Evaluation of g and \(g'\) on the proposed model TTMT@100% (see Sect. 5)

6.4.2 Impact of bi-directional masks

To see the impact of the masking strategy introduced in Sect. 3.2, we examine the performance of the model TTMT@100% with B[8+2] using different mask structures. We explore the TTMT model using: None, Bidirectional-GL, Bidirectional-G, Bidirectional-L, Backward-G and Forward-G mask structures. The None is equivalent to using the original pooling attention unit [24] without any mask. The Bidirectional-G and Bidirectional-L are based on using a bi-directional pooling attention unit with two local (L) and two global (G) masks, respectively, in forward and backward directions. While the masks in Bidirectional-G cover the whole video, the masks in Bidirectional-L cover the maximum of T/2 of the neighborhood of the entities and disable the interactions between rest of the snippets (i.e., \(M^{f}_{ij}\) is \(-\infty \) as well, if \(i<(j-T/2)\) and \(M^{b}_{ij}\) is \(-\infty \) as well, if \(i>(j+T/2)\)). We also experiment on the Bidirectional-GL that includes 4 branches in the pooling attention unit with four masks of two local (L) and two global (G) masks in forward and backward directions.

Given in Table 6, we have observed that the Bidirectional-G outperform other cases in all AR metrics, both Bidirectional-G and Bidirectional-L are better than None case, and also G masks result in better performance than L masks. Investigating, the uni-directional versions of the pooling attention unit, results show that both the Bidirectional-G and Bidirectional-L have better performance than the uni-directional Forward-G and Backward-G versions (uni-directional versions contain a global mask in a specific direction). It suggests using bi-direction masks over uni-directional ones for video evaluation when the offline evaluation setting is possible. Moreover, the results verify the benefits of the masked Transformer models for temporal video evaluation.

Table 9 Comparison of the detection result with mAP@tIoU in various tIoU values [20]

6.4.3 Impact of scoring function

To see the impact of each component of the scoring function given in Eq. (13), we examine the individual components as well as their combinations. Table 7 presents the AR performances and we see that the combined score has a significant improvement over other combinations. A weighting strategy can also be applied at this level of the inference to improve performance, but we have here simply added the computed scores and obtained convincing results.

As can be seen from the combined results, each component of the scoring function contributes effectively to the overall score. This emphasizes that our snippet-based structure requires a well-designed scoring function with powerful components and thus will perform better. The function we introduce here gives good results on our snippet-based prediction structure, removing some parts will cause a decrease in performance. Moreover, a better designed scoring function can further improve performance, while a poorly designed one can degrade it.

6.4.4 Impact of branches on boundaries

As we discuss in Sect. 5, we compute the boundaries of candidate proposals via two pathways, g and \(g'\), respectively. We conduct a set of experiments to see the impact of each pathway on the boundary predictions and report the inference performances in Table 8. We observe that boundary estimation via \(g'\) has benefits over g in AR@50 and AR@100 while g results in better inference in higher AR metrics.

6.5 Temporal action localization

To examine the performance for action detection, Mean Average Precision (mAP) is calculated with tIoU threshold values in the range [0.3, 0.7] with a step of 0.1 for the THUMOS14 dataset. Following other two-stage detection pipelines [2, 4], we first create the top 200 proposals using our model on THUMOS14, and we then use UntrimmedNet (UNet) model [18] to get video-level classification results and keep the top-2 class for each video. Finally, we compute a detection score for each proposal using the proposal score by our TTMT network and the classification score by UNet. In particular, the final scores for the top-2 action categories of each proposal are calculated by a simple multiplication of the proposal scores (see Eq.(13)) and UNet scores. Comparative results are shown in Table 9. Using the same classifier, i.e., UNet, we can observe that our TTMT@100% model outperforms many state-of-the-art anchor-based architectures (e.g. [2, 5, 13]) in high tIoU settings.

7 Conclusion

In this paper, we incorporate a new anchor-free proposal generation model into a teacher-student framework for a semi-supervised two-stage detection pipeline. We apply the pseudo-labeling techniques for classification and regression to improve generated proposals and integrate relative teacher-student uncertainties for selecting effective pseudo-labels in the proposed anchor-free model. We further provide a detailed evaluation of the Masked Transformer network within the teacher-student framework. The proposed Transformer-based model is designed for modeling temporal ordering with a lighter structure compared to anchor-based alternatives and the architecture can be extended with many local predictors by just simply integrating them into the pyramid network branch.

We show how our transformer-based anchor-free SSL method can achieve comparable performance with the state-of-the-art anchor-based methods, besides many architectural benefits. We find that our model benefits from uncertainty estimations and that a good scoring function for merging local estimates is necessary for a good performance.