1 Introduction

In recent years, the widespread use of digital devices led to an exponential growth in video data, creating a need for efficient methods to automatically analyze and understand actions in these videos. Conventional video analysis methods, while effective, often lack real-time responsiveness. Online Action Detection (OAD) [1] aims to classify ongoing actions in real-time streaming videos, which plays a pivotal role in real-time video analysis for applications such as autonomous driving [2] and video surveillance [3].

The OAD task poses a significant challenge, primarily due to the real-time constraint. The crux of the challenge lies in striking a delicate balance between achieving optimal performance and maintaining computational efficiency.

A typical OAD model first extracts features at the snippet level and then predicts the action class over the feature sequence. Early works [4,5,6,7,8,9] in online action detection utilized Recurrent Neural Networks (RNNs) to model history information. However, RNN-based models suffer from non-parallelism and gradient vanishing [10, 11], resulting in training difficulties and sub-optimal performance.

Recent works [12,13,14,15] have introduced the Transformer architecture [16] to address these issues. Despite its benefits, the real-time limitation in OAD restricts the Transformer’s ability to handle long context lengths, potentially missing critical long temporal dependencies in streaming video. Additionally, previous works often assign the label of a specific time point to the entire snippet, which is arbitrary and leaves room for improvement.

To address these issues, we propose an online action detection model that leverages the emerging long sequence model RWKV [17] to enhance both performance and efficiency. RWKV stands out as an innovative and promising model, with both Transformer-like and RNN-style formulations. This unique characteristic enables training as a Transformer while inferencing as an RNN model, thus achieving an optimal balance between performance and efficiency. In adapting RWKV for the OAD task, we employ the Laplace activation to tailor it specifically for OAD requirements. Additionally, we introduce a temporal label smoothing technique for online action detection.

In summary, our primary contribution involves the adaptation of the RWKV architecture for the OAD task. Additionally, we introduce the temporal label smoothing technique to further enhance robustness.

We conducted experiments on two widely used datasets: THUMOS’14 [18] and TVSeries [1]. Our model achieves state-of-the-art performance of 71.8% mAP on THUMOS’14 and 89.7% mcAP on TVSeries. In terms of efficiency, our model can run at 600+FPS alone, making it applicable for real-time online action prediction. Notably, when using only RGB features, the overall system can run at 200+FPS even on CPU while still retaining 59.9% mAP on THUMOS’14, which makes it possible to deploy on edge computing devices.

2 Related Works

Online Action Detection In the field of online action detection, several notable works have contributed significant advancements. RED [6] introduces a reinforcement loss function to promote early action detection. TRN [7] performs online action detection and anticipation simultaneously, utilizing the predicted future information to improve the performance. IDN [8] incorporates an information discrimination unit to selectively accumulate relevant information for the present action. PKD [9] utilizes curriculum knowledge distillation to transfer knowledge from offline models. Colar [19] employs an exemplar-consultation mechanism to compare similarities and aggregate relevant information. LSTR [13] proposes a long short-term memory mechanism to efficiently model video sequences using transformer. GateHub [14] designs a gated history unit to enhance the relevant history information.

Transformer-based models [12,13,14,15] have demonstrated promising performances in online action detection. However, the square computational complexity of self-attention brings significant computational cost, especially when dealing with long video streams. OadTR [12] only keeps a few recent frames to reduce computational demands. LSTR improves efficiency by performing compression on long-term memories using Perceiver [20], and TesTra [15] further improves efficiency by replacing the first layer of LSTR encoder with linear attention [21]. But LSTR/TesTra still has a fixed context length and excessive token reduction could result in the loss of critical information.

Online Detection of Action Start Online Detection of Action Start (ODAS) is a task closely linked to online action detection, focusing on identifying the precise starting point of an action instance and minimizing the time gap between the actual start time and prediction. Few works specifically focus on this task. Shou et al. [22] first introduced this task and proposed three methods to train an ODAS model. StartNet [23] decomposes ODAS into two stages: action classification and start point localization, addressing the challenges of subtle appearance near action starts and limited training data. While WOAD [24] and SCOAD [25] also conduct experiments on this task, their original designs were intended for weakly supervised online action detection.

Efficient Long Sequence Modeling While the Transformer model has demonstrated remarkable capability in handling long-distance dependencies, it is hindered by the square computational complexity of cross self-attention. To address this challenge, a variety of approaches have been proposed. Some focus on optimizing the attention mechanism, employing techniques such as sparse self-attention [26], kernelization [21], low-rank approximations [27], and other methods [28, 29]. Other researchers explore alternative modules to replace attention. MLP-Mixer [30] replaces attention with Multilayer Perceptrons(MLPs), while the Attention Free Transformer(AFT) [31] introduces a computationally efficient alternative to the traditional dot-product self-attention mechanism. Inspired by AFT, RWKV [17] simplifies interaction weights to enable an RNN-style implementation for inference. For a comprehensive overview of more efficient Transformer variants, a survey by Tay et al. [32] can be referenced. Additionally, some approaches modify recurrent neural networks(RNN) to increase context length, such as the Recurrent Memory Transformer [33], Linear Recurrent Unit [34], and state space models(SSM) [35,36,37,38]. These techniques offer alternative strategies to enhance the context modeling capabilities of sequence-based models.

3 Approach

Fig. 1
figure 1

Overview of our model. Raw video frames are initially processed using a pretrained feature extractor to obtain a feature sequence. The feature sequence is then fed into the RWKV model to capture temporal dependencies. Finally, a fully connected layer (classifier) generates the action class prediction

OAD aims to recognize actions in a video stream with only current and historical information. Mathematically, an OAD model provides an action class probability vector \(\varvec{\hat{y}}_t\in \mathbb {R}^k\) for the current frame \(\varvec{f}_t \in \mathbb {R}^s\) at time t, based on the sequence of current and historical frames \(\{\varvec{f}_1,\varvec{f}_2,\dots ,\varvec{f}_t\}\). Here, k represents the number of action classes, including the background class, and s denotes the size of the input frame.

3.1 Base Model

The overall architecture of our model is illustrated in Fig. 1. To complete the OAD task, we employ a pretrained feature extractor [39] to process each video frame \(\varvec{f}_t\) into a feature vector \(\varvec{x}_t\in \mathbb {R}^d\) of d dimensions.Footnote 1 These feature vectors are then fed into our model.

We utilize the RWKV model to capture the temporal dependencies, which offers a balance between performance and efficiency. A RWKV model comprises multiple RWKV layers, each containing a time-mixing block and a channel-mixing block illustrated in Fig. 1. In the time-mixing block, we employ the original RWKV formulas:

$$\begin{aligned} r_t&= W_r \cdot (\mu _r x_t + (1 - \mu _r) x_{t-1} ), \end{aligned}$$
(1)
$$\begin{aligned} k_t&= W_k \cdot (\mu _k x_t + (1 - \mu _k) x_{t-1} ), \end{aligned}$$
(2)
$$\begin{aligned} v_t&= W_v \cdot (\mu _v x_t + (1 - \mu _v) x_{t-1} ), \end{aligned}$$
(3)
$$\begin{aligned} wkv_t&= \frac{ \sum _{i=1}^{t-1} e^{-(t-1-i)w+k_i} v_i + e^{u+k_t} v_t }{\sum _{i=1}^{t-1} e^{-(t-1-i)w+k_i} + e^{u+k_t}}, \end{aligned}$$
(4)
$$\begin{aligned} o_t&= W_o \cdot (\sigma (r_t) \odot wkv_t), \end{aligned}$$
(5)

where equations (1)-(3) represent the token shift operation, aiding the model in better information propagation. Equation (4) plays the role of cross-attention in standard Transformers. In this formulation, the model can be trained in a parallel manner, similar to Transformers. Equation (4) can also be easily rewritten into recursion-style with linear computational cost, potentially enabling the model to handle infinite context length and effectively capture long-term dependencies in streaming video.

The channel-mixing block works as the FFN layer in regular Transformers and it is given by

$$\begin{aligned} r_t&= W_r \cdot (\mu _r x_t + (1 - \mu _r) x_{t-1} ), \end{aligned}$$
(6)
$$\begin{aligned} k_t&= W_k \cdot (\mu _k x_t + (1 - \mu _k) x_{t-1} ), \end{aligned}$$
(7)
$$\begin{aligned} o_t&= \sigma (r_t) \odot (W_v \cdot f_{\textrm{laplace}(k_t)}), \end{aligned}$$
(8)

where we replace the original squared ReLU activation [40] with the Laplace activation [28]:

$$\begin{aligned} f_{\textrm{laplace}}=0.5\times \left[ 1+\textrm{erf}(\frac{x-\mu }{\sigma \sqrt{2}})\right] , \mu =1/\sqrt{2}, \sigma =1/\sqrt{4\pi }. \end{aligned}$$
(9)

The Laplace activation is an approximation of the squared ReLU activation with both a bounded range and gradient, which can address the gradient explosion issue [28]. We find this modification increases the stability of the model. After the RWKV layers, a fully connected layer (classifier) generates the action class prediction for the current time.

3.2 Temporal Label Smoothing

Due to the less explicit boundaries of actions and the higher similarity of features near the action boundaries compared to objects which may confuse the model, we propose a temporal smoothing technique to refine the ground-truth labels. Given a ground-truth label G(t), a general temporal smoothing operation is defined as

$$\begin{aligned} G^*(t) = \int ^{+\infty }_{-\infty } \phi (\tau )\cdot G(t-\tau )\textrm{d}\tau . \end{aligned}$$
(10)

Here, we choose the Gaussian function as the kernel \(\phi (\tau )\)

$$\begin{aligned} \phi (\tau )= \frac{1}{\sigma \sqrt{2\pi }} \exp \left[ -\frac{(\tau - \mu )^2}{2\sigma ^2} \right] . \end{aligned}$$
(11)

which assumes that the ground-truth labels have errors following a normal distribution. By applying the Gaussian smoothing, nearby time points contribute to the refined label \(G^*(t)\) in a continuous and gradual manner, effectively reducing the impact of label uncertainties and ambiguities near action boundaries. For a video snippet, the calibrated ground-truth label is obtained by integrating over time. We utilize \(G^*(t)\) to train our model, which makes it more robust to handle the ambiguous boundaries and feature similarities near action boundaries (Fig. 2).

Fig. 2
figure 2

Illustration of Temporal Label Smoothing (TLS). This diagram illustrates the application of a Gaussian smoothing kernel to refine ground-truth action labels over time. The Gaussian kernel transforms the ground truth from a binary (0-1) function into a gradual and continuous representation

3.3 Training Model

For data augmentation, we utilize a technique similar to the temporal VideoMix [41] and randomly stack an average of 3 video clips as one input sample during training. To train our model, we employ the cross-entropy loss between the predicted probability vector \(\varvec{\hat{y}}_t\) and the ground-truth label \(\varvec{y}_t \in \mathbb {R}^k\) at each frame. The per-frame loss \(\mathcal {L}_t\) is calculated as follows:

$$\begin{aligned} \mathcal {L}_t = -\sum _{i=0}^{k} y_t^i \log \hat{y}_t^i, \end{aligned}$$
(12)

where \(y_t^i\) and \(\hat{y}t^i\) represent the i-th element of the ground-truth and predicted vectors, respectively, at time t. The total loss is obtained by summing over each frame:

$$\begin{aligned} \mathcal {L} = \sum _{t=1}^{T} \mathcal {L}_t, \end{aligned}$$
(13)

where T represents the total number of frames in the video.

4 Experiments

4.1 Datasets and Evaluation Metrics

THUMOS’14 The THUMOS’14 dataset includes 413 untrimmed videos with 20 action classes for online action detection. The dataset is split into 200 training videos and 213 testing videos according to the public data split. In line with previous studies [12,13,14,15, 23,24,25], we evaluate OAD using per-frame mean Average Precision (mAP) and evaluate ODAS using point-level mean Average Precision (pAP) [22].

Most existing works follow the two-stream paradigm [42], utilizing both optical flow and RGB features to achieve optimal performance. However, this leads to high computational costs due to optical flow computation. Hence, we also evaluate the performance of OAD using mAP solely with RGB features. This evaluation allows us to specifically focus on scenarios where computational efficiency is a critical factor to consider.

TVSeries The TVSeries dataset includes 16 h of 27 untrimmed videos with 30 everyday action classes, collected from 6 TV series. This dataset is originally designed for OAD and introduces the calibrated Average Precision (cAP) [1] as the evaluation metric. Therefore, we employ cAP as the evaluation metric for the TVSeries dataset.

4.2 Implement Details

We implemented our model using PyTorch [43] and conducted all experiments on a system equipped with an Intel i9-12900 CPU and an NVIDIA RTX 3090 graphics card. We followed the approach outlined in [13,14,15] to preprocess the videos, where we converted all videos to 24 frames per second (FPS) and then extracted features at 4 FPS. For feature extraction, we utilized the TSN models pretrained on the Kinetics-400 dataset [44], which were implemented in the MMAction2 framework [45]. The TSN model is a two-stream network, where we employed Resnet-50 [46] for handling RGB features and BN-Inception [47] for optical flow features.

Regarding the RWKV model, we set the hidden dimension size to 512 and used 4 RWKV layers. Instead of the initialization proposed in RWKV, we employed Kaiming Initialization [48] for our model.

During the training process, we used a batch size of 1 since multiple video clips were already stacked together. The model was trained for 30 epochs using the Adam optimizer [49] with a weight decay of 9e-8 and a base learning rate of 6e-4. The learning rate was linearly increased from zero to 6e-4 during the first 10 epochs and then reduced to zero following a cosine function.

4.3 Main Results

Online Action Detection We present a comparison of our method with recent works on the OAD task in Table 1. All the compared models utilize TSN features pretrained on the Kinetics-400 dataset as input. Notably, our model achieves 71.8% mAP on THUMOS’14 and 89.7% cAP on TVSeries, surpassing the performance of all previous methods. Moreover, in the RGB-only setup, our model achieves a remarkable mAP of 59.9%, effectively narrowing the performance gap between the two-stream method and the RGB-only method. These results highlight the efficacy and competitiveness of our proposed approach in online action detection.

Table 1 Online action detection results on THUMOS’14 and TVSeries using TSN features pretrained on Kinetics-400

Online Detection of Action Start We present a comparison of our method with recent works on the ODAS task in Table 2. Both StartNet and our model utilize the TSN feature extractor, while WOAD and SCOAD employ the I3D [50] feature extractor. It is worth noting that both feature extractors, TSN and I3D, are pretrained on the Kinetics-400 dataset and demonstrate similar performance characteristics, ensuring a fair comparison.

Our model consistently achieves the best results across various time thresholds, demonstrating superior performance in most cases. Even in scenarios where our model does not secure the top position, it still achieves the second-best results. These outcomes highlight the effectiveness and competitiveness of our proposed approach compared to state-of-the-art methods for the ODAS.

Table 2 Online detection of action start results on THUMOS’14 using feature extractor pretrained on Kinetics-400

4.4 Ablation Study

We conducted ablation experiments on the THUMOS’14 dataset to analyze the performance of our model. We evaluated four different setups, as follows:

Baseline This setup represents the model with raw RWKV layers and serves as the baseline for comparison.

Baseline+LA In this setup, we replaced the original squared ReLU activation in RWKV with the Laplace activation to examine the impact of this modification.

Baseline+TLS We used the proposed temporal smoothed ground-truth labels to train the baseline model to assess the effectiveness of temporal label smoothing.

Baseline+LA+TLS This setup represents our final proposed method, where we combined the Laplace activation and temporal label smoothing.

Table 3 presents the performance comparison of these different methods on the THUMOS’14 dataset, demonstrating the effectiveness of each component in improving the model’s performance.

Table 3 Ablation experiments on THUMOS’14 evaluating the impact of Laplace Activation (LA) and Temporal Label Smoothing (TLS)

4.5 Efficiency Analysis

We compare the efficiency of our method with TRN, LSTR and TesTra in terms of online inference. To ensure a fair comparison, we re-run all models on our platform to control variables. The results are presented in Table 4. Our model outperforms all other models in terms of speed, and specifically, our model achieves a speed that is over two times faster than the previous state-of-the-art method, TesTra, with a slight 0.6% improvement in performance. Additionally, we observe that our method experiences a minimal drop in speed when running on a pure CPU platform, achieving an impressive 206.2 FPS for end-to-end online inference with RGB features only. This suggests that our model is suitable for deployment on low-end platforms, such as edge computing devices.

Table 4 Running speed in frames per second (FPS) comparison

5 Conclusion and Future Work

In this study, we proposed a novel model for online action detection, which combines the RWKV architecture with temporal label smoothing. The RWKV architecture seamlessly integrates the advantages of the Transformer’s long context length and RNN’s efficient inference, creating a model with an optimal balance between performance and efficiency. This characteristic is particularly well-suited for OAD tasks, where the demand for long context length aligns with the necessity for bounded inference time. Moreover, the incorporation of Laplace activation and temporal label smoothing further enhances the model’s robustness.

Our model achieved significant improvements in both performance and efficiency. Experimental results on the THUMOS’14 and TVSeries datasets showcased the superiority of our approach, surpassing state-of-the-art methods in terms of mAP and cAP. Moreover, our model demonstrated impressive inference speed, outperforming the TesTra model by more than two times while maintaining competitive performance. Notably, our model exhibited remarkable efficiency even on CPU platforms, enabling deployment on resource-constrained devices.

In future work, it would be valuable to explore alternatives to optical flow computation or ways to reduce its reliance, as it is the running speed bottleneck for the overall system.