1 Introduction

Human action recognition is a computer vision task to identify some human actions from a series of observations. Every human action, no matter how trivial, is done for some purpose. Due to its wide range of applications in intelligent video surveillance [1, 2], robotics [3], video storage retrieval, smart home monitoring, entertainment and autonomous driving vehicles, human action recognition (HAR) has gained significant popularity in the field of video analytics. HAR relies on computational algorithms to identify and understand human actions [4].

With the advance in computational technologies, deep learning has replaced traditional machine learning in many computer vision tasks, employing multiple layers of artificial neural networks to achieve state-of-the-art (SOTA) accuracy in tasks such as facial recognition, object detection etc.

Despite the extensive research conducted in the field of HAR, numerous challenges still remain unaddressed. HAR from raw videos poses a significant challenge as the model must essentially identify actions based on a series of observations. To achieve accurate predictions, spatial and temporal information are essential, resulting in a higher computational demand compared to other computer vision tasks [5, 6], which only require spatial information. Consequently, HAR models tend to be complex in nature. In the past, researchers relied on designing hand-crafted feature extractors to encode the necessary features for obtaining precise motion representations from video sequences, aiming to enhance the accuracy of HAR models [7,8,9]. Nevertheless, methods based on hand-crafted feature extraction have limitations as they heavily rely on human insight and lack the ability to automatically adapt to new data. Consequently, their applicability in real-world scenarios which are often dynamic and ever-changing, is very limited.

Convolutional Neural Networks (CNNs) play a crucial role in deep learning and find extensive use in various HAR models. They have the ability to directly learn human action features from video data without the need for any hand-crafted feature pre-processing [10]. Currently, the most popular HAR methods include two-stream networks based on 3D CNN and Recurrent Neural Networks (RNN) or Long Short-Term Memory (LSTM). These methods achieve commendable performance, but their computational requirements are high, especially when dealing with long untrimmed videos. Consequently, researchers have shifted their focus towards developing efficient HAR models using 2D CNN-based approaches.

This paper expands our initial work [11] to showcase the comprehensive performance of our proposed 2D CNN-based model, Context-Aware Memory Attention Network (CAMA-Net) which is specifically designed for HAR. CAMA-Net eliminates the need for optical flow computation and 3D convolution. We conduct additional extensive experiments on different public datasets, namely ActivityNet [12], Diving48 [13], HMDB-51 [14] and UCF-101 [15] to prove that our model is robust enough to work in datasets with many different activities. In all the datasets, the proposed model outperforms the SOTA baselines. In addition, we perform more ablation studies to showcase the contributions of the various entities in CAMA-Net and also provide an insight of the inference speed gap between 2D CNN, 3D CNN and two-stream based HAR models. In this paper, we also provide a detailed survey of the related work.

The contributions of our paper can be summarized as follows:

  • We introduce a novel HAR model, named Context Aware Memory Attention Network (CAMA-Net), which does not rely on optical flow computation or 3D convolution which are computationally intensive.

  • The Context Aware Memory Attention (CAMA) module in CAMA-Net accurately computes the relevance scores between key and value pairs obtained from the backbone output for the proposed model to learn a more discriminative spatio-temporal representation for action recognition.

  • We comprehensively evaluate the performance and robustness of CAMA-Net across four widely-used datasets: ActivityNet [12], Diving48 [13], HMDB-51 [14] and UCF-101 [15]. These datasets have different video lengths and different action classes.

  • The experimental results validate its competitive performance when compared to state-of-the-art methods in the field of HAR and demonstrate the robustness of our proposed model across various datasets.

2 Related works

2.1 Deep learning based action recognition

Over the past few years, deep learning models have emerged as the preferred approach for action recognition tasks. This is primarily due to their ability to extract high-level features from input data, which is in stark contrast to the comparatively rigid and less adaptable nature of hand-crafted feature methods.

At present, the predominant approaches in HAR utilize two-stream networks [16,17,18]. In these networks, one stream takes RGB frames as input, extracting appearance information, while the other stream employs optical flow as input, capturing motion information. Optical flow, which recovers pixel-level motion from variations in brightness patterns within spatial-temporal images [19,20,21], is used to effectively track the movement of objects.

Motion representation is thus one of the most important components for action recognition task. [16,17,18] use optical flow to represent short-term motion and many works use it as an additional input source, resulting in significant improvement in action recognition performance compared to using only the raw data. Current popular optical flow computation approaches [22,23,24] pre-compute the optical flow out-of-band and store the information which is inefficient. To address this inefficiency of estimating optical flow, some recent works accelerate optical flow estimation by the judicious construction of CNN models, such as FlowNet family [25, 26], PWC-Net [27] and SpyNet [28] etc. Nonetheless, these models focus on improving the accuracy of the optical flow estimation which is not directly related to the deep learning models for HAR. Other works [18, 29] propose an encoder-decoder network, where the encoder network aims to regenerate the optical flow and the decoder network is the action recognition network. However, the encoder-decoder architecture also entails high computational resource. Hence, it remains challenging to have the best motion representation which is efficient and effective for HAR [30, 31]. To this end, we decide to drop optical flow for fast HAR.

Another category of HAR approaches frequently proposed is 3D CNNs due to their well-defined architectures for temporal modelling [32,33,34]. 3D convolutional operators are built such that they combine the information in both the spatial and temporal dimensions within the local receptive fields [35, 36]. 3D convolutions and 3D pooling are used in 3D CNNs for propagating temporal information across all the layers in the network, so it can learn features that encode temporal information efficiently. The C3D model [33] is first pre-trained on a large-scale public video dataset to learn the spatio-temporal features which are then used as the input to the linear Support Vector Machine (SVM) classifier for action class prediction. I3D [37] uses a deep Inflated 3D CNN model by expanding the popular Inception model [38] to 3D so it can learn the spatiotemporal features in videos for HAR application. T3D [39] proposes a temporal 3D CNN model by extending the original idea of DenseNet [40], while DTPP [41] modifies the temporal pyramid pooling function which originally only works for spatial dimension to three space-time dimensions and use the 3D structure in a two-stream CNN in lieu of the common two-stream 2D CNN.

However, these 3D convolution-based models are typically trained and learned using short video snippets instead of considering the entire videos. As a result, they struggle to accurately capture actions that extend beyond their limited temporal context. To address this limitation, Slowfast Networks [42] incorporates two pathways operating at different frame rates. The slow and fast frame rates allow for the capture of both spatial semantics and fine resolution temporal motion respectively, with lateral connections employed to integrate information from both pathways. It is worth noting that, similar to other deep learning models, the performance of HAR significantly improves when 3D CNN models are trained on large-scale video datasets. However, the computational cost associated with 3D CNN-based methods increases considerably due to the extensive number of parameters involved in stacked 3D convolutions.

Recurrent Neural Networks (RNN) or Long Short-Term Memory (LSTM) [43,44,45], originally popular in natural language processing, have also found application in HAR. RNNs are deep learning models that possess a memory state, denoted as “h”, which summarizes past information to predict future outcomes. Through backpropagation, the RNN learns to capture the history or memory vector. In HAR, RNNs utilize the input (e.g., frames) and memory state (h) to predict the subsequent action. The incorporation of RNNs in HAR offers the advantage of preserving temporal information throughout the entire training process, thereby enhancing the accuracy of action recognition.

In general, HAR is like video understanding and can be treated as sequence modeling. LRCN [46] connects LSTM directly to SOTA CNN models to learn both spatial information and temporal dynamics. Thus, it can be perceived as a direct extension of the encoder-decoder architecture being applied for video representations. One notable advantage of LRCN is its capability to effectively handle sequences of varying lengths. To further enhance the processing of video data, a novel approach called DB-LSTM [47] has been introduced. DB-LSTM combines CNN with deep bidirectional LSTM networks [48]. These LSTM networks are stacked with multiple layers in both the forward and backward passes, thereby increasing the network’s depth, enabling it to recognize actions in long videos which has been a challenge for most of the common sequence models.

In contrast to these approaches, our proposed model, CAMA-Net, does not rely on pre-computed optical flow. Instead, it directly takes raw RGB video frames as input for action recognition. This is accomplished via 2D CNN based methods together with temporal modelling. We provide details on how we integrate 2D ResNet with a memory attention network to find the correspondences between video frames in the later section.

2.2 Attention mechanism

Recently, attention model is becoming very popular as it can focus on the interesting regions in the target videos [49,50,51,52]. Attention mechanism has first been applied for sequence-to-sequence learning in machine translation [53].

The two common types of visual attention [54] are hard and soft attention. The hard attention uses binary choices to choose spatial regions. Several works such as [55, 56] use the idea of hard attention in object recognition to extract the most important features in the images. On the other hand, in soft attention mechanisms, the spatial region of interest is chosen by the weighted averages. [57] designs a teacher-student learning-based model by utilizing an activation-based attention map and a gradient-based attention map. These attentions are propagated from a strong network to a weak CNN to improve the image recognition. Non-local Networks [58] learn long-range temporal relationship by using self-attention mechanism.

Wang et al. [59] introduces a channel attention block that employs 1D convolution to evaluate channel interactions while preserving dimensionality. Misra et al. [60] proposes employing triplet attention to determine attention weights via a three-branch structure, enabling the capture of cross-dimension interactions. Wang et al. [61] designs a self-attention mechanism that dynamically incorporates long-term temporal connections across the video sequence by capturing the relationship between the current frame and adjacent frame. The Stand-alone Inter-Frame Attention [62] is an attention mechanism that operates across multiple frames, computing local self-attention for every spatial position. Hao et al. [63] proposes an effective attention-in-attention technique for enhancing element-wise features, exploring the possibility of integrating channel context into the spatio-temporal attention learning module. Visual attention network [64] uses a large kernel attention to support the establishment of self-adaptive and extended-range correlations of self-attention.

Due to the advancements in applying attention mechanism in different computer vision tasks, we propose a novel approach that incorporates self-attention modules differently into a CNN-based method. Our way of integration is simply to find the correspondences between selected features using attention mechanism, without passing the entire set of features to the CNN model, thus reducing the number of learning parameters compared to pure CNN model. This integration aims to reduce computational complexity in action recognition tasks while maintaining competitive performance.

2.3 2D CNN-based methods for action recognition

As previously mentioned, the well-defined architectures in 3D CNNs make them popular in the field of HAR for temporal modeling. While these networks can achieve impressive performance, their widespread adoption is hindered by high computational requirements and significant GPU memory usage. To address these concerns and develop efficient HAR algorithms, researchers have turned their attention to 2D CNN-based methods. However, these methods do have their limitations. 2D convolutional operators operate within individual image frames, limiting their ability to capture spatial information across adjacent frames. If a 2D CNN model is used directly, it will only have partial observation, thereby compromising the accuracy of action prediction, particularly for longer duration actions. Therefore, to overcome this challenge and improve the performance of 2D CNN-based action recognition algorithms, it is crucial to incorporate temporal modeling techniques.

To address the limitation of sequence length during training, the Temporal Segment Network (TSN) [65] introduces a temporal sampling approach for video clips. TSN aggregates the features to generate video-level representations using an average pooling consensus function. Building upon TSN, the Temporal Relation Networks (TRN) [66] further enhances the temporal modeling capability by leveraging the relationships among video frames in the temporal domain.

In recent times, there has been a rise in the popularity of feature-level inter-frame difference methods for encoding short-term motion information between neighboring frames. For instance, the STM (Spatio-Temporal Motion) approach [67] models the motion representation of spatio-temporal features by utilizing the feature difference between adjacent frames. Another method called Temporal Shift Module (TSM) [68] employs a temporal shift operation to efficiently exchange temporal information among features through the channel dimension, thereby enhancing the performance of 2D CNN techniques. TANet (Temporal Adaptive Network) [69] improves the efficiency of action recognition tasks by stacking multiple Temporal Adaptive Modules (TAM) that encompass both global and local branches, enabling the learning of long-range temporal information. Furthermore, the Temporal Pyramid Network (TPN) [70] introduces feature hierarchy modules to aggregate diverse visual information from different feature levels.

3 Methods

3.1 Model architecture

To achieve faster action recognition, our proposed model, CAMA-Net, eliminates the need for pre-computed optical flow and solely relies on raw RGB video frames as input.

Fig. 1
figure 1

Overview of CAMA-Net architecture. The video input is divided into L video clips, and from each video clip, a random short snippet is selected. Each snippet consists of a set of RGB frames. The snippet is then fed into the ConvNet backbone, followed by adaptive average pooling. The resulting outputs are passed through two separate channels: the Sequence ConvNet and the Segmental Consensus. The Sequence ConvNet produces a pair of memory features, while the Segmental Consensus generates a pair of query features. These memory and query features are input into the CAMA module to compute the relevance scores between them. Thereafter, the outputs from the CAMA module and the Segmental Consensus are concatenated. This concatenated output provides the action class scores for the different snippets

Figure 1 shows the CAMA-Net architecture. The video input is divided into L video clips, and from each video clip, a random short snippet is selected. Each snippet consists of a set of RGB frames. The snippet is then fed into the CNN backbone, followed by adaptive average pooling. The resulting outputs are passed through two separate channels: the Sequence CNN and the Segmental Consensus. The former produces a pair of memory features of (B, C, H, W, T) dimension, while the latter generates a pair of query features of (B, C, H, W) dimension. Here B is the batch size, C is the channel size, H and W are the height and weight respectively and T is the sequence length. Sequence CNN based channel is basically a sequence of \(1\times 1\) convolution module while Segmental Consensus channel is basically an average pooling aggregation function in the temporal dimension.

The memory and query features are specifically designed to serve different purposes. The memory features are analogous to source features or base features, encompassing the majority of the content. On the other hand, the query features can be considered as summarized or filtered features, capturing the most important aspects. Both these features play a crucial role as inputs to the CAMA module, where the relevance scores between them are computed. These relevance scores serve to allow the proposed model to learn a more discriminative spatio-temporal representation for action recognition. Subsequently, the outputs of the CAMA module and the Segmental Consensus are concatenated boosting the prediction performance. This concatenated output provides the action class scores for the different snippets.

3.2 CAMA module

Figure 2 shows the details of the CAMA module. Its primary role is to determine the relevance between the memory key features (\(M_{k}\)) and the query key features (\(Q_{k}\)) through the utilization of three relevance functions. Both the memory and query features possess distinct key features (\(M_{k}\), \(Q_{k}\)) and value features (\(M_{v}\), \(Q_{v}\)) as shown in Fig. 2. The CAMA module employs three distinct functions to calculate the relevance score between the memory key features and the query key features. These scores are combined with the memory value and then concatenated with the query value. The resulting information is subsequently fed into a Fully Connected Network to predict the action class.

Fig. 2
figure 2

CAMA module. The vital role of CAMA module is to determine the relevance between the memory key features (\(M_{k}\)) and the query key features (\(Q_{k}\)) through the utilization of three relevance functions. These computed relevance scores are then summed with the memory value features (\(M_{v}\)) and concatenated with the query value features (\(Q_{v}\)). The resulting output from the CAMA module is subsequently concatenated with the output of the Segmental Consensus module. This concatenated output provides the action class scores for the different snippets

Fig. 3
figure 3

First relevance score computation. Before any relevance score computation, memory key (\(M_{k}\)) and query key (\(Q_{k}\)) are organized without changing their contents. The first relevance equation takes a direct approach to determine the relevance between the memory key (\(M_{k}\)) and query key (\(Q_{k}\)) by comparing their affinity. Multiplication has been performed considering that the batch size of the two features are the same. We used batch matrix multiplication for the memory key (\(M_{k}\)) and query key (\(Q_{k}\)) which does not involve any learnable parameters

Our proposed relevance functions are unique, unlike others such as that proposed in [71] which calculates the relevance scores between the current features (query key) and all the features together (memory key). Our design comprises three functions. The first relevance equation takes a direct approach to determine the relevance between the memory key (\(M_{k}\)) and query key (\(Q_{k}\)) by comparing their affinity. We used batch matrix multiplication for the memory key (\(M_{k}\)) and query key (\(Q_{k}\)) which does not involve any learnable parameters. Before relevance score computation, we change the way the features are organized without changing their contents, as shown as in Fig. 3.The first relevance function \(R(M_{k},Q_{k})\) is shown below:

$$\begin{aligned} R(M_{k},Q_{k}) = M_{k} Q_{k} \end{aligned}$$
(1)

In order to improve the accuracy of the relevance score calculation, we introduce a second relevance equation based on a bi-linear form. This additional equation is necessary because the first relevance equation alone is insufficient. To enable the bi-linear form, we utilize a new metric \(W \in R^{{c_{k}\times c_{v}}}\), which facilitates the computation of the relevance equation:

$$\begin{aligned} R(M_{k},Q_{k}) = M_{k} W Q_{k} \end{aligned}$$
(2)

We define a third relevance function, which incorporates trainable relevance scores (\({ r | M_{k},Q_{k} }\)), allowing the network to explicitly learn these scores. The outputs from these relevance functions are passed through the Softmax function. Softmax function is a function that turns a vector into a vector where its values summed to 1. Here we use Softmax function for the last dimension of each output. The relevance scores are then summed to generate a context term, denoted as \(c_{k}\). This context term is added to the memory value, \(M_{v}\), and concatenated with the query value, \(Q_{v}\). Finally, this combined information is fed into the Fully Connected Network for predicting the action class. The equation for the context term is as follows:

$$\begin{aligned} c_{k}= \sum (R(M_{k},Q_{k})) \nonumber \\= & {} softmax(M_{k} Q_{k})+ softmax(M_{k} W Q_{k})+ softmax(r) \end{aligned}$$
(3)

4 Experiments

4.1 Datasets

We evaluate the performance of the proposed CAMA-Net on popular benchmark datasets, ActivityNet, Diving48, HMDB-51 and UCF-101. ActivityNet [12] contains 200 different types of activities and Version 1.3 contains around 20,000 untrimmed videos. Diving48 [13] is a fine-grained video dataset on competitive diving, consisting of around 18,000 trimmed video clips of 48 unambiguous dive sequences. HMDB51 [14] contains about 7000 videos comprising 51 categories. UCF-101 [15] contains 101 action classes with around 13,000 videos.

4.2 Implementation details

We implement the CAMA-Net framework using ResNet50 and ResNet101 as the backbones. The video sampling frame, denoted as T, is set to 24. The shorter size of the input video frames is resized to 256 and common data augmentation techniques such as random horizontal flipping, multi-scale cropping are applied before training [65]. The optimal settings for model training are as follows: the batch size is set to 6, and the initial learning rate is set to 0.00008. The total number of training epochs is 80 for HMDB51 and Diving48, 120 for UCF101 and ActivityNet. The weight decay is set to 0.0002.

During testing, the video input from the test dataset is resized to 256 on the shorter side to maintain consistency with the training process. We initialize the model with a pre-trained ImageNet model when training on all the datasets. Both the model training and testing are conducted on two NVIDIA Tesla V100 Tensor Core GPUs.

4.3 Performance comparison

The performance of the proposed CAMA-Net is compared with state-of-the-art (SOTA) baselines on the four well-known action recognition datasets, namely ActivityNet, Diving48, HMDB51 and UCF101. The performance metric used is on top-1 and top-5 accuracies. The results are shown in Tables 1, 2, 3 and 4 for the respective datasets. It is to be noted that all the models included in the comparison solely rely on the pre-trained ImageNet model for initialization and do not undergo any additional pre-training on other large-scale video datasets.

The SOTA baselines being compared are 2D CNN based action recognition methods with late fusion of temporal information such as TRN [66] and TSN [65], 2D CNN with built-in temporal modules such as TANet [69], TPN [70] and TSM [68]. The tables show that the proposed CAMA-Net outperforms all the SOTA baselines on all four datasets, testifying to its effectiveness of its action recognition ability with unique temporal information learning techniques.

2D CNN based action recognition methods possess the advantage of faster model inference speed when compared to 3D CNN based methods and two-stream methods (optical flow and RGB frame fusion). To provide some insights on the speed difference, we record the inference speed of CAMA-Net on UCF-101 dataset in number of video frames processed per second and the result is shown in Table 5. The inference speed of a seminal 3D CNN based method, C3D [33] which is the first 3D CNN model for action recognition task and the inference speed of a two-stream network [16] are also shown. As can be seen from Table 5, CAMA-Net is more than twice faster compared to C3D and ten times faster than the two-stream network during inference. For practical deployment especially at edge devices, a lightweight model with fast inference speed with a little tradeoff in recognition accuracy will be most desirable and feasible.

Table 1 Performance comparison against SOTA baselines on ActivityNet dataset. Higher values are better
Table 2 Performance comparison against SOTA baselines on Diving48 dataset. Higher values are better
Table 3 Performance comparison against SOTA baselines on HMDB-51 dataset. Higher values are better
Table 4 Performance comparison against SOTA baselines on UCF-101 dataset. Higher values are better
Table 5 Performance (model inference speed) comparison between 2D CNN based method vs 3D CNN based method vs two-stream (optical flow+RGB) method on UCF-101 dataset. Higher values are better

4.4 Ablation study

Similar to the performance comparison in the previous section, the performance metrics used in the ablation studies are top-1 and top-5 accuracies. The experiments are carried out on the UCF101 dataset with ResNet50 as the backbone.

Table 6 Study on the relevance functions used in CAMA module. Performance of the different combinations of relevance functions on the UCF101 dataset is shown. Higher values are better

Relevance functions used for CAMA module Three different relevance functions have been designed for the CAMA module to calculate the relevance score between the memory and query features. They are the batch matrix multiplication, the bi-linear function and the trainable function. To obtain insight on the effectiveness of these relevance functions in improving the action recognition performance, an experiment is conducted using different combinations of the relevance functions. Table 6 shows the performance of the different combinations of relevance functions. The best combination using all three relevance functions is thus adopted in CAMA module to yield the best performance.

Table 7 Study on adaptive average pooling used in CAMA-Net. Performance with and without adaptive average pooling for CAMA-Net on the UCF101 dataset is shown. Higher values are better

Adaptive average pooling An experiment is also carried out to validate that adaptive pooling plays a significant role in improving the effectiveness of CAMA-Net. As shown in Fig. 1, raw RGB video frames are passed through the ResNet50 backbone to extract the encoded features. Adaptive average pooling is then applied to these encoded features before they are being passed to two separate channels. Adaptive average pooling is the process that applies a 2D adaptive average pooling over an input signal composed of several input planes. Table 7 shows the performance comparison with and without adaptive average pooling after the ResNet backbone. The result shows that CAMA-Net can achieve the best performance with adaptive average pooling.

Table 8 Study on concatenation of outputs of segmental consensus and CAMA module. Performance with and without output concatenation of the two modules in CAMA-Net on the UCF101 dataset is shown. Higher values are better

Concatenation of output of Segmental Consensus and output of CAMA module A study is also carried out to show the effectiveness of concatenation of the outputs of Segmental Consensus and CAMA module. The concatenation of both outputs can reduce bias that result in poor action recognition performance and can be considered as a type of regularization for CAMA-Net. The performance with and without concatenation of these two outputs are shown in Table 8. The result shows that CAMA-Net can achieve the best performance with the concatenation of these two outputs.

Table 9 Study on batch normalization used in CAMA-Net. Performance with and without batch normalization for CAMA-Net on the UCF101 dataset is shown. Higher values are better

Batch normalization An experiment is also carried out to ensure that batch normalization is useful to improve the effectiveness of CAMA-Net. Batch normalization is a common method to standardize the inputs of deep learning model to a single layer for each mini batch during training process. In theory, the learning process can be stabilized, and the number of epochs required for the deep learning model to train can be reduced. Table 9 shows the performance comparison between CAMA-Net with and without batch normalization. The result shows that CAMA-Net can achieve the best performance with batch normalization.

4.5 Other experiments

Channel sizes of input features for CAMA module To recap, the inputs to the CAMA module include memory key (\(M_{k}\)), memory value (\(M_{v}\)), query key (\(Q_{k}\)) and query value (\(Q_{v}\)). These input features are passed to the CNN module to obtain the key value pairs of memory features and query features. The filter size of each CNN module is fixed at \(1\times 1\). Since all the above features jointly contribute to the performance of our proposed model, an experiment is conducted to vary the channel size of each feature to find the optimum. Key channel size is used for memory and query key generation while value channel number is used for memory and query values generation. Table 10 shows the performance result when varying the channel sizes of the key and value pairs. The result shows that CAMA-Net can achieve the best performance with key channel size of 512 and value channel size of 2048.

Table 10 Study on varying channel size of input features for CAMA module. When varying channel sizes of the input features for CAMA module on the UCF-101 dataset is shown. Higher values are better
Table 11 Study on batch size used in CAMA-Net for model training and testing. Performance of the different batch sizes for CAMA-Net on the UCF101 dataset is shown. Higher values are better
Table 12 Study on varying sequence length of input videos in CAMA-Net model training and testing. Performance of the different sequence lengths of video input in CAMA-Net on the UCF101 dataset is shown. Higher values are better
Table 13 Study on initial learning rate used in CAMA-Net model training and testing. Performance of the different initial learning rates in CAMA-Net on the UCF101 dataset is shown. Higher values are better

Batch size, sequence length and learning rate Hyperparameter tuning is very important for a model to achieve the best performance. Therefore, extensive experiments have been carried out to explore the possible range of values to narrow down to the most optimum ones. In action recognition, batch size, sequence length and learning rate are important hyperparameters to achieve good action recognition accuracy. Batch size denotes the number of videos that will be propagated through the network while sequence length denotes the length of the sequence for video snippets. The learning rate is a hyperparameter that controls how fast the model changes in response to the estimated error each time the model weights are updated. The performance of varying the batch size, sequence length and initial learning rate is shown in Tables 11, 12 and 13 respectively. The results show that CAMA-Net can achieve the best performance with a batch size of 6, sequence length of 24 and initial learning rate for the model training of 0.00008.

5 Conclusion

In this paper, we introduce the Context Aware Memory Attention Network (CAMA-Net) for video action recognition, eliminating the requirement for optical flow extraction. CAMA-Net offers enhanced efficiency by avoiding the computationally intensive 3D convolution. Instead, we design a Context Aware Memory Attention (CAMA) module, an attention mechanism used to compute the relevance score between key-value pairs derived from the backbone network outputs. Through extensive experiments conducted on four widely-used benchmark datasets, our proposed model demonstrates remarkable performance improvements while maintaining competitive efficiency compared to SOTA 2D CNN-based models. Our model maintains its performance amid the many different action classes regardless of video length.

Vision Transformers (ViT) [72] are actively used in the research community to replace Convolutional Neural Networks to solve the computer vision tasks, including human action recognition. Recently, there are several lightweight ViT based models, such as MobileViT [73] and EfficientFormer [74] which are able to overcome the computational intensive problem in computer vision task. For future work, we will focus on exploring and improving lightweight ViT based models in human action recognition.