Various frameworks for integrating image and video streams for spatiotemporal information learning employing 2D–3D residual networks for human action recognition

Yosry, Shaimaa; Elrefaei, Lamiaa; ElKamaar, Rafaat; Ziedan, Rania R.

doi:10.1007/s42452-024-05774-9

Various frameworks for integrating image and video streams for spatiotemporal information learning employing 2D–3D residual networks for human action recognition

Research
Open access
Published: 18 March 2024

Volume 6, article number 141, (2024)
Cite this article

Download PDF

You have full access to this open access article

Discover Applied Sciences Aims and scope Submit manuscript

Various frameworks for integrating image and video streams for spatiotemporal information learning employing 2D–3D residual networks for human action recognition

Download PDF

Shaimaa Yosry¹,
Lamiaa Elrefaei¹,
Rafaat ElKamaar¹ &
…
Rania R. Ziedan¹

410 Accesses
Explore all metrics

Abstract

Human action recognition has been identified as an important research topic in computer vision because it is an essential form of communication and interplay between computers and humans to assist computers in automatically recognizing human behaviors and accurately comprehending human intentions. Inspired by some keyframe extraction and multifeatured fusion research, this paper improved the accuracy of action recognition by utilizing keyframe features and fusing them with video features. In this article, we suggest a novel multi-stream approach architecture made up of two distinct models fused using different fusion techniques. The first model combines convolutional neural networks in two-dimensional (2D-CNN) with long-short term memory networks to glean long-term spatial and temporal features from video keyframe images for human action recognition. The second model is a three-dimensional convolutional neural network (3D-CNN) that gathers quick spatial–temporal features from video clips. Subsequently, two frameworks are put forth to explain how various fusion structures can improve the performance of action recognition. We investigate methods for video action recognition using early and late fusion. While the late-fusion framework addresses the decision fusion from the two models' choices for action recognition, the early-fusion framework examines the impact of early feature fusion of the two models for action recognition. The various fusion techniques investigate how much each spatial and temporal feature influences the recognition model's accuracy. The HMDB-51 and UCF-101 datasets are two important action recognition benchmarks used to evaluate our method. When applied to the HMDB-51 dataset and the UCF-101 dataset, the early-fusion strategy achieves an accuracy of 70.1 and 95.5%, respectively, while the late-fusion strategy achieves an accuracy of 77.7 and 97.5%, respectively.

Article Highlights

Current models for human action recognition rely on still photos or brief video clips, and they have trouble correctly identifying activities that take place outside of their constrained temporal context.
We suggest frameworks to investigate several deep learning models with different fusion methods for effective recognition and categorization of human actions.
On the standard datasets, our proposed frameworks outperform the majority of state-of-the-art techniques in terms of performance.

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Deep learning for time series classification: a review

Article 02 March 2019

CBAM: Convolutional Block Attention Module

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The importance of Human Action Recognition (HAR) in computer vision research stems from its usefulness in a variety of applications, including content-based video analysis [1,2,3], health monitoring [4], sophisticated video surveillance [5], and smart robot system [6]. HAR aims to identify and analyze human action into categories using different input data such as 3D-skeleton data [7,8,9,10], still image-based [11,12,13,14], or video-based [15,16,17,18] by identifying and classifying the input data's spatiotemporal features. The spatial features convey appearance, while the temporal features represent action (motion). HAR strategies are split into two stages: feature extraction and classification. Handcrafted or deep-learning approaches are used to extract features [10]. while machine learning and deep learning approaches employ feature learning and action recognition and classification [19].

HAR using deep learning techniques gets more attention nowadays due to their progress and promising results [20]. However, no deep learning model yet demonstrated learning spatiotemporal features from video or successive frames with adequate results one possible explanation for inadequate findings is the nature of the input data from images or videos, the type of retrieved spatiotemporal features, and the established models that utilize to extract spatiotemporal features [15]. HAR deep learning approaches are built on a variety of architectures: two-stream 2D-convolutional neural networks (2D-CNN), recurrent neural networks like LSTM, 3D-convolutional neural networks (3D-CNN), and transformers. These models diverge in terms of input data such as (image, video, or skeleton), network architecture, and technique for extracting spatiotemporal features to recognize action [21,22,23].

Typically, the action in the video is identified using the features that are taken out of the image sequences. Most image sequence models rely on a two-stream 2D CNN architecture. The two-stream architecture applies where one stream extracts spatial information (appearance) from RGB images, while the other stream extracts short-term temporal features (motion) from optical flow images [12, 24, 25]. These models are effective in improving action recognition accuracy; However, their processing cost for figuring out optical flow is considerable. Furthermore, due to the nature of RGB images, it does not work for action recognition; it just extracts spatial information.

Given the expanding volume of video data and the inherent characteristics of RGB images that extract only spatial details, the issue of high processing costs for computing optical flow images must be addressed. Action recognition approaches use 3D-CNN to extract features from RGB videos. Depending on the network design and duration of the video clip, the 3D-CNN network extracts various spatiotemporal features from video clips. Although the 3D-CNN increases the accuracy of HAR [15, 22, 26], the dimensions and duration of the video clip influence it. Most video action recognition approaches use short clip lengths (8 or 16 frames per clip) to decrease the computational time and cost that causes the network to learn the short-term temporal information that is useless for recognizing actions [15, 16, 27, 28]. Using large clip lengths (64 frames per clip) requires more computing effort and money than using short clips to teach the models the long-term temporal information that characterizes long-range movements.

Compared to CNN models for HAR, the LSTM model solves most of the core issues of CNN models. LSTM extracts long-term temporal features that supply more information on motion in a series of images [29]. The CNN architecture is combined with the LSTM for high HAR model performance. Before the LSTM train extracts long-term temporal characteristics from the feature map for action recognition and classification, CNN first extracts spatial or temporal features from images or short videos, respectively [22, 30, 31]. A straightforward model for extracting spatiotemporal information from RGB photos or videos is the CNN-LSTM. The CNN-LSTM model is often dependent on the utilized CNN network and the input data, where CNN architectures influence the extracted features, and affect the model's performance from computational cost and time exhaustive [32,33,34,35].

The 2D-CNN-LSTM model's efficacy may be impacted by redundant data collected by numerous video devices, which could confuse important information. Selecting the keyframe can have an impact on the performance of the HAR model, as only a small number of keyframes extract more expressive information for local frame representations. The keyframes are visual frames in a video clip that include the most important aspects. Depending on the clip's content complexity and keyframe extraction method, one or more keyframes can be extracted from a single clip [36]. Keyframe extraction techniques that are often used include (1) keyframe extraction method based on video lens [37] (2) clustering-based keyframe extraction method [38] (3) keyframe extraction method based on motion features [39] (4) keyframe extraction method based on content analysis [40, 41]. The keyframe extraction approaches reduce the temporal complexity of action recognition and increase the recognition model's performance. HAR models based on keyframe extraction are now available [42,43,44].

To successfully handle the investigated challenges of action recognition models and extract more data for action recognition improvement, a variety of video features and fusion techniques are investigated. Most visual action recognition approaches are based on color cameras which consider human body movement. In this case, RGB image features capture action scene information while RGB video features capture depth and motion information that leads to discriminative performance improvement. Action recognition in videos could perform better when the intrinsic semantic relationship between keyframe pictures and video clips is utilized to combine the scene features from the image and the action characteristics for the video. As a result, we propose two frameworks that discuss distinct fusion methodologies for various action recognition models sought in this paper; These models are expanded from the widely used residual network models (ResNet-101) in [15, 45] and LSTM to handle RGB videos. The frameworks consist of two streams, the first stream extracts temporal features from a keyframe image sequence using the R2D-LSTM network, while the second stream extracts video temporal features from video clips using 3D-CNN, with several fusion techniques for investigating action recognition. Overall, this paper's main contributions are:

(1)
Enhance the performance of 2D ResNet-101 (R2D) by merging it with LSTM to extract temporal information from only keyframe images in videos for HAR (R2D-LSTM).
(2)
Reduce computing costs and time by training the R2D-LSTM model with keyframes rather than entire video frames.
(3)
Examine the Early fusion method for the R2D-LSTM and video-based (R3D) HAR models to investigate various temporal representations of actions in video.
(4)
Examine Late Fusion techniques for the two streams with different inputs for human action recognition.
(5)
Extensive testing demonstrates that our suggested models and frameworks would significantly improve recognition performance and outperform the most recent models. The experimental results demonstrate that, on the UCF-101 and HMDB-51 datasets, our proposed frameworks outperformed the leading models.

This paper is broken up into sections. The related action recognition study is described in Sect. 2 of this paper. The suggested models are discussed in Sect. 3. In Sect. 4, the experimental outcomes of the suggested models are described. The work of this article is completed in Sect. 5.

2 Related work

It is feasible to collect various input data in the HAR domain. Images, videos, skeletons, and more recently depth maps and infrared images are examples of these input data. Here we discuss the related work of HAR models depending on only RGB images and video data.

2.1 Image-based models

Due to the rapid development of deep learning algorithms, considerable progress in the field of action recognition using images. Karpathy et al. [14] via the 2D-CNN network, examine several methods for fusing information over the temporal dimension. They found that the slow fusion model that mixes between early and late fusion captures the spatiotemporal features of video clips. Although the slow fusion model captures spatiotemporal features in the first three layers, it still loses the temporal information after the first three layers. Simonyan and Andrew [24] propose two-stream networks based on the 2D-CNN model. The model uses stacked optical flow vectors of motion features to address the problems with learning motion features that deep learning faces. Combining the two streams increases the accuracy of action identification. The spatial network is used to gather visual information from video frames, and the temporal network is used to gather motion data by applying optical flow to nearby frames. Since then, a lot of approaches based on two-stream CNNs proposed to enhance action recognition performance [12, 25, 46] such models increase in accuracy for action recognition, however, they encounter certain issues, such as the high processing cost associated with calculating optical flow, and given the nature of the image, it can only extract the short-term temporal information. For video action recognition, these short-term temporal features cannot be expressive. Lin et al. [47] offers a fresh viewpoint for gathering temporal data, the Temporal Shift Module (TSM) efficiently collects temporal data using simple 2D convolutions. To determine the differences between the features of different frames, Wu et al. [48] presents a multi-scale temporal shift module based on the TSM. Jiang et al [49]. swap out the original ResNets block with a suggested STM model that learns and encodes in a 2D framework, spatiotemporal and motion information for superior activity recognition results but increases the computing costs. TRM [50] attempts to reposition spatial elements along the temporal dimension to provide long-term temporal modeling that is spatially aware. Wu et al. [51] concentrate on transferring knowledge for video classification problems. They provide a strong semantic target using the well-trained language model for effective learning transfer. They improve the transferability of such vision-language pre-training models for downstream classification tasks using textual encoders. In numerous studies, LSTM has been created to separate long-term temporal information from a series of spatial feature maps. Donahue et al. [28] propose LRCN to extract features from 2D-CNN and then utilize two-layer LSTMs to learn the spatiotemporal features. The LRCN predicts the class action by averaging the individual prediction of each frame of the video sequence. Gammulle et al. [30] develop a model employing two-stream LSTM in combination with convolutional and fully connected activations. They suggest a deep fusion framework that takes advantage of both temporal features from LSTM models and spatial features from CNNs. FC-RNN [33] proposes to combine deep neural networks in many layers and modalities, capturing various static and dynamic stimuli at various temporal scales in addition to combining different semantic levels into each network.

2.2 Video-based models

Unlike the 2D-CNN in the two-stream approaches above, the 3D-CNN outperforms 2D-CNN for large-scale datasets. since the 3D-CNN can extract spatiotemporal information directly from video shots. The first 3D ConvNets proposed [18] in 2013 the model encountered issues such as expensive training costs and time. Tran et al. [27] propose a C3D framework based on the VGG network to learn spatiotemporal features in the context of large-scale datasets. C3D excels in many video analysis tasks. In 2017, I3D [26] proposed a Two-Stream Inflated 3D ConvNet based on Inception-V1. I3D trained on Kinetics-400 [22] using data from realistic, challenging YouTube videos. In terms of performance, I3D outperforms other video action recognition models, but it faces some problems such as high computational cost and time for optical flow calculation and the use of large clip lengths (64 frames per clip) to learn the long-term temporal information that well represents long-range movements. Wang et al. [34] propose I3D-LSTM based on the I3D model, which uses pre-train I3D to extract low-level spatiotemporal information before presenting LSTM to model high-level spatiotemporal features. Roberta et al. [52] propose a 3D-CNN for the analysis of video to identify human activity. This model's 3D-CNN architecture can categorize video sequences that incorporate several types of human activity. They think that being able to categorize a variety of typical human behaviors will help them get closer to being able to categorize deviant behaviors to improve public space security or make someone's day easier. Using diverse video datasets, Hara et al. [15] look at the designs of different 3D-CNNs built on the ResNets family, ranging from shallow to very deep ones. ResNets(R3D) accuracy varies with model depth, but it is still less accurate than two-stream I3D. In terms of recognition accuracy, ResNets (R3D) surpasses C3D. In terms of compactness, mode size reduction, and run-time speed, 3D-ResNet performs better than C3D. Based on the 3D CNN improvement in R3D many approaches were introduced such as Densenet [53], WRN [54], ResNeXt [55], and X3d [56]. A bi-directional LSTM was introduced in [57] (BiLSTM) that preferentially concentrates on useful features in the input frames to recognize the various human actions in the videos. Zan and Zhao [58] combine the CNN and LSTM networks to propose a TS-CNN-LSTM framework. The approach offers a remedy for real-time interactive applications that demand prompt results and answers from human action recognition. Several strategies are based on the 3D CNN with LSTM enhancement including [59,60,61,62]. There are other difficulties where precise categorization depends not only on the latent semantic representations of the attribute actions and their temporal connections but also on the visual aspects. Transformer models have recently to vision challenges, where the transformer may capture global dependency among tokens by learning semantics through self-attention. Pure transformer designs were not popular in computer vision until Vision Transformers (ViT) [63] which achieved remarkable success in image classification and led to the integration of transformers in video classification. ViViT [64] and Timesformer [65] are the first two efforts that effectively utilize a pure transformer architecture for video classification. Zhan et al. [66] propose video-masked autoencoders (VideoMAE) with data-efficient learners using the self-supervised video pre-training (SSVP) method. Zhen et al. [67] introduce the SVFormer model under the Self-Supervised Learning setting for action recognition. Rui et al. [68] propose masked video distillation (MVD together with a simple co-teaching strategy, which enjoys the synergy of image and video teachers. There is much research on transformers for video classification [69,70,71,72]. Also, there is much research that studied the incorporation of transformer blocks [70, 73] as additional layers into CNNs to better model long-range interactions among spatio-temporal characteristics. Despite leveraging on the Transformer’s modeling efficiency in capturing latent semantics and global dependencies, CNNs and LSTM still can capture high-level spatiotemporal features.

3 The Proposed models

We describe the HAR models and frameworks proposed in this paper in detail in this section: In our research the image-based action recognition model in Sect. 3.1, the video-based action recognition model in Sect. 3.2, and the early and late fusion frameworks for HAR in Sect. 3.3.

3.1 The Image-based action recognition model

In our proposed model, we develop an image-based action recognition model that discusses the benefits of integrating the 2D-CNN based on the Resnet-101 network (R2D) [45] with the LSTMs [74] to express the temporal features from video keyframes. As depicted in Fig. 1, The model comprises two fully connected layers with a linear classifier for action recognition, as well as an R2D network, fine-tuned on the keyframe images of the target dataset to obtain spatial features, Two layers of LSTM to obtain temporal features from a spatial feature map.

We develop a keyframe extraction method based on a shot-based analysis method using the histogram comparison technique [40] to extract the video clip's keyframes, as shown in Fig. 2. Each video is split up into K separate clips, with N frames for each. For every video clip, we extract the keyframe by calculating the color histogram for each frame in a video clip as in [75], and compare the histograms dissimilarity of each two consecutive frames using the correlation method in Eq. (1), then the frame with the highest dissimilarity between all N frames chooses as the keyframe for this clip. For each video with K clips, we extract K keyframe images.

$$d\left({Hist}_{i},{Hist}_{i+1}\right)=\frac{{\sum }_{I}\left({Hist}_{i}-\overline{{Hist }_{i}}\right)\left({Hist}_{i+1}-\overline{{Hist }_{i+1}}\right)}{\sqrt{{\sum }_{I}{\left({Hist}_{i}-\overline{{Hist }_{i}}\right)}^{2}{\sum }_{I}{\left({Hist}_{i+1}-\overline{{Hist }_{i+1}}\right)}^{2}}}$$

(1)

where

$$ \overline{{Hist_{i} }} = \frac{1}{M}\sum\limits_{J} {} Hist_{i} \left( J \right) $$

(2)

$\overline{{Hist }_{i}}$ is the mean value of the pixel histogram, M is the total number of histogram bins, and $d\left({Hist}_{i},{Hist}_{i+1}\right)$ express how well both histograms match.

The R2D from [24] adjusts using the keyframes to conduct deep feature learning. We use a custom 2D Resnet-101(R2D) pre-trained on the ImageNet dataset [76] to overcome the issue of insufficient training. Then fine-tuned all the layers using the keyframe images for UCF-101 [77] and HMDB-51 [78] datasets separately to modify the network to meet our study challenge. In the R2D network architecture as shown in Fig. 3, there are four resident blocks, an input convolution layer with maximum pooling, and an average pooling layer with an out-classification layer (fc). We extract the spatial information from Conv5_x, the last convolutional layer.

To capture temporal ordering, and extract temporal features a two-layer LSTM train with the spatial feature vector that R2D derives from keyframe images. The recurrent neural network (LSTM) is effective at processing sequential data. Many of the fundamental issues with traditional recurrent networks are mostly resolved by LSTM, which uses an appropriate gradient-based learning algorithm. Finally, the image-based model uses R2D to extract spatial data, that is fed to LSTM to recover temporal features for performance evaluations. The final output of (R2D-LSTM) is determined by taking the SoftMax classifier output into account.

3.2 The video-based action recognition model

Utilizing the 3D ResNet-101 (R3D) network, the second component of our models is video-based action recognition. Among the most potent networks for picture interpretation is the Residual Network (ResNets). In [15] authors extend the ResNets design to 3D-CNNs in anticipation of improving the performance of the video action recognition domain. In comparison to state-of-the-art methods, the R3D pre-train on a large video dataset kinetics [22] from shallow to deep layers up to 152 layers produces considerable progress and improves the performance of action recognition.

Different networks examine by Kensho et al. in [15] including ResNets (basic and bottleneck blocks), Densenet [53], WRN [54], and ResNeXt [55]. Figure 4, summarizes the network architecture of the 3D ResNet-101. An input convolution layer with maximum pooling, an average pooling layer, a fully linked layer, and a fourth bottleneck block are present. Batch normalization and ReLU layers come after the three convolutional layers in each bottleneck block. Figure 5 shows that the kernel size of the second layer is 3³, whereas the kernel sizes of the first and third convolutional layers are 1³.

The 3D ResNet-101 network was pretrained using the kinetics dataset and then fine-tuned using the UCF-101 and HMDB-51 datasets independently. The model employs to extract the spatiotemporal features from the short-term video clips (N frames) as shown in Fig. 6. The final layer is fully connected and uses the SoftMax classifier to identify actions in videos.

3.3 The fusion frameworks

We investigate two fusion strategies and create two frameworks that integrate image-based and video-based models for action recognition. Two types of fusion are available: Early-fusion, which combines features before classification, and Late-Fusion, which combines classification results for decision-making [79].

Early-Fusion (EF) framework relates to the procedure of combining the unprocessed output features from every model to create a multifeatured map for action recognition, which is processed at the feature level. In this kind of fusion, the features that classify each input data fuse to produce a new feature representation that is more expressive than the original feature representations from which it originates. As shown in Fig. 7, the impact of combining the spatiotemporal data of the video-based model (R3D) with the spatiotemporal features of the image-based model (R2D-LSTM) is covered by the feature fusion model. We construct a fully connected three-layer structure. The upper two tiers have a ReLU activation function. To classify the action, a SoftMax classifier is constructed as the last layer above the fusion model.

Late-Fusion (LF) "Decision-Level Fusion" framework, where the ultimate decision is calculated after the extraction of decisions for every model. Late-Fusion is a merging approach that applies after each model classification. It combines the output of each classifier to get a new output that may be more accurate and consistent. We build a single fully connected layer Late-Fusion model and use a SoftMax classifier to apply the model scores to the classification model as illustrated in Fig. 8.

4 Experimental Results

The experimental results of our proposed models and frameworks are presented in this section. The most widely use datasets for video action recognition: HMDB-51 [78], and UCF-101 [77] used to evaluate the proposed models. The evaluation datasets are described in Sect. 4.1. Section 4.2 represents the evaluation metrics. Our proposed models and frameworks compare with the state-of-the-art methods in Sects. 4.3, 4.4, and 4.5. Finally, to better assess our frameworks, we present the visualization findings. All experiments of training and testing for all the models are conducted on four NVIDIA RTX-3090 GPU with 24GB GPU RAM and 32CPU with 129GB RAM, and they are running on CUDA 11.3 platform.

4.1 Datasets

We carry out extensive experiments on HMDB-51 [78], and UCF-101 [77] datasets to assess the performance of the proposed frameworks for the task of action recognition in videos. The database details is shown in Table 1. Most videos, according to Brown University's HMDB-51 dataset, come from movies, public databases, and Internet video repositories like YouTube. There are 6,766 movies in 51 kinds of human action in the dataset, it offers three train/test splits. There are 13,320 movies totaling 101 human activity classes in the UCF-101 dataset. The majority of the dataset's videos were taken from YouTube. It provides train/test splits as the HMDB-51 dataset. We divide dataset into training, validation, and testing samples as 60, 20, and 20% respectively. The two datasets present some significant obstacles, including the vast variances in the camera's point of view and movement, the busy background, and the variations in the human location, size, and appearance. Sample frames for a few action classes from UCF-101 is shown in Fig. 9a and from HMDB-51 is shown in Fig. 9b, respectively.

Table 1 Details of UCF-101 and HMDB-51 datasets

Full size table

4.2 Evaluation metrics

We analyze the suggested models using a variety of evaluation metrics. A Confusion Matrix is used during the model testing process to reveal detailed information about the model's performance. A classification model's performance assessment using a C × C matrix called the confusion matrix, C denotes the number of target classes. Confusion matrices are widely utilized because they provide a more accurate view of a model's performance than classification accuracy. The deep learning model's predicted values are compared with the actual goal values in the matrix.

We use five assessment measures to assess the proposed model's performance: loss (categorical-cross-entropy), accuracy (overall performance), precision (positive prediction accuracy), recall (sensitivity or actual positive sample coverage), and f1-score.

Accuracy is the model evaluation index that indicates overall model performance. One of the most crucial evaluation criteria for deep learning classification jobs is accuracy, which is defined in Eq. (3).

$${\text{Accuracy}}=\frac{Number\;of\;correct\;{\text{predictions}}}{Total\;number\;of\;predictions}$$

(3)

Categorical cross-entropy loss is a classification loss function that measures the performance of the model whose output is in the form of probability [0,1]. We include the cross-entropy loss function in our model because multi-class classification methods frequently employ it. These loss functions are crucial since they increase the accuracy of our models, the equation of cross entropy loss shown in Eq. (4).

$${\text{Loss}}= -\sum_{j}^{C}{t}_{j}{\text{log}}\left({\text{SoftMax}}\left({p}_{j}\right)\right)$$

(4)

where ${t}_{j}$ is the fundamental fact for each input class j in C classes.

Precision is a statistic that assesses the proportion of correctly identified instances among positive instances. In a multi-class classification problem, precision (positive prediction accuracy) is the sum of true positions across all classes divided by the sum of true positives and false positives across all classes as present in Eq. (5).

$$\mathrm{Precision }\;({\text{P}})= \frac{\sum_{j}^{C}{TP}_{j}}{\sum_{j}^{C}({TP}_{j}+{FP}_{j})}$$

(5)

where ${TP}_{j}$ is the true positive and, ${FP}_{j}$ is the false positive for each class j in C classes.

Recall (also known as sensitivity or actual positive sample coverage) is a statistic that quantifies the amount of correct positive predictions produced out of all possible positive predictions. Recall is the sum of true positives across all classes divided by the sum of true positives and false negatives across all classes as in Eq. (6).

$${\text{Recall}}\;({\text{R}})= \frac{\sum_{j}^{C}TP}{\sum_{j}^{C}({TP}_{j}+{FN}_{j})}$$

(6)

where ${TP}_{j}$ is the true positive and, ${FN}_{j}$ is the false negative for each class j in C classes.

F1-score provides a way to combine both precision and recall into a single measure that captures both properties as shown in Eq. (7).

$$ {\text{F}}1 - {\text{score}} = { }\left( {2{\text{ * Precision * Recall}}} \right){ }/{ }({\text{Precision }} + {\text{ Recall)}} $$

(6)

4.3 The Image-based model

For the image-based model, we employ R2D, which is pretrained on the ImageNet dataset, and refine it on the HMDB-51 and UCF-101 datasets. Instead of using every frame in the video to reduce computational overhead and time consumption, keyframes are employed to fine-tune the model. Considering that video clips in the UCF-101 and HMDB-51 video datasets have a consistent frame rate of 25 frames per second. To reduce the size of the dataset and utilize only the most representative frames, the video segments into frames, and a keyframe image retrieves for each video clip.

The keyframe is the frame with the highest histogram correlation from a set of consecutive frames. Choosing the number of consecutive frames (shot) affects the keyframe extraction accuracy [80, 81], where if the consecutive frames are small sets of frames the variation of frame histograms is very small to identify one keyframe, and if we use a large set of consecutive frames there are more than one keyframe in the shot and we only extract one and neglect others. We explore the keyframe extraction method for a variant video clip length: 5 frames per clip and 16 frames per clip. Table 2 shows a performance measurement for the R2D and R2D-LSTM models using the extracted keyframes from different clip lengths.

Table 2 Performance analysis for the extracted keyframes from different clip lengths: 5 frames per clip, 16 frames per clip on the R2D and R2D-LSTM models

Full size table

According to the accuracy of R2D, and R2D-LSTM models for different keyframe images as seen in Table 2. For the two datasets, UCF-101 and HMDB-51, we find that the accuracy of keyframes recovered from 16-frame per clip on R2D is 83.0 and 60.0%, respectively, and for R2D-LSTM are 92.0, and 65.0% respectively which outperform the accuracy for R2D and R2D-LSTM using keyframes extract from 5-frame per clip. Therefore, we use 16 frames per clip and extract keyframes from each clip for all the video clips. Figure 10a and b show samples of key frames extracted using from 16-frame per clip for UCF-101 and HMDB-51 datasets respectively with action class names. The keyframe images then divide into training, validation, and testing samples as 60, 20, and 20% respectively as shown in Table 3.

Table 3 Shows the number of keyframes for each dataset, and the split images corresponding to each dataset

Full size table

Using the keyframe images, we start fine-tuning the R2D to the UCF-101, and HMDB-51respectively. The R2D is a 2D Resnet-101 model pretrained on the ImageNet dataset [45]. For fine-tuning R2D, we generate training samples from the extracted keyframes. The training samples resize at 224 × 224 pixels, horizontally flipped, and mean subtraction using ImageNet is performed.

For Training: We employ cross-entropy losses and backpropagate their gradients in our training. We use 60% from dataset for training and 20% for validation. The training procedure (fine-tuning) we use the stochastic gradient descent at a learning rate of 0.1, momentum of 0.9, batch size of 64, and 25 epochs using cross-entropy loss.

Figure 11a and b, show the training and validation accuracies of finetuned R2D for UCF-101 and HMDB-51 respectively. For the UCF-101 the accuracy of validation is slightly higher than training accuracy, the best validation accuracy is mostly 85.0%. But for the HMDB-51 the validation accuracy is less than the training accuracy due to the difficulty of the image appearance and contrast, where the validation accuracy almost reaches 65.0%. The HMDB-51 is a challenging dataset due to its appearance and contrast to the UCF-101. Figure 12a and b, describe the loss values of training and validation of R2D for UCF-101 and HMDB-51 respectively. The loss values of training and validation of UCF-101 are the same but for the HMDB-51 The training loss is marginally lower than the validation loss.

We employ the finetuned R2D as a feature extractor to produce 2048 sequence features from the (Conv5_x) layer. A two-layer LSTM with SoftMax classifier trains using the extracted features. We use the LSTM to extract temporal features from the sequence of features extracted by R2D, and we use the SoftMax classifier to recognize the video action. For Training: we use the stochastic gradient descent at a learning rate of 0.1, momentum of 0.9, batch size of 64, and 25 epochs using cross-entropy loss for training the LSTM, also the training and validation data are divided into 60 and 20%, respectively.

Figure 13a and b, display the accuracy of training and validation of R2D-LSTM on each of UCF-101 and HMDB-51, respectively. The figure depicts that the validation accuracy for UCF-101 is 93.0% and for the HMDB-51 is 65.0%. However, we do not use all the video frames and we only use the keyframes of each dataset, but the utilization of LSTM allows the model to learn more temporal features from the keyframe spatial feature sequence produced by R2D. The use of the LSTM increases the performance of the R2D by exploring more temporal features for each action. Figure 14a and b, show the training and validation losses of R2D-LSTM on both UCF-101 and HMDB-51 respectively. The validation losses are marginally more than the comparable training losses for HMDB-51, but for the UCF-101 the validation loss is like the training loss. We can conclude that there is no overfitting occurring in training the R2D-LSTM model. From Figs. 11, 13, and Table 2 we found that R2D-LSTM outperforms the R2D model.

For Testing: We evaluate the R2D-LSTM model using the 20% testing dataset. The testing samples resize to 224 × 224 pixels, each sample flips horizontally, and we also execute a mean subtraction, which involves taking the sample's mean values for each color channel and subtracting them from them. The confusion matrix for the R2D-LSTM model reveals that the majority of the recognition accuracy for the UCF-101 dataset is greater than 93.0% for most of the classes as shown in Fig. 15a, while the model recognition accuracy for the HMDB-51 dataset is near 65.0% for most of the classes as evidenced by the confusion matrix in Fig. 15b.

4.4 The video-based model

We use the pretrained R3D (ResNet-101) model for the video-based model [15], the model is pretrained on the kinetics dataset, and in our experiment, we fine-tuned it on the UCF-101 and HMDB-51 datasets. We choose a 16-frame clip from a video's temporal position. If the video is less than 16 frames long, we loop it as many times as needed. We apply the spatial and temporal transformation to video clip frames. Each sample is three channels, sixteen frames, 112X112 pixels in size, and horizontally flipped with 50% probability. We also perform mean subtraction, which means subtracting the sample's mean values from the mean values of the kinetics dataset [22]. The generated samples all have the same class labels as the original videos.

For Training: In training the SoftMax classifier layer, we use cross-entropy losses and backpropagate their gradients. The training parameters include a learning rate of 0.1 and a momentum of 0.9.

We use the pretrained R3D model and finetuned on UCF-101 and HMDB-51 datasets. First, fine-tuning is done to train the conv5_x bottleneck and the fully connected layer. Figure 16a and b, show the training and validation accuracy for R3D on both UCF-101 and HMDB-51 datasets, respectively. We found out that there is a large gap between the values of training and validation accuracy and there is overfitting occurs by finetuning conv5_x bottleneck and the fully connected layers. For this reason, we refine the model by training the fully connected layer on the corresponding target video datasets (UCF-101 and HMDB-51).

By solely fine-tuning the fully connected layer of R3D, the overfitting issue is resolved. Figure 17a and b, display, respectively, the R3D training and validation accuracy on the UCF-101 and HMDB-51 datasets. We see that the UCF-101's training accuracy is 80.0%, while the validation accuracy values are almost 77.0%, which is quite like the training accuracy. Although the HMDB-51 dataset presents challenges in terms of background, camera orientation, illumination, and other factors that influence the performance of action models, the training accuracy of the suggested R3D model is 60.0%, and the HMDB-51 validation accuracy is 50.0%.

Figure 18a and b, display the training and validation losses for the R3D on the UCF-101 and HMDB-51 respectively. For the UCF-101, the validation loss is quite close to the training loss. Moreover, on HMDB-51, the validation loss is greater than the training loss. Based on these results, we can find that the R3D which was pretrained on Kinetics and fine-tuned on UCF-101 and HMDB-51 could enhance the performance of the model without the need for more computational costs and time for training the model from scratch on those datasets.

For Testing: the R3D using the testing set, the R3D model that performs well in the validation set used. We split Each video in testing data into 16-frame video clips. The class label for the video is determined by the one with the highest average score across all the video clips. Figure 19a and b, present the confusion matrix of R3D using testing data from UCF-101 and HMDB-51 datasets, respectively. Testing accuracy for R3D on the UCF-101 can achieve 77.0% and for the HMDB-51 can reach up to 50.0%.

4.5 The evaluation of the fusion framework’s performance

Two models image-based (R2D-LSTM), and video-based (R3D) that we discuss in Sects. 3.1 and 3.2 used in the fusion frameworks. Initially, we choose a 16-frame clip from a video's temporal position, loop as many times as possible, if necessary, when the clip size is less than 16 frames long then extract a keyframe for each video clip. We have two streams in our frameworks. The first stream is image-based using the extracted keyframe image, we apply the spatial transformation on the keyframe image by flipping by 50% each sample horizontally, resizing it to 224 × 224 pixels, and perform a mean subtraction, in which we deduct the ImageNet mean values [76]. The second stream is video-based using a 16-frame video clip, we apply the spatial–temporal transformation on the video clip to resize the sample to 112 × 112 pixels for the video clip. There is a 50% that each sample flips horizontally. Additionally, we perform mean subtraction, which involves deducting the sample values for each color channel from the mean values of the kinetics dataset [22]. All generated samples and keyframe images use the same class labels as the original videos. Finally, we implement the two different fusion methods we discuss in Sect. 3.3 into practice.

For training: To train the SoftMax classifier layer, we employ cross-entropy losses and backpropagate their gradients in our training. In the training procedure, we use the stochastic gradient descent at a learning rate of 0.1, momentum of 0.9, batch size of 64, and 25 epochs using cross-entropy loss. The training and validation data ratio is divided into 60 and 20%, respectively.

4.5.1 The early-fusion framework

The Early-Fusion combines the extracted features of each model separately to produce a new data representation. The classifier uses the new feature map to train. These features are more expressive than separated representations which enable us to concurrently benefit from the advantages of both representations. Compared to employing a single representation alone, this can produce effective discrimination results as present in Fig. 20, which represents the accuracy of training and validation for the Early-Fusion model using split-1 for the datasets UCF-101 and HMDB-51 respectively.

As demonstrated in Fig. 20, the Early-Fusion model improves the two pre-trained models' accuracy when we combine the features gleaned from the individual models to produce multi-modal feature maps that are more expressive than the individual representations from which they derive. Using the datasets UFC-101 as shown in Fig. 20a and HMDB-51 in Fig. 20b, we discovered that the Early Fusion model's validation accuracy was 93.0 and 71.0%, respectively. While Fig. 21a and b, depicts the training and validation losses for the Early Fusion model for UCF-101 and HMDB-51, respectively.

For Testing: With regards to the Early-Fusion model, we divide the video into discrete 16-frame video clips that do not overlap, extract the keyframe from each video clip, and employ the R3D and R2D-LSTM models that exhibit good test set performance. We use the proposed models to extract different temporal elements from the keyframe and video clip. Based on the combination of these features, we employ the classifier to identify the video action. The class with the highest overall average score across all the video clips assigns the designation for the video. The Early-Fusion Confusion matrix on the UCF-101 and HMDB-51 datasets is shown in Fig. 22a and b, respectively, for the testing data, we discover that the Early-Fusion model's testing accuracy was 95.5 and 70.1%, respectively.

4.5.2 The late-fusion framework

Late-Fusion is a merging technique that merges the classification result of every single model. To create new decisions that are more accurate and dependable, it integrates the results of each classifier and applies a deep fusion classifier model.

According to Fig. 23a and b that present accuracy of Late-fusion on UFC-101 and HMDB-51, respectively. The Late-Fusion increases the accuracy of the two pretrained models. We find that the validation accuracy of the Late-Fusion model on UFC-101 and HMDB-51 for split-1 reach 97.0 and 77.0% respectively. Figure 24a and b, show that the validation loss of the Late-Fusion model is quite like the training loss for UCF-101 and HMDB-51, respectively.

For Testing: The Late-Fusion model, after separating the video into separate, non-overlapping 16-frame video clips and extracting the keyframe in each, we employ the R2D-LSTM and R3D models to achieve good test set performance. Choose the proper class for the keyframe and the video clip using the trained models. The classifier is used to identify the video activity after combining this class assessment. The class with the highest average score overall across all video clips received this categorization for the video. The Late-Fusion Confusion matrix for the UCF-101 and HMDB-51 datasets is shown in Fig. 25a and b respectively, revealing that the Late-Fusion Model's testing accuracy was 97.5 and 77.7%, respectively.

We assess the results of F1-score, Precision, and Recall metrics for our proposed Image-based (R2D-LSTM), video-based (R3D), Early-Fusion and Late-Fusion models using the UFC-101 and HMDB-51 datasets in Figs. 26 and 27 respectively. The Late-fusion framework observes 96.22, 98.12, and 97.16% for precision, recall, and F1-score respectively on the UCF-101 dataset. Also, it reaches 75.25, 77.50, and 76.36% for precision, recall, and F1-score respectively on the HMDB-51 dataset. The results confirm the proposed Late-fusion framework capable of classification of human activities with better results than the other frameworks.

5 Comparison with the-state-of-the-art

This section describes a comparison of the proposed model's outcomes with current best practices for only RGB input data. Table 4 compares the performance of the proposed models and earlier state-of-the-art techniques on the UCF-101 and HMDB-51 benchmarks. Certain models, like the TSM for Lin et al. [47], aggregate temporal data by utilizing simply 2D convolutions in conjunction with a manually created temporal shift module. Video masked autoencoders (VideoMAE) [66] propose data-efficient learners by employing self-supervised video pre-training (SSVP). Masked video distillation (MVD) and a straightforward coteaching approach, which gains from the combination of video and image teachers, were proposed by Rui et al. [68]. Roberta [52] utilizes a 3D-CNN architecture to classify unusual behavior of people in public places from video input. STM [49] proposed by Jiang et al. swaps out ResNet blocks with STM blocks to represent motion and spatiotemporal properties in a 2D framework. I3D-LSTM benefits from a pre-trained I3D CNN model to extract low-level spatial–temporal features and improve the performance of LSTM which introduces to model of the high-level temporal features [34]. MSTSM [44] action recognition framework based on 2D-CNN utilizing Temporal Feature Difference Extraction and Multi-Scale Temporal Shift Module and Baseline ResNet-101 Family [15]. A Two-Stream Inflated 3D-CNN (I3D) model for action recognition using optical and RGB flow streams [26].

Table 4 Accuracy comparison for proposed models and state-of-the-art methods on UCF-101 and HMDB-51 datasets

Full size table

As shown in Table 4, the accuracy of video-based (R3D) different from those offered by the R3D model in [15] for each of the UCF-101 and HMDB-51 datasets according to the varied training parameters and data augmentation used. For the models CNN-LSTM in [22], ST-D LSTM in [31], and Bi-LSTM in [82] all of these models received their training from the ImageNet dataset and used various data augmentation on images which affected the performance model they mentioned in the paper. In the transformer model in [68], authors proposed two models of transformers small model (Teacher-B) and the large model (Teacher-L), we achieved a higher accuracy than Teacher-B, but Teacher-L, the large transformer model has marginally better outcomes than our suggested frameworks, albeit at the expense of additional computational time. while all studies demonstrate improvement in the UCF-101 and HMDB-51 datasets for HAR, the method by Carreira and Zisserman [26] is the most effective now. It achieves 98.0 and 80.2% respectively using RGB and optical flow streams by applying I3D models. However, when we simply used the RGB input data only, the suggested frameworks outscored the competitors (images and videos) by 97.5 and 77.7%. respectively. Finally, we consider our late-fusion framework to be less successful with a difference margin of approximately 1.0% on UCF-101 and 2.0% on HMDB-51 than the I3D [26] model. Noting that the I3D model obtains this accuracy by using two steams of RGB and optical flow modality but we obtain our accuracy by only using RGB modality which made us the best and superior in terms of time and cost.

6 Conclusion

This paper has introduced two efficacious human action recognition frameworks based on two models built using the resident family. We study the improvement and enhancement of 2D and 3D resident families using multi-model fusion techniques. The proposed frameworks consist of two streams that concentrate on RGB input data from images and video clips; the two streams are fused using different fusion strategies. The first stream is an image-based one that is built using R2D and LSTM models called R2D-LSTM that can capture long-term spatial–temporal features from keyframe images extracted from RGB video clips. The second stream is video-based and employs R3D to extract short-term spatial–temporal features from video clips. Two frameworks are proposed to describe the effect of different fusion architectures on enhancing action recognition performance. We explore early and late fusion techniques for video action recognition. The early-fusion framework discusses the effect of early feature fusion of the two streams for decision-making and action recognition, but the late-fusion framework discusses the decision fusion from the two models' decisions for action recognition. The various fusion techniques investigate how much each spatial and temporal feature influences the recognition model's accuracy. Additionally, we investigate the optimal fusing approach that might yield the best outcomes for spatial and temporal information. Both the early-fusion and late-fusion frameworks perform well in experiments. We evaluated the proposed models on two popular video datasets (UCF-101 and HMDB-51). The experimental results of early-fusion achieved 95.5% for the UCF-101 and 70.1% for the HMDB-51, and the late-fusion achieved 95.5% for the UCF-101 and 77.7% for the HMDB-51, which is comparable with state-of-the-art methods on those datasets. In our future work, we will try to use and benefit from this framework in the video domain adaptation for action recognition to optimize and enhance computational cost and performance.

Availability of data and materials

To ensure that the results are reproducible, the source code is publicly available https://github.com/fshimaa/FusionModel.

References

Ge W, Collins RT, Ruback RB. Vision-based analysis of small groups in pedestrian crowds. IEEE Trans Pattern Anal Mach Intell. 2012;34:1003–16.
Article PubMed Google Scholar
Yuan Y, Fang J, Wang Q. Online anomaly detection in crowd scenes via structure analysis. IEEE Trans Cybern. 2014;45:548–61.
Article Google Scholar
Gerónimo D, Kjellström H. Unsupervised surveillance video retrieval based on human action and appearance. In: 2014 22nd international conference on pattern recognition, pp. 4630–4635, 2014.
Nweke HF, Teh YW, Mujtaba G, Al-Garadi MA. Data fusion and multiple classifier systems for human activity detection and health monitoring: review and open research directions. Info Fus. 2019;46:147–70.
Article Google Scholar
Jun B, Choi I, Kim D. Local transform features and hybridization for accurate face and human detection. IEEE Trans Pattern Anal Mach Intell. 2012;35:1423–36.
Article Google Scholar
Perlman J, Roy SS. Analysis of human movement in the Miami metropolitan area utilizing Uber Movement data. Cities. 2021;119:103376.
Article Google Scholar
Han F, Reily B, Hoff W, Zhang H. Space-time representation of people based on 3D skeletal data: a review. Comput Vis Image Underst. 2017;158:85–105.
Article Google Scholar
Weng L, Lou W, Shen X, Gao F. A 3D graph convolutional networks model for 2D skeleton-based human action recognition. IET Image Process. 2022;17:773–83.
Article Google Scholar
Pham H-H, Khoudour L, Crouzil A, Zegers P, Velastin SA. Learning to recognise 3D human action from a new skeleton-based representation using deep convolutional neural networks. IET Comput Vis. 2019;13:319–28.
Article Google Scholar
Huynh-The T, Hua C-H, Ngo T-T, Kim D-S. Image representation of pose-transition feature for 3D skeleton-based action recognition. Inf Sci. 2020;513:112–26.
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60:84–90.
Article Google Scholar
Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941, 2016.
Feichtenhofer C, Fan H, Malik J, He K. Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211, 2019.
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1725–1732, 2014.
Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6546–6555.
Qiu Z, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3d residual networks. In: proceedings of the IEEE international conference on computer vision, 2017, pp. 5533–5541.
Gers FA, Schmidhuber E. LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Trans Neural Netw. 2001;12:1333–40.
Article CAS PubMed Google Scholar
Ji S, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell. 2012;35(1):221–31.
Article Google Scholar
Pareek P, Thakkar A. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artif Intell Rev. 2021;54(3):2259–322.
Article Google Scholar
Zhu Y, Li X, Liu C, Zolfaghari M, Xiong Y, Wu C, Zhang Z, Tighe J, Manmatha R, Li M. A comprehensive study of deep video action recognition. arXiv preprint arXiv:2012.06567, 2020.
Bo Y, Lu Y, He W. Few-shot learning of video action recognition only based on video contents. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 595–604, 2020.
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
Beddiar DR, Nini B, Sabokrou M, Hadid A. Vision-based human activity recognition: a survey. Multimed Tools Appl. 2020;79(41):30509–55.
Article Google Scholar
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199, 2014.
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp. 20–36, 2016.
Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6299–6308.
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 489–4497, 2015.
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2015.
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057. PMLR, 2015.
Gammulle H, Denman S, Sridharan S, Fookes C. Two stream lstm: a deep fusion framework for human action recognition. In: 2017 IEEE winter conference on applications of computer vision (WACV), pp. 177–186. IEEE, 2017.
Hu K, Zheng F, Weng L, Ding Y, Jin J. Action recognition algorithm of spatio-temporal differential LSTM based on feature enhancement. Appl Sci. 2021;11(17):7876.
Article CAS Google Scholar
Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention. arXiv preprint arXiv:1511.04119, 2015.
Yang X, Molchanov P, Kautz J. Multilayer and multimodal fusion of deep neural networks for video classification. In: Proceedings of the 24th ACM international conference on multimedia, 2016, pp. 978–987.
Wang X, Miao Z, Zhang R, Hao S. I3d-lstm: a new model for human action recognition. In: IOP conference series: materials science and engineering, vol. 569. IOP Publishing; 2019, p. 032035.
Long X, Gan C, Melo G, Liu X, Li Y, Li F, Wen S. Multimodal keyless attention fusion for video classification. In: Proceedings of the AAAI conference on artificial intelligence, 2018.
Sheena CV, Narayanan N. Key-frame extraction by analysis of histograms of video frames using statistical methods. Procedia Comput Sci. 2015;70:36–40.
Article Google Scholar
Zhu Y, Zhou D. An approach of key frame extraction based on video clustering. Comput Eng. 2004;30:12–4.
Google Scholar
Amiri A, Fathy M. Hierarchical keyframe-based video summarization using QR-decomposition and modified-means clustering. EURASIP J Image Video Process. 2010;2010:1–16.
Google Scholar
Ejaz N, Baik SW, Majeed H, Chang H, Mehmood I. Multi-scale contrast and relative motion-based key frame extraction. EURASIP J Image Video Process. 2018;2018:1–11.
Article Google Scholar
Jahagirdar A, Nagmode M. Two level key frame extraction for action recognition using content based adaptive threshold. Int J Intell Eng Syst. 2019;12(5):34–52.
Google Scholar
Sze K-W, Lam K-M, Qiu G. A new key frame representation for video segment retrieval. IEEE Trans Circuits Syst Video Technol. 2005;15(9):1148–55.
Article Google Scholar
Zhu W, Hu J, Sun G, Cao X, Qiao Y. A key volume mining deep framework for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1991–1999.
Yan X, Gilani SZ, Qin H, Feng M, Zhang L, Mian A. Deep keyframe detection in human action videos. arXiv preprint arXiv:1804.10021, 2018.
Zhou L, Nagahashi H. Real-time action recognition based on key frame detection. In: Proceedings of the 9th international conference on machine learning and computing, pp. 272–277, 2017.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
Xue F, Ji H, Zhang W, Cao Y. Attention-based spatial–temporal hierarchical ConvLSTM network for action recognition in videos. IET Comput Vis. 2019;13(8):708–18.
Article Google Scholar
Lin J, Gan C, Han S. Tsm: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 7083–7093, 2019.
Wu K-H, Chiu C-T. Action recognition using multi-scale temporal shift module and temporal feature difference extraction based on 2D CNN. J Softw Eng Appl. 2021;14(5):172–88.
Article Google Scholar
Jiang B, Wang M, Gan W, Wu W, Yan J. Stm: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2000–2009, 2019.
Qian Y, Kang G, Yu L, Liu W, Hauptmann AG. Trm: temporal relocation module for video recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 151–160, 2022.
Wu W, Sun Z, Ouyang W. Revisiting classifier: transferring vision-language models for video recognition. In:Proceedings of the AAAI, Washington, DC, USA, pp. 7–8, 2023.
Vrskova R, Hudec R, Kamencay P, Sykora P. Human activity classification using the 3DCNN architecture. Appl Sci. 2022;12(2):931.
Article CAS Google Scholar
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
Zagoruyko S, Komodakis N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Orhan AE. Robustness properties of Facebook's ResNeXt WSL models. arXiv preprint arXiv:1907.07640, 2019.
Feichtenhofer C. X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 203–213.
Muhammad K, Ullah A, Imran AS, Sajjad M, Kiran MS, Sannino G, de Albuquerque VHC, et al. Human action recognition using attention based LSTM network with dilated CNN features. Future Gener Comput Syst. 2021;125:820–30.
Article Google Scholar
Zan H, Zhao G. Human action recognition research based on fusion TS-CNN and LSTM networks. Arab J Sci Eng. 2023;48(2):2331–45.
Article Google Scholar
Li Y, Wu Y.Long-short-term memory based on adaptive convolutional network for time series classification.Neural Process Lett. 2023.
Zhang Y, Xiao Q, Liu X, Wei Y, Chu C, Xue J. Multi-modal fusion method for human action recognition based on IALC. IET Image Proc. 2023;17(2):388–400.
Article Google Scholar
Umamakeswari A, Angelus J, Kannan M, Rashikha, Bragadeesh SA. Action recognition using 3D CNN and LSTM for video analytics. In: Intelligent computing and communication, pp. 531–539, 2020.
Vrskova R, Kamencay P, Hudec R, Sykora P. A new deep-learning method for human activity recognition. Sensors. 2023;23(5):2816.
Article PubMed PubMed Central Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C. Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6836–6846, 2021.
Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding? ICML. 2021;2(3):4.
Google Scholar
Tong Z, Song Y, Wang J, Wang L. Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv Neural Inf Process Syst. 2022;35:10078–93.
Google Scholar
Xing Z, Dai Q, Hu H, Chen J, Wu Z, Jiang YG. Svformer: semi-supervised video transformer for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18816–18826, 2023.
Wang R, Chen D, Wu Z, Chen Y, Dai X, Liu M, Yuan L, Jiang YG. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6312–632, 2023
Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H. Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3202–3211, 2022.
Leong MC, Zhang H, Tan HL, Li L, Lim JH. Combined CNN transformer encoder for enhanced fine-grained human action recognition. arXiv preprint arXiv:2208.01897, 2022.
Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C. Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3333–3343, 2022.
Wang R, Chen D, Wu Z, Chen Y, Dai X, Liu M, Jiang YG, Zhou L, Yuan L. Bevt: Bert pretraining of video transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14733–14743, 2022.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin KI. Attention is all you need. Adv Neural Inf Process Syst 30, 2017.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.
Article CAS PubMed Google Scholar
Rosebrock A. How-To: 3 Ways to Compare Histograms using OpenCV and Python. Pyimagesearch, 2014.
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. IEEE, 2009.
Soomro K, Zamir AR, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: 2011 International conference on computer vision, pp. 2556–2563, 2011.
Boulahia SY, Amamra A, Madi MR, Daikh S. Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Mach Vis Appl. 2021;32(6):1–18.
Article Google Scholar
Yasin H, Hussain M, Weber A. Keys for action: an efficient keyframe-based approach for 3D action recognition using a deep neural network. Sensors. 2020;20(3):2226.
Article PubMed PubMed Central Google Scholar
Le D-S, Phan H-H, Hung HH, Tran V-A, Nguyen T-H, Nguyen D-Q. KFSENet: a key frame-based skeleton feature estimation and action recognition network for improved robot vision with face and emotion recognition. Appl Sci. 2022;12:5455.
Article CAS Google Scholar
Zhao H, Jin X. Human action recognition based on improved fusion attention CNN and RNN. In: 2020 5th international conference on computational intelligence and applications (ICCIA), pp. 108–112, 2020.

Download references

Funding

Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).

Author information

Authors and Affiliations

Electrical Engineering Department, Faculty of Engineering at Shoubra, Benha University, Cairo, 11629, Egypt
Shaimaa Yosry, Lamiaa Elrefaei, Rafaat ElKamaar & Rania R. Ziedan

Authors

Shaimaa Yosry
View author publications
You can also search for this author in PubMed Google Scholar
Lamiaa Elrefaei
View author publications
You can also search for this author in PubMed Google Scholar
Rafaat ElKamaar
View author publications
You can also search for this author in PubMed Google Scholar
Rania R. Ziedan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors wrote and reviewed the manuscript.

Corresponding author

Correspondence to Shaimaa Yosry.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yosry, S., Elrefaei, L., ElKamaar, R. et al. Various frameworks for integrating image and video streams for spatiotemporal information learning employing 2D–3D residual networks for human action recognition. Discov Appl Sci 6, 141 (2024). https://doi.org/10.1007/s42452-024-05774-9

Download citation

Received: 11 January 2024
Accepted: 01 March 2024
Published: 18 March 2024
DOI: https://doi.org/10.1007/s42452-024-05774-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Various frameworks for integrating image and video streams for spatiotemporal information learning employing 2D–3D residual networks for human action recognition

Abstract

Article Highlights

Similar content being viewed by others

A review of convolutional neural networks in computer vision

Deep learning for time series classification: a review

CBAM: Convolutional Block Attention Module

1 Introduction

2 Related work

2.1 Image-based models

2.2 Video-based models

3 The Proposed models

3.1 The Image-based action recognition model

3.2 The video-based action recognition model

3.3 The fusion frameworks

4 Experimental Results

4.1 Datasets

4.2 Evaluation metrics

4.3 The Image-based model

4.4 The video-based model

4.5 The evaluation of the fusion framework’s performance

4.5.1 The early-fusion framework

4.5.2 The late-fusion framework

5 Comparison with the-state-of-the-art

6 Conclusion

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation