1 Introduction

The importance of Human Action Recognition (HAR) in computer vision research stems from its usefulness in a variety of applications, including content-based video analysis [1,2,3], health monitoring [4], sophisticated video surveillance [5], and smart robot system [6]. HAR aims to identify and analyze human action into categories using different input data such as 3D-skeleton data [7,8,9,10], still image-based [11,12,13,14], or video-based [15,16,17,18] by identifying and classifying the input data's spatiotemporal features. The spatial features convey appearance, while the temporal features represent action (motion). HAR strategies are split into two stages: feature extraction and classification. Handcrafted or deep-learning approaches are used to extract features [10]. while machine learning and deep learning approaches employ feature learning and action recognition and classification [19].

HAR using deep learning techniques gets more attention nowadays due to their progress and promising results [20]. However, no deep learning model yet demonstrated learning spatiotemporal features from video or successive frames with adequate results one possible explanation for inadequate findings is the nature of the input data from images or videos, the type of retrieved spatiotemporal features, and the established models that utilize to extract spatiotemporal features [15]. HAR deep learning approaches are built on a variety of architectures: two-stream 2D-convolutional neural networks (2D-CNN), recurrent neural networks like LSTM, 3D-convolutional neural networks (3D-CNN), and transformers. These models diverge in terms of input data such as (image, video, or skeleton), network architecture, and technique for extracting spatiotemporal features to recognize action [21,22,23].

Typically, the action in the video is identified using the features that are taken out of the image sequences. Most image sequence models rely on a two-stream 2D CNN architecture. The two-stream architecture applies where one stream extracts spatial information (appearance) from RGB images, while the other stream extracts short-term temporal features (motion) from optical flow images [12, 24, 25]. These models are effective in improving action recognition accuracy; However, their processing cost for figuring out optical flow is considerable. Furthermore, due to the nature of RGB images, it does not work for action recognition; it just extracts spatial information.

Given the expanding volume of video data and the inherent characteristics of RGB images that extract only spatial details, the issue of high processing costs for computing optical flow images must be addressed. Action recognition approaches use 3D-CNN to extract features from RGB videos. Depending on the network design and duration of the video clip, the 3D-CNN network extracts various spatiotemporal features from video clips. Although the 3D-CNN increases the accuracy of HAR [15, 22, 26], the dimensions and duration of the video clip influence it. Most video action recognition approaches use short clip lengths (8 or 16 frames per clip) to decrease the computational time and cost that causes the network to learn the short-term temporal information that is useless for recognizing actions [15, 16, 27, 28]. Using large clip lengths (64 frames per clip) requires more computing effort and money than using short clips to teach the models the long-term temporal information that characterizes long-range movements.

Compared to CNN models for HAR, the LSTM model solves most of the core issues of CNN models. LSTM extracts long-term temporal features that supply more information on motion in a series of images [29]. The CNN architecture is combined with the LSTM for high HAR model performance. Before the LSTM train extracts long-term temporal characteristics from the feature map for action recognition and classification, CNN first extracts spatial or temporal features from images or short videos, respectively [22, 30, 31]. A straightforward model for extracting spatiotemporal information from RGB photos or videos is the CNN-LSTM. The CNN-LSTM model is often dependent on the utilized CNN network and the input data, where CNN architectures influence the extracted features, and affect the model's performance from computational cost and time exhaustive [32,33,34,35].

The 2D-CNN-LSTM model's efficacy may be impacted by redundant data collected by numerous video devices, which could confuse important information. Selecting the keyframe can have an impact on the performance of the HAR model, as only a small number of keyframes extract more expressive information for local frame representations. The keyframes are visual frames in a video clip that include the most important aspects. Depending on the clip's content complexity and keyframe extraction method, one or more keyframes can be extracted from a single clip [36]. Keyframe extraction techniques that are often used include (1) keyframe extraction method based on video lens [37] (2) clustering-based keyframe extraction method [38] (3) keyframe extraction method based on motion features [39] (4) keyframe extraction method based on content analysis [40, 41]. The keyframe extraction approaches reduce the temporal complexity of action recognition and increase the recognition model's performance. HAR models based on keyframe extraction are now available [42,43,44].

To successfully handle the investigated challenges of action recognition models and extract more data for action recognition improvement, a variety of video features and fusion techniques are investigated. Most visual action recognition approaches are based on color cameras which consider human body movement. In this case, RGB image features capture action scene information while RGB video features capture depth and motion information that leads to discriminative performance improvement. Action recognition in videos could perform better when the intrinsic semantic relationship between keyframe pictures and video clips is utilized to combine the scene features from the image and the action characteristics for the video. As a result, we propose two frameworks that discuss distinct fusion methodologies for various action recognition models sought in this paper; These models are expanded from the widely used residual network models (ResNet-101) in [15, 45] and LSTM to handle RGB videos. The frameworks consist of two streams, the first stream extracts temporal features from a keyframe image sequence using the R2D-LSTM network, while the second stream extracts video temporal features from video clips using 3D-CNN, with several fusion techniques for investigating action recognition. Overall, this paper's main contributions are:

  1. (1)

    Enhance the performance of 2D ResNet-101 (R2D) by merging it with LSTM to extract temporal information from only keyframe images in videos for HAR (R2D-LSTM).

  2. (2)

    Reduce computing costs and time by training the R2D-LSTM model with keyframes rather than entire video frames.

  3. (3)

    Examine the Early fusion method for the R2D-LSTM and video-based (R3D) HAR models to investigate various temporal representations of actions in video.

  4. (4)

    Examine Late Fusion techniques for the two streams with different inputs for human action recognition.

  5. (5)

    Extensive testing demonstrates that our suggested models and frameworks would significantly improve recognition performance and outperform the most recent models. The experimental results demonstrate that, on the UCF-101 and HMDB-51 datasets, our proposed frameworks outperformed the leading models.

This paper is broken up into sections. The related action recognition study is described in Sect. 2 of this paper. The suggested models are discussed in Sect. 3. In Sect. 4, the experimental outcomes of the suggested models are described. The work of this article is completed in Sect. 5.

2 Related work

It is feasible to collect various input data in the HAR domain. Images, videos, skeletons, and more recently depth maps and infrared images are examples of these input data. Here we discuss the related work of HAR models depending on only RGB images and video data.

2.1 Image-based models

Due to the rapid development of deep learning algorithms, considerable progress in the field of action recognition using images. Karpathy et al. [14] via the 2D-CNN network, examine several methods for fusing information over the temporal dimension. They found that the slow fusion model that mixes between early and late fusion captures the spatiotemporal features of video clips. Although the slow fusion model captures spatiotemporal features in the first three layers, it still loses the temporal information after the first three layers. Simonyan and Andrew [24] propose two-stream networks based on the 2D-CNN model. The model uses stacked optical flow vectors of motion features to address the problems with learning motion features that deep learning faces. Combining the two streams increases the accuracy of action identification. The spatial network is used to gather visual information from video frames, and the temporal network is used to gather motion data by applying optical flow to nearby frames. Since then, a lot of approaches based on two-stream CNNs proposed to enhance action recognition performance [12, 25, 46] such models increase in accuracy for action recognition, however, they encounter certain issues, such as the high processing cost associated with calculating optical flow, and given the nature of the image, it can only extract the short-term temporal information. For video action recognition, these short-term temporal features cannot be expressive. Lin et al. [47] offers a fresh viewpoint for gathering temporal data, the Temporal Shift Module (TSM) efficiently collects temporal data using simple 2D convolutions. To determine the differences between the features of different frames, Wu et al. [48] presents a multi-scale temporal shift module based on the TSM. Jiang et al [49]. swap out the original ResNets block with a suggested STM model that learns and encodes in a 2D framework, spatiotemporal and motion information for superior activity recognition results but increases the computing costs. TRM [50] attempts to reposition spatial elements along the temporal dimension to provide long-term temporal modeling that is spatially aware. Wu et al. [51] concentrate on transferring knowledge for video classification problems. They provide a strong semantic target using the well-trained language model for effective learning transfer. They improve the transferability of such vision-language pre-training models for downstream classification tasks using textual encoders. In numerous studies, LSTM has been created to separate long-term temporal information from a series of spatial feature maps. Donahue et al. [28] propose LRCN to extract features from 2D-CNN and then utilize two-layer LSTMs to learn the spatiotemporal features. The LRCN predicts the class action by averaging the individual prediction of each frame of the video sequence. Gammulle et al. [30] develop a model employing two-stream LSTM in combination with convolutional and fully connected activations. They suggest a deep fusion framework that takes advantage of both temporal features from LSTM models and spatial features from CNNs. FC-RNN [33] proposes to combine deep neural networks in many layers and modalities, capturing various static and dynamic stimuli at various temporal scales in addition to combining different semantic levels into each network.

2.2 Video-based models

Unlike the 2D-CNN in the two-stream approaches above, the 3D-CNN outperforms 2D-CNN for large-scale datasets. since the 3D-CNN can extract spatiotemporal information directly from video shots. The first 3D ConvNets proposed [18] in 2013 the model encountered issues such as expensive training costs and time. Tran et al. [27] propose a C3D framework based on the VGG network to learn spatiotemporal features in the context of large-scale datasets. C3D excels in many video analysis tasks. In 2017, I3D [26] proposed a Two-Stream Inflated 3D ConvNet based on Inception-V1. I3D trained on Kinetics-400 [22] using data from realistic, challenging YouTube videos. In terms of performance, I3D outperforms other video action recognition models, but it faces some problems such as high computational cost and time for optical flow calculation and the use of large clip lengths (64 frames per clip) to learn the long-term temporal information that well represents long-range movements. Wang et al. [34] propose I3D-LSTM based on the I3D model, which uses pre-train I3D to extract low-level spatiotemporal information before presenting LSTM to model high-level spatiotemporal features. Roberta et al. [52] propose a 3D-CNN for the analysis of video to identify human activity. This model's 3D-CNN architecture can categorize video sequences that incorporate several types of human activity. They think that being able to categorize a variety of typical human behaviors will help them get closer to being able to categorize deviant behaviors to improve public space security or make someone's day easier. Using diverse video datasets, Hara et al. [15] look at the designs of different 3D-CNNs built on the ResNets family, ranging from shallow to very deep ones. ResNets(R3D) accuracy varies with model depth, but it is still less accurate than two-stream I3D. In terms of recognition accuracy, ResNets (R3D) surpasses C3D. In terms of compactness, mode size reduction, and run-time speed, 3D-ResNet performs better than C3D. Based on the 3D CNN improvement in R3D many approaches were introduced such as Densenet [53], WRN [54], ResNeXt [55], and X3d [56]. A bi-directional LSTM was introduced in [57] (BiLSTM) that preferentially concentrates on useful features in the input frames to recognize the various human actions in the videos. Zan and Zhao [58] combine the CNN and LSTM networks to propose a TS-CNN-LSTM framework. The approach offers a remedy for real-time interactive applications that demand prompt results and answers from human action recognition. Several strategies are based on the 3D CNN with LSTM enhancement including [59,60,61,62]. There are other difficulties where precise categorization depends not only on the latent semantic representations of the attribute actions and their temporal connections but also on the visual aspects. Transformer models have recently to vision challenges, where the transformer may capture global dependency among tokens by learning semantics through self-attention. Pure transformer designs were not popular in computer vision until Vision Transformers (ViT) [63] which achieved remarkable success in image classification and led to the integration of transformers in video classification. ViViT [64] and Timesformer [65] are the first two efforts that effectively utilize a pure transformer architecture for video classification. Zhan et al. [66] propose video-masked autoencoders (VideoMAE) with data-efficient learners using the self-supervised video pre-training (SSVP) method. Zhen et al. [67] introduce the SVFormer model under the Self-Supervised Learning setting for action recognition. Rui et al. [68] propose masked video distillation (MVD together with a simple co-teaching strategy, which enjoys the synergy of image and video teachers. There is much research on transformers for video classification [69,70,71,72]. Also, there is much research that studied the incorporation of transformer blocks [70, 73] as additional layers into CNNs to better model long-range interactions among spatio-temporal characteristics. Despite leveraging on the Transformer’s modeling efficiency in capturing latent semantics and global dependencies, CNNs and LSTM still can capture high-level spatiotemporal features.

3 The Proposed models

We describe the HAR models and frameworks proposed in this paper in detail in this section: In our research the image-based action recognition model in Sect. 3.1, the video-based action recognition model in Sect. 3.2, and the early and late fusion frameworks for HAR in Sect. 3.3.

3.1 The Image-based action recognition model

In our proposed model, we develop an image-based action recognition model that discusses the benefits of integrating the 2D-CNN based on the Resnet-101 network (R2D) [45] with the LSTMs [74] to express the temporal features from video keyframes. As depicted in Fig. 1, The model comprises two fully connected layers with a linear classifier for action recognition, as well as an R2D network, fine-tuned on the keyframe images of the target dataset to obtain spatial features, Two layers of LSTM to obtain temporal features from a spatial feature map.

Fig. 1
figure 1

The R2D-LSTM model for image-based action recognition

We develop a keyframe extraction method based on a shot-based analysis method using the histogram comparison technique [40] to extract the video clip's keyframes, as shown in Fig. 2. Each video is split up into K separate clips, with N frames for each. For every video clip, we extract the keyframe by calculating the color histogram for each frame in a video clip as in [75], and compare the histograms dissimilarity of each two consecutive frames using the correlation method in Eq. (1), then the frame with the highest dissimilarity between all N frames chooses as the keyframe for this clip. For each video with K clips, we extract K keyframe images.

$$d\left({Hist}_{i},{Hist}_{i+1}\right)=\frac{{\sum }_{I}\left({Hist}_{i}-\overline{{Hist }_{i}}\right)\left({Hist}_{i+1}-\overline{{Hist }_{i+1}}\right)}{\sqrt{{\sum }_{I}{\left({Hist}_{i}-\overline{{Hist }_{i}}\right)}^{2}{\sum }_{I}{\left({Hist}_{i+1}-\overline{{Hist }_{i+1}}\right)}^{2}}}$$
(1)

where

Fig. 2
figure 2

Extracting keyframe images from each consecutive N frame of the video clip using histogram difference

$$ \overline{{Hist_{i} }} = \frac{1}{M}\sum\limits_{J} {} Hist_{i} \left( J \right) $$
(2)

\(\overline{{Hist }_{i}}\) is the mean value of the pixel histogram, M is the total number of histogram bins, and  \(d\left({Hist}_{i},{Hist}_{i+1}\right)\) express how well both histograms match.

The R2D from [24] adjusts using the keyframes to conduct deep feature learning. We use a custom 2D Resnet-101(R2D) pre-trained on the ImageNet dataset [76] to overcome the issue of insufficient training. Then fine-tuned all the layers using the keyframe images for UCF-101 [77] and HMDB-51 [78] datasets separately to modify the network to meet our study challenge. In the R2D network architecture as shown in Fig. 3, there are four resident blocks, an input convolution layer with maximum pooling, and an average pooling layer with an out-classification layer (fc). We extract the spatial information from Conv5_x, the last convolutional layer.

Fig. 3
figure 3

2D Resnet-101 (R2D) network architecture, we extract spatial features from the Conv5_x layer [24]

To capture temporal ordering, and extract temporal features a two-layer LSTM train with the spatial feature vector that R2D derives from keyframe images. The recurrent neural network (LSTM) is effective at processing sequential data. Many of the fundamental issues with traditional recurrent networks are mostly resolved by LSTM, which uses an appropriate gradient-based learning algorithm. Finally, the image-based model uses R2D to extract spatial data, that is fed to LSTM to recover temporal features for performance evaluations. The final output of (R2D-LSTM) is determined by taking the SoftMax classifier output into account.

3.2 The video-based action recognition model

Utilizing the 3D ResNet-101 (R3D) network, the second component of our models is video-based action recognition. Among the most potent networks for picture interpretation is the Residual Network (ResNets). In [15] authors extend the ResNets design to 3D-CNNs in anticipation of improving the performance of the video action recognition domain. In comparison to state-of-the-art methods, the R3D pre-train on a large video dataset kinetics [22] from shallow to deep layers up to 152 layers produces considerable progress and improves the performance of action recognition.

Different networks examine by Kensho et al. in [15] including ResNets (basic and bottleneck blocks), Densenet [53], WRN [54], and ResNeXt [55]. Figure 4, summarizes the network architecture of the 3D ResNet-101. An input convolution layer with maximum pooling, an average pooling layer, a fully linked layer, and a fourth bottleneck block are present. Batch normalization and ReLU layers come after the three convolutional layers in each bottleneck block. Figure 5 shows that the kernel size of the second layer is 33, whereas the kernel sizes of the first and third convolutional layers are 13.

Fig. 4
figure 4

The 3D ResNet-101 Network Architecture [15]

Fig. 5
figure 5

The Bottleneck Architecture [15]

The 3D ResNet-101 network was pretrained using the kinetics dataset and then fine-tuned using the UCF-101 and HMDB-51 datasets independently. The model employs to extract the spatiotemporal features from the short-term video clips (N frames) as shown in Fig. 6. The final layer is fully connected and uses the SoftMax classifier to identify actions in videos.

Fig. 6
figure 6

The 3D ResNet-101(R3D) network for video-based action recognition

3.3 The fusion frameworks

We investigate two fusion strategies and create two frameworks that integrate image-based and video-based models for action recognition. Two types of fusion are available: Early-fusion, which combines features before classification, and Late-Fusion, which combines classification results for decision-making [79].

Early-Fusion (EF) framework relates to the procedure of combining the unprocessed output features from every model to create a multifeatured map for action recognition, which is processed at the feature level. In this kind of fusion, the features that classify each input data fuse to produce a new feature representation that is more expressive than the original feature representations from which it originates. As shown in Fig. 7, the impact of combining the spatiotemporal data of the video-based model (R3D) with the spatiotemporal features of the image-based model (R2D-LSTM) is covered by the feature fusion model. We construct a fully connected three-layer structure. The upper two tiers have a ReLU activation function. To classify the action, a SoftMax classifier is constructed as the last layer above the fusion model.

Fig. 7
figure 7

Depicts the overall design of the Early-Fusion Framework

Late-Fusion (LF) "Decision-Level Fusion" framework, where the ultimate decision is calculated after the extraction of decisions for every model. Late-Fusion is a merging approach that applies after each model classification. It combines the output of each classifier to get a new output that may be more accurate and consistent. We build a single fully connected layer Late-Fusion model and use a SoftMax classifier to apply the model scores to the classification model as illustrated in Fig. 8.

Fig. 8
figure 8

Shows the general design of the Late-Fusion model

4 Experimental Results

The experimental results of our proposed models and frameworks are presented in this section. The most widely use datasets for video action recognition: HMDB-51 [78], and UCF-101 [77] used to evaluate the proposed models. The evaluation datasets are described in Sect. 4.1. Section 4.2 represents the evaluation metrics. Our proposed models and frameworks compare with the state-of-the-art methods in Sects. 4.3, 4.4, and 4.5. Finally, to better assess our frameworks, we present the visualization findings. All experiments of training and testing for all the models are conducted on four NVIDIA RTX-3090 GPU with 24GB GPU RAM and 32CPU with 129GB RAM, and they are running on CUDA 11.3 platform.

4.1 Datasets

We carry out extensive experiments on HMDB-51 [78], and UCF-101 [77] datasets to assess the performance of the proposed frameworks for the task of action recognition in videos. The database details is shown in Table 1. Most videos, according to Brown University's HMDB-51 dataset, come from movies, public databases, and Internet video repositories like YouTube. There are 6,766 movies in 51 kinds of human action in the dataset, it offers three train/test splits. There are 13,320 movies totaling 101 human activity classes in the UCF-101 dataset. The majority of the dataset's videos were taken from YouTube. It provides train/test splits as the HMDB-51 dataset. We divide dataset into training, validation, and testing samples as 60, 20, and 20% respectively. The two datasets present some significant obstacles, including the vast variances in the camera's point of view and movement, the busy background, and the variations in the human location, size, and appearance. Sample frames for a few action classes from UCF-101 is shown in Fig. 9a and from HMDB-51 is shown in Fig. 9b, respectively.

Table 1 Details of UCF-101 and HMDB-51 datasets
Fig. 9
figure 9

Sampled frames from UCF-101 and HMDB-51 datasets respectively with action class names

4.2 Evaluation metrics

We analyze the suggested models using a variety of evaluation metrics. A Confusion Matrix is used during the model testing process to reveal detailed information about the model's performance. A classification model's performance assessment using a C × C matrix called the confusion matrix, C denotes the number of target classes. Confusion matrices are widely utilized because they provide a more accurate view of a model's performance than classification accuracy. The deep learning model's predicted values are compared with the actual goal values in the matrix.

We use five assessment measures to assess the proposed model's performance: loss (categorical-cross-entropy), accuracy (overall performance), precision (positive prediction accuracy), recall (sensitivity or actual positive sample coverage), and f1-score.

Accuracy is the model evaluation index that indicates overall model performance. One of the most crucial evaluation criteria for deep learning classification jobs is accuracy, which is defined in Eq. (3).

$${\text{Accuracy}}=\frac{Number\;of\;correct\;{\text{predictions}}}{Total\;number\;of\;predictions}$$
(3)

Categorical cross-entropy loss is a classification loss function that measures the performance of the model whose output is in the form of probability [0,1]. We include the cross-entropy loss function in our model because multi-class classification methods frequently employ it. These loss functions are crucial since they increase the accuracy of our models, the equation of cross entropy loss shown in Eq. (4).

$${\text{Loss}}= -\sum_{j}^{C}{t}_{j}{\text{log}}\left({\text{SoftMax}}\left({p}_{j}\right)\right)$$
(4)

where \({t}_{j}\) is the fundamental fact for each input class j in C classes.

Precision is a statistic that assesses the proportion of correctly identified instances among positive instances. In a multi-class classification problem, precision (positive prediction accuracy) is the sum of true positions across all classes divided by the sum of true positives and false positives across all classes as present in Eq. (5).

$$\mathrm{Precision }\;({\text{P}})= \frac{\sum_{j}^{C}{TP}_{j}}{\sum_{j}^{C}({TP}_{j}+{FP}_{j})}$$
(5)

where \({TP}_{j}\) is the true positive and, \({FP}_{j}\) is the false positive for each class j in C classes.

Recall (also known as sensitivity or actual positive sample coverage) is a statistic that quantifies the amount of correct positive predictions produced out of all possible positive predictions. Recall is the sum of true positives across all classes divided by the sum of true positives and false negatives across all classes as in Eq. (6).

$${\text{Recall}}\;({\text{R}})= \frac{\sum_{j}^{C}TP}{\sum_{j}^{C}({TP}_{j}+{FN}_{j})}$$
(6)

where \({TP}_{j}\) is the true positive and, \({FN}_{j}\) is the false negative for each class j in C classes.

F1-score provides a way to combine both precision and recall into a single measure that captures both properties as shown in Eq. (7).

$$ {\text{F}}1 - {\text{score}} = { }\left( {2{\text{ * Precision * Recall}}} \right){ }/{ }({\text{Precision }} + {\text{ Recall)}} $$
(6)

4.3 The Image-based model

For the image-based model, we employ R2D, which is pretrained on the ImageNet dataset, and refine it on the HMDB-51 and UCF-101 datasets. Instead of using every frame in the video to reduce computational overhead and time consumption, keyframes are employed to fine-tune the model. Considering that video clips in the UCF-101 and HMDB-51 video datasets have a consistent frame rate of 25 frames per second. To reduce the size of the dataset and utilize only the most representative frames, the video segments into frames, and a keyframe image retrieves for each video clip.

The keyframe is the frame with the highest histogram correlation from a set of consecutive frames. Choosing the number of consecutive frames (shot) affects the keyframe extraction accuracy [80, 81], where if the consecutive frames are small sets of frames the variation of frame histograms is very small to identify one keyframe, and if we use a large set of consecutive frames there are more than one keyframe in the shot and we only extract one and neglect others. We explore the keyframe extraction method for a variant video clip length: 5 frames per clip and 16 frames per clip. Table 2 shows a performance measurement for the R2D and R2D-LSTM models using the extracted keyframes from different clip lengths.

Table 2 Performance analysis for the extracted keyframes from different clip lengths: 5 frames per clip, 16 frames per clip on the R2D and R2D-LSTM models

According to the accuracy of R2D, and R2D-LSTM models for different keyframe images as seen in Table 2. For the two datasets, UCF-101 and HMDB-51, we find that the accuracy of keyframes recovered from 16-frame per clip on R2D is 83.0 and 60.0%, respectively, and for R2D-LSTM are 92.0, and 65.0% respectively which outperform the accuracy for R2D and R2D-LSTM using keyframes extract from 5-frame per clip. Therefore, we use 16 frames per clip and extract keyframes from each clip for all the video clips. Figure 10a and b show samples of key frames extracted using from 16-frame per clip for UCF-101 and HMDB-51 datasets respectively with action class names. The keyframe images then divide into training, validation, and testing samples as 60, 20, and 20% respectively as shown in Table 3.

Fig. 10
figure 10

Sampled key frames from UCF-101 and HMDB-51 datasets respectively with action class names

Table 3 Shows the number of keyframes for each dataset, and the split images corresponding to each dataset

Using the keyframe images, we start fine-tuning the R2D to the UCF-101, and HMDB-51respectively. The R2D is a 2D Resnet-101 model pretrained on the ImageNet dataset [45]. For fine-tuning R2D, we generate training samples from the extracted keyframes. The training samples resize at 224 × 224 pixels, horizontally flipped, and mean subtraction using ImageNet is performed.

For Training: We employ cross-entropy losses and backpropagate their gradients in our training. We use 60% from dataset for training and 20% for validation. The training procedure (fine-tuning) we use the stochastic gradient descent at a learning rate of 0.1, momentum of 0.9, batch size of 64, and 25 epochs using cross-entropy loss.

Figure 11a and b, show the training and validation accuracies of finetuned R2D for UCF-101 and HMDB-51 respectively. For the UCF-101 the accuracy of validation is slightly higher than training accuracy, the best validation accuracy is mostly 85.0%. But for the HMDB-51 the validation accuracy is less than the training accuracy due to the difficulty of the image appearance and contrast, where the validation accuracy almost reaches 65.0%. The HMDB-51 is a challenging dataset due to its appearance and contrast to the UCF-101. Figure 12a and b, describe the loss values of training and validation of R2D for UCF-101 and HMDB-51 respectively. The loss values of training and validation of UCF-101 are the same but for the HMDB-51 The training loss is marginally lower than the validation loss.

Fig. 11
figure 11

Fine-tuning ResNet-101 (R2D) training and validation accuracies

Fig. 12
figure 12

Fine-tuning ResNet-101 (R2D) training and validation losses

We employ the finetuned R2D as a feature extractor to produce 2048 sequence features from the (Conv5_x) layer. A two-layer LSTM with SoftMax classifier trains using the extracted features. We use the LSTM to extract temporal features from the sequence of features extracted by R2D, and we use the SoftMax classifier to recognize the video action. For Training: we use the stochastic gradient descent at a learning rate of 0.1, momentum of 0.9, batch size of 64, and 25 epochs using cross-entropy loss for training the LSTM, also the training and validation data are divided into 60 and 20%, respectively.

Figure 13a and b, display the accuracy of training and validation of R2D-LSTM on each of UCF-101 and HMDB-51, respectively. The figure depicts that the validation accuracy for UCF-101 is 93.0% and for the HMDB-51 is 65.0%. However, we do not use all the video frames and we only use the keyframes of each dataset, but the utilization of LSTM allows the model to learn more temporal features from the keyframe spatial feature sequence produced by R2D. The use of the LSTM increases the performance of the R2D by exploring more temporal features for each action. Figure 14a and b, show the training and validation losses of R2D-LSTM on both UCF-101 and HMDB-51 respectively. The validation losses are marginally more than the comparable training losses for HMDB-51, but for the UCF-101 the validation loss is like the training loss. We can conclude that there is no overfitting occurring in training the R2D-LSTM model. From Figs. 11, 13, and Table 2 we found that R2D-LSTM outperforms the R2D model.

Fig. 13
figure 13

R2D-LSTM training and validation Accuracies

Fig. 14
figure 14

R2D-LSTM training and validation losses

For Testing: We evaluate the R2D-LSTM model using the 20% testing dataset. The testing samples resize to 224 × 224 pixels, each sample flips horizontally, and we also execute a mean subtraction, which involves taking the sample's mean values for each color channel and subtracting them from them. The confusion matrix for the R2D-LSTM model reveals that the majority of the recognition accuracy for the UCF-101 dataset is greater than 93.0% for most of the classes as shown in Fig. 15a, while the model recognition accuracy for the HMDB-51 dataset is near 65.0% for most of the classes as evidenced by the confusion matrix in Fig. 15b.

Fig. 15
figure 15

Confusion Matrix for R2D-LSTM on target datasets

4.4 The video-based model

We use the pretrained R3D (ResNet-101) model for the video-based model [15], the model is pretrained on the kinetics dataset, and in our experiment, we fine-tuned it on the UCF-101 and HMDB-51 datasets. We choose a 16-frame clip from a video's temporal position. If the video is less than 16 frames long, we loop it as many times as needed. We apply the spatial and temporal transformation to video clip frames. Each sample is three channels, sixteen frames, 112X112 pixels in size, and horizontally flipped with 50% probability. We also perform mean subtraction, which means subtracting the sample's mean values from the mean values of the kinetics dataset [22]. The generated samples all have the same class labels as the original videos.

For Training: In training the SoftMax classifier layer, we use cross-entropy losses and backpropagate their gradients. The training parameters include a learning rate of 0.1 and a momentum of 0.9.

We use the pretrained R3D model and finetuned on UCF-101 and HMDB-51 datasets. First, fine-tuning is done to train the conv5_x bottleneck and the fully connected layer. Figure 16a and b, show the training and validation accuracy for R3D on both UCF-101 and HMDB-51 datasets, respectively. We found out that there is a large gap between the values of training and validation accuracy and there is overfitting occurs by finetuning conv5_x bottleneck and the fully connected layers. For this reason, we refine the model by training the fully connected layer on the corresponding target video datasets (UCF-101 and HMDB-51).

Fig. 16
figure 16

Training and validation accuracies for R3D by fine-tuning conv5_x bottleneck and the fully connected layer

By solely fine-tuning the fully connected layer of R3D, the overfitting issue is resolved. Figure 17a and b, display, respectively, the R3D training and validation accuracy on the UCF-101 and HMDB-51 datasets. We see that the UCF-101's training accuracy is 80.0%, while the validation accuracy values are almost 77.0%, which is quite like the training accuracy. Although the HMDB-51 dataset presents challenges in terms of background, camera orientation, illumination, and other factors that influence the performance of action models, the training accuracy of the suggested R3D model is 60.0%, and the HMDB-51 validation accuracy is 50.0%.

Fig. 17
figure 17

Training and validation accuracies for R3D by fine-tuning only the fully connected layer

Figure 18a and b, display the training and validation losses for the R3D on the UCF-101 and HMDB-51 respectively. For the UCF-101, the validation loss is quite close to the training loss. Moreover, on HMDB-51, the validation loss is greater than the training loss. Based on these results, we can find that the R3D which was pretrained on Kinetics and fine-tuned on UCF-101 and HMDB-51 could enhance the performance of the model without the need for more computational costs and time for training the model from scratch on those datasets.

Fig. 18
figure 18

Training and validation losses for R3D by fine-tuning only the fully connected layer

For Testing: the R3D using the testing set, the R3D model that performs well in the validation set used. We split Each video in testing data into 16-frame video clips. The class label for the video is determined by the one with the highest average score across all the video clips. Figure 19a and b, present the confusion matrix of R3D using testing data from UCF-101 and HMDB-51 datasets, respectively. Testing accuracy for R3D on the UCF-101 can achieve 77.0% and for the HMDB-51 can reach up to 50.0%.

Fig. 19
figure 19

Confusion Matrix for R3D target datasets

4.5 The evaluation of the fusion framework’s performance

Two models image-based (R2D-LSTM), and video-based (R3D) that we discuss in Sects. 3.1 and 3.2 used in the fusion frameworks. Initially, we choose a 16-frame clip from a video's temporal position, loop as many times as possible, if necessary, when the clip size is less than 16 frames long then extract a keyframe for each video clip. We have two streams in our frameworks. The first stream is image-based using the extracted keyframe image, we apply the spatial transformation on the keyframe image by flipping by 50% each sample horizontally, resizing it to 224 × 224 pixels, and perform a mean subtraction, in which we deduct the ImageNet mean values [76]. The second stream is video-based using a 16-frame video clip, we apply the spatial–temporal transformation on the video clip to resize the sample to 112 × 112 pixels for the video clip. There is a 50% that each sample flips horizontally. Additionally, we perform mean subtraction, which involves deducting the sample values for each color channel from the mean values of the kinetics dataset [22]. All generated samples and keyframe images use the same class labels as the original videos. Finally, we implement the two different fusion methods we discuss in Sect. 3.3 into practice.

For training: To train the SoftMax classifier layer, we employ cross-entropy losses and backpropagate their gradients in our training. In the training procedure, we use the stochastic gradient descent at a learning rate of 0.1, momentum of 0.9, batch size of 64, and 25 epochs using cross-entropy loss. The training and validation data ratio is divided into 60 and 20%, respectively.

4.5.1 The early-fusion framework

The Early-Fusion combines the extracted features of each model separately to produce a new data representation. The classifier uses the new feature map to train. These features are more expressive than separated representations which enable us to concurrently benefit from the advantages of both representations. Compared to employing a single representation alone, this can produce effective discrimination results as present in Fig. 20, which represents the accuracy of training and validation for the Early-Fusion model using split-1 for the datasets UCF-101 and HMDB-51 respectively.

Fig. 20
figure 20

The Early-Fusion training and validation accuracies

As demonstrated in Fig. 20, the Early-Fusion model improves the two pre-trained models' accuracy when we combine the features gleaned from the individual models to produce multi-modal feature maps that are more expressive than the individual representations from which they derive. Using the datasets UFC-101 as shown in Fig. 20a and HMDB-51 in Fig. 20b, we discovered that the Early Fusion model's validation accuracy was 93.0 and 71.0%, respectively. While Fig. 21a and b, depicts the training and validation losses for the Early Fusion model for UCF-101 and HMDB-51, respectively.

Fig. 21
figure 21

The Early Fusion training and validation losses

For Testing: With regards to the Early-Fusion model, we divide the video into discrete 16-frame video clips that do not overlap, extract the keyframe from each video clip, and employ the R3D and R2D-LSTM models that exhibit good test set performance. We use the proposed models to extract different temporal elements from the keyframe and video clip. Based on the combination of these features, we employ the classifier to identify the video action. The class with the highest overall average score across all the video clips assigns the designation for the video. The Early-Fusion Confusion matrix on the UCF-101 and HMDB-51 datasets is shown in Fig. 22a and b, respectively, for the testing data, we discover that the Early-Fusion model's testing accuracy was 95.5 and 70.1%, respectively.

Fig. 22
figure 22

Confusion Matrix for Early-Fusion Framework

4.5.2 The late-fusion framework

Late-Fusion is a merging technique that merges the classification result of every single model. To create new decisions that are more accurate and dependable, it integrates the results of each classifier and applies a deep fusion classifier model.

According to Fig. 23a and b that present accuracy of Late-fusion on UFC-101 and HMDB-51, respectively. The Late-Fusion increases the accuracy of the two pretrained models. We find that the validation accuracy of the Late-Fusion model on UFC-101 and HMDB-51 for split-1 reach 97.0 and 77.0% respectively. Figure 24a and b, show that the validation loss of the Late-Fusion model is quite like the training loss for UCF-101 and HMDB-51, respectively.

Fig. 23
figure 23

The Late-Fusion training and validation Accuracies

Fig. 24
figure 24

The Late-Fusion training and validation losses

For Testing: The Late-Fusion model, after separating the video into separate, non-overlapping 16-frame video clips and extracting the keyframe in each, we employ the R2D-LSTM and R3D models to achieve good test set performance. Choose the proper class for the keyframe and the video clip using the trained models. The classifier is used to identify the video activity after combining this class assessment. The class with the highest average score overall across all video clips received this categorization for the video. The Late-Fusion Confusion matrix for the UCF-101 and HMDB-51 datasets is shown in Fig. 25a and b respectively, revealing that the Late-Fusion Model's testing accuracy was 97.5 and 77.7%, respectively.

Fig. 25
figure 25

Confusion Matrix for Late-Fusion Framework

We assess the results of F1-score, Precision, and Recall metrics for our proposed Image-based (R2D-LSTM), video-based (R3D), Early-Fusion and Late-Fusion models using the UFC-101 and HMDB-51 datasets in Figs. 26 and 27 respectively. The Late-fusion framework observes 96.22, 98.12, and 97.16% for precision, recall, and F1-score respectively on the UCF-101 dataset. Also, it reaches 75.25, 77.50, and 76.36% for precision, recall, and F1-score respectively on the HMDB-51 dataset. The results confirm the proposed Late-fusion framework capable of classification of human activities with better results than the other frameworks.

Fig. 26
figure 26

The evaluation criterion percentage for the proposed models on the UCF-101 dataset

Fig. 27
figure 27

The evaluation criterion percentage for the proposed models on the HMDB-51 dataset

5 Comparison with the-state-of-the-art

This section describes a comparison of the proposed model's outcomes with current best practices for only RGB input data. Table 4 compares the performance of the proposed models and earlier state-of-the-art techniques on the UCF-101 and HMDB-51 benchmarks. Certain models, like the TSM for Lin et al. [47], aggregate temporal data by utilizing simply 2D convolutions in conjunction with a manually created temporal shift module. Video masked autoencoders (VideoMAE) [66] propose data-efficient learners by employing self-supervised video pre-training (SSVP). Masked video distillation (MVD) and a straightforward coteaching approach, which gains from the combination of video and image teachers, were proposed by Rui et al. [68]. Roberta [52] utilizes a 3D-CNN architecture to classify unusual behavior of people in public places from video input. STM [49] proposed by Jiang et al. swaps out ResNet blocks with STM blocks to represent motion and spatiotemporal properties in a 2D framework. I3D-LSTM benefits from a pre-trained I3D CNN model to extract low-level spatial–temporal features and improve the performance of LSTM which introduces to model of the high-level temporal features [34]. MSTSM [44] action recognition framework based on 2D-CNN utilizing Temporal Feature Difference Extraction and Multi-Scale Temporal Shift Module and Baseline ResNet-101 Family [15]. A Two-Stream Inflated 3D-CNN (I3D) model for action recognition using optical and RGB flow streams [26].

Table 4 Accuracy comparison for proposed models and state-of-the-art methods on UCF-101 and HMDB-51 datasets

As shown in Table 4, the accuracy of video-based (R3D) different from those offered by the R3D model in [15] for each of the UCF-101 and HMDB-51 datasets according to the varied training parameters and data augmentation used. For the models CNN-LSTM in [22], ST-D LSTM in [31], and Bi-LSTM in [82] all of these models received their training from the ImageNet dataset and used various data augmentation on images which affected the performance model they mentioned in the paper. In the transformer model in [68], authors proposed two models of transformers small model (Teacher-B) and the large model (Teacher-L), we achieved a higher accuracy than Teacher-B, but Teacher-L, the large transformer model has marginally better outcomes than our suggested frameworks, albeit at the expense of additional computational time. while all studies demonstrate improvement in the UCF-101 and HMDB-51 datasets for HAR, the method by Carreira and Zisserman [26] is the most effective now. It achieves 98.0 and 80.2% respectively using RGB and optical flow streams by applying I3D models. However, when we simply used the RGB input data only, the suggested frameworks outscored the competitors (images and videos) by 97.5 and 77.7%. respectively. Finally, we consider our late-fusion framework to be less successful with a difference margin of approximately 1.0% on UCF-101 and 2.0% on HMDB-51 than the I3D [26] model. Noting that the I3D model obtains this accuracy by using two steams of RGB and optical flow modality but we obtain our accuracy by only using RGB modality which made us the best and superior in terms of time and cost.

6 Conclusion

This paper has introduced two efficacious human action recognition frameworks based on two models built using the resident family. We study the improvement and enhancement of 2D and 3D resident families using multi-model fusion techniques. The proposed frameworks consist of two streams that concentrate on RGB input data from images and video clips; the two streams are fused using different fusion strategies. The first stream is an image-based one that is built using R2D and LSTM models called R2D-LSTM that can capture long-term spatial–temporal features from keyframe images extracted from RGB video clips. The second stream is video-based and employs R3D to extract short-term spatial–temporal features from video clips. Two frameworks are proposed to describe the effect of different fusion architectures on enhancing action recognition performance. We explore early and late fusion techniques for video action recognition. The early-fusion framework discusses the effect of early feature fusion of the two streams for decision-making and action recognition, but the late-fusion framework discusses the decision fusion from the two models' decisions for action recognition. The various fusion techniques investigate how much each spatial and temporal feature influences the recognition model's accuracy. Additionally, we investigate the optimal fusing approach that might yield the best outcomes for spatial and temporal information. Both the early-fusion and late-fusion frameworks perform well in experiments. We evaluated the proposed models on two popular video datasets (UCF-101 and HMDB-51). The experimental results of early-fusion achieved 95.5% for the UCF-101 and 70.1% for the HMDB-51, and the late-fusion achieved 95.5% for the UCF-101 and 77.7% for the HMDB-51, which is comparable with state-of-the-art methods on those datasets. In our future work, we will try to use and benefit from this framework in the video domain adaptation for action recognition to optimize and enhance computational cost and performance.