1 Introduction

Human action recognition is a hot topic in computer vision with a wide range of applications such as video security and surveillance systems (Kwak and Song 2013), human-computer interaction (Choi et al. 2008), smart homes (Amiri et al. 2014), robotics (Akkaladevi and Heindl 2015; Van Amsterdam et al. 2022), content-based video retrieval (Jones and Shao 2013), entertainment (Shotton et al. 2011), sports events (Soomro and Zamir 2015), sign language recognition (Azar and Seyedarabi 2020), etc. A generic framework for action recognition typically consists of some components, including data acquisition, data preprocessing, feature extraction, temporal modeling, classification, and evaluation.

Vision-based human action recognition is the task of assigning a label representing the recognized action of a human (performing a simple gesture/action or an inter-object interaction) or a group of people (performing human-human interactions or group activities) in a trimmed (segmented) or untrimmed (continues) clip of video or in a still image. The inter-class and intra-class variations of different actions (due to different viewpoints, appearances, lighting conditions, occlusion, cluttered background, and varying speed) make human activity recognition a challenging task.

Fortunately, the previous decade began with surprising news from the computer vision community by announcing the success of deep neural networks (DNNs) and especially convolutional neural networks (CNNs) in challenging computer vision tasks such as image classification (He et al. 2016), object detection (Girshick et al. 2014), object recognition (Liang and Hu 2015), image segmentation (Bi et al. 2018), and action recognition (Simonyan and Zisserman 2014a). These approaches and the development of computing technologies contributed to expanding research in the computer vision field. Although, this new approach started with a short delay for human action recognition due to difficulties in using DNNs for video analysis. Today, almost all state-of-the-art methods for HAR are based on deep learning approaches.

Distinct modalities offer complementary information for robust action recognition and provide compensatory information in the case of missing modalities. Activity recognition with multiple visual modalities is regarded as a new promising approach that offers an in-depth understanding of different actions (Yadav et al. 2021; Sun et al. 2022; Majumder and Kehtarnavaz 2020). However, action recognition with multimodal data is challenging due to the heterogeneity of different data sources, large amounts of data, different fusion strategies, and the need for transferring knowledge.

Although different review papers exist in the literature for deep-based HAR, limited papers reviewed multimodal vision-based HAR using deep approaches. As there exist lots of papers in HAR, this review paper tried to compare and classify the existing approaches from different points of view. This review focuses on multimodal visual approaches and categorizes them into four levels (see Fig.1). This review helps readers better comprehend the HAR approaches as well as provides a means for comparing the frameworks in various aspects. Besides, vision-based HAR benchmark datasets are studied, and the best results on popular and newest ones are reported as well.

Fig. 1
figure 1

Four-level classification of deep-based HAR approaches using multiple visual modalities

Our four-level categorization is based on the mentioned components in a generic framework for HAR except data acquisition. In the first level of proposed categorization, methods are classified based on used modalities, i.e., RGB & depth, RGB & skeleton, depth & skeleton, RGB & depth & skeleton, and infrared & other visual modalities.

There are two different approaches for using multimodal data in the second level of our categorization. The first approach considers more than one modality to make use of different and complementary information of distinct modalities. Methods in this approach usually work with multiple streams of data (dependent or independent) and try to fuse features from different modalities. Here, all modalities are present both in training and test time. However, real-life applications usually miss one or more modalities during test time due to the cost, noise, sensor failure, privacy, etc. The second approach targets missing modality scenarios by using multiple modalities during training to compensate for missed modality in the test time through knowledge transfer or co-learning. So, methods in the second level are categorized into two branches: complete modality and missing modality.

At the third level, methods in the second level are grouped based on the framework architecture. Complete modality approaches are grouped into independent streams, dependent streams, and single-stream. Missing modality approaches are grouped into hallucination networks and ensemble methods. Finally, in the fourth level, similar frameworks in the third level are grouped according to the DNN architecture, classifier, fusion, and preprocessing methodology.

All methods are also summarized in Appendix A based on our four-level categorization for ease of access in network architecture comparison and gaining more insights into framework design.

Besides, this paper reviews almost all available and related benchmark datasets, along with categorizing and comparing them with each other. Datasets are grouped based on providing trimmed or untrimmed videos, the number of viewpoints ( as single-view or multi-view), and visual data modalities (as RGB+depth, RGB+skeleton, depth+skeleton, and RGB+depth+skeleton). Furthermore, the results of state-of-the-art methods are reported on popular and the newest benchmark datasets. The main contributions of this review in supervised deep-based HAR techniques using multiple visual data modalities are three folds as follows:

  1. 1.

    A new categorization is proposed for the first time that classifies methods into four levels. Used modalities, using complete or missing modalities, architecture (based on the number and the dependency of network streams and learning methodology), and framework similarities.

  2. 2.

    Novel categorization and comparison are proposed on available multimodal vision-based benchmark datasets of HAR.

  3. 3.

    Different methods are discussed along with challenges, open issues, and new trends to provide insightful guidance on future directions for research.

Also, methods with the best results on popular and the newest benchmark datasets are highlighted.

The rest of the paper is organized as follows. In Sect. 2, a brief review of relevant surveys is presented. Methodology is stated in Sect. 3. A generic framework for HAR is presented in Sect. 4. Section 5 provides a brief review of unimodal vision-based HAR using deep learning. In Sect. 6, various multimodal vision-based HAR methods are studied and analyzed in detail. In Sect. 7, multimodal visual HAR datasets are categorized. We discuss the studied methods and some future research directions in Sects. 8 and 9, respectively. Finally, the paper concludes in Sect. 10.

2 Relevant surveys

As mentioned before, considerable research has been devoted to human activity recognition during the last decades. Besides, many surveys were published based on different characteristics of deep-based HAR methods (see Table 1). Some focused on both traditional and deep-based approaches (Yuanyuan et al. 2021; Pareek and Thakkar 2021; Khan and Ghani 2021; Rangasamy et al. 2020; Jegham et al. 2020b; Zhang et al. 2019; Dhiman and Vishwakarma 2019; Estevam et al. 2021; Özyer et al. 2021).

While others concentrated only on deep-based methodologies (Shabaninia 2022; Ulhaq et al. 2022; Ahmad et al. 2021; Islam et al. 2022; Zhu et al. 2020; Yao et al. 2019; Sreenu and Durai 2019). A group of reviews studied a specific data modality such as visual or sensor-based methods (Chen et al. 2021; Nguyen et al. 2021; Hussain et al. 2020; Dang et al. 2020; Beddiar et al. 2020; Al-Faris et al. 2020; Wang et al. 2019, 2018). Some others dealt with multiple data modalities in HAR (Yadav et al. 2021; Majumder and Kehtarnavaz 2021; Sun et al. 2022; Majumder and Kehtarnavaz 2020; Li et al. 2020; Roitberg et al. 2019; Liu et al. 2019). Also, a branch of surveys focused on HAR applications (Prati et al. 2019; Mar et al. 2019). Further, some surveys reviewed benchmark multimodal visual or RGB-D datasets in HAR (Singh and Vishwakarma 2019b, a; Zhang et al. 2016; Cai et al. 2017).

Although numerous surveys concentrated on deep-based HAR, limited works focus on combining different visual data modalities. Some papers focused on a few combinations of visual data modalities. For example, in (Majumder and Kehtarnavaz 2021; Roitberg et al. 2019), only the fusion of RGB & depth modalities is investigated. The fusion of depth & skeleton is also studied in (Liu et al. 2019). As outlined above, Sun et al (Sun et al. 2022) investigated combinations of visual data modalities; however, limited papers were surveyed and categorized. Compared to (Sun et al. 2022), this review categorizes more methods into four levels, which provides an in-depth analysis of issues.

Table 1 Recent surveys on HAR

3 Methodology

The guidelines followed in this paper are taken from (Harris et al. 2014; Wright et al. 2007). The performed literature review process comprises four steps: formulating research objectives, selecting eligibility criteria, identifying a search strategy, and conducting data extraction (Adewopo et al. 2022).

3.1 Research objectives

This study aims to address the following research questions:

  • RQ1: What are the main deep-based HAR techniques that use multimodal visual data modalities?

  • RQ2: What are the primary datasets and metrics used in this scope?

  • RQ3: What are the best results and future directions in this field of study?

3.2 Eligibility criteria

This review includes the papers related to action recognition. It includes topics in deep-based HAR, multimodal action recognition, activity recognition using visual data modalities, multimodal vision-based HAR, gesture recognition, and group activity recognition published in journals and conferences between 2016 and 2023. The initial works commenced in 2016 and have since been ongoing. Only papers published in the English language were used. The publications needed to meet the following characteristics in order to be included:

  • Action recognition or gesture recognition tasks,

  • Deep-based approaches,

  • Multiple visual data modalities,

  • Trimmed (segmented) datasets,

  • Within 2016–2023 (7 years).

The following exclusions were implemented:

  • Employing untrimmed datasets.

  • Does not provide clear findings and analysis of results.

  • Written in other languages excluding English.

3.3 Information sources

The selection of papers for this review was conducted through a comprehensive search of electronic databases that specifically included articles published in the English language. The databases, including IEEE Xplore, Wiley Online Library, Springer Link, Science Direct, ACM, and arXiv were utilized as the primary source for identifying relevant articles on action recognition tasks. These databases encompass a wide range of full-text journals and conference papers.

3.4 Search strategy

The following keywords are combined with conjunctions “AND” and disjunctions “OR” in our search. The most common terms used for our search were:

  • Action recognition.

  • Activity recognition.

  • Gesture recognition.

  • Recognizing action.

  • Motion recognition.

Above terms were combined with:

  • Deep network.

  • Multimodal.

  • Visual data.

  • Vision-based modalities.

The abstracts, titles, keywords, and employed datasets from selected articles were reviewed to assess their relevance according to the inclusion and exclusion criteria. Articles that did not meet the eligibility criteria or were not pertinent to the research questions were excluded from the study.

3.5 Data extraction

A full text reading of selected articles was conducted to retrieve the relevant data that enabled to answer the research questions, categorize the studies and identify the future research prospects. The following data are extracted from the selected studies:

  1. (1)

    Document title, authors’ names, publication year and journal/conference name,

  2. (2)

    Used modalities,

  3. (3)

    Fusion techniques,

  4. (4)

    Framework architecture,

  5. (5)

    Datasets and corresponding results.

Appendix A summarizes the extracted data from selected studies.

4 A generic framework for HAR

A generic unimodal or multimodal HAR system typically consists of several components, as illustrated in Fig. 2. These components include data acquisition, data preprocessing, spatial feature extraction and temporal modeling, classification, and evaluation.

Initially, the system needs to capture data that contains human actions. This data can be obtained from either an egocentric or third-person view. Egocentric videos are captured from a first-person perspective, where the camera is mounted on the head or body of the person recording the video. On the other hand, third-person videos are captured from a third-person perspective, where the camera is positioned outside of the person being recorded.

Preprocessing techniques are utilized to enhance the quality of inputs before they are fed into next stage. These techniques can be customized and combined to meet the requirements of a specific computer vision task. Common data preprocessing techniques include resizing (changing input dimensions), normalization (scaling pixel values), cropping regions of interest (eliminating irrelevant parts of the input), background subtraction (removing distracting objects in the background), and data augmentation (creating new data by applying random transformations to existing data).

The next stage involves spatial feature extraction and temporal modeling, which is the most crucial phase in the system. Various techniques are utilized for this purpose and DNNs are the focus of this paper as they are commonly used and provide state-of-the-art results.

CNNs are mainly used to extract local features that are meaningful and shared throughout the data. They operate on fixed-size vectors with a fixed number of computational steps (Alom et al. 2019). 3D CNNs have been developed using 3D convolutions (Ji et al. 2012). Compared to CNNs, 3D CNNs can better extract dependencies between adjacent frames (Ji et al. 2012). However, their performance requires a lot of computational resources during the training stage, and they are rigid in capturing action sequences with fine-grained visual patterns (Köpüklü et al. 2022). Graphical neural networks (GNNs) extend deep learning techniques to non-Euclidean or graph data (Ahmad et al. 2021). CNNs have also been extended to non-Euclidean data via graphical convolutional networks (GCNs). Compared to CNNs, GCNs provide unordered and variable-sized structures.

Recurrent neural networks (RNNs) (Alahi et al. 2016), Long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) and Transformers (Vaswani et al. 2017) are applied for temporal sequence modeling in video analysis. However, RNNs suffer from vanishing gradient problems. LSTM overcame this issue, but the number of learnable parameters in LSTM is high (Salehinejad et al. 2017). Transformers are capable of modeling long-range dependencies between input elements of a sequence, and they support parallel sequence processing, unlike RNNs and CNNs. Additionally, their design requires minimum biases (Khan et al. 2022).

After extracting features, the actions are labeled in the classification stage. Fully connected networks or SVMs are used as classifiers. Different protocols are used to evaluate the HAR system. These include cross-subject (people with different appearances, cultures, genders, and ages), cross-view (different views like front or side), and cross-setup (different distances, heights, etc.).

The design of an action recognition system is a complex task that requires careful consideration of various factors in order to achieve high accuracy.

Fig. 2
figure 2

A generic framework for HAR

5 Unimodal vision-based HAR

RGB-D sensors usually provide RGB (a three-channel data encoding color information), depth (containing information relating to the distance of surfaces from a viewpoint), and skeleton (encoding 3D location of joints). Each modality possesses its own specification in HAR. Table 2 lists the pros and cons of using different visual modalities.

Table 2 Different visual modalities used in HAR with an example (Hands Shaking) from NTU RGB+D dataset (Shahroudy et al. 2016)

RGB-based HAR is very common in computer vision tasks. However, depth information is lost in traditional RGB images. With the emergence of low-cost depth sensors such as Microsoft Kinect (Tölgyessy et al. 2021), Intel RealSense (Keselman et al. 2017), and ASUS Xtion (Gonzalez-Jorge et al. 2013), the use of depth data has increased a lot for HAR. Depth sensors use different technologies, such as structured light (measuring the distortion of a pattern projected on an object caused by the object’s surface) or time of flight (ToF) (measuring the phase delay of reflected infrared light) (Zanuttigh et al. 2016). Due to the sensitivity of IR cameras to sunlight, cameras that rely on structured light technology are not usable outdoors. However, they are employed in HAR tasks, which do not require very high-depth resolution (Kazmi et al. 2014).

Deep-based activity recognition with visual data modalities is classified into unimodal and multimodal approaches (see Fig. 3). Most of unimodal approaches use a single stream in their framework, while some with multi-stream frameworks utilize several streams from different shapes of one modality (Özyer et al. 2021; Fu et al. 2020). These unimodal multi-stream approaches are categorized into multi-resolution (Karpathy et al. 2014), multi-rate (Feichtenhofer et al. 2019), and multi-feature (Simonyan and Zisserman 2014a) structures. The multi-resolution structure employs different resolutions of one data modality in each stream (Fig. 4a). In the multi-rate structure, different frame rates of one input are fed to the multiple streams of the network (Fig. 4b). Lastly, the multi-feature structure uses various extracted features (such as optical flow) from one data modality in the streams (Fig. 4c). Although all these approaches use multiple data streams, they are not considered multimodal, which is the focus of this review.

Although many studies concentrate on unimodal approaches, the interest in multimodal ones is also increasing. This paper focuses on HAR methods using multiple visual data modalities.

Fig. 3
figure 3

Classification of unimodal visual HAR methods using deep approaches

Fig. 4
figure 4

Examples of unimodal multi-stream DNNs. a Multi-resolution (Karpathy et al. 2014), b Multi-rate (Feichtenhofer et al. 2019), and c Multi-feature (Simonyan and Zisserman 2014a) structures

6 Multimodal vision-based HAR

Research on multimodal vision-based activity recognition shows that using multiple modalities can achieve higher accuracy than unimodal approaches because they benefit from different and complementary sources of information (Sun et al. 2022; Roitberg et al. 2019; Liu et al. 2019).

Action recognition with multimodal data is challenging due to several factors. Firstly, the heterogeneity of different data sources poses a challenge, as modalities can have different data types, formats, and noise levels. Handling such heterogeneous data requires careful consideration. Secondly, collecting large amounts of multimodal action datasets takes time and effort. Compared to single-modal datasets, large-scale multimodal datasets are relatively scarce and small. This limitation restricts the availability of training data for model training. Thirdly, information fusion from different modalities effectively is a complex task. Various modalities may have varying levels of importance for different actions, and finding the proper fusion strategy is crucial. Moreover, simultaneously processing multiple modalities increases the computational complexity of action recognition frameworks. This can pose challenges for real-time performance, especially when dealing with high-dimensional data. Additionally, different modalities may require specific feature extraction techniques for effective fusion. Therefore, adapting feature extraction methods to suit different modalities is necessary. Lastly, aligning different modalities in time is crucial for accurate action recognition. However, each distinct modality may have a different sampling rate or temporal resolution, making it difficult to synchronize them. Temporal misalignment can lead to inaccurate recognition results.

This paper reviews studies that employ two or more data modalities as inputs. For example, in (Qin et al. 2020; Luvizon et al. 2018; Liu and Yuan 2018; Zolfaghari et al. 2017), depth or pose are created from RGB and used for action classification. Others (Davoodikakhki and Yin 2020) use pose only in the preprocessing stage. As mentioned before, approaches that construct optical flow from RGB (Liu et al. 2018) or similar features from a modality lie in the multi-feature subgroup of the multi-stream unimodal category, as shown in Fig. 3. These papers are beyond the scope of this paper.

This paper proposes a novel four-level categorization of multimodal approaches (see Fig. 1). First, multimodal vision-based HAR methods are categorized into five major categories: RGB & depth, RGB & skeleton, depth & skeleton, RGB & depth & skeleton, and infrared & other visual modalities. Since few works employ infrared along with other visual modalities, the existing papers are analyzed only in a single subsection.

Second, methods of five major categories are grouped into complete and missing modalities. As previously stated, all data modalities available in the training phase may not be available during the test due to different reasons. For the case of complete modalities, three framework architectures are popular in the third level of categorization based on the number of used streams and their dependencies: independent streams, dependent streams, and single-stream. Two different approaches in the literature try to solve the missing modality problem in the third level: Hallucination networks and Ensemble methods (see Fig. 1). Lastly, methods of the third level with similar approaches in DNN architecture, classifier, fusion, and preprocessing techniques are placed in the fourth level.

As Baltrusaitis et al. (Baltrušaitis et al. 2018) categorized multimodal machine learning challenges, multimodal HAR approaches lie in fusion, translation, and co-learning groups. Fusion joins information from all modalities in training and test time. In contrast, co-learning could handle missing modalities or assist learning in modality with fewer samples. In translation, a modality is derived from another modality before the training process, and then all modalities are used in the training and test stage. That is why the framework architecture of fusion and translation HAR methods are analogous.

Fusion is used in multimodal learning algorithms to benefit from various predictive powers (Baltrušaitis et al. 2018). There are different taxonomies for the fusion approaches in the literature. Fusion approaches in (Ramachandram and Taylor 2017) are categorized into early, late, and intermediate (as Fig. 5 shows), while authors in (Jain et al. 2005) grouped them into feature-level, score-level, and decision-level (as illustrated in Fig. 6).

Fig. 5
figure 5

Various fusion models (Ramachandram and Taylor 2017). (a) Early or data-level fusion, (b) late or decision-level fusion, and (c) intermediate fusion

Fig. 6
figure 6

Various fusion models according to (Jain et al. 2005). (a) Feature-level fusion, (b) score-level fusion, and (c) decision-level fusion

Early or data-level fusion involves the integration of multiple raw or preprocessed data modalities into a feature vector before using it in the learning stage. Late or decision-level fusion refers to collecting decisions from multiple classifiers, each trained on distinct modalities. Various rules like maximum or average scores can be used in the late fusion. Intermediate deep fusion is adopted where a shared representation layer is constructed by merging units from multiple paths coming into this layer (Ramachandram and Taylor 2017).

In the other taxonomy, feature-level fusion relates to merging different feature vectors extracted by different algorithms at any layer before fully connected, softmax, and classification layers. The weighted average is used for homogeneous feature vectors and concatenation of non-homogeneous ones (Jain et al. 2005). Decision-level fusion is accomplished after the network prediction. Also, score-level fusion can be performed after or between fully connected and softmax layers (Lai and Yanushkevich 2018).

In independent streams, distinct streams are usually considered for feature extraction of each modality, while the streams do not have any connection or intrusion during extracting features. Then, extracted features are late fused via different fusion strategies in the feature-level (often concatenated), score-level, and decision-level.

Two or more modalities participate in extracting features of each stream in frameworks with dependent streams, i.e., mid-level features of a stream are used in the feature extraction of other streams via an intermediate fusion strategy. These frameworks fuse ultimate features via feature-level, score-level, and decision-level fusion strategies.

Frameworks with only one mainstream lie in the third group. Using an early-level fusion strategy, various data modalities are fed to the network as an N-D entity (for example, 4D for RGB & depth). Other studies make use of a modality as auxiliary data in weighting the features of the primary modality before the classifier layer via late fusion strategies (feature-level fusion).

Further, frameworks that employ one modality as auxiliary data or attention for other primary modalities could possess dependent streams or single-stream. In attention-based approaches with dependent streams, auxiliary data is trained in a separate stream and used as attention for other modalities during the training phase. The features of all streams are employed in the classification process. Nevertheless, features of auxiliary data in single-stream attention-based approaches are employed only in the training of other modalities and do not participate directly in the classification stage. The difference between attention-based approaches with dependent streams and single-stream is shown in Figs. 7 and 8.

Real-life applications deal with missing modality at test time. Some approaches use multiple modalities to profit from diverse information for missed modalities. Methods of this category mainly employ co-learning approaches. In co-learning, “the knowledge transfers between modalities, their representations, and their predictive models” (Baltrušaitis et al. 2018). Co-learning investigates the ways of transferring knowledge learned from one modality to the model trained on different modalities. It is an appropriate approach in missing or noisy modality issues or when one modality has limited labeled data or samples (Rahate et al. 2022). Therefore, it is a promising approach in HAR working with missing modalities.

Fig. 7
figure 7

Attention-based approach with dependent streams (Li et al. 2020)

Fig. 8
figure 8

Attention-based approach with single-stream architecture (Das et al. 2020)

Knowledge distillation and transfer learning are co-learning-based techniques usually used to handle missing modalities (Rahate et al. 2022). Transfer learning aims to leverage knowledge from the source domain to the target domain (Zhuang et al. 2020; Tan et al. 2018), while in knowledge distillation, the generalization of a complex model (teacher) is transferred to a more simple model (student) (Wang and Yoon 2021; Gou et al. 2021). Thereby, both techniques train a prototype network, which can generate a new, so-called hallucination network for missed modality (Rahate et al. 2022).

The hallucination network learns privileged information for missing (or noisy) modalities at test time via co-learning approaches. This is primarily proposed by (Hoffman et al. 2016) presented a convolutional hallucination architecture for training an RGB object detection model, which includes depth information at training time. The hallucination network is trained to mimic mid-level features of missed modality and learns a new representation of available ones, as shown in Fig. 9a. At test time, images are processed mutually through the available modalities and hallucination network to increase the detection performance, as shown in Fig. 9b.

Fig. 9
figure 9

Hallucination network (Hoffman et al. 2016) in a training and b test stage

Since hallucination networks use the whole data of all modalities indiscriminately and require a pre-trained network, some works suggest ensemble approaches, which do not need a pre-training step or extra networks at test time (Garcia et al. 2019).

At the fourth and last level, methods of the third level are categorized firstly based on DNN architecture, including CNNs, 3D CNNs, RNNs, GNNs, and Transformers. However, multiple DNN architectures are also used in some approaches. Next, methods in each category are grouped based on fusion methodologies, classifiers, and preprocessing techniques. In addition, a group of methods concentrates on extracting specific features such as spatio-temporal, common-specific, or pose features. This level captures the similarities and differences between methods by considering general factors. With the introduction of new approaches over time, it is possible to have a more precise categorization for each of the general factors mentioned here.

In each of the following subsections, a combination of modalities is described (first level of our categorization). Complete and missing modalities are explained for each combination in two subsubsections (second level). Complete modality approaches contain independent streams, dependent streams, and single-stream, while missing modality approaches contain hallucination networks and ensemble methods (third level). Similar frameworks are explained in each paragraph, along with similarities specified at the beginning (fourth level).

6.1 RGB and depth

The main branch of multimodal visual action recognition methods focuses on RGB and depth. As previously mentioned, RGB data represents the appearance information of the scene and objects, while depth data encodes body silhouette, 3D shape, and scene structure. The depth data is robust against illumination, color, and texture variations. These complementary aspects of RGB and depth encourage many studies to make use of these two modalities. Results from (Das et al. 2017) show that extracting skeleton data is improved via the fusion of a depth-based approach using Kinect and an RGB-based framework using CNNs.

6.1.1 Complete modality

6.1.1.1 Independent streams

Most multimodal methods focus on using multiple distinct streams to learn features separately. They either fuse each streams’ recognition score or concatenate the extracted features before the classifier layer. In this architecture, streams do not have any connection or do not intrude into other streams while extracting features of other modalities. However, they cannot learn from mid-level complementary information of different heterogeneous modalities.

In Twinanda et al. (2016), Ijjina and Chalavadi (2017), Asadi-Aghbolaghi et al. (2017), Mukherjee et al. (2020) and Sun et al. (2023), multi-stream CNNs are used for extracting features. Extracted features are then concatenated before the classifier. Twinanda et al. (2016) propose a four-stream network pre-trained on AlexNet (Krizhevsky et al. 2017) with RGB, depth, and their motions as inputs for surgical recognition tasks. In Ijjina and Chalavadi (2017), the represented framework for action recognition emphasizes motion in different temporal regions using key poses. It is shown that using multimodal information with the noise tolerance property of convnet features can improve the results. Asadi-Aghbolaghi et al. (2017) consider the fusion of hand-crafted features and deep strategies for RGB-D-based action recognition. For this purpose, dense multimodal trajectories (MMDT) and multimodal 2DCNN approaches are proposed via RGB, depth, scene flow, and optical flow modalities. A two-stream CNN network pre-trained on VGG16 (Simonyan and Zisserman 2014b), or Resnet-101 (He et al. 2016a), is proposed in (Mukherjee et al. 2020) using dynamic images as network inputs. Dynamic images, first introduced in (Bilen et al. 2016), are based on rank pooling; and summarize motion and action information of a video in a single image. Dynamic images are made separately from RGB and depth videos and are fed into the network. Extracted features are concatenated and passed through a fullyconnected layer to predict the action class. Sun et al (2023) introduce the multi-level feature fusion of a two-stream CNN.

In Singh et al. (2020), Rajput et al. (2020), Imran and Kumar (2016), a four-stream CNN network is suggested. It takes RGB and depth maps from three different views (i.e., top, front, and side) as inputs. The scores of different streams are late fused at the end of the network to classify the action. Singh et al. (2020) use dynamic images constructed from RGB videos and depth motion map (DMM) and fed to the pre-trained VGG-F model (Zhou et al. 2017). Depth map motion computes the difference of consecutive frames projected on XY, YZ, and XZ planes (corresponding to front, side, and top). A weighted product model is used to classify the action. In Rajput et al. (2020), a motion history image (MHI) is constructed from RGB video and three DMMs as inputs of a pre-trained Mobilenet network. Scores are late fused using product rules of posterior probability generated in each stream. Imran et al. (2016) propose a framework similar to (Rajput et al. 2020) with a pre-trained VGG16 network structure.

3D CNNs are offered in (Zhu et al. 2016; Li et al. 2016b; Miao et al. 2017; Duan et al. 2016; Zhang et al. 2018a; Bini et al. 2022; Chen et al. 2022). To learn gestures from the whole video, Zhu et al. (2016) use pyramid input and pyramid fusion with multiscale contextual information using 3D CNNs. Li et al. (2016b) employ a pre-trained C3D (Tran et al. 2015) network for RGB and depth to recognize gestures. Extracted features are concatenated or averaged. Finally, the gestures are classified with a linear SVM. In (Miao et al. 2017), different features (RGB, flow, and depth) are extracted by the ResC3D network (Miao et al. 2017) and fuse with canonical correlation analysis. The final recognition results are derived with a linear SVM classifier. Chen et al. (2022) propose a local attention- and dual attention-based multimodal 3D Convolutional Network. The I3D model base extracts features of RGB data and uses the I3D model with dual spatio-temporal attention to extract depth data features. The extracted features are multiplied element-wise as the final classification result. Bini et al. (2022) concentrate on gesture recognition in real-time and over embedded platforms with limited resources. They suggest a four-stream network of 2D CNNs and 3D CNNs (MobileNet and ResNeXt as backbone) using RGB, depth, optical flow, and MHI. Finally, the fusion of streams are performed at the decision level. Duan et al. (2016) suggest a four-stream network for continuous gesture recognition. This approach uses a two-stream convolutional consensus voting network (2SCVN) to model short and long-term video sequences. Furthermore, a two-stream 3D depth-saliency ConvNet (3DDSN) is used to learn subtle motions and remove background clutter. 3D CNNs suffer from computational inefficiency since they comprise numerous parameters. Zhang et al. (2018a) proposed a series of 3D lightweight structures for action recognition based on RGB-D data to resolve this deficiency. The proposed lightweight 3D CNNs have significantly fewer parameters with lower computation costs, and it results in desirable recognition performance compared to conventional 3D CNNs (Zhang et al. 2018a).

A branch of studies (Chai et al. 2016; Zhu et al. 2017; Zhang et al. 2017a) employs RNNs and LSTM in their frameworks. In Chai et al. (2016), two streams recurrent neural network (2S-RNN) is presented using RGB-D data which model the contextual information of temporal sequences. Frameworks in (Zhu et al. 2017; Zhang et al. 2017a; Elboushaki et al. 2020) extract spatio-temporal features which are more robust to background clutter. In Zhu et al. (2017), a two-stream network based on 3D CNN, convolutional LSTM (ConvLSTM), spatial pyramid pooling, and a FC layer is proposed to enhance better long-term spatio-temporal learning. 3D CNNs extract spatio-temporal features of RGB-D data. In Zhang et al. (2017a), a three-stream network based on 3D CNN, ConvLSTM, 2D CNN, temporal pooling, and a FC layer with softmax is used to extract spatio-temporal features of RGB, depth, and optical flow. In Elboushaki et al. (2020), a deep-based framework called MultiD-CNN is proposed for gesture recognition which learns spatio-temporal features from RGB-D videos. This method incorporates spatial and temporal information through two different recognition models: a 3D color-depth convolutional network (3D-CDCN) and a 2D motion representation convolutional network (2D- MRCN). 3D-CDCN adds the temporal dimension and makes use of 3D ResNets and ConvLSTM to learn spatio-temporal features simultaneously. On the other hand, 2D- MRCN accumulates the motion across the video sequences into a motion representation and uses 2D ResNets to learn high-level gesture representation. Obaid et al. (2020) utilize CNNs and RNNs in hand gesture recognition to extract temporal features. Dhiman et al. (2020) suggest motion and shape temporal dynamics (STD) as action cues. They propose a framework with RGB dynamic images in motion stream and depth silhouette in STD stream for recognizing action from an unknown view.

Various research (Shahroudy et al. 2017; Qin et al. 2018; Tang et al. 2018; Qin et al. 2019) concentrate on extracting common-specific RGB-D features. The combination of the shared and specific components in input features can be complex and highly nonlinear (Shahroudy et al. 2017). To disentangle specific features from common ones, Shahroudy et al. (2017) stack layers of nonlinear autoencoder-based component factorization to form a deep shared-specific analysis network. Even though RGB and depth images are inherently different in appearance, there is a certain high-level consistency between them (Qin et al. 2018). Qin et al. (2018) propose a novel two-stream model to extract common-specific features via the constraint of similarity at the high level. In (Tang et al. 2018), a method based on multi-stream deep neural networks is proposed for egocentric action recognition. This method exploits the complementary aspects of RGB and depth by learning the nonlinear structure of heterogeneous representations. It tries to preserve the distinctive property for each modality and simultaneously explore their sharable information in a unified architecture. In addition, it deploys a Cauchy estimator (Mizera and Müller 2002) to maximize the correlations of the sharable components and impose the orthogonality constraints on the individual components to guarantee their high independencies. Qin et al. (2019) employ a novel end-to-end trainable framework called TSN-3DCSF (two-stream network with 3D common-specific features) that uses 3D CNNs to extract common-specific features.

In Ren et al. (2021), Wang et al. (2018c), Ren et al. (2021b), Wang et al. (2020) segmented or bidirectional rank pooling is proposed in the frameworks. Dynamic images are created from the RGB-D sequence as inputs to the ConvNets to extract spatio-temporal information (Ren et al. 2021). Then, a segmented cooperative ConvNet is utilized to learn the complementary features of RGB-D modalities. Wang et al. (2018c) use two separate cooperative convolutional networks (c-ConvNet), which extract information from dynamic images constructed from both visual RGB (VDIs) and depth (DDIs) modalities. By applying bidirectional rank pooling, VDIs, and DDIs are represented by two dynamic images, i.e., forward (f) and backward (b), namely VDIf & VDIb and DDIf & DDIb, respectively. The c-ConvNet consists of one feature extraction network and two branches, one for ranking loss and another for softmax loss. Ren et al. (2021b) employ the segmented bidirectional rank pooling to acquire spatio-temporal information, as shown in Fig. 10. Moreover, the multimodality hierarchical fusion scheme gets the most out of the complementarity of different modalities. The multimodality hierarchical scheme includes VDIs-f, VDIs-b, DDIs-f, DDIs-b, and optical flow fields (flow X and flow Y) trained on ConvNets. In Wang et al. (2020), the network is built upon weighted dynamic images, bidirectional rank pooling, CNNs, and 3D ConvLSTM to extract complementary information from the depth and RGB video sequences. Canonical correlation analysis is adopted for feature-level fusion, and a linear SVM is used for classification.

Fig. 10
figure 10

An example of independent streams (Ren et al. 2021b)

Several methods use RGB-predicted pose in action recognition (Al-Faris et al. 2020; Wu et al. 2021). In Al-Faris et al. (2020), a framework is proposed for hierarchical region-adaptive multi-time resolution DMM (RAMDMM) and multi-time resolution RGB action recognition system. The proposed method introduces a feature representation technique for RGB-D data that enables multi-view and multi-temporal action recognition. Original and synthesized viewpoints employed for multi-view HAR. To be invariant to variations in an action’s speed, it also used temporal motion information by integrating it into the depth sequences. Appearance information in terms of multi-temporal RGB data is used to help focus on the underlying appearance information (that would otherwise be lost with depth data alone). It helps to provide sensitivity to interactions with small objects. Wu et al. (2021) utilize 3D CNNs with multimodal inputs to enhance spatio-temporal features. This approach proposes two different video presentations: depth residual dynamic image sequence (DRDIS), which reflects spatial motion changes of an action over time, and pose estimation map sequence (PEMS), which is constructed by pose (skeleton) estimation from an RGB video. DRDIS is robust under lighting conditions, texture, and color changes. PEMS eliminates the background clutter.

6.1.1.2 Dependent streams

In some other methods, there are dependent streams, i.e., each stream uses extracted features of other streams and shares its features with others during feature extraction. The following approaches employ dependent streams in their architectures.

In Li et al. (2021), Transformers are used in the inter-frame and modality mutual attention for egocentric action recognition (see Fig. 11). Transformers are more robust in modeling long-term sequences than RNNs or LSTM (Khan et al. 2022). Li et al. (2021) apply position encoding to frames in order to emphasize the frame orders. A two-stream network with Transformer encoders is applied to inputs. Then, extracted features are passed to the mutual-attentional fusion block to exchange cross-modality information. Zhao et al. (2021) use a capsule network, Kalman filter, and Transformers to resolve CNN’s over-sensitivity to rotation and scaling. Body parts are extracted using a capsule network, and their attribute is determined via a Kalman filter.

In Li et al. (2019), a spatio-temporal attention mechanism is proposed to select the most representative regions and frames in a video. Different features (RGB, flow, and depth) are extracted by the ResC3D network (Miao et al. 2017) and fuse with canonical correlation analysis. The final recognition results are derived with a linear SVM classifier.

Li et al. (2023) propose a hierarchical gesture prototype framework to handle two problems in gesture recognition, including redundancy in the gesture-relevant features of different modalities and exploiting the complementarity of modalities. The framework highlights gesture-relevant features such as poses and motions using a sample-level prototype and a modal-level prototype. The sample-level gesture prototype uses a memory bank to extract the essential features of a specific gesture class with different phenotypes. Then, the modal-level prototype is obtained via a GAN-based subnetwork, in which the modal-invariant features are extracted and pulled together.

Fig. 11
figure 11

An example of dependent streams using Transformers (Li et al. 2021)

New fusion approaches are proposed in a group of studies (Cheng et al. 2021; Zhou et al. 2021; Tian et al. 2020; Wang et al. 2019a; Hampiholi et al. 2023; Lee et al. 2023; Cheng et al. 2022). In Cheng et al. (2021), a cross-modality compensation block (CMCB) is developed to learn the cross-modality complementary features from RGB and depth modalities. The CMCB first gathers features from the two isolated information flows, then sends and intensifies them to the RGB-D paths using the convolution layers. CMCB is incorporated into two typical network architectures of ResNet and VGG to improve action recognition performance. In Zhou et al. (2021), Adaptive Cross-modal Weighting (ACmW) approach is employed to extract complementary features from RGB-D data. ACmW scheme intuitively evaluates the relationship between the complementary features from different streams and fuses them in the spatial and temporal dimensions. In Zhou et al. (2021), a pair of CNNs is used to exploit the features of RGB and depth images then the extracted features at different levels are evaluated. Wang et al. (2019a) address multi-view and missing view problems in action recognition. The adversarial generative network deploys to generate one view conditioning on the other view, fully exploring the latent connections in intra-view and cross-view. Hampiholi et al. (2023) introduce the Convolutional Transformer Fusion Blocks (CTFBs) in multimodal gesture recognition using RGB and depth modalities. A CTFB consists of a Convolutional Self-Attention (ECSA) mechanism, a fusion operation, and an MLP module. 3D convolution layers are used in ECSAs to capture local key spatio-temporal features from each modality. Then, output feature maps of each modality from two ECSA modules are fused using elementwise addition operation. An MLP is used in the final classification. In Lee et al. (2023), multimodal data are fused with recurrent units. Authors have proposed the Modality Mixer (MMixer) network containing a key component recurrent unit called Multimodal Contextualization Unit (MCU). MCU extracts complementary information across modalities and temporal information of the action. Cheng et al. (2022) suggest a multimodal interactive network (MMINet) for RGB-D-based action recognition using two proposed modules in the two-stream CNNs. The first module is called the spatial-temporal information aggregation module (STIAM) to extract richer spatial-temporal features with limited extra memory and computational cost. A cross-modality interactive module (CMIM) is the second one proposed to fuse the multimodal complementary information fully. The final recognition is based on the score fusion of two stream outputs.

6.1.1.3 Single-stream

Another architecture contains a single stream in which different modalities are considered an N-dimensional (N-D) entity, or one modality is used as an attention guide for other modalities.

In Adhikari et al. (2017), Pigou et al. (2018), Wang et al. (2017a), Zhou et al. (2021a), RGB and depth are combined as a four-channel data or a 4D entity. Adhikari et al. (2017) use CNNs via RGB-D input. This approach uses human postures to detect fall events or non-fall events. Pigou et al. (2018) propose a novel end-to-end trainable network using temporal convolutions and bidirectional recurrence. In this approach, RNNs respond to high-level spatial features and do not need to consider the temporal aspect in the lower layers of the network. In addition, RNNs estimate the beginning and end frames of gestures. In Wang et al. (2017a), extracting scene flow from RGB-D videos for action recognition is considered. Scene flow to action map (SFAM) is presented to summarize RGB-D videos. In Zhou et al. (2021a), a regional attention with architecture-rebuilt 3D network (RAAR3DNet) is proposed for gesture recognition. Fixed inception modules are replaced with the automatically rebuilt structure via neural architecture search (NAS) to get the different representations of features in the early, middle, and late stages of the network. In addition, a stackable regional attention module called dynamic-static attention is designed to emphasize the hand/arm regions and the motion information.

Another class of methods uses an auxiliary modality as an attention for other modalities. In Jegham et al. (2020a), a depth-based spatial attention network is suggested, which focuses on the driver’s silhouette and motion in a scene. Each time, a new weighted RGB frame is fed to the network with the relevant depth frame as attention. Soft spatial attention enhanced the CNN’s recognition by selectively highlighting relevant frame regions.

6.1.2 Missing modality

Useful information from multimodal data in the multimodal action recognition can be performed to include a variety of sensory modalities. However, it is often the case that not all of the modalities are accessible in real-life scenarios due to restrictions such as noise or missing modalities (Fig. 12). The main challenge in missing modalities is training the model in a form that can be utilized at the test time.

Fig. 12
figure 12

Missing modality at test time [images are from NTU RGB+D Dataset (Shahroudy et al. 2016)]

6.1.2.1 Hallucination Network

Some studies utilize distillation knowledge or transfer learning to possess complementary information of all modalities in the missing scenario. In these approaches, hallucination networks are considered in the context of learning with privileged information to address the challenge of considering a missing (or noisy) modality at test time.

The teacher-student framework is introduced to distill knowledge. It deals with missing or noisy data samples (Rahate et al. 2022). The student model is typically faster than the teacher model (Rahate et al. 2022). In Garcia et al. (2018, 2019), the teacher-student framework is used for missed modality in HAR. In Garcia et al. (2018), a hallucination (student) network is trained to simulate the depth stream. The authors similarly use RGB and depth frames as inputs for training, but only RGB at the test time. In this approach, a technique based on inter-stream connection is implemented to enhance the learning process of the hallucination network, as shown in Fig. 13. Also, a general loss function is designed in (Lopez-Paz et al. 2015) that unifies distillation and privileged information learning theories. The proposed scheme in (Garcia et al. 2018) is revisited in (Garcia et al. 2019) where the hallucination network is trained via discriminative adversarial learning. This method does not need to balance the different losses used in other methods (Garcia et al. 2018, 2019).

Park et al. (2023) have proposed Cross-Modal Alignment and Translation (CMAT) in action recognition. The framework first aligns representations of multiple modalities from the same video sample through contrastive learning by R(2+1)D-18 architecture (Tran et al. 2018). Then, CMAT learns to translate representations of one modality into that of another modality using CNNs. This allows the representations of the missing modalities to be generated from the remaining modalities during the testing.

Fig. 13
figure 13

Hallucination Network proposed in (Garcia et al. 2018)

6.1.2.2 Ensemble methods

An ensemble method is proposed in (Garcia et al. 2021), where the complementary information of multiple modalities is leveraged to the benefit of the ensemble and each network (without the need for a pre-training step or another network at test time, as shown in Fig. 14). The introduced distillation multiple choice-learning framework is trained from scratch, and modalities are strengthened.

Fig. 14
figure 14

Ensemble methods proposed in (Garcia et al. 2021)

6.2 RGB and skeleton

Skeletal data, used as high-level information, is robust against different views, backgrounds, and motion speeds (Shabaninia et al. 2019). However, sparse information of 3D joints in skeletal data is insufficient to model human actions, especially human-object interactions fully. Some papers suggest considering skeletal information with RGB data to gain complementary features of both modalities in an accurate recognition framework.

6.2.1 Complete modality

6.2.1.1 Independent streams

Some studies (Jang et al. 2020; Das et al. 2019c; Tomas and Biswas 2017) employ CNNs in their architectures. Jang et al. (2020) propose a four-stream adaptive CNN (FSA-CNN) framework, robust to spatio-temporal variations. The activation function adapts without using multiple activation layers. Streams consist of raw data, short-term temporal differential, long-term temporal differential, and spatial differential sequences of actions. Using 2D skeletal data (created from RGB sequences) and 3D skeletons (captured by Kinect sensors) improves the accuracy of action recognition. Although 2D and 3D skeletons appear similar, their sources differ, and extra information is obtained in the 3D case. Das et al. (2019c) propose a framework with an action-pair memory module to disambiguate similar actions. Moreover, a two-level fusion mechanism employs various information from three modalities, including RGB, 3D skeletons, and 2D skeletons & RGB. Tomas et al. (2017) employ appearance and motion information from RGB and skeletal joints, respectively, to capture subtle motions. Motion representations are learned via CNN from MHIs created from RGB images. Besides, stacked autoencoders (SAE) attain discriminative movements of human skeletal joints by taking the distance of joints from the mean joint at each frame.

A variety of methods exploits spatio-temporal features using both CNNs and RNNs. Debnath et al. (2021) offer a two-stream attention-based framework that learns the 3D position of body joint relationships during action sequences. 3D poses are inputs of two streams; one learns spatial features, and the other learns temporal features. Then, a multi-head attention mechanism fuses pose streams. Along with the 3D pose streams, an RGB stream extracts appearance information using an Inception-ResNet-V2 pre-trained model, multi-head attention block, and bidirectional LSTM. These two streams are fused by the concatenation method and global average pooling. An FC layer for final classification. Verma et al. (2020) propose a two-stream framework using MHI and motion energy image (MEI) as RGB descriptors. Skeleton modality is used after developing intensity images in three views: top, side, and front. Feature-level fusion is applied in each stream. Afterward, scores of two-stream are fused via the weighted product rule. The multimodal network is trained only once using the cyclic learning rate concept. Liu et al. (2018) offer a multimodality multi-task RNN for online action detection. The framework contains classification and regression subnetworks for temporal modeling. After extracting dynamic features, classification, and regression, subnetworks share identical structures with different weights among different modalities. Zhao et al. (2017) suggest 3D CNNs for processing RGB videos and RNNs to extract features from 3D skeleton data with SVM as a classifier.

Cai et al. (2021) utilize a two-stream GCN using a human pose skeleton and joint-centered light-weight information, namely, JOLO-GCN, as inputs. Each joint local motion is captured as the pivotal joint-centered visual information via joint-aligned optical flow patches (JFP). The proposed scheme is accurate while keeping low computational and memory overheads. Duan et al. (2022) propose PoseC3D using a 3D heatmap stack for skeletal data representation instead of a graph sequence. The suggested framework can handle multiple-person scenarios in HAR.

6.2.1.2 Dependent streams

Several methods employ skeletal data as an attention guide for the RGB stream while training them separately. However, high computational cost limits the utilization of these two modalities. A novel pose-driven attention mechanism on 3D ConvNets is suggested in (Das et al. 2019b) to point out activities of daily living (ADL) recognition challenges. Time-series representation of pose dynamics extracts human activities’ spatial and temporal saliency. Song et al. (2018) offer an end-to-end trainable three-stream skeletal attention-based framework from RGB and optical flow videos. The framework is based on a ConvNet with LSTM. Visual features around critical joints are extracted automatically using a skeleton-indexed transform layer, and via a part-aggregated pooling, the visual features from different body parts and actors are uniformly regulated. Baradel et al. (2018; 2017b; a) propose a two-stream LSTM framework from articulated pose and RGB. A specific joint ordering is processed with the pose stream. The RGB stream, which gives essential cues on hand motion and objects, is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. Lastly, a temporal attention scheme learns to fuse features over time. Liu et al. (2019b) propose an attention-based two-stream framework that uses a 3D skeleton sequence and just a single middle frame from an RGB video as network inputs. Spatial features are extracted from the RGB stream using self-attention and skeleton-attention modules. In parallel, temporal features are obtained from the skeleton sequence by a BI-LSTM network. In the case of processing a single image instead of a video, this network contains lightweight architecture with less computational cost. Li et al. (2020) focus on action problems like “throw up hat,” whose related objects are extremely far from actors. The framework is pre-trained on R(2+1)D (Tran et al. 2018) using the Kinetics dataset (Kay et al. 2017). RGB and skeleton data fuse at the feature level via a skeleton-guided multimodal network (SGM-Net) in the proposed framework. Skeleton features guide the attention on the object related to the action and reduce the interference of unnecessary background information, e.g., for the action “throw up hat,” the object information of arms and hat are enhanced. In the guided block, two schemes of correlation operation are explored, including feature learning correlation (FLC) and compact bilinear correlation (CBC). Weiyao et al. (2021) propose bilinear pooling and attention network (BPAN) to fuse multimodal data to get the deep semantic relationship between multimodal features. Two-stream adaptive graph convolution network (2S-AGCN) and R(2+1)D are used in feature extraction from RGB and skeleton, respectively. Bruce et al. (2022) have proposed a model-based multimodal network (MMNet) for HAR in RGB-D videos using a model-based multimodal data fusion mechanism. This method borrows the extracted attention feature from the skeleton modality using GCNs and contributes to the CNN-based stream of RGB modality’s performance to improve ultimate performance.

Joze et al. (2020) offer a multimodal transfer module (MMTM) to fuse knowledge from multiple modalities in CNN. MMTM uses squeeze operations to make a global feature descriptor. Both tensors map into a joint representation using concatenation and a fully connected layer. Excitation signals are produced based on the joint representation to gate the channel-wise features in each modality. This module can be placed at different positions of networks and fused modality features in convolution layers with different spatial dimensions. It is also used in unimodal stream networks with minimum changes in their architectures to initialize the network with pre-trained weights.

In Ahn et al. (2023), spatio-temporal cross attention transformer (STAR++) is suggested using deformable 3D (D3D) token selection and interval attention (IAttn) modules, which creates tokens with spatio-temporal cross-attention. They employ 3D CNNs and Transformers in the proposed architecture.

6.2.1.3 Single-stream

Various papers introduce skeleton as attention for RGB stream; however, skeletal data is not trained separately. In Mahasseni and Todorovic (2016), using 3D skeleton sequences as attention, large-scale video action recognition improves by providing an auxiliary modality in training data to complete poorly or missing features of human actions. The framework consists of LSTM and deep CNN for recognition. LSTM is regularized based on the output of another LSTM and 3D human-skeletal data. For regularization, the standard backpropagation through time (BPTT) is rectified to address problems with gradient descent in constraint optimization. Das et al. (2019a) offer a 3D convolution network with soft RNN attention. Articulated poses specify the best body part for modeling the action class. The framework consists of three branches to extract features from human body parts: left hand, right hand, and entire body. RNN attention subnetwork allocates different levels of importance to the body parts. Das et al. (2020; 2021) propose pose-driven attention strategies called video-pose network (VPN and VPN++) for recognition of ADL with the ability to distinguish between similar activities with fine-grained details. VPN requires both RGB and 3D poses to classify actions. In contrast, VPN++ requires only RGB images to predict action labels. Further, VPN++ provides high speed and high resiliency to noisy poses. RGB video is fed to the network with the corresponding 3D poses at training time, obtained from Kinect sensors or images using pose estimation methods. Features of inputs are extracted via two distinct videos and pose backbones. The video backbone consists of 3D CNNs to extract spatio-temporal features, and the pose backbone contains a spatio-temporal GCN.

Do et al. (2022) propose a Multimodal Transformer (MMT) to use RGB and skeleton data of only eight input frames. Using the transformer-based structure, MMT can capture the correlation between non-local joints in skeleton data modality. The output of the local patch encoder is concatenated to linear projection of skeletons. The Transformer is trained with the global patch encoder, and the final classification is performed by MLPs.

6.2.2 Missing modality

6.2.2.1 Hallucination network

In Xu et al. (2021), a two-stage teacher-student framework is suggested for multi-view and missing modality action recognition. The teacher network leverages multi-view pose and appearance information during training, while the student network uses only RGB sequences at test time. 3D CNN is used for teacher and student frameworks. A cross-modality aggregated transfer (CAT) network transfers multi-view cross-modality aggregated features from the teacher network to the student network. Further, a viewpoint-aware attention (VAA) module taking discriminative information across different views is designed to fuse multi-view features. Then, a multi-view feature strengthening (MFS) network and the VAA module boost the global view-invariance features of the student network. Both CAT and MFS are also trained in an online distillation procedure via jointly training the teacher and the student network. Another teacher-student framework is employed in (Thoker and Gall 2019) for cross-modal action recognition, which nearly achieves the accuracy of a student network trained with full supervision. STGCNs and CNNs are employed as student and teacher architectures, respectively. The student network is trained on sequences of 3D human poses based on a teacher network trained on RGB sequences as supervision. Both RGB videos and human pose sequences train the student network. The student network employs unlabeled data that does not participate in the training of the teacher network. The knowledge of the trained teacher network for the source modality is transferred to a small ensemble of student networks for the target modality.

6.2.2.2 Ensemble methods

Song et al. (2020) suggest a modality compensation network (MCN) to leverage complementary information. The framework consists of CNN and LSTM. RGB and optical flow are source modalities, and skeletal data is the auxiliary modality. The main goal is to compensate for source modality features via auxiliary ones implicitly. A modality adaptation block links source modality to auxiliary modality to compensate for the loss of skeletal data at test time and even at the training.

6.3 Depth and skeleton

Since depth maps are invariant to intra-class variations (such as the appearance of objects), foreground/background segmentation obtains appropriate results using depth data (Camplani and Salgado 2014). HAR using depth maps is poor in noisy data and human-object interactions. Although the computation of skeletal data demands less hardware complexity, the skeleton alone is insufficient to distinguish actions that involve human-object interactions. The fusion of depth and skeleton provides partly discriminative features.

6.3.1 Complete modality

6.3.1.1 Independent streams

A class of methods (Rani et al. 2021; Wang et al. 2017b; Rahmani and Bennamoun 2017; De Smedt et al. 2017) use CNNs in their frameworks. Rain et al. (2021) propose three descriptors, including difference depth MHI (D2MHI) descriptor, spherical joint descriptor (SJD), and kinematic joint descriptor (KJD). Difference depth motion map (D2MM) and modified MHI (M2HI) are fused early to form the D2MHI descriptor. In order to be less sensitive to the joint movements, SJD is presented to make the model more robust for actions with similar movements. The motivation of KJD is to model the spatial and temporal changes in actions. Descriptors are fed to three CNNs, and obtained scores from the softmax layer are late fused to get the final action label. Wang et al. (2017b) apply the bidirectional rank pooling method to three hierarchical spatial levels of depth maps guided by skeletons, i.e., body, part, and joint. Each level contains some components and possesses a specific number of joint locations. A spatially structured dynamic depth image (S2DDI) preserves the coordination and synchronization of body parts during the action. It is suggested to learn spatio-temporal and structural information at all levels. This framework carried on with three weights-shared ConvNets and score fusion for classification. Rahmani et al. (2017) use a CNN-based framework to model human-object interactions and intra-class variations under different viewpoints. First, the relative geometry between every body part and others is assessed to transfer the depth map of body parts to a shared view-invariant space. Afterward, the view-invariant body parts of depth and skeletal modalities are combined with learning body part movements during actions. Then, the FC, temporal pooling, and softmax layers recognize the action class. Smedt et al. (2017) use CNNs feeding with keyframes of depth for action recognition.

Lai et al. (2018) suggest a combination of CNNs and RNNs via depth and skeleton for hand gesture recognition. Various fusion techniques are analyzed for improving performance, including feature-level fusion and score-level fusion. In Mahmud et al. (2023) quantized depth images are employed as an alternative input modality to raw depth images to create sharp relative contrasts between key parts of the hand. The architecture comprises multimodal-fusion CRNNs (Convolutional Recurrent Neural Networks).

Others (Liu et al. 2016; Zhao et al. 2019) employ 3D CNNs in their frameworks. Liu et al (2016) proposed a framework including a 3D-based deep CNN (3D2CNN) to learn the depth and skeleton features (called JointVector) along with fusion of SVM decisions for classification. In Zhao et al. (2019), a fusion-based action recognition framework is proposed, consisting of three parts: 3D CNN, human skeleton manifold representation, and classifier fusion.

6.3.1.2 Dependent streams

Authors in (Mahmud et al. 2021; Kamel et al. 2018) use CNNs and score fusion techniques in the frameworks. Mahmud et al. (2021) suggest dynamic hand gesture recognition using depth quantized images and skeleton joints. The fusion of CNN and LSTM is used in this framework to extract depth features, while skeleton features were extracted via LSTM following distinct MLPs. Besides, depth and skeleton data are concatenated and fed to another MLP. Scores of MLPs are fused in prediction. Kamel et al. (2018) focus on a deep CNN framework that is fed with three descriptors, called depth motion image (DMI), moving joint descriptor (MJD), and fusion of DMI with MJD. DMI represents the body changes of depth maps in an image, and MJD shows body joint position changes and directions around a fixed point.

6.3.2 Missing modality

6.3.2.1 Ensemble methods

Shi et al. (2017) suggest a privileged information-based RNN (PRNN). The privileged information (PI) is only available throughout training but not during test time. This model considers skeleton joints as PI in three-phase training processes, including pre-training, learning, and refining. The suggested network is end-to-end trainable and learns CNN and RNN parameters cooperatively. The final network enhances latent PI iteratively in an EM process.

6.4 RGB and depth and skeleton

Some research uses all three modalities of RGB & Depth & Skeleton. To compare 18 baseline/state-of-the-art frameworks using RGB-D and posed data, Garcia-Hernando et al. (2018) analyze approaches in the egocentric hand actions interacting with 3D objects. Different aspects, such as appearance, pose, and both are assessed. The results demonstrate the impact of the hand pose as a guide in action recognition with RGB-D data.

6.4.1 Complete modality

6.4.1.1 Independent streams

Khaire et al. (2018) use MHI and depth dynamic motion in the top, side, and front view (3-DDM) as RGB and depth descriptors. Their framework offers a new strategy to construct a skeleton image from skeleton joint coordinates. Afterward, five CNNs are trained with constructed descriptors separately. Scores of streams are fused by applying a weighted product model to predict the final action class.

A class of studies (Singh and Vishwakarma 2021; Cardenas and Chavez 2020; Khaire et al. 2018a; Elmadany et al. 2018; Cardenas and Chavez 2018) employ CNNs and SVM in their framework. Singh et al. (2021) present a modality fusion mechanism of RGB, depth, and skeleton called deep bottleneck multimodal feature fusion (D-BMFF) framework. 3D joints are converted into a single RGB skeleton MHI (RGB-SklMHI). Every ten RGB and depth frames are fed to a DNN to extract spatial features, and a single SklMHI image for each activity demonstrates temporal features. Extracted features of different modalities from three distinct streams are fused via multiset discriminant correlation analysis (M-DCA). Then, action is recognized using a linear multiclass SVM. Cardenas et al. (2020) propose a dynamic hand gesture recognition framework that fuses spatio-temporal features obtained from RGB-D and skeleton joints. Hand poses are detected using skeletons in RGB and depth images. Besides, a descriptor called histogram of cumulative magnitudes (HCM) represents the topology of hand and body to discern similar poses used to extract features from depth. Two distinct CNN streams are trained from RGB and depth modalities. Obtained features from CNNs and HCM are integrated and passed to an SVM for classification. In addition, a method is presented to extract a constant number of keyframes to decrease the computational processing time. Khaire et al. (2018a) improve activity recognition based on a five-stream CNN network. Skeleton images, MHI, and three DMMs from the side, top, and front are fed to the network. Three approaches are presented for fusion to improve the overall accuracy of recognition. The fusion of trained CNN on skeleton images as the fifth CNN stream achieves the best result compared to the other two approaches. Elmadany et al. (2018) suggest two fusion methods, called biset globality locality preserving canonical correlation analysis (BGLPCCA) and multiset globality locality preserving canonical correlation analysis (MGLPCCA) for learning common subspace from two sets and more than two sets, respectively. These methods represent global and local data features with a low-dimensional common subspace. Besides, a bag of angles (BoA) is proposed as a descriptor for skeleton and HPDMM-CNN for depth. Finally, a framework is used for action recognition using proposed fusion methods and descriptors. The multimodal information recorded by a Kinect sensor (RGB-D and skeleton) is also exploited (Cardenas and Chavez 2018). Various rank pooling and skeleton optical spectra methods are tested to generate dynamic images summarizing an action sequence into single flow images. Dynamic images are categorized into five groups: a dynamic color group (DC), a dynamic depth group (DD), and three dynamic skeleton groups (DXY, DYZ, DXZ). Different dynamic images with the main postures for each group are generated to model different action postures. Then, a pre-trained flow-CNN extracting spatio-temporal features is applied with a max-mean aggregation.

Romaissa et al. (2021) suggest a four-step framework for action recognition, including creating dynamic image sets from RGB, depth, and skeleton joints, feature extraction, feature fusion, and classification using LSTM. After constructing three different dynamic image sets, features are extracted from image sets via pre-trained CNN-based models using transfer learning. Canonical correlation analysis fuse extracted features. Eventually, a bidirectional LSTM is trained to recognize action labels.

6.4.1.2 Dependent streams

Hu et al. (2018) propose a method to learn modality-temporal mutual information from tensors, called the deep bilinear framework. The bilinear block contains modality pooling and temporal layers, learning the time-varying dynamics and multimodal information. The deep bilinear model is established via accumulating bilinear blocks and other layers to extract video modality-temporal features. Further, a novel descriptor, called modality-temporal cube, characterizing actions from a general schema, is proposed as deep bilinear learning input.

Gan et al. (2023) propose a focal channel knowledge distillation for action recognition to transfer channel semantic correlations and distributions of teacher modalities to the RGB student. The 3D CNN backbone networks extract spatio-temporal features, and an average pooling operation is performed on the teacher features to generate the channel attention map. The channels with large weights are considered as the focal channels. The correlation matrices of these focal channels are measured by inner product, where high relevance represents the homology of channel semantics, and low relevance represents diversity. By minimizing the MSE distance of the focal channel correlation matrices, the student can learn sufficient intrinsic relationships and diversity properties of key semantics. In addition, the teacher’s crucial semantic distribution knowledge is transferred to the student by minimizing the weighted sum of KL divergence of channel distribution differences, thus focusing on the salient region of channel features.

Li et al. (2022) suggest a GCN for first-person hand action recognition. They define geometric relationships between two neighboring bones in a hand skeleton as the third-order node feature. Pretrained networks are employed to extract features from RGB images and depth maps. Customized LSTM units fuse RGB-D features. Finally, the skeleton modality and the RGB-D modality are combined by late fusion of classification scores.

6.4.1.3 Single-stream

Wu et al. (2016) describe a deep hierarchical dynamic neural network for multimodal gesture recognition. This framework consists of a Gaussian-Bernouilli deep belief network (DBN) to extract dynamic skeletal features and a 3DCNN to represent features from RGB and depth images, as shown in Fig. 15. Further, intermediate and late fusion methods are used to fuse RGB and depth with skeleton data. Emission probability learning of HMM is applied to estimate the gesture class.

Fig. 15
figure 15

Single-stream architecture with an N-D entity as input (Wu et al. 2016)

6.4.2 Missing modality

6.4.2.1 Ensemble methods

Luo et al. (2018) suggest a framework for action detection and classification with only limited training data and partially observed modalities. Their method, called graph distillation, merges rich information from the large-scale multimodal dataset in the source domain and increases learning performance in the target domain with rare training data and modalities, as shown in Fig. 16. A graph distillation layer is suggested to distill knowledge between multiple modalities and attach it to available models. A trained model in the action classification domain is used as a pre-trained model in the action detection task. A short video clip is encoded into a feature vector via a visual encoder for an action recognition task. This step is followed by a sequence to construct the final feature vectors for action detection. The feature vector is fed into task-specific linear and softmax layers to obtain the probability distribution for each clip.

In Li et al. (2023), a deep fusion network (DFN) is proposed to fuse features of different modalities even in missing modality cases. DFN comprised MLP and CNNs.

Fig. 16
figure 16

Graph distillation network in (Luo et al. 2018)

6.5 Infrared and other visual modalities

Infrared data is another modality used for HAR in a dark environment. It is favored over RGB, as illumination conditions are less affected. However, limited studies concentrated on fusing infrared with other visual modalities. Because the infrared information is more restricted than other modalities and does not comprise complementary information of other visual modalities, this section introduces HAR algorithms using the combination of infrared with other vision-based modalities.

6.5.1 Complete modality

6.5.1.1 Independent streams

There are 2D or 3D CNN-based methods for infrared with other modalities. Molchanov et al. (2016) train separate 3D CNNs along with an RNN for RGB, optical flow, depth, IR, and IR disparity modalities. Class conditional probability vectors from all modalities are averaged and fused to detect and classify hand gestures. Boissiere et al. (2020) use infrared and skeletal data in HAR. A pre-trained CNN extracts features from skeleton, which crops the region around the subjects. A pre-trained 3D CNN is designed to extract visual features from infrared videos. Extracted feature vectors are fused and exploited jointly. The main focus in (Rückert et al. 2021) is using RGB, depth, and infrared in HAR to acquire and transfer manual assembly workspaces into a digital environment. A framework based on CNNs and RNNs is suggested to differentiate assembly operations that construct a complex assembly process.

6.5.2 Missing modality

6.5.2.1 Hallucination network

The scenario of full-modal learning from partial modalities often arises in practice. For example, RGB surveillance cameras have restrictions according to privacy concerns. In such cases, cross-modal data hallucination is a practical solution (Pahde et al. 2019). Wang et al. (2018a) propose partial-modal generative adversarial networks (PM- GANs) to learn a full-modal model of partial modalities to perform tasks related to data hallucination, as shown in Fig. 17. The complete model is attained via a generated model instead of the missing data channel. In this regard, GANs have shown favorable results for cross-modal sample generation (Pahde et al. 2019). GANs are deep generative models mainly applied for unsupervised tasks that demonstrate significant advances in image generation, image-to-image translation, and facial attribute manipulation (Wang et al. 2021; Pan et al. 2019). Woo et al. (2023) suggested an autoencoder for reconstructing missed modalities. They have used CNNs and Transformers in the proposed architecture.

Fig. 17
figure 17

Full-modal modeling of partial modalities (Wang et al. 2018a)

7 Datasets

With growing attention to deep-based action recognition, there is a need for large datasets describing actions under various conditions, such as different subject appearances (with diverse ages, gender, height, and cultural background), views (first-person or second-person, ego-centric, etc.), illuminations, and environments (Liu et al. 2019). Many different datasets are established in this field for analyzing the efficiency of HAR algorithms. Some datasets contain ADL (Das et al. 2019b; Jang et al. 2020; Liu et al. 2019); others provide data in specific application domains like therapeutic (Negin et al. 2013), sport (Moencks et al. 2019), gaming (Li et al. 2010), (Bloom et al. 2015), human-robot interactions (Jang et al. 2020), and health (Liu et al. 2019). While there are many unimodal datasets in the community, we only concentrate on multimodal visual datasets in this section due to the scope of the paper.

Multimodal visual datasets offer more than one modality for each action, which is helpful in multimodal HAR algorithms. These datasets usually provide RGB, depth, and skeleton. Nevertheless, IR is also provided in some limited datasets.

We comprehensively review available multimodal vision-based gesture, action, and activity datasets. All these datasets are considered action datasets at the rest of the paper. A three-level categorization is proposed on the multimodal HAR benchmark datasets, as shown in Fig. 18. In the first level, datasets are grouped to provide trimmed (segmented) or untrimmed (continuous) videos. While only one action is performed in a trimmed video, untrimmed ones have more than one action in each video. Next, the number of viewpoints for actions is the criterion categorizing datasets in the second level, including single-view and multi-view. Multi-view datasets (often created by more than one camera) provide different views for action in the scene, while others are single-view (frequently front-view). In multi-view datasets, videos are captured by two different methods. First, several cameras are mounted at different positions and angles, and action is captured synchronously using these cameras. Second, the same action is repeated from different viewpoints with only a single camera. In the last level, datasets are grouped based on the provided data modalities. As described above, multimodal visual HAR datasets usually provide two or three modalities, including RGB and depth (RGB+D), RGB and skeleton (RGB+S), depth and skeleton (D+S), and RGB, depth, and skeleton (RGB+D+S) that are respectively demonstrated with red, pink, green, and blue colors in this paper as shown in Figs. 18, 19, 20, 21, 22. Since few datasets provided infrared data, IR is not considered in the categorizing datasets in the third level.

Fig. 18
figure 18

Proposed taxonomy of multimodal vision-based HAR datasets

Therefore, available multimodal vision-based HAR datasets are grouped into four main categories, including trimmed/single-view, trimmed/multi-view, untrimmed/single-view, and untrimmed/multi-view (see Table 3). So datasets are presented accordingly in four diagrams, Figs. 19, 20, 21, 22. Diagrams are provided to compare datasets in each group better. This will help the community choose suitable datasets for their tasks or produce new ones to eliminate available restrictions.

Table 3 Multimodal vision-based action datasets

Figures 19, 20, 21, 22 show the publishing year and the average number of videos per class for different datasets. Further, the circle size for each dataset is related to the number of action classes in that dataset, and the color of circles shows the modalities provided in the dataset (identical with Fig. 18). The vertical axis (the average number of videos per class) is plotted on the logarithmic scale to better show differences between different datasets. Therefore, the vertical axis shows the intra-class variations caused by different subjects, views, and environments, whereas the size of circles shows the inter-class variations. Hence, a circle on the upper side of the graph means more intra-class diversity in the dataset, containing more average number of videos per class. Bigger circles reveal more class activities in the dataset, which causes more inter-class diversity. Therefore, the most appropriate datasets for HAR are bigger ones on the upper side of the diagrams.

Fig. 19
figure 19

Trimmed/single-view multimodal vision-based HAR datasets. Circle size is related to the number of action classes compared to other datasets, and circle color demonstrates provided modalities in the dataset; red: RGB+D, pink: RGB+S, green: D+S, and blue: RGB+D+S

Fig. 20
figure 20

Trimmed/multi-view multimodal vision-based HAR datasets. Circle size is related to the number of action classes compared to other datasets, and circle color demonstrates provided modalities in the dataset; red: RGB+D, pink: RGB+S, and blue: RGB+D+S

Fig. 21
figure 21

Untrimmed single-view multimodal vision-based HAR datasets. Circle size is related to the number of action classes compared to other datasets, and circle color demonstrates provided modalities in the dataset; red: RGB+D, green: D+S, and blue: RGB+D+S

Fig. 22
figure 22

Untrimmed/multi-view multimodal vision-based HAR datasets. Circle size is related to the number of action classes compared to other datasets, and circle color demonstrates provided modalities in the dataset; red: RGB+D, pink: RGB+S, and blue: RGB+D+S

To compare benchmark HAR datasets in a meaningful way, the average number of videos per class is considered instead of the number of videos in a dataset. For example, ChaLearn2014, which contains 13858 videos, provides more intra-class diversity than ConGD, with 22535 videos. Because ConGD has more action classes, the average number of videos in each class decreased with restricted intra-class variations.

According to Fig. 19, among trimmed/ single-view datasets, LboroLdnHAR (Moencks et al. 2019) provides prominently more videos per class; however, it contains few classes. IsoGD (Wan et al. 2016) provides fewer videos per class compared to LboroLdnHAR (Moencks et al. 2019), and EgoGesture (Zhang et al. 2018) also contains the most number of action classes compared to datasets in this group.

As depicted in Fig. 20, ETRI-Activity3D (Jang et al. 2020) comprises more videos per class than others, and it ranked third based on the number of action classes. Although the intra-class diversity is almost the same in NTU RGB+D 120 (Liu et al. 2019) and NTU RGB+D (Shahroudy et al. 2016), NTU RGB+D 120 provides more action classes among trimmed/ multi-view datasets.

As shown in Fig. 21, ChaLearn2014 (Escalera et al. 2015) has the most videos per class, and ConGD (Wan et al. 2016) provides more action classes in the untrimmed/ single-view group. In untrimmed/ multi-view datasets, Toyota-Smarthome (Dai et al. 2022) ranked first for the number of videos per class and second for the number of action classes. PKU-MMD (Liu et al. 2017a) provides the most action classes.

The most common datasets in HAR along with the newest ones are listed in Table 4. This table also demonstrates the method with the best accuracy for each dataset, employed modalities, framework architecture, and the number of studies reviewed in this paper that use the mentioned dataset as benchmarks in their experiments.

We have studied 66 publicly available multimodal visual HAR datasets, in which 44 of them contain trimmed videos and 25 datasets involve untrimmed videos. Three datasets (HumanEva, Toyota-Smarthome, and Egogesture) offer both trimmed and untrimmed videos. Among trimmed datasets, 26 offer actions from single-view, and 18 datasets provide multi-view action videos. Among untrimmed datasets, 15 datasets contain action videos in single-view, and 10 offer multi-view videos. In total, there are 40 single-view and 26 multi-view datasets. The study shows that most multimodal action datasets contain single-view trimmed videos with three different modalities: RGB, depth, and skeleton.

8 Discussion

According to the studied papers, the combination of RGB and depth is more frequent in HAR algorithms as they provide complementary information about the appearance and 3D structure of the scene. Combining RGB and skeleton ranked second. In contrast, the combination of infrared with other modalities is employed less than other modalities since infrared data is appropriate in restricted applications.

The majority of studies have concentrated on HAR scenarios with complete modality, while HAR dealing with missed modalities is a new approach. Independent streams architecture is a more frequent framework in the complete modality group, employing score fusion or feature fusion techniques. As shown in Table 4, the state-of-the-art methods mainly use independent streams architecture. That is maybe because there is no need to handle heterogeneous data from different modalities in the independent streams architecture. However, independent streams cannot learn from mid-level complementary information of different modalities. Although dependent streams architecture can gain from mid-level complementary information of multiple modalities, it should handle heterogeneous data from different modalities. Additionally, new fusion strategies can also be used with dependent streams (Cheng et al. 2021; Zhou et al. 2021; Tian et al. 2020; Wang et al. 2019a; Joze et al. 2020).

Single-stream architecture does not need to fuse data or scores in the framework; however, the data from different modalities should be handled before feeding to the network. Skeletal data is mostly used as attention combined with RGB in single-stream architectures, and RGB and depth form a 4D input in single-stream architecture.

A Hallucination network provides new representations of modalities, and it is popular in the case of missing modalities. This idea is used in different methods, such as teacher-student frameworks and GANs. However, a modest number of studies have investigated multimodal visual HAR with missing modalities.

According to Table 4, the most commonly used dataset in multimodal vision-based HAR is NTU RGB+D (Shahroudy et al. 2016). State-of-the-art methods on this dataset obtain accuracies above 97% for the cross subject and cross view using RGB and skeleton modalities. In contrast, the newest datasets, MDAD (Jegham et al. 2019) and ETRI-Activity3D (Jang et al. 2020), are less frequent, and lower accuracies are obtained using them, which means more accurate methods are required to learn the diversity of actions. Further, the combination of RGB & depth is more frequent in the state-of-the-art methods. Almost all of the most accurate methods in Table 4 employ RGB in their framework. Only the MSR-Action3D (Li et al. 2010) and SYSU 3D HOI (Hu et al. 2015) datasets obtain reasonable results without using the RGB modality.

Table 4 Methods with the best accuracy on common and the newest multimodal vision-based human action and gesture datasets

9 Future directions

The following directions are pointed out for future research.

9.1 Transformers

Transformers first applied in NLP are recently entered into computer vision tasks such as HAR. As Transformers capture long-term dependencies and are capable of parallel processing, they are interested in video action recognition (Girdhar et al. 2019; Gavrilyuk et al. 2020; Chen and Mo 2023; Yang et al. 2022). Also, lightweight Transformers are used in specific applications (EK et al. 2022). Nowadays, Transformers are used in multimodal deep-based HAR (Li et al. 2021). Since Perceiver (Jaegle et al. 2021b) and Perceiver IO (Jaegle et al. 2021a) can be applied in several domains (Han et al. 2022), these two Transformer models are capable of being used in multimodal vision-based HAR. However, there are remaining opportunities to effectively model similar, long-term, and complex activities using Transformers and multimodal visual data. Besides, Transformers seem to be powerful in predicting future actions.

9.2 Large language models

Some convolutional networks have demonstrated promising results in HAR using single modality (Wang et al. 2023), While these can be utilized via multiple modalities as well. The rapid development and high capabilities of large language models (LLMs) present remarkable potential for the future. Initially developed for NLP tasks, LLMs have expanded their application to various vision tasks, such as image captioning (Zhu et al. 2023), visual question answering (Salaberria et al. 2023), OCR (Ye et al. 2023), image generation, and style transfer (Fu et al. 2022). Pretrained LLMs are employed as knowledge engine to generate text descriptions for body movements of actions. Training scheme is performed by using text encoder along with visual encoder (Xiang et al. 2023). Consequently, there is a growing interest in exploring the use of LLMs for action classification tasks with multiple data modalities.

9.3 Missing modality

Real-life applications usually deal with partial modalities due to different restrictions, such as noise or failure of sensors. Co-learning approaches try to transfer or distill knowledge from auxiliary modalities and assist in learning the model from them. Methods that employ hallucination networks in their frameworks (such as student-teacher or GANs) benefit from complementary information of all modalities, which are suggested in (Garcia et al. 2018, 2019; Xu et al. 2021; Thoker and Gall 2019; Wang et al. 2018a). Besides, Transformers are used in the teacher-student framework for NLP applications (Mukherjee and Awadallah 2020; Mirzadeh et al. 2020), and GANs (Jiang et al. 2021), which can be used in HAR. Other methods employ autoencoder to produce missed data modalities (Woo et al. 2023). New and accurate methods of transferring knowledge or knowledge distillation can improve recognition accuracy.

9.4 Few-shot and zero-shot learning

Collecting adequate data for all action classes is a big challenge. Few-shot (or one-shot) and zero-shot learning approaches solve this problem. Few-shot learning is the problem of making predictions based on a limited number of samples. However, zero-shot learning tries to predict without any training samples. Co-learning-based approaches like transfer learning, knowledge distillation, and GANs are tools to hallucinate diverse and discriminative features from a few data samples. In (Wang et al. 2021), GANs are used with significant advances in generating new data from a modality with fewer samples by using other modalities with rich samples. Others employ language models in the context of few-shot learning (Brown et al. 2020; Alayrac et al. 2022). Although there are methods for zero-shot action recognition using RGB data (Estevam et al. 2021), one-shot learning with the fusion of vision-based and sensor-based modalities (Memmesheimer et al. 2021), or Recurrent Transformers (Schatz et al. 2020) to synthesize human actions from novel views, there is no considerable attempt in deep-based HAR with multiple visual data modalities using few-shot or zero-shot learning.

9.5 Fusion methods

Section 6 reviews general and commonly used fusion methods. Nevertheless, some studies have introduced new fusion approaches (Cheng et al. 2021; Zhou et al. 2021; Joze et al. 2020; Singh and Vishwakarma 2021; Elmadany et al. 2018); it is anticipated to propose novel fusion methodologies , particularly in the context of Transformers (Hampiholi et al. 2023) to benefit from mid-level and heterogeneous information effectively.

9.6 Unsupervised, semi-supervised, and self-supervised learning

The entire dataset is used in supervised learning tasks while there is a lack or absence of labels in real scenarios. Semi-supervised and unsupervised learning try to solve tasks with limited or no data labels. Since deep approaches need huge datasets and dataset labeling is a labor-intensive processing task, semi-supervised (Singh et al. 2021) and unsupervised (Lin et al. 2022) learning are of outstanding importance. Multimodal data can be employed as additional information in semi-supervised and unsupervised learning as well (Patwary et al. 2022). Further, a model can be trained to learn one part of input from another part in self-supervised learning (Guo et al. 2022). It seems that self-supervised learning is capable of predicting tasks.

9.7 Datasets

Datasets with diverse and enormous samples have a significant role in developing deep-based algorithms. As mentioned in Sect. 7, many datasets are collected for video-based HAR with multiple visual modalities like ETRI-Activity3D (Jang et al. 2020), Toyota-Smarthome (Das et al. 2019b), and NTU RGB+D 120 (Liu et al. 2019). Due to the diverse intra-class and inter-class variations of actions in the video, the lack of large datasets with multiple modalities still remains, especially for particular applications [such as surgery (Twinanda et al. 2016) or industrial assembly process (Rückert et al. 2021)], complex actions (Li et al. 2010), prediction of actions in the actual scenarios (Dai et al. 2022), uncontrolled (Sung et al. 2011), cluttered and crowded environments (You and Jiang 2019) that need more investigation. Huge datasets for data-hungry DNNs are still in demand for HAR.

10 Conclusion

HAR is an important task in computer vision, which has attracted researchers’ interest. DNNs make HAR algorithms more accurate. This paper presents a comprehensive review of deep-based HAR methods using multiple visual modalities. Methods are reviewed based on a novel four-level categorization, which considers framework modalities, modality availability, framework architecture, and framework similarities. This four-level categorization facilitates researchers to comprehend and compare methods in detail. Common properties between methods and their differences are highlighted. It is indicated that new approaches are evidently required to achieve higher accuracies in HAR.

Further, available benchmark HAR datasets providing multiple vision-based modalities are categorized into four groups based on providing trimmed or untrimmed videos, single-view or multi-view, and data modalities. Datasets in each group are compared graphically by plotting their characteristics.

Besides, the pros and cons of different architectures in the proposed four-level categorization are discussed. Then, the most accurate methods on more popular as well as the newest datasets are listed and commented on. Finally, some potential research directions are discussed.