1 Introduction

The problem of human action recognition (HAR) is an important task in the field of computer vision with various types of practical applications. For users interacting with robotic platforms, HAR enables applications involving user positioning, identification and surveillance, healthcare and remote care, exercise monitoring or various smart home or smart city applications. Several categories of approaches are proposed for this problem: based on video sequences, on depth maps, on the coordinates of connected joints (in the form of a skeleton) or using combinations of several modalities. In general, the practical problems that require a module for recognizing human actions require not only high accuracy but also real-time performance.

Although several solutions based on deep learning have been proposed for this problem, many of these are difficult to integrate directly into a robotic platform. They typically employ deep neural networks with a huge number of parameters which poses a challenge given the often limited computational resources available on robotic platforms. Another disadvantage is that most of the proposed solutions are dependent on the data used in the training process.

Several solutions based on skeletal data for the HAR problem have been proposed. This type of representation allows the implementation of robust deep learning algorithms that can be successfully applied in real-time running scenarios. There are several advantages in using a skeletal data representation, including the fact that the recognition process is independent of the environment in which the person acted. This allows for easier redeployment of the system into novel types of environments. Also, it is easy to perform data normalization to account for user to user variability. For example, the translation operation required to normalize the length of each bone can be applied for each joint. Another advantage is the face that the inference time is usually low, due to the rather small dimensionality of the input data. Moreover, HAR based on skeletal data can overcome the problem of General Data Protection Regulation (GDPR), as compared to solutions based on RGB images, since they do not disclose the identity of the users while still preserving information about their activities.

The solution we propose in this work is based architectures using temporal convolution and graph convolution layers which take as input a sequence of skeletal joint representations that are enhanced with spatial and temporal information by features representing joint velocities, segment lengths and angles between segments.

The contributions of this paper can be summarized as follows:

  • We propose a method to determine geometric features that can help the neural model to achieve a better performance. This method also contains a variant of data normalization.

  • By introducing geometric formulas, we derived features that offer robustness to our neural network for the human action recognition problem. These features are designed to be invariant to the physical attributes of the individual executing the action and also maintain perspective invariance regarding the speed at which the action is performed. In essence, these geometric features serve as a foundation for the neural network, providing it with a reliable and adaptable representation for recognizing human actions.

  • We present a spatio-temporal neural architecture, which combines temporal convolutional and graph convolutional layers, and we report the results of the proposed model on NTU RGB+D [1, 2] benchmark, for all the test protocols.

  • We propose a non-black-box approach in which the model outputs a feature tensor together with the results, thus being able to explain predictions.

  • We introduce a solution capable of achieving satisfactory performance on data independent of those used in the training process. This aspect validates the good generalization capacity of the model influenced by the formulas proposed for the spatio-temporal features.

In Sect. 2, we summarize the methods considered most relevant for our work. Section 3 describes our proposed method, specifying how we preprocessed the data and what are the main components of the neural model. Section 4 includes the results of the proposed solution for the NTU RGB+D dataset. In Sect. 5, we present a variant through which an explanation for the recognized action can be extracted from our model. Section 6 presents the conclusions we reached after developing and evaluating our solution, including advantages and limitations, along with ways in which we could improve the performance of recognition.

2 Related work

Many solutions [3,4,5,6,7] proposed for skeleton-based human action recognition use neural models that contain graph convolutional network (GCN) layers. The main reason why such approaches get good results is related to the fact that the human skeleton is naturally modelled as a graph. Thus, topology becomes an important and informative aspect for these approaches. GCN layers are able to extract spatially relevant features. Specifically, these methods manage to detect spatial patterns of joint interaction for each type of action analysed. Conversely, for extracting temporal features there are two major approaches: one based on recurrent neural networks (RNNs) and the other based on temporal convolutional networks (TCNs). Spatial–temporal graph convolutional network (ST-GCN) combines layers specialized in extracting temporal features with layers specialized in extracting spatial features. The performance obtained by using such approaches demonstrates that simultaneously treating both the spatial and temporal dimensions is important for this problem.

2.1 GCN-based approaches

The human skeleton can be represented as a graph \(G = (V, E)\), where V represents a set with the 25 joints and E represents the set with the 24 segments that unite the joints. Based on such a model an action can be represented as a set containing T graphs with each graph containing N nodes. T represents the total number of frames while N represents the number of joints. This representation has the advantage that for each of the T graphs the same adjacency matrix can be used. This allows for the parallel application of processing operations for these graphs.

TODO: nu pot avea arhitectura doar cu GCN

Cheng et al. [8] proposed a way to solve one of the main disadvantages of GCN-based solutions, namely that they are pretty large in terms of number of parameters. The authors proposed two variants optimized for spatial skeleton graph modelling called local or non-local shift graph operation. They obtained an optimized version (in terms of number of parameters) that still achieves good results for the HAR problem. They also proposed modifications for temporal shift graph operations that obtained better scores than the classic ones through a lower computational effort.

Chen et al. [9] proposed a method called channel-wise topology refinement graph convolution (CTR-GC) that can dynamically learn different graph topologies and work with multiple categories of features specific to graph nodes. They specifically designed this approach for the problem of recognizing human actions using skeletal data. Their model differs from the usual manually defined adjacency matrix for the joint graph since in their approach the model automatically infers the graph topology. This highlights the relationship between the joints from the perspective of the features provided for each joint. Finally, an aggregation of the features from each channel is performed to obtain the final prediction.

Ding et al. [10] address the problem that arises in modelling skeletal data from the perspective of the temporal dimension. Usually, actions do not have a fixed duration and the number of frames differs from one action to another or from one sample to another for the same action. Therefore, the proposed approaches generally use sequence padding which causes the neural model to perform unnecessary calculations. To eliminate this limitation the authors propose an approach called temporal segment graph convolutional networks (TS-GCN). In their approach they divide the sequence into several stages. The proposed model independently analyses each stage. To obtain the final prediction, they apply ensemble methods which integrate the predicted outputs of the model for each stage.

To improve the ability of the neural model to capture local features among adjacent vertices, Xie et al. [11] proposed a new GCN-based model called cross-channel graph convolutional networks. Their idea came from the observation that other existing approaches cannot effectively extract and use local features of adjacent vertices in the graph. Performance is further improved by adding a channel for an attention mechanism. This operation filters out the features that are unrelated to the analysed action. They proposed a neural architecture composed of a batch norm followed by nine residual units, a global average poling and a softmax layer.

Approaches solely relying on GCN prove inadequate for HAR due to their inability to effectively analyse features from a temporal relationship perspective. Since GCNs primarily focus on static graph structures, they struggle to capture the dynamic and temporal nuances inherent in human actions. Consequently, a more comprehensive solution incorporating temporal considerations is essential for accurate and robust human action recognition systems.

2.2 TCN-based approaches

When a problem requires the analysis of a temporal sequence of data, we want the neural network model to be able to store long-term information. For a convolutional neural network (CNN), the receptive field represents the spatial region in the input space that produces the feature. When operating on sequential data, if the receptive field is larger, then correlations between information more distant in time may be found. Thus, by using the dilation operation it is possible to increase the receptive field. Since convolution does not take direction into account, if we apply it to sequential data then the kernel of the convolution will analyse both past and future features. If we require to take into account only the characteristics of the past, then we can use the concept of causal convolution. This type of convolution determines the output for step t taking into account characteristics up to step \(t-1\).

Lea et al. [12] first introduced the concept of TCN as a means to analyse long-term dependencies in patterns. In our previous work [13] we demonstrated the benefits of using a neural model based on TCNs relative to similar models based on recurrent layers.

Solely relying on TCN layers proves inadequate as they fall short in identifying complex spatial relationships. Since TCNs are primarily designed to capture temporal dependencies, their effectiveness in discerning intricate spatial connections is limited.

2.3 RNN-based approaches

For many tasks in which the temporal dimension represents a key aspect, approaches based on RNN have achieved good performance. For a long time, long short-term memory (LSTM) and gated recurrent unit (GRU) represented the fundamental method to determine relevant patterns from temporal sequences. The main advantage of these types of neural layers is their invariance to the sequence length.

Yadav et al. [14] proposed an method for combining recurrent neural networks with convolutional networks. The method, called deep convolutional long short-term memory (ConvLSTM), is a sequential application of the following types of layers: convolutional layers, LSTM cells and fully connected layers. They included convolution-based layers in order to find spatially relevant contextual dependencies and LSTM cells which have the role of computing temporally relevant features. The experimental results demonstrate that such an approach is more suitable than the ones based solely on CNN or LSTM.

LSTM-based approaches achieve good performance for the human action recognition problem using skeletal data. But these presented limitations from the perspective of capturing the structural dependencies between the joints of the same skeleton. To reduce this limitation, we proposed in paper [15] a neural architecture that combines the properties of graph convolutional networks layers with those of LSTM cells. In the designed architecture, we added a module based on GCN layers for extracting spatial features at the frame level, after which we passed them through a module containing 3 LSTM cells. The advantage of this network is the reduced number of parameters but experimental results show that the LSTM cells could no longer preserve spatial information.

Huang et al. [16] tried to eliminate this limitation of LSTM cells by proposing a structural modification that would allow extracting information using the graph structure. They named the proposed recurrent cell Long-Short Graph Memory Network (LSGM). Their variant can learn relevant features at the temporal and spatial level, taking into account the fact that the skeletal data respects the structure of a specific graph. The pipeline proposed by their method separates the skeletal data according to the 3 coordinates X, Y and Z. They used a module based on a bidirectional LSGM unit and a temporal attention unit for each axis. Finally, in the proposed pipeline, they apply a multiplication operation for the features from these units. The outputs of the 3 branches are concatenated and passed through a spatial calibration layer. The experimental results demonstrate that such an approach is more suitable than those in which recurrent networks and GCN are applied sequentially.

2.4 Spatio-temporal-based approaches

Since not all frame in a real time scenario have a complete and correct identified skeleton, Xing et al. [17] proposed a method to recognize human actions by analysing sequences that can contain noise. The method is called Improved spatial–temporal graph convolutional network (IST-GCN). Their neural model is based on three modules: multi-dimension adaptive graph convolutional network (Md-AGCN), enhanced attention mechanism (EAM) and multi-scale temporal convolutional network (MS-TCN). The first module adjusts the graph structure from the perspective of the connection that exists between joints at the spatial level, temporal level, or channel level. The second module is the one that analyses the importance of each category of features, while the last module is the one that ensures that the receptive field is increased when it is necessary to extract additional information from the perspective of latent temporal dependencies.

These existing approaches represented a source of inspiration for the method proposed in this article. Starting from the advantages of each, we have designed a solution capable of achieving comparable performance but with a lower computational effort. Moreover, we tried to develop a method that presents the ability to support the prediction, using explanations that we can deduce by analysing the calculated features. In contrast to the previous studies, we included a detailed analysis of the performances obtained by the proposed model from the perspective of the best recognized classes and those with the worst results. We also present some qualitative results which highlight how the model allows the deduction of an explanation.

3 Proposed method

In this section, we present the proposed method, which consists of a data preprocessing phase followed by a spatio-temporal neural model to recognize the action. The input is the skeletal data, starting from the 3D coordinates predicted by the Kinect sensor for each of the 25 joints highlighted in Fig. 1. Our data preprocessing method aims to remove the noise and calculate relevant geometric features from the perspective of two points of interest: one suitable from the spatial dimension and one focused on the temporal component. The neural model introduced in this section is one with multiple input branches based on TCN and GCN layers. In contrast to the existing neural architectures, our model presents a reduced inference time, obtain results comparable with state-of-the-art methods, and offers the possibility of determining an activation map that can be useful in the explainability process.

Fig. 1
figure 1

The human skeleton in the format generated by the Kinect sensor which contains 25 joints that build 5 important parts of the human body: left arm (highlighted in blue), left leg (highlighted in magenta), trunk (highlighted in green), right arm (highlighted in orange) and right leg (highlighted in violet) (color figure online)

3.1 Data processing

An action is presented in the form of a tensor of size \(M \times T \times 25 \times 3\), where M represents the number of skeletons and T is the number of frames. In our approach, we use a second skeleton with all coordinates equal to 0 for a single-person action. The considered value for T is 300, and the samples containing fewer frames have zero padding.

In this preprocessing stage, we started from the raw data that we turned into smoothed and normalized coordinates. Smoothed coordinates are obtained by applying a Gaussian filter with a \(5 \times 1\) kernel and \(\sigma = 1\).

We obtained a reference point for each skeleton by averaging the coordinates of all the joints in the action sequence. This point serves as a centroid, offering insights into both the spatial arrangement of joints and their temporal distribution. The formula used to determine this point is as follows:

$$\begin{aligned}{} & {} ref\_point \\{} & {} = \left( \frac{1}{25 \cdot T} \sum _{j=1}^{25} \sum _{t=1}^{T} x_{j}^{t}, \frac{1}{25 \cdot T} \sum _{j=1}^{25} \sum _{t=1}^{T} y_{j}^{t}, \frac{1}{25 \cdot T} \sum _{j=1}^{25} \sum _{t=1}^{T} z_{j}^{t} \right) \end{aligned}$$

where T represents the number of frames in the sequence describing the action and \((x_j^{t}, y_j^{t}, z_j^{t})\) represent the coordinates of the joint with index j from frame t. We used this reference point to determine relevant features based on geometric formulas for each branch. These new coordinates have the advantage of combining the spatial and temporal domains.

We considered the method proposed by Song et al. [7] which contains three branches (joint-branch, velocity-branch and bone-branch), and we added new features for each branch. In the version proposed by Song et al. [7], each branch contains 6 channels for features, and in the version proposed by us, each branch contains 9 channels for features. We determined the first six features for each branch using the formulas provided by Song et al. [7], and our contribution consists of the last three geometrical features with temporal importance.

3.1.1 Joint-branch

The features for this branch describe the coordinates of the 25 joints, containing for each joint the following data:

$$\begin{aligned}{} & {} joint\_f_{j_i} = (x_{j_i}, y_{j_i}, z_{j_i}, x_{j_i} - x_{j_c}, y_{j_i} - y_{j_c}, \nonumber \\{} & {} \quad z_{j_i} - z_{j_c}, x_{j_i} - x_{r_p}, y_{j_i} - y_{r_p}, z_{j_i} - z_{r_p}) \end{aligned}$$
(1)

\(\forall i \in \{1, 2, \dots , 25\}\) where \(j_c\) is considered the index of the central joint, considered the centre of gravity for the body, and \(r_p\) represents the reference point. To capture joint features, we opted to start from the three coordinates provided by the Kinect sensor. We achieved spatial normalization by subtracting the coordinates of the point we considered the centre of the skeleton. Additionally, we performed normalization from the perspective of a reference point. Normalizing data around a central point, like the centroid, is crucial for mitigating the challenges posed by disparate magnitudes of variables. This practice aids in achieving faster convergence during training, enhances the stability of the learning process, and reduces the risk of issues like gradient saturation or explosion.

3.1.2 Velocity-branch

This branch contains data determined based on the difference between the coordinates from step t and those from steps \(t + 1\) and \(t + 2\), as shown in Eq. 2. Equation 3 starts from the observation of Zanfir et al. [18] according to which the second-order derivatives are estimated numerically by using a temporal window of 5 frames centred at the current one.

$$\begin{aligned} velocity\_f_{j_i}^{t}&= ( x_{j_i}^{t+1} - x_{j_i}^{t}, y_{j_i}^{t+1} - y_{j_i}^{t}, z_{j_i}^{t+1} - z_{j_i}^{t}, x_{j_i}^{t+2} - x_{j_i}^{t}\nonumber \\&\quad , y_{j_i}^{t+2} - y_{j_i}^{t}, z_{j_i}^{t+2} - z_{j_i}^{t}, \end{aligned}$$
(2)
$$\begin{aligned}&x_{j_i}^{t+2} + x_{j_i}^{t-2} - 2 \cdot x_{j_i}^t, y_{j_i}^{t+2} + y_{j_i}^{t-2} - 2 \cdot y_{j_i}^t, \nonumber \\&\quad z_{j_i}^{t+2} + z_{j_i}^{t-2} - 2 \cdot z_{j_i}^t) \end{aligned}$$
(3)

\(\forall i \in \{1, 2, \dots , 25\}, t \in \{1, 2, \dots , T-2\}\).

Incorporating information about speed and acceleration provides a more nuanced understanding of how actions unfold over time. By considering the temporal dynamics, the neural network acquires the ability to discern the nature of the action itself and the intensity with which humans perform this action. This richer representation enhances the network’s capacity to distinguish between subtle variations and contributes to a more robust and accurate recognition of human activities.

3.1.3 Bone-branch

The bone-branch includes features that describe the bone lengths and angle values for the XY and Z axes.

$$\begin{aligned}&bone\_f_{(j_u, j_v)} = ( x_{j_u} - x_{j_v}, y_{j_u} - y_{j_v}, z_{j_u} - z_{j_v}, a_{(j_u, j_v), x}, a_{(j_u, j_v), y}, \\&\quad a_{(j_u, j_v), z}, bone\_length, bone\_length_{1}, bone\_length_{2}) \end{aligned}$$

where joints \(j_u\) and \(j_v\) are adjacent, \(l_{(j_u, j_v), x} = x_{j_u} - x_{j_v}, l_{(j_u, j_v), y} = y_{j_u} - y_{j_v}, l_{(j_u, j_v), z} = z_{j_u} - z_{j_v}\) and

$$\begin{aligned} a_{(j_u, j_v), x}= & {} \arccos \left( \frac{l_{(j_u, j_v), x}}{\sqrt{l_{(j_u, j_v), x}^{2} + l_{(j_u, j_v), y}^{2} + l_{(j_u, j_v), z}^{2}}} \right) \nonumber \\ a_{(j_u, j_v), y}= & {} \arccos \left( \frac{l_{(j_u, j_v), y}}{\sqrt{l_{(j_u, j_v), x}^{2} + l_{(j_u, j_v), y}^{2} + l_{(j_u, j_v), z}^{2}}} \right) \end{aligned}$$
(4)
$$\begin{aligned} a_{(j_u, j_v), z}= & {} \arccos \left( \frac{l_{(j_u, j_v), z}}{\sqrt{l_{(j_u, j_v), x}^{2} + l_{(j_u, j_v), y}^{2} + l_{(j_u, j_v), z}^{2}}} \right) \end{aligned}$$
(5)
$$\begin{aligned} bone\_length= & {} \sqrt{l_{(j_u, j_v), x}^{2} + l_{(j_u, j_v), y}^{2} + l_{(j_u, j_v), z}^{2}} \end{aligned}$$
(6)
$$\begin{aligned} dist_{j_u,x}= & {} x_{j_u} - x_{r_p}, dist_{j_u,y} \nonumber \\= & {} y_{j_u} - y_{r_p}, dist_{j_u,z} = z_{j_u} - z_{r_p} \end{aligned}$$
(7)
$$\begin{aligned} bone\_length_1= & {} \sqrt{dist_{j_u,x}^{2} + dist_{j_u,y}^{2} + dist_{j_u,z}^{2}} \end{aligned}$$
(8)
$$\begin{aligned} dist_{j_v,x}= & {} x_{j_v} - x_{r_p}, dist_{j_v,y} \nonumber \\= & {} y_{j_v} - y_{r_p}, dist_{j_v,z} = z_{j_v} - z_{r_p} \end{aligned}$$
(9)
$$\begin{aligned} bone\_length_2= & {} \sqrt{dist_{j_v,x}^{2} + dist_{j_v,y}^{2} + dist_{j_v,z}^{2}} \end{aligned}$$
(10)

\(bone\_length_1\) and \(bone\_length_2\) represent the distances between the coordinates of the ends of a bone and the previously entered reference point. Incorporating information about the length of bone segments and the angles they form with the axes is motivated by the desire to encapsulate the structural aspects of human motion within the neural network. These features offer valuable insights into the anatomical proportions and joint articulation during various actions. By considering bone lengths, the network understands the scale and relative proportions of body segments, enabling it to discern between actions performed with different body configurations. Simultaneously, incorporating angles formed by bones with the axes provides crucial information about joint flexion, extension, and overall posture. This anatomical perspective enhances the network’s ability to recognize actions based on the skeletal structure, contributing to a more comprehensive and nuanced analysis of human activities.

3.2 Spatio-temporal model

The general architecture for the proposed approach is presented in Fig. 2. We preprocessed the 3D coordinates predicted by the Kinect sensor for joints, based on operations described in Sect. 3.1. To normalize and extend the features resulting from the preprocessing process, we used 4 ResGCN layers for each branch. We concatenate the data resulting from the application of these layers and use the resulting tensor as input for a spatio-temporal module. The proposed architecture returns two types of results: the features obtained after the application of the spatio-temporal module and the final prediction. The features generated by the neural model serve as critical descriptors of the input data, capturing relevant patterns and representations. These features are subsequently employed in the computation of activation maps, which highlight the areas of significance within the input space. Activation maps play a pivotal role in rendering interpretability to the neural network, as they provide insights into the regions that contribute most substantially to the model’s decision-making process. This interpretability aspect is crucial for enhancing transparency and trust in the model, allowing users and practitioners to comprehend the neural network’s reasoning behind specific outcomes and facilitating the identification of influential input features.

Fig. 2
figure 2

The structure proposed for the neural model used to solve the human action recognition problem. In the preprocessing stage, we normalized the data and applied the proposed formulas to determine the spatio-temporal features for each branching. We independently applied a series of 4 ResGCN modules for each branch and concatenated the obtained data. We used the features obtained after passing the data through the spatio-temporal modules to determine the importance of each joint

The neural model calculates its final prediction by applying the following operations on the feature tensor returned by the spatio-temporal module:

  1. 1.

    In the first step, we apply a pooling operation that reduces the time domain, represented by the number of frames, and the spatial domain, represented by the number of joints.

  2. 2.

    We apply the mean operation to fuse the features calculated for the two skeletons. Regardless of the type of action, we represent each sample in the form of a sequence with two skeletons. If the action contains a single person, the values for the second skeleton will be 0. We chose this operation because we wanted to use one with the commutative property. It is important to achieve this property, because the Kinect sensor does not maintain order for the predicted skeletons from one frame to another.

  3. 3.

    Finally, we added two Linear type layers that achieve the actual classification.

Figure 3 shows the structure of the spatio-temporal module. We used multiple GCN-TCN Units for this module, and Fig. 4 describes the structure of this unit. The spatio-temporal module receives a batch of features, extracted from preprocessed samples, that are \(96 \times 300 \times 25\), where 96 represents the number of channels, 300 represents the total number of frames, and 25 represents the total number of joints for each frame. This module analyses these features from a temporal and spatial perspective to determine relationships that can help distinguish specific actions. After applying the layers of this module, the result obtained is a tensor with the size \(256 \times 38 \times 25\). We chose to return this tensor as one of the results for the proposed neural model. The reason behind this decision is that we will use this tensor to explain the final prediction of the model. That is why we designed this module to preserve the spatial dimension. Specifically, we ensure that the number of joints remains the same in the result. The layers that have a stride equal to 2 change the temporal dimension.

Fig. 3
figure 3

The structure of the spatio-temporal module included in the general architecture presented in Fig. 2. Layers highlighted in blue use strides with the value equal to one and preserve both dimensions (spatial and temporal). For each layer we specified the number of input channels and the number of output channels (color figure online)

For the design of the spatio-temporal module, we used GCN–TCN type units. The architecture of such a block is shown in Fig. 4. These blocks were proposed by Chen et al. in [9].

Fig. 4
figure 4

Architecture used for GCN–TCN type units. \({\textbf{A}}\) represents the matrix that describes the graph. The residual layer is applied only if \(in\_channel \ne out\_channel\). This architecture was proposed by Chen et al. in [9]

4 Experimental results and discussion

4.1 Results on NTU RGB+D dataset

To analyse the performance of the proposed approach, we used NTU RGB+D [1, 2], a comprehensive dataset with actions performed by one or two people. This dataset has two versions: the first version contains 60 classes [2] and the second one is an extension of the first with another 60 classes [2]. The dataset provides samples from the perspective of several modalities, but we used only the skeletal representation generated by the Kinect sensor. The authors recorded all samples indoors, and 106 volunteers aged between 10 and 57 helped to collect this dataset. The actions included in this dataset can be divided into 3 main categories: daily actions, mutual actions, and health-related actions. The NTU RGB+D dataset was collected using 3 cameras placed at different horizontal angle \(-\,45^{\circ }\), \(0^{\circ }\), \(+\,45^{\circ }\). Each volunteer performed each action twice: once towards the left camera, and once towards the right camera.

This dataset has become one of the relevant benchmarks for solving the HAR problems based on skeletal data. For the model evaluation, the NTU RGB+D dataset provides four test protocols: two for the 60-class version and two for the 120-class version. In this paper, we evaluated the proposed model from the perspective of all four protocols. For the cross-subject protocol, the recorded videos are divided into two subsets: the one used for training and the one used for testing. For the cross-view protocol, the authors propose to divide the samples from the perspective of the camera used for collection. In the extended version with 120 classes, the authors kept the cross-subject protocol in an updated version, but replaced the cross-view protocol with a Cross-Setup protocol. This protocol proposes to split the 36 setups used in collecting the entire dataset into training and testing subsets, keeping half for each part.

Regarding the implementation, for each experiment, we used 50 epochs to train the proposed model. We employed the SGD optimizer to estimate the weight values of the neural network model. We selected the SGD optimization algorithm with Nesterov momentum and the following values for parameters: \(learning\_rate = 0.1, momentum = 0.9, weight\_decay = 0.0002\). For large datasets, processing the entire dataset in each iteration can be computationally expensive and memory-intensive. SGD addresses this by using only a subset or a single random data point (mini-batch) for each iteration, making it computationally more efficient. The stochastic nature of SGD, with its random sampling of mini-batches, introduces a certain level of noise in the parameter updates. This noise can act as a form of regularization, helping to prevent overfitting and improving the generalization performance of our model. To update the learning rate, we utilized a cosine scheduler with the values \(max\_epoch = 50\) and \(warm\_up = 5\). We used a batch of 32 samples for training and a batch of 64 samples for testing. We performed all the reported experiments using two Tesla P100 PCIe GPUs. We chose to initialize the weights for each layer in the proposed model. We initialized the weights for all convolutional layers according to the Kaiming normal distribution [19].

Table 1 Comparative results from the perspective of accuracy for the 4 test protocols, but also performance related to efficiency (processing speed and number of parameters)

Table 1 shows the results obtained for our method, compared to those of other existing approaches, from the point of view of accuracy for all four protocols and regarding the inference speed. The inference speed is represented as the number of sequences processed per (second * GPU). For all the 4 test protocols, our proposed approach achieves similar results to other existing ones. The best results are obtained for the cross-view protocol. For this protocol, the data used for training and for testing are rather similar. For approaches that do not have public implementations provided by the authors, we could not determine some values. In this case, we marked with—in Table 1. For our approach, the inference speed was computed by averaging the speeds obtained for all four protocols. The metric employed is the one proposed by Song et al. [7], and for the remaining methods, the presented values for inference speed are sourced from the paper [7].

According to the inference speed, we can see that there are methods that achieve better performances: SGN [32], ResGCN-TCN [13] and ResGCN-LSTM [13]. SGN is a model with a small number of parameters. This fact attracts a limitation in terms of accuracy for a dataset with double the number of classes. Thus, we can see that the results achieved by SGN for NTU v2 are weaker than our method for both test protocols. ResGCN-LSTM has a better inference speed because for recurrent networks we can use pack and unpack operations that allow us to avoid unnecessary calculations for batches containing samples of different sizes. We apply a padding operation to align the tensor from a batch because the actions have variable duration. We highlighted in the previous work [13] that the models based on TCN obtain better accuracy than those based on LSTM for the skeleton-based variant problem of the HAR problem. This claim remains valid this time as well because the performance of the proposed spatio-temporal neural network remains better. The approach based on ResGCN-TCN is faster than the current method because it uses only 6 channels for each branch instead of 9.

In the literature, there are approaches with better results than our solution, such as EfficientGCN-B4 [37] and MS-G3D Net [36]. Instead, our method tries to reduce the disadvantages that these methods present. For example, our model has a higher inference speed than MS-G3D Net [36]. This aspect makes it more suitable for integration in a real-time scenario. We wanted our solution to be non-black-box and with a good generalization capacity for new data. We highlighted these aspects by presenting some qualitative results from the explainability point of view, and evaluating the model trained on another dataset.

For a better interpretation of the results obtained by our method for the NTU RGB+D dataset, we chose to analyse them separately for each protocol in terms of the accuracy obtained for every class. Thus, we analysed the results of our model from the perspective of the following metrics: Top-1, Top-2, Top-3, Top-5 and Top-10.

These metrics can be defined as follows:

  • Top-1—we check if the class for which the model predicted the best score is the correct class or not.

  • Top-X—we determine the first X action classes from the perspective of the predicted score of the model and check if any of them is the correct class or not (\(\forall X, X \in \{2, 3, 5, 10\}\)).

Table 2 demonstrates a considerable difference between the performances of the model obtained from the perspective of the Top 1 and Top 2 metrics. This discrepancy appears for all the test protocols analysed. These results certify that even if the model cannot predict the highest score for the correct class, it provides a similar score for the first 2 classes, of which one is correct. Such situations often appear in the case of very similar actions, and we highlighted this through an extensive analysis in Sect. 4.2.

Table 2 The performances obtained by the proposed model for the NTU RGB+D dataset from the perspective of Top-1, Top-2, Top-3, Top-5 and Top-10 accuracy

4.2 Discussion

In this section, we extend our analysis of action recognition accuracy for both versions of NTU RGB+D dataset (60 and 120 classes), for different cases of actions and recognition results, and for the different protocols. Thus, for each protocol, we performed an analysis of the best-predicted classes and of those for which the proposed model obtains the lowest performance. From this analysis, it appears that our model encounters difficulties in correctly identifying actions only when they are very similar from the perspective of skeletal movement.

The proposed approach, in addition to the advantages outlined in this article, comes with a set of limitations. Subsequent analysis reveals that a significant constraint of this method is its tendency to misclassify similar actions, a limitation closely tied to the absence of information about objects with which individuals interact or other contextual aspects in our approach. It is crucial to note that this limitation is not unique to our methodology but is a recurrent challenge in various solutions proposed in the literature. In our approach, this limitation represents a trade-off, as we aimed to achieve a solution with a reduced number of parameters and high inference speed. Introducing additional data, such as information about objects or context, was not feasible, as it would have required processing the entire video stream, compromising the desired low-parameter count and high inference speed. Another limitation is that, in its current form, the approach can only be applied to actions performed by a maximum of two individuals. We separately analyse the skeletal data for each person and then compute an average of the obtained features for each individual. This average is subsequently used to determine the final prediction.

4.2.1 Cross-subject—NTU RGB+D (60 classes)

First, we determined the actions for which the model manages to identify correctly test samples. Figure 5 includes the top first 30 actions, sorted in descending order by the accuracy. For the actions jump up and walking towards each other, the model correctly predicts the class for all test samples. Actions performed by two people have an index in the interval [50, 60]. Eight of these actions are in the top included in Fig. 5, having an accuracy of over \(93\%\). The other 3 actions performed by 2 people are: point finger at the other person, pat on back of other person and punching/slapping other person. Even for these actions, for which the skeleton movement is extremely similar, the model obtains an accuracy of over \(91\%\). These results demonstrate that the proposed model correctly detects the actions performed by two people. Also, our model achieves good performance for the actions for which the order of the frames is relevant: standing up / sitting down and take off a hat-cap/put on a hat-cap.

Fig. 5
figure 5

Plot showing the actions for which our model correctly predicts most samples in the case of cross-subject protocol

Similarly, in Fig. 6, we present the actions for which the model correctly identifies the fewest samples. This plot includes 30 actions sorted in ascending order by the number of samples correctly identified by the model. As seen from the plot in Fig. 6, the proposed approach got the weakest results for actions that are very easy to confuse: writing vs reading, playing with a phone vs typing on a keyboard.

In Fig. 7, we present the confusion matrix calculated for the cross-subject v1 protocol. The black colour corresponds to the case where the network has not classified any samples for that particular class. The lightest colour corresponds to the maximum value in the matrix. This visualization facilitates the easier identification of patterns in incorrectly classified examples, aiding in a nuanced interpretation of the model’s performance evaluation. It enables a more comprehensive understanding of how the model handles different classes, revealing insights into potential areas of improvement. This confusion matrix shows that the proposed model confuses the reading action (11) with writing (12) and vice versa. A similar problem also arises for playing with a phone (29) and typing on a keyboard (30). Another pair of actions, for which we notice confusion, comprises pointing to something with a finger (31) and taking a selfie (32). For the last pair, it is relevant to note that the number of samples correctly classified for action 32 is higher. We can also see that, for each class, the number of correctly classified samples is the majority.

Fig. 6
figure 6

Plot showing the actions for which the proposed model obtains the lowest performance in the case of cross-subject protocol

Fig. 7
figure 7

The confusion matrix obtained for the evaluation of the proposed model for the Cross-Subject test protocol (v1)

4.2.2 Cross-view—NTU RGB+D (60 classes)

We made the same plots for the cross-view protocol in Figs. 8, 9, and 10. The plot from Fig. 8, with the actions best recognized by the proposed model, shows that the difference between them is much smaller for this protocol. This time, the difference between the action on the 1st place and the one on the 30th place is of only 11 samples. The plot in Fig. 9, which shows the actions for which we obtained the weakest performances, proves that, this time, there are no more actions for which the number of correctly classified samples is below 200. Just like in the previous protocol, for the actions writing, reading, typing on a keyboard and playing with a phone we got weak scores. This time, from the confusion matrix shown in Fig. 10, we can see that the model confuses the following pairs of actions: reading (11) and writing (12), wear a shoe (16) and take off a shoe (17), or playing with a phone (29) and typing on a keyboard (30).

Fig. 8
figure 8

Plot showing the actions for which the proposed model correctly predicts most samples in the case of cross-view test protocol (v1)

Fig. 9
figure 9

Plot showing the actions for which the proposed model obtains the lowest performance in the case of cross-view test protocol (v1)

Fig. 10
figure 10

The confusion matrix obtained for the evaluation of the proposed model for the cross-view test protocol (v1)

Even if the model obtained the best score for this protocol, the results show that the model still confuses samples belonging to very similar classes. It is important to note that for the dataset variant with 60 classes, the model encounters difficulties in terms of actions performed by a single person and not in the case of those acted by two people. Even for this protocol, the proposed model achieves an accuracy of over \(92\%\) for each action performed by two people.

4.2.3 Cross-subject—NTU RGB+D (120 classes)

Liu et al. [2] proposed this protocol for the extended version of the dataset containing 120 classes for which they collected samples using 106 different human subjects. They divided the samples into two subsets, considering the subjects who perform action. We use samples from 53 subjects for training, and samples from the other 53 subjects for testing. We present the results obtained for the proposed model using this protocol in Figs. 11, 12, and 13. Because the number of samples for each action, in the first variant of the dataset, is no longer equal to the number proposed in the second variant, in plots from Figs. 11 and 12, we present the accuracy for each action instead of the number of correctly classified samples. Thus, we can see that there are two actions for which the accuracy is less than \(50\%\): staple book and make ok sign. In the confusion matrix, from Fig. 13, we highlight that for the extended version of the dataset there are no clear pairs of actions that are confused.

Fig. 11
figure 11

Plot showing the actions for which the proposed model correctly predicts most samples in the case of cross-subject test protocol (v2)

Fig. 12
figure 12

Plot showing the actions for which the proposed model obtains the lowest performance in the case of cross-subject test protocol (v2)

Fig. 13
figure 13

The confusion matrix obtained for the evaluation of the proposed model for the cross-subject test protocol (v2)

4.2.4 Cross-setup—NTU RGB+D (120 classes)

For the cross-setup protocol, the authors propose to divide the data into two subsets according to the setup ID. Thus, we used the data from 16 setup IDs for training, and the data from the rest we used for testing. As in the case of the previous test protocol, we made the plots in Figs. 14 and 15 from the perspective of the accuracy obtained for each class. We should note that we also obtained the weakest results for the writing action. This is also the only action for which the model obtains an accuracy of less than \(50\%\) for this protocol.

The confusion matrix obtained for the model trained according to this protocol is comprised in Fig. 16. This time, there are no more clear pairs of actions that are confused. Most problems occur with classes proposed in the second half of the dataset. This can also be seen in the plot from Fig. 15. Of the first 10 actions with poor performance, only 2 actions are part of the first half of the dataset (writing and reading).

Fig. 14
figure 14

Plot showing the actions for which the proposed model correctly predicts most samples in the case of Cross-Setup test protocol (v2)

For this protocol, the proposed model achieves an accuracy of over \(94\%\) for the best predicted 30 actions. The structure designed for the neural network allows the correct identification of similar actions both from a spatial (arm swings\(98\%\) and arm circles\(97\%\)) and a temporal perspective (standing up\(99\%\) and sitting down\(97\%\)).

Fig. 15
figure 15

Plot showing the actions for which the proposed model obtains the lowest performance in the case of cross-setup test protocol (v2)

Fig. 16
figure 16

The confusion matrix obtained for the evaluation of the proposed model for the cross-setup test protocol (v2)

The results presented in this section highlight the capabilities of the proposed model and justify its limitations. The quantitative results obtained by our neural network demonstrate that it has performances comparable to the rest of the state-of-the-art. Furthermore, these results are obtained in a shorter inference time and require less computational effort for the training process. Since it only processes skeletal data, our approach presents limitations from the perspective of similar actions for which the context is relevant. The results achieved for all test protocols validate that our model can capture both temporal and spatial dependencies. Moreover, the model achieves high performance for actions with two people. This can be motivated by the fact that we designed the neural model to analyse the data for each skeleton separately and only in the last stage the obtained features are merged to achieve the final classification.

In Figs. 6, 9, 12, and 15, we include the actions for which our model correctly identified the fewest samples. We notify that there are pairs with similar actions such as "reading" and "writing", "make victory sign" and "make ok sign", "playing with a phone" and "typing on a keyboard". In a previous work [38], we showed that it is also complicated for a human to correctly distinguish these actions when only the data describing the skeleton is available. The results included in Table 2 validate this claim. For all protocols, there is a considerable difference between Top-1 and Top-2 accuracy.

4.3 Results on PKU-MMD dataset

PKU-MMD [39] is a challenging dataset for the HAR problem focused on long continuous sequences of actions and contains multi-modal representation. The dataset was collected using the Kinect v2 sensor and contains two phases: the first phase is for the large-margin action detection task, and the second phase is for action detection tasks with increasing difficulty. To highlight the effectiveness of the proposed method, we started from the first phase of this dataset, which we processed to evaluate our approach.

Phase 1 consists of 1076 video sequences collected from three different camera angles. 66 volunteers participated in the collection of these sequences, and the samples include 51 different action categories. We performed a mapping between the existing actions in the PKU-MMD dataset and those in the NTU RGB+D and obtained a set of 48 actions. Also, we split each sequence into several subsequences, one for each class. After this processing we obtained a dataset with 19,662 samples and 48 classes similar in format to NTU RGB+D.

We started from the models trained on the NTU RGB+D dataset and evaluated them on the 19,662 samples of the obtained dataset. To demonstrate the usefulness of the proposed geometric formulas, we used two scenarios: a scenario in which we trained a model for which we used only the first 6 features for each branch and the second one in which we used all 9 features. Table 3 contains the results we obtained for these two scenarios. These results demonstrate that the proposed spatio-temporal features help the neural model to reach a higher degree of generalization, being able to achieve better performances for new data. There are no appreciable differences between the obtained results for the NTU RGB+D dataset. But the geometric formulas proposed by us in this approach help the model to achieve better accuracy for data collected completely independently of those used for training. For instance, when utilizing only six features, excluding those related to the considered reference point and the second-order derivative approximation, the model trained for 50 epochs achieves a score of 89.01 for the cross-subject protocol. However, by incorporating the features calculated using our introduced formulas, the score increases to 89.53. Although the difference between the two scores may not be substantial, the second version proves significantly more robust, demonstrating superior generalization capabilities. This is underscored by the notably improved score of 54.76 for the PKU-MMD dataset, compared to 43.21 in the first scenario.

Table 3 The performances obtained by the proposed model for the NTU RGB+D dataset and PKU-MMD dataset

5 Explaining the network prediction

Most state-of-the art solutions currently proposed for the HAR problem are black-box, based on deep neural networks that predict a result (a recognized action), based on millions of parameters, that are quite incomprehensible for the human. Therefore, it is very difficult to understand why the network recognized a certain action and, most important, why it failed to predict the correct result. As shown in Sect. 4.2, actions that are very similar resist correct classifications, not only in our approach but in all approaches existing in the literature. An explanation as to why this is happening can bring valuable information both for those who develop applications based on HAR and for future advances on solving this problem. As opposed to methods that explain network prediction for images, there are quite few proposals trying to explain HAR, and most are limited to 3DCNN models. Only a recent paper [7] proposes a visualization of skeletons, which tries to discover the most essential body parts over a whole action sequence, in an attempt to obtain a more explainable representations for different action sequences. We consider that explainability of HAR based on a skeleton model is a very promising and sound path to understanding network prediction and our proposed model was developed starting also from this premise.

To achieve an explanation for the recognized action, we used the features resulting from the application of the spatio-temporal module and the weights of the last two linear layers. Starting from these, we determined for each frame which are the most important joints considered by the network in terms of activations and pictured them in an Activation Map. We also considered the importance that the network attaches to each frame. Similarly, in case two new skeletons appear, we checked the importance for each one.

Fig. 17
figure 17

Sample from the test subset for the brushing teeth action. The network correctly predicts this action with a \(100\%\) probability

In order to highlight some qualitative results obtained for samples from the test set, using the protocol for which the network obtained the lowest score, Cross-Subject v2, we performed the testing from the perspective of 3 actions: brushing teeth (Fig. 17), drink water (Fig. 18) and eat meal or snack (Fig. 19).

We present in Fig. 17 some frames from a sample for the brushing teeth action. We selected these frames using a step of 10 frames. For these frames, we highlight the human skeleton and choose the colours for each joint according to the importance generated by the neural model. We coloured the joints considered by the model unimportant in blue, and, for the rest, we used the red colour considering the intensity generated by the model. The network correctly identifies the key moment of the action. The intensity of the joints are highlighted in images 4, 5, 6, 7, 8.

Fig. 18
figure 18

Sample from the test subset for the drink water action. The network correctly predicts this action with a \(100\%\) probability. We sampled the frames with a step of 5 frames. In these samples, a fake skeleton incorrectly predicted by the Kinect sensor appears. The proposed model completely ignores this skeleton

We include in Fig. 18 an example which highlights an error generated by the Kinect sensor. For each frame, two skeletons appear, even if a single person performs the action. The sensor confused the chair in the image with a skeleton, for which it predicts some distorted data. In our approach, we provide the data from the two skeletons as input for the network. The network correctly identifies the skeleton of interest, completely ignoring the false one. This example highlights the robustness of our model and motivate the correctness of the prediction. In all the examples, we distinguish a difference between the colour intensity for the joints of the hand that performs the action and the other ones. This is especially visible in images 6 and 7. We also notice that, in the last frames, the importance associated with the joints decreases because the coordinates do not change considerably. It is worth to note that for this sample, the model makes a correct prediction with a confidence of \(100\%\).

Fig. 19
figure 19

Sample from the test subset for the eat meal or snack action. The network correctly predicts this action with a \(99.7\%\) probability. We sampled the frames with a step of 5 frames

In the last selected qualitative example, we presented a repetitive action. As seen from the frames included in Fig. 19, the network captures this aspect. In these images the action eat meal or snack is presented. The model correctly classifies this action with very high confidence (\(99.7\%\)) and identifies the frames where the person starts eating. This time, being a repetitive action, the temporal dimension is the one to which the model pays more attention. Therefore, only in the last images could a discrepancy be distinguished between the importance associated with each joint at the same time.

Fig. 20
figure 20

Sample from the test subset for the pickup action. The network correctly predicts this action with a \(99.96\%\) probability. We sampled the frames with a step of 5 frames

When a person bends down to pick up an object from the ground, the captured posture involves overlapping segments, making it challenging for the Kinect sensor to precisely predict a skeleton. Nevertheless, as depicted in Fig. 20, our proposed approach still manages to accurately identify the action.

Fig. 21
figure 21

Sample from the test subset for the sitting down action. The network correctly predicts this action with a \(99.96\%\) probability. We sampled the frames with a step of 5 frames

Fig. 22
figure 22

Sample from the test subset for the standing up action. The network correctly predicts this action with a \(99.92\%\) probability. We sampled the frames with a step of 5 frames

Distinguishing between the actions of sitting down and standing up can be challenging for a neural network due to the inherent similarity in the initial and final postures of both actions. Predicting an accurate skeleton is complicated by the subtle transitional movements and overlapping poses during these actions. Therefore, it is difficult to identify specific patterns for each of these two types of actions. As can be seen in Figs. 21 and  22, the neural model proposed by us manages to discern nuanced variations in body motion for accurate classification. The results obtained from the perspective of the activation map demonstrate that for both samples, the neural network successfully identifies the key moment of the action.

6 Conclusions

In this paper, we proposed a methodology for human action recognition that consists of a preprocessing stage, in which geometric features and data normalization are used to achieve a better performance, followed by a spatio-temporal neural network architecture that combines TCN and GCN layers to capture both the spatial and the temporal dimension of the action. We showed that our proposed model is able to obtain accuracy results similar to state-of-the-art ones and has a lower inference processing time, a robust behaviour in case of incorrect identified skeletons by the sensors, and the capacity to explain the recognized action (or the incorrectly identified one) by highlighting the most important joints considered by the network in terms of activations and the importance that the network attaches to each frame.

We performed a thorough analysis of the network behaviour on the 2 versions of NTU RGB+D (60 and 120 actions) for all the 4 protocols proposed in the literature, in case of both correctly recognized actions and incorrect recognized ones, and we linked this analysis with the explanation capability of the model. We also highlighted the fact that the errors in classification are generated by very similar actions (some even not discernable by a human).

Based on the features provided by the model and the weights from the last two linear type layers, we can generate statistics that present the most important joints considered by the model and the most relevant frames in the performed action. Thus, we performed the analysis and visualization of the reasons behind the predictions for actions performed by one or two people, for single or repetitive actions, showing how the network gives an importance to the spatial dimension and/or the temporal dimension.

We have two directions of research for future work. One of them consists in trying to improve the final part of the neural model that deals with classification by proposing an improved architectural model, possibly based on a multi-stage architecture. The second direction of research is an attempt to improve the understanding of the neural model by introducing additional data to provide more context about the environment in which the subject performs the action. We consider that adding context information to the process can improve both the performance of the architecture and the explanation capabilities of HAR models. Many HAR approaches based on skeleton data encounter difficulties in discerning actions that share kinematic similarities, particularly when contextual cues or information about interacted objects are not explicitly incorporated. Addressing this limitation in future research endeavours may involve exploring additional modalities or refining feature representations to enhance the discriminative power of the models, aligning with the broader efforts within the field to overcome this common obstacle.