Introduction

Gestures are a form of non-verbal communication prominently used in day-to-day communication. Therefore, they can play a fundamental part of human-robot interaction (HRI). Gestures are categorized in the literature as static and dynamic [1]. Static gestures portray meanings through hand postures. They can substitute words or be used together with words in the form of signs or emblems. Such gestures can be recognized by precisely interpreting the emphasized hand shape and spelled-out finger arrangements [2]. In contrast, a dynamic gesture has a temporal aspect articulated through hand movement. Therefore, recognizing it requires employing different techniques, e.g., segmenting and tracking the moving body limb (we refer to a good overview of gesture recognition techniques by Anwar et al. [3]).

However, such categorization of gesture types might be oversimplified. More specifically, static information is essential for recognizing dynamic gestures with similar movement paths. For example, the gesture commands “stop” and “go forward” have an identical motion with the arm extending forward. Understanding these two commands requires observing their unique hand shape and finger arrangements (open palm vs. extended finger). Furthermore, a precise interpretation of the distinctive characteristics of each hand gesture is desired for a smooth HRI experience. This is also vital in robot applications with safety concerns, e.g., medical or industrial applications. Confusion between gestures in such environments might have severe safety consequences.

This precise interpretation is challenging for approaches that rely solely on RGB data. However, RGB-based methods are beneficial [4] because of their convenience and potential compatibility with low-resource systems, such as robots [5]. Also, they facilitate the reproduction of results, especially as reproducibility issues related to deep learning are getting more attention from the scientific community [6]. Despite the recent development triggered by the deep learning trend using networks like 3DCNN, ResNet, and Inception V3, dynamic gesture recognition is still a challenging task.

Some of the factors that influence vision-based approaches are indistinctive and subtle movements [7]. Subtle movements refer to the slight movement of the hand and fingers at the peak with no clear arm movement. Indistinctive movements mean that multiple gestures follow a very similar path of motion. One limitation prominent in various approaches is the reliance on the motion path only (the interested reader can find a good overview in [8]). Consequently, some techniques lack the consideration of hand details, which leads to misclassifications between gestures with similar motion properties. Therefore, it is worth inspecting whether integrating hand details into the classification system would refine the performance of such models.

In this study, we propose a modular RGB-based approach called Snapture. Our architecture is an extension of the CNNLSTM [9] network, which is robust at learning motion patterns but limited at capturing hand details. By integrating hand information in a hybrid architecture, our model aims to improve the performance of the CNNLSTM. Since each dataset imposes different challenges due to the unique gesture vocabulary, we evaluate our approach on multiple domains: robot commands and co-speech gestures. This study is organized as follows: we present our literature review on recent gesture recognition systems. Then, we describe the datasets and our proposed Snapture framework, including its various components. Next, we discuss the experiments carried out in this study. After comparing the performance of our model to a CNNLSTM baseline, we discuss our results and conclude with potential directions for future research.

Related Work

Recent work in dynamic gesture recognition uses various pre-processing techniques for motion representation which tend to lose the hand details. One such technique called star RGB was proposed by dos Santos et al. [10]. Each gesture sequence was divided into three parts corresponding to the pre-stroke, stroke, and post-stroke stages as defined by Kendon [11]. The algorithm generated a motion representation for each part, further merged using the frame’s color channels. The data was fed into a feature extraction model using pre-trained ResNet50 and ResNet101 networks. The features were weighted using a soft-attention mechanism, while a final classification was accomplished using a two-layered feedforward network. Similar to our work, the authors evaluated their approach in both the robot command and co-speech gesture domains. The system achieved an accuracy of \(\sim\)0.98 and \(\sim\)0.95 on the GRIT [12] and Montalbano [13] datasets. However, the architecture is overly complex, especially considering that the GRIT dataset is limited to 543 samples. Despite that, the system encountered confusion between multiple gestures with the same motion since it did not consider the hand shape. Therefore, the results confirm that the problems of indistinctive and subtle gestures are not trivial. The authors hypothesized that these issues could be addressed by integrating hand information into the system.

The stated hypothesis is supported by the work of Wu et al. [14]. The authors demonstrated increased performance concerning gestures with a similar motion by fusing RGB, depth, and skeleton data. The proposed approach, called Deep Dynamic Neural Network (DDNN), consisted of three networks corresponding to each modality. RGB and depth information was fed into a 3DCNN, while skeleton data was passed through a Deep Belief Network (DBN). A Hidden Markov Model (HMM) was responsible for the temporal modeling of gestures using a set of defined states. Each observation was classified by calculating the most probable path using the Viterbi algorithm. The authors reported a score of 0.816 on the Montalbano [13] dataset using the Jaccard index. When considering RGB and depth modalities, the authors showcased less confusion regarding implicit Montalbano movements. Thus, the results hint at the importance of integrating RGB data to preserve the hand pose information. However, the model was computationally intensive and required long training times of 5 days. Thus, its robotic applications might be limited.

Mazhar et al. [15] utilized RGB data when integrating static and dynamic recognition in their system. The authors proposed a framework called StaDNet consisting of two Inception V3 CNNs. Each CNN extracted the spatial features of one of the two hands. The authors stated that by cropping the CNN input to the hand and removing the background, the framework could learn subtle movements. The temporal learning was carried out using an LSTM network. The framework scored an accuracy of 0.8675 and 0.989 on the Chalearn 2016 [16] and OpenSign [17] datasets. However, the method still relied on other modalities for training besides RGB, such as a 2D body skeleton and Kinect depth estimators. Another requirement of this model is the two datasets for training: one static and one dynamic. This requirement implies that the architecture learns each gesture vocabulary separately. Thus, the model does not seamlessly classify a gesture based on its dynamic and static characteristics. Therefore, achieving such integration between static and dynamic recognition remains an open question.

A recent study proposed a system using a transformer-based architecture [18]. The study employed a self-attention mechanism for the sequence modeling of data streams from multiple sensors. A ResNet18 network was responsible for frame-level feature extraction. A temporal function was implemented using a transformer module of six encoders, while the sequences were classified using a fully connected neural network. The system achieved an accuracy of 0.876 and 0.962 on the NVGestures [19] and Briareo [20] datasets. However, the system did not perform very well with RGB data. In contrast, the scores dropped to 0.765 and 0.906 with a single RGB modality. The authors experimented with different combinations of modalities using a late fusion technique. The best results were reported using a fusion of depth, surface normals, infrared, and RGB. Thus, the system’s applications are limited to constrained environments with sensors placed close to the operator’s hand. Furthermore, the system encountered confusion between symmetric gestures, which illustrates the challenge of sequence modeling of dynamic gesture sequences.

Aditya et al. [21] used attention for modeling sequences in a multi-feature setup to classify continuous sign language. The architecture contained spatial and temporal components. The spatial module performed feature extraction on two channels: RGB and key points encoding the body and hand poses. The temporal part consisted mainly of convolution and pooling layers. Sequence modeling was supported by a self-attention mechanism, avoiding blurry frames in the input. However, a following bidirectional LSTM with CTC was needed to interpret the signs and align predicted words into sentences. Using a variety of configurations for attention, the authors found the best performance when attention followed temporal pooling. The system achieved a word error rate (WER) of 0.7 and 21.5 on the CSL [22,23,24] and RWTH-PHOENIX [25] datasets, respectively. However, the error rate increased to 31.2 with the RWTH-PHOENIX dataset when pooling was dropped. Thus, the approach only performed well in capturing short-term dependencies and fell short of modeling longer sequences.

Another approach for dealing with noisy frames was suggested by Cao et al. [26]. The architecture used a self-attention and transformer encoder module for input representation. A feedforward network was used for classification. However, they implemented an additional temporal sampling method to identify the meaningful frames. Using a sliding window technique, they performed a gesture detection step using a single-shot detector to identify whether the hand was present. Then, frames were sampled using step size and length values that were empirically found. Using the NVGestures [19] and EgoGesture [27] datasets, the approach achieved the following accuracy values: 0.807 and 0.926. The sampling technique was found to improve the performance to 0.832 and 0.938 on the NVGestures and EgoGesture datasets, respectively. However, the used datasets contain a mixture of emblems and control commands. Thus, such sampling using hand detection would be limited concerning co-speech gestures, for which the hand would need to be continuously present in all frames.

Similarly, Chen et al. [28] proposed a dynamic gesture recognition system based on an attention mechanism. Two networks, R2plus1D and ConvLSTM, extracted the long-term and short-term temporal features of gesture sequences. The relevance of the extracted features for a given sequence was learned by an RPCNet network. The approach accommodated the contextual information across channels by weighing the contribution of its temporal channels. Multiple experiments were conducted: using a channel fusion component, sequential channel and spatial components, and parallel channel and spatial components. The channel fusion as the first module produced the best accuracy (0.9353) on the EgoGesture [27] dataset. Using average and max pooling of the channel and spatial components further boosted the score to 0.9393. However, the accuracy boost was not significant compared to a base R2plus1D model (0.9276) that did not use attention. The authors also reported an accuracy of 0.997 and 0.693 on the SKIG [29] and IsoGD [16] datasets, respectively. However, they did not provide a comparison to the R2plus1D model using these two datasets. Therefore, the influence of attention on the overall architecture is not fully unveiled.

Tsironi et al. [9] proposed a recurrent network for the motion learning of robot commands. A step of hand segmentation was done using a pre-processing algorithm called the differential image. The processed data passed through a CNNLSTM architecture, which performed the hand’s implicit feature extraction and motion tracking. Gestures with similar movements were challenging to the system due to the loss of hand details. In particular, the system confused gestures like “hello” and “no.” Since the motion similarity was limited to a small subset of gesture classes, it did not overly influence the model’s performance. The framework still achieved an accuracy of \(\sim\)0.92 using the GRIT [12] dataset. However, this dataset is small-scale (543 samples), as previously mentioned. It also consists of only robot commands designed with distinct arm movements. Thus, the performance of such an approach remains unknown on natural gestures with less intense arm movement and more focus on the hand, such as co-speech gestures.

Proposed Model

As mentioned in our literature review, using multiple benchmarks can provide an insightful assessment of the system’s performance. Therefore, we evaluate our architecture on multiple gesture domains: robot commands and co-speech gestures. In this section, we present the two datasets used to evaluate our framework. We also introduce our proposed Snapture architecture for hybrid gesture recognition and show our method for motion profile analysis of gesture sequences.

The GRIT Robot Commands Dataset

In the context of robot commands, the “Gesture commands for Robot InTeraction" (GRIT)Footnote 1 [12] is one of the few publicly available dynamic gesture datasets. The corpus contains 543 isolated gestures distributed over nine gesture classes and recorded with six subjects. Each gesture has a distinct arm movement. An exception is the case of classes “hello” and “no,” which are indistinctive movements. The dataset was collected under lab-controlled settings with a plain white background and no surrounding noise.

The Montalbano V1 Co-Speech Dataset

The Montalbano dataset contains co-speech gestures and is publicly available. It was collected with about 50 participants as part of the ChaLearnFootnote 2Looking at People challenge [13]. It contains around 14,000 Italian gestures spreading over 20 gesture classes. Each recorded video contains a subject in various indoor environments with noisy backgrounds. The Montalbano dataset provides multiple sensory data. However, we only use RGB data due to the advantages of vision-based systems, such as reproducibility and portability. Since the gestures are continuous with little to no pause, we convert them into isolated gestures by identifying the start and end of each movement. We make the annotations created for isolating the sequences and source code of the experiments presented in our work publicly available.Footnote 3

Snapture—Hybrid Gesture Recognition

Our architecture consists of two main components: a dynamic channel for capturing the gesture’s movement and a static channel for the hand pose. A classifier is trained using the combined output of the channels. We analyze the motion profiles of the gesture sequences, which are utilized when extracting the hand pose. Our approach can be extended with an optional component for controlling the static channel to address the issue of blur in the frames. We present the details of the mentioned components in the following subsections. A simplified overview of our so-called SNAPshot capTURE (Snapture) architecture is shown in Fig. 1.

Fig. 1
figure 1

An overview of the Snapture framework. The architecture consists of dynamic and static channels, fused into a final classifier. Thus, it performs a hybrid hand gesture recognition task

Motion Profile Analysis

Our approach for tackling the issues of indistinctive and subtle movements relies on fusing the hand motion and pose. We extract motion features by exploiting the temporal information across consecutive frames. The hand pose is interesting at the stroke phase of co-speech gestures, as described by Kendon [11]. However, the temporal relationship between frames and stroke in the studied datasets is unclear. Therefore, we analyze the gesture sequences in terms of motion and pause. Due to the lack of approaches for gesture analysis, we utilize the structural similarity index measure (SSIM) [30] as a metric for the similarity between consecutive frames. The SSIM calculation is shown in Eq. (1).

$$\begin{aligned} \begin{gathered} SSIM(x, y) = \frac{(2\mu _{x}\mu _{y} + C_{1})(2\sigma _{xy} + C_{2})}{(\mu _{x}^2 + \mu _{y}^2 + C_{1})(\sigma _{x}^2 + \sigma _{y}^2 + C_{2})}\,, \\ \sigma _{xy} = \frac{1}{N-1}\sum _{i=1}^{N}(x_{i} - \mu _{x})(y_{i} - \mu _{y})\,, \\ C_{1} = (K_{1}L)^2\,, \\ C_{2} = (K_{2}L)^2\,, \end{gathered} \end{aligned}$$
(1)

where x and y are spatially local windows of the input frames. \(\mu _{x}\) and \(\mu _{y}\) represent the mean intensity of x and y, respectively. Similarly, \(\sigma _{x}\) and \(\sigma _{y}\) denote the standard deviation. N is the number of pixels, while \(x_{i}\) and \(y_{i}\) represent the pixel values at index i of x and y, respectively. \(C_{1}\) and \(C_{2}\) are stability constants to avoid a division by zero. \(K_{1}\) and \(K_{2}\) are positive values much smaller than 1, while L is the pixel range (255 for 8-bit grayscale frames).

Using the first frame as a reference, we can quantify the amount of motion and pause across the gesture time span. By inverting the equation, we can express change across frames. We express that in Eq. (2) and refer to it as the Inverted SSIM (ISSIM).

$$\begin{aligned} ISSIM = 1 - SSIM (I_{i}, I_{0})\,, \end{aligned}$$
(2)

where \(I_{i}\) and \(I_{0}\) denote the grayscale frames at time steps i and 0, respectively.

We observe two variations of movements in the GRIT dataset based on our analysis. Paused gestures include a pronounced period of pause around the gesture peak. For example, “turn left” (cf. Fig. 2a) lacks motion around the peak since participants hold their hand briefly still. In contrast, in gestures, such as “turn” (cf. Fig. 2b), subjects continuously repeat a circular pattern. We refer to these movements as repeating pattern gestures. These unique characteristics of motion and pause of each gesture influence the design of our approach, as will be discussed later. In contrast to the GRIT dataset, the Montalbano gestures follow Kendon [11] model of gesticulation and concurrent speech. The intensity of the movement starts and ends gradually, with a clear peak in between, ex: gesture “ok” (cf. Fig. 3).

Fig. 2
figure 2

The motion profile of GRIT gestures “turn left” (a) and “turn” (b). “turn left” is paused at the peak, while “turn” is with a repeating pattern due to the continuous intensity across its time span

Fig. 3
figure 3

The motion profile of the Montalbano gesture “ok.” It starts and ends with low intensity and has a clear peak around the midpoint of the timeline

Dynamic Channel

The dynamic channel of Snapture is a CNNLSTM [9] implementation using PyTorch.Footnote 4 The network consists of a two-layer stacked convolutional neural network (CNN) followed by a long short-term memory (LSTM) network (cf. Fig. 4). The input to the network represents segmented gestures. The segmentation is done using the differential image algorithm. In Eq. (3), we show the algorithm’s calculation as described by Tsironi et al. [9]. This algorithm operates on three subsequent frames subtracting every two consecutive frames of a three-frame sequence. The moving hand is extracted in the input by applying a bitwise AND operator to the output of the two subtraction operations.

$$\begin{aligned} \Delta _{i} = (I_{i} - I_{i-1}) \wedge (I_{i+1} - I_{i})\,, \end{aligned}$$
(3)

where \(\Delta _{i}\) and \(\Delta _{i-1}\) are the segmented gesture input frames at the current and previous time steps, respectively. \(I_{i-1}\), \(I_{i}\), and \(I_{i+1}\) denote the grayscale frames at time steps \(i-1\), i, and \(i+1\), respectively. \(\wedge\) is the bitwise AND operator.

Fig. 4
figure 4

The dynamic channel of Snapture is a CNNLSTM network consisting of two layers of CNN followed by an LSTM and a feedforward network. The input is pre-segmented using the differential image algorithm. For clarity, we show only five frames and increase the contrast of the differential images

The stacked convolution layers have five and ten kernels of size 11\(\times\)11 and 6\(\times\)6, respectively. These are the same kernel size and number of filters of the CNN as the original CNNLSTM model [9]. Each layer has a 1\(\times\)1 stride, zero-padded input, and a hyperbolic tangent (Tanh) activation function. A max-pooling layer of size 2\(\times\)2 follows each convolution layer. Following each convolution, batch normalization is used to reduce internal covariate shift [31] and speed up the training. We initialize the CNN’s weights with values from a uniform distribution [32]. The output of the last convolution layer is flattened and propagated through the LSTM.

Since the input represents isolated gestures, each mini-batch has all the information needed for the network to produce a classification. Therefore, we opt to use a stateless LSTM. We configure the LSTM’s cell state to produce a sequence-level classification for each gesture. Therefore, our model requires no additional post-processing steps and fits the concept of capturing a snapshot more intuitively. The LSTM’s numbers of layers and neurons are selected using grid search. The optimal number of layers is 2 out of 1, 2, 4, and 8. The optimal number of neurons has resulted differently for the GRIT and Montalbano datasets. We choose 64 and 512 neurons for the GRIT and Montalbano datasets, respectively. We initialize the LSTM with weights from a uniform distribution with zero bias. After passing through dropout [33], the output of the LSTM is fused through concatenation with the static channel’s output (explained in the next subsection). The combined outputs are propagated into a two-layered feedforward network followed by softmax, producing a probability distribution over the gesture classes.

Static Channel

This channel is responsible for capturing the specific hand shape and finger arrangements through a so-called snapshot at the gesture’s peak. We detect and extract the gesture at the peak corresponding to the stroke phase. This provides hand pose information, which we fuse with the dynamic channel. As a result, our method integrates the characteristics of static and dynamic recognition systems.

Gesture Peak Detection

According to Kendon [11] model of the relationship between gestures and concurrent speech, human gestures are described by five phases (cf. Fig. 5). Gestures start with a rest phase, representing a neutral position of the arms. In the pre-stroke or preparation phase, a gradual intensity in motion of one or both arms starts to unveil. Next is the stroke phase in which the static gesture properties, i.e., hand shape and finger configurations, completely unfold. These characteristics start to fade away in the post-stroke or retraction phase as the intensity of motion gradually decreases. The gesture ends again with a rest phase. The Montalbano gestures have a clear peak through the frames around the midpoint of the gesture sequence. Similar time steps are occupied by a pronounced pause in paused gestures (cf. “Motion Profile Analysis” section). Therefore, we define the peak as the frame in the middle of the gesture sequence.

Fig. 5
figure 5

The five gesture phases, according to Kendon [11]. Each gesture starts with a rest phase. In pre-stroke, the limb moves from the rest position into the stroke phase. The stroke phase contains the most expressive information. In post-stroke, the limb moves away from stroke back into rest phase

Gesture Peak Extraction

We follow a skin detection technique to extract the hand from the frame. Our implementation uses Python and OpenCV.Footnote 5 First, the face is detected and removed because the input contains a full body image, and skin detection treats all visible skin equally. Next, the hand is segmented by converting into the orthogonal color space YCbCr [34]: Y representing the luminance, while Cb and Cr indicate the chromaticity. This is done to avoid the high correlation between luminance, hue, and saturation in RGB [35]. Since various lighting conditions highly influence skin tones, we apply the threshold on chrominance only. We use the thresholds Cb=[80, 120] and Cr=[133, 173] proposed by Basilio et al. [36]. According to the authors, these threshold values are independent of skin tone. An additional step of background removal is applied to the Montalbano data using simple background subtraction. This is due to the complex surroundings, unlike GRIT. Next, we apply the connected component analysis, which describes the YCbCr mask in terms of BLOBs. These objects are then sorted by size and position. Due to the noisy background in the Montalbano dataset, we filter out objects that do not belong to the foreground, calculated in the step of background removal.

We pick the higher object in the frame to avoid assumptions about the subject’s dominant hand. As we observe in the data, the hand performing the gesture is always in an upper position. The other hand is usually at rest or slightly raised. For gestures requiring two hands, both hands always make the same pose. Therefore, our algorithm has the freedom of picking up either hand in this case. A step of hand smoothing is applied using erosion and dilation morphological transformations. However, omitting this step does not influence the algorithm’s output. Finally, an area around the detected hand is extracted from the original frame and resized to 64\(\times\)48 pixels matching the CNN input. The configuration of the CNN network in the static channel is similar to the dynamic channel (cf. “Dynamic Channel” section). The hand features learned through this network are flattened and concatenated with the dynamic channel’s output before feeding into the two-layered feedforward network. The gesture peak extraction module is depicted in Fig. 6.

Fig. 6
figure 6

The gesture peak extraction module of the Snapture approach. Using a skin detection technique, the hand shape and finger configurations are extracted from a target frame at the gesture’s peak. Background removal is only applied to the Montalbano gestures (dotted line). The extracted hand is passed through a CNN and a feedforward network

Static Channel Control

One challenge when using RGB data is capturing hand details during rapid hand movements. This is caused by factors such as lower camera resolution and exposure time. Consequently, it leads to a blurry hand in the frame (cf. Fig. 7). This is pronounced for repeating pattern robot commands due to the intense movement. This phenomenon is a challenge to any vision-based approach due to the missing information in the input frames. However, we address it by regulating the static channel based on the amount of motion contained in a gesture. More precisely, we integrate the extracted static information only if the amount of motion lies below a threshold., i.e., the stroke phase contains a pause sufficient for the snapshot extraction.

Fig. 7
figure 7

Repeating pattern gestures, e.g., “circle” (a), contain a blur at the peak compared to paused movements, e.g., “stop” (b). The blurry hand at the gesture’s peak for highly dynamic movements is challenging for RGB-based approaches. We bypass this issue by regulating the static channel of our approach

We use the SSIM-based method (cf. “Motion Profile Analysis” section) as a quantitative metric for the amount of motion and pause. We split each gesture into three parts: (1) the first part represents all the frames in the rest and pre-stroke phases, (2) the second part contains the frames in pre-stroke and post-stroke phases, and (3) the third part consists of all the frames consecutively from post-stroke to rest phases. We assume the three parts to be of equal length for simplicity. The three parts and our defined threshold are visualized for the GRIT (c.f. Fig. 8) and Montalbano (c.f. Fig. 9) datasets. The average amount of motion in part 2 is less than in part 1 and part 3, which supports our choice of Kendon [11] stroke phase as the gesture’s peak. It is also noticeable that most samples of paused gestures, such as “stop,” “turn left,” and “turn right,” lie well below the threshold due to their pronounced period of pause. In contrast, the intensity of motion is high for repeating pattern gestures, e.g., “circle” (cf. Fig. 8), which corresponds with our definition of repeating pattern gestures (cf. “Motion Profile Analysis” section). On the other hand, most Montalbano gesture classes contain pause facilitating capturing a snapshot.

Fig. 8
figure 8

Our motion analysis of the GRIT dataset after splitting gestures into three parts: rest to pre-stroke phases, pre-stroke to post-stroke phases, and post-stroke to rest phases. The second part contains more pause and facilitates capturing a snapshot. The black line denotes our defined threshold for regulating the static channel

Fig. 9
figure 9

Our motion analysis of the Montalbano dataset after splitting gestures into three parts: rest to pre-stroke phases, pre-stroke to post-stroke phases, and post-stroke to rest phases. Similar to GRIT, the second part contains more pause that facilitates capturing a snapshot. Our defined threshold for regulating the static channel is denoted by the black line

Experimental Procedure

In this section, we present the experiments carried out in this study. In each experiment, we evaluate and compare the following: (1) a CNNLSTM acting as a baseline for comparison, (2) our Snapture architecture, which predicts a class by integrating the hand shape and motion, and (3) Snapture with the threshold-controlled mechanism for regulating the static channel based on the sufficiency of pause to capture a snapshot. We will refer to this model as Snapturethold. The purpose is to evaluate the influence of subtle and indistinctive gestures on the performance of each of the models in two gesture domains, as motivated earlier.

Experimental Settings

The training parameters of each experiment are selected using grid search and are listed in the following subsections. We run each of the models under similar conditions. The hardware specifications used for training and testing are as follows: (1) Ubuntu 18.04.5 LTS operating system; (2) Intel Core i7-4930K 3.40 GHz with six cores; (3) 8 GB of RAM; (4) NVIDIA GeForce GTX 1080 graphics card with 8 GB of memory. The performance of each model is evaluated using accuracy, F1-score, and training time metrics. We report the average performance of each model over five trials. In each trial, we repeat the steps of training and testing. We analyze the classification behavior using the confusion matrices.

GRIT Experiment

The search space and optimal hyperparameters for the experiment on the GRIT dataset are listed in Table 1. We choose the parameters using grid search and cross-validation while leaving out 30% of the data for testing. In this experiment, the optimal values are identical, which we explain by the similarity in architecture and training procedure across the models. We use the same data split ratio for each trained model to conduct a fair comparison. To avoid data imbalances, we use stratified sampling regarding class labels.

Table 1 The search space and optimal hyperparameter values (in bold) of each model in the GRIT experiment

Montalbano Experiment

Similar to the previous experiment, we report the hyperparameters in Table 2. Due to the considerable number of class labels, the search space is extended compared to the GRIT experiment. We choose the parameters using grid search and cross-validation while leaving out unseen data for testing. The dataset is part of the ChaLearn Looking at People challenge and is already split into training and test datasets. Each set contains unique subjects. To avoid any influence of subject variability, we implement our split with data from all participants. We follow this approach since we focus on comparing the classification behavior of the different models rather than taking part in the challenge. Our split consists of 70% and 30% of randomly selected data for training and testing, respectively. Stratified sampling is utilized for an approximately uniform distribution of class labels across the sets.

Table 2 The search space and optimal hyperparameter values (in bold) of each model in the Montalbano experiment

Results

In this section, we present the results of the experiments acquired using the GRIT and Montalbano datasets and under the experimental settings described earlier.

Results of the GRIT Experiment

The experiment results are summarized in Table 3. Our Snapture approach achieves slightly superior results compared to the CNNLSTM in terms of accuracy and F1-score. The scores across the Snapture and Snapturethold variations are similar. The three models have a slight deviation across the five trials. We explain the marginal accuracy boost by three factors. First, GRIT robot commands have unique movement paths. Therefore, the CNNLSTM model is sufficient due to its motion-learning capabilities. Second, due to the repeating pattern gestures, most GRIT movements do not have sufficient pauses for capturing a snapshot (approximately 44%) based on our threshold definition. Combined with the small dataset size, our model may not have seen enough training data to learn the unique characteristics of hand shapes. Third, only approximately 44% of GRIT samples include a motion at the peak beneath the defined threshold. Therefore, the Snapturethold acts similarly to a CNNLSTM model in 56% of the cases, and it is not able to contribute to a noticeable accuracy increase.

Table 3 The results of the GRIT experiment under the described settings. The reported metrics represent the mean of five trials, while the values in parentheses correspond to the standard deviation. The superior accuracy and F1-score values are in bold

However, we analyze the results further through the confusion matrix of the average case (cf. Fig. 10), which is calculated using the average of predicted labels over all trials. The most confusion in the CNNLSTM model occurs between “hello” and “no,” “hello” and “stop,” “no” and “stop,” and “stop” and “abort.” These movements have a similar motion profile but differ in hand shape. Thus, this supports that indistinctive movements negatively influence the performance of CNNLSTM. On the other hand, the confusion between these classes is less pronounced in Snapture (cf. Fig. 11) due to the additional hand pose information. However, the misclassification of “hello” samples as “no” still negatively impacts the performance of Snapture. We observe that some participants perform “hello” and “no” rapidly, resulting in a blur effect and noisy input to the network. Therefore, the Snapturethold improves the situation (cf. Fig. 12) by excluding the snapshot in case the frame is not clear for interpreting the hand details. On the other hand, repeating pattern movements, e.g., “circle” yields comparable F1-score values across the three architectures (cf. Fig. 13) due to the distinctive movement. However, the number of false positives and true negatives associated with “circle” drops noticeably in Snapturethold, further emphasizing that the static channel is indeed counterproductive for such movements.

On a different note, the confusion between the classes “no” and “stop” is less pronounced in Snapture and Snapturethold compared to CNNLSTM. Despite the dissimilarity between the two classes, some subjects tend to perform “no” with a slight left and right hand movement around the wrist, making it very similar to “stop” in terms of arm movement (raised and directed towards the camera). The CNNLSTM struggles with this sort of implicit hand movements due to the loss of hand details. Therefore, our approach improves performance with the static channel.

Fig. 10
figure 10

The confusion matrix of the average case for the CNNLSTM on the GRIT dataset. The confusion is pronounced between the classes “hello” and “no,” “hello” and “stop,” “no” and “stop”

Fig. 11
figure 11

The confusion matrix of the average case for Snapture on the GRIT dataset. The confusion is less pronounced between the classes “hello” and “no,” “hello” and “stop,” “no” and “stop.” However, the performance is still negatively influenced by the false classification of some “hello” samples as “no”

Fig. 12
figure 12

The confusion matrix of the average case for Snapturethold on the GRIT dataset. Less confusion can be observed concerning class “circle,” which confirms that the static channel should be disabled for such repeating pattern movements

Fig. 13
figure 13

A comparison of per-class F1-score values between the different approaches on the GRIT dataset. Snapture increases the score for classes “hello” and “no,” while the performance across the remaining classes is comparable

Results of the Montalbano Experiment

Our Snapture approach scores superior accuracy and F1-score compared to CNNLSTM. Also, the Snapturethold improves the results even further (cf. Table 4). However, we observe a noticeable time increase in Snapturethold. We explain that by the additional check for each sample to identify where it lies compared to the defined threshold. Approximately 70% of the Montalbano data contain a sufficient pause for a snapshot. Thus, it gives more insights into the performance of the Snapturethold approach. By observing per-class performance, Snapture achieves superior per-class F1-scores compared to the CNNLSTM except for “basta” (both models achieve an identical score). Nonetheless, we report a boost in F1-score on all classes with the Snapturethold.

Table 4 The results of the Montalbano experiment under the described settings. The reported metrics represent the mean of five trials, while the values in parentheses correspond to the standard deviation. The superior accuracy and F1-score values are in bold

Indistinctive Movements

In CNNLSTM, multiple observations of classes “vattene” are miscalssified as “vieniqui,” “perfetto,” or “tantotempo” (cf. Fig. 14). We explain that by the similarity in hand motion. Compared to the CNNLSTM, an addition of \(\sim\)19 and \(\sim\)32 samples on average are correctly classified by the Snapture and Snapturethold, respectively (cf. Figs. 15 and 16). Consequently, we observe F1-score improvements in the respective classes (cf. Fig. 17). For classes “vieniqui,” “freganiente,” “ok,” “noncenepiu,” and “buonissimo,” the CNNLSTM achieves poor F1-score values (below 0.6). Most of the confusion of class “ok” is tied to false positives/negatives with one of the said classes. We explain that by the similarity in their motion. However, the total number of misclassified “ok” samples drops in Snapture and Snapturethold by approximately 30. Therefore, we observe an increase in the F1-score. Snapture and Snapturethold also enhance the F1-score of class “seipazzo.” An additional average of \(\sim\)23 and \(\sim\)25 samples are correctly classified due to less confusion with “buonissimo.”

Fig. 14
figure 14

The confusion matrix of the average case for CNNLSTM on the Montalbano dataset. The confusion between gesture classes with indistinctive movement is pronounced, e.g., “vattene,” “vieniqui,” “perfetto,” and “tantotempo”

Fig. 15
figure 15

The confusion matrix of the average case for Snapture on the Montalbano dataset. The confusion concerning gesture classes with indistinctive and implicit movements, e.g., “vattene,” “noncenepiu,” and “ok,” is less pronounced than for the CNNLSTM

Fig. 16
figure 16

The confusion matrix of the average case for Snapturethold on the Montalbano dataset. The confusion concerning gesture classes “vattene,” “furbo,” and “buonissimo” is less pronounced than for the CNNLSTM

Fig. 17
figure 17

A comparison of per-class F1-score values between the different approaches on the Montalbano dataset. Snapture improves the score for all classes except “basta.” The performance of explicit arm movements, e.g., “basta” and “cheduepalle,” is comparable across the three models

On a different note, classes that share motion and hand shape are challenging for our approach. For example, classes “vattene,” “vieniqui,” and “tantotempo” use a similar open palm at the peak (cf. Fig. 18). Therefore, the confusion between such classes is still noticeable in Snapture and Snapturethold despite the hand shape information.

Fig. 18
figure 18

Our snapshot extraction takes place using a single frame at the peak. Thus, a challenging scenario to our approach is when gestures that have a similar hand pose during the stroke phase

Implicit Movements

Besides the motion similarity, some gestures include a delicate hand movement at the peak. For example, “sonostufo” includes a subtle hand movement against the chest. Similarly, “noncenepiu” and “buonissimo” include a rotational motion of the extended index and thumb fingers around the wrist. Due to the pre-processing, these hand details are lost. Consequently, they are not picked up by the CNNLSTM. However, the confusion related to these classes is noticeably less in Snapture and Snapturethold (cf. Figs. 15 and 16). On the other hand, the confusion regarding class “buonissimo” is only slightly boosted in Snapture and Snapturethold. We explain that by observing that “buonissimo” and “furbo” are similar in motion and hand shape, i.e., extended index finger. The difference lies in the position the finger touches the face (under the eyes vs. on the cheek). Efficiently recognizing these gestures requires additional modalities, which our study does not consider. However, we will discuss this point later. Moreover, since snapshot is captured using one frame at the gesture’s peak, it is subject to influence by the corresponding hand orientation and light reflection. Thus, it becomes more challenging to distinguish between an open palm and an extended index finger, especially since the input is in grayscale (cf. Fig. 19).

Fig. 19
figure 19

Some challenges concerning class “buonissimo”: a similarity in hand motion and pose with “furbo.” Therefore, another modality is required, which is not considered by our approach, b similarity in hand orientation and light reflection causes misclassifications with “freganiente” and “cosatifarei.” It becomes challenging to interpret the open palm under these conditions

Explicit Movements

Five Montalbano gestures are two-handed. We observe two types of movements under this category based on how the arms are extended. “Chevuoi” and “combinato” are performed using symmetric hand movements in which both arms move from the rest to make a distinct shape at chest level. Due to the motion similarity, the CNNLSTM comes short in F1-scores, most noticeable for “chevuoi,” while Snapture and Snapturethold present a noticeable F1-score boost for these classes (cf. Figure 17). On the other hand, gestures “cheduepalle” and “basta” are symmetric but made with a movement of both arms to the body side. Both gestures are used when a person is acting decisively and implying “enough.” Therefore, the movement of the arm is quite firm, making it unique from the rest of the gesture vocabulary. Consequently, the CNNLSTM is efficient at picking up these movements. Snapture and Snapturethold only slightly improve over the performance of the CNNLSTM concerning these explicit movements since the hand shape and finger arrangement play a minimal role in their recognition. In Fig. 20, we display a comparison between an implicit and explicit movement and their corresponding pre-processing step.

Fig. 20
figure 20

A comparison between implicit and explicit hand movements. We observe missing hand details concerning “sonostufo” (b). In contrast, the explicit arm movement of “basta” is conserved (d). a and c depict the original sequence for clarity. We increase the contrast of b and d for clarity

Discussion

We proposed a hybrid gesture recognition architecture called Snapture. It integrated the hand pose alongside movement through modular static and dynamic channels. Our work was motivated by the limitation of RGB techniques, such as the CNNLSTM network, across different gesture domains. Therefore, we evaluated our approach in the context of robot commands and co-speech gestures. In our experiments, we compared the performance of Snapture to a CNNLSTM baseline using the GRIT [12] and Montalbano [13] datasets. Our analysis demonstrated that Snapture improved the classification of indistinctive and subtle movements. We believe the unique characteristics of our approach make it potentially beneficial in the following domains: (1) emblematic hand gestures, which substitute words to convey a particular meaning, and (2) co-speech gestures, which accompany words as means of verbal communication. Furthermore, our system is compatible with mobile systems and robot applications due to the few required data streams. We use only RGB frames to extract the motion and hand pose. The participants stand freely in front of the camera without needing environments with constrained setup conditions [18]. We also avoid issues that result from using skeleton data to extract the hand, such as the occasional loss of joint information [15]. Thus, our approach is one of the few pure RGB-based models that operate on the Montalbano dataset.

Recent RGB-based approaches are influenced by similarities in hand movements [9] and the loss of delicate small-scale motions at the peak [10]. The effects of this phenomenon are limited in the GRIT dataset because it includes robot commands with intense arm movements. On the other hand, these effects are more pronounced in the Montalbano Italian gestures, which are part of human communication. These movements are more natural than robot control, have a simple motion path, and involve particular hand and finger configurations. Thus, capturing the hand pose in addition to the movement becomes more critical. However, this is challenging for recent approaches, which require intensive training of networks, such as 3DCNN [14], ResNet [10], and Inception V3 [15]. In contrast, our system captures the hand details by merely incorporating an additional static channel. We integrate the hand information at the gesture’s peak on top of the CNNLSTM model. Therefore, our architecture is easier to train yet effectively capable of addressing the issues of indistinctive and subtle movements. The simplicity of our approach facilitates a robot application due to its lightweight architecture.

Furthermore, the scheme of fusing the static and dynamic features influences the system. Our approach operates on a single frame in the static channel, which has several advantages. First, it matches Kendon [11] model of gesticulation and concurrent speech. The literature shows that the stroke phase plays an essential role in recognition. Following this model also helps simplify our approach since it does not require dedicated networks for learning the short-term and long-term dependencies [21, 28]. Second, the spatial and temporal traits are treated with equal importance. Thus, we avoid the issues of fusing features at each time step and the dominance of particular modalities in the learning process [14]. Therefore, we are better able to analyze the influence of each stream on the final outcome. Our experimental design provides evidence that classification performance concerning indistinctive and subtle movements can be boosted by learning hand details.

Another consideration for RGB-based methods is the issue of blurry frames caused by rapid hand movement and leading to noise in the input data. Due to a lack of literature concerning the analysis of gesture sequences, we employ an SSIM-based algorithm for analyzing motion profiles. This technique is used in a threshold-based fashion and further benefits our approach. The performance is improved by regulating the static channel and bypassing the blurriness issue. Thus, our method is comparable to others that use adaptive sampling of the input frames [26]. In both architectures, the performance is improved by reducing noise in the input frames. However, we control the noise by focusing on the hand’s motion rather than just the presence of the hand, so that our approach works with co-speech gestures. However, one disadvantage of such empirical methods is that they do not guarantee generalization to new samples. Therefore, introducing robustness by learning the cut-off values of the threshold is desired. We hypothesize that approaches using attention-based methods, such as [21], might provide an alternative solution. However, this issue still negatively influences vision-based systems, and our results show that the blur phenomenon is challenging. We hope our work raises more attention to the quality of collected RGB gesture datasets and encourages more research in producing affordable, higher-quality cameras compatible with robots.

Our observations on the GRIT and Montalbano datasets show high variability in hand preference. Besides hand dominance, fatigue and injuries are among the most common factors that drive the interchangeable use of both hands. Therefore, a robust system that works with subjects regardless of the dominant hand is beneficial. Our system accomplishes that by extracting the pose of the hand actively used while making the gesture. Thus, it does not require mirroring videos of left-handed subjects [14] or a dedicated network for each hand [15]. Consequently, our approach facilitates higher flexibility, which should assist in less restrictive and guided HRI scenarios. However, this is a broader research domain, and more work could be done in this area. Another interesting finding is that our architecture is prone to confusion between classes with a similar hand pose at the stroke phase. Gestures such as “furbo” and “buonissimo” are almost identical at the peak with minor distinction. Thus, precisely recognizing these classes requires a more detailed interpretation of facial or speech information. Our modular architectural design facilitates that by incorporating additional channels. We hypothesize that such faulty system behavior can be avoided by including the body pose information. However, it remains an open question whether this is achievable in an architecture based on RGB data only.

Finally, our training scheme could be further improved. In our evaluation, we use a stratified sampler for the data splits to avoid imbalances of class labels. Having a representative test set is important due to the small size of the GRIT dataset and to make sure that challenging gestures, such as “hello” and “no,” are represented fairly in the test set. Although our data split guarantees mutually exclusive sample sets, some of those samples might be recorded for the same participants. Since many participants were recorded in different surroundings and appearances, i.e., different outfits, each environmental and appearance combination presents a different challenge to our RGB approach. However, having a different training strategy, e.g., using a subject-wise split, would demonstrate the model’s robustness to subject variability. Although this is not part of this study, we highlight it as a potential future improvement.

Conclusion

Our study presents a novel architecture called Snapture, which integrates static and dynamic information. Our use of RGB data only and our lightweight architecture allow compatibility with any system equipped with a camera, including robots. We also suggest an algorithm for analyzing gesture motion profiles, which is essential for revealing the unique characteristics of gestures. Our results show that incorporating the hand pose at the gesture’s peak with motion information offers a solution to the issues of indistinctive and subtle movements. The results also demonstrate that these challenges are more prominent in the context of co-speech gestures than robot commands. Therefore, this hints at the substance of evaluating frameworks across multiple gesture domains. Additionally, our Snapturethold extension highlights the influence of RGB data quality for system performance and provides a means for optimization based on a snapshot of a gesture. Overall, our work contributes to bridging the gap between static and dynamic gestures allowing gesture applications that foster immersive and less controlled HRI experiences.