Visual-semantic network: a visual and semantic enhanced model for gesture recognition

Gesture recognition has attracted considerable attention and made encouraging progress in recent years due to its great potential in applications. However, the spatial and temporal modeling in gesture recognition is still a problem to be solved. Specifically, existing works lack efficient temporal modeling and effective spatial attention capacity. To efficiently model temporal information, we first propose a long- and short-term temporal shift module (LS-TSM) that models the long-term and short-term temporal information simultaneously. Then, we propose a spatial attention module (SAM) that focuses on where the change primarily occurs to obtain effective spatial attention capacity. In addition, the semantic relationship among gestures is helpful in gesture recognition. However, this is usually neglected by previous works. Therefore, we propose a label relation module (LRM) that takes full advantage of the relationship among classes based on their labels’ semantic information. To explore the best form of LRM, we design four different semantic reconstruction methods to incorporate the semantic relationship information into the class label’s semantic space. We perform extensive ablation studies to analyze the best settings of each module. The best form of LRM is utilized to build our visual-semantic network (VS Network), which achieves the state-of-the-art performance on two gesture datasets, i.e., EgoGesture and NVGesture.


Introduction
As an important branch in computer vision, gesture recognition has drawn an increasing amount of attention due to its wide application in real life, such as human-computer interaction, smart home, and automatic driving.However, there exist three main challenges of the current research on gesture recognition.
First, gestures can hardly be accurately detected since existing works lack efficient temporal modeling ability.In general, gestures can be divided into two categories, i.e., static gestures and dynamic gestures, depending on whether or not their appearance changes over time.At present, static gestures can be well recognized and classified, while recognizing dynamic gestures remains a challenge due to their high dependence on temporal information.Therefore, how to handle both the temporal and spatial information is significant for classifying dynamic gestures.Previous works usually use two-dimensional convolutional neural networks (2D CNNs) or threedimensional CNNs (3D CNNs) to tackle this issue.3D CNNs [1][2][3] focus on capturing spatial and temporal information simultaneously.However, they are always computationally intensive and time-consuming.To involve the temporal information efficiently, other works use 2D CNNs with extended modules [4][5][6] to recognize gestures.Among them, the temporal shift module (TSM) [5] is the most lightweight module, which exchanges information between adjacent frames to capture short-term tem-poral information.However, it lacks the ability to model long-term temporal information.To overcome this shortcoming, we propose a long and short-term temporal shift module (LS-TSM), which expands the step size of the temporal shift operation, so we can model not only short-term but also long-term temporal information intuitively and effectively.Meanwhile, we believe that in gesture recognition, the importance of long-term temporal information and short-term temporal information is different.Therefore, we assign different weights to the long-term shift part and the short-term shift part.
Second, existing works lack effective spatial attention capacity.When classifying gestures, we should focus on the area where a gesture is performed, because it always contains key information to recognize the gesture.For example, when identifying the gesture "scroll hand towards right" and "scroll hand towards left", it is important to distinguish the direction of the hand movement.The redgreen-blue (RGB) pixel difference among frames exactly fits our requirement.There exist some previous works that utilized the RGB pixel difference between adjacent frames to assign more attention to boundary displacement [7] or selected it as an additional input to achieve better recognition results [8].It should be noted that the pixel difference between adjacent frames is usually too small to capture and it contains trivial information, while the pixel difference between long-distance frames can robustly indicate where the change mainly occurs.Therefore, we propose a spatial attention module (SAM) that can obtain a spatial attention map to assign different weights to different positions for accurate gesture recognition.Compared with other works that operate in the channel dimension to obtain a mask, our work provides a more intuitive method that focuses on the spatial dimension where the change primarily occurs.
Third, the visual information is focused on but the semantic information is usually ignored in existing works in gesture recognition.Actually, the semantic relation among classes has been proven helpful in other computer vision tasks.For example, in few-shot object detection, some works [9,10] used semantic relations to help detection based on the word embeddings of class labels, which can reflect the semantic relationship among classes.Here, inspired by the previous works, we propose a label relation module (LRM) [11] that first applies semantic reasoning to gesture recognition based on the embeddings of class labels.Specifically, we model the relationship among class labels and incorporate the related information into the semantic space to obtain a new semantic space.Then, we incorporate the visual information into the new semantic space to make the final predictions.However, how to model the semantic relationship and incorporate the related information into the semantic space still remains a challenging task.In our previous work [11], we introduced a semantic reconstruction method to build our LRM.In this work, we aim to explore more approaches to model the semantic relationship and determine the best approach.In recent years, Transformer [12] has garnered great attention due to its outstanding performance.It calculates the attention matrix so that it can capture the relationship among the query-key pairs.Thus, based on the Transformer architecture, we propose three semantic reconstruction methods that are based on the self-attention, the multi-head self-attention, and the transformer encoder architecture, respectively.Meanwhile, we also introduce another method that uses the MLPMixer to model semantic relations.For these semantic reconstruction methods, we conduct sufficient ablation studies and determine the best method.After extensive analyses, we found that the selfattention-based method achieves the best performance.To the best of our knowledge, we are the first to incorporate semantic information into gesture recognition, and our work proposes a new perspective for this task.
Given all that, we build our visual-semantic network (VS Network in brief ) based on the standard ResNet50 [13].The proposed LS-TSM and SAM are inserted into every stage of ResNet50.The best form of LRM is inserted before the last fully-connected layer.We add these modules one by one to the backbone and conduct sufficient ablation studies on them.For LS-TSM, there are some hyperparameters in it.To explore the best setting of LS-TSM, we perform sufficient ablation studies to determine the effects of different hyper parameters on the model's performance and select the best settings of LS-TSM.Our SAM provides a more intuitive way to obtain spatial attention maps.To prove the effectiveness of our SAM, we visualize class activation maps of many examples under TSM and our model.Based on these examples, we perform sufficient analysis on SAM.Additionally, to better illustrate the effect of the LRM, we visualize the confusion matrices and the t-SNE diagrams of our models w/o the LRM.Based on them, we better analyze how the LRM works.The overview of the VS Network is illustrated in Fig. 1.Note that this work is an extension of the previous conference publication [11] and the contributions are summerized as follows.
1) We propose a long-and short-term temporal shift module and a spatial attention module to tackle the temporal and spatial modeling problems in gesture recognition.2) We introduce a label relation module (LRM) to model the semantic relation among gesture labels to reconstruct the semantic space and incorporate the semantic information into the network to enhance the visual features.3) In the LRM, we present four forms of semantic reconstruction methods that model the semantic relation among gestures and use it to reconstruct the semantic space of gestures.
Figure 1 Overview of the network.The parse sample strategy [16] is adopted to sample T frames from videos.ResNet50 [13] is utilized as the backbone, and the LS-TSM and SAM modules are inserted into each block.The LRM is inserted before the fully connected layer to enhance semantic reasoning in the model.FC represents the fully-connected layer 4) We perform sufficient ablation studies to obtain the best form of LRM.Extensive experiments have demonstrated the effectiveness of our proposed three modules.5) The three modules can be easily integrated into ResNet50 to construct our VS Network, which can achieve state-of-the-art performance on two commonly used datasets, EgoGesture [14] and NVGesture [15].

Temporal information model
Most works in gesture recognition are based on the RGB videos of gestures, which are easy to obtain.However, the dynamic gestures always contain much more temporal information, which is difficult to understand.Existing works in RGB-based gesture recognition often utilize 2D CNNs or 3D CNNs to recognize gestures.Pure 2D CNNs [13,17,18] can only model the spatial information, while 3D CNNs can model spatial and temporal information simultaneously.However, they are usually computationally intensive and time-consuming.Thus, recent works attempt to add additional modules on the 2D CNN backbones to obtain more temporal information.Lin et al. [5] proposed a temporal shift module named TSM, which shifts part of feature channels along the temporal dimension to transform information between adjacent frames.Li et al. [4] introduced a temporal excitation and aggregation (TEA) block that contains a motion excitation module to model short-range motion and a multiple temporal aggregation (MTA) module to model long-range motion.Wu et al. [6] established an MVFNet that introduces a novel multi-view fusion module with separable convolution to exploit spatio-temporal dynamics.Other than TSM, the above works introduced more parameters that result in latency.Therefore, in this work, we extend the idea of TSM to improve the temporal modeling capacity.

Attention mechanism
Existing works in gesture recognition usually utilize attention mechanisms for more accurate recognition.Some works [19,20] assigned different weights to different channels.Other works [21,22] preferred to use spatial attention mechanism.Woo et al. [21] proposed a convolutional block attention module that infers attention maps along two separate dimensions.Jiang et al. [22] built a spatialwise attention module and a temporal-wise attention module that separately model the temporal relation and spatial relation among contexts.In fact, when classifying gestures, we should focus on where the motion changes mainly occur, and the RGB pixel difference among frames can fit our requirements.Therefore, we utilize it to obtain the spatial attention map.

Semantic relation model
In visual recognition tasks, different classes are not independent of each other.Actually, there exists some related or mutually exclusive information among different classes, which is helpful for visual recognition.In addition, the language description of class labels can reflect the relationship among classes.In object detection, many works [9,10] have applied the semantic embeddings of class labels to model the relationship among classes and incorporate the related information into the network architecture.Zhu et al. [9] used the word embeddings of class labels to construct a semantic knowledge graph that learns the semantic relation among different classes.Nie et al. [10] utilized the word embeddings as nodes to construct a semantic relational graph that models the semantic relationship.To better model the semantic relation among classes, we propose four different ways of capturing semantic relationship and conduct extensive experiments to find the best method.

Method
In this section, we first introduce the long-and short-term temporal shift module termed LS-TSM and the spatial attention module denoted as SAM.Then, we propose the label relation module named LRM and introduce different structures of LRM.Finally, we present how to combine the above modules with the backbone to construct the VS Network.

Long and short-term temporal shift module
TSM [5] is proposed to model the short-term temporal information.Specifically, it shifts part of the feature channels to exchange the information between adjacent frames.However, for dynamic gestures, the long-term temporal information is also important.Extending the idea of TSM, we propose a LS-TSM to model the long-and short-term temporal information simultaneously, which is illustrated in Fig. 2. We perform both short-term and long-term temporal shift operations in the LS-TSM.Given an input feature X ∈ R T×C×T×H×W , we divide it into three parts along the Figure 2 The implementation of LS-TSM.X denotes the input feature map of LS-TSM.The shape of the input spatiotemporal feature X is [T, C, H, W], where T and C denote temporal dimension and feature channels, respectively.H and W correspond to spatial shape channel dimension, i.e., the short-term temporal shift part X s ∈ R T×C/8×H×W , the long-term temporal shift part X l ∈ R T×C/8×H×W and the others X o ∈ R T×3C/4×H×W .

Short-term temporal shift
We select the first 1/8 channels of the input features as X s to perform short-term shift operation.Similar to TSM [5], we shift channels between adjacent frames, which is the same as TSM.We select 1/2 channels of X s to shift forward with step size 1 and select the other 1/2 channels to shift backward with step size 1.

Long-term temporal shift
We also select the second 1/8 channels of the input features as X l to perform long-term shift operation.In the long-term shift part, we shift channels among long-distance frames.We set the step size of the temporal shift operation larger, which is T/2 in this work.Following this setting, we shift 1/2 channels of X l forward and 1/2 backward.
Other channels For the other channels, we keep them the same as the original input X.Finally, we combine the above three parts by a weighted summation: where α and β indicate the importance of long-and shortterm temporal information, respectively.Through empirical validation, we find that α = 1 and β = 1.5 work well in practice.X indicates the output of LS-TSM.

Spatial attention module
We introduce our spatial attention module in this section, which is presented in Fig. 3. Given an input spatiotemporal feature X = {X(1), . . ., X(T)} ∈ R T×C×H×W , we calculate the RGB pixel difference of frame pairs with the Figure 3 The implementation of the spatial attention module (SAM).X denotes the input feature map of SAM time steps t and t + T/2, i.e., the difference between X(t) and X(t + T/2): where D(t) ∈ R 1×C×H×W is the RGB pixel difference between time step t and t +T/2, indicating the motion change during this period.Then, we concatenate all D(t) to construct the final motion change matrix D: We utilize a temporal average pooling to obtain the motion change within the whole video: Next, a 3 × 3 convolutional layer is utilized to perform the channel-wise transformation on D avg .We apply a sigmoid function to obtain the mask M: Finally, we conduct an element-wise multiplication between the input feature X and mask M, as is shown in Eq. (6).
where X o is the output of SAM, and indicates the element-wise multiplication.Using the proposed SAM discussed above, we can focus on where the motion changes primarily occur.

Label relation module
Different gestures may contain related or mutually exclusive information, which can be reflected in their class labels.Therefore, we propose a label relation module (named LRM) to model the semantic relationship among gestures.The LRM is illustrated in Fig. 4. We first encode the class labels of gestures to obtain the original semantic space W e ∈ R N×d e , where N is the number of gesture classes and d e denotes the channel dimension of semantic embeddings.Then using the semantic reconstruction method, we extract the semantic relations among them and construct new phrase embeddings G e ∈ R N×d e .Finally, we combine the visual information v and the reconstructed phrase embedding G e by a defined GLinear to obtain more accurate predictions.The GLinear can be formulated as Glinear : p = softmax(G e Pv + b), (7) where P ∈ R d e ×d v is a learnable variable that aims to fuse the visual and semantic information.

Phrase encoding
The class labels of gestures are always phrases, which are more complex than words.Therefore, we use a phrasespecific encoding model Phrase-BERT [23] to encode the gesture labels.Phrase-BERT [23] proposes a contrastive fine-tuning objective to enable BERT [24] to encode more powerful phrase embeddings.It achieves impressive performance across a variety of phrase-level similarity tasks.Thus, Phrase-BERT [23] exactly meets our needs.

Semantic reconstruction method
As explained above, given the class labels of a gesture dataset, we can construct a semantic space by encoding the class labels into a series of word embeddings using Phrase-BERT [23].We represent the semantic space using a set of d e dimensional word embeddings W e ∈ R N×d e , where N denotes the number of classes.
Based on the semantic space discussed above, we propose four semantic reconstruction methods to construct new phrase embeddings that contain the relations among gestures.These methods are based on the self-attention mechanism, multi-head attention (MHA), Transformer, and MLPMixer, respectively, which is shown in Fig. 4.

Method 1 (Self-attention based method)
We propose the first semantic reconstruction method M based on the selfattention method.We first transform the semantic space into Q, K, and V by three linear layers f , g and h, and f , g, h : R N×d e → R N×d k .We calculate the self-attention matrix using Q and K, and then the self-attention matrix is multiplied by V.
where d k is a scaling factor equal to the channel dimension of key K.We utilize a linear layer l to align the dimension and a residual connection is introduced to maintain the original semantic information.
Through the above self-attention mechanism, we obtain a new semantic space G e that contains class relations.

Method 2 (Multi-head attention based method)
The second semantic reconstruction method is based on the multi-head attention (MHA) mechanism.MHA mechanism contains h different heads.Specifically, given the original semantic embedding, each head transforms it into Q, K, and V and performs scaled dot-product attention operation as Eq. ( 9) shows.Then, we concatenate the output of each head through a fully connected layer: where and W are all parameters in fully conncted layers.Finally, we add a residual connection operation to generate the final aggregated representation: Method 3 (Trasformer based method) The third semantic reconstruction method is based on Transformer.Specifically, it contains two sub-layers: a multi-head self-attention layer and a feed-forward layer.We can formulate the two sub-layers as follows: Method 4 (MLPMixer based method) The final method is based on the MLPMixer layer.One MLPMixer layer contains two MLP blocks, i.e., the token-mixing MLP and the channel-mixing MLP, which act on columns and rows of input features, respectively.Given the original semantic space W e , we can formulate this process as follows: ) where W 1 , W 2 , W 3 and W 4 are linear layers.
To explore the best way to model semantic information, we perform ablation studies on the above four methods.

Overview of the network
Figure 1 presents our method.We follow the sparse sampling scheme proposed in the temporal segment network (TSN) [16].Specifically, given a video clip V , we first divide it into T segments and then randomly sample one frame per segment.Next, the sampled frames are fed into the proposed network.We build our visual-semantic network (VS Network) based on ResNet50 [13].The proposed long-and short-term temporal shift module and spatial attention module are inserted into each residual block as shown in Fig. 1.Meanwhile, the LRM is inserted before the last fully-connected layer of the network to help classification by enhancing the visual feature with the semantic relation information.
EgoGesture is a large-scale dataset with 83 classes of static and dynamic gestures.It contains 24,161 gesture samples and 2,953,224 frames collected from 50 subjects.It is designed for egocentric hand gesture recognition which focuses on interaction with wearable devices.
NVGesture [15] is a dataset with 25 types of gestures.It contains 1532 dynamic gesture samples performed by 20 subjects.It is also designed for human-computer interaction.All gestures are recorded by multiple sensors and from different perspectives.[31] is a gesture dataset that is developed for long-distance gesture recognition.It contains 10 gesture classes, including 7 static gesture classes and 3 dynamic gesture classes.There are 542 videos and 44,887 gesture instances in LD-ConGR.

LD-ConGR
We focus on RGB pixel-based gesture recognition, and thus we only use the RGB videos as input.Meanwhile, to make a fair comparison, we concentrate on the reported results that only use RGB images as input when comparing with other works.

Implementation details
Training We select ResNet50 [13] as our backbone and adopt the sampling strategy in TSN [16].Our progressive training strategy is described as follows.We first train a network with our proposed LS-TSM and SAM.Then, we add the LRM to the network and fine-tune it based on the previous model.The training settings during the two phases are the same.The training settings for EgoGesture [14] are 100 training epochs, initial learning rate 0.001 (decay by 0.1 at epochs 35&60&80), weight decay 5 × 10 -4 , batch size 16, and dropout 0.8.For NVGesture [15], the training settings are 50 epochs, initial learning rate 0.001 (decay by 0.1 at epochs 20&40), weight decay 5 × 10 -4 , batch size 32, and dropout 0.8.Compared with EgoGesture, NVGesture is much smaller and easily overfits.Therefore, during the first training phase of NVGesture, we fine-tune it based on the model trained on EgoGesture.
Testing During testing, following the common setting used in other works [4,5], we sample multiple clips per video (3 for EgoGesture and 10 for NVGesture).Meanwhile, we scale the short side of the frames to 256 and choose 5 crops of 224 × 224 for each frame.Finally, we average the Softmax scores of all clips.

Comparisons with the state-of-the-art methods
We first compare our VS Network with previous works on EgoGesture [14] and NVGesture [15].The results are summarized in Table 1.In Table 1, the first compartment contains the methods based on 3D CNNs and the methods in the second compartment are all based on 2D CNNs.The methods in the third compartment go beyond them, which will be discussed later.The last compartment contains the results of our methods.Obviously, in the simplest settings, i.e., ResNet50 as the backbone and 8/16 frames as The methods in the third compartment can achieve comparable performance with our VS Network in the simplest settings.Yu et al. [28] proposed a multi-rate singlemodal neural architecture search method "NAS1".It contains three branches, each of which is a 3D CNN network with several video frames sampled at different frequencies as inputs.Jain et al. [29] introduced a method that first converts a gesture-depth video into a single image called an encoded motion image (EMI) and then uses the EMI to perform classification.However, the results of the EMI + modified VGG-19 are not superior.Cao et al. [30] introduced a Transformer-based network for gesture recognition, which is a pure self-attentional architecture that predicts gestures by focusing on the whole video sequence information.The NAS1 method and the Transformer-based method both can achieve comparable performance to our work on EgoGesure.On NVGesture, they are slightly better than our VS Network.However, these methods both have their drawbacks.They involve many more parameters, i.e., 127.1M for NAS1 and 121.296M for Transformerbased method, which are much more than our VS Network (25.7M).We use more frames as input, i.e., 32 frames for NVGesture to improve performance.Under this setting, our VS Network can achieve an accuracy of 83.28% on NVGesture.To further improve performance, we adopt the post-fusion strategy, i.e., fusing the results from the 8-frame and 16-frame VS Networks together, and the results of our model can reach a new state-of-the-art performance, i.e., 94.15% on EgoGesture and 83.85% on NVGesture.
Meanwhile, we compare our VS Network with other 2D CNN-based methods on LD-ConGR [31].The results are presented in Table 2.The LD-ConGR dataset focuses on recognizing gestures which are collected over long distance.This causes many irrelevant elements to appear in the image.Our proposed SAM makes the model exactly focus on where the change primarily occurs.Therefore, the model can pay less attention to irrelevant elements.Meanwhile, our LRM can dig out the relationship among different gesture classes.We can observe that when only LS-TSM and SAM are added to the backbone, the accuracy of our method can reach 87.63%, higher than other 2D CNN-based methods.When we continue to add the LRM module, the performance of our VS Network is further improved to 88.363%.This result is much higher than that of other 2D CNN based methods.It should be noted that our model is trained without specific pre-training on any video datasets.

Ablation study
In this section, we first analyze the best settings of LS-TSM and LRM, i.e., the choice of α and β in LS-TSM and the choice of the semantic reconstruction method in LRM.Then, we perform extensive ablation studies to demonstrate the effectiveness of each module in our VS Network.
Hyper-parameter setting in LS-TSM There are three hyper-parameters to be set in LS-TSM, i.e., the shift proportion of the channels, the step size of the long-term temporal shift, and the value setting of α and β.We first conduct experiments on the shift proportion of the channels.
The results are illustrated in Fig. 5.It should be noted that the temporal shift proportion in Fig. 5 is the total proportion of the temporal shift operation.For example, when the temporal shift proportion is c/8, it means that we use half of them (c/16) to perform the short-term temporal shift operation, and half of them (c/16) to perform the longterm temporal shift operation.From Fig. 5, we can observe that as the proportion of the temporal shift increases, the accuracy of our model first increases and then decreases.When the temporal shift proportion is c/4, i.e., c/8 for a short-term shift and c/8 for a long-term shift, the accuracy of the model reaches its peak.Hence, we select c/4 channels to perform the temporal shift operation using our method.The ablation study is then conducted about the step size of the long-term temporal shift operation.The results are presented in Table 3.We set the step size of the long-term temporal shift operation to T/4, T/3, T/2, and 3T/4, respectively.Obviously, as the step size increases, the accuracy of our model with LS-TSM first increases and then decreases.It reaches a peak when the step size is T/2.Therefore, setting the step size of the long-term temporal shift operation to T/2 is more suitable in our case.
Finally, we determine hyper-parameter value in LS-TSM, i.e., the value setting of α and β.For the more concise

Semantic reconstruction method choice
We build four kinds of LRM with four different semantic reconstruction methods.The VS Network is first used with LS-TSM and SAM as the baseline, and then four different LRMs are added to the baseline to construct four kinds of VS Networks.The results are displayed in Table 4. Compared with the baseline, except for Method 4, the other semantic con-   struction methods all achieve obvious improvements, i.e., 0.37%, 0.19% and 0.27%, respectively.This indicates the effectiveness of the idea of incorporating semantic relationship information into the network.Among these methods, the self-attention based method (Method 1) achieves the best performance, and the MLPMixer based method (Method 4) only achieves comparable results compared with the baseline method.Obviously, the self-attention mechanism can better model the semantic information.This result can be attributed to the self-attention mechanism that was originally proposed for language processing, and thus it may be more suitable to handle the semantic information.

The effectiveness of each module
To demonstrate the effectiveness of each module, we add three modules, i.e., LS-TSM, SAM and LRM, to the backbone and evaluate their performance.The results are summarized in Table 5. Obviously, all modules can bring improvement for the model.When LS-TSM is used, our model outperforms TSM by 0.49% on EgoGesture and 0.63% on NVGesture.The LS-TSM is proposed to model the temporal information, which is helpful in recognizing dynamic gestures.To prove our point, we split data in EgoGesture into static gestures and dynamic gestures and evaluate the performance of our model on them.The results are displayed in Table 6.Obviously, the improvement on dynamic gestures is larger than that on static gestures.This indicates the effectiveness of LS-TSM in modeling different ranges of temporal information.Some quantative experiments with LS-TSM are also performed.The results are illustrated in Fig. 7.We can observe that TSM only focuses on part of the process of hand movement.This result can be attributed to the fact that TSM can only capture short-term temporal information, while our method with LS-TSM pays attention to the whole process of hand movement, so as to make accurate classifications.With LS-TSM, the VS Network can pay attention to both long-term and short-term temporal information.
When only SAM is added, we can observe 4.09% and 4.87% improvements on EgoGesture and NVGesture, re-spectively.This demenstrates the effectiveness of our proposed SAM.When adding SAM and LS-TSM simultaneously, we can also observe 0.61% and 0.62% improvements on EgoGesture and NVGesture, respectively.To visualize the effectiveness of SAM, we visualize some examples of the class activation maps of TSM and our model with SAM, which is illustrated in Fig. 8 and Fig. 9. Generally, when SAM is added, the model can pay more attention to the hands, i.e., where the gesture is performed.In Fig. 9, we visualize an example of the gesture "Rotate fists counterclockwise".In this example, due to the center crop in data pre-processing, the models can only observe a part of the hands.Therefore, the hand movement is relatively small, which makes it more difficult for TSM to determine which part of the image to focus on, even paying more attention to irrelevant parts of the image, i.e., chairs in the background.Thus, TSM misclassifies the sample as gesture "Sweep diagonal", a totally irrelevant action.Thanks to SAM, our method exactly focuses on the movement of the hands and is therefore capable of determining the right class.
To prove the effectiveness of our LRM, we only add it to the ResNet50 and evaluate the model's performance.Obviously, only adding LRM can bring 1.37% and 2.25% improvement on EgoGesture and NVGesture, respectively.When adding LRM to the model with LS-TSM and SAM, we can also observe 0.37% and 1.44% improvement on EgoGesture and NVGesture, respectively.The improvement on NVGesture is more obvious than that on EgoGesture.This is probably because there exist more gesture types in EgoGesture, which makes the relation between gestures more complex to model.To verify the effective-ness of LRM, we visualize the confusion matrices of the models w/o LRM on NVGesture, as is demonstrated in Fig. 10.We found that with LRM, the confusion matrix is better distributed along the diagonal line.We amplify the confusion matrix of the first four classes.Obviously, the value along the diagonal line is larger when LRM is used, which indicates the gesture is better classified.
Meanwhile, to further illustrate the effectiveness of LRM, we draw a picture of the t-distributed stochastic neighbor embedding (t-SNE) of our method w/o LRM, which is illustrated in Fig. 11.Obviously, the distribution of feature instances of the same class is more compact with LRM.In particular, as shown in the green boxes in the two t-SNE diagrams, before LRM is used, the inter-class distances among classes 17 (sweep circle), 18 (sweep cross) and 19 (sweep checkmark) are relatively large.By compar- ison, large intra-class distances are also observed, which results in a lot of misclassifications of these three types of gestures as well as a lot of instances of other categories (such as class 15, class 16, and class 33).After LRM is used, it is obvious that the intra-class distances of the three categories (17,18,19) are more compact.Meanwhile, because LRM incorporates the relationship among different gesture classes, the inter-class distances among these gestures with similar semantics are also getting closer.Thus, there are fewer instances of misclassification of other categories, such as class 15, class 16 and class 33.Additionally, with LRM, the classification boundaries of the three gestures are much clearer.Similarly, we enlarge the yellow boxes in the two t-SNE diagrams.Before LRM is adopted, the classification in this part is confused, with classes 40, 57, 56 and 20 being mixed together.However, after LRM is adopted, it is obvious that the inter-class distance of these classes increases and the intra-class distance decreases, which is very conducive to classification effectiveness.

Conclusion
In this paper, we propose two modules, i.e., LS-TSM and SAM to solve the problems of spatio-temporal modeling in gesture recognition.LS-TSM is presented to model the different ranges of temporal information of gestures, and SAM is proposed to focus on where the gesture primarily occurs.In addition, existing works in gesture recognition take little account of gesture relations.We propose a label relation module termed LRM to model semantic relations among gestures and utilize these relations to assist in gesture recognition.To obtain a better semantic modeling effect, we introduce four types of semantic reconstruc-tion methods to form four types of LRM.Next, we evaluate them to find the best choice of LRM.Compared with existing works, our method can model the temporal information more efficiently, obtain a spatial attention map more intuitively, and model the gestures' relations more creatively.Thus, we achieve the state-of-the-art performance on EgoGesture and NVGesture, respectively.Extensive experiments are conducted to fully demonstrate the effectiveness of the proposed modules.

Figure 4
Figure 4 The label relation module and four semantic reconstruction methods.(a) is the implementation of the label relation module; (b), (c), (d) and (e) are four different forms of semantic reconstruction methods

Figure 5
Figure 5 Ablation study of the shift proportion of temporal shift operation

Figure 6
Figure6 Ablation study of the value setting of α and β.The value of α is fixed to 1.The horizontal axis is the value of β

Figure 7 Figure 8 Figure 9
Figure 7 Visual example about the long-and short-term temporal shift module (LS-TSM).The first row is the raw frames sampled from a video.The second row is the class activation maps of TSM.The third row shows the class activation maps of our model with LS-TSM

Figure 11
Figure 11 The t-SNE diagrams of our model w/o LRM

Table 1
Comparison with the state-of-the-art methods on the EgoGesture dataset and the NVGesture dataset."Frame" in the fourth colomn indicates the number of frames used for EgoGesture and NVGesture, respectively.For instance, 8/16 frames means 8 frames for EgoGesture and 16 frames for NVGesture

Table 2
Comparison with the state-of-the-art methods on the LD-ConGR dataset

Table 3
Experiments on the step size of long-term shift operation in LS-TSM

Table 4
Semantic reconstruction method choice in LRM

Table 5
The effectiveness of LS-TSM, SAM and LRM

Table 6
Performance comparison of the models with and without LS-TSM on dynamic and static gestures