Spatial–temporal transformer for end-to-end sign language recognition

Continuous sign language recognition (CSLR) is an essential task for communication between hearing-impaired and people without limitations, which aims at aligning low-density video sequences with high-density text sequences. The current methods for CSLR were mainly based on convolutional neural networks. However, these methods perform poorly in balancing spatial and temporal features during visual feature extraction, making them difficult to improve the accuracy of recognition. To address this issue, we designed an end-to-end CSLR network: Spatial–Temporal Transformer Network (STTN). The model encodes and decodes the sign language video as a predicted sequence that is aligned with a given text sequence. First, since the image sequences are too long for the model to handle directly, we chunk the sign language video frames, i.e., ”image to patch”, which reduces the computational complexity. Second, global features of the sign language video are modeled at the beginning of the model, and the spatial action features of the current video frame and the semantic features of consecutive frames in the temporal dimension are extracted separately, giving rise to fully extracting visual features. Finally, the model uses a simple cross-entropy loss to align video and text. We extensively evaluated the proposed network on two publicly available datasets, CSL and RWTH-PHOENIX-Weather multi-signer 2014 (PHOENIX-2014), which demonstrated the superior performance of our work in CSLR task compared to the state-of-the-art methods.


Introduction
Sign language is the primary language of the hearingimpaired community and consists of various gestural movements, facial expressions, and head movements. According Zhenchao  to World Health Organization (WHO) [1], 466 million people worldwide suffer from hearing loss, accounting for more than 5% of the world's population, and nearly 2.5 billion people are expected to suffer from hearing impairment by 2050. Therefore, the development of sign language recognition (SLR) technology is of great importance for daily communication between hearing-impaired and hearing people as well as for social development. Traditional SLR methods are limited to static gestures and isolated words [2][3][4]. In contrast, continuous sign language recognition (CSLR) is a better way to meet the needs of hearing-impaired communication. In comparison with SLR, CSLR methods process sign language video that contains rich semantic movements of sign language [5,6], and the magnitude of the movements is more localized and detailed.
Video sequences for CSLR are longer and more complex, and require feature and semantic learning in sequential frame sequences [6,7]. It is challenging to map low-density sign language video sequences to the corresponding high-density natural language sequences. In real-world scenarios, sign language videos contain complex life scenes [8], and thus, there are long-term semantic dependencies in the videos. Each video frame is correlated not only with adjacent video frames but also with distant video frames. Typically, CSLR requires the detection of key frames in sign language videos. Spatial feature sequences are extracted from key frames using convolutional neural networks (CNN), and then, temporal features are fused by recurrent neural networks (RNN).
To achieve a high recognition accuracy, the feature extraction of sign language sequences is especially critical. However, existing methods [7][8][9][10] have difficulties in capturing detailed temporal dynamics over long intervals due to insufficient feature extraction. Therefore, adequately capturing visual features in sign language videos, especially long-term semantic dependencies, and extracting the corresponding video contextual features are key issues in CSLR. In addition, the contribution of all values in the output coding vector C of the RNN encoder [11] is the same, which leads to information loss in long sequence data. The impossibility of the model to be executed in parallel is also a major problem. In contrast, Transformer [12] has a strong semantic feature extraction capability and long-range feature capture capability, not only focusing on local information, but also seeing the global information from the low-level feature, and then constructing the global connection between key points.
To solve the long-term semantic dependence of sign language video, we propose a temporal and spatial feature extraction method based on Transformer. The proposed model can capture the spatial feature information of video frames while focusing on the contextual semantic information of consecutive frames. The model can extract the rich sign language features more efficiently and thus improve the recognition accuracy. The work in this paper is based on the traditional Transformer model and combines the characteristics of sign language video sequences for network design. Specifically, we perform a patch chunking operation on sign language video frames to facilitate model learning and training and propose a Spatial-Temporal Transformer model for CSLR (shown in Fig. 1).
The main contributions of this paper are as follows: 1. We propose a deconvolutional sign language recognition network that contains a spatial-temporal (ST) feature encoder and a dynamic decoder. Where the ST feature encoder can distinguish temporal and spatial features, part of the attention module focuses only on the contextual features in the temporal dimension, and the other part of the attention module extracts the spatial dynamic features of the video frames. By such a design, the extraction of sign language video features can be enhanced by aggregating the attention results from different heads. 2. For the long frame sequences of sign language videos, a patch operation is designed to map them into easyto-process sequences. This operation can reduce the computational complexity and facilitates the processing of sign language videos. 3. We designed a model progressive learning strategy to explore the effectiveness of frame size and patch size on recognition results. We conducted experimental evaluations on two widely used datasets, CSL and PHOENIX-2014 dataset, and obtained competitive performance for our method compared to several recently proposed methods.
The remainder of this paper is organized as follows: after presenting the related work in Section II, we present the architectural system implemented in Section III. In Section IV, we present the experimental results. Finally, in Section V, we draw conclusions and look ahead to our work.

Related work
Sign language recognition can be classified into isolated word recognition [3][4][5] and continuous sign language recognition [6,7] based on whether it is continuous or not. Early SLR relied on manual extraction of features, including handshape, appearance, and motion trajectory [8]. The sign language video is first converted into a high-dimensional feature vector by a visual encoder, and then, the feature mapping of this feature vector to semantic text is learned by a decoder. Initially, CSLR was also based on the recognition of individual isolated words. This CSLR based on isolated words involves algorithms related to temporal segmentation [9], which is a complex process and has a high misclassification rate due to temporal segmentation. With the development of deep learning, recent CSLR methods have turned into the automatic extraction of sign language features using deep neural networks.
According to the input modality of recognition, methods can be divided into single modality and mixed modality, where single modality refers to RGB video as the input and mixed modality adds skeleton, depth, optical flow, and other information to RGB video [8]. Current recognition methods mainly focus on a single modality. To extract the visual features of sign language videos, most research used convolutional networks to extract feature sequences from videos, which generally means extracting spatial features using twodimensional convolution (2DCNN) or three-dimensional convolutional networks (3DCNN), and then modeling temporal information dependence using RNN.
Oscar Koller et al. [10] embedded a model combines CNN and Long Short-Term Memory (LSTM) in each Hidden Markov Model (HMM) stream, relying on sequence constraints of HMM independent streams, to learn sign language, mouth shape, and hand shape classifiers using sequential parallelism, which reduced the single-stream HMM Word Fig. 1 Our proposed end-to-end Spatial-Temporal Transformer Network for CSLR Error Rate (WER) to 26.5% and the dual-stream WER to 24.1% on the RWTH-PHOENIX-Weather multi-signer 2014T dataset (PHOENIX-2014-T, an extended version of PHOENIX-2014, is mainly used for sign language translation tasks). Cihan Camgoz et al. [7] proposed a depth-based and end-to-end CSLR framework using the SubUNets approach to improve the learning process of intermediate representation learning. Cui et al. [13] developed a CSLR framework using a combined CNN and Bi-directional Long Short-Term Memory (Bi-LSTM) model using an iterative optimization strategy to obtain representative features from the CNN, and experiments were conducted on the PHOENIX-2014 database and SIGNUM signer-dependent set, with WERs decreased to 24.43% and 3.58%. The VAC network proposed by Min et al. [14] uses 2DCNN to extract frame features and then uses one-dimensional convolutional networks (1DCNN) into local feature extraction with the addition of two auxiliary modules for alignment supervision. In the PHOENIX-2014 dataset, the WER was reduced to 21.2%.
Since sign language sequences require strong temporal correlation between frames, 3D convolution has been adopted for temporal dimensional feature extraction. Pu et al. [11] proposed a CNN-based model for continuous dynamic CSLR from RGB video input. They generated pseudo-labels for video clips from sequence learning model with Connectionist Temporal Classification (CTC), and finetune the 3D-ResNet with the supervision of pseudo-labels for a better feature representation. Their method was evaluated on the PHOENIX-2014 dataset and reduced the WER to 38.0%. Huang et al. [15] proposed a video-based CSLR method without temporal segmentation, which is based on a 3DCNN network and a hierarchical attention network for recognition. Yang et al. [16] proposed a shallow hybrid CNN, which uses both 2D and 3D convolutions, and is coupled with two LSTM networks for glossy and sentence-level sequence modeling, respectively.
Although CNN has a strong feature extraction capability, it is limited to feature extraction of single-frame images. The limited perceptual domain of 3D convolution leads to insufficient extraction of long-term time-dependent features. The convolutional network generates a single feature vector representing the whole video with the average pooling layer, completely ignoring the sequential relationship of video frames, which will lose the temporal and contextual information of the sign language video. And convolutional networks accumulate multiple layers in counting global information, which can lead to problems such as low learning rate and difficulty in transmitting information over long distances.
With the explosive application of Transformer [17] in the field of machine translation, its feature of being good at modeling long-range sequences is widely used in the field of vision, which can alleviate part of the feature extraction problems existing in CSLR. From 2020, transformer started to make a splash in the Computer Vision (CV) field: Vit [18] for image classification, DETR [19] for target detection, semantic segmentation (SETR [20], MedT [21]), image generation (GANsformer [22]), and video understanding (Timesformer [23]), among others. M. Rosso et al. [24] employed ViT for the first time within the road tunnel assessment field, the vision transformer provides overwhelming results for the automatic road tunnel defects classification paradigm. L. Tanzi et al.
[25] applied a ViT architecture to femur fracture classification. It outperformed the state-of-the-art approaches based on CNN. Camgoz et al. [26] introduced the Transformer architecture for joint end-to-end CSLR and translation, with superior translation performance on the PHOENIX-2014-T dataset.
The length of the sign language sequence creates a high degree of complexity in the computation of the transformer, it is impractical to spread the sign language video sequence as the Transformer input. To this end, we improved the transformer structure. We proposed a patch operation for video frames, which reduces the input dimension of video frame sequences and can alleviate the problem of computational power in the first place, while facilitating feature extraction. Since it is an impossibility for the transformer to distinguish between temporal and spatial features of sign language videos, we designed an ST dual-channel feature extraction network to extract contextual features and dynamic features, respectively, with more adequate visual feature extraction.

Spatial-temporal transformer networks for sign language recognition
We propose to consider CSLR as a vector mapping from a low-density video sequence to a sign language high-density text sequence. The mapping is presented in Eq. (1) where X , Y represent the video sequence and text sequence; s and t represent the dimensions; m and n represent the length.
In this work, we proposed a new end-to-end Spatial-Temporal fusion Transformer Network (STTN) for CSLR. Its architecture is shown in Fig. 1, and it is already presented in the fourth paragraph of the introduction. The model consists of three main parts: sign language video sequence vectorization, ST feature extraction, and feature decryption. First, the video frames are extracted uniformly, and the extracted video frames are patched and position coded. Second, the patch sequences with the position information are input to the encoder part of the model. Third, the temporal and spatial features are extracted and fused by the ST encoder. Finally, the fused features are fed to the decoder part for decoding and predicting.

Patch embedding
The standard transformer input is a one-dimensional token embedding. The input in our method is a vector sequence of f ∈ R B×T ×C×H ×W dimension (where B is batch size, T is the number of frames, C is the number of channels, and H and W represent the height and width of the sign language frame, respectively), we reshape each frame Z C×H ×W i of the T-frame sign language video frame into a 2D block of dimen- h × w is the number of blocks per frame, which directly affects the length of the input sequence, and a constant hidden vector d model is used on all layers to map the block spreading projection to the size of d model = D, where D represents the specific dimensional value. The output of this projection is the patch embedding. At this point, the size of the feature map is B × T × N × D, where N is the product of h, w. The output of patch embedding is noted as: where p is the number of patches and t is the number of frames.

Positional embedding
To prevent the position-related loss in networks, the feature map with dimension f 0 ∈ R B×T ×N ×D needs to be position encoded before entering the encoder for feature extraction. Position encoding requires that each position has unique position information and the relationship between two positions can be modeled by affine transformations between their positions. And it is experimentally verified that satisfies pose embedding functions as presented in Eqs. (2), (3) Positional Encoding and Patch-Embedding have the same dimensionality d model , so these two can be directly summed. The vector after positional-embedding is noted as Z i ( p,t) as specified below, in Eq. (4) where X denotes the vector corresponding to each patch, and X is multiplied with a learnable matrix E. The result of the multiplication is added to the position code e pos ( p,t) .

Encoder
The original Transformer structure like most seq2seq models [32] consists of an encoder and a decoder. The encoder and decoder consists of N = 6 identical layers, each layer consisting of two sub_layers: multi-headed self-attention (MHA) mechanism and fully connected feedforward network (FFN). Each sub_layer is connected with residual connection and normalization, and the equation is expressed as Eq. (5) sub_layer_out put = Layer N ormali zation (x + (sub_Layer (x))) . (5) The transformer's attention is a linear combination i a i v i of all word vectors v i in the encoded sentence based on the learned attention weight matrix a i , to perform decoded prediction with attention. Multi-head self-attention, on the other hand, is the projection of "query" Q, "key" K, and "value" V by means of using different linear transformations of heads ("heads" is the number of attention heads). The process of V is projected shown in Fig. 3.
The results of different attentions are attached together as shown in Eq. (6) and schematically visualized in Fig. 4

M H A
In addition, the calculation of attention mostly uses scaled dotproduct (the calculation process is shown in Fig. 5), the equation is shown in Eq. (7) Attention In general, since that CSLR is highly ST dependent, and it is difficult capture temporal features as well as capturing spatial features. In this paper, to remedy the problem of temporal and spatial features, we propose an ST encoder structure for the dynamic spatial correlation and long-term temporal correlation of sign language videos. As presented in Fig. 6, the structure of proposed ST encoder is composed of spatial-attention block and temporal-attention block. The incoming sign language video vector is divided into two channels for processing temporal and spatial attention, and then, the extracted features are attached together. Using dynamic directed spatial correlation and long-term temporal correlation, the model can be enhanced to extract and encode the features of sign language video frames.
Spatial self-attention block performs MSA calculation only for different tokens of the same frame, and the attention value of each patch (p, t) spatial dimension is calculated where N is the number of patches. Temporal self-attention block only calculates MSA for tokens at the same position in different frames, and calculates where M is the number of frames. And then, the calculated temporal attention and spatial attention are connected together

Decoder
The role of the decoder is to characterize the next possible 'value' based on the results of the encoder and the previous prediction. As shown in Fig. 7, each decoder consists of three sub_layers: the first sub_layer includes a multi-headed self-attention layer, a normalization layer, and a residual connection layer; the second sub_layer includes a multi-headed cross-attentive layer, a normalization layer, and a residual connection layer; the third sub_layer contains an FFN, a normalization layer, and a residual connection layer. There is only one output at the encoder side, and each decoder layer passes into the decoder part acts as the K, V of the multi-headed attention mechanism in the second of these cr oss_attn : To decode the encoded feature vectors, we employ the following three operations on them: self-attention, crossattention, and linear mapping. In addition, for the subsequent alignment operation, we perform positional encoding of the sign language text to ensure the normal text language order, as shown in Eqs. (14) P E ( pos,2i+1) = cos pos 10000 2i+1 /d model .
The position encoding in this section is the same as the position encoding module used in the encoder section for video frames. To ensure that each token may only use its predecessors while extracting contextual information, we used a mask operation on the attention computation. To facilitate the probability calculation, we linearly map the vector generated by the decoder stack to a larger vector, which becomes the 'logits' vector. Then, we utilized the so f tmax function to perform the maximum probability calculation and select the word corresponding to the highest probability cell as the output of the current time step, as shown in Fig. 8.

Dataset
As previously stated, to validate the proposed method, this study conducts experiments on two publicly available datasets: PHOENIX-2014 [6], and CSL [9]. The data composition of the two public datasets and the division of training and test samples are shown in Table 1. The Chinese Sign Language dataset (CSL) contains video instances from 50 signers, each repeated 5 times, containing 25 K tags, for a total of over 100 time-length videos. The dataset is divided into isolated words and continuous utterances, containing RGB, depth, and skeleton node data, with 500 classes of words, each containing 250 samples, and a sequence of 21 skeleton node coordinates. There are 100 sentences and a total of 25,000 videos; each sentence contains an average of 4 to 8 words. Each video example is labeled by a professional CSL teacher.

Evaluation metric
For CSLR, substitution, deletion, or insertion of certain words is necessary to maintain consistency between the rec- ognized word sequence and the standard word sequence. The WER is a metric to measure the performance of a CSLR. It compares the model output at the current parameter values with the actual correct sign language sentence vector and is defined as: WER = S+D+I N , where S is the number of substitutions, D is the number of deletions, I is the number of insertions, and N is the number of words in the reference; an example diagram is shown in Fig. 9.

Implementation detail
Our model construction is based on the Pytorch platform [33], and the experimental environment is a 12 GB Nvidia GTX 3060 GPU. On the CSL dataset, we extract 60 frames for each sign language using uniform frame extraction, and then discard 12 frames randomly, and use the last 48 frames as valid frame inputs. The size of each frame is first adjusted to 256 × 256. Adam optimizer is used and the learning rate and weight decay are set to 10 −4 and 10 −5 , respectively. What can be clearly seen in Table 2 are the default parameters for our experiments (where, D_model is the dimensionality of the patch embeddings). Dropout is set to 0.5 to mitigate overfitting. We apply the cross-entropy classification loss on the predictions p with ground-truth targets t to train the model as follows (Eq. (16)): where t is the true label value and p is the predicted probability value. It characterizes the difference between the true sample label and the predicted probability.

Impact of video size
The different sizes of the extracted video frames will have an impact on the extracted features, which in turn will affect the final results of the model. To explore the effect of the model under different cropping ratios, we set three sizes of 224×224, 112×112 and 256×256 for comparison. The comparison was performed in three aspects: model size, WER, and experiment time. According to the results in Table 3, the model size does not change under the three scales. 112 × 112 takes up the least amount of video memory compared to 224 × 224, which is 40.5% lower, and takes the shortest time per 200 iterations, which is 47.11% lower, but the effect is not as good as 224 × 224, which is 34.2% lower. 256 × 256 is impossible to experiment with successfully, because it takes up too much memory. According to the experimental results, the best results were achieved when the setting was 224×224, and the WER was reduced to 19.94.

Impact of patch size
Sign language videos often contain long sequences, which can cause computational difficulties when fed directly into the network. To solve this problem, we divide the sign language video frames into small pieces, i.e., image to patch. To investigate the effect of different P (the P is the number of video frames divided into blocks) on the experimental results, we set the size to 8, 16, and 32, respectively, for experiment comparison. Experiments are carried out based on two perspectives: accuracy and time. The size of extracted video frames is rescaled uniformly: 224 × 224. Fig. 9 An example of sentence recognition results from the Chinese Sign Language dataset (CSL). For each Chinese word, we have also provided the corresponding English translation in brackets. It shows five predicted sentences with decreasing error rates from prediction 1 to 5. The first row represents the input frame sequence. The boxes with a red background indicate wrong predictions. The red symbols indicate the wrong predicted words." D", "S", and "I" represent deletion, substitution, and insertion, respectively  According to Table 4, the size of the patch directly affects the length of the sequence, but it has a smaller effect on the parameters of the model. When the patch size is 32, the number of patches is the smallest, and the computation occupies the less memory. Although the computation time is slightly increased, but the model can be computed faster by increasing the batch size. Through experimental comparison, we conclude that the best result is achieved when the patch size set to 32.

Impact of the proposed modules
In this section, we further verify the effectiveness of the ST module in the STTN network architecture. In this part of the experiment, we set the video size to 224 × 224 and the patch size to 32. As shown in Table 5, the first row represents the original transformer network, and we can see that the best result of WER is 25.11%, and the second row represents the architecture after adding the ST encoder we designed, the WER drops to 19.94%, which is a relative improvement  Fig. 10, we can see that the curve after adding the ST module is significantly better than before, not only achieving a lower WER, but also fitting faster. These results suggest the effectiveness of our method.

Comparison with state-of-the-art
We compared the proposed algorithm with the state-of-theart CSLR methods (Min et al. [29] 2021; Pu et al. [9] 2020; Hao et al. [30] 2021; Camgöz et al. 2020 [24]) using the most general metric WER, the results of PHOENIX-2014 are shown in Table 6, and the results of CSL are shown in Table 7. Pu et al. feature enhancement by the aid of edit real video text pairs and generate corresponding pseudocorrespondence pairs, which, although achieving good result (WER dropped to 21.3%), did not take full advantage of the visual properties of the sign language videos themselves. Min and Hao et al. proposed a visual alignment constraint (VAC) method based on the ResNet18 network to enhance feature extraction by additional alignment supervision, and proposed a Self-Mutual Knowledge Distillation (SMKD) method which enforces the visual and contextual modules to focus on short-term and long-term information and enhances the discriminative power of both modules simultaneously. Their proposed method of VAC reduces the WER to 21.2% and SMKD's method reduces the WER to 20.8%, which shows that their method is effective and also demonstrates The best results are marked in bold The entries denoted by "*" used extra clues (such as keypoints and tracked face regions) the need to pay more attention to the visual features of the sign language video itself. Camgöz et al. proposed an "SLT" model to do joint end-to-end sign language recognition and translation, but this method impossibility to distinguish and extract temporal and spatial features of sign language videos was very well, yet these features are crucial.
In Tables 6 and 7, it can be seen that our proposed method (STTN) achieves good performance, with WERs falling to 19.94% and 1.2% on the PEOENIX-2014 and CSL datasets, respectively. This demonstrates that the long-term temporal dependence and dynamic spatial dependence of joint sign language videos can better learn the visual properties of sign language videos.

Results' visualization
For the purpose of better understand the learning process, we selected a sentence from the CSL dataset for sequence visualization, as shown in Fig. 9, as already mentioned in the previous sections, where different prediction sequences correspond to different WER values. As can be seen from the Fig. 11, that after the 12th epoch, the WER fluctuates slightly   2, which indicates that we achieve better results than the previous model. We also selected a random sample of data in the PHOENIX-2014 dataset, which is demonstrated in Fig. 12, and we can see the movements of the signer displayed in each frame. In addition, we visualized the training effect (WER change curve during train, and validation) in Fig. 13, in which the WER decreases faster during training, and the value of WER reached a low point of 14.26 in the 29th epoch. In addition, the validated experimental data show that the decline of WER slows down, and stops decrease after the 30th epoch reaches 19.94%.

Conclusions
Inadequate feature extraction is one of the major problems in current CSLR tasks, which directly leads to poor recognition of sign languages. In this study, we propose to enhance the feature extraction capability of the network model in both temporal and spatial dimensions, and to patch the sign language video frames to reduce the computational effort while enhancing the generalization capability of the model, thus allowing the CSLR network to be trained end-to-end. The proposed method does not require a text-related inductive bias module and aligns video and text using a simple cross-entropy loss. The experiments show that our proposed achieves state-of-the-art performance on CSL dataset and PHOENIX-2014 dataset, offering new perspectives on vision and natural language processing.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.