1 Introduction

The situation of people with disabilities in Thailand as of September 30, 2023 (reported on the website as of November 9, 2023) [17] reveals that the three highest-ranking types of disabilities are as follows: 51.57% (1,155,339 individuals) have mobility/physical disabilities, 18.57% (415,999 individuals) have hearing/speech disabilities, and 8.24% (184,542 individuals) have visual impairments. Notably, it is evident that individuals with hearing impairments, who rely on sign language, constitute the second-highest number among these categories. Effective communication with deaf or hearing-impaired individuals has been a significant challenge, especially for those unfamiliar with sign language. Thus, the research in the field of Sign Language Translation (SLT) has become popular [21] and it is very valuable.

Sign language can be categorized into two main types. First is gesture language, which utilizes hand motions and facial expressions to convey specific messages. Second is finger-spelling, employing hand signs to represent letters or numerals for word spelling. Thai sign language has been influenced by international sign languages prevalent in Canada, the United States of America, West Africa, and Southeast Asia. Each country has its own unique sign language, and even within the same country, signs may vary based on the user’s age and the region where the sign language is employed. With the advent of the Covid-19 pandemic, online meetings have become a crucial tool and are expected to become the new norm. However, individuals with hearing impairments may encounter difficulties participating in such meetings. Our research will concentrate on gesture language, which is more suitable for simple conversations, as opposed to finger-spelling, which is predominantly used for name spelling or educational purposes.

In recent times, numerous research projects have emerged focusing on sign language translation, proposing various methods and techniques. These include Machine Learning (ML) models, Neural Networks such as Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and hierarchical-LSTM (HLSTM). Many studies aim to develop models capable of interpreting fundamental gestures in sign language captured in real-time by cameras and translating them into text or audio. However, the most significant challenge faced by sign language translation models is their effectiveness in real-world scenarios. Specifically, factors such as environmental conditions during input capture (e.g., lighting, background, and camera position), occlusion (e.g., hand gestures moving out of the frame), and sign boundary detection can impact the accuracy of the model. In this research, we are seeking a model that excels in real-time translation, demonstrating high accuracy in continuous conversation scenarios.

2 Related works

There are numerous studies on techniques related to sign language translation, and alphabet detection in sign language is a notable area of focus. Unlike methods involving gloves or sensors, the approach presented in [1] leverages image processing techniques for letter recognition in American sign language. Initially, the images undergo conversion into a binary format to isolate hands from the background, followed by contour detection to identify the hand’s outline. Points along the contour are extracted, and the Convex Hull technique is employed to distinguish concave and convex areas, thereby identifying fingers and the spaces between them. Different groups of letters necessitate distinct identification techniques; some may rely solely on contour area, while others may require a combination of parameters. For example, the identification of letters such as D, J, H, I, and U involves considering parameters such as solidity, aspect ratio, and angles.

Before deep learning gained popularity, image classification and feature extraction could be achieved through traditional machine learning. However, its application nowadays may be limited to demonstration purposes, as it consumes significantly fewer resources compared to deep learning techniques. Nevertheless, it may involve numerous manual thresholds and complexities. The process of manual feature extraction starts with converting images into feature vectors, and the vector representation is then used as input for further analysis. The work of [2] introduced sign language translation using multidimensional Hidden Markov Models (HMM), utilizing data from a sensory glove to identify hand shape and motion tracking. HMM extracted the constituent signs implicitly in multiple dimensions through a stochastic process, allowing for interactive learning and recognition. In [3], the Support Vector Machine (SVM) was applied with Hidden Conditional Random Fields to enhance sign recognition without the aid of gloves or wearable tracking devices. Feature extraction was classified into two layers, with the static gesture recognition layer serving as a trained classifier extracted by SVM. The dynamic gesture recognition layer, encompassing discrete and dynamic gesture patterns, was applied by Hidden Conditional Random Fields to consider gesture sequences through emission probabilities. This two-layer application proves useful for linguistic information as a discriminative model, improving system performance without overfitting.

Authors in [4] developed a sign language translation model using image processing and traditional machine learning techniques to translate Indian Sign Language to English. The proposed system uses hand gestures as input and generates audio as output. Hand gestures are captured by a video camera, and captured videos are divided into frames. Image processing techniques, including hand segmentation using the YCbCr color space and histograms of oriented gradients (HOG) for feature extraction, are employed. Classification and recognition stages are achieved through SVM. Finally, the system returns output text, and Google Text to Speech transforms the output text into audio. In the work of [5], real-time Myanmar sign language recognition is studied to translate 30 sign gestures from videos. Hand detection is processed by converting frames into the YCbCr color space for skin detection through thresholding, followed by close and open morphological operations. Background removal is performed using hole-filling operations. The hand frame is then cropped and converted into grayscale. Feature extraction is processed using Principal Component Analysis (PCA), and SVM is applied to classify sign language. The SVM algorithm is trained using 3,000 hand signal images, achieving an accuracy rate for each sign gesture ranging from 80% to 100%. The work of [6] introduces discrete square-shaped Haar wavelet transform, including Haar local binary pattern (LBP) features for decomposition. A 2D point cloud is developed to represent posture as feature extraction. Global and local features are calculated to represent the signer’s hand in video sequences. The multi-class multi-label Adaboost algorithm is applied for pattern recognition and queries the signs in the video.

After the evolution of computer and computation technology, deep neural networks have become increasingly powerful. Numerous researchers in SLT have explored neural network techniques to enhance accuracy. In the work of [7], SLT by CNN is proposed, incorporating hand detection and classification. This paper integrates Single Shot MultiBox Detection (SSD), Inception v3, and SVM on fragments of American sign language (ASL) finger-spelling. The simulation results demonstrate a high accuracy rate. However, the study relies on an isolated image dataset and focuses solely on ASL finger-spelling alphabet translation. To enhance real-world applicability in communication, there is a need for real-time SLT. The work of [8] introduces real-time ASL recognition with CNN, utilizing transfer learning on the pre-trained GoogLeNet architecture and testing with real-time users. This research exhibits strong performance in real-time finger-spelling alphabet translation. Many researchers have primarily focused on isolated finger-spelling alphabets, which may not be suitable for real-time communication. In [9], experiments on SLT to words and sentences are conducted using CNN and LSTM. The study reveals that CNN performs well for isolated sign language recognition, while LSTM excels in continuous word recognition. However, the study indicates a 72% accuracy for LSTM, leaving room for improvement in model accuracy. The work of [10] introduces a Hierarchical Long Short-Term Memory (HLSTM) framework designed to capture continuous sign language and translation in sequential order. Through experiments with various models, it was observed that both HLSTM and HLSTM-attention achieved an accuracy above 90% in continuous conversation scenarios.

Recent research published in 2023 [18,19,20] reveals that there is currently no standard scheme or foundational models for interpreting continuous sign language. Consequently, our study will utilize RNN/Bidirectional RNN, LSTM/Bidirectional LSTM, and FNN-LSTM for real-time translation of Thai sign language to Thai language text during continuous conversation. Instead of relying on conventional image processing methods for object detection or deep learning-based object detection [23], this paper will demonstrate the use of the MediaPipe framework to gather key points for extracting left hand, right hand, face, and posture landmarks. Subsequently, we will classify and train Sign Language Translation (SLT) using a traditional neural network model, particularly within the RNN family, applied to a Thai sign language dataset. Our aim is to simplify the complexity of existing methodologies by leveraging MediaPipe and the LSTM model to achieve a high-performance model suitable for real-time continuous conversation.

3 Proposed methods

In our process design (see Fig. 1), the webcam is activated to capture every sign language gesture. Each frame generates keypoints using the MediaPipe Holistic model [11]. Once the keypoints are collected, they are sequentially fed into the deep learning model. The trained model then predicts the probability of each word. Finally, the result of sign language translation is determined by selecting the word with the highest probability exceeding a specified threshold.

Fig. 1
figure 1

The overview workflow of this study. MediaPipe framework is used to extract keypoints from sign gestures. The dataset feed to the five models to predict probability of hand gestures. (Human and model images were collected from www.google.com)

3.1 Data acquisition and pre-processing

In this paper, MediaPipe Holistic is utilized to collect keypoints from the hand, arm, body, and face. MediaPipe provides landmark detection and segmentation on all frames [12]. The acquired keypoints are then saved as a frame, representing a sequence of events for a particular sign. The MediaPipe Holistic pipeline integrates models for pose, face, and hand detection, namely MediaPipe Pose, MediaPipe Face Mesh, and MediaPipe Hands. Each model requires different input specifications; for example, the pose estimation model requires a lower image resolution than the face and hand models. MediaPipe Holistic utilizes these models to generate a total of 543 landmarks (33 pose landmarks, 468 face landmarks, and 21 hand landmarks per hand).

To collect data for the training set and testing set, the process begins by setting up video capture to extract frames using OpenCV. Next, MediaPipe Holistic is constructed to obtain the holistic model used for detection, and MediaPipe drawing utilities are employed to draw keypoints on the hand, pose, and face. Creating a detection with MediaPipe involves converting the collected image from BGR to RGB, setting it as unwritable, performing the detection, setting it back to writable, and transforming it from RGB back to BGR. This is necessary because the feeds from OpenCV are in BGR format, but MediaPipe performs detection in RGB format. The MediaPipe Holistic model operates by making an initial detection and then tracking the keypoints. The MediaPipe draw landmark function visualizes the landmarks in real-time, illustrating the connections between the hand, pose, and head. The keypoint values are then extracted and saved in a Numpy array, which consists of a series of 30 arrays, each containing 1662 values. A single frame comprises 1662 landmark values stored in each of the 30 arrays. Subsequently, folders are established to collect arrays as part of our sign language dataset. Each action comprises 30 distinct frames or 30 different sets of keypoints, and for each action, 30 videos were gathered. For this study, the dataset was split into a 90%-10% train-test ratio. The dataset employed in this project is outlined in Table 1. The selection of words and sentences in the table was based on their ease of performance in sign language, avoiding excessive complexity and length.

Table 1 Our sign language dataset

3.2 Experimental models

In this work, we aim to identify the most suitable model for training on hand, face, and pose gesture keypoints extracted by MediaPipe Holistic. We are experimenting with five models, namely RNN/Bidirectional-RNN, LSTM/Bidirectional-LSTM, and FNN-LSTM. All hyperparameters presented in this section are the best outcomes achieved through our trial-and-error experiments.

3.2.1 Recurrent neural network (RNN)

Fig. 2
figure 2

The internal mechanism of RNN. (Source: [14])

The objective of this study is to predict the meaning of hand gestures from a sequence of frames. We propose implementing Recurrent Neural Networks (RNN), as they are well-suited for handling sequential tasks. As mentioned in [13], the RNN architecture is good at handling temporal tasks by referencing inputs from previous stages and calculating them using the same function. Equations regarding the RNN in Fig. 2 are:

$$\begin{aligned} s_t\,=\, & {} f(s_{t-1} * W + x_t * U + b_h) \end{aligned}$$
(1)
$$\begin{aligned} y_t\,= \,& {} s_t * V + b_y \end{aligned}$$
(2)

\(s_t\) is the vector resulting from the output of the hidden layer function f. It is determined by the combination of the output vector from the previous state, \(s_{t-1}\), multiplied by the internal weight W, and the current state input, \(x_t\), multiplied by the weight V, along with the bias of the hidden layer. \(s_t\) is then multiplied by the weight V and combined with the bias \(b_y\). The ultimate output of the RNN is denoted as \(y_t\). \(s_t\) is subsequently passed on to the next state at \(t+1\), while the weights U, W, and V are shared.

In the data preparation process, the data is represented in the form of a NumPy array with dimensions of 30 x 1662. Consequently, the multi-layer RNN architecture depicted in Fig. 3 comprises five hidden state layers. The first layer accommodates the input shape of 30 x 1662. The initial three layers consist of 64, 128, and 64 nodes, respectively, with the ReLU activation function. The subsequent two dense layers consist of 64 and 32 nodes, employing ReLU as the activation function. Finally, Softmax is applied to the output of the final layer. The model is compiled with Adam as the optimizer and trained for 300 epochs.

Fig. 3
figure 3

The architecture employed in this study involves RNN and Bidirectional RNN. The three RNN layers are connected to two dense layers to calculate the probability of hand gesture meanings using the Softmax function

3.2.2 Bidirectional RNN (Bi-RNN)

The Bi-RNN architecture depicted in Fig. 4 supports data processing in both the forward and backward directions with separate hidden layers. Ultimately, the outputs from both directions are fed into the same output layer. Since the data is sequential, Bi-RNN can effectively handle such sequential data. The equations for the Bi-RNN are as follows:

$$\begin{aligned} h^\rightarrow {}\,=\, & {} H(W_{xh^\rightarrow {}}x_t + W_{h^\rightarrow {}h^\rightarrow {}} h^\rightarrow {}_{t-1}) + b_{h^\rightarrow {}} \end{aligned}$$
(3)
$$\begin{aligned} h^\leftarrow {}\,=\, & {} H(W_{xh^\leftarrow {}}x_t + W_{h^\leftarrow {}h^\leftarrow {}} h^\leftarrow {}_{t+1}) + b_{h^\leftarrow {}} \end{aligned}$$
(4)
$$\begin{aligned} y_t\,=\, & {} W_{h^\rightarrow {}y}h^\rightarrow {}_t + W_{h^\leftarrow {}y}h^\leftarrow {}_t + b_y \end{aligned}$$
(5)
Fig. 4
figure 4

The internal mechanism of Bidirectional RNN. (Source: [15])

Based on [13], the architecture of Bi-RNN is similar to that of RNN as shown in Fig. 3. The weights and input data from both the current and previous states are multiplied and combined by adding the bias \(b_{h}\). After computing the forward and backward hidden layers, the output data is fed to the same output layer, multiplied by the weights, and added to the bias \(b_y\) to yield the final result \(y_t\). In our study, the structure of Bi-RNN is the same as RNN, as the efficiency between the models can be comparable. Therefore, the Bi-RNN consists of five hidden layers. The first layer supports the input shape of 30 x 1662. The initial three layers comprise 64, 128, and 64 nodes, respectively, applying ReLU as the activation function. The subsequent two dense layers consist of 64 and 32 nodes, employing ReLU as the activation function. Finally, Softmax is applied to the output of the final layer. The model is compiled with Adam as the optimizer and trained for 300 epochs.

3.2.3 Long short term memory (LSTM) and bidirectional LSTM (Bi-LSTM)

Moreover, to enhance the capability of the recurrent model, this study also conducted experiments with the LSTM model to identify the best fit for continuous sign language, which is fed by 30 frames sequentially for each translation prediction. The equations for LSTM, as illustrated in Fig. 5, are as follows, where U, W are trainable weight, b are bias, i are input gate, f are forget gate, o are output gate, g are hidden state, \(c_t\) are internal storage, \(s_t\) are output information, and \(\sigma\) are sigmoid function:

$$\begin{aligned} i\,= \,& {} \sigma (U^{i}x_t + W^{i}s_{t-1}+ b_i) \end{aligned}$$
(6)
$$\begin{aligned} f\,=\, & {} \sigma (U^{f}x_t + W^{f}s_{t-1}+ b_f) \end{aligned}$$
(7)
$$\begin{aligned} o\,=\, & {} \sigma (U^{o}x_t + W^{o}s_{t-1}+ b_o) \end{aligned}$$
(8)
$$\begin{aligned} g\,=\, & {} tanh(U^{g}x_t + W^{g}s_{t-1}+ b_g) \end{aligned}$$
(9)
$$\begin{aligned} c_t\,=\, & {} c_{t-1}\cdot f + g\cdot i \end{aligned}$$
(10)
$$\begin{aligned} s_t\,=\, & {} tanh(c_t)\cdot o \end{aligned}$$
(11)
Fig. 5
figure 5

The internal mechanism of LSTM. (Source: [16])

The equations for Bi-LSTM depicted in Fig. 6 are shown in Eq. 1214, where Eq. 12 represents the forward route, and Eq. 13 represents the backward route; \(y_i\) is the output of the Bi-LSTM obtained by combining the results from both directional routes. The architecture of LSTM/Bi-LSTM for this study, is similar to the RNN structures mentioned in Fig. 3. The difference lies in the first three layers, which are changed from RNN to LSTM, while the rest remain the same.

Fig. 6
figure 6

The internal mechanism of Bidirectional LSTM. (Source: [16])

$$\begin{aligned} h^{1}_i\,=\, & {} f(U^{1}\cdot x_i + W^{1} \cdot h_{t-1}) \end{aligned}$$
(12)
$$\begin{aligned} h^{2}_i\,=\, & {} f(U^{2}\cdot x_i + W^{2} \cdot h_{i-1})\end{aligned}$$
(13)
$$\begin{aligned} y_i\,=\, & {} softmax(V \cdot [h^{1}_i; h^{2}_i]) \end{aligned}$$
(14)

3.2.4 Feed forward neural network with LSTM (FNN-LSTM)

In the architecture depicted in Fig. 7, the Feed-Forward Neural Network (FNN) is employed for feature extraction after receiving keypoint features input from MediaPipe as a vector with dimensions of 30 x 1662 for each frame. This is done to recognize local patterns in the sequence for pattern learning [16]. In this study, two FNNs with fully connected layers are utilized, featuring 128 x 64 nodes in the initialization layer and hidden layers, respectively. These layers are designated to classify and recognize local patterns in each sequence of frames before feeding the weighted output to the subsequent LSTM networks for sequential recognition. In the part of sequence recognition, this study applies a RNN with LSTM units. The LSTM consists of two layers, the first with 128 LSTM units and the second with 64 LSTM units, followed by a feed-forward fully connected layer as a dense layer to downsample to categorical output with Softmax activation. The ADAM algorithm is employed as a stochastic optimizer to minimize the categorical cross-entropy loss function.

Fig. 7
figure 7

The architecture of FNN-LSTM used in this study. The two FNN layers are connected to the two LSTM layers to calculate the probability of hand gesture meanings using the Softmax function

4 Experimental results and discussion

In this work, MediaPipe Holistic is chosen for feature extraction, and three main models are employed: RNN/Bidirectional RNN, LSTM/Bidirectional LSTM, and FNN-LSTM due to their capacity to capture sequential actions in continuous sign language. The selection of layers and nodes was based on the experimental results of each model. Five classes were chosen from thirteen Thai words commonly used in meetings during the Covid-19 situation. These words include “meeting agenda,” “today,” “Covid-19,” “work from home,” and the null action used to conclude a statement. These selected words represent common sentences and uncomplicated postures for data collection. During the experiment, recognized continuous sign postures will be displayed as text when the calculated probability exceeds the desired threshold of 0.95, while unrecognized signs will not be translated into any text. The chosen threshold was determined through experimentation, and lowering the threshold resulted in too much interference.

The detailed experimental results are presented in Fig. 8 and Table 2, where the “Model Accuracy” column represents the prediction accuracy on the test dataset, and the “Real-time Test Accuracy” column indicates the prediction accuracy during real-time testing. According to the table, LSTM yielded the best results on the test dataset with a prediction accuracy of 100%. However, when tested in real-time, the accuracy of LSTM dropped to 86%. RNN also achieved 100% accuracy on the test set but showed a lower performance during real-time testing, yielding a result of 64%. On the other hand, Bi-LSTM proved unsuitable for continuous Thai sign language due to the complexity of sequential recognition.

Fig. 8
figure 8

Training accuracy of all experimental models

Table 2 Model evaluation

In continuous sign language, specific words are conveyed not only through hand signs but also involve facial expressions and body posture movements. In such cases, MediaPipe Holistic proves to be a suitable tool for this task. Utilizing MediaPipe Holistic instead of traditional CNN for feature extraction helps reduce programming complexity and provides valuable information for deeper analysis through deep learning. Nevertheless, using a single signer for data collection in our prototype system introduces the inevitable challenge of overfitting due to the limited dataset. Additionally, there may be issues during model execution by others who must replicate the same posture as the person who initially trained the dataset.

In the context of real-time sign language interpretation, the ability of a system to interpret continuous sign language promptly is crucial. However, the current prototype system, which relies on a single signer, has not undergone explicit testing for real-time performance. While MediaPipe is recognized for its rapid and real-time processing capabilities, it is important to acknowledge that deep learning models, despite their efficiency, are associated with substantial computational resource consumption. This characteristic may potentially introduce delays in achieving real-time interpretation. Consequently, future iterations of the system should include comprehensive testing under varying network conditions to assess its performance in scenarios where data transfer rates are low due to network issues. This will provide a more comprehensive understanding of the system’s real-time interpretative capabilities, especially in adverse network conditions.

5 Conclusion and future works

In this paper involving only train-from-scratch basic neural networks, it is evident that the LSTM model stands out as the best performer in this study. However, its tendency to overfit arises due to the limited dataset derived from an individual subject, resulting in insufficient data for our self-collected dataset. Consequently, during real-world testing with diverse signers, the accuracy experiences a significant drop. Nevertheless, there exists room for enhancement in future studies, particularly in data collection. To address this, it is recommended to improve both the quantity and quality of the dataset by gathering more diverse data from various signers and applying data augmentation techniques to generate fresh and varied instances for the dataset. This approach can substantially improve the model’s performance and accuracy when the dataset is both rich and comprehensive. Additionally, more evaluation metrics apart from accuracy should be employed to ensure a comprehensive evaluation of these black-box neural networks, particularly in the consistency aspect across the whole interpretation session [24]. Furthermore, incorporating matrix operation techniques commonly used in deep learning to reposition the coordinates of points can aid in boosting accuracy for different signers, distances, and camera angles. Attention mechanisms, fine-tuning model hyperparameters, and alternative temporal-based model architectures [22, 25] can also contribute to refining model accuracy.