1 Introduction

For a human being to express their feelings and thoughts, collaborate with others, and contribute to the overall growth of society, communication is a necessary activity [1]. There is a difficulty in communication between members of deaf communities and the general population around the world [2, 3], and there are few methods for making online material accessible to those who have hearing loss [4]. As a result, they face substantial obstacles to accessing education, work, and healthcare, and they require effective translation at a reasonable price in order to do so [5,6,7]. The COVID-19 pandemic exacerbated these communication-related disparities, which can have a negative impact on health by increasing emergency room visits, raising the risk of depression and food insecurity, and decreasing awareness of medical conditions and their risks, such as cancer, cardiovascular disease, HIV, and the human papilloma virus (HPV). Also, compared to the general population, deaf persons have higher rates of depression, anxiety, insomnia, emotional distress, and poorer quality of life. In children, psychological problems, including anxiety and low self-esteem, are four times more common than in hearing children [8].

Five percent of people worldwide, according to the World Health Organization, have hearing impairment [9]. Out of these, 34 million are children. By 2020, this number will have risen to 466 million [10], and it is anticipated that more than 900 million individuals will have hearing loss by the year 2050 [10, 11]. According to the World Federation of the Deaf (WFD), a total of 70 million people are deaf and mute, while over 360 million people have hearing loss [12, 13]. About 17% of Americans are deaf or hard of hearing, making them an underserved and underrecognized population in terms of medical care [14]. The Welfare Department of Malaysia recorded that 44,500 people (or 0.14 percent of the country’s 32 million citizens) have hearing impairment [8] and over 1 million people in Europe are deaf [15]. Out of the 200.81 million inhabitants in Pakistan, 10 million individuals are deaf [9]. According to the “Central Agency for Public Mobilization and Statistics," there are approximately 2 million deaf people in Egypt, a number that rose to nearly 4 million in 2012. They therefore require a simple and effective means of contact with other people [16].

As normal people always face difficulty understanding the meaning of a particular sign of the signer community, between these two communities, there is still a communication gap [1, 2, 8, 16,17,18,19,20,21]. Many people are shocked to hear that language, not sound, is the main obstacle to communication for the deaf [7], So to overcome this communication gap, Sign Language (SL) knowledge appeared and gained broad acceptance by the general public [2, 3].

SL is a non-verbal language used by hearing-impaired (deaf/dumb) individuals to communicate. The meaning of SL depends on the movements of the hands and fingers to connect the communication bridge with other people. A sentence in SL is made up of glosses, which are morphemes [21,22,23,24] and the variety of more than 7,000 modern SLs, each with unique variations in motion position, hand form, and body component placement [25]. Hand gestures are used to represent the letters, numerals, words, and phrases of the lexicon, while the others serve to emphasize their meanings. Reports on the use of SLs date back to the fifteenth century [9, 12, 26]. SL is also useful for people suffering with autism spectrum disorder [2]. Due to the discrepancy in the methods of communication, the disparities between a signed and spoken language are higher than those between any two spoken languages, and they vary from one country to another countries [7, 10]. Around 300 SLs are employed globally, according to the WFD of the deaf [10]. Unfortunately, there are no written forms available to deaf individuals, and there are very few electronic tools [27]. Similar to Arabic-speaking nations, the Arabic Sign Language (ArSL) lacks resources like corpora and standardized dictionaries [28]. Consequently, creating a translator between SL and a spoken language is at least as difficult to translate as spoken languages themselves [7].

Although it has been around since the fifth century BC, Standard Latin is not standardized globally, and each nation often has its own native SL. The Middle Eastern region’s Arabic language is extensively used, and it has its own SL variants [24]. Imo and WhatsApp, two popular communication platforms that have become a vital part of modern life, can be utilized to improve communication between the hearing majority and the deaf community [20]. To make communication between the normal and deaf communities easier, there is an increasing interest in using technology to develop an application for deaf people [3, 9, 13, 17]. A number of technologies have been used in recent years to help people with hearing or speaking difficulties and those without them communicate more effectively [29]. Natural language processing (NLP) includes the crucial field of speech recognition (SR). With the advancement of intelligent gadgets and automatic SR, audio data is transformed into equivalent text that is then processed through human–computer interaction, such as hand talking and multilingual translations [30]. The field of computer science known as “machine translation” (MT) focuses on using software to translate speech and text from one language to another. It can help remove language barriers and make information more easily accessible. A hybrid neural machine translation (NMT) model was proposed that uses a deep stacked gated recurrent unit (GRU) algorithm-based NMT model for the translation task and deep generative models to generate sign gesture videos automatically [31].

Devlin et al. [32] proposed the bidirectional encoder representations from transformers (BERT) family of models, which leverages the transformer encoder architecture to interpret each token of input text in the context of all tokens that came before and after it.

Artificial intelligence (AI) and deep learning (DL) are being used by researchers in this field to replace all device-based procedures with vision-based ones in order to eliminate any barriers to communication with the deaf. DL has the primary benefit of allowing the system to automatically learn features from data without having to explicitly define them. This is accomplished by creating the appropriate architecture. Semantic role labeling and named entity recognition are two examples of NLP tasks that can be applied using artificial neural networks learning models [33]. Recurrent neural network (RNN) was used for a short-text conversation generator, a neural network (NN)-based neural responding machine [34]. Convolutional neural networks are used for the majority of classification-related tasks in NLP. These networks improve picture classification accuracy by tapping into the latent semantic power of class labels [35]. Utilizing attention, transformers process the entire input at once. Any place in the input sequence can have context thanks to the attention mechanism. Moreover, the transformer provides several extensions and versions, each of which solves a problem presented in a previous version or provides extra functionality, such as Transformer-XL [36], Compressive Transformer [37], or Reformer [38]. Where Transformer-XL improved its assessment speed by including the recurrence mechanism and relative position encoding and transformed the vanilla Transformer model—which had context fragmentation and restricted context-dependency—into word-level language modeling. Long-term dependency is improved as a result. Transformer-XL can learn dependency 80% longer than RNNs and 450% longer than vanilla transformers, and it performs better on both long and short sequences up to 1800+ times faster than the vanilla transformer.

But, the Compressive Transformer (CT) chooses information from the past and then converts it into a compressed memory. NN that has been trained with a loss function to retain important information performs the compression. CT learns to query both its short-term granular memory and longer-term coarse memory by employing the same attention mechanism over its set of memories and compressed memories. As a result, the modeling of uncommon words is enhanced.

As for Reformer, replacing the dot-product attention with locality-sensitive hashing, changing the model’s complexity from O(\(L^2\)) to O(L log L), and utilizing a reversible variant of the residual layers in place of the conventional residual layer, made the model more competitive with state-of-the-art transformer models while also making it faster and lower in computational cost.

Only a concatenation of isolated signs concentrating mostly on the manual elements have been produced by DL-based SL production processes, resulting in a robotic and unemotional output [39]. Signing avatars makes it possible for deaf people to access information in their preferred language [3]. Acoustic models now include deep neural networks that provide a list of auditory properties. The end-to-end frameworks for SR have demonstrated exceptional performance in high-resource languages as a result of the rising DL. However, it is challenging to achieve good results for SR in low-resource datasets [30]. A mid-level sign gloss representation (successfully recognizing the individual signs) has been demonstrated via SL translation, providing hope for greater interaction with the deaf. Gloss-level tokenization is needed as part of the pre-processing stage of translation in order to increase translation quality. If supervised data is available, tokens can be learned from sign videos. Annotated data is, however, hard to come by and expensive to annotate at the gloss level [40, 41].

While previous techniques may have translated SL sentences into a series of glosses or written language, they fall short when it comes to issues like word order that are not related to vocabulary. This is because, despite the fact that SL phrases have many glosses, the model only views SL sentences as one indivisible sequence [22].

An Arabic sentence is subjected to a morphological, syntactic, and semantic analysis before being translated into an ArSL sentence with an appropriate grammar. They developed a health domain-related 600-sentence corpus [42].

In order to translate Arabic text into ArSL with the suggested translator system, an architecture is needed. The architecture of the translation systems described in other languages cannot be presented or developed with regard to Arabic due to the nature of Arabic Language (AL) as the translator system’s input language and the issues inherent in ArSL as the translator system’s output language. The absence of comparable signs for several words in the Arabic Language (ArL) is among the most significant problems with ArSL. An architecture suitable for the ArL and ArSL was presented to address these issues. The suggested system accepts Arabic text from the user as input in the form of a word or a phrase using the proposed architecture, and after performing initial processing, the system converts the text into ArSL. Finally, by using an avatar, it is possible to represent a word, phrase, or sentence as an SL form.

The proposed model assists deaf–mute people in their lives as it can assist in breaking linguistic barriers between the deaf and hearing communities. It acts as MT software that can take the place of a human translator, which is an expensive option and not always available. Deaf and hearing-impaired people will have greater access to information and services thanks to automated SL translation systems, allowing them to contribute and participate equally in society, and it provides a chance for a hearing person to express information to a deaf person without the need to learn SL.

The contribution of this work is as follows:

  1. 1.

    We built a neural model based on MT between ArL and ArSL. The proposed system converts Arabic text and speech into ArSL.

  2. 2.

    We create a parallel corpus with 12187 Arabic sentences that have been Arabic glossed, which can be used for MT purposes.

  3. 3.

    We provide a visual representation of the sign using a 2D avatar.

This paper is structured as follows: Sect. 2 presents related work. Section 3 shows the components of the proposed solution. A further description of the results obtained from the model is presented in Sect. 4. In addition, the administrative implications of the proposed framework are explained in Sect. 5. Eventually, the conclusion and future work are presented in Sect. 6.

2 Related work

This section presents a summary of the latest relevant work on SL studies utilizing various datasets and DL algorithms, particularly in the field of MT based on transformer technique.

Patel et al. [43] proposed a reliable technique that translates English speech into animations in Indian Sign Language (ISL). It uses a preexisting SL database, the Google Cloud Speech Recognizer API, and the semantics of NLP with an average accuracy of 77% and processing time of 0.85s.

By tracking the production progress over time and predicting the end of the sequence while using an adversarial training regime and a mixture density network on the PHOENIX14T dataset, Saunders et al. [39] presented a transformer architecture to translate spoken sentences to continuous 3D-multichannel sign pose sequences in an end-to-end manner.

Shaikh et al. [44] showed an animated avatar had inverse kinematic solver tools that translated a sentence into ArSL gloss with a web UI that displayed video of an avatar that conveyed information using hand signs.

Nayan et al. [4] suggested a system that translates online video and links to ISL captioning using a 3D cartoonish avatar intended to reinforce classroom concepts during the critical period through NLP algorithms. The results showed that students taught with sign-captioned videos performed better than students taught with English-captioned videos by 37% and 70%, respectively, and that learning vocabulary with the help of sign-aided movies improved by 73.08%.

Das et al. [21] developed a 3D avatar-based SL learning system that converts the input speech or text into corresponding signed movements for the ISL.

Liang et al. [30] introduced an end-to-end framework for SR with multilingual datasets (Chinese, English, and Code-Switch) based on a hybrid model of CTC and an attention model based on PyTorch. The model achieved better performance compared with the HMM-DNN model in a single-language and code-switching environment. The character error rate of the proposed model based on the Chinese dataset defeated the traditional model and reached 10.22%.

Sobhan et al. [13] developed an Android app that is focused on a multimodal approach and can convert speech to visual contexts and vibrations, and similarly, the visual contexts and vibrations can be converted to speech. The study reflects accuracy between 83% and 100% and an average time between 21 and 65 s.

Nguyen et al. [15] proposed a prototype for a 3D German SL avatar on AR glasses given 2D videos of human signers, and the results showed a high acceptance rate of the presented solution, even though the comprehensibility of the avatar’s signing was rather low.

Aliwy et al. [28] constructed a dictionary of ArL to ArSL as part of a translation system using eSign editor software. The sign was then converted to the sign gesture markup language and then to the animated sign by a 3D avatar. The quality of the generated signs ranged from 1 to 5 for 100 signs selected randomly, and the average rating was 4.3.

Sanaullah et al. [9] built Sign4PSL, a reusable application for web and mobile platforms that translates the sentences to Pakistani SL. The system was tested, and it was shown that the deaf students were able to understand the story appropriately.

Andrabi et al. [45] used NN-based DL technique for English to Urdu languages. The model is trained and tested using 70:30 criteria, and the output is contrasted with Google Translator’s output, which has a BLEU score of 45.83 on average.

Mckellar et al. [46] built an Autshumato MT evaluation set between any of the 11 official South African languages.

Benkov et al. [47] built an NMT structure based on an encoder–decoder framework that transformed a source language sentence into continuous space representation through an RNN. The model offered an improvement over translation output but still has to be evaluated in future.

Saija et al. [48] assisted the deaf in communicating with others by constructing an end-to-end system that converts English voice to ISL gloss and vice versa.

Due to ArL being a low-resource language, we had difficulty finding an Arabic-Sign dataset, so we made our own, which will be presented in detail in the Dataset section 4.1.

3 Proposed model

This section presents a detailed framework for lowering the barrier between the hearing and deaf communities. The conversion of continuously spoken words into sign movements requires more care and poses significant development risks. The system consists of three modules: data pre-processing, to prepare raw data in a format that the network can accept (if the input is speech, it will be converted into text), will be explained in Sect. 3.1. Then, text is translated to the corresponding gloss using the transformer technique, known as MT, which is further explained in Sect. 3.2. Finally, 2D animation of the avatar is created based on gloss, which is described in Sect. 3.3.

3.1 Data pre-processing

Data pre-processing is necessary to clean the data and prepare it for a DL model, which also improves the model’s accuracy and effectiveness. In our model, this step consists of stripping all diacritics from the words entering the model, which minimizes the size of the dataset and allows the model to learn the meaning of the word from its context. This would increase the efficiency of the model. Instead of learning 100 words with the same basic spelling, it will be easier to learn only one. In addition to converting most words to their roots, we also want to minimize the size of our vocabulary. Then, BERT is used as an embedding layer to the transformer model, which is used to generate natural language vector-space representations appropriate for DL algorithms and tokenize Arabic vocabulary , SL vocabulary, English vocabulary from the dataset, as illustrated in Fig. 1.

Fig. 1
figure 1

Data pre-processing process, as stripping diacritics and then tokenizing data using BERT

3.2 Speech/text to gloss

Gloss is a textual “translation” of a sign composed of words from the spoken language. In order to translate from spoken language to glosses, automatic SR and NMT techniques can be applied. We will demonstrate the speech-to-gloss translation in two steps:

3.2.1 Arabic speech to Arabic text

Starting with using Google’s pertained acoustic model proposed by Chiu et al. [49], it was used on top of the transformer. The model is used to translate from Arabic speech to Arabic text by setting the language code to ar–EG and then using encoder–decoder network architectures based on transformer.

3.2.2 Arabic text to gloss

The transformer model used is the one proposed by Vaswani et al. [50] with only the input and targeted languages changed, with the ArL and Arabic glosses, respectively. The translation of Arabic text to gloss is illustrated in Fig. 2. The transformer model is a NN architecture that uses attention mechanisms to encode and decode sequences of data. It can be used for various natural language processing tasks, such as machine translation, text summarization, and speech recognition. In the context of Sign Language Translation (SLT), a transformer model can take a sequence of spoken language tokens (words or subwords) as input and generate a sequence of SL tokens (sign glosses or video frames) as output. The transformer architecture is based on the concept of multi-head attention and self-attention, which allows the model to weigh the importance of different parts of an input sequence when producing an output sequence.

The transformer architecture consists of an encoder and a decoder. The encoder takes an input sequence and produces a sequence of hidden states, while the decoder takes this sequence of hidden states and produces an output sequence. Both the encoder and decoder consist of a stack of identical layers.

Each layer in the transformer architecture has two sub-layers: a self-attention mechanism and a feedforward NN. The multi-head attention and self-attention mechanism allows the model to weight the importance of different parts of the input sequence, while the feedforward NN provides a nonlinear transformation of the hidden states.

The multi-head and self-attention mechanism operates on a set of key–value pairs, where the keys, values, and queries are all derived from the input sequence. The mechanism computes a weighted sum of the values, where the weights are determined by the similarity between the queries and the keys. The weights are computed using a softmax function and are used to weight the values in the weighted sum.

In addition to the self-attention mechanism, the transformer architecture also uses positional encodings to provide information about the position of each token in the input sequence. This information is added to the input embeddings and allows the model to differentiate between tokens that have the same embedding.

Overall, the transformer architecture has been highly successful for natural language processing tasks, such as machine translation, achieving state-of-the-art performance on several benchmarks. Its ability to model long-range dependencies and its parallelizability make it well suited for processing long input sequences, such as those found in machine translation.

One of the advantages of using a transformer model for SLT is that it can handle variable-length input and output sequences without relying on recurrent or convolutional operations, which can be computationally expensive and prone to vanishing gradients. Another advantage is that it can leverage pre-trained language models, such as BERT as we used, to initialize its parameters and improve its generalization performance on small datasets. A pre-trained language model is a NN that has been trained on large amounts of text data to learn the patterns and structures of natural languages. By transferring the knowledge from these models to the SLT task, the transformer model can benefit from the rich linguistic information encoded in them.

Finally, our task is not considered a classification task, but a generation task where the model understands the whole Arabic sentence then generates a translated sign gloss sentence.

The model was trained for 100 epochs and 64 batch sizes, achieving a training accuracy of 94.71% and an 87.04% testing accuracy for Arabic–Arabic sign gloss translation.

For our model, at first, the tokenized model saved previously from the pre-processing step is loaded as the embedding layer for the transformer. The distribution of tokens per example in the dataset is illustrated in Fig. 3. To the embedding vector, the positional encoding vector is added. Tokens with similar meanings will be located closer to one another in a d-dimensional space, which is what embeddings represent. The relative positions of the tokens in a sentence are not, however, encoded by the embeddings. In the d-dimensional space, tokens will be more similar to one another after the positional encoding has been added because of their related meanings and positions in the sentences.

Fig. 2
figure 2

Execution flow of the proposed model

Fig. 3
figure 3

Tokens per example

3.3 Gloss to animation

Vector-based animation allows for the motion to be managed by vectors as opposed to pixels. Images in common file types such as JPG, GIF, and BMP are made of pixels. There is no way to enlarge or reduce these photographs without losing visual quality. Resolution is unimportant for vector drawings. Pathways with numerous start and end locations and lines linking them to form the graphic are characteristics of vectors. Characters or other types of images can be formed using shapes. For smooth motion, vector-based animation resizes images using mathematical values. The animator will not have to keep drawing the same characters repeatedly because they can reuse these creations. These vectors can be moved around and animated in this way.

Our process comprises two steps: gloss to skeletons and skeletons to animation. The process aims to generate an animated human pose as key points linked with the corresponding sign gloss. This process is visualized in Fig. 4.

Fig. 4
figure 4

Animation generating process

3.3.1 Gloss to skeleton

In this step, each sign had a video with the gloss as its label. The video was then used to generate points that translated the movement into the X and Y dimensions, as the avatar was a 2D avatar. The main obstacle in completing this task is extracting the motion coordinates and transferring them back to the model due to three factors:

  1. 1.

    The location of the person in the video frame.

  2. 2.

    The zoom ratio in the video frames.

  3. 3.

    The size of the person compared to the avatar.

As a solution, the average of the points in the first frame of each video was calculated. In addition, instead of positioning the point relative to the X- or Y-axis, we positioned the point relative to a fixed point in the person’s body and then calculated the points relative to it for each sign. The shoulder was considered the fixed, stable point that all movement points were calculated relative to, allowing for the movement to be smooth.

3.3.2 Skeletons to animation

In this step, the skeleton’s points were simply mapped to generate the points that make the avatar move, presenting the desired sign using vector-based animation. In addition, by using the inverse kinematics, we were able to generate the animation of the avatar and make it move depending on knowing the last point.

Eventually, the system accepts two input types: a record converting it to text, or a text. Either types the model only accepts text form input. The model then predicts the corresponded sign gloss tokens. The predicted gloss is then mapped to the animation points list. Finally, the avatar is moved using those points representing the desired sign language signs as the output of the system. This system actually faced several challenges and limitation during its development and implementing and even in its data collection phase. One of the encountered challenges pertained to the procurement of a suitable sign gloss dataset for the model’s training purposes. Another significant hurdle arose in the attempt to utilize a modestly-sized animation tool, aimed at preventing any degradation in the operational efficiency of the integrated system. A challenge was also trying to adjust the models hyper-parameters to fit our relatively small dataset since the original paper where the models first introduced were trained on a large dataset unlike ours. Although we agree that it is good practice to use train, validation, and test sets, the dataset was relatively small and we could not afford to split it into three sets without losing too much information. Therefore, we decided to split it into train and testing sets where testing set is unseen to the model, only to monitor the model’s progress without any further optimization, but we also used different evaluation metrics to measure the performance of our model from different perspectives.

Algorithm 1 demonstrates the suggested system’s whole process’s pseudo-code.

figure a

4 Experimental results

4.1 Arabic–Arabic Sign dataset

As mentioned earlier, ArL is a low-resource language, and it was difficult to find a sign gloss dataset to train the model. So, we construct our own dataset.Footnote 1 The dataset consists of 12187 Arabic–Arabic sign gloss pairs. An Arabic-to-English datasetFootnote 2 was initially used to get the Arabic sentences. Then, the Farasa Part-Of-Speech (POS) Tagging ModuleFootnote 3 was used to convert the Arabic sentences to their structure parts, or POS, representing each word as a verb, subject, root, proposition, adverb, adjective, etc.

Next, by searching about the ArSL structure, we extracted and elicited rules and then applied those rules as code to convert the analyzed sentences structure to the ArSL sentence structure. For example, in question sentences, the ArSL moves the question word, such as ، كيف، متى، أين، لماذا, etc. (How, When, Where, Why) to the end of the sentence. Another example is the connected pronouns such as ـه، ـني، ـها، ـنا، ـك (the attached pronoun appears as a suffix and becomes possessive when connected to a noun. Possessive pronouns indicate ownership, such as mine, yours, his, hers, ours, yours, and theirs) are separated to a root word and the corresponding pronoun as in أنا، أنت، هو، هي، نحن,etc. (I, you, he, she, we). Additionally, الـ (the) was stripped from the words, in addition to other rules.

As a next step, a manual review is done since the conversion method is not totally accurate due to the Arabic languages huge diversity of words, which the code could not cover.

Finally, a code to map the Arabic letters in the dataset to a mixture of Latin and French letters was applied since the tokenizer gets confused when applied to the same language as in Arabic written sentences and Arabic written gloss. We used a mixture of Latin and French since English does not have some vowels that are in Arabic and sometimes needs two letters to represent an Arabic vowel, which also confuses the tokenizer, so some French letters were used to represent those vowels.

In addition, the names were also represented in an illustration method called الهجاء الاصبعي which is (Finger-spelling). It is a way of spelling the word letter by letter, as in انا اسمي توم which is (my name is tom) would be translated to gloss as “ana asm t i m,” where “t i m" represents the finger-spelling for the name. A few examples from the dataset are illustrated in Table 1.

Table 1 Examples from the dataset

Note: (i) represents (و), (Ç) represents (ح), (o) represents (ع), (z) represents (ي), (x) represents (ض).

The tokenized dataset consisted of 1511 different classes of Arabic vocabulary and 1154 different classes of SL vocabulary. The Frequency of instances per class for the top 20 most frequent classes is represented in 5a for Arabic and 6a for SL, while the frequency of instances per class for the top 20 least frequent classes is represented in 5b for Arabic and 6b for SL.

Fig. 5
figure 5

Arabic vocabulary top 20 most and least frequent classes

Fig. 6
figure 6

Arabic vocabulary top 20 most and least frequent classes

Dataset imbalance is a common issue encountered in many domains of machine learning and data analysis. It refers to the situation where the distribution of classes or target variables within a dataset is highly skewed, with one or a few classes having significantly more or fewer instances than others. The dataset is apparently imbalanced. Natural language datasets can be imbalanced for various reasons. One of the main reasons is the frequency distribution of words and phrases in natural language. In any language, some words and phrases are more common than others. For example, in English, words such as “the," “and," and “of" are used much more frequently than less common words. This can lead to a data imbalance in datasets that are based on text, where certain words or phrases are more prevalent than others. As it appears in our dataset, the words ("أنا", "ماذا", "هذا", "من") mean (I, What, this, from) are more frequent. And for the sign gloss the words “ant, hi, ana, hz" (you, he, I, she) are more frequent due the separation we did for the connected pronouns.

4.2 Experimental settings

The suggested transformer is trained using online resources in an offline manner. The following defines the experimental setup:

  • The dataset was split into 80% for training and 20% for testing, sequentially, due to the dataset being relatively small compared to other MT datasets.

  • The dataset was edited (adding the sign gloss) using Python and the Farasa library to automate the dataset generation.

  • For the development environment, the following was used: Python = 3.9.12, Keras = 2.6.0, Tensorflow = 2.10.0, Tensorflow-text = 2.10.0, Pyarabic = 0.6.15, Numpy = 1.19.5, OpenCV-Python = 4.5.5.64, Matplotlib =3.5.2.

  • For the device environment, we used an Nvidia GTX 1050 Ti 4GB GPU.

4.3 Performance measures

The following key performance measures were selected in this work to verify the quality of the proposed model: Accuracy is used to quantify how frequently the model properly classifies a data point; it is known as the proportion of accurately predicted data points among all the data points and is defined as follows:

$$\begin{aligned} Accuracy = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(1)

where TP, TN, FP, and FN represent the true positive, true negative, false positive, and false negative, respectively.

The second measure is the BLEU score, which sacreBLEU (a popular open-source Python library we used to calculate BLEU score) is built upon, measures the similarity between machine-generated translations and one or more human reference translations. The BLEU score is based on the principle that better translations tend to have higher levels of lexical and syntactic overlap with the reference translations. It operates by comparing n-grams (contiguous sequences of words) in the candidate translation with the n-grams in the reference translations. The score can take any value between 0.0 and 100.0 and is defined as follows:

$$\begin{aligned} log \ BlEU = min(1-\frac{r}{c},0)+\sum _{n=1}^{4} \frac{log \ p_n}{4} \end{aligned}$$
(2)

where “c” is the predicted length = number of words in the predicted sentence, “r” is the target length = number of words in the target sentence, and “\(p_n\)” is the precision n-gram )

4.4 Actual performance results

Here is a detailed analysis of the proposed model:

$$\begin{aligned} Training \ Accuracy= & {} \frac{TP+TN}{TP+TN+FP+FN} = 0.9471 \end{aligned}$$
(3)
$$\begin{aligned} Testing \ Accuracy= & {} \frac{TP+TN}{TP+TN+FP+FN} = 0.8704 \end{aligned}$$
(4)
$$\begin{aligned} log \ BlEU= & {} min(1-\frac{r}{c},0)+\sum _{n=1}^{4} \frac{log \ p_n}{4} = 20.41 \end{aligned}$$
(5)

Furthermore, a 94.71% training accuracy and an 87.04% testing accuracy were achieved and BLEU score of 20.41 after training the model for 100 epochs with a batch size of 64. Figures 7 and 8 indicate the accuracy and loss value for different training epochs, respectively.

Fig. 7
figure 7

Training accuracy per epoch for training data and testing data (val)

Fig. 8
figure 8

The loss function for training data and testing data (val) per epoch

A confusion matrix is a valuable tool for evaluating the performance of classification models, where the goal is to assign instances to predefined classes. However, it may not be directly applicable or useful for evaluating machine translation models, due to the following reasons:

  • Multiple possible outputs: Machine translation generates multiple valid translations, making it difficult to define a fixed set of classes for a confusion matrix.

  • Variable length output: Translations in machine translation can vary in length, making it challenging to align predicted and reference sentences in a matrix format.

  • Semantic equivalence: Evaluating translation based solely on word-level alignments may overlook the overall semantic equivalence between predicted and reference translations.

Instead, specialized evaluation metrics such as BLEU, METEOR, TER, ROUGE, and CIDEr have been developed to assess translation quality by considering factors such as n-gram overlap, semantic equivalence, fluency, and precision. These metrics provide a more accurate evaluation of machine translation models.

Text translation is a text generation task, so a calibration plot might not be suitable for evaluating the model since there is no clear way to define the true probability of a generated text.

Instead of both those measures, we used another metrics that measure the quality and diversity of the generated translations, which is BLEU.

In the end, extra plots were generated after training the model and saving the results. Here, Fig. 9 represents a visualization of the attention map for the sentence كيف حالك؟ (How are you?). And Fig. 10 represents some of the heads in the multi-head attention map of Arabic and the corresponding gloss and weight of each token for the same sentence. In addition, Fig. 11 represents some of heads map of the multi-head attention of Arabic sentence نحن في المنزل (We’re at home) and the corresponding gloss and the weight of each token for the sentence.

Fig. 9
figure 9

Attention plot for source sentence "كيف حالك؟" (How are you?), target sign gloss "Çal ant kzf

Fig. 10
figure 10

A few heads from multi-head attention for source sentence كيف حالك؟ (How are you?), target sign gloss "Çal ant kzf"

Fig. 11
figure 11

A few heads from multi-head attention for source sentence نحن في المنزل (We’re at home), target sign gloss “nÇn fz bzt"

5 The administrative implications of the proposed model

The proposed model was integrated into a cross-platform application. A video demo is provided in this link.Footnote 4. The application provides a voice recorder or text editor for the user as options for the input sentence. The input is then forwarded to a back-end server that connects to the transformer model and predicts the corresponding sign gloss for the entered sentence. The predicted value is then mapped to the animation points list, which contains all the key points for the sign movements. The points are then sent as a response to the application, which moves the avatar correspondingly. The application also provides a section that contains ArSL dictionaries to offer a learning opportunity to anyone who wants to learn it. The application is relatively small in size compared to other applications that similarly have both an AI model and animation. This is obtained by placing the AI model on the server and using animations as points rather than saving graphed animations. Further, the application benefits significantly from the transformer model by understanding the word from its context, minimizing the number of words needed by the sign as some signs are expressed using compound words, and combining similar signs into the same animation and token. Figure 12 provides some screenshots from the application where (a) it displays the services provided by the application, like the translation service or learning SL service, and (b) it displays the actual translation service screen where the recording or text entering options are provided so that the input is animated by the displayed avatar.

Fig. 12
figure 12

Some screenshots from the application

The application was developed using the following:

  • As a platform, we used dart =2.16.2.

  • As frameworks, we used flutter=2.10.4 and django = 4.0.4.

  • Finally, we used the following packages: firebase_auth: \(^\wedge\)3.3.7, cloud_firestore: \(^\wedge\)3.1.14, firebase_storage: \(^\wedge\)10.2.8, firebase_analytics: \(^\wedge\)9.1.0, firebase_core: \(^\wedge\)1.16.0, flutter_svg: \(^\wedge\)1.0.3, provider: \(^\wedge\)6.0.2, image_picker: \(^\wedge\)0.8.4+11, mic_stream: \(^\wedge\)0.6.0, soundpool: \(^\wedge\)2.3.0, permission_handler: \(^\wedge\)9.2.0, path_provider: \(^\wedge\)2.0.9, google_sign_in: \(^\wedge\)5.2.4, cached_network_image: \(^\wedge\)3.2.0, sign_in_with_apple: \(^\wedge\)3.3.0, microphone: \(^\wedge\)0.1.0, internet_file: \(^\wedge\)1.0.0+2, pdfx: \(^\wedge\)2.0.1+2, shared_preferences: \(^\wedge\)2.0.15.

6 Conclusion and future work

To translate spoken phrases into sign glosses, a NMT system is introduced in this paper. In addition to visualizing the sign glosses into an animated avatar, the proposed model has been evaluated on a self-created ArSL dataset. Furthermore, it was difficult to find other similar Arabic datasets to compare the proposed model’s performance to, due to the lack of Arabic resources (Arabic is a low-resource language). Another limitation in our work was finding a suitable animation tool that generates animation while keeping the app size as minimal as possible.

The proposed NMT model obtained an accuracy value of 94.71% on training data and 87.04% on testing data translating from Arabic to Arabic gloss.

We are still working on expanding and enhancing the dataset using more precise ArSL rules. We are working on supporting the 3D graphics and may redesign the model to animate the avatar directly. Those will speed up the interpretation process and provide the ability to build a lite version for the mobile application to work offline. And we may also aim to support real-time interpretation and translation in more languages.