1 Introduction

Sign language is always paired with deaf people who use it to communicate with each other but the problem arises when a deaf try to communicate with a normal person who has no prior experience with sign language [39]. This gap limits the interactions and social life experience of deaf people as it requires an expert in sign language to ease the communication [71]. Scientists and researchers tried to shed light on this problem and search for such a solution to replace the intermediate human or the expert necessity with an automated interpreter that could convert the hand kinematics and facial expressions to words or phrases [65]. Despite these great efforts and tries in that field and the state-of-the-art development in artificial intelligence and deep learning techniques [13] to find a solution nevertheless, there is no optimal interpreter up to now due to the different challenges and difficulties that face them [58].

A sign language interpreter system (SLIS) accepts the human visual sign as a set of frames from any capturing medium such as cameras and outputs the corresponding meaning of that sign [70]. That sign can be represented as a text or sound as shown in Fig. 1. Training a SLIS requires a unique sign to represent each alphabet, number, and hence, it will result in a massive amount of data and signs required to be processed especially for each available language that exists. Any minimal differences in the signs can affect the interpreter’s performance such as the performer himself as the interpreter can be designed to be user-dependent or user-independent. In the first case, the interpreter depends on the person while in the latter one, the user is not a problem anymore.

Fig. 1
figure 1

A Sign Language Interpreter System (SLIS) Overview

Different challenges in sign language recognition require to be solved in any SLIS starting from collecting the dataset to deploying the overall system [26, 56]. Some of these differences can be summarized as follows:

  • Viewpoint Variance: Different people can capture the same sign with different poses and hand kinematics.

  • Environment: The background, lighting, landmarks, and other elements can exist in the captured sign.

  • Complex Gestures: A sign can be complex to be made by a person especially if the word is less used between people.

  • Facial Expressions: A sign can include a facial expression. The face may include glasses, earrings, etc. and this may infer the system.

  • Non-Symmetric Signs: A word can be expressed in different poses or styles between languages. Also, there are unique signs for each language.

In a-glance, over years scientists tried to find solutions for each challenge and problem to implement such interpreters using different methods and approaches. Mainly, we could classify these approaches into two groups; either sensor-based or vision-based. In the sensor-based approach, the user wears sensors such as colored-gloves or special-gloves, and a motion capture system captures the sign but this approach has different drawbacks [21]. For example, it was impractical in daily life situations as it obligates the user to wear such sensors that depend on the continuous power supply, wires, and other requirements. This reason was enough for the authors to cease this approach for experiments and work on the second approach. In the vision-based approach, the system relies on image processing and computer computations for processing images and videos in addition to machine learning and deep learning techniques to classify and predict the processing data [42]. Such approaches as Hidden-Markov-Model (HMM) [15], Artificial Neural Networks (ANN) [76], Convolutional Neural Network (CNN) [5, 53], and Recurrent Neural Network (RNN) [50]. The advantage of the second approach is the low-cost hardware. The capturing medium can be smartphones cameras.

In the current work, the authors depended on the vision-based approach and the contributions can be summarized as follows:

  • Revising the literature related to the different built systems and frameworks.

  • Proposing a new Arabic sign language dataset.

  • Suggesting a deep learning framework using both CNNs and RNNs for Arabic sign language interpretation.

  • Focusing on the working mechanism of user-independent approach and appliance of it.

  • Performing different experiments and comparing the current work results with other published state-of-the-art results.

The rest of the paper is organized as follows, in Section 2, the related work and studies are discussed. In Section 3, the available Arabic sign language datasets are presented and the proposed dataset is discussed in detail. In Section 4, the pre-processing stages performed on the suggested dataset are discussed. In Section 5, the suggested deep learning architecture is presented and discussed in detail. In Section 6, the experiments and their corresponding results are reported. Finally, in Section 7, the presented work is concluded and the future work is presented.

2 Related work

Sign language recognition is studied by different researchers in different approaches since 1990 [1, 22, 62, 63]. In this section, the related literature is discussed. They include several works and methods throughout the years. Tamura et al. [67] assumed the sign word was composed of a time sequence of units called cheremes which consisted of handshape, movement, and location of the hand. They expressed the 3D features of these factors and converted them into 2D image features and classified the motion image of sign language with the 2D features.

Keskin et al. [40] created realistic 3D hand models that represented the hand with 21 different parts and trained Random Decision Forests (RDFs). They used the RDF to perform per-pixel classification and assigned each pixel to a hand part. It was then fed into a local mode finding algorithm to estimate the joint locations for the hand skeleton. They also described a support vector machine (SVM) model to recognize the Arabic sign language (ASL) digits based on this method. They achieved a high recognition rate on live depth images in real-time. Nandy et al. [52] created a video database for various signs of the Indian sign language. They used the direction histogram, which appealed for illumination and orientation invariance, as the features used in classification. They used two different approaches for recognition which were the Euclidean distance and K-nearest neighbor metrics.

Mehdi et al. [51] used 7-sensor glove of the 5DT Company. It was used to get the input data of the hands’ movements with artificial neural networks (ANN). It was used as the classifier to recognize the signs’ gestures. They achieved an accuracy value of 88%. López-Noriega et al. [47] followed their same approach and also offered a graphical user interface made with “.NET”. Hidden Markov Model (HMM) based model was used and worked effectively in continuous and real-time sign language recognition tasks by Starner et al. [61]. They used gloves images as an input for the HMM. They proposed a recognition method based on the HMM. They used color gloves to capture hand shape, orientation, and trajectory. They represented HMM-based systems for recognizing the sentence-level ASL. They managed to get high word accuracy results.

Hienz et al. [35] used colored cotton gloves to make it easy to extract features. They converted the sequence of videos into feature vectors and then fed them to an HMM to classify them. They have achieved accuracy values from 92% to 94%. Grobel et al. [31] and Parcheta et al. [54] also followed the same approach. In a brief, these previous approaches were able to achieve high accuracy values. But, they could not be used in real daily life as they required the wearing of gloves and were limited to a fixed environment which isn’t natural. Actually, many of them were user-dependent which means that they must be trained on each user which isn’t logical and unnatural. Due to the previous reasons; Youssif et al. [77] tended to generalize and proposed a model based on the HMM that did not depend on users nor require gloves. On the other hand, their model fell into the trap of low accuracy as it reached a value of 82%.

CNN is widely used in the field of image recognition and classification. Researchers made many studies using it with the SLR. Masood et al. [49] proposed a CNN model for ASL’s character recognition. They were able to use CNN to achieve an overall accuracy of 96% on a 2,524 ASL gestures image dataset. Wadhawan et al. [72], Bheda et al. [16] and Tao et al. [68] offered a CNN architecture to classify different languages’ signs alphabet with accuracies 99%, 82.5% and 100% respectively.

CNN uses a frame-by-frame manner in its work. Coupling CNN with RNN can keep information over time, especially in videos. Due to this ability, dynamic signs can be recognized more accurately. Yang et al. [75] proposed an effective continuous sign language recognition method. It was based on the combination of CNN and long short-term memory (LSTM). They achieved remarkable accuracies in the experiments on their self-built dataset. 3D Convolutional Neural Network (3D-CNN) based models, instead of 2D-CNN, require another phase of the RNN to keep information over time. 3D-CNN was able to take multi-frames of a video at once which helped to learn the sequence between frames without the need for RNN. Huang et al. [37] and Al-Hammadi et al. [3] proposed models based on that approach. The approach proposed in the current study is based on the CNN-RNN approach, specifically, the authors use double CNN as features extractors and for the RNN, Bi-directional long short-term memory (BiLSTM) layers are used. The BiLSTM layers are used to identify the complex sequences in videos to overcome the conflicts between different classes.

3 Arabic sign language datasets

Many available datasets in Arabic sign language that focus on letters or words are based on specific conditions such as: (i) the user must wear gloves or (ii) many images refer to static words [3, 40, 49]. So that, the major goal, which is the independence of unnecessary features related to specific users or the surrounding environment, can be achieved. This section starts with presenting the available datasets in sign language and after that, the proposed dataset is presented in detail.

3.1 Available sign language datasets

Latif et al. [43] presented an Arabic Alphabets Sign Language Dataset named “ArASL”. It consisted of 54,049 images. It was compiled by more than 40 volunteers for the 32 standard Arabic signs and alphabets. They mentioned that the number of images per class was not the same. It differs fr om one class to another. They created a Comma-Separated Values (CSV) file that contained the Label of each image. It is available online at https://data.mendeley.com/datasets/y7pckrw6z2/1.

Sign Language Digits Dataset is prepared by “Turkey Ankara Ayrancı Anadolu High School Students” [82]. Each image size is (100 × 100) pixels in the Red-Green-Blue (RGB) color space. It consists of 10 classes (Digits from 0 to 9). The total number of images is 2,062. It was collected from 218 students where the number of samples per student is 10. It is available online at https://www.kaggle.com/ardamavi/sign-language-digits-dataset and https://github.com/ardamavi/Sign-Language-Digits-Dataset.

Another dataset for the alphabets in the American Sign Language [83] which is available online at https://www.kaggle.com/grassknoted/asl-alphabet and https://github.com/SouravJain01/ASL_SIGN_PREDICTOR. The training dataset contains 87,000 images. Each image has a size of (200 × 200) pixels. There are 29 classes (26 for the letters from “A” to “Z” and 3 classes for “SPACE”, “DELETE”, and “NOTHING”). The test dataset contains a mere 29 images.

UCF-101 [60] is an action recognition dataset. It contains 13,320 realistic action YouTube videos. The number of its categories is 101. The UCF-101 categories can be divided into five different types: (i) Human-Object Interaction, (ii) Body-Motion Only, (iii) Human-Human Interaction, (iv) Playing Musical Instruments, and (v) Sports. Apply Eye Makeup, Apply Lipstick, Archery, Baby Crawling, Balance Beam, and Band Marching are examples of these categories. It is available online at https://www.crcv.ucf.edu/data/UCF101.php.

Shohieb et al. [59] developed a dataset for the Arabic sign language manual and non-manual signs named SignsWorld Atlas. Their captured postures, gestures, and motions were applied under different lighting and background conditions. Their dataset contained 500 elements and included (1) Arabic alphabets, (2) numbers from 0 to 9, (3) hand-shapes, (4) signs in isolation, (5) movement in continuous sentences, (6) lip movement for a set of Arabic sentences, and (7) facial expressions. Table 1 summarizes the existing and discussed datasets.

Table 1 Summary of the Existing Datasets

3.2 The proposed dataset

Creating such a dataset to fit the natural circumstances and environments is one of the main objectives of the current study. Based on statistics by 2020 [80, 81], almost everyone has his own smartphone with a camera. Following this concept, the dataset is created using smartphone videos. Videos are recorded natively using the authors’ mobile phones without using any stabilization tool either hardware or software. Videos are captured with different resolutions and different locations, places, and backgrounds. 8,467 videos are recorded for 20 signs from 72 volunteers. The followed recording criteria is that each volunteer has to do each sign for at least 5 times (i.e., around 100 videos from each volunteer). The volunteers were males and females in an age range from 20 to 24. Table 2 shows each sign with the corresponding count of each video. Figure 2 summarizes the statistics of each word in the suggested dataset. The dataset calculated average (i.e., mean) is 423.35 and the standard deviation is 18.58. Figure 3 shows sample frames from each word in the proposed dataset.

Table 2 Signs with the Corresponding Count of each Video
Fig. 2
figure 2

Graphical Statistics of the Dataset Words

Fig. 3
figure 3

Sample Frames from each Word in the Proposed Dataset

4 Dataset pre-processing

In this section, the pre-processing stages made on the raw data are presented. As mentioned in the previous section, the proposed dataset videos were captured by mobile cameras, not a professional camera nor even a fixed camera; hence, the videos are affected by a noticeable amount of noise. By following the rules of feature selection [4, 41, 45], a suitable way should be found to extract just the necessary movement out from each frame, so the model can generalize on any signer under any circumstances [46]. Raw video passes through three stages before it could be used with the proposed model (discussed in Section 5).

First Stage::

The first stage is to reduce each frame’s dimensions and to convert the frames into grayscale. The benefits behind this stage are to (1) reduce the processing time and (2) achieve less overall complexity.

Second Stage::

The output of the first stage is then passed to a difference function as shown in Fig. 4. The difference function subtracts every two consecutive frames to find the motion as shown in Equation (1). If the resultant frame was totally white or black, it is discarded. An adaptive threshold [38] is applied to the resultant frame. This approach will hold the most important information out of the frames. By applying this to the whole video’s frames, (n − 1) frames will be retrieved, where n is the number of video frames. The output single frame can be resized optionally using a resizing function. Figure 5 shows a sample preview after the pre-processing of the second stage on a sample video.

$$ frame_{diff} = frame_{i} - frame_{(i-1)} $$
(1)
Fig. 4
figure 4

A Sample Preview on the Second Pre-Processing Stage

Fig. 5
figure 5

Sample Preview after the Second Stage on a Sample Video

Third Stage::

The third and last stage is about unifying each class’s features and adding a unique factor to each class’s videos. The output is only 30 frames out from (n − 1) frames where each unified frame combines (3 × 3) frames as shown in Fig. 6. These frames aren’t selected randomly but instead, it is related to the index of the currently formed frame. The main purpose of the last stage is to reduce redundancy but without dropping any frame and keeping all information of all frames in the 30 frames. This can reduce conflicts between signs of similar movements’ positions but with different operations sequences as these frames track the hands’ positions through time.

Fig. 6
figure 6

Stage 3 Pre-Processing: Sample Preview

Algorithm 1 summarizes the three dataset pre-processing stages with their inner steps.

figure u

5 The proposed architecture

This paper contributes with an architecture for recognizing videos and classify them in the video classification field specifically sign language recognition. The main idea behind the proposed model (i.e., architecture) is to train two different CNN independently using the same architecture but on different portions of data. The input to it is the frames that are pre-processed in the pre-processing phase. The output from each CNN is concatenated into one single vector with a size of (1 × 512). It is then passed to an RNN, which has a great ability to identify sequences in videos. RNN can learn from the changes over time in each sequence and be able to generalize it over the classes. The authors made the RNN sequence size be (30 × 512). This approach can help the network to identify different features for the same input and improves its overall confidence and accuracy. Figure 7 shows an overview of the suggested model.

Fig. 7
figure 7

Overview on the Proposed Architecture

5.1 Convolutional Neural Network (CNN)

The CNN is used to extract spatial features in the proposed architecture. Mainly, the convolutional layers [10, 57] are used to extract the features and detect different patterns in multiple sub-regions (i.e., kernels). The pooling layers [13, 64] are used to keep the most important features and progressively reduce the input spatial size to reduce the number of parameters and computation cost in the architecture and hence it can control the overfitting issue [9, 33]. There are different types of the pooling layers such as max-, min-, and average (i.e., mean) pooling layers [7]. The max-pooling and min-pooling layers take the maximum and minimum values from the previous layer respectively while the average layer takes the average. The max-pooling is a commonly used pooling type [8].

Figure 8 shows the building blocks of the used CNN architecture. The input layer accepts frames where each frame is sized (128 × 128 × 3). After that, it has four “Conv-Pool-Drop” blocks, a global average pooling (GAP) layer [36], and a prediction network. Each “Conv-Pool-Drop” block of the first four blocks has two convolutional layers, one max-pooling layer, and followed by a dropout layer [6, 14] with a ratio of 0.5. The dropout layer is used to reduce the overfitting and increase its network’s ability to generalize. All blocks almost have the same dimensions except for depth. They are as follows 128, 256, 512, and 256 respectively from the left to the right. The global average pooling layer is used to reduce the spatial dimensions. However, GAP layers apply a more extreme dimensionality reduction approach where the input is reduced in size to have dimensions of (1 × 1 × d). They reduce each feature map to a single value by simply applying the average of all feature map values [12, 29].

Fig. 8
figure 8

The Building Blocks of the CNN Architecture

The prediction network is composed of two “Dense-Drop” blocks and one Fully-Connected (FC) layer [79]. It takes the output of the CNN network, flattens it (i.e., converts from multi-dimensions to a one-dimensional vector [8]), and uses it to classify the input to its class. The two “Dense-Drop” blocks contain a dense layer and a dropout layer. Each dense layer has 1024 neurons and each dropout layer has a dropout ratio of 0.2. The used activation function is Rectified Linear Unit (ReLU) [2, 11] in the hidden layers. ReLU is one of the common activation functions that returns 0 for negative inputs and the value itself for positive inputs. It is helpful for specific interaction effects and non-linearities. Equation (2) shown the used ReLU equation [32].

$$ \text{ReLU}(input) = \max{(0, input)} $$
(2)

The last FC layer contains 20 neurons with a SoftMax activation function [24]. The used batch size for the CNN network is 64. Table 3 shows the internal layers in detail. Figure 9 shows the internal structure of the “Conv-Pool-Drop” and “Dense-Drop” blocks.

Table 3 The Internal In-Detail Blocks of the CNN Architecture
Fig. 9
figure 9

The Internal Structure of the “Conv-Pool-Drop” and “Dense-Drop” Blocks

5.2 Recurrent Neural Network (RNN)

The RNNs make use of the information in the sequence for the recognition tasks. Traditional RNNs suffer from vanishing gradients which caused them not to learn so much [25]. Long Short-Term Memory (LSTM) is a variant of RNN, which is designed to efficiently solve the vanishing and exploding gradients problems [30]. Bi-directional LSTMs (BiLSTMs) are an extension of traditional LSTMs which improve model performance on sequence classification problems [69]. BiLSTMs train two LSTMs instead of one LSTM in the input sequence, when all time steps of the input sequence are available. This can provide additional context to the network and result in faster and even fuller learning on the current task.

In the suggested RNN model, the output is combined from the two CNNs and is fed to five cascaded layers of 512 BiLSTM units. Every one of these layers is followed by a dropout layer with a dropout rate of 0.9 to avoid network overfitting. These layers are followed by an FC layer with a SoftMax activation function which is used to predict the output. Decreasing the number of BiLSTM layers with keeping the same number of BiLSTM units is experimented and using only 3 BiLSTM layers with 2048, 1024, 2048 units respectively is also experimented. We tested different configurations (try-and-error tries) on the suggested dataset and found that 5 BiLSTM layers with 512 hidden units performed the best. Figure 10 shows the building blocks of the used RNN architecture. The used activation function is ReLU [2] in the hidden layers. The used batch size for the RNN network is 64.

Fig. 10
figure 10

The Building Blocks of the RNN Architecture

To train the model, the Adaptive Moment (Adam) parameters optimizer technique is used [17]. It is an optimization algorithm that is used to update the network’s weights (i.e., parameters) iterative based on the training instead of the classical stochastic gradient descent procedure [18]. Adam combines the heuristics of both the Momentum and the RMSProp and hence has the advantage that it can handle spare gradients on noisy problems [78] as shown in Equation (3).

$$ w_{t+1} = w_{t} - \eta \times \frac{v_{t}}{\sqrt{s_{t} + \epsilon}} \times g_{t} $$
(3)

where η is the initial learning rate (10− 4 in the current study), gt is the gradient at time t, vt is the exponential average of gradients, st is the exponential average of square gradients, and 𝜖 is a very small value to avoid the division by zero (it can be 10− 10). Adam is used with a 10− 6 decay rate.

6 Experimental results and discussion

In the first subsection, the experiments’ configurations are presented. In the second subsection, Two types of experiments are performed. The first is performed on the suggested dataset while the second is applied to the UCF-101 dataset.

Table 4 summarizes the experiments configurations.

Table 4 Experiments Common Configurations Summarization

6.1 The suggested dataset experiments

The results of the proposed dataset are shown in Table 5 and presented graphically in Fig. 11. The top-1 accuracies are reported for each class on validation and test sets. By observing the results, it could be noticed that four classes have very low accuracies relative to other classes. They are “Good”, “Hear”, “Thanks”, and “Thinking”. The reason behind these results is that almost all of them are similar in the kinematic movement and the sign performance. That led to a conflict between these signs which in turn led to these low accuracies for these classes.

Table 5 Top-1 Accuracies of the Proposed Dataset
Fig. 11
figure 11

Graphical Summarization of the Results of the Suggested Dataset

The conflicts can be clear also by observing the confusion matrix on test data shown in Fig. 12. As mentioned, the conflict between these few classes occurred due to the lack of experience of performers. Also, it seems that there is a great conflict between the classes “Thanks” and “Thinking” signs as they are almost similar and they need to be performed correctly and accurately to be recognized correctly.

Fig. 12
figure 12

The Confusion Matrix on the Test Data of the Suggested Dataset

6.2 The UCF-101 dataset experiments

To check the ability of our model to behave on other datasets and how it could generalize, we have applied preprocess stages and trained the model on the UCF-101 dataset. Table 6 shows a comparison between the different models and the proposed model. The presented results are reported after performing cross-validation using 3-folds [19].

Table 6 Top-1 and Top-5 Accuracies of the UCF-101 Dataset

As the table shows, our proposed model has achieved state-of-the-art results on the UCF-101 dataset. The current study reported accuracies that were better than 10 previous studies on the UCF-101 dataset. These results confirm that the proposed model can be used on different action recognition datasets not only the sign language datasets including the suggested one.

7 Conclusions and future work

In this paper, we proposed an Arabic sign language dataset with 8,467 videos of 20 signs for different volunteers. The captured videos did not require any tools but just a mobile phone. Also, we suggested a new approach (i.e., architecture) for video classification and recognition using a combination of CNN and RNN besides the pre-processing performed on the captured videos. We used double CNNs as feature extractors out from videos’ frames and concatenate these features together as a sequence. RNN was used to identify the relationship between the sequences and produce the overall prediction. Concerning that approach, we reached state-of-the-art results as we achieved 98% and 92% on the validation and testing subsets respectively on the suggested dataset. The suggested approach also achieved very promising accuracies on the UCF-101 dataset. They were 93.40% and 98.80% on top-1 and top-5 respectively.

In the future, we can enlarge the suggested dataset with new signs and more users. We can shed the light on phrases not just words. We can also modify the proposed model to adapt to the new videos with the ability to implement grammatically right phrases. Different architectures, approaches, networks, methods can be used also. More experiments can be conducted on other Arabic datasets.