1 Introduction

According to the World Health Organization [1], around 430 million people have hearing loss of moderate or higher severity, with the total number of people experiencing some degree of hearing loss increasing to over 1.5 billion, globally. As a result, almost 5% of the world population can be considered deaf, with that number increasing to almost 20% if we consider the hard-of-hearing people too.

Sign Languages (SL) are the main medium of communication for the deaf community, with about 300 different Sign Languages worldwide [2], which only 1% of the population (almost all deaf people themselves and their families) understand. As a result, a group faces challenges in communicating with individuals who can hear daily. This makes it even harder for them to access education, healthcare, employment, entertainment, and engage in social interactions where effective communication is crucial [3].

Therefore, providing a system capable of translating spoken languages into SL and vice versa, to facilitate the exchange of information between deaf and hearing people, who do not know Sign Language, remains a developing challenge, and technology is present to provide that help. SL are visual languages and they include non-manual features (facial and body expressions) beyond the manual gesture itself to provide additional information. Sign Languages have their own grammatical rules and are developed independently of spoken languages [4], which is why they have such a low comprehension rate for those who do not know the language, with no gloss-to-word correspondence between Sign Languages and their related spoken languages.

To address the challenge of improving communication between the deaf and hearing community, there are different techniques already explored in the literature, such as Sign Language Recognition (SLR) [5], which is the process of identifying signs that include manual and non-manual gestures and translating them into one or more glosses (representation of a sign), Sign Language Translation (SLT) [6, 7], whereby a spoken/written language sentence is extracted from a video in which signs are performed continuously, and Sign Language Production (SLP) [8, 9], whose goal is the generation of videos or a sequence of static images from a text or spoken language. Also, SLR includes two main categories: Isolated Sign Language Recognition (ISLR, sometimes referred in literature as word-level), which aims to recognize isolated glosses, which we will focus in this work, and Continuous Sign Language Recognition (CSLR), where the main goal is to recognize each gloss that comprises an SL sentence.

The study of these techniques requires large datasets with sufficient vocabulary and variability, which is challenging to produce because of the need for professional interpreters to validate them. In Spanish Sign Language (LSE) in particular, there are few datasets with a limited number of signs as can be seen in Table 5. In addition, privacy concerns about the visual nature of [10] data need to be addressed. Another problem that can arise is the variation of the same LS between different regions, where completely different signs are used to indicate the same gloss without there being a specific dictionary to clarify the different signs used.

Given these considerations, with a specific focus on Isolated Sign Language Recognition (ISLR), the main contributions of this paper are:

  • We provide a new dataset for Spanish Isolated Sign Language Recognition: CALSE-1000 with the largest number of LSE videos for ISLR to date;

  • We propose a new technique based on face swapping and affine transformations to increase the size of ISLR datasets without increasing the recording time and ensuring anonymity;

  • We improve the accuracy of the recognition model I3D [11] using our proposal in top-1 metrics by up to 11.7 points, top-5 by up to 8.8 points and top-10 by up to 9 points.

The rest of this paper is organized as follows: Sect. 2 provides a context on the scarcity of Sign Language datasets and methods used to increase their size, Sect. 3 describes our method with the created dataset and the techniques applied to it. Section 4 details the experimental procedure performed and presents the results obtained and the dataset creation process, showing the experiments performed and the results obtained in Sect. 5. Finally, we conclude the paper in Sect. 6 by discussing our findings and outlining some possible future work.

2 Related work

Sign Language Recognition is an arduous and complex task due to the scarcity of adequate and consistent datasets [10]. Creating an SLR dataset in a controlled environment under optimal conditions is a time-consuming process and requires validation and supervision by professional interpreters.

2.1 Data scarcity

Sign Language Recognition is a computer vision task that has been explored for years. Although several techniques such as Hidden Markov Models (HMM) [12], Neural Network based methods [13] or Deep Learning methods [14] have been used to study it, having a proper amount of available, coherent datasets is the most fundamental prerequisite for working with this problem. Providing a public and large dataset for Sign Language Recognition is often a difficult task; due to the fact that these datasets consist of videos, it is important to create them in a controlled environment under optimal conditions (no occlusions, same illumination conditions, different viewpoints, etc.), in order to provide realistic samples.

This makes the creation of an SL dataset to perform recognition tasks a time-consuming task, and as it is an unfamiliar language to many people, it requires the validation and supervision of professional interpreters to represent the information correctly, so not many datasets are ready for SLR. In addition, a dataset must include a sufficient variety of interpreters to allow a degree of variability so that trained models can generalise when the model is put into production. It is also influenced by the number of sign languages in existence, around 300 [2], which means that in the research field, the largest datasets are for American Sign Language. On the one hand, they may be incomplete in many cases, providing only gloss information or showing only one possible view [15], and on the other hand, the annotation format of the information tends to be inconsistent among the available sets, as there is no specific convention or protocol for organising the content. Consequently, the type and format of the information included do not always match among the different corpora and, in some cases, may not even be compatible [16]. However, some datasets are designed to be complete and, at the same time, challenging, provide also additional information, with different views and depth data.

Another fact to consider is the quality and the realism of the samples, as exposed by Sincan et al. in [17], where they presented the AUTSL dataset, an alternative for the less realistic datasets, as PHOENIX-2014-T [6] or WLASL [18], in which signers have similar body shapes, clothes and even backgrounds. In the AUTSL dataset, different environments, positions, and body types are considered as responses to the mentioned problem. This idea of increasing the dataset complexity has been an inspiration for the work carried out in Sect. 4. Table 1 presents a summary of the available published datasets including isolated signs.

Table 1 Overview of Isolated SLR (ISLR) datasets with their main characteristics. In the Signers section, the total number of signers participating in the dataset is listed first, followed by the average per video

As can be seen, two new datasets are included, one with 1000 glosses and 6 signers per gloss and a smaller one that can be used for faster testing of 100 glosses.

2.2 Data augmentation

Data augmentation comprises several methods that improve the quality and size of the training dataset, allowing to reduce overfitting and thus helping the model used to extract more information from the original dataset. The main objective is to add “new” data from modifications of the original data.

Many data augmentation techniques are used for different tasks [34, 35]. For image-based tasks (classification, object detection, etc.), one can apply augmentations by data deformation such as colour and geometric transformations (reflecting, cropping, rotation, flipping, etc.), random erasure, kernel filters, or adversarial training. Subsampling enhancements are also used, creating synthetic instances that are added to the training set such as image blending, feature space augmentation, or Generative Adversarial Networks (GAN), which have proven to be really efficient in augmenting datasets [36]. For video-based tasks (action recognition, object detection and segmentation among others), the temporal dimension is used in addition to the image augmentation techniques described above. In both video and image cases, using masks to modify the background or the distribution of the main objects through the frame/s has also proven to be quite effective as a transformation, as seen in [37].

Data augmentation has previously been used in other Sign Language Recognition studies, mostly to prevent overfitting [38] and to increase dataset size [39, 40], as well as for Sign Language Translation (SLT) tasks in the semantic part of the problem, attempting to improve gloss-to-text translation through synonym replacement [41] or by using syntactic rules and word order modifications to create synthetic gloss data [42].

2.3 Face swapping

Deepfakes algorithms are those that combine techniques to manipulate and create fake images and videos by transferring important features from the source image (or video) to the target image (or video) such that humans cannot distinguish them from real ones [43, 44]. Deepfakes can be created by combining traditional visual effects or computer graphics approaches, although the technique most recently applied is Deep Learning models, such as GANs and autoencoders, which are widely used in computer vision [45, 46].

Mirsky et al. [47] categorize the media content generated by deepfakes into four types: reenactment [48], where the source is used to drive the expression, mouth, gaze, pose, or body of the desired target; replacement [49], among which is face swapping, replacing the target’s content with that of the input, thus preserving the input’s identity; editing [50] involves the removal, addition, or alteration of attributes (changes in clothing, ethnicity, age, etc.) of the target; and finally synthesis [51], which is when the deepfake is created without any target as a base.

Due to privacy concerns regarding the visual nature of Sign Language datasets, anonymization is a task that has been undertaken to increase participation readiness in the creation of SL datasets [10]. To this end, tasks such as pixelation [52], blackening [53] and greyscale filtering [10] are not applicable in SLR tasks because, during sign execution, facial and body information not only play an important role in the correct meaning of the sign, but are strictly necessary in order to provide real meaning, thus avoiding that their absence results in having only gestures without meaning. So the generation of realistic face swapping synthetic data is an alternative to the privacy problem, since all the desired characteristics of the original dataset are contained, but without sensitive content, making it impossible to identify individuals [54, 55], as well as a solution in applications such as increasing the number of unbalanced or insufficient datasets [56].

3 Methodology

Fig. 1
figure 1

Pipeline explaining the methodology employed

Our approach starts with the collection of the dataset, which will be explained in Sect. 4, and then tests the influence of data augmentation and fae swapping through the combination with different variants of the original dataset. Once these variants have been obtained, the I3D model [11] is applied for ISLR on each of the generated datasets, in order to observe the improvement with respect to the application of the model on the original data. Figure 1 shows a pipeline explaining the methodology employed. For the experiments performed, the cross-dataset technique [57] has been applied, separating one of the sets as a test and leaving the rest of the data for training.

In addition, the influence of facial expression on the model will also be tested by applying facial omission on the test set. The results obtained are specified in Sect. 5.

All experiments were performed with an NVIDIA Geforce RTX 3090 and an NVIDIA A100.

To increase the dataset and avoid possible problems in maintaining identity through anonymization of the data, the FaceSwap [58] tool has been used, which employs Deep Learning techniques to recognize and swap faces in each of the signers that make up the CALSE-100 dataset.

The face swapping to generate deepfakes is composed of 3 stages:

  1. 1.

    Extraction In this first stage, the extraction of faces from the target video for the later training takes place, in which face landmarks are recognized and the images are cropped, saving the faces to be used for training. In this first step, it is important to have a large set of images containing the face of the subject to be trained, as well as to consider the data quality and the variety of angles and expressions.

  2. 2.

    Training In this step, the training of the ‘Phaze-a’ model [59] (the latest model for Faceswap) is executed for 35,000–40,000 iterations (it depends on the models used for the face swapping), with a batch size of 10.

  3. 3.

    Conversion In this last stage, the face extraction is performed again (in this case, on the source video) and then the face swapping is performed to obtain the final set of videos with face swapping applied.

Fig. 2
figure 2

Scenarios A, B and C, respectively in which the collaborating interpreter performs each sign

Fig. 3
figure 3

Results of applying face swapping with 3 different models

The face swapping technique is applied to the original data set shown in Fig. 2, obtaining the results presented in Fig. 3. It is important to highlight that this technique has been used only on the dataset from our interpreter. This is because the tool used to extract, train and convert the face swapping must be performed independently for each person, which means that obtaining DeepFakes for the entire public set of DILSE, SACU and Spread the Sign, which contains a wide range of signers, is a very costly process.

Fig. 4
figure 4

Different frames obtained by applying face swapping

Moreover, Fig. 4 contains different frames of the results after applying face swapping in more detail. Among them we can compare how the results can be quite accurate when there are no hand occlusions or much expressiveness in the face (frames A and E), while in those where there is more effusiveness (frames C and D) or occlusions are produced when passing the hand in front of the face (frame B), details of the original image are lost. We assume that, although there may be frames in the videos that are not maximally accurate for these reasons, the results obtained are good enough to be taken into account during training.

In addition, a new, publicly available libraryFootnote 1 has been used to increase datasets size that will launch augmentations on the videos.

Although several functions are implemented to apply transformations, we will only consider affine transformations. An affine transformation is the result of a combination of linear transformations. In the linear transformation, lines are converted while retaining points, lines and planes, thus maintaining their parallelism, but not necessarily Euclidean distances and angles, therefore, it includes the classical transformations, i.e. translations, reflections, scalings and rotations. In addition, the function applies a random affine transformation, which may therefore result in the application of one of the above transformations. However, the library adds the possibility to apply these transformations individually, in order to provide a complete tool.

Previous works have utilized anonymization techniques such as pixelation or blackening to preserve privacy. However, since facial expression plays a crucial role in interpreting signs correctly, the test set underwent face omission. This helped evaluate the importance of facial expressions for the model to recognize signs accurately. The results of this evaluation will determine whether incorporating face swapping techniques into the data is worthwhile.

4 Experimental setting

4.1 Dataset

In this section, we detail how we have collected the dataset used to achieve our objectives. The dataset, defined as CALSE (“Conjunto Aislado de Lengua de Signos Española"), has been formed by obtaining videos from 3 different, publicly available data sources, which are the Dictionary of Spanish Sign Language (DILSE) [60] and Spread the Sign (STS) [61] dictionaries, as well as the dataset from the University Community Assistance Service (SACU) of the University of Seville [62], which is a new tool to meet the needs associated with hearing impaired students who use SL as their means of communication. Examples of these 3 sources that compose the data set can be seen in Fig. 5.

The CALSE-1000 set is composed of 1000 glosses, with at least two video samples of each gloss extracted from the sources described above (because not all vocabulary is present in the 3 public data sources). In addition, this set has also been signed by a professional Spanish Sign Language interpreter, thus being able to add more examples for each word in the set. Thanks to this collaboration, 3 more items of each word have been incorporated, thus obtaining a total of 5 examples of each word.

Fig. 5
figure 5

Capture of the different public data sources: SACU, DILSE and STS from left to right

It is important to ensure the model has the ability to adapt for situations and scenarios that are not covered by the training input data, so that once the model is trained, it can be equally accurate to new input data that is different from the training data. To ensure this, it is crucial to incorporate variations in the appearance and style of the signs during the training process, while always faithfully maintaining the meaning of the sign. For this purpose, we have designed several scenarios for video recording of the signs performed by the interpreter. These scenarios cover changes in perspective (such as front or side view), intensity when performing the signs, and variations in clothing, among other aspects. This approach helps to introduce diversity among the different elements of the same signed words, ultimately resulting in a complete, enriched dataset and a more robust model in different situations.

Figure 2 shows the 3 different scenarios in which the collaborating interpreter performed every sign. For each of the scenarios the clothing used is different. In the case of scenario A, it has been defined to perform the recording with the hair up emphasizing each of the signs facing forward; for scenario B, each sign has been performed from the front, with a normal focus of the signs and the hair down; finally, scenario C shows a profile perspective of the signs without emphasis.

This subset can be accessed and downloaded through the OneDrive folder.Footnote 2

Due to the limited availability of training data for signs, it has been decided to omit the validation set. With only 600 videos available, distributed over 6 videos per sign, any additional data separation for a validation set would significantly reduce the size of the training set. In this context, the priority is to maximize the amount of training data to allow our models to effectively learn the distinctive features of each sign. By not using a validation set, we can take full advantage of our limited resources and train more robust and generalized models that better fit the available data. While we understand the importance of evaluating model performance on unseen data, we believe that in this particular case, the quality and quantity of training data are crucial to the success of the research. Therefore, we split the samples into training and test in a 5:1 ratio for the different experiments. The training process ends when 60 epochs are reached, since this is the point at which the loss metric stabilizes and stops decreasing.

This dataset is available for use and download through our project’s GitHub repositoryFootnote 3.

4.2 Implementation details

For the isolated recognition we have used the I3D [11] network architecture implemented in PyTorch, being the same as the one used in [18]. All experiments utilized identical configurations, employing the Adam optimizer [63], a batch size of 10, a learning rate of \(10^-3\) and a total of 60 epochs. Due to the limited number of samples available for each sign, the dataset was divided into training and test sets only, applying the cross-dataset approach [57], so that one of the datasets is separated as a test set (equivalent to one example per sign) and the rest as a training set.

To assess the performance of the models, we calculate the average scores for top-K classification accuracy with K = 1, 5, 10. This evaluation is conducted across all sign instances.

5 Results

5.1 Experimental set

A first experiment was performed with the complete CALSE-100 set without data augmentation during training, setting aside a total of 100 original videos for the test set. Similarly, the result of this training was tested with the same 100 videos by face-parsing them over the entire face of the signer.

Fig. 6
figure 6

Result of applying face omission on a DILSE set signer

In Fig. 6 we can see the output obtained from applying the facial omission on the DILSE test set. This process has been performed on all the videos in the set through face analysis and segmentation. Source code has been obtained from the Face-parsing [64] repository, which provides PyTorch implementations of common models and algorithms for this kind of tasks.

Fig. 7
figure 7

Training results with original data applied to the original DILSE test set and with face omission

Figure 7 illustrates the training results obtained from testing with original and face omission videos. The omission of facial expression information leads to a significant decrease in results. This observation underscores the critical role of facial expressions in the model’s performance. Additionally, it emphasizes the rationale behind utilizing face swapping instead of alternative techniques such as pixelation or blackening, which may anonymize the data but also remove valuable information in the process.

To assess the effect of data augmentation on CALSE-100 training, a test battery was created that merged original data from the training set with data generated through face swapping and other augmentations. For face swapping, three models (two male and one female) were utilized, with 200 new videos generated for each model. Not all of the face swapping videos created during training were used, thus denoting that FS1 experiment is formed by an increment of 200 videos in which the applied corresponds to model 1. Additionally, affine transformations were randomly applied to each video in the dataset, with AF1 and AF2 differing in the dataset to which they were applied. Specifically, AF1 corresponds to the affine transformation applied to our interpreter videos, while AF2 applies to videos from the public datasets.

Table 2 Preliminary experiments. Column AF1 denotes the affine transformation applied to our interpreter videos, while AF2 corresponds to videos from the public datasets

To evaluate the influence of data augmentation techniques on model performance, the test battery described in Table 2 was applied to different subsets using the cross-dataset approach, where a public repository was selected as the test set, and the remaining videos in the dataset were used for training. Therefore, the experimental series was repeated three times: once with the SACU as test set, once with STS, and finally leaving DILSE out of the training set. By comparing the results obtained with these different subsets, we can determine the effectiveness of the data augmentation techniques in improving model performance.

The first experiment performed could be identified as a baseline experiment because no data were added through face swapping or affine transformations during training. Since there are no previous investigations on the newly created dataset, it is not possible to compare the results obtained with those of other studies. The results of subset 100 of the WLASL dataset, also trained for the I3D model, correspond to 65.89, 84.11 and 89.92 for top-1, top-5 and top-10 accuracy respectively. It is worth noting that the WLASL100 subset is significantly larger than our dataset, which consists of 600 videos. This subset includes 2038 videos, with over 20 video samples per sign, resulting in a dataset almost four times larger than ours.

5.2 Results analysis

Table 3 Experiment results executed using STS as a test set

Tables 3, 4 and 5 indicates that, with some exceptions, using data augmentation during training improves the results. For training by isolating the DILSE set for test, as we can see in Table 4 an improvement always occurs regardless of the type of augmentation used, achieving an increase of up to 11.7 points in the top-1 accuracy with respect to the baseline experiment without augmentations, even though it is not the largest training set used.

Table 4 Experiment results executed using DILSE as a test set. Face omission row denotes results when testing with the DILSE set after applying the face omission

On the other hand, Table 5 reflects that unanimously in the top-1, top-5 and top-10, for that training set the best performing augmentation is the use of affine transformations over the set of our collaborating interpreter.

Table 5 Experiment results executed using SACU as a test set
Table 6 Results obtained in all the experiments performed

Table 6 shows a summary of the results obtained in all the experiments performed by dataset. As can be seen, it is generally the DILSE set as a test that obtains the best results in top-1, top-5 and top-10 accuracy, with respect to SACU these results being up to 15.3 points higher.

Fig. 8
figure 8

Top-1 percentage improvement per experiment in each test subset

Figure 8 displays the percentage improvement in the top-1 accuracy metric achieved by including various augmentations during training, relative to the baseline experiment, for each test set evaluated separately.

As can be seen, the use of data augmentation during training can produce an improvement of up to 32.59% in top-1 accuracy with respect to its baseline.

Furthermore, it is evident that while the combination of face swapping and affine transformations in the same training process may improve results, it is when they are used separately that the highest performance is achieved. This isolated usage of face swapping and affine transformations is also beneficial in terms of reduced training set size, leading to shorter execution times.

Worse results are observed when face swapping videos are added to our signer dataset using three different models, as well as when face swapping videos are combined with affine transformations on our collaborator interpreter. This degradation in results may be attributed to the fact that during training, an excessive number of videos with augmentations applied to our interpreter are included, which can lead to a loss of dataset generalization.

Fig. 9
figure 9

Best result for each isolated set

Finally, Fig. 9 presents the best results achieved in each subset, where different augmentations were applied during training. As already shown in Table 6, it is the DILSE set as a test that achieves the highest accuracy in top-1, top-5, and top-10 metrics, with almost 50% accuracy in the first most probable prediction.

6 Conclusions and future work

This study introduces a novel CALSE-100 dataset for Spanish Sign Language, comprising 100 words, on which the I3D architecture was employed to assess the accuracy of Sign Language Recognition. This dataset gathers 600 videos in different scenarios, using diverse perspectives during sign execution and consisting of more than 15 signers.

Since Deep Learning techniques require a large amount of data, various data augmentation approaches such as affine transformations and face swapping were also suggested to enhance accuracy. The cross-dataset approach was employed to conduct the same experiments on different training and test sets. Our proposal showed that augmentations during training generally improved the accuracy in the top-1, top-5, and top-10 metrics compared to the baseline experiment. The improvements ranged up to 32.59% in the top-1 metric.

The importance of facial expressions in the model was confirmed by testing with a facial omission set, so, the incorporation of face swapping videos during training not only improved accuracy, but also ensured user anonymity while preserving facial information during sign execution. On the other hand, it is necessary to consider when it is interesting to use face swapping in this type of study. Creating realistic face swapping videos is an open problem that also presents difficulties, such as the amount of data needed to train and enable the execution of a model, how laborious and impractical it can be to train multiple identities or the high processing times required by this type of technique.

In future work, our focus will be on exploring alternative architectures for Sign Language Recognition, that will allow us to improve the accuracy of the results.