Machine Translation from Signed to Spoken Languages: State of the Art and Challenges

Automatic translation from signed to spoken languages is an interdisciplinary research domain, lying on the intersection of computer vision, machine translation and linguistics. Nevertheless, research in this domain is performed mostly by computer scientists in isolation. As the domain is becoming increasingly popular - the majority of scientific papers on the topic of sign language translation have been published in the past three years - we provide an overview of the state of the art as well as some required background in the different related disciplines. We give a high-level introduction to sign language linguistics and machine translation to illustrate the requirements of automatic sign language translation. We present a systematic literature review to illustrate the state of the art in the domain and then, harking back to the requirements, lay out several challenges for future research. We find that significant advances have been made on the shoulders of spoken language machine translation research. However, current approaches are often not linguistically motivated or are not adapted to the different input modality of sign languages. We explore challenges related to the representation of sign language data, the collection of datasets, the need for interdisciplinary research and requirements for moving beyond research, towards applications. Based on our findings, we advocate for interdisciplinary research and to base future research on linguistic analysis of sign languages. Furthermore, the inclusion of deaf and hearing end users of sign language translation applications in use case identification, data collection and evaluation is of the utmost importance in the creation of useful sign language translation models. We recommend iterative, human-in-the-loop, design and development of sign language translation models.


Introduction 1.Scope of this article
The speedy progress in deep learning has seemingly enabled a bevy of new applications related to sign language recognition, translation, and synthesis, which can be grouped under the umbrella term "sign language processing."Sign language recognition can be likened to "information extraction from sign language data," for example fingerspelling recognition and sign classification.Sign Language Translation (SLT) maps this extracted information to meaning and translates it to another (signed or spoken) language; the opposite direction, from text to sign language, is also possible.Sign Language Synthesis (SLS) aims to generate sign language from some representation of meaning, for example through virtual avatars or by stitching together prerecorded videos, each of which is associated with a specific sign or sign sequence.In this article, we are zooming in on translation from signed languages to spoken languages.
In particular, we focus on translating videos containing sign language conversations to text, i.e., the written form of spoken language.We will only discuss SLT models that support video data as input, as opposed to models that require wearable bracelets or gloves, or 3D cameras.This choice is motivated by the fact that any system designed to be used in an everyday setting cannot be expected to be intrusive.Systems that use smart gloves, wristbands or other wearables are intrusive Figure 1: Sign language translation lies on the intersection of computer vision, machine translation, and linguistics.Each of these domains tackles different related challenges and interdisciplinary research is required to solve sign language translation.and are unable to capture all information present in signing, such as nonmanual actions.They are not usable nor accepted by sign language communities (SLCs) [1].Humans can understand sign language through visual observation and many people have access to a camera at any time through their smartphone.An SLT system built to work with RGB videos therefore seems to be technologically and economically feasible, and potentially user-friendly.

An interdisciplinary challenge
Before we delve into the domains of sign language recognition and translation, we need to address some misconceptions about what a sign language translation model is.Several previously published scientific papers oversimplify this domain, likening sign language recognition to gesture recognition, or even presenting a fingerspelling system as an SLT solution.Such classifications are overly simplified and incorrect.They may lead to a misunderstanding of the technical challenges that must be solved.Fig. 1 positions SLT on the intersection of computer vision, machine translation, and linguistics.Experts from each domain must come together to truly address sign language translation.
A crucial part of sign language research in a computational context, is understanding the difference between sign language recognition on the one hand and sign language translation on the other.The line between them is sometimes blurred in scientific papers.Sign language recognition is typically divided into isolated and continuous recognition.In isolated sign recognition, individual signs are classified, e.g., from a video input to a single sign.Continuous sign language recognition instead considers sequences of two or more signs.Sign language recognition can for example be applied to transcribe sign language videos to sign language glosses.Both the input (videos) and the output (glosses) are related to the same language.In sign language translation, however, sequences of signs are translated into a different language.

Research questions
This article aims to provide a comprehensive overview of the state of the art (SOTA) of signed to spoken language translation.To do this, we perform a systematic literature review and discuss the state of the domain.We aim to find answers to the following research questions.
RQ1.How should we represent sign language data for Machine Translation (MT) purposes?RQ2.Which algorithms are currently the SOTA for SLT? RQ3.Which datasets are used, for which languages, and what are the properties of these datasets?RQ4.How are current SLT models evaluated and is this sufficient?Furthermore, our goal is to list several challenges in SLT that are often overlooked.These challenges are of a technical and linguistic nature.We propose research directions to tackle these challenges.
Therefore, this article is not only an introduction for researchers taking their first steps in the domain of SLT, but also a compass to guide future research.

Structure of this article
We discuss the inclusion criteria and search strategy for our systematic literature search in Section 2.Then, we provide a high level overview of some required background information on sign languages and machine translation in Section 3 and Section 4, respectively.We objectively compare the results of the considered papers on SLT in Section 5; this includes a section focusing on a specific benchmark dataset in Section 5.7.The findings of the literature overview are summarized and discussed in Section 6.We base ourselves on this discussion to present several outstanding challenges in SLT in Section 7. The conclusion and takeaway messages are given in Section 8.
2 Literature review methodology

Inclusion criteria and search strategy
To provide an overview of sound SLT research, we adhere to the following principles in our literature search.We consider only peer reviewed publications.We include journal articles as well as conference papers: the latter are especially important in computer science research.Of course, any paper that is included must be on the topic of sign language machine translation and must not misrepresent the natural language status of sign languages.Therefore, we omit any papers that present classification of individual signs or fingerspelling recognition as sign language translation models.As we focus on nonintrusive translation from sign languages to text, we exclude papers that use gloves or other wearable devices.Finally, we emphasize the importance of (ethically) correct language use.We do not consider papers that make use of the terms "deaf/dumb" or "deaf/mute" 1 , or that imply that sign language users are not "normal", or that they have problems communicating or need help living their daily lives.
These principles are summarized into the following inclusion criteria.Any paper that is included in our literature overview must: • be written in English, • be peer reviewed, • propose, implement and evaluate a sign language machine translation system from a sign language to a spoken language, • present a nonintrusive system based only on RGB camera inputs, • not use terms that can be construed as offensive to members of sign language communities.Three scientific databases were queried: Google Scholar, Web of Science and IEEE Xplore 2 .Four queries were used to obtain initial results: "sign language translation", "sign language machine 1 Both of these terms can be found to be offensive, as explained by the United States National Association of the Deaf (https://www.nad.org/resources/american-sign-language/community-and-culture-frequently-asked-questions/).translation", "gloss translation" and "gloss machine translation".These key phrases were chosen for the following reasons.We want to obtain scientific research papers on the topic of MT from signed to spoken languages: therefore we search for "sign language machine translation".Several papers perform translation between sign language glosses and spoken language text (as we will discuss), hence "gloss machine translation".As many papers omit the word "machine" in "machine translation", we also include the key phrases "sign language translation" and "gloss translation".
The initial search, executed on August 19, 2021, yielded 716 results.239 duplicate entries were removed, leaving 477 papers.In addition, 10 non-English papers were removed.
Title screening was used next on the remaining 467 papers.First we checked the intrusiveness criterion based on titles: any paper addressing SLT with gloves or other wearable sensors was removed.This was done by listing all paper titles containing the words "glove", "armband" or "wearable" and manually removing those papers proposing an intrusive system.After this filtering step, 453 papers remained.
Subsequently, we screened the titles to exclude papers that do not address automatic translation in the direction of signed to spoken languages.We removed papers on human translation and interpretation, papers on translation from spoken to signed languages and papers not related to sign languages at all3 .As a consequence, 256 papers were removed.
For the remaining 197 papers, title and abstract were analyzed more thoroughly to identify remaining mismatches with our inclusion criteria.Papers were deemed irrelevant for several possible reasons.They could for example discuss the topics of fingerspelling recognition (39 papers), isolated sign recognition (31 papers) or continuous sign language recognition (6 papers).Other papers were removed because they considered intrusive methods but passed the earlier title based screening (20 papers).10 papers did not propose a machine translation system and were also removed.36 papers were removed because they only considered translation from spoken languages to signed languages (these papers had passed the title based screening).
16 papers used terms such as "deaf/dumb" or "abnormal" to refer to sign language users and therefore these papers are excluded from our overview.5 more papers were removed for various other reasons.Note that some papers consider both fingerspelling recognition and isolated sign recognition, hence they are counted twice.
After this title and abstract based screening, 55 papers remained.These papers were read in full to determine their relevance.We excluded papers that do not describe a translation model or formulate a proposal only, i.e., papers that do not present any methodology or results.
After all exclusion steps, the remaining 32 papers, 4.5% of the original 716 search results, are discussed in this work.These are peer reviewed papers in English that propose, implement and evaluate a sign language machine translation system from a sign language to a spoken language using an RGB camera and do not contain descriptions that are explicitly offensive to members of the sign language communities.
The list of all 716 search results, as well as the list of the 32 selected papers is provided as supplementary material (Resource 1).

Signed language
It is a common misconception that there exists a single, universal, sign language.Just like spoken languages, sign languages evolve naturally through time and space.Several countries have national sign languages, but often there are also regional differences and local dialects.Furthermore, signs in a sign language do not have a one-to-one mapping to words in any spoken language: translation is not as simple as recognizing individual signs and replacing them with the corresponding words in a spoken language.In summary: sign languages have distinct vocabularies and grammars and they are not tied to any spoken language.Even in two regions with a shared spoken language, the regional sign languages used can differ greatly.In the Netherlands and in Flanders (Belgium), for example, the majority spoken language is Dutch.However, Flemish Sign Language (VGT) and the Sign Language of the Netherlands (NGT) are quite different.Meanwhile, VGT is linguistically and historically much closer to French Belgian Sign Language (LSFB) [2], the sign language used primarily in the French-speaking part of Belgium, because both originate from a common Belgian Sign Language, diverging in the 1990s [3].In a similar vein, American Sign Language (ASL) and British Sign Language (BSL) are completely different even though the two countries share a spoken language, i.e., English.

A high level overview of components of sign languages
We now provide a very high level overview of sign languages from a linguistic point of view.It is by no means comprehensive, but it illustrates why SLT is much broader than gesture recognition and even sign recognition.Remember that there is no single universal sign language, and therefore some of the notions that we discuss here may not apply to all sign languages.Sign languages are visual; they make use of a large space around the signer.Signs are not composed solely out of manual gestures.In fact, there are many more components to a sign.Stokoe stated in 1960 that signs are composed of hand shape, movement and place of articulation parameters [4].Battison later added orientation, both of the palm and of the fingers [5].There are also nonmanual components such as mouth patterns.Mouth patterns can be divided into mouthings -where the pattern refers to (part of) a spoken language word -and mouth gestures, e.g., touting one's lips.Nonmanual components play an important role in sign language lexicons and grammars [6].They can for example separate minimal pairs: these are signs which share all articulation parameters apart from one.When hand shape, orientation, movement and place of articulation are identical, mouth patterns can for example be used to differentiate two signs.Nonmanual actions are not only important at the lexical level as just illustrated, but also at the grammatical level.A clear example of this can be found in eyebrow movements: furrowing or raising the eyebrows can signal that a question is being asked, as well as indicate the type of question (open or closed).
Sign languages exhibit simultaneity on several levels.There is simultaneity on the component level: as explained above, manual actions can be combined with nonmanual actions simultaneously.We also observe simultaneity at the utterance level.It is, for example, possible to turn a positive utterance into a negative utterance by shaking one's head while performing the manual actions.Another example is the use of eyebrow movements to transform a statement into a question.
The space around the signer can also be utilized to indicate for instance the location or moment in time of the conversational topic.A signer can point behind their back to specify that an event occurred in the past and likewise, point in front of them to indicate a future event.An imaginary timeline can also be constructed in front of the signer, with time passing from left to right.Space is also used to position referents [2,7].For example, a person can be discussing a conversation with their mother and father.Both referents get assigned a location (locus) in the signing space and further references to these persons are made by pointing to, looking at, or signing towards these loci.For example, "mom gives something to dad" can be signed by moving the sign for "to give" from the locus associated with the mother to the one associated with the father.Modeling space, detecting positions in space, and remembering these positions is important for SLT models.
Another important aspect of sign languages is the use of classifiers.Zwitserlood describes them as "morphemes with a nonspecific meaning, which are expressed by particular configurations of the manual articulator (or: hands) and which represent entities by denoting salient characteristics" [8].There are many more intricacies of classifiers than can be listed here, so we give a limited set of examples instead.Several types of classifiers exist.They can for example represent nouns or adjectives according to their shape or size.Whole entity classifiers can be used to represent objects, e.g., a flat hand can represent a car; handling classifiers can be used to indicate that an object is being handled, e.g., a pencil is picked up from a table.In a whole entity classifier, the articulator is the object, whereas in a handling classifier it operates on the object.
The vocabularies of sign languages are not fixed.Oftentimes new signs are constructed by sign language users.On the one hand, sign languages can borrow signs from other sign languages, similar to loanwords in spoken languages.In this case, these signs are part of the established lexicon.On the other hand, there is the productive lexicon -one can create an ad hoc sign.Vermeerbergen gives the example of "a man walking on long legs" in VGT: rather than expressing this clause by signing "man", "walk", "long" and "legs", the hands are used (as classifiers) to imitate the man walking [9].Both the established and productive lexicons are integral parts of sign languages.
Fingerspelling can be used to convey concepts for which a sign does not (yet) exist, or to introduce a person who has not yet been assigned a name sign.It is based on the alphabet of a spoken language, where every letter in that alphabet has a corresponding (static or dynamic) sign.Fingerspelling is also not shared between sign languages.For example, in ASL, fingerspelling is one handed, but in BSL two hands are used.
We have now discussed seven important aspects of signing: manual actions, nonmanual actions, signing space, classifiers, the productive lexicon, simultaneity and fingerspelling.Models for SLT require the ability to deal with all of these aspects in some way, either explicitly or implicitly.
These aspects cannot be trivially extracted from sign language videos as they are.The videos first need to be processed into some representation of sign language.This representation can be written, graphical or computational.No matter which kind is used, it needs to contain information on the aforementioned aspects to allow for translation to different languages.

Notation systems for sign languages
Unlike many spoken languages, sign languages do not have a standardized written form.Several notation systems do exist, but none of them are generally accepted as a standard [10].The earliest notation system was proposed in the 1960s by Stokoe: the Stokoe notation [4].It was designed for ASL and comprises a set of symbols to notate the different components of signs.The position, movement and orientation of the hands are encoded in iconic symbols, and for hand shapes, letters from the Latin alphabet corresponding to the most similar fingerspelling hand shape are used [4].Later, in the 1970s, Sutton introduced SignWriting4 : a notation system for sign languages based on a dance choreography notation system [11].The SignWriting notation for a sign is composed of iconic symbols for the hands, face and body.The signing location and movements are also encoded in symbols, in order to capture the dynamic nature of signing.SignWriting is designed as a system for writing signed utterances for everyday communication.In 1989, the Hamburg Notation System (HamNoSys) was introduced [12].Unlike SignWriting, it is designed mainly for linguistic analysis of sign languages.It encodes hand shapes, hand orientation, movements and nonmanual components in the form of symbols.
Stokoe notation, SignWriting and HamNoSys represent the visual nature of signs in a compact format.They are notation systems that operate on the phonological level.These systems, however, do not capture the meaning of signs.In linguistic analysis of sign languages, glosses are typically used to represent meaning.A sign language gloss is a written representation of a sign in one or more words of a spoken language, commonly the majority language of the region.Glosses can be composed of single words in the spoken language, but also of combinations of words.Examples of glosses are: "CAR", "BRIDGE", but also "car-crosses-bridge". Glosses do not accurately represent the meaning of signs in all cases and glossing has several limitations and problems [10].They are inherently sequential, whereas signs often exhibit simultaneity [13] 5 .Furthermore, as glosses are based on spoken languages, there may be an implicit influence of the spoken language projected onto the sign language [9,10].Finally, there is no universal standard on how glosses should be constructed: this leads to differences between corpora of different sign languages, or even between several sign language annotators working on the same corpus.
Sign A is a recently developed framework aiming to define an architecture that is sufficiently robust to model sign languages on both the phonological level as well as containing meaning (when combined with a role and reference grammar (RRG)) [14].Sign A with RRG does not only encode the meaning of sign language utterances, but also parameters pertaining to manual and nonmanual actions.De Sisto et al. propose investigating the application of Sign A for data-driven SLT systems [15].
The above notation systems for sign languages range from graphical to written and computational representations of signs and signed utterances.None of these notation systems were originally designed for the purpose of automatic translation from signed to spoken languages and, in fact, only glosses are currently used for SLT.One reason is that sign glosses are similar on several levels to spoken language words, facilitating translation using spoken language MT techniques.

Machine translation 4.1 Spoken language MT
Machine translation is a sequence to sequence task.That is, given an input sequence of tokens that constitute a sentence in a source language, an MT system generates a new sequence of tokens that represent a sentence in a target language.In fact, as MT is a probabilistic task, the generated sequence is the most likely translation of the input sequence 6 .A token refers to a sentence construction unit: a word, a number, a symbol, a character or a subword unit.
Current SOTA models for spoken language MT are based on a neural encoder-decoder architecture: (i) an encoder network encodes an input sequence in the source language into a multidimensional representation; (ii) it is then fed into a decoder network which generates a hypothesis translation conditioned on this representation.The original encoder-decoder was based on Recurrent Neural Networks (RNNs) [16].To deal with long sequences, Long Short-Term Memory Networks (LSTMs) [17] and Gated Recurrent Units (GRUs) [18] were used.To further improve the performance of RNN-based MT, an attention mechanism was introduced by Bahdanau et al. [19].In recent years the transformer architecture [20], based primarily on the idea of attention (in combination with positional encoding) has pushed the SOTA even further.
Figure 2: Neural machine translation models for spoken (a) and sign (b) language translation are similar; the main difference is the input modality.In this case, the sign language representation is illustrated as human pose estimation keypoints.The pose illustration is adapted from [27].
As noted above, a sentence is broken down into tokens and each token is fed into the Neural Machine Translation (NMT) model.RNN-based models process a sequence one token at a time, whereas transformer based models operate on multiple tokens in parallel.Regardless of the architecture type, NMT converts each token into a multidimensional representation before that token representation is used in the encoder or decoder to construct a sentence level representation.These token representations, typically referred to as word embeddings, encode the meaning of a token based on its context and can be learned along with the training of an NMT model, or independently and used to bootstrap the NMT training.Learning word embeddings is a monolingual task, since they are associated with tokens in a particular language.Given that for a large number of languages and use cases monolingual data is abundant, it is relatively easy to build word embedding models of high quality and coverage.Building such word embedding models is typically performed using unsupervised algorithms such as GLoVe [21], BERT [22] and BART [23].These algorithms encode words into vectors in such a way that the vectors of related words are similar 7 .
The domain of spoken language MT is extensive and the current SOTA of NMT builds upon years of research.To provide a complete overview of spoken language MT is out of scope for our article.For a more in depth overview of the domain, we refer readers to the work of Stahlberg [26].

Sign language MT
Conceptually, sign language MT and spoken language MT are similar.The main difference is the input modality.Spoken language MT operates on two streams of discrete tokens (text to text).As sign languages have no standardized notation system, a generic SLT model must translate from a continuous stream to a discrete stream (video to text).To reduce the complexity of this problem, sign language videos are discretized to a sequence of still frames that make up the video.SLT can now be framed as a sequence-to-sequence, frame-to-token task.As they are, these individual frames do not convey meaning in the way that the word embeddings in a spoken language translation model do.Notwithstanding that it is possible to train SLT models using frame based representations as inputs, the extraction of salient sign language representations is required to facilitate the modeling of meaning in sign language encoders.
Following the encoder-decoder NMT architecture, the first step in SLT would be the encoding of sign language sentences captured in videos, broken down into frames.Encoding a sign language sentence is a challenge on its own.Consider, for example, that each video captures not only the signer but also the background, other signers and other noise, in general.Focusing on the sign language information from the video is essential for the translation task.Therefore, before encoding, such a sentence should be processed and the sign language information isolated.This is the extraction of the sign language representation mentioned above.This process is called "sign language recognition" and it is performed before translation.Whereas spoken language texts can be divided into words, subwords, characters or other discrete tokens, which can be mapped to real-valued vectors, i.e., word embeddings, this is more difficult to do for sign language videos.There is no explicit segmentation between individual signs, for example.In deep learning, typically, Convolutional Neural Networks (CNNs) are used to extract information from videos.These videos can be processed either frame by frame or multiple frames at a time.In the former case, only spatial information is captured, with temporal information implicitly encoded in the sequence.
In the latter technique, both spatial and temporal information is captured.Fig. 2 shows a spoken language NMT and sign language NMT model side by side.The main difference between the two is the input modality.For a spoken language NMT model, both the inputs and outputs are text.For a sign language NMT model, the inputs are some representation of sign language (in the case of this illustration, per-frame human pose keypoints extracted with OpenPose [28]).

Sign language representations
For the encoder of the translation model to capture the meaning of the sign language utterance, a salient representation for sign language videos is required.We can differentiate between: (i) representations that are linked to the source modality, namely videos, and (ii) linguistically motivated representations.
As will be discussed in Section 5.2, the former type of representations are often frame based, i.e., every frame in the video is assigned a vector, or clip based, i.e., clips of arbitrary length are assigned a vector.These type of representations are rather simple to derive, e.g., by extracting information directly from the CNN.However, they suffer from two main drawbacks.First, such representations are fairly long.For example, the RWTH-PHOENIX-Weather 2014T dataset [29] contains samples of on average 114 frames (in German Sign Language (DGS)), whereas the average sentence length (in German) is 13.7 words in that dataset.As a result, frame based representations for sign languages negatively impact the computational performance of SLT models.Second, such representations do not originate from domain knowledge.That is, they capture neither the syntax nor semantics of sign language.If semantic information is not encoded in the sign language representation, the translation model is forced to model the semantics and perform translation at the same time.If, in contrast, semantic information is encoded in the representation, only translation needs to be learned, which is an easier task.
The second category includes a range of linguistically motivated representations, from semantic representations to individual sign representations.In Section 3.3 we presented an overview of some notation systems for sign languages: Stokoe notation, SignWriting, HamNoSys, glosses, and Sign A. These notation systems can be used as representations in an SLT model.From the aforementioned notation systems, only glosses are used in current SOTA models.Therefore in the remainder of this article, our discussion on the application of preexisting notation systems in SLT will be limited to Figure 3: We find four distinct translation tasks in the considered scientific literature on SLT.Each task has different inputs and/or outputs and implications on the required labels and resulting scores on translation metrics.CSLR: Continuous sign language recognition.glosses.

Tasks
The reviewed papers cover four distinct translation tasks that can be classified based on whether, and how, glosses are used.To denote these tasks, we borrow the naming conventions from Camgoz et al. [30].First, translation from sign language glosses to spoken language text is considered (Gloss2Text); such a system assumes the existence of a perfect sign language recognizer to perform the transformation from sign language videos into sign language glosses.The three other tasks consider video inputs.Sign to text translation translates directly from sign language videos to text (Sign2Text).Sign to gloss to text translation first converts the sign language video into glosses through sign language recognition methods and then translates these glosses into text (Sign2Gloss2Text).Finally, the task to jointly predict glosses and translate into spoken language text is called Sign2(Gloss+Text).We provide diagrams of these tasks in Fig. 3.
Gloss2Text Gloss2Text models provide a reference for the performance that can be achieved using a salient representation.Therefore they can serve as a compass for the design of sign language representations and the corresponding sign language recognition systems.Note once more that glosses do not capture all linguistic properties of signs; therefore the performance of a Gloss2Text model is not an upper bound on the performance of an SLT model.
Sign2Gloss2Text A Sign2Gloss2Text translation system includes the sign language recognition system as the first step.Consequently, errors made by the recognition system are propagated to the translation system.Camgoz et al. for example report a drop in translation accuracy when comparing a Sign2Gloss2Text system to a Gloss2Text system [29].However, such a drop in performance can be avoided by using salient sign language representations [31].Note that a Sign2Gloss2Text translation system contains an information bottleneck in the form of the gloss prediction.Furthermore, Sign2Gloss2Text models require a recognizer from signs to glosses at inference time.
Sign2(Gloss+Text) Glosses can provide a supervised signal to a translation system without being an information bottleneck, if the model is trained to jointly predict both glosses and text [30].Such a model must be able to predict glosses and text from a single sign language representation.The gloss labels provide additional information to the encoder, facilitating the training process.In a Sign2Gloss2Text model, the translation models receives glosses as inputs: any information that is not present in glosses cannot be used to translate into spoken language text.In Sign2(Gloss+Text) models, however, the translation model input is the sign language representation.In a Sign2Gloss2Text system, the sign language representation is exactly as salient as glosses.In a Sign2(Gloss+Text) system, the representation can contain additional information not present in glosses.Another benefit of Sign2(Gloss+Text) models is that, after training (i.e., during inference), no gloss information is required: the model can be directly applied to translate from the sign language representation into the spoken language text.Sign2Text Finally, Sign2Text performs both recognition and translation in an end-to-end set-up.It avoids the information bottleneck presented by glosses as well as the need for gloss level annotations, but requires a powerful sign language recognition system.The creation of such recognition systems is currently heavily constrained by the limited amount of labeled data that is available.Furthermore, research into sign language representations and sign language recognition is still ongoing.Therefore, Sign2Text systems are often outperformed by Gloss2Text, Sign2Gloss2Text and Sign2(Gloss+Text) systems: this is discussed in Section 5.7.

Requirements for sign language MT
With the given information on sign language linguistics and MT techniques, we are now able to sketch the requirements for sign language MT.

Video processing and sign language representation
We need to be able to process sign language videos and convert them into an internal representation (sign language recognition).This representation must be rich enough to cover several aspects of sign languages (including manual and nonmanual actions, signing space, classifiers, the productive lexicon, simultaneity and fingerspelling).Ideally, this representation would be sign language agnostic, such that the sign language recognizer can be reused across languages.However, this is not a requirement for models designed for individual language pairs.We look in our literature overview for an answer to RQ1 on how we should represent sign language data.

Translating between sign and spoken representations
We need to be able to translate from such a representation into a spoken language representation, which can be reused from existing spoken language MT systems.We need to adapt NMT systems to be able to work with the sign language representation, which will possibly contain simultaneous elements.By comparing different methods for SLT, we evaluate whether current algorithms are sufficient and which of them perform best in the current SOTA (RQ2).

Data requirements
To perform these operations using current SOTA machine learning methods, we need large datasets.The collection of such datasets is expensive and should therefore be tailored to the wanted use cases.To determine these use cases, members of SLCs must be involved.We answer RQ3 by providing an overview of existing datasets for SLT.
5 Literature overview

Sign language MT
Following our methodology on paper selection, laid out in Section 2, we obtain 32 papers published in the period from 2004 to August 2021.In the analysis we conduct, papers are classified based on tasks, datasets, methods and evaluation techniques.
Until the rise of deep neural models for MT in 2017, most of the work on MT from signed to spoken languages was based entirely on statistical methods [32][33][34][35][36][37][38][39][40].In 2018, the domain moved away from SMT and towards NMT.This trend is clearly visible in Fig. 4.This drastic shift was not only motivated by the successful applications of NMT techniques in spoken language MT, but also by the publication of the RWTH-PHOENIX-Weather 2014T dataset and the promising results obtained on that dataset using NMT methods [29].A single outlier is found in the paper by Luqman et al., who use Rule-based Machine Translation (RBMT) in 2020 to translate from Arabic sign language into Arabic [41].
Between 2004 and 2018, research into translation from signed to spoken languages was sporadic (9 papers were published over 14 years).Since 2018, with the move towards NMT, the domain has become more popular, with 23 papers in our subset published over the span of 3 years.

Sign language representations
The sign language representations used in the current scientific literature range from glosses to representations extracted from videos.In 2007 and 2008, before the widespread use of deep CNNs and while the majority of research was still being performed on Gloss2Text models, two papers tackled Sign2Gloss2Text translation [35,38].They both used appearance based features in the form of 32 by 32 pixel grayscale images as well as motion based features by performing hand tracking: the hand position, velocity and trajectory were used.Schmidt et al. perform Gloss2Text translation with an additional channel of visual information [40].They track the position of the face and mouth using an active appearance model.Then, features are extracted from a landmark model of the face, such as eyebrow movements and mouth movements.All other models from this period tackle Gloss2Text and consequently do not perform sign language feature extraction.All models since 2018 that include feature extraction, use neural networks to do so.
The least popular deep feature extraction method is the 3D CNN, with only three papers (16%) using these networks [44,50,52].Coincidentally, they were also found to produce the least salient representations in a benchmarking study by Orbay et al. [44], who compare different feature extraction methods on the RWTH-PHOENIX-Weather 2014T dataset.They show that the best Sign2Text translation performance (in terms of BLEU-4 score) is achieved using 2D CNNs pretrained on hand shape classification, followed by pose estimation, 2D CNNs and finally 3D CNNs.
The representations discussed here are extracted from sign language videos.Many papers focus on manual actions or simply consider the whole video frames as inputs.Of the 19 papers that perform Sign2Gloss2Text, Sign2(Gloss+Text) or Sign2Text translation, four (21%) explicitly include nonmanual features such as mouth patterns or features extracted after cropping the face from the image [31,43,47,53].The other papers focus on hand appearance features or use full frame images.
To the best of our knowledge, there has been no systematic comparison of RNNs and transformers across multiple tasks and datasets for SLT.Within the reviewed papers, we nonetheless find some comparisons between architectures with varying results.The first comparison is performed by Ko et al. [49] on the KETI dataset (Korean Sign Language (KSL)).Whereas RNNs with Luong attention obtain the highest ROUGE score, transformer based networks perform better in terms of METEOR, BLEU and CIDEr scores.RNNs without attention or with Bahdanau attention are outperformed by the other variants on all reported metrics.Another comparison is performed by Orbay et al. [44] on the RWTH-PHOENIX-Weather 2014T dataset.They obtain different results: an RNN with Bahdanau attention outperforms both an RNN with Luong attention and a transformer in terms of ROUGE and BLEU scores.Yin et al. compare an RNN with Bahdanau attention, an RNN with Luong attention and a transformer, also on the RWTH-PHOENIX-Weather 2014T dataset.They find that the transformer outperforms the RNNs, and that an RNN with Luong attention outperforms one with Bahdanau attention [31].Finally, Camgoz et al. also compare RNNs and transformers [30].They report a large increase in BLEU scores when using transformers, compared to a previous paper using RNNs [29].However, the comparison is between models with different feature extractors: the impact of the architecture versus that of the feature extractors is unclear.It is likely that replacing a 2D CNN pretrained on ImageNet [62] image classification with one pretrained on Continuous Sign Language Recognition (CSLR) will result in a significant increase Figure 5: Gloss-based models are used throughout the entire considered time period (2004-2021), but since 2018 models which translate from video to text are gaining traction.As one paper may discuss several tasks, the total count is higher than the amount of papers. in performance, especially when the CSLR model was trained on data from the same source (i.e., RWTH-PHOENIX-Weather 2014), as is the case here.
We can conclude that the choice of architecture depends on the dataset, sign language representation and translation task.We analyze the performance differences between transformers and RNNs on the RWTH-PHOENIX-Weather 2014T dataset in Section 5.7.
We can distinguish two distinct eras in the SLT research domain: the era of SMT systems until 2015, and the NMT era starting in 2018.In the SMT era, Gloss2Text models were the most popular ones, being proposed 7 times out of 9 models (78%), the other two (22%) being Sign2Gloss2Text models.In the NMT era, there is a shift in the distribution.Due to the availability of larger datasets, deep neural feature extractors and neural Sign Language Recognition (SLR) models, Sign2Gloss2Text (17% of 30), Sign2(Gloss+Text) (10%) and Sign2Text (40%) gain in popularity.This gradual evolution from gloss based models towards end-to-end models is visible in Fig. 5.At the same time, it is clear that the domain is still reliant on glosses.60% of the 30 models proposed since 2018 use gloss information in some form and 33% of the proposed models since 2018 are Gloss2Text models.

Datasets
Several datasets are used in SLT research.Some are reused often, whereas others are only used once.The distribution is shown in Fig. 6.It is clear that the most used dataset is RWTH-PHOENIX-Weather 2014T [29] (used in 37% of the reported papers).This is because it is the first dataset large enough for neural SLT and because it is readily available for research purposes.This dataset is an extension of earlier versions, RWTH-PHOENIX-Weather [63] and RWTH-PHOENIX-Weather 2014 [34], for NMT.It contains videos in DGS, gloss level annotations, and text in German.Precisely because of the popularity of this dataset, we can compare several approaches to SLT: see Section 5.7.Other datasets are also reused several times.The CSL dataset [52] contains videos in Chinese Sign Language (CSL) and text in Chinese.The KETI dataset [49] contains KSL videos, gloss level annotations, and Korean text.RWTH-Boston-104 [35] is a dataset for ASL to English translation containing ASL videos, gloss level annotations, and English text.The ASLG-PC12 dataset [64] contains ASL glosses and English text.
An overview of dataset sizes as well as vocabulary sizes is given in Table 1.Note that the largest dataset in terms of number of parallel sentences, ASLG-PC12, contains 827 thousand training sentences.For MT between spoken languages, datasets typically contain several millions of sentences, for example the Paracrawl corpus [65].Furthermore, the largest dataset with video data (RWTH-PHOENIX-Weather 2014T) contains only 7,096 training sentences.It is clear that compared to spoken language datasets, sign language datasets lack labeled data.In other words, SLT is a lowresource MT task.

Evaluation
The majority of evaluation of the quality of SLT models is based on quantitative metrics.Eight different metrics are used for quantitative evaluation across the 32 papers: BLEU, ROUGE, WER, TER, PER, CIDEr, METEOR, COMET and NIST.The BLEU metric is used most often, by 29 To the best of our knowledge, none of the papers discussed in this overview contain evaluations by members of SLCs.One paper does perform human evaluation, but only by hearing people: Luqman et al. [41] let native Arabic speakers evaluate the model's output translations as "not understandable", "somehow understandable" or "understandable".They define a translation as acceptable if it is "somehow understandable" or "understandable".80% of the reported translations are rated as understandable, 12% are somehow understandable and 8% are not understandable: therefore 92% of the translations are acceptable according to the authors.They also report the BLEU and TER metrics.They remark that these metrics cannot handle different word orders, whereas certain word orders are interchangeable in Arabic.This is a drawback of using n-gram based metrics.These metrics depend on the presence of several reference translations per example in the dataset to account for the fact that there can be multiple correct word orders and to account for synonyms.However, often only a single reference translation is provided, for example in the RWTH-PHOENIX-Weather 2014T dataset.

The RWTH-PHOENIX-Weather 2014T benchmark
The popularity of the RWTH-PHOENIX-Weather 2014T dataset facilitates the comparison of different SLT models on this dataset.As the BLEU metric is the most common in these papers, we compare models based on their BLEU-4 score.
An overview of Gloss2Text models is shown in Table 2.For Sign2Gloss2Text, we refer to Table 3.For Sign2(Gloss+Text) and Sign2Text, we list the results in Tables 4 and 5, respectively.

Sign language representations
We observe a wide variety of feature extraction methods across the different papers.They range from conceptually simple, frame based feature extractors, to linguistically motivated systems.
Several papers use features extracted using a 2D CNN by first training a CSLR model on RWTH-PHOENIX-Weather 2014 [29,43,48].These papers use the full frame as inputs to the feature extractor.
Other papers combine multiple input channels.Yin et al. [31] use Spatio-Temporal Multi-Cue (STMC) features, extracting images of the face, hands and full frames as well as including estimated poses of the body.These features are processed by a network which performs temporal processing, both on the intra-as the inter-cue level.Their models are the SOTA of Sign2Gloss2Text translation .A similar approach is taken by Zhou et al. [47], whose model obtains a BLEU-4 score of 23.65 on Sign2(Gloss+Text) translation (SOTA).Camgoz et al. use mouth pattern cues, pose information and hand shape information; by using this multi-cue representation, they are able to remove glosses from their model and achieve competitive performance compared to models that do use glosses [43].The SOTA is defined by the use of multi-cue features for Sign2Gloss2Text [31] and Sign2(Gloss+Text) [47] translation.
Frame based feature representations result in long input sequences to the translation model.The length of these sequences can be reduced by considering short clips instead of frames.We find this approach in scientific literature in two forms: (i) by using a pretrained 3D CNN or (ii) by reducing the sequence length using temporal convolutions or RNNs that are trained jointly with the translation model.The benchmarking study of Orbay et al. [44] suggests that 2D CNNs provide more informative features than 3D CNNs, especially if the 2D CNN is pretrained on a related task such as human pose estimation or hand shape classification.The fact that 3D CNNs are outperformed by 2D CNNs in that study does not mean that spatial feature extractors always outperform spatio-temporal feature extractors.Zhou et al. [46] use 2D CNN features extracted from full frames, which are then further using temporal convolutions, reducing the temporal feature size by a factor 4. They call this approach Temporal Inception Networks (TIN).They achieve near-SOTA performance on Sign2Gloss2Text translation  and SOTA performance on Sign2Text translation (24.32 BLEU-4).Zheng et al. [42] propose using an unsupervised algorithm called Frame Stream Density Compression (FSDC) to remove temporally redundant frames by comparing frames on the level of pixels.The resulting features are then processed using a combination of temporal convolutions and RNNs.Zheng et al. compare the different settings and their combination and find that these techniques can be used to reduce the input size of the sign language features as well as to improve the translation performance to 10.66 BLEU-4.They do not achieve performance on par with the SOTA due to their choice of an underpowered feature extractor: AlexNet [66] pretrained on ImageNet [62].This feature extractor was also used by Camgoz et al. [29], whose model achieved a BLEU-4 score of 9.58.It has been superseded by the more advanced ones mentioned above.Clearly, temporal processing of the sign language representation before translation can improve the performance of SLT models, if the temporal processing module is trained jointly with the translation model so that it can exploit information on sign language translation (e.g., [31,42,46]).Otherwise, the temporal processing results in a loss of information (e.g., [44]).
In conclusion, SLT models can be improved by applying the two following procedures.Firstly, features can be extracted from multiple channels linked to sign language parameters (e.g., hand and face crops).Secondly, temporal processing methods that are trained jointly with the translation model can reduce the sequence length and improve performance.

Neural architectures
We now determine whether recurrent models or transformers perform best on this dataset.As this may be dependent on the used sign language representation and translation model, we perform an analysis for Gloss2Text, Sign2Gloss2Text, Sign2(Gloss+Text) and Sign2Text, separately.64 -Because all Gloss2Text models use the same sign language representation, i.e., glosses, we can directly compare the performance of different encoder-decoder architectures.Here, we see that transformer based models perform better (23.67 ± 0.998) than recurrent models (17.64 ± 0.955).
We cannot as easily compare the performance between recurrent models and transformer based models for Sign2Gloss2Text translation, because different papers use different feature extractors.There is however one work that compares both architectures.Yin et al. [31] achieve better performance with transformers (24.7 ± 0.699) than with recurrent models (21.65 ± 0.105).
No such comparison is available for Sign2(Gloss+Text) translation.We can only note that the best performing model is that of Zhou et al. [47].It is an LSTM encoder-decoder using spatiotemporal multi-cue features.These features are similar to the ones used by Yin et al. [31], who obtained better results using transformers than using RNNs for Sign2Gloss2Text translation.Because of the lack of comparative papers, we are unable to draw any conclusions with regards to the difference in performance between transformers and RNNs for Sign2(Gloss+Text) translation.
The Sign2Text translation models exhibit higher variance in their scores than models for the other tasks.This is due to the lack of additional supervision signal in the form of glosses: the choice of sign language representation has a larger impact on the translation score.The difference in BLEU-4 score between transformers (19.478 ± 3.294) and RNNs (10.007 ± 1.905) is larger than in other tasks.This is possibly because transformers are better able to handle long term dependencies through the attention mechanism: without glosses, the dependencies in the source language sentences can be quite long.
We provide a graphical overview of the performance of RNNs and transformers across tasks in Fig. 7 and conclude that transformers outperform RNNs in most cases on RWTH-PHOENIX-Weather 2014T.6 Discussion of the current state of the art By analyzing the existing scientific literature on SLT, we set out to discover answers to the following questions.Which kinds of sign language representations are most informative (RQ1)?Which types of models should be used for SLT (RQ2)?Which datasets are currently used (RQ3)?How can we evaluate SLT models (RQ4)?Because of the diversity in applied methods and datasets, answering these questions definitively is challenging.However, the popularity of the RWTH-PHOENIX-Weather 2014T benchmark has enabled some comparisons between different models and provides us with several insights.

Sign language representations
RQ1 asks, "Which kinds of sign language representations are most informative?"The best performing SLT models all use either (i) deep learning-based sign language representations or (ii) glosses.Regarding (i), most commonly these representations are extracted using CNNs that are pretrained on related tasks (e.g., hand shape classification or CSLR).
Several feature extraction techniques are used in the literature.They range from 2D CNNs to human pose estimation and 3D CNNs, and combinations thereof.Translation models benefit from sign language representations extracted from multiple cues (e.g., hand crops, face crops and pose estimation).Human pose estimation is primarily useful as such a cue rather than as the only feature extractor [31,43]: current techniques often fail in the presence of motion blur and manual-facial interactions which are common in sign language data [27,67].After the sign language representation (i.e., a sequence of such features) has been extracted, its length can be reduced to further improve the translation performance.This step must be trained end-to-end with the translation model [31,42,46].Otherwise, if the sequence is simply reduced at arbitrary steps (as for example a frozen pretrained 3D CNN would do), the translation model's performance suffers [44].
Earlier papers using SMT methods often include linguistic properties of sign languages in their models as a way to reduce the problem complexity, for example by computing features that represent hand trajectories.In contrast, many recent NMT-based papers prefer to use end-to-end deep learning.Given the low-resource nature of current SLT datasets (often only several thousands of parallel sentences), the incorporation of domain knowledge (in this case, linguistic knowledge) proves beneficial to the quality of the translations.
Glosses (ii), as written sign language representations which aim to capture the meaning of signs, are easily adopted in sequence-to-sequence models, as both written text and signs can be mapped to glosses.Out of the 32 papers we reviewed, 23 use glosses in some way.Unfortunately, using glosses has several drawbacks.Firstly, annotating sign language corpora on the gloss level is a time-consuming process typically performed by domain experts.Secondly, any translation model that uses glosses as a sign language representation (i.e., Gloss2Text and Sign2Gloss2Text models), requires a sign language recognition system of sufficient quality also at inference time.Finally, glosses do not accurately capture the entire meaning of signs.Some research attempts to remove the gloss dependency, instead proposing so-called Sign2Text systems which translate directly from sign language videos into spoken language texts.Sign2Text systems are slowly becoming competitive with systems dependent on glosses in terms of translation scores such as BLEU-4.A large part of this is due to the incorporation of domain knowledge in the design of the sign language representations, for example including mouth patterns as an additional information channel [31,43,47].

Translation model architectures
Once a proper sign language representation is determined, a translation model must be designed.RQ2 asks whether there is one superior algorithm for SLT.Despite the generally small size of the datasets used for SLT, we see that neural MT models achieve the highest translation scores.In particular, transformers appear to outperform RNNs in several cases -but not consistently: it depends on the task and the used sign language representation.The fact that pretrained language models are readily available for many transformer based architectures (for example via the Hug-gingFace Transformers library [68]) may give transformers an edge over RNNs.De Coster et al. [48] have shown that integrating pretrained spoken language models in an SLT model can increase the BLEU-4 score compared to training transformers from scratch on SLT.Furthermore, the attention mechanism in transformers can be used to inspect the model's decision making [20].This was for example previously performed for isolated SLR, showing that transformers focus on distinguishing frames in clips [69].To the best of our knowledge, no such analysis has been published yet for transformers in SLT.It may prove useful for error analysis of current translation models.

Datasets
Neural translation models need to be trained with sufficiently large datasets.The third question we set out to answer, RQ3, is, "Which datasets are used and what are their properties?"The most used dataset is RWTH-PHOENIX-Weather 2014T [29] for translation from DGS to German.It contains 8,257 parallel utterances from several different interpreters.The domain is weather broadcasts.The fact that it is used so often, allows for comparisons between different methods and allows for incremental progress in SLT.
Several papers use custom datasets, and often only once.Custom datasets can be useful to validate existing approaches on different languages, but the majority of progress is made on public benchmark datasets such as RWTH-PHOENIX-Weather 2014T because such datasets enable the comparison of different approaches.In contrast, using private datasets limits the reproducibility of the presented models.Preferably, a model is trained and validated on a benchmark and then one can evaluate the same model on private datasets, perhaps in a local sign language.
Current datasets have several limitations.They are typically restricted to limited domains of discourse (weather broadcasts) and have little variability in terms of visual conditions (TV studios).Furthermore, some sign language translation datasets contain recordings of nonnative signers.In cases, the signing is interpreted (under time pressure) from spoken language.This means that the used signing may not be representative of the sign language and may in fact be influenced by the grammar of a spoken language.Training a translation model on these kinds of data has implications for the quality and accuracy of the resulting translations.

Tasks
A training scheme is needed to train these translation models on the proposed data.In literature, we have found four such schemes: Gloss2Text, Sign2Gloss2Text, Sign2(Gloss+Text) and Sign2Text.Two of these require gloss annotations for all examples (Gloss2Text and Sign2Gloss2Text).Sign2(Gloss+Text) requires gloss annotations for some or all examples (and does not require glosses at inference time) and Sign2Text does not use glosses at all.Fig. 8 shows an overview of the BLEU-4 scores on the RWTH-PHOENIX-Weather 2014T dataset from its release in 2018 until August 2021.We clearly see increases in scores for all four tasks, though the increase has not yet continued into 2021 for Gloss2Text and Sign2Gloss2Text.We see a rising trend for Sign2(Gloss+Text) and Sign2Text, as there is no gloss bottleneck and improvements in feature extraction techniques as well as architecture enable better translation scores.In most cases, transformers outperform RNNs based methods on RWTH-PHOENIX-Weather 2014T.

Evaluation
In terms of evaluation, we see many papers reporting several translation related metrics, such as BLEU, ROUGE, WER and METEOR.These are standard metrics in MT.Several papers also provide example translations to allow the reader to gauge the translation quality for themselves.Whereas the above metrics often correlate quite well with human evaluation, this is not always the case [70].Including human evaluation is especially important for spoken to signed language translation, where avatars must sign in a natural way.However, it is also paramount for signed to spoken language translation to assess the fluency and correctness of translations.Only one of the 32 reviewed papers incorporates human evaluators in the loop [41].In none of the reviewed papers is evaluation in collaboration with SLC members mentioned.As the translation models proposed in these papers are designed first and foremost for communication between nonsigners and sign language users, they should all be involved in the evaluation of those models.
7 Challenges and proposals

Sign language representations
SLT models require salient representations of sign languages as input, as noted in Section 6.1.The SLT models we investigated are primarily adapted spoken language MT models.Therefore, they expect sign language representations to contain information on the meaning of signs, similar to the function of word embeddings in spoken language MT.There is currently no representation that contains all of the information required for proper sign language translation and that takes into account both the established and the productive lexicons.In fact, it is doubtful whether a pure end-to-end NMT approach is capable of tackling productive signs.To recognize and understand productive signs, we need models that have the ability to link abstract visual information to the properties of objects.This relates to caption generation models and models trained to classify, e.g., highly abstracted drawings.Incorporating the productive lexicon in translation systems is a significant challenge, one for which, to the best of our knowledge, labeled data are currently not available.
The majority of SLT models, especially neural models, overlook several linguistic elements of signing.Especially in recent years, SLT is tackled using end-to-end deep learning techniques.The used representations are typically phonological, focusing on detecting hand shapes and movements.These representations form a good basis for a translation model, but they do not contain meaning.They also do not model fingerspelling, signing space, or classifiers explicitly.Learning these aspects in the translation model with an end-to-end approach is challenging.This is made even more difficult by the lack of annotated data.
The design of a salient sign language representation relates to the problem of sign language segmentation.De Sisto et al. [15] describe why segmentation into individual signs is difficult, citing among others coarticulation and simultaneity.The definition and extraction of so-called "meaningful units" for SLT is an open research question that needs to be answered in order to move beyond frame based and clip based phonological representations.Collaboration between computer scientists and (computational) linguists can aid in this effort.

Exploiting the vast amounts of unlabeled data
Current SOTA SLT models use sign language representations created through supervised end-to-end deep learning, i.e., on labeled datasets.The collection and annotation process of these datasets is expensive and time intensive.However, there are already vast amounts of unannotated sign language videos.This raises the question how unsupervised machine learning techniques can be exploited in the domain of SLT.
In the domain of natural language processing, we already observe tremendous advances thanks to unsupervised language models such as BERT [22].Whereas such (pretrained) models have been exploited in SLT papers [45,48], they have not yet been trained specifically for the purpose of processing sign language videos.
In computer vision, self-supervised techniques are applied to pretrain powerful feature extractors which can then be applied to downstream tasks such as image classification or object detection.Algorithms such as SimCLR [71], BYOL [72] and DINO [73] are used to train 2D CNNs without labels, reaching performance that is almost on the same level as models trained with supervised techniques.
Up to a certain level, sign languages share some common elements, for example the fact that they all use the human body to convey information.Movements used in signing are composed of motion primitives and the configuration of the hand (shape and orientation) is important in all sign languages.The recognition of these low level components does not require language specific datasets and could be performed on multilingual datasets, containing videos recorded around the world from people with various ages, genders, and ethnicities.A parallel can be drawn to speech recognition: Wav2Vec 2.0, for example, learns discrete speech units in a self-supervised manner [74]; a similar approach could be beneficial for pretraining on unlabeled sign language videos.
Deep neural networks trained in a self-supervised way could be applied to learn such common elements of different sign languages.This would not only facilitate automatic SLT, but could also lead to the development of new tools supporting linguistic analysis of sign languages.However, there has been very little investigation into these matters in the scientific literature.Given that annotated data is so scarce in this domain, we advocate for the investigation of unsupervised machine learning techniques.

Dataset collection
Currently, SLT from sign language videos to spoken language text is a low-resource MT task, with the largest public dataset containing just 7,096 training examples [29].Furthermore, far from all sign languages have corresponding translation datasets.Additional datasets need to be collected and existing ones need to be extended.Such collection efforts are expensive, and there are several caveats.
De Meulder [75] raises concerns with current dataset collection efforts.Existing datasets and those currently being collected suffer from several biases.If interpreted data are used, influence from spoken languages will be present in the dataset.If only native signer data are used, then the majority of signers will have the same ethnicity.Both statistical as well as neural MT exacerbate bias [76,77].Therefore, when our training datasets are biased and of small volumes, we cannot expect (data driven) MT systems to reach high qualities and be generalizable.
We remark on the need for two kinds of datasets.On the one hand, sign language recognition, for the purpose of extracting sign language representations, requires large and varied datasets.These can be collected as part of new efforts, or collated from existing ones.If unsupervised techniques are applied (as mentioned above), then these datasets do not necessarily need to be labeled entirely.They also do not need to consist entirely of native signing.On the other hand, sign language translation requires high quality labeled data, the collection of which is challenging.Bragg et al.'s first and second calls to action, "Involve Deaf team members throughout" and "Focus on real-world applications" [78], can be a guide during the dataset collection process.By involving SLC members, the dataset collection effort can be guided towards use cases that would benefit SLCs.Additionally, by collecting datasets with a limited domain of discourse targeted at specific use cases, the SLT problem is effectively simplified.As a result, any applications would be limited in scope, but more useful in practice.
Current research uses datasets in which the videos have fixed viewpoints, similar backgrounds, and sometimes even, the signers wear similar clothing for maximum contrast with the background.In real-world applications, dynamic viewpoints and lighting conditions will be a common occurrence.Dataset collection efforts need to take this into account.
Sincan et al.'s AUTSL dataset addresses these concerns, but for sign language recognition [79].They focused on gathering a large dataset with variability in participants and their knowledge of sign language (including native and nonnative signers).Videos were recorded from different viewpoints and in different locations.Similar dataset collection efforts can aid the domain of SLT, reducing the discussed forms of bias and improving the translation quality for specific use cases.

Computational performance
Within the reviewed papers, there is little to no discussion on the computational performance of the proposed methods.If the end goal is real time SLT, researchers need to be concerned with the types of approaches used.Of course, achieving acceptable translation quality takes priority over computational performance in the current stage of SLT research.
Research into efficient deep learning is a hot topic, as deep neural networks are now deployed on mobile and embedded devices.Engineering techniques such as network quantization can reduce the computational cost of deep neural networks [80].At the same time, theoretical properties of neural networks are being investigated, for example to reduce the computational complexity of transformers [81] and 2D CNNs [82].SLT models will be able to benefit from these optimizations in the future.
Therefore, rather than optimizing translation models at this stage in research, we propose to simply keep computational efficiency in mind but focus first on translation quality.When translation quality reaches levels that are acceptable for end users, then the advances in deep learning can be exploited to make these existing models more efficient.
In the context of computational performance and resource consumption we cannot ignore the environmental impact of deep neural models.Whereas Strubell et al. [83] and Schwartz et al. [84] advocate for greener NMT, consistently larger models are still being trained.Perhaps when deciding our the architecture of an SLT model to be trained, we should pay more attention the fact that we are dealing with a low-resource problem and as such adapt design our systems adequately -consider for example the paper of Sennrich and Zhang [85].More compact models, trained faster on smaller, but use case specific datasets can be a computationally more efficient, more ecological and (use case specific) more effective option.

Evaluation
Current research uses mostly quantitative metrics to evaluate SLT models, on relatively simple datasets.Models should also be evaluated on real-world data from real-world settings.Furthermore, human evaluation from signers and nonsigners is required to truly assess the translation quality.
Human-in-the-loop development can not only yield better end results, but also alleviate some of the concerns that live in SLCs about the application of MT techniques to sign languages about appropriation and alteration of sign languages.Instead of performing research in isolation and then presenting a result to sign language users (the waterfall technique), an agile approach should be used.This could prove especially beneficial in the domain of SLT specifically: many researchers are hearing and do not use signing as their primary means of communication.By frequently evaluating models together with sign language users, we as a research community can avoid drifting off course towards unusable models.Wolfe et al. state with regards to sign language avatars: "the quality of the ultimate signed language display must be given highest priority in a spoken to signed translation system."[86] We argue that the same holds for translation from signed to spoken languages.
Finally, we see that in depth error analysis is missing from many SLT papers.By (visually and/or numerically) analyzing which types of errors are made by our models, we can make them more robust and focus our improvement efforts to where they are most needed.This will benefit the applicability of researched techniques in real-world settings.

The need for interdisciplinary research
Sign language MT cannot be tackled by computer scientists or linguists alone.There is a need for collaboration between technical and linguistic profiles, driven by the input and augmented by the feedback of stakeholders, namely the members of SLCs.Without linguists, computer scientists are prone to missing crucial aspects of sign language grammars.Without the watchful eye of SLCs, researchers may lose track of which types of applications are desired by end users.Computer scientists, finally, are needed, because SLT is a highly challenging problem from a technical point of view, both in terms of machine learning and in terms of developing real time applications.Only through the consistent and close-knit collaboration of these groups, can an SLT system that matches the needs of SLC members ever be developed.
Within the European Union, two international research projects have been launched as recently as January 2021: SignON [87] and EASIER [88].The consortia in both of these projects are composed of computer scientists, linguists and representatives of SLCs.The interdisciplinary collaborations within these projects have the potential to accelerate sign language translation research in the direction of applications that are useful for the end users.

Conclusion
In this article, we discuss the SOTA of SLT and explore challenges and opportunities for future research through a systematic overview of the papers in this domain.We review 32 papers on sign language machine translation from a signed to a spoken language.These papers are selected based on predefined criteria and they are indicative of sound SLT research.The selected papers are written in English and peer reviewed.They propose, implement and evaluate a sign language machine translation system from a sign language to a spoken language using an RGB camera and do not contain descriptions that are explicitly offensive to members of the sign language communities.We discuss the SOTA of SLT and explore several challenges and opportunities for future research.
In recent years, neural machine translation has become dominant in the growing domain of SLT.The most powerful sign language representations are those that combine information from multiple channels (manual actions, body movements and mouth patterns) and those that are reduced in length by temporal processing modules trained jointly with the translation model.These translation models are typically RNNs or transformers.Transformers outperform RNNs in many cases.SLT datasets are small: we are dealing with a low-resource machine translation problem.The datasets consider limited domains of discourse and generally contain recordings of nonnative signers.This has implications on the quality and accuracy of translations generated by models trained on these datasets, which must be taken into account when evaluating SLT models.This evaluation is mostly performed using quantitative metrics that can be computed automatically, given a corpus.There are currently no papers that perform evaluation in collaboration with sign language users.
Progressing beyond the current SOTA of SLT requires efforts in data collection, the design of sign language representations, machine translation, and evaluation.Future research may improve sign language representations by incorporating domain knowledge into their design and by leveraging abundant, but as of yet unexploited, unlabeled data.Research should be conducted in an interdisciplinary manner, with computer scientists, sign language linguists, and experts on sign language cultures working together.Finally, sign language translation models should be evaluated in collaboration with end users: native signers as well as hearing people that do not know any sign language.

Figure 4 :
Figure 4: The earlier papers on SLT all propose SMT models; since 2018, NMT has become the dominant variant.The considered papers were published between 2004 and 2021.SMT: Statistical Machine Translation, NMT: Neural Machine Translation, RBMT: Rule-based Machine Translation.

Figure 6 :
Figure6: The RWTH-PHOENIX-Weather 2014T dataset is used the most throughout literature.Other datasets are referenced at most three times in the 32 discussed papers.Any dataset that occurs only one time is listed under "Other (singleton)".

Figure 7 :
Figure 7: Transformers tend to outperform RNNs on different SLT tasks in terms of BLEU-4 score on the RWTH-PHOENIX-Weather 2014T dataset.

Figure 8 :
Figure8: We can observe a general trend towards higher BLEU-4 scores on the RWTH-PHOENIX-Weather 2014T dataset for all tasks.As recently as 2021, there is little difference between the top scores of different tasks.

Table 1 :
Statistics of the datasets that are used in more than one paper.

Table 2 :
Performance of different models on RWTH-PHOENIX-Weather 2014T Gloss2Text translation.
(91%) papers.ROUGE is used by fifteen (47%) papers.The WER is reported on by seven (22%) papers, TER by five (16%) and PER by six (19%).CIDEr and METEOR are used four (13%) and seven (22%) times respectively, and COMET and NIST are reported on by one paper each.It is clear that the BLEU metric is the most popular metric, followed by ROUGE, WER and METEOR.

Table 3 :
Performance of different models on RWTH-PHOENIX-Weather 2014T Sign2Gloss2Text translation.

Table 5 :
Performance of different models on RWTH-PHOENIX-Weather 2014T Sign2Text translation.