Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

We present a novel approach to automatic Sign Language Production using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks, and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph. The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach for both multi-signer and high-definition settings qualitatively and quantitatively using broadcast quality assessment metrics.

Like spoken languages, sign languages have their own grammatical rules and linguistic structures. This makes the task of translating between spoken and signed languages a complex problem, as it is not simply an exercise of mapping text to gestures word-by-word (see Fig. 1 which demonstrates that both the tokenization of the languages and their ordering is different). It requires machine translation methods to find a mapping between a spoken and signed language, that takes into account both their language models.
To facilitate easy and clear communication between the hearing and the Deaf, it is vital to build robust systems that can translate spoken languages into sign languages and vice versa. This two way process can be facilitated using Sign Language Recognition (SLR) and Sign Language Production (SLP), (see Fig. 2).
Commercial applications for sign language primarily focus on SLR, by mapping sign to spoken language, typically providing a text transcription of the sequence of signs, Fig. 1 Translating from spoken language text into sign language video. Glosses are used as an intermediate representation. There is often no direct mapping between spoken language and sign language sentences Fig. 2 Sign language recognition versus production such as Elwazer (2018), and Robotka (2018). This is due to the misconception that deaf people are comfortable with reading spoken language and therefore do not require translation into sign language. However, there is no guarantee that someone who's first language is, for example, British Sign Language, is familiar with written English, as the two are completely separate languages. Furthermore, generating sign language from spoken language is a complicated task that cannot be accomplished with a simple one-to-one mapping. Unlike spoken languages, sign languages employ multiple asynchronous channels (referred to as articulators in linguistics) to convey information. These channels include both the manual (i.e. upper body motion, hand shape and trajectory) and non-manual (i.e. facial expressions, mouthings, body posture) features.
The problem of SLP is generally tackled using animated avatars, such as Cox et al. (2002), Glauert et al. (2006) and McDonald et al. (2016). When driven using motion capture data, avatars can produce life-like signing, however this approach is limited to pre-recorded phrases, and the production of motion capture data is costly. Another method relies on translating the spoken language into sign glosses, 1 and connecting each entity to a parametric representation, such as the hand shape and motion needed to animate the avatar. However, there are several problems with this method. Translating a spoken sentence into sign glosses is a non-trivial task, as the ordering and number of glosses does not match the words of the spoken language sentence (see Fig. 1). Additionally, by treating sign language as a concatenation of isolated glosses, any context and meaning conveyed by non-manual features is lost. This results in at best crude, and at worst incorrect translations, and results in the indicative 'robotic' motion seen in many avatar based approaches. 1 Glosses are lexical entities that represent individual signs. To advance the field of SLP, we propose a new approach, harnessing methods from NMT, computer graphics, and neural network based image/video generation. The proposed method is capable of generating a sign language video, given a written or spoken language sentence. An encoder-decoder network provides a sequence of gloss probabilities from spoken language text input, that is used to condition a Motion Graph (MG) to find a pose sequence representing the input. Finally, this sequence is used to condition a GAN to produce a video containing sign translations of the input sentence (see Fig. 4). The contributions of this paper can be summarised as: -An NMT-based network combined with a motion graph that allows for continuous-text-to-pose translation. -A generative network conditioned on pose and appearance. -To our knowledge the first spoken language to sign language video translation system without the need for costly motion capture or an avatar.
A preliminary version of this work was presented in Stoll et al. (2018). This extended manuscript contains an improved pipeline and additional formulation. We introduce an MG into the process, that combined with the NMT network is capable of text-to-pose (text2pose) translations. Furthermore, we demonstrate the generation of multiple signers of varying appearance. We also investigate high-definition (HD) sign generation. Extensive new quantitative as well as qualitative evaluation is provided, exploring the capabilities of our approach. Figure 3 gives a comparison of the output of our approach (right) to other avatar based approaches (left and middle).
The rest of this paper is organised as follows: Sect. 2 gives an overview of recent developments in NMT as well as traditional SLP using avatars. We explain the concept of motion graphs, before describing recent advancements in generative image models. Section 3 introduces all parts of our approach. In Sect. 4 we evaluate our system both quantitatively and qualitatively, before concluding in Sect. 5.

Related Work
We treat Sign Language Production (SLP), as a translation problem from spoken into signed language. We therefore first review recent developments in the field of Neural Machine Translation (NMT). However, SLP is different from traditional translation tasks, in that it inherently requires visual content generation. Normally this is performed by animating a 3D avatar. We will therefore give an overview of past and current sign avatar technology. Finally, we cover the concept of Motion Graphs (MGs), a technique used in computer graphics to dynamically animate characters, and the field of conditional image generation.

Neural Machine Translation
NMT utilises Recurrent Neural Network (RNN) based sequence-to-sequence (seq2seq) architectures which learn a statistical model to translate between different languages. Seq2seq (Sutskever et al. 2014;Cho et al. 2014) has seen success in translating between spoken languages. It consists of two RNNs, an encoder and a decoder, that learn to translate a source sequence to a target sequence. To tackle longer sequences Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) or Gated Recurrent Units (GRU) (Chung et al. 2014) are used as RNN cells. Both architectures have mechanisms that allow each cell to pass only the relevant information to the next time step, hence improving translation performance over longer-term dependencies.
To further improve the translation of long sequences Bahdanau et al. (2014) introduced the attention mechanism. It provides additional information to the decoder by allowing it to observe the encoder's hidden states. This mechanism was later improved by Luong et al. (2015). Camgoz et al. 2018 combine a standard seq2seq framework with a Convolutional Neural Network (CNN) to translate sign language videos to spoken language sentences. They first extract features from video using the CNN before translating to text. Similarly, Guo et al. (2018) combine a CNN and an LSTM-based encoder-decoder. However, they employ a 3D CNN to better learn spatio-temporal relationships and identify key clips. This guides the model to focus on the information-rich content. Both approaches can be seen as the inverse to our problem, of translating text to pose.
More recently non-RNN based NMT methods have been explored. ByteNet ) performs translation using dilated convolutions, and Vaswani et al. (2017) introduced the transformer, which is a purely attention-based translation method. Specifically focusing on sign language multi-modal fusion networks have been proposed (Guo et al. 2017;Wang et al. 2018a).
Using NMT methods to translate text to pose is a relatively unexplored and open problem. Ahn et al. (2018) use an RNN-based encoder-decoder model to produce upper body pose sequences of human actions from text and map them onto a Baxter robot. However, their results are purely qualitative and rely on human interpretation. For our work we first translate from text to gloss using a seq2seq architecture with Luong attention (Luong et al. 2015) and GRUs (Chung et al. 2014), similar to Camgoz et al. (2018). However, as we are translating text to pose we do not use a CNN as an initial step. In contrast, we use the probabilities produced by the decoder at each time step to solve a Motion Graph (MG) of sign language pose data, to obtain the text to pose translation.

Avatar Approaches for Sign Language Production
Sign avatars can either be driven directly from motion capture data, or rely on a sequence of parametrised glosses. Since the early 2000s there have been several research projects exploring avatars animated from parametrised glosses, e.g. VisiCast (Bangham et al. 2000), eSign (Zwitserlood et al. 2004), Tessa (Cox et al. 2002), dicta-sign (Efthimiou 2012), and JASigning (Virtual Humans Group 2017). All of these approaches rely on sign video data to be annotated using a transcription language, such as HamNoSys (Prillwitz 1989) or SigML (Kennaway 2013). Whilst these avatars are capable of producing sign sequences, they are not popular with the Deaf community. This is due to under-articulated and unnatural movements, but mostly due to missing non-manuals, such as eye gaze and facial expressions (see Fig. 3). Important meaning and context is lost this way, making the avatars difficult to understand. Furthermore, the robotic motion of the aforementioned avatars can make viewers uncomfortable, due to the uncanny valley 2 (Mori et al. 2012). Recent work has begun to integrate non-manuals into the annotation and animation process (Ebling and Glauert 2013;Ebling and Huenerfauth 2015). However, the correct alignment and articulation of these features poses an unsolved problem, that limit recent avatars such as McDonald et al. (2016) and Kipp et al. (2011).
To make avatars both easier to understand, and increase viewer acceptance, recent sign avatars rely on data collected from motion capture. One example of a motion capture driven avatar is the Sign3D project by MocapLab (Gibet et al. 2016). Given the richness of motion capture data, this approach provides highly realistic results, but is limited to a very small set of phrases, as collecting and annotating data is expensive, time consuming and requires expert knowledge. Although these avatars are better received by the Deaf community, they do not provide a scalable solution. The uncanny valley also still remains a large hurdle. To make synthetic signing more realistic, scalable and avoid the aforementioned problems of 3D avatars, we propose to directly generate sign video from weakly annotated data using the latest developments in machine translation, generative image models and Motion Graphs (MGs).

Motion Graphs
Motion Graphs (MGs) are used in computer graphics to dynamically animate characters, and can be formulated as a directed graph that is constructed from motion capture data. It allows new lifelike sequences to be generated that satisfy specific goals at runtime. MGs were independently introduced by Kovar et al. (2002), Arikan and Forsyth (2002), and Lee et al. (2002). Kovar et al. (2002) define the distance between two frames by calculating the distance between two point clouds. For creating the transitions themselves, the motions are aligned and positions are linearly interpolated between joint rotations. As a search strategy, branch and bound is used. Arikan and Forsyth (2002) use the difference between joint positions and velocities and the difference between torso velocities and accelerations, to define how close or distant two frames are. A smoothing function is applied to the discontinuity between two clips. The graph is searched by first summarizing it and then performing a random search over the summaries. Lee et al. (2002) chose a two layer approach to represent motion data. In the lower layer all data is modelled as a first-order Markov process, where the Markov process is represented by a matrix holding the transition probabilities between frames. The probabilities are derived from measuring the distances of weighted joint angles and velocities. Transitions of low probability are pruned. For blending transitions a hierarchical motion fitting algorithm is used Lee and Shin (1999). The higher layer generalises the motion preserved in the lower layer by performing cluster analysis, to make it easier to search. Each cluster represents similar motion, but to capture connections between frames a cluster tree is formed at each motion frame. The whole higher layer is called a cluster forest.
We build an MG for sign language pose data, by splitting continuous sign sequences into individual glosses, and grouping all motion sequences by gloss. These motion sequences populate the nodes of our MG. We then use the probabilities provided by our NMT decoder at each time step to transition between nodes.

Conditional Image Generation
With the advancements in deep learning, the field of image generation has seen various approaches utilising neuralnetwork based architectures. Chen and Koltun (2017) used CNN based cascaded refinement networks to produce photo-graphic images given semantic label maps. Similarly, van den Oord et al. (2016) developed PixelCNN, which produces images conditioned on a vector, that can be image tags or feature embeddings provided by another network. Gregor et al. (2015) and Oord et al. (2016) also explored the use of RNNs for image generation and completion. All these approaches rely on rich semantic or spatial information as input, such as semantic label maps, or they suffer from being blurry and spatially incoherent.
Since the advent of GANs (Goodfellow et al. 2014), they have been used extensively for the task of image generation. Soon after their emergence, Mirza and Osindero Mirza and Osindero (2014) developed a conditional GAN model, by feeding the conditional information to both the Generator and Discriminator. Radford et al. (2015) proposed Deep Convolutional GAN (DCGAN) which combines the general architecture of a conditional GAN with a set of architectural constraints, such as replacing deterministic spatial pooling with strided convolutions. These changes made the system more stable to train and well-suited for the task of generating realistic and spatially coherent images. Many conditional image generation models have been built by extending the DCGAN model. Notably Reed et al. (2016) built a system to generate images of birds that are conditioned on positional information and text description, using text embedding and binary pose heat maps.
An alternative to GAN-based image generation models is provided by Variational Auto-Encoders (VAEs) (Kingma and Welling 2013). Similar to classical auto-encoders, VAEs consist of two networks, an encoder and a decoder. However, VAEs constrain the encoding network to follow a unit Gaussian distribution. Yan et. al. developed a conditional VAE (Yan et al. 2016), that is capable of generating spatially coherent, but blurry images, a tendency of most VAE-based approaches.
Recent work has looked at combining GANs and VAEs to create robust and versatile image generation models. Makhzani et. al. introduced Adversarial Auto-encoders and applied them to problems in supervised, semi-supervised and unsupervised learning (Makhzani et al. 2016 GANs that use an encoder to learn a latent representation of an input image and a feature vector to change the attributes of human faces (Perarnau et al. 2016).
VAE/GAN hybrid models have proven particularly popular for generating images conditioned on human pose, as done by Ma et al. (2017) and Siarohin et al. (2018). Ma et al. synthesize images of people in arbitrary poses in a two-stage process by fusing an input image of a person for appearance and a heat map providing pose information into a new image in one network, before refining it in a second network. Siaro- Fig. 4 Full system overview. A spoken language sentence is translated into a representative skeletal pose sequence. This sequence is fed into the generative network frame by frame, in order to generate the input sentence's sign language translation hin et al. use a similar method, but additionally use affine transformations to help change the position of body parts.
In the sub-field of image-to-image translation, Isola et al. (2017) introduced pix2pix a conditional GAN, which given its information-rich input and avoidance of fully connected layers was also among the first contenders for generating high definition image content. Building on the success of pix2pix and architecture proposed by (Johnson et al. 2016), Wang et al. recently presented pix2pixHD (Wang et al. 2018b), a network capable of producing 2048 × 1024 images from semantic label maps, using a generator and multi-scale discriminator architecture: A global generator consisting of a convolutional encoder, a set of residual blocks and a convolutional decoder. In addition, a local enhancer network, of similar architecture, provides high resolution images from semantic label maps. Three discriminators are used at different scales to differentiate real from generated images.
For our work, we follow two strands of conditional image generation techniques: We build a multi-person sign generation network conditioned on human appearance and pose, similar to the works of Ma et al. (2017) and Siarohin et al. (2018). In addition we also investigate single-signer HD sign generation by building on the work of Wang et al. (2018b).

Text to Sign Language Translation
Our text-to-sign-language (text2sign) translation system consists of two stages: We train an NMT network to obtain a sequence of gloss probabilities that is used to solve a Motion Graph (MG) which generates human pose sequences (text2pose in Fig. 4).Then a pose-conditioned sign generation network with an encoder-decoder-discriminator architecture produces the output sign video (see pose2video in Fig. 4). We will now discuss each part of our system in detail.

Text to Pose Translation
We employ recent RNN based machine translation methods, namely attention based NMT approaches, to realize spoken language sentence to sign language gloss sequence translation. We use an encoder-decoder architecture (Sutskever et al. 2014) with Luong attention (Luong et al. 2015) (see Fig. 5). Given a spoken language sentence, S N = {w 1 , w 2 , . . . , w N }, with N number of words, our encoder maps the sequence into a latent representation as in: where o 1:N is the output of the encoder for each word w, and h e N is the hidden representation of the encoded sentence. In Fig. 5, the encoder is depicted in blue. This hidden representation and the encoder outputs are then passed to the decoder, which utilises an attention mechanism and generates a probability distribution over glosses: where α(·) is the attention function, g t is the gloss produced at the time step t and h d t−1 is the hidden state of the decoder passed from the previous time step. At the beginning of the decoding, i.e. t = 1, h d t−1 is set as the encoded representation of the input sentence, i.e. h d 0 = h e N . See Fig. 5 for a visualisation.
The reason we utilize an attention based approach instead of a vanilla sequence-to-sequence based architecture is to tackle the long term dependency issues by providing additional information to the decoder. To train our NMT network, we use cross entropy loss over the gloss probabilities at each time step.
We build a Motion Graph (MG) that allows a sequence of 2D skeletal poses to be generated for a given gloss sequence. An MG is a Markov process that can be used to generate new motion sequences that are representative of real motion but fulfil the objectives of the animator e.g. getting from A to B using a specific style of motion. A standard formalisation of an MG is as a finite directed graph of motion primitives (Min and Chai 2012): MG = (V , E), where node v i ∈ V in the graph corresponds to one or more sequences of motion Fig. 6 The graph nodes v i and v j contain one or more motion primitives (depicted by skeletal pose maps) x i and x j , and a prior distribution p(x i ) and p(x j ). We define the transition probability between nodes v i and v j as the probability of motion primitive x j given x i . Y i,j smooths between motion primitives (motion primitives) and a prior distribution function p(x i ) over those motion primitives (x i ). Each motion primitive for a node is an example of the style of motion the node represents. It is therefore possible to have a variable number of motion primitives in a node, the minimum being one. An edge e i, j ∈ E, which represents an allowable transition from node v i to v j , stores a morphable function Y i,j = M(x i , x j ) that enables blending between motions, and a probability distribution p(x j |x i ) over the motion primitives x j at node v j , given a chosen motion primitive x i at node v i . See Fig. 6 for a visualisation.
The motion primitives need to be extracted from a larger set of motion capture data. This can be done by identifying key frames in the motion data that are at the transition points between motions e.g. the left foot impacting the floor for walking sequences. These key-frames are then used to cut the data up into a larger set of motion primitives x i , where each motion primitive is a continuous motion between two key-frames. For more complex datasets of motion, a typical approach is to define a distance metric between skeletal poses which can be used to identify possible transition points as those that fall below a given threshold. The threshold being set to be small enough such that interpolation between two poses will not cause visual disturbance in the fluidity of motion. For our application, we use the gloss boundaries to automatically cut the pose sequences into individual signs so |V | is equal to the gloss vocabulary size and x g contains examples of sign gloss g.
In a graphics context E is learned directly from the data by looking at the transitions between nodes in the graph present in the original data. However, in our case, E is generated at each time step by the decoder network, given the previously generated glosses and encoded sequence, as in: The purpose of Y i,j is to allow smooth transition between different motion primitives. In our case it is constant for all nodes in the graph. We use a Savitzky-Golay filter (Savitzky and Golay 1964) to create smooth transitions. This is done dynamically as the graph is searched. The Savitzky-Golay filter smooths between motion primitives by fitting a loworder polynomial to adjacent data points. We use a window size of five and a polynomial order of two to smooth between the last five frames of the current motion primitive and the first five frames of the upcoming primitive. This allows us to preserve the articulation of each motion primitive, but avoid discontinuities and artefacts at transition points.
To find the most probable motion sequence given a spoken language sentence, we employ beam search over our motion graph. We start generating our sequence from the special x 0 =< bos > (beginning of sequence) node. At each motion step, we consider a list of hypotheses, At each step we expand our hypotheses with a new motion as in: where H t b denotes the set of motions in H b at step t. We choose x * t by: We continue expanding our hypotheses until all of them reach to special x _ =< eos > (end of sequence) node. We then choose the most probable motion sequence H * by:

Pose to Video Translation
The pose-to-video (pose2video) network combines a convolutional image encoder and a Generative Adversarial Network (GAN), see Fig. 7 for an overview. A GAN consists of two models that are trained in conjunction: A generator G that creates new data instances, and a discriminator D that evaluates whether these belong to the same data distribution as the training data. During training, G aims to maximise the likelihood of D falsely predicting a sample generated by G to be part of the training data, while D tries to correctly identify samples to be either fake or real. Using this minmax game setup, the generator learns to produce more and more realistic samples, ideally to the point where D cannot separate them from the ground truth. G is an encoder-decoder, conditioned on human pose and appearance. The latent space can either be a fixed-size onedimensional vector, or a variable-size residual block. A fixed Fig. 7 Our sign generator G has an encoder-decoder structure. It can be conditioned on human pose and appearance. For this we use a human pose label map P t and a frame of the signer I a . The latent space Res for the HD generator is a block of residual layers, whereas the the latent space FC for the MS generator is encoded in a 1-dimensional vector using a fully connected layer. We employ two losses: an adversarial loss using the discriminator D, and an L1 loss. For the MS case we take a pixel-based L1 loss, whereas for the HD case we match extracted features from multiple layers of D and calculate the L1 distance between features size 1D vector latent space using a fully connected layer allows generation of images with both large appearance and spatial change and is employed for multi-signer (MS) output. However, the ability to generate spatial change, and the requirement for fully connected layers increases memory consumption, and limits the output size of the generated images. In contrast, a fully convolutional latent space, such as a number of residual layers, allows for changes in appearance, like changing from a pose label map to an image of a human being in that pose, but does not allow for large spatial changes. This enables the network to transfer style similar to pix2pixHD by Wang et al. (2018b) or Chan et al. (2018). However, due to the avoidance of fully connected layers and with the use of an additional enhancing network, it is capable of producing sharp high definition outputs. We investigate this second formulation for generating high-definition (HD) sign video.

Image Generator
As input to the generator we concatenate P t and I a as separate channels, where P t is a human pose label map. For MS generation I a is an image of an arbitrary human subject in a resting position (base pose). The HD sign generator cannot be conditioned on a base pose, as it does not allow for large spatial changes. Instead it is conditioned on the generated image from the previous time step. On top of helping with appearance this enforces temporal consistency.
The input to the generator is pushed through the convolutional encoder part of the generator and encoded into the latent space. The decoder part of the generator uses upconvolution and resize-convolution to decode from the latent space back into an image using the embedded skeletal information provided by the label map P t . This produces an image G(P t , I a ) of the signer in the pose indicated by P t (see Fig. 7).
In the HD sign variant, an enhancer network En is used to upscale and refine the output images produced by the generator G. Its architecture is very similar to G, consisting of a convolutional encoder, a residual block and an up-convolutional decoder. G is first trained individually, followed by En, before training both networks in conjunction.

Discriminator
The discriminator D receives either a tuple of the generated synthetic image G(P t , I a ) or ground truth I t , and the pose label map P t as input. In the MS case, D is also provided with I a (see Fig. 7). D decides on image's authenticity. In the MS case, given that the system is trained on multiple signers, I a is used to establish whether the generated image resembles the desired signer. The skeletal information provided by P t is used to assess if the generated image has the desired joint configuration. For the HD sign case, like Wang et al. (2018b) we use a multi-scale discriminator with three scales (in our case 1080 × 720, 540 × 360, and 270 × 180).

Loss
We use the GAN's adversarial loss, as well as an L1 loss between generated and ground truth images to train our networks. See Fig. 7 for a visualisation. The overall loss is therefore defined as: where δ weighs the influence of L 1 . For MS generation we give I a to the generator and the discriminator to distinguish between signers. The adversarial loss is thus defined as: [log(1 − D(P t , G(P t , I a ), I a ))].

(8)
The MS L1 loss is defined as the sum of absolute pixel difference between ground truth and generated image: For HD generation the adversarial loss is defined as: where k is the number of discriminator scales. To combine the adversarial losses of all D k , we sum: For HD generation the L1 loss is based on the feature matching loss presented in Wang et al. (2018b). Features extracted from multiple stages of the discriminator are matched, rather than pixels: where T is the total number of layers in D k , i is the current layer of D k , N i is the total number of elements per layer, and D (i) k is the ith layer feature extractor of D k . Again we sum the L1 losses of all D k to obtain the overall L1 loss:

Experiments
We first introduce the datasets used and any necessary pre-processing steps, before evaluating all sub-parts of our system both quantitatively and qualitatively. We show results for translating spoken language text to gloss sequences and pose sequences, and for generating multi-signer (MS) and high-definition (HD) sign video, using broadcast quality assessment metrics. A set of qualitative examples showcases the current state of the full preliminary translation pipeline.

Datasets
In order to realise spoken language to sign video generation, we require a large scale dataset, which provides sign language sequences and their spoken language translations. Although there is vast quantities of broadcast data available and many linguistically annotated datasets, they lack spoken language sentence to sign sequence (i.e. topiccomment) alignment. However, recently Camgoz et al. (2018) released RWTH-PHOENIX-Weather 2014T (PHOENIX14T), which is the extended version of the continuous sign language recognition benchmark dataset PHOENIX-2014Forster et al. (2014. PHOENIX14T consists of German Sign Language (DGS) interpretations of weather broadcasts. It contains 8257 sequences being performed by 9 signers. It has a sign gloss and spoken language vocabulary of 1066 and 2887, respectively. Each sequence is annotated with both the sign glosses and spoken language translations.
We trained our spoken language to sign pose network using PHOENIX14T. However, due to the limited number of signers in the dataset, we utilised another large scale dataset to train the multi-signer (MS) generation network, namely the SMILE Sign Language Assessment Dataset (Ebling et al. 2018). The SMILE dataset contains 42 signers performing 100 isolated signs for three repetitions in Swiss German Sign Language (DSGS). Although the SMILE dataset is multiview, we only used the Kinect colour stream, without any depth information or the Kinect's built-in pose estimations.
We trained the HD sign generation network on 1280×720 HD dissemination material acquired by the Learning to Recognise Dynamic Visual Content from Broadcast Footage (Dynavis) project (Bowden et al. 2016). It consists of multiple videos featuring the same subject performing continuous British Sign Language (BSL) sequences. There is no alignment between spoken language sentences to sign sequences.
Using multiple datasets is motivated by the fact that there is no single dataset that provides text-to-sign translations, a broad range of signers of different appearance, and high definition signing content. Using datasets from different subject domains and languages demonstrates the robustness and flexibility of our method, as it allows us to transfer knowledge between specialised datasets. This makes the approach suitable for translating between different spoken and signed languages, as well as other problems, such as text-conditioned image and video generation.

Data Pre-Processing
In order to perform translation from spoken language to sign pose, we need to find pose sequences that represent the appropriate glosses. We split the continuous samples of the PHOENIX14T dataset by gloss using a forced alignment approach. Then, for each gloss we perform a normalisation over all example sequences containing that gloss. First, we have to relate the different body shapes and sizes of all signers to that of a selected target subject. Additionally we have to time-align all example sequences, before we can find an average representation for each frame of the sequence. To align different signers' skeletons to that of a target subject, we use OpenPose (Cao et al. 2017) to extract upper body key points for each frame in the sequence and for a reference frame of the target subject. We align the skeletons at the neck joint and scale by the shoulder width. We use dynamic time warping to time align sequences, before taking the mean of each joint per frame over all example sequences to generate a representative mean sequence. These mean sequences form the nodes of our MG. We decided to use mean sequences rather than raw example sequences, as they provide a more stable performance. We found that corruptions in the gloss boundary information obtained by forced alignment produced an immense variability in quality and correctness for the samples per node in the graph. The supplementary material contains example comparisons between a motion graph built using mean sequences versus the raw data.

German to Pose Translation
We provide results for translating German sentences into their intermediate gloss representation, and show how this, combined with a MG, can be used to generate human pose sequences from spoken language sentences.
As described in Sect. 3.1, we utilised an encoder-decoder NMT architecture for spoken language to sign gloss translation. Both our encoder and decoder networks have 4 layers with 1000 Gated Recurrent Units (GRUs) each. As an attention mechanism we use Luong et al.'s approach as it utilises both encoder and decoder outputs during context vector calculation. We trained our network using Adam optimisation with a learning rate of 10 −5 for 30 epochs. We also employed dropout with 0.2 probability on GRUs to regularise training. During inference the width B of the beam decoder is set to three, meaning the three top hypotheses are kept per time step. We found this number to be a good trade-off between translation quality and computational complexity. For text2pose generation we report an average time of 0.79 s per translated gloss using a Intel® Core™ i7-6700 CPU (3.40 GHz, 8MB cache), where the majority of time is taken up by generating the pose maps (0.77 s/gloss).

Translating German to Gloss
To measure the translation performance of our approach we used BLEU and ROUGE (ROUGE-L F1) score as well as Word Error Rate (WER), which are amongst the most popular metrics in the machine translation domain. We measure the BLEU scores on different n-gram granularities, namely BLEU 1, 2, 3 and 4, to give readers a better perspective of the translation performance.
We compare our Text2Gloss performance against the Gloss2Text network of Camgoz et al. (2018), which is the opposite task of translating sign glosses to spoken language sequences. We do this as to our knowledge there is no other text-to-gloss translation approach for a direct comparison. We aim to give the reader context, rather than claiming to supersede the Gloss2Text approach ). Our results, as seen in Table 1, show that Text2Gloss performs comparably with the Gloss2Text network. While Gloss2Text achieves a higher BLEU-4 score, our Text2Gloss surpasses its performance on BLEU scores with smaller n-gram and ROUGE scores. We believe this is due to shorter length of sign gloss sequences and their smaller vocabulary. The challenge is further exacerbated by the fact that sign languages employ a spatio-temporal grammar which is challenging to represent in text.
We also provide qualitative results by examining sample Text2Gloss translations (see Table 2). Our experiments indicate that the network is able to produce gloss sequences from text that are close to the gloss ground truth. Even when the predicted gloss sequence does not exactly match the ground truth, the network chooses glosses that are close in meaning.
After reporting these promising intermediate results, we will now show how this approach can be extended to generate human pose maps that encode the motion of signs.

Translating German to Pose
We give a qualitative evaluation of translating German sentences into human pose sequences by solving a MG using the NMT's beam search. Figure 8 shows two examples. In both cases we show key frames that are indicative of the translated glosses. It is interesting to note that both sequences contain the gloss WIND, twice in the top sequence and once in the bottom sequence. The relevant key frames for each occurrence (key frame 2 and 6 for the top sequence, key frame 4 for the bottom sequence) are very similar, showing the conditioning of poses on a specific gloss.
The poses are encoded as 128 × 128 × 10 binary label maps, where each joint inhabits one of the 10 depth channels. This type of map is used to generate sign language video in Sects. 4.3 and 4.4.

Multi-Signer Generation of Isolated Signs
This section presents results using the generated label maps to condition a GAN that generates sign video for multi-signer (MS) video generation. We test using isolated signs from the SMILE dataset. When testing on a GeForce GTX TITAN X we report an average time of 1.71 s per generated image.   We generate synthetic sign video from previously unseen label data. To evaluate the quality of the generated output, we use the Structural Similarity Index Measurement (SSIM) (Wang et al. 2004), Peak Signal-to-Noise Ratio (PSNR), and Mean Squared Error (MSE), three well-known metrics for assessing image quality.
SSIM is a metric used to assess the perceptual degradation of images and video in broadcast, by comparing a corrupted image to its original. We adapt this approach to compare the generated synthetic image G(P t , I a ) to its ground truth image I t .
For ease of notation we define: where l(Î t , I t ) is a luminance term: c(Î t , I t ) is a contrast term: and s(Î t , I t ) is a structural term: with μÎ t , and μ I t being the means, σÎ t , and σ I t the standard deviations and σÎ t I t the cross-covariance for imagesÎ t and I t . C 1 = (k 1 L) 2 and C 2 = (k 2 L) 2 , where L is the dynamic range of pixel values, and k 1 = 0.01 and k 2 = 0.03. C 3 is set to equal C 2 /2.
With default values of α, β, γ , = 1 the expression for SSIM simplifies to: The calculated SSIM ranges from −1 to 1, with 1 indicating the images are identical. PSNR and MSE are metrics used to assess the quality of compressed images compared to their original. We use MSE to calculate the average squared error between a synthetic imageÎ t and its ground truth image I t , by: where N and M are the number of columns and rows respectively. In contrast PSNR measures the peak error in dB, using the MSE: For SSIM the range is − 1 to + 1, with + 1 indicating identical images. The lower the MSE between two images, the more alike they are, whereas we want to maximise the PSNR between two images where R is the maximum possible value of the input data, in this case 255 for 8-bit unsigned integers. The MS generation network was trained on 40 different signers from the SMILE dataset over 90,000 iterations. Out of these signers, several signers were chosen, and the network fine-tuned for another 10,000 iterations on the appearance of those signers. The pose label maps were generated from running OpenPose on the full-size SMILE ground truth footage of 1920 × 1080 pixels, and then downsampled to 128 × 128 pixels. The original SMILE footage was then also downsampled to 128 × 128 pixels to function as input to our network.
We test for three different signers, over a 1000 frames each. We report the mean SSIM, PSNR, and MSE (see Table 3).
The results indicate that the images produced of all three signers are very close to their ground truth, with SSIM values close to 1. Signer 1 has slightly worse scores than signer 2, and 3, which is due to a corrupted sequence in the gathered data.
Qualitative results in Figs. 9 and 10 show that the synthetic sequences generated by our network stay close to their ground truth in terms of both motion and appearance. Details for hands and faces are largely preserved, however the network can struggle to form both arms and hands fully, especially when held in front of the chest and face. This is likely due to the similarity in colour, which also could have led to errors in the key point extraction process.
The results also highlight the power of our data-driven approach to capture natural variations in sign. Signer 2 is left-handed, whereas signer 1 and 3 are right-handed. There are also noticeable discrepancies in speed and size of motion amongst the signers. Linguistically, these are very important factors that can have a significant impact on the meaning of a sign. They convey additional information such as emotion and intent, for example haste, anger, or uncertainty.
Overall our experiments show that our MS generation network is capable of synthesizing sign language videos that are highly realistic and variable in terms of motion and appearance for multiple signers. The limiting factor to this approach is the small aspect ratio of 128 × 128 pixels. We therefore Fig. 9 Synthetic productions of the gloss ANTWORT (ANSWER), for signer 1 (top), signer 2 (middle), and signer 3 (bottom). Every 5th frame is shown. We can see that all three generated sequences are very close to their ground truth. It is also interesting to note that with our data-driven approach it is easy to account for natural variations, such as left-handed versus right-handed signing investigate a different variant of our network to produce HD sign videos in Sect. 4.5.

Spoken Language to Sign Language Translation
In this section we test the full translation pipeline: going from spoken language sentences to sign language video translations. We translate from German to German Sign Language (DGS). Our test data is taken from the PHOENIX14T test dataset. Our Motion Graph (MG) is built from extracting OpenPose skeletal information from the PHOENIX14T training set. The obtained OpenPose extraction was prone to errors due to the small resolution of the PHOENIX14T data (260 × 210 pixels). It did not scale to the 1080 × 720 resolution required for conditioning the HD generator. We therefore only test our full pipeline using the MS generator, as it is better aligned in scale with the PHOENIX14T data.
We depict results for four translations. For all cases the input for our translation is a German sentence. The resulting gloss and sign video translations are given in Figs. 11,12,13 and 14. The beam search over the MG provides the motion sequences that incorporate the translation from spoken language text. These pose sequences then condition the sign generation network. Transitions between sequences are added dynamically. We give representative frames for the generated sequences, indicating which glosses they belong to.
For sequence 1 in Fig. 11 the NMT network correctly translates to a German gloss sequence which corresponds to the ground truth. The overall motion of the arms and hands is consistent with the video ground truth. The signers' appearances are clearly distinguishable from one another. Signer 1 stays closest to the ground truth, having the most developed arms and hands. Signer 2 struggles to fully form the right arm at times, this might be due to the fact that this signer was a lefthanded signer in the original dataset and therefore less right handed motion was observed during training. Signer 3 has under-developed hands, something that is consistent across frames and sequences, indicating a failure in conditioning.
Sequence 2 (see Fig. 12) also correctly translates the original input sentence. On top of the observations made for sequence 1, we notice a failure case for the gloss WECH-SELHAFT (CHANGING). The sign for this gloss in DGS is a repeated left to right motion of both arms in front of the body (see the last three frames of the ground truth in Fig. 12). In the generated sequence, the arms are in front of the body, but remain in the centre. We assume that this is due to a failure in the time alignment leading to key points of motion to the left and right resulting in hands positioned in the centre.

Fig. 10
Synthetic productions of the gloss ERKLAEREN (EXPLAIN), for signer 1 (top), signer 2 (middle), and signer 3 (bottom). Every 10th frame is shown. Generated sequences stay close to their ground truth throughout, however for signer 2 the network fails to form the right arm in frame 20. Additionally, this example shows two more forms of natural variation in sign language: speed and size of movement. These factors can have a significant impact, as they convey additional information such as emotion or intent Fig. 11 Translation results for "In der Nacht an der See noch stuermische Boeen". (In the night still storms near the sea). The ground truth gloss and video is given in the top row. Below we see the gloss translation and synthetic video generated   Translation results for "Im Norden maessiger Wind an den Kuesten weht er teilweise frisch". (Mild winds in the north, at the coast it blows fresh in parts). The ground truth gloss and video is given in the top row. Below we see the gloss translation and synthetic video generated Results for sequence 3 in Fig. 13 are in accordance with the first sequence. The sequence of glosses predicted matches the ground truth. Again signer 1 stays closest to the original motion sequence, even encapsulating the subtle difference in right hand position (not hand shape) between glosses GUT (GOOD) and LIEB (DEAR).
Sequence 4 is longer than previous examples, and contains one translation error (see Fig. 14). The positions of arms and hands are consistent with the ground truth for the first four glosses, before encountering the error in gloss prediction. The motion for the last gloss WIND (WIND) is slightly underarticulated in contrast to the ground truth.
Overall, the movement of signers is smooth and consistent with the glosses they represent, but not as expressive as the ground truth. We suspect that the limited motion stems from the averaging of all example sequences for a gloss to generate one mean sequence. To our knowledge the timing information for all glosses was automatically extracted from the PHOENIX14T data by the creators of the dataset using a Forced Alignment approach. It is therefore reasonable to assume that the provided timings contain errors, which negatively affect the mean sequence. Additionally, for most signs more than one variation exists, but this is not annotated in the dataset, neither is the use of left or right as the dominant hand. This further diminishes the motion of the mean sequences.
For future experiments an averaging and data cleaning process needs to be developed that pays consideration to variability in speed, expression, and left versus right-handed signing. To improve the quality of extracted pose information, and add additional conditioning for hands and faces we need datasets of high image resolution. For translation we require sign language datasets that have topic-comment alignment. If both is combined, it would be possible to avoid the heavy cost of manually annotating details in sign motion such as facial expression and still get rich, natural translations.

High Definition Continuous Sign Generation
To improve the resolution and sharpness of our sign generation we generate HD continuous sign language video using the HD signing network. The network is conditioned with semantic label maps encoding human poses. We evaluate two configurations: A network conditioned only on 15 upper body joints (as was used in the MS network), and a network conditioned on the same 15 joints, plus 21 key points for each hand, and 68 key points for the face. For details see Fig. 15. We trained for 16 epochs over 19,850 frames and corresponding label maps. For both models we report an average time of 0.42 s per image generated during inference using a GeForce GTX TITAN X.
Quantitative as well as qualitative results are provided. As with MS generation, we report the mean SSIM, PSNR, and Each key point is assigned to a separate class using a different pixel value. The hand key points are grouped into two classes representing left and right hand, the facial key points are grouped as contour, left eye, right eye, nose, and mouth. Colour channels are inverted for visualisation purposes For SSIM the range is − 1 to + 1, with + 1 indicating identical images. The lower the MSE between two images, the more alike they are, whereas we want to maximise the PSNR between two images. Best results are marked in bold MSE, this time over a test set of 500 frames, for just the pose input (HDSp) and pose, hands, and face input (HDSphf) in Table 4. The results indicate that more detailed conditioning with pose, hands, and face key points produces synthetic images that are closer to the ground truth. However, the difference in scores is not as significant as might be expected.
Looking at example frames we can see that both HDSp and HDSpfh create synthetic images that closely resemble their ground truth in both overall structure and detail such as clothing, overall facial expression and hand shape (see Fig. 16). However, HDSpfh surpasses HDSp clearly for details of the generated hands and facial features. Whereas both networks learn to generate realistic hands and faces, HDSp can generate the wrong hand shape (see middle column in Figs. 16 and 17), as it does not receive the positional information for all the finger joints, but merely an overall position of the hand.
Overall our results indicate that it is possible to generate highly realistic and detailed synthetic sign language videos, given sufficient positional information. A compromise can be found that keeps the annotation effort minimal (like using an automatic pose detector), whilst maintaining realism and expressiveness in the synthetic sign video.

Conclusions
In this paper, we presented the first spoken language-to-sign language video translation system. While other approaches rely on motion capture data and/or the complex animation of avatars, our deep learning approach combines an NMT network with a Motion Graph (MG) to produce human pose sequences. This conditions a sign generation network capable of producing sign video frames.
The NMT network's predictions can successfully be used to solve the MG, resulting in consistent text2pose translations. We show this by analysing example text2pose sequences, and by providing qualitative and quantitative results for an intermediate text2gloss representation. With our multi-signer (MS) generator we are able to produce multiple signers of different appearance. We show this for isolated signs, and as part of our text2sign translation approach.
Additionally we investigated the generation of HD continuous sign language video. Our results indicate that it is possible to produce photo-realistic video representations of sign language, by conditioning on key points extracted from training data. The accuracy and fidelity of key points seems to play a vital role, reinforcing the need for datasets of sufficient resolution.
Currently our text2sign translation system cannot compete with existing avatar approaches. Due to the low resolution of our translation training data, our results do not have the output resolution and expressiveness obtained by motion capture and avatar-based approaches. However, we have outlined that continuous, realistic sign language synthesis is possible, using minimal annotation. For training we only require text and gloss-level annotations, as skeletal pose information can be extracted from video automatically using an off-the-shelf solution such as OpenPose (Cao et al. 2017). In contrast, avatar-based approaches require detailed annotations using task-specific transcription languages, which can only be carried out by expert linguists. Animating the avatar itself often involves a considerable amount of hand-engineering, and the results thus far remain robotic and under-articulated. Motion capture-based approaches require high-fidelity data, which needs to be captured, cleaned, and stored at considerable cost, limiting the amount of data available, hence making this approach unscalable. We believe that in time our approach will enable highly-realistic, and cost-effective translation of spoken languages to sign languages, improving equal access for the Deaf and Hard of Hearing.
For future work, our goal is to combine the MS and HD sign generation capabilities to synthesize highly detailed sign video, with signers of arbitrary appearance. The MS's Local SSIM values comparing HDSp to the ground truth (top) and HDSpfh to the ground truth (bottom). Both seem capable of generating faces close to the ground truth, but HDSpfh seems to outperform HDSp for generating correct hand shapes ability to account for spatial and appearance changes, in combination with the high resolution of the HD generator would enable us to synthesize highly realistic and expressive sign language video. Additionally, we plan to improve our current MG by developing a data-processing strategy, that pays attention to the intricate features of sign language data, such as size of motion, and speed. This means replacing the current use of mean sequences with a more thoughtful approach that takes into account the likelihood of an example sequence being correct, the skeletal composition of different signers, and their dominant hand. We further plan to train our text2sign system end-to-end, and develop a performance metric to further quantitatively analyse the performance of our SLP system. Going forward, as progress is made in sign video-to-spoken-language translation, this could be used as a quantitative evaluation in itself or possibly as part of a cycle GAN approach.
to thank NVIDIA Corporation for their GPU grant, and Oscar Koller at Microsoft.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.