Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Stoll, Stephanie; Camgoz, Necati Cihan; Hadfield, Simon; Bowden, Richard

doi:10.1007/s11263-019-01281-2

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Open access
Published: 02 January 2020

Volume 128, pages 891–908, (2020)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Download PDF

Stephanie Stoll ORCID: orcid.org/0000-0002-3582-3969¹,
Necati Cihan Camgoz¹,
Simon Hadfield¹ &
…
Richard Bowden¹

32k Accesses
123 Citations
2 Altmetric
Explore all metrics

Abstract

We present a novel approach to automatic Sign Language Production using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks, and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph. The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach for both multi-signer and high-definition settings qualitatively and quantitatively using broadcast quality assessment metrics.

SignSynth: Data-Driven Sign Language Video Generation

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Article Open access 07 May 2021

Sentence2SignGesture: a hybrid neural machine translation network for sign language video generation

Article 26 January 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

According to the World Health Organization there are around 466 million people in the world that are deaf or suffer from disabling hearing loss (WHO: World Health Organization 2018). Whilst not all these people rely on sign languages as their primary form of communication, they are widely used, with an estimated 151,000 users of British Sign Language (BDA: British Deaf Association 2019) in the United Kingdom, and approximately 500,000 people primarily communicating in sign languages across the European Union (EU: European Parliament 2018).

Like spoken languages, sign languages have their own grammatical rules and linguistic structures. This makes the task of translating between spoken and signed languages a complex problem, as it is not simply an exercise of mapping text to gestures word-by-word (see Fig. 1 which demonstrates that both the tokenization of the languages and their ordering is different). It requires machine translation methods to find a mapping between a spoken and signed language, that takes into account both their language models.

To facilitate easy and clear communication between the hearing and the Deaf, it is vital to build robust systems that can translate spoken languages into sign languages and vice versa. This two way process can be facilitated using Sign Language Recognition (SLR) and Sign Language Production (SLP), (see Fig. 2).

Commercial applications for sign language primarily focus on SLR, by mapping sign to spoken language, typically providing a text transcription of the sequence of signs, such as Elwazer (2018), and Robotka (2018). This is due to the misconception that deaf people are comfortable with reading spoken language and therefore do not require translation into sign language. However, there is no guarantee that someone who’s first language is, for example, British Sign Language, is familiar with written English, as the two are completely separate languages. Furthermore, generating sign language from spoken language is a complicated task that cannot be accomplished with a simple one-to-one mapping. Unlike spoken languages, sign languages employ multiple asynchronous channels (referred to as articulators in linguistics) to convey information. These channels include both the manual (i.e. upper body motion, hand shape and trajectory) and non-manual (i.e. facial expressions, mouthings, body posture) features.

The problem of SLP is generally tackled using animated avatars, such as Cox et al. (2002), Glauert et al. (2006) and McDonald et al. (2016). When driven using motion capture data, avatars can produce life-like signing, however this approach is limited to pre-recorded phrases, and the production of motion capture data is costly. Another method relies on translating the spoken language into sign glosses,^{Footnote 1} and connecting each entity to a parametric representation, such as the hand shape and motion needed to animate the avatar. However, there are several problems with this method. Translating a spoken sentence into sign glosses is a non-trivial task, as the ordering and number of glosses does not match the words of the spoken language sentence (see Fig. 1). Additionally, by treating sign language as a concatenation of isolated glosses, any context and meaning conveyed by non-manual features is lost. This results in at best crude, and at worst incorrect translations, and results in the indicative ‘robotic’ motion seen in many avatar based approaches.

To advance the field of SLP, we propose a new approach, harnessing methods from NMT, computer graphics, and neural network based image/video generation. The proposed method is capable of generating a sign language video, given a written or spoken language sentence. An encoder-decoder network provides a sequence of gloss probabilities from spoken language text input, that is used to condition a Motion Graph (MG) to find a pose sequence representing the input. Finally, this sequence is used to condition a GAN to produce a video containing sign translations of the input sentence (see Fig. 4). The contributions of this paper can be summarised as:

An NMT-based network combined with a motion graph that allows for continuous-text-to-pose translation.
A generative network conditioned on pose and appearance.
To our knowledge the first spoken language to sign language video translation system without the need for costly motion capture or an avatar.

A preliminary version of this work was presented in Stoll et al. (2018). This extended manuscript contains an improved pipeline and additional formulation. We introduce an MG into the process, that combined with the NMT network is capable of text-to-pose (text2pose) translations. Furthermore, we demonstrate the generation of multiple signers of varying appearance. We also investigate high-definition (HD) sign generation. Extensive new quantitative as well as qualitative evaluation is provided, exploring the capabilities of our approach. Figure 3 gives a comparison of the output of our approach (right) to other avatar based approaches (left and middle).

The rest of this paper is organised as follows: Sect. 2 gives an overview of recent developments in NMT as well as traditional SLP using avatars. We explain the concept of motion graphs, before describing recent advancements in generative image models. Section 3 introduces all parts of our approach. In Sect. 4 we evaluate our system both quantitatively and qualitatively, before concluding in Sect. 5.

2 Related Work

We treat Sign Language Production (SLP), as a translation problem from spoken into signed language. We therefore first review recent developments in the field of Neural Machine Translation (NMT). However, SLP is different from traditional translation tasks, in that it inherently requires visual content generation. Normally this is performed by animating a 3D avatar. We will therefore give an overview of past and current sign avatar technology. Finally, we cover the concept of Motion Graphs (MGs), a technique used in computer graphics to dynamically animate characters, and the field of conditional image generation.

2.1 Neural Machine Translation

NMT utilises Recurrent Neural Network (RNN) based sequence-to-sequence (seq2seq) architectures which learn a statistical model to translate between different languages. Seq2seq (Sutskever et al. 2014; Cho et al. 2014) has seen success in translating between spoken languages. It consists of two RNNs, an encoder and a decoder, that learn to translate a source sequence to a target sequence. To tackle longer sequences Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) or Gated Recurrent Units (GRU) (Chung et al. 2014) are used as RNN cells. Both architectures have mechanisms that allow each cell to pass only the relevant information to the next time step, hence improving translation performance over longer-term dependencies.

To further improve the translation of long sequences Bahdanau et al. (2014) introduced the attention mechanism. It provides additional information to the decoder by allowing it to observe the encoder’s hidden states. This mechanism was later improved by Luong et al. (2015).

Camgoz et al. 2018 combine a standard seq2seq framework with a Convolutional Neural Network (CNN) to translate sign language videos to spoken language sentences. They first extract features from video using the CNN before translating to text. Similarly, Guo et al. (2018) combine a CNN and an LSTM-based encoder-decoder. However, they employ a 3D CNN to better learn spatio-temporal relationships and identify key clips. This guides the model to focus on the information-rich content. Both approaches can be seen as the inverse to our problem, of translating text to pose.

More recently non-RNN based NMT methods have been explored. ByteNet (Kalchbrenner et al. 2016) performs translation using dilated convolutions, and Vaswani et al. (2017) introduced the transformer, which is a purely attention-based translation method. Specifically focusing on sign language multi-modal fusion networks have been proposed (Guo et al. 2017; Wang et al. 2018a).

Using NMT methods to translate text to pose is a relatively unexplored and open problem. Ahn et al. (2018) use an RNN-based encoder-decoder model to produce upper body pose sequences of human actions from text and map them onto a Baxter robot. However, their results are purely qualitative and rely on human interpretation. For our work we first translate from text to gloss using a seq2seq architecture with Luong attention (Luong et al. 2015) and GRUs (Chung et al. 2014), similar to Camgoz et al. (2018). However, as we are translating text to pose we do not use a CNN as an initial step. In contrast, we use the probabilities produced by the decoder at each time step to solve a Motion Graph (MG) of sign language pose data, to obtain the text to pose translation.

2.2 Avatar Approaches for Sign Language Production

Sign avatars can either be driven directly from motion capture data, or rely on a sequence of parametrised glosses. Since the early 2000s there have been several research projects exploring avatars animated from parametrised glosses, e.g. VisiCast (Bangham et al. 2000), eSign (Zwitserlood et al. 2004), Tessa (Cox et al. 2002), dicta-sign (Efthimiou 2012), and JASigning (Virtual Humans Group 2017). All of these approaches rely on sign video data to be annotated using a transcription language, such as HamNoSys (Prillwitz 1989) or SigML (Kennaway 2013). Whilst these avatars are capable of producing sign sequences, they are not popular with the Deaf community. This is due to under-articulated and unnatural movements, but mostly due to missing non-manuals, such as eye gaze and facial expressions (see Fig. 3). Important meaning and context is lost this way, making the avatars difficult to understand. Furthermore, the robotic motion of the aforementioned avatars can make viewers uncomfortable, due to the uncanny valley^{Footnote 2} (Mori et al. 2012). Recent work has begun to integrate non-manuals into the annotation and animation process (Ebling and Glauert 2013; Ebling and Huenerfauth 2015). However, the correct alignment and articulation of these features poses an unsolved problem, that limit recent avatars such as McDonald et al. (2016) and Kipp et al. (2011).

To make avatars both easier to understand, and increase viewer acceptance, recent sign avatars rely on data collected from motion capture. One example of a motion capture driven avatar is the Sign3D project by MocapLab (Gibet et al. 2016). Given the richness of motion capture data, this approach provides highly realistic results, but is limited to a very small set of phrases, as collecting and annotating data is expensive, time consuming and requires expert knowledge. Although these avatars are better received by the Deaf community, they do not provide a scalable solution. The uncanny valley also still remains a large hurdle. To make synthetic signing more realistic, scalable and avoid the aforementioned problems of 3D avatars, we propose to directly generate sign video from weakly annotated data using the latest developments in machine translation, generative image models and Motion Graphs (MGs).

2.3 Motion Graphs

Motion Graphs (MGs) are used in computer graphics to dynamically animate characters, and can be formulated as a directed graph that is constructed from motion capture data. It allows new lifelike sequences to be generated that satisfy specific goals at runtime. MGs were independently introduced by Kovar et al. (2002), Arikan and Forsyth (2002), and Lee et al. (2002). Kovar et al. (2002) define the distance between two frames by calculating the distance between two point clouds. For creating the transitions themselves, the motions are aligned and positions are linearly interpolated between joint rotations. As a search strategy, branch and bound is used. Arikan and Forsyth (2002) use the difference between joint positions and velocities and the difference between torso velocities and accelerations, to define how close or distant two frames are. A smoothing function is applied to the discontinuity between two clips. The graph is searched by first summarizing it and then performing a random search over the summaries. Lee et al. (2002) chose a two layer approach to represent motion data. In the lower layer all data is modelled as a first-order Markov process, where the Markov process is represented by a matrix holding the transition probabilities between frames. The probabilities are derived from measuring the distances of weighted joint angles and velocities. Transitions of low probability are pruned. For blending transitions a hierarchical motion fitting algorithm is used Lee and Shin (1999). The higher layer generalises the motion preserved in the lower layer by performing cluster analysis, to make it easier to search. Each cluster represents similar motion, but to capture connections between frames a cluster tree is formed at each motion frame. The whole higher layer is called a cluster forest.

We build an MG for sign language pose data, by splitting continuous sign sequences into individual glosses, and grouping all motion sequences by gloss. These motion sequences populate the nodes of our MG. We then use the probabilities provided by our NMT decoder at each time step to transition between nodes.

2.4 Conditional Image Generation

With the advancements in deep learning, the field of image generation has seen various approaches utilising neural-network based architectures. Chen and Koltun (2017) used CNN based cascaded refinement networks to produce photographic images given semantic label maps. Similarly, van den Oord et al. (2016) developed PixelCNN, which produces images conditioned on a vector, that can be image tags or feature embeddings provided by another network. Gregor et al. (2015) and Oord et al. (2016) also explored the use of RNNs for image generation and completion. All these approaches rely on rich semantic or spatial information as input, such as semantic label maps, or they suffer from being blurry and spatially incoherent.

Since the advent of GANs (Goodfellow et al. 2014), they have been used extensively for the task of image generation. Soon after their emergence, Mirza and Osindero Mirza and Osindero (2014) developed a conditional GAN model, by feeding the conditional information to both the Generator and Discriminator. Radford et al. (2015) proposed Deep Convolutional GAN (DCGAN) which combines the general architecture of a conditional GAN with a set of architectural constraints, such as replacing deterministic spatial pooling with strided convolutions. These changes made the system more stable to train and well-suited for the task of generating realistic and spatially coherent images. Many conditional image generation models have been built by extending the DCGAN model. Notably Reed et al. (2016) built a system to generate images of birds that are conditioned on positional information and text description, using text embedding and binary pose heat maps.

An alternative to GAN-based image generation models is provided by Variational Auto-Encoders (VAEs) (Kingma and Welling 2013). Similar to classical auto-encoders, VAEs consist of two networks, an encoder and a decoder. However, VAEs constrain the encoding network to follow a unit Gaussian distribution. Yan et. al. developed a conditional VAE (Yan et al. 2016), that is capable of generating spatially coherent, but blurry images, a tendency of most VAE-based approaches.

Recent work has looked at combining GANs and VAEs to create robust and versatile image generation models. Makhzani et. al. introduced Adversarial Auto-encoders and applied them to problems in supervised, semi-supervised and unsupervised learning (Makhzani et al. 2016). Larsen et. al. have combined VAEs and GANs that can encode, generate and compare samples in an unsupervised fashion (Larsen et al. 2016). Perarnau et. al. developed Invertible Conditional GANs that use an encoder to learn a latent representation of an input image and a feature vector to change the attributes of human faces (Perarnau et al. 2016).

VAE/GAN hybrid models have proven particularly popular for generating images conditioned on human pose, as done by Ma et al. (2017) and Siarohin et al. (2018). Ma et al. synthesize images of people in arbitrary poses in a two-stage process by fusing an input image of a person for appearance and a heat map providing pose information into a new image in one network, before refining it in a second network. Siarohin et al. use a similar method, but additionally use affine transformations to help change the position of body parts.

In the sub-field of image-to-image translation, Isola et al. (2017) introduced pix2pix a conditional GAN, which given its information-rich input and avoidance of fully connected layers was also among the first contenders for generating high definition image content. Building on the success of pix2pix and architecture proposed by (Johnson et al. 2016), Wang et al. recently presented pix2pixHD (Wang et al. 2018b), a network capable of producing $2048 \times 1024$ images from semantic label maps, using a generator and multi-scale discriminator architecture: A global generator consisting of a convolutional encoder, a set of residual blocks and a convolutional decoder. In addition, a local enhancer network, of similar architecture, provides high resolution images from semantic label maps. Three discriminators are used at different scales to differentiate real from generated images.

For our work, we follow two strands of conditional image generation techniques: We build a multi-person sign generation network conditioned on human appearance and pose, similar to the works of Ma et al. (2017) and Siarohin et al. (2018). In addition we also investigate single-signer HD sign generation by building on the work of Wang et al. (2018b).

3 Text to Sign Language Translation

Our text-to-sign-language (text2sign) translation system consists of two stages: We train an NMT network to obtain a sequence of gloss probabilities that is used to solve a Motion Graph (MG) which generates human pose sequences (text2pose in Fig. 4).Then a pose-conditioned sign generation network with an encoder-decoder-discriminator architecture produces the output sign video (see pose2video in Fig. 4). We will now discuss each part of our system in detail.

3.1 Text to Pose Translation

We employ recent RNN based machine translation methods, namely attention based NMT approaches, to realize spoken language sentence to sign language gloss sequence translation. We use an encoder-decoder architecture (Sutskever et al. 2014) with Luong attention (Luong et al. 2015) (see Fig. 5).

Given a spoken language sentence, $S^N = \{w_{1}, w_{2}, \ldots , w_{N}\}$, with N number of words, our encoder maps the sequence into a latent representation as in:

$$\begin{aligned} o_{1:N}, h^{e}_{N} = \text {Encoder}(S^N) \end{aligned}$$

(1)

where $o_{1:N}$ is the output of the encoder for each word w, and $h^{e}_{N}$ is the hidden representation of the encoded sentence. In Fig. 5, the encoder is depicted in blue. This hidden representation and the encoder outputs are then passed to the decoder, which utilises an attention mechanism and generates a probability distribution over glosses:

$$\begin{aligned} p(g_{t}) = \text {Decoder} (g_{t-1}, h^{d}_{t-1}, \alpha (o_{1:N})) \end{aligned}$$

(2)

where $\alpha (\cdot )$ is the attention function, $g_{t}$ is the gloss produced at the time step t and $h^{d}_{t-1}$ is the hidden state of the decoder passed from the previous time step. At the beginning of the decoding, i.e. $t=1$, $h^{d}_{t-1}$ is set as the encoded representation of the input sentence, i.e. $h^{d}_{0} = h^{e}_{N}$. See Fig. 5 for a visualisation.

The reason we utilize an attention based approach instead of a vanilla sequence-to-sequence based architecture is to tackle the long term dependency issues by providing additional information to the decoder. To train our NMT network, we use cross entropy loss over the gloss probabilities at each time step.

We build a Motion Graph (MG) that allows a sequence of 2D skeletal poses to be generated for a given gloss sequence. An MG is a Markov process that can be used to generate new motion sequences that are representative of real motion but fulfil the objectives of the animator e.g. getting from A to B using a specific style of motion. A standard formalisation of an MG is as a finite directed graph of motion primitives (Min and Chai 2012): $MG=(V,E)$, where node $v_i \in V$ in the graph corresponds to one or more sequences of motion (motion primitives) and a prior distribution function $p(x_i)$ over those motion primitives $(x_i)$. Each motion primitive for a node is an example of the style of motion the node represents. It is therefore possible to have a variable number of motion primitives in a node, the minimum being one. An edge $e_{i,j} \in E$, which represents an allowable transition from node $v_i$ to $v_j$, stores a morphable function $\mathbf {Y}_{\mathbf {i},\mathbf {j}}=M(x_i,x_j)$ that enables blending between motions, and a probability distribution $p(x_j|x_i)$ over the motion primitives $x_j$ at node $v_j$, given a chosen motion primitive $x_i$ at node $v_i$. See Fig. 6 for a visualisation.

The motion primitives need to be extracted from a larger set of motion capture data. This can be done by identifying key frames in the motion data that are at the transition points between motions e.g. the left foot impacting the floor for walking sequences. These key-frames are then used to cut the data up into a larger set of motion primitives $x_i$, where each motion primitive is a continuous motion between two key-frames. For more complex datasets of motion, a typical approach is to define a distance metric between skeletal poses which can be used to identify possible transition points as those that fall below a given threshold. The threshold being set to be small enough such that interpolation between two poses will not cause visual disturbance in the fluidity of motion. For our application, we use the gloss boundaries to automatically cut the pose sequences into individual signs so $\vert V \vert $ is equal to the gloss vocabulary size and $x_g$ contains examples of sign gloss g.

In a graphics context E is learned directly from the data by looking at the transitions between nodes in the graph present in the original data. However, in our case, E is generated at each time step by the decoder network, given the previously generated glosses and encoded sequence, as in:

$$\begin{aligned} e_{t-1, t}= & {} p(x_{t} | x_{t-1}) \nonumber \\= & {} p(g_{t} | g_{1:t-1}, S^{N}) \nonumber \\= & {} \text {Decoder}(g_{t-1}, h^{d}_{t-1}, \alpha (o_{1:N})). \end{aligned}$$

(3)

The purpose of $\mathbf {Y}_{\mathbf {i},\mathbf {j}}$ is to allow smooth transition between different motion primitives. In our case it is constant for all nodes in the graph. We use a Savitzky–Golay filter (Savitzky and Golay 1964) to create smooth transitions. This is done dynamically as the graph is searched. The Savitzky–Golay filter smooths between motion primitives by fitting a low-order polynomial to adjacent data points. We use a window size of five and a polynomial order of two to smooth between the last five frames of the current motion primitive and the first five frames of the upcoming primitive. This allows us to preserve the articulation of each motion primitive, but avoid discontinuities and artefacts at transition points.

To find the most probable motion sequence given a spoken language sentence, we employ beam search over our motion graph. We start generating our sequence from the special $x_0=< \hbox {bos}>$ (beginning of sequence) node. At each motion step, we consider a list of hypotheses, $\mathcal {H}^B = \{H_{1}, \ldots , H_{b}, \ldots , H_{B}\}$ where B denotes our beam width.

At each step we expand our hypotheses with a new motion as in:

$$\begin{aligned} H_{b}^{t} = \{H_{b}^{t-1}, x^{*}_{t}\}, \end{aligned}$$

(4)

where $H_{b}^{t}$ denotes the set of motions in $H_{b}$ at step t. We choose $x^{*}_{t}$ by:

$$\begin{aligned} x^{*}_{t} = \mathop {{{\,\mathrm{argmax}\,}}}\limits _{x} p(x | x_{t-1}), \end{aligned}$$

(5)

where $x_{t-1} \in H_{b}^{t-1}$. We continue expanding our hypotheses until all of them reach to special $x_{\_}=< \hbox {eos}>$ (end of sequence) node. We then choose the most probable motion sequence $\mathcal {H}^{*}$ by:

$$\begin{aligned} \mathcal {H}^{*} = \mathop {{{\,\mathrm{argmax}\,}}}\limits _{H_{b}} \prod _{i=1}^{|H_{b}|} p(x_{i} | x_{i-1}). \end{aligned}$$

(6)

3.2 Pose to Video Translation

The pose-to-video (pose2video) network combines a convolutional image encoder and a Generative Adversarial Network (GAN), see Fig. 7 for an overview. A GAN consists of two models that are trained in conjunction: A generator G that creates new data instances, and a discriminator D that evaluates whether these belong to the same data distribution as the training data. During training, G aims to maximise the likelihood of D falsely predicting a sample generated by G to be part of the training data, while D tries to correctly identify samples to be either fake or real. Using this minmax game setup, the generator learns to produce more and more realistic samples, ideally to the point where D cannot separate them from the ground truth.

G is an encoder-decoder, conditioned on human pose and appearance. The latent space can either be a fixed-size one-dimensional vector, or a variable-size residual block. A fixed size 1D vector latent space using a fully connected layer allows generation of images with both large appearance and spatial change and is employed for multi-signer (MS) output. However, the ability to generate spatial change, and the requirement for fully connected layers increases memory consumption, and limits the output size of the generated images. In contrast, a fully convolutional latent space, such as a number of residual layers, allows for changes in appearance, like changing from a pose label map to an image of a human being in that pose, but does not allow for large spatial changes. This enables the network to transfer style similar to pix2pixHD by Wang et al. (2018b) or Chan et al. (2018). However, due to the avoidance of fully connected layers and with the use of an additional enhancing network, it is capable of producing sharp high definition outputs. We investigate this second formulation for generating high-definition (HD) sign video.

3.2.1 Image Generator

As input to the generator we concatenate $P_{t}$ and $I_{a}$ as separate channels, where $P_{t}$ is a human pose label map. For MS generation $I_{a}$ is an image of an arbitrary human subject in a resting position (base pose). The HD sign generator cannot be conditioned on a base pose, as it does not allow for large spatial changes. Instead it is conditioned on the generated image from the previous time step. On top of helping with appearance this enforces temporal consistency.

The input to the generator is pushed through the convolutional encoder part of the generator and encoded into the latent space. The decoder part of the generator uses up-convolution and resize-convolution to decode from the latent space back into an image using the embedded skeletal information provided by the label map $P_{t}$. This produces an image $G(P_{t}, I_{a})$ of the signer in the pose indicated by $P_{t}$ (see Fig. 7).

In the HD sign variant, an enhancer network En is used to upscale and refine the output images produced by the generator G. Its architecture is very similar to G, consisting of a convolutional encoder, a residual block and an up-convolutional decoder. G is first trained individually, followed by En, before training both networks in conjunction.

3.2.2 Discriminator

The discriminator D receives either a tuple of the generated synthetic image $G(P_{t}, I_{a})$ or ground truth $I_{t}$, and the pose label map $P_{t}$ as input. In the MS case, D is also provided with $I_{a}$ (see Fig. 7). D decides on image’s authenticity. In the MS case, given that the system is trained on multiple signers, $I_{a}$ is used to establish whether the generated image resembles the desired signer. The skeletal information provided by $P_{t}$ is used to assess if the generated image has the desired joint configuration. For the HD sign case, like Wang et al. (2018b) we use a multi-scale discriminator with three scales (in our case $1080\times 720$, $540\times 360$, and $270\times 180$).

3.2.3 Loss

We use the GAN’s adversarial loss, as well as an L1 loss between generated and ground truth images to train our networks. See Fig. 7 for a visualisation. The overall loss is therefore defined as:

$$\begin{aligned} \mathcal {L} = L_{GAN} + \delta L_{1}, \end{aligned}$$

(7)

where $\delta $ weighs the influence of $L_{1}$.

For MS generation we give $I_{a}$ to the generator and the discriminator to distinguish between signers. The adversarial loss is thus defined as:

$$\begin{aligned} L_{GAN_{ms}}(G,D)= & {} \mathop {\mathbb {E}}\limits _{(P_{t},I_{t}, I_{a})}[log D(P_{t},I_{t}, I_{a})] \nonumber \\&+\, \mathop {\mathbb {E}}\limits _{(P_{t}, I_{a})}[log(1-D(P_{t},G(P_{t}, I_{a}), I_{a}))].\nonumber \\ \end{aligned}$$

(8)

The MS L1 loss is defined as the sum of absolute pixel difference between ground truth and generated image:

$$\begin{aligned} L_{1_{ms}}(G) = \sum {|(I_{t} - G(P_{t}, I_{a}))|}. \end{aligned}$$

(9)

For HD generation the adversarial loss is defined as:

$$\begin{aligned} L_{GAN_{hd}}(G,D_{k})= & {} \mathop {\mathbb {E}}_{(P_{t},I_{t})}[log D_{k}(P_{t},I_{t})] \nonumber \\&+\, \mathop {\mathbb {E}}_{(P_{t})}[log(1-D_{k}(P_{t},G(P_{t}, I_{a})))],\nonumber \\ \end{aligned}$$

(10)

where k is the number of discriminator scales. To combine the adversarial losses of all $D_k$, we sum:

$$\begin{aligned} L_{GAN_{hd}}(G,D) = \sum _{k=1,2,3}L_{GAN_{hd}}(G,D_{k}). \end{aligned}$$

(11)

For HD generation the L1 loss is based on the feature matching loss presented in Wang et al. (2018b). Features extracted from multiple stages of the discriminator are matched, rather than pixels:

$$\begin{aligned} L_{1_{hd}}(G,D_k) {=} \mathop {\mathbb {E}}_{(P_t, I_t)} \sum _{i=1}^{T}{\frac{1}{N_i}\Big [ \sum {|D_{k}^{(i)}(P_{t},I_{t}) {-} D_{k}^{(i)}(P_{t},G(P_{t}, I_{a}))|} \Big ]}\nonumber \\ \end{aligned}$$

(12)

where T is the total number of layers in $D_k$, i is the current layer of $D_k$, $N_i$ is the total number of elements per layer, and $D_{k}^{(i)}$ is the ith layer feature extractor of $D_k$. Again we sum the L1 losses of all $D_k$ to obtain the overall L1 loss:

$$\begin{aligned} L_{1_{hd}}(G,D) = \sum _{k=1,2,3}L_{1_{hd}}(G,D_k). \end{aligned}$$

(13)

4 Experiments

We first introduce the datasets used and any necessary pre-processing steps, before evaluating all sub-parts of our system both quantitatively and qualitatively. We show results for translating spoken language text to gloss sequences and pose sequences, and for generating multi-signer (MS) and high-definition (HD) sign video, using broadcast quality assessment metrics. A set of qualitative examples showcases the current state of the full preliminary translation pipeline.

4.1 Datasets

In order to realise spoken language to sign video generation, we require a large scale dataset, which provides sign language sequences and their spoken language translations.

Although there is vast quantities of broadcast data available and many linguistically annotated datasets, they lack spoken language sentence to sign sequence (i.e. topic-comment) alignment. However, recently Camgoz et al. (2018) released RWTH-PHOENIX-Weather 2014T (PHOENIX14T), which is the extended version of the continuous sign language recognition benchmark dataset PHOENIX-2014 Forster et al. (2014). PHOENIX14T consists of German Sign Language (DGS) interpretations of weather broadcasts. It contains 8257 sequences being performed by 9 signers. It has a sign gloss and spoken language vocabulary of 1066 and 2887, respectively. Each sequence is annotated with both the sign glosses and spoken language translations.

We trained our spoken language to sign pose network using PHOENIX14T. However, due to the limited number of signers in the dataset, we utilised another large scale dataset to train the multi-signer (MS) generation network, namely the SMILE Sign Language Assessment Dataset (Ebling et al. 2018). The SMILE dataset contains 42 signers performing 100 isolated signs for three repetitions in Swiss German Sign Language (DSGS). Although the SMILE dataset is multi-view, we only used the Kinect colour stream, without any depth information or the Kinect’s built-in pose estimations.

We trained the HD sign generation network on $1280\times 720$ HD dissemination material acquired by the Learning to Recognise Dynamic Visual Content from Broadcast Footage (Dynavis) project (Bowden et al. 2016). It consists of multiple videos featuring the same subject performing continuous British Sign Language (BSL) sequences. There is no alignment between spoken language sentences to sign sequences.

Using multiple datasets is motivated by the fact that there is no single dataset that provides text-to-sign translations, a broad range of signers of different appearance, and high definition signing content. Using datasets from different subject domains and languages demonstrates the robustness and flexibility of our method, as it allows us to transfer knowledge between specialised datasets. This makes the approach suitable for translating between different spoken and signed languages, as well as other problems, such as text-conditioned image and video generation.

4.1.1 Data Pre-Processing

In order to perform translation from spoken language to sign pose, we need to find pose sequences that represent the appropriate glosses. We split the continuous samples of the PHOENIX14T dataset by gloss using a forced alignment approach. Then, for each gloss we perform a normalisation over all example sequences containing that gloss. First, we have to relate the different body shapes and sizes of all signers to that of a selected target subject. Additionally we have to time-align all example sequences, before we can find an average representation for each frame of the sequence. To align different signers’ skeletons to that of a target subject, we use OpenPose (Cao et al. 2017) to extract upper body key points for each frame in the sequence and for a reference frame of the target subject. We align the skeletons at the neck joint and scale by the shoulder width. We use dynamic time warping to time align sequences, before taking the mean of each joint per frame over all example sequences to generate a representative mean sequence. These mean sequences form the nodes of our MG. We decided to use mean sequences rather than raw example sequences, as they provide a more stable performance. We found that corruptions in the gloss boundary information obtained by forced alignment produced an immense variability in quality and correctness for the samples per node in the graph. The supplementary material contains example comparisons between a motion graph built using mean sequences versus the raw data.

4.2 German to Pose Translation

We provide results for translating German sentences into their intermediate gloss representation, and show how this, combined with a MG, can be used to generate human pose sequences from spoken language sentences.

As described in Sect. 3.1, we utilised an encoder-decoder NMT architecture for spoken language to sign gloss translation. Both our encoder and decoder networks have 4 layers with 1000 Gated Recurrent Units (GRUs) each. As an attention mechanism we use Luong et al.’s approach as it utilises both encoder and decoder outputs during context vector calculation. We trained our network using Adam optimisation with a learning rate of $10^{-5}$ for 30 epochs. We also employed dropout with 0.2 probability on GRUs to regularise training. During inference the width B of the beam decoder is set to three, meaning the three top hypotheses are kept per time step. We found this number to be a good trade-off between translation quality and computational complexity. For text2pose generation we report an average time of 0.79 s per translated gloss using a Intel® Core™ i7-6700 CPU (3.40 GHz, 8MB cache), where the majority of time is taken up by generating the pose maps (0.77 s/gloss).

4.2.1 Translating German to Gloss

To measure the translation performance of our approach we used BLEU and ROUGE (ROUGE-L F1) score as well as Word Error Rate (WER), which are amongst the most popular metrics in the machine translation domain. We measure the BLEU scores on different n-gram granularities, namely BLEU 1, 2, 3 and 4, to give readers a better perspective of the translation performance.

We compare our Text2Gloss performance against the Gloss2Text network of Camgoz et al. (2018), which is the opposite task of translating sign glosses to spoken language sequences. We do this as to our knowledge there is no other text-to-gloss translation approach for a direct comparison. We aim to give the reader context, rather than claiming to supersede the Gloss2Text approach (Camgoz et al. 2018). Our results, as seen in Table 1, show that Text2Gloss performs comparably with the Gloss2Text network. While Gloss2Text achieves a higher BLEU-4 score, our Text2Gloss surpasses its performance on BLEU scores with smaller n-gram and ROUGE scores. We believe this is due to shorter length of sign gloss sequences and their smaller vocabulary. The challenge is further exacerbated by the fact that sign languages employ a spatio-temporal grammar which is challenging to represent in text.

Table 1 BLEU and ROUGE scores, as well as WER for PHOENIX-2014T dev and test data

Full size table

We also provide qualitative results by examining sample Text2Gloss translations (see Table 2). Our experiments indicate that the network is able to produce gloss sequences from text that are close to the gloss ground truth. Even when the predicted gloss sequence does not exactly match the ground truth, the network chooses glosses that are close in meaning.

Table 2 Translations from our NMT network (GT: Ground Truth)

Full size table

After reporting these promising intermediate results, we will now show how this approach can be extended to generate human pose maps that encode the motion of signs.

4.2.2 Translating German to Pose

We give a qualitative evaluation of translating German sentences into human pose sequences by solving a MG using the NMT’s beam search. Figure 8 shows two examples. In both cases we show key frames that are indicative of the translated glosses. It is interesting to note that both sequences contain the gloss WIND, twice in the top sequence and once in the bottom sequence. The relevant key frames for each occurrence (key frame 2 and 6 for the top sequence, key frame 4 for the bottom sequence) are very similar, showing the conditioning of poses on a specific gloss.

The poses are encoded as $128\times 128\times 10$ binary label maps, where each joint inhabits one of the 10 depth channels. This type of map is used to generate sign language video in Sects. 4.3 and 4.4.

4.3 Multi-Signer Generation of Isolated Signs

This section presents results using the generated label maps to condition a GAN that generates sign video for multi-signer (MS) video generation. We test using isolated signs from the SMILE dataset. When testing on a GeForce GTX TITAN X we report an average time of 1.71 s per generated image.

We generate synthetic sign video from previously unseen label data. To evaluate the quality of the generated output, we use the Structural Similarity Index Measurement (SSIM) (Wang et al. 2004), Peak Signal-to-Noise Ratio (PSNR), and Mean Squared Error (MSE), three well-known metrics for assessing image quality.

SSIM is a metric used to assess the perceptual degradation of images and video in broadcast, by comparing a corrupted image to its original. We adapt this approach to compare the generated synthetic image $G(P_{t}, I_{a})$ to its ground truth image $I_{t}$.

For ease of notation we define:

$$\begin{aligned}&\hat{I_{t}} = G(P_{t}, I_{a}).\nonumber \\&{ SSIM}(\hat{I_{t}},I_{t}) = [l(\hat{I_{t}},I_{t})]^\alpha \cdot [c(\hat{I_{t}},I_{t})]^\beta \cdot [s(\hat{I_{t}},I_{t})]^\gamma , \end{aligned}$$

(14)

where $l(\hat{I_{t}},I_{t})$ is a luminance term:

$$\begin{aligned} l(\hat{I_{t}},I_{t}) = \frac{2\mu _{\hat{I_{t}}}\mu _{I_{t}} + C_{1}}{\mu ^2_{\hat{I_{t}}} + \mu ^2_{I_{t}} + C_{1}}, \end{aligned}$$

(15)

$c(\hat{I_{t}},I_{t})$ is a contrast term:

$$\begin{aligned} c(\hat{I_{t}},I_{t}) = \frac{2\sigma _{\hat{I_{t}}}\sigma _{I_{t}} + C_{2}}{\sigma ^2_{\hat{I_{t}}} + \sigma ^2_{I_{t}} + C_{2}}, \end{aligned}$$

(16)

and $s(\hat{I_{t}},I_{t})$ is a structural term:

$$\begin{aligned} s(\hat{I_{t}},I_{t}) = \frac{\sigma _{\hat{I_{t}}I_{t}} + C_{3}}{\sigma _{\hat{I_{t}}}\sigma _{I_{t}} + C_{3}}, \end{aligned}$$

(17)

with $\mu _{\hat{I_{t}}}$, and $\mu _{I_{t}}$ being the means, $\sigma _{\hat{I_{t}}}$, and $\sigma _{I_{t}}$ the standard deviations and $\sigma _{\hat{I_{t}}I_{t}}$ the cross-covariance for images $\hat{I_{t}}$ and $I_{t}$. $C_{1}=(k_1L)^2$ and $C_{2}=(k_{2}L)^2$, where L is the dynamic range of pixel values, and $k_{1}=0.01$ and $k_{2}=0.03$. $C_{3}$ is set to equal $C_{2}/2$.

With default values of $\alpha , \beta , \gamma , = 1$ the expression for SSIM simplifies to:

$$\begin{aligned} SSIM(\hat{I_{t}},I_{t}) = \frac{(2\mu _{\hat{I_{t}}}\mu _{I_{t}}+C_{1})(2\sigma _{\hat{I_{t}}I_{t}}+C_{2})}{(\mu ^2_{\hat{I_{t}}}+\mu ^2_{I_{t}}+C_{1})(\sigma ^2_{\hat{I_{t}}}+\sigma ^2_{I_{t}}+C_{2})}. \end{aligned}$$

(18)

The calculated SSIM ranges from $-1$ to 1, with 1 indicating the images are identical.

PSNR and MSE are metrics used to assess the quality of compressed images compared to their original. We use MSE to calculate the average squared error between a synthetic image $\hat{I_{t}}$ and its ground truth image $I_{t}$, by:

$$\begin{aligned} MSE = \frac{1}{MN}\sum _{m=1}^{M}\sum _{n=1}^{N}[I_{t}(m,n) - \hat{I_{t}}(m,n)]^2 \end{aligned}$$

(19)

where N and M are the number of columns and rows respectively.

In contrast PSNR measures the peak error in dB, using the MSE:

$$\begin{aligned} PSNR = 10log_{10}\bigg (\frac{R^2}{MSE}\bigg ), \end{aligned}$$

(20)

where R is the maximum possible value of the input data, in this case 255 for 8-bit unsigned integers.

The MS generation network was trained on 40 different signers from the SMILE dataset over 90,000 iterations. Out of these signers, several signers were chosen, and the network fine-tuned for another 10,000 iterations on the appearance of those signers. The pose label maps were generated from running OpenPose on the full-size SMILE ground truth footage of $1920\times 1080$ pixels, and then downsampled to $128\times 128$ pixels. The original SMILE footage was then also downsampled to $128\times 128$ pixels to function as input to our network.

We test for three different signers, over a 1000 frames each. We report the mean SSIM, PSNR, and MSE (see Table 3). The results indicate that the images produced of all three signers are very close to their ground truth, with SSIM values close to 1. Signer 1 has slightly worse scores than signer 2, and 3, which is due to a corrupted sequence in the gathered data.

Table 3 Mean SSIM, PSNR, and MSE values over the test set, comparing synthetic images to their ground truth

Full size table

Qualitative results in Figs. 9 and 10 show that the synthetic sequences generated by our network stay close to their ground truth in terms of both motion and appearance. Details for hands and faces are largely preserved, however the network can struggle to form both arms and hands fully, especially when held in front of the chest and face. This is likely due to the similarity in colour, which also could have led to errors in the key point extraction process.

The results also highlight the power of our data-driven approach to capture natural variations in sign. Signer 2 is left-handed, whereas signer 1 and 3 are right-handed. There are also noticeable discrepancies in speed and size of motion amongst the signers. Linguistically, these are very important factors that can have a significant impact on the meaning of a sign. They convey additional information such as emotion and intent, for example haste, anger, or uncertainty.

Overall our experiments show that our MS generation network is capable of synthesizing sign language videos that are highly realistic and variable in terms of motion and appearance for multiple signers. The limiting factor to this approach is the small aspect ratio of $128\times 128$ pixels. We therefore investigate a different variant of our network to produce HD sign videos in Sect. 4.5.

4.4 Spoken Language to Sign Language Translation

In this section we test the full translation pipeline: going from spoken language sentences to sign language video translations. We translate from German to German Sign Language (DGS). Our test data is taken from the PHOENIX14T test dataset. Our Motion Graph (MG) is built from extracting OpenPose skeletal information from the PHOENIX14T training set. The obtained OpenPose extraction was prone to errors due to the small resolution of the PHOENIX14T data ($260\times 210$ pixels). It did not scale to the $1080\times 720$ resolution required for conditioning the HD generator. We therefore only test our full pipeline using the MS generator, as it is better aligned in scale with the PHOENIX14T data.

We depict results for four translations. For all cases the input for our translation is a German sentence. The resulting gloss and sign video translations are given in Figs. 11, 12, 13 and 14. The beam search over the MG provides the motion sequences that incorporate the translation from spoken language text. These pose sequences then condition the sign generation network. Transitions between sequences are added dynamically. We give representative frames for the generated sequences, indicating which glosses they belong to.

For sequence 1 in Fig. 11 the NMT network correctly translates to a German gloss sequence which corresponds to the ground truth. The overall motion of the arms and hands is consistent with the video ground truth. The signers’ appearances are clearly distinguishable from one another. Signer 1 stays closest to the ground truth, having the most developed arms and hands. Signer 2 struggles to fully form the right arm at times, this might be due to the fact that this signer was a left-handed signer in the original dataset and therefore less right handed motion was observed during training. Signer 3 has under-developed hands, something that is consistent across frames and sequences, indicating a failure in conditioning.

Sequence 2 (see Fig. 12) also correctly translates the original input sentence. On top of the observations made for sequence 1, we notice a failure case for the gloss WECHSELHAFT (CHANGING). The sign for this gloss in DGS is a repeated left to right motion of both arms in front of the body (see the last three frames of the ground truth in Fig. 12). In the generated sequence, the arms are in front of the body, but remain in the centre. We assume that this is due to a failure in the time alignment leading to key points of motion to the left and right resulting in hands positioned in the centre.

Results for sequence 3 in Fig. 13 are in accordance with the first sequence. The sequence of glosses predicted matches the ground truth. Again signer 1 stays closest to the original motion sequence, even encapsulating the subtle difference in right hand position (not hand shape) between glosses GUT (GOOD) and LIEB (DEAR).

Sequence 4 is longer than previous examples, and contains one translation error (see Fig. 14). The positions of arms and hands are consistent with the ground truth for the first four glosses, before encountering the error in gloss prediction. The motion for the last gloss WIND (WIND) is slightly under-articulated in contrast to the ground truth.

Overall, the movement of signers is smooth and consistent with the glosses they represent, but not as expressive as the ground truth. We suspect that the limited motion stems from the averaging of all example sequences for a gloss to generate one mean sequence. To our knowledge the timing information for all glosses was automatically extracted from the PHOENIX14T data by the creators of the dataset using a Forced Alignment approach. It is therefore reasonable to assume that the provided timings contain errors, which negatively affect the mean sequence. Additionally, for most signs more than one variation exists, but this is not annotated in the dataset, neither is the use of left or right as the dominant hand. This further diminishes the motion of the mean sequences.

For future experiments an averaging and data cleaning process needs to be developed that pays consideration to variability in speed, expression, and left versus right-handed signing. To improve the quality of extracted pose information, and add additional conditioning for hands and faces we need datasets of high image resolution. For translation we require sign language datasets that have topic-comment alignment. If both is combined, it would be possible to avoid the heavy cost of manually annotating details in sign motion such as facial expression and still get rich, natural translations.

4.5 High Definition Continuous Sign Generation

To improve the resolution and sharpness of our sign generation we generate HD continuous sign language video using the HD signing network. The network is conditioned with semantic label maps encoding human poses. We evaluate two configurations: A network conditioned only on 15 upper body joints (as was used in the MS network), and a network conditioned on the same 15 joints, plus 21 key points for each hand, and 68 key points for the face. For details see Fig. 15. We trained for 16 epochs over 19,850 frames and corresponding label maps. For both models we report an average time of 0.42 s per image generated during inference using a GeForce GTX TITAN X.

Quantitative as well as qualitative results are provided. As with MS generation, we report the mean SSIM, PSNR, and MSE, this time over a test set of 500 frames, for just the pose input (HDSp) and pose, hands, and face input (HDSphf) in Table 4. The results indicate that more detailed conditioning with pose, hands, and face key points produces synthetic images that are closer to the ground truth. However, the difference in scores is not as significant as might be expected.

Table 4 Mean SSIM, PSNR, and MSE values over the test set, comparing synthetic images to their ground truth

Full size table

Looking at example frames we can see that both HDSp and HDSpfh create synthetic images that closely resemble their ground truth in both overall structure and detail such as clothing, overall facial expression and hand shape (see Fig. 16). However, HDSpfh surpasses HDSp clearly for details of the generated hands and facial features. Whereas both networks learn to generate realistic hands and faces, HDSp can generate the wrong hand shape (see middle column in Figs. 16 and 17), as it does not receive the positional information for all the finger joints, but merely an overall position of the hand.

Overall our results indicate that it is possible to generate highly realistic and detailed synthetic sign language videos, given sufficient positional information. A compromise can be found that keeps the annotation effort minimal (like using an automatic pose detector), whilst maintaining realism and expressiveness in the synthetic sign video.

5 Conclusions

In this paper, we presented the first spoken language-to-sign language video translation system. While other approaches rely on motion capture data and/or the complex animation of avatars, our deep learning approach combines an NMT network with a Motion Graph (MG) to produce human pose sequences. This conditions a sign generation network capable of producing sign video frames.

The NMT network’s predictions can successfully be used to solve the MG, resulting in consistent text2pose translations. We show this by analysing example text2pose sequences, and by providing qualitative and quantitative results for an intermediate text2gloss representation. With our multi-signer (MS) generator we are able to produce multiple signers of different appearance. We show this for isolated signs, and as part of our text2sign translation approach.

Additionally we investigated the generation of HD continuous sign language video. Our results indicate that it is possible to produce photo-realistic video representations of sign language, by conditioning on key points extracted from training data. The accuracy and fidelity of key points seems to play a vital role, reinforcing the need for datasets of sufficient resolution.

Currently our text2sign translation system cannot compete with existing avatar approaches. Due to the low resolution of our translation training data, our results do not have the output resolution and expressiveness obtained by motion capture and avatar-based approaches. However, we have outlined that continuous, realistic sign language synthesis is possible, using minimal annotation. For training we only require text and gloss-level annotations, as skeletal pose information can be extracted from video automatically using an off-the-shelf solution such as OpenPose (Cao et al. 2017). In contrast, avatar-based approaches require detailed annotations using task-specific transcription languages, which can only be carried out by expert linguists. Animating the avatar itself often involves a considerable amount of hand-engineering, and the results thus far remain robotic and under-articulated. Motion capture-based approaches require high-fidelity data, which needs to be captured, cleaned, and stored at considerable cost, limiting the amount of data available, hence making this approach unscalable. We believe that in time our approach will enable highly-realistic, and cost-effective translation of spoken languages to sign languages, improving equal access for the Deaf and Hard of Hearing.

For future work, our goal is to combine the MS and HD sign generation capabilities to synthesize highly detailed sign video, with signers of arbitrary appearance. The MS’s ability to account for spatial and appearance changes, in combination with the high resolution of the HD generator would enable us to synthesize highly realistic and expressive sign language video. Additionally, we plan to improve our current MG by developing a data-processing strategy, that pays attention to the intricate features of sign language data, such as size of motion, and speed. This means replacing the current use of mean sequences with a more thoughtful approach that takes into account the likelihood of an example sequence being correct, the skeletal composition of different signers, and their dominant hand. We further plan to train our text2sign system end-to-end, and develop a performance metric to further quantitatively analyse the performance of our SLP system. Going forward, as progress is made in sign video-to-spoken-language translation, this could be used as a quantitative evaluation in itself or possibly as part of a cycle GAN approach.

Notes

Glosses are lexical entities that represent individual signs.
The uncanny valley is a concept aimed at explaining the sense of unease people often experience when confronted with simulations that closely resemble humans, but are not quite convincing enough.

References

Ahn, H., Ha, T., Choi, Y., Yoo, H., & Oh, S. (2018). Text2action: Generative adversarial synthesis from language to action. In IEEE international conference on robotics and automation (ICRA).
Arikan, O., & Forsyth, D. A. (2002). Interactive motion generation from examples. In Proceedings of the 29th annual conference on computer graphics and interactive techniques, SIGGRAPH ’02 (pp. 483–490). ACM, New York, NY.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Bangham, J. A., Cox, S. J., Elliott, R., Glauert, J. R. W., Marshall, I., Rankov, S., & Wells, M. (2000). Virtual signing: Capture, animation, storage and transmission-an overview of the visicast project. In IEE Seminar on speech and language processing for disabled and elderly people (Ref. No. 2000/025) (pp. 6/1–6/7).
BDA: British Deaf Association (2019). BSL statistics. https://bda.org.uk/help-resources/#statistics. Accessed 16 Nov 2019.
Bowden, R., Zisserman, A., Hogg, D., & Magee, D. (2016). Learning to recognise dynamic visual content from broadcast footage. https://cvssp.org/projects/dynavis/index.html. Accessed 1 Nov 2018.
Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., & Bowden, R. (2018). Neural sign language translation. In IEEE Conference on computer vision and pattern recognition (CVPR).
Cao, Z., Simon, T., Wei, S., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In 2017 IEEE Conference on computer vision and pattern recognition (CVPR) (Vol. 00, pp. 1302–1310).
Chan, C., Ginosar, S., Zhou, T., & Efros, A. A. (2018). Everybody dance now. CoRR arXiv:1808.07371.
Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks. In ICCV (pp. 1520–1529). IEEE Computer Society.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1724–1734). Association for Computational Linguistics.
Chung, J., Gülçehre, Ç., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR arXiv:1412.3555.
Cox, S., Lincoln, M., Tryggvason, J., Nakisa, M., Wells, M., Tutt, M., & Abbott, S. (2002). Tessa, a system to aid communication with deaf people. In Proceedings of the 5th international ACM conference on assistive technologies (pp. 205–212). ACM
Ebling, S., Camgoz, N.C., Braem, P., Tissi, K., Sidler-Miserez, S., Stoll, S., Hadfield, S., Haug, T., Bowden, R., Tornay, S., Razavi, M., & Magimai-Doss, M. (2018). Smile Swiss German sign language dataset. In 11th Edition of the language resources and evaluation conference (LREC).
Ebling, S., & Glauert, J. (2013). Exploiting the full potential of JASigning to build an avatar signing train announcements. In 3rd International symposium on sign language translation and avatar technology.
Ebling, S., & Huenerfauth, M. (2015). Bridging the gap between sign language machine translation and sign language animation using sequence classification. In SLPAT@Interspeech.
Efthimiou, E. (2012). The dicta-sign wiki: Enabling web communication for the deaf. In K. Miesenberger, A. Karshmer, P. Penaz, & W. Zagler (Eds.) Computers helping people with special needs. ICCHP 2012. Lecture notes in computer science (Vol. 7383). Springer, Berlin, Heidelberg.
Elwazer, M. (2018). Kintrans. http://www.kintrans.com/. Accessed 12 Nov 2018.
EU: European Parliament (2018). Sign languages in the EU. http://www.europarl.europa.eu/RegData/etudes/ATAG/2018/625196/EPRS_ATA(2018)625196_EN.pdf. Accessed 16 Nov 2019.
Forster, J., Schmidt, C., Koller, O., Bellgardt, M., & Ney, H. (2014). Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-Weather. In Language resources and evaluation (pp. 1911–1916). Reykjavik.
Gibet, S., Lefebvre-Albaret, F., Hamon, L., Brun, R., & Turki, A. (2016). Interactive editing in french sign language dedicated to virtual signers: Requirements and challenges. Universal Access in the Information Society, 15(4), 525–539.
Article Google Scholar
Glauert, J., Elliott, R., Cox, S., Tryggvason, J., & Sheard, M. (2006). VANESSA-A system for communication between Deaf and hearing people. Technology and Disability, 18(4), 207–216.
Article Google Scholar
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems 27: Annual conference on neural information processing systems 2014, December 8-13 2014 (pp. 2672–2680). Montreal, Quebec.
Gregor, K., Danihelka, I., Graves, A., Rezende, D., & Wierstra, D. (2015). Draw: A recurrent neural network for image generation. In F. Bach, & D. Blei (Eds.) Proceedings of the 32nd international conference on machine learning, Proceedings of Machine Learning Research (Vol. 37, pp. 1462–1471). PMLR, Lille.
Guo, D., Zhou, W., Li, H., & Wang, M. (2017). Online early-late fusion based on adaptive hmm for sign language recognition. ACM Transactions on Multimedia Computing, Communications, and Applications, 14(1), 8:1–8:18. https://doi.org/10.1145/3152121.
Article Google Scholar
Guo, D., Zhou, W., Li, H., & Wang, M. (2018). Hierarchical LSTM for sign language translation. In AAAI.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Article Google Scholar
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5967–5976).
Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision.
Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. CoRR arXiv:1610.10099.
Kennaway, R. (2013). Avatar-independent scripting for real-time gesture animation. CoRR arXiv:1502.02961.
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. CoRR arXiv:1312.6114.
Kipp, M., Héloir, A., & Nguyen, Q. (2011). Sign language avatars: Animation and comprehensibility. In IVA.
Kovar, L., Gleicher, M., & Pighin, F. (2002). Motion graphs. In Proceedings of the 29th annual conference on computer graphics and interactive techniques, SIGGRAPH ’02 (pp. 473–482). ACM, New York, NY.
Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2016). Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the 33rd international conference on international conference on machine learning—Volume 48, ICML’16 (pp. 1558–1566). JMLR.org.
Lee, J., Chai, J., Reitsma, P. S. A., Hodgins, J. K., & Pollard, N. S. (2002). Interactive control of avatars animated with human motion data. In Proceedings of the 29th annual conference on computer graphics and interactive techniques, SIGGRAPH ’02 (pp. 491–500). ACM, New York, NY.
Lee, J., & Shin, S. Y. (1999). A hierarchical approach to interactive motion editing for human-like figures. In SIGGRAPH.
Luong, T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. In Conference on empirical methods in natural language processing (EMNNLP).
Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., & Van Gool, L. (2017). Pose guided person image generation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.) Advances in neural information processing systems (Vol. 30, pp. 406–416). Curran Associates, Inc.
Makhzani, A., Shlens, J., Jaitly, N., & Goodfellow, I. (2016). Adversarial autoencoders. In International conference on learning representations.
McDonald, J., Wolfe, R., Schnepp, J., Hochgesang, J., Jamrozik, D. G., Stumbo, M., et al. (2016). An automated technique for real-time production of lifelike animations of american sign language. Universal Access in the Information Society, 15(4), 551–566.
Article Google Scholar
Min, J., & Chai, J. (2012). Motion graphs++: A compact generative model for semantic motion analysis and synthesis. ACM Transactions on Graphics, 31(6), 153:1–153:12.
Article Google Scholar
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. CoRR arXiv:1411.1784.
Mori, M., MacDorman, K., & Kageki, N. (2012). The uncanny valley [from the field]. IEEE Robotics & Automation Magazine, 19, 98–100.
Article Google Scholar
Oord, A. V., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. In M. F. Balcan, & K. Q. Weinberger (Eds.) Proceedings of the 33rd international conference on machine learning, proceedings of machine learning research (Vol. 48, pp. 1747–1756). PMLR, New York, New York.
Perarnau, G., van de Weijer, J., Raducanu, B., & Álvarez, J. M. (2016). Invertible conditional GANs for image editing. CoRR arXiv:1611.06355.
Prillwitz, S. (1989). HamNoSys. Version 2.0. Hamburg notation system for sign languages. An introductory guide. Hamburg: Signum Press.
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR arXiv:1511.06434.
Reed, S. E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., & Lee, H. (2016). Learning what and where to draw. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.) Advances in neural information processing systems (Vol. 29, pp. 217–225). Curran Associates, Inc.
Robotka, Z. (2018). Signall. http://www.signall.us/. Accessed 12 Nov 2018.
Savitzky, A., & Golay, M. J. E. (1964). Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8), 1627–1639.
Article Google Scholar
Siarohin, A., Sangineto, E., Lathuilière, S., & Sebe, N. (2018). Deformable GANs for pose-based human image generation. In IEEE Conference on computer vision and pattern recognition (pp. 3408–3416). Salt Lake City, United States.
Stoll, S., Camgoz, N. C., Hadfield, S., & Bowden, R. (2018). Sign language production using neural machine translation and generative adversarial networks. In British machine vision conference (BMVC).
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems.
van den Oord, A., Kalchbrenner, N., Espeholt, L., kavukcuoglu, K., Vinyals, O., & Graves, A. (2016). Conditional image generation with pixelcnn decoders. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.) Advances in neural information processing systems (Vol. 29, pp. 4790–4798). Curran Associates, Inc.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. CoRR arXiv:1706.03762.
Virtual Humans Group. (2017). Virtual humans research for sign language animation. http://vh.cmp.uea.ac.uk/index.php/Main_Page.
Wang, S., Guo, D., Zhou, W. G., Zha, Z. J., & Wang, M. (2018a). Connectionist temporal fusion for sign language translation. In Proceedings of the 26th ACM international conference on multimedia, MM ’18 (pp. 1483–1491). ACM, New York. https://doi.org/10.1145/3240508.3240671.
Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018b). High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Article Google Scholar
WHO: World Health Organization (2018). Deafness and hearing loss. http://www.who.int/mediacentre/factsheets/fs300/en/. Accessed 21 Nov 2018.
Yan, X., Yang, J., Sohn, K., & Lee, H. (2016). Attribute2image: Conditional image generation from visual attributes. In ECCV (4). Lecture Notes in Computer Science (Vol. 9908, pp. 776–791). Springer.
Zwitserlood, I., Verlinden, M., Ros, J., & Schoot, S. V. D. (2005). Synthetic signing for the deaf: eSIGN. http://www.visicast.cmp.uea.ac.uk/Papers/Synthetic%20signing%20for%20the%20Deaf,%20eSIGN.pdf.

Download references

Acknowledgements

This work was funded by the SNSF Sinergia project “Scalable Multimodal Sign Language Technology for Sign Language Learning and Assessment” (SMILE) grant Agreement No. CRSII2 160811, the European Union’s Horizon 2020 research and innovation programme under grant Agreement No. 762021 (Content4All) and the EPSRC Project ExTOL (EP/R03298X/1). We would also like to thank NVIDIA Corporation for their GPU grant, and Oscar Koller at Microsoft.

Author information

Authors and Affiliations

Centre for Vision, Speech and Signal Processing, Guildford, UK
Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield & Richard Bowden

Authors

Stephanie Stoll
View author publications
You can also search for this author in PubMed Google Scholar
Necati Cihan Camgoz
View author publications
You can also search for this author in PubMed Google Scholar
Simon Hadfield
View author publications
You can also search for this author in PubMed Google Scholar
Richard Bowden
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephanie Stoll.

Additional information

Communicated by Ling Shao, Hubert P. H. Shum, Timothy Hospedales.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 17224 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Stoll, S., Camgoz, N.C., Hadfield, S. et al. Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks. Int J Comput Vis 128, 891–908 (2020). https://doi.org/10.1007/s11263-019-01281-2

Download citation

Received: 17 December 2018
Accepted: 10 December 2019
Published: 02 January 2020
Issue Date: April 2020
DOI: https://doi.org/10.1007/s11263-019-01281-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Abstract

Similar content being viewed by others

SignSynth: Data-Driven Sign Language Video Generation

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Sentence2SignGesture: a hybrid neural machine translation network for sign language video generation

Explore related subjects

1 Introduction

2 Related Work

2.1 Neural Machine Translation

2.2 Avatar Approaches for Sign Language Production

2.3 Motion Graphs

2.4 Conditional Image Generation

3 Text to Sign Language Translation

3.1 Text to Pose Translation

3.2 Pose to Video Translation

3.2.1 Image Generator

3.2.2 Discriminator

3.2.3 Loss

4 Experiments

4.1 Datasets

4.1.1 Data Pre-Processing

4.2 German to Pose Translation

4.2.1 Translating German to Gloss

4.2.2 Translating German to Pose

4.3 Multi-Signer Generation of Isolated Signs

4.4 Spoken Language to Sign Language Translation

4.5 High Definition Continuous Sign Generation

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation