1 Introduction

Sign language technology (SLT) has become a prominent research area for the computer vision and natural language processing (NLP) communities in the last 30 years [55, 86, 95]. Initial progress has been made in research into technologies that can aid communication between hearing and deaf communities. However, common mistakes have held the field back. As interest expands into this research area, we believe that best practices must be established to enable effective, continued, and long-lasting progress.

In this paper, we detail the most prominent issues that regularly arise in the current SLT research landscape. Often researchers do not fully appreciate the complexity of sign languages and the importance of the deaf community (Sect. 2). There has been a lack of deaf involvement in SLT projects (Sect. 3). SLT research has focused on ‘problems’ identified by hearing non-signers that are not actually problems at all, whilst some have proposed tools/advancements that have been enormously over-hyped by the media (Sect. 4). The data available for use in SLT have also been limited (Sect. 5), with an as yet unmet requirement for continuous, diverse sign language datasets. Finally, the complexity of sign language translation has not been fully recognised, as multiple intermediary tasks must be tackled before this can be automated (Sect. 6).

To meet each of these challenges, this paper suggests practical steps, laying out best practice recommendations for SLT research. We hope this work can help establish effective guidelines for both new researchers and incumbents in the field, enabling meaningful progress. The main body of this paper describes the five points of consideration in more detail (Sects. 26) with conclusions in Sect. 7.

Before we begin, we wish to provide some context. We are a team of deaf (NF) and hearing (BW, KC) sign language researchers. We are part of and/or have worked closely with British deaf communities for many years, and we are all fluent signers. Because we work primarily on British Sign Language (BSL), most of our observations here relate to BSL, but most hold true for SLT relating to other sign languages as well.

2 Learn about sign languages and deaf people

Sign languages are the languages developed in and used by deaf communities [81]. There are many different sign languages in the world, each with their own grammar and lexicon. The differences in the communicative channels used by spoken and sign languages result in differences in their linguistic structures. For example, spoken languages have access to only a single set of primary articulators (mouth, tongue, lips, teeth), while sign languages have two independent primary articulators (the two hands) and are thus able to make much greater use of simultaneous, rather than linear, grammar [91]. Additionally, in sign languages, communication is necessarily expressed both manually (hands) and non-manually (face and body poses) [84]. Fingerspellings (manual alphabets) are used within sign languages to represent the letters from the ambient written language for specific purposes, such as rendering proper names [65]. Fingerspelled words are distinct and different from the sign language lexicon, which is itself independent of the lexicon of the surrounding spoken/written language [85]. Understanding these complex linguistic features of sign languages is essential in conducting effective SLT research.

In sign languages studied to date, lexical signs are the most frequent form of sign—these are signs that have fairly conventional form and meaning, which can be expressed via one or more ‘translation equivalent’ words in another language [50] (although it should be noted that just as with translation between any two languages, there is often no one-to-one correspondence between signs and words.) But even lexical signs are produced less than 75% of the time in signed discourse [33]. Much of signed discourse involves pointing and/or depiction. Both pointing and depiction are context-dependent and involve some degree of improvisation. Pointings and depictions rarely look the same or mean the same thing more than once in any signed discourse, which makes them difficult to deal with in a machine learning context [48]. Their unconventionality means that in SLT they are treated as single sign tokens (see Sect. 5 on single tokens).

In addition to learning about how sign languages work, gaining basic deaf awareness is a minimal requirement for researchers in the field [4]. Some assume wrongly that deaf people have the same challenges as people with various disabilities, while others assume that deaf people have the same cultural norms as hearing people. Learning about deaf communities and the different ways in which deaf people view the world is fundamental to producing valid sign language research. Researchers also need to learn how deaf people do and do not refer to themselves, in order to avoid offensive terminology [12]. For example, terms such as ‘deaf and dumb’ and ‘deaf-mute’ are completely unacceptable and their use in sign language technology research has led to retractions by publishers (see, e.g. [39, 58]). In addition, referring to sign languages as ‘gestures’, ‘mimicry’ or ‘communication tools’, or being ‘specifically developed for [deaf] people’ (as in, e.g. [6, 8, 45, 87]; and many others) are inaccurate and offensive ways to talk about natural human languages. Börstell [9] has shown that this problem of ableist language use when referring to sign languages and deaf communities is far more prevalent in the field of technology than other fields like linguistics, education, and health—reflecting low levels of deaf awareness and deaf involvement in SLT research.

3 Involve deaf people in research

The ultimate aim of SLT research is to develop technology for the deaf community, to aid communication and accessibility. It is only logical that deaf people must then be involved in the research itself [10]. Deaf perspectives bring the engagement with the community that a successful project should seek to include. Ideally, deaf people should be involved at every level including in the planning stages before any work begins [25, 64], yet few projects and publications reflect this level of deaf involvement. Exceptions include work by Vogler and Metaxas [92, 93], Padden and Gunsauls [65], Cormier, Fox, et al. [23], Glasser et al. [40], and EU projects such as EASIER (https://www.project-easier.eu/) and SignOn (https://signon-project.eu/) which have involved deaf organisations at every stage and deaf lay audiences in user testing. One weakness with many projects to date is that engagement has happened too late, after the main development work has taken place, and the perspective has become one of reporting back to the community, rather than ascertaining whether the community thinks the project is worthy in the first place [34].

One danger of involving deaf people in SLT research minimally is tokenism, and this should be avoided. Tokenism is not an issue as long as one aims for allyship instead [46]. To be an ally is to work towards improving deaf representation in the research in various ways—not just as participants, but also as researchers, advisors, investigators. In areas where deaf people are underrepresented in these roles, hearing allies should recruit and train them so that they can be leaders in the future. It should be an aim of the SLT research community to provide not only equal training opportunities for deaf researchers, but additional training, and fast track possibilities where funding allows, to enable professional development, including via non-traditional routes. Such opportunities apply not just in the day-to-day running of research projects but also in presenting the research, e.g. in publications and conference participation. In these contexts, visibility is key, and hearing allies can play a role in shaping this.

For example, hearing researchers who are invited to contribute to a publication or conference or keynote where the topic is focused on SLT should encourage the inclusion of deaf colleagues by the default provision of interpreting at SL conferences (see, e.g. [38]), and by giving space and time to deaf researchers to showcase their work. Additionally, any workshops and conferences covering sign languages that do not have deaf invited speakers or deaf authors in their proceedings, should be viewed as not deaf-inclusive.

4 Consider the reasons for carrying out the research

When conducting SLT research, ask yourself this: ‘What problem am I trying to solve? Is it actually a problem?’. Technology is never going to solve the problem of deaf signers and hearing non-signers understanding each other [41], but it can be used to develop tools to help towards this end [10]. If you are developing a tool, who exactly would use it and for what purpose [59]? By engaging early with deaf people and deaf communities [60, 80], research can better meet their needs and preferences. Some topics, which could actually benefit deaf people, have received insufficient attention from the research community, while some technologies, such as ‘data gloves for deaf people’, at best have no practical purpose at all [30, 47] and at worst ‘perpetuate cultural appropriation and audism’ [35].

Another problem to address is that many school and college projects, which are touted as technology which will help deaf people to communicate with the hearing community, are initiated almost exclusively by hearing people. The tools that are developed from these projects are clearly only prototypes, often dealing with limited aspects of communication among deaf people (e.g. recognition of fingerspelled handshapes [51, 77] or with signs in isolation [57]) and receive no further development. More importantly, they often serve no useful purpose to deaf people at all. Despite this, because they appear innovative to hearing non-signers, such projects attract publicity and funding.

In addition to attracting funding, technological projects of this type are often picked up by the media and presented as technology that will remove barriers to communication between deaf and hearing people [11, 20, 21]. Media hype nearly always ends up alienating the deaf community as it comes from a mainly hearing perspective. Just as researchers need deaf perspectives, so do the media. This too would be improved with more deaf people involved in the research from the beginning. These responsibilities should also be shared with the funding bodies and their vetting process. If there was an obligation for funding bodies to ensure that their resources are appropriately allocated, it follows that deaf participation would increase and deaf perspectives would be more realistically reflected. The focus would thus shift from research as a self-perpetuating enterprise to one that aims to provide benefit to the community.

Despite the criticisms outlined above, there are some good candidates for useful SLT: for example, the deaf community might well welcome increased access to smart assistants/home control systems such as Siri and Alexa [31], or the ability to search signed videos [29], or a signed wiki [40]. Unfortunately, without awareness in private sector R&D departments that the needs of deaf people may be fundamentally different to those of hearing people, progress is unlikely.

5 Consider the type of source data needed

Sign language corpora exist for a growing number of sign languages around the world [26, 27, 44, 75, 100]. The sources and uses of these corpora are varied: continuous natural studio-recorded datasets originally designed for linguistic use [44, 63, 75], project-specific isolated studio-recorded datasets [16, 26] or sign interpreted broadcast footage [1, 3, 13, 15, 18, 22, 36], to name a few.

The suitability of each type of dataset for SLT research on recognition and output must be considered before use. When considering which sign language dataset to use for SLT research, there are many important factors to be aware of [10]. These include the diversity of signers present in the data, variability across signers of different ages and different proficiency and age of acquisition, the size of the vocabulary, whether the data are isolated or continuous, whether the data are from a laboratory recording, the internet or broadcast footage, and what types of annotation have been undertaken. Yin et al. [98] provide a detailed breakdown of further properties to consider when selecting an appropriate sign language dataset.

In addition, datasets can vary between a spoken language source (interpreted into a signed language, e.g. with picture-in-picture interpreter) and a sign language source (interpreted via voice-over into a spoken language). The most widely used sign machine learning datasets have consisted of broadcast interpretations, most notably television weather reports [36, 53]. Although these have proved useful, there are considerations and concerns about whether it is appropriate to use datasets from such restricted domains of discourse with limited vocabulary size [16, 26], rather than the large domains found in spontaneous, natural signing [18, 44]. One disadvantage, however, of very large domains such as spontaneous conversation is that many signs are represented with only a handful of instances, which poses difficulties for data-hungry machine learning algorithms.

A critical issue in research of this sort is ensuring that the data to be used as the source material represent the actual target of the analysis: in the case where material interpreted from English into BSL is used as the source data, the question arises of whether material comprised of BSL produced by hearing and deaf interpreters is appropriate as source material for developing automated translation from BSL to English. Additional questions arise in relation to automated translation from English to BSL. As with spoken languages, deaf fluent signers can and do make grammaticality and acceptability judgements, assessing whether other signers are fluent or not, whether they are native signers or not, and whether they use the language in everyday contexts. In this respect, there are three important questions to be addressed: (1) to what extent does scripted language (whether produced by hearing or deaf people) differ from the spontaneously produced BSL of deaf people?; (2) are there any differences between interpreted and spontaneously produced BSL?; (3) are there any differences between the interpreted BSL produced by hearing interpreters and that produced by deaf interpreters? This final question is of particular relevance in relation to automatic translation of sign language facilitated by recognition of mouthing patterns used by signers, since there is some evidence that hearing and deaf signers differ in their use of mouthing, both in terms of amount and of form [66].

In the general interpreting/translation literature, there is recognition that translated or interpreted language differs from the source (whether spontaneous or scripted) not only in terms of target language but also on a number of dimensions (so-called translation, or interpreting, ‘universals’ [5, 24, 37, 78, 79]). These differences include a number of features, such as a general tendency towards simplification, and because of their similarity across different source and output texts in different languages, they have been termed ‘interpretese’.

Shlesinger and Ordan [79] compared three types of text: interpreted texts, manually transcribed from the spoken outputs of four professional interpreters working in conference settings; translated written texts in (approximately) the same domains, rendered by professional translators; and original semi-scripted speech in (approximately) the same domains by conference presenters. They found that interpreted texts exhibited far more similarities to original speech than to written translation, reflecting that interpretese is more spoken than translated. On the other hand, they found that features such as simplification and lower type-token ratio (which are characteristic of translation) are found to be more salient in interpreted output, as compared to spontaneous language.

There is little literature in the field addressing these questions in relation to sign language interpreting, although there is evidence that there are differences between hearing and deaf interpreters [83, 94]. Stone [83] addresses differences between hearing and deaf interpreters in the process of preparing a sign language interpretation from an English script, by examining prosodic features of these interpretations, for example in the use of non-manual features such as mouthing (with hearing interpreters more likely to produce multisyllabic mouthings). Additionally, signed translations can either be done from written scripts via autocue in real time or interpreted from a spoken language in real time. For broadcast television material, there are differences between deaf and hearing interpreters, although both produce their final version ‘live’. Hearing interpreters, although they have access to a written script to prepare, undertake limited preparation from these written forms and rely to a greater extent on hearing the spoken version to interpret in real time. In contrast, deaf interpreters prepare extensively from the written text, enabling them to create a translation rather than an interpretation, using the autocue to support the final version in real time [83].

In relation to the question of possible differences between deaf and hearing interpreters, silent mouthing of words from spoken language is also one possible source of difference. It is known that mouthing differs along sociolinguistic parameters such as region, gender, age, nativeness, and level of education, even among deaf signers [7, 68]. No studies have explicitly explored mouthing differences between interpreters on the basis of hearing status. However, this is a topic worthy of research.

Another issue in considering choice of data source regardless of whether it is interpreted or not is annotation. In order to computationally process sign datasets, time-aligned machine-readable annotation is necessary. For machine learning SLT research, these annotations should be accurate and exhaustive, with detailed segmentation and ideally gloss labelling of each sign. However, this process is significantly labour-intensive and requires fluent signers.

It is also important to consider the extraction of data for computational processing, most commonly pose keypoints [19, 99]. Computer models are able to accurately estimate 2D body pose [19], but hand pose estimation, especially with two hands, is still very challenging [43]. Recent work of Moryossef et al. [62] has shown that human body pose estimation quality is potentially a limiting factor when used for SLT and requires further research. To optimise pose estimation results, datasets of higher quality and resolution must be adopted [18].

6 Recognise the challenges of automatic sign language analysis

Automatic translation between signed and spoken languages is the ultimate aim of many SLT projects [18, 23, 28], yet this task is incredibly complex [89]. The computer science community often underestimates the linguistic complexity of sign languages and treats automatic translation as a standard video-to-text/text-to-video problem or as similar to a simple gesture recognition/production problem [56, 90]. This oversimplifies translation models, leading to inaccurate end results and ultimately poor access for deaf people [52, 97].

There are substantial differences between an automated sign-to-spoken language translation process compared to automated translation between two spoken languages. Spoken language translation can involve speech-to-text as a first stage, followed by translation from source language text to target language text, followed by text-to-speech. Sign languages lack a written form [84], and they must be represented in a continuous format for computation [71], in contrast to the discrete representation of written language. Therefore, there is a requirement for bespoke architectures specific to sign language.

In addition, when tackling sign language translation, the natural variability in human translation must be captured, as there is more than one way to translate an utterance between a spoken and a signed language, just as between spoken languages. Many translations are equally valid, but some may be judged as better or more accurate than others. It is important that computational models use judgements of accuracy to take natural variability into account. Currently, the most common SLT research areas are sign recognition: the recognition of isolated lexical signs from a video [42, 57]; sign language translation: the translation of sign language videos to continuous spoken language [17, 18]; and sign language production: the generation of sign language content from spoken language [73, 82, 96].

Although recognition is a logical first step when tackling a full automated translation task, any application of isolated sign recognition has limited use to deaf people. Isolated sign recognition is useful for some tasks such as searching for individual signs in videos and dictionaries but not for recognising sign language discourse. This continued focus on isolated recognition is indicative of a lack of progress in the field [10, 53] and a lack of understanding of sign language and deaf needs. Although continuous sign translation [15] and production [71] are much harder tasks, they are considerably more helpful as tools. SLT research must turn towards continuous translation and production to progress.

However, before unconstrained sign language translation can be achieved, there are multiple additional intermediary tasks in sign language processing that must be tackled. Current intermediate problems include, but are not limited to, active signer detection [2], subtitle alignment [14], sign segmentation [69], visual anonymisation [70, 72], visual representation learning [3], continuous recognition [54], sign animation [74], sign spotting [61, 89], fingerspelling detection [67, 76], detailed 3D human shape estimation [32, 49], facial expressions, head pose and body movements [88], and multi-signer scenarios [64].

7 Conclusions

In this paper, we have outlined the current state of sign language technology (SLT) research, arguing that progress has been hindered by five prominent issues. To tackle this, we have proposed best practices every researcher should consider when conducting SLT research. We hope the insights provided here will enhance progress and value in the field for both hearing and deaf people.