1 Introduction

As a general notion, prosody includes relative prominence, rhythm, and timing of articulations. Prosodic markers indicate which units are grouped together and serve as cues for parsing the signal for comprehension. Speech has a Prosodic Hierarchy [1]: (smallest to largest) Syllable \(<\)Prosodic Word \(<\)Prosodic Phrase \(<\)Intonational Phrase. The variables used to cue these groupings for parsing are pitch (fundamental frequency), intensity (amplitude), and duration. For example, phrase-final words are longer than non-final words (Phrase Final Lengthening). How phrases are constructed depends on syntax, information status (old vs. new), stress (affected by information status), speaking rate, situation formality (articulation distinctness), etc. [2, 3].

Over time, ASL has accommodated to the production and perception requirements of manual/visual modality, developing a prosodic system similar in function to spoken languages but different in means of expression. Thus, the signed signal variables are displacement, time, and velocity (v=d/t), and derivatives thereof. In addition, there is considerable simultaneous transmission of information in the signal, which makes traditional notions from parsing speech less useful to us [2].The hands are not the sole articulators: multiple meaningful articulations are possible from parts of the face, positions of the head and body, collectively known as nonmanuals (NMs). ASL carefully coordinates hands and NMs so that NMs can perform intonational functions, pausing and sign duration changes can indicate phrasing and rhythmic structure, and phrasal prominence (stress) can be signaled by higher peak velocity than unstressed signs [5].

Despite differences, the role of prosody is as critical in signing as it is in speech, and there are many similarities between them. Like speech, ASL has Phrase Final Lengthening. In ASL there is also good correspondence between syntactic breaks and prosodic breaks [2]. Like speech, ASL syntax does not predict all prosodic domains [3, 7, 8], with information structure and signing rate [5] strong influences. Finally, the Prosodic Hierarchy holds for ASL [4, 9]. These functional similarities entail that absence of prosodic cues in animated signing will be as unacceptable and potentially as difficult to understand as robotic speech lacking cues to phrasing, stress, and intonation. We describe the newly-developed capacity to add prosody to animated signing which we call ASL-pro and the testing conducted, in progress, and planned. Animation algorithms must be sensitive to relative values of prosodic elements (pauses, sign lengthening), and not just absolute or local values.

2 ASL Prosody

Only a few of the articulators involved in speech production are visible, and except for speechreading, what is visible is generally not relevant. In contrast, the entire body is visible during signing, so the signal must cue the viewer that linguistic information is being transmitted and how to parse it. The complexity of the signal is reflected by the fact that there are 14 potential NM articulators with multiple positions which can occur simultaneously: body (leans, shifts), head (turn, nod, tilt, shake), brows (up/down, neutral), eyelids (open, squint, closed), gaze (up/down, left/right), nose (crinkle), cheeks (puff, suck), lips (round, flat, other); lip corners (up/down, stretched), tongue (out, touch teeth, in cheek, flap, wiggle), teeth (touch lip/tongue, clench), chin (thrust) [2]. The non-signing hand may be a phrasal prosodic marker. The NM system has evolved to avoid interference among articulations in production or perception. For example, facial articulations can be subdivided into upper face/head [2, 6], which occur with larger clausal constituents, and lower face, which occur with smaller phrasal (nouns, verbs) to provide adverbial/adjectival information. These articulations are under tight control: ASL negative headshakes begin and end precisely with the negated content (within 15–19 ms), showing that they are grammatical, not affective, in contrast to speakers, who start and stop headshakes without concern for English syntactic constituency.

Unfortunately, animators do not have the ASL linguistics information to add prosody. Worse, even if they did, they would have to manually modify the code for each articulator and movement for each sign. Thus, our interface represents a leap forward by combining findings on ASL prosody with algorithms for easily adding predictable prosody to animated signing. Figure 1 shows a comparison between a real signer, an animated avatar signing with prosody, and an animated avatar signing without prosody.

Fig. 1.
figure 1

Video, animation with prosody, and animation without prosody of the same sentence. Row A gives the gloss encoding, B the phrasing, C the English translation, D, E, and F the prosodic elements rendered with the brows, head, and mouth, respectively, G the ASL predictive structure, and H the ASL-pro notation mark and its effects. The animation with prosody was generated by an expert animator in several days using a professional-grade entertainment-industry computer-animation system. The goal of our research is to enable the generation of ASL animation with prosody automatically.

2.1 ASL Prosodic Elements We Predict with Certainty - Linguistic Rules on Which the Algorithms are Based

Linguistic work has provided some guidelines for predicting prosodic elements. Readers are referred to [5] for an overview.

Prosodic Constituency and Phrasing: For spoken English, longer pauses (above 445 ms) occur at sentence boundaries; shorter pauses (to about 245 ms) occur between conjoined sentences, noun or verb phrases; and the shortest pauses (under 245 ms) occur within phrasal constituents. For ASL, Longer pauses between sentences average 229 ms, between conjoined sentences 134 ms, between NP and VP 106 ms, and within the VP 11 ms [2]. Pause duration and other prosodic markers also depend on signing rate [5]. Ideally, algorithms can make adjustments across phrases within sentences, and possibly across sentences within narratives, without requiring users to enter markers other than commas and periods in the annotation [10]. Prosodic groupings show NM changes. Larger groups are marked with changes in head, body position, upper face, and periodic blinks [3]. The lowest relevant prosodic levels (Prosodic Word) are separated from each other by lower face tension changes [7].

Stress: ASL uses peak speed and higher location in the signing space to mark stress. Generally, ASL has sentence stress in clause-final position and on the first syllable in multisyllabic words that are not compounds [2, 9].

Intonation for Sign Languages: Sandler [8] argues that intonation is represented by the NMs. However, there may well be a better as-yet-unidentified analogue to intonation (pitch), and the NMs may be showing morphemic layering [2]. For example, in addition to, or instead of, a negative sign (NOT, NEVER), a negative headshake starts at the beginning of the content that is negated and continues until the end of the negated phrase. Thus, our algorithm need only find a negative sign in the input to automatically generate the negative headshake at the right time. Similarly, brow lowering is a marker of questions with wh-words (who, what, when, where, why, how) [11]. Brow raising occurs with many structures: topics, yes/no questions, conditionals (if), relative clauses, among others. Unlike negative headshakes and brow lowering, there can be more than one grammatical brow raise in a sentence; for this reason, the input to our computational algorithms includes commas to trigger brow raise (as well as phrase final lengthening), except for yes/no questions, which trigger brow raise with a question mark.

Weast [12] measured brow height differences to differentiate syntactic uses of brow height from emotional uses (anger, surprise). She reported brow height shows clear declination across statements and somewhat before sentence-final position in questions, parallel to spoken intonation pitch. Emotional uses set the height range (anger lower, surprise higher), into which syntactic uses of brow raising and lowering must be integrated.

A fourth example is use of eyeblinks. Of the three blink types (startle reflex, periodic wetting, and deliberate), both periodic and deliberate blinks serve linguistic functions in ASL. Periodic blinks (short, quick) mark the ends of higher prosodic phrases [6]. Deliberate blinks (long, slow) occur with signs for semantic/pragmatic emphasis. Our algorithms prevent over-generation of too many periodic blinks by requiring a minimum number of signs or elapsed duration between blinks at boundaries.

To recap, we can predict where pausing and sign lengthening occur, along with changes in brows, head, body, blinks, gaze, cheek and mouth. Now we can see the brilliance of the ASL prosodic system: hand movement marks syllables; lower face changes mark Prosodic Words; and upper face, head, and body positions mark largest Intonational Phrases. Emotions affect range of articulator movement. Everything is visible and tightly coordinated.

3 Prior Work on ASL Animation

Research findings support the value of ASL computer animation. Vcom3D [13] developed two commercial products: Signing Avatar and Sign Smith Studio. Their ASL animation is based on translating high-level external commands into character gestures and a limited set of facial expressions which can be composed in real-time to form sequences of signs. Their animation can approximate sentences produced by ASL signers but individual handshapes and sign rhythm are often unnatural, and facial expressions are not all coded due to the inability to represent ASL prosody effectively. Vcom3D also developed a system for creating animated stories using more realistic ASL (more facial expressions, improved fluidity of body motions, and some ASL prosodic elements) [14]. However, it was animation rendered off-line, derived from motion capture technology, and took a substantial amount of time to complete.

In 2005, TERC [15] collaborated with Vcom3D and the National Technical Institute for the Deaf on the SigningAvatar®accessibility software. TERC also developed a Signing Science Dictionary [16]. Both projects benefited young deaf learners, but did not advance state-of-the-art ASL animation as they used existing technology.

Purdue University Animated Sign Language Research Group led by Adamo- Villani and Wilbur, with the Indiana School for the Deaf (ISD), focuses on development and evaluation of innovative 3D animated interactive tools, e.g. Mathsigner, SMILE and ASL system [17, 18]. Animation of ASL in Mathsigner and SMILE, although far from truly life-like, improved over existing examples. Signing adult and children’s reactions to SigningAvatar and a prototype of Mathsigner rated Mathsigner significantly better on readability, fluidity, and timing, and equally good on realism.

In the U.S., English to ASL translation research systems include those by Zhao et al. [19] and continued by Huenerfauth [20] and by Grieve-Smith [21].

3.1 Prior Work Specifically Targeted at Animated ASL Prosody

Several studies have advanced animated signing beyond straightforward concatenation of hand/arm motions from individual signs. Huenerfauth has investigated the importance of ASL animation speed and timing (pausing) [22, 23], based on earlier ASL psycholinguistics experiments. The model is encoded into two algorithms: one modulating sign duration based on sign frequency and syntactic context (e.g. subsequent occurrences of repeated verbs are shortened by 12 %, signs at a sentence or a clause boundary are lengthened by 8 % and 12 % respectively), and another inserting pauses at inter-sign boundaries in greedy fashion by selecting the longest span of signs not yet broken by a pause and by selecting the boundary within the span with the highest product between a syntactic complexity index and the relative boundary proximity to span mid-point. These algorithms created animations with various speeds and pausing, which were shown to native ASL signers to check comprehension and recall; viewers were asked to rank naturalness on a Likert scale. Animations produced with speed and timing algorithms scored significantly better. The study demonstrated that signed prosodic elements can be added algorithmically.

Zhao et al. developed a system for automatic English to ASL translation that renders ASL inflectional and derivational variations (i.e. temporal aspect, manner, degree) [19] using the EMOTE animation system [24]. EMOTE is procedural: animation is achieved from rules and algorithms, not user input; it is general purpose, i.e., not developed just for ASL, and it allows conveying Effort and Shape, two of the four components of the Laban Movement Analysis system. This shows the feasibility of applying general animation principles to ASL.

4 The ASL-Pro Algorithms

Why automatic generation of prosodic marking in animated signing? As mentioned, ASL prosodic elements are used to clarify syntactic structures in discourse. Research has identified over ten complex prosodic markers and has measured frequencies of up to seven prosodic markers in a two second span. Adding such number and variety of prosodic markers by hand through a graphical user interface (GUI) is prohibitively slow and requires animation expertise. Scalability to all age groups and disciplines can only be achieved if digital educational content can be easily annotated with quality ASL animation by individuals without technical background and computer animation expertise.

Our novel algorithms automate enhancing ASL animation with prosody. An overview of the pipeline is given in Fig. 2.

The pipeline input consists of:

  1. 1.

    Humanoid 3-D character rigged for animation. We use a character with 22 joints/hand, 4 joints/limb, 18 joints for the body (e.g. hips and spine), and 30 facial expression controllers.

  2. 2.

    Database of signs. The algorithm relies on a sign database. The animation of a sign is partitioned into 3 subsequences: in, middle, and out, which allows for smooth interpolation.

  3. 3.

    Sentence encoded in ASL-pro notation. Our project does not target automatic translation of English into ASL. The user provides translation in textual form using ASL-pro notation, which is an enhanced form of ASL gloss.

The pipeline output is ASL animation data for the character, with prosodic elements, corresponding to the input sentence. The output animation data consists of animation elements – units of animation such as “signing the number 3”, or “tilting head forward”— and animation modifiers — associate to animation elements to set parameters, e.g. “translate joint on a Bezier arc between starting and ending positions”, or “ hold position for 0.7 seconds”. The functionality of the three stages of the pipeline is as follows.

Fig. 2.
figure 2

Flowchart illustrating the ASL-pro pipeline

4.1 ASL-Pro Notation Interpretation

The ASL-pro notation of the input sentence is interpreted to identify the signs and prosodic markers and modifiers needed to animate the sequence. Some prosodic markers are included explicitly in the input sentence using ASL-pro syntax (a), and others are derived automatically using ASL prosody rules (b).

(a) Prosodic markers specified by the user are those that are not predictable from linguistic analysis of the sentence, such as affective markers and markers with hard-to-predict semantic/pragmatic functions (e.g., emphasis). Emotional aspect is known to play a decisive role in learning for deaf students [25] and Weast [12] has shown that emotion constrains grammatical markers, so affective markers are indicated by hand: Sad-Neutral-Happy, Angry-Neutral, Shy-Neutral-Curious, Scared-Neutral, Impatient-Neutral, Embarrassed-Neutral, and Startled. Each marker is encoded with a distinctive text tag (e.g. SadNHappy) followed by a numerical value ranging from -100 to 100 with 0 corresponding to neutral. The ASL-pro notation allows for modifying range and speed of motion to convey emotions. The modifiers are persistent, meaning that their scope extends until a marker for return to neutral (e.g. SadNHappy 0). The exception is the brief startled affect.

(b) Prosodic markers are derived automatically from prosody rules as follows:

Pausing. Pauses between and within sentences are added automatically according to research [2]. The current algorithm works with boundaries marked by the user explicitly in the input (comma, semicolon, period). We are investigating deriving boundaries from automatic syntactic analysis.

Phrase Final Lengthening. Signs at the end of a phrase are held and lengthened [5].

Marking Intonational Phrases (IPs). IPs are derived from syntactic constituents and marked by prosodic elements (changes in the head, body position, facial expression, and periodic blinks [3, 6]).

Stress. Syllables to be stressed are identified automatically based on research [2, 9]. Stress is shown by increasing peak velocity and raising the place of articulation in signing space [2]. At the word level, stress on two-syllable signs is equal, except for compounds which have stress on the second; signs with multiple syllables (repetitions) are stressed on the first. At the sentence level, one syllable, usually on the last sign, carries sentence stress.

Negation. Negative headshakes are added when a negative is identified from a negative word or [neg] marker in the notation (e.g. not, never) and go from the negative to the clausal end.

Content questions. These are identified by finding wh-words (e.g. who, what, when, where, why, how) not followed by commas and generate prosodic markers for brow lowering until a period. If followed by commas (wh-clefts - not true questions), such words generate brow raise.

Topic phrases, yes/no questions, conditional clauses, relative clauses. Such cases are identified from keywords (e.g. if, [cond]) and user provided or automatic syntactic analysis. The prosodic marker generated is brow raise until a comma or, for yes/no questions, a period is reached.

4.2 Animation Assembly

Animation elements are retrieved from the sign database for all the signs needed by the input sentence. The prosodic markers, explicit and derived, are translated into animation elements and modifiers using specific sub-algorithms. For example, the body-lean prosodic marker sub-algorithm takes a leaning amplitude parameter specified as a percentage of a maximum lean and generates the corresponding animation element. Similarly, the hand-clasp sub-algorithm generates the corresponding animation element using a combination of forward (e.g. for finger joints) and inverse kinematics (e.g. for elbow joints). The displacement prosodic modifier sub-algorithm takes as input a given fractional displacement amplitude and produces an animation modifier which can be applied to any element to achieve the desired effect. A multi-track animation timeline is populated with the resulting elements and modifiers. Most prosody markers are layered on top of the elements of a sign. Some prosody markers, such as e.g. hand clasp, are inserted between signs in phrases/sentences.

4.3 Animation Compositing

The animation layers, defined by the multiple animation tracks, are collapsed in the final animation. This stage has the important role of arbitrating between various animation elements and modifiers and reconciling conflicts. Physical constraint violations (collisions and excessive joint rotation amplitudes) are detected and remedied by computing alternate motion trajectories.

5 The ASL-Pro User Interface

The key components of the interface are the underlying algorithms (described in Sect. 4), the ASL-pro notation, and the editor. The interface allows users to input notation to generate animations with prosodic elements. Without a generally accepted system of written ASL, to write ASL signs and sentences, linguists use glossing, written in CAPITAL LETTERS, i.e. IX-1 LIKE APPLE I like apples. Gestures that are not signs are written in lower-case letters, i.e. go there. Proper names, technical concepts and other items without obvious ASL translation may be fingerspelled, which is written either as fsMAGNETIC or m-a-g-n-e-t-i-c. NMs are shown above the signs with which they co-occur, with a line indicating the start and end of the articulation. For instance,

figure a

where wh-q indicates a facial articulation with lowered eyebrows.

Fig. 3.
figure 3

ASL-pro notation differs from ASL gloss: there are no lines above sign names; the comma after NATURE triggers brow raise, a blink and within-sentence phrase final lengthening; POWER triggers a lean forward for primary stress and sentence focus; a period triggers a blink and phrase final lengthening; the name of the sign SELF-1 calls a different sign from the lexicon than SELF. To be determined and evaluated for clarity and ease of use: how best to write in ASL-pro emphasis on WOW, primary focus on POWER. Here they are in brackets, but, for example, [emph] could be replaced by !.

To animate ASL sentences, the user types the ASL-pro notation into the interface editor. The notation is interpreted and automatically converted to the correct sign animations with clearly identifiable prosodic elements. The ASL-pro notation is similar to, but not exactly the same as ASL gloss; Fig. 3 shows differences and similarities. Our goal is to allow any user with knowledge of ASL to create animation with prosody. Future work will develop and assess a tutorial, notation examples, and a user-friendly editor with syntactic highlighting to help users learn the ASL-pro system.

To evaluate the quality of ASL-pro animations produced with the interface, we integrated the interface into our existing ASL system, which is a system generating 3D animations of ASL signs and sentences [for details, [26]]. A video of the system can be accessed at: http://hpcg.purdue.edu/idealab/asl/ASLvideo.mov

6 Initial Evaluation of the ASL-Pro Animations

We conducted a small-scale evaluation to determine the accuracy of the algorithms. Algorithms are considered accurate if, given a specific ASL-pro notation, they generate (1) prosodic elements perceived by users as accurate with correct intensity and timing with respect to the hands; (2) ASL animation perceived by users as intelligible and close to real signing; and (3) character motions and facial articulations perceived by users as fluid and life-like.

Judges: Validation of the algorithms relied on formative feedback from 8 ASL users (2 Deaf, 2 CODAs (Child of Deaf Adults), and 4 students in ASL with advanced ASL skills).

Stimuli: ASL-pro algorithmically generated animated ASL with prosodic elements based on the following text:“Nature can be unbelievably powerful. Everyone knows about hurricanes, snow storms, forest fires, floods, and even thunderstorms. But wait! Nature also has many different powers that are overlooked and people don’t know about them. These can only be described as FREAKY”. (From National Geographic KIDS - August 2008). An ASL user translated the English into ASL-pro notation and entered it into the notation editor. The animation algorithms were applied to generate the ASL animation. A segment of the algorithmically generated ASL-pro animation can be viewed at: http://hpcg.purdue.edu/idealab/web/ASL-pro.htm.

Procedure: Judges viewed the video and completed an online survey rating the three areas of desired feedback on a 5-point Likert scale(1=highest rating; 5=lowest rating): accuracy of prosodic markers, intelligibility and closeness to real signing, and fluidity of motions and facial expressions.

Initial Findings: Results indicate that judges thought the prosodic elements were accurate and with correct timing (1.7/5); the signing was readable and fairly close to real signing (1.9/5); and the character motions and facial expressions were fairly fluid and realistic (2.3/5). Comments suggested that a more realistic-looking avatar could better convey prosodic elements (especially facial expressions) but that, despite the stylized look of the avatar, the animated signing was very close to real signing.

7 Conclusion and Future Work

In this paper we have presented a new sign language animation capability: ASL-pro, that is, ASL with pro(sody). We described the development of algorithms for inserting basic prosodic markers in animated ASL, and the design of a new user interface that allows users to input English sentences in ASL-pro notation to automatically generate the corresponding animations with prosodic elements. Findings of an initial study with 8 ASL users are promising, as they show that the algorithmically generated ASL-pro animations are accurate, close to real signing and fluid.

In future work we plan to conduct additional studies with larger pools of subjects to further validate the algorithms. We will also conduct an evaluation of the ASL-pro interface to examine its usability and functionality and to identify weaknesses, with the overarching goal of revising and improving its effectiveness and efficiency. Specifically, usability evaluation will assess educators’perceptions of the ease of using the ASL-pro interface, and their perceived benefits and challenges of using the interface in developing units/modules for their classes.

Our research addresses the need to enhance the accessibility of educational materials for deaf individuals by allowing ASL users with no animation and programming expertise to add accurate and realistic sign language animations to digital learning materials. While our initial objective has been to target the educational audience, we believe that facilitating creation of ASL translation is important beyond the education domain. For instance, ASL-pro animations could be used in entertainment and social networking to remove communication barriers between hearing and non-hearing members of our society.