Virtual Reality

, Volume 14, Issue 4, pp 221–228

Piavca: a framework for heterogeneous interactions with virtual characters

Authors

    • Department of Computing, Goldsmiths CollegeUniversity of London
  • Xueni Pan
    • Department of Computer ScienceUniversity College London
  • Mel Slater
    • ICREA-Universitat de Barcelona
Original Article

DOI: 10.1007/s10055-010-0167-5

Cite this article as:
Gillies, M., Pan, X. & Slater, M. Virtual Reality (2010) 14: 221. doi:10.1007/s10055-010-0167-5

Abstract

This paper presents a virtual character animation system for real- time multimodal interaction in an immersive virtual reality setting. Human to human interaction is highly multimodal, involving features such as verbal language, tone of voice, facial expression, gestures and gaze. This multimodality means that, in order to simulate social interaction, our characters must be able to handle many different types of interaction and many different types of animation, simultaneously. Our system is based on a model of animation that represents different types of animations as instantiations of an abstract function representation. This makes it easy to combine different types of animation. It also encourages the creation of behavior out of basic building blocks, making it easy to create and configure new behaviors for novel situations. The model has been implemented in Piavca, an open source character animation system.

1 Introduction

Animated virtual humans are a vital part of many virtual environments today. Of particular interest are virtual humans that we can interact with in some approximation of social interaction. However, creating characters that we can interact with believably is an extremely complex problem. Part of this difficulty is that human interaction is highly multimodal. The most obvious modality is speech, but even this can be divided into the verbal content and non-verbal aspects of speech, such as tone of voice, whose function can be very subtle and complex. When we take into account bodily modalities, we also have to deal with facial expression, gaze, gesture (both accompanying speech and giving feedback while listening), posture, body movements and touch. While non-immersive environments can ignore some of these modalities of interactions, the full power of an immersive virtual reality interaction with a character can only be achieved by modeling all (of most) of the modalities.

This multimodality implies that there will be a great variety of ways of interacting with a virtual character. A human participant can use these different ways of interacting with a character:
  • Verbal interaction, characters that can engage in conversation using a dialogue engine.

  • Non-verbal aspects of speech, picking up features of the participants’ voice.

  • What Slater and Usoh (1994) calls “Body Centered Interaction”. The participants’ movements are tracked, and their normal movements used to interact with a character. For example, the character may track the participants’ movements with their gaze and maintain a normal conversational distance to them.

  • Control, the character might be an avatar that is being controlled by a participant, either through one of the above modalities or a more conventional user interface such as a mouse and keyboard or a joystick.

  • Watching, not all of a characters behavior will be interactive, some will simply play back and the participant can observe it.

The true complexity of the behavior of an interactive character is that most of these types of interactions are likely to be happening simultaneously. For example, a participant might be engaging in conversation with a character that is controlled by a chat-bot. The character nods in rhythm with variations in the participants speech amplitude and its gaze follow the participants position, while it walks around following the joystick movements made by another participant. Finally, its posture shifts occasionally, following a non-interactive algorithm (admittedly, this example is contrived, if the character is an avatar under the control of a participant, then its speech will almost always be controlled by that participant. However, even given this constraint, many modalities of interaction are likely to appear simultaneously.) This paper presents a method of creating characters that combine these very diverse forms of interaction.

The diverse styles of interaction also imply diverse methods of generating behavior. This paper is mostly restricted to animation, but there are still many different styles. Some animation can be played back from pre-existing data, whether it is from motion capture or hand animation. Some types of animation, such as gaze or lip synchronization are best generated on the fly, algorithmically, a process called procedural animation. Finally, most interactive animation is generated by transforming and combing clips of pre-existing animation data, to produce new animations.

Another source of diversity is the different time scales of interactivity. A character’s gaze has to respond instantly to changes in the humans position, while other body centered interactions have more variable time constraints. Speech interaction tends to be turn based, with long periods of non-interaction, interrupted by a change of speaker. Other modalities have no time constraints for interaction or are not interactive at all. This flexibility in level of interactivity is also important because of the great difference in quality that exists between behavior that is generated on the fly and pre-recorded behavior. This is particularly true with speech where it can be difficult to produce coherent speech for long periods with all but the best existing dialogue systems, and where synthesized speech falls far short of recorded speech quality. However, it is also true of animation, motion captured or hand-animated animation is generally better than what can be produced in real time, even by transforming motion capture data. However, completely pre-recorded behavior reduces interactivity. The sense that we are truly interacting with a character is likely to be a strong contributor to presence and thus is vital. However, the illusion of interactivity can be maintained even if the behavior is not entirely interactive, as long as some elements are. For this reason, we believe it is important to balance the quality of pre-recorded elements. Behavior that consists of some pre-recorded clips, of audio and animation, should be used to ensure quality of the output. However, the sense of interactivity can be maintained to a degree if other aspects of behavior such as gaze and feedback, remain highly interactive. In turn, this sense of interactivity can maintain a high level of presence.

This flexibility also implies that the characters will need to be used in different ways and that different styles of interaction will need to be combined differently in different situations. Different types of behavior need to be interacted with in different ways. Sometimes facial expressions need to be under the control of a human, at other times they must respond to the participants speech and at other times they can be random or scripted. The same factors apply to all the behavior types. Therefore, it needs to be simple to create new interactive behaviors and combine them with others in a variety of different ways.

This paper describes how to create a character capable of very heterogeneous forms of interaction. We first describe related work in this area. We then describe a functional model of animation that is designed for handling very diverse styles of animation, and its implementation with the Piavca open source character animation system. Finally, we describe how it is used to create our heterogeneous character.

2 Related work

This work builds on a long tradition of research on expressive virtual characters (Vinayagamoorthy et al. 2006). This work has aimed at building animated characters that can autonomously produce the type of expressive non-verbal communication that humans naturally use in day to day interaction. This generally entails both an animation system and a higher level model for determining what behavior to produce in response to stimuli. Numerous general purpose systems have been produced notably the “JACK” system by Badler et al. (1993); the “GRETA” system by Pelachaud et al. (2005); work of Cassell et al. (1999), and Guye-Vuilléme et al. (1999). Much of this work has concerned expression of emotions [for example the work of Gratch and Marsella (2001)], but our work is closer to research that models use of non-verbal communication in face-to-face conversation, for example the work of Vilhjálmsson (2005). Each individual modality of expression is highly complex and many researchers have worked to create models of single modalities. Gaze is often linked to turn taking in conversation, for example the work of Lee et al. (2002), further developed by Vinayagamoorthy et al. (2004). Facial expression has been extensively studied in the context of emotional expression, for example the work of Kshirsagar and Magnenat-Thalmann (2002) or Bui et al. (2004). Gestures are very closely related to speech (Cassell and Stone 1999) and have been modeled in a number of ways, from the highly data driven methods of Stone et al. (2004) to the totally procedural methods of Kopp et al. (Kopp and Wachsmuth 2004, Kopp et al. 2005). Finally, posture is also an important modality of expression that has been used as a means of producing believable idling behavior (Egges et al. 2004) or of expressing interpersonal relationships (Gillies and Ballin 2003).

One of the first systems that allows people to interact multimodally with a character using voice and gestures tracking was Thórisson’s Gandalf system (Thórisson 1998). This work was later developed by Cassell’s group into an interaction system with a full bodied character capable of complex non-verbal communication (Cassell et al. 1999). The Max system focus primarily on voice and gesture interaction (Kopp et al. 2003). Work by Maatman et al. (2005) focused on the listening behavior of a character, and like our work uses head tracking and voice input. These systems have demonstrated the power of multimodal face-to-face interaction with virtual characters. This paper shows how it is possible to rapidly create and customize such systems from basic building blocks.

Our work also uses research in the area of data-driven animation and motion editing. In particular, the Motion Warping formulation of motion editing (Witkin and Popović 1995). Our functional model is an excellent way of combining motion editing techniques and we have implemented several within our framework. Examples include, interpolation-based animation such as the work of Rose et al. (1998), principal component analysis-based animation (Alexa and Müller 2000) and the Motion Graph data structure (Arikan and Forsyth 2002; Kovar et al. 2002; Lee et al. 2002).

The work we present is a functional abstraction of character animation and behavior. In this sense, it is similar to functional abstraction frameworks used in other domains. In particular, there are a number of interesting abstraction frameworks used in virtual reality and graphics for example Figueroa et al.’s InTML system (Figueroa et al. 2002) or Elliott et al.’s TBAG system (Elliott et al. 1994).

3 A functional model of animation

Handling and combining many diverse methods of animation on a single character requires a single representation for all of them. At its most abstract, an animation can be viewed as a function of time to the state of a character:
$$ {\bf x} = f(t) $$
(1)
where x is some representation of a characters state. The most common representation in body animation would be a skeletal one, in which the root of the character has a vector position p0 and quaternion orientation q0 and all of the joints have quaternion orientations (qi): {p0q0q1, ..., qn}. However, we do not restrict animations to this representation, for example joint orientations can also be represented using Grassia’s exponential map representation (Grassia 1998) if it is more convenient for certain calculations. For facial animation, the state can be represented as a set of weight values for morph targets or as positions or rotations of facial bones. The state can also have more abstract representations, for example, the result of one animation function can be used as the output of another, as we shall see below.
Most animations, will in fact take other parameters making this the general form of an animation function:
$$ {\bf x} = f(\varvec{\theta}, t) $$
(2)
We will now show how different types of animation can easily be represented in this framework.

Keyframe or motion captured animation

Both hand animated data and motion captured animation are represented as a list of evenly or unevenly spaced keyframes. This can be represented as the following function:
$$ f({\bf k}, t) = {\it interpolate}(k_{\tau(t)}, k_{\tau(t)+1}, t) $$
where k is the keyframe data and τ(t) is the keyframe time prior to t. interpolate can be any suitable interpolation function, we use cubic spline interpolation.

Procedural animation

Procedural animation is the most general instantiation of the functional model, and can be represented as any function of the form (2). The parameters \(\varvec{\theta}\) will depend heavily on the type of motion, for example, a gaze motion will have parameters the include the target of gaze and the length of gaze.

Motion transforms

An animation function can be a transformation of another motion function g:
$$ {\bf x} = f(g,\varvec{\theta}, t) $$
The most general form of transformation is a general motion warp (Witkin and Popović 1995), which consists of a timewarp, α, transforming t and a space warp, β transforming the output of g:
$$ f(g, \varvec{\theta}_{\alpha}, \varvec{\theta}_{\beta}, t) = \beta(\varvec{\theta}_{\beta}, g(\alpha(\varvec{\theta}_{\alpha}, t)) $$
Simple examples include, spatial scaling:1f(gst) = sg(t); temporal scaling: f(gst) = g(st), and looping \(f(g,t)=g(t \bmod |g|)\), where |g| is the length of g (in practice a more complex looping function is used to ensure smooth transitions). A particularly useful transform is a mask, which is used to select certain joint or morph targets to which to apply an animation, based on a mask, m:
$$ f(g, {\bf m}, t)_{i} = \left\{ \begin{array}{ll} g(t)_i & \hbox {if} \; {\bf m}_{\rm i}=1,\\ 0 & \hbox{otherwise} \end{array}\right. $$
As mentioned above, the parameters of a motion transform can themselves be other motion functions. For example, a general motion warp can be created using a motion function, h as the warping parameter: f(ght) = g(h(t)).

Combining animations

Animations can also be combined together by functions of two other animations, which have the general form:
$$ f(g_{1}, g_{2}, \varvec{\theta}_{\beta}, \varvec{\theta}_{\alpha 1}, \varvec{\theta}_{\alpha 2}, t) = \beta(\varvec{\theta}_{\beta}, g_{1}(\alpha_{1}(\varvec{\theta}_{\alpha 1}, t)), g_{1}(\alpha_{2}(\varvec{\theta}_{\alpha 2}, t))) $$
Examples include addition:2f(g1g2t) = g1(t) + g2(t); blending between animations: f(g1g2, λ, t) = λ g1(t) + (1 − λ)g2(t) and sequencing motions:
$$ f(g_{1}, g_{2},t)=\left\{ \begin{array}{ll} g_1(t) & \hbox {if} \; t < |g_2|,\\ g_2(t-|g_1|) & \hbox {otherwise} \end{array} \right.$$
As with looped motions, in practice we would used a smoothed version of this function. There are also functions for combining multiple animations, for example blending between several animations; finite- state machines that choose different animations based on their state; animations based on a principal component analysis of other animations (Alexa and Müller 2000), and motion graphs described below.

The power of the functional model of animation comes from the fact that all types of animation have the same form. The animation functions can be curried, pre-applying all parameters apart from t to get a new function of the form (1). Once in this form the user does not need to know anything about the type of animation. In particular, the functions for transforming and combining animations have the same form as their inputs, making it possible to compose them. As we shall see arbitrary composition of functions for transforming and combining animations can be a powerful tool for creating complex transformations. This leads to a methodology of decomposing transformations into their most basic elements, which can then be reused by composing them in a number of different ways. This in turn makes it easy to author new behaviors and variants of existing behaviors.

3.1 Implementation

Before discussing how the functional model is used, a quick note on how it is implemented. The model has been implemented as part of Piavca (the Platform Independent Architecture for Virtual Characters and Avatars), an open source software framework for creating virtual characters (available at http://piavca.sourceforge.net/). Piavca is a generic API built in C++ that can be combined with different graphics engine. The current implementation uses Cal3d (http://home.gna.org/cal3d/) as a low-level animation system providing functionality such as smooth skinning and morph targets while Piavca overrides the motion-blending features. The renderer is based on Cal3d’s OpenGL renderer. This system can be used with numerous graphics and virtual reality systems. Currently, it is implemented on Dive (http://www.sics.se/dive/), XVR (http://www.vrmedia.it/Xvr.htm) and we are working on an implementation for OpenSG (http://www.opensg.org/).

Our implementation (in the C++ and Python languages) is object oriented. All of the functions are in fact function objects and so can contain state. They all inherit from a single base class, motion, which acts as an abstract representation of an animation function. Currying the functions is achieved by providing all parameters except t at initialization time, and making the function objects callable with t as a parameter.

Users can configure and combine motion functions using a Python based scripting interface to Piavca, for maximum flexibility. Alternatively, there is an XML-based behavior definition language that allows users to create character behavior models without requiring programming skills. The behavior definition language exactly mirrors the functional model with tags for each possible function.

4 A heterogeneous character system

Figure 1 shows the type of heterogeneous interaction that is possible with our characters. The human participant’s behavior is input with typical sensors for an immersive VR system, a microphone and head tracker.
https://static-content.springer.com/image/art%3A10.1007%2Fs10055-010-0167-5/MediaObjects/10055_2010_167_Fig1_HTML.gif
Fig. 1

Heterogeneous interaction

However, these two inputs are used in a variety of different ways by different behaviors. The head tracker is used to obtain the position of the participant in order to maintain an appropriate conversational distance (maintaining distance in conversation is called proxemics in the non-verbal communication literature). It is also used to detect when the participant shifts posture. The character’s posture shifts are then synchronized with those of the participant, which is know to build rapport (Kendon 1970). The position of the participant is also used by a gaze behavior to ensure the character is looking in the right place. The audio from the microphone is used to detect when the participant is talking. This is used by the gaze behavior so the character can look at the participant more when listening to him or her. The character also gives head nods and other feedback signals when the participant is speaking. All these behaviors happen automatically in response to the sensor input, using real-time algorithms.

The microphone audio is also used for speech interaction. Speech interaction is either controlled by a human controller, if the character is an avatar, or by a dialogue engine. When the character speaks a number of other behaviors are triggered. The character’s lip movements will be synchronized to the speech and the character will gesture. The gaze behavior is also altered to take account of the fact that the character is speaking (Argyle and Cookv 1976). As well as triggering speech the character’s controller can also trigger certain scripted actions and gestures. Apart from lip synchronization and gaze, the character’s facial expression is independent of the participant’s behavior, consisting of occasional smiling and blinking. The character therefore has a wide range of styles of interaction, all happening simultaneously. These contain many different animation processes, both facial and bodily, that must be combined to create a single coherent animation.

Figure 2 shows how this type of character can be implemented using our functional model. The example we give is of a human-controlled character that is used in Wizard of Oz style experiments. A human operator controls certain aspects of the behavior, while others are automatic. The character’s speech is controlled by a human being (the controller) selecting speech sequences from a library of possible utterances, while the character interacts with another person (the participant). This system has 3 inputs: the position of the participant, an audio signal of the voice of the participant and input from the person controlling the character, specifying speech utterances. The position input is a 3-vector whose value is obtained every frame from a head tracker on the participant. The voice signal is obtained from an ordinary microphone. For this application we simply threshold the audio value to detect whether the participant is speaking. The controller has a user interface with a number of buttons used to trigger speech utterances.
https://static-content.springer.com/image/art%3A10.1007%2Fs10055-010-0167-5/MediaObjects/10055_2010_167_Fig2_HTML.gif
Fig. 2

Implementing a heterogeneous interaction

Some behaviors of the character are not influenced by any of the inputs. For example, the character has a simple blinking behavior. This is a loop containing a blinking animation sequenced with a zero animation. A zero animation is simply an animation function that returns zero for all joint or facial expression values and in this case it is used to model the inter-blinking period. The length of the zero motion is varied every time around the loop to ensure the timings are not too repetitive. A facial animation loop, in which the character smiles occasionally, is implemented similarly. The head tracker input is used in a number of ways. The first is for posture shifts. During conversation, people tend to synchronize their movements, particularly movement such as posture shifts (Kendon 1970). This synchronization is a strong sign of rapport between individuals. In order to simulate this, we detect posture shifts by finding large changes in position. We then trigger a posture shifts. The characters posture is modeled as a finite state machine animation in which each state is a different possible posture. On a posture shift, a new state is chosen at random. A finite state machine animation performs a smooth sequence between the animations associated with each state, ensuring smooth posture shifts. The head tracker is also used by the proxemics behavior. Proxmemics is the use of space in social interaction. For our characters, this means maintaining a comfortable distance from and orientation to the participant. The relative distance and angle of the participant to the character are calculated from the tracker position. If they are too large or small, the character turns to face the participant or takes a step forward or backward. Again this behavior is modeled as a finite state machine, with a default state being the zero motion and a state for each motion direction. The final use for the position input is to control the gaze behavior. The position gives a target to look at. The audio input is used to detect when the participant is speaking and give feedback behavior. In this implementation, the feedback consists of occasional nodding to give encouragement. This is implemented as a loop in the same way as blinking.

The other major input is from the controller, who can issue commands to control the character’s speech. The behavior consists of a number of multi-modal utterances that can be triggered using a graphical user interface. Multi-modal utterances are short scripted behaviors that combined speech (in this case audio files) with animation elements. For example, the audio is accompanied by facial animation for lip synchronization and also appropriate gestures. The scripts give the creators of the character very tight control of the character’s behavior and potentially high-quality behavior can be created. This comes at the possible cost of some interactivity, however, we believe that our methodology of combining more scripted elements with real-time interaction can combine benefits of both.

Figure 3 shows some still frames from an interaction with our virtual character, the accompanying video shows the actual interaction. The type of character set up we have described is only one way of interacting. For example, the character could be total autonomous with utterances triggered from an AI “chat-bot” system, or the character’s speech could be directly taken from the controllers own voice. In the second case, many elements such as gestures and lip synchronization would have to be generated automatically to suit the speech. Our frameworks make it easy to build new styles of interaction from existing components.
https://static-content.springer.com/image/art%3A10.1007%2Fs10055-010-0167-5/MediaObjects/10055_2010_167_Fig3_HTML.jpg
Fig. 3

A real and virtual human interacting in an immersive virtual environment

5 Conclusion

This paper has presented a software framework for creating interactive virtual characters. The many different types of behavior involved with human social interaction imply a range of different styles of animation and interaction with a character. Our framework allows us to unify and combine these diverse methods using a single abstract function representation. It makes it easy to create new character systems by combining different behavior modules in different ways. The framework has been released as part of the open source project Piavca (http://piavca.sourceforge.net/), we encourage readers to try out the functionality.

This framework has been used for virtual reality experiments at University College London. These have demonstrated that people respond to the characters in some way as if they are human. For example, Pan and Slater (2007) conducted an experiment in which socially phobic male participants interacted with a virtual woman that engaged them in conversation that became increasingly intimate. The experimenters measured skin conductance level, which demonstrated that the intimate conversation resulted in greater arousal. Interestingly, proxemic behavior played an important role in this. At one point, the character’s proxemic distance was decreased resulting in her moving closer to the participant. This produced the highest skin conductance levels in the experiment. These quantitative results were supported by the participants subjective reports, with many reporting strong emotions such as anxiety or even guilt at “cheating” on their partner with a virtual woman. An analysis of the body movements of the participants during the scenario (Pan et al. 2008) showed that they used more social non-verbal cues such as nodding or cocking their heads during the conversation than before it.

Future work on the project will involve increasing the range of functionality and of possible applications. In a modular system such as ours, it is easy to add functionality by either added new animation functions or combining existing ones in new ways. As the system is applied to different situations and styles of interaction, new requirements will naturally emerge and therefore drive the development of new functionality. We are currently applying the system to more graphically realistic character and making greater use of motion capture, raising the level of realism that is possible. This greater realism will itself bring new requirements to our animation framework. The most important change we are currently planning is the addition of a graphical user interface for combining behavior functions. This will supplement the existing scripting interface and definition language, and provided a more accessible method of creating characters.

Footnotes
1

For quaternion animations multiplication is replaced by scaling the rotation angle by s.

 
2

For quaternions, quaternion multiplication is used instead of addition.

 

Acknowledgments

We would like to thank the funders of this work: BT plc, the European Union FET project PRESENCIA (contract number 27731) and the Empathic Avatars project funded by the UK Engineering and Physical Sciences Research Council. We also would like to thank the members of the University College London Department of Computer Science Virtual Environments and Graphics Group.

Copyright information

© Springer-Verlag London Limited 2010