Introduction

Much working time is spent in meetings and, as a consequence, meetings have become the subject of multidisciplinary research. The introduction of technology in meetings offers new perspectives on, amongst others, communication and language, human perception and social interaction. In this paper we describe how VMRs can be used to improve remote meeting participation, to visualize multimodal data and as an instrument for research into social interaction in meetings.

The research reported in this paper has been carried out in the context of AMIFootnote 1, a European Research Project that aims at developing new technologies for supporting meeting activities such as meeting browsers, and technology that makes remote meeting participation easier, more effective and more natural. The Human Media Interaction (HMI) group of the University of Twente is one of the AMI partners (Nijholt et al. 2004). The HMI group has a tradition in research into interaction with embodied conversational agents, computer graphics for virtual environments and machine learning techniques for recognition of higher level features (e.g., dialogue acts, gestures and emotions) from lower level features (e.g., words, hand movements and facial features).

The paper is organized as follows. First, we will summarize how advances in technology allow for new opportunities in supporting meetings through the use of virtual reality. Next, we present a schematic overview of the process from observation to simulation underlying our concept of the VMR. We discuss several possible uses of VMRs in Sect. 4. As an illustration of the scheme we then focus on an experiment we did with an implementation of a VMR.

Meetings in virtual reality

In a general sense a meeting is any coming together, willingly or unwillingly, of two or more people at such a close distance of each other that they are aware of each others presence and, willingly or unwillingly, react to it. The concept of distance, and related to that the concept of being in the same meeting room, has strongly been developed and is still being renewed by the development of technology in the last few centuries, in particular by developments in communication and information technology. This is really a process of conceptual development, in which the concept of sharing the same space evolves from physically sharing the same space to mentally sharing the same space, such as the mentally ‘‘shared space’’ in a chat system or the visually shared space of an immersive meeting room. We identify invariantly a number of central themes: the struggle for the individual privacy, respecting each others private space, the need of being respected by others, the will to express oneself and one’s ideas and to realize individual goals. In a more restricted sense a meeting is an organized process of people coming together focusing on a common topic or task. Meeting, in this sense, is one of the characteristics of the modern way we organize our work in all kinds of organizations. However, professionalized and organized a meeting may be, it is still a gathering of people. All the themes that play in the more general sense of meeting can be identified in these meetings as well, be it often in more organized, conventional forms, and mediated by rules of good conduct: turn taking behavior, addressing behavior, politeness rules, and dominance relations.

The impact of technology on meetings cannot be described adequately in terms of quantitative, measurable effects it has on properties of processes that occur in meetings in their existing forms. Technology develops the very idea of meeting itself, and it has impact on how people realize the idea of meeting. Moreover, what is essential for meetings is that technology offers new perspectives on, amongst others, communication and language, human perception and social interaction. These new perspectives may help to gain more insight in the essential qualities of these aspects of social reality. A discussion of the relation between meetings and technology in general from the viewpoint of the three meeting concepts resources, processes and roles, is presented by Rienks et al. (2006b).

State of the art in computer graphics and embodied conversational agents allows the creation of VMRs, a 3D visualization of a meeting room virtual reality (see Fig. 2). VMRs are useful for various purposes that can be grouped into the following three categories:

  1. 1.

    As an (immersive) virtual environment, a communication means for real-time remote meeting participation. Conducting remote meetings in a virtual environment allows enhanced visualizations of features in order to stress certain elements in the communicative behavior of the participants such as direction and level of attention, or agreement and disagreement, which are often not very clear in video-based remote meetings.

  2. 2.

    Presentation of multimedia information about meetings. Information can be directly obtained from recordings of behaviors in real meetings (e.g., tracking of head or body movements, voice), from annotations or from machine learning models that induce higher level features from recordings. 3D virtual replay of meetings allows us to have, for example, restructured and coherent summarization of a topic, even when it was discussed in a disjointed and fragmentary manner in the original meeting, while still capturing many salient (non-verbal) details.

  3. 3.

    Research into social interaction, and recognition and interpretation of visualized information. Virtual environments allow for tight stimulus control of various independent factors (such as voice, gaze, distance, gestures and facial expressions) and can be used to study how they influence features of social interaction and social behavior. Conversely, the effect of social interaction on these factors can be studied adequately in virtual environments as well.

From observation to simulation in a VMR

Figure 1 shows an abstract view of the “observation to simulation” process underlying our concept of the VMR. The left hand side depicts the observation and interpretation. Human interactions in meetings are recorded on video and audio. Observation of these recordings leads to descriptions of observable events (body movements, facial expressions, speech, etc.). These observations can be interpreted on progressively more complicated levels (see also Reidsma et al. 2005). The right hand side depicts the simulation process. At a certain point, the information from the annotations is used for (re)generation of the communication behavior, sometimes recreating the lower level information from models of human interaction.

Fig. 1
figure 1

Schematic overview of the various steps from observations and recordings, via annotations to simulations mediated by various models expressing the relations between the aspects of verbal and non-verbal conversational behavior of participants in the meeting

Annotations of behavior in meetings

The first step in realizing the process described in the previous section is annotation of recordings containing human-human interactions. Within the AMI meeting project we see a huge effort in meeting data collection, meeting data annotation and dissemination of these data for various multidisciplinary research purposes inside and outside the project. About 100 h of meetings have been recorded, of which about 60% are scenario based meetings with four people meeting four times. This is part of a design project, in which they have to work on a prescribed task to develop a remote TV control unit. Participants have various roles in this scenario and in order to meet reality as best as possible, external events and information are brought in that may influence the decision making process as well as the outcome of the meetings (Post et al. 2004).

The recordings have been annotated in varying levels of detail for different dimensions (Carletta et al. 2006). There are several reasons for creating manual annotations of corpus material. In the first place, ground truth knowledge is needed in order to evaluate (new) techniques for automatic recognition of those same aspects from lower level information. In the second place, as long as the quality of the automatic recognition results is not high enough, only manual annotations provide the quality of information needed to do research on higher levels of interpretation such as human-human interaction patterns (see also Sect. 4.3). A few examples of layers that are annotated are hand and head gestures, speech transcription, communicative acts, argument structures, topics and summaries.

The annotations can be organized in layers of increasing complexity. The lowest layers describe mostly the form of the interactions, or the observable events. The higher layers describe interpretations of these observable events, giving the function of the interactions. Consider for example the situation where a participant raises a hand. The form of this gesture can be observed and annotated as “raised hand”. On an interpretation layer, this event may be annotated with the function of this gesture, such as “request for a dialogue turn” or “vote in a voting situation”.

Once the annotations have been produced they can be analyzed. One of the results of such analyses consists of models of human interaction on varying levels of abstraction. Lower level models might describe how people generally realize certain communicative goals (e.g., how to express the addressee of utterances (Jovanovic and Op den Akker 2004), or how to show disagreement or agreement). Higher level models might describe aspects such as what interaction patterns characterize efficient meetings.

Regeneration of behavior in meetings

The annotations, together with selected models derived from these annotations, can be used to replay (parts of) a meeting in a VMR. Figure 2 shows an image of the real meeting room together with two different views of the VMR. The annotations described in the previous section can be replayed in the meeting room in different ways. Replay can show all available annotated information (rightmost picture, a shot that shows head orientation, recognized body pose, current speaker and addressees of utterances) or only a selection (middle picture, showing only head orientation).

Fig. 2
figure 2

The AMI meeting recording room and two different visualizations of the Virtual Meeting Room

Furthermore, the replay can either be a direct replay of observed behavior, or an interpreted replay, starting from high level interpretation of what happened during the meeting. In the last case, appropriate behavior is generated that expresses the right content but in a potentially different form. The rules for generation of communication are derived from domain knowledge (models and theories of human interaction) collected through the analysis of large amounts of data from real world examples. Examples are models for choosing modalities, realizing gestures or speech, formulating sentences, deciding on communicative goals given beliefs and intentions, choosing communicative actions based on goals, etc. Interpreted replay in its most complete form allows for restructured and coherent summarization of a topic, even when it was discussed in a disjointed and fragmentary manner in the original meeting, while still capturing many salient (non-verbal) details.

Uses of the VMR

In Sect. 2 it was already mentioned that this paper focuses on three categories of VMR applications: an environment for teleconferencing that provides a sense of immersion and presence; visualization of multimedia information from meetings for several purposes; and an instrument for elicitation and validation of models for social interaction.

Remote participation and enhancement of meetings

A VMR can be used as an environment for teleconferencing, as described by Greenhalgh and Benford (1995). In addition to the usual advantages of remote meeting participation, it offers control over some features that are problematic in traditional video-based conferencing (e.g., natural visualization of gaze direction cues). But there are more opportunities for influencing the remote interaction during a teleconference in a VMR.

In the first place different meeting participants need not necessarily have the same view of the virtual environment. This simple fact introduces a lot of possibilities worth investigating. Participants can adapt the virtual environment, in which the meeting takes place to their own preferences and comforts without disturbing the other people. Each person can be given his or her own perception of the seating arrangements. Since it is known that some positions are more advantageous in terms of discussion impact than others, it might be sensible to give each participant such a view of the seating that he or she never feels to be in the most disadvantageous position, leading to all participants feeling more comfortable during the meeting. Another way of adapting the meeting to one’s own preferences involves Transformed Social Interaction, which allows a participant to influence the way that he or she is presented remotely (Bailenson et al. 2004).

A virtual teleconferencing environment also offers the possibility to introduce autonomous agents that have the same communicative channels at their disposal as the human participants (Embodied Conversational Agents or ECAs). This gives opportunities for designing experiments to discover regularities in human social interaction, as will be described in Sect. 4.3. It also facilitates the introduction of helper agents or pro-active meeting assistants into an actual meeting (Rienks et al. 2006a). Existing work by Slater and Steed (2001) has already shown that people can be influenced in their behavior as well as in their assessment of a situation by the presence of autonomous ECAs and their behavior, even if they know that the agents are not representing a real human. Therefore, ECAs can be used, given the emergence of advanced recognition technology for human interaction, partly developed from extensively annotated corpora, to influence the course of the meeting. A simple example would be the introduction of a virtual chairman in the meeting room with a regulating task. Based on an analysis of what is going on in the meeting, the virtual chairman can influence the progress of the meeting (request a vote, encourage silent people to speak, mention gaps in the argumentation). An enhanced version of this chairman becomes possible if the recognition technology is advanced to the point where potentially tense situations can be detected automatically: The virtual chairman could try to defuse such situations by making a joke, or changing the subject of discussion. This topic has been investigated in more detail by Rienks et al. (2006a).

It will be clear that all these kinds of support build on knowledge about what types of events and behaviors in the real meeting are essential to be presented in the virtual meeting in order to maximize the quality of those impressions that are required by the user given his task and role in the meeting, such as the feeling of presence, and the possibility of mutual gaze.

Re-visualization of meetings

With a general implementation of a VMR, it is also possible to re-visualize the contents of a previously recorded meeting. This can be done literally, by trying to stay as close to the original recordings as possible, or more conceptually, by aiming for a visualization that shows an impression of the most important contents of the meeting (rather than the actual form). The re-visualization process traces a path through Fig. 1 that starts at the bottom left corner (real world/video recordings), and first goes upwards through various stages of observation and interpretation. At a certain point the transition to the right part of the model is made (in a sense “copying” the information present on one level from the left hand side to the same level on the right hand side), after which the generation flow is followed down to produce a replay of the meeting in the VMR (bottom right).

Transition at the lowest levels is already interesting. For example, replaying recognized 3D joint angles in a VMR in parallel to the original video offered a kind of quick validation of the pose recognition process, which helped spot recognition errors. If the recognition is good enough to use as input for a gesture labeling algorithm but not good enough to give convincing replay results, the transition could be made at a higher level. After interpreting the movements as labeled gestures, the replay could be created from these gesture types rather than directly from the body poses, leading to an animation that is not an exact copy of the original video but does express the meaning of the movements possibly more clearly.

Another possible level where the transitions can be made is the level of communicative actions such as contributions to or judgments about the current topic of discussion. The simulation on this level might be created using different realizations for the same communicative actions. This can be useful for applying appropriate culturally determined gestures, or to highlight aspects of the contributions in relation to social conventions. These possibilities also apply to the use of the VMR as a remote meeting facility.

The final and more complex possibility discussed here deals with summarized replay of a meeting or set of meetings. If a discussion about a certain issue is spread over fragments of several meetings, at a certain level of interpretation the main structure of the arguments can be found. By making the transition at this level, selective replay enables a new cohesive and interpreted replay of the discussions. If the models for simulating the different individual participants are accurate, the main points of the original meeting will stay intact (who proposed what, who was for/against, who used/supported, which arguments, etc.), without the redundant information that was conveyed in reality. This form of simulation will deviate much from the original recordings, but the relevant content (the function) remains the same.

Validation of models of social interaction

If autonomous agents are to display believable social behavior, there are many communicative aspects to be taken care of. For such aspects models are needed. Which communicative actions are desirable, in which circumstances? How does a person show whom (s)he is addressing? Does it depend on status differences? What is acceptable behavior for an ECA to show that (s)he is listening to the speaker and interested in what the speaker says? How do people exhibit and perceive signals related to relative status? Such models are also needed for effective automatic analysis of meetings for other purposes such as retrieval or meeting support. The VMR provides ways to both elicit and validate such models. The following paragraphs give a few examples of this. A few other experiments that use virtual environments for elicitation and/or validation of models of (social) interaction are given by Bailenson et al. (2001) and Pertaub et al. (2002).

VMR Turing test

The VMR Turing test (adapted from Bailenson et al. 2004) allows one to validate a complex set of models, testing whether they result in convincing, natural social interaction by ECAs. It works as follows: a human subject is shown a VMR containing ECAs, as well as avatars controlled by other humans. From the human avatars, all communication channels that the ECA does not have (for example face expressions) are removed. The subject is asked to judge, which participant is an ECA, and which is actually the avatar of a human. For example, one can validate models of listening behavior by having the subject talk to two humanoids, of which one is ECA and one is operated by human. The aim is to find out whether the subject can tell, which is which, if both are not allowed to talk back.

Validating models of conversational behavior

Besides the fact that models of conversational behavior should lead to natural looking behavior, as described in the previous section, it is also important that the behavior transmits the intended conversational cues. This can also be evaluated in a VMR. For example, a possible way to validate models of addressing behavior is to have an ECA simulate a fragment of conversation, expressing the addressee of utterances in one of the many ways allowed by the model (using vocatives, gaze, etc.). A human participant, immersed in the VMR, will then be asked to assess who is the addressee of utterance. This experiment can provide the validation whether a model of addressing behavior is good enough to use in an ECA, insofar as that a human will understand its addressing cues. The same type setup can be used to validate many more models of conversational behavior for their suitability.

Eye contact and intention to interact

Gaze and mutual gaze are powerful elements of human–human interaction. They play a role in many aspects of communication and communication regulation, such as turn taking, backchannelling and determining salience and information status. One of the communicative functions where gaze is an important mechanism is signaling and detecting intention to interact [see for example the work of Cary (1978) and Kendon (1990)]. This has been taken up in the work on BodyChat by Vilhjálmsson et al. (1998), where intention-to-interact is signaled using gaze in a graphical chat environment, and the work of Peters (2005), in which agents calculate the perceived level of interest from potential conversation partners based on gaze behavior, among other cues.

We intend to use the VMR to experiment with models that simulate “intention-to-interact” in interaction and coordination with user behavior and test whether these models are adequate for evoking appropriate reactions from human users. Such models can then be used to enhance the visualization of participants in a remote meeting setting in order to facilitate smooth interaction processes.

Experiment in the VMR: perception of head orientation

As an example of perception research in the VMR, we summarize an experiment that we performed to assess human observers’ accuracy for head orientation. There is an obvious relation between head orientation and gaze or focus of attention. Perception of gaze has been well-studied. One of the first experiments is due to Gibson (1963), who measured the accuracy for observing gaze direction in dyadic situations. In these situations, a human observer has to assess where the sender looks at, relative to himself. Triadic situations are different since an observer has to report where a sender is looking, not relative to himself. This was found to be a more difficult task due to the more unfavorable position of the observer (Krüger and Hückstedt 1969). Our interest is to determine how factors such as distance and viewing angle play a role in observation accuracy. The experiment described here is a preliminary investigation to define an estimate of accuracy, to be used in further experiments.

Compared to using recorded settings, the use of a virtual environment differs in that our avatar representation is an abstraction of the real persons. The presented avatar might be too simplistic to reliably determine its head orientation. However, Sagiv and Bentin (2001) found that schematic faces are capable of producing similar effects to real faces. This finding is supported by Wilson et al. (2000), who found that perception of head orientation was high, even for low resolution images.

Method

An avatar was positioned in the VMR (see Fig. 3) and observed from a fixed viewpoint. Numbered balls were placed at a distance corresponding to 1.5 m away from the avatar, at eye level. The balls were placed in a range between 60° left and 90° right of the avatar. We used three values for the angular distance between the balls: 15, 22.5 and 30°. For the given angular range of 90° this amounts to 7, 5 and 4 balls, respectively. The eyes of the avatar were fixed and pointed straight ahead. Observers, seated in front of a 19 in.  TFT screen with the VMR, were asked to complete a session of three parts. Each part corresponded with a different angular ball distance. The session types corresponded with the six different orders, in which the parts were presented. Within a session part, for each position a ball was presented exactly once and the order within each part was randomized. The observers were asked to predict, at which ball the avatar was looking. Each observer scored 16 samples. Observers could view their progress in the experiment but did not receive any feedback on their judgment.

Fig. 3
figure 3

The graphical setup of the experiment

Results

A total of 36 persons participated in the experiment. Each of the six session types was completed six times, which resulted in a total of 576 judged samples. The performance scores for each of the conditions are summarized in Table 1.

Table 1 Performance scores for ball identification with different angular ball distance

The results indicate that decreasing the angular distance between the balls increases the judgment error. One quarter of the stimuli are judged incorrectly when the angular distance is only 15°. With an angular distance of 30°, our results indicate that discrimination in this situation is possible with an accuracy of 97.92%. Analysis of the scores for individual balls revealed differences. Due to the limited amount of space available here, we do not discuss the results here. The interested reader is referred to Poppe et al. (“Accuracy of head direction perception in triadic situations: experiments in a virtual environment”, in preparation).

Conclusions and further research

The VMR may add value to the already existing technological means people have to meet and communicate. The various modalities such as speech, gaze, distance, gestures and facial expressions can be controlled, which allows VMRs to be used to improve remote meeting participation, to visualize multimedia data and as an instrument for research into social interaction in meetings. We described the process from observation through annotation to simulation and a model that describes the relations between the annotated features of verbal and non-verbal conversational behavior. This model can be used to relate various research tasks in the field of meeting research. An experiment was conducted in the VMR where we assessed human observers’ accuracy for the perception of head orientation. Use of the VMR allowed for good stimulus control and we demonstrated that we could use this virtual environment instead of video recordings. Regarding our experiment, ongoing work is focused at determining what factors play a role in the assessment of head orientation. Furthermore we will pursue our work on meeting modeling and see how we can present real meetings in an effective way by means of a virtual representation that shows the most informative view on the meeting.

A lot of research remains to be done to see how people perceive and interpret meeting situations and how they react on them in a VMR. Results of such research are necessary to see what information channels and modalities are important to effectively perform the various tasks in a meeting. This concerns not only the transfer of task-based information, but also issues such as maintaining a good feeling of social presence by representing the appropriate communicative cues.