Multimodal Human Machine Interactions in Virtual and Augmented Reality

Chollet, Gérard; Esposito, Anna; Gentes, Annie; Horain, Patrick; Karam, Walid; Li, Zhenbo; Pelachaud, Catherine; Perrot, Patrick; Petrovska-Delacrétaz, Dijana; Zhou, Dianle; Zouari, Leila

doi:10.1007/978-3-642-00525-1_1

Gérard Chollet²³,
Anna Esposito²⁶,
Annie Gentes²³,
Patrick Horain²⁵,
Walid Karam²³,
Zhenbo Li²⁵,
Catherine Pelachaud^23,27,
Patrick Perrot^23,24,
Dijana Petrovska-Delacrétaz²⁵,
Dianle Zhou²⁵ &
…
Leila Zouari²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5398))

1258 Accesses
6 Citations

Abstract

Virtual worlds are developing rapidly over the Internet. They are visited by avatars and staffed with Embodied Conversational Agents (ECAs). An avatar is a representation of a physical person. Each person controls one or several avatars and usually receives feedback from the virtual world on an audio-visual display. Ideally, all senses should be used to feel fully embedded in a virtual world. Sound, vision and sometimes touch are the available modalities. This paper reviews the technological developments which enable audio-visual interactions in virtual and augmented reality worlds. Emphasis is placed on speech and gesture interfaces, including talking face analysis and synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abboud, B., Bredin, H., Aversano, G., Chollet, G.: Audio-visual identity verification: an introductory overview. In: Stylianou, Y. (ed.) Progress in Non-Linear Speech Processing, pp. 118–134. Springer, Heidelberg (2007)
Chapter Google Scholar
Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Voice conversion through vector quantization. In: International Conference on Acoustics, Speech, and Signal Processing (1988)
Google Scholar
Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44–58 (2006)
Google Scholar
Ahlberg, J.: Candide-3, an updated parameterized face. Technical report, Linköping University, Sweden (2001)
Google Scholar
Ahlberg, J.: Real-time facial feature tracking using an active model with fast image warping. In: International Workshop on Very Low Bitrates Video (2001)
Google Scholar
Albrecht, I., Schroeder, M., Haber, J., Seidel, H.-P.: Mixed feelings – expression of non-basic emotions in a muscle-based talking head. In: Virtual Reality (Special Issue Language, Speech and Gesture for VR) (2005)
Google Scholar
Arslan, L.M.: Speaker transformation algorithm using segmental codebooks (STASC). Speech Communication (1999)
Google Scholar
Beau, F.: Culture d’Univers - Jeux en réseau, mondes virtuels, le nouvel âge de la société numérique. Limoges (2007)
Google Scholar
Benesty, J., Sondhi, M., Huang, Y. (eds.): Springer Handbook of Speech Processing. Springer, Heidelberg (2008)
Google Scholar
Bui, T.D.: Creating Emotions And Facial Expressions For Embodied Agents. PhD thesis, University of Twente, Department of Computer Science, Enschede (2004)
Google Scholar
Cassell, J., Bickmore, J., Billinghurst, M., Campbell, L., Chang, K., Vilhjálmsson, H., Yan, H.: Embodiment in conversational interfaces: Rea. In: CHI 1999, Pittsburgh, PA, pp. 520–527 (1999)
Google Scholar
Cassell, J., Kopp, S., Tepper, P., Kim, F.: Trading Spaces: How Humans and Humanoids use Speech and Gesture to Give Directions. John wiley & sons, New york (2007)
Google Scholar
Cassell, J., Vilhjálmsson, H., Bickmore, T.: BEAT: the Behavior Expression Animation Toolkit. In: Computer Graphics Proceedings, Annual Conference Series. ACM SIGGRAPH (2001)
Google Scholar
Cheyer, A., Martin, D.: The open agent architecture. Journal of Autonomous Agents and Multi-Agent Systems, 143–148 (March 2001)
Google Scholar
Chi, D., Costa, M., Zhao, L., Badler, N.: The emote model for effort and shape. In: International Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, pp. 173–182 (2000)
Google Scholar
Chollet, G., Landais, R., Bredin, H., Hueber, T., Mokbel, C., Perrot, P., Zouari, L.: Some experiments in audio-visual speech processing, in non-linear speech processing. In: Chetnaoui, M. (ed.) Progress in Non-Linear Speech Processing. Springer, Heidelberg (2007)
Google Scholar
Chollet, G., Petrovska-Delacrétaz, D.: Searching through a speech memory for efficient coding, recognition and synthesis. Franz Steiner Verlag, Stuttgart (2002)
MATH Google Scholar
Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 681–685 (2001)
Google Scholar
Dornaika, F., Ahlberg, J.: Fast and reliable active appearance model search for 3D face tracking. IEEE Transactions on Systems, Man, and Cybernetics, 1838–1853 (2004)
Google Scholar
Dutoit, T.: Corpus-based speech synthesis. In: Jacob, B., Mohan, S.M., Yiteng (Arden), H. (eds.) Springer Handbook of Speech Processing, pp. 437–453. Springer, Heidelberg (2008)
Chapter Google Scholar
Ekman, P., Campos, J., Davidson, R.J., De Waals, F.: Emotions inside out, vol. 1000. Annals of the New York Academy of Sciences, New York (2003)
Google Scholar
Esposito, A.: Children’s organization of discourse structure through pausing means. In: Faundez-Zanuy, M., Janer, L., Esposito, A., Satue-Villar, A., Roure, J., Espinosa-Duro, V. (eds.) NOLISP 2005. LNCS, vol. 3817, pp. 108–115. Springer, Heidelberg (2006)
Chapter Google Scholar
Esposito, A.: The amount of information on emotional states conveyed by the verbal and nonverbal channels: Some perceptual data. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) COST 277. LNCS, vol. 4391, pp. 249–268. Springer, Heidelberg (2007)
Chapter Google Scholar
Esposito, A., Bourbakis, N.G.: The role of timing in speech perception and speech production processes and its effects on language impaired individuals. In: 6th International IEEE Symposium on BioInformatics and BioEngineering, pp. 348–356 (2006)
Google Scholar
Esposito, A., Marinaro, M.: What pauses can tell us about speech and gesture partnership. In: Esposito, A., Bratanic, M., Keller, E., Marinaro, M. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometrical Issue. NATO Publishing Series, pp. 45–57. IOS press, Amsterdam (2007)
Chapter Google Scholar
Gauvain, J.L., Lamel, L.: Large - Vocabulary Continuous Speech Recognition: Advances and Applications. Proceedings of the IEEE 88, 1181–1200 (2000)
Article Google Scholar
Genoud, D., Chollet, G.: Voice transformations: Some tools for the imposture of speaker verification systems. In: Braun, A. (ed.) Advances in Phonetics. Franz Steiner Verlag (1999)
Google Scholar
Gentes, A.: Second life, une mise en jeu des médias. In: de Cayeux, A., Guibert, C. (eds.) Second life, un monde possible, Les petits matins (2007)
Google Scholar
Gutierrez-Osuna, R., Kakumanu, P., Esposito, A., Garcia, O.N., Bojorquez, A., Castello, J., Rudomin, I.: Speech-driven facial animation with realistic dynamics. In: IEEE Transactions on Multimedia, pp. 33–42 (2005)
Google Scholar
El Hannani, A., Petrovska-Delacrétaz, D., Fauve, B., Mayoue, A., Mason, J., Bonastre, J.-F., Chollet, G.: Text-independent speaker verification. In: Petrovska-Delacrétaz, D., Chollet, G., Dorizzi, B. (eds.) Guide to Biometric Reference Systems and Performance Evaluation. Springer, London (2008)
Google Scholar
Hartmann, B., Mancini, M., Pelachaud, C.: Implementing expressive gesture synthesis for embodied conversational agents. In: Gibet, S., Courty, N., Kamp, J.-F. (eds.) GW 2005. LNCS (LNAI), vol. 3881, pp. 188–199. Springer, Heidelberg (2006)
Chapter Google Scholar
Heylen, D., Ghijsen, M., Nijholt, A., op den Akker, R.: Facial signs of affect during tutoring sessions. In: Tao, J., Tan, T., Picard, R.W. (eds.) ACII 2005. LNCS, vol. 3784, pp. 24–31. Springer, Heidelberg (2005)
Chapter Google Scholar
Horain, P., Bomb, M.: 3D model based gesture acquisition using a single camera. In: IEEE Workshop on Applications of Computer Vision, pp. 158–162 (2002)
Google Scholar
Horain, P., Marques-Soares, J., Rai, P.K., Bideau, A.: Virtually enhancing the perception of user actions. In: 15th International Conference on Artificial Reality and Telexistence ICAT 2005, pp. 245–246 (2005)
Google Scholar
IV2: Identification par l’Iris et le Visage via la Vidéo, http://iv2.ibisc.fr/pageweb-iv2.html
Jelinek, F.: Continuous Speech Recognition by Statistical Methods. Proceedings of the IEEE 64, 532–556 (1976)
Article Google Scholar
Kain, A.: High Resolution Voice Transformation. PhD thesis, Oregon Health and Science University, Portland, USA, october (2001)
Google Scholar
Kain, A., Macon, M.: Spectral voice conversion for text to speech synthesis. In: International Conference on Acoustics, Speech, and Signal Processing, New York (1998)
Google Scholar
Kain, A., Macon, M.W.: Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction. In: International Conference on Acoustics, Speech, and Signal Processing (2001)
Google Scholar
Kakumanu, P., Esposito, A., Gutierrez-Osuna, R., Garcia, O.N.: Comparing different acoustic data-encoding for speech driven facial animation. Speech Communication, 598–615 (2006)
Google Scholar
Karungaru, S., Fukumi, M., Akamatsu, N.: Automatic human faces morphing using genetic algorithms based control points selection. International Journal of Innovative Computing, Information and Control 3(2), 1–6 (2007)
Google Scholar
Kendon, A.: Gesture: Visible action as utterance. Cambridge Press, Cambridge (2004)
Book Google Scholar
Kipp, M., Neff, M., Kipp, K.H., Albrecht, I.: Toward natural gesture synthesis: Evaluating gesture units in a data-driven approach. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS (LNAI), vol. 4722, pp. 15–28. Springer, Heidelberg (2007)
Chapter Google Scholar
Kopp, S., Jung, B., Lessmann, N., Wachsmuth, I.: Max - a multimodal assistant in virtual reality construction. KI Kunstliche Intelligenz (2003)
Google Scholar
Kopp, S., Wachsmuth, I.: Synthesizing multimodal utterances for conversational agents. The Journal Computer Animation and Virtual Worlds 15(1), 39–52 (2004)
Article Google Scholar
Laird, C.: Webster’s New World Dictionary, and Thesaurus. In: Webster dictionary. Macmillan, Basingstoke (1996)
Google Scholar
Li, Y., Wen, Y.: A study on face morphing algorithms, http://scien.stanford.edu/class/ee368/projects2000/project17
Lidell, S.: American Sign Language Syntax. Approaches to semiotics. Mouton, The Hague (1980)
Google Scholar
Lu, S., Huang, G., Samaras, D., Metaxas, D.: Model-based integration of visual cues for hand tracking. In: IEEE workshop on Motion and Video Computing (2002)
Google Scholar
Mancini, M., Pelachaud, C.: Distinctiveness in multimodal behaviors. In: Seventh International Joint Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2008, Estoril Portugal (May 2008)
Google Scholar
Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision, 135–164 (2004)
Google Scholar
McNeill, D.: Gesture and though. University of Chicago Press (2005)
Google Scholar
Moeslund, T., Hilton, A., Kruger, V.: A survey of advances in vision-based human motion capture and analysis. Computer vision and image understanding 4, 90–126 (2006)
Article Google Scholar
Moon, K., Pavlovic, V.I.: Impact of dynamics on subspace embedding and tracking of sequences. In: Conference on Computer Vision and Pattern Recognition, New York, pp. 198–205 (2006)
Google Scholar
Niewiadomski, R., Pelachaud, C.: Model of facial expressions management for an embodied conversational agent. In: 2nd International Conference on Affective Computing and Intelligent Interaction ACII, Lisbon (September 2007)
Google Scholar
Ochs, M., Niewiadomski, R., Pelachaud, C., Sadek, D.: Intelligent expressions of emotions. In: 1st International Conference on Affective Computing and Intelligent Interaction ACII, China (October 2005)
Google Scholar
Padmanabhan, M., Picheny, M.: Large Vocabulary Speech Recognition Algorithms. Computer Magazine 35 (2002)
Google Scholar
Pandzic, I.S., Forcheimer, R. (eds.): MPEG4 Facial Animation - The standard, implementations and applications. John Wiley & Sons, Chichester (2002)
Google Scholar
Park, I.K., Zhang, H., Vezhnevets, V.: Image based 3D face modelling system. EURASIP Journal on Applied Signal Processing, 2072–2090 (January 2005)
Google Scholar
Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D.: Intelligent virtual agents. In: 7th International Working Conference, IVA 2007 (2007)
Google Scholar
Perrot, P., Aversano, G., Chollet, G.: Voice disguise and automatic detection, review and program. In: Stylianou, Y. (ed.) Progress in Non-Linear Speech Processing. Springer, Heidelberg (2007)
Google Scholar
Perrot, P., Aversano, G., Blouet, G.R., Charbit, M., Chollet, G.: Voice forgery using alisp: Indexation in a client memory. In: International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, pp. 17–20 (2005)
Google Scholar
Petrovska-Delacrétaz, D., El Hannani, A., Chollet, G.: Automatic speaker verification, state of the art and current issues. In: Stylianou, Y. (ed.) Progress in Non-Linear Speech Processing. Springer, Heidelberg (2007)
Google Scholar
Petrovska-Delacrétaz, D., Lelandais, S., Colineau, J., Chen, L., Dorizzi, B., Krichen, E., Mellakh, M.A., Chaari, A., Guerfi, S., D’Hose, J., Ardabilian, M., Ben Amor, B.: The iv² multimodal (2D, 3D, stereoscopic face, talking face and iris) biometric database, and the iv² 2007 evaluation campaign. In: The proceedings of the IEEE Second International Conference on Biometrics: Theory, Applications (BTAS), Washington DC, USA (September 2008)
Google Scholar
Poppe, R.: Vision-based human motion analysis: an overview. Computer vision and image understanding 108, 4–18 (2007)
Article Google Scholar
Romdhani, S., Blanz, V., Basso, C., Vetter, T.: Morphable models of faces. In: Li, S., Jain, A. (eds.) Handbook of Face Recognition, pp. 217–245. Springer, Heidelberg (2005)
Chapter Google Scholar
Ruttkay, Z., Noot, H., ten Hagen, P.: Emotion disc and emotion squares: tools to explore the facial expression space. Computer Graphics Forum, 49–53 (2003)
Google Scholar
Sminchisescu, C.: 3D Human Motion Analysis in Monocular Video, Techniques and Challenges. In: AVSS 2006: Proceedings of the IEEE International Conference on Video and Signal Based Surveillance, p. 76 (2006)
Google Scholar
Sminchisescu, C., Triggs, B.: Estimating articulated human motion with covariance scaled sampling. International Journal of Robotic Research, 371–392 (2003)
Google Scholar
Marques Soares, J., Horain, P., Bideau, A., Nguyen, M.H.: Acquisition 3D du geste par vision monoscopique en temps réel et téléprésence. In: Acquisition du geste humain par vision artificielle et applications, pp. 23–27 (2004)
Google Scholar
Sündermann, D., Ney, H.: VTLN-Based Cross-Language Voice Conversion. In: IEEE workshop on Automatic Speech Recognition and Understanding, Virgin Islands, pp. 676–681 (2003)
Google Scholar
Terzopoulos, D., Waters, K.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 569–579 (1993)
Google Scholar
Thiebaux, M., Marshall, A., Marsella, S., Kallmann, M.: SmartBody: Behavior realization for embodied conversational agents. In: Seventh International Joint Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2008, Portugal (May 2008)
Google Scholar
Thorisson, K.R., List, T., Pennock, C., DiPirro, J.: Whiteboards: Scheduling blackboards for semantic routing of messages and streams. In: AAAI 2005 Workshop on Modular Construction of Human-Like Intelligence, July 10 (2005)
Google Scholar
Traum, D.: Talking to virtual humans: Dialogue models and methodologies for embodied conversational agents. In: Wachsmuth, I., Knoblich, G. (eds.) Modeling Communication with Robots and Virtual Humans, pp. 296–309. John Wiley & Sons, Chichester (2008)
Chapter Google Scholar
Tsapatsoulis, N., Raouzaiou, A., Kollias, S., Cowie, R., Douglas-Cowie, E.: Emotion recognition and synthesis based on MPEG-4 FAPs in MPEG-4 facial animation. In: Pandzic, I.S., Forcheimer, R. (eds.) MPEG4 Facial Animation - The standard, implementations and applications. John Wiley & Sons, Chichester (2002)
Google Scholar
Turajlic, E., Rentzos, D., Vaseghi, S., Ho, C.-H.: Evaluation of methods for parametric formant transformation in voice conversion. In: International Conference on Acoustics, Speech, and Signal Processing (2003)
Google Scholar
Turkle, S.: Life on the screen, Identity in the age of the internet. Simon and Schuster, New York (1997)
Google Scholar
Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with gaussian process dynamical models. In: Conference on Computer Vision and Pattern Recognition, New York, pp. 238–245 (2006)
Google Scholar
Vilhjalmsson, H., Cantelmo, N., Cassell, J., Chafai, N.E., Kipp, M., Kopp, S., Mancini, M., Marsella, S., Marshall, A.N., Pelachaud, C., Ruttkay, Z., Thórisson, K.R., van Welbergen, H., van der Werf, R.: The behavior markup language: Recent developments and challenges. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS, vol. 4722, pp. 99–111. Springer, Heidelberg (2007)
Chapter Google Scholar
Wiskott, L., Fellous, J.M., Krüger, N., von der Malsburg, C.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 775–779 (1997)
Google Scholar
Wolberg, G.: Recent advances in image morphing. Computer Graphics Internat, 64–71 (1996)
Google Scholar
Xiao, J., Baker, S., Matthews, I., Kanade, T.: Real-time combined 2D+3D active appearance models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 25–35 (2004)
Google Scholar
Ye, H., Young, S.: Perceptually weighted linear transformation for voice conversion. In: Eurospeech (2003)
Google Scholar
Yegnanarayana, B., Sharat Reddy, K., Kishore, S.P.: Source and system features for speaker recognition using AANN models. In: International Conference on Acoustics, Speech, and Signal Processing (2001)
Google Scholar
Young, S.: Statistical Modelling in Continuous Speech Recognition. In: Proceedings of the 17th International Conference on Uncertainty in Artificial Intelligence, Seattle, WA (August 2001)
Google Scholar
Zanella, V., Fuentes, O.: An Approach to Automatic Morphing of Face Images in Frontal View. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS, vol. 2972, pp. 679–687. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

CNRS-LTCI TELECOM-ParisTech, 46 rue Barrault, 75634, Paris, France
Gérard Chollet, Annie Gentes, Walid Karam, Catherine Pelachaud, Patrick Perrot & Leila Zouari
Institut de Recherche Criminelle de la Gendarmerie Nationale (IRCGN), Rosny sous Bois, France
Patrick Perrot
TELECOM & Management SudParis, 9 rue Charles Fourier, Evry, France
Patrick Horain, Zhenbo Li, Dijana Petrovska-Delacrétaz & Dianle Zhou
Dept. of Psycology, and IIASS, Second University of Naples, Italy
Anna Esposito
LINC, IUT de Montreuil, Université de Paris 8, 140 rue de la Nouvelle France, 93100, Montreuil, France
Catherine Pelachaud

Authors

Gérard Chollet
View author publications
You can also search for this author in PubMed Google Scholar
Anna Esposito
View author publications
You can also search for this author in PubMed Google Scholar
Annie Gentes
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Horain
View author publications
You can also search for this author in PubMed Google Scholar
Walid Karam
View author publications
You can also search for this author in PubMed Google Scholar
Zhenbo Li
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Pelachaud
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Perrot
View author publications
You can also search for this author in PubMed Google Scholar
Dijana Petrovska-Delacrétaz
View author publications
You can also search for this author in PubMed Google Scholar
Dianle Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Leila Zouari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Psychology, Second University of Naples, and IIASS, Via Pellegrino 19, 84019, Vietri sul Mare (SA), Italy
Anna Esposito
Department of Computing Science & Mathematics, University of Stirling, FK9 4LA, Stirling, Scotland, UK
Amir Hussain
Dipartimento di Fisica “E.R. Caianiello”, Università degli Studi di Salerno, Italy and IIASS, Via S. Allende, 84081, Baronissi (SA), Italy
Maria Marinaro
Dip. di Ingegneria dell’ Informazione, Seconda Università di Napoli, Via Roma 29, 81031, Aversa (CE), Italy
Raffaele Martone

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chollet, G. et al. (2009). Multimodal Human Machine Interactions in Virtual and Augmented Reality. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds) Multimodal Signals: Cognitive and Algorithmic Issues. Lecture Notes in Computer Science(), vol 5398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00525-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-00525-1_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00524-4
Online ISBN: 978-3-642-00525-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics