Skip to main content

Multimodal Human Machine Interactions in Virtual and Augmented Reality

  • Conference paper
Multimodal Signals: Cognitive and Algorithmic Issues

Abstract

Virtual worlds are developing rapidly over the Internet. They are visited by avatars and staffed with Embodied Conversational Agents (ECAs). An avatar is a representation of a physical person. Each person controls one or several avatars and usually receives feedback from the virtual world on an audio-visual display. Ideally, all senses should be used to feel fully embedded in a virtual world. Sound, vision and sometimes touch are the available modalities. This paper reviews the technological developments which enable audio-visual interactions in virtual and augmented reality worlds. Emphasis is placed on speech and gesture interfaces, including talking face analysis and synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abboud, B., Bredin, H., Aversano, G., Chollet, G.: Audio-visual identity verification: an introductory overview. In: Stylianou, Y. (ed.) Progress in Non-Linear Speech Processing, pp. 118–134. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  2. Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Voice conversion through vector quantization. In: International Conference on Acoustics, Speech, and Signal Processing (1988)

    Google Scholar 

  3. Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44–58 (2006)

    Google Scholar 

  4. Ahlberg, J.: Candide-3, an updated parameterized face. Technical report, Linköping University, Sweden (2001)

    Google Scholar 

  5. Ahlberg, J.: Real-time facial feature tracking using an active model with fast image warping. In: International Workshop on Very Low Bitrates Video (2001)

    Google Scholar 

  6. Albrecht, I., Schroeder, M., Haber, J., Seidel, H.-P.: Mixed feelings – expression of non-basic emotions in a muscle-based talking head. In: Virtual Reality (Special Issue Language, Speech and Gesture for VR) (2005)

    Google Scholar 

  7. Arslan, L.M.: Speaker transformation algorithm using segmental codebooks (STASC). Speech Communication (1999)

    Google Scholar 

  8. Beau, F.: Culture d’Univers - Jeux en réseau, mondes virtuels, le nouvel âge de la société numérique. Limoges (2007)

    Google Scholar 

  9. Benesty, J., Sondhi, M., Huang, Y. (eds.): Springer Handbook of Speech Processing. Springer, Heidelberg (2008)

    Google Scholar 

  10. Bui, T.D.: Creating Emotions And Facial Expressions For Embodied Agents. PhD thesis, University of Twente, Department of Computer Science, Enschede (2004)

    Google Scholar 

  11. Cassell, J., Bickmore, J., Billinghurst, M., Campbell, L., Chang, K., Vilhjálmsson, H., Yan, H.: Embodiment in conversational interfaces: Rea. In: CHI 1999, Pittsburgh, PA, pp. 520–527 (1999)

    Google Scholar 

  12. Cassell, J., Kopp, S., Tepper, P., Kim, F.: Trading Spaces: How Humans and Humanoids use Speech and Gesture to Give Directions. John wiley & sons, New york (2007)

    Google Scholar 

  13. Cassell, J., Vilhjálmsson, H., Bickmore, T.: BEAT: the Behavior Expression Animation Toolkit. In: Computer Graphics Proceedings, Annual Conference Series. ACM SIGGRAPH (2001)

    Google Scholar 

  14. Cheyer, A., Martin, D.: The open agent architecture. Journal of Autonomous Agents and Multi-Agent Systems, 143–148 (March 2001)

    Google Scholar 

  15. Chi, D., Costa, M., Zhao, L., Badler, N.: The emote model for effort and shape. In: International Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, pp. 173–182 (2000)

    Google Scholar 

  16. Chollet, G., Landais, R., Bredin, H., Hueber, T., Mokbel, C., Perrot, P., Zouari, L.: Some experiments in audio-visual speech processing, in non-linear speech processing. In: Chetnaoui, M. (ed.) Progress in Non-Linear Speech Processing. Springer, Heidelberg (2007)

    Google Scholar 

  17. Chollet, G., Petrovska-Delacrétaz, D.: Searching through a speech memory for efficient coding, recognition and synthesis. Franz Steiner Verlag, Stuttgart (2002)

    MATH  Google Scholar 

  18. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 681–685 (2001)

    Google Scholar 

  19. Dornaika, F., Ahlberg, J.: Fast and reliable active appearance model search for 3D face tracking. IEEE Transactions on Systems, Man, and Cybernetics, 1838–1853 (2004)

    Google Scholar 

  20. Dutoit, T.: Corpus-based speech synthesis. In: Jacob, B., Mohan, S.M., Yiteng (Arden), H. (eds.) Springer Handbook of Speech Processing, pp. 437–453. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  21. Ekman, P., Campos, J., Davidson, R.J., De Waals, F.: Emotions inside out, vol. 1000. Annals of the New York Academy of Sciences, New York (2003)

    Google Scholar 

  22. Esposito, A.: Children’s organization of discourse structure through pausing means. In: Faundez-Zanuy, M., Janer, L., Esposito, A., Satue-Villar, A., Roure, J., Espinosa-Duro, V. (eds.) NOLISP 2005. LNCS, vol. 3817, pp. 108–115. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  23. Esposito, A.: The amount of information on emotional states conveyed by the verbal and nonverbal channels: Some perceptual data. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) COST 277. LNCS, vol. 4391, pp. 249–268. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  24. Esposito, A., Bourbakis, N.G.: The role of timing in speech perception and speech production processes and its effects on language impaired individuals. In: 6th International IEEE Symposium on BioInformatics and BioEngineering, pp. 348–356 (2006)

    Google Scholar 

  25. Esposito, A., Marinaro, M.: What pauses can tell us about speech and gesture partnership. In: Esposito, A., Bratanic, M., Keller, E., Marinaro, M. (eds.) Fundamentals of Verbal and Nonverbal Communication and the Biometrical Issue. NATO Publishing Series, pp. 45–57. IOS press, Amsterdam (2007)

    Chapter  Google Scholar 

  26. Gauvain, J.L., Lamel, L.: Large - Vocabulary Continuous Speech Recognition: Advances and Applications. Proceedings of the IEEE 88, 1181–1200 (2000)

    Article  Google Scholar 

  27. Genoud, D., Chollet, G.: Voice transformations: Some tools for the imposture of speaker verification systems. In: Braun, A. (ed.) Advances in Phonetics. Franz Steiner Verlag (1999)

    Google Scholar 

  28. Gentes, A.: Second life, une mise en jeu des médias. In: de Cayeux, A., Guibert, C. (eds.) Second life, un monde possible, Les petits matins (2007)

    Google Scholar 

  29. Gutierrez-Osuna, R., Kakumanu, P., Esposito, A., Garcia, O.N., Bojorquez, A., Castello, J., Rudomin, I.: Speech-driven facial animation with realistic dynamics. In: IEEE Transactions on Multimedia, pp. 33–42 (2005)

    Google Scholar 

  30. El Hannani, A., Petrovska-Delacrétaz, D., Fauve, B., Mayoue, A., Mason, J., Bonastre, J.-F., Chollet, G.: Text-independent speaker verification. In: Petrovska-Delacrétaz, D., Chollet, G., Dorizzi, B. (eds.) Guide to Biometric Reference Systems and Performance Evaluation. Springer, London (2008)

    Google Scholar 

  31. Hartmann, B., Mancini, M., Pelachaud, C.: Implementing expressive gesture synthesis for embodied conversational agents. In: Gibet, S., Courty, N., Kamp, J.-F. (eds.) GW 2005. LNCS (LNAI), vol. 3881, pp. 188–199. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  32. Heylen, D., Ghijsen, M., Nijholt, A., op den Akker, R.: Facial signs of affect during tutoring sessions. In: Tao, J., Tan, T., Picard, R.W. (eds.) ACII 2005. LNCS, vol. 3784, pp. 24–31. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  33. Horain, P., Bomb, M.: 3D model based gesture acquisition using a single camera. In: IEEE Workshop on Applications of Computer Vision, pp. 158–162 (2002)

    Google Scholar 

  34. Horain, P., Marques-Soares, J., Rai, P.K., Bideau, A.: Virtually enhancing the perception of user actions. In: 15th International Conference on Artificial Reality and Telexistence ICAT 2005, pp. 245–246 (2005)

    Google Scholar 

  35. IV2: Identification par l’Iris et le Visage via la Vidéo, http://iv2.ibisc.fr/pageweb-iv2.html

  36. Jelinek, F.: Continuous Speech Recognition by Statistical Methods. Proceedings of the IEEE 64, 532–556 (1976)

    Article  Google Scholar 

  37. Kain, A.: High Resolution Voice Transformation. PhD thesis, Oregon Health and Science University, Portland, USA, october (2001)

    Google Scholar 

  38. Kain, A., Macon, M.: Spectral voice conversion for text to speech synthesis. In: International Conference on Acoustics, Speech, and Signal Processing, New York (1998)

    Google Scholar 

  39. Kain, A., Macon, M.W.: Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction. In: International Conference on Acoustics, Speech, and Signal Processing (2001)

    Google Scholar 

  40. Kakumanu, P., Esposito, A., Gutierrez-Osuna, R., Garcia, O.N.: Comparing different acoustic data-encoding for speech driven facial animation. Speech Communication, 598–615 (2006)

    Google Scholar 

  41. Karungaru, S., Fukumi, M., Akamatsu, N.: Automatic human faces morphing using genetic algorithms based control points selection. International Journal of Innovative Computing, Information and Control 3(2), 1–6 (2007)

    Google Scholar 

  42. Kendon, A.: Gesture: Visible action as utterance. Cambridge Press, Cambridge (2004)

    Book  Google Scholar 

  43. Kipp, M., Neff, M., Kipp, K.H., Albrecht, I.: Toward natural gesture synthesis: Evaluating gesture units in a data-driven approach. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS (LNAI), vol. 4722, pp. 15–28. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  44. Kopp, S., Jung, B., Lessmann, N., Wachsmuth, I.: Max - a multimodal assistant in virtual reality construction. KI Kunstliche Intelligenz (2003)

    Google Scholar 

  45. Kopp, S., Wachsmuth, I.: Synthesizing multimodal utterances for conversational agents. The Journal Computer Animation and Virtual Worlds 15(1), 39–52 (2004)

    Article  Google Scholar 

  46. Laird, C.: Webster’s New World Dictionary, and Thesaurus. In: Webster dictionary. Macmillan, Basingstoke (1996)

    Google Scholar 

  47. Li, Y., Wen, Y.: A study on face morphing algorithms, http://scien.stanford.edu/class/ee368/projects2000/project17

  48. Lidell, S.: American Sign Language Syntax. Approaches to semiotics. Mouton, The Hague (1980)

    Google Scholar 

  49. Lu, S., Huang, G., Samaras, D., Metaxas, D.: Model-based integration of visual cues for hand tracking. In: IEEE workshop on Motion and Video Computing (2002)

    Google Scholar 

  50. Mancini, M., Pelachaud, C.: Distinctiveness in multimodal behaviors. In: Seventh International Joint Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2008, Estoril Portugal (May 2008)

    Google Scholar 

  51. Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision, 135–164 (2004)

    Google Scholar 

  52. McNeill, D.: Gesture and though. University of Chicago Press (2005)

    Google Scholar 

  53. Moeslund, T., Hilton, A., Kruger, V.: A survey of advances in vision-based human motion capture and analysis. Computer vision and image understanding 4, 90–126 (2006)

    Article  Google Scholar 

  54. Moon, K., Pavlovic, V.I.: Impact of dynamics on subspace embedding and tracking of sequences. In: Conference on Computer Vision and Pattern Recognition, New York, pp. 198–205 (2006)

    Google Scholar 

  55. Niewiadomski, R., Pelachaud, C.: Model of facial expressions management for an embodied conversational agent. In: 2nd International Conference on Affective Computing and Intelligent Interaction ACII, Lisbon (September 2007)

    Google Scholar 

  56. Ochs, M., Niewiadomski, R., Pelachaud, C., Sadek, D.: Intelligent expressions of emotions. In: 1st International Conference on Affective Computing and Intelligent Interaction ACII, China (October 2005)

    Google Scholar 

  57. Padmanabhan, M., Picheny, M.: Large Vocabulary Speech Recognition Algorithms. Computer Magazine 35 (2002)

    Google Scholar 

  58. Pandzic, I.S., Forcheimer, R. (eds.): MPEG4 Facial Animation - The standard, implementations and applications. John Wiley & Sons, Chichester (2002)

    Google Scholar 

  59. Park, I.K., Zhang, H., Vezhnevets, V.: Image based 3D face modelling system. EURASIP Journal on Applied Signal Processing, 2072–2090 (January 2005)

    Google Scholar 

  60. Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D.: Intelligent virtual agents. In: 7th International Working Conference, IVA 2007 (2007)

    Google Scholar 

  61. Perrot, P., Aversano, G., Chollet, G.: Voice disguise and automatic detection, review and program. In: Stylianou, Y. (ed.) Progress in Non-Linear Speech Processing. Springer, Heidelberg (2007)

    Google Scholar 

  62. Perrot, P., Aversano, G., Blouet, G.R., Charbit, M., Chollet, G.: Voice forgery using alisp: Indexation in a client memory. In: International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, pp. 17–20 (2005)

    Google Scholar 

  63. Petrovska-Delacrétaz, D., El Hannani, A., Chollet, G.: Automatic speaker verification, state of the art and current issues. In: Stylianou, Y. (ed.) Progress in Non-Linear Speech Processing. Springer, Heidelberg (2007)

    Google Scholar 

  64. Petrovska-Delacrétaz, D., Lelandais, S., Colineau, J., Chen, L., Dorizzi, B., Krichen, E., Mellakh, M.A., Chaari, A., Guerfi, S., D’Hose, J., Ardabilian, M., Ben Amor, B.: The iv2 multimodal (2D, 3D, stereoscopic face, talking face and iris) biometric database, and the iv2 2007 evaluation campaign. In: The proceedings of the IEEE Second International Conference on Biometrics: Theory, Applications (BTAS), Washington DC, USA (September 2008)

    Google Scholar 

  65. Poppe, R.: Vision-based human motion analysis: an overview. Computer vision and image understanding 108, 4–18 (2007)

    Article  Google Scholar 

  66. Romdhani, S., Blanz, V., Basso, C., Vetter, T.: Morphable models of faces. In: Li, S., Jain, A. (eds.) Handbook of Face Recognition, pp. 217–245. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  67. Ruttkay, Z., Noot, H., ten Hagen, P.: Emotion disc and emotion squares: tools to explore the facial expression space. Computer Graphics Forum, 49–53 (2003)

    Google Scholar 

  68. Sminchisescu, C.: 3D Human Motion Analysis in Monocular Video, Techniques and Challenges. In: AVSS 2006: Proceedings of the IEEE International Conference on Video and Signal Based Surveillance, p. 76 (2006)

    Google Scholar 

  69. Sminchisescu, C., Triggs, B.: Estimating articulated human motion with covariance scaled sampling. International Journal of Robotic Research, 371–392 (2003)

    Google Scholar 

  70. Marques Soares, J., Horain, P., Bideau, A., Nguyen, M.H.: Acquisition 3D du geste par vision monoscopique en temps réel et téléprésence. In: Acquisition du geste humain par vision artificielle et applications, pp. 23–27 (2004)

    Google Scholar 

  71. Sündermann, D., Ney, H.: VTLN-Based Cross-Language Voice Conversion. In: IEEE workshop on Automatic Speech Recognition and Understanding, Virgin Islands, pp. 676–681 (2003)

    Google Scholar 

  72. Terzopoulos, D., Waters, K.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 569–579 (1993)

    Google Scholar 

  73. Thiebaux, M., Marshall, A., Marsella, S., Kallmann, M.: SmartBody: Behavior realization for embodied conversational agents. In: Seventh International Joint Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2008, Portugal (May 2008)

    Google Scholar 

  74. Thorisson, K.R., List, T., Pennock, C., DiPirro, J.: Whiteboards: Scheduling blackboards for semantic routing of messages and streams. In: AAAI 2005 Workshop on Modular Construction of Human-Like Intelligence, July 10 (2005)

    Google Scholar 

  75. Traum, D.: Talking to virtual humans: Dialogue models and methodologies for embodied conversational agents. In: Wachsmuth, I., Knoblich, G. (eds.) Modeling Communication with Robots and Virtual Humans, pp. 296–309. John Wiley & Sons, Chichester (2008)

    Chapter  Google Scholar 

  76. Tsapatsoulis, N., Raouzaiou, A., Kollias, S., Cowie, R., Douglas-Cowie, E.: Emotion recognition and synthesis based on MPEG-4 FAPs in MPEG-4 facial animation. In: Pandzic, I.S., Forcheimer, R. (eds.) MPEG4 Facial Animation - The standard, implementations and applications. John Wiley & Sons, Chichester (2002)

    Google Scholar 

  77. Turajlic, E., Rentzos, D., Vaseghi, S., Ho, C.-H.: Evaluation of methods for parametric formant transformation in voice conversion. In: International Conference on Acoustics, Speech, and Signal Processing (2003)

    Google Scholar 

  78. Turkle, S.: Life on the screen, Identity in the age of the internet. Simon and Schuster, New York (1997)

    Google Scholar 

  79. Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with gaussian process dynamical models. In: Conference on Computer Vision and Pattern Recognition, New York, pp. 238–245 (2006)

    Google Scholar 

  80. Vilhjalmsson, H., Cantelmo, N., Cassell, J., Chafai, N.E., Kipp, M., Kopp, S., Mancini, M., Marsella, S., Marshall, A.N., Pelachaud, C., Ruttkay, Z., Thórisson, K.R., van Welbergen, H., van der Werf, R.: The behavior markup language: Recent developments and challenges. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS, vol. 4722, pp. 99–111. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  81. Wiskott, L., Fellous, J.M., Krüger, N., von der Malsburg, C.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 775–779 (1997)

    Google Scholar 

  82. Wolberg, G.: Recent advances in image morphing. Computer Graphics Internat, 64–71 (1996)

    Google Scholar 

  83. Xiao, J., Baker, S., Matthews, I., Kanade, T.: Real-time combined 2D+3D active appearance models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 25–35 (2004)

    Google Scholar 

  84. Ye, H., Young, S.: Perceptually weighted linear transformation for voice conversion. In: Eurospeech (2003)

    Google Scholar 

  85. Yegnanarayana, B., Sharat Reddy, K., Kishore, S.P.: Source and system features for speaker recognition using AANN models. In: International Conference on Acoustics, Speech, and Signal Processing (2001)

    Google Scholar 

  86. Young, S.: Statistical Modelling in Continuous Speech Recognition. In: Proceedings of the 17th International Conference on Uncertainty in Artificial Intelligence, Seattle, WA (August 2001)

    Google Scholar 

  87. Zanella, V., Fuentes, O.: An Approach to Automatic Morphing of Face Images in Frontal View. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS, vol. 2972, pp. 679–687. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chollet, G. et al. (2009). Multimodal Human Machine Interactions in Virtual and Augmented Reality. In: Esposito, A., Hussain, A., Marinaro, M., Martone, R. (eds) Multimodal Signals: Cognitive and Algorithmic Issues. Lecture Notes in Computer Science(), vol 5398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00525-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00525-1_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00524-4

  • Online ISBN: 978-3-642-00525-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics