Multimedia Tools and Applications

, Volume 75, Issue 9, pp 5125–5146 | Cite as

Emotional head motion predicting from prosodic and linguistic features

  • Minghao YangEmail author
  • Jinlin Jiang
  • Jianhua Tao
  • Kaihui Mu
  • Hao Li


Emotional head motion plays an important role in human-computer interaction (HCI), which is one of the important factors to improve users’ experience in HCI. However, it is still not clear how head motions are influenced by speech features in different emotion states. In this study, we aim to construct a bimodal mapping model from speech to head motions, and try to discover what kinds of prosodic and linguistic features have the most significant influence on emotional head motions. A two-layer clustering schema is introduced to obtain reliable clusters from head motion parameters. With these clusters, an emotion related speech to head gesture mapping model is constructed by a Classification and Regression Tree (CART). Based on the statistic results of CART, a systematical statistic map of the relationship between speech features (including prosodic and linguistic features) and head gestures is presented. The map reveals the features which have the most significant influence on head motions in long or short utterances. We also make an analysis on how linguistic features contribute to different emotional expressions. The discussions in this work provide important references for realistic animation of speech driven talking-head or avatar.


Visual prosody Head gesture Prosody clustering 



This work is supported by the National High-Tech Research and Development Program of China (863 Program) (No.2015AA016305), the National Natural Science Foundation of China (NSFC) (No.61332017, No.61375027, No.61203258, No.61273288,No.61233009, No.61425017).


  1. 1.
    Alberto B, Piero C, Giuseppe RL, Giulio P (2014) LuciaWebGL a new WebGL-based talking head, 15th Annual Conference of the International Speech Communication Association, Singapore (InterSpeech 2014 Show & Tell Contribution)Google Scholar
  2. 2.
    Aleksandra C, Tomislav P, Pandzic IS (2009) RealActor: character animation and multimodal behavior realization system. IVA: 486–487Google Scholar
  3. 3.
    Ananthakrishnan S, Narayanan S (2008) Automatic prosodic event detection using acoustic, lexical, and syntactic evidence. IEEE Trans Audio Speech Lang Process 16(1):216–228CrossRefGoogle Scholar
  4. 4.
    Badler N, Steedman M, Achorn B, Bechet T, Douville B, Prevost S, Cassell J, Pelachaud C, Stone M (1994) Animated conversation: rule-based generation of facial expression gesture and spoken intonation for multiple conversation agents. Proceedings of SIGGRAPH, 73–80Google Scholar
  5. 5.
    Ben-Youssef A, Shimodaira H, Braude DA (2014) Speech driven talking head from estimated articulatory features, The 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014), Florence, ItalyGoogle Scholar
  6. 6.
    Bevacqua E, Hyniewska SJ, Pelachaud C (2010) Evaluation of a virtual listener smiling behavior. Proceedings of the 23rd International Conference on Computer Animation and Social Agents, Saint-Malo, FranceGoogle Scholar
  7. 7.
    Bo X, Georgiou Panayiotis G, Brian Baucom, Shrikanth S (2014) Narayanan, power-spectral analysis of head motion signal for behavioral modeling in human interaction, 2014 I.E. International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014), Florence, ItalyGoogle Scholar
  8. 8.
    Bodenheimer B, Rose C, Rosenthal S, Pella J (1997) The process of motion capture: dealing with the data. In: Thalmann (ed) Computer animation and simulation. Springer NY 318 Eurographics Animation WorkshopGoogle Scholar
  9. 9.
    Boulic R, Becheiraz P, Emering L, Thalmann D (1997) Integration of motion control techniques for virtual human and avatar real-time animation. In: Proc. of Virtual Reality Software and Technology, Switzerland: 111–118Google Scholar
  10. 10.
    Busso C, Deng Z, Neumann U, Narayanan S (2005) Natural head motion synthesis driven by acoustic prosodic features. Comput Anim Virtual Worlds 16(3–4):283–290CrossRefGoogle Scholar
  11. 11.
    Cassell J, Vilhjalmsson HH, Bickmore TW (2001) Beat: the behavior expression animation toolkit. In: Proceedings of SIGGRAPH, 477–486Google Scholar
  12. 12.
    Chuang D, Pengcheng Z, Lei X, Dongmei J, ZhongHua Fu (2014) Northwestern, speech-driven head motion synthesis using neural networks, 15th Annual Conference of the International Speech Communication Association, Singapore (InterSpeech 2014)Google Scholar
  13. 13.
    Cohn JF, Schmidt KL (2004) The timing of facial motion in posed and spontaneous smiles. Int J Wavelets Multiresolution Inf Process 2:1–12CrossRefGoogle Scholar
  14. 14.
    Cowie R, Douglas-Cowie E (2001) Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine. pp. 33–80Google Scholar
  15. 15.
    de Rosis F, Pelachaud C, Poggi I, Carofiglio V, De Carolis N (2003) From Greta’s mind to her face: modeling the dynamics of affective states in a conversational embodied agent, special issue on applications of affective computing in human-computer interaction. Int J Hum Comput Stud 59(1–2):81–118CrossRefGoogle Scholar
  16. 16.
    Faloutsos P, van de Panne M, Terzopoulos D (2001) Composable controllers for physics-based character animation. In: SIGGRAPH ‘01: proceedings of the 28th annual conference on Computer graphics and interactive techniques. ACM Press, New York, p 251–260Google Scholar
  17. 17.
    Fangzhou L, Huibin J, Jianhua T (2008) A maximum entropy based hierarchical model for automatic prosodic boundary labeling in Mandarin. In: Proceedings of 6th International Symposium on Chinese Spoken Language ProcessingGoogle Scholar
  18. 18.
    Graf HP, Cosatto E, Strom V, Huang F (2002) Visual prosody: facial movements accompanying speech. In: Fifth IEEE International Conference on Automatic Face and Gesture Recognition. Washinton D.C., USAGoogle Scholar
  19. 19.
    Hong P, Wen Z, Huang TS (2002) Real-time speech-driven face animation with expressions using neural networks. IEEE Trans Neural Netw 13:916–927CrossRefGoogle Scholar
  20. 20.
    Huibin J, Jianhua T, Wang X (2008) Prosody variation: application to automatic prosody evaluation of mandarin speech. In: Speech prosody, BrailGoogle Scholar
  21. 21.
    Jia J, Shen Z, Fanbo M, Yongxin W, Lianhong C (2011) Emotional audio-visual speech synthesis based on PAD. IEEE Trans Audio Speech Lang Process 19(3):570–582CrossRefGoogle Scholar
  22. 22.
    Jianwu D, Kiyoshi H (Feb 2004) Construction and control of a physiological articulatory model. J Acoust Soc Am 115(2):853–870Google Scholar
  23. 23.
    Kipp M, Heloir A, Gebhard P, Schroeder M (2010) Realizing multimodal behavior: closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th International Conference on Intelligent Virtual Agents. SpringerGoogle Scholar
  24. 24.
    Kopp S, Jung B, Lebmann N, Wachsmuth (2003) I: Max - a multimodal assistant in virtual reality construction. KI -Kunstliche Intelligenz 4/03 117Google Scholar
  25. 25.
    Kopp S, Wachsmuth I (2004) Synthesizing multimodal utterances for conversational agents. Comput Anim Virtual Worlds 15(1):39–52CrossRefGoogle Scholar
  26. 26.
    Lei X, Zhiqiang L (2007) A coupled HMM approach for video-realistic speech animation. Pattern Recogn 40(10):2325–2340zbMATHGoogle Scholar
  27. 27.
    Lei X, Zhiqiang L (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510CrossRefGoogle Scholar
  28. 28.
    Lijuan W, Xiaojun Q, Wei H, Frank KS (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. INTERSPEECH 2010Google Scholar
  29. 29.
    Martin JC, Niewiadomski R, Devillers L, Buisine S, Pelachaud C (2006) Multimodal complex emotions: gesture expressivity and blended facial expressions. International Journal of Humanoid Robotics, special issue Achieving Human-Like Qualities in Interactive Virtual and Physical Humanoids, 3(3): 269–292Google Scholar
  30. 30.
    Meng Z, Kaihui M, Jianhua T (2008) An expressive TTVS system based on dynamic unit selection. J Syst Simul 20(z1):420–422Google Scholar
  31. 31.
    Parke F (1972) Computer generated animation of faces. Proceedings of the ACM National ConferenceGoogle Scholar
  32. 32.
    Pelachaud (2009) Modelling multimodal expression of emotion in a virtual agent. Philos Trans R Soc B Biol Sci 364:3539–3548CrossRefGoogle Scholar
  33. 33.
    Scott AK, Parent RE (2005) Creating speech-synchronized animation. IEEE Trans Vis Comput Graph 11(3):341–352CrossRefGoogle Scholar
  34. 34.
    Shao Y, Han J, Zhao Y, Liu T (2007) Study on automatic prediction of sentential stress for Chinese Putonghua Text-to-Speech system with natural style. Chin J Acoust 26(1):49–92Google Scholar
  35. 35.
    Shiwen Y, Xuefeng Z, Huiming D (2000) The guideline for segmentation and part-of-speech tagging on very large scale corpus of contemporary Chinese. J Chin Inf Process 6:58–64Google Scholar
  36. 36.
    Shiwen Y, Xuefeng Z, Huiming D (2002) The basic processing of contemporary Chinese corpus at Peking University SPECIFICATION 16(6)Google Scholar
  37. 37.
    Song M, Bu J, Chen C, Li N (2004) Audio-visual based emotion recognition- a new approach. In: Proc. of the 2004 I.E. Computer Society Conference on Computer Vision and Pattern Recognition. pp.1020–1025Google Scholar
  38. 38.
    Stone M, DeCarlo D, Oh I, Rodriguez C, Stere A, Lees A, Bregler C (2004) Speaking with hands: creating animated conversational characters from recordings of human perfor- mance. ACM Trans Graph (SIGGRAPH‘04) 23(3):506–51Google Scholar
  39. 39.
    Tony E, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vis 38:45–57CrossRefzbMATHGoogle Scholar
  40. 40.
    Wachsmuth (2008) ‘I, Max’ - communicating with an artificial agent. In: Wachsmuth I, Knoblich G (eds) Modeling communication with robots and virtual humans. Springer, Berlin, pp 279–295CrossRefGoogle Scholar
  41. 41.
    Wang QR, Suen CY (1984) Analysis and design of a decision tree based on entropy reduction and its application to large character set recognition. IEEE Trans Pattern Anal Mach Intell, PAMI 6: 406–417Google Scholar
  42. 42.
    Waters K (1987) A musele model for animating three dimensional facial ExPression. Computer Graphics (SIGGRAPH,87) 22(4): 7–24Google Scholar
  43. 43.
    Wei Z, Zengfu W (2009) Speech rate related facial animation synthesis and evaluation. J Image Graph 14(7):1399–1405Google Scholar
  44. 44.
    Welbergen HV, Reidsma D, Ruttkay ZM, Zwiers EJ (2010) A BML realizer for continuous, multimodal interaction with a virtual human. J Multimodal User Interf 3(4):271–284, ISSN 1783–7677 CrossRefGoogle Scholar
  45. 45.
    Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on Hidden Markov Models. Speech Comm 26(1–2):105–115CrossRefGoogle Scholar
  46. 46.
    Yamamoto SNE, Shikano K (1997) Speech to lip movement synthesis by HMM. In: Proc.AVSP‘97. Rhodes, GreeceGoogle Scholar
  47. 47.
    Young S, Evermann G, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2002) The HTK book (for HTK version 3.2). Cambridge University Engineering DepartmentGoogle Scholar
  48. 48.
    Young S, Jansen J, Odell J, Ollason D, Woodland P (1990) The HTK book. Entropic Labs and Cambridge University, 2.1Google Scholar
  49. 49.
    Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58CrossRefGoogle Scholar
  50. 50.
    Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audiovisual affect recognition. IEEE Trans Multimedia 9(2):424–428CrossRefGoogle Scholar
  51. 51.
    Zhang S, Wu Z, Meng MLH, Cai L (2007) Head movement synthesis based on semantic and prosodic features for a Chinese expressive avatar. In: IEEE Conference on International Conference on Acoustics, Speech and Signal ProcessingGoogle Scholar
  52. 52.
    Zhenhua L, Richmond K, Yamagishi J (2010) An analysis of HMM-based prediction of articulatory movements. Speech Comm 52(10):834–846CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Minghao Yang
    • 1
    Email author
  • Jinlin Jiang
    • 2
  • Jianhua Tao
    • 1
  • Kaihui Mu
    • 1
  • Hao Li
    • 1
  1. 1.The National Laboratory of Pattern Recognition, Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.School of International StudiesUniversity of International Business and EconomicsBeijingChina

Personalised recommendations