Multimedia Tools and Applications

, Volume 74, Issue 22, pp 10025–10051 | Cite as

User behavior fusion in dialog management with multi-modal history cues

  • Minghao YangEmail author
  • Jianhua Tao
  • Linlin Chao
  • Hao Li
  • Dawei Zhang
  • Hao Che
  • Tingli Gao
  • Bin Liu


It enhances user experience by making the talking avatar be sensitive to user behaviors in human computer interaction (HCI). In this study, we combine user’s multi-modal behaviors with behaviors’ historical information in dialog management (DM) to improve the avatar’s sensitivity not only to user explicit behavior (speech command) but also to user supporting expression (emotion and gesture, etc.). In the dialog management, according to the different contributions of facial expression, gesture and head motion to speech comprehension, we divide the user’s multi-modal behaviors into three categories: complementation, conflict and independence. The behavior categories could be first automatically obtained from a short-term and time-dynamic (STTD) fusion model with audio-visual input. Different behavior category leads to different avatar’s response in later dialog turns. Usually, the conflict behavior reflects user’s ambiguous intention (for example: user says “no” while he (her) is smiling). In this case, the trial-and-error schema is adopted to eliminate the conversation ambiguity. For the later dialog process, we divide all the avatar dialog states into four types: “Ask”, “Answer”, “Chat” and “Forget”. With the detection of complementation and independence behaviors, the user supporting expression as well as his (her) explicit behavior could be estimated as triggers for topic maintenance or transfer among four dialog states. At the first section of experiments, we discuss the reliability of STTD model for user behavior classification. Based on the proposed dialog management and STTD model, we continue to construct a drive route information query system by connecting the user behavior sensitive dialog management (BSDM) to a 3D talking avatar. The practical conversation records of avatar with different users show that the BSDM makes the avatar be able to understand and be sensitive to the users’ facial expressions, emotional voice and gesture, which improves user experience on multi-modal human computer conversation.


Dialog management (DM) Multi-modal data fusion Human computer interaction (HCI) Emotion detection 



This work is supported by the National Natural Science Foundation of China (NSFC) (No. 61273288, No. 61233009, No. 61203258, No. 61530503, No. 61332017, No. 61375027), and the Major Program for the National Social Science Fund of China (13&ZD189).


  1. 1.
    Ananova. Accessed 18 Jan 2014; Available from:
  2. 2.
    Baltrušaitis T, Ramirez GA, Morency L-P (2011) Modeling latent discriminative dynamic of multi-dimensional affective signals. Affect Comput Intell Interact, p 396–406. Springer, BerlinGoogle Scholar
  3. 3.
    Bell L, Gustafson J (2000) Positive and negative user feedback in a spoken dialogue Corpus. In: INTERSPEECH. p 589–592Google Scholar
  4. 4.
    Bianchi-Berthouze N, Meng H (2011) Naturalistic affective expression classification by a multi-stage approach based on hidden Markov models. Affect Comput Intell Interact 3975:378–387Google Scholar
  5. 5.
    Bohus D, Rudnicky A (2005) Sorry, I didn’t catch that! - an investigation of non-understanding errors and recovery strategies. In: Proceedings of SIGdial. Lisbon, PortugalGoogle Scholar
  6. 6.
    Bousmalis K, Zafeiriou S, Morency L-P, Pantic M, Ghahramani Z (2013) Variational hidden conditional random fields with coupled Dirichlet process mixtures. In: European conference on machine learning and principles and practice of knowledge discovery in databasesGoogle Scholar
  7. 7.
    Brustoloni JC (1991) Autonomous agents: characterization and requirementsGoogle Scholar
  8. 8.
    Cerekovic TPA, Igor P (2009) RealActor: character animation and multimodal behavior realization system. Intelligent virtual agents. p 486–487.Google Scholar
  9. 9.
    Dobrisek S, Gajsek R, Mihelic F, Pavesic N, Struc V (2013) Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst 10:1–10CrossRefGoogle Scholar
  10. 10.
    Engwall O, Balter O (2007) Pronunciation feedback from real and virtual language teachers. J Comput Assist Lang Learn 20(3):235–262CrossRefGoogle Scholar
  11. 11.
    Goddeau HMD, Poliforni J, Seneff S, Busayapongchait S (1996) A form-based dialogue management for spoken language applications. In: International conference on spoken language processing. Pittsburgh, PA. p 701–704Google Scholar
  12. 12.
  13. 13.
    Heloir A, Kipp M, Gebhard P, Schroeder M (2010) Realizing Multimodal Behavior: Closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th international conference on intelligent virtual agents. SpringerGoogle Scholar
  14. 14.
    Hjalmarsson A, Wik P (2009) Embodied conversational agents in computer assisted language learning. Speech Comm 51(10):1024–1037CrossRefGoogle Scholar
  15. 15.
    Jones MJ, Viola PA (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154CrossRefGoogle Scholar
  16. 16.
    Kaiser M, Willmer M, Eyben F, Schuller B (2012) LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31:153–163Google Scholar
  17. 17.
    Kang Y, Tao J (2005) Features importance analysis for emotion speech classification. In: International conference on affective computing and intelligence interaction -ACII 2005. p 449–457Google Scholar
  18. 18.
  19. 19.
    Lee C, Jung S, Kim K, Lee D, Lee GG (2010) Recent approaches to dialog management for spoken dialog systems. J Comput Sci Eng 4(1):1–22zbMATHCrossRefGoogle Scholar
  20. 20.
    Levin RPE, Eckert W (2000) A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans Speech Audio Process 8(1):11–23CrossRefGoogle Scholar
  21. 21.
    Litman DJ, Tetreault JR (2006) Comparing the utility of state features in spoken dialogue using reinforcement learning. In: Conference: North American Chapter of the association for computational linguistics - NAACL. New York CityGoogle Scholar
  22. 22.
  23. 23.
    McKeown G, Valstar MF, Cowie R, Pantic M (2010) The SEMAINE Corpus of emotionally coloured character interactions. In: Proc IEEE Int’l Conf Multimedia and Expo. p 1079–1084Google Scholar
  24. 24.
    mmdagent. Accessed 18 Jan 2014; Available from:
  25. 25.
  26. 26.
    Pietquin TDO (2006) A probabilistic framework for dialog simulation and optimal strategy. IEEE Trans Audio Speech Lang Process 14(2):589–599CrossRefGoogle Scholar
  27. 27.
    Rebillat M, Courgeon M, Katz B, Clavel C, Martin J-C (2010) Life-sized audiovisual spatial social scenes with multiple characters: MARC SMART-I2. In: 5th meeting of the French association for virtual realityGoogle Scholar
  28. 28.
    Schatzmann KWJ, Stuttle M, Young S (2006) A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowl Eng Rev 21(2):97–126CrossRefGoogle Scholar
  29. 29.
    Schels M, Glodek M, Meudt S, Scherer S, Schmidt M, Layher G, Tschechne S, Brosch T, Hrabal D, Walter S, Traue HC, Palm G, Schwenker F, Campbell MR (2013) Multi-modal classifier-fusion for the recognition of emotions, chapter in converbal synchrony in human-machine interaction. CRC Press, Boca Raton, FL 33487, USAGoogle Scholar
  30. 30.
    Scherer KR (1999) Appraisal theory. In: Dalgleish T, Power M (eds) Handbook of cognition and emotion. Wiley, ChichesterGoogle Scholar
  31. 31.
    Schwarzlery SMS, Schenk J, Wallhoff F, Rigoll G (2009) Using graphical models for mixed-initiative dialog management systems with realtime Policies. In: Conference: annual conference of the International Speech Communication Association - INTERSPEECH. p 260–263Google Scholar
  32. 32.
    Shan S, Niu Z, Chen X (2009) Facial shape localization using probability gradient hints. IEEE Signal Process Lett 16(10):897–900CrossRefGoogle Scholar
  33. 33.
    SPTK. Accessed 18 Jan 2014; Available from:
  34. 34.
    Steedman M, Badler N, Achorn B, Bechet T, Douville B, Prevost S, Cassell J, Pelachaud C, Stone M (1994) Animated conversation: Rule-based generation of facial expression gesture and spoken intonation for multiple conversation agents. In: Proceedings of SIGGRAPH. p 73–80Google Scholar
  35. 35.
    Tao J, Pan S, Yang M, Li Y, Mu K, Che J (2011) Utterance independent bimodal emotion recognition in spontaneous communication. EURASIP J Adv Signal Process 11(1):1–11Google Scholar
  36. 36.
    Tao J, Yang M, Mu K, Li Y, Che J (2012) A multimodal approach of generating 3D human-like talking agent. J Multimodal User Interfaces 5(1–2):61–68Google Scholar
  37. 37.
    Tao J, Yang M, Chao L (2013) Combining emotional history through multimodal fusion methods. In: Asia Pacific Signal and Information Processing Association (APSIPA 2013). Kaohsiung, Taiwan, ChinaGoogle Scholar
  38. 38.
    Tschechne S, Glodek M, Layher G, Schels M, Brosch T, Scherer S, Schwenker F (2011) Multiple classifier systems for the classification of audio-visual emotion states. Affect Comput Intell Interact, 378–387. Springer, BerlinGoogle Scholar
  39. 39.
    Van Reidsma D, Welbergen H, Ruttkay ZM, Zwiers Elckerlyc J (2010) A BML Realizer for continuous, multimodal interaction with a Virtual Human. J Multimodal User Interfaces 3(4):271–284Google Scholar
  40. 40.
    Williams JD (2003) A probabilistic model of human/computer dialogue with application to a partially observable Markov decision processGoogle Scholar
  41. 41.
    Williams JD, Poupart P, Young S (2005) Partially observable Markov decision processes with continuous observations for dialogue management. In: Proceedings of the 6th SigDial workshop on discourse and dialogue. LisbonGoogle Scholar
  42. 42.
    Xin L, Huang L, Zhao L, Tao J (2007) Combining audio and video by dominance in bimodal emotion recognition. In: International conference on affective computing and intelligence interaction - ACII. p 729–730Google Scholar
  43. 43.
    Young S (2006) Using POMDPs for dialog management. In: Conference: IEEE workshop on spoken language technology - SLTGoogle Scholar
  44. 44.
    Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Minghao Yang
    • 1
    Email author
  • Jianhua Tao
    • 1
  • Linlin Chao
    • 1
  • Hao Li
    • 1
  • Dawei Zhang
    • 1
  • Hao Che
    • 1
  • Tingli Gao
    • 1
  • Bin Liu
    • 1
  1. 1.The National Laboratory of Pattern Recognition, Institute of AutomationChinese Academy of SciencesBeijingChina

Personalised recommendations