Date: 10 May 2009
A probabilistic multimodal approach for predicting listener backchannels
Rent the article at a discountRent now
* Final gross prices may vary according to local VAT.Get Access
During face-to-face interactions, listeners use backchannel feedback such as head nods as a signal to the speaker that the communication is working and that they should continue speaking. Predicting these backchannel opportunities is an important milestone for building engaging and natural virtual humans. In this paper we show how sequential probabilistic models (e.g., Hidden Markov Model or Conditional Random Fields) can automatically learn from a database of human-to-human interactions to predict listener backchannels using the speaker multimodal output features (e.g., prosody, spoken words and eye gaze). The main challenges addressed in this paper are automatic selection of the relevant features and optimal feature representation for probabilistic models. For prediction of visual backchannel cues (i.e., head nods), our prediction model shows a statistically significant improvement over a previously published approach based on hand-crafted rules.
Allwood, J. (2008). Dimensions of embodied communication—towards a typology of embodied communication. In I. Wachsmuth, M. Lenzen, & G. Knoblich (Eds.), Embodied communication in humans and machines. Oxford University Press.
Anderson H., Bader M., Bard E.G., Doherty G., GarrodS. Isard S. et al (1991) The mcrc map task corpus. Language and Speech 34(4): 351–366
Bavelas J.B., Coates L., Johnson T. (2000) Listeners as co-narrators. Journal of Personality and Social Psychology 79(6): 941–952CrossRef
Burns M. (1984) Rapport and relationships: The basis of child care. Journal of Child Care 2: 47–57
Cassell, J., Vilhjlmsson, H., & Bickmore, T. (2001). Beat: The behavior expressive animation toolkit. In Proceedings of the SIGGRAPH.
Cathcart, N., Carletta, J., & Klein, E. (2003). A shallow model of backchannel continuers in spoken dialogue. In European ACL, pp. 51–58.
Cheek J.M. (1983) The Revised Cheek and Buss Shyness Scale (RCBS). Wellesley College, Wellesley, MA
Demirdjian, D., & Darrell, T. (2002). 3-d articulated pose tracking for untethered deictic reference. In International conference on multimodal interfaces.
Drolet A.L., Morris M.W. (2000) Rapport in conflict resolution: Accounting for how face-to-face contact fosters mutual cooperation in mixed-motive conflicts. Experimental Social Psychology 36: 26–50CrossRef
Fuchs D. (1987) Examiner familiarity effects on test performance: Implications for training and practice. Topics in Early Childhood Special Education 7: 90–104CrossRef
Fujie, S., Ejiri, Y., Nakajima, K., Matsusaka, Y., Kobayashi, T. (2004). A conversation robot using head gesture recognition as para-linguistic information. In Proceedings of the international symposium on robot and human interactive communication (pp. 159–164).
Gandhe, S., DeVault, D., Roque, A., Martinovski, B., Artstein, R., Leuski, A., et al. (2008). From domain specification to virtual humans: An integrated approach to authoring tactical questioning characters. In Proceedings of interspeech 2008.
Goldberg S.B. (2005) The secrets of successful mediators. Negotiation Journal 21(3): 365–376
Gratch, J., Wang, N., Gerten, J., & Fast, E. (2007). Creating rapport with virtual agents. In Proceedings of intelligent virtual agents (IVA 2007).
hCRF library. http://sourceforge.net/projects/hcrf. Accessed March 2008.
Heylen, D., Bevacqua, E., Tellier, M., & Pelachaud, C. (2007). Searching for prototypical facial feedback signals. In Proceedings of 7th international conference on intelligent virtual agents (pp. 147–153).
Igor, S., Petr, S., Pavel, M., Luk, B., Michal, F., Martin, K., et al. (2005). Comparison of keyword spotting approaches for informal continuous speech. In Proceedings of the joint workshop on multimodal interaction and related machine learning algorithms.
John O. P., & Srivastava, S. (1999). The big-five trait taxonomy: History, measurement, and theoretical perspectives. In L. A. Pervin & O. P. John (Eds.), Handbook of personality: Theory and research (Vol. 2, pp. 102–138). Guilford Press.
Jónsdóttir, G. R., Gratch, J., Fast, E., & Thórisson, K. R. (2007). Fluid semantic back-channel feedback in dialogue: Challenges and progress. In Proceedings of 7th international conference on intelligent virtual agents.
Kang, S.-H., Gratch, J., Wang, N., & Watt, J. (2008). Does the contingency of agents’ nonverbal feedback affect users’ social anxiety? In Proceedings of the international joint conference on autonomous agents and multiagent systems.
Kenny, P., Parsons, T., Gratch, J., & Rizzo, A. (2008). Evaluation of justina: A virtual patient with ptsd. In Proceedings of 8th international conference on intelligent virtual agents, Tokyo, Japan, September 2008.
Kipp, M., Neff, M., Kipp, K. H., & Albrecht, I. (2007). Toward natural gesture synthesis: Evaluating gesture units in a data-driven approach. In Proceedings of 7th international conference on intelligent virtual agents (pp. 15–28). Springer.
Kopp, S., Stocksmeier, T., & Gibbon, D. (2007). Incremental multimodal feedback for conversational agents. In Proceedings of 7th international conference on intelligent virtual agents (pp. 139–146).
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labelling sequence data. In Proceedings of the eighteenth international conference on machine learning.
Lee, J., & Marsella, S. (2006). Nonverbal behavior generator for embodied conversational agents. In Proceedings of 6th international conference on intelligent virtual agents (pp. 243–255).
Lennox R.D., Wolfe R.N. (1984) Revision of the self-monitoring scale. Journal of Personality and Social psychology 46: 1349–1364CrossRef
Maatman, M., Gratch, J., & Marsella, S. (2005). Natural behavior of a listening agent. In Proceedings of intelligent virtual agent (IVA 2005) (pp. 25–36).
Morency, L.-P., de Kok, I., & Gratch, J. (2008). Context-based recognition during human interactions: Automatic feature selection and encoding dictionary. In Proceedings of 10th international conference on multimodal interfaces (ICMI 2008), October 2008.
Morency, L.-P., de Kok, I., & Gratch, J. (2008). Predicting listener backchannels: A probabilistic multimodal approach. In Proceedings of intelligent virtual agents (IVA 2008), September 2008.
Morency, L.-P., Sidner, C., Lee, C., & Darrell, T. (2005). Contextual recognition of head gestures. In Proceedings of the international conference on multimodal interfaces, October 2005.
Nishimura R., Kitaoka N., Nakagawa S. (2007) A spoken dialog system for chat-like conversations considering response timing. Lecture Notes in Computer Science 4629: 599–606
OKAO Vision library. http://www.omron.com/r_d/coretech/vision/okao.htm. Accessed Dec 2008.
Rabiner L.R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2): 257–286CrossRef
Scheier M.F., Carver C.S. (1985) The self-consciousness scale: A revised version for use with general populations. Journal of Applied Social Psychology 15: 687–699CrossRef
Thiebaux, M., Marshall, A., Marsella, S., & Kallmann, M. (2008). Smartbody: Behavior realization for embodied conversational agents. In Proceedings of the international joint conference on autonomous agents and multiagent systems.
Traum, D., Gratch, J., Marsella, S., Lee, J., & Hartholt, A. (2008). Multi-party, multi-issue, multi-strategy negotiation for multi-modal virtual agents. In Proceedings of 8th international conference on intelligent virtual agents, Tokyo, Japan, September 2008.
Tsui P., Schultz G.L. (1985) Failure of rapport: Why psychotheraputic engagement fails in the treatment of asian clients. American Journal of Orthopsychiatry 55: 561–569CrossRef
Valenti, R., & Gevers, T. (2008). Accurate eye center location and tracking using isophote curvature. In IEEE conference on computer vision and pattern recognition (CVPR 2008), June 2008.
Ward N., Tsukahara W. (2000) Prosodic features which cue back-channel responses in English and Japanese. Journal of Pragmatics 23: 1177–1207CrossRef
Yngve, V. H. (1970). On getting a word in edgewise. In Proceedings of the sixth regional meeting of the Chicago Linguistic Society.
- A probabilistic multimodal approach for predicting listener backchannels
Autonomous Agents and Multi-Agent Systems
Volume 20, Issue 1 , pp 70-84
- Cover Date
- Print ISSN
- Online ISSN
- Springer US
- Additional Links
- Listener backchannel feedback
- Nonverbal behavior prediction
- Sequential probabilistic model
- Conditional random field
- Head nod
- Industry Sectors
- Author Affiliations
- 1. Institute for Creative Technologies, University of Southern California, 13274 Fiji Way, Marina del Rey, CA, 90292, USA
- 2. Human Media Interaction Group, University of Twente, P.O. Box 217, 7500AE, Enschede, The Netherlands