Language Resources and Evaluation

, Volume 41, Issue 3–4, pp 305–323 | Cite as

Corpus-based generation of head and eyebrow motion for an embodied conversational agent

  • Mary Ellen Foster
  • Jon Oberlander


Humans are known to use a wide range of non-verbal behaviour while speaking. Generating naturalistic embodied speech for an artificial agent is therefore an application where techniques that draw directly on recorded human motions can be helpful. We present a system that uses corpus-based selection strategies to specify the head and eyebrow motion of an animated talking head. We first describe how a domain-specific corpus of facial displays was recorded and annotated, and outline the regularities that were found in the data. We then present two different methods of selecting motions for the talking head based on the corpus data: one that chooses the majority option in all cases, and one that makes a weighted choice among all of the options. We compare these methods to each other in two ways: through cross-validation against the corpus, and by asking human judges to rate the output. The results of the two evaluation studies differ: the cross-validation study favoured the majority strategy, while the human judges preferred schedules generated using weighted choice. The judges in the second study also showed a preference for the original corpus data over the output of either of the generation strategies.


Data-driven generation Embodied conversational agents Evaluation of generated output Multimodal corpora 



This work was supported by the EU projects COMIC (IST-2001-32311) and JAST (FP6-003747-IP). An initial version of this study was published as Foster and Oberlander (2006).


  1. Artstein, R., & Poesio, M. (2005). Kappa3 = alpha (or beta). Technical Report CSM-437, University of Essex Department of Computer Science.Google Scholar
  2. Bangalore, S., Rambow, O., & Whittaker, S. (2000). Evaluation metrics for generation. In Proceedings of INLG 2000.Google Scholar
  3. Belz, A., Gatt, A., Reiter, E., & Viethen, J. (2007). First NLG shared task and evaluation challenge on attribute selection for referring expression generation.
  4. Belz, A., & Reiter, E. (2006). Comparing automatic and human evaluation of NLG systems. In Proceedings of EACL 2006 (pp. 313–320).Google Scholar
  5. Belz, A., & Varges, S. (Eds.) (2005) Corpus linguistics 2005 workshop on using corpora for natural language generation.Google Scholar
  6. Cassell, J., Bickmore, T., Vilhjálmsson H., & Yan, H. (2001a). More than just a pretty face: Conversational protocols and the affordances of embodiment. Knowledge-Based Systems, 14(1–2), 55–64.CrossRefGoogle Scholar
  7. Cassell, J., Nakano, Y., Bickmore, T. W., Sidner, C. L., & Rich, C. (2001b). Non-verbal cues for discourse structure. In Proceedings of ACL 2001.Google Scholar
  8. Cassell, J., Sullivan, J., Prevost, S., & Churchill, E. (2000). Embodied conversational agents. MIT Press.Google Scholar
  9. Clark, R. A. J., Richmond, K., & King, S. (2004) Festival 2 – Build your own general purpose unit selection speech synthesiser. In Proceedings of the 5th ISCA Workshop on Speech Synthesis.Google Scholar
  10. de Carolis, B., Carofiglio, V., & Pelachaud, C. (2002). From discourse plans to believable behavior generation. In Proceedings of INLG 2002.Google Scholar
  11. DeCarlo, D., Stone, M., Revilla, C., & Venditti, J. (2004). Specifying and animating facial signals for discourse in embodied conversational agents. Computer Animation and Virtual Worlds, 15(1), 27–38.CrossRefGoogle Scholar
  12. Ekman, P. (1979). About brows: Emotional and conversational signals. In M. von Cranach, K. Foppa, W. Lepenies, & D. Ploog (Eds.), Human ethology: Claims and limits of a new discipline. Cambridge University Press.Google Scholar
  13. Foster, M. E. (2007). Evaluating the impact of variation in automatically generated embodied object descriptions. Ph.D. thesis, School of Informatics, University of Edinburgh.Google Scholar
  14. Foster, M. E., & Oberlander, J. (2006). Data-driven generation of emphatic facial displays. In Proceedings of EACL 2006 (pp. 353–360).Google Scholar
  15. Foster, M. E., White, M., Setzer, A., & Catizone, R. (2005). Multimodal generation in the COMIC dialogue system. In Proceedings of the ACL 2005 Demo Session.Google Scholar
  16. Fox, J. (2002). An R and S-Plus companion to applied regression. Sage Publications.Google Scholar
  17. Graf, H., Cosatto, E., Strom, V., & Huang, F. (2002). Visual prosody: Facial movements accompanying speech. In Proceedings of FG 2002 (pp. 397–401).Google Scholar
  18. Kipp, M. (2004). Gesture generation by imitation – From human behavior to computer character animation. Scholar
  19. Krahmer, E., & Swerts, M. (2005). How children and adults produce and perceive uncertainty in audiovisual speech. Language and Speech, 48(1), 29–53.CrossRefGoogle Scholar
  20. Langkilde, I., & Knight, K. (1998). Generation that exploits corpus-based statistical knowledge. In Proceedings of COLING-ACL 1998.Google Scholar
  21. Langkilde-Geary, I. (2002). An empirical verification of coverage and correctness for a general-purpose sentence generator. In Proceedings of INLG 2002.Google Scholar
  22. Mana, N., & Pianesi, F. (2006). HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads. In Proceedings of ICMI 2006.Google Scholar
  23. Martin, J.-C., Kühnlein, P., Paggio, P., Stiefelhagen, R., & Pianesi, F. (Eds.) (2006). LREC 2006 workshop on multimodal corpora: From multimodal behaviour theories to usable models.Google Scholar
  24. McNeill, D. (Ed.) (2000). Language and gesture: Window into thought and action. Cambridge University Press.Google Scholar
  25. Passonneau, R. J. (2004). Computing reliability for coreference annotation. In Proceedings, Fourth International Conference on Language Resources and Evaluation (LREC 2004) (Vol. 4, pp. 1503–1506). Lisbon.Google Scholar
  26. Rehm, M., & André, E. (2005). Catch me if you can – Exploring lying agents in social settings. In Proceedings of AAMAS 2005 (pp. 937–944).Google Scholar
  27. Steedman, M. (2000). Information structure and the syntax-phonology interface. Linguistic Inquiry, 31(4), 649–689.CrossRefGoogle Scholar
  28. Stone, M., DeCarlo, D., Oh, I., Rodriguez, C., Lees, A., Stere, A., & Bregler, C. (2004). Speaking with hands: Creating animated conversational characters from recordings of human performance. ACM Trans. Graphics, 23(3), 506–513.CrossRefGoogle Scholar
  29. White, M. (2006). Efficient realization of coordinate structures in combinatory categorial grammar. Research on Language and Computation, 4(1), 39–75.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  1. 1.Informatik VI: Robotics and Embedded Systems, Technische Universität MünchenGarchingGermany
  2. 2.School of InformaticsUniversity of EdinburghEdinburghUK

Personalised recommendations