KRISTINA: A Knowledge-Based Virtual Conversation Agent

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10349)


We present an intelligent embodied conversation agent with linguistic, social and emotional competence. Unlike the vast majority of the state-of-the-art conversation agents, the proposed agent is constructed around an ontology-based knowledge model that allows for flexible reasoning-driven dialogue planning, instead of using predefined dialogue scripts. It is further complemented by multimodal communication analysis and generation modules and a search engine for the retrieval of multimedia background content from the web needed for conducting a conversation on a given topic. The evaluation of the 1st prototype of the agent shows a high degree of acceptance of the agent by the users with respect to its trustworthiness, naturalness, etc. The individual technologies are being further improved in the 2nd prototype.


Conversation agent Multimodal interaction Ontologies Dialogue management 



The presented work is funded by the European Commission as part of the H2020 Programme, under the contract number 645012–RIA. Many thanks to our colleagues from the University of Tübingen, German Red Cross and semFYC for the definition of the use cases, constant feedback, and evaluation!


  1. 1.
    Anderson, K., et al.: The TARDIS framework: intelligent virtual agents for social coaching in job interviews. In: Reidsma, D., Katayose, H., Nijholt, A. (eds.) ACE 2013. LNCS, vol. 8253, pp. 476–491. Springer, Cham (2013). doi: 10.1007/978-3-319-03161-3_35 CrossRefGoogle Scholar
  2. 2.
    Ballesteros, M., Bohnet, B., Mille, S., Wanner, L.: Data-driven sentence generation with non-isomorphic trees. In: Proceedings of the 2015 Conference of the NAACL: Human Language Technologies, pp. 387–397. ACL, Denver, Colorado, May–June 2015.
  3. 3.
    Ballesteros, M., Bohnet, B., Mille, S., Wanner, L.: Data-driven deep-syntactic dependency parsing. Natural Lang. Eng. 22(6), 939–974 (2016)CrossRefGoogle Scholar
  4. 4.
    Baur, T., Mehlmann, G., Damian, I., Gebhard, P., Lingenfelser, F., Wagner, J., Lugrin, B., André, E.: Context-aware automated analysis and annotation of social human-agent interactions. ACM Trans. Interact. Intell. Syst. 5(2) (2015)Google Scholar
  5. 5.
    Bohnet, B., Wanner, L.: Open soucre graph transducer interpreter and grammar development environment. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17–23 May, Valletta, Malta (2010)Google Scholar
  6. 6.
    Domínguez, M., Farrús, M., Burga, A., Wanner, L.: Using hierarchical information structure for prosody prediction in content-to-speech application. In: Proceedings of the 8th International Conference on Speech Prosody (SP 2016), Boston, MA (2016)Google Scholar
  7. 7.
    Domínguez, M., Farrús, M., Wanner., L.: Combining acoustic and linguistic features in phrase-oriented prosody prediction. In: Proceedings of the 8th International Conference on Speech Prosody (SP 2016), Boston, MA (2016)Google Scholar
  8. 8.
    Du, S., Tao, Y., Martinez, A.M.: Compound facial expressions of emotion. Proc. Nat. Acad. Sci. 111(15), E1454–E1462 (2014)CrossRefGoogle Scholar
  9. 9.
    Ekman, P., Rosenberg, E.L.: What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford University Press, Oxford (1997)Google Scholar
  10. 10.
    Fillmore, C.J.: Frame Semantics, pp. 111–137. Hanshin Publishing Co., Seoul (1982)Google Scholar
  11. 11.
    Gangemi, A.: Ontology design patterns for semantic web content. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 262–276. Springer, Heidelberg (2005). doi: 10.1007/11574620_21 CrossRefGoogle Scholar
  12. 12.
    Gebhard, P., Mehlmann, G.U., Kipp, M.: Visual SceneMaker: a tool for authoring interactive virtual characters. J. Multimodal User Interfaces 6(1–2), 3–11 (2012). Interacting with Embodied Conversational Agents. Springer-VerlagCrossRefGoogle Scholar
  13. 13.
    Gilroy, S.W., Cavazza, M., Niranen, M., André, E., Vogt, T., Urbain, J., Benayoun, M., Seichter, H., Billinghurst, M.: PAD-based multimodal affective fusion. In: Affective Computing and Intelligent Interaction and Workshops (2009)Google Scholar
  14. 14.
    Gunes, H., Schuller, B.: Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image Vis. Comput. 31(2), 120–136 (2013)CrossRefGoogle Scholar
  15. 15.
    Heckmann, D., Schwartz, T., Brandherm, B., Schmitz, M., Wilamowitz-Moellendorff, M.: Gumo – the general user model ontology. In: Ardissono, L., Brna, P., Mitrovic, A. (eds.) UM 2005. LNCS, vol. 3538, pp. 428–432. Springer, Heidelberg (2005). doi: 10.1007/11527886_58 CrossRefGoogle Scholar
  16. 16.
    Hofstede, G.H., Hofstede, G.: Culture’s Consequences: Comparing Values, Behaviors, Institutions and Organizations Across Nations. Sage, Thousand Oaks (2001)Google Scholar
  17. 17.
    Hyde, J., Carter, E.J., Kiesler, S., Hodgins, J.K.: Assessing naturalness and emotional intensity: a perceptual study of animated facial motion. In: Proceedings of the ACM Symposium on Applied Perception, pp. 15–22. ACM (2014)Google Scholar
  18. 18.
    Hyde, J., Carter, E.J., Kiesler, S., Hodgins, J.K.: Using an interactive avatar’s facial expressiveness to increase persuasiveness and socialness. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1719–1728. ACM (2015)Google Scholar
  19. 19.
    Lamel, L., Gauvain, J.: Speech recognition. In: Mitkov, R. (ed.) OUP Handbook on Computational Linguistics, pp. 305–322. Oxford University Press, Oxford (2003)Google Scholar
  20. 20.
    Lingenfelser, F., Wagner, J., André, E., McKeown, G., Curran, W.: An event driven fusion approach for enjoyment recognition in real-time. In: MM, pp. 377–386 (2014)Google Scholar
  21. 21.
    Mehlmann, G., André, E.: Modeling multimodal integration with event logic charts. In: Proceedings of the 14th International Conference on Multimodal Interaction, pp. 125–132. ACM, New York (2012)Google Scholar
  22. 22.
    Mehlmann, G., Janowski, K., André, E.: Modeling grounding for interactive social companions. J. Artif. Intell. 30(1), 45–52 (2016). Social Companion Technologies. Springer-VerlagGoogle Scholar
  23. 23.
    Mehlmann, G., Janowski, K., Baur, T., Häring, M., André, E., Gebhard, P.: Exploring a model of gaze for grounding in HRI. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 247–254. ACM, New York (2014)Google Scholar
  24. 24.
    Mori, M., MacDorman, K.F., Kageki, N.: The uncanny valley [from the field]. IEEE Robot. Autom. Mag. 19(2), 98–100 (2012)CrossRefGoogle Scholar
  25. 25.
    Motik, B., Cuenca Grau, B., Sattler, U.: Structured objects in OWL: representation and reasoning. In: Proceedings of the 17th International Conference on World Wide Web, pp. 555–564. ACM (2008)Google Scholar
  26. 26.
    Ochs, M., Pelachaud, C.: Socially aware virtual characters: the social signal of smiles. IEEE Signal Process. Mag. 30(2), 128–132 (2013)CrossRefGoogle Scholar
  27. 27.
    Posner, J., Russell, J., Peterson, B.: The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development and psychopathology. Dev. Psychopathol. 17(3), 715–734 (2005)CrossRefGoogle Scholar
  28. 28.
    Riaño, D., Real, F., Campana, F., Ercolani, S., Annicchiarico, R.: An ontology for the care of the elder at home. In: Combi, C., Shahar, Y., Abu-Hanna, A. (eds.) AIME 2009. LNCS (LNAI), vol. 5651, pp. 235–239. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-02976-9_33 CrossRefGoogle Scholar
  29. 29.
    Ruiz, A., Van de Weijer, J., Binefa, X.: From emotions to action units with hidden and semi-hidden-task learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3703–3711 (2015)Google Scholar
  30. 30.
    Sandbach, G., Zafeiriou, S., Pantic, M., Yin, L.: Static and dynamic 3D facial expression recognition: a comprehensive survey. Image Vis. Comput. 30(10), 683–697 (2012)CrossRefGoogle Scholar
  31. 31.
    Savran, A., Sankur, B., Bilge, M.T.: Regression- based intensity estimation of facial action units. Image Vis. Comput. 30(10), 774–784 (2012)CrossRefGoogle Scholar
  32. 32.
    Shaw, R., Troncy, R., Hardman, L.: LODE: linking open descriptions of events. In: 4th Asian Conference on The Semantic Web, Shanghai, China, pp. 153–167 (2009)Google Scholar
  33. 33.
    Wagner, J., Lingenfelser, F., André, E.: Building a Robust System for Multimodal Emotion Recognition, pp. 379–419. Wiley, Hoboken (2015)Google Scholar
  34. 34.
    Wagner, J., Lingenfelser, F., Baur, T., Damian, I., Kistler, F., André, E.: The social signal interpretation (SSI) framework-multimodal signal processing and recognition in real-time. In: Proceedings of ACM International Conference on Multimedia (2013)Google Scholar
  35. 35.
    Wanner, L., Bohnet, B., Bouayad-Agha, N., Lareau, F., Nicklaß, D.: MARQUIS: generation of user-tailored multilingual air quality bulletins. Appl. Artif. Intell. 24(10), 914–952 (2010)CrossRefGoogle Scholar
  36. 36.
    Yasavur, U., Lisetti, C., Rishe, N.: Let’s talk! speaking virtual counselor offers you a brief intervention. J. Multimodal User Interfaces 8(4), 381–398 (2014)CrossRefGoogle Scholar
  37. 37.
    Zeng, Z., Pantic, M., Roisman, G., Huang, T.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.ICREABarcelonaSpain
  2. 2.Universitat Pompeu FabraBarcelonaSpain
  3. 3.Universität AugsburgAugsburgGermany
  4. 4.Vocapia ResearchOrsayFrance
  5. 5.CERTHThessalonikiGreece
  6. 6.Universität UlmUlmGermany
  7. 7.AlmendeRotterdamThe Netherlands

Personalised recommendations