Journal on Multimodal User Interfaces

, Volume 8, Issue 1, pp 87–96 | Cite as

Facial expression-based affective speech translation

  • Éva SzékelyEmail author
  • Ingmar Steiner
  • Zeeshan Ahmed
  • Julie Carson-Berndsen
Original Paper


One of the challenges of speech-to-speech translation is to accurately preserve the paralinguistic information in the speaker’s message. Information about affect and emotional intent of a speaker are often carried in more than one modality. For this reason, the possibility of multimodal interaction with the system and the conversation partner may greatly increase the likelihood of a successful and gratifying communication process. In this work we explore the use of automatic facial expression analysis as an input annotation modality to transfer paralinguistic information at a symbolic level from input to output in speech-to-speech translation. To evaluate the feasibility of this approach, a prototype system, FEAST (facial expression-based affective speech translation) has been developed. FEAST classifies the emotional state of the user and uses it to render the translated output in an appropriate voice style, using expressive speech synthesis.


Expressive speech synthesis  Speech-to-speech translation Gesture-driven multimodal interface Affective computing 



This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation ( at University College Dublin (UCD) and Trinity College Dublin (TCD). The opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Science Foundation Ireland. Portions of the research in this paper use the SEMAINE Database collected for the SEMAINE project ( [12].

Supplementary material

Supplementary material 1 (mpg 57676 KB)


  1. 1.
    Agüero PD, Adell J, Bonafonte A (2006) Prosody generation for speech-to-speech translation. In: IEEE international conference on acoustics, speech, and signal processing, pp I-557–I-560. doi: 10.1109/ICASSP.2006.1660081
  2. 2.
    Ahmed Z, Steiner I, Székely É, Carson-Berndsen J (2013) A system for facial expression-based affective speech translation. In: ACM international conference on intelligent user interfaces companion, pp 57–58. doi: 10.1145/2451176.2451197
  3. 3.
    Batliner A, Huber R, Niemann H, Nöth E, Spilker J, Fischer K (2000) The recognition of emotion. In: Wahlster W (ed) Verbmobil: foundations of speech-to-speech translations. Springer, Berlin, pp 122–130Google Scholar
  4. 4.
    Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, BelmontzbMATHGoogle Scholar
  5. 5.
    Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27. doi: 10.1145/1961189.1961199 CrossRefGoogle Scholar
  6. 6.
    Cowie R, Douglas-Cowie E, Apolloni B, Taylor JG, Romano A, Fellenz W (1999) What a neural net needs to know about emotion words. In: World multiconference on circuits, systems, communications and computer, pp 109–114Google Scholar
  7. 7.
    Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human–computer interaction. IEEE Signal Process Mag 18(1):32–80. doi: 10.1109/79.911197 CrossRefGoogle Scholar
  8. 8.
    Ekman P, Keltner D (1997) Universal facial expressions of emotion: an old controversy and new findings. In: Segerstråle U, Molnár P (eds) Nonverbal communication: where nature meets culture. Lawrence Erlbaum, New Jersey, pp 27–46Google Scholar
  9. 9.
    Kano T, Sakti S, Takamichi S, Neubig G, Toda T, Nakamura S (2012) A method for translation of paralinguistic information. In: International workshop on spoken language translationGoogle Scholar
  10. 10.
    Küblbeck C, Ernst A (2006) Face detection and tracking in video sequences using the modified census transformation. Image Vis Comput 24(6):564–572. doi: 10.1016/j.imavis.2005.08.005 Google Scholar
  11. 11.
    Machado AF, Queiroz M (2010) Techniques for crosslingual voice conversion. In: IEEE International symposium on multimedia, pp 365–370. doi: 10.1109/ISM.2010.62
  12. 12.
    McKeown G, Valstar MF, Cowie R, Pantic M (2010) The SEMAINE corpus of emotionally coloured character interactions. In: IEEE international conference on multimedia and expo, pp 1079–1084. doi: 10.1109/ICME.2010.5583006
  13. 13.
    Och FJ (2003) Minimum error rate training in statistical machine translation. In: Annual meeting of the association for computational linguistics, pp 160–167. doi: 10.3115/1075096.1075117
  14. 14.
    Schröder M, Baggia P, Burkhardt F, Pelachaud C, Peter C, Zovato E (2011) EmotionML—an upcoming standard for representing emotions and related states. In: D’Mello S, Graesser A, Schuller B, Martin JC (eds) Affective computing and intelligent interaction. Springer, Berlin, pp 316–325. doi: 10.1007/978-3-642-24600-5_35
  15. 15.
    Schröder M, Trouvain J (2003) The German text-to-speech synthesis system MARY: a tool for research, development and teaching. Int J Speech Technol 6(4):365–377. doi: 10.1023/A:1025708916924 CrossRefGoogle Scholar
  16. 16.
    Shin J, Georgiou PG, Narayanan S (2013) Enabling effective design of multimodal interfaces for speech-to-speech translation system: an empirical study of longitudinal user behaviors over time and user strategies for coping with errors. Comput. Speech Lang 27(2):554–571. doi: 10.1016/j.csl.2012.02.001
  17. 17.
    Steiner I, Schröder M, Charfuelan M, Klepp A (2010) Symbolic vs. acoustics-based style control for expressive unit selection. In: ISCA workshop on speech Synthesis, pp 114–119Google Scholar
  18. 18.
    Székely É, Ahmed Z, Cabral JP, Carson-Berndsen J (2012) WinkTalk: a demonstration of a multimodal speech synthesis platform linking facial expressions to expressive synthetic voices. In: Workshop on speech and language processing for assistive technologies, pp 5–8Google Scholar
  19. 19.
    Tomás J, Canovas A, Lloret J, García M (2010) Speech translation statistical system using multimodal sources of knowledge. In: International multi-conference on computing in the global information technology, pp 5–9. doi: 10.1109/ICCGI.2010.26
  20. 20.
    Vauquois B (2003) Automatic translation—a survey of different approaches. In: Nirenburg S, Somers HL, Wilks Y (eds) Readings in machine translation, chap. 28. MIT Press, Cambridge, pp 333–338Google Scholar
  21. 21.
    Wöllmer M, Metallinou A, Eyben F, Schuller B, Narayanan S (2010) Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: Interspeech, pp 2362–2365Google Scholar

Copyright information

© OpenInterface Association 2013

Authors and Affiliations

  • Éva Székely
    • 1
    Email author
  • Ingmar Steiner
    • 2
  • Zeeshan Ahmed
    • 1
  • Julie Carson-Berndsen
    • 1
  1. 1.Centre for Next Generation Localisation, School of Computer Science and InformaticsUniversity College DublinDublin 4Ireland
  2. 2.Multimodal Computing and Interaction, Saarland University and Language Technology Lab, DFKI GmbHSaarbrückenGermany

Personalised recommendations