Skip to main content
Log in

A Survey of Using Vocal Prosody to Convey Emotion in Robot Speech

  • Survey
  • Published:
International Journal of Social Robotics Aims and scope Submit manuscript

Abstract

The use of speech for robots to communicate with their human users has been facilitated by improvements in speech synthesis technology. Now that the intelligibility of synthetic speech has advanced to the point that speech synthesizers are a widely accepted and used technology, what are other aspects of speech synthesis that can be used to improve the quality of human-robot interaction? The communication of emotions through changes in vocal prosody is one way to make synthesized speech sound more natural. This article reviews the use of vocal prosody to convey emotions between humans, the use of vocal prosody by agents and avatars to convey emotions to their human users, and previous work within the human–robot interaction (HRI) community addressing the use of vocal prosody in robot speech. The goals of this article are (1) to highlight the ability and importance of using vocal prosody to convey emotions within robot speech and (2) to identify experimental design issues when using emotional robot speech in user studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. http://www.zhizheng.org/demo/dnn_tts/demo.html.

  2. http://www.acapela-group.com/.

  3. https://www.cereproc.com/.

  4. https://msdn.microsoft.com/.

  5. http://dragonmobile.nuancemobiledeveloper.com/.

References

  1. Aihara R, Takashima R, Takiguchi T, Ariki Y (2012) GMM-based emotional voice conversion using spectrum and prosody features. Am J Signal Process 2(5):134–138

    Article  Google Scholar 

  2. Alm CO, Roth D, Sproat R (2005) Emotions from text: machine learning for text-based emotion prediction. In: Proceedings of the human language technology conference and conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 579–586

  3. Amir N, Weiss A, Hadad R (2009) Is there a dominant channel in perception of emotions? In: Proceedings of the 3rd international conference on affective computing and intelligent interaction and workshops (ACII). Association for the Advancement of Artificial Intelligence, pp 1–6

  4. Ashimura K, Baggia P, Burkhardt F, Oltramari A, Peter C, Zovato E. (2013) Vocabularies for EmotionML http://www.w3.org/TR/2012/NOTE-emotion-voc-20120510/

  5. Bachorowski JA, Owren MJ (2008) Vocal expressions of emotion. In: Lewis M, Haviland-Jones JM, Barrett LF (eds) Handbook of emotions, 3rd edn. The Guilford Press, New York, pp 196–210

    Google Scholar 

  6. Baggia P, Bagshaw P, Bodell M, Huang DZ, Xiaoyan L, McGlashan S, Tao J, Jun Y, Fang H, Kang Y, Meng H, Xia W, Hairong X, Wu Z (2010) Speech synthesis markup language (SSML) version 1.1. http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/

  7. Baggia P, Pelachaud C, Peter C, Zovato E (2013) Emotion markup language (EmotionML) 1.0. http://www.w3.org/TR/2013/PR-emotionml-20130416/

  8. Bainbridge WA, Hart JW, Kim ES, Scassellati B (2011) The benefits of interactions with physically present robots overvideo-displayed agents. Int J Soc Robot 3(1):41–52

    Article  Google Scholar 

  9. Barrett LF (2006) Solving the emotion paradox: categorization and the experience of emotion. Pers Soc Psychol Rev 10(1):20–46

    Article  Google Scholar 

  10. Bates J (1994) The role of emotion in believable agents. Commun ACM 37(7):122–125

    Article  Google Scholar 

  11. Beale R, Creed C (2009) Affective interaction: How emotional agents affect users. Int J Hum-Comput Stud 67(9):755–776

    Article  Google Scholar 

  12. Beckman ME, Ayers GM (1994) Guidelines for ToBI labelling. http://www.speech.cs.cmu.edu/tobi/

  13. Benoît C, Grice M, Hazan V (1996) The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Commun 18(4):381–392

    Article  Google Scholar 

  14. Bethel CL, Murphy RR (2006) Auditory and other non-verbal expressions of affect for robots. In: Proceedings of the 2006 AAAI fall symposium series, aurally informed performance: integrating machine listening and auditory presentation in robotic systems. AAAI

  15. Black, A.W.: CMU\_ARCTIC speech synthesis databases. http://festvox.org/cmu_arctic/

  16. Boersma P (2001) Praat, a system for doing phonetics by computer. Glot Int 5(9/10):341–345

    Google Scholar 

  17. Breazeal C, Aryananda L (2002) Recognition of affective communicative intent in robot-directed speech. Auton Robots 12(1):83–104

    Article  MATH  Google Scholar 

  18. Breazeal C, Scassellati B (1999) How to build robots that make friends and influence people. In: Proceedings of the 1999 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 858–863

  19. Brooks DJ, Lignos C, Finucane C, Medvedev MS, Perera I, Raman V, Kress-Gazit H, Marcus M, Yanco HA (2012) Make it so: continuous, flexible natural language interaction with an autonomous robot. In: Proceedings of the workshops at twenty-sixth AAAI conference on artificial intelligence. Association for the Advancement of Artificial Intelligence, Palo Alto

  20. Burkhardt F, Sendlmeier WF (2000) Verification of acoustical correlates of emotional speech using formant-synthesis. In: Proceedings of the ISCA tutorial and research workshop (ITRW) on speech and emotion. International Speech Communication Association, Singapore

  21. Cahn J (1990) The generation of affect in synthesized speech. J Am Voice Input/Output Soc 8:1–19

    Google Scholar 

  22. Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Commun 40(1):5–32

    Article  MATH  Google Scholar 

  23. Crumpton J, Bethel CL (2014) Conveying emotion in robotic speech: Lessons learned. In: Proceedings of the 23rd IEEE international symposium on robot and human interactive communication (RO-MAN)

  24. Crumpton J, Bethel CL (2015) Validation of vocal prosody modifications to communicate emotion in robot speech. In: Proceedings of the 2015 international conference on collaboration technologies and systems (CTS 2015)

  25. Dautenhahn K, Woods S, Kaouri C, Walters ML, Koay KL, Werry I (2005) What is a robot companion - friend, assistant or butler? In: Proceedings of the 2005 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 1192–1197

  26. D’Mello S, Graesser A (2013) Autotutor and affective autotutor: Learning by talking with cognitively and emotionally intelligent computers that talk back. ACM Trans Interact Intell Syst 2(4):1–39

    Article  Google Scholar 

  27. Ekman P, Sorenson ER, Friesen WV (1969) Pan-cultural elements in facial displays of emotion. Science 164(3875):86–88

    Article  Google Scholar 

  28. Erickson D (2005) Expressive speech: production, perception and application to speech synthesis. Acoust Sci Technol 26(4):317– 325

    Article  Google Scholar 

  29. Fairbanks G, Pronovost W (1939) An experimental study of the pitch characteristics of the voice during the expression of emotion. Speech Monogr 6(1):87

    Article  Google Scholar 

  30. Frick RW (1985) Communicating emotion: the role of prosodic features. Psychol Bull 97(3):412–429

    Article  Google Scholar 

  31. Greasley P, Sherrard C, Waterman M (2000) Emotion in language and speech: methodological issues in naturalistic approaches. Lang Speech 43(4):355–375

    Article  Google Scholar 

  32. Hammerschmidt K, Jürgens U (2007) Acoustical correlates of affective prosody. J Voice 21(5):531–540

    Article  Google Scholar 

  33. Heiga Z, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: Proceedings: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7962–7966

  34. Hennig S, Chellali R (2012) Expressive synthetic voices: Considerations for human robot interaction. In: Proceedings of the 21st IEEE international symposium on robot and human interactive communication (RO-MAN), pp 589–595

  35. Huang X, Acero A, Hon HW (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice Hall PTR, Upper Saddle River

    Google Scholar 

  36. Huttar GL (1968) Relations between prosodic variables and emotions in normal American English utterances. J Speech Hear Res 11(3):481–487

    Article  Google Scholar 

  37. Iida A, Campbell N, Iga S, Higuchi F, Yasumura M (2000) A speech synthesis system with emotion for assisting communication. In: Proceedings of the ISCA tutorial and research workshop (ITRW) on speech and emotion, pp 167–172

  38. Jung Y, Lee KM (2004) Effects of physical embodiment on social presence of social robots. In: Proceedings of the 7th annual international workshop on presence. International Society for Presence Research, pp 80–87

  39. Khan Z (1998) Attitudes towards intelligent service robots. Technical Report TRITA-NA-P9821, IPLab-154, Royal Institure of Technology (KTH)

  40. Kidd CD, Breazeal C (2004) Effect of a robot on user perceptions. In: Proceedings of the 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS), vol 4, pp 3559–3564

  41. Kim E, Leyzberg D, Tsui K, Scassellati B (2009) How people talk when teaching a robot. In: Proceedings of the 4th ACM/IEEE international conference on human-robot interaction (HRI). ACM, New York, pp 23–30

  42. Laukka P (2004) Vocal expression of emotion: discrete-emotions and dimensional accounts. Dissertation, Uppsala University

  43. Laukka P, Neiberg D, Forsell M, Karlsson I, Elenius K (2011) Expression of affect in spontaneous speech: acoustic correlates and automatic detection of irritation and resignation. Comput Speech Lang 25(1):84–104

    Article  Google Scholar 

  44. Leyzberg D, Avrunin E, Liu J, Scassellati B (2011) Robots that express emotion elicit better human teaching. In: Proceedings of the 6th ACM/IEEE international conference on human–robot interaction (HRI), pp 347–354

  45. Leyzberg D, Spaulding S, Toneva M, Scassellati B (2012) The physical presence of a robot tutor increases cognitive learning gains. In: Proceedings of the 34th annual conference of the Cognitive Science Society (CogSci). Cognitive Science Society

  46. Liu H, Lieberman H, Selker T (2003) A model of textual affect sensing using real-world knowledge. In: Proceedings of the 2003 international conference on intelligent user interfaces (IUI). ACM, New York, pp 125–132

  47. Massaro DW (1989) The logic of the fuzzy logical model of perception. Behav Brain Sci 12(04):778–794

    Article  Google Scholar 

  48. Massaro DW, Egan PB (1996) Perceiving affect from the voice and the face. Psychon Bull Rev 3(2):215–221

    Article  Google Scholar 

  49. Mehrabian A (1996) Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Curr Psychol 14(4):261

    Article  MathSciNet  Google Scholar 

  50. Microsoft (2015) Prosody element. Microsoft. https://msdn.microsoft.com/en-us/library/hh361583(v=office.14).aspx

  51. Mitchell WJ, Szerszen Sr KA, Lu AS, Schermerhorn PW, Scheutz M, MacDorman KF (2011) A mismatch in the human realism of face and voice produces an uncanny valley. i-Percept 2(1):10–12

    Article  Google Scholar 

  52. Murray IR, Arnott JL, Rohwer EA (1996) Emotional stress in synthetic speech: progress and future directions. Speech Commun 20(12):85–91

    Article  Google Scholar 

  53. Nass C, Jonsson I, Harris H, Reaves B, Endo J, Brave S, Takayama L (2005) Improving automotive safety by pairing driver emotion and car voice emotion. In: Proceedings of the CHI ’05 extended abstracts on human factors in computing systems. ACM, New York, pp 1973–1976

  54. Niculescu A, Dijk B, Nijholt A, Li H, See SL (2013) Making social robots more attractive: the effects of voice pitch, humor and empathy. Int J Soc Robot 5(2):171–191

    Article  Google Scholar 

  55. Nuance Communications: (2015) SSML compliance. In: Dragon Mobile SDK Reference. http://dragonmobile.nuancemobiledeveloper.com/public/Help/DragonMobileSDKReference_Android/SpeechKit_Guide/SpeakingText.html

  56. Oudeyer PY (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum-Comput Stud 59(12):157–183

    Google Scholar 

  57. Pearson JC, Nelson PE (2000) An introduction to human communication: understanding and sharing, 8th edn. McGraw-Hill Higher Education, Boston

    Google Scholar 

  58. Pitrelli JF, Bakis R, Eide EM, Fernandez R, Hamza W, Picheny MA (2006) The IBM expressive text-to-speech synthesis system for American English. IEEE Trans Audio Speech Lang Process 14(4):1099–1108

    Article  Google Scholar 

  59. Pittam J, Scherer KR (1993) Vocal expression and communication of emotion. In: Lewis M, Haviland JM (eds) Handbook of emotions, chap. 13. The Guilford Press, New York, pp 185–197

    Google Scholar 

  60. Powers A, Kiesler S, Fussell S, Torrey C (2007) Comparing a computer agent with a humanoid robot. In: Proceedings of the 2nd ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 145–152

  61. Prasad R, Saruwatari H, Shikano K (2004) Robots that can hear, understand and talk. Adv Robot 18(5):533–564

    Article  Google Scholar 

  62. Prinz J (2004) Which emotions are basic? In: Evans D, Cruse P (eds) Emotion, evolution, and rationality, chap. 4. Oxford University Press, Oxford, pp 69–87

    Chapter  Google Scholar 

  63. Rani P, Sarkar N (2004) Emotion-sensitive robots - a new paradigm for human-robot interaction. In: Proceedings of the 4th IEEE/RAS international conference on humanoid robots, vol 1, pp 149–167

  64. Ray C, Mondada F, Siegwart R (2008) What do people expect from robots? In: Proceedings of the 2008 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 3816–3821

  65. Read R (2012) Speaking without words: Affective displays in social robots through non-linguistic utterances. In: Proceedings of the 2012 HRI pioneers workshop. ACM, New York

  66. Read R, Belpaeme T (2010) Interpreting non-linguistic utterances by robots: Studying the influence of physical appearance. In: Proceedings of the 3rd international workshop on affective interaction in natural environments (AFFINE). ACM, New York, pp 65–70

  67. Read R, Belpaeme T (2012) How to use non-linguistic utterances to convey emotion in child-robot interaction. In: Proceedings of the 7th ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 219–220

  68. Read R, Belpaeme T (2013) People interpret robotic non-linguistic utterances categorically. In: Proceedings of the 8th ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 209–210

  69. Read R, Belpaeme T (2014) Non-linguistic utterances should be used alongside language, rather than on their own or as a replacement. In: Proceedings of the 9th ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 276–277

  70. Read R, Belpaeme T (2014) Situational context directs how people affectively interpret robotic non-linguistic utterances. In: Proceedings of the 9th ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 41–48

  71. Roehling S, MacDonald B, Watson C (2006) Towards expressive speech synthesis in English on a robotic platform. In: Proceedings of the 11th Australian international conference on speech science & technology. Australian Speech Science & Technology Association, Gold Coast

  72. Rogalla O, Ehrenmann M, Zöllner R, Becher R, Dillmann R (2002) Using gesture and speech control for commanding a robot assistant. In: Proceedings of the 11th IEEE international workshop on robot and human interactive communication (RO-MAN), pp 454–459

  73. Russell JA, Barrett LF (1999) Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. J Pers and Soc Psychol 76(5):805–819

    Article  Google Scholar 

  74. Scherer K (2009) Emotion theories and concepts (psychological perspectives). In: Sander D, Scherer K (eds) Oxford companion to emotion and the affective sciences. Oxford University Press, Oxford, pp 145–149

    Google Scholar 

  75. Scherer KR (1986) Vocal affect expression: a review and a model for future research. Psychol Bull 99(2):143–165

    Article  MathSciNet  Google Scholar 

  76. Scherer KR (2000) Psychological models of emotion. In: Borod J (ed) The neuropsychology of emotion, chap. 6. Oxford University Press, Oxford, pp 138–162

    Google Scholar 

  77. Scherer KR, Banse R, Wallbott HG, Goldbeck T (1991) Vocal cues in emotion encoding and decoding. Motiv Emot 15(2):123–148

    Article  Google Scholar 

  78. Schröder, M.: MaryXML. http://mary.dfki.de/documentation/maryxml

  79. Schröder M, Baggia P, Burkhardt F, Pelachaud C, Peter C, Zovato E (2011) EmotionML—an upcoming standard for representing emotions and related states. In: D’Mello S, Graesser A, Schuller B, Martin JC (eds) Affective computing and intelligent interaction, vol 6974. Springer, Berlin, pp 316–325

    Chapter  Google Scholar 

  80. Schröder M, Cowie R, Douglas-Cowie E (2001) Acoustic correlates of emotion dimensions in view of speech synthesis. In: Proceedings of the 7th European conference on speech communication and technology (EUROSPEECH), pp 87–90

  81. Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, Müller C, Narayanan S (2013) Paralinguistics in speech and language—state-of-the-art and the challenge. Comput Speech Lang 27(1):4–39

    Article  Google Scholar 

  82. Silverman K, Beckman M, Pitrelli J, Ostendorf M, Wightman C, Price P, Pierrehumbert J, Hirschberg J (1992) ToBI: A standard for labeling English prosody. In: Proceedings of the 2nd international conference on spoken language processing (ICSLP), vol 2. International Speech Communication Association, Singapore, pp 867–870

  83. Sobin C, Alpert M (1999) Emotion in speech: the acoustic attributes of fear, anger, sadness, and joy. J Psycholinguist Res 28(4):347–365

    Article  Google Scholar 

  84. Tang H, Zhou X, Odisio M, Hasegawa-Johnson M, Huang TS (2008) Two-stage prosody prediction for emotional text-to-speech synthesis. In: Proceedings of the 9th annual conference of the International Speech Communication Association. International Speech Communication Association, Singapore, pp 2138–2141

  85. Tao J, Kang Y, Li A (2006) Prosody conversion from neutral speech to emotional speech. IEEE Trans Audio Speech Lang Process 14(4):1145–1154

    Article  Google Scholar 

  86. Tielman M, Neerincx M, Meyer JJ, Looije R (2014) Adaptive emotional expression in robot-child interaction. In: Proceedings of the 9th ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 407–414

  87. Tokuda K, Heiga Z, Black AW (2002) An HMM-based speech synthesis system applied to English. In: Proceedings of the 2002 IEEE workshop on speech synthesis, pp 227–230

  88. Türk O, Schröder M (2010) Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques. IEEE Trans Audio Speech Lang Process 18(5):965–973

    Article  Google Scholar 

  89. Veilleux N, Shattuck-Hufnagel S, Brugos A (2006) 6.911 transcribing prosodic structure of spoken utterances with ToBI. http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-911-transcribing-prosodic-structure-of-spoken-utterances-with-tobi-january-iap-2006

  90. Vinciarelli A, Pantic M, Bourlard H, Pentland A (2008) Social signal processing: State-of-the-art and future perspectives of an emerging domain. In: Proceedings of the 16th ACM international conference on multimedia. ACM, New York, pp 1061–1070

  91. W3C (2004) Speech synthesis markup language (SSML) version 1.0. http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/

  92. Wainer J, Feil-Seifer DJ, Shell DA, Mataric MJ (2007) Embodiment and human-robot interaction: A task-based perspective. In: Proceedings of the 16th IEEE international symposium on robot and human interactive communication (RO-MAN), pp 872–877

  93. Walker MR, Larson J, Hunt A (2001) A new W3C markup standard for text-to-speech synthesis. In: Proceedings of the 2011 IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 2, pp 965–968

  94. Weaver CH, Strausbaugh WL (1964) Hearing the vocal cues. In: Fundamentals of speech communication, chap. 11. American Book Company, New York, pp 283–303

  95. Xingyan L, MacDonald B, Watson CI (2009) Expressive facial speech synthesis on a robotic platform. In: Proceedings of the 2009 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 5009–5014

  96. Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064

    Article  Google Scholar 

  97. Zhen-Hua L, Shi-Yin K, Heiga Z, Senior A, Schuster M, Xiao-Jun Q, Meng HM, Li D (2015) Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Mag 32(3):35–52

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joe Crumpton.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Crumpton, J., Bethel, C.L. A Survey of Using Vocal Prosody to Convey Emotion in Robot Speech. Int J of Soc Robotics 8, 271–285 (2016). https://doi.org/10.1007/s12369-015-0329-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12369-015-0329-4

Keywords

Navigation