Advertisement

International Journal of Social Robotics

, Volume 8, Issue 2, pp 271–285 | Cite as

A Survey of Using Vocal Prosody to Convey Emotion in Robot Speech

  • Joe Crumpton
  • Cindy L. Bethel
Survey

Abstract

The use of speech for robots to communicate with their human users has been facilitated by improvements in speech synthesis technology. Now that the intelligibility of synthetic speech has advanced to the point that speech synthesizers are a widely accepted and used technology, what are other aspects of speech synthesis that can be used to improve the quality of human-robot interaction? The communication of emotions through changes in vocal prosody is one way to make synthesized speech sound more natural. This article reviews the use of vocal prosody to convey emotions between humans, the use of vocal prosody by agents and avatars to convey emotions to their human users, and previous work within the human–robot interaction (HRI) community addressing the use of vocal prosody in robot speech. The goals of this article are (1) to highlight the ability and importance of using vocal prosody to convey emotions within robot speech and (2) to identify experimental design issues when using emotional robot speech in user studies.

Keywords

Synthesized speech Emotional robot speech Human–robot interaction Vocal prosody 

References

  1. 1.
    Aihara R, Takashima R, Takiguchi T, Ariki Y (2012) GMM-based emotional voice conversion using spectrum and prosody features. Am J Signal Process 2(5):134–138CrossRefGoogle Scholar
  2. 2.
    Alm CO, Roth D, Sproat R (2005) Emotions from text: machine learning for text-based emotion prediction. In: Proceedings of the human language technology conference and conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 579–586Google Scholar
  3. 3.
    Amir N, Weiss A, Hadad R (2009) Is there a dominant channel in perception of emotions? In: Proceedings of the 3rd international conference on affective computing and intelligent interaction and workshops (ACII). Association for the Advancement of Artificial Intelligence, pp 1–6Google Scholar
  4. 4.
    Ashimura K, Baggia P, Burkhardt F, Oltramari A, Peter C, Zovato E. (2013) Vocabularies for EmotionML http://www.w3.org/TR/2012/NOTE-emotion-voc-20120510/
  5. 5.
    Bachorowski JA, Owren MJ (2008) Vocal expressions of emotion. In: Lewis M, Haviland-Jones JM, Barrett LF (eds) Handbook of emotions, 3rd edn. The Guilford Press, New York, pp 196–210Google Scholar
  6. 6.
    Baggia P, Bagshaw P, Bodell M, Huang DZ, Xiaoyan L, McGlashan S, Tao J, Jun Y, Fang H, Kang Y, Meng H, Xia W, Hairong X, Wu Z (2010) Speech synthesis markup language (SSML) version 1.1. http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/
  7. 7.
    Baggia P, Pelachaud C, Peter C, Zovato E (2013) Emotion markup language (EmotionML) 1.0. http://www.w3.org/TR/2013/PR-emotionml-20130416/
  8. 8.
    Bainbridge WA, Hart JW, Kim ES, Scassellati B (2011) The benefits of interactions with physically present robots overvideo-displayed agents. Int J Soc Robot 3(1):41–52CrossRefGoogle Scholar
  9. 9.
    Barrett LF (2006) Solving the emotion paradox: categorization and the experience of emotion. Pers Soc Psychol Rev 10(1):20–46CrossRefGoogle Scholar
  10. 10.
    Bates J (1994) The role of emotion in believable agents. Commun ACM 37(7):122–125CrossRefGoogle Scholar
  11. 11.
    Beale R, Creed C (2009) Affective interaction: How emotional agents affect users. Int J Hum-Comput Stud 67(9):755–776CrossRefGoogle Scholar
  12. 12.
    Beckman ME, Ayers GM (1994) Guidelines for ToBI labelling. http://www.speech.cs.cmu.edu/tobi/
  13. 13.
    Benoît C, Grice M, Hazan V (1996) The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Commun 18(4):381–392CrossRefGoogle Scholar
  14. 14.
    Bethel CL, Murphy RR (2006) Auditory and other non-verbal expressions of affect for robots. In: Proceedings of the 2006 AAAI fall symposium series, aurally informed performance: integrating machine listening and auditory presentation in robotic systems. AAAIGoogle Scholar
  15. 15.
    Black, A.W.: CMU\_ARCTIC speech synthesis databases. http://festvox.org/cmu_arctic/
  16. 16.
    Boersma P (2001) Praat, a system for doing phonetics by computer. Glot Int 5(9/10):341–345Google Scholar
  17. 17.
    Breazeal C, Aryananda L (2002) Recognition of affective communicative intent in robot-directed speech. Auton Robots 12(1):83–104CrossRefzbMATHGoogle Scholar
  18. 18.
    Breazeal C, Scassellati B (1999) How to build robots that make friends and influence people. In: Proceedings of the 1999 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 858–863Google Scholar
  19. 19.
    Brooks DJ, Lignos C, Finucane C, Medvedev MS, Perera I, Raman V, Kress-Gazit H, Marcus M, Yanco HA (2012) Make it so: continuous, flexible natural language interaction with an autonomous robot. In: Proceedings of the workshops at twenty-sixth AAAI conference on artificial intelligence. Association for the Advancement of Artificial Intelligence, Palo AltoGoogle Scholar
  20. 20.
    Burkhardt F, Sendlmeier WF (2000) Verification of acoustical correlates of emotional speech using formant-synthesis. In: Proceedings of the ISCA tutorial and research workshop (ITRW) on speech and emotion. International Speech Communication Association, SingaporeGoogle Scholar
  21. 21.
    Cahn J (1990) The generation of affect in synthesized speech. J Am Voice Input/Output Soc 8:1–19Google Scholar
  22. 22.
    Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Commun 40(1):5–32CrossRefzbMATHGoogle Scholar
  23. 23.
    Crumpton J, Bethel CL (2014) Conveying emotion in robotic speech: Lessons learned. In: Proceedings of the 23rd IEEE international symposium on robot and human interactive communication (RO-MAN)Google Scholar
  24. 24.
    Crumpton J, Bethel CL (2015) Validation of vocal prosody modifications to communicate emotion in robot speech. In: Proceedings of the 2015 international conference on collaboration technologies and systems (CTS 2015)Google Scholar
  25. 25.
    Dautenhahn K, Woods S, Kaouri C, Walters ML, Koay KL, Werry I (2005) What is a robot companion - friend, assistant or butler? In: Proceedings of the 2005 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 1192–1197Google Scholar
  26. 26.
    D’Mello S, Graesser A (2013) Autotutor and affective autotutor: Learning by talking with cognitively and emotionally intelligent computers that talk back. ACM Trans Interact Intell Syst 2(4):1–39CrossRefGoogle Scholar
  27. 27.
    Ekman P, Sorenson ER, Friesen WV (1969) Pan-cultural elements in facial displays of emotion. Science 164(3875):86–88CrossRefGoogle Scholar
  28. 28.
    Erickson D (2005) Expressive speech: production, perception and application to speech synthesis. Acoust Sci Technol 26(4):317– 325CrossRefGoogle Scholar
  29. 29.
    Fairbanks G, Pronovost W (1939) An experimental study of the pitch characteristics of the voice during the expression of emotion. Speech Monogr 6(1):87CrossRefGoogle Scholar
  30. 30.
    Frick RW (1985) Communicating emotion: the role of prosodic features. Psychol Bull 97(3):412–429CrossRefGoogle Scholar
  31. 31.
    Greasley P, Sherrard C, Waterman M (2000) Emotion in language and speech: methodological issues in naturalistic approaches. Lang Speech 43(4):355–375CrossRefGoogle Scholar
  32. 32.
    Hammerschmidt K, Jürgens U (2007) Acoustical correlates of affective prosody. J Voice 21(5):531–540CrossRefGoogle Scholar
  33. 33.
    Heiga Z, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: Proceedings: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7962–7966Google Scholar
  34. 34.
    Hennig S, Chellali R (2012) Expressive synthetic voices: Considerations for human robot interaction. In: Proceedings of the 21st IEEE international symposium on robot and human interactive communication (RO-MAN), pp 589–595Google Scholar
  35. 35.
    Huang X, Acero A, Hon HW (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice Hall PTR, Upper Saddle RiverGoogle Scholar
  36. 36.
    Huttar GL (1968) Relations between prosodic variables and emotions in normal American English utterances. J Speech Hear Res 11(3):481–487CrossRefGoogle Scholar
  37. 37.
    Iida A, Campbell N, Iga S, Higuchi F, Yasumura M (2000) A speech synthesis system with emotion for assisting communication. In: Proceedings of the ISCA tutorial and research workshop (ITRW) on speech and emotion, pp 167–172Google Scholar
  38. 38.
    Jung Y, Lee KM (2004) Effects of physical embodiment on social presence of social robots. In: Proceedings of the 7th annual international workshop on presence. International Society for Presence Research, pp 80–87Google Scholar
  39. 39.
    Khan Z (1998) Attitudes towards intelligent service robots. Technical Report TRITA-NA-P9821, IPLab-154, Royal Institure of Technology (KTH)Google Scholar
  40. 40.
    Kidd CD, Breazeal C (2004) Effect of a robot on user perceptions. In: Proceedings of the 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS), vol 4, pp 3559–3564Google Scholar
  41. 41.
    Kim E, Leyzberg D, Tsui K, Scassellati B (2009) How people talk when teaching a robot. In: Proceedings of the 4th ACM/IEEE international conference on human-robot interaction (HRI). ACM, New York, pp 23–30Google Scholar
  42. 42.
    Laukka P (2004) Vocal expression of emotion: discrete-emotions and dimensional accounts. Dissertation, Uppsala UniversityGoogle Scholar
  43. 43.
    Laukka P, Neiberg D, Forsell M, Karlsson I, Elenius K (2011) Expression of affect in spontaneous speech: acoustic correlates and automatic detection of irritation and resignation. Comput Speech Lang 25(1):84–104CrossRefGoogle Scholar
  44. 44.
    Leyzberg D, Avrunin E, Liu J, Scassellati B (2011) Robots that express emotion elicit better human teaching. In: Proceedings of the 6th ACM/IEEE international conference on human–robot interaction (HRI), pp 347–354Google Scholar
  45. 45.
    Leyzberg D, Spaulding S, Toneva M, Scassellati B (2012) The physical presence of a robot tutor increases cognitive learning gains. In: Proceedings of the 34th annual conference of the Cognitive Science Society (CogSci). Cognitive Science SocietyGoogle Scholar
  46. 46.
    Liu H, Lieberman H, Selker T (2003) A model of textual affect sensing using real-world knowledge. In: Proceedings of the 2003 international conference on intelligent user interfaces (IUI). ACM, New York, pp 125–132Google Scholar
  47. 47.
    Massaro DW (1989) The logic of the fuzzy logical model of perception. Behav Brain Sci 12(04):778–794CrossRefGoogle Scholar
  48. 48.
    Massaro DW, Egan PB (1996) Perceiving affect from the voice and the face. Psychon Bull Rev 3(2):215–221CrossRefGoogle Scholar
  49. 49.
    Mehrabian A (1996) Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Curr Psychol 14(4):261MathSciNetCrossRefGoogle Scholar
  50. 50.
    Microsoft (2015) Prosody element. Microsoft. https://msdn.microsoft.com/en-us/library/hh361583(v=office.14).aspx
  51. 51.
    Mitchell WJ, Szerszen Sr KA, Lu AS, Schermerhorn PW, Scheutz M, MacDorman KF (2011) A mismatch in the human realism of face and voice produces an uncanny valley. i-Percept 2(1):10–12CrossRefGoogle Scholar
  52. 52.
    Murray IR, Arnott JL, Rohwer EA (1996) Emotional stress in synthetic speech: progress and future directions. Speech Commun 20(12):85–91CrossRefGoogle Scholar
  53. 53.
    Nass C, Jonsson I, Harris H, Reaves B, Endo J, Brave S, Takayama L (2005) Improving automotive safety by pairing driver emotion and car voice emotion. In: Proceedings of the CHI ’05 extended abstracts on human factors in computing systems. ACM, New York, pp 1973–1976Google Scholar
  54. 54.
    Niculescu A, Dijk B, Nijholt A, Li H, See SL (2013) Making social robots more attractive: the effects of voice pitch, humor and empathy. Int J Soc Robot 5(2):171–191CrossRefGoogle Scholar
  55. 55.
    Nuance Communications: (2015) SSML compliance. In: Dragon Mobile SDK Reference. http://dragonmobile.nuancemobiledeveloper.com/public/Help/DragonMobileSDKReference_Android/SpeechKit_Guide/SpeakingText.html
  56. 56.
    Oudeyer PY (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum-Comput Stud 59(12):157–183Google Scholar
  57. 57.
    Pearson JC, Nelson PE (2000) An introduction to human communication: understanding and sharing, 8th edn. McGraw-Hill Higher Education, BostonGoogle Scholar
  58. 58.
    Pitrelli JF, Bakis R, Eide EM, Fernandez R, Hamza W, Picheny MA (2006) The IBM expressive text-to-speech synthesis system for American English. IEEE Trans Audio Speech Lang Process 14(4):1099–1108CrossRefGoogle Scholar
  59. 59.
    Pittam J, Scherer KR (1993) Vocal expression and communication of emotion. In: Lewis M, Haviland JM (eds) Handbook of emotions, chap. 13. The Guilford Press, New York, pp 185–197Google Scholar
  60. 60.
    Powers A, Kiesler S, Fussell S, Torrey C (2007) Comparing a computer agent with a humanoid robot. In: Proceedings of the 2nd ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 145–152Google Scholar
  61. 61.
    Prasad R, Saruwatari H, Shikano K (2004) Robots that can hear, understand and talk. Adv Robot 18(5):533–564CrossRefGoogle Scholar
  62. 62.
    Prinz J (2004) Which emotions are basic? In: Evans D, Cruse P (eds) Emotion, evolution, and rationality, chap. 4. Oxford University Press, Oxford, pp 69–87CrossRefGoogle Scholar
  63. 63.
    Rani P, Sarkar N (2004) Emotion-sensitive robots - a new paradigm for human-robot interaction. In: Proceedings of the 4th IEEE/RAS international conference on humanoid robots, vol 1, pp 149–167Google Scholar
  64. 64.
    Ray C, Mondada F, Siegwart R (2008) What do people expect from robots? In: Proceedings of the 2008 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 3816–3821Google Scholar
  65. 65.
    Read R (2012) Speaking without words: Affective displays in social robots through non-linguistic utterances. In: Proceedings of the 2012 HRI pioneers workshop. ACM, New YorkGoogle Scholar
  66. 66.
    Read R, Belpaeme T (2010) Interpreting non-linguistic utterances by robots: Studying the influence of physical appearance. In: Proceedings of the 3rd international workshop on affective interaction in natural environments (AFFINE). ACM, New York, pp 65–70Google Scholar
  67. 67.
    Read R, Belpaeme T (2012) How to use non-linguistic utterances to convey emotion in child-robot interaction. In: Proceedings of the 7th ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 219–220Google Scholar
  68. 68.
    Read R, Belpaeme T (2013) People interpret robotic non-linguistic utterances categorically. In: Proceedings of the 8th ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 209–210Google Scholar
  69. 69.
    Read R, Belpaeme T (2014) Non-linguistic utterances should be used alongside language, rather than on their own or as a replacement. In: Proceedings of the 9th ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 276–277Google Scholar
  70. 70.
    Read R, Belpaeme T (2014) Situational context directs how people affectively interpret robotic non-linguistic utterances. In: Proceedings of the 9th ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 41–48Google Scholar
  71. 71.
    Roehling S, MacDonald B, Watson C (2006) Towards expressive speech synthesis in English on a robotic platform. In: Proceedings of the 11th Australian international conference on speech science & technology. Australian Speech Science & Technology Association, Gold CoastGoogle Scholar
  72. 72.
    Rogalla O, Ehrenmann M, Zöllner R, Becher R, Dillmann R (2002) Using gesture and speech control for commanding a robot assistant. In: Proceedings of the 11th IEEE international workshop on robot and human interactive communication (RO-MAN), pp 454–459Google Scholar
  73. 73.
    Russell JA, Barrett LF (1999) Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. J Pers and Soc Psychol 76(5):805–819CrossRefGoogle Scholar
  74. 74.
    Scherer K (2009) Emotion theories and concepts (psychological perspectives). In: Sander D, Scherer K (eds) Oxford companion to emotion and the affective sciences. Oxford University Press, Oxford, pp 145–149Google Scholar
  75. 75.
    Scherer KR (1986) Vocal affect expression: a review and a model for future research. Psychol Bull 99(2):143–165MathSciNetCrossRefGoogle Scholar
  76. 76.
    Scherer KR (2000) Psychological models of emotion. In: Borod J (ed) The neuropsychology of emotion, chap. 6. Oxford University Press, Oxford, pp 138–162Google Scholar
  77. 77.
    Scherer KR, Banse R, Wallbott HG, Goldbeck T (1991) Vocal cues in emotion encoding and decoding. Motiv Emot 15(2):123–148CrossRefGoogle Scholar
  78. 78.
  79. 79.
    Schröder M, Baggia P, Burkhardt F, Pelachaud C, Peter C, Zovato E (2011) EmotionML—an upcoming standard for representing emotions and related states. In: D’Mello S, Graesser A, Schuller B, Martin JC (eds) Affective computing and intelligent interaction, vol 6974. Springer, Berlin, pp 316–325CrossRefGoogle Scholar
  80. 80.
    Schröder M, Cowie R, Douglas-Cowie E (2001) Acoustic correlates of emotion dimensions in view of speech synthesis. In: Proceedings of the 7th European conference on speech communication and technology (EUROSPEECH), pp 87–90Google Scholar
  81. 81.
    Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, Müller C, Narayanan S (2013) Paralinguistics in speech and language—state-of-the-art and the challenge. Comput Speech Lang 27(1):4–39CrossRefGoogle Scholar
  82. 82.
    Silverman K, Beckman M, Pitrelli J, Ostendorf M, Wightman C, Price P, Pierrehumbert J, Hirschberg J (1992) ToBI: A standard for labeling English prosody. In: Proceedings of the 2nd international conference on spoken language processing (ICSLP), vol 2. International Speech Communication Association, Singapore, pp 867–870Google Scholar
  83. 83.
    Sobin C, Alpert M (1999) Emotion in speech: the acoustic attributes of fear, anger, sadness, and joy. J Psycholinguist Res 28(4):347–365CrossRefGoogle Scholar
  84. 84.
    Tang H, Zhou X, Odisio M, Hasegawa-Johnson M, Huang TS (2008) Two-stage prosody prediction for emotional text-to-speech synthesis. In: Proceedings of the 9th annual conference of the International Speech Communication Association. International Speech Communication Association, Singapore, pp 2138–2141Google Scholar
  85. 85.
    Tao J, Kang Y, Li A (2006) Prosody conversion from neutral speech to emotional speech. IEEE Trans Audio Speech Lang Process 14(4):1145–1154CrossRefGoogle Scholar
  86. 86.
    Tielman M, Neerincx M, Meyer JJ, Looije R (2014) Adaptive emotional expression in robot-child interaction. In: Proceedings of the 9th ACM/IEEE international conference on human–robot interaction (HRI). ACM, New York, pp 407–414Google Scholar
  87. 87.
    Tokuda K, Heiga Z, Black AW (2002) An HMM-based speech synthesis system applied to English. In: Proceedings of the 2002 IEEE workshop on speech synthesis, pp 227–230Google Scholar
  88. 88.
    Türk O, Schröder M (2010) Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques. IEEE Trans Audio Speech Lang Process 18(5):965–973CrossRefGoogle Scholar
  89. 89.
    Veilleux N, Shattuck-Hufnagel S, Brugos A (2006) 6.911 transcribing prosodic structure of spoken utterances with ToBI. http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-911-transcribing-prosodic-structure-of-spoken-utterances-with-tobi-january-iap-2006
  90. 90.
    Vinciarelli A, Pantic M, Bourlard H, Pentland A (2008) Social signal processing: State-of-the-art and future perspectives of an emerging domain. In: Proceedings of the 16th ACM international conference on multimedia. ACM, New York, pp 1061–1070Google Scholar
  91. 91.
    W3C (2004) Speech synthesis markup language (SSML) version 1.0. http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
  92. 92.
    Wainer J, Feil-Seifer DJ, Shell DA, Mataric MJ (2007) Embodiment and human-robot interaction: A task-based perspective. In: Proceedings of the 16th IEEE international symposium on robot and human interactive communication (RO-MAN), pp 872–877Google Scholar
  93. 93.
    Walker MR, Larson J, Hunt A (2001) A new W3C markup standard for text-to-speech synthesis. In: Proceedings of the 2011 IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 2, pp 965–968Google Scholar
  94. 94.
    Weaver CH, Strausbaugh WL (1964) Hearing the vocal cues. In: Fundamentals of speech communication, chap. 11. American Book Company, New York, pp 283–303Google Scholar
  95. 95.
    Xingyan L, MacDonald B, Watson CI (2009) Expressive facial speech synthesis on a robotic platform. In: Proceedings of the 2009 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 5009–5014Google Scholar
  96. 96.
    Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064CrossRefGoogle Scholar
  97. 97.
    Zhen-Hua L, Shi-Yin K, Heiga Z, Senior A, Schuster M, Xiao-Jun Q, Meng HM, Li D (2015) Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Mag 32(3):35–52CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  1. 1.Distributed Analytics and Security InstituteMississippi State UniversityStarkvilleUSA
  2. 2.Social, Therapeutic and Robotic Systems Lab, Department of Computer Science and EngineeringMississippi State UniversityStarkvilleUSA

Personalised recommendations