Abstract
This chapter explores the role of prosody in expressive speech synthesis and goes beyond present technology to consider the interrelated multimodal aspects of interactive spoken dialogue systems for human–machine or human–human interaction. The chapter stresses that social aspects of spoken dialogue are now ripe to be considered in the design of interactive systems and shows how three modalities can be combined—utterance content, speech expressivity, and facial or bodily gestures—to express social factors and manage the interaction. Linguistic prosody has been well described in the literature but the social aspects of managing a spoken dialogue remain as a next-step for speech synthesis research. This chapter shows how voice quality functions socially as well as linguistically, and describes an application of speech synthesis in a robot dialogue system that makes complementary use of visual information and peaking-style variation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
AAAC. 2014. Research on affective computing, emotions and human-machine interaction.http://emotion-research.net.
Alku, P., T. Bäckström, and E. Vilkman. 2002. Normalised amplitude quotient for parameterization of the glottal flow. Journal of the Acoustical Society of America 112 (2): 701–710.
Campbell, N. 2007. Expressive speech processing & prosody engineering. In New trends in speech based interactive systems, ed. Fang Chen and Kristiina Jokinen. New York: Springer.
Campbell, N., and P. Mokhtari. 2003. Voice quality: The 4th prosodic dimension. In Proceedings of the 15th international congress of phonetic sciences (ICPhS'03), Barcelona, Spain, 2417–2420.
Creative Speech Technology. 2014. http://crestnetwork.org.uk/page/beyond-speech.
Edlund, J., and M. Heldner. 2005. Exploring prosody in interaction control. Phonetica 62 (2–4): 215–226.
Intel Developer Zone. 2014. Intel® Perceptual Computing SDK 2013.https://software.intel.com/en-us/vcsource/tools/perceptual-computing-sdk/home.
JOKER—FP7 Chist-Era funded research. 2014. http://www.chistera.eu/projects/joker.
Metalogue. 2014. EU FP7 research. http://www.metalogue.eu.
Moore, R. K. 2013. Spoken language processing: Where do we go from here? In Your virtual butler, LNAI, ed. R. Trappl, vol. 7407, 111–125. Heidelberg: Springer.
Moore, R. K., and M. Nicolao. 2011. Reactive speech synthesis: Actively managing phonetic contrast along an H&H continuum. 17th international congress of phonetics sciences (ICPhS), Hong Kong.
Scherer, K. R. 1989. Vocal correlates of emotion. In Handbook of psychophysiology: Emotion and social behavior, ed. A. Manstead and H. Wagner, 165–197. London: Wiley.
Science Gallery. 2011. Human+: The future of our species. https://dublin.sciencegallery.com/humanplus/.
Science Gallery. 2011. Human+: The future of our species. Talking with robots. https://dublin.sciencegallery.com/humanplus/talking-robots/.
Sproat, R. 1998. Multilingual text-to-speech synthesis: The Bell Labs approach. Boston: Kluwer.
Tao, J., L. Huang, Y. Kang, and J. Yu. 2006. The friendliness perception of dialogue speech. Proceedings of Speech Prosody, Germany.
Trouvain, J. 2014. Laughing, breathing clicking—The prosody of nonverbal vocalisations. Proceedings of Speech Prosody (SP7), Dublin, 598–602.
Van Santen, J. P. H., R. W. Sproat, and J. P. Olive, et al. eds. 1996. Progress in speech synthesis. New York: Springer-Verlag.
Vinciarelli, A., M. Pantic, and H. Bourlard. 2008. Social signal processing: Survey of an emerging domain. Image and Vision Computing 27:1743–1759.
Acknowledgement
The authors would like to acknowledge the contribution of SFI (through the FastNet (09/IN.1/I2631) and CNGL (12/CE/I2267) projects, and the Stokes Professorship (07/SK/I1218)), as well as joint work with NAIST in Japan (Kaken-hi 24500256 & 23242023) and with Nanjing Normal University in China (parts of this work are supported by the Major Program for the National Social Science Fund of China (13&ZD189)). The principal author further wishes to thank the Chinese Academy of Sciences for the kind loan of the second author to our lab. We also thank Emer Gilmartin for her perceptive contributions to the development of Herme’s script.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Full text of the Herme utterances as used in the Science Gallery experiment:
Each line represents a new utterance to be synthesised separately, with a variablelength pause, and some groups of lines (starting with a dash) form a mini subdialogue section—we experimented by varying the timing of the pauses, i.e. utterance onsets, according to the nature of each response from the visitors to the exhibition—Herme’s chat partners. Where possible, the content of any response was ignored as the wizard’s task was simply to trigger the next utterance and the automatic version had no content processing module.
- hello? hi
hello
hi
- my name is hur-mi.
h e r m e
hur-mi
what’s your name?
- how old are you?
really
I’m nearly seven weeks old
- do you have an i d number
i need an i d number to talk to you
i d numbers are on your right
thank you
- are you from dublin?
really
I’m from the Speech Communication Lab here in Tee See Dee
- tell me about you
really
owe
- tell me something else
owe
really
- why are you here today?
really
why
- do you like the exhibition
really
why?
i like your hair
- do you know any good jokes?
tell me a funny joke
ha ha haha ha
- tell me a knock knock joke
who’s there
who?
who
ha ha haha ha
- I know a joke
what’s yellow and goes through walls
a ghost banana
ha ha hehe he.
ho hoho ho ho
- thanks for your help
goodbye, see you later
goodbye
Note how these utterance chunks typically group into triads (sets of three) and how they maintain the initiative of the conversation throughout. Note also some spelling hacks (e.g., using ‘owe’ to ensure that the synthesiser correctly pronounced ‘oh!’). The repeated use of ‘oh’ and ‘really’ (sometimes with ‘why’) with various punctuation served to keep the conversation interactive and was key to Herme’s supposed conversational abilities.
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Campbell, N., Li, Y. (2015). Expressivity in Interactive Speech Synthesis; Some Paralinguistic and Nonlinguistic Issues of Speech Prosody for Conversational Dialogue Systems. In: Hirose, K., Tao, J. (eds) Speech Prosody in Speech Synthesis: Modeling and generation of prosody for high quality and flexible speech synthesis. Prosody, Phonology and Phonetics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45258-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-662-45258-5_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-45257-8
Online ISBN: 978-3-662-45258-5
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)