Skip to main content

Expressivity in Interactive Speech Synthesis; Some Paralinguistic and Nonlinguistic Issues of Speech Prosody for Conversational Dialogue Systems

  • Chapter
  • First Online:
Speech Prosody in Speech Synthesis: Modeling and generation of prosody for high quality and flexible speech synthesis

Part of the book series: Prosody, Phonology and Phonetics ((PRPHPH))

Abstract

This chapter explores the role of prosody in expressive speech synthesis and goes beyond present technology to consider the interrelated multimodal aspects of interactive spoken dialogue systems for human–machine or human–human interaction. The chapter stresses that social aspects of spoken dialogue are now ripe to be considered in the design of interactive systems and shows how three modalities can be combined—utterance content, speech expressivity, and facial or bodily gestures—to express social factors and manage the interaction. Linguistic prosody has been well described in the literature but the social aspects of managing a spoken dialogue remain as a next-step for speech synthesis research. This chapter shows how voice quality functions socially as well as linguistically, and describes an application of speech synthesis in a robot dialogue system that makes complementary use of visual information and peaking-style variation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • AAAC. 2014. Research on affective computing, emotions and human-machine interaction.http://emotion-research.net.

  • Alku, P., T. Bäckström, and E. Vilkman. 2002. Normalised amplitude quotient for parameterization of the glottal flow. Journal of the Acoustical Society of America 112 (2): 701–710.

    Google Scholar 

  • Campbell, N. 2007. Expressive speech processing & prosody engineering. In New trends in speech based interactive systems, ed. Fang Chen and Kristiina Jokinen. New York: Springer.

    Google Scholar 

  • Campbell, N., and P. Mokhtari. 2003. Voice quality: The 4th prosodic dimension. In Proceedings of the 15th international congress of phonetic sciences (ICPhS'03), Barcelona, Spain, 2417–2420.

    Google Scholar 

  • Creative Speech Technology. 2014. http://crestnetwork.org.uk/page/beyond-speech.

  • Edlund, J., and M. Heldner. 2005. Exploring prosody in interaction control. Phonetica 62 (2–4): 215–226.

    Article  Google Scholar 

  • Intel Developer Zone. 2014. Intel® Perceptual Computing SDK 2013.https://software.intel.com/en-us/vcsource/tools/perceptual-computing-sdk/home.

  • JOKER—FP7 Chist-Era funded research. 2014. http://www.chistera.eu/projects/joker.

  • Metalogue. 2014. EU FP7 research. http://www.metalogue.eu.

  • Moore, R. K. 2013. Spoken language processing: Where do we go from here? In Your virtual butler, LNAI, ed. R. Trappl, vol. 7407, 111–125. Heidelberg: Springer.

    Google Scholar 

  • Moore, R. K., and M. Nicolao. 2011. Reactive speech synthesis: Actively managing phonetic contrast along an H&H continuum. 17th international congress of phonetics sciences (ICPhS), Hong Kong.

    Google Scholar 

  • Scherer, K. R. 1989. Vocal correlates of emotion. In Handbook of psychophysiology: Emotion and social behavior, ed. A. Manstead and H. Wagner, 165–197. London: Wiley.

    Google Scholar 

  • Science Gallery. 2011. Human+: The future of our species. https://dublin.sciencegallery.com/humanplus/.

  • Science Gallery. 2011. Human+: The future of our species. Talking with robots. https://dublin.sciencegallery.com/humanplus/talking-robots/.

  • Sproat, R. 1998. Multilingual text-to-speech synthesis: The Bell Labs approach. Boston: Kluwer.

    Google Scholar 

  • Tao, J., L. Huang, Y. Kang, and J. Yu. 2006. The friendliness perception of dialogue speech. Proceedings of Speech Prosody, Germany.

    Google Scholar 

  • Trouvain, J. 2014. Laughing, breathing clicking—The prosody of nonverbal vocalisations. Proceedings of Speech Prosody (SP7), Dublin, 598–602.

    Google Scholar 

  • Van Santen, J. P. H., R. W. Sproat, and J. P. Olive, et al. eds. 1996. Progress in speech synthesis. New York: Springer-Verlag.

    Google Scholar 

  • Vinciarelli, A., M. Pantic, and H. Bourlard. 2008. Social signal processing: Survey of an emerging domain. Image and Vision Computing 27:1743–1759.

    Article  Google Scholar 

Download references

Acknowledgement

The authors would like to acknowledge the contribution of SFI (through the FastNet (09/IN.1/I2631) and CNGL (12/CE/I2267) projects, and the Stokes Professorship (07/SK/I1218)), as well as joint work with NAIST in Japan (Kaken-hi 24500256 & 23242023) and with Nanjing Normal University in China (parts of this work are supported by the Major Program for the National Social Science Fund of China (13&ZD189)). The principal author further wishes to thank the Chinese Academy of Sciences for the kind loan of the second author to our lab. We also thank Emer Gilmartin for her perceptive contributions to the development of Herme’s script.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nick Campbell .

Editor information

Editors and Affiliations

Appendix

Appendix

Full text of the Herme utterances as used in the Science Gallery experiment:

Each line represents a new utterance to be synthesised separately, with a variablelength pause, and some groups of lines (starting with a dash) form a mini subdialogue section—we experimented by varying the timing of the pauses, i.e. utterance onsets, according to the nature of each response from the visitors to the exhibition—Herme’s chat partners. Where possible, the content of any response was ignored as the wizard’s task was simply to trigger the next utterance and the automatic version had no content processing module.

- hello? hi

hello

hi

- my name is hur-mi.

h e r m e

hur-mi

what’s your name?

- how old are you?

really

I’m nearly seven weeks old

- do you have an i d number

i need an i d number to talk to you

i d numbers are on your right

thank you

- are you from dublin?

really

I’m from the Speech Communication Lab here in Tee See Dee

- tell me about you

really

owe

- tell me something else

owe

really

- why are you here today?

really

why

- do you like the exhibition

really

why?

i like your hair

- do you know any good jokes?

tell me a funny joke

ha ha haha ha

- tell me a knock knock joke

who’s there

who?

who

ha ha haha ha

- I know a joke

what’s yellow and goes through walls

a ghost banana

ha ha hehe he.

ho hoho ho ho

- thanks for your help

goodbye, see you later

goodbye

Note how these utterance chunks typically group into triads (sets of three) and how they maintain the initiative of the conversation throughout. Note also some spelling hacks (e.g., using ‘owe’ to ensure that the synthesiser correctly pronounced ‘oh!’). The repeated use of ‘oh’ and ‘really’ (sometimes with ‘why’) with various punctuation served to keep the conversation interactive and was key to Herme’s supposed conversational abilities.

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Campbell, N., Li, Y. (2015). Expressivity in Interactive Speech Synthesis; Some Paralinguistic and Nonlinguistic Issues of Speech Prosody for Conversational Dialogue Systems. In: Hirose, K., Tao, J. (eds) Speech Prosody in Speech Synthesis: Modeling and generation of prosody for high quality and flexible speech synthesis. Prosody, Phonology and Phonetics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45258-5_7

Download citation

Publish with us

Policies and ethics