Expressivity in Interactive Speech Synthesis; Some Paralinguistic and Nonlinguistic Issues of Speech Prosody for Conversational Dialogue Systems

Campbell, Nick; Li, Ya

doi:10.1007/978-3-662-45258-5_7

Nick Campbell⁶ &
Ya Li^7,8

Part of the book series: Prosody, Phonology and Phonetics ((PRPHPH))

875 Accesses
1 Citations

Abstract

This chapter explores the role of prosody in expressive speech synthesis and goes beyond present technology to consider the interrelated multimodal aspects of interactive spoken dialogue systems for human–machine or human–human interaction. The chapter stresses that social aspects of spoken dialogue are now ripe to be considered in the design of interactive systems and shows how three modalities can be combined—utterance content, speech expressivity, and facial or bodily gestures—to express social factors and manage the interaction. Linguistic prosody has been well described in the literature but the social aspects of managing a spoken dialogue remain as a next-step for speech synthesis research. This chapter shows how voice quality functions socially as well as linguistically, and describes an application of speech synthesis in a robot dialogue system that makes complementary use of visual information and peaking-style variation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

AAAC. 2014. Research on affective computing, emotions and human-machine interaction.http://emotion-research.net.
Alku, P., T. Bäckström, and E. Vilkman. 2002. Normalised amplitude quotient for parameterization of the glottal flow. Journal of the Acoustical Society of America 112 (2): 701–710.
Google Scholar
Campbell, N. 2007. Expressive speech processing & prosody engineering. In New trends in speech based interactive systems, ed. Fang Chen and Kristiina Jokinen. New York: Springer.
Google Scholar
Campbell, N., and P. Mokhtari. 2003. Voice quality: The 4th prosodic dimension. In Proceedings of the 15th international congress of phonetic sciences (ICPhS'03), Barcelona, Spain, 2417–2420.
Google Scholar
Creative Speech Technology. 2014. http://crestnetwork.org.uk/page/beyond-speech.
Edlund, J., and M. Heldner. 2005. Exploring prosody in interaction control. Phonetica 62 (2–4): 215–226.
Article Google Scholar
Intel Developer Zone. 2014. Intel® Perceptual Computing SDK 2013.https://software.intel.com/en-us/vcsource/tools/perceptual-computing-sdk/home.
JOKER—FP7 Chist-Era funded research. 2014. http://www.chistera.eu/projects/joker.
Metalogue. 2014. EU FP7 research. http://www.metalogue.eu.
Moore, R. K. 2013. Spoken language processing: Where do we go from here? In Your virtual butler, LNAI, ed. R. Trappl, vol. 7407, 111–125. Heidelberg: Springer.
Google Scholar
Moore, R. K., and M. Nicolao. 2011. Reactive speech synthesis: Actively managing phonetic contrast along an H&H continuum. 17th international congress of phonetics sciences (ICPhS), Hong Kong.
Google Scholar
Scherer, K. R. 1989. Vocal correlates of emotion. In Handbook of psychophysiology: Emotion and social behavior, ed. A. Manstead and H. Wagner, 165–197. London: Wiley.
Google Scholar
Science Gallery. 2011. Human+: The future of our species. https://dublin.sciencegallery.com/humanplus/.
Science Gallery. 2011. Human+: The future of our species. Talking with robots. https://dublin.sciencegallery.com/humanplus/talking-robots/.
Sproat, R. 1998. Multilingual text-to-speech synthesis: The Bell Labs approach. Boston: Kluwer.
Google Scholar
Tao, J., L. Huang, Y. Kang, and J. Yu. 2006. The friendliness perception of dialogue speech. Proceedings of Speech Prosody, Germany.
Google Scholar
Trouvain, J. 2014. Laughing, breathing clicking—The prosody of nonverbal vocalisations. Proceedings of Speech Prosody (SP7), Dublin, 598–602.
Google Scholar
Van Santen, J. P. H., R. W. Sproat, and J. P. Olive, et al. eds. 1996. Progress in speech synthesis. New York: Springer-Verlag.
Google Scholar
Vinciarelli, A., M. Pantic, and H. Bourlard. 2008. Social signal processing: Survey of an emerging domain. Image and Vision Computing 27:1743–1759.
Article Google Scholar

Download references

Acknowledgement

The authors would like to acknowledge the contribution of SFI (through the FastNet (09/IN.1/I2631) and CNGL (12/CE/I2267) projects, and the Stokes Professorship (07/SK/I1218)), as well as joint work with NAIST in Japan (Kaken-hi 24500256 & 23242023) and with Nanjing Normal University in China (parts of this work are supported by the Major Program for the National Social Science Fund of China (13&ZD189)). The principal author further wishes to thank the Chinese Academy of Sciences for the kind loan of the second author to our lab. We also thank Emer Gilmartin for her perceptive contributions to the development of Herme’s script.

Author information

Authors and Affiliations

Speech Communication Lab, Centre for Language and Communication Studies, Trinity College Dublin, The University of Dublin, Dublin, Ireland
Nick Campbell
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Ya Li
Trinity College Dublin, The University of Dublin, Dublin, Ireland
Ya Li

Authors

Nick Campbell
View author publications
You can also search for this author in PubMed Google Scholar
Ya Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nick Campbell .

Editor information

Editors and Affiliations

University of Tokyo, Tokyo, Japan
Keikichi Hirose
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Jianhua Tao

Appendix

Full text of the Herme utterances as used in the Science Gallery experiment:

Each line represents a new utterance to be synthesised separately, with a variablelength pause, and some groups of lines (starting with a dash) form a mini subdialogue section—we experimented by varying the timing of the pauses, i.e. utterance onsets, according to the nature of each response from the visitors to the exhibition—Herme’s chat partners. Where possible, the content of any response was ignored as the wizard’s task was simply to trigger the next utterance and the automatic version had no content processing module.

- hello? hi

hello

hi

- my name is hur-mi.

h e r m e

hur-mi

what’s your name?

- how old are you?

really

I’m nearly seven weeks old

- do you have an i d number

i need an i d number to talk to you

i d numbers are on your right

thank you

- are you from dublin?

really

I’m from the Speech Communication Lab here in Tee See Dee

- tell me about you

really

owe

- tell me something else

owe

really

- why are you here today?

really

why

- do you like the exhibition

really

why?

i like your hair

- do you know any good jokes?

tell me a funny joke

ha ha haha ha

- tell me a knock knock joke

who’s there

who?

who

ha ha haha ha

- I know a joke

what’s yellow and goes through walls

a ghost banana

ha ha hehe he.

ho hoho ho ho

- thanks for your help

goodbye, see you later

goodbye

Note how these utterance chunks typically group into triads (sets of three) and how they maintain the initiative of the conversation throughout. Note also some spelling hacks (e.g., using ‘owe’ to ensure that the synthesiser correctly pronounced ‘oh!’). The repeated use of ‘oh’ and ‘really’ (sometimes with ‘why’) with various punctuation served to keep the conversation interactive and was key to Herme’s supposed conversational abilities.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Campbell, N., Li, Y. (2015). Expressivity in Interactive Speech Synthesis; Some Paralinguistic and Nonlinguistic Issues of Speech Prosody for Conversational Dialogue Systems. In: Hirose, K., Tao, J. (eds) Speech Prosody in Speech Synthesis: Modeling and generation of prosody for high quality and flexible speech synthesis. Prosody, Phonology and Phonetics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45258-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-662-45258-5_7
Published: 26 February 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-45257-8
Online ISBN: 978-3-662-45258-5
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics

Expressivity in Interactive Speech Synthesis; Some Paralinguistic and Nonlinguistic Issues of Speech Prosody for Conversational Dialogue Systems

Abstract

Access this chapter

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation