Perception of Expressivity in TTS: Linguistics, Phonetics or Prosody?

Tahon, Marie; Lecorvé, Gwénolé; Lolive, Damien; Qader, Raheel

doi:10.1007/978-3-319-68456-7_22

Marie Tahon¹⁶,
Gwénolé Lecorvé¹⁶,
Damien Lolive¹⁶ &
…
Raheel Qader¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10583))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

703 Accesses

Abstract

Actually a lot of work on expressive speech focus on acoustic models and prosody variations. However, in expressive Text-to-Speech (TTS) systems, prosody generation strongly relies on the sequence of phonemes to be expressed and also to the words below these phonemes. Consequently, linguistic and phonetic cues play a significant role in the perception of expressivity. In previous works, we proposed a statistical corpus-specific framework which adapts phonemes derived from an automatic phonetizer to the phonemes as labelled in the TTS speech corpus. This framework allows to synthesize good quality but neutral speech samples. The present study goes further in the generation of expressive speech by predicting not only corpus-specific but also expressive pronunciation. It also investigates the shared impacts of linguistics, phonetics and prosody, these impacts being evaluated through different French neutral and expressive speech collected with different speaking styles and linguistic content and expressed under diverse emotional states. Perception tests show that expressivity is more easily perceived when linguistics, phonetics and prosody are consistent. Linguistics seems to be the strongest cue in the perception of expressivity, but phonetics greatly improves expressiveness when combined with and adequate prosody.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alain, P., Chevelu, J., Guennec, D., Lecorvé, G., Lolive, D.: The IRISA Text-to-Speech system for the Blizzard Challenge 2016. In: Blizzard Challenge (Satellite of Interspeech) (2016)
Google Scholar
Bartkova, K., Jouvet, D., Delais-Roussarie, E.: Prosodic parameters and prosodic structures of French emotional data. In: Speech Prosody, Shanghai, China (2016)
Google Scholar
Boeffard, O., Charonnat, L., Maguer, S.L., Lolive, D., Vidal, G.: Towards fully automatic annotation of audiobooks for TTS. In: LREC, Istanbul, Turkey (2012). http://www.lrec-conf.org/proceedings/lrec2012/pdf/632_Paper.pdf
Brognaux, S., Picart, B., Drugman, T.: Speech synthesis in various communicative situations: impact of pronunciation variations. In: Interspeech, pp. 1524–1528, September 2014
Google Scholar
Campbell, N.: Expressive/Affective Speech Synthesis, pp. 505–518. Springer, Heidelberg (2008)
Google Scholar
Charfuelan, M., Steiner, I.: Expressive speech synthesis in MARY TTS using audiobook data and EmotionML. In: Interspeech, Lyon, France, August 2013
Google Scholar
Chen, Y.Y., Wu, C.H., Huang, Y.F.: Generation of emotion control vector using MDS-based space transformation for expressive speech synthesis. In: Interspeech, San Fransisco, USA, pp. 3176–3180, September 2016
Google Scholar
Chollet, G., Montacié, C.: Evaluating speech recognizers and databases. Recent Adv. Speech Understand. Dialog Syst. NATO ASI F: Comput. Syst. Sci. 46, 345–348 (1988)
Article Google Scholar
Feugère, L., d’Alessandro, C., Delalez, S., Ardaillon, L., Roebel, A.: Evaluation of singing synthesis: methodology and case study with concatenative and performative systems. In: Interspeech, San Fransisco, USA, pp. 1245–1249, September 2016
Google Scholar
Goldman-Eisler, F.: The significance of changes in the rate of articulation. Lang. Speech 4(4), 171–174 (1961)
Article Google Scholar
Guennec, D., Lolive, D.: Unit selection cost function exploration using an A* based Text-to-Speech system. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 432–440. Springer, Cham (2014). doi:10.1007/978-3-319-10816-2_52
Google Scholar
Kanagawa, H., Nose, T., Kobayashi, T.: Speaker-independent style conversion for HMM-based expressive speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7864–7868, May 2013
Google Scholar
King, S., Karaiskos, V.: The Blizzard Challenge 2016. In: Blizzard Challenge (Satellite of Interspeech) (2016)
Google Scholar
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings the 48th Annual Meeting of the Association for Computational Linguistics (ACL), Uppsala, Sweden, pp. 504–513 (2010)
Google Scholar
Pammi, S., Charfuelan, M.: HMM-based sCost quality control for unit selection speech synthesis. In: ISCA Speech Synthesis Workshop, Barcelona, Spain, pp. 53–57, September 2013
Google Scholar
Qader, R., Lecorvé, G., Lolive, D., Tahon, M., Sébillot, P.: Statistical pronunciation adaptation for spontaneous speech synthesis. In: TSD, Pragua, Czech Republic (2017)
Google Scholar
Schröder, M.: Expressive speech synthesis: past, present, and possible futures. In: Tao, J., Tan, T. (eds.) Affective Information Processing, pp. 111–126. Springer, London (2009). doi:10.1007/978-1-84800-306-4_7
Chapter Google Scholar
Steiner, I., Schröder, M., Charfuelan, M., Klepp, A.: Symbolic vs. acoustics-based style control for expressive unit selection. In: ISCA Speech Synthesis Workshop (SSW7), Kyoto, Japan (2010)
Google Scholar
Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Improving TTS with corpus-specific pronunciation adaptation. In: Interspeech, San Fransisco, USA (2016)
Google Scholar
Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Optimal feature set and minimal training size for pronunciation adaptation in TTS. In: Král, P., Martín-Vide, C. (eds.) SLSP 2016. LNCS, vol. 9918, pp. 108–119. Springer, Cham (2016). doi:10.1007/978-3-319-45925-7_9
Chapter Google Scholar
Turk, O., Schröder, M.: Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques. IEEE Trans. Audio Speech Lang. Process. 18(5), 965–973 (2010)
Article Google Scholar

Download references

Acknowledgments

This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.

Author information

Authors and Affiliations

IRISA/University of Rennes 1, 6 Rue de Kérampont, 22300, Lannion, France
Marie Tahon, Gwénolé Lecorvé, Damien Lolive & Raheel Qader

Authors

Marie Tahon
View author publications
You can also search for this author in PubMed Google Scholar
Gwénolé Lecorvé
View author publications
You can also search for this author in PubMed Google Scholar
Damien Lolive
View author publications
You can also search for this author in PubMed Google Scholar
Raheel Qader
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marie Tahon .

Editor information

Editors and Affiliations

University of Le Mans, Le Mans, France
Nathalie Camelin
University of Le Mans, Le Mans, France
Yannick Estève
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tahon, M., Lecorvé, G., Lolive, D., Qader, R. (2017). Perception of Expressivity in TTS: Linguistics, Phonetics or Prosody?. In: Camelin, N., Estève, Y., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2017. Lecture Notes in Computer Science(), vol 10583. Springer, Cham. https://doi.org/10.1007/978-3-319-68456-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-68456-7_22
Published: 27 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68455-0
Online ISBN: 978-3-319-68456-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics