Skip to main content

Perception of Expressivity in TTS: Linguistics, Phonetics or Prosody?

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10583))

Included in the following conference series:

  • 703 Accesses

Abstract

Actually a lot of work on expressive speech focus on acoustic models and prosody variations. However, in expressive Text-to-Speech (TTS) systems, prosody generation strongly relies on the sequence of phonemes to be expressed and also to the words below these phonemes. Consequently, linguistic and phonetic cues play a significant role in the perception of expressivity. In previous works, we proposed a statistical corpus-specific framework which adapts phonemes derived from an automatic phonetizer to the phonemes as labelled in the TTS speech corpus. This framework allows to synthesize good quality but neutral speech samples. The present study goes further in the generation of expressive speech by predicting not only corpus-specific but also expressive pronunciation. It also investigates the shared impacts of linguistics, phonetics and prosody, these impacts being evaluated through different French neutral and expressive speech collected with different speaking styles and linguistic content and expressed under diverse emotional states. Perception tests show that expressivity is more easily perceived when linguistics, phonetics and prosody are consistent. Linguistics seems to be the strongest cue in the perception of expressivity, but phonetics greatly improves expressiveness when combined with and adequate prosody.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alain, P., Chevelu, J., Guennec, D., Lecorvé, G., Lolive, D.: The IRISA Text-to-Speech system for the Blizzard Challenge 2016. In: Blizzard Challenge (Satellite of Interspeech) (2016)

    Google Scholar 

  2. Bartkova, K., Jouvet, D., Delais-Roussarie, E.: Prosodic parameters and prosodic structures of French emotional data. In: Speech Prosody, Shanghai, China (2016)

    Google Scholar 

  3. Boeffard, O., Charonnat, L., Maguer, S.L., Lolive, D., Vidal, G.: Towards fully automatic annotation of audiobooks for TTS. In: LREC, Istanbul, Turkey (2012). http://www.lrec-conf.org/proceedings/lrec2012/pdf/632_Paper.pdf

  4. Brognaux, S., Picart, B., Drugman, T.: Speech synthesis in various communicative situations: impact of pronunciation variations. In: Interspeech, pp. 1524–1528, September 2014

    Google Scholar 

  5. Campbell, N.: Expressive/Affective Speech Synthesis, pp. 505–518. Springer, Heidelberg (2008)

    Google Scholar 

  6. Charfuelan, M., Steiner, I.: Expressive speech synthesis in MARY TTS using audiobook data and EmotionML. In: Interspeech, Lyon, France, August 2013

    Google Scholar 

  7. Chen, Y.Y., Wu, C.H., Huang, Y.F.: Generation of emotion control vector using MDS-based space transformation for expressive speech synthesis. In: Interspeech, San Fransisco, USA, pp. 3176–3180, September 2016

    Google Scholar 

  8. Chollet, G., Montacié, C.: Evaluating speech recognizers and databases. Recent Adv. Speech Understand. Dialog Syst. NATO ASI F: Comput. Syst. Sci. 46, 345–348 (1988)

    Article  Google Scholar 

  9. Feugère, L., d’Alessandro, C., Delalez, S., Ardaillon, L., Roebel, A.: Evaluation of singing synthesis: methodology and case study with concatenative and performative systems. In: Interspeech, San Fransisco, USA, pp. 1245–1249, September 2016

    Google Scholar 

  10. Goldman-Eisler, F.: The significance of changes in the rate of articulation. Lang. Speech 4(4), 171–174 (1961)

    Article  Google Scholar 

  11. Guennec, D., Lolive, D.: Unit selection cost function exploration using an A* based Text-to-Speech system. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 432–440. Springer, Cham (2014). doi:10.1007/978-3-319-10816-2_52

    Google Scholar 

  12. Kanagawa, H., Nose, T., Kobayashi, T.: Speaker-independent style conversion for HMM-based expressive speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7864–7868, May 2013

    Google Scholar 

  13. King, S., Karaiskos, V.: The Blizzard Challenge 2016. In: Blizzard Challenge (Satellite of Interspeech) (2016)

    Google Scholar 

  14. Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings the 48th Annual Meeting of the Association for Computational Linguistics (ACL), Uppsala, Sweden, pp. 504–513 (2010)

    Google Scholar 

  15. Pammi, S., Charfuelan, M.: HMM-based sCost quality control for unit selection speech synthesis. In: ISCA Speech Synthesis Workshop, Barcelona, Spain, pp. 53–57, September 2013

    Google Scholar 

  16. Qader, R., Lecorvé, G., Lolive, D., Tahon, M., Sébillot, P.: Statistical pronunciation adaptation for spontaneous speech synthesis. In: TSD, Pragua, Czech Republic (2017)

    Google Scholar 

  17. Schröder, M.: Expressive speech synthesis: past, present, and possible futures. In: Tao, J., Tan, T. (eds.) Affective Information Processing, pp. 111–126. Springer, London (2009). doi:10.1007/978-1-84800-306-4_7

    Chapter  Google Scholar 

  18. Steiner, I., Schröder, M., Charfuelan, M., Klepp, A.: Symbolic vs. acoustics-based style control for expressive unit selection. In: ISCA Speech Synthesis Workshop (SSW7), Kyoto, Japan (2010)

    Google Scholar 

  19. Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Improving TTS with corpus-specific pronunciation adaptation. In: Interspeech, San Fransisco, USA (2016)

    Google Scholar 

  20. Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Optimal feature set and minimal training size for pronunciation adaptation in TTS. In: Král, P., Martín-Vide, C. (eds.) SLSP 2016. LNCS, vol. 9918, pp. 108–119. Springer, Cham (2016). doi:10.1007/978-3-319-45925-7_9

    Chapter  Google Scholar 

  21. Turk, O., Schröder, M.: Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques. IEEE Trans. Audio Speech Lang. Process. 18(5), 965–973 (2010)

    Article  Google Scholar 

Download references

Acknowledgments

This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marie Tahon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Tahon, M., Lecorvé, G., Lolive, D., Qader, R. (2017). Perception of Expressivity in TTS: Linguistics, Phonetics or Prosody?. In: Camelin, N., Estève, Y., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2017. Lecture Notes in Computer Science(), vol 10583. Springer, Cham. https://doi.org/10.1007/978-3-319-68456-7_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-68456-7_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-68455-0

  • Online ISBN: 978-3-319-68456-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics