Advertisement

Perception of Expressivity in TTS: Linguistics, Phonetics or Prosody?

  • Marie Tahon
  • Gwénolé Lecorvé
  • Damien Lolive
  • Raheel Qader
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10583)

Abstract

Actually a lot of work on expressive speech focus on acoustic models and prosody variations. However, in expressive Text-to-Speech (TTS) systems, prosody generation strongly relies on the sequence of phonemes to be expressed and also to the words below these phonemes. Consequently, linguistic and phonetic cues play a significant role in the perception of expressivity. In previous works, we proposed a statistical corpus-specific framework which adapts phonemes derived from an automatic phonetizer to the phonemes as labelled in the TTS speech corpus. This framework allows to synthesize good quality but neutral speech samples. The present study goes further in the generation of expressive speech by predicting not only corpus-specific but also expressive pronunciation. It also investigates the shared impacts of linguistics, phonetics and prosody, these impacts being evaluated through different French neutral and expressive speech collected with different speaking styles and linguistic content and expressed under diverse emotional states. Perception tests show that expressivity is more easily perceived when linguistics, phonetics and prosody are consistent. Linguistics seems to be the strongest cue in the perception of expressivity, but phonetics greatly improves expressiveness when combined with and adequate prosody.

Keywords

Expressive speech synthesis Perception Linguistics Phonetics Prosody Pronunciation adaptation 

Notes

Acknowledgments

This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.

References

  1. 1.
    Alain, P., Chevelu, J., Guennec, D., Lecorvé, G., Lolive, D.: The IRISA Text-to-Speech system for the Blizzard Challenge 2016. In: Blizzard Challenge (Satellite of Interspeech) (2016)Google Scholar
  2. 2.
    Bartkova, K., Jouvet, D., Delais-Roussarie, E.: Prosodic parameters and prosodic structures of French emotional data. In: Speech Prosody, Shanghai, China (2016)Google Scholar
  3. 3.
    Boeffard, O., Charonnat, L., Maguer, S.L., Lolive, D., Vidal, G.: Towards fully automatic annotation of audiobooks for TTS. In: LREC, Istanbul, Turkey (2012). http://www.lrec-conf.org/proceedings/lrec2012/pdf/632_Paper.pdf
  4. 4.
    Brognaux, S., Picart, B., Drugman, T.: Speech synthesis in various communicative situations: impact of pronunciation variations. In: Interspeech, pp. 1524–1528, September 2014Google Scholar
  5. 5.
    Campbell, N.: Expressive/Affective Speech Synthesis, pp. 505–518. Springer, Heidelberg (2008)Google Scholar
  6. 6.
    Charfuelan, M., Steiner, I.: Expressive speech synthesis in MARY TTS using audiobook data and EmotionML. In: Interspeech, Lyon, France, August 2013Google Scholar
  7. 7.
    Chen, Y.Y., Wu, C.H., Huang, Y.F.: Generation of emotion control vector using MDS-based space transformation for expressive speech synthesis. In: Interspeech, San Fransisco, USA, pp. 3176–3180, September 2016Google Scholar
  8. 8.
    Chollet, G., Montacié, C.: Evaluating speech recognizers and databases. Recent Adv. Speech Understand. Dialog Syst. NATO ASI F: Comput. Syst. Sci. 46, 345–348 (1988)CrossRefGoogle Scholar
  9. 9.
    Feugère, L., d’Alessandro, C., Delalez, S., Ardaillon, L., Roebel, A.: Evaluation of singing synthesis: methodology and case study with concatenative and performative systems. In: Interspeech, San Fransisco, USA, pp. 1245–1249, September 2016Google Scholar
  10. 10.
    Goldman-Eisler, F.: The significance of changes in the rate of articulation. Lang. Speech 4(4), 171–174 (1961)CrossRefGoogle Scholar
  11. 11.
    Guennec, D., Lolive, D.: Unit selection cost function exploration using an A* based Text-to-Speech system. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 432–440. Springer, Cham (2014). doi: 10.1007/978-3-319-10816-2_52 Google Scholar
  12. 12.
    Kanagawa, H., Nose, T., Kobayashi, T.: Speaker-independent style conversion for HMM-based expressive speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7864–7868, May 2013Google Scholar
  13. 13.
    King, S., Karaiskos, V.: The Blizzard Challenge 2016. In: Blizzard Challenge (Satellite of Interspeech) (2016)Google Scholar
  14. 14.
    Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings the 48th Annual Meeting of the Association for Computational Linguistics (ACL), Uppsala, Sweden, pp. 504–513 (2010)Google Scholar
  15. 15.
    Pammi, S., Charfuelan, M.: HMM-based sCost quality control for unit selection speech synthesis. In: ISCA Speech Synthesis Workshop, Barcelona, Spain, pp. 53–57, September 2013Google Scholar
  16. 16.
    Qader, R., Lecorvé, G., Lolive, D., Tahon, M., Sébillot, P.: Statistical pronunciation adaptation for spontaneous speech synthesis. In: TSD, Pragua, Czech Republic (2017)Google Scholar
  17. 17.
    Schröder, M.: Expressive speech synthesis: past, present, and possible futures. In: Tao, J., Tan, T. (eds.) Affective Information Processing, pp. 111–126. Springer, London (2009). doi: 10.1007/978-1-84800-306-4_7 CrossRefGoogle Scholar
  18. 18.
    Steiner, I., Schröder, M., Charfuelan, M., Klepp, A.: Symbolic vs. acoustics-based style control for expressive unit selection. In: ISCA Speech Synthesis Workshop (SSW7), Kyoto, Japan (2010)Google Scholar
  19. 19.
    Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Improving TTS with corpus-specific pronunciation adaptation. In: Interspeech, San Fransisco, USA (2016)Google Scholar
  20. 20.
    Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Optimal feature set and minimal training size for pronunciation adaptation in TTS. In: Král, P., Martín-Vide, C. (eds.) SLSP 2016. LNCS, vol. 9918, pp. 108–119. Springer, Cham (2016). doi: 10.1007/978-3-319-45925-7_9 CrossRefGoogle Scholar
  21. 21.
    Turk, O., Schröder, M.: Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques. IEEE Trans. Audio Speech Lang. Process. 18(5), 965–973 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Marie Tahon
    • 1
  • Gwénolé Lecorvé
    • 1
  • Damien Lolive
    • 1
  • Raheel Qader
    • 1
  1. 1.IRISA/University of Rennes 1LannionFrance

Personalised recommendations