Skip to main content

Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9918))

Abstract

Text-to-Speech (TTS) systems rely on a grapheme-to-phoneme converter which is built to produce canonical, or statically stylized, pronunciations. Hence, the TTS quality drops when phoneme sequences generated by this converter are inconsistent with those labeled in the speech corpus on which the TTS system is built, or when a given expressivity is desired. To solve this problem, the present work aims at automatically adapting generated pronunciations to a given style by training a phoneme-to-phoneme conditional random field (CRF). Precisely, our work investigates (i) the choice of optimal features among acoustic, articulatory, phonological and linguistic ones, and (ii) the selection of a minimal data size to train the CRF. As a case study, adaptation to a TTS-dedicated speech corpus is performed. Cross-validation experiments show that small training corpora can be used without much degrading performance. Apart from improving TTS quality, these results bring interesting perspectives for more complex adaptation scenarios towards expressive speech synthesis.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www-expression.irisa.fr/demos/: Corpus-specific adaptation.

References

  1. Olinsky, C., Cummins, F.: Iterative English adaptation in a speech synthesis system. In: IEEE Workshop on Speech Synthesis (2002)

    Google Scholar 

  2. Govind, D., Prasanna, S.M.: Expressive speech synthesis: a review. Int. J. Speech Technol. 16, 237–260 (2013)

    Article  Google Scholar 

  3. Karanasou, P., Yvon, F., Lavergne, T., Lamel, L.: Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR. In: Proceedings of Interspeech (2013)

    Google Scholar 

  4. Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: Proceedings of ICASSP (2015)

    Google Scholar 

  5. Yao, K., Zweig, G.: Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. In: Proceedings of Interspeech (2015)

    Google Scholar 

  6. Lecorvé, G., Lolive, D.: Adaptive statistical utterance phonetization for French. In: Proceedings of ICASSP (2015)

    Google Scholar 

  7. Hazen, T.J., Hetherington, I., Shu, H., Livescu, K.: Pronunciation modeling using a finite-state transducer representation. Speech Commun. 46, 189–203 (2005)

    Article  Google Scholar 

  8. Livescu, K., Jyothi, P., Fosler-Lussier, E.: Articulatory feature-based pronunciation modeling. Comput. Speech Lang. 36, 212–232 (2016)

    Article  Google Scholar 

  9. Nagòrski, A., Boves, L., Steeneken, H.: In search of optimal data selection for training of automatic speech recognition systems. In: Proceedings of ASRU (2003)

    Google Scholar 

  10. Moore, R.K.: A comparison of the data requirements of automatic speech recognition systems and human listeners. In: Proceedings of Eurospeech (2003)

    Google Scholar 

  11. Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: Proceedings of Interspeech (2007)

    Google Scholar 

  12. Tahon, M., Devillers, L.: Towards a small set of robust acoustic features for emotion recognition: challenges. IEEE/ACM Trans. Speech Audio Lang. Process. 54(1), 16–48 (2016)

    Article  Google Scholar 

  13. Chen, Y., Ganapathi, A., Katz, R.: Challenges and opportunities for managing data systems using statistical models. In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (2011)

    Google Scholar 

  14. Qader, R., Lecorvé, G., Lolive, D., Sébillot, P.: Probabilistic speaker pronunciation adaptation for spontaneous speech synthesis using linguistic features. In: Dediu, A.-H., et al. (eds.) SLSP 2015. LNCS, vol. 9449, pp. 229–241. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25789-1_22

    Chapter  Google Scholar 

  15. Chevelu, J., Lecorvé, G., Lolive, D.: ROOTS: a toolkit for easy, fast and consistent processing of large sequential annotated data collections. In: Proceedings of LREC (2014)

    Google Scholar 

  16. Béchet, F.: LIA-PHON: un système complet de phonétisation de texte. Traitement Automatique des Langues (TAL) 42, 47–67 (2001)

    Google Scholar 

  17. Lin, Y., Michel, J.-B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the Google books ngram corpus. In: Proceedings of ACL (2012)

    Google Scholar 

  18. d’Alessandro, C., Rosset, S., Rossi, J.-P.: The pitch of short-duration fundamental frequency glissandos. J. Acoust. Soc. Am. 104, 2339–2348 (1998)

    Article  Google Scholar 

  19. Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of ACL (2010)

    Google Scholar 

  20. Guyon, I., Elissef, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  21. Tahon, M., Qader, R., Lecorvé, G., Lolive, D.: Improving TTS with corpus-specific pronunciation adaptation. In: Proceedings of Interspeech (2016)

    Google Scholar 

  22. Qader, R., Lecorvé, G., Lolive, D., Sébillot, P.: Adaptation de la prononciation pour la synthèse de la parole spontanée en utilisant des informations linguistiques. In: Proceedings of Journées d’Etudes sur la Parole (2016)

    Google Scholar 

Download references

Acknowledgments

This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marie Tahon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Tahon, M., Qader, R., Lecorvé, G., Lolive, D. (2016). Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS. In: Král, P., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2016. Lecture Notes in Computer Science(), vol 9918. Springer, Cham. https://doi.org/10.1007/978-3-319-45925-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45925-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45924-0

  • Online ISBN: 978-3-319-45925-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics