Skip to main content

Probabilistic Speaker Pronunciation Adaptation for Spontaneous Speech Synthesis Using Linguistic Features

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9449))

Included in the following conference series:

Abstract

Pronunciation adaptation consists in predicting pronunciation variants of words and utterances based on their standard pronunciation and a target style. This is a key issue in text-to-speech as those variants bring expressiveness to synthetic speech, especially when considering a spontaneous style. This paper presents a new pronunciation adaptation method which adapts standard pronunciations to the style of individual speakers in a context of spontaneous speech. Its originality and strength are to solely rely on linguistic features and to consider a probabilistic machine learning framework, namely conditional random fields, to produce the adapted pronunciations. Features are first selected in a series of experiments, then combined to produce the final adaptation method. Backend experiments on the Buckeye conversational English speech corpus show that adapted pronunciations significantly better reflect spontaneous speech than standard ones, and that even better could be achieved if considering alternative predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Asymmetric windows were also tested but they led to worse results.

  2. 2.

    The p-values are 0.01037 and 0.008844 using a paired t-test and a paired Wilcoxon test, respectively, with a confidence level \(\alpha =0.05\).

References

  1. Adda-Decker, M., de Mareüil, P.B., Adda, G., Lamel, L.: Investigating syllabic structures and their variation in spontaneous French. Speech Commun. 46(2), 119–139 (2005)

    Article  Google Scholar 

  2. Bates, R., Ostendorf, M.: Modeling pronunciation variation in conversational speech using prosody. In: ISCA Tutorial and Research Workshop (ITRW) on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology (2002)

    Google Scholar 

  3. Bell, A., Brenier, J.M., Gregory, M., Girand, C., Jurafsky, D.: Predictability effects on durations of content and function words in conversational english. J. Mem. Lang. 60(1), 92–111 (2009)

    Article  Google Scholar 

  4. Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., Gildea, D.: Effects of disfluencies, predictability, and utterance position on word form variation in english conversation. J. Acoust. Soc. Am. 113(2), 1001–1024 (2003)

    Article  Google Scholar 

  5. Chen, K., Hasegawa-Johnson, M.: Modeling pronunciation variation using artificial neural networks for English spontaneous speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2004)

    Google Scholar 

  6. Dilts, P.C.: Modelling phonetic reduction in a corpus of spoken english using random forests and mixed-effects regression. Ph.D. thesis, University of Alberta (2013)

    Google Scholar 

  7. Fosler-Lussier, E., et al.: Multi-level decision trees for static and dynamic pronunciation models. In: Proceedings of the European Conference on Speech Communication and Technology (Eurospeech) (1999)

    Google Scholar 

  8. Giachin, E., Rosenberg, A., Lee, C.H.: Word juncture modeling using phonological rules for HMM-based continuous speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5, 155–168 (1990)

    Google Scholar 

  9. Illina, I., Fohr, D., Jouvet, D.: Grapheme-to-phoneme conversion using conditional random fields. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2011)

    Google Scholar 

  10. Karanasou, P., Yvon, F., Lavergne, T., Lamel, L.: Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2013)

    Google Scholar 

  11. Kolluru, B., Wan, V., Latorre, J., Yanagisawa, K., Gales, M.J.F.: Generating multiple-accent pronunciations for TTS using joint sequence model interpolation. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2014)

    Google Scholar 

  12. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)

    Google Scholar 

  13. Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (2010)

    Google Scholar 

  14. Lecorvé, G., Lolive, D.: Adaptive statistical utterance phonetization for French. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015)

    Google Scholar 

  15. Pitt, M.A., Johnson, K., Hume, E., Kiesling, S., Raymond, W.: The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Commun. 45(1), 89–95 (2005)

    Article  Google Scholar 

  16. Prahallad, K., Black, A.W., Mosur, R.: Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1 (2006)

    Google Scholar 

  17. Tajchman, G., Foster, E., Jurafsky, D.: Building multiple pronunciation models for novel words using exploratory computational phonology. In: Proceedings of the European Conference on Speech Communication and Technology (Eurospeech) (1995)

    Google Scholar 

  18. Vazirnezhad, B., Almasganj, F., Ahadi, S.M.: Hybrid statistical pronunciation models designed to be trained by a medium-size corpus. Comput. Speech Lang. 23(1), 1–24 (2009)

    Article  Google Scholar 

  19. Wang, D., King, S.: Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process. Lett. 18(2), 122–125 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raheel Qader .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Qader, R., Lecorvé, G., Lolive, D., Sébillot, P. (2015). Probabilistic Speaker Pronunciation Adaptation for Spontaneous Speech Synthesis Using Linguistic Features. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25789-1_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25788-4

  • Online ISBN: 978-3-319-25789-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics