Probabilistic Speaker Pronunciation Adaptation for Spontaneous Speech Synthesis Using Linguistic Features

Qader, Raheel; Lecorvé, Gwénolé; Lolive, Damien; Sébillot, Pascale

doi:10.1007/978-3-319-25789-1_22

Raheel Qader¹⁶,
Gwénolé Lecorvé¹⁶,
Damien Lolive¹⁶ &
…
Pascale Sébillot¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9449))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

656 Accesses
3 Citations

Abstract

Pronunciation adaptation consists in predicting pronunciation variants of words and utterances based on their standard pronunciation and a target style. This is a key issue in text-to-speech as those variants bring expressiveness to synthetic speech, especially when considering a spontaneous style. This paper presents a new pronunciation adaptation method which adapts standard pronunciations to the style of individual speakers in a context of spontaneous speech. Its originality and strength are to solely rely on linguistic features and to consider a probabilistic machine learning framework, namely conditional random fields, to produce the adapted pronunciations. Features are first selected in a series of experiments, then combined to produce the final adaptation method. Backend experiments on the Buckeye conversational English speech corpus show that adapted pronunciations significantly better reflect spontaneous speech than standard ones, and that even better could be achieved if considering alternative predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Asymmetric windows were also tested but they led to worse results.
2.
The p-values are 0.01037 and 0.008844 using a paired t-test and a paired Wilcoxon test, respectively, with a confidence level \(\alpha =0.05\).

References

Adda-Decker, M., de Mareüil, P.B., Adda, G., Lamel, L.: Investigating syllabic structures and their variation in spontaneous French. Speech Commun. 46(2), 119–139 (2005)
Article Google Scholar
Bates, R., Ostendorf, M.: Modeling pronunciation variation in conversational speech using prosody. In: ISCA Tutorial and Research Workshop (ITRW) on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology (2002)
Google Scholar
Bell, A., Brenier, J.M., Gregory, M., Girand, C., Jurafsky, D.: Predictability effects on durations of content and function words in conversational english. J. Mem. Lang. 60(1), 92–111 (2009)
Article Google Scholar
Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., Gildea, D.: Effects of disfluencies, predictability, and utterance position on word form variation in english conversation. J. Acoust. Soc. Am. 113(2), 1001–1024 (2003)
Article Google Scholar
Chen, K., Hasegawa-Johnson, M.: Modeling pronunciation variation using artificial neural networks for English spontaneous speech. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2004)
Google Scholar
Dilts, P.C.: Modelling phonetic reduction in a corpus of spoken english using random forests and mixed-effects regression. Ph.D. thesis, University of Alberta (2013)
Google Scholar
Fosler-Lussier, E., et al.: Multi-level decision trees for static and dynamic pronunciation models. In: Proceedings of the European Conference on Speech Communication and Technology (Eurospeech) (1999)
Google Scholar
Giachin, E., Rosenberg, A., Lee, C.H.: Word juncture modeling using phonological rules for HMM-based continuous speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5, 155–168 (1990)
Google Scholar
Illina, I., Fohr, D., Jouvet, D.: Grapheme-to-phoneme conversion using conditional random fields. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2011)
Google Scholar
Karanasou, P., Yvon, F., Lavergne, T., Lamel, L.: Discriminative training of a phoneme confusion model for a dynamic lexicon in ASR. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2013)
Google Scholar
Kolluru, B., Wan, V., Latorre, J., Yanagisawa, K., Gales, M.J.F.: Generating multiple-accent pronunciations for TTS using joint sequence model interpolation. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech) (2014)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)
Google Scholar
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (2010)
Google Scholar
Lecorvé, G., Lolive, D.: Adaptive statistical utterance phonetization for French. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015)
Google Scholar
Pitt, M.A., Johnson, K., Hume, E., Kiesling, S., Raymond, W.: The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Commun. 45(1), 89–95 (2005)
Article Google Scholar
Prahallad, K., Black, A.W., Mosur, R.: Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1 (2006)
Google Scholar
Tajchman, G., Foster, E., Jurafsky, D.: Building multiple pronunciation models for novel words using exploratory computational phonology. In: Proceedings of the European Conference on Speech Communication and Technology (Eurospeech) (1995)
Google Scholar
Vazirnezhad, B., Almasganj, F., Ahadi, S.M.: Hybrid statistical pronunciation models designed to be trained by a medium-size corpus. Comput. Speech Lang. 23(1), 1–24 (2009)
Article Google Scholar
Wang, D., King, S.: Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process. Lett. 18(2), 122–125 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

IRISA/Université de Rennes 1, Lannion, France
Raheel Qader, Gwénolé Lecorvé & Damien Lolive
IRISA/INSA de Rennes, Rennes, France
Pascale Sébillot

Authors

Raheel Qader
View author publications
You can also search for this author in PubMed Google Scholar
Gwénolé Lecorvé
View author publications
You can also search for this author in PubMed Google Scholar
Damien Lolive
View author publications
You can also search for this author in PubMed Google Scholar
Pascale Sébillot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raheel Qader .

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistic, Rovira i Virgili University, Tarragona, Spain
Adrian-Horia Dediu
Research Group on Mathematical Linguistic, Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary
Klára Vicsi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qader, R., Lecorvé, G., Lolive, D., Sébillot, P. (2015). Probabilistic Speaker Pronunciation Adaptation for Spontaneous Speech Synthesis Using Linguistic Features. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-25789-1_22
Published: 17 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25788-4
Online ISBN: 978-3-319-25789-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics