Exploiting Alternatives for Text-To-Speech Synthesis: From Machine to Human

Obin, Nicolas; Veaux, Christophe; Lanchantin, Pierre

doi:10.1007/978-3-662-45258-5_13

Nicolas Obin⁶,
Christophe Veaux⁷ &
Pierre Lanchantin⁸

Part of the book series: Prosody, Phonology and Phonetics ((PRPHPH))

849 Accesses
1 Citations

Abstract

The absence of alternatives/variants is a dramatical limitation of text-to-speech (TTS) synthesis compared to the variety of human speech. This chapter introduces the use of speech alternatives/variants in order to improve TTS synthesis systems. Speech alternatives denote the variety of possibilities that a speaker has to pronounce a sentence—depending on linguistic constraints, specific strategies of the speaker, speaking style, and pragmatic constraints. During the training, symbolic and acoustic characteristics of a unit-selection speech synthesis system are statistically modelled with context-dependent parametric models (Gaussian mixture models (GMMs)/hidden Markov models (HMMs)). During the synthesis, symbolic and acoustic alternatives are exploited using a Generalized Viterbi Algorithm (GVA) to determine the sequence of speech units used for the synthesis. Objective and subjective evaluations support evidence that the use of speech alternatives significantly improves speech synthesis over conventional speech synthesis systems. Moreover, speech alternatives can also be used to vary the speech synthesis for a given text. The proposed method can easily be extended to HMM-based speech synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Atterer, M., and E. Klein. 2002. Integrating linguistic and performance-based constraints for assigning phrase breaks. In International Conference on Computational Linguistics, Taipei, Taiwan, 995–998.
Google Scholar
Bell, P., T. Burrows, and P. Taylor. 2006. Adaptation of prosodic phrasing models. In Speech Prosody, Dresden, Germany.
Google Scholar
Black, A., and P. Taylor. 1994. Assigning intonation elements and prosodic phrasing for English speech synthesis from high level linguistic input. In International Conference on Spoken Language Processing, Yokohama, Japan, 715–718.
Google Scholar
Bulyko, I., and M. Ostendorf. 2001. Joint prosody prediction and unit selection for concatenative speech synthesis. In International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, USA, 781–784.
Google Scholar
Gao, B., Y. Qian, Z. Wu, and F. Soong. 2008. Duration refinement by jointly optimizing state and longer unit likelihood. In Interspeech, Brisbane, Australia, 2266–2269.
Google Scholar
Hashimoto, T. 1987. A list-type reduced-constraint generalization of the Viterbi algorithm. IEEE Transactions on Information Theory 33 (6): 866–876.
Article Google Scholar
Hunt, A., and A. Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In International Conference on Audio, Speech, and Signal Processing, 373–376.
Google Scholar
Ingulfen, T., T. Burrows, and S. Buchholz. 2005. Influence of syntax on prosodic boundary prediction. In Interspeech, Lisboa, Portugal, 1817–1820.
Google Scholar
Latorre, J., and M. Akamine. 2008. Multilevel parametric-base F0 model for speech synthesis. In Interspeech, Brisbane, Australia, 2274–2277.
Google Scholar
Obin, N. 2011. MeLos: Analysis and modelling of speech prosody and speaking style. PhD Thesis, Ircam - UPMC.
Google Scholar
Obin, N., P. Lanchantin, A. Lacheret, and X. Rodet. 2010a. Towards improved HMM-based speech synthesis using high-level syntactical features. In Speech Prosody, Chicago, USA
Google Scholar
Obin, N., A. Lacheret, and X. Rodet. 2010b. HMM-based prosodic structure model using rich linguistic context. In Interspeech, Makuhari, Japan, 1133–1136.
Google Scholar
Obin, N., P. Lanchantin, A. Lacheret, and X. Rodet. 2011a. Discrete/continuous modelling of speaking style in HMM-based speech synthesis: Design and evaluation. In Interspeech, Florence, Italy, 2785–2788.
Google Scholar
Obin, N., A. Lacheret, and X. Rodet. 2011b. Stylization and trajectory modelling of short and long term speech prosody variations. In Interspeech, Florence, Italy, 2029–2032.
Google Scholar
Obin, N., P. Lanchantin, A. Lacheret, and X. Rodet. 2011c. Reformulating prosodic break model into segmental HMMs and information fusion. In Interspeech, Florence, Italy, 1829–1832.
Google Scholar
Ostendorf, M., and N. Veilleux. 1994. A hierarchical stochastic model for automatic prediction of prosodic boundary location. Journal of Computational Linguistics 20 (1): 27–54.
Google Scholar
Parlikar, A., and A. W. Black. 2012. Modeling pause-duration for style-specific speech synthesis. In Interspeech, Portland, Oregon, USA, 446–449.
Google Scholar
Parlikar, A., and A. W. Black. 2013. Minimum error rate training for phrasing in speech synthesis. In Speech Synthesis Workshop (SSW), Barcelona, Spain, 13–17.
Google Scholar
Qian, Y., Z. Wu, and F. K. Soong. 2009. Improved prosody generation by maximizing joint likelihood of state and longer units. In International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 3781–3784.
Google Scholar
Schmid, H., and M. Atterer. 2004. New statistical methods for phrase break prediction. In International Conference on Computational Linguistics, Geneva, Switzerland, 659–665.
Google Scholar
Toda, T., and K. Tokuda. 2007. A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and Systems 90 (5): 816–824.
Article Google Scholar
Tokuda, K., H. Zen, and T. Kitamura. 2003. Trajectory modeling based on HMMs with the explicit relationship between static and dynamic features. In European Conference on Speech Communication and Technology, Geneva, Switzerland, 865–868.
Google Scholar
Veaux, C., and X. Rodet. 2011. Prosodic control of unit-selection speech synthesis: A probabilistic approach. In International Conference on Acoustics, Speech, and Signal Processing, Prague, Czech Republic, 5360–5363.
Google Scholar
Veaux, C., P. Lanchantin, and X. Rodet. 2010. Joint prosodic and segmental unit selection for expressive speech synthesis. In Speech Synthesis Workshop (SSW7), Kyoto, Japan, 323–327.
Google Scholar
Yan, Z.-J., Y. Qian, and F. K. Soong. 2009. Rich context modeling for high quality HMM-based TTS. In Interspeech, Brighton, UK, 4025–4028.
Google Scholar
Yoshimura, T., K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. 1999. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In European Conference on Speech Communication and Technology, Budapest, Hungary, 2347–2350.
Google Scholar
Zen, H., K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. 2004. Hidden semi-Markov model based speech synthesis. In International Conference on Spoken Language Processing, Jeju Island, Korea, 1397–1400.
Google Scholar
Zen, H., K. Tokuda, and A. Black. 2009. Statistical parametric speech synthesis. Speech Communication 51 (11): 1039–1064.
Article Google Scholar
Zen, A., A. Senior, and M. Schuster. 2013. Statistical parametric speech synthesis using deep neural networks. In International Conference on Acoustics, Speech, and Signal Processing, Vancouver, Canada, 7962–7966.
Google Scholar

Download references

Author information

Authors and Affiliations

IRCAM, UMR STMS IRCAM-CNRS-UPMC, Paris, France
Nicolas Obin
Centre for Speech Technology Research, The University of Edinburgh, Edinburgh, UK
Christophe Veaux
Department of Engineering, Cambridge University, Cambridge, UK
Pierre Lanchantin

Authors

Nicolas Obin
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Veaux
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Lanchantin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Obin .

Editor information

Editors and Affiliations

University of Tokyo, Tokyo, Japan
Keikichi Hirose
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Jianhua Tao

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Obin, N., Veaux, C., Lanchantin, P. (2015). Exploiting Alternatives for Text-To-Speech Synthesis: From Machine to Human. In: Hirose, K., Tao, J. (eds) Speech Prosody in Speech Synthesis: Modeling and generation of prosody for high quality and flexible speech synthesis. Prosody, Phonology and Phonetics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45258-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-662-45258-5_13
Published: 26 February 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-45257-8
Online ISBN: 978-3-662-45258-5
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics