Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Pamisetty, Giridhar; Sri Rama Murty, K.

doi:10.1007/s00034-022-02126-z

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Published: 08 August 2022

Volume 42, pages 361–384, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

594 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

End-to-end text-to-speech synthesis systems achieved immense success in recent times, with improved naturalness and intelligibility. However, the end-to-end models, which primarily depend on the attention-based alignment, do not offer an explicit provision to modify/incorporate the desired prosody while synthesizing the speech. Moreover, the state-of-the-art end-to-end systems use autoregressive models for synthesis, making the prediction sequential. Hence, the inference time and the computational complexity are quite high. This paper proposes Prosody-TTS, a data-efficient end-to-end speech synthesis model that combines the advantages of statistical parametric models and end-to-end neural network models. It also has a provision to modify or incorporate the desired prosody at the finer level by controlling the fundamental frequency (\(f_0\)) and the phone duration. Generating speech utterances with appropriate prosody and rhythm helps in improving the naturalness of the synthesized speech. We explicitly model the duration of the phoneme and the \(f_0\) to have a finer level control over them during the synthesis. The model is trained in an end-to-end fashion to directly generate the speech waveform from the input text, which in turn depends on the auxiliary subtasks of predicting the phoneme duration, \(f_0\), and Mel spectrogram. Experiments on the Telugu language data of the IndicTTS database show that the proposed Prosody-TTS model achieves state-of-the-art performance with a mean opinion score of 4.08, with a very low inference time using just 4 hours of training data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An end-to-end TTS model with pronunciation predictor

Article 13 November 2022

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

Efficient decoding self-attention for end-to-end speech synthesis

Article 31 May 2022

Notes

Synthesized utterances are available at: https://siplabiith.github.io/prosody-tts.html.
https://github.com/siplabiith/Prosody-TTS.
Synthesized speech utterances are available at: https://siplabiith.github.io/prosody-tts.html.
https://github.com/r9y9/wavenet_vocoder.
https://github.com/Kyubyong/tacotron.
https://github.com/TensorSpeech/TensorFlowTTS.
https://github.com/syang1993/gst-tacotron.

References

S.Ö. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, M. Shoeybi, Deep voice: real-time neural text-to-speech. CoRR. (2017). arXiv preprint arXiv:1702.07825
A. Baby, A.L.N. Thomas, T. Consortium, Resources for Indian Languages. Community-Based Building of Language Resources (2016)
R. Banse, K.R. Scherer, Acoustic profiles in vocal emotion expression. Motivation Emotion 15, 123–148 (1991). https://doi.org/10.1007/BF00995674
Article Google Scholar
A.W. Black, H. Zen, K. Tokuda, Statistical parametric speech synthesis, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing—ICASSP ’07, vol. 4, pp. IV-1229-IV–1232 (2007). https://doi.org/10.1109/ICASSP.2007.367298
W. Chu, A. Alwan, Reducing f0 frame error of f0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3969–3972 (2009). https://doi.org/10.1109/ICASSP.2009.4960497
Y. Chung, Y. Wang, W. Hsu, Y. Zhang, R.J. Skerry-Ryan, Semi-supervised training for improving data efficiency in end-to-end speech synthesis. CoRR (2018). arXiv preprint arXiv:1808.10128
A. de Myttenaere, B. Golden, B. Le Grand, F. Rossi, Mean absolute percentage error for regression models. Neurocomputing (2016). https://doi.org/10.1016/j.neucom.2015.12.114
Article Google Scholar
G. Divu, S. Prasanna, B. Yegnanarayana, Neutral to target emotion conversion using source and suprasegmental information, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 2969–2972 (2011)
D. Griffin, Jae Lim: signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)
Article Google Scholar
J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, T. Chen, Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377 (2018). https://doi.org/10.1016/j.patcog.2017.10.013.
Article Google Scholar
X. Huang, A. Acero, H.W. Hon, R. Reddy, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, 1st edn. (Prentice Hall PTR, USA, 2001)
A.J. Hunt, A.W. Black, Unit selection in a concatenative speech synthesis system using a large speech database, in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1, pp. 373–376 (1996)
ITU-T Recommendation P.800, Methods for Subjective Determination of Transmission Quality (1996). https://www.itu.int/rec/T-REC-P.800-199608-I
N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, K. Kavukcuoglu, Efficient neural audio synthesis. CoRR (2018). arXiv preprint arXiv:1802.08435
D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in 3rd International Conference on Learning Representations, ICLR 2015, May 7-9, 2015, Conference Track Proceedings, ed. by Y. Bengio, Y. LeCun (eds.) (San Diego, CA, USA) (2015). arXiv preprint arXiv:1412.6980
N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, M. Zhou, Close to human quality TTS with transformer. CoRR (2018). arXiv preprint arXiv:1809.08895
S. Liu, A. Davison, E. Johns, Self-supervised generalisation with meta auxiliary learning, in Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., 2019). arXiv preprint arXiv:1901.08933
Z. Luo, T. Takiguchi, Y. Ariki, Emotional voice conversion using deep neural networks with mcc and f0 features, in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–5 (2016). https://doi.org/10.1109/ICIS.2016.7550889
M. Morise, F. Yokomori, K. Ozawa, World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. E99.D(7), 1877–1884 (2016). https://doi.org/10.1587/transinf.2015EDP7457
Article Google Scholar
T. Nakatani, S. Amano, T. Irino, K. Ishizuka, T. Kondo, A method for fundamental frequency estimation and voicing decision: application to infant utterances recorded in real acoustical environments. Speech Commun. 50(3), 203–214 (2008). https://doi.org/10.1016/j.specom.2007.09.003.
Article Google Scholar
A. Öktem, M. Farrús, A. Bonafonte, Prosodic phrase alignment for machine dubbing. CoRR (2019). arXiv preprint arXiv:1908.07226
A. Oktem, M. Farrús, A. Bonafonte, Prosodic phrase alignment for machine dubbing, in Interspeech 2019, pp. 4215–4219 (2019). https://doi.org/10.21437/Interspeech.2019-1621
W. Ping, K. Peng, J. Chen, Clarinet: parallel wave generation in end-to-end text-to-speech. CoRR (2018). arXiv preprint arXiv:1807.07281
R. Prenger, R. Valle, B. Catanzaro, Waveglow: a flow-based generative network for speech synthesis. CoRR (2018). arXiv preprint arXiv:1811.00002
Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Liu, Fastspeech: fast, robust and controllable text to speech. CoRR (2019). arXiv preprint arXiv:1905.09263
J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, Y. Wu, Non-attentive tacotron: Robust and controllable neural TTS synthesis including unsupervised duration modeling. CoRR (2020). arXiv preprint arXiv:2010.04301
R.J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R.J. Weiss, R. Clark, R.A. Saurous, Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. CoRR (2018). arXiv preprint arXiv:1803.09047
C. Smith, Handbook of the international phonetic association: a guide to the use of the international phonetic alphabet (1999). Phonology 17, 291–295 (2000). https://doi.org/10.1017/S0952675700003894
Article Google Scholar
J. Sotelo, S. Mehri, K. Kumar, J.F. Santos, K. Kastner, A. Courville, Y. Bengio, Char2wav: end-to-end speech synthesis, in ICLR (2017)
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594
G.L. Trager, B. Bloch, The syllabic phonemes of English. Language 17(3), 223–246 (1941) http://www.jstor.org/stable/409203
Article Google Scholar
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A.W. Senior, K. Kavukcuoglu, Wavenet: a generative model for raw audio. CoRR (2016). arXiv preprint arXiv:1609.03499
Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R.J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q.V. Le, Y. Agiomyrgiannakis, R. Clark, R.A. Saurous, Tacotron: a fully end-to-end text-to-speech synthesis model. CoRR (2017). arXiv preprint arXiv:1703.10135
Y. Wang, D. Stanton, Y. Zhang, R.J. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, R.A. Saurous, Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. CoRR (2018). arXiv preprint arXiv:1803.09017
O. Watts, C. Valentini-Botinhao, S. King, Speech waveform reconstruction using convolutional neural networks with noise and periodic inputs, in:ICASSP 2019 - IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7045–7049 (2019). https://doi.org/10.1109/ICASSP.2019.8683398
D. Weber, C. Gühmann, Non-autoregressive vs autoregressive neural networks for system identification. CoRR (2021). arXiv preprint arXiv:2105.02027
R. Yamamoto, E. Song, J.M. Kim, Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (IEEE, 2020)
R. Yamashita, M. Nishio, R. Do, K. Togashi, Convolutional neural networks: an overview and application in radiology. Insights Imaging 9, 611–629 (2018). https://doi.org/10.1007/s13244-018-0639-9
Article Google Scholar
Y. Yasuda, X. Wang, J. Yamagishi, Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis (2020)
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK book. Cambridge University Engineering Department (2002)
H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A.W. Black, K. Tokuda, The hmm-based speech synthesis system (hts) version 2.0. In: SSW, pp. 294–299. Citeseer (2007)
S.K. Zhizheng Wu Oliver Watts, Merlin: an open source neural network speech synthesis system, in 9th ISCA Speech Synthesis Workshop (SSW9) (2016)
K. Zhou, B. Sisman, M. Zhang, H. Li, Converting anyone’s emotion: towards speaker-independent emotional voice conversion. CoRR (2020). arXiv preprint arXiv:2005.07025

Download references

Acknowledgements

The authors would like to thank the Ministry of Electronics and Information Technology (MeitY) for supporting this work under the project “Speech to Speech Translation for Tribal Languages using Deep Learning Framework”.

Author information

Authors and Affiliations

Speech Information Processing Lab, Department of Electrical Engineering, Indian Institute of Technology, Hyderabad, India
Giridhar Pamisetty & K. Sri Rama Murty

Authors

Giridhar Pamisetty
View author publications
You can also search for this author in PubMed Google Scholar
K. Sri Rama Murty
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giridhar Pamisetty.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Pamisetty, G., Sri Rama Murty, K. Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control. Circuits Syst Signal Process 42, 361–384 (2023). https://doi.org/10.1007/s00034-022-02126-z

Download citation

Received: 10 September 2021
Revised: 15 July 2022
Accepted: 16 July 2022
Published: 08 August 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00034-022-02126-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Abstract

Access this article

Similar content being viewed by others

An end-to-end TTS model with pronunciation predictor

Conventional and contemporary approaches used in text to speech synthesis: a review

Efficient decoding self-attention for end-to-end speech synthesis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Abstract

Access this article

Similar content being viewed by others

An end-to-end TTS model with pronunciation predictor

Conventional and contemporary approaches used in text to speech synthesis: a review

Efficient decoding self-attention for end-to-end speech synthesis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation