Skip to main content
Log in

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

End-to-end text-to-speech synthesis systems achieved immense success in recent times, with improved naturalness and intelligibility. However, the end-to-end models, which primarily depend on the attention-based alignment, do not offer an explicit provision to modify/incorporate the desired prosody while synthesizing the speech. Moreover, the state-of-the-art end-to-end systems use autoregressive models for synthesis, making the prediction sequential. Hence, the inference time and the computational complexity are quite high. This paper proposes Prosody-TTS, a data-efficient end-to-end speech synthesis model that combines the advantages of statistical parametric models and end-to-end neural network models. It also has a provision to modify or incorporate the desired prosody at the finer level by controlling the fundamental frequency (\(f_0\)) and the phone duration. Generating speech utterances with appropriate prosody and rhythm helps in improving the naturalness of the synthesized speech. We explicitly model the duration of the phoneme and the \(f_0\) to have a finer level control over them during the synthesis. The model is trained in an end-to-end fashion to directly generate the speech waveform from the input text, which in turn depends on the auxiliary subtasks of predicting the phoneme duration, \(f_0\), and Mel spectrogram. Experiments on the Telugu language data of the IndicTTS database show that the proposed Prosody-TTS model achieves state-of-the-art performance with a mean opinion score of 4.08, with a very low inference time using just 4 hours of training data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Synthesized utterances are available at: https://siplabiith.github.io/prosody-tts.html.

  2. https://github.com/siplabiith/Prosody-TTS.

  3. Synthesized speech utterances are available at: https://siplabiith.github.io/prosody-tts.html.

  4. https://github.com/r9y9/wavenet_vocoder.

  5. https://github.com/Kyubyong/tacotron.

  6. https://github.com/TensorSpeech/TensorFlowTTS.

  7. https://github.com/syang1993/gst-tacotron.

References

  1. S.Ö. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, M. Shoeybi, Deep voice: real-time neural text-to-speech. CoRR. (2017). arXiv preprint arXiv:1702.07825

  2. A. Baby, A.L.N. Thomas, T. Consortium, Resources for Indian Languages. Community-Based Building of Language Resources (2016)

  3. R. Banse, K.R. Scherer, Acoustic profiles in vocal emotion expression. Motivation Emotion 15, 123–148 (1991). https://doi.org/10.1007/BF00995674

    Article  Google Scholar 

  4. A.W. Black, H. Zen, K. Tokuda, Statistical parametric speech synthesis, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing—ICASSP ’07, vol. 4, pp. IV-1229-IV–1232 (2007). https://doi.org/10.1109/ICASSP.2007.367298

  5. W. Chu, A. Alwan, Reducing f0 frame error of f0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3969–3972 (2009). https://doi.org/10.1109/ICASSP.2009.4960497

  6. Y. Chung, Y. Wang, W. Hsu, Y. Zhang, R.J. Skerry-Ryan, Semi-supervised training for improving data efficiency in end-to-end speech synthesis. CoRR (2018). arXiv preprint arXiv:1808.10128

  7. A. de Myttenaere, B. Golden, B. Le Grand, F. Rossi, Mean absolute percentage error for regression models. Neurocomputing (2016). https://doi.org/10.1016/j.neucom.2015.12.114

    Article  Google Scholar 

  8. G. Divu, S. Prasanna, B. Yegnanarayana, Neutral to target emotion conversion using source and suprasegmental information, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 2969–2972 (2011)

  9. D. Griffin, Jae Lim: signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)

    Article  Google Scholar 

  10. J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, T. Chen, Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377 (2018). https://doi.org/10.1016/j.patcog.2017.10.013.

    Article  Google Scholar 

  11. X. Huang, A. Acero, H.W. Hon, R. Reddy, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, 1st edn. (Prentice Hall PTR, USA, 2001)

  12. A.J. Hunt, A.W. Black, Unit selection in a concatenative speech synthesis system using a large speech database, in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1, pp. 373–376 (1996)

  13. ITU-T Recommendation P.800, Methods for Subjective Determination of Transmission Quality (1996). https://www.itu.int/rec/T-REC-P.800-199608-I

  14. N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, K. Kavukcuoglu, Efficient neural audio synthesis. CoRR (2018). arXiv preprint arXiv:1802.08435

  15. D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in 3rd International Conference on Learning Representations, ICLR 2015, May 7-9, 2015, Conference Track Proceedings, ed. by Y. Bengio, Y. LeCun (eds.) (San Diego, CA, USA) (2015). arXiv preprint arXiv:1412.6980

  16. N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, M. Zhou, Close to human quality TTS with transformer. CoRR (2018). arXiv preprint arXiv:1809.08895

  17. S. Liu, A. Davison, E. Johns, Self-supervised generalisation with meta auxiliary learning, in Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., 2019). arXiv preprint arXiv:1901.08933

  18. Z. Luo, T. Takiguchi, Y. Ariki, Emotional voice conversion using deep neural networks with mcc and f0 features, in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–5 (2016). https://doi.org/10.1109/ICIS.2016.7550889

  19. M. Morise, F. Yokomori, K. Ozawa, World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. E99.D(7), 1877–1884 (2016). https://doi.org/10.1587/transinf.2015EDP7457

    Article  Google Scholar 

  20. T. Nakatani, S. Amano, T. Irino, K. Ishizuka, T. Kondo, A method for fundamental frequency estimation and voicing decision: application to infant utterances recorded in real acoustical environments. Speech Commun. 50(3), 203–214 (2008). https://doi.org/10.1016/j.specom.2007.09.003.

    Article  Google Scholar 

  21. A. Öktem, M. Farrús, A. Bonafonte, Prosodic phrase alignment for machine dubbing. CoRR (2019). arXiv preprint arXiv:1908.07226

  22. A. Oktem, M. Farrús, A. Bonafonte, Prosodic phrase alignment for machine dubbing, in Interspeech 2019, pp. 4215–4219 (2019). https://doi.org/10.21437/Interspeech.2019-1621

  23. W. Ping, K. Peng, J. Chen, Clarinet: parallel wave generation in end-to-end text-to-speech. CoRR (2018). arXiv preprint arXiv:1807.07281

  24. R. Prenger, R. Valle, B. Catanzaro, Waveglow: a flow-based generative network for speech synthesis. CoRR (2018). arXiv preprint arXiv:1811.00002

  25. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Liu, Fastspeech: fast, robust and controllable text to speech. CoRR (2019). arXiv preprint arXiv:1905.09263

  26. J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, Y. Wu, Non-attentive tacotron: Robust and controllable neural TTS synthesis including unsupervised duration modeling. CoRR (2020). arXiv preprint arXiv:2010.04301

  27. R.J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R.J. Weiss, R. Clark, R.A. Saurous, Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. CoRR (2018). arXiv preprint arXiv:1803.09047

  28. C. Smith, Handbook of the international phonetic association: a guide to the use of the international phonetic alphabet (1999). Phonology 17, 291–295 (2000). https://doi.org/10.1017/S0952675700003894

    Article  Google Scholar 

  29. J. Sotelo, S. Mehri, K. Kumar, J.F. Santos, K. Kastner, A. Courville, Y. Bengio, Char2wav: end-to-end speech synthesis, in ICLR (2017)

  30. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594

  31. G.L. Trager, B. Bloch, The syllabic phonemes of English. Language 17(3), 223–246 (1941) http://www.jstor.org/stable/409203

    Article  Google Scholar 

  32. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A.W. Senior, K. Kavukcuoglu, Wavenet: a generative model for raw audio. CoRR (2016). arXiv preprint arXiv:1609.03499

  33. Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R.J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q.V. Le, Y. Agiomyrgiannakis, R. Clark, R.A. Saurous, Tacotron: a fully end-to-end text-to-speech synthesis model. CoRR (2017). arXiv preprint arXiv:1703.10135

  34. Y. Wang, D. Stanton, Y. Zhang, R.J. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, R.A. Saurous, Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. CoRR (2018). arXiv preprint arXiv:1803.09017

  35. O. Watts, C. Valentini-Botinhao, S. King, Speech waveform reconstruction using convolutional neural networks with noise and periodic inputs, in:ICASSP 2019 - IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7045–7049 (2019). https://doi.org/10.1109/ICASSP.2019.8683398

  36. D. Weber, C. Gühmann, Non-autoregressive vs autoregressive neural networks for system identification. CoRR (2021). arXiv preprint arXiv:2105.02027

  37. R. Yamamoto, E. Song, J.M. Kim, Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (IEEE, 2020)

  38. R. Yamashita, M. Nishio, R. Do, K. Togashi, Convolutional neural networks: an overview and application in radiology. Insights Imaging 9, 611–629 (2018). https://doi.org/10.1007/s13244-018-0639-9

    Article  Google Scholar 

  39. Y. Yasuda, X. Wang, J. Yamagishi, Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis (2020)

  40. S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK book. Cambridge University Engineering Department (2002)

  41. H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A.W. Black, K. Tokuda, The hmm-based speech synthesis system (hts) version 2.0. In: SSW, pp. 294–299. Citeseer (2007)

  42. S.K. Zhizheng Wu Oliver Watts, Merlin: an open source neural network speech synthesis system, in 9th ISCA Speech Synthesis Workshop (SSW9) (2016)

  43. K. Zhou, B. Sisman, M. Zhang, H. Li, Converting anyone’s emotion: towards speaker-independent emotional voice conversion. CoRR (2020). arXiv preprint arXiv:2005.07025

Download references

Acknowledgements

The authors would like to thank the Ministry of Electronics and Information Technology (MeitY) for supporting this work under the project “Speech to Speech Translation for Tribal Languages using Deep Learning Framework”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giridhar Pamisetty.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pamisetty, G., Sri Rama Murty, K. Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control. Circuits Syst Signal Process 42, 361–384 (2023). https://doi.org/10.1007/s00034-022-02126-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02126-z

Keywords

Navigation