Abstract
End-to-end text-to-speech synthesis systems achieved immense success in recent times, with improved naturalness and intelligibility. However, the end-to-end models, which primarily depend on the attention-based alignment, do not offer an explicit provision to modify/incorporate the desired prosody while synthesizing the speech. Moreover, the state-of-the-art end-to-end systems use autoregressive models for synthesis, making the prediction sequential. Hence, the inference time and the computational complexity are quite high. This paper proposes Prosody-TTS, a data-efficient end-to-end speech synthesis model that combines the advantages of statistical parametric models and end-to-end neural network models. It also has a provision to modify or incorporate the desired prosody at the finer level by controlling the fundamental frequency (\(f_0\)) and the phone duration. Generating speech utterances with appropriate prosody and rhythm helps in improving the naturalness of the synthesized speech. We explicitly model the duration of the phoneme and the \(f_0\) to have a finer level control over them during the synthesis. The model is trained in an end-to-end fashion to directly generate the speech waveform from the input text, which in turn depends on the auxiliary subtasks of predicting the phoneme duration, \(f_0\), and Mel spectrogram. Experiments on the Telugu language data of the IndicTTS database show that the proposed Prosody-TTS model achieves state-of-the-art performance with a mean opinion score of 4.08, with a very low inference time using just 4 hours of training data.
Similar content being viewed by others
Notes
Synthesized utterances are available at: https://siplabiith.github.io/prosody-tts.html.
Synthesized speech utterances are available at: https://siplabiith.github.io/prosody-tts.html.
References
S.Ö. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, M. Shoeybi, Deep voice: real-time neural text-to-speech. CoRR. (2017). arXiv preprint arXiv:1702.07825
A. Baby, A.L.N. Thomas, T. Consortium, Resources for Indian Languages. Community-Based Building of Language Resources (2016)
R. Banse, K.R. Scherer, Acoustic profiles in vocal emotion expression. Motivation Emotion 15, 123–148 (1991). https://doi.org/10.1007/BF00995674
A.W. Black, H. Zen, K. Tokuda, Statistical parametric speech synthesis, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing—ICASSP ’07, vol. 4, pp. IV-1229-IV–1232 (2007). https://doi.org/10.1109/ICASSP.2007.367298
W. Chu, A. Alwan, Reducing f0 frame error of f0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3969–3972 (2009). https://doi.org/10.1109/ICASSP.2009.4960497
Y. Chung, Y. Wang, W. Hsu, Y. Zhang, R.J. Skerry-Ryan, Semi-supervised training for improving data efficiency in end-to-end speech synthesis. CoRR (2018). arXiv preprint arXiv:1808.10128
A. de Myttenaere, B. Golden, B. Le Grand, F. Rossi, Mean absolute percentage error for regression models. Neurocomputing (2016). https://doi.org/10.1016/j.neucom.2015.12.114
G. Divu, S. Prasanna, B. Yegnanarayana, Neutral to target emotion conversion using source and suprasegmental information, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 2969–2972 (2011)
D. Griffin, Jae Lim: signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)
J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, T. Chen, Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377 (2018). https://doi.org/10.1016/j.patcog.2017.10.013.
X. Huang, A. Acero, H.W. Hon, R. Reddy, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, 1st edn. (Prentice Hall PTR, USA, 2001)
A.J. Hunt, A.W. Black, Unit selection in a concatenative speech synthesis system using a large speech database, in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1, pp. 373–376 (1996)
ITU-T Recommendation P.800, Methods for Subjective Determination of Transmission Quality (1996). https://www.itu.int/rec/T-REC-P.800-199608-I
N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, K. Kavukcuoglu, Efficient neural audio synthesis. CoRR (2018). arXiv preprint arXiv:1802.08435
D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in 3rd International Conference on Learning Representations, ICLR 2015, May 7-9, 2015, Conference Track Proceedings, ed. by Y. Bengio, Y. LeCun (eds.) (San Diego, CA, USA) (2015). arXiv preprint arXiv:1412.6980
N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, M. Zhou, Close to human quality TTS with transformer. CoRR (2018). arXiv preprint arXiv:1809.08895
S. Liu, A. Davison, E. Johns, Self-supervised generalisation with meta auxiliary learning, in Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., 2019). arXiv preprint arXiv:1901.08933
Z. Luo, T. Takiguchi, Y. Ariki, Emotional voice conversion using deep neural networks with mcc and f0 features, in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–5 (2016). https://doi.org/10.1109/ICIS.2016.7550889
M. Morise, F. Yokomori, K. Ozawa, World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. E99.D(7), 1877–1884 (2016). https://doi.org/10.1587/transinf.2015EDP7457
T. Nakatani, S. Amano, T. Irino, K. Ishizuka, T. Kondo, A method for fundamental frequency estimation and voicing decision: application to infant utterances recorded in real acoustical environments. Speech Commun. 50(3), 203–214 (2008). https://doi.org/10.1016/j.specom.2007.09.003.
A. Öktem, M. Farrús, A. Bonafonte, Prosodic phrase alignment for machine dubbing. CoRR (2019). arXiv preprint arXiv:1908.07226
A. Oktem, M. Farrús, A. Bonafonte, Prosodic phrase alignment for machine dubbing, in Interspeech 2019, pp. 4215–4219 (2019). https://doi.org/10.21437/Interspeech.2019-1621
W. Ping, K. Peng, J. Chen, Clarinet: parallel wave generation in end-to-end text-to-speech. CoRR (2018). arXiv preprint arXiv:1807.07281
R. Prenger, R. Valle, B. Catanzaro, Waveglow: a flow-based generative network for speech synthesis. CoRR (2018). arXiv preprint arXiv:1811.00002
Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Liu, Fastspeech: fast, robust and controllable text to speech. CoRR (2019). arXiv preprint arXiv:1905.09263
J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, Y. Wu, Non-attentive tacotron: Robust and controllable neural TTS synthesis including unsupervised duration modeling. CoRR (2020). arXiv preprint arXiv:2010.04301
R.J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R.J. Weiss, R. Clark, R.A. Saurous, Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. CoRR (2018). arXiv preprint arXiv:1803.09047
C. Smith, Handbook of the international phonetic association: a guide to the use of the international phonetic alphabet (1999). Phonology 17, 291–295 (2000). https://doi.org/10.1017/S0952675700003894
J. Sotelo, S. Mehri, K. Kumar, J.F. Santos, K. Kastner, A. Courville, Y. Bengio, Char2wav: end-to-end speech synthesis, in ICLR (2017)
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594
G.L. Trager, B. Bloch, The syllabic phonemes of English. Language 17(3), 223–246 (1941) http://www.jstor.org/stable/409203
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A.W. Senior, K. Kavukcuoglu, Wavenet: a generative model for raw audio. CoRR (2016). arXiv preprint arXiv:1609.03499
Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R.J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q.V. Le, Y. Agiomyrgiannakis, R. Clark, R.A. Saurous, Tacotron: a fully end-to-end text-to-speech synthesis model. CoRR (2017). arXiv preprint arXiv:1703.10135
Y. Wang, D. Stanton, Y. Zhang, R.J. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, R.A. Saurous, Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. CoRR (2018). arXiv preprint arXiv:1803.09017
O. Watts, C. Valentini-Botinhao, S. King, Speech waveform reconstruction using convolutional neural networks with noise and periodic inputs, in:ICASSP 2019 - IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7045–7049 (2019). https://doi.org/10.1109/ICASSP.2019.8683398
D. Weber, C. Gühmann, Non-autoregressive vs autoregressive neural networks for system identification. CoRR (2021). arXiv preprint arXiv:2105.02027
R. Yamamoto, E. Song, J.M. Kim, Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (IEEE, 2020)
R. Yamashita, M. Nishio, R. Do, K. Togashi, Convolutional neural networks: an overview and application in radiology. Insights Imaging 9, 611–629 (2018). https://doi.org/10.1007/s13244-018-0639-9
Y. Yasuda, X. Wang, J. Yamagishi, Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis (2020)
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK book. Cambridge University Engineering Department (2002)
H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A.W. Black, K. Tokuda, The hmm-based speech synthesis system (hts) version 2.0. In: SSW, pp. 294–299. Citeseer (2007)
S.K. Zhizheng Wu Oliver Watts, Merlin: an open source neural network speech synthesis system, in 9th ISCA Speech Synthesis Workshop (SSW9) (2016)
K. Zhou, B. Sisman, M. Zhang, H. Li, Converting anyone’s emotion: towards speaker-independent emotional voice conversion. CoRR (2020). arXiv preprint arXiv:2005.07025
Acknowledgements
The authors would like to thank the Ministry of Electronics and Information Technology (MeitY) for supporting this work under the project “Speech to Speech Translation for Tribal Languages using Deep Learning Framework”.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pamisetty, G., Sri Rama Murty, K. Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control. Circuits Syst Signal Process 42, 361–384 (2023). https://doi.org/10.1007/s00034-022-02126-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-02126-z