Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture

Fahmy, Fady K.; Abbas, Hazem M.; Khalil, Mahmoud I.

doi:10.1007/s10772-022-09961-0

Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture

Published: 08 February 2022

Volume 25, pages 79–88, (2022)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

355 Accesses
3 Citations
Explore all metrics

Abstract

End-to-end speech synthesis methods managed to achieve nearly natural and human-like speech. They are prone to some synthesis errors such as missing or repeating words, or incomplete synthesis. We may argue this is mainly due to the local information preference between text input and the learned acoustic features of a conditional autoregressive (CAR) model. The local information preference prevents the model from depending on text input when predicting acoustic features. It contributes to synthesis errors during inference time. In this work, we are comparing two modified architectures based on Tacotron2 to generate Arabic speech. The first architecture replaces the WaveNet vocoder with a flow-based implementation of WaveGlow. The second architecture, influenced by InfoGan, maximizes the mutual information between text input and predicted acoustic features (mel-spectrogram) to eliminate the local information preference. The training objective has been also changed by adding a CTC loss term. The training objective could be considered as a metric of local information preference between text input and predicted acoustic features. We carried the experiments on Nawar Halabi’s dataset (http://en.arabicspeechcorpus.com/) which contains about 2.41 h of Arabic speech. Our experiments show that maximizing mutual information between predicted acoustic features and conditional text input as well as changing the training objective can enhance the subjective quality of generated speech and reduce the utterance error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Transfer Learning End-to-End Arabic Text-To-Speech (TTS) Deep Architecture

Arabic speech synthesis and diacritic recognition

Article 18 May 2016

DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation

Notes

A spectrogram could be considered as a bunch of fast Fourier transformed (FFT) windowed signals stacked on top of each other, Amplitude of signal is transformed to decibel while the y-axis is converted to mel-scale.
http://en.arabicspeechcorpus.com/
https://hpc.bibalex.org/.
https://github.com/nawarhalabi/Arabic-Phonetiser.
https://cutt.ly/xbkdcrr.
see Example 2 in the following survey https://cutt.ly/5kt9xP9
https://www.mturk.com/.
https://cutt.ly/5kt9xP9.
The quality of being real and human-like.
The quality of being possible to understand.
https://shorturl.at/huGOW

References

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Y. Bengio & Y. LeCun (Eds.), 3rd International conference on learning representations (ICLR 2015), Conference Track Proceedings, San Diego, CA, USA 7–9 May 2015. arXiv:1409.0473
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016a). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. NIPS. arXiv:1606.03657
Chen, X., Kingma, DP., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., & Abbeel, P. (2016b).Variational lossy autoencoder. ICLR. arXiv:1611.02731
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. CoRR. arXiv:1506.07503
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. CoRR. arXiv:1605.08803
Fahmy, F. K., Khalil, M. I., & Abbas, H. M. (2020). A transfer learning end-to-end arabic text-to-speech (TTS) deep architecture. In F. P. Schilling & T. Stadelmann (Eds.), Artificial neural networks in pattern recognition (pp. 266–277). Cham: Springer.
Guo, H., Soong, FK., He, L., & Xie, L. (2019). A new gan-based end-to-end TTS training algorithm. CoRR. arXiv:1904.04775
Halabi, N., & Wald, M. (2016). Phonetic inventory for an Arabic speech corpus. In Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 734–738).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML). arXiv:1502.03167
Jin, Z., Finkelstein, A., Mysore, G. J., & Lu, J. (2018). Fftnet: A real-time speaker-dependent neural vocoder. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2251–2255).
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Y. Bengio & Y. LeCun (Eds.), 3rd International conference on learning representations (ICLR 2015), Conference Track Proceedings, San Diego, CA, USA 7–9 May 2015. arXiv:1412.6980
Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible \(1 \times 1\) convolutions. CoRR, 1807, 03039.
Google Scholar
Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N. R., Goyal, A., Bengio, Y., Larochelle, H., Courville, A. C., & Pal, C. (2016). Zoneout: Regularizing rnns by randomly preserving hidden activations. In International conference on learning representations (ICLR). arXiv:1606.01305
Li, Y., & Yuan, Y. (2017). Convergence analysis of two-layer neural networks with ReLU activation. CoRR. arXiv:1705.09886
Liu, P., Wu, X., Kang, S., Li, G., Su, D., & Yu, D. (2019). Maximizing mutual information for tacotron. arXiv:1909.01145
Pascanu, R., Mikolov, T., & Bengio, Y. (2012). Understanding the exploding gradient problem. ICML. arXiv:1211.5063
Ping, W., Peng, K., & Chen, J. (2018). Clarinet: Parallel wave generation in end-to-end text-to-speech. CoRR. arXiv:1807.07281
Prenger, R., Valle, R., & Catanzaro, B. (2018). Waveglow: A flow-based generative network for speech synthesis. CoRR. arXiv:1811.00002
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R. J., Saurous, R., Agiomyrgiannakis, Y., & Wu, Y. (2017). Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR. arXiv:1712.05884
Skerry-Ryan, R. J., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R. J., Clark, R., & Saurous, R. A. (2018). Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. CoRR. arXiv:1803.09047
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems. arXiv:1409.3215
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR. arXiv:1609.03499
van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., Lockhart, E., Cobo, L. C., Stimberg, F., Casagrande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalchbrenner, N., Zen, H., Graves, A., King, H., Walters, T., Belov, D., & Hassabis, D. (2017). Parallel wavenet: Fast high-fidelity speech synthesis. CoRR. arXiv:1711.10433
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S, Le, Q. V., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: A fully end-to-end text-to-speech synthesis model. CoRR. arXiv:1703.10135
Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270–280.
Article Google Scholar
Yassa, F. (1987). Optimality in the choice of the convergence factor for gradient-based adaptive algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(1), 48–59.
Article Google Scholar
Zhang, Z., & Sabuncu, M. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31, pp. 8778–8788). Curran Associates.

Download references

Author information

Authors and Affiliations

Department of Computer and Systems Engineering, Ain Shams University, Cairo, Egypt
Fady K. Fahmy, Hazem M. Abbas & Mahmoud I. Khalil

Authors

Fady K. Fahmy
View author publications
You can also search for this author in PubMed Google Scholar
Hazem M. Abbas
View author publications
You can also search for this author in PubMed Google Scholar
Mahmoud I. Khalil
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fahmy, F.K., Abbas, H.M. & Khalil, M.I. Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture. Int J Speech Technol 25, 79–88 (2022). https://doi.org/10.1007/s10772-022-09961-0

Download citation

Received: 03 November 2020
Accepted: 04 January 2022
Published: 08 February 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10772-022-09961-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture

Abstract

Access this article

Similar content being viewed by others

A Transfer Learning End-to-End Arabic Text-To-Speech (TTS) Deep Architecture

Arabic speech synthesis and diacritic recognition

DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation

Notes

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture

Abstract

Access this article

Similar content being viewed by others

A Transfer Learning End-to-End Arabic Text-To-Speech (TTS) Deep Architecture

Arabic speech synthesis and diacritic recognition

DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation

Notes

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation