Skip to main content
Log in

Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

End-to-end speech synthesis methods managed to achieve nearly natural and human-like speech. They are prone to some synthesis errors such as missing or repeating words, or incomplete synthesis. We may argue this is mainly due to the local information preference between text input and the learned acoustic features of a conditional autoregressive (CAR) model. The local information preference prevents the model from depending on text input when predicting acoustic features. It contributes to synthesis errors during inference time. In this work, we are comparing two modified architectures based on Tacotron2 to generate Arabic speech. The first architecture replaces the WaveNet vocoder with a flow-based implementation of WaveGlow. The second architecture, influenced by InfoGan, maximizes the mutual information between text input and predicted acoustic features (mel-spectrogram) to eliminate the local information preference. The training objective has been also changed by adding a CTC loss term. The training objective could be considered as a metric of local information preference between text input and predicted acoustic features. We carried the experiments on Nawar Halabi’s dataset (http://en.arabicspeechcorpus.com/) which contains about 2.41 h of Arabic speech. Our experiments show that maximizing mutual information between predicted acoustic features and conditional text input as well as changing the training objective can enhance the subjective quality of generated speech and reduce the utterance error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. A spectrogram could be considered as a bunch of fast Fourier transformed (FFT) windowed signals stacked on top of each other, Amplitude of signal is transformed to decibel while the y-axis is converted to mel-scale.

  2. http://en.arabicspeechcorpus.com/

  3. https://hpc.bibalex.org/.

  4. https://github.com/nawarhalabi/Arabic-Phonetiser.

  5. https://cutt.ly/xbkdcrr.

  6. see Example 2 in the following survey https://cutt.ly/5kt9xP9

  7. https://www.mturk.com/.

  8. https://cutt.ly/5kt9xP9.

  9. The quality of being real and human-like.

  10. The quality of being possible to understand.

  11. https://shorturl.at/huGOW

References

  • Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Y. Bengio & Y. LeCun (Eds.), 3rd International conference on learning representations (ICLR 2015), Conference Track Proceedings, San Diego, CA, USA 7–9 May 2015. arXiv:1409.0473

  • Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016a). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. NIPS. arXiv:1606.03657

  • Chen, X., Kingma, DP., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., & Abbeel, P. (2016b).Variational lossy autoencoder. ICLR. arXiv:1611.02731

  • Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. CoRR. arXiv:1506.07503

  • Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. CoRR. arXiv:1605.08803

  • Fahmy, F. K., Khalil, M. I., & Abbas, H. M. (2020). A transfer learning end-to-end arabic text-to-speech (TTS) deep architecture. In F. P. Schilling & T. Stadelmann (Eds.), Artificial neural networks in pattern recognition (pp. 266–277). Cham: Springer.

  • Guo, H., Soong, FK., He, L., & Xie, L. (2019). A new gan-based end-to-end TTS training algorithm. CoRR. arXiv:1904.04775

  • Halabi, N., & Wald, M. (2016). Phonetic inventory for an Arabic speech corpus. In Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 734–738).

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML). arXiv:1502.03167

  • Jin, Z., Finkelstein, A., Mysore, G. J., & Lu, J. (2018). Fftnet: A real-time speaker-dependent neural vocoder. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2251–2255).

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Y. Bengio & Y. LeCun (Eds.), 3rd International conference on learning representations (ICLR 2015), Conference Track Proceedings, San Diego, CA, USA 7–9 May 2015. arXiv:1412.6980

  • Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible \(1 \times 1\) convolutions. CoRR, 1807, 03039.

    Google Scholar 

  • Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N. R., Goyal, A., Bengio, Y., Larochelle, H., Courville, A. C., & Pal, C. (2016). Zoneout: Regularizing rnns by randomly preserving hidden activations. In International conference on learning representations (ICLR). arXiv:1606.01305

  • Li, Y., & Yuan, Y. (2017). Convergence analysis of two-layer neural networks with ReLU activation. CoRR. arXiv:1705.09886

  • Liu, P., Wu, X., Kang, S., Li, G., Su, D., & Yu, D. (2019). Maximizing mutual information for tacotron. arXiv:1909.01145

  • Pascanu, R., Mikolov, T., & Bengio, Y. (2012). Understanding the exploding gradient problem. ICML. arXiv:1211.5063

  • Ping, W., Peng, K., & Chen, J. (2018). Clarinet: Parallel wave generation in end-to-end text-to-speech. CoRR. arXiv:1807.07281

  • Prenger, R., Valle, R., & Catanzaro, B. (2018). Waveglow: A flow-based generative network for speech synthesis. CoRR. arXiv:1811.00002

  • Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.

  • Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R. J., Saurous, R., Agiomyrgiannakis, Y., & Wu, Y. (2017). Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR. arXiv:1712.05884

  • Skerry-Ryan, R. J., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R. J., Clark, R., & Saurous, R. A. (2018). Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. CoRR. arXiv:1803.09047

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.

  • Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems. arXiv:1409.3215

  • van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR. arXiv:1609.03499

  • van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., Lockhart, E., Cobo, L. C., Stimberg, F., Casagrande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalchbrenner, N., Zen, H., Graves, A., King, H., Walters, T., Belov, D., & Hassabis, D. (2017). Parallel wavenet: Fast high-fidelity speech synthesis. CoRR. arXiv:1711.10433

  • Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S, Le, Q. V., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: A fully end-to-end text-to-speech synthesis model. CoRR. arXiv:1703.10135

  • Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270–280.

    Article  Google Scholar 

  • Yassa, F. (1987). Optimality in the choice of the convergence factor for gradient-based adaptive algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(1), 48–59.

    Article  Google Scholar 

  • Zhang, Z., & Sabuncu, M. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31, pp. 8778–8788). Curran Associates.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fahmy, F.K., Abbas, H.M. & Khalil, M.I. Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture. Int J Speech Technol 25, 79–88 (2022). https://doi.org/10.1007/s10772-022-09961-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-022-09961-0

Keywords

Navigation