Abstract
End-to-end speech synthesis methods managed to achieve nearly natural and human-like speech. They are prone to some synthesis errors such as missing or repeating words, or incomplete synthesis. We may argue this is mainly due to the local information preference between text input and the learned acoustic features of a conditional autoregressive (CAR) model. The local information preference prevents the model from depending on text input when predicting acoustic features. It contributes to synthesis errors during inference time. In this work, we are comparing two modified architectures based on Tacotron2 to generate Arabic speech. The first architecture replaces the WaveNet vocoder with a flow-based implementation of WaveGlow. The second architecture, influenced by InfoGan, maximizes the mutual information between text input and predicted acoustic features (mel-spectrogram) to eliminate the local information preference. The training objective has been also changed by adding a CTC loss term. The training objective could be considered as a metric of local information preference between text input and predicted acoustic features. We carried the experiments on Nawar Halabi’s dataset (http://en.arabicspeechcorpus.com/) which contains about 2.41 h of Arabic speech. Our experiments show that maximizing mutual information between predicted acoustic features and conditional text input as well as changing the training objective can enhance the subjective quality of generated speech and reduce the utterance error rate.
Similar content being viewed by others
Notes
A spectrogram could be considered as a bunch of fast Fourier transformed (FFT) windowed signals stacked on top of each other, Amplitude of signal is transformed to decibel while the y-axis is converted to mel-scale.
see Example 2 in the following survey https://cutt.ly/5kt9xP9
The quality of being real and human-like.
The quality of being possible to understand.
https://shorturl.at/huGOW
References
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Y. Bengio & Y. LeCun (Eds.), 3rd International conference on learning representations (ICLR 2015), Conference Track Proceedings, San Diego, CA, USA 7–9 May 2015. arXiv:1409.0473
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016a). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. NIPS. arXiv:1606.03657
Chen, X., Kingma, DP., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., & Abbeel, P. (2016b).Variational lossy autoencoder. ICLR. arXiv:1611.02731
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-based models for speech recognition. CoRR. arXiv:1506.07503
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. CoRR. arXiv:1605.08803
Fahmy, F. K., Khalil, M. I., & Abbas, H. M. (2020). A transfer learning end-to-end arabic text-to-speech (TTS) deep architecture. In F. P. Schilling & T. Stadelmann (Eds.), Artificial neural networks in pattern recognition (pp. 266–277). Cham: Springer.
Guo, H., Soong, FK., He, L., & Xie, L. (2019). A new gan-based end-to-end TTS training algorithm. CoRR. arXiv:1904.04775
Halabi, N., & Wald, M. (2016). Phonetic inventory for an Arabic speech corpus. In Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 734–738).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML). arXiv:1502.03167
Jin, Z., Finkelstein, A., Mysore, G. J., & Lu, J. (2018). Fftnet: A real-time speaker-dependent neural vocoder. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2251–2255).
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Y. Bengio & Y. LeCun (Eds.), 3rd International conference on learning representations (ICLR 2015), Conference Track Proceedings, San Diego, CA, USA 7–9 May 2015. arXiv:1412.6980
Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible \(1 \times 1\) convolutions. CoRR, 1807, 03039.
Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N. R., Goyal, A., Bengio, Y., Larochelle, H., Courville, A. C., & Pal, C. (2016). Zoneout: Regularizing rnns by randomly preserving hidden activations. In International conference on learning representations (ICLR). arXiv:1606.01305
Li, Y., & Yuan, Y. (2017). Convergence analysis of two-layer neural networks with ReLU activation. CoRR. arXiv:1705.09886
Liu, P., Wu, X., Kang, S., Li, G., Su, D., & Yu, D. (2019). Maximizing mutual information for tacotron. arXiv:1909.01145
Pascanu, R., Mikolov, T., & Bengio, Y. (2012). Understanding the exploding gradient problem. ICML. arXiv:1211.5063
Ping, W., Peng, K., & Chen, J. (2018). Clarinet: Parallel wave generation in end-to-end text-to-speech. CoRR. arXiv:1807.07281
Prenger, R., Valle, R., & Catanzaro, B. (2018). Waveglow: A flow-based generative network for speech synthesis. CoRR. arXiv:1811.00002
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R. J., Saurous, R., Agiomyrgiannakis, Y., & Wu, Y. (2017). Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR. arXiv:1712.05884
Skerry-Ryan, R. J., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R. J., Clark, R., & Saurous, R. A. (2018). Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. CoRR. arXiv:1803.09047
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems. arXiv:1409.3215
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR. arXiv:1609.03499
van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., Lockhart, E., Cobo, L. C., Stimberg, F., Casagrande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalchbrenner, N., Zen, H., Graves, A., King, H., Walters, T., Belov, D., & Hassabis, D. (2017). Parallel wavenet: Fast high-fidelity speech synthesis. CoRR. arXiv:1711.10433
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S, Le, Q. V., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: A fully end-to-end text-to-speech synthesis model. CoRR. arXiv:1703.10135
Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270–280.
Yassa, F. (1987). Optimality in the choice of the convergence factor for gradient-based adaptive algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(1), 48–59.
Zhang, Z., & Sabuncu, M. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31, pp. 8778–8788). Curran Associates.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Fahmy, F.K., Abbas, H.M. & Khalil, M.I. Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture. Int J Speech Technol 25, 79–88 (2022). https://doi.org/10.1007/s10772-022-09961-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-022-09961-0