Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method

Naderi, Navid; Nasersharif, Babak; Nikoofard, Amirhossein

doi:10.1007/s11042-021-11719-w

Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method

Published: 16 November 2021

Volume 81, pages 3629–3645, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

258 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

An end-to-end text-to-speech system generates acoustic features directly from input text to synthesize speech from it. The challenges of using these models for Persian language are lack of a proper data, and also detection of exceptions and Ezafe between words inherently (without grapheme-to-phoneme). In this paper, we propose to use an special end-to-end tts system named Tacotron2, and suggest solutions for the mentioned problems. For the lack of data problem, we collect a dataset proper for end-to-end text-to-speech including 21 hours of Persian speech and corresponding text. We use multi-resolution convolution and part of speech embedding layers in the encoder part of Tacotron2, to overcome the exceptions and Ezafe detection problem. In addition, in the case of Tacotron2, Mel-spectrogram generation process is unstable due to high dropout rate at inference time. To handle this problem, we propose to use a convex optimization method, named Net-Trim. Experimental results show that our proposed method increases Tacotron2 mean opinion score from 3.01 to 3.97. Furthermore, the proposed method decreases Mel cepstral distortion in comparison with Tacotron2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Synthesis Method Based on Tacotron + WaveNet

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

Notes

References

Aghasi A, Abdi A, Nguyen N, Romberg J (2017) Net-Trim: Convex pruning of deep neural networks with performance guarantee. 31st Conference on neural information processing systems (NIPS). Long Beach, CA, USA, pp 3177–3186
Google Scholar
Arik SÖ, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Ng, A, Raiman J, Others (2017) Deep voice: Real-time neural text-to-speech. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 195–204
Battenberg E, Mariooryad S, Stanton D, Skerry-Ryan RJ, Shannon M, Kao D, Bagby T (2019a) Effective use of variational embedding capacity in expressive end-to-end speech synthesis. In: International conference on learning representations (ICLR), Addis Ababa, Ethiopia
Battenberg E, Skerry-Ryan RJ, Mariooryad S, Stanton D, Kao D, Shannon M, Bagby T (2019b) Location-relative attention mechanisms for robust long-form speech synthesis. In: 45th International conference on acoustics, speech, and signal processing (ICASSP), pp 6189–6193
Bollepalli B, Juvela L, Alku P (2019) Lombard speech synthesis using transfer learning in a tacotron text-to-speech system. In: Proceedings of interspeech, pp 2833–2837
Chen M, Chen M, Liang S, Ma J, Chen L, Wang S, Xiao J (2019) Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding. In: Interspeech, pp 2105–2109
Chung Y-A, Wang Y, Hsu W-N, Zhang Y, Skerry-Ryan RJ (2019) Semi-supervised training for improving data efficiency in end-to-end speech synthesis. In: 2019 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6940–6944
Gibiansky A, Arik S, Diamos G, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017) Deep voice 2: Multi-speaker neural text-to-speech. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems (NIPS) 30, Curran Associates, Inc, pp 2962–2970
Haig G (2011) Linker, relativizer, nominalizer, tense-particle: On the Ezafe in West Iranian. John Benjamins Publishing Company, Amsterdam
Book Google Scholar
Halabi N (2016) Modern standard persian phonetics for speech synthesis. PhD thesis, University of Southampton
Hashemian M, Heidari Soureshjani K (2013) An analysis of pronunciation errors of iranian efl learners. Research in English Language Pedagogy 1(1):5–12
Google Scholar
Hendessi F, Ghayoori A, Gulliver TA (2005) A speech synthesizer for persian text using a neural network with a smooth ergodic hmm. ACM Transactions on Asian Language Information Processing 4(1):38–52
Article Google Scholar
Ito, K (2018) https://github.com/keithito/tacotron
King SJ, Karaiskos V (2011) The blizzard challenge 2011. In: Proceedings of blizzard challenge workshop
Kubichek R (1993) Mel-cepstral distance measure for objective speech quality assessment. In: Proceedings of IEEE pacific rim conference on communications computers and signal processing, vol 1, pp 125–128
Łańcucki A (2021) Fastpitch: Parallel text-to-speech with pitch prediction. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6588–6592
Lee Y, Rabiee A, Lee S-Y (2017) Emotional end-to-end neural speech synthesizer. In: 31st Conference on neural information processing systems (NIPS)
Luo R, Tan X, Wang R, Qin T, Li J, Zhao S, Chen E, Liu T-Y (2021) Lightspeech: Lightweight and fast text to speech with neural architecture search. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 5699–5703
Mama R (2018) https://github.com/Rayhane-mamah/Tacotron-2
Miao C, Shuang L, Liu Z, Minchuan C, Ma J, Wang S, Xiao J (2021) Efficienttts: An efficient and high-quality text-to-speech architecture. In: Meila M, Zhang T (eds) Proceedings of the 38th International conference on machine learning, volume 139 of Proceedings of machine learning research, PMLR, pp 7700–7709
Naderi N, Nasersharif B (2017) Multiresolution convolutional neural network for robust speech recognition. In: 2017 Iranian Conference on Electrical Engineering (ICEE), pp 1459–1464
Ning Y, He S, Wu Z, Xing C, Zhang L-J (2019) A review of deep learning based speech synthesis. Applied Sciences 9(19):1–16
Article Google Scholar
Park K, Mulc T (2019) CSS10: A collection of single speaker speech datasets for 10 languages. In: Interspeech, pp 1566–1570
Ping W, Peng K, Gibiansky AO, Arık, S, Kannan A, Narang S (2018) Deep Voice 3: Scaling Text-to-speech with convolutional sequence learning. In: International conference on learning representations (ICLR)
Ratnaparkhi A (1997) A simple introduction to maximum entropy models for natural language processing. IRCS Technical Reports Series, University of Pennsylvania, Philadelphia, PA, USA
Rosenberg A, Ramabhadran B (2017) Bias and statistical significance in evaluating speech synthesis with mean opinion scores. In: Proceedings of interspeech, pp 3976–3980
Seraji M (2013) PrePer : A Pre-processor for persian. In: fifth International conference on iranian linguistics (ICIL5)
Shamsfard M (2011) Challenges and open problems in persian text processing. Proceedings of LTC 11:65–69
Google Scholar
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerry-Ryan RJ, Saurous RA, Agiomyrgiannakis Y, Wu Y (2018) Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). Calgary, AB, Canada, pp 4779–4783
Skerry-Ryan RJ, Battenberg E, Xiao Y, Wang Y, Stanton D, Shor J, Weiss RJ, Clark R, Saurous RA (2018) Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. ICML, pp 4700–4709
Soozandehfar SMA, Souzandehfar M (2011) How to improve pronunciation? an in-depth contrastive investigation of sound-spelling systems in english and persian. Journal of Language Teaching and Research 2(5):1086–1098
Article Google Scholar
Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville A, Bengio Y (2017) Char2wav: End-to-end speech synthesis. In: International conference on learning representations (ICLR), workshop track
Toghyani Khorasgani A, Khorasgani A, Aray N (2017) A survey on several potentially problematic areas of pronunciation for iranian efl learners. Indonesian EFL Journal 1(2):189–198
Article Google Scholar
Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2000) Speech parameter generation algorithms for hmm-based speech synthesis. In: 2000 IEEE International conference on acoustics, speech, and signal processing. proceedings (Cat. No.00CH37100), vol 3, pp 1315–1318
Valle R, Li J, Prenger R, Catanzaro B (2020) Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6189–6193
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: A generative model for raw audio. arXiv:1609.03499
Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss R, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le Q, Agiomyrgiannakis Y, Clark R, Saurous R (2017) Tacotron: Towards end-to-end speech synthesis. In: Interspeech, pp 4006–4010
Wang Y, Stanton D, Zhang Y, Skerry-Ryan RJ, Battenberg E, Shor J, Xiao Y, Ren F, Jia Y, Saurous RA (2018) Style Tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: ICML, pp 5180–5189
Weiss RJ, Skerry-Ryan R, Battenberg E, Mariooryad S, Kingma DP (2021) Wave-tacotron: Spectrogram-free end-to-end text-to-speech synthesis. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 5679–5683
Yang G, Yang S, Liu K, Fang P, Chen W, Xie L (2021) Multi-band melgan: Faster waveform generation for high-quality text-to-speech. In: IEEE Spoken language technology workshop (SLT), pp 492–498
Yasuda Y, Wang X, Takaki S, Yamagishi J (2019) Investigation of enhanced tacotron text-to-speech synthesis systems with self-attention for pitch accent language. In: 2019 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6905–6909
Zen H, Clark R, Weiss RJ, Dang V, Jia Y, Wu Y, Zhang Y, Chen Z (2019) LibriTTS: A corpus derived from librispeech for text-to-speech. In: Proceedings of interspeech, pp 1526–1530
Zhang J-X, Ling Z-H., Dai L-R (2018) Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 4789–4793
Zhang M, Wang X, Fang F, Li H, Yamagishi J (2019) Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet. In: INTERSPEECH. Graz, Austria, pp 1298–1302
Zivkovic M, Bacanin N, Venkatachalam K, Nayyar A, Djordjevic A, Strumberger I, Al-Turjman F (2021) Covid-19 cases prediction by using hybrid machine learning and beetle antennae search approach. Sustainable Cities and Society 66:102669
Article Google Scholar

Download references

Acknowledgements

The authors are grateful to AmerAndish company for supporting this project and their helpful feedback. We acknowledge the raw data, the tools needed for collecting and cleaning data, the computing resources, and the users for rating all are provided by AmerAndish company.

Author information

Authors and Affiliations

Department of Computer Engineering, K. N. Toosi University of Technology, Tehran, Iran
Navid Naderi & Babak Nasersharif
Department of Systems and Control, Faculty of Electrical Engineering, K. N. Toosi University of Technology, Tehran, Iran
Amirhossein Nikoofard

Authors

Navid Naderi
View author publications
You can also search for this author in PubMed Google Scholar
Babak Nasersharif
View author publications
You can also search for this author in PubMed Google Scholar
Amirhossein Nikoofard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Babak Nasersharif.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Naderi, N., Nasersharif, B. & Nikoofard, A. Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method. Multimed Tools Appl 81, 3629–3645 (2022). https://doi.org/10.1007/s11042-021-11719-w

Download citation

Received: 17 April 2021
Revised: 24 August 2021
Accepted: 01 November 2021
Published: 16 November 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s11042-021-11719-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method

Abstract

Access this article

Similar content being viewed by others

Speech Synthesis Method Based on Tacotron + WaveNet

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Conventional and contemporary approaches used in text to speech synthesis: a review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method

Abstract

Access this article

Similar content being viewed by others

Speech Synthesis Method Based on Tacotron + WaveNet

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Conventional and contemporary approaches used in text to speech synthesis: a review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation