Text to Speech Synthesis Using Deep Learning

Mahum, Rabbia; Irtaza, Aun; Javed, Ali

doi:10.1007/978-3-031-34873-0_12

Rabbia Mahum⁴,
Aun Irtaza⁴ &
Ali Javed⁵

176 Accesses
1 Citations

Abstract

The audio deep fake is a process to generate speech similar to some specific people using various methods from text utterances of natural language. The task of speech synthesis is challenging due to the unavailability of any general model that can generate speech from all existing languages. Therefore, in this chapter, we analyze the existing techniques for Text-To-Speech (TTS) synthesis considering their architecture. Moreover, we propose a novel TTS synthesizer based on a deep learning model i.e. encoder, decoder, synthesizer, combined with a reward block based on Reinforcement Learning (RL). To the best of our knowledge, our proposed synthesizer is the first TTS synthesizer that can generate speeches in three natural languages such as English, Chinese, and French. Furthermore, we performed an experiment based on Mean Opinion Scale (MOS) which showed that our proposed model is an efficient technique for TTS synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mishra R, Tripathi SP (2021) Deep learning based search engine for biomedical images using convolutional neural networks. Multimed Tools Appl 80(10):15057–15065
Article Google Scholar
Parah SA, Sheikh JA, Ahad F, Bhat GM (2018) High capacity and secure electronic patient record (EPR) embedding in color images for IoT driven healthcare systems. In: Internet of things and big data analytics toward next-generation intelligence. Springer, Cham, pp 409–437
Chapter Google Scholar
Hurrah NN, Parah SA, Sheikh JA (2020) Embedding in medical images: an efficient scheme for authentication and tamper localization. Multimed Tools Appl 79:21441–21470
Article Google Scholar
Sarosh P, Heidari AA, Muhammad K (2021) Secret sharing-based personal health records management for the internet of health things. Sustain Cities Soc 74:103129
Article Google Scholar
Ahad F, Bhat GM (2015) On the realization of robust watermarking system for medical images. In: 2015 Annual IEEE India conference (INDICON), New Delhi, pp 1–5. https://doi.org/10.1109/INDICON.2015.7443363
Mahum R et al (2022) A novel framework for potato leaf disease detection using an efficient deep learning model. Hum Ecol Risk Assess: Int J 29:1–24
Google Scholar
Mahum R et al (2021) A novel hybrid approach based on deep CNN features to detect knee osteoarthritis. Sensors 21(18):6189
Article Google Scholar
Mahum R et al (2021) A novel hybrid approach based on deep CNN to detect glaucoma using fundus imaging. Electronics 11(1):26
Article Google Scholar
Korzekwa D et al (2022) Computer-assisted pronunciation training—speech synthesis is almost all you need. Speech Comm 142:22–33
Article Google Scholar
Korshunov P et al (2016) Overview of BTAS 2016 speaker anti-spoofing competition. In: 2016 IEEE 8th international conference on biometrics theory, applications and systems (BTAS). IEEE, New York
Google Scholar
Wu H et al (2020) Defense against adversarial attacks on spoofing countermeasures of ASV. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Google Scholar
Wu D (2019) An audio classification approach based on machine learning. In: 2019 International conference on intelligent transportation, big data & smart city (ICITBS). IEEE, Los Alamitos
Google Scholar
Todisco M et al (2019) ASVspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441
Google Scholar
Dinkel H, Qian Y, Yu K (2018) Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 26(11):2002–2014
Article Google Scholar
Chintha A et al (2020) Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE J Sel Top Signal Process 14(5):1024–1037
Article Google Scholar
Lavrentyeva G et al (2019) STC antispoofing systems for the ASVspoof2019 challenge. arXiv preprint arXiv:1904.05576
Google Scholar
He K et al (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, Berlin
Google Scholar
Alzantot M, Wang Z, Srivastava MB (2019) Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501
Google Scholar
Lai C-I et al (2019) ASSERT: anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120
Google Scholar
Monteiro J, Alam J, Falk TH (2020) An ensemble based approach for generalized detection of spoofing attacks to automatic speaker recognizers. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Google Scholar
Verma NK et al (2015) Intelligent condition based monitoring using acoustic signals for air compressors. IEEE Trans Reliab 65(1):291–309
Article Google Scholar
Wu Z et al (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Commun 66:130–153
Article Google Scholar
Wu Z et al (2016) Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio Speech Lang Process 24(4):768–783
Article Google Scholar
Chao Y-H et al (2008) Using kernel discriminant analysis to improve the characterization of the alternative hypothesis for speaker verification. IEEE Trans Audio Speech Lang Process 16(8):1675–1684
Article Google Scholar
Ze H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, Piscataway
Google Scholar
Dörfler M, Bammer R, Grill T (2017) Inside the spectrogram: convolutional neural networks in audio processing. In: 2017 international conference on sampling theory and applications (SampTA). IEEE, Piscataway
Google Scholar
Balamurali B et al (2019) Toward robust audio spoofing detection: a detailed comparison of traditional and learned features. IEEE Access 7:84229–84241
Article Google Scholar
Chao Y-H (2014) Using LR-based discriminant kernel methods with applications to speaker verification. Speech Comm 57:76–86
Article Google Scholar
Yaman S, Pelecanos J (2013) Using polynomial kernel support vector machines for speaker verification. IEEE Signal Processing Lett 20(9):901–904
Article Google Scholar
Loughran R et al (2017) Feature selection for speaker verification using genetic programming. Evol Intel 10(1):1–21
Article Google Scholar
Zhao H, Malik H (2013) Audio recording location identification using acoustic environment signature. IEEE Trans Inf Forensics Secur 8(11):1746–1759
Article Google Scholar
Handley Z (2009) Is text-to-speech synthesis ready for use in computer-assisted language learning? Speech Comm 51(10):906–919
Article Google Scholar
McCoy KF et al (2013) Speech and language processing as assistive technologies. Comput Speech Lang 27(6):1143–1146
Article Google Scholar
Shen J et al (2018) Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Google Scholar
Alghoul A et al (2018) Email classification using artificial neural network. Int J Acad Dev 2(11):8–14
Google Scholar
Yang S et al (2015) From facial parts responses to face detection: a deep learning approach. In: Proceedings of the IEEE international conference on computer vision. IEEE
Google Scholar
Dhamyal H et al (2021) Fake audio detection in resource-constrained settings using microfeatures. Proc Interspeech 2021:4149–4153
Google Scholar
Ng H-W et al (2015) Deep learning for emotion recognition on small datasets using transfer learning. In: Proceedings of the 2015 ACM on international conference on multimodal interaction. ACM
Google Scholar
Wan L et al (2018) Generalized end-to-end loss for speaker verification. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Google Scholar
Oord AVD et al (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499
Google Scholar
Panayotov V et al (2015) Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Google Scholar
Honnet P-E et al (2017) The SIWIS French speech synthesis database? Design and recording of a high quality French database for speech synthesis. Idiap
Google Scholar
Wang D, Zhang X (2015) Thchs-30: a free chinese speech corpus. arXiv preprint arXiv:1512.01882
Google Scholar
Variani E et al (2014) Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Google Scholar
Heigold G et al (2016) End-to-end text-dependent speaker verification. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Google Scholar
Arık SÖ et al (2017) Deep voice 2: multi-speaker neural text-to-speech. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Long Beach, California, pp 2966–2974
Google Scholar
Wang X et al (2020) ASVspoof 2019: a large-scale public database of synthesized, converted and replayed speech. Comput Speech Lang 64:101114
Article Google Scholar
Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Article Google Scholar
Rec I (1996) P. 800: methods for subjective determination of transmission quality. International Telecommunication Union, Geneva, p 22
Google Scholar
Elias I et al (2021) Parallel tacotron 2: a non-autoregressive neural TTS model with differentiable duration modeling. arXiv preprint arXiv:2103.14574
Google Scholar
Ren Y et al (2020) Fastspeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558
Google Scholar
Liu P et al (2021) VARA-TTS: non-autoregressive text-to-speech synthesis based on very deep vae with residual attention. arXiv preprint arXiv:2102.06431
Google Scholar
Lee Y, Shin J, Jung K (2020) Bidirectional variational inference for non-autoregressive text-to-speech. In: International conference on learning representations
Google Scholar

Download references

Acknowledgement

The researchers want to thank University of Engineering and Technology Taxila to provide research environment.

Author information

Authors and Affiliations

Computer Science Department, University of Engineering and Technology, Taxila, Pakistan
Rabbia Mahum & Aun Irtaza
Software Engineering Department, University of Engineering and Technology, Taxila, Pakistan
Ali Javed

Authors

Rabbia Mahum
View author publications
You can also search for this author in PubMed Google Scholar
Aun Irtaza
View author publications
You can also search for this author in PubMed Google Scholar
Ali Javed
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rabbia Mahum .

Editor information

Editors and Affiliations

Department of Electronics and Instrumentation, University of Kashmir, Srinagar, Jammu and Kashmir, India
Shabir A. Parah
Aligarh Muslim University, Aligarh, India
Nasir N. Hurrah
Department of Electronics Engineering, Aligarh Muslim University, AMU Aligarh, India
Ekram Khan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mahum, R., Irtaza, A., Javed, A. (2023). Text to Speech Synthesis Using Deep Learning. In: Parah, S.A., Hurrah, N.N., Khan, E. (eds) Intelligent Multimedia Signal Processing for Smart Ecosystems. Springer, Cham. https://doi.org/10.1007/978-3-031-34873-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-34873-0_12
Published: 01 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34872-3
Online ISBN: 978-3-031-34873-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics