Abstract
The audio deep fake is a process to generate speech similar to some specific people using various methods from text utterances of natural language. The task of speech synthesis is challenging due to the unavailability of any general model that can generate speech from all existing languages. Therefore, in this chapter, we analyze the existing techniques for Text-To-Speech (TTS) synthesis considering their architecture. Moreover, we propose a novel TTS synthesizer based on a deep learning model i.e. encoder, decoder, synthesizer, combined with a reward block based on Reinforcement Learning (RL). To the best of our knowledge, our proposed synthesizer is the first TTS synthesizer that can generate speeches in three natural languages such as English, Chinese, and French. Furthermore, we performed an experiment based on Mean Opinion Scale (MOS) which showed that our proposed model is an efficient technique for TTS synthesis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mishra R, Tripathi SP (2021) Deep learning based search engine for biomedical images using convolutional neural networks. Multimed Tools Appl 80(10):15057–15065
Parah SA, Sheikh JA, Ahad F, Bhat GM (2018) High capacity and secure electronic patient record (EPR) embedding in color images for IoT driven healthcare systems. In: Internet of things and big data analytics toward next-generation intelligence. Springer, Cham, pp 409–437
Hurrah NN, Parah SA, Sheikh JA (2020) Embedding in medical images: an efficient scheme for authentication and tamper localization. Multimed Tools Appl 79:21441–21470
Sarosh P, Heidari AA, Muhammad K (2021) Secret sharing-based personal health records management for the internet of health things. Sustain Cities Soc 74:103129
Ahad F, Bhat GM (2015) On the realization of robust watermarking system for medical images. In: 2015 Annual IEEE India conference (INDICON), New Delhi, pp 1–5. https://doi.org/10.1109/INDICON.2015.7443363
Mahum R et al (2022) A novel framework for potato leaf disease detection using an efficient deep learning model. Hum Ecol Risk Assess: Int J 29:1–24
Mahum R et al (2021) A novel hybrid approach based on deep CNN features to detect knee osteoarthritis. Sensors 21(18):6189
Mahum R et al (2021) A novel hybrid approach based on deep CNN to detect glaucoma using fundus imaging. Electronics 11(1):26
Korzekwa D et al (2022) Computer-assisted pronunciation training—speech synthesis is almost all you need. Speech Comm 142:22–33
Korshunov P et al (2016) Overview of BTAS 2016 speaker anti-spoofing competition. In: 2016 IEEE 8th international conference on biometrics theory, applications and systems (BTAS). IEEE, New York
Wu H et al (2020) Defense against adversarial attacks on spoofing countermeasures of ASV. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Wu D (2019) An audio classification approach based on machine learning. In: 2019 International conference on intelligent transportation, big data & smart city (ICITBS). IEEE, Los Alamitos
Todisco M et al (2019) ASVspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441
Dinkel H, Qian Y, Yu K (2018) Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 26(11):2002–2014
Chintha A et al (2020) Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE J Sel Top Signal Process 14(5):1024–1037
Lavrentyeva G et al (2019) STC antispoofing systems for the ASVspoof2019 challenge. arXiv preprint arXiv:1904.05576
He K et al (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, Berlin
Alzantot M, Wang Z, Srivastava MB (2019) Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501
Lai C-I et al (2019) ASSERT: anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120
Monteiro J, Alam J, Falk TH (2020) An ensemble based approach for generalized detection of spoofing attacks to automatic speaker recognizers. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Verma NK et al (2015) Intelligent condition based monitoring using acoustic signals for air compressors. IEEE Trans Reliab 65(1):291–309
Wu Z et al (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Commun 66:130–153
Wu Z et al (2016) Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio Speech Lang Process 24(4):768–783
Chao Y-H et al (2008) Using kernel discriminant analysis to improve the characterization of the alternative hypothesis for speaker verification. IEEE Trans Audio Speech Lang Process 16(8):1675–1684
Ze H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, Piscataway
Dörfler M, Bammer R, Grill T (2017) Inside the spectrogram: convolutional neural networks in audio processing. In: 2017 international conference on sampling theory and applications (SampTA). IEEE, Piscataway
Balamurali B et al (2019) Toward robust audio spoofing detection: a detailed comparison of traditional and learned features. IEEE Access 7:84229–84241
Chao Y-H (2014) Using LR-based discriminant kernel methods with applications to speaker verification. Speech Comm 57:76–86
Yaman S, Pelecanos J (2013) Using polynomial kernel support vector machines for speaker verification. IEEE Signal Processing Lett 20(9):901–904
Loughran R et al (2017) Feature selection for speaker verification using genetic programming. Evol Intel 10(1):1–21
Zhao H, Malik H (2013) Audio recording location identification using acoustic environment signature. IEEE Trans Inf Forensics Secur 8(11):1746–1759
Handley Z (2009) Is text-to-speech synthesis ready for use in computer-assisted language learning? Speech Comm 51(10):906–919
McCoy KF et al (2013) Speech and language processing as assistive technologies. Comput Speech Lang 27(6):1143–1146
Shen J et al (2018) Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Alghoul A et al (2018) Email classification using artificial neural network. Int J Acad Dev 2(11):8–14
Yang S et al (2015) From facial parts responses to face detection: a deep learning approach. In: Proceedings of the IEEE international conference on computer vision. IEEE
Dhamyal H et al (2021) Fake audio detection in resource-constrained settings using microfeatures. Proc Interspeech 2021:4149–4153
Ng H-W et al (2015) Deep learning for emotion recognition on small datasets using transfer learning. In: Proceedings of the 2015 ACM on international conference on multimodal interaction. ACM
Wan L et al (2018) Generalized end-to-end loss for speaker verification. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Oord AVD et al (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499
Panayotov V et al (2015) Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Honnet P-E et al (2017) The SIWIS French speech synthesis database? Design and recording of a high quality French database for speech synthesis. Idiap
Wang D, Zhang X (2015) Thchs-30: a free chinese speech corpus. arXiv preprint arXiv:1512.01882
Variani E et al (2014) Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Heigold G et al (2016) End-to-end text-dependent speaker verification. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway
Arık SÖ et al (2017) Deep voice 2: multi-speaker neural text-to-speech. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Long Beach, California, pp 2966–2974
Wang X et al (2020) ASVspoof 2019: a large-scale public database of synthesized, converted and replayed speech. Comput Speech Lang 64:101114
Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Rec I (1996) P. 800: methods for subjective determination of transmission quality. International Telecommunication Union, Geneva, p 22
Elias I et al (2021) Parallel tacotron 2: a non-autoregressive neural TTS model with differentiable duration modeling. arXiv preprint arXiv:2103.14574
Ren Y et al (2020) Fastspeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558
Liu P et al (2021) VARA-TTS: non-autoregressive text-to-speech synthesis based on very deep vae with residual attention. arXiv preprint arXiv:2102.06431
Lee Y, Shin J, Jung K (2020) Bidirectional variational inference for non-autoregressive text-to-speech. In: International conference on learning representations
Acknowledgement
The researchers want to thank University of Engineering and Technology Taxila to provide research environment.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Mahum, R., Irtaza, A., Javed, A. (2023). Text to Speech Synthesis Using Deep Learning. In: Parah, S.A., Hurrah, N.N., Khan, E. (eds) Intelligent Multimedia Signal Processing for Smart Ecosystems. Springer, Cham. https://doi.org/10.1007/978-3-031-34873-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-34873-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34872-3
Online ISBN: 978-3-031-34873-0
eBook Packages: Computer ScienceComputer Science (R0)