Skip to main content

Text to Speech Synthesis Using Deep Learning

  • Chapter
  • First Online:
Intelligent Multimedia Signal Processing for Smart Ecosystems

Abstract

The audio deep fake is a process to generate speech similar to some specific people using various methods from text utterances of natural language. The task of speech synthesis is challenging due to the unavailability of any general model that can generate speech from all existing languages. Therefore, in this chapter, we analyze the existing techniques for Text-To-Speech (TTS) synthesis considering their architecture. Moreover, we propose a novel TTS synthesizer based on a deep learning model i.e. encoder, decoder, synthesizer, combined with a reward block based on Reinforcement Learning (RL). To the best of our knowledge, our proposed synthesizer is the first TTS synthesizer that can generate speeches in three natural languages such as English, Chinese, and French. Furthermore, we performed an experiment based on Mean Opinion Scale (MOS) which showed that our proposed model is an efficient technique for TTS synthesis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mishra R, Tripathi SP (2021) Deep learning based search engine for biomedical images using convolutional neural networks. Multimed Tools Appl 80(10):15057–15065

    Article  Google Scholar 

  2. Parah SA, Sheikh JA, Ahad F, Bhat GM (2018) High capacity and secure electronic patient record (EPR) embedding in color images for IoT driven healthcare systems. In: Internet of things and big data analytics toward next-generation intelligence. Springer, Cham, pp 409–437

    Chapter  Google Scholar 

  3. Hurrah NN, Parah SA, Sheikh JA (2020) Embedding in medical images: an efficient scheme for authentication and tamper localization. Multimed Tools Appl 79:21441–21470

    Article  Google Scholar 

  4. Sarosh P, Heidari AA, Muhammad K (2021) Secret sharing-based personal health records management for the internet of health things. Sustain Cities Soc 74:103129

    Article  Google Scholar 

  5. Ahad F, Bhat GM (2015) On the realization of robust watermarking system for medical images. In: 2015 Annual IEEE India conference (INDICON), New Delhi, pp 1–5. https://doi.org/10.1109/INDICON.2015.7443363

  6. Mahum R et al (2022) A novel framework for potato leaf disease detection using an efficient deep learning model. Hum Ecol Risk Assess: Int J 29:1–24

    Google Scholar 

  7. Mahum R et al (2021) A novel hybrid approach based on deep CNN features to detect knee osteoarthritis. Sensors 21(18):6189

    Article  Google Scholar 

  8. Mahum R et al (2021) A novel hybrid approach based on deep CNN to detect glaucoma using fundus imaging. Electronics 11(1):26

    Article  Google Scholar 

  9. Korzekwa D et al (2022) Computer-assisted pronunciation training—speech synthesis is almost all you need. Speech Comm 142:22–33

    Article  Google Scholar 

  10. Korshunov P et al (2016) Overview of BTAS 2016 speaker anti-spoofing competition. In: 2016 IEEE 8th international conference on biometrics theory, applications and systems (BTAS). IEEE, New York

    Google Scholar 

  11. Wu H et al (2020) Defense against adversarial attacks on spoofing countermeasures of ASV. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway

    Google Scholar 

  12. Wu D (2019) An audio classification approach based on machine learning. In: 2019 International conference on intelligent transportation, big data & smart city (ICITBS). IEEE, Los Alamitos

    Google Scholar 

  13. Todisco M et al (2019) ASVspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441

    Google Scholar 

  14. Dinkel H, Qian Y, Yu K (2018) Investigating raw wave deep neural networks for end-to-end speaker spoofing detection. IEEE/ACM Trans Audio Speech Lang Process 26(11):2002–2014

    Article  Google Scholar 

  15. Chintha A et al (2020) Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE J Sel Top Signal Process 14(5):1024–1037

    Article  Google Scholar 

  16. Lavrentyeva G et al (2019) STC antispoofing systems for the ASVspoof2019 challenge. arXiv preprint arXiv:1904.05576

    Google Scholar 

  17. He K et al (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, Berlin

    Google Scholar 

  18. Alzantot M, Wang Z, Srivastava MB (2019) Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501

    Google Scholar 

  19. Lai C-I et al (2019) ASSERT: anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120

    Google Scholar 

  20. Monteiro J, Alam J, Falk TH (2020) An ensemble based approach for generalized detection of spoofing attacks to automatic speaker recognizers. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway

    Google Scholar 

  21. Verma NK et al (2015) Intelligent condition based monitoring using acoustic signals for air compressors. IEEE Trans Reliab 65(1):291–309

    Article  Google Scholar 

  22. Wu Z et al (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Commun 66:130–153

    Article  Google Scholar 

  23. Wu Z et al (2016) Anti-spoofing for text-independent speaker verification: an initial database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio Speech Lang Process 24(4):768–783

    Article  Google Scholar 

  24. Chao Y-H et al (2008) Using kernel discriminant analysis to improve the characterization of the alternative hypothesis for speaker verification. IEEE Trans Audio Speech Lang Process 16(8):1675–1684

    Article  Google Scholar 

  25. Ze H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, Piscataway

    Google Scholar 

  26. Dörfler M, Bammer R, Grill T (2017) Inside the spectrogram: convolutional neural networks in audio processing. In: 2017 international conference on sampling theory and applications (SampTA). IEEE, Piscataway

    Google Scholar 

  27. Balamurali B et al (2019) Toward robust audio spoofing detection: a detailed comparison of traditional and learned features. IEEE Access 7:84229–84241

    Article  Google Scholar 

  28. Chao Y-H (2014) Using LR-based discriminant kernel methods with applications to speaker verification. Speech Comm 57:76–86

    Article  Google Scholar 

  29. Yaman S, Pelecanos J (2013) Using polynomial kernel support vector machines for speaker verification. IEEE Signal Processing Lett 20(9):901–904

    Article  Google Scholar 

  30. Loughran R et al (2017) Feature selection for speaker verification using genetic programming. Evol Intel 10(1):1–21

    Article  Google Scholar 

  31. Zhao H, Malik H (2013) Audio recording location identification using acoustic environment signature. IEEE Trans Inf Forensics Secur 8(11):1746–1759

    Article  Google Scholar 

  32. Handley Z (2009) Is text-to-speech synthesis ready for use in computer-assisted language learning? Speech Comm 51(10):906–919

    Article  Google Scholar 

  33. McCoy KF et al (2013) Speech and language processing as assistive technologies. Comput Speech Lang 27(6):1143–1146

    Article  Google Scholar 

  34. Shen J et al (2018) Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway

    Google Scholar 

  35. Alghoul A et al (2018) Email classification using artificial neural network. Int J Acad Dev 2(11):8–14

    Google Scholar 

  36. Yang S et al (2015) From facial parts responses to face detection: a deep learning approach. In: Proceedings of the IEEE international conference on computer vision. IEEE

    Google Scholar 

  37. Dhamyal H et al (2021) Fake audio detection in resource-constrained settings using microfeatures. Proc Interspeech 2021:4149–4153

    Google Scholar 

  38. Ng H-W et al (2015) Deep learning for emotion recognition on small datasets using transfer learning. In: Proceedings of the 2015 ACM on international conference on multimodal interaction. ACM

    Google Scholar 

  39. Wan L et al (2018) Generalized end-to-end loss for speaker verification. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway

    Google Scholar 

  40. Oord AVD et al (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499

    Google Scholar 

  41. Panayotov V et al (2015) Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway

    Google Scholar 

  42. Honnet P-E et al (2017) The SIWIS French speech synthesis database? Design and recording of a high quality French database for speech synthesis. Idiap

    Google Scholar 

  43. Wang D, Zhang X (2015) Thchs-30: a free chinese speech corpus. arXiv preprint arXiv:1512.01882

    Google Scholar 

  44. Variani E et al (2014) Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway

    Google Scholar 

  45. Heigold G et al (2016) End-to-end text-dependent speaker verification. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway

    Google Scholar 

  46. Arık SÖ et al (2017) Deep voice 2: multi-speaker neural text-to-speech. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Long Beach, California, pp 2966–2974

    Google Scholar 

  47. Wang X et al (2020) ASVspoof 2019: a large-scale public database of synthesized, converted and replayed speech. Comput Speech Lang 64:101114

    Article  Google Scholar 

  48. Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243

    Article  Google Scholar 

  49. Rec I (1996) P. 800: methods for subjective determination of transmission quality. International Telecommunication Union, Geneva, p 22

    Google Scholar 

  50. Elias I et al (2021) Parallel tacotron 2: a non-autoregressive neural TTS model with differentiable duration modeling. arXiv preprint arXiv:2103.14574

    Google Scholar 

  51. Ren Y et al (2020) Fastspeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558

    Google Scholar 

  52. Liu P et al (2021) VARA-TTS: non-autoregressive text-to-speech synthesis based on very deep vae with residual attention. arXiv preprint arXiv:2102.06431

    Google Scholar 

  53. Lee Y, Shin J, Jung K (2020) Bidirectional variational inference for non-autoregressive text-to-speech. In: International conference on learning representations

    Google Scholar 

Download references

Acknowledgement

The researchers want to thank University of Engineering and Technology Taxila to provide research environment.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rabbia Mahum .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Mahum, R., Irtaza, A., Javed, A. (2023). Text to Speech Synthesis Using Deep Learning. In: Parah, S.A., Hurrah, N.N., Khan, E. (eds) Intelligent Multimedia Signal Processing for Smart Ecosystems. Springer, Cham. https://doi.org/10.1007/978-3-031-34873-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34873-0_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34872-3

  • Online ISBN: 978-3-031-34873-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics