Skip to main content

Synthetic Speech Data Generation Using Generative Adversarial Networks

  • Conference paper
  • First Online:
International Conference on Cloud Computing and Computer Networks (CCCN 2023)

Part of the book series: Signals and Communication Technology ((SCT))

Included in the following conference series:

  • 62 Accesses

Abstract

The capabilities of artificial intelligence (AI) and deep Learning are increasing rapidly with the increasing computing power and specialized microprocessors. A very interesting architecture, generative adversarial networks (GANs), is at the forefront of innovation. Some examples of what GAN networks are used for are text-to-image translation, image editing/manipulation, recreating images of higher resolution, and creating three-dimensional objects. When it comes to audio, Google WaveNET, Parallel WaveNET, and its successor Tacotron 1 and 2 are the frameworks of choice to create synthetic-based audio. In cases where there is not enough training data, one can synthetically generate data for further research and training. Methodology wise qualitative data samples can be synthetically generated for any language. This paper showcases data generation for the Afrikaans language. Here, we used a trained network to create Afrikaans speech clips based on text. Finally, when generating the same sentence multiple times, the clips have different emotional states. These clips are then verified, categorized, and used to train another network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. R. Yamamoto, E. Song and J. M. Kim, “Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram,” ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6199–6203, 2020.

    Google Scholar 

  2. Q. Tian, X. Wan and S. Liu, “Generative Adversarial Network based Speaker Adaptation for High Fidelity WaveNet Vocoder,” Computer and information sciences, 2018.

    Google Scholar 

  3. B. H. Story, “History of Speech Synthesis,” The Routledge Handbook of Phonetics, p. 9–33, 2019.

    Google Scholar 

  4. J. Shen and R. Pang, “Tacotron 2: Generating Human-like Speech from Text,” 19 12 2017. [Online]. Available: https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html.

  5. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. W. Yu Zhang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” ICASSP 2018, 2017.

    Google Scholar 

  6. P. Salza, E. Foti, L. Nebbia and M. Oreglia, “MOS and Pair Comparison Combined Methods for Quality Evaluation of Text-to-Speech Systems,” Acta Acustica united with Acustica, vol. 82, pp. 650–656, 07 1996.

    Google Scholar 

  7. R. Nielek, M. Ciastek and W. Kopeć, “Emotions Make Cities Live,” Proceedings of the International Conference on Web Intelligence, 2017.

    Google Scholar 

  8. N. NGC, “NVIDIA NGC Catalog,” [Online]. Available: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tlt-jarvis/models/speechsynthesis_english_tacotron2. [Accessed 19 01 2023].

  9. F. Ma, Y. Li, S. Ni, S.-L. Huang and L. Zhang, “Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN,” Applied Sciences, vol. 12, no. 1, p. 527, 2022.

    Article  Google Scholar 

  10. J. Liu, C. Zhang, Z. Xie and G. Shi, “A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2,” International Journal of Machine Learning and Cybernetics, vol. 12, no. 10, p. 2809–23, 2021.

    Article  Google Scholar 

  11. Y. Kumar, A. Koul and C. Singh, “A Deep Learning Approaches in Text-To-Speech System: A Systematic Review and Recent Research Perspective,” Multimedia Tools and Applications, pp. 1573–7721, 12 September 2022.

    Google Scholar 

  12. K. Kuligowska, P. Kisielewicz and A. Włodarz, “Speech Synthesis Systems: Disadvantages and Limitations,” International Journal of Engineering & Technology, vol. 7, no. 2, p. 234, 2018.

    Article  Google Scholar 

  13. A. A. Karim and S. M. Saleh, “Text to speech using Mel-Spectrogram with deep learning algorithms,” Periodicals of Engineering and Natural Sciences, vol. 10, no. 3, pp. 380–386, June 2022.

    Google Scholar 

  14. C. v. Heerden, E. Barnard, J. Badenhorst, M. Davel and A. d. Waal, “NCHLT Afrikaans Speech Corpus.” Audio Recordings Smartphone-Collected in Non-Studio Environment,” 2014. [Online]. Available: https://repo.sadilar.org/handle/20.500.12185/280. [Accessed 19 1 2023].

  15. I. Goodfellow, J. &. M. Pouget-Abadie, B. Mehdi & Xu, D. &. O. Warde-Farley, S. &. Courville and Y. Aaron & Bengio, “Generative Adversarial Networks,” Computer and information sciences, 2014.

    Google Scholar 

  16. D. Ferris, “Techniques and Challenges in Speech Synthesis,” 2017.

    Google Scholar 

Download references

Acknowledgements

Thanks to Nvidia for supplying the open-source GitHub source code that allowed the model to be trained and the speech samples to be created. This research is partially supported by the South African National Research Foundation (Grant Nos. 132797 and 137951), the South African National Research Foundation incentive grant (No. 114911), and the South African Eskom Tertiary Education Support Programme.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Norval, M., Wang, Z., Sun, Y. (2024). Synthetic Speech Data Generation Using Generative Adversarial Networks. In: Meng, L. (eds) International Conference on Cloud Computing and Computer Networks. CCCN 2023. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-031-47100-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-47100-1_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-47099-8

  • Online ISBN: 978-3-031-47100-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics