Synthetic Speech Data Generation Using Generative Adversarial Networks

Norval, Michael; Wang, Zenghui; Sun, Yanxia

doi:10.1007/978-3-031-47100-1_11

Michael Norval⁸,
Zenghui Wang⁸ &
Yanxia Sun⁹

Part of the book series: Signals and Communication Technology ((SCT))

Included in the following conference series:

International Conference on Cloud Computing and Computer Networks

62 Accesses

Abstract

The capabilities of artificial intelligence (AI) and deep Learning are increasing rapidly with the increasing computing power and specialized microprocessors. A very interesting architecture, generative adversarial networks (GANs), is at the forefront of innovation. Some examples of what GAN networks are used for are text-to-image translation, image editing/manipulation, recreating images of higher resolution, and creating three-dimensional objects. When it comes to audio, Google WaveNET, Parallel WaveNET, and its successor Tacotron 1 and 2 are the frameworks of choice to create synthetic-based audio. In cases where there is not enough training data, one can synthetically generate data for further research and training. Methodology wise qualitative data samples can be synthetically generated for any language. This paper showcases data generation for the Afrikaans language. Here, we used a trained network to create Afrikaans speech clips based on text. Finally, when generating the same sentence multiple times, the clips have different emotional states. These clips are then verified, categorized, and used to train another network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Introduction to Generative Adversarial Learning: Architectures and Applications

Let the Robot Speak! AI-Generated Speech and Freedom of Expression

Conditional Generative Adversarial Network with One-Dimensional Self-attention for Speech Synthesis

References

R. Yamamoto, E. Song and J. M. Kim, “Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram,” ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6199–6203, 2020.
Google Scholar
Q. Tian, X. Wan and S. Liu, “Generative Adversarial Network based Speaker Adaptation for High Fidelity WaveNet Vocoder,” Computer and information sciences, 2018.
Google Scholar
B. H. Story, “History of Speech Synthesis,” The Routledge Handbook of Phonetics, p. 9–33, 2019.
Google Scholar
J. Shen and R. Pang, “Tacotron 2: Generating Human-like Speech from Text,” 19 12 2017. [Online]. Available: https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html.
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. W. Yu Zhang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” ICASSP 2018, 2017.
Google Scholar
P. Salza, E. Foti, L. Nebbia and M. Oreglia, “MOS and Pair Comparison Combined Methods for Quality Evaluation of Text-to-Speech Systems,” Acta Acustica united with Acustica, vol. 82, pp. 650–656, 07 1996.
Google Scholar
R. Nielek, M. Ciastek and W. Kopeć, “Emotions Make Cities Live,” Proceedings of the International Conference on Web Intelligence, 2017.
Google Scholar
N. NGC, “NVIDIA NGC Catalog,” [Online]. Available: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tlt-jarvis/models/speechsynthesis_english_tacotron2. [Accessed 19 01 2023].
F. Ma, Y. Li, S. Ni, S.-L. Huang and L. Zhang, “Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN,” Applied Sciences, vol. 12, no. 1, p. 527, 2022.
Article Google Scholar
J. Liu, C. Zhang, Z. Xie and G. Shi, “A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2,” International Journal of Machine Learning and Cybernetics, vol. 12, no. 10, p. 2809–23, 2021.
Article Google Scholar
Y. Kumar, A. Koul and C. Singh, “A Deep Learning Approaches in Text-To-Speech System: A Systematic Review and Recent Research Perspective,” Multimedia Tools and Applications, pp. 1573–7721, 12 September 2022.
Google Scholar
K. Kuligowska, P. Kisielewicz and A. Włodarz, “Speech Synthesis Systems: Disadvantages and Limitations,” International Journal of Engineering & Technology, vol. 7, no. 2, p. 234, 2018.
Article Google Scholar
A. A. Karim and S. M. Saleh, “Text to speech using Mel-Spectrogram with deep learning algorithms,” Periodicals of Engineering and Natural Sciences, vol. 10, no. 3, pp. 380–386, June 2022.
Google Scholar
C. v. Heerden, E. Barnard, J. Badenhorst, M. Davel and A. d. Waal, “NCHLT Afrikaans Speech Corpus.” Audio Recordings Smartphone-Collected in Non-Studio Environment,” 2014. [Online]. Available: https://repo.sadilar.org/handle/20.500.12185/280. [Accessed 19 1 2023].
I. Goodfellow, J. &. M. Pouget-Abadie, B. Mehdi & Xu, D. &. O. Warde-Farley, S. &. Courville and Y. Aaron & Bengio, “Generative Adversarial Networks,” Computer and information sciences, 2014.
Google Scholar
D. Ferris, “Techniques and Challenges in Speech Synthesis,” 2017.
Google Scholar

Download references

Acknowledgements

Thanks to Nvidia for supplying the open-source GitHub source code that allowed the model to be trained and the speech samples to be created. This research is partially supported by the South African National Research Foundation (Grant Nos. 132797 and 137951), the South African National Research Foundation incentive grant (No. 114911), and the South African Eskom Tertiary Education Support Programme.

Author information

Authors and Affiliations

Department of Electrical Engineering, University of South Africa, Florida, South Africa
Michael Norval & Zenghui Wang
Department of Electrical and Electronics Science Engineering, University of Johannesburg, Johannesburg, South Africa
Yanxia Sun

Authors

Michael Norval
View author publications
You can also search for this author in PubMed Google Scholar
Zenghui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanxia Sun
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Software, Shangdong University, Jinan, China
Lei Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Norval, M., Wang, Z., Sun, Y. (2024). Synthetic Speech Data Generation Using Generative Adversarial Networks. In: Meng, L. (eds) International Conference on Cloud Computing and Computer Networks. CCCN 2023. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-031-47100-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-47100-1_11
Published: 24 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47099-8
Online ISBN: 978-3-031-47100-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Synthetic Speech Data Generation Using Generative Adversarial Networks

Abstract

Access this chapter

Similar content being viewed by others

An Introduction to Generative Adversarial Learning: Architectures and Applications

Let the Robot Speak! AI-Generated Speech and Freedom of Expression

Conditional Generative Adversarial Network with One-Dimensional Self-attention for Speech Synthesis

References

Acknowledgements

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Synthetic Speech Data Generation Using Generative Adversarial Networks

Abstract

Access this chapter

Similar content being viewed by others

An Introduction to Generative Adversarial Learning: Architectures and Applications

Let the Robot Speak! AI-Generated Speech and Freedom of Expression

Conditional Generative Adversarial Network with One-Dimensional Self-attention for Speech Synthesis

References

Acknowledgements

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation