Abstract
The capabilities of artificial intelligence (AI) and deep Learning are increasing rapidly with the increasing computing power and specialized microprocessors. A very interesting architecture, generative adversarial networks (GANs), is at the forefront of innovation. Some examples of what GAN networks are used for are text-to-image translation, image editing/manipulation, recreating images of higher resolution, and creating three-dimensional objects. When it comes to audio, Google WaveNET, Parallel WaveNET, and its successor Tacotron 1 and 2 are the frameworks of choice to create synthetic-based audio. In cases where there is not enough training data, one can synthetically generate data for further research and training. Methodology wise qualitative data samples can be synthetically generated for any language. This paper showcases data generation for the Afrikaans language. Here, we used a trained network to create Afrikaans speech clips based on text. Finally, when generating the same sentence multiple times, the clips have different emotional states. These clips are then verified, categorized, and used to train another network.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
R. Yamamoto, E. Song and J. M. Kim, “Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram,” ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6199–6203, 2020.
Q. Tian, X. Wan and S. Liu, “Generative Adversarial Network based Speaker Adaptation for High Fidelity WaveNet Vocoder,” Computer and information sciences, 2018.
B. H. Story, “History of Speech Synthesis,” The Routledge Handbook of Phonetics, p. 9–33, 2019.
J. Shen and R. Pang, “Tacotron 2: Generating Human-like Speech from Text,” 19 12 2017. [Online]. Available: https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html.
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. W. Yu Zhang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” ICASSP 2018, 2017.
P. Salza, E. Foti, L. Nebbia and M. Oreglia, “MOS and Pair Comparison Combined Methods for Quality Evaluation of Text-to-Speech Systems,” Acta Acustica united with Acustica, vol. 82, pp. 650–656, 07 1996.
R. Nielek, M. Ciastek and W. Kopeć, “Emotions Make Cities Live,” Proceedings of the International Conference on Web Intelligence, 2017.
N. NGC, “NVIDIA NGC Catalog,” [Online]. Available: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tlt-jarvis/models/speechsynthesis_english_tacotron2. [Accessed 19 01 2023].
F. Ma, Y. Li, S. Ni, S.-L. Huang and L. Zhang, “Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN,” Applied Sciences, vol. 12, no. 1, p. 527, 2022.
J. Liu, C. Zhang, Z. Xie and G. Shi, “A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2,” International Journal of Machine Learning and Cybernetics, vol. 12, no. 10, p. 2809–23, 2021.
Y. Kumar, A. Koul and C. Singh, “A Deep Learning Approaches in Text-To-Speech System: A Systematic Review and Recent Research Perspective,” Multimedia Tools and Applications, pp. 1573–7721, 12 September 2022.
K. Kuligowska, P. Kisielewicz and A. Włodarz, “Speech Synthesis Systems: Disadvantages and Limitations,” International Journal of Engineering & Technology, vol. 7, no. 2, p. 234, 2018.
A. A. Karim and S. M. Saleh, “Text to speech using Mel-Spectrogram with deep learning algorithms,” Periodicals of Engineering and Natural Sciences, vol. 10, no. 3, pp. 380–386, June 2022.
C. v. Heerden, E. Barnard, J. Badenhorst, M. Davel and A. d. Waal, “NCHLT Afrikaans Speech Corpus.” Audio Recordings Smartphone-Collected in Non-Studio Environment,” 2014. [Online]. Available: https://repo.sadilar.org/handle/20.500.12185/280. [Accessed 19 1 2023].
I. Goodfellow, J. &. M. Pouget-Abadie, B. Mehdi & Xu, D. &. O. Warde-Farley, S. &. Courville and Y. Aaron & Bengio, “Generative Adversarial Networks,” Computer and information sciences, 2014.
D. Ferris, “Techniques and Challenges in Speech Synthesis,” 2017.
Acknowledgements
Thanks to Nvidia for supplying the open-source GitHub source code that allowed the model to be trained and the speech samples to be created. This research is partially supported by the South African National Research Foundation (Grant Nos. 132797 and 137951), the South African National Research Foundation incentive grant (No. 114911), and the South African Eskom Tertiary Education Support Programme.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Norval, M., Wang, Z., Sun, Y. (2024). Synthetic Speech Data Generation Using Generative Adversarial Networks. In: Meng, L. (eds) International Conference on Cloud Computing and Computer Networks. CCCN 2023. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-031-47100-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-47100-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47099-8
Online ISBN: 978-3-031-47100-1
eBook Packages: EngineeringEngineering (R0)