Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement

Liu, Yazhu; Xue, Shaofei; Tang, Jian

doi:10.1007/978-981-99-2401-1_15

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1765))

Included in the following conference series:

National Conference on Man-Machine Speech Communication

385 Accesses

Abstract

As the development of deep learning, neural network (NN) based text-to-speech (TTS) that adopts deep neural networks as the model backbone for speech synthesis, has now become the mainstream technology for TTS. Compared to the previous TTS systems based on concatenative synthesis and statistical parametric synthesis, the NN based speech synthesis shows conspicuous advantages. It needs less requirement on human pre-processing and feature development, and brings high-quality voice in terms of both intelligibility and naturalness. However, robust NN based speech synthesis model typically requires a sizable set of high-quality data for training, which is expensive to collect especially in low-resource scenarios. It is worth investigating how to take advantage of low-quality material such as automatic speech recognition (ASR) data which can be easily obtained compared with high-quality TTS material. In this paper, we propose a pre-training technique framework to improve the performance of low-resource speech synthesis. The idea is to extend the training material of TTS model by using ASR based data augmentation method. Specifically, we first build a frame-wise phoneme classification network on the ASR dataset and extract the semi-supervised <linguistic features, audio> paired data from large-scale speech corpora. We then pre-train the NN based TTS acoustic model by using the semi-supervised <linguistic features, audio> pairs. Finally, we fine-tune the model with a small amount of available paired data. Experimental results show that our proposed framework enables the TTS model to generate more intelligible and natural speech with the same amount of paired training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on MEL spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018)
Google Scholar
Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)
Chen, M., et al.: AdaSpeech: adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993 (2021)
Tjandra, A., Sakti, S., Nakamura, S.: Listening while speaking: speech chain by deep learning. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 301–308 (2017)
Google Scholar
Xu, J., et al.: LRSpeech: extremely low-resource speech synthesis and recognition. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2802–2812 (2020)
Google Scholar
Ren, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.-Y.: Almost unsupervised text to speech and automatic speech recognition. In: International Conference on Machine Learning, pp. 5410–5419. PMLR (2019)
Google Scholar
Azizah, K., Adriani, M., Jatmiko, W.: Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages. IEEE Access 8, 179 798–179 812 (2020)
Google Scholar
de Korte, M., Kim, J., Klabbers, E.: Efficient neural speech synthesis for low-resource languages through multilingual modeling. arXiv preprint arXiv:2008.09659 (2020)
Zhang, W., Yang, H., Bu, X., Wang, L.: Deep learning for Mandarin-Tibetan cross-lingual speech synthesis. IEEE Access 7, 167 884–167 894 (2019)
Google Scholar
Nekvinda, T., Dušek, O.: One model, many languages: meta-learning for multilingual text-to-speech. arXiv preprint arXiv:2008.00768 (2020)
Tu, T., Chen, Y.-J., Yeh, C.-C., Lee, H.-Y.: End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning. arXiv preprint arXiv:1904.06508 (2019)
Hemati, H., Borth, D.: Using IPA-based Tacotron for data efficient cross-lingual speaker adaptation and pronunciation enhancement. arXiv preprint arXiv:2011.06392 (2020)
Mohammadi, S.H., Kain, A.: An overview of voice conversion systems. Speech Commun. 88, 65–82 (2017)
Article Google Scholar
Karlapati, S., Moinet, A., Joly, A., Klimkov, V., Sáez-Trigueros, D., Drugman, T.: CopyCat: many-to-many fine-grained prosody transfer for neural text-to-speech. arXiv preprint arXiv:2004.14617 (2020)
Huybrechts, G., Merritt, T., Comini, G., Perz, B., Shah, R., Lorenzo-Trueba, J.: Low-resource expressive text-to-speech using data augmentation. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6593–6597 (2021)
Google Scholar
Fang, W., Chung, Y.-A., Glass, J.: Towards transfer learning for end-to-end speech synthesis from deep pre-trained language models. arXiv preprint arXiv:1906.07307 (2019)
Devlin, J., Cheng, M.-W., Kenton, L., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Chung, Y.-A., Wang, Y., Hsu, W.-N., Zhang, Y., Skerry-Ryan, R.: Semi-supervised training for improving data efficiency in end-to-end speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6940–6944 (2019)
Google Scholar
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Zhang, H., Lin, Y.: Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages. arXiv preprint arXiv:2008.04549 (2020)
Ni, J., et al.: Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition. arXiv preprint arXiv:2203.15796 (2022)
Valin, J.-M., Skoglund, J.: LPCNet: improving neural speech synthesis through linear prediction. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5891–5895 (2019)
Google Scholar
Skoglund, J., Valin, J.-M.: Improving Opus low bit rate quality with neural speech synthesis. arXiv preprint arXiv:1905.04628 (2019)
Valin, J.-M., Skoglund, J.: A real-time wideband neural vocoder at 1.6 kb/s using LPCNet. arXiv preprint arXiv:1903.12087 (2019)
Zhang, S., Lei, M., Yan, Z., Dai, L.: Deep-FSMN for large vocabulary continuous speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5869–5873 (2018)
Google Scholar
Zhang, S., Lei, M., Yan, Z.: Automatic spelling correction with transformer for CTC-based end-to-end speech recognition. arXiv preprint arXiv:1904.10045 (2019)
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

AIspeech Ltd., Suzhou, China
Yazhu Liu, Shaofei Xue & Jian Tang

Authors

Yazhu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shaofei Xue
View author publications
You can also search for this author in PubMed Google Scholar
Jian Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shaofei Xue .

Editor information

Editors and Affiliations

University of Science and Technology of China, Anhui, China
Ling Zhenhua
Hefei University, Anhui, China
Gao Jianqing
Shanghai Jiaotong University, Shanghai, China
Yu Kai
Tsinghua University, Beijing, China
Jia Jia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Xue, S., Tang, J. (2023). Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds) Man-Machine Speech Communication. NCMMSC 2022. Communications in Computer and Information Science, vol 1765. Springer, Singapore. https://doi.org/10.1007/978-981-99-2401-1_15

Download citation

DOI: https://doi.org/10.1007/978-981-99-2401-1_15
Published: 10 May 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-2400-4
Online ISBN: 978-981-99-2401-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement