Multi-speaker Multi-style Speech Synthesis with Timbre and Style Disentanglement

Song, Wei; Yue, Yanghao; Zhang, Ya-jie; Zhang, Zhengchen; Wu, Youzheng; He, Xiaodong

doi:10.1007/978-981-99-2401-1_12

Wei Song⁹,
Yanghao Yue⁹,
Ya-jie Zhang⁹,
Zhengchen Zhang⁹,
Youzheng Wu⁹ &
…
Xiaodong He⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1765))

Included in the following conference series:

National Conference on Man-Machine Speech Communication

361 Accesses
2 Citations

Abstract

Disentanglement of a speaker’s timbre and style is very important for style transfer in multi-speaker multi-style text-to-speech (TTS) scenarios. With the disentanglement of timbres and styles, TTS systems could synthesize expressive speech for a given speaker with any style which has been seen in the training corpus. However, there are still some shortcomings with the current research on timbre and style disentanglement. The current method either requires single-speaker multi-style recordings, which are difficult and expensive to collect, or uses a complex network and complicated training method, which is difficult to reproduce and control the style transfer behavior. To improve the disentanglement effectiveness of timbres and styles, and to remove the reliance on single-speaker multi-style corpus, a simple but effective timbre and style disentanglement method is proposed in this paper. The FastSpeech2 network is employed as the backbone network, with explicit duration, pitch, and energy trajectory to represent the style. Each speaker’s data is considered as a separate and isolated style, then a speaker embedding and a style embedding are added to the FastSpeech2 network to learn disentangled representations. Utterance level pitch and energy normalization are utilized to improve the decoupling effect. Experimental results demonstrate that the proposed model could synthesize speech with any style seen during training with high style similarity while maintaining very high speaker similarity.

This work was supported by the National Key R &D Program of China under Grant No. 2020AAA0108600.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cross Lingual Style Transfer Using Multiscale Loss Function for Soliga: A Low Resource Tribal Language

Calib-StyleSpeech: A Zero-Shot Approach in Voice Cloning of High Adaptive Text to Speech System with Imbalanced Dataset

Diffusion-Based Approach to Style Modeling in Expressive TTS

Notes

References

Taylor, P.: Text-to-speech synthesis. Cambridge University Press (2009)
Google Scholar
Sotelo, J., et al.: Char2Wav: End-to-end speech synthesis. In: Proceedings ICLR, Toulon (2017)
Google Scholar
Wang, Y., et al.: Tacotron: Towards end-to-end speech synthesis. In: Proceedings Interspeech, Stockholm (2017)
Google Scholar
Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., Liu, T.-Y.: FastSpeech 2: Fast and High-ality End-to-End Text to Speech. arXiv preprint arXiv:2006.04558(2020)
Ping, W., et al.: Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv:1710.07654 (2017)
Tan, X., Qin, T., Soong, F., Liu, T.-Y.: A Survey on Neural Speech Synthesis, arXiv preprint arXiv:2106.15561 (2021)
Tan, X., et al.: NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality. arXiv preprint arXiv:2205.04421 (2022)
Pan, S., He, L.: Cross-speaker style transfer with prosody bottleneck in neural speech synthesis. arXiv preprint arXiv:2107.12562 (2021)
Xie, Q., etb al.: Multi-speaker multi-style text-to-speech synthesis with single-speaker single-style training data scenarios. arXiv preprint arXiv:2112.12743 (2021)
An, X., Soong, F.K., Xie, L.: Disentangling style and speaker attributes for tts style transfer. IEEE/ACM TASLP 30, 646–658 (2022)
Google Scholar
Li, T., Yang, S., Xue, L., Xie, L.: Controllable emotion transfer for end-to-end speech synthesis. In: Proceedings ISCSLP 2021. IEEE, 2021, pp. 1–5 (2021)
Google Scholar
Wang, Y., et al.: Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis, In: International Conference on Machine Learning. PMLR, 2018, pp. 5180–5189 (2018)
Google Scholar
Zhang, Y.-J., Pan, S., He, L., Ling, Z.-H.: Learning latent representations for style control and transfer in end-to-end speech synthesis. In: Proceedings ICASSP. IEEE, 2019, pp. 6945–6949 (2019)
Google Scholar
Kumar, K., et al.: MelGAN: Generative adversarial networks for conditional waveform synthesis. In: Proceedings NIPS, Vancouver (2019)
Google Scholar
Zen, H., Tokuda, K., Black, A,W,: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Google Scholar
Xie, Q., et al.: The multi-speaker multi-style voice cloning challenge 2021. In: Proceedings ICASSP. IEEE, (2021)
Google Scholar
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In: Interspeech, 2017, pp. 498–502 (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings ICLR, San Diego (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of NIPS, Long Beach (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

JD Technology Group, Beijing, China
Wei Song, Yanghao Yue, Ya-jie Zhang, Zhengchen Zhang, Youzheng Wu & Xiaodong He

Authors

Wei Song
View author publications
You can also search for this author in PubMed Google Scholar
Yanghao Yue
View author publications
You can also search for this author in PubMed Google Scholar
Ya-jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengchen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Youzheng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Song .

Editor information

Editors and Affiliations

University of Science and Technology of China, Anhui, China
Ling Zhenhua
Hefei University, Anhui, China
Gao Jianqing
Shanghai Jiaotong University, Shanghai, China
Yu Kai
Tsinghua University, Beijing, China
Jia Jia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, W., Yue, Y., Zhang, Yj., Zhang, Z., Wu, Y., He, X. (2023). Multi-speaker Multi-style Speech Synthesis with Timbre and Style Disentanglement. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds) Man-Machine Speech Communication. NCMMSC 2022. Communications in Computer and Information Science, vol 1765. Springer, Singapore. https://doi.org/10.1007/978-981-99-2401-1_12

Download citation

DOI: https://doi.org/10.1007/978-981-99-2401-1_12
Published: 10 May 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-2400-4
Online ISBN: 978-981-99-2401-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-speaker Multi-style Speech Synthesis with Timbre and Style Disentanglement

Abstract

Access this chapter

Similar content being viewed by others

Cross Lingual Style Transfer Using Multiscale Loss Function for Soliga: A Low Resource Tribal Language

Calib-StyleSpeech: A Zero-Shot Approach in Voice Cloning of High Adaptive Text to Speech System with Imbalanced Dataset

Diffusion-Based Approach to Style Modeling in Expressive TTS

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Multi-speaker Multi-style Speech Synthesis with Timbre and Style Disentanglement

Abstract

Access this chapter

Similar content being viewed by others

Cross Lingual Style Transfer Using Multiscale Loss Function for Soliga: A Low Resource Tribal Language

Calib-StyleSpeech: A Zero-Shot Approach in Voice Cloning of High Adaptive Text to Speech System with Imbalanced Dataset

Diffusion-Based Approach to Style Modeling in Expressive TTS

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation