Skip to main content

Multi-speaker Multi-style Speech Synthesis with Timbre and Style Disentanglement

  • Conference paper
  • First Online:
Man-Machine Speech Communication (NCMMSC 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1765))

Included in the following conference series:

Abstract

Disentanglement of a speaker’s timbre and style is very important for style transfer in multi-speaker multi-style text-to-speech (TTS) scenarios. With the disentanglement of timbres and styles, TTS systems could synthesize expressive speech for a given speaker with any style which has been seen in the training corpus. However, there are still some shortcomings with the current research on timbre and style disentanglement. The current method either requires single-speaker multi-style recordings, which are difficult and expensive to collect, or uses a complex network and complicated training method, which is difficult to reproduce and control the style transfer behavior. To improve the disentanglement effectiveness of timbres and styles, and to remove the reliance on single-speaker multi-style corpus, a simple but effective timbre and style disentanglement method is proposed in this paper. The FastSpeech2 network is employed as the backbone network, with explicit duration, pitch, and energy trajectory to represent the style. Each speaker’s data is considered as a separate and isolated style, then a speaker embedding and a style embedding are added to the FastSpeech2 network to learn disentangled representations. Utterance level pitch and energy normalization are utilized to improve the decoupling effect. Experimental results demonstrate that the proposed model could synthesize speech with any style seen during training with high style similarity while maintaining very high speaker similarity.

This work was supported by the National Key R &D Program of China under Grant No. 2020AAA0108600.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.data-baker.com/open_source.html.

  2. 2.

    http://challenge.ai.iqiyi.com/detail?raceId=5fb2688224954e0b48431fe0.

  3. 3.

    https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder.

  4. 4.

    https://weixsong.github.io/demos/MultiSpeakerMultiStyle/.

References

  1. Taylor, P.: Text-to-speech synthesis. Cambridge University Press (2009)

    Google Scholar 

  2. Sotelo, J., et al.: Char2Wav: End-to-end speech synthesis. In: Proceedings ICLR, Toulon (2017)

    Google Scholar 

  3. Wang, Y., et al.: Tacotron: Towards end-to-end speech synthesis. In: Proceedings Interspeech, Stockholm (2017)

    Google Scholar 

  4. Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., Liu, T.-Y.: FastSpeech 2: Fast and High-ality End-to-End Text to Speech. arXiv preprint arXiv:2006.04558(2020)

  5. Ping, W., et al.: Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv:1710.07654 (2017)

  6. Tan, X., Qin, T., Soong, F., Liu, T.-Y.: A Survey on Neural Speech Synthesis, arXiv preprint arXiv:2106.15561 (2021)

  7. Tan, X., et al.: NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality. arXiv preprint arXiv:2205.04421 (2022)

  8. Pan, S., He, L.: Cross-speaker style transfer with prosody bottleneck in neural speech synthesis. arXiv preprint arXiv:2107.12562 (2021)

  9. Xie, Q., etb al.: Multi-speaker multi-style text-to-speech synthesis with single-speaker single-style training data scenarios. arXiv preprint arXiv:2112.12743 (2021)

  10. An, X., Soong, F.K., Xie, L.: Disentangling style and speaker attributes for tts style transfer. IEEE/ACM TASLP 30, 646–658 (2022)

    Google Scholar 

  11. Li, T., Yang, S., Xue, L., Xie, L.: Controllable emotion transfer for end-to-end speech synthesis. In: Proceedings ISCSLP 2021. IEEE, 2021, pp. 1–5 (2021)

    Google Scholar 

  12. Wang, Y., et al.: Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis, In: International Conference on Machine Learning. PMLR, 2018, pp. 5180–5189 (2018)

    Google Scholar 

  13. Zhang, Y.-J., Pan, S., He, L., Ling, Z.-H.: Learning latent representations for style control and transfer in end-to-end speech synthesis. In: Proceedings ICASSP. IEEE, 2019, pp. 6945–6949 (2019)

    Google Scholar 

  14. Kumar, K., et al.: MelGAN: Generative adversarial networks for conditional waveform synthesis. In: Proceedings NIPS, Vancouver (2019)

    Google Scholar 

  15. Zen, H., Tokuda, K., Black, A,W,: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)

    Google Scholar 

  16. Xie, Q., et al.: The multi-speaker multi-style voice cloning challenge 2021. In: Proceedings ICASSP. IEEE, (2021)

    Google Scholar 

  17. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In: Interspeech, 2017, pp. 498–502 (2017)

    Google Scholar 

  18. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings ICLR, San Diego (2015)

    Google Scholar 

  19. Vaswani, A., et al.: Attention is all you need. In: Proceedings of NIPS, Long Beach (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Song .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Song, W., Yue, Y., Zhang, Yj., Zhang, Z., Wu, Y., He, X. (2023). Multi-speaker Multi-style Speech Synthesis with Timbre and Style Disentanglement. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds) Man-Machine Speech Communication. NCMMSC 2022. Communications in Computer and Information Science, vol 1765. Springer, Singapore. https://doi.org/10.1007/978-981-99-2401-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-2401-1_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-2400-4

  • Online ISBN: 978-981-99-2401-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics