Skip to main content

PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-Based Prosody Modeling

  • Conference paper
  • First Online:
Pattern Recognition (ACPR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14406))

Included in the following conference series:

Abstract

Although text-to-speech (TTS) systems have significantly improved, most TTS systems still have limitations in synthesizing speech with appropriate phrasing. For natural speech synthesis, it is important to synthesize the speech with a phrasing structure that groups words into phrases based on semantic information. In this paper, we propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling. First, we introduce a phrasing structure encoder that utilizes a context representation from the pre-trained language model. In the phrasing structure encoder, we extract a speaker-dependent syntactic representation from the context representation and then predict a pause sequence that separates the input text into phrases. Furthermore, we introduce a pause-based word encoder to model word-level prosody based on pause sequence. Experimental results show PauseSpeech outperforms previous models in terms of naturalness. Furthermore, in terms of objective evaluations, we can observe that our proposed methods help the model decrease the distance between ground-truth and synthesized speech. Audio samples are available at https://jisang93.github.io/pausespeech-demo/.

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University) and No. 2021-0-02068, Artificial Intelligence Innovation Hub).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://datashare.ed.ac.uk/handle/10283/3443.

  2. 2.

    https://github.com/Kyubyong/g2p.

  3. 3.

    https://huggingface.co/bert-base-uncased.

  4. 4.

    https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self.

References

  1. Abbas, A., et al.: Expressive, Variable, and Controllable Duration Modelling in TTS. arXiv preprint arXiv:2206.14165 (2022)

  2. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)

    Google Scholar 

  3. Braunschweiler, N., Chen, L.: Automatic detection of inhalation breath pauses for improved pause modelling in HMM-TTS. In: SSW, vol. 8, pp. 1–6 (2013)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018)

  5. Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., Simonyan, K.: End-to-End Adversarial Text-to-Speech. arXiv preprint arXiv:2006.03575 (2020)

  6. Elmers, M., Werner, R., Muhlack, B., Möbius, B., Trouvain, J.: Take a breath: respiratory sounds improve recollection in synthetic speech. In: Interspeech, pp. 3196–3200 (2021)

    Google Scholar 

  7. Futamata, K., Park, B., Yamamoto, R., Tachibana, K.: Phrase Break Prediction with Bidirectional Encoder Representations in Japanese Text-to-Speech Synthesis. arXiv preprint arXiv:2104.12395 (2021)

  8. Goldberg, Y.: Assessing BERT’s Syntactic Abilities. arXiv preprint arXiv:1901.05287 (2019)

  9. Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  10. Hayashi, T., Watanabe, S., Toda, T., Takeda, K., Toshniwal, S., Livescu, K.: Pre-trained text embeddings for enhanced text-to-speech synthesis. In: Interspeech, pp. 4430–4434 (2019)

    Google Scholar 

  11. Hewitt, J., Manning, C.D.: A structural probe for finding syntax in word representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138 (2019)

    Google Scholar 

  12. Hida, R., Hamada, M., Kamada, C., Tsunoo, E., Sekiya, T., Kumakura, T.: Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7132–7136. IEEE (2022)

    Google Scholar 

  13. Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. Adv. Neural. Inf. Process. Syst. 33, 8067–8077 (2020)

    Google Scholar 

  14. Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540. PMLR (2021)

    Google Scholar 

  15. Kim, J.H., Lee, S.H., Lee, J.H., Lee, S.W.: Fre-GAN: adversarial frequency-consistent audio synthesis. In: 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, pp. 3246–3250. International Speech Communication Association (2021)

    Google Scholar 

  16. Kim, K.T., Guan, C., Lee, S.W.: A subject-transfer framework based on single-trial EMG analysis using convolutional neural networks. IEEE Trans. Neural Syst. Rehabil. Eng. 28(1), 94–103 (2019)

    Article  Google Scholar 

  17. Klimkov, V., et al.: Phrase break prediction for long-form reading TTS: exploiting text structure information. In: Proceedings of Interspeech 2017, pp. 1064–1068 (2017)

    Google Scholar 

  18. Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)

    Google Scholar 

  19. Łańcucki, A.: FastPitch: parallel text-to-speech with pitch prediction. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6588–6592. IEEE (2021)

    Google Scholar 

  20. Lee, J.H., Lee, S.H., Kim, J.H., Lee, S.W.: PVAE-TTS: adaptive text-to-speech via progressive style adaptation. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6312–6316. IEEE (2022)

    Google Scholar 

  21. Lee, S.H., Kim, S.B., Lee, J.H., Song, E., Hwang, M.J., Lee, S.W.: HierSpeech: bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. Adv. Neural. Inf. Process. Syst. 35, 16624–16636 (2022)

    Google Scholar 

  22. Lee, S.H., Yoon, H.W., Noh, H.R., Kim, J.H., Lee, S.W.: Multi-SpectroGAN: high-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13198–13206 (2021)

    Google Scholar 

  23. Lewis, M., et al.: BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. arXiv preprint arXiv:1910.13461 (2019)

  24. Liu, N.F., Gardner, M., Belinkov, Y., Peters, M.E., Smith, N.A.: Linguistic knowledge and transferability of contextual representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1073–1094 (2019)

    Google Scholar 

  25. Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)

  26. Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101 (2017)

  27. Makarov, P., et al.: Simple and Effective Multi-Sentence TTS with Expressive and Coherent Prosody. arXiv preprint arXiv:2206.14643 (2022)

  28. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017)

    Google Scholar 

  29. Oh, H.S., Lee, S.H., Lee, S.W.: DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training. arXiv preprint arXiv:2307.16549 (2023)

  30. Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)

    Google Scholar 

  31. Ren, Y., et al.: ProsoSpeech: enhancing prosody with quantized vector pre-training in text-to-speech. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7577–7581. IEEE (2022)

    Google Scholar 

  32. Ren, Y., Liu, J., Zhao, Z.: PortaSpeech: portable and high-quality generative text-to-speech. Adv. Neural. Inf. Process. Syst. 34, 13963–13974 (2021)

    Google Scholar 

  33. Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3171–3180 (2019)

    Google Scholar 

  34. Ren, Y., Tan, X., Qin, T., Zhao, Z., Liu, T.Y.: Revisiting over-smoothness in text to speech. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8197–8213 (2022)

    Google Scholar 

  35. Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning, pp. 1530–1538. PMLR (2015)

    Google Scholar 

  36. Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2021)

    Article  Google Scholar 

  37. Seshadri, S., Raitio, T., Castellani, D., Li, J.: Emphasis Control for Parallel Neural TTS. arXiv preprint arXiv:2110.03012 (2021)

  38. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with Relative Position Representations. arXiv preprint arXiv:1803.02155 (2018)

  39. Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)

    Google Scholar 

  40. Székely, É., Henter, G.E., Beskow, J., Gustafson, J.: Breathing and speech planning in spontaneous speech synthesis. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7649–7653. IEEE (2020)

    Google Scholar 

  41. Thung, K.H., Yap, P.T., Adeli, E., Lee, S.W., Shen, D., Initiative, A.D.N., et al.: Conversion and time-to-conversion predictions of mild cognitive impairment using low-rank affinity pursuit denoising and matrix completion. Med. Image Anal. 45, 68–82 (2018)

    Article  Google Scholar 

  42. Veaux, C., Yamagishi, J., MacDonald, K., et al.: Superseded-CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit (2016)

    Google Scholar 

  43. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of Interspeech 2017, pp. 4006–4010 (2017)

    Google Scholar 

  44. Wu, J., Luan, J.: Adversarially Trained Multi-Singer Sequence-to-Sequence Singing Synthesizer. arXiv preprint arXiv:2006.10317 (2020)

  45. Xu, G., Song, W., Zhang, Z., Zhang, C., He, X., Zhou, B.: Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6079–6083. IEEE (2021)

    Google Scholar 

  46. Yang, D., Koriyama, T., Saito, Y., Saeki, T., Xin, D., Saruwatari, H.: Duration-Aware Pause Insertion Using Pre-Trained Language Model for Multi-Speaker Text-to-Speech. arXiv preprint arXiv:2302.13652 (2023)

  47. Ye, Z., Zhao, Z., Ren, Y., Wu, F.: SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech. arXiv preprint arXiv:2204.11792 (2022)

  48. Zhang, J.X., Ling, Z.H., Liu, L.J., Jiang, Y., Dai, L.R.: Sequence-to-sequence acoustic modeling for voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 27(3), 631–644 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seong-Whan Lee .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hwang, JS., Lee, SH., Lee, SW. (2023). PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-Based Prosody Modeling. In: Lu, H., Blumenstein, M., Cho, SB., Liu, CL., Yagi, Y., Kamiya, T. (eds) Pattern Recognition. ACPR 2023. Lecture Notes in Computer Science, vol 14406. Springer, Cham. https://doi.org/10.1007/978-3-031-47634-1_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-47634-1_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-47633-4

  • Online ISBN: 978-3-031-47634-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics