PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-Based Prosody Modeling

Hwang, Ji-Sang; Lee, Sang-Hoon; Lee, Seong-Whan

doi:10.1007/978-3-031-47634-1_31

Ji-Sang Hwang¹³,
Sang-Hoon Lee¹³ &
Seong-Whan Lee¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14406))

Included in the following conference series:

Asian Conference on Pattern Recognition

261 Accesses
1 Citations

Abstract

Although text-to-speech (TTS) systems have significantly improved, most TTS systems still have limitations in synthesizing speech with appropriate phrasing. For natural speech synthesis, it is important to synthesize the speech with a phrasing structure that groups words into phrases based on semantic information. In this paper, we propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling. First, we introduce a phrasing structure encoder that utilizes a context representation from the pre-trained language model. In the phrasing structure encoder, we extract a speaker-dependent syntactic representation from the context representation and then predict a pause sequence that separates the input text into phrases. Furthermore, we introduce a pause-based word encoder to model word-level prosody based on pause sequence. Experimental results show PauseSpeech outperforms previous models in terms of naturalness. Furthermore, in terms of objective evaluations, we can observe that our proposed methods help the model decrease the distance between ground-truth and synthesized speech. Audio samples are available at https://jisang93.github.io/pausespeech-demo/.

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University) and No. 2021-0-02068, Artificial Intelligence Innovation Hub).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

E-TTS: Expressive Text-to-Speech Synthesis for Hindi Using Data Augmentation

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Article 08 August 2022

DNN-Based Duration Modeling for Synthesizing Short Sentences

Notes

References

Abbas, A., et al.: Expressive, Variable, and Controllable Duration Modelling in TTS. arXiv preprint arXiv:2206.14165 (2022)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)
Google Scholar
Braunschweiler, N., Chen, L.: Automatic detection of inhalation breath pauses for improved pause modelling in HMM-TTS. In: SSW, vol. 8, pp. 1–6 (2013)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018)
Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., Simonyan, K.: End-to-End Adversarial Text-to-Speech. arXiv preprint arXiv:2006.03575 (2020)
Elmers, M., Werner, R., Muhlack, B., Möbius, B., Trouvain, J.: Take a breath: respiratory sounds improve recollection in synthetic speech. In: Interspeech, pp. 3196–3200 (2021)
Google Scholar
Futamata, K., Park, B., Yamamoto, R., Tachibana, K.: Phrase Break Prediction with Bidirectional Encoder Representations in Japanese Text-to-Speech Synthesis. arXiv preprint arXiv:2104.12395 (2021)
Goldberg, Y.: Assessing BERT’s Syntactic Abilities. arXiv preprint arXiv:1901.05287 (2019)
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Hayashi, T., Watanabe, S., Toda, T., Takeda, K., Toshniwal, S., Livescu, K.: Pre-trained text embeddings for enhanced text-to-speech synthesis. In: Interspeech, pp. 4430–4434 (2019)
Google Scholar
Hewitt, J., Manning, C.D.: A structural probe for finding syntax in word representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138 (2019)
Google Scholar
Hida, R., Hamada, M., Kamada, C., Tsunoo, E., Sekiya, T., Kumakura, T.: Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7132–7136. IEEE (2022)
Google Scholar
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. Adv. Neural. Inf. Process. Syst. 33, 8067–8077 (2020)
Google Scholar
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540. PMLR (2021)
Google Scholar
Kim, J.H., Lee, S.H., Lee, J.H., Lee, S.W.: Fre-GAN: adversarial frequency-consistent audio synthesis. In: 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, pp. 3246–3250. International Speech Communication Association (2021)
Google Scholar
Kim, K.T., Guan, C., Lee, S.W.: A subject-transfer framework based on single-trial EMG analysis using convolutional neural networks. IEEE Trans. Neural Syst. Rehabil. Eng. 28(1), 94–103 (2019)
Article Google Scholar
Klimkov, V., et al.: Phrase break prediction for long-form reading TTS: exploiting text structure information. In: Proceedings of Interspeech 2017, pp. 1064–1068 (2017)
Google Scholar
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
Google Scholar
Łańcucki, A.: FastPitch: parallel text-to-speech with pitch prediction. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6588–6592. IEEE (2021)
Google Scholar
Lee, J.H., Lee, S.H., Kim, J.H., Lee, S.W.: PVAE-TTS: adaptive text-to-speech via progressive style adaptation. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6312–6316. IEEE (2022)
Google Scholar
Lee, S.H., Kim, S.B., Lee, J.H., Song, E., Hwang, M.J., Lee, S.W.: HierSpeech: bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. Adv. Neural. Inf. Process. Syst. 35, 16624–16636 (2022)
Google Scholar
Lee, S.H., Yoon, H.W., Noh, H.R., Kim, J.H., Lee, S.W.: Multi-SpectroGAN: high-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13198–13206 (2021)
Google Scholar
Lewis, M., et al.: BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. arXiv preprint arXiv:1910.13461 (2019)
Liu, N.F., Gardner, M., Belinkov, Y., Peters, M.E., Smith, N.A.: Linguistic knowledge and transferability of contextual representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1073–1094 (2019)
Google Scholar
Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101 (2017)
Makarov, P., et al.: Simple and Effective Multi-Sentence TTS with Expressive and Coherent Prosody. arXiv preprint arXiv:2206.14643 (2022)
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017)
Google Scholar
Oh, H.S., Lee, S.H., Lee, S.W.: DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training. arXiv preprint arXiv:2307.16549 (2023)
Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)
Google Scholar
Ren, Y., et al.: ProsoSpeech: enhancing prosody with quantized vector pre-training in text-to-speech. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7577–7581. IEEE (2022)
Google Scholar
Ren, Y., Liu, J., Zhao, Z.: PortaSpeech: portable and high-quality generative text-to-speech. Adv. Neural. Inf. Process. Syst. 34, 13963–13974 (2021)
Google Scholar
Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3171–3180 (2019)
Google Scholar
Ren, Y., Tan, X., Qin, T., Zhao, Z., Liu, T.Y.: Revisiting over-smoothness in text to speech. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8197–8213 (2022)
Google Scholar
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning, pp. 1530–1538. PMLR (2015)
Google Scholar
Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2021)
Article Google Scholar
Seshadri, S., Raitio, T., Castellani, D., Li, J.: Emphasis Control for Parallel Neural TTS. arXiv preprint arXiv:2110.03012 (2021)
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with Relative Position Representations. arXiv preprint arXiv:1803.02155 (2018)
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Google Scholar
Székely, É., Henter, G.E., Beskow, J., Gustafson, J.: Breathing and speech planning in spontaneous speech synthesis. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7649–7653. IEEE (2020)
Google Scholar
Thung, K.H., Yap, P.T., Adeli, E., Lee, S.W., Shen, D., Initiative, A.D.N., et al.: Conversion and time-to-conversion predictions of mild cognitive impairment using low-rank affinity pursuit denoising and matrix completion. Med. Image Anal. 45, 68–82 (2018)
Article Google Scholar
Veaux, C., Yamagishi, J., MacDonald, K., et al.: Superseded-CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit (2016)
Google Scholar
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of Interspeech 2017, pp. 4006–4010 (2017)
Google Scholar
Wu, J., Luan, J.: Adversarially Trained Multi-Singer Sequence-to-Sequence Singing Synthesizer. arXiv preprint arXiv:2006.10317 (2020)
Xu, G., Song, W., Zhang, Z., Zhang, C., He, X., Zhou, B.: Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6079–6083. IEEE (2021)
Google Scholar
Yang, D., Koriyama, T., Saito, Y., Saeki, T., Xin, D., Saruwatari, H.: Duration-Aware Pause Insertion Using Pre-Trained Language Model for Multi-Speaker Text-to-Speech. arXiv preprint arXiv:2302.13652 (2023)
Ye, Z., Zhao, Z., Ren, Y., Wu, F.: SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech. arXiv preprint arXiv:2204.11792 (2022)
Zhang, J.X., Ling, Z.H., Liu, L.J., Jiang, Y., Dai, L.R.: Sequence-to-sequence acoustic modeling for voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 27(3), 631–644 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Artificial Intelligence, Korea University, Seoul, Korea
Ji-Sang Hwang, Sang-Hoon Lee & Seong-Whan Lee

Authors

Ji-Sang Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Sang-Hoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Seong-Whan Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seong-Whan Lee .

Editor information

Editors and Affiliations

Kyushu Institute of Technology, Kitakyushu, Fukuoka, Japan
Huimin Lu
The University of Sydney, Sydney, NSW, Australia
Michael Blumenstein
Yonsei University, Seoul, Korea (Republic of)
Sung-Bae Cho
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Osaka University, Osaka, Ibaraki, Japan
Yasushi Yagi
Kyushu Institute of Technology, Kitakyushu, Japan
Tohru Kamiya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hwang, JS., Lee, SH., Lee, SW. (2023). PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-Based Prosody Modeling. In: Lu, H., Blumenstein, M., Cho, SB., Liu, CL., Yagi, Y., Kamiya, T. (eds) Pattern Recognition. ACPR 2023. Lecture Notes in Computer Science, vol 14406. Springer, Cham. https://doi.org/10.1007/978-3-031-47634-1_31

Download citation

DOI: https://doi.org/10.1007/978-3-031-47634-1_31
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47633-4
Online ISBN: 978-3-031-47634-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-Based Prosody Modeling

Abstract

Access this chapter

Similar content being viewed by others

E-TTS: Expressive Text-to-Speech Synthesis for Hindi Using Data Augmentation

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

DNN-Based Duration Modeling for Synthesizing Short Sentences

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-Based Prosody Modeling

Abstract

Access this chapter

Similar content being viewed by others

E-TTS: Expressive Text-to-Speech Synthesis for Hindi Using Data Augmentation

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

DNN-Based Duration Modeling for Synthesizing Short Sentences

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation