Abstract
Although text-to-speech (TTS) systems have significantly improved, most TTS systems still have limitations in synthesizing speech with appropriate phrasing. For natural speech synthesis, it is important to synthesize the speech with a phrasing structure that groups words into phrases based on semantic information. In this paper, we propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling. First, we introduce a phrasing structure encoder that utilizes a context representation from the pre-trained language model. In the phrasing structure encoder, we extract a speaker-dependent syntactic representation from the context representation and then predict a pause sequence that separates the input text into phrases. Furthermore, we introduce a pause-based word encoder to model word-level prosody based on pause sequence. Experimental results show PauseSpeech outperforms previous models in terms of naturalness. Furthermore, in terms of objective evaluations, we can observe that our proposed methods help the model decrease the distance between ground-truth and synthesized speech. Audio samples are available at https://jisang93.github.io/pausespeech-demo/.
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program (Korea University) and No. 2021-0-02068, Artificial Intelligence Innovation Hub).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abbas, A., et al.: Expressive, Variable, and Controllable Duration Modelling in TTS. arXiv preprint arXiv:2206.14165 (2022)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)
Braunschweiler, N., Chen, L.: Automatic detection of inhalation breath pauses for improved pause modelling in HMM-TTS. In: SSW, vol. 8, pp. 1–6 (2013)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018)
Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., Simonyan, K.: End-to-End Adversarial Text-to-Speech. arXiv preprint arXiv:2006.03575 (2020)
Elmers, M., Werner, R., Muhlack, B., Möbius, B., Trouvain, J.: Take a breath: respiratory sounds improve recollection in synthetic speech. In: Interspeech, pp. 3196–3200 (2021)
Futamata, K., Park, B., Yamamoto, R., Tachibana, K.: Phrase Break Prediction with Bidirectional Encoder Representations in Japanese Text-to-Speech Synthesis. arXiv preprint arXiv:2104.12395 (2021)
Goldberg, Y.: Assessing BERT’s Syntactic Abilities. arXiv preprint arXiv:1901.05287 (2019)
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Hayashi, T., Watanabe, S., Toda, T., Takeda, K., Toshniwal, S., Livescu, K.: Pre-trained text embeddings for enhanced text-to-speech synthesis. In: Interspeech, pp. 4430–4434 (2019)
Hewitt, J., Manning, C.D.: A structural probe for finding syntax in word representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138 (2019)
Hida, R., Hamada, M., Kamada, C., Tsunoo, E., Sekiya, T., Kumakura, T.: Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7132–7136. IEEE (2022)
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. Adv. Neural. Inf. Process. Syst. 33, 8067–8077 (2020)
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In: International Conference on Machine Learning, pp. 5530–5540. PMLR (2021)
Kim, J.H., Lee, S.H., Lee, J.H., Lee, S.W.: Fre-GAN: adversarial frequency-consistent audio synthesis. In: 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, pp. 3246–3250. International Speech Communication Association (2021)
Kim, K.T., Guan, C., Lee, S.W.: A subject-transfer framework based on single-trial EMG analysis using convolutional neural networks. IEEE Trans. Neural Syst. Rehabil. Eng. 28(1), 94–103 (2019)
Klimkov, V., et al.: Phrase break prediction for long-form reading TTS: exploiting text structure information. In: Proceedings of Interspeech 2017, pp. 1064–1068 (2017)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
Łańcucki, A.: FastPitch: parallel text-to-speech with pitch prediction. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6588–6592. IEEE (2021)
Lee, J.H., Lee, S.H., Kim, J.H., Lee, S.W.: PVAE-TTS: adaptive text-to-speech via progressive style adaptation. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6312–6316. IEEE (2022)
Lee, S.H., Kim, S.B., Lee, J.H., Song, E., Hwang, M.J., Lee, S.W.: HierSpeech: bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. Adv. Neural. Inf. Process. Syst. 35, 16624–16636 (2022)
Lee, S.H., Yoon, H.W., Noh, H.R., Kim, J.H., Lee, S.W.: Multi-SpectroGAN: high-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13198–13206 (2021)
Lewis, M., et al.: BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. arXiv preprint arXiv:1910.13461 (2019)
Liu, N.F., Gardner, M., Belinkov, Y., Peters, M.E., Smith, N.A.: Linguistic knowledge and transferability of contextual representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1073–1094 (2019)
Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019)
Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101 (2017)
Makarov, P., et al.: Simple and Effective Multi-Sentence TTS with Expressive and Coherent Prosody. arXiv preprint arXiv:2206.14643 (2022)
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using kaldi. In: Interspeech, vol. 2017, pp. 498–502 (2017)
Oh, H.S., Lee, S.H., Lee, S.W.: DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training. arXiv preprint arXiv:2307.16549 (2023)
Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)
Ren, Y., et al.: ProsoSpeech: enhancing prosody with quantized vector pre-training in text-to-speech. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7577–7581. IEEE (2022)
Ren, Y., Liu, J., Zhao, Z.: PortaSpeech: portable and high-quality generative text-to-speech. Adv. Neural. Inf. Process. Syst. 34, 13963–13974 (2021)
Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3171–3180 (2019)
Ren, Y., Tan, X., Qin, T., Zhao, Z., Liu, T.Y.: Revisiting over-smoothness in text to speech. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8197–8213 (2022)
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning, pp. 1530–1538. PMLR (2015)
Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2021)
Seshadri, S., Raitio, T., Castellani, D., Li, J.: Emphasis Control for Parallel Neural TTS. arXiv preprint arXiv:2110.03012 (2021)
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with Relative Position Representations. arXiv preprint arXiv:1803.02155 (2018)
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Székely, É., Henter, G.E., Beskow, J., Gustafson, J.: Breathing and speech planning in spontaneous speech synthesis. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7649–7653. IEEE (2020)
Thung, K.H., Yap, P.T., Adeli, E., Lee, S.W., Shen, D., Initiative, A.D.N., et al.: Conversion and time-to-conversion predictions of mild cognitive impairment using low-rank affinity pursuit denoising and matrix completion. Med. Image Anal. 45, 68–82 (2018)
Veaux, C., Yamagishi, J., MacDonald, K., et al.: Superseded-CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit (2016)
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of Interspeech 2017, pp. 4006–4010 (2017)
Wu, J., Luan, J.: Adversarially Trained Multi-Singer Sequence-to-Sequence Singing Synthesizer. arXiv preprint arXiv:2006.10317 (2020)
Xu, G., Song, W., Zhang, Z., Zhang, C., He, X., Zhou, B.: Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6079–6083. IEEE (2021)
Yang, D., Koriyama, T., Saito, Y., Saeki, T., Xin, D., Saruwatari, H.: Duration-Aware Pause Insertion Using Pre-Trained Language Model for Multi-Speaker Text-to-Speech. arXiv preprint arXiv:2302.13652 (2023)
Ye, Z., Zhao, Z., Ren, Y., Wu, F.: SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech. arXiv preprint arXiv:2204.11792 (2022)
Zhang, J.X., Ling, Z.H., Liu, L.J., Jiang, Y., Dai, L.R.: Sequence-to-sequence acoustic modeling for voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 27(3), 631–644 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hwang, JS., Lee, SH., Lee, SW. (2023). PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-Based Prosody Modeling. In: Lu, H., Blumenstein, M., Cho, SB., Liu, CL., Yagi, Y., Kamiya, T. (eds) Pattern Recognition. ACPR 2023. Lecture Notes in Computer Science, vol 14406. Springer, Cham. https://doi.org/10.1007/978-3-031-47634-1_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-47634-1_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47633-4
Online ISBN: 978-3-031-47634-1
eBook Packages: Computer ScienceComputer Science (R0)