Abstract
Phrase boundary detection is an important part of text-to-speech systems since it ensures more natural speech synthesis outputs. However, the problem of phrasing is ambiguous, especially per speaker and per style. This is the reason why this paper focuses on speaker-dependent phrasing for the purposes of speech synthesis, using a neural network model with a speaker code. We also describe results of a listening test focused on incorrectly detected breaks because it turned out that some mistakes could be actually fine, not wrong.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Newton (2009)
Fernandez, R., Rendel, A., Ramabhadran, B., Hoory, R.: Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: INTERSPEECH 2014, pp. 2268–2272. ISCA (2014)
Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 361–372. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_31
Hirschberg, J., Prieto, P.: Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Commun. 18(3), 281–290 (1996)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural comput. 9(8), 1735–1780 (1997)
Hojo, N., Ijima, Y., Mizuno, H.: DNN-based speech synthesis using speaker codes. IEICE Trans. Inf. Syst. 101(2), 462–472 (2018)
Jůzová, M.: Prosodic phrase boundary classification based on Czech speech corpora. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 165–173. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_19
Jůzová, M.: On the comparison of different phrase boundary detection approaches trained on Czech TTS speech corpora. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 255–263. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_27
Louw, J.A., Moodley, A.: Speaker specific phrase break modeling with conditional random fields for text-to-speech. In: 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference, pp. 1–6 (2016)
Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: INTERSPEECH 2008, pp. 1626–1629. ISCA, Brisbane, Australia (2008)
Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41
Mishra, T., Kim, Y.J., Bangalore, S.: Intonational phrase break prediction for text-to-speech synthesis using dependency relations. In: ICASSP 2015, pp. 4919–4923 (2015)
Prahallad, K., Raghavendra, E.V., Black, A.W.: Learning speaker-specific phrase breaks for text-to-speech systems. In: SSW (2010)
Read, I., Cox, S.: Stochastic and syntactic techniques for predicting phrase breaks. Comput. Speech Lang. 21, 3233–3236 (2005)
Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009)
Taylor, P., Black, A.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12, 99–117 (1998)
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Vít, J., Hanzlíček, Z., Matoušek, J.: Czech speech synthesis with generative neural vocoder. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 307–315. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_26
Acknowledgements
This research was supported by the Czech Science Foundation (GACR), project No. GA19-19324S, and by the grant of the University of West Bohemia, project No. SGS-2019-027.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Jůzová, M., Tihelka, D. (2020). Speaker-Dependent BiLSTM-Based Phrasing. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-58323-1_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58322-4
Online ISBN: 978-3-030-58323-1
eBook Packages: Computer ScienceComputer Science (R0)