Speaker-Dependent BiLSTM-Based Phrasing

Jůzová, Markéta; Tihelka, Daniel

doi:10.1007/978-3-030-58323-1_37

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12284))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1352 Accesses
2 Citations

Abstract

Phrase boundary detection is an important part of text-to-speech systems since it ensures more natural speech synthesis outputs. However, the problem of phrasing is ambiguous, especially per speaker and per style. This is the reason why this paper focuses on speaker-dependent phrasing for the purposes of speech synthesis, using a neural network model with a speaker code. We also describe results of a listening test focused on incorrectly detected breaks because it turned out that some mistakes could be actually fine, not wrong.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Newton (2009)
MATH Google Scholar
Fernandez, R., Rendel, A., Ramabhadran, B., Hoory, R.: Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: INTERSPEECH 2014, pp. 2268–2272. ISCA (2014)
Google Scholar
Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 361–372. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_31
Chapter Google Scholar
Hirschberg, J., Prieto, P.: Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Commun. 18(3), 281–290 (1996)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hojo, N., Ijima, Y., Mizuno, H.: DNN-based speech synthesis using speaker codes. IEICE Trans. Inf. Syst. 101(2), 462–472 (2018)
Article Google Scholar
Jůzová, M.: Prosodic phrase boundary classification based on Czech speech corpora. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 165–173. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_19
Chapter Google Scholar
Jůzová, M.: On the comparison of different phrase boundary detection approaches trained on Czech TTS speech corpora. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 255–263. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_27
Chapter Google Scholar
Louw, J.A., Moodley, A.: Speaker specific phrase break modeling with conditional random fields for text-to-speech. In: 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference, pp. 1–6 (2016)
Google Scholar
Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: INTERSPEECH 2008, pp. 1626–1629. ISCA, Brisbane, Australia (2008)
Google Scholar
Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41
Chapter Google Scholar
Mishra, T., Kim, Y.J., Bangalore, S.: Intonational phrase break prediction for text-to-speech synthesis using dependency relations. In: ICASSP 2015, pp. 4919–4923 (2015)
Google Scholar
Prahallad, K., Raghavendra, E.V., Black, A.W.: Learning speaker-specific phrase breaks for text-to-speech systems. In: SSW (2010)
Google Scholar
Read, I., Cox, S.: Stochastic and syntactic techniques for predicting phrase breaks. Comput. Speech Lang. 21, 3233–3236 (2005)
Google Scholar
Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009)
Book Google Scholar
Taylor, P., Black, A.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12, 99–117 (1998)
Article Google Scholar
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Chapter Google Scholar
Vít, J., Hanzlíček, Z., Matoušek, J.: Czech speech synthesis with generative neural vocoder. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 307–315. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_26
Chapter Google Scholar

Download references

Acknowledgements

This research was supported by the Czech Science Foundation (GACR), project No. GA19-19324S, and by the grant of the University of West Bohemia, project No. SGS-2019-027.

Author information

Authors and Affiliations

Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
Markéta Jůzová
New Technology for Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
Markéta Jůzová & Daniel Tihelka

Authors

Markéta Jůzová
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Tihelka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markéta Jůzová .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jůzová, M., Tihelka, D. (2020). Speaker-Dependent BiLSTM-Based Phrasing. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-58323-1_37
Published: 01 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58322-4
Online ISBN: 978-3-030-58323-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics