On the Comparison of Different Phrase Boundary Detection Approaches Trained on Czech TTS Speech Corpora

  • Markéta JůzováEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11096)


The phrasing is a very important issue in the process of speech synthesis since it ensures higher naturalness and intelligibility of synthesized sentences. There are many different approaches to phrase boundary detection, including simple classification-based, HMM-based, CRF-based approaches, however, different types of neural networks are used for this task as well. The paper compares representative methods for phrasing of Czech sentences using large-scale TTS speech corpora as training data, taking only speaker-dependent phrasing issue into consideration.


Phrase boundary Speech corpus Classification Conditional random fields Neural networks 



The work has been supported by the grant of the University of West Bohemia, project No. SGS-2016-039, and by Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.


  1. 1.
    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 999888, 2493–2537 (2011)zbMATHGoogle Scholar
  2. 2.
    Fernandez, R., Rendel, A., Ramabhadran, B., Hoory, R.: Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: Proceedings of Interspeech 2014, pp. 2268–2272. ISCA, September 2014Google Scholar
  3. 3.
    Gregory, M.L.: Using conditional random fields to predict pitch accents in conversational speech. In: Proceedings of ACL 2004. ACL, East Stroudsburg, pp. 677–684 (2004)Google Scholar
  4. 4.
    Hanzlíček, Z.: Correction of prosodic phrases in large speech corpora. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 408–417. Springer, Cham (2016). Scholar
  5. 5.
    Hirschberg, J., Prieto, P.: Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Commun. 18(3), 281–290 (1996)CrossRefGoogle Scholar
  6. 6.
    Jůzová, M.: CRF-based phrase boundary detection trained on large-scale TTS speech corpora. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 272–281. Springer, Cham (2017). Scholar
  7. 7.
    Jůzová, M.: Prosodic phrase boundary classification based on Czech speech corpora. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 165–173. Springer, Cham (2017). Scholar
  8. 8.
    Jůzová, M., Tihelka, D., Volín, J.: On the extension of the formal prosody model for TTS. In: Text, Speech and Dialogue. Lecture Notes in Computer Science, Springer, Heidelberg (2018)Google Scholar
  9. 9.
    Koehn, P., Abney, S., Hirschberg, J., Collins, M.: Improving intonational phrasing with syntactic information. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1289–1290 (2000)Google Scholar
  10. 10.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  11. 11.
    Legát, M., Matoušek, J., Tihelka, D.: A robust multi-phase pitch-mark detection algorithm. Proc. Interspeech 2007, 1641–1644 (2007)Google Scholar
  12. 12.
    Legát, M., Matoušek, J., Tihelka, D.: On the detection of pitch marks using a robust multi-phase algorithm. Speech Commun. 53(4), 552–566 (2011)CrossRefGoogle Scholar
  13. 13.
    Louw, A., Moodley, A.: Speaker specific phrase break modeling with conditional random fields for text-to-speech. In: Proceedings of 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), pp. 1–6 (2016)Google Scholar
  14. 14.
    Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: Proceedings of Interspeech 2008, pp. 1626–1629. ISCA (2008)Google Scholar
  15. 15.
    Matoušek, J., Legát, M.: Is unit selection aware of audible artifacts? SSW 2013. In: Proceedings of the 8th Speech Synthesis Workshop, pp. 267–271. ISCA, Barcelona (2013)Google Scholar
  16. 16.
    Matoušek, J., Tihelka, D.: Classification-based detection of glottal closure instants from speech signals. In: Proceedings of Interspeech 2017, pp. 3053–3057. ISCA (2017)Google Scholar
  17. 17.
    Matoušek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006). Scholar
  18. 18.
    Matoušek, J., Romportl, J.: Recording and annotation of speech corpus for Czech unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 326–333. Springer, Heidelberg (2007). Scholar
  19. 19.
    Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of Interspeech 2013, pp. 1511–1515. ISCA (2013)Google Scholar
  20. 20.
    Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of 2013 NAACL HLT, pp. 746–751 (2013)Google Scholar
  21. 21.
    Mishra, T., Jun Kim, Y., Bangalore, S.: Intonational phrase break prediction for text-to-speech synthesis using dependency relations. In: Proceedings of ICASSP 2015, pp. 4919–4923 (2015)Google Scholar
  22. 22.
    Palková, Z.: Rytmická výstavba prozaického textu. Studia ČSAV; čis. 13/1974, Academia (1974)Google Scholar
  23. 23.
    Parlikar, A., Black, A.W.: Data-driven phrasing for speech synthesis in low-resource languages. Proc. ICASSP 2012, 4013–4016 (2012)Google Scholar
  24. 24.
    Prahallad, K., Raghavendra, E.V., Black, A.W.: Learning speaker-specific phrase breaks for text-to-speech systems. In: The 7th ISCA Tutorial and Research Workshop on Speech Synthesis, pp. 162–166 (2010)Google Scholar
  25. 25.
    Read, I., Cox, S.: Stochastic and syntactic techniques for predicting phrase breaks. Comput. Speech Lang. 21(3), 519–542 (2007)CrossRefGoogle Scholar
  26. 26.
    Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). Scholar
  27. 27.
    Romportl, J.: Structural data-driven prosody model for TTS synthesis. In: Proceedings of Speech Prosody 2006, pp. 549–552. TUDpress, Dresden (2006)Google Scholar
  28. 28.
    Romportl, J.: Automatic prosodic phrase annotation in a corpus for speech synthesis. In: Proceedings of Speech Prosody 2010. University of Illionois, Chicago (2010)Google Scholar
  29. 29.
    Romportl, J., Matoušek, J.: Several aspects of machine-driven phrasing in text-to-speech systems. Prague Bull. Math. Linguist. 95, 51–61 (2011)CrossRefGoogle Scholar
  30. 30.
    Rosenberg, A., Fernandez, R., Ramabhadran, B.: Modeling phrasing and prominence using deep recurrent learning. In: Proceedings of Interspeech 2015, pp. 3066–3070. ISCA (2015)Google Scholar
  31. 31.
    Sun, X., Applebaum, T.H.: Intonational phrase break prediction using decision tree and n-gram model. Proc. Eurospeech 2001, 3–7 (2001)Google Scholar
  32. 32.
    Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009)CrossRefGoogle Scholar
  33. 33.
    Taylor, P., Black, A.W.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12(2), 99–117 (1998)CrossRefGoogle Scholar
  34. 34.
    Tihelka, D.: Symbolic prosody driven unit selection for highly natural synthetic speech. In: Proceedings of Interspeech 2005 - Eurospeech, pp. 2525–2528. ISCA (2005)Google Scholar
  35. 35.
    Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: A decade of research on the field of speech technologies. In: Text, Speech and Dialogue. Lecture Notes in Computer Science, Springer, Heidelberg (2018)Google Scholar
  36. 36.
    Tihelka, D., Kala, J., Matoušek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: Proceedings of Interspeech 2010, pp. 174–177. ISCA (2010)Google Scholar
  37. 37.
    Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semisupervised learning. In: Proceedings of ACL 2010, pp. 384–394. ACL (2010)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Cybernetics and New Technologies for the Information Society, Faculty of Applied SciencesUniversity of West BohemiaPilsenCzech Republic

Personalised recommendations