Skip to main content

Prosodic Break Prediction with RNNs

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10077))

Abstract

Prosodic breaks prediction from text is a fundamental task to obtain naturalness in text to speech applications. In this work we build a data-driven break predictor out of linguistic features like the Part of Speech (POS) tags and forward-backward word distance to punctuation marks, and to do so we use a basic Recurrent Neural Network (RNN) model to exploit the sequence dependency in decisions. In the experiments we evaluate the performance of a logistic regression model and the recurrent one. The results show that the logistic regression outperforms the baseline (CART) by a \(9.5\,\%\) in the F-score, and the addition of the recurrent layer in the model further improves the predictions of the baseline by an \(11\,\%\).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)

    Book  Google Scholar 

  2. Taylor, P., Black, A.W.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12, 99–117 (1998)

    Article  Google Scholar 

  3. Bonafonte, A., Agüero, P.D.: Phrase break prediction using a finite state transducer. In: Proceedings of Advanced Speech Technologies (2004)

    Google Scholar 

  4. Agüero, P.D., et al.: Síntesis de voz aplicada a la traducción voz a voz. Ph.D. dissertation, Tesis Doctoral. Universidad Politécnica de Cataluña (2012). http://hdl.handle.net/10803/97035

  5. Hirschberg, J., Prieto, P.: Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Commun. 18(3), 281–290 (1996)

    Article  Google Scholar 

  6. Li, J., Hu, G., Wang, R.: Prosody phrase break prediction based on maximum entropy model. J. Chin. Inf. Process. 18, 56–63 (2004)

    Google Scholar 

  7. Watts, O., Gangireddy, S., Yamagishi, J., King, S., Renals, S., Stan, A., Giurgiu, M.: Neural net word representations for phrase-break prediction without a part of speech tagger. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2599–2603. IEEE (2014)

    Google Scholar 

  8. Mishra, T., Kim, Y.-J., Bangalore, S.: Intonational phrase break prediction for text-to-speech synthesis using dependency relations. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4919–4923. IEEE (2015)

    Google Scholar 

  9. Watts, O., Yamagishi, J., King, S.: Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger. In: INTERSPEECH, pp. 2157–2160 (2011)

    Google Scholar 

  10. Sun, X., Applebaum, T.H.: Intonational phrase break prediction using decision tree and n-gram model. In: INTERSPEECH, pp. 537–540 (2001)

    Google Scholar 

  11. Deng, L., Yu, D.: Deep learning: methods and applications. Found. Trends Sig. Process. 7(3–4), 197–387 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  12. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models, arXiv preprint arXiv:1508.06615 (2015)

  13. Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024 (2011)

    Google Scholar 

  14. Mikolov, T., Deoras, A., Kombrink, S., Burget, L., Cernockỳ, J.: Empirical evaluation and combination of advanced language modeling techniques. In: INTERSPEECH, pp. 605–608 (2011)

    Google Scholar 

  15. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078 (2014)

  16. Zen, H., Sak, H.: Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4470–4474 (2015)

    Google Scholar 

  17. Fan, Y., Qian, Y., Xie, F.-L., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: INTERSPEECH, pp. 1964–1968 (2014)

    Google Scholar 

  18. Hochreiter, S.: The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertainty Fuzziness Knowl.-Based Syst. 6(02), 107–116 (1998)

    Article  MATH  Google Scholar 

  19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  20. Chollet, F.: Keras (2015). https://github.com/fchollet/keras

  21. Bonafonte, A., Höge, H., Kiss, I., Moreno, A., Ziegenhain, U., van den Heuvel, H., Hain, H.-U., Wang, X.S., Garcia, M.-N.: TC-STAR: specifications of language resources and evaluation for speech synthesis. In: Proceedings of LREC Conference, pp. 311–314 (2006)

    Google Scholar 

  22. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 1139–1147 (2013)

    Google Scholar 

  23. Forney Jr., J.D.: The viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This work was supported by the Spanish Ministerio de Economía y Competitividad and European Regional Development Fund, contract TEC2015-69266-P (MINECO/FEDER, UE).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Santiago Pascual .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Pascual, S., Bonafonte, A. (2016). Prosodic Break Prediction with RNNs. In: Abad, A., et al. Advances in Speech and Language Technologies for Iberian Languages. IberSPEECH 2016. Lecture Notes in Computer Science(), vol 10077. Springer, Cham. https://doi.org/10.1007/978-3-319-49169-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49169-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49168-4

  • Online ISBN: 978-3-319-49169-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics