Prosodic Break Prediction with RNNs

Pascual, Santiago; Bonafonte, Antonio

doi:10.1007/978-3-319-49169-1_7

Prosodic Break Prediction with RNNs

Santiago Pascual²¹ &
Antonio Bonafonte²¹

Conference paper
First Online: 04 November 2016

773 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10077))

Abstract

Prosodic breaks prediction from text is a fundamental task to obtain naturalness in text to speech applications. In this work we build a data-driven break predictor out of linguistic features like the Part of Speech (POS) tags and forward-backward word distance to punctuation marks, and to do so we use a basic Recurrent Neural Network (RNN) model to exploit the sequence dependency in decisions. In the experiments we evaluate the performance of a logistic regression model and the recurrent one. The results show that the logistic regression outperforms the baseline (CART) by a \(9.5\,\%\) in the F-score, and the addition of the recurrent layer in the model further improves the predictions of the baseline by an \(11\,\%\).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)
Book Google Scholar
Taylor, P., Black, A.W.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12, 99–117 (1998)
Article Google Scholar
Bonafonte, A., Agüero, P.D.: Phrase break prediction using a finite state transducer. In: Proceedings of Advanced Speech Technologies (2004)
Google Scholar
Agüero, P.D., et al.: Síntesis de voz aplicada a la traducción voz a voz. Ph.D. dissertation, Tesis Doctoral. Universidad Politécnica de Cataluña (2012). http://hdl.handle.net/10803/97035
Hirschberg, J., Prieto, P.: Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Commun. 18(3), 281–290 (1996)
Article Google Scholar
Li, J., Hu, G., Wang, R.: Prosody phrase break prediction based on maximum entropy model. J. Chin. Inf. Process. 18, 56–63 (2004)
Google Scholar
Watts, O., Gangireddy, S., Yamagishi, J., King, S., Renals, S., Stan, A., Giurgiu, M.: Neural net word representations for phrase-break prediction without a part of speech tagger. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2599–2603. IEEE (2014)
Google Scholar
Mishra, T., Kim, Y.-J., Bangalore, S.: Intonational phrase break prediction for text-to-speech synthesis using dependency relations. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4919–4923. IEEE (2015)
Google Scholar
Watts, O., Yamagishi, J., King, S.: Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger. In: INTERSPEECH, pp. 2157–2160 (2011)
Google Scholar
Sun, X., Applebaum, T.H.: Intonational phrase break prediction using decision tree and n-gram model. In: INTERSPEECH, pp. 537–540 (2001)
Google Scholar
Deng, L., Yu, D.: Deep learning: methods and applications. Found. Trends Sig. Process. 7(3–4), 197–387 (2014)
Article MathSciNet MATH Google Scholar
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models, arXiv preprint arXiv:1508.06615 (2015)
Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024 (2011)
Google Scholar
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., Cernockỳ, J.: Empirical evaluation and combination of advanced language modeling techniques. In: INTERSPEECH, pp. 605–608 (2011)
Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078 (2014)
Zen, H., Sak, H.: Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4470–4474 (2015)
Google Scholar
Fan, Y., Qian, Y., Xie, F.-L., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: INTERSPEECH, pp. 1964–1968 (2014)
Google Scholar
Hochreiter, S.: The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertainty Fuzziness Knowl.-Based Syst. 6(02), 107–116 (1998)
Article MATH Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Chollet, F.: Keras (2015). https://github.com/fchollet/keras
Bonafonte, A., Höge, H., Kiss, I., Moreno, A., Ziegenhain, U., van den Heuvel, H., Hain, H.-U., Wang, X.S., Garcia, M.-N.: TC-STAR: specifications of language resources and evaluation for speech synthesis. In: Proceedings of LREC Conference, pp. 311–314 (2006)
Google Scholar
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 1139–1147 (2013)
Google Scholar
Forney Jr., J.D.: The viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work was supported by the Spanish Ministerio de Economía y Competitividad and European Regional Development Fund, contract TEC2015-69266-P (MINECO/FEDER, UE).

Author information

Authors and Affiliations

Universitat Politècnica de Catalunya, Barcelona, Spain
Santiago Pascual & Antonio Bonafonte

Authors

Santiago Pascual
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Bonafonte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Santiago Pascual .

Editor information

Editors and Affiliations

INESC-ID/IST, Universidade de Lisboa, Lisbon, Portugal
Alberto Abad
I3A/University of Zaragoza, Zaragoza, Spain
Alfonso Ortega
DETI/IEETA, University of Aveiro, Aveiro, Portugal
António Teixeira
AtlantTIC Research Center, Universidad de Vigo, Vigo, Spain
Carmen García Mateo
Universitat Politècnica de València, Valencia, Spain
Carlos D. Martínez Hinarejos
University of Coimbra, Coimbra, Portugal
Fernando Perdigão
INESC-ID/ISCTE-IUL, Lisbon, Portugal
Fernando Batista
INESC-ID/IST, Universidade de Lisboa, Lisbon, Portugal
Nuno Mamede

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pascual, S., Bonafonte, A. (2016). Prosodic Break Prediction with RNNs. In: Abad, A., et al. Advances in Speech and Language Technologies for Iberian Languages. IberSPEECH 2016. Lecture Notes in Computer Science(), vol 10077. Springer, Cham. https://doi.org/10.1007/978-3-319-49169-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-49169-1_7
Published: 04 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49168-4
Online ISBN: 978-3-319-49169-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics