Improvements to Prosodic Variation in Long Short-Term Memory Based Intonation Models Using Random Forest

  • Bálint Pál TóthEmail author
  • Balázs Szórádi
  • Géza Németh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9811)


Statistical parametric speech synthesis has overcome unit selection methods in many aspects, including flexibility and variability. However, the intonation of these systems is quite monotonic, especially in case of longer sentences. Due to statistical methods the variation of fundamental frequency (F0) trajectories decreases. In this research a random forest (RF) based classifier was trained with radio conversations based on the perceived variation by a human annotator. This classifier was used to extend the labels of a phonetically balanced, studio quality speech corpus. With the extended labels a Long Short-Term Memory (LSTM) network was trained to model fundamental frequency (F0). Objective and subjective evaluations were carried out. The results show that the variation of the generated F0 trajectories can be fine-tuned with an additional input of the LSTM network.


Text-To-Speech TTS Deep learning Deep neural networks LSTM Random forest Fundamental frequency Prosodic variability 



We would like to thank to Mátyás Bartalis for his help in creating the subjective listening test and to the listeners for participating in it. Bálint Pál Tóth gratefully acknowledges the support of NVIDIA Corporation with the donation of an NVidia Titan X GPU used for his research. This research is partially supported by the Swiss National Science Foundation via the joint research project (SCOPES scheme) SP2: SCOPES project on speech prosody (SNSF n° IZ73Z0_152495-1).


  1. 1.
    Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)CrossRefGoogle Scholar
  2. 2.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  3. 3.
    Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: ICASSP, pp. 7962–7966 (2013)Google Scholar
  4. 4.
    Fan, Y., Qian, Y., Xie, F.-L., Soong. F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In:. Interspeech, pp. 1964–1968 (2014)Google Scholar
  5. 5.
    Németh, G., Fék, M., Csapó, T.G.: Increasing prosodic variability of text-to-speech synthesizers. In: INTERSPEECH, pp. 474–477 (2007)Google Scholar
  6. 6.
    Jia, H, Tao, J, Wang, X.: Prosody variation: application to automatic prosody evaluation of Mandarin speech. In: Proceeding Speech Prosody, pp. 547–550 (2008)Google Scholar
  7. 7.
    Gahlawat, M., Malik, A., Bansal, P.: Expressive speech synthesis system using unit selection. In: Prasath, R., Kathirvalavakumar, T. (eds.) MIKE 2013. LNCS, vol. 8284, pp. 391–401. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  8. 8.
    Gustafson, K., House, D.: Fun or boring? A web-based evaluation of expressive synthesis for children. In: INTERSPEECH, pp. 565–568 (2001)Google Scholar
  9. 9.
    Camacho, A.: Swipe: a sawtooth waveform inspired pitch estimator for speech and music. Doctoral dissertation at the University of Florida, pp. 47–86 (2007)Google Scholar
  10. 10.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  11. 11.
    Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063 (2012)
  12. 12.
    Garner, P.N., Cernak, M., Motlicek, P.: A simple continuous pitch estimation algorithm. IEEE Signal Process. Lett. 20(1), 102–105 (2013)CrossRefGoogle Scholar
  13. 13.
    Zhang, Q., Soong, F. K., Qian, Y., Yan, Z., Pan, J., Yan, Y.: Improved modeling for F0 generation and V/U decision in HMM-based TTS. In: ICASSP, pp. 4606–4609 (2010)Google Scholar
  14. 14.
    Drugman, T., Stylianou, Y.: Maximum voiced frequency estimation: exploiting amplitude and phase spectra. IEEE Signal Process. Lett. 21(10), 1230–1234 (2014)CrossRefGoogle Scholar
  15. 15.
    Csapó, T.G., Németh, G., Cernak, M.: Residual-based excitation with continuous F0 modeling in HMM-based speech synthesis. In: Dediu, A.-H., Martín-Vide, C., Vicsi, K. (eds.) SLSP 2015. LNCS, vol. 9449, pp. 27–38. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-25789-1_4 CrossRefGoogle Scholar
  16. 16.
    Olaszy, G.: Development and services of a Hungarian precisely labeled and segmented, parallel speech database, (in Hungarian), Speech Res., pp. 261–270 (2013)Google Scholar
  17. 17.
    Chollet, F.: Keras: Theano-based deep learning library,, Documentation: (2015)
  18. 18.
    ITU-T recommendation p. 800: Methods for subjective determination of transmission quality (1996)Google Scholar
  19. 19.
    Laskowski, K., Heldner, M., Edlund, J.: The fundamental frequency variation spectrum. In: FONETIK-2008, pp. 29–32 (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Bálint Pál Tóth
    • 1
    Email author
  • Balázs Szórádi
    • 1
  • Géza Németh
    • 1
  1. 1.Department of Telecommunications and Media InformaticsBudapest University of Technology and EconomicsBudapestHungary

Personalised recommendations