Ensemble Deep Neural Network Based Waveform-Driven Stress Model for Speech Synthesis

  • Bálint Pál Tóth
  • Kornél István Kis
  • György Szaszák
  • Géza Németh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9811)


Stress annotations in the training corpus of speech synthesis systems are usually obtained by applying language rules to the transcripts. However, the actual stress patterns seen in the waveform are not guaranteed to be canonical, they can deviate from locations defined by language rules. This is driven mostly by speaker dependent factors. Therefore, stress models based on these corpora can be far from perfect. This paper proposes a waveform based stress annotation technique. According to the stress classes, four feedforward deep neural networks (DNNs) were trained to model fundamental frequency (F0) of speech. During synthesis, stress labels are generated from the textual input and an ensemble of the four DNNs predict the F0 trajectories. Objective and subjective evaluation was carried out. The results show that the proposed method surpasses the quality of vanilla DNN-based F0 models.


Text-to-speech TTS Deep learning Deep neural networks F0 Ensemble learning Stress annotation 



We would like to thank to Mátyás Bartalis for his help in creating the subjective listening test and to the listeners for participating in it. Bálint Pál Tóth gratefully acknowledges the support of NVIDIA Corporation with the donation of an NVidia Titan X GPU used for his research. This research is partially supported by the Swiss National Science Foundation via the joint research project (SCOPES scheme) SP2: SCOPES project on speech prosody (SNSF n° IZ73Z0_152495-1).


  1. 1.
    Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Eurospeech, pp. 2347–2350 (1999)Google Scholar
  2. 2.
    Tomoki, T., Tokuda, K.: A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Syst. 90(5), 816–824 (2007)Google Scholar
  3. 3.
    Pitrelli, J.F., Beckman, M.E., Hirschberg, J.: Evaluation of prosodic transcription labeling reliability in the ToBI framework. In: International Conference on Spoken Language Processing, vol. 1, pp. 123–126 (1994)Google Scholar
  4. 4.
    Szaszák, G., Beke, A., Olaszy, G., Tóth, B.P.: Using automatic stress extraction from audio for improved prosody modeling in speech synthesis. In: 16th Annual Conference of the International Speech Communication Association, pp. 2227–2231 (2015)Google Scholar
  5. 5.
    Pitrelli, J.F., Beckman, M.E., Hirschberg, J.: Evaluation of prosodic transcription labeling reliability in the ToBI framework. In: International Conference on Spoken Language Processing, vol. 1, pp. 123–126 (1994)Google Scholar
  6. 6.
    Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
  7. 7.
    Szaszák, G., Beke, A.: Exploiting prosody for syntactic analysis in automatic speech understanding. J. Lang. Model. 1, 143–172 (2012)CrossRefGoogle Scholar
  8. 8.
    Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7962–7966 (2013)Google Scholar
  9. 9.
    Fan, Y., Qian, Y., Xie, F.L., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Interspeech, pp. 1964–1968 (2014)Google Scholar
  10. 10.
    Camacho, A., Harris, J.G.: A sawtooth waveform inspired pitch estimator for speech and music. J. Acoust. Soc. Am. 124(3), 1638–1652 (2008)CrossRefGoogle Scholar
  11. 11.
    Nesterov, Y.: Gradient methods for minimizing composite objective function, UCL (2007)Google Scholar
  12. 12.
    Koutny, I.: Parsing Hungarian sentences in order to determine their prosodic structure in a multilingual TTS system. In: Eurospeech, pp. 2091–2094 (1999)Google Scholar
  13. 13.
    Olaszy, G., Németh, G., Olaszi, P., Kiss, G., Zainkó, C., Gordos, G.: Profivox – a Hungarian TTS system for telecommunications applications. Int. J. Speech Technol. 3(3-4), 201–215 (2000)CrossRefzbMATHGoogle Scholar
  14. 14.
    Olaszy, G.: Precíziós, párhuzamos magyar beszédadatbázis fejlesztése és szolgáltatásai [Development and services of a Hungarian precisely labeled and segmented, parallel speech database] (in Hungarian),” Beszédkutatás 2013 [Speech Res. 2013], pp. 261–270 (2013)Google Scholar
  15. 15.
    Chollet, F.: Keras: Theano-based deep learning library (2015)., Documentation:
  16. 16.
    ITU-T recommendation p. 800: Methods for subjective determination of transmission quality (1996)Google Scholar
  17. 17.
    Tóth, B., Csapó, G.: Continuous fundamental frequency prediction with deep neural networks. In: European Signal Processing Conference (2016, in review)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Bálint Pál Tóth
    • 1
  • Kornél István Kis
    • 1
  • György Szaszák
    • 1
  • Géza Németh
    • 1
  1. 1.Department of Telecommunications and Media InformaticsBudapest University of Technology and EconomicsBudapestHungary

Personalised recommendations