Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis

  • Tamás Gábor Csapó
  • Géza Németh
  • Milos Cernak
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9449)


In statistical parametric speech synthesis, creaky voice can cause disturbing artifacts. The reason is that standard pitch tracking algorithms tend to erroneously measure F0 in regions of creaky voice. This pattern is learned during training of hidden Markov-models (HMMs). In the synthesis phase, false voiced/unvoiced decision caused by creaky voice results in audible quality degradation. In order to eliminate this phenomena, we use a simple continuous F0 tracker which does not apply a strict voiced/unvoiced decision. In the proposed residual-based vocoder, Maximum Voiced Frequency is used for mixed voiced and unvoiced excitation. As all parameters of the vocoder are continuous, Multi-Space Distribution is not necessary during training the HMMs, which has been shown to be advantageous. Artifacts caused by creaky voice are eliminated with this speech synthesis system. A subjective listening test of English utterances has shown improvement over the traditional excitation.


Speech synthesis HMM Creaky voice Vocoder Pitch tracking 



We would like to thank the listeners for participating in the subjective test. We thank Philip N. Garner for providing the continuous pitch tracker open source and Bálint Pál Tóth for useful comments on this manuscript. This research is partially supported by the Swiss National Science Foundation via the joint research project (SCOPES scheme) SP2: SCOPES project on speech prosody (SNSF no IZ73Z0_152495-1) and by the EITKIC project (EITKIC_12-1-2012-001).


  1. 1.
    The Festival Speech Synthesis System [Computer program], Version 2.1 (2010).
  2. 2.
    The Snack Sound Toolkit [Computer program], Version 2.2.10 (2012).
  3. 3.
    Speech Signal Processing - a small collection of routines in Python to do signal processing [Computer program] (2015).
  4. 4.
    Bohm, T., Audibert, N., Shattuck-Hufnagel, S., Németh, G., Aubergé, V.: Transforming modal voice into irregular voice by amplitude scaling of individual glottal cycles. In: Acoustics 2008, Paris, France, pp. 6141–6146 (2008)Google Scholar
  5. 5.
    Blomgren, M., Chen, Y., Ng, M.L., Gilbert, H.R.: Acoustic, aerodynamic, physiologic, and perceptual properties of modal and vocal fry registers. J. Acoust. Soc. Am. 103(5), 2649–2658 (1998)CrossRefGoogle Scholar
  6. 6.
    Csapó, T.G., Németh, G.: A novel codebook-based excitation model for use in speech synthesis. In: IEEE CogInfoCom, Kosice, Slovakia, pp. 661–665, December 2012Google Scholar
  7. 7.
    Csapó, T.G., Németh, G.: A novel irregular voice model for HMM-based speech synthesis. In: Proceedings of the ISCA SSW8, Barcelona, Spain, pp. 229–234 (2013)Google Scholar
  8. 8.
    Csapó, T.G., Németh, G.: Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE J. Sel. Top. Sig. Proc. 8(2), 209–220 (2014)CrossRefGoogle Scholar
  9. 9.
    Csapó, T.G., Németh, G.: Statistical parametric speech synthesis with a novel codebook-based excitation model. Intell. Decis. Technol. 8(4), 289–299 (2014)CrossRefGoogle Scholar
  10. 10.
    Csapó, T.G., Németh, G.: Automatic transformation of irregular to regular voice by residual analysis and synthesis. In: Proceedings of the Interspeech (2015). (accepted)Google Scholar
  11. 11.
    Drugman, T., Dutoit, T.: The deterministic plus stochastic model of the residual signal and its applications. IEEE Trans. Audio Speech Lang. Proc. 20(3), 968–981 (2012)CrossRefGoogle Scholar
  12. 12.
    Drugman, T., Kane, J., Gobl, C.: Data-driven detection and analysis of the patterns of creaky voice. Comput. Speech Lang. 28(5), 1233–1253 (2014)CrossRefGoogle Scholar
  13. 13.
    Drugman, T., Raitio, T.: Excitation modeling for HMM-based speech synthesis: breaking down the impact of periodic and aperiodic components. In: Proceedings of the ICASSP, Florence, Italy, pp. 260–264 (2014)Google Scholar
  14. 14.
    Drugman, T., Stylianou, Y.: Maximum voiced frequency estimation : exploiting amplitude and phase spectra. IEEE Sig. Proc. Lett. 21(10), 1230–1234 (2014)CrossRefGoogle Scholar
  15. 15.
    Drugman, T., Thomas, M.: Detection of glottal closure instants from speech signals: a quantitative review. IEEE Trans. Audio Speech Lang. Process. 20(3), 994–1006 (2012)CrossRefGoogle Scholar
  16. 16.
    Drugman, T., Wilfart, G., Dutoit, T.: Eigenresiduals for improved parametric speech synthesis. In: EUSIPCO09 (2009)Google Scholar
  17. 17.
    Garner, P.N., Cernak, M., Motlicek, P.: A simple continuous pitch estimation algorithm. IEEE Sig. Process. Lett. 20(1), 102–105 (2013)CrossRefGoogle Scholar
  18. 18.
    Hu, Q., Richmond, K., Yamagishi, J., Latorre, J.: An experimental comparison of multiple vocoder types. In: Proceedings of the ISCA SSW8, pp. 155–160 (2013)Google Scholar
  19. 19.
    Imai, S., Sumita, K., Furuichi, C.: Mel Log Spectrum Approximation (MLSA) filter for speech synthesis. Electron. Commun. Jpn. (Part I: Communications) 66(2), 10–18 (1983)CrossRefGoogle Scholar
  20. 20.
    Kawahara, H., Masuda-Katsuse, I., de Cheveigné, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun 27(3), 187–207 (1999)CrossRefGoogle Scholar
  21. 21.
    Kominek, J., Black, A.W.: CMU ARCTIC databases for speech synthesis. Tech. rep, Language Technologies Institute (2003)Google Scholar
  22. 22.
    Latorre, J., Gales, M.J.F., Buchholz, S., Knil, K., Tamura, M., Ohtani, Y., Akamine, M.: Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? In: Proceedings of the ICASSP, Prague, Czech Republic, pp. 4724–4727 (2011)Google Scholar
  23. 23.
    Talkin, D.: A robust algorithm for pitch tracking (RAPT). In: Kleijn, W.B., Paliwal, K.K. (eds.) Speech Coding and Synthesis, pp. 495–518. Elsevier, Amsterdam (1995)Google Scholar
  24. 24.
    Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis - a unified approach to speech spectral estimation. In: Proceedings of the ICSLP, Yokohama, Japan, pp. 1043–1046 (1994)Google Scholar
  25. 25.
    Tokuda, K., Mausko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. E85–D(3), 455–464 (2002)Google Scholar
  26. 26.
    Yu, K., Thomson, B., Young, S., Street, T.: From discontinuous to continuous F0 modelling in HMM-based speech synthesis. In: Proceedings of the ISCA SSW7, Kyoto, Japan, pp. 94–99 (2010)Google Scholar
  27. 27.
    Yu, K., Young, S.: Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 19(5), 1071–1079 (2011)CrossRefGoogle Scholar
  28. 28.
    Yu, K., Young, S.: Joint modelling of voicing label and continuous F0 for HMM based speech synthesis. In: Proceedings of the ICASSP, Prague, Czech Republic, pp. 4572–4575 (2011)Google Scholar
  29. 29.
    Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.: The HMM-based speech synthesis system version 2.0. In: Proceedings of the ISCA SSW6, Bonn, Germany, pp. 294–299 (2007)Google Scholar
  30. 30.
    Zen, H., Toda, T., Nakamura, M., Tokuda, K.: Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Trans. Inf. Syst. E90–D(1), 325–333 (2007)CrossRefGoogle Scholar
  31. 31.
    Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Tamás Gábor Csapó
    • 1
  • Géza Németh
    • 1
  • Milos Cernak
    • 2
  1. 1.Department of Telecommunications and Media InformaticsBudapest University of Technology and EconomicsBudapestHungary
  2. 2.Idiap Research InstituteMartignySwitzerland

Personalised recommendations