Skip to main content

Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9449))

Included in the following conference series:

Abstract

In statistical parametric speech synthesis, creaky voice can cause disturbing artifacts. The reason is that standard pitch tracking algorithms tend to erroneously measure F0 in regions of creaky voice. This pattern is learned during training of hidden Markov-models (HMMs). In the synthesis phase, false voiced/unvoiced decision caused by creaky voice results in audible quality degradation. In order to eliminate this phenomena, we use a simple continuous F0 tracker which does not apply a strict voiced/unvoiced decision. In the proposed residual-based vocoder, Maximum Voiced Frequency is used for mixed voiced and unvoiced excitation. As all parameters of the vocoder are continuous, Multi-Space Distribution is not necessary during training the HMMs, which has been shown to be advantageous. Artifacts caused by creaky voice are eliminated with this speech synthesis system. A subjective listening test of English utterances has shown improvement over the traditional excitation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. The Festival Speech Synthesis System [Computer program], Version 2.1 (2010). http://www.cstr.ed.ac.uk/projects/festival/

  2. The Snack Sound Toolkit [Computer program], Version 2.2.10 (2012). http://www.speech.kth.se/snack/

  3. Speech Signal Processing - a small collection of routines in Python to do signal processing [Computer program] (2015). https://github.com/idiap/ssp

  4. Bohm, T., Audibert, N., Shattuck-Hufnagel, S., Németh, G., Aubergé, V.: Transforming modal voice into irregular voice by amplitude scaling of individual glottal cycles. In: Acoustics 2008, Paris, France, pp. 6141–6146 (2008)

    Google Scholar 

  5. Blomgren, M., Chen, Y., Ng, M.L., Gilbert, H.R.: Acoustic, aerodynamic, physiologic, and perceptual properties of modal and vocal fry registers. J. Acoust. Soc. Am. 103(5), 2649–2658 (1998)

    Article  Google Scholar 

  6. Csapó, T.G., Németh, G.: A novel codebook-based excitation model for use in speech synthesis. In: IEEE CogInfoCom, Kosice, Slovakia, pp. 661–665, December 2012

    Google Scholar 

  7. Csapó, T.G., Németh, G.: A novel irregular voice model for HMM-based speech synthesis. In: Proceedings of the ISCA SSW8, Barcelona, Spain, pp. 229–234 (2013)

    Google Scholar 

  8. Csapó, T.G., Németh, G.: Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE J. Sel. Top. Sig. Proc. 8(2), 209–220 (2014)

    Article  Google Scholar 

  9. Csapó, T.G., Németh, G.: Statistical parametric speech synthesis with a novel codebook-based excitation model. Intell. Decis. Technol. 8(4), 289–299 (2014)

    Article  Google Scholar 

  10. Csapó, T.G., Németh, G.: Automatic transformation of irregular to regular voice by residual analysis and synthesis. In: Proceedings of the Interspeech (2015). (accepted)

    Google Scholar 

  11. Drugman, T., Dutoit, T.: The deterministic plus stochastic model of the residual signal and its applications. IEEE Trans. Audio Speech Lang. Proc. 20(3), 968–981 (2012)

    Article  Google Scholar 

  12. Drugman, T., Kane, J., Gobl, C.: Data-driven detection and analysis of the patterns of creaky voice. Comput. Speech Lang. 28(5), 1233–1253 (2014)

    Article  Google Scholar 

  13. Drugman, T., Raitio, T.: Excitation modeling for HMM-based speech synthesis: breaking down the impact of periodic and aperiodic components. In: Proceedings of the ICASSP, Florence, Italy, pp. 260–264 (2014)

    Google Scholar 

  14. Drugman, T., Stylianou, Y.: Maximum voiced frequency estimation : exploiting amplitude and phase spectra. IEEE Sig. Proc. Lett. 21(10), 1230–1234 (2014)

    Article  Google Scholar 

  15. Drugman, T., Thomas, M.: Detection of glottal closure instants from speech signals: a quantitative review. IEEE Trans. Audio Speech Lang. Process. 20(3), 994–1006 (2012)

    Article  Google Scholar 

  16. Drugman, T., Wilfart, G., Dutoit, T.: Eigenresiduals for improved parametric speech synthesis. In: EUSIPCO09 (2009)

    Google Scholar 

  17. Garner, P.N., Cernak, M., Motlicek, P.: A simple continuous pitch estimation algorithm. IEEE Sig. Process. Lett. 20(1), 102–105 (2013)

    Article  Google Scholar 

  18. Hu, Q., Richmond, K., Yamagishi, J., Latorre, J.: An experimental comparison of multiple vocoder types. In: Proceedings of the ISCA SSW8, pp. 155–160 (2013)

    Google Scholar 

  19. Imai, S., Sumita, K., Furuichi, C.: Mel Log Spectrum Approximation (MLSA) filter for speech synthesis. Electron. Commun. Jpn. (Part I: Communications) 66(2), 10–18 (1983)

    Article  Google Scholar 

  20. Kawahara, H., Masuda-Katsuse, I., de Cheveigné, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun 27(3), 187–207 (1999)

    Article  Google Scholar 

  21. Kominek, J., Black, A.W.: CMU ARCTIC databases for speech synthesis. Tech. rep, Language Technologies Institute (2003)

    Google Scholar 

  22. Latorre, J., Gales, M.J.F., Buchholz, S., Knil, K., Tamura, M., Ohtani, Y., Akamine, M.: Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? In: Proceedings of the ICASSP, Prague, Czech Republic, pp. 4724–4727 (2011)

    Google Scholar 

  23. Talkin, D.: A robust algorithm for pitch tracking (RAPT). In: Kleijn, W.B., Paliwal, K.K. (eds.) Speech Coding and Synthesis, pp. 495–518. Elsevier, Amsterdam (1995)

    Google Scholar 

  24. Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis - a unified approach to speech spectral estimation. In: Proceedings of the ICSLP, Yokohama, Japan, pp. 1043–1046 (1994)

    Google Scholar 

  25. Tokuda, K., Mausko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. E85–D(3), 455–464 (2002)

    Google Scholar 

  26. Yu, K., Thomson, B., Young, S., Street, T.: From discontinuous to continuous F0 modelling in HMM-based speech synthesis. In: Proceedings of the ISCA SSW7, Kyoto, Japan, pp. 94–99 (2010)

    Google Scholar 

  27. Yu, K., Young, S.: Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 19(5), 1071–1079 (2011)

    Article  Google Scholar 

  28. Yu, K., Young, S.: Joint modelling of voicing label and continuous F0 for HMM based speech synthesis. In: Proceedings of the ICASSP, Prague, Czech Republic, pp. 4572–4575 (2011)

    Google Scholar 

  29. Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.: The HMM-based speech synthesis system version 2.0. In: Proceedings of the ISCA SSW6, Bonn, Germany, pp. 294–299 (2007)

    Google Scholar 

  30. Zen, H., Toda, T., Nakamura, M., Tokuda, K.: Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Trans. Inf. Syst. E90–D(1), 325–333 (2007)

    Article  Google Scholar 

  31. Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank the listeners for participating in the subjective test. We thank Philip N. Garner for providing the continuous pitch tracker open source and Bálint Pál Tóth for useful comments on this manuscript. This research is partially supported by the Swiss National Science Foundation via the joint research project (SCOPES scheme) SP2: SCOPES project on speech prosody (SNSF no IZ73Z0_152495-1) and by the EITKIC project (EITKIC_12-1-2012-001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tamás Gábor Csapó .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Csapó, T.G., Németh, G., Cernak, M. (2015). Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25789-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25788-4

  • Online ISBN: 978-3-319-25789-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics