Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis

Csapó, Tamás Gábor; Németh, Géza; Cernak, Milos

doi:10.1007/978-3-319-25789-1_4

Tamás Gábor Csapó¹⁶,
Géza Németh¹⁶ &
Milos Cernak¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9449))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

713 Accesses
7 Citations

Abstract

In statistical parametric speech synthesis, creaky voice can cause disturbing artifacts. The reason is that standard pitch tracking algorithms tend to erroneously measure F0 in regions of creaky voice. This pattern is learned during training of hidden Markov-models (HMMs). In the synthesis phase, false voiced/unvoiced decision caused by creaky voice results in audible quality degradation. In order to eliminate this phenomena, we use a simple continuous F0 tracker which does not apply a strict voiced/unvoiced decision. In the proposed residual-based vocoder, Maximum Voiced Frequency is used for mixed voiced and unvoiced excitation. As all parameters of the vocoder are continuous, Multi-Space Distribution is not necessary during training the HMMs, which has been shown to be advantageous. Artifacts caused by creaky voice are eliminated with this speech synthesis system. A subjective listening test of English utterances has shown improvement over the traditional excitation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

The Festival Speech Synthesis System [Computer program], Version 2.1 (2010). http://www.cstr.ed.ac.uk/projects/festival/
The Snack Sound Toolkit [Computer program], Version 2.2.10 (2012). http://www.speech.kth.se/snack/
Speech Signal Processing - a small collection of routines in Python to do signal processing [Computer program] (2015). https://github.com/idiap/ssp
Bohm, T., Audibert, N., Shattuck-Hufnagel, S., Németh, G., Aubergé, V.: Transforming modal voice into irregular voice by amplitude scaling of individual glottal cycles. In: Acoustics 2008, Paris, France, pp. 6141–6146 (2008)
Google Scholar
Blomgren, M., Chen, Y., Ng, M.L., Gilbert, H.R.: Acoustic, aerodynamic, physiologic, and perceptual properties of modal and vocal fry registers. J. Acoust. Soc. Am. 103(5), 2649–2658 (1998)
Article Google Scholar
Csapó, T.G., Németh, G.: A novel codebook-based excitation model for use in speech synthesis. In: IEEE CogInfoCom, Kosice, Slovakia, pp. 661–665, December 2012
Google Scholar
Csapó, T.G., Németh, G.: A novel irregular voice model for HMM-based speech synthesis. In: Proceedings of the ISCA SSW8, Barcelona, Spain, pp. 229–234 (2013)
Google Scholar
Csapó, T.G., Németh, G.: Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE J. Sel. Top. Sig. Proc. 8(2), 209–220 (2014)
Article Google Scholar
Csapó, T.G., Németh, G.: Statistical parametric speech synthesis with a novel codebook-based excitation model. Intell. Decis. Technol. 8(4), 289–299 (2014)
Article Google Scholar
Csapó, T.G., Németh, G.: Automatic transformation of irregular to regular voice by residual analysis and synthesis. In: Proceedings of the Interspeech (2015). (accepted)
Google Scholar
Drugman, T., Dutoit, T.: The deterministic plus stochastic model of the residual signal and its applications. IEEE Trans. Audio Speech Lang. Proc. 20(3), 968–981 (2012)
Article Google Scholar
Drugman, T., Kane, J., Gobl, C.: Data-driven detection and analysis of the patterns of creaky voice. Comput. Speech Lang. 28(5), 1233–1253 (2014)
Article Google Scholar
Drugman, T., Raitio, T.: Excitation modeling for HMM-based speech synthesis: breaking down the impact of periodic and aperiodic components. In: Proceedings of the ICASSP, Florence, Italy, pp. 260–264 (2014)
Google Scholar
Drugman, T., Stylianou, Y.: Maximum voiced frequency estimation : exploiting amplitude and phase spectra. IEEE Sig. Proc. Lett. 21(10), 1230–1234 (2014)
Article Google Scholar
Drugman, T., Thomas, M.: Detection of glottal closure instants from speech signals: a quantitative review. IEEE Trans. Audio Speech Lang. Process. 20(3), 994–1006 (2012)
Article Google Scholar
Drugman, T., Wilfart, G., Dutoit, T.: Eigenresiduals for improved parametric speech synthesis. In: EUSIPCO09 (2009)
Google Scholar
Garner, P.N., Cernak, M., Motlicek, P.: A simple continuous pitch estimation algorithm. IEEE Sig. Process. Lett. 20(1), 102–105 (2013)
Article Google Scholar
Hu, Q., Richmond, K., Yamagishi, J., Latorre, J.: An experimental comparison of multiple vocoder types. In: Proceedings of the ISCA SSW8, pp. 155–160 (2013)
Google Scholar
Imai, S., Sumita, K., Furuichi, C.: Mel Log Spectrum Approximation (MLSA) filter for speech synthesis. Electron. Commun. Jpn. (Part I: Communications) 66(2), 10–18 (1983)
Article Google Scholar
Kawahara, H., Masuda-Katsuse, I., de Cheveigné, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun 27(3), 187–207 (1999)
Article Google Scholar
Kominek, J., Black, A.W.: CMU ARCTIC databases for speech synthesis. Tech. rep, Language Technologies Institute (2003)
Google Scholar
Latorre, J., Gales, M.J.F., Buchholz, S., Knil, K., Tamura, M., Ohtani, Y., Akamine, M.: Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? In: Proceedings of the ICASSP, Prague, Czech Republic, pp. 4724–4727 (2011)
Google Scholar
Talkin, D.: A robust algorithm for pitch tracking (RAPT). In: Kleijn, W.B., Paliwal, K.K. (eds.) Speech Coding and Synthesis, pp. 495–518. Elsevier, Amsterdam (1995)
Google Scholar
Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis - a unified approach to speech spectral estimation. In: Proceedings of the ICSLP, Yokohama, Japan, pp. 1043–1046 (1994)
Google Scholar
Tokuda, K., Mausko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. E85–D(3), 455–464 (2002)
Google Scholar
Yu, K., Thomson, B., Young, S., Street, T.: From discontinuous to continuous F0 modelling in HMM-based speech synthesis. In: Proceedings of the ISCA SSW7, Kyoto, Japan, pp. 94–99 (2010)
Google Scholar
Yu, K., Young, S.: Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 19(5), 1071–1079 (2011)
Article Google Scholar
Yu, K., Young, S.: Joint modelling of voicing label and continuous F0 for HMM based speech synthesis. In: Proceedings of the ICASSP, Prague, Czech Republic, pp. 4572–4575 (2011)
Google Scholar
Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.: The HMM-based speech synthesis system version 2.0. In: Proceedings of the ISCA SSW6, Bonn, Germany, pp. 294–299 (2007)
Google Scholar
Zen, H., Toda, T., Nakamura, M., Tokuda, K.: Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Trans. Inf. Syst. E90–D(1), 325–333 (2007)
Article Google Scholar
Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Article Google Scholar

Download references

Acknowledgments

We would like to thank the listeners for participating in the subjective test. We thank Philip N. Garner for providing the continuous pitch tracker open source and Bálint Pál Tóth for useful comments on this manuscript. This research is partially supported by the Swiss National Science Foundation via the joint research project (SCOPES scheme) SP2: SCOPES project on speech prosody (SNSF no IZ73Z0_152495-1) and by the EITKIC project (EITKIC_12-1-2012-001).

Author information

Authors and Affiliations

Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Magyar Tudósok körútja 2, Budapest, Hungary
Tamás Gábor Csapó & Géza Németh
Idiap Research Institute, Rue Marconi 19, Martigny, Switzerland
Milos Cernak

Authors

Tamás Gábor Csapó
View author publications
You can also search for this author in PubMed Google Scholar
Géza Németh
View author publications
You can also search for this author in PubMed Google Scholar
Milos Cernak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tamás Gábor Csapó .

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistic, Rovira i Virgili University, Tarragona, Spain
Adrian-Horia Dediu
Research Group on Mathematical Linguistic, Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary
Klára Vicsi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Csapó, T.G., Németh, G., Cernak, M. (2015). Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-25789-1_4
Published: 17 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25788-4
Online ISBN: 978-3-319-25789-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics