Abstract
In statistical parametric speech synthesis, creaky voice can cause disturbing artifacts. The reason is that standard pitch tracking algorithms tend to erroneously measure F0 in regions of creaky voice. This pattern is learned during training of hidden Markov-models (HMMs). In the synthesis phase, false voiced/unvoiced decision caused by creaky voice results in audible quality degradation. In order to eliminate this phenomena, we use a simple continuous F0 tracker which does not apply a strict voiced/unvoiced decision. In the proposed residual-based vocoder, Maximum Voiced Frequency is used for mixed voiced and unvoiced excitation. As all parameters of the vocoder are continuous, Multi-Space Distribution is not necessary during training the HMMs, which has been shown to be advantageous. Artifacts caused by creaky voice are eliminated with this speech synthesis system. A subjective listening test of English utterances has shown improvement over the traditional excitation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
The Festival Speech Synthesis System [Computer program], Version 2.1 (2010). http://www.cstr.ed.ac.uk/projects/festival/
The Snack Sound Toolkit [Computer program], Version 2.2.10 (2012). http://www.speech.kth.se/snack/
Speech Signal Processing - a small collection of routines in Python to do signal processing [Computer program] (2015). https://github.com/idiap/ssp
Bohm, T., Audibert, N., Shattuck-Hufnagel, S., Németh, G., Aubergé, V.: Transforming modal voice into irregular voice by amplitude scaling of individual glottal cycles. In: Acoustics 2008, Paris, France, pp. 6141–6146 (2008)
Blomgren, M., Chen, Y., Ng, M.L., Gilbert, H.R.: Acoustic, aerodynamic, physiologic, and perceptual properties of modal and vocal fry registers. J. Acoust. Soc. Am. 103(5), 2649–2658 (1998)
Csapó, T.G., Németh, G.: A novel codebook-based excitation model for use in speech synthesis. In: IEEE CogInfoCom, Kosice, Slovakia, pp. 661–665, December 2012
Csapó, T.G., Németh, G.: A novel irregular voice model for HMM-based speech synthesis. In: Proceedings of the ISCA SSW8, Barcelona, Spain, pp. 229–234 (2013)
Csapó, T.G., Németh, G.: Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE J. Sel. Top. Sig. Proc. 8(2), 209–220 (2014)
Csapó, T.G., Németh, G.: Statistical parametric speech synthesis with a novel codebook-based excitation model. Intell. Decis. Technol. 8(4), 289–299 (2014)
Csapó, T.G., Németh, G.: Automatic transformation of irregular to regular voice by residual analysis and synthesis. In: Proceedings of the Interspeech (2015). (accepted)
Drugman, T., Dutoit, T.: The deterministic plus stochastic model of the residual signal and its applications. IEEE Trans. Audio Speech Lang. Proc. 20(3), 968–981 (2012)
Drugman, T., Kane, J., Gobl, C.: Data-driven detection and analysis of the patterns of creaky voice. Comput. Speech Lang. 28(5), 1233–1253 (2014)
Drugman, T., Raitio, T.: Excitation modeling for HMM-based speech synthesis: breaking down the impact of periodic and aperiodic components. In: Proceedings of the ICASSP, Florence, Italy, pp. 260–264 (2014)
Drugman, T., Stylianou, Y.: Maximum voiced frequency estimation : exploiting amplitude and phase spectra. IEEE Sig. Proc. Lett. 21(10), 1230–1234 (2014)
Drugman, T., Thomas, M.: Detection of glottal closure instants from speech signals: a quantitative review. IEEE Trans. Audio Speech Lang. Process. 20(3), 994–1006 (2012)
Drugman, T., Wilfart, G., Dutoit, T.: Eigenresiduals for improved parametric speech synthesis. In: EUSIPCO09 (2009)
Garner, P.N., Cernak, M., Motlicek, P.: A simple continuous pitch estimation algorithm. IEEE Sig. Process. Lett. 20(1), 102–105 (2013)
Hu, Q., Richmond, K., Yamagishi, J., Latorre, J.: An experimental comparison of multiple vocoder types. In: Proceedings of the ISCA SSW8, pp. 155–160 (2013)
Imai, S., Sumita, K., Furuichi, C.: Mel Log Spectrum Approximation (MLSA) filter for speech synthesis. Electron. Commun. Jpn. (Part I: Communications) 66(2), 10–18 (1983)
Kawahara, H., Masuda-Katsuse, I., de Cheveigné, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun 27(3), 187–207 (1999)
Kominek, J., Black, A.W.: CMU ARCTIC databases for speech synthesis. Tech. rep, Language Technologies Institute (2003)
Latorre, J., Gales, M.J.F., Buchholz, S., Knil, K., Tamura, M., Ohtani, Y., Akamine, M.: Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? In: Proceedings of the ICASSP, Prague, Czech Republic, pp. 4724–4727 (2011)
Talkin, D.: A robust algorithm for pitch tracking (RAPT). In: Kleijn, W.B., Paliwal, K.K. (eds.) Speech Coding and Synthesis, pp. 495–518. Elsevier, Amsterdam (1995)
Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis - a unified approach to speech spectral estimation. In: Proceedings of the ICSLP, Yokohama, Japan, pp. 1043–1046 (1994)
Tokuda, K., Mausko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. E85–D(3), 455–464 (2002)
Yu, K., Thomson, B., Young, S., Street, T.: From discontinuous to continuous F0 modelling in HMM-based speech synthesis. In: Proceedings of the ISCA SSW7, Kyoto, Japan, pp. 94–99 (2010)
Yu, K., Young, S.: Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 19(5), 1071–1079 (2011)
Yu, K., Young, S.: Joint modelling of voicing label and continuous F0 for HMM based speech synthesis. In: Proceedings of the ICASSP, Prague, Czech Republic, pp. 4572–4575 (2011)
Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.: The HMM-based speech synthesis system version 2.0. In: Proceedings of the ISCA SSW6, Bonn, Germany, pp. 294–299 (2007)
Zen, H., Toda, T., Nakamura, M., Tokuda, K.: Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Trans. Inf. Syst. E90–D(1), 325–333 (2007)
Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Acknowledgments
We would like to thank the listeners for participating in the subjective test. We thank Philip N. Garner for providing the continuous pitch tracker open source and Bálint Pál Tóth for useful comments on this manuscript. This research is partially supported by the Swiss National Science Foundation via the joint research project (SCOPES scheme) SP2: SCOPES project on speech prosody (SNSF no IZ73Z0_152495-1) and by the EITKIC project (EITKIC_12-1-2012-001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Csapó, T.G., Németh, G., Cernak, M. (2015). Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-25789-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25788-4
Online ISBN: 978-3-319-25789-1
eBook Packages: Computer ScienceComputer Science (R0)