Abstract
This paper proposes a robust voicing detection and \(F_{0}\) estimation method for Hidden Markov model (HMM)-based speech synthesis system. Impulse-like excitation present in voiced speech is utilized for extracting the fundamental frequency. Zero-frequency filter (ZFF) is used to derive the locations of impulse excitation. The main contribution of this paper is exploitation of size of window used in ZFF for accurate voicing detection and \(F_{0}\) estimation. By adaptively choosing appropriate window size, the strength of excitation for voiced speech is significantly higher compared with unvoiced speech. With suitable threshold on the strength of excitation, accurate voicing detection is performed. In this method, smooth and accurate \(F_{0}\) contour is extracted by frame-wise zero-frequency filtering of speech with appropriate window size. Performance of the proposed method is compared with other existing voicing detection and \(F_{0}\) estimation methods. The proposed voicing detection and \(F_{0}\) estimation method is implemented in HMM-based speech synthesis system. Both objective and subjective evaluation results show that the proposed method is capable of generating good quality speech compared with HMM-based speech synthesis systems developed using voicing detection and \(F_{0}\) estimation methods based on Robust algorithm for pitch tracking and Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum.
Similar content being viewed by others
References
P. Alku, T. Bakstrom, E. Vikman, Normalized amplitude quotient for parameterization of the glottal flow. J. Acoust. Soc. Am. 112(2), 701–710 (2002)
D. Arifianto, T. Tanaka, T. Masuko, T. Kobayashi, Robust F0 estimation of speech signal using harmonicity measure based on instantaneous frequency. IEICE Trans. Inf. Syst. E87–D(12), 2812–2820 (2004)
P. Bagshaw, S. M. Hiller, M. A. Jack, Enhanced Pitch Tracking and the Processing of FQ Contours for Computer and Intonation Teaching, in Proceedings of Eurospeech, (1993), pp. 1003–1006
Y. Bayya, D.N. Gowda, Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Commun. 55(6), 782–795 (2013)
P. Boersma, Accurate Short-Term Analysis of Fundamental Frequency and the Harmonics-To-Noise Ratio of a Sampled Sound, in Proceedings of the Institute of Phonetic Sciences, vol. 17 (1993), p. 97–110
T. Drugman, A. Alwan, Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics, in Proceedings of Interspeech, (2011), pp. 1973–1976
H. Fujisaki, K. Hirose, Analysis of voice fundamental frequency contours for declarative sentences of japanese. J. Acoust. Soc. Jpn. E 5(4), 233–242 (1984)
R. Goldberg, L. Riek, A Practical Handbook of Speech Coders (CRC, Boca Raton, 2000)
D.J. Hermes, Measurement of pitch by subharmonic summation. J. Acoust. Soci. Am. 83(1), 257–264 (1988)
HMM-based speech synthesis system (HTS). http://hts.sp.nitech.ac.jp/. Accessed 20 Feb 2014
H. Kawahara, H. Katayose, A. de Cheveigne, R. Patterson, Fixed Point Analysis of Frequency to Instantaneous Frequency Mapping for Accurate Estimation of f0 and Periodicity, in Proceedings of Eurospeech, vol. 6 (1999), pp. 2781–2784
K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)
K.S.R. Murty, B. Yegnanarayana, M.A. Joseph, Characterization of glottal activity from speech signals. IEEE Signal Process. Lett. 16(6), 469–472 (2009)
N.P. Narendra, K.S. Rao, K. Ghosh, R.R. Vempada, S. Maity, Development of syllable-based text to speech synthesis system in Bengali. Int. J. Speech Technol. 14(3), 167–181 (2011)
N.P. Narendra, K.S. Rao, Syllable specific unit selection cost functions for text-to-speech synthesis. ACM Trans. Speech Lang. Process. 9(3), 5:1–5:24 (2012)
J. J. Odella, The use of context in large vocabulary speech recognition. PhD thesis, (Cambridge University, 1995)
K. Oura, H. Zen, Y. Nankaku, A. Lee, K. Tokuda, A tied covariance technique for HMM-based speech synthesis. IEICE Trans. Inf. Syst. E93–D(3), 595–601 (2010)
F. Plante, G. F. Meyer, W. A. Aubsworth, A Pitch Extraction Reference Database, in Proceedings of Eurospeech, (1995), pp. 837–840
Y. Qian, F. Soong, M. Wang, Z. Wu, A Minimum V/U Error Approach to F0 Generation in HMM-Based TTS, in Proceedings of Interspeech, (2009), pp. 408–411
K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Eigenvoices for HMM-Based Speech Synthesis, in Proceedings of the International Conference on Spoken Language Processing (ICSLP), (2002), pp. 1269–1272
K. Shinoda, T. Watanabe, MDL-based context-dependent subword modeling for speech recognition. J. Acoust. Soc. Jpn. E 21(2), 79–86 (2000)
H. Silén, E. Helander, J. Nurminen, M. Gabbouj, Parameterization of Vocal Fry in HMM-Based Speech Synthesis, in Proceedings of Interspeech, (2009), pp. 1775–1778
Q. Sun, K. Hirose, W. Gu, N. Minematsu, Generation of Fundamental Frequency Contours for Mandarin Speech Synthesis Based on Tone Nucleus Model, in Proceedings of Interspeech, (2005), pp. 3265–3268
D. Talkin, A Robust Algorithm for Pitch Tracking (RAPT), Ch. 14 (Elsevier Science, Amsterdam, 1995)
M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, Adaptation of Pitch and Spectrum for HMM-Based Speech Synthesis Using MLLR, in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, (ICASSP), vol. 2 (2001), pp. 805–808
T. Toda, K. Tokuda, A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Syst. 90(5), 816–824 (2007)
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, (ICASSP), vol. 3 (2000), pp. 1315–1318
K. Tokuda, T. Mausko, N. Miyazaki, T. Kobayashi, Muti-space probability distribution HMM. IEICE Trans. Inf. Syst. E85–D(3), 455–464 (2002)
K. Tokuda, H. Zen, A.W. Black, in Text to Speech Synthesis: New Paradigms and Advances, ed. by S. Narayanan, A. Alwan. HMM-Based Approach to Multilingual Speech Synthesis (Prentice-Hall, Upper Saddle River, 2004), pp. 135–152
K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, K. Oura, Speech synthesis based on hidden Markov models. Proc. IEEE 101(5), 1234–1252 (2013)
J. Yamagishi, Z. Ling, S. King, Robustness of HMM-Based Speech Synthesis, in Proceedings of Interspeech (2008), pp. 581–584
B. Yegnanarayana, K.S.R. Murty, Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Trans. Audio Speech Lang. Process. 17(4), 614–624 (2009)
T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, T. Kitamura, Speaker Interpolation in HMM-Based Speech Synthesis System, in Proceedings of Eurospeech, (1997), pp. 2523–2526
H. Zen, T. Toda, M. Nakamura, K. Tokuda, Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Trans. Inf. Syst. E90–D(1), 325–333 (2007)
H. Zen, T. Toda, K. Tokuda, The NITECH-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. IEICE Trans. Inf. Syst. E91–D(6), 1764–1773 (2008)
H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Acknowledgments
The authors would like to thank Dr. Nagarajan T and his students at the speech laboratory of SSN College of Engineering for organizing workshop on HMM-based speech synthesis system. They are greatfull to Prof. Hema A Murthy and Dr. K. Samudravijaya for their precious help in understanding HMM-based speech synthesis system. The authors would also like to thank Department of Information Technology, Government of India for sponsoring the workshop on HMM-based speech synthesis system.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Narendra, N.P., Rao, K.S. Robust Voicing Detection and \(F_{0}\) Estimation for HMM-Based Speech Synthesis. Circuits Syst Signal Process 34, 2597–2619 (2015). https://doi.org/10.1007/s00034-015-9977-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-015-9977-8