Robust Voicing Detection and $$F_{0}$$ Estimation for HMM-Based Speech Synthesis

Narendra, N. P.; Rao, K. Sreenivasa

doi:10.1007/s00034-015-9977-8

Robust Voicing Detection and $F_{0}$ Estimation for HMM-Based Speech Synthesis

Published: 30 January 2015

Volume 34, pages 2597–2619, (2015)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

N. P. Narendra¹ &
K. Sreenivasa Rao¹

358 Accesses
13 Citations
Explore all metrics

Abstract

This paper proposes a robust voicing detection and $F_{0}$ estimation method for Hidden Markov model (HMM)-based speech synthesis system. Impulse-like excitation present in voiced speech is utilized for extracting the fundamental frequency. Zero-frequency filter (ZFF) is used to derive the locations of impulse excitation. The main contribution of this paper is exploitation of size of window used in ZFF for accurate voicing detection and $F_{0}$ estimation. By adaptively choosing appropriate window size, the strength of excitation for voiced speech is significantly higher compared with unvoiced speech. With suitable threshold on the strength of excitation, accurate voicing detection is performed. In this method, smooth and accurate $F_{0}$ contour is extracted by frame-wise zero-frequency filtering of speech with appropriate window size. Performance of the proposed method is compared with other existing voicing detection and $F_{0}$ estimation methods. The proposed voicing detection and $F_{0}$ estimation method is implemented in HMM-based speech synthesis system. Both objective and subjective evaluation results show that the proposed method is capable of generating good quality speech compared with HMM-based speech synthesis systems developed using voicing detection and $F_{0}$ estimation methods based on Robust algorithm for pitch tracking and Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

Audio verification in forensic investigation using light deep neural network

Article 20 April 2024

A Strategic Approach for Robust Dysarthric Speech Recognition

Article 01 February 2024

References

P. Alku, T. Bakstrom, E. Vikman, Normalized amplitude quotient for parameterization of the glottal flow. J. Acoust. Soc. Am. 112(2), 701–710 (2002)
Article Google Scholar
D. Arifianto, T. Tanaka, T. Masuko, T. Kobayashi, Robust F0 estimation of speech signal using harmonicity measure based on instantaneous frequency. IEICE Trans. Inf. Syst. E87–D(12), 2812–2820 (2004)
Google Scholar
P. Bagshaw, S. M. Hiller, M. A. Jack, Enhanced Pitch Tracking and the Processing of FQ Contours for Computer and Intonation Teaching, in Proceedings of Eurospeech, (1993), pp. 1003–1006
Y. Bayya, D.N. Gowda, Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Commun. 55(6), 782–795 (2013)
Article Google Scholar
P. Boersma, Accurate Short-Term Analysis of Fundamental Frequency and the Harmonics-To-Noise Ratio of a Sampled Sound, in Proceedings of the Institute of Phonetic Sciences, vol. 17 (1993), p. 97–110
T. Drugman, A. Alwan, Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics, in Proceedings of Interspeech, (2011), pp. 1973–1976
H. Fujisaki, K. Hirose, Analysis of voice fundamental frequency contours for declarative sentences of japanese. J. Acoust. Soc. Jpn. E 5(4), 233–242 (1984)
Article Google Scholar
R. Goldberg, L. Riek, A Practical Handbook of Speech Coders (CRC, Boca Raton, 2000)
Book MATH Google Scholar
D.J. Hermes, Measurement of pitch by subharmonic summation. J. Acoust. Soci. Am. 83(1), 257–264 (1988)
Article Google Scholar
HMM-based speech synthesis system (HTS). http://hts.sp.nitech.ac.jp/. Accessed 20 Feb 2014
H. Kawahara, H. Katayose, A. de Cheveigne, R. Patterson, Fixed Point Analysis of Frequency to Instantaneous Frequency Mapping for Accurate Estimation of f0 and Periodicity, in Proceedings of Eurospeech, vol. 6 (1999), pp. 2781–2784
K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)
Article Google Scholar
K.S.R. Murty, B. Yegnanarayana, M.A. Joseph, Characterization of glottal activity from speech signals. IEEE Signal Process. Lett. 16(6), 469–472 (2009)
Article Google Scholar
N.P. Narendra, K.S. Rao, K. Ghosh, R.R. Vempada, S. Maity, Development of syllable-based text to speech synthesis system in Bengali. Int. J. Speech Technol. 14(3), 167–181 (2011)
Article Google Scholar
N.P. Narendra, K.S. Rao, Syllable specific unit selection cost functions for text-to-speech synthesis. ACM Trans. Speech Lang. Process. 9(3), 5:1–5:24 (2012)
Article Google Scholar
J. J. Odella, The use of context in large vocabulary speech recognition. PhD thesis, (Cambridge University, 1995)
K. Oura, H. Zen, Y. Nankaku, A. Lee, K. Tokuda, A tied covariance technique for HMM-based speech synthesis. IEICE Trans. Inf. Syst. E93–D(3), 595–601 (2010)
Article Google Scholar
F. Plante, G. F. Meyer, W. A. Aubsworth, A Pitch Extraction Reference Database, in Proceedings of Eurospeech, (1995), pp. 837–840
Y. Qian, F. Soong, M. Wang, Z. Wu, A Minimum V/U Error Approach to F0 Generation in HMM-Based TTS, in Proceedings of Interspeech, (2009), pp. 408–411
K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Eigenvoices for HMM-Based Speech Synthesis, in Proceedings of the International Conference on Spoken Language Processing (ICSLP), (2002), pp. 1269–1272
K. Shinoda, T. Watanabe, MDL-based context-dependent subword modeling for speech recognition. J. Acoust. Soc. Jpn. E 21(2), 79–86 (2000)
Article Google Scholar
H. Silén, E. Helander, J. Nurminen, M. Gabbouj, Parameterization of Vocal Fry in HMM-Based Speech Synthesis, in Proceedings of Interspeech, (2009), pp. 1775–1778
Q. Sun, K. Hirose, W. Gu, N. Minematsu, Generation of Fundamental Frequency Contours for Mandarin Speech Synthesis Based on Tone Nucleus Model, in Proceedings of Interspeech, (2005), pp. 3265–3268
D. Talkin, A Robust Algorithm for Pitch Tracking (RAPT), Ch. 14 (Elsevier Science, Amsterdam, 1995)
Google Scholar
M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, Adaptation of Pitch and Spectrum for HMM-Based Speech Synthesis Using MLLR, in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, (ICASSP), vol. 2 (2001), pp. 805–808
T. Toda, K. Tokuda, A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Syst. 90(5), 816–824 (2007)
Article Google Scholar
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, (ICASSP), vol. 3 (2000), pp. 1315–1318
K. Tokuda, T. Mausko, N. Miyazaki, T. Kobayashi, Muti-space probability distribution HMM. IEICE Trans. Inf. Syst. E85–D(3), 455–464 (2002)
Google Scholar
K. Tokuda, H. Zen, A.W. Black, in Text to Speech Synthesis: New Paradigms and Advances, ed. by S. Narayanan, A. Alwan. HMM-Based Approach to Multilingual Speech Synthesis (Prentice-Hall, Upper Saddle River, 2004), pp. 135–152
K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, K. Oura, Speech synthesis based on hidden Markov models. Proc. IEEE 101(5), 1234–1252 (2013)
Article Google Scholar
J. Yamagishi, Z. Ling, S. King, Robustness of HMM-Based Speech Synthesis, in Proceedings of Interspeech (2008), pp. 581–584
B. Yegnanarayana, K.S.R. Murty, Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Trans. Audio Speech Lang. Process. 17(4), 614–624 (2009)
Article Google Scholar
T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, T. Kitamura, Speaker Interpolation in HMM-Based Speech Synthesis System, in Proceedings of Eurospeech, (1997), pp. 2523–2526
H. Zen, T. Toda, M. Nakamura, K. Tokuda, Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Trans. Inf. Syst. E90–D(1), 325–333 (2007)
Article Google Scholar
H. Zen, T. Toda, K. Tokuda, The NITECH-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. IEICE Trans. Inf. Syst. E91–D(6), 1764–1773 (2008)
Article Google Scholar
H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank Dr. Nagarajan T and his students at the speech laboratory of SSN College of Engineering for organizing workshop on HMM-based speech synthesis system. They are greatfull to Prof. Hema A Murthy and Dr. K. Samudravijaya for their precious help in understanding HMM-based speech synthesis system. The authors would also like to thank Department of Information Technology, Government of India for sponsoring the workshop on HMM-based speech synthesis system.

Author information

Authors and Affiliations

School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India
N. P. Narendra & K. Sreenivasa Rao

Authors

N. P. Narendra
View author publications
You can also search for this author in PubMed Google Scholar
K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. P. Narendra.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Narendra, N.P., Rao, K.S. Robust Voicing Detection and $F_{0}$ Estimation for HMM-Based Speech Synthesis. Circuits Syst Signal Process 34, 2597–2619 (2015). https://doi.org/10.1007/s00034-015-9977-8

Download citation

Received: 09 August 2014
Revised: 13 January 2015
Accepted: 14 January 2015
Published: 30 January 2015
Issue Date: August 2015
DOI: https://doi.org/10.1007/s00034-015-9977-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Voicing Detection and \(F_{0}\) Estimation for HMM-Based Speech Synthesis

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Audio verification in forensic investigation using light deep neural network

A Strategic Approach for Robust Dysarthric Speech Recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust Voicing Detection and \(F_{0}\) Estimation for HMM-Based Speech Synthesis

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Audio verification in forensic investigation using light deep neural network

A Strategic Approach for Robust Dysarthric Speech Recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation