Skip to main content
Log in

Robust Voicing Detection and \(F_{0}\) Estimation for HMM-Based Speech Synthesis

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper proposes a robust voicing detection and \(F_{0}\) estimation method for Hidden Markov model (HMM)-based speech synthesis system. Impulse-like excitation present in voiced speech is utilized for extracting the fundamental frequency. Zero-frequency filter (ZFF) is used to derive the locations of impulse excitation. The main contribution of this paper is exploitation of size of window used in ZFF for accurate voicing detection and \(F_{0}\) estimation. By adaptively choosing appropriate window size, the strength of excitation for voiced speech is significantly higher compared with unvoiced speech. With suitable threshold on the strength of excitation, accurate voicing detection is performed. In this method, smooth and accurate \(F_{0}\) contour is extracted by frame-wise zero-frequency filtering of speech with appropriate window size. Performance of the proposed method is compared with other existing voicing detection and \(F_{0}\) estimation methods. The proposed voicing detection and \(F_{0}\) estimation method is implemented in HMM-based speech synthesis system. Both objective and subjective evaluation results show that the proposed method is capable of generating good quality speech compared with HMM-based speech synthesis systems developed using voicing detection and \(F_{0}\) estimation methods based on Robust algorithm for pitch tracking and Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. P. Alku, T. Bakstrom, E. Vikman, Normalized amplitude quotient for parameterization of the glottal flow. J. Acoust. Soc. Am. 112(2), 701–710 (2002)

    Article  Google Scholar 

  2. D. Arifianto, T. Tanaka, T. Masuko, T. Kobayashi, Robust F0 estimation of speech signal using harmonicity measure based on instantaneous frequency. IEICE Trans. Inf. Syst. E87–D(12), 2812–2820 (2004)

    Google Scholar 

  3. P. Bagshaw, S. M. Hiller, M. A. Jack, Enhanced Pitch Tracking and the Processing of FQ Contours for Computer and Intonation Teaching, in Proceedings of Eurospeech, (1993), pp. 1003–1006

  4. Y. Bayya, D.N. Gowda, Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Commun. 55(6), 782–795 (2013)

    Article  Google Scholar 

  5. P. Boersma, Accurate Short-Term Analysis of Fundamental Frequency and the Harmonics-To-Noise Ratio of a Sampled Sound, in Proceedings of the Institute of Phonetic Sciences, vol. 17 (1993), p. 97–110

  6. T. Drugman, A. Alwan, Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics, in Proceedings of Interspeech, (2011), pp. 1973–1976

  7. H. Fujisaki, K. Hirose, Analysis of voice fundamental frequency contours for declarative sentences of japanese. J. Acoust. Soc. Jpn. E 5(4), 233–242 (1984)

    Article  Google Scholar 

  8. R. Goldberg, L. Riek, A Practical Handbook of Speech Coders (CRC, Boca Raton, 2000)

    Book  MATH  Google Scholar 

  9. D.J. Hermes, Measurement of pitch by subharmonic summation. J. Acoust. Soci. Am. 83(1), 257–264 (1988)

    Article  Google Scholar 

  10. HMM-based speech synthesis system (HTS). http://hts.sp.nitech.ac.jp/. Accessed 20 Feb 2014

  11. H. Kawahara, H. Katayose, A. de Cheveigne, R. Patterson, Fixed Point Analysis of Frequency to Instantaneous Frequency Mapping for Accurate Estimation of f0 and Periodicity, in Proceedings of Eurospeech, vol. 6 (1999), pp. 2781–2784

  12. K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)

    Article  Google Scholar 

  13. K.S.R. Murty, B. Yegnanarayana, M.A. Joseph, Characterization of glottal activity from speech signals. IEEE Signal Process. Lett. 16(6), 469–472 (2009)

    Article  Google Scholar 

  14. N.P. Narendra, K.S. Rao, K. Ghosh, R.R. Vempada, S. Maity, Development of syllable-based text to speech synthesis system in Bengali. Int. J. Speech Technol. 14(3), 167–181 (2011)

    Article  Google Scholar 

  15. N.P. Narendra, K.S. Rao, Syllable specific unit selection cost functions for text-to-speech synthesis. ACM Trans. Speech Lang. Process. 9(3), 5:1–5:24 (2012)

    Article  Google Scholar 

  16. J. J. Odella, The use of context in large vocabulary speech recognition. PhD thesis, (Cambridge University, 1995)

  17. K. Oura, H. Zen, Y. Nankaku, A. Lee, K. Tokuda, A tied covariance technique for HMM-based speech synthesis. IEICE Trans. Inf. Syst. E93–D(3), 595–601 (2010)

    Article  Google Scholar 

  18. F. Plante, G. F. Meyer, W. A. Aubsworth, A Pitch Extraction Reference Database, in Proceedings of Eurospeech, (1995), pp. 837–840

  19. Y. Qian, F. Soong, M. Wang, Z. Wu, A Minimum V/U Error Approach to F0 Generation in HMM-Based TTS, in Proceedings of Interspeech, (2009), pp. 408–411

  20. K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Eigenvoices for HMM-Based Speech Synthesis, in Proceedings of the International Conference on Spoken Language Processing (ICSLP), (2002), pp. 1269–1272

  21. K. Shinoda, T. Watanabe, MDL-based context-dependent subword modeling for speech recognition. J. Acoust. Soc. Jpn. E 21(2), 79–86 (2000)

    Article  Google Scholar 

  22. H. Silén, E. Helander, J. Nurminen, M. Gabbouj, Parameterization of Vocal Fry in HMM-Based Speech Synthesis, in Proceedings of Interspeech, (2009), pp. 1775–1778

  23. Q. Sun, K. Hirose, W. Gu, N. Minematsu, Generation of Fundamental Frequency Contours for Mandarin Speech Synthesis Based on Tone Nucleus Model, in Proceedings of Interspeech, (2005), pp. 3265–3268

  24. D. Talkin, A Robust Algorithm for Pitch Tracking (RAPT), Ch. 14 (Elsevier Science, Amsterdam, 1995)

    Google Scholar 

  25. M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, Adaptation of Pitch and Spectrum for HMM-Based Speech Synthesis Using MLLR, in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, (ICASSP), vol. 2 (2001), pp. 805–808

  26. T. Toda, K. Tokuda, A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Syst. 90(5), 816–824 (2007)

    Article  Google Scholar 

  27. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, (ICASSP), vol. 3 (2000), pp. 1315–1318

  28. K. Tokuda, T. Mausko, N. Miyazaki, T. Kobayashi, Muti-space probability distribution HMM. IEICE Trans. Inf. Syst. E85–D(3), 455–464 (2002)

    Google Scholar 

  29. K. Tokuda, H. Zen, A.W. Black, in Text to Speech Synthesis: New Paradigms and Advances, ed. by S. Narayanan, A. Alwan. HMM-Based Approach to Multilingual Speech Synthesis (Prentice-Hall, Upper Saddle River, 2004), pp. 135–152

  30. K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, K. Oura, Speech synthesis based on hidden Markov models. Proc. IEEE 101(5), 1234–1252 (2013)

    Article  Google Scholar 

  31. J. Yamagishi, Z. Ling, S. King, Robustness of HMM-Based Speech Synthesis, in Proceedings of Interspeech (2008), pp. 581–584

  32. B. Yegnanarayana, K.S.R. Murty, Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Trans. Audio Speech Lang. Process. 17(4), 614–624 (2009)

    Article  Google Scholar 

  33. T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, T. Kitamura, Speaker Interpolation in HMM-Based Speech Synthesis System, in Proceedings of Eurospeech, (1997), pp. 2523–2526

  34. H. Zen, T. Toda, M. Nakamura, K. Tokuda, Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Trans. Inf. Syst. E90–D(1), 325–333 (2007)

    Article  Google Scholar 

  35. H. Zen, T. Toda, K. Tokuda, The NITECH-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. IEICE Trans. Inf. Syst. E91–D(6), 1764–1773 (2008)

    Article  Google Scholar 

  36. H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank Dr. Nagarajan T and his students at the speech laboratory of SSN College of Engineering for organizing workshop on HMM-based speech synthesis system. They are greatfull to Prof. Hema A Murthy and Dr. K. Samudravijaya for their precious help in understanding HMM-based speech synthesis system. The authors would also like to thank Department of Information Technology, Government of India for sponsoring the workshop on HMM-based speech synthesis system.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. P. Narendra.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Narendra, N.P., Rao, K.S. Robust Voicing Detection and \(F_{0}\) Estimation for HMM-Based Speech Synthesis. Circuits Syst Signal Process 34, 2597–2619 (2015). https://doi.org/10.1007/s00034-015-9977-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-015-9977-8

Keywords

Navigation