Background and Literature Review

Rao, K. Sreenivasa; Narendra, N. P.

doi:10.1007/978-3-030-02759-9_2

K. Sreenivasa Rao⁴ &
N. P. Narendra⁵

Part of the book series: SpringerBriefs in Speech Technology ((BRIEFSSPEECHTECH))

335 Accesses

Abstract

This chapter provides a brief overview about the HMM-based speech synthesis. Existing works related to voicing detection and F ₀ estimation are briefly discussed. Previous works about different source modeling approaches are presented here. Different studies related to modeling and generation of creaky voice are briefly reviewed in this chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

T. Fukada, K. Tokuda, T. Kobayashi, S. Imai, An adaptive algorithm for mel-cepstral analysis of speech, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (1992), pp. 137–140
Google Scholar
F. Itakura, Line spectrum representation of linear predictor coefficients of speech signals. J. Acoust. Soc. Am. 57, S35–S35 (1975)
Article Google Scholar
K. Tokuda, T. Kobayashi, T. Masuko, S. Imai, Mel-generalized cepstral analysis a unified approach to speech spectral estimation, in Proceedings of the International Conference on Spoken Language Processing (ICSLP) (1994), pp. 1043–1046
Google Scholar
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Mixed-excitation for HMM-based speech synthesis, in Proceedings of the Eurospeech (2001), pp. 2259–2262
Google Scholar
H. Kawahara, I. Masuda-Katsuse, A. de Cheveigne, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27(3–4), 187–207 (1999)
Article Google Scholar
L.E. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41(1), 164–171 (1970)
Article MathSciNet Google Scholar
L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
K. Tokuda, H. Zen, A.W. Black, HMM-based approach to multilingual speech synthesis, in Text to Speech Synthesis: New Paradigms and Advances, ed. by S. Narayanan, A. Alwan (Prentice-Hall, Upper Saddle River, 2004), pp. 135–153
Google Scholar
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X.-Y. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland The Hidden Markov Model Toolkit (HTK) Version 3.4 (2006). Available: http://htk.eng.cam.ac.uk/
H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Hidden semi-Markov model based speech synthesis system. IEICE Trans. Inf. Syst. E90-D(5), 825–834 (2007)
Article Google Scholar
J.J. Odella, The use of context in large vocabulary speech recognition, Ph.D. dissertation, Cambridge University, 1995
Google Scholar
K. Shinoda, T. Watanabe, MDL-based context-dependent subword modeling for speech recognition. J. Acoust. Soc. Jpn. (E) 21(2), 79–86 (2000)
Article Google Scholar
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis, in Proceedings of the Eurospeech (1999), pp. 2347–2350
Google Scholar
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis, in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2000), pp. 1315–1318
Google Scholar
E.C. Zsiga, The Sounds of Language: An Introduction to Phonetics and Phonology (Wiley-Blackwell, Chichester, 2012)
Google Scholar
D.J. Hermes, Measurement of pitch by subharmonic summation. J. Acoust. Soc. Am. 83(1), 257–264 (1988)
Article Google Scholar
P. Boersma, Accurate short-term analysis of fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Inst. Phon. Sci. 17, 97–110 (1993)
Google Scholar
D. Talkin, A robust algorithm for pitch tracking (RAPT), in Speech Coding and Synthesis (Elsevier Science, Amsterdam, 1995), pp. 495–518
Google Scholar
H. Kawahara, H. Katayose, A. de Cheveigne, R. Patterson, Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity, in Proceedings of the Eurospeech (1999), pp. 2781–2784
Google Scholar
R. Goldberg, L. Riek, A Practical Handbook of Speech Coders (CRC Press, Boca Raton, 2000)
Book Google Scholar
B. Yegnanarayana, K.S.R. Murty, Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Trans. Audio Speech Lang. Process. 17(4), 614–624 (2009)
Article Google Scholar
T. Drugman, A. Alwan, Joint robust voicing detection and pitch estimation based on residual harmonics, in Proceedings of the Interspeech (2011), pp. 1973–1976
Google Scholar
T. Raitio, A. Suni, J. Yamagishi, H. Pulakka, J. Nurminen, M. Vainio, P. Alku, HMM-based speech synthesis utilizing glottal inverse filtering. IEEE Trans. Audio Speech Lang. Process. 19(1), 153–165 (2011)
Article Google Scholar
T. Drugman, T. Dutoit, The deterministic plus stochastic model of the residual signal and its applications. IEEE Trans. Audio Speech Lang. Process. 20(3), 968–981 (2012)
Article Google Scholar
T. Raitio, J. Kane, T. Drugman, C. Gobl, HMM-based synthesis of creaky voice, in Proceedings of the Interspeech (2013), pp. 2316–2320
Google Scholar
H. Zen, T. Toda, M. Nakamura, K. Tokuda, Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Trans. Inf. Syst. E90-D(1), 325–333 (2007)
Article Google Scholar
H. Zen, T. Toda, K. Tokuda, The Nitech-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. IEICE Trans. Inf. Syst. E91-D(6), 1764–1773 (2008)
Article Google Scholar
K. Oura, H. Zen, Y. Nankaku, A. Lee, K. Tokuda, A tied covariance technique for HMM-based speech synthesis. IEICE Trans. Inf. Syst. E93-D(3), 595–601 (2010)
Article Google Scholar
H. Sil, E. Helander, J. Nurminen, M. Gabbouj, Parameterization of vocal fry in HMM-based speech synthesis, in Proceedings of the Interspeech (2009), pp. 1775–1778
Google Scholar
HMM-based speech synthesis system (HTS). Available: http://hts.sp.nitech.ac.jp/
Q. Zhang, F. Soong, Y. Qian, Z. Yan, J. Pan, Y. Yan, Improved modeling for F0 generation and V/U decision in HMM-based TTS, in Proceedings of the International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4606–4609
Google Scholar
J. Yamagishi, Z. Ling, S. King, Robustness of HMM-based speech synthesis, in Proceedings of the Interspeech (2008), pp. 581–584
Google Scholar
D. Arifianto, T. Tanaka, T. Masuko, T. Kobayashi, Robust F0 estimation of speech signal using harmonicity measure based on instantaneous frequency. IEICE Trans. Inf. Syst. E87-D(12), 2812–2820 (2004)
Google Scholar
H. Fujisaki, K. Hirose, Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J. Acoust. Soc. Jpn. (E) 5(4), 233–242 (1984)
Article Google Scholar
Q. Sun, K. Hirose, W. Gu, N. Minematsu, Generation of fundamental frequency contours for Mandarin speech synthesis based on tone nucleus model, in Proceedings of the Interspeech (2005), pp. 3265–3268
Google Scholar
A. McCree, K. Truong, E. George, T. Barnwell, V. Viswanathan, A 2.4 kbit/s MELP coder candidate for the new U.S. Federal Standard,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1996), pp. 200–203
Google Scholar
R. Maia, T. Toda, H. Zen, Y. Nankaku, K. Tokuda, An excitation model for HMM-based speech synthesis based on residual modeling, in Proceedings of the International Speech Communication Association Speech Synthesis Workshop 6 (ISCA SW6) (2007), pp. 131–136
Google Scholar
J.S. Sung, D.H. Hong, K.H. Oh, N.S. Kim, Excitation modeling based on waveform interpolation for HMM-based speech synthesis, in Proceedings of the Interspeech (2010), pp. 813–816
Google Scholar
W. Kleijn, Continuous representations in linear predictive coding, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1991), pp. 201–204
Google Scholar
J. Cabral, S. Renals, J. Yamagishi, K. Richmond, HMM-based speech synthesiser using the LF-model of the glottal source, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2011), pp. 4704–4707
Google Scholar
P. Alku, Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(2–3), 109–118 (1992)
Article Google Scholar
J.P. Cabral, Uniform concatenative excitation model for synthesising speech without voiced/unvoiced classification, in Proceedings of the Interspeech (2013), pp. 1082–1086
Google Scholar
Z. Wen, J. Tao, S. Pan, Y. Wang, Pitch-scaled spectrum based excitation model for HMM-based speech synthesis. J. Signal Process. Syst. 74(3), 423–435 (2013)
Article Google Scholar
T. Drugman, A. Moinet, T. Dutoit, G. Wilfart, Using a pitch-synchrounous residual codebook for hybrid HMM/frame selection speech synthesis, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, (ICASSP) (2009), pp. 3793–3796
Google Scholar
T. Raitio, A. Suni, H. Pulakka, M. Vainio, P. Alku, Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, (ICASSP) (2011), pp. 4564–4567
Google Scholar
T. Drugman, T. Raitio, Excitation modeling for HMM-based speech synthesis: breaking down the impact of periodic and aperiodic components, in Proceedings of the International Conference on Audio, Speech and Signal Processing (ICASSP) (2014), pp. 260–264
Google Scholar
S. Vishnubhotla, C. Espy-Wilson, Automatic detection of irregular phonation in continuous speech, in Proceedings of the Interspeech (2006), pp. 949–952
Google Scholar
C. Ishi, K. Sakakibara, H. Ishiguro, N. Hagita, A method for automatic detection of vocal fry. IEEE Trans. Audio Speech Lang. Process. 16(1), 47–56 (2008)
Article Google Scholar
J. Kane, T. Drugman, C. Gobl, Improved automatic detection of creak. Comput. Speech Lang. 27(4), 1028–1047 (2013)
Article Google Scholar
T. Drugman, J. Kane, C. Gobl, Modeling the creaky excitation for parametric speech synthesis, in Proceedings of the Interspeech (2012), pp. 1424–1427
Google Scholar
T.G. Csapo, G. Nemeth, Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE J. Sel. Top. Signal Process. 8(2), 209–220 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India
K. Sreenivasa Rao
Aalto University, Espoo, Finland
N. P. Narendra

Authors

K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar
N. P. Narendra
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rao, K.S., Narendra, N.P. (2019). Background and Literature Review. In: Source Modeling Techniques for Quality Enhancement in Statistical Parametric Speech Synthesis. SpringerBriefs in Speech Technology. Springer, Cham. https://doi.org/10.1007/978-3-030-02759-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-02759-9_2
Published: 14 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02758-2
Online ISBN: 978-3-030-02759-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics