Abstract
This paper presents a hidden Markov model (HMM)-based Mandarin-Tibetan cross-lingual emotional speech synthesis by using an emotional Mandarin speech corpus with speaker adaptation. We firstly train a set of average acoustic models by speaker adaptive training with a one-speaker neutral Tibetan corpus and a multi-speaker neutral Mandarin corpus. Then we train a set of speaker dependent acoustic models of target emotion, which are used to synthesize emotional Tibetan or Mandarin speech, by speaker adaptation with the target emotional Mandarin corpus. Subjective evaluations and objective tests show that the method can synthesize both emotional Mandarin speech and emotional Tibetan speech with high naturalness and emotional similarity. Therefore, the method can be adopted to realizing an emotional speech synthesis with exiting emotional training corpus for languages lacking emotional speech resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lorenzo-Trueba, J., Barra-Chicote, R., San-Segundo, R., et al.: Emotion transplantation through adaptation in HMM-based speech synthesis. Comput. Speech Lang. 34, 292–307 (2015)
Schroder, M.: Emotional speech synthesis: a review. In: Interspeech, pp. 561–564 (2001)
Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 11, 175–187 (1992)
Adell, J., Escudero, D., Bonafonte, A.: Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence. Speech Commun. 54, 459–476 (2012)
Hamza, W., Eide, E., Bakis, R., et al.: The IBM expressive speech synthesis system. In: Interspeech (2004)
Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)
Pitrelli, J.F., Bakis, R., Eide, E.M., et al.: The IBM expressive text-to-speech synthesis system for American English. IEEE Trans. Audio Speech Lang. Process. 14, 1099–1108 (2006)
Bulut, M., Narayanan, S.S., Syrdal, A.K.: Expressive speech synthesis using a concatenative synthesizer. In: Interspeech (2002)
Eide, E.: Preservation, identification, and use of emotion in a text-to-speech system. In: Proceedings of 2002 IEEE Workshop on Speech Synthesis, pp. 127–130. IEEE (2002)
Strom, V., King, S.: Investigating Festival’s target cost function using perceptual experiments (2008)
Yamagishi, J., Onishi, K., Masuko, T., et al.: Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis. IEICE Trans. Inf. Syst. 88, 502–509 (2005)
Tachibana, M., Yamagishi, J., Masuko, T., et al.: Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing. IEICE Trans. Inf. Syst. 88, 2484–2491 (2005)
Takashi, N., Yamagishi, J., Masuko, T., et al.: A style control technique for HMM-based expressive speech synthesis. IEICE Trans. Inf. Syst. 90, 1406–1413 (2007)
Yamagishi, J., Kobayashi, T., Nakano, Y., et al.: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Audio Speech Lang. Process. 17, 66–83 (2009)
Lorenzo-Trueba, J., Barra-Chicote, R., Yamagishi, J., Montero, J.M.: Towards cross-lingual emotion transplantation. In: Navarro Mesa, J.L., Ortega, A., Teixeira, A., Hernández Pérez, E., Quintana Morales, P., Ravelo GarcÃa, A., Guerra Moreno, I., Toledano, D.T. (eds.) IberSPEECH 2014. LNCS (LNAI), vol. 8854, pp. 199–208. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-13623-3_21
Zen, H.: Speaker and language adaptive training for HMM-based polyglot speech synthesis. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Yang, H., Oura, K., Wang, H., et al.: Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis. Multimed. Tools Appl. 74, 9927–9942 (2015)
Russell, J.A.: Pancultural aspects of the human conceptual organization of emotions. J. Pers. Soc. Psychol. 45, 1281 (1983)
Wester, M.: The EMIME bilingual database. University of Edinburgh (2010)
Loizou, P.C.: Speech quality assessment. In: Lin, W., Tao, D., Kacprzyk, J., Li, Z., Izquierdo, E., Wang, H. (eds.) Multimedia Analysis, Processing and Communications. SCI, vol. 346, pp. 623–654. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19551-8_23
Acknowledgments
The research leading to these results was partly funded by the National Natural Science Foundation of China (Grant No. 11664036, 61263036) and Natural Science Foundation of Gansu (Grant No. 1506RJYA126).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wu, P., Yang, H., Gan, Z. (2018). Using Mandarin Training Corpus to Realize a Mandarin-Tibetan Cross-Lingual Emotional Speech Synthesis. In: Tao, J., Zheng, T., Bao, C., Wang, D., Li, Y. (eds) Man-Machine Speech Communication. NCMMSC 2017. Communications in Computer and Information Science, vol 807. Springer, Singapore. https://doi.org/10.1007/978-981-10-8111-8_11
Download citation
DOI: https://doi.org/10.1007/978-981-10-8111-8_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8110-1
Online ISBN: 978-981-10-8111-8
eBook Packages: Computer ScienceComputer Science (R0)