Using Mandarin Training Corpus to Realize a Mandarin-Tibetan Cross-Lingual Emotional Speech Synthesis

  • Peiwen Wu
  • Hongwu YangEmail author
  • Zhenye Gan
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 807)


This paper presents a hidden Markov model (HMM)-based Mandarin-Tibetan cross-lingual emotional speech synthesis by using an emotional Mandarin speech corpus with speaker adaptation. We firstly train a set of average acoustic models by speaker adaptive training with a one-speaker neutral Tibetan corpus and a multi-speaker neutral Mandarin corpus. Then we train a set of speaker dependent acoustic models of target emotion, which are used to synthesize emotional Tibetan or Mandarin speech, by speaker adaptation with the target emotional Mandarin corpus. Subjective evaluations and objective tests show that the method can synthesize both emotional Mandarin speech and emotional Tibetan speech with high naturalness and emotional similarity. Therefore, the method can be adopted to realizing an emotional speech synthesis with exiting emotional training corpus for languages lacking emotional speech resources.


Mandarin-Tibetan cross-lingual emotional speech synthesis Hidden Markov model (HMM) Speaker adaptive training Mandarin-Tibetan cross-lingual speech synthesis Emotional speech synthesis 



The research leading to these results was partly funded by the National Natural Science Foundation of China (Grant No. 11664036, 61263036) and Natural Science Foundation of Gansu (Grant No. 1506RJYA126).


  1. 1.
    Lorenzo-Trueba, J., Barra-Chicote, R., San-Segundo, R., et al.: Emotion transplantation through adaptation in HMM-based speech synthesis. Comput. Speech Lang. 34, 292–307 (2015)CrossRefGoogle Scholar
  2. 2.
    Schroder, M.: Emotional speech synthesis: a review. In: Interspeech, pp. 561–564 (2001)Google Scholar
  3. 3.
    Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 11, 175–187 (1992)CrossRefGoogle Scholar
  4. 4.
    Adell, J., Escudero, D., Bonafonte, A.: Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence. Speech Commun. 54, 459–476 (2012)CrossRefGoogle Scholar
  5. 5.
    Hamza, W., Eide, E., Bakis, R., et al.: The IBM expressive speech synthesis system. In: Interspeech (2004)Google Scholar
  6. 6.
    Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)CrossRefGoogle Scholar
  7. 7.
    Pitrelli, J.F., Bakis, R., Eide, E.M., et al.: The IBM expressive text-to-speech synthesis system for American English. IEEE Trans. Audio Speech Lang. Process. 14, 1099–1108 (2006)CrossRefGoogle Scholar
  8. 8.
    Bulut, M., Narayanan, S.S., Syrdal, A.K.: Expressive speech synthesis using a concatenative synthesizer. In: Interspeech (2002)Google Scholar
  9. 9.
    Eide, E.: Preservation, identification, and use of emotion in a text-to-speech system. In: Proceedings of 2002 IEEE Workshop on Speech Synthesis, pp. 127–130. IEEE (2002)Google Scholar
  10. 10.
    Strom, V., King, S.: Investigating Festival’s target cost function using perceptual experiments (2008)Google Scholar
  11. 11.
    Yamagishi, J., Onishi, K., Masuko, T., et al.: Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis. IEICE Trans. Inf. Syst. 88, 502–509 (2005)CrossRefGoogle Scholar
  12. 12.
    Tachibana, M., Yamagishi, J., Masuko, T., et al.: Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing. IEICE Trans. Inf. Syst. 88, 2484–2491 (2005)CrossRefGoogle Scholar
  13. 13.
    Takashi, N., Yamagishi, J., Masuko, T., et al.: A style control technique for HMM-based expressive speech synthesis. IEICE Trans. Inf. Syst. 90, 1406–1413 (2007)Google Scholar
  14. 14.
    Yamagishi, J., Kobayashi, T., Nakano, Y., et al.: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Audio Speech Lang. Process. 17, 66–83 (2009)CrossRefGoogle Scholar
  15. 15.
    Lorenzo-Trueba, J., Barra-Chicote, R., Yamagishi, J., Montero, J.M.: Towards cross-lingual emotion transplantation. In: Navarro Mesa, J.L., Ortega, A., Teixeira, A., Hernández Pérez, E., Quintana Morales, P., Ravelo García, A., Guerra Moreno, I., Toledano, D.T. (eds.) IberSPEECH 2014. LNCS (LNAI), vol. 8854, pp. 199–208. Springer, Cham (2014). Google Scholar
  16. 16.
    Zen, H.: Speaker and language adaptive training for HMM-based polyglot speech synthesis. In: Eleventh Annual Conference of the International Speech Communication Association (2010)Google Scholar
  17. 17.
    Yang, H., Oura, K., Wang, H., et al.: Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis. Multimed. Tools Appl. 74, 9927–9942 (2015)CrossRefGoogle Scholar
  18. 18.
    Russell, J.A.: Pancultural aspects of the human conceptual organization of emotions. J. Pers. Soc. Psychol. 45, 1281 (1983)CrossRefGoogle Scholar
  19. 19.
    Wester, M.: The EMIME bilingual database. University of Edinburgh (2010)Google Scholar
  20. 20.
    Loizou, P.C.: Speech quality assessment. In: Lin, W., Tao, D., Kacprzyk, J., Li, Z., Izquierdo, E., Wang, H. (eds.) Multimedia Analysis, Processing and Communications. SCI, vol. 346, pp. 623–654. Springer, Heidelberg (2011). CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.College of Physics and Electronic EngineeringNorthwest Normal UniversityLanzhouChina

Personalised recommendations