Language-Independent Acoustic Cloning of HTS Voices: An Objective Evaluation

  • Carmen Magariños
  • Daniel Erro
  • Paula Lopez-Otero
  • Eduardo R. Banga
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10077)


In a previous work we presented a method to combine the acoustic characteristics of a speech synthesis model with the linguistic characteristics of another one. This paper presents a more extensive evaluation of the method when applied to cross-lingual adaptation. A large number of voices from a database in Spanish are adapted to Basque, Catalan, English and Galician. Using a state-of-the-art speaker identification system, we show that the proposed method captures the identity of the target speakers almost as well as standard intra-lingual adaptation techniques.


HMM-based speech synthesis Cross-lingual speaker adaptation Polyglot synthesis Multilingual synthesis Voice cloning I-vectors Speaker identification 



This research was funded by the Spanish Government (project TEC2015-65345-P and BES-2013-063708), the Galician Government through the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014) and ‘AtlantTIC’ CN2012/160, the European Regional Development Fund (ERDF) and the COST Action IC1206.


  1. 1.
    Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRefGoogle Scholar
  2. 2.
    Yamagishi, J.: Average-voice-based speech synthesis. Ph.d. dissertation, Tokyo Institute of Technology, Yokohama, Japan (2006)Google Scholar
  3. 3.
    Yamagishi, J., Nose, T., Zen, H., Ling, Z.H., Toda, T., Tokuda, K., King, S., Renals, S.: Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Trans. Audio Speech Lang. Process. 17(6), 1208–1230 (2009)CrossRefGoogle Scholar
  4. 4.
    Latorre, J., Iwano, K., Furui, S.: New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer. Speech Commun. 48, 1227–1242 (2006)CrossRefGoogle Scholar
  5. 5.
    Wu, Y.J., Nankaku, Y., Tokuda, K.: State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis. In: Proceedings of Interspeech, pp. 528–531 (2009)Google Scholar
  6. 6.
    Oura, K., Yamagishi, J., Wester, M., King, S., Tokuda, K.: Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping. Speech Commun. 54, 703–714 (2012)CrossRefGoogle Scholar
  7. 7.
    Dines, J., Liang, H., Saheer, L., Gibson, M., Byrne, W., Oura, K., Tokuda, K., Yamagishi, J., King, S., Wester, M., Hirsimki, T., Karhila, R., Kurimo, M.: Personalising speech-to-speech translation: unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. Comput. Speech Lang. 27, 420–437 (2013)CrossRefGoogle Scholar
  8. 8.
    Zen, H., Braunschweiler, N., Buchholz, S., Gales, M., Knill, K., Krstulovic, S., Latorre, J.: Statistical parametric speech synthesis based on speaker and language factorization. IEEE Trans. Audio Speech Lang. Process. 20(6), 1713–1724 (2012)CrossRefGoogle Scholar
  9. 9.
    Magariños, C., Erro, D., Banga, E.R.: Language-independent acoustic cloning of HTS voices: a preliminary study. In: Proceedings of ICASSP, pp. 5615–5619 (2016)Google Scholar
  10. 10.
    Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A.W., Tokuda, K.: The HMM-based speech synthesis system (HTS) version 2.0. In: Proceedings of 6th ISCA Speech Synthesis Workshop, pp. 294–299. ISCA (2007)Google Scholar
  11. 11.
    Erro, D., Moreno, A., Bonafonte, A.: INCA algorithm for training voice conversion systems from nonparallel corpora. IEEE Trans. Audio Speech Lang. Process. 18(5), 944–953 (2010)CrossRefGoogle Scholar
  12. 12.
    Agiomyrgiannakis, Y.: The matching-minimization algorithm, the INCA algorithm and a mathematical framework for voice conversion with unaligned corpora. In: Proceedings of ICASSP, Shanghai, pp. 5645–5649 (2016)Google Scholar
  13. 13.
    Hansen, J., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32, 74–99 (2015)CrossRefGoogle Scholar
  14. 14.
    Cumani, S., Brümmer, N., Burget, L., Laface, P.: Fast discriminative speaker verification in the i-vector space. In: Proceedings of ICASSP, pp. 4852–4855 (2011)Google Scholar
  15. 15.
    Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 788–798 (2011)CrossRefGoogle Scholar
  16. 16.
    Moreno, A., Poch, D., Bonafonte, A., Lleida, E., Llisterri, J., Mariño, J.B., Nadeu, C.: Albayzin speech database: design of the phonetic corpus. In: EUROSPEECH (1993)Google Scholar
  17. 17.
    Sainz, I., Erro, D., Navas, E., Hernáez, I., Sánchez, J., Saratxaga, I., Odriozola, I., Luengo, I.: Aholab speech synthesizers for albayzin2010. In: Proceedings of FALA 2010, pp. 343–348 (2010)Google Scholar
  18. 18.
    Bonafonte, A., Aguilar, L., Esquerra, I., Oller, S., Moreno, A.: Recent work on the FESTCAT database for speech synthesis. In: Proceedings of the I Iberian SLTech, pp. 131–132 (2009)Google Scholar
  19. 19.
    Taylor, P., Black, A.W., Caley, R.: The architecture of the festival speech synthesis system. In: Proceedings of the ESCA Workshop in Speech Synthesis, pp. 141–151 (1998)Google Scholar
  20. 20.
    Rodríguez-Banga, E., García-Mateo, C., Méndez-Pazó, F., González-González, M., Magariños, C.: Cotovía: an open source TTS for Galician and Spanish. In: Proceedings of IberSPEECH, pp. 308–315. RTTH and SIG-IL (2012)Google Scholar
  21. 21.
    Erro, D., Sainz, I., Navas, E., Hernáez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE J. Sel. Top. Signal Process. 8(2), 184–194 (2014)CrossRefGoogle Scholar
  22. 22.
    Ortega-Garcia, J., Fierrez, J., Alonso-Fernandez, F., Galbally, J., Freire, M.R., Gonzalez-Rodriguez, J., Garcia-Mateo, C., Alba-Castro, J.L., Gonzalez-Agulla, E., Otero-Muras, E., Garcia-Salicetti, S., Allano, L., Ly-Van, B., Dorizzi, B., Kittler, J., Bourlai, T., Poh, N., Deravi, F., Ng, M.W.R., Fairhurst, M., Hennebert, J., Humm, A., Tistarelli, M., Brodo, L., Richiardi, J., Drygajlo, A., Ganster, H., Sukno, F., Pavani, S.K., Frangi, A., Akarun, L., Savran, A.: The multi-scenario multi-environment BioSecure multimodal database (BMDB). IEEE Trans. Pattern Anal. Mach. Intell. 32(4), 1097–1111 (2009)Google Scholar
  23. 23.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Carmen Magariños
    • 1
  • Daniel Erro
    • 2
  • Paula Lopez-Otero
    • 1
  • Eduardo R. Banga
    • 1
  1. 1.Multimedia Technology Group (GTM), AtlantTICUniversity of VigoVigoSpain
  2. 2.IKERBASQUE – AholabUniversity of the Basque CountryBilbaoSpain

Personalised recommendations