Study of the Effect of Reducing Training Data in Speech Synthesis Adaptation Based on Frequency Warping

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10077)


Speaker adaptation techniques use a small amount of data to modify Hidden Markov Model (HMM) based speech synthesis systems to mimic a target voice. These techniques can be used to provide personalized systems to people who suffer some speech impairment and allow them to communicate in a more natural way. Although the adaptation techniques don’t require a big quantity of data, the recording process can be tedious if the user has speaking problems. To improve the acceptance of these systems an important factor is to be able to obtain acceptable results with minimal amount of recordings. In this work we explore the performance of an adaptation method based on Frequency Warping which uses only vocalic segments according to the amount of available training data.


Speech adaptation Statistical speech synthesis Frequency warping Dysarthric voice 



This work has been partially supported by MINECO/FEDER, UE (SpeechTech4All project, TEC2012-38939-C03-03 and RESTORE project, TEC2015-67163-C2-1-R), and the Basque Government (ELKAROLA project, KK-2015/00098).


  1. 1.
    Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)CrossRefGoogle Scholar
  2. 2.
    Zen, H., Tokuda, K., Black, A.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRefGoogle Scholar
  3. 3.
    Yamagishi, J., Nose, T., Zen, H., Ling, Z.H., Toda, T., Tokuda, K., King, S., Renals, S.: Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Trans. Audio Speech Lang. Process. 17(6), 1208–1230 (2009)CrossRefGoogle Scholar
  4. 4.
    Yamagishi, J., Veaux, C., King, S., Renals, S.: Speech synthesis technologies for individuals with vocal disabilities: voice banking and reconstruction. Acoust. Sci. Technol. 33(1), 1–5 (2012)CrossRefGoogle Scholar
  5. 5.
    Creer, S., Cunningham, S., Green, P., Yamagishi, J.: Building personalised synthetic voices for individuals with severe speech impairment. Comput. Speech Lang. 27(6), 1178–1193 (2013)CrossRefGoogle Scholar
  6. 6.
    Lanchantin, P., Veaux, C., Gales, M.J.F., King, S., Yamagishi, J.: Reconstructing voices within the multiple-average-voice-model framework. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (Interspeech), Dresden, Germany, pp. 2232–2236 (2015)Google Scholar
  7. 7.
    Alonso, A., Erro, D., Navas, E., Hernaez, I.: Speaker adaptation using only vocalic segments via frequency warping. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (Interspeech), Dresden, Germany, pp. 2764–2768 (2015)Google Scholar
  8. 8.
    Kawahara, H., Masuda-Katsusue, I., de Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27, 187–207 (1999)CrossRefGoogle Scholar
  9. 9.
    Erro, D., Sainz, I., Navas, E., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE J. Sel. Top. Signal Process. 8(2), 184–194 (2014)CrossRefGoogle Scholar
  10. 10.
    Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: A hidden semi-Markov model-based speech synthesis system. IEICE Trans. Inf. Syst. E90-D(5), 825–834 (2007)Google Scholar
  11. 11.
    Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis, vol. 30, pp. 1315–1318 (2000)Google Scholar
  12. 12.
    Yamagishi, J.: A training method of average voice model for HMM-based speech synthesis using MLLR. IEICE Trans. Inf. Syst. 86(8), 1956–1963 (2003)Google Scholar
  13. 13.
    Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRefGoogle Scholar
  14. 14.
    Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., Isogai, J.: Analysis of speaker adaptation algorthims for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Audio Speech Lang. Process. 19(1), 66–83 (2009)CrossRefGoogle Scholar
  15. 15.
    Erro, D., Alonso, A., Serrano, L., Navas, E., Hernaez, I.: Interpretable parametric voice conversion functions based on Gaussian mixture models and constrained transformations. Comput. Speech Lang. 30, 3–15 (2015)CrossRefGoogle Scholar
  16. 16.
    Erro, D., Moreno, A., Bonafonte, A.: Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18(5), 922–931 (2010)CrossRefGoogle Scholar
  17. 17.
    Zorila, T.C., Erro, D., Hernaez, I.: Improving the quality of standard GMM-based voice conversion systems by considering physically motivated linear transformations. Commun. Comput. Inf. Sci. 328, 30–39 (2012)CrossRefGoogle Scholar
  18. 18.
    Godoy, E., Rosec, O., Chonavel, T.: Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Trans. Audio Speech Lang. Process. 20(4), 1313–1323 (2012)CrossRefGoogle Scholar
  19. 19.
    Erro, D., Navas, E., Hernaez, I.: Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans. Audio Speech Lang. Process. 21(3), 556–566 (2013)CrossRefGoogle Scholar
  20. 20.
    Pitz, M., Ney, H.: Vocal tract normalization equals linear transformation in cepstral space. IEEE Trans. Speech Audio Process. 13, 930–944 (2005)CrossRefGoogle Scholar
  21. 21.
    Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 11(2–3), 175–187 (1992)CrossRefGoogle Scholar
  22. 22.
    Cappé, O., Laroche, J., Moulines, E.: Regularized estimation of cepstrum envelope from discrete frequency points. In: IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 213–216 (1995)Google Scholar
  23. 23.
    Erro, D., Hernáez, I., Navas, E., Alonso, A., Arzelus, H., Jauk, I., Hy, N.Q., Magariños, C., Pérez-Ramón, R., Sulír, M., Tian, X., Wang, X., Ye, J.: ZureTTS: online platform for obtaining personalized synthetic voices. In: Proceedings of eNTERFACE 2014 (2014)Google Scholar
  24. 24.
    Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P., et al.: The HTK Book, version 3.4 (2006)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.AHOLABUniversity of the Basque Country (UPV/EHU)BilbaoSpain
  2. 2.Basque Foundation for Science (IKERBASQUE)BilbaoSpain

Personalised recommendations