Towards Physically Interpretable Parametric Voice Conversion Functions

  • Daniel Erro
  • Agustín Alonso
  • Luis Serrano
  • Eva Navas
  • Inma Hernáez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7911)


Typical voice conversion functions based on Gaussian mixture models are opaque in the sense that it is not straightforward to establish a link between the conversion parameters and their physical implications. Following the line of recent works, in this paper we study how physically meaningful constraints can be imposed to a system operating in the cepstral domain in order to get more informative conversion functions. The resulting method can be used to study the differences between source and target voices in terms of formant location in frequency, spectral tilt and amplitude in specific bands.


voice conversion Gaussian mixture models frequency warping amplitude scaling spectral tilt 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abe, M., Nakamura, S., Shikano, K., Kuwabara, H.: Voice conversion through vector quantization. In: Proc. IEEE ICASSP, pp. 655–658 (1988)Google Scholar
  2. 2.
    Arslan, L.M.: Speaker transformation algorithm using segmental codebooks (STASC). Speech Commun. 28(3), 211–226 (1999)CrossRefGoogle Scholar
  3. 3.
    Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 1, 145–148 (1992)Google Scholar
  4. 4.
    Sündermann, D., Ney, H.: VTLN-based voice conversion. In: Proc. ISSPIT, pp. 556–559 (2003)Google Scholar
  5. 5.
    Narendranath, M., Murthy, H.A., Rajendran, S., Yegnanarayana, B.: Transformation of formants for voice conversion using artificial neural networks. Speech Commun. 16(2), 207–216 (1995)CrossRefGoogle Scholar
  6. 6.
    Duxans, H., Bonafonte, A., Kain, A., van Santen, J.: Including dynamic and phonetic information in voice conversion systems. In: Proc. ICSLP, pp. 1193–1196 (2004)Google Scholar
  7. 7.
    Stylianou, Y., Cappé, O., Moulines, E.: Continuous probabilistic transform for voice conversion. IEEE Trans. Speech and Audio Process. 6, 131–142 (1998)CrossRefGoogle Scholar
  8. 8.
    Kain, A.: High resolution voice transformation, Ph.D. thesis, Oregon Health & Science University (2001)Google Scholar
  9. 9.
    Toda, T., Black, A.W., Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio, Speech, Lang. Process. 15(8), 2222–2235 (2007)CrossRefGoogle Scholar
  10. 10.
    Toda, T., Saruwatari, H., Shikano, K.: Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In: Proc. IEEE ICASSP, pp. 841–844 (2001)Google Scholar
  11. 11.
    Erro, D., Moreno, A., Bonafonte, A.: Voice conversion based on weighted frequency warping. IEEE Trans. Audio, Speech, Lang. Process. 18(5), 922–931 (2010)CrossRefGoogle Scholar
  12. 12.
    Tamura, M., Morita, M., Kagoshima, T., Akamine, M.: One sentence voice adaptation using GMM-based frequency-warping and shift with a sub-band basis spectrum model. In: Proc. IEEE ICASSP, pp. 5124–5127 (2011)Google Scholar
  13. 13.
    Godoy, E., Rosec, O., Chonavel, T.: Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Trans. Audio, Speech, Lang. Process. 20(4), 1313–1323 (2012)CrossRefGoogle Scholar
  14. 14.
    Zorilă, T.-C., Erro, D., Hernaez, I.: Improving the Quality of Standard GMM-Based Voice Conversion Systems by Considering Physically Motivated Linear Transformations. In: Torre Toledano, D., Ortega Giménez, A., Teixeira, A., González Rodríguez, J., Hernández Gómez, L., San Segundo Hernández, R., Ramos Castro, D. (eds.) IberSPEECH 2012. CCIS, vol. 328, pp. 30–39. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  15. 15.
    Erro, D., Navas, E., Hernaez, I.: Iterative MMSE Estimation of Vocal Tract Length Normalization Factors for Voice Transformation. In: Proc. Interspeech, pp.86–89 (2012)Google Scholar
  16. 16.
    Erro, D., Navas, E., Hernaez, I.: Parametric Voice Conversion based on Bilinear Frequency Warping plus Amplitude Scaling. IEEE Trans. Audio, Speech, and Lang. Process. 21(3), 556–566 (2013)CrossRefGoogle Scholar
  17. 17.
    Pitz, M., Ney, H.: Vocal tract normalization equals linear transformation in cepstral space. IEEE Trans. Speech and Audio Process. 13(5), 930–944 (2005)CrossRefGoogle Scholar
  18. 18.
    McDonough, J., Byrne, W.: Speaker adaptation with all-pass transforms. In: Proc. IEEE ICASSP, pp. 757–760 (1999)Google Scholar
  19. 19.
    CMU ARCTIC speech synthesis databases,
  20. 20.
    Erro, D., Sainz, I., Navas, E., Hernaez, I.: HNM-based MFCC+F0 extractor applied to statistical speech synthesis. In: Proc. IEEE ICASSP, pp. 4728–4731 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Daniel Erro
    • 1
    • 2
  • Agustín Alonso
    • 1
  • Luis Serrano
    • 1
  • Eva Navas
    • 1
  • Inma Hernáez
    • 1
  1. 1.AHOLABUniversity of the Basque Country (UPV/EHU)BilbaoSpain
  2. 2.Basque Foundation for Science (IKERBASQUE)BilbaoSpain

Personalised recommendations