Skip to main content
Log in

High quality voice conversion using prosodic and high-resolution spectral features

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution spectral feature. The prosodic features include F0, intensity and duration. It is well known that DNN is useful as a tool to model high-dimensional features. In this work, we show that DNN initialized by our proposed autoencoder pretraining yields good quality DNN conversion models. This pretraining is tailor-made for voice conversion and leverages on autoencoder to capture the generic spectral shape of source speech. Additionally, our framework uses segmental DNN models to capture the evolution of the prosodic features over time. To reconstruct the converted speech, the spectral feature produced by the DNN model is combined with the three prosodic features produced by the DNN segmental models. Our experimental results show that the application of both prosodic and high-resolution spectral features leads to quality converted speech as measured by objective evaluation and subjective listening tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Adami AG, Mihaescu R, Reynold DA, Godjirey JJ (2003) Modeling prosodic dynamics for speaker recognition. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 788–791

  2. Barlow M, Wagner M (1988) Prosody as a basis for determining speaker characteristics. In: Proc. The Australian International Conference on Speech Science and Technology, pp. 80–85

  3. Chen LH, Ling ZH, Dai LR (2014) Voice conversion using generative trained deep neural networks with multiple frame spectral envelopes. In: Proc. INTERSPEECH, pp 2313– 2317

  4. Chen LH, Ling ZH, Liu LJ, Dai LR (2014) Voice conversion using deep neural networks with layer-wise generative training. IEEE Trans Audio Speech Lang Process 22(12):1859–1872

    Article  Google Scholar 

  5. Chen LH, Ling ZH, Song Y, Dai LR (2013) Joint spectral distribution modeling using restricted Boltzmann machines for voice conversion (August)

  6. Dahan D, Bernard JM (1996) Interspeaker variability in emphatic accent production in French. Lang Speech 39(4):341–374

    Google Scholar 

  7. Desai S, Raghavendra E, Yegnanarayana B, Black A, Prahallad K (2009) Voice conversion using artificial neural networks. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3893–3896

  8. van Donzel ME, Koopmans-van Beinum FJ (1997) Evaluation of prosodic characteristics in retold stories in Dutch by means of semantic scales. In: Proc. EUROSPEECH, pp. 211–214

  9. Erro D, Moreno A, Bonafonte A (2010) Voice conversion based on weighted frequency warping. IEEE Trans Audio Speech Lang Process 18(5):922–931

    Article  Google Scholar 

  10. Erro D, Navas E, Hernaez I (2013) Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans Audio Speech Lang Process 21 (3):556–566

    Article  Google Scholar 

  11. Helander E, Silen H, Virtanen T, Gabbouj M (2012) Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Audio Speech Lang Process 20(3):806–817

    Article  Google Scholar 

  12. Helander E, Virtanen T, Nurminen J, Gabbouj M (2010) Voice conversion using partial least squares regression. IEEE Trans Audio Speech Lang Process 18 (5):912–921

    Article  Google Scholar 

  13. Helander EE, Nurminen J (2007) A novel method for prosody prediction in voice conversion. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 509–512

  14. Hinton G (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800

    Article  MathSciNet  MATH  Google Scholar 

  15. Hwang HT, Tsao Y, Wang HM, Wang YR, Chen SH (2013) Incorporating global variance in the training phase of GMM-based voice conversion. In: Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp 1–6

  16. Kain A, Macon M (1998) Spectral voice x. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 1, pp 285–288

  17. Kawahara H, Morise M, Takahashi T, Nisimura R, Irino T, Banno H (2008) TANDEM-STRAIGHT: a temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3933–3936

  18. Kominek J, Black AW (2003) CMU ARCTIC databases for speech synthesis. Tech rep

  19. Lamere P, Kwok P, Walker W, Gouvêa EB, Singh R, Raj B, Wolf P (2003) Design of the CMU Sphinx-4 decoder. In: Proc. EUROSPEECH, pp 1181–1184

  20. Le QV, Coates A, Prochnow B, Ng AY (2011) On optimization methods for deep learning. In: Proc. The 28th International Conference on Machine Learning (ICML), pp 265–272

  21. Lee SW, Ang ST, Dong M, Li H (2012) Generalized F0 modelling with absolute and relative pitch features for singing voice synthesis. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 429–432

  22. Lee SW, Wu Z, Dong M, Tian X, Li H (2014) A comparative study of spectral transformation techniques for singing voice synthesis. In: Proc. INTERSPEECH, pp 2499–2503

  23. Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng H, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation. IEEE Signal Proc Mag:35–52

  24. Liu LJ, Chen LH, Ling ZH, Dai LR (2015) Spectral conversion using deep neural networks trained with multi-source speakers. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4849–4853

  25. Meyer GA (1961) The semantics of stress and pitch in english. The Faculty Association, Utah State University

  26. Nakashika T, Takashima R (2013) Voice conversion in high-order eigen space using deep belief nets. In: Proc. INTERSPEECH, pp 369–372

  27. Nakashika T, Takiguchi T, Ariki Y. (2014) High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion. In: Proc. INTERSPEECH, pp 2278–2282

  28. Narendranath M, Murthy HA, Rajendran S, Yegnanarayana B (1995) Transformation of formants for voice conversion using artificial neural networks. Speech Comm 16(2):207–216

    Article  Google Scholar 

  29. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536

    Article  Google Scholar 

  30. Sanchez G, Silen H, Nurminen J, Gabbouj M (2014) Hierarchical modeling of F0 contours for voice conversion. In: Proc. INTERSPEECH, pp 2318–2321

  31. Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Comm 46 (3-4):455–472

    Article  Google Scholar 

  32. Sorin A, Shechtman S, Pollet V (2015) Coherent modification of pitch and energy for expressive prosody implantation. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4914–4918

  33. Srikanth R (2012) Duration modelling in voice conversion using artificial neural networks. In: Proc. The Anual International Conference on Systems, Signals and Image Processing, pp 556–559

  34. Stylianou Y, Cappé O, Moulines E (1998) Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing 6(2):131–142

    Article  Google Scholar 

  35. Sundermann D, Ney H, Hoge H (2003) VTLN-based cross-language voice conversion. In: Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp 676–681

  36. Takashima R, Takiguchi T, Ariki Y (2012) Exemplar-based voice conversion in noisy environment. In: Proc. Spoken Language Technology workshop (SLT), pp 313–317

  37. Tian X, Wu Z, Lee SW, Chng ES (2014) Correlation-based frequency warping for voice conversion. In: Proc. 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp 211–215

  38. Tian X, Wu Z, Lee SW, Hy NQ, Chng ES, Dong M (2015) Sparse representation for frequency warping based voice conversion. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 01, pp 4235–4239

  39. Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235

    Article  Google Scholar 

  40. Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2000) Speech parameter generation algorithms for HMM-based speech synthesis. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 3, pp 1315–1318

  41. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408

    MathSciNet  MATH  Google Scholar 

  42. Wu CH, Hsia CC, Liu TH, Wang JF (2006) Voice conversion using duration-embedded Bi-HMMs for expressive speech synthesis. IEEE Trans Audio Speech Lang Process 14(4):1109–1116

    Article  Google Scholar 

  43. Wu Z, Chng ES, Li H (2013) Conditional restricted Boltzmann machine for voice conversion. In: Proc. IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), pp 104–108

  44. Wu Z, Chng ES, Li H (2014) Joint nonnegative matrix factorization for exemplar-based voice conversion. In: Proc. INTERSPEECH, pp 2509–2513

  45. Wu Z, Virtanen T, Chng ES, Li H (2014) Exemplar-based sparse representation with residual compensation for voice conversion. IEEE Trans Audio Speech Lang Process 22(10):1506–1521

    Article  Google Scholar 

  46. Xie FL, Qian Y, Fan Y, Soong FK, Li H (2014) Sequence error (SE) minimization training of neural network for voice conversion. In: Proc. INTERSPEECH, pp 2283–2287

  47. Xie FL, Qian Y, Soong FK, Li H (2014) Pitch transformation in neural network based voice conversion. In: Proc. The 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 197–200. IEEE

  48. Ye H, Young S (2004) High quality voice morphing. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–9–12

  49. Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proc. EUROSPEECH, pp 2347–2350

  50. Yu D, Deng L (2011) Deep learning and its applications to signal and information processing. IEEE Signal Proc Mag:145–154

  51. Yu D, Deng L (2015) Automatic Speech Recognition - A Deep Learning Approach. Springer-Verlag London

  52. Yutani K, Uto Y, Nankaku Y, Toda T, Tokuda K (2008) Simultaneous conversion of duration and spectrum based on statistical models including time-sequence matching. In: Proc. INTERSPEECH, vol 3, pp 1072–1075

Download references

Acknowledgments

This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its IDM Futures Funding Initiative and administered by the Interactive and Digital Media Programme Office.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hy Quy Nguyen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, H.Q., Lee, S.W., Tian, X. et al. High quality voice conversion using prosodic and high-resolution spectral features. Multimed Tools Appl 75, 5265–5285 (2016). https://doi.org/10.1007/s11042-015-3039-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-3039-x

Keywords

Navigation