High quality voice conversion using prosodic and high-resolution spectral features

Nguyen, Hy Quy; Lee, Siu Wa; Tian, Xiaohai; Dong, Minghui; Chng, Eng Siong

doi:10.1007/s11042-015-3039-x

High quality voice conversion using prosodic and high-resolution spectral features

Published: 19 November 2015

Volume 75, pages 5265–5285, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Hy Quy Nguyen^1,2,
Siu Wa Lee³,
Xiaohai Tian^1,2,
Minghui Dong³ &
…
Eng Siong Chng^1,2

693 Accesses
18 Citations
1 Altmetric
Explore all metrics

Abstract

Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution spectral feature. The prosodic features include F0, intensity and duration. It is well known that DNN is useful as a tool to model high-dimensional features. In this work, we show that DNN initialized by our proposed autoencoder pretraining yields good quality DNN conversion models. This pretraining is tailor-made for voice conversion and leverages on autoencoder to capture the generic spectral shape of source speech. Additionally, our framework uses segmental DNN models to capture the evolution of the prosodic features over time. To reconstruct the converted speech, the spectral feature produced by the DNN model is combined with the three prosodic features produced by the DNN segmental models. Our experimental results show that the application of both prosodic and high-resolution spectral features leads to quality converted speech as measured by objective evaluation and subjective listening tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

References

Adami AG, Mihaescu R, Reynold DA, Godjirey JJ (2003) Modeling prosodic dynamics for speaker recognition. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 788–791
Barlow M, Wagner M (1988) Prosody as a basis for determining speaker characteristics. In: Proc. The Australian International Conference on Speech Science and Technology, pp. 80–85
Chen LH, Ling ZH, Dai LR (2014) Voice conversion using generative trained deep neural networks with multiple frame spectral envelopes. In: Proc. INTERSPEECH, pp 2313– 2317
Chen LH, Ling ZH, Liu LJ, Dai LR (2014) Voice conversion using deep neural networks with layer-wise generative training. IEEE Trans Audio Speech Lang Process 22(12):1859–1872
Article Google Scholar
Chen LH, Ling ZH, Song Y, Dai LR (2013) Joint spectral distribution modeling using restricted Boltzmann machines for voice conversion (August)
Dahan D, Bernard JM (1996) Interspeaker variability in emphatic accent production in French. Lang Speech 39(4):341–374
Google Scholar
Desai S, Raghavendra E, Yegnanarayana B, Black A, Prahallad K (2009) Voice conversion using artificial neural networks. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3893–3896
van Donzel ME, Koopmans-van Beinum FJ (1997) Evaluation of prosodic characteristics in retold stories in Dutch by means of semantic scales. In: Proc. EUROSPEECH, pp. 211–214
Erro D, Moreno A, Bonafonte A (2010) Voice conversion based on weighted frequency warping. IEEE Trans Audio Speech Lang Process 18(5):922–931
Article Google Scholar
Erro D, Navas E, Hernaez I (2013) Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans Audio Speech Lang Process 21 (3):556–566
Article Google Scholar
Helander E, Silen H, Virtanen T, Gabbouj M (2012) Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Audio Speech Lang Process 20(3):806–817
Article Google Scholar
Helander E, Virtanen T, Nurminen J, Gabbouj M (2010) Voice conversion using partial least squares regression. IEEE Trans Audio Speech Lang Process 18 (5):912–921
Article Google Scholar
Helander EE, Nurminen J (2007) A novel method for prosody prediction in voice conversion. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 509–512
Hinton G (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Article MathSciNet MATH Google Scholar
Hwang HT, Tsao Y, Wang HM, Wang YR, Chen SH (2013) Incorporating global variance in the training phase of GMM-based voice conversion. In: Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp 1–6
Kain A, Macon M (1998) Spectral voice x. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 1, pp 285–288
Kawahara H, Morise M, Takahashi T, Nisimura R, Irino T, Banno H (2008) TANDEM-STRAIGHT: a temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3933–3936
Kominek J, Black AW (2003) CMU ARCTIC databases for speech synthesis. Tech rep
Lamere P, Kwok P, Walker W, Gouvêa EB, Singh R, Raj B, Wolf P (2003) Design of the CMU Sphinx-4 decoder. In: Proc. EUROSPEECH, pp 1181–1184
Le QV, Coates A, Prochnow B, Ng AY (2011) On optimization methods for deep learning. In: Proc. The 28th International Conference on Machine Learning (ICML), pp 265–272
Lee SW, Ang ST, Dong M, Li H (2012) Generalized F0 modelling with absolute and relative pitch features for singing voice synthesis. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 429–432
Lee SW, Wu Z, Dong M, Tian X, Li H (2014) A comparative study of spectral transformation techniques for singing voice synthesis. In: Proc. INTERSPEECH, pp 2499–2503
Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng H, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation. IEEE Signal Proc Mag:35–52
Liu LJ, Chen LH, Ling ZH, Dai LR (2015) Spectral conversion using deep neural networks trained with multi-source speakers. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4849–4853
Meyer GA (1961) The semantics of stress and pitch in english. The Faculty Association, Utah State University
Nakashika T, Takashima R (2013) Voice conversion in high-order eigen space using deep belief nets. In: Proc. INTERSPEECH, pp 369–372
Nakashika T, Takiguchi T, Ariki Y. (2014) High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion. In: Proc. INTERSPEECH, pp 2278–2282
Narendranath M, Murthy HA, Rajendran S, Yegnanarayana B (1995) Transformation of formants for voice conversion using artificial neural networks. Speech Comm 16(2):207–216
Article Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536
Article Google Scholar
Sanchez G, Silen H, Nurminen J, Gabbouj M (2014) Hierarchical modeling of F0 contours for voice conversion. In: Proc. INTERSPEECH, pp 2318–2321
Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Comm 46 (3-4):455–472
Article Google Scholar
Sorin A, Shechtman S, Pollet V (2015) Coherent modification of pitch and energy for expressive prosody implantation. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4914–4918
Srikanth R (2012) Duration modelling in voice conversion using artificial neural networks. In: Proc. The Anual International Conference on Systems, Signals and Image Processing, pp 556–559
Stylianou Y, Cappé O, Moulines E (1998) Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing 6(2):131–142
Article Google Scholar
Sundermann D, Ney H, Hoge H (2003) VTLN-based cross-language voice conversion. In: Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp 676–681
Takashima R, Takiguchi T, Ariki Y (2012) Exemplar-based voice conversion in noisy environment. In: Proc. Spoken Language Technology workshop (SLT), pp 313–317
Tian X, Wu Z, Lee SW, Chng ES (2014) Correlation-based frequency warping for voice conversion. In: Proc. 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp 211–215
Tian X, Wu Z, Lee SW, Hy NQ, Chng ES, Dong M (2015) Sparse representation for frequency warping based voice conversion. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 01, pp 4235–4239
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235
Article Google Scholar
Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2000) Speech parameter generation algorithms for HMM-based speech synthesis. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 3, pp 1315–1318
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
MathSciNet MATH Google Scholar
Wu CH, Hsia CC, Liu TH, Wang JF (2006) Voice conversion using duration-embedded Bi-HMMs for expressive speech synthesis. IEEE Trans Audio Speech Lang Process 14(4):1109–1116
Article Google Scholar
Wu Z, Chng ES, Li H (2013) Conditional restricted Boltzmann machine for voice conversion. In: Proc. IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), pp 104–108
Wu Z, Chng ES, Li H (2014) Joint nonnegative matrix factorization for exemplar-based voice conversion. In: Proc. INTERSPEECH, pp 2509–2513
Wu Z, Virtanen T, Chng ES, Li H (2014) Exemplar-based sparse representation with residual compensation for voice conversion. IEEE Trans Audio Speech Lang Process 22(10):1506–1521
Article Google Scholar
Xie FL, Qian Y, Fan Y, Soong FK, Li H (2014) Sequence error (SE) minimization training of neural network for voice conversion. In: Proc. INTERSPEECH, pp 2283–2287
Xie FL, Qian Y, Soong FK, Li H (2014) Pitch transformation in neural network based voice conversion. In: Proc. The 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 197–200. IEEE
Ye H, Young S (2004) High quality voice morphing. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–9–12
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proc. EUROSPEECH, pp 2347–2350
Yu D, Deng L (2011) Deep learning and its applications to signal and information processing. IEEE Signal Proc Mag:145–154
Yu D, Deng L (2015) Automatic Speech Recognition - A Deep Learning Approach. Springer-Verlag London
Yutani K, Uto Y, Nankaku Y, Toda T, Tokuda K (2008) Simultaneous conversion of duration and spectrum based on statistical models including time-sequence matching. In: Proc. INTERSPEECH, vol 3, pp 1072–1075

Download references

Acknowledgments

This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its IDM Futures Funding Initiative and administered by the Interactive and Digital Media Programme Office.

Author information

Authors and Affiliations

School of Computer Engineering, Nanyang Technological University (NTU), Singapore, Singapore
Hy Quy Nguyen, Xiaohai Tian & Eng Siong Chng
Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly, NTU, Singapore, Singapore
Hy Quy Nguyen, Xiaohai Tian & Eng Siong Chng
Human Language Technology Department, Institute for Infocomm Research, A⋆STAR, Singapore, Singapore
Siu Wa Lee & Minghui Dong

Authors

Hy Quy Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Siu Wa Lee
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohai Tian
View author publications
You can also search for this author in PubMed Google Scholar
Minghui Dong
View author publications
You can also search for this author in PubMed Google Scholar
Eng Siong Chng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hy Quy Nguyen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, H.Q., Lee, S.W., Tian, X. et al. High quality voice conversion using prosodic and high-resolution spectral features. Multimed Tools Appl 75, 5265–5285 (2016). https://doi.org/10.1007/s11042-015-3039-x

Download citation

Received: 07 June 2015
Revised: 08 September 2015
Accepted: 20 October 2015
Published: 19 November 2015
Issue Date: May 2016
DOI: https://doi.org/10.1007/s11042-015-3039-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High quality voice conversion using prosodic and high-resolution spectral features

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

A Deep Learning Framework for Audio Deepfake Detection

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High quality voice conversion using prosodic and high-resolution spectral features

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

A Deep Learning Framework for Audio Deepfake Detection

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation