Voice Conversion for TTS Systems with Tuning on the Target Speaker Based on GMM

Zahariev, Vadim; Azarov, Elias; Petrovsky, Alexander

doi:10.1007/978-3-319-66429-3_79

Vadim Zahariev¹⁶,
Elias Azarov¹⁶ &
Alexander Petrovsky¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

International Conference on Speech and Computer

2215 Accesses
1 Citations

Abstract

The paper is devoted to improving the methods of voice conversion (VC) for developing text-to-speech synthesis systems with capabilities of tuning on the target speaker. Such system with VC module in acoustic processor, parametric representation of speech database for concatenative synthesis based on instantaneous harmonic representation is presented in the paper. Voice conversion is based on multiple regression mapping function and Gaussian mixture model (GMM), the method of text-independent learning is based on hidden Markov models and modified Viterbi algorithm. Experimental evaluation of the proposed solutions in terms of naturalness and similarity is presented as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sebastian, A.: Adobe demos “photoshop for audio,” lets you edit speech as easily as text. In Ars Technika, electronic resource (2016). https://goo.gl/yCkGyp
McTear, M., Callejas, Z., Griol, D.: The Conversational Interface: Talking to Smart Devices. Springer, Switzerland (2016)
Google Scholar
Dutoit, T.: An Introduction to Text-to-Speech Synthesis. Springer, Netherlands (2013)
Google Scholar
Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)
Book Google Scholar
Shikano, K., Lee, K., Reddy, R.: Speaker adaptation through vector quantization. In: ICASSP 1986, Japan, Tokyo, pp. 231–237 (1986)
Google Scholar
Klabbers, E., Veldhuis, R.: Reducing audible spectral discontinuities. IEEE Trans. Speech Audio Process. 9(1), 39–51 (2001)
Article Google Scholar
Vepa, J., King, S.: Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis. IEEE Trans. Audio Speech Lang. Process. 14(5), 1763–1771 (2006)
Article Google Scholar
Kirkpatrick, B., O’Brien, D., Scaife, R.: Feature transformation applied to the detection of discontinuities in concatenated speech. In: SSW6-2007, pp. 17–21 (2007)
Google Scholar
Stylianou, Y.: Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Trans. Speech Audio Process. 9(1), 21–29 (2001)
Article Google Scholar
Kawahara, H.: STRAIGHT, exploitation of the other aspect of VOCODER: perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Technol. 27(6), 349–353 (2006)
Article Google Scholar
Agiomyrgiannakis, Y.: Vocaine the vocoder and applications in speech synthesis. In: ICASSP 2015, Brisbane, Australia, pp. 4230–4234, April 2015
Google Scholar
Azarov, E., Vashkevich, M., Petrovsky, A.: Instantaneous harmonic representation of speech using multicomponent sinusoidal excitation. In: INTERSPEECH-2013, Lyon, France, pp. 1697–1701 (2013)
Google Scholar
Nilsson, M., Resch, B., Kim, M-Y., Kleijn, W.B.: A canonical representation of speech. In: ICASSP-2007, Honolulu, USA, pp. 849–852, April 2007
Google Scholar
Azarov, E., Vashkevich, M., Petrovsky, A.: Guslar: a framework for automated singing voice correction. In: ICASSP-2014, Florence, Italy, pp. 7919–7923 (2014)
Google Scholar
Mohammadi, S.H., Kain, A.: An overview of voice conversion systems. Speech Commun. 88, 65–82 (2017)
Article Google Scholar
Stylinau, Y.: Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6, 131–142 (1998)
Article Google Scholar
Zahariev, V., Petrovsky, A.: Voice conversion based on GMM with multifactor regression function and spectral weighting. Speech Technol. 3, 40–54 (2014)
Google Scholar
Rabiner, L.: Fundamentals of Speech Recognition. Printice Hall, United States (1993)
Google Scholar
Zahariev, V., Petrovsky, A.: Text-independent learning in the voice conversion system based on hidden Markov models and the grapheme-to-phoneme conversion scheme. In: DSPA-2013, Moscow Russia, pp. 327–332, March 2013
Google Scholar

Download references

Acknowledgment

This work was supported by IT4YOU company (Moscow, Russian Federation).

Author information

Authors and Affiliations

Belarusian State University of Informatics and Radioelectronics, Minsk, Belarus
Vadim Zahariev, Elias Azarov & Alexander Petrovsky

Authors

Vadim Zahariev
View author publications
You can also search for this author in PubMed Google Scholar
Elias Azarov
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Petrovsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vadim Zahariev .

Editor information

Editors and Affiliations

SPIIRAS, Saint Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Hertfordshire, Hatfield, United Kingdom
Iosif Mporas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zahariev, V., Azarov, E., Petrovsky, A. (2017). Voice Conversion for TTS Systems with Tuning on the Target Speaker Based on GMM. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_79

Download citation

DOI: https://doi.org/10.1007/978-3-319-66429-3_79
Published: 13 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics