Skip to main content
Log in

Cloning and Conversion of an Arbitrary Voice Using Generative Flows

  • THEMATIC ISSUE
  • Published:
Automation and Remote Control Aims and scope Submit manuscript

Abstract

To improve the quality of generated speech signals, this paper proposes a method for taking into account time-varying information about the speaker. Using this technique, the system synthesizes more natural speech with a voice similar to the given target voice in both the voice cloning and voice conversion problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

Similar content being viewed by others

REFERENCES

  1. Cooper, F.S., Gaitenby, J.H., and Nye, P.W., Evolution of reading machines for the blind: Haskins Laboratories’ research as a case history, J. Rehabil. Res. Dev., 1984, vol. 21, no. 1, pp. 51–87.

    Google Scholar 

  2. Miyabe, M. and Yoshino, T., Development of multilingual medical reception support system with text-to-speech function to combine utterance data with voice synthesis, ICIC’10: Proc. 3rd Int. Conf. Intercult. Collab. (2010), pp. 195–198.

  3. Kargathara, A., Vaidya, K., and Kumbharana, C.K., Analyzing desktop and mobile application for text to speech conversation, in Rising Threats in Expert Applications and Solutions, vol. 1187 of Adv. Intell. Syst. Comput., New York: Springer, 2020, pp. 331–337.

  4. Sokol, K. and Flach, P., Glass-Box: Explaining AI decisions with counterfactual statements through conversation with a voice-enabled virtual assistant, Proc. Twenty-Seventh Int. Joint Conf. Artif. Intell. (2018), pp. 5868–5870.

  5. Hoy, M.B., Alexa, Siri, Cortana, and More: An introduction to voice assistants, Med. Ref. Serv. Q., 2018, no. 37, pp. 81–88.

  6. Nasirian, F., Ahmadian, M., and Lee, O., AI-based voice assistant systems: Evaluating from the interaction and trust perspectives, Twenty-Third Am. Conf. Inf. Syst. (2017).

  7. Obukhov, D.S., Polyphonic natural speech synthesis using generative flows, Sovrem. Inf. Tekhnol. IT-Obraz., 2021, vol. 17, no. 4, pp. 896–905.

    Google Scholar 

  8. Xie, Q., Tian, X., Liu, G., et al., The multi-speaker multi-style voice cloning challenge 2021, Int. Conf. Acoust. Speech Signal Process. (2021).

  9. Sisman, B., Yamagishi, J., King, S., and Li, H., An overview of voice conversion and its challenges: From statistical modeling to deep learning, IEEE/ACM Trans. Audio Speech Lang. Process., 2020, vol. 29, pp. 132–157.

    Article  Google Scholar 

  10. Jia, Y., Zhang, Y., Weiss, R.J., et al., Transfer learning from speaker verification to multispeaker text-to-speech synthesis, Conf. Neural Inf. Process. Syst. (2018).

  11. Tan, X., Qin, T., Soong, F., and Liu, T.-Y., A survey on neural speech synthesis, 2021. arXiv:2106.15561. Accessed January 22, 2022.

  12. Lancucki, A., Fastpitch: Parallel text-to-speech with pitch prediction, 2020. arXiv:2006.06873. Accessed January 22, 2022.

  13. Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y., Fastspeech: Fast, robust and controllable text to speech, in Adv. Neural Inf. Process. Syst., 2019, pp. 3165–3174.

  14. Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y., Fastspeech 2: Fast and high-quality end-to-end text to speech, 2020. arXiv:2006.0455. Accessed January 22, 2022.

  15. Valle, R., Shih, K., Prenger, R., and Catanzaro, B., Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis, 2020. arXiv:2005.05957. Accessed January 22, 2022.

  16. Kim, J., Kim, S., Kong, J., and Yoon, S., Glow-TTS: A generative flow for text-to-speech via monotonic alignment search, 34th Conf. Neural Inf. Process. Syst. (NeurIPS 2020) (Vancouver, Canada, 2020).

  17. Kim, J., Kong, J., and Son, J., Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech, 2021. arXiv:2106.06103. 2021. Accessed January 22, 2022.

  18. Povey, D., Ghoshal, A., Boulianne, G., et al., The Kaldi Speech Recognition Toolkit, in IEEE 2011 Workshop Autom. Speech Recognit. Understanding, 2011.

  19. Desplanques, B., Thienpondt, J., and Demuynck, K., Ecapatdnn: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification, 2020. arXiv:2005.07143. Accessed January 22, 2022.

  20. Kong, J., Kim, J., and Bae, J., Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis, in Adv. Neural Inf. Process. Syst., 2020.

  21. Oord, A., Dieleman, S., Zen, H., et al., Wavenet: A generative model for raw audio, 2016. arXiv:1609.03499. Accessed January 22, 2022.

  22. King, S. and Karaiskos, V., The blizzard challenge, Proc. Blizzard Challenge Workshop (20130).

  23. Bakhturina, E., Lavrukhin, V., Ginsburg, B., and Zhang, Y., Hi-fi multi-speaker English TTS dataset, 2021. arXiv:2104.01497. Accessed January 22, 2022.

  24. Zen, H., Dang, V., Clark, R., et al., LibriTTS: A corpus derived from LibriSpeech for text-to-speech, 2019. Accessed January 22, 2022.

  25. Ito, K. and Johnson, L., The LJ speech dataset, 2017. https://keithito.com/LJ-Speech-Dataset . Accessed January 22, 2022.

  26. Solak, I., The M-AILABS Speech Dataset, 2019. https://www.caito.de/2019/01/the-m-ailabs-speech-dataset . Accessed January 22, 2022.

  27. Kyubyong, P. and Jongseok, K., g2pE: A Simple Python module for English grapheme to phoneme conversion. GitHub repository, 2018. https://github.com/Kyubyong/g2p . Accessed January 22, 2022.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. S. Obukhov.

Additional information

Translated by V. Potapchouck

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Obukhov, D.S. Cloning and Conversion of an Arbitrary Voice Using Generative Flows. Autom Remote Control 83, 1555–1566 (2022). https://doi.org/10.1134/S00051179220100083

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S00051179220100083

Keywords

Navigation