Skip to main content

Boosting StarGANs for Voice Conversion with Contrastive Discriminator

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13624))

Included in the following conference series:

Abstract

Nonparallel multi-domain voice conversion methods such as the StarGAN-VCs have been widely applied in many scenarios. However, the training of these models usually poses a challenge due to their complicated adversarial network architectures. To address this, in this work we leverage the state-of-the-art contrastive learning techniques and incorporate an efficient Siamese network structure into the StarGAN discriminator. Our method is called SimSiam-StarGAN-VC and it boosts the training stability and effectively prevents the discriminator overfitting issue in the training process. We conduct experiments on the Voice Conversion Challenge (VCC 2018) dataset, plus a user study to validate the performance of our framework. Our experimental results show that SimSiam-StarGAN-VC significantly outperforms existing StarGAN-VC methods in terms of both the objective and subjective metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chen, C.Y., Zheng, W.Z., Wang, S.S., Tsao, Y., Li, P.C., Li, Y.: Enhancing intelligibility of dysarthric speech using gated convolutional-based voice conversion system. In: IEEE Interspeech (2020)

    Google Scholar 

  2. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  3. Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of CVPR, pp. 15750–15758 (2021)

    Google Scholar 

  4. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)

    Google Scholar 

  5. Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: Stargan v2: diverse image synthesis for multiple domains. In: Proceedings of CVPR, pp. 8188–8197 (2020)

    Google Scholar 

  6. Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. In: ICLR (2017). https://openreview.net/forum?id=BJO-BuT1g

  7. Helander, E., Virtanen, T., Nurminen, J., Gabbouj, M.: Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18(5), 912–921 (2010)

    Article  Google Scholar 

  8. Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: Proceedings of APSIPA, pp. 1–6. IEEE (2016)

    Google Scholar 

  9. Jeong, J., Shin, J.: Training {gan}s with stronger augmentations via contrastive discriminator. In: ICLR (2021)

    Google Scholar 

  10. Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks. In: SLT, pp. 266–273. IEEE (2018)

    Google Scholar 

  11. Kaneko, T., Kameoka, H.: Cyclegan-vc: non-parallel voice conversion using cycle-consistent adversarial networks. In: Proceedings of EUSIPCO, pp. 2100–2104. IEEE (2018)

    Google Scholar 

  12. Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: Cyclegan-vc2: improved cyclegan-based non-parallel voice conversion. In: Proceedings of ICASSP, pp. 6820–6824. IEEE (2019)

    Google Scholar 

  13. Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: Stargan-vc2: rethinking conditional methods for stargan-based voice conversion. In: Proceedings of INTERSPEECH (2019)

    Google Scholar 

  14. Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: Cyclegan-vc3: examining and improving cyclegan-vcs for mel-spectrogram conversion. In: Proceedings of Interspeech, pp. 2017–2021 (2020)

    Google Scholar 

  15. Khosla, P., et al.: Supervised contrastive learning. In: NeurIPS, vol. 33 (2020)

    Google Scholar 

  16. Kim, T.H., Cho, S., Choi, S., Park, S., Lee, S.Y.: Emotional voice conversion using multitask learning with text-to-speech. In: Proceedings of ICASSP, pp. 7774–7778. IEEE (2020)

    Google Scholar 

  17. Lee, K.S., Tran, N.T., Cheung, N.M.: InfoMax-GAN: improved adversarial image generation via information maximization and contrastive learning. In: Proceedings of WACV, pp. 3942–3952 (2021)

    Google Scholar 

  18. Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. In: Proceedings of Speaker Odyssey (2018)

    Google Scholar 

  19. Nercessian, S.: Zero-shot singing voice conversion. In: Proceedings of ISMIR (2020)

    Google Scholar 

  20. Park, D.S., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. In: Proceedings of Interspeech, pp. 2613–2617 (2019)

    Google Scholar 

  21. Saito, Y., Takamichi, S., Saruwatari, H.: Voice conversion using input-to-output highway networks. IEICE Trans. Inf. Syst. 100(8), 1925–1928 (2017)

    Article  Google Scholar 

  22. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. NeurIPS 29, 2234–2242 (2016)

    Google Scholar 

  23. Si, S., et al.: Speech2video: cross-modal distillation for speech to video generation. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2021)

    Google Scholar 

  24. Si, S., et al.: Variational information bottleneck for effective low-resource audio classification. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, p. 31 (2021)

    Google Scholar 

  25. Stylianou, Y., Cappé, O., Moulines, E.: Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6(2), 131–142 (1998)

    Article  Google Scholar 

  26. Sun, L., Kang, S., Li, K., Meng, H.: Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: Proceedings of ICASSP, pp. 4869–4873. IEEE (2015)

    Google Scholar 

  27. Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Avqvc: one-shot voice conversion by vector quantization with applying contrastive learning. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4613–4617. IEEE (2022)

    Google Scholar 

  28. Tang, H., et al.: Tgavc: improving autoencoder voice conversion with text-guided and adversarial training. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 938–945. IEEE (2021)

    Google Scholar 

  29. Toda, T., Black, A.W., Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)

    Article  Google Scholar 

  30. Urabe, E., Hirakawa, R., Kawano, H., Nakashi, K., Nakatoh, Y.: Electrolarynx system using voice conversion based on wavernn. In: Proceedings of ICCE, pp. 1–2. IEEE (2020)

    Google Scholar 

  31. Zhang, H., Zhang, Z., Odena, A., Lee, H.: Consistency regularization for generative adversarial networks. In: ICLR (2020). https://openreview.net/forum?id=S1lxKlSKPH

  32. Zhang, M., Zhou, Y., Zhao, L., Li, H.: Transfer learning from speech synthesis to voice conversion with non-parallel training data. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1290–1302 (2021)

    Article  Google Scholar 

  33. Zhou, K., Sisman, B., Liu, R., Li, H.: Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: Proceedings of ICASSP, pp. 920–924. IEEE (2021)

    Google Scholar 

  34. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)

    Google Scholar 

Download references

Acknowledgment

This paper is supported by the Key Research and Development Program of Guangdong Province under grant No.2021B0101400003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianzong Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Si, S., Wang, J., Zhang, X., Qu, X., Cheng, N., Xiao, J. (2023). Boosting StarGANs for Voice Conversion with Contrastive Discriminator. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13624. Springer, Cham. https://doi.org/10.1007/978-3-031-30108-7_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30108-7_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30107-0

  • Online ISBN: 978-3-031-30108-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics