Boosting StarGANs for Voice Conversion with Contrastive Discriminator

Si, Shijing; Wang, Jianzong; Zhang, Xulong; Qu, Xiaoyang; Cheng, Ning; Xiao, Jing

doi:10.1007/978-3-031-30108-7_30

Shijing Si^12,13,
Jianzong Wang¹²,
Xulong Zhang¹²,
Xiaoyang Qu¹²,
Ning Cheng¹² &
…
Jing Xiao¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13624))

Included in the following conference series:

International Conference on Neural Information Processing

608 Accesses
2 Citations

Abstract

Nonparallel multi-domain voice conversion methods such as the StarGAN-VCs have been widely applied in many scenarios. However, the training of these models usually poses a challenge due to their complicated adversarial network architectures. To address this, in this work we leverage the state-of-the-art contrastive learning techniques and incorporate an efficient Siamese network structure into the StarGAN discriminator. Our method is called SimSiam-StarGAN-VC and it boosts the training stability and effectively prevents the discriminator overfitting issue in the training process. We conduct experiments on the Voice Conversion Challenge (VCC 2018) dataset, plus a user study to validate the performance of our framework. Our experimental results show that SimSiam-StarGAN-VC significantly outperforms existing StarGAN-VC methods in terms of both the objective and subjective metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, C.Y., Zheng, W.Z., Wang, S.S., Tsao, Y., Li, P.C., Li, Y.: Enhancing intelligibility of dysarthric speech using gated convolutional-based voice conversion system. In: IEEE Interspeech (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of CVPR, pp. 15750–15758 (2021)
Google Scholar
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
Google Scholar
Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: Stargan v2: diverse image synthesis for multiple domains. In: Proceedings of CVPR, pp. 8188–8197 (2020)
Google Scholar
Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. In: ICLR (2017). https://openreview.net/forum?id=BJO-BuT1g
Helander, E., Virtanen, T., Nurminen, J., Gabbouj, M.: Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18(5), 912–921 (2010)
Article Google Scholar
Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: Proceedings of APSIPA, pp. 1–6. IEEE (2016)
Google Scholar
Jeong, J., Shin, J.: Training {gan}s with stronger augmentations via contrastive discriminator. In: ICLR (2021)
Google Scholar
Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks. In: SLT, pp. 266–273. IEEE (2018)
Google Scholar
Kaneko, T., Kameoka, H.: Cyclegan-vc: non-parallel voice conversion using cycle-consistent adversarial networks. In: Proceedings of EUSIPCO, pp. 2100–2104. IEEE (2018)
Google Scholar
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: Cyclegan-vc2: improved cyclegan-based non-parallel voice conversion. In: Proceedings of ICASSP, pp. 6820–6824. IEEE (2019)
Google Scholar
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: Stargan-vc2: rethinking conditional methods for stargan-based voice conversion. In: Proceedings of INTERSPEECH (2019)
Google Scholar
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: Cyclegan-vc3: examining and improving cyclegan-vcs for mel-spectrogram conversion. In: Proceedings of Interspeech, pp. 2017–2021 (2020)
Google Scholar
Khosla, P., et al.: Supervised contrastive learning. In: NeurIPS, vol. 33 (2020)
Google Scholar
Kim, T.H., Cho, S., Choi, S., Park, S., Lee, S.Y.: Emotional voice conversion using multitask learning with text-to-speech. In: Proceedings of ICASSP, pp. 7774–7778. IEEE (2020)
Google Scholar
Lee, K.S., Tran, N.T., Cheung, N.M.: InfoMax-GAN: improved adversarial image generation via information maximization and contrastive learning. In: Proceedings of WACV, pp. 3942–3952 (2021)
Google Scholar
Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. In: Proceedings of Speaker Odyssey (2018)
Google Scholar
Nercessian, S.: Zero-shot singing voice conversion. In: Proceedings of ISMIR (2020)
Google Scholar
Park, D.S., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. In: Proceedings of Interspeech, pp. 2613–2617 (2019)
Google Scholar
Saito, Y., Takamichi, S., Saruwatari, H.: Voice conversion using input-to-output highway networks. IEICE Trans. Inf. Syst. 100(8), 1925–1928 (2017)
Article Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. NeurIPS 29, 2234–2242 (2016)
Google Scholar
Si, S., et al.: Speech2video: cross-modal distillation for speech to video generation. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2021)
Google Scholar
Si, S., et al.: Variational information bottleneck for effective low-resource audio classification. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, p. 31 (2021)
Google Scholar
Stylianou, Y., Cappé, O., Moulines, E.: Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6(2), 131–142 (1998)
Article Google Scholar
Sun, L., Kang, S., Li, K., Meng, H.: Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: Proceedings of ICASSP, pp. 4869–4873. IEEE (2015)
Google Scholar
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Avqvc: one-shot voice conversion by vector quantization with applying contrastive learning. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4613–4617. IEEE (2022)
Google Scholar
Tang, H., et al.: Tgavc: improving autoencoder voice conversion with text-guided and adversarial training. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 938–945. IEEE (2021)
Google Scholar
Toda, T., Black, A.W., Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)
Article Google Scholar
Urabe, E., Hirakawa, R., Kawano, H., Nakashi, K., Nakatoh, Y.: Electrolarynx system using voice conversion based on wavernn. In: Proceedings of ICCE, pp. 1–2. IEEE (2020)
Google Scholar
Zhang, H., Zhang, Z., Odena, A., Lee, H.: Consistency regularization for generative adversarial networks. In: ICLR (2020). https://openreview.net/forum?id=S1lxKlSKPH
Zhang, M., Zhou, Y., Zhao, L., Li, H.: Transfer learning from speech synthesis to voice conversion with non-parallel training data. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1290–1302 (2021)
Article Google Scholar
Zhou, K., Sisman, B., Liu, R., Li, H.: Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: Proceedings of ICASSP, pp. 920–924. IEEE (2021)
Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Google Scholar

Download references

Acknowledgment

This paper is supported by the Key Research and Development Program of Guangdong Province under grant No.2021B0101400003.

Author information

Authors and Affiliations

Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Shijing Si, Jianzong Wang, Xulong Zhang, Xiaoyang Qu, Ning Cheng & Jing Xiao
School of Economics and Finance, Shanghai International Studies University, Shanghai, China
Shijing Si

Authors

Shijing Si
View author publications
You can also search for this author in PubMed Google Scholar
Jianzong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xulong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyang Qu
View author publications
You can also search for this author in PubMed Google Scholar
Ning Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jing Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianzong Wang .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Si, S., Wang, J., Zhang, X., Qu, X., Cheng, N., Xiao, J. (2023). Boosting StarGANs for Voice Conversion with Contrastive Discriminator. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13624. Springer, Cham. https://doi.org/10.1007/978-3-031-30108-7_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-30108-7_30
Published: 13 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30107-0
Online ISBN: 978-3-031-30108-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Boosting StarGANs for Voice Conversion with Contrastive Discriminator