Abstract
Nonparallel multi-domain voice conversion methods such as the StarGAN-VCs have been widely applied in many scenarios. However, the training of these models usually poses a challenge due to their complicated adversarial network architectures. To address this, in this work we leverage the state-of-the-art contrastive learning techniques and incorporate an efficient Siamese network structure into the StarGAN discriminator. Our method is called SimSiam-StarGAN-VC and it boosts the training stability and effectively prevents the discriminator overfitting issue in the training process. We conduct experiments on the Voice Conversion Challenge (VCC 2018) dataset, plus a user study to validate the performance of our framework. Our experimental results show that SimSiam-StarGAN-VC significantly outperforms existing StarGAN-VC methods in terms of both the objective and subjective metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, C.Y., Zheng, W.Z., Wang, S.S., Tsao, Y., Li, P.C., Li, Y.: Enhancing intelligibility of dysarthric speech using gated convolutional-based voice conversion system. In: IEEE Interspeech (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607. PMLR (2020)
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of CVPR, pp. 15750–15758 (2021)
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: Stargan v2: diverse image synthesis for multiple domains. In: Proceedings of CVPR, pp. 8188–8197 (2020)
Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. In: ICLR (2017). https://openreview.net/forum?id=BJO-BuT1g
Helander, E., Virtanen, T., Nurminen, J., Gabbouj, M.: Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18(5), 912–921 (2010)
Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: Proceedings of APSIPA, pp. 1–6. IEEE (2016)
Jeong, J., Shin, J.: Training {gan}s with stronger augmentations via contrastive discriminator. In: ICLR (2021)
Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks. In: SLT, pp. 266–273. IEEE (2018)
Kaneko, T., Kameoka, H.: Cyclegan-vc: non-parallel voice conversion using cycle-consistent adversarial networks. In: Proceedings of EUSIPCO, pp. 2100–2104. IEEE (2018)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: Cyclegan-vc2: improved cyclegan-based non-parallel voice conversion. In: Proceedings of ICASSP, pp. 6820–6824. IEEE (2019)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: Stargan-vc2: rethinking conditional methods for stargan-based voice conversion. In: Proceedings of INTERSPEECH (2019)
Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: Cyclegan-vc3: examining and improving cyclegan-vcs for mel-spectrogram conversion. In: Proceedings of Interspeech, pp. 2017–2021 (2020)
Khosla, P., et al.: Supervised contrastive learning. In: NeurIPS, vol. 33 (2020)
Kim, T.H., Cho, S., Choi, S., Park, S., Lee, S.Y.: Emotional voice conversion using multitask learning with text-to-speech. In: Proceedings of ICASSP, pp. 7774–7778. IEEE (2020)
Lee, K.S., Tran, N.T., Cheung, N.M.: InfoMax-GAN: improved adversarial image generation via information maximization and contrastive learning. In: Proceedings of WACV, pp. 3942–3952 (2021)
Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. In: Proceedings of Speaker Odyssey (2018)
Nercessian, S.: Zero-shot singing voice conversion. In: Proceedings of ISMIR (2020)
Park, D.S., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. In: Proceedings of Interspeech, pp. 2613–2617 (2019)
Saito, Y., Takamichi, S., Saruwatari, H.: Voice conversion using input-to-output highway networks. IEICE Trans. Inf. Syst. 100(8), 1925–1928 (2017)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. NeurIPS 29, 2234–2242 (2016)
Si, S., et al.: Speech2video: cross-modal distillation for speech to video generation. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2021)
Si, S., et al.: Variational information bottleneck for effective low-resource audio classification. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, p. 31 (2021)
Stylianou, Y., Cappé, O., Moulines, E.: Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6(2), 131–142 (1998)
Sun, L., Kang, S., Li, K., Meng, H.: Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: Proceedings of ICASSP, pp. 4869–4873. IEEE (2015)
Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: Avqvc: one-shot voice conversion by vector quantization with applying contrastive learning. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4613–4617. IEEE (2022)
Tang, H., et al.: Tgavc: improving autoencoder voice conversion with text-guided and adversarial training. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 938–945. IEEE (2021)
Toda, T., Black, A.W., Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)
Urabe, E., Hirakawa, R., Kawano, H., Nakashi, K., Nakatoh, Y.: Electrolarynx system using voice conversion based on wavernn. In: Proceedings of ICCE, pp. 1–2. IEEE (2020)
Zhang, H., Zhang, Z., Odena, A., Lee, H.: Consistency regularization for generative adversarial networks. In: ICLR (2020). https://openreview.net/forum?id=S1lxKlSKPH
Zhang, M., Zhou, Y., Zhao, L., Li, H.: Transfer learning from speech synthesis to voice conversion with non-parallel training data. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1290–1302 (2021)
Zhou, K., Sisman, B., Liu, R., Li, H.: Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: Proceedings of ICASSP, pp. 920–924. IEEE (2021)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Acknowledgment
This paper is supported by the Key Research and Development Program of Guangdong Province under grant No.2021B0101400003.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Si, S., Wang, J., Zhang, X., Qu, X., Cheng, N., Xiao, J. (2023). Boosting StarGANs for Voice Conversion with Contrastive Discriminator. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13624. Springer, Cham. https://doi.org/10.1007/978-3-031-30108-7_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-30108-7_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30107-0
Online ISBN: 978-3-031-30108-7
eBook Packages: Computer ScienceComputer Science (R0)