Abstract
In this paper, we propose adversarial training based on meta-learning (AML) for automatic speaker verification (ASV). Existing ASV systems usually suffer from poor performance when apply to unseen data with domain shift caused by the difference between training data and testing data such as scene noise and speaking style. To solve the above issues, the model we proposed includes a backbone and an extra domain attention module, which are optimized via meta-learning to improve the generalization of speaker embedding space. We adopt domain-level adversarial training to make the generated embedding reduce the domain differentiation. Furthermore, we also propose an improved episode-level balanced sampling to simulate the domain shift in the real-world, which is an essential factor for our model to get the improvement. In terms of the domain attention module, we use the multi-layer convolution with bi-linear attention. We experimentally evaluate the proposed method on CNCeleb and VoxCeleb, and the results show that the combination of adversarial training and meta-learning effectively improves the performance in unseen domains.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio, Speech, Lang. Process. 19(4), 788–798 (2010)
Kenny, P.: Bayesian speaker verification with, heavy tailed priors. In: Proc. Odyssey 2010 (2010)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: Robust dnn embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)
Desplanques, B., Thienpondt, J., Demuynck, K.: Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143 (2020)
Lin, W., Mak, M.M., Li, N., Su, D., Yu, D.: Multi-level deep neural network adaptation for speaker verification using mmd and consistency regularization. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6839–6843. IEEE (2020)
Sun, B., Saenko, K.: Deep CORAL: correlation alignment for deep domain adaptation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 443–450. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_35
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2030–2096 (2016)
Kataria, S., Villalba, J., Zelasko, P., Moro-Velázquez, L., Dehak, N.: Deep feature cyclegans: Speaker identity preserving non-parallel microphone-telephone domain adaptation for speaker verification. arXiv preprint arXiv:2104.01433 (2021)
Zhou, K., Yang, Y., Qiao, Y., Xiang, T.: Domain generalization with mixstyle. arXiv preprint arXiv:2104.02008 (2021)
Chen, Z., Wang, S., Qian, Y., Yu, K.: Channel invariant speaker embedding learning with joint multi-task and adversarial training. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6574–6578. IEEE (2020)
Zhang, H., Wang, L., Lee, K.A., Liu, M., Dang, J., Chen, H.: Learning domain-invariant transformation for speaker verification. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7177–7181. IEEE, (2022)
Kang, J., Liu, R., Li, L., Cai, Y., Wang, D., Zheng, F.T.: Domain-invariant speaker vector projection by model-agnostic meta-learning. arXiv preprint arXiv:2005.11900 (2020)
Qin, X., Cai, D., Li, M.: Robust multi-channel far-field speaker verification under different in-domain data availability scenarios. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1–15, (2022)
Meng, Z., Zhao, Y., Li, J., Gong, Y.: Adversarial speaker verification. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6216–6220. IEEE (2019)
Vilalta, R., Drissi, Y.: A perspective view and survey of meta-learning. Artif. Intell. Rev. 18(2), 77–95 (2002)
Vanschoren, J. Meta-learning: A survey. arXiv preprint arXiv:1810.03548 (2018)
Fan, Y., et al.: Cn-celeb: a challenging chinese speaker recognition dataset. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7604–7608. IEEE, 2020
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, number CONF. IEEE Signal Processing Society (2011)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)
Acknowledgements
This work was supported by the Leading Plan of CAS (XDC08030200).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, JT., Fang, X., Li, J., Song, Y., Dai, LR. (2023). Adversarial Training Based on Meta-Learning in Unseen Domains for Speaker Verification. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds) Man-Machine Speech Communication. NCMMSC 2022. Communications in Computer and Information Science, vol 1765. Springer, Singapore. https://doi.org/10.1007/978-981-99-2401-1_11
Download citation
DOI: https://doi.org/10.1007/978-981-99-2401-1_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-2400-4
Online ISBN: 978-981-99-2401-1
eBook Packages: Computer ScienceComputer Science (R0)