Self-supervised Speaker Verification Employing Augmentation Mix and Self-augmented Training-Based Clustering

Fathan, Abderrahim; Alam, Jahangir

doi:10.1007/978-3-031-48312-7_44

Abderrahim Fathan¹³ &
Jahangir Alam¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14339))

Included in the following conference series:

International Conference on Speech and Computer

353 Accesses

Abstract

Using clustering algorithms to optimize speaker embedding networks via pseudo-labels is a widely used practice to train self-supervised speaker verification systems. Although pseudo-label-based self-supervised training scheme showed outstanding performance, this latter depends on high-quality pseudo-labels, and recent studies have shown that label noise can remarkably impact downstream performance. In this paper, we propose a general-purpose clustering algorithm called CAMSAT that outperforms all other baselines used to cluster speaker embeddings. Moreover, using the generated pseudo-labels to train our speaker embedding systems allows us to further improve speaker verification performance. CAMSAT is based on two principles: (1) Augmentation Mix (AM) by mixing predictions of augmented samples to provide a complementary supervisory signal for clustering and enforce symmetry within augmentations and (2) Self-Augmented Training (SAT) to enforce representation invariance and maximize the information-theoretic dependency between samples and their predicted pseudo-labels. We provide a thorough comparative analysis of the performance of our clustering method compared to all baselines using a variety of clustering metrics and perform an ablation study to analyze the contribution of each component of our system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Analytic Study on Clustering-Based Pseudo-labels for Self-supervised Deep Speaker Verification

CI-Mix: cut instance mix for robust speaker verification

Article 01 November 2023

Investigating Effective Domain Adaptation Method for Speaker Verification Task

References

Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., et al.: A closer look at memorization in deep networks. In: International Conference on Machine Learning (2017)
Google Scholar
Bai, Z., Zhang, X.L.: Speaker recognition based on deep learning: an overview. Neural Netw. 140, 65–99 (2021)
Article Google Scholar
Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Bayesian analysis (2006)
Google Scholar
Bridle, J., Heading, A., MacKay, D.: Unsupervised classifiers, mutual information and’phantom targets. In: Advances in Neural Information Processing Systems 4 (1991)
Google Scholar
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 3(1), 1–27 (1974)
Google Scholar
Cho, J., et al.: The jhu submission to voxsrc-21: Track 3. arXiv preprint arXiv:2109.13425 (2021)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: INTERSPEECH (2018)
Google Scholar
Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems 26 (2013)
Google Scholar
Dahal, P.: Learning embedding space for clustering from deep representations. In: 2018 IEEE International Conference on Big Data (Big Data) (2018)
Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)
Article Google Scholar
Day, W.H.E., et al.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1, 7–24 (1984)
Article MATH Google Scholar
Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 788–798 (2011)
Google Scholar
Deng, J., et al.: Arcface: additive angular margin loss for deep face recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Google Scholar
Desplanques, B., et al.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Interspeech (2020)
Google Scholar
Dilokthanakul, N., et al.: Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648 (2016)
Dosovitskiy, A., et al.: Discriminative unsupervised feature learning with convolutional neural networks. NeurIPS your (2014)
Google Scholar
Estévez, P.A., et al.: Normalized mutual information feature selection. IEEE Trans. Neural Networks 20(2), 189–201 (2009)
Google Scholar
Fathan, A., Alam, J.: On the influence of the quality of pseudo-labels on the self-supervised speaker verification task: a thorough analysis. In: 2023 11th International Workshop on Biometrics and Forensics (IWBF), pp. 1–6. IEEE (2023)
Google Scholar
Fathan, A., Alam, J., Kang, W.: On the impact of the quality of pseudo-labels on the self-supervised speaker verification task. In: NeurIPS ENLSP Workshop (2022)
Google Scholar
Fortunato, M., et al.: Noisy networks for exploration. arXiv preprint arXiv:1706.10295 (2017)
Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–569 (1983)
Google Scholar
Geirhos, R., et al.: Generalisation in humans and deep neural networks. In: Advances in Neural Information Processing Systems 31 (2018)
Google Scholar
Gong, Y., et al.: Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE PAMI 35(12), 2916–2929 (2012)
Article Google Scholar
Guha, S., et al.: Cure: an efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73–84 (1998)
Google Scholar
Han, B., Chen, Z., Qian, Y.: Self-supervised speaker verification using dynamic loss-gate and label correction. arXiv preprint arXiv:2208.01928 (2022)
Han, B., et al.: Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification. arXiv preprint arXiv:2304.05754 (2023)
Hansen, J.H., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32, 74–99 (2015)
Google Scholar
Hartigan, J.A., Wong, M.A.: A k-means clustering algorithm. JSTOR: Appl. Stat. 28(1), 100–108 (1979)
Google Scholar
Hendrycks, D., et al.: Augmix: a simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781 (2019)
Hou, L., Yu, C.P., Samaras, D.: Squared earth mover’s distance-based loss for training deep neural networks. arXiv preprint arXiv:1611.05916 (2016)
Hu, W., et al.: Learning discrete representations via information maximizing self-augmented training. PMLR (2017)
Google Scholar
Jiang, Z., Zheng, Y., Tan, H., Tang, B., Zhou, H.: Variational deep embedding: a generative approach to clustering. CoRR, abs/1611.05148 1 (2016)
Google Scholar
Kang, W.H., Alam, J., Fathan, A.: An analytic study on clustering-based pseudo-labels for self-supervised deep speaker verification. In: SPECOM. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20980-2_29
Kang, W.H., Alam, J., Fathan, A.: l-mix: a latent-level instance mixup regularization for robust self-supervised speaker representation learning. JSTSP (2022)
Google Scholar
Kenny, P.: A small footprint I-vector extractor. In: Odyssey, pp. 1–6 (2012)
Google Scholar
Kohonen, T.: Self-organizing maps. Springer Science & Business Media (2012)
Google Scholar
Krause, A., et al.: Discriminative clustering by regularized information maximization. In: Advances in Neural Information Processing Systems 23 (2010)
Google Scholar
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. NeurIPS 30 (2017)
Google Scholar
Lee, D.H., et al.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3, p. 896 (2013)
Google Scholar
Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J.T., Peng, X.: Contrastive clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35 (2021)
Google Scholar
Mao, X., Ma, Y., Yang, Z., Chen, Y., Li, Q.: Virtual mixup training for unsupervised domain adaptation. arXiv preprint arXiv:1905.04215 (2019)
Miyato, T., Maeda, S.I., Koyama, M., Ishii, S.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. PAMI (2018)
Google Scholar
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
Google Scholar
Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems 14 (2001)
Google Scholar
Nielsen, F.: Hierarchical clustering. In: Introduction to HPC with MPI for Data Science. UTCS, pp. 195–211. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21903-5_8
Chapter Google Scholar
Park, D.S., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. In: Interspeech 2019, pp. 2613–2617 (2019)
Google Scholar
Peng, J., et al.: Progressive contrastive learning for self-supervised text-independent speaker verification. In: Proceedings of Odyssey Workshop (2022)
Google Scholar
Plappert, M., et al.: Parameter space noise for exploration. arXiv preprint arXiv:1706.01905 (2017)
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE Workshop (2011)
Google Scholar
Ronen, M., Finder, S.E., Freifeld, O.: Deepdpm: deep clustering with an unknown number of clusters. In: Proceedings of IEEE/CVF (2022)
Google Scholar
Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of EMNLP-CoNLL, pp. 410–420 (2007)
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Google Scholar
Snyder, D., et al.: X-vectors: Robust dnn embeddings for speaker recognition. In: IEEE-CASSP (2018)
Google Scholar
Tao, R., Lee, K.A., Das, R.K., Hautamäki, V., Li, H.: Self-supervised speaker recognition with loss-gated learning (2021)
Google Scholar
Tao, R., et al.: Self-supervised speaker recognition with loss-gated learning. In: ICASSP. IEEE (2022)
Google Scholar
Vasiljevic, I., Chakrabarti, A., Shakhnarovich, G.: Examining the impact of blur on recognition by convolutional networks. arXiv preprint arXiv:1611.05760 (2016)
Wang, C.D., Lai, J.H., Suen, C.Y., Zhu, J.Y.: Multi-exemplar affinity propagation. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2223–2237 (2013)
Google Scholar
Wang, C.D., Lai, J.H., Zhu, J.Y.: Conscience online learning: an efficient approach for robust kernel-based clustering. Knowl. Inf. Syst. 31, 79–104 (2012)
Google Scholar
Wang, C.D., Lai, J.: Position regularized support vector domain description. Pattern Recogn. 46(3), 875–884 (2013)
Article Google Scholar
Wang, C.D., et al.: A conscience on-line learning approach for kernel-based clustering. In: 2010 IEEE International Conference on Data Mining (2010)
Google Scholar
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Advances in Neural Information Processing Systems 21 (2008)
Google Scholar
Xia, W., et al.: Self-supervised text-independent speaker verification using prototypical momentum contrastive learning. In: ICASSP. IEEE (2021)
Google Scholar
Xie, J., et al.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478–487. PMLR (2016)
Google Scholar
Xuan, N., et al.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance (2010)
Google Scholar
Zhang, T., et al.: BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery (1997)
Google Scholar

Download references

Acknowledgments

The authors wish to acknowledge the funding from the Government of Canada’s New Frontiers in Research Fund (NFRF) through grant NFRFR-2021-00338.

Author information

Authors and Affiliations

Computer Research Institute of Montreal, Montreal, QC, H3N 1M3, Canada
Abderrahim Fathan & Jahangir Alam

Authors

Abderrahim Fathan
View author publications
You can also search for this author in PubMed Google Scholar
Jahangir Alam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abderrahim Fathan .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
Indian Institute of Information Technology Dharwad, Dharwad, India
K. T. Deepak
Indian Institute of Technology Dharwad, Dharwad, India
Rajesh M. Hegde
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal
Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fathan, A., Alam, J. (2023). Self-supervised Speaker Verification Employing Augmentation Mix and Self-augmented Training-Based Clustering. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14339. Springer, Cham. https://doi.org/10.1007/978-3-031-48312-7_44

Download citation

DOI: https://doi.org/10.1007/978-3-031-48312-7_44
Published: 22 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48311-0
Online ISBN: 978-3-031-48312-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Self-supervised Speaker Verification Employing Augmentation Mix and Self-augmented Training-Based Clustering

Abstract

Access this chapter

Similar content being viewed by others

An Analytic Study on Clustering-Based Pseudo-labels for Self-supervised Deep Speaker Verification

CI-Mix: cut instance mix for robust speaker verification

Investigating Effective Domain Adaptation Method for Speaker Verification Task

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Self-supervised Speaker Verification Employing Augmentation Mix and Self-augmented Training-Based Clustering

Abstract

Access this chapter

Similar content being viewed by others

An Analytic Study on Clustering-Based Pseudo-labels for Self-supervised Deep Speaker Verification

CI-Mix: cut instance mix for robust speaker verification

Investigating Effective Domain Adaptation Method for Speaker Verification Task

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation