Skip to main content

Self-supervised Speaker Verification Employing Augmentation Mix andĀ Self-augmented Training-Based Clustering

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14339))

Included in the following conference series:

  • 353 Accesses

Abstract

Using clustering algorithms to optimize speaker embedding networks via pseudo-labels is a widely used practice to train self-supervised speaker verification systems. Although pseudo-label-based self-supervised training scheme showed outstanding performance, this latter depends on high-quality pseudo-labels, and recent studies have shown that label noise can remarkably impact downstream performance. In this paper, we propose a general-purpose clustering algorithm called CAMSAT that outperforms all other baselines used to cluster speaker embeddings. Moreover, using the generated pseudo-labels to train our speaker embedding systems allows us to further improve speaker verification performance. CAMSAT is based on two principles: (1) Augmentation Mix (AM) by mixing predictions of augmented samples to provide a complementary supervisory signal for clustering and enforce symmetry within augmentations and (2) Self-Augmented Training (SAT) to enforce representation invariance and maximize the information-theoretic dependency between samples and their predicted pseudo-labels. We provide a thorough comparative analysis of the performance of our clustering method compared to all baselines using a variety of clustering metrics and perform an ablation study to analyze the contribution of each component of our system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., et al.: A closer look at memorization in deep networks. In: International Conference on Machine Learning (2017)

    Google ScholarĀ 

  2. Bai, Z., Zhang, X.L.: Speaker recognition based on deep learning: an overview. Neural Netw. 140, 65ā€“99 (2021)

    ArticleĀ  Google ScholarĀ 

  3. Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Bayesian analysis (2006)

    Google ScholarĀ 

  4. Bridle, J., Heading, A., MacKay, D.: Unsupervised classifiers, mutual information andā€™phantom targets. In: Advances in Neural Information Processing Systems 4 (1991)

    Google ScholarĀ 

  5. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 3(1), 1ā€“27 (1974)

    Google ScholarĀ 

  6. Cho, J., et al.: The jhu submission to voxsrc-21: Track 3. arXiv preprint arXiv:2109.13425 (2021)

  7. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: INTERSPEECH (2018)

    Google ScholarĀ 

  8. Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems 26 (2013)

    Google ScholarĀ 

  9. Dahal, P.: Learning embedding space for clustering from deep representations. In: 2018 IEEE International Conference on Big Data (Big Data) (2018)

    Google ScholarĀ 

  10. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224ā€“227 (1979)

    ArticleĀ  Google ScholarĀ 

  11. Day, W.H.E., et al.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1, 7ā€“24 (1984)

    ArticleĀ  MATHĀ  Google ScholarĀ 

  12. Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 788ā€“798 (2011)

    Google ScholarĀ 

  13. Deng, J., et al.: Arcface: additive angular margin loss for deep face recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2021)

    Google ScholarĀ 

  14. Desplanques, B., et al.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Interspeech (2020)

    Google ScholarĀ 

  15. Dilokthanakul, N., et al.: Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648 (2016)

  16. Dosovitskiy, A., et al.: Discriminative unsupervised feature learning with convolutional neural networks. NeurIPS your (2014)

    Google ScholarĀ 

  17. EstĆ©vez, P.A., et al.: Normalized mutual information feature selection. IEEE Trans. Neural Networks 20(2), 189ā€“201 (2009)

    Google ScholarĀ 

  18. Fathan, A., Alam, J.: On the influence of the quality of pseudo-labels on the self-supervised speaker verification task: a thorough analysis. In: 2023 11th International Workshop on Biometrics and Forensics (IWBF), pp. 1ā€“6. IEEE (2023)

    Google ScholarĀ 

  19. Fathan, A., Alam, J., Kang, W.: On the impact of the quality of pseudo-labels on the self-supervised speaker verification task. In: NeurIPS ENLSP Workshop (2022)

    Google ScholarĀ 

  20. Fortunato, M., et al.: Noisy networks for exploration. arXiv preprint arXiv:1706.10295 (2017)

  21. Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553ā€“569 (1983)

    Google ScholarĀ 

  22. Geirhos, R., et al.: Generalisation in humans and deep neural networks. In: Advances in Neural Information Processing Systems 31 (2018)

    Google ScholarĀ 

  23. Gong, Y., et al.: Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE PAMI 35(12), 2916ā€“2929 (2012)

    ArticleĀ  Google ScholarĀ 

  24. Guha, S., et al.: Cure: an efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73ā€“84 (1998)

    Google ScholarĀ 

  25. Han, B., Chen, Z., Qian, Y.: Self-supervised speaker verification using dynamic loss-gate and label correction. arXiv preprint arXiv:2208.01928 (2022)

  26. Han, B., et al.: Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification. arXiv preprint arXiv:2304.05754 (2023)

  27. Hansen, J.H., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32, 74ā€“99 (2015)

    Google ScholarĀ 

  28. Hartigan, J.A., Wong, M.A.: A k-means clustering algorithm. JSTOR: Appl. Stat. 28(1), 100ā€“108 (1979)

    Google ScholarĀ 

  29. Hendrycks, D., et al.: Augmix: a simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781 (2019)

  30. Hou, L., Yu, C.P., Samaras, D.: Squared earth moverā€™s distance-based loss for training deep neural networks. arXiv preprint arXiv:1611.05916 (2016)

  31. Hu, W., et al.: Learning discrete representations via information maximizing self-augmented training. PMLR (2017)

    Google ScholarĀ 

  32. Jiang, Z., Zheng, Y., Tan, H., Tang, B., Zhou, H.: Variational deep embedding: a generative approach to clustering. CoRR, abs/1611.05148 1 (2016)

    Google ScholarĀ 

  33. Kang, W.H., Alam, J., Fathan, A.: An analytic study on clustering-based pseudo-labels for self-supervised deep speaker verification. In: SPECOM. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20980-2_29

  34. Kang, W.H., Alam, J., Fathan, A.: l-mix: a latent-level instance mixup regularization for robust self-supervised speaker representation learning. JSTSP (2022)

    Google ScholarĀ 

  35. Kenny, P.: A small footprint I-vector extractor. In: Odyssey, pp. 1ā€“6 (2012)

    Google ScholarĀ 

  36. Kohonen, T.: Self-organizing maps. Springer Science & Business Media (2012)

    Google ScholarĀ 

  37. Krause, A., et al.: Discriminative clustering by regularized information maximization. In: Advances in Neural Information Processing Systems 23 (2010)

    Google ScholarĀ 

  38. Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. NeurIPS 30 (2017)

    Google ScholarĀ 

  39. Lee, D.H., et al.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3, p. 896 (2013)

    Google ScholarĀ 

  40. Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J.T., Peng, X.: Contrastive clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35 (2021)

    Google ScholarĀ 

  41. Mao, X., Ma, Y., Yang, Z., Chen, Y., Li, Q.: Virtual mixup training for unsupervised domain adaptation. arXiv preprint arXiv:1905.04215 (2019)

  42. Miyato, T., Maeda, S.I., Koyama, M., Ishii, S.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. PAMI (2018)

    Google ScholarĀ 

  43. Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)

    Google ScholarĀ 

  44. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems 14 (2001)

    Google ScholarĀ 

  45. Nielsen, F.: Hierarchical clustering. In: Introduction to HPC with MPI for Data Science. UTCS, pp. 195ā€“211. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21903-5_8

    ChapterĀ  Google ScholarĀ 

  46. Park, D.S., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. In: Interspeech 2019, pp. 2613ā€“2617 (2019)

    Google ScholarĀ 

  47. Peng, J., et al.: Progressive contrastive learning for self-supervised text-independent speaker verification. In: Proceedings of Odyssey Workshop (2022)

    Google ScholarĀ 

  48. Plappert, M., et al.: Parameter space noise for exploration. arXiv preprint arXiv:1706.01905 (2017)

  49. Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE Workshop (2011)

    Google ScholarĀ 

  50. Ronen, M., Finder, S.E., Freifeld, O.: Deepdpm: deep clustering with an unknown number of clusters. In: Proceedings of IEEE/CVF (2022)

    Google ScholarĀ 

  51. Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of EMNLP-CoNLL, pp. 410ā€“420 (2007)

    Google ScholarĀ 

  52. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53ā€“65 (1987)

    Google ScholarĀ 

  53. Snyder, D., et al.: X-vectors: Robust dnn embeddings for speaker recognition. In: IEEE-CASSP (2018)

    Google ScholarĀ 

  54. Tao, R., Lee, K.A., Das, R.K., HautamƤki, V., Li, H.: Self-supervised speaker recognition with loss-gated learning (2021)

    Google ScholarĀ 

  55. Tao, R., et al.: Self-supervised speaker recognition with loss-gated learning. In: ICASSP. IEEE (2022)

    Google ScholarĀ 

  56. Vasiljevic, I., Chakrabarti, A., Shakhnarovich, G.: Examining the impact of blur on recognition by convolutional networks. arXiv preprint arXiv:1611.05760 (2016)

  57. Wang, C.D., Lai, J.H., Suen, C.Y., Zhu, J.Y.: Multi-exemplar affinity propagation. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2223ā€“2237 (2013)

    Google ScholarĀ 

  58. Wang, C.D., Lai, J.H., Zhu, J.Y.: Conscience online learning: an efficient approach for robust kernel-based clustering. Knowl. Inf. Syst. 31, 79ā€“104 (2012)

    Google ScholarĀ 

  59. Wang, C.D., Lai, J.: Position regularized support vector domain description. Pattern Recogn. 46(3), 875ā€“884 (2013)

    ArticleĀ  Google ScholarĀ 

  60. Wang, C.D., et al.: A conscience on-line learning approach for kernel-based clustering. In: 2010 IEEE International Conference on Data Mining (2010)

    Google ScholarĀ 

  61. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Advances in Neural Information Processing Systems 21 (2008)

    Google ScholarĀ 

  62. Xia, W., et al.: Self-supervised text-independent speaker verification using prototypical momentum contrastive learning. In: ICASSP. IEEE (2021)

    Google ScholarĀ 

  63. Xie, J., et al.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478ā€“487. PMLR (2016)

    Google ScholarĀ 

  64. Xuan, N., et al.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance (2010)

    Google ScholarĀ 

  65. Zhang, T., et al.: BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery (1997)

    Google ScholarĀ 

Download references

Acknowledgments

The authors wish to acknowledge the funding from the Government of Canadaā€™s New Frontiers in Research Fund (NFRF) through grant NFRFR-2021-00338.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abderrahim Fathan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fathan, A., Alam, J. (2023). Self-supervised Speaker Verification Employing Augmentation Mix andĀ Self-augmented Training-Based Clustering. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14339. Springer, Cham. https://doi.org/10.1007/978-3-031-48312-7_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48312-7_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48311-0

  • Online ISBN: 978-3-031-48312-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics