Improving speech embedding using crossmodal transfer learning with audio-visual data

  • Nam LeEmail author
  • Jean-Marc Odobez


Learning a discriminative voice embedding allows speaker turns to be compared directly and efficiently, which is crucial for tasks such as diarization and verification. This paper investigates several transfer learning approaches to improve a voice embedding using knowledge transferred from a face representation. The main idea of our crossmodal approaches is to constrain the target voice embedding space to share latent attributes with the source face embedding space. The shared latent attributes can be formalized as geometric properties or distribution characterics between these embedding spaces. We propose four transfer learning approaches belonging to two categories: the first category relies on the structure of the source face embedding space to regularize at different granularities the speaker turn embedding space. The second category -a domain adaptation approach- improves the embedding space of speaker turns by applying a maximum mean discrepancy loss to minimize the disparity between the distributions of the embedded features. Experiments are conducted on TV news datasets, REPERE and ETAPE, to demonstrate our methods. Quantitative results in verification and clustering tasks show promising improvement, especially in cases where speaker turns are short or the training data size is limited. The analysis also gives insights the embedding spaces and shows their potential applications.


Speaker diariazation Multimodal identification Metric learning Transfer learning Deep learning 



This work was supported by the EU projects EUMSSI - Event Understanding through Multimodal Social Stream Interpretation (FP7-611057) and MuMMER - MultiModal Mall Entertainment Robot (H2020-688147).


  1. 1.
    Baktashmotlagh M, Harandi M, Salzmann M (2016) Distribution-matching embedding for visual domain adaptation. J Mach Learn Res 17(1):3760–3789MathSciNetzbMATHGoogle Scholar
  2. 2.
    Barras C, Zhu X, Meignier S, Gauvain J-L (2006) Multistage speaker diarization of broadcast news. IEEE Trans Audio Speech Lang Process 14(5):1505–1512CrossRefGoogle Scholar
  3. 3.
    Bendris M, Favre B, Charlet D, Damnati G, Auguste R (2014) Multiple-view constrained clustering for unsupervised face identification in TV-broadcast. In: ICASSP), pp 494–498. IEEEGoogle Scholar
  4. 4.
    Bhattarai B, Sharma G, Jurie F (2016) CP-mtML coupled projection multi-task metric learning for large scale face retrieval. In: CVPR. IEEEGoogle Scholar
  5. 5.
    Bost X, Linares G (2014) Constrained speaker diarization of TV series based on visual patterns. In: 2014 IEEE spoken language technology workshop (SLT). IEEEGoogle Scholar
  6. 6.
    Bredin H (2017) Tristounet: triplet loss for speaker turn embedding. In: ICASSP, New Orleans, USA. IEEEGoogle Scholar
  7. 7.
    Bredin H, Gelly G (2016) Improving speaker diarization of TV series using talking-face detection and clustering. In: ACM Multimedia, pp 157–161. ACMGoogle Scholar
  8. 8.
    Chen S, Gopalakrishnan PS (1998) Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings DARPA broadcast news transcription and understanding workshopGoogle Scholar
  9. 9.
    Clément P, Bazillon T, Fredouille C (2011) Speaker diarization of heterogeneous web video files A preliminary study. In: ICASSP. IEEEGoogle Scholar
  10. 10.
    Dai D, Kroeger T, Timofte R, Van Gool L (2015) Metric imitation by manifold transfer for efficient vision applications. In: CVPR. IEEEGoogle Scholar
  11. 11.
    Dai D, Van Gool L (2016) Unsupervised high-level feature learning by ensemble projection for semi-supervised image classification and image clustering. arXiv:1602.00955
  12. 12.
    Dey S, Madikeri S, Motlicek P (2018) End-to-end text-dependent speaker verification using novel distance measures. In: Proceedings Interspeech 2018, pp 3598–3602Google Scholar
  13. 13.
    Dubout C, Fleuret F (2013) Deformable part models with individual part scaling. In: BMVCGoogle Scholar
  14. 14.
    Ganin Y, Lempitsky V (2015) Unsupervised domain adaptation by backpropagation. In: ICMLGoogle Scholar
  15. 15.
    Gay P, Khoury E, Meignier S, Odobez J-M, Deleglise P (2014) A conditional random field approach for audio-visual people diarization. In: ICASSP. IEEEGoogle Scholar
  16. 16.
    Giraudel A, Carré M, Mapelli V, Kahn J, Galibert O, Quintard L (2012) The REPERE corpus: a multimodal corpus for person recognition. In: LRECGoogle Scholar
  17. 17.
    Gravier G, Adda G, Paulson N, Carré M, Giraudel A, Galibert O (2012) The etape corpus for the evaluation of speech-based tv content processing in the french language. In: LRECGoogle Scholar
  18. 18.
    Gretton A, Borgwardt KM, Rasch M, Schölkopf B, Smola AJ (2007) A kernel method for the two-sample-problem. NIPSGoogle Scholar
  19. 19.
    Guillaumin M, Verbeek J, Schmid C (2009) Is that you? metric learning approaches for face identification. In: ICCV. IEEEGoogle Scholar
  20. 20.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR. IEEEGoogle Scholar
  21. 21.
    Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv:1503.02531
  22. 22.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  23. 23.
    Hu D, Lu X, Li X (2016) Multimodal learning via exploring deep semantic similarity. In: ACM MultimediaGoogle Scholar
  24. 24.
    Hu Y, Ren JS, Dai J, Yuan C, Xu L, Wang W (2015) Deep multimodal speaker naming. In: ACM MultimediaGoogle Scholar
  25. 25.
    Jousse V, Petit-Renaud S, Meignier S, Esteve Y, Jacquin C (2009) Automatic named identification of speakers using diarization and {ASR} systems. In: ICASSPGoogle Scholar
  26. 26.
    Le N, Heili A, Wu D, Odobez J-M (2016) Temporally subsampled detection for accurate and efficient face tracking and diarization. In: ICPR. IEEEGoogle Scholar
  27. 27.
    Le N, Odobez J-M (2017) A domain adaptation approach to improve speaker turn embedding using face representation. In: Proceedings of the 19th ACM international conference on multimodal interaction, pp 411–415. ACMGoogle Scholar
  28. 28.
    Le N, Odobez J-M (2017) Improving speaker turn embedding by crossmodal transfer learning from face embedding. In: Proceedings of the IEEE international conference on computer vision workshops, pp 428–437Google Scholar
  29. 29.
    Le N, Odobez J-M (2018) Robust and discriminative speaker embedding via intra-class distance variance regularization. In: Proceedings Interspeech 2018, pp 2257–2261Google Scholar
  30. 30.
    Leng L, Zhang J, Chen G, Khan MK, Alghathbar K (2011) Two-directional two-dimensional random projection and its variations for face and palmprint recognition. In: International conference on computational science and its applications, pp 458–470. SpringerGoogle Scholar
  31. 31.
    Leng L, Zhang J, Xu J, Khan MK, Alghathbar K (2010) Dynamic weighted discrimination power analysis in dct domain for face and palmprint recognition. In: 2010 international conference on information and communication technology convergence (ICTC), pp 467–471. IEEEGoogle Scholar
  32. 32.
    Li A, Shan S, Chen X, Gao W (2011) Cross-pose face recognition based on partial least squares. Pattern Recogn Lett 32(15):1948–1955CrossRefGoogle Scholar
  33. 33.
    Liong VE, Lu J, Tan Y-P, Zhou J (2017) Deep coupled metric learning for cross-modal matching. IEEE Trans Multimed 19(6):1234–1244CrossRefGoogle Scholar
  34. 34.
    Long M, Cheng W, Jin X, Wang J, Shen D (2010) Transfer learning via cluster correspondence inference. In: ICDM. IEEEGoogle Scholar
  35. 35.
    Ma C, Nguyen P, Mahajan M (2007) Finding speaker identities with a conditional maximum entropy model. In: ICASSPGoogle Scholar
  36. 36.
    Nagrani A, Albanie S, Zisserman A (2018) Seeing voices and hearing faces: cross-modal biometric matching. In: IEEE conference on computer vision and pattern recognitionGoogle Scholar
  37. 37.
    Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: a large-scale speaker identification dataset. In: INTERSPEECHGoogle Scholar
  38. 38.
    Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In: BMVCGoogle Scholar
  39. 39.
    Poignant J, Besacier L, Quénot G (2015) Unsupervised speaker identification in {TV} broadcast based on written names. IEEE/ACM Trans Audio Speech Lang Process 23(1):57–68Google Scholar
  40. 40.
    Ren JS, Hu Y, Tai Y-W, Wang C, Xu L, Sun W, Yan Q (2016) Look, listen and learn - a multimodal LSTM for speaker identification. In: AAAIGoogle Scholar
  41. 41.
    Roy A, Marcel S (2010) Introducing crossmodal biometrics: person identification from distinct audio & visual streams. In: BTAS. IEEEGoogle Scholar
  42. 42.
    Rozantsev A, Salzmann M, Fua P (2016) Beyond sharing weights for deep domain adaptation. arXiv:1603.06432
  43. 43.
    Sargent G, de Fonseca GB, Freire IL, Sicre R, Do Patrocínio Z Jr, Guimarães S, Gravier G (2016) Puc minas and irisa at multimodal person discovery. In: Mediaeval workshopGoogle Scholar
  44. 44.
    Sarkar AK, Matrouf D, Bousquet P-M, Bonastre J-F (2012) Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In: InterspeechGoogle Scholar
  45. 45.
    Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: CVPRGoogle Scholar
  46. 46.
    Snyder D, Garcia-Romero D, Povey D, Khudanpur S (2017) Deep neural network embeddings for text-independent speaker verification. In: INTERSPEECHGoogle Scholar
  47. 47.
    Steinwart I (2001) On the influence of the kernel on the consistency of support vector machines. JMLR Google Scholar
  48. 48.
    Tapaswi M, Parkhi OM, Rahtu E, Sommerlade E, Stiefelhagen R, Zisserman A (2014) Total cluster: a person agnostic clustering method for broadcast videos. In: Indian conference on computer vision graphics and image processing. ACMGoogle Scholar
  49. 49.
    Tieleman T, Hinton G (2012) Lecture 6.5-RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4 (2):26–31Google Scholar
  50. 50.
    Yi D, Lei Z, Liao S, Li SZ (2014) Learning face representation from scratch. arXiv:1411.7923
  51. 51.
    Zheng Y, Pal DK, Savvides M (2018) Ring loss: convex feature normalization for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5089–5097Google Scholar
  52. 52.
    Zhuang F, Luo P, Xiong H, He Q, Xiong Y, Shi Z (2011) Exploiting associations between word clusters and document classes for cross-domain text categorization. Statistical Analysis and Data Mining: The ASA Data Science Journal 4(1):100–114MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Idiap Research InstituteMartignySwitzerland
  2. 2.École Polytechnique Fédéral de LausanneLausanneSwitzerland

Personalised recommendations