Learnable PINs: Cross-modal Embeddings for Person Identity

  • Arsha NagraniEmail author
  • Samuel Albanie
  • Andrew Zisserman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11217)


We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice.

We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.


Joint embedding Cross-modal Multi-modal Self-supervised Face recognition Speaker identification Metric learning 



The authors gratefully acknowledge the support of EPSRC CDT AIMS grant EP/L015897/1 and the Programme Grant Seebibyte EP/M013774/1. The authors would also like to thank Judith Albanie for helpful suggestions.

Supplementary material

474201_1_En_5_MOESM1_ESM.pdf (206 kb)
Supplementary material 1 (pdf 205 KB)


  1. 1.
    Albanie, S., Nagrani, A., Vedaldi, A., Zisserman, A.: Emotion recognition in speech using cross-modal transfer in the wild. In: Proceedings of the 2018 ACM on Multimedia Conference. ACM (2018)Google Scholar
  2. 2.
    Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV, pp. 609–617. IEEE (2017)Google Scholar
  3. 3.
    Arandjelović, R., Zisserman, A.: Objects that sound (2017). arXiv preprint: arXiv:1712.06651
  4. 4.
    Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: NIPS, pp. 892–900 (2016)Google Scholar
  5. 5.
    Barnard, K., et al.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)zbMATHGoogle Scholar
  6. 6.
    Bruce, V., Young, A.: Understanding face recognition. Br. J. Psychol. 77(3), 305–327 (1986)CrossRefGoogle Scholar
  7. 7.
    Brunelli, R., Falavigna, D.: Person identification using multiple cues. IEEE Trans. Pattern Anal. Mach. Intell. 17(10), 955–966 (1995)CrossRefGoogle Scholar
  8. 8.
    Budnik, M., Poignant, J., Besacier, L., Quénot, G.: Automatic propagation of manual annotations for multimodal person identification in TV shows. In: 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE (2014)Google Scholar
  9. 9.
    Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: a dataset for recognising faces across pose and age. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition (2018)Google Scholar
  10. 10.
    Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: Proceedings of BMVC (2011)Google Scholar
  11. 11.
    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of CVPR, vol. 1, pp. 539–546. IEEE (2005)Google Scholar
  12. 12.
    Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016, Part II. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). Scholar
  13. 13.
    Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: INTERSPEECH (2018)Google Scholar
  14. 14.
    Cinbis, R.G., Verbeek, J., Schmid, C.: Unsupervised metric learning for face identification in TV video. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1559–1566. IEEE (2011)Google Scholar
  15. 15.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015)Google Scholar
  16. 16.
    Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). Scholar
  17. 17.
    Gordo, A., Larlus, D.: Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  18. 18.
    Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: challenge of recognizing one million celebrities in the real world. Electron. Imaging 2016(11), 1–6 (2016)CrossRefGoogle Scholar
  19. 19.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR, vol. 2, pp. 1735–1742. IEEE (2006)Google Scholar
  20. 20.
    Harwath, D., Torralba, A., Glass, J.: Unsupervised learning of spoken language with visual context. In: Advances in Neural Information Processing Systems, pp. 1858–1866 (2016)Google Scholar
  21. 21.
    Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification (2017). arXiv preprint: arXiv:1703.07737
  22. 22.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). arXiv preprint: arXiv:1502.03167
  23. 23.
    Kemelmacher-Shlizerman, I., Seitz, S.M., Miller, D., Brossard, E.: The megaface benchmark: 1 million faces for recognition at scale. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4873–4882 (2016)Google Scholar
  24. 24.
    Khoury, E., El Shafey, L., McCool, C., Günther, M., Marcel, S.: Bi-modal biometric authentication on mobile phones in challenging conditions. Image Vis. Comput. 32(12), 1147–1160 (2014)CrossRefGoogle Scholar
  25. 25.
    Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 88–95. IEEE (2005)Google Scholar
  26. 26.
    Kim, H., et al.: Deep video portraits. In: SIGGRAPH (2018)Google Scholar
  27. 27.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014). arXiv preprint: arXiv:1411.2539
  28. 28.
    Lampert, C.H., Krömer, O.: Weakly-paired maximum covariance analysis for multimodal dimensionality reduction and transfer learning. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 566–579. Springer, Heidelberg (2010). Scholar
  29. 29.
    Le, N., Odobez, J.M.: Improving speaker turn embedding by cross-modal transfer learning from face embedding (2017). arXiv preprint: arXiv:1707.02749
  30. 30.
    Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, pp. 604–611. ACM (2003)Google Scholar
  31. 31.
    van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  32. 32.
    McLaren, M., Ferrer, L., Castan, D., Lawson, A.: The speakers in the wild (SITW) speaker recognition database. In: Interspeech, pp. 818–822 (2016)Google Scholar
  33. 33.
    Nagrani, A., Albanie, S., Zisserman, A.: Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of CVPR (2018)Google Scholar
  34. 34.
    Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)Google Scholar
  35. 35.
    Nagrani, A., Zisserman, A.: From benedict cumberbatch to sherlock holmes: character identification in TV series without a script. In: Proceedings of BMVC (2017)Google Scholar
  36. 36.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 689–696 (2011)Google Scholar
  37. 37.
    Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). Scholar
  38. 38.
    Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of BMVC (2015)Google Scholar
  39. 39.
    Rothe, R., Timofte, R., Van Gool, L.: Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vis. 126, 144–157 (2018)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of CVPR (2015)Google Scholar
  41. 41.
    Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)Google Scholar
  42. 42.
    Song, H.O., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4004–4012. IEEE (2016)Google Scholar
  43. 43.
    Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 2222–2230 (2012)Google Scholar
  44. 44.
    Sung, K.K.: Learning and example selection for object and pattern detection (1996)Google Scholar
  45. 45.
    Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: closing the gap to human-level performance in face verification. In: Proceedings of CVPR, pp. 1701–1708 (2014)Google Scholar
  46. 46.
    Tapaswi, M., Bäuml, M., Stiefelhagen, R.: “Knock! knock! who is it?” probabilistic person identification in tv-series. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2658–2665. IEEE (2012)Google Scholar
  47. 47.
    Thornhill, R., Møller, A.P.: Developmental stability, disease and medicine. Biol. Rev. 72, 497–548 (1997)CrossRefGoogle Scholar
  48. 48.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)Google Scholar
  49. 49.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos (2015). arXiv preprint: arXiv:1505.00687
  50. 50.
    Wells, T., Baguley, T., Sergeant, M., Dunn, A.: Perceptions of human attractiveness comprising face and voice cues. Arch. Sex. Behav. 42, 805–811 (2013)CrossRefGoogle Scholar
  51. 51.
    Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization (2016). arXiv preprint: arXiv:1611.03530
  52. 52.
    Zhang, C., Koishida, K., Hansen, J.H.: Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 26(9), 1633–1644 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.VGG, Department of Engineering ScienceOxfordUK

Personalised recommendations