International Journal of Computer Vision

, Volume 96, Issue 1, pp 64–82 | Cite as

Face Recognition from Caption-Based Supervision

  • Matthieu Guillaumin
  • Thomas Mensink
  • Jakob Verbeek
  • Cordelia Schmid


In this paper, we present methods for face recognition using a collection of images with captions. We consider two tasks: retrieving all faces of a particular person in a data set, and establishing the correct association between the names in the captions and the faces in the images. This is challenging because of the very large appearance variation in the images, as well as the potential mismatch between images and their captions.

For both tasks, we compare generative and discriminative probabilistic models, as well as methods that maximize subgraph densities in similarity graphs. We extend them by considering different metric learning techniques to obtain appropriate face representations that reduce intra person variability and increase inter person separation. For the retrieval task, we also study the benefit of query expansion.

To evaluate performance, we use a new fully labeled data set of 31147 faces which extends the recent Labeled Faces in the Wild data set. We present extensive experimental results which show that metric learning significantly improves the performance of all approaches on both tasks.


Face recognition Metric learning Weakly supervised learning Face retrieval Constrained clustering 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Anguelov, D., Lee, K. C., Gokturk, S., & Sumengen, B. (2007). Contextual identity recognition in personal photo albums. In CVPR. Google Scholar
  2. Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2005). Learning a Mahanalobis metric from equivalence constraints. Journal of Machine Learning Research, 6, 937–965. zbMATHMathSciNetGoogle Scholar
  3. Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D., & Jordan, M. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135. zbMATHGoogle Scholar
  4. Bekkerman, R., & Jeon, J. (2007). Multi-modal clustering for multimedia collections. In CVPR. Google Scholar
  5. Berg, T., Berg, A., Edwards, J., Maire, M., White, R., Teh, Y., Learned-Miller, E., & Forsyth, D. (2004). Names and faces in the news. In CVPR. Google Scholar
  6. Berg, T., & Forsyth, D. (2006). Animals on the web. In CVPR. Google Scholar
  7. Bertsekas, D. (1976). On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transactions on Automatic Control, 21(2), 174–184. CrossRefzbMATHMathSciNetGoogle Scholar
  8. Bressan, M., Csurka, G., Hoppenot, Y., & Renders, J. (2008). Travel blog assistant system. In Proceedings of the international conference on computer vision theory and applications. Google Scholar
  9. Buckley, C., Salton, G., Allan, J., & Singhal, A. (1995). Automatic query expansion using SMART: TREC 3. In Proceedings of the text retrieval conference (pp. 69–80). Google Scholar
  10. Charikar, M. (2000). Greedy approximation algorithms for finding dense components in a graph. In Proceedings of international workshop approximation algorithms for combinatorial optimization (pp. 139–152). Google Scholar
  11. Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In CVPR. Google Scholar
  12. Chum, O., Philbin, J., Sivic, J., Isard, M., & Zisserman, A. (2007). Total recall: Automatic query expansion with a generative feature model for object retrieval. In ICCV. Google Scholar
  13. Cormen, T., Leiserson, C., Rivest, R., & Stein, C. (2001). Introduction to algorithms (2nd ed.). Cambridge/New York: MIT Press/McGraw-Hill. zbMATHGoogle Scholar
  14. Davis, J., Kulis, B., Jain, P., Sra, S., & Dhillon, I. (2007). Information-theoretic metric learning. In ICML. Google Scholar
  15. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B. Methodological, 39(1), 1–38. zbMATHMathSciNetGoogle Scholar
  16. Deschacht, K., & Moens, M. (2006). Efficient hierarchical entity classification using conditional random fields. In Proceedings of workshop on ontology learning and population. Google Scholar
  17. Everingham, M., Sivic, J., & Zisserman, A. (2006). ‘Hello! My name is… Buffy’—automatic naming of characters in TV video. In BMVC. Google Scholar
  18. Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611. CrossRefGoogle Scholar
  19. Ferencz, A., Learned-Miller, E., & Malik, J. (2008). Learning to locate informative features for visual identification. International Journal of Computer Vision, 77, 3–24. CrossRefGoogle Scholar
  20. Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In ICCV (Vol. 10, pp. 1816–1823). Google Scholar
  21. Georghiades, A., Belhumeur, P., & Kriegman, D. (2005). From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 643–660. Google Scholar
  22. Globerson, A., & Roweis, S. (2006). Metric learning by collapsing classes. In NIPS. Google Scholar
  23. Grangier, D., Monay, F., & Bengio, S. (2006). A discriminative approach for the retrieval of images from text queries. In Proceedings of the European conference on machine learning (pp. 162–173). Google Scholar
  24. Guillaumin, M., Mensink, T., Verbeek, J., & Schmid, C. (2008). Automatic face naming with caption-based supervision. In CVPR. Google Scholar
  25. Guillaumin, M., Mensink, T., Verbeek, J., & Schmid, C. (2009a). Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In ICCV. Google Scholar
  26. Guillaumin, M., Verbeek, J., & Schmid, C. (2009b). Is that you? Metric learning approaches for face identification. In ICCV. Google Scholar
  27. Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multiple instance metric learning from automatically labeled bags of faces. In ECCV. Google Scholar
  28. Holub, A., Moreels, P., & Perona, P. (2008). Unsupervised clustering for Google searches of celebrity images. In IEEE conference on face and gesture recognition. Google Scholar
  29. Huang, G., Jain, V., & Learned-Miller, E. (2007a). Unsupervised joint alignment of complex images. In ICCV. Google Scholar
  30. Huang, G., Ramesh, M., Berg, T., & Learned-Miller, E. (2007b). Labeled faces in the wild: a database for studying face recognition in unconstrained environments (Tech. Rep. 07-49). University of Massachusetts, Amherst. Google Scholar
  31. Jain, V., Ferencz, A., & Learned-Miller, E. (2006). Discriminative training of hyper-feature models for object identification. In BMVC. Google Scholar
  32. Jain, V., Learned-Miller, E., & McCallum, A. (2007). People-LDA: anchoring topics to people using face recognition. In ICCV. Google Scholar
  33. Krishnapuram, B., Carin, L., Figueiredo, M., & Hartemink, A. (2005). Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 957–968. CrossRefGoogle Scholar
  34. Kumar, N., Berg, A., Belhumeur, P., & Nayar, S. (2009). Attribute and simile classifiers for face verification. In ICCV. Google Scholar
  35. Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR. Google Scholar
  36. Lazebnik, S., Schmid, C., & Ponce, J. (2003). Affine-invariant local descriptors and neighborhood statistics for texture recognition. In ICCV (pp. 649–655). Google Scholar
  37. Li, L., Wang, G., & Fei-Fei, L. (2007). OPTIMOL: automatic object picture collection via incremental model learning. In CVPR. Google Scholar
  38. Marcel, S., Abbet, P., & Guillemot, M. (2007). Google portrait (Tech. Rep. IDIAP-COM-07-07). IDIAP. Google Scholar
  39. Mensink, T., & Verbeek, J. (2008). Improving people search using query expansions: how friends help to find people. In ECCV. Google Scholar
  40. Naaman, M., Yeh, R. B., Garcia-Molina, H., & Paepcke, A. (2005). Leveraging context to resolve identity in photo albums. In Proceedings of the joint conference on digital libraries. Google Scholar
  41. Neal, R., & Hinton, G. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. Jordan (Ed.), Learning in graphical models (pp. 355–368). Dordrecht: Kluwer Academic. Google Scholar
  42. Nowak, E., & Jurie, F. (2007). Learning visual similarity measures for comparing never seen objects. In CVPR. Google Scholar
  43. Ozkan, D., & Duygulu, P. (2006). A graph based approach for naming faces in news photos. In CVPR (pp. 1477–1482). Google Scholar
  44. Ozkan, D., & Duygulu, P. (2009). Interesting faces: a graph-based approach for finding people in news. Pattern Recognition. Google Scholar
  45. Pham, P., Moens, M., & Tuytelaars, T. (2008). Linking names and faces: seeing the problem in different ways. In Proceedings of ECCV workshop on faces in real-life images. Google Scholar
  46. Pinto, N., DiCarlo, J., & Cox, D. (2009). How far can you get with a modern face recognition test set using only simple features? In CVPR. Google Scholar
  47. Ramanan, D., & Baker, S. (2009). Local distance functions: a taxonomy, new algorithms, and an evaluation. In ICCV. Google Scholar
  48. Satoh, S., Nakamura, Y., & Kanade, T. (1999). Name-it: Naming and detecting faces in news videos. IEEE Transactions on Multimedia, 6(1), 22–35. CrossRefGoogle Scholar
  49. Sivic, J., Everingham, M., & Zisserman, A. (2009). “Who are you?”: learning person specific classifiers from video. In CVPR. Google Scholar
  50. Srihari, R. (1991). PICTION: A system that uses captions to label human faces in newspaper photographs. In A. Press (Ed.), Proceedings of the AAAI-91 (pp. 80–85). Google Scholar
  51. Stone, Z., Zickler, T., & Darrell, T. (2008). Autotagging Facebook: social network context improves photo annotation. In CVPR workshops. Google Scholar
  52. Taigman, Y., Wolf, L., & Hassner, T. (2009). Multiple one-shots for utilizing class label information. In The British machine vision conference (BMVC). URL Google Scholar
  53. Tian, Y., Liu, W., Xiao, R., Wen, F., & Tang, X. (2007). A face annotation framework with partial clustering and interactive labeling. In CVPR. Google Scholar
  54. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. CrossRefGoogle Scholar
  55. Verbeek, J., & Triggs, B. (2007). Region classification with Markov field aspect models. In CVPR. Google Scholar
  56. Viola, P., & Jones, M. (2004). Robust real-time object detection. International Journal of Computer Vision, 57(2), 137–154. CrossRefGoogle Scholar
  57. Wagstaff, K., & Rogers, S. (2001). Constrained k-means clustering with background knowledge. In ICML (pp. 577–584). Google Scholar
  58. Weinberger, K., Blitzer, J., & Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. In NIPS. Google Scholar
  59. Wolf, L., Hassner, T., & Taigman, Y. (2008). Descriptor based methods in the wild. In Workshop on faces real-life images at ECCV. Google Scholar
  60. Xing, E., Ng, A., Jordan, M., & Russell, S. (2004). Distance metric learning, with application to clustering with side-information. In NIPS. Google Scholar
  61. Zhang, L., Hu, Y., Li, M., Ma, W., & Zhang, H. (2004). Efficient propagation for face annotation in family albums. In Proceedings of the 12th annual ACM international conference on multimedia (pp. 716–723). CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Matthieu Guillaumin
    • 1
  • Thomas Mensink
    • 1
  • Jakob Verbeek
    • 1
  • Cordelia Schmid
    • 1
  1. 1.INRIA Rhône-AlpesMontbonnotFrance

Personalised recommendations