Machine Learning

, Volume 81, Issue 1, pp 21–35 | Cite as

Large scale image annotation: learning to rank with joint word-image embeddings

  • Jason Weston
  • Samy Bengio
  • Nicolas Usunier


Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.


Large scale Image annotation Learning to rank Embedding 


  1. Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6, 1817–1953. MathSciNetGoogle Scholar
  2. Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Cortes, C., & Mohri, M. (2009). Polynomial semantic indexing. In Advances in neural information processing systems (NIPS 2009). Google Scholar
  3. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551–585. MathSciNetGoogle Scholar
  4. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: a large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition. Google Scholar
  5. Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL Visual Object Classes Challenge 2007 (VOC2007). Google Scholar
  6. Fellbaum, C. (Ed.) (1998). WordNet: An electronic lexical database. Cambridge: MIT Press. zbMATHGoogle Scholar
  7. Fergus, R., Weiss, Y., & Torralba, A. (2009). Semi-supervised learning in gigantic image collections. In Advances in neural information processing systems, 2009. Google Scholar
  8. Grangier, D., & Bengio, S. (2008). A discriminative kernel-based model to rank images from text queries. Transactions on Pattern Analysis and Machine Intelligence, 30(8), 1371–1384. CrossRefGoogle Scholar
  9. Grauman, K., & Darrell, T. (2007). The pyramid match kernel: Efficient learning with sets of features. Journal of Machine Learning Research, 8(725–760), 7–8. Google Scholar
  10. Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset (Technical Report 7694). California Institute of Technology. Google Scholar
  11. Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C., Lear, I., & Kuntzmann, L. (2009). Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In ICCV. Google Scholar
  12. Loeff, N., Farhadi, A., Endres, I., & Forsyth, D. (2009). Unlabeled data improves word prediction. In ICCV’09. Google Scholar
  13. Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In European conference on computer vision (ECCV). Google Scholar
  14. Monay, F., & Gatica-Perez, D. (2004). PLSA-based image auto-annotation: constraining the latent space. In Proceedings of the 12th annual ACM international conference on multimedia (pp. 348–351). New York: ACM. CrossRefGoogle Scholar
  15. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407. zbMATHCrossRefMathSciNetGoogle Scholar
  16. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1), 267–288. zbMATHMathSciNetGoogle Scholar
  17. Torralba, A., Fergus, R., & Freeman, W. T. (2008a). 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970. CrossRefGoogle Scholar
  18. Torralba, A. B., Fergus, R., & Weiss, Y. (2008b). Small codes and large image databases for recognition. In CVPR. Los Alamitos: IEEE Comput. Soc. Google Scholar
  19. Usunier, N., Buffoni, D., & Gallinari, P. (2009). Ranking with ordered weighted pairwise classification. In L. Bottou, & M. Littman (Eds.), Proceedings of the 26th international conference on machine learning, Montreal, Omnipress, June 2009 (pp. 1057–1064). Google Scholar
  20. Wang, J., Li, J., & Wiederholdy, G. (2000). SIMPLIcity: Semantics-sensitive integrated matching for picture libraries. Advances in Visual Information Systems (pp. 171–193). Google Scholar
  21. Wolpert, D. (1992). Stacked generalization. Neural Networks, 5, 241–259. CrossRefGoogle Scholar
  22. Xia, F., Liu, T. Y., Wang, J., Zhang, W., & Li, H. (2008). Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on machine learning. Google Scholar
  23. Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval (pp. 271–278). Google Scholar
  24. Zhou, Z., Zhan, D., & Yang, Q. (1999/2007). Semi-supervised learning with very few labeled training examples. In Proceedings of the national conference on artificial intelligence (Vol. 22, p. 675). Menlo Park/Cambridge: AAAI Press/MIT Press. Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.GoogleNew YorkUSA
  2. 2.GoogleMountain ViewUSA
  3. 3.Université Paris 6, LIP6ParisFrance

Personalised recommendations