Abstract
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.
Article PDF
Similar content being viewed by others
References
Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6, 1817–1953.
Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Cortes, C., & Mohri, M. (2009). Polynomial semantic indexing. In Advances in neural information processing systems (NIPS 2009).
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551–585.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: a large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition.
Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL Visual Object Classes Challenge 2007 (VOC2007).
Fellbaum, C. (Ed.) (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
Fergus, R., Weiss, Y., & Torralba, A. (2009). Semi-supervised learning in gigantic image collections. In Advances in neural information processing systems, 2009.
Grangier, D., & Bengio, S. (2008). A discriminative kernel-based model to rank images from text queries. Transactions on Pattern Analysis and Machine Intelligence, 30(8), 1371–1384.
Grauman, K., & Darrell, T. (2007). The pyramid match kernel: Efficient learning with sets of features. Journal of Machine Learning Research, 8(725–760), 7–8.
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset (Technical Report 7694). California Institute of Technology.
Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C., Lear, I., & Kuntzmann, L. (2009). Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In ICCV.
Loeff, N., Farhadi, A., Endres, I., & Forsyth, D. (2009). Unlabeled data improves word prediction. In ICCV’09.
Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In European conference on computer vision (ECCV).
Monay, F., & Gatica-Perez, D. (2004). PLSA-based image auto-annotation: constraining the latent space. In Proceedings of the 12th annual ACM international conference on multimedia (pp. 348–351). New York: ACM.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1), 267–288.
Torralba, A., Fergus, R., & Freeman, W. T. (2008a). 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.
Torralba, A. B., Fergus, R., & Weiss, Y. (2008b). Small codes and large image databases for recognition. In CVPR. Los Alamitos: IEEE Comput. Soc.
Usunier, N., Buffoni, D., & Gallinari, P. (2009). Ranking with ordered weighted pairwise classification. In L. Bottou, & M. Littman (Eds.), Proceedings of the 26th international conference on machine learning, Montreal, Omnipress, June 2009 (pp. 1057–1064).
Wang, J., Li, J., & Wiederholdy, G. (2000). SIMPLIcity: Semantics-sensitive integrated matching for picture libraries. Advances in Visual Information Systems (pp. 171–193).
Wolpert, D. (1992). Stacked generalization. Neural Networks, 5, 241–259.
Xia, F., Liu, T. Y., Wang, J., Zhang, W., & Li, H. (2008). Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on machine learning.
Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval (pp. 271–278).
Zhou, Z., Zhan, D., & Yang, Q. (1999/2007). Semi-supervised learning with very few labeled training examples. In Proceedings of the national conference on artificial intelligence (Vol. 22, p. 675). Menlo Park/Cambridge: AAAI Press/MIT Press.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: José L. Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag.
Rights and permissions
About this article
Cite this article
Weston, J., Bengio, S. & Usunier, N. Large scale image annotation: learning to rank with joint word-image embeddings. Mach Learn 81, 21–35 (2010). https://doi.org/10.1007/s10994-010-5198-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5198-3