Abstract
The goal of image annotation is to automatically assign a set of textual labels to an image to describe the visual contents thereof. Recently, with the rapid increase in the number of web images, nearest neighbor (NN) based methods have become more attractive and have shown exciting results for image annotation. One of the key challenges of these methods is to define an appropriate similarity measure between images for neighbor selection. Several distance metric learning (DML) algorithms derived from traditional image classification problems have been applied to annotation tasks. However, a fundamental limitation of applying DML to image annotation is that it learns a single global distance metric over the entire image collection and measures the distance between image pairs in the image-level. For multi-label annotation problems, it may be more reasonable to measure similarity of image pairs in the label-level. In this paper, we develop a novel label prediction scheme utilizing multiple label-specific local metrics for label-level similarity measure, and propose two different local metric learning methods in a multi-task learning (MTL) framework. Extensive experimental results on two challenging annotation datasets demonstrate that 1) utilizing multiple local distance metrics to learn label-level distances is superior to using a single global metric in label prediction, and 2) the proposed methods using the MTL framework to learn multiple local metrics simultaneously can model the commonalities of labels, thereby facilitating label prediction results to achieve state-of-the-art annotation performance.
This is a preview of subscription content, access via your institution.







Notes
For more details see: http://www.imageclef.org/2011/Photo.
The F-ex metric is an example-based evaluation, which is the averaged F1 score (\(F1 = 2\frac {Precision * Recall}{Precision + Recall}\)) of all images. Note that higher F-ex score implies better performance
Here the time of neighborhood selection is not included in recording the running time. In our experiment, exhaustive neighborhood selection for all the training and test images takes around 24 and 18 hours respectively
References
Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853
Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272
Binder A, Samek W, Müller KR, Kawanabe M (2013) Enhanced representation and multi-task learning for image annotation. Comp Vision Image Underst 117(5):466–478
Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR ’03, pp 127–134
Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Chang C C, Lin C J (2011) Libsvm: A library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27
Chen X, Mu Y, Yan S, Chua TS (2010) Efficient large-scale image annotation by probabilistic collaborative multi-label propagation. In: Proceedings of the international conference on multimedia, MM ’10, pp 35–44
Chen X, Yuan X, Yan S, Tang J, Rui Y, Chua TS (2011) Towards multi-semantic image annotation with graph regularized exclusive group lasso. In: Proceedings of the 19th ACM international conference on multimedia, MM ’11, pp 263–272
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, CIVR ’09, pp 48:1–48:9
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
Feng S, Manmatha R, Lavrenko V (2004) Multiple bernoulli relevance models for image and video annotation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 1002–1009
Feng Z, Jin R, Jain A (2013) Large-scale image annotation by efficient and robust kernel metric learning. In: IEEE international conference on computer vision (ICCV), pp 3490–3497
Guillaumin M, Mensink T, Verbeek J, Schmid C (2009a) Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: IEEE 12th international conference on computer vision (ICCV), pp 309–316
Guillaumin M, Verbeek J, Schmid C (2009b) Is that you? metric learning approaches for face identification. In: International conference on computer vision, pp 498–505
Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128
Koen EA, van de Sande TG, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596
Li X, Snoek CGM, Worring M (2009) Annotating images by harnessing worldwide user-tagged photos. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3717–3720
Lin Z, Ding G, Hu M (2013) Multi-source image auto-annotation. In: IEEE international conference on image processing (ICIP), pp 2567–2571
Lin Z, Ding G, Hu M (2014) Image auto-annotation via tag-dependent random search over range-constrained visual neighbours. Multimedia Tools Appl:1–26
Liu Y, Jin R (2009) Distance metric learning: a comprehensive survey. Research report, Michigan State University
Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: Proceedings of the 10th European Conference on Computer Vision, ECCV ’08, pp 316–329
Mensink T, Verbeek J, Perronnin F, Csurka G (2012) Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In: European conference on computer vision, ECCV’12, pp 488–501
Nagel K, Nowak S, Kühhirt U, Wolter K (2011) The fraunhofer idmt at imageclef 2011 photo annotation task. In: CLEF (Notebook Papers/Labs/Workshop)
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Parameswaran S, Weinberger K (2010) Large margin multi-task metric learning. In: Advances in Neural Information Processing Systems, pp 1867–1875
Putthividhya D, Attias HT, Nagarajan SS (2010) Topic regression multi-modal latent dirichlet allocation for image annotation. In: International conference on computer vision and pattern recognition, pp 3408–3415
Su Y, Jurie F (2011) Semantic contexts and fisher vectors for the imageclef 2011 photo annotation task. In: CLEF (Notebook Papers/Labs/Workshop)
Ushiku Y, Muraoka H, Inaba S, Fujisawa T, Yasumoto K, Gunji N, Higuchi T, Hara Y, Harada T, Kuniyoshi Y (2012) Isi at imageclef 2012: Scalable system for image annotation. In: CLEF (Online Working Notes/Labs/Workshop)
van de Sande KEA, Snoek CGM (2011) The university of amsterdam’s concept detection system at imageclef 2011. In: CLEF (Notebook Papers/Labs/Workshop)
Verbeek J, Guillaumin M, Mensink T, Schmid C (2010) Image annotation with tagprop on the mirflickr set. In: Proceedings of the international conference on multimedia information retrieval, MIR ’10, pp 537–546
Verma Y, Jawahar C (2012) Image annotation using metric learning in semantic neighbourhoods. Eur Conf Comput Vis 7574:836–849
Verma Y, Jawahar CV (2013) Exploring svm for image annotation in presence of confusing labels. In: BMVC, pp 25.1–25.11
Weinberger K, Saul L (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
Wu L, Hoi SC, Jin R, Zhu J, Yu N (2011) Distance metric learning from uncertain side information for automated photo tagging. ACM Trans Intell Syst Technol 2(2):1–28
Xiang Y, Zhou X, Chua TS, Ngo CW (2009) A revisit of generative model for automatic image annotation using markov random fields. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1153–1160
Xu X, Shimada A, Ri T (2013) Image annotation by learning label-specific distance metrics. In: Internation conference on image analysis and processing, vol 8156, pp 101–110
Yang Y, Ma Z, Hauptmann A, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669
Yuan XT, Liu X, Yan S (2012) Visual classification with multitask joint sparse representation. IEEE Trans Image Process 21(10):4349–4360
Zhang S, Huang J, Huang Y, Yu Y, Li H, Metaxas D (2010) Automatic image annotation using group sparsity. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3312–3319
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Optimization in MTLM-LDML
For the logistic regression model in (7), the bias parameter b m can be considered to be an additional entry in the weight vector w m , as b m = w 0 and d 0(⋅) = −1. Thus, the probability of a similar pair in (7) can be concisely expressed as
Intuitively, to simplify the denotation, the cost function in (9) can be rewritten as
Based on the definitions of \(D_{y_{m}}\) and the sigmoid function \(\sigma (z) = \frac {1}{1 + \exp (-z)}\), we first consider the gradients with respect to weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) of the label-specific metrics as
Hence, the gradient of w m is given by
Then, we consider gradients with respect to weight vector w ∗ of the shared metric as
Note that the gradients of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) in (16) can be calculated separately according to the pairwise constraints of labels \(\{{y_{m}}\}_{m=1}^{M}\), while the gradient of w ∗ in (17) depends on all gradients of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and must be calculated thereafter.
Finally, the iterative update solutions for weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and w ∗ can be formulated as
where η t is the step size of iteration t, and \([f]_{+} = \max (0, f)\) truncates any negative entries in w and sets them to zero for the non-negative constraints of w. Algorithm 1 summarizes the learning process in the proposed MTLM-LDML method.

Appendix B: Optimization in MTLM-LMNN
Similar to the computation of gradients in Appendix A, from the cost function in (12) we can derive the gradients of weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and w ∗ as
Here we make the similar observations that the gradient of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) in (20) can be calculated separately according to the pairwise constraints \(\mathcal {P}_{y_{m}}\) of each label y m , while the gradient of w ∗ in (21) depends on all the gradients of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\).
Since naive computation of the gradients in (20) and (21) would be extremely expensive, we consider only the “active” triplets in \(\mathcal {P}_{y_{m}}\) that change from iteration t to iteration t + 1 based on the following update rules:
where μ t is the step size in the t-th iteration, \(\mathcal {P}_{y_{m}}^{t+1} - \mathcal {P}_{y_{m}}^{t}\) represents the new triplets appearing in \(\mathcal {P}_{y_{m}}^{t+1}\), and \(\mathcal {P}_{y_{m}}^{t} - \mathcal {P}_{y_{m}}^{t+1}\) represents the old triplets that have disappeared in \(\mathcal {P}_{y_{m}}^{t+1}\). For a small step size, the set \(\mathcal {P}_{y_{m}}^{t}\) changes little in each iteration. In this case, computing the gradients in (22) and (23) is much faster.
Finally, we utilize a gradient descent based learning process similar to that depicted in Algorithm 1 to learn weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and w ∗ for the proposed MTLM-LMNN model.
Rights and permissions
About this article
Cite this article
Xu, X., Shimada, A., Nagahara, H. et al. Learning multi-task local metrics for image annotation. Multimed Tools Appl 75, 2203–2231 (2016). https://doi.org/10.1007/s11042-014-2402-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2402-7
Keywords
- Image annotation
- Label prediction
- Metric learning
- Local metric
- Multi-task learning