Skip to main content

Learning multi-task local metrics for image annotation


The goal of image annotation is to automatically assign a set of textual labels to an image to describe the visual contents thereof. Recently, with the rapid increase in the number of web images, nearest neighbor (NN) based methods have become more attractive and have shown exciting results for image annotation. One of the key challenges of these methods is to define an appropriate similarity measure between images for neighbor selection. Several distance metric learning (DML) algorithms derived from traditional image classification problems have been applied to annotation tasks. However, a fundamental limitation of applying DML to image annotation is that it learns a single global distance metric over the entire image collection and measures the distance between image pairs in the image-level. For multi-label annotation problems, it may be more reasonable to measure similarity of image pairs in the label-level. In this paper, we develop a novel label prediction scheme utilizing multiple label-specific local metrics for label-level similarity measure, and propose two different local metric learning methods in a multi-task learning (MTL) framework. Extensive experimental results on two challenging annotation datasets demonstrate that 1) utilizing multiple local distance metrics to learn label-level distances is superior to using a single global metric in label prediction, and 2) the proposed methods using the MTL framework to learn multiple local metrics simultaneously can model the commonalities of labels, thereby facilitating label prediction results to achieve state-of-the-art annotation performance.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. For more details see:

  2. The F-ex metric is an example-based evaluation, which is the averaged F1 score (\(F1 = 2\frac {Precision * Recall}{Precision + Recall}\)) of all images. Note that higher F-ex score implies better performance

  3. Here the time of neighborhood selection is not included in recording the running time. In our experiment, exhaustive neighborhood selection for all the training and test images takes around 24 and 18 hours respectively


  1. Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853

    MathSciNet  MATH  Google Scholar 

  2. Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272

    Article  Google Scholar 

  3. Binder A, Samek W, Müller KR, Kawanabe M (2013) Enhanced representation and multi-task learning for image annotation. Comp Vision Image Underst 117(5):466–478

    Article  Google Scholar 

  4. Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR ’03, pp 127–134

  5. Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410

    Article  Google Scholar 

  6. Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75

    Article  MathSciNet  Google Scholar 

  7. Chang C C, Lin C J (2011) Libsvm: A library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27

    Article  Google Scholar 

  8. Chen X, Mu Y, Yan S, Chua TS (2010) Efficient large-scale image annotation by probabilistic collaborative multi-label propagation. In: Proceedings of the international conference on multimedia, MM ’10, pp 35–44

  9. Chen X, Yuan X, Yan S, Tang J, Rui Y, Chua TS (2011) Towards multi-semantic image annotation with graph regularized exclusive group lasso. In: Proceedings of the 19th ACM international conference on multimedia, MM ’11, pp 263–272

  10. Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, CIVR ’09, pp 48:1–48:9

  11. Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637

    MathSciNet  MATH  Google Scholar 

  12. Feng S, Manmatha R, Lavrenko V (2004) Multiple bernoulli relevance models for image and video annotation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 1002–1009

  13. Feng Z, Jin R, Jain A (2013) Large-scale image annotation by efficient and robust kernel metric learning. In: IEEE international conference on computer vision (ICCV), pp 3490–3497

  14. Guillaumin M, Mensink T, Verbeek J, Schmid C (2009a) Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: IEEE 12th international conference on computer vision (ICCV), pp 309–316

  15. Guillaumin M, Verbeek J, Schmid C (2009b) Is that you? metric learning approaches for face identification. In: International conference on computer vision, pp 498–505

  16. Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128

    Article  Google Scholar 

  17. Koen EA, van de Sande TG, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596

    Article  Google Scholar 

  18. Li X, Snoek CGM, Worring M (2009) Annotating images by harnessing worldwide user-tagged photos. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3717–3720

  19. Lin Z, Ding G, Hu M (2013) Multi-source image auto-annotation. In: IEEE international conference on image processing (ICIP), pp 2567–2571

  20. Lin Z, Ding G, Hu M (2014) Image auto-annotation via tag-dependent random search over range-constrained visual neighbours. Multimedia Tools Appl:1–26

  21. Liu Y, Jin R (2009) Distance metric learning: a comprehensive survey. Research report, Michigan State University

  22. Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: Proceedings of the 10th European Conference on Computer Vision, ECCV ’08, pp 316–329

  23. Mensink T, Verbeek J, Perronnin F, Csurka G (2012) Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In: European conference on computer vision, ECCV’12, pp 488–501

  24. Nagel K, Nowak S, Kühhirt U, Wolter K (2011) The fraunhofer idmt at imageclef 2011 photo annotation task. In: CLEF (Notebook Papers/Labs/Workshop)

  25. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175

    Article  MATH  Google Scholar 

  26. Parameswaran S, Weinberger K (2010) Large margin multi-task metric learning. In: Advances in Neural Information Processing Systems, pp 1867–1875

  27. Putthividhya D, Attias HT, Nagarajan SS (2010) Topic regression multi-modal latent dirichlet allocation for image annotation. In: International conference on computer vision and pattern recognition, pp 3408–3415

  28. Su Y, Jurie F (2011) Semantic contexts and fisher vectors for the imageclef 2011 photo annotation task. In: CLEF (Notebook Papers/Labs/Workshop)

  29. Ushiku Y, Muraoka H, Inaba S, Fujisawa T, Yasumoto K, Gunji N, Higuchi T, Hara Y, Harada T, Kuniyoshi Y (2012) Isi at imageclef 2012: Scalable system for image annotation. In: CLEF (Online Working Notes/Labs/Workshop)

  30. van de Sande KEA, Snoek CGM (2011) The university of amsterdam’s concept detection system at imageclef 2011. In: CLEF (Notebook Papers/Labs/Workshop)

  31. Verbeek J, Guillaumin M, Mensink T, Schmid C (2010) Image annotation with tagprop on the mirflickr set. In: Proceedings of the international conference on multimedia information retrieval, MIR ’10, pp 537–546

  32. Verma Y, Jawahar C (2012) Image annotation using metric learning in semantic neighbourhoods. Eur Conf Comput Vis 7574:836–849

    Google Scholar 

  33. Verma Y, Jawahar CV (2013) Exploring svm for image annotation in presence of confusing labels. In: BMVC, pp 25.1–25.11

  34. Weinberger K, Saul L (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244

    MATH  Google Scholar 

  35. Wu L, Hoi SC, Jin R, Zhu J, Yu N (2011) Distance metric learning from uncertain side information for automated photo tagging. ACM Trans Intell Syst Technol 2(2):1–28

    Article  Google Scholar 

  36. Xiang Y, Zhou X, Chua TS, Ngo CW (2009) A revisit of generative model for automatic image annotation using markov random fields. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1153–1160

  37. Xu X, Shimada A, Ri T (2013) Image annotation by learning label-specific distance metrics. In: Internation conference on image analysis and processing, vol 8156, pp 101–110

  38. Yang Y, Ma Z, Hauptmann A, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669

    Article  Google Scholar 

  39. Yuan XT, Liu X, Yan S (2012) Visual classification with multitask joint sparse representation. IEEE Trans Image Process 21(10):4349–4360

    Article  MathSciNet  Google Scholar 

  40. Zhang S, Huang J, Huang Y, Yu Y, Li H, Metaxas D (2010) Automatic image annotation using group sparsity. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3312–3319

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Xing Xu.


Appendix A: Optimization in MTLM-LDML

For the logistic regression model in (7), the bias parameter b m can be considered to be an additional entry in the weight vector w m , as b m = w 0 and d 0(⋅) = −1. Thus, the probability of a similar pair in (7) can be concisely expressed as

$$ p_{qp}(s_{qp} = 1|y_{m}; D_{y_{m}}(I_{q}, I_{p}), b) = \sigma(-D_{y_{m}}(I_{q}, I_{p})). $$

Intuitively, to simplify the denotation, the cost function in (9) can be rewritten as

$$ E(\boldsymbol{w}) = {\gamma_{\ast} \| \boldsymbol{w}_{\ast} \|_{2}^{2}} + \sum\limits_{m=1}^{M}\left[\gamma_{m}{\| \boldsymbol{w}_{m} \|_{2}^{2}} + {\underset{(I_{q}, I_{p}) \in \mathcal{P}_{y_{m}}}\sum}{\log(1 + \exp(s_{qp} D_{y_{m}}(I_{q}, I_{p})))} \right]. $$

Based on the definitions of \(D_{y_{m}}\) and the sigmoid function \(\sigma (z) = \frac {1}{1 + \exp (-z)}\), we first consider the gradients with respect to weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) of the label-specific metrics as

$$ E(\boldsymbol{w}_{m}) = \gamma_{m}{\| \boldsymbol{w}_{m} \|_{2}^{2}} + {\underset{(I_{q}, I_{p}) \in \mathcal{P}_{y_{m}}}\sum}{\log(1 + \exp(s_{qp} D_{y_{m}}(I_{q}, I_{p})))}. $$

Hence, the gradient of w m is given by

$$ \boldsymbol{G}(\boldsymbol{w}_{m}) = \frac{\partial{E(\boldsymbol{w})}}{\partial{\boldsymbol{w}_{m}}} = 2\gamma_{m} \boldsymbol{w}_{m} + {\underset{(I_{q}, I_{p}) \in \mathcal{P}_{y_{m}}}\sum}{s_{qp} \sigma(s_{qp}D_{y_{m}}(I_{q}, I_{p}))\boldsymbol{d}(I_{q}, I_{p})}. $$

Then, we consider gradients with respect to weight vector w of the shared metric as

$$ \boldsymbol{G}(\boldsymbol{w}_{\ast}) = \frac{\partial{E(\boldsymbol{w})}}{\partial{\boldsymbol{w}_{\ast}}} = 2\gamma_{\ast} \boldsymbol{w}_{\ast} + \sum\limits_{m=1}^{M} ~~~~~~{\underset{(I_{q}, I_{p}) \in \mathcal{P}_{y_{m}}}\sum}{s_{qp} \sigma(s_{qp}D_{y_{m}}(I_{q}, I_{p}))\boldsymbol{d}(I_{q}, I_{p})}. $$

Note that the gradients of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) in (16) can be calculated separately according to the pairwise constraints of labels \(\{{y_{m}}\}_{m=1}^{M}\), while the gradient of w in (17) depends on all gradients of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and must be calculated thereafter.

Finally, the iterative update solutions for weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and w can be formulated as

$$ \boldsymbol{w}_{m}^{t+1} = [\boldsymbol{w}_{m}^{t} - \eta_{t} \boldsymbol{G}(\boldsymbol{w}_{m}^{t}) ]_{+}, $$
$$ \boldsymbol{w}_{\ast}^{t+1} = [\boldsymbol{w}_{\ast}^{t} - \eta_{t} \boldsymbol{G}(\boldsymbol{w}_{\ast}^{t}) ]_{+}, $$

where η t is the step size of iteration t, and \([f]_{+} = \max (0, f)\) truncates any negative entries in w and sets them to zero for the non-negative constraints of w. Algorithm 1 summarizes the learning process in the proposed MTLM-LDML method.

figure e

Appendix B: Optimization in MTLM-LMNN

Similar to the computation of gradients in Appendix A, from the cost function in (12) we can derive the gradients of weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and w as

$$ \boldsymbol{G}(\boldsymbol{w}_{m}) = 2\gamma_{m} \boldsymbol{w}_{m} + {\underset{(I_{q}, I_{n}) \in \mathcal{P}_{y_{m}}^{+}}\sum}{\boldsymbol{d}(I_{q}, I_{p})} + \mu {\underset{(I_{q}, I_{n}) \in \mathcal{P}_{y_{m}}^{-}}\sum}\left(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n}) \right), $$
$$ \boldsymbol{G}(\boldsymbol{w}_{\ast}) = 2\gamma_{\ast} \boldsymbol{w}_{\ast} + \sum\limits_{m=1}^{M} \left[~~~~~ {\underset{(I_{q}, I_{n}) \in \mathcal{P}_{y_{m}}^{+}}\sum}{\boldsymbol{d}(I_{q}, I_{p})} + \mu {\underset{(I_{q}, I_{n}) \in \mathcal{P}_{y_{m}}^{-}}\sum}\left(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n}) \right) \right]. $$

Here we make the similar observations that the gradient of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) in (20) can be calculated separately according to the pairwise constraints \(\mathcal {P}_{y_{m}}\) of each label y m , while the gradient of w in (21) depends on all the gradients of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\).

Since naive computation of the gradients in (20) and (21) would be extremely expensive, we consider only the “active” triplets in \(\mathcal {P}_{y_{m}}\) that change from iteration t to iteration t + 1 based on the following update rules:

$$\begin{array}{@{}rcl@{}} \boldsymbol{G}(\boldsymbol{w}_{m}^{t+1})&=&\boldsymbol{G}(\boldsymbol{w}_{m}^{t}) - \eta_{t} {\underset{(I_{q}, I_{p}, I_{n}) \in \mathcal{P}_{y_{m}}^{t} - P_{y_{m}}^{t+1}}\sum}(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n})) + \eta_{t}\\&&{\kern4pc} {\underset{(I_{q}, I_{p}, I_{n}) \in \mathcal{P}_{y_{m}}^{t+1} - P_{y_{m}}^{t}}\sum}(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n})), \end{array} $$
$$\begin{array}{@{}rcl@{}} \boldsymbol{G}(\boldsymbol{w}_{\ast}^{t+1}) = \boldsymbol{G}(\boldsymbol{w}_{\ast}^{t}) - \sum\limits_{m=1}^{M} \left[ \eta_{t} {\underset{(I_{q}, I_{p}, I_{n}) \in \mathcal{P}_{y_{m}}^{t} - \mathcal{P}_{y_{m}}^{t+1}}\sum}(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n})) + \eta_{t}\right.\\ \left.{\underset{(I_{q}, I_{p}, I_{n}) \in \mathcal{P}_{y_{m}}^{t+1} - \mathcal{P}_{y_{m}}^{t}}\sum}(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n})) \right], \end{array} $$

where μ t is the step size in the t-th iteration, \(\mathcal {P}_{y_{m}}^{t+1} - \mathcal {P}_{y_{m}}^{t}\) represents the new triplets appearing in \(\mathcal {P}_{y_{m}}^{t+1}\), and \(\mathcal {P}_{y_{m}}^{t} - \mathcal {P}_{y_{m}}^{t+1}\) represents the old triplets that have disappeared in \(\mathcal {P}_{y_{m}}^{t+1}\). For a small step size, the set \(\mathcal {P}_{y_{m}}^{t}\) changes little in each iteration. In this case, computing the gradients in (22) and (23) is much faster.

Finally, we utilize a gradient descent based learning process similar to that depicted in Algorithm 1 to learn weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and w for the proposed MTLM-LMNN model.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Xu, X., Shimada, A., Nagahara, H. et al. Learning multi-task local metrics for image annotation. Multimed Tools Appl 75, 2203–2231 (2016).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Image annotation
  • Label prediction
  • Metric learning
  • Local metric
  • Multi-task learning