## Abstract

The goal of image annotation is to automatically assign a set of textual labels to an image to describe the visual contents thereof. Recently, with the rapid increase in the number of web images, nearest neighbor (NN) based methods have become more attractive and have shown exciting results for image annotation. One of the key challenges of these methods is to define an appropriate similarity measure between images for neighbor selection. Several distance metric learning (DML) algorithms derived from traditional image classification problems have been applied to annotation tasks. However, a fundamental limitation of applying DML to image annotation is that it learns a single global distance metric over the entire image collection and measures the distance between image pairs in the image-level. For multi-label annotation problems, it may be more reasonable to measure similarity of image pairs in the label-level. In this paper, we develop a novel label prediction scheme utilizing multiple label-specific local metrics for label-level similarity measure, and propose two different local metric learning methods in a multi-task learning (MTL) framework. Extensive experimental results on two challenging annotation datasets demonstrate that 1) utilizing multiple local distance metrics to learn label-level distances is superior to using a single global metric in label prediction, and 2) the proposed methods using the MTL framework to learn multiple local metrics simultaneously can model the commonalities of labels, thereby facilitating label prediction results to achieve state-of-the-art annotation performance.

This is a preview of subscription content, access via your institution.

## Notes

For more details see: http://www.imageclef.org/2011/Photo.

The F-ex metric is an example-based evaluation, which is the averaged F1 score (\(F1 = 2\frac {Precision * Recall}{Precision + Recall}\)) of all images. Note that higher F-ex score implies better performance

Here the time of neighborhood selection is not included in recording the running time. In our experiment, exhaustive neighborhood selection for all the training and test images takes around 24 and 18 hours respectively

## References

Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853

Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272

Binder A, Samek W, Müller KR, Kawanabe M (2013) Enhanced representation and multi-task learning for image annotation. Comp Vision Image Underst 117(5):466–478

Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR ’03, pp 127–134

Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410

Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75

Chang C C, Lin C J (2011) Libsvm: A library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27

Chen X, Mu Y, Yan S, Chua TS (2010) Efficient large-scale image annotation by probabilistic collaborative multi-label propagation. In: Proceedings of the international conference on multimedia, MM ’10, pp 35–44

Chen X, Yuan X, Yan S, Tang J, Rui Y, Chua TS (2011) Towards multi-semantic image annotation with graph regularized exclusive group lasso. In: Proceedings of the 19th ACM international conference on multimedia, MM ’11, pp 263–272

Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, CIVR ’09, pp 48:1–48:9

Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637

Feng S, Manmatha R, Lavrenko V (2004) Multiple bernoulli relevance models for image and video annotation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 1002–1009

Feng Z, Jin R, Jain A (2013) Large-scale image annotation by efficient and robust kernel metric learning. In: IEEE international conference on computer vision (ICCV), pp 3490–3497

Guillaumin M, Mensink T, Verbeek J, Schmid C (2009a) Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: IEEE 12th international conference on computer vision (ICCV), pp 309–316

Guillaumin M, Verbeek J, Schmid C (2009b) Is that you? metric learning approaches for face identification. In: International conference on computer vision, pp 498–505

Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128

Koen EA, van de Sande TG, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596

Li X, Snoek CGM, Worring M (2009) Annotating images by harnessing worldwide user-tagged photos. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3717–3720

Lin Z, Ding G, Hu M (2013) Multi-source image auto-annotation. In: IEEE international conference on image processing (ICIP), pp 2567–2571

Lin Z, Ding G, Hu M (2014) Image auto-annotation via tag-dependent random search over range-constrained visual neighbours. Multimedia Tools Appl:1–26

Liu Y, Jin R (2009) Distance metric learning: a comprehensive survey. Research report, Michigan State University

Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: Proceedings of the 10th European Conference on Computer Vision, ECCV ’08, pp 316–329

Mensink T, Verbeek J, Perronnin F, Csurka G (2012) Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In: European conference on computer vision, ECCV’12, pp 488–501

Nagel K, Nowak S, Kühhirt U, Wolter K (2011) The fraunhofer idmt at imageclef 2011 photo annotation task. In: CLEF (Notebook Papers/Labs/Workshop)

Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175

Parameswaran S, Weinberger K (2010) Large margin multi-task metric learning. In: Advances in Neural Information Processing Systems, pp 1867–1875

Putthividhya D, Attias HT, Nagarajan SS (2010) Topic regression multi-modal latent dirichlet allocation for image annotation. In: International conference on computer vision and pattern recognition, pp 3408–3415

Su Y, Jurie F (2011) Semantic contexts and fisher vectors for the imageclef 2011 photo annotation task. In: CLEF (Notebook Papers/Labs/Workshop)

Ushiku Y, Muraoka H, Inaba S, Fujisawa T, Yasumoto K, Gunji N, Higuchi T, Hara Y, Harada T, Kuniyoshi Y (2012) Isi at imageclef 2012: Scalable system for image annotation. In: CLEF (Online Working Notes/Labs/Workshop)

van de Sande KEA, Snoek CGM (2011) The university of amsterdam’s concept detection system at imageclef 2011. In: CLEF (Notebook Papers/Labs/Workshop)

Verbeek J, Guillaumin M, Mensink T, Schmid C (2010) Image annotation with tagprop on the mirflickr set. In: Proceedings of the international conference on multimedia information retrieval, MIR ’10, pp 537–546

Verma Y, Jawahar C (2012) Image annotation using metric learning in semantic neighbourhoods. Eur Conf Comput Vis 7574:836–849

Verma Y, Jawahar CV (2013) Exploring svm for image annotation in presence of confusing labels. In: BMVC, pp 25.1–25.11

Weinberger K, Saul L (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244

Wu L, Hoi SC, Jin R, Zhu J, Yu N (2011) Distance metric learning from uncertain side information for automated photo tagging. ACM Trans Intell Syst Technol 2(2):1–28

Xiang Y, Zhou X, Chua TS, Ngo CW (2009) A revisit of generative model for automatic image annotation using markov random fields. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1153–1160

Xu X, Shimada A, Ri T (2013) Image annotation by learning label-specific distance metrics. In: Internation conference on image analysis and processing, vol 8156, pp 101–110

Yang Y, Ma Z, Hauptmann A, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669

Yuan XT, Liu X, Yan S (2012) Visual classification with multitask joint sparse representation. IEEE Trans Image Process 21(10):4349–4360

Zhang S, Huang J, Huang Y, Yu Y, Li H, Metaxas D (2010) Automatic image annotation using group sparsity. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3312–3319

## Author information

### Authors and Affiliations

### Corresponding author

## Appendices

### Appendix A: Optimization in MTLM-LDML

For the logistic regression model in (7), the bias parameter *b*
_{
m
} can be considered to be an additional entry in the weight vector *w*_{
m
}, as *b*
_{
m
} = *w*
_{0} and *d*
^{0}(⋅) = −1. Thus, the probability of a similar pair in (7) can be concisely expressed as

Intuitively, to simplify the denotation, the cost function in (9) can be rewritten as

Based on the definitions of \(D_{y_{m}}\) and the sigmoid function \(\sigma (z) = \frac {1}{1 + \exp (-z)}\), we first consider the gradients with respect to weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) of the label-specific metrics as

Hence, the gradient of *w*_{
m
} is given by

Then, we consider gradients with respect to weight vector *w*_{∗} of the shared metric as

Note that the gradients of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) in (16) can be calculated separately according to the pairwise constraints of labels \(\{{y_{m}}\}_{m=1}^{M}\), while the gradient of *w*_{∗} in (17) depends on all gradients of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and must be calculated thereafter.

Finally, the iterative update solutions for weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and *w*_{∗} can be formulated as

where *η*
_{
t
} is the step size of iteration *t*, and \([f]_{+} = \max (0, f)\) truncates any negative entries in ** w** and sets them to zero for the non-negative constraints of

**. Algorithm 1 summarizes the learning process in the proposed MTLM-LDML method.**

*w*### Appendix B: Optimization in MTLM-LMNN

Similar to the computation of gradients in Appendix A, from the cost function in (12) we can derive the gradients of weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and *w*_{∗} as

Here we make the similar observations that the gradient of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) in (20) can be calculated separately according to the pairwise constraints \(\mathcal {P}_{y_{m}}\) of each label *y*
_{
m
}, while the gradient of *w*_{∗} in (21) depends on all the gradients of \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\).

Since naive computation of the gradients in (20) and (21) would be extremely expensive, we consider only the “*active*” triplets in \(\mathcal {P}_{y_{m}}\) that change from iteration *t* to iteration *t* + 1 based on the following update rules:

where *μ*
_{
t
} is the step size in the *t*-th iteration, \(\mathcal {P}_{y_{m}}^{t+1} - \mathcal {P}_{y_{m}}^{t}\) represents the new triplets appearing in \(\mathcal {P}_{y_{m}}^{t+1}\), and \(\mathcal {P}_{y_{m}}^{t} - \mathcal {P}_{y_{m}}^{t+1}\) represents the old triplets that have disappeared in \(\mathcal {P}_{y_{m}}^{t+1}\). For a small step size, the set \(\mathcal {P}_{y_{m}}^{t}\) changes little in each iteration. In this case, computing the gradients in (22) and (23) is much faster.

Finally, we utilize a gradient descent based learning process similar to that depicted in Algorithm 1 to learn weight vectors \(\{\boldsymbol {w}_{m}\}_{m=1}^{M}\) and *w*_{∗} for the proposed MTLM-LMNN model.

## Rights and permissions

## About this article

### Cite this article

Xu, X., Shimada, A., Nagahara, H. *et al.* Learning multi-task local metrics for image annotation.
*Multimed Tools Appl* **75**, 2203–2231 (2016). https://doi.org/10.1007/s11042-014-2402-7

Received:

Revised:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s11042-014-2402-7

### Keywords

- Image annotation
- Label prediction
- Metric learning
- Local metric
- Multi-task learning