Learning multi-task local metrics for image annotation

Xu, Xing; Shimada, Atsushi; Nagahara, Hajime; Taniguchi, Rin-ichiro

doi:10.1007/s11042-014-2402-7

Learning multi-task local metrics for image annotation

Published: 14 December 2014

Volume 75, pages 2203–2231, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Xing Xu¹,
Atsushi Shimada¹,
Hajime Nagahara¹ &
…
Rin-ichiro Taniguchi¹

334 Accesses
7 Citations
Explore all metrics

Abstract

The goal of image annotation is to automatically assign a set of textual labels to an image to describe the visual contents thereof. Recently, with the rapid increase in the number of web images, nearest neighbor (NN) based methods have become more attractive and have shown exciting results for image annotation. One of the key challenges of these methods is to define an appropriate similarity measure between images for neighbor selection. Several distance metric learning (DML) algorithms derived from traditional image classification problems have been applied to annotation tasks. However, a fundamental limitation of applying DML to image annotation is that it learns a single global distance metric over the entire image collection and measures the distance between image pairs in the image-level. For multi-label annotation problems, it may be more reasonable to measure similarity of image pairs in the label-level. In this paper, we develop a novel label prediction scheme utilizing multiple label-specific local metrics for label-level similarity measure, and propose two different local metric learning methods in a multi-task learning (MTL) framework. Extensive experimental results on two challenging annotation datasets demonstrate that 1) utilizing multiple local distance metrics to learn label-level distances is superior to using a single global metric in label prediction, and 2) the proposed methods using the MTL framework to learn multiple local metrics simultaneously can model the commonalities of labels, thereby facilitating label prediction results to achieve state-of-the-art annotation performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Matching from Handcrafted to Deep Features: A Survey

Article Open access 04 August 2020

Learning to Prompt for Vision-Language Models

Article 31 July 2022

The Curious Layperson: Fine-Grained Image Recognition Without Expert Labels

Article Open access 13 September 2023

Notes

For more details see: http://www.imageclef.org/2011/Photo.
The F-ex metric is an example-based evaluation, which is the averaged F1 score ($F1 = 2\frac {Precision * Recall}{Precision + Recall}$) of all images. Note that higher F-ex score implies better performance
Here the time of neighborhood selection is not included in recording the running time. In our experiment, exhaustive neighborhood selection for all the training and test images takes around 24 and 18 hours respectively

References

Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853
MathSciNet MATH Google Scholar
Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272
Article Google Scholar
Binder A, Samek W, Müller KR, Kawanabe M (2013) Enhanced representation and multi-task learning for image annotation. Comp Vision Image Underst 117(5):466–478
Article Google Scholar
Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR ’03, pp 127–134
Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410
Article Google Scholar
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Article MathSciNet Google Scholar
Chang C C, Lin C J (2011) Libsvm: A library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27
Article Google Scholar
Chen X, Mu Y, Yan S, Chua TS (2010) Efficient large-scale image annotation by probabilistic collaborative multi-label propagation. In: Proceedings of the international conference on multimedia, MM ’10, pp 35–44
Chen X, Yuan X, Yan S, Tang J, Rui Y, Chua TS (2011) Towards multi-semantic image annotation with graph regularized exclusive group lasso. In: Proceedings of the 19th ACM international conference on multimedia, MM ’11, pp 263–272
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, CIVR ’09, pp 48:1–48:9
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
MathSciNet MATH Google Scholar
Feng S, Manmatha R, Lavrenko V (2004) Multiple bernoulli relevance models for image and video annotation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 1002–1009
Feng Z, Jin R, Jain A (2013) Large-scale image annotation by efficient and robust kernel metric learning. In: IEEE international conference on computer vision (ICCV), pp 3490–3497
Guillaumin M, Mensink T, Verbeek J, Schmid C (2009a) Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: IEEE 12th international conference on computer vision (ICCV), pp 309–316
Guillaumin M, Verbeek J, Schmid C (2009b) Is that you? metric learning approaches for face identification. In: International conference on computer vision, pp 498–505
Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128
Article Google Scholar
Koen EA, van de Sande TG, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596
Article Google Scholar
Li X, Snoek CGM, Worring M (2009) Annotating images by harnessing worldwide user-tagged photos. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3717–3720
Lin Z, Ding G, Hu M (2013) Multi-source image auto-annotation. In: IEEE international conference on image processing (ICIP), pp 2567–2571
Lin Z, Ding G, Hu M (2014) Image auto-annotation via tag-dependent random search over range-constrained visual neighbours. Multimedia Tools Appl:1–26
Liu Y, Jin R (2009) Distance metric learning: a comprehensive survey. Research report, Michigan State University
Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: Proceedings of the 10th European Conference on Computer Vision, ECCV ’08, pp 316–329
Mensink T, Verbeek J, Perronnin F, Csurka G (2012) Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In: European conference on computer vision, ECCV’12, pp 488–501
Nagel K, Nowak S, Kühhirt U, Wolter K (2011) The fraunhofer idmt at imageclef 2011 photo annotation task. In: CLEF (Notebook Papers/Labs/Workshop)
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Article MATH Google Scholar
Parameswaran S, Weinberger K (2010) Large margin multi-task metric learning. In: Advances in Neural Information Processing Systems, pp 1867–1875
Putthividhya D, Attias HT, Nagarajan SS (2010) Topic regression multi-modal latent dirichlet allocation for image annotation. In: International conference on computer vision and pattern recognition, pp 3408–3415
Su Y, Jurie F (2011) Semantic contexts and fisher vectors for the imageclef 2011 photo annotation task. In: CLEF (Notebook Papers/Labs/Workshop)
Ushiku Y, Muraoka H, Inaba S, Fujisawa T, Yasumoto K, Gunji N, Higuchi T, Hara Y, Harada T, Kuniyoshi Y (2012) Isi at imageclef 2012: Scalable system for image annotation. In: CLEF (Online Working Notes/Labs/Workshop)
van de Sande KEA, Snoek CGM (2011) The university of amsterdam’s concept detection system at imageclef 2011. In: CLEF (Notebook Papers/Labs/Workshop)
Verbeek J, Guillaumin M, Mensink T, Schmid C (2010) Image annotation with tagprop on the mirflickr set. In: Proceedings of the international conference on multimedia information retrieval, MIR ’10, pp 537–546
Verma Y, Jawahar C (2012) Image annotation using metric learning in semantic neighbourhoods. Eur Conf Comput Vis 7574:836–849
Google Scholar
Verma Y, Jawahar CV (2013) Exploring svm for image annotation in presence of confusing labels. In: BMVC, pp 25.1–25.11
Weinberger K, Saul L (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
MATH Google Scholar
Wu L, Hoi SC, Jin R, Zhu J, Yu N (2011) Distance metric learning from uncertain side information for automated photo tagging. ACM Trans Intell Syst Technol 2(2):1–28
Article Google Scholar
Xiang Y, Zhou X, Chua TS, Ngo CW (2009) A revisit of generative model for automatic image annotation using markov random fields. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1153–1160
Xu X, Shimada A, Ri T (2013) Image annotation by learning label-specific distance metrics. In: Internation conference on image analysis and processing, vol 8156, pp 101–110
Yang Y, Ma Z, Hauptmann A, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669
Article Google Scholar
Yuan XT, Liu X, Yan S (2012) Visual classification with multitask joint sparse representation. IEEE Trans Image Process 21(10):4349–4360
Article MathSciNet Google Scholar
Zhang S, Huang J, Huang Y, Yu Y, Li H, Metaxas D (2010) Automatic image annotation using group sparsity. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3312–3319

Download references

Author information

Authors and Affiliations

Department of Advanced Information and Technology, Kyushu University, Fukuoka, Japan
Xing Xu, Atsushi Shimada, Hajime Nagahara & Rin-ichiro Taniguchi

Authors

Xing Xu
View author publications
You can also search for this author in PubMed Google Scholar
Atsushi Shimada
View author publications
You can also search for this author in PubMed Google Scholar
Hajime Nagahara
View author publications
You can also search for this author in PubMed Google Scholar
Rin-ichiro Taniguchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xing Xu.

Appendices

Appendix A: Optimization in MTLM-LDML

For the logistic regression model in (7), the bias parameter b _m can be considered to be an additional entry in the weight vector w _m, as b _m = w ₀ and d ⁰(⋅) = −1. Thus, the probability of a similar pair in (7) can be concisely expressed as

$$ p_{qp}(s_{qp} = 1|y_{m}; D_{y_{m}}(I_{q}, I_{p}), b) = \sigma(-D_{y_{m}}(I_{q}, I_{p})). $$

(13)

Intuitively, to simplify the denotation, the cost function in (9) can be rewritten as

$$ E(\boldsymbol{w}) = {\gamma_{\ast} \| \boldsymbol{w}_{\ast} \|_{2}^{2}} + \sum\limits_{m=1}^{M}\left[\gamma_{m}{\| \boldsymbol{w}_{m} \|_{2}^{2}} + {\underset{(I_{q}, I_{p}) \in \mathcal{P}_{y_{m}}}\sum}{\log(1 + \exp(s_{qp} D_{y_{m}}(I_{q}, I_{p})))} \right]. $$

(14)

Based on the definitions of $D_{y_{m}}$ and the sigmoid function $\sigma (z) = \frac {1}{1 + \exp (-z)}$, we first consider the gradients with respect to weight vectors $\{\boldsymbol {w}_{m}\}_{m=1}^{M}$ of the label-specific metrics as

$$ E(\boldsymbol{w}_{m}) = \gamma_{m}{\| \boldsymbol{w}_{m} \|_{2}^{2}} + {\underset{(I_{q}, I_{p}) \in \mathcal{P}_{y_{m}}}\sum}{\log(1 + \exp(s_{qp} D_{y_{m}}(I_{q}, I_{p})))}. $$

(15)

Hence, the gradient of w _m is given by

$$ \boldsymbol{G}(\boldsymbol{w}_{m}) = \frac{\partial{E(\boldsymbol{w})}}{\partial{\boldsymbol{w}_{m}}} = 2\gamma_{m} \boldsymbol{w}_{m} + {\underset{(I_{q}, I_{p}) \in \mathcal{P}_{y_{m}}}\sum}{s_{qp} \sigma(s_{qp}D_{y_{m}}(I_{q}, I_{p}))\boldsymbol{d}(I_{q}, I_{p})}. $$

(16)

Then, we consider gradients with respect to weight vector w _∗ of the shared metric as

$$ \boldsymbol{G}(\boldsymbol{w}_{\ast}) = \frac{\partial{E(\boldsymbol{w})}}{\partial{\boldsymbol{w}_{\ast}}} = 2\gamma_{\ast} \boldsymbol{w}_{\ast} + \sum\limits_{m=1}^{M} ~~~~~~{\underset{(I_{q}, I_{p}) \in \mathcal{P}_{y_{m}}}\sum}{s_{qp} \sigma(s_{qp}D_{y_{m}}(I_{q}, I_{p}))\boldsymbol{d}(I_{q}, I_{p})}. $$

(17)

Note that the gradients of $\{\boldsymbol {w}_{m}\}_{m=1}^{M}$ in (16) can be calculated separately according to the pairwise constraints of labels $\{{y_{m}}\}_{m=1}^{M}$, while the gradient of w _∗ in (17) depends on all gradients of $\{\boldsymbol {w}_{m}\}_{m=1}^{M}$ and must be calculated thereafter.

Finally, the iterative update solutions for weight vectors $\{\boldsymbol {w}_{m}\}_{m=1}^{M}$ and w _∗ can be formulated as

$$ \boldsymbol{w}_{m}^{t+1} = [\boldsymbol{w}_{m}^{t} - \eta_{t} \boldsymbol{G}(\boldsymbol{w}_{m}^{t}) ]_{+}, $$

(18)

$$ \boldsymbol{w}_{\ast}^{t+1} = [\boldsymbol{w}_{\ast}^{t} - \eta_{t} \boldsymbol{G}(\boldsymbol{w}_{\ast}^{t}) ]_{+}, $$

(19)

where η _t is the step size of iteration t, and $[f]_{+} = \max (0, f)$ truncates any negative entries in w and sets them to zero for the non-negative constraints of w. Algorithm 1 summarizes the learning process in the proposed MTLM-LDML method.

Appendix B: Optimization in MTLM-LMNN

Similar to the computation of gradients in Appendix A, from the cost function in (12) we can derive the gradients of weight vectors $\{\boldsymbol {w}_{m}\}_{m=1}^{M}$ and w _∗ as

$$ \boldsymbol{G}(\boldsymbol{w}_{m}) = 2\gamma_{m} \boldsymbol{w}_{m} + {\underset{(I_{q}, I_{n}) \in \mathcal{P}_{y_{m}}^{+}}\sum}{\boldsymbol{d}(I_{q}, I_{p})} + \mu {\underset{(I_{q}, I_{n}) \in \mathcal{P}_{y_{m}}^{-}}\sum}\left(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n}) \right), $$

(20)

$$ \boldsymbol{G}(\boldsymbol{w}_{\ast}) = 2\gamma_{\ast} \boldsymbol{w}_{\ast} + \sum\limits_{m=1}^{M} \left[~~~~~ {\underset{(I_{q}, I_{n}) \in \mathcal{P}_{y_{m}}^{+}}\sum}{\boldsymbol{d}(I_{q}, I_{p})} + \mu {\underset{(I_{q}, I_{n}) \in \mathcal{P}_{y_{m}}^{-}}\sum}\left(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n}) \right) \right]. $$

(21)

Here we make the similar observations that the gradient of $\{\boldsymbol {w}_{m}\}_{m=1}^{M}$ in (20) can be calculated separately according to the pairwise constraints $\mathcal {P}_{y_{m}}$ of each label y _m, while the gradient of w _∗ in (21) depends on all the gradients of $\{\boldsymbol {w}_{m}\}_{m=1}^{M}$.

Since naive computation of the gradients in (20) and (21) would be extremely expensive, we consider only the “active” triplets in $\mathcal {P}_{y_{m}}$ that change from iteration t to iteration t + 1 based on the following update rules:

$$\begin{array}{@{}rcl@{}} \boldsymbol{G}(\boldsymbol{w}_{m}^{t+1})&=&\boldsymbol{G}(\boldsymbol{w}_{m}^{t}) - \eta_{t} {\underset{(I_{q}, I_{p}, I_{n}) \in \mathcal{P}_{y_{m}}^{t} - P_{y_{m}}^{t+1}}\sum}(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n})) + \eta_{t}\\&&{\kern4pc} {\underset{(I_{q}, I_{p}, I_{n}) \in \mathcal{P}_{y_{m}}^{t+1} - P_{y_{m}}^{t}}\sum}(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n})), \end{array} $$

(22)

$$\begin{array}{@{}rcl@{}} \boldsymbol{G}(\boldsymbol{w}_{\ast}^{t+1}) = \boldsymbol{G}(\boldsymbol{w}_{\ast}^{t}) - \sum\limits_{m=1}^{M} \left[ \eta_{t} {\underset{(I_{q}, I_{p}, I_{n}) \in \mathcal{P}_{y_{m}}^{t} - \mathcal{P}_{y_{m}}^{t+1}}\sum}(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n})) + \eta_{t}\right.\\ \left.{\underset{(I_{q}, I_{p}, I_{n}) \in \mathcal{P}_{y_{m}}^{t+1} - \mathcal{P}_{y_{m}}^{t}}\sum}(\boldsymbol{d}(I_{q}, I_{p}) - \boldsymbol{d}(I_{q}, I_{n})) \right], \end{array} $$

(23)

where μ _t is the step size in the t-th iteration, $\mathcal {P}_{y_{m}}^{t+1} - \mathcal {P}_{y_{m}}^{t}$ represents the new triplets appearing in $\mathcal {P}_{y_{m}}^{t+1}$, and $\mathcal {P}_{y_{m}}^{t} - \mathcal {P}_{y_{m}}^{t+1}$ represents the old triplets that have disappeared in $\mathcal {P}_{y_{m}}^{t+1}$. For a small step size, the set $\mathcal {P}_{y_{m}}^{t}$ changes little in each iteration. In this case, computing the gradients in (22) and (23) is much faster.

Finally, we utilize a gradient descent based learning process similar to that depicted in Algorithm 1 to learn weight vectors $\{\boldsymbol {w}_{m}\}_{m=1}^{M}$ and w _∗ for the proposed MTLM-LMNN model.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, X., Shimada, A., Nagahara, H. et al. Learning multi-task local metrics for image annotation. Multimed Tools Appl 75, 2203–2231 (2016). https://doi.org/10.1007/s11042-014-2402-7

Download citation

Received: 24 February 2014
Revised: 26 September 2014
Accepted: 24 November 2014
Published: 14 December 2014
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11042-014-2402-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning multi-task local metrics for image annotation

Abstract

Access this article

Similar content being viewed by others

Image Matching from Handcrafted to Deep Features: A Survey

Learning to Prompt for Vision-Language Models

The Curious Layperson: Fine-Grained Image Recognition Without Expert Labels

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Optimization in MTLM-LDML

Appendix B: Optimization in MTLM-LMNN

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning multi-task local metrics for image annotation

Abstract

Access this article

Similar content being viewed by others

Image Matching from Handcrafted to Deep Features: A Survey

Learning to Prompt for Vision-Language Models

The Curious Layperson: Fine-Grained Image Recognition Without Expert Labels

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Optimization in MTLM-LDML

Appendix B: Optimization in MTLM-LMNN

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation