Enhancing Intra-modal Similarity in a Cross-Modal Triplet Loss

Mallea, Mario; Nanculef, Ricardo; Araya, Mauricio

doi:10.1007/978-3-031-45275-8_17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14276))

Included in the following conference series:

International Conference on Discovery Science

558 Accesses

Abstract

Cross-modal retrieval requires building a common latent space that captures and correlates information from different data modalities, usually images and texts. Cross-modal training based on the triplet loss with hard negative mining is a state-of-the-art technique to address this problem. This paper shows that such approach is not always effective in handling intra-modal similarities. Specifically, we found that this method can lead to inconsistent similarity orderings in the latent space, where intra-modal pairs with unknown ground-truth similarity are ranked higher than cross-modal pairs representing the same concept. To address this problem, we propose two novel loss functions that leverage intra-modal similarity constraints available in a training triplet but not used by the original formulation. Additionally, this paper explores the application of this framework to unsupervised image retrieval problems, where cross-modal training can provide the supervisory signals that are otherwise missing in the absence of category labels. Up to our knowledge, we are the first to evaluate cross-modal training for intra-modal retrieval without labels.

We present comprehensive experiments on MS-COCO and Flickr30K, demonstrating the advantages and limitations of the proposed methods in cross-modal and intra-modal retrieval tasks in terms of performance and novelty measures. Our code is publicly available on GitHub https://github.com/MariodotR/FullHN.git.

This research was partially funded by National Agency for Research and Development (ANID, Chile), grant numbers FONDEF IT21I0019, ANID PIA/APOYO AFB180002 and ANID-Basal Project FB0008.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For MS-COCO, we report results for 5k images.
2.
With results of \([ 351.3 - 350.3 , 351.0-349.3, 350.4-351.4]\), respectively.
3.
With results of [367.0, 368.1, 375.9, 377.5, 370.4], respectively.
4.
The best value is underlined and the best without considering TERAN is highlighted in bold.

References

Chaudhuri, U., Banerjee, B., Bhattacharya, A., Datcu, M.: CMIR-NET: a deep learning based model for cross-modal retrieval in remote sensing. Pattern Recogn. Lett. 131, 456–462 (2020)
Article Google Scholar
Clarke, C.L., et al.: Novelty and diversity in information retrieval evaluation. In: SIGIR 2008 ,p p. 659–666. ACM, New York (2008)
Google Scholar
Do, T.T., Tran, T., Ian, R., et al.: A theoretically sound upper bound on the triplet loss for improving the efficiency of deep distance metric learning. In: IEEE CVPR, pp. 10404–10413 (2019)
Google Scholar
Dubey, S.R.: A decade survey of content based image retrieval using deep learning. IEEE Trans. Circ. Syst. Video Technol. 32, 2687–2704 (2020)
Article Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: Proceedings of BMVC (2017)
Google Scholar
Ge, W., Huang, W., Dong, D., Scott, M.R.: Deep metric learning with hierarchical triplet loss. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 272–288. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_17
Chapter Google Scholar
Gong, Y., Cosma, G.: Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval. Pattern Recogn. 137, 109272 (2023)
Article Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 664–676 (2017)
Article Google Scholar
Mahmut, K., Şakir, H.: Deep metric learning: a survey. Symmetry 11(9), 1066 (2019)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2017)
Google Scholar
Li, X., Yang, J., Ma, J.: Recent developments of content-based image retrieval (CBIR). Neurocomputing 452, 675–689 (2021)
Article Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Ma, H., et al.: Ei-clip: entity-aware interventional contrastive learning for e-commerce cross-modal retrieval. In: CVPR, pp. 18051–18061 (2022)
Google Scholar
Messina, N., et al.: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM), 17(4), 1–23 (2021)
Google Scholar
Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5222–5229. IEEE (2021)
Google Scholar
Molina, G., et al.: A new content-based image retrieval system for SARS-CoV-2 computer-aided diagnosis. In: Su, R., Zhang, Y.-D., Liu, H. (eds.) MICAD 2021. LNEE, vol. 784, pp. 316–324. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-3880-0_33
Chapter Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Ren, R., et al.: Pair: leveraging passage-centric similarity relation for improving dense passage retrieval, pp. 2173–2183 (2021)
Google Scholar
Schubert, E.: A triangle inequality for cosine similarity. In: Reyes, N., et al. (eds.) SISAP 2021. LNCS, vol. 13058, pp. 32–44. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89657-7_3
Chapter Google Scholar
Song, H.O., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: IEEE CVPR, pp. 4004–4012 (2016)
Google Scholar
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: Mpnet: masked and permuted pre-training for language understanding. NIPS 33, 16857–16867 (2020)
Google Scholar
Song, Y., Soleymani, M.: Polysemous visual-semantic embedding for cross-modal retrieval. In: CVPR, pp. 1979–1988 (2019)
Google Scholar
Tan, M., Le, Q.V.: Efficientnetv2: smaller models and faster training. CoRR abs/2104.00298 (2021)
Google Scholar
Tian, Y., et al.: Sosnet: second order similarity regularization for local descriptor learning, pp. 11008–11017 (2019)
Google Scholar
Ng, T., Balntas, V., Y, Tian., Mikolajczyk, K.: Solar: Second-order loss and attention for image retrieval. ArXiv (2020)
Google Scholar
Wang, Z., et al.: Adaptive margin based deep adversarial metric learning. In: IEEE BigDataSecurity/HPSC/IDS 2020, pp. 100–108 (2020)
Google Scholar
Chen, W., Chen, X., Zhang, J., Huang, K.: Beyond triplet loss: a deep quadruplet network for person re-identification. In: IEEE CVPR, pp. 1320–1329 (2017)
Google Scholar
Wu, Y., Wang, S., Huang, Q.: Online asymmetric similarity learning for cross-modal retrieval. In: IEEE CVPR, pp. 3984–3993 (2017)
Google Scholar
Wu, Y., Wang, S., Huang, Q.: Online fast adaptive low-rank similarity learning for cross-modal retrieval. IEEE Trans. Multimedia 22(5), 1310–1322 (2020)
Article Google Scholar
Xuan, H., Stylianou, A., Liu, X., Pless, R.: Hard negative examples are hard, but useful. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 126–142. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_8
Chapter Google Scholar
Yang, J., et al.: Vision-language pre-training with triple contrastive learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15650–15659 (2022)
Google Scholar
Ye, M., et al.: Deep learning for person re-identification: a survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 2872–2893 (2021)
Article Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)
Article Google Scholar
Zhao, C., et al.: Deep fusion feature representation learning with hard mining center-triplet loss for person re-identification. IEEE Trans. Multimedia 22(12), 3180–3195 (2020)
Article Google Scholar
Zhou, T., et al.: Solving the apparent diversity-accuracy dilemma of recommender systems. PNAS 107, 4511–4515 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Federico Santa María, Valparaíso, Chile
Mario Mallea, Ricardo Nanculef & Mauricio Araya

Authors

Mario Mallea
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Nanculef
View author publications
You can also search for this author in PubMed Google Scholar
Mauricio Araya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mario Mallea .

Editor information

Editors and Affiliations

Waikato University, Hamilton, New Zealand
Albert Bifet
Aeronautics Institute of Technology, São José dos Campos, Brazil
Ana Carolina Lorena
University of Porto, Porto, Portugal
Rita P. Ribeiro
University of Porto, Porto, Portugal
João Gama
University of Coimbra, Coimbra, Portugal
Pedro H. Abreu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mallea, M., Nanculef, R., Araya, M. (2023). Enhancing Intra-modal Similarity in a Cross-Modal Triplet Loss. In: Bifet, A., Lorena, A.C., Ribeiro, R.P., Gama, J., Abreu, P.H. (eds) Discovery Science. DS 2023. Lecture Notes in Computer Science(), vol 14276. Springer, Cham. https://doi.org/10.1007/978-3-031-45275-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-45275-8_17
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45274-1
Online ISBN: 978-3-031-45275-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Enhancing Intra-modal Similarity in a Cross-Modal Triplet Loss