Evaluating Image Similarity Using Contextual Information of Images with Pre-trained Models

Kim, Juyeon; Park, Sungwon; Park, Byunghoon; Shin, B. Sooyeon

doi:10.1007/978-3-031-52426-4_13

Juyeon Kim¹²,
Sungwon Park^13,14,
Byunghoon Park¹⁴ &
…
B. Sooyeon Shin¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14482))

Included in the following conference series:

International Conference on Mobile, Secure, and Programmable Networking

102 Accesses

Abstract

This study proposes an integrated approach to image similarity measurement by extending traditional methods that concentrate on local features to incorporate global information. Global information, including background, colors, spatial representation, and object relations, can leverage the ability to distinguish similarity based on the overall context of an image using natural process techniques. We employ Video-LLaMA model to extract textual descriptions of images through question prompts, and apply cosine similarity metrics, BERTScore, to quantify image similarities. We conduct experiments on images of the same and different topics using various pre-trained language model configurations. To validate the coherence of the generated text descriptions with the actual theme of the image, we generate images using DALL-E 2 and evaluate them using human judgement. Key findings demonstrate the effectiveness of pre-trained language models in distinguishing between images depicting similar and different topics with a clear gap in similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT (2019)
Google Scholar
Huggingface blog. https://huggingface.co/blog/vision_language_pretraining. Accessed 8 Sept 2023
Kim, P.: Convolutional neural network. In: Kim, P. (ed.) MATLAB Deep Learning, pp. 121–147. Apress, Berkeley (2017). https://doi.org/10.1007/978-1-4842-2845-6_6
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale (2020)
Google Scholar
Zhai, X., et al.: Scaling vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Zhang, H., Li, X., Bing, L.: Video-llama: an instruction-tuned audio-visual language model for video understanding (2023)
Google Scholar
Li, J., et al.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models (2023). https://doi.org/10.48550/arXiv.2301.12597
Devlin, J., et al. BERT: pre-training of deep bidirectional transformers for language understanding (2018)
Google Scholar
Liu, N.F., et al.: Linguistic knowledge and transferability of contextual representations (2019). https://doi.org/10.18653/v1/N19-1112
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. Association for Computational Linguistics (2002). https://doi.org/10.3115/1073083.1073135
Eddine, M.K., et al.: FrugalScore: learning cheaper, lighter, and faster evaluation metrics for automatic text generation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2022)
Google Scholar
Wang, X., Zhu, Z.: Context understanding in computer vision: a survey. Comput. Vis. Image Understanding 229 (2023)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019)
Google Scholar
Lample, G., Conneau, A.: Cross-lingual language model pretraining (2019)
Google Scholar
Griffin, G., Holub, A.D., Perona, P.: Caltech 256 Image Dataset
Google Scholar
Huggingface docs, pre-trained models. https://huggingface.co/transformers/v3.4.0/pretrained_models.html

Download references

Author information

Authors and Affiliations

Sorbonne University, Paris, France
Juyeon Kim
Korea University, Sejong, South Korea
Sungwon Park
T3Q Co., Ltd., Seoul, South Korea
Sungwon Park & Byunghoon Park
Center for Creative Convergence Education, Hanyang University, Seoul, South Korea
B. Sooyeon Shin

Authors

Juyeon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Sungwon Park
View author publications
You can also search for this author in PubMed Google Scholar
Byunghoon Park
View author publications
You can also search for this author in PubMed Google Scholar
B. Sooyeon Shin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. Sooyeon Shin .

Editor information

Editors and Affiliations

Cedric Lab, Cnam, Paris, France
Samia Bouzefrane
Trasna Solutions, Rousset, France
Soumya Banerjee
University Paris-Est Créteil, Créteil, France
Fabrice Mourlin
Cedric Lab, Cnam, Paris, France
Selma Boumerdassi
LIGM, ESIEE, Noisy-le-Grand, France
Éric Renault

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, J., Park, S., Park, B., Shin, B.S. (2024). Evaluating Image Similarity Using Contextual Information of Images with Pre-trained Models. In: Bouzefrane, S., Banerjee, S., Mourlin, F., Boumerdassi, S., Renault, É. (eds) Mobile, Secure, and Programmable Networking. MSPN 2023. Lecture Notes in Computer Science, vol 14482. Springer, Cham. https://doi.org/10.1007/978-3-031-52426-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-52426-4_13
Published: 25 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-52425-7
Online ISBN: 978-3-031-52426-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Evaluating Image Similarity Using Contextual Information of Images with Pre-trained Models