Abstract
This study proposes an integrated approach to image similarity measurement by extending traditional methods that concentrate on local features to incorporate global information. Global information, including background, colors, spatial representation, and object relations, can leverage the ability to distinguish similarity based on the overall context of an image using natural process techniques. We employ Video-LLaMA model to extract textual descriptions of images through question prompts, and apply cosine similarity metrics, BERTScore, to quantify image similarities. We conduct experiments on images of the same and different topics using various pre-trained language model configurations. To validate the coherence of the generated text descriptions with the actual theme of the image, we generate images using DALL-E 2 and evaluate them using human judgement. Key findings demonstrate the effectiveness of pre-trained language models in distinguishing between images depicting similar and different topics with a clear gap in similarity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT (2019)
Huggingface blog. https://huggingface.co/blog/vision_language_pretraining. Accessed 8 Sept 2023
Kim, P.: Convolutional neural network. In: Kim, P. (ed.) MATLAB Deep Learning, pp. 121–147. Apress, Berkeley (2017). https://doi.org/10.1007/978-1-4842-2845-6_6
Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale (2020)
Zhai, X., et al.: Scaling vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Zhang, H., Li, X., Bing, L.: Video-llama: an instruction-tuned audio-visual language model for video understanding (2023)
Li, J., et al.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models (2023). https://doi.org/10.48550/arXiv.2301.12597
Devlin, J., et al. BERT: pre-training of deep bidirectional transformers for language understanding (2018)
Liu, N.F., et al.: Linguistic knowledge and transferability of contextual representations (2019). https://doi.org/10.18653/v1/N19-1112
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. Association for Computational Linguistics (2002). https://doi.org/10.3115/1073083.1073135
Eddine, M.K., et al.: FrugalScore: learning cheaper, lighter, and faster evaluation metrics for automatic text generation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2022)
Wang, X., Zhu, Z.: Context understanding in computer vision: a survey. Comput. Vis. Image Understanding 229 (2023)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019)
Lample, G., Conneau, A.: Cross-lingual language model pretraining (2019)
Griffin, G., Holub, A.D., Perona, P.: Caltech 256 Image Dataset
Huggingface docs, pre-trained models. https://huggingface.co/transformers/v3.4.0/pretrained_models.html
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kim, J., Park, S., Park, B., Shin, B.S. (2024). Evaluating Image Similarity Using Contextual Information of Images with Pre-trained Models. In: Bouzefrane, S., Banerjee, S., Mourlin, F., Boumerdassi, S., Renault, É. (eds) Mobile, Secure, and Programmable Networking. MSPN 2023. Lecture Notes in Computer Science, vol 14482. Springer, Cham. https://doi.org/10.1007/978-3-031-52426-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-52426-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-52425-7
Online ISBN: 978-3-031-52426-4
eBook Packages: Computer ScienceComputer Science (R0)