Less Is More: Similarity Models for Content-Based Video Retrieval

Veselý, Patrik; Peška, Ladislav

doi:10.1007/978-3-031-27818-1_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13834))

Included in the following conference series:

International Conference on Multimedia Modeling

1244 Accesses
2 Citations

Abstract

The concept of object-to-object similarity plays a crucial role in interactive content-based video retrieval tools. Similarity (or distance) models are core components of several retrieval concepts, e.g. Query by Example or relevance feedback. In these scenarios, the common approach is to apply some feature extractor that transforms the object to a vector of features, i.e., positions it into an induced latent space. The similarity is then based on some distance metric in this space.

Historically, feature extractors were mostly based on some color histograms or hand-crafted descriptors such as SIFT, but nowadays state-of-the-art tools mostly rely on some deep learning (DL) approaches. However, so far there were no systematic study of how suitable are individual feature extractors in the video retrieval domain. Or, in other words, to what extent are human-perceived and model-based similarities concordant. To fill this gap, we conducted a user study with over 4000 similarity judgements comparing over 20 variants of feature extractors. Results corroborate the dominance of deep learning approaches, but surprisingly favor smaller and simpler DL models instead of larger ones.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.oberlo.com/blog/youtube-statistics.
2.
https://huggingface.co/.
3.
https://otrok.ms.mff.cuni.cz:8030/user.
4.
The exact prompt was “Which image is more similar to the one on the top?”.
5.
All mentioned differences were stat. sign. with \(p<0.05\) w.r.t. Fisher exact test.
6.
The first group included RGB Histogram 256, LAB Positional 8x8, ImageGPT medium, EfficientNetB7, ViT large and ResNetV2 152. The second group included RGB Histogram 64, LAB Positional 2x2, ImageGPT small, EfficientNetB0, ViT base and ResNetV2 50.

References

Berns, F., Rossetto, L., Schoeffmann, K., Beecks, C., Awad, G.: V3C1 dataset: an evaluation of content characteristics. In: ICMR 2019, pp. 334–338. ACM (2019)
Google Scholar
Chen, M., et al.: Generative pretraining from pixels. In: ICML 2020. PMLR (2020)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009, pp. 248–255. IEEE (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Hebart, M.N., Zheng, C.Y., Pereira, F., Baker, C.I.: Revealing the multidimensional mental representations of natural objects underlying human similarity judgements. Nat. Hum. Behav. 4(11), 1173–1185 (2020)
Article Google Scholar
Heller, S., Gsteiger, V., Bailer, W., et al.: Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. Int. J. Multimed. Inf. Retr. 11(1), 1–18 (2022). https://doi.org/10.1007/s13735-021-00225-2
Article Google Scholar
Hezel, N., Schall, K., Jung, K., Barthel, K.U.: Efficient search and browsing of large-scale video collections with vibro. In: Þór Jónsson, B., et al. (eds.) MMM 2022. LNCS, vol. 13142, pp. 487–492. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98355-0_43
Hofmann, K., Schuth, A., Bellogín, A., de Rijke, M.: Effects of position bias on click-based recommender evaluation. In: de Rijke, M., et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 624–630. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_67
Chapter Google Scholar
Huang, P., Dai, S.: Image retrieval by texture similarity. Pattern Recogn. 36(3), 665–679 (2003)
Article Google Scholar
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR 2010, pp. 3304–3311. IEEE (2010)
Google Scholar
Kratochvíl, M., Veselý, P., Mejzlík, F., Lokoč, J.: SOM-hunter: video browsing with relevance-to-SOM feedback loop. In: Ro, Y.M., et al. (eds.) MMM 2020. LNCS, vol. 11962, pp. 790–795. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37734-2_71
Chapter Google Scholar
Křenková, M., Mic, V., Zezula, P.: Similarity search with the distance density model. In: Skopal, T., Falchi, F., Lokoč, J., Sapino, M.L., Bartolini, I., Patella, M. (eds.) SISAP 2022. LNCS, vol. 13590, pp. 118–132. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-17849-8_10
Chapter Google Scholar
Li, X., Xu, C., Yang, G., Chen, Z., Dong, J.: W2VV++ fully deep learning for ad-hoc video search. In: ACM MM 2019, pp. 1786–1794 (2019)
Google Scholar
Li, Y., et al.: TGIF: a new dataset and benchmark on animated GIF description. In: CVPR 2016, pp. 4641–4650 (2016)
Google Scholar
Lokoč, J., Mejzlík, F., Souček, T., Dokoupil, P., Peška, L.: Video search with context-aware ranker and relevance feedback. In: Þór Jónsson, B., et al. (eds.) MMM 2022. LNCS, vol. 13142, pp. 505–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98355-0_46
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
Article Google Scholar
Lu, T.C., Chang, C.C.: Color image retrieval technique based on color features and image bitmap. Inf. Process. Manag. 43(2), 461–472 (2007)
Article Google Scholar
McLaren, K.: The development of the CIE 1976 (L*a*b*) uniform colour-space and colour-difference formula. J. Soc. Dyers Colour. 92, 338–341 (2008)
Article Google Scholar
Peterson, J.C., Abbott, J.T., Griffiths, T.L.: Evaluating (and improving) the correspondence between deep neural networks and human representations. Cogn. Sci. 42(8), 2648–2669 (2018)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML 2019, pp. 8748–8763. PMLR (2021)
Google Scholar
Roads, B.D., Love, B.C.: Enriching ImageNet with human similarity judgments and psychological embeddings. In: CVPR 2021, pp. 3547–3557. IEEE/CVF (2021)
Google Scholar
Skopal, T.: On visualizations in the role of universal data representation. In: ICMR 2020, pp. 362–367. ACM (2020)
Google Scholar
Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01424-7_27
Chapter Google Scholar
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: ICML 2019, pp. 6105–6114. PMLR (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR 2016, pp. 5288–5296 (2016)
Google Scholar

Download references

Acknowledgments

This paper has been supported by Czech Science Foundation (GAČR) project 22-21696S and Charles University grant SVV-260588. Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA CZ LM2018140) supported by the Ministry of Education, Youth and Sports of the Czech Republic. Source codes and raw data are available from https://github.com/Anophel/image_similarity_study.

Author information

Authors and Affiliations

Faculty of Mathematics and Physics, Charles University, Prague, Czechia
Patrik Veselý & Ladislav Peška

Authors

Patrik Veselý
View author publications
You can also search for this author in PubMed Google Scholar
Ladislav Peška
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ladislav Peška .

Editor information

Editors and Affiliations

University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
Dublin City University, Dublin, Ireland
Cathal Gurrin
Radboud University Nijmegen, Nijmegen, The Netherlands
Martha Larson
Dublin City University, Dublin, Ireland
Alan F. Smeaton
University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
National Institute of Information and Communications Technology, Tokyo, Japan
Minh-Son Dao
Department of Information Science and Media Studies, University of Bergen, Bergen, Norway
Christoph Trattner
La Trobe University, Melbourne, VIC, Australia
Phoebe Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Veselý, P., Peška, L. (2023). Less Is More: Similarity Models for Content-Based Video Retrieval. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13834. Springer, Cham. https://doi.org/10.1007/978-3-031-27818-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-27818-1_5
Published: 31 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27817-4
Online ISBN: 978-3-031-27818-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Less Is More: Similarity Models for Content-Based Video Retrieval