Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval

Moro, Gianluca; Salvatori, Stefano

doi:10.1007/978-3-031-17849-8_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13590))

Included in the following conference series:

International Conference on Similarity Search and Applications

808 Accesses
3 Citations

Abstract

Fashion multi-modal retrieval has been recently addressed using vision-and-language transformers. However, these models cannot scale in training time and memory requirements due to the quadratic attention mechanism. Moreover, they design the retrieval as a classification task, assigning a similarity score to pairs of text and images in input. Each query is thus resolved inefficiently by pairing it, at runtime, with every text or image in the entire dataset, precluding the scalability to large-scale datasets. We propose a novel approach for efficient multi-modal retrieval in the fashion domain that combines self-supervised pretraining with linear attention and deep metric learning to create a latent space where spatial proximity among instances translates into a semantic similarity score. Unlike existing contributions, our approach separately embeds text and images, decoupling them and allowing to collocate and search in the space, after training, even for new images with missing text and vice versa. Experiments show that with a single 12 GB GPU, our solution outperforms, both in efficacy and efficiency, existing state-of-the-art contributions on the FashionGen dataset. Our architecture also enables the adoption of multidimensional indices, with which retrieval scales in logarithmic time up to millions, and potentially billions, of text and images.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The reader can refer to [2] for further technical details and a rigorous proof of why this approximation works.
2.
The total input length L is therefore the sum of both text tokens and image patches.
3.
In some of our preliminary experiments we also tested \(m=0.1\) but we found no substantial differences (the results were slightly worse in that case).
4.
Pretraining with our linear attention model took \(\sim \)72 h to complete, and metric learning required \(\sim \)15 h. Using a quadratic Transformer, pretraining would have ended in \(\sim \)100 h and metric learning in \(\sim \)24 h.

References

Cerroni, W., Moro, G., Pirini, T., Ramilli, M.: Peer-to-peer data mining classifiers for decentralized detection of network attacks, vol. 137, pp. 101–107, January 2013
Google Scholar
Choromanski, K.M., et al.: Rethinking attention with performers. In: ICLR. OpenReview.net (2021)
Google Scholar
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG, pp. 253–262. ACM (2004)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1), pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Discovering new gene functionalities from random perturbations of known gene ontological annotations. In: KDIR, pp. 107–116. SciTePress (2014)
Google Scholar
Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Cross-organism learning method to discover new gene functionalities. Comput. Methods Programs Biomed. 126, 20–34 (2016)
Article Google Scholar
Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Cross-domain text classification through iterative refining of target categories representations. In: KDIR, pp. 31–42. SciTePress (2014)
Google Scholar
Domeniconi, G., Semertzidis, K., López, V., Daly, E.M., Kotoulas, S., Moro, G.: A novel method for unsupervised and supervised conversational message thread detection. In: DATA, pp. 43–54. SciTePress (2016)
Google Scholar
Endo, M., Krishnan, R., Krishna, V., Ng, A.Y., Rajpurkar, P.: Retrieval-based chest X-ray report generation using a pre-trained contrastive language-image model. In: ML4H@NeurIPS. Proceedings of Machine Learning Research, vol. 158, pp. 209–219. PMLR (2021)
Google Scholar
Fabbri, M., Moro, G.: Dow jones trading with deep learning: the unreasonable effectiveness of recurrent neural networks. In: DATA, pp. 142–153. SciTePress (2018)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC, p. 12. BMVA Press (2018)
Google Scholar
Frisoni, G., Moro, G., Carlassare, G., Carbonaro, A.: Unsupervised event graph representation and similarity learning on biomedical literature. Sensors 22(1), 3 (2022)
Article Google Scholar
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)
Google Scholar
Gao, D., et al.: FashionBERT: text and image matching with adaptive loss for cross-modal retrieval. In: SIGIR, pp. 2251–2260. ACM (2020)
Google Scholar
Goenka, S., et al.: FashionVLP: vision language transformer for fashion retrieval with feedback. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14105–14115 (2022)
Google Scholar
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2), pp. 1735–1742. IEEE Computer Society (2006)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)
Google Scholar
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7
Chapter Google Scholar
Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)
Article Google Scholar
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: ICLR. OpenReview.net (2020)
Google Scholar
Laenen, K., Zoghbi, S., Moens, M.F.: Cross-modal search for fashion attributes. In: KDD 2017 (2017)
Google Scholar
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Chapter Google Scholar
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, pp. 11336–11344. AAAI Press (2020)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS, pp. 13–23 (2019)
Google Scholar
Miech, A., Alayrac, J., Laptev, I., Sivic, J., Zisserman, A.: Thinking fast and slow: efficient text-to-visual retrieval with transformers. In: CVPR, pp. 9826–9836. Computer Vision Foundation/IEEE (2021)
Google Scholar
Moro, G., Pagliarani, A., Pasolini, R., Sartori, C.: Cross-domain & in-domain sentiment analysis with memory-based deep neural networks. In: KDIR, pp. 125–136. SciTePress (2018)
Google Scholar
Moro, G., Valgimigli, L.: Efficient self-supervised metric information retrieval: a bibliography based method applied to COVID literature. Sensors 21(19), 6430 (2021)
Article Google Scholar
Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: VISAPP (1), pp. 331–340. INSTICC Press (2009)
Google Scholar
Omohundro, S.M.: Five Balltree Construction Algorithms. International Computer Science Institute, Berkeley (1989)
Google Scholar
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data. CoRR abs/2001.07966 (2020)
Google Scholar
Rostamzadeh, N.: Fashion-Gen: the generative fashion dataset and challenge. CoRR abs/1806.08317 (2018)
Google Scholar
Sadegharmaki, S., Kastner, M.A., Satoh, S.: FashionGraph: understanding fashion data using scene graph generation. In: 2020 25th International Conference On Pattern Recognition (ICPR), pp. 7923–7929. IEEE (2021)
Google Scholar
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR. OpenReview.net (2020)
Google Scholar
Tay, Y., et al.: Long range arena: a benchmark for efficient transformers. CoRR abs/2011.04006 (2020)
Google Scholar
Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: a survey. CoRR:2009.06732 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Google Scholar
Wang, J., et al.: GIT: a generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022)
Wang, L., Li, Y., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. CoRR:1704.03470 (2017)
Google Scholar
Wang, Y., et al.: Position focused attention network for image-text matching. In: IJCAI, pp. 3792–3798 (2019). ijcai.org
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Google Scholar
Zaheer, M., et al.: Big bird: transformers for longer sequences. In: NeurIPS (2020)
Google Scholar
Zhuge, M., et al.: Kaleido-BERT: vision-language pre-training on fashion domain. In: CVPR, pp. 12647–12657. Computer Vision Foundation/IEEE (2021)
Google Scholar
Zoghbi, S., Heyman, G., Gomez, J.C., Moens, M.F.: Fashion meets computer vision and NLP at e-commerce search. Int. J. Comput. Electr. Eng. (IJCEE) 8, 31–43 (2016). https://doi.org/10.17706/IJCEE.2016.8.1.31-43
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering - DISI, University of Bologna, Cesena, Italy
Gianluca Moro & Stefano Salvatori

Authors

Gianluca Moro
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Salvatori
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Salvatori .

Editor information

Editors and Affiliations

Charles University, Prague, Czech Republic
Tomáš Skopal
ISTI-CNR, Pisa, Italy
Fabrizio Falchi
Charles University, Prague, Czech Republic
Jakub Lokoč
University of Torino, Torino, Italy
Maria Luisa Sapino
University of Bologna, Bologna, Italy
Ilaria Bartolini
University of Bologna, Bologna, Italy
Marco Patella

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moro, G., Salvatori, S. (2022). Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval. In: Skopal, T., Falchi, F., Lokoč, J., Sapino, M.L., Bartolini, I., Patella, M. (eds) Similarity Search and Applications. SISAP 2022. Lecture Notes in Computer Science, vol 13590. Springer, Cham. https://doi.org/10.1007/978-3-031-17849-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-17849-8_4
Published: 28 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17848-1
Online ISBN: 978-3-031-17849-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval