Abstract
Fashion multi-modal retrieval has been recently addressed using vision-and-language transformers. However, these models cannot scale in training time and memory requirements due to the quadratic attention mechanism. Moreover, they design the retrieval as a classification task, assigning a similarity score to pairs of text and images in input. Each query is thus resolved inefficiently by pairing it, at runtime, with every text or image in the entire dataset, precluding the scalability to large-scale datasets. We propose a novel approach for efficient multi-modal retrieval in the fashion domain that combines self-supervised pretraining with linear attention and deep metric learning to create a latent space where spatial proximity among instances translates into a semantic similarity score. Unlike existing contributions, our approach separately embeds text and images, decoupling them and allowing to collocate and search in the space, after training, even for new images with missing text and vice versa. Experiments show that with a single 12 GB GPU, our solution outperforms, both in efficacy and efficiency, existing state-of-the-art contributions on the FashionGen dataset. Our architecture also enables the adoption of multidimensional indices, with which retrieval scales in logarithmic time up to millions, and potentially billions, of text and images.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The reader can refer to [2] for further technical details and a rigorous proof of why this approximation works.
- 2.
The total input length L is therefore the sum of both text tokens and image patches.
- 3.
In some of our preliminary experiments we also tested \(m=0.1\) but we found no substantial differences (the results were slightly worse in that case).
- 4.
Pretraining with our linear attention model took \(\sim \)72 h to complete, and metric learning required \(\sim \)15 h. Using a quadratic Transformer, pretraining would have ended in \(\sim \)100 h and metric learning in \(\sim \)24 h.
References
Cerroni, W., Moro, G., Pirini, T., Ramilli, M.: Peer-to-peer data mining classifiers for decentralized detection of network attacks, vol. 137, pp. 101–107, January 2013
Choromanski, K.M., et al.: Rethinking attention with performers. In: ICLR. OpenReview.net (2021)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG, pp. 253–262. ACM (2004)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1), pp. 4171–4186. Association for Computational Linguistics (2019)
Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Discovering new gene functionalities from random perturbations of known gene ontological annotations. In: KDIR, pp. 107–116. SciTePress (2014)
Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Cross-organism learning method to discover new gene functionalities. Comput. Methods Programs Biomed. 126, 20–34 (2016)
Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Cross-domain text classification through iterative refining of target categories representations. In: KDIR, pp. 31–42. SciTePress (2014)
Domeniconi, G., Semertzidis, K., López, V., Daly, E.M., Kotoulas, S., Moro, G.: A novel method for unsupervised and supervised conversational message thread detection. In: DATA, pp. 43–54. SciTePress (2016)
Endo, M., Krishnan, R., Krishna, V., Ng, A.Y., Rajpurkar, P.: Retrieval-based chest X-ray report generation using a pre-trained contrastive language-image model. In: ML4H@NeurIPS. Proceedings of Machine Learning Research, vol. 158, pp. 209–219. PMLR (2021)
Fabbri, M., Moro, G.: Dow jones trading with deep learning: the unreasonable effectiveness of recurrent neural networks. In: DATA, pp. 142–153. SciTePress (2018)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC, p. 12. BMVA Press (2018)
Frisoni, G., Moro, G., Carlassare, G., Carbonaro, A.: Unsupervised event graph representation and similarity learning on biomedical literature. Sensors 22(1), 3 (2022)
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)
Gao, D., et al.: FashionBERT: text and image matching with adaptive loss for cross-modal retrieval. In: SIGIR, pp. 2251–2260. ACM (2020)
Goenka, S., et al.: FashionVLP: vision language transformer for fashion retrieval with feedback. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14105–14115 (2022)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2), pp. 1735–1742. IEEE Computer Society (2006)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7
Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: ICLR. OpenReview.net (2020)
Laenen, K., Zoghbi, S., Moens, M.F.: Cross-modal search for fashion attributes. In: KDD 2017 (2017)
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, pp. 11336–11344. AAAI Press (2020)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS, pp. 13–23 (2019)
Miech, A., Alayrac, J., Laptev, I., Sivic, J., Zisserman, A.: Thinking fast and slow: efficient text-to-visual retrieval with transformers. In: CVPR, pp. 9826–9836. Computer Vision Foundation/IEEE (2021)
Moro, G., Pagliarani, A., Pasolini, R., Sartori, C.: Cross-domain & in-domain sentiment analysis with memory-based deep neural networks. In: KDIR, pp. 125–136. SciTePress (2018)
Moro, G., Valgimigli, L.: Efficient self-supervised metric information retrieval: a bibliography based method applied to COVID literature. Sensors 21(19), 6430 (2021)
Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: VISAPP (1), pp. 331–340. INSTICC Press (2009)
Omohundro, S.M.: Five Balltree Construction Algorithms. International Computer Science Institute, Berkeley (1989)
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data. CoRR abs/2001.07966 (2020)
Rostamzadeh, N.: Fashion-Gen: the generative fashion dataset and challenge. CoRR abs/1806.08317 (2018)
Sadegharmaki, S., Kastner, M.A., Satoh, S.: FashionGraph: understanding fashion data using scene graph generation. In: 2020 25th International Conference On Pattern Recognition (ICPR), pp. 7923–7929. IEEE (2021)
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR. OpenReview.net (2020)
Tay, Y., et al.: Long range arena: a benchmark for efficient transformers. CoRR abs/2011.04006 (2020)
Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: a survey. CoRR:2009.06732 (2020)
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Wang, J., et al.: GIT: a generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022)
Wang, L., Li, Y., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. CoRR:1704.03470 (2017)
Wang, Y., et al.: Position focused attention network for image-text matching. In: IJCAI, pp. 3792–3798 (2019). ijcai.org
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Zaheer, M., et al.: Big bird: transformers for longer sequences. In: NeurIPS (2020)
Zhuge, M., et al.: Kaleido-BERT: vision-language pre-training on fashion domain. In: CVPR, pp. 12647–12657. Computer Vision Foundation/IEEE (2021)
Zoghbi, S., Heyman, G., Gomez, J.C., Moens, M.F.: Fashion meets computer vision and NLP at e-commerce search. Int. J. Comput. Electr. Eng. (IJCEE) 8, 31–43 (2016). https://doi.org/10.17706/IJCEE.2016.8.1.31-43
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Moro, G., Salvatori, S. (2022). Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval. In: Skopal, T., Falchi, F., Lokoč, J., Sapino, M.L., Bartolini, I., Patella, M. (eds) Similarity Search and Applications. SISAP 2022. Lecture Notes in Computer Science, vol 13590. Springer, Cham. https://doi.org/10.1007/978-3-031-17849-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-17849-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17848-1
Online ISBN: 978-3-031-17849-8
eBook Packages: Computer ScienceComputer Science (R0)