Skip to main content

Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval

  • Conference paper
  • First Online:
Similarity Search and Applications (SISAP 2022)

Abstract

Fashion multi-modal retrieval has been recently addressed using vision-and-language transformers. However, these models cannot scale in training time and memory requirements due to the quadratic attention mechanism. Moreover, they design the retrieval as a classification task, assigning a similarity score to pairs of text and images in input. Each query is thus resolved inefficiently by pairing it, at runtime, with every text or image in the entire dataset, precluding the scalability to large-scale datasets. We propose a novel approach for efficient multi-modal retrieval in the fashion domain that combines self-supervised pretraining with linear attention and deep metric learning to create a latent space where spatial proximity among instances translates into a semantic similarity score. Unlike existing contributions, our approach separately embeds text and images, decoupling them and allowing to collocate and search in the space, after training, even for new images with missing text and vice versa. Experiments show that with a single 12 GB GPU, our solution outperforms, both in efficacy and efficiency, existing state-of-the-art contributions on the FashionGen dataset. Our architecture also enables the adoption of multidimensional indices, with which retrieval scales in logarithmic time up to millions, and potentially billions, of text and images.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The reader can refer to [2] for further technical details and a rigorous proof of why this approximation works.

  2. 2.

    The total input length L is therefore the sum of both text tokens and image patches.

  3. 3.

    In some of our preliminary experiments we also tested \(m=0.1\) but we found no substantial differences (the results were slightly worse in that case).

  4. 4.

    Pretraining with our linear attention model took \(\sim \)72 h to complete, and metric learning required \(\sim \)15 h. Using a quadratic Transformer, pretraining would have ended in \(\sim \)100 h and metric learning in \(\sim \)24 h.

References

  1. Cerroni, W., Moro, G., Pirini, T., Ramilli, M.: Peer-to-peer data mining classifiers for decentralized detection of network attacks, vol. 137, pp. 101–107, January 2013

    Google Scholar 

  2. Choromanski, K.M., et al.: Rethinking attention with performers. In: ICLR. OpenReview.net (2021)

    Google Scholar 

  3. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG, pp. 253–262. ACM (2004)

    Google Scholar 

  4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1), pp. 4171–4186. Association for Computational Linguistics (2019)

    Google Scholar 

  5. Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Discovering new gene functionalities from random perturbations of known gene ontological annotations. In: KDIR, pp. 107–116. SciTePress (2014)

    Google Scholar 

  6. Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Cross-organism learning method to discover new gene functionalities. Comput. Methods Programs Biomed. 126, 20–34 (2016)

    Article  Google Scholar 

  7. Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Cross-domain text classification through iterative refining of target categories representations. In: KDIR, pp. 31–42. SciTePress (2014)

    Google Scholar 

  8. Domeniconi, G., Semertzidis, K., López, V., Daly, E.M., Kotoulas, S., Moro, G.: A novel method for unsupervised and supervised conversational message thread detection. In: DATA, pp. 43–54. SciTePress (2016)

    Google Scholar 

  9. Endo, M., Krishnan, R., Krishna, V., Ng, A.Y., Rajpurkar, P.: Retrieval-based chest X-ray report generation using a pre-trained contrastive language-image model. In: ML4H@NeurIPS. Proceedings of Machine Learning Research, vol. 158, pp. 209–219. PMLR (2021)

    Google Scholar 

  10. Fabbri, M., Moro, G.: Dow jones trading with deep learning: the unreasonable effectiveness of recurrent neural networks. In: DATA, pp. 142–153. SciTePress (2018)

    Google Scholar 

  11. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC, p. 12. BMVA Press (2018)

    Google Scholar 

  12. Frisoni, G., Moro, G., Carlassare, G., Carbonaro, A.: Unsupervised event graph representation and similarity learning on biomedical literature. Sensors 22(1), 3 (2022)

    Article  Google Scholar 

  13. Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)

    Google Scholar 

  14. Gao, D., et al.: FashionBERT: text and image matching with adaptive loss for cross-modal retrieval. In: SIGIR, pp. 2251–2260. ACM (2020)

    Google Scholar 

  15. Goenka, S., et al.: FashionVLP: vision language transformer for fashion retrieval with feedback. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14105–14115 (2022)

    Google Scholar 

  16. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2), pp. 1735–1742. IEEE Computer Society (2006)

    Google Scholar 

  17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)

    Google Scholar 

  18. Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7

    Chapter  Google Scholar 

  19. Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)

    Article  Google Scholar 

  20. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: ICLR. OpenReview.net (2020)

    Google Scholar 

  21. Laenen, K., Zoghbi, S., Moens, M.F.: Cross-modal search for fashion attributes. In: KDD 2017 (2017)

    Google Scholar 

  22. Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13

    Chapter  Google Scholar 

  23. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, pp. 11336–11344. AAAI Press (2020)

    Google Scholar 

  24. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  25. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS, pp. 13–23 (2019)

    Google Scholar 

  26. Miech, A., Alayrac, J., Laptev, I., Sivic, J., Zisserman, A.: Thinking fast and slow: efficient text-to-visual retrieval with transformers. In: CVPR, pp. 9826–9836. Computer Vision Foundation/IEEE (2021)

    Google Scholar 

  27. Moro, G., Pagliarani, A., Pasolini, R., Sartori, C.: Cross-domain & in-domain sentiment analysis with memory-based deep neural networks. In: KDIR, pp. 125–136. SciTePress (2018)

    Google Scholar 

  28. Moro, G., Valgimigli, L.: Efficient self-supervised metric information retrieval: a bibliography based method applied to COVID literature. Sensors 21(19), 6430 (2021)

    Article  Google Scholar 

  29. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: VISAPP (1), pp. 331–340. INSTICC Press (2009)

    Google Scholar 

  30. Omohundro, S.M.: Five Balltree Construction Algorithms. International Computer Science Institute, Berkeley (1989)

    Google Scholar 

  31. Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data. CoRR abs/2001.07966 (2020)

    Google Scholar 

  32. Rostamzadeh, N.: Fashion-Gen: the generative fashion dataset and challenge. CoRR abs/1806.08317 (2018)

    Google Scholar 

  33. Sadegharmaki, S., Kastner, M.A., Satoh, S.: FashionGraph: understanding fashion data using scene graph generation. In: 2020 25th International Conference On Pattern Recognition (ICPR), pp. 7923–7929. IEEE (2021)

    Google Scholar 

  34. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR. OpenReview.net (2020)

    Google Scholar 

  35. Tay, Y., et al.: Long range arena: a benchmark for efficient transformers. CoRR abs/2011.04006 (2020)

    Google Scholar 

  36. Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: a survey. CoRR:2009.06732 (2020)

    Google Scholar 

  37. Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)

    Google Scholar 

  38. Wang, J., et al.: GIT: a generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022)

  39. Wang, L., Li, Y., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. CoRR:1704.03470 (2017)

    Google Scholar 

  40. Wang, Y., et al.: Position focused attention network for image-text matching. In: IJCAI, pp. 3792–3798 (2019). ijcai.org

  41. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)

    Google Scholar 

  42. Zaheer, M., et al.: Big bird: transformers for longer sequences. In: NeurIPS (2020)

    Google Scholar 

  43. Zhuge, M., et al.: Kaleido-BERT: vision-language pre-training on fashion domain. In: CVPR, pp. 12647–12657. Computer Vision Foundation/IEEE (2021)

    Google Scholar 

  44. Zoghbi, S., Heyman, G., Gomez, J.C., Moens, M.F.: Fashion meets computer vision and NLP at e-commerce search. Int. J. Comput. Electr. Eng. (IJCEE) 8, 31–43 (2016). https://doi.org/10.17706/IJCEE.2016.8.1.31-43

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Salvatori .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Moro, G., Salvatori, S. (2022). Deep Vision-Language Model for Efficient Multi-modal Similarity Search in Fashion Retrieval. In: Skopal, T., Falchi, F., Lokoč, J., Sapino, M.L., Bartolini, I., Patella, M. (eds) Similarity Search and Applications. SISAP 2022. Lecture Notes in Computer Science, vol 13590. Springer, Cham. https://doi.org/10.1007/978-3-031-17849-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17849-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17848-1

  • Online ISBN: 978-3-031-17849-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics