Skip to main content

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

Abstract

Large-scale Vision-and-Language (V+L) pre-training for representation learning has proven to be effective in boosting various downstream V+L tasks. However, when it comes to the fashion domain, existing V+L methods are inadequate as they overlook the unique characteristics of both fashion V+L data and downstream tasks. In this work, we propose a novel fashion-focused V+L representation learning framework, dubbed as FashionViL. It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data. First, in contrast to other domains where a V+L datum contains only a single image-text pair, there could be multiple images in the fashion domain. We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text. Second, fashion text (e.g., product description) often contains rich fine-grained concepts (attributes/noun phrases). To capitalize this, a Pseudo-Attributes Classification task is introduced to encourage the learned unimodal (visual/textual) representations of the same concept to be adjacent. Further, fashion V+L tasks uniquely include ones that do not conform to the common one-stream or two-stream architectures (e.g., text-guided image retrieval). We thus propose a flexible, versatile V+L model architecture consisting of a modality-agnostic Transformer so that it can be flexibly adapted to any downstream tasks. Extensive experiments show that our FashionViL achieves new state of the art across five downstream tasks. Code is available at https://github.com/BrandonHanx/mmf.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A single-stream model can also be applied at a cost of traversing every query-gallery pair, resulting in unacceptable retrieval speed in large-scale applications.

  2. 2.

    We randomly dropout some words in \(\textbf{w}\) and patches in \(\textbf{d}\) with the probability of 15% to make the learning process more robust.

  3. 3.

    Following UNITER, we use conditional masking for MLM/MPFC, i.e., only masking one modality while keeping the other one intact at each time.

  4. 4.

    Following BERT and UNITER, we decompose this 15% into 10% random words, 10% unchanged, and 80% [MASK].

  5. 5.

    In the 101 images, 1 is positively paired with the text and the other 100 are randomly paired but sharing the same sub-category as the positive, increasing the difficulty.

  6. 6.

    Because the authors did not release their 1K retrieval set, we report the average recall of 5 experiments with 5 randomly selected 1K retrieval sets.

  7. 7.

    Details for the reproduction of previous methods are in the supplementary file.

  8. 8.

    We have no access to the data splits of CSA-Net, so constructed the Polyvore Outfits [58] and reproduced CSA-Net by ourselves according to the original paper [25, 38].

References

  1. Akbari, H., et al.: Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. In: NeurIPS (2021)

    Google Scholar 

  2. Antol, S., et al.: VQA: Visual question answering. In: ICCV (2015)

    Google Scholar 

  3. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  4. Bao, H., Dong, L., Piao, S., Wei, F.: Beit: bert pre-training of image transformers. In: ICLR (2022)

    Google Scholar 

  5. Bird, S., Klein, E., Loper, E.: Natural language processing with python: analyzing text with the natural language toolkit (2009). https://www.nltk.org

  6. Bugliarello, E., Cotterell, R., Okazaki, N., Elliott, D.: Multimodal pretraining unmasked: a meta-analysis and a unified framework of vision-and-language berts. TACL (2021)

    Google Scholar 

  7. Chen, Y., Gong, S., Bazzani, L.: Image search with text feedback by visiolinguistic attention learning. In: CVPR (2020)

    Google Scholar 

  8. Chen, Y.C., et al.: Uniter: universal image-text representation learning. In: ECCV (2020)

    Google Scholar 

  9. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)

    Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)

    Google Scholar 

  11. Dong, X., et al.: M5product: a multi-modal pretraining benchmark for e-commercial product downstream tasks. arXiv preprint arXiv:2109.04275 (2021)

  12. Dong, X., et al.: Peco: perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710 (2021)

  13. Dosovitskiy, A., et al.: An image is worth 16 x 16 words: transformers for image recognition at scale. In: ICLR (2020)

    Google Scholar 

  14. Dou, Z.Y., et al.: An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv:2111.02387 (2021)

  15. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)

    Google Scholar 

  16. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives. In: BMVC (2018)

    Google Scholar 

  17. Fei, N., et al.: Wenlan 2.0: make AI imagine via a multimodal foundation model. arXiv preprint arXiv:2110.14378 (2021)

  18. Gao, D., et al.: Fashionbert: text and image matching with adaptive loss for cross-modal retrieval. In: SIGIR (2020)

    Google Scholar 

  19. Geigle, G., Pfeiffer, J., Reimers, N., Vulić, I., Gurevych, I.: Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. arXiv preprint arXiv:2103.11920 (2021)

  20. Guo, X., Wu, H., Cheng, Y., Rennie, S., Tesauro, G., Feris, R.S.: Dialog-based interactive image retrieval. In: NeurIPS (2018)

    Google Scholar 

  21. Han, X., He, S., Zhang, L., Song, Y.Z., Xiang, T.: UIGR: unified interactive garment retrieval. In: CVPR workshops (2022)

    Google Scholar 

  22. Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: ICCV (2017)

    Google Scholar 

  23. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  24. Hoe, J.T., Ng, K.W., Zhang, T., Chan, C.S., Song, Y.Z., Xiang, T.: One loss for all: deep hashing with a single cosine similarity based learning objective. In: NeurIPS (2021)

    Google Scholar 

  25. Hou, Y., Vig, E., Donoser, M., Bazzani, L.: Learning attribute-driven disentangled representations for interactive fashion retrieval. In: ICCV (2021)

    Google Scholar 

  26. Hu, X., et al.: Vivo: visual vocabulary pre-training for novel object captioning. In: AAAI (2021)

    Google Scholar 

  27. Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)

  28. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)

    Google Scholar 

  29. Kim, W., Son, B., Kim, I.: VILT: Vision-and-language transformer without convolution or region supervision. In: ICML (2021)

    Google Scholar 

  30. Lee, S., Kim, D., Han, B.: Cosmo: Content-style modulation for image retrieval with text feedback. In: CVPR (2021)

    Google Scholar 

  31. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)

    Google Scholar 

  32. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  33. Li, L.H., You, H., Wang, Z., Zareian, A., Chang, S.F., Chang, K.W.: Unsupervised vision-and-language pre-training without parallel images and captions. In: NAACL-HLT (2021)

    Google Scholar 

  34. Li, W., et al.: Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. In: ACL-IJCNLP (2021)

    Google Scholar 

  35. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: ECCV (2020)

    Google Scholar 

  36. Liao, L., He, X., Zhao, B., Ngo, C.W., Chua, T.S.: Interpretable multimodal retrieval for fashion products. In: ACM MM (2018)

    Google Scholar 

  37. Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV (2014)

    Google Scholar 

  38. Lin, Y.L., Tran, S., Davis, L.S.: Fashion outfit complementary item retrieval. In: CVPR (2020)

    Google Scholar 

  39. Liu, H., Yu, T., Li, P.: Inflate and shrink: enriching and reducing interactions for fast text-image retrieval. In: EMNLP (2021)

    Google Scholar 

  40. Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: ICCV (2021)

    Google Scholar 

  41. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)

    Google Scholar 

  42. Ma, Y., Jia, J., Zhou, S., Fu, J., Liu, Y., Tong, Z.: Towards better understanding the clothing fashion styles: a multimodal deep learning approach. In: AAAI (2017)

    Google Scholar 

  43. Van der Maaten, L., Hinton, G.: Visualizing data using T-SNE. JMLR (2008)

    Google Scholar 

  44. Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  45. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)

    Google Scholar 

  46. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)

    Google Scholar 

  47. Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A.: Imagebert: cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020)

  48. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  49. Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)

    Google Scholar 

  50. Rostamzadeh, N., et al.: Fashion-gen: the generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317 (2018)

  51. Shin, M., Cho, Y., Ko, B., Gu, G.: Rtic: Residual learning for text and image composition using graph convolutional network. arXiv preprint arXiv:2104.03015 (2021)

  52. Singh, A., et al.: MMF: a multimodal framework for vision and language research (2020). https://github.com/facebookresearch/mmf

  53. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR (2020)

    Google Scholar 

  54. Sun, S., Chen, Y.C., Li, L., Wang, S., Fang, Y., Liu, J.: Lightningdot: pre-training visual-semantic embeddings for real-time image-text retrieval. In: NAACL-HLT (2021)

    Google Scholar 

  55. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP-IJCNLP (2019)

    Google Scholar 

  56. Tan, R., Vasileva, M.I., Saenko, K., Plummer, B.A.: Learning similarity conditions without explicit supervision. In: ICCV (2019)

    Google Scholar 

  57. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)

    Google Scholar 

  58. Vasileva, M.I., Plummer, B.A., Dusad, K., Rajpal, S., Kumar, R., Forsyth, D.: Learning type-aware embeddings for fashion compatibility. In: ECCV (2018)

    Google Scholar 

  59. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  60. Vo, N., Jiang, L., Sun, C., Murphy, K., Li, L.J., Fei-Fei, L., Hays, J.: Composing text and image for image retrieval - an empirical odyssey. In: CVPR (2019)

    Google Scholar 

  61. Wang, J., et al.: UFO: a unified transformer for vision-language representation learning. arXiv preprint arXiv:2111.10023 (2021)

  62. Wang, W., Bao, H., Dong, L., Wei, F.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358 (2021)

  63. Wang, Z., Wang, W., Zhu, H., Liu, M., Qin, B., Wei, F.: Distilled dual-encoder model for vision-language understanding. arXiv preprint arXiv:2112.08723 (2021)

  64. Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: simple visual language model pretraining with weak supervision. In: ICLR (2021)

    Google Scholar 

  65. Wu, Het al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR (2021)

    Google Scholar 

  66. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

  67. Xu, H., et al.: E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In: ACL-IJCNLP (2021)

    Google Scholar 

  68. Yang, X., et al.: Fashion captioning: towards generating accurate descriptions with semantic rewards. In: ECCV (2020)

    Google Scholar 

  69. You, H., et al.: Ma-clip: towards modality-agnostic contrastive language-image pre-training. OpenReview (2021)

    Google Scholar 

  70. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV (2016)

    Google Scholar 

  71. Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: CVPR (2019)

    Google Scholar 

  72. Zhang, L., et al.: Vldeformer: learning visual-semantic embeddings by vision-language transformer decomposing. arXiv preprint arXiv:2110.11338 (2021)

  73. Zhang, P., et al.: VINVL: revisiting visual representations in vision-language models. In: CVPR (2021)

    Google Scholar 

  74. Zhang, Z., et al: UFC-bert: unifying multi-modal controls for conditional image synthesis. In: NeurIPS (2021)

    Google Scholar 

  75. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI (2020)

    Google Scholar 

  76. Zhu, Y., et al.: Knowledge perceived multi-modal pretraining in e-commerce. In: ACM MM (2021)

    Google Scholar 

  77. Zhuge, M., et al.: Kaleido-bert: vision-language pre-training on fashion domain. In: CVPR (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao Han .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5901 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Han, X., Yu, L., Zhu, X., Zhang, L., Song, YZ., Xiang, T. (2022). FashionViL: Fashion-Focused Vision-and-Language Representation Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19833-5_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19832-8

  • Online ISBN: 978-3-031-19833-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics